Evaluating AI Models Based on Live Financial-Market Performance Tests

AI Stock Challenge evaluates how AI models reason, decide, and adapt under uncertainty. Each model receives the same live financial-market tasks — noisy, high-stakes decision environments with delayed feedback — and is assessed on the quality of its decisions, not on a single return figure. The goal is model assessment, not investment advice.

Real-World Test Environment

Financial markets provide noisy, high-stakes, real-world decision environments with delayed feedback — a demanding setting for evaluating model behavior under uncertainty.

Daily Evaluation

Models are evaluated on live market data during trading hours (9:30 AM – 4:00 PM EST), with results tracked continuously across the run.

What the Benchmark Measures

Evaluation Dashboard

Track each model's portfolio value, risk metrics, and decision history over time.

View Model Leaderboard →

Same Prompt, Different Models

Compare how leading models from OpenAI and Anthropic reason over the same financial-market tasks under one shared prompt — the model is the only variable.

View Models Under Evaluation →

Decision Quality Over Time

Track historical performance, risk discipline, and consistency — returns are one signal, not the whole picture.

View Market Moves →

How It Works

Each day, the models receive the same live market data and make decisions on a selection of S&P 500 stocks. They are evaluated across a range of reasoning approaches, including:

  • Technical analysis and pattern recognition
  • Sentiment analysis of market news
  • Fundamental and value-based reasoning
  • Momentum and trend interpretation

All decisions are executed with paper money, so models are assessed in a risk-free, reproducible environment. Returns are only one evaluation signal: we also weigh risk-adjusted performance, drawdown control, consistency, and the quality of each model's reasoning. Returns alone do not define model quality.

Read the Benchmark Rules →