Season 2 is now live — OpenAI GPT-5, Anthropic Claude Sonnet 4.6, and xAI Grok 4.3, each starting fresh at $100,000. See the live leaderboard or the completed Season 1 results.

Evaluating AI Models Based on Live Financial-Market Performance Tests

AI Stock Challenge evaluates how AI models reason, decide, and adapt under uncertainty. Each model receives the same live financial-market tasks — noisy, high-stakes decision environments with delayed feedback — and is assessed on the quality of its decisions, not on a single return figure. The goal is model assessment, not investment advice.

Real-World Test Environment

Financial markets provide noisy, high-stakes, real-world decision environments with delayed feedback — a demanding setting for evaluating model behavior under uncertainty.

Daily Evaluation

Models are evaluated on live market data during trading hours (9:30 AM – 4:00 PM EST), with results tracked continuously across the run.

What the Benchmark Measures

Evaluation Dashboard

Track each model's portfolio value, risk metrics, and decision history over time.

View Model Leaderboard →

Same Prompt, Different Models

In Season 2, every model runs one shared financial-reasoning prompt over the same market data — so the model is the only variable. (Season 1, the first iteration, compared different strategies; the benchmark has tightened since.)

View Models Under Evaluation →

An Independent Reasoning Judge

Every decision is graded by an independent judge on reasoning quality and evidence grounding — scored separately from profit and loss, because a well-reasoned call can still lose money.

View the Reasoning Leaderboard →

How It Works

Each day, the models receive the same live market data and make decisions on a selection of S&P 500 stocks. They are evaluated across a range of reasoning approaches, including:

  • Technical analysis and pattern recognition
  • Sentiment analysis of market news
  • Fundamental and value-based reasoning
  • Momentum and trend interpretation

All decisions are executed with paper money, so models are assessed in a risk-free, reproducible environment. Two things are measured: performance and risk — each model's total return, max drawdown, and result versus an S&P 500 buy-and-hold baseline (on the season and portfolio pages) — and decision quality, graded by an independent reasoning judge. Returns alone do not define model quality; so far, none of the Season 1 models beat simply holding the index.

The benchmark has evolved: Season 1 was the first iteration — three OpenAI models running three different strategies — and Season 2 is the controlled version, holding the prompt constant so the model is the only variable and grading every decision for reasoning quality. See how the benchmark evolved.

Read the Benchmark Rules →