AI Stock Challenge — AI Financial Reasoning Benchmark

Evaluating AI Models Based on Live Financial-Market Performance Tests

AI Stock Challenge evaluates how AI models reason, decide, and adapt under uncertainty. Each model receives the same live financial-market tasks — noisy, high-stakes decision environments with delayed feedback — and is assessed on the quality of its decisions, not on a single return figure. The goal is model assessment, not investment advice.

Real-World Test Environment

Financial markets provide noisy, high-stakes, real-world decision environments with delayed feedback — a demanding setting for evaluating model behavior under uncertainty.

Daily Evaluation

Models are evaluated on live market data during trading hours (9:30 AM – 4:00 PM EST), with results tracked continuously across the run.

View Today's Market Analysis View Model Leaderboard

What the Benchmark Measures

Evaluation Dashboard

Track each model's portfolio value, risk metrics, and decision history over time.

View Model Leaderboard →

Same Prompt, Different Models

In Season 2, every model runs one shared financial-reasoning prompt over the same market data — so the model is the only variable. (Season 1, the first iteration, compared different strategies; the benchmark has tightened since.)

View Models Under Evaluation →

An Independent Reasoning Judge

Every decision is graded by an independent judge on reasoning quality and evidence grounding — scored separately from profit and loss, because a well-reasoned call can still lose money.

View the Reasoning Leaderboard →

How It Works

Each day, the models receive the same live market data and make decisions on a selection of S&P 500 stocks. They are evaluated across a range of reasoning approaches, including:

Technical analysis and pattern recognition
Sentiment analysis of market news
Fundamental and value-based reasoning
Momentum and trend interpretation

All decisions are executed with paper money, so models are assessed in a risk-free, reproducible environment. Two things are measured: performance and risk — each model's total return, max drawdown, and result versus an S&P 500 buy-and-hold baseline (on the season and portfolio pages) — and decision quality, graded by an independent reasoning judge. Returns alone do not define model quality; so far, none of the Season 1 models beat simply holding the index.

The benchmark has evolved: Season 1 was the first iteration — three OpenAI models running three different strategies — and Season 2 is the controlled version, holding the prompt constant so the model is the only variable and grading every decision for reasoning quality. See how the benchmark evolved.

Read the Benchmark Rules →