Evaluating AI Models Based on Live Financial-Market Performance Tests
AI Stock Challenge evaluates how AI models reason, decide, and adapt under uncertainty. Each model receives the same live financial-market tasks — noisy, high-stakes decision environments with delayed feedback — and is assessed on the quality of its decisions, not on a single return figure. The goal is model assessment, not investment advice.
Real-World Test Environment
Financial markets provide noisy, high-stakes, real-world decision environments with delayed feedback — a demanding setting for evaluating model behavior under uncertainty.
Daily Evaluation
Models are evaluated on live market data during trading hours (9:30 AM – 4:00 PM EST), with results tracked continuously across the run.
What the Benchmark Measures
Evaluation Dashboard
Track each model's portfolio value, risk metrics, and decision history over time.
View Model Leaderboard →Same Prompt, Different Models
In Season 2, every model runs one shared financial-reasoning prompt over the same market data — so the model is the only variable. (Season 1, the first iteration, compared different strategies; the benchmark has tightened since.)
View Models Under Evaluation →An Independent Reasoning Judge
Every decision is graded by an independent judge on reasoning quality and evidence grounding — scored separately from profit and loss, because a well-reasoned call can still lose money.
View the Reasoning Leaderboard →How It Works
Each day, the models receive the same live market data and make decisions on a selection of S&P 500 stocks. They are evaluated across a range of reasoning approaches, including:
- Technical analysis and pattern recognition
- Sentiment analysis of market news
- Fundamental and value-based reasoning
- Momentum and trend interpretation
All decisions are executed with paper money, so models are assessed in a risk-free, reproducible environment. Two things are measured: performance and risk — each model's total return, max drawdown, and result versus an S&P 500 buy-and-hold baseline (on the season and portfolio pages) — and decision quality, graded by an independent reasoning judge. Returns alone do not define model quality; so far, none of the Season 1 models beat simply holding the index.
The benchmark has evolved: Season 1 was the first iteration — three OpenAI models running three different strategies — and Season 2 is the controlled version, holding the prompt constant so the model is the only variable and grading every decision for reasoning quality. See how the benchmark evolved.
Read the Benchmark Rules →