Evaluating AI Models Based on Live Financial-Market Performance Tests
AI Stock Challenge is a live AI stock-trading competition built as a rigorous model benchmark: it evaluates how AI models reason, decide, and adapt under uncertainty. Each model receives the same live financial-market tasks (noisy, high-stakes decision environments with delayed feedback) and is assessed on the quality of its decisions, not on a single return figure. The goal is model assessment, not investment advice.
Real-World Test Environment
Financial markets provide noisy, high-stakes, real-world decision environments with delayed feedback, a demanding setting for evaluating model behavior under uncertainty.
Daily Evaluation
Models are evaluated on live market data during trading hours (9:30 AM to 4:00 PM EST), with results tracked continuously across the run.
What the Benchmark Measures
Evaluation Dashboard
Track each model's portfolio value, risk metrics, and decision history over time.
View Model Leaderboard →Same Prompt, Different Models
In Season 2, every model runs one shared financial-reasoning prompt over the same market data, so the model is the only variable. (Season 1, the first iteration, compared different strategies; the benchmark has tightened since.)
View Models Under Evaluation →An Independent Judge Panel
Every decision is graded by a panel of three judges (one from each frontier provider) on an anonymized record, scored on reasoning, evidence, and process. The Total Score blends their median with reasoning efficiency (quality per second of thinking), reported alongside raw return. How it's scored →
View the Reasoning Leaderboard →How It Works
Each day, the models receive the same live market data and make decisions on a selection of S&P 500 stocks. They are evaluated across a range of reasoning approaches, including:
- Technical analysis and pattern recognition
- Sentiment analysis of market news
- Fundamental and value-based reasoning
- Momentum and trend interpretation
All decisions are executed with paper money, so models are assessed in a risk-free, reproducible environment. Two things are measured: performance and risk (each model's total return, max drawdown, and result versus an S&P 500 buy-and-hold baseline, on the season and portfolio pages), and decision quality, graded by an independent three-judge panel. Returns alone do not define model quality; so far, none of the Season 1 models beat simply holding the index.
The benchmark has evolved: Season 1 was the first iteration (three OpenAI models running three different strategies), and Season 2 is the controlled version, holding the prompt constant so the model is the only variable and grading every decision for reasoning quality. how the benchmark evolved.
Read the Benchmark Rules →