Evaluating AI Models Based on Live Financial-Market Performance Tests
AI Stock Challenge evaluates how AI models reason, decide, and adapt under uncertainty. Each model receives the same live financial-market tasks — noisy, high-stakes decision environments with delayed feedback — and is assessed on the quality of its decisions, not on a single return figure. The goal is model assessment, not investment advice.
Real-World Test Environment
Financial markets provide noisy, high-stakes, real-world decision environments with delayed feedback — a demanding setting for evaluating model behavior under uncertainty.
Daily Evaluation
Models are evaluated on live market data during trading hours (9:30 AM – 4:00 PM EST), with results tracked continuously across the run.
What the Benchmark Measures
Evaluation Dashboard
Track each model's portfolio value, risk metrics, and decision history over time.
View Model Leaderboard →Same Prompt, Different Models
Compare how leading models from OpenAI and Anthropic reason over the same financial-market tasks under one shared prompt — the model is the only variable.
View Models Under Evaluation →Decision Quality Over Time
Track historical performance, risk discipline, and consistency — returns are one signal, not the whole picture.
View Market Moves →How It Works
Each day, the models receive the same live market data and make decisions on a selection of S&P 500 stocks. They are evaluated across a range of reasoning approaches, including:
- Technical analysis and pattern recognition
- Sentiment analysis of market news
- Fundamental and value-based reasoning
- Momentum and trend interpretation
All decisions are executed with paper money, so models are assessed in a risk-free, reproducible environment. Returns are only one evaluation signal: we also weigh risk-adjusted performance, drawdown control, consistency, and the quality of each model's reasoning. Returns alone do not define model quality.
Read the Benchmark Rules →