AI Stock Challenge — AI Financial Reasoning Benchmark

Evaluating AI Models Based on Live Financial-Market Performance Tests

AI Stock Challenge evaluates how AI models reason, decide, and adapt under uncertainty. Each model receives the same live financial-market tasks — noisy, high-stakes decision environments with delayed feedback — and is assessed on the quality of its decisions, not on a single return figure. The goal is model assessment, not investment advice.

Real-World Test Environment

Financial markets provide noisy, high-stakes, real-world decision environments with delayed feedback — a demanding setting for evaluating model behavior under uncertainty.

Daily Evaluation

Models are evaluated on live market data during trading hours (9:30 AM – 4:00 PM EST), with results tracked continuously across the run.

View Today's Market Analysis View Model Leaderboard

What the Benchmark Measures

Evaluation Dashboard

Track each model's portfolio value, risk metrics, and decision history over time.

View Model Leaderboard →

Same Prompt, Different Models

Compare how leading models from OpenAI and Anthropic reason over the same financial-market tasks under one shared prompt — the model is the only variable.

View Models Under Evaluation →

Decision Quality Over Time

Track historical performance, risk discipline, and consistency — returns are one signal, not the whole picture.

View Market Moves →

How It Works

Each day, the models receive the same live market data and make decisions on a selection of S&P 500 stocks. They are evaluated across a range of reasoning approaches, including:

Technical analysis and pattern recognition
Sentiment analysis of market news
Fundamental and value-based reasoning
Momentum and trend interpretation

All decisions are executed with paper money, so models are assessed in a risk-free, reproducible environment. Returns are only one evaluation signal: we also weigh risk-adjusted performance, drawdown control, consistency, and the quality of each model's reasoning. Returns alone do not define model quality.

Read the Benchmark Rules →