How the Reasoning Judge Works — Scoring Methodology

What the headline score is — and what it isn't

The headline number on the leaderboard and model pages — labelled Total Score — is a blended decision-quality score from 0 to 100. It is produced by a single large language model acting as a judge, grading each trading model's full day-by-day decision record against point-in-time market data.

It is not a pure reasoning-only metric, and we no longer claim it is. The composite includes a 10% outcome component, so realized results do influence it. We report Outcome and raw trading Return as separate columns precisely so you can see each axis on its own and not have to trust a single blended figure. If you want a P&L-free view, read the Reasoning and Evidence sub-scores directly; if you want results, read Return.

These scores are directional, not certified measurements. Read the bands (below), not the decimals: the difference between a 49 and a 52 is noise, the difference between a 45 and a 70 is signal.

Who the judge is

The judge is OpenAI GPT-5.4, run through the OpenAI Responses API with a strict JSON-schema structured output, so every score, dimension, and claim comes back in a fixed, machine-checkable shape rather than free prose.
It grades the whole horizon of a model's decisions — the sequence of daily trades with the rationale and context the model gave at the time — not a single trade in isolation.
It is given point-in-time market and fundamental data(price, volume, valuation, momentum, analyst, sector) for held and recently-traded symbols, and is explicitly instructed not to use the later snapshot to grade earlier decisions (a logged lookahead_check).
Scoring is pointwise: each model is judged on its own, in its own call — not ranked head-to-head against another model in the same prompt.

"Independent" — independent from what?

We use the word carefully. The judge is independent in two senses and not in a third:

Independent from the trading models: the judge does not trade and is not one of the competitors. It only sees the decision record and the market data.
Independent from P&L — only partly: 90% of the composite is reasoning, evidence, and process; 10% is outcome. So it is mostly, but not wholly, return-independent. We state the weight rather than hiding it.
Not independent from the provider: the judge is an OpenAI model, and one Season 2 competitor (OpenAI GPT-5) shares that provider. That is a real self-preference risk, addressed head-on below.

The rubric

Four axes are scored 0–100. Reasoning quality and evidence grounding are each a weighted sum of the sub-components below. Outcome grades how the decisions actually played out; data reliability grades the internal consistency of the decision record.

Reasoning quality (weighted to 100)

Sub-component	Weight
Action–rationale alignment	10
Thesis quality	12
Strategy fit	10
Risk awareness	12
Portfolio discipline	10
Temporal consistency	10
Decision-update quality	10
Uncertainty discipline	8
Outcome-independent reasoning quality	8
Clarity and specificity	10

Evidence grounding (weighted to 100)

Sub-component	Weight
Claim support from market data	25
Correct use of valuation metrics	15
Correct use of momentum / technical indicators	15
Correct use of analyst data	10
Correct use of sector / industry context	10
Avoidance of unsupported claims	15
No contradiction with available data	10

Total Score (composite)

Component	Contribution
Reasoning quality	40%
Evidence grounding	35%
Temporal / portfolio process quality	15%
Outcome awareness (if outcome data exists)	10%

When a model has no resolvable outcome yet, the 10% is redistributed across reasoning quality and evidence grounding rather than assumed.

How to read a score

Every 0–100 axis (and the dimension breakdown on each model page) reads on the same three bands:

Band	Range	What it means
Strong	67–100	Consistent, well-grounded decision-making across the horizon.
Mixed	45–66	Real strengths offset by gaps — drift, thin risk discipline, or unsupported claims.
Weak	0–44	Reasoning frequently unsupported, contradictory, or incoherent over time.

So a 46 is a borderline-mixed result, a 51 sits mid-mixed, and a 70+ is the first genuinely strong band. Treat single-point gaps as noise.

Known LLM-as-judge failure modes

LLM judges have documented biases. Here's each one and what we actually do about it — including where we can only mitigate, not eliminate.

Self-preference bias (a model favouring its own family). This is the sharpest risk, because the judge is an OpenAI model and one competitor is OpenAI GPT-5. We disclose it rather than bury it. Current live evidence cuts against naive self-preference: the GPT judge ranks xAI Grok first (78), above OpenAI GPT-5 (68). That is a data point, not a calibration study — a cross-provider judge check (re-grading a sample with a non-OpenAI judge and reporting agreement) is the planned next step.
Verbosity bias (rewarding longer answers). The judge scores grounding against named market-data fields via a claim ledger — each factual claim is checked against the data — so length on its own earns nothing. We don't claim this fully eliminates the bias.
Position / order bias. Because scoring is pointwise (one model per call, no A-vs-B ordering), the classic position bias of pairwise judging largely does not apply here.
Look-ahead leakage. The judge is told to grade an earlier decision only on what was knowable then, with an explicit pass/fail lookahead_check in the output.

A worked example

Season 1's OpenAI GPT-4 Turbo scored Reasoning 46, Evidence 57, Outcome 61, Data reliability 54, for a Total Score of 51 — mid-"mixed" — with the verdict "Mixed / Improving but Inconsistent." The judge's one-line summary: it improved into a more disciplined fundamentals process, but high turnover and inconsistent thesis maintenance held it back.

The claim ledger is where the grounding score comes from. For example, the claim "GOOGL offers strong profitability and growth with analyst upside" was marked supported, citing the exact fields checked: profitMargin, operatingMarginTTM, returnOnEquityTTM, quarterlyEarningsGrowthYOY, analystTargetPrice. A claim with no backing field is marked not verifiable; one the data contradicts is marked contradicted and penalised. Every model page shows its full ledger.

Limitations & what's next

Single judge, no inter-rater study yet. One model grades everything. We have not yet published a second-judge agreement number or a self-consistency run (same input, multiple seeds). Both are planned.
The composite blends outcome. 10% is results, so it is not a clean reasoning-only metric — read the sub-scores if you want that.
Season 1 varied strategy as well as model, so it cannot isolate the model; Season 2 holds the prompt constant to fix that. See how the benchmark evolved.
Scores can shift on re-grading. LLM outputs aren't perfectly deterministic; treat the bands as stable, the decimals as not.

Grader version: grounded_financial_reasoning_judge_v0.3. Every model in a season is graded with the same prompt and the same point-in-time data — the model is the only thing that changes.

Back to the leaderboard → · Benchmark rules