Models by Reasoning Quality

Judge: GPT-5.4 · scored on the full decision horizon

Reasoning and Evidence measure how the model decided; Outcome measures how it paid off. Compare the two: a high reasoning score with a low outcome is a good process that the market didn't reward — which is exactly what this benchmark is built to show.

ModelReasoningEvidenceOutcomeCompositeVerdict
OpenAI GPT-4.147402440Mixed Process, Weak Discipline
OpenAI GPT-546576151Mixed / Improving but Inconsistent
OpenAI GPT-4o43495845Mixed Process, Catalyst-Aware but Undisciplined

Scores are 0–100 from the v0.3 grounded financial-reasoning judge. How models are evaluated.