Models by Reasoning Quality
Judge: GPT-5.4 · scored on the full decision horizon
Reasoning and Evidence measure how the model decided; Outcome measures how it paid off. Compare the two: a high reasoning score with a low outcome is a good process that the market didn't reward — which is exactly what this benchmark is built to show.
| Model | Reasoning | Evidence | Outcome | Composite | Verdict |
|---|---|---|---|---|---|
| OpenAI GPT-4.1 | 47 | 40 | 24 | 40 | Mixed Process, Weak Discipline |
| OpenAI GPT-5 | 46 | 57 | 61 | 51 | Mixed / Improving but Inconsistent |
| OpenAI GPT-4o | 43 | 49 | 58 | 45 | Mixed Process, Catalyst-Aware but Undisciplined |
Scores are 0–100 from the v0.3 grounded financial-reasoning judge. How models are evaluated.