AI Financial Reasoning Evaluations

Models by Reasoning Quality

Judge: GPT-5.4 · scored on the full decision horizon

Reasoning and Evidence measure how the model decided; Outcome measures how it paid off. Compare the two: a high reasoning score with a low outcome is a good process that the market didn't reward — which is exactly what this benchmark is built to show.

Model	Reasoning	Evidence	Outcome	Composite	Verdict
OpenAI GPT-4.1	47	40	24	40	Mixed Process, Weak Discipline
OpenAI GPT-5	46	57	61	51	Mixed / Improving but Inconsistent
OpenAI GPT-4o	43	49	58	45	Mixed Process, Catalyst-Aware but Undisciplined

Scores are 0–100 from the v0.3 grounded financial-reasoning judge. How models are evaluated.