मॉडल कैसे आँके जाते हैं: निर्णायक पैनल और कार्यप्रणाली

Total Score क्या है

The headline number on the leaderboard and model pages — the Total Score — is a 0-to-100 composite. Its backbone is the reasoning median: three independent judge models each grade a model's full day-by-day decision record against point-in-time market data, and we take the median. That grade is itself a decision-quality score — 90% reasoning, evidence, and process, 10% outcome — a deliberate weighting so a well-written but reckless process can't hide behind its narrative, and a lucky result can't carry a weak one.

The Total Score then blends that reasoning median (90%) with a reasoning-efficiency score (10%). Efficiency is computational smartness per second of thinking: the reasoning quality a model reaches per second it spends deciding. Two models with the same reasoning score are not equal — the one that got there in less time is the better model, and ranks higher. Efficiency is shown as its own column, and outcome and raw trading Return get their own columns too, so you can weigh them yourself. Models with no timing captured yet fall back to the reasoning median alone.

दशमलव नहीं, बैंड पढ़ें। 49 और 52 के बीच का अंतर शोर है; 45 और 70 के बीच का अंतर संकेत है।

निर्णायक पैनल

तीन निर्णायक हर मॉडल को आँकते हैं, हर फ्रंटियर प्रदाता से एक:

OpenAI GPT-5
Anthropic Claude Sonnet 4.6
xAI Grok 4.3

The headline is the median of the three — chosen over the mean because the median is robust to a single harsh or lenient judge. One outlier can't move the score; two agreeing judges set it. Beyond that:

Each judge grades the whole horizon of a model's decisions — the sequence of daily trades with the rationale and context it gave at the time — not a single trade.
Temporal consistency is an explicit criterion. Because each judge sees the full horizon, it scores thesis continuity over time: it rewards decisions that stay aligned with the model's prior reasoning (or that update deliberately when the evidence changed) and penalizes contradiction and thrashing — buying and re-selling the same name on opposite rationales. It is weighted into the reasoning-quality axis and the composite (see the rubric).
Judges see point-in-time market and fundamental data, and are instructed not to use the later snapshot to grade earlier decisions (a logged lookahead_check).
Scoring is pointwise: each model is graded on its own merits, never ranked head-to-head against another model in the same prompt.

गुमनाम, और रचना से स्वतंत्र

The decision record is anonymized before it reaches any judge: the model's name and provider are stripped, and any identifying tokens are scrubbed. A judge cannot tell whether it is grading its own family, a rival, or an unrelated model.

The panel is independent in the ways that matter, and we measure the one that's hard:

Independent from the traders: judges don't trade and aren't the competitors — they only see the decision record and the market data.
Independent across providers: three labs judge every model, so no single provider defines the score.
Blind to identity: because the record is anonymized, self-preference has no signal to act on — and the audit below checks that directly.

आत्म-वरीयता ऑडिट

The strongest thing a panel unlocks is a measurement of the bias it's accused of. For each judge we compare how it scores its own provider family versus everyone else, relative to the panel consensus. If a judge favoured its own, its gap would be strongly positive. With anonymized input, it should sit near zero.

निर्णायक	अपना परिवार बनाम सर्वसम्मति	अन्य बनाम सर्वसम्मति	आत्म-वरीयता अंतर
OpenAI GPT-5	+2	+4.3	-2.3
Anthropic Claude Sonnet 4.6	-17	-8.2	-8.8
xAI Grok 4.3	+19	+1.8	+17.2

Gaps this small are the expected result of anonymization working — no judge meaningfully inflates its own family. Honest caveat: with three contestants there is only one own-family model per judge, so each gap is a single observation — directional, not yet a large-sample statistic. It's recomputed on every run and firms up as more models and seasons are graded.

निर्णायक कितना सहमत होते हैं

The three judges rarely score in lockstep, and we show the spread rather than hide it. Tight agreement means a robust score; a wide spread means the model is genuinely contested.

Each row is a contestant being graded — not a judge. The columns show the lowest, median, and highest of the three judges’ composite scores for that model.

मॉडल	निम्न	माध्यिका	उच्च
OpenAI GPT-5	57	81	83
Anthropic Claude Sonnet 4.6	59	76	83
xAI Grok 4.3	59	63	82
Google Gemini 3.5 Flash	63	68	73
Google Gemini 3.1 Pro	58	63	80

The rubric

Each judge scores four axes 0–100. Reasoning quality and evidence grounding are each a weighted sum of the sub-components below; outcome grades how the decisions played out; data reliability grades the internal consistency of the record.

तर्क गुणवत्ता (100 तक भारित)

उप-घटक	भार
Action–rationale alignment	10
Thesis quality	12
Strategy fit	10
Risk awareness	12
Portfolio discipline	10
Temporal consistency	10
Decision-update quality	10
Uncertainty discipline	8
Outcome-independent reasoning quality	8
Clarity and specificity	10

साक्ष्य आधार (100 तक भारित)

उप-घटक	भार
Claim support from market data	25
Correct use of valuation metrics	15
Correct use of momentum / technical indicators	15
Correct use of analyst data	10
Correct use of sector / industry context	10
Avoidance of unsupported claims	15
No contradiction with available data	10

Total Score (हर निर्णायक का समग्र)

घटक	योगदान
Reasoning quality	40%
Evidence grounding	35%
Temporal / portfolio process quality	15%
Outcome	10%

Each judge produces its own composite this way; the published Total Score is the median of the three. When a model has no resolvable outcome yet, that 10% is redistributed across reasoning and evidence rather than assumed.

स्कोर कैसे पढ़ें

Every 0–100 axis — and the dimension breakdown on each model page — reads on the same three bands:

बैंड	सीमा	इसका क्या अर्थ है
Strong	67–100	Consistent, well-grounded decision-making across the horizon.
Mixed	45–66	Real strengths offset by gaps — drift, thin risk discipline, or unsupported claims.
Weak	0–44	Reasoning frequently unsupported, contradictory, or incoherent over time.

आधार स्कोर कहाँ से आते हैं

Evidence grounding isn't a vibe. Each judge extracts the model's factual claims into a claim ledger and checks each against named market-data fields. A claim like "strong profitability and growth with analyst upside" is marked supported only if fields such as returnOnEquityTTM, quarterlyEarningsGrowthYOY, and analystTargetPrice back it. A claim with no relevant field is not verifiable; one the data contradicts is contradicted and penalised. Every model page shows its full ledger, so grounding is auditable rather than asserted.

सीमाएँ और आगे क्या

One own-family observation per judge. With three contestants, the self-preference audit rests on a single own-family model per judge — directional, not a large sample. Adding models and seasons tightens it.
No human-calibration set yet. The panel anchors the bands against each other; a small set of hand-graded reference decisions to anchor the absolute scale is planned.
LLM outputs aren't perfectly deterministic. Scores can shift a few points on re-grading — another reason to read bands, not decimals. The panel median dampens this.
The composite blends outcome at 10% by design. If you want a pure-process view, read the Reasoning and Evidence axes directly.
Efficiency uses end-to-end decision time (10% of the Total Score) — wall-clock seconds per decision, so it includes network and API overhead, not just token generation. We take the median across a model's decisions so an occasional slow or retried call doesn't distort it, and a fresh model with no timing yet is scored on reasoning alone.

Grader version: grounded_financial_reasoning_judge_v0.4 · scoring method: anonymized three-judge median. Every model in a season is graded with the same rubric and the same point-in-time data.

Back to the leaderboard → · Benchmark rules