All research

The rubric locks first: grading criteria you write after reading the answer are not criteria

A rubric that can still change after seeing model output is not testing the model — it is describing it. Every task we ship can produce two dates on request: the day its rubric froze and the day a candidate model first saw it. The second never comes before the first.

Every task we ship can produce two dates on request: the day its rubric froze, and the day a candidate model first saw the task. For all 69 tasks in our reference case, the second date never comes before the first. That ordering is the entire argument of this piece, and we can document it for any task you’re holding, because the freeze gets a timestamp and a content hash — task_sha — and nothing after that point is allowed to move.

Compare that to the quiet norm in a lot of eval construction: someone drafts a rubric, runs a few candidate models against it, reads the answers that come back, and “refines” the criteria — tightens a phrase, adds a line, drops whatever made a favored model look bad. Nobody involved would call it cheating. Most would call it iterating. It is unfalsifiable by construction regardless of what anyone calls it: a rubric that can still change after it has seen model output is not testing the model, it is describing the model, dressed up as an independent standard.

The order of operations is the whole difference between a rubric and a mirror — so here is the order, dated.

The timeline

Day 0 — facts, verbatim. Before a single task is drafted, the source documents are decomposed into atomic facts — each an {id, claim, source_quote} triple, the quote required to appear character-for-character in the source. Nothing enters the ledger that cannot be quoted, and no task will later be built on a claim that isn’t in it.

Interactive — one task’s life, in order · chain content illustrative, the order is the design
Every task can produce two dates on request: the day its rubric froze, and the day a candidate model first saw it. The second never comes before the first — that ordering, not the wording of any criterion, is what makes the rubric a standard instead of a mirror.

Quote to fact, fact to criterion, criterion to grader line — every link in that chain is fixed on Day 0, long before any model touches the question.

Days that follow — composed backward, authored adversarially. Tasks are composed backward from the ledger, never forward from a sense of what a good question looks like; every rubric criterion declares which fact it binds_to, and one with nothing to bind to does not survive. Two independent model families then author candidate tasks against the same ledger and adversarially cross-review each other’s output — one family’s work judged by the other, and back again. Candidates must clear a scored quality bar, and judges from both families have to agree it is cleared. Where two rubrics both pass but disagree on which is stronger, a quality score breaks the tie. No family grades only its own work.

Day N — the lock. The rubric freezes: content-hashed to a task_sha, tiered so critical criteria gate a continuous graded band anchored on proficiency levels, and closed to further edits. This is the date we can always produce on request.

First model contact — after the hash exists, never before. The candidate model that will eventually answer this task does not exist yet, or has not been shown it, at the moment the grading standard is fixed. There is no version of this process where a model’s answer arrives before the hash does.

What the lock does not prove

A lock like this proves the criteria weren’t shaped by seeing outputs. It does not prove they are the right criteria, or that a judge applying them agrees with a domain expert — a frozen rubric can still be a bad rubric, since freezing early only rules out the one failure mode where the standard bends to fit the answer. That is why the lock sits next to the probe work that checks whether a task can be gamed regardless of how well the rubric reads, and next to calibration work checking whether the rubric’s judgments agree with expert judgment, using the same inter-rater tools — Cohen’s κ, Krippendorff’s α — built to validate human graders against each other. Nor does the lock mean authoring is free of judgment: two model families and practitioner supervision still decide which facts are worth a task and how a criterion should be worded. What the lock removes is narrower — the chance to adjust that judgment retroactively, after an answer is already on the table.

Ask for the two dates

Every task we ship — including every task in Hindsight — carries this same timeline, and the task_sha it produces is what the probe reports and calibration numbers in evidence/ are keyed to. So ask any benchmark vendor for their two dates: when did the rubric freeze, and relative to what? Not whether a rubric exists — whether it existed, unchanged, before the first candidate model ever answered the task. A clean date and a content hash is a falsifiable answer. “We iterate on the rubric as we see results” is the same answer wearing different words, and it describes a standard that will always look like it agrees with whichever model wrote the best-looking answer first.

Two dates and a content hash, for any task you hold. That is the standard we run — and the one we think you should demand from anyone selling you an eval.

Connect with us