The noise floor of judgment: what 564 runs and a 69-task eval taught us about our own scoring math
A 69-task eval came back 68% zeros with a tail to 0.7 — two populations, not one. The model was not the culprit; the scoring math was manufacturing the pile. What we blamed first, what it actually was, and the one-line change we shipped in response, dated.
Before we asked anyone to trust a grader we built, we ran one synthetic deal through it 564 times, isolated, watching what varied when nothing about the task changed. Then, in May 2026, a 69-task eval against a frontier open-weights model came back with a reward distribution that had no business looking the way it did: 68% zeros, and a long tail stretching to 0.7.
We do not assume our own reward function behaves. We measure it — and when the measurement says the scoring math is wrong, we change the math.
This is the record of the latest time that happened: what we saw, what we wrongly blamed first, what the problem actually was, what we changed, and what we now publish so it can be checked.
—What we saw
Two populations, not one distribution with noise around a mean. 68% of runs at zero. The remaining runs tailed out to 0.7. A grader working as intended, scoring answers of genuinely different quality, produces a spread — it does not produce two clusters with a gap between them.
—What we blamed
The easy read was the model: maybe it was failing outright on most of the 69 tasks and only occasionally doing real work. That story fits a pile of zeros. It does not fit zeros sitting immediately next to a tail of real, differentiated scores with almost nothing in between, on a model capable enough to be worth evaluating in the first place.
—What it actually was
Our own binary critical-criteria gate. Any single failed critical criterion zeroed the entire reward, no matter how strong the rest of the answer was — which meant a borderline judgment call on one critical line, decided the wrong way by ordinary grader noise, didn’t cost a fraction of the score. It cost all of it. An answer good enough to deserve real partial credit could still fail one critical line, sometimes correctly and sometimes on a call that could have gone either way, and lose everything instead of losing its proportionate share. That is exactly the shape of failure that produces a pile of zeros next to a tail of real scores: one kind of answer, landing on both sides of a gate that had no business being a gate.
—What we changed — 2026-05-17
The single pass/fail switch became a continuous multiplier: critical criteria passed, divided by critical criteria total, reflecting how many an answer actually satisfied instead of collapsing that count to a single bit. Clear most of them now, and the score reflects most of them — not none of them. The revision is dated 2026-05-17 and traceable directly to this result, not a general tuning pass dressed up after the fact.
We are not going to describe the post-revision distribution with numbers we have not run yet. We found the pattern, found the mechanism, and changed the mechanism; the next test-retest pass on the revised gate is what tells us whether it worked. The post-revision distribution and σ, on the same 69-task case, ship when that pass completes.
This is not an argument that LLM judges are unusable — it is an argument that unpublished variance is the actual risk. A binary gate sitting on top of an inherently graded judgment converts ordinary, boundable noise into a cliff nobody bounded; a continuous gate does not make the underlying noise disappear, it stops amplifying it past what the judgment call deserves. Nor is this a fix we make once and file away — it is one dated entry in a log we intend to keep adding to.
—What we now publish
The 564-run analysis was the first version of a question we now ask permanently: show the grader the same thing twice, does it say the same thing twice? We re-score identical (task, answer) pairs k=5 times each and look at what moves. Per-task reward σ shows how much a single task’s score wobbles across five otherwise-identical passes. A corpus-level self-agreement figure, computed with the same ICC(2,k) statistic normally used to check whether independent human raters agree, shows whether the grader agrees with itself across the whole corpus. Because five passes come back, not one, we ship the median of the five as the production reward rather than whichever score happened to land first.
An RL signal is only as good as its noise floor. A reward that swings across identical re-scores of the same answer is not a precise signal with a bit of jitter on it — it is gradient noise your training run absorbs without anyone having measured how much. We are not aware of an environment vendor who publishes this number. Ours ships with the values from that pass, not before — and here is the design behind it, so it can be run against any grader, including ours. It runs on the same reference case behind every other number we publish — 69 graded tasks, content-hashed, reward in [0,1] — and lands in evidence/ alongside the probe reports and fork deltas from our other work: the same noise floor that sits underneath every judgment-gap score.
Ask any environment vendor for their grader’s noise floor. If the answer is a description instead of a number and a design, you have learned something. Ours ships with the artifact — the design is published today; the numbers land with the next test-retest pass.
Connect with us