Your reward function is lying to you: probes for graders that fail silently
Before any graded task reaches a candidate model, we run three attacks against its grader, all hunting the same thing: reward without reasoning. A task that survives ships. A task that doesn't gets rewritten — and we keep the log, including the attacks that won.
Before any of the 69 graded tasks in our reference case reaches a candidate model, we try to beat it. Three attacks, run against every task, all looking for the same thing: a way to earn a strong reward without doing the reasoning the task claims to measure. A task that survives all three ships. A task that doesn’t gets rewritten or retired — and we keep the log.
We run this because a gamable grader does not fail loudly under reinforcement-learning pressure. It does not error out, and it does not plateau at zero where someone would notice. It quietly trains the wrong policy — rewarding whatever it can be tricked into liking, rather than the reasoning the task was built to elicit. This is the mechanism Skalse et al. formalized as reward hacking [1], and left unattacked, it is close to the default outcome for any grader nobody has tried to defeat.
We assume every grader we ship is broken until an attack log says otherwise — and the log ships with every deliverable, including the attempts that succeeded.
—Attempt 1: take the documents away
What we tried: strip the entire deal room from a task — every source document, every exhibit, every timeline entry — and re-score candidate answers with the production rubric, the same one a live run is graded against.
What the probe catches: a task that still earns a reward at or above 0.5 with zero documents behind it is handing back its own answer somewhere in the question itself. The reasoning the task claims to test was never required to reach that score.
The rule that now runs per task: every task gets this stem-only ablation before it ships. A reward ≥ 0.5 on zero evidence is an automatic red flag. It does not ship until rewritten or retired.
—Attempt 2: hand the answer back through arithmetic
What we tried: look for stems that already contain the two numbers a correct answer is supposed to derive, so that “solving” the task is really just arithmetic already laid out in the question.
What the probe catches, schematically — illustrative, not a task drawn from any live corpus: a stem asking “Revenue grew from $40M to $52M over two years. What is the CAGR, and is the underwriting thesis’s assumed 15% growth conservative or aggressive relative to realized growth?” does not require the deal room at all. The two figures needed to compute a roughly 14% CAGR, and to compare it against the stated 15%, are already sitting in the stem. A model can land the right verdict without opening a document.
The rule that now runs per task: a numeric-bridge check, tolerant to ±1%, flags any stem where the arithmetic behind the “correct” answer is already present in the question. It needs a calculator, not a deal room, and that disqualifies it.
—Attempt 3: pad and stuff
What we tried: two ways a task can leak without any arithmetic at all — naming so many comparators in a comparative stem that the model only has to adjudicate between options handed to it, or phrasing a stem so its own vocabulary already overlaps with the expected answer.
What the probe catches: comparative tasks naming two or more comparators directly in the stem let a model skip the harder step of selecting its own comparators. Separately, stems whose wording echoes the expected answer’s vocabulary hand back structure, not just numbers.
The rule that now runs per task: a comparator-density check flags any comparative-mode task naming two or more comparators in its own stem. An answer-structure pass — regex plus unigram overlap — flags stems whose phrasing leaks into the expected answer.
The reference case behind that battery is not a placeholder: 30 sources, 18 exhibits, 3 timelines, 69 graded tasks, each content-hashed at task_sha, reward compiled to [0,1]. What is pending is the published aggregate — the share of tasks that came back red before remediation, by attack. We would rather ship the table empty than fill it with a number that sounds right.
None of these three attacks proves the grader is unbeatable. They bound three specific, named failure modes and say nothing about modes nobody has attacked yet — a limit Skalse et al. [1], Krakovna et al.’s specification-gaming catalogue [2], and the broader inventory Amodei et al. laid out [3] all point to from the research side. A task can pass all three and still be graded by a rubric a domain expert would disagree with; that is a calibration problem, not a leakage problem, and it needs different evidence. And a task that passes today can fail tomorrow if a later edit reopens a path we already closed — this is a standing check, not a certificate issued once.
—Run this against your vendor
Ask any environment vendor — including us — for the attack log before you train on their tasks. Not a description of the grading philosophy: the actual fraction of tasks that came back red, and what happened to them afterward. A vendor with nothing to show has not looked, and an unattacked grader should be assumed breakable, the same way we assume it about our own.
The log ships inside the evidence/ directory of every artifact we deliver, versioned against the exact task_sha of the tasks you hold — the same discipline behind every task in Hindsight, our benchmark for financial judgment. You get the outputs alongside the tasks and can run the attacks yourself.
References
- Skalse, Howe, Krasheninnikov, Krueger. Defining and characterizing reward gaming. NeurIPS 2022. arXiv:2209.13085.
- Krakovna et al. Specification gaming: the flip side of AI ingenuity. DeepMind, 2020.
- Amodei, Olah, Steinhardt, Christiano, Schulman, Mané. Concrete problems in AI safety. arXiv:1606.06565, 2016.
The attack log ships in the evidence/ directory of every artifact we deliver. Ask us for it — or run the attacks yourself.