Does the model know the deal, or understand it?

Every one of the 69 tasks in our reference case is built to exist twice: once inside the original deal room, and once inside a byte-level, fingerprinted fork with every identifying detail — the company, the figures, the distinguishing language — stripped and replaced throughout all 30 sources and 18 exhibits. We built the second version because of a plain worry: a model can score well on the first version by recognizing the company, not by reading the documents in front of it.

That worry has a name. Geirhos et al. call it a shortcut [1] — solving a task by latching onto a signal correlated with the right answer, without doing the work the task was built to measure. If a company is public, was covered in the trade press, or shows up in a research note scraped into a training corpus years ago, a model can “know” how a deal turned out before it reads a page of the dataroom, and produce a fluent, confident, well-structured answer for a reason that has nothing to do with the documents in front of it. Recognizing a company is not reasoning about a deal.

A model that scores well on a real transaction may be reciting what it already knew rather than reading what you gave it — and until you measure the gap, you cannot tell which one you are grading.

The trouble is that measuring a gap requires an instrument, and an instrument nobody has validated is not evidence of anything. So the question we had to answer before any recognition number meant anything was not “does the model recognize the deal?” It was: does our own probe for recognition actually work?

—The control comes first

The instrument here is an identity probe: something that looks at a model’s answer and estimates whether it is drawing on knowledge of the real, named deal. Run that probe against the anonymized fork and get a clean reading, and it is tempting to conclude the fork carries no identifying signal. But a clean reading from a probe that could never detect identity under any circumstances is not evidence of a clean fork — it is evidence of a broken instrument, and the two look identical unless the instrument has already been tested somewhere it should obviously fire.

That somewhere is the original, unmodified case. The un-anonymized parent stem runs through the identity probe first, as a positive control, before the probe’s verdict on any fork is trusted. If the probe cannot detect identity in a document that plainly and undeniably contains it, the probe is broken, and every “clean” fork it has ever certified is unverified, not verified. The control comes first. Only once the probe fires correctly on the parent does its silence on the fork mean what we want it to mean.

The fork itself has to earn the same scrutiny the probe does. Anonymization runs at the byte level across every document in a case — not a summary pass, not a find-and-replace on the company name, but a transform fingerprinted against the parent content (a parent-content hash, task hashes, and a transform version, all recorded together), so we can prove exactly which document produced which fork and catch any transform that partially failed. A pipeline that quietly misses one identifying detail is worse than no anonymization at all, because it manufactures confidence in a fork that still leaks; ours fails the build loudly instead of shipping the seam.

—Recognition, measured — and then made moot

With a validated instrument, the measurement itself is simple to state: run the same task against the original case and against its anonymized fork, and take the score delta. Call that delta the recognition share — the portion of a model’s score on the original that evaporates once everything identifying is gone. Reasoning from the documents should score about the same on both versions. Coasting on recognition should not.

We have not measured those deltas yet, and we are not going to estimate them. What is worth saying now is why anonymization alone does not settle the question, even once the instrument is trusted: a capable model does not always need the name to know the company. An unusual margin structure, a specific facility size, a covenant package that shows up in exactly one real deal can be as identifying as a name. That is where counterfactual forks go further than anonymization — instead of only relabeling, we consistently perturb the load-bearing figures throughout a case and regenerate the gold answer from the fact ledger to match. A model that memorized the real answer now gets it wrong, not because the label is missing, but because the correct answer changed and its memorized one didn’t. Hindsight Open, the public tier of our benchmark, is built entirely from cases constructed this way: every task a model outside our walls can run is graded against gold answers regenerated from perturbed figures — where the perturbation moves the answer, a memorized one is wrong by construction. Hindsight Sealed, the tier we run ourselves against a submitted model endpoint, applies the same fork logic to cases that never leave our control plane at all — independent cover against the same failure mode, on a set the public can never train against by accident.

None of this is a claim that contamination is zero. Writing style, document structure, and rare combinations of otherwise-unremarkable facts could in principle still correlate with a known deal in ways neither the anonymization nor the perturbation touches. The honest claim is narrower: we measure the gap on the one axis we can build a positive control for, and we say plainly which axis that is. Our cases begin as the real datarooms of past deals dissei.credit has done — memos, financials, exhibits, and timelines as they stood at the decision, rebuilt point-in-time before any anonymization or perturbation runs — and as licensing for new origination completes, that substrate grows without changing the fork methodology built on top of it.

Ask any benchmark vendor for their recognition delta, not just their leaderboard score — and ask what validates their probe before you trust the delta at all. A vendor who has never run the positive control does not know whether their instrument would ever detect the thing it claims to rule out.

—The control comes first

—Recognition, measured — and then made moot

References