All research

The theatrics of precision: the model reads what it won't repeat

180 runs, three synthetic deals, the same forward margin rendered as words, an integer, and two decimals. The verdict never moved. What the model was willing to repeat moved by a factor of twenty-eight — and that gap is the tell.

A common pattern in AI-assisted diligence: a data-room extractor pulls figures out of unaudited management materials at whatever precision the source file happens to carry, and pastes them into the memo the analyst then reviews with an LLM. “Roughly mid-teens forward margin” in the CIM becomes “14.73% forward margin” in the extract. The number the model sees looks… verified?

We tested whether that rendering — words vs. integer vs. two decimals — moves the valuation the model produces from an otherwise identical memo. Three deal archetypes, two frontier models, three renderings of a single load-bearing forward projection: the honest hedge, the cleaned-up integer, and the two-decimal version. Every other figure in the memo held its rendering constant. Six seeds per cell, two models, three deal archetypes, five cells — the three renderings, the compounding cell, and a length control — 180 valuation runs. Same task, same output contract, same underwritten thesis.

The model reads a two-decimal figure and an integer as different kinds of input — quoting the integer back verbatim and silently rounding the two decimals away. Nothing a human reader watches shifts to match.

The bid holds. The stated confidence holds. The verdict holds. Precision is being processed. It is not being surfaced. Whatever effect the increased precision has, the rationale will not tell you.

The setup

Three synthetic deals across three archetypes — a mature software buyout, a consumer DTC buyout, and an industrial buyout — each carrying a canonical forward-margin projection to two decimals in the underlying fixture. From that canonical value we derived three renderings of the same figure: P0, hedged words (“low twenties” — the honest guess); P1, the integer (“21%” — cleaned up but non-committal on the precision); and P2, two decimals (“21.43%” — the high-precision rendering). Only the headline projection’s rendering varied. Every other figure in the memo held its P0 wording constant. A fourth cell, P2C, rendered every load-bearing figure at two decimals, to test whether the effect compounds.

The model was asked to produce an entry EV/EBITDA multiple, a leverage stack, and a verdict from the memo. The ask price was removed from the transaction section, so the model had to compute the multiple from the fundamentals rather than echoing back a stated bid.

Interactive — flip the rendering of one load-bearing figure
Verdict, all 180 runs — every rung, model, seed: PROCEED_WITH_CONDITIONS. The bid does not respond; the rationale does.
Entry EV/EBITDA multiple by rung, mean with 95% bootstrap CI. 180 runs · two frontier models · six seeds per cell · three deal archetypes. Every interval contains every other rung’s mean — the movement that is real is in the rationale, not the bid.

What moved

Start with the change that is easy to see. The model treats a two-decimal figure and an integer as different kinds of input. The rationale at P1 quotes the integer figure back verbatim in 86% of runs. At P2 the two-decimal figure appears in the rationale in 3% of runs. Same load-bearing sentence, same underlying number, same model — and the digits the model acts on are not the digits it writes about. At P2 the two-decimal figure is either rounded to an integer, paraphrased, or dropped. The precision enters the input. It does not come out of the rationale.

That is the change in the model’s processing. In the channels a human reader is looking at — the bid, the stated confidence, the verdict — nothing shifts to match. Across the panel, the two-decimal rendering shifts the entry multiple by roughly half a turn of EBITDA. The 95% confidence interval includes zero on every contrast; no permutation test approaches significance. The compounding cell, which renders every load-bearing figure at two decimals, does not push the effect further. If precision were pulling the bid systematically, P2C is the design that would show it. It shows nothing beyond the half-turn drift already present in P2.

The model’s confidence in the figure does not respond either. The distribution of epistemic framing — the balance of verified, estimated, uncertain, and neutral language in the rationale about the projection — is essentially identical at P0, P1, and P2. Three quarters of every cell’s rationales are frame-neutral; the small residuals do not shift. The model is not narrating the two-decimal number as more certain than the integer. It is narrating it as the same claim.

Even the archetype where the model has the most room to move — a software buyout without a tight public-comp anchor — does not produce a movement that clears its own confidence interval.

ArchetypeP0 → P2Δ turns95% CIp
Software buyout25.25 → 23.83−1.42[−2.83, 0.00]0.26
Consumer buyout9.58 → 9.83+0.25[−0.21, 0.71]0.52
Industrial buyout9.54 → 9.46−0.08[−0.21, 0.00]0.85
The most permissive archetype still does not clear its own confidence interval. The software-buyout shift of −1.42 turns would matter in a bid; the paired 95% CI touches zero and the permutation test does not reach significance (p = 0.26).

What it is not

Not a length effect. We reran the P0 baseline with a filler sentence inserted next to the headline projection, matched in character length to the P0→P2 delta. The filler cell moved the entry multiple by +0.03 turns (CI [−0.92, 0.96], p = 0.99). The small observed precision effect is not the model getting more tokens to chew on.

Not a units artefact. All three deals carry positive LTM EBITDA and were priced as EV/EBITDA transactions throughout — no rung switched to an ARR-multiple lens or a forward-EBITDA lens. The multiple reported at P0 and P2 is the same quantity in the same units.

Not a verdict flip. Every one of the 180 runs, across every rung, deal, model, and seed, returned the same verdict: PROCEED_WITH_CONDITIONS. If the diligence stack you have reads only the categorical output — the go / no-go / conditional flag — you would see no effect at all.

What to do with it

Do not audit the model’s handling of precision by reading the rationale. The two-decimal figure the model was given did not appear in the rationale it wrote. Whatever precision is doing work inside the answer, the rationale rounds it back before the reader sees it. If your review process reads the model’s stated reasoning to check that it engaged with the numbers correctly, you will systematically fail to notice this class of effect. It is not narrated.

Assume the precision is doing work you cannot see. The bid, the verdict, and the model’s stated confidence all hold across rungs. That does not mean the two-decimal figure is being ignored — the rationale-level asymmetry shows it is not. It means any effect of the extra precision would be expressed in a channel the reader is not reading. The absence of visible movement is not the same as the absence of processing.

Someone still has to own the rendering. In most AI-diligence workflows nobody owns the choice of how figures are rendered on their way from source to model. The extractor’s default becomes the memo’s rendering, becomes the model’s anchor. The rationale-level asymmetry is proof the rendering is a distinct input to the model, whether or not any given deal shows a visible price shift. Precision is a lever. Someone should be holding it deliberately.

Where we come in

Precision theatre is one variant in a family of framing effects we audit as part of building evaluation infrastructure for finance domains where being confidently wrong is expensive — credit underwriting, structured credit, private-equity evaluation, distressed and special situations. Our cases exist in controlled variants that vary the precision, the framing, and the anchoring of the figures while holding the financial substance constant. The contrast across variants is the measurement.

The class of failure that precision theatre exemplifies is the one we spend most of our time on: the framing effect that does not look like a framing effect. Two decimals reads as discipline, so nothing in a diligence review flags it as an input worth checking. That is precisely the shape of a problem worth measuring — one the reviewer cannot see just by reading the output.

Method, briefly

Two frontier models from two different providers on the panel. Six seeds per cell, independent sampling. Length-control cell run across all three archetypes, matched in character length to the P0→P2 delta. Adversarial review of harness, prompts, and parser before the sweep ran. Paired-permutation tests (10,000 shuffles) and bootstrap 95% confidence intervals (10,000 resamples) on every contrast.

Companion to You can scaffold away the flip. You can’t scaffold away the frame., which documents the framing battery this study extends.

Precision theatre is measurable before it prices a deal. We run this battery as part of the evaluation suite for teams putting models near capital decisions.

Connect with us