Part II – The Shadow Divers Method: How Manual Error Analysis Finds the Truth

Part I showed how aggregate metrics create the illusion of certainty. Part II turns to the opposite, the disciplined, primary source research required to uncover actual truth. Few stories illustrate this better than the six-year investigation into the identity of a single, misplaced German U-boat.

AI teams face the same contradiction the divers did. The difference is simply the environment: our search happens in traces, data analysis, and log files instead of the cold Atlantic water.

Finding the First Knife

The divers’ first meaningful clue was a simple artifact, a knife pulled from a silverware drawer, inscribed with the name Horenburg.

This moment has a precise LLM analogue, the open coding phase of error analysis. A human reviewer studies a representative set of traces and captures the first visible point of failure in free text. No scores. No abstractions. Just the artifact.

But here’s the critical warning, when the divers found the Horenburg knife, they initially dismissed it. The official record, that the submarine was somewhere near Africa, felt more authoritative than the evidence in their hands. Teams make the same mistake. They see a catastrophic failure trace and discard it as an edge case because the dashboard shows a reassuring metric.

The failure is the truth, the dashboard is the fiction.

The first artifact is not noise. It is the beginning of the map.

From Knife to Engine Tags

A single anomaly is not proof. The divers needed a pattern, so they continued gathering more artifacts, spare parts, personal effects, instruments. The breakthrough came with the recovery of engine room tags stamped “U-869”. The pattern was undeniable. The wreck had identified itself.

This investigative escalation is the same process LLM teams follow when they convert scattered free text comments into a failure taxonomy.

The Horenburg knife is a lone hallucinationm, an event.
The engine tags are clusters of similar failures, evidence.

The transformation is profound: from “This was weird” → to “This is a systemic, repeatable failure mode.”

A taxonomy isn’t a collection of items, it’s the architecture that organizes the truth.

Building a Reliable Evaluation Process

Manual evaluation isn’t just preliminary phase, it’s core to the feedback loop. As divers constructed a detailed map of the wreck, LLM eval reviewers construct an evolving map of system failures.

A reliable process looks like this:

Assemble a diverse dataset of 50–100 examples representing the real target task.
Apply pass/fail judgments, adding free text comments describing the first each failure*.
Cluster these comments into a taxonomy, your evolving map of the failures.
Repeat this cycle regularly to track how failure modes shift as you ship changes.

* Do not use a multi score system, use pass/fail. There is too much nuance is trying to score a failure on a scale say from 1-5 or even 1-3.

This qualitative loop is what aligns an LLM system with human judgment. It is slow by design. You cannot mass produce truth. The value you get from manually reviewing your failures and data is priceless in building a better LLM integration.

The Hidden Cost of Abstraction

Users experience specific, individual outputs. A single hallucination, a PII leak, an invented legal clause, defines the product experience, not the 90% success rate. Optimizing for an abstract score is not optimizing for quality. It is optimizing for a proxy of quality, creating blind spots and hidden failure modes.

Takeaways:

The divers didn’t find the truth because they trusted the maps. They found it because they trusted the artifacts. LLM teams must do the same.

Every failure trace is an artifact.
Taxonomies are maps of the wreck.
Truth emerges from pattern recognition, not aggregation.
Finding the wreck isn’t enough. You still need to understand why it sank.

Stay tuned for Part III – coming soon