26 May 2026

Offline eval: why I test ingestion before I test reasoning

How AMICOVERED's regression battery works across three layers — document ingestion, retrieval, and inference — and why the order matters.

My first instinct when AMICOVERED gave a wrong answer was to tweak the prompt, or reach for a better model. I quickly found that instinct was often wrong.

AMICOVERED is a coverage assistant, a tool that reads a user’s Swiss private insurance policy and answers natural-language questions about what’s covered. If you read my previous article, you might remember how the architecture got here and what broke along the way. This article is about how I knew each change was an improvement, rather than a different way to be confidently wrong.

A Swiss insurance policy normally comes with two documents: the policy certificate, and the general insurance conditions (different insurers use different names). The pipeline starts with PDF ingestion, runs through structure-aware chunking for the general insurance conditions, and compiles per-policy context from the policy certificate. At question time, an agentic retrieval loop gathers the relevant evidence and hands it to the model to generate the answer. A wrong answer can originate anywhere in that chain. A better model or prompt at the top doesn’t fix a broken ingestion step at the bottom. It just reaches a wrong conclusion with more confidence.

Without the right evidence in front of it, the model will hallucinate or reason to the wrong conclusion. You’ve heard it before: garbage in, garbage out.

This is why the test and eval structure mirrors the architecture: three layers, each one verifying what the next one depends on. Testing reasoning without verifying the layers below is testing the whole stack at once. When something fails, you don’t know where. You end up chasing reasoning failures that aren’t reasoning failures.

Layer 1: Ingestion

General insurance conditions are hierarchical. A chunk’s meaning doesn’t come only from its words, but also from where it sits in the document structure. I extract the section tree at ingestion time and tag every chunk with its path, so the agentic loop knows not just what a chunk says, but where in the policy it belongs. The section tree is also passed to the model at retrieval time so it knows the document structure.

This is also why semantic search and vector databases alone are often not enough, but that is a topic for another article.

The ingestion tests verify that tree. For a set of real policy documents, I assert that the document tree is correctly extracted: sections present, hierarchy intact, chunks tagged with the right paths. Fully deterministic assertions.

The policy certificate is also part of this layer. The general conditions describe what the insurance covers in principle. The policy certificate is what a specific user actually has: which coverages are active, which tier they’re on, which deductibles apply. AMICOVERED has a context compiler that extracts this into deterministic data structures.

Without precise information about which coverages are active, the model reasons over the generic conditions and may reach a wrong conclusion. Try forcing the model to check coverage activation and, if that information is unclear, it will either be overly conservative and say “not covered”, or confidently confirm coverage the user never purchased.

Coverage activation is typically encoded as tables, often with symbols like ✓ and ✗. Flatten that into a text chunk and the model will misread it or miss it entirely. The context compiler exists for exactly this: it reads the certificate’s structure and extracts activation state as unambiguous data, not prose.

The certificate tests assert the extracted data structures match the expected values. Also fully deterministic.

If ingestion is broken, everything above it is broken. The chunks are wrong, or the premises are wrong. No amount of retrieval quality or model reasoning fixes that.

Layer 2: Retrieval

With the document tree and user context verified, retrieval tests check that the agentic loop surfaces the right evidence.

For a given question and its relevant frame, I verify that the relevant chunks are retrieved, that they rank high enough to make it into the model’s context, and that they contain the information needed to support a correct answer. Also deterministic: for a given query I know which chunks should come back and roughly where they should rank. Either they do or they don’t.

Retrieval failures are a different class of problem from ingestion failures, and catching them separately matters. A chunk that exists in the document tree but never surfaces in retrieval is just as useless as a chunk that was never ingested correctly. The ingestion tests won’t catch that. Only retrieval tests will.

Layer 3: Inference

With ingestion and retrieval verified, the inference layer can measure what it’s supposed to measure: whether the system reasons correctly.

120 questions across 6 real policy documents, a mix of positive outcomes (covered), negative outcomes (not covered), and edge cases. Three checks per question.

Coverage resolution is deterministic. The expected outcome is one of four values: covered, not covered, partially covered, or unclear. The system’s answer either matches or it doesn’t.

Citations sit at the boundary between retrieval and reasoning. The model receives more chunks than it cites. The citation check verifies that the model selected and cited the right ones, the sections that actually matter for the answer. This is not a retrieval quality check. It is asking whether the model’s attention is in the right place given the context it received.

Reasoning is where LLM-as-a-judge enters. I pass the question, the cited sections, the system’s reasoning, and a reference reasoning (a few sentences describing the direction the correct conclusion should follow) to a separate model and ask it to evaluate whether the logic is sound given the evidence. Outcome and citations can be verified programmatically. Reasoning quality cannot. LLM-as-a-judge catches it, with all the caveats that come with using a model to evaluate a model.

The pass rate started in the low 50s and ended up in the low 90s, varying by policy type. Most of that progress came from improving ingestion and retrieval.

120 questions across 6 documents is not a large test set, and there would be little point in over-optimizing for a higher rate. But it is a good enough signal to tell whether a change actually helped or just looked like it did.

For the broader product picture, architecture, diagrams, and the live app, see the AMICOVERED case study.