6 May 2026

Debugging journey: from naive RAG to an agentic loop

How AMICOVERED’s architecture moved from a standard RAG stack to a staged pipeline, then to structure-aware chunking, and finally to an agentic retrieval loop—driven by two different kinds of failure.

AMICOVERED is a coverage assistant, a tool that reads a user’s Swiss private insurance policy and answers natural-language questions about what’s covered.

I built AMICOVERED to put RAG into practice. It looked like a known problem with a known solution… A couple of months later I’d rewritten the core architecture twice and ended up somewhere I didn’t expect.

My first version was a fairly standard RAG pipeline, with some preprocessing on the question and reranking of retrievals. Chunk the general conditions and the policy certificate on paragraph boundaries, embed with a multilingual model, semantic-search at ask time, hand the top-k chunks to the model with the question.

On simple questions it worked. On real ones it didn’t.

A Houshold insurance realated question like “am I covered if my personal laptop screen gets damaged by sunlight?” would come back as covered. The model found a clause stating that personal laptops are covered (as opposed to work laptops), and stopped there. The clause was correct, but it’s about which laptops are covered once an event applies. The model should have checked first whether gradual damage by sunlight is a covered peril (it isn’t). The answer was confident, plausibly cited, and wrong.

My first reaction was to divide responsibilities and impose structure. If the model skipped steps, I’d force the steps in. The next version was an eight-stage pipeline that encoded how I thought a claims handler reasons, each step with a clear responsibility: is the event in the list of covered perils, is the object covered, does any exclusion apply, are any conditions or obligations unmet. etc. Each stage had its own retrieval, its own prompt, and a simple, clear goal.

The new pipeline caught that kind of failure mentioned above. It didn’t catch the next one.

A question about “cash stolen from an apartment during a burglary” would come back as not covered. The retrieval pulled a chunk with an explicit exclusion of cash. This time the system hadn’t skipped any step. The peril is covered (burglary), and the exclusion was real, but it belonged to a different peril (“simple theft”), where cash is excluded. Nothing in the chunk itself told the model where the exclusion belonged.

That was a chunking problem, not a reasoning one. Policy documents have a structure (A. General provisions, D. Perils, D.7 Simple theft, D.8 Burglary), and the meaning of any given chunk depends on where in that structure it sits. The solution was to extract the document’s own section tree at ingestion time and tag every chunk with its section path. An exclusion chunk now carries the information that it belongs to, say, D.7 (simple theft), and can be excluded when the question is about burglary.

That solved the misapplied-exclusion problem. More importantly, it triggered a realization: if the system knew which section to search for an exclusion, the wrong exclusion would never end up among the evidences in the first place. But the pipeline’s searches were deterministic: one fixed query per step. How would I know, for any given question, which section to look in?

The fix was to let the model decide what to search for, and where (agentic retrieval). Claude Sonnet was already what sat behind each pipeline stage; switching models wasn’t the breakthrough. What changed was orchestration: one Sonnet conversation with tools, where each turn the model chooses which retrieval action to run, instead of the code marching through eight fixed stages with predetermined queries per stage. It still had the same claims-handling checklist to work through and the section tree for document structure. Four tools: general semantic search across the general conditions or the policy certificate, semantic search scoped to a specific section of either document, a way to discard irrelevant results, and a termination tool that ends the loop when the model has enough information to make a decision. The model chooses which to call on each turn.

It worked, and it was cheaper. The fixed pipeline made 8 LLM calls per question by design, after monitoring and tuning, the agentic loop was capped at 4 turns, often fewer in practice. Watching those runs also surfaced a side effect I hadn’t predicted: the loop sometimes reconsiders. On some questions the model concludes one thing on an early turn, and then on a later turn (sometimes without any new retrieval, just looking again at the evidence and previous reasoning already in context) revises its own earlier conclusion. The behavior looks like a person re-reading a clause and saying “wait, this provision actually applies here.” It was the first thing that made me feel the loop was doing something structurally different from the pipeline, not the same work rearranged.

The naive RAG version assumed the model already had the judgment and just needed documents handed to it. The fixed pipeline went too far the other way and tried to encode the work as a sequence of steps: the surface of what a claims handler does, not the substance. A real claims handler doesn’t walk every section of the general conditions on every claim. They have enough domain knowledge to know which sections matter for a given event, and enough honesty to revise when a first read turns out wrong.

The agentic loop, almost by accident, reproduces that. Not the checklist, but the judgment about where to look, and the willingness to change its mind.

None of this explains how I knew each change was a general improvement, rather than just a different way to be confidently wrong. That’s mostly evaluation: a regression battery, how answers get scored, and what has to pass before a change ships. I wrote that up in a follow-up article.

For the broader product picture—architecture, diagrams, and the live app—see the AMICOVERED case study.