2025–present · Zürich

AMICOVERED · Coverage Q&A for Swiss insurance policies

Solo product owner + engineer · AMICOVERED

A production consumer app that reads a user's own Swiss insurance policy and returns cited, structured coverage answers via an agentic-RAG architecture on Claude Sonnet.

amicovered.ch

Context

AMICOVERED is a live consumer web app at amicovered.ch that answers “am I covered?” questions about Swiss private insurance. Users upload their own policy certificate and general conditions (the two-document structure used across the main Swiss private insurance lines) and ask natural-language questions. The UI is localised to German, French, Italian, and English, but the Q&A itself crosses languages: a policy written in German can be queried in Spanish and answered in Spanish, with citations back into the German source. The system returns a structured answer (yes / no / partial / unclear) with citations back to specific sections of the user’s own contract.

Insurance Q&A is a tempting but dangerous LLM use case. A hallucinated coverage answer is worse than no answer, which makes grounding in the user’s own document, explicit citations, and fail-loud behavior non-negotiable: they are the product, not features of it.

It’s a portfolio product, built solo end-to-end: intake, async document processing, agentic reasoning, billing, the mobile app, Swiss data-protection posture. Real users, real payment flow, no team, no funding.

Notes from building

This case study is the “what and why” overview. For the longer story of how the architecture moved from early RAG to a staged pipeline, then structure-aware chunking, and finally an agentic retrieval loop (and the different failure modes that drove each change), see Debugging journey: from naive RAG to an agentic loop.

How it works

The architecture turns on one split: expensive reasoning happens at upload and at ask, but never on the same schedule.

Expensive reasoning happens at upload and at ask, but never on the same schedule.

Write-time is asynchronous and bursty. When a user uploads, an RQ worker extracts a nested section tree (section number, title, content, page) from the PDF, chunks it so every chunk carries its own section_path, sends the chunks to a separate embedding worker, and writes them into ChromaDB keyed by user_id and policy_id. Classification calls during intake use Claude Haiku.

Read-time is synchronous. This is the agentic-RAG side of the system: a single Claude Sonnet conversation runs a tool-use loop with four tools: semantic search across the general conditions, semantic search across the policy certificate, direct retrieval of a named section from the tree, and per-chunk feedback to discard irrelevant results. The loop terminates when the model calls done_gathering, or at a hard cap of four turns. Across turns, only the system prompt and the most recent tool result carry cache_control: ephemeral, so subsequent turns don’t re-read accumulated context at full cost.

Agentic reasoning loop: model, tools, iteration cap, and exit condition.

The whole system runs on a single Hetzner VPS as seven Docker containers behind Traefik: React + Vite frontend, FastAPI backend, two Python workers for document processing and embeddings, Postgres, ChromaDB, and Redis. Firebase for auth, Stripe for credit packs, an append-only credit ledger with FIFO lot expiration for billing.

Key decisions

Agentic RAG over a fixed-stage RAG pipeline. I started with an eight-stage linear pipeline (chunk filtering, chunk labeling, activation detection, answer generation) that encoded how I thought a claims handler should reason. Multi-section questions broke it immediately. “Is theft away from home covered if I have only the BASIC tier?” needs both the policy certificate (which tier is active) and the peril section of the general conditions (what the tier actually covers), and no filter stage I could write was simultaneously strict enough to stay focused and greedy enough to catch both. Replacing the pipeline with a Sonnet loop that decides for itself when to search, when to drop results, and when to stop produced better citations and was less code. Trade-off accepted: non-deterministic call count per question, bounded by a four-turn cap.

Rule-graph extraction at write time as the alternative (and why I didn’t ship it). The other architectural path I seriously explored was to transform each policy at ingestion time into a structured rule graph: perils, exclusions, conditions, and their interrelationships represented as data, evaluated deterministically at query time with little or no LLM reasoning in the hot path. A knowledge-graph approach closer to symbolic reasoning than to GraphRAG: the graph is evaluated deterministically, not used as an LLM retrieval surface. Cheap, fast, auditable. I didn’t take it because in a B2C deployment each policy is uploaded by one user and the heavy ingestion cost is never amortized. In a B2B deployment inside an insurer, where the same general conditions are the insurer’s own published documents, ingested once and reused across thousands of customers, the math flips, and the rule-graph approach becomes the right one. This is part of why I think the product’s center of gravity is B2B.

Hetzner VPS + Docker Compose over autoscaling cloud. AMICOVERED is a fully functional product with real users, real billing, and real Swiss data-protection compliance, but it isn’t a startup chasing growth; it’s a portfolio product with no expectation of bursty traffic. ECS, Lambda, or Kubernetes would have been exponentially more expensive for no near-term benefit. A single VPS running the full Compose stack is cheap, predictable, and easy to reason about. The architecture deliberately doesn’t foreclose scale: the four runtime containers are independently image-tagged and CI-pushed to GHCR, so migrating any of them to ECS Fargate or Kubernetes would be a deployment change, not an application rewrite. (I started on AWS Lightsail; Hetzner was cheaper for the same resources.)

Python + FastAPI (in hindsight, the less obvious call). The document pipeline is Python-native: unstructured, pdfminer.six, PyMuPDF, pytesseract, sentence-transformers. Keeping the stack in one language pulled me toward FastAPI. What I didn’t weigh enough was that I had no prior Python experience coming in, and the project turned into a simultaneous language-and-framework learning exercise. Today I’d likely pick Node: the ecosystem needs are reachable there too, just less conveniently, and the learning-curve tax slowed early iteration more than I predicted.

Evaluation

Every architectural change (pipeline to agentic, chunking changes, prompt tweaks, model or version bumps) runs against a 120-question regression battery before shipping. The battery covers home, liability, auto, health-supplement, travel, and accident-supplement policies across the four expected answer types.

Scoring evaluates three dimensions per answer: correctness (yes / no / partial / unclear matches expected), sources (the citations point to the same sections a correct answer should), and reasoning (the argument aligns with the expected one). The first two are deterministic checks; the third uses LLM-as-judge, because semantic alignment of reasoning can’t be checked with equality. Without the last two, a “correct” answer produced for the wrong reason passes silently: the failure mode that matters most for a coverage-answer tool: a confident, plausibly-cited wrong answer is worse than no answer at all.

A full cycle runs around $5 in LLM spend (120 generation calls plus a reasoning-judge call on each). Not negligible over many iterations, but worth it: given the failure mode above, I’d rather pay for evidence than make architectural decisions on intuition.

The full story of how the eval is structured, why the layers are ordered the way they are, and what the pass rate looked like across iterations is in Offline eval: why I test ingestion before I test reasoning.

An interesting detail: section-tree-aware chunking

The ingestion pipeline doesn’t split documents into fixed-size chunks and stop there. Instead it uses a form of structure-aware, hierarchical chunking: it first extracts the document’s own table of contents (“A. General provisions”, “D.7 Theft”, “E. Exclusions”) as a nested tree where every node carries its section number, title, content, starting page, and children. Chunking then walks the tree, and every chunk ends up tagged with its section_path and section_number.

A. General provisions
B. Insured objects
C. Insured sum
D. Perils
├── D.1 Fire
├── D.7 Theft
│   ├── D.7.1 At home    ← chunk tagged section_path="D.7.1"
│   └── D.7.2 Away from home
└── D.8 Vandalism
E. Exclusions

At query time this pays for itself twice. The agent can scope a semantic search to a branch (section_filter="D7") and skip retrievals from other parts of the document. And there’s a get_section tool that returns the full text of a named section directly from the tree with no vector search at all, which matters when a user’s question references a specific clause. Those same section identifiers are what the structured answer cites back to the reader.

Reflection

Three things worth keeping from this build.

Architectural. When the model can actually reason, fighting to constrain it into a rigid pipeline is wasted effort. I spent too long encoding how I thought a claims handler should reason before letting the model do it itself.

Distribution. Regulated-domain consumer apps hit gates that are invisible until you run into them. Both Apple and Google require a legal entity (a Swiss GmbH in practice) before approving an insurance app. The mobile app is code-complete on iOS and Android but hasn’t launched for that reason. A business decision, not a technical one, and not one I was willing to take on as a solo founder.

Market. Consumer-paid isn’t the right fit for this product. The credit model exists to prevent abuse, not to produce revenue. The real opportunity is B2B: most insurers still handle first-line coverage questions with traditional methods, and a system with this architecture (cited answers, structured output, agentic reasoning grounded in the customer’s own documents) would be more valuable inside an insurer’s workflow than as a consumer app.