Skip to content
Back to writing Enterprise AI

What We Actually Learned Building RAG Systems

Retrieval-Augmented Generation looked straightforward in the tutorial. In production, it was anything but. Here's what we learned the hard way.

Mal Wanstall

Mal Wanstall

What We Actually Learned Building RAG Systems

Retrieval-Augmented Generation seemed like the obvious solution. We had a large corpus of internal documents. People needed answers from those documents. An LLM that could retrieve relevant context and generate accurate answers felt like the perfect fit.

The prototype took two weeks. Getting it to production quality took five months. And most of those five months were spent on problems that no tutorial or blog post had warned us about.

The Prototype Was Misleadingly Good

Our first RAG prototype was impressive. We chunked a few hundred documents, embedded them, stored them in a vector database, and connected the retrieval pipeline to an LLM. The answers were good. Sometimes surprisingly good. Everyone who saw the demo wanted it immediately.

Then we tried it on our full document corpus: 15,000 documents spanning multiple formats, teams, and time periods. Performance dropped off a cliff.

The problem wasn’t the LLM. It was everything before the LLM. Retrieval quality determines answer quality, and retrieval is much harder than it looks once you move past a small, clean dataset.

Chunking Is Where Most RAG Systems Fail

Every RAG tutorial shows you how to split documents into chunks. What they don’t show you is how much the chunking strategy matters, and how domain-specific it needs to be.

We started with fixed-size chunks of 500 tokens with 50-token overlap. Standard stuff. It worked terribly for our data. A 500-token chunk from the middle of a technical specification would strip away all the context needed to understand the content. The chunk would say something about “the device” without ever specifying which device. The LLM would hallucinate the rest.

We iterated through four different chunking strategies over two months. What we landed on was a hybrid approach: structural chunking based on document headings and sections, with metadata preserved from parent sections. Each chunk carries context about what document it came from, what section it belongs to, and what the document’s overall purpose is.

This sounds obvious in retrospect. It wasn’t obvious at the time, and I haven’t seen a single RAG tutorial that covers it adequately.

The Retrieval Quality Problem

Vector similarity search is good at finding documents that are semantically similar to a query. It is not good at finding documents that contain the answer to a question. These are different things.

A user asks: “What is the recommended maintenance interval for Product X?” Vector search returns five chunks about Product X maintenance. Sounds great. Except two of those chunks are from an outdated manual, one is from an internal discussion that was never finalised, and the most relevant one ranks fourth because its embedding is slightly less similar to the query than the others.

The retrieval pipeline we landed on looks like this:

flowchart LR
    Q[User Query] --> MF[Metadata Filtering]
    MF --> VS[Vector Search]
    MF --> KW[Keyword Search]
    VS --> Merge[Merge Results]
    KW --> Merge
    Merge --> RR[Re-ranking Model]
    RR --> LLM[LLM Generation]
    LLM --> A[Answer + Sources]

We added three things to make retrieval work reliably.

Metadata filtering. Before vector search even runs, we filter by document type, recency, and approval status. This eliminates the “outdated manual” problem. It required us to have clean metadata, which meant a painful two-week data cleaning exercise across 15,000 documents.

Hybrid search. We combine vector similarity with keyword matching. Some queries are best served by semantic similarity. Others, especially those asking about specific product codes or technical terms, need exact keyword matching. We run both and merge the results.

Re-ranking. After retrieval, we run a re-ranking model that scores each chunk based on how likely it is to actually answer the question, not just how similar it is to the query. This consistently moved the best chunk from position four or five to position one. The re-ranking step added about 200ms of latency but dramatically improved answer quality.

Hallucination Is Not Solved By Retrieval

One of the selling points of RAG is that it reduces hallucination by grounding the LLM in real documents. This is true in the same way that wearing a seatbelt reduces injury. It helps a lot, but it doesn’t make you invincible.

Our system still hallucinated in specific, predictable ways. When the retrieved context was insufficient to fully answer the question, the LLM would confidently fill in the gaps with plausible but incorrect information. It wouldn’t say “I don’t have enough information.” It would just make something up that sounded right.

We addressed this with explicit instructions in the system prompt to say “I don’t have enough information to answer this fully” when the context is insufficient. We also implemented a confidence scoring system based on the retrieval scores and answer consistency across multiple retrieved chunks. When confidence is low, the system tells the user it’s uncertain and points them to the source documents for manual review.

This reduced but did not eliminate hallucination. We still see it in roughly 3-4% of responses. For internal use cases with knowledgeable users who can spot errors, this is acceptable. For customer-facing applications, it wasn’t, and we added a human review step for those.

The Document Freshness Problem

Documents change. New versions replace old ones. Policies get updated. Products get revised. If your RAG system doesn’t handle this, users will get answers based on outdated information and trust in the system erodes fast.

We built an ingestion pipeline that watches for document changes and re-processes affected chunks. This sounds straightforward, but handling partial updates, version conflicts, and deleted documents added significant complexity. We also had to handle the case where someone asks about a topic that spans multiple documents with different update dates.

Our current approach: each chunk carries a “last verified” timestamp. The system weights recent chunks higher and flags when an answer relies primarily on documents older than six months. Not perfect, but it prevents the worst failure mode of confidently serving stale information.

Evaluation Is the Whole Game

The single most important investment we made was building a robust evaluation framework before we started optimising. We curated a set of 500 question-answer pairs, reviewed by domain experts, covering the range of questions users actually ask.

Every change we make to the system, whether it’s a chunking strategy update, a prompt modification, or a re-ranking model swap, gets evaluated against this test set. We measure retrieval precision, answer accuracy, hallucination rate, and latency. Without this framework, we would have been tuning blind.

Building the evaluation set took three weeks of expert time. It was the highest-ROI investment in the entire project. I would not start another RAG project without building the evaluation framework first.

What I’d Tell Someone Starting Today

Start with a small, clean, well-curated document set. Prove the concept, then expand. Don’t start with 15,000 documents like we did.

Invest in chunking strategy early and expect to iterate on it multiple times. The right chunking approach is domain-specific and there’s no universal best practice.

Build evaluation infrastructure before you optimise anything. You can’t improve what you can’t measure.

Plan for the retrieval problem to be harder than the generation problem. The LLM is the easy part. Getting the right context to the LLM is where you’ll spend most of your engineering effort.

And be honest about the limitations. RAG systems are powerful, but they’re not magic. Setting appropriate expectations with users about what the system can and can’t do reliably is just as important as the engineering work.

Share

Enterprise AIData Architecture
Mal Wanstall

Mal Wanstall

AI & Innovation Strategist

15+ years shipping AI products and scaling teams across financial services, NFP, and medical technology.

Related