RAG is Not Magic: Honest Lessons from Production Retrieval Systems
Every RAG demo looks impressive. Production RAG is a different story. Here's what actually breaks, why naive chunking destroys quality, and how I structure retrieval pipelines that hold up under real load.
RAG (Retrieval-Augmented Generation) is one of those things that looks almost trivially easy in tutorials. Embed your documents, store them in a vector DB, retrieve the top-k chunks at query time, stuff them in the prompt. Done.
Then you put it in front of real users and it starts failing in ways that are hard to diagnose and even harder to fix.
I've built RAG pipelines across a few projects — a clinical notes system at Allia Health, a buyer intelligence platform at XiQ, and a few internal knowledge bases. Here's what I've actually learned.
The Chunking Problem is Real
Most tutorials chunk documents at fixed sizes — 512 tokens, overlap of 50, move on. It works fine for demos. In production it creates a silent quality problem: your retrieved chunks often don't contain the information the user asked about, even when that information is clearly in the documents.
The issue is that fixed-size chunking doesn't respect semantic boundaries. A single paragraph explaining a drug interaction gets split between two chunks. A table gets cut in half. The embedding for each fragment is less meaningful than the embedding for the complete thought would have been.
What actually works better:
Recursive character splitting with semantic awareness. Split on paragraph breaks first, then sentences, only falling back to character limits as a last resort.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
separators=["\n\n", "\n", ". ", " ", ""],
chunk_size=800,
chunk_overlap=100,
length_function=len,
)For structured documents (PDFs, reports), parse structure first. Extract headings, tables, and body text separately. A table is not a paragraph and shouldn't be embedded the same way.
Parent-child chunking. Store small chunks for retrieval precision, but return their larger parent chunk to the LLM for context. Small chunks match queries better; large chunks give the LLM enough context to actually answer.
# Index child chunks (256 tokens) for retrieval
# Store reference to parent chunk (1024 tokens) for generation
async def retrieve_with_parent(query: str, k: int = 5):
child_results = await vector_store.similarity_search(query, k=k*2)
parent_ids = {r.metadata["parent_id"] for r in child_results}
parents = await document_store.get_many(list(parent_ids)[:k])
return parentsThe Embedding Model Matters More Than the Vector DB
I've seen teams spend days evaluating Pinecone vs Weaviate vs pgvector and almost no time evaluating embedding models. This is backwards.
The vector DB is table stakes — they're all fast enough, they all support filtering, the operational differences are relatively minor. The embedding model determines the quality ceiling for your entire retrieval system.
text-embedding-ada-002 is fine. text-embedding-3-large is noticeably better for complex queries. For domain-specific content (medical, legal, financial), a fine-tuned model or a domain-specific one like voyage-law-2 can be dramatically better.
Test this properly before you build anything else. Create a set of 50 representative queries with known good answers, and measure recall@k for each embedding model you're evaluating.
Reranking is Not Optional for Production
Semantic similarity is a decent first filter but a poor final ranking. Two chunks might be similarly "related" to a query but differ massively in how useful they actually are for answering it.
A cross-encoder reranker reads each (query, chunk) pair together and scores relevance much more accurately than the two-tower embedding approach. The tradeoff is speed — you can't run a cross-encoder over your entire corpus, which is why you use it as a second stage.
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
async def retrieve_and_rerank(query: str, k: int = 5):
# First stage: fast vector search, over-fetch
candidates = await vector_store.similarity_search(query, k=k*4)
# Second stage: accurate reranking on top candidates
pairs = [(query, c.page_content) for c in candidates]
scores = reranker.predict(pairs)
ranked = sorted(zip(scores, candidates), reverse=True)
return [chunk for _, chunk in ranked[:k]]Cohere's Rerank API is the easiest drop-in if you don't want to self-host. The quality difference is significant enough that it's usually worth the added latency (~100–200ms).
Query Transformation Fixes the "No Results" Problem
Users don't phrase queries the way documents are written. A user asking "what's the max dose?" won't match a document that says "the maximum recommended daily dosage is..." — at least not reliably.
A few techniques that help:
HyDE (Hypothetical Document Embeddings). Ask the LLM to generate what a good answer document would look like, embed that, and use it for retrieval instead of the raw query. Sounds weird, works surprisingly well.
async def hyde_retrieve(query: str, k: int = 5):
# Generate a hypothetical answer
hypothetical = await llm.complete(
f"Write a short paragraph that directly answers: {query}"
)
# Embed the hypothetical answer, not the query
embedding = await embed(hypothetical)
return await vector_store.similarity_search_by_vector(embedding, k=k)Multi-query retrieval. Generate 3–5 rephrasings of the user's query, retrieve for each, deduplicate, merge. More tokens but significantly better recall.
Evaluation: The Part Everyone Skips
Here's the most important thing I can tell you about production RAG: if you're not evaluating it systematically, you don't know if it works.
Vibe checks are not enough. You need a dataset of questions with known ground-truth answers, and you need to measure:
- Context recall: is the right information present in the retrieved chunks?
- Answer faithfulness: is the LLM's answer grounded in the retrieved context, or is it hallucinating?
- Answer relevance: does the answer actually address the question?
RAGAS is a good library for this. It automates most of these metrics using an LLM-as-judge approach, which isn't perfect but is fast enough to run in CI.
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall
results = evaluate(
dataset=test_dataset,
metrics=[faithfulness, answer_relevancy, context_recall],
)
print(results)
# {'faithfulness': 0.82, 'answer_relevancy': 0.91, 'context_recall': 0.74}A context recall of 0.74 means 26% of the time, the right information wasn't in the retrieved chunks. That's a chunking or retrieval problem, not an LLM problem. This kind of measurement tells you exactly where to focus.
RAG is not magic. It's an information retrieval problem wrapped around an LLM. The retrieval part is the hard part, and it rewards the same careful engineering discipline as any other data pipeline.