Chapter 7: RAG Systems
retrieval, embeddings, vector search, chunking, reranking, hybrid search, semantic search, HNSW, ColBERT, knowledge grounding
Introduction
In the summer of 2022, a team at a major technology company faced a problem. They had deployed a large language model to answer questions about their internal documentation—thousands of pages covering engineering practices, HR policies, and product specifications. The model was impressive: articulate, confident, and completely unreliable. It would cite policies that didn’t exist, invent employee benefits, and present plausible-sounding procedures that contradicted official guidelines. The fundamental issue was simple: the model’s knowledge was frozen at training time, and it had never seen the company’s documents.
Retrieval-Augmented Generation (RAG) emerged as the solution to this problem. Instead of relying solely on knowledge baked into model parameters during training, RAG systems retrieve relevant documents at inference time and include them directly in the prompt. The model generates responses grounded in specific, verifiable sources rather than interpolating from training data.
The idea is elegantly simple: look things up before answering. But the execution is nuanced. Building a RAG system that works reliably—that retrieves the right information, presents it effectively, and generates accurate responses—requires understanding a stack of interconnected techniques: document processing, embedding models, vector search, reranking, and prompt engineering. Each component has its own tradeoffs, and weaknesses in any link can break the chain.
This chapter takes you from the foundations of RAG to production-ready systems. We’ll examine not just how each component works, but why it works—the mathematical and conceptual foundations that explain when techniques succeed and when they fail. By the end, you’ll understand RAG deeply enough to debug failures, choose between alternatives, and push the boundaries of what’s possible.
The Core Insight: Why RAG Works
To understand RAG, you need to understand why language models sometimes fail at knowledge-intensive tasks.
A language model like GPT-5 or Claude Opus contains vast amounts of knowledge encoded in its parameters—facts about the world, patterns of language, reasoning strategies. But this knowledge has fundamental limitations:
Temporal cutoff: The model only knows what existed in its training data. Events after training are invisible.
Coverage gaps: Training data, however massive, doesn’t include everything. Proprietary documents, recent publications, niche domains—all are absent.
Parametric compression: Knowledge stored in parameters is lossy. The model learns statistical patterns, not a perfect database. Details blur, facts merge, and the model may “remember” something that’s a plausible interpolation rather than a precise fact.
Confidence without calibration: Models generate confident-sounding text regardless of whether they actually know the answer. They’re trained to produce fluent completions, not to say “I don’t know.”
RAG addresses these limitations by shifting from parametric knowledge (stored in model weights) to non-parametric knowledge (retrieved from external sources). When a user asks a question, the system:
- Retrieves relevant documents from a knowledge base
- Augments the prompt with these documents as context
- Generates a response grounded in the retrieved information
This is powerful because retrieval is exact: we can guarantee the model sees the current version of a document, include proprietary information without retraining, and update knowledge by simply updating the document store. The model becomes a sophisticated reader and synthesizer rather than a unreliable memory.
A Mental Model for RAG
Think of RAG as giving a language model an open-book exam. Without RAG, the model must answer from memory—and like a student who crammed but didn’t deeply learn the material, it may confuse details, conflate topics, or confidently assert things that aren’t quite right. With RAG, the model can flip through a textbook, find relevant passages, and compose an answer that synthesizes what it reads.
The key challenges then become:
- Finding the right pages: Given a question, which documents are relevant?
- Reading efficiently: With limited space in the prompt, what to include?
- Synthesizing accurately: How to ensure the model uses the documents faithfully?
These map to the core RAG components: retrieval, context assembly, and generation with grounding.
When to Use RAG vs Fine-tuning
Before diving into RAG implementation, understand when RAG is the right choice:
Choose RAG when: Knowledge updates frequently, citations needed, no training budget, quick deployment required.
Choose Fine-tuning when: Static knowledge, need specific style/behavior, latency critical, knowledge fits in training data.
Use both when: Large knowledge base with specific format requirements.
What You’ll Learn
- How embedding models encode semantic meaning, and why this enables similarity search
- Chunking strategies and the tradeoffs that govern them
- Vector search algorithms: exact, approximate, and hybrid approaches
- Reranking to improve precision when initial retrieval is noisy
- Advanced patterns: query transformation, self-reflection, multi-hop reasoning
- Evaluation methodologies that diagnose where systems fail
Prerequisites
- Understanding of embeddings and attention mechanisms (Chapter 5)
- Familiarity with prompt engineering principles (Chapter 6)
- Basic database concepts (indexes, queries, consistency)
- Python programming
The RAG Architecture
The Two-Phase Design
RAG systems operate in two distinct phases with different computational constraints and optimization objectives.
Indexing (offline): Process your document corpus once, preparing it for fast retrieval. This phase can be compute-intensive because it runs ahead of user requests.
Documents → Parsing → Chunking → Embedding → Vector Index
↓ ↓ ↓ ↓
Clean text Segments Dense vectors Searchable store
Query (online): For each user request, find relevant context and generate a response. This phase must be fast—users expect responses in seconds.
Query → Embedding → Vector Search → Reranking → Context Assembly → Generation
↓ ↓ ↓ ↓ ↓ ↓
User Dense Candidate Precise Prompt Response
question vector documents top-k construction with citations
Let’s trace through a concrete example to build intuition.
Scenario: A legal firm has 50,000 contract documents. A lawyer asks: “What are the standard termination clauses for vendor agreements?”
Indexing (done once): 1. Parse all 50,000 contracts, extracting text from PDFs 2. Split each contract into chunks (perhaps by clause or paragraph) 3. Generate an embedding vector for each chunk using a model like E5 or BGE 4. Store vectors in a searchable index (e.g., Qdrant, Pinecone) 5. Store chunk text and metadata (contract name, date, type) alongside vectors
Query (every request): 1. Embed the query “What are the standard termination clauses for vendor agreements?” using the same model 2. Search the vector index for the 50 most similar chunk vectors 3. Rerank these 50 using a more powerful cross-encoder model, keeping top 5 4. Assemble the top 5 chunks into a prompt 5. Send to LLM: “Based on these contract excerpts, explain standard termination clauses for vendor agreements” 6. Return the response with citations to specific contracts
The magic happens in step 2: because the query and documents are embedded by the same model into the same vector space, we can find semantically similar content even when the words don’t match exactly. “Termination clauses” finds documents mentioning “early exit provisions” or “contract cancellation terms.”
Why Vectors Enable Semantic Search
Traditional keyword search (like Elasticsearch’s BM25) matches documents containing query terms. It works well when users know the exact vocabulary in the documents. But it fails on semantic mismatches:
- Query: “how to cancel my subscription” → Document uses: “membership termination procedure”
- Query: “vacation policy” → Document uses: “paid time off guidelines”
- Query: “machine learning engineer salary” → Document uses: “ML practitioner compensation”
Embedding models solve this by mapping text to points in high-dimensional space where semantic similarity corresponds to geometric proximity. This is the foundational insight of modern NLP.
How does this work? Embedding models (like E5, GTE, or OpenAI’s text-embedding-3) are trained on massive datasets of text pairs that are semantically related—question-answer pairs, paraphrase pairs, document-summary pairs. Through this training, the model learns to:
- Place synonyms near each other (“cancel” ≈ “terminate” ≈ “end”)
- Group related concepts (“subscription,” “membership,” “account” cluster together)
- Separate unrelated meanings (the word “bank” in financial vs. river contexts gets different vectors)
When we embed a query and documents with the same model, we can find documents whose embeddings are geometrically close to the query embedding—meaning they’re semantically related, even without word overlap.
This is why embedding model choice matters so much: the model’s training determines what “similar” means. A model trained on scientific papers will have different similarity judgments than one trained on customer support conversations.
1970s (TF-IDF): Term Frequency-Inverse Document Frequency became the foundation of search. Documents were represented as sparse vectors of word counts, weighted by rarity. Simple, interpretable, and still used today—but purely lexical with no understanding of meaning.
1990s (LSA/LSI): Latent Semantic Analysis applied SVD to term-document matrices, discovering latent “topics.” First attempt at semantic similarity, but computationally expensive and limited by linear algebra assumptions.
2013 (Word2Vec): Mikolov’s word embeddings revolutionized NLP by learning dense vectors where similar words clustered together. “King - Man + Woman ≈ Queen” captured something like meaning. But averaging word vectors for documents lost crucial information.
2017-2018 (Transformer Encoders): BERT and its variants produced contextual embeddings—the same word got different vectors depending on context. Sentence-BERT (2019) adapted this for efficient sentence embedding, enabling practical semantic search.
2020-2022 (Contrastive Learning): Models like E5, GTE, and BGE used contrastive training on massive text pairs to create embedding models optimized specifically for retrieval. These dominate modern RAG systems.
2023-2024 (Hybrid & Learned): Hybrid search combining BM25 with dense retrieval became standard. ColBERT’s late-interaction paradigm and learned sparse encoders (SPLADE) blur the line between lexical and semantic approaches.
The pattern: Retrieval evolved from exact word matching to statistical co-occurrence to learned semantic similarity. Each generation traded interpretability for capability, but the core goal remained: finding relevant information efficiently.
Document Processing and Chunking
“When I review RAG implementations, I first check chunk size and overlap. 90% of the retrieval problems I see come from chunks that are too small or don’t overlap. Everyone reads the tutorials saying ‘256 tokens is optimal’ and blindly copies it. Then they wonder why their legal document Q&A system returns sentence fragments that make no sense.”
— Principal Engineer at legal tech startup
The Chunking Problem
Here’s a fundamental tension in RAG: embedding models and LLM context windows have limited capacity, but documents can be arbitrarily long. A 100-page contract cannot be embedded as a single unit—the embedding model would truncate or struggle, and even if it worked, a single vector for 100 pages would lose detail. We must split documents into smaller chunks.
But chunking introduces its own problems:
- Semantic fragmentation: Splitting at arbitrary boundaries can divide a coherent thought across chunks, losing meaning
- Context loss: A chunk may reference “the policy described above”—but “above” is in a different chunk
- Size variability: Very small chunks may lack context; very large chunks may dilute the signal
Good chunking strategies navigate these tradeoffs for your specific use case.
Fixed-Size Chunking: Simple But Fragile
The simplest approach splits text at regular token intervals:
def fixed_size_chunk(text: str, chunk_size: int = 512, overlap: int = 50) -> list[str]:
"""Split text into fixed-size chunks with overlap."""
tokens = tokenizer.encode(text)
chunks = []
for start in range(0, len(tokens), chunk_size - overlap):
end = start + chunk_size
chunks.append(tokenizer.decode(tokens[start:end]))
return chunksWhy overlap? Without overlap, a relevant sentence split across chunk boundaries might not match any single chunk well enough to be retrieved. Overlap ensures boundary content appears in at least one chunk.
When fixed-size works: Homogeneous documents where any segment is roughly self-contained—news articles, blog posts, social media. The simplicity is valuable: predictable memory usage, easy to reason about.
When it fails: Structured documents where meaning depends on headers, sections, or cross-references. Splitting a legal contract mid-clause produces chunks that are difficult to interpret without context.
Semantic Chunking: Following Natural Boundaries
Semantic chunking respects document structure, splitting at natural boundaries like paragraphs, sentences, or sections:
def semantic_chunk(text: str, max_tokens: int = 512) -> list[str]:
"""Split at paragraph boundaries, respecting max size."""
paragraphs = text.split('\n\n')
chunks, current, current_size = [], [], 0
for para in paragraphs:
para_tokens = len(tokenizer.encode(para))
if current_size + para_tokens > max_tokens and current:
chunks.append('\n\n'.join(current))
current, current_size = [], 0
current.append(para)
current_size += para_tokens
if current:
chunks.append('\n\n'.join(current))
return chunksThis preserves paragraph coherence. But what if a single paragraph exceeds the limit? We need fallback logic—either split it further (losing coherence) or truncate (losing content).
Recursive Chunking: Best of Both Worlds
Recursive chunking tries progressively finer splits until content fits:
def recursive_chunk(text: str, max_tokens: int = 512,
separators: list[str] = ['\n\n', '\n', '. ', ' ']) -> list[str]:
"""Try coarse splits first, fall back to finer splits."""
if len(tokenizer.encode(text)) <= max_tokens:
return [text]
for sep in separators:
parts = text.split(sep)
if len(parts) > 1:
chunks = []
for part in parts:
chunks.extend(recursive_chunk(part, max_tokens, separators))
return chunks
# Last resort: hard split
return fixed_size_chunk(text, max_tokens)This tries to split at double-newlines (paragraphs) first. If any segment is still too large, try single newlines, then sentences, then words. It maximizes semantic coherence while guaranteeing size limits.
Full implementation: See reference/07b_rag_systems_code.md for complete implementations including structure-aware chunking with header tracking.
How Chunk Size Affects Retrieval
Chunk size is the most important parameter in RAG. It creates a fundamental tradeoff:
Smaller chunks (128-256 tokens):
- ✓ More precise retrieval—each chunk is about one idea
- ✓ Higher recall—easier to match specific facts
- ✗ Loss of context—chunks may be hard to interpret alone
- ✗ More chunks to store and search
Larger chunks (512-1024 tokens):
- ✓ Self-contained—each chunk has enough context
- ✓ Fewer chunks, more efficient storage and search
- ✗ Lower precision—relevant sentence buried in irrelevant paragraph
- ✗ May exceed embedding model’s effective window
A reasonable default is to start at 512 tokens with overlap, then tune: go smaller (down to ~256) only when precision demands it, and larger (512–1024) when meaning depends on surrounding context. The instinct to shrink chunks usually backfires—too-small chunks return fragments that lose the context the model needs (this is the same default the “Chunks Too Small” mistake below recommends). The right size depends heavily on your documents:
| Document Type | Recommended Size | Rationale |
|---|---|---|
| FAQs, knowledge bases | 128-256 tokens | One question-answer pair per chunk |
| Technical documentation | 256-512 tokens | Section-level coherence |
| Legal contracts | Full clauses | Clause is atomic semantic unit |
| Research papers | 512-1024 tokens | Need surrounding context for meaning |
| Code files | Full functions | Function is natural semantic unit |
Practical advice: Start with 512 tokens and semantic boundaries. Measure retrieval quality on real queries. If retrieval finds irrelevant results, try smaller chunks. If retrieved chunks lack context, try larger or add context metadata.
The Lost Context Problem
When you chunk a document, each chunk loses its place in the whole. Consider this chunk:
“As discussed in the previous section, these conditions apply. The party responsible must notify the other party within 30 days.”
In isolation, this is nearly useless—what conditions? Which parties? The references point outside the chunk.
Solution 1: Context headers. Prepend hierarchical context:
def add_context_headers(chunk: str, headers: list[str]) -> str:
"""Prepend document structure context to chunk."""
context = " > ".join(headers) # "Contract > Section 3 > Termination"
return f"[{context}]\n{chunk}"Solution 2: Sentence-window retrieval. Embed individual sentences but retrieve with surrounding window:
def sentence_window_chunk(text: str, window: int = 3) -> list[dict]:
"""Embed sentences, but return with surrounding context."""
sentences = sent_tokenize(text)
for i, sent in enumerate(sentences):
start = max(0, i - window)
end = min(len(sentences), i + window + 1)
yield {
'embed_text': sent, # This gets embedded
'context': ' '.join(sentences[start:end]) # This gets returned
}This gives you sentence-level retrieval precision with paragraph-level context in the final prompt.
Solution 3: Parent-child indexing. Index both small chunks (for retrieval precision) and large chunks (for context). When a small chunk matches, return its parent.
Document Extraction: Handling Real-World Formats
Real documents come as PDFs, Word files, HTML pages, and scanned images. Extraction quality directly impacts RAG quality—garbage in, garbage out.
def extract_document(path: str) -> tuple[str, dict]:
"""Extract text and metadata from various formats."""
ext = path.rsplit('.', 1)[-1].lower()
extractors = {
'pdf': extract_pdf,
'docx': extract_docx,
'html': extract_html,
'md': extract_markdown,
}
if ext not in extractors:
raise ValueError(f"Unsupported format: {ext}")
return extractors[ext](path)PDF challenges are underestimated:
- Multi-column layouts: Reading order ambiguity
- Tables: Structure lost when flattened
- Scanned documents: Require OCR
- Headers/footers: Repeated noise
- Embedded images with text: Invisible without OCR
For production systems, consider specialized services (Unstructured.io, LlamaParse, AWS Textract) that handle these edge cases. The investment pays off in retrieval quality.
Full implementation: See reference/07b_rag_systems_code.md for complete extraction code.
What people do: Set chunk size to 128-256 tokens “to be safe” and ensure precise retrieval.
Why it fails: Small chunks lose context. A sentence like “This applies to returns within 30 days” is meaningless without knowing what “this” refers to. Retrieval metrics may look good (you found a relevant sentence), but generation quality suffers.
Fix: Start with 512-1024 tokens and tune based on retrieval + generation quality together. Include chunk overlap (10-20%). Consider parent-child chunking: retrieve small chunks for precision, include parent chunk in context for completeness.
Embeddings and Vector Search
How Embedding Models Work
Embedding models transform text into dense vectors (typically 384-4096 dimensions) where geometric relationships encode semantic relationships. But how?
Modern embedding models are neural networks—typically transformers—trained on massive datasets of related text pairs. The training objective shapes what “similar” means:
Contrastive learning: Given a query q, a positive document d+ (relevant), and negative documents d- (irrelevant), train the model so that:
- similarity(embed(q), embed(d+)) is high
- similarity(embed(q), embed(d-)) is low
Through millions of such examples, the model learns to map semantically related texts to nearby points in vector space.
Why this works: The model learns features—syntactic patterns, entity types, topic signals—that distinguish relevant from irrelevant. These features become the dimensions of the embedding space. When a query and document share features (i.e., are about the same topic, contain related concepts), their vectors point in similar directions.
What the dimensions represent: Unlike hand-crafted features, neural network dimensions don’t have interpretable meanings. Dimension 47 isn’t “about finance.” Instead, each dimension captures a learned combination of features. The entire vector, taken together, represents the text’s position in semantic space.
Embedding Models for Retrieval
Not all embeddings are equal for retrieval. General-purpose embeddings from language models capture meaning but aren’t optimized for finding relevant documents.
Retrieval-specific models are trained explicitly on query-document pairs:
| Model | Dimensions | Strengths |
|---|---|---|
| E5-large-v2 | 1024 | Strong general-purpose, good multilingual |
| BGE-large-en-v1.5 | 1024 | Excellent English, instruction-following |
| GTE-large-en-v1.5 | 1024 | Strong benchmark performance |
| text-embedding-3-large | 3072 | High capacity, commercial |
| Voyage-large-2 | 1024 | Domain-specific options available |
Query vs. document prefixes: Some models (E5, BGE) use different prefixes for queries and documents:
# E5 model expects these prefixes
query_embedding = model.encode("query: What is the refund policy?")
doc_embedding = model.encode("passage: Our refund policy allows returns within 30 days...")This asymmetric encoding helps because queries and documents are linguistically different—queries are short questions, documents are declarative paragraphs. The prefixes tell the model to produce appropriately calibrated embeddings.
Choosing an Embedding Model
Quick recommendations:
- Best quality (API): OpenAI text-embedding-3-large (3072 dims)
- Best quality (self-hosted): BGE-large-en-v1.5 or GTE-large
- Best efficiency: all-MiniLM-L6-v2 (384 dims, very fast)
- Multilingual: E5-multilingual-large
Similarity Metrics: The Math of “Closeness”
Given two vectors, how do we measure similarity?
Cosine similarity measures the angle between vectors, ignoring magnitude:
\[\text{cosine}(a, b) = \frac{a \cdot b}{\|a\| \|b\|}\]
Range: [-1, 1] where 1 = identical direction, 0 = orthogonal, -1 = opposite.
Dot product is simply \(a \cdot b\). For unit vectors (normalized to length 1), dot product equals cosine similarity.
Euclidean distance measures straight-line distance: \(\|a - b\|_2\). Lower = more similar.
Why cosine is standard: Embedding models typically normalize outputs, making magnitude uninformative. Cosine (or equivalently, dot product on normalized vectors) focuses on direction—which captures semantic content. Two documents about “refund policies” should be similar regardless of length.
Practical note: Most systems normalize embeddings and use dot product, which is faster (no normalization at query time) and equivalent to cosine for normalized vectors.
Exact vs. Approximate Nearest Neighbor Search
With embeddings and similarity metrics, retrieval becomes a nearest neighbor search: find the k vectors most similar to the query vector.
Exact search (brute force): Compare the query to every document. Trivially correct but O(n) per query—fine for 10,000 documents, infeasible for 10 million.
Approximate nearest neighbor (ANN): Use index structures to find probably nearest neighbors without exhaustive search. Tradeoff: small accuracy loss for massive speedup.
The key ANN algorithms:
HNSW (Hierarchical Navigable Small World):
- Builds a graph where similar vectors are connected
- Query navigates the graph, following edges toward similar vectors
- Like a social network where “similar” people know each other—start anywhere, follow connections to find who you’re looking for
- Very fast queries, but high memory (stores graph structure)
IVF (Inverted File Index):
- Clusters vectors into buckets (Voronoi cells)
- Query checks only nearby buckets
- Like organizing a library by topic—search relevant sections, not the whole library
- Tunable: check more buckets for higher recall
Product Quantization (PQ):
- Compresses vectors by representing them as combinations of codebook entries
- Like describing colors as “red-ish, somewhat green, no blue” instead of exact RGB values
- 10-100x memory reduction with modest accuracy loss
- Often combined with IVF for very large scales
import faiss
# Exact search (< 10K vectors)
index = faiss.IndexFlatIP(dimension)
# HNSW (10K - 1M vectors)
index = faiss.IndexHNSWFlat(dimension, M=32)
# IVF + PQ (> 1M vectors)
quantizer = faiss.IndexFlatL2(dimension)
index = faiss.IndexIVFPQ(quantizer, dimension, nlist=1024, M=32, bits=8)
index.train(sample_vectors) # Required for IVF/PQFull implementation: See reference/07b_rag_systems_code.md for complete indexing and search code.
Vector Search Doesn’t Understand Everything
A critical insight: vector similarity captures topical relevance, not answer correctness.
When you search for “What is the return policy?”, vector search finds documents about return policies. But it doesn’t know:
- Whether the document answers the question
- Whether it’s the current policy or an outdated version
- Whether it applies to your specific situation
This is why RAG pipelines need multiple stages. Vector search finds candidates; reranking selects the best; generation synthesizes an answer; prompting ensures grounding. Weakness in any component propagates.
What people do: Store only text and embeddings, discarding document metadata (dates, authors, categories, versions).
Why it fails: Vector search alone can’t answer “What’s the current policy?” vs. “What was the policy in 2023?” Both documents embed similarly. Without metadata filtering, you return outdated information or mix answers from different contexts.
Fix: Store metadata alongside embeddings. Use metadata filters in queries (e.g., date >= '2024-01-01'). Include source attribution in generated responses so users can verify. Consider metadata-aware reranking that boosts recent documents.
Hybrid Search and Reranking
The Complementary Strengths of Vector and Keyword Search
Vector search and keyword search (BM25) fail in different ways:
Vector search failures:
- Exact identifiers: “error code XJ-4402” may not embed close to documents mentioning it
- Technical terminology: “PostgreSQL pg_dump” gets semantically diffused
- Named entities: “Dr. Sarah Chen’s 2024 paper” needs exact matching
Keyword search failures:
- Synonyms: “automobile” doesn’t match “car”
- Paraphrasing: “how to end my account” doesn’t match “subscription termination process”
- Conceptual queries: “best practices for scaling” spans many possible wordings
Hybrid search combines both, capturing documents that are either lexically or semantically similar.
BM25: The Keyword Baseline
BM25 is the standard keyword ranking algorithm, an improvement on TF-IDF that handles document length variation:
\[\text{BM25}(q, d) = \sum_{t \in q} \text{IDF}(t) \cdot \frac{f(t, d) \cdot (k_1 + 1)}{f(t, d) + k_1 \cdot (1 - b + b \cdot \frac{|d|}{\text{avgdl}})}\]
The intuition:
- IDF (Inverse Document Frequency): Rare terms matter more. “Termination” is more informative than “the.”
- TF (Term Frequency): More occurrences = stronger signal, but with diminishing returns
- Length normalization: Long documents get many terms by chance; normalize for length
BM25 requires no training—just tokenization and statistics. It’s fast, interpretable, and remains competitive with neural methods for exact-match queries.
Reciprocal Rank Fusion
Combining two ranked lists is tricky. Scores aren’t comparable (cosine similarity vs. BM25 scores live in different scales). Reciprocal Rank Fusion (RRF) sidesteps this by using ranks, not scores:
\[\text{RRF}(d) = \sum_{r \in \text{retrievers}} \frac{1}{k + \text{rank}_r(d)}\]
Where k is a constant (typically 60) that prevents very high-ranked documents from dominating.
def reciprocal_rank_fusion(rankings: list[list[str]], k: int = 60) -> list[str]:
"""Combine multiple rankings using RRF."""
scores = defaultdict(float)
for ranking in rankings:
for rank, doc_id in enumerate(ranking, 1):
scores[doc_id] += 1.0 / (k + rank)
return sorted(scores, key=scores.get, reverse=True)RRF is robust because:
- No score normalization needed
- Outliers in one retriever don’t dominate
- Documents ranked well by multiple retrievers rise to the top
Full implementation: See reference/07b_rag_systems_code.md for complete hybrid search with BM25 and vector fusion.
Reranking: Precision Over Speed
Initial retrieval (vector or hybrid) casts a wide net—retrieve 50-100 candidates quickly. Reranking uses a more expensive model to precisely score these candidates.
The insight: bi-encoders (embedding models) score documents independently—embed query, embed document, compute similarity. Cross-encoders process query and document together, enabling rich interaction between them.
from sentence_transformers import CrossEncoder
class Reranker:
def __init__(self, model: str = "BAAI/bge-reranker-large"):
self.model = CrossEncoder(model)
def rerank(self, query: str, documents: list[str], top_k: int = 5):
pairs = [[query, doc] for doc in documents]
scores = self.model.predict(pairs)
ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
return ranked[:top_k]Why cross-encoders are more accurate: They can match specific query terms to document locations, understand negation (“not eligible for” vs. “eligible for”), and reason about relevance holistically. Bi-encoders must compress all meaning into fixed vectors before comparison.
Why we don’t use cross-encoders for initial retrieval: Cross-encoders require O(n) forward passes (one per document). For millions of documents, this is impossibly slow. Bi-encoders enable precomputation—embed documents once, search repeatedly.
The two-stage pattern—fast bi-encoder retrieval, precise cross-encoder reranking—gives us the best of both worlds.
The situation: An enterprise search team spent three months building a sophisticated RAG pipeline for their internal knowledge base. Their crown jewel was a cross-encoder reranker fine-tuned on 50,000 labeled query-document pairs. Offline metrics were stellar: MRR improved from 0.65 to 0.84.
What happened: They launched with fanfare. Usage dropped 40% within a month. Users complained the system was “too slow to be useful.” Exit surveys revealed people were returning to keyword search or just asking colleagues directly.
Root cause: The reranker added 800ms of latency per query—acceptable in benchmarks, unacceptable in practice. Users expect search results in under 300ms. An 84% MRR means nothing if users don’t wait for results. The team had measured the wrong thing: they optimized for ranking quality while users optimized for time-to-answer.
How they fixed it: 1. Implemented async reranking: show initial bi-encoder results immediately, then update with reranked results if user is still on page 2. Added a lightweight reranker (100ms budget) for real-time, heavy reranker for “show more” requests 3. Built latency into their evaluation metrics: score = accuracy × (1 - latency_penalty) 4. Set up user timing dashboards alongside ranking quality dashboards
The takeaway: Users don’t experience your metrics; they experience latency. The best ranking algorithm your users never see is worthless.
What people do: Return top-k results from vector search directly to the LLM.
Why it fails: Vector similarity ≠ relevance. A document about your topic may score higher than a document that answers your question. Without reranking, you fill context with tangentially related content, forcing the LLM to sift through noise.
Fix: Always add a reranking stage, even a simple one. Cross-encoders like ms-marco-MiniLM-L-6-v2 add ~50ms latency but dramatically improve answer quality. For production, batch reranking requests and set a timeout—partial reranking beats no reranking.
Context Assembly and Generation
The Art of Prompt Construction
You’ve retrieved and reranked documents. Now you must assemble them into a prompt that elicits accurate, grounded responses.
Key principles:
Structure for clarity: Label documents clearly so the model can cite them.
RAG_PROMPT = """Answer the question based ONLY on the following documents.
If the documents don't contain the answer, say "I don't have information about that."
Cite sources using [Doc N] format.
{documents}
Question: {question}
Answer:"""
def format_documents(docs: list[str]) -> str:
return "\n\n".join(f"[Doc {i+1}]:\n{doc}" for i, doc in enumerate(docs))Explicit grounding instructions: Without clear instructions, models may blend retrieved content with parametric knowledge. “Answer based ONLY on the following documents” reduces this.
Handle insufficient context gracefully: The model should acknowledge when it can’t answer rather than hallucinate.
The Lost-in-the-Middle Problem
A surprising finding from Liu et al. (2024): language models with long contexts don’t use all the context equally. Information in the middle of long prompts is often ignored—the model attends more strongly to the beginning and end.
Implications for RAG:
- Put the most relevant documents first (or last)
- Don’t dump 50 documents into context hoping the model will find what it needs
- Quality over quantity—5 highly relevant chunks beat 20 mixed ones
Mitigation strategies:
- Aggressive reranking to ensure top documents are truly best
- Context compression (summarize less-relevant documents)
- Multi-pass generation (retrieve more, generate summaries, retrieve again)
Citation and Attribution
Users need to verify AI responses. Proper citations enable this:
def extract_citations(response: str, num_docs: int) -> list[int]:
"""Extract document citations from response."""
import re
citations = re.findall(r'\[Doc (\d+)\]', response)
valid = [int(c) for c in citations if 1 <= int(c) <= num_docs]
return list(set(valid))Trust but verify: Models sometimes hallucinate citations—referencing [Doc 5] when only 3 documents were provided, or attributing statements to the wrong source. Consider:
- Validating citation numbers are in range
- Checking that cited content actually supports the claim (expensive but possible with another LLM call)
- Surfacing source documents to users for verification
Advanced RAG Patterns
Basic RAG—retrieve, then generate—handles simple factual queries. Complex scenarios require more sophisticated orchestration.
Query Transformation
User queries are often poorly suited for retrieval:
- Too short: “refund?” lacks context
- Wrong vocabulary: user terms ≠ document terms
- Complex: multiple sub-questions combined
Query expansion adds related terms:
def expand_query(query: str, llm) -> str:
"""Generate query expansions for better retrieval."""
expansion = llm.generate(f"""Generate search terms related to: {query}
Related terms (comma-separated):""")
return f"{query} {expansion}"HyDE (Hypothetical Document Embeddings) generates a hypothetical answer, then retrieves documents similar to it:
def hyde_retrieve(query: str, retriever, llm) -> list[dict]:
"""Generate hypothetical answer, use it for retrieval."""
hypothetical = llm.generate(
f"Write a detailed paragraph answering: {query}"
)
return retriever.search(hypothetical) # Search using generated docWhy HyDE works: The hypothetical answer uses vocabulary likely to appear in actual documents—bridging the query-document lexical gap. It’s particularly effective when users ask questions in conversational language while documents are written formally.
Query decomposition breaks complex questions into simpler sub-queries:
def decompose_and_retrieve(query: str, retriever, llm) -> list[dict]:
"""Break complex query into sub-queries, retrieve for each."""
sub_queries = llm.generate(
f"Break this into simpler questions:\n{query}"
).split('\n')
all_docs = []
for sub_q in sub_queries:
all_docs.extend(retriever.search(sub_q))
return deduplicate(all_docs)Self-RAG and Corrective RAG
Advanced patterns add reflection and self-correction.
Self-RAG (Asai et al., 2023) decides whether to retrieve (some questions don’t need external knowledge) and evaluates whether the response is supported by retrieved documents:
class SelfRAG:
def query(self, question: str) -> str:
# Decide if retrieval is needed
if self.needs_retrieval(question):
docs = self.retriever.search(question)
response = self.generate_with_context(question, docs)
# Verify response is supported by documents
if not self.is_supported(response, docs):
response = self.regenerate(question, docs)
else:
response = self.generate_direct(question)
return responseCorrective RAG (CRAG) evaluates retrieval quality and falls back to alternative sources (like web search) when the knowledge base doesn’t have good answers:
class CorrectiveRAG:
def query(self, question: str) -> str:
docs = self.retriever.search(question)
quality = self.evaluate_retrieval(question, docs)
if quality == "CORRECT":
context = docs
elif quality == "INCORRECT":
context = self.web_search(question) # Fallback
else: # AMBIGUOUS
context = docs + self.web_search(question) # Combine
return self.generate(question, context)These patterns add LLM calls (increasing latency and cost) but significantly improve reliability on hard queries.
Full implementation: See reference/07b_rag_systems_code.md for complete Self-RAG, CRAG, and multi-hop implementations.
Multi-Hop RAG
Some questions require synthesizing information across multiple documents that aren’t directly connected:
“Which of our company’s products were affected by the EU AI Act regulations discussed in last month’s legal memo?”
This needs: (1) the legal memo about EU AI Act, (2) product specifications, (3) reasoning about which products fall under which regulations.
Multi-hop RAG iteratively retrieves and reasons:
class MultiHopRAG:
def query(self, question: str, max_hops: int = 3) -> str:
context = []
current_query = question
for _ in range(max_hops):
docs = self.retriever.search(current_query)
context.extend(docs)
can_answer, missing = self.check_answerability(question, context)
if can_answer:
break
# Generate follow-up query to find missing information
current_query = self.generate_followup(question, context, missing)
return self.generate_answer(question, context)GraphRAG: Knowledge Graphs Meet Retrieval
Vector search finds topically similar documents but doesn’t understand relationships between entities. GraphRAG combines vector retrieval with knowledge graph traversal:
- Extract entities and relationships from documents (using an LLM or NER models)
- Build a knowledge graph connecting entities
- At query time, find relevant entities and traverse the graph to find related information
This excels for queries like:
- “How is Project Alpha related to the Johnson contract?”
- “What other products use the same supplier as Widget X?”
- “Who worked on both the 2023 audit and the compliance review?”
Full implementation: See reference/07b_rag_systems_code.md for complete GraphRAG implementation with entity extraction and graph traversal.
Evaluation
Why RAG Evaluation is Hard
RAG systems have multiple components that can fail independently:
- Retrieval might miss relevant documents
- Reranking might demote the best documents
- Context assembly might truncate critical information
- Generation might hallucinate despite good context
A single end-to-end metric hides which component is failing. Comprehensive evaluation requires component-level and end-to-end metrics.
Retrieval Metrics
Classic information retrieval metrics:
Recall@k: Of all relevant documents, what fraction were retrieved in top k? \[\text{Recall@k} = \frac{|\text{Relevant} \cap \text{Retrieved@k}|}{|\text{Relevant}|}\]
Precision@k: Of retrieved documents, what fraction were relevant? \[\text{Precision@k} = \frac{|\text{Relevant} \cap \text{Retrieved@k}|}{k}\]
MRR (Mean Reciprocal Rank): Average of 1/rank of first relevant document.
def evaluate_retrieval(retriever, queries, relevant_docs) -> dict:
"""Evaluate retrieval quality."""
metrics = {'recall@5': [], 'recall@10': [], 'mrr': []}
for query, relevant in zip(queries, relevant_docs):
results = retriever.search(query, k=10)
retrieved_ids = [r['id'] for r in results]
metrics['recall@5'].append(
len(set(retrieved_ids[:5]) & relevant) / len(relevant)
)
metrics['recall@10'].append(
len(set(retrieved_ids[:10]) & relevant) / len(relevant)
)
for rank, doc_id in enumerate(retrieved_ids, 1):
if doc_id in relevant:
metrics['mrr'].append(1 / rank)
break
else:
metrics['mrr'].append(0)
return {k: sum(v)/len(v) for k, v in metrics.items()}End-to-End Evaluation
For the full pipeline, evaluate answer quality:
Faithfulness: Does the response only contain information from retrieved documents? (Checks for hallucination)
Relevance: Does the response answer the question?
Completeness: Does the response cover all aspects of the question?
LLM-as-judge works well for these:
def evaluate_faithfulness(response: str, context: str, evaluator_llm) -> float:
"""Check if response is faithful to context."""
prompt = f"""Rate whether this response is faithful to the provided context.
A faithful response only contains information present in the context.
Context: {context}
Response: {response}
Rate 1-5 (5 = completely faithful):"""
return float(evaluator_llm.generate(prompt))Full implementation: See reference/07b_rag_systems_code.md for complete evaluation harness with component-level and end-to-end metrics.
Debugging Failures
When RAG fails, diagnose systematically:
- Was the relevant document retrieved? → Retrieval problem (embedding model, chunking, index)
- Was it ranked high enough? → Reranking problem
- Was it included in context? → Context assembly problem
- Did generation use it correctly? → Prompt engineering problem
class RAGDebugger:
def diagnose(self, query: str, expected_answer: str) -> dict:
"""Diagnose RAG failure point."""
retrieved = self.retriever.search(query, k=50)
# Check if answer content is in any retrieved doc
relevant = [d for d in retrieved if self.contains_answer(d, expected_answer)]
if not relevant:
return {'failure': 'retrieval', 'issue': 'Relevant content not retrieved'}
reranked = self.reranker.rerank(query, retrieved[:20])
relevant_ranks = [i for i, d in enumerate(reranked) if d in relevant]
if min(relevant_ranks) > 5:
return {'failure': 'reranking', 'issue': f'Relevant doc ranked {min(relevant_ranks)}'}
response = self.generate(query, reranked[:5])
if not self.answer_correct(response, expected_answer):
return {'failure': 'generation', 'issue': 'Good context but wrong answer'}
return {'failure': None, 'status': 'success'}Production Considerations
Vector Database Selection
For production, you need more than FAISS: persistence, updates, filtering, scaling.
Decision framework:
| Scale | Complexity | Recommendation |
|---|---|---|
| < 100K vectors | Low | PostgreSQL + pgvector |
| 100K - 10M | Medium | Qdrant or Weaviate |
| > 10M | High | Milvus or Pinecone |
Key capabilities to evaluate:
- Hybrid search support (vector + keyword)
- Metadata filtering efficiency
- Update performance (real-time vs. batch)
- Horizontal scaling options
- Backup and disaster recovery
Monitoring and Observability
Production RAG needs continuous monitoring:
@dataclass
class RAGMetrics:
retrieval_latency_ms: float
rerank_latency_ms: float
generation_latency_ms: float
num_docs_retrieved: int
top_retrieval_score: float
user_feedback: Optional[str]
def log_rag_request(query: str, metrics: RAGMetrics):
"""Log for monitoring and debugging."""
logger.info({
'query': query,
**asdict(metrics),
'timestamp': datetime.utcnow().isoformat()
})Alert on:
- Latency percentile spikes
- Low retrieval scores (indicates out-of-domain queries)
- User feedback trends
- Document freshness (stale indexes)
Common Production Pitfalls
Index-embedding mismatch: Using different embedding models for indexing vs. querying produces nonsense results. Version your embeddings with your index.
Stale documents: When source documents update, indexes become inconsistent. Implement incremental re-indexing.
Query injection: User queries become part of prompts. Sanitize inputs and use delimiters to separate user content from instructions.
Cost spiral: LLM calls per query add up. Monitor and set budgets.
def answer_question(query: str, documents: list[str]) -> str:
# Embed and search
query_vec = embed(query)
results = vector_db.search(query_vec, k=10)
# Stuff everything into context
context = "\n\n".join([doc.text for doc in results])
response = llm.generate(
f"Context: {context}\n\nQuestion: {query}\n\nAnswer:"
)
return responseasync def answer_question(
query: str,
user_id: str,
timeout: float = 10.0
) -> RAGResponse:
"""Production RAG with reranking, citation, and observability."""
start = time.monotonic()
# Stage 1: Hybrid retrieval (vector + BM25)
async with asyncio.timeout(timeout * 0.3):
vector_results = await vector_db.search(embed(query), k=50)
keyword_results = await bm25_index.search(query, k=50)
candidates = reciprocal_rank_fusion(vector_results, keyword_results)
# Stage 2: Cross-encoder reranking
async with asyncio.timeout(timeout * 0.4):
reranked = await reranker.rerank(query, candidates[:20])
top_docs = [d for d in reranked if d.score > 0.5][:5]
if not top_docs:
return RAGResponse(answer=None, status="no_relevant_docs")
# Stage 3: Generate with citation requirements
context = format_docs_with_ids(top_docs) # [1], [2], etc.
response = await llm.generate(
messages=[{
"role": "system",
"content": "Answer based ONLY on the provided context. "
"Cite sources using [1], [2], etc. "
"If unsure, say 'I don't have enough information.'"
}, {
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {query}"
}],
max_tokens=500
)
# Log for monitoring
log_rag_metrics(
query=query, user_id=user_id,
retrieval_latency=(time.monotonic() - start),
docs_used=len(top_docs),
top_score=top_docs[0].score if top_docs else 0
)
return RAGResponse(
answer=response.content,
sources=[d.metadata for d in top_docs],
latency_ms=(time.monotonic() - start) * 1000
)Key differences: Hybrid search, reranking stage, relevance threshold, citation enforcement, timeout management, observability, and structured response with sources.
Key Takeaways
Chunking strategy determines retrieval quality - Preserve semantic coherence while enabling precise retrieval. Use document-aware chunking that respects natural boundaries like paragraphs and sections.
Hybrid search outperforms single approaches - Combine vector search (semantic similarity) with BM25 (keyword matching). Vector search handles paraphrases; BM25 handles exact terms and identifiers.
Two-stage retrieval balances speed and accuracy - Use fast bi-encoders to find candidates, then expensive cross-encoders to precisely rerank. This achieves both scale and quality.
The “lost in the middle” problem affects generation - LLMs attend more strongly to content at the beginning and end of context. Put the most relevant documents first and prioritize quality over quantity.
Evaluate retrieval and generation separately - Component-level evaluation enables precise diagnosis. Measure retrieval recall@k, generation faithfulness, and end-to-end answer correctness independently.
Summary
RAG transforms language models from unreliable oracles into grounded research assistants. The pattern is simple—retrieve then generate—but production quality requires:
- Thoughtful chunking that preserves semantic coherence while enabling precise retrieval
- Embedding models trained for retrieval, not just similarity
- Hybrid search combining vector and keyword approaches
- Reranking to precision-sort noisy retrieval results
- Careful prompt engineering to ensure grounding and citation
- Component-level evaluation to diagnose failures
- Production infrastructure for monitoring, updates, and scale
The field continues to evolve rapidly. GraphRAG, adaptive retrieval, and learned sparse representations are pushing boundaries. But the fundamentals—understanding what makes retrieval work, why generation can fail, how to evaluate systematically—remain essential regardless of which specific techniques you adopt.
RAG Architecture Selection Flowchart
Practical Exercises
Chunking experiment: Take a 10-page document. Implement three chunking strategies (fixed-size, semantic, recursive). For 20 test queries, measure retrieval recall. Which strategy wins? Why?
Embedding comparison: Compare E5-large, BGE-large, and OpenAI embeddings on your documents. Measure recall@10 and latency. Is the most expensive model worth it?
Hybrid search tuning: Implement hybrid search with configurable vector/BM25 weights. Use grid search to find optimal weights for your evaluation set.
Failure analysis: Collect 20 queries where your RAG system gives wrong answers. For each, diagnose the failure point (retrieval, reranking, generation). What patterns emerge?
GraphRAG prototype: For a small document set, extract entities and relationships. Implement graph traversal. Find queries where GraphRAG outperforms standard RAG.
Self-Assessment Checkpoint
Conceptual Questions
Q1. [IC2] A RAG system uses 512-token chunks. For a legal contract where clauses reference each other heavily, what problems might this cause and how would you address them?
Answer
Problems: (1) Clauses may be split mid-sentence, losing meaning. (2) References like “as described in Section 3.2 above” become meaningless when Section 3.2 is in a different chunk. Solutions: (1) Use document-aware chunking that respects clause boundaries. (2) Add context headers showing the document structure (e.g., “[Contract > Section 3 > Termination Clause]”). (3) Use parent-child indexing: index small chunks for precision but return parent sections for context. (4) Consider sentence-window retrieval: embed sentences but return with surrounding context.Q2. [IC2] Why do we use two-stage retrieval (bi-encoder then cross-encoder reranking) instead of just using cross-encoders for everything?
Answer
Cross-encoders are more accurate because they process query and document together, enabling rich interaction. But they require O(n) forward passes—one per document. For millions of documents, this is impossibly slow. Bi-encoders precompute document embeddings, enabling sublinear search via ANN indexes. The two-stage pattern uses fast bi-encoders to find ~50-100 candidates, then expensive cross-encoders to precisely rank only these candidates. This achieves both speed and accuracy.Q3. [Senior] Explain the “lost in the middle” problem and its implications for RAG prompt design.
Answer
Liu et al. (2024) found that LLMs with long contexts don’t use context uniformly—they attend more strongly to content at the beginning and end, often ignoring the middle. For RAG: (1) Don’t dump many documents hoping the model finds what it needs. (2) Put the most relevant documents first (or last). (3) Aggressive reranking matters—ensure top documents are truly best. (4) Consider quality over quantity—5 highly relevant chunks beat 20 mixed ones. (5) For long contexts, consider summarizing less-relevant documents.Q4. [Senior] A hybrid search system combines BM25 and vector search using RRF. What query types would favor each component, and how would you tune the balance?
Answer
BM25 favors: exact terms, identifiers, technical terminology, named entities (“error XJ-4402”, “PostgreSQL pg_dump”). Vector search favors: conceptual queries, paraphrases, synonyms (“how to cancel account” vs. “subscription termination”). Tuning: Create an evaluation set with both query types. Use grid search over RRF weights or fusion parameters. Common finding: 50/50 works well as default, but domain-specific tuning (e.g., 70/30 for technical docs with many identifiers) can improve significantly.Q5. [Staff] Your RAG system handles internal company documents. The legal team says some documents are confidential and should only be retrieved for users with proper authorization. How do you architect this?
Answer
Metadata filtering approach: (1) Store access control metadata (department, clearance level, document classification) with each chunk. (2) At query time, filter retrieval to documents the user can access. Most vector DBs support metadata filtering. Considerations: (1) Filter BEFORE retrieval, not after—otherwise you leak information through similarity scores. (2) Test that filtered results still have sufficient coverage. (3) Consider separate indexes for different access levels for very sensitive data. (4) Audit logging for compliance. (5) Don’t put access control logic in the LLM prompt—it’s not reliable.Spot the Problem
Problem 1. [IC2] A developer’s chunking function:
def chunk_document(text: str, size: int = 500) -> list[str]:
words = text.split()
return [' '.join(words[i:i+size]) for i in range(0, len(words), size)]What’s wrong?
Answer
Issues: (1) Chunking by words, not tokens—500 words could be 600-800 tokens, exceeding embedding model limits. (2) No overlap—sentences at boundaries may not match queries well. (3) Splits arbitrarily, even mid-sentence. (4) Ignores document structure (paragraphs, sections). Better: Use a tokenizer to count tokens, add overlap (e.g., 10-15%), and respect semantic boundaries (at least sentence boundaries, ideally paragraphs).Problem 2. [Senior] A RAG system has 95% retrieval recall@10 on the eval set but users complain answers are often wrong. What’s happening?
# Their evaluation
retrieved = retriever.search(query, k=10)
if any(doc.id in ground_truth for doc in retrieved):
recall_hits += 1Answer
Several possible issues: (1) Recall@10 means the right document is somewhere in top 10, but generation may only use top 3-5. If the right document is at position 8, it might get cut. (2) They’re checking document ID, not whether the document actually answers the question—document might be related but not contain the answer. (3) Generation faithfulness—even with perfect retrieval, the model might hallucinate or misuse context. (4) Eval set might not represent real user queries. Need to evaluate the full pipeline: retrieval precision@k for actual k used, faithfulness of generated answers, and end-to-end answer correctness.Problem 3. [Staff] A production RAG system worked well for months, then quality suddenly dropped. No code changes were deployed. What should you investigate?
Answer
Likely causes: (1) Source document changes—documents were updated but the index is stale. Check document freshness and re-index. (2) Embedding model API changed—if using a hosted embedding service, the model may have been updated, changing the embedding space. Vectors indexed with old model won’t match queries with new model. (3) Data drift—user query patterns changed and the system wasn’t designed for them. (4) Upstream API changes—LLM provider may have updated the model. (5) Volume spike causing degraded performance (timeouts, truncation). Investigate: Check document timestamps vs index timestamps, compare embedding similarity distributions over time, review query logs for pattern changes, check latency metrics.Design Exercises
Exercise 1. [Senior] Design a RAG system for a customer support chatbot. Requirements: 50,000 support articles, need sub-2-second response time, must cite sources, handles 1,000 queries/hour peak. Outline your architecture including chunking strategy, retrieval pipeline, and infrastructure choices.
Guidance
Consider: (1) Chunking: Support articles are structured—chunk by FAQ pairs or section headers. 256-512 tokens per chunk. (2) Retrieval: Hybrid search (BM25 + vectors) with reranking. Retrieve 20, rerank to 5. (3) Infrastructure: 50K documents = ~200K chunks = fit easily in managed vector DB (Pinecone, Qdrant Cloud). (4) Latency budget: Embedding (~100ms) + Vector search (~50ms) + Rerank (~200ms) + LLM (~1000ms) = ~1.3s, leaving margin. (5) Citations: Structure prompt to output source IDs, validate citations exist. (6) Caching: Consider caching frequent queries—support has many repeat questions.Exercise 2. [Staff] You’re building a RAG system for a research organization with 2 million academic papers. Users ask complex questions requiring synthesis across multiple papers. Design the system considering: scale, multi-hop reasoning needs, and the fact that papers heavily reference each other.
Guidance
This is a GraphRAG candidate. Consider: (1) Chunking: Papers have structure (abstract, intro, methods, results)—chunk by section. Also extract figure captions separately. (2) Entity extraction: Extract paper metadata, citations, key terms, author names. Build a citation graph. (3) Multi-hop: Implement iterative retrieval—find initial papers, check if answer is complete, follow citations if not. (4) Scale: 2M papers × ~10 chunks = 20M vectors. Need serious infrastructure—Milvus or Pinecone with good filtering. (5) Consider hierarchical indexing: topic-level index first, then paper-level. (6) Latency: Accept higher latency (5-10s) for complex queries given the research use case.Connections to Other Chapters
RAG systems sit at the intersection of retrieval, language understanding, and production engineering. Here’s how this chapter connects to others:
Chapter 5 (LLM/NLP Foundations): Provides the theory behind embeddings—how text becomes vectors, why similar meanings cluster together, and how to choose embedding models. Essential background for understanding retrieval.
Chapter 6 (Prompt Engineering): Techniques for integrating retrieved context into prompts effectively. Covers context window management, few-shot examples, and structured outputs for RAG responses.
Chapter 8 (Agentic Systems): Agents often use RAG as a tool for knowledge retrieval. Understanding RAG fundamentals helps you build more capable retrieval-augmented agents.
Chapter 15 (MLOps & Evaluation): Extends evaluation concepts to RAG-specific challenges—retrieval metrics, answer faithfulness, and end-to-end quality assessment.
Chapter 30 (Data Architecture for AI): Data pipeline design and vector database infrastructure at scale. Covers feature stores, data versioning, and the data engineering required for production RAG.
Further Reading
Essential
- Lewis et al. (2020), “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” - The paper that introduced RAG. Start here.
- Karpukhin et al. (2020), “Dense Passage Retrieval for Open-Domain QA” - Dense retrieval foundations that underpin modern RAG systems.
- Liu et al. (2024), “Lost in the Middle” - How models use long contexts. Critical for understanding retrieval ordering.
Deep Dives
- Asai et al. (2023), “Self-RAG: Learning to Retrieve, Generate, and Critique” - Self-reflective retrieval for higher quality.
- Malkov & Yashunin (2018), “HNSW: Hierarchical Navigable Small World Graphs” - The algorithm behind most vector database indices.
- Es et al. (2023), “RAGAS: Automated Evaluation of RAG” - Systematic framework for measuring RAG performance.