Chapter 7: RAG Systems

Keywords

retrieval, embeddings, vector search, chunking, reranking, hybrid search, semantic search, HNSW, ColBERT, knowledge grounding

Introduction

In the summer of 2022, a team at a major technology company faced a problem. They had deployed a large language model to answer questions about their internal documentation—thousands of pages covering engineering practices, HR policies, and product specifications. The model was impressive: articulate, confident, and completely unreliable. It would cite policies that didn’t exist, invent employee benefits, and present plausible-sounding procedures that contradicted official guidelines. The fundamental issue was simple: the model’s knowledge was frozen at training time, and it had never seen the company’s documents.

Retrieval-Augmented Generation (RAG) emerged as the solution to this problem. Instead of relying solely on knowledge baked into model parameters during training, RAG systems retrieve relevant documents at inference time and include them directly in the prompt. The model generates responses grounded in specific, verifiable sources rather than interpolating from training data.

The idea is elegantly simple: look things up before answering. But the execution is nuanced. Building a RAG system that works reliably—that retrieves the right information, presents it effectively, and generates accurate responses—requires understanding a stack of interconnected techniques: document processing, embedding models, vector search, reranking, and prompt engineering. Each component has its own tradeoffs, and weaknesses in any link can break the chain.

This chapter takes you from the foundations of RAG to production-ready systems. We’ll examine not just how each component works, but why it works—the mathematical and conceptual foundations that explain when techniques succeed and when they fail. By the end, you’ll understand RAG deeply enough to debug failures, choose between alternatives, and push the boundaries of what’s possible.

The Core Insight: Why RAG Works

To understand RAG, you need to understand why language models sometimes fail at knowledge-intensive tasks.

A language model like GPT-5 or Claude Opus contains vast amounts of knowledge encoded in its parameters—facts about the world, patterns of language, reasoning strategies. But this knowledge has fundamental limitations:

Temporal cutoff: The model only knows what existed in its training data. Events after training are invisible.
Coverage gaps: Training data, however massive, doesn’t include everything. Proprietary documents, recent publications, niche domains—all are absent.
Parametric compression: Knowledge stored in parameters is lossy. The model learns statistical patterns, not a perfect database. Details blur, facts merge, and the model may “remember” something that’s a plausible interpolation rather than a precise fact.
Confidence without calibration: Models generate confident-sounding text regardless of whether they actually know the answer. They’re trained to produce fluent completions, not to say “I don’t know.”

RAG addresses these limitations by shifting from parametric knowledge (stored in model weights) to non-parametric knowledge (retrieved from external sources). When a user asks a question, the system:

Retrieves relevant documents from a knowledge base
Augments the prompt with these documents as context
Generates a response grounded in the retrieved information

This is powerful because retrieval is exact: we can guarantee the model sees the current version of a document, include proprietary information without retraining, and update knowledge by simply updating the document store. The model becomes a sophisticated reader and synthesizer rather than a unreliable memory.

A Mental Model for RAG

Think of RAG as giving a language model an open-book exam. Without RAG, the model must answer from memory—and like a student who crammed but didn’t deeply learn the material, it may confuse details, conflate topics, or confidently assert things that aren’t quite right. With RAG, the model can flip through a textbook, find relevant passages, and compose an answer that synthesizes what it reads.

The key challenges then become:

Finding the right pages: Given a question, which documents are relevant?
Reading efficiently: With limited space in the prompt, what to include?
Synthesizing accurately: How to ensure the model uses the documents faithfully?

These map to the core RAG components: retrieval, context assembly, and generation with grounding.

When to Use RAG vs Fine-tuning

Before diving into RAG implementation, understand when RAG is the right choice:

Choose RAG when: Knowledge updates frequently, citations needed, no training budget, quick deployment required.

Choose Fine-tuning when: Static knowledge, need specific style/behavior, latency critical, knowledge fits in training data.

Use both when: Large knowledge base with specific format requirements.

What You’ll Learn

How embedding models encode semantic meaning, and why this enables similarity search
Chunking strategies and the tradeoffs that govern them
Vector search algorithms: exact, approximate, and hybrid approaches
Reranking to improve precision when initial retrieval is noisy
Advanced patterns: query transformation, self-reflection, multi-hop reasoning
Evaluation methodologies that diagnose where systems fail

Prerequisites

Understanding of embeddings and attention mechanisms (Chapter 5)
Familiarity with prompt engineering principles (Chapter 6)
Basic database concepts (indexes, queries, consistency)
Python programming

The RAG Architecture

The Two-Phase Design

RAG systems operate in two distinct phases with different computational constraints and optimization objectives.

Indexing (offline): Process your document corpus once, preparing it for fast retrieval. This phase can be compute-intensive because it runs ahead of user requests.

Documents → Parsing → Chunking → Embedding → Vector Index
              ↓          ↓           ↓            ↓
         Clean text   Segments   Dense vectors  Searchable store

Query (online): For each user request, find relevant context and generate a response. This phase must be fast—users expect responses in seconds.

Query → Embedding → Vector Search → Reranking → Context Assembly → Generation
  ↓          ↓            ↓            ↓              ↓               ↓
User      Dense        Candidate    Precise       Prompt         Response
question   vector      documents    top-k        construction    with citations

Let’s trace through a concrete example to build intuition.

Scenario: A legal firm has 50,000 contract documents. A lawyer asks: “What are the standard termination clauses for vendor agreements?”

Indexing (done once): 1. Parse all 50,000 contracts, extracting text from PDFs 2. Split each contract into chunks (perhaps by clause or paragraph) 3. Generate an embedding vector for each chunk using a model like E5 or BGE 4. Store vectors in a searchable index (e.g., Qdrant, Pinecone) 5. Store chunk text and metadata (contract name, date, type) alongside vectors

Query (every request): 1. Embed the query “What are the standard termination clauses for vendor agreements?” using the same model 2. Search the vector index for the 50 most similar chunk vectors 3. Rerank these 50 using a more powerful cross-encoder model, keeping top 5 4. Assemble the top 5 chunks into a prompt 5. Send to LLM: “Based on these contract excerpts, explain standard termination clauses for vendor agreements” 6. Return the response with citations to specific contracts

The magic happens in step 2: because the query and documents are embedded by the same model into the same vector space, we can find semantically similar content even when the words don’t match exactly. “Termination clauses” finds documents mentioning “early exit provisions” or “contract cancellation terms.”

Why Vectors Enable Semantic Search

Traditional keyword search (like Elasticsearch’s BM25) matches documents containing query terms. It works well when users know the exact vocabulary in the documents. But it fails on semantic mismatches:

Query: “how to cancel my subscription” → Document uses: “membership termination procedure”
Query: “vacation policy” → Document uses: “paid time off guidelines”
Query: “machine learning engineer salary” → Document uses: “ML practitioner compensation”

Embedding models solve this by mapping text to points in high-dimensional space where semantic similarity corresponds to geometric proximity. This is the foundational insight of modern NLP.

How does this work? Embedding models (like E5, GTE, or OpenAI’s text-embedding-3) are trained on massive datasets of text pairs that are semantically related—question-answer pairs, paraphrase pairs, document-summary pairs. Through this training, the model learns to:

Place synonyms near each other (“cancel” ≈ “terminate” ≈ “end”)
Group related concepts (“subscription,” “membership,” “account” cluster together)
Separate unrelated meanings (the word “bank” in financial vs. river contexts gets different vectors)

When we embed a query and documents with the same model, we can find documents whose embeddings are geometrically close to the query embedding—meaning they’re semantically related, even without word overlap.

This is why embedding model choice matters so much: the model’s training determines what “similar” means. A model trained on scientific papers will have different similarity judgments than one trained on customer support conversations.

Historical Context: The Evolution of Information Retrieval

1970s (TF-IDF): Term Frequency-Inverse Document Frequency became the foundation of search. Documents were represented as sparse vectors of word counts, weighted by rarity. Simple, interpretable, and still used today—but purely lexical with no understanding of meaning.

1990s (LSA/LSI): Latent Semantic Analysis applied SVD to term-document matrices, discovering latent “topics.” First attempt at semantic similarity, but computationally expensive and limited by linear algebra assumptions.

2013 (Word2Vec): Mikolov’s word embeddings revolutionized NLP by learning dense vectors where similar words clustered together. “King - Man + Woman ≈ Queen” captured something like meaning. But averaging word vectors for documents lost crucial information.

2017-2018 (Transformer Encoders): BERT and its variants produced contextual embeddings—the same word got different vectors depending on context. Sentence-BERT (2019) adapted this for efficient sentence embedding, enabling practical semantic search.

2020-2022 (Contrastive Learning): Models like E5, GTE, and BGE used contrastive training on massive text pairs to create embedding models optimized specifically for retrieval. These dominate modern RAG systems.

2023-2024 (Hybrid & Learned): Hybrid search combining BM25 with dense retrieval became standard. ColBERT’s late-interaction paradigm and learned sparse encoders (SPLADE) blur the line between lexical and semantic approaches.

The pattern: Retrieval evolved from exact word matching to statistical co-occurrence to learned semantic similarity. Each generation traded interpretability for capability, but the core goal remained: finding relevant information efficiently.

Document Processing and Chunking

Staff Engineer Perspective

“When I review RAG implementations, I first check chunk size and overlap. 90% of the retrieval problems I see come from chunks that are too small or don’t overlap. Everyone reads the tutorials saying ‘256 tokens is optimal’ and blindly copies it. Then they wonder why their legal document Q&A system returns sentence fragments that make no sense.”

— Principal Engineer at legal tech startup

The Chunking Problem

Here’s a fundamental tension in RAG: embedding models and LLM context windows have limited capacity, but documents can be arbitrarily long. A 100-page contract cannot be embedded as a single unit—the embedding model would truncate or struggle, and even if it worked, a single vector for 100 pages would lose detail. We must split documents into smaller chunks.

But chunking introduces its own problems:

Semantic fragmentation: Splitting at arbitrary boundaries can divide a coherent thought across chunks, losing meaning
Context loss: A chunk may reference “the policy described above”—but “above” is in a different chunk
Size variability: Very small chunks may lack context; very large chunks may dilute the signal

Good chunking strategies navigate these tradeoffs for your specific use case.

Fixed-Size Chunking: Simple But Fragile

The simplest approach splits text at regular token intervals:

def fixed_size_chunk(text: str, chunk_size: int = 512, overlap: int = 50) -> list[str]:
    """Split text into fixed-size chunks with overlap."""
    tokens = tokenizer.encode(text)
    chunks = []
    for start in range(0, len(tokens), chunk_size - overlap):
        end = start + chunk_size
        chunks.append(tokenizer.decode(tokens[start:end]))
    return chunks

Why overlap? Without overlap, a relevant sentence split across chunk boundaries might not match any single chunk well enough to be retrieved. Overlap ensures boundary content appears in at least one chunk.

When fixed-size works: Homogeneous documents where any segment is roughly self-contained—news articles, blog posts, social media. The simplicity is valuable: predictable memory usage, easy to reason about.

When it fails: Structured documents where meaning depends on headers, sections, or cross-references. Splitting a legal contract mid-clause produces chunks that are difficult to interpret without context.

Semantic Chunking: Following Natural Boundaries

Semantic chunking respects document structure, splitting at natural boundaries like paragraphs, sentences, or sections:

def semantic_chunk(text: str, max_tokens: int = 512) -> list[str]:
    """Split at paragraph boundaries, respecting max size."""
    paragraphs = text.split('\n\n')
    chunks, current, current_size = [], [], 0

    for para in paragraphs:
        para_tokens = len(tokenizer.encode(para))
        if current_size + para_tokens > max_tokens and current:
            chunks.append('\n\n'.join(current))
            current, current_size = [], 0
        current.append(para)
        current_size += para_tokens

    if current:
        chunks.append('\n\n'.join(current))
    return chunks

This preserves paragraph coherence. But what if a single paragraph exceeds the limit? We need fallback logic—either split it further (losing coherence) or truncate (losing content).

Recursive Chunking: Best of Both Worlds

Recursive chunking tries progressively finer splits until content fits:

def recursive_chunk(text: str, max_tokens: int = 512,
                    separators: list[str] = ['\n\n', '\n', '. ', ' ']) -> list[str]:
    """Try coarse splits first, fall back to finer splits."""
    if len(tokenizer.encode(text)) <= max_tokens:
        return [text]

    for sep in separators:
        parts = text.split(sep)
        if len(parts) > 1:
            chunks = []
            for part in parts:
                chunks.extend(recursive_chunk(part, max_tokens, separators))
            return chunks

    # Last resort: hard split
    return fixed_size_chunk(text, max_tokens)

This tries to split at double-newlines (paragraphs) first. If any segment is still too large, try single newlines, then sentences, then words. It maximizes semantic coherence while guaranteeing size limits.

Full implementation: See reference/07b_rag_systems_code.md for complete implementations including structure-aware chunking with header tracking.

How Chunk Size Affects Retrieval

Chunk size is the most important parameter in RAG. It creates a fundamental tradeoff:

Smaller chunks (128-256 tokens):

✓ More precise retrieval—each chunk is about one idea
✓ Higher recall—easier to match specific facts
✗ Loss of context—chunks may be hard to interpret alone
✗ More chunks to store and search

Larger chunks (512-1024 tokens):

✓ Self-contained—each chunk has enough context
✓ Fewer chunks, more efficient storage and search
✗ Lower precision—relevant sentence buried in irrelevant paragraph
✗ May exceed embedding model’s effective window

A reasonable default is to start at 512 tokens with overlap, then tune: go smaller (down to ~256) only when precision demands it, and larger (512–1024) when meaning depends on surrounding context. The instinct to shrink chunks usually backfires—too-small chunks return fragments that lose the context the model needs (this is the same default the “Chunks Too Small” mistake below recommends). The right size depends heavily on your documents:

Document Type	Recommended Size	Rationale
FAQs, knowledge bases	128-256 tokens	One question-answer pair per chunk
Technical documentation	256-512 tokens	Section-level coherence
Legal contracts	Full clauses	Clause is atomic semantic unit
Research papers	512-1024 tokens	Need surrounding context for meaning
Code files	Full functions	Function is natural semantic unit

Practical advice: Start with 512 tokens and semantic boundaries. Measure retrieval quality on real queries. If retrieval finds irrelevant results, try smaller chunks. If retrieved chunks lack context, try larger or add context metadata.

The Lost Context Problem

When you chunk a document, each chunk loses its place in the whole. Consider this chunk:

“As discussed in the previous section, these conditions apply. The party responsible must notify the other party within 30 days.”

In isolation, this is nearly useless—what conditions? Which parties? The references point outside the chunk.

Solution 1: Context headers. Prepend hierarchical context:

def add_context_headers(chunk: str, headers: list[str]) -> str:
    """Prepend document structure context to chunk."""
    context = " > ".join(headers)  # "Contract > Section 3 > Termination"
    return f"[{context}]\n{chunk}"

Solution 2: Sentence-window retrieval. Embed individual sentences but retrieve with surrounding window:

def sentence_window_chunk(text: str, window: int = 3) -> list[dict]:
    """Embed sentences, but return with surrounding context."""
    sentences = sent_tokenize(text)
    for i, sent in enumerate(sentences):
        start = max(0, i - window)
        end = min(len(sentences), i + window + 1)
        yield {
            'embed_text': sent,  # This gets embedded
            'context': ' '.join(sentences[start:end])  # This gets returned
        }

This gives you sentence-level retrieval precision with paragraph-level context in the final prompt.

Solution 3: Parent-child indexing. Index both small chunks (for retrieval precision) and large chunks (for context). When a small chunk matches, return its parent.

Document Extraction: Handling Real-World Formats

Real documents come as PDFs, Word files, HTML pages, and scanned images. Extraction quality directly impacts RAG quality—garbage in, garbage out.

def extract_document(path: str) -> tuple[str, dict]:
    """Extract text and metadata from various formats."""
    ext = path.rsplit('.', 1)[-1].lower()

    extractors = {
        'pdf': extract_pdf,
        'docx': extract_docx,
        'html': extract_html,
        'md': extract_markdown,
    }

    if ext not in extractors:
        raise ValueError(f"Unsupported format: {ext}")

    return extractors[ext](path)

PDF challenges are underestimated:

Multi-column layouts: Reading order ambiguity
Tables: Structure lost when flattened
Scanned documents: Require OCR
Headers/footers: Repeated noise
Embedded images with text: Invisible without OCR

For production systems, consider specialized services (Unstructured.io, LlamaParse, AWS Textract) that handle these edge cases. The investment pays off in retrieval quality.

Full implementation: See reference/07b_rag_systems_code.md for complete extraction code.

Common Mistake: Chunks Too Small

What people do: Set chunk size to 128-256 tokens “to be safe” and ensure precise retrieval.

Why it fails: Small chunks lose context. A sentence like “This applies to returns within 30 days” is meaningless without knowing what “this” refers to. Retrieval metrics may look good (you found a relevant sentence), but generation quality suffers.

Fix: Start with 512-1024 tokens and tune based on retrieval + generation quality together. Include chunk overlap (10-20%). Consider parent-child chunking: retrieve small chunks for precision, include parent chunk in context for completeness.

Embeddings and Vector Search

How Embedding Models Work

Embedding models transform text into dense vectors (typically 384-4096 dimensions) where geometric relationships encode semantic relationships. But how?

Modern embedding models are neural networks—typically transformers—trained on massive datasets of related text pairs. The training objective shapes what “similar” means:

Contrastive learning: Given a query q, a positive document d+ (relevant), and negative documents d- (irrelevant), train the model so that:

similarity(embed(q), embed(d+)) is high
similarity(embed(q), embed(d-)) is low

Through millions of such examples, the model learns to map semantically related texts to nearby points in vector space.

Why this works: The model learns features—syntactic patterns, entity types, topic signals—that distinguish relevant from irrelevant. These features become the dimensions of the embedding space. When a query and document share features (i.e., are about the same topic, contain related concepts), their vectors point in similar directions.

What the dimensions represent: Unlike hand-crafted features, neural network dimensions don’t have interpretable meanings. Dimension 47 isn’t “about finance.” Instead, each dimension captures a learned combination of features. The entire vector, taken together, represents the text’s position in semantic space.

Embedding Models for Retrieval

Not all embeddings are equal for retrieval. General-purpose embeddings from language models capture meaning but aren’t optimized for finding relevant documents.

Retrieval-specific models are trained explicitly on query-document pairs:

Model	Dimensions	Strengths
E5-large-v2	1024	Strong general-purpose, good multilingual
BGE-large-en-v1.5	1024	Excellent English, instruction-following
GTE-large-en-v1.5	1024	Strong benchmark performance
text-embedding-3-large	3072	High capacity, commercial
Voyage-large-2	1024	Domain-specific options available

Query vs. document prefixes: Some models (E5, BGE) use different prefixes for queries and documents:

# E5 model expects these prefixes
query_embedding = model.encode("query: What is the refund policy?")
doc_embedding = model.encode("passage: Our refund policy allows returns within 30 days...")

This asymmetric encoding helps because queries and documents are linguistically different—queries are short questions, documents are declarative paragraphs. The prefixes tell the model to produce appropriately calibrated embeddings.

Choosing an Embedding Model

Embedding Model Selection Flowchart

Quick recommendations:

Best quality (API): OpenAI text-embedding-3-large (3072 dims)
Best quality (self-hosted): BGE-large-en-v1.5 or GTE-large
Best efficiency: all-MiniLM-L6-v2 (384 dims, very fast)
Multilingual: E5-multilingual-large

Similarity Metrics: The Math of “Closeness”

Given two vectors, how do we measure similarity?

Cosine similarity measures the angle between vectors, ignoring magnitude:

\[\text{cosine}(a, b) = \frac{a \cdot b}{\|a\| \|b\|}\]

Range: [-1, 1] where 1 = identical direction, 0 = orthogonal, -1 = opposite.

Dot product is simply $a \cdot b$. For unit vectors (normalized to length 1), dot product equals cosine similarity.

Euclidean distance measures straight-line distance: $\|a - b\|_2$. Lower = more similar.

Why cosine is standard: Embedding models typically normalize outputs, making magnitude uninformative. Cosine (or equivalently, dot product on normalized vectors) focuses on direction—which captures semantic content. Two documents about “refund policies” should be similar regardless of length.

Practical note: Most systems normalize embeddings and use dot product, which is faster (no normalization at query time) and equivalent to cosine for normalized vectors.

Exact vs. Approximate Nearest Neighbor Search

With embeddings and similarity metrics, retrieval becomes a nearest neighbor search: find the k vectors most similar to the query vector.

Exact search (brute force): Compare the query to every document. Trivially correct but O(n) per query—fine for 10,000 documents, infeasible for 10 million.

Approximate nearest neighbor (ANN): Use index structures to find probably nearest neighbors without exhaustive search. Tradeoff: small accuracy loss for massive speedup.

The key ANN algorithms:

HNSW (Hierarchical Navigable Small World):

Builds a graph where similar vectors are connected
Query navigates the graph, following edges toward similar vectors
Like a social network where “similar” people know each other—start anywhere, follow connections to find who you’re looking for
Very fast queries, but high memory (stores graph structure)

IVF (Inverted File Index):

Clusters vectors into buckets (Voronoi cells)
Query checks only nearby buckets
Like organizing a library by topic—search relevant sections, not the whole library
Tunable: check more buckets for higher recall

Product Quantization (PQ):

Compresses vectors by representing them as combinations of codebook entries
Like describing colors as “red-ish, somewhat green, no blue” instead of exact RGB values
10-100x memory reduction with modest accuracy loss
Often combined with IVF for very large scales

import faiss

# Exact search (< 10K vectors)
index = faiss.IndexFlatIP(dimension)

# HNSW (10K - 1M vectors)
index = faiss.IndexHNSWFlat(dimension, M=32)

# IVF + PQ (> 1M vectors)
quantizer = faiss.IndexFlatL2(dimension)
index = faiss.IndexIVFPQ(quantizer, dimension, nlist=1024, M=32, bits=8)
index.train(sample_vectors)  # Required for IVF/PQ

Full implementation: See reference/07b_rag_systems_code.md for complete indexing and search code.

Vector Search Doesn’t Understand Everything

A critical insight: vector similarity captures topical relevance, not answer correctness.

When you search for “What is the return policy?”, vector search finds documents about return policies. But it doesn’t know:

Whether the document answers the question
Whether it’s the current policy or an outdated version
Whether it applies to your specific situation

This is why RAG pipelines need multiple stages. Vector search finds candidates; reranking selects the best; generation synthesizes an answer; prompting ensures grounding. Weakness in any component propagates.

Common Mistake: Ignoring Metadata

What people do: Store only text and embeddings, discarding document metadata (dates, authors, categories, versions).

Why it fails: Vector search alone can’t answer “What’s the current policy?” vs. “What was the policy in 2023?” Both documents embed similarly. Without metadata filtering, you return outdated information or mix answers from different contexts.

Fix: Store metadata alongside embeddings. Use metadata filters in queries (e.g., date >= '2024-01-01'). Include source attribution in generated responses so users can verify. Consider metadata-aware reranking that boosts recent documents.

Hybrid Search and Reranking

The Complementary Strengths of Vector and Keyword Search

Vector search and keyword search (BM25) fail in different ways:

Vector search failures:

Exact identifiers: “error code XJ-4402” may not embed close to documents mentioning it
Technical terminology: “PostgreSQL pg_dump” gets semantically diffused
Named entities: “Dr. Sarah Chen’s 2024 paper” needs exact matching

Keyword search failures:

Synonyms: “automobile” doesn’t match “car”
Paraphrasing: “how to end my account” doesn’t match “subscription termination process”
Conceptual queries: “best practices for scaling” spans many possible wordings

Hybrid search combines both, capturing documents that are either lexically or semantically similar.

BM25: The Keyword Baseline

BM25 is the standard keyword ranking algorithm, an improvement on TF-IDF that handles document length variation:

\[\text{BM25}(q, d) = \sum_{t \in q} \text{IDF}(t) \cdot \frac{f(t, d) \cdot (k_1 + 1)}{f(t, d) + k_1 \cdot (1 - b + b \cdot \frac{|d|}{\text{avgdl}})}\]

The intuition:

IDF (Inverse Document Frequency): Rare terms matter more. “Termination” is more informative than “the.”
TF (Term Frequency): More occurrences = stronger signal, but with diminishing returns
Length normalization: Long documents get many terms by chance; normalize for length

BM25 requires no training—just tokenization and statistics. It’s fast, interpretable, and remains competitive with neural methods for exact-match queries.

Reciprocal Rank Fusion

Combining two ranked lists is tricky. Scores aren’t comparable (cosine similarity vs. BM25 scores live in different scales). Reciprocal Rank Fusion (RRF) sidesteps this by using ranks, not scores:

\[\text{RRF}(d) = \sum_{r \in \text{retrievers}} \frac{1}{k + \text{rank}_r(d)}\]

Where k is a constant (typically 60) that prevents very high-ranked documents from dominating.

def reciprocal_rank_fusion(rankings: list[list[str]], k: int = 60) -> list[str]:
    """Combine multiple rankings using RRF."""
    scores = defaultdict(float)
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking, 1):
            scores[doc_id] += 1.0 / (k + rank)
    return sorted(scores, key=scores.get, reverse=True)

RRF is robust because:

No score normalization needed
Outliers in one retriever don’t dominate
Documents ranked well by multiple retrievers rise to the top

Full implementation: See reference/07b_rag_systems_code.md for complete hybrid search with BM25 and vector fusion.

Reranking: Precision Over Speed

Initial retrieval (vector or hybrid) casts a wide net—retrieve 50-100 candidates quickly. Reranking uses a more expensive model to precisely score these candidates.

The insight: bi-encoders (embedding models) score documents independently—embed query, embed document, compute similarity. Cross-encoders process query and document together, enabling rich interaction between them.

from sentence_transformers import CrossEncoder

class Reranker:
    def __init__(self, model: str = "BAAI/bge-reranker-large"):
        self.model = CrossEncoder(model)

    def rerank(self, query: str, documents: list[str], top_k: int = 5):
        pairs = [[query, doc] for doc in documents]
        scores = self.model.predict(pairs)
        ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
        return ranked[:top_k]

Why cross-encoders are more accurate: They can match specific query terms to document locations, understand negation (“not eligible for” vs. “eligible for”), and reason about relevance holistically. Bi-encoders must compress all meaning into fixed vectors before comparison.

Why we don’t use cross-encoders for initial retrieval: Cross-encoders require O(n) forward passes (one per document). For millions of documents, this is impossibly slow. Bi-encoders enable precomputation—embed documents once, search repeatedly.

The two-stage pattern—fast bi-encoder retrieval, precise cross-encoder reranking—gives us the best of both worlds.

The Reranker That Wasn’t

The situation: An enterprise search team spent three months building a sophisticated RAG pipeline for their internal knowledge base. Their crown jewel was a cross-encoder reranker fine-tuned on 50,000 labeled query-document pairs. Offline metrics were stellar: MRR improved from 0.65 to 0.84.

What happened: They launched with fanfare. Usage dropped 40% within a month. Users complained the system was “too slow to be useful.” Exit surveys revealed people were returning to keyword search or just asking colleagues directly.

Root cause: The reranker added 800ms of latency per query—acceptable in benchmarks, unacceptable in practice. Users expect search results in under 300ms. An 84% MRR means nothing if users don’t wait for results. The team had measured the wrong thing: they optimized for ranking quality while users optimized for time-to-answer.

How they fixed it: 1. Implemented async reranking: show initial bi-encoder results immediately, then update with reranked results if user is still on page 2. Added a lightweight reranker (100ms budget) for real-time, heavy reranker for “show more” requests 3. Built latency into their evaluation metrics: score = accuracy × (1 - latency_penalty) 4. Set up user timing dashboards alongside ranking quality dashboards

The takeaway: Users don’t experience your metrics; they experience latency. The best ranking algorithm your users never see is worthless.

Common Mistake: Skipping Reranking

What people do: Return top-k results from vector search directly to the LLM.

Why it fails: Vector similarity ≠ relevance. A document about your topic may score higher than a document that answers your question. Without reranking, you fill context with tangentially related content, forcing the LLM to sift through noise.

Fix: Always add a reranking stage, even a simple one. Cross-encoders like ms-marco-MiniLM-L-6-v2 add ~50ms latency but dramatically improve answer quality. For production, batch reranking requests and set a timeout—partial reranking beats no reranking.

Context Assembly and Generation

The Art of Prompt Construction

You’ve retrieved and reranked documents. Now you must assemble them into a prompt that elicits accurate, grounded responses.

Key principles:

Structure for clarity: Label documents clearly so the model can cite them.

RAG_PROMPT = """Answer the question based ONLY on the following documents.
If the documents don't contain the answer, say "I don't have information about that."
Cite sources using [Doc N] format.

{documents}

Question: {question}

Answer:"""

def format_documents(docs: list[str]) -> str:
    return "\n\n".join(f"[Doc {i+1}]:\n{doc}" for i, doc in enumerate(docs))

Explicit grounding instructions: Without clear instructions, models may blend retrieved content with parametric knowledge. “Answer based ONLY on the following documents” reduces this.

Handle insufficient context gracefully: The model should acknowledge when it can’t answer rather than hallucinate.

The Lost-in-the-Middle Problem

A surprising finding from Liu et al. (2024): language models with long contexts don’t use all the context equally. Information in the middle of long prompts is often ignored—the model attends more strongly to the beginning and end.

Implications for RAG:

Put the most relevant documents first (or last)
Don’t dump 50 documents into context hoping the model will find what it needs
Quality over quantity—5 highly relevant chunks beat 20 mixed ones

Mitigation strategies:

Aggressive reranking to ensure top documents are truly best
Context compression (summarize less-relevant documents)
Multi-pass generation (retrieve more, generate summaries, retrieve again)

Citation and Attribution

Users need to verify AI responses. Proper citations enable this:

def extract_citations(response: str, num_docs: int) -> list[int]:
    """Extract document citations from response."""
    import re
    citations = re.findall(r'\[Doc (\d+)\]', response)
    valid = [int(c) for c in citations if 1 <= int(c) <= num_docs]
    return list(set(valid))

Trust but verify: Models sometimes hallucinate citations—referencing [Doc 5] when only 3 documents were provided, or attributing statements to the wrong source. Consider:

Validating citation numbers are in range
Checking that cited content actually supports the claim (expensive but possible with another LLM call)
Surfacing source documents to users for verification

Advanced RAG Patterns

Basic RAG—retrieve, then generate—handles simple factual queries. Complex scenarios require more sophisticated orchestration.

Query Transformation

User queries are often poorly suited for retrieval:

Too short: “refund?” lacks context
Wrong vocabulary: user terms ≠ document terms
Complex: multiple sub-questions combined

Query expansion adds related terms:

def expand_query(query: str, llm) -> str:
    """Generate query expansions for better retrieval."""
    expansion = llm.generate(f"""Generate search terms related to: {query}

    Related terms (comma-separated):""")
    return f"{query} {expansion}"

HyDE (Hypothetical Document Embeddings) generates a hypothetical answer, then retrieves documents similar to it:

def hyde_retrieve(query: str, retriever, llm) -> list[dict]:
    """Generate hypothetical answer, use it for retrieval."""
    hypothetical = llm.generate(
        f"Write a detailed paragraph answering: {query}"
    )
    return retriever.search(hypothetical)  # Search using generated doc

Why HyDE works: The hypothetical answer uses vocabulary likely to appear in actual documents—bridging the query-document lexical gap. It’s particularly effective when users ask questions in conversational language while documents are written formally.

Query decomposition breaks complex questions into simpler sub-queries:

def decompose_and_retrieve(query: str, retriever, llm) -> list[dict]:
    """Break complex query into sub-queries, retrieve for each."""
    sub_queries = llm.generate(
        f"Break this into simpler questions:\n{query}"
    ).split('\n')

    all_docs = []
    for sub_q in sub_queries:
        all_docs.extend(retriever.search(sub_q))
    return deduplicate(all_docs)

Self-RAG and Corrective RAG

Advanced patterns add reflection and self-correction.

Self-RAG (Asai et al., 2023) decides whether to retrieve (some questions don’t need external knowledge) and evaluates whether the response is supported by retrieved documents:

class SelfRAG:
    def query(self, question: str) -> str:
        # Decide if retrieval is needed
        if self.needs_retrieval(question):
            docs = self.retriever.search(question)
            response = self.generate_with_context(question, docs)

            # Verify response is supported by documents
            if not self.is_supported(response, docs):
                response = self.regenerate(question, docs)
        else:
            response = self.generate_direct(question)

        return response

Corrective RAG (CRAG) evaluates retrieval quality and falls back to alternative sources (like web search) when the knowledge base doesn’t have good answers:

class CorrectiveRAG:
    def query(self, question: str) -> str:
        docs = self.retriever.search(question)
        quality = self.evaluate_retrieval(question, docs)

        if quality == "CORRECT":
            context = docs
        elif quality == "INCORRECT":
            context = self.web_search(question)  # Fallback
        else:  # AMBIGUOUS
            context = docs + self.web_search(question)  # Combine

        return self.generate(question, context)

These patterns add LLM calls (increasing latency and cost) but significantly improve reliability on hard queries.

Full implementation: See reference/07b_rag_systems_code.md for complete Self-RAG, CRAG, and multi-hop implementations.

Multi-Hop RAG

Some questions require synthesizing information across multiple documents that aren’t directly connected:

“Which of our company’s products were affected by the EU AI Act regulations discussed in last month’s legal memo?”

This needs: (1) the legal memo about EU AI Act, (2) product specifications, (3) reasoning about which products fall under which regulations.

Multi-hop RAG iteratively retrieves and reasons:

class MultiHopRAG:
    def query(self, question: str, max_hops: int = 3) -> str:
        context = []
        current_query = question

        for _ in range(max_hops):
            docs = self.retriever.search(current_query)
            context.extend(docs)

            can_answer, missing = self.check_answerability(question, context)
            if can_answer:
                break

            # Generate follow-up query to find missing information
            current_query = self.generate_followup(question, context, missing)

        return self.generate_answer(question, context)

GraphRAG: Knowledge Graphs Meet Retrieval

Vector search finds topically similar documents but doesn’t understand relationships between entities. GraphRAG combines vector retrieval with knowledge graph traversal:

Extract entities and relationships from documents (using an LLM or NER models)
Build a knowledge graph connecting entities
At query time, find relevant entities and traverse the graph to find related information

This excels for queries like:

“How is Project Alpha related to the Johnson contract?”
“What other products use the same supplier as Widget X?”
“Who worked on both the 2023 audit and the compliance review?”

Full implementation: See reference/07b_rag_systems_code.md for complete GraphRAG implementation with entity extraction and graph traversal.

Evaluation

Why RAG Evaluation is Hard

RAG systems have multiple components that can fail independently:

Retrieval might miss relevant documents
Reranking might demote the best documents
Context assembly might truncate critical information
Generation might hallucinate despite good context

A single end-to-end metric hides which component is failing. Comprehensive evaluation requires component-level and end-to-end metrics.

Retrieval Metrics

Classic information retrieval metrics:

Recall@k: Of all relevant documents, what fraction were retrieved in top k? \[\text{Recall@k} = \frac{|\text{Relevant} \cap \text{Retrieved@k}|}{|\text{Relevant}|}\]

Precision@k: Of retrieved documents, what fraction were relevant? \[\text{Precision@k} = \frac{|\text{Relevant} \cap \text{Retrieved@k}|}{k}\]

MRR (Mean Reciprocal Rank): Average of 1/rank of first relevant document.

def evaluate_retrieval(retriever, queries, relevant_docs) -> dict:
    """Evaluate retrieval quality."""
    metrics = {'recall@5': [], 'recall@10': [], 'mrr': []}

    for query, relevant in zip(queries, relevant_docs):
        results = retriever.search(query, k=10)
        retrieved_ids = [r['id'] for r in results]

        metrics['recall@5'].append(
            len(set(retrieved_ids[:5]) & relevant) / len(relevant)
        )
        metrics['recall@10'].append(
            len(set(retrieved_ids[:10]) & relevant) / len(relevant)
        )

        for rank, doc_id in enumerate(retrieved_ids, 1):
            if doc_id in relevant:
                metrics['mrr'].append(1 / rank)
                break
        else:
            metrics['mrr'].append(0)

    return {k: sum(v)/len(v) for k, v in metrics.items()}

End-to-End Evaluation

For the full pipeline, evaluate answer quality:

Faithfulness: Does the response only contain information from retrieved documents? (Checks for hallucination)

Relevance: Does the response answer the question?

Completeness: Does the response cover all aspects of the question?

LLM-as-judge works well for these:

def evaluate_faithfulness(response: str, context: str, evaluator_llm) -> float:
    """Check if response is faithful to context."""
    prompt = f"""Rate whether this response is faithful to the provided context.
    A faithful response only contains information present in the context.

    Context: {context}
    Response: {response}

    Rate 1-5 (5 = completely faithful):"""

    return float(evaluator_llm.generate(prompt))

Full implementation: See reference/07b_rag_systems_code.md for complete evaluation harness with component-level and end-to-end metrics.

Debugging Failures

When RAG fails, diagnose systematically:

Was the relevant document retrieved? → Retrieval problem (embedding model, chunking, index)
Was it ranked high enough? → Reranking problem
Was it included in context? → Context assembly problem
Did generation use it correctly? → Prompt engineering problem

class RAGDebugger:
    def diagnose(self, query: str, expected_answer: str) -> dict:
        """Diagnose RAG failure point."""
        retrieved = self.retriever.search(query, k=50)

        # Check if answer content is in any retrieved doc
        relevant = [d for d in retrieved if self.contains_answer(d, expected_answer)]

        if not relevant:
            return {'failure': 'retrieval', 'issue': 'Relevant content not retrieved'}

        reranked = self.reranker.rerank(query, retrieved[:20])
        relevant_ranks = [i for i, d in enumerate(reranked) if d in relevant]

        if min(relevant_ranks) > 5:
            return {'failure': 'reranking', 'issue': f'Relevant doc ranked {min(relevant_ranks)}'}

        response = self.generate(query, reranked[:5])
        if not self.answer_correct(response, expected_answer):
            return {'failure': 'generation', 'issue': 'Good context but wrong answer'}

        return {'failure': None, 'status': 'success'}

Production Considerations

Vector Database Selection

For production, you need more than FAISS: persistence, updates, filtering, scaling.

Decision framework:

Scale	Complexity	Recommendation
< 100K vectors	Low	PostgreSQL + pgvector
100K - 10M	Medium	Qdrant or Weaviate
> 10M	High	Milvus or Pinecone

Key capabilities to evaluate:

Hybrid search support (vector + keyword)
Metadata filtering efficiency
Update performance (real-time vs. batch)
Horizontal scaling options
Backup and disaster recovery

Monitoring and Observability

Production RAG needs continuous monitoring:

@dataclass
class RAGMetrics:
    retrieval_latency_ms: float
    rerank_latency_ms: float
    generation_latency_ms: float
    num_docs_retrieved: int
    top_retrieval_score: float
    user_feedback: Optional[str]

def log_rag_request(query: str, metrics: RAGMetrics):
    """Log for monitoring and debugging."""
    logger.info({
        'query': query,
        **asdict(metrics),
        'timestamp': datetime.utcnow().isoformat()
    })

Alert on:

Latency percentile spikes
Low retrieval scores (indicates out-of-domain queries)
User feedback trends
Document freshness (stale indexes)

Common Production Pitfalls

Index-embedding mismatch: Using different embedding models for indexing vs. querying produces nonsense results. Version your embeddings with your index.

Stale documents: When source documents update, indexes become inconsistent. Implement incremental re-indexing.

Query injection: User queries become part of prompts. Sanitize inputs and use delimiters to separate user content from instructions.

Cost spiral: LLM calls per query add up. Monitor and set budgets.

Naive RAG
Production RAG

def answer_question(query: str, documents: list[str]) -> str:
    # Embed and search
    query_vec = embed(query)
    results = vector_db.search(query_vec, k=10)

    # Stuff everything into context
    context = "\n\n".join([doc.text for doc in results])

    response = llm.generate(
        f"Context: {context}\n\nQuestion: {query}\n\nAnswer:"
    )
    return response

async def answer_question(
    query: str,
    user_id: str,
    timeout: float = 10.0
) -> RAGResponse:
    """Production RAG with reranking, citation, and observability."""
    start = time.monotonic()

    # Stage 1: Hybrid retrieval (vector + BM25)
    async with asyncio.timeout(timeout * 0.3):
        vector_results = await vector_db.search(embed(query), k=50)
        keyword_results = await bm25_index.search(query, k=50)
        candidates = reciprocal_rank_fusion(vector_results, keyword_results)

    # Stage 2: Cross-encoder reranking
    async with asyncio.timeout(timeout * 0.4):
        reranked = await reranker.rerank(query, candidates[:20])
        top_docs = [d for d in reranked if d.score > 0.5][:5]

    if not top_docs:
        return RAGResponse(answer=None, status="no_relevant_docs")

    # Stage 3: Generate with citation requirements
    context = format_docs_with_ids(top_docs)  # [1], [2], etc.

    response = await llm.generate(
        messages=[{
            "role": "system",
            "content": "Answer based ONLY on the provided context. "
                      "Cite sources using [1], [2], etc. "
                      "If unsure, say 'I don't have enough information.'"
        }, {
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {query}"
        }],
        max_tokens=500
    )

    # Log for monitoring
    log_rag_metrics(
        query=query, user_id=user_id,
        retrieval_latency=(time.monotonic() - start),
        docs_used=len(top_docs),
        top_score=top_docs[0].score if top_docs else 0
    )

    return RAGResponse(
        answer=response.content,
        sources=[d.metadata for d in top_docs],
        latency_ms=(time.monotonic() - start) * 1000
    )

Key differences: Hybrid search, reranking stage, relevance threshold, citation enforcement, timeout management, observability, and structured response with sources.

Key Takeaways

Chunking strategy determines retrieval quality - Preserve semantic coherence while enabling precise retrieval. Use document-aware chunking that respects natural boundaries like paragraphs and sections.
Hybrid search outperforms single approaches - Combine vector search (semantic similarity) with BM25 (keyword matching). Vector search handles paraphrases; BM25 handles exact terms and identifiers.
Two-stage retrieval balances speed and accuracy - Use fast bi-encoders to find candidates, then expensive cross-encoders to precisely rerank. This achieves both scale and quality.
The “lost in the middle” problem affects generation - LLMs attend more strongly to content at the beginning and end of context. Put the most relevant documents first and prioritize quality over quantity.
Evaluate retrieval and generation separately - Component-level evaluation enables precise diagnosis. Measure retrieval recall@k, generation faithfulness, and end-to-end answer correctness independently.

Summary

RAG transforms language models from unreliable oracles into grounded research assistants. The pattern is simple—retrieve then generate—but production quality requires:

Thoughtful chunking that preserves semantic coherence while enabling precise retrieval
Embedding models trained for retrieval, not just similarity
Hybrid search combining vector and keyword approaches
Reranking to precision-sort noisy retrieval results
Careful prompt engineering to ensure grounding and citation
Component-level evaluation to diagnose failures
Production infrastructure for monitoring, updates, and scale

The field continues to evolve rapidly. GraphRAG, adaptive retrieval, and learned sparse representations are pushing boundaries. But the fundamentals—understanding what makes retrieval work, why generation can fail, how to evaluate systematically—remain essential regardless of which specific techniques you adopt.

RAG Architecture Selection Flowchart

RAG Architecture Selection Decision Flowchart

Practical Exercises

Chunking experiment: Take a 10-page document. Implement three chunking strategies (fixed-size, semantic, recursive). For 20 test queries, measure retrieval recall. Which strategy wins? Why?
Embedding comparison: Compare E5-large, BGE-large, and OpenAI embeddings on your documents. Measure recall@10 and latency. Is the most expensive model worth it?
Hybrid search tuning: Implement hybrid search with configurable vector/BM25 weights. Use grid search to find optimal weights for your evaluation set.
Failure analysis: Collect 20 queries where your RAG system gives wrong answers. For each, diagnose the failure point (retrieval, reranking, generation). What patterns emerge?
GraphRAG prototype: For a small document set, extract entities and relationships. Implement graph traversal. Find queries where GraphRAG outperforms standard RAG.

Self-Assessment Checkpoint

Conceptual Questions

Q1. [IC2] A RAG system uses 512-token chunks. For a legal contract where clauses reference each other heavily, what problems might this cause and how would you address them?

Answer

Problems: (1) Clauses may be split mid-sentence, losing meaning. (2) References like “as described in Section 3.2 above” become meaningless when Section 3.2 is in a different chunk. Solutions: (1) Use document-aware chunking that respects clause boundaries. (2) Add context headers showing the document structure (e.g., “[Contract > Section 3 > Termination Clause]”). (3) Use parent-child indexing: index small chunks for precision but return parent sections for context. (4) Consider sentence-window retrieval: embed sentences but return with surrounding context.

Q2. [IC2] Why do we use two-stage retrieval (bi-encoder then cross-encoder reranking) instead of just using cross-encoders for everything?

Answer

Cross-encoders are more accurate because they process query and document together, enabling rich interaction. But they require O(n) forward passes—one per document. For millions of documents, this is impossibly slow. Bi-encoders precompute document embeddings, enabling sublinear search via ANN indexes. The two-stage pattern uses fast bi-encoders to find ~50-100 candidates, then expensive cross-encoders to precisely rank only these candidates. This achieves both speed and accuracy.

Q3. [Senior] Explain the “lost in the middle” problem and its implications for RAG prompt design.

Answer

Liu et al. (2024) found that LLMs with long contexts don’t use context uniformly—they attend more strongly to content at the beginning and end, often ignoring the middle. For RAG: (1) Don’t dump many documents hoping the model finds what it needs. (2) Put the most relevant documents first (or last). (3) Aggressive reranking matters—ensure top documents are truly best. (4) Consider quality over quantity—5 highly relevant chunks beat 20 mixed ones. (5) For long contexts, consider summarizing less-relevant documents.

Q4. [Senior] A hybrid search system combines BM25 and vector search using RRF. What query types would favor each component, and how would you tune the balance?

Answer

BM25 favors: exact terms, identifiers, technical terminology, named entities (“error XJ-4402”, “PostgreSQL pg_dump”). Vector search favors: conceptual queries, paraphrases, synonyms (“how to cancel account” vs. “subscription termination”). Tuning: Create an evaluation set with both query types. Use grid search over RRF weights or fusion parameters. Common finding: 50/50 works well as default, but domain-specific tuning (e.g., 70/30 for technical docs with many identifiers) can improve significantly.

Q5. [Staff] Your RAG system handles internal company documents. The legal team says some documents are confidential and should only be retrieved for users with proper authorization. How do you architect this?

Answer

Metadata filtering approach: (1) Store access control metadata (department, clearance level, document classification) with each chunk. (2) At query time, filter retrieval to documents the user can access. Most vector DBs support metadata filtering. Considerations: (1) Filter BEFORE retrieval, not after—otherwise you leak information through similarity scores. (2) Test that filtered results still have sufficient coverage. (3) Consider separate indexes for different access levels for very sensitive data. (4) Audit logging for compliance. (5) Don’t put access control logic in the LLM prompt—it’s not reliable.

Spot the Problem

Problem 1. [IC2] A developer’s chunking function:

def chunk_document(text: str, size: int = 500) -> list[str]:
    words = text.split()
    return [' '.join(words[i:i+size]) for i in range(0, len(words), size)]

What’s wrong?

Answer

Issues: (1) Chunking by words, not tokens—500 words could be 600-800 tokens, exceeding embedding model limits. (2) No overlap—sentences at boundaries may not match queries well. (3) Splits arbitrarily, even mid-sentence. (4) Ignores document structure (paragraphs, sections). Better: Use a tokenizer to count tokens, add overlap (e.g., 10-15%), and respect semantic boundaries (at least sentence boundaries, ideally paragraphs).

Problem 2. [Senior] A RAG system has 95% retrieval recall@10 on the eval set but users complain answers are often wrong. What’s happening?

# Their evaluation
retrieved = retriever.search(query, k=10)
if any(doc.id in ground_truth for doc in retrieved):
    recall_hits += 1

Answer

Several possible issues: (1) Recall@10 means the right document is somewhere in top 10, but generation may only use top 3-5. If the right document is at position 8, it might get cut. (2) They’re checking document ID, not whether the document actually answers the question—document might be related but not contain the answer. (3) Generation faithfulness—even with perfect retrieval, the model might hallucinate or misuse context. (4) Eval set might not represent real user queries. Need to evaluate the full pipeline: retrieval precision@k for actual k used, faithfulness of generated answers, and end-to-end answer correctness.

Problem 3. [Staff] A production RAG system worked well for months, then quality suddenly dropped. No code changes were deployed. What should you investigate?

Answer

Likely causes: (1) Source document changes—documents were updated but the index is stale. Check document freshness and re-index. (2) Embedding model API changed—if using a hosted embedding service, the model may have been updated, changing the embedding space. Vectors indexed with old model won’t match queries with new model. (3) Data drift—user query patterns changed and the system wasn’t designed for them. (4) Upstream API changes—LLM provider may have updated the model. (5) Volume spike causing degraded performance (timeouts, truncation). Investigate: Check document timestamps vs index timestamps, compare embedding similarity distributions over time, review query logs for pattern changes, check latency metrics.

Design Exercises

Exercise 1. [Senior] Design a RAG system for a customer support chatbot. Requirements: 50,000 support articles, need sub-2-second response time, must cite sources, handles 1,000 queries/hour peak. Outline your architecture including chunking strategy, retrieval pipeline, and infrastructure choices.

Guidance

Consider: (1) Chunking: Support articles are structured—chunk by FAQ pairs or section headers. 256-512 tokens per chunk. (2) Retrieval: Hybrid search (BM25 + vectors) with reranking. Retrieve 20, rerank to 5. (3) Infrastructure: 50K documents = ~200K chunks = fit easily in managed vector DB (Pinecone, Qdrant Cloud). (4) Latency budget: Embedding (~100ms) + Vector search (~50ms) + Rerank (~200ms) + LLM (~1000ms) = ~1.3s, leaving margin. (5) Citations: Structure prompt to output source IDs, validate citations exist. (6) Caching: Consider caching frequent queries—support has many repeat questions.

Exercise 2. [Staff] You’re building a RAG system for a research organization with 2 million academic papers. Users ask complex questions requiring synthesis across multiple papers. Design the system considering: scale, multi-hop reasoning needs, and the fact that papers heavily reference each other.

Guidance

This is a GraphRAG candidate. Consider: (1) Chunking: Papers have structure (abstract, intro, methods, results)—chunk by section. Also extract figure captions separately. (2) Entity extraction: Extract paper metadata, citations, key terms, author names. Build a citation graph. (3) Multi-hop: Implement iterative retrieval—find initial papers, check if answer is complete, follow citations if not. (4) Scale: 2M papers × ~10 chunks = 20M vectors. Need serious infrastructure—Milvus or Pinecone with good filtering. (5) Consider hierarchical indexing: topic-level index first, then paper-level. (6) Latency: Accept higher latency (5-10s) for complex queries given the research use case.

Connections to Other Chapters

RAG systems sit at the intersection of retrieval, language understanding, and production engineering. Here’s how this chapter connects to others:

Chapter 5 (LLM/NLP Foundations): Provides the theory behind embeddings—how text becomes vectors, why similar meanings cluster together, and how to choose embedding models. Essential background for understanding retrieval.
Chapter 6 (Prompt Engineering): Techniques for integrating retrieved context into prompts effectively. Covers context window management, few-shot examples, and structured outputs for RAG responses.
Chapter 8 (Agentic Systems): Agents often use RAG as a tool for knowledge retrieval. Understanding RAG fundamentals helps you build more capable retrieval-augmented agents.
Chapter 15 (MLOps & Evaluation): Extends evaluation concepts to RAG-specific challenges—retrieval metrics, answer faithfulness, and end-to-end quality assessment.
Chapter 30 (Data Architecture for AI): Data pipeline design and vector database infrastructure at scale. Covers feature stores, data versioning, and the data engineering required for production RAG.

Lewis et al. (2020), “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” - The paper that introduced RAG. Start here.
Karpukhin et al. (2020), “Dense Passage Retrieval for Open-Domain QA” - Dense retrieval foundations that underpin modern RAG systems.
Liu et al. (2024), “Lost in the Middle” - How models use long contexts. Critical for understanding retrieval ordering.

Deep Dives

Asai et al. (2023), “Self-RAG: Learning to Retrieve, Generate, and Critique” - Self-reflective retrieval for higher quality.
Malkov & Yashunin (2018), “HNSW: Hierarchical Navigable Small World Graphs” - The algorithm behind most vector database indices.
Es et al. (2023), “RAGAS: Automated Evaluation of RAG” - Systematic framework for measuring RAG performance.

--- title: "Chapter 7: RAG Systems" keywords: [retrieval, embeddings, vector search, chunking, reranking, hybrid search, semantic search, HNSW, ColBERT, knowledge grounding] difficulty: intermediate prerequisites: [ch05, ch06] estimated_time: "4-5 hours" --- ## Introduction In the summer of 2022, a team at a major technology company faced a problem. They had deployed a large language model to answer questions about their internal documentation—thousands of pages covering engineering practices, HR policies, and product specifications. The model was impressive: articulate, confident, and completely unreliable. It would cite policies that didn't exist, invent employee benefits, and present plausible-sounding procedures that contradicted official guidelines. The fundamental issue was simple: the model's knowledge was frozen at training time, and it had never seen the company's documents. Retrieval-Augmented Generation (RAG) emerged as the solution to this problem. Instead of relying solely on knowledge baked into model parameters during training, RAG systems retrieve relevant documents at inference time and include them directly in the prompt. The model generates responses grounded in specific, verifiable sources rather than interpolating from training data. The idea is elegantly simple: look things up before answering. But the execution is nuanced. Building a RAG system that works reliably—that retrieves the right information, presents it effectively, and generates accurate responses—requires understanding a stack of interconnected techniques: document processing, embedding models, vector search, reranking, and prompt engineering. Each component has its own tradeoffs, and weaknesses in any link can break the chain. This chapter takes you from the foundations of RAG to production-ready systems. We'll examine not just *how* each component works, but *why* it works—the mathematical and conceptual foundations that explain when techniques succeed and when they fail. By the end, you'll understand RAG deeply enough to debug failures, choose between alternatives, and push the boundaries of what's possible. ### The Core Insight: Why RAG Works To understand RAG, you need to understand why language models sometimes fail at knowledge-intensive tasks. A language model like GPT-5 or Claude Opus contains vast amounts of knowledge encoded in its parameters—facts about the world, patterns of language, reasoning strategies. But this knowledge has fundamental limitations: 1. **Temporal cutoff**: The model only knows what existed in its training data. Events after training are invisible. 2. **Coverage gaps**: Training data, however massive, doesn't include everything. Proprietary documents, recent publications, niche domains—all are absent. 3. **Parametric compression**: Knowledge stored in parameters is lossy. The model learns statistical patterns, not a perfect database. Details blur, facts merge, and the model may "remember" something that's a plausible interpolation rather than a precise fact. 4. **Confidence without calibration**: Models generate confident-sounding text regardless of whether they actually know the answer. They're trained to produce fluent completions, not to say "I don't know." RAG addresses these limitations by shifting from parametric knowledge (stored in model weights) to non-parametric knowledge (retrieved from external sources). When a user asks a question, the system: 1. **Retrieves** relevant documents from a knowledge base 2. **Augments** the prompt with these documents as context 3. **Generates** a response grounded in the retrieved information This is powerful because retrieval is exact: we can guarantee the model sees the current version of a document, include proprietary information without retraining, and update knowledge by simply updating the document store. The model becomes a sophisticated reader and synthesizer rather than a unreliable memory. ### A Mental Model for RAG Think of RAG as giving a language model an open-book exam. Without RAG, the model must answer from memory—and like a student who crammed but didn't deeply learn the material, it may confuse details, conflate topics, or confidently assert things that aren't quite right. With RAG, the model can flip through a textbook, find relevant passages, and compose an answer that synthesizes what it reads. The key challenges then become: - **Finding the right pages**: Given a question, which documents are relevant? - **Reading efficiently**: With limited space in the prompt, what to include? - **Synthesizing accurately**: How to ensure the model uses the documents faithfully? These map to the core RAG components: retrieval, context assembly, and generation with grounding. ### When to Use RAG vs Fine-tuning Before diving into RAG implementation, understand when RAG is the right choice: ![RAG vs Fine-tuning Decision Flowchart](../assets/diagrams/rendered/ch07_rag_vs_finetuning.svg) **Choose RAG when**: Knowledge updates frequently, citations needed, no training budget, quick deployment required. **Choose Fine-tuning when**: Static knowledge, need specific style/behavior, latency critical, knowledge fits in training data. **Use both when**: Large knowledge base with specific format requirements. ### What You'll Learn - How embedding models encode semantic meaning, and why this enables similarity search - Chunking strategies and the tradeoffs that govern them - Vector search algorithms: exact, approximate, and hybrid approaches - Reranking to improve precision when initial retrieval is noisy - Advanced patterns: query transformation, self-reflection, multi-hop reasoning - Evaluation methodologies that diagnose where systems fail ### Prerequisites - Understanding of embeddings and attention mechanisms (Chapter 5) - Familiarity with prompt engineering principles (Chapter 6) - Basic database concepts (indexes, queries, consistency) - Python programming --- ## The RAG Architecture ### The Two-Phase Design RAG systems operate in two distinct phases with different computational constraints and optimization objectives. ![RAG Pipeline Architecture](../assets/diagrams/rendered/rag_pipeline.svg) **Indexing (offline)**: Process your document corpus once, preparing it for fast retrieval. This phase can be compute-intensive because it runs ahead of user requests. ``` Documents → Parsing → Chunking → Embedding → Vector Index ↓ ↓ ↓ ↓ Clean text Segments Dense vectors Searchable store ``` **Query (online)**: For each user request, find relevant context and generate a response. This phase must be fast—users expect responses in seconds. ``` Query → Embedding → Vector Search → Reranking → Context Assembly → Generation ↓ ↓ ↓ ↓ ↓ ↓ User Dense Candidate Precise Prompt Response question vector documents top-k construction with citations ``` Let's trace through a concrete example to build intuition. **Scenario**: A legal firm has 50,000 contract documents. A lawyer asks: "What are the standard termination clauses for vendor agreements?" **Indexing (done once)**: 1. Parse all 50,000 contracts, extracting text from PDFs 2. Split each contract into chunks (perhaps by clause or paragraph) 3. Generate an embedding vector for each chunk using a model like E5 or BGE 4. Store vectors in a searchable index (e.g., Qdrant, Pinecone) 5. Store chunk text and metadata (contract name, date, type) alongside vectors **Query (every request)**: 1. Embed the query "What are the standard termination clauses for vendor agreements?" using the same model 2. Search the vector index for the 50 most similar chunk vectors 3. Rerank these 50 using a more powerful cross-encoder model, keeping top 5 4. Assemble the top 5 chunks into a prompt 5. Send to LLM: "Based on these contract excerpts, explain standard termination clauses for vendor agreements" 6. Return the response with citations to specific contracts The magic happens in step 2: because the query and documents are embedded by the same model into the same vector space, we can find semantically similar content even when the words don't match exactly. "Termination clauses" finds documents mentioning "early exit provisions" or "contract cancellation terms." ### Why Vectors Enable Semantic Search Traditional keyword search (like Elasticsearch's BM25) matches documents containing query terms. It works well when users know the exact vocabulary in the documents. But it fails on semantic mismatches: - Query: "how to cancel my subscription" → Document uses: "membership termination procedure" - Query: "vacation policy" → Document uses: "paid time off guidelines" - Query: "machine learning engineer salary" → Document uses: "ML practitioner compensation" Embedding models solve this by mapping text to points in high-dimensional space where **semantic similarity corresponds to geometric proximity**. This is the foundational insight of modern NLP. How does this work? Embedding models (like E5, GTE, or OpenAI's text-embedding-3) are trained on massive datasets of text pairs that are semantically related—question-answer pairs, paraphrase pairs, document-summary pairs. Through this training, the model learns to: - Place synonyms near each other ("cancel" ≈ "terminate" ≈ "end") - Group related concepts ("subscription," "membership," "account" cluster together) - Separate unrelated meanings (the word "bank" in financial vs. river contexts gets different vectors) When we embed a query and documents with the same model, we can find documents whose embeddings are geometrically close to the query embedding—meaning they're semantically related, even without word overlap. This is why embedding model choice matters so much: the model's training determines what "similar" means. A model trained on scientific papers will have different similarity judgments than one trained on customer support conversations. ::: {.callout-note} ## Historical Context: The Evolution of Information Retrieval **1970s (TF-IDF)**: Term Frequency-Inverse Document Frequency became the foundation of search. Documents were represented as sparse vectors of word counts, weighted by rarity. Simple, interpretable, and still used today—but purely lexical with no understanding of meaning. **1990s (LSA/LSI)**: Latent Semantic Analysis applied SVD to term-document matrices, discovering latent "topics." First attempt at semantic similarity, but computationally expensive and limited by linear algebra assumptions. **2013 (Word2Vec)**: Mikolov's word embeddings revolutionized NLP by learning dense vectors where similar words clustered together. "King - Man + Woman ≈ Queen" captured something like meaning. But averaging word vectors for documents lost crucial information. **2017-2018 (Transformer Encoders)**: BERT and its variants produced contextual embeddings—the same word got different vectors depending on context. Sentence-BERT (2019) adapted this for efficient sentence embedding, enabling practical semantic search. **2020-2022 (Contrastive Learning)**: Models like E5, GTE, and BGE used contrastive training on massive text pairs to create embedding models optimized specifically for retrieval. These dominate modern RAG systems. **2023-2024 (Hybrid & Learned)**: Hybrid search combining BM25 with dense retrieval became standard. ColBERT's late-interaction paradigm and learned sparse encoders (SPLADE) blur the line between lexical and semantic approaches. **The pattern**: Retrieval evolved from exact word matching to statistical co-occurrence to learned semantic similarity. Each generation traded interpretability for capability, but the core goal remained: finding relevant information efficiently. ::: --- ## Document Processing and Chunking ::: {.callout-tip} ## Staff Engineer Perspective "When I review RAG implementations, I first check chunk size and overlap. 90% of the retrieval problems I see come from chunks that are too small or don't overlap. Everyone reads the tutorials saying '256 tokens is optimal' and blindly copies it. Then they wonder why their legal document Q&A system returns sentence fragments that make no sense." — *Principal Engineer at legal tech startup* ::: ### The Chunking Problem Here's a fundamental tension in RAG: embedding models and LLM context windows have limited capacity, but documents can be arbitrarily long. A 100-page contract cannot be embedded as a single unit—the embedding model would truncate or struggle, and even if it worked, a single vector for 100 pages would lose detail. We must split documents into smaller **chunks**. But chunking introduces its own problems: 1. **Semantic fragmentation**: Splitting at arbitrary boundaries can divide a coherent thought across chunks, losing meaning 2. **Context loss**: A chunk may reference "the policy described above"—but "above" is in a different chunk 3. **Size variability**: Very small chunks may lack context; very large chunks may dilute the signal Good chunking strategies navigate these tradeoffs for your specific use case. ### Fixed-Size Chunking: Simple But Fragile The simplest approach splits text at regular token intervals: ```python def fixed_size_chunk(text: str, chunk_size: int = 512, overlap: int = 50) -> list[str]: """Split text into fixed-size chunks with overlap.""" tokens = tokenizer.encode(text) chunks = [] for start in range(0, len(tokens), chunk_size - overlap): end = start + chunk_size chunks.append(tokenizer.decode(tokens[start:end])) return chunks ``` **Why overlap?** Without overlap, a relevant sentence split across chunk boundaries might not match any single chunk well enough to be retrieved. Overlap ensures boundary content appears in at least one chunk. **When fixed-size works**: Homogeneous documents where any segment is roughly self-contained—news articles, blog posts, social media. The simplicity is valuable: predictable memory usage, easy to reason about. **When it fails**: Structured documents where meaning depends on headers, sections, or cross-references. Splitting a legal contract mid-clause produces chunks that are difficult to interpret without context. ### Semantic Chunking: Following Natural Boundaries Semantic chunking respects document structure, splitting at natural boundaries like paragraphs, sentences, or sections: ```python def semantic_chunk(text: str, max_tokens: int = 512) -> list[str]: """Split at paragraph boundaries, respecting max size.""" paragraphs = text.split('\n\n') chunks, current, current_size = [], [], 0 for para in paragraphs: para_tokens = len(tokenizer.encode(para)) if current_size + para_tokens > max_tokens and current: chunks.append('\n\n'.join(current)) current, current_size = [], 0 current.append(para) current_size += para_tokens if current: chunks.append('\n\n'.join(current)) return chunks ``` This preserves paragraph coherence. But what if a single paragraph exceeds the limit? We need fallback logic—either split it further (losing coherence) or truncate (losing content). ### Recursive Chunking: Best of Both Worlds Recursive chunking tries progressively finer splits until content fits: ```python def recursive_chunk(text: str, max_tokens: int = 512, separators: list[str] = ['\n\n', '\n', '. ', ' ']) -> list[str]: """Try coarse splits first, fall back to finer splits.""" if len(tokenizer.encode(text)) <= max_tokens: return [text] for sep in separators: parts = text.split(sep) if len(parts) > 1: chunks = [] for part in parts: chunks.extend(recursive_chunk(part, max_tokens, separators)) return chunks # Last resort: hard split return fixed_size_chunk(text, max_tokens) ``` This tries to split at double-newlines (paragraphs) first. If any segment is still too large, try single newlines, then sentences, then words. It maximizes semantic coherence while guaranteeing size limits. > **Full implementation**: See [reference/07b_rag_systems_code.md](../reference/07b_rag_systems_code.html#chunking-strategies) for complete implementations including structure-aware chunking with header tracking. ### How Chunk Size Affects Retrieval Chunk size is the most important parameter in RAG. It creates a fundamental tradeoff: **Smaller chunks (128-256 tokens)**: - ✓ More precise retrieval—each chunk is about one idea - ✓ Higher recall—easier to match specific facts - ✗ Loss of context—chunks may be hard to interpret alone - ✗ More chunks to store and search **Larger chunks (512-1024 tokens)**: - ✓ Self-contained—each chunk has enough context - ✓ Fewer chunks, more efficient storage and search - ✗ Lower precision—relevant sentence buried in irrelevant paragraph - ✗ May exceed embedding model's effective window **A reasonable default** is to start at 512 tokens with overlap, then tune: go smaller (down to ~256) only when precision demands it, and larger (512–1024) when meaning depends on surrounding context. The instinct to shrink chunks usually backfires—too-small chunks return fragments that lose the context the model needs (this is the same default the "Chunks Too Small" mistake below recommends). The right size depends heavily on your documents: | Document Type | Recommended Size | Rationale | |---------------|------------------|-----------| | FAQs, knowledge bases | 128-256 tokens | One question-answer pair per chunk | | Technical documentation | 256-512 tokens | Section-level coherence | | Legal contracts | Full clauses | Clause is atomic semantic unit | | Research papers | 512-1024 tokens | Need surrounding context for meaning | | Code files | Full functions | Function is natural semantic unit | **Practical advice**: Start with 512 tokens and semantic boundaries. Measure retrieval quality on real queries. If retrieval finds irrelevant results, try smaller chunks. If retrieved chunks lack context, try larger or add context metadata. ### The Lost Context Problem When you chunk a document, each chunk loses its place in the whole. Consider this chunk: > "As discussed in the previous section, these conditions apply. The party responsible must notify the other party within 30 days." In isolation, this is nearly useless—what conditions? Which parties? The references point outside the chunk. **Solution 1: Context headers**. Prepend hierarchical context: ```python def add_context_headers(chunk: str, headers: list[str]) -> str: """Prepend document structure context to chunk.""" context = " > ".join(headers) # "Contract > Section 3 > Termination" return f"[{context}]\n{chunk}" ``` **Solution 2: Sentence-window retrieval**. Embed individual sentences but retrieve with surrounding window: ```python def sentence_window_chunk(text: str, window: int = 3) -> list[dict]: """Embed sentences, but return with surrounding context.""" sentences = sent_tokenize(text) for i, sent in enumerate(sentences): start = max(0, i - window) end = min(len(sentences), i + window + 1) yield { 'embed_text': sent, # This gets embedded 'context': ' '.join(sentences[start:end]) # This gets returned } ``` This gives you sentence-level retrieval precision with paragraph-level context in the final prompt. **Solution 3: Parent-child indexing**. Index both small chunks (for retrieval precision) and large chunks (for context). When a small chunk matches, return its parent. ### Document Extraction: Handling Real-World Formats Real documents come as PDFs, Word files, HTML pages, and scanned images. Extraction quality directly impacts RAG quality—garbage in, garbage out. ```python def extract_document(path: str) -> tuple[str, dict]: """Extract text and metadata from various formats.""" ext = path.rsplit('.', 1)[-1].lower() extractors = { 'pdf': extract_pdf, 'docx': extract_docx, 'html': extract_html, 'md': extract_markdown, } if ext not in extractors: raise ValueError(f"Unsupported format: {ext}") return extractors[ext](path) ``` **PDF challenges are underestimated**: - Multi-column layouts: Reading order ambiguity - Tables: Structure lost when flattened - Scanned documents: Require OCR - Headers/footers: Repeated noise - Embedded images with text: Invisible without OCR For production systems, consider specialized services (Unstructured.io, LlamaParse, AWS Textract) that handle these edge cases. The investment pays off in retrieval quality. > **Full implementation**: See [reference/07b_rag_systems_code.md](../reference/07b_rag_systems_code.html#document-extraction) for complete extraction code. ::: {.callout-warning} ## Common Mistake: Chunks Too Small **What people do**: Set chunk size to 128-256 tokens "to be safe" and ensure precise retrieval. **Why it fails**: Small chunks lose context. A sentence like "This applies to returns within 30 days" is meaningless without knowing what "this" refers to. Retrieval metrics may look good (you found a relevant sentence), but generation quality suffers. **Fix**: Start with 512-1024 tokens and tune based on retrieval + generation quality together. Include chunk overlap (10-20%). Consider parent-child chunking: retrieve small chunks for precision, include parent chunk in context for completeness. ::: --- ## Embeddings and Vector Search ### How Embedding Models Work Embedding models transform text into dense vectors (typically 384-4096 dimensions) where geometric relationships encode semantic relationships. But how? Modern embedding models are neural networks—typically transformers—trained on massive datasets of related text pairs. The training objective shapes what "similar" means: **Contrastive learning**: Given a query q, a positive document d+ (relevant), and negative documents d- (irrelevant), train the model so that: - similarity(embed(q), embed(d+)) is high - similarity(embed(q), embed(d-)) is low Through millions of such examples, the model learns to map semantically related texts to nearby points in vector space. **Why this works**: The model learns features—syntactic patterns, entity types, topic signals—that distinguish relevant from irrelevant. These features become the dimensions of the embedding space. When a query and document share features (i.e., are about the same topic, contain related concepts), their vectors point in similar directions. **What the dimensions represent**: Unlike hand-crafted features, neural network dimensions don't have interpretable meanings. Dimension 47 isn't "about finance." Instead, each dimension captures a learned combination of features. The entire vector, taken together, represents the text's position in semantic space. ### Embedding Models for Retrieval Not all embeddings are equal for retrieval. General-purpose embeddings from language models capture meaning but aren't optimized for finding relevant documents. **Retrieval-specific models** are trained explicitly on query-document pairs: | Model | Dimensions | Strengths | |-------|------------|-----------| | E5-large-v2 | 1024 | Strong general-purpose, good multilingual | | BGE-large-en-v1.5 | 1024 | Excellent English, instruction-following | | GTE-large-en-v1.5 | 1024 | Strong benchmark performance | | text-embedding-3-large | 3072 | High capacity, commercial | | Voyage-large-2 | 1024 | Domain-specific options available | **Query vs. document prefixes**: Some models (E5, BGE) use different prefixes for queries and documents: ```python # E5 model expects these prefixes query_embedding = model.encode("query: What is the refund policy?") doc_embedding = model.encode("passage: Our refund policy allows returns within 30 days...") ``` This asymmetric encoding helps because queries and documents are linguistically different—queries are short questions, documents are declarative paragraphs. The prefixes tell the model to produce appropriately calibrated embeddings. ### Choosing an Embedding Model ![Embedding Model Selection Flowchart](../assets/diagrams/rendered/ch07_embedding_model_selection.svg) **Quick recommendations**: - **Best quality (API)**: OpenAI text-embedding-3-large (3072 dims) - **Best quality (self-hosted)**: BGE-large-en-v1.5 or GTE-large - **Best efficiency**: all-MiniLM-L6-v2 (384 dims, very fast) - **Multilingual**: E5-multilingual-large ### Similarity Metrics: The Math of "Closeness" Given two vectors, how do we measure similarity? **Cosine similarity** measures the angle between vectors, ignoring magnitude: $$\text{cosine}(a, b) = \frac{a \cdot b}{\|a\| \|b\|}$$ Range: [-1, 1] where 1 = identical direction, 0 = orthogonal, -1 = opposite. **Dot product** is simply $a \cdot b$. For unit vectors (normalized to length 1), dot product equals cosine similarity. **Euclidean distance** measures straight-line distance: $\|a - b\|_2$. Lower = more similar. **Why cosine is standard**: Embedding models typically normalize outputs, making magnitude uninformative. Cosine (or equivalently, dot product on normalized vectors) focuses on direction—which captures semantic content. Two documents about "refund policies" should be similar regardless of length. **Practical note**: Most systems normalize embeddings and use dot product, which is faster (no normalization at query time) and equivalent to cosine for normalized vectors. ### Exact vs. Approximate Nearest Neighbor Search With embeddings and similarity metrics, retrieval becomes a nearest neighbor search: find the k vectors most similar to the query vector. **Exact search (brute force)**: Compare the query to every document. Trivially correct but O(n) per query—fine for 10,000 documents, infeasible for 10 million. **Approximate nearest neighbor (ANN)**: Use index structures to find *probably* nearest neighbors without exhaustive search. Tradeoff: small accuracy loss for massive speedup. The key ANN algorithms: **HNSW (Hierarchical Navigable Small World)**: - Builds a graph where similar vectors are connected - Query navigates the graph, following edges toward similar vectors - Like a social network where "similar" people know each other—start anywhere, follow connections to find who you're looking for - Very fast queries, but high memory (stores graph structure) **IVF (Inverted File Index)**: - Clusters vectors into buckets (Voronoi cells) - Query checks only nearby buckets - Like organizing a library by topic—search relevant sections, not the whole library - Tunable: check more buckets for higher recall **Product Quantization (PQ)**: - Compresses vectors by representing them as combinations of codebook entries - Like describing colors as "red-ish, somewhat green, no blue" instead of exact RGB values - 10-100x memory reduction with modest accuracy loss - Often combined with IVF for very large scales ```python import faiss # Exact search (< 10K vectors) index = faiss.IndexFlatIP(dimension) # HNSW (10K - 1M vectors) index = faiss.IndexHNSWFlat(dimension, M=32) # IVF + PQ (> 1M vectors) quantizer = faiss.IndexFlatL2(dimension) index = faiss.IndexIVFPQ(quantizer, dimension, nlist=1024, M=32, bits=8) index.train(sample_vectors) # Required for IVF/PQ ``` > **Full implementation**: See [reference/07b_rag_systems_code.md](../reference/07b_rag_systems_code.html#vector-search-with-faiss) for complete indexing and search code. ### Vector Search Doesn't Understand Everything A critical insight: vector similarity captures *topical relevance*, not *answer correctness*. When you search for "What is the return policy?", vector search finds documents *about* return policies. But it doesn't know: - Whether the document *answers* the question - Whether it's the *current* policy or an outdated version - Whether it applies to *your specific situation* This is why RAG pipelines need multiple stages. Vector search finds candidates; reranking selects the best; generation synthesizes an answer; prompting ensures grounding. Weakness in any component propagates. ::: {.callout-warning} ## Common Mistake: Ignoring Metadata **What people do**: Store only text and embeddings, discarding document metadata (dates, authors, categories, versions). **Why it fails**: Vector search alone can't answer "What's the *current* policy?" vs. "What was the policy in 2023?" Both documents embed similarly. Without metadata filtering, you return outdated information or mix answers from different contexts. **Fix**: Store metadata alongside embeddings. Use metadata filters in queries (e.g., `date >= '2024-01-01'`). Include source attribution in generated responses so users can verify. Consider metadata-aware reranking that boosts recent documents. ::: --- ## Hybrid Search and Reranking ### The Complementary Strengths of Vector and Keyword Search Vector search and keyword search (BM25) fail in different ways: **Vector search failures**: - Exact identifiers: "error code XJ-4402" may not embed close to documents mentioning it - Technical terminology: "PostgreSQL pg_dump" gets semantically diffused - Named entities: "Dr. Sarah Chen's 2024 paper" needs exact matching **Keyword search failures**: - Synonyms: "automobile" doesn't match "car" - Paraphrasing: "how to end my account" doesn't match "subscription termination process" - Conceptual queries: "best practices for scaling" spans many possible wordings **Hybrid search** combines both, capturing documents that are either lexically or semantically similar. ### BM25: The Keyword Baseline BM25 is the standard keyword ranking algorithm, an improvement on TF-IDF that handles document length variation: $$\text{BM25}(q, d) = \sum_{t \in q} \text{IDF}(t) \cdot \frac{f(t, d) \cdot (k_1 + 1)}{f(t, d) + k_1 \cdot (1 - b + b \cdot \frac{|d|}{\text{avgdl}})}$$ The intuition: - **IDF (Inverse Document Frequency)**: Rare terms matter more. "Termination" is more informative than "the." - **TF (Term Frequency)**: More occurrences = stronger signal, but with diminishing returns - **Length normalization**: Long documents get many terms by chance; normalize for length BM25 requires no training—just tokenization and statistics. It's fast, interpretable, and remains competitive with neural methods for exact-match queries. ### Reciprocal Rank Fusion Combining two ranked lists is tricky. Scores aren't comparable (cosine similarity vs. BM25 scores live in different scales). Reciprocal Rank Fusion (RRF) sidesteps this by using ranks, not scores: $$\text{RRF}(d) = \sum_{r \in \text{retrievers}} \frac{1}{k + \text{rank}_r(d)}$$ Where k is a constant (typically 60) that prevents very high-ranked documents from dominating. ```python def reciprocal_rank_fusion(rankings: list[list[str]], k: int = 60) -> list[str]: """Combine multiple rankings using RRF.""" scores = defaultdict(float) for ranking in rankings: for rank, doc_id in enumerate(ranking, 1): scores[doc_id] += 1.0 / (k + rank) return sorted(scores, key=scores.get, reverse=True) ``` RRF is robust because: - No score normalization needed - Outliers in one retriever don't dominate - Documents ranked well by multiple retrievers rise to the top > **Full implementation**: See [reference/07b_rag_systems_code.md](../reference/07b_rag_systems_code.html#hybrid-search) for complete hybrid search with BM25 and vector fusion. ### Reranking: Precision Over Speed Initial retrieval (vector or hybrid) casts a wide net—retrieve 50-100 candidates quickly. **Reranking** uses a more expensive model to precisely score these candidates. The insight: bi-encoders (embedding models) score documents independently—embed query, embed document, compute similarity. Cross-encoders process query and document together, enabling rich interaction between them. ```python from sentence_transformers import CrossEncoder class Reranker: def __init__(self, model: str = "BAAI/bge-reranker-large"): self.model = CrossEncoder(model) def rerank(self, query: str, documents: list[str], top_k: int = 5): pairs = [[query, doc] for doc in documents] scores = self.model.predict(pairs) ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True) return ranked[:top_k] ``` **Why cross-encoders are more accurate**: They can match specific query terms to document locations, understand negation ("not eligible for" vs. "eligible for"), and reason about relevance holistically. Bi-encoders must compress all meaning into fixed vectors before comparison. **Why we don't use cross-encoders for initial retrieval**: Cross-encoders require O(n) forward passes (one per document). For millions of documents, this is impossibly slow. Bi-encoders enable precomputation—embed documents once, search repeatedly. The two-stage pattern—fast bi-encoder retrieval, precise cross-encoder reranking—gives us the best of both worlds. ::: {.callout-warning} ## The Reranker That Wasn't **The situation**: An enterprise search team spent three months building a sophisticated RAG pipeline for their internal knowledge base. Their crown jewel was a cross-encoder reranker fine-tuned on 50,000 labeled query-document pairs. Offline metrics were stellar: MRR improved from 0.65 to 0.84. **What happened**: They launched with fanfare. Usage dropped 40% within a month. Users complained the system was "too slow to be useful." Exit surveys revealed people were returning to keyword search or just asking colleagues directly. **Root cause**: The reranker added 800ms of latency per query—acceptable in benchmarks, unacceptable in practice. Users expect search results in under 300ms. An 84% MRR means nothing if users don't wait for results. The team had measured the wrong thing: they optimized for ranking quality while users optimized for time-to-answer. **How they fixed it**: 1. Implemented async reranking: show initial bi-encoder results immediately, then update with reranked results if user is still on page 2. Added a lightweight reranker (100ms budget) for real-time, heavy reranker for "show more" requests 3. Built latency into their evaluation metrics: score = accuracy × (1 - latency_penalty) 4. Set up user timing dashboards alongside ranking quality dashboards **The takeaway**: Users don't experience your metrics; they experience latency. The best ranking algorithm your users never see is worthless. ::: ::: {.callout-warning} ## Common Mistake: Skipping Reranking **What people do**: Return top-k results from vector search directly to the LLM. **Why it fails**: Vector similarity ≠ relevance. A document *about* your topic may score higher than a document that *answers* your question. Without reranking, you fill context with tangentially related content, forcing the LLM to sift through noise. **Fix**: Always add a reranking stage, even a simple one. Cross-encoders like `ms-marco-MiniLM-L-6-v2` add ~50ms latency but dramatically improve answer quality. For production, batch reranking requests and set a timeout—partial reranking beats no reranking. ::: --- ## Context Assembly and Generation ### The Art of Prompt Construction You've retrieved and reranked documents. Now you must assemble them into a prompt that elicits accurate, grounded responses. Key principles: **Structure for clarity**: Label documents clearly so the model can cite them. ```python RAG_PROMPT = """Answer the question based ONLY on the following documents. If the documents don't contain the answer, say "I don't have information about that." Cite sources using [Doc N] format. {documents} Question: {question} Answer:""" def format_documents(docs: list[str]) -> str: return "\n\n".join(f"[Doc {i+1}]:\n{doc}" for i, doc in enumerate(docs)) ``` **Explicit grounding instructions**: Without clear instructions, models may blend retrieved content with parametric knowledge. "Answer based ONLY on the following documents" reduces this. **Handle insufficient context gracefully**: The model should acknowledge when it can't answer rather than hallucinate. ### The Lost-in-the-Middle Problem A surprising finding from Liu et al. (2024): language models with long contexts don't use all the context equally. Information in the middle of long prompts is often ignored—the model attends more strongly to the beginning and end. **Implications for RAG**: - Put the most relevant documents first (or last) - Don't dump 50 documents into context hoping the model will find what it needs - Quality over quantity—5 highly relevant chunks beat 20 mixed ones **Mitigation strategies**: - Aggressive reranking to ensure top documents are truly best - Context compression (summarize less-relevant documents) - Multi-pass generation (retrieve more, generate summaries, retrieve again) ### Citation and Attribution Users need to verify AI responses. Proper citations enable this: ```python def extract_citations(response: str, num_docs: int) -> list[int]: """Extract document citations from response.""" import re citations = re.findall(r'\[Doc (\d+)\]', response) valid = [int(c) for c in citations if 1 <= int(c) <= num_docs] return list(set(valid)) ``` **Trust but verify**: Models sometimes hallucinate citations—referencing [Doc 5] when only 3 documents were provided, or attributing statements to the wrong source. Consider: - Validating citation numbers are in range - Checking that cited content actually supports the claim (expensive but possible with another LLM call) - Surfacing source documents to users for verification --- ## Advanced RAG Patterns Basic RAG—retrieve, then generate—handles simple factual queries. Complex scenarios require more sophisticated orchestration. ### Query Transformation User queries are often poorly suited for retrieval: - Too short: "refund?" lacks context - Wrong vocabulary: user terms ≠ document terms - Complex: multiple sub-questions combined **Query expansion** adds related terms: ```python def expand_query(query: str, llm) -> str: """Generate query expansions for better retrieval.""" expansion = llm.generate(f"""Generate search terms related to: {query} Related terms (comma-separated):""") return f"{query} {expansion}" ``` **HyDE (Hypothetical Document Embeddings)** generates a hypothetical answer, then retrieves documents similar to it: ```python def hyde_retrieve(query: str, retriever, llm) -> list[dict]: """Generate hypothetical answer, use it for retrieval.""" hypothetical = llm.generate( f"Write a detailed paragraph answering: {query}" ) return retriever.search(hypothetical) # Search using generated doc ``` Why HyDE works: The hypothetical answer uses vocabulary likely to appear in actual documents—bridging the query-document lexical gap. It's particularly effective when users ask questions in conversational language while documents are written formally. **Query decomposition** breaks complex questions into simpler sub-queries: ```python def decompose_and_retrieve(query: str, retriever, llm) -> list[dict]: """Break complex query into sub-queries, retrieve for each.""" sub_queries = llm.generate( f"Break this into simpler questions:\n{query}" ).split('\n') all_docs = [] for sub_q in sub_queries: all_docs.extend(retriever.search(sub_q)) return deduplicate(all_docs) ``` ### Self-RAG and Corrective RAG Advanced patterns add reflection and self-correction. **Self-RAG** (Asai et al., 2023) decides *whether* to retrieve (some questions don't need external knowledge) and evaluates *whether* the response is supported by retrieved documents: ```python class SelfRAG: def query(self, question: str) -> str: # Decide if retrieval is needed if self.needs_retrieval(question): docs = self.retriever.search(question) response = self.generate_with_context(question, docs) # Verify response is supported by documents if not self.is_supported(response, docs): response = self.regenerate(question, docs) else: response = self.generate_direct(question) return response ``` **Corrective RAG (CRAG)** evaluates retrieval quality and falls back to alternative sources (like web search) when the knowledge base doesn't have good answers: ```python class CorrectiveRAG: def query(self, question: str) -> str: docs = self.retriever.search(question) quality = self.evaluate_retrieval(question, docs) if quality == "CORRECT": context = docs elif quality == "INCORRECT": context = self.web_search(question) # Fallback else: # AMBIGUOUS context = docs + self.web_search(question) # Combine return self.generate(question, context) ``` These patterns add LLM calls (increasing latency and cost) but significantly improve reliability on hard queries. > **Full implementation**: See [reference/07b_rag_systems_code.md](../reference/07b_rag_systems_code.html#advanced-rag-patterns) for complete Self-RAG, CRAG, and multi-hop implementations. ### Multi-Hop RAG Some questions require synthesizing information across multiple documents that aren't directly connected: > "Which of our company's products were affected by the EU AI Act regulations discussed in last month's legal memo?" This needs: (1) the legal memo about EU AI Act, (2) product specifications, (3) reasoning about which products fall under which regulations. Multi-hop RAG iteratively retrieves and reasons: ```python class MultiHopRAG: def query(self, question: str, max_hops: int = 3) -> str: context = [] current_query = question for _ in range(max_hops): docs = self.retriever.search(current_query) context.extend(docs) can_answer, missing = self.check_answerability(question, context) if can_answer: break # Generate follow-up query to find missing information current_query = self.generate_followup(question, context, missing) return self.generate_answer(question, context) ``` ### GraphRAG: Knowledge Graphs Meet Retrieval Vector search finds topically similar documents but doesn't understand *relationships* between entities. GraphRAG combines vector retrieval with knowledge graph traversal: 1. Extract entities and relationships from documents (using an LLM or NER models) 2. Build a knowledge graph connecting entities 3. At query time, find relevant entities and traverse the graph to find related information This excels for queries like: - "How is Project Alpha related to the Johnson contract?" - "What other products use the same supplier as Widget X?" - "Who worked on both the 2023 audit and the compliance review?" > **Full implementation**: See [reference/07b_rag_systems_code.md](../reference/07b_rag_systems_code.html#graphrag) for complete GraphRAG implementation with entity extraction and graph traversal. --- ## Evaluation ### Why RAG Evaluation is Hard RAG systems have multiple components that can fail independently: - Retrieval might miss relevant documents - Reranking might demote the best documents - Context assembly might truncate critical information - Generation might hallucinate despite good context A single end-to-end metric hides which component is failing. Comprehensive evaluation requires component-level and end-to-end metrics. ### Retrieval Metrics Classic information retrieval metrics: **Recall@k**: Of all relevant documents, what fraction were retrieved in top k? $$\text{Recall@k} = \frac{|\text{Relevant} \cap \text{Retrieved@k}|}{|\text{Relevant}|}$$ **Precision@k**: Of retrieved documents, what fraction were relevant? $$\text{Precision@k} = \frac{|\text{Relevant} \cap \text{Retrieved@k}|}{k}$$ **MRR (Mean Reciprocal Rank)**: Average of 1/rank of first relevant document. ```python def evaluate_retrieval(retriever, queries, relevant_docs) -> dict: """Evaluate retrieval quality.""" metrics = {'recall@5': [], 'recall@10': [], 'mrr': []} for query, relevant in zip(queries, relevant_docs): results = retriever.search(query, k=10) retrieved_ids = [r['id'] for r in results] metrics['recall@5'].append( len(set(retrieved_ids[:5]) & relevant) / len(relevant) ) metrics['recall@10'].append( len(set(retrieved_ids[:10]) & relevant) / len(relevant) ) for rank, doc_id in enumerate(retrieved_ids, 1): if doc_id in relevant: metrics['mrr'].append(1 / rank) break else: metrics['mrr'].append(0) return {k: sum(v)/len(v) for k, v in metrics.items()} ``` ### End-to-End Evaluation For the full pipeline, evaluate answer quality: **Faithfulness**: Does the response only contain information from retrieved documents? (Checks for hallucination) **Relevance**: Does the response answer the question? **Completeness**: Does the response cover all aspects of the question? LLM-as-judge works well for these: ```python def evaluate_faithfulness(response: str, context: str, evaluator_llm) -> float: """Check if response is faithful to context.""" prompt = f"""Rate whether this response is faithful to the provided context. A faithful response only contains information present in the context. Context: {context} Response: {response} Rate 1-5 (5 = completely faithful):""" return float(evaluator_llm.generate(prompt)) ``` > **Full implementation**: See [reference/07b_rag_systems_code.md](../reference/07b_rag_systems_code.html#evaluation) for complete evaluation harness with component-level and end-to-end metrics. ### Debugging Failures When RAG fails, diagnose systematically: 1. **Was the relevant document retrieved?** → Retrieval problem (embedding model, chunking, index) 2. **Was it ranked high enough?** → Reranking problem 3. **Was it included in context?** → Context assembly problem 4. **Did generation use it correctly?** → Prompt engineering problem ```python class RAGDebugger: def diagnose(self, query: str, expected_answer: str) -> dict: """Diagnose RAG failure point.""" retrieved = self.retriever.search(query, k=50) # Check if answer content is in any retrieved doc relevant = [d for d in retrieved if self.contains_answer(d, expected_answer)] if not relevant: return {'failure': 'retrieval', 'issue': 'Relevant content not retrieved'} reranked = self.reranker.rerank(query, retrieved[:20]) relevant_ranks = [i for i, d in enumerate(reranked) if d in relevant] if min(relevant_ranks) > 5: return {'failure': 'reranking', 'issue': f'Relevant doc ranked {min(relevant_ranks)}'} response = self.generate(query, reranked[:5]) if not self.answer_correct(response, expected_answer): return {'failure': 'generation', 'issue': 'Good context but wrong answer'} return {'failure': None, 'status': 'success'} ``` --- ## Production Considerations ### Vector Database Selection For production, you need more than FAISS: persistence, updates, filtering, scaling. ![Vector Database Selection Flowchart](../assets/diagrams/rendered/ch07_vector_database_selection.svg) **Decision framework**: | Scale | Complexity | Recommendation | |-------|------------|----------------| | < 100K vectors | Low | PostgreSQL + pgvector | | 100K - 10M | Medium | Qdrant or Weaviate | | > 10M | High | Milvus or Pinecone | **Key capabilities to evaluate**: - Hybrid search support (vector + keyword) - Metadata filtering efficiency - Update performance (real-time vs. batch) - Horizontal scaling options - Backup and disaster recovery ### Monitoring and Observability Production RAG needs continuous monitoring: ```python @dataclass class RAGMetrics: retrieval_latency_ms: float rerank_latency_ms: float generation_latency_ms: float num_docs_retrieved: int top_retrieval_score: float user_feedback: Optional[str] def log_rag_request(query: str, metrics: RAGMetrics): """Log for monitoring and debugging.""" logger.info({ 'query': query, **asdict(metrics), 'timestamp': datetime.utcnow().isoformat() }) ``` **Alert on**: - Latency percentile spikes - Low retrieval scores (indicates out-of-domain queries) - User feedback trends - Document freshness (stale indexes) ### Common Production Pitfalls **Index-embedding mismatch**: Using different embedding models for indexing vs. querying produces nonsense results. Version your embeddings with your index. **Stale documents**: When source documents update, indexes become inconsistent. Implement incremental re-indexing. **Query injection**: User queries become part of prompts. Sanitize inputs and use delimiters to separate user content from instructions. **Cost spiral**: LLM calls per query add up. Monitor and set budgets. ::: {.panel-tabset} ## Naive RAG ```python def answer_question(query: str, documents: list[str]) -> str: # Embed and search query_vec = embed(query) results = vector_db.search(query_vec, k=10) # Stuff everything into context context = "\n\n".join([doc.text for doc in results]) response = llm.generate( f"Context: {context}\n\nQuestion: {query}\n\nAnswer:" ) return response ``` ## Production RAG ```python async def answer_question( query: str, user_id: str, timeout: float = 10.0 ) -> RAGResponse: """Production RAG with reranking, citation, and observability.""" start = time.monotonic() # Stage 1: Hybrid retrieval (vector + BM25) async with asyncio.timeout(timeout * 0.3): vector_results = await vector_db.search(embed(query), k=50) keyword_results = await bm25_index.search(query, k=50) candidates = reciprocal_rank_fusion(vector_results, keyword_results) # Stage 2: Cross-encoder reranking async with asyncio.timeout(timeout * 0.4): reranked = await reranker.rerank(query, candidates[:20]) top_docs = [d for d in reranked if d.score > 0.5][:5] if not top_docs: return RAGResponse(answer=None, status="no_relevant_docs") # Stage 3: Generate with citation requirements context = format_docs_with_ids(top_docs) # [1], [2], etc. response = await llm.generate( messages=[{ "role": "system", "content": "Answer based ONLY on the provided context. " "Cite sources using [1], [2], etc. " "If unsure, say 'I don't have enough information.'" }, { "role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}" }], max_tokens=500 ) # Log for monitoring log_rag_metrics( query=query, user_id=user_id, retrieval_latency=(time.monotonic() - start), docs_used=len(top_docs), top_score=top_docs[0].score if top_docs else 0 ) return RAGResponse( answer=response.content, sources=[d.metadata for d in top_docs], latency_ms=(time.monotonic() - start) * 1000 ) ``` **Key differences**: Hybrid search, reranking stage, relevance threshold, citation enforcement, timeout management, observability, and structured response with sources. ::: --- ## Key Takeaways 1. **Chunking strategy determines retrieval quality** - Preserve semantic coherence while enabling precise retrieval. Use document-aware chunking that respects natural boundaries like paragraphs and sections. 2. **Hybrid search outperforms single approaches** - Combine vector search (semantic similarity) with BM25 (keyword matching). Vector search handles paraphrases; BM25 handles exact terms and identifiers. 3. **Two-stage retrieval balances speed and accuracy** - Use fast bi-encoders to find candidates, then expensive cross-encoders to precisely rerank. This achieves both scale and quality. 4. **The "lost in the middle" problem affects generation** - LLMs attend more strongly to content at the beginning and end of context. Put the most relevant documents first and prioritize quality over quantity. 5. **Evaluate retrieval and generation separately** - Component-level evaluation enables precise diagnosis. Measure retrieval recall@k, generation faithfulness, and end-to-end answer correctness independently. --- ## Summary RAG transforms language models from unreliable oracles into grounded research assistants. The pattern is simple—retrieve then generate—but production quality requires: 1. **Thoughtful chunking** that preserves semantic coherence while enabling precise retrieval 2. **Embedding models** trained for retrieval, not just similarity 3. **Hybrid search** combining vector and keyword approaches 4. **Reranking** to precision-sort noisy retrieval results 5. **Careful prompt engineering** to ensure grounding and citation 6. **Component-level evaluation** to diagnose failures 7. **Production infrastructure** for monitoring, updates, and scale The field continues to evolve rapidly. GraphRAG, adaptive retrieval, and learned sparse representations are pushing boundaries. But the fundamentals—understanding what makes retrieval work, why generation can fail, how to evaluate systematically—remain essential regardless of which specific techniques you adopt. ### RAG Architecture Selection Flowchart ![RAG Architecture Selection Decision Flowchart](../assets/diagrams/rendered/ch07_rag_architecture_selection.svg) --- ## Practical Exercises 1. **Chunking experiment**: Take a 10-page document. Implement three chunking strategies (fixed-size, semantic, recursive). For 20 test queries, measure retrieval recall. Which strategy wins? Why? 2. **Embedding comparison**: Compare E5-large, BGE-large, and OpenAI embeddings on your documents. Measure recall@10 and latency. Is the most expensive model worth it? 3. **Hybrid search tuning**: Implement hybrid search with configurable vector/BM25 weights. Use grid search to find optimal weights for your evaluation set. 4. **Failure analysis**: Collect 20 queries where your RAG system gives wrong answers. For each, diagnose the failure point (retrieval, reranking, generation). What patterns emerge? 5. **GraphRAG prototype**: For a small document set, extract entities and relationships. Implement graph traversal. Find queries where GraphRAG outperforms standard RAG. --- ## Self-Assessment Checkpoint ### Conceptual Questions **Q1. [IC2]** A RAG system uses 512-token chunks. For a legal contract where clauses reference each other heavily, what problems might this cause and how would you address them? <details> <summary>Answer</summary> Problems: (1) Clauses may be split mid-sentence, losing meaning. (2) References like "as described in Section 3.2 above" become meaningless when Section 3.2 is in a different chunk. Solutions: (1) Use document-aware chunking that respects clause boundaries. (2) Add context headers showing the document structure (e.g., "[Contract > Section 3 > Termination Clause]"). (3) Use parent-child indexing: index small chunks for precision but return parent sections for context. (4) Consider sentence-window retrieval: embed sentences but return with surrounding context. </details> **Q2. [IC2]** Why do we use two-stage retrieval (bi-encoder then cross-encoder reranking) instead of just using cross-encoders for everything? <details> <summary>Answer</summary> Cross-encoders are more accurate because they process query and document together, enabling rich interaction. But they require O(n) forward passes—one per document. For millions of documents, this is impossibly slow. Bi-encoders precompute document embeddings, enabling sublinear search via ANN indexes. The two-stage pattern uses fast bi-encoders to find ~50-100 candidates, then expensive cross-encoders to precisely rank only these candidates. This achieves both speed and accuracy. </details> **Q3. [Senior]** Explain the "lost in the middle" problem and its implications for RAG prompt design. <details> <summary>Answer</summary> Liu et al. (2024) found that LLMs with long contexts don't use context uniformly—they attend more strongly to content at the beginning and end, often ignoring the middle. For RAG: (1) Don't dump many documents hoping the model finds what it needs. (2) Put the most relevant documents first (or last). (3) Aggressive reranking matters—ensure top documents are truly best. (4) Consider quality over quantity—5 highly relevant chunks beat 20 mixed ones. (5) For long contexts, consider summarizing less-relevant documents. </details> **Q4. [Senior]** A hybrid search system combines BM25 and vector search using RRF. What query types would favor each component, and how would you tune the balance? <details> <summary>Answer</summary> BM25 favors: exact terms, identifiers, technical terminology, named entities ("error XJ-4402", "PostgreSQL pg_dump"). Vector search favors: conceptual queries, paraphrases, synonyms ("how to cancel account" vs. "subscription termination"). Tuning: Create an evaluation set with both query types. Use grid search over RRF weights or fusion parameters. Common finding: 50/50 works well as default, but domain-specific tuning (e.g., 70/30 for technical docs with many identifiers) can improve significantly. </details> **Q5. [Staff]** Your RAG system handles internal company documents. The legal team says some documents are confidential and should only be retrieved for users with proper authorization. How do you architect this? <details> <summary>Answer</summary> Metadata filtering approach: (1) Store access control metadata (department, clearance level, document classification) with each chunk. (2) At query time, filter retrieval to documents the user can access. Most vector DBs support metadata filtering. Considerations: (1) Filter BEFORE retrieval, not after—otherwise you leak information through similarity scores. (2) Test that filtered results still have sufficient coverage. (3) Consider separate indexes for different access levels for very sensitive data. (4) Audit logging for compliance. (5) Don't put access control logic in the LLM prompt—it's not reliable. </details> ### Spot the Problem **Problem 1. [IC2]** A developer's chunking function: ```python def chunk_document(text: str, size: int = 500) -> list[str]: words = text.split() return [' '.join(words[i:i+size]) for i in range(0, len(words), size)] ``` What's wrong? <details> <summary>Answer</summary> Issues: (1) Chunking by words, not tokens—500 words could be 600-800 tokens, exceeding embedding model limits. (2) No overlap—sentences at boundaries may not match queries well. (3) Splits arbitrarily, even mid-sentence. (4) Ignores document structure (paragraphs, sections). Better: Use a tokenizer to count tokens, add overlap (e.g., 10-15%), and respect semantic boundaries (at least sentence boundaries, ideally paragraphs). </details> **Problem 2. [Senior]** A RAG system has 95% retrieval recall@10 on the eval set but users complain answers are often wrong. What's happening? ```python # Their evaluation retrieved = retriever.search(query, k=10) if any(doc.id in ground_truth for doc in retrieved): recall_hits += 1 ``` <details> <summary>Answer</summary> Several possible issues: (1) Recall@10 means the right document is somewhere in top 10, but generation may only use top 3-5. If the right document is at position 8, it might get cut. (2) They're checking document ID, not whether the document actually answers the question—document might be related but not contain the answer. (3) Generation faithfulness—even with perfect retrieval, the model might hallucinate or misuse context. (4) Eval set might not represent real user queries. Need to evaluate the full pipeline: retrieval precision@k for actual k used, faithfulness of generated answers, and end-to-end answer correctness. </details> **Problem 3. [Staff]** A production RAG system worked well for months, then quality suddenly dropped. No code changes were deployed. What should you investigate? <details> <summary>Answer</summary> Likely causes: (1) Source document changes—documents were updated but the index is stale. Check document freshness and re-index. (2) Embedding model API changed—if using a hosted embedding service, the model may have been updated, changing the embedding space. Vectors indexed with old model won't match queries with new model. (3) Data drift—user query patterns changed and the system wasn't designed for them. (4) Upstream API changes—LLM provider may have updated the model. (5) Volume spike causing degraded performance (timeouts, truncation). Investigate: Check document timestamps vs index timestamps, compare embedding similarity distributions over time, review query logs for pattern changes, check latency metrics. </details> ### Design Exercises **Exercise 1. [Senior]** Design a RAG system for a customer support chatbot. Requirements: 50,000 support articles, need sub-2-second response time, must cite sources, handles 1,000 queries/hour peak. Outline your architecture including chunking strategy, retrieval pipeline, and infrastructure choices. <details> <summary>Guidance</summary> Consider: (1) Chunking: Support articles are structured—chunk by FAQ pairs or section headers. 256-512 tokens per chunk. (2) Retrieval: Hybrid search (BM25 + vectors) with reranking. Retrieve 20, rerank to 5. (3) Infrastructure: 50K documents = ~200K chunks = fit easily in managed vector DB (Pinecone, Qdrant Cloud). (4) Latency budget: Embedding (~100ms) + Vector search (~50ms) + Rerank (~200ms) + LLM (~1000ms) = ~1.3s, leaving margin. (5) Citations: Structure prompt to output source IDs, validate citations exist. (6) Caching: Consider caching frequent queries—support has many repeat questions. </details> **Exercise 2. [Staff]** You're building a RAG system for a research organization with 2 million academic papers. Users ask complex questions requiring synthesis across multiple papers. Design the system considering: scale, multi-hop reasoning needs, and the fact that papers heavily reference each other. <details> <summary>Guidance</summary> This is a GraphRAG candidate. Consider: (1) Chunking: Papers have structure (abstract, intro, methods, results)—chunk by section. Also extract figure captions separately. (2) Entity extraction: Extract paper metadata, citations, key terms, author names. Build a citation graph. (3) Multi-hop: Implement iterative retrieval—find initial papers, check if answer is complete, follow citations if not. (4) Scale: 2M papers × ~10 chunks = 20M vectors. Need serious infrastructure—Milvus or Pinecone with good filtering. (5) Consider hierarchical indexing: topic-level index first, then paper-level. (6) Latency: Accept higher latency (5-10s) for complex queries given the research use case. </details> --- ### Connections to Other Chapters RAG systems sit at the intersection of retrieval, language understanding, and production engineering. Here's how this chapter connects to others: - **Chapter 5 (LLM/NLP Foundations)**: Provides the theory behind embeddings—how text becomes vectors, why similar meanings cluster together, and how to choose embedding models. Essential background for understanding retrieval. - **Chapter 6 (Prompt Engineering)**: Techniques for integrating retrieved context into prompts effectively. Covers context window management, few-shot examples, and structured outputs for RAG responses. - **Chapter 8 (Agentic Systems)**: Agents often use RAG as a tool for knowledge retrieval. Understanding RAG fundamentals helps you build more capable retrieval-augmented agents. - **Chapter 15 (MLOps & Evaluation)**: Extends evaluation concepts to RAG-specific challenges—retrieval metrics, answer faithfulness, and end-to-end quality assessment. - **Chapter 30 (Data Architecture for AI)**: Data pipeline design and vector database infrastructure at scale. Covers feature stores, data versioning, and the data engineering required for production RAG. --- ## Further Reading ### Essential - **Lewis et al. (2020), "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks"** - The paper that introduced RAG. Start here. - **Karpukhin et al. (2020), "Dense Passage Retrieval for Open-Domain QA"** - Dense retrieval foundations that underpin modern RAG systems. - **Liu et al. (2024), "Lost in the Middle"** - How models use long contexts. Critical for understanding retrieval ordering. ### Deep Dives - **Asai et al. (2023), "Self-RAG: Learning to Retrieve, Generate, and Critique"** - Self-reflective retrieval for higher quality. - **Malkov & Yashunin (2018), "HNSW: Hierarchical Navigable Small World Graphs"** - The algorithm behind most vector database indices. - **Es et al. (2023), "RAGAS: Automated Evaluation of RAG"** - Systematic framework for measuring RAG performance.