Chapter 7: RAG Systems

Keywords

retrieval, embeddings, vector search, chunking, reranking, hybrid search, semantic search, HNSW, ColBERT, knowledge grounding

Introduction

In the summer of 2022, a team at a major technology company faced a problem. They had deployed a large language model to answer questions about their internal documentation—thousands of pages covering engineering practices, HR policies, and product specifications. The model was impressive: articulate, confident, and completely unreliable. It would cite policies that didn’t exist, invent employee benefits, and present plausible-sounding procedures that contradicted official guidelines. The fundamental issue was simple: the model’s knowledge was frozen at training time, and it had never seen the company’s documents.

Retrieval-Augmented Generation (RAG) emerged as the solution to this problem. Instead of relying solely on knowledge baked into model parameters during training, RAG systems retrieve relevant documents at inference time and include them directly in the prompt. The model generates responses grounded in specific, verifiable sources rather than interpolating from training data.

The idea is elegantly simple: look things up before answering. But the execution is nuanced. Building a RAG system that works reliably—that retrieves the right information, presents it effectively, and generates accurate responses—requires understanding a stack of interconnected techniques: document processing, embedding models, vector search, reranking, and prompt engineering. Each component has its own tradeoffs, and weaknesses in any link can break the chain.

This chapter takes you from the foundations of RAG to production-ready systems. We’ll examine not just how each component works, but why it works—the mathematical and conceptual foundations that explain when techniques succeed and when they fail. By the end, you’ll understand RAG deeply enough to debug failures, choose between alternatives, and push the boundaries of what’s possible.

The Core Insight: Why RAG Works

To understand RAG, you need to understand why language models sometimes fail at knowledge-intensive tasks.

A language model like GPT-5 or Claude Opus contains vast amounts of knowledge encoded in its parameters—facts about the world, patterns of language, reasoning strategies. But this knowledge has fundamental limitations:

  1. Temporal cutoff: The model only knows what existed in its training data. Events after training are invisible.

  2. Coverage gaps: Training data, however massive, doesn’t include everything. Proprietary documents, recent publications, niche domains—all are absent.

  3. Parametric compression: Knowledge stored in parameters is lossy. The model learns statistical patterns, not a perfect database. Details blur, facts merge, and the model may “remember” something that’s a plausible interpolation rather than a precise fact.

  4. Confidence without calibration: Models generate confident-sounding text regardless of whether they actually know the answer. They’re trained to produce fluent completions, not to say “I don’t know.”

RAG addresses these limitations by shifting from parametric knowledge (stored in model weights) to non-parametric knowledge (retrieved from external sources). When a user asks a question, the system:

  1. Retrieves relevant documents from a knowledge base
  2. Augments the prompt with these documents as context
  3. Generates a response grounded in the retrieved information

This is powerful because retrieval is exact: we can guarantee the model sees the current version of a document, include proprietary information without retraining, and update knowledge by simply updating the document store. The model becomes a sophisticated reader and synthesizer rather than a unreliable memory.

A Mental Model for RAG

Think of RAG as giving a language model an open-book exam. Without RAG, the model must answer from memory—and like a student who crammed but didn’t deeply learn the material, it may confuse details, conflate topics, or confidently assert things that aren’t quite right. With RAG, the model can flip through a textbook, find relevant passages, and compose an answer that synthesizes what it reads.

The key challenges then become:

  • Finding the right pages: Given a question, which documents are relevant?
  • Reading efficiently: With limited space in the prompt, what to include?
  • Synthesizing accurately: How to ensure the model uses the documents faithfully?

These map to the core RAG components: retrieval, context assembly, and generation with grounding.

When to Use RAG vs Fine-tuning

Before diving into RAG implementation, understand when RAG is the right choice:

RAG vs Fine-tuning Decision Flowchart

RAG vs Fine-tuning Decision Flowchart

Choose RAG when: Knowledge updates frequently, citations needed, no training budget, quick deployment required.

Choose Fine-tuning when: Static knowledge, need specific style/behavior, latency critical, knowledge fits in training data.

Use both when: Large knowledge base with specific format requirements.

What You’ll Learn

  • How embedding models encode semantic meaning, and why this enables similarity search
  • Chunking strategies and the tradeoffs that govern them
  • Vector search algorithms: exact, approximate, and hybrid approaches
  • Reranking to improve precision when initial retrieval is noisy
  • Advanced patterns: query transformation, self-reflection, multi-hop reasoning
  • Evaluation methodologies that diagnose where systems fail

Prerequisites

  • Understanding of embeddings and attention mechanisms (Chapter 5)
  • Familiarity with prompt engineering principles (Chapter 6)
  • Basic database concepts (indexes, queries, consistency)
  • Python programming

The RAG Architecture

The Two-Phase Design

RAG systems operate in two distinct phases with different computational constraints and optimization objectives.

RAG Pipeline Architecture

RAG Pipeline Architecture

Indexing (offline): Process your document corpus once, preparing it for fast retrieval. This phase can be compute-intensive because it runs ahead of user requests.

Documents → Parsing → Chunking → Embedding → Vector Index
              ↓          ↓           ↓            ↓
         Clean text   Segments   Dense vectors  Searchable store

Query (online): For each user request, find relevant context and generate a response. This phase must be fast—users expect responses in seconds.

Query → Embedding → Vector Search → Reranking → Context Assembly → Generation
  ↓          ↓            ↓            ↓              ↓               ↓
User      Dense        Candidate    Precise       Prompt         Response
question   vector      documents    top-k        construction    with citations

Let’s trace through a concrete example to build intuition.

Scenario: A legal firm has 50,000 contract documents. A lawyer asks: “What are the standard termination clauses for vendor agreements?”

Indexing (done once): 1. Parse all 50,000 contracts, extracting text from PDFs 2. Split each contract into chunks (perhaps by clause or paragraph) 3. Generate an embedding vector for each chunk using a model like E5 or BGE 4. Store vectors in a searchable index (e.g., Qdrant, Pinecone) 5. Store chunk text and metadata (contract name, date, type) alongside vectors

Query (every request): 1. Embed the query “What are the standard termination clauses for vendor agreements?” using the same model 2. Search the vector index for the 50 most similar chunk vectors 3. Rerank these 50 using a more powerful cross-encoder model, keeping top 5 4. Assemble the top 5 chunks into a prompt 5. Send to LLM: “Based on these contract excerpts, explain standard termination clauses for vendor agreements” 6. Return the response with citations to specific contracts

The magic happens in step 2: because the query and documents are embedded by the same model into the same vector space, we can find semantically similar content even when the words don’t match exactly. “Termination clauses” finds documents mentioning “early exit provisions” or “contract cancellation terms.”

Document Processing and Chunking

Staff Engineer Perspective

“When I review RAG implementations, I first check chunk size and overlap. 90% of the retrieval problems I see come from chunks that are too small or don’t overlap. Everyone reads the tutorials saying ‘256 tokens is optimal’ and blindly copies it. Then they wonder why their legal document Q&A system returns sentence fragments that make no sense.”

Principal Engineer at legal tech startup

The Chunking Problem

Here’s a fundamental tension in RAG: embedding models and LLM context windows have limited capacity, but documents can be arbitrarily long. A 100-page contract cannot be embedded as a single unit—the embedding model would truncate or struggle, and even if it worked, a single vector for 100 pages would lose detail. We must split documents into smaller chunks.

But chunking introduces its own problems:

  1. Semantic fragmentation: Splitting at arbitrary boundaries can divide a coherent thought across chunks, losing meaning
  2. Context loss: A chunk may reference “the policy described above”—but “above” is in a different chunk
  3. Size variability: Very small chunks may lack context; very large chunks may dilute the signal

Good chunking strategies navigate these tradeoffs for your specific use case.

Fixed-Size Chunking: Simple But Fragile

The simplest approach splits text at regular token intervals:

def fixed_size_chunk(text: str, chunk_size: int = 512, overlap: int = 50) -> list[str]:
    """Split text into fixed-size chunks with overlap."""
    tokens = tokenizer.encode(text)
    chunks = []
    for start in range(0, len(tokens), chunk_size - overlap):
        end = start + chunk_size
        chunks.append(tokenizer.decode(tokens[start:end]))
    return chunks

Why overlap? Without overlap, a relevant sentence split across chunk boundaries might not match any single chunk well enough to be retrieved. Overlap ensures boundary content appears in at least one chunk.

When fixed-size works: Homogeneous documents where any segment is roughly self-contained—news articles, blog posts, social media. The simplicity is valuable: predictable memory usage, easy to reason about.

When it fails: Structured documents where meaning depends on headers, sections, or cross-references. Splitting a legal contract mid-clause produces chunks that are difficult to interpret without context.

Semantic Chunking: Following Natural Boundaries

Semantic chunking respects document structure, splitting at natural boundaries like paragraphs, sentences, or sections:

def semantic_chunk(text: str, max_tokens: int = 512) -> list[str]:
    """Split at paragraph boundaries, respecting max size."""
    paragraphs = text.split('\n\n')
    chunks, current, current_size = [], [], 0

    for para in paragraphs:
        para_tokens = len(tokenizer.encode(para))
        if current_size + para_tokens > max_tokens and current:
            chunks.append('\n\n'.join(current))
            current, current_size = [], 0
        current.append(para)
        current_size += para_tokens

    if current:
        chunks.append('\n\n'.join(current))
    return chunks

This preserves paragraph coherence. But what if a single paragraph exceeds the limit? We need fallback logic—either split it further (losing coherence) or truncate (losing content).

Recursive Chunking: Best of Both Worlds

Recursive chunking tries progressively finer splits until content fits:

def recursive_chunk(text: str, max_tokens: int = 512,
                    separators: list[str] = ['\n\n', '\n', '. ', ' ']) -> list[str]:
    """Try coarse splits first, fall back to finer splits."""
    if len(tokenizer.encode(text)) <= max_tokens:
        return [text]

    for sep in separators:
        parts = text.split(sep)
        if len(parts) > 1:
            chunks = []
            for part in parts:
                chunks.extend(recursive_chunk(part, max_tokens, separators))
            return chunks

    # Last resort: hard split
    return fixed_size_chunk(text, max_tokens)

This tries to split at double-newlines (paragraphs) first. If any segment is still too large, try single newlines, then sentences, then words. It maximizes semantic coherence while guaranteeing size limits.

Full implementation: See reference/07b_rag_systems_code.md for complete implementations including structure-aware chunking with header tracking.

How Chunk Size Affects Retrieval

Chunk size is the most important parameter in RAG. It creates a fundamental tradeoff:

Smaller chunks (128-256 tokens):

  • ✓ More precise retrieval—each chunk is about one idea
  • ✓ Higher recall—easier to match specific facts
  • ✗ Loss of context—chunks may be hard to interpret alone
  • ✗ More chunks to store and search

Larger chunks (512-1024 tokens):

  • ✓ Self-contained—each chunk has enough context
  • ✓ Fewer chunks, more efficient storage and search
  • ✗ Lower precision—relevant sentence buried in irrelevant paragraph
  • ✗ May exceed embedding model’s effective window

A reasonable default is to start at 512 tokens with overlap, then tune: go smaller (down to ~256) only when precision demands it, and larger (512–1024) when meaning depends on surrounding context. The instinct to shrink chunks usually backfires—too-small chunks return fragments that lose the context the model needs (this is the same default the “Chunks Too Small” mistake below recommends). The right size depends heavily on your documents:

Document Type Recommended Size Rationale
FAQs, knowledge bases 128-256 tokens One question-answer pair per chunk
Technical documentation 256-512 tokens Section-level coherence
Legal contracts Full clauses Clause is atomic semantic unit
Research papers 512-1024 tokens Need surrounding context for meaning
Code files Full functions Function is natural semantic unit

Practical advice: Start with 512 tokens and semantic boundaries. Measure retrieval quality on real queries. If retrieval finds irrelevant results, try smaller chunks. If retrieved chunks lack context, try larger or add context metadata.

The Lost Context Problem

When you chunk a document, each chunk loses its place in the whole. Consider this chunk:

“As discussed in the previous section, these conditions apply. The party responsible must notify the other party within 30 days.”

In isolation, this is nearly useless—what conditions? Which parties? The references point outside the chunk.

Solution 1: Context headers. Prepend hierarchical context:

def add_context_headers(chunk: str, headers: list[str]) -> str:
    """Prepend document structure context to chunk."""
    context = " > ".join(headers)  # "Contract > Section 3 > Termination"
    return f"[{context}]\n{chunk}"

Solution 2: Sentence-window retrieval. Embed individual sentences but retrieve with surrounding window:

def sentence_window_chunk(text: str, window: int = 3) -> list[dict]:
    """Embed sentences, but return with surrounding context."""
    sentences = sent_tokenize(text)
    for i, sent in enumerate(sentences):
        start = max(0, i - window)
        end = min(len(sentences), i + window + 1)
        yield {
            'embed_text': sent,  # This gets embedded
            'context': ' '.join(sentences[start:end])  # This gets returned
        }

This gives you sentence-level retrieval precision with paragraph-level context in the final prompt.

Solution 3: Parent-child indexing. Index both small chunks (for retrieval precision) and large chunks (for context). When a small chunk matches, return its parent.

Document Extraction: Handling Real-World Formats

Real documents come as PDFs, Word files, HTML pages, and scanned images. Extraction quality directly impacts RAG quality—garbage in, garbage out.

def extract_document(path: str) -> tuple[str, dict]:
    """Extract text and metadata from various formats."""
    ext = path.rsplit('.', 1)[-1].lower()

    extractors = {
        'pdf': extract_pdf,
        'docx': extract_docx,
        'html': extract_html,
        'md': extract_markdown,
    }

    if ext not in extractors:
        raise ValueError(f"Unsupported format: {ext}")

    return extractors[ext](path)

PDF challenges are underestimated:

  • Multi-column layouts: Reading order ambiguity
  • Tables: Structure lost when flattened
  • Scanned documents: Require OCR
  • Headers/footers: Repeated noise
  • Embedded images with text: Invisible without OCR

For production systems, consider specialized services (Unstructured.io, LlamaParse, AWS Textract) that handle these edge cases. The investment pays off in retrieval quality.

Full implementation: See reference/07b_rag_systems_code.md for complete extraction code.

Common Mistake: Chunks Too Small

What people do: Set chunk size to 128-256 tokens “to be safe” and ensure precise retrieval.

Why it fails: Small chunks lose context. A sentence like “This applies to returns within 30 days” is meaningless without knowing what “this” refers to. Retrieval metrics may look good (you found a relevant sentence), but generation quality suffers.

Fix: Start with 512-1024 tokens and tune based on retrieval + generation quality together. Include chunk overlap (10-20%). Consider parent-child chunking: retrieve small chunks for precision, include parent chunk in context for completeness.


Hybrid Search and Reranking

BM25: The Keyword Baseline

BM25 is the standard keyword ranking algorithm, an improvement on TF-IDF that handles document length variation:

\[\text{BM25}(q, d) = \sum_{t \in q} \text{IDF}(t) \cdot \frac{f(t, d) \cdot (k_1 + 1)}{f(t, d) + k_1 \cdot (1 - b + b \cdot \frac{|d|}{\text{avgdl}})}\]

The intuition:

  • IDF (Inverse Document Frequency): Rare terms matter more. “Termination” is more informative than “the.”
  • TF (Term Frequency): More occurrences = stronger signal, but with diminishing returns
  • Length normalization: Long documents get many terms by chance; normalize for length

BM25 requires no training—just tokenization and statistics. It’s fast, interpretable, and remains competitive with neural methods for exact-match queries.

Reciprocal Rank Fusion

Combining two ranked lists is tricky. Scores aren’t comparable (cosine similarity vs. BM25 scores live in different scales). Reciprocal Rank Fusion (RRF) sidesteps this by using ranks, not scores:

\[\text{RRF}(d) = \sum_{r \in \text{retrievers}} \frac{1}{k + \text{rank}_r(d)}\]

Where k is a constant (typically 60) that prevents very high-ranked documents from dominating.

def reciprocal_rank_fusion(rankings: list[list[str]], k: int = 60) -> list[str]:
    """Combine multiple rankings using RRF."""
    scores = defaultdict(float)
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking, 1):
            scores[doc_id] += 1.0 / (k + rank)
    return sorted(scores, key=scores.get, reverse=True)

RRF is robust because:

  • No score normalization needed
  • Outliers in one retriever don’t dominate
  • Documents ranked well by multiple retrievers rise to the top

Full implementation: See reference/07b_rag_systems_code.md for complete hybrid search with BM25 and vector fusion.

Reranking: Precision Over Speed

Initial retrieval (vector or hybrid) casts a wide net—retrieve 50-100 candidates quickly. Reranking uses a more expensive model to precisely score these candidates.

The insight: bi-encoders (embedding models) score documents independently—embed query, embed document, compute similarity. Cross-encoders process query and document together, enabling rich interaction between them.

from sentence_transformers import CrossEncoder

class Reranker:
    def __init__(self, model: str = "BAAI/bge-reranker-large"):
        self.model = CrossEncoder(model)

    def rerank(self, query: str, documents: list[str], top_k: int = 5):
        pairs = [[query, doc] for doc in documents]
        scores = self.model.predict(pairs)
        ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
        return ranked[:top_k]

Why cross-encoders are more accurate: They can match specific query terms to document locations, understand negation (“not eligible for” vs. “eligible for”), and reason about relevance holistically. Bi-encoders must compress all meaning into fixed vectors before comparison.

Why we don’t use cross-encoders for initial retrieval: Cross-encoders require O(n) forward passes (one per document). For millions of documents, this is impossibly slow. Bi-encoders enable precomputation—embed documents once, search repeatedly.

The two-stage pattern—fast bi-encoder retrieval, precise cross-encoder reranking—gives us the best of both worlds.

The Reranker That Wasn’t

The situation: An enterprise search team spent three months building a sophisticated RAG pipeline for their internal knowledge base. Their crown jewel was a cross-encoder reranker fine-tuned on 50,000 labeled query-document pairs. Offline metrics were stellar: MRR improved from 0.65 to 0.84.

What happened: They launched with fanfare. Usage dropped 40% within a month. Users complained the system was “too slow to be useful.” Exit surveys revealed people were returning to keyword search or just asking colleagues directly.

Root cause: The reranker added 800ms of latency per query—acceptable in benchmarks, unacceptable in practice. Users expect search results in under 300ms. An 84% MRR means nothing if users don’t wait for results. The team had measured the wrong thing: they optimized for ranking quality while users optimized for time-to-answer.

How they fixed it: 1. Implemented async reranking: show initial bi-encoder results immediately, then update with reranked results if user is still on page 2. Added a lightweight reranker (100ms budget) for real-time, heavy reranker for “show more” requests 3. Built latency into their evaluation metrics: score = accuracy × (1 - latency_penalty) 4. Set up user timing dashboards alongside ranking quality dashboards

The takeaway: Users don’t experience your metrics; they experience latency. The best ranking algorithm your users never see is worthless.

Common Mistake: Skipping Reranking

What people do: Return top-k results from vector search directly to the LLM.

Why it fails: Vector similarity ≠ relevance. A document about your topic may score higher than a document that answers your question. Without reranking, you fill context with tangentially related content, forcing the LLM to sift through noise.

Fix: Always add a reranking stage, even a simple one. Cross-encoders like ms-marco-MiniLM-L-6-v2 add ~50ms latency but dramatically improve answer quality. For production, batch reranking requests and set a timeout—partial reranking beats no reranking.


Context Assembly and Generation

The Art of Prompt Construction

You’ve retrieved and reranked documents. Now you must assemble them into a prompt that elicits accurate, grounded responses.

Key principles:

Structure for clarity: Label documents clearly so the model can cite them.

RAG_PROMPT = """Answer the question based ONLY on the following documents.
If the documents don't contain the answer, say "I don't have information about that."
Cite sources using [Doc N] format.

{documents}

Question: {question}

Answer:"""

def format_documents(docs: list[str]) -> str:
    return "\n\n".join(f"[Doc {i+1}]:\n{doc}" for i, doc in enumerate(docs))

Explicit grounding instructions: Without clear instructions, models may blend retrieved content with parametric knowledge. “Answer based ONLY on the following documents” reduces this.

Handle insufficient context gracefully: The model should acknowledge when it can’t answer rather than hallucinate.

The Lost-in-the-Middle Problem

A surprising finding from Liu et al. (2024): language models with long contexts don’t use all the context equally. Information in the middle of long prompts is often ignored—the model attends more strongly to the beginning and end.

Implications for RAG:

  • Put the most relevant documents first (or last)
  • Don’t dump 50 documents into context hoping the model will find what it needs
  • Quality over quantity—5 highly relevant chunks beat 20 mixed ones

Mitigation strategies:

  • Aggressive reranking to ensure top documents are truly best
  • Context compression (summarize less-relevant documents)
  • Multi-pass generation (retrieve more, generate summaries, retrieve again)

Citation and Attribution

Users need to verify AI responses. Proper citations enable this:

def extract_citations(response: str, num_docs: int) -> list[int]:
    """Extract document citations from response."""
    import re
    citations = re.findall(r'\[Doc (\d+)\]', response)
    valid = [int(c) for c in citations if 1 <= int(c) <= num_docs]
    return list(set(valid))

Trust but verify: Models sometimes hallucinate citations—referencing [Doc 5] when only 3 documents were provided, or attributing statements to the wrong source. Consider:

  • Validating citation numbers are in range
  • Checking that cited content actually supports the claim (expensive but possible with another LLM call)
  • Surfacing source documents to users for verification

Advanced RAG Patterns

Basic RAG—retrieve, then generate—handles simple factual queries. Complex scenarios require more sophisticated orchestration.

Query Transformation

User queries are often poorly suited for retrieval:

  • Too short: “refund?” lacks context
  • Wrong vocabulary: user terms ≠ document terms
  • Complex: multiple sub-questions combined

Query expansion adds related terms:

def expand_query(query: str, llm) -> str:
    """Generate query expansions for better retrieval."""
    expansion = llm.generate(f"""Generate search terms related to: {query}

    Related terms (comma-separated):""")
    return f"{query} {expansion}"

HyDE (Hypothetical Document Embeddings) generates a hypothetical answer, then retrieves documents similar to it:

def hyde_retrieve(query: str, retriever, llm) -> list[dict]:
    """Generate hypothetical answer, use it for retrieval."""
    hypothetical = llm.generate(
        f"Write a detailed paragraph answering: {query}"
    )
    return retriever.search(hypothetical)  # Search using generated doc

Why HyDE works: The hypothetical answer uses vocabulary likely to appear in actual documents—bridging the query-document lexical gap. It’s particularly effective when users ask questions in conversational language while documents are written formally.

Query decomposition breaks complex questions into simpler sub-queries:

def decompose_and_retrieve(query: str, retriever, llm) -> list[dict]:
    """Break complex query into sub-queries, retrieve for each."""
    sub_queries = llm.generate(
        f"Break this into simpler questions:\n{query}"
    ).split('\n')

    all_docs = []
    for sub_q in sub_queries:
        all_docs.extend(retriever.search(sub_q))
    return deduplicate(all_docs)

Self-RAG and Corrective RAG

Advanced patterns add reflection and self-correction.

Self-RAG (Asai et al., 2023) decides whether to retrieve (some questions don’t need external knowledge) and evaluates whether the response is supported by retrieved documents:

class SelfRAG:
    def query(self, question: str) -> str:
        # Decide if retrieval is needed
        if self.needs_retrieval(question):
            docs = self.retriever.search(question)
            response = self.generate_with_context(question, docs)

            # Verify response is supported by documents
            if not self.is_supported(response, docs):
                response = self.regenerate(question, docs)
        else:
            response = self.generate_direct(question)

        return response

Corrective RAG (CRAG) evaluates retrieval quality and falls back to alternative sources (like web search) when the knowledge base doesn’t have good answers:

class CorrectiveRAG:
    def query(self, question: str) -> str:
        docs = self.retriever.search(question)
        quality = self.evaluate_retrieval(question, docs)

        if quality == "CORRECT":
            context = docs
        elif quality == "INCORRECT":
            context = self.web_search(question)  # Fallback
        else:  # AMBIGUOUS
            context = docs + self.web_search(question)  # Combine

        return self.generate(question, context)

These patterns add LLM calls (increasing latency and cost) but significantly improve reliability on hard queries.

Full implementation: See reference/07b_rag_systems_code.md for complete Self-RAG, CRAG, and multi-hop implementations.

Multi-Hop RAG

Some questions require synthesizing information across multiple documents that aren’t directly connected:

“Which of our company’s products were affected by the EU AI Act regulations discussed in last month’s legal memo?”

This needs: (1) the legal memo about EU AI Act, (2) product specifications, (3) reasoning about which products fall under which regulations.

Multi-hop RAG iteratively retrieves and reasons:

class MultiHopRAG:
    def query(self, question: str, max_hops: int = 3) -> str:
        context = []
        current_query = question

        for _ in range(max_hops):
            docs = self.retriever.search(current_query)
            context.extend(docs)

            can_answer, missing = self.check_answerability(question, context)
            if can_answer:
                break

            # Generate follow-up query to find missing information
            current_query = self.generate_followup(question, context, missing)

        return self.generate_answer(question, context)

GraphRAG: Knowledge Graphs Meet Retrieval

Vector search finds topically similar documents but doesn’t understand relationships between entities. GraphRAG combines vector retrieval with knowledge graph traversal:

  1. Extract entities and relationships from documents (using an LLM or NER models)
  2. Build a knowledge graph connecting entities
  3. At query time, find relevant entities and traverse the graph to find related information

This excels for queries like:

  • “How is Project Alpha related to the Johnson contract?”
  • “What other products use the same supplier as Widget X?”
  • “Who worked on both the 2023 audit and the compliance review?”

Full implementation: See reference/07b_rag_systems_code.md for complete GraphRAG implementation with entity extraction and graph traversal.


Evaluation

Why RAG Evaluation is Hard

RAG systems have multiple components that can fail independently:

  • Retrieval might miss relevant documents
  • Reranking might demote the best documents
  • Context assembly might truncate critical information
  • Generation might hallucinate despite good context

A single end-to-end metric hides which component is failing. Comprehensive evaluation requires component-level and end-to-end metrics.

Retrieval Metrics

Classic information retrieval metrics:

Recall@k: Of all relevant documents, what fraction were retrieved in top k? \[\text{Recall@k} = \frac{|\text{Relevant} \cap \text{Retrieved@k}|}{|\text{Relevant}|}\]

Precision@k: Of retrieved documents, what fraction were relevant? \[\text{Precision@k} = \frac{|\text{Relevant} \cap \text{Retrieved@k}|}{k}\]

MRR (Mean Reciprocal Rank): Average of 1/rank of first relevant document.

def evaluate_retrieval(retriever, queries, relevant_docs) -> dict:
    """Evaluate retrieval quality."""
    metrics = {'recall@5': [], 'recall@10': [], 'mrr': []}

    for query, relevant in zip(queries, relevant_docs):
        results = retriever.search(query, k=10)
        retrieved_ids = [r['id'] for r in results]

        metrics['recall@5'].append(
            len(set(retrieved_ids[:5]) & relevant) / len(relevant)
        )
        metrics['recall@10'].append(
            len(set(retrieved_ids[:10]) & relevant) / len(relevant)
        )

        for rank, doc_id in enumerate(retrieved_ids, 1):
            if doc_id in relevant:
                metrics['mrr'].append(1 / rank)
                break
        else:
            metrics['mrr'].append(0)

    return {k: sum(v)/len(v) for k, v in metrics.items()}

End-to-End Evaluation

For the full pipeline, evaluate answer quality:

Faithfulness: Does the response only contain information from retrieved documents? (Checks for hallucination)

Relevance: Does the response answer the question?

Completeness: Does the response cover all aspects of the question?

LLM-as-judge works well for these:

def evaluate_faithfulness(response: str, context: str, evaluator_llm) -> float:
    """Check if response is faithful to context."""
    prompt = f"""Rate whether this response is faithful to the provided context.
    A faithful response only contains information present in the context.

    Context: {context}
    Response: {response}

    Rate 1-5 (5 = completely faithful):"""

    return float(evaluator_llm.generate(prompt))

Full implementation: See reference/07b_rag_systems_code.md for complete evaluation harness with component-level and end-to-end metrics.

Debugging Failures

When RAG fails, diagnose systematically:

  1. Was the relevant document retrieved? → Retrieval problem (embedding model, chunking, index)
  2. Was it ranked high enough? → Reranking problem
  3. Was it included in context? → Context assembly problem
  4. Did generation use it correctly? → Prompt engineering problem
class RAGDebugger:
    def diagnose(self, query: str, expected_answer: str) -> dict:
        """Diagnose RAG failure point."""
        retrieved = self.retriever.search(query, k=50)

        # Check if answer content is in any retrieved doc
        relevant = [d for d in retrieved if self.contains_answer(d, expected_answer)]

        if not relevant:
            return {'failure': 'retrieval', 'issue': 'Relevant content not retrieved'}

        reranked = self.reranker.rerank(query, retrieved[:20])
        relevant_ranks = [i for i, d in enumerate(reranked) if d in relevant]

        if min(relevant_ranks) > 5:
            return {'failure': 'reranking', 'issue': f'Relevant doc ranked {min(relevant_ranks)}'}

        response = self.generate(query, reranked[:5])
        if not self.answer_correct(response, expected_answer):
            return {'failure': 'generation', 'issue': 'Good context but wrong answer'}

        return {'failure': None, 'status': 'success'}

Production Considerations

Vector Database Selection

For production, you need more than FAISS: persistence, updates, filtering, scaling.

Vector Database Selection Flowchart

Vector Database Selection Flowchart

Decision framework:

Scale Complexity Recommendation
< 100K vectors Low PostgreSQL + pgvector
100K - 10M Medium Qdrant or Weaviate
> 10M High Milvus or Pinecone

Key capabilities to evaluate:

  • Hybrid search support (vector + keyword)
  • Metadata filtering efficiency
  • Update performance (real-time vs. batch)
  • Horizontal scaling options
  • Backup and disaster recovery

Monitoring and Observability

Production RAG needs continuous monitoring:

@dataclass
class RAGMetrics:
    retrieval_latency_ms: float
    rerank_latency_ms: float
    generation_latency_ms: float
    num_docs_retrieved: int
    top_retrieval_score: float
    user_feedback: Optional[str]

def log_rag_request(query: str, metrics: RAGMetrics):
    """Log for monitoring and debugging."""
    logger.info({
        'query': query,
        **asdict(metrics),
        'timestamp': datetime.utcnow().isoformat()
    })

Alert on:

  • Latency percentile spikes
  • Low retrieval scores (indicates out-of-domain queries)
  • User feedback trends
  • Document freshness (stale indexes)

Common Production Pitfalls

Index-embedding mismatch: Using different embedding models for indexing vs. querying produces nonsense results. Version your embeddings with your index.

Stale documents: When source documents update, indexes become inconsistent. Implement incremental re-indexing.

Query injection: User queries become part of prompts. Sanitize inputs and use delimiters to separate user content from instructions.

Cost spiral: LLM calls per query add up. Monitor and set budgets.

def answer_question(query: str, documents: list[str]) -> str:
    # Embed and search
    query_vec = embed(query)
    results = vector_db.search(query_vec, k=10)

    # Stuff everything into context
    context = "\n\n".join([doc.text for doc in results])

    response = llm.generate(
        f"Context: {context}\n\nQuestion: {query}\n\nAnswer:"
    )
    return response
async def answer_question(
    query: str,
    user_id: str,
    timeout: float = 10.0
) -> RAGResponse:
    """Production RAG with reranking, citation, and observability."""
    start = time.monotonic()

    # Stage 1: Hybrid retrieval (vector + BM25)
    async with asyncio.timeout(timeout * 0.3):
        vector_results = await vector_db.search(embed(query), k=50)
        keyword_results = await bm25_index.search(query, k=50)
        candidates = reciprocal_rank_fusion(vector_results, keyword_results)

    # Stage 2: Cross-encoder reranking
    async with asyncio.timeout(timeout * 0.4):
        reranked = await reranker.rerank(query, candidates[:20])
        top_docs = [d for d in reranked if d.score > 0.5][:5]

    if not top_docs:
        return RAGResponse(answer=None, status="no_relevant_docs")

    # Stage 3: Generate with citation requirements
    context = format_docs_with_ids(top_docs)  # [1], [2], etc.

    response = await llm.generate(
        messages=[{
            "role": "system",
            "content": "Answer based ONLY on the provided context. "
                      "Cite sources using [1], [2], etc. "
                      "If unsure, say 'I don't have enough information.'"
        }, {
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {query}"
        }],
        max_tokens=500
    )

    # Log for monitoring
    log_rag_metrics(
        query=query, user_id=user_id,
        retrieval_latency=(time.monotonic() - start),
        docs_used=len(top_docs),
        top_score=top_docs[0].score if top_docs else 0
    )

    return RAGResponse(
        answer=response.content,
        sources=[d.metadata for d in top_docs],
        latency_ms=(time.monotonic() - start) * 1000
    )

Key differences: Hybrid search, reranking stage, relevance threshold, citation enforcement, timeout management, observability, and structured response with sources.


Key Takeaways

  1. Chunking strategy determines retrieval quality - Preserve semantic coherence while enabling precise retrieval. Use document-aware chunking that respects natural boundaries like paragraphs and sections.

  2. Hybrid search outperforms single approaches - Combine vector search (semantic similarity) with BM25 (keyword matching). Vector search handles paraphrases; BM25 handles exact terms and identifiers.

  3. Two-stage retrieval balances speed and accuracy - Use fast bi-encoders to find candidates, then expensive cross-encoders to precisely rerank. This achieves both scale and quality.

  4. The “lost in the middle” problem affects generation - LLMs attend more strongly to content at the beginning and end of context. Put the most relevant documents first and prioritize quality over quantity.

  5. Evaluate retrieval and generation separately - Component-level evaluation enables precise diagnosis. Measure retrieval recall@k, generation faithfulness, and end-to-end answer correctness independently.


Summary

RAG transforms language models from unreliable oracles into grounded research assistants. The pattern is simple—retrieve then generate—but production quality requires:

  1. Thoughtful chunking that preserves semantic coherence while enabling precise retrieval
  2. Embedding models trained for retrieval, not just similarity
  3. Hybrid search combining vector and keyword approaches
  4. Reranking to precision-sort noisy retrieval results
  5. Careful prompt engineering to ensure grounding and citation
  6. Component-level evaluation to diagnose failures
  7. Production infrastructure for monitoring, updates, and scale

The field continues to evolve rapidly. GraphRAG, adaptive retrieval, and learned sparse representations are pushing boundaries. But the fundamentals—understanding what makes retrieval work, why generation can fail, how to evaluate systematically—remain essential regardless of which specific techniques you adopt.

RAG Architecture Selection Flowchart

RAG Architecture Selection Decision Flowchart

RAG Architecture Selection Decision Flowchart

Practical Exercises

  1. Chunking experiment: Take a 10-page document. Implement three chunking strategies (fixed-size, semantic, recursive). For 20 test queries, measure retrieval recall. Which strategy wins? Why?

  2. Embedding comparison: Compare E5-large, BGE-large, and OpenAI embeddings on your documents. Measure recall@10 and latency. Is the most expensive model worth it?

  3. Hybrid search tuning: Implement hybrid search with configurable vector/BM25 weights. Use grid search to find optimal weights for your evaluation set.

  4. Failure analysis: Collect 20 queries where your RAG system gives wrong answers. For each, diagnose the failure point (retrieval, reranking, generation). What patterns emerge?

  5. GraphRAG prototype: For a small document set, extract entities and relationships. Implement graph traversal. Find queries where GraphRAG outperforms standard RAG.


Self-Assessment Checkpoint

Conceptual Questions

Q1. [IC2] A RAG system uses 512-token chunks. For a legal contract where clauses reference each other heavily, what problems might this cause and how would you address them?

Answer Problems: (1) Clauses may be split mid-sentence, losing meaning. (2) References like “as described in Section 3.2 above” become meaningless when Section 3.2 is in a different chunk. Solutions: (1) Use document-aware chunking that respects clause boundaries. (2) Add context headers showing the document structure (e.g., “[Contract > Section 3 > Termination Clause]”). (3) Use parent-child indexing: index small chunks for precision but return parent sections for context. (4) Consider sentence-window retrieval: embed sentences but return with surrounding context.

Q2. [IC2] Why do we use two-stage retrieval (bi-encoder then cross-encoder reranking) instead of just using cross-encoders for everything?

Answer Cross-encoders are more accurate because they process query and document together, enabling rich interaction. But they require O(n) forward passes—one per document. For millions of documents, this is impossibly slow. Bi-encoders precompute document embeddings, enabling sublinear search via ANN indexes. The two-stage pattern uses fast bi-encoders to find ~50-100 candidates, then expensive cross-encoders to precisely rank only these candidates. This achieves both speed and accuracy.

Q3. [Senior] Explain the “lost in the middle” problem and its implications for RAG prompt design.

Answer Liu et al. (2024) found that LLMs with long contexts don’t use context uniformly—they attend more strongly to content at the beginning and end, often ignoring the middle. For RAG: (1) Don’t dump many documents hoping the model finds what it needs. (2) Put the most relevant documents first (or last). (3) Aggressive reranking matters—ensure top documents are truly best. (4) Consider quality over quantity—5 highly relevant chunks beat 20 mixed ones. (5) For long contexts, consider summarizing less-relevant documents.

Q4. [Senior] A hybrid search system combines BM25 and vector search using RRF. What query types would favor each component, and how would you tune the balance?

Answer BM25 favors: exact terms, identifiers, technical terminology, named entities (“error XJ-4402”, “PostgreSQL pg_dump”). Vector search favors: conceptual queries, paraphrases, synonyms (“how to cancel account” vs. “subscription termination”). Tuning: Create an evaluation set with both query types. Use grid search over RRF weights or fusion parameters. Common finding: 50/50 works well as default, but domain-specific tuning (e.g., 70/30 for technical docs with many identifiers) can improve significantly.

Q5. [Staff] Your RAG system handles internal company documents. The legal team says some documents are confidential and should only be retrieved for users with proper authorization. How do you architect this?

Answer Metadata filtering approach: (1) Store access control metadata (department, clearance level, document classification) with each chunk. (2) At query time, filter retrieval to documents the user can access. Most vector DBs support metadata filtering. Considerations: (1) Filter BEFORE retrieval, not after—otherwise you leak information through similarity scores. (2) Test that filtered results still have sufficient coverage. (3) Consider separate indexes for different access levels for very sensitive data. (4) Audit logging for compliance. (5) Don’t put access control logic in the LLM prompt—it’s not reliable.

Spot the Problem

Problem 1. [IC2] A developer’s chunking function:

def chunk_document(text: str, size: int = 500) -> list[str]:
    words = text.split()
    return [' '.join(words[i:i+size]) for i in range(0, len(words), size)]

What’s wrong?

Answer Issues: (1) Chunking by words, not tokens—500 words could be 600-800 tokens, exceeding embedding model limits. (2) No overlap—sentences at boundaries may not match queries well. (3) Splits arbitrarily, even mid-sentence. (4) Ignores document structure (paragraphs, sections). Better: Use a tokenizer to count tokens, add overlap (e.g., 10-15%), and respect semantic boundaries (at least sentence boundaries, ideally paragraphs).

Problem 2. [Senior] A RAG system has 95% retrieval recall@10 on the eval set but users complain answers are often wrong. What’s happening?

# Their evaluation
retrieved = retriever.search(query, k=10)
if any(doc.id in ground_truth for doc in retrieved):
    recall_hits += 1
Answer Several possible issues: (1) Recall@10 means the right document is somewhere in top 10, but generation may only use top 3-5. If the right document is at position 8, it might get cut. (2) They’re checking document ID, not whether the document actually answers the question—document might be related but not contain the answer. (3) Generation faithfulness—even with perfect retrieval, the model might hallucinate or misuse context. (4) Eval set might not represent real user queries. Need to evaluate the full pipeline: retrieval precision@k for actual k used, faithfulness of generated answers, and end-to-end answer correctness.

Problem 3. [Staff] A production RAG system worked well for months, then quality suddenly dropped. No code changes were deployed. What should you investigate?

Answer Likely causes: (1) Source document changes—documents were updated but the index is stale. Check document freshness and re-index. (2) Embedding model API changed—if using a hosted embedding service, the model may have been updated, changing the embedding space. Vectors indexed with old model won’t match queries with new model. (3) Data drift—user query patterns changed and the system wasn’t designed for them. (4) Upstream API changes—LLM provider may have updated the model. (5) Volume spike causing degraded performance (timeouts, truncation). Investigate: Check document timestamps vs index timestamps, compare embedding similarity distributions over time, review query logs for pattern changes, check latency metrics.

Design Exercises

Exercise 1. [Senior] Design a RAG system for a customer support chatbot. Requirements: 50,000 support articles, need sub-2-second response time, must cite sources, handles 1,000 queries/hour peak. Outline your architecture including chunking strategy, retrieval pipeline, and infrastructure choices.

Guidance Consider: (1) Chunking: Support articles are structured—chunk by FAQ pairs or section headers. 256-512 tokens per chunk. (2) Retrieval: Hybrid search (BM25 + vectors) with reranking. Retrieve 20, rerank to 5. (3) Infrastructure: 50K documents = ~200K chunks = fit easily in managed vector DB (Pinecone, Qdrant Cloud). (4) Latency budget: Embedding (~100ms) + Vector search (~50ms) + Rerank (~200ms) + LLM (~1000ms) = ~1.3s, leaving margin. (5) Citations: Structure prompt to output source IDs, validate citations exist. (6) Caching: Consider caching frequent queries—support has many repeat questions.

Exercise 2. [Staff] You’re building a RAG system for a research organization with 2 million academic papers. Users ask complex questions requiring synthesis across multiple papers. Design the system considering: scale, multi-hop reasoning needs, and the fact that papers heavily reference each other.

Guidance This is a GraphRAG candidate. Consider: (1) Chunking: Papers have structure (abstract, intro, methods, results)—chunk by section. Also extract figure captions separately. (2) Entity extraction: Extract paper metadata, citations, key terms, author names. Build a citation graph. (3) Multi-hop: Implement iterative retrieval—find initial papers, check if answer is complete, follow citations if not. (4) Scale: 2M papers × ~10 chunks = 20M vectors. Need serious infrastructure—Milvus or Pinecone with good filtering. (5) Consider hierarchical indexing: topic-level index first, then paper-level. (6) Latency: Accept higher latency (5-10s) for complex queries given the research use case.

Connections to Other Chapters

RAG systems sit at the intersection of retrieval, language understanding, and production engineering. Here’s how this chapter connects to others:

  • Chapter 5 (LLM/NLP Foundations): Provides the theory behind embeddings—how text becomes vectors, why similar meanings cluster together, and how to choose embedding models. Essential background for understanding retrieval.

  • Chapter 6 (Prompt Engineering): Techniques for integrating retrieved context into prompts effectively. Covers context window management, few-shot examples, and structured outputs for RAG responses.

  • Chapter 8 (Agentic Systems): Agents often use RAG as a tool for knowledge retrieval. Understanding RAG fundamentals helps you build more capable retrieval-augmented agents.

  • Chapter 15 (MLOps & Evaluation): Extends evaluation concepts to RAG-specific challenges—retrieval metrics, answer faithfulness, and end-to-end quality assessment.

  • Chapter 30 (Data Architecture for AI): Data pipeline design and vector database infrastructure at scale. Covers feature stores, data versioning, and the data engineering required for production RAG.


Further Reading

Essential

  • Lewis et al. (2020), “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” - The paper that introduced RAG. Start here.
  • Karpukhin et al. (2020), “Dense Passage Retrieval for Open-Domain QA” - Dense retrieval foundations that underpin modern RAG systems.
  • Liu et al. (2024), “Lost in the Middle” - How models use long contexts. Critical for understanding retrieval ordering.

Deep Dives

  • Asai et al. (2023), “Self-RAG: Learning to Retrieve, Generate, and Critique” - Self-reflective retrieval for higher quality.
  • Malkov & Yashunin (2018), “HNSW: Hierarchical Navigable Small World Graphs” - The algorithm behind most vector database indices.
  • Es et al. (2023), “RAGAS: Automated Evaluation of RAG” - Systematic framework for measuring RAG performance.