Chapter 4: Your First LLM Application

Keywords

hands-on, RAG tutorial, document QA, API integration, OpenAI, vector database, project structure

Introduction

In the fall of 2023, a software engineer at a mid-sized fintech company was tasked with building an internal documentation assistant. The engineering wiki had grown to thousands of pages over five years, and new hires were spending weeks hunting for information that existed but was impossible to find. “Can’t we just ask the AI?” became a common refrain.

The engineer had experience with APIs and databases, but had never built anything with LLMs. What followed was a journey familiar to many: the initial excitement of a working prototype, the frustration of hallucinated answers, the iterative refinement that turned a demo into a tool people actually trusted. Three months later, the documentation assistant was handling hundreds of queries daily, and the engineer had learned lessons that no tutorial could have taught.

This chapter is that tutorial. We’ll build a complete Document Q&A Assistant from scratch: a command-line tool that answers questions about a set of documents using retrieval-augmented generation (RAG). Along the way, you’ll encounter the same surprises, mistakes, and “aha moments” that every AI engineer experiences when building their first real application.

The goal isn’t just working code. It’s understanding why each piece exists, what happens when things go wrong, and how the choices we make here connect to the production systems you’ll build later. By the end, you’ll have both a working application and the intuition to extend it.

Why Build This?

You might wonder: why build a Q&A assistant when dozens exist? Three reasons:

Understanding through construction: Reading about RAG is different from debugging why your retrieval returns irrelevant chunks at 2 AM. Building forces you to internalize concepts in a way that reading cannot.
The foundation for everything else: Every production LLM application shares DNA with what we’ll build: taking user input, retrieving relevant context, constructing prompts, generating responses, handling errors. Master this pattern, and you can build chatbots, coding assistants, research tools, and more.
Demystification: LLM applications can seem magical or impossibly complex. Building one from scratch reveals that they’re just software: APIs, databases, text processing, error handling. The magic is in the model, not in the application code.

What We’re Building

Our Document Q&A Assistant will:

Load a folder of documents (markdown, text, or PDF)
Chunk documents into searchable pieces
Create vector embeddings for semantic search
Store embeddings in a local vector database
Accept questions from the command line
Retrieve relevant document chunks
Generate answers grounded in the retrieved context
Display answers with source citations

Here’s what using it looks like:

$ python qa_assistant.py index ./company_docs
Indexing 47 documents...
Created 312 chunks
Generated embeddings
Index saved to ./qa_index

$ python qa_assistant.py query "What is our vacation policy?"

Based on the company documentation:

Our vacation policy provides 20 days of paid time off per year for full-time
employees. PTO accrues monthly at 1.67 days per month. Unused days roll over
up to a maximum of 5 days. Requests should be submitted at least 2 weeks in
advance through the HR portal.

Sources:

- hr_policies/time_off.md (chunk 3)
- employee_handbook/benefits.md (chunk 7)

What You’ll Learn

How to structure an LLM application project
Making API calls to Anthropic and OpenAI
Processing documents into chunks for retrieval
Creating and using embeddings
Building a simple RAG pipeline
Error handling patterns for LLM applications
Basic evaluation and cost tracking
Common mistakes and how to avoid them

Prerequisites

Python programming (functions, classes, file I/O)
Command-line familiarity
Basic understanding of APIs and HTTP
A text editor or IDE

You don’t need prior experience with LLMs, embeddings, or vector databases. We’ll build that understanding as we go.

Setting Up

Getting API Keys

Before writing any code, you need access to an LLM API. We’ll use two providers:

Anthropic (Claude) for text generation. Claude excels at following instructions and grounding responses in provided context, making it ideal for RAG applications.

OpenAI for embeddings. OpenAI’s text-embedding-3-small model provides excellent quality at low cost for semantic search.

Why two providers? Each has strengths. This mirrors production systems that often use different models for different tasks. You could use a single provider, but learning to work with multiple APIs is valuable.

Getting an Anthropic API Key:

Go to https://console.anthropic.com/
Create an account or sign in
Navigate to API Keys
Create a new key, give it a descriptive name
Copy the key immediately (you won’t see it again)

Getting an OpenAI API Key:

Go to https://platform.openai.com/
Create an account or sign in
Navigate to API Keys in the sidebar
Create a new secret key
Copy the key immediately

Storing Keys Securely:

Never put API keys in your code. Use environment variables:

# Add to your ~/.bashrc or ~/.zshrc
export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."

# Then reload your shell or run:
source ~/.bashrc

On Windows, use the Environment Variables system settings or:

$env:ANTHROPIC_API_KEY = "sk-ant-..."
$env:OPENAI_API_KEY = "sk-..."

Common Mistake #1: Committing API keys to git. Add .env to your .gitignore immediately. Better yet, never create a .env file. Environment variables are safer.

Project Structure

Create a clean project structure:

This separation isn’t arbitrary. Each file has a single responsibility:

indexer.py: Getting documents into a searchable form
retriever.py: Finding relevant documents for a query
generator.py: Turning retrieved context into an answer
config.py: Centralizing settings makes changes easy

Why structure matters: When something breaks (and it will), you want to know exactly where to look. If retrieval returns bad results, debug retriever.py. If answers ignore the context, debug generator.py. Tight coupling makes debugging a nightmare.

Dependencies and Virtual Environment

Create a virtual environment to isolate dependencies:

# Create project directory
mkdir document_qa
cd document_qa

# Create virtual environment
python -m venv venv

# Activate it
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Verify you're in the virtual environment
which python  # Should show .../document_qa/venv/bin/python

Create requirements.txt:

anthropic>=0.18.0
openai>=1.12.0
chromadb>=0.4.22
tiktoken>=0.6.0
rich>=13.7.0

Why these packages?

anthropic: Official Anthropic Python SDK for Claude
openai: Official OpenAI Python SDK for embeddings
chromadb: Simple, local vector database (perfect for learning)
tiktoken: OpenAI’s tokenizer (for counting tokens accurately)
rich: Beautiful terminal output (optional but nice)

Install dependencies:

pip install -r requirements.txt

Common Mistake #2: Using global pip instead of virtual environment pip. Always activate your virtual environment first. If which pip doesn’t show your venv path, something is wrong.

Configuration File

Create config.py to centralize settings:

"""Configuration for the Document Q&A Assistant."""

import os
from dataclasses import dataclass


@dataclass
class Config:
    """Application configuration with sensible defaults."""

    # API Keys (from environment)
    anthropic_api_key: str = os.getenv("ANTHROPIC_API_KEY", "")
    openai_api_key: str = os.getenv("OPENAI_API_KEY", "")

    # Model settings
    llm_model: str = "claude-sonnet-4-6"
    embedding_model: str = "text-embedding-3-small"
    embedding_dimensions: int = 1536

    # Chunking settings
    chunk_size: int = 500  # tokens
    chunk_overlap: int = 50  # tokens

    # Retrieval settings
    top_k: int = 5  # number of chunks to retrieve

    # Generation settings
    max_tokens: int = 1024
    temperature: float = 0.0  # deterministic for Q&A

    # Paths
    index_path: str = "./qa_index"

    def validate(self) -> list[str]:
        """Check configuration and return list of errors."""
        errors = []
        if not self.anthropic_api_key:
            errors.append("ANTHROPIC_API_KEY environment variable not set")
        if not self.openai_api_key:
            errors.append("OPENAI_API_KEY environment variable not set")
        return errors


# Global config instance
config = Config()

Why a dataclass? It provides:

Type hints for IDE autocompletion
Default values in one place
Easy validation
Immutability awareness

The “aha” moment: Notice how configuration is separate from logic. When you need to change chunk size from 500 to 400 tokens, you edit one number in one place. When you deploy to production with different settings, you create a new Config instance. This seems trivial now but prevents painful debugging later.

Verify Setup

Create a quick test script to verify everything works:

# test_setup.py
"""Verify the development environment is correctly configured."""

from config import config

def main():
    errors = config.validate()
    if errors:
        print("Configuration errors:")
        for error in errors:
            print(f"  - {error}")
        return False

    # Test imports
    try:
        import anthropic
        import openai
        import chromadb
        import tiktoken
        print("All packages imported successfully")
    except ImportError as e:
        print(f"Import error: {e}")
        return False

    # Test Anthropic connection
    try:
        client = anthropic.Anthropic()
        # Make a minimal API call
        response = client.messages.create(
            model=config.llm_model,
            max_tokens=10,
            messages=[{"role": "user", "content": "Say 'hello' and nothing else."}]
        )
        print(f"Anthropic API working: {response.content[0].text}")
    except Exception as e:
        print(f"Anthropic API error: {e}")
        return False

    # Test OpenAI connection
    try:
        client = openai.OpenAI()
        response = client.embeddings.create(
            model=config.embedding_model,
            input="test"
        )
        print(f"OpenAI API working: got embedding with {len(response.data[0].embedding)} dimensions")
    except Exception as e:
        print(f"OpenAI API error: {e}")
        return False

    print("\nSetup complete! Ready to build.")
    return True


if __name__ == "__main__":
    main()

Run it:

python test_setup.py

If you see “Setup complete!”, you’re ready. If not, the error messages will guide you.

Making Your First API Call

Understanding the Anthropic Messages API

Before we build anything complex, let’s understand the API we’re working with. Create generator.py:

"""LLM interaction for generating answers."""

import anthropic
from config import config


def generate_simple(prompt: str) -> str:
    """
    Make a simple completion request to Claude.

    This is the most basic form of LLM interaction:
    send a prompt, get a response.
    """
    client = anthropic.Anthropic()

    message = client.messages.create(
        model=config.llm_model,
        max_tokens=config.max_tokens,
        messages=[
            {"role": "user", "content": prompt}
        ]
    )

    return message.content[0].text

Let’s test it interactively:

# In Python REPL or a test script
from generator import generate_simple

response = generate_simple("What is the capital of France?")
print(response)
# Output: The capital of France is Paris.

Simple. But what’s actually happening?

Anatomy of an API Request

Let’s examine each parameter:

message = client.messages.create(
    model="claude-sonnet-4-6",  # Which Claude model to use
    max_tokens=1024,                 # Maximum response length
    messages=[                        # The conversation history
        {"role": "user", "content": prompt}
    ]
)

model: Different models have different capabilities, speeds, and costs:

claude-opus-4-8: Most capable, best for complex reasoning
claude-sonnet-4-6: Great balance of capability and speed
claude-haiku-4-5-20251001: Fastest, cheapest, good for simple tasks

For our Q&A assistant, Sonnet provides the best tradeoff: capable enough to synthesize information from multiple sources, fast enough for interactive use.

max_tokens: The hard limit on response length. Set this based on expected answer length plus buffer. Too low and answers get cut off. Too high and you pay for unused capacity.

Why we set temperature to 0: For Q&A systems, we want consistent, factual responses. Temperature controls randomness in output. Temperature 0 means the model always picks the most likely token. Temperature 1 introduces variety. For creative writing, higher temperature helps. For factual Q&A, temperature 0 provides deterministic, reproducible results.

messages: The conversation formatted as a list of turns. Each message has:

role: Either “user” (human) or “assistant” (Claude)
content: The text of that turn

For multi-turn conversations, you include the full history:

messages = [
    {"role": "user", "content": "What is Python?"},
    {"role": "assistant", "content": "Python is a programming language..."},
    {"role": "user", "content": "What are its main uses?"}
]

Understanding the Response

The API returns a Message object. Let’s examine it:

from anthropic import Anthropic
from config import config

client = Anthropic()
message = client.messages.create(
    model=config.llm_model,
    max_tokens=100,
    messages=[{"role": "user", "content": "Say hello"}]
)

print(f"ID: {message.id}")
print(f"Model: {message.model}")
print(f"Role: {message.role}")
print(f"Content: {message.content}")
print(f"Stop reason: {message.stop_reason}")
print(f"Input tokens: {message.usage.input_tokens}")
print(f"Output tokens: {message.usage.output_tokens}")

Output:

ID: msg_01XFDUDYJgAACzvnptvVoYEL
Model: claude-sonnet-4-6
Role: assistant
Content: [TextBlock(text='Hello! How can I help you today?', type='text')]
Stop reason: end_turn
Input tokens: 10
Output tokens: 11

Key fields:

content: A list of content blocks. Usually one TextBlock, but can include multiple for tool use.
stop_reason: Why generation stopped. “end_turn” means natural completion. “max_tokens” means hit the limit.
usage: Token counts for billing and monitoring.

The “aha” moment: Notice that usage.input_tokens and usage.output_tokens are separate. API pricing charges differently for input vs. output tokens. In RAG applications, you send lots of context (input tokens) but generate shorter answers (output tokens). Understanding this helps with cost optimization.

Streaming Responses

For a responsive user experience, stream responses instead of waiting for completion:

def generate_streaming(prompt: str) -> str:
    """
    Stream a response from Claude, printing tokens as they arrive.

    Returns the complete response when done.
    """
    client = anthropic.Anthropic()

    full_response = ""

    with client.messages.stream(
        model=config.llm_model,
        max_tokens=config.max_tokens,
        messages=[{"role": "user", "content": prompt}]
    ) as stream:
        for text in stream.text_stream:
            print(text, end="", flush=True)
            full_response += text

    print()  # Newline at the end
    return full_response

Why streaming matters: 1. Perceived latency: Users see progress immediately instead of waiting seconds for a complete response 2. Time to first token: Critical metric for UX. Streaming shows content in ~200ms vs. waiting 2-5 seconds for full response 3. Cancellation: User can interrupt if the response is going in the wrong direction

Common Mistake #3: Not using streaming in user-facing applications. Waiting for full responses creates a jarring experience. Users prefer watching text appear to staring at a spinner.

Error Handling

LLM APIs fail in predictable ways. Build resilience from the start:

import anthropic
import time
from typing import Optional


class GenerationError(Exception):
    """Custom exception for generation failures."""
    pass


def generate_with_retry(
    prompt: str,
    max_retries: int = 3,
    initial_delay: float = 1.0
) -> str:
    """
    Generate a response with exponential backoff retry.

    Handles transient API errors gracefully.
    """
    client = anthropic.Anthropic()
    delay = initial_delay
    last_error: Optional[Exception] = None

    for attempt in range(max_retries):
        try:
            message = client.messages.create(
                model=config.llm_model,
                max_tokens=config.max_tokens,
                messages=[{"role": "user", "content": prompt}]
            )
            return message.content[0].text

        except anthropic.RateLimitError as e:
            # Rate limited: wait and retry
            last_error = e
            print(f"Rate limited. Waiting {delay}s before retry {attempt + 1}/{max_retries}")
            time.sleep(delay)
            delay *= 2  # Exponential backoff

        except anthropic.APIStatusError as e:
            # API error (5xx): might be transient
            if e.status_code >= 500:
                last_error = e
                print(f"API error {e.status_code}. Waiting {delay}s before retry")
                time.sleep(delay)
                delay *= 2
            else:
                # Client error (4xx): don't retry
                raise GenerationError(f"API error: {e.message}") from e

        except anthropic.APIConnectionError as e:
            # Network error: retry
            last_error = e
            print(f"Connection error. Waiting {delay}s before retry")
            time.sleep(delay)
            delay *= 2

    raise GenerationError(f"Failed after {max_retries} attempts: {last_error}")

Why exponential backoff? When services are overloaded, everyone retrying immediately makes things worse. Exponential backoff (1s, 2s, 4s, …) spreads out retry attempts, giving the service time to recover.

Common Mistake #4: Retrying on all errors. Some errors (invalid API key, malformed request) will never succeed no matter how many times you retry. Only retry transient errors.

Putting It Together

Update generator.py with a complete implementation:

"""LLM interaction for generating answers."""

import anthropic
import time
from typing import Optional
from dataclasses import dataclass

from config import config


class GenerationError(Exception):
    """Custom exception for generation failures."""
    pass


@dataclass
class GenerationResult:
    """Result of a generation request with metadata."""
    text: str
    input_tokens: int
    output_tokens: int
    model: str


def generate_answer(
    question: str,
    context: str,
    max_retries: int = 3
) -> GenerationResult:
    """
    Generate an answer to a question given context.

    This is the core RAG generation function: it takes retrieved
    context and synthesizes an answer.
    """
    client = anthropic.Anthropic()

    # Construct the prompt with clear instructions
    prompt = f"""Based on the following context, answer the question.
If the context doesn't contain enough information to answer, say so.
Always cite which parts of the context support your answer.

Context:
{context}

Question: {question}

Answer:"""

    delay = 1.0
    last_error: Optional[Exception] = None

    for attempt in range(max_retries):
        try:
            message = client.messages.create(
                model=config.llm_model,
                max_tokens=config.max_tokens,
                temperature=config.temperature,
                messages=[{"role": "user", "content": prompt}]
            )

            return GenerationResult(
                text=message.content[0].text,
                input_tokens=message.usage.input_tokens,
                output_tokens=message.usage.output_tokens,
                model=message.model
            )

        except (anthropic.RateLimitError, anthropic.APIConnectionError) as e:
            last_error = e
            time.sleep(delay)
            delay *= 2

        except anthropic.APIStatusError as e:
            if e.status_code >= 500:
                last_error = e
                time.sleep(delay)
                delay *= 2
            else:
                raise GenerationError(f"API error: {e.message}") from e

    raise GenerationError(f"Failed after {max_retries} attempts: {last_error}")

Notice how generate_answer is different from generate_simple:

It takes both a question and context (RAG pattern)
It constructs a prompt with clear instructions
It returns structured metadata for monitoring
It handles errors gracefully

This is the pattern we’ll use throughout: functions that do one thing well, with clear inputs, outputs, and error handling.

Complete code: See Answer Generator (generator.py) for the full implementation with additional error handling and token counting.

Adding Context: Simple RAG

Now we’ll build the retrieval side: turning documents into searchable chunks and finding relevant ones for a query.

Why Chunking Matters

Imagine searching a library with a single catalog card for each book. “Moby Dick: A story about whaling.” Helpful, but if you’re looking for the chapter about the whiteness of the whale, you’re out of luck. You need finer-grained index entries.

Chunking is creating those index entries for your documents. But chunk too finely (every sentence) and you lose context. Chunk too coarsely (whole documents) and search precision suffers.

The “aha” moment: Chunking is a core tradeoff in RAG. Smaller chunks = better search precision but less context per result. Larger chunks = more context but noisier search results. There’s no perfect answer, only the right answer for your use case.

Building the Indexer

Create indexer.py:

"""Document processing and indexing."""

import os
import hashlib
from pathlib import Path
from dataclasses import dataclass
from typing import Iterator

import tiktoken

from config import config


@dataclass
class Chunk:
    """A chunk of text with metadata."""
    text: str
    source_file: str
    chunk_index: int
    token_count: int

    @property
    def id(self) -> str:
        """Generate a unique ID for this chunk."""
        content_hash = hashlib.md5(self.text.encode()).hexdigest()[:8]
        return f"{self.source_file}::{self.chunk_index}::{content_hash}"


class Chunker:
    """Split documents into chunks for indexing."""

    def __init__(
        self,
        chunk_size: int = config.chunk_size,
        chunk_overlap: int = config.chunk_overlap
    ):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        # Use the tokenizer for our embedding model
        self.tokenizer = tiktoken.get_encoding("cl100k_base")

    def count_tokens(self, text: str) -> int:
        """Count tokens in text."""
        return len(self.tokenizer.encode(text))

    def chunk_text(self, text: str, source_file: str) -> list[Chunk]:
        """
        Split text into overlapping chunks.

        Strategy: Split on paragraph boundaries when possible,
        fall back to sentence boundaries, then word boundaries.
        """
        # Normalize whitespace
        text = text.strip()
        if not text:
            return []

        # Try to split on paragraphs first
        paragraphs = text.split('\n\n')

        chunks = []
        current_chunk = []
        current_tokens = 0

        for para in paragraphs:
            para = para.strip()
            if not para:
                continue

            para_tokens = self.count_tokens(para)

            # If single paragraph exceeds chunk size, split it further
            if para_tokens > self.chunk_size:
                # Flush current chunk
                if current_chunk:
                    chunk_text = '\n\n'.join(current_chunk)
                    chunks.append(self._create_chunk(
                        chunk_text, source_file, len(chunks)
                    ))
                    current_chunk = []
                    current_tokens = 0

                # Split the large paragraph
                chunks.extend(self._split_large_text(
                    para, source_file, len(chunks)
                ))
                continue

            # Check if adding this paragraph exceeds chunk size
            if current_tokens + para_tokens > self.chunk_size and current_chunk:
                # Save current chunk
                chunk_text = '\n\n'.join(current_chunk)
                chunks.append(self._create_chunk(
                    chunk_text, source_file, len(chunks)
                ))

                # Start new chunk with overlap
                # Keep last paragraph(s) that fit in overlap
                overlap_chunks = []
                overlap_tokens = 0
                for p in reversed(current_chunk):
                    p_tokens = self.count_tokens(p)
                    if overlap_tokens + p_tokens <= self.chunk_overlap:
                        overlap_chunks.insert(0, p)
                        overlap_tokens += p_tokens
                    else:
                        break

                current_chunk = overlap_chunks
                current_tokens = overlap_tokens

            current_chunk.append(para)
            current_tokens += para_tokens

        # Don't forget the last chunk
        if current_chunk:
            chunk_text = '\n\n'.join(current_chunk)
            chunks.append(self._create_chunk(
                chunk_text, source_file, len(chunks)
            ))

        return chunks

    def _split_large_text(
        self, text: str, source_file: str, start_index: int
    ) -> list[Chunk]:
        """Split text that exceeds chunk size using sentences."""
        import re

        # Split on sentence boundaries
        sentences = re.split(r'(?<=[.!?])\s+', text)

        chunks = []
        current_sentences = []
        current_tokens = 0

        for sentence in sentences:
            sentence_tokens = self.count_tokens(sentence)

            if current_tokens + sentence_tokens > self.chunk_size and current_sentences:
                chunk_text = ' '.join(current_sentences)
                chunks.append(self._create_chunk(
                    chunk_text, source_file, start_index + len(chunks)
                ))
                current_sentences = []
                current_tokens = 0

            current_sentences.append(sentence)
            current_tokens += sentence_tokens

        if current_sentences:
            chunk_text = ' '.join(current_sentences)
            chunks.append(self._create_chunk(
                chunk_text, source_file, start_index + len(chunks)
            ))

        return chunks

    def _create_chunk(self, text: str, source_file: str, index: int) -> Chunk:
        """Create a Chunk object."""
        return Chunk(
            text=text,
            source_file=source_file,
            chunk_index=index,
            token_count=self.count_tokens(text)
        )


def load_documents(directory: str) -> Iterator[tuple[str, str]]:
    """
    Load all supported documents from a directory.

    Yields (filename, content) tuples.
    """
    supported_extensions = {'.txt', '.md', '.markdown'}
    directory = Path(directory)

    if not directory.exists():
        raise FileNotFoundError(f"Directory not found: {directory}")

    for file_path in directory.rglob('*'):
        if file_path.suffix.lower() in supported_extensions:
            try:
                content = file_path.read_text(encoding='utf-8')
                # Use relative path as identifier
                relative_path = file_path.relative_to(directory)
                yield str(relative_path), content
            except Exception as e:
                print(f"Warning: Could not read {file_path}: {e}")


def index_documents(directory: str) -> list[Chunk]:
    """
    Load and chunk all documents from a directory.

    This is the main entry point for document indexing.
    """
    chunker = Chunker()
    all_chunks = []

    for filename, content in load_documents(directory):
        chunks = chunker.chunk_text(content, filename)
        all_chunks.extend(chunks)
        print(f"  {filename}: {len(chunks)} chunks")

    return all_chunks

Let’s trace through what happens when we chunk a document:

Load: Read file content as text
Split on paragraphs: Natural semantic boundaries
Accumulate: Keep adding paragraphs until we hit chunk_size tokens
Overlap: Keep the last few paragraphs when starting a new chunk
Handle edge cases: Very long paragraphs get split on sentences

Why overlap? Consider this text split into two chunks without overlap:

Chunk 1: "...The new policy takes effect on January 1st."
Chunk 2: "Employees must complete training before the deadline..."

A question about “when employees must complete training” might not match either chunk well. With overlap:

Chunk 1: "...The new policy takes effect on January 1st."
Chunk 2: "takes effect on January 1st. Employees must complete training before the deadline..."

Now chunk 2 has the context needed to answer the question.

Common Mistake #5: No chunk overlap. This creates “boundary blindness” where important information spanning chunks becomes unfindable.

Creating Embeddings

Embeddings are numeric representations of text that capture semantic meaning. Similar texts have similar embeddings, enabling semantic search.

Add to indexer.py:

import openai
from config import config


def create_embeddings(texts: list[str]) -> list[list[float]]:
    """
    Create embeddings for a list of texts.

    Uses OpenAI's embedding API with batching for efficiency.
    """
    client = openai.OpenAI()

    # OpenAI's embedding API accepts batches
    # but has a token limit per request
    batch_size = 100
    all_embeddings = []

    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]

        response = client.embeddings.create(
            model=config.embedding_model,
            input=batch
        )

        # Sort by index to maintain order
        sorted_embeddings = sorted(response.data, key=lambda x: x.index)
        all_embeddings.extend([e.embedding for e in sorted_embeddings])

    return all_embeddings

Why batch embeddings? 1. Efficiency: One API call for 100 texts vs. 100 API calls 2. Cost: Same price but less overhead 3. Rate limits: Fewer requests means less chance of hitting limits

The “aha” moment: Embeddings are created once during indexing and reused for every query. The embedding model never sees your questions directly, it just provides a way to measure similarity. This is why RAG can work with any LLM that the embedding model.

Complete code: See Document Indexer (indexer.py) for the full implementation with progress tracking, file type handling, and metadata extraction.

Storing in ChromaDB

ChromaDB is a simple, local vector database perfect for learning. Create retriever.py:

"""Vector search and retrieval."""

import chromadb
from chromadb.config import Settings

from config import config
from indexer import Chunk, create_embeddings


class VectorStore:
    """Simple vector store using ChromaDB."""

    def __init__(self, persist_directory: str = config.index_path):
        """Initialize or load existing vector store."""
        self.client = chromadb.PersistentClient(
            path=persist_directory,
            settings=Settings(anonymized_telemetry=False)
        )
        self.collection = self.client.get_or_create_collection(
            name="documents",
            metadata={"hnsw:space": "cosine"}  # Use cosine similarity
        )

    def add_chunks(self, chunks: list[Chunk]) -> None:
        """Add chunks to the vector store."""
        if not chunks:
            return

        # Extract texts and metadata
        texts = [chunk.text for chunk in chunks]
        ids = [chunk.id for chunk in chunks]
        metadatas = [
            {
                "source_file": chunk.source_file,
                "chunk_index": chunk.chunk_index,
                "token_count": chunk.token_count
            }
            for chunk in chunks
        ]

        # Create embeddings
        print("Generating embeddings...")
        embeddings = create_embeddings(texts)

        # Add to collection
        self.collection.add(
            ids=ids,
            embeddings=embeddings,
            documents=texts,
            metadatas=metadatas
        )
        print(f"Added {len(chunks)} chunks to index")

    def search(self, query: str, top_k: int = config.top_k) -> list[dict]:
        """
        Search for chunks similar to the query.

        Returns list of dicts with 'text', 'metadata', and 'score' keys.
        """
        # Create embedding for query
        query_embedding = create_embeddings([query])[0]

        # Search
        results = self.collection.query(
            query_embeddings=[query_embedding],
            n_results=top_k,
            include=["documents", "metadatas", "distances"]
        )

        # Format results
        formatted = []
        for i in range(len(results['ids'][0])):
            formatted.append({
                'text': results['documents'][0][i],
                'metadata': results['metadatas'][0][i],
                'score': 1 - results['distances'][0][i]  # Convert distance to similarity
            })

        return formatted

    def count(self) -> int:
        """Return number of chunks in the index."""
        return self.collection.count()

    def clear(self) -> None:
        """Clear all data from the index."""
        self.client.delete_collection("documents")
        self.collection = self.client.get_or_create_collection(
            name="documents",
            metadata={"hnsw:space": "cosine"}
        )

Key concepts:

Cosine similarity: Measures the angle between two vectors. Range is -1 to 1, where 1 means identical direction (most similar). We configure ChromaDB to use cosine by setting hnsw:space: cosine.

HNSW (Hierarchical Navigable Small World): The algorithm ChromaDB uses for fast approximate nearest neighbor search. You don’t need to understand the details, just know it trades a tiny bit of accuracy for massive speed improvements.

Distance vs. similarity: ChromaDB returns distances (lower = more similar). We convert to similarity scores (higher = more similar) for intuitive interpretation.

Complete code: See Vector Retriever (retriever.py) for the full implementation with hybrid search, metadata filtering, and batch retrieval.

Testing Retrieval

Let’s create some test documents and verify retrieval works:

# test_retrieval.py
"""Test the retrieval pipeline."""

from pathlib import Path
from indexer import index_documents
from retriever import VectorStore


def create_sample_docs():
    """Create sample documents for testing."""
    docs_dir = Path("sample_docs")
    docs_dir.mkdir(exist_ok=True)

    # HR Policy document
    (docs_dir / "hr_policy.md").write_text("""
# HR Policies

## Vacation Policy

All full-time employees receive 20 days of paid time off (PTO) per year.
PTO accrues at a rate of 1.67 days per month.

Unused PTO can roll over to the next year, up to a maximum of 5 days.
PTO requests should be submitted at least 2 weeks in advance.

## Remote Work Policy

Employees may work remotely up to 3 days per week with manager approval.
Full remote arrangements require VP approval and must be reviewed quarterly.

## Expense Policy

Business expenses over $50 require receipt documentation.
Expenses over $500 require pre-approval from your manager.
Submit expense reports within 30 days of the expense.
""")

    # Technical documentation
    (docs_dir / "tech_guide.md").write_text("""
# Technical Guidelines

## Code Review Process

All code changes require at least one approval before merging.
Security-sensitive changes require review from the security team.
Reviews should be completed within 2 business days.

## Deployment Process

Deployments to production happen on Tuesdays and Thursdays.
Emergency deployments require on-call approval and a rollback plan.
All deployments must pass automated tests and staging validation.

## On-Call Rotation

Engineers participate in on-call rotation after 3 months on the team.
On-call shifts are one week long, starting Monday at 9 AM.
On-call engineers receive a $500 stipend per week.
""")

    print("Sample documents created in ./sample_docs")
    return docs_dir


def main():
    # Create sample documents
    docs_dir = create_sample_docs()

    # Index documents
    print("\nIndexing documents...")
    chunks = index_documents(str(docs_dir))
    print(f"Created {len(chunks)} chunks")

    # Add to vector store
    store = VectorStore()
    store.clear()  # Start fresh
    store.add_chunks(chunks)

    # Test queries
    test_queries = [
        "How many vacation days do I get?",
        "Can I work from home?",
        "How do I submit an expense report?",
        "What is the deployment schedule?",
        "How much is the on-call stipend?"
    ]

    print("\n" + "="*50)
    print("Testing retrieval:")
    print("="*50)

    for query in test_queries:
        print(f"\nQuery: {query}")
        results = store.search(query, top_k=2)
        for i, result in enumerate(results):
            print(f"  Result {i+1} (score: {result['score']:.3f}):")
            print(f"    Source: {result['metadata']['source_file']}")
            # Show first 100 chars of text
            preview = result['text'][:100].replace('\n', ' ')
            print(f"    Preview: {preview}...")


if __name__ == "__main__":
    main()

Run it:

python test_retrieval.py

You should see each query returning relevant chunks. For “How many vacation days do I get?”, the top result should be from the vacation policy section, not the deployment process.

Common Mistake #6: Not testing retrieval in isolation. If your Q&A system gives bad answers, is it retrieval (wrong chunks) or generation (ignoring context)? Test each component separately.

Building the Q&A Loop

Now we connect everything into a working Q&A system.

The Complete Pipeline

Constructing Effective Prompts

The prompt is where retrieval meets generation. A good RAG prompt:

Clearly provides the context
States the task unambiguously
Instructs the model to use only the context
Requests citations for verifiability

Update generator.py with our RAG-specific generator:

"""LLM interaction for generating answers."""

import anthropic
import time
from typing import Optional
from dataclasses import dataclass

from config import config


class GenerationError(Exception):
    """Custom exception for generation failures."""
    pass


@dataclass
class GenerationResult:
    """Result of a generation request with metadata."""
    text: str
    input_tokens: int
    output_tokens: int
    model: str


# The RAG prompt template
RAG_PROMPT_TEMPLATE = """You are a helpful assistant that answers questions based on the provided documentation.

## Instructions
- Answer the question using ONLY the information in the provided context
- If the context doesn't contain enough information to answer, say "I don't have enough information to answer that question based on the available documentation"
- Always cite your sources by mentioning which document the information comes from
- Be concise but complete
- If multiple documents contain relevant information, synthesize them

## Context
{context}

## Question
{question}

## Answer"""


def format_context(search_results: list[dict]) -> str:
    """Format search results into context for the prompt."""
    context_parts = []

    for i, result in enumerate(search_results, 1):
        source = result['metadata']['source_file']
        chunk_idx = result['metadata']['chunk_index']
        text = result['text']
        score = result['score']

        context_parts.append(
            f"[Source {i}: {source} (chunk {chunk_idx}, relevance: {score:.2f})]\n{text}"
        )

    return "\n\n---\n\n".join(context_parts)


def generate_answer(
    question: str,
    search_results: list[dict],
    max_retries: int = 3
) -> GenerationResult:
    """
    Generate an answer using retrieved context.

    This is the core RAG generation function.
    """
    client = anthropic.Anthropic()

    # Format context from search results
    context = format_context(search_results)

    # Construct the full prompt
    prompt = RAG_PROMPT_TEMPLATE.format(
        context=context,
        question=question
    )

    delay = 1.0
    last_error: Optional[Exception] = None

    for attempt in range(max_retries):
        try:
            message = client.messages.create(
                model=config.llm_model,
                max_tokens=config.max_tokens,
                temperature=config.temperature,
                messages=[{"role": "user", "content": prompt}]
            )

            return GenerationResult(
                text=message.content[0].text,
                input_tokens=message.usage.input_tokens,
                output_tokens=message.usage.output_tokens,
                model=message.model
            )

        except (anthropic.RateLimitError, anthropic.APIConnectionError) as e:
            last_error = e
            time.sleep(delay)
            delay *= 2

        except anthropic.APIStatusError as e:
            if e.status_code >= 500:
                last_error = e
                time.sleep(delay)
                delay *= 2
            else:
                raise GenerationError(f"API error: {e.message}") from e

    raise GenerationError(f"Failed after {max_retries} attempts: {last_error}")

Let’s examine the prompt template:

RAG_PROMPT_TEMPLATE = """You are a helpful assistant that answers questions based on the provided documentation.

## Instructions
- Answer the question using ONLY the information in the provided context
- If the context doesn't contain enough information...

Why this structure?

Role setting (“You are a helpful assistant”): Establishes behavior expectations
Clear constraint (“ONLY the information in the provided context”): Reduces hallucination
Graceful failure (“If the context doesn’t contain enough information”): Better than wrong answers
Citation requirement: Enables verification
Labeled sections: Makes parsing unambiguous

The “aha” moment: The prompt is where you control the model’s behavior. Every instruction you add trades off against other instructions. “Be concise” conflicts with “be complete.” “Cite sources” adds length. Finding the right balance requires iteration.

The Main Application

Create qa_assistant.py:

"""Document Q&A Assistant - Main application."""

import argparse
import sys
from pathlib import Path

from rich.console import Console
from rich.markdown import Markdown
from rich.panel import Panel

from config import config
from indexer import index_documents
from retriever import VectorStore
from generator import generate_answer, GenerationError


console = Console()


def cmd_index(args):
    """Index documents from a directory."""
    directory = Path(args.directory)

    if not directory.exists():
        console.print(f"[red]Error: Directory not found: {directory}[/red]")
        sys.exit(1)

    console.print(f"[bold]Indexing documents from {directory}...[/bold]")

    # Load and chunk documents
    chunks = index_documents(str(directory))

    if not chunks:
        console.print("[yellow]No documents found to index[/yellow]")
        sys.exit(1)

    console.print(f"Created {len(chunks)} chunks from documents")

    # Store in vector database
    store = VectorStore()
    if args.clear:
        console.print("Clearing existing index...")
        store.clear()

    store.add_chunks(chunks)

    console.print(f"[green]Index saved. Total chunks: {store.count()}[/green]")


def cmd_query(args):
    """Query the document index."""
    # Validate configuration
    errors = config.validate()
    if errors:
        console.print("[red]Configuration errors:[/red]")
        for error in errors:
            console.print(f"  - {error}")
        sys.exit(1)

    # Load vector store
    store = VectorStore()

    if store.count() == 0:
        console.print("[yellow]Index is empty. Run 'index' command first.[/yellow]")
        sys.exit(1)

    question = args.question

    console.print(f"\n[bold]Question:[/bold] {question}\n")

    # Retrieve relevant chunks
    with console.status("Searching documents..."):
        results = store.search(question, top_k=config.top_k)

    if not results:
        console.print("[yellow]No relevant documents found.[/yellow]")
        sys.exit(1)

    # Show retrieved sources if verbose
    if args.verbose:
        console.print("[dim]Retrieved sources:[/dim]")
        for i, result in enumerate(results, 1):
            source = result['metadata']['source_file']
            score = result['score']
            console.print(f"  {i}. {source} (relevance: {score:.2f})")
        console.print()

    # Generate answer
    with console.status("Generating answer..."):
        try:
            result = generate_answer(question, results)
        except GenerationError as e:
            console.print(f"[red]Error generating answer: {e}[/red]")
            sys.exit(1)

    # Display answer
    console.print(Panel(
        Markdown(result.text),
        title="Answer",
        border_style="green"
    ))

    # Show token usage if verbose
    if args.verbose:
        console.print(f"\n[dim]Tokens: {result.input_tokens} input, {result.output_tokens} output[/dim]")
        # Estimate cost (approximate rates as of 2026)
        input_cost = result.input_tokens * 0.003 / 1000  # $3 per 1M input tokens
        output_cost = result.output_tokens * 0.015 / 1000  # $15 per 1M output tokens
        console.print(f"[dim]Estimated cost: ${input_cost + output_cost:.4f}[/dim]")


def cmd_interactive(args):
    """Interactive query mode."""
    errors = config.validate()
    if errors:
        console.print("[red]Configuration errors:[/red]")
        for error in errors:
            console.print(f"  - {error}")
        sys.exit(1)

    store = VectorStore()

    if store.count() == 0:
        console.print("[yellow]Index is empty. Run 'index' command first.[/yellow]")
        sys.exit(1)

    console.print("[bold]Document Q&A Assistant[/bold]")
    console.print(f"Index contains {store.count()} chunks")
    console.print("Type 'quit' or 'exit' to stop.\n")

    while True:
        try:
            question = console.input("[bold cyan]Question:[/bold cyan] ").strip()
        except (KeyboardInterrupt, EOFError):
            console.print("\nGoodbye!")
            break

        if not question:
            continue

        if question.lower() in ('quit', 'exit', 'q'):
            console.print("Goodbye!")
            break

        # Retrieve
        results = store.search(question, top_k=config.top_k)

        if not results:
            console.print("[yellow]No relevant documents found.[/yellow]\n")
            continue

        # Generate
        try:
            result = generate_answer(question, results)
        except GenerationError as e:
            console.print(f"[red]Error: {e}[/red]\n")
            continue

        # Display
        console.print()
        console.print(Panel(
            Markdown(result.text),
            title="Answer",
            border_style="green"
        ))
        console.print()


def main():
    parser = argparse.ArgumentParser(
        description="Document Q&A Assistant",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
Examples:
  Index documents:
    python qa_assistant.py index ./docs

  Query the index:
    python qa_assistant.py query "What is the vacation policy?"

  Interactive mode:
    python qa_assistant.py interactive
"""
    )

    subparsers = parser.add_subparsers(dest="command", help="Available commands")

    # Index command
    index_parser = subparsers.add_parser("index", help="Index documents")
    index_parser.add_argument("directory", help="Directory containing documents")
    index_parser.add_argument("--clear", action="store_true",
                              help="Clear existing index before adding")

    # Query command
    query_parser = subparsers.add_parser("query", help="Query the index")
    query_parser.add_argument("question", help="Question to ask")
    query_parser.add_argument("-v", "--verbose", action="store_true",
                              help="Show additional details")

    # Interactive command
    interactive_parser = subparsers.add_parser("interactive",
                                                help="Interactive query mode")

    args = parser.parse_args()

    if args.command == "index":
        cmd_index(args)
    elif args.command == "query":
        cmd_query(args)
    elif args.command == "interactive":
        cmd_interactive(args)
    else:
        parser.print_help()


if __name__ == "__main__":
    main()

Testing the Complete System

# First, create and index sample documents
python test_retrieval.py

# Now query using the Q&A assistant
python qa_assistant.py query "How many vacation days do employees get?"

# Try with verbose mode
python qa_assistant.py query -v "What is the deployment schedule?"

# Interactive mode
python qa_assistant.py interactive

You should see answers grounded in your documents, with citations to the source files.

Complete code: See Main Application (qa_assistant.py) for the full implementation with argument parsing, error handling, and graceful shutdown.

Making It Production-Ready

A working prototype is not a production system. Let’s add the scaffolding that turns demos into reliable tools.

Adding Logging

Logging is how you debug production issues. You can’t attach a debugger to production. Create or update files to include logging:

# Add to config.py
import logging

def setup_logging(level: str = "INFO") -> logging.Logger:
    """Configure logging for the application."""
    logging.basicConfig(
        level=getattr(logging, level.upper()),
        format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
        datefmt='%Y-%m-%d %H:%M:%S'
    )
    return logging.getLogger("qa_assistant")


logger = setup_logging()

# Update generator.py to use logging
from config import logger

def generate_answer(question: str, search_results: list[dict], ...):
    logger.info(f"Generating answer for question: {question[:50]}...")
    logger.debug(f"Using {len(search_results)} context chunks")

    # ... existing code ...

    logger.info(f"Generated answer: {result.input_tokens} in, {result.output_tokens} out")
    return result

# Update retriever.py
from config import logger

def search(self, query: str, top_k: int = config.top_k):
    logger.debug(f"Searching for: {query[:50]}...")

    # ... existing code ...

    logger.info(f"Found {len(formatted)} results, top score: {formatted[0]['score']:.3f}")
    return formatted

Why structured logging? - Timestamps: Know when things happened - Levels: Filter by severity (DEBUG, INFO, WARNING, ERROR) - Context: Include relevant data for debugging

Handling Edge Cases

Real users do unexpected things. Handle them gracefully:

# Add to qa_assistant.py

def validate_question(question: str) -> tuple[bool, str]:
    """
    Validate user question before processing.

    Returns (is_valid, error_message).
    """
    if not question or not question.strip():
        return False, "Question cannot be empty"

    if len(question) > 1000:
        return False, "Question is too long (max 1000 characters)"

    if len(question.split()) < 2:
        return False, "Please ask a complete question"

    return True, ""


def cmd_query(args):
    # Validate question first
    is_valid, error_msg = validate_question(args.question)
    if not is_valid:
        console.print(f"[red]Invalid question: {error_msg}[/red]")
        sys.exit(1)

    # ... rest of function

Common edge cases to handle:

Empty queries
Extremely long queries (will exceed context limits)
Queries with no relevant results
API timeouts
Malformed documents during indexing
File encoding issues

Common Mistake #7: Ignoring edge cases during development. The happy path works great in demos. Production users find every edge case.

Basic Evaluation

How do you know if your system is working well? Build evaluation into the system:

# evaluation.py
"""Basic evaluation utilities for the Q&A assistant."""

from dataclasses import dataclass
from typing import Optional
import json
from pathlib import Path

from retriever import VectorStore
from generator import generate_answer


@dataclass
class EvalCase:
    """A single evaluation case."""
    question: str
    expected_answer: str
    expected_sources: list[str]


@dataclass
class EvalResult:
    """Result of evaluating a single case."""
    question: str
    generated_answer: str
    expected_answer: str
    retrieved_sources: list[str]
    expected_sources: list[str]
    answer_contains_expected: bool
    correct_source_retrieved: bool


def load_eval_set(path: str) -> list[EvalCase]:
    """Load evaluation cases from a JSON file."""
    with open(path) as f:
        data = json.load(f)

    return [
        EvalCase(
            question=case['question'],
            expected_answer=case['expected_answer'],
            expected_sources=case.get('expected_sources', [])
        )
        for case in data
    ]


def evaluate_case(case: EvalCase, store: VectorStore) -> EvalResult:
    """Evaluate a single case."""
    # Retrieve
    results = store.search(case.question)
    retrieved_sources = [r['metadata']['source_file'] for r in results]

    # Generate
    gen_result = generate_answer(case.question, results)

    # Check if expected answer content is in generated answer
    answer_contains_expected = (
        case.expected_answer.lower() in gen_result.text.lower()
    )

    # Check if expected source was retrieved
    correct_source_retrieved = any(
        expected in retrieved_sources
        for expected in case.expected_sources
    ) if case.expected_sources else True

    return EvalResult(
        question=case.question,
        generated_answer=gen_result.text,
        expected_answer=case.expected_answer,
        retrieved_sources=retrieved_sources,
        expected_sources=case.expected_sources,
        answer_contains_expected=answer_contains_expected,
        correct_source_retrieved=correct_source_retrieved
    )


def run_evaluation(eval_path: str, store: VectorStore) -> dict:
    """Run full evaluation and return metrics."""
    cases = load_eval_set(eval_path)
    results = []

    for case in cases:
        result = evaluate_case(case, store)
        results.append(result)

    # Compute metrics
    total = len(results)
    answer_correct = sum(1 for r in results if r.answer_contains_expected)
    source_correct = sum(1 for r in results if r.correct_source_retrieved)

    return {
        'total_cases': total,
        'answer_accuracy': answer_correct / total if total > 0 else 0,
        'retrieval_accuracy': source_correct / total if total > 0 else 0,
        'results': results
    }

Create an evaluation set (eval_cases.json):

[
    {
        "question": "How many vacation days do employees get?",
        "expected_answer": "20 days",
        "expected_sources": ["hr_policy.md"]
    },
    {
        "question": "What days can we deploy to production?",
        "expected_answer": "Tuesdays and Thursdays",
        "expected_sources": ["tech_guide.md"]
    },
    {
        "question": "How much is the on-call stipend?",
        "expected_answer": "$500",
        "expected_sources": ["tech_guide.md"]
    }
]

The “aha” moment: Evaluation lets you measure improvement. Without metrics, you’re guessing. “Does changing chunk size help?” becomes “Did accuracy go from 75% to 82%?”

Cost Tracking

LLM APIs charge per token. Track costs to avoid surprises. See Appendix K (LLM Provider Comparison) for current pricing across providers.

# Add to config.py

@dataclass
class CostTracker:
    """Track API usage and costs."""

    # Pricing (check Appendix K for current rates)
    CLAUDE_INPUT_PRICE = 3.00 / 1_000_000   # $3 per 1M input tokens
    CLAUDE_OUTPUT_PRICE = 15.00 / 1_000_000  # $15 per 1M output tokens
    EMBEDDING_PRICE = 0.02 / 1_000_000       # $0.02 per 1M tokens

    total_input_tokens: int = 0
    total_output_tokens: int = 0
    total_embedding_tokens: int = 0

    def add_generation(self, input_tokens: int, output_tokens: int):
        """Record a generation API call."""
        self.total_input_tokens += input_tokens
        self.total_output_tokens += output_tokens

    def add_embedding(self, tokens: int):
        """Record an embedding API call."""
        self.total_embedding_tokens += tokens

    @property
    def total_cost(self) -> float:
        """Calculate total cost in USD."""
        generation_cost = (
            self.total_input_tokens * self.CLAUDE_INPUT_PRICE +
            self.total_output_tokens * self.CLAUDE_OUTPUT_PRICE
        )
        embedding_cost = self.total_embedding_tokens * self.EMBEDDING_PRICE
        return generation_cost + embedding_cost

    def summary(self) -> str:
        """Return a cost summary string."""
        return (
            f"Tokens: {self.total_input_tokens:,} input, "
            f"{self.total_output_tokens:,} output, "
            f"{self.total_embedding_tokens:,} embedding\n"
            f"Estimated cost: ${self.total_cost:.4f}"
        )


# Global cost tracker
cost_tracker = CostTracker()

Update generator.py to track costs:

from config import cost_tracker

def generate_answer(...):
    # ... existing code ...

    # Track costs
    cost_tracker.add_generation(
        result.input_tokens,
        result.output_tokens
    )

    return result

Common Mistake #8: Ignoring costs during development. It’s easy to rack up hundreds of dollars in API costs during development and testing. Track from day one.

Complete Production-Ready Code

Here’s the final structure with all improvements:

document_qa/
├── qa_assistant.py      # CLI entry point
├── indexer.py           # Document processing
├── retriever.py         # Vector search
├── generator.py         # LLM interaction
├── evaluation.py        # Evaluation utilities
├── config.py            # Configuration, logging, cost tracking
├── requirements.txt
├── eval_cases.json      # Evaluation test cases
├── sample_docs/
└── qa_index/

Each file has clear responsibilities. The system is observable through logging and cost tracking. Evaluation provides a feedback loop for improvement.

Complete Reference: The full, runnable implementation of all components is available in Your First LLM Application - Code Reference. This includes all source files, evaluation utilities, sample documents, and test scripts discussed in this chapter.

Next Steps

What We Skipped

This tutorial intentionally omitted several important topics that production systems need:

Advanced chunking: We used simple paragraph-based chunking. Production systems often use recursive chunking, semantic chunking, or even LLM-based chunking for better results. Chapter 7 covers these in depth.

Reranking: We used raw embedding similarity. Adding a reranking step (where a more powerful model re-scores candidates) significantly improves precision. This is covered in Chapter 7.

Prompt optimization: Our prompt is functional but not optimized. Production prompts benefit from systematic testing and iteration. Chapter 6 covers prompt engineering comprehensively.

Hybrid search: We used pure vector search. Combining vector and keyword search (BM25) often works better, especially for exact matches. Covered in Chapter 7.

Caching: Every query creates new embeddings and makes API calls. Caching embeddings and common queries reduces cost and latency. Covered in Chapter 9.

Authentication and security: Our CLI has no access controls. Production systems need authentication, rate limiting, and protection against prompt injection. Covered in Chapter 12.

Deployment: We run locally. Production needs containers, load balancing, monitoring, and CI/CD. Covered in Chapter 9.

How This Connects to Later Chapters

What you built here is the foundation for more sophisticated systems:

Chapter 7 (RAG Systems): Goes deep on everything we touched lightly: advanced chunking, embedding models, hybrid search, reranking, GraphRAG. You now have context for why these matter.

Chapter 8 (Agentic Systems): Agents extend this pattern. Instead of just answering questions, agents can take actions, use tools, and reason over multiple steps. The Q&A assistant becomes one tool in a larger system.

Chapter 9 (Deployment): Taking this CLI to production. How do you handle thousands of concurrent users? How do you update the index without downtime? How do you monitor quality?

Chapter 15 (Evaluation): We built basic evaluation. Production evaluation needs systematic test sets, regression testing, and metrics pipelines. LLM-as-judge evaluation becomes important.

Chapter 16 (Security): Users will try to trick your system. Prompt injection attacks try to override your instructions. Production systems need defenses.

Ideas for Extending the Project

Now that you have a working system, try these extensions:

Add PDF support: Use a library like pypdf or pdfplumber to extract text from PDFs. Handle the messiness of PDF text extraction.
Improve chunk quality: Implement semantic chunking that splits on topic changes rather than character counts. Compare retrieval quality.
Add conversation history: Make the assistant remember previous questions in the session. How does this change the prompt?
Build a web interface: Use FastAPI or Flask to create an HTTP API, then add a simple web frontend.
Implement caching: Add an LRU cache for embeddings and common queries. Measure the cost savings.
Add source highlighting: When displaying answers, highlight the specific passages that were used. Helps users verify accuracy.
Compare embedding models: Try different embedding models (OpenAI ada vs. text-embedding-3-small vs. open-source alternatives). Measure retrieval quality differences.
Add query expansion: Before searching, generate query variations. Does retrieving across multiple queries improve results?

Summary

You’ve built a complete Document Q&A Assistant from scratch. Along the way, you learned:

Setting Up: API keys go in environment variables, never in code. Virtual environments isolate dependencies. Configuration lives in one place.

API Interaction: LLM APIs are request-response with structured inputs and outputs. Streaming improves perceived latency. Retry with exponential backoff handles transient errors.

RAG Pipeline: Documents become chunks. Chunks become embeddings. Embeddings enable semantic search. Retrieved chunks become prompt context. The model generates grounded answers.

Production Readiness: Logging enables debugging. Edge case handling prevents crashes. Evaluation measures quality. Cost tracking prevents surprises.

Key Takeaways

RAG is retrieve-then-generate: Find relevant context, include it in the prompt, generate grounded responses. This pattern underlies most practical LLM applications.
Chunking is a tradeoff: Smaller chunks = better precision, less context. Larger chunks = more context, noisier search. Overlap prevents boundary blindness.
The prompt is your lever: You control model behavior through instructions. Clear structure, explicit constraints, and graceful failure handling matter.
Test components independently: When answers are wrong, is it retrieval or generation? Isolated testing reveals the failure point.
Build in observability: Logging, cost tracking, and evaluation aren’t optional. You can’t improve what you can’t measure.

The Architecture at a Glance

Practical Exercises

Chunk size experiment: Modify chunk_size to 200, 500, and 1000 tokens. Run the evaluation suite after each change. Which size gives the best results for your documents? Why?
Prompt engineering: The current RAG prompt is basic. Try adding:
- Instructions to format answers as bullet points
- A constraint to keep answers under 100 words
- Instructions to express uncertainty when appropriate Measure how these changes affect answer quality.
Failure analysis: Find 5 questions where the system gives wrong answers. For each, determine if it’s a retrieval failure (wrong chunks) or generation failure (right chunks, wrong answer). What patterns emerge?
Cost optimization: Track costs for 50 queries. Identify the most expensive queries. What makes them expensive? How could you reduce costs?
Hybrid search: Implement keyword search alongside vector search. Combine results using reciprocal rank fusion. Compare retrieval quality to vector-only search.

Self-Assessment Checkpoint

Conceptual Questions

Q1. [IC1] Why do we chunk documents before embedding them instead of embedding entire documents?

Answer

Two reasons: (1) Embedding models have token limits (typically 512-8192 tokens). Long documents exceed these limits. (2) Retrieval precision—if you embed a 50-page document as one vector, you retrieve the whole document even if only one paragraph is relevant. Smaller chunks enable precise retrieval of the specific content that answers the query.

Q2. [IC1] The system returns 5 chunks but the LLM only uses information from 2 of them. Is this a problem? How would you know?

Answer

It might be fine—the LLM correctly identifies the most relevant content. To evaluate: (1) Check if the answer is correct and complete. (2) Examine if the unused chunks were truly irrelevant (retrieval issue) or contained useful information the LLM missed (generation issue). (3) If unused chunks are consistently irrelevant, reduce top_k to save tokens and cost. If they contain useful information, the RAG prompt may need improvement.

Q3. [IC2] A user asks “What’s the vacation policy?” and your system returns chunks about sick leave instead. Where in the pipeline did this fail and how would you debug it?

Answer

Debug systematically: (1) Check the query embedding—is “vacation” semantically similar to “sick leave” in your embedding space? Test with embedding("vacation") · embedding("sick leave"). (2) Check if vacation policy exists in your documents. If not, retrieval is working correctly (returning closest match). (3) If vacation docs exist, check their embeddings. They may have chunked poorly, losing the word “vacation” in headers. (4) Solution options: improve chunking to preserve context, use hybrid search to match keywords, or add metadata filtering.

Q4. [IC2] Your RAG system costs $0.50 per query on average. The product manager wants to reduce this to $0.10. What are your options?

Answer

Cost reduction strategies: (1) Use a smaller, cheaper LLM for simple queries and only escalate to the expensive model when needed. (2) Cache common queries and their answers. (3) Reduce context size—use fewer chunks, shorter chunks, or summarize retrieved content. (4) Use a cheaper embedding model (or cache embeddings). (5) Implement semantic caching—if a similar question was asked before, return cached answer. (6) Batch queries where possible. Trade-off: each reduction may impact quality, so measure carefully.

Q5. [Senior] The system works well in testing but users complain it “doesn’t know” recent company updates. What’s happening and how do you fix it?

Answer

The index is stale—documents were added once but not updated. Solutions: (1) Implement incremental indexing that detects new/modified files. (2) Set up a scheduled job to re-index changed documents. (3) Add a “last indexed” timestamp to metadata so you can track freshness. (4) Consider a document change detection system (file watchers, webhooks from document system). (5) For critical updates, provide a manual “refresh index” command. Production RAG systems need operational processes for keeping the index current.

Code Challenges

Challenge 1. [IC1] The chunker splits on double newlines. Some documents use single newlines between paragraphs. Fix the chunker to handle both.

Solution

def chunk_text(self, text: str, source_file: str) -> list[Chunk]:
    text = text.strip()
    if not text:
        return []

    # Normalize paragraph breaks: convert single newlines
    # (not preceded/followed by newline) to double newlines
    # But preserve intentional single newlines in lists, etc.
    import re

    # First, normalize multiple newlines to exactly two
    text = re.sub(r'\n{3,}', '\n\n', text)

    # Then split on double newlines
    paragraphs = text.split("\n\n")

    # Rest of chunking logic...

Alternative: Use a more sophisticated paragraph detection that considers line length and indentation patterns.

Challenge 2. [IC2] Add a --verbose flag to the CLI that shows which chunks were retrieved and their similarity scores before the answer.

Solution

# In cli.py, modify the ask command:
@app.command()
def ask(
    question: str,
    verbose: bool = typer.Option(False, "--verbose", "-v", help="Show retrieved chunks"),
):
    """Ask a question about the indexed documents."""
    # ... existing setup code ...

    # Get results with scores
    results = pipeline.query(question)

    if verbose:
        console.print("\n[bold]Retrieved Chunks:[/bold]")
        for i, chunk in enumerate(results.chunks, 1):
            console.print(f"\n[cyan]Chunk {i}[/cyan] (score: {chunk.score:.3f})")
            console.print(f"Source: {chunk.source_file}")
            console.print(chunk.text[:200] + "..." if len(chunk.text) > 200 else chunk.text)
        console.print("\n" + "="*50 + "\n")

    # Print answer
    console.print(Markdown(results.answer))

Connections to Other Chapters

This hands-on tutorial introduces concepts explored in depth throughout the book:

Chapter 6 (Prompt Engineering): The prompts we built here are just the beginning—Chapter 6 covers advanced patterns like chain-of-thought, few-shot learning, and structured outputs.
Chapter 7 (RAG Systems): Takes the basic retrieval we implemented and expands to production patterns including hybrid search, reranking, and GraphRAG.
Chapter 15 (MLOps & Evaluation): How to systematically evaluate the quality of responses from systems like this one.
Chapter 14 (Backend Engineering for AI): Production patterns for testing, debugging, and deploying LLM applications.
Capstone Projects (Appendix E): Extends this tutorial into full production projects with Docker, monitoring, and comprehensive testing.

Connections to Other Chapters

This tutorial chapter serves as a practical bridge between foundational knowledge and advanced techniques:

From Chapter 5 (LLM/NLP Foundations): We used embeddings without explaining how they work. Chapter 5 covers the mathematics of how text becomes vectors and why similar meanings cluster in embedding space.

From Chapter 6 (Prompt Engineering): Our RAG prompt was functional but basic. Chapter 6 teaches systematic prompt design, few-shot learning, and chain-of-thought techniques that can significantly improve answer quality.

To Chapter 7 (RAG Systems): This chapter used simple chunking and vector search. Chapter 7 goes deep: semantic chunking, hybrid search, reranking, GraphRAG, and advanced retrieval patterns.

To Chapter 8 (Agentic Systems): The Q&A assistant is reactive (answer questions). Agents are proactive (take actions). Chapter 8 shows how to build systems that use tools, reason over multiple steps, and accomplish complex goals.

To Chapter 9 (Deployment): We ran locally. Chapter 9 covers serving LLM applications at scale: batching, caching, load balancing, and production infrastructure.

To Chapter 15 (MLOps & Evaluation): Our evaluation was minimal. Chapter 11 covers systematic evaluation: test set design, metrics, LLM-as-judge patterns, and continuous quality monitoring.

Part I Checkpoint: Foundations Complete

Congratulations on completing Part I! Before moving to Part II, verify you can do the following:

Skills Checklist

Explain the AI engineering landscape (Ch01): Describe the difference between ML engineering and AI engineering, and identify where you fit in the career ladder
Write production Python code (Ch02): Use async/await, Pydantic models, and proper error handling in LLM applications
Understand ML fundamentals (Ch03): Explain embeddings, loss functions, and why neural networks work
Build a working RAG application (Ch04): Create a document Q&A system with chunking, embeddings, and retrieval

Quick Self-Test (10 minutes)

Q1. What’s the difference between embeddings and one-hot encodings? Why do embeddings enable semantic search?

Q2. Your async LLM call is blocking the event loop. What’s wrong and how do you fix it?

Q3. You built a document Q&A system but it’s returning irrelevant chunks. List three things you would check.

Q4. Why does chunk overlap matter in RAG systems?

Check Your Answers

A1. One-hot encodings are sparse vectors where similar words have no relationship (dot product = 0). Embeddings are dense vectors where similar meanings are geometrically close, enabling similarity search via cosine distance.

A2. You’re likely using a synchronous API call inside an async function. Use await with an async client, or run sync code in an executor: await asyncio.to_thread(sync_function).

A3. (1) Check chunk size—too small loses context, too large dilutes relevance. (2) Verify embedding model matches your domain. (3) Check if the query and documents use different vocabulary (consider hybrid search).

A4. Without overlap, relevant sentences at chunk boundaries may be split and neither chunk matches the query well. Overlap ensures boundary content appears in at least one complete chunk.

Ready for Part II?

If you can confidently check all boxes above, you’re ready for Part II: Core LLM Development, where you’ll learn the theory behind transformers, master prompt engineering, build production RAG systems, and create autonomous agents.

--- title: "Chapter 4: Your First LLM Application" keywords: [hands-on, RAG tutorial, document QA, API integration, OpenAI, vector database, project structure] difficulty: beginner prerequisites: [ch02, ch03] estimated_time: "4-5 hours" --- ## Introduction In the fall of 2023, a software engineer at a mid-sized fintech company was tasked with building an internal documentation assistant. The engineering wiki had grown to thousands of pages over five years, and new hires were spending weeks hunting for information that existed but was impossible to find. "Can't we just ask the AI?" became a common refrain. The engineer had experience with APIs and databases, but had never built anything with LLMs. What followed was a journey familiar to many: the initial excitement of a working prototype, the frustration of hallucinated answers, the iterative refinement that turned a demo into a tool people actually trusted. Three months later, the documentation assistant was handling hundreds of queries daily, and the engineer had learned lessons that no tutorial could have taught. This chapter is that tutorial. We'll build a complete Document Q&A Assistant from scratch: a command-line tool that answers questions about a set of documents using retrieval-augmented generation (RAG). Along the way, you'll encounter the same surprises, mistakes, and "aha moments" that every AI engineer experiences when building their first real application. The goal isn't just working code. It's understanding *why* each piece exists, what happens when things go wrong, and how the choices we make here connect to the production systems you'll build later. By the end, you'll have both a working application and the intuition to extend it. ### Why Build This? You might wonder: why build a Q&A assistant when dozens exist? Three reasons: 1. **Understanding through construction**: Reading about RAG is different from debugging why your retrieval returns irrelevant chunks at 2 AM. Building forces you to internalize concepts in a way that reading cannot. 2. **The foundation for everything else**: Every production LLM application shares DNA with what we'll build: taking user input, retrieving relevant context, constructing prompts, generating responses, handling errors. Master this pattern, and you can build chatbots, coding assistants, research tools, and more. 3. **Demystification**: LLM applications can seem magical or impossibly complex. Building one from scratch reveals that they're just software: APIs, databases, text processing, error handling. The magic is in the model, not in the application code. ### What We're Building Our Document Q&A Assistant will: - Load a folder of documents (markdown, text, or PDF) - Chunk documents into searchable pieces - Create vector embeddings for semantic search - Store embeddings in a local vector database - Accept questions from the command line - Retrieve relevant document chunks - Generate answers grounded in the retrieved context - Display answers with source citations Here's what using it looks like: ``` $ python qa_assistant.py index ./company_docs Indexing 47 documents... Created 312 chunks Generated embeddings Index saved to ./qa_index $ python qa_assistant.py query "What is our vacation policy?" Based on the company documentation: Our vacation policy provides 20 days of paid time off per year for full-time employees. PTO accrues monthly at 1.67 days per month. Unused days roll over up to a maximum of 5 days. Requests should be submitted at least 2 weeks in advance through the HR portal. Sources: - hr_policies/time_off.md (chunk 3) - employee_handbook/benefits.md (chunk 7) ``` ### What You'll Learn - How to structure an LLM application project - Making API calls to Anthropic and OpenAI - Processing documents into chunks for retrieval - Creating and using embeddings - Building a simple RAG pipeline - Error handling patterns for LLM applications - Basic evaluation and cost tracking - Common mistakes and how to avoid them ### Prerequisites - Python programming (functions, classes, file I/O) - Command-line familiarity - Basic understanding of APIs and HTTP - A text editor or IDE You don't need prior experience with LLMs, embeddings, or vector databases. We'll build that understanding as we go. --- ## Setting Up ### Getting API Keys Before writing any code, you need access to an LLM API. We'll use two providers: **Anthropic (Claude)** for text generation. Claude excels at following instructions and grounding responses in provided context, making it ideal for RAG applications. **OpenAI** for embeddings. OpenAI's text-embedding-3-small model provides excellent quality at low cost for semantic search. > **Why two providers?** Each has strengths. This mirrors production systems that often use different models for different tasks. You could use a single provider, but learning to work with multiple APIs is valuable. **Getting an Anthropic API Key:** 1. Go to https://console.anthropic.com/ 2. Create an account or sign in 3. Navigate to API Keys 4. Create a new key, give it a descriptive name 5. Copy the key immediately (you won't see it again) **Getting an OpenAI API Key:** 1. Go to https://platform.openai.com/ 2. Create an account or sign in 3. Navigate to API Keys in the sidebar 4. Create a new secret key 5. Copy the key immediately **Storing Keys Securely:** Never put API keys in your code. Use environment variables: ```bash # Add to your ~/.bashrc or ~/.zshrc export ANTHROPIC_API_KEY="sk-ant-..." export OPENAI_API_KEY="sk-..." # Then reload your shell or run: source ~/.bashrc ``` On Windows, use the Environment Variables system settings or: ```powershell $env:ANTHROPIC_API_KEY = "sk-ant-..." $env:OPENAI_API_KEY = "sk-..." ``` > **Common Mistake #1**: Committing API keys to git. Add `.env` to your `.gitignore` immediately. Better yet, never create a `.env` file. Environment variables are safer. ### Project Structure Create a clean project structure: ![Project Structure](../assets/diagrams/rendered/ch04_project_structure.svg) This separation isn't arbitrary. Each file has a single responsibility: - **indexer.py**: Getting documents into a searchable form - **retriever.py**: Finding relevant documents for a query - **generator.py**: Turning retrieved context into an answer - **config.py**: Centralizing settings makes changes easy > **Why structure matters**: When something breaks (and it will), you want to know exactly where to look. If retrieval returns bad results, debug `retriever.py`. If answers ignore the context, debug `generator.py`. Tight coupling makes debugging a nightmare. ### Dependencies and Virtual Environment Create a virtual environment to isolate dependencies: ```bash # Create project directory mkdir document_qa cd document_qa # Create virtual environment python -m venv venv # Activate it source venv/bin/activate # On Windows: venv\Scripts\activate # Verify you're in the virtual environment which python # Should show .../document_qa/venv/bin/python ``` Create `requirements.txt`: ``` anthropic>=0.18.0 openai>=1.12.0 chromadb>=0.4.22 tiktoken>=0.6.0 rich>=13.7.0 ``` Why these packages? - **anthropic**: Official Anthropic Python SDK for Claude - **openai**: Official OpenAI Python SDK for embeddings - **chromadb**: Simple, local vector database (perfect for learning) - **tiktoken**: OpenAI's tokenizer (for counting tokens accurately) - **rich**: Beautiful terminal output (optional but nice) Install dependencies: ```bash pip install -r requirements.txt ``` > **Common Mistake #2**: Using global pip instead of virtual environment pip. Always activate your virtual environment first. If `which pip` doesn't show your venv path, something is wrong. ### Configuration File Create `config.py` to centralize settings: ```python """Configuration for the Document Q&A Assistant.""" import os from dataclasses import dataclass @dataclass class Config: """Application configuration with sensible defaults.""" # API Keys (from environment) anthropic_api_key: str = os.getenv("ANTHROPIC_API_KEY", "") openai_api_key: str = os.getenv("OPENAI_API_KEY", "") # Model settings llm_model: str = "claude-sonnet-4-6" embedding_model: str = "text-embedding-3-small" embedding_dimensions: int = 1536 # Chunking settings chunk_size: int = 500 # tokens chunk_overlap: int = 50 # tokens # Retrieval settings top_k: int = 5 # number of chunks to retrieve # Generation settings max_tokens: int = 1024 temperature: float = 0.0 # deterministic for Q&A # Paths index_path: str = "./qa_index" def validate(self) -> list[str]: """Check configuration and return list of errors.""" errors = [] if not self.anthropic_api_key: errors.append("ANTHROPIC_API_KEY environment variable not set") if not self.openai_api_key: errors.append("OPENAI_API_KEY environment variable not set") return errors # Global config instance config = Config() ``` Why a dataclass? It provides: - Type hints for IDE autocompletion - Default values in one place - Easy validation - Immutability awareness > **The "aha" moment**: Notice how configuration is separate from logic. When you need to change chunk size from 500 to 400 tokens, you edit one number in one place. When you deploy to production with different settings, you create a new Config instance. This seems trivial now but prevents painful debugging later. ### Verify Setup Create a quick test script to verify everything works: ```python # test_setup.py """Verify the development environment is correctly configured.""" from config import config def main(): errors = config.validate() if errors: print("Configuration errors:") for error in errors: print(f" - {error}") return False # Test imports try: import anthropic import openai import chromadb import tiktoken print("All packages imported successfully") except ImportError as e: print(f"Import error: {e}") return False # Test Anthropic connection try: client = anthropic.Anthropic() # Make a minimal API call response = client.messages.create( model=config.llm_model, max_tokens=10, messages=[{"role": "user", "content": "Say 'hello' and nothing else."}] ) print(f"Anthropic API working: {response.content[0].text}") except Exception as e: print(f"Anthropic API error: {e}") return False # Test OpenAI connection try: client = openai.OpenAI() response = client.embeddings.create( model=config.embedding_model, input="test" ) print(f"OpenAI API working: got embedding with {len(response.data[0].embedding)} dimensions") except Exception as e: print(f"OpenAI API error: {e}") return False print("\nSetup complete! Ready to build.") return True if __name__ == "__main__": main() ``` Run it: ```bash python test_setup.py ``` If you see "Setup complete!", you're ready. If not, the error messages will guide you. --- ## Making Your First API Call ### Understanding the Anthropic Messages API Before we build anything complex, let's understand the API we're working with. Create `generator.py`: ```python """LLM interaction for generating answers.""" import anthropic from config import config def generate_simple(prompt: str) -> str: """ Make a simple completion request to Claude. This is the most basic form of LLM interaction: send a prompt, get a response. """ client = anthropic.Anthropic() message = client.messages.create( model=config.llm_model, max_tokens=config.max_tokens, messages=[ {"role": "user", "content": prompt} ] ) return message.content[0].text ``` Let's test it interactively: ```python # In Python REPL or a test script from generator import generate_simple response = generate_simple("What is the capital of France?") print(response) # Output: The capital of France is Paris. ``` Simple. But what's actually happening? ### Anatomy of an API Request Let's examine each parameter: ```python message = client.messages.create( model="claude-sonnet-4-6", # Which Claude model to use max_tokens=1024, # Maximum response length messages=[ # The conversation history {"role": "user", "content": prompt} ] ) ``` **model**: Different models have different capabilities, speeds, and costs: - `claude-opus-4-8`: Most capable, best for complex reasoning - `claude-sonnet-4-6`: Great balance of capability and speed - `claude-haiku-4-5-20251001`: Fastest, cheapest, good for simple tasks For our Q&A assistant, Sonnet provides the best tradeoff: capable enough to synthesize information from multiple sources, fast enough for interactive use. **max_tokens**: The hard limit on response length. Set this based on expected answer length plus buffer. Too low and answers get cut off. Too high and you pay for unused capacity. > **Why we set temperature to 0**: For Q&A systems, we want consistent, factual responses. Temperature controls randomness in output. Temperature 0 means the model always picks the most likely token. Temperature 1 introduces variety. For creative writing, higher temperature helps. For factual Q&A, temperature 0 provides deterministic, reproducible results. **messages**: The conversation formatted as a list of turns. Each message has: - `role`: Either "user" (human) or "assistant" (Claude) - `content`: The text of that turn For multi-turn conversations, you include the full history: ```python messages = [ {"role": "user", "content": "What is Python?"}, {"role": "assistant", "content": "Python is a programming language..."}, {"role": "user", "content": "What are its main uses?"} ] ``` ### Understanding the Response The API returns a `Message` object. Let's examine it: ```python from anthropic import Anthropic from config import config client = Anthropic() message = client.messages.create( model=config.llm_model, max_tokens=100, messages=[{"role": "user", "content": "Say hello"}] ) print(f"ID: {message.id}") print(f"Model: {message.model}") print(f"Role: {message.role}") print(f"Content: {message.content}") print(f"Stop reason: {message.stop_reason}") print(f"Input tokens: {message.usage.input_tokens}") print(f"Output tokens: {message.usage.output_tokens}") ``` Output: ``` ID: msg_01XFDUDYJgAACzvnptvVoYEL Model: claude-sonnet-4-6 Role: assistant Content: [TextBlock(text='Hello! How can I help you today?', type='text')] Stop reason: end_turn Input tokens: 10 Output tokens: 11 ``` Key fields: - **content**: A list of content blocks. Usually one TextBlock, but can include multiple for tool use. - **stop_reason**: Why generation stopped. "end_turn" means natural completion. "max_tokens" means hit the limit. - **usage**: Token counts for billing and monitoring. > **The "aha" moment**: Notice that `usage.input_tokens` and `usage.output_tokens` are separate. API pricing charges differently for input vs. output tokens. In RAG applications, you send lots of context (input tokens) but generate shorter answers (output tokens). Understanding this helps with cost optimization. ### Streaming Responses For a responsive user experience, stream responses instead of waiting for completion: ```python def generate_streaming(prompt: str) -> str: """ Stream a response from Claude, printing tokens as they arrive. Returns the complete response when done. """ client = anthropic.Anthropic() full_response = "" with client.messages.stream( model=config.llm_model, max_tokens=config.max_tokens, messages=[{"role": "user", "content": prompt}] ) as stream: for text in stream.text_stream: print(text, end="", flush=True) full_response += text print() # Newline at the end return full_response ``` Why streaming matters: 1. **Perceived latency**: Users see progress immediately instead of waiting seconds for a complete response 2. **Time to first token**: Critical metric for UX. Streaming shows content in ~200ms vs. waiting 2-5 seconds for full response 3. **Cancellation**: User can interrupt if the response is going in the wrong direction > **Common Mistake #3**: Not using streaming in user-facing applications. Waiting for full responses creates a jarring experience. Users prefer watching text appear to staring at a spinner. ### Error Handling LLM APIs fail in predictable ways. Build resilience from the start: ```python import anthropic import time from typing import Optional class GenerationError(Exception): """Custom exception for generation failures.""" pass def generate_with_retry( prompt: str, max_retries: int = 3, initial_delay: float = 1.0 ) -> str: """ Generate a response with exponential backoff retry. Handles transient API errors gracefully. """ client = anthropic.Anthropic() delay = initial_delay last_error: Optional[Exception] = None for attempt in range(max_retries): try: message = client.messages.create( model=config.llm_model, max_tokens=config.max_tokens, messages=[{"role": "user", "content": prompt}] ) return message.content[0].text except anthropic.RateLimitError as e: # Rate limited: wait and retry last_error = e print(f"Rate limited. Waiting {delay}s before retry {attempt + 1}/{max_retries}") time.sleep(delay) delay *= 2 # Exponential backoff except anthropic.APIStatusError as e: # API error (5xx): might be transient if e.status_code >= 500: last_error = e print(f"API error {e.status_code}. Waiting {delay}s before retry") time.sleep(delay) delay *= 2 else: # Client error (4xx): don't retry raise GenerationError(f"API error: {e.message}") from e except anthropic.APIConnectionError as e: # Network error: retry last_error = e print(f"Connection error. Waiting {delay}s before retry") time.sleep(delay) delay *= 2 raise GenerationError(f"Failed after {max_retries} attempts: {last_error}") ``` Why exponential backoff? When services are overloaded, everyone retrying immediately makes things worse. Exponential backoff (1s, 2s, 4s, ...) spreads out retry attempts, giving the service time to recover. > **Common Mistake #4**: Retrying on all errors. Some errors (invalid API key, malformed request) will never succeed no matter how many times you retry. Only retry transient errors. ### Putting It Together Update `generator.py` with a complete implementation: ```python """LLM interaction for generating answers.""" import anthropic import time from typing import Optional from dataclasses import dataclass from config import config class GenerationError(Exception): """Custom exception for generation failures.""" pass @dataclass class GenerationResult: """Result of a generation request with metadata.""" text: str input_tokens: int output_tokens: int model: str def generate_answer( question: str, context: str, max_retries: int = 3 ) -> GenerationResult: """ Generate an answer to a question given context. This is the core RAG generation function: it takes retrieved context and synthesizes an answer. """ client = anthropic.Anthropic() # Construct the prompt with clear instructions prompt = f"""Based on the following context, answer the question. If the context doesn't contain enough information to answer, say so. Always cite which parts of the context support your answer. Context: {context} Question: {question} Answer:""" delay = 1.0 last_error: Optional[Exception] = None for attempt in range(max_retries): try: message = client.messages.create( model=config.llm_model, max_tokens=config.max_tokens, temperature=config.temperature, messages=[{"role": "user", "content": prompt}] ) return GenerationResult( text=message.content[0].text, input_tokens=message.usage.input_tokens, output_tokens=message.usage.output_tokens, model=message.model ) except (anthropic.RateLimitError, anthropic.APIConnectionError) as e: last_error = e time.sleep(delay) delay *= 2 except anthropic.APIStatusError as e: if e.status_code >= 500: last_error = e time.sleep(delay) delay *= 2 else: raise GenerationError(f"API error: {e.message}") from e raise GenerationError(f"Failed after {max_retries} attempts: {last_error}") ``` Notice how `generate_answer` is different from `generate_simple`: - It takes both a question and context (RAG pattern) - It constructs a prompt with clear instructions - It returns structured metadata for monitoring - It handles errors gracefully This is the pattern we'll use throughout: functions that do one thing well, with clear inputs, outputs, and error handling. > **Complete code**: See [Answer Generator (generator.py)](../reference/04_document_qa_app.html#answer-generator-generatorpy) for the full implementation with additional error handling and token counting. --- ## Adding Context: Simple RAG Now we'll build the retrieval side: turning documents into searchable chunks and finding relevant ones for a query. ### Why Chunking Matters Imagine searching a library with a single catalog card for each book. "Moby Dick: A story about whaling." Helpful, but if you're looking for the chapter about the whiteness of the whale, you're out of luck. You need finer-grained index entries. Chunking is creating those index entries for your documents. But chunk too finely (every sentence) and you lose context. Chunk too coarsely (whole documents) and search precision suffers. > **The "aha" moment**: Chunking is a core tradeoff in RAG. Smaller chunks = better search precision but less context per result. Larger chunks = more context but noisier search results. There's no perfect answer, only the right answer for your use case. ### Building the Indexer Create `indexer.py`: ```python """Document processing and indexing.""" import os import hashlib from pathlib import Path from dataclasses import dataclass from typing import Iterator import tiktoken from config import config @dataclass class Chunk: """A chunk of text with metadata.""" text: str source_file: str chunk_index: int token_count: int @property def id(self) -> str: """Generate a unique ID for this chunk.""" content_hash = hashlib.md5(self.text.encode()).hexdigest()[:8] return f"{self.source_file}::{self.chunk_index}::{content_hash}" class Chunker: """Split documents into chunks for indexing.""" def __init__( self, chunk_size: int = config.chunk_size, chunk_overlap: int = config.chunk_overlap ): self.chunk_size = chunk_size self.chunk_overlap = chunk_overlap # Use the tokenizer for our embedding model self.tokenizer = tiktoken.get_encoding("cl100k_base") def count_tokens(self, text: str) -> int: """Count tokens in text.""" return len(self.tokenizer.encode(text)) def chunk_text(self, text: str, source_file: str) -> list[Chunk]: """ Split text into overlapping chunks. Strategy: Split on paragraph boundaries when possible, fall back to sentence boundaries, then word boundaries. """ # Normalize whitespace text = text.strip() if not text: return [] # Try to split on paragraphs first paragraphs = text.split('\n\n') chunks = [] current_chunk = [] current_tokens = 0 for para in paragraphs: para = para.strip() if not para: continue para_tokens = self.count_tokens(para) # If single paragraph exceeds chunk size, split it further if para_tokens > self.chunk_size: # Flush current chunk if current_chunk: chunk_text = '\n\n'.join(current_chunk) chunks.append(self._create_chunk( chunk_text, source_file, len(chunks) )) current_chunk = [] current_tokens = 0 # Split the large paragraph chunks.extend(self._split_large_text( para, source_file, len(chunks) )) continue # Check if adding this paragraph exceeds chunk size if current_tokens + para_tokens > self.chunk_size and current_chunk: # Save current chunk chunk_text = '\n\n'.join(current_chunk) chunks.append(self._create_chunk( chunk_text, source_file, len(chunks) )) # Start new chunk with overlap # Keep last paragraph(s) that fit in overlap overlap_chunks = [] overlap_tokens = 0 for p in reversed(current_chunk): p_tokens = self.count_tokens(p) if overlap_tokens + p_tokens <= self.chunk_overlap: overlap_chunks.insert(0, p) overlap_tokens += p_tokens else: break current_chunk = overlap_chunks current_tokens = overlap_tokens current_chunk.append(para) current_tokens += para_tokens # Don't forget the last chunk if current_chunk: chunk_text = '\n\n'.join(current_chunk) chunks.append(self._create_chunk( chunk_text, source_file, len(chunks) )) return chunks def _split_large_text( self, text: str, source_file: str, start_index: int ) -> list[Chunk]: """Split text that exceeds chunk size using sentences.""" import re # Split on sentence boundaries sentences = re.split(r'(?<=[.!?])\s+', text) chunks = [] current_sentences = [] current_tokens = 0 for sentence in sentences: sentence_tokens = self.count_tokens(sentence) if current_tokens + sentence_tokens > self.chunk_size and current_sentences: chunk_text = ' '.join(current_sentences) chunks.append(self._create_chunk( chunk_text, source_file, start_index + len(chunks) )) current_sentences = [] current_tokens = 0 current_sentences.append(sentence) current_tokens += sentence_tokens if current_sentences: chunk_text = ' '.join(current_sentences) chunks.append(self._create_chunk( chunk_text, source_file, start_index + len(chunks) )) return chunks def _create_chunk(self, text: str, source_file: str, index: int) -> Chunk: """Create a Chunk object.""" return Chunk( text=text, source_file=source_file, chunk_index=index, token_count=self.count_tokens(text) ) def load_documents(directory: str) -> Iterator[tuple[str, str]]: """ Load all supported documents from a directory. Yields (filename, content) tuples. """ supported_extensions = {'.txt', '.md', '.markdown'} directory = Path(directory) if not directory.exists(): raise FileNotFoundError(f"Directory not found: {directory}") for file_path in directory.rglob('*'): if file_path.suffix.lower() in supported_extensions: try: content = file_path.read_text(encoding='utf-8') # Use relative path as identifier relative_path = file_path.relative_to(directory) yield str(relative_path), content except Exception as e: print(f"Warning: Could not read {file_path}: {e}") def index_documents(directory: str) -> list[Chunk]: """ Load and chunk all documents from a directory. This is the main entry point for document indexing. """ chunker = Chunker() all_chunks = [] for filename, content in load_documents(directory): chunks = chunker.chunk_text(content, filename) all_chunks.extend(chunks) print(f" {filename}: {len(chunks)} chunks") return all_chunks ``` Let's trace through what happens when we chunk a document: 1. **Load**: Read file content as text 2. **Split on paragraphs**: Natural semantic boundaries 3. **Accumulate**: Keep adding paragraphs until we hit chunk_size tokens 4. **Overlap**: Keep the last few paragraphs when starting a new chunk 5. **Handle edge cases**: Very long paragraphs get split on sentences Why overlap? Consider this text split into two chunks without overlap: ``` Chunk 1: "...The new policy takes effect on January 1st." Chunk 2: "Employees must complete training before the deadline..." ``` A question about "when employees must complete training" might not match either chunk well. With overlap: ``` Chunk 1: "...The new policy takes effect on January 1st." Chunk 2: "takes effect on January 1st. Employees must complete training before the deadline..." ``` Now chunk 2 has the context needed to answer the question. > **Common Mistake #5**: No chunk overlap. This creates "boundary blindness" where important information spanning chunks becomes unfindable. ### Creating Embeddings Embeddings are numeric representations of text that capture semantic meaning. Similar texts have similar embeddings, enabling semantic search. Add to `indexer.py`: ```python import openai from config import config def create_embeddings(texts: list[str]) -> list[list[float]]: """ Create embeddings for a list of texts. Uses OpenAI's embedding API with batching for efficiency. """ client = openai.OpenAI() # OpenAI's embedding API accepts batches # but has a token limit per request batch_size = 100 all_embeddings = [] for i in range(0, len(texts), batch_size): batch = texts[i:i + batch_size] response = client.embeddings.create( model=config.embedding_model, input=batch ) # Sort by index to maintain order sorted_embeddings = sorted(response.data, key=lambda x: x.index) all_embeddings.extend([e.embedding for e in sorted_embeddings]) return all_embeddings ``` Why batch embeddings? 1. **Efficiency**: One API call for 100 texts vs. 100 API calls 2. **Cost**: Same price but less overhead 3. **Rate limits**: Fewer requests means less chance of hitting limits > **The "aha" moment**: Embeddings are created once during indexing and reused for every query. The embedding model never sees your questions directly, it just provides a way to measure similarity. This is why RAG can work with any LLM that the embedding model. > **Complete code**: See [Document Indexer (indexer.py)](../reference/04_document_qa_app.html#document-indexer-indexerpy) for the full implementation with progress tracking, file type handling, and metadata extraction. ### Storing in ChromaDB ChromaDB is a simple, local vector database perfect for learning. Create `retriever.py`: ```python """Vector search and retrieval.""" import chromadb from chromadb.config import Settings from config import config from indexer import Chunk, create_embeddings class VectorStore: """Simple vector store using ChromaDB.""" def __init__(self, persist_directory: str = config.index_path): """Initialize or load existing vector store.""" self.client = chromadb.PersistentClient( path=persist_directory, settings=Settings(anonymized_telemetry=False) ) self.collection = self.client.get_or_create_collection( name="documents", metadata={"hnsw:space": "cosine"} # Use cosine similarity ) def add_chunks(self, chunks: list[Chunk]) -> None: """Add chunks to the vector store.""" if not chunks: return # Extract texts and metadata texts = [chunk.text for chunk in chunks] ids = [chunk.id for chunk in chunks] metadatas = [ { "source_file": chunk.source_file, "chunk_index": chunk.chunk_index, "token_count": chunk.token_count } for chunk in chunks ] # Create embeddings print("Generating embeddings...") embeddings = create_embeddings(texts) # Add to collection self.collection.add( ids=ids, embeddings=embeddings, documents=texts, metadatas=metadatas ) print(f"Added {len(chunks)} chunks to index") def search(self, query: str, top_k: int = config.top_k) -> list[dict]: """ Search for chunks similar to the query. Returns list of dicts with 'text', 'metadata', and 'score' keys. """ # Create embedding for query query_embedding = create_embeddings([query])[0] # Search results = self.collection.query( query_embeddings=[query_embedding], n_results=top_k, include=["documents", "metadatas", "distances"] ) # Format results formatted = [] for i in range(len(results['ids'][0])): formatted.append({ 'text': results['documents'][0][i], 'metadata': results['metadatas'][0][i], 'score': 1 - results['distances'][0][i] # Convert distance to similarity }) return formatted def count(self) -> int: """Return number of chunks in the index.""" return self.collection.count() def clear(self) -> None: """Clear all data from the index.""" self.client.delete_collection("documents") self.collection = self.client.get_or_create_collection( name="documents", metadata={"hnsw:space": "cosine"} ) ``` Key concepts: **Cosine similarity**: Measures the angle between two vectors. Range is -1 to 1, where 1 means identical direction (most similar). We configure ChromaDB to use cosine by setting `hnsw:space: cosine`. **HNSW (Hierarchical Navigable Small World)**: The algorithm ChromaDB uses for fast approximate nearest neighbor search. You don't need to understand the details, just know it trades a tiny bit of accuracy for massive speed improvements. **Distance vs. similarity**: ChromaDB returns distances (lower = more similar). We convert to similarity scores (higher = more similar) for intuitive interpretation. > **Complete code**: See [Vector Retriever (retriever.py)](../reference/04_document_qa_app.html#vector-retriever-retrieverpy) for the full implementation with hybrid search, metadata filtering, and batch retrieval. ### Testing Retrieval Let's create some test documents and verify retrieval works: ```python # test_retrieval.py """Test the retrieval pipeline.""" from pathlib import Path from indexer import index_documents from retriever import VectorStore def create_sample_docs(): """Create sample documents for testing.""" docs_dir = Path("sample_docs") docs_dir.mkdir(exist_ok=True) # HR Policy document (docs_dir / "hr_policy.md").write_text(""" # HR Policies ## Vacation Policy All full-time employees receive 20 days of paid time off (PTO) per year. PTO accrues at a rate of 1.67 days per month. Unused PTO can roll over to the next year, up to a maximum of 5 days. PTO requests should be submitted at least 2 weeks in advance. ## Remote Work Policy Employees may work remotely up to 3 days per week with manager approval. Full remote arrangements require VP approval and must be reviewed quarterly. ## Expense Policy Business expenses over $50 require receipt documentation. Expenses over $500 require pre-approval from your manager. Submit expense reports within 30 days of the expense. """) # Technical documentation (docs_dir / "tech_guide.md").write_text(""" # Technical Guidelines ## Code Review Process All code changes require at least one approval before merging. Security-sensitive changes require review from the security team. Reviews should be completed within 2 business days. ## Deployment Process Deployments to production happen on Tuesdays and Thursdays. Emergency deployments require on-call approval and a rollback plan. All deployments must pass automated tests and staging validation. ## On-Call Rotation Engineers participate in on-call rotation after 3 months on the team. On-call shifts are one week long, starting Monday at 9 AM. On-call engineers receive a $500 stipend per week. """) print("Sample documents created in ./sample_docs") return docs_dir def main(): # Create sample documents docs_dir = create_sample_docs() # Index documents print("\nIndexing documents...") chunks = index_documents(str(docs_dir)) print(f"Created {len(chunks)} chunks") # Add to vector store store = VectorStore() store.clear() # Start fresh store.add_chunks(chunks) # Test queries test_queries = [ "How many vacation days do I get?", "Can I work from home?", "How do I submit an expense report?", "What is the deployment schedule?", "How much is the on-call stipend?" ] print("\n" + "="*50) print("Testing retrieval:") print("="*50) for query in test_queries: print(f"\nQuery: {query}") results = store.search(query, top_k=2) for i, result in enumerate(results): print(f" Result {i+1} (score: {result['score']:.3f}):") print(f" Source: {result['metadata']['source_file']}") # Show first 100 chars of text preview = result['text'][:100].replace('\n', ' ') print(f" Preview: {preview}...") if __name__ == "__main__": main() ``` Run it: ```bash python test_retrieval.py ``` You should see each query returning relevant chunks. For "How many vacation days do I get?", the top result should be from the vacation policy section, not the deployment process. > **Common Mistake #6**: Not testing retrieval in isolation. If your Q&A system gives bad answers, is it retrieval (wrong chunks) or generation (ignoring context)? Test each component separately. --- ## Building the Q&A Loop Now we connect everything into a working Q&A system. ### The Complete Pipeline ![Q&A Pipeline](../assets/diagrams/rendered/ch04_qa_pipeline.svg) ### Constructing Effective Prompts The prompt is where retrieval meets generation. A good RAG prompt: 1. Clearly provides the context 2. States the task unambiguously 3. Instructs the model to use only the context 4. Requests citations for verifiability Update `generator.py` with our RAG-specific generator: ```python """LLM interaction for generating answers.""" import anthropic import time from typing import Optional from dataclasses import dataclass from config import config class GenerationError(Exception): """Custom exception for generation failures.""" pass @dataclass class GenerationResult: """Result of a generation request with metadata.""" text: str input_tokens: int output_tokens: int model: str # The RAG prompt template RAG_PROMPT_TEMPLATE = """You are a helpful assistant that answers questions based on the provided documentation. ## Instructions - Answer the question using ONLY the information in the provided context - If the context doesn't contain enough information to answer, say "I don't have enough information to answer that question based on the available documentation" - Always cite your sources by mentioning which document the information comes from - Be concise but complete - If multiple documents contain relevant information, synthesize them ## Context {context} ## Question {question} ## Answer""" def format_context(search_results: list[dict]) -> str: """Format search results into context for the prompt.""" context_parts = [] for i, result in enumerate(search_results, 1): source = result['metadata']['source_file'] chunk_idx = result['metadata']['chunk_index'] text = result['text'] score = result['score'] context_parts.append( f"[Source {i}: {source} (chunk {chunk_idx}, relevance: {score:.2f})]\n{text}" ) return "\n\n---\n\n".join(context_parts) def generate_answer( question: str, search_results: list[dict], max_retries: int = 3 ) -> GenerationResult: """ Generate an answer using retrieved context. This is the core RAG generation function. """ client = anthropic.Anthropic() # Format context from search results context = format_context(search_results) # Construct the full prompt prompt = RAG_PROMPT_TEMPLATE.format( context=context, question=question ) delay = 1.0 last_error: Optional[Exception] = None for attempt in range(max_retries): try: message = client.messages.create( model=config.llm_model, max_tokens=config.max_tokens, temperature=config.temperature, messages=[{"role": "user", "content": prompt}] ) return GenerationResult( text=message.content[0].text, input_tokens=message.usage.input_tokens, output_tokens=message.usage.output_tokens, model=message.model ) except (anthropic.RateLimitError, anthropic.APIConnectionError) as e: last_error = e time.sleep(delay) delay *= 2 except anthropic.APIStatusError as e: if e.status_code >= 500: last_error = e time.sleep(delay) delay *= 2 else: raise GenerationError(f"API error: {e.message}") from e raise GenerationError(f"Failed after {max_retries} attempts: {last_error}") ``` Let's examine the prompt template: ```python RAG_PROMPT_TEMPLATE = """You are a helpful assistant that answers questions based on the provided documentation. ## Instructions - Answer the question using ONLY the information in the provided context - If the context doesn't contain enough information... ``` Why this structure? 1. **Role setting** ("You are a helpful assistant"): Establishes behavior expectations 2. **Clear constraint** ("ONLY the information in the provided context"): Reduces hallucination 3. **Graceful failure** ("If the context doesn't contain enough information"): Better than wrong answers 4. **Citation requirement**: Enables verification 5. **Labeled sections**: Makes parsing unambiguous > **The "aha" moment**: The prompt is where you control the model's behavior. Every instruction you add trades off against other instructions. "Be concise" conflicts with "be complete." "Cite sources" adds length. Finding the right balance requires iteration. ### The Main Application Create `qa_assistant.py`: ```python """Document Q&A Assistant - Main application.""" import argparse import sys from pathlib import Path from rich.console import Console from rich.markdown import Markdown from rich.panel import Panel from config import config from indexer import index_documents from retriever import VectorStore from generator import generate_answer, GenerationError console = Console() def cmd_index(args): """Index documents from a directory.""" directory = Path(args.directory) if not directory.exists(): console.print(f"[red]Error: Directory not found: {directory}[/red]") sys.exit(1) console.print(f"[bold]Indexing documents from {directory}...[/bold]") # Load and chunk documents chunks = index_documents(str(directory)) if not chunks: console.print("[yellow]No documents found to index[/yellow]") sys.exit(1) console.print(f"Created {len(chunks)} chunks from documents") # Store in vector database store = VectorStore() if args.clear: console.print("Clearing existing index...") store.clear() store.add_chunks(chunks) console.print(f"[green]Index saved. Total chunks: {store.count()}[/green]") def cmd_query(args): """Query the document index.""" # Validate configuration errors = config.validate() if errors: console.print("[red]Configuration errors:[/red]") for error in errors: console.print(f" - {error}") sys.exit(1) # Load vector store store = VectorStore() if store.count() == 0: console.print("[yellow]Index is empty. Run 'index' command first.[/yellow]") sys.exit(1) question = args.question console.print(f"\n[bold]Question:[/bold] {question}\n") # Retrieve relevant chunks with console.status("Searching documents..."): results = store.search(question, top_k=config.top_k) if not results: console.print("[yellow]No relevant documents found.[/yellow]") sys.exit(1) # Show retrieved sources if verbose if args.verbose: console.print("[dim]Retrieved sources:[/dim]") for i, result in enumerate(results, 1): source = result['metadata']['source_file'] score = result['score'] console.print(f" {i}. {source} (relevance: {score:.2f})") console.print() # Generate answer with console.status("Generating answer..."): try: result = generate_answer(question, results) except GenerationError as e: console.print(f"[red]Error generating answer: {e}[/red]") sys.exit(1) # Display answer console.print(Panel( Markdown(result.text), title="Answer", border_style="green" )) # Show token usage if verbose if args.verbose: console.print(f"\n[dim]Tokens: {result.input_tokens} input, {result.output_tokens} output[/dim]") # Estimate cost (approximate rates as of 2026) input_cost = result.input_tokens * 0.003 / 1000 # $3 per 1M input tokens output_cost = result.output_tokens * 0.015 / 1000 # $15 per 1M output tokens console.print(f"[dim]Estimated cost: ${input_cost + output_cost:.4f}[/dim]") def cmd_interactive(args): """Interactive query mode.""" errors = config.validate() if errors: console.print("[red]Configuration errors:[/red]") for error in errors: console.print(f" - {error}") sys.exit(1) store = VectorStore() if store.count() == 0: console.print("[yellow]Index is empty. Run 'index' command first.[/yellow]") sys.exit(1) console.print("[bold]Document Q&A Assistant[/bold]") console.print(f"Index contains {store.count()} chunks") console.print("Type 'quit' or 'exit' to stop.\n") while True: try: question = console.input("[bold cyan]Question:[/bold cyan] ").strip() except (KeyboardInterrupt, EOFError): console.print("\nGoodbye!") break if not question: continue if question.lower() in ('quit', 'exit', 'q'): console.print("Goodbye!") break # Retrieve results = store.search(question, top_k=config.top_k) if not results: console.print("[yellow]No relevant documents found.[/yellow]\n") continue # Generate try: result = generate_answer(question, results) except GenerationError as e: console.print(f"[red]Error: {e}[/red]\n") continue # Display console.print() console.print(Panel( Markdown(result.text), title="Answer", border_style="green" )) console.print() def main(): parser = argparse.ArgumentParser( description="Document Q&A Assistant", formatter_class=argparse.RawDescriptionHelpFormatter, epilog=""" Examples: Index documents: python qa_assistant.py index ./docs Query the index: python qa_assistant.py query "What is the vacation policy?" Interactive mode: python qa_assistant.py interactive """ ) subparsers = parser.add_subparsers(dest="command", help="Available commands") # Index command index_parser = subparsers.add_parser("index", help="Index documents") index_parser.add_argument("directory", help="Directory containing documents") index_parser.add_argument("--clear", action="store_true", help="Clear existing index before adding") # Query command query_parser = subparsers.add_parser("query", help="Query the index") query_parser.add_argument("question", help="Question to ask") query_parser.add_argument("-v", "--verbose", action="store_true", help="Show additional details") # Interactive command interactive_parser = subparsers.add_parser("interactive", help="Interactive query mode") args = parser.parse_args() if args.command == "index": cmd_index(args) elif args.command == "query": cmd_query(args) elif args.command == "interactive": cmd_interactive(args) else: parser.print_help() if __name__ == "__main__": main() ``` ### Testing the Complete System ```bash # First, create and index sample documents python test_retrieval.py # Now query using the Q&A assistant python qa_assistant.py query "How many vacation days do employees get?" # Try with verbose mode python qa_assistant.py query -v "What is the deployment schedule?" # Interactive mode python qa_assistant.py interactive ``` You should see answers grounded in your documents, with citations to the source files. > **Complete code**: See [Main Application (qa_assistant.py)](../reference/04_document_qa_app.html#main-application-qa_assistantpy) for the full implementation with argument parsing, error handling, and graceful shutdown. --- ## Making It Production-Ready A working prototype is not a production system. Let's add the scaffolding that turns demos into reliable tools. ### Adding Logging Logging is how you debug production issues. You can't attach a debugger to production. Create or update files to include logging: ```python # Add to config.py import logging def setup_logging(level: str = "INFO") -> logging.Logger: """Configure logging for the application.""" logging.basicConfig( level=getattr(logging, level.upper()), format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', datefmt='%Y-%m-%d %H:%M:%S' ) return logging.getLogger("qa_assistant") logger = setup_logging() ``` ```python # Update generator.py to use logging from config import logger def generate_answer(question: str, search_results: list[dict], ...): logger.info(f"Generating answer for question: {question[:50]}...") logger.debug(f"Using {len(search_results)} context chunks") # ... existing code ... logger.info(f"Generated answer: {result.input_tokens} in, {result.output_tokens} out") return result ``` ```python # Update retriever.py from config import logger def search(self, query: str, top_k: int = config.top_k): logger.debug(f"Searching for: {query[:50]}...") # ... existing code ... logger.info(f"Found {len(formatted)} results, top score: {formatted[0]['score']:.3f}") return formatted ``` Why structured logging? - **Timestamps**: Know when things happened - **Levels**: Filter by severity (DEBUG, INFO, WARNING, ERROR) - **Context**: Include relevant data for debugging ### Handling Edge Cases Real users do unexpected things. Handle them gracefully: ```python # Add to qa_assistant.py def validate_question(question: str) -> tuple[bool, str]: """ Validate user question before processing. Returns (is_valid, error_message). """ if not question or not question.strip(): return False, "Question cannot be empty" if len(question) > 1000: return False, "Question is too long (max 1000 characters)" if len(question.split()) < 2: return False, "Please ask a complete question" return True, "" def cmd_query(args): # Validate question first is_valid, error_msg = validate_question(args.question) if not is_valid: console.print(f"[red]Invalid question: {error_msg}[/red]") sys.exit(1) # ... rest of function ``` Common edge cases to handle: - Empty queries - Extremely long queries (will exceed context limits) - Queries with no relevant results - API timeouts - Malformed documents during indexing - File encoding issues > **Common Mistake #7**: Ignoring edge cases during development. The happy path works great in demos. Production users find every edge case. ### Basic Evaluation How do you know if your system is working well? Build evaluation into the system: ```python # evaluation.py """Basic evaluation utilities for the Q&A assistant.""" from dataclasses import dataclass from typing import Optional import json from pathlib import Path from retriever import VectorStore from generator import generate_answer @dataclass class EvalCase: """A single evaluation case.""" question: str expected_answer: str expected_sources: list[str] @dataclass class EvalResult: """Result of evaluating a single case.""" question: str generated_answer: str expected_answer: str retrieved_sources: list[str] expected_sources: list[str] answer_contains_expected: bool correct_source_retrieved: bool def load_eval_set(path: str) -> list[EvalCase]: """Load evaluation cases from a JSON file.""" with open(path) as f: data = json.load(f) return [ EvalCase( question=case['question'], expected_answer=case['expected_answer'], expected_sources=case.get('expected_sources', []) ) for case in data ] def evaluate_case(case: EvalCase, store: VectorStore) -> EvalResult: """Evaluate a single case.""" # Retrieve results = store.search(case.question) retrieved_sources = [r['metadata']['source_file'] for r in results] # Generate gen_result = generate_answer(case.question, results) # Check if expected answer content is in generated answer answer_contains_expected = ( case.expected_answer.lower() in gen_result.text.lower() ) # Check if expected source was retrieved correct_source_retrieved = any( expected in retrieved_sources for expected in case.expected_sources ) if case.expected_sources else True return EvalResult( question=case.question, generated_answer=gen_result.text, expected_answer=case.expected_answer, retrieved_sources=retrieved_sources, expected_sources=case.expected_sources, answer_contains_expected=answer_contains_expected, correct_source_retrieved=correct_source_retrieved ) def run_evaluation(eval_path: str, store: VectorStore) -> dict: """Run full evaluation and return metrics.""" cases = load_eval_set(eval_path) results = [] for case in cases: result = evaluate_case(case, store) results.append(result) # Compute metrics total = len(results) answer_correct = sum(1 for r in results if r.answer_contains_expected) source_correct = sum(1 for r in results if r.correct_source_retrieved) return { 'total_cases': total, 'answer_accuracy': answer_correct / total if total > 0 else 0, 'retrieval_accuracy': source_correct / total if total > 0 else 0, 'results': results } ``` Create an evaluation set (`eval_cases.json`): ```json [ { "question": "How many vacation days do employees get?", "expected_answer": "20 days", "expected_sources": ["hr_policy.md"] }, { "question": "What days can we deploy to production?", "expected_answer": "Tuesdays and Thursdays", "expected_sources": ["tech_guide.md"] }, { "question": "How much is the on-call stipend?", "expected_answer": "$500", "expected_sources": ["tech_guide.md"] } ] ``` > **The "aha" moment**: Evaluation lets you measure improvement. Without metrics, you're guessing. "Does changing chunk size help?" becomes "Did accuracy go from 75% to 82%?" ### Cost Tracking LLM APIs charge per token. Track costs to avoid surprises. See **Appendix K (LLM Provider Comparison)** for current pricing across providers. ```python # Add to config.py @dataclass class CostTracker: """Track API usage and costs.""" # Pricing (check Appendix K for current rates) CLAUDE_INPUT_PRICE = 3.00 / 1_000_000 # $3 per 1M input tokens CLAUDE_OUTPUT_PRICE = 15.00 / 1_000_000 # $15 per 1M output tokens EMBEDDING_PRICE = 0.02 / 1_000_000 # $0.02 per 1M tokens total_input_tokens: int = 0 total_output_tokens: int = 0 total_embedding_tokens: int = 0 def add_generation(self, input_tokens: int, output_tokens: int): """Record a generation API call.""" self.total_input_tokens += input_tokens self.total_output_tokens += output_tokens def add_embedding(self, tokens: int): """Record an embedding API call.""" self.total_embedding_tokens += tokens @property def total_cost(self) -> float: """Calculate total cost in USD.""" generation_cost = ( self.total_input_tokens * self.CLAUDE_INPUT_PRICE + self.total_output_tokens * self.CLAUDE_OUTPUT_PRICE ) embedding_cost = self.total_embedding_tokens * self.EMBEDDING_PRICE return generation_cost + embedding_cost def summary(self) -> str: """Return a cost summary string.""" return ( f"Tokens: {self.total_input_tokens:,} input, " f"{self.total_output_tokens:,} output, " f"{self.total_embedding_tokens:,} embedding\n" f"Estimated cost: ${self.total_cost:.4f}" ) # Global cost tracker cost_tracker = CostTracker() ``` Update generator.py to track costs: ```python from config import cost_tracker def generate_answer(...): # ... existing code ... # Track costs cost_tracker.add_generation( result.input_tokens, result.output_tokens ) return result ``` > **Common Mistake #8**: Ignoring costs during development. It's easy to rack up hundreds of dollars in API costs during development and testing. Track from day one. ### Complete Production-Ready Code Here's the final structure with all improvements: ``` document_qa/ ├── qa_assistant.py # CLI entry point ├── indexer.py # Document processing ├── retriever.py # Vector search ├── generator.py # LLM interaction ├── evaluation.py # Evaluation utilities ├── config.py # Configuration, logging, cost tracking ├── requirements.txt ├── eval_cases.json # Evaluation test cases ├── sample_docs/ └── qa_index/ ``` Each file has clear responsibilities. The system is observable through logging and cost tracking. Evaluation provides a feedback loop for improvement. > **Complete Reference**: The full, runnable implementation of all components is available in [Your First LLM Application - Code Reference](../reference/04_document_qa_app.html). This includes all source files, evaluation utilities, sample documents, and test scripts discussed in this chapter. --- ## Next Steps ### What We Skipped This tutorial intentionally omitted several important topics that production systems need: **Advanced chunking**: We used simple paragraph-based chunking. Production systems often use recursive chunking, semantic chunking, or even LLM-based chunking for better results. Chapter 7 covers these in depth. **Reranking**: We used raw embedding similarity. Adding a reranking step (where a more powerful model re-scores candidates) significantly improves precision. This is covered in Chapter 7. **Prompt optimization**: Our prompt is functional but not optimized. Production prompts benefit from systematic testing and iteration. Chapter 6 covers prompt engineering comprehensively. **Hybrid search**: We used pure vector search. Combining vector and keyword search (BM25) often works better, especially for exact matches. Covered in Chapter 7. **Caching**: Every query creates new embeddings and makes API calls. Caching embeddings and common queries reduces cost and latency. Covered in Chapter 9. **Authentication and security**: Our CLI has no access controls. Production systems need authentication, rate limiting, and protection against prompt injection. Covered in Chapter 12. **Deployment**: We run locally. Production needs containers, load balancing, monitoring, and CI/CD. Covered in Chapter 9. ### How This Connects to Later Chapters What you built here is the foundation for more sophisticated systems: **Chapter 7 (RAG Systems)**: Goes deep on everything we touched lightly: advanced chunking, embedding models, hybrid search, reranking, GraphRAG. You now have context for why these matter. **Chapter 8 (Agentic Systems)**: Agents extend this pattern. Instead of just answering questions, agents can take actions, use tools, and reason over multiple steps. The Q&A assistant becomes one tool in a larger system. **Chapter 9 (Deployment)**: Taking this CLI to production. How do you handle thousands of concurrent users? How do you update the index without downtime? How do you monitor quality? **Chapter 15 (Evaluation)**: We built basic evaluation. Production evaluation needs systematic test sets, regression testing, and metrics pipelines. LLM-as-judge evaluation becomes important. **Chapter 16 (Security)**: Users will try to trick your system. Prompt injection attacks try to override your instructions. Production systems need defenses. ### Ideas for Extending the Project Now that you have a working system, try these extensions: 1. **Add PDF support**: Use a library like `pypdf` or `pdfplumber` to extract text from PDFs. Handle the messiness of PDF text extraction. 2. **Improve chunk quality**: Implement semantic chunking that splits on topic changes rather than character counts. Compare retrieval quality. 3. **Add conversation history**: Make the assistant remember previous questions in the session. How does this change the prompt? 4. **Build a web interface**: Use FastAPI or Flask to create an HTTP API, then add a simple web frontend. 5. **Implement caching**: Add an LRU cache for embeddings and common queries. Measure the cost savings. 6. **Add source highlighting**: When displaying answers, highlight the specific passages that were used. Helps users verify accuracy. 7. **Compare embedding models**: Try different embedding models (OpenAI ada vs. text-embedding-3-small vs. open-source alternatives). Measure retrieval quality differences. 8. **Add query expansion**: Before searching, generate query variations. Does retrieving across multiple queries improve results? --- ## Summary You've built a complete Document Q&A Assistant from scratch. Along the way, you learned: **Setting Up**: API keys go in environment variables, never in code. Virtual environments isolate dependencies. Configuration lives in one place. **API Interaction**: LLM APIs are request-response with structured inputs and outputs. Streaming improves perceived latency. Retry with exponential backoff handles transient errors. **RAG Pipeline**: Documents become chunks. Chunks become embeddings. Embeddings enable semantic search. Retrieved chunks become prompt context. The model generates grounded answers. **Production Readiness**: Logging enables debugging. Edge case handling prevents crashes. Evaluation measures quality. Cost tracking prevents surprises. ### Key Takeaways 1. **RAG is retrieve-then-generate**: Find relevant context, include it in the prompt, generate grounded responses. This pattern underlies most practical LLM applications. 2. **Chunking is a tradeoff**: Smaller chunks = better precision, less context. Larger chunks = more context, noisier search. Overlap prevents boundary blindness. 3. **The prompt is your lever**: You control model behavior through instructions. Clear structure, explicit constraints, and graceful failure handling matter. 4. **Test components independently**: When answers are wrong, is it retrieval or generation? Isolated testing reveals the failure point. 5. **Build in observability**: Logging, cost tracking, and evaluation aren't optional. You can't improve what you can't measure. ### The Architecture at a Glance ![Document Q&A Architecture](../assets/diagrams/rendered/ch04_architecture_overview.svg) --- ## Practical Exercises 1. **Chunk size experiment**: Modify chunk_size to 200, 500, and 1000 tokens. Run the evaluation suite after each change. Which size gives the best results for your documents? Why? 2. **Prompt engineering**: The current RAG prompt is basic. Try adding: - Instructions to format answers as bullet points - A constraint to keep answers under 100 words - Instructions to express uncertainty when appropriate Measure how these changes affect answer quality. 3. **Failure analysis**: Find 5 questions where the system gives wrong answers. For each, determine if it's a retrieval failure (wrong chunks) or generation failure (right chunks, wrong answer). What patterns emerge? 4. **Cost optimization**: Track costs for 50 queries. Identify the most expensive queries. What makes them expensive? How could you reduce costs? 5. **Hybrid search**: Implement keyword search alongside vector search. Combine results using reciprocal rank fusion. Compare retrieval quality to vector-only search. --- ## Self-Assessment Checkpoint ### Conceptual Questions **Q1. [IC1]** Why do we chunk documents before embedding them instead of embedding entire documents? <details> <summary>Answer</summary> Two reasons: (1) Embedding models have token limits (typically 512-8192 tokens). Long documents exceed these limits. (2) Retrieval precision—if you embed a 50-page document as one vector, you retrieve the whole document even if only one paragraph is relevant. Smaller chunks enable precise retrieval of the specific content that answers the query. </details> **Q2. [IC1]** The system returns 5 chunks but the LLM only uses information from 2 of them. Is this a problem? How would you know? <details> <summary>Answer</summary> It might be fine—the LLM correctly identifies the most relevant content. To evaluate: (1) Check if the answer is correct and complete. (2) Examine if the unused chunks were truly irrelevant (retrieval issue) or contained useful information the LLM missed (generation issue). (3) If unused chunks are consistently irrelevant, reduce top_k to save tokens and cost. If they contain useful information, the RAG prompt may need improvement. </details> **Q3. [IC2]** A user asks "What's the vacation policy?" and your system returns chunks about sick leave instead. Where in the pipeline did this fail and how would you debug it? <details> <summary>Answer</summary> Debug systematically: (1) Check the query embedding—is "vacation" semantically similar to "sick leave" in your embedding space? Test with `embedding("vacation") · embedding("sick leave")`. (2) Check if vacation policy exists in your documents. If not, retrieval is working correctly (returning closest match). (3) If vacation docs exist, check their embeddings. They may have chunked poorly, losing the word "vacation" in headers. (4) Solution options: improve chunking to preserve context, use hybrid search to match keywords, or add metadata filtering. </details> **Q4. [IC2]** Your RAG system costs $0.50 per query on average. The product manager wants to reduce this to $0.10. What are your options? <details> <summary>Answer</summary> Cost reduction strategies: (1) Use a smaller, cheaper LLM for simple queries and only escalate to the expensive model when needed. (2) Cache common queries and their answers. (3) Reduce context size—use fewer chunks, shorter chunks, or summarize retrieved content. (4) Use a cheaper embedding model (or cache embeddings). (5) Implement semantic caching—if a similar question was asked before, return cached answer. (6) Batch queries where possible. Trade-off: each reduction may impact quality, so measure carefully. </details> **Q5. [Senior]** The system works well in testing but users complain it "doesn't know" recent company updates. What's happening and how do you fix it? <details> <summary>Answer</summary> The index is stale—documents were added once but not updated. Solutions: (1) Implement incremental indexing that detects new/modified files. (2) Set up a scheduled job to re-index changed documents. (3) Add a "last indexed" timestamp to metadata so you can track freshness. (4) Consider a document change detection system (file watchers, webhooks from document system). (5) For critical updates, provide a manual "refresh index" command. Production RAG systems need operational processes for keeping the index current. </details> ### Code Challenges **Challenge 1. [IC1]** The chunker splits on double newlines. Some documents use single newlines between paragraphs. Fix the chunker to handle both. <details> <summary>Solution</summary> ```python def chunk_text(self, text: str, source_file: str) -> list[Chunk]: text = text.strip() if not text: return [] # Normalize paragraph breaks: convert single newlines # (not preceded/followed by newline) to double newlines # But preserve intentional single newlines in lists, etc. import re # First, normalize multiple newlines to exactly two text = re.sub(r'\n{3,}', '\n\n', text) # Then split on double newlines paragraphs = text.split("\n\n") # Rest of chunking logic... ``` Alternative: Use a more sophisticated paragraph detection that considers line length and indentation patterns. </details> **Challenge 2. [IC2]** Add a `--verbose` flag to the CLI that shows which chunks were retrieved and their similarity scores before the answer. <details> <summary>Solution</summary> ```python # In cli.py, modify the ask command: @app.command() def ask( question: str, verbose: bool = typer.Option(False, "--verbose", "-v", help="Show retrieved chunks"), ): """Ask a question about the indexed documents.""" # ... existing setup code ... # Get results with scores results = pipeline.query(question) if verbose: console.print("\n[bold]Retrieved Chunks:[/bold]") for i, chunk in enumerate(results.chunks, 1): console.print(f"\n[cyan]Chunk {i}[/cyan] (score: {chunk.score:.3f})") console.print(f"Source: {chunk.source_file}") console.print(chunk.text[:200] + "..." if len(chunk.text) > 200 else chunk.text) console.print("\n" + "="*50 + "\n") # Print answer console.print(Markdown(results.answer)) ``` </details> --- ## Connections to Other Chapters This hands-on tutorial introduces concepts explored in depth throughout the book: - **Chapter 6 (Prompt Engineering)**: The prompts we built here are just the beginning—Chapter 6 covers advanced patterns like chain-of-thought, few-shot learning, and structured outputs. - **Chapter 7 (RAG Systems)**: Takes the basic retrieval we implemented and expands to production patterns including hybrid search, reranking, and GraphRAG. - **Chapter 15 (MLOps & Evaluation)**: How to systematically evaluate the quality of responses from systems like this one. - **Chapter 14 (Backend Engineering for AI)**: Production patterns for testing, debugging, and deploying LLM applications. - **Capstone Projects** (Appendix E): Extends this tutorial into full production projects with Docker, monitoring, and comprehensive testing. --- ## Further Reading ### Essential - **Anthropic Claude API docs** (docs.anthropic.com) - Primary API reference. - **Lewis et al. (2020), "Retrieval-Augmented Generation"** - The paper that introduced RAG. - **ChromaDB documentation** (docs.trychroma.com) - Vector store reference. ### Deep Dives - **Gao et al. (2023), "RAG for LLMs: A Survey"** - Comprehensive overview of RAG techniques. - **Reimers & Gurevych (2019), "Sentence-BERT"** - Foundation for dense retrieval. --- ## Connections to Other Chapters This tutorial chapter serves as a practical bridge between foundational knowledge and advanced techniques: **From Chapter 5 (LLM/NLP Foundations)**: We used embeddings without explaining how they work. Chapter 5 covers the mathematics of how text becomes vectors and why similar meanings cluster in embedding space. **From Chapter 6 (Prompt Engineering)**: Our RAG prompt was functional but basic. Chapter 6 teaches systematic prompt design, few-shot learning, and chain-of-thought techniques that can significantly improve answer quality. **To Chapter 7 (RAG Systems)**: This chapter used simple chunking and vector search. Chapter 7 goes deep: semantic chunking, hybrid search, reranking, GraphRAG, and advanced retrieval patterns. **To Chapter 8 (Agentic Systems)**: The Q&A assistant is reactive (answer questions). Agents are proactive (take actions). Chapter 8 shows how to build systems that use tools, reason over multiple steps, and accomplish complex goals. **To Chapter 9 (Deployment)**: We ran locally. Chapter 9 covers serving LLM applications at scale: batching, caching, load balancing, and production infrastructure. **To Chapter 15 (MLOps & Evaluation)**: Our evaluation was minimal. Chapter 11 covers systematic evaluation: test set design, metrics, LLM-as-judge patterns, and continuous quality monitoring. --- ## Part I Checkpoint: Foundations Complete Congratulations on completing Part I! Before moving to Part II, verify you can do the following: ### Skills Checklist - [ ] **Explain the AI engineering landscape** (Ch01): Describe the difference between ML engineering and AI engineering, and identify where you fit in the career ladder - [ ] **Write production Python code** (Ch02): Use async/await, Pydantic models, and proper error handling in LLM applications - [ ] **Understand ML fundamentals** (Ch03): Explain embeddings, loss functions, and why neural networks work - [ ] **Build a working RAG application** (Ch04): Create a document Q&A system with chunking, embeddings, and retrieval ### Quick Self-Test (10 minutes) **Q1.** What's the difference between embeddings and one-hot encodings? Why do embeddings enable semantic search? **Q2.** Your async LLM call is blocking the event loop. What's wrong and how do you fix it? **Q3.** You built a document Q&A system but it's returning irrelevant chunks. List three things you would check. **Q4.** Why does chunk overlap matter in RAG systems? ::: {.callout-tip collapse="true"} ## Check Your Answers **A1.** One-hot encodings are sparse vectors where similar words have no relationship (dot product = 0). Embeddings are dense vectors where similar meanings are geometrically close, enabling similarity search via cosine distance. **A2.** You're likely using a synchronous API call inside an async function. Use `await` with an async client, or run sync code in an executor: `await asyncio.to_thread(sync_function)`. **A3.** (1) Check chunk size—too small loses context, too large dilutes relevance. (2) Verify embedding model matches your domain. (3) Check if the query and documents use different vocabulary (consider hybrid search). **A4.** Without overlap, relevant sentences at chunk boundaries may be split and neither chunk matches the query well. Overlap ensures boundary content appears in at least one complete chunk. ::: ### Ready for Part II? If you can confidently check all boxes above, you're ready for Part II: Core LLM Development, where you'll learn the theory behind transformers, master prompt engineering, build production RAG systems, and create autonomous agents.