Chapter 4: Your First LLM Application
hands-on, RAG tutorial, document QA, API integration, OpenAI, vector database, project structure
Introduction
In the fall of 2023, a software engineer at a mid-sized fintech company was tasked with building an internal documentation assistant. The engineering wiki had grown to thousands of pages over five years, and new hires were spending weeks hunting for information that existed but was impossible to find. “Can’t we just ask the AI?” became a common refrain.
The engineer had experience with APIs and databases, but had never built anything with LLMs. What followed was a journey familiar to many: the initial excitement of a working prototype, the frustration of hallucinated answers, the iterative refinement that turned a demo into a tool people actually trusted. Three months later, the documentation assistant was handling hundreds of queries daily, and the engineer had learned lessons that no tutorial could have taught.
This chapter is that tutorial. We’ll build a complete Document Q&A Assistant from scratch: a command-line tool that answers questions about a set of documents using retrieval-augmented generation (RAG). Along the way, you’ll encounter the same surprises, mistakes, and “aha moments” that every AI engineer experiences when building their first real application.
The goal isn’t just working code. It’s understanding why each piece exists, what happens when things go wrong, and how the choices we make here connect to the production systems you’ll build later. By the end, you’ll have both a working application and the intuition to extend it.
Why Build This?
You might wonder: why build a Q&A assistant when dozens exist? Three reasons:
Understanding through construction: Reading about RAG is different from debugging why your retrieval returns irrelevant chunks at 2 AM. Building forces you to internalize concepts in a way that reading cannot.
The foundation for everything else: Every production LLM application shares DNA with what we’ll build: taking user input, retrieving relevant context, constructing prompts, generating responses, handling errors. Master this pattern, and you can build chatbots, coding assistants, research tools, and more.
Demystification: LLM applications can seem magical or impossibly complex. Building one from scratch reveals that they’re just software: APIs, databases, text processing, error handling. The magic is in the model, not in the application code.
What We’re Building
Our Document Q&A Assistant will:
- Load a folder of documents (markdown, text, or PDF)
- Chunk documents into searchable pieces
- Create vector embeddings for semantic search
- Store embeddings in a local vector database
- Accept questions from the command line
- Retrieve relevant document chunks
- Generate answers grounded in the retrieved context
- Display answers with source citations
Here’s what using it looks like:
$ python qa_assistant.py index ./company_docs
Indexing 47 documents...
Created 312 chunks
Generated embeddings
Index saved to ./qa_index
$ python qa_assistant.py query "What is our vacation policy?"
Based on the company documentation:
Our vacation policy provides 20 days of paid time off per year for full-time
employees. PTO accrues monthly at 1.67 days per month. Unused days roll over
up to a maximum of 5 days. Requests should be submitted at least 2 weeks in
advance through the HR portal.
Sources:
- hr_policies/time_off.md (chunk 3)
- employee_handbook/benefits.md (chunk 7)
What You’ll Learn
- How to structure an LLM application project
- Making API calls to Anthropic and OpenAI
- Processing documents into chunks for retrieval
- Creating and using embeddings
- Building a simple RAG pipeline
- Error handling patterns for LLM applications
- Basic evaluation and cost tracking
- Common mistakes and how to avoid them
Prerequisites
- Python programming (functions, classes, file I/O)
- Command-line familiarity
- Basic understanding of APIs and HTTP
- A text editor or IDE
You don’t need prior experience with LLMs, embeddings, or vector databases. We’ll build that understanding as we go.
Setting Up
Getting API Keys
Before writing any code, you need access to an LLM API. We’ll use two providers:
Anthropic (Claude) for text generation. Claude excels at following instructions and grounding responses in provided context, making it ideal for RAG applications.
OpenAI for embeddings. OpenAI’s text-embedding-3-small model provides excellent quality at low cost for semantic search.
Why two providers? Each has strengths. This mirrors production systems that often use different models for different tasks. You could use a single provider, but learning to work with multiple APIs is valuable.
Getting an Anthropic API Key:
- Go to https://console.anthropic.com/
- Create an account or sign in
- Navigate to API Keys
- Create a new key, give it a descriptive name
- Copy the key immediately (you won’t see it again)
Getting an OpenAI API Key:
- Go to https://platform.openai.com/
- Create an account or sign in
- Navigate to API Keys in the sidebar
- Create a new secret key
- Copy the key immediately
Storing Keys Securely:
Never put API keys in your code. Use environment variables:
# Add to your ~/.bashrc or ~/.zshrc
export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."
# Then reload your shell or run:
source ~/.bashrcOn Windows, use the Environment Variables system settings or:
$env:ANTHROPIC_API_KEY = "sk-ant-..."
$env:OPENAI_API_KEY = "sk-..."Common Mistake #1: Committing API keys to git. Add
.envto your.gitignoreimmediately. Better yet, never create a.envfile. Environment variables are safer.
Project Structure
Create a clean project structure:
This separation isn’t arbitrary. Each file has a single responsibility:
- indexer.py: Getting documents into a searchable form
- retriever.py: Finding relevant documents for a query
- generator.py: Turning retrieved context into an answer
- config.py: Centralizing settings makes changes easy
Why structure matters: When something breaks (and it will), you want to know exactly where to look. If retrieval returns bad results, debug
retriever.py. If answers ignore the context, debuggenerator.py. Tight coupling makes debugging a nightmare.
Dependencies and Virtual Environment
Create a virtual environment to isolate dependencies:
# Create project directory
mkdir document_qa
cd document_qa
# Create virtual environment
python -m venv venv
# Activate it
source venv/bin/activate # On Windows: venv\Scripts\activate
# Verify you're in the virtual environment
which python # Should show .../document_qa/venv/bin/pythonCreate requirements.txt:
anthropic>=0.18.0
openai>=1.12.0
chromadb>=0.4.22
tiktoken>=0.6.0
rich>=13.7.0
Why these packages?
- anthropic: Official Anthropic Python SDK for Claude
- openai: Official OpenAI Python SDK for embeddings
- chromadb: Simple, local vector database (perfect for learning)
- tiktoken: OpenAI’s tokenizer (for counting tokens accurately)
- rich: Beautiful terminal output (optional but nice)
Install dependencies:
pip install -r requirements.txtCommon Mistake #2: Using global pip instead of virtual environment pip. Always activate your virtual environment first. If
which pipdoesn’t show your venv path, something is wrong.
Configuration File
Create config.py to centralize settings:
"""Configuration for the Document Q&A Assistant."""
import os
from dataclasses import dataclass
@dataclass
class Config:
"""Application configuration with sensible defaults."""
# API Keys (from environment)
anthropic_api_key: str = os.getenv("ANTHROPIC_API_KEY", "")
openai_api_key: str = os.getenv("OPENAI_API_KEY", "")
# Model settings
llm_model: str = "claude-sonnet-4-6"
embedding_model: str = "text-embedding-3-small"
embedding_dimensions: int = 1536
# Chunking settings
chunk_size: int = 500 # tokens
chunk_overlap: int = 50 # tokens
# Retrieval settings
top_k: int = 5 # number of chunks to retrieve
# Generation settings
max_tokens: int = 1024
temperature: float = 0.0 # deterministic for Q&A
# Paths
index_path: str = "./qa_index"
def validate(self) -> list[str]:
"""Check configuration and return list of errors."""
errors = []
if not self.anthropic_api_key:
errors.append("ANTHROPIC_API_KEY environment variable not set")
if not self.openai_api_key:
errors.append("OPENAI_API_KEY environment variable not set")
return errors
# Global config instance
config = Config()Why a dataclass? It provides:
- Type hints for IDE autocompletion
- Default values in one place
- Easy validation
- Immutability awareness
The “aha” moment: Notice how configuration is separate from logic. When you need to change chunk size from 500 to 400 tokens, you edit one number in one place. When you deploy to production with different settings, you create a new Config instance. This seems trivial now but prevents painful debugging later.
Verify Setup
Create a quick test script to verify everything works:
# test_setup.py
"""Verify the development environment is correctly configured."""
from config import config
def main():
errors = config.validate()
if errors:
print("Configuration errors:")
for error in errors:
print(f" - {error}")
return False
# Test imports
try:
import anthropic
import openai
import chromadb
import tiktoken
print("All packages imported successfully")
except ImportError as e:
print(f"Import error: {e}")
return False
# Test Anthropic connection
try:
client = anthropic.Anthropic()
# Make a minimal API call
response = client.messages.create(
model=config.llm_model,
max_tokens=10,
messages=[{"role": "user", "content": "Say 'hello' and nothing else."}]
)
print(f"Anthropic API working: {response.content[0].text}")
except Exception as e:
print(f"Anthropic API error: {e}")
return False
# Test OpenAI connection
try:
client = openai.OpenAI()
response = client.embeddings.create(
model=config.embedding_model,
input="test"
)
print(f"OpenAI API working: got embedding with {len(response.data[0].embedding)} dimensions")
except Exception as e:
print(f"OpenAI API error: {e}")
return False
print("\nSetup complete! Ready to build.")
return True
if __name__ == "__main__":
main()Run it:
python test_setup.pyIf you see “Setup complete!”, you’re ready. If not, the error messages will guide you.
Making Your First API Call
Understanding the Anthropic Messages API
Before we build anything complex, let’s understand the API we’re working with. Create generator.py:
"""LLM interaction for generating answers."""
import anthropic
from config import config
def generate_simple(prompt: str) -> str:
"""
Make a simple completion request to Claude.
This is the most basic form of LLM interaction:
send a prompt, get a response.
"""
client = anthropic.Anthropic()
message = client.messages.create(
model=config.llm_model,
max_tokens=config.max_tokens,
messages=[
{"role": "user", "content": prompt}
]
)
return message.content[0].textLet’s test it interactively:
# In Python REPL or a test script
from generator import generate_simple
response = generate_simple("What is the capital of France?")
print(response)
# Output: The capital of France is Paris.Simple. But what’s actually happening?
Anatomy of an API Request
Let’s examine each parameter:
message = client.messages.create(
model="claude-sonnet-4-6", # Which Claude model to use
max_tokens=1024, # Maximum response length
messages=[ # The conversation history
{"role": "user", "content": prompt}
]
)model: Different models have different capabilities, speeds, and costs:
claude-opus-4-8: Most capable, best for complex reasoningclaude-sonnet-4-6: Great balance of capability and speedclaude-haiku-4-5-20251001: Fastest, cheapest, good for simple tasks
For our Q&A assistant, Sonnet provides the best tradeoff: capable enough to synthesize information from multiple sources, fast enough for interactive use.
max_tokens: The hard limit on response length. Set this based on expected answer length plus buffer. Too low and answers get cut off. Too high and you pay for unused capacity.
Why we set temperature to 0: For Q&A systems, we want consistent, factual responses. Temperature controls randomness in output. Temperature 0 means the model always picks the most likely token. Temperature 1 introduces variety. For creative writing, higher temperature helps. For factual Q&A, temperature 0 provides deterministic, reproducible results.
messages: The conversation formatted as a list of turns. Each message has:
role: Either “user” (human) or “assistant” (Claude)content: The text of that turn
For multi-turn conversations, you include the full history:
messages = [
{"role": "user", "content": "What is Python?"},
{"role": "assistant", "content": "Python is a programming language..."},
{"role": "user", "content": "What are its main uses?"}
]Understanding the Response
The API returns a Message object. Let’s examine it:
from anthropic import Anthropic
from config import config
client = Anthropic()
message = client.messages.create(
model=config.llm_model,
max_tokens=100,
messages=[{"role": "user", "content": "Say hello"}]
)
print(f"ID: {message.id}")
print(f"Model: {message.model}")
print(f"Role: {message.role}")
print(f"Content: {message.content}")
print(f"Stop reason: {message.stop_reason}")
print(f"Input tokens: {message.usage.input_tokens}")
print(f"Output tokens: {message.usage.output_tokens}")Output:
ID: msg_01XFDUDYJgAACzvnptvVoYEL
Model: claude-sonnet-4-6
Role: assistant
Content: [TextBlock(text='Hello! How can I help you today?', type='text')]
Stop reason: end_turn
Input tokens: 10
Output tokens: 11
Key fields:
- content: A list of content blocks. Usually one TextBlock, but can include multiple for tool use.
- stop_reason: Why generation stopped. “end_turn” means natural completion. “max_tokens” means hit the limit.
- usage: Token counts for billing and monitoring.
The “aha” moment: Notice that
usage.input_tokensandusage.output_tokensare separate. API pricing charges differently for input vs. output tokens. In RAG applications, you send lots of context (input tokens) but generate shorter answers (output tokens). Understanding this helps with cost optimization.
Streaming Responses
For a responsive user experience, stream responses instead of waiting for completion:
def generate_streaming(prompt: str) -> str:
"""
Stream a response from Claude, printing tokens as they arrive.
Returns the complete response when done.
"""
client = anthropic.Anthropic()
full_response = ""
with client.messages.stream(
model=config.llm_model,
max_tokens=config.max_tokens,
messages=[{"role": "user", "content": prompt}]
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
full_response += text
print() # Newline at the end
return full_responseWhy streaming matters: 1. Perceived latency: Users see progress immediately instead of waiting seconds for a complete response 2. Time to first token: Critical metric for UX. Streaming shows content in ~200ms vs. waiting 2-5 seconds for full response 3. Cancellation: User can interrupt if the response is going in the wrong direction
Common Mistake #3: Not using streaming in user-facing applications. Waiting for full responses creates a jarring experience. Users prefer watching text appear to staring at a spinner.
Error Handling
LLM APIs fail in predictable ways. Build resilience from the start:
import anthropic
import time
from typing import Optional
class GenerationError(Exception):
"""Custom exception for generation failures."""
pass
def generate_with_retry(
prompt: str,
max_retries: int = 3,
initial_delay: float = 1.0
) -> str:
"""
Generate a response with exponential backoff retry.
Handles transient API errors gracefully.
"""
client = anthropic.Anthropic()
delay = initial_delay
last_error: Optional[Exception] = None
for attempt in range(max_retries):
try:
message = client.messages.create(
model=config.llm_model,
max_tokens=config.max_tokens,
messages=[{"role": "user", "content": prompt}]
)
return message.content[0].text
except anthropic.RateLimitError as e:
# Rate limited: wait and retry
last_error = e
print(f"Rate limited. Waiting {delay}s before retry {attempt + 1}/{max_retries}")
time.sleep(delay)
delay *= 2 # Exponential backoff
except anthropic.APIStatusError as e:
# API error (5xx): might be transient
if e.status_code >= 500:
last_error = e
print(f"API error {e.status_code}. Waiting {delay}s before retry")
time.sleep(delay)
delay *= 2
else:
# Client error (4xx): don't retry
raise GenerationError(f"API error: {e.message}") from e
except anthropic.APIConnectionError as e:
# Network error: retry
last_error = e
print(f"Connection error. Waiting {delay}s before retry")
time.sleep(delay)
delay *= 2
raise GenerationError(f"Failed after {max_retries} attempts: {last_error}")Why exponential backoff? When services are overloaded, everyone retrying immediately makes things worse. Exponential backoff (1s, 2s, 4s, …) spreads out retry attempts, giving the service time to recover.
Common Mistake #4: Retrying on all errors. Some errors (invalid API key, malformed request) will never succeed no matter how many times you retry. Only retry transient errors.
Putting It Together
Update generator.py with a complete implementation:
"""LLM interaction for generating answers."""
import anthropic
import time
from typing import Optional
from dataclasses import dataclass
from config import config
class GenerationError(Exception):
"""Custom exception for generation failures."""
pass
@dataclass
class GenerationResult:
"""Result of a generation request with metadata."""
text: str
input_tokens: int
output_tokens: int
model: str
def generate_answer(
question: str,
context: str,
max_retries: int = 3
) -> GenerationResult:
"""
Generate an answer to a question given context.
This is the core RAG generation function: it takes retrieved
context and synthesizes an answer.
"""
client = anthropic.Anthropic()
# Construct the prompt with clear instructions
prompt = f"""Based on the following context, answer the question.
If the context doesn't contain enough information to answer, say so.
Always cite which parts of the context support your answer.
Context:
{context}
Question: {question}
Answer:"""
delay = 1.0
last_error: Optional[Exception] = None
for attempt in range(max_retries):
try:
message = client.messages.create(
model=config.llm_model,
max_tokens=config.max_tokens,
temperature=config.temperature,
messages=[{"role": "user", "content": prompt}]
)
return GenerationResult(
text=message.content[0].text,
input_tokens=message.usage.input_tokens,
output_tokens=message.usage.output_tokens,
model=message.model
)
except (anthropic.RateLimitError, anthropic.APIConnectionError) as e:
last_error = e
time.sleep(delay)
delay *= 2
except anthropic.APIStatusError as e:
if e.status_code >= 500:
last_error = e
time.sleep(delay)
delay *= 2
else:
raise GenerationError(f"API error: {e.message}") from e
raise GenerationError(f"Failed after {max_retries} attempts: {last_error}")Notice how generate_answer is different from generate_simple:
- It takes both a question and context (RAG pattern)
- It constructs a prompt with clear instructions
- It returns structured metadata for monitoring
- It handles errors gracefully
This is the pattern we’ll use throughout: functions that do one thing well, with clear inputs, outputs, and error handling.
Complete code: See Answer Generator (generator.py) for the full implementation with additional error handling and token counting.
Adding Context: Simple RAG
Now we’ll build the retrieval side: turning documents into searchable chunks and finding relevant ones for a query.
Why Chunking Matters
Imagine searching a library with a single catalog card for each book. “Moby Dick: A story about whaling.” Helpful, but if you’re looking for the chapter about the whiteness of the whale, you’re out of luck. You need finer-grained index entries.
Chunking is creating those index entries for your documents. But chunk too finely (every sentence) and you lose context. Chunk too coarsely (whole documents) and search precision suffers.
The “aha” moment: Chunking is a core tradeoff in RAG. Smaller chunks = better search precision but less context per result. Larger chunks = more context but noisier search results. There’s no perfect answer, only the right answer for your use case.
Building the Indexer
Create indexer.py:
"""Document processing and indexing."""
import os
import hashlib
from pathlib import Path
from dataclasses import dataclass
from typing import Iterator
import tiktoken
from config import config
@dataclass
class Chunk:
"""A chunk of text with metadata."""
text: str
source_file: str
chunk_index: int
token_count: int
@property
def id(self) -> str:
"""Generate a unique ID for this chunk."""
content_hash = hashlib.md5(self.text.encode()).hexdigest()[:8]
return f"{self.source_file}::{self.chunk_index}::{content_hash}"
class Chunker:
"""Split documents into chunks for indexing."""
def __init__(
self,
chunk_size: int = config.chunk_size,
chunk_overlap: int = config.chunk_overlap
):
self.chunk_size = chunk_size
self.chunk_overlap = chunk_overlap
# Use the tokenizer for our embedding model
self.tokenizer = tiktoken.get_encoding("cl100k_base")
def count_tokens(self, text: str) -> int:
"""Count tokens in text."""
return len(self.tokenizer.encode(text))
def chunk_text(self, text: str, source_file: str) -> list[Chunk]:
"""
Split text into overlapping chunks.
Strategy: Split on paragraph boundaries when possible,
fall back to sentence boundaries, then word boundaries.
"""
# Normalize whitespace
text = text.strip()
if not text:
return []
# Try to split on paragraphs first
paragraphs = text.split('\n\n')
chunks = []
current_chunk = []
current_tokens = 0
for para in paragraphs:
para = para.strip()
if not para:
continue
para_tokens = self.count_tokens(para)
# If single paragraph exceeds chunk size, split it further
if para_tokens > self.chunk_size:
# Flush current chunk
if current_chunk:
chunk_text = '\n\n'.join(current_chunk)
chunks.append(self._create_chunk(
chunk_text, source_file, len(chunks)
))
current_chunk = []
current_tokens = 0
# Split the large paragraph
chunks.extend(self._split_large_text(
para, source_file, len(chunks)
))
continue
# Check if adding this paragraph exceeds chunk size
if current_tokens + para_tokens > self.chunk_size and current_chunk:
# Save current chunk
chunk_text = '\n\n'.join(current_chunk)
chunks.append(self._create_chunk(
chunk_text, source_file, len(chunks)
))
# Start new chunk with overlap
# Keep last paragraph(s) that fit in overlap
overlap_chunks = []
overlap_tokens = 0
for p in reversed(current_chunk):
p_tokens = self.count_tokens(p)
if overlap_tokens + p_tokens <= self.chunk_overlap:
overlap_chunks.insert(0, p)
overlap_tokens += p_tokens
else:
break
current_chunk = overlap_chunks
current_tokens = overlap_tokens
current_chunk.append(para)
current_tokens += para_tokens
# Don't forget the last chunk
if current_chunk:
chunk_text = '\n\n'.join(current_chunk)
chunks.append(self._create_chunk(
chunk_text, source_file, len(chunks)
))
return chunks
def _split_large_text(
self, text: str, source_file: str, start_index: int
) -> list[Chunk]:
"""Split text that exceeds chunk size using sentences."""
import re
# Split on sentence boundaries
sentences = re.split(r'(?<=[.!?])\s+', text)
chunks = []
current_sentences = []
current_tokens = 0
for sentence in sentences:
sentence_tokens = self.count_tokens(sentence)
if current_tokens + sentence_tokens > self.chunk_size and current_sentences:
chunk_text = ' '.join(current_sentences)
chunks.append(self._create_chunk(
chunk_text, source_file, start_index + len(chunks)
))
current_sentences = []
current_tokens = 0
current_sentences.append(sentence)
current_tokens += sentence_tokens
if current_sentences:
chunk_text = ' '.join(current_sentences)
chunks.append(self._create_chunk(
chunk_text, source_file, start_index + len(chunks)
))
return chunks
def _create_chunk(self, text: str, source_file: str, index: int) -> Chunk:
"""Create a Chunk object."""
return Chunk(
text=text,
source_file=source_file,
chunk_index=index,
token_count=self.count_tokens(text)
)
def load_documents(directory: str) -> Iterator[tuple[str, str]]:
"""
Load all supported documents from a directory.
Yields (filename, content) tuples.
"""
supported_extensions = {'.txt', '.md', '.markdown'}
directory = Path(directory)
if not directory.exists():
raise FileNotFoundError(f"Directory not found: {directory}")
for file_path in directory.rglob('*'):
if file_path.suffix.lower() in supported_extensions:
try:
content = file_path.read_text(encoding='utf-8')
# Use relative path as identifier
relative_path = file_path.relative_to(directory)
yield str(relative_path), content
except Exception as e:
print(f"Warning: Could not read {file_path}: {e}")
def index_documents(directory: str) -> list[Chunk]:
"""
Load and chunk all documents from a directory.
This is the main entry point for document indexing.
"""
chunker = Chunker()
all_chunks = []
for filename, content in load_documents(directory):
chunks = chunker.chunk_text(content, filename)
all_chunks.extend(chunks)
print(f" {filename}: {len(chunks)} chunks")
return all_chunksLet’s trace through what happens when we chunk a document:
- Load: Read file content as text
- Split on paragraphs: Natural semantic boundaries
- Accumulate: Keep adding paragraphs until we hit chunk_size tokens
- Overlap: Keep the last few paragraphs when starting a new chunk
- Handle edge cases: Very long paragraphs get split on sentences
Why overlap? Consider this text split into two chunks without overlap:
Chunk 1: "...The new policy takes effect on January 1st."
Chunk 2: "Employees must complete training before the deadline..."
A question about “when employees must complete training” might not match either chunk well. With overlap:
Chunk 1: "...The new policy takes effect on January 1st."
Chunk 2: "takes effect on January 1st. Employees must complete training before the deadline..."
Now chunk 2 has the context needed to answer the question.
Common Mistake #5: No chunk overlap. This creates “boundary blindness” where important information spanning chunks becomes unfindable.
Creating Embeddings
Embeddings are numeric representations of text that capture semantic meaning. Similar texts have similar embeddings, enabling semantic search.
Add to indexer.py:
import openai
from config import config
def create_embeddings(texts: list[str]) -> list[list[float]]:
"""
Create embeddings for a list of texts.
Uses OpenAI's embedding API with batching for efficiency.
"""
client = openai.OpenAI()
# OpenAI's embedding API accepts batches
# but has a token limit per request
batch_size = 100
all_embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
response = client.embeddings.create(
model=config.embedding_model,
input=batch
)
# Sort by index to maintain order
sorted_embeddings = sorted(response.data, key=lambda x: x.index)
all_embeddings.extend([e.embedding for e in sorted_embeddings])
return all_embeddingsWhy batch embeddings? 1. Efficiency: One API call for 100 texts vs. 100 API calls 2. Cost: Same price but less overhead 3. Rate limits: Fewer requests means less chance of hitting limits
The “aha” moment: Embeddings are created once during indexing and reused for every query. The embedding model never sees your questions directly, it just provides a way to measure similarity. This is why RAG can work with any LLM that the embedding model.
Complete code: See Document Indexer (indexer.py) for the full implementation with progress tracking, file type handling, and metadata extraction.
Storing in ChromaDB
ChromaDB is a simple, local vector database perfect for learning. Create retriever.py:
"""Vector search and retrieval."""
import chromadb
from chromadb.config import Settings
from config import config
from indexer import Chunk, create_embeddings
class VectorStore:
"""Simple vector store using ChromaDB."""
def __init__(self, persist_directory: str = config.index_path):
"""Initialize or load existing vector store."""
self.client = chromadb.PersistentClient(
path=persist_directory,
settings=Settings(anonymized_telemetry=False)
)
self.collection = self.client.get_or_create_collection(
name="documents",
metadata={"hnsw:space": "cosine"} # Use cosine similarity
)
def add_chunks(self, chunks: list[Chunk]) -> None:
"""Add chunks to the vector store."""
if not chunks:
return
# Extract texts and metadata
texts = [chunk.text for chunk in chunks]
ids = [chunk.id for chunk in chunks]
metadatas = [
{
"source_file": chunk.source_file,
"chunk_index": chunk.chunk_index,
"token_count": chunk.token_count
}
for chunk in chunks
]
# Create embeddings
print("Generating embeddings...")
embeddings = create_embeddings(texts)
# Add to collection
self.collection.add(
ids=ids,
embeddings=embeddings,
documents=texts,
metadatas=metadatas
)
print(f"Added {len(chunks)} chunks to index")
def search(self, query: str, top_k: int = config.top_k) -> list[dict]:
"""
Search for chunks similar to the query.
Returns list of dicts with 'text', 'metadata', and 'score' keys.
"""
# Create embedding for query
query_embedding = create_embeddings([query])[0]
# Search
results = self.collection.query(
query_embeddings=[query_embedding],
n_results=top_k,
include=["documents", "metadatas", "distances"]
)
# Format results
formatted = []
for i in range(len(results['ids'][0])):
formatted.append({
'text': results['documents'][0][i],
'metadata': results['metadatas'][0][i],
'score': 1 - results['distances'][0][i] # Convert distance to similarity
})
return formatted
def count(self) -> int:
"""Return number of chunks in the index."""
return self.collection.count()
def clear(self) -> None:
"""Clear all data from the index."""
self.client.delete_collection("documents")
self.collection = self.client.get_or_create_collection(
name="documents",
metadata={"hnsw:space": "cosine"}
)Key concepts:
Cosine similarity: Measures the angle between two vectors. Range is -1 to 1, where 1 means identical direction (most similar). We configure ChromaDB to use cosine by setting hnsw:space: cosine.
HNSW (Hierarchical Navigable Small World): The algorithm ChromaDB uses for fast approximate nearest neighbor search. You don’t need to understand the details, just know it trades a tiny bit of accuracy for massive speed improvements.
Distance vs. similarity: ChromaDB returns distances (lower = more similar). We convert to similarity scores (higher = more similar) for intuitive interpretation.
Complete code: See Vector Retriever (retriever.py) for the full implementation with hybrid search, metadata filtering, and batch retrieval.
Testing Retrieval
Let’s create some test documents and verify retrieval works:
# test_retrieval.py
"""Test the retrieval pipeline."""
from pathlib import Path
from indexer import index_documents
from retriever import VectorStore
def create_sample_docs():
"""Create sample documents for testing."""
docs_dir = Path("sample_docs")
docs_dir.mkdir(exist_ok=True)
# HR Policy document
(docs_dir / "hr_policy.md").write_text("""
# HR Policies
## Vacation Policy
All full-time employees receive 20 days of paid time off (PTO) per year.
PTO accrues at a rate of 1.67 days per month.
Unused PTO can roll over to the next year, up to a maximum of 5 days.
PTO requests should be submitted at least 2 weeks in advance.
## Remote Work Policy
Employees may work remotely up to 3 days per week with manager approval.
Full remote arrangements require VP approval and must be reviewed quarterly.
## Expense Policy
Business expenses over $50 require receipt documentation.
Expenses over $500 require pre-approval from your manager.
Submit expense reports within 30 days of the expense.
""")
# Technical documentation
(docs_dir / "tech_guide.md").write_text("""
# Technical Guidelines
## Code Review Process
All code changes require at least one approval before merging.
Security-sensitive changes require review from the security team.
Reviews should be completed within 2 business days.
## Deployment Process
Deployments to production happen on Tuesdays and Thursdays.
Emergency deployments require on-call approval and a rollback plan.
All deployments must pass automated tests and staging validation.
## On-Call Rotation
Engineers participate in on-call rotation after 3 months on the team.
On-call shifts are one week long, starting Monday at 9 AM.
On-call engineers receive a $500 stipend per week.
""")
print("Sample documents created in ./sample_docs")
return docs_dir
def main():
# Create sample documents
docs_dir = create_sample_docs()
# Index documents
print("\nIndexing documents...")
chunks = index_documents(str(docs_dir))
print(f"Created {len(chunks)} chunks")
# Add to vector store
store = VectorStore()
store.clear() # Start fresh
store.add_chunks(chunks)
# Test queries
test_queries = [
"How many vacation days do I get?",
"Can I work from home?",
"How do I submit an expense report?",
"What is the deployment schedule?",
"How much is the on-call stipend?"
]
print("\n" + "="*50)
print("Testing retrieval:")
print("="*50)
for query in test_queries:
print(f"\nQuery: {query}")
results = store.search(query, top_k=2)
for i, result in enumerate(results):
print(f" Result {i+1} (score: {result['score']:.3f}):")
print(f" Source: {result['metadata']['source_file']}")
# Show first 100 chars of text
preview = result['text'][:100].replace('\n', ' ')
print(f" Preview: {preview}...")
if __name__ == "__main__":
main()Run it:
python test_retrieval.pyYou should see each query returning relevant chunks. For “How many vacation days do I get?”, the top result should be from the vacation policy section, not the deployment process.
Common Mistake #6: Not testing retrieval in isolation. If your Q&A system gives bad answers, is it retrieval (wrong chunks) or generation (ignoring context)? Test each component separately.
Building the Q&A Loop
Now we connect everything into a working Q&A system.
The Complete Pipeline
Constructing Effective Prompts
The prompt is where retrieval meets generation. A good RAG prompt:
- Clearly provides the context
- States the task unambiguously
- Instructs the model to use only the context
- Requests citations for verifiability
Update generator.py with our RAG-specific generator:
"""LLM interaction for generating answers."""
import anthropic
import time
from typing import Optional
from dataclasses import dataclass
from config import config
class GenerationError(Exception):
"""Custom exception for generation failures."""
pass
@dataclass
class GenerationResult:
"""Result of a generation request with metadata."""
text: str
input_tokens: int
output_tokens: int
model: str
# The RAG prompt template
RAG_PROMPT_TEMPLATE = """You are a helpful assistant that answers questions based on the provided documentation.
## Instructions
- Answer the question using ONLY the information in the provided context
- If the context doesn't contain enough information to answer, say "I don't have enough information to answer that question based on the available documentation"
- Always cite your sources by mentioning which document the information comes from
- Be concise but complete
- If multiple documents contain relevant information, synthesize them
## Context
{context}
## Question
{question}
## Answer"""
def format_context(search_results: list[dict]) -> str:
"""Format search results into context for the prompt."""
context_parts = []
for i, result in enumerate(search_results, 1):
source = result['metadata']['source_file']
chunk_idx = result['metadata']['chunk_index']
text = result['text']
score = result['score']
context_parts.append(
f"[Source {i}: {source} (chunk {chunk_idx}, relevance: {score:.2f})]\n{text}"
)
return "\n\n---\n\n".join(context_parts)
def generate_answer(
question: str,
search_results: list[dict],
max_retries: int = 3
) -> GenerationResult:
"""
Generate an answer using retrieved context.
This is the core RAG generation function.
"""
client = anthropic.Anthropic()
# Format context from search results
context = format_context(search_results)
# Construct the full prompt
prompt = RAG_PROMPT_TEMPLATE.format(
context=context,
question=question
)
delay = 1.0
last_error: Optional[Exception] = None
for attempt in range(max_retries):
try:
message = client.messages.create(
model=config.llm_model,
max_tokens=config.max_tokens,
temperature=config.temperature,
messages=[{"role": "user", "content": prompt}]
)
return GenerationResult(
text=message.content[0].text,
input_tokens=message.usage.input_tokens,
output_tokens=message.usage.output_tokens,
model=message.model
)
except (anthropic.RateLimitError, anthropic.APIConnectionError) as e:
last_error = e
time.sleep(delay)
delay *= 2
except anthropic.APIStatusError as e:
if e.status_code >= 500:
last_error = e
time.sleep(delay)
delay *= 2
else:
raise GenerationError(f"API error: {e.message}") from e
raise GenerationError(f"Failed after {max_retries} attempts: {last_error}")Let’s examine the prompt template:
RAG_PROMPT_TEMPLATE = """You are a helpful assistant that answers questions based on the provided documentation.
## Instructions
- Answer the question using ONLY the information in the provided context
- If the context doesn't contain enough information...Why this structure?
- Role setting (“You are a helpful assistant”): Establishes behavior expectations
- Clear constraint (“ONLY the information in the provided context”): Reduces hallucination
- Graceful failure (“If the context doesn’t contain enough information”): Better than wrong answers
- Citation requirement: Enables verification
- Labeled sections: Makes parsing unambiguous
The “aha” moment: The prompt is where you control the model’s behavior. Every instruction you add trades off against other instructions. “Be concise” conflicts with “be complete.” “Cite sources” adds length. Finding the right balance requires iteration.
The Main Application
Create qa_assistant.py:
"""Document Q&A Assistant - Main application."""
import argparse
import sys
from pathlib import Path
from rich.console import Console
from rich.markdown import Markdown
from rich.panel import Panel
from config import config
from indexer import index_documents
from retriever import VectorStore
from generator import generate_answer, GenerationError
console = Console()
def cmd_index(args):
"""Index documents from a directory."""
directory = Path(args.directory)
if not directory.exists():
console.print(f"[red]Error: Directory not found: {directory}[/red]")
sys.exit(1)
console.print(f"[bold]Indexing documents from {directory}...[/bold]")
# Load and chunk documents
chunks = index_documents(str(directory))
if not chunks:
console.print("[yellow]No documents found to index[/yellow]")
sys.exit(1)
console.print(f"Created {len(chunks)} chunks from documents")
# Store in vector database
store = VectorStore()
if args.clear:
console.print("Clearing existing index...")
store.clear()
store.add_chunks(chunks)
console.print(f"[green]Index saved. Total chunks: {store.count()}[/green]")
def cmd_query(args):
"""Query the document index."""
# Validate configuration
errors = config.validate()
if errors:
console.print("[red]Configuration errors:[/red]")
for error in errors:
console.print(f" - {error}")
sys.exit(1)
# Load vector store
store = VectorStore()
if store.count() == 0:
console.print("[yellow]Index is empty. Run 'index' command first.[/yellow]")
sys.exit(1)
question = args.question
console.print(f"\n[bold]Question:[/bold] {question}\n")
# Retrieve relevant chunks
with console.status("Searching documents..."):
results = store.search(question, top_k=config.top_k)
if not results:
console.print("[yellow]No relevant documents found.[/yellow]")
sys.exit(1)
# Show retrieved sources if verbose
if args.verbose:
console.print("[dim]Retrieved sources:[/dim]")
for i, result in enumerate(results, 1):
source = result['metadata']['source_file']
score = result['score']
console.print(f" {i}. {source} (relevance: {score:.2f})")
console.print()
# Generate answer
with console.status("Generating answer..."):
try:
result = generate_answer(question, results)
except GenerationError as e:
console.print(f"[red]Error generating answer: {e}[/red]")
sys.exit(1)
# Display answer
console.print(Panel(
Markdown(result.text),
title="Answer",
border_style="green"
))
# Show token usage if verbose
if args.verbose:
console.print(f"\n[dim]Tokens: {result.input_tokens} input, {result.output_tokens} output[/dim]")
# Estimate cost (approximate rates as of 2026)
input_cost = result.input_tokens * 0.003 / 1000 # $3 per 1M input tokens
output_cost = result.output_tokens * 0.015 / 1000 # $15 per 1M output tokens
console.print(f"[dim]Estimated cost: ${input_cost + output_cost:.4f}[/dim]")
def cmd_interactive(args):
"""Interactive query mode."""
errors = config.validate()
if errors:
console.print("[red]Configuration errors:[/red]")
for error in errors:
console.print(f" - {error}")
sys.exit(1)
store = VectorStore()
if store.count() == 0:
console.print("[yellow]Index is empty. Run 'index' command first.[/yellow]")
sys.exit(1)
console.print("[bold]Document Q&A Assistant[/bold]")
console.print(f"Index contains {store.count()} chunks")
console.print("Type 'quit' or 'exit' to stop.\n")
while True:
try:
question = console.input("[bold cyan]Question:[/bold cyan] ").strip()
except (KeyboardInterrupt, EOFError):
console.print("\nGoodbye!")
break
if not question:
continue
if question.lower() in ('quit', 'exit', 'q'):
console.print("Goodbye!")
break
# Retrieve
results = store.search(question, top_k=config.top_k)
if not results:
console.print("[yellow]No relevant documents found.[/yellow]\n")
continue
# Generate
try:
result = generate_answer(question, results)
except GenerationError as e:
console.print(f"[red]Error: {e}[/red]\n")
continue
# Display
console.print()
console.print(Panel(
Markdown(result.text),
title="Answer",
border_style="green"
))
console.print()
def main():
parser = argparse.ArgumentParser(
description="Document Q&A Assistant",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
Index documents:
python qa_assistant.py index ./docs
Query the index:
python qa_assistant.py query "What is the vacation policy?"
Interactive mode:
python qa_assistant.py interactive
"""
)
subparsers = parser.add_subparsers(dest="command", help="Available commands")
# Index command
index_parser = subparsers.add_parser("index", help="Index documents")
index_parser.add_argument("directory", help="Directory containing documents")
index_parser.add_argument("--clear", action="store_true",
help="Clear existing index before adding")
# Query command
query_parser = subparsers.add_parser("query", help="Query the index")
query_parser.add_argument("question", help="Question to ask")
query_parser.add_argument("-v", "--verbose", action="store_true",
help="Show additional details")
# Interactive command
interactive_parser = subparsers.add_parser("interactive",
help="Interactive query mode")
args = parser.parse_args()
if args.command == "index":
cmd_index(args)
elif args.command == "query":
cmd_query(args)
elif args.command == "interactive":
cmd_interactive(args)
else:
parser.print_help()
if __name__ == "__main__":
main()Testing the Complete System
# First, create and index sample documents
python test_retrieval.py
# Now query using the Q&A assistant
python qa_assistant.py query "How many vacation days do employees get?"
# Try with verbose mode
python qa_assistant.py query -v "What is the deployment schedule?"
# Interactive mode
python qa_assistant.py interactiveYou should see answers grounded in your documents, with citations to the source files.
Complete code: See Main Application (qa_assistant.py) for the full implementation with argument parsing, error handling, and graceful shutdown.
Making It Production-Ready
A working prototype is not a production system. Let’s add the scaffolding that turns demos into reliable tools.
Adding Logging
Logging is how you debug production issues. You can’t attach a debugger to production. Create or update files to include logging:
# Add to config.py
import logging
def setup_logging(level: str = "INFO") -> logging.Logger:
"""Configure logging for the application."""
logging.basicConfig(
level=getattr(logging, level.upper()),
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
datefmt='%Y-%m-%d %H:%M:%S'
)
return logging.getLogger("qa_assistant")
logger = setup_logging()# Update generator.py to use logging
from config import logger
def generate_answer(question: str, search_results: list[dict], ...):
logger.info(f"Generating answer for question: {question[:50]}...")
logger.debug(f"Using {len(search_results)} context chunks")
# ... existing code ...
logger.info(f"Generated answer: {result.input_tokens} in, {result.output_tokens} out")
return result# Update retriever.py
from config import logger
def search(self, query: str, top_k: int = config.top_k):
logger.debug(f"Searching for: {query[:50]}...")
# ... existing code ...
logger.info(f"Found {len(formatted)} results, top score: {formatted[0]['score']:.3f}")
return formattedWhy structured logging? - Timestamps: Know when things happened - Levels: Filter by severity (DEBUG, INFO, WARNING, ERROR) - Context: Include relevant data for debugging
Handling Edge Cases
Real users do unexpected things. Handle them gracefully:
# Add to qa_assistant.py
def validate_question(question: str) -> tuple[bool, str]:
"""
Validate user question before processing.
Returns (is_valid, error_message).
"""
if not question or not question.strip():
return False, "Question cannot be empty"
if len(question) > 1000:
return False, "Question is too long (max 1000 characters)"
if len(question.split()) < 2:
return False, "Please ask a complete question"
return True, ""
def cmd_query(args):
# Validate question first
is_valid, error_msg = validate_question(args.question)
if not is_valid:
console.print(f"[red]Invalid question: {error_msg}[/red]")
sys.exit(1)
# ... rest of functionCommon edge cases to handle:
- Empty queries
- Extremely long queries (will exceed context limits)
- Queries with no relevant results
- API timeouts
- Malformed documents during indexing
- File encoding issues
Common Mistake #7: Ignoring edge cases during development. The happy path works great in demos. Production users find every edge case.
Basic Evaluation
How do you know if your system is working well? Build evaluation into the system:
# evaluation.py
"""Basic evaluation utilities for the Q&A assistant."""
from dataclasses import dataclass
from typing import Optional
import json
from pathlib import Path
from retriever import VectorStore
from generator import generate_answer
@dataclass
class EvalCase:
"""A single evaluation case."""
question: str
expected_answer: str
expected_sources: list[str]
@dataclass
class EvalResult:
"""Result of evaluating a single case."""
question: str
generated_answer: str
expected_answer: str
retrieved_sources: list[str]
expected_sources: list[str]
answer_contains_expected: bool
correct_source_retrieved: bool
def load_eval_set(path: str) -> list[EvalCase]:
"""Load evaluation cases from a JSON file."""
with open(path) as f:
data = json.load(f)
return [
EvalCase(
question=case['question'],
expected_answer=case['expected_answer'],
expected_sources=case.get('expected_sources', [])
)
for case in data
]
def evaluate_case(case: EvalCase, store: VectorStore) -> EvalResult:
"""Evaluate a single case."""
# Retrieve
results = store.search(case.question)
retrieved_sources = [r['metadata']['source_file'] for r in results]
# Generate
gen_result = generate_answer(case.question, results)
# Check if expected answer content is in generated answer
answer_contains_expected = (
case.expected_answer.lower() in gen_result.text.lower()
)
# Check if expected source was retrieved
correct_source_retrieved = any(
expected in retrieved_sources
for expected in case.expected_sources
) if case.expected_sources else True
return EvalResult(
question=case.question,
generated_answer=gen_result.text,
expected_answer=case.expected_answer,
retrieved_sources=retrieved_sources,
expected_sources=case.expected_sources,
answer_contains_expected=answer_contains_expected,
correct_source_retrieved=correct_source_retrieved
)
def run_evaluation(eval_path: str, store: VectorStore) -> dict:
"""Run full evaluation and return metrics."""
cases = load_eval_set(eval_path)
results = []
for case in cases:
result = evaluate_case(case, store)
results.append(result)
# Compute metrics
total = len(results)
answer_correct = sum(1 for r in results if r.answer_contains_expected)
source_correct = sum(1 for r in results if r.correct_source_retrieved)
return {
'total_cases': total,
'answer_accuracy': answer_correct / total if total > 0 else 0,
'retrieval_accuracy': source_correct / total if total > 0 else 0,
'results': results
}Create an evaluation set (eval_cases.json):
[
{
"question": "How many vacation days do employees get?",
"expected_answer": "20 days",
"expected_sources": ["hr_policy.md"]
},
{
"question": "What days can we deploy to production?",
"expected_answer": "Tuesdays and Thursdays",
"expected_sources": ["tech_guide.md"]
},
{
"question": "How much is the on-call stipend?",
"expected_answer": "$500",
"expected_sources": ["tech_guide.md"]
}
]The “aha” moment: Evaluation lets you measure improvement. Without metrics, you’re guessing. “Does changing chunk size help?” becomes “Did accuracy go from 75% to 82%?”
Cost Tracking
LLM APIs charge per token. Track costs to avoid surprises. See Appendix K (LLM Provider Comparison) for current pricing across providers.
# Add to config.py
@dataclass
class CostTracker:
"""Track API usage and costs."""
# Pricing (check Appendix K for current rates)
CLAUDE_INPUT_PRICE = 3.00 / 1_000_000 # $3 per 1M input tokens
CLAUDE_OUTPUT_PRICE = 15.00 / 1_000_000 # $15 per 1M output tokens
EMBEDDING_PRICE = 0.02 / 1_000_000 # $0.02 per 1M tokens
total_input_tokens: int = 0
total_output_tokens: int = 0
total_embedding_tokens: int = 0
def add_generation(self, input_tokens: int, output_tokens: int):
"""Record a generation API call."""
self.total_input_tokens += input_tokens
self.total_output_tokens += output_tokens
def add_embedding(self, tokens: int):
"""Record an embedding API call."""
self.total_embedding_tokens += tokens
@property
def total_cost(self) -> float:
"""Calculate total cost in USD."""
generation_cost = (
self.total_input_tokens * self.CLAUDE_INPUT_PRICE +
self.total_output_tokens * self.CLAUDE_OUTPUT_PRICE
)
embedding_cost = self.total_embedding_tokens * self.EMBEDDING_PRICE
return generation_cost + embedding_cost
def summary(self) -> str:
"""Return a cost summary string."""
return (
f"Tokens: {self.total_input_tokens:,} input, "
f"{self.total_output_tokens:,} output, "
f"{self.total_embedding_tokens:,} embedding\n"
f"Estimated cost: ${self.total_cost:.4f}"
)
# Global cost tracker
cost_tracker = CostTracker()Update generator.py to track costs:
from config import cost_tracker
def generate_answer(...):
# ... existing code ...
# Track costs
cost_tracker.add_generation(
result.input_tokens,
result.output_tokens
)
return resultCommon Mistake #8: Ignoring costs during development. It’s easy to rack up hundreds of dollars in API costs during development and testing. Track from day one.
Complete Production-Ready Code
Here’s the final structure with all improvements:
document_qa/
├── qa_assistant.py # CLI entry point
├── indexer.py # Document processing
├── retriever.py # Vector search
├── generator.py # LLM interaction
├── evaluation.py # Evaluation utilities
├── config.py # Configuration, logging, cost tracking
├── requirements.txt
├── eval_cases.json # Evaluation test cases
├── sample_docs/
└── qa_index/
Each file has clear responsibilities. The system is observable through logging and cost tracking. Evaluation provides a feedback loop for improvement.
Complete Reference: The full, runnable implementation of all components is available in Your First LLM Application - Code Reference. This includes all source files, evaluation utilities, sample documents, and test scripts discussed in this chapter.
Next Steps
What We Skipped
This tutorial intentionally omitted several important topics that production systems need:
Advanced chunking: We used simple paragraph-based chunking. Production systems often use recursive chunking, semantic chunking, or even LLM-based chunking for better results. Chapter 7 covers these in depth.
Reranking: We used raw embedding similarity. Adding a reranking step (where a more powerful model re-scores candidates) significantly improves precision. This is covered in Chapter 7.
Prompt optimization: Our prompt is functional but not optimized. Production prompts benefit from systematic testing and iteration. Chapter 6 covers prompt engineering comprehensively.
Hybrid search: We used pure vector search. Combining vector and keyword search (BM25) often works better, especially for exact matches. Covered in Chapter 7.
Caching: Every query creates new embeddings and makes API calls. Caching embeddings and common queries reduces cost and latency. Covered in Chapter 9.
Authentication and security: Our CLI has no access controls. Production systems need authentication, rate limiting, and protection against prompt injection. Covered in Chapter 12.
Deployment: We run locally. Production needs containers, load balancing, monitoring, and CI/CD. Covered in Chapter 9.
How This Connects to Later Chapters
What you built here is the foundation for more sophisticated systems:
Chapter 7 (RAG Systems): Goes deep on everything we touched lightly: advanced chunking, embedding models, hybrid search, reranking, GraphRAG. You now have context for why these matter.
Chapter 8 (Agentic Systems): Agents extend this pattern. Instead of just answering questions, agents can take actions, use tools, and reason over multiple steps. The Q&A assistant becomes one tool in a larger system.
Chapter 9 (Deployment): Taking this CLI to production. How do you handle thousands of concurrent users? How do you update the index without downtime? How do you monitor quality?
Chapter 15 (Evaluation): We built basic evaluation. Production evaluation needs systematic test sets, regression testing, and metrics pipelines. LLM-as-judge evaluation becomes important.
Chapter 16 (Security): Users will try to trick your system. Prompt injection attacks try to override your instructions. Production systems need defenses.
Ideas for Extending the Project
Now that you have a working system, try these extensions:
Add PDF support: Use a library like
pypdforpdfplumberto extract text from PDFs. Handle the messiness of PDF text extraction.Improve chunk quality: Implement semantic chunking that splits on topic changes rather than character counts. Compare retrieval quality.
Add conversation history: Make the assistant remember previous questions in the session. How does this change the prompt?
Build a web interface: Use FastAPI or Flask to create an HTTP API, then add a simple web frontend.
Implement caching: Add an LRU cache for embeddings and common queries. Measure the cost savings.
Add source highlighting: When displaying answers, highlight the specific passages that were used. Helps users verify accuracy.
Compare embedding models: Try different embedding models (OpenAI ada vs. text-embedding-3-small vs. open-source alternatives). Measure retrieval quality differences.
Add query expansion: Before searching, generate query variations. Does retrieving across multiple queries improve results?
Summary
You’ve built a complete Document Q&A Assistant from scratch. Along the way, you learned:
Setting Up: API keys go in environment variables, never in code. Virtual environments isolate dependencies. Configuration lives in one place.
API Interaction: LLM APIs are request-response with structured inputs and outputs. Streaming improves perceived latency. Retry with exponential backoff handles transient errors.
RAG Pipeline: Documents become chunks. Chunks become embeddings. Embeddings enable semantic search. Retrieved chunks become prompt context. The model generates grounded answers.
Production Readiness: Logging enables debugging. Edge case handling prevents crashes. Evaluation measures quality. Cost tracking prevents surprises.
Key Takeaways
RAG is retrieve-then-generate: Find relevant context, include it in the prompt, generate grounded responses. This pattern underlies most practical LLM applications.
Chunking is a tradeoff: Smaller chunks = better precision, less context. Larger chunks = more context, noisier search. Overlap prevents boundary blindness.
The prompt is your lever: You control model behavior through instructions. Clear structure, explicit constraints, and graceful failure handling matter.
Test components independently: When answers are wrong, is it retrieval or generation? Isolated testing reveals the failure point.
Build in observability: Logging, cost tracking, and evaluation aren’t optional. You can’t improve what you can’t measure.
The Architecture at a Glance
Practical Exercises
Chunk size experiment: Modify chunk_size to 200, 500, and 1000 tokens. Run the evaluation suite after each change. Which size gives the best results for your documents? Why?
Prompt engineering: The current RAG prompt is basic. Try adding:
- Instructions to format answers as bullet points
- A constraint to keep answers under 100 words
- Instructions to express uncertainty when appropriate Measure how these changes affect answer quality.
Failure analysis: Find 5 questions where the system gives wrong answers. For each, determine if it’s a retrieval failure (wrong chunks) or generation failure (right chunks, wrong answer). What patterns emerge?
Cost optimization: Track costs for 50 queries. Identify the most expensive queries. What makes them expensive? How could you reduce costs?
Hybrid search: Implement keyword search alongside vector search. Combine results using reciprocal rank fusion. Compare retrieval quality to vector-only search.
Self-Assessment Checkpoint
Conceptual Questions
Q1. [IC1] Why do we chunk documents before embedding them instead of embedding entire documents?
Answer
Two reasons: (1) Embedding models have token limits (typically 512-8192 tokens). Long documents exceed these limits. (2) Retrieval precision—if you embed a 50-page document as one vector, you retrieve the whole document even if only one paragraph is relevant. Smaller chunks enable precise retrieval of the specific content that answers the query.Q2. [IC1] The system returns 5 chunks but the LLM only uses information from 2 of them. Is this a problem? How would you know?
Answer
It might be fine—the LLM correctly identifies the most relevant content. To evaluate: (1) Check if the answer is correct and complete. (2) Examine if the unused chunks were truly irrelevant (retrieval issue) or contained useful information the LLM missed (generation issue). (3) If unused chunks are consistently irrelevant, reduce top_k to save tokens and cost. If they contain useful information, the RAG prompt may need improvement.Q3. [IC2] A user asks “What’s the vacation policy?” and your system returns chunks about sick leave instead. Where in the pipeline did this fail and how would you debug it?
Answer
Debug systematically: (1) Check the query embedding—is “vacation” semantically similar to “sick leave” in your embedding space? Test withembedding("vacation") · embedding("sick leave"). (2) Check if vacation policy exists in your documents. If not, retrieval is working correctly (returning closest match). (3) If vacation docs exist, check their embeddings. They may have chunked poorly, losing the word “vacation” in headers. (4) Solution options: improve chunking to preserve context, use hybrid search to match keywords, or add metadata filtering.
Q4. [IC2] Your RAG system costs $0.50 per query on average. The product manager wants to reduce this to $0.10. What are your options?
Answer
Cost reduction strategies: (1) Use a smaller, cheaper LLM for simple queries and only escalate to the expensive model when needed. (2) Cache common queries and their answers. (3) Reduce context size—use fewer chunks, shorter chunks, or summarize retrieved content. (4) Use a cheaper embedding model (or cache embeddings). (5) Implement semantic caching—if a similar question was asked before, return cached answer. (6) Batch queries where possible. Trade-off: each reduction may impact quality, so measure carefully.Q5. [Senior] The system works well in testing but users complain it “doesn’t know” recent company updates. What’s happening and how do you fix it?
Answer
The index is stale—documents were added once but not updated. Solutions: (1) Implement incremental indexing that detects new/modified files. (2) Set up a scheduled job to re-index changed documents. (3) Add a “last indexed” timestamp to metadata so you can track freshness. (4) Consider a document change detection system (file watchers, webhooks from document system). (5) For critical updates, provide a manual “refresh index” command. Production RAG systems need operational processes for keeping the index current.Code Challenges
Challenge 1. [IC1] The chunker splits on double newlines. Some documents use single newlines between paragraphs. Fix the chunker to handle both.
Solution
def chunk_text(self, text: str, source_file: str) -> list[Chunk]:
text = text.strip()
if not text:
return []
# Normalize paragraph breaks: convert single newlines
# (not preceded/followed by newline) to double newlines
# But preserve intentional single newlines in lists, etc.
import re
# First, normalize multiple newlines to exactly two
text = re.sub(r'\n{3,}', '\n\n', text)
# Then split on double newlines
paragraphs = text.split("\n\n")
# Rest of chunking logic...Challenge 2. [IC2] Add a --verbose flag to the CLI that shows which chunks were retrieved and their similarity scores before the answer.
Solution
# In cli.py, modify the ask command:
@app.command()
def ask(
question: str,
verbose: bool = typer.Option(False, "--verbose", "-v", help="Show retrieved chunks"),
):
"""Ask a question about the indexed documents."""
# ... existing setup code ...
# Get results with scores
results = pipeline.query(question)
if verbose:
console.print("\n[bold]Retrieved Chunks:[/bold]")
for i, chunk in enumerate(results.chunks, 1):
console.print(f"\n[cyan]Chunk {i}[/cyan] (score: {chunk.score:.3f})")
console.print(f"Source: {chunk.source_file}")
console.print(chunk.text[:200] + "..." if len(chunk.text) > 200 else chunk.text)
console.print("\n" + "="*50 + "\n")
# Print answer
console.print(Markdown(results.answer))Connections to Other Chapters
This hands-on tutorial introduces concepts explored in depth throughout the book:
Chapter 6 (Prompt Engineering): The prompts we built here are just the beginning—Chapter 6 covers advanced patterns like chain-of-thought, few-shot learning, and structured outputs.
Chapter 7 (RAG Systems): Takes the basic retrieval we implemented and expands to production patterns including hybrid search, reranking, and GraphRAG.
Chapter 15 (MLOps & Evaluation): How to systematically evaluate the quality of responses from systems like this one.
Chapter 14 (Backend Engineering for AI): Production patterns for testing, debugging, and deploying LLM applications.
Capstone Projects (Appendix E): Extends this tutorial into full production projects with Docker, monitoring, and comprehensive testing.
Further Reading
Essential
- Anthropic Claude API docs (docs.anthropic.com) - Primary API reference.
- Lewis et al. (2020), “Retrieval-Augmented Generation” - The paper that introduced RAG.
- ChromaDB documentation (docs.trychroma.com) - Vector store reference.
Deep Dives
- Gao et al. (2023), “RAG for LLMs: A Survey” - Comprehensive overview of RAG techniques.
- Reimers & Gurevych (2019), “Sentence-BERT” - Foundation for dense retrieval.
Connections to Other Chapters
This tutorial chapter serves as a practical bridge between foundational knowledge and advanced techniques:
From Chapter 5 (LLM/NLP Foundations): We used embeddings without explaining how they work. Chapter 5 covers the mathematics of how text becomes vectors and why similar meanings cluster in embedding space.
From Chapter 6 (Prompt Engineering): Our RAG prompt was functional but basic. Chapter 6 teaches systematic prompt design, few-shot learning, and chain-of-thought techniques that can significantly improve answer quality.
To Chapter 7 (RAG Systems): This chapter used simple chunking and vector search. Chapter 7 goes deep: semantic chunking, hybrid search, reranking, GraphRAG, and advanced retrieval patterns.
To Chapter 8 (Agentic Systems): The Q&A assistant is reactive (answer questions). Agents are proactive (take actions). Chapter 8 shows how to build systems that use tools, reason over multiple steps, and accomplish complex goals.
To Chapter 9 (Deployment): We ran locally. Chapter 9 covers serving LLM applications at scale: batching, caching, load balancing, and production infrastructure.
To Chapter 15 (MLOps & Evaluation): Our evaluation was minimal. Chapter 11 covers systematic evaluation: test set design, metrics, LLM-as-judge patterns, and continuous quality monitoring.
Part I Checkpoint: Foundations Complete
Congratulations on completing Part I! Before moving to Part II, verify you can do the following:
Skills Checklist
Quick Self-Test (10 minutes)
Q1. What’s the difference between embeddings and one-hot encodings? Why do embeddings enable semantic search?
Q2. Your async LLM call is blocking the event loop. What’s wrong and how do you fix it?
Q3. You built a document Q&A system but it’s returning irrelevant chunks. List three things you would check.
Q4. Why does chunk overlap matter in RAG systems?
A1. One-hot encodings are sparse vectors where similar words have no relationship (dot product = 0). Embeddings are dense vectors where similar meanings are geometrically close, enabling similarity search via cosine distance.
A2. You’re likely using a synchronous API call inside an async function. Use await with an async client, or run sync code in an executor: await asyncio.to_thread(sync_function).
A3. (1) Check chunk size—too small loses context, too large dilutes relevance. (2) Verify embedding model matches your domain. (3) Check if the query and documents use different vocabulary (consider hybrid search).
A4. Without overlap, relevant sentences at chunk boundaries may be split and neither chunk matches the query well. Overlap ensures boundary content appears in at least one complete chunk.
Ready for Part II?
If you can confidently check all boxes above, you’re ready for Part II: Core LLM Development, where you’ll learn the theory behind transformers, master prompt engineering, build production RAG systems, and create autonomous agents.