Appendix H: Reference Cards & Code Index

This appendix combines two essential reference resources: Quick Reference Cards with one-page summaries for each major topic, and a Code Example Index to find specific patterns and implementations across the textbook.


Part I: Quick Reference Cards

One-page summaries for each major topic. Print these or keep them handy during development.


Chapter 5: LLM/NLP Foundations

Key Concepts

┌─────────────────────────────────────────────────────────────────────────────┐
│ LLM FUNDAMENTALS                                                            │
└─────────────────────────────────────────────────────────────────────────────┘

TOKENIZATION
├── BPE (Byte-Pair Encoding): Most common, good compression
├── WordPiece: Used by BERT
├── SentencePiece: Language-agnostic
└── Tokens ≠ Words: "tokenization" → ["token", "ization"]

EMBEDDINGS
├── Dense vectors representing semantic meaning
├── Similar meanings → similar vectors
├── Dimensionality: 768-4096 typical
└── Cosine similarity for comparison

ATTENTION
├── Query, Key, Value matrices
├── Attention(Q,K,V) = softmax(QK^T/√d)V
├── Complexity: O(n²) with sequence length
└── Enables long-range dependencies

TRANSFORMER ARCHITECTURE
├── Encoder: Bidirectional (BERT-style)
├── Decoder: Autoregressive (GPT-style)
├── Encoder-Decoder: Seq2seq (T5-style)
└── Modern LLMs: Decoder-only dominant

Quick Formulas

Concept Formula/Rule
Attention softmax(QK^T / √d_k) × V
Token estimate ~4 characters per token (English)
Context usage prompt_tokens + max_tokens ≤ context_window

Common Model Sizes

Model Parameters VRAM (FP16) VRAM (INT8)
7B 7 billion ~14 GB ~7 GB
13B 13 billion ~26 GB ~13 GB
70B 70 billion ~140 GB ~70 GB

Chapter 6: Prompt Engineering

Prompt Patterns

┌─────────────────────────────────────────────────────────────────────────────┐
│ PROMPTING PATTERNS                                                          │
└─────────────────────────────────────────────────────────────────────────────┘

ZERO-SHOT
"Classify this text as positive or negative: {text}"

FEW-SHOT
"Examples:
 Text: Great product! → positive
 Text: Terrible service → negative
 Text: {text} →"

CHAIN-OF-THOUGHT
"Let's solve this step by step:
 1. First, identify...
 2. Then, calculate...
 3. Finally, conclude..."

STRUCTURED OUTPUT
"Respond with valid JSON matching this schema:
 {\"sentiment\": \"positive|negative\", \"confidence\": 0.0-1.0}"

Best Practices

Do Don’t
Be specific and explicit Assume the model knows context
Use delimiters for data Mix instructions with data
Provide examples Give vague instructions
Set output format Leave format ambiguous
Include constraints Forget edge cases

Temperature Guide

Temperature Use Case
0.0-0.3 Factual, deterministic (code, extraction)
0.4-0.7 Balanced (most applications)
0.8-1.0 Creative (brainstorming, writing)

Chapter 7: RAG Systems

Pipeline Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│ RAG PIPELINE                                                                │
└─────────────────────────────────────────────────────────────────────────────┘

INDEXING (Offline)
Documents → Extract → Chunk → Embed → Store
                ↓         ↓        ↓       ↓
            Clean text  512 tok  1024-dim  Vector DB

RETRIEVAL (Online)
Query → Embed → Search → Rerank → Top-K chunks
          ↓        ↓        ↓          ↓
       Same model  ANN    Cross-enc   3-5 chunks

GENERATION
Context + Query → LLM → Response + Citations

Chunking Decision Tree

Document Type?
├── Structured (API docs, specs)
│   └── Use section boundaries + semantic chunking
├── Narrative (articles, reports)
│   └── Use paragraph-based with 50-token overlap
├── Code
│   └── Use function/class boundaries
└── Mixed
    └── Use hierarchical (parent + children)

Chunk Size?
├── Short answers needed → 256-512 tokens
├── Detailed answers needed → 512-1024 tokens
└── Context window limited → Keep chunks smaller

Retrieval Metrics

Metric Formula Target
Recall@K relevant_in_top_k / total_relevant >0.8
MRR 1 / rank_of_first_relevant >0.5
NDCG DCG / IDCG >0.7

Common Issues

Symptom Likely Cause Fix
Irrelevant results Wrong embedding model Try e5, bge, or domain-specific
Missing obvious matches Chunk too small Increase chunk size
Partial answers Relevant info split Add overlap, semantic chunking
Slow retrieval No index Add HNSW/IVF index

Chapter 8: Agentic Systems

Agent Loop (ReAct Pattern)

┌─────────────────────────────────────────────────────────────────────────────┐
│ ReAct AGENT LOOP                                                            │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────┐    ┌──────────┐    ┌─────────┐    ┌────────────┐
│  Query  │───▶│  Think   │───▶│  Act    │───▶│  Observe   │
└─────────┘    │ (Reason) │    │ (Tool)  │    │  (Result)  │
               └──────────┘    └─────────┘    └────────────┘
                    ▲                              │
                    │                              │
                    └──────────────────────────────┘
                         (Iterate until done)

Tool Definition Template

{
    "name": "search_database",
    "description": "Search the product database by query",
    "parameters": {
        "type": "object",
        "properties": {
            "query": {
                "type": "string",
                "description": "Search query"
            },
            "limit": {
                "type": "integer",
                "description": "Max results (default 10)"
            }
        },
        "required": ["query"]
    }
}

Safety Checklist

□ Rate limiting per user/session
□ Tool call validation (allowlists)
□ Sandboxed execution for code
□ Human confirmation for dangerous actions
□ Maximum steps/tokens limit
□ Audit logging enabled
□ Timeout on all operations

Chapter 9: LLM Deployment

Serving Options Decision Tree

┌─────────────────────────────────────────────────────────────────────────────┐
│ WHICH SERVING OPTION?                                                       │
└─────────────────────────────────────────────────────────────────────────────┘

Traffic Volume?
├── <1K req/day → API (OpenAI, Anthropic)
├── 1K-100K req/day → API or Self-hosted (depends on cost/privacy)
└── >100K req/day → Self-hosted likely cheaper

Privacy Requirements?
├── PII in prompts → Self-hosted or private API deployment
└── No PII → API is fine

Latency Requirements?
├── <500ms p99 → Self-hosted with warm instances
└── >1s OK → API works

Self-Hosted Stack:
├── Inference Engine: vLLM (recommended) or TGI
├── Load Balancer: nginx or cloud LB
├── Metrics: Prometheus + Grafana
└── Scaling: Kubernetes with GPU nodes

vLLM Quick Start

# Install
pip install vllm

# Start server
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-4-Scout-17B-16E-Instruct \
    --port 8000 \
    --gpu-memory-utilization 0.9

# Test
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
         "messages": [{"role": "user", "content": "Hello!"}]}'

Optimization Checklist

□ Enable continuous batching
□ Set appropriate gpu-memory-utilization (0.85-0.95)
□ Use quantization if memory-constrained (INT8 first, then INT4)
□ Enable KV cache (default in vLLM)
□ Configure appropriate max-model-len
□ Add response caching for repeated queries
□ Monitor GPU utilization (target >70%)

Chapter 15: MLOps & Evaluation

Metrics Quick Reference

┌─────────────────────────────────────────────────────────────────────────────┐
│ EVALUATION METRICS                                                          │
└─────────────────────────────────────────────────────────────────────────────┘

RETRIEVAL
├── Recall@K:    % of relevant docs in top K
├── MRR:         1 / position of first relevant
├── NDCG:        Ranking quality score
└── Precision@K: % of top K that are relevant

GENERATION (Automated)
├── BLEU:        N-gram overlap with reference
├── ROUGE:       Recall-oriented overlap
├── BERTScore:   Semantic similarity
└── Exact Match: Binary correct/incorrect

GENERATION (LLM-as-Judge)
├── Accuracy:    Is the answer correct?
├── Helpfulness: Does it address the question?
├── Safety:      Is it appropriate?
├── Groundedness: Is it supported by context?
└── Coherence:   Is it well-written?

LLM-as-Judge Template

JUDGE_PROMPT = """Rate this response on a scale of 1-5.

Question: {question}
Response: {response}
Reference (if available): {reference}

Criteria:

- 5: Excellent, fully correct and helpful
- 4: Good, mostly correct with minor issues
- 3: Acceptable, partially correct
- 2: Poor, significant errors
- 1: Unacceptable, wrong or harmful

Provide your rating and brief justification.
Rating:"""

Chapter 16: Security

Threat Model

┌─────────────────────────────────────────────────────────────────────────────┐
│ LLM SECURITY THREATS                                                        │
└─────────────────────────────────────────────────────────────────────────────┘

PROMPT INJECTION
├── Direct: User input tries to override system prompt
├── Indirect: External data (RAG, tools) contains injection
└── Defense: Input validation, prompt hardening, output filtering

JAILBREAKING
├── Attempts to bypass safety training
├── Often uses roleplay, hypotheticals
└── Defense: Safety classifiers, content filtering

DATA LEAKAGE
├── System prompt extraction
├── Training data extraction
├── PII in responses
└── Defense: Output filtering, don't put secrets in prompts

AGENT-SPECIFIC
├── Malicious tool calls
├── Unauthorized actions
├── Sandbox escapes
└── Defense: Tool policies, sandboxing, confirmation

Defense Checklist

INPUT
□ Length limits (max tokens)
□ Rate limiting (per user/IP)
□ Known pattern detection
□ Unicode normalization

PROMPT
□ Clear instruction/data separation
□ Explicit boundaries ("User input below:")
□ Don't trust user input
□ No secrets in system prompts

OUTPUT
□ Content filtering
□ System prompt leak detection
□ PII scrubbing
□ Format validation

AGENTS
□ Tool allowlists
□ Sandboxed execution
□ Human-in-the-loop for risky actions
□ Audit logging

Chapter 27: Performance Engineering

Optimization Decision Tree

┌─────────────────────────────────────────────────────────────────────────────┐
│ PERFORMANCE OPTIMIZATION                                                    │
└─────────────────────────────────────────────────────────────────────────────┘

What's the bottleneck?
├── Memory (OOM)
│   ├── Try INT8 quantization first
│   ├── Then INT4 if needed
│   ├── Reduce max_model_len
│   └── Enable PagedAttention (vLLM default)
│
├── Latency (Slow TTFT)
│   ├── Reduce input tokens (shorter context)
│   ├── Use smaller model
│   ├── Add more GPU memory (larger batch)
│   └── Try speculative decoding
│
└── Throughput (Low req/s)
    ├── Enable continuous batching
    ├── Increase GPU memory utilization
    ├── Add more replicas
    └── Consider tensor parallelism for large models

Quantization Options

Method Bits Quality Loss Speed Memory
FP16 16 None Baseline Baseline
INT8 8 Minimal ~Same 50%
INT4 (GPTQ/AWQ) 4 Small Faster 25%
NF4 (QLoRA) 4 Small Faster 25%

Key Formulas

Memory (FP16) ≈ Parameters × 2 bytes + KV Cache
Memory (INT8) ≈ Parameters × 1 byte + KV Cache
KV Cache ≈ 2 × layers × batch × seq_len × hidden_dim × 2 bytes

Throughput ≈ GPU_Memory / (Model_Size + KV_Cache_per_request)

Chapter 31: Reliability Engineering

SLO Targets (Typical)

Metric Target Measurement
Availability 99.9% uptime / total_time
Latency (p50) <1s 50th percentile
Latency (p99) <5s 99th percentile
Error Rate <0.1% errors / requests
Quality Score >4.0/5.0 LLM-as-judge average

Graceful Degradation Levels

Level 1: Normal Operation
         └── Full functionality

Level 2: Degraded - Use Cache
         └── Return cached responses for similar queries

Level 3: Degraded - Simpler Model
         └── Route to faster/smaller model

Level 4: Degraded - Limited Features
         └── Disable non-essential features (streaming, personalization)

Level 5: Fallback - Static Response
         └── Return pre-written responses for common queries

Circuit Breaker Pattern

if error_rate > threshold:
    open_circuit()      # Stop sending requests
    wait(cool_down)     # Let service recover
    half_open()         # Try a few requests
    if healthy:
        close_circuit() # Resume normal operation

Chapter 32: Cost Engineering

Cost Estimation

┌─────────────────────────────────────────────────────────────────────────────┐
│ LLM COST CALCULATION                                                        │
└─────────────────────────────────────────────────────────────────────────────┘

API Cost = (Input_Tokens × Input_Price) + (Output_Tokens × Output_Price)

Example (GPT-5):
  1000 input tokens × $0.00125/1K = $0.00125
  500 output tokens × $0.01/1K = $0.005
  Total: $0.00625 per request

Self-Hosted Cost = GPU_Hours × GPU_Cost_Per_Hour / Requests_Per_Hour

Example (A100, 70B model):
  $2/hour GPU × 24 hours = $48/day
  10,000 requests/day
  Cost: $0.0048 per request

Cost Optimization Checklist

□ Enable semantic caching (30-50% savings typical)
□ Route simple queries to smaller models
□ Set appropriate max_tokens limits
□ Compress context where possible
□ Batch requests when latency allows
□ Use spot/preemptible instances for batch jobs
□ Monitor and alert on cost anomalies

Build vs Buy Quick Check

Factor Favors Build Favors Buy
Volume >100K req/day <10K req/day
Privacy PII in prompts No sensitive data
Customization Need fine-tuning Prompt is enough
Latency <500ms required 1-2s acceptable
Team ML/infra expertise Small team
Timeline Can invest 3+ months Need it now

Common Commands

vLLM

# Start server
python -m vllm.entrypoints.openai.api_server --model MODEL_NAME

# With options
--gpu-memory-utilization 0.9  # Use 90% GPU memory
--max-model-len 8192          # Max context length
--quantization awq            # Use AWQ quantization
--tensor-parallel-size 2      # Split across 2 GPUs

Evaluation

# Run promptfoo evaluation
promptfoo eval -c promptfooconfig.yaml

# Run LM-eval benchmark
lm_eval --model vllm --model_args pretrained=MODEL --tasks hellaswag

Monitoring

# GPU monitoring
nvidia-smi -l 1                    # Refresh every 1 second
nvitop                             # Interactive GPU monitor

# Container monitoring
docker stats                       # Resource usage
kubectl top pods -n ml             # K8s pod resources

Part II: Code Example Index

A searchable quick-reference for all code implementations in this textbook. Use this to find specific patterns, classes, and utilities across the reference files. Each entry links to the source file and provides a brief description of what the code does.


By Topic

RAG and Retrieval

Example File Description
DocumentExtractor reference/07b_rag_systems_code.md Extract and clean text from PDF/DOCX/HTML with metadata
RecursiveChunker reference/07b_rag_systems_code.md Split documents into semantic chunks with overlap
SemanticChunker reference/07b_rag_systems_code.md Chunk based on semantic similarity between sentences
FAISSVectorSearch reference/07b_rag_systems_code.md Vector similarity search using FAISS index
BM25Index reference/07b_rag_systems_code.md Keyword-based search using BM25 algorithm
HybridRetriever reference/07b_rag_systems_code.md Combine vector and BM25 search with RRF fusion
Reranker reference/07b_rag_systems_code.md Cross-encoder reranking for improved relevance
ContextAssembler reference/07b_rag_systems_code.md Build context with source attribution for generation
RAGPipeline reference/07b_rag_systems_code.md Complete end-to-end RAG pipeline orchestration
GraphRAG reference/07b_rag_systems_code.md Knowledge graph-enhanced retrieval with entity linking
RAGDebugger reference/07b_rag_systems_code.md Debug RAG pipeline with stage-by-stage analysis
RAGMonitor reference/07b_rag_systems_code.md Production monitoring for retrieval quality metrics
ProductionRAG reference/07b_rag_systems_code.md Production-ready RAG with caching, fallbacks, monitoring
EmbeddingFeatureStore reference/24_data_architecture_code.md Store and retrieve embedding features with vector search

Agents and Tool Use

Example File Description
MCPServer reference/08a_mcp_server_pattern.md Model Context Protocol server for document search tools
SupervisorAgent reference/08b_agent_orchestration.md Coordinate worker agents with task decomposition and synthesis
WorkerAgent reference/08b_agent_orchestration.md Base class for specialist worker agents
CircuitBreaker reference/08b_agent_orchestration.md Prevent infinite loops with state similarity detection
agentic_loop reference/08c_agentic_systems_code.md Basic tool execution loop with Claude API
ReActAgent reference/08c_agentic_systems_code.md Reasoning and acting agent with thought-action-observation
PlanningAgent reference/08c_agentic_systems_code.md Agent that creates and executes multi-step plans
HierarchicalAgent reference/08c_agentic_systems_code.md Multi-level agent hierarchy for complex workflows
SelfReflectingAgent reference/08c_agentic_systems_code.md Agent that evaluates and improves its own outputs
MemoryAugmentedAgent reference/08c_agentic_systems_code.md Agent with episodic and working memory systems
ComputerUseAgent reference/08c_agentic_systems_code.md Agent that interacts with computer via screenshots
ScopedToolSet reference/08c_agentic_systems_code.md Capability-scoped tools with permission levels
SafeAgent reference/08c_agentic_systems_code.md Agent with safety guards and human approval workflows
SandboxedCodeExecutor reference/08c_agentic_systems_code.md Secure code execution in Docker containers
LoopDetector reference/08c_agentic_systems_code.md Detect and prevent agent infinite loops
AgentEvaluator reference/08c_agentic_systems_code.md Evaluate agent performance on task completion
react_agent reference/06b_prompt_engineering_code.md Simple ReAct implementation with tool parsing
analyze_document_chain reference/06b_prompt_engineering_code.md Multi-step document analysis with prompt chaining

LLM Deployment and Serving

Example File Description
InferenceMetrics reference/09_llm_deployment_code.md Track TTFT, TPS, latency, throughput metrics
estimate_memory_gb reference/09_llm_deployment_code.md Calculate GPU memory requirements for model loading
StaticBatcher reference/09_llm_deployment_code.md Fixed-size batch inference for predictable workloads
ContinuousBatcher reference/09_llm_deployment_code.md Dynamic batching with adaptive batch sizing
InferenceProfiler reference/09_llm_deployment_code.md Profile inference performance with bottleneck detection
KVCacheManager reference/09_llm_deployment_code.md Manage key-value cache for transformer inference
SpeculativeDecoder reference/09_llm_deployment_code.md Speculative decoding with draft model verification
PrefixCache reference/09_llm_deployment_code.md Cache and reuse common prompt prefixes
LLMLoadBalancer reference/09_llm_deployment_code.md Route requests across model replicas
LoadShedder reference/09_llm_deployment_code.md Gracefully shed load during overload
GracefulServer reference/09_llm_deployment_code.md Server with graceful shutdown and health checks
CostEstimator reference/09_llm_deployment_code.md Estimate inference costs for different configurations

Prompt Engineering

Example File Description
DynamicFewShotClassifier reference/06b_prompt_engineering_code.md Select relevant examples based on query similarity
ContextManager reference/06b_prompt_engineering_code.md Allocate context budget across prompt components
manage_conversation_with_summary reference/06b_prompt_engineering_code.md Summarize old messages to manage long conversations
chunk_document reference/06b_prompt_engineering_code.md Chunk document with overlap for continuity
chunk_by_semantic_similarity reference/06b_prompt_engineering_code.md Chunk based on semantic similarity thresholds
Classification prompts reference/06a_prompt_template_library.md Templates for intent classification, sentiment analysis
Chain-of-Thought prompts reference/06a_prompt_template_library.md Step-by-step reasoning templates
RAG prompts reference/06a_prompt_template_library.md Context injection and citation templates
Agent prompts reference/06a_prompt_template_library.md Tool use and ReAct-style prompts
Summarization prompts reference/06a_prompt_template_library.md Document and conversation summarization
Code generation prompts reference/06a_prompt_template_library.md Code writing with test generation

Security

Example File Description
InputValidator reference/12a_security_defense_patterns.md Validate and sanitize user inputs
create_sandwiched_prompt reference/12a_security_defense_patterns.md Wrap user input with system instructions
OutputFilter reference/12a_security_defense_patterns.md Filter sensitive data from model outputs
AdaptiveRateLimiter reference/12a_security_defense_patterns.md Rate limiting with adaptive thresholds
SecureToolExecutor reference/12a_security_defense_patterns.md Execute tools with capability restrictions
SecureLLMPipeline reference/12a_security_defense_patterns.md Complete request pipeline with all defenses
HardenedPromptBuilder reference/12b_security_adversarial_code.md Build prompts resistant to injection attacks
OutputValidator reference/12b_security_adversarial_code.md Validate output structure and content safety
DualLLMGuard reference/12b_security_adversarial_code.md Use separate LLM to validate requests/responses
StructuredOutputDefense reference/12b_security_adversarial_code.md Constrain outputs to safe structured formats
SafetyGuard reference/12b_security_adversarial_code.md Multi-layer safety checks for content
JailbreakResponseHandler reference/12b_security_adversarial_code.md Detect and handle jailbreak attempts
DataLeakagePreventor reference/12b_security_adversarial_code.md Prevent sensitive data in model outputs
SessionIsolator reference/12b_security_adversarial_code.md Isolate user sessions to prevent cross-contamination
SecureLogger reference/12b_security_adversarial_code.md Log requests with PII redaction
AutomatedRedTeam reference/12b_security_adversarial_code.md Automated adversarial testing framework
CISecurityGate reference/12b_security_adversarial_code.md Security validation in CI/CD pipeline
SecurityMonitor reference/12b_security_adversarial_code.md Real-time security monitoring and alerting

Evaluation and MLOps

Example File Description
RAGEvaluator reference/11a_evaluation_harness.md LLM-as-judge evaluator with calibration
JUDGE_PROMPT reference/11a_evaluation_harness.md Calibrated judge prompt with explicit rubric
EvaluationPipeline reference/11b_mlops_evaluation_code.md Run evaluation suites with parallel execution
LLMJudge reference/11b_mlops_evaluation_code.md Single-aspect LLM judgment
PairwiseJudge reference/11b_mlops_evaluation_code.md Compare two outputs for preference
MultiAspectJudge reference/11b_mlops_evaluation_code.md Evaluate multiple quality dimensions
JudgeCalibrator reference/11b_mlops_evaluation_code.md Calibrate judge consistency with known examples
ExperimentManager reference/11b_mlops_evaluation_code.md Manage A/B test experiments
ExperimentAnalyzer reference/11b_mlops_evaluation_code.md Statistical analysis of experiment results
ModelComparator reference/11b_mlops_evaluation_code.md Compare model versions with metrics
QualityMonitor reference/11b_mlops_evaluation_code.md Monitor production quality metrics
DriftDetector reference/11b_mlops_evaluation_code.md Detect feature and prediction drift

Performance and Optimization

Example File Description
InferenceProfiler reference/21_performance_engineering_code.md Profile inference with bottleneck detection
BottleneckDiagnoser reference/21_performance_engineering_code.md Diagnose performance bottlenecks
ModelOptimizer reference/21_performance_engineering_code.md Optimize models with torch.compile
TensorRTOptimizer reference/21_performance_engineering_code.md TensorRT conversion for NVIDIA GPUs
CUDAGraphOptimizer reference/21_performance_engineering_code.md CUDA graph capture for reduced overhead
TensorParallelInference reference/21_performance_engineering_code.md Shard models across multiple GPUs
OptimizedKVCache reference/21_performance_engineering_code.md Efficient KV cache implementation
PagedKVCache reference/21_performance_engineering_code.md vLLM-style paged attention cache
QuantizationAnalyzer reference/21_performance_engineering_code.md Analyze quantization impact on accuracy
GradientCheckpointAnalyzer reference/21_performance_engineering_code.md Optimize memory with gradient checkpointing
AttentionProfiler reference/21_performance_engineering_code.md Profile attention mechanisms
LatencyThroughputOptimizer reference/21_performance_engineering_code.md Balance latency vs throughput tradeoffs
StreamingOptimizer reference/21_performance_engineering_code.md Optimize token streaming performance
CostOptimizer reference/21_performance_engineering_code.md Optimize inference cost per request
BenchmarkSuite reference/21_performance_engineering_code.md Comprehensive benchmarking framework

Cost Engineering

Example File Description
calculate_cost reference/26a_cost_estimation_formulas.md Calculate token costs for API calls
project_monthly_cost reference/26a_cost_estimation_formulas.md Project monthly costs from usage patterns
calculate_rag_cost reference/26a_cost_estimation_formulas.md Calculate RAG system costs including retrieval
calculate_breakeven reference/26a_cost_estimation_formulas.md GPU self-hosting break-even analysis
CostCategory reference/26b_cost_engineering_code.md Enum of AI cost categories
CostItem reference/26b_cost_engineering_code.md Model individual cost items
AICostBreakdown reference/26b_cost_engineering_code.md Complete cost breakdown for AI system
TCOCalculator reference/26b_cost_engineering_code.md Calculate total cost of ownership
CostAttributionEngine reference/26b_cost_engineering_code.md Attribute costs to teams and projects
UnitEconomicsCalculator reference/26b_cost_engineering_code.md Calculate cost per request/user/revenue
GPUOptimizer reference/26b_cost_engineering_code.md Analyze and optimize GPU utilization
InferenceCostOptimizer reference/26b_cost_engineering_code.md Optimize inference costs
TrainingCostOptimizer reference/26b_cost_engineering_code.md Optimize training costs with strategy selection
BuildVsBuyAnalyzer reference/26b_cost_engineering_code.md Analyze build vs buy decisions
CrossoverAnalysis reference/26b_cost_engineering_code.md Calculate when build becomes cheaper than buy
APICostManager reference/26b_cost_engineering_code.md Track and manage API costs with budgets
APICostOptimizer reference/26b_cost_engineering_code.md Recommend API cost optimizations
TokenOptimizer reference/26b_cost_engineering_code.md Analyze and optimize token usage
CostDashboard reference/26b_cost_engineering_code.md Cost visibility dashboard for teams
CostGovernanceEnforcer reference/26b_cost_engineering_code.md Enforce cost governance policies
ScalingCostModel reference/26b_cost_engineering_code.md Project costs at different scale

Data Architecture

Example File Description
ArchitectureSelector reference/24_data_architecture_code.md Choose between Lambda and Kappa architectures
SchemaRegistry reference/24_data_architecture_code.md Central schema registry with compatibility checking
ContractEnforcer reference/24_data_architecture_code.md Enforce data contracts between producers and consumers
FeatureStore reference/24_data_architecture_code.md Feature store with online/offline serving
FeatureStoreSelectorTool reference/24_data_architecture_code.md Select the right feature store platform
FeatureEngineerer reference/24_data_architecture_code.md Feature engineering utilities (aggregations, cyclical encoding)
EmbeddingFeatureStore reference/24_data_architecture_code.md Specialized storage for embedding features
SkewDetector reference/24_data_architecture_code.md Detect training-serving skew
UnifiedFeatureComputation reference/24_data_architecture_code.md Single source of truth for feature computation
PreprocessingPackager reference/24_data_architecture_code.md Package preprocessing with model artifacts
SkewMonitoringDashboard reference/24_data_architecture_code.md Monitor training-serving skew in production
LineageTracker reference/24_data_architecture_code.md Track data lineage across assets
DataVersionManager reference/24_data_architecture_code.md Manage versioned data for reproducibility
DataObservabilityMonitor reference/24_data_architecture_code.md Monitor data observability (freshness, volume, schema)
DataAnomalyDetector reference/24_data_architecture_code.md Detect anomalies in data metrics
DataQualityMonitor reference/24_data_architecture_code.md Monitor data quality with configurable checks
FeatureGovernanceSystem reference/24_data_architecture_code.md Manage feature governance and deprecation
FreshnessMonitor reference/24_data_architecture_code.md Monitor feature freshness with SLAs
LeakageDetector reference/24_data_architecture_code.md Detect data leakage in training data
RealTimeQualityGate reference/24_data_architecture_code.md Lightweight quality checks for real-time pipelines

Reliability

Example File Description
AISLOFramework reference/25b_reliability_engineering_code.md Define AI-specific SLOs (availability, latency, quality, freshness, safety)
ErrorBudgetTracker reference/25b_reliability_engineering_code.md Track error budget consumption
RealTimeQualityEstimator reference/25b_reliability_engineering_code.md Estimate quality in real-time using proxy metrics
GracefulDegradationController reference/25b_reliability_engineering_code.md Control graceful degradation with multiple levels
CircuitBreaker reference/25b_reliability_engineering_code.md Circuit breaker pattern for model calls
FallbackExecutor reference/25b_reliability_engineering_code.md Execute fallback strategies by level
AIIncidentManager reference/25b_reliability_engineering_code.md Manage AI system incidents with runbooks
PostIncidentReviewTemplate reference/25b_reliability_engineering_code.md Generate post-incident review templates
IncidentMetrics reference/25b_reliability_engineering_code.md Track incident metrics (MTTR, by type, by severity)
ReliableTrainingPipeline reference/25b_reliability_engineering_code.md Training pipeline with checkpointing and retries
LoadSheddingStrategy reference/25b_reliability_engineering_code.md Priority-based load shedding
AdaptiveRateLimiter reference/25b_reliability_engineering_code.md Rate limiter that adapts to model latency
ReliableServingPipeline reference/25b_reliability_engineering_code.md Serving pipeline with fallbacks and circuit breakers
AIObservabilityPipeline reference/25b_reliability_engineering_code.md Observability pipeline for AI systems
DistributionMonitor reference/25b_reliability_engineering_code.md Monitor feature and prediction distributions
ChaosRunner reference/25b_reliability_engineering_code.md Run chaos experiments on AI systems
GameDay reference/25b_reliability_engineering_code.md Organize game day exercises for reliability

By Implementation Type

Complete Systems

End-to-end implementations that you can adapt for production use.

Example File Description
ProductionRAG reference/07b_rag_systems_code.md Production-ready RAG with caching, fallbacks, monitoring
RAGPipeline reference/07b_rag_systems_code.md Complete end-to-end RAG pipeline orchestration
GraphRAG reference/07b_rag_systems_code.md Knowledge graph-enhanced retrieval system
SecureLLMPipeline reference/12a_security_defense_patterns.md Complete request pipeline with all security defenses
MCPServer reference/08a_mcp_server_pattern.md Complete MCP server for document search
SupervisorAgent reference/08b_agent_orchestration.md Full supervisor-worker agent architecture
ReliableServingPipeline reference/25b_reliability_engineering_code.md Production serving with reliability features
ReliableTrainingPipeline reference/25b_reliability_engineering_code.md Training pipeline with checkpointing and retries
EvaluationPipeline reference/11b_mlops_evaluation_code.md Full evaluation suite with parallel execution
AIObservabilityPipeline reference/25b_reliability_engineering_code.md Complete observability for AI systems

Utility Functions

Standalone functions for common operations.

Example File Description
estimate_memory_gb reference/09_llm_deployment_code.md Calculate GPU memory requirements
calculate_cost reference/26a_cost_estimation_formulas.md Calculate token costs for API calls
calculate_breakeven reference/26a_cost_estimation_formulas.md GPU self-hosting break-even analysis
chunk_document reference/06b_prompt_engineering_code.md Chunk document with overlap
chunk_by_semantic_similarity reference/06b_prompt_engineering_code.md Semantic-based document chunking
create_sandwiched_prompt reference/12a_security_defense_patterns.md Wrap user input with system instructions
manage_conversation_with_summary reference/06b_prompt_engineering_code.md Summarize long conversations
encode_cyclical_time reference/24_data_architecture_code.md Encode cyclical time features with sin/cos
compute_embedding_similarity reference/24_data_architecture_code.md Cosine similarity between embeddings

Configuration Examples

Patterns and templates for configuration.

Example File Description
JUDGE_PROMPT reference/11a_evaluation_harness.md LLM-as-judge prompt with rubric
PLANNING_PROMPT reference/08b_agent_orchestration.md Agent task decomposition prompt
SYNTHESIS_PROMPT reference/08b_agent_orchestration.md Agent result synthesis prompt
Classification prompts reference/06a_prompt_template_library.md Intent and sentiment classification templates
CoT prompts reference/06a_prompt_template_library.md Chain-of-thought reasoning templates
RAG prompts reference/06a_prompt_template_library.md Context injection templates
FALLBACK_STRATEGIES reference/25b_reliability_engineering_code.md Fallback strategy configurations by system type
CHAOS_EXPERIMENTS reference/25b_reliability_engineering_code.md Chaos experiment definitions
AI_INCIDENT_SIGNATURES reference/25b_reliability_engineering_code.md Incident type signatures with runbooks
COST_POLICIES reference/26b_cost_engineering_code.md Cost governance policy examples
TCO_COMPONENTS reference/26b_cost_engineering_code.md Total cost of ownership components
FEATURE_STORE_PLATFORMS reference/24_data_architecture_code.md Feature store platform comparison
CALIBRATION_SET reference/11a_evaluation_harness.md Judge calibration set structure

Cross-Reference by Chapter

Use this to find all code examples related to a specific chapter.

Cross-Reference by Chapter
Chapter Reference Files Key Classes/Functions
Ch 5 05_llm_nlp_foundations_code.md Tokenizer, embedding, attention implementations
Ch 6 06a_prompt_template_library.md, 06b_prompt_engineering_code.md Prompts, DynamicFewShotClassifier, ContextManager
Ch 7 07a_rag_pipeline_architecture.md, 07b_rag_systems_code.md RAGPipeline, HybridRetriever, Reranker, GraphRAG
Ch 8 08a_mcp_server_pattern.md, 08b_agent_orchestration.md, 08c_agentic_systems_code.md SupervisorAgent, ReActAgent, SafeAgent
Ch 9 09_llm_deployment_code.md ContinuousBatcher, KVCacheManager, LLMLoadBalancer
Ch 15 11a_evaluation_harness.md, 11b_mlops_evaluation_code.md RAGEvaluator, LLMJudge, ExperimentManager
Ch 16 12a_security_defense_patterns.md, 12b_security_adversarial_code.md InputValidator, SecureLLMPipeline, AutomatedRedTeam
Ch 27 21_performance_engineering_code.md InferenceProfiler, QuantizationAnalyzer, PagedKVCache
Ch 30 24_data_architecture_code.md FeatureStore, SkewDetector, LineageTracker
Ch 31 25a_incident_runbook.md, 25b_reliability_engineering_code.md AISLOFramework, CircuitBreaker, ChaosRunner
Ch 32 26a_cost_estimation_formulas.md, 26b_cost_engineering_code.md TCOCalculator, APICostManager, BuildVsBuyAnalyzer

Runnable Code Projects

Complete, installable Python projects in the code/ directory. These are fully functional applications you can run locally.

Document Q&A (Chapter 4)

Location: code/ch04_document_qa/

Module Description
document_qa.chunker Document chunking with paragraph/sentence boundaries
document_qa.embeddings OpenAI embedding integration with batching
document_qa.store ChromaDB vector store with search
document_qa.generator Answer generation with context and citations
document_qa.pipeline End-to-end RAG pipeline orchestration
document_qa.cli Command-line interface for indexing and querying
cd code/ch04_document_qa && pip install -e .
document-qa index ../datasets/documents
document-qa ask "What is the vacation policy?"

Advanced RAG (Chapter 7)

Location: code/ch07_rag_advanced/

Module Description
src.chunking Multiple chunking strategies (recursive, semantic, sentence, sliding window)
src.embeddings Cached embeddings with OpenAI and sentence-transformers
src.retrieval Hybrid search combining BM25 and vector search with RRF fusion
src.reranking Cross-encoder reranking for improved relevance
src.pipeline RAGPipeline with caching and observability
src.evaluation Retrieval evaluation metrics (MRR, NDCG, Recall@K)

Agentic Systems (Chapter 8)

Location: code/ch08_agents/

Module Description
src.tools Tool interface, ToolRegistry, ToolResult
src.react_agent ReAct agent implementation with thought-action-observation
src.supervisor Supervisor-worker multi-agent architecture
src.safety Safety checks, rate limiting, loop detection
src.sandbox Docker-based sandboxed code execution
src.mcp_server Model Context Protocol server
tools.calculator Mathematical operations tool
tools.web_search Mock web search for testing
tools.file_ops Sandboxed file operations
tools.code_executor Sandboxed Python execution

Evaluation Harness (Chapter 15)

Location: code/ch11_evaluation/

Module Description
src.metrics RetrievalMetrics (MRR, NDCG, Precision@K, Recall@K)
src.llm_judge LLMJudge, PairwiseJudge, MultiAspectJudge
src.calibration JudgeCalibrator for consistency validation
src.ab_testing ABTest, ExperimentManager, statistical analysis
src.drift_detection DriftDetector for quality/distribution drift
src.reporting ReportGenerator for HTML/JSON reports

Additional Projects

Project Location Description
prompt_lab code/prompt_lab/ Interactive prompt testing tool
self_hosted_llm code/self_hosted_llm/ vLLM deployment with Docker
security_scanner code/security_scanner/ Prompt injection detector
cost_tracker code/cost_tracker/ Token and cost monitoring

How to Use This Index

  1. Running complete examples: Check “Runnable Code Projects” for installable Python projects
  2. Finding a specific pattern: Use the topic sections to locate implementations for your use case
  3. Starting a new project: Check “Complete Systems” for end-to-end examples
  4. Adding a utility: Look in “Utility Functions” for standalone helpers
  5. Setting up configuration: Check “Configuration Examples” for templates and prompts
  6. Deep diving into a chapter: Use the cross-reference table to find all related code

All file paths are relative to the repository root. Reference implementations are at reference/<filename>. Runnable projects are at code/<project>/.