Appendix H: Reference Cards & Code Index

This appendix combines two essential reference resources: Quick Reference Cards with one-page summaries for each major topic, and a Code Example Index to find specific patterns and implementations across the textbook.

Part I: Quick Reference Cards

One-page summaries for each major topic. Print these or keep them handy during development.

Chapter 5: LLM/NLP Foundations

Key Concepts

┌─────────────────────────────────────────────────────────────────────────────┐
│ LLM FUNDAMENTALS                                                            │
└─────────────────────────────────────────────────────────────────────────────┘

TOKENIZATION
├── BPE (Byte-Pair Encoding): Most common, good compression
├── WordPiece: Used by BERT
├── SentencePiece: Language-agnostic
└── Tokens ≠ Words: "tokenization" → ["token", "ization"]

EMBEDDINGS
├── Dense vectors representing semantic meaning
├── Similar meanings → similar vectors
├── Dimensionality: 768-4096 typical
└── Cosine similarity for comparison

ATTENTION
├── Query, Key, Value matrices
├── Attention(Q,K,V) = softmax(QK^T/√d)V
├── Complexity: O(n²) with sequence length
└── Enables long-range dependencies

TRANSFORMER ARCHITECTURE
├── Encoder: Bidirectional (BERT-style)
├── Decoder: Autoregressive (GPT-style)
├── Encoder-Decoder: Seq2seq (T5-style)
└── Modern LLMs: Decoder-only dominant

Quick Formulas

Concept	Formula/Rule
Attention	softmax(QK^T / √d_k) × V
Token estimate	~4 characters per token (English)
Context usage	prompt_tokens + max_tokens ≤ context_window

Common Model Sizes

Model	Parameters	VRAM (FP16)	VRAM (INT8)
7B	7 billion	~14 GB	~7 GB
13B	13 billion	~26 GB	~13 GB
70B	70 billion	~140 GB	~70 GB

Chapter 6: Prompt Engineering

Prompt Patterns

┌─────────────────────────────────────────────────────────────────────────────┐
│ PROMPTING PATTERNS                                                          │
└─────────────────────────────────────────────────────────────────────────────┘

ZERO-SHOT
"Classify this text as positive or negative: {text}"

FEW-SHOT
"Examples:
 Text: Great product! → positive
 Text: Terrible service → negative
 Text: {text} →"

CHAIN-OF-THOUGHT
"Let's solve this step by step:
 1. First, identify...
 2. Then, calculate...
 3. Finally, conclude..."

STRUCTURED OUTPUT
"Respond with valid JSON matching this schema:
 {\"sentiment\": \"positive|negative\", \"confidence\": 0.0-1.0}"

Best Practices

Do	Don’t
Be specific and explicit	Assume the model knows context
Use delimiters for data	Mix instructions with data
Provide examples	Give vague instructions
Set output format	Leave format ambiguous
Include constraints	Forget edge cases

Temperature Guide

Temperature	Use Case
0.0-0.3	Factual, deterministic (code, extraction)
0.4-0.7	Balanced (most applications)
0.8-1.0	Creative (brainstorming, writing)

Chapter 7: RAG Systems

Pipeline Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│ RAG PIPELINE                                                                │
└─────────────────────────────────────────────────────────────────────────────┘

INDEXING (Offline)
Documents → Extract → Chunk → Embed → Store
                ↓         ↓        ↓       ↓
            Clean text  512 tok  1024-dim  Vector DB

RETRIEVAL (Online)
Query → Embed → Search → Rerank → Top-K chunks
          ↓        ↓        ↓          ↓
       Same model  ANN    Cross-enc   3-5 chunks

GENERATION
Context + Query → LLM → Response + Citations

Chunking Decision Tree

Document Type?
├── Structured (API docs, specs)
│   └── Use section boundaries + semantic chunking
├── Narrative (articles, reports)
│   └── Use paragraph-based with 50-token overlap
├── Code
│   └── Use function/class boundaries
└── Mixed
    └── Use hierarchical (parent + children)

Chunk Size?
├── Short answers needed → 256-512 tokens
├── Detailed answers needed → 512-1024 tokens
└── Context window limited → Keep chunks smaller

Retrieval Metrics

Metric	Formula	Target
Recall@K	relevant_in_top_k / total_relevant	>0.8
MRR	1 / rank_of_first_relevant	>0.5
NDCG	DCG / IDCG	>0.7

Common Issues

Symptom	Likely Cause	Fix
Irrelevant results	Wrong embedding model	Try e5, bge, or domain-specific
Missing obvious matches	Chunk too small	Increase chunk size
Partial answers	Relevant info split	Add overlap, semantic chunking
Slow retrieval	No index	Add HNSW/IVF index

Chapter 8: Agentic Systems

Agent Loop (ReAct Pattern)

┌─────────────────────────────────────────────────────────────────────────────┐
│ ReAct AGENT LOOP                                                            │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────┐    ┌──────────┐    ┌─────────┐    ┌────────────┐
│  Query  │───▶│  Think   │───▶│  Act    │───▶│  Observe   │
└─────────┘    │ (Reason) │    │ (Tool)  │    │  (Result)  │
               └──────────┘    └─────────┘    └────────────┘
                    ▲                              │
                    │                              │
                    └──────────────────────────────┘
                         (Iterate until done)

Tool Definition Template

{
    "name": "search_database",
    "description": "Search the product database by query",
    "parameters": {
        "type": "object",
        "properties": {
            "query": {
                "type": "string",
                "description": "Search query"
            },
            "limit": {
                "type": "integer",
                "description": "Max results (default 10)"
            }
        },
        "required": ["query"]
    }
}

Safety Checklist

□ Rate limiting per user/session
□ Tool call validation (allowlists)
□ Sandboxed execution for code
□ Human confirmation for dangerous actions
□ Maximum steps/tokens limit
□ Audit logging enabled
□ Timeout on all operations

Chapter 9: LLM Deployment

Serving Options Decision Tree

┌─────────────────────────────────────────────────────────────────────────────┐
│ WHICH SERVING OPTION?                                                       │
└─────────────────────────────────────────────────────────────────────────────┘

Traffic Volume?
├── <1K req/day → API (OpenAI, Anthropic)
├── 1K-100K req/day → API or Self-hosted (depends on cost/privacy)
└── >100K req/day → Self-hosted likely cheaper

Privacy Requirements?
├── PII in prompts → Self-hosted or private API deployment
└── No PII → API is fine

Latency Requirements?
├── <500ms p99 → Self-hosted with warm instances
└── >1s OK → API works

Self-Hosted Stack:
├── Inference Engine: vLLM (recommended) or TGI
├── Load Balancer: nginx or cloud LB
├── Metrics: Prometheus + Grafana
└── Scaling: Kubernetes with GPU nodes

vLLM Quick Start

# Install
pip install vllm

# Start server
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-4-Scout-17B-16E-Instruct \
    --port 8000 \
    --gpu-memory-utilization 0.9

# Test
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
         "messages": [{"role": "user", "content": "Hello!"}]}'

Optimization Checklist

□ Enable continuous batching
□ Set appropriate gpu-memory-utilization (0.85-0.95)
□ Use quantization if memory-constrained (INT8 first, then INT4)
□ Enable KV cache (default in vLLM)
□ Configure appropriate max-model-len
□ Add response caching for repeated queries
□ Monitor GPU utilization (target >70%)

Chapter 15: MLOps & Evaluation

Metrics Quick Reference

┌─────────────────────────────────────────────────────────────────────────────┐
│ EVALUATION METRICS                                                          │
└─────────────────────────────────────────────────────────────────────────────┘

RETRIEVAL
├── Recall@K:    % of relevant docs in top K
├── MRR:         1 / position of first relevant
├── NDCG:        Ranking quality score
└── Precision@K: % of top K that are relevant

GENERATION (Automated)
├── BLEU:        N-gram overlap with reference
├── ROUGE:       Recall-oriented overlap
├── BERTScore:   Semantic similarity
└── Exact Match: Binary correct/incorrect

GENERATION (LLM-as-Judge)
├── Accuracy:    Is the answer correct?
├── Helpfulness: Does it address the question?
├── Safety:      Is it appropriate?
├── Groundedness: Is it supported by context?
└── Coherence:   Is it well-written?

LLM-as-Judge Template

JUDGE_PROMPT = """Rate this response on a scale of 1-5.

Question: {question}
Response: {response}
Reference (if available): {reference}

Criteria:

- 5: Excellent, fully correct and helpful
- 4: Good, mostly correct with minor issues
- 3: Acceptable, partially correct
- 2: Poor, significant errors
- 1: Unacceptable, wrong or harmful

Provide your rating and brief justification.
Rating:"""

Chapter 16: Security

Threat Model

┌─────────────────────────────────────────────────────────────────────────────┐
│ LLM SECURITY THREATS                                                        │
└─────────────────────────────────────────────────────────────────────────────┘

PROMPT INJECTION
├── Direct: User input tries to override system prompt
├── Indirect: External data (RAG, tools) contains injection
└── Defense: Input validation, prompt hardening, output filtering

JAILBREAKING
├── Attempts to bypass safety training
├── Often uses roleplay, hypotheticals
└── Defense: Safety classifiers, content filtering

DATA LEAKAGE
├── System prompt extraction
├── Training data extraction
├── PII in responses
└── Defense: Output filtering, don't put secrets in prompts

AGENT-SPECIFIC
├── Malicious tool calls
├── Unauthorized actions
├── Sandbox escapes
└── Defense: Tool policies, sandboxing, confirmation

Defense Checklist

INPUT
□ Length limits (max tokens)
□ Rate limiting (per user/IP)
□ Known pattern detection
□ Unicode normalization

PROMPT
□ Clear instruction/data separation
□ Explicit boundaries ("User input below:")
□ Don't trust user input
□ No secrets in system prompts

OUTPUT
□ Content filtering
□ System prompt leak detection
□ PII scrubbing
□ Format validation

AGENTS
□ Tool allowlists
□ Sandboxed execution
□ Human-in-the-loop for risky actions
□ Audit logging

Chapter 27: Performance Engineering

Optimization Decision Tree

┌─────────────────────────────────────────────────────────────────────────────┐
│ PERFORMANCE OPTIMIZATION                                                    │
└─────────────────────────────────────────────────────────────────────────────┘

What's the bottleneck?
├── Memory (OOM)
│   ├── Try INT8 quantization first
│   ├── Then INT4 if needed
│   ├── Reduce max_model_len
│   └── Enable PagedAttention (vLLM default)
│
├── Latency (Slow TTFT)
│   ├── Reduce input tokens (shorter context)
│   ├── Use smaller model
│   ├── Add more GPU memory (larger batch)
│   └── Try speculative decoding
│
└── Throughput (Low req/s)
    ├── Enable continuous batching
    ├── Increase GPU memory utilization
    ├── Add more replicas
    └── Consider tensor parallelism for large models

Quantization Options

Method	Bits	Quality Loss	Speed	Memory
FP16	16	None	Baseline	Baseline
INT8	8	Minimal	~Same	50%
INT4 (GPTQ/AWQ)	4	Small	Faster	25%
NF4 (QLoRA)	4	Small	Faster	25%

Key Formulas

Memory (FP16) ≈ Parameters × 2 bytes + KV Cache
Memory (INT8) ≈ Parameters × 1 byte + KV Cache
KV Cache ≈ 2 × layers × batch × seq_len × hidden_dim × 2 bytes

Throughput ≈ GPU_Memory / (Model_Size + KV_Cache_per_request)

Chapter 31: Reliability Engineering

SLO Targets (Typical)

Metric	Target	Measurement
Availability	99.9%	uptime / total_time
Latency (p50)	<1s	50th percentile
Latency (p99)	<5s	99th percentile
Error Rate	<0.1%	errors / requests
Quality Score	>4.0/5.0	LLM-as-judge average

Graceful Degradation Levels

Level 1: Normal Operation
         └── Full functionality

Level 2: Degraded - Use Cache
         └── Return cached responses for similar queries

Level 3: Degraded - Simpler Model
         └── Route to faster/smaller model

Level 4: Degraded - Limited Features
         └── Disable non-essential features (streaming, personalization)

Level 5: Fallback - Static Response
         └── Return pre-written responses for common queries

Circuit Breaker Pattern

if error_rate > threshold:
    open_circuit()      # Stop sending requests
    wait(cool_down)     # Let service recover
    half_open()         # Try a few requests
    if healthy:
        close_circuit() # Resume normal operation

Chapter 32: Cost Engineering

Cost Estimation

┌─────────────────────────────────────────────────────────────────────────────┐
│ LLM COST CALCULATION                                                        │
└─────────────────────────────────────────────────────────────────────────────┘

API Cost = (Input_Tokens × Input_Price) + (Output_Tokens × Output_Price)

Example (GPT-5):
  1000 input tokens × $0.00125/1K = $0.00125
  500 output tokens × $0.01/1K = $0.005
  Total: $0.00625 per request

Self-Hosted Cost = GPU_Hours × GPU_Cost_Per_Hour / Requests_Per_Hour

Example (A100, 70B model):
  $2/hour GPU × 24 hours = $48/day
  10,000 requests/day
  Cost: $0.0048 per request

Cost Optimization Checklist

□ Enable semantic caching (30-50% savings typical)
□ Route simple queries to smaller models
□ Set appropriate max_tokens limits
□ Compress context where possible
□ Batch requests when latency allows
□ Use spot/preemptible instances for batch jobs
□ Monitor and alert on cost anomalies

Build vs Buy Quick Check

Factor	Favors Build	Favors Buy
Volume	>100K req/day	<10K req/day
Privacy	PII in prompts	No sensitive data
Customization	Need fine-tuning	Prompt is enough
Latency	<500ms required	1-2s acceptable
Team	ML/infra expertise	Small team
Timeline	Can invest 3+ months	Need it now

Common Commands

vLLM

# Start server
python -m vllm.entrypoints.openai.api_server --model MODEL_NAME

# With options
--gpu-memory-utilization 0.9  # Use 90% GPU memory
--max-model-len 8192          # Max context length
--quantization awq            # Use AWQ quantization
--tensor-parallel-size 2      # Split across 2 GPUs

Evaluation

# Run promptfoo evaluation
promptfoo eval -c promptfooconfig.yaml

# Run LM-eval benchmark
lm_eval --model vllm --model_args pretrained=MODEL --tasks hellaswag

Monitoring

# GPU monitoring
nvidia-smi -l 1                    # Refresh every 1 second
nvitop                             # Interactive GPU monitor

# Container monitoring
docker stats                       # Resource usage
kubectl top pods -n ml             # K8s pod resources

Useful Links

vLLM Docs: https://docs.vllm.ai
HuggingFace: https://huggingface.co/docs
LangChain: https://python.langchain.com
Anthropic Docs: https://docs.anthropic.com
OpenAI Docs: https://platform.openai.com/docs
ANN Benchmarks: https://ann-benchmarks.com

Part II: Code Example Index

A searchable quick-reference for all code implementations in this textbook. Use this to find specific patterns, classes, and utilities across the reference files. Each entry links to the source file and provides a brief description of what the code does.

By Topic

RAG and Retrieval

Example	File	Description
DocumentExtractor	`reference/07b_rag_systems_code.md`	Extract and clean text from PDF/DOCX/HTML with metadata
RecursiveChunker	`reference/07b_rag_systems_code.md`	Split documents into semantic chunks with overlap
SemanticChunker	`reference/07b_rag_systems_code.md`	Chunk based on semantic similarity between sentences
FAISSVectorSearch	`reference/07b_rag_systems_code.md`	Vector similarity search using FAISS index
BM25Index	`reference/07b_rag_systems_code.md`	Keyword-based search using BM25 algorithm
HybridRetriever	`reference/07b_rag_systems_code.md`	Combine vector and BM25 search with RRF fusion
Reranker	`reference/07b_rag_systems_code.md`	Cross-encoder reranking for improved relevance
ContextAssembler	`reference/07b_rag_systems_code.md`	Build context with source attribution for generation
RAGPipeline	`reference/07b_rag_systems_code.md`	Complete end-to-end RAG pipeline orchestration
GraphRAG	`reference/07b_rag_systems_code.md`	Knowledge graph-enhanced retrieval with entity linking
RAGDebugger	`reference/07b_rag_systems_code.md`	Debug RAG pipeline with stage-by-stage analysis
RAGMonitor	`reference/07b_rag_systems_code.md`	Production monitoring for retrieval quality metrics
ProductionRAG	`reference/07b_rag_systems_code.md`	Production-ready RAG with caching, fallbacks, monitoring
EmbeddingFeatureStore	`reference/24_data_architecture_code.md`	Store and retrieve embedding features with vector search

Agents and Tool Use

Example	File	Description
MCPServer	`reference/08a_mcp_server_pattern.md`	Model Context Protocol server for document search tools
SupervisorAgent	`reference/08b_agent_orchestration.md`	Coordinate worker agents with task decomposition and synthesis
WorkerAgent	`reference/08b_agent_orchestration.md`	Base class for specialist worker agents
CircuitBreaker	`reference/08b_agent_orchestration.md`	Prevent infinite loops with state similarity detection
agentic_loop	`reference/08c_agentic_systems_code.md`	Basic tool execution loop with Claude API
ReActAgent	`reference/08c_agentic_systems_code.md`	Reasoning and acting agent with thought-action-observation
PlanningAgent	`reference/08c_agentic_systems_code.md`	Agent that creates and executes multi-step plans
HierarchicalAgent	`reference/08c_agentic_systems_code.md`	Multi-level agent hierarchy for complex workflows
SelfReflectingAgent	`reference/08c_agentic_systems_code.md`	Agent that evaluates and improves its own outputs
MemoryAugmentedAgent	`reference/08c_agentic_systems_code.md`	Agent with episodic and working memory systems
ComputerUseAgent	`reference/08c_agentic_systems_code.md`	Agent that interacts with computer via screenshots
ScopedToolSet	`reference/08c_agentic_systems_code.md`	Capability-scoped tools with permission levels
SafeAgent	`reference/08c_agentic_systems_code.md`	Agent with safety guards and human approval workflows
SandboxedCodeExecutor	`reference/08c_agentic_systems_code.md`	Secure code execution in Docker containers
LoopDetector	`reference/08c_agentic_systems_code.md`	Detect and prevent agent infinite loops
AgentEvaluator	`reference/08c_agentic_systems_code.md`	Evaluate agent performance on task completion
react_agent	`reference/06b_prompt_engineering_code.md`	Simple ReAct implementation with tool parsing
analyze_document_chain	`reference/06b_prompt_engineering_code.md`	Multi-step document analysis with prompt chaining

LLM Deployment and Serving

Example	File	Description
InferenceMetrics	`reference/09_llm_deployment_code.md`	Track TTFT, TPS, latency, throughput metrics
estimate_memory_gb	`reference/09_llm_deployment_code.md`	Calculate GPU memory requirements for model loading
StaticBatcher	`reference/09_llm_deployment_code.md`	Fixed-size batch inference for predictable workloads
ContinuousBatcher	`reference/09_llm_deployment_code.md`	Dynamic batching with adaptive batch sizing
InferenceProfiler	`reference/09_llm_deployment_code.md`	Profile inference performance with bottleneck detection
KVCacheManager	`reference/09_llm_deployment_code.md`	Manage key-value cache for transformer inference
SpeculativeDecoder	`reference/09_llm_deployment_code.md`	Speculative decoding with draft model verification
PrefixCache	`reference/09_llm_deployment_code.md`	Cache and reuse common prompt prefixes
LLMLoadBalancer	`reference/09_llm_deployment_code.md`	Route requests across model replicas
LoadShedder	`reference/09_llm_deployment_code.md`	Gracefully shed load during overload
GracefulServer	`reference/09_llm_deployment_code.md`	Server with graceful shutdown and health checks
CostEstimator	`reference/09_llm_deployment_code.md`	Estimate inference costs for different configurations

Prompt Engineering

Example	File	Description
DynamicFewShotClassifier	`reference/06b_prompt_engineering_code.md`	Select relevant examples based on query similarity
ContextManager	`reference/06b_prompt_engineering_code.md`	Allocate context budget across prompt components
manage_conversation_with_summary	`reference/06b_prompt_engineering_code.md`	Summarize old messages to manage long conversations
chunk_document	`reference/06b_prompt_engineering_code.md`	Chunk document with overlap for continuity
chunk_by_semantic_similarity	`reference/06b_prompt_engineering_code.md`	Chunk based on semantic similarity thresholds
Classification prompts	`reference/06a_prompt_template_library.md`	Templates for intent classification, sentiment analysis
Chain-of-Thought prompts	`reference/06a_prompt_template_library.md`	Step-by-step reasoning templates
RAG prompts	`reference/06a_prompt_template_library.md`	Context injection and citation templates
Agent prompts	`reference/06a_prompt_template_library.md`	Tool use and ReAct-style prompts
Summarization prompts	`reference/06a_prompt_template_library.md`	Document and conversation summarization
Code generation prompts	`reference/06a_prompt_template_library.md`	Code writing with test generation

Security

Example	File	Description
InputValidator	`reference/12a_security_defense_patterns.md`	Validate and sanitize user inputs
create_sandwiched_prompt	`reference/12a_security_defense_patterns.md`	Wrap user input with system instructions
OutputFilter	`reference/12a_security_defense_patterns.md`	Filter sensitive data from model outputs
AdaptiveRateLimiter	`reference/12a_security_defense_patterns.md`	Rate limiting with adaptive thresholds
SecureToolExecutor	`reference/12a_security_defense_patterns.md`	Execute tools with capability restrictions
SecureLLMPipeline	`reference/12a_security_defense_patterns.md`	Complete request pipeline with all defenses
HardenedPromptBuilder	`reference/12b_security_adversarial_code.md`	Build prompts resistant to injection attacks
OutputValidator	`reference/12b_security_adversarial_code.md`	Validate output structure and content safety
DualLLMGuard	`reference/12b_security_adversarial_code.md`	Use separate LLM to validate requests/responses
StructuredOutputDefense	`reference/12b_security_adversarial_code.md`	Constrain outputs to safe structured formats
SafetyGuard	`reference/12b_security_adversarial_code.md`	Multi-layer safety checks for content
JailbreakResponseHandler	`reference/12b_security_adversarial_code.md`	Detect and handle jailbreak attempts
DataLeakagePreventor	`reference/12b_security_adversarial_code.md`	Prevent sensitive data in model outputs
SessionIsolator	`reference/12b_security_adversarial_code.md`	Isolate user sessions to prevent cross-contamination
SecureLogger	`reference/12b_security_adversarial_code.md`	Log requests with PII redaction
AutomatedRedTeam	`reference/12b_security_adversarial_code.md`	Automated adversarial testing framework
CISecurityGate	`reference/12b_security_adversarial_code.md`	Security validation in CI/CD pipeline
SecurityMonitor	`reference/12b_security_adversarial_code.md`	Real-time security monitoring and alerting

Evaluation and MLOps

Example	File	Description
RAGEvaluator	`reference/11a_evaluation_harness.md`	LLM-as-judge evaluator with calibration
JUDGE_PROMPT	`reference/11a_evaluation_harness.md`	Calibrated judge prompt with explicit rubric
EvaluationPipeline	`reference/11b_mlops_evaluation_code.md`	Run evaluation suites with parallel execution
LLMJudge	`reference/11b_mlops_evaluation_code.md`	Single-aspect LLM judgment
PairwiseJudge	`reference/11b_mlops_evaluation_code.md`	Compare two outputs for preference
MultiAspectJudge	`reference/11b_mlops_evaluation_code.md`	Evaluate multiple quality dimensions
JudgeCalibrator	`reference/11b_mlops_evaluation_code.md`	Calibrate judge consistency with known examples
ExperimentManager	`reference/11b_mlops_evaluation_code.md`	Manage A/B test experiments
ExperimentAnalyzer	`reference/11b_mlops_evaluation_code.md`	Statistical analysis of experiment results
ModelComparator	`reference/11b_mlops_evaluation_code.md`	Compare model versions with metrics
QualityMonitor	`reference/11b_mlops_evaluation_code.md`	Monitor production quality metrics
DriftDetector	`reference/11b_mlops_evaluation_code.md`	Detect feature and prediction drift

Performance and Optimization

Example	File	Description
InferenceProfiler	`reference/21_performance_engineering_code.md`	Profile inference with bottleneck detection
BottleneckDiagnoser	`reference/21_performance_engineering_code.md`	Diagnose performance bottlenecks
ModelOptimizer	`reference/21_performance_engineering_code.md`	Optimize models with torch.compile
TensorRTOptimizer	`reference/21_performance_engineering_code.md`	TensorRT conversion for NVIDIA GPUs
CUDAGraphOptimizer	`reference/21_performance_engineering_code.md`	CUDA graph capture for reduced overhead
TensorParallelInference	`reference/21_performance_engineering_code.md`	Shard models across multiple GPUs
OptimizedKVCache	`reference/21_performance_engineering_code.md`	Efficient KV cache implementation
PagedKVCache	`reference/21_performance_engineering_code.md`	vLLM-style paged attention cache
QuantizationAnalyzer	`reference/21_performance_engineering_code.md`	Analyze quantization impact on accuracy
GradientCheckpointAnalyzer	`reference/21_performance_engineering_code.md`	Optimize memory with gradient checkpointing
AttentionProfiler	`reference/21_performance_engineering_code.md`	Profile attention mechanisms
LatencyThroughputOptimizer	`reference/21_performance_engineering_code.md`	Balance latency vs throughput tradeoffs
StreamingOptimizer	`reference/21_performance_engineering_code.md`	Optimize token streaming performance
CostOptimizer	`reference/21_performance_engineering_code.md`	Optimize inference cost per request
BenchmarkSuite	`reference/21_performance_engineering_code.md`	Comprehensive benchmarking framework

Cost Engineering

Example	File	Description
calculate_cost	`reference/26a_cost_estimation_formulas.md`	Calculate token costs for API calls
project_monthly_cost	`reference/26a_cost_estimation_formulas.md`	Project monthly costs from usage patterns
calculate_rag_cost	`reference/26a_cost_estimation_formulas.md`	Calculate RAG system costs including retrieval
calculate_breakeven	`reference/26a_cost_estimation_formulas.md`	GPU self-hosting break-even analysis
CostCategory	`reference/26b_cost_engineering_code.md`	Enum of AI cost categories
CostItem	`reference/26b_cost_engineering_code.md`	Model individual cost items
AICostBreakdown	`reference/26b_cost_engineering_code.md`	Complete cost breakdown for AI system
TCOCalculator	`reference/26b_cost_engineering_code.md`	Calculate total cost of ownership
CostAttributionEngine	`reference/26b_cost_engineering_code.md`	Attribute costs to teams and projects
UnitEconomicsCalculator	`reference/26b_cost_engineering_code.md`	Calculate cost per request/user/revenue
GPUOptimizer	`reference/26b_cost_engineering_code.md`	Analyze and optimize GPU utilization
InferenceCostOptimizer	`reference/26b_cost_engineering_code.md`	Optimize inference costs
TrainingCostOptimizer	`reference/26b_cost_engineering_code.md`	Optimize training costs with strategy selection
BuildVsBuyAnalyzer	`reference/26b_cost_engineering_code.md`	Analyze build vs buy decisions
CrossoverAnalysis	`reference/26b_cost_engineering_code.md`	Calculate when build becomes cheaper than buy
APICostManager	`reference/26b_cost_engineering_code.md`	Track and manage API costs with budgets
APICostOptimizer	`reference/26b_cost_engineering_code.md`	Recommend API cost optimizations
TokenOptimizer	`reference/26b_cost_engineering_code.md`	Analyze and optimize token usage
CostDashboard	`reference/26b_cost_engineering_code.md`	Cost visibility dashboard for teams
CostGovernanceEnforcer	`reference/26b_cost_engineering_code.md`	Enforce cost governance policies
ScalingCostModel	`reference/26b_cost_engineering_code.md`	Project costs at different scale

Data Architecture

Example	File	Description
ArchitectureSelector	`reference/24_data_architecture_code.md`	Choose between Lambda and Kappa architectures
SchemaRegistry	`reference/24_data_architecture_code.md`	Central schema registry with compatibility checking
ContractEnforcer	`reference/24_data_architecture_code.md`	Enforce data contracts between producers and consumers
FeatureStore	`reference/24_data_architecture_code.md`	Feature store with online/offline serving
FeatureStoreSelectorTool	`reference/24_data_architecture_code.md`	Select the right feature store platform
FeatureEngineerer	`reference/24_data_architecture_code.md`	Feature engineering utilities (aggregations, cyclical encoding)
EmbeddingFeatureStore	`reference/24_data_architecture_code.md`	Specialized storage for embedding features
SkewDetector	`reference/24_data_architecture_code.md`	Detect training-serving skew
UnifiedFeatureComputation	`reference/24_data_architecture_code.md`	Single source of truth for feature computation
PreprocessingPackager	`reference/24_data_architecture_code.md`	Package preprocessing with model artifacts
SkewMonitoringDashboard	`reference/24_data_architecture_code.md`	Monitor training-serving skew in production
LineageTracker	`reference/24_data_architecture_code.md`	Track data lineage across assets
DataVersionManager	`reference/24_data_architecture_code.md`	Manage versioned data for reproducibility
DataObservabilityMonitor	`reference/24_data_architecture_code.md`	Monitor data observability (freshness, volume, schema)
DataAnomalyDetector	`reference/24_data_architecture_code.md`	Detect anomalies in data metrics
DataQualityMonitor	`reference/24_data_architecture_code.md`	Monitor data quality with configurable checks
FeatureGovernanceSystem	`reference/24_data_architecture_code.md`	Manage feature governance and deprecation
FreshnessMonitor	`reference/24_data_architecture_code.md`	Monitor feature freshness with SLAs
LeakageDetector	`reference/24_data_architecture_code.md`	Detect data leakage in training data
RealTimeQualityGate	`reference/24_data_architecture_code.md`	Lightweight quality checks for real-time pipelines

Reliability

Example	File	Description
AISLOFramework	`reference/25b_reliability_engineering_code.md`	Define AI-specific SLOs (availability, latency, quality, freshness, safety)
ErrorBudgetTracker	`reference/25b_reliability_engineering_code.md`	Track error budget consumption
RealTimeQualityEstimator	`reference/25b_reliability_engineering_code.md`	Estimate quality in real-time using proxy metrics
GracefulDegradationController	`reference/25b_reliability_engineering_code.md`	Control graceful degradation with multiple levels
CircuitBreaker	`reference/25b_reliability_engineering_code.md`	Circuit breaker pattern for model calls
FallbackExecutor	`reference/25b_reliability_engineering_code.md`	Execute fallback strategies by level
AIIncidentManager	`reference/25b_reliability_engineering_code.md`	Manage AI system incidents with runbooks
PostIncidentReviewTemplate	`reference/25b_reliability_engineering_code.md`	Generate post-incident review templates
IncidentMetrics	`reference/25b_reliability_engineering_code.md`	Track incident metrics (MTTR, by type, by severity)
ReliableTrainingPipeline	`reference/25b_reliability_engineering_code.md`	Training pipeline with checkpointing and retries
LoadSheddingStrategy	`reference/25b_reliability_engineering_code.md`	Priority-based load shedding
AdaptiveRateLimiter	`reference/25b_reliability_engineering_code.md`	Rate limiter that adapts to model latency
ReliableServingPipeline	`reference/25b_reliability_engineering_code.md`	Serving pipeline with fallbacks and circuit breakers
AIObservabilityPipeline	`reference/25b_reliability_engineering_code.md`	Observability pipeline for AI systems
DistributionMonitor	`reference/25b_reliability_engineering_code.md`	Monitor feature and prediction distributions
ChaosRunner	`reference/25b_reliability_engineering_code.md`	Run chaos experiments on AI systems
GameDay	`reference/25b_reliability_engineering_code.md`	Organize game day exercises for reliability

By Implementation Type

Complete Systems

End-to-end implementations that you can adapt for production use.

Example	File	Description
ProductionRAG	`reference/07b_rag_systems_code.md`	Production-ready RAG with caching, fallbacks, monitoring
RAGPipeline	`reference/07b_rag_systems_code.md`	Complete end-to-end RAG pipeline orchestration
GraphRAG	`reference/07b_rag_systems_code.md`	Knowledge graph-enhanced retrieval system
SecureLLMPipeline	`reference/12a_security_defense_patterns.md`	Complete request pipeline with all security defenses
MCPServer	`reference/08a_mcp_server_pattern.md`	Complete MCP server for document search
SupervisorAgent	`reference/08b_agent_orchestration.md`	Full supervisor-worker agent architecture
ReliableServingPipeline	`reference/25b_reliability_engineering_code.md`	Production serving with reliability features
ReliableTrainingPipeline	`reference/25b_reliability_engineering_code.md`	Training pipeline with checkpointing and retries
EvaluationPipeline	`reference/11b_mlops_evaluation_code.md`	Full evaluation suite with parallel execution
AIObservabilityPipeline	`reference/25b_reliability_engineering_code.md`	Complete observability for AI systems

Utility Functions

Standalone functions for common operations.

Example	File	Description
estimate_memory_gb	`reference/09_llm_deployment_code.md`	Calculate GPU memory requirements
calculate_cost	`reference/26a_cost_estimation_formulas.md`	Calculate token costs for API calls
calculate_breakeven	`reference/26a_cost_estimation_formulas.md`	GPU self-hosting break-even analysis
chunk_document	`reference/06b_prompt_engineering_code.md`	Chunk document with overlap
chunk_by_semantic_similarity	`reference/06b_prompt_engineering_code.md`	Semantic-based document chunking
create_sandwiched_prompt	`reference/12a_security_defense_patterns.md`	Wrap user input with system instructions
manage_conversation_with_summary	`reference/06b_prompt_engineering_code.md`	Summarize long conversations
encode_cyclical_time	`reference/24_data_architecture_code.md`	Encode cyclical time features with sin/cos
compute_embedding_similarity	`reference/24_data_architecture_code.md`	Cosine similarity between embeddings

Configuration Examples

Patterns and templates for configuration.

Example	File	Description
JUDGE_PROMPT	`reference/11a_evaluation_harness.md`	LLM-as-judge prompt with rubric
PLANNING_PROMPT	`reference/08b_agent_orchestration.md`	Agent task decomposition prompt
SYNTHESIS_PROMPT	`reference/08b_agent_orchestration.md`	Agent result synthesis prompt
Classification prompts	`reference/06a_prompt_template_library.md`	Intent and sentiment classification templates
CoT prompts	`reference/06a_prompt_template_library.md`	Chain-of-thought reasoning templates
RAG prompts	`reference/06a_prompt_template_library.md`	Context injection templates
FALLBACK_STRATEGIES	`reference/25b_reliability_engineering_code.md`	Fallback strategy configurations by system type
CHAOS_EXPERIMENTS	`reference/25b_reliability_engineering_code.md`	Chaos experiment definitions
AI_INCIDENT_SIGNATURES	`reference/25b_reliability_engineering_code.md`	Incident type signatures with runbooks
COST_POLICIES	`reference/26b_cost_engineering_code.md`	Cost governance policy examples
TCO_COMPONENTS	`reference/26b_cost_engineering_code.md`	Total cost of ownership components
FEATURE_STORE_PLATFORMS	`reference/24_data_architecture_code.md`	Feature store platform comparison
CALIBRATION_SET	`reference/11a_evaluation_harness.md`	Judge calibration set structure

Cross-Reference by Chapter

Use this to find all code examples related to a specific chapter.

Cross-Reference by Chapter
Chapter	Reference Files	Key Classes/Functions
Ch 5	`05_llm_nlp_foundations_code.md`	Tokenizer, embedding, attention implementations
Ch 6	`06a_prompt_template_library.md`, `06b_prompt_engineering_code.md`	Prompts, DynamicFewShotClassifier, ContextManager
Ch 7	`07a_rag_pipeline_architecture.md`, `07b_rag_systems_code.md`	RAGPipeline, HybridRetriever, Reranker, GraphRAG
Ch 8	`08a_mcp_server_pattern.md`, `08b_agent_orchestration.md`, `08c_agentic_systems_code.md`	SupervisorAgent, ReActAgent, SafeAgent
Ch 9	`09_llm_deployment_code.md`	ContinuousBatcher, KVCacheManager, LLMLoadBalancer
Ch 15	`11a_evaluation_harness.md`, `11b_mlops_evaluation_code.md`	RAGEvaluator, LLMJudge, ExperimentManager
Ch 16	`12a_security_defense_patterns.md`, `12b_security_adversarial_code.md`	InputValidator, SecureLLMPipeline, AutomatedRedTeam
Ch 27	`21_performance_engineering_code.md`	InferenceProfiler, QuantizationAnalyzer, PagedKVCache
Ch 30	`24_data_architecture_code.md`	FeatureStore, SkewDetector, LineageTracker
Ch 31	`25a_incident_runbook.md`, `25b_reliability_engineering_code.md`	AISLOFramework, CircuitBreaker, ChaosRunner
Ch 32	`26a_cost_estimation_formulas.md`, `26b_cost_engineering_code.md`	TCOCalculator, APICostManager, BuildVsBuyAnalyzer

Runnable Code Projects

Complete, installable Python projects in the code/ directory. These are fully functional applications you can run locally.

Document Q&A (Chapter 4)

Location: code/ch04_document_qa/

Module	Description
document_qa.chunker	Document chunking with paragraph/sentence boundaries
document_qa.embeddings	OpenAI embedding integration with batching
document_qa.store	ChromaDB vector store with search
document_qa.generator	Answer generation with context and citations
document_qa.pipeline	End-to-end RAG pipeline orchestration
document_qa.cli	Command-line interface for indexing and querying

cd code/ch04_document_qa && pip install -e .
document-qa index ../datasets/documents
document-qa ask "What is the vacation policy?"

Advanced RAG (Chapter 7)

Location: code/ch07_rag_advanced/

Module	Description
src.chunking	Multiple chunking strategies (recursive, semantic, sentence, sliding window)
src.embeddings	Cached embeddings with OpenAI and sentence-transformers
src.retrieval	Hybrid search combining BM25 and vector search with RRF fusion
src.reranking	Cross-encoder reranking for improved relevance
src.pipeline	RAGPipeline with caching and observability
src.evaluation	Retrieval evaluation metrics (MRR, NDCG, Recall@K)

Agentic Systems (Chapter 8)

Location: code/ch08_agents/

Module	Description
src.tools	Tool interface, ToolRegistry, ToolResult
src.react_agent	ReAct agent implementation with thought-action-observation
src.supervisor	Supervisor-worker multi-agent architecture
src.safety	Safety checks, rate limiting, loop detection
src.sandbox	Docker-based sandboxed code execution
src.mcp_server	Model Context Protocol server
tools.calculator	Mathematical operations tool
tools.web_search	Mock web search for testing
tools.file_ops	Sandboxed file operations
tools.code_executor	Sandboxed Python execution

Evaluation Harness (Chapter 15)

Location: code/ch11_evaluation/

Module	Description
src.metrics	RetrievalMetrics (MRR, NDCG, Precision@K, Recall@K)
src.llm_judge	LLMJudge, PairwiseJudge, MultiAspectJudge
src.calibration	JudgeCalibrator for consistency validation
src.ab_testing	ABTest, ExperimentManager, statistical analysis
src.drift_detection	DriftDetector for quality/distribution drift
src.reporting	ReportGenerator for HTML/JSON reports

Additional Projects

Project	Location	Description
prompt_lab	`code/prompt_lab/`	Interactive prompt testing tool
self_hosted_llm	`code/self_hosted_llm/`	vLLM deployment with Docker
security_scanner	`code/security_scanner/`	Prompt injection detector
cost_tracker	`code/cost_tracker/`	Token and cost monitoring

How to Use This Index

Running complete examples: Check “Runnable Code Projects” for installable Python projects
Finding a specific pattern: Use the topic sections to locate implementations for your use case
Starting a new project: Check “Complete Systems” for end-to-end examples
Adding a utility: Look in “Utility Functions” for standalone helpers
Setting up configuration: Check “Configuration Examples” for templates and prompts
Deep diving into a chapter: Use the cross-reference table to find all related code

All file paths are relative to the repository root. Reference implementations are at reference/<filename>. Runnable projects are at code/<project>/.

# Appendix H: Reference Cards & Code Index {.unnumbered} This appendix combines two essential reference resources: **Quick Reference Cards** with one-page summaries for each major topic, and a **Code Example Index** to find specific patterns and implementations across the textbook. --- # Part I: Quick Reference Cards One-page summaries for each major topic. Print these or keep them handy during development. --- ## Chapter 5: LLM/NLP Foundations ### Key Concepts ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ LLM FUNDAMENTALS │ └─────────────────────────────────────────────────────────────────────────────┘ TOKENIZATION ├── BPE (Byte-Pair Encoding): Most common, good compression ├── WordPiece: Used by BERT ├── SentencePiece: Language-agnostic └── Tokens ≠ Words: "tokenization" → ["token", "ization"] EMBEDDINGS ├── Dense vectors representing semantic meaning ├── Similar meanings → similar vectors ├── Dimensionality: 768-4096 typical └── Cosine similarity for comparison ATTENTION ├── Query, Key, Value matrices ├── Attention(Q,K,V) = softmax(QK^T/√d)V ├── Complexity: O(n²) with sequence length └── Enables long-range dependencies TRANSFORMER ARCHITECTURE ├── Encoder: Bidirectional (BERT-style) ├── Decoder: Autoregressive (GPT-style) ├── Encoder-Decoder: Seq2seq (T5-style) └── Modern LLMs: Decoder-only dominant ``` ### Quick Formulas | Concept | Formula/Rule | |---------|--------------| | Attention | softmax(QK^T / √d_k) × V | | Token estimate | ~4 characters per token (English) | | Context usage | prompt_tokens + max_tokens ≤ context_window | ### Common Model Sizes | Model | Parameters | VRAM (FP16) | VRAM (INT8) | |-------|------------|-------------|-------------| | 7B | 7 billion | ~14 GB | ~7 GB | | 13B | 13 billion | ~26 GB | ~13 GB | | 70B | 70 billion | ~140 GB | ~70 GB | --- ## Chapter 6: Prompt Engineering ### Prompt Patterns ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ PROMPTING PATTERNS │ └─────────────────────────────────────────────────────────────────────────────┘ ZERO-SHOT "Classify this text as positive or negative: {text}" FEW-SHOT "Examples: Text: Great product! → positive Text: Terrible service → negative Text: {text} →" CHAIN-OF-THOUGHT "Let's solve this step by step: 1. First, identify... 2. Then, calculate... 3. Finally, conclude..." STRUCTURED OUTPUT "Respond with valid JSON matching this schema: {\"sentiment\": \"positive|negative\", \"confidence\": 0.0-1.0}" ``` ### Best Practices | Do | Don't | |----|-------| | Be specific and explicit | Assume the model knows context | | Use delimiters for data | Mix instructions with data | | Provide examples | Give vague instructions | | Set output format | Leave format ambiguous | | Include constraints | Forget edge cases | ### Temperature Guide | Temperature | Use Case | |-------------|----------| | 0.0-0.3 | Factual, deterministic (code, extraction) | | 0.4-0.7 | Balanced (most applications) | | 0.8-1.0 | Creative (brainstorming, writing) | --- ## Chapter 7: RAG Systems ### Pipeline Overview ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ RAG PIPELINE │ └─────────────────────────────────────────────────────────────────────────────┘ INDEXING (Offline) Documents → Extract → Chunk → Embed → Store ↓ ↓ ↓ ↓ Clean text 512 tok 1024-dim Vector DB RETRIEVAL (Online) Query → Embed → Search → Rerank → Top-K chunks ↓ ↓ ↓ ↓ Same model ANN Cross-enc 3-5 chunks GENERATION Context + Query → LLM → Response + Citations ``` ### Chunking Decision Tree ``` Document Type? ├── Structured (API docs, specs) │ └── Use section boundaries + semantic chunking ├── Narrative (articles, reports) │ └── Use paragraph-based with 50-token overlap ├── Code │ └── Use function/class boundaries └── Mixed └── Use hierarchical (parent + children) Chunk Size? ├── Short answers needed → 256-512 tokens ├── Detailed answers needed → 512-1024 tokens └── Context window limited → Keep chunks smaller ``` ### Retrieval Metrics | Metric | Formula | Target | |--------|---------|--------| | Recall@K | relevant_in_top_k / total_relevant | >0.8 | | MRR | 1 / rank_of_first_relevant | >0.5 | | NDCG | DCG / IDCG | >0.7 | ### Common Issues | Symptom | Likely Cause | Fix | |---------|--------------|-----| | Irrelevant results | Wrong embedding model | Try e5, bge, or domain-specific | | Missing obvious matches | Chunk too small | Increase chunk size | | Partial answers | Relevant info split | Add overlap, semantic chunking | | Slow retrieval | No index | Add HNSW/IVF index | --- ## Chapter 8: Agentic Systems ### Agent Loop (ReAct Pattern) ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ ReAct AGENT LOOP │ └─────────────────────────────────────────────────────────────────────────────┘ ┌─────────┐ ┌──────────┐ ┌─────────┐ ┌────────────┐ │ Query │───▶│ Think │───▶│ Act │───▶│ Observe │ └─────────┘ │ (Reason) │ │ (Tool) │ │ (Result) │ └──────────┘ └─────────┘ └────────────┘ ▲ │ │ │ └──────────────────────────────┘ (Iterate until done) ``` ### Tool Definition Template ```python { "name": "search_database", "description": "Search the product database by query", "parameters": { "type": "object", "properties": { "query": { "type": "string", "description": "Search query" }, "limit": { "type": "integer", "description": "Max results (default 10)" } }, "required": ["query"] } } ``` ### Safety Checklist ``` □ Rate limiting per user/session □ Tool call validation (allowlists) □ Sandboxed execution for code □ Human confirmation for dangerous actions □ Maximum steps/tokens limit □ Audit logging enabled □ Timeout on all operations ``` --- ## Chapter 9: LLM Deployment ### Serving Options Decision Tree ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ WHICH SERVING OPTION? │ └─────────────────────────────────────────────────────────────────────────────┘ Traffic Volume? ├── <1K req/day → API (OpenAI, Anthropic) ├── 1K-100K req/day → API or Self-hosted (depends on cost/privacy) └── >100K req/day → Self-hosted likely cheaper Privacy Requirements? ├── PII in prompts → Self-hosted or private API deployment └── No PII → API is fine Latency Requirements? ├── <500ms p99 → Self-hosted with warm instances └── >1s OK → API works Self-Hosted Stack: ├── Inference Engine: vLLM (recommended) or TGI ├── Load Balancer: nginx or cloud LB ├── Metrics: Prometheus + Grafana └── Scaling: Kubernetes with GPU nodes ``` ### vLLM Quick Start ```bash # Install pip install vllm # Start server python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-4-Scout-17B-16E-Instruct \ --port 8000 \ --gpu-memory-utilization 0.9 # Test curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model": "meta-llama/Llama-4-Scout-17B-16E-Instruct", "messages": [{"role": "user", "content": "Hello!"}]}' ``` ### Optimization Checklist ``` □ Enable continuous batching □ Set appropriate gpu-memory-utilization (0.85-0.95) □ Use quantization if memory-constrained (INT8 first, then INT4) □ Enable KV cache (default in vLLM) □ Configure appropriate max-model-len □ Add response caching for repeated queries □ Monitor GPU utilization (target >70%) ``` --- ## Chapter 15: MLOps & Evaluation ### Metrics Quick Reference ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ EVALUATION METRICS │ └─────────────────────────────────────────────────────────────────────────────┘ RETRIEVAL ├── Recall@K: % of relevant docs in top K ├── MRR: 1 / position of first relevant ├── NDCG: Ranking quality score └── Precision@K: % of top K that are relevant GENERATION (Automated) ├── BLEU: N-gram overlap with reference ├── ROUGE: Recall-oriented overlap ├── BERTScore: Semantic similarity └── Exact Match: Binary correct/incorrect GENERATION (LLM-as-Judge) ├── Accuracy: Is the answer correct? ├── Helpfulness: Does it address the question? ├── Safety: Is it appropriate? ├── Groundedness: Is it supported by context? └── Coherence: Is it well-written? ``` ### LLM-as-Judge Template ```python JUDGE_PROMPT = """Rate this response on a scale of 1-5. Question: {question} Response: {response} Reference (if available): {reference} Criteria: - 5: Excellent, fully correct and helpful - 4: Good, mostly correct with minor issues - 3: Acceptable, partially correct - 2: Poor, significant errors - 1: Unacceptable, wrong or harmful Provide your rating and brief justification. Rating:""" ``` --- ## Chapter 16: Security ### Threat Model ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ LLM SECURITY THREATS │ └─────────────────────────────────────────────────────────────────────────────┘ PROMPT INJECTION ├── Direct: User input tries to override system prompt ├── Indirect: External data (RAG, tools) contains injection └── Defense: Input validation, prompt hardening, output filtering JAILBREAKING ├── Attempts to bypass safety training ├── Often uses roleplay, hypotheticals └── Defense: Safety classifiers, content filtering DATA LEAKAGE ├── System prompt extraction ├── Training data extraction ├── PII in responses └── Defense: Output filtering, don't put secrets in prompts AGENT-SPECIFIC ├── Malicious tool calls ├── Unauthorized actions ├── Sandbox escapes └── Defense: Tool policies, sandboxing, confirmation ``` ### Defense Checklist ``` INPUT □ Length limits (max tokens) □ Rate limiting (per user/IP) □ Known pattern detection □ Unicode normalization PROMPT □ Clear instruction/data separation □ Explicit boundaries ("User input below:") □ Don't trust user input □ No secrets in system prompts OUTPUT □ Content filtering □ System prompt leak detection □ PII scrubbing □ Format validation AGENTS □ Tool allowlists □ Sandboxed execution □ Human-in-the-loop for risky actions □ Audit logging ``` --- ## Chapter 27: Performance Engineering ### Optimization Decision Tree ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ PERFORMANCE OPTIMIZATION │ └─────────────────────────────────────────────────────────────────────────────┘ What's the bottleneck? ├── Memory (OOM) │ ├── Try INT8 quantization first │ ├── Then INT4 if needed │ ├── Reduce max_model_len │ └── Enable PagedAttention (vLLM default) │ ├── Latency (Slow TTFT) │ ├── Reduce input tokens (shorter context) │ ├── Use smaller model │ ├── Add more GPU memory (larger batch) │ └── Try speculative decoding │ └── Throughput (Low req/s) ├── Enable continuous batching ├── Increase GPU memory utilization ├── Add more replicas └── Consider tensor parallelism for large models ``` ### Quantization Options | Method | Bits | Quality Loss | Speed | Memory | |--------|------|--------------|-------|--------| | FP16 | 16 | None | Baseline | Baseline | | INT8 | 8 | Minimal | ~Same | 50% | | INT4 (GPTQ/AWQ) | 4 | Small | Faster | 25% | | NF4 (QLoRA) | 4 | Small | Faster | 25% | ### Key Formulas ``` Memory (FP16) ≈ Parameters × 2 bytes + KV Cache Memory (INT8) ≈ Parameters × 1 byte + KV Cache KV Cache ≈ 2 × layers × batch × seq_len × hidden_dim × 2 bytes Throughput ≈ GPU_Memory / (Model_Size + KV_Cache_per_request) ``` --- ## Chapter 31: Reliability Engineering ### SLO Targets (Typical) | Metric | Target | Measurement | |--------|--------|-------------| | Availability | 99.9% | uptime / total_time | | Latency (p50) | <1s | 50th percentile | | Latency (p99) | <5s | 99th percentile | | Error Rate | <0.1% | errors / requests | | Quality Score | >4.0/5.0 | LLM-as-judge average | ### Graceful Degradation Levels ``` Level 1: Normal Operation └── Full functionality Level 2: Degraded - Use Cache └── Return cached responses for similar queries Level 3: Degraded - Simpler Model └── Route to faster/smaller model Level 4: Degraded - Limited Features └── Disable non-essential features (streaming, personalization) Level 5: Fallback - Static Response └── Return pre-written responses for common queries ``` ### Circuit Breaker Pattern ```python if error_rate > threshold: open_circuit() # Stop sending requests wait(cool_down) # Let service recover half_open() # Try a few requests if healthy: close_circuit() # Resume normal operation ``` --- ## Chapter 32: Cost Engineering ### Cost Estimation ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ LLM COST CALCULATION │ └─────────────────────────────────────────────────────────────────────────────┘ API Cost = (Input_Tokens × Input_Price) + (Output_Tokens × Output_Price) Example (GPT-5): 1000 input tokens × $0.00125/1K = $0.00125 500 output tokens × $0.01/1K = $0.005 Total: $0.00625 per request Self-Hosted Cost = GPU_Hours × GPU_Cost_Per_Hour / Requests_Per_Hour Example (A100, 70B model): $2/hour GPU × 24 hours = $48/day 10,000 requests/day Cost: $0.0048 per request ``` ### Cost Optimization Checklist ``` □ Enable semantic caching (30-50% savings typical) □ Route simple queries to smaller models □ Set appropriate max_tokens limits □ Compress context where possible □ Batch requests when latency allows □ Use spot/preemptible instances for batch jobs □ Monitor and alert on cost anomalies ``` ### Build vs Buy Quick Check | Factor | Favors Build | Favors Buy | |--------|--------------|------------| | Volume | >100K req/day | <10K req/day | | Privacy | PII in prompts | No sensitive data | | Customization | Need fine-tuning | Prompt is enough | | Latency | <500ms required | 1-2s acceptable | | Team | ML/infra expertise | Small team | | Timeline | Can invest 3+ months | Need it now | --- ## Common Commands ### vLLM ```bash # Start server python -m vllm.entrypoints.openai.api_server --model MODEL_NAME # With options --gpu-memory-utilization 0.9 # Use 90% GPU memory --max-model-len 8192 # Max context length --quantization awq # Use AWQ quantization --tensor-parallel-size 2 # Split across 2 GPUs ``` ### Evaluation ```bash # Run promptfoo evaluation promptfoo eval -c promptfooconfig.yaml # Run LM-eval benchmark lm_eval --model vllm --model_args pretrained=MODEL --tasks hellaswag ``` ### Monitoring ```bash # GPU monitoring nvidia-smi -l 1 # Refresh every 1 second nvitop # Interactive GPU monitor # Container monitoring docker stats # Resource usage kubectl top pods -n ml # K8s pod resources ``` --- ## Useful Links - vLLM Docs: https://docs.vllm.ai - HuggingFace: https://huggingface.co/docs - LangChain: https://python.langchain.com - Anthropic Docs: https://docs.anthropic.com - OpenAI Docs: https://platform.openai.com/docs - ANN Benchmarks: https://ann-benchmarks.com --- # Part II: Code Example Index A searchable quick-reference for all code implementations in this textbook. Use this to find specific patterns, classes, and utilities across the reference files. Each entry links to the source file and provides a brief description of what the code does. --- ## By Topic ### RAG and Retrieval | Example | File | Description | |---------|------|-------------| | **DocumentExtractor** | `reference/07b_rag_systems_code.md` | Extract and clean text from PDF/DOCX/HTML with metadata | | **RecursiveChunker** | `reference/07b_rag_systems_code.md` | Split documents into semantic chunks with overlap | | **SemanticChunker** | `reference/07b_rag_systems_code.md` | Chunk based on semantic similarity between sentences | | **FAISSVectorSearch** | `reference/07b_rag_systems_code.md` | Vector similarity search using FAISS index | | **BM25Index** | `reference/07b_rag_systems_code.md` | Keyword-based search using BM25 algorithm | | **HybridRetriever** | `reference/07b_rag_systems_code.md` | Combine vector and BM25 search with RRF fusion | | **Reranker** | `reference/07b_rag_systems_code.md` | Cross-encoder reranking for improved relevance | | **ContextAssembler** | `reference/07b_rag_systems_code.md` | Build context with source attribution for generation | | **RAGPipeline** | `reference/07b_rag_systems_code.md` | Complete end-to-end RAG pipeline orchestration | | **GraphRAG** | `reference/07b_rag_systems_code.md` | Knowledge graph-enhanced retrieval with entity linking | | **RAGDebugger** | `reference/07b_rag_systems_code.md` | Debug RAG pipeline with stage-by-stage analysis | | **RAGMonitor** | `reference/07b_rag_systems_code.md` | Production monitoring for retrieval quality metrics | | **ProductionRAG** | `reference/07b_rag_systems_code.md` | Production-ready RAG with caching, fallbacks, monitoring | | **EmbeddingFeatureStore** | `reference/24_data_architecture_code.md` | Store and retrieve embedding features with vector search | ### Agents and Tool Use | Example | File | Description | |---------|------|-------------| | **MCPServer** | `reference/08a_mcp_server_pattern.md` | Model Context Protocol server for document search tools | | **SupervisorAgent** | `reference/08b_agent_orchestration.md` | Coordinate worker agents with task decomposition and synthesis | | **WorkerAgent** | `reference/08b_agent_orchestration.md` | Base class for specialist worker agents | | **CircuitBreaker** | `reference/08b_agent_orchestration.md` | Prevent infinite loops with state similarity detection | | **agentic_loop** | `reference/08c_agentic_systems_code.md` | Basic tool execution loop with Claude API | | **ReActAgent** | `reference/08c_agentic_systems_code.md` | Reasoning and acting agent with thought-action-observation | | **PlanningAgent** | `reference/08c_agentic_systems_code.md` | Agent that creates and executes multi-step plans | | **HierarchicalAgent** | `reference/08c_agentic_systems_code.md` | Multi-level agent hierarchy for complex workflows | | **SelfReflectingAgent** | `reference/08c_agentic_systems_code.md` | Agent that evaluates and improves its own outputs | | **MemoryAugmentedAgent** | `reference/08c_agentic_systems_code.md` | Agent with episodic and working memory systems | | **ComputerUseAgent** | `reference/08c_agentic_systems_code.md` | Agent that interacts with computer via screenshots | | **ScopedToolSet** | `reference/08c_agentic_systems_code.md` | Capability-scoped tools with permission levels | | **SafeAgent** | `reference/08c_agentic_systems_code.md` | Agent with safety guards and human approval workflows | | **SandboxedCodeExecutor** | `reference/08c_agentic_systems_code.md` | Secure code execution in Docker containers | | **LoopDetector** | `reference/08c_agentic_systems_code.md` | Detect and prevent agent infinite loops | | **AgentEvaluator** | `reference/08c_agentic_systems_code.md` | Evaluate agent performance on task completion | | **react_agent** | `reference/06b_prompt_engineering_code.md` | Simple ReAct implementation with tool parsing | | **analyze_document_chain** | `reference/06b_prompt_engineering_code.md` | Multi-step document analysis with prompt chaining | ### LLM Deployment and Serving | Example | File | Description | |---------|------|-------------| | **InferenceMetrics** | `reference/09_llm_deployment_code.md` | Track TTFT, TPS, latency, throughput metrics | | **estimate_memory_gb** | `reference/09_llm_deployment_code.md` | Calculate GPU memory requirements for model loading | | **StaticBatcher** | `reference/09_llm_deployment_code.md` | Fixed-size batch inference for predictable workloads | | **ContinuousBatcher** | `reference/09_llm_deployment_code.md` | Dynamic batching with adaptive batch sizing | | **InferenceProfiler** | `reference/09_llm_deployment_code.md` | Profile inference performance with bottleneck detection | | **KVCacheManager** | `reference/09_llm_deployment_code.md` | Manage key-value cache for transformer inference | | **SpeculativeDecoder** | `reference/09_llm_deployment_code.md` | Speculative decoding with draft model verification | | **PrefixCache** | `reference/09_llm_deployment_code.md` | Cache and reuse common prompt prefixes | | **LLMLoadBalancer** | `reference/09_llm_deployment_code.md` | Route requests across model replicas | | **LoadShedder** | `reference/09_llm_deployment_code.md` | Gracefully shed load during overload | | **GracefulServer** | `reference/09_llm_deployment_code.md` | Server with graceful shutdown and health checks | | **CostEstimator** | `reference/09_llm_deployment_code.md` | Estimate inference costs for different configurations | ### Prompt Engineering | Example | File | Description | |---------|------|-------------| | **DynamicFewShotClassifier** | `reference/06b_prompt_engineering_code.md` | Select relevant examples based on query similarity | | **ContextManager** | `reference/06b_prompt_engineering_code.md` | Allocate context budget across prompt components | | **manage_conversation_with_summary** | `reference/06b_prompt_engineering_code.md` | Summarize old messages to manage long conversations | | **chunk_document** | `reference/06b_prompt_engineering_code.md` | Chunk document with overlap for continuity | | **chunk_by_semantic_similarity** | `reference/06b_prompt_engineering_code.md` | Chunk based on semantic similarity thresholds | | **Classification prompts** | `reference/06a_prompt_template_library.md` | Templates for intent classification, sentiment analysis | | **Chain-of-Thought prompts** | `reference/06a_prompt_template_library.md` | Step-by-step reasoning templates | | **RAG prompts** | `reference/06a_prompt_template_library.md` | Context injection and citation templates | | **Agent prompts** | `reference/06a_prompt_template_library.md` | Tool use and ReAct-style prompts | | **Summarization prompts** | `reference/06a_prompt_template_library.md` | Document and conversation summarization | | **Code generation prompts** | `reference/06a_prompt_template_library.md` | Code writing with test generation | ### Security | Example | File | Description | |---------|------|-------------| | **InputValidator** | `reference/12a_security_defense_patterns.md` | Validate and sanitize user inputs | | **create_sandwiched_prompt** | `reference/12a_security_defense_patterns.md` | Wrap user input with system instructions | | **OutputFilter** | `reference/12a_security_defense_patterns.md` | Filter sensitive data from model outputs | | **AdaptiveRateLimiter** | `reference/12a_security_defense_patterns.md` | Rate limiting with adaptive thresholds | | **SecureToolExecutor** | `reference/12a_security_defense_patterns.md` | Execute tools with capability restrictions | | **SecureLLMPipeline** | `reference/12a_security_defense_patterns.md` | Complete request pipeline with all defenses | | **HardenedPromptBuilder** | `reference/12b_security_adversarial_code.md` | Build prompts resistant to injection attacks | | **OutputValidator** | `reference/12b_security_adversarial_code.md` | Validate output structure and content safety | | **DualLLMGuard** | `reference/12b_security_adversarial_code.md` | Use separate LLM to validate requests/responses | | **StructuredOutputDefense** | `reference/12b_security_adversarial_code.md` | Constrain outputs to safe structured formats | | **SafetyGuard** | `reference/12b_security_adversarial_code.md` | Multi-layer safety checks for content | | **JailbreakResponseHandler** | `reference/12b_security_adversarial_code.md` | Detect and handle jailbreak attempts | | **DataLeakagePreventor** | `reference/12b_security_adversarial_code.md` | Prevent sensitive data in model outputs | | **SessionIsolator** | `reference/12b_security_adversarial_code.md` | Isolate user sessions to prevent cross-contamination | | **SecureLogger** | `reference/12b_security_adversarial_code.md` | Log requests with PII redaction | | **AutomatedRedTeam** | `reference/12b_security_adversarial_code.md` | Automated adversarial testing framework | | **CISecurityGate** | `reference/12b_security_adversarial_code.md` | Security validation in CI/CD pipeline | | **SecurityMonitor** | `reference/12b_security_adversarial_code.md` | Real-time security monitoring and alerting | ### Evaluation and MLOps | Example | File | Description | |---------|------|-------------| | **RAGEvaluator** | `reference/11a_evaluation_harness.md` | LLM-as-judge evaluator with calibration | | **JUDGE_PROMPT** | `reference/11a_evaluation_harness.md` | Calibrated judge prompt with explicit rubric | | **EvaluationPipeline** | `reference/11b_mlops_evaluation_code.md` | Run evaluation suites with parallel execution | | **LLMJudge** | `reference/11b_mlops_evaluation_code.md` | Single-aspect LLM judgment | | **PairwiseJudge** | `reference/11b_mlops_evaluation_code.md` | Compare two outputs for preference | | **MultiAspectJudge** | `reference/11b_mlops_evaluation_code.md` | Evaluate multiple quality dimensions | | **JudgeCalibrator** | `reference/11b_mlops_evaluation_code.md` | Calibrate judge consistency with known examples | | **ExperimentManager** | `reference/11b_mlops_evaluation_code.md` | Manage A/B test experiments | | **ExperimentAnalyzer** | `reference/11b_mlops_evaluation_code.md` | Statistical analysis of experiment results | | **ModelComparator** | `reference/11b_mlops_evaluation_code.md` | Compare model versions with metrics | | **QualityMonitor** | `reference/11b_mlops_evaluation_code.md` | Monitor production quality metrics | | **DriftDetector** | `reference/11b_mlops_evaluation_code.md` | Detect feature and prediction drift | ### Performance and Optimization | Example | File | Description | |---------|------|-------------| | **InferenceProfiler** | `reference/21_performance_engineering_code.md` | Profile inference with bottleneck detection | | **BottleneckDiagnoser** | `reference/21_performance_engineering_code.md` | Diagnose performance bottlenecks | | **ModelOptimizer** | `reference/21_performance_engineering_code.md` | Optimize models with torch.compile | | **TensorRTOptimizer** | `reference/21_performance_engineering_code.md` | TensorRT conversion for NVIDIA GPUs | | **CUDAGraphOptimizer** | `reference/21_performance_engineering_code.md` | CUDA graph capture for reduced overhead | | **TensorParallelInference** | `reference/21_performance_engineering_code.md` | Shard models across multiple GPUs | | **OptimizedKVCache** | `reference/21_performance_engineering_code.md` | Efficient KV cache implementation | | **PagedKVCache** | `reference/21_performance_engineering_code.md` | vLLM-style paged attention cache | | **QuantizationAnalyzer** | `reference/21_performance_engineering_code.md` | Analyze quantization impact on accuracy | | **GradientCheckpointAnalyzer** | `reference/21_performance_engineering_code.md` | Optimize memory with gradient checkpointing | | **AttentionProfiler** | `reference/21_performance_engineering_code.md` | Profile attention mechanisms | | **LatencyThroughputOptimizer** | `reference/21_performance_engineering_code.md` | Balance latency vs throughput tradeoffs | | **StreamingOptimizer** | `reference/21_performance_engineering_code.md` | Optimize token streaming performance | | **CostOptimizer** | `reference/21_performance_engineering_code.md` | Optimize inference cost per request | | **BenchmarkSuite** | `reference/21_performance_engineering_code.md` | Comprehensive benchmarking framework | ### Cost Engineering | Example | File | Description | |---------|------|-------------| | **calculate_cost** | `reference/26a_cost_estimation_formulas.md` | Calculate token costs for API calls | | **project_monthly_cost** | `reference/26a_cost_estimation_formulas.md` | Project monthly costs from usage patterns | | **calculate_rag_cost** | `reference/26a_cost_estimation_formulas.md` | Calculate RAG system costs including retrieval | | **calculate_breakeven** | `reference/26a_cost_estimation_formulas.md` | GPU self-hosting break-even analysis | | **CostCategory** | `reference/26b_cost_engineering_code.md` | Enum of AI cost categories | | **CostItem** | `reference/26b_cost_engineering_code.md` | Model individual cost items | | **AICostBreakdown** | `reference/26b_cost_engineering_code.md` | Complete cost breakdown for AI system | | **TCOCalculator** | `reference/26b_cost_engineering_code.md` | Calculate total cost of ownership | | **CostAttributionEngine** | `reference/26b_cost_engineering_code.md` | Attribute costs to teams and projects | | **UnitEconomicsCalculator** | `reference/26b_cost_engineering_code.md` | Calculate cost per request/user/revenue | | **GPUOptimizer** | `reference/26b_cost_engineering_code.md` | Analyze and optimize GPU utilization | | **InferenceCostOptimizer** | `reference/26b_cost_engineering_code.md` | Optimize inference costs | | **TrainingCostOptimizer** | `reference/26b_cost_engineering_code.md` | Optimize training costs with strategy selection | | **BuildVsBuyAnalyzer** | `reference/26b_cost_engineering_code.md` | Analyze build vs buy decisions | | **CrossoverAnalysis** | `reference/26b_cost_engineering_code.md` | Calculate when build becomes cheaper than buy | | **APICostManager** | `reference/26b_cost_engineering_code.md` | Track and manage API costs with budgets | | **APICostOptimizer** | `reference/26b_cost_engineering_code.md` | Recommend API cost optimizations | | **TokenOptimizer** | `reference/26b_cost_engineering_code.md` | Analyze and optimize token usage | | **CostDashboard** | `reference/26b_cost_engineering_code.md` | Cost visibility dashboard for teams | | **CostGovernanceEnforcer** | `reference/26b_cost_engineering_code.md` | Enforce cost governance policies | | **ScalingCostModel** | `reference/26b_cost_engineering_code.md` | Project costs at different scale | ### Data Architecture | Example | File | Description | |---------|------|-------------| | **ArchitectureSelector** | `reference/24_data_architecture_code.md` | Choose between Lambda and Kappa architectures | | **SchemaRegistry** | `reference/24_data_architecture_code.md` | Central schema registry with compatibility checking | | **ContractEnforcer** | `reference/24_data_architecture_code.md` | Enforce data contracts between producers and consumers | | **FeatureStore** | `reference/24_data_architecture_code.md` | Feature store with online/offline serving | | **FeatureStoreSelectorTool** | `reference/24_data_architecture_code.md` | Select the right feature store platform | | **FeatureEngineerer** | `reference/24_data_architecture_code.md` | Feature engineering utilities (aggregations, cyclical encoding) | | **EmbeddingFeatureStore** | `reference/24_data_architecture_code.md` | Specialized storage for embedding features | | **SkewDetector** | `reference/24_data_architecture_code.md` | Detect training-serving skew | | **UnifiedFeatureComputation** | `reference/24_data_architecture_code.md` | Single source of truth for feature computation | | **PreprocessingPackager** | `reference/24_data_architecture_code.md` | Package preprocessing with model artifacts | | **SkewMonitoringDashboard** | `reference/24_data_architecture_code.md` | Monitor training-serving skew in production | | **LineageTracker** | `reference/24_data_architecture_code.md` | Track data lineage across assets | | **DataVersionManager** | `reference/24_data_architecture_code.md` | Manage versioned data for reproducibility | | **DataObservabilityMonitor** | `reference/24_data_architecture_code.md` | Monitor data observability (freshness, volume, schema) | | **DataAnomalyDetector** | `reference/24_data_architecture_code.md` | Detect anomalies in data metrics | | **DataQualityMonitor** | `reference/24_data_architecture_code.md` | Monitor data quality with configurable checks | | **FeatureGovernanceSystem** | `reference/24_data_architecture_code.md` | Manage feature governance and deprecation | | **FreshnessMonitor** | `reference/24_data_architecture_code.md` | Monitor feature freshness with SLAs | | **LeakageDetector** | `reference/24_data_architecture_code.md` | Detect data leakage in training data | | **RealTimeQualityGate** | `reference/24_data_architecture_code.md` | Lightweight quality checks for real-time pipelines | ### Reliability | Example | File | Description | |---------|------|-------------| | **AISLOFramework** | `reference/25b_reliability_engineering_code.md` | Define AI-specific SLOs (availability, latency, quality, freshness, safety) | | **ErrorBudgetTracker** | `reference/25b_reliability_engineering_code.md` | Track error budget consumption | | **RealTimeQualityEstimator** | `reference/25b_reliability_engineering_code.md` | Estimate quality in real-time using proxy metrics | | **GracefulDegradationController** | `reference/25b_reliability_engineering_code.md` | Control graceful degradation with multiple levels | | **CircuitBreaker** | `reference/25b_reliability_engineering_code.md` | Circuit breaker pattern for model calls | | **FallbackExecutor** | `reference/25b_reliability_engineering_code.md` | Execute fallback strategies by level | | **AIIncidentManager** | `reference/25b_reliability_engineering_code.md` | Manage AI system incidents with runbooks | | **PostIncidentReviewTemplate** | `reference/25b_reliability_engineering_code.md` | Generate post-incident review templates | | **IncidentMetrics** | `reference/25b_reliability_engineering_code.md` | Track incident metrics (MTTR, by type, by severity) | | **ReliableTrainingPipeline** | `reference/25b_reliability_engineering_code.md` | Training pipeline with checkpointing and retries | | **LoadSheddingStrategy** | `reference/25b_reliability_engineering_code.md` | Priority-based load shedding | | **AdaptiveRateLimiter** | `reference/25b_reliability_engineering_code.md` | Rate limiter that adapts to model latency | | **ReliableServingPipeline** | `reference/25b_reliability_engineering_code.md` | Serving pipeline with fallbacks and circuit breakers | | **AIObservabilityPipeline** | `reference/25b_reliability_engineering_code.md` | Observability pipeline for AI systems | | **DistributionMonitor** | `reference/25b_reliability_engineering_code.md` | Monitor feature and prediction distributions | | **ChaosRunner** | `reference/25b_reliability_engineering_code.md` | Run chaos experiments on AI systems | | **GameDay** | `reference/25b_reliability_engineering_code.md` | Organize game day exercises for reliability | --- ## By Implementation Type ### Complete Systems End-to-end implementations that you can adapt for production use. | Example | File | Description | |---------|------|-------------| | **ProductionRAG** | `reference/07b_rag_systems_code.md` | Production-ready RAG with caching, fallbacks, monitoring | | **RAGPipeline** | `reference/07b_rag_systems_code.md` | Complete end-to-end RAG pipeline orchestration | | **GraphRAG** | `reference/07b_rag_systems_code.md` | Knowledge graph-enhanced retrieval system | | **SecureLLMPipeline** | `reference/12a_security_defense_patterns.md` | Complete request pipeline with all security defenses | | **MCPServer** | `reference/08a_mcp_server_pattern.md` | Complete MCP server for document search | | **SupervisorAgent** | `reference/08b_agent_orchestration.md` | Full supervisor-worker agent architecture | | **ReliableServingPipeline** | `reference/25b_reliability_engineering_code.md` | Production serving with reliability features | | **ReliableTrainingPipeline** | `reference/25b_reliability_engineering_code.md` | Training pipeline with checkpointing and retries | | **EvaluationPipeline** | `reference/11b_mlops_evaluation_code.md` | Full evaluation suite with parallel execution | | **AIObservabilityPipeline** | `reference/25b_reliability_engineering_code.md` | Complete observability for AI systems | ### Utility Functions Standalone functions for common operations. | Example | File | Description | |---------|------|-------------| | **estimate_memory_gb** | `reference/09_llm_deployment_code.md` | Calculate GPU memory requirements | | **calculate_cost** | `reference/26a_cost_estimation_formulas.md` | Calculate token costs for API calls | | **calculate_breakeven** | `reference/26a_cost_estimation_formulas.md` | GPU self-hosting break-even analysis | | **chunk_document** | `reference/06b_prompt_engineering_code.md` | Chunk document with overlap | | **chunk_by_semantic_similarity** | `reference/06b_prompt_engineering_code.md` | Semantic-based document chunking | | **create_sandwiched_prompt** | `reference/12a_security_defense_patterns.md` | Wrap user input with system instructions | | **manage_conversation_with_summary** | `reference/06b_prompt_engineering_code.md` | Summarize long conversations | | **encode_cyclical_time** | `reference/24_data_architecture_code.md` | Encode cyclical time features with sin/cos | | **compute_embedding_similarity** | `reference/24_data_architecture_code.md` | Cosine similarity between embeddings | ### Configuration Examples Patterns and templates for configuration. | Example | File | Description | |---------|------|-------------| | **JUDGE_PROMPT** | `reference/11a_evaluation_harness.md` | LLM-as-judge prompt with rubric | | **PLANNING_PROMPT** | `reference/08b_agent_orchestration.md` | Agent task decomposition prompt | | **SYNTHESIS_PROMPT** | `reference/08b_agent_orchestration.md` | Agent result synthesis prompt | | **Classification prompts** | `reference/06a_prompt_template_library.md` | Intent and sentiment classification templates | | **CoT prompts** | `reference/06a_prompt_template_library.md` | Chain-of-thought reasoning templates | | **RAG prompts** | `reference/06a_prompt_template_library.md` | Context injection templates | | **FALLBACK_STRATEGIES** | `reference/25b_reliability_engineering_code.md` | Fallback strategy configurations by system type | | **CHAOS_EXPERIMENTS** | `reference/25b_reliability_engineering_code.md` | Chaos experiment definitions | | **AI_INCIDENT_SIGNATURES** | `reference/25b_reliability_engineering_code.md` | Incident type signatures with runbooks | | **COST_POLICIES** | `reference/26b_cost_engineering_code.md` | Cost governance policy examples | | **TCO_COMPONENTS** | `reference/26b_cost_engineering_code.md` | Total cost of ownership components | | **FEATURE_STORE_PLATFORMS** | `reference/24_data_architecture_code.md` | Feature store platform comparison | | **CALIBRATION_SET** | `reference/11a_evaluation_harness.md` | Judge calibration set structure | --- ## Cross-Reference by Chapter Use this to find all code examples related to a specific chapter. | Chapter | Reference Files | Key Classes/Functions | |---------|-----------------|----------------------| | Ch 5 | `05_llm_nlp_foundations_code.md` | Tokenizer, embedding, attention implementations | | Ch 6 | `06a_prompt_template_library.md`, `06b_prompt_engineering_code.md` | Prompts, DynamicFewShotClassifier, ContextManager | | Ch 7 | `07a_rag_pipeline_architecture.md`, `07b_rag_systems_code.md` | RAGPipeline, HybridRetriever, Reranker, GraphRAG | | Ch 8 | `08a_mcp_server_pattern.md`, `08b_agent_orchestration.md`, `08c_agentic_systems_code.md` | SupervisorAgent, ReActAgent, SafeAgent | | Ch 9 | `09_llm_deployment_code.md` | ContinuousBatcher, KVCacheManager, LLMLoadBalancer | | Ch 15 | `11a_evaluation_harness.md`, `11b_mlops_evaluation_code.md` | RAGEvaluator, LLMJudge, ExperimentManager | | Ch 16 | `12a_security_defense_patterns.md`, `12b_security_adversarial_code.md` | InputValidator, SecureLLMPipeline, AutomatedRedTeam | | Ch 27 | `21_performance_engineering_code.md` | InferenceProfiler, QuantizationAnalyzer, PagedKVCache | | Ch 30 | `24_data_architecture_code.md` | FeatureStore, SkewDetector, LineageTracker | | Ch 31 | `25a_incident_runbook.md`, `25b_reliability_engineering_code.md` | AISLOFramework, CircuitBreaker, ChaosRunner | | Ch 32 | `26a_cost_estimation_formulas.md`, `26b_cost_engineering_code.md` | TCOCalculator, APICostManager, BuildVsBuyAnalyzer | : Cross-Reference by Chapter {tbl-colwidths="[10,45,45]"} --- ## Runnable Code Projects Complete, installable Python projects in the `code/` directory. These are fully functional applications you can run locally. ### Document Q&A (Chapter 4) Location: `code/ch04_document_qa/` | Module | Description | |--------|-------------| | **document_qa.chunker** | Document chunking with paragraph/sentence boundaries | | **document_qa.embeddings** | OpenAI embedding integration with batching | | **document_qa.store** | ChromaDB vector store with search | | **document_qa.generator** | Answer generation with context and citations | | **document_qa.pipeline** | End-to-end RAG pipeline orchestration | | **document_qa.cli** | Command-line interface for indexing and querying | ```bash cd code/ch04_document_qa && pip install -e . document-qa index ../datasets/documents document-qa ask "What is the vacation policy?" ``` ### Advanced RAG (Chapter 7) Location: `code/ch07_rag_advanced/` | Module | Description | |--------|-------------| | **src.chunking** | Multiple chunking strategies (recursive, semantic, sentence, sliding window) | | **src.embeddings** | Cached embeddings with OpenAI and sentence-transformers | | **src.retrieval** | Hybrid search combining BM25 and vector search with RRF fusion | | **src.reranking** | Cross-encoder reranking for improved relevance | | **src.pipeline** | RAGPipeline with caching and observability | | **src.evaluation** | Retrieval evaluation metrics (MRR, NDCG, Recall@K) | ### Agentic Systems (Chapter 8) Location: `code/ch08_agents/` | Module | Description | |--------|-------------| | **src.tools** | Tool interface, ToolRegistry, ToolResult | | **src.react_agent** | ReAct agent implementation with thought-action-observation | | **src.supervisor** | Supervisor-worker multi-agent architecture | | **src.safety** | Safety checks, rate limiting, loop detection | | **src.sandbox** | Docker-based sandboxed code execution | | **src.mcp_server** | Model Context Protocol server | | **tools.calculator** | Mathematical operations tool | | **tools.web_search** | Mock web search for testing | | **tools.file_ops** | Sandboxed file operations | | **tools.code_executor** | Sandboxed Python execution | ### Evaluation Harness (Chapter 15) Location: `code/ch11_evaluation/` | Module | Description | |--------|-------------| | **src.metrics** | RetrievalMetrics (MRR, NDCG, Precision@K, Recall@K) | | **src.llm_judge** | LLMJudge, PairwiseJudge, MultiAspectJudge | | **src.calibration** | JudgeCalibrator for consistency validation | | **src.ab_testing** | ABTest, ExperimentManager, statistical analysis | | **src.drift_detection** | DriftDetector for quality/distribution drift | | **src.reporting** | ReportGenerator for HTML/JSON reports | ### Additional Projects | Project | Location | Description | |---------|----------|-------------| | **prompt_lab** | `code/prompt_lab/` | Interactive prompt testing tool | | **self_hosted_llm** | `code/self_hosted_llm/` | vLLM deployment with Docker | | **security_scanner** | `code/security_scanner/` | Prompt injection detector | | **cost_tracker** | `code/cost_tracker/` | Token and cost monitoring | --- ## How to Use This Index 1. **Running complete examples**: Check "Runnable Code Projects" for installable Python projects 2. **Finding a specific pattern**: Use the topic sections to locate implementations for your use case 3. **Starting a new project**: Check "Complete Systems" for end-to-end examples 4. **Adding a utility**: Look in "Utility Functions" for standalone helpers 5. **Setting up configuration**: Check "Configuration Examples" for templates and prompts 6. **Deep diving into a chapter**: Use the cross-reference table to find all related code All file paths are relative to the repository root. Reference implementations are at `reference/<filename>`. Runnable projects are at `code/<project>/`.