This appendix combines two essential reference resources: Quick Reference Cards with one-page summaries for each major topic, and a Code Example Index to find specific patterns and implementations across the textbook.
Part I: Quick Reference Cards
One-page summaries for each major topic. Print these or keep them handy during development.
Chapter 5: LLM/NLP Foundations
Key Concepts
┌─────────────────────────────────────────────────────────────────────────────┐
│ LLM FUNDAMENTALS │
└─────────────────────────────────────────────────────────────────────────────┘
TOKENIZATION
├── BPE (Byte-Pair Encoding): Most common, good compression
├── WordPiece: Used by BERT
├── SentencePiece: Language-agnostic
└── Tokens ≠ Words: "tokenization" → ["token", "ization"]
EMBEDDINGS
├── Dense vectors representing semantic meaning
├── Similar meanings → similar vectors
├── Dimensionality: 768-4096 typical
└── Cosine similarity for comparison
ATTENTION
├── Query, Key, Value matrices
├── Attention(Q,K,V) = softmax(QK^T/√d)V
├── Complexity: O(n²) with sequence length
└── Enables long-range dependencies
TRANSFORMER ARCHITECTURE
├── Encoder: Bidirectional (BERT-style)
├── Decoder: Autoregressive (GPT-style)
├── Encoder-Decoder: Seq2seq (T5-style)
└── Modern LLMs: Decoder-only dominant
Quick Formulas
Concept
Formula/Rule
Attention
softmax(QK^T / √d_k) × V
Token estimate
~4 characters per token (English)
Context usage
prompt_tokens + max_tokens ≤ context_window
Common Model Sizes
Model
Parameters
VRAM (FP16)
VRAM (INT8)
7B
7 billion
~14 GB
~7 GB
13B
13 billion
~26 GB
~13 GB
70B
70 billion
~140 GB
~70 GB
Chapter 6: Prompt Engineering
Prompt Patterns
┌─────────────────────────────────────────────────────────────────────────────┐
│ PROMPTING PATTERNS │
└─────────────────────────────────────────────────────────────────────────────┘
ZERO-SHOT
"Classify this text as positive or negative: {text}"
FEW-SHOT
"Examples:
Text: Great product! → positive
Text: Terrible service → negative
Text: {text} →"
CHAIN-OF-THOUGHT
"Let's solve this step by step:
1. First, identify...
2. Then, calculate...
3. Finally, conclude..."
STRUCTURED OUTPUT
"Respond with valid JSON matching this schema:
{\"sentiment\": \"positive|negative\", \"confidence\": 0.0-1.0}"
Best Practices
Do
Don’t
Be specific and explicit
Assume the model knows context
Use delimiters for data
Mix instructions with data
Provide examples
Give vague instructions
Set output format
Leave format ambiguous
Include constraints
Forget edge cases
Temperature Guide
Temperature
Use Case
0.0-0.3
Factual, deterministic (code, extraction)
0.4-0.7
Balanced (most applications)
0.8-1.0
Creative (brainstorming, writing)
Chapter 7: RAG Systems
Pipeline Overview
┌─────────────────────────────────────────────────────────────────────────────┐
│ RAG PIPELINE │
└─────────────────────────────────────────────────────────────────────────────┘
INDEXING (Offline)
Documents → Extract → Chunk → Embed → Store
↓ ↓ ↓ ↓
Clean text 512 tok 1024-dim Vector DB
RETRIEVAL (Online)
Query → Embed → Search → Rerank → Top-K chunks
↓ ↓ ↓ ↓
Same model ANN Cross-enc 3-5 chunks
GENERATION
Context + Query → LLM → Response + Citations
□ Rate limiting per user/session
□ Tool call validation (allowlists)
□ Sandboxed execution for code
□ Human confirmation for dangerous actions
□ Maximum steps/tokens limit
□ Audit logging enabled
□ Timeout on all operations
Chapter 9: LLM Deployment
Serving Options Decision Tree
┌─────────────────────────────────────────────────────────────────────────────┐
│ WHICH SERVING OPTION? │
└─────────────────────────────────────────────────────────────────────────────┘
Traffic Volume?
├── <1K req/day → API (OpenAI, Anthropic)
├── 1K-100K req/day → API or Self-hosted (depends on cost/privacy)
└── >100K req/day → Self-hosted likely cheaper
Privacy Requirements?
├── PII in prompts → Self-hosted or private API deployment
└── No PII → API is fine
Latency Requirements?
├── <500ms p99 → Self-hosted with warm instances
└── >1s OK → API works
Self-Hosted Stack:
├── Inference Engine: vLLM (recommended) or TGI
├── Load Balancer: nginx or cloud LB
├── Metrics: Prometheus + Grafana
└── Scaling: Kubernetes with GPU nodes
□ Enable continuous batching
□ Set appropriate gpu-memory-utilization (0.85-0.95)
□ Use quantization if memory-constrained (INT8 first, then INT4)
□ Enable KV cache (default in vLLM)
□ Configure appropriate max-model-len
□ Add response caching for repeated queries
□ Monitor GPU utilization (target >70%)
Chapter 15: MLOps & Evaluation
Metrics Quick Reference
┌─────────────────────────────────────────────────────────────────────────────┐
│ EVALUATION METRICS │
└─────────────────────────────────────────────────────────────────────────────┘
RETRIEVAL
├── Recall@K: % of relevant docs in top K
├── MRR: 1 / position of first relevant
├── NDCG: Ranking quality score
└── Precision@K: % of top K that are relevant
GENERATION (Automated)
├── BLEU: N-gram overlap with reference
├── ROUGE: Recall-oriented overlap
├── BERTScore: Semantic similarity
└── Exact Match: Binary correct/incorrect
GENERATION (LLM-as-Judge)
├── Accuracy: Is the answer correct?
├── Helpfulness: Does it address the question?
├── Safety: Is it appropriate?
├── Groundedness: Is it supported by context?
└── Coherence: Is it well-written?
LLM-as-Judge Template
JUDGE_PROMPT ="""Rate this response on a scale of 1-5.Question: {question}Response: {response}Reference (if available): {reference}Criteria:- 5: Excellent, fully correct and helpful- 4: Good, mostly correct with minor issues- 3: Acceptable, partially correct- 2: Poor, significant errors- 1: Unacceptable, wrong or harmfulProvide your rating and brief justification.Rating:"""
Chapter 16: Security
Threat Model
┌─────────────────────────────────────────────────────────────────────────────┐
│ LLM SECURITY THREATS │
└─────────────────────────────────────────────────────────────────────────────┘
PROMPT INJECTION
├── Direct: User input tries to override system prompt
├── Indirect: External data (RAG, tools) contains injection
└── Defense: Input validation, prompt hardening, output filtering
JAILBREAKING
├── Attempts to bypass safety training
├── Often uses roleplay, hypotheticals
└── Defense: Safety classifiers, content filtering
DATA LEAKAGE
├── System prompt extraction
├── Training data extraction
├── PII in responses
└── Defense: Output filtering, don't put secrets in prompts
AGENT-SPECIFIC
├── Malicious tool calls
├── Unauthorized actions
├── Sandbox escapes
└── Defense: Tool policies, sandboxing, confirmation
Defense Checklist
INPUT
□ Length limits (max tokens)
□ Rate limiting (per user/IP)
□ Known pattern detection
□ Unicode normalization
PROMPT
□ Clear instruction/data separation
□ Explicit boundaries ("User input below:")
□ Don't trust user input
□ No secrets in system prompts
OUTPUT
□ Content filtering
□ System prompt leak detection
□ PII scrubbing
□ Format validation
AGENTS
□ Tool allowlists
□ Sandboxed execution
□ Human-in-the-loop for risky actions
□ Audit logging
Chapter 27: Performance Engineering
Optimization Decision Tree
┌─────────────────────────────────────────────────────────────────────────────┐
│ PERFORMANCE OPTIMIZATION │
└─────────────────────────────────────────────────────────────────────────────┘
What's the bottleneck?
├── Memory (OOM)
│ ├── Try INT8 quantization first
│ ├── Then INT4 if needed
│ ├── Reduce max_model_len
│ └── Enable PagedAttention (vLLM default)
│
├── Latency (Slow TTFT)
│ ├── Reduce input tokens (shorter context)
│ ├── Use smaller model
│ ├── Add more GPU memory (larger batch)
│ └── Try speculative decoding
│
└── Throughput (Low req/s)
├── Enable continuous batching
├── Increase GPU memory utilization
├── Add more replicas
└── Consider tensor parallelism for large models
Level 1: Normal Operation
└── Full functionality
Level 2: Degraded - Use Cache
└── Return cached responses for similar queries
Level 3: Degraded - Simpler Model
└── Route to faster/smaller model
Level 4: Degraded - Limited Features
└── Disable non-essential features (streaming, personalization)
Level 5: Fallback - Static Response
└── Return pre-written responses for common queries
Circuit Breaker Pattern
if error_rate > threshold: open_circuit() # Stop sending requests wait(cool_down) # Let service recover half_open() # Try a few requestsif healthy: close_circuit() # Resume normal operation
□ Enable semantic caching (30-50% savings typical)
□ Route simple queries to smaller models
□ Set appropriate max_tokens limits
□ Compress context where possible
□ Batch requests when latency allows
□ Use spot/preemptible instances for batch jobs
□ Monitor and alert on cost anomalies
Build vs Buy Quick Check
Factor
Favors Build
Favors Buy
Volume
>100K req/day
<10K req/day
Privacy
PII in prompts
No sensitive data
Customization
Need fine-tuning
Prompt is enough
Latency
<500ms required
1-2s acceptable
Team
ML/infra expertise
Small team
Timeline
Can invest 3+ months
Need it now
Common Commands
vLLM
# Start serverpython-m vllm.entrypoints.openai.api_server --model MODEL_NAME# With options--gpu-memory-utilization 0.9 # Use 90% GPU memory--max-model-len 8192 # Max context length--quantization awq # Use AWQ quantization--tensor-parallel-size 2 # Split across 2 GPUs
Evaluation
# Run promptfoo evaluationpromptfoo eval -c promptfooconfig.yaml# Run LM-eval benchmarklm_eval--model vllm --model_args pretrained=MODEL --tasks hellaswag
Monitoring
# GPU monitoringnvidia-smi-l 1 # Refresh every 1 secondnvitop# Interactive GPU monitor# Container monitoringdocker stats # Resource usagekubectl top pods -n ml # K8s pod resources
Useful Links
vLLM Docs: https://docs.vllm.ai
HuggingFace: https://huggingface.co/docs
LangChain: https://python.langchain.com
Anthropic Docs: https://docs.anthropic.com
OpenAI Docs: https://platform.openai.com/docs
ANN Benchmarks: https://ann-benchmarks.com
Part II: Code Example Index
A searchable quick-reference for all code implementations in this textbook. Use this to find specific patterns, classes, and utilities across the reference files. Each entry links to the source file and provides a brief description of what the code does.
By Topic
RAG and Retrieval
Example
File
Description
DocumentExtractor
reference/07b_rag_systems_code.md
Extract and clean text from PDF/DOCX/HTML with metadata
RecursiveChunker
reference/07b_rag_systems_code.md
Split documents into semantic chunks with overlap
SemanticChunker
reference/07b_rag_systems_code.md
Chunk based on semantic similarity between sentences
FAISSVectorSearch
reference/07b_rag_systems_code.md
Vector similarity search using FAISS index
BM25Index
reference/07b_rag_systems_code.md
Keyword-based search using BM25 algorithm
HybridRetriever
reference/07b_rag_systems_code.md
Combine vector and BM25 search with RRF fusion
Reranker
reference/07b_rag_systems_code.md
Cross-encoder reranking for improved relevance
ContextAssembler
reference/07b_rag_systems_code.md
Build context with source attribution for generation
RAGPipeline
reference/07b_rag_systems_code.md
Complete end-to-end RAG pipeline orchestration
GraphRAG
reference/07b_rag_systems_code.md
Knowledge graph-enhanced retrieval with entity linking
RAGDebugger
reference/07b_rag_systems_code.md
Debug RAG pipeline with stage-by-stage analysis
RAGMonitor
reference/07b_rag_systems_code.md
Production monitoring for retrieval quality metrics
ProductionRAG
reference/07b_rag_systems_code.md
Production-ready RAG with caching, fallbacks, monitoring
EmbeddingFeatureStore
reference/24_data_architecture_code.md
Store and retrieve embedding features with vector search
Agents and Tool Use
Example
File
Description
MCPServer
reference/08a_mcp_server_pattern.md
Model Context Protocol server for document search tools
SupervisorAgent
reference/08b_agent_orchestration.md
Coordinate worker agents with task decomposition and synthesis
WorkerAgent
reference/08b_agent_orchestration.md
Base class for specialist worker agents
CircuitBreaker
reference/08b_agent_orchestration.md
Prevent infinite loops with state similarity detection
agentic_loop
reference/08c_agentic_systems_code.md
Basic tool execution loop with Claude API
ReActAgent
reference/08c_agentic_systems_code.md
Reasoning and acting agent with thought-action-observation
PlanningAgent
reference/08c_agentic_systems_code.md
Agent that creates and executes multi-step plans
HierarchicalAgent
reference/08c_agentic_systems_code.md
Multi-level agent hierarchy for complex workflows
SelfReflectingAgent
reference/08c_agentic_systems_code.md
Agent that evaluates and improves its own outputs
MemoryAugmentedAgent
reference/08c_agentic_systems_code.md
Agent with episodic and working memory systems
ComputerUseAgent
reference/08c_agentic_systems_code.md
Agent that interacts with computer via screenshots
ScopedToolSet
reference/08c_agentic_systems_code.md
Capability-scoped tools with permission levels
SafeAgent
reference/08c_agentic_systems_code.md
Agent with safety guards and human approval workflows
SandboxedCodeExecutor
reference/08c_agentic_systems_code.md
Secure code execution in Docker containers
LoopDetector
reference/08c_agentic_systems_code.md
Detect and prevent agent infinite loops
AgentEvaluator
reference/08c_agentic_systems_code.md
Evaluate agent performance on task completion
react_agent
reference/06b_prompt_engineering_code.md
Simple ReAct implementation with tool parsing
analyze_document_chain
reference/06b_prompt_engineering_code.md
Multi-step document analysis with prompt chaining
LLM Deployment and Serving
Example
File
Description
InferenceMetrics
reference/09_llm_deployment_code.md
Track TTFT, TPS, latency, throughput metrics
estimate_memory_gb
reference/09_llm_deployment_code.md
Calculate GPU memory requirements for model loading
StaticBatcher
reference/09_llm_deployment_code.md
Fixed-size batch inference for predictable workloads
ContinuousBatcher
reference/09_llm_deployment_code.md
Dynamic batching with adaptive batch sizing
InferenceProfiler
reference/09_llm_deployment_code.md
Profile inference performance with bottleneck detection
KVCacheManager
reference/09_llm_deployment_code.md
Manage key-value cache for transformer inference
SpeculativeDecoder
reference/09_llm_deployment_code.md
Speculative decoding with draft model verification
PrefixCache
reference/09_llm_deployment_code.md
Cache and reuse common prompt prefixes
LLMLoadBalancer
reference/09_llm_deployment_code.md
Route requests across model replicas
LoadShedder
reference/09_llm_deployment_code.md
Gracefully shed load during overload
GracefulServer
reference/09_llm_deployment_code.md
Server with graceful shutdown and health checks
CostEstimator
reference/09_llm_deployment_code.md
Estimate inference costs for different configurations
Prompt Engineering
Example
File
Description
DynamicFewShotClassifier
reference/06b_prompt_engineering_code.md
Select relevant examples based on query similarity
ContextManager
reference/06b_prompt_engineering_code.md
Allocate context budget across prompt components
manage_conversation_with_summary
reference/06b_prompt_engineering_code.md
Summarize old messages to manage long conversations
chunk_document
reference/06b_prompt_engineering_code.md
Chunk document with overlap for continuity
chunk_by_semantic_similarity
reference/06b_prompt_engineering_code.md
Chunk based on semantic similarity thresholds
Classification prompts
reference/06a_prompt_template_library.md
Templates for intent classification, sentiment analysis
Chain-of-Thought prompts
reference/06a_prompt_template_library.md
Step-by-step reasoning templates
RAG prompts
reference/06a_prompt_template_library.md
Context injection and citation templates
Agent prompts
reference/06a_prompt_template_library.md
Tool use and ReAct-style prompts
Summarization prompts
reference/06a_prompt_template_library.md
Document and conversation summarization
Code generation prompts
reference/06a_prompt_template_library.md
Code writing with test generation
Security
Example
File
Description
InputValidator
reference/12a_security_defense_patterns.md
Validate and sanitize user inputs
create_sandwiched_prompt
reference/12a_security_defense_patterns.md
Wrap user input with system instructions
OutputFilter
reference/12a_security_defense_patterns.md
Filter sensitive data from model outputs
AdaptiveRateLimiter
reference/12a_security_defense_patterns.md
Rate limiting with adaptive thresholds
SecureToolExecutor
reference/12a_security_defense_patterns.md
Execute tools with capability restrictions
SecureLLMPipeline
reference/12a_security_defense_patterns.md
Complete request pipeline with all defenses
HardenedPromptBuilder
reference/12b_security_adversarial_code.md
Build prompts resistant to injection attacks
OutputValidator
reference/12b_security_adversarial_code.md
Validate output structure and content safety
DualLLMGuard
reference/12b_security_adversarial_code.md
Use separate LLM to validate requests/responses
StructuredOutputDefense
reference/12b_security_adversarial_code.md
Constrain outputs to safe structured formats
SafetyGuard
reference/12b_security_adversarial_code.md
Multi-layer safety checks for content
JailbreakResponseHandler
reference/12b_security_adversarial_code.md
Detect and handle jailbreak attempts
DataLeakagePreventor
reference/12b_security_adversarial_code.md
Prevent sensitive data in model outputs
SessionIsolator
reference/12b_security_adversarial_code.md
Isolate user sessions to prevent cross-contamination
SecureLogger
reference/12b_security_adversarial_code.md
Log requests with PII redaction
AutomatedRedTeam
reference/12b_security_adversarial_code.md
Automated adversarial testing framework
CISecurityGate
reference/12b_security_adversarial_code.md
Security validation in CI/CD pipeline
SecurityMonitor
reference/12b_security_adversarial_code.md
Real-time security monitoring and alerting
Evaluation and MLOps
Example
File
Description
RAGEvaluator
reference/11a_evaluation_harness.md
LLM-as-judge evaluator with calibration
JUDGE_PROMPT
reference/11a_evaluation_harness.md
Calibrated judge prompt with explicit rubric
EvaluationPipeline
reference/11b_mlops_evaluation_code.md
Run evaluation suites with parallel execution
LLMJudge
reference/11b_mlops_evaluation_code.md
Single-aspect LLM judgment
PairwiseJudge
reference/11b_mlops_evaluation_code.md
Compare two outputs for preference
MultiAspectJudge
reference/11b_mlops_evaluation_code.md
Evaluate multiple quality dimensions
JudgeCalibrator
reference/11b_mlops_evaluation_code.md
Calibrate judge consistency with known examples
ExperimentManager
reference/11b_mlops_evaluation_code.md
Manage A/B test experiments
ExperimentAnalyzer
reference/11b_mlops_evaluation_code.md
Statistical analysis of experiment results
ModelComparator
reference/11b_mlops_evaluation_code.md
Compare model versions with metrics
QualityMonitor
reference/11b_mlops_evaluation_code.md
Monitor production quality metrics
DriftDetector
reference/11b_mlops_evaluation_code.md
Detect feature and prediction drift
Performance and Optimization
Example
File
Description
InferenceProfiler
reference/21_performance_engineering_code.md
Profile inference with bottleneck detection
BottleneckDiagnoser
reference/21_performance_engineering_code.md
Diagnose performance bottlenecks
ModelOptimizer
reference/21_performance_engineering_code.md
Optimize models with torch.compile
TensorRTOptimizer
reference/21_performance_engineering_code.md
TensorRT conversion for NVIDIA GPUs
CUDAGraphOptimizer
reference/21_performance_engineering_code.md
CUDA graph capture for reduced overhead
TensorParallelInference
reference/21_performance_engineering_code.md
Shard models across multiple GPUs
OptimizedKVCache
reference/21_performance_engineering_code.md
Efficient KV cache implementation
PagedKVCache
reference/21_performance_engineering_code.md
vLLM-style paged attention cache
QuantizationAnalyzer
reference/21_performance_engineering_code.md
Analyze quantization impact on accuracy
GradientCheckpointAnalyzer
reference/21_performance_engineering_code.md
Optimize memory with gradient checkpointing
AttentionProfiler
reference/21_performance_engineering_code.md
Profile attention mechanisms
LatencyThroughputOptimizer
reference/21_performance_engineering_code.md
Balance latency vs throughput tradeoffs
StreamingOptimizer
reference/21_performance_engineering_code.md
Optimize token streaming performance
CostOptimizer
reference/21_performance_engineering_code.md
Optimize inference cost per request
BenchmarkSuite
reference/21_performance_engineering_code.md
Comprehensive benchmarking framework
Cost Engineering
Example
File
Description
calculate_cost
reference/26a_cost_estimation_formulas.md
Calculate token costs for API calls
project_monthly_cost
reference/26a_cost_estimation_formulas.md
Project monthly costs from usage patterns
calculate_rag_cost
reference/26a_cost_estimation_formulas.md
Calculate RAG system costs including retrieval
calculate_breakeven
reference/26a_cost_estimation_formulas.md
GPU self-hosting break-even analysis
CostCategory
reference/26b_cost_engineering_code.md
Enum of AI cost categories
CostItem
reference/26b_cost_engineering_code.md
Model individual cost items
AICostBreakdown
reference/26b_cost_engineering_code.md
Complete cost breakdown for AI system
TCOCalculator
reference/26b_cost_engineering_code.md
Calculate total cost of ownership
CostAttributionEngine
reference/26b_cost_engineering_code.md
Attribute costs to teams and projects
UnitEconomicsCalculator
reference/26b_cost_engineering_code.md
Calculate cost per request/user/revenue
GPUOptimizer
reference/26b_cost_engineering_code.md
Analyze and optimize GPU utilization
InferenceCostOptimizer
reference/26b_cost_engineering_code.md
Optimize inference costs
TrainingCostOptimizer
reference/26b_cost_engineering_code.md
Optimize training costs with strategy selection
BuildVsBuyAnalyzer
reference/26b_cost_engineering_code.md
Analyze build vs buy decisions
CrossoverAnalysis
reference/26b_cost_engineering_code.md
Calculate when build becomes cheaper than buy
APICostManager
reference/26b_cost_engineering_code.md
Track and manage API costs with budgets
APICostOptimizer
reference/26b_cost_engineering_code.md
Recommend API cost optimizations
TokenOptimizer
reference/26b_cost_engineering_code.md
Analyze and optimize token usage
CostDashboard
reference/26b_cost_engineering_code.md
Cost visibility dashboard for teams
CostGovernanceEnforcer
reference/26b_cost_engineering_code.md
Enforce cost governance policies
ScalingCostModel
reference/26b_cost_engineering_code.md
Project costs at different scale
Data Architecture
Example
File
Description
ArchitectureSelector
reference/24_data_architecture_code.md
Choose between Lambda and Kappa architectures
SchemaRegistry
reference/24_data_architecture_code.md
Central schema registry with compatibility checking
ContractEnforcer
reference/24_data_architecture_code.md
Enforce data contracts between producers and consumers