Appendix O: Concept Index
A comprehensive index of key concepts, techniques, and terminology covered in this book. Each entry includes a brief definition and chapter references for deeper exploration.
A
A/B Testing — Comparing two variants (prompts, models, features) with statistical rigor to determine which performs better. Ch 15. See also: Experiment design, Statistical significance
Ablation study — Removing components to understand their contribution to overall performance. Ch 15
Activation function — Non-linear transformation applied to neuron outputs (ReLU, GELU, SiLU). Ch 5
Adapter layers — Small trainable layers added to frozen models for parameter-efficient fine-tuning. Ch 5. See also: LoRA, PEFT
Agent loop — The cycle of observe → think → act → observe that drives autonomous agents. Ch 8
Agentic systems — AI systems that can take autonomous actions in an environment using tools. Ch 8
Alignment — Training models to behave according to human preferences and values. Ch 20
Anomaly detection — Identifying data points that deviate significantly from expected patterns. Ch 15, Ch 31
API gateway — Entry point that handles authentication, rate limiting, and routing for API requests. Ch 14
Attention mechanism — Allows models to focus on relevant parts of input when producing output. Ch 5
Attention sink — First tokens that accumulate disproportionate attention in long contexts. Ch 5
Auto-regressive generation — Generating tokens one at a time, each conditioned on previous tokens. Ch 5
AWQ (Activation-aware Weight Quantization) — Quantization method that preserves important weights. Ch 9
B
Backpressure — Mechanism to slow down producers when consumers can’t keep up. Ch 14, Ch 31
Batch inference — Processing multiple inputs together for efficiency. Ch 9
Batching strategies — Static batching (fixed size) vs. continuous batching (dynamic). Ch 9
BERT — Bidirectional Encoder Representations from Transformers; encoder-only architecture. Ch 5
Bi-encoder — Encodes query and documents separately for efficient similarity search. Ch 7. See also: Cross-encoder
BM25 — Classic term-frequency based ranking algorithm for keyword search. Ch 7
Breakeven analysis — Calculating when self-hosting becomes cheaper than API usage. Ch 32
Byte-Pair Encoding (BPE) — Subword tokenization algorithm that merges frequent byte pairs. Ch 5
C
Caching — Storing computed results to avoid redundant computation. Types: prompt caching, KV caching, semantic caching. Ch 6, Ch 9
Calibration — Ensuring model confidence scores reflect actual accuracy. Ch 15
Chain-of-thought (CoT) — Prompting technique that elicits step-by-step reasoning. Ch 6
Chunk — A segment of a document optimized for retrieval and context fitting. Ch 7
Chunking strategies — Methods for splitting documents: fixed-size, semantic, recursive, sentence-based. Ch 7
Circuit breaker — Pattern that prevents cascading failures by failing fast when a dependency is unhealthy. Ch 31
Claude — Anthropic’s family of language models. Throughout
Computer use — Agent capability to interact with computer interfaces via screenshots and actions. Ch 8
Concept drift — When the relationship between inputs and outputs changes over time. Ch 15, Ch 31
Constitutional AI — Training approach using principles to guide model behavior. Ch 20
Content filtering — Detecting and blocking harmful or inappropriate content. Ch 16, Ch 20
Context window — Maximum number of tokens a model can process in a single request. Ch 5, Ch 6
Continuous batching — Dynamically adding/removing requests from a batch as they complete. Ch 9
Cosine similarity — Measure of similarity between two vectors (dot product of unit vectors). Ch 5, Ch 7
Cost attribution — Assigning infrastructure costs to specific teams, features, or users. Ch 32
Cross-encoder — Encodes query and document together for more accurate but slower ranking. Ch 7
D
Data drift — When input data distribution changes from training distribution. Ch 15, Ch 30
Data lineage — Tracking the origin and transformations of data through pipelines. Ch 30
Decoder-only — Transformer architecture that uses only the decoder (GPT, Claude, LLaMA). Ch 5
Defense-in-depth — Multiple overlapping security layers where each catches what others miss. Ch 16
Degradation hierarchy — Ordered levels of fallback behavior as system components fail. Ch 31
Dense retrieval — Retrieval using learned vector representations (embeddings). Ch 7
Direct Preference Optimization (DPO) — Training method aligning models with preferences without reward models. Ch 5
Distributed training — Training models across multiple GPUs/machines. Ch 9
Drift detection — Monitoring for changes in data or model behavior over time. Ch 15
E
Embedding — Dense vector representation that captures semantic meaning. Ch 5, Ch 7
Embedding model — Model specialized for generating embeddings (e.g., text-embedding-3-large). Ch 7
Encoder-decoder — Transformer architecture with separate encoder and decoder (T5, BART). Ch 5
Encoder-only — Transformer architecture using only encoder (BERT, RoBERTa). Ch 5
Error budget — Allowed amount of unreliability (1 - SLO target). Ch 31
Evaluation harness — Framework for systematically evaluating model performance. Ch 15
Exact match — Evaluation metric checking if output exactly matches expected answer. Ch 15
F
F1 score — Harmonic mean of precision and recall. Ch 15
Fallback strategy — Alternative behavior when primary path fails. Ch 31
Feature drift — When feature distributions change from training to serving. Ch 30
Feature store — Centralized repository for machine learning features. Ch 30
Few-shot learning — Learning from a small number of examples provided in context. Ch 6
Fine-tuning — Adapting a pre-trained model to a specific task with additional training. Ch 5
Flash Attention — Memory-efficient attention implementation that avoids materializing full attention matrix. Ch 5, Ch 9
FP8 — 8-bit floating point format for efficient inference. Ch 9
Function calling — Model capability to output structured tool invocations. Ch 8
G
Generative AI — AI that creates new content (text, images, code) rather than just classifying. Ch 1
GPT — Generative Pre-trained Transformer; OpenAI’s model family. Ch 5
GPTQ — Post-training quantization method for large language models. Ch 9
Graceful degradation — Maintaining partial functionality when components fail. Ch 31
Graph RAG — RAG enhanced with knowledge graphs for multi-hop reasoning. Ch 7
Ground truth — Known correct answers used for evaluation. Ch 15
Grouped-Query Attention (GQA) — Attention variant sharing KV heads across query heads for efficiency. Ch 5, Ch 9
Guardrails — Safety mechanisms that constrain model behavior and outputs. Ch 11, Ch 16
H
Hallucination — Model generating false or unsupported information. Ch 6, Ch 7, Ch 20
Hierarchical agents — Multi-level agent architectures with supervisors and workers. Ch 8
Human-in-the-loop (HITL) — Systems requiring human approval for certain actions. Ch 8, Ch 20
Hybrid search — Combining vector search and keyword search for better retrieval. Ch 7
Hyperparameter — Configuration value set before training (learning rate, batch size). Ch 5
I
In-context learning — Model learning from examples provided in the prompt. Ch 6
Index — Data structure enabling fast retrieval (vector index, inverted index). Ch 7
Inference — Running a trained model to generate predictions/outputs. Ch 9
Inference server — Specialized server for efficient model serving (vLLM, TensorRT-LLM). Ch 9
Injection attack — Malicious input designed to manipulate model behavior. Ch 16
Instruction hierarchy — System prompts take precedence over user messages for safety. Ch 16
Instruction tuning — Fine-tuning models to follow instructions better. Ch 5
J
Jailbreak — Attempt to bypass model safety restrictions. Ch 16
JSON mode — Constraining model output to valid JSON format. Ch 6
Judge (LLM-as-judge) — Using an LLM to evaluate outputs of another LLM. Ch 15
K
K-nearest neighbors (KNN) — Finding k most similar vectors to a query. Ch 7
Knowledge cutoff — Date after which model has no training data. Throughout
Knowledge graph — Structured representation of entities and relationships. Ch 7
KV cache — Stored key-value pairs from previous tokens to avoid recomputation. Ch 5, Ch 9
L
Latency — Time from request to response. Metrics: TTFT, TPS, P50, P99. Ch 9, Ch 27
Layer normalization — Normalizing activations across features within each sample. Ch 5
LLM (Large Language Model) — Neural network trained on massive text corpora for language tasks. Throughout
LLM-as-judge — Using language models to evaluate quality of other model outputs. Ch 15
Load balancing — Distributing requests across multiple servers/replicas. Ch 9, Ch 14
Load shedding — Intentionally dropping requests to maintain service for remaining traffic. Ch 31
LoRA (Low-Rank Adaptation) — Parameter-efficient fine-tuning using low-rank decomposition. Ch 5
Lost in the middle — Models attending less to information in the middle of long contexts. Ch 5, Ch 6
M
Max tokens — Parameter limiting the number of tokens in model output. Ch 6, Ch 32
MCP (Model Context Protocol) — Protocol for tools to communicate with LLM applications. Ch 8, Ch 10
Mean Reciprocal Rank (MRR) — Metric measuring rank of first relevant result. Ch 7, Ch 15
Memory (agent) — Mechanisms for agents to remember information across turns. Types: working, episodic, semantic. Ch 8
Mental model — Conceptual framework for understanding complex systems. Appendix A
Mixture of Experts (MoE) — Architecture routing inputs to specialized sub-networks. Ch 5
Model card — Documentation describing a model’s capabilities, limitations, and intended use. Ch 20
Model drift — When model performance degrades over time due to changing data. Ch 15
Model serving — Infrastructure for running inference at scale. Ch 9
Multi-head attention — Attention with multiple parallel attention operations. Ch 5
Multi-modal — Models that process multiple input types (text, images, audio). Ch 17, Ch 18
Multi-Query Attention (MQA) — All query heads share single K, V for memory efficiency. Ch 9
N
NDCG (Normalized Discounted Cumulative Gain) — Metric measuring ranking quality with position discount. Ch 7, Ch 15
Next-token prediction — Training objective of predicting the next token given previous tokens. Ch 5
Normalization — Standardizing data or activations to improve training stability. Ch 5
O
Observability — Ability to understand system state from external outputs (logs, metrics, traces). Ch 11
OCR (Optical Character Recognition) — Extracting text from images. Ch 17
One-way door — Decision that is difficult or impossible to reverse. Ch 26
Output parsing — Extracting structured data from model text output. Ch 6
Overfitting — Model memorizing training data instead of learning generalizable patterns. Ch 5
P
P50/P99 — 50th/99th percentile latency (median and tail latency). Ch 9, Ch 31
PagedAttention — Memory management technique treating KV cache like virtual memory. Ch 9
Parameter-efficient fine-tuning (PEFT) — Fine-tuning methods that update only a small fraction of parameters. Ch 5
Perplexity — Measure of how well a model predicts text (lower is better). Ch 5
PII (Personally Identifiable Information) — Data that can identify individuals. Ch 16, Ch 20
Position encoding — Mechanism for models to understand token positions. Types: absolute, relative, RoPE. Ch 5
Precision — Fraction of retrieved items that are relevant. Ch 15
Prefix caching — Caching KV for common prompt prefixes to avoid recomputation. Ch 9
Prompt — Input text given to a language model. Ch 6
Prompt engineering — Crafting prompts to elicit desired model behavior. Ch 6
Prompt injection — Attack manipulating model via malicious input text. Ch 16
Q
QLoRA — Quantized LoRA combining 4-bit quantization with LoRA fine-tuning. Ch 5
Quality metrics — Measures of output quality: accuracy, helpfulness, harmlessness. Ch 15
Quantization — Reducing numerical precision (FP32 → INT8, FP8) for efficiency. Ch 9
R
RAG (Retrieval-Augmented Generation) — Combining retrieval with generation for grounded responses. Ch 7
Rate limiting — Restricting request frequency to prevent abuse and overload. Ch 14, Ch 16
ReAct — Agent pattern: Reasoning + Acting in interleaved steps. Ch 8
Recall — Fraction of relevant items that were retrieved. Ch 15
Recall@K — Fraction of relevant items found in top K results. Ch 7, Ch 15
Reciprocal Rank Fusion (RRF) — Method for combining multiple ranked lists. Ch 7
Red teaming — Adversarial testing to find vulnerabilities. Ch 16
Reinforcement Learning from Human Feedback (RLHF) — Training method using human preferences. Ch 5
Reranking — Re-scoring initial retrieval results for better ordering. Ch 7
Residual connection — Adding input directly to output (skip connection). Ch 5
Responsible AI — Developing AI systems that are safe, fair, and beneficial. Ch 20
Retrieval — Finding relevant documents/passages for a query. Ch 7
ROUGE — Metrics measuring overlap between generated and reference text. Ch 15
RoPE (Rotary Position Embedding) — Position encoding using rotation matrices. Ch 5
S
Safety — Preventing harmful model outputs and behaviors. Ch 16, Ch 20
Sandboxing — Isolating code execution to prevent damage. Ch 8
Scaling laws — Mathematical relationships between model size, data, and performance. Ch 5
Semantic caching — Caching based on meaning similarity rather than exact match. Ch 9
Semantic chunking — Splitting documents based on meaning/topic boundaries. Ch 7
Self-attention — Attention mechanism where input attends to itself. Ch 5
Self-consistency — Generating multiple responses and selecting most common answer. Ch 6
Self-hosting — Running models on your own infrastructure instead of APIs. Ch 9
Sentiment analysis — Classifying text as positive, negative, or neutral. Ch 6
Serving — Running inference for production traffic. Ch 9
SLA (Service Level Agreement) — Contractual commitment about service behavior. Ch 31
SLI (Service Level Indicator) — Measured metric of service behavior. Ch 31
SLO (Service Level Objective) — Target value for service behavior. Ch 31
Sparse attention — Attention patterns that skip some positions for efficiency. Ch 5
Speculative decoding — Using small model to draft tokens verified by large model. Ch 9
Stop sequence — Token or string that terminates generation. Ch 6
Streaming — Returning tokens as they’re generated rather than waiting for completion. Ch 6, Ch 14
Structured output — Constraining model output to specific formats (JSON, XML). Ch 6
Subword tokenization — Splitting text into sub-word units for vocabulary coverage. Ch 5
Summarization — Condensing longer text into shorter form. Ch 6
Supervisor agent — Agent that coordinates other agents in multi-agent systems. Ch 8
System prompt — Initial instructions that shape model behavior for a session. Ch 6
T
Temperature — Parameter controlling randomness in generation (higher = more random). Ch 6
Tensor parallelism — Splitting model weights across GPUs. Ch 9
TensorRT — NVIDIA’s inference optimization library. Ch 9
Text embedding — Vector representation of text capturing semantic meaning. Ch 5, Ch 7
Throughput — Number of requests or tokens processed per unit time. Ch 9
Token — Basic unit of text processed by LLMs (roughly 4 characters). Ch 5
Token budget — Maximum tokens allowed for a request (context + generation). Ch 6
Tokenization — Converting text into tokens for model processing. Ch 5
Tool use — Model capability to invoke external functions/APIs. Ch 8
Top-K sampling — Sampling from K most likely next tokens. Ch 6
Top-P sampling (nucleus) — Sampling from smallest set of tokens with cumulative probability >= P. Ch 6
Training-serving skew — Differences between training and serving environments. Ch 30
Transformer — Neural network architecture based on self-attention. Ch 5
TTFT (Time to First Token) — Latency until first generated token arrives. Ch 9
Two-way door — Decision that can be easily reversed. Ch 26
U
Unit economics — Cost and revenue per unit (request, user, query). Ch 32
V
Vector database — Database optimized for storing and querying vector embeddings. Ch 7
Vector search — Finding similar vectors using distance metrics (cosine, L2). Ch 7
Vision-language model — Model that processes both images and text. Ch 17
vLLM — Open-source LLM serving library with PagedAttention. Ch 9
W
Weight quantization — Reducing precision of model weights for smaller size and faster inference. Ch 9
Window attention — Attending only to local context window (sliding window attention). Ch 5
Worker agent — Specialized agent performing specific tasks in multi-agent systems. Ch 8
Z
Zero-shot learning — Model performing task without any task-specific examples in prompt. Ch 6
Quick Reference by Topic
Core LLM Concepts
Transformer, Attention, Tokenization, Embedding, Context window, Temperature, Top-P, Token
RAG & Retrieval
RAG, Chunking, Vector search, Hybrid search, Reranking, Bi-encoder, Cross-encoder, BM25, NDCG, MRR
Agents & Tools
Agent loop, ReAct, Tool use, Function calling, MCP, Supervisor agent, Sandboxing, Memory
Production & Deployment
Inference server, vLLM, Continuous batching, KV cache, Quantization, Tensor parallelism, Latency, Throughput
Evaluation & Quality
LLM-as-judge, A/B testing, Calibration, Precision, Recall, F1, ROUGE, Ground truth, Drift detection
Security & Safety
Prompt injection, Jailbreak, Defense-in-depth, Guardrails, Red teaming, Content filtering, PII
Reliability & Operations
SLO, Error budget, Circuit breaker, Graceful degradation, Load shedding, Observability
Cost & Efficiency
Token economics, Cost attribution, Breakeven analysis, Unit economics, Caching, Quantization