Appendix O: Concept Index

A comprehensive index of key concepts, techniques, and terminology covered in this book. Each entry includes a brief definition and chapter references for deeper exploration.


A

A/B Testing — Comparing two variants (prompts, models, features) with statistical rigor to determine which performs better. Ch 15. See also: Experiment design, Statistical significance

Ablation study — Removing components to understand their contribution to overall performance. Ch 15

Activation function — Non-linear transformation applied to neuron outputs (ReLU, GELU, SiLU). Ch 5

Adapter layers — Small trainable layers added to frozen models for parameter-efficient fine-tuning. Ch 5. See also: LoRA, PEFT

Agent loop — The cycle of observe → think → act → observe that drives autonomous agents. Ch 8

Agentic systems — AI systems that can take autonomous actions in an environment using tools. Ch 8

Alignment — Training models to behave according to human preferences and values. Ch 20

Anomaly detection — Identifying data points that deviate significantly from expected patterns. Ch 15, Ch 31

API gateway — Entry point that handles authentication, rate limiting, and routing for API requests. Ch 14

Attention mechanism — Allows models to focus on relevant parts of input when producing output. Ch 5

Attention sink — First tokens that accumulate disproportionate attention in long contexts. Ch 5

Auto-regressive generation — Generating tokens one at a time, each conditioned on previous tokens. Ch 5

AWQ (Activation-aware Weight Quantization) — Quantization method that preserves important weights. Ch 9


B

Backpressure — Mechanism to slow down producers when consumers can’t keep up. Ch 14, Ch 31

Batch inference — Processing multiple inputs together for efficiency. Ch 9

Batching strategies — Static batching (fixed size) vs. continuous batching (dynamic). Ch 9

BERT — Bidirectional Encoder Representations from Transformers; encoder-only architecture. Ch 5

Bi-encoder — Encodes query and documents separately for efficient similarity search. Ch 7. See also: Cross-encoder

BM25 — Classic term-frequency based ranking algorithm for keyword search. Ch 7

Breakeven analysis — Calculating when self-hosting becomes cheaper than API usage. Ch 32

Byte-Pair Encoding (BPE) — Subword tokenization algorithm that merges frequent byte pairs. Ch 5


C

Caching — Storing computed results to avoid redundant computation. Types: prompt caching, KV caching, semantic caching. Ch 6, Ch 9

Calibration — Ensuring model confidence scores reflect actual accuracy. Ch 15

Chain-of-thought (CoT) — Prompting technique that elicits step-by-step reasoning. Ch 6

Chunk — A segment of a document optimized for retrieval and context fitting. Ch 7

Chunking strategies — Methods for splitting documents: fixed-size, semantic, recursive, sentence-based. Ch 7

Circuit breaker — Pattern that prevents cascading failures by failing fast when a dependency is unhealthy. Ch 31

Claude — Anthropic’s family of language models. Throughout

Computer use — Agent capability to interact with computer interfaces via screenshots and actions. Ch 8

Concept drift — When the relationship between inputs and outputs changes over time. Ch 15, Ch 31

Constitutional AI — Training approach using principles to guide model behavior. Ch 20

Content filtering — Detecting and blocking harmful or inappropriate content. Ch 16, Ch 20

Context window — Maximum number of tokens a model can process in a single request. Ch 5, Ch 6

Continuous batching — Dynamically adding/removing requests from a batch as they complete. Ch 9

Cosine similarity — Measure of similarity between two vectors (dot product of unit vectors). Ch 5, Ch 7

Cost attribution — Assigning infrastructure costs to specific teams, features, or users. Ch 32

Cross-encoder — Encodes query and document together for more accurate but slower ranking. Ch 7


D

Data drift — When input data distribution changes from training distribution. Ch 15, Ch 30

Data lineage — Tracking the origin and transformations of data through pipelines. Ch 30

Decoder-only — Transformer architecture that uses only the decoder (GPT, Claude, LLaMA). Ch 5

Defense-in-depth — Multiple overlapping security layers where each catches what others miss. Ch 16

Degradation hierarchy — Ordered levels of fallback behavior as system components fail. Ch 31

Dense retrieval — Retrieval using learned vector representations (embeddings). Ch 7

Direct Preference Optimization (DPO) — Training method aligning models with preferences without reward models. Ch 5

Distributed training — Training models across multiple GPUs/machines. Ch 9

Drift detection — Monitoring for changes in data or model behavior over time. Ch 15


E

Embedding — Dense vector representation that captures semantic meaning. Ch 5, Ch 7

Embedding model — Model specialized for generating embeddings (e.g., text-embedding-3-large). Ch 7

Encoder-decoder — Transformer architecture with separate encoder and decoder (T5, BART). Ch 5

Encoder-only — Transformer architecture using only encoder (BERT, RoBERTa). Ch 5

Error budget — Allowed amount of unreliability (1 - SLO target). Ch 31

Evaluation harness — Framework for systematically evaluating model performance. Ch 15

Exact match — Evaluation metric checking if output exactly matches expected answer. Ch 15


F

F1 score — Harmonic mean of precision and recall. Ch 15

Fallback strategy — Alternative behavior when primary path fails. Ch 31

Feature drift — When feature distributions change from training to serving. Ch 30

Feature store — Centralized repository for machine learning features. Ch 30

Few-shot learning — Learning from a small number of examples provided in context. Ch 6

Fine-tuning — Adapting a pre-trained model to a specific task with additional training. Ch 5

Flash Attention — Memory-efficient attention implementation that avoids materializing full attention matrix. Ch 5, Ch 9

FP8 — 8-bit floating point format for efficient inference. Ch 9

Function calling — Model capability to output structured tool invocations. Ch 8


G

Generative AI — AI that creates new content (text, images, code) rather than just classifying. Ch 1

GPT — Generative Pre-trained Transformer; OpenAI’s model family. Ch 5

GPTQ — Post-training quantization method for large language models. Ch 9

Graceful degradation — Maintaining partial functionality when components fail. Ch 31

Graph RAG — RAG enhanced with knowledge graphs for multi-hop reasoning. Ch 7

Ground truth — Known correct answers used for evaluation. Ch 15

Grouped-Query Attention (GQA) — Attention variant sharing KV heads across query heads for efficiency. Ch 5, Ch 9

Guardrails — Safety mechanisms that constrain model behavior and outputs. Ch 11, Ch 16


H

Hallucination — Model generating false or unsupported information. Ch 6, Ch 7, Ch 20

Hierarchical agents — Multi-level agent architectures with supervisors and workers. Ch 8

Human-in-the-loop (HITL) — Systems requiring human approval for certain actions. Ch 8, Ch 20

Hybrid search — Combining vector search and keyword search for better retrieval. Ch 7

Hyperparameter — Configuration value set before training (learning rate, batch size). Ch 5


I

In-context learning — Model learning from examples provided in the prompt. Ch 6

Index — Data structure enabling fast retrieval (vector index, inverted index). Ch 7

Inference — Running a trained model to generate predictions/outputs. Ch 9

Inference server — Specialized server for efficient model serving (vLLM, TensorRT-LLM). Ch 9

Injection attack — Malicious input designed to manipulate model behavior. Ch 16

Instruction hierarchy — System prompts take precedence over user messages for safety. Ch 16

Instruction tuning — Fine-tuning models to follow instructions better. Ch 5


J

Jailbreak — Attempt to bypass model safety restrictions. Ch 16

JSON mode — Constraining model output to valid JSON format. Ch 6

Judge (LLM-as-judge) — Using an LLM to evaluate outputs of another LLM. Ch 15


K

K-nearest neighbors (KNN) — Finding k most similar vectors to a query. Ch 7

Knowledge cutoff — Date after which model has no training data. Throughout

Knowledge graph — Structured representation of entities and relationships. Ch 7

KV cache — Stored key-value pairs from previous tokens to avoid recomputation. Ch 5, Ch 9


L

Latency — Time from request to response. Metrics: TTFT, TPS, P50, P99. Ch 9, Ch 27

Layer normalization — Normalizing activations across features within each sample. Ch 5

LLM (Large Language Model) — Neural network trained on massive text corpora for language tasks. Throughout

LLM-as-judge — Using language models to evaluate quality of other model outputs. Ch 15

Load balancing — Distributing requests across multiple servers/replicas. Ch 9, Ch 14

Load shedding — Intentionally dropping requests to maintain service for remaining traffic. Ch 31

LoRA (Low-Rank Adaptation) — Parameter-efficient fine-tuning using low-rank decomposition. Ch 5

Lost in the middle — Models attending less to information in the middle of long contexts. Ch 5, Ch 6


M

Max tokens — Parameter limiting the number of tokens in model output. Ch 6, Ch 32

MCP (Model Context Protocol) — Protocol for tools to communicate with LLM applications. Ch 8, Ch 10

Mean Reciprocal Rank (MRR) — Metric measuring rank of first relevant result. Ch 7, Ch 15

Memory (agent) — Mechanisms for agents to remember information across turns. Types: working, episodic, semantic. Ch 8

Mental model — Conceptual framework for understanding complex systems. Appendix A

Mixture of Experts (MoE) — Architecture routing inputs to specialized sub-networks. Ch 5

Model card — Documentation describing a model’s capabilities, limitations, and intended use. Ch 20

Model drift — When model performance degrades over time due to changing data. Ch 15

Model serving — Infrastructure for running inference at scale. Ch 9

Multi-head attention — Attention with multiple parallel attention operations. Ch 5

Multi-modal — Models that process multiple input types (text, images, audio). Ch 17, Ch 18

Multi-Query Attention (MQA) — All query heads share single K, V for memory efficiency. Ch 9


N

NDCG (Normalized Discounted Cumulative Gain) — Metric measuring ranking quality with position discount. Ch 7, Ch 15

Next-token prediction — Training objective of predicting the next token given previous tokens. Ch 5

Normalization — Standardizing data or activations to improve training stability. Ch 5


O

Observability — Ability to understand system state from external outputs (logs, metrics, traces). Ch 11

OCR (Optical Character Recognition) — Extracting text from images. Ch 17

One-way door — Decision that is difficult or impossible to reverse. Ch 26

Output parsing — Extracting structured data from model text output. Ch 6

Overfitting — Model memorizing training data instead of learning generalizable patterns. Ch 5


P

P50/P99 — 50th/99th percentile latency (median and tail latency). Ch 9, Ch 31

PagedAttention — Memory management technique treating KV cache like virtual memory. Ch 9

Parameter-efficient fine-tuning (PEFT) — Fine-tuning methods that update only a small fraction of parameters. Ch 5

Perplexity — Measure of how well a model predicts text (lower is better). Ch 5

PII (Personally Identifiable Information) — Data that can identify individuals. Ch 16, Ch 20

Position encoding — Mechanism for models to understand token positions. Types: absolute, relative, RoPE. Ch 5

Precision — Fraction of retrieved items that are relevant. Ch 15

Prefix caching — Caching KV for common prompt prefixes to avoid recomputation. Ch 9

Prompt — Input text given to a language model. Ch 6

Prompt engineering — Crafting prompts to elicit desired model behavior. Ch 6

Prompt injection — Attack manipulating model via malicious input text. Ch 16


Q

QLoRA — Quantized LoRA combining 4-bit quantization with LoRA fine-tuning. Ch 5

Quality metrics — Measures of output quality: accuracy, helpfulness, harmlessness. Ch 15

Quantization — Reducing numerical precision (FP32 → INT8, FP8) for efficiency. Ch 9


R

RAG (Retrieval-Augmented Generation) — Combining retrieval with generation for grounded responses. Ch 7

Rate limiting — Restricting request frequency to prevent abuse and overload. Ch 14, Ch 16

ReAct — Agent pattern: Reasoning + Acting in interleaved steps. Ch 8

Recall — Fraction of relevant items that were retrieved. Ch 15

Recall@K — Fraction of relevant items found in top K results. Ch 7, Ch 15

Reciprocal Rank Fusion (RRF) — Method for combining multiple ranked lists. Ch 7

Red teaming — Adversarial testing to find vulnerabilities. Ch 16

Reinforcement Learning from Human Feedback (RLHF) — Training method using human preferences. Ch 5

Reranking — Re-scoring initial retrieval results for better ordering. Ch 7

Residual connection — Adding input directly to output (skip connection). Ch 5

Responsible AI — Developing AI systems that are safe, fair, and beneficial. Ch 20

Retrieval — Finding relevant documents/passages for a query. Ch 7

ROUGE — Metrics measuring overlap between generated and reference text. Ch 15

RoPE (Rotary Position Embedding) — Position encoding using rotation matrices. Ch 5


S

Safety — Preventing harmful model outputs and behaviors. Ch 16, Ch 20

Sandboxing — Isolating code execution to prevent damage. Ch 8

Scaling laws — Mathematical relationships between model size, data, and performance. Ch 5

Semantic caching — Caching based on meaning similarity rather than exact match. Ch 9

Semantic chunking — Splitting documents based on meaning/topic boundaries. Ch 7

Self-attention — Attention mechanism where input attends to itself. Ch 5

Self-consistency — Generating multiple responses and selecting most common answer. Ch 6

Self-hosting — Running models on your own infrastructure instead of APIs. Ch 9

Sentiment analysis — Classifying text as positive, negative, or neutral. Ch 6

Serving — Running inference for production traffic. Ch 9

SLA (Service Level Agreement) — Contractual commitment about service behavior. Ch 31

SLI (Service Level Indicator) — Measured metric of service behavior. Ch 31

SLO (Service Level Objective) — Target value for service behavior. Ch 31

Sparse attention — Attention patterns that skip some positions for efficiency. Ch 5

Speculative decoding — Using small model to draft tokens verified by large model. Ch 9

Stop sequence — Token or string that terminates generation. Ch 6

Streaming — Returning tokens as they’re generated rather than waiting for completion. Ch 6, Ch 14

Structured output — Constraining model output to specific formats (JSON, XML). Ch 6

Subword tokenization — Splitting text into sub-word units for vocabulary coverage. Ch 5

Summarization — Condensing longer text into shorter form. Ch 6

Supervisor agent — Agent that coordinates other agents in multi-agent systems. Ch 8

System prompt — Initial instructions that shape model behavior for a session. Ch 6


T

Temperature — Parameter controlling randomness in generation (higher = more random). Ch 6

Tensor parallelism — Splitting model weights across GPUs. Ch 9

TensorRT — NVIDIA’s inference optimization library. Ch 9

Text embedding — Vector representation of text capturing semantic meaning. Ch 5, Ch 7

Throughput — Number of requests or tokens processed per unit time. Ch 9

Token — Basic unit of text processed by LLMs (roughly 4 characters). Ch 5

Token budget — Maximum tokens allowed for a request (context + generation). Ch 6

Tokenization — Converting text into tokens for model processing. Ch 5

Tool use — Model capability to invoke external functions/APIs. Ch 8

Top-K sampling — Sampling from K most likely next tokens. Ch 6

Top-P sampling (nucleus) — Sampling from smallest set of tokens with cumulative probability >= P. Ch 6

Training-serving skew — Differences between training and serving environments. Ch 30

Transformer — Neural network architecture based on self-attention. Ch 5

TTFT (Time to First Token) — Latency until first generated token arrives. Ch 9

Two-way door — Decision that can be easily reversed. Ch 26


U

Unit economics — Cost and revenue per unit (request, user, query). Ch 32


V

Vector database — Database optimized for storing and querying vector embeddings. Ch 7

Vector search — Finding similar vectors using distance metrics (cosine, L2). Ch 7

Vision-language model — Model that processes both images and text. Ch 17

vLLM — Open-source LLM serving library with PagedAttention. Ch 9


W

Weight quantization — Reducing precision of model weights for smaller size and faster inference. Ch 9

Window attention — Attending only to local context window (sliding window attention). Ch 5

Worker agent — Specialized agent performing specific tasks in multi-agent systems. Ch 8


Z

Zero-shot learning — Model performing task without any task-specific examples in prompt. Ch 6


Quick Reference by Topic

Core LLM Concepts

Transformer, Attention, Tokenization, Embedding, Context window, Temperature, Top-P, Token

RAG & Retrieval

RAG, Chunking, Vector search, Hybrid search, Reranking, Bi-encoder, Cross-encoder, BM25, NDCG, MRR

Agents & Tools

Agent loop, ReAct, Tool use, Function calling, MCP, Supervisor agent, Sandboxing, Memory

Production & Deployment

Inference server, vLLM, Continuous batching, KV cache, Quantization, Tensor parallelism, Latency, Throughput

Evaluation & Quality

LLM-as-judge, A/B testing, Calibration, Precision, Recall, F1, ROUGE, Ground truth, Drift detection

Security & Safety

Prompt injection, Jailbreak, Defense-in-depth, Guardrails, Red teaming, Content filtering, PII

Reliability & Operations

SLO, Error budget, Circuit breaker, Graceful degradation, Load shedding, Observability

Cost & Efficiency

Token economics, Cost attribution, Breakeven analysis, Unit economics, Caching, Quantization