Appendix O: Concept Index

A comprehensive index of key concepts, techniques, and terminology covered in this book. Each entry includes a brief definition and chapter references for deeper exploration.

A

A/B Testing — Comparing two variants (prompts, models, features) with statistical rigor to determine which performs better. Ch 15. See also: Experiment design, Statistical significance

Ablation study — Removing components to understand their contribution to overall performance. Ch 15

Activation function — Non-linear transformation applied to neuron outputs (ReLU, GELU, SiLU). Ch 5

Adapter layers — Small trainable layers added to frozen models for parameter-efficient fine-tuning. Ch 5. See also: LoRA, PEFT

Agent loop — The cycle of observe → think → act → observe that drives autonomous agents. Ch 8

Agentic systems — AI systems that can take autonomous actions in an environment using tools. Ch 8

Alignment — Training models to behave according to human preferences and values. Ch 20

Anomaly detection — Identifying data points that deviate significantly from expected patterns. Ch 15, Ch 31

API gateway — Entry point that handles authentication, rate limiting, and routing for API requests. Ch 14

Attention mechanism — Allows models to focus on relevant parts of input when producing output. Ch 5

Attention sink — First tokens that accumulate disproportionate attention in long contexts. Ch 5

Auto-regressive generation — Generating tokens one at a time, each conditioned on previous tokens. Ch 5

AWQ (Activation-aware Weight Quantization) — Quantization method that preserves important weights. Ch 9

B

Backpressure — Mechanism to slow down producers when consumers can’t keep up. Ch 14, Ch 31

Batch inference — Processing multiple inputs together for efficiency. Ch 9

Batching strategies — Static batching (fixed size) vs. continuous batching (dynamic). Ch 9

BERT — Bidirectional Encoder Representations from Transformers; encoder-only architecture. Ch 5

Bi-encoder — Encodes query and documents separately for efficient similarity search. Ch 7. See also: Cross-encoder

BM25 — Classic term-frequency based ranking algorithm for keyword search. Ch 7

Breakeven analysis — Calculating when self-hosting becomes cheaper than API usage. Ch 32

Byte-Pair Encoding (BPE) — Subword tokenization algorithm that merges frequent byte pairs. Ch 5

C

Caching — Storing computed results to avoid redundant computation. Types: prompt caching, KV caching, semantic caching. Ch 6, Ch 9

Calibration — Ensuring model confidence scores reflect actual accuracy. Ch 15

Chain-of-thought (CoT) — Prompting technique that elicits step-by-step reasoning. Ch 6

Chunk — A segment of a document optimized for retrieval and context fitting. Ch 7

Chunking strategies — Methods for splitting documents: fixed-size, semantic, recursive, sentence-based. Ch 7

Circuit breaker — Pattern that prevents cascading failures by failing fast when a dependency is unhealthy. Ch 31

Claude — Anthropic’s family of language models. Throughout

Computer use — Agent capability to interact with computer interfaces via screenshots and actions. Ch 8

Concept drift — When the relationship between inputs and outputs changes over time. Ch 15, Ch 31

Constitutional AI — Training approach using principles to guide model behavior. Ch 20

Content filtering — Detecting and blocking harmful or inappropriate content. Ch 16, Ch 20

Context window — Maximum number of tokens a model can process in a single request. Ch 5, Ch 6

Continuous batching — Dynamically adding/removing requests from a batch as they complete. Ch 9

Cosine similarity — Measure of similarity between two vectors (dot product of unit vectors). Ch 5, Ch 7

Cost attribution — Assigning infrastructure costs to specific teams, features, or users. Ch 32

Cross-encoder — Encodes query and document together for more accurate but slower ranking. Ch 7

D

Data drift — When input data distribution changes from training distribution. Ch 15, Ch 30

Data lineage — Tracking the origin and transformations of data through pipelines. Ch 30

Decoder-only — Transformer architecture that uses only the decoder (GPT, Claude, LLaMA). Ch 5

Defense-in-depth — Multiple overlapping security layers where each catches what others miss. Ch 16

Degradation hierarchy — Ordered levels of fallback behavior as system components fail. Ch 31

Dense retrieval — Retrieval using learned vector representations (embeddings). Ch 7

Direct Preference Optimization (DPO) — Training method aligning models with preferences without reward models. Ch 5

Distributed training — Training models across multiple GPUs/machines. Ch 9

Drift detection — Monitoring for changes in data or model behavior over time. Ch 15

E

Embedding — Dense vector representation that captures semantic meaning. Ch 5, Ch 7

Embedding model — Model specialized for generating embeddings (e.g., text-embedding-3-large). Ch 7

Encoder-decoder — Transformer architecture with separate encoder and decoder (T5, BART). Ch 5

Encoder-only — Transformer architecture using only encoder (BERT, RoBERTa). Ch 5

Error budget — Allowed amount of unreliability (1 - SLO target). Ch 31

Evaluation harness — Framework for systematically evaluating model performance. Ch 15

Exact match — Evaluation metric checking if output exactly matches expected answer. Ch 15

F

F1 score — Harmonic mean of precision and recall. Ch 15

Fallback strategy — Alternative behavior when primary path fails. Ch 31

Feature drift — When feature distributions change from training to serving. Ch 30

Feature store — Centralized repository for machine learning features. Ch 30

Few-shot learning — Learning from a small number of examples provided in context. Ch 6

Fine-tuning — Adapting a pre-trained model to a specific task with additional training. Ch 5

Flash Attention — Memory-efficient attention implementation that avoids materializing full attention matrix. Ch 5, Ch 9

FP8 — 8-bit floating point format for efficient inference. Ch 9

Function calling — Model capability to output structured tool invocations. Ch 8

G

Generative AI — AI that creates new content (text, images, code) rather than just classifying. Ch 1

GPT — Generative Pre-trained Transformer; OpenAI’s model family. Ch 5

GPTQ — Post-training quantization method for large language models. Ch 9

Graceful degradation — Maintaining partial functionality when components fail. Ch 31

Graph RAG — RAG enhanced with knowledge graphs for multi-hop reasoning. Ch 7

Ground truth — Known correct answers used for evaluation. Ch 15

Grouped-Query Attention (GQA) — Attention variant sharing KV heads across query heads for efficiency. Ch 5, Ch 9

Guardrails — Safety mechanisms that constrain model behavior and outputs. Ch 11, Ch 16

H

Hallucination — Model generating false or unsupported information. Ch 6, Ch 7, Ch 20

Hierarchical agents — Multi-level agent architectures with supervisors and workers. Ch 8

Human-in-the-loop (HITL) — Systems requiring human approval for certain actions. Ch 8, Ch 20

Hybrid search — Combining vector search and keyword search for better retrieval. Ch 7

Hyperparameter — Configuration value set before training (learning rate, batch size). Ch 5

I

In-context learning — Model learning from examples provided in the prompt. Ch 6

Index — Data structure enabling fast retrieval (vector index, inverted index). Ch 7

Inference — Running a trained model to generate predictions/outputs. Ch 9

Inference server — Specialized server for efficient model serving (vLLM, TensorRT-LLM). Ch 9

Injection attack — Malicious input designed to manipulate model behavior. Ch 16

Instruction hierarchy — System prompts take precedence over user messages for safety. Ch 16

Instruction tuning — Fine-tuning models to follow instructions better. Ch 5

J

Jailbreak — Attempt to bypass model safety restrictions. Ch 16

JSON mode — Constraining model output to valid JSON format. Ch 6

Judge (LLM-as-judge) — Using an LLM to evaluate outputs of another LLM. Ch 15

K

K-nearest neighbors (KNN) — Finding k most similar vectors to a query. Ch 7

Knowledge cutoff — Date after which model has no training data. Throughout

Knowledge graph — Structured representation of entities and relationships. Ch 7

KV cache — Stored key-value pairs from previous tokens to avoid recomputation. Ch 5, Ch 9

L

Latency — Time from request to response. Metrics: TTFT, TPS, P50, P99. Ch 9, Ch 27

Layer normalization — Normalizing activations across features within each sample. Ch 5

LLM (Large Language Model) — Neural network trained on massive text corpora for language tasks. Throughout

LLM-as-judge — Using language models to evaluate quality of other model outputs. Ch 15

Load balancing — Distributing requests across multiple servers/replicas. Ch 9, Ch 14

Load shedding — Intentionally dropping requests to maintain service for remaining traffic. Ch 31

LoRA (Low-Rank Adaptation) — Parameter-efficient fine-tuning using low-rank decomposition. Ch 5

Lost in the middle — Models attending less to information in the middle of long contexts. Ch 5, Ch 6

M

Max tokens — Parameter limiting the number of tokens in model output. Ch 6, Ch 32

MCP (Model Context Protocol) — Protocol for tools to communicate with LLM applications. Ch 8, Ch 10

Mean Reciprocal Rank (MRR) — Metric measuring rank of first relevant result. Ch 7, Ch 15

Memory (agent) — Mechanisms for agents to remember information across turns. Types: working, episodic, semantic. Ch 8

Mental model — Conceptual framework for understanding complex systems. Appendix A

Mixture of Experts (MoE) — Architecture routing inputs to specialized sub-networks. Ch 5

Model card — Documentation describing a model’s capabilities, limitations, and intended use. Ch 20

Model drift — When model performance degrades over time due to changing data. Ch 15

Model serving — Infrastructure for running inference at scale. Ch 9

Multi-head attention — Attention with multiple parallel attention operations. Ch 5

Multi-modal — Models that process multiple input types (text, images, audio). Ch 17, Ch 18

Multi-Query Attention (MQA) — All query heads share single K, V for memory efficiency. Ch 9

N

NDCG (Normalized Discounted Cumulative Gain) — Metric measuring ranking quality with position discount. Ch 7, Ch 15

Next-token prediction — Training objective of predicting the next token given previous tokens. Ch 5

Normalization — Standardizing data or activations to improve training stability. Ch 5

O

Observability — Ability to understand system state from external outputs (logs, metrics, traces). Ch 11

OCR (Optical Character Recognition) — Extracting text from images. Ch 17

One-way door — Decision that is difficult or impossible to reverse. Ch 26

Output parsing — Extracting structured data from model text output. Ch 6

Overfitting — Model memorizing training data instead of learning generalizable patterns. Ch 5

P

P50/P99 — 50th/99th percentile latency (median and tail latency). Ch 9, Ch 31

PagedAttention — Memory management technique treating KV cache like virtual memory. Ch 9

Parameter-efficient fine-tuning (PEFT) — Fine-tuning methods that update only a small fraction of parameters. Ch 5

Perplexity — Measure of how well a model predicts text (lower is better). Ch 5

PII (Personally Identifiable Information) — Data that can identify individuals. Ch 16, Ch 20

Position encoding — Mechanism for models to understand token positions. Types: absolute, relative, RoPE. Ch 5

Precision — Fraction of retrieved items that are relevant. Ch 15

Prefix caching — Caching KV for common prompt prefixes to avoid recomputation. Ch 9

Prompt — Input text given to a language model. Ch 6

Prompt engineering — Crafting prompts to elicit desired model behavior. Ch 6

Prompt injection — Attack manipulating model via malicious input text. Ch 16

Q

QLoRA — Quantized LoRA combining 4-bit quantization with LoRA fine-tuning. Ch 5

Quality metrics — Measures of output quality: accuracy, helpfulness, harmlessness. Ch 15

Quantization — Reducing numerical precision (FP32 → INT8, FP8) for efficiency. Ch 9

R

RAG (Retrieval-Augmented Generation) — Combining retrieval with generation for grounded responses. Ch 7

Rate limiting — Restricting request frequency to prevent abuse and overload. Ch 14, Ch 16

ReAct — Agent pattern: Reasoning + Acting in interleaved steps. Ch 8

Recall — Fraction of relevant items that were retrieved. Ch 15

Recall@K — Fraction of relevant items found in top K results. Ch 7, Ch 15

Reciprocal Rank Fusion (RRF) — Method for combining multiple ranked lists. Ch 7

Red teaming — Adversarial testing to find vulnerabilities. Ch 16

Reinforcement Learning from Human Feedback (RLHF) — Training method using human preferences. Ch 5

Reranking — Re-scoring initial retrieval results for better ordering. Ch 7

Residual connection — Adding input directly to output (skip connection). Ch 5

Responsible AI — Developing AI systems that are safe, fair, and beneficial. Ch 20

Retrieval — Finding relevant documents/passages for a query. Ch 7

ROUGE — Metrics measuring overlap between generated and reference text. Ch 15

RoPE (Rotary Position Embedding) — Position encoding using rotation matrices. Ch 5

S

Safety — Preventing harmful model outputs and behaviors. Ch 16, Ch 20

Sandboxing — Isolating code execution to prevent damage. Ch 8

Scaling laws — Mathematical relationships between model size, data, and performance. Ch 5

Semantic caching — Caching based on meaning similarity rather than exact match. Ch 9

Semantic chunking — Splitting documents based on meaning/topic boundaries. Ch 7

Self-attention — Attention mechanism where input attends to itself. Ch 5

Self-consistency — Generating multiple responses and selecting most common answer. Ch 6

Self-hosting — Running models on your own infrastructure instead of APIs. Ch 9

Sentiment analysis — Classifying text as positive, negative, or neutral. Ch 6

Serving — Running inference for production traffic. Ch 9

SLA (Service Level Agreement) — Contractual commitment about service behavior. Ch 31

SLI (Service Level Indicator) — Measured metric of service behavior. Ch 31

SLO (Service Level Objective) — Target value for service behavior. Ch 31

Sparse attention — Attention patterns that skip some positions for efficiency. Ch 5

Speculative decoding — Using small model to draft tokens verified by large model. Ch 9

Stop sequence — Token or string that terminates generation. Ch 6

Streaming — Returning tokens as they’re generated rather than waiting for completion. Ch 6, Ch 14

Structured output — Constraining model output to specific formats (JSON, XML). Ch 6

Subword tokenization — Splitting text into sub-word units for vocabulary coverage. Ch 5

Summarization — Condensing longer text into shorter form. Ch 6

Supervisor agent — Agent that coordinates other agents in multi-agent systems. Ch 8

System prompt — Initial instructions that shape model behavior for a session. Ch 6

T

Temperature — Parameter controlling randomness in generation (higher = more random). Ch 6

Tensor parallelism — Splitting model weights across GPUs. Ch 9

TensorRT — NVIDIA’s inference optimization library. Ch 9

Text embedding — Vector representation of text capturing semantic meaning. Ch 5, Ch 7

Throughput — Number of requests or tokens processed per unit time. Ch 9

Token — Basic unit of text processed by LLMs (roughly 4 characters). Ch 5

Token budget — Maximum tokens allowed for a request (context + generation). Ch 6

Tokenization — Converting text into tokens for model processing. Ch 5

Tool use — Model capability to invoke external functions/APIs. Ch 8

Top-K sampling — Sampling from K most likely next tokens. Ch 6

Top-P sampling (nucleus) — Sampling from smallest set of tokens with cumulative probability >= P. Ch 6

Training-serving skew — Differences between training and serving environments. Ch 30

Transformer — Neural network architecture based on self-attention. Ch 5

TTFT (Time to First Token) — Latency until first generated token arrives. Ch 9

Two-way door — Decision that can be easily reversed. Ch 26

U

Unit economics — Cost and revenue per unit (request, user, query). Ch 32

V

Vector database — Database optimized for storing and querying vector embeddings. Ch 7

Vector search — Finding similar vectors using distance metrics (cosine, L2). Ch 7

Vision-language model — Model that processes both images and text. Ch 17

vLLM — Open-source LLM serving library with PagedAttention. Ch 9

W

Weight quantization — Reducing precision of model weights for smaller size and faster inference. Ch 9

Window attention — Attending only to local context window (sliding window attention). Ch 5

Worker agent — Specialized agent performing specific tasks in multi-agent systems. Ch 8

Z

Zero-shot learning — Model performing task without any task-specific examples in prompt. Ch 6

Quick Reference by Topic

Core LLM Concepts

Transformer, Attention, Tokenization, Embedding, Context window, Temperature, Top-P, Token

RAG & Retrieval

RAG, Chunking, Vector search, Hybrid search, Reranking, Bi-encoder, Cross-encoder, BM25, NDCG, MRR

Agents & Tools

Agent loop, ReAct, Tool use, Function calling, MCP, Supervisor agent, Sandboxing, Memory

Production & Deployment

Inference server, vLLM, Continuous batching, KV cache, Quantization, Tensor parallelism, Latency, Throughput

Evaluation & Quality

LLM-as-judge, A/B testing, Calibration, Precision, Recall, F1, ROUGE, Ground truth, Drift detection

Security & Safety

Prompt injection, Jailbreak, Defense-in-depth, Guardrails, Red teaming, Content filtering, PII

Reliability & Operations

SLO, Error budget, Circuit breaker, Graceful degradation, Load shedding, Observability

Cost & Efficiency

Token economics, Cost attribution, Breakeven analysis, Unit economics, Caching, Quantization

# Appendix O: Concept Index {.unnumbered} A comprehensive index of key concepts, techniques, and terminology covered in this book. Each entry includes a brief definition and chapter references for deeper exploration. --- ## A **A/B Testing** — Comparing two variants (prompts, models, features) with statistical rigor to determine which performs better. *Ch 15*. See also: *Experiment design, Statistical significance* **Ablation study** — Removing components to understand their contribution to overall performance. *Ch 15* **Activation function** — Non-linear transformation applied to neuron outputs (ReLU, GELU, SiLU). *Ch 5* **Adapter layers** — Small trainable layers added to frozen models for parameter-efficient fine-tuning. *Ch 5*. See also: *LoRA, PEFT* **Agent loop** — The cycle of observe → think → act → observe that drives autonomous agents. *Ch 8* **Agentic systems** — AI systems that can take autonomous actions in an environment using tools. *Ch 8* **Alignment** — Training models to behave according to human preferences and values. *Ch 20* **Anomaly detection** — Identifying data points that deviate significantly from expected patterns. *Ch 15, Ch 31* **API gateway** — Entry point that handles authentication, rate limiting, and routing for API requests. *Ch 14* **Attention mechanism** — Allows models to focus on relevant parts of input when producing output. *Ch 5* **Attention sink** — First tokens that accumulate disproportionate attention in long contexts. *Ch 5* **Auto-regressive generation** — Generating tokens one at a time, each conditioned on previous tokens. *Ch 5* **AWQ (Activation-aware Weight Quantization)** — Quantization method that preserves important weights. *Ch 9* --- ## B **Backpressure** — Mechanism to slow down producers when consumers can't keep up. *Ch 14, Ch 31* **Batch inference** — Processing multiple inputs together for efficiency. *Ch 9* **Batching strategies** — Static batching (fixed size) vs. continuous batching (dynamic). *Ch 9* **BERT** — Bidirectional Encoder Representations from Transformers; encoder-only architecture. *Ch 5* **Bi-encoder** — Encodes query and documents separately for efficient similarity search. *Ch 7*. See also: *Cross-encoder* **BM25** — Classic term-frequency based ranking algorithm for keyword search. *Ch 7* **Breakeven analysis** — Calculating when self-hosting becomes cheaper than API usage. *Ch 32* **Byte-Pair Encoding (BPE)** — Subword tokenization algorithm that merges frequent byte pairs. *Ch 5* --- ## C **Caching** — Storing computed results to avoid redundant computation. Types: prompt caching, KV caching, semantic caching. *Ch 6, Ch 9* **Calibration** — Ensuring model confidence scores reflect actual accuracy. *Ch 15* **Chain-of-thought (CoT)** — Prompting technique that elicits step-by-step reasoning. *Ch 6* **Chunk** — A segment of a document optimized for retrieval and context fitting. *Ch 7* **Chunking strategies** — Methods for splitting documents: fixed-size, semantic, recursive, sentence-based. *Ch 7* **Circuit breaker** — Pattern that prevents cascading failures by failing fast when a dependency is unhealthy. *Ch 31* **Claude** — Anthropic's family of language models. *Throughout* **Computer use** — Agent capability to interact with computer interfaces via screenshots and actions. *Ch 8* **Concept drift** — When the relationship between inputs and outputs changes over time. *Ch 15, Ch 31* **Constitutional AI** — Training approach using principles to guide model behavior. *Ch 20* **Content filtering** — Detecting and blocking harmful or inappropriate content. *Ch 16, Ch 20* **Context window** — Maximum number of tokens a model can process in a single request. *Ch 5, Ch 6* **Continuous batching** — Dynamically adding/removing requests from a batch as they complete. *Ch 9* **Cosine similarity** — Measure of similarity between two vectors (dot product of unit vectors). *Ch 5, Ch 7* **Cost attribution** — Assigning infrastructure costs to specific teams, features, or users. *Ch 32* **Cross-encoder** — Encodes query and document together for more accurate but slower ranking. *Ch 7* --- ## D **Data drift** — When input data distribution changes from training distribution. *Ch 15, Ch 30* **Data lineage** — Tracking the origin and transformations of data through pipelines. *Ch 30* **Decoder-only** — Transformer architecture that uses only the decoder (GPT, Claude, LLaMA). *Ch 5* **Defense-in-depth** — Multiple overlapping security layers where each catches what others miss. *Ch 16* **Degradation hierarchy** — Ordered levels of fallback behavior as system components fail. *Ch 31* **Dense retrieval** — Retrieval using learned vector representations (embeddings). *Ch 7* **Direct Preference Optimization (DPO)** — Training method aligning models with preferences without reward models. *Ch 5* **Distributed training** — Training models across multiple GPUs/machines. *Ch 9* **Drift detection** — Monitoring for changes in data or model behavior over time. *Ch 15* --- ## E **Embedding** — Dense vector representation that captures semantic meaning. *Ch 5, Ch 7* **Embedding model** — Model specialized for generating embeddings (e.g., text-embedding-3-large). *Ch 7* **Encoder-decoder** — Transformer architecture with separate encoder and decoder (T5, BART). *Ch 5* **Encoder-only** — Transformer architecture using only encoder (BERT, RoBERTa). *Ch 5* **Error budget** — Allowed amount of unreliability (1 - SLO target). *Ch 31* **Evaluation harness** — Framework for systematically evaluating model performance. *Ch 15* **Exact match** — Evaluation metric checking if output exactly matches expected answer. *Ch 15* --- ## F **F1 score** — Harmonic mean of precision and recall. *Ch 15* **Fallback strategy** — Alternative behavior when primary path fails. *Ch 31* **Feature drift** — When feature distributions change from training to serving. *Ch 30* **Feature store** — Centralized repository for machine learning features. *Ch 30* **Few-shot learning** — Learning from a small number of examples provided in context. *Ch 6* **Fine-tuning** — Adapting a pre-trained model to a specific task with additional training. *Ch 5* **Flash Attention** — Memory-efficient attention implementation that avoids materializing full attention matrix. *Ch 5, Ch 9* **FP8** — 8-bit floating point format for efficient inference. *Ch 9* **Function calling** — Model capability to output structured tool invocations. *Ch 8* --- ## G **Generative AI** — AI that creates new content (text, images, code) rather than just classifying. *Ch 1* **GPT** — Generative Pre-trained Transformer; OpenAI's model family. *Ch 5* **GPTQ** — Post-training quantization method for large language models. *Ch 9* **Graceful degradation** — Maintaining partial functionality when components fail. *Ch 31* **Graph RAG** — RAG enhanced with knowledge graphs for multi-hop reasoning. *Ch 7* **Ground truth** — Known correct answers used for evaluation. *Ch 15* **Grouped-Query Attention (GQA)** — Attention variant sharing KV heads across query heads for efficiency. *Ch 5, Ch 9* **Guardrails** — Safety mechanisms that constrain model behavior and outputs. *Ch 11, Ch 16* --- ## H **Hallucination** — Model generating false or unsupported information. *Ch 6, Ch 7, Ch 20* **Hierarchical agents** — Multi-level agent architectures with supervisors and workers. *Ch 8* **Human-in-the-loop (HITL)** — Systems requiring human approval for certain actions. *Ch 8, Ch 20* **Hybrid search** — Combining vector search and keyword search for better retrieval. *Ch 7* **Hyperparameter** — Configuration value set before training (learning rate, batch size). *Ch 5* --- ## I **In-context learning** — Model learning from examples provided in the prompt. *Ch 6* **Index** — Data structure enabling fast retrieval (vector index, inverted index). *Ch 7* **Inference** — Running a trained model to generate predictions/outputs. *Ch 9* **Inference server** — Specialized server for efficient model serving (vLLM, TensorRT-LLM). *Ch 9* **Injection attack** — Malicious input designed to manipulate model behavior. *Ch 16* **Instruction hierarchy** — System prompts take precedence over user messages for safety. *Ch 16* **Instruction tuning** — Fine-tuning models to follow instructions better. *Ch 5* --- ## J **Jailbreak** — Attempt to bypass model safety restrictions. *Ch 16* **JSON mode** — Constraining model output to valid JSON format. *Ch 6* **Judge (LLM-as-judge)** — Using an LLM to evaluate outputs of another LLM. *Ch 15* --- ## K **K-nearest neighbors (KNN)** — Finding k most similar vectors to a query. *Ch 7* **Knowledge cutoff** — Date after which model has no training data. *Throughout* **Knowledge graph** — Structured representation of entities and relationships. *Ch 7* **KV cache** — Stored key-value pairs from previous tokens to avoid recomputation. *Ch 5, Ch 9* --- ## L **Latency** — Time from request to response. Metrics: TTFT, TPS, P50, P99. *Ch 9, Ch 27* **Layer normalization** — Normalizing activations across features within each sample. *Ch 5* **LLM (Large Language Model)** — Neural network trained on massive text corpora for language tasks. *Throughout* **LLM-as-judge** — Using language models to evaluate quality of other model outputs. *Ch 15* **Load balancing** — Distributing requests across multiple servers/replicas. *Ch 9, Ch 14* **Load shedding** — Intentionally dropping requests to maintain service for remaining traffic. *Ch 31* **LoRA (Low-Rank Adaptation)** — Parameter-efficient fine-tuning using low-rank decomposition. *Ch 5* **Lost in the middle** — Models attending less to information in the middle of long contexts. *Ch 5, Ch 6* --- ## M **Max tokens** — Parameter limiting the number of tokens in model output. *Ch 6, Ch 32* **MCP (Model Context Protocol)** — Protocol for tools to communicate with LLM applications. *Ch 8, Ch 10* **Mean Reciprocal Rank (MRR)** — Metric measuring rank of first relevant result. *Ch 7, Ch 15* **Memory (agent)** — Mechanisms for agents to remember information across turns. Types: working, episodic, semantic. *Ch 8* **Mental model** — Conceptual framework for understanding complex systems. *Appendix A* **Mixture of Experts (MoE)** — Architecture routing inputs to specialized sub-networks. *Ch 5* **Model card** — Documentation describing a model's capabilities, limitations, and intended use. *Ch 20* **Model drift** — When model performance degrades over time due to changing data. *Ch 15* **Model serving** — Infrastructure for running inference at scale. *Ch 9* **Multi-head attention** — Attention with multiple parallel attention operations. *Ch 5* **Multi-modal** — Models that process multiple input types (text, images, audio). *Ch 17, Ch 18* **Multi-Query Attention (MQA)** — All query heads share single K, V for memory efficiency. *Ch 9* --- ## N **NDCG (Normalized Discounted Cumulative Gain)** — Metric measuring ranking quality with position discount. *Ch 7, Ch 15* **Next-token prediction** — Training objective of predicting the next token given previous tokens. *Ch 5* **Normalization** — Standardizing data or activations to improve training stability. *Ch 5* --- ## O **Observability** — Ability to understand system state from external outputs (logs, metrics, traces). *Ch 11* **OCR (Optical Character Recognition)** — Extracting text from images. *Ch 17* **One-way door** — Decision that is difficult or impossible to reverse. *Ch 26* **Output parsing** — Extracting structured data from model text output. *Ch 6* **Overfitting** — Model memorizing training data instead of learning generalizable patterns. *Ch 5* --- ## P **P50/P99** — 50th/99th percentile latency (median and tail latency). *Ch 9, Ch 31* **PagedAttention** — Memory management technique treating KV cache like virtual memory. *Ch 9* **Parameter-efficient fine-tuning (PEFT)** — Fine-tuning methods that update only a small fraction of parameters. *Ch 5* **Perplexity** — Measure of how well a model predicts text (lower is better). *Ch 5* **PII (Personally Identifiable Information)** — Data that can identify individuals. *Ch 16, Ch 20* **Position encoding** — Mechanism for models to understand token positions. Types: absolute, relative, RoPE. *Ch 5* **Precision** — Fraction of retrieved items that are relevant. *Ch 15* **Prefix caching** — Caching KV for common prompt prefixes to avoid recomputation. *Ch 9* **Prompt** — Input text given to a language model. *Ch 6* **Prompt engineering** — Crafting prompts to elicit desired model behavior. *Ch 6* **Prompt injection** — Attack manipulating model via malicious input text. *Ch 16* --- ## Q **QLoRA** — Quantized LoRA combining 4-bit quantization with LoRA fine-tuning. *Ch 5* **Quality metrics** — Measures of output quality: accuracy, helpfulness, harmlessness. *Ch 15* **Quantization** — Reducing numerical precision (FP32 → INT8, FP8) for efficiency. *Ch 9* --- ## R **RAG (Retrieval-Augmented Generation)** — Combining retrieval with generation for grounded responses. *Ch 7* **Rate limiting** — Restricting request frequency to prevent abuse and overload. *Ch 14, Ch 16* **ReAct** — Agent pattern: Reasoning + Acting in interleaved steps. *Ch 8* **Recall** — Fraction of relevant items that were retrieved. *Ch 15* **Recall@K** — Fraction of relevant items found in top K results. *Ch 7, Ch 15* **Reciprocal Rank Fusion (RRF)** — Method for combining multiple ranked lists. *Ch 7* **Red teaming** — Adversarial testing to find vulnerabilities. *Ch 16* **Reinforcement Learning from Human Feedback (RLHF)** — Training method using human preferences. *Ch 5* **Reranking** — Re-scoring initial retrieval results for better ordering. *Ch 7* **Residual connection** — Adding input directly to output (skip connection). *Ch 5* **Responsible AI** — Developing AI systems that are safe, fair, and beneficial. *Ch 20* **Retrieval** — Finding relevant documents/passages for a query. *Ch 7* **ROUGE** — Metrics measuring overlap between generated and reference text. *Ch 15* **RoPE (Rotary Position Embedding)** — Position encoding using rotation matrices. *Ch 5* --- ## S **Safety** — Preventing harmful model outputs and behaviors. *Ch 16, Ch 20* **Sandboxing** — Isolating code execution to prevent damage. *Ch 8* **Scaling laws** — Mathematical relationships between model size, data, and performance. *Ch 5* **Semantic caching** — Caching based on meaning similarity rather than exact match. *Ch 9* **Semantic chunking** — Splitting documents based on meaning/topic boundaries. *Ch 7* **Self-attention** — Attention mechanism where input attends to itself. *Ch 5* **Self-consistency** — Generating multiple responses and selecting most common answer. *Ch 6* **Self-hosting** — Running models on your own infrastructure instead of APIs. *Ch 9* **Sentiment analysis** — Classifying text as positive, negative, or neutral. *Ch 6* **Serving** — Running inference for production traffic. *Ch 9* **SLA (Service Level Agreement)** — Contractual commitment about service behavior. *Ch 31* **SLI (Service Level Indicator)** — Measured metric of service behavior. *Ch 31* **SLO (Service Level Objective)** — Target value for service behavior. *Ch 31* **Sparse attention** — Attention patterns that skip some positions for efficiency. *Ch 5* **Speculative decoding** — Using small model to draft tokens verified by large model. *Ch 9* **Stop sequence** — Token or string that terminates generation. *Ch 6* **Streaming** — Returning tokens as they're generated rather than waiting for completion. *Ch 6, Ch 14* **Structured output** — Constraining model output to specific formats (JSON, XML). *Ch 6* **Subword tokenization** — Splitting text into sub-word units for vocabulary coverage. *Ch 5* **Summarization** — Condensing longer text into shorter form. *Ch 6* **Supervisor agent** — Agent that coordinates other agents in multi-agent systems. *Ch 8* **System prompt** — Initial instructions that shape model behavior for a session. *Ch 6* --- ## T **Temperature** — Parameter controlling randomness in generation (higher = more random). *Ch 6* **Tensor parallelism** — Splitting model weights across GPUs. *Ch 9* **TensorRT** — NVIDIA's inference optimization library. *Ch 9* **Text embedding** — Vector representation of text capturing semantic meaning. *Ch 5, Ch 7* **Throughput** — Number of requests or tokens processed per unit time. *Ch 9* **Token** — Basic unit of text processed by LLMs (roughly 4 characters). *Ch 5* **Token budget** — Maximum tokens allowed for a request (context + generation). *Ch 6* **Tokenization** — Converting text into tokens for model processing. *Ch 5* **Tool use** — Model capability to invoke external functions/APIs. *Ch 8* **Top-K sampling** — Sampling from K most likely next tokens. *Ch 6* **Top-P sampling (nucleus)** — Sampling from smallest set of tokens with cumulative probability >= P. *Ch 6* **Training-serving skew** — Differences between training and serving environments. *Ch 30* **Transformer** — Neural network architecture based on self-attention. *Ch 5* **TTFT (Time to First Token)** — Latency until first generated token arrives. *Ch 9* **Two-way door** — Decision that can be easily reversed. *Ch 26* --- ## U **Unit economics** — Cost and revenue per unit (request, user, query). *Ch 32* --- ## V **Vector database** — Database optimized for storing and querying vector embeddings. *Ch 7* **Vector search** — Finding similar vectors using distance metrics (cosine, L2). *Ch 7* **Vision-language model** — Model that processes both images and text. *Ch 17* **vLLM** — Open-source LLM serving library with PagedAttention. *Ch 9* --- ## W **Weight quantization** — Reducing precision of model weights for smaller size and faster inference. *Ch 9* **Window attention** — Attending only to local context window (sliding window attention). *Ch 5* **Worker agent** — Specialized agent performing specific tasks in multi-agent systems. *Ch 8* --- ## Z **Zero-shot learning** — Model performing task without any task-specific examples in prompt. *Ch 6* --- ## Quick Reference by Topic ### Core LLM Concepts Transformer, Attention, Tokenization, Embedding, Context window, Temperature, Top-P, Token ### RAG & Retrieval RAG, Chunking, Vector search, Hybrid search, Reranking, Bi-encoder, Cross-encoder, BM25, NDCG, MRR ### Agents & Tools Agent loop, ReAct, Tool use, Function calling, MCP, Supervisor agent, Sandboxing, Memory ### Production & Deployment Inference server, vLLM, Continuous batching, KV cache, Quantization, Tensor parallelism, Latency, Throughput ### Evaluation & Quality LLM-as-judge, A/B testing, Calibration, Precision, Recall, F1, ROUGE, Ground truth, Drift detection ### Security & Safety Prompt injection, Jailbreak, Defense-in-depth, Guardrails, Red teaming, Content filtering, PII ### Reliability & Operations SLO, Error budget, Circuit breaker, Graceful degradation, Load shedding, Observability ### Cost & Efficiency Token economics, Cost attribution, Breakeven analysis, Unit economics, Caching, Quantization