Appendix A: Glossary

This glossary defines key terms used throughout the book. Terms are organized alphabetically within categories for easier reference.

Model Architecture & Training

Attention Mechanism: A neural network component that allows models to focus on different parts of the input when producing each part of the output. Self-attention allows every position to attend to every other position in a sequence. The mathematical operation is typically Attention(Q, K, V) = softmax(QK^T/√d_k)V.

Autoregressive Model: A model that generates output one token at a time, where each token prediction is conditioned on all previously generated tokens. GPT-style models are autoregressive. Contrast with encoder-only or encoder-decoder architectures.

Backpropagation: Algorithm for computing gradients of the loss function with respect to all parameters in a neural network. Essential for training via gradient descent.

Batch Size: The number of training examples used in one forward/backward pass. Larger batches improve GPU utilization but may affect convergence. Critical parameter for training efficiency and cost.

BERT (Bidirectional Encoder Representations from Transformers): Encoder-only transformer model trained with masked language modeling. Foundation for many NLP classification and extraction tasks. Released by Google in 2018.

Causal Attention: Attention that only allows positions to attend to earlier positions (not future ones). Essential for autoregressive generation. Also called “masked attention.”

Checkpoint: A saved snapshot of model weights and optimizer state during training. Enables resuming training after interruption and selecting the best model version.

Contrastive Learning: Training approach where the model learns to distinguish similar pairs from dissimilar pairs. Common for training embedding models (e.g., contrastive loss, triplet loss).

Cross-Entropy Loss: Standard loss function for classification tasks. Measures the difference between predicted probability distribution and true distribution.

Decoder: In transformer architecture, the component that generates output tokens autoregressively. Uses causal attention and cross-attention to encoder outputs.

Dropout: Regularization technique that randomly zeroes network activations during training. Prevents overfitting by encouraging redundant representations.

Embedding: Dense vector representation of discrete objects (tokens, entities, etc.). Learned during training to capture semantic relationships. See also: vector embedding.

Encoder: In transformer architecture, the component that processes input sequence bidirectionally. Produces contextualized representations used by decoder.

Epoch: One complete pass through the entire training dataset. Models typically train for multiple epochs.

Fine-Tuning: Adapting a pre-trained model to a specific task by continuing training on task-specific data. Usually with lower learning rate than pre-training.

Frozen Layers: Model layers whose weights are not updated during fine-tuning. Freezing early layers preserves general knowledge while adapting later layers to new tasks.

Gradient Descent: Optimization algorithm that updates model parameters in the direction that reduces the loss function. Variants include SGD, Adam, AdamW.

GPT (Generative Pre-trained Transformer): Decoder-only transformer model trained with next-token prediction. Foundation of ChatGPT and many LLMs.

KV Cache: Cached key and value matrices from attention layers during autoregressive generation. Avoids recomputing attention for previous tokens. Critical for inference efficiency.

Layer Normalization: Normalization technique applied to layer activations. Standard in transformers (RMSNorm in newer models).

Learning Rate: Hyperparameter controlling the step size during gradient descent. Critical for training stability and convergence speed.

LoRA (Low-Rank Adaptation): Parameter-efficient fine-tuning method that adds trainable low-rank matrices to attention layers. Enables fine-tuning large models with minimal additional parameters.

Loss Function: Mathematical function that measures how well model predictions match targets. Training minimizes the loss.

Mixed Precision Training: Training with both 16-bit and 32-bit floating point to reduce memory and increase speed while maintaining accuracy.

MLP (Multi-Layer Perceptron): Feedforward neural network with fully connected layers. In transformers, the feedforward network after attention.

Multi-Head Attention: Attention mechanism that runs multiple attention operations in parallel with different learned projections. Standard in transformers.

Overfitting: When a model performs well on training data but poorly on unseen data. Indicates memorization rather than generalization.

Parameter: A learnable weight in a neural network. Model size is typically measured in parameters (e.g., 7B parameters).

PEFT (Parameter-Efficient Fine-Tuning): Family of techniques for fine-tuning large models by training only a small subset of parameters. Includes LoRA, adapters, prefix tuning.

Perplexity: Metric for language model quality. Lower perplexity indicates better prediction of held-out text. Related to cross-entropy by PP = exp(CE).

Pre-training: Initial training of a model on large-scale data before task-specific fine-tuning. Creates general-purpose representations.

Quantization: Reducing the numerical precision of model weights (e.g., from 16-bit to 8-bit or 4-bit). Reduces memory and often increases inference speed with minimal quality loss.

ReLU (Rectified Linear Unit): Activation function f(x) = max(0, x). Common in neural networks though transformers often use variants like GELU.

Regularization: Techniques to prevent overfitting, including dropout, weight decay, and early stopping.

Self-Attention: Attention where queries, keys, and values all come from the same sequence. Core operation in transformers.

Softmax: Function that converts a vector of scores into a probability distribution. Used in attention and classification outputs.

Teacher Forcing: Training technique where the model receives the correct previous token rather than its own prediction. Standard for training autoregressive models.

Token: The basic unit of text processed by language models. May be words, subwords, or characters depending on the tokenizer.

Tokenizer: Algorithm that converts text to tokens and vice versa. Common approaches include BPE, WordPiece, and SentencePiece.

Transfer Learning: Using knowledge from pre-training on one task to improve performance on a different task.

Transformer: Neural network architecture based entirely on attention mechanisms. Foundation of modern LLMs. Introduced in “Attention Is All You Need” (2017).

Weight Decay: Regularization technique that adds a penalty proportional to the magnitude of weights. Prevents weights from growing too large.

Inference & Deployment

Batching: Processing multiple requests together in a single forward pass. Improves throughput at the cost of latency. See also: continuous batching.

Beam Search: Decoding strategy that maintains multiple candidate sequences and selects the highest-scoring completion. Improves generation quality but increases computation.

Continuous Batching: Dynamic batching technique that adds new requests to running batches as existing requests complete. Maximizes GPU utilization.

Flash Attention: Optimized attention implementation that reduces memory usage and improves speed through tiling and kernel fusion. Standard in modern inference systems.

Greedy Decoding: Decoding strategy that always selects the highest-probability token. Fast but may miss better overall sequences.

Inference: Using a trained model to make predictions on new inputs. Contrast with training.

Latency: Time from request to response. p50 is median, p99 is 99th percentile (tail latency).

Model Serving: Infrastructure for hosting and running inference on trained models. Handles request routing, scaling, and response delivery.

Nucleus Sampling (Top-p): Decoding strategy that samples from the smallest set of tokens whose cumulative probability exceeds p. Balances diversity and quality.

Speculative Decoding: Acceleration technique using a smaller draft model to propose multiple tokens that the larger model verifies in parallel. Can provide 2-3x speedup.

Temperature: Parameter controlling randomness in sampling-based decoding. Higher temperature produces more diverse but less reliable outputs.

TensorRT: NVIDIA’s inference optimization library. Performs graph optimization, kernel fusion, and quantization for faster inference.

Throughput: Number of requests processed per unit time. Often measured in tokens per second.

Top-k Sampling: Decoding strategy that samples from the k highest-probability tokens. Simple diversity control.

vLLM: High-performance inference engine for LLMs featuring PagedAttention and continuous batching.

RAG & Retrieval

BM25: Traditional sparse retrieval algorithm based on term frequency. Still competitive for many retrieval tasks.

Chunking: Splitting documents into smaller pieces for embedding and retrieval. Chunk size significantly affects retrieval quality.

Dense Retrieval: Retrieval using learned vector embeddings rather than keyword matching. Captures semantic similarity.

Embedding Model: Neural network that produces vector representations of text. Common examples include E5, BGE, and OpenAI’s text-embedding models.

Hybrid Search: Combining sparse (keyword) and dense (semantic) retrieval. Often outperforms either alone.

Indexing: Pre-processing documents into a searchable structure. For vector search, involves computing embeddings and building ANN index.

RAG (Retrieval-Augmented Generation): Architecture that retrieves relevant context before generating responses. Grounds LLM outputs in specific knowledge.

Reranking: Second-stage retrieval that re-scores initial retrieval results with a more sophisticated model. Improves precision.

Semantic Search: Search based on meaning rather than keywords. Typically implemented with embedding similarity.

Vector Database: Database optimized for storing and searching vector embeddings. Examples: Pinecone, Weaviate, Milvus, Qdrant.

Vector Embedding: Dense numerical representation of text/data in high-dimensional space where similar items are close together.

ANN (Approximate Nearest Neighbor): Algorithm for finding similar vectors without exhaustive search. HNSW and IVF are common approaches. Trades small accuracy loss for large speedup.

Bi-Encoder: Architecture that independently encodes query and documents into vectors for similarity comparison. Fast but less accurate than cross-encoders. Used for initial retrieval.

Cross-Encoder: Architecture that jointly encodes query and document together, producing a relevance score. More accurate than bi-encoders but slower. Used for reranking.

GraphRAG: RAG approach that structures retrieved information as a knowledge graph. Enables multi-hop reasoning and relationship-aware retrieval.

RRF (Reciprocal Rank Fusion): Method for combining ranked lists from multiple retrieval systems. Score = sum(1 / (k + rank)) across systems, where k is typically 60.

Agents & Tool Use

Agent: AI system that can take actions in an environment to achieve goals. Uses planning, tool use, and feedback loops.

Chain-of-Thought (CoT): Prompting technique that encourages step-by-step reasoning. Improves performance on complex tasks.

Function Calling: LLM capability to generate structured function/API calls rather than free text. Enables reliable tool use.

MCP (Model Context Protocol): Protocol for connecting AI models to external tools and data sources. Standardizes tool integration.

Multi-Agent System: Architecture using multiple AI agents that collaborate or compete to solve problems.

ReAct: Agent framework combining Reasoning and Acting. Interleaves thinking with tool use.

Tool Use: LLM capability to invoke external tools (search, calculation, APIs) during generation.

MLOps & Evaluation

A/B Testing: Comparing two system versions by exposing different users to each and measuring outcomes.

Canary Deployment: Gradual rollout exposing a small percentage of traffic to new model before full deployment.

Concept Drift: Change in the relationship between input features and target variable over time.

Data Drift: Change in the distribution of input data over time.

Feature Store: Centralized repository for storing, managing, and serving ML features. Enables feature reuse and prevents training-serving skew.

LLM-as-Judge: Using an LLM to evaluate outputs of another model. Common for evaluating generation quality.

Model Registry: Versioned repository for trained models with metadata, lineage, and deployment status.

Observability: Ability to understand system state from external outputs. For ML: metrics, logs, traces, and prediction analysis.

Point-in-Time Join: Joining features to training data using only information available at the time of the training label. Prevents data leakage.

Shadow Deployment: Running a new model in parallel with production, comparing outputs without affecting users.

Training-Serving Skew: Differences between how features are computed in training vs serving, leading to degraded model performance.

Infrastructure & Performance

CUDA: NVIDIA’s parallel computing platform and API for GPU programming.

Data Parallelism: Distributing training by splitting batches across multiple GPUs, each with a full model copy.

FLOPS: Floating-point operations per second. Measure of computational throughput.

GPU Memory: High-bandwidth memory on graphics cards used for storing model weights, activations, and gradients.

Model Parallelism: Distributing training by splitting model layers across multiple GPUs. Required for models too large for single GPU.

Tensor Parallelism: Form of model parallelism that splits individual layer computations across GPUs.

Pipeline Parallelism: Form of model parallelism that assigns different layers to different GPUs in a pipeline.

ZeRO: Memory optimization technique that partitions optimizer states, gradients, and parameters across data parallel workers.

Security & Safety

Adversarial Attack: Input designed to cause model misbehavior. Includes prompt injection, jailbreaks, and evasion attacks.

Guardrails: Safety mechanisms that filter or modify model inputs/outputs. May include content filters, output validators.

Jailbreak: Adversarial prompt that causes a model to bypass its safety training.

Prompt Injection: Attack where malicious instructions in user input override system instructions.

Red Teaming: Systematic adversarial testing to discover model vulnerabilities and failure modes.

RLHF (Reinforcement Learning from Human Feedback): Training technique that fine-tunes models using human preference signals. Used to align models with human values.

Security is a Dialogue, Not a Wall — Mental model from Ch 16 framing LLM security as an ongoing feedback loop of layered defenses, monitoring, and red-teaming against an evolving adversary, rather than a one-time perimeter to be built.

Data & Features

Data Lineage: Tracking the origin and transformations of data through the pipeline. Essential for debugging and compliance.

Data Quality: Dimensions including completeness, accuracy, consistency, timeliness, and validity.

ETL (Extract, Transform, Load): Pipeline pattern for moving data between systems with transformations.

Feature Engineering: Creating input features from raw data to improve model performance.

Feature View: In feature stores, a collection of related features that share an entity and update schedule.

Schema: Specification of data structure including field names, types, and constraints.

Business & Operations

Error Budget: Allowed amount of downtime or SLO violations in a period. Balances reliability work with feature development.

Incident Management: Process for detecting, responding to, and learning from system failures.

On-Call: Engineers responsible for responding to system issues outside business hours.

SLA (Service Level Agreement): Contractual commitment to service quality. Violation may trigger penalties.

SLI (Service Level Indicator): Metric measuring service quality (e.g., request latency, error rate).

SLO (Service Level Objective): Target value for an SLI. Internal goal that should be stricter than SLA.

TCO (Total Cost of Ownership): Complete cost including infrastructure, personnel, maintenance, and hidden costs.

Prompting & Context

Context Window: Maximum number of tokens a model can process in a single input. Ranges from 4K to 1M+ tokens in modern models.

Few-Shot Learning: Providing a few examples in the prompt to guide model behavior. Contrast with zero-shot (no examples) and many-shot.

In-Context Learning: Model’s ability to learn from examples provided in the prompt without parameter updates.

Prompt Engineering: The practice of designing effective prompts to elicit desired model behavior.

System Prompt: Instructions that set the model’s behavior, persona, and constraints. Typically hidden from users.

Zero-Shot: Using a model without task-specific examples. Tests the model’s inherent capabilities.

Architectures & Patterns

Encoder-Decoder: Architecture with separate encoder for input processing and decoder for output generation. Used in T5, BART.

Decoder-Only: Architecture using only the decoder component with causal attention. GPT, LLaMA, and most modern LLMs.

Encoder-Only: Architecture using only the encoder component with bidirectional attention. BERT, RoBERTa.

Mixture of Experts (MoE): Architecture where different “expert” networks specialize in different inputs. Enables larger models with less computation per token.

Multimodal Model: Model that processes multiple types of input (text, images, audio, video).

Vision Transformer (ViT): Transformer architecture adapted for image processing by treating image patches as tokens.

CLIP: Contrastive Language-Image Pre-training. Model that learns to align image and text representations.

Optimization & Scaling

Distributed Training: Training across multiple GPUs or machines. Requires data, model, or pipeline parallelism.

Gradient Accumulation: Simulating larger batch sizes by accumulating gradients over multiple forward passes before updating.

Gradient Checkpointing: Trading compute for memory by recomputing activations during backward pass instead of storing them.

Learning Rate Schedule: Plan for changing learning rate during training (e.g., warmup, decay, cosine annealing).

Model Sharding: Splitting model weights across multiple devices. Enables training/inference of models that don’t fit on single device.

Scaling Laws: Empirical relationships between model size, data size, compute, and performance. Guide resource allocation decisions.

Evaluation Metrics

BLEU: Metric for machine translation quality based on n-gram overlap with reference translations.

F1 Score: Harmonic mean of precision and recall. Balanced metric for classification.

Hallucination: Model generating plausible-sounding but factually incorrect information.

Human Evaluation: Quality assessment by human raters. Gold standard but expensive and slow.

ROUGE: Metric for summarization quality based on overlap with reference summaries.

Win Rate: Percentage of comparisons where one system is preferred over another. Common in LLM evaluation.

MRR (Mean Reciprocal Rank): Retrieval metric that measures how quickly relevant documents appear in results. MRR = average of (1 / rank of first relevant document) across queries.

NDCG (Normalized Discounted Cumulative Gain): Retrieval metric that considers both relevance grades and position. Rewards highly relevant documents appearing early in results.

Precision@K: Fraction of top-K retrieved documents that are relevant. Measures retrieval precision at a cutoff.

Recall@K: Fraction of relevant documents that appear in the top-K results. Measures retrieval coverage at a cutoff.

Data Management

Annotation: Adding labels or metadata to data for supervised learning. May involve human labelers or automated systems.

Data Augmentation: Artificially increasing training data size through transformations (for images) or paraphrasing (for text).

Data Cleaning: Removing or correcting erroneous, duplicate, or low-quality data.

Deduplication: Removing duplicate or near-duplicate examples from training data. Important for preventing memorization.

Synthetic Data: Artificially generated data, often using LLMs, to supplement training data.

Tokenization: Converting text to sequence of tokens for model input. Includes handling of special tokens, padding, and truncation.

Reliability & Operations

AI Systems Fail Like Biological Systems — Mental model from Ch 31 framing AI failure as gradual, diffuse drift with cascading downstream effects (more like an organism than a light bulb), demanding continuous-vitals monitoring rather than binary up/down checks.

Blue-Green Deployment: Running two identical production environments and switching traffic between them for updates.

Circuit Breaker: Pattern that stops calls to a failing service to prevent cascade failures and allow recovery.

Graceful Degradation: System design that maintains partial functionality when components fail.

Health Check: Automated verification that a service is functioning correctly.

Load Balancing: Distributing requests across multiple servers to optimize resource use and reliability.

Rate Limiting: Restricting the number of requests a client can make in a time period.

Reliability is a Budget, Not a Goal — Mental model from Ch 31 treating SLO/error budgets as a finite resource to spend on velocity and risk, not a target to maximize; chasing 100% reliability is exponentially expensive and ultimately impossible.

Rollback: Reverting to a previous version of software or model after detecting problems.

Cost & Economics

Compute Cost: Cost of processing resources (CPU, GPU, TPU) for training and inference.

Cost Attribution: Assigning costs to specific teams, projects, or use cases.

Reserved Instances: Pre-purchased cloud capacity at a discount in exchange for commitment.

Spot Instances: Spare cloud capacity available at significant discount but with risk of interruption.

The 10x Cost Cliff — Mental model from Ch 32 for the API-versus-self-host crossover point (typically in the low hundreds of thousands of requests per month) above which managed API spend exceeds self-hosted TCO by roughly an order of magnitude.

Unit Economics: Cost analysis per unit of value (cost per request, cost per user, etc.).

Modern LLM Providers & Models

Anthropic: AI safety company that develops the Claude family of models. Known for emphasis on safety, long context windows, and strong reasoning capabilities.

Claude: Family of large language models developed by Anthropic. Current versions include Claude Opus 4.8, Sonnet 4.6, and Haiku 4.5. Known for nuanced understanding, long context (200K-1M tokens), and strong coding abilities.

Cohere: Enterprise AI company offering LLMs optimized for business applications. Known for Command models and Rerank for retrieval.

Gemini: Google’s multimodal AI model family. Current versions include Gemini 3.1 Pro, Gemini 3.5 Flash, and Gemini 3 Pro. Known for native multimodal capabilities and extremely long context windows (1M-2M tokens). Integrated with Google Cloud and Workspace.

GPT-5: OpenAI’s flagship large language model family. Current versions include GPT-5.5 (~1.05M context) and GPT-5.4. Foundation for ChatGPT Plus and enterprise applications.

Llama: Meta’s open-weight large language model family. Llama 4 includes Scout (10M context) and Maverick variants. Popular for self-hosting, fine-tuning, and extreme context applications.

Mistral: French AI company known for efficient open-weight models. Mistral 7B and Mixtral (MoE) are popular for self-hosting. Known for strong performance relative to size.

OpenAI: AI research company that developed GPT models and ChatGPT. Offers API access to GPT-5.5, GPT-5.4, and embedding models.

Advanced Training Techniques

Constitutional AI (CAI): Anthropic’s approach to training AI systems using a set of principles (a “constitution”) that guides self-improvement and reduces harmful outputs.

Direct Preference Optimization (DPO): Training technique that directly optimizes for human preferences without explicit reward modeling. Simpler alternative to RLHF.

Instruction Tuning: Fine-tuning a pre-trained model on instruction-following examples. Creates models that respond to natural language instructions.

ORPO (Odds Ratio Preference Optimization): Preference learning technique that combines supervised fine-tuning and preference alignment in a single stage.

PPO (Proximal Policy Optimization): Reinforcement learning algorithm commonly used in RLHF. Balances exploration with stable policy updates.

Rejection Sampling: Generating multiple outputs and selecting the best according to a reward model. Used in training and inference.

SFT (Supervised Fine-Tuning): Training a model on curated instruction-response pairs before preference tuning.

Inference Infrastructure

GGML/GGUF: File formats for storing quantized models, optimized for CPU inference. Popular with llama.cpp.

Inference is a Queue Theory Problem — Mental model from Ch 9 reframing inference optimization as scheduling (batching, KV cache paging, prefill/decode disaggregation, prefix caching) over a scarce GPU resource rather than as a pure compute problem.

llama.cpp: C++ library for running LLMs on consumer hardware including CPUs and Apple Silicon. Enables local LLM inference.

Ollama: Tool for running LLMs locally with simple model management. Wraps llama.cpp with user-friendly interface.

PagedAttention: Memory management technique for KV cache that uses paging (like OS virtual memory). Enables efficient memory utilization in vLLM.

SGLang: High-performance inference framework with RadixAttention for prefix sharing. Optimized for complex prompting patterns.

TGI (Text Generation Inference): Hugging Face’s inference server for LLMs. Supports tensor parallelism, continuous batching, and various model formats.

Triton Inference Server: NVIDIA’s model serving platform supporting multiple frameworks. Used for production deployment with GPU optimization.

vLLM: High-throughput inference engine featuring PagedAttention and continuous batching. De facto standard for self-hosted LLM serving.

Tool Use & Agents

Agents are Junior Engineers with Perfect Memory and No Judgment — Mental model from Ch 8 for deciding when to give agents autonomy: trust them with rote, well-specified work; supervise anything creative, irreversible, or cross-cutting.

Anthropic Tool Use: Claude’s native function calling capability. Uses structured tool definitions and returns structured responses.

Computer Use: Capability for AI to interact with computer interfaces (mouse, keyboard, screenshots). Enables automation of desktop tasks.

LangChain: Framework for building LLM applications with chains, agents, and tool use. Popular but can add complexity.

LangGraph: Extension of LangChain for building stateful, multi-actor applications with graph-based workflows.

MCP (Model Context Protocol): Anthropic’s open protocol for connecting AI models to external tools and data sources. Standardizes tool integration.

MCP Server: Service that implements the Model Context Protocol to expose tools or data to AI models.

OpenAI Function Calling: OpenAI’s API for structured tool use. Model outputs JSON matching provided function schemas.

Tool Schema: JSON Schema definition describing a tool’s name, description, and parameters. Used by LLMs to understand available tools.

Evaluation & Safety

Bias: Systematic errors in model outputs that reflect prejudices in training data or objectives. Includes demographic, cultural, and topic biases.

Capability Elicitation: Techniques for discovering what a model can do, especially latent or hidden capabilities.

Catastrophic Forgetting: Loss of previously learned knowledge when training on new data. Challenge for continual learning.

Deceptive Alignment: Theoretical concern where a model appears aligned during training but behaves differently when deployed.

Evals (Evaluations): Benchmarks and test suites for measuring model capabilities, safety, and alignment.

Frontier Model: The most capable AI models at any given time. Subject to enhanced safety requirements.

Model Card: Documentation describing a model’s intended use, capabilities, limitations, and ethical considerations.

Refusal: When a model declines to complete a request, typically due to safety training. Can be over-refusal or under-refusal.

Safety Training: Techniques to make models refuse harmful requests and behave according to guidelines.

Sandbagging: When a model intentionally performs worse than its capability to avoid detection or evaluation.

Prompt Engineering Techniques

Chain-of-Thought (CoT): Prompting technique that encourages step-by-step reasoning. Improves performance on complex tasks. See Chapter 6.

Few-Shot Prompting: Providing examples in the prompt to guide model behavior. The examples demonstrate the desired format and approach.

Prompt Injection Defense: Techniques to prevent malicious user input from overriding system instructions. Includes input validation, output filtering, and architectural defenses.

Prompts are Compiled, Not Written — Mental model from Ch 6 framing prompt engineering as iterative optimization passes (clarity, examples, constraints, format) over a draft rather than one-shot authorship.

Role Prompting: Assigning a specific role or persona to the model in the system prompt to shape responses.

Self-Consistency: Generating multiple reasoning paths and taking majority vote. Improves reliability on reasoning tasks.

Tree of Thoughts (ToT): Extension of chain-of-thought that explores multiple reasoning branches.

XML Tags: Using XML-style tags in prompts to structure input and output. Common pattern with Claude.

Zero-Shot CoT: Adding “Let’s think step by step” without examples. Simple technique that improves reasoning.

Data & Context Management

Context Compression: Techniques to fit more information in limited context windows. Includes summarization and selective retrieval.

Context Stuffing: Filling the context window with relevant information. Balance between more context and noise.

Long Context Models: Models with extended context windows (100K+ tokens). Enables processing entire documents.

Lost in the Middle: Phenomenon where models attend less to information in the middle of long contexts.

Needle in a Haystack: Evaluation testing retrieval from specific positions in long contexts.

Prompt Caching: Caching computed representations of prompt prefixes to avoid recomputation. Reduces latency and cost for repeated prefixes.

System Prompt: Initial instructions that define model behavior, persona, and constraints. Not visible to end users.

Acronyms

Acronym	Full Form
API	Application Programming Interface
ANN	Approximate Nearest Neighbor
BERT	Bidirectional Encoder Representations from Transformers
BPE	Byte Pair Encoding
CAI	Constitutional AI
CDC	Change Data Capture
CoT	Chain-of-Thought
CUDA	Compute Unified Device Architecture
DPO	Direct Preference Optimization
ETL	Extract, Transform, Load
FLOPS	Floating-Point Operations Per Second
GGUF	GPT-Generated Unified Format
GPU	Graphics Processing Unit
GPT	Generative Pre-trained Transformer
HF	Hugging Face
KV	Key-Value
LLM	Large Language Model
LoRA	Low-Rank Adaptation
MCP	Model Context Protocol
MLP	Multi-Layer Perceptron
MLOps	Machine Learning Operations
MoE	Mixture of Experts
NLP	Natural Language Processing
OOD	Out-of-Distribution
ORPO	Odds Ratio Preference Optimization
PEFT	Parameter-Efficient Fine-Tuning
PPO	Proximal Policy Optimization
PSI	Population Stability Index
QA	Question Answering
QLoRA	Quantized Low-Rank Adaptation
RAG	Retrieval-Augmented Generation
RLHF	Reinforcement Learning from Human Feedback
ROI	Return on Investment
SDK	Software Development Kit
SFT	Supervised Fine-Tuning
SLA	Service Level Agreement
SLI	Service Level Indicator
SLO	Service Level Objective
SRE	Site Reliability Engineering
TCO	Total Cost of Ownership
TGI	Text Generation Inference
ToT	Tree of Thoughts
TPU	Tensor Processing Unit
TTL	Time to Live
YAML	YAML Ain’t Markup Language

Additional Technical Terms

Ablation Study: Systematic removal of components to understand their contribution. Essential for understanding what works in ML systems.

Active Learning: Training strategy that selectively requests labels for the most informative examples. Reduces labeling cost.

Alignment: Ensuring AI systems behave in accordance with human values and intentions. Major research area for LLMs.

API Gateway: Service that handles API routing, authentication, rate limiting, and monitoring. Entry point for inference requests.

Attention Head: One of multiple parallel attention operations in multi-head attention. Different heads can capture different relationships.

Benchmark: Standardized dataset and evaluation protocol for comparing model performance.

Cold Start: Initial period when a system has insufficient data for personalization. Challenge for recommendation systems.

Constitutional AI: Training approach that uses a set of principles to guide model behavior.

Distillation: Training a smaller model to mimic a larger model’s behavior. Creates efficient models with similar capabilities.

Early Stopping: Halting training when validation performance stops improving. Prevents overfitting.

Elicitation: Techniques for extracting specific capabilities or behaviors from a trained model.

Ensemble: Combining predictions from multiple models. Often improves accuracy and robustness.

Explainability: Ability to understand why a model made a particular prediction.

Feature Importance: Measure of how much each input feature contributes to model predictions.

Generalization: Model’s ability to perform well on unseen data. The core goal of machine learning.

Gradient Clipping: Limiting gradient magnitude to prevent exploding gradients during training.

Hyperparameter: Configuration that controls training but isn’t learned (learning rate, batch size, architecture choices).

Idempotent: Operation that produces the same result regardless of how many times it’s executed.

Latent Space: Hidden representation space where models encode information about inputs.

Leakage: When information from the test set or future data inadvertently influences training.

Memorization: Model learning to reproduce specific training examples rather than general patterns.

Normalization: Scaling data to a standard range. Improves training stability.

Out-of-Distribution (OOD): Data that differs significantly from training distribution. Models often fail on OOD inputs.

Prompt Caching: Storing computed representations of common prompt prefixes to avoid recomputation.

Pruning: Removing unnecessary weights or connections from a neural network to reduce size and computation.

Reproducibility: Ability to obtain the same results when repeating an experiment. Essential for scientific validity.

Stochastic Gradient Descent (SGD): Optimization algorithm that estimates gradients from random subsets of data.

Structured Output: Model output constrained to follow a specific format (JSON, XML, etc.).

Supervised Learning: Learning from labeled examples where the correct answer is provided.

Unsupervised Learning: Learning patterns from data without explicit labels.

Validation Set: Data held out during training to evaluate model performance and tune hyperparameters.

Warm-up: Gradually increasing learning rate at the start of training. Improves stability.

Weight Initialization: Strategy for setting initial parameter values before training begins

# Appendix A: Glossary {.unnumbered} This glossary defines key terms used throughout the book. Terms are organized alphabetically within categories for easier reference. --- ## Model Architecture & Training **Attention Mechanism**: A neural network component that allows models to focus on different parts of the input when producing each part of the output. Self-attention allows every position to attend to every other position in a sequence. The mathematical operation is typically Attention(Q, K, V) = softmax(QK^T/√d_k)V. **Autoregressive Model**: A model that generates output one token at a time, where each token prediction is conditioned on all previously generated tokens. GPT-style models are autoregressive. Contrast with encoder-only or encoder-decoder architectures. **Backpropagation**: Algorithm for computing gradients of the loss function with respect to all parameters in a neural network. Essential for training via gradient descent. **Batch Size**: The number of training examples used in one forward/backward pass. Larger batches improve GPU utilization but may affect convergence. Critical parameter for training efficiency and cost. **BERT (Bidirectional Encoder Representations from Transformers)**: Encoder-only transformer model trained with masked language modeling. Foundation for many NLP classification and extraction tasks. Released by Google in 2018. **Causal Attention**: Attention that only allows positions to attend to earlier positions (not future ones). Essential for autoregressive generation. Also called "masked attention." **Checkpoint**: A saved snapshot of model weights and optimizer state during training. Enables resuming training after interruption and selecting the best model version. **Contrastive Learning**: Training approach where the model learns to distinguish similar pairs from dissimilar pairs. Common for training embedding models (e.g., contrastive loss, triplet loss). **Cross-Entropy Loss**: Standard loss function for classification tasks. Measures the difference between predicted probability distribution and true distribution. **Decoder**: In transformer architecture, the component that generates output tokens autoregressively. Uses causal attention and cross-attention to encoder outputs. **Dropout**: Regularization technique that randomly zeroes network activations during training. Prevents overfitting by encouraging redundant representations. **Embedding**: Dense vector representation of discrete objects (tokens, entities, etc.). Learned during training to capture semantic relationships. See also: vector embedding. **Encoder**: In transformer architecture, the component that processes input sequence bidirectionally. Produces contextualized representations used by decoder. **Epoch**: One complete pass through the entire training dataset. Models typically train for multiple epochs. **Fine-Tuning**: Adapting a pre-trained model to a specific task by continuing training on task-specific data. Usually with lower learning rate than pre-training. **Frozen Layers**: Model layers whose weights are not updated during fine-tuning. Freezing early layers preserves general knowledge while adapting later layers to new tasks. **Gradient Descent**: Optimization algorithm that updates model parameters in the direction that reduces the loss function. Variants include SGD, Adam, AdamW. **GPT (Generative Pre-trained Transformer)**: Decoder-only transformer model trained with next-token prediction. Foundation of ChatGPT and many LLMs. **KV Cache**: Cached key and value matrices from attention layers during autoregressive generation. Avoids recomputing attention for previous tokens. Critical for inference efficiency. **Layer Normalization**: Normalization technique applied to layer activations. Standard in transformers (RMSNorm in newer models). **Learning Rate**: Hyperparameter controlling the step size during gradient descent. Critical for training stability and convergence speed. **LoRA (Low-Rank Adaptation)**: Parameter-efficient fine-tuning method that adds trainable low-rank matrices to attention layers. Enables fine-tuning large models with minimal additional parameters. **Loss Function**: Mathematical function that measures how well model predictions match targets. Training minimizes the loss. **Mixed Precision Training**: Training with both 16-bit and 32-bit floating point to reduce memory and increase speed while maintaining accuracy. **MLP (Multi-Layer Perceptron)**: Feedforward neural network with fully connected layers. In transformers, the feedforward network after attention. **Multi-Head Attention**: Attention mechanism that runs multiple attention operations in parallel with different learned projections. Standard in transformers. **Overfitting**: When a model performs well on training data but poorly on unseen data. Indicates memorization rather than generalization. **Parameter**: A learnable weight in a neural network. Model size is typically measured in parameters (e.g., 7B parameters). **PEFT (Parameter-Efficient Fine-Tuning)**: Family of techniques for fine-tuning large models by training only a small subset of parameters. Includes LoRA, adapters, prefix tuning. **Perplexity**: Metric for language model quality. Lower perplexity indicates better prediction of held-out text. Related to cross-entropy by PP = exp(CE). **Pre-training**: Initial training of a model on large-scale data before task-specific fine-tuning. Creates general-purpose representations. **Quantization**: Reducing the numerical precision of model weights (e.g., from 16-bit to 8-bit or 4-bit). Reduces memory and often increases inference speed with minimal quality loss. **ReLU (Rectified Linear Unit)**: Activation function f(x) = max(0, x). Common in neural networks though transformers often use variants like GELU. **Regularization**: Techniques to prevent overfitting, including dropout, weight decay, and early stopping. **Self-Attention**: Attention where queries, keys, and values all come from the same sequence. Core operation in transformers. **Softmax**: Function that converts a vector of scores into a probability distribution. Used in attention and classification outputs. **Teacher Forcing**: Training technique where the model receives the correct previous token rather than its own prediction. Standard for training autoregressive models. **Token**: The basic unit of text processed by language models. May be words, subwords, or characters depending on the tokenizer. **Tokenizer**: Algorithm that converts text to tokens and vice versa. Common approaches include BPE, WordPiece, and SentencePiece. **Transfer Learning**: Using knowledge from pre-training on one task to improve performance on a different task. **Transformer**: Neural network architecture based entirely on attention mechanisms. Foundation of modern LLMs. Introduced in "Attention Is All You Need" (2017). **Weight Decay**: Regularization technique that adds a penalty proportional to the magnitude of weights. Prevents weights from growing too large. --- ## Inference & Deployment **Batching**: Processing multiple requests together in a single forward pass. Improves throughput at the cost of latency. See also: continuous batching. **Beam Search**: Decoding strategy that maintains multiple candidate sequences and selects the highest-scoring completion. Improves generation quality but increases computation. **Continuous Batching**: Dynamic batching technique that adds new requests to running batches as existing requests complete. Maximizes GPU utilization. **Flash Attention**: Optimized attention implementation that reduces memory usage and improves speed through tiling and kernel fusion. Standard in modern inference systems. **Greedy Decoding**: Decoding strategy that always selects the highest-probability token. Fast but may miss better overall sequences. **Inference**: Using a trained model to make predictions on new inputs. Contrast with training. **Latency**: Time from request to response. p50 is median, p99 is 99th percentile (tail latency). **Model Serving**: Infrastructure for hosting and running inference on trained models. Handles request routing, scaling, and response delivery. **Nucleus Sampling (Top-p)**: Decoding strategy that samples from the smallest set of tokens whose cumulative probability exceeds p. Balances diversity and quality. **Speculative Decoding**: Acceleration technique using a smaller draft model to propose multiple tokens that the larger model verifies in parallel. Can provide 2-3x speedup. **Temperature**: Parameter controlling randomness in sampling-based decoding. Higher temperature produces more diverse but less reliable outputs. **TensorRT**: NVIDIA's inference optimization library. Performs graph optimization, kernel fusion, and quantization for faster inference. **Throughput**: Number of requests processed per unit time. Often measured in tokens per second. **Top-k Sampling**: Decoding strategy that samples from the k highest-probability tokens. Simple diversity control. **vLLM**: High-performance inference engine for LLMs featuring PagedAttention and continuous batching. --- ## RAG & Retrieval **BM25**: Traditional sparse retrieval algorithm based on term frequency. Still competitive for many retrieval tasks. **Chunking**: Splitting documents into smaller pieces for embedding and retrieval. Chunk size significantly affects retrieval quality. **Dense Retrieval**: Retrieval using learned vector embeddings rather than keyword matching. Captures semantic similarity. **Embedding Model**: Neural network that produces vector representations of text. Common examples include E5, BGE, and OpenAI's text-embedding models. **Hybrid Search**: Combining sparse (keyword) and dense (semantic) retrieval. Often outperforms either alone. **Indexing**: Pre-processing documents into a searchable structure. For vector search, involves computing embeddings and building ANN index. **RAG (Retrieval-Augmented Generation)**: Architecture that retrieves relevant context before generating responses. Grounds LLM outputs in specific knowledge. **Reranking**: Second-stage retrieval that re-scores initial retrieval results with a more sophisticated model. Improves precision. **Semantic Search**: Search based on meaning rather than keywords. Typically implemented with embedding similarity. **Vector Database**: Database optimized for storing and searching vector embeddings. Examples: Pinecone, Weaviate, Milvus, Qdrant. **Vector Embedding**: Dense numerical representation of text/data in high-dimensional space where similar items are close together. **ANN (Approximate Nearest Neighbor)**: Algorithm for finding similar vectors without exhaustive search. HNSW and IVF are common approaches. Trades small accuracy loss for large speedup. **Bi-Encoder**: Architecture that independently encodes query and documents into vectors for similarity comparison. Fast but less accurate than cross-encoders. Used for initial retrieval. **Cross-Encoder**: Architecture that jointly encodes query and document together, producing a relevance score. More accurate than bi-encoders but slower. Used for reranking. **GraphRAG**: RAG approach that structures retrieved information as a knowledge graph. Enables multi-hop reasoning and relationship-aware retrieval. **RRF (Reciprocal Rank Fusion)**: Method for combining ranked lists from multiple retrieval systems. Score = sum(1 / (k + rank)) across systems, where k is typically 60. --- ## Agents & Tool Use **Agent**: AI system that can take actions in an environment to achieve goals. Uses planning, tool use, and feedback loops. **Chain-of-Thought (CoT)**: Prompting technique that encourages step-by-step reasoning. Improves performance on complex tasks. **Function Calling**: LLM capability to generate structured function/API calls rather than free text. Enables reliable tool use. **MCP (Model Context Protocol)**: Protocol for connecting AI models to external tools and data sources. Standardizes tool integration. **Multi-Agent System**: Architecture using multiple AI agents that collaborate or compete to solve problems. **ReAct**: Agent framework combining Reasoning and Acting. Interleaves thinking with tool use. **Tool Use**: LLM capability to invoke external tools (search, calculation, APIs) during generation. --- ## MLOps & Evaluation **A/B Testing**: Comparing two system versions by exposing different users to each and measuring outcomes. **Canary Deployment**: Gradual rollout exposing a small percentage of traffic to new model before full deployment. **Concept Drift**: Change in the relationship between input features and target variable over time. **Data Drift**: Change in the distribution of input data over time. **Feature Store**: Centralized repository for storing, managing, and serving ML features. Enables feature reuse and prevents training-serving skew. **LLM-as-Judge**: Using an LLM to evaluate outputs of another model. Common for evaluating generation quality. **Model Registry**: Versioned repository for trained models with metadata, lineage, and deployment status. **Observability**: Ability to understand system state from external outputs. For ML: metrics, logs, traces, and prediction analysis. **Point-in-Time Join**: Joining features to training data using only information available at the time of the training label. Prevents data leakage. **Shadow Deployment**: Running a new model in parallel with production, comparing outputs without affecting users. **Training-Serving Skew**: Differences between how features are computed in training vs serving, leading to degraded model performance. --- ## Infrastructure & Performance **CUDA**: NVIDIA's parallel computing platform and API for GPU programming. **Data Parallelism**: Distributing training by splitting batches across multiple GPUs, each with a full model copy. **FLOPS**: Floating-point operations per second. Measure of computational throughput. **GPU Memory**: High-bandwidth memory on graphics cards used for storing model weights, activations, and gradients. **Model Parallelism**: Distributing training by splitting model layers across multiple GPUs. Required for models too large for single GPU. **Tensor Parallelism**: Form of model parallelism that splits individual layer computations across GPUs. **Pipeline Parallelism**: Form of model parallelism that assigns different layers to different GPUs in a pipeline. **ZeRO**: Memory optimization technique that partitions optimizer states, gradients, and parameters across data parallel workers. --- ## Security & Safety **Adversarial Attack**: Input designed to cause model misbehavior. Includes prompt injection, jailbreaks, and evasion attacks. **Guardrails**: Safety mechanisms that filter or modify model inputs/outputs. May include content filters, output validators. **Jailbreak**: Adversarial prompt that causes a model to bypass its safety training. **Prompt Injection**: Attack where malicious instructions in user input override system instructions. **Red Teaming**: Systematic adversarial testing to discover model vulnerabilities and failure modes. **RLHF (Reinforcement Learning from Human Feedback)**: Training technique that fine-tunes models using human preference signals. Used to align models with human values. **Security is a Dialogue, Not a Wall** — Mental model from Ch 16 framing LLM security as an ongoing feedback loop of layered defenses, monitoring, and red-teaming against an evolving adversary, rather than a one-time perimeter to be built. --- ## Data & Features **Data Lineage**: Tracking the origin and transformations of data through the pipeline. Essential for debugging and compliance. **Data Quality**: Dimensions including completeness, accuracy, consistency, timeliness, and validity. **ETL (Extract, Transform, Load)**: Pipeline pattern for moving data between systems with transformations. **Feature Engineering**: Creating input features from raw data to improve model performance. **Feature View**: In feature stores, a collection of related features that share an entity and update schedule. **Schema**: Specification of data structure including field names, types, and constraints. --- ## Business & Operations **Error Budget**: Allowed amount of downtime or SLO violations in a period. Balances reliability work with feature development. **Incident Management**: Process for detecting, responding to, and learning from system failures. **On-Call**: Engineers responsible for responding to system issues outside business hours. **SLA (Service Level Agreement)**: Contractual commitment to service quality. Violation may trigger penalties. **SLI (Service Level Indicator)**: Metric measuring service quality (e.g., request latency, error rate). **SLO (Service Level Objective)**: Target value for an SLI. Internal goal that should be stricter than SLA. **TCO (Total Cost of Ownership)**: Complete cost including infrastructure, personnel, maintenance, and hidden costs. --- ## Prompting & Context **Context Window**: Maximum number of tokens a model can process in a single input. Ranges from 4K to 1M+ tokens in modern models. **Few-Shot Learning**: Providing a few examples in the prompt to guide model behavior. Contrast with zero-shot (no examples) and many-shot. **In-Context Learning**: Model's ability to learn from examples provided in the prompt without parameter updates. **Prompt Engineering**: The practice of designing effective prompts to elicit desired model behavior. **System Prompt**: Instructions that set the model's behavior, persona, and constraints. Typically hidden from users. **Zero-Shot**: Using a model without task-specific examples. Tests the model's inherent capabilities. --- ## Architectures & Patterns **Encoder-Decoder**: Architecture with separate encoder for input processing and decoder for output generation. Used in T5, BART. **Decoder-Only**: Architecture using only the decoder component with causal attention. GPT, LLaMA, and most modern LLMs. **Encoder-Only**: Architecture using only the encoder component with bidirectional attention. BERT, RoBERTa. **Mixture of Experts (MoE)**: Architecture where different "expert" networks specialize in different inputs. Enables larger models with less computation per token. **Multimodal Model**: Model that processes multiple types of input (text, images, audio, video). **Vision Transformer (ViT)**: Transformer architecture adapted for image processing by treating image patches as tokens. **CLIP**: Contrastive Language-Image Pre-training. Model that learns to align image and text representations. --- ## Optimization & Scaling **Distributed Training**: Training across multiple GPUs or machines. Requires data, model, or pipeline parallelism. **Gradient Accumulation**: Simulating larger batch sizes by accumulating gradients over multiple forward passes before updating. **Gradient Checkpointing**: Trading compute for memory by recomputing activations during backward pass instead of storing them. **Learning Rate Schedule**: Plan for changing learning rate during training (e.g., warmup, decay, cosine annealing). **Model Sharding**: Splitting model weights across multiple devices. Enables training/inference of models that don't fit on single device. **Scaling Laws**: Empirical relationships between model size, data size, compute, and performance. Guide resource allocation decisions. --- ## Evaluation Metrics **BLEU**: Metric for machine translation quality based on n-gram overlap with reference translations. **F1 Score**: Harmonic mean of precision and recall. Balanced metric for classification. **Hallucination**: Model generating plausible-sounding but factually incorrect information. **Human Evaluation**: Quality assessment by human raters. Gold standard but expensive and slow. **ROUGE**: Metric for summarization quality based on overlap with reference summaries. **Win Rate**: Percentage of comparisons where one system is preferred over another. Common in LLM evaluation. **MRR (Mean Reciprocal Rank)**: Retrieval metric that measures how quickly relevant documents appear in results. MRR = average of (1 / rank of first relevant document) across queries. **NDCG (Normalized Discounted Cumulative Gain)**: Retrieval metric that considers both relevance grades and position. Rewards highly relevant documents appearing early in results. **Precision@K**: Fraction of top-K retrieved documents that are relevant. Measures retrieval precision at a cutoff. **Recall@K**: Fraction of relevant documents that appear in the top-K results. Measures retrieval coverage at a cutoff. --- ## Data Management **Annotation**: Adding labels or metadata to data for supervised learning. May involve human labelers or automated systems. **Data Augmentation**: Artificially increasing training data size through transformations (for images) or paraphrasing (for text). **Data Cleaning**: Removing or correcting erroneous, duplicate, or low-quality data. **Deduplication**: Removing duplicate or near-duplicate examples from training data. Important for preventing memorization. **Synthetic Data**: Artificially generated data, often using LLMs, to supplement training data. **Tokenization**: Converting text to sequence of tokens for model input. Includes handling of special tokens, padding, and truncation. --- ## Reliability & Operations **AI Systems Fail Like Biological Systems** — Mental model from Ch 31 framing AI failure as gradual, diffuse drift with cascading downstream effects (more like an organism than a light bulb), demanding continuous-vitals monitoring rather than binary up/down checks. **Blue-Green Deployment**: Running two identical production environments and switching traffic between them for updates. **Circuit Breaker**: Pattern that stops calls to a failing service to prevent cascade failures and allow recovery. **Graceful Degradation**: System design that maintains partial functionality when components fail. **Health Check**: Automated verification that a service is functioning correctly. **Load Balancing**: Distributing requests across multiple servers to optimize resource use and reliability. **Rate Limiting**: Restricting the number of requests a client can make in a time period. **Reliability is a Budget, Not a Goal** — Mental model from Ch 31 treating SLO/error budgets as a finite resource to spend on velocity and risk, not a target to maximize; chasing 100% reliability is exponentially expensive and ultimately impossible. **Rollback**: Reverting to a previous version of software or model after detecting problems. --- ## Cost & Economics **Compute Cost**: Cost of processing resources (CPU, GPU, TPU) for training and inference. **Cost Attribution**: Assigning costs to specific teams, projects, or use cases. **Reserved Instances**: Pre-purchased cloud capacity at a discount in exchange for commitment. **Spot Instances**: Spare cloud capacity available at significant discount but with risk of interruption. **The 10x Cost Cliff** — Mental model from Ch 32 for the API-versus-self-host crossover point (typically in the low hundreds of thousands of requests per month) above which managed API spend exceeds self-hosted TCO by roughly an order of magnitude. **Unit Economics**: Cost analysis per unit of value (cost per request, cost per user, etc.). --- ## Modern LLM Providers & Models **Anthropic**: AI safety company that develops the Claude family of models. Known for emphasis on safety, long context windows, and strong reasoning capabilities. **Claude**: Family of large language models developed by Anthropic. Current versions include Claude Opus 4.8, Sonnet 4.6, and Haiku 4.5. Known for nuanced understanding, long context (200K-1M tokens), and strong coding abilities. **Cohere**: Enterprise AI company offering LLMs optimized for business applications. Known for Command models and Rerank for retrieval. **Gemini**: Google's multimodal AI model family. Current versions include Gemini 3.1 Pro, Gemini 3.5 Flash, and Gemini 3 Pro. Known for native multimodal capabilities and extremely long context windows (1M-2M tokens). Integrated with Google Cloud and Workspace. **GPT-5**: OpenAI's flagship large language model family. Current versions include GPT-5.5 (~1.05M context) and GPT-5.4. Foundation for ChatGPT Plus and enterprise applications. **Llama**: Meta's open-weight large language model family. Llama 4 includes Scout (10M context) and Maverick variants. Popular for self-hosting, fine-tuning, and extreme context applications. **Mistral**: French AI company known for efficient open-weight models. Mistral 7B and Mixtral (MoE) are popular for self-hosting. Known for strong performance relative to size. **OpenAI**: AI research company that developed GPT models and ChatGPT. Offers API access to GPT-5.5, GPT-5.4, and embedding models. --- ## Advanced Training Techniques **Constitutional AI (CAI)**: Anthropic's approach to training AI systems using a set of principles (a "constitution") that guides self-improvement and reduces harmful outputs. **Direct Preference Optimization (DPO)**: Training technique that directly optimizes for human preferences without explicit reward modeling. Simpler alternative to RLHF. **Instruction Tuning**: Fine-tuning a pre-trained model on instruction-following examples. Creates models that respond to natural language instructions. **ORPO (Odds Ratio Preference Optimization)**: Preference learning technique that combines supervised fine-tuning and preference alignment in a single stage. **PPO (Proximal Policy Optimization)**: Reinforcement learning algorithm commonly used in RLHF. Balances exploration with stable policy updates. **Rejection Sampling**: Generating multiple outputs and selecting the best according to a reward model. Used in training and inference. **SFT (Supervised Fine-Tuning)**: Training a model on curated instruction-response pairs before preference tuning. --- ## Inference Infrastructure **GGML/GGUF**: File formats for storing quantized models, optimized for CPU inference. Popular with llama.cpp. **Inference is a Queue Theory Problem** — Mental model from Ch 9 reframing inference optimization as scheduling (batching, KV cache paging, prefill/decode disaggregation, prefix caching) over a scarce GPU resource rather than as a pure compute problem. **llama.cpp**: C++ library for running LLMs on consumer hardware including CPUs and Apple Silicon. Enables local LLM inference. **Ollama**: Tool for running LLMs locally with simple model management. Wraps llama.cpp with user-friendly interface. **PagedAttention**: Memory management technique for KV cache that uses paging (like OS virtual memory). Enables efficient memory utilization in vLLM. **SGLang**: High-performance inference framework with RadixAttention for prefix sharing. Optimized for complex prompting patterns. **TGI (Text Generation Inference)**: Hugging Face's inference server for LLMs. Supports tensor parallelism, continuous batching, and various model formats. **Triton Inference Server**: NVIDIA's model serving platform supporting multiple frameworks. Used for production deployment with GPU optimization. **vLLM**: High-throughput inference engine featuring PagedAttention and continuous batching. De facto standard for self-hosted LLM serving. --- ## Tool Use & Agents **Agents are Junior Engineers with Perfect Memory and No Judgment** — Mental model from Ch 8 for deciding when to give agents autonomy: trust them with rote, well-specified work; supervise anything creative, irreversible, or cross-cutting. **Anthropic Tool Use**: Claude's native function calling capability. Uses structured tool definitions and returns structured responses. **Computer Use**: Capability for AI to interact with computer interfaces (mouse, keyboard, screenshots). Enables automation of desktop tasks. **LangChain**: Framework for building LLM applications with chains, agents, and tool use. Popular but can add complexity. **LangGraph**: Extension of LangChain for building stateful, multi-actor applications with graph-based workflows. **MCP (Model Context Protocol)**: Anthropic's open protocol for connecting AI models to external tools and data sources. Standardizes tool integration. **MCP Server**: Service that implements the Model Context Protocol to expose tools or data to AI models. **OpenAI Function Calling**: OpenAI's API for structured tool use. Model outputs JSON matching provided function schemas. **Tool Schema**: JSON Schema definition describing a tool's name, description, and parameters. Used by LLMs to understand available tools. --- ## Evaluation & Safety **Bias**: Systematic errors in model outputs that reflect prejudices in training data or objectives. Includes demographic, cultural, and topic biases. **Capability Elicitation**: Techniques for discovering what a model can do, especially latent or hidden capabilities. **Catastrophic Forgetting**: Loss of previously learned knowledge when training on new data. Challenge for continual learning. **Deceptive Alignment**: Theoretical concern where a model appears aligned during training but behaves differently when deployed. **Evals (Evaluations)**: Benchmarks and test suites for measuring model capabilities, safety, and alignment. **Frontier Model**: The most capable AI models at any given time. Subject to enhanced safety requirements. **Model Card**: Documentation describing a model's intended use, capabilities, limitations, and ethical considerations. **Refusal**: When a model declines to complete a request, typically due to safety training. Can be over-refusal or under-refusal. **Safety Training**: Techniques to make models refuse harmful requests and behave according to guidelines. **Sandbagging**: When a model intentionally performs worse than its capability to avoid detection or evaluation. --- ## Prompt Engineering Techniques **Chain-of-Thought (CoT)**: Prompting technique that encourages step-by-step reasoning. Improves performance on complex tasks. See Chapter 6. **Few-Shot Prompting**: Providing examples in the prompt to guide model behavior. The examples demonstrate the desired format and approach. **Prompt Injection Defense**: Techniques to prevent malicious user input from overriding system instructions. Includes input validation, output filtering, and architectural defenses. **Prompts are Compiled, Not Written** — Mental model from Ch 6 framing prompt engineering as iterative optimization passes (clarity, examples, constraints, format) over a draft rather than one-shot authorship. **Role Prompting**: Assigning a specific role or persona to the model in the system prompt to shape responses. **Self-Consistency**: Generating multiple reasoning paths and taking majority vote. Improves reliability on reasoning tasks. **Tree of Thoughts (ToT)**: Extension of chain-of-thought that explores multiple reasoning branches. **XML Tags**: Using XML-style tags in prompts to structure input and output. Common pattern with Claude. **Zero-Shot CoT**: Adding "Let's think step by step" without examples. Simple technique that improves reasoning. --- ## Data & Context Management **Context Compression**: Techniques to fit more information in limited context windows. Includes summarization and selective retrieval. **Context Stuffing**: Filling the context window with relevant information. Balance between more context and noise. **Long Context Models**: Models with extended context windows (100K+ tokens). Enables processing entire documents. **Lost in the Middle**: Phenomenon where models attend less to information in the middle of long contexts. **Needle in a Haystack**: Evaluation testing retrieval from specific positions in long contexts. **Prompt Caching**: Caching computed representations of prompt prefixes to avoid recomputation. Reduces latency and cost for repeated prefixes. **System Prompt**: Initial instructions that define model behavior, persona, and constraints. Not visible to end users. --- ## Acronyms | Acronym | Full Form | |---------|-----------| | API | Application Programming Interface | | ANN | Approximate Nearest Neighbor | | BERT | Bidirectional Encoder Representations from Transformers | | BPE | Byte Pair Encoding | | CAI | Constitutional AI | | CDC | Change Data Capture | | CoT | Chain-of-Thought | | CUDA | Compute Unified Device Architecture | | DPO | Direct Preference Optimization | | ETL | Extract, Transform, Load | | FLOPS | Floating-Point Operations Per Second | | GGUF | GPT-Generated Unified Format | | GPU | Graphics Processing Unit | | GPT | Generative Pre-trained Transformer | | HF | Hugging Face | | KV | Key-Value | | LLM | Large Language Model | | LoRA | Low-Rank Adaptation | | MCP | Model Context Protocol | | MLP | Multi-Layer Perceptron | | MLOps | Machine Learning Operations | | MoE | Mixture of Experts | | NLP | Natural Language Processing | | OOD | Out-of-Distribution | | ORPO | Odds Ratio Preference Optimization | | PEFT | Parameter-Efficient Fine-Tuning | | PPO | Proximal Policy Optimization | | PSI | Population Stability Index | | QA | Question Answering | | QLoRA | Quantized Low-Rank Adaptation | | RAG | Retrieval-Augmented Generation | | RLHF | Reinforcement Learning from Human Feedback | | ROI | Return on Investment | | SDK | Software Development Kit | | SFT | Supervised Fine-Tuning | | SLA | Service Level Agreement | | SLI | Service Level Indicator | | SLO | Service Level Objective | | SRE | Site Reliability Engineering | | TCO | Total Cost of Ownership | | TGI | Text Generation Inference | | ToT | Tree of Thoughts | | TPU | Tensor Processing Unit | | TTL | Time to Live | | YAML | YAML Ain't Markup Language | --- ## Additional Technical Terms **Ablation Study**: Systematic removal of components to understand their contribution. Essential for understanding what works in ML systems. **Active Learning**: Training strategy that selectively requests labels for the most informative examples. Reduces labeling cost. **Alignment**: Ensuring AI systems behave in accordance with human values and intentions. Major research area for LLMs. **API Gateway**: Service that handles API routing, authentication, rate limiting, and monitoring. Entry point for inference requests. **Attention Head**: One of multiple parallel attention operations in multi-head attention. Different heads can capture different relationships. **Benchmark**: Standardized dataset and evaluation protocol for comparing model performance. **Cold Start**: Initial period when a system has insufficient data for personalization. Challenge for recommendation systems. **Constitutional AI**: Training approach that uses a set of principles to guide model behavior. **Distillation**: Training a smaller model to mimic a larger model's behavior. Creates efficient models with similar capabilities. **Early Stopping**: Halting training when validation performance stops improving. Prevents overfitting. **Elicitation**: Techniques for extracting specific capabilities or behaviors from a trained model. **Ensemble**: Combining predictions from multiple models. Often improves accuracy and robustness. **Explainability**: Ability to understand why a model made a particular prediction. **Feature Importance**: Measure of how much each input feature contributes to model predictions. **Generalization**: Model's ability to perform well on unseen data. The core goal of machine learning. **Gradient Clipping**: Limiting gradient magnitude to prevent exploding gradients during training. **Hyperparameter**: Configuration that controls training but isn't learned (learning rate, batch size, architecture choices). **Idempotent**: Operation that produces the same result regardless of how many times it's executed. **Latent Space**: Hidden representation space where models encode information about inputs. **Leakage**: When information from the test set or future data inadvertently influences training. **Memorization**: Model learning to reproduce specific training examples rather than general patterns. **Normalization**: Scaling data to a standard range. Improves training stability. **Out-of-Distribution (OOD)**: Data that differs significantly from training distribution. Models often fail on OOD inputs. **Prompt Caching**: Storing computed representations of common prompt prefixes to avoid recomputation. **Pruning**: Removing unnecessary weights or connections from a neural network to reduce size and computation. **Reproducibility**: Ability to obtain the same results when repeating an experiment. Essential for scientific validity. **Stochastic Gradient Descent (SGD)**: Optimization algorithm that estimates gradients from random subsets of data. **Structured Output**: Model output constrained to follow a specific format (JSON, XML, etc.). **Supervised Learning**: Learning from labeled examples where the correct answer is provided. **Unsupervised Learning**: Learning patterns from data without explicit labels. **Validation Set**: Data held out during training to evaluate model performance and tune hyperparameters. **Warm-up**: Gradually increasing learning rate at the start of training. Improves stability. **Weight Initialization**: Strategy for setting initial parameter values before training begins