Appendix A: Glossary
This glossary defines key terms used throughout the book. Terms are organized alphabetically within categories for easier reference.
Model Architecture & Training
Attention Mechanism: A neural network component that allows models to focus on different parts of the input when producing each part of the output. Self-attention allows every position to attend to every other position in a sequence. The mathematical operation is typically Attention(Q, K, V) = softmax(QK^T/√d_k)V.
Autoregressive Model: A model that generates output one token at a time, where each token prediction is conditioned on all previously generated tokens. GPT-style models are autoregressive. Contrast with encoder-only or encoder-decoder architectures.
Backpropagation: Algorithm for computing gradients of the loss function with respect to all parameters in a neural network. Essential for training via gradient descent.
Batch Size: The number of training examples used in one forward/backward pass. Larger batches improve GPU utilization but may affect convergence. Critical parameter for training efficiency and cost.
BERT (Bidirectional Encoder Representations from Transformers): Encoder-only transformer model trained with masked language modeling. Foundation for many NLP classification and extraction tasks. Released by Google in 2018.
Causal Attention: Attention that only allows positions to attend to earlier positions (not future ones). Essential for autoregressive generation. Also called “masked attention.”
Checkpoint: A saved snapshot of model weights and optimizer state during training. Enables resuming training after interruption and selecting the best model version.
Contrastive Learning: Training approach where the model learns to distinguish similar pairs from dissimilar pairs. Common for training embedding models (e.g., contrastive loss, triplet loss).
Cross-Entropy Loss: Standard loss function for classification tasks. Measures the difference between predicted probability distribution and true distribution.
Decoder: In transformer architecture, the component that generates output tokens autoregressively. Uses causal attention and cross-attention to encoder outputs.
Dropout: Regularization technique that randomly zeroes network activations during training. Prevents overfitting by encouraging redundant representations.
Embedding: Dense vector representation of discrete objects (tokens, entities, etc.). Learned during training to capture semantic relationships. See also: vector embedding.
Encoder: In transformer architecture, the component that processes input sequence bidirectionally. Produces contextualized representations used by decoder.
Epoch: One complete pass through the entire training dataset. Models typically train for multiple epochs.
Fine-Tuning: Adapting a pre-trained model to a specific task by continuing training on task-specific data. Usually with lower learning rate than pre-training.
Frozen Layers: Model layers whose weights are not updated during fine-tuning. Freezing early layers preserves general knowledge while adapting later layers to new tasks.
Gradient Descent: Optimization algorithm that updates model parameters in the direction that reduces the loss function. Variants include SGD, Adam, AdamW.
GPT (Generative Pre-trained Transformer): Decoder-only transformer model trained with next-token prediction. Foundation of ChatGPT and many LLMs.
KV Cache: Cached key and value matrices from attention layers during autoregressive generation. Avoids recomputing attention for previous tokens. Critical for inference efficiency.
Layer Normalization: Normalization technique applied to layer activations. Standard in transformers (RMSNorm in newer models).
Learning Rate: Hyperparameter controlling the step size during gradient descent. Critical for training stability and convergence speed.
LoRA (Low-Rank Adaptation): Parameter-efficient fine-tuning method that adds trainable low-rank matrices to attention layers. Enables fine-tuning large models with minimal additional parameters.
Loss Function: Mathematical function that measures how well model predictions match targets. Training minimizes the loss.
Mixed Precision Training: Training with both 16-bit and 32-bit floating point to reduce memory and increase speed while maintaining accuracy.
MLP (Multi-Layer Perceptron): Feedforward neural network with fully connected layers. In transformers, the feedforward network after attention.
Multi-Head Attention: Attention mechanism that runs multiple attention operations in parallel with different learned projections. Standard in transformers.
Overfitting: When a model performs well on training data but poorly on unseen data. Indicates memorization rather than generalization.
Parameter: A learnable weight in a neural network. Model size is typically measured in parameters (e.g., 7B parameters).
PEFT (Parameter-Efficient Fine-Tuning): Family of techniques for fine-tuning large models by training only a small subset of parameters. Includes LoRA, adapters, prefix tuning.
Perplexity: Metric for language model quality. Lower perplexity indicates better prediction of held-out text. Related to cross-entropy by PP = exp(CE).
Pre-training: Initial training of a model on large-scale data before task-specific fine-tuning. Creates general-purpose representations.
Quantization: Reducing the numerical precision of model weights (e.g., from 16-bit to 8-bit or 4-bit). Reduces memory and often increases inference speed with minimal quality loss.
ReLU (Rectified Linear Unit): Activation function f(x) = max(0, x). Common in neural networks though transformers often use variants like GELU.
Regularization: Techniques to prevent overfitting, including dropout, weight decay, and early stopping.
Self-Attention: Attention where queries, keys, and values all come from the same sequence. Core operation in transformers.
Softmax: Function that converts a vector of scores into a probability distribution. Used in attention and classification outputs.
Teacher Forcing: Training technique where the model receives the correct previous token rather than its own prediction. Standard for training autoregressive models.
Token: The basic unit of text processed by language models. May be words, subwords, or characters depending on the tokenizer.
Tokenizer: Algorithm that converts text to tokens and vice versa. Common approaches include BPE, WordPiece, and SentencePiece.
Transfer Learning: Using knowledge from pre-training on one task to improve performance on a different task.
Transformer: Neural network architecture based entirely on attention mechanisms. Foundation of modern LLMs. Introduced in “Attention Is All You Need” (2017).
Weight Decay: Regularization technique that adds a penalty proportional to the magnitude of weights. Prevents weights from growing too large.
Inference & Deployment
Batching: Processing multiple requests together in a single forward pass. Improves throughput at the cost of latency. See also: continuous batching.
Beam Search: Decoding strategy that maintains multiple candidate sequences and selects the highest-scoring completion. Improves generation quality but increases computation.
Continuous Batching: Dynamic batching technique that adds new requests to running batches as existing requests complete. Maximizes GPU utilization.
Flash Attention: Optimized attention implementation that reduces memory usage and improves speed through tiling and kernel fusion. Standard in modern inference systems.
Greedy Decoding: Decoding strategy that always selects the highest-probability token. Fast but may miss better overall sequences.
Inference: Using a trained model to make predictions on new inputs. Contrast with training.
Latency: Time from request to response. p50 is median, p99 is 99th percentile (tail latency).
Model Serving: Infrastructure for hosting and running inference on trained models. Handles request routing, scaling, and response delivery.
Nucleus Sampling (Top-p): Decoding strategy that samples from the smallest set of tokens whose cumulative probability exceeds p. Balances diversity and quality.
Speculative Decoding: Acceleration technique using a smaller draft model to propose multiple tokens that the larger model verifies in parallel. Can provide 2-3x speedup.
Temperature: Parameter controlling randomness in sampling-based decoding. Higher temperature produces more diverse but less reliable outputs.
TensorRT: NVIDIA’s inference optimization library. Performs graph optimization, kernel fusion, and quantization for faster inference.
Throughput: Number of requests processed per unit time. Often measured in tokens per second.
Top-k Sampling: Decoding strategy that samples from the k highest-probability tokens. Simple diversity control.
vLLM: High-performance inference engine for LLMs featuring PagedAttention and continuous batching.
RAG & Retrieval
BM25: Traditional sparse retrieval algorithm based on term frequency. Still competitive for many retrieval tasks.
Chunking: Splitting documents into smaller pieces for embedding and retrieval. Chunk size significantly affects retrieval quality.
Dense Retrieval: Retrieval using learned vector embeddings rather than keyword matching. Captures semantic similarity.
Embedding Model: Neural network that produces vector representations of text. Common examples include E5, BGE, and OpenAI’s text-embedding models.
Hybrid Search: Combining sparse (keyword) and dense (semantic) retrieval. Often outperforms either alone.
Indexing: Pre-processing documents into a searchable structure. For vector search, involves computing embeddings and building ANN index.
RAG (Retrieval-Augmented Generation): Architecture that retrieves relevant context before generating responses. Grounds LLM outputs in specific knowledge.
Reranking: Second-stage retrieval that re-scores initial retrieval results with a more sophisticated model. Improves precision.
Semantic Search: Search based on meaning rather than keywords. Typically implemented with embedding similarity.
Vector Database: Database optimized for storing and searching vector embeddings. Examples: Pinecone, Weaviate, Milvus, Qdrant.
Vector Embedding: Dense numerical representation of text/data in high-dimensional space where similar items are close together.
ANN (Approximate Nearest Neighbor): Algorithm for finding similar vectors without exhaustive search. HNSW and IVF are common approaches. Trades small accuracy loss for large speedup.
Bi-Encoder: Architecture that independently encodes query and documents into vectors for similarity comparison. Fast but less accurate than cross-encoders. Used for initial retrieval.
Cross-Encoder: Architecture that jointly encodes query and document together, producing a relevance score. More accurate than bi-encoders but slower. Used for reranking.
GraphRAG: RAG approach that structures retrieved information as a knowledge graph. Enables multi-hop reasoning and relationship-aware retrieval.
RRF (Reciprocal Rank Fusion): Method for combining ranked lists from multiple retrieval systems. Score = sum(1 / (k + rank)) across systems, where k is typically 60.
Agents & Tool Use
Agent: AI system that can take actions in an environment to achieve goals. Uses planning, tool use, and feedback loops.
Chain-of-Thought (CoT): Prompting technique that encourages step-by-step reasoning. Improves performance on complex tasks.
Function Calling: LLM capability to generate structured function/API calls rather than free text. Enables reliable tool use.
MCP (Model Context Protocol): Protocol for connecting AI models to external tools and data sources. Standardizes tool integration.
Multi-Agent System: Architecture using multiple AI agents that collaborate or compete to solve problems.
ReAct: Agent framework combining Reasoning and Acting. Interleaves thinking with tool use.
Tool Use: LLM capability to invoke external tools (search, calculation, APIs) during generation.
MLOps & Evaluation
A/B Testing: Comparing two system versions by exposing different users to each and measuring outcomes.
Canary Deployment: Gradual rollout exposing a small percentage of traffic to new model before full deployment.
Concept Drift: Change in the relationship between input features and target variable over time.
Data Drift: Change in the distribution of input data over time.
Feature Store: Centralized repository for storing, managing, and serving ML features. Enables feature reuse and prevents training-serving skew.
LLM-as-Judge: Using an LLM to evaluate outputs of another model. Common for evaluating generation quality.
Model Registry: Versioned repository for trained models with metadata, lineage, and deployment status.
Observability: Ability to understand system state from external outputs. For ML: metrics, logs, traces, and prediction analysis.
Point-in-Time Join: Joining features to training data using only information available at the time of the training label. Prevents data leakage.
Shadow Deployment: Running a new model in parallel with production, comparing outputs without affecting users.
Training-Serving Skew: Differences between how features are computed in training vs serving, leading to degraded model performance.
Infrastructure & Performance
CUDA: NVIDIA’s parallel computing platform and API for GPU programming.
Data Parallelism: Distributing training by splitting batches across multiple GPUs, each with a full model copy.
FLOPS: Floating-point operations per second. Measure of computational throughput.
GPU Memory: High-bandwidth memory on graphics cards used for storing model weights, activations, and gradients.
Model Parallelism: Distributing training by splitting model layers across multiple GPUs. Required for models too large for single GPU.
Tensor Parallelism: Form of model parallelism that splits individual layer computations across GPUs.
Pipeline Parallelism: Form of model parallelism that assigns different layers to different GPUs in a pipeline.
ZeRO: Memory optimization technique that partitions optimizer states, gradients, and parameters across data parallel workers.
Security & Safety
Adversarial Attack: Input designed to cause model misbehavior. Includes prompt injection, jailbreaks, and evasion attacks.
Guardrails: Safety mechanisms that filter or modify model inputs/outputs. May include content filters, output validators.
Jailbreak: Adversarial prompt that causes a model to bypass its safety training.
Prompt Injection: Attack where malicious instructions in user input override system instructions.
Red Teaming: Systematic adversarial testing to discover model vulnerabilities and failure modes.
RLHF (Reinforcement Learning from Human Feedback): Training technique that fine-tunes models using human preference signals. Used to align models with human values.
Security is a Dialogue, Not a Wall — Mental model from Ch 16 framing LLM security as an ongoing feedback loop of layered defenses, monitoring, and red-teaming against an evolving adversary, rather than a one-time perimeter to be built.
Data & Features
Data Lineage: Tracking the origin and transformations of data through the pipeline. Essential for debugging and compliance.
Data Quality: Dimensions including completeness, accuracy, consistency, timeliness, and validity.
ETL (Extract, Transform, Load): Pipeline pattern for moving data between systems with transformations.
Feature Engineering: Creating input features from raw data to improve model performance.
Feature View: In feature stores, a collection of related features that share an entity and update schedule.
Schema: Specification of data structure including field names, types, and constraints.
Business & Operations
Error Budget: Allowed amount of downtime or SLO violations in a period. Balances reliability work with feature development.
Incident Management: Process for detecting, responding to, and learning from system failures.
On-Call: Engineers responsible for responding to system issues outside business hours.
SLA (Service Level Agreement): Contractual commitment to service quality. Violation may trigger penalties.
SLI (Service Level Indicator): Metric measuring service quality (e.g., request latency, error rate).
SLO (Service Level Objective): Target value for an SLI. Internal goal that should be stricter than SLA.
TCO (Total Cost of Ownership): Complete cost including infrastructure, personnel, maintenance, and hidden costs.
Prompting & Context
Context Window: Maximum number of tokens a model can process in a single input. Ranges from 4K to 1M+ tokens in modern models.
Few-Shot Learning: Providing a few examples in the prompt to guide model behavior. Contrast with zero-shot (no examples) and many-shot.
In-Context Learning: Model’s ability to learn from examples provided in the prompt without parameter updates.
Prompt Engineering: The practice of designing effective prompts to elicit desired model behavior.
System Prompt: Instructions that set the model’s behavior, persona, and constraints. Typically hidden from users.
Zero-Shot: Using a model without task-specific examples. Tests the model’s inherent capabilities.
Architectures & Patterns
Encoder-Decoder: Architecture with separate encoder for input processing and decoder for output generation. Used in T5, BART.
Decoder-Only: Architecture using only the decoder component with causal attention. GPT, LLaMA, and most modern LLMs.
Encoder-Only: Architecture using only the encoder component with bidirectional attention. BERT, RoBERTa.
Mixture of Experts (MoE): Architecture where different “expert” networks specialize in different inputs. Enables larger models with less computation per token.
Multimodal Model: Model that processes multiple types of input (text, images, audio, video).
Vision Transformer (ViT): Transformer architecture adapted for image processing by treating image patches as tokens.
CLIP: Contrastive Language-Image Pre-training. Model that learns to align image and text representations.
Optimization & Scaling
Distributed Training: Training across multiple GPUs or machines. Requires data, model, or pipeline parallelism.
Gradient Accumulation: Simulating larger batch sizes by accumulating gradients over multiple forward passes before updating.
Gradient Checkpointing: Trading compute for memory by recomputing activations during backward pass instead of storing them.
Learning Rate Schedule: Plan for changing learning rate during training (e.g., warmup, decay, cosine annealing).
Model Sharding: Splitting model weights across multiple devices. Enables training/inference of models that don’t fit on single device.
Scaling Laws: Empirical relationships between model size, data size, compute, and performance. Guide resource allocation decisions.
Evaluation Metrics
BLEU: Metric for machine translation quality based on n-gram overlap with reference translations.
F1 Score: Harmonic mean of precision and recall. Balanced metric for classification.
Hallucination: Model generating plausible-sounding but factually incorrect information.
Human Evaluation: Quality assessment by human raters. Gold standard but expensive and slow.
ROUGE: Metric for summarization quality based on overlap with reference summaries.
Win Rate: Percentage of comparisons where one system is preferred over another. Common in LLM evaluation.
MRR (Mean Reciprocal Rank): Retrieval metric that measures how quickly relevant documents appear in results. MRR = average of (1 / rank of first relevant document) across queries.
NDCG (Normalized Discounted Cumulative Gain): Retrieval metric that considers both relevance grades and position. Rewards highly relevant documents appearing early in results.
Precision@K: Fraction of top-K retrieved documents that are relevant. Measures retrieval precision at a cutoff.
Recall@K: Fraction of relevant documents that appear in the top-K results. Measures retrieval coverage at a cutoff.
Data Management
Annotation: Adding labels or metadata to data for supervised learning. May involve human labelers or automated systems.
Data Augmentation: Artificially increasing training data size through transformations (for images) or paraphrasing (for text).
Data Cleaning: Removing or correcting erroneous, duplicate, or low-quality data.
Deduplication: Removing duplicate or near-duplicate examples from training data. Important for preventing memorization.
Synthetic Data: Artificially generated data, often using LLMs, to supplement training data.
Tokenization: Converting text to sequence of tokens for model input. Includes handling of special tokens, padding, and truncation.
Reliability & Operations
AI Systems Fail Like Biological Systems — Mental model from Ch 31 framing AI failure as gradual, diffuse drift with cascading downstream effects (more like an organism than a light bulb), demanding continuous-vitals monitoring rather than binary up/down checks.
Blue-Green Deployment: Running two identical production environments and switching traffic between them for updates.
Circuit Breaker: Pattern that stops calls to a failing service to prevent cascade failures and allow recovery.
Graceful Degradation: System design that maintains partial functionality when components fail.
Health Check: Automated verification that a service is functioning correctly.
Load Balancing: Distributing requests across multiple servers to optimize resource use and reliability.
Rate Limiting: Restricting the number of requests a client can make in a time period.
Reliability is a Budget, Not a Goal — Mental model from Ch 31 treating SLO/error budgets as a finite resource to spend on velocity and risk, not a target to maximize; chasing 100% reliability is exponentially expensive and ultimately impossible.
Rollback: Reverting to a previous version of software or model after detecting problems.
Cost & Economics
Compute Cost: Cost of processing resources (CPU, GPU, TPU) for training and inference.
Cost Attribution: Assigning costs to specific teams, projects, or use cases.
Reserved Instances: Pre-purchased cloud capacity at a discount in exchange for commitment.
Spot Instances: Spare cloud capacity available at significant discount but with risk of interruption.
The 10x Cost Cliff — Mental model from Ch 32 for the API-versus-self-host crossover point (typically in the low hundreds of thousands of requests per month) above which managed API spend exceeds self-hosted TCO by roughly an order of magnitude.
Unit Economics: Cost analysis per unit of value (cost per request, cost per user, etc.).
Modern LLM Providers & Models
Anthropic: AI safety company that develops the Claude family of models. Known for emphasis on safety, long context windows, and strong reasoning capabilities.
Claude: Family of large language models developed by Anthropic. Current versions include Claude Opus 4.8, Sonnet 4.6, and Haiku 4.5. Known for nuanced understanding, long context (200K-1M tokens), and strong coding abilities.
Cohere: Enterprise AI company offering LLMs optimized for business applications. Known for Command models and Rerank for retrieval.
Gemini: Google’s multimodal AI model family. Current versions include Gemini 3.1 Pro, Gemini 3.5 Flash, and Gemini 3 Pro. Known for native multimodal capabilities and extremely long context windows (1M-2M tokens). Integrated with Google Cloud and Workspace.
GPT-5: OpenAI’s flagship large language model family. Current versions include GPT-5.5 (~1.05M context) and GPT-5.4. Foundation for ChatGPT Plus and enterprise applications.
Llama: Meta’s open-weight large language model family. Llama 4 includes Scout (10M context) and Maverick variants. Popular for self-hosting, fine-tuning, and extreme context applications.
Mistral: French AI company known for efficient open-weight models. Mistral 7B and Mixtral (MoE) are popular for self-hosting. Known for strong performance relative to size.
OpenAI: AI research company that developed GPT models and ChatGPT. Offers API access to GPT-5.5, GPT-5.4, and embedding models.
Advanced Training Techniques
Constitutional AI (CAI): Anthropic’s approach to training AI systems using a set of principles (a “constitution”) that guides self-improvement and reduces harmful outputs.
Direct Preference Optimization (DPO): Training technique that directly optimizes for human preferences without explicit reward modeling. Simpler alternative to RLHF.
Instruction Tuning: Fine-tuning a pre-trained model on instruction-following examples. Creates models that respond to natural language instructions.
ORPO (Odds Ratio Preference Optimization): Preference learning technique that combines supervised fine-tuning and preference alignment in a single stage.
PPO (Proximal Policy Optimization): Reinforcement learning algorithm commonly used in RLHF. Balances exploration with stable policy updates.
Rejection Sampling: Generating multiple outputs and selecting the best according to a reward model. Used in training and inference.
SFT (Supervised Fine-Tuning): Training a model on curated instruction-response pairs before preference tuning.
Inference Infrastructure
GGML/GGUF: File formats for storing quantized models, optimized for CPU inference. Popular with llama.cpp.
Inference is a Queue Theory Problem — Mental model from Ch 9 reframing inference optimization as scheduling (batching, KV cache paging, prefill/decode disaggregation, prefix caching) over a scarce GPU resource rather than as a pure compute problem.
llama.cpp: C++ library for running LLMs on consumer hardware including CPUs and Apple Silicon. Enables local LLM inference.
Ollama: Tool for running LLMs locally with simple model management. Wraps llama.cpp with user-friendly interface.
PagedAttention: Memory management technique for KV cache that uses paging (like OS virtual memory). Enables efficient memory utilization in vLLM.
SGLang: High-performance inference framework with RadixAttention for prefix sharing. Optimized for complex prompting patterns.
TGI (Text Generation Inference): Hugging Face’s inference server for LLMs. Supports tensor parallelism, continuous batching, and various model formats.
Triton Inference Server: NVIDIA’s model serving platform supporting multiple frameworks. Used for production deployment with GPU optimization.
vLLM: High-throughput inference engine featuring PagedAttention and continuous batching. De facto standard for self-hosted LLM serving.
Tool Use & Agents
Agents are Junior Engineers with Perfect Memory and No Judgment — Mental model from Ch 8 for deciding when to give agents autonomy: trust them with rote, well-specified work; supervise anything creative, irreversible, or cross-cutting.
Anthropic Tool Use: Claude’s native function calling capability. Uses structured tool definitions and returns structured responses.
Computer Use: Capability for AI to interact with computer interfaces (mouse, keyboard, screenshots). Enables automation of desktop tasks.
LangChain: Framework for building LLM applications with chains, agents, and tool use. Popular but can add complexity.
LangGraph: Extension of LangChain for building stateful, multi-actor applications with graph-based workflows.
MCP (Model Context Protocol): Anthropic’s open protocol for connecting AI models to external tools and data sources. Standardizes tool integration.
MCP Server: Service that implements the Model Context Protocol to expose tools or data to AI models.
OpenAI Function Calling: OpenAI’s API for structured tool use. Model outputs JSON matching provided function schemas.
Tool Schema: JSON Schema definition describing a tool’s name, description, and parameters. Used by LLMs to understand available tools.
Evaluation & Safety
Bias: Systematic errors in model outputs that reflect prejudices in training data or objectives. Includes demographic, cultural, and topic biases.
Capability Elicitation: Techniques for discovering what a model can do, especially latent or hidden capabilities.
Catastrophic Forgetting: Loss of previously learned knowledge when training on new data. Challenge for continual learning.
Deceptive Alignment: Theoretical concern where a model appears aligned during training but behaves differently when deployed.
Evals (Evaluations): Benchmarks and test suites for measuring model capabilities, safety, and alignment.
Frontier Model: The most capable AI models at any given time. Subject to enhanced safety requirements.
Model Card: Documentation describing a model’s intended use, capabilities, limitations, and ethical considerations.
Refusal: When a model declines to complete a request, typically due to safety training. Can be over-refusal or under-refusal.
Safety Training: Techniques to make models refuse harmful requests and behave according to guidelines.
Sandbagging: When a model intentionally performs worse than its capability to avoid detection or evaluation.
Prompt Engineering Techniques
Chain-of-Thought (CoT): Prompting technique that encourages step-by-step reasoning. Improves performance on complex tasks. See Chapter 6.
Few-Shot Prompting: Providing examples in the prompt to guide model behavior. The examples demonstrate the desired format and approach.
Prompt Injection Defense: Techniques to prevent malicious user input from overriding system instructions. Includes input validation, output filtering, and architectural defenses.
Prompts are Compiled, Not Written — Mental model from Ch 6 framing prompt engineering as iterative optimization passes (clarity, examples, constraints, format) over a draft rather than one-shot authorship.
Role Prompting: Assigning a specific role or persona to the model in the system prompt to shape responses.
Self-Consistency: Generating multiple reasoning paths and taking majority vote. Improves reliability on reasoning tasks.
Tree of Thoughts (ToT): Extension of chain-of-thought that explores multiple reasoning branches.
XML Tags: Using XML-style tags in prompts to structure input and output. Common pattern with Claude.
Zero-Shot CoT: Adding “Let’s think step by step” without examples. Simple technique that improves reasoning.
Data & Context Management
Context Compression: Techniques to fit more information in limited context windows. Includes summarization and selective retrieval.
Context Stuffing: Filling the context window with relevant information. Balance between more context and noise.
Long Context Models: Models with extended context windows (100K+ tokens). Enables processing entire documents.
Lost in the Middle: Phenomenon where models attend less to information in the middle of long contexts.
Needle in a Haystack: Evaluation testing retrieval from specific positions in long contexts.
Prompt Caching: Caching computed representations of prompt prefixes to avoid recomputation. Reduces latency and cost for repeated prefixes.
System Prompt: Initial instructions that define model behavior, persona, and constraints. Not visible to end users.
Acronyms
| Acronym | Full Form |
|---|---|
| API | Application Programming Interface |
| ANN | Approximate Nearest Neighbor |
| BERT | Bidirectional Encoder Representations from Transformers |
| BPE | Byte Pair Encoding |
| CAI | Constitutional AI |
| CDC | Change Data Capture |
| CoT | Chain-of-Thought |
| CUDA | Compute Unified Device Architecture |
| DPO | Direct Preference Optimization |
| ETL | Extract, Transform, Load |
| FLOPS | Floating-Point Operations Per Second |
| GGUF | GPT-Generated Unified Format |
| GPU | Graphics Processing Unit |
| GPT | Generative Pre-trained Transformer |
| HF | Hugging Face |
| KV | Key-Value |
| LLM | Large Language Model |
| LoRA | Low-Rank Adaptation |
| MCP | Model Context Protocol |
| MLP | Multi-Layer Perceptron |
| MLOps | Machine Learning Operations |
| MoE | Mixture of Experts |
| NLP | Natural Language Processing |
| OOD | Out-of-Distribution |
| ORPO | Odds Ratio Preference Optimization |
| PEFT | Parameter-Efficient Fine-Tuning |
| PPO | Proximal Policy Optimization |
| PSI | Population Stability Index |
| QA | Question Answering |
| QLoRA | Quantized Low-Rank Adaptation |
| RAG | Retrieval-Augmented Generation |
| RLHF | Reinforcement Learning from Human Feedback |
| ROI | Return on Investment |
| SDK | Software Development Kit |
| SFT | Supervised Fine-Tuning |
| SLA | Service Level Agreement |
| SLI | Service Level Indicator |
| SLO | Service Level Objective |
| SRE | Site Reliability Engineering |
| TCO | Total Cost of Ownership |
| TGI | Text Generation Inference |
| ToT | Tree of Thoughts |
| TPU | Tensor Processing Unit |
| TTL | Time to Live |
| YAML | YAML Ain’t Markup Language |
Additional Technical Terms
Ablation Study: Systematic removal of components to understand their contribution. Essential for understanding what works in ML systems.
Active Learning: Training strategy that selectively requests labels for the most informative examples. Reduces labeling cost.
Alignment: Ensuring AI systems behave in accordance with human values and intentions. Major research area for LLMs.
API Gateway: Service that handles API routing, authentication, rate limiting, and monitoring. Entry point for inference requests.
Attention Head: One of multiple parallel attention operations in multi-head attention. Different heads can capture different relationships.
Benchmark: Standardized dataset and evaluation protocol for comparing model performance.
Cold Start: Initial period when a system has insufficient data for personalization. Challenge for recommendation systems.
Constitutional AI: Training approach that uses a set of principles to guide model behavior.
Distillation: Training a smaller model to mimic a larger model’s behavior. Creates efficient models with similar capabilities.
Early Stopping: Halting training when validation performance stops improving. Prevents overfitting.
Elicitation: Techniques for extracting specific capabilities or behaviors from a trained model.
Ensemble: Combining predictions from multiple models. Often improves accuracy and robustness.
Explainability: Ability to understand why a model made a particular prediction.
Feature Importance: Measure of how much each input feature contributes to model predictions.
Generalization: Model’s ability to perform well on unseen data. The core goal of machine learning.
Gradient Clipping: Limiting gradient magnitude to prevent exploding gradients during training.
Hyperparameter: Configuration that controls training but isn’t learned (learning rate, batch size, architecture choices).
Idempotent: Operation that produces the same result regardless of how many times it’s executed.
Latent Space: Hidden representation space where models encode information about inputs.
Leakage: When information from the test set or future data inadvertently influences training.
Memorization: Model learning to reproduce specific training examples rather than general patterns.
Normalization: Scaling data to a standard range. Improves training stability.
Out-of-Distribution (OOD): Data that differs significantly from training distribution. Models often fail on OOD inputs.
Prompt Caching: Storing computed representations of common prompt prefixes to avoid recomputation.
Pruning: Removing unnecessary weights or connections from a neural network to reduce size and computation.
Reproducibility: Ability to obtain the same results when repeating an experiment. Essential for scientific validity.
Stochastic Gradient Descent (SGD): Optimization algorithm that estimates gradients from random subsets of data.
Structured Output: Model output constrained to follow a specific format (JSON, XML, etc.).
Supervised Learning: Learning from labeled examples where the correct answer is provided.
Unsupervised Learning: Learning patterns from data without explicit labels.
Validation Set: Data held out during training to evaluate model performance and tune hyperparameters.
Warm-up: Gradually increasing learning rate at the start of training. Improves stability.
Weight Initialization: Strategy for setting initial parameter values before training begins