Chapter 3: ML Foundations for AI Engineers

Keywords

machine learning, neural networks, backpropagation, loss functions, optimization, overfitting, embeddings, GPU

Introduction

In 2021, a software engineer at a fintech startup was debugging a strange issue. Their fraud detection model, which had performed excellently in testing, was flagging legitimate transactions from a new payment processor at alarming rates. The engineer knew Python, understood APIs, and could deploy services to production. But they were stuck because they didn’t understand why the model behaved this way. Was it a bug? A data problem? Expected behavior?

The answer turned out to be distribution shift—the new payment processor’s transactions looked different from the training data in ways the model couldn’t handle. Understanding this required knowing what “training data” actually means, how models learn patterns, and why they fail when those patterns break. Once the engineer understood these fundamentals, the fix became obvious: retrain with examples from the new processor.

This story captures why AI engineers need to understand machine learning. Not to become ML researchers or train models from scratch, but to debug production issues, make informed decisions about model selection, and communicate effectively with ML teams. You don’t need to derive backpropagation—but you do need to understand what it accomplishes. You don’t need to implement a transformer—but you need intuition for why attention matters.

This chapter provides the 80/20 of ML knowledge for AI engineering: the 20% of concepts that explain 80% of what you’ll encounter working with LLMs. We’ll build mental models, not mathematical proofs. By the end, you’ll have working intuition for how these systems learn, why they fail, and what levers you can pull to improve them.

Skip This Chapter If…

You can skip this chapter if you already:

Understand overfitting, underfitting, and generalization
Know what embeddings are and why semantic similarity works
Can explain why LLMs hallucinate (training objective vs. truth)
Understand the train/validation/test split and why it matters
Know the difference between classification, regression, and generation

Skim instead if you have ML experience but haven’t worked with LLMs. Focus on the “LLM Training” and “Embeddings” sections for LLM-specific intuition.

Read carefully if you’re a software engineer with no ML background, or if you’ve used ML APIs but don’t understand what’s happening inside.

What You Can Skip (For Now)

Before diving in, let’s be explicit about scope. You don’t need to understand:

Detailed calculus of gradient descent derivations
The mathematics of specific architectures (ResNets, LSTMs)
Hyperparameter tuning strategies
Distributed training systems
Custom loss function design

These matter for ML engineers building models. As an AI engineer using models, you need different knowledge.

What You’ll Learn

In this chapter, you’ll understand:

Why models generalize (or fail to)
What embeddings represent and why they enable semantic search
How LLMs are trained and what that implies for their behavior
Why evaluation is hard and what to do about it
The relationship between data quality and model quality

These concepts appear constantly when debugging, evaluating, and deploying AI systems.

Mental Models That Help

Throughout this chapter, we’ll build several mental models:

Models as pattern extractors: They find statistical regularities in data, nothing more
Training as compression: The model learns to compress training data into parameters
Inference as pattern matching: The model retrieves and combines patterns seen during training
Failure as distribution mismatch: Most failures come from inputs that differ from training data

Keep these in mind as we explore each topic.

Core ML Concepts

The Learning Problem

Machine learning solves a specific type of problem: given examples of inputs and desired outputs, find a function that maps new inputs to correct outputs.

Consider spam detection. You have thousands of emails labeled “spam” or “not spam.” You want a function that, given a new email, predicts the correct label. You could write rules manually (“if contains ‘Nigerian prince’, then spam”), but this doesn’t scale—spammers adapt, edge cases multiply, and rules conflict.

Machine learning instead learns the mapping from data. You provide examples; the algorithm finds patterns; the result is a model that generalizes to new inputs.

This is powerful because it scales. Instead of encoding human knowledge as rules, you encode it as labeled examples. The algorithm handles pattern extraction.

Supervised vs. Unsupervised Learning

Supervised learning uses labeled examples: input-output pairs where you know the correct answer.

Input: "Congratulations! You've won $1M!"  → Output: spam
Input: "Meeting moved to 3pm tomorrow"      → Output: not spam
Input: "Your account statement is ready"    → Output: not spam

The model learns to predict outputs from inputs. Classification (discrete labels) and regression (continuous values) are both supervised tasks.

Unsupervised learning finds patterns without labels. Given only inputs, discover structure.

Inputs: [email1, email2, email3, ..., email1000]
Goal: Find natural groupings, common topics, or anomalies

Clustering, dimensionality reduction, and anomaly detection are unsupervised tasks. The model finds patterns but doesn’t know what they mean—a human must interpret.

Where LLMs fit: LLM pretraining is self-supervised—a middle ground. The model predicts the next word given previous words. The “label” (next word) comes from the text itself, not human annotation. This allows training on vast unlabeled text.

Input: "The capital of France is"  → Label: "Paris"
Input: "She opened the door and"   → Label: "walked"

The labels are free—they’re just the next word in existing text. This is why LLMs can train on trillions of tokens.

Training, Validation, and Test Sets

This is one of the most important concepts in ML, and getting it wrong invalidates everything.

Training set: Data the model learns from. The model sees these examples repeatedly and adjusts to predict them correctly.

Validation set: Data for tuning decisions during development. Used to compare models, select hyperparameters, and decide when to stop training. The model never trains on this data, but you make decisions based on it.

Test set: Data for final evaluation only. Touched once, at the very end, to estimate real-world performance. If you tune based on test performance, you’re cheating—your estimate becomes optimistic.

Why this matters: The goal isn’t to predict training data (you already know those answers). The goal is to predict new data you haven’t seen. Separate sets let you estimate how well the model will perform on new data.

Total Data: 10,000 examples
├── Training:   7,000 (70%) - Model learns from these
├── Validation: 1,500 (15%) - Tune hyperparameters, compare approaches
└── Test:       1,500 (15%) - Final evaluation only

A common mistake: Tuning on test data. If you check test performance, change something, check again, and iterate—you’re fitting to the test set. Your final accuracy will be optimistic. This is called data leakage.

For LLMs: This matters in fine-tuning and evaluation. If your evaluation examples overlap with training data, the model might be “remembering” rather than “reasoning.” Many LLM benchmarks are now contaminated—models may have seen test questions during pretraining.

Overfitting and Generalization

Overfitting is memorizing the training data instead of learning general patterns.

Imagine memorizing a Spanish phrasebook: “Donde esta la biblioteca?” means “Where is the library?” If someone asks “Donde esta el banco?”, you’re stuck. You memorized specific phrases, not the underlying pattern (donde esta = where is).

A model that overfits does the same thing. It achieves perfect training accuracy but fails on new examples because it memorized instead of learned.

Generalization is the opposite: learning patterns that transfer to new data.

Signs of overfitting:

Training accuracy high, validation accuracy much lower
Small changes to input cause large changes in output
Model is very confident even on unusual inputs

Why overfitting happens: 1. Too little data: Not enough examples to distinguish patterns from noise 2. Too complex model: Model has capacity to memorize everything 3. Training too long: Model eventually memorizes after learning general patterns

How to prevent overfitting: 1. More data: More examples make memorization harder 2. Simpler model: Less capacity means less memorization 3. Regularization: Techniques that penalize complexity 4. Early stopping: Stop training when validation performance plateaus

The bias-variance tradeoff: Simple models may underfit (miss real patterns). Complex models may overfit (learn false patterns). The sweet spot is a model complex enough to capture real patterns but not so complex it memorizes noise.

For AI engineers, the key insight is: good training performance doesn’t mean good real-world performance. Always evaluate on held-out data that resembles your production distribution.

Loss Functions and Optimization (Intuition)

A loss function measures how wrong the model is. Lower loss means better predictions.

For spam classification, one loss function is: count the misclassifications. If the model predicts spam for 10 legitimate emails and not-spam for 5 spam emails, the loss is 15.

For next-word prediction in LLMs, the loss is how “surprised” the model is by the actual next word. If the model assigns high probability to the correct word, loss is low. If it assigns low probability (didn’t expect that word), loss is high.

Optimization is the process of adjusting model parameters to reduce loss. Think of it as navigating a landscape where elevation represents loss. You’re trying to find the lowest point (minimum loss).

Gradient descent is the standard approach: 1. Compute which direction increases loss fastest (the gradient) 2. Step in the opposite direction (downhill) 3. Repeat until loss stops decreasing

You don’t need to understand the calculus. The key insight is that training repeatedly adjusts parameters to reduce prediction errors on training data. Each parameter gets nudged in a direction that (locally) improves predictions.

Learning rate controls step size. Too large: overshoot the minimum, oscillate, or diverge. Too small: progress is painfully slow. Finding good learning rates is a key part of training.

The Training Process at a High Level

Putting it together, here’s what happens when you train a model:

1. Initialize parameters randomly
2. Repeat for many epochs:
   a. For each batch of training examples:
      - Forward pass: compute predictions from inputs
      - Compute loss: how wrong were the predictions?
      - Backward pass: compute how each parameter contributed to the loss
      - Update: adjust parameters to reduce loss
   b. Check validation performance
   c. If validation performance plateaus or worsens, stop
3. Evaluate on test set to estimate real-world performance

Epochs and batches: An epoch is one pass through all training data. Training typically runs for many epochs. Within each epoch, data is processed in batches (e.g., 32 examples at a time) for efficiency.

What the model learns: Through thousands of parameter updates, the model discovers patterns that predict well on training data. If the data is representative of the real world, these patterns generalize.

Neural Networks Intuition

Layers and Transformations

Neural networks are stacked layers that transform input into output. Each layer takes vectors in, applies learned weights, and passes transformed vectors to the next layer.

Input "The cat sat on the"
  → Embedding Layer (convert words to vectors)
  → Hidden Layers (transform representations)
  → Output Layer (probability distribution over vocabulary)
  → Prediction: "mat" (highest probability)

The forward pass flows input through all layers to produce output. The backward pass (backpropagation) traces backward through the network after computing loss, determining how each weight contributed to the error. Weights are then adjusted to reduce error. This cycle repeats billions of times during training.

Why Depth Matters

A “deep” network has many layers. Depth enables hierarchical feature learning: each layer builds on the previous one, learning progressively abstract features.

In language models: - Early layers: syntax, word relationships - Middle layers: semantic meaning, entity relationships - Later layers: discourse structure, reasoning patterns

Shallow networks must learn everything in one step; deep networks build complexity gradually. This is why modern LLMs have dozens to over a hundred layers.

The trade-off: Deeper networks are harder to train. Gradients can vanish or explode as they propagate through many layers. Modern architectures use techniques like residual connections and layer normalization to mitigate this. Chapter 5 (LLM Foundations) covers transformer architecture in depth.

The Role of Data in Learning

Models learn from data. The quality and quantity of data fundamentally limits what can be learned.

Data quantity: More data generally helps. It provides more examples of each pattern and reduces overfitting. This is why LLMs train on trillions of tokens—the scale enables learning rare patterns.

Data quality: Garbage in, garbage out. If training data contains errors, biases, or irrelevant information, the model learns these too.

Common data quality issues:

Mislabeled examples: Errors in labels corrupt learning
Selection bias: Training data isn’t representative of deployment
Class imbalance: Some categories have many more examples than others
Noisy features: Inputs contain irrelevant or misleading information

Data distribution: The model learns patterns in the training distribution. If deployment data has different patterns, performance degrades.

The fundamental insight: A model can only be as good as its data. No amount of architectural cleverness overcomes fundamentally flawed training data. For LLMs, the training corpus determines what the model “knows” and how it reasons.

Understanding LLMs as ML Systems

How LLMs Are Trained

LLM training typically proceeds in three phases, each with different data and objectives.

Phase 1: Pretraining

The model learns to predict the next token from vast amounts of text.

Training objective: Given previous tokens, predict the next token
Data: Trillions of tokens from the internet, books, code, etc.
Duration: Weeks to months on thousands of GPUs
Result: A model with broad knowledge and language fluency

Pretraining creates a general-purpose language model. It learns grammar, facts, reasoning patterns, and styles—everything implicit in predicting text well. But it doesn’t know how to be helpful or follow instructions.

Phase 2: Supervised Fine-tuning (SFT)

The model learns to follow instructions and be helpful.

Training objective: Given an instruction, generate a good response
Data: Thousands of high-quality instruction-response pairs
Duration: Hours to days
Result: A model that follows instructions and generates helpful responses

SFT examples look like:

Instruction: "Explain photosynthesis in simple terms"
Response: "Photosynthesis is how plants make food from sunlight..."

Instruction: "Write a Python function to sort a list"
Response: "def sort_list(items):\n    return sorted(items)"

Phase 3: RLHF (Reinforcement Learning from Human Feedback)

The model learns to generate responses that humans prefer.

Training objective: Generate responses that maximize human preference
Data: Human rankings of model outputs (A better than B)
Duration: Days to weeks
Result: A model that's more helpful, harmless, and honest

RLHF addresses behaviors that are hard to demonstrate: being appropriately uncertain, avoiding harmful content, staying on topic. Humans rank outputs, and the model learns to generate higher-ranked responses.

Why this multi-phase approach? - Pretraining provides knowledge and fluency (requires massive data) - SFT teaches the format of helpfulness (requires quality instruction data) - RLHF refines behavior based on human values (requires preference data)

Each phase targets different aspects of capability.

What the Model “Knows” vs. What It Can Do

A crucial distinction: knowledge vs. capabilities.

Knowledge: Facts, patterns, and information encoded in model weights during pretraining.

The model “knows” that:

Paris is the capital of France
Python uses indentation for blocks
Shakespeare wrote in iambic pentameter
Water freezes at 0C

Capabilities: What the model can do with that knowledge.

The model can:

Answer questions about its knowledge
Generate text in various styles
Translate between languages
Write code that compiles
Reason about problems (to some extent)

The important nuance: Knowledge is frozen at training time. The model cannot learn new facts from your conversation. What it “knows” depends entirely on what was in training data.

But capabilities can be elicited through prompting. A model might “know” something but not surface it without the right prompt. This is why prompt engineering matters—you’re not teaching the model new information, you’re helping it access and apply what it already knows.

Implications for AI engineers:

Don’t expect the model to know recent events (after training cutoff)
Don’t expect the model to know your private data (unless fine-tuned on it)
Do expect the model to have knowledge gaps even before the cutoff
Focus on eliciting capabilities through good prompts

Why LLMs Hallucinate

Hallucination—generating plausible-sounding but false information—is intrinsic to how LLMs work.

Cause 1: Training objective

The training objective is next-token prediction, not truth. The model learns to generate plausible continuations, and false statements can be perfectly plausible.

"The first person to walk on the moon was"
Plausible continuation: "Neil Armstrong" (true)
Also plausible: "Buzz Aldrin" (false but reasonable)

The model isn’t trained to verify facts—it’s trained to generate likely text.

Cause 2: Compression loss

The model compresses trillions of tokens into billions of parameters. This is lossy compression. Rare facts get blurred, similar facts get merged, and the model may confabulate to fill gaps.

If the training data mentioned “John Smith won the 1987 prize” once, the model might remember “someone named John won a prize in the 1980s” and hallucinate details when asked.

Cause 3: Distribution mismatch

When asked about topics underrepresented in training data, the model extrapolates from related patterns. These extrapolations may be wrong.

Cause 4: Confident training signal

Training data is mostly confident, declarative text. The model learns to be confident. It doesn’t learn to say “I don’t know” because training data rarely does.

Implications for AI engineers:

Never trust model outputs for critical facts without verification
Use RAG to ground responses in verified sources
Design systems that make confidence explicit
Expect hallucination and build guardrails

Temperature, Top-p, and Sampling

When generating text, the model produces probabilities for each possible next token. How we select from these probabilities affects output characteristics.

Temperature controls randomness:

Temperature = 0: Always pick the highest-probability token (deterministic)
Temperature = 1: Sample according to probabilities (as trained)
Temperature > 1: Flatten probabilities (more random)
Temperature < 1: Sharpen probabilities (less random)

# Pseudocode for temperature scaling
probabilities = softmax(logits / temperature)
next_token = sample(probabilities)

Low temperature → more focused, repetitive, “safer” outputs High temperature → more creative, diverse, potentially incoherent outputs

Top-p (nucleus sampling) limits the sampling pool:

Instead of considering all tokens, only consider the smallest set whose cumulative probability exceeds p.

If p = 0.9:

- Token A: 0.5
- Token B: 0.3
- Token C: 0.15  ← Cumulative = 0.95 > 0.9, stop here
- Token D: 0.03  ← Not considered
- Token E: 0.02  ← Not considered

Sample only from {A, B, C}

This removes the “long tail” of unlikely tokens that might introduce errors or non-sequiturs.

Top-k is simpler: only consider the k most probable tokens.

Practical guidelines:

Factual tasks: Low temperature (0.0-0.3), let the model be confident
Creative tasks: Higher temperature (0.7-1.0), allow diversity
Code generation: Low temperature (0.0-0.2), correctness matters
Top-p around 0.9-0.95 is often good
Temperature and top-p interact; tune together

The deeper insight: These parameters don’t change what the model knows. They change how it samples from its probability distribution. The same model with different parameters produces different outputs, but the underlying patterns are identical.

Embeddings and Vector Spaces

What Embeddings Represent

An embedding is a vector (list of numbers) that represents something—a word, sentence, document, image, or any concept.

# A word embedding might look like:
"king" → [0.2, -0.5, 0.1, 0.8, ..., -0.3]  # 768 numbers
"queen" → [0.3, -0.4, 0.0, 0.9, ..., -0.2]  # 768 numbers

But what do these numbers mean?

Each dimension captures some aspect of meaning. Through training, the model learns to assign similar vectors to related concepts. The dimensions don’t have clean interpretations like “royalty” or “gender”—they’re learned statistical features.

Analogy: Think of describing movies with ratings on various qualities: action (7/10), romance (2/10), comedy (4/10), etc. Two movies with similar ratings are similar in style. Embeddings work the same way, but with hundreds of dimensions that are learned rather than defined.

What makes a good embedding?

A good embedding captures the aspects of meaning relevant to your task:

For semantic search: similar concepts should be close
For sentiment analysis: positive and negative should be far apart
For analogy tasks: relationships should be parallel (king-queen ≈ man-woman)

Different embedding models capture different aspects of similarity.

Why Similar Things Are Close Together

The magic of embeddings: geometry reflects semantics.

During training, embedding models see millions of examples of related text. Through this exposure, they learn to place related items nearby in vector space.

Contrastive learning: Many embedding models train on pairs (similar, different):

Place “how to return an item” near “refund policy”
Move “how to return an item” away from “weather forecast”

After millions of such adjustments, the space organizes semantically.

Analogy arithmetic: Famous result from Word2Vec:

vector("king") - vector("man") + vector("woman") ≈ vector("queen")

The direction from “man” to “woman” is parallel to the direction from “king” to “queen.” The embedding space encodes relationships geometrically.

Why this matters for AI engineers: This is why vector search works for RAG. You can find documents about “termination policy” by searching near “how to cancel”—they’re semantically related, so they’re geometrically close.

Dimensionality and the Curse of Dimensionality

Embeddings typically have 384 to 4096 dimensions. Why so many?

More dimensions = more capacity: Each dimension can encode different aspects of meaning. Low-dimensional embeddings can’t capture the full richness of language.

But high dimensions are weird: Our geometric intuitions from 2D and 3D don’t hold.

The curse of dimensionality: 1. Distance concentration: In high dimensions, all points become roughly equidistant. The difference between nearest and farthest neighbors shrinks. 2. Sparse space: The volume of high-dimensional space is vast. Even billions of points barely cover it. 3. Corners dominate: In high dimensions, most volume is near the corners/edges of any bounded region.

Practical implications:

Similarity scores cluster in a narrow range—don’t expect a clear gap between relevant and irrelevant
Threshold selection is tricky; what counts as “similar enough” isn’t obvious
More data is needed to cover higher-dimensional spaces
Dimensionality reduction (before storage/search) can help but loses information

For AI engineers: Be skeptical of raw similarity scores. A cosine similarity of 0.7 isn’t obviously “good” or “bad”—it depends on the embedding model and your data distribution. Always evaluate on held-out examples.

How Embeddings Enable Semantic Search

Traditional keyword search matches documents containing query words. Semantic search matches documents with related meaning, regardless of word choice. A query for “how do I get my money back” can find documents about “refunds” and “returns” even without word overlap—because these concepts are geometrically close in embedding space.

This is the foundation of Retrieval-Augmented Generation (RAG). Chapter 7 (RAG Systems) covers the full implementation: embedding strategies, vector databases, retrieval optimization, and reranking.

Evaluation Fundamentals

Metrics Basics

Evaluation requires metrics—numbers that quantify how well the model performs.

For classification (predicting categories):

Accuracy: Fraction of predictions that are correct.

accuracy = correct_predictions / total_predictions

Problem: Misleading for imbalanced data. If 99% of emails are not spam, predicting “not spam” always achieves 99% accuracy while catching zero spam.

Precision: Of positive predictions, how many were actually positive?

precision = true_positives / (true_positives + false_positives)

High precision = few false alarms. If the spam filter says it’s spam, it probably is.

Recall: Of actual positives, how many did we catch?

recall = true_positives / (true_positives + false_negatives)

High recall = few misses. If an email is spam, the filter probably catches it.

F1 Score: Harmonic mean of precision and recall.

F1 = 2 * (precision * recall) / (precision + recall)

Balances both concerns.

The precision-recall tradeoff: You can usually increase one by sacrificing the other. Aggressive spam filtering (high recall) means more false positives (low precision). Conservative filtering (high precision) means more spam gets through (low recall).

Confusion matrix: Shows all four outcomes:

                    Actual Spam    Actual Not-Spam
Predicted Spam      True Positive  False Positive
Predicted Not-Spam  False Negative True Negative

This reveals where the model fails, not just how often.

Why LLM Evaluation Is Different

Traditional ML metrics assume:

Clear ground truth (email is spam or isn’t)
One correct answer
Automatic comparison (predicted label vs. true label)

LLM outputs violate all of these:

Many valid answers to “explain photosynthesis”
Quality is subjective (is response helpful? clear? accurate?)
No simple way to compare generated text to a reference

What makes LLM evaluation hard:

Open-ended outputs: “Write a poem about autumn” has infinite valid responses. Which is better? By what criteria?
Multi-dimensional quality: A response can be accurate but unclear, or helpful but verbose. Single metrics miss nuance.
Context dependence: A good response for an expert differs from one for a beginner. Quality depends on the user and situation.
Hallucination is hard to detect: The model confidently states false things. Verifying requires external knowledge.

Evaluation approaches for LLMs:

Benchmark datasets: Standardized tasks with known answers (MMLU, HellaSwag). Useful but limited to what can be auto-graded.
Human evaluation: Have people rate outputs on various criteria. Gold standard but expensive and slow.
LLM-as-judge: Use another LLM to evaluate outputs. Scalable but inherits model biases.
Reference-based metrics: Compare to reference outputs (BLEU, ROUGE). Problematic because valid outputs differ from references.

The Role of Human Judgment

Ultimately, LLM quality is about human satisfaction. Does the output help the user? Is it trustworthy? Would a human expert approve?

When to use human evaluation:

Evaluating subjective qualities (helpfulness, clarity, tone)
Validating LLM-as-judge correlation
High-stakes decisions where errors are costly
Understanding failure modes qualitatively

Designing human evaluation:

Clear rubrics: Define what each score means
Calibration: Have raters practice on examples with known scores
Multiple raters: Check for agreement, investigate disagreement
Blind evaluation: Raters shouldn’t know which model produced each output

LLM-as-judge correlation: A useful approach is to have humans evaluate a sample, then check if an LLM judge agrees. If correlation is high, use the LLM for scale. If low, the LLM judge isn’t valid for your task.

Building Intuition for Model Quality

Beyond metrics, develop qualitative intuition for model quality.

Stress testing: Find inputs where the model might fail. - Edge cases (empty input, very long input) - Adversarial inputs (attempts to confuse or manipulate) - Out-of-distribution inputs (topics rarely seen in training)

Error analysis: Don’t just compute error rate—examine the errors. - What types of inputs cause failures? - Are errors random or systematic? - What would a user experience feel like?

Comparative evaluation: Compare models side-by-side. - For the same inputs, which model gives better outputs? - Do models fail in the same places or different ones? - Is improvement consistent across input types?

User feedback loops: In production, collect user feedback. - Thumbs up/down on responses - User corrections and complaints - Task completion rates

Metrics tell you how much the model fails. Qualitative analysis tells you why and when—information you need to improve.

Putting It All Together

The ML Mindset for AI Engineers

You now have the conceptual foundation to reason about ML systems. Here’s how to apply it:

When debugging unexpected behavior:

Consider distribution shift: Is the input different from training data?
Consider overfitting: Did the model memorize instead of generalize?
Consider data quality: What was the model trained on?
Consider hallucination: Is the model generating plausible but false content?

When choosing models:

What was it trained on? Does that match your domain?
What was the training objective? Does that align with your needs?
What are the documented limitations?
How does it perform on your specific evaluation set?

When evaluating outputs:

Don’t trust single metrics; examine failures qualitatively
Don’t trust model confidence; it’s not calibrated
Don’t trust impressive demos; test on your data
Do establish baselines; how would a simple approach perform?

When working with ML teams:

Speak the language: training/validation/test, overfitting, distribution shift
Ask the right questions: What data? What objective? What failure modes?
Understand trade-offs: More data vs. cleaner data, speed vs. quality
Contribute valuable feedback: Real-world failure cases, user complaints

What We Didn’t Cover

This chapter gave you working intuition, not comprehensive knowledge. Topics we skipped that you might explore later:

Model architectures: Transformers, attention mechanisms, CNNs, RNNs. Understanding architecture helps you understand capabilities and limitations.

Training at scale: Distributed training, mixed precision, gradient accumulation. Matters if you fine-tune large models.

Advanced optimization: Adam, learning rate schedules, warmup. Matters for fine-tuning and understanding training dynamics.

Regularization techniques: Dropout, weight decay, data augmentation. Matters for preventing overfitting.

Interpretability: Understanding what models learn internally. Matters for debugging and trust.

These topics become relevant as you go deeper into AI engineering. The fundamentals in this chapter are prerequisite to all of them.

Summary

Machine learning is pattern extraction from data. Neural networks are flexible function approximators that learn these patterns through iterative parameter adjustment. LLMs are neural networks trained to predict text, and this training objective explains both their capabilities and limitations.

Key concepts for AI engineers:

Training/validation/test splits prevent you from fooling yourself about model quality
Overfitting is memorizing instead of learning; combat it with more data, simpler models, and regularization
Loss functions and optimization adjust parameters to reduce prediction errors
Deep networks learn hierarchical features, enabling complex pattern recognition
LLM training proceeds in phases: pretraining for knowledge, SFT for instruction-following, RLHF for alignment
Hallucination is intrinsic to how LLMs work—they generate plausible text, not verified facts
Temperature and sampling control how the model selects from its probability distribution
Embeddings represent meaning as vectors where geometry reflects semantics
Evaluation requires multiple metrics, human judgment, and qualitative analysis

With this foundation, you can debug production issues, make informed model selection decisions, and communicate effectively with ML teams. You won’t be training models from scratch, but you’ll understand the systems you’re building on.

Practical Exercises

Train/test split experiment: Take a classification dataset. Train a model on all data and report accuracy. Then train on 80% and test on 20%. Compare results. Intentionally overfit by training too long—what happens to train vs. test accuracy?
Temperature exploration: Using an LLM API, generate 10 responses to the same creative prompt at temperatures 0.0, 0.5, and 1.0. How does output diversity change? At what temperature do you see degraded quality?
Embedding investigation: Embed 100 sentences from your domain. Visualize with t-SNE or UMAP. Do semantically similar sentences cluster? Find examples where the embedding space surprises you.
Hallucination audit: Ask an LLM 20 factual questions about your domain. Verify each answer. What’s the hallucination rate? What types of questions trigger hallucination?
Evaluation design: Design a human evaluation protocol for an LLM task of your choice. Define criteria, create a rubric, evaluate 50 outputs, and measure inter-rater reliability.

Self-Assessment Checkpoint

Conceptual Questions

Q1. [IC1] Your model gets 95% accuracy on the test set but users say it “never works.” What might explain this?

Answer

Several possibilities: (1) Class imbalance: If 95% of examples are one class, predicting that class always gets 95% accuracy but is useless. Check precision/recall for the minority class. (2) Distribution shift: Test set doesn’t match production data. Model learned patterns that don’t generalize. (3) Wrong metric: Accuracy might not matter. If false positives are expensive, users might care about precision. (4) Data leakage: Test set accidentally overlaps with training data, or features leak the label. High test accuracy but poor generalization.

Q2. [IC1] Why does setting temperature=0 for an LLM not always give identical outputs?

Answer

Even at temperature=0 (greedy decoding), outputs can vary due to: (1) Floating-point non-determinism: Different GPU hardware or batching configurations can produce slightly different floating-point results. (2) Model updates: API providers update models without notice. (3) Batching effects: Some providers process requests differently based on load. (4) Tie-breaking: When multiple tokens have equal probability, the tie-breaker may vary. For reproducibility, some APIs offer a seed parameter, but even then, hardware differences can cause variation.

Q3. [IC2] You’re embedding product descriptions for search. Short products have 10 words, long ones have 500. Should you worry about this?

Answer

Yes, length bias is real. Issues: (1) Embedding models may give different quality representations for different lengths. (2) Long documents may have their meaning “diluted” across more tokens. (3) Similarity scores may be biased toward certain lengths. Mitigations: (1) Chunk long descriptions to consistent length before embedding. (2) Compare retrieval quality across length buckets. (3) Some embedding models handle this better than others—benchmark on your data. (4) Consider mean-pooling vs. CLS token strategies.

Q4. [IC2] An LLM confidently answers “The capital of Australia is Sydney.” How does this happen and what can you do about it?

Answer

This is hallucination—the model generates plausible but incorrect content. It happens because: (1) LLMs are trained to produce fluent, confident text, not necessarily correct text. (2) Sydney appears more frequently in Australian contexts than Canberra. (3) The model has no internal “fact checker”—it generates based on patterns, not knowledge retrieval. Mitigations: (1) RAG: Ground responses in retrieved documents. (2) Prompt for uncertainty: “If you’re not sure, say so.” (3) Use structured outputs with verified data. (4) Add validation layers that check claims against known facts.

Q5. [Senior] You need to evaluate an LLM for customer support quality. Automated metrics don’t capture what matters. How do you design scalable human evaluation?

Answer

Framework: (1) Define criteria: What does “good” mean? Accuracy, helpfulness, tone, policy compliance? Make each measurable. (2) Create rubric: 1-5 scale for each criterion with clear examples for each score. (3) Select evaluators: Domain experts for accuracy, customers for helpfulness. Train them on the rubric. (4) Measure reliability: Have multiple evaluators score the same samples. Calculate inter-rater agreement (Cohen’s kappa). If low, refine rubric. (5) Scale carefully: Use sampling strategies. Don’t evaluate everything—stratified samples across query types. (6) Combine with automation: Train a model to predict human scores, use it for monitoring, spot-check with humans.

Quick Checks

Check 1. What’s the difference between overfitting and underfitting?

Answer

Overfitting: Model memorizes training data, performs poorly on new data. Training loss low, test loss high. Too complex for the data. Underfitting: Model is too simple to capture patterns. Both training and test loss are high. Solution for overfitting: regularization, more data, simpler model. Solution for underfitting: more complex model, better features.

Check 2. Why do we use embeddings instead of one-hot vectors?

Answer

One-hot vectors are sparse and don’t capture similarity. “Cat” and “dog” are equally distant from each other as “cat” and “refrigerator.” Embeddings are dense vectors where similar meanings are close together. This enables semantic operations: finding similar items, arithmetic like “king - man + woman ≈ queen”, and efficient storage (300 dimensions vs. 50,000+ for vocabulary size).

Connections to Other Chapters

The ML concepts in this chapter underpin many techniques throughout the book:

Chapter 5 (LLM/NLP Foundations): Builds directly on the embeddings and neural network concepts here, diving deep into transformer architecture and attention mechanisms.
Chapter 7 (RAG Systems): Uses embeddings for semantic search—understanding what embeddings represent helps debug retrieval issues.
Chapter 15 (MLOps & Evaluation): Extends the evaluation concepts here to LLM-specific challenges like hallucination detection and response quality.
Chapter 27 (Performance Engineering): GPU optimization and quantization require understanding model architectures and their memory footprints.
Glossary (Appendix A): Definitions for all ML terms introduced in this chapter.

Vaswani et al. (2017), “Attention Is All You Need” - The transformer architecture paper.
Brown et al. (2020), “Language Models are Few-Shot Learners” - GPT-3 and emergent capabilities.
Rafailov et al. (2023), “Direct Preference Optimization” - Simplified alternative to RLHF.
Zheng et al. (2023), “Judging LLM-as-a-Judge” - When LLM evaluation works and when it doesn’t.

--- title: "Chapter 3: ML Foundations for AI Engineers" keywords: [machine learning, neural networks, backpropagation, loss functions, optimization, overfitting, embeddings, GPU] difficulty: beginner prerequisites: [ch02] estimated_time: "3-4 hours" --- ## Introduction In 2021, a software engineer at a fintech startup was debugging a strange issue. Their fraud detection model, which had performed excellently in testing, was flagging legitimate transactions from a new payment processor at alarming rates. The engineer knew Python, understood APIs, and could deploy services to production. But they were stuck because they didn't understand *why* the model behaved this way. Was it a bug? A data problem? Expected behavior? The answer turned out to be distribution shift—the new payment processor's transactions looked different from the training data in ways the model couldn't handle. Understanding this required knowing what "training data" actually means, how models learn patterns, and why they fail when those patterns break. Once the engineer understood these fundamentals, the fix became obvious: retrain with examples from the new processor. This story captures why AI engineers need to understand machine learning. Not to become ML researchers or train models from scratch, but to debug production issues, make informed decisions about model selection, and communicate effectively with ML teams. You don't need to derive backpropagation—but you do need to understand what it accomplishes. You don't need to implement a transformer—but you need intuition for why attention matters. This chapter provides the 80/20 of ML knowledge for AI engineering: the 20% of concepts that explain 80% of what you'll encounter working with LLMs. We'll build mental models, not mathematical proofs. By the end, you'll have working intuition for how these systems learn, why they fail, and what levers you can pull to improve them. ::: {.callout-tip} ## Skip This Chapter If... You can skip this chapter if you already: - Understand overfitting, underfitting, and generalization - Know what embeddings are and why semantic similarity works - Can explain why LLMs hallucinate (training objective vs. truth) - Understand the train/validation/test split and why it matters - Know the difference between classification, regression, and generation **Skim instead if** you have ML experience but haven't worked with LLMs. Focus on the "LLM Training" and "Embeddings" sections for LLM-specific intuition. **Read carefully if** you're a software engineer with no ML background, or if you've used ML APIs but don't understand what's happening inside. ::: ### What You Can Skip (For Now) Before diving in, let's be explicit about scope. You don't need to understand: - Detailed calculus of gradient descent derivations - The mathematics of specific architectures (ResNets, LSTMs) - Hyperparameter tuning strategies - Distributed training systems - Custom loss function design These matter for ML engineers building models. As an AI engineer using models, you need different knowledge. ### What You'll Learn In this chapter, you'll understand: - Why models generalize (or fail to) - What embeddings represent and why they enable semantic search - How LLMs are trained and what that implies for their behavior - Why evaluation is hard and what to do about it - The relationship between data quality and model quality These concepts appear constantly when debugging, evaluating, and deploying AI systems. ### Mental Models That Help Throughout this chapter, we'll build several mental models: 1. **Models as pattern extractors**: They find statistical regularities in data, nothing more 2. **Training as compression**: The model learns to compress training data into parameters 3. **Inference as pattern matching**: The model retrieves and combines patterns seen during training 4. **Failure as distribution mismatch**: Most failures come from inputs that differ from training data Keep these in mind as we explore each topic. --- ## Core ML Concepts ### The Learning Problem Machine learning solves a specific type of problem: given examples of inputs and desired outputs, find a function that maps new inputs to correct outputs. Consider spam detection. You have thousands of emails labeled "spam" or "not spam." You want a function that, given a new email, predicts the correct label. You could write rules manually ("if contains 'Nigerian prince', then spam"), but this doesn't scale—spammers adapt, edge cases multiply, and rules conflict. Machine learning instead learns the mapping from data. You provide examples; the algorithm finds patterns; the result is a model that generalizes to new inputs. This is powerful because it scales. Instead of encoding human knowledge as rules, you encode it as labeled examples. The algorithm handles pattern extraction. ### Supervised vs. Unsupervised Learning **Supervised learning** uses labeled examples: input-output pairs where you know the correct answer. ``` Input: "Congratulations! You've won $1M!" → Output: spam Input: "Meeting moved to 3pm tomorrow" → Output: not spam Input: "Your account statement is ready" → Output: not spam ``` The model learns to predict outputs from inputs. Classification (discrete labels) and regression (continuous values) are both supervised tasks. **Unsupervised learning** finds patterns without labels. Given only inputs, discover structure. ``` Inputs: [email1, email2, email3, ..., email1000] Goal: Find natural groupings, common topics, or anomalies ``` Clustering, dimensionality reduction, and anomaly detection are unsupervised tasks. The model finds patterns but doesn't know what they mean—a human must interpret. **Where LLMs fit**: LLM pretraining is self-supervised—a middle ground. The model predicts the next word given previous words. The "label" (next word) comes from the text itself, not human annotation. This allows training on vast unlabeled text. ``` Input: "The capital of France is" → Label: "Paris" Input: "She opened the door and" → Label: "walked" ``` The labels are free—they're just the next word in existing text. This is why LLMs can train on trillions of tokens. ### Training, Validation, and Test Sets This is one of the most important concepts in ML, and getting it wrong invalidates everything. **Training set**: Data the model learns from. The model sees these examples repeatedly and adjusts to predict them correctly. **Validation set**: Data for tuning decisions during development. Used to compare models, select hyperparameters, and decide when to stop training. The model never trains on this data, but you make decisions based on it. **Test set**: Data for final evaluation only. Touched once, at the very end, to estimate real-world performance. If you tune based on test performance, you're cheating—your estimate becomes optimistic. **Why this matters**: The goal isn't to predict training data (you already know those answers). The goal is to predict *new* data you haven't seen. Separate sets let you estimate how well the model will perform on new data. ``` Total Data: 10,000 examples ├── Training: 7,000 (70%) - Model learns from these ├── Validation: 1,500 (15%) - Tune hyperparameters, compare approaches └── Test: 1,500 (15%) - Final evaluation only ``` **A common mistake**: Tuning on test data. If you check test performance, change something, check again, and iterate—you're fitting to the test set. Your final accuracy will be optimistic. This is called data leakage. **For LLMs**: This matters in fine-tuning and evaluation. If your evaluation examples overlap with training data, the model might be "remembering" rather than "reasoning." Many LLM benchmarks are now contaminated—models may have seen test questions during pretraining. ### Overfitting and Generalization **Overfitting** is memorizing the training data instead of learning general patterns. Imagine memorizing a Spanish phrasebook: "Donde esta la biblioteca?" means "Where is the library?" If someone asks "Donde esta el banco?", you're stuck. You memorized specific phrases, not the underlying pattern (donde esta = where is). A model that overfits does the same thing. It achieves perfect training accuracy but fails on new examples because it memorized instead of learned. **Generalization** is the opposite: learning patterns that transfer to new data. Signs of overfitting: - Training accuracy high, validation accuracy much lower - Small changes to input cause large changes in output - Model is very confident even on unusual inputs **Why overfitting happens**: 1. **Too little data**: Not enough examples to distinguish patterns from noise 2. **Too complex model**: Model has capacity to memorize everything 3. **Training too long**: Model eventually memorizes after learning general patterns **How to prevent overfitting**: 1. **More data**: More examples make memorization harder 2. **Simpler model**: Less capacity means less memorization 3. **Regularization**: Techniques that penalize complexity 4. **Early stopping**: Stop training when validation performance plateaus **The bias-variance tradeoff**: Simple models may underfit (miss real patterns). Complex models may overfit (learn false patterns). The sweet spot is a model complex enough to capture real patterns but not so complex it memorizes noise. For AI engineers, the key insight is: **good training performance doesn't mean good real-world performance**. Always evaluate on held-out data that resembles your production distribution. ### Loss Functions and Optimization (Intuition) A **loss function** measures how wrong the model is. Lower loss means better predictions. For spam classification, one loss function is: count the misclassifications. If the model predicts spam for 10 legitimate emails and not-spam for 5 spam emails, the loss is 15. For next-word prediction in LLMs, the loss is how "surprised" the model is by the actual next word. If the model assigns high probability to the correct word, loss is low. If it assigns low probability (didn't expect that word), loss is high. **Optimization** is the process of adjusting model parameters to reduce loss. Think of it as navigating a landscape where elevation represents loss. You're trying to find the lowest point (minimum loss). **Gradient descent** is the standard approach: 1. Compute which direction increases loss fastest (the gradient) 2. Step in the opposite direction (downhill) 3. Repeat until loss stops decreasing You don't need to understand the calculus. The key insight is that training repeatedly adjusts parameters to reduce prediction errors on training data. Each parameter gets nudged in a direction that (locally) improves predictions. **Learning rate** controls step size. Too large: overshoot the minimum, oscillate, or diverge. Too small: progress is painfully slow. Finding good learning rates is a key part of training. ### The Training Process at a High Level Putting it together, here's what happens when you train a model: ``` 1. Initialize parameters randomly 2. Repeat for many epochs: a. For each batch of training examples: - Forward pass: compute predictions from inputs - Compute loss: how wrong were the predictions? - Backward pass: compute how each parameter contributed to the loss - Update: adjust parameters to reduce loss b. Check validation performance c. If validation performance plateaus or worsens, stop 3. Evaluate on test set to estimate real-world performance ``` **Epochs and batches**: An epoch is one pass through all training data. Training typically runs for many epochs. Within each epoch, data is processed in batches (e.g., 32 examples at a time) for efficiency. **What the model learns**: Through thousands of parameter updates, the model discovers patterns that predict well on training data. If the data is representative of the real world, these patterns generalize. --- ## Neural Networks Intuition ### Layers and Transformations Neural networks are stacked layers that transform input into output. Each layer takes vectors in, applies learned weights, and passes transformed vectors to the next layer. ``` Input "The cat sat on the" → Embedding Layer (convert words to vectors) → Hidden Layers (transform representations) → Output Layer (probability distribution over vocabulary) → Prediction: "mat" (highest probability) ``` The **forward pass** flows input through all layers to produce output. The **backward pass** (backpropagation) traces backward through the network after computing loss, determining how each weight contributed to the error. Weights are then adjusted to reduce error. This cycle repeats billions of times during training. ### Why Depth Matters A "deep" network has many layers. Depth enables **hierarchical feature learning**: each layer builds on the previous one, learning progressively abstract features. In language models: - Early layers: syntax, word relationships - Middle layers: semantic meaning, entity relationships - Later layers: discourse structure, reasoning patterns Shallow networks must learn everything in one step; deep networks build complexity gradually. This is why modern LLMs have dozens to over a hundred layers. **The trade-off**: Deeper networks are harder to train. Gradients can vanish or explode as they propagate through many layers. Modern architectures use techniques like residual connections and layer normalization to mitigate this. **Chapter 5 (LLM Foundations)** covers transformer architecture in depth. ### The Role of Data in Learning Models learn from data. The quality and quantity of data fundamentally limits what can be learned. **Data quantity**: More data generally helps. It provides more examples of each pattern and reduces overfitting. This is why LLMs train on trillions of tokens—the scale enables learning rare patterns. **Data quality**: Garbage in, garbage out. If training data contains errors, biases, or irrelevant information, the model learns these too. Common data quality issues: - **Mislabeled examples**: Errors in labels corrupt learning - **Selection bias**: Training data isn't representative of deployment - **Class imbalance**: Some categories have many more examples than others - **Noisy features**: Inputs contain irrelevant or misleading information **Data distribution**: The model learns patterns in the training distribution. If deployment data has different patterns, performance degrades. **The fundamental insight**: A model can only be as good as its data. No amount of architectural cleverness overcomes fundamentally flawed training data. For LLMs, the training corpus determines what the model "knows" and how it reasons. --- ## Understanding LLMs as ML Systems ### How LLMs Are Trained LLM training typically proceeds in three phases, each with different data and objectives. **Phase 1: Pretraining** The model learns to predict the next token from vast amounts of text. ``` Training objective: Given previous tokens, predict the next token Data: Trillions of tokens from the internet, books, code, etc. Duration: Weeks to months on thousands of GPUs Result: A model with broad knowledge and language fluency ``` Pretraining creates a general-purpose language model. It learns grammar, facts, reasoning patterns, and styles—everything implicit in predicting text well. But it doesn't know how to be helpful or follow instructions. **Phase 2: Supervised Fine-tuning (SFT)** The model learns to follow instructions and be helpful. ``` Training objective: Given an instruction, generate a good response Data: Thousands of high-quality instruction-response pairs Duration: Hours to days Result: A model that follows instructions and generates helpful responses ``` SFT examples look like: ``` Instruction: "Explain photosynthesis in simple terms" Response: "Photosynthesis is how plants make food from sunlight..." Instruction: "Write a Python function to sort a list" Response: "def sort_list(items):\n return sorted(items)" ``` **Phase 3: RLHF (Reinforcement Learning from Human Feedback)** The model learns to generate responses that humans prefer. ``` Training objective: Generate responses that maximize human preference Data: Human rankings of model outputs (A better than B) Duration: Days to weeks Result: A model that's more helpful, harmless, and honest ``` RLHF addresses behaviors that are hard to demonstrate: being appropriately uncertain, avoiding harmful content, staying on topic. Humans rank outputs, and the model learns to generate higher-ranked responses. **Why this multi-phase approach?** - Pretraining provides knowledge and fluency (requires massive data) - SFT teaches the format of helpfulness (requires quality instruction data) - RLHF refines behavior based on human values (requires preference data) Each phase targets different aspects of capability. ### What the Model "Knows" vs. What It Can Do A crucial distinction: knowledge vs. capabilities. **Knowledge**: Facts, patterns, and information encoded in model weights during pretraining. The model "knows" that: - Paris is the capital of France - Python uses indentation for blocks - Shakespeare wrote in iambic pentameter - Water freezes at 0C **Capabilities**: What the model can do with that knowledge. The model can: - Answer questions about its knowledge - Generate text in various styles - Translate between languages - Write code that compiles - Reason about problems (to some extent) **The important nuance**: Knowledge is frozen at training time. The model cannot learn new facts from your conversation. What it "knows" depends entirely on what was in training data. But capabilities can be elicited through prompting. A model might "know" something but not surface it without the right prompt. This is why prompt engineering matters—you're not teaching the model new information, you're helping it access and apply what it already knows. **Implications for AI engineers**: - Don't expect the model to know recent events (after training cutoff) - Don't expect the model to know your private data (unless fine-tuned on it) - Do expect the model to have knowledge gaps even before the cutoff - Focus on eliciting capabilities through good prompts ### Why LLMs Hallucinate Hallucination—generating plausible-sounding but false information—is intrinsic to how LLMs work. **Cause 1: Training objective** The training objective is next-token prediction, not truth. The model learns to generate plausible continuations, and false statements can be perfectly plausible. ``` "The first person to walk on the moon was" Plausible continuation: "Neil Armstrong" (true) Also plausible: "Buzz Aldrin" (false but reasonable) ``` The model isn't trained to verify facts—it's trained to generate likely text. **Cause 2: Compression loss** The model compresses trillions of tokens into billions of parameters. This is lossy compression. Rare facts get blurred, similar facts get merged, and the model may confabulate to fill gaps. If the training data mentioned "John Smith won the 1987 prize" once, the model might remember "someone named John won a prize in the 1980s" and hallucinate details when asked. **Cause 3: Distribution mismatch** When asked about topics underrepresented in training data, the model extrapolates from related patterns. These extrapolations may be wrong. **Cause 4: Confident training signal** Training data is mostly confident, declarative text. The model learns to be confident. It doesn't learn to say "I don't know" because training data rarely does. **Implications for AI engineers**: - Never trust model outputs for critical facts without verification - Use RAG to ground responses in verified sources - Design systems that make confidence explicit - Expect hallucination and build guardrails ### Temperature, Top-p, and Sampling When generating text, the model produces probabilities for each possible next token. How we select from these probabilities affects output characteristics. **Temperature** controls randomness: - Temperature = 0: Always pick the highest-probability token (deterministic) - Temperature = 1: Sample according to probabilities (as trained) - Temperature > 1: Flatten probabilities (more random) - Temperature < 1: Sharpen probabilities (less random) ```python # Pseudocode for temperature scaling probabilities = softmax(logits / temperature) next_token = sample(probabilities) ``` Low temperature → more focused, repetitive, "safer" outputs High temperature → more creative, diverse, potentially incoherent outputs **Top-p (nucleus sampling)** limits the sampling pool: Instead of considering all tokens, only consider the smallest set whose cumulative probability exceeds p. ``` If p = 0.9: - Token A: 0.5 - Token B: 0.3 - Token C: 0.15 ← Cumulative = 0.95 > 0.9, stop here - Token D: 0.03 ← Not considered - Token E: 0.02 ← Not considered Sample only from {A, B, C} ``` This removes the "long tail" of unlikely tokens that might introduce errors or non-sequiturs. **Top-k** is simpler: only consider the k most probable tokens. **Practical guidelines**: - Factual tasks: Low temperature (0.0-0.3), let the model be confident - Creative tasks: Higher temperature (0.7-1.0), allow diversity - Code generation: Low temperature (0.0-0.2), correctness matters - Top-p around 0.9-0.95 is often good - Temperature and top-p interact; tune together **The deeper insight**: These parameters don't change what the model knows. They change how it samples from its probability distribution. The same model with different parameters produces different outputs, but the underlying patterns are identical. --- ## Embeddings and Vector Spaces ### What Embeddings Represent An embedding is a vector (list of numbers) that represents something—a word, sentence, document, image, or any concept. ```python # A word embedding might look like: "king" → [0.2, -0.5, 0.1, 0.8, ..., -0.3] # 768 numbers "queen" → [0.3, -0.4, 0.0, 0.9, ..., -0.2] # 768 numbers ``` But what do these numbers mean? Each dimension captures some aspect of meaning. Through training, the model learns to assign similar vectors to related concepts. The dimensions don't have clean interpretations like "royalty" or "gender"—they're learned statistical features. **Analogy**: Think of describing movies with ratings on various qualities: action (7/10), romance (2/10), comedy (4/10), etc. Two movies with similar ratings are similar in style. Embeddings work the same way, but with hundreds of dimensions that are learned rather than defined. **What makes a good embedding?** A good embedding captures the aspects of meaning relevant to your task: - For semantic search: similar concepts should be close - For sentiment analysis: positive and negative should be far apart - For analogy tasks: relationships should be parallel (king-queen ≈ man-woman) Different embedding models capture different aspects of similarity. ### Why Similar Things Are Close Together The magic of embeddings: geometry reflects semantics. During training, embedding models see millions of examples of related text. Through this exposure, they learn to place related items nearby in vector space. **Contrastive learning**: Many embedding models train on pairs (similar, different): - Place "how to return an item" near "refund policy" - Move "how to return an item" away from "weather forecast" After millions of such adjustments, the space organizes semantically. **Analogy arithmetic**: Famous result from Word2Vec: ``` vector("king") - vector("man") + vector("woman") ≈ vector("queen") ``` The direction from "man" to "woman" is parallel to the direction from "king" to "queen." The embedding space encodes relationships geometrically. **Why this matters for AI engineers**: This is why vector search works for RAG. You can find documents about "termination policy" by searching near "how to cancel"—they're semantically related, so they're geometrically close. ### Dimensionality and the Curse of Dimensionality Embeddings typically have 384 to 4096 dimensions. Why so many? **More dimensions = more capacity**: Each dimension can encode different aspects of meaning. Low-dimensional embeddings can't capture the full richness of language. **But high dimensions are weird**: Our geometric intuitions from 2D and 3D don't hold. **The curse of dimensionality**: 1. **Distance concentration**: In high dimensions, all points become roughly equidistant. The difference between nearest and farthest neighbors shrinks. 2. **Sparse space**: The volume of high-dimensional space is vast. Even billions of points barely cover it. 3. **Corners dominate**: In high dimensions, most volume is near the corners/edges of any bounded region. **Practical implications**: - Similarity scores cluster in a narrow range—don't expect a clear gap between relevant and irrelevant - Threshold selection is tricky; what counts as "similar enough" isn't obvious - More data is needed to cover higher-dimensional spaces - Dimensionality reduction (before storage/search) can help but loses information **For AI engineers**: Be skeptical of raw similarity scores. A cosine similarity of 0.7 isn't obviously "good" or "bad"—it depends on the embedding model and your data distribution. Always evaluate on held-out examples. ### How Embeddings Enable Semantic Search Traditional keyword search matches documents containing query words. Semantic search matches documents with related meaning, regardless of word choice. A query for "how do I get my money back" can find documents about "refunds" and "returns" even without word overlap—because these concepts are geometrically close in embedding space. This is the foundation of Retrieval-Augmented Generation (RAG). **Chapter 7 (RAG Systems)** covers the full implementation: embedding strategies, vector databases, retrieval optimization, and reranking. --- ## Evaluation Fundamentals ### Metrics Basics Evaluation requires metrics—numbers that quantify how well the model performs. **For classification (predicting categories)**: **Accuracy**: Fraction of predictions that are correct. ``` accuracy = correct_predictions / total_predictions ``` Problem: Misleading for imbalanced data. If 99% of emails are not spam, predicting "not spam" always achieves 99% accuracy while catching zero spam. **Precision**: Of positive predictions, how many were actually positive? ``` precision = true_positives / (true_positives + false_positives) ``` High precision = few false alarms. If the spam filter says it's spam, it probably is. **Recall**: Of actual positives, how many did we catch? ``` recall = true_positives / (true_positives + false_negatives) ``` High recall = few misses. If an email is spam, the filter probably catches it. **F1 Score**: Harmonic mean of precision and recall. ``` F1 = 2 * (precision * recall) / (precision + recall) ``` Balances both concerns. **The precision-recall tradeoff**: You can usually increase one by sacrificing the other. Aggressive spam filtering (high recall) means more false positives (low precision). Conservative filtering (high precision) means more spam gets through (low recall). **Confusion matrix**: Shows all four outcomes: ``` Actual Spam Actual Not-Spam Predicted Spam True Positive False Positive Predicted Not-Spam False Negative True Negative ``` This reveals where the model fails, not just how often. ### Why LLM Evaluation Is Different Traditional ML metrics assume: - Clear ground truth (email is spam or isn't) - One correct answer - Automatic comparison (predicted label vs. true label) LLM outputs violate all of these: - Many valid answers to "explain photosynthesis" - Quality is subjective (is response helpful? clear? accurate?) - No simple way to compare generated text to a reference **What makes LLM evaluation hard**: 1. **Open-ended outputs**: "Write a poem about autumn" has infinite valid responses. Which is better? By what criteria? 2. **Multi-dimensional quality**: A response can be accurate but unclear, or helpful but verbose. Single metrics miss nuance. 3. **Context dependence**: A good response for an expert differs from one for a beginner. Quality depends on the user and situation. 4. **Hallucination is hard to detect**: The model confidently states false things. Verifying requires external knowledge. **Evaluation approaches for LLMs**: - **Benchmark datasets**: Standardized tasks with known answers (MMLU, HellaSwag). Useful but limited to what can be auto-graded. - **Human evaluation**: Have people rate outputs on various criteria. Gold standard but expensive and slow. - **LLM-as-judge**: Use another LLM to evaluate outputs. Scalable but inherits model biases. - **Reference-based metrics**: Compare to reference outputs (BLEU, ROUGE). Problematic because valid outputs differ from references. ### The Role of Human Judgment Ultimately, LLM quality is about human satisfaction. Does the output help the user? Is it trustworthy? Would a human expert approve? **When to use human evaluation**: - Evaluating subjective qualities (helpfulness, clarity, tone) - Validating LLM-as-judge correlation - High-stakes decisions where errors are costly - Understanding failure modes qualitatively **Designing human evaluation**: - Clear rubrics: Define what each score means - Calibration: Have raters practice on examples with known scores - Multiple raters: Check for agreement, investigate disagreement - Blind evaluation: Raters shouldn't know which model produced each output **LLM-as-judge correlation**: A useful approach is to have humans evaluate a sample, then check if an LLM judge agrees. If correlation is high, use the LLM for scale. If low, the LLM judge isn't valid for your task. ### Building Intuition for Model Quality Beyond metrics, develop qualitative intuition for model quality. **Stress testing**: Find inputs where the model might fail. - Edge cases (empty input, very long input) - Adversarial inputs (attempts to confuse or manipulate) - Out-of-distribution inputs (topics rarely seen in training) **Error analysis**: Don't just compute error rate—examine the errors. - What types of inputs cause failures? - Are errors random or systematic? - What would a user experience feel like? **Comparative evaluation**: Compare models side-by-side. - For the same inputs, which model gives better outputs? - Do models fail in the same places or different ones? - Is improvement consistent across input types? **User feedback loops**: In production, collect user feedback. - Thumbs up/down on responses - User corrections and complaints - Task completion rates Metrics tell you *how much* the model fails. Qualitative analysis tells you *why* and *when*—information you need to improve. --- ## Putting It All Together ### The ML Mindset for AI Engineers You now have the conceptual foundation to reason about ML systems. Here's how to apply it: **When debugging unexpected behavior**: - Consider distribution shift: Is the input different from training data? - Consider overfitting: Did the model memorize instead of generalize? - Consider data quality: What was the model trained on? - Consider hallucination: Is the model generating plausible but false content? **When choosing models**: - What was it trained on? Does that match your domain? - What was the training objective? Does that align with your needs? - What are the documented limitations? - How does it perform on your specific evaluation set? **When evaluating outputs**: - Don't trust single metrics; examine failures qualitatively - Don't trust model confidence; it's not calibrated - Don't trust impressive demos; test on your data - Do establish baselines; how would a simple approach perform? **When working with ML teams**: - Speak the language: training/validation/test, overfitting, distribution shift - Ask the right questions: What data? What objective? What failure modes? - Understand trade-offs: More data vs. cleaner data, speed vs. quality - Contribute valuable feedback: Real-world failure cases, user complaints ### What We Didn't Cover This chapter gave you working intuition, not comprehensive knowledge. Topics we skipped that you might explore later: **Model architectures**: Transformers, attention mechanisms, CNNs, RNNs. Understanding architecture helps you understand capabilities and limitations. **Training at scale**: Distributed training, mixed precision, gradient accumulation. Matters if you fine-tune large models. **Advanced optimization**: Adam, learning rate schedules, warmup. Matters for fine-tuning and understanding training dynamics. **Regularization techniques**: Dropout, weight decay, data augmentation. Matters for preventing overfitting. **Interpretability**: Understanding what models learn internally. Matters for debugging and trust. These topics become relevant as you go deeper into AI engineering. The fundamentals in this chapter are prerequisite to all of them. --- ## Summary Machine learning is pattern extraction from data. Neural networks are flexible function approximators that learn these patterns through iterative parameter adjustment. LLMs are neural networks trained to predict text, and this training objective explains both their capabilities and limitations. Key concepts for AI engineers: 1. **Training/validation/test splits** prevent you from fooling yourself about model quality 2. **Overfitting** is memorizing instead of learning; combat it with more data, simpler models, and regularization 3. **Loss functions and optimization** adjust parameters to reduce prediction errors 4. **Deep networks** learn hierarchical features, enabling complex pattern recognition 5. **LLM training** proceeds in phases: pretraining for knowledge, SFT for instruction-following, RLHF for alignment 6. **Hallucination** is intrinsic to how LLMs work—they generate plausible text, not verified facts 7. **Temperature and sampling** control how the model selects from its probability distribution 8. **Embeddings** represent meaning as vectors where geometry reflects semantics 9. **Evaluation** requires multiple metrics, human judgment, and qualitative analysis With this foundation, you can debug production issues, make informed model selection decisions, and communicate effectively with ML teams. You won't be training models from scratch, but you'll understand the systems you're building on. --- ## Practical Exercises 1. **Train/test split experiment**: Take a classification dataset. Train a model on all data and report accuracy. Then train on 80% and test on 20%. Compare results. Intentionally overfit by training too long—what happens to train vs. test accuracy? 2. **Temperature exploration**: Using an LLM API, generate 10 responses to the same creative prompt at temperatures 0.0, 0.5, and 1.0. How does output diversity change? At what temperature do you see degraded quality? 3. **Embedding investigation**: Embed 100 sentences from your domain. Visualize with t-SNE or UMAP. Do semantically similar sentences cluster? Find examples where the embedding space surprises you. 4. **Hallucination audit**: Ask an LLM 20 factual questions about your domain. Verify each answer. What's the hallucination rate? What types of questions trigger hallucination? 5. **Evaluation design**: Design a human evaluation protocol for an LLM task of your choice. Define criteria, create a rubric, evaluate 50 outputs, and measure inter-rater reliability. --- ## Self-Assessment Checkpoint ### Conceptual Questions **Q1. [IC1]** Your model gets 95% accuracy on the test set but users say it "never works." What might explain this? <details> <summary>Answer</summary> Several possibilities: (1) **Class imbalance**: If 95% of examples are one class, predicting that class always gets 95% accuracy but is useless. Check precision/recall for the minority class. (2) **Distribution shift**: Test set doesn't match production data. Model learned patterns that don't generalize. (3) **Wrong metric**: Accuracy might not matter. If false positives are expensive, users might care about precision. (4) **Data leakage**: Test set accidentally overlaps with training data, or features leak the label. High test accuracy but poor generalization. </details> **Q2. [IC1]** Why does setting temperature=0 for an LLM not always give identical outputs? <details> <summary>Answer</summary> Even at temperature=0 (greedy decoding), outputs can vary due to: (1) **Floating-point non-determinism**: Different GPU hardware or batching configurations can produce slightly different floating-point results. (2) **Model updates**: API providers update models without notice. (3) **Batching effects**: Some providers process requests differently based on load. (4) **Tie-breaking**: When multiple tokens have equal probability, the tie-breaker may vary. For reproducibility, some APIs offer a seed parameter, but even then, hardware differences can cause variation. </details> **Q3. [IC2]** You're embedding product descriptions for search. Short products have 10 words, long ones have 500. Should you worry about this? <details> <summary>Answer</summary> Yes, length bias is real. Issues: (1) Embedding models may give different quality representations for different lengths. (2) Long documents may have their meaning "diluted" across more tokens. (3) Similarity scores may be biased toward certain lengths. Mitigations: (1) Chunk long descriptions to consistent length before embedding. (2) Compare retrieval quality across length buckets. (3) Some embedding models handle this better than others—benchmark on your data. (4) Consider mean-pooling vs. CLS token strategies. </details> **Q4. [IC2]** An LLM confidently answers "The capital of Australia is Sydney." How does this happen and what can you do about it? <details> <summary>Answer</summary> This is hallucination—the model generates plausible but incorrect content. It happens because: (1) LLMs are trained to produce fluent, confident text, not necessarily correct text. (2) Sydney appears more frequently in Australian contexts than Canberra. (3) The model has no internal "fact checker"—it generates based on patterns, not knowledge retrieval. Mitigations: (1) RAG: Ground responses in retrieved documents. (2) Prompt for uncertainty: "If you're not sure, say so." (3) Use structured outputs with verified data. (4) Add validation layers that check claims against known facts. </details> **Q5. [Senior]** You need to evaluate an LLM for customer support quality. Automated metrics don't capture what matters. How do you design scalable human evaluation? <details> <summary>Answer</summary> Framework: (1) **Define criteria**: What does "good" mean? Accuracy, helpfulness, tone, policy compliance? Make each measurable. (2) **Create rubric**: 1-5 scale for each criterion with clear examples for each score. (3) **Select evaluators**: Domain experts for accuracy, customers for helpfulness. Train them on the rubric. (4) **Measure reliability**: Have multiple evaluators score the same samples. Calculate inter-rater agreement (Cohen's kappa). If low, refine rubric. (5) **Scale carefully**: Use sampling strategies. Don't evaluate everything—stratified samples across query types. (6) **Combine with automation**: Train a model to predict human scores, use it for monitoring, spot-check with humans. </details> ### Quick Checks **Check 1.** What's the difference between overfitting and underfitting? <details> <summary>Answer</summary> **Overfitting**: Model memorizes training data, performs poorly on new data. Training loss low, test loss high. Too complex for the data. **Underfitting**: Model is too simple to capture patterns. Both training and test loss are high. Solution for overfitting: regularization, more data, simpler model. Solution for underfitting: more complex model, better features. </details> **Check 2.** Why do we use embeddings instead of one-hot vectors? <details> <summary>Answer</summary> One-hot vectors are sparse and don't capture similarity. "Cat" and "dog" are equally distant from each other as "cat" and "refrigerator." Embeddings are dense vectors where similar meanings are close together. This enables semantic operations: finding similar items, arithmetic like "king - man + woman ≈ queen", and efficient storage (300 dimensions vs. 50,000+ for vocabulary size). </details> --- ## Connections to Other Chapters The ML concepts in this chapter underpin many techniques throughout the book: - **Chapter 5 (LLM/NLP Foundations)**: Builds directly on the embeddings and neural network concepts here, diving deep into transformer architecture and attention mechanisms. - **Chapter 7 (RAG Systems)**: Uses embeddings for semantic search—understanding what embeddings represent helps debug retrieval issues. - **Chapter 15 (MLOps & Evaluation)**: Extends the evaluation concepts here to LLM-specific challenges like hallucination detection and response quality. - **Chapter 27 (Performance Engineering)**: GPU optimization and quantization require understanding model architectures and their memory footprints. - **Glossary** (Appendix A): Definitions for all ML terms introduced in this chapter. --- ## Further Reading ### Essential - **3Blue1Brown's neural network videos** - Visual intuition for how neural networks learn. - **Goodfellow, Bengio & Courville, "Deep Learning"** (2016) - The comprehensive ML textbook. - **Jay Alammar's illustrated guides** - Visual explanations of transformers and embeddings. ### Deep Dives - **Vaswani et al. (2017), "Attention Is All You Need"** - The transformer architecture paper. - **Brown et al. (2020), "Language Models are Few-Shot Learners"** - GPT-3 and emergent capabilities. - **Rafailov et al. (2023), "Direct Preference Optimization"** - Simplified alternative to RLHF. - **Zheng et al. (2023), "Judging LLM-as-a-Judge"** - When LLM evaluation works and when it doesn't.