Chapter 5: LLM/NLP Foundations

Keywords

transformers, attention, tokenization, embeddings, self-attention, BPE, GPT, BERT, positional encoding, layer normalization

Introduction

In late 2017, a team at Google was struggling with a fundamental limitation of their translation systems. The neural networks they used processed text sequentially, one word at a time, like reading a sentence aloud from left to right. This worked, but it was slow and had a critical flaw: by the time the model reached the end of a long sentence, information from the beginning had degraded through many transformations. The model couldn’t easily connect a pronoun at the end of a sentence to its antecedent at the beginning.

Their solution upended a decade of deep learning research in NLP. Instead of processing text sequentially, they designed a mechanism that let every word “look at” every other word simultaneously, computing relevance scores that determined how much attention each word should pay to each other word. They called this mechanism “attention,” and the architecture built around it “the Transformer.”

The paper “Attention Is All You Need” launched a revolution. Within two years, BERT had shattered benchmarks across NLP tasks. Within five years, GPT-3 demonstrated that language models could perform tasks they were never explicitly trained for. By 2024, transformer-based models powered ChatGPT, Claude, Gemini, and every other large language model in production.

Understanding transformers deeply is not optional for AI engineers. Every technique in this book, from prompt engineering to RAG to agent systems, operates on top of transformer architectures. When your production system misbehaves, when token counts explode unexpectedly, when the model hallucinates despite good retrieval, diagnosing the problem requires understanding what’s actually happening inside the model.

This chapter builds that understanding from the ground up. We start with the surprisingly consequential question of how text becomes numbers, trace the journey through embeddings and attention, and arrive at the complete transformer architecture that powers modern AI. By the end, you should be able to explain not just what these systems do, but why they work, and what happens when they fail.

The Core Insight: Learning from Prediction

Before diving into architecture, it’s worth understanding the central insight that makes language models work.

A language model is trained on a simple task: predict what comes next. Given “The cat sat on the,” what word probably follows? This seems trivially simple, and yet to do it well, the model must learn:

Syntax and grammar: “the” is a determiner, so a noun probably follows
Semantic coherence: the noun should relate to sitting
Common patterns: “mat,” “floor,” “couch” are plausible; “differential” is not
World knowledge: cats sit on soft surfaces more often than hard ones

Through billions of such predictions across massive corpora, models learn representations that capture the structure of language and much of human knowledge. The simple objective of next-token prediction turns out to be sufficient for learning remarkably sophisticated capabilities.

This is why understanding the mechanics matters: when we debug a model’s behavior or design a prompt, we’re working with the artifacts of this prediction-based learning. Token boundaries, attention patterns, and layer representations all emerge from this training process.

A Mental Model: Text as Points in Space

The fundamental operation of neural NLP is converting discrete text into continuous vectors that can be manipulated mathematically. Think of each word (or, more precisely, each token) as a point in high-dimensional space.

In this space:

Similar meanings cluster together: “dog,” “puppy,” and “canine” are near each other
Relationships become directions: the direction from “king” to “queen” is similar to the direction from “man” to “woman”
Context shifts positions: “bank” (financial) and “bank” (river) occupy different regions based on surrounding words

The transformer’s job is to take initial points (token embeddings) and progressively refine them through attention and feed-forward layers, eventually producing points that encode meaning, relationships, and context so precisely that the next token can be predicted accurately.

This geometric intuition will guide us through the chapter. When we discuss attention, think of it as a mechanism for moving points toward other relevant points. When we discuss layers, think of progressive refinement of these positions.

What You’ll Learn

How tokenization transforms text into discrete units, and why tokenization choices have cascading effects on model behavior
The mathematics of embeddings: how lookup tables become semantic spaces
Attention mechanisms from first principles: the queries-keys-values framework and why it enables parallel processing
The complete transformer architecture, including residual connections, layer normalization, and feed-forward networks
Training objectives that create useful models, from next-token prediction to instruction tuning
Inference and generation: how models produce text, and the engineering tradeoffs involved

Prerequisites

This chapter assumes:

Python programming and NumPy array operations
Basic linear algebra: matrix multiplication, dot products, vector norms, and the intuition that dot products measure similarity
Neural network fundamentals: layers, activation functions, backpropagation, gradient descent
General ML concepts: training vs. inference, overfitting, loss functions

You don’t need prior NLP or transformer experience. We build that here.

Tokenization

The Fundamental Problem

Language models don’t see text. They see sequences of integers. The process of converting “Hello, world!” into [15496, 11, 995, 0] is tokenization, and despite seeming like a preprocessing detail, it fundamentally shapes what models can learn and how they behave.

Consider the challenge: we need a finite vocabulary of tokens (the model needs a fixed-size output layer), but natural language has infinite expressivity. Any tokenization scheme must balance competing concerns:

Coverage: Can we represent any possible text?
Efficiency: How many tokens per semantic unit?
Learnability: Can the model learn meaningful representations for each token?
Consistency: Do similar texts tokenize similarly?

The history of NLP is littered with approaches that failed one of these tests.

Why Not Characters or Words?

Character-level tokenization treats each character as a token. This solves coverage (every text is representable) but creates efficiency nightmares. The word “understanding” becomes 13 tokens. A 4,000-word document becomes 25,000+ tokens. Worse, the model must learn long-range dependencies through many more steps, each of which can degrade information.

Word-level tokenization goes to the opposite extreme. English has hundreds of thousands of words, and that’s before misspellings, technical jargon, code, and compound words. A vocabulary this large is impractical: most words appear so rarely in training data that the model can’t learn good representations for them. And what happens when you encounter “unforeseen”? If it’s not in your vocabulary, you’re stuck.

The insight of modern tokenization is that the right granularity is neither characters nor words, but something in between: subwords.

Subword Tokenization: The Key Insight

Subword tokenization splits rare words into meaningful pieces while keeping common words whole. “Understanding” might be a single token, while “unforgettable” becomes “un” + “forget” + “table.” This achieves:

Finite vocabulary: Typically 32K-100K tokens
Complete coverage: Any text is representable by combining subwords
Efficient encoding: Common words are single tokens
Meaningful pieces: Subwords often correspond to morphemes (un-, -ing, -tion)

The “meaningful pieces” property is crucial. When the model sees “un” + “forget” + “table,” the “un-” prefix carries semantic information (negation) that transfers across words like “un” + “happy” and “un” + “expected.”

Byte-Pair Encoding (BPE)

BPE, originally developed for data compression, is the most widely used subword algorithm. It works by iteratively merging the most frequent adjacent pairs.

The Algorithm:

Start with a vocabulary of individual characters (or bytes)
Count all adjacent pairs of tokens in your training corpus
Merge the most frequent pair into a new token
Repeat until you reach your target vocabulary size

Let’s trace through a small example. Suppose our corpus is just the text “banana banana banana”:

Initial vocabulary: {b, a, n, _} where _ represents space
Initial representation: [b, a, n, a, n, a, _, b, a, n, a, n, a, _, b, a, n, a, n, a]

Iteration 1:
  Count pairs: (a,n)=6, (n,a)=6, (b,a)=3, (a,_)=3, (_,b)=2
  Most frequent: (a,n) - merge into "an"
  New vocabulary: {b, a, n, _, an}
  New representation: [b, an, an, a, _, b, an, an, a, _, b, an, an, a]

Iteration 2:
  Count pairs: (b,an)=3, (an,an)=3, (an,a)=3, (a,_)=3, (_,b)=2
  Most frequent: (an,a) - merge into "ana"
  New vocabulary: {b, a, n, _, an, ana}
  New representation: [b, an, ana, _, b, an, ana, _, b, an, ana]

Iteration 3:
  Most frequent: (b,an)=3
  Merge into "ban"
  ...

Iteration 4:
  Merge "ban" + "ana" into "banana"
  Final representation: [banana, _, banana, _, banana]

In practice, BPE runs on billions of tokens of text, producing vocabularies of 32K-100K tokens. The algorithm naturally discovers meaningful units: common words become single tokens, while rare words decompose into recognizable pieces.

Why BPE Works for Language Models:

The frequency-based merging captures a crucial pattern: tokens that often appear together likely form a semantic unit. This produces:

Common English words as single tokens: “the,” “and,” “in”
Morphological units: “-ing,” “-tion,” “-ed,” “un-”
Common phrases or compounds: “however,” “cannot”
Rare words split meaningfully: “antidisestablishmentarianism” becomes recognizable pieces

Unigram and SentencePiece

While BPE builds vocabulary bottom-up through merging, the Unigram model takes a top-down approach:

Start with a large vocabulary of possible subwords
Compute how well each subword helps model the training corpus (under a unigram language model)
Iteratively remove subwords that contribute least
Stop when vocabulary reaches target size

The key difference: Unigram considers the probability of different tokenizations, not just frequency. For a given text, there may be multiple valid tokenizations; Unigram assigns probabilities to each.

SentencePiece is a library implementing both BPE and Unigram with a key innovation: it treats text as a raw byte stream with no pre-tokenization. There’s no concept of “words” or “spaces” being special. This matters enormously for:

Languages without spaces (Chinese, Japanese, Thai)
Code (where whitespace is syntactically meaningful)
Mixed-language text

import sentencepiece as spm

# Training a SentencePiece model
spm.SentencePieceTrainer.train(
    input='training_corpus.txt',
    model_prefix='my_model',
    vocab_size=32000,
    model_type='bpe'  # or 'unigram'
)

# Using the trained model
sp = spm.SentencePieceProcessor()
sp.load('my_model.model')

tokens = sp.encode_as_pieces("Hello, world!")
ids = sp.encode_as_ids("Hello, world!")

Real-World Tokenization: Production Case Studies

OpenAI’s Evolution (GPT-2 to GPT-4):

GPT-2 used a 50,257-token vocabulary trained primarily on English web text. This worked well for English but fragmented other languages badly. Chinese text often required 3-4x more tokens than English for equivalent semantic content, directly affecting costs and context limits.

GPT-4 introduced cl100k_base, a 100,000-token vocabulary with several key improvements:

Better multilingual coverage (more balanced tokenization across languages)
Improved code handling (common keywords, indentation patterns as tokens)
Byte-level fallback (can represent any byte sequence, eliminating unknown tokens)

Anthropic’s Approach:

Claude uses a proprietary tokenizer optimized for the assistant use case. Notably, it handles code more consistently than GPT-2-era tokenizers, treating indentation and common programming patterns as coherent units.

LLaMA and Open Models:

LLaMA uses SentencePiece with a 32K vocabulary. The smaller vocabulary means some rare words require more tokens, but the model fits in less memory (smaller embedding table) and trains faster.

Decision Framework: Choosing Tokenization Strategy

Factor	BPE	Unigram	WordPiece
Primary users	GPT family, LLaMA	T5, mBART, many multilingual	BERT family
Vocabulary building	Bottom-up merging	Top-down pruning	Bottom-up with likelihood
Handling rare words	Splits into subwords	Splits with probability	Splits into subwords
Determinism	Deterministic	Probabilistic (usually use Viterbi)	Deterministic
Best for	General purpose	Multilingual	Understanding tasks

When to train your own tokenizer:

Domain-specific vocabulary: If your domain uses terminology absent from standard tokenizers (medical, legal, scientific), you’ll get better tokenization with domain-specific training
Code-heavy applications: Standard tokenizers fragment identifiers and syntax unpredictably
Non-English majority: Tokenizers trained on English treat other languages as rare, fragmenting them excessively

When to use pre-trained tokenizers:

Using a pre-trained model: Always use the model’s native tokenizer. Mismatched tokenizers cause silent failures.
General-purpose applications: Standard tokenizers handle most text well
Reproducibility: Pre-trained tokenizers are stable and well-tested

The Mathematics of Tokenization

BPE can be understood as greedy compression. At each step, we replace the most frequent pair (x, y) with a new symbol xy. This minimizes description length greedily.

Formally, if we have a corpus $C$ as a sequence of symbols, and pair $(x, y)$ appears $n$ times:

Original encoding length: $|C|$ symbols
After merge: $|C| - n$ symbols (each occurrence of $xy$ is now one symbol)
Vocabulary increases by 1

We’re trading vocabulary size for sequence length, converging toward a balance between the two.

The Unigram model is more principled. It defines a probability model:

\[P(x_1, x_2, ..., x_n) = \prod_{i=1}^{n} P(x_i)\]

For a given text, we find the tokenization that maximizes this probability (typically via Viterbi algorithm). The vocabulary is pruned to keep tokens that maximize likelihood over the training corpus.

Implementation: Token Counting and Budgeting

Production systems need accurate token counts for cost estimation and context management:

import tiktoken

def count_tokens(text: str, model: str = "gpt-5") -> int:
    """Count tokens for a specific model."""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

# Rough estimates for planning (English prose):
# - ~4 characters per token
# - ~0.75 words per token
# - ~1.3 tokens per word
# These vary significantly by content type

Token Budget Management:

class TokenBudget:
    """Manage token budgets for prompt construction."""

    def __init__(self, max_tokens: int, model: str = "gpt-5"):
        self.max_tokens = max_tokens
        self.encoding = tiktoken.encoding_for_model(model)

    def count(self, text: str) -> int:
        return len(self.encoding.encode(text))

    def allocate(self, components: dict[str, str],
                 output_reserve: int = 500) -> dict[str, int]:
        """Check token usage for prompt components."""
        usage = {name: self.count(text) for name, text in components.items()}
        usage['_output_reserve'] = output_reserve
        usage['_total'] = sum(usage.values())
        usage['_remaining'] = self.max_tokens - usage['_total']
        return usage

    def truncate_to_fit(self, text: str, max_tokens: int) -> str:
        """Truncate text to fit within token limit."""
        tokens = self.encoding.encode(text)
        if len(tokens) <= max_tokens:
            return text
        return self.encoding.decode(tokens[:max_tokens])

Common Pitfalls and Debugging

Token boundary surprises:

tokenizer = tiktoken.encoding_for_model("gpt-5")

# Leading spaces matter!
print(len(tokenizer.encode("hello")))      # 1 token
print(len(tokenizer.encode(" hello")))     # 2 tokens (space + hello)

# Concatenation loses space tokens
text1 = "Say hello"
text2 = "Say" + "hello"  # Missing space!
print(tokenizer.decode(tokenizer.encode(text1)))  # "Say hello"
print(tokenizer.decode(tokenizer.encode(text2)))  # "Sayhello"

Number tokenization is chaotic:

# Numbers often tokenize inconsistently
print(tokenizer.encode("1234"))   # Might be ["123", "4"]
print(tokenizer.encode("1235"))   # Might be ["12", "35"] - different split!

# This is why LLMs struggle with arithmetic
# The number 1234 doesn't have a consistent representation

Multilingual fragmentation:

# Same semantic content, vastly different token counts
english = "How are you doing today?"
chinese = "你今天好吗？"

print(f"English: {count_tokens(english)} tokens")  # ~6 tokens
print(f"Chinese: {count_tokens(chinese)} tokens")  # ~10-15 tokens with English-trained tokenizer

# This affects:
# - API costs (billed per token)
# - Context window usage
# - Model performance (less "room to think")

Common Mistake: Ignoring Tokenization Costs for Non-English Text

What people do: Build products for multilingual markets using per-token pricing estimates based on English text.

Why it fails: Tokenizers trained primarily on English fragment other languages excessively. Chinese, Japanese, Arabic, and many other languages can require 2-4x more tokens for equivalent semantic content. A product costing $0.01 per query in English might cost $0.03 per query in Japanese.

Fix: Always test tokenization on your target languages before committing to cost estimates. Use tiktoken or your provider’s tokenizer to get actual token counts. For multilingual applications, budget for the highest-cost language, not the average.

Debugging tokenization issues:

def debug_tokenization(text: str, model: str = "gpt-5"):
    """Visualize how text is tokenized."""
    encoding = tiktoken.encoding_for_model(model)
    tokens = encoding.encode(text)
    pieces = [encoding.decode([t]) for t in tokens]

    print(f"Text: {text}")
    print(f"Tokens ({len(tokens)}): {tokens}")
    print(f"Pieces: {pieces}")
    print(f"Joined: {'|'.join(pieces)}")

# Example output:
# Text: The transformer architecture
# Tokens (3): [791, 43678, 18112]
# Pieces: ['The', ' transformer', ' architecture']
# Joined: The| transformer| architecture

Historical Context: The Evolution of Tokenization

1990s (Word-level): Early NLP used word-level tokenization with fixed vocabularies. Unknown words were mapped to <UNK>, losing information entirely.

2015 (BPE for NMT): Sennrich et al. applied Byte-Pair Encoding to neural machine translation, enabling open-vocabulary models that could handle rare words and morphologically rich languages.

2018 (SentencePiece): Google’s SentencePiece made subword tokenization language-agnostic by treating text as raw bytes, eliminating preprocessing dependencies.

2022 (tiktoken): OpenAI’s tiktoken optimized BPE for speed (10-100x faster than SentencePiece) with byte-level encoding, becoming the standard for GPT models.

The pattern: Tokenization evolved from linguistic assumptions (words) to learned statistical patterns (subwords) to raw bytes. Each step increased generality and reduced language-specific engineering. Future models may move toward byte-level transformers that learn tokenization end-to-end.

Embeddings

From Tokens to Vectors: The Core Transformation

Tokens are integers. Neural networks operate on continuous vectors. Embeddings bridge this gap, converting each token into a dense vector that the network can process.

At its simplest, an embedding is a lookup table. If your vocabulary has 100,000 tokens and you choose an embedding dimension of 4,096, your embedding matrix has shape (100000, 4096). Converting token ID 42 to its embedding means retrieving row 42.

import torch
import torch.nn as nn

vocab_size = 100000
embedding_dim = 4096

# Create embedding layer
embedding = nn.Embedding(vocab_size, embedding_dim)

# Token IDs to embeddings: just a lookup
token_ids = torch.tensor([42, 1337, 99])  # Shape: (3,)
vectors = embedding(token_ids)             # Shape: (3, 4096)

But this simple lookup table, when trained end-to-end with the language model, learns remarkable structure.

Why Embeddings Capture Meaning: The Distributional Hypothesis

The power of embeddings derives from the distributional hypothesis: words that appear in similar contexts have similar meanings. “Dog” and “cat” both appear in contexts like “The ___ ran across the yard” and “She fed her ___.” Through training on next-token prediction, embeddings for words in similar contexts become geometrically close.

This creates an embedding space where:

Synonyms cluster: “big,” “large,” “huge” have similar embeddings
Antonyms are related: “hot” and “cold” are closer to each other than to “blue” (they share the concept of temperature)
Analogies are arithmetic: embed("king") - embed("man") + embed("woman") ≈ embed("queen")

The analogy result is remarkable. The direction from “man” to “woman” encodes the concept of gender, and this same direction applies to other gendered pairs. The embedding space has learned a geometric representation of abstract concepts.

The Geometry of Embeddings

Embeddings as points in space: semantically similar items cluster together, and the direction between points carries meaning.

Cosine Similarity:

We measure embedding similarity using cosine similarity, which measures the angle between vectors, ignoring magnitude:

\[\text{cosine}(a, b) = \frac{a \cdot b}{\|a\| \|b\|} = \frac{\sum_i a_i b_i}{\sqrt{\sum_i a_i^2} \sqrt{\sum_i b_i^2}}\]

Range: [-1, 1] where:

1: identical direction (same meaning)
0: orthogonal (unrelated meanings)
-1: opposite direction (rarely seen in practice)

Why Cosine Over Euclidean?

Euclidean distance measures straight-line distance: $\|a - b\|_2 = \sqrt{\sum_i (a_i - b_i)^2}$

For embeddings, we care about direction (what the token means) more than magnitude (how “strongly” the token is expressed). Cosine similarity is direction-only. Additionally, for normalized embeddings (vectors with length 1), cosine and dot product are equivalent, enabling fast computation.

The Mathematical Connection:

For unit vectors (normalized to length 1): \[\text{cosine}(a, b) = a \cdot b = \|a\| \|b\| \cos(\theta) = \cos(\theta)\]

And Euclidean distance relates to cosine similarity: \[\|a - b\|^2 = \|a\|^2 + \|b\|^2 - 2(a \cdot b) = 2(1 - \text{cosine}(a, b))\]

So maximizing cosine similarity minimizes Euclidean distance for normalized vectors.

Position Encodings: Where Am I?

Transformers process all tokens in parallel, unlike RNNs that process sequentially. This parallelism is a key advantage, but it creates a problem: without explicit position information, the model can’t distinguish “The cat sat on the mat” from “mat the on sat cat The.”

Absolute Position Embeddings (GPT-2 Style):

Learn a separate embedding for each position:

max_seq_length = 2048
position_embedding = nn.Embedding(max_seq_length, embedding_dim)

# For a sequence of length L
positions = torch.arange(L)  # [0, 1, 2, ..., L-1]
pos_embeddings = position_embedding(positions)  # (L, embedding_dim)

# Final input: token embedding + position embedding
input_to_transformer = token_embeddings + pos_embeddings

Simple but limited: the model can’t generalize beyond max_seq_length positions it was trained on.

Sinusoidal Position Embeddings (Original Transformer):

Use deterministic sine and cosine functions at different frequencies:

\[PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d})\] \[PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d})\]

Where pos is position and i is dimension index.

The intuition: different dimensions oscillate at different frequencies. Low dimensions capture coarse position (beginning vs. end), high dimensions capture fine position (differences between adjacent positions). The wavelengths form a geometric progression from $2\pi$ to $10000 \cdot 2\pi$.

def sinusoidal_positions(seq_length: int, dim: int) -> torch.Tensor:
    """Generate sinusoidal position embeddings."""
    position = torch.arange(seq_length).unsqueeze(1)  # (seq_length, 1)
    div_term = torch.exp(
        torch.arange(0, dim, 2) * (-math.log(10000.0) / dim)
    )  # (dim/2,)

    pe = torch.zeros(seq_length, dim)
    pe[:, 0::2] = torch.sin(position * div_term)  # Even dimensions
    pe[:, 1::2] = torch.cos(position * div_term)  # Odd dimensions
    return pe

Rotary Position Embeddings (RoPE) - Modern Standard:

RoPE encodes positions by rotating the query and key vectors rather than adding position embeddings. This elegant approach is now standard in LLaMA, Mistral, and most modern open models.

The key insight: if we rotate query and key vectors by position-dependent angles, their dot product (attention score) depends on the relative position between them, not absolute positions.

Mathematically, for a 2D subspace of the embedding:

\[\begin{pmatrix} \cos(m\theta) & -\sin(m\theta) \\ \sin(m\theta) & \cos(m\theta) \end{pmatrix} \begin{pmatrix} q_1 \\ q_2 \end{pmatrix}\]

Where $m$ is the position and $\theta$ is a frequency. We apply this rotation to pairs of dimensions throughout the embedding.

Why RoPE is better: 1. Relative positions naturally: The dot product between rotated vectors depends on position difference 2. Length generalization: Works for sequences longer than training (with some degradation) 3. Efficient: Rotation is cheap; just multiply by sin/cos values

def apply_rotary_embeddings(q, k, freqs_cos, freqs_sin):
    """Apply RoPE to queries and keys.

    q, k: (batch, seq_len, num_heads, head_dim)
    freqs_cos, freqs_sin: (seq_len, head_dim/2)
    """
    # Reshape to pairs of dimensions
    q_r = q.float().reshape(*q.shape[:-1], -1, 2)  # (..., head_dim/2, 2)
    k_r = k.float().reshape(*k.shape[:-1], -1, 2)

    # Apply rotation using complex number multiplication
    # (a + bi)(cos + sin*i) = (a*cos - b*sin) + (a*sin + b*cos)i
    q_out_r = q_r[..., 0] * freqs_cos - q_r[..., 1] * freqs_sin
    q_out_i = q_r[..., 0] * freqs_sin + q_r[..., 1] * freqs_cos
    k_out_r = k_r[..., 0] * freqs_cos - k_r[..., 1] * freqs_sin
    k_out_i = k_r[..., 0] * freqs_sin + k_r[..., 1] * freqs_cos

    # Interleave real and imaginary parts
    q_out = torch.stack([q_out_r, q_out_i], dim=-1).flatten(-2)
    k_out = torch.stack([k_out_r, k_out_i], dim=-1).flatten(-2)

    return q_out.type_as(q), k_out.type_as(k)

Static vs. Contextual Embeddings

Early embedding methods like Word2Vec and GloVe produced static embeddings: each word has one vector, regardless of context. “Bank” has the same embedding whether it means a financial institution or a river bank.

Transformer-based models produce contextual embeddings. After passing through attention layers, each token’s representation incorporates information from its context. The representation of “bank” in “I deposited money at the bank” differs from “I sat on the river bank.”

from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

text1 = "I deposited money at the bank"
text2 = "I sat on the river bank"

with torch.no_grad():
    outputs1 = model(**tokenizer(text1, return_tensors="pt"))
    outputs2 = model(**tokenizer(text2, return_tensors="pt"))

# "bank" is at position 6 in text1, position 5 in text2
bank_embedding_1 = outputs1.last_hidden_state[0, 6]
bank_embedding_2 = outputs2.last_hidden_state[0, 5]

# These embeddings are different!
similarity = torch.cosine_similarity(bank_embedding_1, bank_embedding_2, dim=0)
print(f"Similarity: {similarity:.3f}")  # Much less than 1.0

This contextual nature is why transformers are so powerful for NLP: they can disambiguate meaning based on context.

Common Mistake: Pre-computing Embeddings Without Context

What people do: Pre-compute embeddings for individual words or short phrases (“product quality”, “user experience”) and store them for later similarity search, treating transformer embeddings as static word vectors.

Why it fails: Transformer embeddings are contextual—the embedding for a word depends on its surrounding text. The embedding for “Python” in “I have a pet python” differs significantly from “Python” in “Write this in Python.” Pre-computing embeddings for isolated terms loses this context and produces misleading similarity scores.

Fix: Always embed complete sentences or paragraphs that provide context. For retrieval, embed document chunks (not keywords). For semantic similarity, embed the full text you want to compare. If you must embed short phrases, use models specifically trained for short-text embedding (like sentence transformers) rather than raw transformer outputs.

Embedding Dimensions: Tradeoffs and Guidelines

Dimension Size:

Larger dimensions capture more information but cost more:

Memory: Embedding table scales with vocab_size * embedding_dim
Compute: All operations involve embedding_dim-sized vectors
Overfitting risk: More parameters can memorize training data

Model	Embedding Dimension	Vocabulary	Embedding Parameters
BERT-base	768	30,522	23M
BERT-large	1,024	30,522	31M
GPT-2	768-1,600	50,257	39M-80M
LLaMA-7B	4,096	32,000	131M
GPT-4 (estimated)	12,288+	100,000+	1.2B+

Decision Framework:

Use Case	Recommended Dimension	Rationale
Semantic search/retrieval	256-768	Balance of quality and speed; retrieval doesn’t need full model capacity
Fine-grained similarity	768-1,024	Captures nuanced relationships
General LLM embedding	2,048-4,096	Matches model capacity
Frontier models	8,192-16,384	Maximum expressiveness for complex reasoning

Matryoshka Embeddings:

Introduced by Kusupati et al. (2022): embeddings trained to be useful at multiple truncated dimensions. You can use the first 256 dimensions for fast retrieval, then the full 1,024 for reranking. This is achieved by applying the loss function at multiple dimension checkpoints during training.

Pooling: From Token Embeddings to Sequence Embeddings

Transformers output one embedding per token. For tasks requiring a single sequence embedding (classification, retrieval), we must pool:

def pool_embeddings(token_embeddings, attention_mask, strategy="mean"):
    """Pool token embeddings into a single sequence embedding.

    Args:
        token_embeddings: (batch, seq_len, hidden_dim)
        attention_mask: (batch, seq_len) - 1 for real tokens, 0 for padding
        strategy: "cls", "mean", "max", or "last"
    """
    if strategy == "cls":
        # Use [CLS] token (first token for BERT-style models)
        return token_embeddings[:, 0, :]

    elif strategy == "mean":
        # Average all non-padding tokens
        mask_expanded = attention_mask.unsqueeze(-1).float()
        sum_embeddings = torch.sum(token_embeddings * mask_expanded, dim=1)
        sum_mask = mask_expanded.sum(dim=1).clamp(min=1e-9)
        return sum_embeddings / sum_mask

    elif strategy == "max":
        # Max pool over non-padding tokens
        token_embeddings = token_embeddings.masked_fill(
            attention_mask.unsqueeze(-1) == 0, float('-inf')
        )
        return torch.max(token_embeddings, dim=1).values

    elif strategy == "last":
        # Last non-padding token (for decoder models)
        seq_lengths = attention_mask.sum(dim=1) - 1
        batch_indices = torch.arange(token_embeddings.size(0))
        return token_embeddings[batch_indices, seq_lengths]

Which pooling strategy?

CLS: Standard for BERT-style models trained with CLS pooling
Mean: Good default; robust to varying sequence lengths
Max: Emphasizes distinctive features; good for keywords
Last: Appropriate for decoder models (GPT-style) where the last token aggregates context

Attention: The Core Innovation

The Problem Attention Solves

Before transformers, sequence models were dominated by recurrent neural networks (RNNs) and their variants (LSTMs, GRUs). These process text sequentially:

"The" → [hidden state 1] → "cat" → [hidden state 2] → "sat" → [hidden state 3] → ...

Each hidden state must compress all previous information into a fixed-size vector. This creates two fundamental problems:

Information bottleneck: By the time we reach word 50, information from word 1 has passed through 49 transformations. Important details get lost or corrupted.
Sequential computation: We can’t process word 5 until words 1-4 are done. This prevents parallelization.

The sentence “The cat, which was spotted with orange and black stripes and had been living in the old barn since last winter, sat on the mat” requires understanding that “sat” refers to “cat,” but they’re separated by 20+ words. RNNs struggle with such long-range dependencies.

Attention solves both problems.

Every token can directly attend to every other token. “Sat” doesn’t need to receive information about “cat” through a chain of transformations; it can look directly at “cat” and decide to incorporate its information.

And because attention computes relationships between all pairs simultaneously, it’s fully parallelizable. We can process all positions at once on a GPU.

Queries, Keys, and Values: The Mental Model

Think of attention as a content-based memory lookup, like a database query:

Query (Q): “What am I looking for?” Each token broadcasts a query vector describing what information it needs.
Key (K): “What do I contain?” Each token provides a key vector describing what information it has.
Value (V): “What do I contribute?” Each token provides a value vector containing its actual information.

The attention mechanism: 1. Compares each query to all keys (dot product) 2. Converts similarities to weights (softmax) 3. Returns a weighted sum of values

This is like searching a database: the query specifies what you want, keys index the entries, and values are the actual content.

Query("sat"): "I'm a verb. What's my subject?"
   ↓
Compare to all keys:
   Key("The"): determiner, low relevance → low weight
   Key("cat"): noun, potential subject → HIGH weight
   Key("which"): relative pronoun → low weight
   Key("on"): preposition → low weight
   ↓
Return: weighted sum of values, dominated by Value("cat")

The Mathematics of Attention

Formally, given input embeddings $X \in \mathbb{R}^{n \times d}$ (n tokens, d dimensions):

Project to Q, K, V: \[Q = XW_Q, \quad K = XW_K, \quad V = XW_V\]

Where $W_Q, W_K, W_V \in \mathbb{R}^{d \times d_k}$ are learned projection matrices.

Compute attention scores: \[\text{scores} = QK^T \in \mathbb{R}^{n \times n}\]

Entry $(i, j)$ is how much position $i$ should attend to position $j$: the dot product of query $i$ and key $j$.

Scale: \[\text{scaled\_scores} = \frac{QK^T}{\sqrt{d_k}}\]

We’ll explain why scaling is crucial below.

Apply softmax: \[\text{weights} = \text{softmax}(\text{scaled\_scores})\]

Each row sums to 1, forming a probability distribution over positions.

Weighted sum of values: \[\text{output} = \text{weights} \cdot V \in \mathbb{R}^{n \times d_v}\]

The complete formula:

\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

Why Scale by $\sqrt{d_k}$?

The scaling factor is crucial for training stability. Here’s why:

Dot products grow with dimension. If $q$ and $k$ are random vectors with entries of mean 0 and variance 1, then:

\[\text{E}[q \cdot k] = 0, \quad \text{Var}[q \cdot k] = d_k\]

As $d_k$ increases, dot products have higher variance. For $d_k = 64$, scores might range from -20 to +20. After softmax, such extreme values become nearly one-hot:

import torch.nn.functional as F

# High-variance scores
scores = torch.tensor([10.0, 1.0, -5.0])
print(F.softmax(scores, dim=0))  # [0.9999, 0.0001, 0.0000] - nearly one-hot

# After scaling by sqrt(64) = 8
scaled_scores = scores / 8
print(F.softmax(scaled_scores, dim=0))  # [0.67, 0.19, 0.14] - smoother

One-hot attention distributions have near-zero gradients (softmax saturation), making training difficult. Scaling by $\sqrt{d_k}$ keeps scores in a range where softmax produces useful gradients.

Implementation: Scaled Dot-Product Attention

def scaled_dot_product_attention(
    query: torch.Tensor,  # (batch, seq_q, d_k)
    key: torch.Tensor,    # (batch, seq_k, d_k)
    value: torch.Tensor,  # (batch, seq_k, d_v)
    mask: torch.Tensor = None  # (seq_q, seq_k) or (batch, seq_q, seq_k)
) -> tuple[torch.Tensor, torch.Tensor]:
    """Compute scaled dot-product attention.

    Returns:
        output: (batch, seq_q, d_v) - attention output
        weights: (batch, seq_q, seq_k) - attention weights
    """
    d_k = query.size(-1)

    # Compute attention scores: (batch, seq_q, seq_k)
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)

    # Apply mask (for causal attention or padding)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))

    # Softmax to get attention weights
    weights = F.softmax(scores, dim=-1)

    # Weighted sum of values
    output = torch.matmul(weights, value)

    return output, weights

Multi-Head Attention: Multiple Perspectives

Single-head attention can only capture one type of relationship. Multi-head attention runs multiple attention operations in parallel, each with different learned projections:

Input → [Head 1: syntactic] → ↘
Input → [Head 2: semantic]  → → Concatenate → Project → Output
Input → [Head 3: positional] → ↗
...

Each head can specialize:

One head might track subject-verb relationships
Another might attend to recent context
Another might handle coreference (pronouns to antecedents)

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model: int, num_heads: int, dropout: float = 0.1):
        super().__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"

        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        # Projection matrices
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

        self.dropout = nn.Dropout(dropout)

    def forward(
        self,
        query: torch.Tensor,  # (batch, seq_len, d_model)
        key: torch.Tensor,
        value: torch.Tensor,
        mask: torch.Tensor = None
    ) -> torch.Tensor:
        batch_size = query.size(0)

        # Project and reshape: (batch, seq, d_model) -> (batch, num_heads, seq, d_k)
        Q = self.W_q(query).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(key).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(value).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        # Attention for all heads in parallel: (batch, num_heads, seq, d_k)
        attn_output, _ = scaled_dot_product_attention(Q, K, V, mask)

        # Concatenate heads: (batch, seq, d_model)
        attn_output = attn_output.transpose(1, 2).contiguous().view(
            batch_size, -1, self.d_model
        )

        # Final projection
        return self.W_o(attn_output)

Head Count Guidelines:

Model	Heads	Head Dimension	Total d_model
BERT-base	12	64	768
GPT-2 medium	16	64	1024
GPT-3 175B	96	128	12288
LLaMA-7B	32	128	4096

More heads = more types of relationships, but each head is smaller. The d_model / num_heads tradeoff is usually empirical.

Causal Masking: Looking Only Backward

For autoregressive generation (predicting the next token), we must prevent position $i$ from attending to positions $> i$. The model can’t peek at future tokens.

def create_causal_mask(seq_length: int) -> torch.Tensor:
    """Create a lower-triangular causal mask.

    Position i can attend to positions 0, 1, ..., i
    """
    return torch.tril(torch.ones(seq_length, seq_length))

# Example for sequence length 4:
# [[1, 0, 0, 0],   ← position 0 only sees itself
#  [1, 1, 0, 0],   ← position 1 sees 0 and 1
#  [1, 1, 1, 0],   ← position 2 sees 0, 1, and 2
#  [1, 1, 1, 1]]   ← position 3 sees all

The mask is applied before softmax: positions with 0 get score $-\infty$, which softmax converts to probability 0.

Self-Attention vs. Cross-Attention

Self-attention: Q, K, V all come from the same sequence. Each token attends to all tokens in its own sequence. This is the standard attention in GPT-style decoders.

Cross-attention: Q comes from one sequence (e.g., decoder), while K and V come from another (e.g., encoder). The decoder attends to encoder outputs. Used in:

Encoder-decoder models (T5, original Transformer)
Vision-language models (image encoder → text decoder)
Retrieval-augmented generation (query attends to retrieved documents)

Attention Complexity: The Quadratic Challenge

Standard attention has time and memory complexity $O(n^2)$ where $n$ is sequence length. Computing the $n \times n$ score matrix is unavoidable in vanilla attention.

For short sequences (< 4K tokens), this is fine. For long contexts (32K-1M tokens), it becomes prohibitive:

100K tokens → 10 billion attention scores per layer
At 32 heads and 96 layers (GPT-3 scale), that’s trillions of operations

Flash Attention (Dao et al., 2022) is the production solution: it computes exact attention but with clever memory access patterns that avoid materializing the full $n \times n$ matrix. This is an implementation optimization, not an algorithmic change, but it extends practical context lengths by 4-10x.

Sparse attention patterns reduce complexity by limiting which positions attend to which:

Sliding window: Each position attends only to a local window
Global tokens: Certain tokens (like [CLS]) attend to and are attended by all positions
Strided patterns: Attend to every k-th token beyond the local window

def sliding_window_mask(seq_length: int, window_size: int) -> torch.Tensor:
    """Mask allowing attention only within a sliding window."""
    mask = torch.zeros(seq_length, seq_length)
    for i in range(seq_length):
        start = max(0, i - window_size // 2)
        end = min(seq_length, i + window_size // 2 + 1)
        mask[i, start:end] = 1
    return mask

Grouped-Query Attention (GQA): Memory Efficiency

The KV cache (storing K and V for generated tokens) dominates inference memory for long sequences. Standard multi-head attention stores separate K, V per head:

\[\text{KV cache} = 2 \times \text{num\_heads} \times \text{seq\_length} \times \text{head\_dim} \times \text{bytes}\]

Multi-Query Attention (MQA) uses a single K, V shared across all heads. This dramatically reduces KV cache but loses some expressiveness.

Grouped-Query Attention (GQA) is the middle ground: groups of heads share K, V. LLaMA 2 70B uses 8 KV heads shared among 64 query heads (8:1 ratio).

class GroupedQueryAttention(nn.Module):
    """GQA: Multiple query heads share fewer key-value heads."""

    def __init__(self, d_model: int, num_heads: int, num_kv_heads: int):
        super().__init__()
        self.num_heads = num_heads
        self.num_kv_heads = num_kv_heads
        self.num_groups = num_heads // num_kv_heads
        self.head_dim = d_model // num_heads

        self.W_q = nn.Linear(d_model, num_heads * self.head_dim)
        self.W_k = nn.Linear(d_model, num_kv_heads * self.head_dim)  # Fewer K heads
        self.W_v = nn.Linear(d_model, num_kv_heads * self.head_dim)  # Fewer V heads
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, x, mask=None):
        batch_size, seq_len, _ = x.shape

        Q = self.W_q(x).view(batch_size, seq_len, self.num_heads, self.head_dim)
        K = self.W_k(x).view(batch_size, seq_len, self.num_kv_heads, self.head_dim)
        V = self.W_v(x).view(batch_size, seq_len, self.num_kv_heads, self.head_dim)

        # Repeat K, V for each group of query heads
        K = K.repeat_interleave(self.num_groups, dim=2)
        V = V.repeat_interleave(self.num_groups, dim=2)

        # Continue with standard attention...

GQA reduces KV cache by num_groups factor with minimal quality loss. It’s now standard in production models.

Historical Context: The Evolution of Attention

2014 (Bahdanau Attention): Attention was introduced for RNN-based machine translation. For the first time, models could look back at any input position when generating each output word.

2017 (Transformer): Vaswani et al.’s “Attention Is All You Need” removed recurrence entirely. Self-attention let every position attend to every other position, enabling massive parallelization.

2019 (Multi-Query Attention): Shazeer proposed sharing keys and values across attention heads, reducing memory by 8x with minimal quality loss.

2022 (FlashAttention): Dao et al. restructured attention computation to be memory-efficient, enabling 2-4x longer contexts on the same hardware.

2023 (Grouped-Query Attention): Ainslie et al. found the sweet spot between full multi-head and multi-query attention, now standard in Llama 2/3 and most production models.

The pattern: Attention evolved from an add-on for RNNs to the core mechanism of modern AI. Each improvement focused on the same goal: maintain quality while reducing memory and compute costs.

The Transformer Architecture

Building the Complete Model

A transformer consists of stacked blocks, each containing: 1. Multi-head self-attention 2. Feed-forward network 3. Residual connections and layer normalization

Let’s build each component and understand why it’s there.

Residual Connections: The Gradient Highway

Without residual connections, gradients must flow through each layer sequentially. In deep networks (32-100+ layers), gradients can vanish or explode.

Residual connections provide a “gradient highway”: gradients can flow directly through the addition, bypassing the layer if needed.

# Without residual:
x = layer(x)  # Gradient flows through layer only

# With residual:
x = x + layer(x)  # Gradient flows through addition AND layer

Mathematically, if $f(x)$ is the layer function:

Without residual: $\frac{\partial L}{\partial x} = \frac{\partial L}{\partial f} \frac{\partial f}{\partial x}$ (chain rule through all layers)
With residual: $\frac{\partial L}{\partial x} = \frac{\partial L}{\partial (x + f)} \left(1 + \frac{\partial f}{\partial x}\right)$ (gradient flows via the “1”)

The “+1” term ensures gradients don’t vanish even if $\frac{\partial f}{\partial x} \approx 0$.

Layer Normalization: Stabilizing Training

Layer normalization normalizes activations across the feature dimension, stabilizing training dynamics:

class LayerNorm(nn.Module):
    def __init__(self, d_model: int, eps: float = 1e-6):
        super().__init__()
        self.gamma = nn.Parameter(torch.ones(d_model))   # Learnable scale
        self.beta = nn.Parameter(torch.zeros(d_model))   # Learnable shift
        self.eps = eps

    def forward(self, x):
        mean = x.mean(dim=-1, keepdim=True)
        std = x.std(dim=-1, keepdim=True)
        return self.gamma * (x - mean) / (std + self.eps) + self.beta

The intuition: ensure each layer receives inputs with consistent scale and shift, regardless of what previous layers produced.

Pre-norm vs. Post-norm:

The original Transformer used post-norm:

x = LayerNorm(x + Sublayer(x))

Modern models use pre-norm:

x = x + Sublayer(LayerNorm(x))

Pre-norm is more stable during training. The difference: with pre-norm, the residual addition happens after the normalized sublayer, maintaining a clean residual path.

Feed-Forward Networks: Where Knowledge Lives

Each transformer block contains a feed-forward network (FFN) applied independently to each position:

class FeedForward(nn.Module):
    def __init__(self, d_model: int, d_ff: int, dropout: float = 0.1):
        super().__init__()
        self.w1 = nn.Linear(d_model, d_ff)    # Expand
        self.w2 = nn.Linear(d_ff, d_model)    # Project back
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        return self.w2(self.dropout(F.gelu(self.w1(x))))

The expansion ratio ($d_{ff} / d_{model}$) is typically 4x in classical transformers, ~2.7x in LLaMA (with SwiGLU).

Why is the FFN important?

Research suggests attention and FFN have distinct roles:

Attention: Routes information between positions, handles relationships and coreference
FFN: Transforms information at each position, stores factual knowledge

Experiments show that “facts” like “The Eiffel Tower is in Paris” are stored in FFN weights. You can edit a model’s knowledge by modifying specific FFN neurons.

SwiGLU Activation:

Modern models replace GELU with SwiGLU, a gated activation:

\[\text{SwiGLU}(x) = (\text{Swish}(xW_1) \odot xW_3) W_2\]

Where $\text{Swish}(x) = x \cdot \sigma(x)$ and $\odot$ is element-wise multiplication.

class SwiGLU(nn.Module):
    def __init__(self, d_model: int, d_ff: int):
        super().__init__()
        self.w1 = nn.Linear(d_model, d_ff, bias=False)
        self.w2 = nn.Linear(d_ff, d_model, bias=False)
        self.w3 = nn.Linear(d_model, d_ff, bias=False)

    def forward(self, x):
        return self.w2(F.silu(self.w1(x)) * self.w3(x))

The gate ($xW_3$) provides multiplicative control over information flow, improving gradient dynamics.

The Complete Transformer Block

Putting it together:

class TransformerBlock(nn.Module):
    def __init__(
        self,
        d_model: int,
        num_heads: int,
        d_ff: int,
        dropout: float = 0.1
    ):
        super().__init__()
        self.attention = MultiHeadAttention(d_model, num_heads, dropout)
        self.feed_forward = SwiGLU(d_model, d_ff)
        self.norm1 = LayerNorm(d_model)
        self.norm2 = LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x: torch.Tensor, mask: torch.Tensor = None) -> torch.Tensor:
        # Pre-norm attention with residual
        normed = self.norm1(x)
        attn_out = self.attention(normed, normed, normed, mask)
        x = x + self.dropout(attn_out)

        # Pre-norm feed-forward with residual
        x = x + self.dropout(self.feed_forward(self.norm2(x)))

        return x

Decoder-Only Architecture: The Modern Standard

The original Transformer was encoder-decoder. Modern LLMs (GPT, Claude, LLaMA) are decoder-only:

class TransformerLM(nn.Module):
    def __init__(
        self,
        vocab_size: int,
        d_model: int,
        num_heads: int,
        num_layers: int,
        d_ff: int,
        max_seq_length: int,
        dropout: float = 0.1
    ):
        super().__init__()

        # Embeddings
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.position_embedding = nn.Embedding(max_seq_length, d_model)

        # Transformer blocks
        self.blocks = nn.ModuleList([
            TransformerBlock(d_model, num_heads, d_ff, dropout)
            for _ in range(num_layers)
        ])

        # Final layer norm and output projection
        self.norm = LayerNorm(d_model)
        self.lm_head = nn.Linear(d_model, vocab_size, bias=False)

        # Weight tying: share embedding and output weights
        self.lm_head.weight = self.token_embedding.weight

        self._init_weights()

    def _init_weights(self):
        """Initialize weights for training stability."""
        for module in self.modules():
            if isinstance(module, nn.Linear):
                torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
                if module.bias is not None:
                    torch.nn.init.zeros_(module.bias)

    def forward(
        self,
        input_ids: torch.Tensor,  # (batch, seq_len)
        mask: torch.Tensor = None
    ) -> torch.Tensor:
        batch_size, seq_len = input_ids.shape
        device = input_ids.device

        # Create causal mask if not provided
        if mask is None:
            mask = torch.tril(torch.ones(seq_len, seq_len, device=device))

        # Embeddings: token + position
        positions = torch.arange(seq_len, device=device)
        x = self.token_embedding(input_ids) + self.position_embedding(positions)

        # Transformer blocks
        for block in self.blocks:
            x = block(x, mask)

        # Output projection
        logits = self.lm_head(self.norm(x))  # (batch, seq_len, vocab_size)

        return logits

Why decoder-only dominates:

Simplicity: One architecture serves all tasks (generation, classification, QA)
Emergent capabilities: Tasks like translation emerge from next-token prediction
Efficient inference: Causal attention enables KV caching
Scale: The simplest architecture scales best with compute

Encoder-Decoder vs. Decoder-Only

Encoder-decoder (T5, BART, original Transformer):

Encoder: bidirectional attention (sees full input)
Decoder: causal attention + cross-attention to encoder
Good for: translation, summarization with clear input/output separation

Encoder-only (BERT, RoBERTa):

Bidirectional attention throughout
Good for: classification, NER, semantic similarity
Cannot generate text naturally

Decoder-only (GPT, Claude, LLaMA):

Causal attention throughout
Good for: generation, and surprisingly, everything else too
Handles classification/similarity via prompting

Mixture of Experts (MoE): Sparse Scaling

Standard transformers use every parameter for every token. Mixture of Experts (MoE) architectures activate only a subset of parameters per token, enabling much larger models without proportionally more compute.

The key insight: replace the single feed-forward network (FFN) in each transformer block with multiple “expert” FFNs, plus a router that decides which experts process each token.

Standard FFN:          MoE Layer:

Input ──→ [FFN] ──→ Output     Input ──→ [Router] ──→ Expert 1 ──→┐
                                    │            ──→ Expert 2    ├──→ Output
                                    └────────────→  ...          │
                                                 ──→ Expert N ──→┘
                                                 (most inactive)

How routing works:

# Simplified router (real implementations are more complex)
def route_tokens(hidden_states, num_experts=8, top_k=2):
    # Router computes probability for each expert
    router_logits = router_linear(hidden_states)  # (batch, seq, num_experts)
    router_probs = F.softmax(router_logits, dim=-1)

    # Select top-k experts per token
    top_probs, top_indices = router_probs.topk(top_k, dim=-1)

    # Each token only uses 2 of 8 experts → 4x less compute per token
    return top_indices, top_probs

Why MoE matters for AI engineers:

Frontier models use MoE: GPT-4, Gemini, Mixtral, and likely Claude all use MoE. Understanding MoE explains why these models are both very large (trillions of parameters) and reasonably fast.
Total parameters vs. active parameters: Mixtral 8x7B has 47B total parameters but only ~13B active per forward pass. This means:
- Inference cost closer to a 13B model
- Memory for full model: 47B weights
- Quality approaching a 47B dense model
Load balancing challenges: If all tokens route to the same experts, training fails. MoE architectures include auxiliary losses to encourage balanced expert utilization.

Model	Architecture	Total Params	Active Params	Experts
Mixtral 8x7B	MoE	47B	~13B	8 (top-2)
Mixtral 8x22B	MoE	141B	~39B	8 (top-2)
GPT-4 (reported)	MoE	~1.8T	~220B	16 (top-2)
LLaMA 2 70B	Dense	70B	70B	N/A

MoE trade-offs:

Pro: More capacity at similar inference cost
Pro: Experts can specialize (one for code, one for math, etc.)
Con: Higher memory—all expert weights must be loaded
Con: Harder to fine-tune—expert specialization can be disrupted
Con: Communication overhead in distributed training

When you’ll encounter MoE:

Inference optimization: Expert offloading, expert parallelism
Model selection: Understanding why Mixtral competes with larger dense models
Fine-tuning decisions: MoE models may need different fine-tuning strategies

Scaling Laws: More is (Usually) Better

Staff Engineer Perspective

“Scaling laws are useful for understanding trends, but dangerous for planning specific projects. ‘We’ll just use a bigger model’ became a crutch that let teams avoid fixing actual problems. I’ve seen teams throw GPT-4 at problems that GPT-3.5 with better prompts solved perfectly. Understand scaling laws, but don’t use them as an excuse to skip optimization.”

— ML Platform Architect at large tech company

Kaplan et al. (2020) and Hoffmann et al. (2022, “Chinchilla”) established that model performance follows predictable power laws:

\[L(N, D) = \left(\frac{N_c}{N}\right)^{\alpha_N} + \left(\frac{D_c}{D}\right)^{\alpha_D} + L_{\infty}\]

Where:

$L$ is loss (lower is better)
$N$ is parameter count
$D$ is training tokens
$N_c, D_c, \alpha_N, \alpha_D$ are fitted constants

Key insights:

Smooth improvement: Loss decreases predictably with more parameters and data
Compute-optimal scaling (Chinchilla): Training tokens should be approximately 20x the parameter count. A 10B model should see ~200B tokens.
No saturation yet: We haven’t seen diminishing returns at frontier scales

Practical implications:

If you have a compute budget, split it between model size and training data according to Chinchilla
For inference, smaller well-trained models often beat larger undertrained ones
You can predict performance before training, enabling better planning

Parameter Counting: Know Your Model

Understanding parameter counts helps with hardware planning:

def count_parameters(config):
    """Count parameters in a transformer LM."""
    params = {}

    # Token embeddings
    params['token_embed'] = config.vocab_size * config.d_model

    # Position embeddings (if learned)
    params['pos_embed'] = config.max_seq_length * config.d_model

    # Per-layer parameters
    layer_params = {}
    layer_params['attention'] = 4 * config.d_model * config.d_model  # Q, K, V, O
    layer_params['ffn'] = 2 * config.d_model * config.d_ff  # w1, w2
    if hasattr(config, 'use_swiglu') and config.use_swiglu:
        layer_params['ffn'] = 3 * config.d_model * config.d_ff  # w1, w2, w3
    layer_params['layer_norm'] = 4 * config.d_model  # Two layer norms

    params['layers'] = sum(layer_params.values()) * config.num_layers

    # Output projection (often tied with embedding)
    params['lm_head'] = config.vocab_size * config.d_model

    total = sum(params.values())
    if config.tie_weights:
        total -= params['lm_head']  # Subtract if tied

    return total, params

# Example: LLaMA-7B
llama_config = {
    'vocab_size': 32000, 'd_model': 4096, 'num_layers': 32,
    'd_ff': 11008, 'max_seq_length': 2048, 'use_swiglu': True,
    'tie_weights': True
}
# Result: ~6.7B parameters

Memory estimation:

Component	Formula	Example (7B, 4K context)
Model weights (fp16)	params * 2 bytes	14 GB
KV cache (fp16)	2 * layers * heads * head_dim * seq_len * 2	1.6 GB per sequence
Activations	~2x weights during forward	28 GB
Inference total	weights + KV cache	~16 GB
Training total	~16x weights	~110 GB

Training and Pre-training

The Pre-training Objective

Modern LLMs are trained on a deceptively simple objective: predict the next token.

Given a sequence $[x_1, x_2, ..., x_n]$, the model predicts $P(x_i | x_1, ..., x_{i-1})$ for all positions. The loss is cross-entropy:

\[\mathcal{L} = -\sum_{i=1}^{n} \log P(x_i | x_1, ..., x_{i-1})\]

def compute_lm_loss(model, input_ids):
    """Standard causal language modeling loss."""
    logits = model(input_ids)  # (batch, seq_len, vocab_size)

    # Shift for next-token prediction:
    # - logits[:, :-1, :] predicts tokens 1, 2, ..., n-1
    # - input_ids[:, 1:] are the targets (tokens 1, 2, ..., n-1)
    shift_logits = logits[:, :-1, :].contiguous()
    shift_labels = input_ids[:, 1:].contiguous()

    loss = F.cross_entropy(
        shift_logits.view(-1, shift_logits.size(-1)),
        shift_labels.view(-1)
    )
    return loss

Why next-token prediction works:

To predict text well, the model must learn:

Grammar and syntax (“The” is often followed by a noun or adjective)
Factual knowledge (“The capital of France is…” → “Paris”)
Reasoning patterns (“If all A are B, and X is an A, then X is…”)
Style and register (formal vs. casual, technical vs. conversational)
Common sense (people go to restaurants to eat, not to sleep)

All of this emerges from the simple objective of predicting what comes next.

Masked Language Modeling (BERT-style)

An alternative training objective: randomly mask 15% of tokens and predict them:

def compute_mlm_loss(model, input_ids, mask_prob=0.15):
    """Masked language modeling loss."""
    # Create mask
    rand = torch.rand(input_ids.shape)
    mask = (rand < mask_prob) & (input_ids != tokenizer.pad_token_id)

    # Apply mask
    masked_input = input_ids.clone()
    masked_input[mask] = tokenizer.mask_token_id

    # Predict masked tokens
    logits = model(masked_input)
    loss = F.cross_entropy(
        logits[mask].view(-1, vocab_size),
        input_ids[mask].view(-1)
    )
    return loss

MLM enables bidirectional context (the model sees tokens on both sides of the mask), which is useful for understanding tasks. But it can’t generate text naturally.

From Pre-training to Instruction Tuning

Raw pre-trained models complete text, but they don’t follow instructions or hold conversations. Instruction tuning bridges this gap:

Pre-training data:

The cat sat on the mat. It was a warm day and the sun was shining...

Instruction tuning data:

{"instruction": "Summarize the following text.",
 "input": "The cat sat on the mat. It was a warm day...",
 "output": "A cat rests on a mat on a sunny day."}

The model learns to map instructions to appropriate outputs, becoming useful as an assistant.

RLHF and Alignment

Even instruction-tuned models can be unhelpful, harmful, or misaligned. Reinforcement Learning from Human Feedback (RLHF) further refines behavior:

Collect comparisons: Humans rank multiple model outputs for the same prompt
Train reward model: Learn to predict which outputs humans prefer
Optimize policy: Use PPO or similar to maximize reward while staying close to the base model

Constitutional AI (Anthropic’s approach) uses AI feedback: the model critiques its own outputs against a set of principles, generating training data for self-improvement.

These techniques are crucial for making models safe, helpful, and honest, but they’re outside the scope of this foundations chapter.

Training at Scale: Practical Considerations

Compute requirements:

Model Size	Training Tokens	GPU Hours (A100)	Approximate Cost
7B	1T	100,000	$150K
13B	1T	200,000	$300K
70B	2T	1,500,000	$2M

When to fine-tune vs. prompt:

Situation	Recommendation
Few examples, simple task	Few-shot prompting
Domain-specific vocabulary	Fine-tune or RAG
Specific output format	Fine-tune
Need updated knowledge	RAG + prompting
Behavioral changes	RLHF or fine-tuning

Common Mistake: Fine-tuning Before Trying Prompting

What people do: Immediately jump to fine-tuning when a model doesn’t perform well on their task, spending weeks collecting training data and running training jobs.

Why it fails: Fine-tuning is expensive (compute, data collection, evaluation) and creates maintenance burden (model versioning, retraining as base models improve). Many tasks that seem to require fine-tuning can be solved with better prompting, few-shot examples, or retrieval-augmented generation.

Fix: Follow this hierarchy: (1) Improve your prompt—be specific, provide examples, use chain-of-thought. (2) Add few-shot examples—often 3-5 examples unlock capability. (3) Try RAG—if the issue is knowledge, retrieval is cheaper than training. (4) Use a larger model—capabilities increase with scale. Only fine-tune after exhausting these options, or when you need specific output formats, domain vocabulary, or behavioral changes that prompting cannot achieve.

Parameter-efficient fine-tuning (LoRA, QLoRA):

Instead of updating all parameters, add small trainable adapters:

# Conceptual LoRA: low-rank updates to weight matrices
# Original: y = xW
# LoRA: y = x(W + BA) where B is (d, r) and A is (r, d), r << d

class LoRALinear(nn.Module):
    def __init__(self, original: nn.Linear, rank: int = 16):
        super().__init__()
        self.original = original
        self.original.weight.requires_grad = False  # Freeze

        d_in, d_out = original.weight.shape
        self.lora_A = nn.Parameter(torch.randn(d_in, rank) / math.sqrt(rank))
        self.lora_B = nn.Parameter(torch.zeros(rank, d_out))

    def forward(self, x):
        return self.original(x) + x @ self.lora_A @ self.lora_B

LoRA reduces trainable parameters by 100-1000x while achieving similar performance to full fine-tuning for many tasks.

Inference and Generation

The Autoregressive Generation Loop

Text generation is iterative: generate one token, append it to the input, generate the next.

def generate(
    model,
    tokenizer,
    prompt: str,
    max_new_tokens: int = 100,
    temperature: float = 1.0,
    top_p: float = 0.9
) -> str:
    """Generate text autoregressively."""
    input_ids = tokenizer.encode(prompt, return_tensors='pt')

    for _ in range(max_new_tokens):
        with torch.no_grad():
            logits = model(input_ids)[:, -1, :]  # Last position only

        # Apply temperature
        logits = logits / temperature

        # Top-p (nucleus) sampling
        next_token = top_p_sample(logits, top_p)

        # Append and continue
        input_ids = torch.cat([input_ids, next_token], dim=-1)

        if next_token.item() == tokenizer.eos_token_id:
            break

    return tokenizer.decode(input_ids[0])

Each iteration requires a forward pass through the entire model. For a 70B model generating 100 tokens, that’s 100 forward passes, millions of operations each.

Temperature: Controlling Randomness

Temperature scales logits before softmax, controlling the “sharpness” of the distribution:

\[P(x_i) = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}\]

Temperature	Effect	Use Case
T → 0	Greedy (argmax)	Deterministic, factual
T = 0.5	Focused	Coherent creative
T = 1.0	Model distribution	General use
T = 2.0	High entropy	Brainstorming, diversity

def apply_temperature(logits: torch.Tensor, temperature: float) -> torch.Tensor:
    if temperature == 0:
        raise ValueError("Use greedy decoding for T=0")
    return logits / temperature

# Low temperature concentrates probability on top tokens
# High temperature flattens the distribution

Top-k and Top-p Sampling

Top-k sampling restricts to the k most likely tokens:

def top_k_sample(logits: torch.Tensor, k: int = 50) -> torch.Tensor:
    top_k_logits, top_k_indices = torch.topk(logits, k, dim=-1)
    probs = F.softmax(top_k_logits, dim=-1)
    sampled_idx = torch.multinomial(probs, num_samples=1)
    return top_k_indices.gather(-1, sampled_idx)

Problem: k is fixed, but the “good” vocabulary size varies. For “The capital of France is ”, there’s really only one good answer. For ”I feel ”, many tokens are valid.

Top-p (nucleus) sampling adapts to the distribution, keeping the smallest set of tokens whose cumulative probability exceeds p:

def top_p_sample(logits: torch.Tensor, p: float = 0.9) -> torch.Tensor:
    sorted_logits, sorted_indices = torch.sort(logits, descending=True)
    cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)

    # Find cutoff index
    sorted_indices_to_remove = cumulative_probs > p
    sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
    sorted_indices_to_remove[..., 0] = False

    sorted_logits[sorted_indices_to_remove] = float('-inf')
    probs = F.softmax(sorted_logits, dim=-1)

    sampled = torch.multinomial(probs, num_samples=1)
    return sorted_indices.gather(-1, sampled)

Top-p is now the standard for production systems, often combined with temperature.

KV Caching: The Essential Optimization

Without caching, generating token $n$ requires computing attention over all $n$ tokens. But tokens 1 through $n-1$ don’t change between iterations, their K and V values are constant.

KV caching stores computed K and V values, reusing them:

class CachedAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.k_cache = None
        self.v_cache = None

    def forward(self, x, use_cache=True):
        batch_size, seq_len, _ = x.shape

        # Compute Q, K, V for new tokens only
        Q = self.W_q(x)
        K = self.W_k(x)
        V = self.W_v(x)

        if use_cache and self.k_cache is not None:
            # Append new K, V to cache
            K = torch.cat([self.k_cache, K], dim=1)
            V = torch.cat([self.v_cache, V], dim=1)

        if use_cache:
            self.k_cache = K.detach()
            self.v_cache = V.detach()

        # Attention: Q attends to all K (including cached)
        return self.attention(Q, K, V)

KV caching changes generation from $O(n^2)$ to $O(n)$ operations (for n output tokens).

Speculative Decoding: Parallelizing Generation

Generation is bottlenecked by sequential token generation. Speculative decoding uses a small, fast “draft” model to generate candidates, then verifies them in parallel with the large model:

def speculative_decode(
    large_model,
    draft_model,
    input_ids,
    n_draft: int = 4
):
    """Generate faster using speculation."""
    # Draft model proposes n_draft tokens quickly
    draft_tokens = draft_model.generate(input_ids, max_new_tokens=n_draft)

    # Large model scores all proposed tokens in parallel
    with torch.no_grad():
        logits = large_model(draft_tokens)

    # Accept draft tokens until one is rejected
    accepted = []
    for i in range(n_draft):
        draft_token = draft_tokens[0, len(input_ids) + i]
        large_model_pred = logits[0, len(input_ids) + i - 1].argmax()

        if draft_token == large_model_pred:
            accepted.append(draft_token)
        else:
            break

    # Return accepted tokens + one from large model
    return torch.cat([
        input_ids,
        torch.tensor([accepted + [large_model_pred]])
    ], dim=-1)

When draft and target models agree, you get multiple tokens for one target model forward pass. Speed improvements of 2-3x are common.

Batching and Continuous Batching

Static batching waits for all sequences in a batch to complete:

Sequence 1: [tokens...] [pad] [pad] [pad]  <- waiting
Sequence 2: [tokens...tokens...tokens...]  <- still generating

This wastes compute on padding.

Continuous batching dynamically adds/removes sequences:

t=0: [seq1, seq2, seq3] all generating
t=5: seq1 done, add seq4: [seq2, seq3, seq4]
t=8: seq3 done, add seq5: [seq2, seq4, seq5]

This maximizes GPU utilization for serving systems.

Production Inference Systems

Modern inference frameworks (vLLM, TensorRT-LLM, TGI) combine:

KV caching with memory management (PagedAttention)
Continuous batching
Speculative decoding
Quantization (int8, int4)
Tensor parallelism across GPUs

These optimizations can improve throughput 10-100x compared to naive implementations.

Practical Exercises

Exercise 1: Build a BPE Tokenizer

Implement BPE from scratch:

class MinimalBPE:
    """A minimal BPE tokenizer for learning."""

    def __init__(self):
        self.vocab = {}        # token -> id
        self.merges = []       # list of (a, b) merge operations

    def train(self, text: str, num_merges: int = 100):
        """Train on text, learning num_merges merge operations."""
        # 1. Initialize vocab with characters
        # 2. Repeat num_merges times:
        #    a. Count all adjacent pairs
        #    b. Find most frequent pair
        #    c. Merge it: add to vocab, replace in text
        #    d. Record the merge
        pass

    def encode(self, text: str) -> list[int]:
        """Apply learned merges to encode text."""
        # 1. Split into characters
        # 2. Apply merges in order
        # 3. Convert to IDs
        pass

    def decode(self, ids: list[int]) -> str:
        """Convert IDs back to text."""
        pass

Verification: 1. Train on a paragraph with 100 merges 2. Check that decode(encode(text)) == text 3. Examine which pairs merged first (should be common pairs like “th”, “in”) 4. Test on new text: common words should be single tokens

Exercise 2: Implement Attention from Scratch

def attention(Q, K, V, mask=None):
    """
    Implement scaled dot-product attention.

    Args:
        Q: (batch, seq_q, d_k)
        K: (batch, seq_k, d_k)
        V: (batch, seq_k, d_v)
        mask: (seq_q, seq_k) causal mask

    Returns:
        output: (batch, seq_q, d_v)
        weights: (batch, seq_q, seq_k)
    """
    pass

Verification: 1. Weights should sum to 1 along the last dimension 2. Without scaling, high-d_k should produce nearly one-hot weights 3. With causal mask, weights above the diagonal should be zero 4. Visualize attention for a real sentence

Exercise 3: Memory Estimation

For LLaMA-7B (vocab=32000, d_model=4096, layers=32, heads=32, d_ff=11008):

Calculate by hand: 1. Total parameters 2. Model memory in fp16 3. KV cache for 4K context 4. Minimum GPU memory for inference

Compare to the actual LLaMA-7B requirements.

Exercise 4: Generation with Sampling

Implement a complete generation loop:

def generate_with_sampling(
    model,
    tokenizer,
    prompt: str,
    max_tokens: int = 100,
    temperature: float = 0.8,
    top_p: float = 0.95
) -> str:
    """Generate text with temperature and top-p sampling."""
    pass

Verification: 1. Temperature=0 should be deterministic 2. Higher temperature should increase diversity 3. Compare outputs to the model’s built-in generate method

Key Takeaways

Tokenization determines how models see text - Subword algorithms like BPE convert text to discrete tokens, and token boundaries directly affect model behavior. Mismatched tokenizers cause silent failures.
Attention is the core transformer innovation - The queries-keys-values mechanism enables parallel processing and direct connections between all token pairs, with multi-head attention capturing different relationship types.
Scale follows predictable power laws - Model capability scales predictably with parameters, data, and compute. Understanding these relationships helps make informed architecture and training decisions.
Generation is memory-bandwidth bound - During autoregressive decoding, GPUs spend most time loading weights from memory. KV caching, batching, and quantization are essential optimizations for efficient inference.
Training objectives shape capabilities - Next-token prediction enables models to acquire grammar, knowledge, and reasoning. Instruction tuning and RLHF transform text predictors into useful assistants.

Summary

This chapter established the foundational concepts underlying modern large language models:

Tokenization converts text to discrete tokens through subword algorithms like BPE. The tokenizer determines how text is represented to the model. Token boundaries affect model behavior in subtle ways, and mismatched tokenizers cause silent failures.

Embeddings are learned lookup tables that place tokens in continuous vector spaces. Through training, semantically similar tokens cluster together. Contextual embeddings from transformers vary based on surrounding context, enabling disambiguation.

Attention is the core innovation that enables parallel processing and direct connections between all token pairs. The queries-keys-values framework implements content-based lookup, and multi-head attention captures multiple types of relationships. The $\sqrt{d_k}$ scaling is crucial for training stability.

The Transformer architecture stacks attention and feed-forward layers with residual connections and layer normalization. Decoder-only architectures dominate modern LLMs. Scale (parameters, data, compute) determines capability, following predictable power laws.

Training objectives shape what models learn. Next-token prediction is remarkably powerful, enabling models to acquire grammar, knowledge, and reasoning from the single objective of predicting what comes next. Instruction tuning and RLHF transform text predictors into useful assistants.

Inference and generation is where models produce outputs. Temperature and sampling strategies control the quality-diversity tradeoff. KV caching is essential for efficient generation. Modern inference systems combine many optimizations for practical deployment.

Understanding these components mechanically enables you to:

Debug unexpected model behavior
Estimate computational requirements
Make informed architecture and configuration decisions
Reason about what models can and cannot do

Self-Assessment Checkpoint

Conceptual Questions

Q1. [IC2] Why does the scaling factor $\sqrt{d_k}$ appear in the attention equation? What would happen if we removed it?

Answer

As the dimension $d_k$ increases, the dot products between Q and K vectors grow larger in magnitude. Before softmax, large values cause the gradient to become very small (saturated softmax), making training unstable. Dividing by $\sqrt{d_k}$ normalizes the variance back to approximately 1, regardless of dimension. Without it, training would be unstable, especially for larger models with larger head dimensions.

Q2. [IC2] A 7B parameter model uses 32 attention heads with dimension 128. If you switch to GQA with 4 KV heads, how much memory does this save for the KV cache during generation with a batch of 16 sequences at 4K context?

Answer

Original KV cache: 2 × 32 layers × 4096 context × 32 heads × 128 dim × 2 bytes × 16 batch = 34.4 GB With GQA (4 KV heads instead of 32): 2 × 32 × 4096 × 4 × 128 × 2 × 16 = 4.3 GB Savings: ~30 GB (8x reduction in KV cache). This is why GQA is essential for long-context serving.

Q3. [Senior] Explain why a model might produce different outputs for “The cat sat on the mat” vs “The cat sat on the mat” (extra spaces). What determines this behavior?

Answer

Tokenization determines this. BPE and similar tokenizers process whitespace as part of tokens. Extra spaces create different token sequences. “The” might be one token, but ” The” (with leading space) is typically a different token. Multiple spaces may tokenize into space tokens + word tokens differently than single spaces. The model has never seen certain space patterns during training, so its behavior on them is unpredictable. This is why input normalization matters.

Q4. [Senior] During inference, why is the prefill phase compute-bound while the decode phase is memory-bandwidth bound?

Answer

Prefill processes all input tokens in parallel—it’s a dense matrix-matrix multiplication (weights × all token embeddings). This has high arithmetic intensity (many FLOPs per byte loaded). Decode generates one token at a time—matrix-vector multiplication (weights × single token embedding). The entire model weights must be loaded from memory for each token, but we only do one row of computation per weight element. The arithmetic intensity drops dramatically, making memory bandwidth the bottleneck.

Q5. [Staff] How do scaling laws inform the decision between training a smaller model longer vs. a larger model for fewer steps? What practical constraints might override the compute-optimal recommendation?

Answer

Chinchilla scaling laws show compute-optimal training has roughly 20 tokens per parameter. However: (1) Inference costs favor smaller models—a 2x larger model costs 2x more per inference forever, while training is one-time. High-volume production may justify over-training smaller models. (2) Hardware constraints—you may not have enough GPUs for a larger model. (3) Latency requirements—smaller models are faster. (4) Capability thresholds—some capabilities may only emerge at certain scales, making a larger model necessary regardless of compute efficiency.

Spot the Problem

Problem 1. [IC2] Find the issue in this attention implementation:

def broken_attention(Q, K, V):
    scores = torch.matmul(Q, K)  # Q @ K
    weights = F.softmax(scores, dim=-1)
    return torch.matmul(weights, V)

Answer

Two issues: (1) K should be transposed—attention computes Q @ K^T, not Q @ K. The shapes won’t match otherwise (Q is [batch, seq, d_k] and K is also [batch, seq, d_k]). (2) Missing the scaling factor √d_k. Correct: scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(K.size(-1))

Problem 2. [Senior] A team fine-tuned a model and sees good training loss but terrible inference quality. Their evaluation code:

model.eval()
with torch.no_grad():
    for batch in eval_loader:
        outputs = model(batch['input_ids'])
        # They check loss here and it looks fine

What’s likely wrong?

Answer

They’re evaluating with teacher forcing (providing the correct previous tokens) rather than autoregressive generation. During actual inference, the model generates tokens based on its own previous outputs, and errors compound. Good training loss doesn’t mean good generation. They should test with actual generation: model.generate(prompt, max_new_tokens=100) and evaluate the generated text quality.

Problem 3. [Staff] A production system shows mysteriously high memory usage. The model is 7B parameters (FP16 = 14GB), but the system needs 50GB for a single request. What’s happening?

Answer

The KV cache is the likely culprit. For long context (e.g., 32K tokens) or multiple concurrent requests, KV cache can dwarf model size. Calculation: 2 × 32 layers × 32K context × 32 heads × 128 dim × 2 bytes = 17GB for just KV cache at 32K context. Add model weights (14GB), activations, optimizer states (if training), and framework overhead. Solutions: Use GQA, reduce context length, implement KV cache quantization, or use PagedAttention.

Design Exercises

Exercise 1. [Senior] Design a system to serve a 70B model with 100K context length. You have access to 8x A100-80GB GPUs. Outline your approach for:

Model parallelism strategy
KV cache management
Expected throughput and latency characteristics

Guidance

Consider: (1) Tensor parallelism across 8 GPUs—each GPU holds 1/8 of model weights (~17GB per GPU). (2) KV cache at 100K context in FP16 would be enormous—consider KV cache quantization to INT8 or even INT4. (3) Use PagedAttention to manage KV cache efficiently. (4) Expected throughput will be low due to cross-GPU communication overhead and massive KV cache. Focus on optimizing for few concurrent long-context requests rather than high throughput. Latency for prefill will be significant (seconds) due to 100K token processing.

Exercise 2. [Staff] You’re advising a team that wants to build a code completion model. They’re debating: (a) fine-tune an existing 7B model on code, (b) train a code-specific 3B model from scratch, or (c) use a general-purpose 70B model via API. What factors should drive this decision? Create a decision framework.

Guidance

Key factors: (1) Latency requirements—code completion needs <100ms, favoring smaller self-hosted models. (2) Code-specific vs. general capability—do they need understanding of natural language requirements too? (3) Data availability—do they have enough code data to train from scratch? (4) Budget—API costs vs. training/hosting costs. (5) Privacy—can code go to external APIs? (6) Model quality—70B API likely produces better code but at higher latency/cost. Recommendation often: Start with API for quality baseline, fine-tune 7B on domain-specific code for production deployment with latency constraints.

Connections to Other Chapters

Chapter 6 (Prompt Engineering): Understanding tokenization helps manage context windows and craft effective prompts. Token boundaries affect how models parse instructions.
Chapter 7 (RAG Systems): Embeddings are the foundation of semantic search. The embedding concepts here directly enable document retrieval.
Chapter 8 (Agentic Systems): The generation process underlies agent loops. Understanding autoregressive generation helps design reliable agents.
Chapter 9 (Deployment): Memory requirements (KV cache, model weights) and the prefill/decode distinction are essential for optimization.

Vaswani et al. (2017), “Attention Is All You Need”: The original transformer paper. Focus on Section 3 (architecture) and the attention equations—everything else builds on this.
Brown et al. (2020), “Language Models are Few-Shot Learners” (GPT-3): Demonstrated emergent capabilities at scale. Read for understanding why scale matters.
Hoffmann et al. (2022), “Training Compute-Optimal Large Language Models” (Chinchilla): The definitive guide to scaling laws. Essential for understanding model sizing decisions.

Deep Dives

Touvron et al. (2023), “LLaMA: Open and Efficient Foundation Language Models”: Best modern reference for how production transformers are actually built.
Dao et al. (2022), “FlashAttention: Fast and Memory-Efficient Exact Attention”: How attention is computed efficiently on GPUs. Critical for understanding inference optimization.

Visual Introduction

Jay Alammar, “The Illustrated Transformer”: Start here if papers feel dense. Visual walkthrough of attention mechanics that makes the math intuitive.

--- title: "Chapter 5: LLM/NLP Foundations" keywords: [transformers, attention, tokenization, embeddings, self-attention, BPE, GPT, BERT, positional encoding, layer normalization] difficulty: intermediate prerequisites: [ch03] estimated_time: "4-5 hours" --- ## Introduction In late 2017, a team at Google was struggling with a fundamental limitation of their translation systems. The neural networks they used processed text sequentially, one word at a time, like reading a sentence aloud from left to right. This worked, but it was slow and had a critical flaw: by the time the model reached the end of a long sentence, information from the beginning had degraded through many transformations. The model couldn't easily connect a pronoun at the end of a sentence to its antecedent at the beginning. Their solution upended a decade of deep learning research in NLP. Instead of processing text sequentially, they designed a mechanism that let every word "look at" every other word simultaneously, computing relevance scores that determined how much attention each word should pay to each other word. They called this mechanism "attention," and the architecture built around it "the Transformer." The paper "Attention Is All You Need" launched a revolution. Within two years, BERT had shattered benchmarks across NLP tasks. Within five years, GPT-3 demonstrated that language models could perform tasks they were never explicitly trained for. By 2024, transformer-based models powered ChatGPT, Claude, Gemini, and every other large language model in production. Understanding transformers deeply is not optional for AI engineers. Every technique in this book, from prompt engineering to RAG to agent systems, operates on top of transformer architectures. When your production system misbehaves, when token counts explode unexpectedly, when the model hallucinates despite good retrieval, diagnosing the problem requires understanding what's actually happening inside the model. This chapter builds that understanding from the ground up. We start with the surprisingly consequential question of how text becomes numbers, trace the journey through embeddings and attention, and arrive at the complete transformer architecture that powers modern AI. By the end, you should be able to explain not just *what* these systems do, but *why* they work, and what happens when they fail. ### The Core Insight: Learning from Prediction Before diving into architecture, it's worth understanding the central insight that makes language models work. A language model is trained on a simple task: predict what comes next. Given "The cat sat on the," what word probably follows? This seems trivially simple, and yet to do it well, the model must learn: 1. **Syntax and grammar**: "the" is a determiner, so a noun probably follows 2. **Semantic coherence**: the noun should relate to sitting 3. **Common patterns**: "mat," "floor," "couch" are plausible; "differential" is not 4. **World knowledge**: cats sit on soft surfaces more often than hard ones Through billions of such predictions across massive corpora, models learn representations that capture the structure of language and much of human knowledge. The simple objective of next-token prediction turns out to be sufficient for learning remarkably sophisticated capabilities. This is why understanding the mechanics matters: when we debug a model's behavior or design a prompt, we're working with the artifacts of this prediction-based learning. Token boundaries, attention patterns, and layer representations all emerge from this training process. ### A Mental Model: Text as Points in Space The fundamental operation of neural NLP is converting discrete text into continuous vectors that can be manipulated mathematically. Think of each word (or, more precisely, each token) as a point in high-dimensional space. In this space: - **Similar meanings cluster together**: "dog," "puppy," and "canine" are near each other - **Relationships become directions**: the direction from "king" to "queen" is similar to the direction from "man" to "woman" - **Context shifts positions**: "bank" (financial) and "bank" (river) occupy different regions based on surrounding words The transformer's job is to take initial points (token embeddings) and progressively refine them through attention and feed-forward layers, eventually producing points that encode meaning, relationships, and context so precisely that the next token can be predicted accurately. This geometric intuition will guide us through the chapter. When we discuss attention, think of it as a mechanism for moving points toward other relevant points. When we discuss layers, think of progressive refinement of these positions. ### What You'll Learn - How tokenization transforms text into discrete units, and why tokenization choices have cascading effects on model behavior - The mathematics of embeddings: how lookup tables become semantic spaces - Attention mechanisms from first principles: the queries-keys-values framework and why it enables parallel processing - The complete transformer architecture, including residual connections, layer normalization, and feed-forward networks - Training objectives that create useful models, from next-token prediction to instruction tuning - Inference and generation: how models produce text, and the engineering tradeoffs involved ### Prerequisites This chapter assumes: - Python programming and NumPy array operations - Basic linear algebra: matrix multiplication, dot products, vector norms, and the intuition that dot products measure similarity - Neural network fundamentals: layers, activation functions, backpropagation, gradient descent - General ML concepts: training vs. inference, overfitting, loss functions You don't need prior NLP or transformer experience. We build that here. --- ## Tokenization ### The Fundamental Problem Language models don't see text. They see sequences of integers. The process of converting "Hello, world!" into [15496, 11, 995, 0] is tokenization, and despite seeming like a preprocessing detail, it fundamentally shapes what models can learn and how they behave. Consider the challenge: we need a finite vocabulary of tokens (the model needs a fixed-size output layer), but natural language has infinite expressivity. Any tokenization scheme must balance competing concerns: 1. **Coverage**: Can we represent any possible text? 2. **Efficiency**: How many tokens per semantic unit? 3. **Learnability**: Can the model learn meaningful representations for each token? 4. **Consistency**: Do similar texts tokenize similarly? The history of NLP is littered with approaches that failed one of these tests. ### Why Not Characters or Words? **Character-level tokenization** treats each character as a token. This solves coverage (every text is representable) but creates efficiency nightmares. The word "understanding" becomes 13 tokens. A 4,000-word document becomes 25,000+ tokens. Worse, the model must learn long-range dependencies through many more steps, each of which can degrade information. **Word-level tokenization** goes to the opposite extreme. English has hundreds of thousands of words, and that's before misspellings, technical jargon, code, and compound words. A vocabulary this large is impractical: most words appear so rarely in training data that the model can't learn good representations for them. And what happens when you encounter "unforeseen"? If it's not in your vocabulary, you're stuck. The insight of modern tokenization is that the right granularity is *neither* characters nor words, but something in between: subwords. ### Subword Tokenization: The Key Insight Subword tokenization splits rare words into meaningful pieces while keeping common words whole. "Understanding" might be a single token, while "unforgettable" becomes "un" + "forget" + "table." This achieves: - **Finite vocabulary**: Typically 32K-100K tokens - **Complete coverage**: Any text is representable by combining subwords - **Efficient encoding**: Common words are single tokens - **Meaningful pieces**: Subwords often correspond to morphemes (un-, -ing, -tion) The "meaningful pieces" property is crucial. When the model sees "un" + "forget" + "table," the "un-" prefix carries semantic information (negation) that transfers across words like "un" + "happy" and "un" + "expected." ### Byte-Pair Encoding (BPE) BPE, originally developed for data compression, is the most widely used subword algorithm. It works by iteratively merging the most frequent adjacent pairs. **The Algorithm:** 1. Start with a vocabulary of individual characters (or bytes) 2. Count all adjacent pairs of tokens in your training corpus 3. Merge the most frequent pair into a new token 4. Repeat until you reach your target vocabulary size Let's trace through a small example. Suppose our corpus is just the text "banana banana banana": ``` Initial vocabulary: {b, a, n, _} where _ represents space Initial representation: [b, a, n, a, n, a, _, b, a, n, a, n, a, _, b, a, n, a, n, a] Iteration 1: Count pairs: (a,n)=6, (n,a)=6, (b,a)=3, (a,_)=3, (_,b)=2 Most frequent: (a,n) - merge into "an" New vocabulary: {b, a, n, _, an} New representation: [b, an, an, a, _, b, an, an, a, _, b, an, an, a] Iteration 2: Count pairs: (b,an)=3, (an,an)=3, (an,a)=3, (a,_)=3, (_,b)=2 Most frequent: (an,a) - merge into "ana" New vocabulary: {b, a, n, _, an, ana} New representation: [b, an, ana, _, b, an, ana, _, b, an, ana] Iteration 3: Most frequent: (b,an)=3 Merge into "ban" ... Iteration 4: Merge "ban" + "ana" into "banana" Final representation: [banana, _, banana, _, banana] ``` In practice, BPE runs on billions of tokens of text, producing vocabularies of 32K-100K tokens. The algorithm naturally discovers meaningful units: common words become single tokens, while rare words decompose into recognizable pieces. **Why BPE Works for Language Models:** The frequency-based merging captures a crucial pattern: tokens that often appear together likely form a semantic unit. This produces: - Common English words as single tokens: "the," "and," "in" - Morphological units: "-ing," "-tion," "-ed," "un-" - Common phrases or compounds: "however," "cannot" - Rare words split meaningfully: "antidisestablishmentarianism" becomes recognizable pieces ### Unigram and SentencePiece While BPE builds vocabulary bottom-up through merging, the **Unigram** model takes a top-down approach: 1. Start with a large vocabulary of possible subwords 2. Compute how well each subword helps model the training corpus (under a unigram language model) 3. Iteratively remove subwords that contribute least 4. Stop when vocabulary reaches target size The key difference: Unigram considers the *probability* of different tokenizations, not just frequency. For a given text, there may be multiple valid tokenizations; Unigram assigns probabilities to each. **SentencePiece** is a library implementing both BPE and Unigram with a key innovation: it treats text as a raw byte stream with no pre-tokenization. There's no concept of "words" or "spaces" being special. This matters enormously for: - Languages without spaces (Chinese, Japanese, Thai) - Code (where whitespace is syntactically meaningful) - Mixed-language text ```python import sentencepiece as spm # Training a SentencePiece model spm.SentencePieceTrainer.train( input='training_corpus.txt', model_prefix='my_model', vocab_size=32000, model_type='bpe' # or 'unigram' ) # Using the trained model sp = spm.SentencePieceProcessor() sp.load('my_model.model') tokens = sp.encode_as_pieces("Hello, world!") ids = sp.encode_as_ids("Hello, world!") ``` ### Real-World Tokenization: Production Case Studies **OpenAI's Evolution (GPT-2 to GPT-4):** GPT-2 used a 50,257-token vocabulary trained primarily on English web text. This worked well for English but fragmented other languages badly. Chinese text often required 3-4x more tokens than English for equivalent semantic content, directly affecting costs and context limits. GPT-4 introduced `cl100k_base`, a 100,000-token vocabulary with several key improvements: - Better multilingual coverage (more balanced tokenization across languages) - Improved code handling (common keywords, indentation patterns as tokens) - Byte-level fallback (can represent any byte sequence, eliminating unknown tokens) **Anthropic's Approach:** Claude uses a proprietary tokenizer optimized for the assistant use case. Notably, it handles code more consistently than GPT-2-era tokenizers, treating indentation and common programming patterns as coherent units. **LLaMA and Open Models:** LLaMA uses SentencePiece with a 32K vocabulary. The smaller vocabulary means some rare words require more tokens, but the model fits in less memory (smaller embedding table) and trains faster. ### Decision Framework: Choosing Tokenization Strategy | Factor | BPE | Unigram | WordPiece | |--------|-----|---------|-----------| | Primary users | GPT family, LLaMA | T5, mBART, many multilingual | BERT family | | Vocabulary building | Bottom-up merging | Top-down pruning | Bottom-up with likelihood | | Handling rare words | Splits into subwords | Splits with probability | Splits into subwords | | Determinism | Deterministic | Probabilistic (usually use Viterbi) | Deterministic | | Best for | General purpose | Multilingual | Understanding tasks | **When to train your own tokenizer:** 1. **Domain-specific vocabulary**: If your domain uses terminology absent from standard tokenizers (medical, legal, scientific), you'll get better tokenization with domain-specific training 2. **Code-heavy applications**: Standard tokenizers fragment identifiers and syntax unpredictably 3. **Non-English majority**: Tokenizers trained on English treat other languages as rare, fragmenting them excessively **When to use pre-trained tokenizers:** 1. **Using a pre-trained model**: Always use the model's native tokenizer. Mismatched tokenizers cause silent failures. 2. **General-purpose applications**: Standard tokenizers handle most text well 3. **Reproducibility**: Pre-trained tokenizers are stable and well-tested ### The Mathematics of Tokenization BPE can be understood as greedy compression. At each step, we replace the most frequent pair `(x, y)` with a new symbol `xy`. This minimizes description length greedily. Formally, if we have a corpus $C$ as a sequence of symbols, and pair $(x, y)$ appears $n$ times: - Original encoding length: $|C|$ symbols - After merge: $|C| - n$ symbols (each occurrence of $xy$ is now one symbol) - Vocabulary increases by 1 We're trading vocabulary size for sequence length, converging toward a balance between the two. The Unigram model is more principled. It defines a probability model: $$P(x_1, x_2, ..., x_n) = \prod_{i=1}^{n} P(x_i)$$ For a given text, we find the tokenization that maximizes this probability (typically via Viterbi algorithm). The vocabulary is pruned to keep tokens that maximize likelihood over the training corpus. ### Implementation: Token Counting and Budgeting Production systems need accurate token counts for cost estimation and context management: ```python import tiktoken def count_tokens(text: str, model: str = "gpt-5") -> int: """Count tokens for a specific model.""" encoding = tiktoken.encoding_for_model(model) return len(encoding.encode(text)) # Rough estimates for planning (English prose): # - ~4 characters per token # - ~0.75 words per token # - ~1.3 tokens per word # These vary significantly by content type ``` **Token Budget Management:** ```python class TokenBudget: """Manage token budgets for prompt construction.""" def __init__(self, max_tokens: int, model: str = "gpt-5"): self.max_tokens = max_tokens self.encoding = tiktoken.encoding_for_model(model) def count(self, text: str) -> int: return len(self.encoding.encode(text)) def allocate(self, components: dict[str, str], output_reserve: int = 500) -> dict[str, int]: """Check token usage for prompt components.""" usage = {name: self.count(text) for name, text in components.items()} usage['_output_reserve'] = output_reserve usage['_total'] = sum(usage.values()) usage['_remaining'] = self.max_tokens - usage['_total'] return usage def truncate_to_fit(self, text: str, max_tokens: int) -> str: """Truncate text to fit within token limit.""" tokens = self.encoding.encode(text) if len(tokens) <= max_tokens: return text return self.encoding.decode(tokens[:max_tokens]) ``` ### Common Pitfalls and Debugging **Token boundary surprises:** ```python tokenizer = tiktoken.encoding_for_model("gpt-5") # Leading spaces matter! print(len(tokenizer.encode("hello"))) # 1 token print(len(tokenizer.encode(" hello"))) # 2 tokens (space + hello) # Concatenation loses space tokens text1 = "Say hello" text2 = "Say" + "hello" # Missing space! print(tokenizer.decode(tokenizer.encode(text1))) # "Say hello" print(tokenizer.decode(tokenizer.encode(text2))) # "Sayhello" ``` **Number tokenization is chaotic:** ```python # Numbers often tokenize inconsistently print(tokenizer.encode("1234")) # Might be ["123", "4"] print(tokenizer.encode("1235")) # Might be ["12", "35"] - different split! # This is why LLMs struggle with arithmetic # The number 1234 doesn't have a consistent representation ``` **Multilingual fragmentation:** ```python # Same semantic content, vastly different token counts english = "How are you doing today?" chinese = "你今天好吗？" print(f"English: {count_tokens(english)} tokens") # ~6 tokens print(f"Chinese: {count_tokens(chinese)} tokens") # ~10-15 tokens with English-trained tokenizer # This affects: # - API costs (billed per token) # - Context window usage # - Model performance (less "room to think") ``` ::: {.callout-warning} ## Common Mistake: Ignoring Tokenization Costs for Non-English Text **What people do:** Build products for multilingual markets using per-token pricing estimates based on English text. **Why it fails:** Tokenizers trained primarily on English fragment other languages excessively. Chinese, Japanese, Arabic, and many other languages can require 2-4x more tokens for equivalent semantic content. A product costing $0.01 per query in English might cost $0.03 per query in Japanese. **Fix:** Always test tokenization on your target languages before committing to cost estimates. Use `tiktoken` or your provider's tokenizer to get actual token counts. For multilingual applications, budget for the highest-cost language, not the average. ::: **Debugging tokenization issues:** ```python def debug_tokenization(text: str, model: str = "gpt-5"): """Visualize how text is tokenized.""" encoding = tiktoken.encoding_for_model(model) tokens = encoding.encode(text) pieces = [encoding.decode([t]) for t in tokens] print(f"Text: {text}") print(f"Tokens ({len(tokens)}): {tokens}") print(f"Pieces: {pieces}") print(f"Joined: {'|'.join(pieces)}") # Example output: # Text: The transformer architecture # Tokens (3): [791, 43678, 18112] # Pieces: ['The', ' transformer', ' architecture'] # Joined: The| transformer| architecture ``` ::: {.callout-note} ## Historical Context: The Evolution of Tokenization **1990s (Word-level)**: Early NLP used word-level tokenization with fixed vocabularies. Unknown words were mapped to `<UNK>`, losing information entirely. **2015 (BPE for NMT)**: Sennrich et al. applied Byte-Pair Encoding to neural machine translation, enabling open-vocabulary models that could handle rare words and morphologically rich languages. **2018 (SentencePiece)**: Google's SentencePiece made subword tokenization language-agnostic by treating text as raw bytes, eliminating preprocessing dependencies. **2022 (tiktoken)**: OpenAI's tiktoken optimized BPE for speed (10-100x faster than SentencePiece) with byte-level encoding, becoming the standard for GPT models. **The pattern**: Tokenization evolved from linguistic assumptions (words) to learned statistical patterns (subwords) to raw bytes. Each step increased generality and reduced language-specific engineering. Future models may move toward byte-level transformers that learn tokenization end-to-end. ::: --- ## Embeddings ### From Tokens to Vectors: The Core Transformation Tokens are integers. Neural networks operate on continuous vectors. Embeddings bridge this gap, converting each token into a dense vector that the network can process. At its simplest, an embedding is a lookup table. If your vocabulary has 100,000 tokens and you choose an embedding dimension of 4,096, your embedding matrix has shape `(100000, 4096)`. Converting token ID 42 to its embedding means retrieving row 42. ```python import torch import torch.nn as nn vocab_size = 100000 embedding_dim = 4096 # Create embedding layer embedding = nn.Embedding(vocab_size, embedding_dim) # Token IDs to embeddings: just a lookup token_ids = torch.tensor([42, 1337, 99]) # Shape: (3,) vectors = embedding(token_ids) # Shape: (3, 4096) ``` But this simple lookup table, when trained end-to-end with the language model, learns remarkable structure. ### Why Embeddings Capture Meaning: The Distributional Hypothesis The power of embeddings derives from the **distributional hypothesis**: words that appear in similar contexts have similar meanings. "Dog" and "cat" both appear in contexts like "The ___ ran across the yard" and "She fed her ___." Through training on next-token prediction, embeddings for words in similar contexts become geometrically close. This creates an embedding space where: 1. **Synonyms cluster**: "big," "large," "huge" have similar embeddings 2. **Antonyms are related**: "hot" and "cold" are closer to each other than to "blue" (they share the concept of temperature) 3. **Analogies are arithmetic**: `embed("king") - embed("man") + embed("woman") ≈ embed("queen")` The analogy result is remarkable. The direction from "man" to "woman" encodes the concept of gender, and this same direction applies to other gendered pairs. The embedding space has learned a geometric representation of abstract concepts. ### The Geometry of Embeddings ![Embeddings as points in space: semantically similar items cluster together, and the direction between points carries meaning.](../assets/diagrams/rendered/ch05_embedding_space.svg) **Cosine Similarity:** We measure embedding similarity using cosine similarity, which measures the angle between vectors, ignoring magnitude: $$\text{cosine}(a, b) = \frac{a \cdot b}{\|a\| \|b\|} = \frac{\sum_i a_i b_i}{\sqrt{\sum_i a_i^2} \sqrt{\sum_i b_i^2}}$$ Range: [-1, 1] where: - 1: identical direction (same meaning) - 0: orthogonal (unrelated meanings) - -1: opposite direction (rarely seen in practice) **Why Cosine Over Euclidean?** Euclidean distance measures straight-line distance: $\|a - b\|_2 = \sqrt{\sum_i (a_i - b_i)^2}$ For embeddings, we care about direction (what the token means) more than magnitude (how "strongly" the token is expressed). Cosine similarity is direction-only. Additionally, for normalized embeddings (vectors with length 1), cosine and dot product are equivalent, enabling fast computation. **The Mathematical Connection:** For unit vectors (normalized to length 1): $$\text{cosine}(a, b) = a \cdot b = \|a\| \|b\| \cos(\theta) = \cos(\theta)$$ And Euclidean distance relates to cosine similarity: $$\|a - b\|^2 = \|a\|^2 + \|b\|^2 - 2(a \cdot b) = 2(1 - \text{cosine}(a, b))$$ So maximizing cosine similarity minimizes Euclidean distance for normalized vectors. ### Position Encodings: Where Am I? Transformers process all tokens in parallel, unlike RNNs that process sequentially. This parallelism is a key advantage, but it creates a problem: without explicit position information, the model can't distinguish "The cat sat on the mat" from "mat the on sat cat The." **Absolute Position Embeddings (GPT-2 Style):** Learn a separate embedding for each position: ```python max_seq_length = 2048 position_embedding = nn.Embedding(max_seq_length, embedding_dim) # For a sequence of length L positions = torch.arange(L) # [0, 1, 2, ..., L-1] pos_embeddings = position_embedding(positions) # (L, embedding_dim) # Final input: token embedding + position embedding input_to_transformer = token_embeddings + pos_embeddings ``` Simple but limited: the model can't generalize beyond `max_seq_length` positions it was trained on. **Sinusoidal Position Embeddings (Original Transformer):** Use deterministic sine and cosine functions at different frequencies: $$PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d})$$ $$PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d})$$ Where `pos` is position and `i` is dimension index. The intuition: different dimensions oscillate at different frequencies. Low dimensions capture coarse position (beginning vs. end), high dimensions capture fine position (differences between adjacent positions). The wavelengths form a geometric progression from $2\pi$ to $10000 \cdot 2\pi$. ```python def sinusoidal_positions(seq_length: int, dim: int) -> torch.Tensor: """Generate sinusoidal position embeddings.""" position = torch.arange(seq_length).unsqueeze(1) # (seq_length, 1) div_term = torch.exp( torch.arange(0, dim, 2) * (-math.log(10000.0) / dim) ) # (dim/2,) pe = torch.zeros(seq_length, dim) pe[:, 0::2] = torch.sin(position * div_term) # Even dimensions pe[:, 1::2] = torch.cos(position * div_term) # Odd dimensions return pe ``` **Rotary Position Embeddings (RoPE) - Modern Standard:** RoPE encodes positions by *rotating* the query and key vectors rather than adding position embeddings. This elegant approach is now standard in LLaMA, Mistral, and most modern open models. The key insight: if we rotate query and key vectors by position-dependent angles, their dot product (attention score) depends on the *relative* position between them, not absolute positions. Mathematically, for a 2D subspace of the embedding: $$\begin{pmatrix} \cos(m\theta) & -\sin(m\theta) \\ \sin(m\theta) & \cos(m\theta) \end{pmatrix} \begin{pmatrix} q_1 \\ q_2 \end{pmatrix}$$ Where $m$ is the position and $\theta$ is a frequency. We apply this rotation to pairs of dimensions throughout the embedding. Why RoPE is better: 1. **Relative positions naturally**: The dot product between rotated vectors depends on position difference 2. **Length generalization**: Works for sequences longer than training (with some degradation) 3. **Efficient**: Rotation is cheap; just multiply by sin/cos values ```python def apply_rotary_embeddings(q, k, freqs_cos, freqs_sin): """Apply RoPE to queries and keys. q, k: (batch, seq_len, num_heads, head_dim) freqs_cos, freqs_sin: (seq_len, head_dim/2) """ # Reshape to pairs of dimensions q_r = q.float().reshape(*q.shape[:-1], -1, 2) # (..., head_dim/2, 2) k_r = k.float().reshape(*k.shape[:-1], -1, 2) # Apply rotation using complex number multiplication # (a + bi)(cos + sin*i) = (a*cos - b*sin) + (a*sin + b*cos)i q_out_r = q_r[..., 0] * freqs_cos - q_r[..., 1] * freqs_sin q_out_i = q_r[..., 0] * freqs_sin + q_r[..., 1] * freqs_cos k_out_r = k_r[..., 0] * freqs_cos - k_r[..., 1] * freqs_sin k_out_i = k_r[..., 0] * freqs_sin + k_r[..., 1] * freqs_cos # Interleave real and imaginary parts q_out = torch.stack([q_out_r, q_out_i], dim=-1).flatten(-2) k_out = torch.stack([k_out_r, k_out_i], dim=-1).flatten(-2) return q_out.type_as(q), k_out.type_as(k) ``` ### Static vs. Contextual Embeddings Early embedding methods like Word2Vec and GloVe produced **static embeddings**: each word has one vector, regardless of context. "Bank" has the same embedding whether it means a financial institution or a river bank. Transformer-based models produce **contextual embeddings**. After passing through attention layers, each token's representation incorporates information from its context. The representation of "bank" in "I deposited money at the bank" differs from "I sat on the river bank." ```python from transformers import AutoModel, AutoTokenizer model = AutoModel.from_pretrained("bert-base-uncased") tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") text1 = "I deposited money at the bank" text2 = "I sat on the river bank" with torch.no_grad(): outputs1 = model(**tokenizer(text1, return_tensors="pt")) outputs2 = model(**tokenizer(text2, return_tensors="pt")) # "bank" is at position 6 in text1, position 5 in text2 bank_embedding_1 = outputs1.last_hidden_state[0, 6] bank_embedding_2 = outputs2.last_hidden_state[0, 5] # These embeddings are different! similarity = torch.cosine_similarity(bank_embedding_1, bank_embedding_2, dim=0) print(f"Similarity: {similarity:.3f}") # Much less than 1.0 ``` This contextual nature is why transformers are so powerful for NLP: they can disambiguate meaning based on context. ::: {.callout-warning} ## Common Mistake: Pre-computing Embeddings Without Context **What people do:** Pre-compute embeddings for individual words or short phrases ("product quality", "user experience") and store them for later similarity search, treating transformer embeddings as static word vectors. **Why it fails:** Transformer embeddings are contextual—the embedding for a word depends on its surrounding text. The embedding for "Python" in "I have a pet python" differs significantly from "Python" in "Write this in Python." Pre-computing embeddings for isolated terms loses this context and produces misleading similarity scores. **Fix:** Always embed complete sentences or paragraphs that provide context. For retrieval, embed document chunks (not keywords). For semantic similarity, embed the full text you want to compare. If you must embed short phrases, use models specifically trained for short-text embedding (like sentence transformers) rather than raw transformer outputs. ::: ### Embedding Dimensions: Tradeoffs and Guidelines **Dimension Size:** Larger dimensions capture more information but cost more: - **Memory**: Embedding table scales with `vocab_size * embedding_dim` - **Compute**: All operations involve `embedding_dim`-sized vectors - **Overfitting risk**: More parameters can memorize training data | Model | Embedding Dimension | Vocabulary | Embedding Parameters | |-------|---------------------|------------|---------------------| | BERT-base | 768 | 30,522 | 23M | | BERT-large | 1,024 | 30,522 | 31M | | GPT-2 | 768-1,600 | 50,257 | 39M-80M | | LLaMA-7B | 4,096 | 32,000 | 131M | | GPT-4 (estimated) | 12,288+ | 100,000+ | 1.2B+ | **Decision Framework:** | Use Case | Recommended Dimension | Rationale | |----------|----------------------|-----------| | Semantic search/retrieval | 256-768 | Balance of quality and speed; retrieval doesn't need full model capacity | | Fine-grained similarity | 768-1,024 | Captures nuanced relationships | | General LLM embedding | 2,048-4,096 | Matches model capacity | | Frontier models | 8,192-16,384 | Maximum expressiveness for complex reasoning | **Matryoshka Embeddings:** Introduced by Kusupati et al. (2022): embeddings trained to be useful at multiple truncated dimensions. You can use the first 256 dimensions for fast retrieval, then the full 1,024 for reranking. This is achieved by applying the loss function at multiple dimension checkpoints during training. ### Pooling: From Token Embeddings to Sequence Embeddings Transformers output one embedding per token. For tasks requiring a single sequence embedding (classification, retrieval), we must pool: ```python def pool_embeddings(token_embeddings, attention_mask, strategy="mean"): """Pool token embeddings into a single sequence embedding. Args: token_embeddings: (batch, seq_len, hidden_dim) attention_mask: (batch, seq_len) - 1 for real tokens, 0 for padding strategy: "cls", "mean", "max", or "last" """ if strategy == "cls": # Use [CLS] token (first token for BERT-style models) return token_embeddings[:, 0, :] elif strategy == "mean": # Average all non-padding tokens mask_expanded = attention_mask.unsqueeze(-1).float() sum_embeddings = torch.sum(token_embeddings * mask_expanded, dim=1) sum_mask = mask_expanded.sum(dim=1).clamp(min=1e-9) return sum_embeddings / sum_mask elif strategy == "max": # Max pool over non-padding tokens token_embeddings = token_embeddings.masked_fill( attention_mask.unsqueeze(-1) == 0, float('-inf') ) return torch.max(token_embeddings, dim=1).values elif strategy == "last": # Last non-padding token (for decoder models) seq_lengths = attention_mask.sum(dim=1) - 1 batch_indices = torch.arange(token_embeddings.size(0)) return token_embeddings[batch_indices, seq_lengths] ``` **Which pooling strategy?** - **CLS**: Standard for BERT-style models trained with CLS pooling - **Mean**: Good default; robust to varying sequence lengths - **Max**: Emphasizes distinctive features; good for keywords - **Last**: Appropriate for decoder models (GPT-style) where the last token aggregates context --- ## Attention: The Core Innovation ### The Problem Attention Solves Before transformers, sequence models were dominated by recurrent neural networks (RNNs) and their variants (LSTMs, GRUs). These process text sequentially: ``` "The" → [hidden state 1] → "cat" → [hidden state 2] → "sat" → [hidden state 3] → ... ``` Each hidden state must compress all previous information into a fixed-size vector. This creates two fundamental problems: 1. **Information bottleneck**: By the time we reach word 50, information from word 1 has passed through 49 transformations. Important details get lost or corrupted. 2. **Sequential computation**: We can't process word 5 until words 1-4 are done. This prevents parallelization. The sentence "The cat, which was spotted with orange and black stripes and had been living in the old barn since last winter, sat on the mat" requires understanding that "sat" refers to "cat," but they're separated by 20+ words. RNNs struggle with such long-range dependencies. **Attention solves both problems.** Every token can directly attend to every other token. "Sat" doesn't need to receive information about "cat" through a chain of transformations; it can look directly at "cat" and decide to incorporate its information. And because attention computes relationships between all pairs simultaneously, it's fully parallelizable. We can process all positions at once on a GPU. ### Queries, Keys, and Values: The Mental Model Think of attention as a content-based memory lookup, like a database query: - **Query (Q)**: "What am I looking for?" Each token broadcasts a query vector describing what information it needs. - **Key (K)**: "What do I contain?" Each token provides a key vector describing what information it has. - **Value (V)**: "What do I contribute?" Each token provides a value vector containing its actual information. The attention mechanism: 1. Compares each query to all keys (dot product) 2. Converts similarities to weights (softmax) 3. Returns a weighted sum of values This is like searching a database: the query specifies what you want, keys index the entries, and values are the actual content. ``` Query("sat"): "I'm a verb. What's my subject?" ↓ Compare to all keys: Key("The"): determiner, low relevance → low weight Key("cat"): noun, potential subject → HIGH weight Key("which"): relative pronoun → low weight Key("on"): preposition → low weight ↓ Return: weighted sum of values, dominated by Value("cat") ``` ### The Mathematics of Attention Formally, given input embeddings $X \in \mathbb{R}^{n \times d}$ (n tokens, d dimensions): 1. **Project to Q, K, V:** $$Q = XW_Q, \quad K = XW_K, \quad V = XW_V$$ Where $W_Q, W_K, W_V \in \mathbb{R}^{d \times d_k}$ are learned projection matrices. 2. **Compute attention scores:** $$\text{scores} = QK^T \in \mathbb{R}^{n \times n}$$ Entry $(i, j)$ is how much position $i$ should attend to position $j$: the dot product of query $i$ and key $j$. 3. **Scale:** $$\text{scaled\_scores} = \frac{QK^T}{\sqrt{d_k}}$$ We'll explain why scaling is crucial below. 4. **Apply softmax:** $$\text{weights} = \text{softmax}(\text{scaled\_scores})$$ Each row sums to 1, forming a probability distribution over positions. 5. **Weighted sum of values:** $$\text{output} = \text{weights} \cdot V \in \mathbb{R}^{n \times d_v}$$ The complete formula: $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$ ### Why Scale by $\sqrt{d_k}$? The scaling factor is crucial for training stability. Here's why: Dot products grow with dimension. If $q$ and $k$ are random vectors with entries of mean 0 and variance 1, then: $$\text{E}[q \cdot k] = 0, \quad \text{Var}[q \cdot k] = d_k$$ As $d_k$ increases, dot products have higher variance. For $d_k = 64$, scores might range from -20 to +20. After softmax, such extreme values become nearly one-hot: ```python import torch.nn.functional as F # High-variance scores scores = torch.tensor([10.0, 1.0, -5.0]) print(F.softmax(scores, dim=0)) # [0.9999, 0.0001, 0.0000] - nearly one-hot # After scaling by sqrt(64) = 8 scaled_scores = scores / 8 print(F.softmax(scaled_scores, dim=0)) # [0.67, 0.19, 0.14] - smoother ``` One-hot attention distributions have near-zero gradients (softmax saturation), making training difficult. Scaling by $\sqrt{d_k}$ keeps scores in a range where softmax produces useful gradients. ### Implementation: Scaled Dot-Product Attention ```python def scaled_dot_product_attention( query: torch.Tensor, # (batch, seq_q, d_k) key: torch.Tensor, # (batch, seq_k, d_k) value: torch.Tensor, # (batch, seq_k, d_v) mask: torch.Tensor = None # (seq_q, seq_k) or (batch, seq_q, seq_k) ) -> tuple[torch.Tensor, torch.Tensor]: """Compute scaled dot-product attention. Returns: output: (batch, seq_q, d_v) - attention output weights: (batch, seq_q, seq_k) - attention weights """ d_k = query.size(-1) # Compute attention scores: (batch, seq_q, seq_k) scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k) # Apply mask (for causal attention or padding) if mask is not None: scores = scores.masked_fill(mask == 0, float('-inf')) # Softmax to get attention weights weights = F.softmax(scores, dim=-1) # Weighted sum of values output = torch.matmul(weights, value) return output, weights ``` ### Multi-Head Attention: Multiple Perspectives Single-head attention can only capture one type of relationship. Multi-head attention runs multiple attention operations in parallel, each with different learned projections: ``` Input → [Head 1: syntactic] → ↘ Input → [Head 2: semantic] → → Concatenate → Project → Output Input → [Head 3: positional] → ↗ ... ``` Each head can specialize: - One head might track subject-verb relationships - Another might attend to recent context - Another might handle coreference (pronouns to antecedents) ```python class MultiHeadAttention(nn.Module): def __init__(self, d_model: int, num_heads: int, dropout: float = 0.1): super().__init__() assert d_model % num_heads == 0, "d_model must be divisible by num_heads" self.d_model = d_model self.num_heads = num_heads self.d_k = d_model // num_heads # Projection matrices self.W_q = nn.Linear(d_model, d_model) self.W_k = nn.Linear(d_model, d_model) self.W_v = nn.Linear(d_model, d_model) self.W_o = nn.Linear(d_model, d_model) self.dropout = nn.Dropout(dropout) def forward( self, query: torch.Tensor, # (batch, seq_len, d_model) key: torch.Tensor, value: torch.Tensor, mask: torch.Tensor = None ) -> torch.Tensor: batch_size = query.size(0) # Project and reshape: (batch, seq, d_model) -> (batch, num_heads, seq, d_k) Q = self.W_q(query).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2) K = self.W_k(key).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2) V = self.W_v(value).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2) # Attention for all heads in parallel: (batch, num_heads, seq, d_k) attn_output, _ = scaled_dot_product_attention(Q, K, V, mask) # Concatenate heads: (batch, seq, d_model) attn_output = attn_output.transpose(1, 2).contiguous().view( batch_size, -1, self.d_model ) # Final projection return self.W_o(attn_output) ``` **Head Count Guidelines:** | Model | Heads | Head Dimension | Total d_model | |-------|-------|----------------|---------------| | BERT-base | 12 | 64 | 768 | | GPT-2 medium | 16 | 64 | 1024 | | GPT-3 175B | 96 | 128 | 12288 | | LLaMA-7B | 32 | 128 | 4096 | More heads = more types of relationships, but each head is smaller. The d_model / num_heads tradeoff is usually empirical. ### Causal Masking: Looking Only Backward For autoregressive generation (predicting the next token), we must prevent position $i$ from attending to positions $> i$. The model can't peek at future tokens. ```python def create_causal_mask(seq_length: int) -> torch.Tensor: """Create a lower-triangular causal mask. Position i can attend to positions 0, 1, ..., i """ return torch.tril(torch.ones(seq_length, seq_length)) # Example for sequence length 4: # [[1, 0, 0, 0], ← position 0 only sees itself # [1, 1, 0, 0], ← position 1 sees 0 and 1 # [1, 1, 1, 0], ← position 2 sees 0, 1, and 2 # [1, 1, 1, 1]] ← position 3 sees all ``` The mask is applied before softmax: positions with 0 get score $-\infty$, which softmax converts to probability 0. ### Self-Attention vs. Cross-Attention **Self-attention**: Q, K, V all come from the same sequence. Each token attends to all tokens in its own sequence. This is the standard attention in GPT-style decoders. **Cross-attention**: Q comes from one sequence (e.g., decoder), while K and V come from another (e.g., encoder). The decoder attends to encoder outputs. Used in: - Encoder-decoder models (T5, original Transformer) - Vision-language models (image encoder → text decoder) - Retrieval-augmented generation (query attends to retrieved documents) ### Attention Complexity: The Quadratic Challenge Standard attention has time and memory complexity $O(n^2)$ where $n$ is sequence length. Computing the $n \times n$ score matrix is unavoidable in vanilla attention. For short sequences (< 4K tokens), this is fine. For long contexts (32K-1M tokens), it becomes prohibitive: - 100K tokens → 10 billion attention scores per layer - At 32 heads and 96 layers (GPT-3 scale), that's trillions of operations **Flash Attention** (Dao et al., 2022) is the production solution: it computes exact attention but with clever memory access patterns that avoid materializing the full $n \times n$ matrix. This is an implementation optimization, not an algorithmic change, but it extends practical context lengths by 4-10x. **Sparse attention patterns** reduce complexity by limiting which positions attend to which: - **Sliding window**: Each position attends only to a local window - **Global tokens**: Certain tokens (like [CLS]) attend to and are attended by all positions - **Strided patterns**: Attend to every k-th token beyond the local window ```python def sliding_window_mask(seq_length: int, window_size: int) -> torch.Tensor: """Mask allowing attention only within a sliding window.""" mask = torch.zeros(seq_length, seq_length) for i in range(seq_length): start = max(0, i - window_size // 2) end = min(seq_length, i + window_size // 2 + 1) mask[i, start:end] = 1 return mask ``` ### Grouped-Query Attention (GQA): Memory Efficiency The KV cache (storing K and V for generated tokens) dominates inference memory for long sequences. Standard multi-head attention stores separate K, V per head: $$\text{KV cache} = 2 \times \text{num\_heads} \times \text{seq\_length} \times \text{head\_dim} \times \text{bytes}$$ **Multi-Query Attention (MQA)** uses a single K, V shared across all heads. This dramatically reduces KV cache but loses some expressiveness. **Grouped-Query Attention (GQA)** is the middle ground: groups of heads share K, V. LLaMA 2 70B uses 8 KV heads shared among 64 query heads (8:1 ratio). ```python class GroupedQueryAttention(nn.Module): """GQA: Multiple query heads share fewer key-value heads.""" def __init__(self, d_model: int, num_heads: int, num_kv_heads: int): super().__init__() self.num_heads = num_heads self.num_kv_heads = num_kv_heads self.num_groups = num_heads // num_kv_heads self.head_dim = d_model // num_heads self.W_q = nn.Linear(d_model, num_heads * self.head_dim) self.W_k = nn.Linear(d_model, num_kv_heads * self.head_dim) # Fewer K heads self.W_v = nn.Linear(d_model, num_kv_heads * self.head_dim) # Fewer V heads self.W_o = nn.Linear(d_model, d_model) def forward(self, x, mask=None): batch_size, seq_len, _ = x.shape Q = self.W_q(x).view(batch_size, seq_len, self.num_heads, self.head_dim) K = self.W_k(x).view(batch_size, seq_len, self.num_kv_heads, self.head_dim) V = self.W_v(x).view(batch_size, seq_len, self.num_kv_heads, self.head_dim) # Repeat K, V for each group of query heads K = K.repeat_interleave(self.num_groups, dim=2) V = V.repeat_interleave(self.num_groups, dim=2) # Continue with standard attention... ``` GQA reduces KV cache by num_groups factor with minimal quality loss. It's now standard in production models. ::: {.callout-note} ## Historical Context: The Evolution of Attention **2014 (Bahdanau Attention)**: Attention was introduced for RNN-based machine translation. For the first time, models could look back at any input position when generating each output word. **2017 (Transformer)**: Vaswani et al.'s "Attention Is All You Need" removed recurrence entirely. Self-attention let every position attend to every other position, enabling massive parallelization. **2019 (Multi-Query Attention)**: Shazeer proposed sharing keys and values across attention heads, reducing memory by 8x with minimal quality loss. **2022 (FlashAttention)**: Dao et al. restructured attention computation to be memory-efficient, enabling 2-4x longer contexts on the same hardware. **2023 (Grouped-Query Attention)**: Ainslie et al. found the sweet spot between full multi-head and multi-query attention, now standard in Llama 2/3 and most production models. **The pattern**: Attention evolved from an add-on for RNNs to the core mechanism of modern AI. Each improvement focused on the same goal: maintain quality while reducing memory and compute costs. ::: --- ## The Transformer Architecture ### Building the Complete Model A transformer consists of stacked blocks, each containing: 1. Multi-head self-attention 2. Feed-forward network 3. Residual connections and layer normalization ![Transformer Block Architecture](../assets/diagrams/rendered/transformer_block.svg) Let's build each component and understand why it's there. ### Residual Connections: The Gradient Highway Without residual connections, gradients must flow through each layer sequentially. In deep networks (32-100+ layers), gradients can vanish or explode. Residual connections provide a "gradient highway": gradients can flow directly through the addition, bypassing the layer if needed. ```python # Without residual: x = layer(x) # Gradient flows through layer only # With residual: x = x + layer(x) # Gradient flows through addition AND layer ``` Mathematically, if $f(x)$ is the layer function: - Without residual: $\frac{\partial L}{\partial x} = \frac{\partial L}{\partial f} \frac{\partial f}{\partial x}$ (chain rule through all layers) - With residual: $\frac{\partial L}{\partial x} = \frac{\partial L}{\partial (x + f)} \left(1 + \frac{\partial f}{\partial x}\right)$ (gradient flows via the "1") The "+1" term ensures gradients don't vanish even if $\frac{\partial f}{\partial x} \approx 0$. ### Layer Normalization: Stabilizing Training Layer normalization normalizes activations across the feature dimension, stabilizing training dynamics: ```python class LayerNorm(nn.Module): def __init__(self, d_model: int, eps: float = 1e-6): super().__init__() self.gamma = nn.Parameter(torch.ones(d_model)) # Learnable scale self.beta = nn.Parameter(torch.zeros(d_model)) # Learnable shift self.eps = eps def forward(self, x): mean = x.mean(dim=-1, keepdim=True) std = x.std(dim=-1, keepdim=True) return self.gamma * (x - mean) / (std + self.eps) + self.beta ``` The intuition: ensure each layer receives inputs with consistent scale and shift, regardless of what previous layers produced. **Pre-norm vs. Post-norm:** The original Transformer used **post-norm**: ```python x = LayerNorm(x + Sublayer(x)) ``` Modern models use **pre-norm**: ```python x = x + Sublayer(LayerNorm(x)) ``` Pre-norm is more stable during training. The difference: with pre-norm, the residual addition happens *after* the normalized sublayer, maintaining a clean residual path. ### Feed-Forward Networks: Where Knowledge Lives Each transformer block contains a feed-forward network (FFN) applied independently to each position: ```python class FeedForward(nn.Module): def __init__(self, d_model: int, d_ff: int, dropout: float = 0.1): super().__init__() self.w1 = nn.Linear(d_model, d_ff) # Expand self.w2 = nn.Linear(d_ff, d_model) # Project back self.dropout = nn.Dropout(dropout) def forward(self, x): return self.w2(self.dropout(F.gelu(self.w1(x)))) ``` The expansion ratio ($d_{ff} / d_{model}$) is typically 4x in classical transformers, ~2.7x in LLaMA (with SwiGLU). **Why is the FFN important?** Research suggests attention and FFN have distinct roles: - **Attention**: Routes information between positions, handles relationships and coreference - **FFN**: Transforms information at each position, stores factual knowledge Experiments show that "facts" like "The Eiffel Tower is in Paris" are stored in FFN weights. You can edit a model's knowledge by modifying specific FFN neurons. **SwiGLU Activation:** Modern models replace GELU with SwiGLU, a gated activation: $$\text{SwiGLU}(x) = (\text{Swish}(xW_1) \odot xW_3) W_2$$ Where $\text{Swish}(x) = x \cdot \sigma(x)$ and $\odot$ is element-wise multiplication. ```python class SwiGLU(nn.Module): def __init__(self, d_model: int, d_ff: int): super().__init__() self.w1 = nn.Linear(d_model, d_ff, bias=False) self.w2 = nn.Linear(d_ff, d_model, bias=False) self.w3 = nn.Linear(d_model, d_ff, bias=False) def forward(self, x): return self.w2(F.silu(self.w1(x)) * self.w3(x)) ``` The gate ($xW_3$) provides multiplicative control over information flow, improving gradient dynamics. ### The Complete Transformer Block Putting it together: ```python class TransformerBlock(nn.Module): def __init__( self, d_model: int, num_heads: int, d_ff: int, dropout: float = 0.1 ): super().__init__() self.attention = MultiHeadAttention(d_model, num_heads, dropout) self.feed_forward = SwiGLU(d_model, d_ff) self.norm1 = LayerNorm(d_model) self.norm2 = LayerNorm(d_model) self.dropout = nn.Dropout(dropout) def forward(self, x: torch.Tensor, mask: torch.Tensor = None) -> torch.Tensor: # Pre-norm attention with residual normed = self.norm1(x) attn_out = self.attention(normed, normed, normed, mask) x = x + self.dropout(attn_out) # Pre-norm feed-forward with residual x = x + self.dropout(self.feed_forward(self.norm2(x))) return x ``` ### Decoder-Only Architecture: The Modern Standard The original Transformer was encoder-decoder. Modern LLMs (GPT, Claude, LLaMA) are decoder-only: ```python class TransformerLM(nn.Module): def __init__( self, vocab_size: int, d_model: int, num_heads: int, num_layers: int, d_ff: int, max_seq_length: int, dropout: float = 0.1 ): super().__init__() # Embeddings self.token_embedding = nn.Embedding(vocab_size, d_model) self.position_embedding = nn.Embedding(max_seq_length, d_model) # Transformer blocks self.blocks = nn.ModuleList([ TransformerBlock(d_model, num_heads, d_ff, dropout) for _ in range(num_layers) ]) # Final layer norm and output projection self.norm = LayerNorm(d_model) self.lm_head = nn.Linear(d_model, vocab_size, bias=False) # Weight tying: share embedding and output weights self.lm_head.weight = self.token_embedding.weight self._init_weights() def _init_weights(self): """Initialize weights for training stability.""" for module in self.modules(): if isinstance(module, nn.Linear): torch.nn.init.normal_(module.weight, mean=0.0, std=0.02) if module.bias is not None: torch.nn.init.zeros_(module.bias) def forward( self, input_ids: torch.Tensor, # (batch, seq_len) mask: torch.Tensor = None ) -> torch.Tensor: batch_size, seq_len = input_ids.shape device = input_ids.device # Create causal mask if not provided if mask is None: mask = torch.tril(torch.ones(seq_len, seq_len, device=device)) # Embeddings: token + position positions = torch.arange(seq_len, device=device) x = self.token_embedding(input_ids) + self.position_embedding(positions) # Transformer blocks for block in self.blocks: x = block(x, mask) # Output projection logits = self.lm_head(self.norm(x)) # (batch, seq_len, vocab_size) return logits ``` **Why decoder-only dominates:** 1. **Simplicity**: One architecture serves all tasks (generation, classification, QA) 2. **Emergent capabilities**: Tasks like translation emerge from next-token prediction 3. **Efficient inference**: Causal attention enables KV caching 4. **Scale**: The simplest architecture scales best with compute ### Encoder-Decoder vs. Decoder-Only **Encoder-decoder** (T5, BART, original Transformer): - Encoder: bidirectional attention (sees full input) - Decoder: causal attention + cross-attention to encoder - Good for: translation, summarization with clear input/output separation **Encoder-only** (BERT, RoBERTa): - Bidirectional attention throughout - Good for: classification, NER, semantic similarity - Cannot generate text naturally **Decoder-only** (GPT, Claude, LLaMA): - Causal attention throughout - Good for: generation, and surprisingly, everything else too - Handles classification/similarity via prompting ### Mixture of Experts (MoE): Sparse Scaling Standard transformers use every parameter for every token. **Mixture of Experts (MoE)** architectures activate only a subset of parameters per token, enabling much larger models without proportionally more compute. The key insight: replace the single feed-forward network (FFN) in each transformer block with multiple "expert" FFNs, plus a router that decides which experts process each token. ``` Standard FFN: MoE Layer: Input ──→ [FFN] ──→ Output Input ──→ [Router] ──→ Expert 1 ──→┐ │ ──→ Expert 2 ├──→ Output └────────────→ ... │ ──→ Expert N ──→┘ (most inactive) ``` **How routing works:** ```python # Simplified router (real implementations are more complex) def route_tokens(hidden_states, num_experts=8, top_k=2): # Router computes probability for each expert router_logits = router_linear(hidden_states) # (batch, seq, num_experts) router_probs = F.softmax(router_logits, dim=-1) # Select top-k experts per token top_probs, top_indices = router_probs.topk(top_k, dim=-1) # Each token only uses 2 of 8 experts → 4x less compute per token return top_indices, top_probs ``` **Why MoE matters for AI engineers:** 1. **Frontier models use MoE**: GPT-4, Gemini, Mixtral, and likely Claude all use MoE. Understanding MoE explains why these models are both very large (trillions of parameters) and reasonably fast. 2. **Total parameters vs. active parameters**: Mixtral 8x7B has 47B total parameters but only ~13B active per forward pass. This means: - Inference cost closer to a 13B model - Memory for full model: 47B weights - Quality approaching a 47B dense model 3. **Load balancing challenges**: If all tokens route to the same experts, training fails. MoE architectures include auxiliary losses to encourage balanced expert utilization. | Model | Architecture | Total Params | Active Params | Experts | |-------|-------------|--------------|---------------|---------| | Mixtral 8x7B | MoE | 47B | ~13B | 8 (top-2) | | Mixtral 8x22B | MoE | 141B | ~39B | 8 (top-2) | | GPT-4 (reported) | MoE | ~1.8T | ~220B | 16 (top-2) | | LLaMA 2 70B | Dense | 70B | 70B | N/A | **MoE trade-offs:** - **Pro**: More capacity at similar inference cost - **Pro**: Experts can specialize (one for code, one for math, etc.) - **Con**: Higher memory—all expert weights must be loaded - **Con**: Harder to fine-tune—expert specialization can be disrupted - **Con**: Communication overhead in distributed training **When you'll encounter MoE:** - **Inference optimization**: Expert offloading, expert parallelism - **Model selection**: Understanding why Mixtral competes with larger dense models - **Fine-tuning decisions**: MoE models may need different fine-tuning strategies ### Scaling Laws: More is (Usually) Better ::: {.callout-tip} ## Staff Engineer Perspective "Scaling laws are useful for understanding trends, but dangerous for planning specific projects. 'We'll just use a bigger model' became a crutch that let teams avoid fixing actual problems. I've seen teams throw GPT-4 at problems that GPT-3.5 with better prompts solved perfectly. Understand scaling laws, but don't use them as an excuse to skip optimization." — *ML Platform Architect at large tech company* ::: Kaplan et al. (2020) and Hoffmann et al. (2022, "Chinchilla") established that model performance follows predictable power laws: $$L(N, D) = \left(\frac{N_c}{N}\right)^{\alpha_N} + \left(\frac{D_c}{D}\right)^{\alpha_D} + L_{\infty}$$ Where: - $L$ is loss (lower is better) - $N$ is parameter count - $D$ is training tokens - $N_c, D_c, \alpha_N, \alpha_D$ are fitted constants **Key insights:** 1. **Smooth improvement**: Loss decreases predictably with more parameters and data 2. **Compute-optimal scaling** (Chinchilla): Training tokens should be approximately 20x the parameter count. A 10B model should see ~200B tokens. 3. **No saturation yet**: We haven't seen diminishing returns at frontier scales **Practical implications:** - If you have a compute budget, split it between model size and training data according to Chinchilla - For inference, smaller well-trained models often beat larger undertrained ones - You can predict performance before training, enabling better planning ### Parameter Counting: Know Your Model Understanding parameter counts helps with hardware planning: ```python def count_parameters(config): """Count parameters in a transformer LM.""" params = {} # Token embeddings params['token_embed'] = config.vocab_size * config.d_model # Position embeddings (if learned) params['pos_embed'] = config.max_seq_length * config.d_model # Per-layer parameters layer_params = {} layer_params['attention'] = 4 * config.d_model * config.d_model # Q, K, V, O layer_params['ffn'] = 2 * config.d_model * config.d_ff # w1, w2 if hasattr(config, 'use_swiglu') and config.use_swiglu: layer_params['ffn'] = 3 * config.d_model * config.d_ff # w1, w2, w3 layer_params['layer_norm'] = 4 * config.d_model # Two layer norms params['layers'] = sum(layer_params.values()) * config.num_layers # Output projection (often tied with embedding) params['lm_head'] = config.vocab_size * config.d_model total = sum(params.values()) if config.tie_weights: total -= params['lm_head'] # Subtract if tied return total, params # Example: LLaMA-7B llama_config = { 'vocab_size': 32000, 'd_model': 4096, 'num_layers': 32, 'd_ff': 11008, 'max_seq_length': 2048, 'use_swiglu': True, 'tie_weights': True } # Result: ~6.7B parameters ``` **Memory estimation:** | Component | Formula | Example (7B, 4K context) | |-----------|---------|--------------------------| | Model weights (fp16) | params * 2 bytes | 14 GB | | KV cache (fp16) | 2 * layers * heads * head_dim * seq_len * 2 | 1.6 GB per sequence | | Activations | ~2x weights during forward | 28 GB | | **Inference total** | weights + KV cache | ~16 GB | | **Training total** | ~16x weights | ~110 GB | --- ## Training and Pre-training ### The Pre-training Objective Modern LLMs are trained on a deceptively simple objective: predict the next token. Given a sequence $[x_1, x_2, ..., x_n]$, the model predicts $P(x_i | x_1, ..., x_{i-1})$ for all positions. The loss is cross-entropy: $$\mathcal{L} = -\sum_{i=1}^{n} \log P(x_i | x_1, ..., x_{i-1})$$ ```python def compute_lm_loss(model, input_ids): """Standard causal language modeling loss.""" logits = model(input_ids) # (batch, seq_len, vocab_size) # Shift for next-token prediction: # - logits[:, :-1, :] predicts tokens 1, 2, ..., n-1 # - input_ids[:, 1:] are the targets (tokens 1, 2, ..., n-1) shift_logits = logits[:, :-1, :].contiguous() shift_labels = input_ids[:, 1:].contiguous() loss = F.cross_entropy( shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1) ) return loss ``` **Why next-token prediction works:** To predict text well, the model must learn: - Grammar and syntax ("The" is often followed by a noun or adjective) - Factual knowledge ("The capital of France is..." → "Paris") - Reasoning patterns ("If all A are B, and X is an A, then X is...") - Style and register (formal vs. casual, technical vs. conversational) - Common sense (people go to restaurants to eat, not to sleep) All of this emerges from the simple objective of predicting what comes next. ### Masked Language Modeling (BERT-style) An alternative training objective: randomly mask 15% of tokens and predict them: ```python def compute_mlm_loss(model, input_ids, mask_prob=0.15): """Masked language modeling loss.""" # Create mask rand = torch.rand(input_ids.shape) mask = (rand < mask_prob) & (input_ids != tokenizer.pad_token_id) # Apply mask masked_input = input_ids.clone() masked_input[mask] = tokenizer.mask_token_id # Predict masked tokens logits = model(masked_input) loss = F.cross_entropy( logits[mask].view(-1, vocab_size), input_ids[mask].view(-1) ) return loss ``` MLM enables bidirectional context (the model sees tokens on both sides of the mask), which is useful for understanding tasks. But it can't generate text naturally. ### From Pre-training to Instruction Tuning Raw pre-trained models complete text, but they don't follow instructions or hold conversations. Instruction tuning bridges this gap: **Pre-training data:** ``` The cat sat on the mat. It was a warm day and the sun was shining... ``` **Instruction tuning data:** ``` {"instruction": "Summarize the following text.", "input": "The cat sat on the mat. It was a warm day...", "output": "A cat rests on a mat on a sunny day."} ``` The model learns to map instructions to appropriate outputs, becoming useful as an assistant. ### RLHF and Alignment Even instruction-tuned models can be unhelpful, harmful, or misaligned. Reinforcement Learning from Human Feedback (RLHF) further refines behavior: 1. **Collect comparisons**: Humans rank multiple model outputs for the same prompt 2. **Train reward model**: Learn to predict which outputs humans prefer 3. **Optimize policy**: Use PPO or similar to maximize reward while staying close to the base model **Constitutional AI** (Anthropic's approach) uses AI feedback: the model critiques its own outputs against a set of principles, generating training data for self-improvement. These techniques are crucial for making models safe, helpful, and honest, but they're outside the scope of this foundations chapter. ### Training at Scale: Practical Considerations **Compute requirements:** | Model Size | Training Tokens | GPU Hours (A100) | Approximate Cost | |------------|-----------------|------------------|------------------| | 7B | 1T | 100,000 | $150K | | 13B | 1T | 200,000 | $300K | | 70B | 2T | 1,500,000 | $2M | **When to fine-tune vs. prompt:** | Situation | Recommendation | |-----------|----------------| | Few examples, simple task | Few-shot prompting | | Domain-specific vocabulary | Fine-tune or RAG | | Specific output format | Fine-tune | | Need updated knowledge | RAG + prompting | | Behavioral changes | RLHF or fine-tuning | ::: {.callout-warning} ## Common Mistake: Fine-tuning Before Trying Prompting **What people do:** Immediately jump to fine-tuning when a model doesn't perform well on their task, spending weeks collecting training data and running training jobs. **Why it fails:** Fine-tuning is expensive (compute, data collection, evaluation) and creates maintenance burden (model versioning, retraining as base models improve). Many tasks that seem to require fine-tuning can be solved with better prompting, few-shot examples, or retrieval-augmented generation. **Fix:** Follow this hierarchy: (1) Improve your prompt—be specific, provide examples, use chain-of-thought. (2) Add few-shot examples—often 3-5 examples unlock capability. (3) Try RAG—if the issue is knowledge, retrieval is cheaper than training. (4) Use a larger model—capabilities increase with scale. Only fine-tune after exhausting these options, or when you need specific output formats, domain vocabulary, or behavioral changes that prompting cannot achieve. ::: **Parameter-efficient fine-tuning (LoRA, QLoRA):** Instead of updating all parameters, add small trainable adapters: ```python # Conceptual LoRA: low-rank updates to weight matrices # Original: y = xW # LoRA: y = x(W + BA) where B is (d, r) and A is (r, d), r << d class LoRALinear(nn.Module): def __init__(self, original: nn.Linear, rank: int = 16): super().__init__() self.original = original self.original.weight.requires_grad = False # Freeze d_in, d_out = original.weight.shape self.lora_A = nn.Parameter(torch.randn(d_in, rank) / math.sqrt(rank)) self.lora_B = nn.Parameter(torch.zeros(rank, d_out)) def forward(self, x): return self.original(x) + x @ self.lora_A @ self.lora_B ``` LoRA reduces trainable parameters by 100-1000x while achieving similar performance to full fine-tuning for many tasks. --- ## Inference and Generation ### The Autoregressive Generation Loop Text generation is iterative: generate one token, append it to the input, generate the next. ```python def generate( model, tokenizer, prompt: str, max_new_tokens: int = 100, temperature: float = 1.0, top_p: float = 0.9 ) -> str: """Generate text autoregressively.""" input_ids = tokenizer.encode(prompt, return_tensors='pt') for _ in range(max_new_tokens): with torch.no_grad(): logits = model(input_ids)[:, -1, :] # Last position only # Apply temperature logits = logits / temperature # Top-p (nucleus) sampling next_token = top_p_sample(logits, top_p) # Append and continue input_ids = torch.cat([input_ids, next_token], dim=-1) if next_token.item() == tokenizer.eos_token_id: break return tokenizer.decode(input_ids[0]) ``` Each iteration requires a forward pass through the entire model. For a 70B model generating 100 tokens, that's 100 forward passes, millions of operations each. ### Temperature: Controlling Randomness Temperature scales logits before softmax, controlling the "sharpness" of the distribution: $$P(x_i) = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}$$ | Temperature | Effect | Use Case | |-------------|--------|----------| | T → 0 | Greedy (argmax) | Deterministic, factual | | T = 0.5 | Focused | Coherent creative | | T = 1.0 | Model distribution | General use | | T = 2.0 | High entropy | Brainstorming, diversity | ```python def apply_temperature(logits: torch.Tensor, temperature: float) -> torch.Tensor: if temperature == 0: raise ValueError("Use greedy decoding for T=0") return logits / temperature # Low temperature concentrates probability on top tokens # High temperature flattens the distribution ``` ### Top-k and Top-p Sampling **Top-k sampling** restricts to the k most likely tokens: ```python def top_k_sample(logits: torch.Tensor, k: int = 50) -> torch.Tensor: top_k_logits, top_k_indices = torch.topk(logits, k, dim=-1) probs = F.softmax(top_k_logits, dim=-1) sampled_idx = torch.multinomial(probs, num_samples=1) return top_k_indices.gather(-1, sampled_idx) ``` Problem: k is fixed, but the "good" vocabulary size varies. For "The capital of France is ___", there's really only one good answer. For "I feel ___", many tokens are valid. **Top-p (nucleus) sampling** adapts to the distribution, keeping the smallest set of tokens whose cumulative probability exceeds p: ```python def top_p_sample(logits: torch.Tensor, p: float = 0.9) -> torch.Tensor: sorted_logits, sorted_indices = torch.sort(logits, descending=True) cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1) # Find cutoff index sorted_indices_to_remove = cumulative_probs > p sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone() sorted_indices_to_remove[..., 0] = False sorted_logits[sorted_indices_to_remove] = float('-inf') probs = F.softmax(sorted_logits, dim=-1) sampled = torch.multinomial(probs, num_samples=1) return sorted_indices.gather(-1, sampled) ``` Top-p is now the standard for production systems, often combined with temperature. ### KV Caching: The Essential Optimization Without caching, generating token $n$ requires computing attention over all $n$ tokens. But tokens 1 through $n-1$ don't change between iterations, their K and V values are constant. **KV caching** stores computed K and V values, reusing them: ```python class CachedAttention(nn.Module): def __init__(self, d_model, num_heads): super().__init__() self.attention = MultiHeadAttention(d_model, num_heads) self.k_cache = None self.v_cache = None def forward(self, x, use_cache=True): batch_size, seq_len, _ = x.shape # Compute Q, K, V for new tokens only Q = self.W_q(x) K = self.W_k(x) V = self.W_v(x) if use_cache and self.k_cache is not None: # Append new K, V to cache K = torch.cat([self.k_cache, K], dim=1) V = torch.cat([self.v_cache, V], dim=1) if use_cache: self.k_cache = K.detach() self.v_cache = V.detach() # Attention: Q attends to all K (including cached) return self.attention(Q, K, V) ``` KV caching changes generation from $O(n^2)$ to $O(n)$ operations (for n output tokens). ### Speculative Decoding: Parallelizing Generation Generation is bottlenecked by sequential token generation. Speculative decoding uses a small, fast "draft" model to generate candidates, then verifies them in parallel with the large model: ```python def speculative_decode( large_model, draft_model, input_ids, n_draft: int = 4 ): """Generate faster using speculation.""" # Draft model proposes n_draft tokens quickly draft_tokens = draft_model.generate(input_ids, max_new_tokens=n_draft) # Large model scores all proposed tokens in parallel with torch.no_grad(): logits = large_model(draft_tokens) # Accept draft tokens until one is rejected accepted = [] for i in range(n_draft): draft_token = draft_tokens[0, len(input_ids) + i] large_model_pred = logits[0, len(input_ids) + i - 1].argmax() if draft_token == large_model_pred: accepted.append(draft_token) else: break # Return accepted tokens + one from large model return torch.cat([ input_ids, torch.tensor([accepted + [large_model_pred]]) ], dim=-1) ``` When draft and target models agree, you get multiple tokens for one target model forward pass. Speed improvements of 2-3x are common. ### Batching and Continuous Batching **Static batching** waits for all sequences in a batch to complete: ``` Sequence 1: [tokens...] [pad] [pad] [pad] <- waiting Sequence 2: [tokens...tokens...tokens...] <- still generating ``` This wastes compute on padding. **Continuous batching** dynamically adds/removes sequences: ``` t=0: [seq1, seq2, seq3] all generating t=5: seq1 done, add seq4: [seq2, seq3, seq4] t=8: seq3 done, add seq5: [seq2, seq4, seq5] ``` This maximizes GPU utilization for serving systems. ### Production Inference Systems Modern inference frameworks (vLLM, TensorRT-LLM, TGI) combine: - KV caching with memory management (PagedAttention) - Continuous batching - Speculative decoding - Quantization (int8, int4) - Tensor parallelism across GPUs These optimizations can improve throughput 10-100x compared to naive implementations. --- ## Practical Exercises ### Exercise 1: Build a BPE Tokenizer Implement BPE from scratch: ```python class MinimalBPE: """A minimal BPE tokenizer for learning.""" def __init__(self): self.vocab = {} # token -> id self.merges = [] # list of (a, b) merge operations def train(self, text: str, num_merges: int = 100): """Train on text, learning num_merges merge operations.""" # 1. Initialize vocab with characters # 2. Repeat num_merges times: # a. Count all adjacent pairs # b. Find most frequent pair # c. Merge it: add to vocab, replace in text # d. Record the merge pass def encode(self, text: str) -> list[int]: """Apply learned merges to encode text.""" # 1. Split into characters # 2. Apply merges in order # 3. Convert to IDs pass def decode(self, ids: list[int]) -> str: """Convert IDs back to text.""" pass ``` **Verification:** 1. Train on a paragraph with 100 merges 2. Check that `decode(encode(text)) == text` 3. Examine which pairs merged first (should be common pairs like "th", "in") 4. Test on new text: common words should be single tokens ### Exercise 2: Implement Attention from Scratch ```python def attention(Q, K, V, mask=None): """ Implement scaled dot-product attention. Args: Q: (batch, seq_q, d_k) K: (batch, seq_k, d_k) V: (batch, seq_k, d_v) mask: (seq_q, seq_k) causal mask Returns: output: (batch, seq_q, d_v) weights: (batch, seq_q, seq_k) """ pass ``` **Verification:** 1. Weights should sum to 1 along the last dimension 2. Without scaling, high-d_k should produce nearly one-hot weights 3. With causal mask, weights above the diagonal should be zero 4. Visualize attention for a real sentence ### Exercise 3: Memory Estimation For LLaMA-7B (vocab=32000, d_model=4096, layers=32, heads=32, d_ff=11008): Calculate by hand: 1. Total parameters 2. Model memory in fp16 3. KV cache for 4K context 4. Minimum GPU memory for inference Compare to the actual LLaMA-7B requirements. ### Exercise 4: Generation with Sampling Implement a complete generation loop: ```python def generate_with_sampling( model, tokenizer, prompt: str, max_tokens: int = 100, temperature: float = 0.8, top_p: float = 0.95 ) -> str: """Generate text with temperature and top-p sampling.""" pass ``` **Verification:** 1. Temperature=0 should be deterministic 2. Higher temperature should increase diversity 3. Compare outputs to the model's built-in generate method --- ## Key Takeaways 1. **Tokenization determines how models see text** - Subword algorithms like BPE convert text to discrete tokens, and token boundaries directly affect model behavior. Mismatched tokenizers cause silent failures. 2. **Attention is the core transformer innovation** - The queries-keys-values mechanism enables parallel processing and direct connections between all token pairs, with multi-head attention capturing different relationship types. 3. **Scale follows predictable power laws** - Model capability scales predictably with parameters, data, and compute. Understanding these relationships helps make informed architecture and training decisions. 4. **Generation is memory-bandwidth bound** - During autoregressive decoding, GPUs spend most time loading weights from memory. KV caching, batching, and quantization are essential optimizations for efficient inference. 5. **Training objectives shape capabilities** - Next-token prediction enables models to acquire grammar, knowledge, and reasoning. Instruction tuning and RLHF transform text predictors into useful assistants. --- ## Summary This chapter established the foundational concepts underlying modern large language models: **Tokenization** converts text to discrete tokens through subword algorithms like BPE. The tokenizer determines how text is represented to the model. Token boundaries affect model behavior in subtle ways, and mismatched tokenizers cause silent failures. **Embeddings** are learned lookup tables that place tokens in continuous vector spaces. Through training, semantically similar tokens cluster together. Contextual embeddings from transformers vary based on surrounding context, enabling disambiguation. **Attention** is the core innovation that enables parallel processing and direct connections between all token pairs. The queries-keys-values framework implements content-based lookup, and multi-head attention captures multiple types of relationships. The $\sqrt{d_k}$ scaling is crucial for training stability. **The Transformer architecture** stacks attention and feed-forward layers with residual connections and layer normalization. Decoder-only architectures dominate modern LLMs. Scale (parameters, data, compute) determines capability, following predictable power laws. **Training objectives** shape what models learn. Next-token prediction is remarkably powerful, enabling models to acquire grammar, knowledge, and reasoning from the single objective of predicting what comes next. Instruction tuning and RLHF transform text predictors into useful assistants. **Inference and generation** is where models produce outputs. Temperature and sampling strategies control the quality-diversity tradeoff. KV caching is essential for efficient generation. Modern inference systems combine many optimizations for practical deployment. Understanding these components mechanically enables you to: - Debug unexpected model behavior - Estimate computational requirements - Make informed architecture and configuration decisions - Reason about what models can and cannot do --- ## Self-Assessment Checkpoint ### Conceptual Questions **Q1. [IC2]** Why does the scaling factor $\sqrt{d_k}$ appear in the attention equation? What would happen if we removed it? <details> <summary>Answer</summary> As the dimension $d_k$ increases, the dot products between Q and K vectors grow larger in magnitude. Before softmax, large values cause the gradient to become very small (saturated softmax), making training unstable. Dividing by $\sqrt{d_k}$ normalizes the variance back to approximately 1, regardless of dimension. Without it, training would be unstable, especially for larger models with larger head dimensions. </details> **Q2. [IC2]** A 7B parameter model uses 32 attention heads with dimension 128. If you switch to GQA with 4 KV heads, how much memory does this save for the KV cache during generation with a batch of 16 sequences at 4K context? <details> <summary>Answer</summary> Original KV cache: 2 × 32 layers × 4096 context × 32 heads × 128 dim × 2 bytes × 16 batch = 34.4 GB With GQA (4 KV heads instead of 32): 2 × 32 × 4096 × 4 × 128 × 2 × 16 = 4.3 GB Savings: ~30 GB (8x reduction in KV cache). This is why GQA is essential for long-context serving. </details> **Q3. [Senior]** Explain why a model might produce different outputs for "The cat sat on the mat" vs "The cat sat on the mat" (extra spaces). What determines this behavior? <details> <summary>Answer</summary> Tokenization determines this. BPE and similar tokenizers process whitespace as part of tokens. Extra spaces create different token sequences. "The" might be one token, but " The" (with leading space) is typically a different token. Multiple spaces may tokenize into space tokens + word tokens differently than single spaces. The model has never seen certain space patterns during training, so its behavior on them is unpredictable. This is why input normalization matters. </details> **Q4. [Senior]** During inference, why is the prefill phase compute-bound while the decode phase is memory-bandwidth bound? <details> <summary>Answer</summary> Prefill processes all input tokens in parallel—it's a dense matrix-matrix multiplication (weights × all token embeddings). This has high arithmetic intensity (many FLOPs per byte loaded). Decode generates one token at a time—matrix-vector multiplication (weights × single token embedding). The entire model weights must be loaded from memory for each token, but we only do one row of computation per weight element. The arithmetic intensity drops dramatically, making memory bandwidth the bottleneck. </details> **Q5. [Staff]** How do scaling laws inform the decision between training a smaller model longer vs. a larger model for fewer steps? What practical constraints might override the compute-optimal recommendation? <details> <summary>Answer</summary> Chinchilla scaling laws show compute-optimal training has roughly 20 tokens per parameter. However: (1) Inference costs favor smaller models—a 2x larger model costs 2x more per inference forever, while training is one-time. High-volume production may justify over-training smaller models. (2) Hardware constraints—you may not have enough GPUs for a larger model. (3) Latency requirements—smaller models are faster. (4) Capability thresholds—some capabilities may only emerge at certain scales, making a larger model necessary regardless of compute efficiency. </details> ### Spot the Problem **Problem 1. [IC2]** Find the issue in this attention implementation: ```python def broken_attention(Q, K, V): scores = torch.matmul(Q, K) # Q @ K weights = F.softmax(scores, dim=-1) return torch.matmul(weights, V) ``` <details> <summary>Answer</summary> Two issues: (1) K should be transposed—attention computes Q @ K^T, not Q @ K. The shapes won't match otherwise (Q is [batch, seq, d_k] and K is also [batch, seq, d_k]). (2) Missing the scaling factor √d_k. Correct: `scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(K.size(-1))` </details> **Problem 2. [Senior]** A team fine-tuned a model and sees good training loss but terrible inference quality. Their evaluation code: ```python model.eval() with torch.no_grad(): for batch in eval_loader: outputs = model(batch['input_ids']) # They check loss here and it looks fine ``` What's likely wrong? <details> <summary>Answer</summary> They're evaluating with teacher forcing (providing the correct previous tokens) rather than autoregressive generation. During actual inference, the model generates tokens based on its own previous outputs, and errors compound. Good training loss doesn't mean good generation. They should test with actual generation: `model.generate(prompt, max_new_tokens=100)` and evaluate the generated text quality. </details> **Problem 3. [Staff]** A production system shows mysteriously high memory usage. The model is 7B parameters (FP16 = 14GB), but the system needs 50GB for a single request. What's happening? <details> <summary>Answer</summary> The KV cache is the likely culprit. For long context (e.g., 32K tokens) or multiple concurrent requests, KV cache can dwarf model size. Calculation: 2 × 32 layers × 32K context × 32 heads × 128 dim × 2 bytes = 17GB for just KV cache at 32K context. Add model weights (14GB), activations, optimizer states (if training), and framework overhead. Solutions: Use GQA, reduce context length, implement KV cache quantization, or use PagedAttention. </details> ### Design Exercises **Exercise 1. [Senior]** Design a system to serve a 70B model with 100K context length. You have access to 8x A100-80GB GPUs. Outline your approach for: - Model parallelism strategy - KV cache management - Expected throughput and latency characteristics <details> <summary>Guidance</summary> Consider: (1) Tensor parallelism across 8 GPUs—each GPU holds 1/8 of model weights (~17GB per GPU). (2) KV cache at 100K context in FP16 would be enormous—consider KV cache quantization to INT8 or even INT4. (3) Use PagedAttention to manage KV cache efficiently. (4) Expected throughput will be low due to cross-GPU communication overhead and massive KV cache. Focus on optimizing for few concurrent long-context requests rather than high throughput. Latency for prefill will be significant (seconds) due to 100K token processing. </details> **Exercise 2. [Staff]** You're advising a team that wants to build a code completion model. They're debating: (a) fine-tune an existing 7B model on code, (b) train a code-specific 3B model from scratch, or (c) use a general-purpose 70B model via API. What factors should drive this decision? Create a decision framework. <details> <summary>Guidance</summary> Key factors: (1) Latency requirements—code completion needs <100ms, favoring smaller self-hosted models. (2) Code-specific vs. general capability—do they need understanding of natural language requirements too? (3) Data availability—do they have enough code data to train from scratch? (4) Budget—API costs vs. training/hosting costs. (5) Privacy—can code go to external APIs? (6) Model quality—70B API likely produces better code but at higher latency/cost. Recommendation often: Start with API for quality baseline, fine-tune 7B on domain-specific code for production deployment with latency constraints. </details> --- ## Connections to Other Chapters - **Chapter 6 (Prompt Engineering)**: Understanding tokenization helps manage context windows and craft effective prompts. Token boundaries affect how models parse instructions. - **Chapter 7 (RAG Systems)**: Embeddings are the foundation of semantic search. The embedding concepts here directly enable document retrieval. - **Chapter 8 (Agentic Systems)**: The generation process underlies agent loops. Understanding autoregressive generation helps design reliable agents. - **Chapter 9 (Deployment)**: Memory requirements (KV cache, model weights) and the prefill/decode distinction are essential for optimization. --- ## Further Reading ### Essential - **Vaswani et al. (2017), "Attention Is All You Need"**: The original transformer paper. Focus on Section 3 (architecture) and the attention equations—everything else builds on this. - **Brown et al. (2020), "Language Models are Few-Shot Learners"** (GPT-3): Demonstrated emergent capabilities at scale. Read for understanding why scale matters. - **Hoffmann et al. (2022), "Training Compute-Optimal Large Language Models"** (Chinchilla): The definitive guide to scaling laws. Essential for understanding model sizing decisions. ### Deep Dives - **Touvron et al. (2023), "LLaMA: Open and Efficient Foundation Language Models"**: Best modern reference for how production transformers are actually built. - **Dao et al. (2022), "FlashAttention: Fast and Memory-Efficient Exact Attention"**: How attention is computed efficiently on GPUs. Critical for understanding inference optimization. ### Visual Introduction - **Jay Alammar, "The Illustrated Transformer"**: Start here if papers feel dense. Visual walkthrough of attention mechanics that makes the math intuitive.

Introduction

The Core Insight: Learning from Prediction

A Mental Model: Text as Points in Space

What You’ll Learn

Prerequisites

Tokenization

The Fundamental Problem

Why Not Characters or Words?

Subword Tokenization: The Key Insight

Byte-Pair Encoding (BPE)

Unigram and SentencePiece

Real-World Tokenization: Production Case Studies

Decision Framework: Choosing Tokenization Strategy

The Mathematics of Tokenization

Implementation: Token Counting and Budgeting

Common Pitfalls and Debugging

Embeddings

From Tokens to Vectors: The Core Transformation

Why Embeddings Capture Meaning: The Distributional Hypothesis

The Geometry of Embeddings

Position Encodings: Where Am I?

Static vs. Contextual Embeddings

Embedding Dimensions: Tradeoffs and Guidelines

Pooling: From Token Embeddings to Sequence Embeddings

Attention: The Core Innovation

The Problem Attention Solves

Queries, Keys, and Values: The Mental Model

The Mathematics of Attention

Why Scale by \(\sqrt{d_k}\)?

Implementation: Scaled Dot-Product Attention

Multi-Head Attention: Multiple Perspectives

Causal Masking: Looking Only Backward

Self-Attention vs. Cross-Attention

Attention Complexity: The Quadratic Challenge

Grouped-Query Attention (GQA): Memory Efficiency

The Transformer Architecture

Building the Complete Model

Residual Connections: The Gradient Highway

Layer Normalization: Stabilizing Training

Feed-Forward Networks: Where Knowledge Lives

The Complete Transformer Block

Decoder-Only Architecture: The Modern Standard

Encoder-Decoder vs. Decoder-Only

Mixture of Experts (MoE): Sparse Scaling

Scaling Laws: More is (Usually) Better

Parameter Counting: Know Your Model

Training and Pre-training

The Pre-training Objective

Masked Language Modeling (BERT-style)

From Pre-training to Instruction Tuning

RLHF and Alignment

Training at Scale: Practical Considerations

Inference and Generation

The Autoregressive Generation Loop

Temperature: Controlling Randomness

Top-k and Top-p Sampling

KV Caching: The Essential Optimization

Speculative Decoding: Parallelizing Generation

Batching and Continuous Batching

Production Inference Systems

Practical Exercises

Exercise 1: Build a BPE Tokenizer

Exercise 2: Implement Attention from Scratch

Exercise 3: Memory Estimation

Exercise 4: Generation with Sampling

Key Takeaways

Summary

Self-Assessment Checkpoint

Conceptual Questions

Spot the Problem

Design Exercises

Connections to Other Chapters

Further Reading

Essential

Deep Dives

Visual Introduction