Chapter 27: Performance Engineering

Keywords

performance, profiling, optimization, GPU utilization, latency, throughput, bottleneck analysis, CUDA

Introduction

In early 2023, a startup building an AI coding assistant faced a crisis. Their prototype worked beautifully in demos—the model understood context, generated accurate code, and responded naturally to follow-up questions. But when they tried to scale beyond a hundred users, the economics collapsed. Each GPU could serve only 8 concurrent users at acceptable latency. At $3 per GPU-hour, that translated to $0.375 per user-hour—roughly $270 per user per month just in inference costs. For a product targeting $20/month subscriptions, they were losing money on every customer.

The team spent three months on performance engineering. They implemented continuous batching, added speculative decoding with a small draft model, optimized their KV cache with paged attention, and carefully profiled every component. By the end, the same GPU served 64 concurrent users—an 8x improvement. Their unit economics went from catastrophic to viable.

This story illustrates a fundamental truth about AI systems: performance isn’t just about speed—it’s about feasibility. A model that costs $1 per query can’t power a consumer product. An inference pipeline with 5-second latency can’t enable real-time applications. Understanding performance engineering transforms “this would be nice” into “this is possible.”

Why AI Performance is Different

Traditional software performance optimization focuses on algorithmic complexity, database queries, and network latency. AI workloads introduce fundamentally different challenges:

Memory bandwidth, not compute, is often the bottleneck. Modern GPUs have enormous compute capacity—the H100 can perform nearly 2,000 trillion floating-point operations per second. But loading the 140GB of weights for a 70B parameter model at FP16 precision takes 47 milliseconds at the H100’s peak 3TB/s memory bandwidth. For autoregressive generation, where we must load the entire model for each token, this creates a hard floor on latency that adding more compute cannot break.

The quadratic attention bottleneck. Self-attention—the mechanism that gives transformers their power—requires computing interactions between all pairs of tokens. For a 128K context window, that’s 16 billion attention computations per layer. Naive implementations run out of memory long before running out of time.

Batching dynamics are complex. Unlike web servers where requests are largely independent, AI inference can batch requests to amortize fixed costs—but different requests may need different numbers of output tokens. A batch is limited by its slowest request, creating intricate scheduling challenges.

The latency-throughput tradeoff is acute. Users expect sub-second responses. Providers want to maximize GPU utilization. These goals conflict: high utilization requires large batches, but large batches increase latency for individual requests.

The Core Insight: Roofline Thinking

To reason about AI performance, you need a mental model for what limits computation. The roofline model, introduced by Williams et al. (2009), provides exactly this framework.

Every computation has two potential bottlenecks:

Compute bound: Limited by how fast we can perform arithmetic operations
Memory bound: Limited by how fast we can load data from memory

The ratio between these determines which bottleneck dominates:

\[\text{Arithmetic Intensity} = \frac{\text{FLOPs}}{\text{Bytes Loaded}}\]

If your arithmetic intensity is high (many operations per byte loaded), you’re compute-bound. If it’s low (few operations per byte loaded), you’re memory-bound.

Why LLM inference is memory-bound: For a single token prediction with a 7B parameter model:

FLOPs: ~14 billion (roughly 2 * parameter count for forward pass)
Bytes loaded: ~14 billion (loading all parameters once at FP16)
Arithmetic intensity: ~1 FLOP/byte

Modern GPUs achieve 100+ FLOPs per byte of memory bandwidth. At 1 FLOP/byte arithmetic intensity, we’re using less than 1% of the GPU’s compute capability. We’re completely memory-bound.

This insight transforms how we think about optimization. Adding more compute power won’t help a memory-bound workload. Instead, we must either: 1. Reduce memory traffic (quantization, caching) 2. Increase arithmetic intensity (batching, operator fusion) 3. Use memory more efficiently (Flash Attention, paged KV cache)

Common Mistake: Throwing More GPUs at Memory-Bound Workloads

What people do: Inference is slow, so upgrade from A100 to H100 expecting a proportional speedup. Or add more GPUs expecting near-linear scaling.

Why it fails: LLM inference at small batch sizes is memory-bandwidth bound, not compute-bound. The H100 has 3x the compute of A100 but only 2x the memory bandwidth. For batch-size-1 inference, you get ~2x speedup, not 3x. Adding more GPUs often makes things worse due to communication overhead—you’re splitting a memory-bound workload across more devices that now must synchronize.

Fix: First identify your bottleneck using profiling. For memory-bound workloads: quantize to reduce bytes loaded, increase batch size to amortize weight loading, or use speculative decoding. Only upgrade hardware after you’ve exhausted algorithmic optimizations—often you can get 4-8x improvements without buying anything.

What You’ll Learn

How GPU architecture determines what optimizations are possible
Why LLMs are memory-bound and how to reason about memory hierarchy
Flash Attention: the mathematics and why it works
Quantization theory: trading precision for performance
Profiling methodology: finding the real bottleneck
Speculative decoding: breaking the autoregressive barrier
Production optimization: batching, caching, and cost engineering

Prerequisites

Understanding of transformer architecture (Chapter 5)
Familiarity with PyTorch or similar frameworks
Basic linear algebra (matrix multiplication, attention computation)
Helpful but not required: computer architecture concepts

GPU Architecture for AI Engineers

The Parallel Processing Paradigm

To optimize for GPUs, you must understand how they differ from CPUs.

A CPU is designed for latency: execute a single thread of instructions as fast as possible, with sophisticated branch prediction, out-of-order execution, and large caches to hide memory latency. A high-end CPU might have 64 cores, each executing 1-2 threads.

A GPU is designed for throughput: execute thousands of threads simultaneously, hiding memory latency through massive parallelism rather than caching. An H100 has 132 Streaming Multiprocessors (SMs), each capable of executing thousands of threads concurrently.

CPU Execution Model:
Thread 1: ████████████████████████████████████████████████████████████
Thread 2: ████████████████████████████████████████████████████████████
[Few threads, each runs fast with sophisticated optimization]

GPU Execution Model:
Thread 1:    ██░░░░██░░░░██░░░░██░░░░██░░░░  (waiting for memory)
Thread 2:    ░░██░░░░██░░░░██░░░░██░░░░██░░
Thread 3:    ░░░░██░░░░██░░░░██░░░░██░░░░██
Thread 1024: ██░░░░██░░░░██░░░░██░░░░██░░░░
[Many threads, each slow, but overlapped to hide latency]

The key insight: GPUs hide memory latency through parallelism, not caching. While one group of threads waits for memory, others execute. This only works if you have enough threads to keep the execution units busy—a concept called occupancy.

The Memory Hierarchy

Understanding the GPU memory hierarchy is essential for optimization:

Memory Level	Size (H100)	Bandwidth	Latency	Scope
Registers	256KB per SM	~10 TB/s	0 cycles	Per thread
Shared Memory / L1	228KB per SM	~19 TB/s	~20 cycles	Per thread block
L2 Cache	50MB	~6 TB/s	~200 cycles	Global
HBM3 (VRAM)	80GB	3.35 TB/s	~400 cycles	Global
CPU RAM (via PCIe)	System	~64 GB/s	~10,000 cycles	System

The bandwidth gap between levels is staggering. Moving from HBM to shared memory is a 6x bandwidth improvement. Moving from HBM to registers is 3x more. The single most important optimization principle is keeping data close to computation.

SIMT Execution and Warps

GPUs execute instructions in lockstep groups of 32 threads called warps (NVIDIA terminology) or wavefronts (AMD). All 32 threads in a warp execute the same instruction simultaneously—a model called SIMT (Single Instruction, Multiple Thread).

This has profound implications:

Warp divergence: If threads in a warp take different branches, both branches must execute serially:

# This code causes warp divergence
if thread_id % 2 == 0:
    do_even_work()  # Half the warp executes
else:
    do_odd_work()   # Other half executes (serialized!)

Memory coalescing: Adjacent threads should access adjacent memory addresses. The memory system fetches data in 128-byte chunks. If 32 threads each request one float (4 bytes) from adjacent addresses, one fetch serves all of them. If they request scattered addresses, you might need 32 separate fetches—a 32x slowdown.

# Coalesced access (fast)
data[thread_id]  # Threads 0-31 access addresses 0-31

# Strided access (slow)
data[thread_id * stride]  # Threads access scattered addresses

# Random access (very slow)
data[random_indices[thread_id]]  # Even worse

Kernel Launch Overhead

Each GPU operation (kernel) has launch overhead: 5-10 microseconds to dispatch work from the CPU to the GPU. This seems tiny, but for a model with hundreds of small operations, it adds up. A forward pass through a transformer layer might involve:

4 linear projections (Q, K, V, O)
Attention computation
Softmax
2 FFN linear layers
2 layer normalizations
Various reshapes and transposes

That’s easily 15+ kernel launches per layer. For a 70-layer model, that’s 1,000+ kernels, potentially 5-10ms of pure launch overhead.

Kernel fusion combines multiple operations into a single kernel, eliminating these round-trips. This is why Flash Attention, which fuses the entire attention computation into one kernel, provides such dramatic speedups.

Memory-Bound Inference: The Fundamental Challenge

Why LLM Inference is Memory-Bound

Let’s derive why large language model inference is memory-bound from first principles.

For a transformer with:

$P$ = number of parameters
$d$ = hidden dimension
$n$ = sequence length
$b$ = batch size

Forward pass computation (simplified, for attention + FFN): \[\text{FLOPs} \approx 2 \cdot P \cdot b\]

Memory traffic (loading all parameters): \[\text{Bytes} = P \cdot \text{bytes\_per\_param}\]

Arithmetic intensity: \[\text{AI} = \frac{2 \cdot P \cdot b}{P \cdot \text{bytes\_per\_param}} = \frac{2 \cdot b}{\text{bytes\_per\_param}}\]

For FP16 (2 bytes per parameter): \[\text{AI} = b \text{ FLOPs/byte}\]

The H100’s compute-to-bandwidth ratio is approximately: \[\frac{990 \text{ TFLOPS}}{3.35 \text{ TB/s}} \approx 295 \text{ FLOPs/byte}\]

To be compute-bound, we need arithmetic intensity of at least 295. That means batch size must be at least 295 to fully utilize the H100’s compute capability!

For interactive inference (batch size 1), we’re at 1/295 = 0.3% compute utilization. Almost all time is spent waiting for memory.

The Memory Bandwidth Calculation

This analysis gives us a powerful tool for performance prediction. For a 70B parameter model at FP16:

\[\text{Minimum time per token} = \frac{70\text{B params} \times 2 \text{ bytes}}{3.35 \text{ TB/s}} = 42\text{ms}\]

That’s the theoretical minimum—loading every parameter once at peak bandwidth. In practice, we also have:

KV cache reads and writes
Attention computation
Activation transfers
Memory access patterns that don’t achieve peak bandwidth

Real-world single-request latency is typically 2-3x this theoretical minimum.

Implications for Optimization

This analysis reveals which optimizations can help:

Optimization	How It Helps	Expected Impact
Quantization (FP16 → INT8)	2x fewer bytes to load	~2x faster
Quantization (FP16 → INT4)	4x fewer bytes to load	~4x faster (with quality tradeoff)
Batching	Amortizes weight loading	Linear throughput improvement
KV caching	Avoids recomputing past	Essential for practical generation
Flash Attention	Reduces memory traffic	Enables long contexts

And which cannot help:

Adding more GPU compute: Already underutilized
Faster GPU clock speeds: Memory-bound, not compute-bound
More CUDA cores: Same limitation

Flash Attention: A Deep Dive

The Attention Bottleneck

Standard attention computes: \[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V\]

For sequence length $n$ and head dimension $d$:

$QK^T$ produces an $n \times n$ matrix
This matrix must be stored for the backward pass (training) or for the softmax (inference)
Memory: $O(n^2)$ for the attention matrix

For $n = 128,000$ tokens: \[\text{Attention matrix size} = 128,000^2 \times 2 \text{ bytes} = 32 \text{ GB per layer per head}\]

This is clearly infeasible. Even for modest sequence lengths, the attention matrix becomes the dominant memory consumer.

The Flash Attention Insight

Tri Dao’s Flash Attention (2022) solves this with a key insight: we never need to materialize the full attention matrix. We can compute attention in tiles, maintaining only partial results in fast SRAM (shared memory), never writing the full $n \times n$ matrix to slow HBM.

The algorithm: 1. Split Q, K, V into blocks that fit in SRAM 2. For each block of Q, iterate over blocks of K, V 3. Compute block attention scores in SRAM 4. Use online softmax (computing softmax incrementally) to combine blocks 5. Write only the final output to HBM

The mathematics of online softmax:

For standard softmax: \[\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}\]

The problem: we need the full sum before computing any output.

Online softmax maintains running statistics: \[m_{\text{new}} = \max(m_{\text{old}}, m_{\text{block}})\] \[\ell_{\text{new}} = \ell_{\text{old}} \cdot e^{m_{\text{old}} - m_{\text{new}}} + \ell_{\text{block}} \cdot e^{m_{\text{block}} - m_{\text{new}}}\]

This allows computing exact softmax while processing data in blocks, never needing to store the full attention matrix.

Flash Attention Performance Analysis

Aspect	Standard Attention	Flash Attention
Memory	$O(n^2)$	$O(n)$
HBM reads	$O(n^2 d)$	$O(n^2 d^2 / M)$
HBM writes	$O(n^2 + nd)$	$O(nd)$
Kernel launches	3+ (matmul, softmax, matmul)	1 (fused)

Where $M$ is SRAM size. For typical values ($d = 128$, $M = 100$KB), Flash Attention reduces HBM traffic by 5-10x.

Flash Attention 2 and 3

Flash Attention 2 (2023) improved on the original with:

Better work partitioning across thread blocks
Reduced non-matmul FLOPs
~2x speedup over Flash Attention 1

Flash Attention 3 (2024) further optimized for Hopper architecture:

Exploits asynchronous operations
Uses FP8 tensor cores when available
Achieves 75%+ of theoretical maximum FLOPS

# Using Flash Attention is now trivial in modern frameworks
import torch.nn.functional as F

# PyTorch 2.0+ automatically uses Flash Attention when possible
output = F.scaled_dot_product_attention(query, key, value)

# HuggingFace models: specify implementation
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-4-Scout-17B",
    attn_implementation="flash_attention_2"
)

Full implementation: See reference/21_performance_engineering_code.md for benchmarking code comparing attention implementations.

Quantization: Theory and Practice

Why Quantization Works

Neural network weights are stored as floating-point numbers, typically FP32 (32 bits) or FP16 (16 bits). Quantization represents these weights with fewer bits—INT8 (8 bits) or INT4 (4 bits).

Why does this work? Neural networks are remarkably robust to noise. The weights exist in a continuous space where small perturbations don’t significantly change behavior. Quantization introduces small errors, but these errors are bounded and the network’s overall function is preserved.

Mathematically, for a weight matrix $W$ and quantized approximation $\hat{W}$: \[\hat{W} = s \cdot \text{round}\left(\frac{W}{s}\right)\]

Where $s$ is a scaling factor. The error $W - \hat{W}$ is bounded by $s/2$.

Quantization Schemes

Per-tensor quantization: One scale factor for the entire tensor. - Simplest implementation - May lose precision if weights have different magnitudes

Per-channel quantization: One scale factor per output channel. - Better accuracy than per-tensor - Standard for INT8 quantization

Per-group quantization: One scale factor per group of weights (e.g., 128 weights). - Used by GPTQ, AWQ for 4-bit quantization - Good accuracy-efficiency tradeoff

Mixed precision: Different layers quantized to different precisions. - Sensitive layers (embeddings, attention) kept at higher precision - Less sensitive layers (FFN) more aggressively quantized

GPTQ: Post-Training Quantization for LLMs

GPTQ (Frantar et al., 2022) introduced an efficient method for post-training quantization of LLMs:

Layer-wise quantization: Process one layer at a time
Hessian-based error compensation: Use second-order information to minimize quantization error
Lazy batch updates: Efficient implementation that scales to large models

The key insight: when quantizing weight $w_i$, we can adjust remaining weights $w_{i+1:}$ to compensate for the error introduced. GPTQ computes optimal adjustments using approximate Hessian information from calibration data.

# GPTQ quantization with auto-gptq library
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

quantize_config = BaseQuantizeConfig(
    bits=4,                 # 4-bit quantization
    group_size=128,         # Per-group scaling
    desc_act=False,         # Activation order
    damp_percent=0.1        # Dampening for numerical stability
)

model = AutoGPTQForCausalLM.from_pretrained(
    "meta-llama/Llama-4-Scout-17B",
    quantize_config=quantize_config
)
model.quantize(calibration_data)

AWQ: Activation-Aware Quantization

AWQ (Lin et al., 2023) improves on GPTQ by focusing on salient weights—weights that, when activated by typical inputs, have outsized impact on outputs.

Key insight: 1% of weights contribute to 50%+ of the output. These salient weights should be preserved more carefully.

AWQ: 1. Profiles activations on calibration data 2. Identifies salient weights based on activation magnitudes 3. Scales salient weights to protect them from quantization error 4. Applies standard quantization to the scaled weights

Results: AWQ achieves better accuracy than GPTQ at the same bit width, with similar inference speed.

Quantization Impact Analysis

Method	Bits	Memory Reduction	Speed Improvement	Quality Impact
FP16 baseline	16	1x	1x	Baseline
INT8 dynamic	8	2x	1.5-2x	<1% degradation
GPTQ 4-bit	4	4x	2-3x	1-2% degradation
AWQ 4-bit	4	4x	2-3x	<1% degradation
GGUF Q4_K_M	4	4x	2-3x	<1% degradation
2-bit	2	8x	3-4x	Significant degradation

The sweet spot for most applications is 4-bit quantization with per-group scaling (GPTQ or AWQ). This achieves 4x memory reduction with minimal quality loss.

Full implementation: See reference/21_performance_engineering_code.md for complete quantization code with benchmarking.

The KV Cache: Memory Management for Generation

Why KV Caching is Essential

Autoregressive generation produces tokens one at a time, each conditioned on all previous tokens. Without caching, generating the $n$-th token requires recomputing attention over all $n-1$ previous tokens—an $O(n^2)$ total cost for generating $n$ tokens.

KV caching stores the key and value projections from previous tokens:

First token: Compute K, V for position 0, store in cache
Second token: Compute K, V for position 1, append to cache; attend to positions 0-1
$n$-th token: Compute K, V for position $n-1$, append to cache; attend to positions 0 through $n-1$

This reduces generation from $O(n^2)$ to $O(n)$—essential for any practical system.

KV Cache Memory Requirements

For a model with:

$L$ layers
$h_{kv}$ key/value heads (equals query heads $h$ for MHA; smaller for GQA/MQA)
$d_h$ head dimension (typically $d_{model}/h$, where $h$ is the number of query heads)
$n$ sequence length
$b$ batch size

KV cache size: \[\text{KV cache} = 2 \cdot L \cdot h_{kv} \cdot d_h \cdot n \cdot b \cdot \text{bytes\_per\_value}\]

The leading factor of 2 accounts for storing both keys and values. The key correction for modern models is that we use $h_{kv}$ (KV heads), not $h$ (query heads)—a frequent source of overestimated KV cache budgets.

For Llama 2 70B ($L=80$, $h_{kv}=8$ GQA heads, $d_h=128$) at 8K context with FP16: \[\text{KV cache} = 2 \cdot 80 \cdot 8 \cdot 128 \cdot 8192 \cdot 2 = 2.7 \text{ GB per request}\]

Note: Llama 2 70B uses Grouped-Query Attention (GQA) with 8 KV heads (not all 64 query heads), which reduces KV cache 8x compared to standard MHA. Without GQA, this would be 21.5 GB per request—illustrating why GQA is essential for long-context serving.

This creates the memory wall for long contexts: even with Flash Attention enabling the computation, we may not have memory for the KV cache.

Common Mistake: Ignoring KV Cache in Capacity Planning

What people do: Calculate GPU memory requirements based only on model weights. “70B at FP16 is 140GB, so it fits on 2x H100 80GB with room to spare.”

Why it fails: KV cache grows with sequence length and concurrent requests. At 8K context, Llama 2 70B needs ~2.7GB of KV cache per request (with GQA). With 32 concurrent requests at 8K context, that’s ~86GB just for KV cache—comparable to the model weights. Your “fits with room to spare” calculation may now be OOM.

Fix: Always include KV cache in memory planning: total_memory = model_weights + (kv_cache_per_token × max_tokens × max_concurrent_requests). Use GQA models when possible (8x smaller KV cache). Implement paged attention to avoid over-allocation. Set conservative concurrent request limits and monitor memory utilization in production.

Paged Attention (vLLM)

Traditional KV cache pre-allocates a contiguous block of memory for the maximum sequence length. This is wasteful:

Most sequences are shorter than max length
Memory fragmentation prevents fitting more requests
Early requests hog memory even after completing

PagedAttention (Kwon et al., 2023, introduced in vLLM) applies virtual memory concepts:

Divide KV cache into fixed-size blocks (pages)
Allocate pages on demand as sequences grow
Track page mappings in a page table
When sequences complete, return pages to free pool

# Conceptual PagedAttention implementation
class PagedKVCache:
    def __init__(self, num_blocks, block_size, num_layers, num_heads, head_dim):
        # Pre-allocate block pool
        self.key_blocks = torch.zeros(num_blocks, num_layers, block_size, num_heads, head_dim)
        self.value_blocks = torch.zeros(num_blocks, num_layers, block_size, num_heads, head_dim)
        self.free_blocks = list(range(num_blocks))
        self.sequence_block_tables = {}  # seq_id -> list of block indices

    def allocate(self, seq_id):
        """Allocate a new block for a sequence."""
        if not self.free_blocks:
            raise RuntimeError("Out of KV cache memory")
        block_idx = self.free_blocks.pop()
        if seq_id not in self.sequence_block_tables:
            self.sequence_block_tables[seq_id] = []
        self.sequence_block_tables[seq_id].append(block_idx)
        return block_idx

    def free(self, seq_id):
        """Return all blocks for a sequence to the free pool."""
        blocks = self.sequence_block_tables.pop(seq_id, [])
        self.free_blocks.extend(blocks)

PagedAttention enables 2-4x more concurrent requests through efficient memory utilization.

Multi-Query and Grouped-Query Attention

Another approach: reduce the KV cache size by sharing key-value heads.

Multi-Query Attention (MQA): All query heads share one K and one V. \[\text{KV cache reduction: } h \times \text{ smaller}\]

Grouped-Query Attention (GQA): Groups of query heads share K, V. - Llama 2 70B uses 8 KV heads for 64 query heads (8x reduction) - Good tradeoff between quality and memory

Architecture	Query Heads	KV Heads	KV Cache Reduction
MHA (standard)	64	64	1x
GQA (Llama 2 70B)	64	8	8x
MQA	64	1	64x

Full implementation: See reference/21_performance_engineering_code.md for complete KV cache code with paged attention.

Speculative Decoding: Breaking the Sequential Barrier

The Autoregressive Bottleneck

Autoregressive generation is inherently sequential: each token depends on all previous tokens. With batch size 1, we can only generate one token at a time, leaving most GPU compute idle.

For a 70B model generating 100 tokens:

100 sequential forward passes
Each pass: ~50ms (memory-bound)
Total: ~5 seconds
GPU compute utilization: <1%

The Speculative Decoding Insight

Speculative decoding (Leviathan et al., 2022; Chen et al., 2023) introduces parallelism through speculation:

Use a small, fast draft model to generate $\gamma$ tokens quickly
Send all $\gamma$ tokens to the large target model in one batch
Target model verifies draft tokens in parallel
Accept verified tokens; resample if rejected

Key insight: verification is parallel, but generation is sequential. The target model can check all $\gamma$ draft tokens simultaneously—work that would otherwise require $\gamma$ sequential passes.

The Mathematics of Speculative Decoding

Let $p(x)$ be the target model distribution and $q(x)$ be the draft model distribution.

For each draft token $x_i$:

If $p(x_i) \geq q(x_i)$: always accept (target is at least as likely)
If $p(x_i) < q(x_i)$: accept with probability $p(x_i)/q(x_i)$

If rejected, sample from an adjusted distribution: \[p'(x) = \frac{\max(0, p(x) - q(x))}{\sum_x \max(0, p(x) - q(x))}\]

Theorem: This procedure produces samples from exactly the target distribution $p(x)$.

This is remarkable: we get speedup without any approximation to the output distribution.

Expected Speedup

The speedup depends on:

$\gamma$: number of speculative tokens
$\alpha$: acceptance rate (how well draft matches target)
$c$: relative cost of draft vs target inference

Expected tokens per target model call: \[E[\text{tokens}] = \frac{1 - \alpha^{\gamma+1}}{1 - \alpha}\]

For $\alpha = 0.8$ (80% acceptance) and $\gamma = 4$: \[E[\text{tokens}] = \frac{1 - 0.8^5}{1 - 0.8} = \frac{1 - 0.328}{0.2} = 3.36\]

If the draft model adds negligible overhead, this is a 3.36x speedup!

Implementation Considerations

Draft model selection:

Same architecture family (shares vocabulary, tokenizer)
10-20x smaller than target (e.g., 68M draft for 7B target)
Trained on similar data for high acceptance rate

Optimal speculation length:

Too few: Underutilizes parallelism
Too many: Wastes compute on likely-rejected tokens
Sweet spot: 4-8 tokens for most models

class SpeculativeDecoder:
    def __init__(self, target_model, draft_model, gamma=4):
        self.target = target_model
        self.draft = draft_model
        self.gamma = gamma

    def generate_step(self, input_ids):
        # 1. Draft model generates gamma tokens
        draft_tokens = []
        draft_probs = []
        context = input_ids
        for _ in range(self.gamma):
            logits = self.draft(context)
            probs = F.softmax(logits[:, -1], dim=-1)
            token = torch.multinomial(probs, 1)
            draft_tokens.append(token)
            draft_probs.append(probs[0, token].item())
            context = torch.cat([context, token], dim=1)

        # 2. Target verifies all at once
        all_draft = torch.stack(draft_tokens, dim=1)
        target_logits = self.target(torch.cat([input_ids, all_draft], dim=1))
        target_probs = F.softmax(target_logits[:, -self.gamma-1:-1], dim=-1)

        # 3. Accept/reject each token
        accepted = []
        for i, (token, q, p_dist) in enumerate(zip(draft_tokens, draft_probs, target_probs)):
            p = p_dist[0, token].item()
            if random.random() < min(1, p / q):
                accepted.append(token)
            else:
                # Sample from adjusted distribution
                adjusted = torch.clamp(p_dist - draft_probs_dist, min=0)
                adjusted = adjusted / adjusted.sum()
                accepted.append(torch.multinomial(adjusted, 1))
                break

        return torch.cat(accepted, dim=1)

Full implementation: See reference/21_performance_engineering_code.md for production-ready speculative decoding.

Profiling: Finding the Real Bottleneck

Staff Engineer Perspective

“On my second week as TL I was bored in a Grafana dashboard and noticed our embeddings service had a flat 90ms floor on every request, even tiny ones. Nobody had ever questioned it. I traced it to a synchronous DNS lookup we were doing per call because somebody had disabled the resolver cache in 2022 to debug a different issue and never re-enabled it. One line change. Service p50 dropped from 110ms to 18ms. Eight months of cumulative latency burned because the original ‘fix’ became invisible the moment the incident closed. Now I do a ‘weird flat line’ tour of dashboards once a quarter. Cheap wins live in things everyone has stopped looking at.”

— Performance Engineer at a real-time messaging company

The Performance Engineering Mindset

The first rule of optimization: measure before you optimize. Intuition about performance is often wrong. Engineers routinely spend weeks optimizing the wrong component because they didn’t profile first.

The Performance Engineering Loop:

1. DEFINE goals (what latency/throughput do we need?)
       ↓
2. MEASURE current performance (profile systematically)
       ↓
3. ANALYZE results (identify the bottleneck)
       ↓
4. HYPOTHESIZE (what optimization might help?)
       ↓
5. IMPLEMENT (make the change)
       ↓
6. VALIDATE (measure again, compare)
       ↓
   [Repeat until goals met]

Profiling Tools for AI Workloads

Different tools reveal different aspects of performance:

Tool	What It Measures	When to Use	Overhead
`torch.profiler`	PyTorch ops, GPU kernels	First-line profiling	Low
`py-spy`	Python call stacks	CPU bottlenecks	Very low
`nsys` (Nsight Systems)	GPU timeline, CPU-GPU sync	GPU pipeline analysis	Low
`ncu` (Nsight Compute)	Detailed kernel metrics	Deep kernel optimization	High
`nvidia-smi`	GPU utilization, memory	Quick health check	None

PyTorch Profiler Deep Dive

import torch
from torch.profiler import profile, record_function, ProfilerActivity

def profile_model(model, input_data, warmup=3, active=5):
    """Profile model inference with proper warmup."""

    # Warmup runs (crucial for accurate measurement)
    for _ in range(warmup):
        with torch.no_grad():
            _ = model(input_data)
    torch.cuda.synchronize()

    # Profile with GPU activities, memory, and stack traces
    with profile(
        activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
        record_shapes=True,
        profile_memory=True,
        with_stack=True
    ) as prof:
        for _ in range(active):
            with record_function("model_inference"):
                with torch.no_grad():
                    _ = model(input_data)

    # Analyze results
    print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=20))

    # Export for detailed visualization
    prof.export_chrome_trace("trace.json")

    return prof

Interpreting Profiler Output

A typical PyTorch profiler output:

---------------------------------  -------  -------  -------  -------
                             Name  CPU Time  CUDA Time  Calls  CPU Mem
---------------------------------  -------  -------  -------  -------
                     aten::linear   1.2ms    15.3ms      96     0 b
                     aten::matmul   0.8ms    12.1ms      48     0 b
               aten::_softmax      0.3ms     2.4ms      48     0 b
                 aten::layer_norm   0.4ms     1.8ms      48     0 b
                     aten::copy_   0.6ms     1.2ms     128     0 b

Key insights:

CUDA time >> CPU time: GPU-bound (good, GPU is doing work)
CPU time >> CUDA time: CPU-bound (investigate Python overhead)
Many calls with small time each: Kernel launch overhead
Large memory in specific ops: Memory bottleneck candidates

Common Bottleneck Patterns

Pattern 1: CPU-Bound Preprocessing

Symptom: Low GPU utilization, high CPU time
Profile: Data loading, tokenization dominate
Solution: Async data loading, batch preprocessing, compiled tokenizers

Pattern 2: Kernel Launch Overhead

Symptom: Many small CUDA calls, GPU gaps in timeline
Profile: Hundreds of ops with <0.1ms each
Solution: torch.compile, CUDA graphs, operator fusion

Pattern 3: Memory Bandwidth Bound

Symptom: High GPU utilization, but lower throughput than expected
Profile: Large tensors, many memory copies
Solution: Quantization, better data layout, reduce memory traffic

Pattern 4: CPU-GPU Synchronization

Symptom: GPU stalls, CPU waiting
Profile: Explicit syncs, CPU operations on GPU tensors
Solution: Async operations, keep computation on GPU

Full implementation: See reference/21_performance_engineering_code.md for complete profiling framework with analysis.

Common Mistake: Optimizing Without Profiling

What people do: Assume the model inference is slow and spend weeks implementing CUDA optimizations, only to discover that 60% of latency was actually in Python preprocessing or database queries.

Why it fails: Intuition about performance is notoriously unreliable. The bottleneck is rarely where you expect. Without measurement, you might 10x the speed of something that contributes 5% to total latency—a 0.5% overall improvement after weeks of work.

Fix: Always profile before optimizing. Use torch.profiler for GPU kernels, py-spy for Python overhead, and end-to-end tracing for system-level bottlenecks. Identify which component dominates latency (preprocessing? inference? postprocessing? network?). Optimize the actual bottleneck. Re-profile after each change to validate impact.

Production Optimization Techniques

Continuous Batching

Traditional batching waits for a batch to fill, processes all requests, then returns results. This is inefficient because:

Short requests wait for long requests to complete
New requests wait for the entire batch to finish
GPU may be idle while waiting for requests

Continuous batching (iteration-level batching) processes requests at the token level:

class ContinuousBatcher:
    def __init__(self, model, max_batch_size, max_waiting_time_ms=50):
        self.model = model
        self.max_batch_size = max_batch_size
        self.max_waiting_time = max_waiting_time_ms
        self.active_requests = []
        self.waiting_queue = []

    async def run_loop(self):
        """Main loop: process one token for all active requests."""
        while True:
            # Add waiting requests (up to capacity)
            self._fill_batch()

            if not self.active_requests:
                await asyncio.sleep(0.001)
                continue

            # Generate one token for all active requests
            await self._step()

            # Return completed requests
            self._retire_completed()

    async def _step(self):
        """Generate one token for all active requests in batch."""
        # Batch all active requests
        input_ids = self._prepare_batch()

        # Single forward pass for all requests
        logits = self.model(input_ids)

        # Sample next token for each request
        for i, request in enumerate(self.active_requests):
            next_token = sample(logits[i])
            request.append_token(next_token)

    def _retire_completed(self):
        """Return completed requests, freeing batch slots."""
        completed = [r for r in self.active_requests if r.is_done()]
        for request in completed:
            self.active_requests.remove(request)
            request.future.set_result(request.output)

Continuous batching enables:

Higher throughput: GPU always has work
Lower latency: Short requests don’t wait for long ones
Better utilization: Batch stays full as requests come and go

torch.compile and TorchInductor

PyTorch 2.0’s torch.compile applies automatic optimizations:

import torch

# Basic compilation
model = torch.compile(model)

# With optimization mode
model = torch.compile(model, mode="reduce-overhead")  # CUDA graphs
model = torch.compile(model, mode="max-autotune")     # Best perf, slow compile

# With dynamic shapes
model = torch.compile(model, dynamic=True)  # Handle varying input sizes

What torch.compile does: 1. Traces the model to capture computation graph 2. Fuses operations to reduce kernel launches 3. Generates optimized kernels via TorchInductor 4. Optionally uses CUDA graphs to eliminate launch overhead

Expected speedups:

Typical models: 1.3-2x
Memory-bound models: 1.1-1.5x (less kernel fusion benefit)
Compute-bound models: 1.5-3x (more fusion benefit)

CUDA Graphs

CUDA graphs capture a sequence of GPU operations and replay them without CPU involvement:

class CUDAGraphRunner:
    def __init__(self, model, sample_input):
        self.model = model

        # Static buffers for graph
        self.static_input = sample_input.clone()
        self.static_output = None

        # Warmup
        for _ in range(3):
            _ = model(self.static_input)
        torch.cuda.synchronize()

        # Capture graph
        self.graph = torch.cuda.CUDAGraph()
        with torch.cuda.graph(self.graph):
            self.static_output = model(self.static_input)

    def __call__(self, x):
        # Copy input to static buffer
        self.static_input.copy_(x)
        # Replay captured graph
        self.graph.replay()
        # Return output (already computed in static buffer)
        return self.static_output.clone()

Limitations:

Input shapes must be fixed
No dynamic control flow during graph replay
Memory layout must be consistent

Best for: Fixed-shape inference with repetitive workloads.

Tensor Parallelism

When models don’t fit on a single GPU, tensor parallelism splits weight matrices across GPUs:

Without Tensor Parallelism:
GPU 0: Full 70B model (doesn't fit!)

With 4-way Tensor Parallelism:
GPU 0: 1/4 of each weight matrix
GPU 1: 1/4 of each weight matrix
GPU 2: 1/4 of each weight matrix
GPU 3: 1/4 of each weight matrix

Each GPU computes partial results, then all-reduce to combine.

For a linear layer $Y = XW$: 1. Split $W$ into $W_1, W_2, ..., W_n$ (column-wise) 2. Each GPU computes $Y_i = XW_i$ 3. All-reduce to get $Y = [Y_1, Y_2, ..., Y_n]$

Communication overhead:

One all-reduce per linear layer: ~8 all-reduces per transformer block
For 80 layers: ~640 all-reduces per forward pass
With NVLink (900 GB/s): Acceptable overhead
With PCIe (64 GB/s): Often prohibitive

Full implementation: See reference/21_performance_engineering_code.md for tensor parallelism setup.

Decision Frameworks for Performance Optimization

When to Optimize

Not all optimization is worthwhile. Consider:

The 80/20 rule: 80% of time is spent in 20% of code. Profile to find that 20%.

Amdahl’s Law: If a component is 10% of total time, even infinite speedup yields only 11% overall improvement.

Development cost: A week of engineering time costs $5,000+. Will the optimization save that much in compute?

Optimization Priority Framework

Do First (High Impact, Low Effort):

Enable Flash Attention (one line of code)
torch.compile (one line of code)
Use production serving framework (vLLM, TGI)

Do Later (High Impact, High Effort):

Custom CUDA kernels
Model architecture changes
Custom quantization schemes

Skip (Low Impact, Low Effort):

Micro-optimizations in preprocessing
Minor memory layout changes

Do If Time (Low Impact, High Effort):

Exotic optimization techniques
Hardware-specific tuning

Latency vs. Throughput Decision Matrix

Scenario	Optimize For	Key Techniques
Interactive chat	Latency	Speculative decoding, tensor parallelism, small batch
Batch processing	Throughput	Large batches, continuous batching, quantization
Mixed workload	Balance	Priority queues, adaptive batching
Cost-constrained	Throughput	Aggressive quantization, spot instances
Real-time (<100ms)	Latency	CUDA graphs, distillation, edge deployment

Cost-Performance Tradeoff Analysis

def analyze_configuration(config):
    """Analyze cost-performance tradeoff for a configuration."""

    # Calculate throughput
    tokens_per_second = estimate_throughput(config)

    # Calculate cost
    gpu_cost_per_hour = GPU_COSTS[config.gpu_type]
    cost_per_million_tokens = (gpu_cost_per_hour / 3600) / tokens_per_second * 1_000_000

    # Calculate quality (if using quantization)
    quality_score = estimate_quality(config)

    return {
        'throughput': tokens_per_second,
        'cost_per_1M_tokens': cost_per_million_tokens,
        'quality': quality_score,
        'latency_p99': estimate_latency(config)
    }

# Example configurations
configs = [
    {'gpu': 'A10', 'model': '7B', 'quant': 'fp16', 'batch': 8},
    {'gpu': 'A10', 'model': '7B', 'quant': 'int8', 'batch': 16},
    {'gpu': 'A100', 'model': '70B', 'quant': 'int4', 'batch': 32},
]

# Find Pareto-optimal configurations
for config in configs:
    metrics = analyze_configuration(config)
    if meets_latency_slo(metrics) and meets_quality_threshold(metrics):
        print(f"Viable: {config} -> ${metrics['cost_per_1M_tokens']:.4f}/1M tokens")

Case Studies: Real-World Optimization Wins

Case Study 1: The Startup Cost Crisis

Situation: AI coding assistant, 7B parameter model, $0.375/user-hour serving cost.

Analysis:

Single user per GPU (batch size 1)
FP16 inference (no quantization)
Standard attention (no Flash Attention)
No caching of common prompts

Optimization Journey:

Stage	Change	Before	After	Improvement
Baseline	Measure	-	50ms/token, 8 users/GPU	-
1	Flash Attention	8 users	8 users	No change (not attention-bound)
2	INT8 quantization	50ms	35ms	1.4x latency
3	Continuous batching	8 users	32 users	4x throughput
4	KV cache optimization	32 users	48 users	1.5x throughput
5	Speculative decoding	35ms	20ms	1.75x latency
Final	Combined	8 users, $0.375/hr	64 users, $0.047/hr	8x cost reduction

Key Insight: The biggest win came from continuous batching, not the “obvious” optimization (Flash Attention). This is why profiling matters.

Case Study 2: Long-Context Document Analysis

Situation: Legal document analysis requiring 100K+ context windows.

Problem: Standard attention OOMs at 32K tokens.

Analysis:

32K tokens * 32K tokens * 2 bytes * 40 heads = 82GB per layer for attention
Not feasible on any single GPU

Solution:

Technique	Purpose	Impact
Flash Attention 2	Eliminate O(n^2) memory	Enable 100K context
Sliding window (partial)	Reduce effective sequence for most layers	2x memory reduction
Gradient checkpointing	Trade compute for memory	3x memory reduction for training
GQA (8 KV heads)	Reduce KV cache size	8x KV cache reduction

Result: 128K context windows on A100-80GB, enabling full document analysis.

Case Study 3: Real-Time Voice Assistant

Situation: Voice assistant requiring <200ms end-to-end latency.

Budget breakdown:

Audio processing: 20ms
Speech-to-text: 50ms
LLM inference: 100ms (target)
Text-to-speech: 30ms

Challenge: 100ms for LLM inference requires aggressive optimization.

Solution:

Technique	Latency Impact
3B distilled model (from 7B)	2x faster
INT8 quantization	1.5x faster
Speculative decoding	2x faster
CUDA graphs	1.2x faster
Tensor parallelism (2 GPUs)	1.8x faster

Result:

Time to first token: 25ms
Tokens per second: 80
End-to-end latency: 180ms (under budget)

Key Insight: Distillation provided the biggest win. A faster model beats a heavily optimized slow model.

Common Pitfalls and How to Avoid Them

Pitfall 1: Optimizing Without Profiling

The mistake: “Attention is slow, so I’ll implement custom attention kernels.”

Reality: Attention might be 10% of your runtime. Data loading might be 60%.

The fix: Always profile first. Optimize the actual bottleneck.

Pitfall 2: Forgetting Warmup

The mistake: First inference takes 2 seconds, subsequent ones take 50ms. Reporting 2s latency.

Reality: First call includes JIT compilation, CUDA context creation, memory allocation.

The fix: Always run warmup iterations before benchmarking.

# Wrong
start = time.time()
output = model(input)
print(f"Latency: {time.time() - start}")  # Includes warmup!

# Right
for _ in range(5):  # Warmup
    _ = model(input)
torch.cuda.synchronize()

times = []
for _ in range(100):  # Benchmark
    start = time.perf_counter()
    _ = model(input)
    torch.cuda.synchronize()
    times.append(time.perf_counter() - start)
print(f"Latency: {np.mean(times):.3f}s +/- {np.std(times):.3f}s")

Pitfall 3: Missing GPU Synchronization

The mistake: Timing shows 1ms, but actual latency is 50ms.

Reality: CUDA operations are asynchronous. The Python call returns immediately.

The fix: Always torch.cuda.synchronize() before timing endpoints.

Pitfall 4: Testing at Wrong Scale

The mistake: Optimizations work great on 100 tokens, fail at 10,000 tokens.

Reality: Algorithmic complexity differences only show at scale.

The fix: Test with production-representative input sizes.

Pitfall 5: Ignoring Memory Fragmentation

The mistake: OOM with 40GB free, 60GB reserved.

Reality: Memory fragmentation prevents allocation even with “free” memory.

The fix: Monitor fragmentation, use paged memory, periodic cleanup.

def check_memory():
    allocated = torch.cuda.memory_allocated() / 1e9
    reserved = torch.cuda.memory_reserved() / 1e9
    fragmentation = (reserved - allocated) / reserved if reserved > 0 else 0

    print(f"Allocated: {allocated:.1f}GB, Reserved: {reserved:.1f}GB")
    print(f"Fragmentation: {fragmentation:.1%}")

    if fragmentation > 0.2:
        print("Warning: High fragmentation. Consider torch.cuda.empty_cache()")

Pitfall 6: Premature Optimization

The mistake: Spending a week writing custom CUDA kernels before validating the product.

Reality: The product might pivot, making the optimization worthless.

The fix: Start with high-level optimizations (torch.compile, vLLM). Only write custom code when high-level tools aren’t enough.

Key Takeaways

Memory bandwidth is the primary bottleneck - Modern GPUs have abundant compute but limited memory bandwidth. For LLM inference at small batch sizes, you use less than 1% of available compute.
Use the roofline model to identify bottlenecks - Calculate arithmetic intensity (FLOPs per byte) to determine if you’re compute-bound or memory-bound. Then optimize the actual constraint, not the assumed one.
Flash Attention enables long contexts - By computing attention in tiles that fit in fast SRAM, Flash Attention achieves O(N) memory instead of O(N^2) without approximating the computation.
Quantization trades precision for throughput - 4-bit quantization achieves 4x memory reduction with minimal quality loss for most applications. Quantize to INT8 or INT4 for production inference.
Always profile before optimizing - Intuition about bottlenecks is often wrong. Use torch.profiler, nsys, and systematic measurement to find the real bottleneck before investing in optimization.

Summary

Performance engineering for AI systems requires understanding the fundamental differences between AI workloads and traditional software:

Memory bandwidth is the bottleneck. Modern GPUs have abundant compute but limited memory bandwidth. For LLM inference at small batch sizes, you’re using less than 1% of available compute—all time is spent loading weights from memory.

The roofline model provides a framework for reasoning about bottlenecks. Calculate arithmetic intensity (FLOPs per byte) to determine if you’re compute-bound or memory-bound, then optimize accordingly.

Flash Attention eliminates the quadratic memory bottleneck of attention, enabling long context windows by never materializing the full attention matrix.

Quantization reduces memory traffic by representing weights with fewer bits. 4-bit quantization achieves 4x memory reduction with minimal quality loss for most applications.

The KV cache is critical for efficient generation. Paged attention enables efficient memory management, supporting 2-4x more concurrent requests.

Speculative decoding breaks the autoregressive barrier by using a small draft model to generate candidates that a large model verifies in parallel.

Always profile first. Intuition about bottlenecks is often wrong. Use torch.profiler, nsys, and systematic measurement to find the real bottleneck before optimizing.

Understand the tradeoffs. Latency and throughput often conflict. Cost and quality often conflict. The right configuration depends on your specific requirements.

Practical Exercises

Exercise 1: Roofline Analysis

For a model you’re working with: 1. Calculate parameter count and FLOPs per forward pass 2. Calculate memory traffic (bytes loaded) 3. Determine arithmetic intensity 4. Compare to your GPU’s compute-to-bandwidth ratio 5. Are you compute-bound or memory-bound?

Exercise 2: Profiling Deep Dive

Profile a model inference with torch.profiler
Export to Chrome trace and visualize the timeline
Identify the top 5 operations by time
Classify each operation: compute-bound, memory-bound, or launch-bound
Propose optimizations for each bottleneck

Exercise 3: Quantization Impact

Measure baseline model performance (latency, throughput, quality)
Apply INT8 dynamic quantization and re-measure
Apply 4-bit GPTQ quantization and re-measure
Plot the Pareto frontier of quality vs. performance
Recommend the best configuration for interactive use

Exercise 4: Batching Optimization

Measure single-request latency and throughput
Implement static batching with sizes 1, 2, 4, 8, 16, 32
Plot latency and throughput vs. batch size
Identify the optimal batch size for your latency SLO
Implement continuous batching and compare

Exercise 5: End-to-End Optimization

Starting with a baseline model: 1. Profile to identify the primary bottleneck 2. Apply the single most impactful optimization 3. Re-profile and identify the new bottleneck 4. Repeat until you achieve 3x improvement or reach diminishing returns 5. Document the optimization journey with metrics

Self-Assessment Checkpoint

Conceptual Questions

Q1. [Senior] What does it mean for LLM inference to be “memory-bandwidth bound”? How does this differ from being “compute bound”?

Answer

Memory-bandwidth bound: The GPU spends more time waiting for data to load from memory than doing computation. During LLM decode, we load the entire model (14GB for 7B FP16) to do a matrix-vector multiply for one token—massive data movement, minimal computation. The arithmetic intensity (FLOPs per byte loaded) is ~1-2, while GPUs can do ~600 FLOPs per byte loaded. Compute bound: The GPU is fully utilized doing math, and loading more data wouldn’t help. This occurs during prefill (matrix-matrix multiply) where we process many tokens simultaneously, achieving high arithmetic intensity. Key implication: For decode phase, buying faster compute doesn’t help—need faster memory or smarter use of memory (batching, quantization, caching).

Q2. [Senior] Why does batching improve throughput for LLM inference? What limits how large the batch can be?

Answer

Batching amortizes weight loading: Instead of loading 14GB of weights to generate 1 token, load them once and generate B tokens for B batched requests. Throughput scales nearly linearly with batch size until limits hit. Limits: (1) Memory—KV cache grows with batch size × context length. Eventually GPU memory is exhausted. (2) Latency—larger batches mean each request waits longer (for its slot, for the batch to complete). SLA limits batch size. (3) Diminishing returns—at very large batches, compute becomes the bottleneck instead of memory bandwidth. (4) Request arrival—need enough concurrent requests to form batches. Trade-off: Higher batch = higher throughput but higher latency. Optimal batch size depends on latency SLO and memory available.

Q3. [Staff] Explain how FlashAttention achieves better performance without approximating attention. What’s the key insight?

Answer

FlashAttention is exact attention (same mathematical result) but with better memory access patterns. Key insight: Standard attention materializes the N×N attention matrix in GPU memory (for softmax), requiring O(N²) memory reads/writes. For long sequences, this is slow (memory-bound) and uses massive memory. FlashAttention uses tiling: computes attention in blocks that fit in fast SRAM (on-chip memory), never materializing the full N×N matrix in slow HBM. It’s IO-aware—designed around the memory hierarchy rather than just minimizing FLOPs. Result: Same computation, but ~2-4x faster and O(N) memory instead of O(N²). This is why FlashAttention enabled long contexts (100K+ tokens) that were previously impractical.

Q4. [Staff] What’s the roofline model and how do you use it to identify bottlenecks?

Answer

Roofline plots achievable performance (FLOPs/s) against arithmetic intensity (FLOPs/byte). The “roof” has two parts: (1) Sloped section (memory-bound)—performance limited by memory bandwidth × arithmetic intensity. (2) Flat section (compute-bound)—performance limited by peak compute. To use: Calculate your kernel’s arithmetic intensity. Plot it on the roofline. If you’re in the sloped region, you’re memory-bound—optimize memory access (batching, quantization, caching). If you’re in the flat region, you’re compute-bound—optimize computation (better algorithms, hardware utilization). For LLM inference: Decode is typically far in the memory-bound region. Prefill may be closer to the knee or compute-bound. This tells you which optimizations will help for each phase.

Q5. [Staff] How do you decide between INT8, INT4, and FP16 for model deployment? What are the trade-offs?

Answer

Trade-offs: FP16: Full precision, no quality loss, 2 bytes/parameter. Baseline. INT8: ~4x faster memory bandwidth, usually <1% quality loss for well-quantized models. Good default for production. INT4: ~8x faster memory bandwidth, 2-5% quality loss typical, can be more for reasoning-heavy tasks. Best for resource-constrained or cost-sensitive deployments. Decision factors: (1) Quality requirements—measure on YOUR task, not benchmarks. Some tasks degrade more than others. (2) Hardware—INT4 needs kernels that support it efficiently (modern GPUs). (3) Model—some models quantize better than others. Always test. (4) Latency vs. throughput—INT4 helps throughput more than single-request latency. Process: Start with INT8 (usually safe). If quality acceptable and need more efficiency, try INT4. Always validate on task-specific evaluation set before deploying.

Spot the Problem

Problem 1. [Senior] Optimization approach:

"Our model inference is slow. I'll rewrite the attention mechanism
in custom CUDA to make it faster."

Answer

Problems: (1) Premature optimization—haven’t profiled to confirm attention is the bottleneck. Might be optimizing the wrong thing. (2) Reinventing the wheel—FlashAttention already exists and is highly optimized by experts. Unlikely to beat it. (3) Maintenance burden—custom CUDA is hard to maintain, debug, and port to new hardware. (4) May break other optimizations—custom kernels might not integrate with torch.compile, quantization, etc. Correct approach: Profile first (torch.profiler, nsys). Use established optimizations (FlashAttention, torch.compile). Only write custom CUDA if profiling shows a specific bottleneck that existing tools don’t address—and even then, consider contributing to existing libraries.

Problem 2. [Staff] Performance report:

"After applying all optimizations (FlashAttention, INT4 quantization,
continuous batching), we achieved 100 tokens/second throughput."

What’s missing?

Answer

Missing context: (1) What model?—100 tok/s means different things for 7B vs. 70B. (2) What hardware?—100 tok/s on H100 might be poor; on consumer GPU might be excellent. (3) What batch size?—100 tok/s at batch=1 vs. batch=32 are very different. (4) What sequence lengths?—Long contexts are harder. (5) Latency metrics?—What’s P50/P99 per-request latency? Throughput without latency is incomplete. (6) Quality impact?—Did INT4 quantization hurt accuracy? (7) Baseline comparison?—100 tok/s improvement from what? A complete performance report includes: model, hardware, batch size, sequence lengths, latency percentiles, throughput, quality metrics, and comparison to baseline.

Problem 3. [Staff] Deployment configuration:

# Using maximum batch size for best throughput
config = {
    "max_batch_size": 64,
    "max_sequence_length": 32768,
    "dtype": "int4"
}

Answer

Likely memory issues: Batch size 64 × 32K context × KV cache per token = enormous memory. Even INT4 weights won’t save you if KV cache (typically FP16) explodes. Problems: (1) May OOM—calculate actual memory needed before setting config. (2) Latency impact—batch 64 means requests wait for huge batches. (3) 32K context for all requests?—Most requests probably don’t need it. Wasting memory reservation. (4) INT4 doesn’t shrink KV cache—just weights. Long context is still memory-hungry. Better: Profile actual request distribution (context lengths, arrival rate). Set max_sequence_length to actual needs. Use dynamic batching that adapts to load. Calculate memory budget: model weights + max_batch_size × max_context × KV_cache_per_token < GPU_memory.

Design Exercises

Exercise 1. [Staff] You have an A100 80GB GPU and need to serve a 70B parameter model with sub-100ms time to first token for prompts up to 4K tokens. Design your deployment strategy, including quantization, parallelism, and batching decisions.

Guidance

Constraints: (1) 70B model in FP16 = 140GB—doesn’t fit in 80GB. Need quantization or tensor parallelism. (2) TTFT < 100ms for 4K tokens is aggressive—prefill of 4K tokens takes significant time. Analysis: (1) INT4 quantization: 70B × 0.5 bytes = 35GB weights. Fits with room for KV cache. (2) Single A100 with INT4 might achieve ~100ms TTFT for shorter prompts. 4K is pushing it. (3) Alternative: 2x A100 with tensor parallelism, FP16. Better quality, meets latency. Strategy options: (1) Single A100 + INT4 + aggressive optimization—risky for 4K, may work for shorter. (2) 2x A100 + tensor parallelism + INT8—more headroom, meets quality and latency. (3) Consider if 70B is necessary—would 13B meet quality needs at much lower cost? Batching: Limited—sub-100ms TTFT limits batch wait time. Likely batch=1-4 for real-time path.

Exercise 2. [Staff] Your LLM inference service is costing $500K/month. Leadership wants 30% cost reduction without impacting user experience. Design an optimization roadmap with prioritized initiatives and expected savings.

Guidance

First, understand cost breakdown: (1) GPU hours by instance type. (2) Utilization patterns (peak vs. off-peak). (3) Request distribution (simple vs. complex). Optimization roadmap: (1) Caching—identify repeat/similar queries. Even 20% cache hit rate at near-zero marginal cost is significant. (2) Model routing—route simple queries to smaller/cheaper models. Requires quality validation but can save 40%+ on routed queries. (3) Quantization—if not already INT8, significant savings. Validate quality first. (4) Batching optimization—improve utilization through better batching strategies. (5) Reserved capacity—convert on-demand to reserved/spot for predictable baseline load. 30-50% savings. (6) Off-peak processing—batch non-urgent requests to off-peak hours with cheaper capacity. Prioritize by: Effort vs. impact. Caching and routing often highest ROI. Always validate quality metrics before and after.

Connections to Other Chapters

Performance engineering intersects with deployment, cost, and system design. Here’s how this chapter connects:

Chapter 9 (LLM Deployment & Infrastructure): Core deployment architecture, batching strategies, and KV cache optimization. This chapter goes deeper into the performance theory behind those patterns.
Chapter 25 (System Design at Scale): Distributed systems patterns and architecture for high-throughput AI systems. Performance engineering at the component level enables system-level scalability.
Chapter 32 (Cost Engineering): How performance optimizations translate to cost savings and ROI analysis. Faster inference means lower GPU costs.
Tool & Framework Reference (Appendix B): Inference engines like vLLM, TGI, and TensorRT-LLM covered in this chapter.

Dao et al. (2022), “FlashAttention” - The paper that made long contexts practical. Critical for understanding modern attention.
Kwon et al. (2023), “vLLM/PagedAttention” - Efficient memory management for LLM serving.
Horace He, “Making Deep Learning Go Brrrr From First Principles” - Excellent practical GPU performance introduction.

Deep Dives

Frantar et al. (2022), “GPTQ” - Post-training quantization for transformers. Key for model compression.
Leviathan et al. (2022), “Speculative Decoding” - Using small models to accelerate large model inference.
Williams et al. (2009), “Roofline Model” - Visual performance model for understanding compute bounds.

--- title: "Chapter 27: Performance Engineering" keywords: [performance, profiling, optimization, GPU utilization, latency, throughput, bottleneck analysis, CUDA] difficulty: advanced prerequisites: [ch05, ch09] estimated_time: "4-5 hours" --- ## Introduction In early 2023, a startup building an AI coding assistant faced a crisis. Their prototype worked beautifully in demos—the model understood context, generated accurate code, and responded naturally to follow-up questions. But when they tried to scale beyond a hundred users, the economics collapsed. Each GPU could serve only 8 concurrent users at acceptable latency. At $3 per GPU-hour, that translated to $0.375 per user-hour—roughly $270 per user per month just in inference costs. For a product targeting $20/month subscriptions, they were losing money on every customer. The team spent three months on performance engineering. They implemented continuous batching, added speculative decoding with a small draft model, optimized their KV cache with paged attention, and carefully profiled every component. By the end, the same GPU served 64 concurrent users—an 8x improvement. Their unit economics went from catastrophic to viable. This story illustrates a fundamental truth about AI systems: **performance isn't just about speed—it's about feasibility**. A model that costs $1 per query can't power a consumer product. An inference pipeline with 5-second latency can't enable real-time applications. Understanding performance engineering transforms "this would be nice" into "this is possible." ### Why AI Performance is Different Traditional software performance optimization focuses on algorithmic complexity, database queries, and network latency. AI workloads introduce fundamentally different challenges: **Memory bandwidth, not compute, is often the bottleneck.** Modern GPUs have enormous compute capacity—the H100 can perform nearly 2,000 trillion floating-point operations per second. But loading the 140GB of weights for a 70B parameter model at FP16 precision takes 47 milliseconds at the H100's peak 3TB/s memory bandwidth. For autoregressive generation, where we must load the entire model for each token, this creates a hard floor on latency that adding more compute cannot break. **The quadratic attention bottleneck.** Self-attention—the mechanism that gives transformers their power—requires computing interactions between all pairs of tokens. For a 128K context window, that's 16 billion attention computations per layer. Naive implementations run out of memory long before running out of time. **Batching dynamics are complex.** Unlike web servers where requests are largely independent, AI inference can batch requests to amortize fixed costs—but different requests may need different numbers of output tokens. A batch is limited by its slowest request, creating intricate scheduling challenges. **The latency-throughput tradeoff is acute.** Users expect sub-second responses. Providers want to maximize GPU utilization. These goals conflict: high utilization requires large batches, but large batches increase latency for individual requests. ### The Core Insight: Roofline Thinking To reason about AI performance, you need a mental model for what limits computation. The **roofline model**, introduced by Williams et al. (2009), provides exactly this framework. Every computation has two potential bottlenecks: - **Compute bound**: Limited by how fast we can perform arithmetic operations - **Memory bound**: Limited by how fast we can load data from memory The ratio between these determines which bottleneck dominates: $$\text{Arithmetic Intensity} = \frac{\text{FLOPs}}{\text{Bytes Loaded}}$$ If your arithmetic intensity is high (many operations per byte loaded), you're compute-bound. If it's low (few operations per byte loaded), you're memory-bound. **Why LLM inference is memory-bound**: For a single token prediction with a 7B parameter model: - FLOPs: ~14 billion (roughly 2 * parameter count for forward pass) - Bytes loaded: ~14 billion (loading all parameters once at FP16) - Arithmetic intensity: ~1 FLOP/byte Modern GPUs achieve 100+ FLOPs per byte of memory bandwidth. At 1 FLOP/byte arithmetic intensity, we're using less than 1% of the GPU's compute capability. We're completely memory-bound. This insight transforms how we think about optimization. **Adding more compute power won't help a memory-bound workload.** Instead, we must either: 1. Reduce memory traffic (quantization, caching) 2. Increase arithmetic intensity (batching, operator fusion) 3. Use memory more efficiently (Flash Attention, paged KV cache) ::: {.callout-warning} ## Common Mistake: Throwing More GPUs at Memory-Bound Workloads **What people do:** Inference is slow, so upgrade from A100 to H100 expecting a proportional speedup. Or add more GPUs expecting near-linear scaling. **Why it fails:** LLM inference at small batch sizes is memory-bandwidth bound, not compute-bound. The H100 has 3x the compute of A100 but only 2x the memory bandwidth. For batch-size-1 inference, you get ~2x speedup, not 3x. Adding more GPUs often makes things worse due to communication overhead—you're splitting a memory-bound workload across more devices that now must synchronize. **Fix:** First identify your bottleneck using profiling. For memory-bound workloads: quantize to reduce bytes loaded, increase batch size to amortize weight loading, or use speculative decoding. Only upgrade hardware after you've exhausted algorithmic optimizations—often you can get 4-8x improvements without buying anything. ::: ### What You'll Learn - How GPU architecture determines what optimizations are possible - Why LLMs are memory-bound and how to reason about memory hierarchy - Flash Attention: the mathematics and why it works - Quantization theory: trading precision for performance - Profiling methodology: finding the real bottleneck - Speculative decoding: breaking the autoregressive barrier - Production optimization: batching, caching, and cost engineering ### Prerequisites - Understanding of transformer architecture (Chapter 5) - Familiarity with PyTorch or similar frameworks - Basic linear algebra (matrix multiplication, attention computation) - Helpful but not required: computer architecture concepts --- ## GPU Architecture for AI Engineers ### The Parallel Processing Paradigm To optimize for GPUs, you must understand how they differ from CPUs. A CPU is designed for **latency**: execute a single thread of instructions as fast as possible, with sophisticated branch prediction, out-of-order execution, and large caches to hide memory latency. A high-end CPU might have 64 cores, each executing 1-2 threads. A GPU is designed for **throughput**: execute thousands of threads simultaneously, hiding memory latency through massive parallelism rather than caching. An H100 has 132 Streaming Multiprocessors (SMs), each capable of executing thousands of threads concurrently. ``` CPU Execution Model: Thread 1: ████████████████████████████████████████████████████████████ Thread 2: ████████████████████████████████████████████████████████████ [Few threads, each runs fast with sophisticated optimization] GPU Execution Model: Thread 1: ██░░░░██░░░░██░░░░██░░░░██░░░░ (waiting for memory) Thread 2: ░░██░░░░██░░░░██░░░░██░░░░██░░ Thread 3: ░░░░██░░░░██░░░░██░░░░██░░░░██ Thread 1024: ██░░░░██░░░░██░░░░██░░░░██░░░░ [Many threads, each slow, but overlapped to hide latency] ``` The key insight: **GPUs hide memory latency through parallelism, not caching**. While one group of threads waits for memory, others execute. This only works if you have enough threads to keep the execution units busy—a concept called **occupancy**. ### The Memory Hierarchy Understanding the GPU memory hierarchy is essential for optimization: | Memory Level | Size (H100) | Bandwidth | Latency | Scope | |-------------|-------------|-----------|---------|-------| | Registers | 256KB per SM | ~10 TB/s | 0 cycles | Per thread | | Shared Memory / L1 | 228KB per SM | ~19 TB/s | ~20 cycles | Per thread block | | L2 Cache | 50MB | ~6 TB/s | ~200 cycles | Global | | HBM3 (VRAM) | 80GB | 3.35 TB/s | ~400 cycles | Global | | CPU RAM (via PCIe) | System | ~64 GB/s | ~10,000 cycles | System | The bandwidth gap between levels is staggering. Moving from HBM to shared memory is a 6x bandwidth improvement. Moving from HBM to registers is 3x more. **The single most important optimization principle is keeping data close to computation**. ### SIMT Execution and Warps GPUs execute instructions in lockstep groups of 32 threads called **warps** (NVIDIA terminology) or wavefronts (AMD). All 32 threads in a warp execute the same instruction simultaneously—a model called SIMT (Single Instruction, Multiple Thread). This has profound implications: **Warp divergence**: If threads in a warp take different branches, both branches must execute serially: ```python # This code causes warp divergence if thread_id % 2 == 0: do_even_work() # Half the warp executes else: do_odd_work() # Other half executes (serialized!) ``` **Memory coalescing**: Adjacent threads should access adjacent memory addresses. The memory system fetches data in 128-byte chunks. If 32 threads each request one float (4 bytes) from adjacent addresses, one fetch serves all of them. If they request scattered addresses, you might need 32 separate fetches—a 32x slowdown. ```python # Coalesced access (fast) data[thread_id] # Threads 0-31 access addresses 0-31 # Strided access (slow) data[thread_id * stride] # Threads access scattered addresses # Random access (very slow) data[random_indices[thread_id]] # Even worse ``` ### Kernel Launch Overhead Each GPU operation (kernel) has launch overhead: 5-10 microseconds to dispatch work from the CPU to the GPU. This seems tiny, but for a model with hundreds of small operations, it adds up. A forward pass through a transformer layer might involve: - 4 linear projections (Q, K, V, O) - Attention computation - Softmax - 2 FFN linear layers - 2 layer normalizations - Various reshapes and transposes That's easily 15+ kernel launches per layer. For a 70-layer model, that's 1,000+ kernels, potentially 5-10ms of pure launch overhead. **Kernel fusion** combines multiple operations into a single kernel, eliminating these round-trips. This is why Flash Attention, which fuses the entire attention computation into one kernel, provides such dramatic speedups. --- ## Memory-Bound Inference: The Fundamental Challenge ### Why LLM Inference is Memory-Bound Let's derive why large language model inference is memory-bound from first principles. For a transformer with: - $P$ = number of parameters - $d$ = hidden dimension - $n$ = sequence length - $b$ = batch size **Forward pass computation** (simplified, for attention + FFN): $$\text{FLOPs} \approx 2 \cdot P \cdot b$$ **Memory traffic** (loading all parameters): $$\text{Bytes} = P \cdot \text{bytes\_per\_param}$$ **Arithmetic intensity**: $$\text{AI} = \frac{2 \cdot P \cdot b}{P \cdot \text{bytes\_per\_param}} = \frac{2 \cdot b}{\text{bytes\_per\_param}}$$ For FP16 (2 bytes per parameter): $$\text{AI} = b \text{ FLOPs/byte}$$ The H100's compute-to-bandwidth ratio is approximately: $$\frac{990 \text{ TFLOPS}}{3.35 \text{ TB/s}} \approx 295 \text{ FLOPs/byte}$$ To be compute-bound, we need arithmetic intensity of at least 295. That means **batch size must be at least 295** to fully utilize the H100's compute capability! For interactive inference (batch size 1), we're at 1/295 = 0.3% compute utilization. Almost all time is spent waiting for memory. ### The Memory Bandwidth Calculation This analysis gives us a powerful tool for performance prediction. For a 70B parameter model at FP16: $$\text{Minimum time per token} = \frac{70\text{B params} \times 2 \text{ bytes}}{3.35 \text{ TB/s}} = 42\text{ms}$$ That's the theoretical minimum—loading every parameter once at peak bandwidth. In practice, we also have: - KV cache reads and writes - Attention computation - Activation transfers - Memory access patterns that don't achieve peak bandwidth Real-world single-request latency is typically 2-3x this theoretical minimum. ### Implications for Optimization This analysis reveals which optimizations can help: | Optimization | How It Helps | Expected Impact | |-------------|--------------|-----------------| | Quantization (FP16 → INT8) | 2x fewer bytes to load | ~2x faster | | Quantization (FP16 → INT4) | 4x fewer bytes to load | ~4x faster (with quality tradeoff) | | Batching | Amortizes weight loading | Linear throughput improvement | | KV caching | Avoids recomputing past | Essential for practical generation | | Flash Attention | Reduces memory traffic | Enables long contexts | And which cannot help: - **Adding more GPU compute**: Already underutilized - **Faster GPU clock speeds**: Memory-bound, not compute-bound - **More CUDA cores**: Same limitation --- ## Flash Attention: A Deep Dive ### The Attention Bottleneck Standard attention computes: $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$ For sequence length $n$ and head dimension $d$: - $QK^T$ produces an $n \times n$ matrix - This matrix must be stored for the backward pass (training) or for the softmax (inference) - Memory: $O(n^2)$ for the attention matrix For $n = 128,000$ tokens: $$\text{Attention matrix size} = 128,000^2 \times 2 \text{ bytes} = 32 \text{ GB per layer per head}$$ This is clearly infeasible. Even for modest sequence lengths, the attention matrix becomes the dominant memory consumer. ### The Flash Attention Insight Tri Dao's Flash Attention (2022) solves this with a key insight: **we never need to materialize the full attention matrix**. We can compute attention in tiles, maintaining only partial results in fast SRAM (shared memory), never writing the full $n \times n$ matrix to slow HBM. The algorithm: 1. Split Q, K, V into blocks that fit in SRAM 2. For each block of Q, iterate over blocks of K, V 3. Compute block attention scores in SRAM 4. Use online softmax (computing softmax incrementally) to combine blocks 5. Write only the final output to HBM The mathematics of online softmax: For standard softmax: $$\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}$$ The problem: we need the full sum before computing any output. Online softmax maintains running statistics: $$m_{\text{new}} = \max(m_{\text{old}}, m_{\text{block}})$$ $$\ell_{\text{new}} = \ell_{\text{old}} \cdot e^{m_{\text{old}} - m_{\text{new}}} + \ell_{\text{block}} \cdot e^{m_{\text{block}} - m_{\text{new}}}$$ This allows computing exact softmax while processing data in blocks, never needing to store the full attention matrix. ### Flash Attention Performance Analysis | Aspect | Standard Attention | Flash Attention | |--------|-------------------|-----------------| | Memory | $O(n^2)$ | $O(n)$ | | HBM reads | $O(n^2 d)$ | $O(n^2 d^2 / M)$ | | HBM writes | $O(n^2 + nd)$ | $O(nd)$ | | Kernel launches | 3+ (matmul, softmax, matmul) | 1 (fused) | Where $M$ is SRAM size. For typical values ($d = 128$, $M = 100$KB), Flash Attention reduces HBM traffic by 5-10x. ### Flash Attention 2 and 3 Flash Attention 2 (2023) improved on the original with: - Better work partitioning across thread blocks - Reduced non-matmul FLOPs - ~2x speedup over Flash Attention 1 Flash Attention 3 (2024) further optimized for Hopper architecture: - Exploits asynchronous operations - Uses FP8 tensor cores when available - Achieves 75%+ of theoretical maximum FLOPS ```python # Using Flash Attention is now trivial in modern frameworks import torch.nn.functional as F # PyTorch 2.0+ automatically uses Flash Attention when possible output = F.scaled_dot_product_attention(query, key, value) # HuggingFace models: specify implementation model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-4-Scout-17B", attn_implementation="flash_attention_2" ) ``` > **Full implementation**: See [reference/21_performance_engineering_code.md](../reference/21_performance_engineering_code.html#flash-attention-comparison) for benchmarking code comparing attention implementations. --- ## Quantization: Theory and Practice ### Why Quantization Works Neural network weights are stored as floating-point numbers, typically FP32 (32 bits) or FP16 (16 bits). Quantization represents these weights with fewer bits—INT8 (8 bits) or INT4 (4 bits). Why does this work? Neural networks are remarkably robust to noise. The weights exist in a continuous space where small perturbations don't significantly change behavior. Quantization introduces small errors, but these errors are bounded and the network's overall function is preserved. Mathematically, for a weight matrix $W$ and quantized approximation $\hat{W}$: $$\hat{W} = s \cdot \text{round}\left(\frac{W}{s}\right)$$ Where $s$ is a scaling factor. The error $W - \hat{W}$ is bounded by $s/2$. ### Quantization Schemes **Per-tensor quantization**: One scale factor for the entire tensor. - Simplest implementation - May lose precision if weights have different magnitudes **Per-channel quantization**: One scale factor per output channel. - Better accuracy than per-tensor - Standard for INT8 quantization **Per-group quantization**: One scale factor per group of weights (e.g., 128 weights). - Used by GPTQ, AWQ for 4-bit quantization - Good accuracy-efficiency tradeoff **Mixed precision**: Different layers quantized to different precisions. - Sensitive layers (embeddings, attention) kept at higher precision - Less sensitive layers (FFN) more aggressively quantized ### GPTQ: Post-Training Quantization for LLMs GPTQ (Frantar et al., 2022) introduced an efficient method for post-training quantization of LLMs: 1. **Layer-wise quantization**: Process one layer at a time 2. **Hessian-based error compensation**: Use second-order information to minimize quantization error 3. **Lazy batch updates**: Efficient implementation that scales to large models The key insight: when quantizing weight $w_i$, we can adjust remaining weights $w_{i+1:}$ to compensate for the error introduced. GPTQ computes optimal adjustments using approximate Hessian information from calibration data. ```python # GPTQ quantization with auto-gptq library from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig quantize_config = BaseQuantizeConfig( bits=4, # 4-bit quantization group_size=128, # Per-group scaling desc_act=False, # Activation order damp_percent=0.1 # Dampening for numerical stability ) model = AutoGPTQForCausalLM.from_pretrained( "meta-llama/Llama-4-Scout-17B", quantize_config=quantize_config ) model.quantize(calibration_data) ``` ### AWQ: Activation-Aware Quantization AWQ (Lin et al., 2023) improves on GPTQ by focusing on **salient weights**—weights that, when activated by typical inputs, have outsized impact on outputs. Key insight: 1% of weights contribute to 50%+ of the output. These salient weights should be preserved more carefully. AWQ: 1. Profiles activations on calibration data 2. Identifies salient weights based on activation magnitudes 3. Scales salient weights to protect them from quantization error 4. Applies standard quantization to the scaled weights Results: AWQ achieves better accuracy than GPTQ at the same bit width, with similar inference speed. ### Quantization Impact Analysis | Method | Bits | Memory Reduction | Speed Improvement | Quality Impact | |--------|------|------------------|-------------------|----------------| | FP16 baseline | 16 | 1x | 1x | Baseline | | INT8 dynamic | 8 | 2x | 1.5-2x | <1% degradation | | GPTQ 4-bit | 4 | 4x | 2-3x | 1-2% degradation | | AWQ 4-bit | 4 | 4x | 2-3x | <1% degradation | | GGUF Q4_K_M | 4 | 4x | 2-3x | <1% degradation | | 2-bit | 2 | 8x | 3-4x | Significant degradation | The sweet spot for most applications is **4-bit quantization with per-group scaling** (GPTQ or AWQ). This achieves 4x memory reduction with minimal quality loss. > **Full implementation**: See [reference/21_performance_engineering_code.md](../reference/21_performance_engineering_code.html#quantization-implementation) for complete quantization code with benchmarking. --- ## The KV Cache: Memory Management for Generation ### Why KV Caching is Essential Autoregressive generation produces tokens one at a time, each conditioned on all previous tokens. Without caching, generating the $n$-th token requires recomputing attention over all $n-1$ previous tokens—an $O(n^2)$ total cost for generating $n$ tokens. KV caching stores the key and value projections from previous tokens: - First token: Compute K, V for position 0, store in cache - Second token: Compute K, V for position 1, append to cache; attend to positions 0-1 - $n$-th token: Compute K, V for position $n-1$, append to cache; attend to positions 0 through $n-1$ This reduces generation from $O(n^2)$ to $O(n)$—essential for any practical system. ### KV Cache Memory Requirements For a model with: - $L$ layers - $h_{kv}$ key/value heads (equals query heads $h$ for MHA; smaller for GQA/MQA) - $d_h$ head dimension (typically $d_{model}/h$, where $h$ is the number of query heads) - $n$ sequence length - $b$ batch size KV cache size: $$\text{KV cache} = 2 \cdot L \cdot h_{kv} \cdot d_h \cdot n \cdot b \cdot \text{bytes\_per\_value}$$ The leading factor of 2 accounts for storing both keys and values. The key correction for modern models is that we use $h_{kv}$ (KV heads), not $h$ (query heads)—a frequent source of overestimated KV cache budgets. For Llama 2 70B ($L=80$, $h_{kv}=8$ GQA heads, $d_h=128$) at 8K context with FP16: $$\text{KV cache} = 2 \cdot 80 \cdot 8 \cdot 128 \cdot 8192 \cdot 2 = 2.7 \text{ GB per request}$$ Note: Llama 2 70B uses Grouped-Query Attention (GQA) with 8 KV heads (not all 64 query heads), which reduces KV cache 8x compared to standard MHA. Without GQA, this would be 21.5 GB per request—illustrating why GQA is essential for long-context serving. This creates the **memory wall for long contexts**: even with Flash Attention enabling the computation, we may not have memory for the KV cache. ::: {.callout-warning} ## Common Mistake: Ignoring KV Cache in Capacity Planning **What people do:** Calculate GPU memory requirements based only on model weights. "70B at FP16 is 140GB, so it fits on 2x H100 80GB with room to spare." **Why it fails:** KV cache grows with sequence length and concurrent requests. At 8K context, Llama 2 70B needs ~2.7GB of KV cache *per request* (with GQA). With 32 concurrent requests at 8K context, that's ~86GB just for KV cache—comparable to the model weights. Your "fits with room to spare" calculation may now be OOM. **Fix:** Always include KV cache in memory planning: `total_memory = model_weights + (kv_cache_per_token × max_tokens × max_concurrent_requests)`. Use GQA models when possible (8x smaller KV cache). Implement paged attention to avoid over-allocation. Set conservative concurrent request limits and monitor memory utilization in production. ::: ### Paged Attention (vLLM) Traditional KV cache pre-allocates a contiguous block of memory for the maximum sequence length. This is wasteful: - Most sequences are shorter than max length - Memory fragmentation prevents fitting more requests - Early requests hog memory even after completing **PagedAttention** (Kwon et al., 2023, introduced in vLLM) applies virtual memory concepts: 1. Divide KV cache into fixed-size blocks (pages) 2. Allocate pages on demand as sequences grow 3. Track page mappings in a page table 4. When sequences complete, return pages to free pool ```python # Conceptual PagedAttention implementation class PagedKVCache: def __init__(self, num_blocks, block_size, num_layers, num_heads, head_dim): # Pre-allocate block pool self.key_blocks = torch.zeros(num_blocks, num_layers, block_size, num_heads, head_dim) self.value_blocks = torch.zeros(num_blocks, num_layers, block_size, num_heads, head_dim) self.free_blocks = list(range(num_blocks)) self.sequence_block_tables = {} # seq_id -> list of block indices def allocate(self, seq_id): """Allocate a new block for a sequence.""" if not self.free_blocks: raise RuntimeError("Out of KV cache memory") block_idx = self.free_blocks.pop() if seq_id not in self.sequence_block_tables: self.sequence_block_tables[seq_id] = [] self.sequence_block_tables[seq_id].append(block_idx) return block_idx def free(self, seq_id): """Return all blocks for a sequence to the free pool.""" blocks = self.sequence_block_tables.pop(seq_id, []) self.free_blocks.extend(blocks) ``` PagedAttention enables **2-4x more concurrent requests** through efficient memory utilization. ### Multi-Query and Grouped-Query Attention Another approach: reduce the KV cache size by sharing key-value heads. **Multi-Query Attention (MQA)**: All query heads share one K and one V. $$\text{KV cache reduction: } h \times \text{ smaller}$$ **Grouped-Query Attention (GQA)**: Groups of query heads share K, V. - Llama 2 70B uses 8 KV heads for 64 query heads (8x reduction) - Good tradeoff between quality and memory | Architecture | Query Heads | KV Heads | KV Cache Reduction | |-------------|-------------|----------|-------------------| | MHA (standard) | 64 | 64 | 1x | | GQA (Llama 2 70B) | 64 | 8 | 8x | | MQA | 64 | 1 | 64x | > **Full implementation**: See [reference/21_performance_engineering_code.md](../reference/21_performance_engineering_code.html#kv-cache-implementation) for complete KV cache code with paged attention. --- ## Speculative Decoding: Breaking the Sequential Barrier ### The Autoregressive Bottleneck Autoregressive generation is inherently sequential: each token depends on all previous tokens. With batch size 1, we can only generate one token at a time, leaving most GPU compute idle. For a 70B model generating 100 tokens: - 100 sequential forward passes - Each pass: ~50ms (memory-bound) - Total: ~5 seconds - GPU compute utilization: <1% ### The Speculative Decoding Insight Speculative decoding (Leviathan et al., 2022; Chen et al., 2023) introduces parallelism through speculation: 1. Use a small, fast **draft model** to generate $\gamma$ tokens quickly 2. Send all $\gamma$ tokens to the large **target model** in one batch 3. Target model **verifies** draft tokens in parallel 4. Accept verified tokens; resample if rejected Key insight: **verification is parallel, but generation is sequential**. The target model can check all $\gamma$ draft tokens simultaneously—work that would otherwise require $\gamma$ sequential passes. ### The Mathematics of Speculative Decoding Let $p(x)$ be the target model distribution and $q(x)$ be the draft model distribution. For each draft token $x_i$: - If $p(x_i) \geq q(x_i)$: **always accept** (target is at least as likely) - If $p(x_i) < q(x_i)$: **accept with probability** $p(x_i)/q(x_i)$ If rejected, sample from an **adjusted distribution**: $$p'(x) = \frac{\max(0, p(x) - q(x))}{\sum_x \max(0, p(x) - q(x))}$$ **Theorem**: This procedure produces samples from exactly the target distribution $p(x)$. This is remarkable: we get speedup without any approximation to the output distribution. ### Expected Speedup The speedup depends on: - $\gamma$: number of speculative tokens - $\alpha$: acceptance rate (how well draft matches target) - $c$: relative cost of draft vs target inference Expected tokens per target model call: $$E[\text{tokens}] = \frac{1 - \alpha^{\gamma+1}}{1 - \alpha}$$ For $\alpha = 0.8$ (80% acceptance) and $\gamma = 4$: $$E[\text{tokens}] = \frac{1 - 0.8^5}{1 - 0.8} = \frac{1 - 0.328}{0.2} = 3.36$$ If the draft model adds negligible overhead, this is a 3.36x speedup! ### Implementation Considerations **Draft model selection**: - Same architecture family (shares vocabulary, tokenizer) - 10-20x smaller than target (e.g., 68M draft for 7B target) - Trained on similar data for high acceptance rate **Optimal speculation length**: - Too few: Underutilizes parallelism - Too many: Wastes compute on likely-rejected tokens - Sweet spot: 4-8 tokens for most models ```python class SpeculativeDecoder: def __init__(self, target_model, draft_model, gamma=4): self.target = target_model self.draft = draft_model self.gamma = gamma def generate_step(self, input_ids): # 1. Draft model generates gamma tokens draft_tokens = [] draft_probs = [] context = input_ids for _ in range(self.gamma): logits = self.draft(context) probs = F.softmax(logits[:, -1], dim=-1) token = torch.multinomial(probs, 1) draft_tokens.append(token) draft_probs.append(probs[0, token].item()) context = torch.cat([context, token], dim=1) # 2. Target verifies all at once all_draft = torch.stack(draft_tokens, dim=1) target_logits = self.target(torch.cat([input_ids, all_draft], dim=1)) target_probs = F.softmax(target_logits[:, -self.gamma-1:-1], dim=-1) # 3. Accept/reject each token accepted = [] for i, (token, q, p_dist) in enumerate(zip(draft_tokens, draft_probs, target_probs)): p = p_dist[0, token].item() if random.random() < min(1, p / q): accepted.append(token) else: # Sample from adjusted distribution adjusted = torch.clamp(p_dist - draft_probs_dist, min=0) adjusted = adjusted / adjusted.sum() accepted.append(torch.multinomial(adjusted, 1)) break return torch.cat(accepted, dim=1) ``` > **Full implementation**: See [reference/21_performance_engineering_code.md](../reference/21_performance_engineering_code.html#speculative-decoding) for production-ready speculative decoding. --- ## Profiling: Finding the Real Bottleneck ::: {.callout-tip} ## Staff Engineer Perspective "On my second week as TL I was bored in a Grafana dashboard and noticed our embeddings service had a flat 90ms floor on every request, even tiny ones. Nobody had ever questioned it. I traced it to a synchronous DNS lookup we were doing per call because somebody had disabled the resolver cache in 2022 to debug a different issue and never re-enabled it. One line change. Service p50 dropped from 110ms to 18ms. Eight months of cumulative latency burned because the original 'fix' became invisible the moment the incident closed. Now I do a 'weird flat line' tour of dashboards once a quarter. Cheap wins live in things everyone has stopped looking at." — *Performance Engineer at a real-time messaging company* ::: ### The Performance Engineering Mindset The first rule of optimization: **measure before you optimize**. Intuition about performance is often wrong. Engineers routinely spend weeks optimizing the wrong component because they didn't profile first. ``` The Performance Engineering Loop: 1. DEFINE goals (what latency/throughput do we need?) ↓ 2. MEASURE current performance (profile systematically) ↓ 3. ANALYZE results (identify the bottleneck) ↓ 4. HYPOTHESIZE (what optimization might help?) ↓ 5. IMPLEMENT (make the change) ↓ 6. VALIDATE (measure again, compare) ↓ [Repeat until goals met] ``` ### Profiling Tools for AI Workloads Different tools reveal different aspects of performance: | Tool | What It Measures | When to Use | Overhead | |------|------------------|-------------|----------| | `torch.profiler` | PyTorch ops, GPU kernels | First-line profiling | Low | | `py-spy` | Python call stacks | CPU bottlenecks | Very low | | `nsys` (Nsight Systems) | GPU timeline, CPU-GPU sync | GPU pipeline analysis | Low | | `ncu` (Nsight Compute) | Detailed kernel metrics | Deep kernel optimization | High | | `nvidia-smi` | GPU utilization, memory | Quick health check | None | ### PyTorch Profiler Deep Dive ```python import torch from torch.profiler import profile, record_function, ProfilerActivity def profile_model(model, input_data, warmup=3, active=5): """Profile model inference with proper warmup.""" # Warmup runs (crucial for accurate measurement) for _ in range(warmup): with torch.no_grad(): _ = model(input_data) torch.cuda.synchronize() # Profile with GPU activities, memory, and stack traces with profile( activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True, profile_memory=True, with_stack=True ) as prof: for _ in range(active): with record_function("model_inference"): with torch.no_grad(): _ = model(input_data) # Analyze results print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=20)) # Export for detailed visualization prof.export_chrome_trace("trace.json") return prof ``` ### Interpreting Profiler Output A typical PyTorch profiler output: ``` --------------------------------- ------- ------- ------- ------- Name CPU Time CUDA Time Calls CPU Mem --------------------------------- ------- ------- ------- ------- aten::linear 1.2ms 15.3ms 96 0 b aten::matmul 0.8ms 12.1ms 48 0 b aten::_softmax 0.3ms 2.4ms 48 0 b aten::layer_norm 0.4ms 1.8ms 48 0 b aten::copy_ 0.6ms 1.2ms 128 0 b ``` **Key insights**: - CUDA time >> CPU time: GPU-bound (good, GPU is doing work) - CPU time >> CUDA time: CPU-bound (investigate Python overhead) - Many calls with small time each: Kernel launch overhead - Large memory in specific ops: Memory bottleneck candidates ### Common Bottleneck Patterns **Pattern 1: CPU-Bound Preprocessing** ``` Symptom: Low GPU utilization, high CPU time Profile: Data loading, tokenization dominate Solution: Async data loading, batch preprocessing, compiled tokenizers ``` **Pattern 2: Kernel Launch Overhead** ``` Symptom: Many small CUDA calls, GPU gaps in timeline Profile: Hundreds of ops with <0.1ms each Solution: torch.compile, CUDA graphs, operator fusion ``` **Pattern 3: Memory Bandwidth Bound** ``` Symptom: High GPU utilization, but lower throughput than expected Profile: Large tensors, many memory copies Solution: Quantization, better data layout, reduce memory traffic ``` **Pattern 4: CPU-GPU Synchronization** ``` Symptom: GPU stalls, CPU waiting Profile: Explicit syncs, CPU operations on GPU tensors Solution: Async operations, keep computation on GPU ``` > **Full implementation**: See [reference/21_performance_engineering_code.md](../reference/21_performance_engineering_code.html#profiling-framework) for complete profiling framework with analysis. ::: {.callout-warning} ## Common Mistake: Optimizing Without Profiling **What people do:** Assume the model inference is slow and spend weeks implementing CUDA optimizations, only to discover that 60% of latency was actually in Python preprocessing or database queries. **Why it fails:** Intuition about performance is notoriously unreliable. The bottleneck is rarely where you expect. Without measurement, you might 10x the speed of something that contributes 5% to total latency—a 0.5% overall improvement after weeks of work. **Fix:** Always profile before optimizing. Use `torch.profiler` for GPU kernels, `py-spy` for Python overhead, and end-to-end tracing for system-level bottlenecks. Identify which component dominates latency (preprocessing? inference? postprocessing? network?). Optimize the actual bottleneck. Re-profile after each change to validate impact. ::: --- ## Production Optimization Techniques ### Continuous Batching Traditional batching waits for a batch to fill, processes all requests, then returns results. This is inefficient because: - Short requests wait for long requests to complete - New requests wait for the entire batch to finish - GPU may be idle while waiting for requests **Continuous batching** (iteration-level batching) processes requests at the token level: ```python class ContinuousBatcher: def __init__(self, model, max_batch_size, max_waiting_time_ms=50): self.model = model self.max_batch_size = max_batch_size self.max_waiting_time = max_waiting_time_ms self.active_requests = [] self.waiting_queue = [] async def run_loop(self): """Main loop: process one token for all active requests.""" while True: # Add waiting requests (up to capacity) self._fill_batch() if not self.active_requests: await asyncio.sleep(0.001) continue # Generate one token for all active requests await self._step() # Return completed requests self._retire_completed() async def _step(self): """Generate one token for all active requests in batch.""" # Batch all active requests input_ids = self._prepare_batch() # Single forward pass for all requests logits = self.model(input_ids) # Sample next token for each request for i, request in enumerate(self.active_requests): next_token = sample(logits[i]) request.append_token(next_token) def _retire_completed(self): """Return completed requests, freeing batch slots.""" completed = [r for r in self.active_requests if r.is_done()] for request in completed: self.active_requests.remove(request) request.future.set_result(request.output) ``` Continuous batching enables: - **Higher throughput**: GPU always has work - **Lower latency**: Short requests don't wait for long ones - **Better utilization**: Batch stays full as requests come and go ### torch.compile and TorchInductor PyTorch 2.0's `torch.compile` applies automatic optimizations: ```python import torch # Basic compilation model = torch.compile(model) # With optimization mode model = torch.compile(model, mode="reduce-overhead") # CUDA graphs model = torch.compile(model, mode="max-autotune") # Best perf, slow compile # With dynamic shapes model = torch.compile(model, dynamic=True) # Handle varying input sizes ``` What torch.compile does: 1. **Traces** the model to capture computation graph 2. **Fuses** operations to reduce kernel launches 3. **Generates** optimized kernels via TorchInductor 4. Optionally uses **CUDA graphs** to eliminate launch overhead Expected speedups: - Typical models: 1.3-2x - Memory-bound models: 1.1-1.5x (less kernel fusion benefit) - Compute-bound models: 1.5-3x (more fusion benefit) ### CUDA Graphs CUDA graphs capture a sequence of GPU operations and replay them without CPU involvement: ```python class CUDAGraphRunner: def __init__(self, model, sample_input): self.model = model # Static buffers for graph self.static_input = sample_input.clone() self.static_output = None # Warmup for _ in range(3): _ = model(self.static_input) torch.cuda.synchronize() # Capture graph self.graph = torch.cuda.CUDAGraph() with torch.cuda.graph(self.graph): self.static_output = model(self.static_input) def __call__(self, x): # Copy input to static buffer self.static_input.copy_(x) # Replay captured graph self.graph.replay() # Return output (already computed in static buffer) return self.static_output.clone() ``` Limitations: - Input shapes must be fixed - No dynamic control flow during graph replay - Memory layout must be consistent Best for: Fixed-shape inference with repetitive workloads. ### Tensor Parallelism When models don't fit on a single GPU, tensor parallelism splits weight matrices across GPUs: ``` Without Tensor Parallelism: GPU 0: Full 70B model (doesn't fit!) With 4-way Tensor Parallelism: GPU 0: 1/4 of each weight matrix GPU 1: 1/4 of each weight matrix GPU 2: 1/4 of each weight matrix GPU 3: 1/4 of each weight matrix Each GPU computes partial results, then all-reduce to combine. ``` For a linear layer $Y = XW$: 1. Split $W$ into $W_1, W_2, ..., W_n$ (column-wise) 2. Each GPU computes $Y_i = XW_i$ 3. All-reduce to get $Y = [Y_1, Y_2, ..., Y_n]$ Communication overhead: - One all-reduce per linear layer: ~8 all-reduces per transformer block - For 80 layers: ~640 all-reduces per forward pass - With NVLink (900 GB/s): Acceptable overhead - With PCIe (64 GB/s): Often prohibitive > **Full implementation**: See [reference/21_performance_engineering_code.md](../reference/21_performance_engineering_code.html#tensor-parallelism) for tensor parallelism setup. --- ## Decision Frameworks for Performance Optimization ### When to Optimize Not all optimization is worthwhile. Consider: **The 80/20 rule**: 80% of time is spent in 20% of code. Profile to find that 20%. **Amdahl's Law**: If a component is 10% of total time, even infinite speedup yields only 11% overall improvement. **Development cost**: A week of engineering time costs $5,000+. Will the optimization save that much in compute? ### Optimization Priority Framework ![Optimization Priority Matrix](../assets/diagrams/rendered/ch21_optimization_matrix.svg) **Do First** (High Impact, Low Effort): - Enable Flash Attention (one line of code) - torch.compile (one line of code) - Use production serving framework (vLLM, TGI) **Do Later** (High Impact, High Effort): - Custom CUDA kernels - Model architecture changes - Custom quantization schemes **Skip** (Low Impact, Low Effort): - Micro-optimizations in preprocessing - Minor memory layout changes **Do If Time** (Low Impact, High Effort): - Exotic optimization techniques - Hardware-specific tuning ### Latency vs. Throughput Decision Matrix | Scenario | Optimize For | Key Techniques | |----------|--------------|----------------| | Interactive chat | Latency | Speculative decoding, tensor parallelism, small batch | | Batch processing | Throughput | Large batches, continuous batching, quantization | | Mixed workload | Balance | Priority queues, adaptive batching | | Cost-constrained | Throughput | Aggressive quantization, spot instances | | Real-time (<100ms) | Latency | CUDA graphs, distillation, edge deployment | ### Cost-Performance Tradeoff Analysis ```python def analyze_configuration(config): """Analyze cost-performance tradeoff for a configuration.""" # Calculate throughput tokens_per_second = estimate_throughput(config) # Calculate cost gpu_cost_per_hour = GPU_COSTS[config.gpu_type] cost_per_million_tokens = (gpu_cost_per_hour / 3600) / tokens_per_second * 1_000_000 # Calculate quality (if using quantization) quality_score = estimate_quality(config) return { 'throughput': tokens_per_second, 'cost_per_1M_tokens': cost_per_million_tokens, 'quality': quality_score, 'latency_p99': estimate_latency(config) } # Example configurations configs = [ {'gpu': 'A10', 'model': '7B', 'quant': 'fp16', 'batch': 8}, {'gpu': 'A10', 'model': '7B', 'quant': 'int8', 'batch': 16}, {'gpu': 'A100', 'model': '70B', 'quant': 'int4', 'batch': 32}, ] # Find Pareto-optimal configurations for config in configs: metrics = analyze_configuration(config) if meets_latency_slo(metrics) and meets_quality_threshold(metrics): print(f"Viable: {config} -> ${metrics['cost_per_1M_tokens']:.4f}/1M tokens") ``` --- ## Case Studies: Real-World Optimization Wins ### Case Study 1: The Startup Cost Crisis **Situation**: AI coding assistant, 7B parameter model, $0.375/user-hour serving cost. **Analysis**: - Single user per GPU (batch size 1) - FP16 inference (no quantization) - Standard attention (no Flash Attention) - No caching of common prompts **Optimization Journey**: | Stage | Change | Before | After | Improvement | |-------|--------|--------|-------|-------------| | Baseline | Measure | - | 50ms/token, 8 users/GPU | - | | 1 | Flash Attention | 8 users | 8 users | No change (not attention-bound) | | 2 | INT8 quantization | 50ms | 35ms | 1.4x latency | | 3 | Continuous batching | 8 users | 32 users | 4x throughput | | 4 | KV cache optimization | 32 users | 48 users | 1.5x throughput | | 5 | Speculative decoding | 35ms | 20ms | 1.75x latency | | **Final** | Combined | 8 users, $0.375/hr | 64 users, $0.047/hr | **8x cost reduction** | **Key Insight**: The biggest win came from continuous batching, not the "obvious" optimization (Flash Attention). This is why profiling matters. ### Case Study 2: Long-Context Document Analysis **Situation**: Legal document analysis requiring 100K+ context windows. **Problem**: Standard attention OOMs at 32K tokens. **Analysis**: - 32K tokens * 32K tokens * 2 bytes * 40 heads = 82GB per layer for attention - Not feasible on any single GPU **Solution**: | Technique | Purpose | Impact | |-----------|---------|--------| | Flash Attention 2 | Eliminate O(n^2) memory | Enable 100K context | | Sliding window (partial) | Reduce effective sequence for most layers | 2x memory reduction | | Gradient checkpointing | Trade compute for memory | 3x memory reduction for training | | GQA (8 KV heads) | Reduce KV cache size | 8x KV cache reduction | **Result**: 128K context windows on A100-80GB, enabling full document analysis. ### Case Study 3: Real-Time Voice Assistant **Situation**: Voice assistant requiring <200ms end-to-end latency. **Budget breakdown**: - Audio processing: 20ms - Speech-to-text: 50ms - LLM inference: 100ms (target) - Text-to-speech: 30ms **Challenge**: 100ms for LLM inference requires aggressive optimization. **Solution**: | Technique | Latency Impact | |-----------|---------------| | 3B distilled model (from 7B) | 2x faster | | INT8 quantization | 1.5x faster | | Speculative decoding | 2x faster | | CUDA graphs | 1.2x faster | | Tensor parallelism (2 GPUs) | 1.8x faster | **Result**: - Time to first token: 25ms - Tokens per second: 80 - End-to-end latency: 180ms (under budget) **Key Insight**: Distillation provided the biggest win. A faster model beats a heavily optimized slow model. --- ## Common Pitfalls and How to Avoid Them ### Pitfall 1: Optimizing Without Profiling **The mistake**: "Attention is slow, so I'll implement custom attention kernels." **Reality**: Attention might be 10% of your runtime. Data loading might be 60%. **The fix**: Always profile first. Optimize the actual bottleneck. ### Pitfall 2: Forgetting Warmup **The mistake**: First inference takes 2 seconds, subsequent ones take 50ms. Reporting 2s latency. **Reality**: First call includes JIT compilation, CUDA context creation, memory allocation. **The fix**: Always run warmup iterations before benchmarking. ```python # Wrong start = time.time() output = model(input) print(f"Latency: {time.time() - start}") # Includes warmup! # Right for _ in range(5): # Warmup _ = model(input) torch.cuda.synchronize() times = [] for _ in range(100): # Benchmark start = time.perf_counter() _ = model(input) torch.cuda.synchronize() times.append(time.perf_counter() - start) print(f"Latency: {np.mean(times):.3f}s +/- {np.std(times):.3f}s") ``` ### Pitfall 3: Missing GPU Synchronization **The mistake**: Timing shows 1ms, but actual latency is 50ms. **Reality**: CUDA operations are asynchronous. The Python call returns immediately. **The fix**: Always `torch.cuda.synchronize()` before timing endpoints. ### Pitfall 4: Testing at Wrong Scale **The mistake**: Optimizations work great on 100 tokens, fail at 10,000 tokens. **Reality**: Algorithmic complexity differences only show at scale. **The fix**: Test with production-representative input sizes. ### Pitfall 5: Ignoring Memory Fragmentation **The mistake**: OOM with 40GB free, 60GB reserved. **Reality**: Memory fragmentation prevents allocation even with "free" memory. **The fix**: Monitor fragmentation, use paged memory, periodic cleanup. ```python def check_memory(): allocated = torch.cuda.memory_allocated() / 1e9 reserved = torch.cuda.memory_reserved() / 1e9 fragmentation = (reserved - allocated) / reserved if reserved > 0 else 0 print(f"Allocated: {allocated:.1f}GB, Reserved: {reserved:.1f}GB") print(f"Fragmentation: {fragmentation:.1%}") if fragmentation > 0.2: print("Warning: High fragmentation. Consider torch.cuda.empty_cache()") ``` ### Pitfall 6: Premature Optimization **The mistake**: Spending a week writing custom CUDA kernels before validating the product. **Reality**: The product might pivot, making the optimization worthless. **The fix**: Start with high-level optimizations (torch.compile, vLLM). Only write custom code when high-level tools aren't enough. --- ## Key Takeaways 1. **Memory bandwidth is the primary bottleneck** - Modern GPUs have abundant compute but limited memory bandwidth. For LLM inference at small batch sizes, you use less than 1% of available compute. 2. **Use the roofline model to identify bottlenecks** - Calculate arithmetic intensity (FLOPs per byte) to determine if you're compute-bound or memory-bound. Then optimize the actual constraint, not the assumed one. 3. **Flash Attention enables long contexts** - By computing attention in tiles that fit in fast SRAM, Flash Attention achieves O(N) memory instead of O(N^2) without approximating the computation. 4. **Quantization trades precision for throughput** - 4-bit quantization achieves 4x memory reduction with minimal quality loss for most applications. Quantize to INT8 or INT4 for production inference. 5. **Always profile before optimizing** - Intuition about bottlenecks is often wrong. Use torch.profiler, nsys, and systematic measurement to find the real bottleneck before investing in optimization. --- ## Summary Performance engineering for AI systems requires understanding the fundamental differences between AI workloads and traditional software: **Memory bandwidth is the bottleneck**. Modern GPUs have abundant compute but limited memory bandwidth. For LLM inference at small batch sizes, you're using less than 1% of available compute—all time is spent loading weights from memory. **The roofline model** provides a framework for reasoning about bottlenecks. Calculate arithmetic intensity (FLOPs per byte) to determine if you're compute-bound or memory-bound, then optimize accordingly. **Flash Attention** eliminates the quadratic memory bottleneck of attention, enabling long context windows by never materializing the full attention matrix. **Quantization** reduces memory traffic by representing weights with fewer bits. 4-bit quantization achieves 4x memory reduction with minimal quality loss for most applications. **The KV cache** is critical for efficient generation. Paged attention enables efficient memory management, supporting 2-4x more concurrent requests. **Speculative decoding** breaks the autoregressive barrier by using a small draft model to generate candidates that a large model verifies in parallel. **Always profile first**. Intuition about bottlenecks is often wrong. Use torch.profiler, nsys, and systematic measurement to find the real bottleneck before optimizing. **Understand the tradeoffs**. Latency and throughput often conflict. Cost and quality often conflict. The right configuration depends on your specific requirements. --- ## Practical Exercises ### Exercise 1: Roofline Analysis For a model you're working with: 1. Calculate parameter count and FLOPs per forward pass 2. Calculate memory traffic (bytes loaded) 3. Determine arithmetic intensity 4. Compare to your GPU's compute-to-bandwidth ratio 5. Are you compute-bound or memory-bound? ### Exercise 2: Profiling Deep Dive 1. Profile a model inference with torch.profiler 2. Export to Chrome trace and visualize the timeline 3. Identify the top 5 operations by time 4. Classify each operation: compute-bound, memory-bound, or launch-bound 5. Propose optimizations for each bottleneck ### Exercise 3: Quantization Impact 1. Measure baseline model performance (latency, throughput, quality) 2. Apply INT8 dynamic quantization and re-measure 3. Apply 4-bit GPTQ quantization and re-measure 4. Plot the Pareto frontier of quality vs. performance 5. Recommend the best configuration for interactive use ### Exercise 4: Batching Optimization 1. Measure single-request latency and throughput 2. Implement static batching with sizes 1, 2, 4, 8, 16, 32 3. Plot latency and throughput vs. batch size 4. Identify the optimal batch size for your latency SLO 5. Implement continuous batching and compare ### Exercise 5: End-to-End Optimization Starting with a baseline model: 1. Profile to identify the primary bottleneck 2. Apply the single most impactful optimization 3. Re-profile and identify the new bottleneck 4. Repeat until you achieve 3x improvement or reach diminishing returns 5. Document the optimization journey with metrics --- ## Self-Assessment Checkpoint ### Conceptual Questions **Q1. [Senior]** What does it mean for LLM inference to be "memory-bandwidth bound"? How does this differ from being "compute bound"? <details> <summary>Answer</summary> Memory-bandwidth bound: The GPU spends more time waiting for data to load from memory than doing computation. During LLM decode, we load the entire model (14GB for 7B FP16) to do a matrix-vector multiply for one token—massive data movement, minimal computation. The arithmetic intensity (FLOPs per byte loaded) is ~1-2, while GPUs can do ~600 FLOPs per byte loaded. Compute bound: The GPU is fully utilized doing math, and loading more data wouldn't help. This occurs during prefill (matrix-matrix multiply) where we process many tokens simultaneously, achieving high arithmetic intensity. Key implication: For decode phase, buying faster compute doesn't help—need faster memory or smarter use of memory (batching, quantization, caching). </details> **Q2. [Senior]** Why does batching improve throughput for LLM inference? What limits how large the batch can be? <details> <summary>Answer</summary> Batching amortizes weight loading: Instead of loading 14GB of weights to generate 1 token, load them once and generate B tokens for B batched requests. Throughput scales nearly linearly with batch size until limits hit. Limits: (1) Memory—KV cache grows with batch size × context length. Eventually GPU memory is exhausted. (2) Latency—larger batches mean each request waits longer (for its slot, for the batch to complete). SLA limits batch size. (3) Diminishing returns—at very large batches, compute becomes the bottleneck instead of memory bandwidth. (4) Request arrival—need enough concurrent requests to form batches. Trade-off: Higher batch = higher throughput but higher latency. Optimal batch size depends on latency SLO and memory available. </details> **Q3. [Staff]** Explain how FlashAttention achieves better performance without approximating attention. What's the key insight? <details> <summary>Answer</summary> FlashAttention is exact attention (same mathematical result) but with better memory access patterns. Key insight: Standard attention materializes the N×N attention matrix in GPU memory (for softmax), requiring O(N²) memory reads/writes. For long sequences, this is slow (memory-bound) and uses massive memory. FlashAttention uses tiling: computes attention in blocks that fit in fast SRAM (on-chip memory), never materializing the full N×N matrix in slow HBM. It's IO-aware—designed around the memory hierarchy rather than just minimizing FLOPs. Result: Same computation, but ~2-4x faster and O(N) memory instead of O(N²). This is why FlashAttention enabled long contexts (100K+ tokens) that were previously impractical. </details> **Q4. [Staff]** What's the roofline model and how do you use it to identify bottlenecks? <details> <summary>Answer</summary> Roofline plots achievable performance (FLOPs/s) against arithmetic intensity (FLOPs/byte). The "roof" has two parts: (1) Sloped section (memory-bound)—performance limited by memory bandwidth × arithmetic intensity. (2) Flat section (compute-bound)—performance limited by peak compute. To use: Calculate your kernel's arithmetic intensity. Plot it on the roofline. If you're in the sloped region, you're memory-bound—optimize memory access (batching, quantization, caching). If you're in the flat region, you're compute-bound—optimize computation (better algorithms, hardware utilization). For LLM inference: Decode is typically far in the memory-bound region. Prefill may be closer to the knee or compute-bound. This tells you which optimizations will help for each phase. </details> **Q5. [Staff]** How do you decide between INT8, INT4, and FP16 for model deployment? What are the trade-offs? <details> <summary>Answer</summary> Trade-offs: FP16: Full precision, no quality loss, 2 bytes/parameter. Baseline. INT8: ~4x faster memory bandwidth, usually <1% quality loss for well-quantized models. Good default for production. INT4: ~8x faster memory bandwidth, 2-5% quality loss typical, can be more for reasoning-heavy tasks. Best for resource-constrained or cost-sensitive deployments. Decision factors: (1) Quality requirements—measure on YOUR task, not benchmarks. Some tasks degrade more than others. (2) Hardware—INT4 needs kernels that support it efficiently (modern GPUs). (3) Model—some models quantize better than others. Always test. (4) Latency vs. throughput—INT4 helps throughput more than single-request latency. Process: Start with INT8 (usually safe). If quality acceptable and need more efficiency, try INT4. Always validate on task-specific evaluation set before deploying. </details> ### Spot the Problem **Problem 1. [Senior]** Optimization approach: ``` "Our model inference is slow. I'll rewrite the attention mechanism in custom CUDA to make it faster." ``` <details> <summary>Answer</summary> Problems: (1) Premature optimization—haven't profiled to confirm attention is the bottleneck. Might be optimizing the wrong thing. (2) Reinventing the wheel—FlashAttention already exists and is highly optimized by experts. Unlikely to beat it. (3) Maintenance burden—custom CUDA is hard to maintain, debug, and port to new hardware. (4) May break other optimizations—custom kernels might not integrate with torch.compile, quantization, etc. Correct approach: Profile first (torch.profiler, nsys). Use established optimizations (FlashAttention, torch.compile). Only write custom CUDA if profiling shows a specific bottleneck that existing tools don't address—and even then, consider contributing to existing libraries. </details> **Problem 2. [Staff]** Performance report: ``` "After applying all optimizations (FlashAttention, INT4 quantization, continuous batching), we achieved 100 tokens/second throughput." ``` What's missing? <details> <summary>Answer</summary> Missing context: (1) What model?—100 tok/s means different things for 7B vs. 70B. (2) What hardware?—100 tok/s on H100 might be poor; on consumer GPU might be excellent. (3) What batch size?—100 tok/s at batch=1 vs. batch=32 are very different. (4) What sequence lengths?—Long contexts are harder. (5) Latency metrics?—What's P50/P99 per-request latency? Throughput without latency is incomplete. (6) Quality impact?—Did INT4 quantization hurt accuracy? (7) Baseline comparison?—100 tok/s improvement from what? A complete performance report includes: model, hardware, batch size, sequence lengths, latency percentiles, throughput, quality metrics, and comparison to baseline. </details> **Problem 3. [Staff]** Deployment configuration: ``` # Using maximum batch size for best throughput config = { "max_batch_size": 64, "max_sequence_length": 32768, "dtype": "int4" } ``` <details> <summary>Answer</summary> Likely memory issues: Batch size 64 × 32K context × KV cache per token = enormous memory. Even INT4 weights won't save you if KV cache (typically FP16) explodes. Problems: (1) May OOM—calculate actual memory needed before setting config. (2) Latency impact—batch 64 means requests wait for huge batches. (3) 32K context for all requests?—Most requests probably don't need it. Wasting memory reservation. (4) INT4 doesn't shrink KV cache—just weights. Long context is still memory-hungry. Better: Profile actual request distribution (context lengths, arrival rate). Set max_sequence_length to actual needs. Use dynamic batching that adapts to load. Calculate memory budget: model weights + max_batch_size × max_context × KV_cache_per_token < GPU_memory. </details> ### Design Exercises **Exercise 1. [Staff]** You have an A100 80GB GPU and need to serve a 70B parameter model with sub-100ms time to first token for prompts up to 4K tokens. Design your deployment strategy, including quantization, parallelism, and batching decisions. <details> <summary>Guidance</summary> Constraints: (1) 70B model in FP16 = 140GB—doesn't fit in 80GB. Need quantization or tensor parallelism. (2) TTFT < 100ms for 4K tokens is aggressive—prefill of 4K tokens takes significant time. Analysis: (1) INT4 quantization: 70B × 0.5 bytes = 35GB weights. Fits with room for KV cache. (2) Single A100 with INT4 might achieve ~100ms TTFT for shorter prompts. 4K is pushing it. (3) Alternative: 2x A100 with tensor parallelism, FP16. Better quality, meets latency. Strategy options: (1) Single A100 + INT4 + aggressive optimization—risky for 4K, may work for shorter. (2) 2x A100 + tensor parallelism + INT8—more headroom, meets quality and latency. (3) Consider if 70B is necessary—would 13B meet quality needs at much lower cost? Batching: Limited—sub-100ms TTFT limits batch wait time. Likely batch=1-4 for real-time path. </details> **Exercise 2. [Staff]** Your LLM inference service is costing $500K/month. Leadership wants 30% cost reduction without impacting user experience. Design an optimization roadmap with prioritized initiatives and expected savings. <details> <summary>Guidance</summary> First, understand cost breakdown: (1) GPU hours by instance type. (2) Utilization patterns (peak vs. off-peak). (3) Request distribution (simple vs. complex). Optimization roadmap: (1) Caching—identify repeat/similar queries. Even 20% cache hit rate at near-zero marginal cost is significant. (2) Model routing—route simple queries to smaller/cheaper models. Requires quality validation but can save 40%+ on routed queries. (3) Quantization—if not already INT8, significant savings. Validate quality first. (4) Batching optimization—improve utilization through better batching strategies. (5) Reserved capacity—convert on-demand to reserved/spot for predictable baseline load. 30-50% savings. (6) Off-peak processing—batch non-urgent requests to off-peak hours with cheaper capacity. Prioritize by: Effort vs. impact. Caching and routing often highest ROI. Always validate quality metrics before and after. </details> --- ### Connections to Other Chapters Performance engineering intersects with deployment, cost, and system design. Here's how this chapter connects: - **Chapter 9 (LLM Deployment & Infrastructure)**: Core deployment architecture, batching strategies, and KV cache optimization. This chapter goes deeper into the performance theory behind those patterns. - **Chapter 25 (System Design at Scale)**: Distributed systems patterns and architecture for high-throughput AI systems. Performance engineering at the component level enables system-level scalability. - **Chapter 32 (Cost Engineering)**: How performance optimizations translate to cost savings and ROI analysis. Faster inference means lower GPU costs. - **Tool & Framework Reference** (Appendix B): Inference engines like vLLM, TGI, and TensorRT-LLM covered in this chapter. --- ## Further Reading ### Essential - **Dao et al. (2022), "FlashAttention"** - The paper that made long contexts practical. Critical for understanding modern attention. - **Kwon et al. (2023), "vLLM/PagedAttention"** - Efficient memory management for LLM serving. - **Horace He, "Making Deep Learning Go Brrrr From First Principles"** - Excellent practical GPU performance introduction. ### Deep Dives - **Frantar et al. (2022), "GPTQ"** - Post-training quantization for transformers. Key for model compression. - **Leviathan et al. (2022), "Speculative Decoding"** - Using small models to accelerate large model inference. - **Williams et al. (2009), "Roofline Model"** - Visual performance model for understanding compute bounds.