Chapter 27: Performance Engineering
performance, profiling, optimization, GPU utilization, latency, throughput, bottleneck analysis, CUDA
Introduction
In early 2023, a startup building an AI coding assistant faced a crisis. Their prototype worked beautifully in demos—the model understood context, generated accurate code, and responded naturally to follow-up questions. But when they tried to scale beyond a hundred users, the economics collapsed. Each GPU could serve only 8 concurrent users at acceptable latency. At $3 per GPU-hour, that translated to $0.375 per user-hour—roughly $270 per user per month just in inference costs. For a product targeting $20/month subscriptions, they were losing money on every customer.
The team spent three months on performance engineering. They implemented continuous batching, added speculative decoding with a small draft model, optimized their KV cache with paged attention, and carefully profiled every component. By the end, the same GPU served 64 concurrent users—an 8x improvement. Their unit economics went from catastrophic to viable.
This story illustrates a fundamental truth about AI systems: performance isn’t just about speed—it’s about feasibility. A model that costs $1 per query can’t power a consumer product. An inference pipeline with 5-second latency can’t enable real-time applications. Understanding performance engineering transforms “this would be nice” into “this is possible.”
Why AI Performance is Different
Traditional software performance optimization focuses on algorithmic complexity, database queries, and network latency. AI workloads introduce fundamentally different challenges:
Memory bandwidth, not compute, is often the bottleneck. Modern GPUs have enormous compute capacity—the H100 can perform nearly 2,000 trillion floating-point operations per second. But loading the 140GB of weights for a 70B parameter model at FP16 precision takes 47 milliseconds at the H100’s peak 3TB/s memory bandwidth. For autoregressive generation, where we must load the entire model for each token, this creates a hard floor on latency that adding more compute cannot break.
The quadratic attention bottleneck. Self-attention—the mechanism that gives transformers their power—requires computing interactions between all pairs of tokens. For a 128K context window, that’s 16 billion attention computations per layer. Naive implementations run out of memory long before running out of time.
Batching dynamics are complex. Unlike web servers where requests are largely independent, AI inference can batch requests to amortize fixed costs—but different requests may need different numbers of output tokens. A batch is limited by its slowest request, creating intricate scheduling challenges.
The latency-throughput tradeoff is acute. Users expect sub-second responses. Providers want to maximize GPU utilization. These goals conflict: high utilization requires large batches, but large batches increase latency for individual requests.
The Core Insight: Roofline Thinking
To reason about AI performance, you need a mental model for what limits computation. The roofline model, introduced by Williams et al. (2009), provides exactly this framework.
Every computation has two potential bottlenecks:
- Compute bound: Limited by how fast we can perform arithmetic operations
- Memory bound: Limited by how fast we can load data from memory
The ratio between these determines which bottleneck dominates:
\[\text{Arithmetic Intensity} = \frac{\text{FLOPs}}{\text{Bytes Loaded}}\]
If your arithmetic intensity is high (many operations per byte loaded), you’re compute-bound. If it’s low (few operations per byte loaded), you’re memory-bound.
Why LLM inference is memory-bound: For a single token prediction with a 7B parameter model:
- FLOPs: ~14 billion (roughly 2 * parameter count for forward pass)
- Bytes loaded: ~14 billion (loading all parameters once at FP16)
- Arithmetic intensity: ~1 FLOP/byte
Modern GPUs achieve 100+ FLOPs per byte of memory bandwidth. At 1 FLOP/byte arithmetic intensity, we’re using less than 1% of the GPU’s compute capability. We’re completely memory-bound.
This insight transforms how we think about optimization. Adding more compute power won’t help a memory-bound workload. Instead, we must either: 1. Reduce memory traffic (quantization, caching) 2. Increase arithmetic intensity (batching, operator fusion) 3. Use memory more efficiently (Flash Attention, paged KV cache)
What people do: Inference is slow, so upgrade from A100 to H100 expecting a proportional speedup. Or add more GPUs expecting near-linear scaling.
Why it fails: LLM inference at small batch sizes is memory-bandwidth bound, not compute-bound. The H100 has 3x the compute of A100 but only 2x the memory bandwidth. For batch-size-1 inference, you get ~2x speedup, not 3x. Adding more GPUs often makes things worse due to communication overhead—you’re splitting a memory-bound workload across more devices that now must synchronize.
Fix: First identify your bottleneck using profiling. For memory-bound workloads: quantize to reduce bytes loaded, increase batch size to amortize weight loading, or use speculative decoding. Only upgrade hardware after you’ve exhausted algorithmic optimizations—often you can get 4-8x improvements without buying anything.
What You’ll Learn
- How GPU architecture determines what optimizations are possible
- Why LLMs are memory-bound and how to reason about memory hierarchy
- Flash Attention: the mathematics and why it works
- Quantization theory: trading precision for performance
- Profiling methodology: finding the real bottleneck
- Speculative decoding: breaking the autoregressive barrier
- Production optimization: batching, caching, and cost engineering
Prerequisites
- Understanding of transformer architecture (Chapter 5)
- Familiarity with PyTorch or similar frameworks
- Basic linear algebra (matrix multiplication, attention computation)
- Helpful but not required: computer architecture concepts
GPU Architecture for AI Engineers
The Parallel Processing Paradigm
To optimize for GPUs, you must understand how they differ from CPUs.
A CPU is designed for latency: execute a single thread of instructions as fast as possible, with sophisticated branch prediction, out-of-order execution, and large caches to hide memory latency. A high-end CPU might have 64 cores, each executing 1-2 threads.
A GPU is designed for throughput: execute thousands of threads simultaneously, hiding memory latency through massive parallelism rather than caching. An H100 has 132 Streaming Multiprocessors (SMs), each capable of executing thousands of threads concurrently.
CPU Execution Model:
Thread 1: ████████████████████████████████████████████████████████████
Thread 2: ████████████████████████████████████████████████████████████
[Few threads, each runs fast with sophisticated optimization]
GPU Execution Model:
Thread 1: ██░░░░██░░░░██░░░░██░░░░██░░░░ (waiting for memory)
Thread 2: ░░██░░░░██░░░░██░░░░██░░░░██░░
Thread 3: ░░░░██░░░░██░░░░██░░░░██░░░░██
Thread 1024: ██░░░░██░░░░██░░░░██░░░░██░░░░
[Many threads, each slow, but overlapped to hide latency]
The key insight: GPUs hide memory latency through parallelism, not caching. While one group of threads waits for memory, others execute. This only works if you have enough threads to keep the execution units busy—a concept called occupancy.
The Memory Hierarchy
Understanding the GPU memory hierarchy is essential for optimization:
| Memory Level | Size (H100) | Bandwidth | Latency | Scope |
|---|---|---|---|---|
| Registers | 256KB per SM | ~10 TB/s | 0 cycles | Per thread |
| Shared Memory / L1 | 228KB per SM | ~19 TB/s | ~20 cycles | Per thread block |
| L2 Cache | 50MB | ~6 TB/s | ~200 cycles | Global |
| HBM3 (VRAM) | 80GB | 3.35 TB/s | ~400 cycles | Global |
| CPU RAM (via PCIe) | System | ~64 GB/s | ~10,000 cycles | System |
The bandwidth gap between levels is staggering. Moving from HBM to shared memory is a 6x bandwidth improvement. Moving from HBM to registers is 3x more. The single most important optimization principle is keeping data close to computation.
SIMT Execution and Warps
GPUs execute instructions in lockstep groups of 32 threads called warps (NVIDIA terminology) or wavefronts (AMD). All 32 threads in a warp execute the same instruction simultaneously—a model called SIMT (Single Instruction, Multiple Thread).
This has profound implications:
Warp divergence: If threads in a warp take different branches, both branches must execute serially:
# This code causes warp divergence
if thread_id % 2 == 0:
do_even_work() # Half the warp executes
else:
do_odd_work() # Other half executes (serialized!)Memory coalescing: Adjacent threads should access adjacent memory addresses. The memory system fetches data in 128-byte chunks. If 32 threads each request one float (4 bytes) from adjacent addresses, one fetch serves all of them. If they request scattered addresses, you might need 32 separate fetches—a 32x slowdown.
# Coalesced access (fast)
data[thread_id] # Threads 0-31 access addresses 0-31
# Strided access (slow)
data[thread_id * stride] # Threads access scattered addresses
# Random access (very slow)
data[random_indices[thread_id]] # Even worseKernel Launch Overhead
Each GPU operation (kernel) has launch overhead: 5-10 microseconds to dispatch work from the CPU to the GPU. This seems tiny, but for a model with hundreds of small operations, it adds up. A forward pass through a transformer layer might involve:
- 4 linear projections (Q, K, V, O)
- Attention computation
- Softmax
- 2 FFN linear layers
- 2 layer normalizations
- Various reshapes and transposes
That’s easily 15+ kernel launches per layer. For a 70-layer model, that’s 1,000+ kernels, potentially 5-10ms of pure launch overhead.
Kernel fusion combines multiple operations into a single kernel, eliminating these round-trips. This is why Flash Attention, which fuses the entire attention computation into one kernel, provides such dramatic speedups.
Memory-Bound Inference: The Fundamental Challenge
Why LLM Inference is Memory-Bound
Let’s derive why large language model inference is memory-bound from first principles.
For a transformer with:
- \(P\) = number of parameters
- \(d\) = hidden dimension
- \(n\) = sequence length
- \(b\) = batch size
Forward pass computation (simplified, for attention + FFN): \[\text{FLOPs} \approx 2 \cdot P \cdot b\]
Memory traffic (loading all parameters): \[\text{Bytes} = P \cdot \text{bytes\_per\_param}\]
Arithmetic intensity: \[\text{AI} = \frac{2 \cdot P \cdot b}{P \cdot \text{bytes\_per\_param}} = \frac{2 \cdot b}{\text{bytes\_per\_param}}\]
For FP16 (2 bytes per parameter): \[\text{AI} = b \text{ FLOPs/byte}\]
The H100’s compute-to-bandwidth ratio is approximately: \[\frac{990 \text{ TFLOPS}}{3.35 \text{ TB/s}} \approx 295 \text{ FLOPs/byte}\]
To be compute-bound, we need arithmetic intensity of at least 295. That means batch size must be at least 295 to fully utilize the H100’s compute capability!
For interactive inference (batch size 1), we’re at 1/295 = 0.3% compute utilization. Almost all time is spent waiting for memory.
The Memory Bandwidth Calculation
This analysis gives us a powerful tool for performance prediction. For a 70B parameter model at FP16:
\[\text{Minimum time per token} = \frac{70\text{B params} \times 2 \text{ bytes}}{3.35 \text{ TB/s}} = 42\text{ms}\]
That’s the theoretical minimum—loading every parameter once at peak bandwidth. In practice, we also have:
- KV cache reads and writes
- Attention computation
- Activation transfers
- Memory access patterns that don’t achieve peak bandwidth
Real-world single-request latency is typically 2-3x this theoretical minimum.
Implications for Optimization
This analysis reveals which optimizations can help:
| Optimization | How It Helps | Expected Impact |
|---|---|---|
| Quantization (FP16 → INT8) | 2x fewer bytes to load | ~2x faster |
| Quantization (FP16 → INT4) | 4x fewer bytes to load | ~4x faster (with quality tradeoff) |
| Batching | Amortizes weight loading | Linear throughput improvement |
| KV caching | Avoids recomputing past | Essential for practical generation |
| Flash Attention | Reduces memory traffic | Enables long contexts |
And which cannot help:
- Adding more GPU compute: Already underutilized
- Faster GPU clock speeds: Memory-bound, not compute-bound
- More CUDA cores: Same limitation
Flash Attention: A Deep Dive
The Attention Bottleneck
Standard attention computes: \[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V\]
For sequence length \(n\) and head dimension \(d\):
- \(QK^T\) produces an \(n \times n\) matrix
- This matrix must be stored for the backward pass (training) or for the softmax (inference)
- Memory: \(O(n^2)\) for the attention matrix
For \(n = 128,000\) tokens: \[\text{Attention matrix size} = 128,000^2 \times 2 \text{ bytes} = 32 \text{ GB per layer per head}\]
This is clearly infeasible. Even for modest sequence lengths, the attention matrix becomes the dominant memory consumer.
The Flash Attention Insight
Tri Dao’s Flash Attention (2022) solves this with a key insight: we never need to materialize the full attention matrix. We can compute attention in tiles, maintaining only partial results in fast SRAM (shared memory), never writing the full \(n \times n\) matrix to slow HBM.
The algorithm: 1. Split Q, K, V into blocks that fit in SRAM 2. For each block of Q, iterate over blocks of K, V 3. Compute block attention scores in SRAM 4. Use online softmax (computing softmax incrementally) to combine blocks 5. Write only the final output to HBM
The mathematics of online softmax:
For standard softmax: \[\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}\]
The problem: we need the full sum before computing any output.
Online softmax maintains running statistics: \[m_{\text{new}} = \max(m_{\text{old}}, m_{\text{block}})\] \[\ell_{\text{new}} = \ell_{\text{old}} \cdot e^{m_{\text{old}} - m_{\text{new}}} + \ell_{\text{block}} \cdot e^{m_{\text{block}} - m_{\text{new}}}\]
This allows computing exact softmax while processing data in blocks, never needing to store the full attention matrix.
Flash Attention Performance Analysis
| Aspect | Standard Attention | Flash Attention |
|---|---|---|
| Memory | \(O(n^2)\) | \(O(n)\) |
| HBM reads | \(O(n^2 d)\) | \(O(n^2 d^2 / M)\) |
| HBM writes | \(O(n^2 + nd)\) | \(O(nd)\) |
| Kernel launches | 3+ (matmul, softmax, matmul) | 1 (fused) |
Where \(M\) is SRAM size. For typical values (\(d = 128\), \(M = 100\)KB), Flash Attention reduces HBM traffic by 5-10x.
Flash Attention 2 and 3
Flash Attention 2 (2023) improved on the original with:
- Better work partitioning across thread blocks
- Reduced non-matmul FLOPs
- ~2x speedup over Flash Attention 1
Flash Attention 3 (2024) further optimized for Hopper architecture:
- Exploits asynchronous operations
- Uses FP8 tensor cores when available
- Achieves 75%+ of theoretical maximum FLOPS
# Using Flash Attention is now trivial in modern frameworks
import torch.nn.functional as F
# PyTorch 2.0+ automatically uses Flash Attention when possible
output = F.scaled_dot_product_attention(query, key, value)
# HuggingFace models: specify implementation
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-4-Scout-17B",
attn_implementation="flash_attention_2"
)Full implementation: See reference/21_performance_engineering_code.md for benchmarking code comparing attention implementations.
Quantization: Theory and Practice
Why Quantization Works
Neural network weights are stored as floating-point numbers, typically FP32 (32 bits) or FP16 (16 bits). Quantization represents these weights with fewer bits—INT8 (8 bits) or INT4 (4 bits).
Why does this work? Neural networks are remarkably robust to noise. The weights exist in a continuous space where small perturbations don’t significantly change behavior. Quantization introduces small errors, but these errors are bounded and the network’s overall function is preserved.
Mathematically, for a weight matrix \(W\) and quantized approximation \(\hat{W}\): \[\hat{W} = s \cdot \text{round}\left(\frac{W}{s}\right)\]
Where \(s\) is a scaling factor. The error \(W - \hat{W}\) is bounded by \(s/2\).
Quantization Schemes
Per-tensor quantization: One scale factor for the entire tensor. - Simplest implementation - May lose precision if weights have different magnitudes
Per-channel quantization: One scale factor per output channel. - Better accuracy than per-tensor - Standard for INT8 quantization
Per-group quantization: One scale factor per group of weights (e.g., 128 weights). - Used by GPTQ, AWQ for 4-bit quantization - Good accuracy-efficiency tradeoff
Mixed precision: Different layers quantized to different precisions. - Sensitive layers (embeddings, attention) kept at higher precision - Less sensitive layers (FFN) more aggressively quantized
GPTQ: Post-Training Quantization for LLMs
GPTQ (Frantar et al., 2022) introduced an efficient method for post-training quantization of LLMs:
- Layer-wise quantization: Process one layer at a time
- Hessian-based error compensation: Use second-order information to minimize quantization error
- Lazy batch updates: Efficient implementation that scales to large models
The key insight: when quantizing weight \(w_i\), we can adjust remaining weights \(w_{i+1:}\) to compensate for the error introduced. GPTQ computes optimal adjustments using approximate Hessian information from calibration data.
# GPTQ quantization with auto-gptq library
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
quantize_config = BaseQuantizeConfig(
bits=4, # 4-bit quantization
group_size=128, # Per-group scaling
desc_act=False, # Activation order
damp_percent=0.1 # Dampening for numerical stability
)
model = AutoGPTQForCausalLM.from_pretrained(
"meta-llama/Llama-4-Scout-17B",
quantize_config=quantize_config
)
model.quantize(calibration_data)AWQ: Activation-Aware Quantization
AWQ (Lin et al., 2023) improves on GPTQ by focusing on salient weights—weights that, when activated by typical inputs, have outsized impact on outputs.
Key insight: 1% of weights contribute to 50%+ of the output. These salient weights should be preserved more carefully.
AWQ: 1. Profiles activations on calibration data 2. Identifies salient weights based on activation magnitudes 3. Scales salient weights to protect them from quantization error 4. Applies standard quantization to the scaled weights
Results: AWQ achieves better accuracy than GPTQ at the same bit width, with similar inference speed.
Quantization Impact Analysis
| Method | Bits | Memory Reduction | Speed Improvement | Quality Impact |
|---|---|---|---|---|
| FP16 baseline | 16 | 1x | 1x | Baseline |
| INT8 dynamic | 8 | 2x | 1.5-2x | <1% degradation |
| GPTQ 4-bit | 4 | 4x | 2-3x | 1-2% degradation |
| AWQ 4-bit | 4 | 4x | 2-3x | <1% degradation |
| GGUF Q4_K_M | 4 | 4x | 2-3x | <1% degradation |
| 2-bit | 2 | 8x | 3-4x | Significant degradation |
The sweet spot for most applications is 4-bit quantization with per-group scaling (GPTQ or AWQ). This achieves 4x memory reduction with minimal quality loss.
Full implementation: See reference/21_performance_engineering_code.md for complete quantization code with benchmarking.
The KV Cache: Memory Management for Generation
Why KV Caching is Essential
Autoregressive generation produces tokens one at a time, each conditioned on all previous tokens. Without caching, generating the \(n\)-th token requires recomputing attention over all \(n-1\) previous tokens—an \(O(n^2)\) total cost for generating \(n\) tokens.
KV caching stores the key and value projections from previous tokens:
- First token: Compute K, V for position 0, store in cache
- Second token: Compute K, V for position 1, append to cache; attend to positions 0-1
- \(n\)-th token: Compute K, V for position \(n-1\), append to cache; attend to positions 0 through \(n-1\)
This reduces generation from \(O(n^2)\) to \(O(n)\)—essential for any practical system.
KV Cache Memory Requirements
For a model with:
- \(L\) layers
- \(h_{kv}\) key/value heads (equals query heads \(h\) for MHA; smaller for GQA/MQA)
- \(d_h\) head dimension (typically \(d_{model}/h\), where \(h\) is the number of query heads)
- \(n\) sequence length
- \(b\) batch size
KV cache size: \[\text{KV cache} = 2 \cdot L \cdot h_{kv} \cdot d_h \cdot n \cdot b \cdot \text{bytes\_per\_value}\]
The leading factor of 2 accounts for storing both keys and values. The key correction for modern models is that we use \(h_{kv}\) (KV heads), not \(h\) (query heads)—a frequent source of overestimated KV cache budgets.
For Llama 2 70B (\(L=80\), \(h_{kv}=8\) GQA heads, \(d_h=128\)) at 8K context with FP16: \[\text{KV cache} = 2 \cdot 80 \cdot 8 \cdot 128 \cdot 8192 \cdot 2 = 2.7 \text{ GB per request}\]
Note: Llama 2 70B uses Grouped-Query Attention (GQA) with 8 KV heads (not all 64 query heads), which reduces KV cache 8x compared to standard MHA. Without GQA, this would be 21.5 GB per request—illustrating why GQA is essential for long-context serving.
This creates the memory wall for long contexts: even with Flash Attention enabling the computation, we may not have memory for the KV cache.
What people do: Calculate GPU memory requirements based only on model weights. “70B at FP16 is 140GB, so it fits on 2x H100 80GB with room to spare.”
Why it fails: KV cache grows with sequence length and concurrent requests. At 8K context, Llama 2 70B needs ~2.7GB of KV cache per request (with GQA). With 32 concurrent requests at 8K context, that’s ~86GB just for KV cache—comparable to the model weights. Your “fits with room to spare” calculation may now be OOM.
Fix: Always include KV cache in memory planning: total_memory = model_weights + (kv_cache_per_token × max_tokens × max_concurrent_requests). Use GQA models when possible (8x smaller KV cache). Implement paged attention to avoid over-allocation. Set conservative concurrent request limits and monitor memory utilization in production.
Paged Attention (vLLM)
Traditional KV cache pre-allocates a contiguous block of memory for the maximum sequence length. This is wasteful:
- Most sequences are shorter than max length
- Memory fragmentation prevents fitting more requests
- Early requests hog memory even after completing
PagedAttention (Kwon et al., 2023, introduced in vLLM) applies virtual memory concepts:
- Divide KV cache into fixed-size blocks (pages)
- Allocate pages on demand as sequences grow
- Track page mappings in a page table
- When sequences complete, return pages to free pool
# Conceptual PagedAttention implementation
class PagedKVCache:
def __init__(self, num_blocks, block_size, num_layers, num_heads, head_dim):
# Pre-allocate block pool
self.key_blocks = torch.zeros(num_blocks, num_layers, block_size, num_heads, head_dim)
self.value_blocks = torch.zeros(num_blocks, num_layers, block_size, num_heads, head_dim)
self.free_blocks = list(range(num_blocks))
self.sequence_block_tables = {} # seq_id -> list of block indices
def allocate(self, seq_id):
"""Allocate a new block for a sequence."""
if not self.free_blocks:
raise RuntimeError("Out of KV cache memory")
block_idx = self.free_blocks.pop()
if seq_id not in self.sequence_block_tables:
self.sequence_block_tables[seq_id] = []
self.sequence_block_tables[seq_id].append(block_idx)
return block_idx
def free(self, seq_id):
"""Return all blocks for a sequence to the free pool."""
blocks = self.sequence_block_tables.pop(seq_id, [])
self.free_blocks.extend(blocks)PagedAttention enables 2-4x more concurrent requests through efficient memory utilization.
Multi-Query and Grouped-Query Attention
Another approach: reduce the KV cache size by sharing key-value heads.
Multi-Query Attention (MQA): All query heads share one K and one V. \[\text{KV cache reduction: } h \times \text{ smaller}\]
Grouped-Query Attention (GQA): Groups of query heads share K, V. - Llama 2 70B uses 8 KV heads for 64 query heads (8x reduction) - Good tradeoff between quality and memory
| Architecture | Query Heads | KV Heads | KV Cache Reduction |
|---|---|---|---|
| MHA (standard) | 64 | 64 | 1x |
| GQA (Llama 2 70B) | 64 | 8 | 8x |
| MQA | 64 | 1 | 64x |
Full implementation: See reference/21_performance_engineering_code.md for complete KV cache code with paged attention.
Speculative Decoding: Breaking the Sequential Barrier
The Autoregressive Bottleneck
Autoregressive generation is inherently sequential: each token depends on all previous tokens. With batch size 1, we can only generate one token at a time, leaving most GPU compute idle.
For a 70B model generating 100 tokens:
- 100 sequential forward passes
- Each pass: ~50ms (memory-bound)
- Total: ~5 seconds
- GPU compute utilization: <1%
The Speculative Decoding Insight
Speculative decoding (Leviathan et al., 2022; Chen et al., 2023) introduces parallelism through speculation:
- Use a small, fast draft model to generate \(\gamma\) tokens quickly
- Send all \(\gamma\) tokens to the large target model in one batch
- Target model verifies draft tokens in parallel
- Accept verified tokens; resample if rejected
Key insight: verification is parallel, but generation is sequential. The target model can check all \(\gamma\) draft tokens simultaneously—work that would otherwise require \(\gamma\) sequential passes.
The Mathematics of Speculative Decoding
Let \(p(x)\) be the target model distribution and \(q(x)\) be the draft model distribution.
For each draft token \(x_i\):
- If \(p(x_i) \geq q(x_i)\): always accept (target is at least as likely)
- If \(p(x_i) < q(x_i)\): accept with probability \(p(x_i)/q(x_i)\)
If rejected, sample from an adjusted distribution: \[p'(x) = \frac{\max(0, p(x) - q(x))}{\sum_x \max(0, p(x) - q(x))}\]
Theorem: This procedure produces samples from exactly the target distribution \(p(x)\).
This is remarkable: we get speedup without any approximation to the output distribution.
Expected Speedup
The speedup depends on:
- \(\gamma\): number of speculative tokens
- \(\alpha\): acceptance rate (how well draft matches target)
- \(c\): relative cost of draft vs target inference
Expected tokens per target model call: \[E[\text{tokens}] = \frac{1 - \alpha^{\gamma+1}}{1 - \alpha}\]
For \(\alpha = 0.8\) (80% acceptance) and \(\gamma = 4\): \[E[\text{tokens}] = \frac{1 - 0.8^5}{1 - 0.8} = \frac{1 - 0.328}{0.2} = 3.36\]
If the draft model adds negligible overhead, this is a 3.36x speedup!
Implementation Considerations
Draft model selection:
- Same architecture family (shares vocabulary, tokenizer)
- 10-20x smaller than target (e.g., 68M draft for 7B target)
- Trained on similar data for high acceptance rate
Optimal speculation length:
- Too few: Underutilizes parallelism
- Too many: Wastes compute on likely-rejected tokens
- Sweet spot: 4-8 tokens for most models
class SpeculativeDecoder:
def __init__(self, target_model, draft_model, gamma=4):
self.target = target_model
self.draft = draft_model
self.gamma = gamma
def generate_step(self, input_ids):
# 1. Draft model generates gamma tokens
draft_tokens = []
draft_probs = []
context = input_ids
for _ in range(self.gamma):
logits = self.draft(context)
probs = F.softmax(logits[:, -1], dim=-1)
token = torch.multinomial(probs, 1)
draft_tokens.append(token)
draft_probs.append(probs[0, token].item())
context = torch.cat([context, token], dim=1)
# 2. Target verifies all at once
all_draft = torch.stack(draft_tokens, dim=1)
target_logits = self.target(torch.cat([input_ids, all_draft], dim=1))
target_probs = F.softmax(target_logits[:, -self.gamma-1:-1], dim=-1)
# 3. Accept/reject each token
accepted = []
for i, (token, q, p_dist) in enumerate(zip(draft_tokens, draft_probs, target_probs)):
p = p_dist[0, token].item()
if random.random() < min(1, p / q):
accepted.append(token)
else:
# Sample from adjusted distribution
adjusted = torch.clamp(p_dist - draft_probs_dist, min=0)
adjusted = adjusted / adjusted.sum()
accepted.append(torch.multinomial(adjusted, 1))
break
return torch.cat(accepted, dim=1)Full implementation: See reference/21_performance_engineering_code.md for production-ready speculative decoding.
Profiling: Finding the Real Bottleneck
“On my second week as TL I was bored in a Grafana dashboard and noticed our embeddings service had a flat 90ms floor on every request, even tiny ones. Nobody had ever questioned it. I traced it to a synchronous DNS lookup we were doing per call because somebody had disabled the resolver cache in 2022 to debug a different issue and never re-enabled it. One line change. Service p50 dropped from 110ms to 18ms. Eight months of cumulative latency burned because the original ‘fix’ became invisible the moment the incident closed. Now I do a ‘weird flat line’ tour of dashboards once a quarter. Cheap wins live in things everyone has stopped looking at.”
— Performance Engineer at a real-time messaging company
The Performance Engineering Mindset
The first rule of optimization: measure before you optimize. Intuition about performance is often wrong. Engineers routinely spend weeks optimizing the wrong component because they didn’t profile first.
The Performance Engineering Loop:
1. DEFINE goals (what latency/throughput do we need?)
↓
2. MEASURE current performance (profile systematically)
↓
3. ANALYZE results (identify the bottleneck)
↓
4. HYPOTHESIZE (what optimization might help?)
↓
5. IMPLEMENT (make the change)
↓
6. VALIDATE (measure again, compare)
↓
[Repeat until goals met]
Profiling Tools for AI Workloads
Different tools reveal different aspects of performance:
| Tool | What It Measures | When to Use | Overhead |
|---|---|---|---|
torch.profiler |
PyTorch ops, GPU kernels | First-line profiling | Low |
py-spy |
Python call stacks | CPU bottlenecks | Very low |
nsys (Nsight Systems) |
GPU timeline, CPU-GPU sync | GPU pipeline analysis | Low |
ncu (Nsight Compute) |
Detailed kernel metrics | Deep kernel optimization | High |
nvidia-smi |
GPU utilization, memory | Quick health check | None |
PyTorch Profiler Deep Dive
import torch
from torch.profiler import profile, record_function, ProfilerActivity
def profile_model(model, input_data, warmup=3, active=5):
"""Profile model inference with proper warmup."""
# Warmup runs (crucial for accurate measurement)
for _ in range(warmup):
with torch.no_grad():
_ = model(input_data)
torch.cuda.synchronize()
# Profile with GPU activities, memory, and stack traces
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
record_shapes=True,
profile_memory=True,
with_stack=True
) as prof:
for _ in range(active):
with record_function("model_inference"):
with torch.no_grad():
_ = model(input_data)
# Analyze results
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=20))
# Export for detailed visualization
prof.export_chrome_trace("trace.json")
return profInterpreting Profiler Output
A typical PyTorch profiler output:
--------------------------------- ------- ------- ------- -------
Name CPU Time CUDA Time Calls CPU Mem
--------------------------------- ------- ------- ------- -------
aten::linear 1.2ms 15.3ms 96 0 b
aten::matmul 0.8ms 12.1ms 48 0 b
aten::_softmax 0.3ms 2.4ms 48 0 b
aten::layer_norm 0.4ms 1.8ms 48 0 b
aten::copy_ 0.6ms 1.2ms 128 0 b
Key insights:
- CUDA time >> CPU time: GPU-bound (good, GPU is doing work)
- CPU time >> CUDA time: CPU-bound (investigate Python overhead)
- Many calls with small time each: Kernel launch overhead
- Large memory in specific ops: Memory bottleneck candidates
Common Bottleneck Patterns
Pattern 1: CPU-Bound Preprocessing
Symptom: Low GPU utilization, high CPU time
Profile: Data loading, tokenization dominate
Solution: Async data loading, batch preprocessing, compiled tokenizers
Pattern 2: Kernel Launch Overhead
Symptom: Many small CUDA calls, GPU gaps in timeline
Profile: Hundreds of ops with <0.1ms each
Solution: torch.compile, CUDA graphs, operator fusion
Pattern 3: Memory Bandwidth Bound
Symptom: High GPU utilization, but lower throughput than expected
Profile: Large tensors, many memory copies
Solution: Quantization, better data layout, reduce memory traffic
Pattern 4: CPU-GPU Synchronization
Symptom: GPU stalls, CPU waiting
Profile: Explicit syncs, CPU operations on GPU tensors
Solution: Async operations, keep computation on GPU
Full implementation: See reference/21_performance_engineering_code.md for complete profiling framework with analysis.
What people do: Assume the model inference is slow and spend weeks implementing CUDA optimizations, only to discover that 60% of latency was actually in Python preprocessing or database queries.
Why it fails: Intuition about performance is notoriously unreliable. The bottleneck is rarely where you expect. Without measurement, you might 10x the speed of something that contributes 5% to total latency—a 0.5% overall improvement after weeks of work.
Fix: Always profile before optimizing. Use torch.profiler for GPU kernels, py-spy for Python overhead, and end-to-end tracing for system-level bottlenecks. Identify which component dominates latency (preprocessing? inference? postprocessing? network?). Optimize the actual bottleneck. Re-profile after each change to validate impact.
Production Optimization Techniques
Continuous Batching
Traditional batching waits for a batch to fill, processes all requests, then returns results. This is inefficient because:
- Short requests wait for long requests to complete
- New requests wait for the entire batch to finish
- GPU may be idle while waiting for requests
Continuous batching (iteration-level batching) processes requests at the token level:
class ContinuousBatcher:
def __init__(self, model, max_batch_size, max_waiting_time_ms=50):
self.model = model
self.max_batch_size = max_batch_size
self.max_waiting_time = max_waiting_time_ms
self.active_requests = []
self.waiting_queue = []
async def run_loop(self):
"""Main loop: process one token for all active requests."""
while True:
# Add waiting requests (up to capacity)
self._fill_batch()
if not self.active_requests:
await asyncio.sleep(0.001)
continue
# Generate one token for all active requests
await self._step()
# Return completed requests
self._retire_completed()
async def _step(self):
"""Generate one token for all active requests in batch."""
# Batch all active requests
input_ids = self._prepare_batch()
# Single forward pass for all requests
logits = self.model(input_ids)
# Sample next token for each request
for i, request in enumerate(self.active_requests):
next_token = sample(logits[i])
request.append_token(next_token)
def _retire_completed(self):
"""Return completed requests, freeing batch slots."""
completed = [r for r in self.active_requests if r.is_done()]
for request in completed:
self.active_requests.remove(request)
request.future.set_result(request.output)Continuous batching enables:
- Higher throughput: GPU always has work
- Lower latency: Short requests don’t wait for long ones
- Better utilization: Batch stays full as requests come and go
torch.compile and TorchInductor
PyTorch 2.0’s torch.compile applies automatic optimizations:
import torch
# Basic compilation
model = torch.compile(model)
# With optimization mode
model = torch.compile(model, mode="reduce-overhead") # CUDA graphs
model = torch.compile(model, mode="max-autotune") # Best perf, slow compile
# With dynamic shapes
model = torch.compile(model, dynamic=True) # Handle varying input sizesWhat torch.compile does: 1. Traces the model to capture computation graph 2. Fuses operations to reduce kernel launches 3. Generates optimized kernels via TorchInductor 4. Optionally uses CUDA graphs to eliminate launch overhead
Expected speedups:
- Typical models: 1.3-2x
- Memory-bound models: 1.1-1.5x (less kernel fusion benefit)
- Compute-bound models: 1.5-3x (more fusion benefit)
CUDA Graphs
CUDA graphs capture a sequence of GPU operations and replay them without CPU involvement:
class CUDAGraphRunner:
def __init__(self, model, sample_input):
self.model = model
# Static buffers for graph
self.static_input = sample_input.clone()
self.static_output = None
# Warmup
for _ in range(3):
_ = model(self.static_input)
torch.cuda.synchronize()
# Capture graph
self.graph = torch.cuda.CUDAGraph()
with torch.cuda.graph(self.graph):
self.static_output = model(self.static_input)
def __call__(self, x):
# Copy input to static buffer
self.static_input.copy_(x)
# Replay captured graph
self.graph.replay()
# Return output (already computed in static buffer)
return self.static_output.clone()Limitations:
- Input shapes must be fixed
- No dynamic control flow during graph replay
- Memory layout must be consistent
Best for: Fixed-shape inference with repetitive workloads.
Tensor Parallelism
When models don’t fit on a single GPU, tensor parallelism splits weight matrices across GPUs:
Without Tensor Parallelism:
GPU 0: Full 70B model (doesn't fit!)
With 4-way Tensor Parallelism:
GPU 0: 1/4 of each weight matrix
GPU 1: 1/4 of each weight matrix
GPU 2: 1/4 of each weight matrix
GPU 3: 1/4 of each weight matrix
Each GPU computes partial results, then all-reduce to combine.
For a linear layer \(Y = XW\): 1. Split \(W\) into \(W_1, W_2, ..., W_n\) (column-wise) 2. Each GPU computes \(Y_i = XW_i\) 3. All-reduce to get \(Y = [Y_1, Y_2, ..., Y_n]\)
Communication overhead:
- One all-reduce per linear layer: ~8 all-reduces per transformer block
- For 80 layers: ~640 all-reduces per forward pass
- With NVLink (900 GB/s): Acceptable overhead
- With PCIe (64 GB/s): Often prohibitive
Full implementation: See reference/21_performance_engineering_code.md for tensor parallelism setup.
Decision Frameworks for Performance Optimization
When to Optimize
Not all optimization is worthwhile. Consider:
The 80/20 rule: 80% of time is spent in 20% of code. Profile to find that 20%.
Amdahl’s Law: If a component is 10% of total time, even infinite speedup yields only 11% overall improvement.
Development cost: A week of engineering time costs $5,000+. Will the optimization save that much in compute?
Optimization Priority Framework
Do First (High Impact, Low Effort):
- Enable Flash Attention (one line of code)
- torch.compile (one line of code)
- Use production serving framework (vLLM, TGI)
Do Later (High Impact, High Effort):
- Custom CUDA kernels
- Model architecture changes
- Custom quantization schemes
Skip (Low Impact, Low Effort):
- Micro-optimizations in preprocessing
- Minor memory layout changes
Do If Time (Low Impact, High Effort):
- Exotic optimization techniques
- Hardware-specific tuning
Latency vs. Throughput Decision Matrix
| Scenario | Optimize For | Key Techniques |
|---|---|---|
| Interactive chat | Latency | Speculative decoding, tensor parallelism, small batch |
| Batch processing | Throughput | Large batches, continuous batching, quantization |
| Mixed workload | Balance | Priority queues, adaptive batching |
| Cost-constrained | Throughput | Aggressive quantization, spot instances |
| Real-time (<100ms) | Latency | CUDA graphs, distillation, edge deployment |
Cost-Performance Tradeoff Analysis
def analyze_configuration(config):
"""Analyze cost-performance tradeoff for a configuration."""
# Calculate throughput
tokens_per_second = estimate_throughput(config)
# Calculate cost
gpu_cost_per_hour = GPU_COSTS[config.gpu_type]
cost_per_million_tokens = (gpu_cost_per_hour / 3600) / tokens_per_second * 1_000_000
# Calculate quality (if using quantization)
quality_score = estimate_quality(config)
return {
'throughput': tokens_per_second,
'cost_per_1M_tokens': cost_per_million_tokens,
'quality': quality_score,
'latency_p99': estimate_latency(config)
}
# Example configurations
configs = [
{'gpu': 'A10', 'model': '7B', 'quant': 'fp16', 'batch': 8},
{'gpu': 'A10', 'model': '7B', 'quant': 'int8', 'batch': 16},
{'gpu': 'A100', 'model': '70B', 'quant': 'int4', 'batch': 32},
]
# Find Pareto-optimal configurations
for config in configs:
metrics = analyze_configuration(config)
if meets_latency_slo(metrics) and meets_quality_threshold(metrics):
print(f"Viable: {config} -> ${metrics['cost_per_1M_tokens']:.4f}/1M tokens")Case Studies: Real-World Optimization Wins
Case Study 1: The Startup Cost Crisis
Situation: AI coding assistant, 7B parameter model, $0.375/user-hour serving cost.
Analysis:
- Single user per GPU (batch size 1)
- FP16 inference (no quantization)
- Standard attention (no Flash Attention)
- No caching of common prompts
Optimization Journey:
| Stage | Change | Before | After | Improvement |
|---|---|---|---|---|
| Baseline | Measure | - | 50ms/token, 8 users/GPU | - |
| 1 | Flash Attention | 8 users | 8 users | No change (not attention-bound) |
| 2 | INT8 quantization | 50ms | 35ms | 1.4x latency |
| 3 | Continuous batching | 8 users | 32 users | 4x throughput |
| 4 | KV cache optimization | 32 users | 48 users | 1.5x throughput |
| 5 | Speculative decoding | 35ms | 20ms | 1.75x latency |
| Final | Combined | 8 users, $0.375/hr | 64 users, $0.047/hr | 8x cost reduction |
Key Insight: The biggest win came from continuous batching, not the “obvious” optimization (Flash Attention). This is why profiling matters.
Case Study 2: Long-Context Document Analysis
Situation: Legal document analysis requiring 100K+ context windows.
Problem: Standard attention OOMs at 32K tokens.
Analysis:
- 32K tokens * 32K tokens * 2 bytes * 40 heads = 82GB per layer for attention
- Not feasible on any single GPU
Solution:
| Technique | Purpose | Impact |
|---|---|---|
| Flash Attention 2 | Eliminate O(n^2) memory | Enable 100K context |
| Sliding window (partial) | Reduce effective sequence for most layers | 2x memory reduction |
| Gradient checkpointing | Trade compute for memory | 3x memory reduction for training |
| GQA (8 KV heads) | Reduce KV cache size | 8x KV cache reduction |
Result: 128K context windows on A100-80GB, enabling full document analysis.
Case Study 3: Real-Time Voice Assistant
Situation: Voice assistant requiring <200ms end-to-end latency.
Budget breakdown:
- Audio processing: 20ms
- Speech-to-text: 50ms
- LLM inference: 100ms (target)
- Text-to-speech: 30ms
Challenge: 100ms for LLM inference requires aggressive optimization.
Solution:
| Technique | Latency Impact |
|---|---|
| 3B distilled model (from 7B) | 2x faster |
| INT8 quantization | 1.5x faster |
| Speculative decoding | 2x faster |
| CUDA graphs | 1.2x faster |
| Tensor parallelism (2 GPUs) | 1.8x faster |
Result:
- Time to first token: 25ms
- Tokens per second: 80
- End-to-end latency: 180ms (under budget)
Key Insight: Distillation provided the biggest win. A faster model beats a heavily optimized slow model.
Common Pitfalls and How to Avoid Them
Pitfall 1: Optimizing Without Profiling
The mistake: “Attention is slow, so I’ll implement custom attention kernels.”
Reality: Attention might be 10% of your runtime. Data loading might be 60%.
The fix: Always profile first. Optimize the actual bottleneck.
Pitfall 2: Forgetting Warmup
The mistake: First inference takes 2 seconds, subsequent ones take 50ms. Reporting 2s latency.
Reality: First call includes JIT compilation, CUDA context creation, memory allocation.
The fix: Always run warmup iterations before benchmarking.
# Wrong
start = time.time()
output = model(input)
print(f"Latency: {time.time() - start}") # Includes warmup!
# Right
for _ in range(5): # Warmup
_ = model(input)
torch.cuda.synchronize()
times = []
for _ in range(100): # Benchmark
start = time.perf_counter()
_ = model(input)
torch.cuda.synchronize()
times.append(time.perf_counter() - start)
print(f"Latency: {np.mean(times):.3f}s +/- {np.std(times):.3f}s")Pitfall 3: Missing GPU Synchronization
The mistake: Timing shows 1ms, but actual latency is 50ms.
Reality: CUDA operations are asynchronous. The Python call returns immediately.
The fix: Always torch.cuda.synchronize() before timing endpoints.
Pitfall 4: Testing at Wrong Scale
The mistake: Optimizations work great on 100 tokens, fail at 10,000 tokens.
Reality: Algorithmic complexity differences only show at scale.
The fix: Test with production-representative input sizes.
Pitfall 5: Ignoring Memory Fragmentation
The mistake: OOM with 40GB free, 60GB reserved.
Reality: Memory fragmentation prevents allocation even with “free” memory.
The fix: Monitor fragmentation, use paged memory, periodic cleanup.
def check_memory():
allocated = torch.cuda.memory_allocated() / 1e9
reserved = torch.cuda.memory_reserved() / 1e9
fragmentation = (reserved - allocated) / reserved if reserved > 0 else 0
print(f"Allocated: {allocated:.1f}GB, Reserved: {reserved:.1f}GB")
print(f"Fragmentation: {fragmentation:.1%}")
if fragmentation > 0.2:
print("Warning: High fragmentation. Consider torch.cuda.empty_cache()")Pitfall 6: Premature Optimization
The mistake: Spending a week writing custom CUDA kernels before validating the product.
Reality: The product might pivot, making the optimization worthless.
The fix: Start with high-level optimizations (torch.compile, vLLM). Only write custom code when high-level tools aren’t enough.
Key Takeaways
Memory bandwidth is the primary bottleneck - Modern GPUs have abundant compute but limited memory bandwidth. For LLM inference at small batch sizes, you use less than 1% of available compute.
Use the roofline model to identify bottlenecks - Calculate arithmetic intensity (FLOPs per byte) to determine if you’re compute-bound or memory-bound. Then optimize the actual constraint, not the assumed one.
Flash Attention enables long contexts - By computing attention in tiles that fit in fast SRAM, Flash Attention achieves O(N) memory instead of O(N^2) without approximating the computation.
Quantization trades precision for throughput - 4-bit quantization achieves 4x memory reduction with minimal quality loss for most applications. Quantize to INT8 or INT4 for production inference.
Always profile before optimizing - Intuition about bottlenecks is often wrong. Use torch.profiler, nsys, and systematic measurement to find the real bottleneck before investing in optimization.
Summary
Performance engineering for AI systems requires understanding the fundamental differences between AI workloads and traditional software:
Memory bandwidth is the bottleneck. Modern GPUs have abundant compute but limited memory bandwidth. For LLM inference at small batch sizes, you’re using less than 1% of available compute—all time is spent loading weights from memory.
The roofline model provides a framework for reasoning about bottlenecks. Calculate arithmetic intensity (FLOPs per byte) to determine if you’re compute-bound or memory-bound, then optimize accordingly.
Flash Attention eliminates the quadratic memory bottleneck of attention, enabling long context windows by never materializing the full attention matrix.
Quantization reduces memory traffic by representing weights with fewer bits. 4-bit quantization achieves 4x memory reduction with minimal quality loss for most applications.
The KV cache is critical for efficient generation. Paged attention enables efficient memory management, supporting 2-4x more concurrent requests.
Speculative decoding breaks the autoregressive barrier by using a small draft model to generate candidates that a large model verifies in parallel.
Always profile first. Intuition about bottlenecks is often wrong. Use torch.profiler, nsys, and systematic measurement to find the real bottleneck before optimizing.
Understand the tradeoffs. Latency and throughput often conflict. Cost and quality often conflict. The right configuration depends on your specific requirements.
Practical Exercises
Exercise 1: Roofline Analysis
For a model you’re working with: 1. Calculate parameter count and FLOPs per forward pass 2. Calculate memory traffic (bytes loaded) 3. Determine arithmetic intensity 4. Compare to your GPU’s compute-to-bandwidth ratio 5. Are you compute-bound or memory-bound?
Exercise 2: Profiling Deep Dive
- Profile a model inference with torch.profiler
- Export to Chrome trace and visualize the timeline
- Identify the top 5 operations by time
- Classify each operation: compute-bound, memory-bound, or launch-bound
- Propose optimizations for each bottleneck
Exercise 3: Quantization Impact
- Measure baseline model performance (latency, throughput, quality)
- Apply INT8 dynamic quantization and re-measure
- Apply 4-bit GPTQ quantization and re-measure
- Plot the Pareto frontier of quality vs. performance
- Recommend the best configuration for interactive use
Exercise 4: Batching Optimization
- Measure single-request latency and throughput
- Implement static batching with sizes 1, 2, 4, 8, 16, 32
- Plot latency and throughput vs. batch size
- Identify the optimal batch size for your latency SLO
- Implement continuous batching and compare
Exercise 5: End-to-End Optimization
Starting with a baseline model: 1. Profile to identify the primary bottleneck 2. Apply the single most impactful optimization 3. Re-profile and identify the new bottleneck 4. Repeat until you achieve 3x improvement or reach diminishing returns 5. Document the optimization journey with metrics
Self-Assessment Checkpoint
Conceptual Questions
Q1. [Senior] What does it mean for LLM inference to be “memory-bandwidth bound”? How does this differ from being “compute bound”?
Answer
Memory-bandwidth bound: The GPU spends more time waiting for data to load from memory than doing computation. During LLM decode, we load the entire model (14GB for 7B FP16) to do a matrix-vector multiply for one token—massive data movement, minimal computation. The arithmetic intensity (FLOPs per byte loaded) is ~1-2, while GPUs can do ~600 FLOPs per byte loaded. Compute bound: The GPU is fully utilized doing math, and loading more data wouldn’t help. This occurs during prefill (matrix-matrix multiply) where we process many tokens simultaneously, achieving high arithmetic intensity. Key implication: For decode phase, buying faster compute doesn’t help—need faster memory or smarter use of memory (batching, quantization, caching).Q2. [Senior] Why does batching improve throughput for LLM inference? What limits how large the batch can be?
Answer
Batching amortizes weight loading: Instead of loading 14GB of weights to generate 1 token, load them once and generate B tokens for B batched requests. Throughput scales nearly linearly with batch size until limits hit. Limits: (1) Memory—KV cache grows with batch size × context length. Eventually GPU memory is exhausted. (2) Latency—larger batches mean each request waits longer (for its slot, for the batch to complete). SLA limits batch size. (3) Diminishing returns—at very large batches, compute becomes the bottleneck instead of memory bandwidth. (4) Request arrival—need enough concurrent requests to form batches. Trade-off: Higher batch = higher throughput but higher latency. Optimal batch size depends on latency SLO and memory available.Q3. [Staff] Explain how FlashAttention achieves better performance without approximating attention. What’s the key insight?
Answer
FlashAttention is exact attention (same mathematical result) but with better memory access patterns. Key insight: Standard attention materializes the N×N attention matrix in GPU memory (for softmax), requiring O(N²) memory reads/writes. For long sequences, this is slow (memory-bound) and uses massive memory. FlashAttention uses tiling: computes attention in blocks that fit in fast SRAM (on-chip memory), never materializing the full N×N matrix in slow HBM. It’s IO-aware—designed around the memory hierarchy rather than just minimizing FLOPs. Result: Same computation, but ~2-4x faster and O(N) memory instead of O(N²). This is why FlashAttention enabled long contexts (100K+ tokens) that were previously impractical.Q4. [Staff] What’s the roofline model and how do you use it to identify bottlenecks?
Answer
Roofline plots achievable performance (FLOPs/s) against arithmetic intensity (FLOPs/byte). The “roof” has two parts: (1) Sloped section (memory-bound)—performance limited by memory bandwidth × arithmetic intensity. (2) Flat section (compute-bound)—performance limited by peak compute. To use: Calculate your kernel’s arithmetic intensity. Plot it on the roofline. If you’re in the sloped region, you’re memory-bound—optimize memory access (batching, quantization, caching). If you’re in the flat region, you’re compute-bound—optimize computation (better algorithms, hardware utilization). For LLM inference: Decode is typically far in the memory-bound region. Prefill may be closer to the knee or compute-bound. This tells you which optimizations will help for each phase.Q5. [Staff] How do you decide between INT8, INT4, and FP16 for model deployment? What are the trade-offs?
Answer
Trade-offs: FP16: Full precision, no quality loss, 2 bytes/parameter. Baseline. INT8: ~4x faster memory bandwidth, usually <1% quality loss for well-quantized models. Good default for production. INT4: ~8x faster memory bandwidth, 2-5% quality loss typical, can be more for reasoning-heavy tasks. Best for resource-constrained or cost-sensitive deployments. Decision factors: (1) Quality requirements—measure on YOUR task, not benchmarks. Some tasks degrade more than others. (2) Hardware—INT4 needs kernels that support it efficiently (modern GPUs). (3) Model—some models quantize better than others. Always test. (4) Latency vs. throughput—INT4 helps throughput more than single-request latency. Process: Start with INT8 (usually safe). If quality acceptable and need more efficiency, try INT4. Always validate on task-specific evaluation set before deploying.Spot the Problem
Problem 1. [Senior] Optimization approach:
"Our model inference is slow. I'll rewrite the attention mechanism
in custom CUDA to make it faster."
Answer
Problems: (1) Premature optimization—haven’t profiled to confirm attention is the bottleneck. Might be optimizing the wrong thing. (2) Reinventing the wheel—FlashAttention already exists and is highly optimized by experts. Unlikely to beat it. (3) Maintenance burden—custom CUDA is hard to maintain, debug, and port to new hardware. (4) May break other optimizations—custom kernels might not integrate with torch.compile, quantization, etc. Correct approach: Profile first (torch.profiler, nsys). Use established optimizations (FlashAttention, torch.compile). Only write custom CUDA if profiling shows a specific bottleneck that existing tools don’t address—and even then, consider contributing to existing libraries.Problem 2. [Staff] Performance report:
"After applying all optimizations (FlashAttention, INT4 quantization,
continuous batching), we achieved 100 tokens/second throughput."
What’s missing?
Answer
Missing context: (1) What model?—100 tok/s means different things for 7B vs. 70B. (2) What hardware?—100 tok/s on H100 might be poor; on consumer GPU might be excellent. (3) What batch size?—100 tok/s at batch=1 vs. batch=32 are very different. (4) What sequence lengths?—Long contexts are harder. (5) Latency metrics?—What’s P50/P99 per-request latency? Throughput without latency is incomplete. (6) Quality impact?—Did INT4 quantization hurt accuracy? (7) Baseline comparison?—100 tok/s improvement from what? A complete performance report includes: model, hardware, batch size, sequence lengths, latency percentiles, throughput, quality metrics, and comparison to baseline.Problem 3. [Staff] Deployment configuration:
# Using maximum batch size for best throughput
config = {
"max_batch_size": 64,
"max_sequence_length": 32768,
"dtype": "int4"
}
Answer
Likely memory issues: Batch size 64 × 32K context × KV cache per token = enormous memory. Even INT4 weights won’t save you if KV cache (typically FP16) explodes. Problems: (1) May OOM—calculate actual memory needed before setting config. (2) Latency impact—batch 64 means requests wait for huge batches. (3) 32K context for all requests?—Most requests probably don’t need it. Wasting memory reservation. (4) INT4 doesn’t shrink KV cache—just weights. Long context is still memory-hungry. Better: Profile actual request distribution (context lengths, arrival rate). Set max_sequence_length to actual needs. Use dynamic batching that adapts to load. Calculate memory budget: model weights + max_batch_size × max_context × KV_cache_per_token < GPU_memory.Design Exercises
Exercise 1. [Staff] You have an A100 80GB GPU and need to serve a 70B parameter model with sub-100ms time to first token for prompts up to 4K tokens. Design your deployment strategy, including quantization, parallelism, and batching decisions.
Guidance
Constraints: (1) 70B model in FP16 = 140GB—doesn’t fit in 80GB. Need quantization or tensor parallelism. (2) TTFT < 100ms for 4K tokens is aggressive—prefill of 4K tokens takes significant time. Analysis: (1) INT4 quantization: 70B × 0.5 bytes = 35GB weights. Fits with room for KV cache. (2) Single A100 with INT4 might achieve ~100ms TTFT for shorter prompts. 4K is pushing it. (3) Alternative: 2x A100 with tensor parallelism, FP16. Better quality, meets latency. Strategy options: (1) Single A100 + INT4 + aggressive optimization—risky for 4K, may work for shorter. (2) 2x A100 + tensor parallelism + INT8—more headroom, meets quality and latency. (3) Consider if 70B is necessary—would 13B meet quality needs at much lower cost? Batching: Limited—sub-100ms TTFT limits batch wait time. Likely batch=1-4 for real-time path.Exercise 2. [Staff] Your LLM inference service is costing $500K/month. Leadership wants 30% cost reduction without impacting user experience. Design an optimization roadmap with prioritized initiatives and expected savings.
Guidance
First, understand cost breakdown: (1) GPU hours by instance type. (2) Utilization patterns (peak vs. off-peak). (3) Request distribution (simple vs. complex). Optimization roadmap: (1) Caching—identify repeat/similar queries. Even 20% cache hit rate at near-zero marginal cost is significant. (2) Model routing—route simple queries to smaller/cheaper models. Requires quality validation but can save 40%+ on routed queries. (3) Quantization—if not already INT8, significant savings. Validate quality first. (4) Batching optimization—improve utilization through better batching strategies. (5) Reserved capacity—convert on-demand to reserved/spot for predictable baseline load. 30-50% savings. (6) Off-peak processing—batch non-urgent requests to off-peak hours with cheaper capacity. Prioritize by: Effort vs. impact. Caching and routing often highest ROI. Always validate quality metrics before and after.Connections to Other Chapters
Performance engineering intersects with deployment, cost, and system design. Here’s how this chapter connects:
Chapter 9 (LLM Deployment & Infrastructure): Core deployment architecture, batching strategies, and KV cache optimization. This chapter goes deeper into the performance theory behind those patterns.
Chapter 25 (System Design at Scale): Distributed systems patterns and architecture for high-throughput AI systems. Performance engineering at the component level enables system-level scalability.
Chapter 32 (Cost Engineering): How performance optimizations translate to cost savings and ROI analysis. Faster inference means lower GPU costs.
Tool & Framework Reference (Appendix B): Inference engines like vLLM, TGI, and TensorRT-LLM covered in this chapter.
Further Reading
Essential
- Dao et al. (2022), “FlashAttention” - The paper that made long contexts practical. Critical for understanding modern attention.
- Kwon et al. (2023), “vLLM/PagedAttention” - Efficient memory management for LLM serving.
- Horace He, “Making Deep Learning Go Brrrr From First Principles” - Excellent practical GPU performance introduction.
Deep Dives
- Frantar et al. (2022), “GPTQ” - Post-training quantization for transformers. Key for model compression.
- Leviathan et al. (2022), “Speculative Decoding” - Using small models to accelerate large model inference.
- Williams et al. (2009), “Roofline Model” - Visual performance model for understanding compute bounds.