Chapter 9: LLM Deployment & Infrastructure
deployment, inference, KV cache, quantization, vLLM, batching, GPU, PagedAttention, speculative decoding, self-hosting
Introduction
In March 2023, a startup launched their AI-powered legal document analysis tool. During development, their prototype felt snappy—responses streamed in under a second. On launch day, they watched their system collapse. Response times climbed from 500 milliseconds to 30 seconds. GPU memory overflowed, causing crashes. Their cloud bill for that single day exceeded their projected monthly budget. The model worked perfectly; the infrastructure failed catastrophically.
This story illustrates a harsh truth about LLM deployment: training a model is the beginning, not the end. The techniques that make inference fast, efficient, and economical are fundamentally different from those used in training. Understanding these techniques—why they exist, how they work mathematically, and when to apply them—separates production-ready systems from expensive science projects.
LLM inference presents unique challenges compared to traditional ML serving. A classification model processes fixed-size inputs and produces fixed-size outputs in a single forward pass. An LLM generates tokens one at a time, each requiring a full forward pass through billions of parameters. A 1000-token response requires 1000 forward passes—and each pass must load the entire model from memory. The computational pattern is fundamentally different, and so are the optimization strategies.
This chapter examines the engineering of LLM deployment: the mathematical foundations of inference optimization, the systems that implement them, and the decision frameworks for choosing among alternatives. We focus primarily on self-hosted deployment because that’s where you need the deepest understanding—but even if you use managed APIs, these concepts help you reason about performance, cost, and architectural decisions.
The Core Challenge: Memory Bandwidth
To understand why LLM inference is hard, you need to understand one number: the ratio of compute to memory bandwidth on modern GPUs.
An NVIDIA H100 GPU can perform 1979 TFLOPS of FP16 computation per second. It can transfer 3.35 TB/s from its HBM3 memory. This means for every byte loaded from memory, the GPU can perform approximately 600 floating-point operations.
Now consider LLM inference. During token generation, each forward pass multiplies the model’s weight matrices by small vectors (the current token’s hidden states). For a 7 billion parameter model in FP16, this means loading 14 GB of weights to perform matrix-vector multiplications on vectors of size ~4096. The computation is trivial compared to the data transfer.
The arithmetic intensity (operations per byte loaded) of this workload is roughly 1-2, not 600. LLM generation is memory-bandwidth bound, not compute bound. This single insight explains nearly every optimization technique in this chapter:
- Batching: Process multiple requests together so the same weight load serves multiple computations
- Quantization: Reduce weight size to load fewer bytes
- KV caching: Avoid recomputing attention by storing intermediate results
- Continuous batching: Keep GPUs busy by dynamically managing requests
Understanding that inference is memory-bound changes how you think about every optimization decision.
Most engineers reach for inference optimization expecting a compute problem—more FLOPS, faster kernels. The reality is closer to an airport boarding gate: requests arrive at unpredictable rates, share an expensive scarce resource (the GPU), and have wildly varying service times (prefill versus decode, short versus long outputs). KV cache management, continuous batching, prefill/decode disaggregation, and prefix caching are all scheduling policies, not math tricks. Heuristic: when latency or cost surprises you, draw the queue first—arrival rate, service time, queue depth—before touching the model.
Reframing inference as a queue theory problem rather than a compute problem is what unlocks the 10-20x throughput wins modern serving stacks routinely achieve.
What You’ll Learn
- The mathematics of LLM inference: prefill vs. decode, KV cache growth, memory bandwidth limits
- How continuous batching and PagedAttention achieve 10-20x throughput improvements
- Quantization theory: why INT4 weights often preserve quality, and when they don’t
- Decision frameworks for GPU selection, inference servers, and self-hosting vs. APIs
- How leading providers (Anthropic, OpenAI, Together, Fireworks) architect their serving systems
- Cost modeling and optimization for sustainable deployment
Prerequisites
- Understanding of transformer architecture and attention mechanisms (Chapter 5)
- Familiarity with GPU computing concepts (CUDA, memory hierarchy)
- Basic knowledge of containerization and orchestration
- Python programming
The Physics of LLM Inference
Two Phases, Two Bottlenecks
LLM inference has two distinct computational phases with fundamentally different characteristics.
Prefill (prompt processing): The model processes all input tokens simultaneously, building the initial attention state. This phase involves dense matrix-matrix multiplications—multiplying weight matrices against the entire prompt sequence. The arithmetic intensity is high (many operations per byte loaded), making this phase compute-bound. Prefill time scales approximately linearly with prompt length, though the quadratic attention computation becomes significant for very long prompts.
Decode (generation): The model generates tokens one at a time. Each token requires a forward pass where weight matrices multiply against a single vector (the current token’s hidden state). The arithmetic intensity drops dramatically—we load the entire model to generate one token. This phase is memory-bandwidth bound. Decode time per token is nearly constant regardless of how many tokens have been generated (with a slight increase due to growing KV cache reads).
This asymmetry has profound implications. A 2000-token prompt processes in prefill in perhaps 200ms. Generating 500 tokens in decode might take 5000ms—25x longer for 1/4 the tokens. The bottleneck isn’t computation; it’s moving 14+ GB of weights from memory to compute units for each generated token.
The Mathematics of Memory Bandwidth Limits
Let’s derive why decode is memory-bound. Consider an H100 GPU (3.35 TB/s bandwidth) serving a 7B parameter model in FP16 (14 GB weights).
Time to load model weights once: 14 GB / 3.35 TB/s = 4.2 ms
Theoretical maximum generation speed (single request): 1 token / 4.2 ms = 238 tokens/second
Observed speed in practice: ~150-200 tokens/second (due to overheads, KV cache reads)
This is the fundamental ceiling: you cannot generate tokens faster than you can read the model weights from memory. No amount of additional compute helps—the H100’s massive 1979 TFLOPS sits mostly idle during single-request generation.
This analysis reveals why batching is so critical. If you process B requests simultaneously, you load the weights once and use them B times. The effective bandwidth cost per token becomes (14 GB / B), and throughput scales nearly linearly with batch size—until you hit other limits.
The KV Cache: Trading Memory for Compute
The attention mechanism in transformers computes:
\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]
Where Q (queries) represents the current token, and K (keys) and V (values) represent all tokens seen so far. Without caching, generating the 500th token would require recomputing attention over all 500 tokens—quadratic complexity.
The KV cache stores the K and V projections for all previous tokens. When generating a new token, we only compute its Q and look up the cached K, V. This transforms attention from O(n^2) to O(n) per token, but introduces a memory cost that grows linearly with sequence length.
KV Cache Size Formula: \[\text{KV Cache per request} = 2 \times L \times S \times H \times D \times \text{bytes\_per\_value}\]
Where:
- 2: Key and Value tensors
- L: Number of transformer layers
- S: Sequence length (prompt + generated tokens)
- H: Number of attention heads
- D: Dimension per head (typically hidden_dim / num_heads)
- bytes_per_value: 2 for FP16, 1 for INT8
Example: LLaMA-2 7B with 4K context in FP16: \[2 \times 32 \times 4096 \times 32 \times 128 \times 2 = 2.15 \text{ GB per request}\]
At 4K context, the KV cache for a single request exceeds the model weights stored at INT4 precision. This is why long-context serving is expensive—the memory cost grows linearly with context length and linearly with concurrent requests.
| Context Length | KV Cache Per Request | With 10 Concurrent | With 50 Concurrent |
|---|---|---|---|
| 512 | 268 MB | 2.7 GB | 13.4 GB |
| 2,048 | 1.07 GB | 10.7 GB | 53.5 GB |
| 4,096 | 2.15 GB | 21.5 GB | 107 GB |
| 8,192 | 4.3 GB | 43 GB | 215 GB |
This table illustrates why serving long contexts at high concurrency requires either massive GPU memory, aggressive KV cache compression, or accepting lower concurrency.
Key Performance Metrics
Understanding inference metrics is essential for reasoning about systems and comparing configurations.
Time to First Token (TTFT): The delay before generation starts. Dominated by prefill time. Users perceive this as “thinking time.” For interactive applications, TTFT under 500ms feels responsive; over 2 seconds feels sluggish.
Inter-Token Latency (ITL) or Time Per Output Token (TPOT): The average time between consecutive generated tokens. Determines streaming speed. For readable streaming, ITL under 50ms (20 tokens/second) feels natural.
Throughput: Total tokens generated per second across all concurrent requests. The key metric for cost efficiency. A system generating 1000 tokens/second across 10 concurrent requests is 10x more efficient than one generating 100 tokens/second for a single request.
Tokens Per Dollar: The ultimate efficiency metric. Combines throughput with infrastructure cost.
These metrics trade off against each other. Increasing batch size improves throughput but increases TTFT (new requests wait for batch slots) and ITL (GPU time shared among more requests). Optimizing for one metric often degrades others.
The KV Cache Revolution: PagedAttention
The Memory Fragmentation Problem
Before 2023, LLM serving systems pre-allocated contiguous memory for each request’s KV cache, sized for the maximum possible sequence length. A system configured for 4K context would allocate 2.15 GB per request, even if the actual sequence used only 500 tokens. This led to severe memory fragmentation and waste.
Consider a server with 80GB GPU memory, 14GB used by model weights, leaving 66GB for KV cache. With conservative pre-allocation:
- Maximum concurrent requests: 66 GB / 2.15 GB ≈ 30 requests
- But average sequences are 1K tokens, using only 537 MB
- Actual memory usage: 30 × 537 MB = 16 GB
- Wasted memory: 50 GB (76%)
Even worse, variable-length sequences caused external fragmentation—freed memory was scattered in non-contiguous chunks, making it unusable for new requests.
PagedAttention: Virtual Memory for KV Cache
The vLLM paper (Kwon et al., 2023) introduced PagedAttention, applying operating system virtual memory concepts to KV cache management. The insight: manage KV cache as small, fixed-size blocks rather than contiguous per-request allocations.
Instead of pre-allocating maximum sequence length, PagedAttention: 1. Divides KV cache into fixed-size blocks (typically 16 tokens each) 2. Allocates blocks on-demand as sequences grow 3. Maintains a block table mapping logical sequence positions to physical memory locations 4. Frees blocks immediately when requests complete
The result is near-perfect memory utilization. Memory waste drops from 60-80% to under 4%. The same hardware can serve 2-4x more concurrent requests.
Block Tables: Each request maintains a block table mapping logical positions to physical block IDs:
# Conceptual block table structure
request_1_blocks = [
{"logical_range": (0, 15), "physical_block": 7},
{"logical_range": (16, 31), "physical_block": 23},
{"logical_range": (32, 47), "physical_block": 12},
]During attention computation, the system translates logical sequence positions to physical memory locations using these tables. The indirection adds negligible overhead compared to the memory efficiency gains.
Practical Impact: Real-World Benchmarks
The vLLM paper demonstrated significant improvements over previous systems:
| Comparison | vLLM Improvement | Notes |
|---|---|---|
| vs HuggingFace Transformers (baseline) | 14-24x throughput | Naive batching baseline |
| vs HuggingFace TGI | 2.2-3.5x throughput | Both use continuous batching |
| vs FasterTransformer/Orca | 2-4x throughput | At similar latency |
| Memory utilization | ~97% vs ~35% baseline | Near-zero KV cache waste |
Note: The improvement magnitude depends heavily on workload characteristics. High-concurrency scenarios with shared prefixes see the largest gains. Interactive single-user scenarios may see smaller improvements. TGI v3 (2024) has closed some of this gap, particularly for long-context workloads with prefix caching.
What people do: Focus only on model weight memory when sizing GPUs, then wonder why inference fails at moderate concurrency.
Why it fails: KV cache grows with sequence length × batch size × layers × head dimension. A 70B model’s weights fit in 40GB, but 32 concurrent 4K-context requests need another 50GB+ of KV cache. Memory exhaustion causes OOM crashes or aggressive eviction.
Fix: Calculate worst-case KV cache before deployment: kv_bytes = 2 × layers × hidden_dim × seq_len × batch_size × bytes_per_element. Use this to set max_concurrent_requests. Monitor cache hit rates and eviction rates in production.
Continuous Batching: Maximizing GPU Utilization
Most engineers think about LLM inference as a compute problem: “How fast can we do matrix multiplications?” This framing leads to chasing faster GPUs and more FLOPS. It’s wrong.
LLM inference is fundamentally a queue theory problem: “How do we keep tokens flowing through the system with minimum waiting?” The bottleneck isn’t computation—it’s memory bandwidth and scheduling.
This reframing changes everything:
Traditional compute thinking: “We need more GPU horsepower” → Buy more A100s Queue theory thinking: “Requests are waiting in line too long” → Optimize scheduling
Key queue theory concepts that apply directly:
- Arrival rate (λ): Requests per second hitting your system
- Service rate (μ): Tokens per second your GPU can produce
- Utilization (ρ = λ/μ): At ρ > 0.8, latency explodes exponentially
- Little’s Law (L = λW): Average queue length equals arrival rate times average wait time
When you see high latency, don’t immediately add GPUs. Ask: “Is my queue poorly scheduled?” Continuous batching, PagedAttention, and priority queues are scheduling optimizations, not compute optimizations. Often, better scheduling on existing hardware outperforms naive scaling by 5-10x.
The Static Batching Problem
Traditional batching collects requests, processes them together, and waits for the entire batch to complete before returning results. This creates two problems for LLM serving:
Head-of-line blocking: If one request generates 500 tokens while others generate 50, the short requests wait for the long one. Response time becomes max(request_lengths), not mean.
Batch assembly latency: New requests must wait for the current batch to finish before joining. During low traffic, requests queue unnecessarily.
Consider a batch of 8 requests:
- Request 1-7: Generate 50 tokens each (500ms)
- Request 8: Generates 500 tokens (5000ms)
With static batching, all requests return after 5000ms. Requests 1-7 could have returned 4500ms earlier.
Continuous Batching: The Solution
Continuous batching (also called “iteration-level scheduling” or “dynamic batching”) processes one decode step at a time, allowing requests to enter and exit the batch at any iteration.
The key insight: during decode, we generate one token per request per iteration. There’s no mathematical requirement that all requests generate the same number of tokens before any can return.
The Mathematics of Continuous Batching
Let’s formalize the throughput advantage. Define:
- B: Maximum batch size
- T_decode: Time for one decode iteration (one token per request)
- N_tokens(i): Number of tokens generated by request i
Static batching throughput: \[\text{Throughput}_{static} = \frac{\sum_{i=1}^{B} N_{tokens}(i)}{T_{decode} \times \max_{i}(N_{tokens}(i))}\]
Continuous batching throughput (ideal): \[\text{Throughput}_{continuous} = \frac{B}{T_{decode}}\]
The ratio depends on request length variance. If all requests generate exactly the same number of tokens, the approaches are equivalent. But real workloads have high variance—some requests are “tell me yes or no” while others are “write a detailed essay.”
Example calculation:
- Batch of 8 requests: 7 generate 50 tokens, 1 generates 500 tokens
- T_decode = 10ms per iteration
Static: (7×50 + 500) / (10ms × 500) = 850/5000ms = 170 tokens/second Continuous: During first 50 iterations, 8 requests active; iterations 51-500, only 1 request active. Total time = 50×10ms + 450×10ms = 5000ms but 7 slots freed after iteration 50.
With continuous batching, those 7 freed slots can accept new requests. In steady state, throughput approaches the ideal B/T_decode = 800 tokens/second—a 4.7x improvement for this workload.
Implementation: Iteration-Level Scheduling
Modern inference servers implement continuous batching with iteration-level schedulers:
# Conceptual iteration scheduler (simplified)
class IterationScheduler:
def __init__(self, max_batch_size, memory_manager):
self.max_batch_size = max_batch_size
self.memory = memory_manager
self.active_requests = [] # Currently generating
self.waiting_queue = [] # Pending prefill
def schedule_iteration(self):
"""Select requests for this decode iteration."""
# Step 1: Keep all active decode requests (don't interrupt)
scheduled = [r for r in self.active_requests if not r.done]
# Step 2: Add waiting requests if slots and memory available
while (len(scheduled) < self.max_batch_size
and self.waiting_queue
and self.memory.can_fit(self.waiting_queue[0])):
request = self.waiting_queue.pop(0)
scheduled.append(request)
return scheduled
def run_iteration(self, scheduled):
"""Execute one decode step for all scheduled requests."""
# Prefill any newly added requests
new_requests = [r for r in scheduled if not r.prefilled]
if new_requests:
self.prefill_batch(new_requests)
# Decode one token for all active requests
tokens = self.decode_step(scheduled)
# Return completed requests, keep others active
for request, token in zip(scheduled, tokens):
request.generated_tokens.append(token)
if token == EOS_TOKEN or len(request.generated_tokens) >= request.max_tokens:
request.complete()
self.memory.free(request.kv_blocks)
self.active_requests.remove(request)Full implementation: See reference/09_llm_deployment_code.md
The scheduler’s decisions—which requests to prefill, how to balance prefill vs. decode, when to preempt—significantly impact performance. Advanced schedulers consider request priorities, memory pressure, and fairness guarantees.
Chunked Prefill: Optimizing for Mixed Workloads
A subtle problem arises when long prompts enter the system. Prefill is compute-bound and can monopolize the GPU, blocking decode iterations for active requests. If a 10,000-token prompt enters, other requests might stall for several hundred milliseconds.
Chunked prefill splits long prefills into chunks (e.g., 512 tokens), interleaving them with decode iterations:
This increases the new request’s TTFT (prefill takes longer overall) but maintains acceptable ITL for existing requests—a worthwhile trade-off for interactive applications.
What people do: Set max_batch_size to whatever the GPU can fit in memory, maximizing throughput.
Why it fails: Maximum batch size ≠ optimal batch size. Large batches increase latency for all requests (everyone waits for the longest generation). For interactive applications, high p99 latency destroys user experience even if throughput is great.
Fix: Set batch size based on latency SLOs, not memory capacity. Start with max_batch_size=8-16 for interactive workloads. Monitor p99 latency under load. Increase batch size only for batch/offline processing where latency doesn’t matter. Use separate pools for interactive vs. batch traffic.
Quantization: Fitting More with Less
“We spent three months building a complex serving optimization before someone suggested trying INT8 quantization. It took a day to implement and gave us 2x the throughput improvement of everything else combined. Now my rule is: try quantization before any other optimization. It’s free performance.”
— ML Platform Engineer at AI startup
Why Quantization Works
Quantization reduces numerical precision—replacing 16-bit floats with 8-bit or 4-bit integers. Intuitively, this should destroy model quality. Why doesn’t it?
The key insight comes from analyzing weight distributions in trained transformers. Weights are not uniformly distributed; they cluster around zero with heavy tails. Most weights are small, and the model’s behavior depends more on preserving the relative relationships between weights than their exact values.
Consider a simplified example. If we have weights [0.0012, 0.0015, 0.0014, 0.0013], the exact values matter less than their ordering and rough magnitudes. Quantization that maps these to [1, 2, 2, 1] (scaled appropriately) preserves most of the computational effect.
Formal perspective: Neural network weights are overdetermined. The function computed by the network depends on the weight manifold (the subspace of weight configurations producing equivalent outputs), not individual weight values. Small perturbations from quantization often stay within this manifold.
Quantization Methods: From Theory to Practice
Uniform quantization divides the value range into equal intervals:
\[q = \text{round}\left(\frac{x - x_{min}}{x_{max} - x_{min}} \times (2^b - 1)\right)\]
Where b is the number of bits. Dequantization reverses the process:
\[\hat{x} = q \times \frac{x_{max} - x_{min}}{2^b - 1} + x_{min}\]
The problem: outliers. If most weights are between -0.1 and 0.1 but a few are at 1.0, the quantization range wastes precision on the sparse outliers.
Group quantization addresses this by applying separate scales to groups of weights (typically 128 weights per group). This captures local distributions better but adds storage overhead for the scales.
GPTQ: Optimal Brain Quantization for Transformers
GPTQ (Frantar et al., 2022) applies Optimal Brain Quantization principles to LLMs. The key insight: quantize weights in order of increasing sensitivity, using second-order information (Hessian) to adjust remaining weights and compensate for quantization error.
The algorithm processes weights column by column: 1. Compute the inverse Hessian H^(-1) (approximated efficiently) 2. For each weight w_i, compute the quantization error δ 3. Adjust remaining weights to minimize output change: w_j -= δ × H^(-1){ij} / H^(-1){ii}
This “error correction” is what makes GPTQ achieve better quality than naive quantization. The second-order update propagates the correction optimally through the layer.
Calibration requirement: GPTQ requires a small calibration dataset (typically 128-1024 samples) to estimate the Hessian. The calibration data should represent your actual workload; using generic text when you serve code will degrade quality.
AWQ: Activation-Aware Weight Quantization
AWQ (Lin et al., 2023) observes that not all weights are equally important—some channels carry more information than others. Rather than treating all weights equally, AWQ protects “salient” weights identified by activation magnitudes.
The insight: channels with large activations amplify weight quantization errors. AWQ scales these channels before quantization, reducing their relative quantization error:
Channel importance from activations:
────────────────────────────────────────────────────────────────
Channel 1: avg |activation| = 0.1 → Low importance
Channel 2: avg |activation| = 2.3 → High importance
Channel 3: avg |activation| = 0.05 → Low importance
Channel 4: avg |activation| = 1.8 → High importance
AWQ strategy: Scale channels 2 and 4 up before quantization,
scale them back down during inference.
Higher precision preserved where it matters.
AWQ typically achieves 0.5-1% better perplexity than GPTQ at the same bit-width, with the advantage increasing for aggressive 3-bit quantization.
FP8: The Hardware-Supported Middle Ground
NVIDIA H100 GPUs introduced native FP8 (8-bit floating point) support. Unlike INT8, FP8 maintains a floating-point format. The E4M3 variant uses 4 bits of exponent and 3 bits of mantissa (better precision, smaller range — typically used for forward-pass activations and weights), while E5M2 uses 5 bits of exponent and 2 bits of mantissa (wider dynamic range — typically used for gradients in training).
Advantages of FP8:
- Native hardware support: 2x throughput vs. FP16 on H100
- Floating-point range: Handles outliers better than fixed-point
- Minimal quality loss: Typically <0.5% perplexity increase
- Simple adoption: No complex calibration or algorithm changes
FP8 is increasingly the default for H100 deployments—it provides most of the memory savings of INT8 with simpler implementation and better quality.
Quantization Decision Framework
Quality degradation by method (typical, task-dependent):
| Method | Bits | Memory | Perplexity | Quality |
|---|---|---|---|---|
| FP16 | 16 | 1x | Baseline | Reference quality |
| FP8 | 8 | 2x savings | <0.5% | Excellent; H100 optimized |
| INT8 (GPTQ) | 8 | 2x savings | 0.5-1.5% | Good general-purpose |
| INT8 (AWQ) | 8 | 2x savings | 0.3-1.0% | Better than GPTQ |
| INT4 (GPTQ) | 4 | 4x savings | 2-15% | Degrades on complex tasks |
| INT4 (AWQ) | 4 | 4x savings | 1-10% | Preferred INT4; test first |
FlashAttention: Compute-Efficient Attention
The Memory Hierarchy Problem
Standard attention implementation materializes the full N×N attention matrix in GPU memory:
# Standard attention (memory-inefficient)
Q, K, V = ... # (batch, heads, seq_len, head_dim)
scores = torch.matmul(Q, K.transpose(-2, -1)) / sqrt(d_k) # (batch, heads, seq_len, seq_len)
attention = torch.softmax(scores, dim=-1) # Materializes N×N matrix
output = torch.matmul(attention, V)For sequence length N=4096 and 32 heads, the attention matrix requires:
- 32 × 4096 × 4096 × 4 bytes (FP32) = 2 GB of temporary storage
This memory is touched exactly once then discarded—pure waste of precious GPU memory bandwidth.
FlashAttention: Tiled, Memory-Efficient Attention
FlashAttention (Dao et al., 2022) restructures attention to minimize memory transfers by keeping computation in fast SRAM rather than slow HBM.
The key insight: we don’t need the full attention matrix simultaneously. We can compute attention in tiles, accumulating results without ever materializing the full N×N matrix.
The algorithm: 1. Divide Q, K, V into blocks that fit in SRAM 2. For each Q block, iterate over K, V blocks 3. Compute partial attention scores in SRAM 4. Apply softmax correction (requires tracking running max/sum) 5. Accumulate to output without writing intermediate results to HBM
Standard Attention Memory Traffic:
────────────────────────────────────────────────────────────────
HBM ──read Q, K──► SRAM ──compute S=QK^T──► HBM ──write S
HBM ──read S──────► SRAM ──softmax(S)────► HBM ──write P
HBM ──read P, V──► SRAM ──compute PV─────► HBM ──write O
Memory I/O: O(N² + Nd) ≈ O(N²) for large N
FlashAttention Memory Traffic:
────────────────────────────────────────────────────────────────
HBM ──read Q, K, V blocks──► SRAM ──compute in tiles──► HBM ──write O
Memory I/O: O(Nd) — no N² term!
Performance impact: FlashAttention achieves 2-4x speedup for long sequences. More importantly, it enables longer contexts without OOM—a qualitative capability improvement.
FlashAttention-2: Further Optimizations
FlashAttention-2 (Dao, 2023) improves parallelization across thread blocks:
- Parallelize over sequence length: Process Q blocks in parallel (independent)
- Better work partitioning: Reduce synchronization overhead
- Leverage NVIDIA specific instructions: Exploit tensor cores more efficiently
FlashAttention-2 achieves up to 2x additional speedup over FlashAttention-1, approaching the hardware theoretical maximum.
Why this matters for inference: During decode, the query is a single token attending to a potentially long KV cache. FlashAttention’s memory efficiency makes long-context inference practical. Without it, 100K+ context windows would be prohibitively slow and memory-intensive.
Inference Server Architectures
The Inference Server Landscape
The complexity of efficient LLM serving—PagedAttention, continuous batching, tensor parallelism, quantization kernels—has been packaged into specialized inference servers. Understanding their architectures and trade-offs is essential for deployment decisions.
# Sequential processing - wastes GPU cycles
def generate_responses(prompts: list[str]) -> list[str]:
results = []
for prompt in prompts:
# One request at a time, GPU sits idle between requests
input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")
output = model.generate(input_ids, max_new_tokens=256)
results.append(tokenizer.decode(output[0]))
return results
# 10 requests x 2 seconds each = 20 seconds total
# GPU utilization: ~15%from vllm import LLM, SamplingParams
# Production: continuous batching with KV cache management
class ProductionInference:
def __init__(self, model_path: str):
self.llm = LLM(
model=model_path,
tensor_parallel_size=2, # Split across GPUs
gpu_memory_utilization=0.9, # Maximize KV cache
max_num_seqs=64, # Concurrent sequences
enable_prefix_caching=True, # Cache common prefixes
quantization="fp8", # Memory efficiency
)
async def generate_batch(
self,
prompts: list[str],
max_tokens: int = 256
) -> list[GenerationResult]:
sampling_params = SamplingParams(
max_tokens=max_tokens,
temperature=0.7,
top_p=0.95,
)
# Continuous batching: requests processed as tokens complete
# New requests can join mid-batch
outputs = self.llm.generate(prompts, sampling_params)
return [
GenerationResult(
text=output.outputs[0].text,
tokens=len(output.outputs[0].token_ids),
finish_reason=output.outputs[0].finish_reason
)
for output in outputs
]
# 10 requests batched = ~3 seconds total
# GPU utilization: ~85%Key differences: Continuous batching (vs sequential), PagedAttention for KV cache, tensor parallelism across GPUs, prefix caching for repeated system prompts, FP8 quantization, and concurrent sequence handling. Result: 6x throughput improvement.
vLLM: The PagedAttention Reference Implementation
vLLM, developed at UC Berkeley, introduced PagedAttention and remains the most widely deployed open-source inference server.
Architecture:
Key configuration parameters:
gpu-memory-utilization: Fraction of GPU memory for KV cache (default 0.9)max-model-len: Maximum sequence lengthtensor-parallel-size: GPUs for tensor parallelismmax-num-seqs: Maximum concurrent sequencesquantization: awq, gptq, fp8, or none
When to choose vLLM:
- General-purpose LLM serving
- High throughput requirements
- Need for prefix caching / shared prompts
- OpenAI API compatibility required
TensorRT-LLM: Maximum NVIDIA Performance
TensorRT-LLM is NVIDIA’s inference stack, optimizing specifically for NVIDIA GPUs.
Key differences from vLLM: 1. Compilation required: Models must be compiled to TensorRT engines 2. Kernel optimization: Custom kernels for each GPU architecture 3. In-flight batching: NVIDIA’s continuous batching implementation 4. FP8 native: Best FP8 support on H100
Architecture:
Compilation process:
# Convert model to TensorRT-LLM checkpoint
python convert_checkpoint.py --model_dir llama-7b-hf --output_dir llama-7b-trt
# Build TensorRT engine
trtllm-build --checkpoint_dir llama-7b-trt \
--output_dir llama-7b-engine \
--max_batch_size 64 \
--max_input_len 2048 \
--max_output_len 512When to choose TensorRT-LLM:
- Maximum throughput on NVIDIA GPUs (often 20-40% faster than vLLM)
- Using H100s with FP8
- Willing to invest in complex setup and model compilation
- Latency-critical applications
SGLang: Structured Generation Optimization
SGLang (LMSYS) optimizes for workloads with structured outputs—JSON, code, constrained generation.
Key innovation: RadixAttention—a trie-based KV cache that efficiently shares prefixes across requests and enables fast constraint checking during generation.
When to choose SGLang:
- Structured output (JSON schemas, grammar constraints)
- Multi-turn conversations with prefix sharing
- Research / rapid prototyping
Hugging Face TGI: Production-Ready Simplicity
Text Generation Inference (TGI) prioritizes production stability and Hugging Face ecosystem integration.
Key features:
- Watermarking support (detecting AI-generated text)
- Token streaming with standard SSE
- Prometheus metrics built-in
- Simple Docker deployment
When to choose TGI:
- Hugging Face model hub workflow
- Need watermarking
- Prefer stability over maximum performance
Inference Server Decision Framework
How the Leaders Do It: Industry Case Studies
Understanding how leading providers architect their serving infrastructure provides both practical insights and validation for the techniques we’ve discussed.
OpenAI: Scale Through Specialization
OpenAI serves billions of API requests daily across GPT-5, GPT-5.5, and GPT-4 variants. Their architecture, while not fully public, reveals key patterns through their API behavior and published work.
Observed characteristics:
- Multiple model variants: Different price/capability tiers (GPT-5.5 vs GPT-5 vs GPT-4)
- Aggressive batching: High latency tolerance for batch API (50% cost reduction)
- Speculative decoding: Faster tokens on GPT-5.5 suggest draft model usage
- Regional routing: Requests routed to nearest data center
Inferred architecture patterns:
- Separate clusters for different model sizes/speeds
- Request classification to route to appropriate backend
- Heavy use of prefix caching for system prompts
- Custom hardware (rumored TPU/custom ASIC usage)
Key insight: OpenAI offers explicit price/latency trade-offs (batch API, prompt caching discount), suggesting their infrastructure can dynamically trade these resources.
Anthropic: Optimizing for Long Context
Anthropic’s Claude models support 200K token contexts—far beyond typical deployments. This requires specific architectural choices.
Observed characteristics:
- Streaming by default: Long outputs stream immediately
- Context caching: Explicit prompt caching API with cost reduction
- Graceful degradation: Long context requests have higher latency tolerance
Inferred architecture patterns:
- Extensive KV cache optimization (likely PagedAttention or custom equivalent)
- Potentially hierarchical or compressed KV cache for very long contexts
- Dedicated capacity for long-context workloads
- Heavy prefix caching for repeated system prompts
Key insight: Anthropic’s explicit prompt caching API (5-minute cache, 90% cost reduction for cache hits) reveals their infrastructure heavily leverages prefix sharing.
Together AI: Inference as a Service
Together AI specializes in hosted inference for open-source models. Their architecture is more transparent.
Public architecture details:
- Custom inference stack: “Together Inference Engine” with claimed 2-3x throughput vs. standard vLLM
- Speculative decoding: Available for supported models
- Multiple model sizes: From small to 70B+ parameters
- Dedicated endpoints: Option for reserved capacity
Key optimizations (from their blog):
- Custom CUDA kernels for specific model architectures
- Aggressive operator fusion
- Memory-optimized attention implementations
- Dynamic batching with priority support
Throughput claims: Up to 117 tokens/second per request for Llama 4 Scout on 8x A100s with speculative decoding.
Fireworks AI: Speed as the Value Proposition
Fireworks AI positions itself on inference speed, claiming industry-leading latency.
Public architecture details:
- FireAttention: Custom attention kernel optimized for inference
- Speculative decoding: Enabled by default where applicable
- Optimized routing: Request-aware load balancing
Performance claims:
- TTFT under 200ms for most requests
- Up to 4x throughput improvement over standard implementations
Key insight: Fireworks’ focus on latency suggests architecture choices that prioritize TTFT over raw throughput—potentially smaller batch sizes, eager prefill, and latency-optimized kernels.
Lessons from Production Deployments
Common patterns across all providers:
- Tiered service levels: Different price/performance points for different workloads
- Prefix caching: Universal adoption, often with explicit API support
- Speculative decoding: Increasingly common for throughput improvement
- Custom kernels: All serious providers have custom CUDA implementations
- Batching intelligence: Request-aware routing and scheduling
- Regional distribution: Low-latency access requires geographic presence
GPU Selection and Capacity Planning
Understanding GPU Specifications for Inference
GPU selection for inference differs from training. Memory bandwidth often matters more than raw compute.
Key specifications to evaluate:
| Spec | Why It Matters | Trade-off |
|---|---|---|
| Memory Size | Determines model size + max concurrent requests | Larger = more expensive |
| Memory Bandwidth | Bounds single-request generation speed | Higher bandwidth GPUs cost more |
| FP16 TFLOPS | Affects prefill speed | Often over-provisioned for inference |
| FP8 Support | Enables 2x throughput with minimal quality loss | Only H100/H200 |
| Interconnect | Needed for tensor parallelism | NVLink vastly outperforms PCIe |
GPU Comparison for Inference
| GPU | Memory | BW (TB/s) | $/hr | Best For |
|---|---|---|---|---|
| RTX 4090 | 24 GB | 1.0 | $0.50 | Dev/testing, 7B models |
| A10 | 24 GB | 0.6 | $1.50 | 7B production, cost-sensitive |
| L40S | 48 GB | 0.9 | $2.00 | 13B models, balanced |
| A100 40GB | 40 GB | 1.6 | $4.00 | 13-30B, high throughput |
| A100 80GB | 80 GB | 2.0 | $6.00 | 70B single GPU, large batches |
| H100 80GB | 80 GB | 3.4 | $10.00 | Maximum throughput, FP8 |
| H200 141GB | 141 GB | 4.8 | $15.00 | 200B+, longest contexts |
Memory Budget Calculation
A systematic approach to GPU selection:
Step 1: Model memory
Weight memory = Parameters × Bytes per parameter
7B FP16: 7 × 10^9 × 2 = 14 GB
7B INT4: 7 × 10^9 × 0.5 = 3.5 GB
70B INT4: 70 × 10^9 × 0.5 = 35 GB
Step 2: KV cache budget
Available for KV = GPU Memory - Model Memory - Overhead (2-4 GB)
A100 80GB with 7B INT4:
80 - 3.5 - 3 = 73.5 GB for KV cache
Step 3: Max concurrent requests
KV cache per request = 2 × Layers × Seq_len × Heads × Head_dim × Bytes
LLaMA-7B at 4K context, FP16 KV:
2 × 32 × 4096 × 32 × 128 × 2 = 2.15 GB per request
Max concurrent: 73.5 GB / 2.15 GB ≈ 34 requests
Step 4: Throughput estimation
Tokens/second ≈ Memory_bandwidth / (Model_size × bytes_per_weight)
× Batch_efficiency_factor × Kernel_efficiency
A100 (2 TB/s) with 7B INT4, batch 16:
2 × 10^12 / (7 × 10^9 × 0.5) × 0.8 × 0.7 ≈ 320 tok/s per request
Total throughput: ~5,000 tok/s (batch 16)
Multi-GPU Considerations
When a model doesn’t fit on one GPU, you need parallelism:
Tensor Parallelism (TP): Split each layer across GPUs. Requires fast interconnect (NVLink). Linear speedup for prefill, limited improvement for decode.
Pipeline Parallelism (PP): Different layers on different GPUs. Works across nodes. Introduces pipeline bubbles, more complex scheduling.
Self-Hosting vs. API: The Decision Framework
Cost Analysis Framework
2020 (GPT-3 Beta): $60/1M tokens. Early access was expensive and rate-limited. Cost prohibited most production use cases. A customer support bot at scale could cost $1M+/month.
2022 (ChatGPT Launch): $20/1M tokens (text-davinci). Still expensive, but 3x cheaper. Companies started building production applications, carefully counting tokens.
2023 (GPT-3.5 Turbo): $2/1M tokens. A 10x drop. The “ChatGPT API” made LLMs economically viable for mainstream applications. Token counting became less critical.
Late 2023 (GPT-4 Turbo, Claude 2.1): Frontier models at $10-30/1M input, $30-60/1M output. Smaller/faster models hit $0.50-2/1M. Tiered pricing became standard: pay for capability you need.
2024 (GPT-4o, Claude 3.5 Sonnet): $2.50-5/1M for production-ready models. Cached input at 50-90% discount. Quality that was frontier-only a year prior now available at commodity prices.
2025 (Current): Frontier models at $10-15/1M, production models at $1-3/1M, distilled models at $0.10-0.50/1M. 100-600x cheaper than 2020 for equivalent capability.
The pattern: LLM costs dropped roughly 10x every 18 months—faster than Moore’s Law. This changes build-vs-buy economics: self-hosting makes sense at ever-higher volume thresholds. More importantly, capabilities that were impossibly expensive become routine—enabling entirely new application categories.
The self-host vs. API decision ultimately comes down to total cost of ownership. Let’s build a rigorous comparison framework.
“The most common mistake I see is teams doing a spreadsheet analysis—‘API costs $X, self-hosting costs $Y, Y < X, let’s self-host.’ That analysis is always wrong because it ignores the hidden costs.
Self-hosting means you need: GPU procurement expertise (6-month lead times), MLOps engineers who understand CUDA, on-call rotation for inference infrastructure, capacity planning for traffic spikes, and a model upgrade path that doesn’t require re-architecting.
My rule of thumb: don’t self-host until you’re spending $50K/month on API costs AND you have at least two engineers who’ve operated GPU infrastructure before. Below that threshold, the operational overhead will eat any savings. Above it, the economics usually work—but only if you’ve built the team first.”
— Staff ML Engineer, Infrastructure
API costs (straightforward):
Monthly API Cost = Requests × (Input_tokens × $/input + Output_tokens × $/output)
Example: GPT-5.5 at 1M requests/month, 500 input, 200 output tokens avg
= 1M × (500 × $0.005/1K + 200 × $0.03/1K)
= 1M × ($0.0025 + $0.006)
= $8,500/month
Self-hosted costs (complex):
Hardware: GPU rental or amortization
= GPU_cost × Number_of_GPUs × Hours_per_month
Operations: Engineer time for setup, maintenance, on-call
= Loaded_engineer_cost × Hours_per_month
Infrastructure: Networking, storage, monitoring
= Platform_costs (varies widely)
Quality risk: Model may not match API capability
= Hard to quantify, often underestimated
Break-Even Analysis
def calculate_breakeven(
api_cost_per_request: float,
gpu_hourly_cost: float,
throughput_requests_per_second: float,
ops_hours_per_month: float = 20,
engineer_hourly_cost: float = 100
) -> dict:
"""Calculate when self-hosting becomes cheaper."""
# GPU cost for 24/7 operation
monthly_gpu_cost = gpu_hourly_cost * 24 * 30
# Operations overhead
monthly_ops_cost = ops_hours_per_month * engineer_hourly_cost
# Total self-hosted monthly cost
total_self_hosted = monthly_gpu_cost + monthly_ops_cost
# Maximum requests the GPU can serve
max_requests_per_month = throughput_requests_per_second * 3600 * 24 * 30 * 0.7 # 70% utilization
# Break-even point
break_even_requests = total_self_hosted / api_cost_per_request
return {
"monthly_self_hosted_cost": total_self_hosted,
"break_even_requests": break_even_requests,
"max_capacity_requests": max_requests_per_month,
"cost_per_request_at_capacity": total_self_hosted / max_requests_per_month
}Example: Llama 4 Scout vs. GPT-5.5
| Factor | GPT-5.5 API | Self-hosted Llama 4 Scout |
|---|---|---|
| Cost per request* | $0.0085 | Variable |
| GPU cost | N/A | 2x A100 80GB @ $12/hr = $8,640/mo |
| Ops cost | N/A | ~$2,000/mo |
| Throughput | N/A | ~50 req/s = 130M req/mo capacity |
| Break-even | N/A | $10,640 / $0.0085 = 1.25M req/mo |
| Cost at capacity | $0.0085 | $0.00008 (106x cheaper!) |
The catch: This analysis assumes:
- You achieve API-level quality with open-source models (not always true)
- You maintain high utilization (often difficult)
- Your ops costs stay low (they often don’t)
Decision Framework
Key decision factors: - Data residency requirements → May force self-host regardless of cost - Burst traffic patterns → APIs handle spikes better without over-provisioning - Model customization needs → Self-host enables fine-tuning and modification - Team expertise → Self-hosting requires GPU/ML infrastructure expertise
Hybrid Architectures
Many organizations find hybrid approaches optimal:
API for peak, self-host for baseline: Self-host for predictable baseline traffic, overflow to API during peaks
Self-host for sensitive, API for general: Private data stays on-premise, general queries to API
Self-host for experimentation, API for production: Develop and test locally, deploy proven prompts to API
API for quality-critical, self-host for volume: Important use cases get GPT-5, commodity tasks get self-hosted
Self-Host vs. API Decision Flowchart
Speculative Decoding: Trading Compute for Latency
The Speculative Decoding Insight
Speculative decoding addresses a fundamental inefficiency: during decode, we load the entire model to generate a single token. What if we could verify multiple tokens with a single model load?
The key insight: verification is cheaper than generation. Given candidate tokens, checking if the large model agrees is faster than generating those tokens directly.
The Algorithm
- Draft: A small, fast model generates K candidate tokens autoregressively
- Verify: The large model processes all K candidates in a single forward pass
- Accept/Reject: Compare probabilities; accept matching tokens, reject and correct on disagreement
The Mathematics of Speculation
Let:
- p_target(x): Target model’s probability of token x
- p_draft(x): Draft model’s probability of token x
- K: Number of speculated tokens
- α: Average acceptance rate
Acceptance criterion: Accept the drafted token x with probability \(\min\left(1, \frac{p_{target}(x)}{p_{draft}(x)}\right)\). On rejection, that position is resampled from the normalized residual distribution \(\max(0, p_{target} - p_{draft})\). This rejection-sampling rule is the whole point: it makes speculative decoding lossless, so the output distribution exactly matches what the target model would have produced on its own. (A naive “accept if the target likes it at least as much as the draft” test would bias the distribution.)
Expected tokens per iteration: \[E[\text{tokens}] = \sum_{k=0}^{K} (k+1) \cdot P(\text{accept } k \text{ then reject}) + (K+1) \cdot P(\text{accept all } K)\]
Speedup factor: \[\text{Speedup} = \frac{E[\text{tokens}]}{1 + K \cdot r_{draft}}\]
Where r_draft is the ratio of draft model cost to target model cost.
For well-matched draft models (high α, low r_draft), speedups of 2-3x are typical.
Draft Model Selection
The draft model must balance:
- Speed: Small enough to be fast
- Alignment: Similar output distribution to minimize rejection
- Architecture: Compatible tokenizer and vocabulary
Options: 1. Smaller same-family model: LLaMA-7B as draft for Llama 4 Scout 2. Distilled model: Model trained specifically to match the target 3. Quantized self-draft: INT4 version of the target model
Alignment matters enormously: A misaligned draft model produces candidates the target rejects, wasting the speculation cost with no benefit.
# Evaluating draft model alignment
import random
def measure_acceptance_rate(draft, target, prompts, K=4):
"""Measure how often target accepts draft's predictions."""
total_drafted = 0
total_accepted = 0
for prompt in prompts:
draft_tokens = draft.generate(prompt, num_tokens=K)
target_probs = target.get_token_probs(prompt + draft_tokens)
draft_probs = draft.get_token_probs(prompt + draft_tokens)
for d_prob, t_prob in zip(draft_probs, target_probs):
total_drafted += 1
# Correct speculative-sampling rule: accept with probability
# min(1, p_target / p_draft). On rejection the real algorithm
# resamples this position from the residual; we stop measuring here.
accept_prob = min(1.0, t_prob / d_prob)
if random.random() < accept_prob:
total_accepted += 1
else:
break # first rejection ends this speculation window
return total_accepted / total_drafted
# Good alignment: >70% acceptance
# Poor alignment: <50% acceptanceWhen Speculative Decoding Helps
Speculative decoding provides the most benefit when:
- Target model is large (high per-token cost)
- Good draft model exists (high acceptance rate)
- Latency matters more than throughput
- Batch size is small (batching already amortizes model loads)
It provides less benefit when:
- Already batching heavily (less memory bandwidth to spare)
- No suitable draft model (poor alignment wastes speculation)
- Throughput matters more than latency (better to increase batch size)
Production Operations
Monitoring Essentials
Production LLM serving requires comprehensive monitoring:
Request-level metrics:
- TTFT (time to first token) - P50, P95, P99
- ITL (inter-token latency) - P50, P95, P99
- Total latency - P50, P95, P99
- Request success rate
- Tokens generated per request
System-level metrics:
- GPU memory utilization (total and KV cache)
- GPU compute utilization
- Request queue depth
- Active concurrent requests
- Memory bandwidth utilization (if available)
Business metrics:
- Tokens generated per second (throughput)
- Cost per 1000 tokens
- Requests served per dollar
# Essential Prometheus metrics for LLM serving
from prometheus_client import Counter, Histogram, Gauge
# Request metrics
requests_total = Counter('llm_requests_total', 'Total requests', ['model', 'status'])
tokens_generated = Counter('llm_tokens_generated_total', 'Tokens generated', ['model'])
ttft_seconds = Histogram('llm_ttft_seconds', 'Time to first token', ['model'],
buckets=[0.1, 0.2, 0.5, 1, 2, 5, 10])
request_latency_seconds = Histogram('llm_request_latency_seconds', 'Total request latency',
['model'], buckets=[0.5, 1, 2, 5, 10, 30, 60])
# System metrics
active_requests = Gauge('llm_active_requests', 'Currently processing', ['model'])
queue_depth = Gauge('llm_queue_depth', 'Requests waiting', ['model'])
gpu_memory_used_bytes = Gauge('llm_gpu_memory_bytes', 'GPU memory', ['gpu_id'])
kv_cache_utilization = Gauge('llm_kv_cache_utilization', 'KV cache usage ratio', ['model'])Full implementation: See reference/09_llm_deployment_code.md
Load Shedding and Overload Protection
LLM inference is expensive. Under overload, accepting all requests leads to:
- Memory exhaustion (KV cache overflow)
- Latency explosion (queue buildup)
- Cascading failures (timeouts causing retries)
Load shedding strategies:
class LoadShedder:
"""Protect the system by rejecting excess load."""
def should_accept(self, request) -> tuple[bool, str]:
# Check queue depth
if self.queue_depth >= self.max_queue:
return False, "queue_full"
# Check KV cache headroom
estimated_kv = estimate_kv_usage(request.max_tokens)
if self.kv_used + estimated_kv > self.kv_limit * 0.95:
return False, "memory_pressure"
# Probabilistic shedding at high load
load_factor = self.queue_depth / self.max_queue
if load_factor > 0.8 and random.random() < (load_factor - 0.8) / 0.2:
return False, "load_shed"
return True, "accepted"Client-side handling: Return HTTP 503 with Retry-After header. Well-behaved clients will back off.
Full implementation: See reference/09_llm_deployment_code.md
Graceful Shutdown
Rolling deployments and autoscaling require graceful shutdown:
- Stop accepting new requests: Return 503 to new connections
- Drain active requests: Allow in-progress generation to complete
- Set timeout: Force-terminate after grace period
- Release resources: Free GPU memory, close connections
class GracefulServer:
async def shutdown(self, grace_period=30):
"""Gracefully shut down the server."""
self.accepting = False # Stop new requests
# Wait for active requests with timeout
deadline = time.time() + grace_period
while self.active_requests and time.time() < deadline:
await asyncio.sleep(0.1)
if self.active_requests:
logger.warning(f"Force-closing {len(self.active_requests)} requests")
for req in self.active_requests:
req.cancel()Full implementation: See reference/09_llm_deployment_code.md
Kubernetes Deployment Patterns
LLM workloads have unique Kubernetes requirements:
GPU scheduling: Request GPU resources, handle node affinity Slow startup: Models take minutes to load; use appropriate health probes Memory requirements: Set resource limits carefully to prevent OOM Autoscaling: Scale on queue depth, not CPU (GPU is the constraint)
# Key Kubernetes patterns for LLM inference
apiVersion: apps/v1
kind: Deployment
spec:
template:
spec:
containers:
- name: vllm
resources:
limits:
nvidia.com/gpu: 1
memory: "96Gi" # More than GPU memory for safety
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 180 # Model loading time
periodSeconds: 10
failureThreshold: 3
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 300 # Even longer for liveness
periodSeconds: 30
---
# Scale on queue depth, not CPU
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
minReplicas: 1
maxReplicas: 10
metrics:
- type: External
external:
metric:
name: llm_queue_depth
target:
type: AverageValue
averageValue: "10" # Scale when queue > 10Full implementation: See reference/09_llm_deployment_code.md
What people do: Use simple TCP or HTTP health checks that return 200 if the server is listening.
Why it fails: LLM servers can be “up” but non-functional—GPU memory exhausted, model weights corrupted, or inference stuck. Kubernetes keeps routing traffic to broken replicas, causing cascading timeouts.
Fix: Implement deep health checks that verify actual inference capability. Include a lightweight inference test (e.g., 10-token completion) in your readiness probe. Set appropriate timeouts (30-60s) for these probes. Use liveness probes to restart truly stuck processes, but be conservative—killing a loaded server makes things worse.
Key Takeaways
Memory bandwidth is the inference bottleneck - During decode, GPUs spend most time loading weights, not computing. Batching, quantization, and caching all help by increasing useful work per byte loaded.
KV cache dominates memory at scale - For long contexts or high concurrency, KV cache memory exceeds model weight memory. PagedAttention eliminates 60-80% memory waste through block-based allocation.
Continuous batching provides 5-10x throughput - Unlike static batching, continuous batching eliminates head-of-line blocking by adding requests as others complete, maximizing GPU utilization.
Quantization works because networks are overdetermined - INT4 quantization reduces memory 4x while preserving most model capability. The function computed depends on the weight manifold, not exact values.
Know your latency metrics: TTFT vs ITL - Time to First Token (prefill-dominated) and Inter-Token Latency (decode-dominated) require different optimizations. Prefix caching helps TTFT; batch size affects ITL.
Summary
LLM deployment is fundamentally different from traditional ML serving. The unique characteristics of autoregressive generation—memory bandwidth bounds, growing KV cache, token-by-token processing—require specialized techniques.
Key Insights
Memory bandwidth is the bottleneck: During decode, GPUs spend most of their time loading weights, not computing. This explains why batching, quantization, and speculative decoding all help—they increase the useful work done per byte loaded.
The KV cache dominates at scale: For long contexts or high concurrency, KV cache memory exceeds model weight memory. PagedAttention’s block-based management was revolutionary because it eliminated the 60-80% memory waste of naive allocation.
Continuous batching is essential: The shift from static to continuous batching provides 5-10x throughput improvement by eliminating head-of-line blocking and maximizing GPU utilization.
Quantization works because neural networks are overdetermined: The function computed depends on the weight manifold, not exact values. INT4 quantization preserves most of this manifold while reducing memory 4x.
Inference is a queue theory problem. Batching, KV cache paging, prefill/decode disaggregation, and prefix caching are scheduling policies over a scarce GPU resource. Optimizing serving means optimizing the queue, not just the kernel.
Architecture Summary
Connections to Other Chapters
- Chapter 5 (LLM Foundations): The transformer architecture, attention mechanism, and KV cache explained here drive all deployment constraints
- Chapter 6 (Prompt Engineering): Understanding deployment costs helps design efficient prompts that minimize tokens while maintaining quality
- Chapter 7 (RAG Systems): RAG often requires serving embedding models alongside LLMs; similar deployment principles apply
- Chapter 8 (Agentic Systems): Agents create multi-turn conversations that accumulate KV cache; understanding memory growth is essential
- Chapter 27 (Performance Engineering): Covers deeper GPU optimization, profiling, and CUDA-level tuning
- Chapter 32 (Cost Engineering): Extends cost analysis to full TCO, including engineering time and opportunity cost
Practical Exercises
Memory Budget Calculation: For a 13B parameter model, calculate the maximum concurrent requests you can serve at 4K, 8K, and 16K context lengths on an A100 80GB. Use both FP16 and INT4 quantization scenarios. What’s the break-even context length where KV cache exceeds model weight memory?
Batching Experiment: Deploy vLLM and measure throughput at batch sizes 1, 2, 4, 8, 16, 32. Plot throughput vs. latency. At what batch size does throughput plateau? What limits further scaling?
Quantization Quality Evaluation: Compare a base model vs. AWQ INT4 quantized version on your specific use case. Measure both perplexity on representative data and task-specific quality (accuracy, human preference). Is the quality loss acceptable for your application?
Cost Comparison: For your workload (estimate request volume, input/output lengths), calculate the monthly cost of: (a) OpenAI API, (b) self-hosted on A100, (c) self-hosted on H100 with FP8. Include realistic ops overhead for self-hosting. At what volume does each option become optimal?
Speculative Decoding Implementation: Using a small (1B) and large (7B) model from the same family, implement speculative decoding. Measure acceptance rate and end-to-end speedup. How does K (speculation depth) affect the trade-off?
Self-Assessment Checkpoint
Conceptual Questions
Q1. [IC2] Why is LLM generation memory-bandwidth bound rather than compute-bound? What does this mean for optimization strategies?
Answer
During generation, each forward pass multiplies model weights against a single token’s hidden state (matrix-vector multiplication). The compute per weight element is tiny, but all weights must be loaded from GPU memory. On an H100, compute capability is ~2000 TFLOPS but memory bandwidth is ~3.35 TB/s. The ratio (operations per byte) for generation is ~1-2, far below the ~600 ratio the hardware can sustain. This means: (1) Adding more compute doesn’t help—the GPU sits idle waiting for memory. (2) Batching helps—load weights once, use for multiple requests. (3) Quantization helps—smaller weights = less to load. (4) KV caching is essential—avoid recomputing attention states.Q2. [IC2] A 7B model in FP16 needs 14GB for weights. Why might you need 80GB of GPU memory to serve it at 4K context?
Answer
KV cache: For a typical 7B model architecture (32 layers, 32 heads, 128 dim per head), KV cache per request at 4K context in FP16 is: 2 × 32 × 4096 × 32 × 128 × 2 bytes = 2.15 GB per request. With model weights (14GB) + 20-30 concurrent requests × 2.15GB = 14 + 50-60 GB. Add activations, framework overhead, and buffer for continuous batching, and you’re at 80GB. At long contexts or high concurrency, KV cache dominates memory usage.Q3. [Senior] Explain the difference between Time to First Token (TTFT) and Inter-Token Latency (ITL). Which optimizations affect each?
Answer
TTFT is the delay before generation starts—dominated by prefill (processing the prompt). Affected by: prompt length, batching wait time, prefill compute capacity, prefix caching (can dramatically reduce if prompt shares prefix with cached entry). ITL is the time between consecutive tokens during generation—dominated by decode memory bandwidth. Affected by: batch size (more batched requests = each request gets fewer cycles), quantization (faster weight loads), model size, hardware memory bandwidth. You can have low TTFT but high ITL (small prompts, many concurrent requests) or high TTFT but low ITL (long prompts, few concurrent requests).Q4. [Senior] How does PagedAttention improve over fixed KV cache allocation? When does it matter most?
Answer
Traditional systems pre-allocate KV cache for max context length per request. If requests vary in length (common), short requests waste memory. PagedAttention allocates KV cache in fixed-size blocks (pages), allocated as needed. Benefits: (1) No wasted memory for short sequences. (2) Can serve more concurrent requests with same memory. (3) Enables prefix sharing—common prefixes share the same physical pages. Matters most when: (1) High request concurrency. (2) Variable sequence lengths. (3) Many requests share common prefixes (same system prompt). Production improvement: 2-4x throughput is common.Q5. [Staff] You’re deploying a 70B model. Should you use tensor parallelism across 2 GPUs or pipeline parallelism? What factors determine this?
Answer
Tensor parallelism (TP): Splits each layer across GPUs. Every token requires all GPUs to communicate. Lower latency per token (all GPUs work in parallel on each token), but high communication overhead (all-reduce after each layer). Pipeline parallelism (PP): Different layers on different GPUs. Lower communication (only between pipeline stages), but can have bubbles (idle time). For latency-sensitive serving: TP is usually better—each token completes faster. For throughput-focused batch processing: PP can be better—fill pipeline bubbles with multiple microbatches. 70B on 2 GPUs: TP is typical choice for serving. NVLink bandwidth critical—without it, TP communication becomes bottleneck.Spot the Problem
Problem 1. [IC2] A deployment configuration:
model = vllm.LLM(
model="llama-7b",
max_num_batched_tokens=32768,
gpu_memory_utilization=0.95
)What issues might this cause?
Answer
Issues: (1) 95% GPU memory utilization leaves very little buffer—CUDA operations occasionally need temporary memory, can cause OOM under load. Safe: 0.85-0.90. (2) max_num_batched_tokens=32768 is very high—if requests have long prompts, this allows huge prefill batches that may cause memory spikes or timeout individual requests. (3) No explicit max_model_len—defaults vary by version, might not match your use case. Better: Start conservative (0.85 utilization), tune max_num_batched_tokens based on your request distribution, set explicit max_model_len.Problem 2. [Senior] Cost analysis claiming self-hosting saves 10x:
API cost: $0.01 per request
Self-hosted cost: $2000/month for GPU
At 200K requests/month: API = $2000, Self-hosted = $2000
At 2M requests/month: API = $20000, Self-hosted = $2000 → 10x savings!
Answer
Missing costs: (1) Operations overhead—engineer time for setup, maintenance, on-call. At $150/hr loaded cost, 20 hours/month = $3000. (2) Infrastructure—load balancing, monitoring, logging, networking. (3) Opportunity cost—engineering time not spent on product. (4) Quality assumption—implicitly assumes self-hosted model matches API quality. May not be true, especially for hard tasks. (5) Utilization—assumes 100% GPU utilization. Realistic is 70-80% for variable traffic. (6) Scaling—what happens at 5M requests? May need another GPU. Real comparison: Should be (GPU + ops + infra) vs. API, and account for quality differences.Problem 3. [Staff] A team’s quantization approach:
“We quantized our model to INT4 and ran our eval suite. Accuracy dropped from 92% to 90%. We decided 2% loss is acceptable.”
Answer
Flawed analysis: (1) Aggregate accuracy hides distribution—might be 95%→95% on easy cases but 80%→70% on hard cases. Check performance by difficulty/category. (2) Eval suite may not cover edge cases—quantization can amplify errors on unusual inputs. (3) Perplexity should be checked—better correlates with subtle quality issues. (4) Production distribution matters—if hard cases are 20% of production traffic, that 10% drop on hard cases is significant. (5) Should compare on actual production traffic sample. Better: Stratified analysis by case difficulty, perplexity comparison, production traffic evaluation, user-facing quality assessment.Design Exercises
Exercise 1. [Senior] Design the deployment architecture for a chatbot that needs to handle 10,000 concurrent users, with P95 latency under 2 seconds for the first token. Users have conversations averaging 10 turns. The model is 13B parameters. Outline your hardware, software stack, and capacity planning.
Guidance
Key calculations: (1) At 10K concurrent users, assume 30% are actively waiting for a response at any moment = 3K active requests. (2) 13B model at FP16 = 26GB. With quantization (INT4) = ~7GB weights. (3) KV cache for 10-turn conversation (~4K context) = ~1GB per request in FP16, ~0.5GB in INT8. (4) Per A100 80GB: ~7GB model + ~50GB for KV = ~50-100 concurrent requests with INT4+INT8 KV. Need ~30-60 A100s for 3K concurrent. Software: vLLM with continuous batching, load balancer distributing by user session (session affinity for KV cache reuse). Scaling: Start with peak estimate, monitor, right-size. Use auto-scaling if traffic is variable.Exercise 2. [Staff] You’re responsible for reducing LLM serving costs by 50% without impacting user-facing quality. Current setup: API provider, $500K/month, mix of tasks (customer support, content generation, data extraction). Design your cost reduction strategy with specific actions, measurements, and risk mitigation.
Guidance
Strategies: (1) Task routing—not all tasks need the best model. Route simple tasks to cheaper/smaller models. Requires building a classifier and per-task quality validation. (2) Caching—identify common queries, cache responses. Effective for support FAQs. (3) Prompt optimization—shorter prompts = fewer input tokens. Aggressive with examples and instructions. (4) Output optimization—constrain response length where appropriate. (5) Self-hosting evaluation—calculate break-even for highest-volume tasks. Risk mitigation: (1) A/B test each change against current baseline. (2) Monitor quality metrics per task category. (3) Staged rollout (10% → 50% → 100%). (4) Kill switch to revert to original setup. Measurement: Track cost per task, quality per task, user satisfaction. Target: 50% cost reduction with <2% quality regression on any metric.See Also
- Chapter 5: LLM/NLP Foundations - Transformer architecture and attention mechanisms that determine inference characteristics
- Chapter 21: Performance Engineering - Advanced optimization techniques and latency engineering strategies
- Chapter 26: Cost Engineering - Economic analysis of deployment decisions and total cost of ownership
- Appendix K: LLM Provider Comparison - Detailed comparison of managed API providers vs self-hosting options
Recommended Reading
Essential (Read These)
“Efficient Memory Management for Large Language Model Serving with PagedAttention” (Kwon et al., 2023) The vLLM paper that introduced PagedAttention. Essential reading for understanding modern inference systems.
“FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning” (Dao, 2023) The attention optimization that enables long contexts. Well-written explanation of the tiling algorithm.
“GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers” (Frantar et al., 2022) Foundation for understanding weight quantization. Explains the Optimal Brain Quantization approach.
Deep Dives (For Specialists)
“AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration” (Lin et al., 2023) State-of-the-art quantization technique. Explains the activation-aware approach.
“Efficiently Scaling Transformer Inference” (Pope et al., 2022) Google’s analysis of inference efficiency. Excellent derivation of memory bandwidth bounds.
“Fast Inference from Transformers via Speculative Decoding” (Leviathan et al., 2022) Original speculative decoding paper. Rigorous analysis of acceptance rates and speedup bounds.
Systems and Infrastructure
vLLM Documentation (docs.vllm.ai) Comprehensive guide to the most popular open-source inference server.
TensorRT-LLM Documentation (NVIDIA) Deep technical details on NVIDIA’s optimized inference stack.
Historical Context
“Orca: A Distributed Serving System for Transformer-Based Generative Models” (Yu et al., 2022) Introduced continuous batching. Foundational for understanding modern scheduling.