Chapter 9: LLM Deployment & Infrastructure

Keywords

deployment, inference, KV cache, quantization, vLLM, batching, GPU, PagedAttention, speculative decoding, self-hosting

Introduction

In March 2023, a startup launched their AI-powered legal document analysis tool. During development, their prototype felt snappy—responses streamed in under a second. On launch day, they watched their system collapse. Response times climbed from 500 milliseconds to 30 seconds. GPU memory overflowed, causing crashes. Their cloud bill for that single day exceeded their projected monthly budget. The model worked perfectly; the infrastructure failed catastrophically.

This story illustrates a harsh truth about LLM deployment: training a model is the beginning, not the end. The techniques that make inference fast, efficient, and economical are fundamentally different from those used in training. Understanding these techniques—why they exist, how they work mathematically, and when to apply them—separates production-ready systems from expensive science projects.

LLM inference presents unique challenges compared to traditional ML serving. A classification model processes fixed-size inputs and produces fixed-size outputs in a single forward pass. An LLM generates tokens one at a time, each requiring a full forward pass through billions of parameters. A 1000-token response requires 1000 forward passes—and each pass must load the entire model from memory. The computational pattern is fundamentally different, and so are the optimization strategies.

This chapter examines the engineering of LLM deployment: the mathematical foundations of inference optimization, the systems that implement them, and the decision frameworks for choosing among alternatives. We focus primarily on self-hosted deployment because that’s where you need the deepest understanding—but even if you use managed APIs, these concepts help you reason about performance, cost, and architectural decisions.

The Core Challenge: Memory Bandwidth

To understand why LLM inference is hard, you need to understand one number: the ratio of compute to memory bandwidth on modern GPUs.

An NVIDIA H100 GPU can perform 1979 TFLOPS of FP16 computation per second. It can transfer 3.35 TB/s from its HBM3 memory. This means for every byte loaded from memory, the GPU can perform approximately 600 floating-point operations.

Now consider LLM inference. During token generation, each forward pass multiplies the model’s weight matrices by small vectors (the current token’s hidden states). For a 7 billion parameter model in FP16, this means loading 14 GB of weights to perform matrix-vector multiplications on vectors of size ~4096. The computation is trivial compared to the data transfer.

The arithmetic intensity (operations per byte loaded) of this workload is roughly 1-2, not 600. LLM generation is memory-bandwidth bound, not compute bound. This single insight explains nearly every optimization technique in this chapter:

Batching: Process multiple requests together so the same weight load serves multiple computations
Quantization: Reduce weight size to load fewer bytes
KV caching: Avoid recomputing attention by storing intermediate results
Continuous batching: Keep GPUs busy by dynamically managing requests

Understanding that inference is memory-bound changes how you think about every optimization decision.

Mental Model: Inference is a queue theory problem

Most engineers reach for inference optimization expecting a compute problem—more FLOPS, faster kernels. The reality is closer to an airport boarding gate: requests arrive at unpredictable rates, share an expensive scarce resource (the GPU), and have wildly varying service times (prefill versus decode, short versus long outputs). KV cache management, continuous batching, prefill/decode disaggregation, and prefix caching are all scheduling policies, not math tricks. Heuristic: when latency or cost surprises you, draw the queue first—arrival rate, service time, queue depth—before touching the model.

Reframing inference as a queue theory problem rather than a compute problem is what unlocks the 10-20x throughput wins modern serving stacks routinely achieve.

What You’ll Learn

The mathematics of LLM inference: prefill vs. decode, KV cache growth, memory bandwidth limits
How continuous batching and PagedAttention achieve 10-20x throughput improvements
Quantization theory: why INT4 weights often preserve quality, and when they don’t
Decision frameworks for GPU selection, inference servers, and self-hosting vs. APIs
How leading providers (Anthropic, OpenAI, Together, Fireworks) architect their serving systems
Cost modeling and optimization for sustainable deployment

Prerequisites

Understanding of transformer architecture and attention mechanisms (Chapter 5)
Familiarity with GPU computing concepts (CUDA, memory hierarchy)
Basic knowledge of containerization and orchestration
Python programming

The Physics of LLM Inference

Two Phases, Two Bottlenecks

LLM inference has two distinct computational phases with fundamentally different characteristics.

Prefill (prompt processing): The model processes all input tokens simultaneously, building the initial attention state. This phase involves dense matrix-matrix multiplications—multiplying weight matrices against the entire prompt sequence. The arithmetic intensity is high (many operations per byte loaded), making this phase compute-bound. Prefill time scales approximately linearly with prompt length, though the quadratic attention computation becomes significant for very long prompts.

Decode (generation): The model generates tokens one at a time. Each token requires a forward pass where weight matrices multiply against a single vector (the current token’s hidden state). The arithmetic intensity drops dramatically—we load the entire model to generate one token. This phase is memory-bandwidth bound. Decode time per token is nearly constant regardless of how many tokens have been generated (with a slight increase due to growing KV cache reads).

This asymmetry has profound implications. A 2000-token prompt processes in prefill in perhaps 200ms. Generating 500 tokens in decode might take 5000ms—25x longer for 1/4 the tokens. The bottleneck isn’t computation; it’s moving 14+ GB of weights from memory to compute units for each generated token.

The Mathematics of Memory Bandwidth Limits

Let’s derive why decode is memory-bound. Consider an H100 GPU (3.35 TB/s bandwidth) serving a 7B parameter model in FP16 (14 GB weights).

Time to load model weights once: 14 GB / 3.35 TB/s = 4.2 ms

Theoretical maximum generation speed (single request): 1 token / 4.2 ms = 238 tokens/second

Observed speed in practice: ~150-200 tokens/second (due to overheads, KV cache reads)

This is the fundamental ceiling: you cannot generate tokens faster than you can read the model weights from memory. No amount of additional compute helps—the H100’s massive 1979 TFLOPS sits mostly idle during single-request generation.

This analysis reveals why batching is so critical. If you process B requests simultaneously, you load the weights once and use them B times. The effective bandwidth cost per token becomes (14 GB / B), and throughput scales nearly linearly with batch size—until you hit other limits.

The KV Cache: Trading Memory for Compute

The attention mechanism in transformers computes:

\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

Where Q (queries) represents the current token, and K (keys) and V (values) represent all tokens seen so far. Without caching, generating the 500th token would require recomputing attention over all 500 tokens—quadratic complexity.

The KV cache stores the K and V projections for all previous tokens. When generating a new token, we only compute its Q and look up the cached K, V. This transforms attention from O(n^2) to O(n) per token, but introduces a memory cost that grows linearly with sequence length.

KV Cache Size Formula: \[\text{KV Cache per request} = 2 \times L \times S \times H \times D \times \text{bytes\_per\_value}\]

Where:

2: Key and Value tensors
L: Number of transformer layers
S: Sequence length (prompt + generated tokens)
H: Number of attention heads
D: Dimension per head (typically hidden_dim / num_heads)
bytes_per_value: 2 for FP16, 1 for INT8

Example: LLaMA-2 7B with 4K context in FP16: \[2 \times 32 \times 4096 \times 32 \times 128 \times 2 = 2.15 \text{ GB per request}\]

At 4K context, the KV cache for a single request exceeds the model weights stored at INT4 precision. This is why long-context serving is expensive—the memory cost grows linearly with context length and linearly with concurrent requests.

Memory Growth with Context Length (LLaMA-2 7B, FP16 KV cache). Model weights add 14GB (FP16) or 3.5GB (INT4) to all scenarios.
Context Length	KV Cache Per Request	With 10 Concurrent	With 50 Concurrent
512	268 MB	2.7 GB	13.4 GB
2,048	1.07 GB	10.7 GB	53.5 GB
4,096	2.15 GB	21.5 GB	107 GB
8,192	4.3 GB	43 GB	215 GB

This table illustrates why serving long contexts at high concurrency requires either massive GPU memory, aggressive KV cache compression, or accepting lower concurrency.

Key Performance Metrics

Understanding inference metrics is essential for reasoning about systems and comparing configurations.

Time to First Token (TTFT): The delay before generation starts. Dominated by prefill time. Users perceive this as “thinking time.” For interactive applications, TTFT under 500ms feels responsive; over 2 seconds feels sluggish.

Inter-Token Latency (ITL) or Time Per Output Token (TPOT): The average time between consecutive generated tokens. Determines streaming speed. For readable streaming, ITL under 50ms (20 tokens/second) feels natural.

Throughput: Total tokens generated per second across all concurrent requests. The key metric for cost efficiency. A system generating 1000 tokens/second across 10 concurrent requests is 10x more efficient than one generating 100 tokens/second for a single request.

Tokens Per Dollar: The ultimate efficiency metric. Combines throughput with infrastructure cost.

These metrics trade off against each other. Increasing batch size improves throughput but increases TTFT (new requests wait for batch slots) and ITL (GPU time shared among more requests). Optimizing for one metric often degrades others.

The KV Cache Revolution: PagedAttention

The Memory Fragmentation Problem

Before 2023, LLM serving systems pre-allocated contiguous memory for each request’s KV cache, sized for the maximum possible sequence length. A system configured for 4K context would allocate 2.15 GB per request, even if the actual sequence used only 500 tokens. This led to severe memory fragmentation and waste.

Consider a server with 80GB GPU memory, 14GB used by model weights, leaving 66GB for KV cache. With conservative pre-allocation:

Maximum concurrent requests: 66 GB / 2.15 GB ≈ 30 requests
But average sequences are 1K tokens, using only 537 MB
Actual memory usage: 30 × 537 MB = 16 GB
Wasted memory: 50 GB (76%)

Even worse, variable-length sequences caused external fragmentation—freed memory was scattered in non-contiguous chunks, making it unusable for new requests.

PagedAttention: Virtual Memory for KV Cache

The vLLM paper (Kwon et al., 2023) introduced PagedAttention, applying operating system virtual memory concepts to KV cache management. The insight: manage KV cache as small, fixed-size blocks rather than contiguous per-request allocations.

Instead of pre-allocating maximum sequence length, PagedAttention: 1. Divides KV cache into fixed-size blocks (typically 16 tokens each) 2. Allocates blocks on-demand as sequences grow 3. Maintains a block table mapping logical sequence positions to physical memory locations 4. Frees blocks immediately when requests complete

Traditional vs PagedAttention Memory Allocation

The result is near-perfect memory utilization. Memory waste drops from 60-80% to under 4%. The same hardware can serve 2-4x more concurrent requests.

Block Tables: Each request maintains a block table mapping logical positions to physical block IDs:

# Conceptual block table structure
request_1_blocks = [
    {"logical_range": (0, 15), "physical_block": 7},
    {"logical_range": (16, 31), "physical_block": 23},
    {"logical_range": (32, 47), "physical_block": 12},
]

During attention computation, the system translates logical sequence positions to physical memory locations using these tables. The indirection adds negligible overhead compared to the memory efficiency gains.

Copy-on-Write for Shared Prefixes

PagedAttention enables another optimization: copy-on-write sharing for common prefixes.

Many applications use the same system prompt for all requests. Without sharing, each of 50 concurrent requests stores identical KV cache entries for the shared prefix—pure waste.

With PagedAttention, requests sharing a prefix can share the same physical blocks:

If each request’s unique content is 300 tokens (19 blocks), total memory drops from 50×32 = 1600 blocks to 13 + 50×19 = 963 blocks—a 40% reduction.

Practical Impact: Real-World Benchmarks

The vLLM paper demonstrated significant improvements over previous systems:

Comparison	vLLM Improvement	Notes
vs HuggingFace Transformers (baseline)	14-24x throughput	Naive batching baseline
vs HuggingFace TGI	2.2-3.5x throughput	Both use continuous batching
vs FasterTransformer/Orca	2-4x throughput	At similar latency
Memory utilization	~97% vs ~35% baseline	Near-zero KV cache waste

Note: The improvement magnitude depends heavily on workload characteristics. High-concurrency scenarios with shared prefixes see the largest gains. Interactive single-user scenarios may see smaller improvements. TGI v3 (2024) has closed some of this gap, particularly for long-context workloads with prefix caching.

Common Mistake: Ignoring KV Cache Sizing

What people do: Focus only on model weight memory when sizing GPUs, then wonder why inference fails at moderate concurrency.

Why it fails: KV cache grows with sequence length × batch size × layers × head dimension. A 70B model’s weights fit in 40GB, but 32 concurrent 4K-context requests need another 50GB+ of KV cache. Memory exhaustion causes OOM crashes or aggressive eviction.

Fix: Calculate worst-case KV cache before deployment: kv_bytes = 2 × layers × hidden_dim × seq_len × batch_size × bytes_per_element. Use this to set max_concurrent_requests. Monitor cache hit rates and eviction rates in production.

Continuous Batching: Maximizing GPU Utilization

Mental Model: “Inference Is a Queue Theory Problem”

Most engineers think about LLM inference as a compute problem: “How fast can we do matrix multiplications?” This framing leads to chasing faster GPUs and more FLOPS. It’s wrong.

LLM inference is fundamentally a queue theory problem: “How do we keep tokens flowing through the system with minimum waiting?” The bottleneck isn’t computation—it’s memory bandwidth and scheduling.

This reframing changes everything:

Traditional compute thinking: “We need more GPU horsepower” → Buy more A100s Queue theory thinking: “Requests are waiting in line too long” → Optimize scheduling

Key queue theory concepts that apply directly:

Arrival rate (λ): Requests per second hitting your system
Service rate (μ): Tokens per second your GPU can produce
Utilization (ρ = λ/μ): At ρ > 0.8, latency explodes exponentially
Little’s Law (L = λW): Average queue length equals arrival rate times average wait time

When you see high latency, don’t immediately add GPUs. Ask: “Is my queue poorly scheduled?” Continuous batching, PagedAttention, and priority queues are scheduling optimizations, not compute optimizations. Often, better scheduling on existing hardware outperforms naive scaling by 5-10x.

The Static Batching Problem

Traditional batching collects requests, processes them together, and waits for the entire batch to complete before returning results. This creates two problems for LLM serving:

Head-of-line blocking: If one request generates 500 tokens while others generate 50, the short requests wait for the long one. Response time becomes max(request_lengths), not mean.
Batch assembly latency: New requests must wait for the current batch to finish before joining. During low traffic, requests queue unnecessarily.

Consider a batch of 8 requests:

Request 1-7: Generate 50 tokens each (500ms)
Request 8: Generates 500 tokens (5000ms)

With static batching, all requests return after 5000ms. Requests 1-7 could have returned 4500ms earlier.

Continuous Batching: The Solution

Continuous batching (also called “iteration-level scheduling” or “dynamic batching”) processes one decode step at a time, allowing requests to enter and exit the batch at any iteration.

The key insight: during decode, we generate one token per request per iteration. There’s no mathematical requirement that all requests generate the same number of tokens before any can return.

The Mathematics of Continuous Batching

Let’s formalize the throughput advantage. Define:

B: Maximum batch size
T_decode: Time for one decode iteration (one token per request)
N_tokens(i): Number of tokens generated by request i

Static batching throughput: \[\text{Throughput}_{static} = \frac{\sum_{i=1}^{B} N_{tokens}(i)}{T_{decode} \times \max_{i}(N_{tokens}(i))}\]

Continuous batching throughput (ideal): \[\text{Throughput}_{continuous} = \frac{B}{T_{decode}}\]

The ratio depends on request length variance. If all requests generate exactly the same number of tokens, the approaches are equivalent. But real workloads have high variance—some requests are “tell me yes or no” while others are “write a detailed essay.”

Example calculation:

Batch of 8 requests: 7 generate 50 tokens, 1 generates 500 tokens
T_decode = 10ms per iteration

Static: (7×50 + 500) / (10ms × 500) = 850/5000ms = 170 tokens/second Continuous: During first 50 iterations, 8 requests active; iterations 51-500, only 1 request active. Total time = 50×10ms + 450×10ms = 5000ms but 7 slots freed after iteration 50.

With continuous batching, those 7 freed slots can accept new requests. In steady state, throughput approaches the ideal B/T_decode = 800 tokens/second—a 4.7x improvement for this workload.

Implementation: Iteration-Level Scheduling

Modern inference servers implement continuous batching with iteration-level schedulers:

# Conceptual iteration scheduler (simplified)
class IterationScheduler:
    def __init__(self, max_batch_size, memory_manager):
        self.max_batch_size = max_batch_size
        self.memory = memory_manager
        self.active_requests = []  # Currently generating
        self.waiting_queue = []    # Pending prefill

    def schedule_iteration(self):
        """Select requests for this decode iteration."""
        # Step 1: Keep all active decode requests (don't interrupt)
        scheduled = [r for r in self.active_requests if not r.done]

        # Step 2: Add waiting requests if slots and memory available
        while (len(scheduled) < self.max_batch_size
               and self.waiting_queue
               and self.memory.can_fit(self.waiting_queue[0])):
            request = self.waiting_queue.pop(0)
            scheduled.append(request)

        return scheduled

    def run_iteration(self, scheduled):
        """Execute one decode step for all scheduled requests."""
        # Prefill any newly added requests
        new_requests = [r for r in scheduled if not r.prefilled]
        if new_requests:
            self.prefill_batch(new_requests)

        # Decode one token for all active requests
        tokens = self.decode_step(scheduled)

        # Return completed requests, keep others active
        for request, token in zip(scheduled, tokens):
            request.generated_tokens.append(token)
            if token == EOS_TOKEN or len(request.generated_tokens) >= request.max_tokens:
                request.complete()
                self.memory.free(request.kv_blocks)
                self.active_requests.remove(request)

Full implementation: See reference/09_llm_deployment_code.md

The scheduler’s decisions—which requests to prefill, how to balance prefill vs. decode, when to preempt—significantly impact performance. Advanced schedulers consider request priorities, memory pressure, and fairness guarantees.

Chunked Prefill: Optimizing for Mixed Workloads

A subtle problem arises when long prompts enter the system. Prefill is compute-bound and can monopolize the GPU, blocking decode iterations for active requests. If a 10,000-token prompt enters, other requests might stall for several hundred milliseconds.

Chunked prefill splits long prefills into chunks (e.g., 512 tokens), interleaving them with decode iterations:

This increases the new request’s TTFT (prefill takes longer overall) but maintains acceptable ITL for existing requests—a worthwhile trade-off for interactive applications.

Common Mistake: Wrong Max Batch Size

What people do: Set max_batch_size to whatever the GPU can fit in memory, maximizing throughput.

Why it fails: Maximum batch size ≠ optimal batch size. Large batches increase latency for all requests (everyone waits for the longest generation). For interactive applications, high p99 latency destroys user experience even if throughput is great.

Fix: Set batch size based on latency SLOs, not memory capacity. Start with max_batch_size=8-16 for interactive workloads. Monitor p99 latency under load. Increase batch size only for batch/offline processing where latency doesn’t matter. Use separate pools for interactive vs. batch traffic.

Quantization: Fitting More with Less

Staff Engineer Perspective

“We spent three months building a complex serving optimization before someone suggested trying INT8 quantization. It took a day to implement and gave us 2x the throughput improvement of everything else combined. Now my rule is: try quantization before any other optimization. It’s free performance.”

— ML Platform Engineer at AI startup

Why Quantization Works

Quantization reduces numerical precision—replacing 16-bit floats with 8-bit or 4-bit integers. Intuitively, this should destroy model quality. Why doesn’t it?

The key insight comes from analyzing weight distributions in trained transformers. Weights are not uniformly distributed; they cluster around zero with heavy tails. Most weights are small, and the model’s behavior depends more on preserving the relative relationships between weights than their exact values.

Consider a simplified example. If we have weights [0.0012, 0.0015, 0.0014, 0.0013], the exact values matter less than their ordering and rough magnitudes. Quantization that maps these to [1, 2, 2, 1] (scaled appropriately) preserves most of the computational effect.

Formal perspective: Neural network weights are overdetermined. The function computed by the network depends on the weight manifold (the subspace of weight configurations producing equivalent outputs), not individual weight values. Small perturbations from quantization often stay within this manifold.

Quantization Methods: From Theory to Practice

Uniform quantization divides the value range into equal intervals:

\[q = \text{round}\left(\frac{x - x_{min}}{x_{max} - x_{min}} \times (2^b - 1)\right)\]

Where b is the number of bits. Dequantization reverses the process:

\[\hat{x} = q \times \frac{x_{max} - x_{min}}{2^b - 1} + x_{min}\]

The problem: outliers. If most weights are between -0.1 and 0.1 but a few are at 1.0, the quantization range wastes precision on the sparse outliers.

Group quantization addresses this by applying separate scales to groups of weights (typically 128 weights per group). This captures local distributions better but adds storage overhead for the scales.

GPTQ: Optimal Brain Quantization for Transformers

GPTQ (Frantar et al., 2022) applies Optimal Brain Quantization principles to LLMs. The key insight: quantize weights in order of increasing sensitivity, using second-order information (Hessian) to adjust remaining weights and compensate for quantization error.

The algorithm processes weights column by column: 1. Compute the inverse Hessian H^(-1) (approximated efficiently) 2. For each weight w_i, compute the quantization error δ 3. Adjust remaining weights to minimize output change: w_j -= δ × H^(-1){ij} / H^(-1){ii}

This “error correction” is what makes GPTQ achieve better quality than naive quantization. The second-order update propagates the correction optimally through the layer.

Calibration requirement: GPTQ requires a small calibration dataset (typically 128-1024 samples) to estimate the Hessian. The calibration data should represent your actual workload; using generic text when you serve code will degrade quality.

AWQ: Activation-Aware Weight Quantization

AWQ (Lin et al., 2023) observes that not all weights are equally important—some channels carry more information than others. Rather than treating all weights equally, AWQ protects “salient” weights identified by activation magnitudes.

The insight: channels with large activations amplify weight quantization errors. AWQ scales these channels before quantization, reducing their relative quantization error:

Channel importance from activations:
────────────────────────────────────────────────────────────────
Channel 1:  avg |activation| = 0.1   → Low importance
Channel 2:  avg |activation| = 2.3   → High importance
Channel 3:  avg |activation| = 0.05  → Low importance
Channel 4:  avg |activation| = 1.8   → High importance

AWQ strategy: Scale channels 2 and 4 up before quantization,
              scale them back down during inference.
              Higher precision preserved where it matters.

AWQ typically achieves 0.5-1% better perplexity than GPTQ at the same bit-width, with the advantage increasing for aggressive 3-bit quantization.

FP8: The Hardware-Supported Middle Ground

NVIDIA H100 GPUs introduced native FP8 (8-bit floating point) support. Unlike INT8, FP8 maintains a floating-point format. The E4M3 variant uses 4 bits of exponent and 3 bits of mantissa (better precision, smaller range — typically used for forward-pass activations and weights), while E5M2 uses 5 bits of exponent and 2 bits of mantissa (wider dynamic range — typically used for gradients in training).

Advantages of FP8:

Native hardware support: 2x throughput vs. FP16 on H100
Floating-point range: Handles outliers better than fixed-point
Minimal quality loss: Typically <0.5% perplexity increase
Simple adoption: No complex calibration or algorithm changes

FP8 is increasingly the default for H100 deployments—it provides most of the memory savings of INT8 with simpler implementation and better quality.

Quantization Decision Framework

Quality degradation by method (typical, task-dependent):

Quantization quality degradation by method (typical, task-dependent).
Method	Bits	Memory	Perplexity	Quality
FP16	16	1x	Baseline	Reference quality
FP8	8	2x savings	<0.5%	Excellent; H100 optimized
INT8 (GPTQ)	8	2x savings	0.5-1.5%	Good general-purpose
INT8 (AWQ)	8	2x savings	0.3-1.0%	Better than GPTQ
INT4 (GPTQ)	4	4x savings	2-15%	Degrades on complex tasks
INT4 (AWQ)	4	4x savings	1-10%	Preferred INT4; test first

FlashAttention: Compute-Efficient Attention

The Memory Hierarchy Problem

Standard attention implementation materializes the full N×N attention matrix in GPU memory:

# Standard attention (memory-inefficient)
Q, K, V = ...  # (batch, heads, seq_len, head_dim)
scores = torch.matmul(Q, K.transpose(-2, -1)) / sqrt(d_k)  # (batch, heads, seq_len, seq_len)
attention = torch.softmax(scores, dim=-1)  # Materializes N×N matrix
output = torch.matmul(attention, V)

For sequence length N=4096 and 32 heads, the attention matrix requires:

32 × 4096 × 4096 × 4 bytes (FP32) = 2 GB of temporary storage

This memory is touched exactly once then discarded—pure waste of precious GPU memory bandwidth.

FlashAttention: Tiled, Memory-Efficient Attention

FlashAttention (Dao et al., 2022) restructures attention to minimize memory transfers by keeping computation in fast SRAM rather than slow HBM.

The key insight: we don’t need the full attention matrix simultaneously. We can compute attention in tiles, accumulating results without ever materializing the full N×N matrix.

The algorithm: 1. Divide Q, K, V into blocks that fit in SRAM 2. For each Q block, iterate over K, V blocks 3. Compute partial attention scores in SRAM 4. Apply softmax correction (requires tracking running max/sum) 5. Accumulate to output without writing intermediate results to HBM

Standard Attention Memory Traffic:
────────────────────────────────────────────────────────────────
HBM ──read Q, K──► SRAM ──compute S=QK^T──► HBM ──write S
HBM ──read S──────► SRAM ──softmax(S)────► HBM ──write P
HBM ──read P, V──► SRAM ──compute PV─────► HBM ──write O

Memory I/O: O(N² + Nd) ≈ O(N²) for large N

FlashAttention Memory Traffic:
────────────────────────────────────────────────────────────────
HBM ──read Q, K, V blocks──► SRAM ──compute in tiles──► HBM ──write O

Memory I/O: O(Nd) — no N² term!

Performance impact: FlashAttention achieves 2-4x speedup for long sequences. More importantly, it enables longer contexts without OOM—a qualitative capability improvement.

FlashAttention-2: Further Optimizations

FlashAttention-2 (Dao, 2023) improves parallelization across thread blocks:

Parallelize over sequence length: Process Q blocks in parallel (independent)
Better work partitioning: Reduce synchronization overhead
Leverage NVIDIA specific instructions: Exploit tensor cores more efficiently

FlashAttention-2 achieves up to 2x additional speedup over FlashAttention-1, approaching the hardware theoretical maximum.

Why this matters for inference: During decode, the query is a single token attending to a potentially long KV cache. FlashAttention’s memory efficiency makes long-context inference practical. Without it, 100K+ context windows would be prohibitively slow and memory-intensive.

Inference Server Architectures

The Inference Server Landscape

The complexity of efficient LLM serving—PagedAttention, continuous batching, tensor parallelism, quantization kernels—has been packaged into specialized inference servers. Understanding their architectures and trade-offs is essential for deployment decisions.

Naive Inference
Production Inference

# Sequential processing - wastes GPU cycles
def generate_responses(prompts: list[str]) -> list[str]:
    results = []
    for prompt in prompts:
        # One request at a time, GPU sits idle between requests
        input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")
        output = model.generate(input_ids, max_new_tokens=256)
        results.append(tokenizer.decode(output[0]))
    return results

# 10 requests x 2 seconds each = 20 seconds total
# GPU utilization: ~15%

from vllm import LLM, SamplingParams

# Production: continuous batching with KV cache management
class ProductionInference:
    def __init__(self, model_path: str):
        self.llm = LLM(
            model=model_path,
            tensor_parallel_size=2,          # Split across GPUs
            gpu_memory_utilization=0.9,      # Maximize KV cache
            max_num_seqs=64,                 # Concurrent sequences
            enable_prefix_caching=True,      # Cache common prefixes
            quantization="fp8",              # Memory efficiency
        )

    async def generate_batch(
        self,
        prompts: list[str],
        max_tokens: int = 256
    ) -> list[GenerationResult]:
        sampling_params = SamplingParams(
            max_tokens=max_tokens,
            temperature=0.7,
            top_p=0.95,
        )

        # Continuous batching: requests processed as tokens complete
        # New requests can join mid-batch
        outputs = self.llm.generate(prompts, sampling_params)

        return [
            GenerationResult(
                text=output.outputs[0].text,
                tokens=len(output.outputs[0].token_ids),
                finish_reason=output.outputs[0].finish_reason
            )
            for output in outputs
        ]

# 10 requests batched = ~3 seconds total
# GPU utilization: ~85%

Key differences: Continuous batching (vs sequential), PagedAttention for KV cache, tensor parallelism across GPUs, prefix caching for repeated system prompts, FP8 quantization, and concurrent sequence handling. Result: 6x throughput improvement.

vLLM: The PagedAttention Reference Implementation

vLLM, developed at UC Berkeley, introduced PagedAttention and remains the most widely deployed open-source inference server.

Architecture:

Key configuration parameters:

gpu-memory-utilization: Fraction of GPU memory for KV cache (default 0.9)
max-model-len: Maximum sequence length
tensor-parallel-size: GPUs for tensor parallelism
max-num-seqs: Maximum concurrent sequences
quantization: awq, gptq, fp8, or none

When to choose vLLM:

General-purpose LLM serving
High throughput requirements
Need for prefix caching / shared prompts
OpenAI API compatibility required

TensorRT-LLM: Maximum NVIDIA Performance

TensorRT-LLM is NVIDIA’s inference stack, optimizing specifically for NVIDIA GPUs.

Key differences from vLLM: 1. Compilation required: Models must be compiled to TensorRT engines 2. Kernel optimization: Custom kernels for each GPU architecture 3. In-flight batching: NVIDIA’s continuous batching implementation 4. FP8 native: Best FP8 support on H100

Architecture:

Compilation process:

# Convert model to TensorRT-LLM checkpoint
python convert_checkpoint.py --model_dir llama-7b-hf --output_dir llama-7b-trt

# Build TensorRT engine
trtllm-build --checkpoint_dir llama-7b-trt \
             --output_dir llama-7b-engine \
             --max_batch_size 64 \
             --max_input_len 2048 \
             --max_output_len 512

When to choose TensorRT-LLM:

Maximum throughput on NVIDIA GPUs (often 20-40% faster than vLLM)
Using H100s with FP8
Willing to invest in complex setup and model compilation
Latency-critical applications

SGLang: Structured Generation Optimization

SGLang (LMSYS) optimizes for workloads with structured outputs—JSON, code, constrained generation.

Key innovation: RadixAttention—a trie-based KV cache that efficiently shares prefixes across requests and enables fast constraint checking during generation.

When to choose SGLang:

Structured output (JSON schemas, grammar constraints)
Multi-turn conversations with prefix sharing
Research / rapid prototyping

Hugging Face TGI: Production-Ready Simplicity

Text Generation Inference (TGI) prioritizes production stability and Hugging Face ecosystem integration.

Key features:

Watermarking support (detecting AI-generated text)
Token streaming with standard SSE
Prometheus metrics built-in
Simple Docker deployment

When to choose TGI:

Hugging Face model hub workflow
Need watermarking
Prefer stability over maximum performance

Inference Server Decision Framework

How the Leaders Do It: Industry Case Studies

Understanding how leading providers architect their serving infrastructure provides both practical insights and validation for the techniques we’ve discussed.

OpenAI: Scale Through Specialization

OpenAI serves billions of API requests daily across GPT-5, GPT-5.5, and GPT-4 variants. Their architecture, while not fully public, reveals key patterns through their API behavior and published work.

Observed characteristics:

Multiple model variants: Different price/capability tiers (GPT-5.5 vs GPT-5 vs GPT-4)
Aggressive batching: High latency tolerance for batch API (50% cost reduction)
Speculative decoding: Faster tokens on GPT-5.5 suggest draft model usage
Regional routing: Requests routed to nearest data center

Inferred architecture patterns:

Separate clusters for different model sizes/speeds
Request classification to route to appropriate backend
Heavy use of prefix caching for system prompts
Custom hardware (rumored TPU/custom ASIC usage)

Key insight: OpenAI offers explicit price/latency trade-offs (batch API, prompt caching discount), suggesting their infrastructure can dynamically trade these resources.

Anthropic: Optimizing for Long Context

Anthropic’s Claude models support 200K token contexts—far beyond typical deployments. This requires specific architectural choices.

Observed characteristics:

Streaming by default: Long outputs stream immediately
Context caching: Explicit prompt caching API with cost reduction
Graceful degradation: Long context requests have higher latency tolerance

Inferred architecture patterns:

Extensive KV cache optimization (likely PagedAttention or custom equivalent)
Potentially hierarchical or compressed KV cache for very long contexts
Dedicated capacity for long-context workloads
Heavy prefix caching for repeated system prompts

Key insight: Anthropic’s explicit prompt caching API (5-minute cache, 90% cost reduction for cache hits) reveals their infrastructure heavily leverages prefix sharing.

Together AI: Inference as a Service

Together AI specializes in hosted inference for open-source models. Their architecture is more transparent.

Public architecture details:

Custom inference stack: “Together Inference Engine” with claimed 2-3x throughput vs. standard vLLM
Speculative decoding: Available for supported models
Multiple model sizes: From small to 70B+ parameters
Dedicated endpoints: Option for reserved capacity

Key optimizations (from their blog):

Custom CUDA kernels for specific model architectures
Aggressive operator fusion
Memory-optimized attention implementations
Dynamic batching with priority support

Throughput claims: Up to 117 tokens/second per request for Llama 4 Scout on 8x A100s with speculative decoding.

Fireworks AI: Speed as the Value Proposition

Fireworks AI positions itself on inference speed, claiming industry-leading latency.

Public architecture details:

FireAttention: Custom attention kernel optimized for inference
Speculative decoding: Enabled by default where applicable
Optimized routing: Request-aware load balancing

Performance claims:

TTFT under 200ms for most requests
Up to 4x throughput improvement over standard implementations

Key insight: Fireworks’ focus on latency suggests architecture choices that prioritize TTFT over raw throughput—potentially smaller batch sizes, eager prefill, and latency-optimized kernels.

Lessons from Production Deployments

Common patterns across all providers:

Tiered service levels: Different price/performance points for different workloads
Prefix caching: Universal adoption, often with explicit API support
Speculative decoding: Increasingly common for throughput improvement
Custom kernels: All serious providers have custom CUDA implementations
Batching intelligence: Request-aware routing and scheduling
Regional distribution: Low-latency access requires geographic presence

GPU Selection and Capacity Planning

Understanding GPU Specifications for Inference

GPU selection for inference differs from training. Memory bandwidth often matters more than raw compute.

Key specifications to evaluate:

Spec	Why It Matters	Trade-off
Memory Size	Determines model size + max concurrent requests	Larger = more expensive
Memory Bandwidth	Bounds single-request generation speed	Higher bandwidth GPUs cost more
FP16 TFLOPS	Affects prefill speed	Often over-provisioned for inference
FP8 Support	Enables 2x throughput with minimal quality loss	Only H100/H200
Interconnect	Needed for tensor parallelism	NVLink vastly outperforms PCIe

GPU Comparison for Inference

GPU Comparison for LLM Inference (2024-2025). Approximate on-demand cloud pricing; reserved/spot significantly lower.
GPU	Memory	BW (TB/s)	$/hr	Best For
RTX 4090	24 GB	1.0	$0.50	Dev/testing, 7B models
A10	24 GB	0.6	$1.50	7B production, cost-sensitive
L40S	48 GB	0.9	$2.00	13B models, balanced
A100 40GB	40 GB	1.6	$4.00	13-30B, high throughput
A100 80GB	80 GB	2.0	$6.00	70B single GPU, large batches
H100 80GB	80 GB	3.4	$10.00	Maximum throughput, FP8
H200 141GB	141 GB	4.8	$15.00	200B+, longest contexts

Memory Budget Calculation

A systematic approach to GPU selection:

Step 1: Model memory

Weight memory = Parameters × Bytes per parameter
7B FP16:  7 × 10^9 × 2 = 14 GB
7B INT4:  7 × 10^9 × 0.5 = 3.5 GB
70B INT4: 70 × 10^9 × 0.5 = 35 GB

Step 2: KV cache budget

Available for KV = GPU Memory - Model Memory - Overhead (2-4 GB)

A100 80GB with 7B INT4:
80 - 3.5 - 3 = 73.5 GB for KV cache

Step 3: Max concurrent requests

KV cache per request = 2 × Layers × Seq_len × Heads × Head_dim × Bytes

LLaMA-7B at 4K context, FP16 KV:
2 × 32 × 4096 × 32 × 128 × 2 = 2.15 GB per request

Max concurrent: 73.5 GB / 2.15 GB ≈ 34 requests

Step 4: Throughput estimation

Tokens/second ≈ Memory_bandwidth / (Model_size × bytes_per_weight)
             × Batch_efficiency_factor × Kernel_efficiency

A100 (2 TB/s) with 7B INT4, batch 16:
2 × 10^12 / (7 × 10^9 × 0.5) × 0.8 × 0.7 ≈ 320 tok/s per request
Total throughput: ~5,000 tok/s (batch 16)

Multi-GPU Considerations

When a model doesn’t fit on one GPU, you need parallelism:

Tensor Parallelism (TP): Split each layer across GPUs. Requires fast interconnect (NVLink). Linear speedup for prefill, limited improvement for decode.

Pipeline Parallelism (PP): Different layers on different GPUs. Works across nodes. Introduces pipeline bubbles, more complex scheduling.

Self-Hosting vs. API: The Decision Framework

Cost Analysis Framework

Historical Context: The Dramatic Decline of LLM Costs

2020 (GPT-3 Beta): $60/1M tokens. Early access was expensive and rate-limited. Cost prohibited most production use cases. A customer support bot at scale could cost $1M+/month.

2022 (ChatGPT Launch): $20/1M tokens (text-davinci). Still expensive, but 3x cheaper. Companies started building production applications, carefully counting tokens.

2023 (GPT-3.5 Turbo): $2/1M tokens. A 10x drop. The “ChatGPT API” made LLMs economically viable for mainstream applications. Token counting became less critical.

Late 2023 (GPT-4 Turbo, Claude 2.1): Frontier models at $10-30/1M input, $30-60/1M output. Smaller/faster models hit $0.50-2/1M. Tiered pricing became standard: pay for capability you need.

2024 (GPT-4o, Claude 3.5 Sonnet): $2.50-5/1M for production-ready models. Cached input at 50-90% discount. Quality that was frontier-only a year prior now available at commodity prices.

2025 (Current): Frontier models at $10-15/1M, production models at $1-3/1M, distilled models at $0.10-0.50/1M. 100-600x cheaper than 2020 for equivalent capability.

The pattern: LLM costs dropped roughly 10x every 18 months—faster than Moore’s Law. This changes build-vs-buy economics: self-hosting makes sense at ever-higher volume thresholds. More importantly, capabilities that were impossibly expensive become routine—enabling entirely new application categories.

The self-host vs. API decision ultimately comes down to total cost of ownership. Let’s build a rigorous comparison framework.

Staff Engineer Perspective: The Self-Hosting Decision

“The most common mistake I see is teams doing a spreadsheet analysis—‘API costs $X, self-hosting costs $Y, Y < X, let’s self-host.’ That analysis is always wrong because it ignores the hidden costs.

Self-hosting means you need: GPU procurement expertise (6-month lead times), MLOps engineers who understand CUDA, on-call rotation for inference infrastructure, capacity planning for traffic spikes, and a model upgrade path that doesn’t require re-architecting.

My rule of thumb: don’t self-host until you’re spending $50K/month on API costs AND you have at least two engineers who’ve operated GPU infrastructure before. Below that threshold, the operational overhead will eat any savings. Above it, the economics usually work—but only if you’ve built the team first.”

— Staff ML Engineer, Infrastructure

API costs (straightforward):

Monthly API Cost = Requests × (Input_tokens × $/input + Output_tokens × $/output)

Example: GPT-5.5 at 1M requests/month, 500 input, 200 output tokens avg
= 1M × (500 × $0.005/1K + 200 × $0.03/1K)
= 1M × ($0.0025 + $0.006)
= $8,500/month

Self-hosted costs (complex):

Hardware: GPU rental or amortization
= GPU_cost × Number_of_GPUs × Hours_per_month

Operations: Engineer time for setup, maintenance, on-call
= Loaded_engineer_cost × Hours_per_month

Infrastructure: Networking, storage, monitoring
= Platform_costs (varies widely)

Quality risk: Model may not match API capability
= Hard to quantify, often underestimated

Break-Even Analysis

def calculate_breakeven(
    api_cost_per_request: float,
    gpu_hourly_cost: float,
    throughput_requests_per_second: float,
    ops_hours_per_month: float = 20,
    engineer_hourly_cost: float = 100
) -> dict:
    """Calculate when self-hosting becomes cheaper."""

    # GPU cost for 24/7 operation
    monthly_gpu_cost = gpu_hourly_cost * 24 * 30

    # Operations overhead
    monthly_ops_cost = ops_hours_per_month * engineer_hourly_cost

    # Total self-hosted monthly cost
    total_self_hosted = monthly_gpu_cost + monthly_ops_cost

    # Maximum requests the GPU can serve
    max_requests_per_month = throughput_requests_per_second * 3600 * 24 * 30 * 0.7  # 70% utilization

    # Break-even point
    break_even_requests = total_self_hosted / api_cost_per_request

    return {
        "monthly_self_hosted_cost": total_self_hosted,
        "break_even_requests": break_even_requests,
        "max_capacity_requests": max_requests_per_month,
        "cost_per_request_at_capacity": total_self_hosted / max_requests_per_month
    }

Example: Llama 4 Scout vs. GPT-5.5

Llama 4 Scout vs GPT-5.5 cost comparison. *500 input + 200 output tokens.
Factor	GPT-5.5 API	Self-hosted Llama 4 Scout
Cost per request*	$0.0085	Variable
GPU cost	N/A	2x A100 80GB @ $12/hr = $8,640/mo
Ops cost	N/A	~$2,000/mo
Throughput	N/A	~50 req/s = 130M req/mo capacity
Break-even	N/A	$10,640 / $0.0085 = 1.25M req/mo
Cost at capacity	$0.0085	$0.00008 (106x cheaper!)

The catch: This analysis assumes:

You achieve API-level quality with open-source models (not always true)
You maintain high utilization (often difficult)
Your ops costs stay low (they often don’t)

Decision Framework

Key decision factors: - Data residency requirements → May force self-host regardless of cost - Burst traffic patterns → APIs handle spikes better without over-provisioning - Model customization needs → Self-host enables fine-tuning and modification - Team expertise → Self-hosting requires GPU/ML infrastructure expertise

Hybrid Architectures

Many organizations find hybrid approaches optimal:

API for peak, self-host for baseline: Self-host for predictable baseline traffic, overflow to API during peaks
Self-host for sensitive, API for general: Private data stays on-premise, general queries to API
Self-host for experimentation, API for production: Develop and test locally, deploy proven prompts to API
API for quality-critical, self-host for volume: Important use cases get GPT-5, commodity tasks get self-hosted

Self-Host vs. API Decision Flowchart

Speculative Decoding: Trading Compute for Latency

The Speculative Decoding Insight

Speculative decoding addresses a fundamental inefficiency: during decode, we load the entire model to generate a single token. What if we could verify multiple tokens with a single model load?

The key insight: verification is cheaper than generation. Given candidate tokens, checking if the large model agrees is faster than generating those tokens directly.

The Algorithm

Draft: A small, fast model generates K candidate tokens autoregressively
Verify: The large model processes all K candidates in a single forward pass
Accept/Reject: Compare probabilities; accept matching tokens, reject and correct on disagreement

The Mathematics of Speculation

Let:

p_target(x): Target model’s probability of token x
p_draft(x): Draft model’s probability of token x
K: Number of speculated tokens
α: Average acceptance rate

Acceptance criterion: Accept the drafted token x with probability $\min\left(1, \frac{p_{target}(x)}{p_{draft}(x)}\right)$. On rejection, that position is resampled from the normalized residual distribution $\max(0, p_{target} - p_{draft})$. This rejection-sampling rule is the whole point: it makes speculative decoding lossless, so the output distribution exactly matches what the target model would have produced on its own. (A naive “accept if the target likes it at least as much as the draft” test would bias the distribution.)

Expected tokens per iteration: \[E[\text{tokens}] = \sum_{k=0}^{K} (k+1) \cdot P(\text{accept } k \text{ then reject}) + (K+1) \cdot P(\text{accept all } K)\]

Speedup factor: \[\text{Speedup} = \frac{E[\text{tokens}]}{1 + K \cdot r_{draft}}\]

Where r_draft is the ratio of draft model cost to target model cost.

For well-matched draft models (high α, low r_draft), speedups of 2-3x are typical.

Draft Model Selection

The draft model must balance:

Speed: Small enough to be fast
Alignment: Similar output distribution to minimize rejection
Architecture: Compatible tokenizer and vocabulary

Options: 1. Smaller same-family model: LLaMA-7B as draft for Llama 4 Scout 2. Distilled model: Model trained specifically to match the target 3. Quantized self-draft: INT4 version of the target model

Alignment matters enormously: A misaligned draft model produces candidates the target rejects, wasting the speculation cost with no benefit.

# Evaluating draft model alignment
import random

def measure_acceptance_rate(draft, target, prompts, K=4):
    """Measure how often target accepts draft's predictions."""
    total_drafted = 0
    total_accepted = 0

    for prompt in prompts:
        draft_tokens = draft.generate(prompt, num_tokens=K)
        target_probs = target.get_token_probs(prompt + draft_tokens)
        draft_probs = draft.get_token_probs(prompt + draft_tokens)

        for d_prob, t_prob in zip(draft_probs, target_probs):
            total_drafted += 1
            # Correct speculative-sampling rule: accept with probability
            # min(1, p_target / p_draft). On rejection the real algorithm
            # resamples this position from the residual; we stop measuring here.
            accept_prob = min(1.0, t_prob / d_prob)
            if random.random() < accept_prob:
                total_accepted += 1
            else:
                break  # first rejection ends this speculation window

    return total_accepted / total_drafted

# Good alignment: >70% acceptance
# Poor alignment: <50% acceptance

When Speculative Decoding Helps

Speculative decoding provides the most benefit when:

Target model is large (high per-token cost)
Good draft model exists (high acceptance rate)
Latency matters more than throughput
Batch size is small (batching already amortizes model loads)

It provides less benefit when:

Already batching heavily (less memory bandwidth to spare)
No suitable draft model (poor alignment wastes speculation)
Throughput matters more than latency (better to increase batch size)

Production Operations

Monitoring Essentials

Production LLM serving requires comprehensive monitoring:

Request-level metrics:

TTFT (time to first token) - P50, P95, P99
ITL (inter-token latency) - P50, P95, P99
Total latency - P50, P95, P99
Request success rate
Tokens generated per request

System-level metrics:

GPU memory utilization (total and KV cache)
GPU compute utilization
Request queue depth
Active concurrent requests
Memory bandwidth utilization (if available)

Business metrics:

Tokens generated per second (throughput)
Cost per 1000 tokens
Requests served per dollar

# Essential Prometheus metrics for LLM serving
from prometheus_client import Counter, Histogram, Gauge

# Request metrics
requests_total = Counter('llm_requests_total', 'Total requests', ['model', 'status'])
tokens_generated = Counter('llm_tokens_generated_total', 'Tokens generated', ['model'])
ttft_seconds = Histogram('llm_ttft_seconds', 'Time to first token', ['model'],
                         buckets=[0.1, 0.2, 0.5, 1, 2, 5, 10])
request_latency_seconds = Histogram('llm_request_latency_seconds', 'Total request latency',
                                    ['model'], buckets=[0.5, 1, 2, 5, 10, 30, 60])

# System metrics
active_requests = Gauge('llm_active_requests', 'Currently processing', ['model'])
queue_depth = Gauge('llm_queue_depth', 'Requests waiting', ['model'])
gpu_memory_used_bytes = Gauge('llm_gpu_memory_bytes', 'GPU memory', ['gpu_id'])
kv_cache_utilization = Gauge('llm_kv_cache_utilization', 'KV cache usage ratio', ['model'])

Full implementation: See reference/09_llm_deployment_code.md

Load Shedding and Overload Protection

LLM inference is expensive. Under overload, accepting all requests leads to:

Memory exhaustion (KV cache overflow)
Latency explosion (queue buildup)
Cascading failures (timeouts causing retries)

Load shedding strategies:

class LoadShedder:
    """Protect the system by rejecting excess load."""

    def should_accept(self, request) -> tuple[bool, str]:
        # Check queue depth
        if self.queue_depth >= self.max_queue:
            return False, "queue_full"

        # Check KV cache headroom
        estimated_kv = estimate_kv_usage(request.max_tokens)
        if self.kv_used + estimated_kv > self.kv_limit * 0.95:
            return False, "memory_pressure"

        # Probabilistic shedding at high load
        load_factor = self.queue_depth / self.max_queue
        if load_factor > 0.8 and random.random() < (load_factor - 0.8) / 0.2:
            return False, "load_shed"

        return True, "accepted"

Client-side handling: Return HTTP 503 with Retry-After header. Well-behaved clients will back off.

Full implementation: See reference/09_llm_deployment_code.md

Graceful Shutdown

Rolling deployments and autoscaling require graceful shutdown:

Stop accepting new requests: Return 503 to new connections
Drain active requests: Allow in-progress generation to complete
Set timeout: Force-terminate after grace period
Release resources: Free GPU memory, close connections

class GracefulServer:
    async def shutdown(self, grace_period=30):
        """Gracefully shut down the server."""
        self.accepting = False  # Stop new requests

        # Wait for active requests with timeout
        deadline = time.time() + grace_period
        while self.active_requests and time.time() < deadline:
            await asyncio.sleep(0.1)

        if self.active_requests:
            logger.warning(f"Force-closing {len(self.active_requests)} requests")
            for req in self.active_requests:
                req.cancel()

Full implementation: See reference/09_llm_deployment_code.md

Kubernetes Deployment Patterns

LLM workloads have unique Kubernetes requirements:

GPU scheduling: Request GPU resources, handle node affinity Slow startup: Models take minutes to load; use appropriate health probes Memory requirements: Set resource limits carefully to prevent OOM Autoscaling: Scale on queue depth, not CPU (GPU is the constraint)

# Key Kubernetes patterns for LLM inference
apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
      - name: vllm
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "96Gi"  # More than GPU memory for safety
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 180  # Model loading time
          periodSeconds: 10
          failureThreshold: 3
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 300  # Even longer for liveness
          periodSeconds: 30
---
# Scale on queue depth, not CPU
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: External
    external:
      metric:
        name: llm_queue_depth
      target:
        type: AverageValue
        averageValue: "10"  # Scale when queue > 10

Full implementation: See reference/09_llm_deployment_code.md

Common Mistake: No LLM-Aware Health Checks

What people do: Use simple TCP or HTTP health checks that return 200 if the server is listening.

Why it fails: LLM servers can be “up” but non-functional—GPU memory exhausted, model weights corrupted, or inference stuck. Kubernetes keeps routing traffic to broken replicas, causing cascading timeouts.

Fix: Implement deep health checks that verify actual inference capability. Include a lightweight inference test (e.g., 10-token completion) in your readiness probe. Set appropriate timeouts (30-60s) for these probes. Use liveness probes to restart truly stuck processes, but be conservative—killing a loaded server makes things worse.

Key Takeaways

Memory bandwidth is the inference bottleneck - During decode, GPUs spend most time loading weights, not computing. Batching, quantization, and caching all help by increasing useful work per byte loaded.
KV cache dominates memory at scale - For long contexts or high concurrency, KV cache memory exceeds model weight memory. PagedAttention eliminates 60-80% memory waste through block-based allocation.
Continuous batching provides 5-10x throughput - Unlike static batching, continuous batching eliminates head-of-line blocking by adding requests as others complete, maximizing GPU utilization.
Quantization works because networks are overdetermined - INT4 quantization reduces memory 4x while preserving most model capability. The function computed depends on the weight manifold, not exact values.
Know your latency metrics: TTFT vs ITL - Time to First Token (prefill-dominated) and Inter-Token Latency (decode-dominated) require different optimizations. Prefix caching helps TTFT; batch size affects ITL.

Summary

LLM deployment is fundamentally different from traditional ML serving. The unique characteristics of autoregressive generation—memory bandwidth bounds, growing KV cache, token-by-token processing—require specialized techniques.

Key Insights

Memory bandwidth is the bottleneck: During decode, GPUs spend most of their time loading weights, not computing. This explains why batching, quantization, and speculative decoding all help—they increase the useful work done per byte loaded.

The KV cache dominates at scale: For long contexts or high concurrency, KV cache memory exceeds model weight memory. PagedAttention’s block-based management was revolutionary because it eliminated the 60-80% memory waste of naive allocation.

Continuous batching is essential: The shift from static to continuous batching provides 5-10x throughput improvement by eliminating head-of-line blocking and maximizing GPU utilization.

Quantization works because neural networks are overdetermined: The function computed depends on the weight manifold, not exact values. INT4 quantization preserves most of this manifold while reducing memory 4x.

Inference is a queue theory problem. Batching, KV cache paging, prefill/decode disaggregation, and prefix caching are scheduling policies over a scarce GPU resource. Optimizing serving means optimizing the queue, not just the kernel.

Architecture Summary

Connections to Other Chapters

Chapter 5 (LLM Foundations): The transformer architecture, attention mechanism, and KV cache explained here drive all deployment constraints
Chapter 6 (Prompt Engineering): Understanding deployment costs helps design efficient prompts that minimize tokens while maintaining quality
Chapter 7 (RAG Systems): RAG often requires serving embedding models alongside LLMs; similar deployment principles apply
Chapter 8 (Agentic Systems): Agents create multi-turn conversations that accumulate KV cache; understanding memory growth is essential
Chapter 27 (Performance Engineering): Covers deeper GPU optimization, profiling, and CUDA-level tuning
Chapter 32 (Cost Engineering): Extends cost analysis to full TCO, including engineering time and opportunity cost

Practical Exercises

Memory Budget Calculation: For a 13B parameter model, calculate the maximum concurrent requests you can serve at 4K, 8K, and 16K context lengths on an A100 80GB. Use both FP16 and INT4 quantization scenarios. What’s the break-even context length where KV cache exceeds model weight memory?
Batching Experiment: Deploy vLLM and measure throughput at batch sizes 1, 2, 4, 8, 16, 32. Plot throughput vs. latency. At what batch size does throughput plateau? What limits further scaling?
Quantization Quality Evaluation: Compare a base model vs. AWQ INT4 quantized version on your specific use case. Measure both perplexity on representative data and task-specific quality (accuracy, human preference). Is the quality loss acceptable for your application?
Cost Comparison: For your workload (estimate request volume, input/output lengths), calculate the monthly cost of: (a) OpenAI API, (b) self-hosted on A100, (c) self-hosted on H100 with FP8. Include realistic ops overhead for self-hosting. At what volume does each option become optimal?
Speculative Decoding Implementation: Using a small (1B) and large (7B) model from the same family, implement speculative decoding. Measure acceptance rate and end-to-end speedup. How does K (speculation depth) affect the trade-off?

Self-Assessment Checkpoint

Conceptual Questions

Q1. [IC2] Why is LLM generation memory-bandwidth bound rather than compute-bound? What does this mean for optimization strategies?

Answer

During generation, each forward pass multiplies model weights against a single token’s hidden state (matrix-vector multiplication). The compute per weight element is tiny, but all weights must be loaded from GPU memory. On an H100, compute capability is ~2000 TFLOPS but memory bandwidth is ~3.35 TB/s. The ratio (operations per byte) for generation is ~1-2, far below the ~600 ratio the hardware can sustain. This means: (1) Adding more compute doesn’t help—the GPU sits idle waiting for memory. (2) Batching helps—load weights once, use for multiple requests. (3) Quantization helps—smaller weights = less to load. (4) KV caching is essential—avoid recomputing attention states.

Q2. [IC2] A 7B model in FP16 needs 14GB for weights. Why might you need 80GB of GPU memory to serve it at 4K context?

Answer

KV cache: For a typical 7B model architecture (32 layers, 32 heads, 128 dim per head), KV cache per request at 4K context in FP16 is: 2 × 32 × 4096 × 32 × 128 × 2 bytes = 2.15 GB per request. With model weights (14GB) + 20-30 concurrent requests × 2.15GB = 14 + 50-60 GB. Add activations, framework overhead, and buffer for continuous batching, and you’re at 80GB. At long contexts or high concurrency, KV cache dominates memory usage.

Q3. [Senior] Explain the difference between Time to First Token (TTFT) and Inter-Token Latency (ITL). Which optimizations affect each?

Answer

TTFT is the delay before generation starts—dominated by prefill (processing the prompt). Affected by: prompt length, batching wait time, prefill compute capacity, prefix caching (can dramatically reduce if prompt shares prefix with cached entry). ITL is the time between consecutive tokens during generation—dominated by decode memory bandwidth. Affected by: batch size (more batched requests = each request gets fewer cycles), quantization (faster weight loads), model size, hardware memory bandwidth. You can have low TTFT but high ITL (small prompts, many concurrent requests) or high TTFT but low ITL (long prompts, few concurrent requests).

Q4. [Senior] How does PagedAttention improve over fixed KV cache allocation? When does it matter most?

Answer

Traditional systems pre-allocate KV cache for max context length per request. If requests vary in length (common), short requests waste memory. PagedAttention allocates KV cache in fixed-size blocks (pages), allocated as needed. Benefits: (1) No wasted memory for short sequences. (2) Can serve more concurrent requests with same memory. (3) Enables prefix sharing—common prefixes share the same physical pages. Matters most when: (1) High request concurrency. (2) Variable sequence lengths. (3) Many requests share common prefixes (same system prompt). Production improvement: 2-4x throughput is common.

Q5. [Staff] You’re deploying a 70B model. Should you use tensor parallelism across 2 GPUs or pipeline parallelism? What factors determine this?

Answer

Tensor parallelism (TP): Splits each layer across GPUs. Every token requires all GPUs to communicate. Lower latency per token (all GPUs work in parallel on each token), but high communication overhead (all-reduce after each layer). Pipeline parallelism (PP): Different layers on different GPUs. Lower communication (only between pipeline stages), but can have bubbles (idle time). For latency-sensitive serving: TP is usually better—each token completes faster. For throughput-focused batch processing: PP can be better—fill pipeline bubbles with multiple microbatches. 70B on 2 GPUs: TP is typical choice for serving. NVLink bandwidth critical—without it, TP communication becomes bottleneck.

Spot the Problem

Problem 1. [IC2] A deployment configuration:

model = vllm.LLM(
    model="llama-7b",
    max_num_batched_tokens=32768,
    gpu_memory_utilization=0.95
)

What issues might this cause?

Answer

Issues: (1) 95% GPU memory utilization leaves very little buffer—CUDA operations occasionally need temporary memory, can cause OOM under load. Safe: 0.85-0.90. (2) max_num_batched_tokens=32768 is very high—if requests have long prompts, this allows huge prefill batches that may cause memory spikes or timeout individual requests. (3) No explicit max_model_len—defaults vary by version, might not match your use case. Better: Start conservative (0.85 utilization), tune max_num_batched_tokens based on your request distribution, set explicit max_model_len.

Problem 2. [Senior] Cost analysis claiming self-hosting saves 10x:

API cost: $0.01 per request
Self-hosted cost: $2000/month for GPU
At 200K requests/month: API = $2000, Self-hosted = $2000
At 2M requests/month: API = $20000, Self-hosted = $2000 → 10x savings!

Answer

Missing costs: (1) Operations overhead—engineer time for setup, maintenance, on-call. At $150/hr loaded cost, 20 hours/month = $3000. (2) Infrastructure—load balancing, monitoring, logging, networking. (3) Opportunity cost—engineering time not spent on product. (4) Quality assumption—implicitly assumes self-hosted model matches API quality. May not be true, especially for hard tasks. (5) Utilization—assumes 100% GPU utilization. Realistic is 70-80% for variable traffic. (6) Scaling—what happens at 5M requests? May need another GPU. Real comparison: Should be (GPU + ops + infra) vs. API, and account for quality differences.

Problem 3. [Staff] A team’s quantization approach:

“We quantized our model to INT4 and ran our eval suite. Accuracy dropped from 92% to 90%. We decided 2% loss is acceptable.”

Answer

Flawed analysis: (1) Aggregate accuracy hides distribution—might be 95%→95% on easy cases but 80%→70% on hard cases. Check performance by difficulty/category. (2) Eval suite may not cover edge cases—quantization can amplify errors on unusual inputs. (3) Perplexity should be checked—better correlates with subtle quality issues. (4) Production distribution matters—if hard cases are 20% of production traffic, that 10% drop on hard cases is significant. (5) Should compare on actual production traffic sample. Better: Stratified analysis by case difficulty, perplexity comparison, production traffic evaluation, user-facing quality assessment.

Design Exercises

Exercise 1. [Senior] Design the deployment architecture for a chatbot that needs to handle 10,000 concurrent users, with P95 latency under 2 seconds for the first token. Users have conversations averaging 10 turns. The model is 13B parameters. Outline your hardware, software stack, and capacity planning.

Guidance

Key calculations: (1) At 10K concurrent users, assume 30% are actively waiting for a response at any moment = 3K active requests. (2) 13B model at FP16 = 26GB. With quantization (INT4) = ~7GB weights. (3) KV cache for 10-turn conversation (~4K context) = ~1GB per request in FP16, ~0.5GB in INT8. (4) Per A100 80GB: ~7GB model + ~50GB for KV = ~50-100 concurrent requests with INT4+INT8 KV. Need ~30-60 A100s for 3K concurrent. Software: vLLM with continuous batching, load balancer distributing by user session (session affinity for KV cache reuse). Scaling: Start with peak estimate, monitor, right-size. Use auto-scaling if traffic is variable.

Exercise 2. [Staff] You’re responsible for reducing LLM serving costs by 50% without impacting user-facing quality. Current setup: API provider, $500K/month, mix of tasks (customer support, content generation, data extraction). Design your cost reduction strategy with specific actions, measurements, and risk mitigation.

Guidance

Strategies: (1) Task routing—not all tasks need the best model. Route simple tasks to cheaper/smaller models. Requires building a classifier and per-task quality validation. (2) Caching—identify common queries, cache responses. Effective for support FAQs. (3) Prompt optimization—shorter prompts = fewer input tokens. Aggressive with examples and instructions. (4) Output optimization—constrain response length where appropriate. (5) Self-hosting evaluation—calculate break-even for highest-volume tasks. Risk mitigation: (1) A/B test each change against current baseline. (2) Monitor quality metrics per task category. (3) Staged rollout (10% → 50% → 100%). (4) Kill switch to revert to original setup. Measurement: Track cost per task, quality per task, user satisfaction. Target: 50% cost reduction with <2% quality regression on any metric.

Recommended Reading

Essential (Read These)

“Efficient Memory Management for Large Language Model Serving with PagedAttention” (Kwon et al., 2023) The vLLM paper that introduced PagedAttention. Essential reading for understanding modern inference systems.

“FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning” (Dao, 2023) The attention optimization that enables long contexts. Well-written explanation of the tiling algorithm.

“GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers” (Frantar et al., 2022) Foundation for understanding weight quantization. Explains the Optimal Brain Quantization approach.

Deep Dives (For Specialists)

“AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration” (Lin et al., 2023) State-of-the-art quantization technique. Explains the activation-aware approach.

“Efficiently Scaling Transformer Inference” (Pope et al., 2022) Google’s analysis of inference efficiency. Excellent derivation of memory bandwidth bounds.

“Fast Inference from Transformers via Speculative Decoding” (Leviathan et al., 2022) Original speculative decoding paper. Rigorous analysis of acceptance rates and speedup bounds.

Systems and Infrastructure

vLLM Documentation (docs.vllm.ai) Comprehensive guide to the most popular open-source inference server.

TensorRT-LLM Documentation (NVIDIA) Deep technical details on NVIDIA’s optimized inference stack.

Historical Context

“Orca: A Distributed Serving System for Transformer-Based Generative Models” (Yu et al., 2022) Introduced continuous batching. Foundational for understanding modern scheduling.

--- title: "Chapter 9: LLM Deployment & Infrastructure" keywords: [deployment, inference, KV cache, quantization, vLLM, batching, GPU, PagedAttention, speculative decoding, self-hosting] difficulty: advanced prerequisites: [ch05] estimated_time: "4-5 hours" --- ## Introduction In March 2023, a startup launched their AI-powered legal document analysis tool. During development, their prototype felt snappy—responses streamed in under a second. On launch day, they watched their system collapse. Response times climbed from 500 milliseconds to 30 seconds. GPU memory overflowed, causing crashes. Their cloud bill for that single day exceeded their projected monthly budget. The model worked perfectly; the infrastructure failed catastrophically. This story illustrates a harsh truth about LLM deployment: training a model is the beginning, not the end. The techniques that make inference fast, efficient, and economical are fundamentally different from those used in training. Understanding these techniques—why they exist, how they work mathematically, and when to apply them—separates production-ready systems from expensive science projects. LLM inference presents unique challenges compared to traditional ML serving. A classification model processes fixed-size inputs and produces fixed-size outputs in a single forward pass. An LLM generates tokens one at a time, each requiring a full forward pass through billions of parameters. A 1000-token response requires 1000 forward passes—and each pass must load the entire model from memory. The computational pattern is fundamentally different, and so are the optimization strategies. This chapter examines the engineering of LLM deployment: the mathematical foundations of inference optimization, the systems that implement them, and the decision frameworks for choosing among alternatives. We focus primarily on self-hosted deployment because that's where you need the deepest understanding—but even if you use managed APIs, these concepts help you reason about performance, cost, and architectural decisions. ### The Core Challenge: Memory Bandwidth To understand why LLM inference is hard, you need to understand one number: the ratio of compute to memory bandwidth on modern GPUs. An NVIDIA H100 GPU can perform 1979 TFLOPS of FP16 computation per second. It can transfer 3.35 TB/s from its HBM3 memory. This means for every byte loaded from memory, the GPU can perform approximately 600 floating-point operations. Now consider LLM inference. During token generation, each forward pass multiplies the model's weight matrices by small vectors (the current token's hidden states). For a 7 billion parameter model in FP16, this means loading 14 GB of weights to perform matrix-vector multiplications on vectors of size ~4096. The computation is trivial compared to the data transfer. The arithmetic intensity (operations per byte loaded) of this workload is roughly 1-2, not 600. **LLM generation is memory-bandwidth bound, not compute bound.** This single insight explains nearly every optimization technique in this chapter: - **Batching**: Process multiple requests together so the same weight load serves multiple computations - **Quantization**: Reduce weight size to load fewer bytes - **KV caching**: Avoid recomputing attention by storing intermediate results - **Continuous batching**: Keep GPUs busy by dynamically managing requests Understanding that inference is memory-bound changes how you think about every optimization decision. ::: {.callout-important} ## Mental Model: Inference is a queue theory problem Most engineers reach for inference optimization expecting a compute problem—more FLOPS, faster kernels. The reality is closer to an airport boarding gate: requests arrive at unpredictable rates, share an expensive scarce resource (the GPU), and have wildly varying service times (prefill versus decode, short versus long outputs). KV cache management, continuous batching, prefill/decode disaggregation, and prefix caching are all *scheduling policies*, not math tricks. **Heuristic: when latency or cost surprises you, draw the queue first—arrival rate, service time, queue depth—before touching the model.** ::: Reframing **inference as a queue theory problem** rather than a compute problem is what unlocks the 10-20x throughput wins modern serving stacks routinely achieve. ### What You'll Learn - The mathematics of LLM inference: prefill vs. decode, KV cache growth, memory bandwidth limits - How continuous batching and PagedAttention achieve 10-20x throughput improvements - Quantization theory: why INT4 weights often preserve quality, and when they don't - Decision frameworks for GPU selection, inference servers, and self-hosting vs. APIs - How leading providers (Anthropic, OpenAI, Together, Fireworks) architect their serving systems - Cost modeling and optimization for sustainable deployment ### Prerequisites - Understanding of transformer architecture and attention mechanisms (Chapter 5) - Familiarity with GPU computing concepts (CUDA, memory hierarchy) - Basic knowledge of containerization and orchestration - Python programming --- ## The Physics of LLM Inference ### Two Phases, Two Bottlenecks LLM inference has two distinct computational phases with fundamentally different characteristics. **Prefill (prompt processing)**: The model processes all input tokens simultaneously, building the initial attention state. This phase involves dense matrix-matrix multiplications—multiplying weight matrices against the entire prompt sequence. The arithmetic intensity is high (many operations per byte loaded), making this phase **compute-bound**. Prefill time scales approximately linearly with prompt length, though the quadratic attention computation becomes significant for very long prompts. **Decode (generation)**: The model generates tokens one at a time. Each token requires a forward pass where weight matrices multiply against a single vector (the current token's hidden state). The arithmetic intensity drops dramatically—we load the entire model to generate one token. This phase is **memory-bandwidth bound**. Decode time per token is nearly constant regardless of how many tokens have been generated (with a slight increase due to growing KV cache reads). ![Compute Intensity Comparison](../assets/diagrams/rendered/ch09_compute_intensity.svg) This asymmetry has profound implications. A 2000-token prompt processes in prefill in perhaps 200ms. Generating 500 tokens in decode might take 5000ms—25x longer for 1/4 the tokens. The bottleneck isn't computation; it's moving 14+ GB of weights from memory to compute units for each generated token. ### The Mathematics of Memory Bandwidth Limits Let's derive why decode is memory-bound. Consider an H100 GPU (3.35 TB/s bandwidth) serving a 7B parameter model in FP16 (14 GB weights). **Time to load model weights once**: 14 GB / 3.35 TB/s = 4.2 ms **Theoretical maximum generation speed (single request)**: 1 token / 4.2 ms = 238 tokens/second **Observed speed in practice**: ~150-200 tokens/second (due to overheads, KV cache reads) This is the fundamental ceiling: you cannot generate tokens faster than you can read the model weights from memory. No amount of additional compute helps—the H100's massive 1979 TFLOPS sits mostly idle during single-request generation. This analysis reveals why batching is so critical. If you process B requests simultaneously, you load the weights once and use them B times. The effective bandwidth cost per token becomes (14 GB / B), and throughput scales nearly linearly with batch size—until you hit other limits. ### The KV Cache: Trading Memory for Compute The attention mechanism in transformers computes: $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$ Where Q (queries) represents the current token, and K (keys) and V (values) represent all tokens seen so far. Without caching, generating the 500th token would require recomputing attention over all 500 tokens—quadratic complexity. The **KV cache** stores the K and V projections for all previous tokens. When generating a new token, we only compute its Q and look up the cached K, V. This transforms attention from O(n^2) to O(n) per token, but introduces a memory cost that grows linearly with sequence length. **KV Cache Size Formula**: $$\text{KV Cache per request} = 2 \times L \times S \times H \times D \times \text{bytes\_per\_value}$$ Where: - 2: Key and Value tensors - L: Number of transformer layers - S: Sequence length (prompt + generated tokens) - H: Number of attention heads - D: Dimension per head (typically hidden_dim / num_heads) - bytes_per_value: 2 for FP16, 1 for INT8 **Example: LLaMA-2 7B with 4K context in FP16**: $$2 \times 32 \times 4096 \times 32 \times 128 \times 2 = 2.15 \text{ GB per request}$$ At 4K context, the KV cache for a single request exceeds the model weights stored at INT4 precision. This is why long-context serving is expensive—the memory cost grows linearly with context length and linearly with concurrent requests. | Context Length | KV Cache Per Request | With 10 Concurrent | With 50 Concurrent | |---------------|---------------------|-------------------|-------------------| | 512 | 268 MB | 2.7 GB | 13.4 GB | | 2,048 | 1.07 GB | 10.7 GB | 53.5 GB | | 4,096 | 2.15 GB | 21.5 GB | 107 GB | | 8,192 | 4.3 GB | 43 GB | 215 GB | : Memory Growth with Context Length (LLaMA-2 7B, FP16 KV cache). Model weights add 14GB (FP16) or 3.5GB (INT4) to all scenarios. {tbl-colwidths="[20,25,25,30]"} This table illustrates why serving long contexts at high concurrency requires either massive GPU memory, aggressive KV cache compression, or accepting lower concurrency. ### Key Performance Metrics Understanding inference metrics is essential for reasoning about systems and comparing configurations. **Time to First Token (TTFT)**: The delay before generation starts. Dominated by prefill time. Users perceive this as "thinking time." For interactive applications, TTFT under 500ms feels responsive; over 2 seconds feels sluggish. **Inter-Token Latency (ITL)** or **Time Per Output Token (TPOT)**: The average time between consecutive generated tokens. Determines streaming speed. For readable streaming, ITL under 50ms (20 tokens/second) feels natural. **Throughput**: Total tokens generated per second across all concurrent requests. The key metric for cost efficiency. A system generating 1000 tokens/second across 10 concurrent requests is 10x more efficient than one generating 100 tokens/second for a single request. **Tokens Per Dollar**: The ultimate efficiency metric. Combines throughput with infrastructure cost. These metrics trade off against each other. Increasing batch size improves throughput but increases TTFT (new requests wait for batch slots) and ITL (GPU time shared among more requests). Optimizing for one metric often degrades others. --- ## The KV Cache Revolution: PagedAttention ### The Memory Fragmentation Problem Before 2023, LLM serving systems pre-allocated contiguous memory for each request's KV cache, sized for the maximum possible sequence length. A system configured for 4K context would allocate 2.15 GB per request, even if the actual sequence used only 500 tokens. This led to severe memory fragmentation and waste. Consider a server with 80GB GPU memory, 14GB used by model weights, leaving 66GB for KV cache. With conservative pre-allocation: - Maximum concurrent requests: 66 GB / 2.15 GB ≈ 30 requests - But average sequences are 1K tokens, using only 537 MB - Actual memory usage: 30 × 537 MB = 16 GB - **Wasted memory: 50 GB (76%)** Even worse, variable-length sequences caused external fragmentation—freed memory was scattered in non-contiguous chunks, making it unusable for new requests. ### PagedAttention: Virtual Memory for KV Cache The vLLM paper (Kwon et al., 2023) introduced **PagedAttention**, applying operating system virtual memory concepts to KV cache management. The insight: manage KV cache as small, fixed-size blocks rather than contiguous per-request allocations. Instead of pre-allocating maximum sequence length, PagedAttention: 1. Divides KV cache into fixed-size blocks (typically 16 tokens each) 2. Allocates blocks on-demand as sequences grow 3. Maintains a block table mapping logical sequence positions to physical memory locations 4. Frees blocks immediately when requests complete ![Traditional vs PagedAttention Memory Allocation](../assets/diagrams/rendered/ch09_paged_attention.svg) The result is near-perfect memory utilization. Memory waste drops from 60-80% to under 4%. The same hardware can serve 2-4x more concurrent requests. **Block Tables**: Each request maintains a block table mapping logical positions to physical block IDs: ```python # Conceptual block table structure request_1_blocks = [ {"logical_range": (0, 15), "physical_block": 7}, {"logical_range": (16, 31), "physical_block": 23}, {"logical_range": (32, 47), "physical_block": 12}, ] ``` During attention computation, the system translates logical sequence positions to physical memory locations using these tables. The indirection adds negligible overhead compared to the memory efficiency gains. ### Copy-on-Write for Shared Prefixes PagedAttention enables another optimization: **copy-on-write sharing** for common prefixes. Many applications use the same system prompt for all requests. Without sharing, each of 50 concurrent requests stores identical KV cache entries for the shared prefix—pure waste. With PagedAttention, requests sharing a prefix can share the same physical blocks: ![Shared System Prompt with Copy-on-Write](../assets/diagrams/rendered/ch09_copy_on_write.svg) If each request's unique content is 300 tokens (19 blocks), total memory drops from 50×32 = 1600 blocks to 13 + 50×19 = 963 blocks—a 40% reduction. ### Practical Impact: Real-World Benchmarks The vLLM paper demonstrated significant improvements over previous systems: | Comparison | vLLM Improvement | Notes | |------------|------------------|-------| | vs HuggingFace Transformers (baseline) | 14-24x throughput | Naive batching baseline | | vs HuggingFace TGI | 2.2-3.5x throughput | Both use continuous batching | | vs FasterTransformer/Orca | 2-4x throughput | At similar latency | | Memory utilization | ~97% vs ~35% baseline | Near-zero KV cache waste | Note: The improvement magnitude depends heavily on workload characteristics. High-concurrency scenarios with shared prefixes see the largest gains. Interactive single-user scenarios may see smaller improvements. TGI v3 (2024) has closed some of this gap, particularly for long-context workloads with prefix caching. ::: {.callout-warning} ## Common Mistake: Ignoring KV Cache Sizing **What people do**: Focus only on model weight memory when sizing GPUs, then wonder why inference fails at moderate concurrency. **Why it fails**: KV cache grows with sequence length × batch size × layers × head dimension. A 70B model's weights fit in 40GB, but 32 concurrent 4K-context requests need another 50GB+ of KV cache. Memory exhaustion causes OOM crashes or aggressive eviction. **Fix**: Calculate worst-case KV cache before deployment: `kv_bytes = 2 × layers × hidden_dim × seq_len × batch_size × bytes_per_element`. Use this to set max_concurrent_requests. Monitor cache hit rates and eviction rates in production. ::: --- ## Continuous Batching: Maximizing GPU Utilization ::: {.callout-note} ## Mental Model: "Inference Is a Queue Theory Problem" Most engineers think about LLM inference as a compute problem: "How fast can we do matrix multiplications?" This framing leads to chasing faster GPUs and more FLOPS. It's wrong. LLM inference is fundamentally a **queue theory problem**: "How do we keep tokens flowing through the system with minimum waiting?" The bottleneck isn't computation—it's memory bandwidth and scheduling. This reframing changes everything: **Traditional compute thinking**: "We need more GPU horsepower" → Buy more A100s **Queue theory thinking**: "Requests are waiting in line too long" → Optimize scheduling Key queue theory concepts that apply directly: - **Arrival rate (λ)**: Requests per second hitting your system - **Service rate (μ)**: Tokens per second your GPU can produce - **Utilization (ρ = λ/μ)**: At ρ > 0.8, latency explodes exponentially - **Little's Law (L = λW)**: Average queue length equals arrival rate times average wait time When you see high latency, don't immediately add GPUs. Ask: "Is my queue poorly scheduled?" Continuous batching, PagedAttention, and priority queues are scheduling optimizations, not compute optimizations. Often, better scheduling on existing hardware outperforms naive scaling by 5-10x. ::: ### The Static Batching Problem Traditional batching collects requests, processes them together, and waits for the entire batch to complete before returning results. This creates two problems for LLM serving: 1. **Head-of-line blocking**: If one request generates 500 tokens while others generate 50, the short requests wait for the long one. Response time becomes max(request_lengths), not mean. 2. **Batch assembly latency**: New requests must wait for the current batch to finish before joining. During low traffic, requests queue unnecessarily. Consider a batch of 8 requests: - Request 1-7: Generate 50 tokens each (500ms) - Request 8: Generates 500 tokens (5000ms) With static batching, all requests return after 5000ms. Requests 1-7 could have returned 4500ms earlier. ### Continuous Batching: The Solution **Continuous batching** (also called "iteration-level scheduling" or "dynamic batching") processes one decode step at a time, allowing requests to enter and exit the batch at any iteration. ![Continuous Batching Timeline](../assets/diagrams/rendered/ch09_continuous_batching.svg) The key insight: during decode, we generate one token per request per iteration. There's no mathematical requirement that all requests generate the same number of tokens before any can return. ### The Mathematics of Continuous Batching Let's formalize the throughput advantage. Define: - B: Maximum batch size - T_decode: Time for one decode iteration (one token per request) - N_tokens(i): Number of tokens generated by request i **Static batching throughput**: $$\text{Throughput}_{static} = \frac{\sum_{i=1}^{B} N_{tokens}(i)}{T_{decode} \times \max_{i}(N_{tokens}(i))}$$ **Continuous batching throughput** (ideal): $$\text{Throughput}_{continuous} = \frac{B}{T_{decode}}$$ The ratio depends on request length variance. If all requests generate exactly the same number of tokens, the approaches are equivalent. But real workloads have high variance—some requests are "tell me yes or no" while others are "write a detailed essay." **Example calculation**: - Batch of 8 requests: 7 generate 50 tokens, 1 generates 500 tokens - T_decode = 10ms per iteration Static: (7×50 + 500) / (10ms × 500) = 850/5000ms = 170 tokens/second Continuous: During first 50 iterations, 8 requests active; iterations 51-500, only 1 request active. Total time = 50×10ms + 450×10ms = 5000ms but 7 slots freed after iteration 50. With continuous batching, those 7 freed slots can accept new requests. In steady state, throughput approaches the ideal B/T_decode = 800 tokens/second—a 4.7x improvement for this workload. ### Implementation: Iteration-Level Scheduling Modern inference servers implement continuous batching with iteration-level schedulers: ```python # Conceptual iteration scheduler (simplified) class IterationScheduler: def __init__(self, max_batch_size, memory_manager): self.max_batch_size = max_batch_size self.memory = memory_manager self.active_requests = [] # Currently generating self.waiting_queue = [] # Pending prefill def schedule_iteration(self): """Select requests for this decode iteration.""" # Step 1: Keep all active decode requests (don't interrupt) scheduled = [r for r in self.active_requests if not r.done] # Step 2: Add waiting requests if slots and memory available while (len(scheduled) < self.max_batch_size and self.waiting_queue and self.memory.can_fit(self.waiting_queue[0])): request = self.waiting_queue.pop(0) scheduled.append(request) return scheduled def run_iteration(self, scheduled): """Execute one decode step for all scheduled requests.""" # Prefill any newly added requests new_requests = [r for r in scheduled if not r.prefilled] if new_requests: self.prefill_batch(new_requests) # Decode one token for all active requests tokens = self.decode_step(scheduled) # Return completed requests, keep others active for request, token in zip(scheduled, tokens): request.generated_tokens.append(token) if token == EOS_TOKEN or len(request.generated_tokens) >= request.max_tokens: request.complete() self.memory.free(request.kv_blocks) self.active_requests.remove(request) ``` > **Full implementation**: See [reference/09_llm_deployment_code.md](../reference/09_llm_deployment_code.html#iteration-level-scheduler) The scheduler's decisions—which requests to prefill, how to balance prefill vs. decode, when to preempt—significantly impact performance. Advanced schedulers consider request priorities, memory pressure, and fairness guarantees. ### Chunked Prefill: Optimizing for Mixed Workloads A subtle problem arises when long prompts enter the system. Prefill is compute-bound and can monopolize the GPU, blocking decode iterations for active requests. If a 10,000-token prompt enters, other requests might stall for several hundred milliseconds. **Chunked prefill** splits long prefills into chunks (e.g., 512 tokens), interleaving them with decode iterations: ![Chunked Prefill Interleaving](../assets/diagrams/rendered/ch09_chunked_prefill.svg) This increases the new request's TTFT (prefill takes longer overall) but maintains acceptable ITL for existing requests—a worthwhile trade-off for interactive applications. ::: {.callout-warning} ## Common Mistake: Wrong Max Batch Size **What people do**: Set max_batch_size to whatever the GPU can fit in memory, maximizing throughput. **Why it fails**: Maximum batch size ≠ optimal batch size. Large batches increase latency for all requests (everyone waits for the longest generation). For interactive applications, high p99 latency destroys user experience even if throughput is great. **Fix**: Set batch size based on latency SLOs, not memory capacity. Start with max_batch_size=8-16 for interactive workloads. Monitor p99 latency under load. Increase batch size only for batch/offline processing where latency doesn't matter. Use separate pools for interactive vs. batch traffic. ::: --- ## Quantization: Fitting More with Less ::: {.callout-tip} ## Staff Engineer Perspective "We spent three months building a complex serving optimization before someone suggested trying INT8 quantization. It took a day to implement and gave us 2x the throughput improvement of everything else combined. Now my rule is: try quantization before any other optimization. It's free performance." — *ML Platform Engineer at AI startup* ::: ### Why Quantization Works Quantization reduces numerical precision—replacing 16-bit floats with 8-bit or 4-bit integers. Intuitively, this should destroy model quality. Why doesn't it? The key insight comes from analyzing weight distributions in trained transformers. Weights are not uniformly distributed; they cluster around zero with heavy tails. Most weights are small, and the model's behavior depends more on preserving the relative relationships between weights than their exact values. Consider a simplified example. If we have weights [0.0012, 0.0015, 0.0014, 0.0013], the exact values matter less than their ordering and rough magnitudes. Quantization that maps these to [1, 2, 2, 1] (scaled appropriately) preserves most of the computational effect. **Formal perspective**: Neural network weights are overdetermined. The function computed by the network depends on the weight manifold (the subspace of weight configurations producing equivalent outputs), not individual weight values. Small perturbations from quantization often stay within this manifold. ### Quantization Methods: From Theory to Practice **Uniform quantization** divides the value range into equal intervals: $$q = \text{round}\left(\frac{x - x_{min}}{x_{max} - x_{min}} \times (2^b - 1)\right)$$ Where b is the number of bits. Dequantization reverses the process: $$\hat{x} = q \times \frac{x_{max} - x_{min}}{2^b - 1} + x_{min}$$ The problem: outliers. If most weights are between -0.1 and 0.1 but a few are at 1.0, the quantization range wastes precision on the sparse outliers. **Group quantization** addresses this by applying separate scales to groups of weights (typically 128 weights per group). This captures local distributions better but adds storage overhead for the scales. ### GPTQ: Optimal Brain Quantization for Transformers GPTQ (Frantar et al., 2022) applies **Optimal Brain Quantization** principles to LLMs. The key insight: quantize weights in order of increasing sensitivity, using second-order information (Hessian) to adjust remaining weights and compensate for quantization error. The algorithm processes weights column by column: 1. Compute the inverse Hessian H^(-1) (approximated efficiently) 2. For each weight w_i, compute the quantization error δ 3. Adjust remaining weights to minimize output change: w_j -= δ × H^(-1)_{ij} / H^(-1)_{ii} This "error correction" is what makes GPTQ achieve better quality than naive quantization. The second-order update propagates the correction optimally through the layer. **Calibration requirement**: GPTQ requires a small calibration dataset (typically 128-1024 samples) to estimate the Hessian. The calibration data should represent your actual workload; using generic text when you serve code will degrade quality. ### AWQ: Activation-Aware Weight Quantization AWQ (Lin et al., 2023) observes that not all weights are equally important—some channels carry more information than others. Rather than treating all weights equally, AWQ protects "salient" weights identified by activation magnitudes. The insight: channels with large activations amplify weight quantization errors. AWQ scales these channels before quantization, reducing their relative quantization error: ``` Channel importance from activations: ──────────────────────────────────────────────────────────────── Channel 1: avg |activation| = 0.1 → Low importance Channel 2: avg |activation| = 2.3 → High importance Channel 3: avg |activation| = 0.05 → Low importance Channel 4: avg |activation| = 1.8 → High importance AWQ strategy: Scale channels 2 and 4 up before quantization, scale them back down during inference. Higher precision preserved where it matters. ``` AWQ typically achieves 0.5-1% better perplexity than GPTQ at the same bit-width, with the advantage increasing for aggressive 3-bit quantization. ### FP8: The Hardware-Supported Middle Ground NVIDIA H100 GPUs introduced native FP8 (8-bit floating point) support. Unlike INT8, FP8 maintains a floating-point format. The E4M3 variant uses 4 bits of exponent and 3 bits of mantissa (better precision, smaller range — typically used for forward-pass activations and weights), while E5M2 uses 5 bits of exponent and 2 bits of mantissa (wider dynamic range — typically used for gradients in training). Advantages of FP8: - Native hardware support: 2x throughput vs. FP16 on H100 - Floating-point range: Handles outliers better than fixed-point - Minimal quality loss: Typically <0.5% perplexity increase - Simple adoption: No complex calibration or algorithm changes FP8 is increasingly the default for H100 deployments—it provides most of the memory savings of INT8 with simpler implementation and better quality. ### Quantization Decision Framework ![Quantization Decision Framework](../assets/diagrams/rendered/ch09_quantization_decision.svg) **Quality degradation by method (typical, task-dependent)**: | Method | Bits | Memory | Perplexity | Quality | |--------|------|--------|------------|---------| | FP16 | 16 | 1x | Baseline | Reference quality | | FP8 | 8 | 2x savings | <0.5% | Excellent; H100 optimized | | INT8 (GPTQ) | 8 | 2x savings | 0.5-1.5% | Good general-purpose | | INT8 (AWQ) | 8 | 2x savings | 0.3-1.0% | Better than GPTQ | | INT4 (GPTQ) | 4 | 4x savings | 2-15% | Degrades on complex tasks | | INT4 (AWQ) | 4 | 4x savings | 1-10% | Preferred INT4; test first | : Quantization quality degradation by method (typical, task-dependent). {tbl-colwidths="[20,10,15,20,35]"} --- ## FlashAttention: Compute-Efficient Attention ### The Memory Hierarchy Problem Standard attention implementation materializes the full N×N attention matrix in GPU memory: ```python # Standard attention (memory-inefficient) Q, K, V = ... # (batch, heads, seq_len, head_dim) scores = torch.matmul(Q, K.transpose(-2, -1)) / sqrt(d_k) # (batch, heads, seq_len, seq_len) attention = torch.softmax(scores, dim=-1) # Materializes N×N matrix output = torch.matmul(attention, V) ``` For sequence length N=4096 and 32 heads, the attention matrix requires: - 32 × 4096 × 4096 × 4 bytes (FP32) = 2 GB of temporary storage This memory is touched exactly once then discarded—pure waste of precious GPU memory bandwidth. ### FlashAttention: Tiled, Memory-Efficient Attention FlashAttention (Dao et al., 2022) restructures attention to minimize memory transfers by keeping computation in fast SRAM rather than slow HBM. The key insight: we don't need the full attention matrix simultaneously. We can compute attention in tiles, accumulating results without ever materializing the full N×N matrix. The algorithm: 1. Divide Q, K, V into blocks that fit in SRAM 2. For each Q block, iterate over K, V blocks 3. Compute partial attention scores in SRAM 4. Apply softmax correction (requires tracking running max/sum) 5. Accumulate to output without writing intermediate results to HBM ``` Standard Attention Memory Traffic: ──────────────────────────────────────────────────────────────── HBM ──read Q, K──► SRAM ──compute S=QK^T──► HBM ──write S HBM ──read S──────► SRAM ──softmax(S)────► HBM ──write P HBM ──read P, V──► SRAM ──compute PV─────► HBM ──write O Memory I/O: O(N² + Nd) ≈ O(N²) for large N FlashAttention Memory Traffic: ──────────────────────────────────────────────────────────────── HBM ──read Q, K, V blocks──► SRAM ──compute in tiles──► HBM ──write O Memory I/O: O(Nd) — no N² term! ``` **Performance impact**: FlashAttention achieves 2-4x speedup for long sequences. More importantly, it enables longer contexts without OOM—a qualitative capability improvement. ### FlashAttention-2: Further Optimizations FlashAttention-2 (Dao, 2023) improves parallelization across thread blocks: 1. **Parallelize over sequence length**: Process Q blocks in parallel (independent) 2. **Better work partitioning**: Reduce synchronization overhead 3. **Leverage NVIDIA specific instructions**: Exploit tensor cores more efficiently FlashAttention-2 achieves up to 2x additional speedup over FlashAttention-1, approaching the hardware theoretical maximum. **Why this matters for inference**: During decode, the query is a single token attending to a potentially long KV cache. FlashAttention's memory efficiency makes long-context inference practical. Without it, 100K+ context windows would be prohibitively slow and memory-intensive. --- ## Inference Server Architectures ### The Inference Server Landscape The complexity of efficient LLM serving—PagedAttention, continuous batching, tensor parallelism, quantization kernels—has been packaged into specialized inference servers. Understanding their architectures and trade-offs is essential for deployment decisions. ::: {.panel-tabset} ## Naive Inference ```python # Sequential processing - wastes GPU cycles def generate_responses(prompts: list[str]) -> list[str]: results = [] for prompt in prompts: # One request at a time, GPU sits idle between requests input_ids = tokenizer(prompt, return_tensors="pt").to("cuda") output = model.generate(input_ids, max_new_tokens=256) results.append(tokenizer.decode(output[0])) return results # 10 requests x 2 seconds each = 20 seconds total # GPU utilization: ~15% ``` ## Production Inference ```python from vllm import LLM, SamplingParams # Production: continuous batching with KV cache management class ProductionInference: def __init__(self, model_path: str): self.llm = LLM( model=model_path, tensor_parallel_size=2, # Split across GPUs gpu_memory_utilization=0.9, # Maximize KV cache max_num_seqs=64, # Concurrent sequences enable_prefix_caching=True, # Cache common prefixes quantization="fp8", # Memory efficiency ) async def generate_batch( self, prompts: list[str], max_tokens: int = 256 ) -> list[GenerationResult]: sampling_params = SamplingParams( max_tokens=max_tokens, temperature=0.7, top_p=0.95, ) # Continuous batching: requests processed as tokens complete # New requests can join mid-batch outputs = self.llm.generate(prompts, sampling_params) return [ GenerationResult( text=output.outputs[0].text, tokens=len(output.outputs[0].token_ids), finish_reason=output.outputs[0].finish_reason ) for output in outputs ] # 10 requests batched = ~3 seconds total # GPU utilization: ~85% ``` **Key differences**: Continuous batching (vs sequential), PagedAttention for KV cache, tensor parallelism across GPUs, prefix caching for repeated system prompts, FP8 quantization, and concurrent sequence handling. Result: 6x throughput improvement. ::: ### vLLM: The PagedAttention Reference Implementation vLLM, developed at UC Berkeley, introduced PagedAttention and remains the most widely deployed open-source inference server. **Architecture**: ![vLLM Architecture](../assets/diagrams/rendered/ch09_vllm_architecture.svg) **Key configuration parameters**: - `gpu-memory-utilization`: Fraction of GPU memory for KV cache (default 0.9) - `max-model-len`: Maximum sequence length - `tensor-parallel-size`: GPUs for tensor parallelism - `max-num-seqs`: Maximum concurrent sequences - `quantization`: awq, gptq, fp8, or none **When to choose vLLM**: - General-purpose LLM serving - High throughput requirements - Need for prefix caching / shared prompts - OpenAI API compatibility required ### TensorRT-LLM: Maximum NVIDIA Performance TensorRT-LLM is NVIDIA's inference stack, optimizing specifically for NVIDIA GPUs. **Key differences from vLLM**: 1. **Compilation required**: Models must be compiled to TensorRT engines 2. **Kernel optimization**: Custom kernels for each GPU architecture 3. **In-flight batching**: NVIDIA's continuous batching implementation 4. **FP8 native**: Best FP8 support on H100 **Architecture**: ![TensorRT-LLM Architecture](../assets/diagrams/rendered/ch09_tensorrt_architecture.svg) **Compilation process**: ```bash # Convert model to TensorRT-LLM checkpoint python convert_checkpoint.py --model_dir llama-7b-hf --output_dir llama-7b-trt # Build TensorRT engine trtllm-build --checkpoint_dir llama-7b-trt \ --output_dir llama-7b-engine \ --max_batch_size 64 \ --max_input_len 2048 \ --max_output_len 512 ``` **When to choose TensorRT-LLM**: - Maximum throughput on NVIDIA GPUs (often 20-40% faster than vLLM) - Using H100s with FP8 - Willing to invest in complex setup and model compilation - Latency-critical applications ### SGLang: Structured Generation Optimization SGLang (LMSYS) optimizes for workloads with structured outputs—JSON, code, constrained generation. **Key innovation**: RadixAttention—a trie-based KV cache that efficiently shares prefixes across requests and enables fast constraint checking during generation. **When to choose SGLang**: - Structured output (JSON schemas, grammar constraints) - Multi-turn conversations with prefix sharing - Research / rapid prototyping ### Hugging Face TGI: Production-Ready Simplicity Text Generation Inference (TGI) prioritizes production stability and Hugging Face ecosystem integration. **Key features**: - Watermarking support (detecting AI-generated text) - Token streaming with standard SSE - Prometheus metrics built-in - Simple Docker deployment **When to choose TGI**: - Hugging Face model hub workflow - Need watermarking - Prefer stability over maximum performance ### Inference Server Decision Framework ![Inference Server Selection](../assets/diagrams/rendered/ch09_inference_server_selection.svg) --- ## How the Leaders Do It: Industry Case Studies Understanding how leading providers architect their serving infrastructure provides both practical insights and validation for the techniques we've discussed. ### OpenAI: Scale Through Specialization OpenAI serves billions of API requests daily across GPT-5, GPT-5.5, and GPT-4 variants. Their architecture, while not fully public, reveals key patterns through their API behavior and published work. **Observed characteristics**: - **Multiple model variants**: Different price/capability tiers (GPT-5.5 vs GPT-5 vs GPT-4) - **Aggressive batching**: High latency tolerance for batch API (50% cost reduction) - **Speculative decoding**: Faster tokens on GPT-5.5 suggest draft model usage - **Regional routing**: Requests routed to nearest data center **Inferred architecture patterns**: - Separate clusters for different model sizes/speeds - Request classification to route to appropriate backend - Heavy use of prefix caching for system prompts - Custom hardware (rumored TPU/custom ASIC usage) **Key insight**: OpenAI offers explicit price/latency trade-offs (batch API, prompt caching discount), suggesting their infrastructure can dynamically trade these resources. ### Anthropic: Optimizing for Long Context Anthropic's Claude models support 200K token contexts—far beyond typical deployments. This requires specific architectural choices. **Observed characteristics**: - **Streaming by default**: Long outputs stream immediately - **Context caching**: Explicit prompt caching API with cost reduction - **Graceful degradation**: Long context requests have higher latency tolerance **Inferred architecture patterns**: - Extensive KV cache optimization (likely PagedAttention or custom equivalent) - Potentially hierarchical or compressed KV cache for very long contexts - Dedicated capacity for long-context workloads - Heavy prefix caching for repeated system prompts **Key insight**: Anthropic's explicit prompt caching API (5-minute cache, 90% cost reduction for cache hits) reveals their infrastructure heavily leverages prefix sharing. ### Together AI: Inference as a Service Together AI specializes in hosted inference for open-source models. Their architecture is more transparent. **Public architecture details**: - **Custom inference stack**: "Together Inference Engine" with claimed 2-3x throughput vs. standard vLLM - **Speculative decoding**: Available for supported models - **Multiple model sizes**: From small to 70B+ parameters - **Dedicated endpoints**: Option for reserved capacity **Key optimizations (from their blog)**: - Custom CUDA kernels for specific model architectures - Aggressive operator fusion - Memory-optimized attention implementations - Dynamic batching with priority support **Throughput claims**: Up to 117 tokens/second per request for Llama 4 Scout on 8x A100s with speculative decoding. ### Fireworks AI: Speed as the Value Proposition Fireworks AI positions itself on inference speed, claiming industry-leading latency. **Public architecture details**: - **FireAttention**: Custom attention kernel optimized for inference - **Speculative decoding**: Enabled by default where applicable - **Optimized routing**: Request-aware load balancing **Performance claims**: - TTFT under 200ms for most requests - Up to 4x throughput improvement over standard implementations **Key insight**: Fireworks' focus on latency suggests architecture choices that prioritize TTFT over raw throughput—potentially smaller batch sizes, eager prefill, and latency-optimized kernels. ### Lessons from Production Deployments Common patterns across all providers: 1. **Tiered service levels**: Different price/performance points for different workloads 2. **Prefix caching**: Universal adoption, often with explicit API support 3. **Speculative decoding**: Increasingly common for throughput improvement 4. **Custom kernels**: All serious providers have custom CUDA implementations 5. **Batching intelligence**: Request-aware routing and scheduling 6. **Regional distribution**: Low-latency access requires geographic presence --- ## GPU Selection and Capacity Planning ### Understanding GPU Specifications for Inference GPU selection for inference differs from training. Memory bandwidth often matters more than raw compute. **Key specifications to evaluate**: | Spec | Why It Matters | Trade-off | |------|----------------|-----------| | Memory Size | Determines model size + max concurrent requests | Larger = more expensive | | Memory Bandwidth | Bounds single-request generation speed | Higher bandwidth GPUs cost more | | FP16 TFLOPS | Affects prefill speed | Often over-provisioned for inference | | FP8 Support | Enables 2x throughput with minimal quality loss | Only H100/H200 | | Interconnect | Needed for tensor parallelism | NVLink vastly outperforms PCIe | ### GPU Comparison for Inference | GPU | Memory | BW (TB/s) | $/hr | Best For | |-----|--------|-----------|------|----------| | RTX 4090 | 24 GB | 1.0 | $0.50 | Dev/testing, 7B models | | A10 | 24 GB | 0.6 | $1.50 | 7B production, cost-sensitive | | L40S | 48 GB | 0.9 | $2.00 | 13B models, balanced | | A100 40GB | 40 GB | 1.6 | $4.00 | 13-30B, high throughput | | A100 80GB | 80 GB | 2.0 | $6.00 | 70B single GPU, large batches | | H100 80GB | 80 GB | 3.4 | $10.00 | Maximum throughput, FP8 | | H200 141GB | 141 GB | 4.8 | $15.00 | 200B+, longest contexts | : GPU Comparison for LLM Inference (2024-2025). Approximate on-demand cloud pricing; reserved/spot significantly lower. {tbl-colwidths="[18,12,13,12,45]"} ### Memory Budget Calculation A systematic approach to GPU selection: **Step 1: Model memory** ``` Weight memory = Parameters × Bytes per parameter 7B FP16: 7 × 10^9 × 2 = 14 GB 7B INT4: 7 × 10^9 × 0.5 = 3.5 GB 70B INT4: 70 × 10^9 × 0.5 = 35 GB ``` **Step 2: KV cache budget** ``` Available for KV = GPU Memory - Model Memory - Overhead (2-4 GB) A100 80GB with 7B INT4: 80 - 3.5 - 3 = 73.5 GB for KV cache ``` **Step 3: Max concurrent requests** ``` KV cache per request = 2 × Layers × Seq_len × Heads × Head_dim × Bytes LLaMA-7B at 4K context, FP16 KV: 2 × 32 × 4096 × 32 × 128 × 2 = 2.15 GB per request Max concurrent: 73.5 GB / 2.15 GB ≈ 34 requests ``` **Step 4: Throughput estimation** ``` Tokens/second ≈ Memory_bandwidth / (Model_size × bytes_per_weight) × Batch_efficiency_factor × Kernel_efficiency A100 (2 TB/s) with 7B INT4, batch 16: 2 × 10^12 / (7 × 10^9 × 0.5) × 0.8 × 0.7 ≈ 320 tok/s per request Total throughput: ~5,000 tok/s (batch 16) ``` ### Multi-GPU Considerations When a model doesn't fit on one GPU, you need parallelism: **Tensor Parallelism (TP)**: Split each layer across GPUs. Requires fast interconnect (NVLink). Linear speedup for prefill, limited improvement for decode. **Pipeline Parallelism (PP)**: Different layers on different GPUs. Works across nodes. Introduces pipeline bubbles, more complex scheduling. ![Multi-GPU Strategy Selection](../assets/diagrams/rendered/ch09_multi_gpu_strategy.svg) --- ## Self-Hosting vs. API: The Decision Framework ### Cost Analysis Framework ::: {.callout-note} ## Historical Context: The Dramatic Decline of LLM Costs **2020 (GPT-3 Beta)**: $60/1M tokens. Early access was expensive and rate-limited. Cost prohibited most production use cases. A customer support bot at scale could cost $1M+/month. **2022 (ChatGPT Launch)**: $20/1M tokens (text-davinci). Still expensive, but 3x cheaper. Companies started building production applications, carefully counting tokens. **2023 (GPT-3.5 Turbo)**: $2/1M tokens. A 10x drop. The "ChatGPT API" made LLMs economically viable for mainstream applications. Token counting became less critical. **Late 2023 (GPT-4 Turbo, Claude 2.1)**: Frontier models at $10-30/1M input, $30-60/1M output. Smaller/faster models hit $0.50-2/1M. Tiered pricing became standard: pay for capability you need. **2024 (GPT-4o, Claude 3.5 Sonnet)**: $2.50-5/1M for production-ready models. Cached input at 50-90% discount. Quality that was frontier-only a year prior now available at commodity prices. **2025 (Current)**: Frontier models at $10-15/1M, production models at $1-3/1M, distilled models at $0.10-0.50/1M. 100-600x cheaper than 2020 for equivalent capability. **The pattern**: LLM costs dropped roughly 10x every 18 months—faster than Moore's Law. This changes build-vs-buy economics: self-hosting makes sense at ever-higher volume thresholds. More importantly, capabilities that were impossibly expensive become routine—enabling entirely new application categories. ::: The self-host vs. API decision ultimately comes down to total cost of ownership. Let's build a rigorous comparison framework. ::: {.callout-tip} ## Staff Engineer Perspective: The Self-Hosting Decision "The most common mistake I see is teams doing a spreadsheet analysis—'API costs $X, self-hosting costs $Y, Y < X, let's self-host.' That analysis is always wrong because it ignores the hidden costs. Self-hosting means you need: GPU procurement expertise (6-month lead times), MLOps engineers who understand CUDA, on-call rotation for inference infrastructure, capacity planning for traffic spikes, and a model upgrade path that doesn't require re-architecting. My rule of thumb: don't self-host until you're spending $50K/month on API costs AND you have at least two engineers who've operated GPU infrastructure before. Below that threshold, the operational overhead will eat any savings. Above it, the economics usually work—but only if you've built the team first." — *Staff ML Engineer, Infrastructure* ::: **API costs** (straightforward): ``` Monthly API Cost = Requests × (Input_tokens × $/input + Output_tokens × $/output) Example: GPT-5.5 at 1M requests/month, 500 input, 200 output tokens avg = 1M × (500 × $0.005/1K + 200 × $0.03/1K) = 1M × ($0.0025 + $0.006) = $8,500/month ``` **Self-hosted costs** (complex): ``` Hardware: GPU rental or amortization = GPU_cost × Number_of_GPUs × Hours_per_month Operations: Engineer time for setup, maintenance, on-call = Loaded_engineer_cost × Hours_per_month Infrastructure: Networking, storage, monitoring = Platform_costs (varies widely) Quality risk: Model may not match API capability = Hard to quantify, often underestimated ``` ### Break-Even Analysis ```python def calculate_breakeven( api_cost_per_request: float, gpu_hourly_cost: float, throughput_requests_per_second: float, ops_hours_per_month: float = 20, engineer_hourly_cost: float = 100 ) -> dict: """Calculate when self-hosting becomes cheaper.""" # GPU cost for 24/7 operation monthly_gpu_cost = gpu_hourly_cost * 24 * 30 # Operations overhead monthly_ops_cost = ops_hours_per_month * engineer_hourly_cost # Total self-hosted monthly cost total_self_hosted = monthly_gpu_cost + monthly_ops_cost # Maximum requests the GPU can serve max_requests_per_month = throughput_requests_per_second * 3600 * 24 * 30 * 0.7 # 70% utilization # Break-even point break_even_requests = total_self_hosted / api_cost_per_request return { "monthly_self_hosted_cost": total_self_hosted, "break_even_requests": break_even_requests, "max_capacity_requests": max_requests_per_month, "cost_per_request_at_capacity": total_self_hosted / max_requests_per_month } ``` **Example: Llama 4 Scout vs. GPT-5.5** | Factor | GPT-5.5 API | Self-hosted Llama 4 Scout | |--------|-----------------|----------------------| | Cost per request* | $0.0085 | Variable | | GPU cost | N/A | 2x A100 80GB @ $12/hr = $8,640/mo | | Ops cost | N/A | ~$2,000/mo | | Throughput | N/A | ~50 req/s = 130M req/mo capacity | | Break-even | N/A | $10,640 / $0.0085 = 1.25M req/mo | | Cost at capacity | $0.0085 | $0.00008 (106x cheaper!) | : Llama 4 Scout vs GPT-5.5 cost comparison. *500 input + 200 output tokens. {tbl-colwidths="[25,25,50]"} **The catch**: This analysis assumes: - You achieve API-level quality with open-source models (not always true) - You maintain high utilization (often difficult) - Your ops costs stay low (they often don't) ### Decision Framework ![LLM Deployment Decision Framework](../assets/diagrams/rendered/ch09_deployment_decision.svg) **Key decision factors:** - **Data residency requirements** → May force self-host regardless of cost - **Burst traffic patterns** → APIs handle spikes better without over-provisioning - **Model customization needs** → Self-host enables fine-tuning and modification - **Team expertise** → Self-hosting requires GPU/ML infrastructure expertise ### Hybrid Architectures Many organizations find hybrid approaches optimal: 1. **API for peak, self-host for baseline**: Self-host for predictable baseline traffic, overflow to API during peaks 2. **Self-host for sensitive, API for general**: Private data stays on-premise, general queries to API 3. **Self-host for experimentation, API for production**: Develop and test locally, deploy proven prompts to API 4. **API for quality-critical, self-host for volume**: Important use cases get GPT-5, commodity tasks get self-hosted ### Self-Host vs. API Decision Flowchart ![Self-Host vs API Decision Flowchart](../assets/diagrams/rendered/ch09_selfhost_vs_api.svg) --- ## Speculative Decoding: Trading Compute for Latency ### The Speculative Decoding Insight Speculative decoding addresses a fundamental inefficiency: during decode, we load the entire model to generate a single token. What if we could verify multiple tokens with a single model load? The key insight: **verification is cheaper than generation**. Given candidate tokens, checking if the large model agrees is faster than generating those tokens directly. ### The Algorithm 1. **Draft**: A small, fast model generates K candidate tokens autoregressively 2. **Verify**: The large model processes all K candidates in a single forward pass 3. **Accept/Reject**: Compare probabilities; accept matching tokens, reject and correct on disagreement ![Speculative Decoding Timeline](../assets/diagrams/rendered/ch09_speculative_decoding.svg) ### The Mathematics of Speculation Let: - p_target(x): Target model's probability of token x - p_draft(x): Draft model's probability of token x - K: Number of speculated tokens - α: Average acceptance rate **Acceptance criterion**: Accept the drafted token x with probability $\min\left(1, \frac{p_{target}(x)}{p_{draft}(x)}\right)$. On rejection, that position is resampled from the normalized residual distribution $\max(0, p_{target} - p_{draft})$. This rejection-sampling rule is the whole point: it makes speculative decoding **lossless**, so the output distribution exactly matches what the target model would have produced on its own. (A naive "accept if the target likes it at least as much as the draft" test would bias the distribution.) **Expected tokens per iteration**: $$E[\text{tokens}] = \sum_{k=0}^{K} (k+1) \cdot P(\text{accept } k \text{ then reject}) + (K+1) \cdot P(\text{accept all } K)$$ **Speedup factor**: $$\text{Speedup} = \frac{E[\text{tokens}]}{1 + K \cdot r_{draft}}$$ Where r_draft is the ratio of draft model cost to target model cost. For well-matched draft models (high α, low r_draft), speedups of 2-3x are typical. ### Draft Model Selection The draft model must balance: - **Speed**: Small enough to be fast - **Alignment**: Similar output distribution to minimize rejection - **Architecture**: Compatible tokenizer and vocabulary **Options**: 1. **Smaller same-family model**: LLaMA-7B as draft for Llama 4 Scout 2. **Distilled model**: Model trained specifically to match the target 3. **Quantized self-draft**: INT4 version of the target model **Alignment matters enormously**: A misaligned draft model produces candidates the target rejects, wasting the speculation cost with no benefit. ```python # Evaluating draft model alignment import random def measure_acceptance_rate(draft, target, prompts, K=4): """Measure how often target accepts draft's predictions.""" total_drafted = 0 total_accepted = 0 for prompt in prompts: draft_tokens = draft.generate(prompt, num_tokens=K) target_probs = target.get_token_probs(prompt + draft_tokens) draft_probs = draft.get_token_probs(prompt + draft_tokens) for d_prob, t_prob in zip(draft_probs, target_probs): total_drafted += 1 # Correct speculative-sampling rule: accept with probability # min(1, p_target / p_draft). On rejection the real algorithm # resamples this position from the residual; we stop measuring here. accept_prob = min(1.0, t_prob / d_prob) if random.random() < accept_prob: total_accepted += 1 else: break # first rejection ends this speculation window return total_accepted / total_drafted # Good alignment: >70% acceptance # Poor alignment: <50% acceptance ``` ### When Speculative Decoding Helps Speculative decoding provides the most benefit when: - Target model is large (high per-token cost) - Good draft model exists (high acceptance rate) - Latency matters more than throughput - Batch size is small (batching already amortizes model loads) It provides less benefit when: - Already batching heavily (less memory bandwidth to spare) - No suitable draft model (poor alignment wastes speculation) - Throughput matters more than latency (better to increase batch size) --- ## Production Operations ### Monitoring Essentials Production LLM serving requires comprehensive monitoring: **Request-level metrics**: - TTFT (time to first token) - P50, P95, P99 - ITL (inter-token latency) - P50, P95, P99 - Total latency - P50, P95, P99 - Request success rate - Tokens generated per request **System-level metrics**: - GPU memory utilization (total and KV cache) - GPU compute utilization - Request queue depth - Active concurrent requests - Memory bandwidth utilization (if available) **Business metrics**: - Tokens generated per second (throughput) - Cost per 1000 tokens - Requests served per dollar ```python # Essential Prometheus metrics for LLM serving from prometheus_client import Counter, Histogram, Gauge # Request metrics requests_total = Counter('llm_requests_total', 'Total requests', ['model', 'status']) tokens_generated = Counter('llm_tokens_generated_total', 'Tokens generated', ['model']) ttft_seconds = Histogram('llm_ttft_seconds', 'Time to first token', ['model'], buckets=[0.1, 0.2, 0.5, 1, 2, 5, 10]) request_latency_seconds = Histogram('llm_request_latency_seconds', 'Total request latency', ['model'], buckets=[0.5, 1, 2, 5, 10, 30, 60]) # System metrics active_requests = Gauge('llm_active_requests', 'Currently processing', ['model']) queue_depth = Gauge('llm_queue_depth', 'Requests waiting', ['model']) gpu_memory_used_bytes = Gauge('llm_gpu_memory_bytes', 'GPU memory', ['gpu_id']) kv_cache_utilization = Gauge('llm_kv_cache_utilization', 'KV cache usage ratio', ['model']) ``` > **Full implementation**: See [reference/09_llm_deployment_code.md](../reference/09_llm_deployment_code.html#prometheus-metrics-exporter) ### Load Shedding and Overload Protection LLM inference is expensive. Under overload, accepting all requests leads to: - Memory exhaustion (KV cache overflow) - Latency explosion (queue buildup) - Cascading failures (timeouts causing retries) **Load shedding strategies**: ```python class LoadShedder: """Protect the system by rejecting excess load.""" def should_accept(self, request) -> tuple[bool, str]: # Check queue depth if self.queue_depth >= self.max_queue: return False, "queue_full" # Check KV cache headroom estimated_kv = estimate_kv_usage(request.max_tokens) if self.kv_used + estimated_kv > self.kv_limit * 0.95: return False, "memory_pressure" # Probabilistic shedding at high load load_factor = self.queue_depth / self.max_queue if load_factor > 0.8 and random.random() < (load_factor - 0.8) / 0.2: return False, "load_shed" return True, "accepted" ``` **Client-side handling**: Return HTTP 503 with Retry-After header. Well-behaved clients will back off. > **Full implementation**: See [reference/09_llm_deployment_code.md](../reference/09_llm_deployment_code.html#load-shedder) ### Graceful Shutdown Rolling deployments and autoscaling require graceful shutdown: 1. **Stop accepting new requests**: Return 503 to new connections 2. **Drain active requests**: Allow in-progress generation to complete 3. **Set timeout**: Force-terminate after grace period 4. **Release resources**: Free GPU memory, close connections ```python class GracefulServer: async def shutdown(self, grace_period=30): """Gracefully shut down the server.""" self.accepting = False # Stop new requests # Wait for active requests with timeout deadline = time.time() + grace_period while self.active_requests and time.time() < deadline: await asyncio.sleep(0.1) if self.active_requests: logger.warning(f"Force-closing {len(self.active_requests)} requests") for req in self.active_requests: req.cancel() ``` > **Full implementation**: See [reference/09_llm_deployment_code.md](../reference/09_llm_deployment_code.html#graceful-server) ### Kubernetes Deployment Patterns LLM workloads have unique Kubernetes requirements: **GPU scheduling**: Request GPU resources, handle node affinity **Slow startup**: Models take minutes to load; use appropriate health probes **Memory requirements**: Set resource limits carefully to prevent OOM **Autoscaling**: Scale on queue depth, not CPU (GPU is the constraint) ```yaml # Key Kubernetes patterns for LLM inference apiVersion: apps/v1 kind: Deployment spec: template: spec: containers: - name: vllm resources: limits: nvidia.com/gpu: 1 memory: "96Gi" # More than GPU memory for safety readinessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 180 # Model loading time periodSeconds: 10 failureThreshold: 3 livenessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 300 # Even longer for liveness periodSeconds: 30 --- # Scale on queue depth, not CPU apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler spec: minReplicas: 1 maxReplicas: 10 metrics: - type: External external: metric: name: llm_queue_depth target: type: AverageValue averageValue: "10" # Scale when queue > 10 ``` > **Full implementation**: See [reference/09_llm_deployment_code.md](../reference/09_llm_deployment_code.html#kubernetes-deployment) ::: {.callout-warning} ## Common Mistake: No LLM-Aware Health Checks **What people do**: Use simple TCP or HTTP health checks that return 200 if the server is listening. **Why it fails**: LLM servers can be "up" but non-functional—GPU memory exhausted, model weights corrupted, or inference stuck. Kubernetes keeps routing traffic to broken replicas, causing cascading timeouts. **Fix**: Implement deep health checks that verify actual inference capability. Include a lightweight inference test (e.g., 10-token completion) in your readiness probe. Set appropriate timeouts (30-60s) for these probes. Use liveness probes to restart truly stuck processes, but be conservative—killing a loaded server makes things worse. ::: --- ## Key Takeaways 1. **Memory bandwidth is the inference bottleneck** - During decode, GPUs spend most time loading weights, not computing. Batching, quantization, and caching all help by increasing useful work per byte loaded. 2. **KV cache dominates memory at scale** - For long contexts or high concurrency, KV cache memory exceeds model weight memory. PagedAttention eliminates 60-80% memory waste through block-based allocation. 3. **Continuous batching provides 5-10x throughput** - Unlike static batching, continuous batching eliminates head-of-line blocking by adding requests as others complete, maximizing GPU utilization. 4. **Quantization works because networks are overdetermined** - INT4 quantization reduces memory 4x while preserving most model capability. The function computed depends on the weight manifold, not exact values. 5. **Know your latency metrics: TTFT vs ITL** - Time to First Token (prefill-dominated) and Inter-Token Latency (decode-dominated) require different optimizations. Prefix caching helps TTFT; batch size affects ITL. --- ## Summary LLM deployment is fundamentally different from traditional ML serving. The unique characteristics of autoregressive generation—memory bandwidth bounds, growing KV cache, token-by-token processing—require specialized techniques. ### Key Insights **Memory bandwidth is the bottleneck**: During decode, GPUs spend most of their time loading weights, not computing. This explains why batching, quantization, and speculative decoding all help—they increase the useful work done per byte loaded. **The KV cache dominates at scale**: For long contexts or high concurrency, KV cache memory exceeds model weight memory. PagedAttention's block-based management was revolutionary because it eliminated the 60-80% memory waste of naive allocation. **Continuous batching is essential**: The shift from static to continuous batching provides 5-10x throughput improvement by eliminating head-of-line blocking and maximizing GPU utilization. **Quantization works because neural networks are overdetermined**: The function computed depends on the weight manifold, not exact values. INT4 quantization preserves most of this manifold while reducing memory 4x. **Inference is a queue theory problem.** Batching, KV cache paging, prefill/decode disaggregation, and prefix caching are scheduling policies over a scarce GPU resource. Optimizing serving means optimizing the queue, not just the kernel. ### Architecture Summary ![Production LLM Serving Stack](../assets/diagrams/rendered/ch09_production_serving_stack.svg) ### Connections to Other Chapters - **Chapter 5 (LLM Foundations)**: The transformer architecture, attention mechanism, and KV cache explained here drive all deployment constraints - **Chapter 6 (Prompt Engineering)**: Understanding deployment costs helps design efficient prompts that minimize tokens while maintaining quality - **Chapter 7 (RAG Systems)**: RAG often requires serving embedding models alongside LLMs; similar deployment principles apply - **Chapter 8 (Agentic Systems)**: Agents create multi-turn conversations that accumulate KV cache; understanding memory growth is essential - **Chapter 27 (Performance Engineering)**: Covers deeper GPU optimization, profiling, and CUDA-level tuning - **Chapter 32 (Cost Engineering)**: Extends cost analysis to full TCO, including engineering time and opportunity cost --- ## Practical Exercises 1. **Memory Budget Calculation**: For a 13B parameter model, calculate the maximum concurrent requests you can serve at 4K, 8K, and 16K context lengths on an A100 80GB. Use both FP16 and INT4 quantization scenarios. What's the break-even context length where KV cache exceeds model weight memory? 2. **Batching Experiment**: Deploy vLLM and measure throughput at batch sizes 1, 2, 4, 8, 16, 32. Plot throughput vs. latency. At what batch size does throughput plateau? What limits further scaling? 3. **Quantization Quality Evaluation**: Compare a base model vs. AWQ INT4 quantized version on your specific use case. Measure both perplexity on representative data and task-specific quality (accuracy, human preference). Is the quality loss acceptable for your application? 4. **Cost Comparison**: For your workload (estimate request volume, input/output lengths), calculate the monthly cost of: (a) OpenAI API, (b) self-hosted on A100, (c) self-hosted on H100 with FP8. Include realistic ops overhead for self-hosting. At what volume does each option become optimal? 5. **Speculative Decoding Implementation**: Using a small (1B) and large (7B) model from the same family, implement speculative decoding. Measure acceptance rate and end-to-end speedup. How does K (speculation depth) affect the trade-off? --- ## Self-Assessment Checkpoint ### Conceptual Questions **Q1. [IC2]** Why is LLM generation memory-bandwidth bound rather than compute-bound? What does this mean for optimization strategies? <details> <summary>Answer</summary> During generation, each forward pass multiplies model weights against a single token's hidden state (matrix-vector multiplication). The compute per weight element is tiny, but all weights must be loaded from GPU memory. On an H100, compute capability is ~2000 TFLOPS but memory bandwidth is ~3.35 TB/s. The ratio (operations per byte) for generation is ~1-2, far below the ~600 ratio the hardware can sustain. This means: (1) Adding more compute doesn't help—the GPU sits idle waiting for memory. (2) Batching helps—load weights once, use for multiple requests. (3) Quantization helps—smaller weights = less to load. (4) KV caching is essential—avoid recomputing attention states. </details> **Q2. [IC2]** A 7B model in FP16 needs 14GB for weights. Why might you need 80GB of GPU memory to serve it at 4K context? <details> <summary>Answer</summary> KV cache: For a typical 7B model architecture (32 layers, 32 heads, 128 dim per head), KV cache per request at 4K context in FP16 is: 2 × 32 × 4096 × 32 × 128 × 2 bytes = 2.15 GB per request. With model weights (14GB) + 20-30 concurrent requests × 2.15GB = 14 + 50-60 GB. Add activations, framework overhead, and buffer for continuous batching, and you're at 80GB. At long contexts or high concurrency, KV cache dominates memory usage. </details> **Q3. [Senior]** Explain the difference between Time to First Token (TTFT) and Inter-Token Latency (ITL). Which optimizations affect each? <details> <summary>Answer</summary> TTFT is the delay before generation starts—dominated by prefill (processing the prompt). Affected by: prompt length, batching wait time, prefill compute capacity, prefix caching (can dramatically reduce if prompt shares prefix with cached entry). ITL is the time between consecutive tokens during generation—dominated by decode memory bandwidth. Affected by: batch size (more batched requests = each request gets fewer cycles), quantization (faster weight loads), model size, hardware memory bandwidth. You can have low TTFT but high ITL (small prompts, many concurrent requests) or high TTFT but low ITL (long prompts, few concurrent requests). </details> **Q4. [Senior]** How does PagedAttention improve over fixed KV cache allocation? When does it matter most? <details> <summary>Answer</summary> Traditional systems pre-allocate KV cache for max context length per request. If requests vary in length (common), short requests waste memory. PagedAttention allocates KV cache in fixed-size blocks (pages), allocated as needed. Benefits: (1) No wasted memory for short sequences. (2) Can serve more concurrent requests with same memory. (3) Enables prefix sharing—common prefixes share the same physical pages. Matters most when: (1) High request concurrency. (2) Variable sequence lengths. (3) Many requests share common prefixes (same system prompt). Production improvement: 2-4x throughput is common. </details> **Q5. [Staff]** You're deploying a 70B model. Should you use tensor parallelism across 2 GPUs or pipeline parallelism? What factors determine this? <details> <summary>Answer</summary> Tensor parallelism (TP): Splits each layer across GPUs. Every token requires all GPUs to communicate. Lower latency per token (all GPUs work in parallel on each token), but high communication overhead (all-reduce after each layer). Pipeline parallelism (PP): Different layers on different GPUs. Lower communication (only between pipeline stages), but can have bubbles (idle time). For latency-sensitive serving: TP is usually better—each token completes faster. For throughput-focused batch processing: PP can be better—fill pipeline bubbles with multiple microbatches. 70B on 2 GPUs: TP is typical choice for serving. NVLink bandwidth critical—without it, TP communication becomes bottleneck. </details> ### Spot the Problem **Problem 1. [IC2]** A deployment configuration: ```python model = vllm.LLM( model="llama-7b", max_num_batched_tokens=32768, gpu_memory_utilization=0.95 ) ``` What issues might this cause? <details> <summary>Answer</summary> Issues: (1) 95% GPU memory utilization leaves very little buffer—CUDA operations occasionally need temporary memory, can cause OOM under load. Safe: 0.85-0.90. (2) max_num_batched_tokens=32768 is very high—if requests have long prompts, this allows huge prefill batches that may cause memory spikes or timeout individual requests. (3) No explicit max_model_len—defaults vary by version, might not match your use case. Better: Start conservative (0.85 utilization), tune max_num_batched_tokens based on your request distribution, set explicit max_model_len. </details> **Problem 2. [Senior]** Cost analysis claiming self-hosting saves 10x: ``` API cost: $0.01 per request Self-hosted cost: $2000/month for GPU At 200K requests/month: API = $2000, Self-hosted = $2000 At 2M requests/month: API = $20000, Self-hosted = $2000 → 10x savings! ``` <details> <summary>Answer</summary> Missing costs: (1) Operations overhead—engineer time for setup, maintenance, on-call. At $150/hr loaded cost, 20 hours/month = $3000. (2) Infrastructure—load balancing, monitoring, logging, networking. (3) Opportunity cost—engineering time not spent on product. (4) Quality assumption—implicitly assumes self-hosted model matches API quality. May not be true, especially for hard tasks. (5) Utilization—assumes 100% GPU utilization. Realistic is 70-80% for variable traffic. (6) Scaling—what happens at 5M requests? May need another GPU. Real comparison: Should be (GPU + ops + infra) vs. API, and account for quality differences. </details> **Problem 3. [Staff]** A team's quantization approach: "We quantized our model to INT4 and ran our eval suite. Accuracy dropped from 92% to 90%. We decided 2% loss is acceptable." <details> <summary>Answer</summary> Flawed analysis: (1) Aggregate accuracy hides distribution—might be 95%→95% on easy cases but 80%→70% on hard cases. Check performance by difficulty/category. (2) Eval suite may not cover edge cases—quantization can amplify errors on unusual inputs. (3) Perplexity should be checked—better correlates with subtle quality issues. (4) Production distribution matters—if hard cases are 20% of production traffic, that 10% drop on hard cases is significant. (5) Should compare on actual production traffic sample. Better: Stratified analysis by case difficulty, perplexity comparison, production traffic evaluation, user-facing quality assessment. </details> ### Design Exercises **Exercise 1. [Senior]** Design the deployment architecture for a chatbot that needs to handle 10,000 concurrent users, with P95 latency under 2 seconds for the first token. Users have conversations averaging 10 turns. The model is 13B parameters. Outline your hardware, software stack, and capacity planning. <details> <summary>Guidance</summary> Key calculations: (1) At 10K concurrent users, assume 30% are actively waiting for a response at any moment = 3K active requests. (2) 13B model at FP16 = 26GB. With quantization (INT4) = ~7GB weights. (3) KV cache for 10-turn conversation (~4K context) = ~1GB per request in FP16, ~0.5GB in INT8. (4) Per A100 80GB: ~7GB model + ~50GB for KV = ~50-100 concurrent requests with INT4+INT8 KV. Need ~30-60 A100s for 3K concurrent. Software: vLLM with continuous batching, load balancer distributing by user session (session affinity for KV cache reuse). Scaling: Start with peak estimate, monitor, right-size. Use auto-scaling if traffic is variable. </details> **Exercise 2. [Staff]** You're responsible for reducing LLM serving costs by 50% without impacting user-facing quality. Current setup: API provider, $500K/month, mix of tasks (customer support, content generation, data extraction). Design your cost reduction strategy with specific actions, measurements, and risk mitigation. <details> <summary>Guidance</summary> Strategies: (1) Task routing—not all tasks need the best model. Route simple tasks to cheaper/smaller models. Requires building a classifier and per-task quality validation. (2) Caching—identify common queries, cache responses. Effective for support FAQs. (3) Prompt optimization—shorter prompts = fewer input tokens. Aggressive with examples and instructions. (4) Output optimization—constrain response length where appropriate. (5) Self-hosting evaluation—calculate break-even for highest-volume tasks. Risk mitigation: (1) A/B test each change against current baseline. (2) Monitor quality metrics per task category. (3) Staged rollout (10% → 50% → 100%). (4) Kill switch to revert to original setup. Measurement: Track cost per task, quality per task, user satisfaction. Target: 50% cost reduction with <2% quality regression on any metric. </details> --- ## See Also - **Chapter 5: LLM/NLP Foundations** - Transformer architecture and attention mechanisms that determine inference characteristics - **Chapter 21: Performance Engineering** - Advanced optimization techniques and latency engineering strategies - **Chapter 26: Cost Engineering** - Economic analysis of deployment decisions and total cost of ownership - **Appendix K: LLM Provider Comparison** - Detailed comparison of managed API providers vs self-hosting options --- ## Recommended Reading ### Essential (Read These) **"Efficient Memory Management for Large Language Model Serving with PagedAttention"** (Kwon et al., 2023) *The vLLM paper that introduced PagedAttention. Essential reading for understanding modern inference systems.* **"FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning"** (Dao, 2023) *The attention optimization that enables long contexts. Well-written explanation of the tiling algorithm.* **"GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers"** (Frantar et al., 2022) *Foundation for understanding weight quantization. Explains the Optimal Brain Quantization approach.* ### Deep Dives (For Specialists) **"AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration"** (Lin et al., 2023) *State-of-the-art quantization technique. Explains the activation-aware approach.* **"Efficiently Scaling Transformer Inference"** (Pope et al., 2022) *Google's analysis of inference efficiency. Excellent derivation of memory bandwidth bounds.* **"Fast Inference from Transformers via Speculative Decoding"** (Leviathan et al., 2022) *Original speculative decoding paper. Rigorous analysis of acceptance rates and speedup bounds.* ### Systems and Infrastructure **vLLM Documentation** (docs.vllm.ai) *Comprehensive guide to the most popular open-source inference server.* **TensorRT-LLM Documentation** (NVIDIA) *Deep technical details on NVIDIA's optimized inference stack.* ### Historical Context **"Orca: A Distributed Serving System for Transformer-Based Generative Models"** (Yu et al., 2022) *Introduced continuous batching. Foundational for understanding modern scheduling.*