Chapter 25: System Design at Scale
distributed systems, scalability, multi-region, load balancing, CAP theorem, sharding, replication, circuit breakers
Introduction
In March 2023, a startup launched their AI assistant to a modest beta user base. The system worked beautifully: sub-second responses, personalized interactions, glowing user feedback. Three months later, a viral Twitter thread brought 100,000 users in a single day. By noon, latencies had climbed from 200ms to 15 seconds. By evening, the service was completely offline. The engineering team spent 72 hours in triage mode, cobbling together fixes: adding servers, implementing hasty caching, and shedding features. When stability returned, they faced a harder question: how did a system designed by competent engineers fail so catastrophically?
The answer lies in a fundamental truth about distributed systems: scale changes everything. Patterns that work at small scale become liabilities at large scale. Assumptions that hold for hundreds of users shatter at millions. Failures that are theoretical edge cases become daily occurrences. System design at scale isn’t about doing more of the same—it’s about fundamentally rethinking how systems behave under pressure.
AI systems face unique scaling challenges that traditional web services don’t encounter. A typical web request might involve a database lookup and some business logic, completing in 50 milliseconds. An LLM inference request might require loading billions of parameters, executing trillions of floating-point operations, and generating output tokens sequentially—taking seconds, not milliseconds, while consuming GPU resources that cost 100x what CPU resources cost. The economics, physics, and failure modes are entirely different.
This chapter builds your intuition for designing AI systems that remain reliable as they grow. We’ll start with distributed systems foundations—the theoretical constraints that govern what’s possible—then examine how these principles apply to AI workloads specifically. You’ll learn to reason about bottlenecks before they manifest, design for graceful degradation when failures inevitably occur, and make principled trade-offs between competing concerns like consistency, availability, and cost.
The Core Insight: Why AI Systems Scale Differently
To design scalable AI systems, you must first understand why they’re different from traditional web services.
Resource intensity and economics: A single LLM inference request might require 16GB of GPU memory and 100 TFLOPS of compute. Compare this to a typical API request that uses megabytes of RAM and negligible CPU. The cost differential is enormous—GPU hours cost 10-100x CPU hours—which means inefficiency at scale translates to millions of dollars.
Sequential generation: Autoregressive language models generate tokens one at a time, with each token depending on all previous tokens. You cannot parallelize this within a single request. A 500-token response requires 500 sequential forward passes, creating a latency floor that no amount of hardware can reduce without architectural changes.
Memory as the bottleneck: Unlike CPU-bound services where you can scale horizontally almost indefinitely, GPU memory is finite and expensive. A 70B parameter model requires ~140GB just for weights in FP16. The KV cache for context grows with sequence length. Memory fragmentation becomes a first-order concern.
Batching dynamics: Individual inference requests underutilize GPUs dramatically. Throughput increases nearly linearly with batch size until memory limits. This creates a fundamental tension: latency-sensitive applications want small batches (fast individual requests); throughput-sensitive applications want large batches (efficient GPU utilization).
Stateful computation: For conversational AI, each turn might need context from all previous turns. This state must be managed somewhere—cached in GPU memory, stored externally, or regenerated. Each choice has profound implications for latency, cost, and system design.
Understanding these differences is essential because they determine which distributed systems patterns apply and how. Techniques designed for stateless web services often fail for AI workloads, while AI-specific optimizations can yield 10x improvements that would be impossible in traditional systems.
What You’ll Learn
- Why distributed systems guarantees constrain AI system design
- How to choose between consistency models based on your use case
- Horizontal vs. vertical scaling for inference and training workloads
- Multi-region deployment patterns and their trade-offs
- Identifying and eliminating bottlenecks before they manifest
- Designing for graceful degradation when components fail
- The operational practices that distinguish reliable systems from fragile ones
Prerequisites
- Familiarity with AI inference fundamentals (Chapter 9)
- Understanding of basic networking and storage concepts
- Experience with production systems and debugging distributed issues
Distributed Systems Foundations
“Every outage we’ve had in the last two years traces back to violating one of the eight fallacies. The worst was when we assumed network latency was negligible and set our timeout to 100ms. Worked perfectly in us-east-1. Then we went multi-region and calls to eu-west-1 took 150ms on a good day. Our entire Europe presence was unusable for a week while we fixed it. Now we start every design review by asking: which fallacy are we accidentally assuming?”
— Principal Engineer at global AI platform
Before designing scalable AI systems, you need to understand the fundamental constraints that govern all distributed systems. These aren’t engineering choices—they’re mathematical limitations, as inescapable as the speed of light.
The Eight Fallacies of Distributed Computing
In 1994, Peter Deutsch articulated seven fallacies (later expanded to eight by James Gosling) that engineers often assume when building distributed systems:
- The network is reliable
- Latency is zero
- Bandwidth is infinite
- The network is secure
- Topology doesn’t change
- There is one administrator
- Transport cost is zero
- The network is homogeneous
Every one of these is false, and AI systems violate these assumptions with particular severity.
Network reliability matters more for AI: A web request that fails can be retried with minimal cost. An AI inference request that fails after 10 seconds of computation wastes expensive GPU time. Partial failures are worse—a request might complete on the GPU but fail during response transmission, leaving the system in an ambiguous state.
Latency compounds for AI workloads: An LLM might call external APIs for retrieval, consult a feature store, and make multiple model calls. Each network hop adds latency. When your baseline inference is 2 seconds, a 100ms network delay might seem trivial—but 10 such delays add a full second, a 50% increase.
Bandwidth limits model distribution: Downloading a 70B parameter model (140GB in FP16) takes 20 minutes at 1 Gbps. Cross-region replication of model weights becomes a significant operational concern. Even within a datacenter, NVLink bandwidth between GPUs matters for distributed inference.
The CAP Theorem: Fundamental Trade-offs
Eric Brewer’s CAP theorem (formally proven by Gilbert and Lynch in 2002) states that a distributed system can provide at most two of three guarantees:
- Consistency: Every read receives the most recent write or an error
- Availability: Every request receives a response (without guarantee of most recent data)
- Partition tolerance: The system continues operating despite network partitions
Since network partitions are inevitable in distributed systems, the real choice is between consistency and availability during partitions. This isn’t a one-time architectural choice—different components of your system can make different trade-offs based on their requirements.
How to think about CAP for AI systems:
For model metadata (which model version is deployed, routing rules), you typically want consistency. If one gateway thinks model v2 is active while another thinks v1 is active, users get inconsistent behavior. During a partition, it’s better to refuse requests than serve the wrong model.
For inference results caching, you typically want availability. Serving a cached result from a slightly stale model is better than returning an error. Users might see marginally different responses, but the system remains functional.
For feature stores in fraud detection, the choice is nuanced. Stale features could miss fraud patterns, suggesting you want consistency. But blocking all transactions during a partition could cost millions. The right answer depends on your specific risk tolerance and fallback strategies.
# Framework for documenting CAP decisions
@dataclass
class CAPDecision:
component: str
choice: str # "CP" or "AP"
rationale: str
partition_behavior: str
consistency_guarantee: str
# Example decisions for an AI system
CAP_DECISIONS = {
"model_registry": CAPDecision(
component="Model Registry",
choice="CP",
rationale="Wrong model version causes incorrect predictions",
partition_behavior="Reject deployment requests; serving continues with last-known-good",
consistency_guarantee="Linearizable for writes, sequential for reads"
),
"inference_cache": CAPDecision(
component="Inference Result Cache",
choice="AP",
rationale="Stale cached result better than no result",
partition_behavior="Serve local cache; async reconciliation",
consistency_guarantee="Eventual, with bounded staleness (5 min TTL)"
),
"feature_store_recommendations": CAPDecision(
component="Feature Store (Recommendations)",
choice="AP",
rationale="Slightly stale recommendations acceptable",
partition_behavior="Serve from local replica",
consistency_guarantee="Eventual, 30-second replication lag target"
),
"feature_store_fraud": CAPDecision(
component="Feature Store (Fraud Detection)",
choice="CP",
rationale="Stale features could miss fraud patterns",
partition_behavior="Fail closed (reject suspicious transactions)",
consistency_guarantee="Linearizable"
),
}Consistency Models: A Spectrum of Guarantees
CAP presents a binary choice, but consistency exists on a spectrum. Understanding these models helps you choose the minimum consistency required, maximizing availability and performance.
Linearizability (strongest): Operations appear to execute instantaneously at some point between invocation and response. All clients see operations in the same order. Required for distributed locks and leader election.
Sequential consistency: All clients see operations in the same order, but that order need not match real-time order. Weaker than linearizable but sufficient for many applications.
Causal consistency: Operations that are causally related appear in the same order to all clients. Concurrent operations may appear in different orders. A good balance for many AI applications.
Eventual consistency (weakest): All replicas will eventually converge to the same value if updates stop. No guarantees about ordering in the meantime.
For AI systems, the key insight is that different data has different consistency requirements:
| Data Type | Typical Requirement | Rationale |
|---|---|---|
| Model weights | Eventual (with versioning) | Updates are rare; versioning prevents conflicts |
| Inference results | No consistency needed | Stateless, ephemeral |
| User conversation history | Causal | Messages should appear in order |
| Feature values | Depends on use case | Real-time features need stronger guarantees |
| Model routing rules | Sequential | All traffic should see same routing |
| Billing/usage counters | Eventual with reconciliation | Accuracy matters but timing doesn’t |
The Network as Unreliable Medium
Networks fail in more ways than you might expect, and AI systems must handle all of them:
Message loss: Packets disappear. Requests never arrive; responses never return. Your system must use timeouts and retries, but retries for expensive AI operations need careful thought.
Message delay: Variable latency makes it impossible to distinguish a slow response from a failed node. If your timeout is 5 seconds and inference takes 4 seconds, you’ll see spurious timeouts under load.
Message duplication: Retries can cause duplicate processing. For stateless inference, this wastes resources. For stateful operations (like charging for usage), it causes correctness problems.
Message reordering: TCP guarantees ordering within a connection, but complex systems have multiple paths. A user might see response B before response A if they were routed differently.
Partial failure: The hardest class of problems. A request might succeed on the server but fail during response transmission. Did the operation happen? You don’t know, and neither does the client.
For AI workloads, these network issues interact with the long processing times. Consider this scenario:
- Client sends inference request
- Server receives request, begins 10-second inference
- Network hiccups; client connection drops at second 5
- Server completes inference at second 10, tries to send response—connection gone
- Client’s timeout fires at second 5, it retries
- Server receives retry, begins another 10-second inference
You’ve now wasted 15 seconds of GPU time and the client is still waiting. Solutions include request deduplication (with unique request IDs), connection-independent result storage, and client polling rather than blocking.
Clocks and Time in Distributed Systems
There is no global clock in a distributed system. Each machine’s clock drifts independently. Even with NTP synchronization, clocks can differ by hundreds of milliseconds—enough to cause ordering issues.
For AI systems, clock skew matters for:
Request deadlines: If a client sets a 5-second deadline but the server’s clock is 500ms behind, the server might process a request the client has already abandoned.
Cache TTLs: A cache entry with a 5-minute TTL might be considered valid or expired depending on which server’s clock you ask.
Log correlation: Debugging distributed issues requires correlating logs across machines. Clock skew makes this difficult without logical timestamps.
Consistent ordering: If you need to determine which of two events happened first, clock timestamps are unreliable.
The solution is to use logical clocks (like Lamport timestamps or vector clocks) when ordering matters, and to build systems that don’t rely on synchronized time.
What people do: Build distributed AI systems expecting both strong consistency (same embedding for the same query everywhere) and high availability (always respond, even during partitions).
Why it fails: CAP theorem says you can’t have both. During a network partition, you must choose: return potentially stale embeddings (availability) or reject queries until consistency is restored (consistency). Systems that don’t explicitly choose often behave unpredictably under failure.
Fix: Decide your consistency model upfront. For most AI applications, eventual consistency is acceptable—a slightly stale cache or older model version usually doesn’t matter. Document the consistency guarantees, design your application to tolerate the inconsistency you’ve chosen, and test behavior during partitions.
Why AI Systems Have Unique Scaling Challenges
Traditional web services scale horizontally: need more capacity? Add more servers. Load balancing is straightforward because requests are largely interchangeable and stateless. AI systems break these assumptions in fundamental ways.
The Memory Bandwidth Bottleneck
GPU compute power has grown exponentially, but memory bandwidth hasn’t kept pace. This creates a fundamental bottleneck for large language models.
A 70B parameter model in FP16 uses 140GB of memory. During inference, these parameters must be loaded from memory for each forward pass. At 3TB/s memory bandwidth (H100), loading all parameters takes about 47ms—the lower bound for inference latency, regardless of compute power.
This is why LLM inference is memory-bandwidth bound, not compute-bound. The GPU’s tensor cores sit idle waiting for data. Optimization strategies must focus on reducing memory transfers:
- Batching: Amortize parameter loading across multiple requests
- Quantization: Smaller datatypes (INT8, INT4) reduce memory bandwidth by 2-4x
- KV caching: Don’t recompute what you can store
- Model parallelism: Distribute across GPUs with faster interconnects
The Batching Paradox
Batching dramatically improves throughput but harms latency:
| Batch Size | Throughput (tokens/sec) | Per-Request Latency |
|---|---|---|
| 1 | 50 | 200ms |
| 8 | 350 | 230ms |
| 32 | 1000 | 320ms |
| 128 | 2500 | 510ms |
With batch size 1, you process 50 tokens/second but each request completes in 200ms. With batch size 128, you process 2500 tokens/second (50x throughput!) but each request takes 510ms.
This creates a fundamental tension:
- Latency-sensitive applications (interactive chat) want small batches
- Throughput-sensitive applications (batch embeddings) want large batches
- Cost-sensitive deployments want high throughput (utilization)
The key insight is that you cannot optimize for all three. You must choose your priority and design accordingly:
Trade-off choices:
- Interactive chat: Optimize latency and cost, accept lower throughput
- Batch processing: Optimize throughput and cost, accept higher latency
- Real-time at scale: Optimize latency and throughput, accept higher cost
State and the Statefulness Spectrum
AI systems exist on a spectrum of statefulness, and where you sit determines your scaling strategy:
Stateless inference: Each request is independent. Embeddings, single-turn classification. Scales horizontally without constraint.
Context-dependent inference: Each request depends on conversation history. The history must be provided (expensive) or cached (complex). Most LLM applications.
Session state: User preferences, personalization features. Can be stored externally and loaded per-request.
Model state: KV cache for long contexts, adapter weights per user. Must be co-located with inference or expensively transferred.
The more stateful your workload, the harder it is to scale horizontally. Stateless requests can go to any server; stateful requests need routing to specific servers (consistent hashing) or expensive state transfer.
# State management patterns for scalable AI inference
class InferenceStateStrategy:
"""Different strategies for managing inference state."""
def __init__(self, state_type: str):
self.strategies = {
"stateless": {
"description": "No state between requests",
"routing": "Round-robin / least-connections",
"scaling": "Add replicas freely",
"examples": ["Embeddings", "Classification", "Single-turn QA"]
},
"externalized_context": {
"description": "Context stored externally, loaded per-request",
"routing": "Any server, load context from store",
"scaling": "Add replicas freely, scale context store separately",
"tradeoff": "Latency for context loading",
"examples": ["RAG systems", "Document QA"]
},
"sticky_sessions": {
"description": "Route same user to same server",
"routing": "Consistent hashing on user_id",
"scaling": "Careful rebalancing on scale events",
"tradeoff": "Load imbalance possible",
"examples": ["Multi-turn chat with KV cache"]
},
"replicated_state": {
"description": "State replicated across servers",
"routing": "Any server",
"scaling": "Replication overhead limits scale",
"tradeoff": "Memory and network overhead",
"examples": ["Personalization features"]
},
}The Cold Start Problem
AI inference has significant cold start costs that don’t exist for traditional services:
Model loading: A 70B model takes 30-60 seconds to load from disk to GPU. During this time, the server can’t serve requests.
JIT compilation: Frameworks like PyTorch compile CUDA kernels on first use. The first request is always slower.
KV cache warm-up: For efficient inference, the KV cache should be pre-allocated. This takes time and memory.
Connection establishment: HTTP/2 connections, gRPC channels, and database connections all take time to establish.
These cold start costs compound when autoscaling. If a traffic spike triggers scaling at 10:00:00, new servers might not be ready until 10:01:30. During that 90 seconds, existing servers are overwhelmed.
Mitigation strategies:
- Keep warm pools: Maintain minimum replicas with models loaded
- Predictive scaling: Scale up before anticipated traffic (batch jobs, marketing campaigns)
- Progressive loading: Serve with smaller model while large model loads
- Pre-warming: Run dummy inference to trigger JIT compilation
Scaling Patterns for AI Inference
With the foundations established, let’s examine concrete patterns for scaling AI inference systems.
Horizontal vs. Vertical Scaling: A Decision Framework
The classic scaling question takes on new dimensions for AI workloads:
Vertical Scaling (bigger machines):
- Add more GPUs to a single node
- Use faster GPUs (A10 -> H100)
- Add more memory
Horizontal Scaling (more machines):
- Add more inference servers
- Distribute load across replicas
- Add more regions
For AI systems, the decision depends on your bottleneck:
When to choose vertical scaling:
- Model larger than single-GPU memory (you have no choice)
- Latency is the primary constraint
- Simpler operational model is valuable
- Hardware costs less than engineering time
When to choose horizontal scaling:
- Need availability guarantees (single machine is SPOF)
- Throughput scales linearly with machines
- Cost optimization via spot/preemptible instances
- Already at hardware limits vertically
Most production systems use both: scale vertically to a reasonable machine size (e.g., 4-GPU node), then scale horizontally for capacity and availability.
Model Parallelism Strategies
When a model doesn’t fit on a single GPU, you must distribute it. Three main strategies exist:
Tensor Parallelism: Split individual layers across GPUs. Each GPU holds part of each weight matrix. Requires high-bandwidth interconnects (NVLink) because GPUs must communicate for every layer.
- Pros: Low latency (all GPUs work simultaneously)
- Cons: Requires fast interconnect, communication every layer
Pipeline Parallelism: Split layers across GPUs. Each GPU holds complete layers. GPUs communicate only at layer boundaries, reducing communication but creating pipeline bubbles.
- Pros: Less communication, can use slower network
- Cons: Pipeline bubbles (some GPUs idle), higher latency per request
Data Parallelism: Each GPU has complete model copy, different data. For inference, this is simply multiple replicas. For training, gradients must be synchronized.
- Pros: Simple, linear throughput scaling
- Cons: Each GPU needs enough memory for full model
Choosing a strategy:
| Model Size | GPU Memory | Strategy |
|---|---|---|
| Fits on 1 GPU | < 80GB | Data parallelism (replicas) |
| Fits on 1 node (8 GPUs) | 80-640GB | Tensor parallelism within node |
| Requires multiple nodes | > 640GB | Pipeline parallelism across nodes, tensor within |
Load Balancing for AI Workloads
Standard load balancing algorithms assume requests are roughly equivalent. AI requests are not—a request to generate 1000 tokens costs 50x more than a 20-token request.
Naive round-robin fails:
Server A: [100 tokens] [1000 tokens] [50 tokens] = 1150 total
Server B: [30 tokens] [25 tokens] [20 tokens] = 75 total
Server A is 15x more loaded despite equal request count!
Better algorithms for AI:
Least-tokens-in-progress: Route to the server with the fewest tokens currently being processed. Requires servers to report their current load.
class TokenAwareLoadBalancer:
"""Load balancer that considers request size."""
def select_server(self, request: dict, server_stats: dict[str, dict]) -> str:
"""Select server based on estimated request cost."""
estimated_tokens = self._estimate_tokens(request)
candidates = []
for server, stats in server_stats.items():
if not stats['healthy']:
continue
# Score by available capacity (memory and compute)
memory_available = stats['gpu_memory_free'] / stats['gpu_memory_total']
tokens_in_flight = stats['tokens_in_progress']
capacity_score = memory_available * 100 - tokens_in_flight / 10
# Penalty if this request might not fit
if estimated_tokens > stats['max_tokens_available']:
capacity_score -= 1000
candidates.append((server, capacity_score))
if not candidates:
raise NoHealthyServersError()
return max(candidates, key=lambda x: x[1])[0]Consistent hashing for stateful requests: When requests benefit from server affinity (KV cache reuse), use consistent hashing on session ID. But handle rebalancing carefully—when servers are added/removed, some sessions must migrate.
Two-tier routing: Route first by request type (streaming vs. batch), then by load within the tier. This prevents batch requests from starving interactive users.
Continuous Batching and PagedAttention
Two advances have transformed LLM serving efficiency:
Continuous batching (Orca, 2022): Instead of waiting for all requests in a batch to complete before starting new ones, continuously add new requests to the batch as old ones complete. This eliminates the “batch waiting” inefficiency.
Traditional batching:
Time 0: Start batch [A, B, C, D]
Time 2: A completes, but batch waits
Time 4: B, C complete, but batch waits
Time 6: D completes, batch released
Time 6: Start new batch [E, F, G, H]
Continuous batching:
Time 0: Start [A, B, C, D]
Time 2: A completes, immediately add E
Time 3: C completes, immediately add F
...
No idle time between requests!
PagedAttention (vLLM, 2023): Manage KV cache memory like virtual memory pages. Instead of pre-allocating maximum context length for each request, allocate pages on demand and reclaim when done. This dramatically improves memory efficiency.
Traditional KV cache:
Request A (needs 100 tokens): Allocates 2048 tokens (max context)
Request B (needs 200 tokens): Allocates 2048 tokens
Request C (needs 50 tokens): Allocates 2048 tokens
Total allocated: 6144 tokens
Total used: 350 tokens
Efficiency: 5.7%
PagedAttention:
Request A (needs 100 tokens): Allocates 100 tokens
Request B (needs 200 tokens): Allocates 200 tokens
Request C (needs 50 tokens): Allocates 50 tokens
Total allocated: 350 tokens
Total used: 350 tokens
Efficiency: 100%
These techniques are implemented in modern serving frameworks (vLLM, TGI, TensorRT-LLM) and can improve throughput by 2-5x over naive implementations.
Speculative Decoding
Autoregressive generation is inherently sequential: each token depends on all previous tokens. Speculative decoding breaks this limitation by using a smaller “draft” model to speculate multiple tokens, then verifying them in parallel with the large model.
Traditional generation (70B model):
Step 1: Generate token 1 (200ms)
Step 2: Generate token 2 (200ms)
Step 3: Generate token 3 (200ms)
...
100 tokens = 20 seconds
Speculative decoding:
Step 1: Draft model generates tokens 1-4 (20ms)
Step 2: Target model verifies tokens 1-4 in parallel (250ms)
- Tokens 1-3 accepted, token 4 rejected
- Resample token 4 from target model
Step 3: Draft model generates tokens 5-8 (20ms)
...
100 tokens with 75% acceptance = ~7 seconds (3x speedup)
The speedup depends on how well the draft model matches the target model. If they agree often (high acceptance rate), speculative decoding provides significant gains. If they disagree frequently, you’re doing extra work for little benefit.
What people do: Start with speculative decoding, custom quantization, and multi-region architecture before validating basic functionality.
Why it fails: Each optimization adds complexity, debugging difficulty, and operational burden. Speculative decoding needs draft model tuning. Quantization can introduce quality regressions. Multi-region adds consistency challenges. You’re solving scaling problems you may never have.
Fix: Start simple: single region, API-based inference, standard batching. Measure actual bottlenecks under real load. Optimize the measured bottleneck, not the theoretical one. Most AI systems never grow large enough to need the sophisticated techniques discussed in this chapter. Simple and reliable beats complex and fast.
Multi-Region Architecture
Global AI systems must serve users across continents. This introduces challenges that don’t exist for single-region deployments.
Why Multi-Region is Hard for AI
Physics: Light travels at ~200km/ms in fiber. US East to US West is ~4000km, adding 40ms round-trip latency minimum. To Europe, ~120ms. To Asia, ~200ms. For interactive AI with 2-second generation time, this matters less than for millisecond APIs—but it still impacts user perception.
Model size: A 70B model is 140GB. Replicating to 5 regions means 700GB of cross-region transfer every model update. At $0.02/GB egress, that’s $14 per update—and you might update daily.
Consistency challenges: If a user in Europe and their collaborator in Asia both update a shared conversation, which version wins? With LLM systems, “merging” conversation histories is semantically complex.
GPU availability: Not all regions have the same GPU availability. H100s might be plentiful in US regions but scarce elsewhere. Your architecture must handle heterogeneous hardware.
Multi-Region Deployment Patterns
Pattern 1: Active-Passive (Single Primary)
One region handles all traffic; others are standby for disaster recovery.
Pros:
- Simple consistency (one source of truth)
- No conflict resolution needed
- Lower cost (standby resources minimal)
Cons:
- High latency for distant users
- Failover is disruptive (minutes)
- Standby resources often underutilized
Pattern 2: Active-Active (Multi-Primary)
All regions serve traffic; users route to nearest region.
Pros:
- Low latency for all users
- High availability (no single point of failure)
- Better resource utilization
Cons:
- Complex conflict resolution
- Higher replication costs
- Consistency challenges
Pattern 3: Follow-the-Sun / Read-Local-Write-Global
Reads served locally; writes route to primary region.
Data Sovereignty and Compliance
GDPR, CCPA, and other regulations may require data to stay within certain jurisdictions. For AI systems, this affects:
User data: Conversation history, preferences, uploaded documents must stay in-region.
Model inputs/outputs: Even inference requests may constitute personal data if they contain user information.
Model weights: Some models trained on sensitive data may have restrictions on where they can be deployed.
Audit logs: Records of AI system usage may need to remain in-jurisdiction.
Architectural implications:
class DataSovereigntyRouter:
"""Route requests based on data residency requirements."""
def __init__(self, region_capabilities: dict):
self.regions = region_capabilities
def route_request(self, request: dict) -> str:
"""Determine which region should handle this request."""
user_jurisdiction = request.get('user_jurisdiction')
data_classification = request.get('data_classification', 'standard')
# Strict residency: must stay in specific regions
if data_classification == 'pii_eu':
eligible_regions = [r for r, caps in self.regions.items()
if 'gdpr_compliant' in caps]
elif data_classification == 'pii_cn':
eligible_regions = [r for r, caps in self.regions.items()
if 'china_data_resident' in caps]
else:
eligible_regions = list(self.regions.keys())
# From eligible regions, pick lowest latency
return self._select_lowest_latency(eligible_regions, user_jurisdiction)Cross-Region Data Replication
Replicating AI system data across regions requires careful design:
Model weights: Large and immutable per version. Replicate asynchronously; use content-addressable storage. Include checksums to detect corruption.
User conversation history: Append-only is easier to replicate. If users can edit/delete, need conflict resolution.
Feature store data: Depends on feature freshness requirements. Real-time features may need synchronous replication (expensive) or region-local computation.
Cache data: Generally don’t replicate caches. Each region warms its own cache. On failover, accept cold start.
class CrossRegionReplicator:
"""Replicate data across regions with appropriate consistency."""
async def replicate(self, data_type: str, key: str, value: Any) -> dict:
"""Replicate based on data type's consistency requirements."""
if data_type == 'model_weights':
# Async replication with verification
return await self._replicate_async_verified(key, value)
elif data_type == 'user_conversation':
# Causal consistency via vector clocks
return await self._replicate_causal(key, value)
elif data_type == 'feature_realtime':
# Synchronous to primary + one secondary
return await self._replicate_sync_quorum(key, value)
elif data_type == 'cache':
# Don't replicate, local only
return {'replicated': False, 'reason': 'cache_local_only'}Identifying and Eliminating Bottlenecks
Effective system design requires understanding where bottlenecks will emerge before they manifest. This section builds intuition for bottleneck analysis.
The Theory of Constraints Applied to AI Systems
Every system has exactly one constraint that limits throughput. Optimizing non-constraints provides zero benefit. This simple insight, from Goldratt’s “The Goal,” transforms how you approach scaling.
Finding the constraint:
- Measure utilization of each component under load
- The component at highest utilization is likely the constraint
- Improvements to other components won’t help until you address the constraint
For AI systems, common constraints:
Constraint Analysis Framework:
1. GPU Compute
Symptoms: GPU utilization near 100%, scaling GPUs improves throughput
Solutions: Batching, quantization, faster GPUs, model optimization
2. GPU Memory
Symptoms: OOM errors, limited batch size, can't fit longer contexts
Solutions: Quantization, PagedAttention, offloading, model parallelism
3. Memory Bandwidth
Symptoms: GPU utilization low despite compute availability
Solutions: Batching, quantization, better memory access patterns
4. Network (for distributed)
Symptoms: Latency increases with model parallelism degree
Solutions: Better interconnects, reduce communication, overlap compute
5. CPU Preprocessing
Symptoms: GPUs idle waiting for batches, tokenization slow
Solutions: Async preprocessing, faster tokenizers, preprocessing tier
6. I/O (model loading, data)
Symptoms: Cold starts slow, batch processing I/O bound
Solutions: Caching, faster storage, prefetching
Worked example:
System: 8x A100 inference cluster
Observation: 1000 requests/second max, want 2000
Measurements:
- GPU utilization: 65%
- GPU memory: 45% used
- Network: 10% bandwidth
- CPU: 85% utilization
- Tokenization: 40ms/request
Analysis:
CPU at 85% is likely constraint. Tokenization at 40ms means
CPU can handle 1/0.04 = 25 requests/sec per core.
With 40 cores, max 1000 req/sec matches observation.
GPUs are underutilized because CPU can't feed them fast enough.
Solution: Add tokenization tier, use async processing,
or use faster tokenizer (HuggingFace Rust tokenizers are 10x faster)
After fix, GPU becomes new constraint at 65% utilization.
Now can improve batching or add GPUs.
Little’s Law for Capacity Planning
Little’s Law states: L = lambda * W
Where:
- L = average number of requests in system
- lambda = arrival rate (requests/second)
- W = average time in system (latency)
Rearranging: lambda = L / W
This tells you the throughput given your concurrency and latency.
Applied to AI inference:
Given:
- 8 GPUs
- Each GPU handles 1 request at a time (no batching)
- Average inference time: 2 seconds
L = 8 (8 GPUs, each processing 1 request)
W = 2 seconds
lambda = L / W = 8 / 2 = 4 requests/second
With batching (batch size 8):
L = 8 GPUs * 8 requests/batch = 64 concurrent
W = 2.5 seconds (slightly higher with batching)
lambda = 64 / 2.5 = 25.6 requests/second
Batching increased throughput 6.4x!
For capacity planning:
Want to serve: 1000 requests/minute = 16.7 requests/second
Average latency target: 3 seconds
L = lambda * W = 16.7 * 3 = 50 concurrent requests needed
If each GPU handles batch size 8:
GPUs needed = 50 / 8 = 6.25 -> 7 GPUs minimum
Add 50% headroom for traffic spikes: 11 GPUs
Amdahl’s Law and Parallelization Limits
Amdahl’s Law quantifies the speedup possible from parallelization:
Speedup = 1 / (S + P/N)
Where:
- S = serial fraction (cannot be parallelized)
- P = parallel fraction (1 - S)
- N = number of parallel workers
Applied to AI systems:
LLM autoregressive generation is inherently serial—each token depends on all previous tokens. If generation is 90% of latency:
S = 0.90 (serial - generation)
P = 0.10 (parallel - preprocessing, networking)
N = 100 (lots of parallel workers)
Speedup = 1 / (0.90 + 0.10/100) = 1 / 0.901 = 1.11x
Even with infinite parallel workers, max speedup = 1/0.90 = 1.11x!
This explains why:
- Horizontal scaling helps throughput but not latency
- Speculative decoding attacks the serial fraction
- Faster GPUs help latency; more GPUs help throughput
Queuing Theory Intuitions
Requests arriving at a system form queues. Understanding queuing behavior helps predict latency under load.
Key insight: Latency explodes as utilization approaches 100%.
For M/M/1 queue (random arrivals, random service, 1 server):
- Average wait time = (utilization) / (1 - utilization) * service_time
At 50% utilization: wait = 0.5/0.5 * service = 1x service time At 80% utilization: wait = 0.8/0.2 * service = 4x service time At 90% utilization: wait = 0.9/0.1 * service = 9x service time At 95% utilization: wait = 0.95/0.05 * service = 19x service time
Practical implications:
- Don’t run systems at > 70-80% utilization
- The “last 10%” of capacity costs disproportionate latency
- Autoscale triggers should fire well before saturation
- Headroom for traffic spikes is essential
Capacity Planning Rule of Thumb:
Sustainable capacity = Peak capacity * 0.7
If you need to handle 1000 req/sec peak,
provision for 1000 / 0.7 = 1430 req/sec capacity.
For AI with expensive cold starts, be more conservative:
Sustainable capacity = Peak capacity * 0.5
Designing for Failure
Large-scale systems experience constant failures. The question isn’t whether components will fail—they will—but how the system behaves when they do.
Failure Mode Analysis
Before building, enumerate failure modes and their consequences:
Failure Mode and Effects Analysis (FMEA) for AI Inference:
| Component | Failure Mode | Effect | Severity | Mitigation |
|---|---|---|---|---|
| Model Server | Process crash | Requests fail, timeout | High | Health checks, auto-restart |
| Model Server | GPU OOM | Single request fails | Medium | Memory limits, admission ctrl |
| Model Server | Model corrupt | Wrong/garbage output | Critical | Checksums, output validation |
| Load Balancer | Process crash | All traffic fails | Critical | Redundant LBs, health checks |
| Load Balancer | Misrouting | Wrong model version served | High | Canary detection, rollback |
| Feature Store | Unavailable | Features missing | High | Fallback features, cache |
| Feature Store | Stale data | Outdated predictions | Medium | Freshness alerts, TTL |
| Cache | Unavailable | Higher latency, load on GPUs | Medium | Cache tiers, graceful degrade |
| Network | Partition | Subset of users affected | High | Multi-path, failover |
| Network | High latency | Timeout, poor UX | Medium | Adaptive timeouts, shed load |
Circuit Breakers
When a dependency fails, retrying makes things worse—you overwhelm a recovering service. Circuit breakers prevent cascade failures:
class CircuitBreaker:
"""Prevent cascade failures by failing fast."""
def __init__(self, failure_threshold: int = 5,
recovery_timeout: float = 30.0,
half_open_max_calls: int = 3):
self.state = "CLOSED"
self.failure_count = 0
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.last_failure_time = None
self.half_open_successes = 0
self.half_open_max_calls = half_open_max_calls
async def call(self, func, *args, fallback=None, **kwargs):
"""Execute function with circuit breaker protection."""
if self.state == "OPEN":
if self._should_attempt_reset():
self.state = "HALF_OPEN"
self.half_open_successes = 0
elif fallback:
return await fallback(*args, **kwargs)
else:
raise CircuitOpenError(f"Circuit open, retry after {self._time_until_retry():.1f}s")
try:
result = await func(*args, **kwargs)
self._record_success()
return result
except Exception as e:
self._record_failure()
if fallback:
return await fallback(*args, **kwargs)
raise
def _record_failure(self):
self.failure_count += 1
self.last_failure_time = time.time()
if self.state == "HALF_OPEN" or self.failure_count >= self.failure_threshold:
self.state = "OPEN"
def _record_success(self):
if self.state == "HALF_OPEN":
self.half_open_successes += 1
if self.half_open_successes >= self.half_open_max_calls:
self.state = "CLOSED"
self.failure_count = 0
elif self.state == "CLOSED":
self.failure_count = max(0, self.failure_count - 1) # Gradual recoveryGraceful Degradation Levels
Define degradation levels before you need them:
Degradation Levels for AI Chat Application:
LEVEL 0 - NORMAL
All features available
Full model, full context window
Personalization enabled
LEVEL 1 - MINOR DEGRADATION
Trigger: Feature store p99 > 500ms OR GPU utilization > 85%
Actions:
- Disable non-essential features (personalization)
- Reduce context window (8K -> 4K)
- Enable aggressive caching
LEVEL 2 - MAJOR DEGRADATION
Trigger: Model server errors > 5% OR Feature store unavailable
Actions:
- Switch to smaller/faster model
- Use cached responses where possible
- Reduce max output tokens
- Show "high demand" message to users
LEVEL 3 - EMERGENCY
Trigger: Model servers unavailable > 50% OR cascading failures
Actions:
- Static fallback responses for common queries
- Queue non-urgent requests
- Reject new users, serve existing sessions
- Page on-call engineering
LEVEL 4 - MAINTENANCE MODE
Trigger: Manual activation OR unrecoverable failure
Actions:
- Return maintenance page
- Log all requests for replay
- All hands on deck
class GracefulDegradation:
"""Manage system degradation based on health signals."""
def __init__(self, config: dict):
self.config = config
self.current_level = 0
self.level_actions = self._build_level_actions()
async def evaluate(self, metrics: dict) -> int:
"""Evaluate current metrics and adjust degradation level."""
if metrics['model_error_rate'] > 0.5 or metrics['cascade_detected']:
target_level = 3
elif metrics['model_error_rate'] > 0.05 or not metrics['feature_store_healthy']:
target_level = 2
elif metrics['gpu_utilization'] > 0.85 or metrics['feature_store_p99'] > 500:
target_level = 1
else:
target_level = 0
if target_level != self.current_level:
await self._transition(self.current_level, target_level)
self.current_level = target_level
return self.current_level
async def _transition(self, from_level: int, to_level: int):
"""Execute transition actions."""
if to_level > from_level:
# Degrading: execute in order
for level in range(from_level + 1, to_level + 1):
for action in self.level_actions[level]['enter']:
await action()
else:
# Recovering: execute in reverse
for level in range(from_level, to_level, -1):
for action in self.level_actions[level]['exit']:
await action()Timeout and Retry Strategies
Timeouts and retries seem simple but are surprisingly subtle for AI workloads:
Timeout considerations:
- LLM inference can legitimately take 10-60 seconds
- Setting timeout too low causes spurious failures
- Setting timeout too high wastes resources on truly failed requests
- Different request types need different timeouts
class AdaptiveTimeout:
"""Adjust timeouts based on request characteristics and system state."""
def calculate_timeout(self, request: dict, system_metrics: dict) -> float:
"""Calculate appropriate timeout for this request."""
# Base timeout from request characteristics
estimated_tokens = request.get('max_tokens', 512)
tokens_per_second = system_metrics.get('current_tokens_per_second', 50)
base_timeout = estimated_tokens / tokens_per_second + 2.0 # +2s buffer
# Adjust for system load
queue_depth = system_metrics.get('queue_depth', 0)
load_factor = 1 + (queue_depth / 100) # Add 1% per queued request
# Adjust for observed latency
p99_latency = system_metrics.get('p99_latency', 5.0)
latency_factor = max(1.0, p99_latency / 5.0)
adjusted_timeout = base_timeout * load_factor * latency_factor
# Clamp to reasonable bounds
return min(max(adjusted_timeout, 5.0), 120.0)Retry considerations:
- Retrying expensive AI operations wastes GPU resources
- Must distinguish retryable (network timeout) from non-retryable (invalid input)
- Exponential backoff prevents thundering herd
- Budget-based retries prevent infinite retry loops
class AIRetryStrategy:
"""Retry strategy aware of AI workload costs."""
def __init__(self, max_retries: int = 3, base_delay: float = 1.0):
self.max_retries = max_retries
self.base_delay = base_delay
self.cost_per_retry = {} # Track retry costs
def should_retry(self, error: Exception, attempt: int,
request: dict) -> tuple[bool, float]:
"""Determine if we should retry and with what delay."""
if attempt >= self.max_retries:
return False, 0
# Non-retryable errors
if isinstance(error, (InvalidInputError, AuthenticationError)):
return False, 0
# Expensive requests get fewer retries
estimated_cost = self._estimate_cost(request)
if estimated_cost > 0.01 and attempt >= 1: # $0.01 threshold
return False, 0
# Retryable errors with exponential backoff
if isinstance(error, (TimeoutError, ConnectionError, ServiceUnavailableError)):
delay = self.base_delay * (2 ** attempt) + random.uniform(0, 1)
return True, delay
# Unknown errors: retry once with long delay
if attempt < 1:
return True, 5.0
return False, 0What people do: Build failure handling (retries, circuit breakers, fallbacks) and assume it works because the code looks correct.
Why it fails: Failure handling code is the least-tested code in your system—it runs when things go wrong, which is rare in development. Untested failure paths often don’t work: retries that amplify failures, circuit breakers that never trip, fallbacks that have their own bugs.
Fix: Regularly inject failures in production (or production-like environments). Kill random pods, inject latency, corrupt responses, exhaust memory. Start small (one failure type, limited blast radius) and expand. Use chaos engineering tools (Chaos Monkey, LitmusChaos) for automation. The goal isn’t to cause outages—it’s to find weaknesses before real failures do.
Operational Excellence
A well-designed system that’s impossible to operate is not a well-designed system. Operations must be a first-class concern.
Observability: Metrics, Logs, and Traces
Metrics provide quantitative measurements over time. For AI systems, track:
Essential AI System Metrics:
Request Metrics (RED):
- Rate: requests per second, by model/endpoint
- Errors: error rate, by error type
- Duration: latency p50/p95/p99
Resource Metrics (USE):
- GPU Utilization: percent, per GPU
- GPU Memory: used/free/fragmentation
- CPU Utilization: for preprocessing
- Network: bandwidth, connections
AI-Specific Metrics:
- Tokens per second: throughput
- Tokens per request: distribution
- Queue depth: pending requests
- Batch size: actual vs. configured
- Cache hit rate: L1/L2/semantic
- Model load time: cold start latency
- KV cache utilization: memory efficiency
Logs provide context for debugging. Structure them for searchability:
import structlog
logger = structlog.get_logger()
def log_inference_request(request_id: str, request: dict, result: dict,
timing: dict, error: Exception = None):
"""Structured logging for inference requests."""
logger.info(
"inference_complete",
request_id=request_id,
model=request.get('model'),
input_tokens=result.get('input_tokens'),
output_tokens=result.get('output_tokens'),
queue_time_ms=timing.get('queue_time') * 1000,
inference_time_ms=timing.get('inference_time') * 1000,
total_time_ms=timing.get('total_time') * 1000,
cache_hit=result.get('cache_hit'),
error_type=type(error).__name__ if error else None,
error_message=str(error) if error else None,
)Traces follow requests across services. Essential for debugging distributed issues:
from opentelemetry import trace
tracer = trace.get_tracer("ai-inference")
async def inference_with_tracing(request: dict) -> dict:
"""Trace a request through the inference pipeline."""
with tracer.start_as_current_span("inference") as span:
span.set_attribute("model", request.get('model'))
span.set_attribute("max_tokens", request.get('max_tokens'))
with tracer.start_span("preprocess"):
processed = await preprocess(request)
with tracer.start_span("feature_fetch"):
features = await fetch_features(request)
with tracer.start_span("model_inference"):
result = await run_inference(processed, features)
with tracer.start_span("postprocess"):
final = await postprocess(result)
span.set_attribute("output_tokens", final.get('token_count'))
return finalAlerting Philosophy
Good alerting is an art. Too few alerts and you miss problems; too many and you get alert fatigue.
Alert on symptoms, not causes: Alert on “user-facing error rate > 1%” rather than “GPU memory > 90%”. High GPU memory might be fine; user errors are never fine.
Every alert should be actionable: If there’s no action to take, it’s not an alert—it’s a log. Define the runbook before enabling the alert.
Alert hierarchy:
Critical (page immediately):
- User-facing error rate > 5% for 5 minutes
- Complete service outage
- Data corruption detected
- Security incident indicators
Warning (page during business hours):
- User-facing error rate > 1% for 15 minutes
- Latency p99 > 2x normal for 30 minutes
- GPU utilization > 90% for 1 hour
- Approaching capacity limits
Info (notify slack, no page):
- Unusual traffic patterns
- Elevated cache miss rate
- Single server unhealthy
Capacity Planning
Plan capacity before you need it:
class CapacityPlanner:
"""Project capacity needs based on growth trends."""
def project_capacity_needs(self, historical_usage: list[dict],
months_ahead: int = 6) -> dict:
"""Project future capacity needs."""
# Fit growth curve to historical data
growth_rate = self._calculate_growth_rate(historical_usage)
projections = {}
current = historical_usage[-1]
for month in range(1, months_ahead + 1):
projected = {}
for resource, current_usage in current.items():
projected_usage = current_usage * ((1 + growth_rate) ** month)
projected[resource] = {
'usage': projected_usage,
'utilization': projected_usage / self.capacity[resource],
'at_risk': projected_usage / self.capacity[resource] > 0.7
}
projections[f'month_{month}'] = projected
return projections
def recommend_scaling(self, projections: dict) -> list[dict]:
"""Recommend scaling actions based on projections."""
recommendations = []
for month, resources in projections.items():
for resource, data in resources.items():
if data['at_risk']:
recommendations.append({
'timeframe': month,
'resource': resource,
'current_capacity': self.capacity[resource],
'projected_need': data['usage'] * 1.5, # 50% headroom
'action': f"Scale {resource} by {(data['usage'] * 1.5 / self.capacity[resource] - 1) * 100:.0f}%"
})
return recommendationsRunbook Essentials
Every alert needs a runbook. Every runbook needs these elements:
# Runbook: High Inference Latency
## Summary
P99 inference latency exceeds threshold (>5 seconds for >5 minutes)
## Impact
Users experience slow responses. May lead to timeouts and errors.
## Diagnosis Steps
### 1. Check Dashboard
- Link: [Inference Latency Dashboard](...)
- Look for: Which models affected? All or specific?
### 2. Check GPU Utilization
```bash
kubectl exec -it inference-server-0 -- nvidia-smi
```
- If >95%: Likely resource saturation. Go to Mitigation Option A.
- If <50%: Likely preprocessing bottleneck. Go to Mitigation Option B.
### 3. Check Queue Depth
```bash
curl http://inference-api:8080/metrics | grep queue_depth
```
- If >100: Traffic spike. Go to Mitigation Option C.
## Mitigation Options
### Option A: Resource Saturation
1. Scale up inference replicas: `kubectl scale deployment inference --replicas=+2`
2. Enable aggressive caching: `kubectl set env deployment/inference CACHE_AGGRESSIVE=true`
3. Reduce max tokens: `kubectl set env deployment/inference MAX_TOKENS=256`
### Option B: Preprocessing Bottleneck
1. Check tokenization service: `kubectl logs -l app=tokenizer`
2. Restart if stuck: `kubectl rollout restart deployment/tokenizer`
3. Scale tokenizers: `kubectl scale deployment tokenizer --replicas=+2`
### Option C: Traffic Spike
1. Enable rate limiting: `kubectl apply -f rate-limit-strict.yaml`
2. Scale inference: see Option A step 1
3. If sustained, engage capacity planning
## Escalation
If not resolved in 30 minutes, page senior on-call.Case Study: How Companies Architect for Scale
Learning from real-world systems provides insights that theoretical analysis cannot. This section examines patterns from companies that have solved AI scaling challenges.
Pattern: Prefill-Decode Disaggregation
The problem: LLM inference has two phases with different characteristics:
- Prefill: Process all input tokens (compute-bound, parallelizable)
- Decode: Generate output tokens one by one (memory-bound, sequential)
Running both on the same hardware means neither is optimized.
The solution: Separate prefill and decode to different hardware.
Why it works: Prefill benefits from high compute density. Decode benefits from high memory bandwidth. Different GPU configurations optimize each. The orchestrator routes appropriately.
Trade-off: Increased system complexity. KV cache transfer adds latency. Only worthwhile at significant scale.
Pattern: Semantic Caching at Scale
The problem: Users often ask similar questions. Computing the same response repeatedly wastes expensive GPU time.
The solution: Cache responses based on semantic similarity, not just exact match.
Implementation considerations:
- Similarity threshold affects hit rate vs. accuracy trade-off
- Cache invalidation for stale information
- Handle personalization (cache per-user or use placeholders)
- Multi-tenant isolation
class SemanticCache:
"""Cache with semantic similarity matching."""
def __init__(self, embedding_model, vector_store,
similarity_threshold: float = 0.95):
self.embedder = embedding_model
self.store = vector_store
self.threshold = similarity_threshold
async def get(self, query: str, context: dict = None) -> Optional[dict]:
"""Check cache for semantically similar query."""
query_embedding = await self.embedder.embed(query)
# Search for similar cached queries
results = await self.store.search(
vector=query_embedding,
top_k=1,
filter=self._build_filter(context) # Tenant, user, etc.
)
if results and results[0].score >= self.threshold:
# Check freshness
cached = results[0].metadata
if self._is_fresh(cached):
return {
'response': cached['response'],
'cache_hit': True,
'similarity': results[0].score,
'cache_age_seconds': time.time() - cached['timestamp']
}
return None
async def put(self, query: str, response: str, context: dict = None):
"""Store response in cache."""
query_embedding = await self.embedder.embed(query)
await self.store.upsert(
id=self._generate_id(query, context),
vector=query_embedding,
metadata={
'query': query,
'response': response,
'timestamp': time.time(),
'context': context
}
)Pattern: Model Cascading
The problem: Not all requests need the largest, most capable model. Using GPT-5-class models for simple questions wastes resources.
The solution: Route requests to appropriately-sized models based on complexity.
Implementation:
class ModelCascade:
"""Route requests to appropriately-sized models."""
def __init__(self, models: list[dict], classifier):
self.models = sorted(models, key=lambda m: m['cost'])
self.classifier = classifier
async def inference(self, request: dict) -> dict:
"""Route to appropriate model, retry if needed."""
complexity = await self.classifier.estimate_complexity(request)
starting_tier = self._select_tier(complexity)
for tier in range(starting_tier, len(self.models)):
model = self.models[tier]
result = await model['client'].inference(request)
if await self._validate_response(result, request):
return {
**result,
'model_used': model['name'],
'model_tier': tier,
'estimated_cost': model['cost'] * result['tokens']
}
# Response inadequate, try larger model
continue
# All models failed validation
return await self._handle_cascade_failure(request)Pattern: Regional Sharding with Global Routing
The problem: Global users need low latency, but maintaining identical copies of everything in all regions is expensive and complex.
The solution: Shard data by user region; route globally but process locally.
Data Placement:
- Model weights: Replicated to all regions (immutable, versioned)
- User data: Sharded by user’s home region
- Conversation history: Lives in user’s region
- Global metadata: Replicated with eventual consistency
Handling cross-region users:
- Traveling user: Route to nearest region, fetch data from home region
- Cross-region collaboration: Route to data owner’s region
- Emergency failover: Full data copy in standby region (async, potentially stale)
Key Takeaways
CAP theorem forces real choices - During network partitions, you must choose between consistency and availability. Different data types warrant different choices: user sessions favor availability; billing favors consistency.
AI systems have unique scaling challenges - Memory bandwidth bottlenecks before compute. Batching creates latency-throughput tension. Cold starts are expensive. State management complexity increases with scale.
Match scaling strategy to your bottleneck - Horizontal scaling improves throughput but not single-request latency. Vertical scaling has hardware limits. Model parallelism enables larger models but adds complexity.
Design for failure from the start - Circuit breakers prevent cascade failures. Graceful degradation maintains partial service. Timeouts and retries need careful calibration for expensive AI operations.
Operational excellence enables sustainability - Observability (metrics, logs, traces) is non-negotiable. Alerting should be actionable. Capacity planning must precede growth. Scale changes everything.
Summary
System design at scale is fundamentally about understanding constraints and making principled trade-offs. This chapter covered the essential concepts:
Distributed systems foundations constrain what’s possible. The CAP theorem forces a choice between consistency and availability during partitions. Consistency models exist on a spectrum—choose the minimum required. Networks fail in complex ways; design for it.
AI systems have unique challenges that traditional web services don’t face. Memory bandwidth often bottlenecks before compute. Batching creates a fundamental tension between latency and throughput. State management complexity increases with scale. Cold starts are expensive and must be planned for.
Scaling patterns must match your constraints. Horizontal scaling improves throughput and availability but not single-request latency. Vertical scaling has hardware limits but simpler operations. Model parallelism enables larger models but adds complexity. Choose based on your actual bottleneck.
Multi-region architecture is necessary for global systems but hard for AI. Consider active-active vs. active-passive based on your consistency needs. Data sovereignty requirements may constrain your options. Plan for cross-region replication costs.
Bottleneck analysis should precede optimization. Use the theory of constraints to find the single limiting factor. Little’s Law and queuing theory provide mathematical foundations. Amdahl’s Law explains why some optimizations provide limited benefit.
Designing for failure assumes components will fail. Circuit breakers prevent cascade failures. Graceful degradation maintains partial service. Timeouts and retries need careful calibration for expensive AI operations.
Operational excellence enables sustainable systems. Observability (metrics, logs, traces) is non-negotiable. Alerting should be actionable and hierarchical. Capacity planning must precede growth. Runbooks enable consistent incident response.
The most important insight is this: scale changes everything. Patterns that work at small scale fail at large scale. Assumptions that hold for hundreds of users shatter at millions. The best time to design for scale is before you need it—because by the time you need it, you won’t have time to redesign.
Practical Exercises
Exercise 1: Bottleneck Analysis
Given a system with these metrics under load:
- GPU utilization: 65%
- GPU memory: 80% used
- CPU utilization: 30%
- Network bandwidth: 15%
- Queue depth: 200 requests
- Latency p99: 8 seconds (target: 3 seconds)
Identify the bottleneck and propose solutions. Consider: 1. What is the most likely constraint? 2. What additional metrics would you gather? 3. What optimizations would you try first?
Exercise 2: CAP Trade-off Design
Design the data architecture for a multi-region AI chat application serving users globally. Document: 1. What data needs to be stored (conversations, preferences, features) 2. CAP choice for each data type with justification 3. Replication strategy for each 4. Behavior during a region partition
Exercise 3: Failure Mode Analysis
Perform FMEA for an AI system with:
- Load balancer
- 10 inference servers with GPUs
- Feature store (external service)
- Result cache (Redis cluster)
- Model registry
For each component, identify: 1. Possible failure modes 2. Impact on users 3. Mitigation strategy 4. Monitoring/detection approach
Exercise 4: Capacity Planning
Your AI service currently handles:
- 10,000 requests/hour average
- 20,000 requests/hour peak
- 8 GPUs at 60% average utilization
- 3-second average latency
Plan for 10x growth over 12 months: 1. Project monthly capacity needs 2. Identify when scaling is needed (assume 80% max utilization) 3. Recommend scaling approach (horizontal, vertical, or both) 4. Estimate cost implications
Self-Assessment Checkpoint
Conceptual Questions
Q1. [Senior] Explain the CAP theorem and what “choosing two of three” actually means in practice for AI systems.
Answer
CAP states that during a network partition, you must choose between consistency (all nodes see the same data) and availability (every request gets a response). “Choose two” is misleading—you always want C and A when there’s no partition. The real choice is: when a partition happens, do you fail requests (choose C) or serve potentially stale data (choose A)? For AI systems: (1) User preferences/sessions—usually choose A (serve with stale data rather than fail). (2) Feature store data—often choose A (stale features better than no features). (3) Model versioning—usually choose C (wrong model version could be dangerous). (4) Billing/quotas—choose C (over-serving is costly). Most systems need different CAP choices for different data types.Q2. [Senior] What’s the difference between horizontal and vertical scaling? When would you choose each for an LLM serving system?
Answer
Horizontal: Add more machines (scale out). Vertical: Add more resources to existing machines (scale up). For LLM serving: Vertical scaling first for model size—if model needs 40GB VRAM, you need a GPU with 40GB+. Can’t split a model across machines without significant complexity (tensor parallelism). Horizontal scaling for throughput—once you have GPUs big enough for the model, add more GPUs/machines to handle more concurrent requests. Hybrid is common: vertical to meet per-model requirements, horizontal for capacity. Limits: Vertical has ceiling (biggest available GPU). Horizontal adds coordination overhead, network latency, state synchronization complexity.Q3. [Staff] Describe the “tail at scale” problem and techniques to mitigate it in AI inference systems.
Answer
Problem: At scale, even rare slow requests compound. If each server has 1% chance of being slow, a request touching 100 servers has 63% chance of hitting at least one slow server (1 - 0.99^100). P99 latency becomes typical latency at scale. Mitigation: (1) Hedged requests—send request to multiple servers, use first response. (2) Tied requests—queue at multiple servers, cancel others when one starts. (3) Canary requests—test with small probe before full request. (4) Request deadlines—kill requests that can’t meet SLA anyway. (5) Isolation—prevent slow requests from blocking fast ones (separate queues by predicted latency). (6) Reduce variability sources—consistent hardware, avoid GC pauses, prewarming. For LLM inference: Variable generation length is major tail source. Consider: Batching strategy affects tail (waiting for batch vs. immediate processing).Q4. [Staff] You’re designing a multi-region AI application. What are the key architectural decisions around data placement and request routing?
Answer
Data placement decisions: (1) User data—usually regional (data residency requirements, latency). (2) Model weights—replicated globally (read-only, large, can lag). (3) Feature stores—depends on freshness requirements; often eventual consistency with regional caching. (4) Conversation history—usually regional for latency, global for users who travel. Routing decisions: (1) Latency-based—route to nearest healthy region. (2) Data locality—route to where user data resides. (3) Cost-based—route to cheaper regions when latency tolerance allows. (4) Capacity-based—overflow to other regions during peak. Complexity: Cross-region requests (user in Region A, data in Region B) add latency. Either accept latency, replicate data, or design for locality. Consider: Failover strategy when region fails—where does traffic go? Is data available there?Q5. [Staff] How do you approach capacity planning for a system that’s growing 10x over the next year? What are the pitfalls?
Answer
Approach: (1) Baseline current capacity—throughput per GPU, memory per request, storage growth rate. (2) Project growth curve—is it linear, exponential, seasonal? (3) Calculate capacity needs at checkpoints (monthly/quarterly). (4) Add buffer—target 70-80% utilization, not 100%. (5) Account for operational overhead—deployment capacity, testing, incidents. (6) Consider lead times—GPU procurement can take months. Pitfalls: (1) Assuming linear scaling—10x traffic doesn’t mean exactly 10x servers (coordination overhead grows). (2) Ignoring dependent systems—database, network, storage may bottleneck before compute. (3) Under-estimating growth spikes—10x average might mean 50x peak. (4) Over-provisioning too early—burns money. (5) Forgetting about new features—roadmap may add load beyond organic growth. (6) Missing cost projections—can you afford 10x?Spot the Problem
Problem 1. [Senior] Architecture description:
"For high availability, we run three replicas of our inference service.
If one fails, the load balancer routes to the others."
Answer
Incomplete HA strategy: (1) Three replicas where?—Same availability zone? Same datacenter? If so, zone/datacenter failure takes all three. (2) How does LB detect failure?—Health checks? What’s the detection delay? (3) What about cascading failure?—If one fails, others get 50% more load. Can they handle it? (4) State handling?—If service has state (sessions, KV cache), is it replicated? (5) What about dependencies?—Model registry, feature store—are they also HA? Proper HA includes: Multi-AZ or multi-region deployment, health checks with appropriate timeouts, capacity to handle N-1 failures, dependency HA, and tested failover procedures.Problem 2. [Staff] Scaling strategy:
"Our inference system auto-scales based on CPU utilization.
When CPU > 70%, we add instances."
Answer
Wrong metric for LLM inference: (1) LLM inference is typically GPU-bound and memory-bandwidth bound, not CPU-bound. CPU utilization may be low while GPUs are saturated. (2) Should scale on: GPU utilization, request queue depth, latency percentiles. (3) Reactive scaling has lag—by the time utilization triggers scaling, latency may already be bad. Consider predictive scaling based on time-of-day patterns. (4) Scale-up time—how long to provision GPU instances? May need pre-provisioned warm pool. Better: Scale on queue depth or latency percentile. Maintain warm capacity. Use predictive scaling for known patterns.Problem 3. [Staff] Multi-region design:
"We deployed our AI service in three regions for global coverage.
Users are routed to the nearest region. Each region has its own database."
Answer
Issues: (1) Nearest region may not have user’s data—if user traveled, their data might be in a different region. (2) How is data synchronized?—If user creates data in Region A, can they access it from Region B? (3) Conflict resolution—if user updates from two regions, what wins? (4) Failover complexity—if Region A fails, can Region B serve Region A users? Does it have the data? (5) Consistency model—what guarantees does the user get? Should document: Data placement strategy, replication approach, consistency guarantees, cross-region access patterns, failover behavior.Design Exercises
Exercise 1. [Staff] Design a global AI chat application serving 100M users across NA, EU, and APAC. Requirements: <500ms time to first token, conversation history must be accessible anywhere, comply with GDPR data residency. Document your architecture including data placement, request routing, and failure handling.
Guidance
Key decisions: (1) Model serving—GPU clusters in each region, route to nearest. Models replicated globally. (2) Conversation data—regional primary storage (GDPR: EU data stays in EU). Cross-region access via read replicas or on-demand fetch with latency cost. (3) User routing—latency-based initially, data-locality override for returning users. (4) Caching—edge caches for static assets, regional caches for frequent patterns. (5) Failover—within region: AZ failover automatic. Cross-region: explicit, requires user data access strategy. GDPR specifics: EU users’ data in EU region only. Non-EU users can be anywhere. Clear data flows documenting where data goes. Consider: Do you need to process EU users’ prompts only on EU servers? Legal review needed.Exercise 2. [Staff] You’re designing an AI inference platform that must handle both real-time requests (100ms P99) and batch processing (throughput optimized, cost sensitive). Both use the same models. Design the system architecture addressing resource sharing, isolation, and cost optimization.
Guidance
Architecture options: (1) Separate clusters—real-time and batch isolated. Simple, but utilization inefficient. (2) Shared cluster with priority—real-time preempts batch. Better utilization, complexity in scheduling. (3) Time-sharing—batch runs during off-peak real-time hours. Works if patterns are predictable. Considerations: (1) Batch can use lower-priority/spot instances—cost savings. (2) Batch can queue and wait—better batching efficiency. (3) Real-time needs head-of-line prevention—batch requests shouldn’t block real-time. (4) Resource limits—prevent batch from starving real-time. (5) Different SLOs—batch failures can retry, real-time cannot. Design: Likely hybrid—dedicated real-time capacity for SLA, shared overflow to batch pool, batch uses idle real-time capacity + dedicated cheap instances.Connections to Other Chapters
System design at scale integrates knowledge from across the book. Here’s how this chapter connects:
Chapter 9 (LLM Deployment & Infrastructure): Deployment patterns and inference optimization for individual model instances. This chapter shows how to scale those patterns across regions and handle global traffic.
Chapter 27 (Performance Engineering): Latency optimization, caching strategies, and profiling techniques. Performance engineering at the component level feeds into system-wide performance at scale.
Chapter 31 (Reliability Engineering): Fault tolerance, redundancy patterns, and chaos engineering. Reliability is essential when operating distributed AI systems across regions.
Chapter 32 (Cost Engineering): Cost optimization strategies and capacity planning. System design decisions directly impact infrastructure costs at scale.
Architecture Decision Records (Appendix G): Example ADRs for common architectural decisions. Use ADRs to document key system design choices.
Further Reading
Essential
- Kleppmann, “Designing Data-Intensive Applications” - The definitive practitioner’s guide to distributed systems.
- Dean and Barroso (2013), “The Tail at Scale” - Why tail latencies matter. Essential Google experience.
- Kwon et al. (2023), “PagedAttention” - The vLLM paper that transformed LLM serving efficiency.
Deep Dives
- Lamport (1978), “Time, Clocks, and the Ordering of Events” - Foundational work on logical clocks.
- Yu et al. (2022), “Orca” - Introduces continuous batching for LLM serving.
- Beyer et al., “Site Reliability Engineering” - Google’s approach to reliable systems at scale.