Chapter 25: System Design at Scale

Keywords

distributed systems, scalability, multi-region, load balancing, CAP theorem, sharding, replication, circuit breakers

Introduction

In March 2023, a startup launched their AI assistant to a modest beta user base. The system worked beautifully: sub-second responses, personalized interactions, glowing user feedback. Three months later, a viral Twitter thread brought 100,000 users in a single day. By noon, latencies had climbed from 200ms to 15 seconds. By evening, the service was completely offline. The engineering team spent 72 hours in triage mode, cobbling together fixes: adding servers, implementing hasty caching, and shedding features. When stability returned, they faced a harder question: how did a system designed by competent engineers fail so catastrophically?

The answer lies in a fundamental truth about distributed systems: scale changes everything. Patterns that work at small scale become liabilities at large scale. Assumptions that hold for hundreds of users shatter at millions. Failures that are theoretical edge cases become daily occurrences. System design at scale isn’t about doing more of the same—it’s about fundamentally rethinking how systems behave under pressure.

AI systems face unique scaling challenges that traditional web services don’t encounter. A typical web request might involve a database lookup and some business logic, completing in 50 milliseconds. An LLM inference request might require loading billions of parameters, executing trillions of floating-point operations, and generating output tokens sequentially—taking seconds, not milliseconds, while consuming GPU resources that cost 100x what CPU resources cost. The economics, physics, and failure modes are entirely different.

This chapter builds your intuition for designing AI systems that remain reliable as they grow. We’ll start with distributed systems foundations—the theoretical constraints that govern what’s possible—then examine how these principles apply to AI workloads specifically. You’ll learn to reason about bottlenecks before they manifest, design for graceful degradation when failures inevitably occur, and make principled trade-offs between competing concerns like consistency, availability, and cost.

The Core Insight: Why AI Systems Scale Differently

To design scalable AI systems, you must first understand why they’re different from traditional web services.

Resource intensity and economics: A single LLM inference request might require 16GB of GPU memory and 100 TFLOPS of compute. Compare this to a typical API request that uses megabytes of RAM and negligible CPU. The cost differential is enormous—GPU hours cost 10-100x CPU hours—which means inefficiency at scale translates to millions of dollars.

Sequential generation: Autoregressive language models generate tokens one at a time, with each token depending on all previous tokens. You cannot parallelize this within a single request. A 500-token response requires 500 sequential forward passes, creating a latency floor that no amount of hardware can reduce without architectural changes.

Memory as the bottleneck: Unlike CPU-bound services where you can scale horizontally almost indefinitely, GPU memory is finite and expensive. A 70B parameter model requires ~140GB just for weights in FP16. The KV cache for context grows with sequence length. Memory fragmentation becomes a first-order concern.

Batching dynamics: Individual inference requests underutilize GPUs dramatically. Throughput increases nearly linearly with batch size until memory limits. This creates a fundamental tension: latency-sensitive applications want small batches (fast individual requests); throughput-sensitive applications want large batches (efficient GPU utilization).

Stateful computation: For conversational AI, each turn might need context from all previous turns. This state must be managed somewhere—cached in GPU memory, stored externally, or regenerated. Each choice has profound implications for latency, cost, and system design.

Understanding these differences is essential because they determine which distributed systems patterns apply and how. Techniques designed for stateless web services often fail for AI workloads, while AI-specific optimizations can yield 10x improvements that would be impossible in traditional systems.

What You’ll Learn

  • Why distributed systems guarantees constrain AI system design
  • How to choose between consistency models based on your use case
  • Horizontal vs. vertical scaling for inference and training workloads
  • Multi-region deployment patterns and their trade-offs
  • Identifying and eliminating bottlenecks before they manifest
  • Designing for graceful degradation when components fail
  • The operational practices that distinguish reliable systems from fragile ones

Prerequisites

  • Familiarity with AI inference fundamentals (Chapter 9)
  • Understanding of basic networking and storage concepts
  • Experience with production systems and debugging distributed issues

Distributed Systems Foundations

Staff Engineer Perspective

“Every outage we’ve had in the last two years traces back to violating one of the eight fallacies. The worst was when we assumed network latency was negligible and set our timeout to 100ms. Worked perfectly in us-east-1. Then we went multi-region and calls to eu-west-1 took 150ms on a good day. Our entire Europe presence was unusable for a week while we fixed it. Now we start every design review by asking: which fallacy are we accidentally assuming?”

Principal Engineer at global AI platform

Before designing scalable AI systems, you need to understand the fundamental constraints that govern all distributed systems. These aren’t engineering choices—they’re mathematical limitations, as inescapable as the speed of light.

The Eight Fallacies of Distributed Computing

In 1994, Peter Deutsch articulated seven fallacies (later expanded to eight by James Gosling) that engineers often assume when building distributed systems:

  1. The network is reliable
  2. Latency is zero
  3. Bandwidth is infinite
  4. The network is secure
  5. Topology doesn’t change
  6. There is one administrator
  7. Transport cost is zero
  8. The network is homogeneous

Every one of these is false, and AI systems violate these assumptions with particular severity.

Network reliability matters more for AI: A web request that fails can be retried with minimal cost. An AI inference request that fails after 10 seconds of computation wastes expensive GPU time. Partial failures are worse—a request might complete on the GPU but fail during response transmission, leaving the system in an ambiguous state.

Latency compounds for AI workloads: An LLM might call external APIs for retrieval, consult a feature store, and make multiple model calls. Each network hop adds latency. When your baseline inference is 2 seconds, a 100ms network delay might seem trivial—but 10 such delays add a full second, a 50% increase.

Bandwidth limits model distribution: Downloading a 70B parameter model (140GB in FP16) takes 20 minutes at 1 Gbps. Cross-region replication of model weights becomes a significant operational concern. Even within a datacenter, NVLink bandwidth between GPUs matters for distributed inference.

The CAP Theorem: Fundamental Trade-offs

Eric Brewer’s CAP theorem (formally proven by Gilbert and Lynch in 2002) states that a distributed system can provide at most two of three guarantees:

  • Consistency: Every read receives the most recent write or an error
  • Availability: Every request receives a response (without guarantee of most recent data)
  • Partition tolerance: The system continues operating despite network partitions

Since network partitions are inevitable in distributed systems, the real choice is between consistency and availability during partitions. This isn’t a one-time architectural choice—different components of your system can make different trade-offs based on their requirements.

How to think about CAP for AI systems:

For model metadata (which model version is deployed, routing rules), you typically want consistency. If one gateway thinks model v2 is active while another thinks v1 is active, users get inconsistent behavior. During a partition, it’s better to refuse requests than serve the wrong model.

For inference results caching, you typically want availability. Serving a cached result from a slightly stale model is better than returning an error. Users might see marginally different responses, but the system remains functional.

For feature stores in fraud detection, the choice is nuanced. Stale features could miss fraud patterns, suggesting you want consistency. But blocking all transactions during a partition could cost millions. The right answer depends on your specific risk tolerance and fallback strategies.

# Framework for documenting CAP decisions
@dataclass
class CAPDecision:
    component: str
    choice: str  # "CP" or "AP"
    rationale: str
    partition_behavior: str
    consistency_guarantee: str

# Example decisions for an AI system
CAP_DECISIONS = {
    "model_registry": CAPDecision(
        component="Model Registry",
        choice="CP",
        rationale="Wrong model version causes incorrect predictions",
        partition_behavior="Reject deployment requests; serving continues with last-known-good",
        consistency_guarantee="Linearizable for writes, sequential for reads"
    ),
    "inference_cache": CAPDecision(
        component="Inference Result Cache",
        choice="AP",
        rationale="Stale cached result better than no result",
        partition_behavior="Serve local cache; async reconciliation",
        consistency_guarantee="Eventual, with bounded staleness (5 min TTL)"
    ),
    "feature_store_recommendations": CAPDecision(
        component="Feature Store (Recommendations)",
        choice="AP",
        rationale="Slightly stale recommendations acceptable",
        partition_behavior="Serve from local replica",
        consistency_guarantee="Eventual, 30-second replication lag target"
    ),
    "feature_store_fraud": CAPDecision(
        component="Feature Store (Fraud Detection)",
        choice="CP",
        rationale="Stale features could miss fraud patterns",
        partition_behavior="Fail closed (reject suspicious transactions)",
        consistency_guarantee="Linearizable"
    ),
}

Consistency Models: A Spectrum of Guarantees

CAP presents a binary choice, but consistency exists on a spectrum. Understanding these models helps you choose the minimum consistency required, maximizing availability and performance.

Linearizability (strongest): Operations appear to execute instantaneously at some point between invocation and response. All clients see operations in the same order. Required for distributed locks and leader election.

Sequential consistency: All clients see operations in the same order, but that order need not match real-time order. Weaker than linearizable but sufficient for many applications.

Causal consistency: Operations that are causally related appear in the same order to all clients. Concurrent operations may appear in different orders. A good balance for many AI applications.

Eventual consistency (weakest): All replicas will eventually converge to the same value if updates stop. No guarantees about ordering in the meantime.

For AI systems, the key insight is that different data has different consistency requirements:

Data Type Typical Requirement Rationale
Model weights Eventual (with versioning) Updates are rare; versioning prevents conflicts
Inference results No consistency needed Stateless, ephemeral
User conversation history Causal Messages should appear in order
Feature values Depends on use case Real-time features need stronger guarantees
Model routing rules Sequential All traffic should see same routing
Billing/usage counters Eventual with reconciliation Accuracy matters but timing doesn’t

The Network as Unreliable Medium

Networks fail in more ways than you might expect, and AI systems must handle all of them:

Message loss: Packets disappear. Requests never arrive; responses never return. Your system must use timeouts and retries, but retries for expensive AI operations need careful thought.

Message delay: Variable latency makes it impossible to distinguish a slow response from a failed node. If your timeout is 5 seconds and inference takes 4 seconds, you’ll see spurious timeouts under load.

Message duplication: Retries can cause duplicate processing. For stateless inference, this wastes resources. For stateful operations (like charging for usage), it causes correctness problems.

Message reordering: TCP guarantees ordering within a connection, but complex systems have multiple paths. A user might see response B before response A if they were routed differently.

Partial failure: The hardest class of problems. A request might succeed on the server but fail during response transmission. Did the operation happen? You don’t know, and neither does the client.

For AI workloads, these network issues interact with the long processing times. Consider this scenario:

  1. Client sends inference request
  2. Server receives request, begins 10-second inference
  3. Network hiccups; client connection drops at second 5
  4. Server completes inference at second 10, tries to send response—connection gone
  5. Client’s timeout fires at second 5, it retries
  6. Server receives retry, begins another 10-second inference

You’ve now wasted 15 seconds of GPU time and the client is still waiting. Solutions include request deduplication (with unique request IDs), connection-independent result storage, and client polling rather than blocking.

Clocks and Time in Distributed Systems

There is no global clock in a distributed system. Each machine’s clock drifts independently. Even with NTP synchronization, clocks can differ by hundreds of milliseconds—enough to cause ordering issues.

For AI systems, clock skew matters for:

Request deadlines: If a client sets a 5-second deadline but the server’s clock is 500ms behind, the server might process a request the client has already abandoned.

Cache TTLs: A cache entry with a 5-minute TTL might be considered valid or expired depending on which server’s clock you ask.

Log correlation: Debugging distributed issues requires correlating logs across machines. Clock skew makes this difficult without logical timestamps.

Consistent ordering: If you need to determine which of two events happened first, clock timestamps are unreliable.

The solution is to use logical clocks (like Lamport timestamps or vector clocks) when ordering matters, and to build systems that don’t rely on synchronized time.

Common Mistake: Ignoring CAP Theorem Implications

What people do: Build distributed AI systems expecting both strong consistency (same embedding for the same query everywhere) and high availability (always respond, even during partitions).

Why it fails: CAP theorem says you can’t have both. During a network partition, you must choose: return potentially stale embeddings (availability) or reject queries until consistency is restored (consistency). Systems that don’t explicitly choose often behave unpredictably under failure.

Fix: Decide your consistency model upfront. For most AI applications, eventual consistency is acceptable—a slightly stale cache or older model version usually doesn’t matter. Document the consistency guarantees, design your application to tolerate the inconsistency you’ve chosen, and test behavior during partitions.


Why AI Systems Have Unique Scaling Challenges

Traditional web services scale horizontally: need more capacity? Add more servers. Load balancing is straightforward because requests are largely interchangeable and stateless. AI systems break these assumptions in fundamental ways.

The Memory Bandwidth Bottleneck

GPU compute power has grown exponentially, but memory bandwidth hasn’t kept pace. This creates a fundamental bottleneck for large language models.

A 70B parameter model in FP16 uses 140GB of memory. During inference, these parameters must be loaded from memory for each forward pass. At 3TB/s memory bandwidth (H100), loading all parameters takes about 47ms—the lower bound for inference latency, regardless of compute power.

This is why LLM inference is memory-bandwidth bound, not compute-bound. The GPU’s tensor cores sit idle waiting for data. Optimization strategies must focus on reducing memory transfers:

  • Batching: Amortize parameter loading across multiple requests
  • Quantization: Smaller datatypes (INT8, INT4) reduce memory bandwidth by 2-4x
  • KV caching: Don’t recompute what you can store
  • Model parallelism: Distribute across GPUs with faster interconnects

The Batching Paradox

Batching dramatically improves throughput but harms latency:

Batch Size Throughput (tokens/sec) Per-Request Latency
1 50 200ms
8 350 230ms
32 1000 320ms
128 2500 510ms

With batch size 1, you process 50 tokens/second but each request completes in 200ms. With batch size 128, you process 2500 tokens/second (50x throughput!) but each request takes 510ms.

This creates a fundamental tension:

  • Latency-sensitive applications (interactive chat) want small batches
  • Throughput-sensitive applications (batch embeddings) want large batches
  • Cost-sensitive deployments want high throughput (utilization)

The key insight is that you cannot optimize for all three. You must choose your priority and design accordingly:

Latency-Throughput-Cost Tradeoff

Latency-Throughput-Cost Tradeoff

Trade-off choices:

  • Interactive chat: Optimize latency and cost, accept lower throughput
  • Batch processing: Optimize throughput and cost, accept higher latency
  • Real-time at scale: Optimize latency and throughput, accept higher cost

State and the Statefulness Spectrum

AI systems exist on a spectrum of statefulness, and where you sit determines your scaling strategy:

Stateless inference: Each request is independent. Embeddings, single-turn classification. Scales horizontally without constraint.

Context-dependent inference: Each request depends on conversation history. The history must be provided (expensive) or cached (complex). Most LLM applications.

Session state: User preferences, personalization features. Can be stored externally and loaded per-request.

Model state: KV cache for long contexts, adapter weights per user. Must be co-located with inference or expensively transferred.

The more stateful your workload, the harder it is to scale horizontally. Stateless requests can go to any server; stateful requests need routing to specific servers (consistent hashing) or expensive state transfer.

# State management patterns for scalable AI inference
class InferenceStateStrategy:
    """Different strategies for managing inference state."""

    def __init__(self, state_type: str):
        self.strategies = {
            "stateless": {
                "description": "No state between requests",
                "routing": "Round-robin / least-connections",
                "scaling": "Add replicas freely",
                "examples": ["Embeddings", "Classification", "Single-turn QA"]
            },
            "externalized_context": {
                "description": "Context stored externally, loaded per-request",
                "routing": "Any server, load context from store",
                "scaling": "Add replicas freely, scale context store separately",
                "tradeoff": "Latency for context loading",
                "examples": ["RAG systems", "Document QA"]
            },
            "sticky_sessions": {
                "description": "Route same user to same server",
                "routing": "Consistent hashing on user_id",
                "scaling": "Careful rebalancing on scale events",
                "tradeoff": "Load imbalance possible",
                "examples": ["Multi-turn chat with KV cache"]
            },
            "replicated_state": {
                "description": "State replicated across servers",
                "routing": "Any server",
                "scaling": "Replication overhead limits scale",
                "tradeoff": "Memory and network overhead",
                "examples": ["Personalization features"]
            },
        }

The Cold Start Problem

AI inference has significant cold start costs that don’t exist for traditional services:

Model loading: A 70B model takes 30-60 seconds to load from disk to GPU. During this time, the server can’t serve requests.

JIT compilation: Frameworks like PyTorch compile CUDA kernels on first use. The first request is always slower.

KV cache warm-up: For efficient inference, the KV cache should be pre-allocated. This takes time and memory.

Connection establishment: HTTP/2 connections, gRPC channels, and database connections all take time to establish.

These cold start costs compound when autoscaling. If a traffic spike triggers scaling at 10:00:00, new servers might not be ready until 10:01:30. During that 90 seconds, existing servers are overwhelmed.

Mitigation strategies:

  • Keep warm pools: Maintain minimum replicas with models loaded
  • Predictive scaling: Scale up before anticipated traffic (batch jobs, marketing campaigns)
  • Progressive loading: Serve with smaller model while large model loads
  • Pre-warming: Run dummy inference to trigger JIT compilation

Scaling Patterns for AI Inference

With the foundations established, let’s examine concrete patterns for scaling AI inference systems.

Horizontal vs. Vertical Scaling: A Decision Framework

The classic scaling question takes on new dimensions for AI workloads:

Vertical Scaling (bigger machines):

  • Add more GPUs to a single node
  • Use faster GPUs (A10 -> H100)
  • Add more memory

Horizontal Scaling (more machines):

  • Add more inference servers
  • Distribute load across replicas
  • Add more regions

For AI systems, the decision depends on your bottleneck:

Scaling Decision Tree

Scaling Decision Tree

When to choose vertical scaling:

  • Model larger than single-GPU memory (you have no choice)
  • Latency is the primary constraint
  • Simpler operational model is valuable
  • Hardware costs less than engineering time

When to choose horizontal scaling:

  • Need availability guarantees (single machine is SPOF)
  • Throughput scales linearly with machines
  • Cost optimization via spot/preemptible instances
  • Already at hardware limits vertically

Most production systems use both: scale vertically to a reasonable machine size (e.g., 4-GPU node), then scale horizontally for capacity and availability.

Model Parallelism Strategies

When a model doesn’t fit on a single GPU, you must distribute it. Three main strategies exist:

Model Parallelism Strategies

Model Parallelism Strategies

Tensor Parallelism: Split individual layers across GPUs. Each GPU holds part of each weight matrix. Requires high-bandwidth interconnects (NVLink) because GPUs must communicate for every layer.

  • Pros: Low latency (all GPUs work simultaneously)
  • Cons: Requires fast interconnect, communication every layer

Pipeline Parallelism: Split layers across GPUs. Each GPU holds complete layers. GPUs communicate only at layer boundaries, reducing communication but creating pipeline bubbles.

  • Pros: Less communication, can use slower network
  • Cons: Pipeline bubbles (some GPUs idle), higher latency per request

Data Parallelism: Each GPU has complete model copy, different data. For inference, this is simply multiple replicas. For training, gradients must be synchronized.

  • Pros: Simple, linear throughput scaling
  • Cons: Each GPU needs enough memory for full model

Choosing a strategy:

Model Size GPU Memory Strategy
Fits on 1 GPU < 80GB Data parallelism (replicas)
Fits on 1 node (8 GPUs) 80-640GB Tensor parallelism within node
Requires multiple nodes > 640GB Pipeline parallelism across nodes, tensor within

Load Balancing for AI Workloads

Standard load balancing algorithms assume requests are roughly equivalent. AI requests are not—a request to generate 1000 tokens costs 50x more than a 20-token request.

Naive round-robin fails:

Server A: [100 tokens] [1000 tokens] [50 tokens]   = 1150 total
Server B: [30 tokens]  [25 tokens]   [20 tokens]   = 75 total

Server A is 15x more loaded despite equal request count!

Better algorithms for AI:

Least-tokens-in-progress: Route to the server with the fewest tokens currently being processed. Requires servers to report their current load.

class TokenAwareLoadBalancer:
    """Load balancer that considers request size."""

    def select_server(self, request: dict, server_stats: dict[str, dict]) -> str:
        """Select server based on estimated request cost."""
        estimated_tokens = self._estimate_tokens(request)

        candidates = []
        for server, stats in server_stats.items():
            if not stats['healthy']:
                continue

            # Score by available capacity (memory and compute)
            memory_available = stats['gpu_memory_free'] / stats['gpu_memory_total']
            tokens_in_flight = stats['tokens_in_progress']
            capacity_score = memory_available * 100 - tokens_in_flight / 10

            # Penalty if this request might not fit
            if estimated_tokens > stats['max_tokens_available']:
                capacity_score -= 1000

            candidates.append((server, capacity_score))

        if not candidates:
            raise NoHealthyServersError()

        return max(candidates, key=lambda x: x[1])[0]

Consistent hashing for stateful requests: When requests benefit from server affinity (KV cache reuse), use consistent hashing on session ID. But handle rebalancing carefully—when servers are added/removed, some sessions must migrate.

Two-tier routing: Route first by request type (streaming vs. batch), then by load within the tier. This prevents batch requests from starving interactive users.

Continuous Batching and PagedAttention

Two advances have transformed LLM serving efficiency:

Continuous batching (Orca, 2022): Instead of waiting for all requests in a batch to complete before starting new ones, continuously add new requests to the batch as old ones complete. This eliminates the “batch waiting” inefficiency.

Traditional batching:
Time 0: Start batch [A, B, C, D]
Time 2: A completes, but batch waits
Time 4: B, C complete, but batch waits
Time 6: D completes, batch released
Time 6: Start new batch [E, F, G, H]

Continuous batching:
Time 0: Start [A, B, C, D]
Time 2: A completes, immediately add E
Time 3: C completes, immediately add F
...
No idle time between requests!

PagedAttention (vLLM, 2023): Manage KV cache memory like virtual memory pages. Instead of pre-allocating maximum context length for each request, allocate pages on demand and reclaim when done. This dramatically improves memory efficiency.

Traditional KV cache:
Request A (needs 100 tokens): Allocates 2048 tokens (max context)
Request B (needs 200 tokens): Allocates 2048 tokens
Request C (needs 50 tokens):  Allocates 2048 tokens
Total allocated: 6144 tokens
Total used: 350 tokens
Efficiency: 5.7%

PagedAttention:
Request A (needs 100 tokens): Allocates 100 tokens
Request B (needs 200 tokens): Allocates 200 tokens
Request C (needs 50 tokens):  Allocates 50 tokens
Total allocated: 350 tokens
Total used: 350 tokens
Efficiency: 100%

These techniques are implemented in modern serving frameworks (vLLM, TGI, TensorRT-LLM) and can improve throughput by 2-5x over naive implementations.

Speculative Decoding

Autoregressive generation is inherently sequential: each token depends on all previous tokens. Speculative decoding breaks this limitation by using a smaller “draft” model to speculate multiple tokens, then verifying them in parallel with the large model.

Traditional generation (70B model):
Step 1: Generate token 1 (200ms)
Step 2: Generate token 2 (200ms)
Step 3: Generate token 3 (200ms)
...
100 tokens = 20 seconds

Speculative decoding:
Step 1: Draft model generates tokens 1-4 (20ms)
Step 2: Target model verifies tokens 1-4 in parallel (250ms)
        - Tokens 1-3 accepted, token 4 rejected
        - Resample token 4 from target model
Step 3: Draft model generates tokens 5-8 (20ms)
...
100 tokens with 75% acceptance = ~7 seconds (3x speedup)

The speedup depends on how well the draft model matches the target model. If they agree often (high acceptance rate), speculative decoding provides significant gains. If they disagree frequently, you’re doing extra work for little benefit.

Common Mistake: Premature Optimization

What people do: Start with speculative decoding, custom quantization, and multi-region architecture before validating basic functionality.

Why it fails: Each optimization adds complexity, debugging difficulty, and operational burden. Speculative decoding needs draft model tuning. Quantization can introduce quality regressions. Multi-region adds consistency challenges. You’re solving scaling problems you may never have.

Fix: Start simple: single region, API-based inference, standard batching. Measure actual bottlenecks under real load. Optimize the measured bottleneck, not the theoretical one. Most AI systems never grow large enough to need the sophisticated techniques discussed in this chapter. Simple and reliable beats complex and fast.


Multi-Region Architecture

Global AI systems must serve users across continents. This introduces challenges that don’t exist for single-region deployments.

Why Multi-Region is Hard for AI

Physics: Light travels at ~200km/ms in fiber. US East to US West is ~4000km, adding 40ms round-trip latency minimum. To Europe, ~120ms. To Asia, ~200ms. For interactive AI with 2-second generation time, this matters less than for millisecond APIs—but it still impacts user perception.

Model size: A 70B model is 140GB. Replicating to 5 regions means 700GB of cross-region transfer every model update. At $0.02/GB egress, that’s $14 per update—and you might update daily.

Consistency challenges: If a user in Europe and their collaborator in Asia both update a shared conversation, which version wins? With LLM systems, “merging” conversation histories is semantically complex.

GPU availability: Not all regions have the same GPU availability. H100s might be plentiful in US regions but scarce elsewhere. Your architecture must handle heterogeneous hardware.

Multi-Region Deployment Patterns

Pattern 1: Active-Passive (Single Primary)

One region handles all traffic; others are standby for disaster recovery.

Active-Passive Replication Pattern

Active-Passive Replication Pattern

Pros:

  • Simple consistency (one source of truth)
  • No conflict resolution needed
  • Lower cost (standby resources minimal)

Cons:

  • High latency for distant users
  • Failover is disruptive (minutes)
  • Standby resources often underutilized

Pattern 2: Active-Active (Multi-Primary)

All regions serve traffic; users route to nearest region.

Active-Active Multi-Primary Pattern

Active-Active Multi-Primary Pattern

Pros:

  • Low latency for all users
  • High availability (no single point of failure)
  • Better resource utilization

Cons:

  • Complex conflict resolution
  • Higher replication costs
  • Consistency challenges

Pattern 3: Follow-the-Sun / Read-Local-Write-Global

Reads served locally; writes route to primary region.

Read-Local, Write-Global Pattern

Read-Local, Write-Global Pattern

Data Sovereignty and Compliance

GDPR, CCPA, and other regulations may require data to stay within certain jurisdictions. For AI systems, this affects:

User data: Conversation history, preferences, uploaded documents must stay in-region.

Model inputs/outputs: Even inference requests may constitute personal data if they contain user information.

Model weights: Some models trained on sensitive data may have restrictions on where they can be deployed.

Audit logs: Records of AI system usage may need to remain in-jurisdiction.

Architectural implications:

class DataSovereigntyRouter:
    """Route requests based on data residency requirements."""

    def __init__(self, region_capabilities: dict):
        self.regions = region_capabilities

    def route_request(self, request: dict) -> str:
        """Determine which region should handle this request."""
        user_jurisdiction = request.get('user_jurisdiction')
        data_classification = request.get('data_classification', 'standard')

        # Strict residency: must stay in specific regions
        if data_classification == 'pii_eu':
            eligible_regions = [r for r, caps in self.regions.items()
                              if 'gdpr_compliant' in caps]
        elif data_classification == 'pii_cn':
            eligible_regions = [r for r, caps in self.regions.items()
                              if 'china_data_resident' in caps]
        else:
            eligible_regions = list(self.regions.keys())

        # From eligible regions, pick lowest latency
        return self._select_lowest_latency(eligible_regions, user_jurisdiction)

Cross-Region Data Replication

Replicating AI system data across regions requires careful design:

Model weights: Large and immutable per version. Replicate asynchronously; use content-addressable storage. Include checksums to detect corruption.

User conversation history: Append-only is easier to replicate. If users can edit/delete, need conflict resolution.

Feature store data: Depends on feature freshness requirements. Real-time features may need synchronous replication (expensive) or region-local computation.

Cache data: Generally don’t replicate caches. Each region warms its own cache. On failover, accept cold start.

class CrossRegionReplicator:
    """Replicate data across regions with appropriate consistency."""

    async def replicate(self, data_type: str, key: str, value: Any) -> dict:
        """Replicate based on data type's consistency requirements."""

        if data_type == 'model_weights':
            # Async replication with verification
            return await self._replicate_async_verified(key, value)

        elif data_type == 'user_conversation':
            # Causal consistency via vector clocks
            return await self._replicate_causal(key, value)

        elif data_type == 'feature_realtime':
            # Synchronous to primary + one secondary
            return await self._replicate_sync_quorum(key, value)

        elif data_type == 'cache':
            # Don't replicate, local only
            return {'replicated': False, 'reason': 'cache_local_only'}

Identifying and Eliminating Bottlenecks

Effective system design requires understanding where bottlenecks will emerge before they manifest. This section builds intuition for bottleneck analysis.

The Theory of Constraints Applied to AI Systems

Every system has exactly one constraint that limits throughput. Optimizing non-constraints provides zero benefit. This simple insight, from Goldratt’s “The Goal,” transforms how you approach scaling.

Finding the constraint:

  1. Measure utilization of each component under load
  2. The component at highest utilization is likely the constraint
  3. Improvements to other components won’t help until you address the constraint

For AI systems, common constraints:

Constraint Analysis Framework:

1. GPU Compute
   Symptoms: GPU utilization near 100%, scaling GPUs improves throughput
   Solutions: Batching, quantization, faster GPUs, model optimization

2. GPU Memory
   Symptoms: OOM errors, limited batch size, can't fit longer contexts
   Solutions: Quantization, PagedAttention, offloading, model parallelism

3. Memory Bandwidth
   Symptoms: GPU utilization low despite compute availability
   Solutions: Batching, quantization, better memory access patterns

4. Network (for distributed)
   Symptoms: Latency increases with model parallelism degree
   Solutions: Better interconnects, reduce communication, overlap compute

5. CPU Preprocessing
   Symptoms: GPUs idle waiting for batches, tokenization slow
   Solutions: Async preprocessing, faster tokenizers, preprocessing tier

6. I/O (model loading, data)
   Symptoms: Cold starts slow, batch processing I/O bound
   Solutions: Caching, faster storage, prefetching

Worked example:

System: 8x A100 inference cluster
Observation: 1000 requests/second max, want 2000

Measurements:

- GPU utilization: 65%
- GPU memory: 45% used
- Network: 10% bandwidth
- CPU: 85% utilization
- Tokenization: 40ms/request

Analysis:
CPU at 85% is likely constraint. Tokenization at 40ms means
CPU can handle 1/0.04 = 25 requests/sec per core.
With 40 cores, max 1000 req/sec matches observation.

GPUs are underutilized because CPU can't feed them fast enough.

Solution: Add tokenization tier, use async processing,
or use faster tokenizer (HuggingFace Rust tokenizers are 10x faster)

After fix, GPU becomes new constraint at 65% utilization.
Now can improve batching or add GPUs.

Little’s Law for Capacity Planning

Little’s Law states: L = lambda * W

Where:

  • L = average number of requests in system
  • lambda = arrival rate (requests/second)
  • W = average time in system (latency)

Rearranging: lambda = L / W

This tells you the throughput given your concurrency and latency.

Applied to AI inference:

Given:

- 8 GPUs
- Each GPU handles 1 request at a time (no batching)
- Average inference time: 2 seconds

L = 8 (8 GPUs, each processing 1 request)
W = 2 seconds

lambda = L / W = 8 / 2 = 4 requests/second

With batching (batch size 8):
L = 8 GPUs * 8 requests/batch = 64 concurrent
W = 2.5 seconds (slightly higher with batching)

lambda = 64 / 2.5 = 25.6 requests/second

Batching increased throughput 6.4x!

For capacity planning:

Want to serve: 1000 requests/minute = 16.7 requests/second
Average latency target: 3 seconds

L = lambda * W = 16.7 * 3 = 50 concurrent requests needed

If each GPU handles batch size 8:
GPUs needed = 50 / 8 = 6.25 -> 7 GPUs minimum

Add 50% headroom for traffic spikes: 11 GPUs

Amdahl’s Law and Parallelization Limits

Amdahl’s Law quantifies the speedup possible from parallelization:

Speedup = 1 / (S + P/N)

Where:

  • S = serial fraction (cannot be parallelized)
  • P = parallel fraction (1 - S)
  • N = number of parallel workers

Applied to AI systems:

LLM autoregressive generation is inherently serial—each token depends on all previous tokens. If generation is 90% of latency:

S = 0.90 (serial - generation)
P = 0.10 (parallel - preprocessing, networking)
N = 100 (lots of parallel workers)

Speedup = 1 / (0.90 + 0.10/100) = 1 / 0.901 = 1.11x

Even with infinite parallel workers, max speedup = 1/0.90 = 1.11x!

This explains why:

  • Horizontal scaling helps throughput but not latency
  • Speculative decoding attacks the serial fraction
  • Faster GPUs help latency; more GPUs help throughput

Queuing Theory Intuitions

Requests arriving at a system form queues. Understanding queuing behavior helps predict latency under load.

Key insight: Latency explodes as utilization approaches 100%.

For M/M/1 queue (random arrivals, random service, 1 server):

  • Average wait time = (utilization) / (1 - utilization) * service_time

At 50% utilization: wait = 0.5/0.5 * service = 1x service time At 80% utilization: wait = 0.8/0.2 * service = 4x service time At 90% utilization: wait = 0.9/0.1 * service = 9x service time At 95% utilization: wait = 0.95/0.05 * service = 19x service time

Practical implications:

  • Don’t run systems at > 70-80% utilization
  • The “last 10%” of capacity costs disproportionate latency
  • Autoscale triggers should fire well before saturation
  • Headroom for traffic spikes is essential
Capacity Planning Rule of Thumb:

Sustainable capacity = Peak capacity * 0.7

If you need to handle 1000 req/sec peak,
provision for 1000 / 0.7 = 1430 req/sec capacity.

For AI with expensive cold starts, be more conservative:
Sustainable capacity = Peak capacity * 0.5

Designing for Failure

Large-scale systems experience constant failures. The question isn’t whether components will fail—they will—but how the system behaves when they do.

Failure Mode Analysis

Before building, enumerate failure modes and their consequences:

Failure Mode and Effects Analysis (FMEA) for AI Inference:

Component Failure Mode Effect Severity Mitigation
Model Server Process crash Requests fail, timeout High Health checks, auto-restart
Model Server GPU OOM Single request fails Medium Memory limits, admission ctrl
Model Server Model corrupt Wrong/garbage output Critical Checksums, output validation
Load Balancer Process crash All traffic fails Critical Redundant LBs, health checks
Load Balancer Misrouting Wrong model version served High Canary detection, rollback
Feature Store Unavailable Features missing High Fallback features, cache
Feature Store Stale data Outdated predictions Medium Freshness alerts, TTL
Cache Unavailable Higher latency, load on GPUs Medium Cache tiers, graceful degrade
Network Partition Subset of users affected High Multi-path, failover
Network High latency Timeout, poor UX Medium Adaptive timeouts, shed load

Circuit Breakers

When a dependency fails, retrying makes things worse—you overwhelm a recovering service. Circuit breakers prevent cascade failures:

Circuit Breaker State Machine

Circuit Breaker State Machine
class CircuitBreaker:
    """Prevent cascade failures by failing fast."""

    def __init__(self, failure_threshold: int = 5,
                 recovery_timeout: float = 30.0,
                 half_open_max_calls: int = 3):
        self.state = "CLOSED"
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.last_failure_time = None
        self.half_open_successes = 0
        self.half_open_max_calls = half_open_max_calls

    async def call(self, func, *args, fallback=None, **kwargs):
        """Execute function with circuit breaker protection."""
        if self.state == "OPEN":
            if self._should_attempt_reset():
                self.state = "HALF_OPEN"
                self.half_open_successes = 0
            elif fallback:
                return await fallback(*args, **kwargs)
            else:
                raise CircuitOpenError(f"Circuit open, retry after {self._time_until_retry():.1f}s")

        try:
            result = await func(*args, **kwargs)
            self._record_success()
            return result
        except Exception as e:
            self._record_failure()
            if fallback:
                return await fallback(*args, **kwargs)
            raise

    def _record_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()
        if self.state == "HALF_OPEN" or self.failure_count >= self.failure_threshold:
            self.state = "OPEN"

    def _record_success(self):
        if self.state == "HALF_OPEN":
            self.half_open_successes += 1
            if self.half_open_successes >= self.half_open_max_calls:
                self.state = "CLOSED"
                self.failure_count = 0
        elif self.state == "CLOSED":
            self.failure_count = max(0, self.failure_count - 1)  # Gradual recovery

Graceful Degradation Levels

Define degradation levels before you need them:

Degradation Levels for AI Chat Application:

LEVEL 0 - NORMAL
  All features available
  Full model, full context window
  Personalization enabled

LEVEL 1 - MINOR DEGRADATION
  Trigger: Feature store p99 > 500ms OR GPU utilization > 85%
  Actions:
    - Disable non-essential features (personalization)
    - Reduce context window (8K -> 4K)
    - Enable aggressive caching

LEVEL 2 - MAJOR DEGRADATION
  Trigger: Model server errors > 5% OR Feature store unavailable
  Actions:
    - Switch to smaller/faster model
    - Use cached responses where possible
    - Reduce max output tokens
    - Show "high demand" message to users

LEVEL 3 - EMERGENCY
  Trigger: Model servers unavailable > 50% OR cascading failures
  Actions:
    - Static fallback responses for common queries
    - Queue non-urgent requests
    - Reject new users, serve existing sessions
    - Page on-call engineering

LEVEL 4 - MAINTENANCE MODE
  Trigger: Manual activation OR unrecoverable failure
  Actions:
    - Return maintenance page
    - Log all requests for replay
    - All hands on deck
class GracefulDegradation:
    """Manage system degradation based on health signals."""

    def __init__(self, config: dict):
        self.config = config
        self.current_level = 0
        self.level_actions = self._build_level_actions()

    async def evaluate(self, metrics: dict) -> int:
        """Evaluate current metrics and adjust degradation level."""
        if metrics['model_error_rate'] > 0.5 or metrics['cascade_detected']:
            target_level = 3
        elif metrics['model_error_rate'] > 0.05 or not metrics['feature_store_healthy']:
            target_level = 2
        elif metrics['gpu_utilization'] > 0.85 or metrics['feature_store_p99'] > 500:
            target_level = 1
        else:
            target_level = 0

        if target_level != self.current_level:
            await self._transition(self.current_level, target_level)
            self.current_level = target_level

        return self.current_level

    async def _transition(self, from_level: int, to_level: int):
        """Execute transition actions."""
        if to_level > from_level:
            # Degrading: execute in order
            for level in range(from_level + 1, to_level + 1):
                for action in self.level_actions[level]['enter']:
                    await action()
        else:
            # Recovering: execute in reverse
            for level in range(from_level, to_level, -1):
                for action in self.level_actions[level]['exit']:
                    await action()

Timeout and Retry Strategies

Timeouts and retries seem simple but are surprisingly subtle for AI workloads:

Timeout considerations:

  • LLM inference can legitimately take 10-60 seconds
  • Setting timeout too low causes spurious failures
  • Setting timeout too high wastes resources on truly failed requests
  • Different request types need different timeouts
class AdaptiveTimeout:
    """Adjust timeouts based on request characteristics and system state."""

    def calculate_timeout(self, request: dict, system_metrics: dict) -> float:
        """Calculate appropriate timeout for this request."""
        # Base timeout from request characteristics
        estimated_tokens = request.get('max_tokens', 512)
        tokens_per_second = system_metrics.get('current_tokens_per_second', 50)
        base_timeout = estimated_tokens / tokens_per_second + 2.0  # +2s buffer

        # Adjust for system load
        queue_depth = system_metrics.get('queue_depth', 0)
        load_factor = 1 + (queue_depth / 100)  # Add 1% per queued request

        # Adjust for observed latency
        p99_latency = system_metrics.get('p99_latency', 5.0)
        latency_factor = max(1.0, p99_latency / 5.0)

        adjusted_timeout = base_timeout * load_factor * latency_factor

        # Clamp to reasonable bounds
        return min(max(adjusted_timeout, 5.0), 120.0)

Retry considerations:

  • Retrying expensive AI operations wastes GPU resources
  • Must distinguish retryable (network timeout) from non-retryable (invalid input)
  • Exponential backoff prevents thundering herd
  • Budget-based retries prevent infinite retry loops
class AIRetryStrategy:
    """Retry strategy aware of AI workload costs."""

    def __init__(self, max_retries: int = 3, base_delay: float = 1.0):
        self.max_retries = max_retries
        self.base_delay = base_delay
        self.cost_per_retry = {}  # Track retry costs

    def should_retry(self, error: Exception, attempt: int,
                     request: dict) -> tuple[bool, float]:
        """Determine if we should retry and with what delay."""
        if attempt >= self.max_retries:
            return False, 0

        # Non-retryable errors
        if isinstance(error, (InvalidInputError, AuthenticationError)):
            return False, 0

        # Expensive requests get fewer retries
        estimated_cost = self._estimate_cost(request)
        if estimated_cost > 0.01 and attempt >= 1:  # $0.01 threshold
            return False, 0

        # Retryable errors with exponential backoff
        if isinstance(error, (TimeoutError, ConnectionError, ServiceUnavailableError)):
            delay = self.base_delay * (2 ** attempt) + random.uniform(0, 1)
            return True, delay

        # Unknown errors: retry once with long delay
        if attempt < 1:
            return True, 5.0

        return False, 0
Common Mistake: No Chaos Testing

What people do: Build failure handling (retries, circuit breakers, fallbacks) and assume it works because the code looks correct.

Why it fails: Failure handling code is the least-tested code in your system—it runs when things go wrong, which is rare in development. Untested failure paths often don’t work: retries that amplify failures, circuit breakers that never trip, fallbacks that have their own bugs.

Fix: Regularly inject failures in production (or production-like environments). Kill random pods, inject latency, corrupt responses, exhaust memory. Start small (one failure type, limited blast radius) and expand. Use chaos engineering tools (Chaos Monkey, LitmusChaos) for automation. The goal isn’t to cause outages—it’s to find weaknesses before real failures do.


Operational Excellence

A well-designed system that’s impossible to operate is not a well-designed system. Operations must be a first-class concern.

Observability: Metrics, Logs, and Traces

Metrics provide quantitative measurements over time. For AI systems, track:

Essential AI System Metrics:

Request Metrics (RED):

- Rate: requests per second, by model/endpoint
- Errors: error rate, by error type
- Duration: latency p50/p95/p99

Resource Metrics (USE):

- GPU Utilization: percent, per GPU
- GPU Memory: used/free/fragmentation
- CPU Utilization: for preprocessing
- Network: bandwidth, connections

AI-Specific Metrics:

- Tokens per second: throughput
- Tokens per request: distribution
- Queue depth: pending requests
- Batch size: actual vs. configured
- Cache hit rate: L1/L2/semantic
- Model load time: cold start latency
- KV cache utilization: memory efficiency

Logs provide context for debugging. Structure them for searchability:

import structlog

logger = structlog.get_logger()

def log_inference_request(request_id: str, request: dict, result: dict,
                          timing: dict, error: Exception = None):
    """Structured logging for inference requests."""
    logger.info(
        "inference_complete",
        request_id=request_id,
        model=request.get('model'),
        input_tokens=result.get('input_tokens'),
        output_tokens=result.get('output_tokens'),
        queue_time_ms=timing.get('queue_time') * 1000,
        inference_time_ms=timing.get('inference_time') * 1000,
        total_time_ms=timing.get('total_time') * 1000,
        cache_hit=result.get('cache_hit'),
        error_type=type(error).__name__ if error else None,
        error_message=str(error) if error else None,
    )

Traces follow requests across services. Essential for debugging distributed issues:

from opentelemetry import trace

tracer = trace.get_tracer("ai-inference")

async def inference_with_tracing(request: dict) -> dict:
    """Trace a request through the inference pipeline."""
    with tracer.start_as_current_span("inference") as span:
        span.set_attribute("model", request.get('model'))
        span.set_attribute("max_tokens", request.get('max_tokens'))

        with tracer.start_span("preprocess"):
            processed = await preprocess(request)

        with tracer.start_span("feature_fetch"):
            features = await fetch_features(request)

        with tracer.start_span("model_inference"):
            result = await run_inference(processed, features)

        with tracer.start_span("postprocess"):
            final = await postprocess(result)

        span.set_attribute("output_tokens", final.get('token_count'))
        return final

Alerting Philosophy

Good alerting is an art. Too few alerts and you miss problems; too many and you get alert fatigue.

Alert on symptoms, not causes: Alert on “user-facing error rate > 1%” rather than “GPU memory > 90%”. High GPU memory might be fine; user errors are never fine.

Every alert should be actionable: If there’s no action to take, it’s not an alert—it’s a log. Define the runbook before enabling the alert.

Alert hierarchy:

Critical (page immediately):

- User-facing error rate > 5% for 5 minutes
- Complete service outage
- Data corruption detected
- Security incident indicators

Warning (page during business hours):

- User-facing error rate > 1% for 15 minutes
- Latency p99 > 2x normal for 30 minutes
- GPU utilization > 90% for 1 hour
- Approaching capacity limits

Info (notify slack, no page):

- Unusual traffic patterns
- Elevated cache miss rate
- Single server unhealthy

Capacity Planning

Plan capacity before you need it:

class CapacityPlanner:
    """Project capacity needs based on growth trends."""

    def project_capacity_needs(self, historical_usage: list[dict],
                               months_ahead: int = 6) -> dict:
        """Project future capacity needs."""
        # Fit growth curve to historical data
        growth_rate = self._calculate_growth_rate(historical_usage)

        projections = {}
        current = historical_usage[-1]

        for month in range(1, months_ahead + 1):
            projected = {}
            for resource, current_usage in current.items():
                projected_usage = current_usage * ((1 + growth_rate) ** month)
                projected[resource] = {
                    'usage': projected_usage,
                    'utilization': projected_usage / self.capacity[resource],
                    'at_risk': projected_usage / self.capacity[resource] > 0.7
                }
            projections[f'month_{month}'] = projected

        return projections

    def recommend_scaling(self, projections: dict) -> list[dict]:
        """Recommend scaling actions based on projections."""
        recommendations = []

        for month, resources in projections.items():
            for resource, data in resources.items():
                if data['at_risk']:
                    recommendations.append({
                        'timeframe': month,
                        'resource': resource,
                        'current_capacity': self.capacity[resource],
                        'projected_need': data['usage'] * 1.5,  # 50% headroom
                        'action': f"Scale {resource} by {(data['usage'] * 1.5 / self.capacity[resource] - 1) * 100:.0f}%"
                    })

        return recommendations

Runbook Essentials

Every alert needs a runbook. Every runbook needs these elements:

# Runbook: High Inference Latency

## Summary
P99 inference latency exceeds threshold (>5 seconds for >5 minutes)

## Impact
Users experience slow responses. May lead to timeouts and errors.

## Diagnosis Steps

### 1. Check Dashboard
- Link: [Inference Latency Dashboard](...)
- Look for: Which models affected? All or specific?

### 2. Check GPU Utilization
```bash
kubectl exec -it inference-server-0 -- nvidia-smi
```
- If >95%: Likely resource saturation. Go to Mitigation Option A.
- If <50%: Likely preprocessing bottleneck. Go to Mitigation Option B.

### 3. Check Queue Depth
```bash
curl http://inference-api:8080/metrics | grep queue_depth
```
- If >100: Traffic spike. Go to Mitigation Option C.

## Mitigation Options

### Option A: Resource Saturation
1. Scale up inference replicas: `kubectl scale deployment inference --replicas=+2`
2. Enable aggressive caching: `kubectl set env deployment/inference CACHE_AGGRESSIVE=true`
3. Reduce max tokens: `kubectl set env deployment/inference MAX_TOKENS=256`

### Option B: Preprocessing Bottleneck
1. Check tokenization service: `kubectl logs -l app=tokenizer`
2. Restart if stuck: `kubectl rollout restart deployment/tokenizer`
3. Scale tokenizers: `kubectl scale deployment tokenizer --replicas=+2`

### Option C: Traffic Spike
1. Enable rate limiting: `kubectl apply -f rate-limit-strict.yaml`
2. Scale inference: see Option A step 1
3. If sustained, engage capacity planning

## Escalation
If not resolved in 30 minutes, page senior on-call.

Case Study: How Companies Architect for Scale

Learning from real-world systems provides insights that theoretical analysis cannot. This section examines patterns from companies that have solved AI scaling challenges.

Pattern: Prefill-Decode Disaggregation

The problem: LLM inference has two phases with different characteristics:

  • Prefill: Process all input tokens (compute-bound, parallelizable)
  • Decode: Generate output tokens one by one (memory-bound, sequential)

Running both on the same hardware means neither is optimized.

The solution: Separate prefill and decode to different hardware.

Prefill-Decode Separation Architecture

Prefill-Decode Separation Architecture

Why it works: Prefill benefits from high compute density. Decode benefits from high memory bandwidth. Different GPU configurations optimize each. The orchestrator routes appropriately.

Trade-off: Increased system complexity. KV cache transfer adds latency. Only worthwhile at significant scale.

Pattern: Semantic Caching at Scale

The problem: Users often ask similar questions. Computing the same response repeatedly wastes expensive GPU time.

The solution: Cache responses based on semantic similarity, not just exact match.

Semantic Caching Architecture

Semantic Caching Architecture

Implementation considerations:

  • Similarity threshold affects hit rate vs. accuracy trade-off
  • Cache invalidation for stale information
  • Handle personalization (cache per-user or use placeholders)
  • Multi-tenant isolation
class SemanticCache:
    """Cache with semantic similarity matching."""

    def __init__(self, embedding_model, vector_store,
                 similarity_threshold: float = 0.95):
        self.embedder = embedding_model
        self.store = vector_store
        self.threshold = similarity_threshold

    async def get(self, query: str, context: dict = None) -> Optional[dict]:
        """Check cache for semantically similar query."""
        query_embedding = await self.embedder.embed(query)

        # Search for similar cached queries
        results = await self.store.search(
            vector=query_embedding,
            top_k=1,
            filter=self._build_filter(context)  # Tenant, user, etc.
        )

        if results and results[0].score >= self.threshold:
            # Check freshness
            cached = results[0].metadata
            if self._is_fresh(cached):
                return {
                    'response': cached['response'],
                    'cache_hit': True,
                    'similarity': results[0].score,
                    'cache_age_seconds': time.time() - cached['timestamp']
                }

        return None

    async def put(self, query: str, response: str, context: dict = None):
        """Store response in cache."""
        query_embedding = await self.embedder.embed(query)
        await self.store.upsert(
            id=self._generate_id(query, context),
            vector=query_embedding,
            metadata={
                'query': query,
                'response': response,
                'timestamp': time.time(),
                'context': context
            }
        )

Pattern: Model Cascading

The problem: Not all requests need the largest, most capable model. Using GPT-5-class models for simple questions wastes resources.

The solution: Route requests to appropriately-sized models based on complexity.

Model Cascade Architecture

Model Cascade Architecture

Implementation:

class ModelCascade:
    """Route requests to appropriately-sized models."""

    def __init__(self, models: list[dict], classifier):
        self.models = sorted(models, key=lambda m: m['cost'])
        self.classifier = classifier

    async def inference(self, request: dict) -> dict:
        """Route to appropriate model, retry if needed."""
        complexity = await self.classifier.estimate_complexity(request)
        starting_tier = self._select_tier(complexity)

        for tier in range(starting_tier, len(self.models)):
            model = self.models[tier]
            result = await model['client'].inference(request)

            if await self._validate_response(result, request):
                return {
                    **result,
                    'model_used': model['name'],
                    'model_tier': tier,
                    'estimated_cost': model['cost'] * result['tokens']
                }

            # Response inadequate, try larger model
            continue

        # All models failed validation
        return await self._handle_cascade_failure(request)

Pattern: Regional Sharding with Global Routing

The problem: Global users need low latency, but maintaining identical copies of everything in all regions is expensive and complex.

The solution: Shard data by user region; route globally but process locally.

Regional Sharding with Global Routing

Regional Sharding with Global Routing

Data Placement:

  • Model weights: Replicated to all regions (immutable, versioned)
  • User data: Sharded by user’s home region
  • Conversation history: Lives in user’s region
  • Global metadata: Replicated with eventual consistency

Handling cross-region users:

  • Traveling user: Route to nearest region, fetch data from home region
  • Cross-region collaboration: Route to data owner’s region
  • Emergency failover: Full data copy in standby region (async, potentially stale)

Key Takeaways

  1. CAP theorem forces real choices - During network partitions, you must choose between consistency and availability. Different data types warrant different choices: user sessions favor availability; billing favors consistency.

  2. AI systems have unique scaling challenges - Memory bandwidth bottlenecks before compute. Batching creates latency-throughput tension. Cold starts are expensive. State management complexity increases with scale.

  3. Match scaling strategy to your bottleneck - Horizontal scaling improves throughput but not single-request latency. Vertical scaling has hardware limits. Model parallelism enables larger models but adds complexity.

  4. Design for failure from the start - Circuit breakers prevent cascade failures. Graceful degradation maintains partial service. Timeouts and retries need careful calibration for expensive AI operations.

  5. Operational excellence enables sustainability - Observability (metrics, logs, traces) is non-negotiable. Alerting should be actionable. Capacity planning must precede growth. Scale changes everything.


Summary

System design at scale is fundamentally about understanding constraints and making principled trade-offs. This chapter covered the essential concepts:

Distributed systems foundations constrain what’s possible. The CAP theorem forces a choice between consistency and availability during partitions. Consistency models exist on a spectrum—choose the minimum required. Networks fail in complex ways; design for it.

AI systems have unique challenges that traditional web services don’t face. Memory bandwidth often bottlenecks before compute. Batching creates a fundamental tension between latency and throughput. State management complexity increases with scale. Cold starts are expensive and must be planned for.

Scaling patterns must match your constraints. Horizontal scaling improves throughput and availability but not single-request latency. Vertical scaling has hardware limits but simpler operations. Model parallelism enables larger models but adds complexity. Choose based on your actual bottleneck.

Multi-region architecture is necessary for global systems but hard for AI. Consider active-active vs. active-passive based on your consistency needs. Data sovereignty requirements may constrain your options. Plan for cross-region replication costs.

Bottleneck analysis should precede optimization. Use the theory of constraints to find the single limiting factor. Little’s Law and queuing theory provide mathematical foundations. Amdahl’s Law explains why some optimizations provide limited benefit.

Designing for failure assumes components will fail. Circuit breakers prevent cascade failures. Graceful degradation maintains partial service. Timeouts and retries need careful calibration for expensive AI operations.

Operational excellence enables sustainable systems. Observability (metrics, logs, traces) is non-negotiable. Alerting should be actionable and hierarchical. Capacity planning must precede growth. Runbooks enable consistent incident response.

The most important insight is this: scale changes everything. Patterns that work at small scale fail at large scale. Assumptions that hold for hundreds of users shatter at millions. The best time to design for scale is before you need it—because by the time you need it, you won’t have time to redesign.


Practical Exercises

Exercise 1: Bottleneck Analysis

Given a system with these metrics under load:

  • GPU utilization: 65%
  • GPU memory: 80% used
  • CPU utilization: 30%
  • Network bandwidth: 15%
  • Queue depth: 200 requests
  • Latency p99: 8 seconds (target: 3 seconds)

Identify the bottleneck and propose solutions. Consider: 1. What is the most likely constraint? 2. What additional metrics would you gather? 3. What optimizations would you try first?

Exercise 2: CAP Trade-off Design

Design the data architecture for a multi-region AI chat application serving users globally. Document: 1. What data needs to be stored (conversations, preferences, features) 2. CAP choice for each data type with justification 3. Replication strategy for each 4. Behavior during a region partition

Exercise 3: Failure Mode Analysis

Perform FMEA for an AI system with:

  • Load balancer
  • 10 inference servers with GPUs
  • Feature store (external service)
  • Result cache (Redis cluster)
  • Model registry

For each component, identify: 1. Possible failure modes 2. Impact on users 3. Mitigation strategy 4. Monitoring/detection approach

Exercise 4: Capacity Planning

Your AI service currently handles:

  • 10,000 requests/hour average
  • 20,000 requests/hour peak
  • 8 GPUs at 60% average utilization
  • 3-second average latency

Plan for 10x growth over 12 months: 1. Project monthly capacity needs 2. Identify when scaling is needed (assume 80% max utilization) 3. Recommend scaling approach (horizontal, vertical, or both) 4. Estimate cost implications


Self-Assessment Checkpoint

Conceptual Questions

Q1. [Senior] Explain the CAP theorem and what “choosing two of three” actually means in practice for AI systems.

Answer CAP states that during a network partition, you must choose between consistency (all nodes see the same data) and availability (every request gets a response). “Choose two” is misleading—you always want C and A when there’s no partition. The real choice is: when a partition happens, do you fail requests (choose C) or serve potentially stale data (choose A)? For AI systems: (1) User preferences/sessions—usually choose A (serve with stale data rather than fail). (2) Feature store data—often choose A (stale features better than no features). (3) Model versioning—usually choose C (wrong model version could be dangerous). (4) Billing/quotas—choose C (over-serving is costly). Most systems need different CAP choices for different data types.

Q2. [Senior] What’s the difference between horizontal and vertical scaling? When would you choose each for an LLM serving system?

Answer Horizontal: Add more machines (scale out). Vertical: Add more resources to existing machines (scale up). For LLM serving: Vertical scaling first for model size—if model needs 40GB VRAM, you need a GPU with 40GB+. Can’t split a model across machines without significant complexity (tensor parallelism). Horizontal scaling for throughput—once you have GPUs big enough for the model, add more GPUs/machines to handle more concurrent requests. Hybrid is common: vertical to meet per-model requirements, horizontal for capacity. Limits: Vertical has ceiling (biggest available GPU). Horizontal adds coordination overhead, network latency, state synchronization complexity.

Q3. [Staff] Describe the “tail at scale” problem and techniques to mitigate it in AI inference systems.

Answer Problem: At scale, even rare slow requests compound. If each server has 1% chance of being slow, a request touching 100 servers has 63% chance of hitting at least one slow server (1 - 0.99^100). P99 latency becomes typical latency at scale. Mitigation: (1) Hedged requests—send request to multiple servers, use first response. (2) Tied requests—queue at multiple servers, cancel others when one starts. (3) Canary requests—test with small probe before full request. (4) Request deadlines—kill requests that can’t meet SLA anyway. (5) Isolation—prevent slow requests from blocking fast ones (separate queues by predicted latency). (6) Reduce variability sources—consistent hardware, avoid GC pauses, prewarming. For LLM inference: Variable generation length is major tail source. Consider: Batching strategy affects tail (waiting for batch vs. immediate processing).

Q4. [Staff] You’re designing a multi-region AI application. What are the key architectural decisions around data placement and request routing?

Answer Data placement decisions: (1) User data—usually regional (data residency requirements, latency). (2) Model weights—replicated globally (read-only, large, can lag). (3) Feature stores—depends on freshness requirements; often eventual consistency with regional caching. (4) Conversation history—usually regional for latency, global for users who travel. Routing decisions: (1) Latency-based—route to nearest healthy region. (2) Data locality—route to where user data resides. (3) Cost-based—route to cheaper regions when latency tolerance allows. (4) Capacity-based—overflow to other regions during peak. Complexity: Cross-region requests (user in Region A, data in Region B) add latency. Either accept latency, replicate data, or design for locality. Consider: Failover strategy when region fails—where does traffic go? Is data available there?

Q5. [Staff] How do you approach capacity planning for a system that’s growing 10x over the next year? What are the pitfalls?

Answer Approach: (1) Baseline current capacity—throughput per GPU, memory per request, storage growth rate. (2) Project growth curve—is it linear, exponential, seasonal? (3) Calculate capacity needs at checkpoints (monthly/quarterly). (4) Add buffer—target 70-80% utilization, not 100%. (5) Account for operational overhead—deployment capacity, testing, incidents. (6) Consider lead times—GPU procurement can take months. Pitfalls: (1) Assuming linear scaling—10x traffic doesn’t mean exactly 10x servers (coordination overhead grows). (2) Ignoring dependent systems—database, network, storage may bottleneck before compute. (3) Under-estimating growth spikes—10x average might mean 50x peak. (4) Over-provisioning too early—burns money. (5) Forgetting about new features—roadmap may add load beyond organic growth. (6) Missing cost projections—can you afford 10x?

Spot the Problem

Problem 1. [Senior] Architecture description:

"For high availability, we run three replicas of our inference service.
If one fails, the load balancer routes to the others."
Answer Incomplete HA strategy: (1) Three replicas where?—Same availability zone? Same datacenter? If so, zone/datacenter failure takes all three. (2) How does LB detect failure?—Health checks? What’s the detection delay? (3) What about cascading failure?—If one fails, others get 50% more load. Can they handle it? (4) State handling?—If service has state (sessions, KV cache), is it replicated? (5) What about dependencies?—Model registry, feature store—are they also HA? Proper HA includes: Multi-AZ or multi-region deployment, health checks with appropriate timeouts, capacity to handle N-1 failures, dependency HA, and tested failover procedures.

Problem 2. [Staff] Scaling strategy:

"Our inference system auto-scales based on CPU utilization.
When CPU > 70%, we add instances."
Answer Wrong metric for LLM inference: (1) LLM inference is typically GPU-bound and memory-bandwidth bound, not CPU-bound. CPU utilization may be low while GPUs are saturated. (2) Should scale on: GPU utilization, request queue depth, latency percentiles. (3) Reactive scaling has lag—by the time utilization triggers scaling, latency may already be bad. Consider predictive scaling based on time-of-day patterns. (4) Scale-up time—how long to provision GPU instances? May need pre-provisioned warm pool. Better: Scale on queue depth or latency percentile. Maintain warm capacity. Use predictive scaling for known patterns.

Problem 3. [Staff] Multi-region design:

"We deployed our AI service in three regions for global coverage.
Users are routed to the nearest region. Each region has its own database."
Answer Issues: (1) Nearest region may not have user’s data—if user traveled, their data might be in a different region. (2) How is data synchronized?—If user creates data in Region A, can they access it from Region B? (3) Conflict resolution—if user updates from two regions, what wins? (4) Failover complexity—if Region A fails, can Region B serve Region A users? Does it have the data? (5) Consistency model—what guarantees does the user get? Should document: Data placement strategy, replication approach, consistency guarantees, cross-region access patterns, failover behavior.

Design Exercises

Exercise 1. [Staff] Design a global AI chat application serving 100M users across NA, EU, and APAC. Requirements: <500ms time to first token, conversation history must be accessible anywhere, comply with GDPR data residency. Document your architecture including data placement, request routing, and failure handling.

Guidance Key decisions: (1) Model serving—GPU clusters in each region, route to nearest. Models replicated globally. (2) Conversation data—regional primary storage (GDPR: EU data stays in EU). Cross-region access via read replicas or on-demand fetch with latency cost. (3) User routing—latency-based initially, data-locality override for returning users. (4) Caching—edge caches for static assets, regional caches for frequent patterns. (5) Failover—within region: AZ failover automatic. Cross-region: explicit, requires user data access strategy. GDPR specifics: EU users’ data in EU region only. Non-EU users can be anywhere. Clear data flows documenting where data goes. Consider: Do you need to process EU users’ prompts only on EU servers? Legal review needed.

Exercise 2. [Staff] You’re designing an AI inference platform that must handle both real-time requests (100ms P99) and batch processing (throughput optimized, cost sensitive). Both use the same models. Design the system architecture addressing resource sharing, isolation, and cost optimization.

Guidance Architecture options: (1) Separate clusters—real-time and batch isolated. Simple, but utilization inefficient. (2) Shared cluster with priority—real-time preempts batch. Better utilization, complexity in scheduling. (3) Time-sharing—batch runs during off-peak real-time hours. Works if patterns are predictable. Considerations: (1) Batch can use lower-priority/spot instances—cost savings. (2) Batch can queue and wait—better batching efficiency. (3) Real-time needs head-of-line prevention—batch requests shouldn’t block real-time. (4) Resource limits—prevent batch from starving real-time. (5) Different SLOs—batch failures can retry, real-time cannot. Design: Likely hybrid—dedicated real-time capacity for SLA, shared overflow to batch pool, batch uses idle real-time capacity + dedicated cheap instances.

Connections to Other Chapters

System design at scale integrates knowledge from across the book. Here’s how this chapter connects:

  • Chapter 9 (LLM Deployment & Infrastructure): Deployment patterns and inference optimization for individual model instances. This chapter shows how to scale those patterns across regions and handle global traffic.

  • Chapter 27 (Performance Engineering): Latency optimization, caching strategies, and profiling techniques. Performance engineering at the component level feeds into system-wide performance at scale.

  • Chapter 31 (Reliability Engineering): Fault tolerance, redundancy patterns, and chaos engineering. Reliability is essential when operating distributed AI systems across regions.

  • Chapter 32 (Cost Engineering): Cost optimization strategies and capacity planning. System design decisions directly impact infrastructure costs at scale.

  • Architecture Decision Records (Appendix G): Example ADRs for common architectural decisions. Use ADRs to document key system design choices.


Further Reading

Essential

  • Kleppmann, “Designing Data-Intensive Applications” - The definitive practitioner’s guide to distributed systems.
  • Dean and Barroso (2013), “The Tail at Scale” - Why tail latencies matter. Essential Google experience.
  • Kwon et al. (2023), “PagedAttention” - The vLLM paper that transformed LLM serving efficiency.

Deep Dives

  • Lamport (1978), “Time, Clocks, and the Ordering of Events” - Foundational work on logical clocks.
  • Yu et al. (2022), “Orca” - Introduces continuous batching for LLM serving.
  • Beyer et al., “Site Reliability Engineering” - Google’s approach to reliable systems at scale.