Chapter 31: Reliability Engineering
reliability, SLOs, SLAs, incident response, chaos engineering, fault tolerance, graceful degradation, on-call
Introduction
On March 14, 2023, an AI-powered customer service system at a major financial institution began providing subtly incorrect information about loan eligibility requirements. The system remained “available” by every traditional metric: response times were normal, error rates were low, and the service was handling its usual traffic. But the model had drifted, and for three weeks, it told customers they qualified for loans they would ultimately be denied, and told others they didn’t qualify when they actually did.
The incident wasn’t detected until a compliance audit flagged an unusual pattern in loan application outcomes. By then, thousands of customers had been given incorrect information, support tickets had spiked, and the institution faced both regulatory scrutiny and reputational damage. The total cost exceeded $4 million, yet the monitoring dashboards had shown green the entire time.
This incident captures the fundamental challenge of reliability engineering for AI systems: availability is not the same as correctness. A traditional service either works or it doesn’t. An AI system can be running perfectly while producing garbage. It can degrade slowly without triggering any alerts. It can fail in ways that only become apparent days or weeks later, when downstream effects manifest.
Reliability engineering for AI systems builds on decades of Site Reliability Engineering (SRE) practices developed at companies like Google, but extends them to handle the unique characteristics of machine learning: probabilistic outputs that are inherently uncertain, model drift that accumulates silently over time, data dependencies that create invisible failure modes, and the “long tail” of edge cases where models behave unexpectedly.
This chapter provides the conceptual foundations and practical frameworks you need to build AI systems that are truly reliable, not just available. We’ll start with the theoretical foundations of reliability engineering, examine what makes AI reliability uniquely challenging, and then work through the concrete practices that help organizations maintain AI systems at scale.
Why AI Reliability is Different
Consider a traditional web service that serves user profiles. Its reliability properties are relatively simple:
- The service is either up or down (binary state)
- Responses are either correct or they’re errors (deterministic)
- Failures are usually immediate and observable (fast feedback)
- The service behavior doesn’t change unless you deploy new code (stable)
Now consider an AI system that recommends products to users:
- The service can be “up” while producing useless recommendations (quality as a spectrum)
- “Correct” is probabilistic, there’s no single right answer (inherent uncertainty)
- Quality degradation may take weeks to manifest in business metrics (delayed feedback)
- Behavior changes even without deployment as data distributions shift (dynamic)
This creates a fundamental tension: traditional reliability engineering assumes you can distinguish working from broken, but AI systems exist on a continuum of quality. A recommendation system might be 60% as good as it was last week. Is that an incident? It depends on context, business impact, and whether you can even measure the degradation in real-time.
A web server fails like a light bulb: it works, then it doesn’t, and the failure is loud. An AI system fails like an aging organism—slow drift, diffuse symptoms, redundant pathways masking real degradation, and a single organ failure that cascades through everything downstream. You diagnose biological systems with continuous vitals and population baselines, not pass/fail tests. The same applies here: graceful degradation, redundancy, drift monitoring, and slack in the system (error budgets) keep AI systems alive. Heuristic: if your only signal is “up or down,” you’ll miss the disease that actually kills your system.
Because AI systems fail like biological systems, reliability engineering for them looks less like building a fortress and more like running a clinic—continuous monitoring of many soft signals, not just hard up/down checks. The analogy rewards being taken seriously:
Gradual onset: Quality degrades slowly as data drifts. By the time you notice, the damage is done. Like high blood pressure, there’s no alarm when things start going wrong.
Silent progression: The system keeps running, keeps returning responses. But the responses are subtly worse. Users may notice before your metrics do—if your metrics measure the right thing at all.
Interconnected symptoms: A quality problem in one area causes compensation elsewhere. The model generates longer responses to hedge its bets. Latency increases. Costs rise. Users lose trust. Each symptom triggers others.
Diagnosis requires investigation: You can’t just read a stack trace. You need to compare distributions, analyze samples, test hypotheses. Incident response becomes epidemiology.
Treatment vs. cure: Quick fixes (filtering bad outputs, tightening thresholds) treat symptoms but don’t address root causes. Real fixes require understanding what changed in the data, the model, or the world.
This biological framing changes how you build monitoring. You don’t just watch for crashes—you track vital signs. You establish baselines, detect trends, and investigate anomalies before they become incidents. You treat reliability as health maintenance, not just break-fix.
The companies that operate AI systems at scale, Google, Meta, Netflix, Uber, have developed sophisticated practices for navigating this complexity. Their approaches share common foundations:
- Expanded definitions of reliability that include model quality, not just availability
- Probabilistic thinking about failures, error budgets, and acceptable degradation
- Multi-layer defense with graceful degradation rather than binary failure modes
- Feedback loops that close the gap between prediction time and outcome observation
- Chaos engineering adapted for ML-specific failure modes
What You’ll Learn
- The theoretical foundations of reliability engineering: SLOs, SLIs, SLAs, and error budgets
- Why traditional reliability metrics are insufficient for AI systems
- How to define meaningful SLOs that capture AI quality, not just availability
- Graceful degradation patterns that maintain service value even when components fail
- Circuit breaker patterns adapted for AI inference
- Chaos engineering principles and how to apply them to ML systems
- Incident response frameworks tailored for AI-specific failure modes
- Real-world case studies and postmortems from production AI systems
Prerequisites
- Understanding of ML systems architecture (Chapter 9)
- Familiarity with system design at scale (Chapter 25)
- Basic monitoring and observability concepts (Chapter 11)
Theoretical Foundations
The Language of Reliability: SLIs, SLOs, and SLAs
Reliability engineering has developed a precise vocabulary for discussing system behavior. Understanding these concepts deeply, not just their definitions, is essential for reasoning about AI reliability.
Service Level Indicator (SLI) is a quantitative measure of some aspect of service behavior. SLIs are the raw measurements, the numbers you read off your monitoring systems. Examples include:
- Request latency (p50, p95, p99)
- Error rate (percentage of requests returning errors)
- Throughput (requests per second)
- Availability (percentage of time the service responds successfully)
For AI systems, we need additional SLIs that capture model quality:
- Prediction confidence distribution
- Feature coverage (percentage of features successfully retrieved)
- Model freshness (time since last model update)
- Drift metrics (distance between current and baseline distributions)
Service Level Objective (SLO) is a target value or range for an SLI. SLOs represent the reliability you’re committing to achieve. They answer the question: “How much reliability is enough?” Examples:
- 99.9% of requests will complete successfully (availability SLO)
- 95% of requests will complete in under 200ms (latency SLO)
- Model accuracy will remain within 5% of baseline (quality SLO)
SLOs embody a crucial insight: 100% reliability is the wrong target. Pursuing perfect reliability has diminishing returns and eventually becomes impossibly expensive. The goal is to be reliable enough to satisfy users while preserving engineering velocity for improvements.
Service Level Agreement (SLA) is a contractual commitment about service behavior, typically with consequences for violation. SLAs are external promises; SLOs are internal targets. Smart organizations set internal SLOs more stringently than external SLAs, creating a buffer that prevents SLA violations.
Error Budgets: The Currency of Reliability
The instinct of every well-meaning engineer is to chase 100% uptime. That instinct is wrong. Reliability is a finite resource you allocate—each “nine” you add costs roughly 10x the previous one, and 100% is mathematically impossible once you depend on networks, hardware, and humans. Error budgets make this explicit: spend the budget on velocity, experimentation, and risk. Hoard it, and you’ve over-invested; blow through it, and you slow down to refill it. Heuristic: if your team has never burned an error budget, your SLO is too loose or your investment is too high.
Treating reliability as a budget, not a goal is what turns SLOs from aspirational dashboards into actionable engineering decisions. The mental shift is worth unpacking in more detail.
Engineers often think of reliability as a goal to maximize: “We want to be as reliable as possible.” This framing is subtly wrong and leads to bad decisions. Reliability is a resource you spend, not a target you hit. Think of your error budget like a financial budget. You have a fixed amount (say, 43 minutes of downtime per month for 99.9% availability). The question isn’t “how do we never spend this?”—it’s “how do we spend it wisely?”
Spending nothing is waste: If you never touch your error budget, you’re over-invested in reliability. That engineering time could have shipped features. An unused budget is a missed opportunity.
Spending everything is bankruptcy: If you routinely exhaust your budget, users lose trust, and you’re forced into emergency reliability work. You’ve borrowed against future velocity.
The goal is sustainable spending: Burn the budget on valuable risks—deploys that ship important features, experiments that teach you something, migrations that reduce tech debt. Don’t burn it on preventable incidents or reckless deploys.
This mental shift changes conversations:
- “Is this deploy risky?” becomes “Is this deploy worth 10% of our monthly budget?”
- “We need better reliability” becomes “Our budget burn rate is unsustainable—let’s find the biggest sources”
- “Can we ship this untested change?” becomes “We have budget to burn—let’s try it in staging first”
A reliability budget makes tradeoffs explicit and decisions rational.
The concept of an error budget transforms reliability from a constraint into a resource. If your SLO allows 0.1% errors, you have an “error budget” of 0.1% to spend.
\[\text{Error Budget} = 1 - \text{SLO Target}\]
For a 99.9% availability SLO over a 30-day window: \[\text{Error Budget} = 1 - 0.999 = 0.001 = 43.2 \text{ minutes of downtime}\]
Error budgets serve several purposes:
Aligning incentives. Development teams want to ship features quickly. Operations teams want stability. Error budgets align these goals: you can ship fast as long as you don’t exhaust the budget. If the budget is depleted, feature development pauses while the team focuses on reliability.
Enabling risk-taking. Without error budgets, any risk feels like irresponsible gambling with reliability. With error budgets, teams can make rational decisions: “This deployment is risky but valuable. If it fails, we’ll use 20% of our remaining budget, and we can afford that.”
Quantifying tradeoffs. How much should you invest in reliability improvement versus new features? Error budget burn rate provides data. If you’re regularly exhausting your budget, you’re under-invested in reliability. If you never use your budget, you might be over-invested (or your SLO is too loose).
For AI systems, error budgets become more nuanced because we track multiple SLOs simultaneously:
@dataclass
class ErrorBudgetState:
"""Track error budget consumption across multiple SLOs."""
availability_remaining: float # e.g., 0.73 = 73% of budget remaining
latency_remaining: float
quality_remaining: float
safety_remaining: float
def overall_status(self) -> str:
"""Return overall health based on all budgets."""
min_remaining = min(
self.availability_remaining,
self.latency_remaining,
self.quality_remaining,
self.safety_remaining
)
if min_remaining > 0.5:
return "healthy"
elif min_remaining > 0.2:
return "warning"
elif min_remaining > 0:
return "critical"
else:
return "exhausted"The Reliability Stack Model
A useful mental model for AI reliability is the “reliability stack,” a hierarchy of concerns where each layer depends on the layers below:
Traditional SRE focuses primarily on Layers 1-2. AI reliability must address all five layers:
- Layer 1 (Infrastructure): Compute resources, network, storage are operational
- Layer 2 (Service Availability): Inference endpoints respond to requests
- Layer 3 (Data Quality): Features are fresh, complete, and correctly formatted
- Layer 4 (Model Quality): Predictions are accurate, well-calibrated, and appropriate
- Layer 5 (Business Outcomes): The system achieves its intended business purpose
A system can be perfectly reliable at lower layers while failing at higher ones. The financial services incident in our introduction had Layers 1-3 operating normally, Layer 4 degraded silently, and Layer 5 failing badly. This is the distinctive challenge of AI reliability.
Thinking Probabilistically About Failure
Traditional software tends toward binary thinking: the system works or it doesn’t. AI systems require probabilistic thinking, and this shift is more fundamental than it might appear.
Binary framing: “The service is up” or “The service is down” Probabilistic framing: “There’s a 99.9% chance any given request will succeed, and of those that succeed, 95% will return high-quality predictions”
This probabilistic perspective changes how we reason about incidents. Consider model quality degradation: a model might be producing predictions that are only 80% as accurate as baseline. Is this an incident? The answer depends on:
How confident are we in the measurement? Quality metrics often have high variance. A 20% drop might be within normal measurement noise, or it might be statistically significant.
What’s the business impact? A 20% quality drop in a recommendation system might mean slightly worse user engagement. A 20% quality drop in a fraud detection system might mean millions in losses.
Is it transient or persistent? A temporary quality dip during an unusual traffic pattern might be acceptable. A sustained degradation requires intervention.
Are we consuming error budget? If we have quality error budget remaining, we can tolerate the degradation while investigating. If budget is exhausted, we need immediate action.
This probabilistic thinking also informs graceful degradation strategies. Rather than asking “Should we fail over to the fallback?”, we ask “What’s the expected value of continuing with the primary system versus switching to the fallback?” The answer depends on current error rates, fallback quality, switching costs, and our error budget position.
SLOs for AI Systems
“Our 99.9% availability SLO was a lie. We measured ‘service returns HTTP 200’ but didn’t check if the response was actually useful. For three months, 5% of our responses were low-confidence fallbacks that users hated. When we redefined availability as ‘returns a high-confidence response,’ our real SLO was 94.7%. Embarrassing, but at least now we’re measuring what actually matters.”
— Site Reliability Engineer at AI infrastructure company
“The first time I set SLOs, I made them aspirational: 99.99% availability, p95 under 100ms. We never hit them. Eventually I realized SLOs aren’t goals—they’re contracts with your users and your on-call team.
An SLO of 99.9% means you’re allowed 43 minutes of downtime per month. If you’re burning error budget faster than that, you stop feature work and fix reliability. If you’re under budget, you can take risks. The SLO defines what ‘reliable enough’ means.
For AI systems, the hard part is defining ‘available.’ I’ve seen teams celebrate 99.99% uptime while the model served garbage 10% of the time. Add quality SLOs: ‘Among successful responses, 95% must have confidence above threshold.’ Now you have a meaningful contract.
One more lesson: review SLOs quarterly. User expectations change. System capabilities change. An SLO set two years ago may be way too lenient—or impossibly strict—for today’s system.”
—Staff SRE, AI Platform
Defining Meaningful AI SLOs
Traditional SLOs focus on availability, latency, and throughput. AI systems need all of these plus SLOs that capture model quality, data freshness, and safety. Let’s examine each category.
Availability SLOs
For AI systems, “availability” should mean more than “the service responds without error.” A prediction service that returns random noise is technically “available” but useless.
Narrow definition: The service returns a non-error response \[\text{Availability}_{\text{narrow}} = \frac{\text{Successful Responses}}{\text{Total Requests}}\]
Expanded definition: The service returns a valid, usable prediction \[\text{Availability}_{\text{expanded}} = \frac{\text{Successful Responses with Confidence} > \theta}{\text{Total Requests}}\]
The expanded definition excludes predictions where the model admits it doesn’t know (low confidence) or where quality checks fail. This is a more meaningful measure for AI systems.
Example SLO: “99.9% of inference requests will return a valid prediction with confidence greater than 0.5, measured over a rolling 30-day window.”
Latency SLOs
AI inference latency has unique characteristics. The same model can have very different latencies depending on input complexity (a simple query versus a complex one), batch size, hardware utilization, and whether features need to be computed or retrieved.
Key insight: Percentile-based latency SLOs (p95, p99) are more meaningful than averages for AI systems because the distribution often has a heavy tail.
Example SLOs:
- “p50 latency < 50ms” (typical user experience)
- “p95 latency < 200ms” (worst-case for most users)
- “p99 latency < 500ms” (acceptable even for tail cases)
For LLM systems with streaming responses, time-to-first-token (TTFT) becomes an important additional metric:
- “p95 time-to-first-token < 500ms”
Quality SLOs
Quality SLOs are the distinctive element of AI reliability. They capture whether the model’s outputs are actually good, not just whether outputs are produced.
The challenge is that “quality” requires ground truth labels, which often aren’t available in real-time. We handle this through several strategies:
Delayed evaluation: For systems where outcomes become known (did the user click? was the transaction fraudulent?), compute quality metrics with a delay.
\[\text{Quality}_{t} = f(\text{predictions}_{t-\Delta}, \text{outcomes}_{t})\]
Proxy metrics: When ground truth isn’t available, use proxy metrics that correlate with quality:
- Prediction confidence distribution
- Feature coverage
- Output distribution stability
- Agreement with a reference model
Example SLOs:
- “Model accuracy will remain within 5% of baseline over any 7-day window” (delayed)
- “Mean prediction confidence will remain within 2 standard deviations of baseline” (proxy)
- “Prediction distribution KL divergence from baseline < 0.1” (proxy)
Freshness SLOs
AI systems have temporal dependencies that traditional services lack. Models go stale as the world changes. Features go stale as data pipelines lag. Freshness SLOs capture these time-based requirements.
Model freshness: “Production model will be no more than 7 days old” Feature freshness: “Real-time features will be no more than 5 minutes stale” Embedding freshness: “User embeddings will be refreshed at least every 24 hours”
Safety SLOs
For systems with potential for harm, safety SLOs set hard limits on unacceptable outputs.
Example SLOs:
- “Less than 0.01% of outputs will be flagged by content safety filters”
- “Zero outputs will contain PII from training data”
- “Zero high-confidence predictions on known out-of-distribution inputs”
Safety SLOs often have zero error budget (any violation is unacceptable) and trigger immediate incident response.
The SLO Setting Framework
How do you choose SLO targets? The answer involves balancing user expectations, business requirements, technical feasibility, and cost.
Step 1: Understand user expectations. What reliability would users notice and care about? For a recommendation system, users might not notice if recommendations are 10% worse, but they’d notice if the page fails to load. For a medical diagnostic system, any quality degradation is unacceptable.
Step 2: Assess current baseline. What reliability are you achieving today? SLOs should be achievable targets, not aspirations. If you’re currently at 99% availability, jumping to 99.99% is probably unrealistic without major investment.
Step 3: Model the cost-reliability curve. Each “nine” of availability costs more than the last. 99% to 99.9% might require redundancy. 99.9% to 99.99% might require multi-region failover. 99.99% to 99.999% might require custom infrastructure.
Step 4: Align with business impact. What is the cost of unreliability? For an ad-serving system, each minute of downtime has a calculable revenue impact. For a customer service chatbot, the impact is harder to measure but includes customer satisfaction and support costs. SLO targets should be set where the marginal cost of improving reliability equals the marginal benefit.
Step 5: Start conservative and iterate. It’s easier to loosen an SLO than to tighten one. Start with achievable targets, build operational muscle, then raise the bar.
What people do: Define reliability SLOs purely in terms of availability (e.g., “99.9% uptime”) without measuring latency or quality.
Why it fails: An AI system can be “up” but useless. If latency spikes from 200ms to 5 seconds, users abandon requests—but the system is technically available. If model quality degrades (hallucination rate doubles), users lose trust—but uptime is 100%. Availability-only SLOs miss the failures that actually hurt users.
Fix: Define SLOs across all three dimensions: availability (is it up?), latency (is it fast enough?), and quality (is it correct?). Example: “99.9% availability AND p99 latency under 2 seconds AND hallucination rate under 5%.” Measure what users actually experience, not just what’s easy to measure.
What people do: Set aggressive SLO targets (99.99% availability) without validating that current systems can meet them.
Why it fails: Unrealistic SLOs become meaningless. If you’re constantly over error budget, teams stop paying attention. “We’re always in violation anyway” becomes the culture. SLOs are supposed to drive action—if they’re impossible, they drive nothing.
Fix: Measure your current reliability for 2-4 weeks before setting SLOs. Set initial targets at or slightly below your observed performance. Improve the system, then tighten the SLO. A 99% SLO that’s actually met is more valuable than a 99.99% SLO that’s always violated.
Real-Time Quality Estimation
Since quality labels often arrive with a delay, production systems need techniques to estimate quality in real-time. Here are proven approaches:
Confidence monitoring: Track the distribution of prediction confidence scores. A sudden drop in mean confidence, or an unusual number of low-confidence predictions, often indicates quality issues.
Using mean confidence as a quality signal assumes confidence tracks correctness—but neural models are frequently miscalibrated, and the failure mode is the dangerous direction: they tend to be overconfident precisely on out-of-distribution inputs, the cases where quality has actually degraded. A model can sail through a distribution shift while reporting high, stable confidence on inputs it is getting wrong, so confidence monitoring would show “all green” during the exact incident you want to catch. Before trusting confidence as a quality proxy, verify the model is calibrated (e.g., expected calibration error, reliability diagrams) and ideally recalibrate (temperature scaling). Otherwise pair it with a signal that doesn’t depend on the model’s self-assessment, such as shadow scoring or outcome correlation below.
def detect_confidence_anomaly(current_confidence: list[float],
baseline_mean: float,
baseline_std: float) -> dict:
"""Detect anomalies in confidence distribution."""
current_mean = sum(current_confidence) / len(current_confidence)
z_score = (current_mean - baseline_mean) / baseline_std
return {
"current_mean": current_mean,
"z_score": z_score,
"status": "anomaly" if abs(z_score) > 3 else "normal"
}Distribution shift detection: Compare the distribution of model inputs or outputs to a baseline. Significant shifts suggest the model is seeing different data than it was trained on.
Shadow scoring: Route a sample of traffic to a known-good reference model and compare predictions. Divergence suggests quality degradation.
Outcome correlation: For systems with fast feedback loops (e.g., user clicks), monitor the correlation between predictions and outcomes. Declining correlation indicates quality problems.
Graceful Degradation
The Philosophy of Graceful Degradation
When systems fail, they should fail gracefully. This means maintaining some level of service even when components are broken, rather than collapsing entirely.
The key insight is that partial service is usually better than no service. If your recommendation model is down, showing popular items is better than showing an error page. If your personalized ranking is too slow, returning unranked results is better than timing out.
Graceful degradation requires designing degradation modes in advance. You can’t figure out fallback strategies during an incident. You need to:
- Enumerate failure modes: What can break? Models, features, infrastructure, dependencies.
- Design degradation levels: For each failure mode, what’s an acceptable fallback?
- Implement automatic transitions: How does the system detect failures and switch modes?
- Test regularly: Untested fallbacks will fail when needed most.
def get_recommendation(user_id: str) -> list[str]:
# Any failure = complete outage
features = feature_store.get(user_id)
embedding = model.encode(features)
recommendations = index.search(embedding, k=10)
return recommendations
# Feature store timeout? 500 error.
# Model OOM? 500 error.
# Index corrupt? 500 error.class ResilientRecommender:
def __init__(self):
self.circuit_breaker = CircuitBreaker(failure_threshold=5)
self.cache = ResponseCache(ttl_seconds=300)
self.fallback_items = self.load_popular_items()
async def get_recommendation(
self,
user_id: str,
timeout: float = 2.0
) -> RecommendationResponse:
# Level 0: Try primary path with circuit breaker
if self.circuit_breaker.state == "closed":
try:
async with asyncio.timeout(timeout):
features = await self.feature_store.get(user_id)
recs = await self.model.predict(features)
self.cache.set(user_id, recs) # Cache for fallback
return RecommendationResponse(
items=recs, quality="full", source="model"
)
except Exception as e:
self.circuit_breaker.record_failure()
logger.warning(f"Primary path failed: {e}")
# Level 1: Try cached response
cached = self.cache.get(user_id)
if cached:
return RecommendationResponse(
items=cached, quality="stale", source="cache"
)
# Level 2: Try simplified model (no personalization features)
try:
async with asyncio.timeout(timeout * 0.5):
recs = await self.simple_model.predict(user_id)
return RecommendationResponse(
items=recs, quality="degraded", source="simple_model"
)
except Exception:
pass
# Level 3: Static fallback (always works)
return RecommendationResponse(
items=self.fallback_items,
quality="minimal",
source="popular_items"
)Key differences: Circuit breaker prevents cascade, multiple fallback levels (cache → simple model → static), timeout management at each level, quality indicators in response, and no request returns a 500 error—users always get something.
The Degradation Hierarchy
A well-designed AI system has multiple degradation levels, each trading quality for resilience:
Each level should have clearly defined:
- Entry conditions: What failures trigger this level?
- Exit conditions: When can we return to the higher level?
- Quality metrics: How much worse is this level?
- Capacity requirements: Can this level handle full traffic?
Circuit Breaker Pattern
The circuit breaker pattern prevents cascading failures by “tripping” when a dependency fails too often. Named after electrical circuit breakers, it has three states:
Closed (Normal): Requests flow through normally. Failures are counted.
Open (Blocking): After too many failures, the circuit “trips.” All requests are immediately rejected or routed to fallback without attempting the primary path. This protects the failing component from additional load and protects callers from slow failures.
Half-Open (Testing): After a cooling-off period, a limited number of requests are allowed through to test if the component has recovered. Success returns to Closed; failure returns to Open.
class CircuitBreaker:
"""Circuit breaker for AI model calls."""
def __init__(self, failure_threshold: int = 5,
reset_timeout_seconds: float = 30,
half_open_max_calls: int = 3):
self.failure_threshold = failure_threshold
self.reset_timeout = reset_timeout_seconds
self.half_open_max_calls = half_open_max_calls
self.state = "closed"
self.failure_count = 0
self.last_failure_time = None
self.half_open_successes = 0
def can_proceed(self) -> bool:
"""Check if request should proceed or use fallback."""
if self.state == "closed":
return True
elif self.state == "open":
if self._reset_timeout_expired():
self.state = "half_open"
self.half_open_successes = 0
return True
return False
else: # half_open
return self.half_open_successes < self.half_open_max_calls
def record_success(self):
"""Record successful call."""
if self.state == "half_open":
self.half_open_successes += 1
if self.half_open_successes >= self.half_open_max_calls:
self.state = "closed"
self.failure_count = 0
def record_failure(self):
"""Record failed call."""
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = "open"For AI systems, circuit breakers need enhancement to handle quality degradation, not just hard failures:
class AICircuitBreaker(CircuitBreaker):
"""Circuit breaker that trips on quality degradation."""
def __init__(self, quality_threshold: float = 0.8, **kwargs):
super().__init__(**kwargs)
self.quality_threshold = quality_threshold
self.recent_confidences = []
def record_prediction(self, confidence: float):
"""Record prediction confidence; may trip on low quality."""
self.recent_confidences.append(confidence)
if len(self.recent_confidences) > 100:
self.recent_confidences.pop(0)
avg_confidence = sum(self.recent_confidences) / len(self.recent_confidences)
if avg_confidence < self.quality_threshold:
self.record_failure()
else:
self.record_success()Fallback Strategies by System Type
Different AI systems require different fallback strategies:
Recommendation Systems
| Level | Strategy | Quality | Example |
|---|---|---|---|
| 0 | Personalized ML model | 100% | User-specific recommendations |
| 1 | Collaborative filtering with cached embeddings | 85% | “Users like you also liked…” |
| 2 | Popularity-based | 60% | “Trending in your region” |
| 3 | Category defaults | 40% | “Popular in Electronics” |
| 4 | Editorial picks | 25% | Static curated list |
Search Ranking
| Level | Strategy | Quality | Example |
|---|---|---|---|
| 0 | ML ranker with full features | 100% | Personalized relevance |
| 1 | ML ranker with cached features | 90% | Recent user history only |
| 2 | BM25 keyword matching | 70% | Text relevance only |
| 3 | Exact match + recency | 50% | Newest matching results |
| 4 | Random sample | 20% | Any matching documents |
LLM Chat Applications
| Level | Strategy | Quality | Example |
|---|---|---|---|
| 0 | Primary LLM with full context | 100% | GPT-5 with RAG |
| 1 | Primary LLM with reduced context | 85% | Shorter context window |
| 2 | Smaller/faster model | 60% | Claude Haiku 4.5 fallback |
| 3 | Retrieval without generation | 40% | “Here are relevant docs…” |
| 4 | Scripted responses | 20% | “I’m having trouble. Please try again.” |
Fraud Detection
| Level | Strategy | Quality | Example |
|---|---|---|---|
| 0 | ML model with full signals | 100% | Real-time fraud scoring |
| 1 | ML model with cached signals | 90% | Slightly stale features |
| 2 | Rules-based detection | 70% | Velocity checks, blacklists |
| 3 | Conservative blocking | 50% | Block high-risk categories |
| 4 | Allow with manual review | 30% | Queue for human review |
Load Shedding
When systems are overloaded, you can’t serve everyone well. Load shedding means intentionally rejecting some requests to maintain quality for others.
Smart load shedding for AI systems considers request priority:
class PriorityLoadShedder:
"""Shed low-priority requests under load."""
def __init__(self, capacity_qps: float,
priority_thresholds: dict[int, float]):
"""
priority_thresholds maps priority level to load factor where
requests at that priority start being shed.
e.g., {0: 1.0, 1: 0.95, 2: 0.85, 3: 0.70}
Priority 0 = critical, never shed
Priority 3 = low, shed when load > 70%
"""
self.capacity_qps = capacity_qps
self.priority_thresholds = priority_thresholds
def should_accept(self, request_priority: int,
current_qps: float) -> bool:
"""Determine whether to accept or shed this request."""
load_factor = current_qps / self.capacity_qps
threshold = self.priority_thresholds.get(request_priority, 0.7)
return load_factor < thresholdPriority assignment depends on your domain:
- E-commerce: Checkout requests are higher priority than browsing
- Ads: Requests from high-value advertisers are higher priority
- Healthcare: Emergency alerts are higher priority than routine checks
Chaos Engineering for AI Systems
Principles of Chaos Engineering
Chaos engineering is the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions in production. The core insight is that you don’t truly know if your system can handle failure until you’ve actually failed it.
The foundational work on chaos engineering emerged from Netflix’s development of Chaos Monkey in 2010, which randomly terminated production instances to ensure the system could tolerate instance failures. This evolved into a broader discipline with principles articulated by the Netflix team and formalized in subsequent research:
Build a hypothesis around steady-state behavior. Define what “normal” looks like in measurable terms. For AI systems, this includes quality metrics, not just availability.
Vary real-world events. Inject failures that actually occur: network partitions, dependency failures, hardware problems, traffic spikes. For AI, add: model corruption, feature staleness, data quality issues.
Run experiments in production. Staging environments can’t replicate production complexity. Start with small blast radius but aim for production validation.
Automate experiments to run continuously. One-time tests provide one-time confidence. Continuous chaos provides continuous confidence.
Minimize blast radius. Start small, monitor closely, have kill switches. The goal is learning, not outages.
AI-Specific Chaos Experiments
Traditional chaos engineering focuses on infrastructure failures. AI systems have additional failure modes that require specific experiments:
Model Failure Injection
@dataclass
class ModelChaosExperiment:
name: str = "model_failure"
description: str = "Simulate primary model becoming unavailable"
injection: str = "Return error for all model inference calls"
expected_behavior: str = "System falls back to secondary model or rules"
success_criteria: list[str] = field(default_factory=lambda: [
"Availability remains above 99%",
"Fallback activates within 5 seconds",
"Quality degradation is within expected bounds",
"No cascading failures to other services"
])
blast_radius: str = "5% of traffic in one region"
duration_minutes: int = 30
rollback: str = "Disable fault injection flag"Feature Store Latency
@dataclass
class FeatureLatencyExperiment:
name: str = "feature_store_latency"
description: str = "Inject latency into feature retrieval"
injection: str = "Add 500ms delay to feature store responses"
expected_behavior: str = "System uses cached features or proceeds with partial features"
success_criteria: list[str] = field(default_factory=lambda: [
"p99 latency remains under SLO",
"Cached features are used correctly",
"Graceful degradation activates",
"No requests timeout entirely"
])
blast_radius: str = "10% of requests"
duration_minutes: int = 60
rollback: str = "Remove latency injection"Data Quality Degradation
@dataclass
class DataQualityChaosExperiment:
name: str = "data_quality_degradation"
description: str = "Inject noise or missing values into input features"
injection: str = "Randomly null 20% of feature values"
expected_behavior: str = "System detects quality issue, alerts, may degrade"
success_criteria: list[str] = field(default_factory=lambda: [
"Data quality monitoring detects the issue",
"Alerts fire within 5 minutes",
"Model handles missing features gracefully",
"No silent quality degradation"
])
blast_radius: str = "Shadow traffic only initially"
duration_minutes: int = 120
rollback: str = "Stop feature injection"Model Drift Simulation
@dataclass
class DriftChaosExperiment:
name: str = "model_drift_simulation"
description: str = "Simulate gradual model drift"
injection: str = "Slowly shift input feature distributions"
expected_behavior: str = "Drift detection triggers, alerts, possible retraining"
success_criteria: list[str] = field(default_factory=lambda: [
"Drift monitoring detects shift",
"Quality proxy metrics respond",
"Alerts reach appropriate teams",
"Runbook is actionable"
])
blast_radius: str = "Canary traffic only"
duration_minutes: int = 1440 # 24 hours
rollback: str = "Restore normal feature distribution"Running Chaos Experiments Safely
Chaos experiments require careful execution to provide value without causing incidents:
Pre-experiment checklist:
During experiment:
- Continuous monitoring of key metrics
- Clear communication channel (Slack, etc.) for observations
- Pre-authorized decision maker for early termination
- Timestamped log of observations
Post-experiment:
- Document findings regardless of outcome
- If experiment failed (system didn’t behave as expected), create action items
- If experiment succeeded, consider increasing blast radius for next iteration
- Update runbooks based on learnings
Game Days
Game days are structured exercises where teams intentionally break things and practice response. They’re like fire drills for reliability.
Tabletop exercises: Discussion-based scenarios. The team walks through how they would respond to a hypothetical incident without actually injecting failures.
“Scenario: It’s 2 AM and you’re paged. The recommendation model is serving predictions, but CTR has dropped 30% over the past hour. Walk me through your investigation.”
Live exercises: Actual failure injection with coordinated response. More realistic but higher risk.
Value of game days:
- Validates that runbooks are accurate and useful
- Builds team muscle memory for incident response
- Identifies gaps in monitoring and tooling
- Creates psychological safety around failures
- Surfaces organizational issues (unclear escalation paths, missing documentation)
Incident Response for AI Systems
AI-Specific Incident Types
AI systems have distinctive failure modes that require specialized incident response. Understanding these patterns helps teams detect and respond faster.
Model Quality Degradation
Signals: Quality proxy metrics declining, prediction confidence dropping, downstream business metrics falling, A/B test metrics diverging
Common causes: Data drift (input distribution changed), concept drift (relationship between inputs and outputs changed), feature pipeline failure, model corruption, dependency changes
Investigation approach: Compare current feature distributions to training data; check for upstream data pipeline changes; verify model version; look for recent deployments or dependency updates
Data Quality Incident
Signals: Feature coverage dropping, null rates increasing, distribution anomalies, feature computation failures, data freshness violations
Common causes: Upstream data source changed schema or semantics, ETL pipeline failure, data provider outage, timezone or encoding issues
Investigation approach: Check data pipeline health; compare schemas to expected; look for nulls, outliers, or encoding issues; verify upstream system status
Feature Pipeline Incident
Signals: Feature freshness SLO violation, feature store latency spike, cache miss rate increase, partial feature vectors
Common causes: Pipeline job failure, resource exhaustion, dependency timeout, storage issues
Investigation approach: Check job scheduler; verify compute resources; look for dependency health; examine logs for errors
Safety Violation
Signals: Content safety flags triggered, PII detected in output, harmful output reported, compliance alert
Common causes: Adversarial input, jailbreak attempt, training data leakage, filter failure, prompt injection
Investigation approach: Immediately restrict affected functionality; examine flagged outputs; check filter effectiveness; look for attack patterns
Infrastructure Incident
Signals: Availability drop, latency spike, error rate increase, resource exhaustion
Common causes: Hardware failure, network issues, resource limits, deployment problem, DDoS
Investigation approach: Check infrastructure dashboards; verify deployment state; examine resource utilization; look for external factors
Incident Response Framework
Effective incident response follows a structured process:
Phase 1: Detection (0-5 minutes)
The goal is to recognize that an incident is occurring and assess initial severity.
- Alert fires from monitoring system
- On-call engineer acknowledges within SLA (typically 5-15 minutes)
- Initial severity assessment based on impact and scope
- Decision to open incident channel or handle as normal alert
Severity levels for AI systems should include quality, not just availability:
| Severity | Criteria | Response |
|---|---|---|
| SEV1 | Complete outage OR safety violation OR quality below minimum threshold | Page everyone, all-hands |
| SEV2 | Significant degradation OR major quality drop | Page on-call, escalation ready |
| SEV3 | Partial impact OR moderate quality decline | On-call investigates, normal hours |
| SEV4 | Minor issue OR early warning signals | Track, investigate during business hours |
Phase 2: Triage (5-30 minutes)
The goal is to understand the scope, impact, and immediate mitigation options.
- Open incident channel, start incident document
- Assign roles: Incident Commander (IC), Technical Lead, Communications Lead
- Gather initial data: What’s broken? Since when? What’s the impact?
- Decide on immediate mitigation (fail over? rollback? degrade?)
Key questions for AI incidents:
- Is this a model issue, data issue, or infrastructure issue?
- Is quality degraded or completely broken?
- Are fallbacks activating correctly?
- What’s the business impact per minute?
Phase 3: Mitigation (30 minutes - hours)
The goal is to restore service to acceptable levels, even if not fully fixed.
- Execute mitigation playbook (rollback, failover, manual intervention)
- Verify mitigation is effective
- Communicate status to stakeholders
- If mitigation insufficient, escalate and try alternative approaches
Phase 4: Resolution (hours - days)
The goal is to fully resolve the issue and restore normal operations.
- Identify and fix root cause
- Verify fix in staging/canary
- Deploy fix to production
- Monitor for recurrence
- Close incident when stable
Phase 5: Post-Incident Review (days later)
The goal is to learn from the incident and prevent recurrence.
- Schedule blameless post-mortem
- Document timeline, root cause, contributing factors
- Identify what went well and what could improve
- Define and assign action items
- Share learnings broadly
The Blameless Post-Mortem
The post-incident review is where organizations learn from failures. The key principle is blamelessness: focus on systems and processes, not individuals. People make mistakes when systems allow them to; fix the systems.
Template for AI incident post-mortems:
## Incident Summary
[One paragraph describing what happened and impact]
## Timeline
[Timestamped sequence of events from first signal to resolution]
## Root Cause
[Technical explanation of why the incident occurred]
## Contributing Factors
[Systemic issues that enabled or worsened the incident]
- Why didn't monitoring catch this earlier?
- Why did the fallback not activate correctly?
- Why was recovery slow?
## What Went Well
[Things that worked, to reinforce good practices]
## What Could Improve
[Opportunities for improvement, framed constructively]
## Action Items
[Specific, assigned, time-bound improvements]
| Action | Owner | Due Date | Priority |
|--------|-------|----------|----------|
| Add monitoring for X | @alice | 2024-02-15 | P1 |
| Update runbook for Y | @bob | 2024-02-20 | P2 |Managing Model Transitions
One of the most underappreciated reliability challenges is swapping foundation models in production. Unlike traditional software deployments where the same code produces the same output, model changes can cause subtle behavioral shifts that break downstream systems, confuse users, and invalidate prompts that were carefully tuned for the previous model.
This section covers the reliability practices for managing these transitions safely.
Semantic Versioning for Prompts
When you change a model, prompts that worked perfectly may fail silently. The model still produces output—it just produces different output. This is harder to detect than a crash and can cause significant damage before anyone notices.
Adopt semantic versioning for your prompts, treating them as first-class artifacts:
Version Format: MAJOR.MINOR.PATCH
- MAJOR: Model family change (GPT-4 → GPT-5, Claude 3 → Claude 4). Requires full regression testing.
- MINOR: Significant prompt restructuring, new sections, format changes. Requires targeted testing.
- PATCH: Wording tweaks, clarifications, typo fixes. Requires smoke testing.
Prompt Registry Pattern:
from dataclasses import dataclass
from datetime import datetime
from typing import Optional
@dataclass
class PromptVersion:
"""Versioned prompt with metadata."""
id: str
version: str # "1.2.3"
template: str
target_model: str # "claude-sonnet-4-6"
tested_models: list[str] # Models this prompt has been validated against
created_at: datetime
deprecated_at: Optional[datetime] = None
deprecation_reason: Optional[str] = None
class PromptRegistry:
"""Central registry for prompt versions."""
def get_prompt(self, prompt_id: str, model: str) -> PromptVersion:
"""Get the appropriate prompt version for a model."""
versions = self._get_versions(prompt_id)
# Find version compatible with target model
for version in versions:
if model in version.tested_models:
return version
# Fall back to latest, but log warning
logger.warning(f"No tested prompt for {prompt_id} on {model}")
return versions[0]
def validate_compatibility(self, prompt_id: str, old_model: str, new_model: str) -> bool:
"""Check if prompt is validated for new model."""
version = self.get_prompt(prompt_id, old_model)
return new_model in version.tested_modelsBest Practices:
- Never change prompts and models simultaneously. Change one, validate, then change the other.
- Store prompt history. You need to know what prompt was running when an issue occurred.
- Include model name in prompt metadata. Prompts are not model-agnostic.
- Test prompts on new models before switching. Add new models to
tested_modelsonly after validation.
A/B Rollout for Model Swaps
Never switch models all at once. Use progressive rollout to catch issues early:
Phase 1: Shadow Mode (0% user traffic) - Route production queries to both old and new models - Compare outputs offline - Measure latency, cost, output length differences - Duration: 24-48 hours minimum
Phase 2: Canary (1-5% traffic) - Route small percentage of real traffic to new model - Monitor all quality metrics: latency, error rate, user feedback - Set automatic rollback thresholds - Duration: 24-72 hours
Phase 3: Progressive Rollout (5% → 25% → 50% → 100%) - Increase traffic in stages - Wait for statistical significance at each stage - Watch for segment-specific issues (certain user types, query types) - Duration: Several days per stage
class ModelRollout:
"""Progressive model rollout manager."""
def __init__(self, old_model: str, new_model: str):
self.old_model = old_model
self.new_model = new_model
self.rollout_percentage = 0.0
self.rollback_triggered = False
def get_model(self, request_id: str) -> str:
"""Determine which model to use for a request."""
if self.rollback_triggered:
return self.old_model
# Consistent hashing ensures same user gets same model
hash_value = hash(request_id) % 100
if hash_value < self.rollout_percentage:
return self.new_model
return self.old_model
def check_rollback_conditions(self, metrics: dict) -> bool:
"""Check if we should automatically rollback."""
if metrics['error_rate'] > self.max_error_rate:
return True
if metrics['p99_latency'] > self.max_latency:
return True
if metrics['quality_score'] < self.min_quality:
return True
return FalsePrompt Compatibility Testing
Before any model change, run systematic compatibility tests:
Test Categories:
- Functional tests: Does the prompt still produce valid outputs?
- Format compliance (JSON structure, required fields)
- Output length within expected range
- No new error cases
- Semantic tests: Does the output mean the same thing?
- Compare outputs on golden examples
- Use LLM-as-judge to evaluate equivalence
- Check for new hallucinations or omissions
- Edge case tests: Does it handle boundary conditions?
- Very long inputs
- Adversarial inputs
- Empty or minimal inputs
- Non-English content (if applicable)
- Performance tests: Is the new model acceptable?
- Latency comparison
- Cost per request
- Token usage patterns
Automated Compatibility Suite:
async def test_prompt_compatibility(
prompt: PromptVersion,
old_model: str,
new_model: str,
test_cases: list[TestCase]
) -> CompatibilityReport:
"""Test prompt compatibility across model versions."""
results = []
for test in test_cases:
old_output = await generate(prompt.template, test.input, old_model)
new_output = await generate(prompt.template, test.input, new_model)
result = TestResult(
test_id=test.id,
old_output=old_output,
new_output=new_output,
format_valid=validate_format(new_output, test.expected_schema),
semantic_similar=await compare_semantic(old_output, new_output),
latency_ratio=new_output.latency / old_output.latency,
cost_ratio=new_output.cost / old_output.cost,
)
results.append(result)
return CompatibilityReport(
prompt_id=prompt.id,
old_model=old_model,
new_model=new_model,
results=results,
compatible=all(r.format_valid and r.semantic_similar for r in results),
warnings=generate_warnings(results),
)Rollback Strategies
When things go wrong—and they will—you need fast rollback:
Immediate Rollback (< 1 minute) - Feature flag to switch all traffic back to old model - No code deployment required - Keep old model warm and ready
Data Rollback - If the new model corrupted data (bad extractions, wrong classifications), you may need to reprocess - Keep audit logs of which model produced which outputs - Have scripts ready to identify and reprocess affected records
Prompt Rollback - If you changed prompts alongside the model, roll back prompts too - This is why you should change one thing at a time
Rollback Checklist:
□ Flip feature flag to old model
□ Verify traffic is routing to old model
□ Check error rates returning to baseline
□ Notify stakeholders of rollback
□ Preserve logs from failed rollout
□ Schedule post-mortem
□ Identify data that needs reprocessing (if any)
Model Deprecation Planning
Providers deprecate models regularly. Plan for this:
- Track deprecation announcements: Subscribe to provider changelogs
- Maintain model inventory: Know which models you use where
- Test new models proactively: Don’t wait until deprecation date
- Budget for migration work: Model transitions take engineering time
- Have fallback providers: If your primary model is deprecated suddenly, can you switch providers?
Case Studies
Case Study 1: The Silent Model Regression
Context: A major e-commerce platform used an ML model to rank search results. The model was retrained weekly and automatically deployed after passing quality checks in offline evaluation.
What happened: A routine retraining run produced a model that performed well on the held-out test set but had subtly different behavior on a specific category of products (electronics). The offline evaluation didn’t include category-specific metrics, only aggregate accuracy.
After deployment, click-through rates on electronics searches dropped 15%. However, aggregate metrics (across all categories) only dropped 2%, within normal variance. The issue wasn’t detected for 11 days, until the electronics team noticed their sales were down and traced it to search visibility.
Root cause: The training data had been inadvertently filtered to exclude recent electronics listings due to a join bug. The model learned to underweight electronics-related signals. Offline evaluation used the same filtered data, so the model passed quality checks.
Lessons and changes: 1. Added category-specific metrics to offline evaluation (detect similar issues earlier) 2. Implemented real-time quality monitoring with category breakdown (faster detection) 3. Added data lineage verification to training pipeline (prevent data bugs) 4. Established canary deployment with automatic rollback triggers (limit blast radius)
Case Study 2: The Feedback Loop Collapse
Context: A content moderation system used ML to flag potentially harmful content for human review. Human reviewers made final decisions, and their labels were fed back to improve the model.
What happened: A subtle bug caused the feedback pipeline to over-represent “not harmful” labels from reviewers who were faster but less careful. Over several weeks, the model learned to be more permissive, flagging less content. This reduced human review workload (which seemed like efficiency), but allowed more harmful content through.
The issue was discovered when external users reported an increase in harmful content. By then, the model had degraded significantly, and the feedback loop had compounded the error over months of retraining.
Root cause: The feedback pipeline didn’t weight labels by reviewer quality or handle the selection bias of faster reviewers. The feedback loop amplified initial errors.
Lessons and changes: 1. Implemented reviewer quality scores and weighted feedback accordingly 2. Added golden set evaluation (known-label test cases) to detect model drift 3. Established a feedback loop circuit breaker (halt training if quality drops) 4. Created a model quality SLO with mandatory manual review before deployment
Case Study 3: The Feature Store Cascade
Context: A ride-sharing platform used a feature store to serve real-time features for pricing, matching, and fraud detection. The feature store was a shared dependency for multiple ML systems.
What happened: A deployment to the feature store introduced a memory leak. Under normal load, the leak was slow enough to be invisible. During a surge event (holiday weekend), the increased load accelerated the leak. Feature store latency increased, then pods started OOMing.
The cascade: Feature store slowdown caused pricing model timeouts, which caused fallback to static pricing, which underpriced rides in high-demand areas, which caused driver shortages, which caused even higher surge that the static pricing didn’t reflect.
The incident lasted 4 hours and cost millions in lost revenue and driver/rider dissatisfaction.
Root cause: Memory leak in feature store deployment, but the deeper cause was lack of isolation between critical systems and insufficient fallback testing under realistic load.
Lessons and changes: 1. Implemented resource isolation for critical feature paths 2. Added feature store chaos experiments (latency, partial failure) 3. Improved fallback pricing model to use recent data rather than static values 4. Created cascading failure runbooks with decision trees 5. Established feature store latency SLO with automatic traffic shedding
Case Study 4: The LLM Safety Incident
Context: A customer service platform deployed an LLM-based chatbot to handle initial customer inquiries before routing to humans.
What happened: A coordinated attack by external users discovered that certain prompt patterns could cause the chatbot to generate inappropriate content and reveal training data. The attack was shared on social media, and within hours, thousands of users were triggering the vulnerability.
The content safety filters caught some outputs, but the attack patterns were novel enough to bypass others. The team scrambled to add keyword filters, but attackers found workarounds faster than defenses could be deployed.
The incident required taking the chatbot offline for 48 hours while implementing robust prompt injection defenses.
Root cause: Insufficient adversarial testing before deployment, over-reliance on static content filters, no rate limiting on suspicious query patterns.
Lessons and changes: 1. Red team testing before any LLM deployment 2. Implemented query anomaly detection (unusual patterns trigger review) 3. Added rate limiting and circuit breakers for flagged users 4. Deployed multi-layer content filtering (pre-generation and post-generation) 5. Created a rapid response capability for novel attack patterns
Case Study 5: The 3AM Cascade
Context: A large e-commerce platform used LLM-powered product descriptions and recommendations. The inference service ran on a fleet of GPU servers behind a load balancer, with a 5-second timeout and automatic retry-on-failure.
What happened: At 3:07 AM on a holiday weekend, a GPU server started experiencing memory pressure from an unrelated batch job. Response times on that server increased from 200ms to 4.8 seconds—just under the timeout. Requests didn’t fail; they just became slow.
The load balancer detected “slow” as “available” and kept routing traffic. As more requests queued on the slow server, other servers saw their request rates drop. But here’s where it got worse: the retry logic kicked in. Clients experiencing 4.8-second latencies retried after 5 seconds. Now each user request became two requests. The slow server couldn’t catch up; the retry storm spread to healthy servers.
By 3:45 AM, the entire inference cluster was saturated. Every request was timing out and retrying. The cascade triggered rate limits on the upstream API, which caused the search ranking service to fail, which caused the homepage to show error messages. On a holiday weekend. With the on-call engineer’s phone on silent.
Root cause: No circuit breakers between layers. The retry policy assumed failures were transient, not systemic. The load balancer couldn’t distinguish “slow” from “available.” And a single slow server could poison the entire fleet.
How they fixed it: 1. Implemented circuit breakers at every service boundary—fail fast, don’t retry into failures 2. Added latency-based load balancing (route away from slow servers, not just failed ones) 3. Set retry budgets: a request can trigger at most N total attempts across all layers 4. Built cascade detection into monitoring (alert if retry rate exceeds 10%) 5. Added “deployment freeze” monitoring for high-traffic periods
The takeaway: In distributed systems, retry is a form of amplification. Without circuit breakers, a single slow node can cascade into a total outage. Circuit breakers don’t just protect systems; they protect careers.
Building a Reliability Culture
Reliability is Everyone’s Responsibility
The most common failure mode for AI reliability isn’t technical, it’s organizational. Reliability becomes “someone else’s job,” and incentives favor feature velocity over system resilience.
Signs of unhealthy reliability culture:
- Incidents are followed by blame, not learning
- Error budgets exist on paper but don’t influence decisions
- Fallbacks exist but are never tested
- On-call is punishment, not professional development
- Post-mortems are perfunctory or skipped
Signs of healthy reliability culture:
- Incidents are welcomed as learning opportunities
- Error budget status is visible and influences roadmap
- Chaos experiments run regularly
- On-call is well-supported and respected
- Post-mortems are thorough and action items are completed
Organizational Structures that Work
Embedded SRE model: Reliability engineers are embedded in product teams, combining deep domain knowledge with reliability expertise. Works well when teams are large enough to justify dedicated reliability focus.
Platform team model: A central platform team provides reliability primitives (monitoring, circuit breakers, deployment pipelines) that product teams use. Works well for consistency but requires strong API design and documentation.
Hybrid model: Central platform for common infrastructure, embedded specialists for domain-specific reliability concerns. Most large organizations evolve toward this.
Making the Business Case for Reliability
Reliability investments compete with feature development for engineering resources. Making the business case requires quantifying the cost of unreliability:
Direct costs:
- Revenue lost during outages
- Customer compensation/credits
- Incident response labor costs
Indirect costs:
- Customer churn from poor experiences
- Brand/reputation damage
- Engineering time diverted to firefighting
- Technical debt accumulated during quick fixes
Example calculation:
- Revenue per hour: $500,000
- Availability target: 99.9% (8.76 hours downtime/year allowed)
- Current availability: 99.5% (43.8 hours downtime/year)
- Gap: 35 hours of additional downtime
- Annual cost of gap: 35 * $500,000 = $17.5 million
- Investment to close gap: $2 million
- ROI: 8.75x
Key Takeaways
Availability is necessary but not sufficient - AI systems can be “up” while producing useless or harmful outputs. Quality must be part of your reliability definition alongside uptime and latency.
Error budgets enable rational tradeoffs - Use budget consumption to balance feature velocity against reliability investment. Healthy budget means ship faster; exhausted budget means slow down.
Design multiple fallback levels - Graceful degradation with levels trading quality for resilience is better than binary success/failure. Plan for partial service when components fail.
Test your failure modes proactively - Chaos engineering and game days build confidence. Untested fallbacks are assumptions, not guarantees. Simulate incidents before they happen.
Incidents are learning opportunities - Blameless post-mortems and systematic follow-through on action items prevent recurrence. Culture matters as much as technical practices.
Summary
Reliability engineering for AI systems extends traditional SRE practices to address the unique challenges of machine learning: probabilistic outputs, silent degradation, data dependencies, and complex failure modes.
Core principles:
Availability is necessary but not sufficient. AI systems can be “up” while producing useless or harmful outputs. Quality must be part of your reliability definition.
SLOs should capture what users care about. Include model quality and safety SLOs alongside traditional availability and latency targets.
Error budgets enable rational tradeoffs. Use budget consumption to balance feature velocity against reliability investment.
Design for graceful degradation. Multiple fallback levels, each trading quality for resilience, are better than binary success/failure.
Test your failure modes. Chaos engineering and game days build confidence that systems behave correctly when components fail.
Incidents are learning opportunities. Blameless post-mortems and systematic follow-through on action items prevent recurrence.
Reliability is a team sport. Culture and organizational design matter as much as technical practices.
Two mental models tie this together. First, AI systems fail like biological systems: drift, diffuse symptoms, and cascading organ failures—so build clinic-style monitoring, not pass/fail alarms. Second, reliability is a budget, not a goal: stop chasing 100% and start spending error budget to buy velocity. Together they reframe reliability from a constraint to a discipline of resource allocation under uncertainty.
The companies that operate AI systems at scale have learned these lessons through painful experience. The goal of this chapter is to help you learn from their experience rather than repeating their mistakes.
Practical Exercises
Exercise 1: Define SLOs for an AI System
Choose an AI system you work on (or a hypothetical one: recommendation engine, search ranking, fraud detection, chatbot).
- Define at least one SLO in each category: availability, latency, quality, freshness
- For each SLO, specify the SLI (what you measure), target (what level), and window (over what period)
- Calculate the error budget for a 30-day window
- Identify proxy metrics you could use for real-time quality estimation
Exercise 2: Design a Degradation Hierarchy
For the same system:
- Enumerate the major failure modes (model, features, infrastructure, dependencies)
- Design at least 4 degradation levels, from full service to minimal fallback
- For each level, specify: entry conditions, exit conditions, expected quality, capacity requirements
- Write pseudo-code for the degradation controller
Exercise 3: Plan a Chaos Experiment
Design a chaos experiment for your system:
- State your hypothesis about system behavior under failure
- Define the failure injection (what will you break?)
- Specify success criteria (how do you know the experiment passed?)
- Define blast radius and rollback procedure
- Create a pre-experiment checklist
Exercise 4: Incident Response Simulation
Walk through an incident response for this scenario:
“It’s Tuesday at 3 PM. An alert fires indicating that your model’s prediction confidence has dropped 25% over the past hour. Business metrics haven’t moved yet, but the trend is concerning.”
- What’s your initial severity assessment?
- What data would you gather in the first 15 minutes?
- What are your mitigation options?
- Who would you involve and when?
- Write the key sections of a post-mortem assuming you discovered the root cause was a feature pipeline bug.
Self-Assessment Checkpoint
Conceptual Questions
Q1. [Senior] What’s the difference between an SLI, SLO, SLA, and error budget? How do they relate to each other?
Answer
SLI (Service Level Indicator): What you measure—a metric representing service health. Example: “Proportion of requests with latency < 200ms.” SLO (Service Level Objective): Your internal target for the SLI. Example: “99.9% of requests should have latency < 200ms over a 30-day window.” SLA (Service Level Agreement): External commitment with consequences (usually financial). Example: “We guarantee 99.5% uptime; if violated, customers get credits.” Error budget: The permitted failure rate derived from SLO. For 99.9% availability, error budget = 0.1% = 43.2 minutes of downtime per 30 days. Relationship: SLI measures reality, SLO sets your target, error budget is the slack between perfect and target. SLA is typically less stringent than SLO (you want buffer). Error budget gets “spent” on incidents, deploys, and experiments.Q2. [Senior] Explain the “error budget” concept. How does it change how teams make decisions?
Answer
Error budget is the acceptable amount of unreliability implied by your SLO. If SLO is 99.9% availability, you can afford 0.1% downtime. This changes decision-making: (1) Velocity vs. stability trade-off becomes concrete—if budget is healthy, ship features faster. If budget is exhausted, slow down and focus on reliability. (2) Risk quantification—a risky deployment that might cause 10 minutes of issues can be evaluated against remaining budget. (3) Alignment—product and engineering agree on the trade-off explicitly. (4) Accountability—if you burn through budget, there are consequences (slow down deploys, reliability sprint). (5) Resource allocation—if you’re always exhausting budget, invest in reliability. If you never use it, maybe SLO is too loose. Key insight: Some unreliability is acceptable and planned for, not a failure.Q3. [Staff] How do you define quality SLOs for an AI system where “correctness” is subjective and ground truth is often unknown?
Answer
Approaches: (1) Proxy metrics—measurable signals that correlate with quality. User engagement (clicks, time spent), explicit feedback (thumbs up/down), regeneration rate, session completion. (2) Canary evaluation—run sample traffic through LLM-as-judge, track score distribution. (3) Consistency metrics—for same/similar inputs, how consistent are outputs? Sudden inconsistency suggests quality issues. (4) Business metrics—downstream metrics (conversion, retention) that quality impacts, with lag. (5) Benchmark slices—maintain golden set of examples with known-good answers, continuously evaluate. (6) Regression detection—statistical tests for output distribution changes. Setting SLOs: Target based on historical baselines + acceptable variance. “Quality proxy score P50 > 4.2” or “Regeneration rate < 15%.” Review and calibrate regularly—quality SLOs need more tuning than latency SLOs.Q4. [Staff] Describe the concept of “graceful degradation” for an AI system. How do you design degradation levels?
Answer
Graceful degradation: When components fail, reduce capability gradually rather than failing completely. Users get reduced service instead of no service. Design levels: (1) Full service—everything works. (2) Reduced quality—fall back to simpler/smaller model, faster but less capable. (3) Cached responses—serve recent results for similar queries. (4) Static fallback—pre-computed responses for common cases. (5) Minimal service—acknowledgment, queue for later, or redirect to human. For each level, define: Entry conditions (what triggers), exit conditions (when to recover), quality impact, capacity impact. Implementation: Circuit breakers detect failures. Degradation controller decides level based on health signals. Each level has its own serving path. Recovery is gradual (don’t immediately return to full service—verify stability first).Q5. [Staff] How do you approach post-mortem analysis for an AI system incident where the model started producing poor outputs but no infrastructure failed?
Answer
AI-specific investigation: (1) What changed?—Model update, prompt change, feature pipeline change, upstream data change? Timeline crucial. (2) Input drift—did input distribution change? (e.g., new user segment, seasonal pattern, external event). Compare recent vs. historical input embeddings. (3) Feature issues—did any feature start returning nulls, stale values, or anomalous distributions? (4) Dependency changes—did an upstream service change its API or data format? (5) Model confidence—were confidence scores low, indicating the model was uncertain? (6) A/B test interaction—was a conflicting experiment running? (7) Resource pressure—was the model under load that caused degraded behavior (timeouts, truncation)? Post-mortem should include: detection timeline (how long until noticed?), impact quantification, root cause with evidence, why existing monitoring didn’t catch it, specific action items with owners/dates.Spot the Problem
Problem 1. [Senior] SLO definition:
"SLO: 99.99% availability"
Answer
Incomplete definition: (1) What counts as “available”?—Need SLI definition. Is a slow response “available”? A wrong response? (2) What’s the measurement window?—99.99% over a year is different from over a day. (3) What traffic counts?—All requests? Excluding health checks? Excluding known bad clients? (4) How measured?—Server-side vs. client-side availability differ (network issues affect client-side only). Complete SLO: “99.99% of user-facing requests (excluding health checks) return a valid response with status 2xx within 500ms, measured over a 30-day rolling window from client-side synthetic monitoring.”Problem 2. [Staff] Incident response:
"We noticed latency spikes at 2 PM. We restarted the service at 2:15 PM
and things got better. Closing the incident."
Answer
Problems: (1) No root cause—why did latency spike? Will it happen again? “Restart fixed it” is not understanding. (2) No impact quantification—how many users affected? What was the error budget cost? (3) No detection analysis—how was it noticed? Alert or luck? (4) No follow-up actions—what prevents recurrence? (5) No post-mortem—team doesn’t learn. Restart may have masked the real problem (memory leak? queue buildup? resource exhaustion?). The underlying issue will likely recur. Proper closure requires: Confirmed root cause, impact measured, detection improved if needed, preventive action items with owners.Problem 3. [Staff] Chaos engineering plan:
"Tomorrow we'll shut down the primary database to test our failover."
Answer
Problems: (1) No hypothesis—what do you expect to happen? Without a hypothesis, you can’t evaluate results. (2) No blast radius limitation—“primary database” affects everything. Start smaller. (3) No rollback plan—what if failover doesn’t work? (4) No success criteria—how do you know if it passed? (5) No stakeholder communication—does everyone know? Customer support? Leadership? (6) No pre-checks—is the system healthy now? Recent deploys? Planned events? Proper chaos experiment: (1) Hypothesis: “If primary DB fails, secondary takes over within 30s with <1% request failures.” (2) Start with smaller scope (single table, read traffic only). (3) Defined rollback. (4) Communication plan. (5) Run during low-traffic hours. (6) Monitoring in place to detect issues immediately.Design Exercises
Exercise 1. [Staff] Design the reliability architecture for an AI recommendation system serving 10M users. The system must: maintain 99.9% availability, degrade gracefully under partial failures, and detect quality regressions within 5 minutes. Include SLOs, degradation strategy, and monitoring approach.
Guidance
SLOs: (1) Availability: 99.9% of requests return valid recommendations within 500ms. (2) Latency: P99 < 500ms, P50 < 100ms. (3) Quality: CTR delta from baseline < 5% (measured via A/B holdout). (4) Freshness: Model updated within 24 hours of new training data. Degradation levels: (1) Full—ML model + personalization. (2) Reduced—ML model, no personalization (simpler, faster). (3) Cached—user’s recent recommendations. (4) Popular—globally popular items (no ML). (5) Empty state—graceful “no recommendations” UX. Quality monitoring: (1) Real-time proxy: user engagement signals (clicks, dwell time) vs. baseline. (2) Canary evaluation: 1% traffic through quality scorer. (3) A/B holdout: continuous comparison to control. Alert if any deviates >2 standard deviations for 5 minutes.Exercise 2. [Staff] Design an incident response process for an AI platform team supporting 50 AI products. Consider: severity classification, escalation paths, communication protocols, and post-mortem process. How do you balance standardization with product-specific needs?
Guidance
Severity classification: (1) SEV1—complete outage, >50% users affected, revenue impact. (2) SEV2—major degradation, specific products affected. (3) SEV3—partial degradation, workarounds available. (4) SEV4—minor issues, no immediate user impact. Escalation: SEV1—immediate all-hands, executive notification within 15 min. SEV2—on-call + product owner within 30 min. SEV3/4—on-call handles, escalates if needed. Communication: (1) Status page for external. (2) Internal chat channel per incident. (3) Periodic updates (every 30 min for SEV1/2). (4) Post-incident email to stakeholders. Post-mortem: Required for SEV1/2, optional for SEV3. Blameless. Template: timeline, impact, root cause, action items with owners/dates. Review meeting within 1 week. Action items tracked to completion. Standardization vs. flexibility: Common process/templates/tools. Product-specific: severity thresholds, escalation contacts, product-specific runbooks.Connections to Other Chapters
Reliability engineering connects to infrastructure and operations throughout the book:
Chapter 9 (LLM Deployment & Infrastructure): Deployment patterns that affect reliability, including load balancing and failover.
Chapter 12 (Cloud AI Deployment): Cloud-specific reliability features like availability zones, managed failover, and provider SLAs.
Chapter 25 (System Design at Scale): Distributed system patterns that require reliability considerations.
Chapter 32 (Cost Engineering): Tradeoffs between reliability investments and cost, including redundancy costs.
Debugging & Troubleshooting (Appendix F): Systematic diagnosis techniques for production incidents.
Further Reading
Essential
- Beyer et al. (2016), “Site Reliability Engineering” - The foundational SRE text. Focus on SLOs, error budgets, and incident response.
- Chen et al. (2023), “Reliable Machine Learning” - Extends SRE to ML systems with Google case studies.
- Sculley et al. (2015), “Hidden Technical Debt in ML Systems” - Classic on ML-specific failure modes.
Deep Dives
- Rosenthal et al. (2020), “Chaos Engineering” - Comprehensive guide from the Netflix team that pioneered the discipline.
- Allspaw, “Etsy’s Debriefing Facilitation Guide” - Excellent resource on blameless post-mortems.