Incident Response Runbook

Incident Severity Levels

Severity Definition Response Time Examples
SEV1 Complete outage, all users affected < 15 min API down, data breach
SEV2 Major degradation, many users affected < 30 min High error rate, severe latency
SEV3 Partial degradation, some users affected < 2 hrs Single region down, quality drop
SEV4 Minor issue, minimal impact < 24 hrs Intermittent errors, slow queries

Runbook 1: High Latency / Timeout Spike

Symptoms

  • P95 latency > 2x normal
  • Timeout errors increasing
  • User complaints about slow responses

Immediate Checks (< 5 min)

# 1. Check current latency metrics {.unnumbered}
curl -s $METRICS_ENDPOINT/api/v1/query \
  --data-urlencode 'query=histogram_quantile(0.95, rate(request_latency_seconds_bucket[5m]))'

# 2. Check GPU utilization {.unnumbered}
nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv

# 3. Check queue depth {.unnumbered}
curl -s $VLLM_ENDPOINT/metrics | grep "queue_size"

# 4. Check provider status pages {.unnumbered}
# - status.anthropic.com {.unnumbered}
# - status.openai.com {.unnumbered}

Decision Tree

Is GPU utilization > 90%?
├─ YES → Scale up or enable overflow routing
└─ NO → Is queue depth high?
         ├─ YES → Check for stuck requests, restart workers
         └─ NO → Is provider having issues?
                  ├─ YES → Enable fallback provider
                  └─ NO → Check network/load balancer

Mitigation Actions

Action Command/Steps When to Use
Enable overflow routing Set OVERFLOW_THRESHOLD=0.7 GPU > 85%
Activate fallback provider Set FALLBACK_ENABLED=true Provider issues
Scale up inference pool kubectl scale deployment vllm --replicas=+2 Sustained load
Reduce batch size Set MAX_BATCH_SIZE=16 Memory pressure
Enable request shedding Set SHED_RATE=0.1 Emergency only

Resolution Checklist


Runbook 2: High Error Rate

Symptoms

  • Error rate > 1%
  • 4xx/5xx responses increasing
  • Failed request alerts firing

Immediate Checks (< 5 min)

# 1. Check error breakdown by type {.unnumbered}
curl -s $METRICS_ENDPOINT/api/v1/query \
  --data-urlencode 'query=sum by (status_code) (rate(http_requests_total[5m]))'

# 2. Check recent error logs {.unnumbered}
kubectl logs -l app=api-gateway --since=5m | grep -i error | tail -50

# 3. Check model health {.unnumbered}
curl -s $VLLM_ENDPOINT/health

# 4. Check rate limit status {.unnumbered}
redis-cli GET "ratelimit:global:count"

Error Type Decision Tree

What's the primary error type?

400 Bad Request
└─ Check: Input validation, schema changes, client bugs

401/403 Auth Errors
└─ Check: API key rotation, permission changes, token expiry

429 Rate Limited
└─ Check: Traffic spike, abuse, limit misconfiguration

500 Internal Error
└─ Check: Model OOM, GPU failure, code bug

502/503 Upstream Error
└─ Check: Provider outage, network issues, health checks

504 Timeout
└─ Check: Long-running requests, resource exhaustion

Mitigation Actions

Error Type Action Command
429 spike Increase limits temporarily redis-cli SET ratelimit:rpm 1000
500 OOM Restart workers kubectl rollout restart deployment/vllm
502 provider Enable fallback kubectl set env deployment/api FALLBACK=true
503 overload Enable circuit breaker Set CIRCUIT_BREAKER=true

Runbook 3: Quality Degradation / Hallucination Outbreak

Symptoms

  • User reports of incorrect answers
  • Evaluation scores dropping
  • Hallucination rate > 2%

Immediate Checks (< 15 min)

# 1. Check recent eval scores {.unnumbered}
psql -c "SELECT date, avg(score) FROM evals
         WHERE date > now() - interval '24 hours'
         GROUP BY date ORDER BY date"

# 2. Sample recent responses for manual review {.unnumbered}
psql -c "SELECT query, response FROM requests
         WHERE created_at > now() - interval '1 hour'
         ORDER BY random() LIMIT 10"

# 3. Check if retrieval is working {.unnumbered}
curl -X POST $RAG_ENDPOINT/search \
  -d '{"query": "test query", "top_k": 5}' | jq '.results | length'

# 4. Check embedding model health {.unnumbered}
curl -s $EMBEDDING_ENDPOINT/health

Root Cause Decision Tree

Are retrieval results relevant?
├─ NO → Check embedding model, index freshness, query preprocessing
└─ YES → Are citations being used?
          ├─ NO → Check prompt template, context injection
          └─ YES → Is model following instructions?
                   ├─ NO → Check for prompt drift, model version change
                   └─ YES → Likely edge case - add to eval set

Mitigation Actions

Root Cause Action Steps
Stale index Trigger reindex Run indexing pipeline
Embedding drift Rollback embedding model Deploy previous version
Prompt regression Rollback prompt version Restore from prompt versioning
Model behavior change Switch model version Update model config
Missing context Increase retrieval K Set TOP_K=10

Quality Gate Actions

# Emergency quality gate - block responses below threshold {.unnumbered}
if eval_score < QUALITY_THRESHOLD:
    return {
        "response": "I'm not confident in my answer. Please contact support.",
        "flagged": True,
        "reason": "quality_gate"
    }

Runbook 4: Cost Spike

Symptoms

  • Daily spend > 120% of budget
  • Unexpected provider invoice
  • Cost alerts firing

Immediate Checks (< 15 min)

# 1. Check token usage by endpoint {.unnumbered}
psql -c "SELECT endpoint, sum(input_tokens), sum(output_tokens),
         sum(cost_usd) FROM requests
         WHERE date = current_date GROUP BY endpoint"

# 2. Check for unusual traffic patterns {.unnumbered}
psql -c "SELECT user_id, count(*), sum(cost_usd) FROM requests
         WHERE date = current_date
         GROUP BY user_id ORDER BY sum(cost_usd) DESC LIMIT 10"

# 3. Check cache hit rate {.unnumbered}
redis-cli INFO stats | grep hit_rate

# 4. Check average tokens per request (looking for prompt bloat) {.unnumbered}
psql -c "SELECT avg(input_tokens), avg(output_tokens) FROM requests
         WHERE date = current_date"

Cost Spike Decision Tree

Is traffic volume normal?
├─ NO (higher) → Check for abuse, viral usage, bot traffic
└─ YES → Is cost per request higher?
          ├─ YES → Check model routing, prompt size, cache misses
          └─ NO → Check for billing anomaly, pricing change

Mitigation Actions

Cause Action Implementation
Abuse/bot traffic Block suspicious users Add to blocklist
Cache misses Increase cache TTL Set CACHE_TTL=3600
Prompt bloat Audit and trim prompts Review recent prompt changes
Model misrouting Fix routing logic Check model selection code
High-cost model overuse Adjust routing thresholds Lower quality threshold

Emergency Cost Controls

# Emergency spend limit {.unnumbered}
if daily_spend > DAILY_BUDGET * 1.2:
    # Option 1: Switch to cheaper model
    MODEL = "claude-3-5-haiku"

    # Option 2: Enable aggressive caching
    CACHE_EVERYTHING = True

    # Option 3: Rate limit all users
    GLOBAL_RATE_LIMIT = NORMAL_LIMIT * 0.5

    # Option 4: Graceful degradation message
    if daily_spend > DAILY_BUDGET * 1.5:
        return "Service temporarily limited. Please try again later."

Runbook 5: Security Incident

Symptoms

  • Unusual API access patterns
  • Prompt injection attempts detected
  • Data exfiltration alerts
  • User reports of unexpected behavior

Immediate Actions (< 5 min)

# 1. Check for injection attempts {.unnumbered}
grep -i "ignore previous\|system prompt\|you are now" /var/log/api/requests.log | tail -100

# 2. Check for data exfiltration patterns {.unnumbered}
psql -c "SELECT user_id, query FROM requests
         WHERE response LIKE '%API_KEY%' OR response LIKE '%password%'
         AND date = current_date"

# 3. Check rate limit violations {.unnumbered}
redis-cli KEYS "blocked:*" | wc -l

# 4. Review high-risk user activity {.unnumbered}
psql -c "SELECT * FROM requests
         WHERE risk_score > 0.5 AND date = current_date"

Security Response Actions

Threat Immediate Action Follow-up
Active injection attack Block user/IP Analyze attack vector
Data leak detected Rotate exposed credentials Audit all responses
Credential compromise Rotate API keys Review access logs
Abuse detected Enable strict rate limits Implement additional validation

Containment Checklist


Post-Incident Process

Incident Documentation Template

# Incident Report: [TITLE] {.unnumbered}

## Summary
- **Severity**: SEV[1-4]
- **Duration**: [START] to [END]
- **Impact**: [Users/requests affected]

## Timeline
- HH:MM - Alert fired
- HH:MM - Incident declared
- HH:MM - Root cause identified
- HH:MM - Mitigation applied
- HH:MM - Resolved

## Root Cause
[Description of what caused the incident]

## Resolution
[What was done to fix it]

## Action Items
- [ ] [Preventive measure 1]
- [ ] [Preventive measure 2]
- [ ] [Detection improvement]

## Lessons Learned
[What we learned and how to prevent recurrence]

Post-Incident Checklist