Incident Response Runbook

Incident Severity Levels

Severity	Definition	Response Time	Examples
SEV1	Complete outage, all users affected	< 15 min	API down, data breach
SEV2	Major degradation, many users affected	< 30 min	High error rate, severe latency
SEV3	Partial degradation, some users affected	< 2 hrs	Single region down, quality drop
SEV4	Minor issue, minimal impact	< 24 hrs	Intermittent errors, slow queries

Runbook 1: High Latency / Timeout Spike

Symptoms

P95 latency > 2x normal
Timeout errors increasing
User complaints about slow responses

Immediate Checks (< 5 min)

# 1. Check current latency metrics {.unnumbered}
curl -s $METRICS_ENDPOINT/api/v1/query \
  --data-urlencode 'query=histogram_quantile(0.95, rate(request_latency_seconds_bucket[5m]))'

# 2. Check GPU utilization {.unnumbered}
nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv

# 3. Check queue depth {.unnumbered}
curl -s $VLLM_ENDPOINT/metrics | grep "queue_size"

# 4. Check provider status pages {.unnumbered}
# - status.anthropic.com {.unnumbered}
# - status.openai.com {.unnumbered}

Decision Tree

Is GPU utilization > 90%?
├─ YES → Scale up or enable overflow routing
└─ NO → Is queue depth high?
         ├─ YES → Check for stuck requests, restart workers
         └─ NO → Is provider having issues?
                  ├─ YES → Enable fallback provider
                  └─ NO → Check network/load balancer

Mitigation Actions

Action	Command/Steps	When to Use
Enable overflow routing	Set `OVERFLOW_THRESHOLD=0.7`	GPU > 85%
Activate fallback provider	Set `FALLBACK_ENABLED=true`	Provider issues
Scale up inference pool	`kubectl scale deployment vllm --replicas=+2`	Sustained load
Reduce batch size	Set `MAX_BATCH_SIZE=16`	Memory pressure
Enable request shedding	Set `SHED_RATE=0.1`	Emergency only

Runbook 2: High Error Rate

Symptoms

Error rate > 1%
4xx/5xx responses increasing
Failed request alerts firing

Immediate Checks (< 5 min)

# 1. Check error breakdown by type {.unnumbered}
curl -s $METRICS_ENDPOINT/api/v1/query \
  --data-urlencode 'query=sum by (status_code) (rate(http_requests_total[5m]))'

# 2. Check recent error logs {.unnumbered}
kubectl logs -l app=api-gateway --since=5m | grep -i error | tail -50

# 3. Check model health {.unnumbered}
curl -s $VLLM_ENDPOINT/health

# 4. Check rate limit status {.unnumbered}
redis-cli GET "ratelimit:global:count"

Error Type Decision Tree

What's the primary error type?

400 Bad Request
└─ Check: Input validation, schema changes, client bugs

401/403 Auth Errors
└─ Check: API key rotation, permission changes, token expiry

429 Rate Limited
└─ Check: Traffic spike, abuse, limit misconfiguration

500 Internal Error
└─ Check: Model OOM, GPU failure, code bug

502/503 Upstream Error
└─ Check: Provider outage, network issues, health checks

504 Timeout
└─ Check: Long-running requests, resource exhaustion

Mitigation Actions

Error Type	Action	Command
429 spike	Increase limits temporarily	`redis-cli SET ratelimit:rpm 1000`
500 OOM	Restart workers	`kubectl rollout restart deployment/vllm`
502 provider	Enable fallback	`kubectl set env deployment/api FALLBACK=true`
503 overload	Enable circuit breaker	Set `CIRCUIT_BREAKER=true`

Runbook 3: Quality Degradation / Hallucination Outbreak

Symptoms

User reports of incorrect answers
Evaluation scores dropping
Hallucination rate > 2%

Immediate Checks (< 15 min)

# 1. Check recent eval scores {.unnumbered}
psql -c "SELECT date, avg(score) FROM evals
         WHERE date > now() - interval '24 hours'
         GROUP BY date ORDER BY date"

# 2. Sample recent responses for manual review {.unnumbered}
psql -c "SELECT query, response FROM requests
         WHERE created_at > now() - interval '1 hour'
         ORDER BY random() LIMIT 10"

# 3. Check if retrieval is working {.unnumbered}
curl -X POST $RAG_ENDPOINT/search \
  -d '{"query": "test query", "top_k": 5}' | jq '.results | length'

# 4. Check embedding model health {.unnumbered}
curl -s $EMBEDDING_ENDPOINT/health

Root Cause Decision Tree

Are retrieval results relevant?
├─ NO → Check embedding model, index freshness, query preprocessing
└─ YES → Are citations being used?
          ├─ NO → Check prompt template, context injection
          └─ YES → Is model following instructions?
                   ├─ NO → Check for prompt drift, model version change
                   └─ YES → Likely edge case - add to eval set

Mitigation Actions

Root Cause	Action	Steps
Stale index	Trigger reindex	Run indexing pipeline
Embedding drift	Rollback embedding model	Deploy previous version
Prompt regression	Rollback prompt version	Restore from prompt versioning
Model behavior change	Switch model version	Update model config
Missing context	Increase retrieval K	Set `TOP_K=10`

Quality Gate Actions

# Emergency quality gate - block responses below threshold {.unnumbered}
if eval_score < QUALITY_THRESHOLD:
    return {
        "response": "I'm not confident in my answer. Please contact support.",
        "flagged": True,
        "reason": "quality_gate"
    }

Runbook 4: Cost Spike

Symptoms

Daily spend > 120% of budget
Unexpected provider invoice
Cost alerts firing

Immediate Checks (< 15 min)

# 1. Check token usage by endpoint {.unnumbered}
psql -c "SELECT endpoint, sum(input_tokens), sum(output_tokens),
         sum(cost_usd) FROM requests
         WHERE date = current_date GROUP BY endpoint"

# 2. Check for unusual traffic patterns {.unnumbered}
psql -c "SELECT user_id, count(*), sum(cost_usd) FROM requests
         WHERE date = current_date
         GROUP BY user_id ORDER BY sum(cost_usd) DESC LIMIT 10"

# 3. Check cache hit rate {.unnumbered}
redis-cli INFO stats | grep hit_rate

# 4. Check average tokens per request (looking for prompt bloat) {.unnumbered}
psql -c "SELECT avg(input_tokens), avg(output_tokens) FROM requests
         WHERE date = current_date"

Cost Spike Decision Tree

Is traffic volume normal?
├─ NO (higher) → Check for abuse, viral usage, bot traffic
└─ YES → Is cost per request higher?
          ├─ YES → Check model routing, prompt size, cache misses
          └─ NO → Check for billing anomaly, pricing change

Mitigation Actions

Cause	Action	Implementation
Abuse/bot traffic	Block suspicious users	Add to blocklist
Cache misses	Increase cache TTL	Set `CACHE_TTL=3600`
Prompt bloat	Audit and trim prompts	Review recent prompt changes
Model misrouting	Fix routing logic	Check model selection code
High-cost model overuse	Adjust routing thresholds	Lower quality threshold

Emergency Cost Controls

# Emergency spend limit {.unnumbered}
if daily_spend > DAILY_BUDGET * 1.2:
    # Option 1: Switch to cheaper model
    MODEL = "claude-3-5-haiku"

    # Option 2: Enable aggressive caching
    CACHE_EVERYTHING = True

    # Option 3: Rate limit all users
    GLOBAL_RATE_LIMIT = NORMAL_LIMIT * 0.5

    # Option 4: Graceful degradation message
    if daily_spend > DAILY_BUDGET * 1.5:
        return "Service temporarily limited. Please try again later."

Runbook 5: Security Incident

Symptoms

Unusual API access patterns
Prompt injection attempts detected
Data exfiltration alerts
User reports of unexpected behavior

Immediate Actions (< 5 min)

# 1. Check for injection attempts {.unnumbered}
grep -i "ignore previous\|system prompt\|you are now" /var/log/api/requests.log | tail -100

# 2. Check for data exfiltration patterns {.unnumbered}
psql -c "SELECT user_id, query FROM requests
         WHERE response LIKE '%API_KEY%' OR response LIKE '%password%'
         AND date = current_date"

# 3. Check rate limit violations {.unnumbered}
redis-cli KEYS "blocked:*" | wc -l

# 4. Review high-risk user activity {.unnumbered}
psql -c "SELECT * FROM requests
         WHERE risk_score > 0.5 AND date = current_date"

Security Response Actions

Threat	Immediate Action	Follow-up
Active injection attack	Block user/IP	Analyze attack vector
Data leak detected	Rotate exposed credentials	Audit all responses
Credential compromise	Rotate API keys	Review access logs
Abuse detected	Enable strict rate limits	Implement additional validation

Containment Checklist

Post-Incident Process

Incident Documentation Template

# Incident Report: [TITLE] {.unnumbered}

## Summary
- **Severity**: SEV[1-4]
- **Duration**: [START] to [END]
- **Impact**: [Users/requests affected]

## Timeline
- HH:MM - Alert fired
- HH:MM - Incident declared
- HH:MM - Root cause identified
- HH:MM - Mitigation applied
- HH:MM - Resolved

## Root Cause
[Description of what caused the incident]

## Resolution
[What was done to fix it]

## Action Items
- [ ] [Preventive measure 1]
- [ ] [Preventive measure 2]
- [ ] [Detection improvement]

## Lessons Learned
[What we learned and how to prevent recurrence]

Post-Incident Checklist

Incident documented in incident tracker
Timeline reconstructed from logs
Root cause identified
Action items created and assigned
Monitoring/alerting gaps addressed
Runbook updated if needed
Stakeholders notified of resolution

--- number-sections: false execute: enabled: false --- # Incident Response Runbook {.unnumbered} ## Incident Severity Levels | Severity | Definition | Response Time | Examples | |----------|------------|---------------|----------| | **SEV1** | Complete outage, all users affected | < 15 min | API down, data breach | | **SEV2** | Major degradation, many users affected | < 30 min | High error rate, severe latency | | **SEV3** | Partial degradation, some users affected | < 2 hrs | Single region down, quality drop | | **SEV4** | Minor issue, minimal impact | < 24 hrs | Intermittent errors, slow queries | --- ## Runbook 1: High Latency / Timeout Spike ### Symptoms - P95 latency > 2x normal - Timeout errors increasing - User complaints about slow responses ### Immediate Checks (< 5 min) ```bash # 1. Check current latency metrics {.unnumbered} curl -s $METRICS_ENDPOINT/api/v1/query \ --data-urlencode 'query=histogram_quantile(0.95, rate(request_latency_seconds_bucket[5m]))' # 2. Check GPU utilization {.unnumbered} nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv # 3. Check queue depth {.unnumbered} curl -s $VLLM_ENDPOINT/metrics | grep "queue_size" # 4. Check provider status pages {.unnumbered} # - status.anthropic.com {.unnumbered} # - status.openai.com {.unnumbered} ``` ### Decision Tree ``` Is GPU utilization > 90%? ├─ YES → Scale up or enable overflow routing └─ NO → Is queue depth high? ├─ YES → Check for stuck requests, restart workers └─ NO → Is provider having issues? ├─ YES → Enable fallback provider └─ NO → Check network/load balancer ``` ### Mitigation Actions | Action | Command/Steps | When to Use | |--------|---------------|-------------| | Enable overflow routing | Set `OVERFLOW_THRESHOLD=0.7` | GPU > 85% | | Activate fallback provider | Set `FALLBACK_ENABLED=true` | Provider issues | | Scale up inference pool | `kubectl scale deployment vllm --replicas=+2` | Sustained load | | Reduce batch size | Set `MAX_BATCH_SIZE=16` | Memory pressure | | Enable request shedding | Set `SHED_RATE=0.1` | Emergency only | ### Resolution Checklist - [ ] Latency returned to baseline - [ ] No elevated error rates - [ ] Queue depth normal - [ ] Alerts cleared - [ ] Incident documented --- ## Runbook 2: High Error Rate ### Symptoms - Error rate > 1% - 4xx/5xx responses increasing - Failed request alerts firing ### Immediate Checks (< 5 min) ```bash # 1. Check error breakdown by type {.unnumbered} curl -s $METRICS_ENDPOINT/api/v1/query \ --data-urlencode 'query=sum by (status_code) (rate(http_requests_total[5m]))' # 2. Check recent error logs {.unnumbered} kubectl logs -l app=api-gateway --since=5m | grep -i error | tail -50 # 3. Check model health {.unnumbered} curl -s $VLLM_ENDPOINT/health # 4. Check rate limit status {.unnumbered} redis-cli GET "ratelimit:global:count" ``` ### Error Type Decision Tree ``` What's the primary error type? 400 Bad Request └─ Check: Input validation, schema changes, client bugs 401/403 Auth Errors └─ Check: API key rotation, permission changes, token expiry 429 Rate Limited └─ Check: Traffic spike, abuse, limit misconfiguration 500 Internal Error └─ Check: Model OOM, GPU failure, code bug 502/503 Upstream Error └─ Check: Provider outage, network issues, health checks 504 Timeout └─ Check: Long-running requests, resource exhaustion ``` ### Mitigation Actions | Error Type | Action | Command | |------------|--------|---------| | 429 spike | Increase limits temporarily | `redis-cli SET ratelimit:rpm 1000` | | 500 OOM | Restart workers | `kubectl rollout restart deployment/vllm` | | 502 provider | Enable fallback | `kubectl set env deployment/api FALLBACK=true` | | 503 overload | Enable circuit breaker | Set `CIRCUIT_BREAKER=true` | --- ## Runbook 3: Quality Degradation / Hallucination Outbreak ### Symptoms - User reports of incorrect answers - Evaluation scores dropping - Hallucination rate > 2% ### Immediate Checks (< 15 min) ```bash # 1. Check recent eval scores {.unnumbered} psql -c "SELECT date, avg(score) FROM evals WHERE date > now() - interval '24 hours' GROUP BY date ORDER BY date" # 2. Sample recent responses for manual review {.unnumbered} psql -c "SELECT query, response FROM requests WHERE created_at > now() - interval '1 hour' ORDER BY random() LIMIT 10" # 3. Check if retrieval is working {.unnumbered} curl -X POST $RAG_ENDPOINT/search \ -d '{"query": "test query", "top_k": 5}' | jq '.results | length' # 4. Check embedding model health {.unnumbered} curl -s $EMBEDDING_ENDPOINT/health ``` ### Root Cause Decision Tree ``` Are retrieval results relevant? ├─ NO → Check embedding model, index freshness, query preprocessing └─ YES → Are citations being used? ├─ NO → Check prompt template, context injection └─ YES → Is model following instructions? ├─ NO → Check for prompt drift, model version change └─ YES → Likely edge case - add to eval set ``` ### Mitigation Actions | Root Cause | Action | Steps | |------------|--------|-------| | Stale index | Trigger reindex | Run indexing pipeline | | Embedding drift | Rollback embedding model | Deploy previous version | | Prompt regression | Rollback prompt version | Restore from prompt versioning | | Model behavior change | Switch model version | Update model config | | Missing context | Increase retrieval K | Set `TOP_K=10` | ### Quality Gate Actions ```python # Emergency quality gate - block responses below threshold {.unnumbered} if eval_score < QUALITY_THRESHOLD: return { "response": "I'm not confident in my answer. Please contact support.", "flagged": True, "reason": "quality_gate" } ``` --- ## Runbook 4: Cost Spike ### Symptoms - Daily spend > 120% of budget - Unexpected provider invoice - Cost alerts firing ### Immediate Checks (< 15 min) ```bash # 1. Check token usage by endpoint {.unnumbered} psql -c "SELECT endpoint, sum(input_tokens), sum(output_tokens), sum(cost_usd) FROM requests WHERE date = current_date GROUP BY endpoint" # 2. Check for unusual traffic patterns {.unnumbered} psql -c "SELECT user_id, count(*), sum(cost_usd) FROM requests WHERE date = current_date GROUP BY user_id ORDER BY sum(cost_usd) DESC LIMIT 10" # 3. Check cache hit rate {.unnumbered} redis-cli INFO stats | grep hit_rate # 4. Check average tokens per request (looking for prompt bloat) {.unnumbered} psql -c "SELECT avg(input_tokens), avg(output_tokens) FROM requests WHERE date = current_date" ``` ### Cost Spike Decision Tree ``` Is traffic volume normal? ├─ NO (higher) → Check for abuse, viral usage, bot traffic └─ YES → Is cost per request higher? ├─ YES → Check model routing, prompt size, cache misses └─ NO → Check for billing anomaly, pricing change ``` ### Mitigation Actions | Cause | Action | Implementation | |-------|--------|----------------| | Abuse/bot traffic | Block suspicious users | Add to blocklist | | Cache misses | Increase cache TTL | Set `CACHE_TTL=3600` | | Prompt bloat | Audit and trim prompts | Review recent prompt changes | | Model misrouting | Fix routing logic | Check model selection code | | High-cost model overuse | Adjust routing thresholds | Lower quality threshold | ### Emergency Cost Controls ```python # Emergency spend limit {.unnumbered} if daily_spend > DAILY_BUDGET * 1.2: # Option 1: Switch to cheaper model MODEL = "claude-3-5-haiku" # Option 2: Enable aggressive caching CACHE_EVERYTHING = True # Option 3: Rate limit all users GLOBAL_RATE_LIMIT = NORMAL_LIMIT * 0.5 # Option 4: Graceful degradation message if daily_spend > DAILY_BUDGET * 1.5: return "Service temporarily limited. Please try again later." ``` --- ## Runbook 5: Security Incident ### Symptoms - Unusual API access patterns - Prompt injection attempts detected - Data exfiltration alerts - User reports of unexpected behavior ### Immediate Actions (< 5 min) ```bash # 1. Check for injection attempts {.unnumbered} grep -i "ignore previous\|system prompt\|you are now" /var/log/api/requests.log | tail -100 # 2. Check for data exfiltration patterns {.unnumbered} psql -c "SELECT user_id, query FROM requests WHERE response LIKE '%API_KEY%' OR response LIKE '%password%' AND date = current_date" # 3. Check rate limit violations {.unnumbered} redis-cli KEYS "blocked:*" | wc -l # 4. Review high-risk user activity {.unnumbered} psql -c "SELECT * FROM requests WHERE risk_score > 0.5 AND date = current_date" ``` ### Security Response Actions | Threat | Immediate Action | Follow-up | |--------|------------------|-----------| | Active injection attack | Block user/IP | Analyze attack vector | | Data leak detected | Rotate exposed credentials | Audit all responses | | Credential compromise | Rotate API keys | Review access logs | | Abuse detected | Enable strict rate limits | Implement additional validation | ### Containment Checklist - [ ] Identified affected users/requests - [ ] Blocked malicious actors - [ ] Rotated any exposed credentials - [ ] Enabled enhanced logging - [ ] Notified security team - [ ] Preserved evidence for analysis --- ## Post-Incident Process ### Incident Documentation Template ```markdown # Incident Report: [TITLE] {.unnumbered} ## Summary - **Severity**: SEV[1-4] - **Duration**: [START] to [END] - **Impact**: [Users/requests affected] ## Timeline - HH:MM - Alert fired - HH:MM - Incident declared - HH:MM - Root cause identified - HH:MM - Mitigation applied - HH:MM - Resolved ## Root Cause [Description of what caused the incident] ## Resolution [What was done to fix it] ## Action Items - [ ] [Preventive measure 1] - [ ] [Preventive measure 2] - [ ] [Detection improvement] ## Lessons Learned [What we learned and how to prevent recurrence] ``` ### Post-Incident Checklist - [ ] Incident documented in incident tracker - [ ] Timeline reconstructed from logs - [ ] Root cause identified - [ ] Action items created and assigned - [ ] Monitoring/alerting gaps addressed - [ ] Runbook updated if needed - [ ] Stakeholders notified of resolution