Is GPU utilization > 90%?
├─ YES → Scale up or enable overflow routing
└─ NO → Is queue depth high?
├─ YES → Check for stuck requests, restart workers
└─ NO → Is provider having issues?
├─ YES → Enable fallback provider
└─ NO → Check network/load balancer
Mitigation Actions
Action
Command/Steps
When to Use
Enable overflow routing
Set OVERFLOW_THRESHOLD=0.7
GPU > 85%
Activate fallback provider
Set FALLBACK_ENABLED=true
Provider issues
Scale up inference pool
kubectl scale deployment vllm --replicas=+2
Sustained load
Reduce batch size
Set MAX_BATCH_SIZE=16
Memory pressure
Enable request shedding
Set SHED_RATE=0.1
Emergency only
Resolution Checklist
Runbook 2: High Error Rate
Symptoms
Error rate > 1%
4xx/5xx responses increasing
Failed request alerts firing
Immediate Checks (< 5 min)
# 1. Check error breakdown by type {.unnumbered}curl-s$METRICS_ENDPOINT/api/v1/query \--data-urlencode'query=sum by (status_code) (rate(http_requests_total[5m]))'# 2. Check recent error logs {.unnumbered}kubectl logs -l app=api-gateway --since=5m |grep-i error |tail-50# 3. Check model health {.unnumbered}curl-s$VLLM_ENDPOINT/health# 4. Check rate limit status {.unnumbered}redis-cli GET "ratelimit:global:count"
# 1. Check recent eval scores {.unnumbered}psql-c"SELECT date, avg(score) FROM evals WHERE date > now() - interval '24 hours' GROUP BY date ORDER BY date"# 2. Sample recent responses for manual review {.unnumbered}psql-c"SELECT query, response FROM requests WHERE created_at > now() - interval '1 hour' ORDER BY random() LIMIT 10"# 3. Check if retrieval is working {.unnumbered}curl-X POST $RAG_ENDPOINT/search \-d'{"query": "test query", "top_k": 5}'|jq'.results | length'# 4. Check embedding model health {.unnumbered}curl-s$EMBEDDING_ENDPOINT/health
Root Cause Decision Tree
Are retrieval results relevant?
├─ NO → Check embedding model, index freshness, query preprocessing
└─ YES → Are citations being used?
├─ NO → Check prompt template, context injection
└─ YES → Is model following instructions?
├─ NO → Check for prompt drift, model version change
└─ YES → Likely edge case - add to eval set
Mitigation Actions
Root Cause
Action
Steps
Stale index
Trigger reindex
Run indexing pipeline
Embedding drift
Rollback embedding model
Deploy previous version
Prompt regression
Rollback prompt version
Restore from prompt versioning
Model behavior change
Switch model version
Update model config
Missing context
Increase retrieval K
Set TOP_K=10
Quality Gate Actions
# Emergency quality gate - block responses below threshold {.unnumbered}if eval_score < QUALITY_THRESHOLD:return {"response": "I'm not confident in my answer. Please contact support.","flagged": True,"reason": "quality_gate" }
Runbook 4: Cost Spike
Symptoms
Daily spend > 120% of budget
Unexpected provider invoice
Cost alerts firing
Immediate Checks (< 15 min)
# 1. Check token usage by endpoint {.unnumbered}psql-c"SELECT endpoint, sum(input_tokens), sum(output_tokens), sum(cost_usd) FROM requests WHERE date = current_date GROUP BY endpoint"# 2. Check for unusual traffic patterns {.unnumbered}psql-c"SELECT user_id, count(*), sum(cost_usd) FROM requests WHERE date = current_date GROUP BY user_id ORDER BY sum(cost_usd) DESC LIMIT 10"# 3. Check cache hit rate {.unnumbered}redis-cli INFO stats |grep hit_rate# 4. Check average tokens per request (looking for prompt bloat) {.unnumbered}psql-c"SELECT avg(input_tokens), avg(output_tokens) FROM requests WHERE date = current_date"
Cost Spike Decision Tree
Is traffic volume normal?
├─ NO (higher) → Check for abuse, viral usage, bot traffic
└─ YES → Is cost per request higher?
├─ YES → Check model routing, prompt size, cache misses
└─ NO → Check for billing anomaly, pricing change
Mitigation Actions
Cause
Action
Implementation
Abuse/bot traffic
Block suspicious users
Add to blocklist
Cache misses
Increase cache TTL
Set CACHE_TTL=3600
Prompt bloat
Audit and trim prompts
Review recent prompt changes
Model misrouting
Fix routing logic
Check model selection code
High-cost model overuse
Adjust routing thresholds
Lower quality threshold
Emergency Cost Controls
# Emergency spend limit {.unnumbered}if daily_spend > DAILY_BUDGET *1.2:# Option 1: Switch to cheaper model MODEL ="claude-3-5-haiku"# Option 2: Enable aggressive caching CACHE_EVERYTHING =True# Option 3: Rate limit all users GLOBAL_RATE_LIMIT = NORMAL_LIMIT *0.5# Option 4: Graceful degradation messageif daily_spend > DAILY_BUDGET *1.5:return"Service temporarily limited. Please try again later."
Runbook 5: Security Incident
Symptoms
Unusual API access patterns
Prompt injection attempts detected
Data exfiltration alerts
User reports of unexpected behavior
Immediate Actions (< 5 min)
# 1. Check for injection attempts {.unnumbered}grep-i"ignore previous\|system prompt\|you are now" /var/log/api/requests.log |tail-100# 2. Check for data exfiltration patterns {.unnumbered}psql-c"SELECT user_id, query FROM requests WHERE response LIKE '%API_KEY%' OR response LIKE '%password%' AND date = current_date"# 3. Check rate limit violations {.unnumbered}redis-cli KEYS "blocked:*"|wc-l# 4. Review high-risk user activity {.unnumbered}psql-c"SELECT * FROM requests WHERE risk_score > 0.5 AND date = current_date"
Security Response Actions
Threat
Immediate Action
Follow-up
Active injection attack
Block user/IP
Analyze attack vector
Data leak detected
Rotate exposed credentials
Audit all responses
Credential compromise
Rotate API keys
Review access logs
Abuse detected
Enable strict rate limits
Implement additional validation
Containment Checklist
Post-Incident Process
Incident Documentation Template
# Incident Report: [TITLE] {.unnumbered}## Summary- **Severity**: SEV[1-4]- **Duration**: [START] to [END]- **Impact**: [Users/requests affected]## Timeline- HH:MM - Alert fired- HH:MM - Incident declared- HH:MM - Root cause identified- HH:MM - Mitigation applied- HH:MM - Resolved## Root Cause[Description of what caused the incident]## Resolution[What was done to fix it]## Action Items- [ ][Preventive measure 1]- [ ][Preventive measure 2]- [ ][Detection improvement]## Lessons Learned[What we learned and how to prevent recurrence]