---
number-sections: false
execute:
enabled: false
---
# Production Deployment Checklists for AI Systems {.unnumbered}
Actionable yes/no checklists for deploying AI systems to production. Each item should be verified before launch.
---
## 1. RAG System Deployment Checklist
### Indexing Pipeline
- [ ] **Chunking strategy validated** - Tested chunk sizes on representative documents; overlap configured appropriately (see Ch 7)
- [ ] **Chunk size tuned for retrieval** - Evaluated different sizes (256, 512, 1024 tokens) against retrieval accuracy metrics
- [ ] **Document extraction working** - PDF, DOCX, HTML parsers tested on edge cases (scanned docs, tables, multi-column)
- [ ] **Metadata extraction configured** - Source, date, author, section headers captured for filtering
- [ ] **Embedding model versioned** - Model version tracked; re-indexing plan documented for model upgrades
- [ ] **Embedding dimension matched** - Vector DB index dimension matches embedding model output dimension
- [ ] **Index versioning in place** - Can roll back to previous index version if quality degrades
- [ ] **Duplicate detection enabled** - Near-duplicate documents identified and handled (dedup or flag)
- [ ] **Quality gates configured** - Documents below quality threshold (OCR confidence, parsing errors) are flagged or rejected
- [ ] **Indexing pipeline idempotent** - Re-running indexing on same docs produces identical results
- [ ] **Incremental indexing tested** - New documents can be added without full reindex
### Retrieval Quality
- [ ] **Baseline metrics established** - Recall@K, MRR, NDCG measured on eval set before launch
- [ ] **Retrieval latency within SLO** - P95 retrieval latency meets target (typically <100ms for vector search)
- [ ] **Hybrid search configured** - If using hybrid, weights tuned (e.g., 0.7 vector / 0.3 BM25)
- [ ] **Reranker evaluated** - Cross-encoder reranker tested; latency/quality tradeoff acceptable
- [ ] **Query preprocessing tested** - Query expansion, spelling correction, entity extraction working correctly
- [ ] **Empty result handling** - Graceful fallback when no relevant documents found
- [ ] **Multi-query retrieval tested** - If decomposing queries, sub-query results properly merged
- [ ] **Filter accuracy verified** - Metadata filters (date ranges, categories) return correct subsets
- [ ] **Minimum similarity threshold set** - Low-confidence results filtered or flagged (see [07b](07b_rag_systems_code.html))
### Infrastructure
- [ ] **Vector DB properly sized** - Capacity for current docs + 2x growth; memory/disk allocated
- [ ] **Vector DB replicated** - At least one replica for availability (if using managed service, confirm SLA)
- [ ] **Embedding service scaled** - Can handle peak query load for embedding new queries
- [ ] **Caching layer configured** - Query result caching for common queries; TTL set appropriately
- [ ] **Connection pooling enabled** - Database connections pooled to avoid connection exhaustion
- [ ] **Index backup scheduled** - Regular backups of vector index; restore procedure tested
- [ ] **Health checks configured** - Vector DB and embedding service health endpoints monitored
- [ ] **Rate limiting on indexing** - Bulk indexing rate limited to avoid overwhelming embedding service
### Monitoring
- [ ] **Retrieval latency tracked** - P50, P95, P99 latency dashboards configured
- [ ] **Retrieval quality metrics logged** - Click-through rates, user satisfaction signals captured
- [ ] **Empty result rate monitored** - Alert if >X% queries return no results
- [ ] **Index freshness tracked** - Time since last document update visible; staleness alerts configured
- [ ] **Embedding model drift detection** - Periodic checks that embedding quality hasn't degraded
- [ ] **Query volume dashboards** - Traffic patterns visible for capacity planning
- [ ] **Retrieval errors alerted** - Vector DB errors, timeouts trigger alerts (see [25a](25a_incident_runbook.html))
**Related chapters:** [Ch 7 - RAG Systems](../chapters/ch07_rag_systems.html), [07a](07a_rag_pipeline_architecture.html), [07b](07b_rag_systems_code.html)
---
## 2. Agent Deployment Checklist
### Safety
- [ ] **Tool permissions scoped** - Each tool has minimum required permissions; no admin/root access
- [ ] **Sensitive tools require approval** - Human-in-the-loop for: delete operations, financial transactions, external communications, data exports
- [ ] **File system access restricted** - Agent can only access designated directories; path traversal prevented
- [ ] **Network access controlled** - Allowlist of permitted domains/IPs for external requests
- [ ] **Command execution sandboxed** - If shell access needed, use restricted sandbox (Docker, gVisor, etc.)
- [ ] **Resource limits set** - Max memory, CPU time, disk space per agent execution
- [ ] **Output size limits configured** - Prevent agents from generating unbounded output
- [ ] **PII handling rules enforced** - Agents cannot store or transmit PII without explicit safeguards (see [12a](12a_security_defense_patterns.html))
- [ ] **Audit trail for sensitive actions** - All tool invocations logged with full parameters
### Reliability
- [ ] **Loop detection implemented** - Agent terminated after N iterations or when state repeats (see [08b](08b_agent_orchestration.html))
- [ ] **Maximum iteration limit set** - Hard cap on tool calls per request (e.g., 20 iterations)
- [ ] **Per-tool timeout configured** - Individual tool calls timeout (e.g., 30s for API calls, 60s for code execution)
- [ ] **Total execution timeout set** - Maximum wall-clock time for entire agent task (e.g., 5 minutes)
- [ ] **Fallback behavior defined** - Graceful degradation when tools fail or timeout
- [ ] **Stuck state detection** - Agent detects when making no progress; escalates or terminates
- [ ] **Retry logic with backoff** - Transient tool failures retried with exponential backoff
- [ ] **Partial result handling** - If task partially completes, useful output still returned
- [ ] **State checkpointing** - For long-running tasks, intermediate state saved for recovery
- [ ] **Dead letter queue** - Failed agent tasks captured for debugging and retry
### Observability
- [ ] **All tool calls logged** - Tool name, parameters, return value, duration for every invocation
- [ ] **Decision traces captured** - Agent reasoning/planning steps logged for debugging
- [ ] **Token usage tracked** - Input/output tokens per agent turn; cost attribution enabled
- [ ] **Tool success rate monitored** - Per-tool success/failure rates visible in dashboards
- [ ] **Agent completion rate tracked** - What percentage of tasks complete successfully
- [ ] **Latency breakdown visible** - Time spent in LLM vs tools vs orchestration
- [ ] **Error categorization** - Failures categorized (tool error, timeout, loop, safety block)
- [ ] **Trace correlation IDs** - Agent execution linked to parent request for end-to-end tracing
- [ ] **Replay capability** - Can re-execute agent with logged inputs for debugging
### Testing
- [ ] **Adversarial testing completed** - Prompt injection attacks against agent tested (see [12b](12b_security_adversarial_code.html))
- [ ] **Tool misuse scenarios tested** - Agent tested with prompts trying to misuse tools
- [ ] **Edge cases covered** - Empty inputs, very long inputs, malformed tool responses
- [ ] **Concurrent execution tested** - Multiple agent instances running simultaneously
- [ ] **Resource exhaustion tested** - Behavior when hitting memory/timeout limits
- [ ] **Recovery testing done** - Agent behavior after partial failures verified
- [ ] **Multi-turn conversation tested** - State maintained correctly across turns
- [ ] **Integration tests with real tools** - End-to-end tests against actual tool implementations
**Related chapters:** [Ch 8 - Agentic Systems](../chapters/ch08_agentic_systems.html), [08b](08b_agent_orchestration.html), [08c](08c_agentic_systems_code.html)
---
## 3. LLM API Integration Checklist
### Error Handling
- [ ] **Retry configuration set** - Max retries (e.g., 3), backoff multiplier (e.g., 2x), max delay (e.g., 60s)
- [ ] **Retryable errors identified** - 429, 500, 502, 503, 504 retried; 400, 401, 403 not retried
- [ ] **Fallback model configured** - Secondary model/provider activated on primary failure
- [ ] **Fallback trigger conditions defined** - Switch after N consecutive failures or error rate >X%
- [ ] **Circuit breaker implemented** - Stops calling failing provider; auto-recovers after cooldown (see [25b](25b_reliability_engineering_code.html))
- [ ] **Timeout values set** - Connection timeout (5s), read timeout (60-120s based on expected response)
- [ ] **Graceful degradation message** - User-friendly error when all providers fail
- [ ] **Error details logged** - Full error response captured for debugging (sanitize sensitive data)
- [ ] **Provider status monitoring** - Subscribe to provider status pages; alert on incidents
### Cost Controls
- [ ] **Per-request budget set** - Maximum spend per API call enforced
- [ ] **Daily budget limit configured** - Hard cap on daily spend; alerts at 80%, 100%, 120%
- [ ] **Monthly budget tracked** - Projection dashboards; alerts before exceeding budget
- [ ] **Per-user limits enforced** - Individual users have request/token quotas
- [ ] **Cost attribution enabled** - Costs tagged by feature, team, customer (see [26b](26b_cost_engineering_code.html))
- [ ] **Token counting before request** - Estimate tokens before sending to avoid surprise costs
- [ ] **Output token limits set** - max_tokens parameter set appropriately (not unlimited)
- [ ] **Alerting on cost anomalies** - Spike detection for unusual spending patterns
- [ ] **Unused context optimization** - System prompts aren't unnecessarily repeated (use caching)
### Performance
- [ ] **Response caching enabled** - Identical queries return cached responses; TTL configured
- [ ] **Prompt caching utilized** - Provider prompt caching enabled for repeated system prompts (if available)
- [ ] **Batching configured** - Where possible, batch multiple requests (Batch API for non-real-time)
- [ ] **Connection keep-alive enabled** - HTTP connections reused to reduce latency
- [ ] **Streaming configured correctly** - Use streaming for long responses to improve TTFT
- [ ] **Async patterns used** - Non-blocking calls for concurrent request handling
- [ ] **Request queuing implemented** - Queue requests during rate limits instead of immediate failure
- [ ] **Latency monitoring in place** - TTFT, total latency tracked with percentile dashboards
### Security
- [ ] **API keys in secrets manager** - Never in code, environment variables in development only
- [ ] **API key rotation procedure** - Keys can be rotated without downtime; tested
- [ ] **Key scoping enforced** - Separate keys for dev/staging/prod; different keys per service
- [ ] **Request logging sanitized** - API keys, PII redacted from logs
- [ ] **Input validation enabled** - User input sanitized before sending to LLM (see [12a](12a_security_defense_patterns.html))
- [ ] **Output validation enabled** - LLM responses checked for PII leakage, harmful content
- [ ] **TLS enforced** - All API calls over HTTPS; certificate validation enabled
- [ ] **Request signing** - If provider supports, request signing enabled for integrity
- [ ] **Audit logging** - All LLM API calls logged with user ID, timestamp, sanitized content
**Related chapters:** [Ch 9 - LLM Deployment](../chapters/ch09_llm_deployment_infrastructure.html), [Ch 10 - Backend Engineering](../chapters/ch10_backend_engineering_for_ai.html), [02](02_python_ai_patterns.html)
---
## 4. Model Upgrade Checklist
### Evaluation
- [ ] **Benchmark suite run** - New model evaluated on standard eval set (see [11a](11a_evaluation_harness.html))
- [ ] **Comparison against current model** - Side-by-side metrics on identical test cases
- [ ] **Regression tests passed** - No degradation on key use cases; regressions documented if accepted
- [ ] **Edge case evaluation** - Model tested on known difficult inputs from production
- [ ] **Latency benchmarked** - New model latency within acceptable range vs current
- [ ] **Cost impact calculated** - Token pricing and efficiency differences quantified (see [26a](26a_cost_estimation_formulas.html))
- [ ] **Prompt compatibility verified** - Existing prompts work correctly with new model
- [ ] **Output format consistency** - Structured outputs (JSON, etc.) still parse correctly
- [ ] **Safety evaluation completed** - Red team testing on new model for safety regressions
- [ ] **Calibration check done** - LLM-as-judge re-calibrated if judge model also changes
### Rollout
- [ ] **Canary deployment planned** - Start with X% traffic (e.g., 5%), monitor, then expand
- [ ] **Canary success criteria defined** - Metrics thresholds for expanding rollout
- [ ] **Rollback procedure documented** - One-command rollback to previous model version
- [ ] **Rollback tested** - Actually performed a rollback in staging environment
- [ ] **Feature flags configured** - Can enable/disable new model per user segment
- [ ] **Shadow mode option available** - Can run new model in parallel without serving responses
- [ ] **A/B test configured** - Statistical comparison framework ready (see [11b](11b_mlops_evaluation_code.html))
- [ ] **Rollout timeline documented** - Schedule for 5% -> 25% -> 50% -> 100%
- [ ] **On-call notified** - Team aware of rollout timing; escalation path clear
- [ ] **Stakeholders informed** - Product, support teams aware of potential behavior changes
### Monitoring
- [ ] **Comparison dashboards ready** - Old vs new model metrics side-by-side
- [ ] **Alert thresholds updated** - Latency, error, quality alerts adjusted for new model baseline
- [ ] **Quality score monitoring** - LLM-as-judge running on sample of new model responses
- [ ] **User feedback collection** - Thumbs up/down, explicit feedback tracked during rollout
- [ ] **Error rate comparison** - New model error rate vs baseline visualized
- [ ] **Cost tracking by model version** - Spend attributed to each model version during transition
- [ ] **Latency comparison** - TTFT, total latency tracked per model version
- [ ] **Rollback trigger alerts** - Automatic alerts if new model exceeds error/latency thresholds
- [ ] **Post-rollout review scheduled** - Meeting to review metrics after full rollout
**Related chapters:** [Ch 11 - MLOps & Evaluation](../chapters/ch11_mlops_evaluation.html), [Ch 22 - Research to Production](../chapters/ch22_research_to_production.html)
---
## 5. Security Review Checklist
### Input Validation
- [ ] **Prompt injection defenses active** - Input screened for injection patterns (see [12a](12a_security_defense_patterns.html))
- [ ] **Sandwich defense implemented** - User input wrapped between instruction reinforcement
- [ ] **Input length limits enforced** - Maximum input size to prevent resource exhaustion
- [ ] **Unicode normalization applied** - Homoglyph and invisible character attacks mitigated
- [ ] **Special character handling** - Escape sequences, control characters sanitized
- [ ] **Multi-language injection tested** - Injection attempts in non-English languages blocked
- [ ] **Encoded payload detection** - Base64, URL-encoded injection attempts detected
- [ ] **Rate limiting on suspicious inputs** - Users sending risky inputs get throttled
- [ ] **Input risk scoring logged** - Suspicious inputs flagged for review
### Output Filtering
- [ ] **PII detection enabled** - SSN, credit card, phone, email patterns redacted (see [12a](12a_security_defense_patterns.html))
- [ ] **Custom PII patterns added** - Organization-specific sensitive data patterns configured
- [ ] **Content filtering active** - Harmful, inappropriate content blocked
- [ ] **Source code detection** - Credentials in generated code detected and redacted
- [ ] **Jailbreak output detection** - Responses indicating successful jailbreak flagged
- [ ] **Maximum output length enforced** - Prevent unbounded output generation
- [ ] **Citation verification** - If generating citations, verify they exist in context
- [ ] **Structured output validation** - JSON/XML outputs validated against schema
- [ ] **Logging of filtered content** - Redacted items logged for audit (with redaction)
### Access Control
- [ ] **Authentication required** - All API endpoints require valid authentication
- [ ] **Authorization checks in place** - Users can only access their own data/conversations
- [ ] **API key permissions scoped** - Keys have minimum required permissions
- [ ] **Rate limiting enforced** - Per-user, per-IP, per-API-key limits configured
- [ ] **Abuse detection active** - Unusual patterns (high volume, repeated failures) trigger blocks (see [12a](12a_security_defense_patterns.html))
- [ ] **IP allowlisting** - For sensitive endpoints, restrict to known IPs (if applicable)
- [ ] **Session management secure** - Sessions expire, tokens can be revoked
- [ ] **CORS configured correctly** - Only authorized origins can call API
- [ ] **Admin access audited** - Admin actions logged with user identity
### Audit Logging
- [ ] **All LLM calls logged** - Every API call recorded with timestamp, user, model, tokens
- [ ] **Sensitive data redacted in logs** - PII, API keys masked before logging
- [ ] **Log retention configured** - Logs retained for compliance period (e.g., 90 days)
- [ ] **Log integrity protected** - Logs stored in tamper-evident system (or hashed)
- [ ] **Log access controlled** - Only authorized personnel can access logs
- [ ] **Security event alerting** - Suspicious activity triggers security team notification
- [ ] **Compliance requirements met** - GDPR, HIPAA, SOC2 logging requirements satisfied
- [ ] **Incident investigation capability** - Can reconstruct user sessions from logs
- [ ] **Regular log review process** - Scheduled review of security-relevant logs
**Related chapters:** [Ch 12 - Security & Adversarial Robustness](../chapters/ch12_security_adversarial_robustness.html), [12a](12a_security_defense_patterns.html), [12b](12b_security_adversarial_code.html)
---
## 6. Cost Optimization Checklist
### Token Efficiency
- [ ] **System prompts optimized** - Prompts trimmed to essential instructions; no redundancy
- [ ] **Prompt caching enabled** - Provider caching features used for repeated prompts (if available)
- [ ] **Response caching implemented** - Identical queries served from cache
- [ ] **Cache hit rate monitored** - Targeting >30% cache hit rate for typical workloads
- [ ] **Context window usage tracked** - Monitoring how much of context window is typically used
- [ ] **Unnecessary context removed** - Not including full documents when excerpts suffice
- [ ] **Output length controlled** - max_tokens set appropriately; not requesting more than needed
- [ ] **Token counting accurate** - Using correct tokenizer for accurate cost estimation
- [ ] **Prompt versioning tracked** - Prompt changes linked to cost changes (see [26a](26a_cost_estimation_formulas.html))
### Model Routing
- [ ] **Task complexity assessment** - Simple tasks routed to smaller/cheaper models
- [ ] **Model routing rules defined** - Clear criteria for when to use each model tier
- [ ] **Routing accuracy monitored** - Tracking if routing decisions are correct
- [ ] **Fallback model configured** - Expensive model available for complex tasks that fail on cheaper
- [ ] **A/B testing routing strategies** - Experimenting with different routing thresholds
- [ ] **Quality vs cost tradeoffs documented** - Decision matrix for model selection
- [ ] **Batch API usage** - Non-real-time workloads using discounted batch endpoints (50% savings)
- [ ] **Fine-tuned model evaluation** - Assessed if fine-tuning smaller model beats larger generic model (see [26b](26b_cost_engineering_code.html))
### Infrastructure
- [ ] **Compute right-sized** - GPU/CPU instances matched to actual workload (not over-provisioned)
- [ ] **Auto-scaling configured** - Scale up for peak traffic, scale down during quiet periods
- [ ] **Spot/preemptible instances used** - Non-critical workloads on discounted instances (where applicable)
- [ ] **Reserved capacity evaluated** - For predictable workloads, committed use discounts calculated
- [ ] **Multi-region cost comparison** - Deployed in cost-effective regions (with latency tradeoffs noted)
- [ ] **Idle resource detection** - Alerting on underutilized infrastructure
- [ ] **Vector DB tiering** - Frequently accessed data on fast storage; archival on cheaper tiers
- [ ] **Cold start optimization** - Serverless functions kept warm if cold start costs exceed benefit
### Budget Management
- [ ] **Cost dashboards deployed** - Real-time visibility into spend by service, feature, team
- [ ] **Budget alerts configured** - Notifications at 50%, 80%, 100% of budget
- [ ] **Anomaly detection active** - Alerts on unexpected spend spikes
- [ ] **Cost attribution working** - Spend tagged to teams/products for chargeback
- [ ] **Monthly cost reviews scheduled** - Regular review of optimization opportunities
- [ ] **Cost vs quality metrics linked** - Understanding cost per successful response, not just cost per request
- [ ] **Forecasting in place** - Projected costs based on traffic trends
- [ ] **Emergency cost controls documented** - Runbook for rapid cost reduction if needed (see [25a](25a_incident_runbook.html))
**Related chapters:** [Ch 26 - Cost Engineering](../chapters/ch26_cost_engineering.html), [26a](26a_cost_estimation_formulas.html), [26b](26b_cost_engineering_code.html)
---
## Quick Reference: Pre-Launch Gate
Before any production launch, verify these critical items are complete:
| Category | Critical Items |
|----------|----------------|
| **Safety** | Tool permissions scoped, human-in-loop for sensitive actions, input validation active |
| **Reliability** | Timeouts configured, fallbacks working, circuit breakers enabled |
| **Security** | API keys secured, input/output filtering active, audit logging enabled |
| **Monitoring** | Dashboards deployed, alerts configured, on-call rotation set |
| **Cost** | Budgets set, alerts configured, cost tracking by feature |
| **Rollout** | Canary plan ready, rollback tested, stakeholders notified |
---
## Usage Guide
### Before First Production Deployment
1. Complete all items in the relevant checklists above
2. Have a second person verify critical items (safety, security)
3. Document any intentionally skipped items with justification
4. Schedule post-launch review (1 week) to reassess
### For Model/System Updates
1. Complete Model Upgrade Checklist
2. Re-run relevant portions of original deployment checklist
3. Use canary deployment; do not go 0% to 100%
### Regular Review Cadence
| Checklist | Review Frequency |
|-----------|------------------|
| RAG System | Monthly + after index changes |
| Agent Safety | Monthly + after tool changes |
| LLM API Integration | Quarterly + after provider changes |
| Security | Quarterly + after any security incident |
| Cost Optimization | Monthly |
---
*These checklists complement the incident runbooks ([25a](25a_incident_runbook.html)) and should be reviewed together before production deployment.*