Production Deployment Checklists for AI Systems

Actionable yes/no checklists for deploying AI systems to production. Each item should be verified before launch.

Quick Reference: Pre-Launch Gate

Before any production launch, verify these critical items are complete:

Category	Critical Items
Safety	Tool permissions scoped, human-in-loop for sensitive actions, input validation active
Reliability	Timeouts configured, fallbacks working, circuit breakers enabled
Security	API keys secured, input/output filtering active, audit logging enabled
Monitoring	Dashboards deployed, alerts configured, on-call rotation set
Cost	Budgets set, alerts configured, cost tracking by feature
Rollout	Canary plan ready, rollback tested, stakeholders notified

Usage Guide

Before First Production Deployment

Complete all items in the relevant checklists above
Have a second person verify critical items (safety, security)
Document any intentionally skipped items with justification
Schedule post-launch review (1 week) to reassess

For Model/System Updates

Complete Model Upgrade Checklist
Re-run relevant portions of original deployment checklist
Use canary deployment; do not go 0% to 100%

Regular Review Cadence

Checklist	Review Frequency
RAG System	Monthly + after index changes
Agent Safety	Monthly + after tool changes
LLM API Integration	Quarterly + after provider changes
Security	Quarterly + after any security incident
Cost Optimization	Monthly

These checklists complement the incident runbooks (25a) and should be reviewed together before production deployment.

--- number-sections: false execute: enabled: false --- # Production Deployment Checklists for AI Systems {.unnumbered} Actionable yes/no checklists for deploying AI systems to production. Each item should be verified before launch. --- ## 1. RAG System Deployment Checklist ### Indexing Pipeline - [ ] **Chunking strategy validated** - Tested chunk sizes on representative documents; overlap configured appropriately (see Ch 7) - [ ] **Chunk size tuned for retrieval** - Evaluated different sizes (256, 512, 1024 tokens) against retrieval accuracy metrics - [ ] **Document extraction working** - PDF, DOCX, HTML parsers tested on edge cases (scanned docs, tables, multi-column) - [ ] **Metadata extraction configured** - Source, date, author, section headers captured for filtering - [ ] **Embedding model versioned** - Model version tracked; re-indexing plan documented for model upgrades - [ ] **Embedding dimension matched** - Vector DB index dimension matches embedding model output dimension - [ ] **Index versioning in place** - Can roll back to previous index version if quality degrades - [ ] **Duplicate detection enabled** - Near-duplicate documents identified and handled (dedup or flag) - [ ] **Quality gates configured** - Documents below quality threshold (OCR confidence, parsing errors) are flagged or rejected - [ ] **Indexing pipeline idempotent** - Re-running indexing on same docs produces identical results - [ ] **Incremental indexing tested** - New documents can be added without full reindex ### Retrieval Quality - [ ] **Baseline metrics established** - Recall@K, MRR, NDCG measured on eval set before launch - [ ] **Retrieval latency within SLO** - P95 retrieval latency meets target (typically <100ms for vector search) - [ ] **Hybrid search configured** - If using hybrid, weights tuned (e.g., 0.7 vector / 0.3 BM25) - [ ] **Reranker evaluated** - Cross-encoder reranker tested; latency/quality tradeoff acceptable - [ ] **Query preprocessing tested** - Query expansion, spelling correction, entity extraction working correctly - [ ] **Empty result handling** - Graceful fallback when no relevant documents found - [ ] **Multi-query retrieval tested** - If decomposing queries, sub-query results properly merged - [ ] **Filter accuracy verified** - Metadata filters (date ranges, categories) return correct subsets - [ ] **Minimum similarity threshold set** - Low-confidence results filtered or flagged (see [07b](07b_rag_systems_code.html)) ### Infrastructure - [ ] **Vector DB properly sized** - Capacity for current docs + 2x growth; memory/disk allocated - [ ] **Vector DB replicated** - At least one replica for availability (if using managed service, confirm SLA) - [ ] **Embedding service scaled** - Can handle peak query load for embedding new queries - [ ] **Caching layer configured** - Query result caching for common queries; TTL set appropriately - [ ] **Connection pooling enabled** - Database connections pooled to avoid connection exhaustion - [ ] **Index backup scheduled** - Regular backups of vector index; restore procedure tested - [ ] **Health checks configured** - Vector DB and embedding service health endpoints monitored - [ ] **Rate limiting on indexing** - Bulk indexing rate limited to avoid overwhelming embedding service ### Monitoring - [ ] **Retrieval latency tracked** - P50, P95, P99 latency dashboards configured - [ ] **Retrieval quality metrics logged** - Click-through rates, user satisfaction signals captured - [ ] **Empty result rate monitored** - Alert if >X% queries return no results - [ ] **Index freshness tracked** - Time since last document update visible; staleness alerts configured - [ ] **Embedding model drift detection** - Periodic checks that embedding quality hasn't degraded - [ ] **Query volume dashboards** - Traffic patterns visible for capacity planning - [ ] **Retrieval errors alerted** - Vector DB errors, timeouts trigger alerts (see [25a](25a_incident_runbook.html)) **Related chapters:** [Ch 7 - RAG Systems](../chapters/ch07_rag_systems.html), [07a](07a_rag_pipeline_architecture.html), [07b](07b_rag_systems_code.html) --- ## 2. Agent Deployment Checklist ### Safety - [ ] **Tool permissions scoped** - Each tool has minimum required permissions; no admin/root access - [ ] **Sensitive tools require approval** - Human-in-the-loop for: delete operations, financial transactions, external communications, data exports - [ ] **File system access restricted** - Agent can only access designated directories; path traversal prevented - [ ] **Network access controlled** - Allowlist of permitted domains/IPs for external requests - [ ] **Command execution sandboxed** - If shell access needed, use restricted sandbox (Docker, gVisor, etc.) - [ ] **Resource limits set** - Max memory, CPU time, disk space per agent execution - [ ] **Output size limits configured** - Prevent agents from generating unbounded output - [ ] **PII handling rules enforced** - Agents cannot store or transmit PII without explicit safeguards (see [12a](12a_security_defense_patterns.html)) - [ ] **Audit trail for sensitive actions** - All tool invocations logged with full parameters ### Reliability - [ ] **Loop detection implemented** - Agent terminated after N iterations or when state repeats (see [08b](08b_agent_orchestration.html)) - [ ] **Maximum iteration limit set** - Hard cap on tool calls per request (e.g., 20 iterations) - [ ] **Per-tool timeout configured** - Individual tool calls timeout (e.g., 30s for API calls, 60s for code execution) - [ ] **Total execution timeout set** - Maximum wall-clock time for entire agent task (e.g., 5 minutes) - [ ] **Fallback behavior defined** - Graceful degradation when tools fail or timeout - [ ] **Stuck state detection** - Agent detects when making no progress; escalates or terminates - [ ] **Retry logic with backoff** - Transient tool failures retried with exponential backoff - [ ] **Partial result handling** - If task partially completes, useful output still returned - [ ] **State checkpointing** - For long-running tasks, intermediate state saved for recovery - [ ] **Dead letter queue** - Failed agent tasks captured for debugging and retry ### Observability - [ ] **All tool calls logged** - Tool name, parameters, return value, duration for every invocation - [ ] **Decision traces captured** - Agent reasoning/planning steps logged for debugging - [ ] **Token usage tracked** - Input/output tokens per agent turn; cost attribution enabled - [ ] **Tool success rate monitored** - Per-tool success/failure rates visible in dashboards - [ ] **Agent completion rate tracked** - What percentage of tasks complete successfully - [ ] **Latency breakdown visible** - Time spent in LLM vs tools vs orchestration - [ ] **Error categorization** - Failures categorized (tool error, timeout, loop, safety block) - [ ] **Trace correlation IDs** - Agent execution linked to parent request for end-to-end tracing - [ ] **Replay capability** - Can re-execute agent with logged inputs for debugging ### Testing - [ ] **Adversarial testing completed** - Prompt injection attacks against agent tested (see [12b](12b_security_adversarial_code.html)) - [ ] **Tool misuse scenarios tested** - Agent tested with prompts trying to misuse tools - [ ] **Edge cases covered** - Empty inputs, very long inputs, malformed tool responses - [ ] **Concurrent execution tested** - Multiple agent instances running simultaneously - [ ] **Resource exhaustion tested** - Behavior when hitting memory/timeout limits - [ ] **Recovery testing done** - Agent behavior after partial failures verified - [ ] **Multi-turn conversation tested** - State maintained correctly across turns - [ ] **Integration tests with real tools** - End-to-end tests against actual tool implementations **Related chapters:** [Ch 8 - Agentic Systems](../chapters/ch08_agentic_systems.html), [08b](08b_agent_orchestration.html), [08c](08c_agentic_systems_code.html) --- ## 3. LLM API Integration Checklist ### Error Handling - [ ] **Retry configuration set** - Max retries (e.g., 3), backoff multiplier (e.g., 2x), max delay (e.g., 60s) - [ ] **Retryable errors identified** - 429, 500, 502, 503, 504 retried; 400, 401, 403 not retried - [ ] **Fallback model configured** - Secondary model/provider activated on primary failure - [ ] **Fallback trigger conditions defined** - Switch after N consecutive failures or error rate >X% - [ ] **Circuit breaker implemented** - Stops calling failing provider; auto-recovers after cooldown (see [25b](25b_reliability_engineering_code.html)) - [ ] **Timeout values set** - Connection timeout (5s), read timeout (60-120s based on expected response) - [ ] **Graceful degradation message** - User-friendly error when all providers fail - [ ] **Error details logged** - Full error response captured for debugging (sanitize sensitive data) - [ ] **Provider status monitoring** - Subscribe to provider status pages; alert on incidents ### Cost Controls - [ ] **Per-request budget set** - Maximum spend per API call enforced - [ ] **Daily budget limit configured** - Hard cap on daily spend; alerts at 80%, 100%, 120% - [ ] **Monthly budget tracked** - Projection dashboards; alerts before exceeding budget - [ ] **Per-user limits enforced** - Individual users have request/token quotas - [ ] **Cost attribution enabled** - Costs tagged by feature, team, customer (see [26b](26b_cost_engineering_code.html)) - [ ] **Token counting before request** - Estimate tokens before sending to avoid surprise costs - [ ] **Output token limits set** - max_tokens parameter set appropriately (not unlimited) - [ ] **Alerting on cost anomalies** - Spike detection for unusual spending patterns - [ ] **Unused context optimization** - System prompts aren't unnecessarily repeated (use caching) ### Performance - [ ] **Response caching enabled** - Identical queries return cached responses; TTL configured - [ ] **Prompt caching utilized** - Provider prompt caching enabled for repeated system prompts (if available) - [ ] **Batching configured** - Where possible, batch multiple requests (Batch API for non-real-time) - [ ] **Connection keep-alive enabled** - HTTP connections reused to reduce latency - [ ] **Streaming configured correctly** - Use streaming for long responses to improve TTFT - [ ] **Async patterns used** - Non-blocking calls for concurrent request handling - [ ] **Request queuing implemented** - Queue requests during rate limits instead of immediate failure - [ ] **Latency monitoring in place** - TTFT, total latency tracked with percentile dashboards ### Security - [ ] **API keys in secrets manager** - Never in code, environment variables in development only - [ ] **API key rotation procedure** - Keys can be rotated without downtime; tested - [ ] **Key scoping enforced** - Separate keys for dev/staging/prod; different keys per service - [ ] **Request logging sanitized** - API keys, PII redacted from logs - [ ] **Input validation enabled** - User input sanitized before sending to LLM (see [12a](12a_security_defense_patterns.html)) - [ ] **Output validation enabled** - LLM responses checked for PII leakage, harmful content - [ ] **TLS enforced** - All API calls over HTTPS; certificate validation enabled - [ ] **Request signing** - If provider supports, request signing enabled for integrity - [ ] **Audit logging** - All LLM API calls logged with user ID, timestamp, sanitized content **Related chapters:** [Ch 9 - LLM Deployment](../chapters/ch09_llm_deployment_infrastructure.html), [Ch 10 - Backend Engineering](../chapters/ch10_backend_engineering_for_ai.html), [02](02_python_ai_patterns.html) --- ## 4. Model Upgrade Checklist ### Evaluation - [ ] **Benchmark suite run** - New model evaluated on standard eval set (see [11a](11a_evaluation_harness.html)) - [ ] **Comparison against current model** - Side-by-side metrics on identical test cases - [ ] **Regression tests passed** - No degradation on key use cases; regressions documented if accepted - [ ] **Edge case evaluation** - Model tested on known difficult inputs from production - [ ] **Latency benchmarked** - New model latency within acceptable range vs current - [ ] **Cost impact calculated** - Token pricing and efficiency differences quantified (see [26a](26a_cost_estimation_formulas.html)) - [ ] **Prompt compatibility verified** - Existing prompts work correctly with new model - [ ] **Output format consistency** - Structured outputs (JSON, etc.) still parse correctly - [ ] **Safety evaluation completed** - Red team testing on new model for safety regressions - [ ] **Calibration check done** - LLM-as-judge re-calibrated if judge model also changes ### Rollout - [ ] **Canary deployment planned** - Start with X% traffic (e.g., 5%), monitor, then expand - [ ] **Canary success criteria defined** - Metrics thresholds for expanding rollout - [ ] **Rollback procedure documented** - One-command rollback to previous model version - [ ] **Rollback tested** - Actually performed a rollback in staging environment - [ ] **Feature flags configured** - Can enable/disable new model per user segment - [ ] **Shadow mode option available** - Can run new model in parallel without serving responses - [ ] **A/B test configured** - Statistical comparison framework ready (see [11b](11b_mlops_evaluation_code.html)) - [ ] **Rollout timeline documented** - Schedule for 5% -> 25% -> 50% -> 100% - [ ] **On-call notified** - Team aware of rollout timing; escalation path clear - [ ] **Stakeholders informed** - Product, support teams aware of potential behavior changes ### Monitoring - [ ] **Comparison dashboards ready** - Old vs new model metrics side-by-side - [ ] **Alert thresholds updated** - Latency, error, quality alerts adjusted for new model baseline - [ ] **Quality score monitoring** - LLM-as-judge running on sample of new model responses - [ ] **User feedback collection** - Thumbs up/down, explicit feedback tracked during rollout - [ ] **Error rate comparison** - New model error rate vs baseline visualized - [ ] **Cost tracking by model version** - Spend attributed to each model version during transition - [ ] **Latency comparison** - TTFT, total latency tracked per model version - [ ] **Rollback trigger alerts** - Automatic alerts if new model exceeds error/latency thresholds - [ ] **Post-rollout review scheduled** - Meeting to review metrics after full rollout **Related chapters:** [Ch 11 - MLOps & Evaluation](../chapters/ch11_mlops_evaluation.html), [Ch 22 - Research to Production](../chapters/ch22_research_to_production.html) --- ## 5. Security Review Checklist ### Input Validation - [ ] **Prompt injection defenses active** - Input screened for injection patterns (see [12a](12a_security_defense_patterns.html)) - [ ] **Sandwich defense implemented** - User input wrapped between instruction reinforcement - [ ] **Input length limits enforced** - Maximum input size to prevent resource exhaustion - [ ] **Unicode normalization applied** - Homoglyph and invisible character attacks mitigated - [ ] **Special character handling** - Escape sequences, control characters sanitized - [ ] **Multi-language injection tested** - Injection attempts in non-English languages blocked - [ ] **Encoded payload detection** - Base64, URL-encoded injection attempts detected - [ ] **Rate limiting on suspicious inputs** - Users sending risky inputs get throttled - [ ] **Input risk scoring logged** - Suspicious inputs flagged for review ### Output Filtering - [ ] **PII detection enabled** - SSN, credit card, phone, email patterns redacted (see [12a](12a_security_defense_patterns.html)) - [ ] **Custom PII patterns added** - Organization-specific sensitive data patterns configured - [ ] **Content filtering active** - Harmful, inappropriate content blocked - [ ] **Source code detection** - Credentials in generated code detected and redacted - [ ] **Jailbreak output detection** - Responses indicating successful jailbreak flagged - [ ] **Maximum output length enforced** - Prevent unbounded output generation - [ ] **Citation verification** - If generating citations, verify they exist in context - [ ] **Structured output validation** - JSON/XML outputs validated against schema - [ ] **Logging of filtered content** - Redacted items logged for audit (with redaction) ### Access Control - [ ] **Authentication required** - All API endpoints require valid authentication - [ ] **Authorization checks in place** - Users can only access their own data/conversations - [ ] **API key permissions scoped** - Keys have minimum required permissions - [ ] **Rate limiting enforced** - Per-user, per-IP, per-API-key limits configured - [ ] **Abuse detection active** - Unusual patterns (high volume, repeated failures) trigger blocks (see [12a](12a_security_defense_patterns.html)) - [ ] **IP allowlisting** - For sensitive endpoints, restrict to known IPs (if applicable) - [ ] **Session management secure** - Sessions expire, tokens can be revoked - [ ] **CORS configured correctly** - Only authorized origins can call API - [ ] **Admin access audited** - Admin actions logged with user identity ### Audit Logging - [ ] **All LLM calls logged** - Every API call recorded with timestamp, user, model, tokens - [ ] **Sensitive data redacted in logs** - PII, API keys masked before logging - [ ] **Log retention configured** - Logs retained for compliance period (e.g., 90 days) - [ ] **Log integrity protected** - Logs stored in tamper-evident system (or hashed) - [ ] **Log access controlled** - Only authorized personnel can access logs - [ ] **Security event alerting** - Suspicious activity triggers security team notification - [ ] **Compliance requirements met** - GDPR, HIPAA, SOC2 logging requirements satisfied - [ ] **Incident investigation capability** - Can reconstruct user sessions from logs - [ ] **Regular log review process** - Scheduled review of security-relevant logs **Related chapters:** [Ch 12 - Security & Adversarial Robustness](../chapters/ch12_security_adversarial_robustness.html), [12a](12a_security_defense_patterns.html), [12b](12b_security_adversarial_code.html) --- ## 6. Cost Optimization Checklist ### Token Efficiency - [ ] **System prompts optimized** - Prompts trimmed to essential instructions; no redundancy - [ ] **Prompt caching enabled** - Provider caching features used for repeated prompts (if available) - [ ] **Response caching implemented** - Identical queries served from cache - [ ] **Cache hit rate monitored** - Targeting >30% cache hit rate for typical workloads - [ ] **Context window usage tracked** - Monitoring how much of context window is typically used - [ ] **Unnecessary context removed** - Not including full documents when excerpts suffice - [ ] **Output length controlled** - max_tokens set appropriately; not requesting more than needed - [ ] **Token counting accurate** - Using correct tokenizer for accurate cost estimation - [ ] **Prompt versioning tracked** - Prompt changes linked to cost changes (see [26a](26a_cost_estimation_formulas.html)) ### Model Routing - [ ] **Task complexity assessment** - Simple tasks routed to smaller/cheaper models - [ ] **Model routing rules defined** - Clear criteria for when to use each model tier - [ ] **Routing accuracy monitored** - Tracking if routing decisions are correct - [ ] **Fallback model configured** - Expensive model available for complex tasks that fail on cheaper - [ ] **A/B testing routing strategies** - Experimenting with different routing thresholds - [ ] **Quality vs cost tradeoffs documented** - Decision matrix for model selection - [ ] **Batch API usage** - Non-real-time workloads using discounted batch endpoints (50% savings) - [ ] **Fine-tuned model evaluation** - Assessed if fine-tuning smaller model beats larger generic model (see [26b](26b_cost_engineering_code.html)) ### Infrastructure - [ ] **Compute right-sized** - GPU/CPU instances matched to actual workload (not over-provisioned) - [ ] **Auto-scaling configured** - Scale up for peak traffic, scale down during quiet periods - [ ] **Spot/preemptible instances used** - Non-critical workloads on discounted instances (where applicable) - [ ] **Reserved capacity evaluated** - For predictable workloads, committed use discounts calculated - [ ] **Multi-region cost comparison** - Deployed in cost-effective regions (with latency tradeoffs noted) - [ ] **Idle resource detection** - Alerting on underutilized infrastructure - [ ] **Vector DB tiering** - Frequently accessed data on fast storage; archival on cheaper tiers - [ ] **Cold start optimization** - Serverless functions kept warm if cold start costs exceed benefit ### Budget Management - [ ] **Cost dashboards deployed** - Real-time visibility into spend by service, feature, team - [ ] **Budget alerts configured** - Notifications at 50%, 80%, 100% of budget - [ ] **Anomaly detection active** - Alerts on unexpected spend spikes - [ ] **Cost attribution working** - Spend tagged to teams/products for chargeback - [ ] **Monthly cost reviews scheduled** - Regular review of optimization opportunities - [ ] **Cost vs quality metrics linked** - Understanding cost per successful response, not just cost per request - [ ] **Forecasting in place** - Projected costs based on traffic trends - [ ] **Emergency cost controls documented** - Runbook for rapid cost reduction if needed (see [25a](25a_incident_runbook.html)) **Related chapters:** [Ch 26 - Cost Engineering](../chapters/ch26_cost_engineering.html), [26a](26a_cost_estimation_formulas.html), [26b](26b_cost_engineering_code.html) --- ## Quick Reference: Pre-Launch Gate Before any production launch, verify these critical items are complete: | Category | Critical Items | |----------|----------------| | **Safety** | Tool permissions scoped, human-in-loop for sensitive actions, input validation active | | **Reliability** | Timeouts configured, fallbacks working, circuit breakers enabled | | **Security** | API keys secured, input/output filtering active, audit logging enabled | | **Monitoring** | Dashboards deployed, alerts configured, on-call rotation set | | **Cost** | Budgets set, alerts configured, cost tracking by feature | | **Rollout** | Canary plan ready, rollback tested, stakeholders notified | --- ## Usage Guide ### Before First Production Deployment 1. Complete all items in the relevant checklists above 2. Have a second person verify critical items (safety, security) 3. Document any intentionally skipped items with justification 4. Schedule post-launch review (1 week) to reassess ### For Model/System Updates 1. Complete Model Upgrade Checklist 2. Re-run relevant portions of original deployment checklist 3. Use canary deployment; do not go 0% to 100% ### Regular Review Cadence | Checklist | Review Frequency | |-----------|------------------| | RAG System | Monthly + after index changes | | Agent Safety | Monthly + after tool changes | | LLM API Integration | Quarterly + after provider changes | | Security | Quarterly + after any security incident | | Cost Optimization | Monthly | --- *These checklists complement the incident runbooks ([25a](25a_incident_runbook.html)) and should be reviewed together before production deployment.*