Appendix G: Architecture Decision Records (ADRs)
This appendix provides example ADRs for common AI engineering decisions. Use these as templates for documenting your own architectural choices.
ADR Template
# ADR-XXX: [Title]
## Status
[Proposed | Accepted | Deprecated | Superseded by ADR-XXX]
## Context
[What is the issue we're facing? What forces are at play?]
## Decision
[What is the change being proposed/made?]
## Consequences
[What are the positive and negative effects of this decision?]
## Alternatives Considered
[What other options were evaluated?]
## References
[Related documents, chapters, papers]ADR-001: RAG vs Fine-Tuning for Domain Knowledge
Status
Accepted
Context
We need to add domain-specific knowledge to our LLM application. The knowledge includes:
- 50,000 internal documents (policies, procedures, product specs)
- Content that updates weekly
- Specialized terminology users expect precise answers about
We must decide between: 1. Fine-tuning a model on this domain 2. Using RAG to retrieve relevant documents at query time 3. A hybrid approach
Decision
We will use RAG as the primary approach with the following implementation:
- Vector database for document storage (Qdrant)
- Hybrid search (dense embeddings + BM25)
- Reranking with cross-encoder
- Weekly incremental indexing
Fine-tuning will only be considered if:
- RAG retrieval quality remains below 80% after optimization
- We identify specific terminology gaps that prompting can’t address
Consequences
Positive: - Documents can be updated without model retraining - Full traceability (we can cite sources) - Lower compute costs (no training infrastructure) - Faster iteration on content - No risk of catastrophic forgetting
Negative: - Added latency from retrieval step (~200ms) - Retrieval quality becomes critical path - Context window limits affect how much we can include - Infrastructure complexity (vector DB, embeddings)
Alternatives Considered
1. Full Fine-Tuning - Pros: Knowledge “baked in,” lower inference latency - Cons: Expensive to retrain on updates, no citations, risk of hallucination - Rejected because: Update frequency (weekly) makes retraining impractical
2. Few-Shot Prompting Only - Pros: Simplest approach - Cons: Context window can’t fit 50K documents - Rejected because: Volume of knowledge far exceeds context limits
3. Fine-Tune + RAG Hybrid - Pros: Best of both worlds potentially - Cons: Highest complexity and cost - Deferred: Will revisit if RAG alone proves insufficient
References
- Chapter 7: RAG Systems Deep Dive
- Chapter 15: MLOps & Evaluation (Fine-Tuning approaches)
- Paper: “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”
ADR-002: Vector Database Selection
Status
Accepted
Context
Our RAG system requires a vector database for storing and searching embeddings. Key requirements:
- Scale: 10M vectors initially, growing to 100M
- Latency: <100ms p99 for search
- Availability: 99.9% uptime for production
- Features: Filtering, hybrid search
- Team: Small team (3 engineers), limited ops capacity
Candidates evaluated: 1. Pinecone (managed) 2. Weaviate (self-hosted or managed) 3. Qdrant (self-hosted or managed) 4. Milvus (self-hosted) 5. pgvector (PostgreSQL extension)
Decision
We will use Qdrant Cloud (managed) with the following configuration:
- Managed deployment to minimize ops burden
- HNSW index with default parameters initially
- Enable hybrid search (sparse + dense)
- Start with single region, plan multi-region for phase 2
Consequences
Positive: - Managed service reduces ops burden significantly - Strong hybrid search support out of the box - Good balance of cost and features - Easy to migrate from cloud to self-hosted later if needed - Active community and good documentation
Negative: - Vendor lock-in for managed service - Less customization than fully self-hosted - Cost scales with usage (vs fixed self-hosted cost)
Alternatives Considered
1. Pinecone - Pros: Simplest managed option, excellent reliability - Cons: Higher cost, less flexibility - Rejected because: Cost at 100M vectors scale was 3x Qdrant
2. Weaviate - Pros: Good hybrid search, managed option available - Cons: More complex configuration - Rejected because: Qdrant had better performance benchmarks for our workload
3. Milvus (Self-Hosted) - Pros: Most scalable, full control - Cons: Significant ops overhead (ZooKeeper, etcd dependencies) - Rejected because: Team size can’t support ops complexity
4. pgvector - Pros: Familiar Postgres, simple ops - Cons: Best suited for under ~10M vectors, limited hybrid search support - Rejected because: Won’t scale to 100M vectors without sharding
References
- Chapter 7: Vector Databases in Production
- Appendix B: Tool & Framework Reference
- ANN Benchmarks: ann-benchmarks.com
ADR-003: Self-Hosted vs API for LLM Inference
Status
Accepted
Context
We need to decide how to serve LLM inference for our production application:
Requirements:
- ~100K requests/day initially, growing to 1M/day
- Latency: <2s p95 for generation
- Privacy: Some queries contain PII that shouldn’t leave our infrastructure
- Budget: $20K/month initially, scaling with usage
- Models: Need GPT-5 class quality for complex queries, smaller for simple
Options: 1. API only (OpenAI, Anthropic, etc.) 2. Self-hosted only (vLLM, TGI) 3. Hybrid (API for complex, self-hosted for simple/private)
Decision
We will implement a hybrid approach:
- Self-hosted (vLLM) for:
- Simple queries (classification, extraction)
- Queries containing PII
- High-volume, cost-sensitive workloads
- Model: Llama 3 70B with INT8 quantization
- API (Anthropic Claude) for:
- Complex reasoning queries
- When self-hosted is at capacity
- New capabilities not yet available in open models
- Router that directs queries based on:
- Complexity score (LLM-based classification)
- PII detection
- Current load on self-hosted
- Cost optimization targets
Consequences
Positive: - Privacy compliance for PII data - Cost optimization (self-hosted for bulk) - Flexibility to use best model for each task - Not fully dependent on any single provider - Can gradually shift workload as self-hosted scales
Negative: - Operational complexity (managing both) - Router adds latency and complexity - Need expertise in both API integration and GPU ops - More complex debugging when issues span systems
Cost Analysis
| Scenario | API Only | Self-Hosted Only | Hybrid |
|---|---|---|---|
| 100K req/day | $30K/mo | $8K/mo | $15K/mo |
| 1M req/day | $300K/mo | $40K/mo | $80K/mo |
Assumes 60% simple queries routed to self-hosted in hybrid
Alternatives Considered
1. API Only - Pros: Simplest, no ops burden - Cons: Cost at scale, PII concerns, vendor dependency - Rejected because: PII requirement forces some self-hosting
2. Self-Hosted Only - Pros: Full control, lowest cost at scale - Cons: Miss out on frontier capabilities, more ops burden - Rejected because: Want access to best models for complex queries
References
- Chapter 9: LLM Deployment & Infrastructure
- Chapter 32: Cost Engineering
- Appendix B: Inference engines comparison
ADR-004: Chunking Strategy for Technical Documentation
Status
Accepted
Context
We’re indexing technical documentation for RAG. The content includes:
- API references with code examples
- Tutorial articles with sequential steps
- Conceptual explanations with diagrams
- Changelog entries
We need to determine chunking strategy that:
- Preserves semantic coherence
- Works well with embedding models
- Produces chunks that fit in context windows
- Maintains useful metadata
Decision
We will use hierarchical semantic chunking:
- First pass: Split by major sections (H1, H2 headers)
- Second pass: Within sections, split by paragraphs
- Merge: Combine small paragraphs up to 512 tokens
- Overlap: 50 token overlap between chunks
- Metadata: Preserve document title, section hierarchy, source URL
Special handling:
- Code blocks: Keep code and surrounding explanation together
- Lists: Keep complete lists together up to 800 tokens
- Tables: Keep tables as single chunks with surrounding context
Consequences
Positive: - Preserves semantic meaning within chunks - Code examples stay with their explanations - Section hierarchy aids retrieval (can boost by header match) - Manageable chunk sizes for embedding models
Negative: - Variable chunk sizes (100-800 tokens) - More complex implementation than fixed-size - Requires format-specific parsing (Markdown, HTML)
Implementation
def hierarchical_chunk(doc: str, metadata: dict) -> list[Chunk]:
# Parse document structure
sections = parse_sections(doc)
chunks = []
for section in sections:
section_metadata = {
**metadata,
"section_title": section.title,
"section_path": section.path # e.g., "Installation > Prerequisites"
}
# Handle code blocks specially
if section.has_code_blocks:
chunks.extend(chunk_with_code(section, section_metadata))
else:
chunks.extend(semantic_chunk(section.text, section_metadata))
return chunksAlternatives Considered
1. Fixed-Size Chunking (512 tokens) - Pros: Simple, predictable - Cons: Splits code blocks and semantic units - Rejected because: Poor quality on technical content
2. Sentence-Based Chunking - Pros: Natural boundaries - Cons: Very small chunks, loses context - Rejected because: Too fine-grained for our content
3. Document-Level (No Chunking) - Pros: Full context preserved - Cons: Many docs exceed context limits - Rejected because: Documents average 5000 tokens
References
- Chapter 7: Document Processing and Chunking
- Chapter 7: Evaluation section on retrieval metrics
ADR-005: Agent Safety Architecture
Status
Accepted
Context
We’re building an agent that can:
- Read and write files in user workspaces
- Execute code in sandboxed environments
- Make API calls to external services
- Search the web
We need safety measures that:
- Prevent prompt injection attacks
- Limit blast radius of failures
- Require confirmation for dangerous operations
- Maintain audit trail
Decision
We will implement defense-in-depth with four layers:
Layer 1: Input Validation - Scan all inputs for injection patterns - Rate limit by user and session - Log suspicious inputs for review
Layer 2: Tool Policies
TOOL_POLICIES = {
"read_file": {
"allowed_paths": ["/workspace/"],
"blocked_patterns": [".env", "credentials"],
"requires_confirmation": False
},
"write_file": {
"allowed_paths": ["/workspace/"],
"blocked_patterns": [".git/", "node_modules/"],
"requires_confirmation": True
},
"execute_code": {
"sandboxed": True,
"timeout_seconds": 30,
"blocked_imports": ["os", "subprocess", "socket"],
"requires_confirmation": True
},
"web_request": {
"allowed_domains": ["api.example.com", "docs.example.com"],
"requires_confirmation": False
}
}Layer 3: Sandboxed Execution - Code runs in Docker containers - Network disabled by default - Read-only filesystem - Resource limits (CPU, memory, time)
Layer 4: Human-in-the-Loop - Dangerous operations require explicit approval - Summary of proposed actions shown before execution - User can cancel or modify
Consequences
Positive: - Defense in depth limits blast radius - Audit trail for compliance - User control over risky operations - Can tune policies without code changes
Negative: - Added latency for confirmations - Some user friction - Maintenance overhead for policies - Complex testing (security scenarios)
Alternatives Considered
1. No Safety Measures (Trust the Model) - Rejected because: Unacceptable risk for file/code execution
2. Confirmation for All Operations - Pros: Maximum safety - Cons: Too much friction, users would disable - Rejected because: Poor user experience
3. Static Allowlist Only - Pros: Simple - Cons: Too restrictive or requires frequent updates - Rejected because: Need flexibility of policy system
References
- Chapter 8: Agent Safety and Constraints
- Chapter 16: Security for Agentic Systems
- Chapter 16: Defense-in-Depth Architecture
ADR-006: Evaluation Strategy for LLM Application
Status
Accepted
Context
We need a comprehensive evaluation strategy for our LLM-powered application. Requirements:
- Catch regressions before deployment
- Measure quality across dimensions (accuracy, safety, style)
- Handle subjective quality judgments
- Scale to thousands of test cases
- Run in CI/CD pipeline
Decision
We will implement a multi-tier evaluation system:
Tier 1: Deterministic Tests (CI - Every Commit) - Format validation (JSON schema compliance) - Safety filters (blocked content detection) - Response length bounds - Required field presence - ~500 test cases, runs in <5 minutes
Tier 2: LLM-as-Judge (Pre-Merge - Every PR) - Accuracy rating (1-5 scale) - Helpfulness rating - Safety rating - Factual grounding check - ~200 test cases, runs in <15 minutes - Uses Claude Haiku as judge for cost efficiency
Tier 3: Human Evaluation (Weekly) - Random sample of 100 production queries - Expert review for domain correctness - Edge case identification - Calibrates Tier 2 judges
Tier 4: A/B Testing (Major Changes) - Live traffic comparison - Business metrics (conversion, satisfaction) - Statistical significance requirements - 1-2 week evaluation periods
Consequences
Positive: - Fast feedback on obvious issues (Tier 1) - Catches quality regressions (Tier 2) - Human calibration prevents drift (Tier 3) - Business impact validation (Tier 4) - Appropriate cost at each tier
Negative: - Complex pipeline to maintain - LLM-as-judge has known biases - Human evaluation is expensive and slow - A/B tests delay launches
Metrics Targets
| Tier | Metric | Target | Blocking |
|---|---|---|---|
| 1 | Format compliance | 100% | Yes |
| 2 | Accuracy score | >4.0/5.0 | Yes |
| 2 | Safety score | >4.5/5.0 | Yes |
| 3 | Human accuracy | >90% | No (alerts) |
| 4 | Conversion lift | >0% | Yes |
References
- Chapter 15: MLOps & Evaluation
- Chapter 15: LLM-as-Judge methodology
- Chapter 15: A/B Testing for AI
ADR-007: Observability Stack for AI Systems
Status
Accepted
Context
We need observability for our LLM application covering:
- Traditional metrics (latency, errors, throughput)
- AI-specific metrics (token usage, quality scores)
- Request/response logging for debugging
- Cost tracking
- Prompt/response inspection for issues
Decision
We will use the following stack:
Metrics: Prometheus + Grafana - Standard infrastructure metrics - Custom AI metrics (tokens, costs, quality) - Alerting through Grafana
Logging: OpenTelemetry + Jaeger - Distributed tracing across services - Request/response capture (redacted) - Span-level timing breakdown
AI-Specific: Custom LLM Logger
@dataclass
class LLMRequestLog:
request_id: str
timestamp: datetime
model: str
prompt_hash: str # For grouping similar prompts
prompt_preview: str # First 200 chars
input_tokens: int
output_tokens: int
latency_ms: float
status: str
cost_usd: float
quality_score: float # From automated eval
user_id: str # For cost attributionDashboards: 1. System health (latency, errors, throughput) 2. Model performance (quality scores over time) 3. Cost tracking (by model, user, endpoint) 4. Debug view (inspect individual requests)
Consequences
Positive: - Complete visibility into system behavior - AI-specific insights beyond traditional ops - Cost attribution for business planning - Debugging capability for quality issues
Negative: - Storage costs for detailed logs - Privacy considerations for logging prompts - Dashboard maintenance overhead - Learning curve for the team
References
- Chapter 11: Observability section
- Chapter 31: Reliability Engineering
- Chapter 32: Cost Attribution
ADR Index
| ADR | Title | Status | Key Decision |
|---|---|---|---|
| 001 | RAG vs Fine-Tuning | Accepted | Use RAG for updatable knowledge |
| 002 | Vector Database | Accepted | Qdrant Cloud (managed) |
| 003 | LLM Inference | Accepted | Hybrid: self-hosted + API |
| 004 | Chunking Strategy | Accepted | Hierarchical semantic chunking |
| 005 | Agent Safety | Accepted | Defense-in-depth, 4 layers |
| 006 | Evaluation Strategy | Accepted | Multi-tier (deterministic → human) |
| 007 | Observability | Accepted | Prometheus + OTel + custom |
Creating Your Own ADRs
When writing ADRs for your team:
- Write them at decision time: Not after the fact
- Keep them concise: 1-2 pages max
- Include context: Future readers need to understand why
- Document alternatives: Shows due diligence
- Accept uncertainty: It’s OK to decide with incomplete info
- Update status: Mark deprecated/superseded when things change
- Link to code: Reference implementations where relevant
ADRs are historical documents. Don’t modify the decision section after acceptance—write a new ADR that supersedes it.