Appendix G: Architecture Decision Records (ADRs)

This appendix provides example ADRs for common AI engineering decisions. Use these as templates for documenting your own architectural choices.


ADR Template

# ADR-XXX: [Title]

## Status
[Proposed | Accepted | Deprecated | Superseded by ADR-XXX]

## Context
[What is the issue we're facing? What forces are at play?]

## Decision
[What is the change being proposed/made?]

## Consequences
[What are the positive and negative effects of this decision?]

## Alternatives Considered
[What other options were evaluated?]

## References
[Related documents, chapters, papers]

ADR-001: RAG vs Fine-Tuning for Domain Knowledge

Status

Accepted

Context

We need to add domain-specific knowledge to our LLM application. The knowledge includes:

  • 50,000 internal documents (policies, procedures, product specs)
  • Content that updates weekly
  • Specialized terminology users expect precise answers about

We must decide between: 1. Fine-tuning a model on this domain 2. Using RAG to retrieve relevant documents at query time 3. A hybrid approach

Decision

We will use RAG as the primary approach with the following implementation:

  • Vector database for document storage (Qdrant)
  • Hybrid search (dense embeddings + BM25)
  • Reranking with cross-encoder
  • Weekly incremental indexing

Fine-tuning will only be considered if:

  • RAG retrieval quality remains below 80% after optimization
  • We identify specific terminology gaps that prompting can’t address

Consequences

Positive: - Documents can be updated without model retraining - Full traceability (we can cite sources) - Lower compute costs (no training infrastructure) - Faster iteration on content - No risk of catastrophic forgetting

Negative: - Added latency from retrieval step (~200ms) - Retrieval quality becomes critical path - Context window limits affect how much we can include - Infrastructure complexity (vector DB, embeddings)

Alternatives Considered

1. Full Fine-Tuning - Pros: Knowledge “baked in,” lower inference latency - Cons: Expensive to retrain on updates, no citations, risk of hallucination - Rejected because: Update frequency (weekly) makes retraining impractical

2. Few-Shot Prompting Only - Pros: Simplest approach - Cons: Context window can’t fit 50K documents - Rejected because: Volume of knowledge far exceeds context limits

3. Fine-Tune + RAG Hybrid - Pros: Best of both worlds potentially - Cons: Highest complexity and cost - Deferred: Will revisit if RAG alone proves insufficient

References

  • Chapter 7: RAG Systems Deep Dive
  • Chapter 15: MLOps & Evaluation (Fine-Tuning approaches)
  • Paper: “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”

ADR-002: Vector Database Selection

Status

Accepted

Context

Our RAG system requires a vector database for storing and searching embeddings. Key requirements:

  • Scale: 10M vectors initially, growing to 100M
  • Latency: <100ms p99 for search
  • Availability: 99.9% uptime for production
  • Features: Filtering, hybrid search
  • Team: Small team (3 engineers), limited ops capacity

Candidates evaluated: 1. Pinecone (managed) 2. Weaviate (self-hosted or managed) 3. Qdrant (self-hosted or managed) 4. Milvus (self-hosted) 5. pgvector (PostgreSQL extension)

Decision

We will use Qdrant Cloud (managed) with the following configuration:

  • Managed deployment to minimize ops burden
  • HNSW index with default parameters initially
  • Enable hybrid search (sparse + dense)
  • Start with single region, plan multi-region for phase 2

Consequences

Positive: - Managed service reduces ops burden significantly - Strong hybrid search support out of the box - Good balance of cost and features - Easy to migrate from cloud to self-hosted later if needed - Active community and good documentation

Negative: - Vendor lock-in for managed service - Less customization than fully self-hosted - Cost scales with usage (vs fixed self-hosted cost)

Alternatives Considered

1. Pinecone - Pros: Simplest managed option, excellent reliability - Cons: Higher cost, less flexibility - Rejected because: Cost at 100M vectors scale was 3x Qdrant

2. Weaviate - Pros: Good hybrid search, managed option available - Cons: More complex configuration - Rejected because: Qdrant had better performance benchmarks for our workload

3. Milvus (Self-Hosted) - Pros: Most scalable, full control - Cons: Significant ops overhead (ZooKeeper, etcd dependencies) - Rejected because: Team size can’t support ops complexity

4. pgvector - Pros: Familiar Postgres, simple ops - Cons: Best suited for under ~10M vectors, limited hybrid search support - Rejected because: Won’t scale to 100M vectors without sharding

References

  • Chapter 7: Vector Databases in Production
  • Appendix B: Tool & Framework Reference
  • ANN Benchmarks: ann-benchmarks.com

ADR-003: Self-Hosted vs API for LLM Inference

Status

Accepted

Context

We need to decide how to serve LLM inference for our production application:

Requirements:

  • ~100K requests/day initially, growing to 1M/day
  • Latency: <2s p95 for generation
  • Privacy: Some queries contain PII that shouldn’t leave our infrastructure
  • Budget: $20K/month initially, scaling with usage
  • Models: Need GPT-5 class quality for complex queries, smaller for simple

Options: 1. API only (OpenAI, Anthropic, etc.) 2. Self-hosted only (vLLM, TGI) 3. Hybrid (API for complex, self-hosted for simple/private)

Decision

We will implement a hybrid approach:

  1. Self-hosted (vLLM) for:
    • Simple queries (classification, extraction)
    • Queries containing PII
    • High-volume, cost-sensitive workloads
    • Model: Llama 3 70B with INT8 quantization
  2. API (Anthropic Claude) for:
    • Complex reasoning queries
    • When self-hosted is at capacity
    • New capabilities not yet available in open models
  3. Router that directs queries based on:
    • Complexity score (LLM-based classification)
    • PII detection
    • Current load on self-hosted
    • Cost optimization targets

Consequences

Positive: - Privacy compliance for PII data - Cost optimization (self-hosted for bulk) - Flexibility to use best model for each task - Not fully dependent on any single provider - Can gradually shift workload as self-hosted scales

Negative: - Operational complexity (managing both) - Router adds latency and complexity - Need expertise in both API integration and GPU ops - More complex debugging when issues span systems

Cost Analysis

Scenario API Only Self-Hosted Only Hybrid
100K req/day $30K/mo $8K/mo $15K/mo
1M req/day $300K/mo $40K/mo $80K/mo

Assumes 60% simple queries routed to self-hosted in hybrid

Alternatives Considered

1. API Only - Pros: Simplest, no ops burden - Cons: Cost at scale, PII concerns, vendor dependency - Rejected because: PII requirement forces some self-hosting

2. Self-Hosted Only - Pros: Full control, lowest cost at scale - Cons: Miss out on frontier capabilities, more ops burden - Rejected because: Want access to best models for complex queries

References

  • Chapter 9: LLM Deployment & Infrastructure
  • Chapter 32: Cost Engineering
  • Appendix B: Inference engines comparison

ADR-004: Chunking Strategy for Technical Documentation

Status

Accepted

Context

We’re indexing technical documentation for RAG. The content includes:

  • API references with code examples
  • Tutorial articles with sequential steps
  • Conceptual explanations with diagrams
  • Changelog entries

We need to determine chunking strategy that:

  • Preserves semantic coherence
  • Works well with embedding models
  • Produces chunks that fit in context windows
  • Maintains useful metadata

Decision

We will use hierarchical semantic chunking:

  1. First pass: Split by major sections (H1, H2 headers)
  2. Second pass: Within sections, split by paragraphs
  3. Merge: Combine small paragraphs up to 512 tokens
  4. Overlap: 50 token overlap between chunks
  5. Metadata: Preserve document title, section hierarchy, source URL

Special handling:

  • Code blocks: Keep code and surrounding explanation together
  • Lists: Keep complete lists together up to 800 tokens
  • Tables: Keep tables as single chunks with surrounding context

Consequences

Positive: - Preserves semantic meaning within chunks - Code examples stay with their explanations - Section hierarchy aids retrieval (can boost by header match) - Manageable chunk sizes for embedding models

Negative: - Variable chunk sizes (100-800 tokens) - More complex implementation than fixed-size - Requires format-specific parsing (Markdown, HTML)

Implementation

def hierarchical_chunk(doc: str, metadata: dict) -> list[Chunk]:
    # Parse document structure
    sections = parse_sections(doc)

    chunks = []
    for section in sections:
        section_metadata = {
            **metadata,
            "section_title": section.title,
            "section_path": section.path  # e.g., "Installation > Prerequisites"
        }

        # Handle code blocks specially
        if section.has_code_blocks:
            chunks.extend(chunk_with_code(section, section_metadata))
        else:
            chunks.extend(semantic_chunk(section.text, section_metadata))

    return chunks

Alternatives Considered

1. Fixed-Size Chunking (512 tokens) - Pros: Simple, predictable - Cons: Splits code blocks and semantic units - Rejected because: Poor quality on technical content

2. Sentence-Based Chunking - Pros: Natural boundaries - Cons: Very small chunks, loses context - Rejected because: Too fine-grained for our content

3. Document-Level (No Chunking) - Pros: Full context preserved - Cons: Many docs exceed context limits - Rejected because: Documents average 5000 tokens

References

  • Chapter 7: Document Processing and Chunking
  • Chapter 7: Evaluation section on retrieval metrics

ADR-005: Agent Safety Architecture

Status

Accepted

Context

We’re building an agent that can:

  • Read and write files in user workspaces
  • Execute code in sandboxed environments
  • Make API calls to external services
  • Search the web

We need safety measures that:

  • Prevent prompt injection attacks
  • Limit blast radius of failures
  • Require confirmation for dangerous operations
  • Maintain audit trail

Decision

We will implement defense-in-depth with four layers:

Layer 1: Input Validation - Scan all inputs for injection patterns - Rate limit by user and session - Log suspicious inputs for review

Layer 2: Tool Policies

TOOL_POLICIES = {
    "read_file": {
        "allowed_paths": ["/workspace/"],
        "blocked_patterns": [".env", "credentials"],
        "requires_confirmation": False
    },
    "write_file": {
        "allowed_paths": ["/workspace/"],
        "blocked_patterns": [".git/", "node_modules/"],
        "requires_confirmation": True
    },
    "execute_code": {
        "sandboxed": True,
        "timeout_seconds": 30,
        "blocked_imports": ["os", "subprocess", "socket"],
        "requires_confirmation": True
    },
    "web_request": {
        "allowed_domains": ["api.example.com", "docs.example.com"],
        "requires_confirmation": False
    }
}

Layer 3: Sandboxed Execution - Code runs in Docker containers - Network disabled by default - Read-only filesystem - Resource limits (CPU, memory, time)

Layer 4: Human-in-the-Loop - Dangerous operations require explicit approval - Summary of proposed actions shown before execution - User can cancel or modify

Consequences

Positive: - Defense in depth limits blast radius - Audit trail for compliance - User control over risky operations - Can tune policies without code changes

Negative: - Added latency for confirmations - Some user friction - Maintenance overhead for policies - Complex testing (security scenarios)

Alternatives Considered

1. No Safety Measures (Trust the Model) - Rejected because: Unacceptable risk for file/code execution

2. Confirmation for All Operations - Pros: Maximum safety - Cons: Too much friction, users would disable - Rejected because: Poor user experience

3. Static Allowlist Only - Pros: Simple - Cons: Too restrictive or requires frequent updates - Rejected because: Need flexibility of policy system

References

  • Chapter 8: Agent Safety and Constraints
  • Chapter 16: Security for Agentic Systems
  • Chapter 16: Defense-in-Depth Architecture

ADR-006: Evaluation Strategy for LLM Application

Status

Accepted

Context

We need a comprehensive evaluation strategy for our LLM-powered application. Requirements:

  • Catch regressions before deployment
  • Measure quality across dimensions (accuracy, safety, style)
  • Handle subjective quality judgments
  • Scale to thousands of test cases
  • Run in CI/CD pipeline

Decision

We will implement a multi-tier evaluation system:

Tier 1: Deterministic Tests (CI - Every Commit) - Format validation (JSON schema compliance) - Safety filters (blocked content detection) - Response length bounds - Required field presence - ~500 test cases, runs in <5 minutes

Tier 2: LLM-as-Judge (Pre-Merge - Every PR) - Accuracy rating (1-5 scale) - Helpfulness rating - Safety rating - Factual grounding check - ~200 test cases, runs in <15 minutes - Uses Claude Haiku as judge for cost efficiency

Tier 3: Human Evaluation (Weekly) - Random sample of 100 production queries - Expert review for domain correctness - Edge case identification - Calibrates Tier 2 judges

Tier 4: A/B Testing (Major Changes) - Live traffic comparison - Business metrics (conversion, satisfaction) - Statistical significance requirements - 1-2 week evaluation periods

Consequences

Positive: - Fast feedback on obvious issues (Tier 1) - Catches quality regressions (Tier 2) - Human calibration prevents drift (Tier 3) - Business impact validation (Tier 4) - Appropriate cost at each tier

Negative: - Complex pipeline to maintain - LLM-as-judge has known biases - Human evaluation is expensive and slow - A/B tests delay launches

Metrics Targets

Tier Metric Target Blocking
1 Format compliance 100% Yes
2 Accuracy score >4.0/5.0 Yes
2 Safety score >4.5/5.0 Yes
3 Human accuracy >90% No (alerts)
4 Conversion lift >0% Yes

References

  • Chapter 15: MLOps & Evaluation
  • Chapter 15: LLM-as-Judge methodology
  • Chapter 15: A/B Testing for AI

ADR-007: Observability Stack for AI Systems

Status

Accepted

Context

We need observability for our LLM application covering:

  • Traditional metrics (latency, errors, throughput)
  • AI-specific metrics (token usage, quality scores)
  • Request/response logging for debugging
  • Cost tracking
  • Prompt/response inspection for issues

Decision

We will use the following stack:

Metrics: Prometheus + Grafana - Standard infrastructure metrics - Custom AI metrics (tokens, costs, quality) - Alerting through Grafana

Logging: OpenTelemetry + Jaeger - Distributed tracing across services - Request/response capture (redacted) - Span-level timing breakdown

AI-Specific: Custom LLM Logger

@dataclass
class LLMRequestLog:
    request_id: str
    timestamp: datetime
    model: str
    prompt_hash: str  # For grouping similar prompts
    prompt_preview: str  # First 200 chars
    input_tokens: int
    output_tokens: int
    latency_ms: float
    status: str
    cost_usd: float
    quality_score: float  # From automated eval
    user_id: str  # For cost attribution

Dashboards: 1. System health (latency, errors, throughput) 2. Model performance (quality scores over time) 3. Cost tracking (by model, user, endpoint) 4. Debug view (inspect individual requests)

Consequences

Positive: - Complete visibility into system behavior - AI-specific insights beyond traditional ops - Cost attribution for business planning - Debugging capability for quality issues

Negative: - Storage costs for detailed logs - Privacy considerations for logging prompts - Dashboard maintenance overhead - Learning curve for the team

References

  • Chapter 11: Observability section
  • Chapter 31: Reliability Engineering
  • Chapter 32: Cost Attribution

ADR Index

ADR Title Status Key Decision
001 RAG vs Fine-Tuning Accepted Use RAG for updatable knowledge
002 Vector Database Accepted Qdrant Cloud (managed)
003 LLM Inference Accepted Hybrid: self-hosted + API
004 Chunking Strategy Accepted Hierarchical semantic chunking
005 Agent Safety Accepted Defense-in-depth, 4 layers
006 Evaluation Strategy Accepted Multi-tier (deterministic → human)
007 Observability Accepted Prometheus + OTel + custom

Creating Your Own ADRs

When writing ADRs for your team:

  1. Write them at decision time: Not after the fact
  2. Keep them concise: 1-2 pages max
  3. Include context: Future readers need to understand why
  4. Document alternatives: Shows due diligence
  5. Accept uncertainty: It’s OK to decide with incomplete info
  6. Update status: Mark deprecated/superseded when things change
  7. Link to code: Reference implementations where relevant

ADRs are historical documents. Don’t modify the decision section after acceptance—write a new ADR that supersedes it.