Appendix G: Architecture Decision Records (ADRs)

This appendix provides example ADRs for common AI engineering decisions. Use these as templates for documenting your own architectural choices.

ADR Template

# ADR-XXX: [Title]

## Status
[Proposed | Accepted | Deprecated | Superseded by ADR-XXX]

## Context
[What is the issue we're facing? What forces are at play?]

## Decision
[What is the change being proposed/made?]

## Consequences
[What are the positive and negative effects of this decision?]

## Alternatives Considered
[What other options were evaluated?]

## References
[Related documents, chapters, papers]

ADR-001: RAG vs Fine-Tuning for Domain Knowledge

Status

Accepted

Context

We need to add domain-specific knowledge to our LLM application. The knowledge includes:

50,000 internal documents (policies, procedures, product specs)
Content that updates weekly
Specialized terminology users expect precise answers about

We must decide between: 1. Fine-tuning a model on this domain 2. Using RAG to retrieve relevant documents at query time 3. A hybrid approach

Decision

We will use RAG as the primary approach with the following implementation:

Vector database for document storage (Qdrant)
Hybrid search (dense embeddings + BM25)
Reranking with cross-encoder
Weekly incremental indexing

Fine-tuning will only be considered if:

RAG retrieval quality remains below 80% after optimization
We identify specific terminology gaps that prompting can’t address

Consequences

Positive: - Documents can be updated without model retraining - Full traceability (we can cite sources) - Lower compute costs (no training infrastructure) - Faster iteration on content - No risk of catastrophic forgetting

Negative: - Added latency from retrieval step (~200ms) - Retrieval quality becomes critical path - Context window limits affect how much we can include - Infrastructure complexity (vector DB, embeddings)

Alternatives Considered

1. Full Fine-Tuning - Pros: Knowledge “baked in,” lower inference latency - Cons: Expensive to retrain on updates, no citations, risk of hallucination - Rejected because: Update frequency (weekly) makes retraining impractical

2. Few-Shot Prompting Only - Pros: Simplest approach - Cons: Context window can’t fit 50K documents - Rejected because: Volume of knowledge far exceeds context limits

3. Fine-Tune + RAG Hybrid - Pros: Best of both worlds potentially - Cons: Highest complexity and cost - Deferred: Will revisit if RAG alone proves insufficient

References

Chapter 7: RAG Systems Deep Dive
Chapter 15: MLOps & Evaluation (Fine-Tuning approaches)
Paper: “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”

ADR-002: Vector Database Selection

Status

Accepted

Context

Our RAG system requires a vector database for storing and searching embeddings. Key requirements:

Scale: 10M vectors initially, growing to 100M
Latency: <100ms p99 for search
Availability: 99.9% uptime for production
Features: Filtering, hybrid search
Team: Small team (3 engineers), limited ops capacity

Candidates evaluated: 1. Pinecone (managed) 2. Weaviate (self-hosted or managed) 3. Qdrant (self-hosted or managed) 4. Milvus (self-hosted) 5. pgvector (PostgreSQL extension)

Decision

We will use Qdrant Cloud (managed) with the following configuration:

Managed deployment to minimize ops burden
HNSW index with default parameters initially
Enable hybrid search (sparse + dense)
Start with single region, plan multi-region for phase 2

Consequences

Positive: - Managed service reduces ops burden significantly - Strong hybrid search support out of the box - Good balance of cost and features - Easy to migrate from cloud to self-hosted later if needed - Active community and good documentation

Negative: - Vendor lock-in for managed service - Less customization than fully self-hosted - Cost scales with usage (vs fixed self-hosted cost)

Alternatives Considered

1. Pinecone - Pros: Simplest managed option, excellent reliability - Cons: Higher cost, less flexibility - Rejected because: Cost at 100M vectors scale was 3x Qdrant

2. Weaviate - Pros: Good hybrid search, managed option available - Cons: More complex configuration - Rejected because: Qdrant had better performance benchmarks for our workload

3. Milvus (Self-Hosted) - Pros: Most scalable, full control - Cons: Significant ops overhead (ZooKeeper, etcd dependencies) - Rejected because: Team size can’t support ops complexity

4. pgvector - Pros: Familiar Postgres, simple ops - Cons: Best suited for under ~10M vectors, limited hybrid search support - Rejected because: Won’t scale to 100M vectors without sharding

References

Chapter 7: Vector Databases in Production
Appendix B: Tool & Framework Reference
ANN Benchmarks: ann-benchmarks.com

ADR-003: Self-Hosted vs API for LLM Inference

Status

Accepted

Context

We need to decide how to serve LLM inference for our production application:

Requirements:

~100K requests/day initially, growing to 1M/day
Latency: <2s p95 for generation
Privacy: Some queries contain PII that shouldn’t leave our infrastructure
Budget: $20K/month initially, scaling with usage
Models: Need GPT-5 class quality for complex queries, smaller for simple

Options: 1. API only (OpenAI, Anthropic, etc.) 2. Self-hosted only (vLLM, TGI) 3. Hybrid (API for complex, self-hosted for simple/private)

Decision

We will implement a hybrid approach:

Self-hosted (vLLM) for:
- Simple queries (classification, extraction)
- Queries containing PII
- High-volume, cost-sensitive workloads
- Model: Llama 3 70B with INT8 quantization
API (Anthropic Claude) for:
- Complex reasoning queries
- When self-hosted is at capacity
- New capabilities not yet available in open models
Router that directs queries based on:
- Complexity score (LLM-based classification)
- PII detection
- Current load on self-hosted
- Cost optimization targets

Consequences

Positive: - Privacy compliance for PII data - Cost optimization (self-hosted for bulk) - Flexibility to use best model for each task - Not fully dependent on any single provider - Can gradually shift workload as self-hosted scales

Negative: - Operational complexity (managing both) - Router adds latency and complexity - Need expertise in both API integration and GPU ops - More complex debugging when issues span systems

Cost Analysis

Scenario	API Only	Self-Hosted Only	Hybrid
100K req/day	$30K/mo	$8K/mo	$15K/mo
1M req/day	$300K/mo	$40K/mo	$80K/mo

Assumes 60% simple queries routed to self-hosted in hybrid

Alternatives Considered

1. API Only - Pros: Simplest, no ops burden - Cons: Cost at scale, PII concerns, vendor dependency - Rejected because: PII requirement forces some self-hosting

2. Self-Hosted Only - Pros: Full control, lowest cost at scale - Cons: Miss out on frontier capabilities, more ops burden - Rejected because: Want access to best models for complex queries

References

Chapter 9: LLM Deployment & Infrastructure
Chapter 32: Cost Engineering
Appendix B: Inference engines comparison

ADR-004: Chunking Strategy for Technical Documentation

Status

Accepted

Context

We’re indexing technical documentation for RAG. The content includes:

API references with code examples
Tutorial articles with sequential steps
Conceptual explanations with diagrams
Changelog entries

We need to determine chunking strategy that:

Preserves semantic coherence
Works well with embedding models
Produces chunks that fit in context windows
Maintains useful metadata

Decision

We will use hierarchical semantic chunking:

First pass: Split by major sections (H1, H2 headers)
Second pass: Within sections, split by paragraphs
Merge: Combine small paragraphs up to 512 tokens
Overlap: 50 token overlap between chunks
Metadata: Preserve document title, section hierarchy, source URL

Special handling:

Code blocks: Keep code and surrounding explanation together
Lists: Keep complete lists together up to 800 tokens
Tables: Keep tables as single chunks with surrounding context

Consequences

Positive: - Preserves semantic meaning within chunks - Code examples stay with their explanations - Section hierarchy aids retrieval (can boost by header match) - Manageable chunk sizes for embedding models

Negative: - Variable chunk sizes (100-800 tokens) - More complex implementation than fixed-size - Requires format-specific parsing (Markdown, HTML)

Implementation

def hierarchical_chunk(doc: str, metadata: dict) -> list[Chunk]:
    # Parse document structure
    sections = parse_sections(doc)

    chunks = []
    for section in sections:
        section_metadata = {
            **metadata,
            "section_title": section.title,
            "section_path": section.path  # e.g., "Installation > Prerequisites"
        }

        # Handle code blocks specially
        if section.has_code_blocks:
            chunks.extend(chunk_with_code(section, section_metadata))
        else:
            chunks.extend(semantic_chunk(section.text, section_metadata))

    return chunks

Alternatives Considered

1. Fixed-Size Chunking (512 tokens) - Pros: Simple, predictable - Cons: Splits code blocks and semantic units - Rejected because: Poor quality on technical content

2. Sentence-Based Chunking - Pros: Natural boundaries - Cons: Very small chunks, loses context - Rejected because: Too fine-grained for our content

3. Document-Level (No Chunking) - Pros: Full context preserved - Cons: Many docs exceed context limits - Rejected because: Documents average 5000 tokens

References

Chapter 7: Document Processing and Chunking
Chapter 7: Evaluation section on retrieval metrics

ADR-005: Agent Safety Architecture

Status

Accepted

Context

We’re building an agent that can:

Read and write files in user workspaces
Execute code in sandboxed environments
Make API calls to external services
Search the web

We need safety measures that:

Prevent prompt injection attacks
Limit blast radius of failures
Require confirmation for dangerous operations
Maintain audit trail

Decision

We will implement defense-in-depth with four layers:

Layer 1: Input Validation - Scan all inputs for injection patterns - Rate limit by user and session - Log suspicious inputs for review

Layer 2: Tool Policies

TOOL_POLICIES = {
    "read_file": {
        "allowed_paths": ["/workspace/"],
        "blocked_patterns": [".env", "credentials"],
        "requires_confirmation": False
    },
    "write_file": {
        "allowed_paths": ["/workspace/"],
        "blocked_patterns": [".git/", "node_modules/"],
        "requires_confirmation": True
    },
    "execute_code": {
        "sandboxed": True,
        "timeout_seconds": 30,
        "blocked_imports": ["os", "subprocess", "socket"],
        "requires_confirmation": True
    },
    "web_request": {
        "allowed_domains": ["api.example.com", "docs.example.com"],
        "requires_confirmation": False
    }
}

Layer 3: Sandboxed Execution - Code runs in Docker containers - Network disabled by default - Read-only filesystem - Resource limits (CPU, memory, time)

Layer 4: Human-in-the-Loop - Dangerous operations require explicit approval - Summary of proposed actions shown before execution - User can cancel or modify

Consequences

Positive: - Defense in depth limits blast radius - Audit trail for compliance - User control over risky operations - Can tune policies without code changes

Negative: - Added latency for confirmations - Some user friction - Maintenance overhead for policies - Complex testing (security scenarios)

Alternatives Considered

1. No Safety Measures (Trust the Model) - Rejected because: Unacceptable risk for file/code execution

2. Confirmation for All Operations - Pros: Maximum safety - Cons: Too much friction, users would disable - Rejected because: Poor user experience

3. Static Allowlist Only - Pros: Simple - Cons: Too restrictive or requires frequent updates - Rejected because: Need flexibility of policy system

References

Chapter 8: Agent Safety and Constraints
Chapter 16: Security for Agentic Systems
Chapter 16: Defense-in-Depth Architecture

ADR-006: Evaluation Strategy for LLM Application

Status

Accepted

Context

We need a comprehensive evaluation strategy for our LLM-powered application. Requirements:

Catch regressions before deployment
Measure quality across dimensions (accuracy, safety, style)
Handle subjective quality judgments
Scale to thousands of test cases
Run in CI/CD pipeline

Decision

We will implement a multi-tier evaluation system:

Tier 1: Deterministic Tests (CI - Every Commit) - Format validation (JSON schema compliance) - Safety filters (blocked content detection) - Response length bounds - Required field presence - ~500 test cases, runs in <5 minutes

Tier 2: LLM-as-Judge (Pre-Merge - Every PR) - Accuracy rating (1-5 scale) - Helpfulness rating - Safety rating - Factual grounding check - ~200 test cases, runs in <15 minutes - Uses Claude Haiku as judge for cost efficiency

Tier 3: Human Evaluation (Weekly) - Random sample of 100 production queries - Expert review for domain correctness - Edge case identification - Calibrates Tier 2 judges

Tier 4: A/B Testing (Major Changes) - Live traffic comparison - Business metrics (conversion, satisfaction) - Statistical significance requirements - 1-2 week evaluation periods

Consequences

Positive: - Fast feedback on obvious issues (Tier 1) - Catches quality regressions (Tier 2) - Human calibration prevents drift (Tier 3) - Business impact validation (Tier 4) - Appropriate cost at each tier

Negative: - Complex pipeline to maintain - LLM-as-judge has known biases - Human evaluation is expensive and slow - A/B tests delay launches

Metrics Targets

Tier	Metric	Target	Blocking
1	Format compliance	100%	Yes
2	Accuracy score	>4.0/5.0	Yes
2	Safety score	>4.5/5.0	Yes
3	Human accuracy	>90%	No (alerts)
4	Conversion lift	>0%	Yes

References

Chapter 15: MLOps & Evaluation
Chapter 15: LLM-as-Judge methodology
Chapter 15: A/B Testing for AI

ADR-007: Observability Stack for AI Systems

Status

Accepted

Context

We need observability for our LLM application covering:

Traditional metrics (latency, errors, throughput)
AI-specific metrics (token usage, quality scores)
Request/response logging for debugging
Cost tracking
Prompt/response inspection for issues

Decision

We will use the following stack:

Metrics: Prometheus + Grafana - Standard infrastructure metrics - Custom AI metrics (tokens, costs, quality) - Alerting through Grafana

Logging: OpenTelemetry + Jaeger - Distributed tracing across services - Request/response capture (redacted) - Span-level timing breakdown

AI-Specific: Custom LLM Logger

@dataclass
class LLMRequestLog:
    request_id: str
    timestamp: datetime
    model: str
    prompt_hash: str  # For grouping similar prompts
    prompt_preview: str  # First 200 chars
    input_tokens: int
    output_tokens: int
    latency_ms: float
    status: str
    cost_usd: float
    quality_score: float  # From automated eval
    user_id: str  # For cost attribution

Dashboards: 1. System health (latency, errors, throughput) 2. Model performance (quality scores over time) 3. Cost tracking (by model, user, endpoint) 4. Debug view (inspect individual requests)

Consequences

Positive: - Complete visibility into system behavior - AI-specific insights beyond traditional ops - Cost attribution for business planning - Debugging capability for quality issues

Negative: - Storage costs for detailed logs - Privacy considerations for logging prompts - Dashboard maintenance overhead - Learning curve for the team

References

Chapter 11: Observability section
Chapter 31: Reliability Engineering
Chapter 32: Cost Attribution

ADR Index

ADR	Title	Status	Key Decision
001	RAG vs Fine-Tuning	Accepted	Use RAG for updatable knowledge
002	Vector Database	Accepted	Qdrant Cloud (managed)
003	LLM Inference	Accepted	Hybrid: self-hosted + API
004	Chunking Strategy	Accepted	Hierarchical semantic chunking
005	Agent Safety	Accepted	Defense-in-depth, 4 layers
006	Evaluation Strategy	Accepted	Multi-tier (deterministic → human)
007	Observability	Accepted	Prometheus + OTel + custom

Creating Your Own ADRs

When writing ADRs for your team:

Write them at decision time: Not after the fact
Keep them concise: 1-2 pages max
Include context: Future readers need to understand why
Document alternatives: Shows due diligence
Accept uncertainty: It’s OK to decide with incomplete info
Update status: Mark deprecated/superseded when things change
Link to code: Reference implementations where relevant

ADRs are historical documents. Don’t modify the decision section after acceptance—write a new ADR that supersedes it.

# Appendix G: Architecture Decision Records (ADRs) {.unnumbered} This appendix provides example ADRs for common AI engineering decisions. Use these as templates for documenting your own architectural choices. --- ## ADR Template ```markdown # ADR-XXX: [Title] ## Status [Proposed | Accepted | Deprecated | Superseded by ADR-XXX] ## Context [What is the issue we're facing? What forces are at play?] ## Decision [What is the change being proposed/made?] ## Consequences [What are the positive and negative effects of this decision?] ## Alternatives Considered [What other options were evaluated?] ## References [Related documents, chapters, papers] ``` --- ## ADR-001: RAG vs Fine-Tuning for Domain Knowledge ### Status Accepted ### Context We need to add domain-specific knowledge to our LLM application. The knowledge includes: - 50,000 internal documents (policies, procedures, product specs) - Content that updates weekly - Specialized terminology users expect precise answers about We must decide between: 1. Fine-tuning a model on this domain 2. Using RAG to retrieve relevant documents at query time 3. A hybrid approach ### Decision We will use **RAG as the primary approach** with the following implementation: - Vector database for document storage (Qdrant) - Hybrid search (dense embeddings + BM25) - Reranking with cross-encoder - Weekly incremental indexing Fine-tuning will only be considered if: - RAG retrieval quality remains below 80% after optimization - We identify specific terminology gaps that prompting can't address ### Consequences **Positive:** - Documents can be updated without model retraining - Full traceability (we can cite sources) - Lower compute costs (no training infrastructure) - Faster iteration on content - No risk of catastrophic forgetting **Negative:** - Added latency from retrieval step (~200ms) - Retrieval quality becomes critical path - Context window limits affect how much we can include - Infrastructure complexity (vector DB, embeddings) ### Alternatives Considered **1. Full Fine-Tuning** - Pros: Knowledge "baked in," lower inference latency - Cons: Expensive to retrain on updates, no citations, risk of hallucination - Rejected because: Update frequency (weekly) makes retraining impractical **2. Few-Shot Prompting Only** - Pros: Simplest approach - Cons: Context window can't fit 50K documents - Rejected because: Volume of knowledge far exceeds context limits **3. Fine-Tune + RAG Hybrid** - Pros: Best of both worlds potentially - Cons: Highest complexity and cost - Deferred: Will revisit if RAG alone proves insufficient ### References - Chapter 7: RAG Systems Deep Dive - Chapter 15: MLOps & Evaluation (Fine-Tuning approaches) - Paper: "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" --- ## ADR-002: Vector Database Selection ### Status Accepted ### Context Our RAG system requires a vector database for storing and searching embeddings. Key requirements: - Scale: 10M vectors initially, growing to 100M - Latency: <100ms p99 for search - Availability: 99.9% uptime for production - Features: Filtering, hybrid search - Team: Small team (3 engineers), limited ops capacity Candidates evaluated: 1. Pinecone (managed) 2. Weaviate (self-hosted or managed) 3. Qdrant (self-hosted or managed) 4. Milvus (self-hosted) 5. pgvector (PostgreSQL extension) ### Decision We will use **Qdrant Cloud (managed)** with the following configuration: - Managed deployment to minimize ops burden - HNSW index with default parameters initially - Enable hybrid search (sparse + dense) - Start with single region, plan multi-region for phase 2 ### Consequences **Positive:** - Managed service reduces ops burden significantly - Strong hybrid search support out of the box - Good balance of cost and features - Easy to migrate from cloud to self-hosted later if needed - Active community and good documentation **Negative:** - Vendor lock-in for managed service - Less customization than fully self-hosted - Cost scales with usage (vs fixed self-hosted cost) ### Alternatives Considered **1. Pinecone** - Pros: Simplest managed option, excellent reliability - Cons: Higher cost, less flexibility - Rejected because: Cost at 100M vectors scale was 3x Qdrant **2. Weaviate** - Pros: Good hybrid search, managed option available - Cons: More complex configuration - Rejected because: Qdrant had better performance benchmarks for our workload **3. Milvus (Self-Hosted)** - Pros: Most scalable, full control - Cons: Significant ops overhead (ZooKeeper, etcd dependencies) - Rejected because: Team size can't support ops complexity **4. pgvector** - Pros: Familiar Postgres, simple ops - Cons: Best suited for under ~10M vectors, limited hybrid search support - Rejected because: Won't scale to 100M vectors without sharding ### References - Chapter 7: Vector Databases in Production - Appendix B: Tool & Framework Reference - ANN Benchmarks: ann-benchmarks.com --- ## ADR-003: Self-Hosted vs API for LLM Inference ### Status Accepted ### Context We need to decide how to serve LLM inference for our production application: Requirements: - ~100K requests/day initially, growing to 1M/day - Latency: <2s p95 for generation - Privacy: Some queries contain PII that shouldn't leave our infrastructure - Budget: $20K/month initially, scaling with usage - Models: Need GPT-5 class quality for complex queries, smaller for simple Options: 1. API only (OpenAI, Anthropic, etc.) 2. Self-hosted only (vLLM, TGI) 3. Hybrid (API for complex, self-hosted for simple/private) ### Decision We will implement a **hybrid approach**: 1. **Self-hosted (vLLM)** for: - Simple queries (classification, extraction) - Queries containing PII - High-volume, cost-sensitive workloads - Model: Llama 3 70B with INT8 quantization 2. **API (Anthropic Claude)** for: - Complex reasoning queries - When self-hosted is at capacity - New capabilities not yet available in open models 3. **Router** that directs queries based on: - Complexity score (LLM-based classification) - PII detection - Current load on self-hosted - Cost optimization targets ### Consequences **Positive:** - Privacy compliance for PII data - Cost optimization (self-hosted for bulk) - Flexibility to use best model for each task - Not fully dependent on any single provider - Can gradually shift workload as self-hosted scales **Negative:** - Operational complexity (managing both) - Router adds latency and complexity - Need expertise in both API integration and GPU ops - More complex debugging when issues span systems ### Cost Analysis | Scenario | API Only | Self-Hosted Only | Hybrid | |----------|----------|------------------|--------| | 100K req/day | $30K/mo | $8K/mo | $15K/mo | | 1M req/day | $300K/mo | $40K/mo | $80K/mo | *Assumes 60% simple queries routed to self-hosted in hybrid* ### Alternatives Considered **1. API Only** - Pros: Simplest, no ops burden - Cons: Cost at scale, PII concerns, vendor dependency - Rejected because: PII requirement forces some self-hosting **2. Self-Hosted Only** - Pros: Full control, lowest cost at scale - Cons: Miss out on frontier capabilities, more ops burden - Rejected because: Want access to best models for complex queries ### References - Chapter 9: LLM Deployment & Infrastructure - Chapter 32: Cost Engineering - Appendix B: Inference engines comparison --- ## ADR-004: Chunking Strategy for Technical Documentation ### Status Accepted ### Context We're indexing technical documentation for RAG. The content includes: - API references with code examples - Tutorial articles with sequential steps - Conceptual explanations with diagrams - Changelog entries We need to determine chunking strategy that: - Preserves semantic coherence - Works well with embedding models - Produces chunks that fit in context windows - Maintains useful metadata ### Decision We will use **hierarchical semantic chunking**: 1. **First pass**: Split by major sections (H1, H2 headers) 2. **Second pass**: Within sections, split by paragraphs 3. **Merge**: Combine small paragraphs up to 512 tokens 4. **Overlap**: 50 token overlap between chunks 5. **Metadata**: Preserve document title, section hierarchy, source URL Special handling: - **Code blocks**: Keep code and surrounding explanation together - **Lists**: Keep complete lists together up to 800 tokens - **Tables**: Keep tables as single chunks with surrounding context ### Consequences **Positive:** - Preserves semantic meaning within chunks - Code examples stay with their explanations - Section hierarchy aids retrieval (can boost by header match) - Manageable chunk sizes for embedding models **Negative:** - Variable chunk sizes (100-800 tokens) - More complex implementation than fixed-size - Requires format-specific parsing (Markdown, HTML) ### Implementation ```python def hierarchical_chunk(doc: str, metadata: dict) -> list[Chunk]: # Parse document structure sections = parse_sections(doc) chunks = [] for section in sections: section_metadata = { **metadata, "section_title": section.title, "section_path": section.path # e.g., "Installation > Prerequisites" } # Handle code blocks specially if section.has_code_blocks: chunks.extend(chunk_with_code(section, section_metadata)) else: chunks.extend(semantic_chunk(section.text, section_metadata)) return chunks ``` ### Alternatives Considered **1. Fixed-Size Chunking (512 tokens)** - Pros: Simple, predictable - Cons: Splits code blocks and semantic units - Rejected because: Poor quality on technical content **2. Sentence-Based Chunking** - Pros: Natural boundaries - Cons: Very small chunks, loses context - Rejected because: Too fine-grained for our content **3. Document-Level (No Chunking)** - Pros: Full context preserved - Cons: Many docs exceed context limits - Rejected because: Documents average 5000 tokens ### References - Chapter 7: Document Processing and Chunking - Chapter 7: Evaluation section on retrieval metrics --- ## ADR-005: Agent Safety Architecture ### Status Accepted ### Context We're building an agent that can: - Read and write files in user workspaces - Execute code in sandboxed environments - Make API calls to external services - Search the web We need safety measures that: - Prevent prompt injection attacks - Limit blast radius of failures - Require confirmation for dangerous operations - Maintain audit trail ### Decision We will implement **defense-in-depth** with four layers: **Layer 1: Input Validation** - Scan all inputs for injection patterns - Rate limit by user and session - Log suspicious inputs for review **Layer 2: Tool Policies** ```python TOOL_POLICIES = { "read_file": { "allowed_paths": ["/workspace/"], "blocked_patterns": [".env", "credentials"], "requires_confirmation": False }, "write_file": { "allowed_paths": ["/workspace/"], "blocked_patterns": [".git/", "node_modules/"], "requires_confirmation": True }, "execute_code": { "sandboxed": True, "timeout_seconds": 30, "blocked_imports": ["os", "subprocess", "socket"], "requires_confirmation": True }, "web_request": { "allowed_domains": ["api.example.com", "docs.example.com"], "requires_confirmation": False } } ``` **Layer 3: Sandboxed Execution** - Code runs in Docker containers - Network disabled by default - Read-only filesystem - Resource limits (CPU, memory, time) **Layer 4: Human-in-the-Loop** - Dangerous operations require explicit approval - Summary of proposed actions shown before execution - User can cancel or modify ### Consequences **Positive:** - Defense in depth limits blast radius - Audit trail for compliance - User control over risky operations - Can tune policies without code changes **Negative:** - Added latency for confirmations - Some user friction - Maintenance overhead for policies - Complex testing (security scenarios) ### Alternatives Considered **1. No Safety Measures (Trust the Model)** - Rejected because: Unacceptable risk for file/code execution **2. Confirmation for All Operations** - Pros: Maximum safety - Cons: Too much friction, users would disable - Rejected because: Poor user experience **3. Static Allowlist Only** - Pros: Simple - Cons: Too restrictive or requires frequent updates - Rejected because: Need flexibility of policy system ### References - Chapter 8: Agent Safety and Constraints - Chapter 16: Security for Agentic Systems - Chapter 16: Defense-in-Depth Architecture --- ## ADR-006: Evaluation Strategy for LLM Application ### Status Accepted ### Context We need a comprehensive evaluation strategy for our LLM-powered application. Requirements: - Catch regressions before deployment - Measure quality across dimensions (accuracy, safety, style) - Handle subjective quality judgments - Scale to thousands of test cases - Run in CI/CD pipeline ### Decision We will implement a **multi-tier evaluation system**: **Tier 1: Deterministic Tests (CI - Every Commit)** - Format validation (JSON schema compliance) - Safety filters (blocked content detection) - Response length bounds - Required field presence - ~500 test cases, runs in <5 minutes **Tier 2: LLM-as-Judge (Pre-Merge - Every PR)** - Accuracy rating (1-5 scale) - Helpfulness rating - Safety rating - Factual grounding check - ~200 test cases, runs in <15 minutes - Uses Claude Haiku as judge for cost efficiency **Tier 3: Human Evaluation (Weekly)** - Random sample of 100 production queries - Expert review for domain correctness - Edge case identification - Calibrates Tier 2 judges **Tier 4: A/B Testing (Major Changes)** - Live traffic comparison - Business metrics (conversion, satisfaction) - Statistical significance requirements - 1-2 week evaluation periods ### Consequences **Positive:** - Fast feedback on obvious issues (Tier 1) - Catches quality regressions (Tier 2) - Human calibration prevents drift (Tier 3) - Business impact validation (Tier 4) - Appropriate cost at each tier **Negative:** - Complex pipeline to maintain - LLM-as-judge has known biases - Human evaluation is expensive and slow - A/B tests delay launches ### Metrics Targets | Tier | Metric | Target | Blocking | |------|--------|--------|----------| | 1 | Format compliance | 100% | Yes | | 2 | Accuracy score | >4.0/5.0 | Yes | | 2 | Safety score | >4.5/5.0 | Yes | | 3 | Human accuracy | >90% | No (alerts) | | 4 | Conversion lift | >0% | Yes | ### References - Chapter 15: MLOps & Evaluation - Chapter 15: LLM-as-Judge methodology - Chapter 15: A/B Testing for AI --- ## ADR-007: Observability Stack for AI Systems ### Status Accepted ### Context We need observability for our LLM application covering: - Traditional metrics (latency, errors, throughput) - AI-specific metrics (token usage, quality scores) - Request/response logging for debugging - Cost tracking - Prompt/response inspection for issues ### Decision We will use the following stack: **Metrics: Prometheus + Grafana** - Standard infrastructure metrics - Custom AI metrics (tokens, costs, quality) - Alerting through Grafana **Logging: OpenTelemetry + Jaeger** - Distributed tracing across services - Request/response capture (redacted) - Span-level timing breakdown **AI-Specific: Custom LLM Logger** ```python @dataclass class LLMRequestLog: request_id: str timestamp: datetime model: str prompt_hash: str # For grouping similar prompts prompt_preview: str # First 200 chars input_tokens: int output_tokens: int latency_ms: float status: str cost_usd: float quality_score: float # From automated eval user_id: str # For cost attribution ``` **Dashboards:** 1. System health (latency, errors, throughput) 2. Model performance (quality scores over time) 3. Cost tracking (by model, user, endpoint) 4. Debug view (inspect individual requests) ### Consequences **Positive:** - Complete visibility into system behavior - AI-specific insights beyond traditional ops - Cost attribution for business planning - Debugging capability for quality issues **Negative:** - Storage costs for detailed logs - Privacy considerations for logging prompts - Dashboard maintenance overhead - Learning curve for the team ### References - Chapter 11: Observability section - Chapter 31: Reliability Engineering - Chapter 32: Cost Attribution --- ## ADR Index | ADR | Title | Status | Key Decision | |-----|-------|--------|--------------| | 001 | RAG vs Fine-Tuning | Accepted | Use RAG for updatable knowledge | | 002 | Vector Database | Accepted | Qdrant Cloud (managed) | | 003 | LLM Inference | Accepted | Hybrid: self-hosted + API | | 004 | Chunking Strategy | Accepted | Hierarchical semantic chunking | | 005 | Agent Safety | Accepted | Defense-in-depth, 4 layers | | 006 | Evaluation Strategy | Accepted | Multi-tier (deterministic → human) | | 007 | Observability | Accepted | Prometheus + OTel + custom | --- ## Creating Your Own ADRs When writing ADRs for your team: 1. **Write them at decision time**: Not after the fact 2. **Keep them concise**: 1-2 pages max 3. **Include context**: Future readers need to understand *why* 4. **Document alternatives**: Shows due diligence 5. **Accept uncertainty**: It's OK to decide with incomplete info 6. **Update status**: Mark deprecated/superseded when things change 7. **Link to code**: Reference implementations where relevant ADRs are historical documents. Don't modify the decision section after acceptance—write a new ADR that supersedes it.