Chapter 14: Backend Engineering for AI

Keywords

testing, debugging, integration, API design, async patterns, caching, streaming, error handling

Introduction

In late 2023, a fintech startup deployed their first LLM-powered feature: an intelligent assistant that helped users understand their spending patterns. The engineering team had built many successful backend systems before—payment processors handling thousands of transactions per second, real-time fraud detection, complex financial reporting pipelines. They applied their usual practices: comprehensive unit tests, deterministic integration tests, and monitoring dashboards tracking latency percentiles.

Within two weeks, they were drowning in bug reports they couldn’t reproduce. Users complained the assistant was “wrong” but the logs showed the system working correctly. Tests passed in CI but failed intermittently in production. Latency varied wildly—the same query taking 500ms one moment and 8 seconds the next. Most confusingly, a prompt that worked perfectly for weeks suddenly started producing garbled outputs after a model provider update they hadn’t even noticed.

The team had encountered a fundamental truth: LLM backends are not traditional backends. The patterns that served them well for deterministic systems—exact assertion testing, reproducible debugging, predictable performance—needed rethinking for systems whose core component is a black-box probabilistic function.

This chapter bridges the gap between traditional backend engineering and the unique demands of LLM applications. You’ll learn not just the techniques, but the underlying principles that explain why LLM backends behave differently and how to design systems that embrace rather than fight this reality.

The Core Insight: Probabilistic Systems Require Probabilistic Engineering

Traditional backend engineering operates in a world of determinism. Given the same inputs and state, the same code produces the same outputs. This assumption underlies everything: we write unit tests expecting exact matches, debug by reproducing issues, and monitor for binary outcomes (success or failure).

LLM applications violate this assumption at their core. A language model is a probabilistic function—the same prompt can produce different outputs due to:

Sampling temperature: Non-zero temperature introduces intentional randomness
Infrastructure variation: Different GPU instances may produce slightly different floating-point results
Model versioning: Providers update models without explicit notification
Context sensitivity: Slight prompt variations can dramatically shift outputs
Batching effects: Some providers batch requests in ways that affect generation

This isn’t a bug to be fixed; it’s the nature of the technology. The engineering challenge is building robust systems atop this probabilistic foundation.

What Makes LLM Backends Different

Consider how traditional backend concerns transform for LLM applications:

Traditional Backend	LLM Backend
Latency is predictable	Latency varies 10x depending on output length
Errors are binary (works/fails)	“Wrong” exists on a spectrum
Tests use exact assertions	Tests use statistical assertions
Debugging reproduces the issue	Debugging analyzes distributions
Caching is straightforward	Caching requires semantic similarity
Retry logic is simple	Retry may produce different (better or worse) results
Load testing scales linearly	Token-based pricing creates non-linear costs

Understanding these differences is the foundation for effective LLM backend engineering.

Common Mistake: Applying Traditional Backend Patterns Without Adaptation

What people do: Build LLM backends using exact same patterns as deterministic services—exact match caching, binary success/fail monitoring, retry-until-success logic, unit tests with assertEqual.

Why it fails: LLM outputs are probabilistic. Exact-match caching misses semantically equivalent queries. Binary monitoring misses “technically successful but wrong” responses. Retries might produce different (possibly worse) results. Exact assertions fail randomly.

Fix: Adapt each pattern to probabilistic reality. Use semantic similarity for cache lookups. Monitor response quality, not just HTTP status. Implement smart retry logic that detects whether retry improved results. Use statistical assertions or LLM-as-judge for testing. Accept that some techniques (like deterministic debugging) need complete rethinking.

What You’ll Learn

Architectural foundations for LLM applications: async patterns, connection pooling, streaming
Why testing LLMs is hard and how to build statistical test frameworks
Debugging strategies for non-deterministic systems
Caching strategies: when semantic caching helps and when it hurts
Data pipeline design for training and inference
Decision frameworks for key architectural choices

Prerequisites

Understanding of LLM fundamentals (Chapter 5)
Familiarity with prompt engineering (Chapter 6)
Experience with backend development (APIs, databases, async programming)
Python programming

Foundations of LLM Application Architecture

The Anatomy of an LLM Backend

Before diving into specific techniques, let’s establish a mental model for LLM backend architecture. Every LLM application shares common structural elements, though implementations vary.

Request flow for a typical query: 1. Request arrives at API gateway (authentication, rate limiting) 2. Cache check: have we seen this exact request before? 3. Router decides: synchronous response or queue for async processing? 4. Orchestrator assembles context (retrieval, tool calls, conversation history) 5. LLM API call with constructed prompt 6. Post-processing (format validation, safety checks) 7. Response with observability logging

Each component has design decisions that differ from traditional backends due to LLM characteristics.

Async Patterns for Streaming

LLM generation is inherently sequential—tokens are produced one at a time. For long responses, waiting for complete generation creates poor user experience. Streaming delivers tokens as they’re generated, improving perceived latency dramatically.

The key insight: streaming fundamentally changes your architecture. It’s not just a performance optimization; it affects API design, error handling, caching, and observability.

# The streaming mental model: think of LLM output as a generator, not a return value

async def stream_response(prompt: str) -> AsyncIterator[str]:
    """Stream tokens as they're generated."""
    async for chunk in llm_client.stream(prompt):
        # Each chunk contains one or more tokens
        yield chunk.content

        # You can process tokens incrementally
        # - Detect early termination signals
        # - Update progress indicators
        # - Accumulate for logging

# Consumer handles the stream
async def handle_request(request):
    buffer = []
    async for token in stream_response(request.prompt):
        buffer.append(token)
        await websocket.send(token)  # Real-time delivery

    # Post-completion processing
    full_response = "".join(buffer)
    await log_interaction(request, full_response)

Architectural implications of streaming:

Error handling becomes complex: An error mid-stream means partial content was already sent to the user. You need strategies for this:
- Sentinel tokens to signal errors in the stream
- Client-side buffer with commit semantics
- Graceful degradation with “generation interrupted” messages
Caching requires decisions: Do you cache the complete response after streaming finishes? Do you cache per-token for replay? The answer depends on your use case.
Observability changes: You can’t log “response” as a single event. You need streaming-aware metrics: time-to-first-token, tokens-per-second, stream completion rate.
Connection management differs: HTTP/1.1 streaming ties up connections. Consider Server-Sent Events, WebSockets, or HTTP/2 streaming based on your client requirements.

Full implementation: See reference/10_backend_engineering_code.md for complete streaming implementation with error handling.

Connection Pooling for API Calls

LLM API calls are expensive—both in latency (typically 500ms-10s) and in connection overhead. Connection pooling amortizes connection establishment costs across requests.

But LLM workloads differ from typical API workloads:

Traditional API call: 50-200ms response time, high request volume, connection reuse is critical for throughput.

LLM API call: 500ms-10s response time, lower volume, connection lifespan varies dramatically based on output length.

This changes optimal pooling strategy:

import httpx
from contextlib import asynccontextmanager

class LLMConnectionPool:
    """Connection pool tuned for LLM API characteristics."""

    def __init__(
        self,
        base_url: str,
        max_connections: int = 100,
        max_keepalive: int = 20,  # Fewer keepalive for long-running requests
        timeout_config: dict = None
    ):
        # LLM-specific timeouts
        self.timeout = httpx.Timeout(
            connect=10.0,        # Connection establishment
            read=120.0,          # Long reads for generation
            write=10.0,          # Quick writes (prompts)
            pool=30.0            # Wait for connection from pool
        )

        self.limits = httpx.Limits(
            max_connections=max_connections,
            max_keepalive_connections=max_keepalive,
            keepalive_expiry=30.0  # Release idle connections faster
        )

        self.client = httpx.AsyncClient(
            base_url=base_url,
            timeout=self.timeout,
            limits=self.limits,
            http2=True  # Better multiplexing for concurrent requests
        )

Key tuning considerations:

Timeout configuration: Read timeouts must accommodate longest expected generation (function of max_tokens). Connect timeouts should be short to fail fast.
Pool sizing: More connections than typical because each request holds a connection for seconds, not milliseconds. Size based on expected concurrent LLM calls, not total request volume.
Keep-alive strategy: LLM providers may have different keep-alive semantics. Test with your specific provider.
HTTP/2 consideration: HTTP/2 multiplexing can handle multiple requests over fewer connections, but streaming behavior varies by implementation.

The Cost of Token-Based Pricing

LLM APIs typically charge per token, creating a cost model unlike traditional APIs. This has architectural implications:

Traditional API: Fixed cost per call regardless of payload size. Optimization focuses on reducing call count.

LLM API: Cost scales with input + output tokens. A 4K token prompt costs 10x a 400 token prompt. Long outputs compound costs.

This means:

Caching ROI is higher: Cached response saves both latency AND cost
Prompt optimization is cost optimization: Shorter prompts save money
Output limiting is essential: Max_tokens isn’t just for safety; it’s cost control
“Batching” means three different things—keep them straight: (a) Request-batching for throughput—running many requests through one model pass via continuous batching is a large win when you self-host, because it raises GPU utilization and cuts cost-per-token, though it does not change the token count itself; (b) Provider Batch APIs—submitting an offline job to a provider’s async batch endpoint earns a real ~50% token discount (see the Batch Inference pricing in Chapters 12-13); (c) the narrow true caveat—merely concatenating several prompts into one synchronous call does not reduce the number of tokens billed, so it saves no token cost (and can hurt latency). Earlier editions over-generalized (c) into “batching doesn’t help,” which is wrong: (a) and (b) are among the highest-leverage cost levers available.

# Cost-aware request handling
class CostAwareLLMClient:
    def __init__(self, budget_per_request: float = 0.10):
        self.budget = budget_per_request
        self.pricing = {
            'gpt-5.5': {'input': 0.005/1000, 'output': 0.030/1000},
            'gpt-5.4': {'input': 0.0025/1000, 'output': 0.015/1000},
            'claude-opus-4-8': {'input': 0.005/1000, 'output': 0.025/1000},
        }

    def estimate_cost(self, model: str, prompt: str, max_tokens: int) -> float:
        """Estimate request cost before making call."""
        input_tokens = count_tokens(prompt)
        prices = self.pricing[model]

        # Worst case: max_tokens fully used
        max_cost = (
            input_tokens * prices['input'] +
            max_tokens * prices['output']
        )
        return max_cost

    async def generate(self, prompt: str, model: str, max_tokens: int) -> str:
        estimated = self.estimate_cost(model, prompt, max_tokens)
        if estimated > self.budget:
            # Decision: reject, use cheaper model, or reduce max_tokens
            raise CostLimitExceeded(f"Estimated ${estimated:.4f} exceeds ${self.budget:.4f}")

        return await self._call_api(prompt, model, max_tokens)

Why Testing LLMs is Hard

The Fundamental Challenge

Traditional testing rests on a simple assumption: given the same inputs, code produces the same outputs. This enables assertions like assertEqual(actual, expected). When tests pass, we have confidence the code works. When they fail, we have a reproducible bug to fix.

LLM applications break this assumption. The same prompt may produce:

Different word choices expressing the same meaning
Different valid formats for the same content
Different reasoning paths to the same conclusion
Occasionally, genuinely different (and possibly wrong) answers

This creates a testing trilemma:

There’s no perfect solution—only tradeoffs appropriate to your situation.

Testing Strategies and When to Use Each

Strategy 1: Deterministic testing (temperature=0)

Set temperature to 0 for reproducible outputs. Useful for:

Regression testing against known-good outputs
Format validation (does output parse correctly?)
Exact-match requirements (code generation, structured data)

Limitations:

Doesn’t test production behavior (where temperature > 0)
May give false confidence (model could be “luckily” correct at temp=0)
Provider changes can still break determinism

def test_classification_deterministic():
    """Test with temperature=0 for reproducibility."""
    response = llm.generate(
        "Classify this support ticket: 'My order never arrived'",
        temperature=0
    )
    # Can use exact assertion because deterministic
    assert response.strip() == "shipping_issue"

Strategy 2: Statistical testing

Run tests multiple times, assert on distributions. Useful for:

Production-representative quality testing
Measuring improvement/regression across model versions
Understanding consistency boundaries

Limitations:

Slower (multiple API calls per test)
More expensive
Requires defining statistical thresholds

def test_classification_statistical():
    """Test accuracy over multiple runs."""
    test_cases = [
        ("My order never arrived", "shipping_issue"),
        ("The product was damaged", "quality_issue"),
        # ... more cases
    ]

    correct = 0
    for input_text, expected in test_cases:
        response = llm.generate(f"Classify: {input_text}", temperature=0.7)
        if expected in response.lower():
            correct += 1

    accuracy = correct / len(test_cases)

    # Statistical assertion with confidence interval
    assert accuracy >= 0.90, f"Accuracy {accuracy:.2%} below 90% threshold"

Strategy 3: Behavioral/property testing

Test for properties rather than exact outputs. Useful for:

Format compliance (valid JSON, expected schema)
Safety constraints (no harmful content)
Consistency properties (same entity names throughout)

def test_json_output_valid():
    """Test that output is always valid JSON."""
    for _ in range(10):  # Multiple runs
        response = llm.generate(
            "Return a JSON object with fields: name, age, city",
            temperature=0.7
        )
        try:
            parsed = json.loads(response)
            assert 'name' in parsed
            assert 'age' in parsed
            assert 'city' in parsed
        except json.JSONDecodeError:
            pytest.fail(f"Invalid JSON: {response}")

Strategy 4: LLM-as-judge

Use a (usually stronger) LLM to evaluate outputs. Useful for:

Subjective quality assessment
Complex criteria (helpfulness, accuracy, tone)
Scaling evaluation beyond human capacity

Limitations:

Introduces LLM bias into evaluation
Expensive (two LLM calls per test)
Judge LLM can be wrong

def test_with_llm_judge():
    """Use LLM to evaluate response quality."""
    response = llm.generate("Explain quantum entanglement simply")

    judge_prompt = f"""
    Evaluate this explanation of quantum entanglement for a general audience.

    Explanation: {response}

    Rate 1-5 on:
    - Accuracy (scientifically correct)
    - Clarity (understandable to non-experts)
    - Completeness (covers key concepts)

    Return JSON: {{"accuracy": N, "clarity": N, "completeness": N}}
    """

    scores = json.loads(judge_llm.generate(judge_prompt, temperature=0))
    assert all(score >= 4 for score in scores.values())

Strategy 5: Mock/fixture testing

Replace LLM calls with recorded responses. Useful for:

Testing non-LLM logic (routing, formatting, error handling)
Fast CI/CD pipelines
Consistent behavior across environments

Limitations:

Doesn’t test actual model behavior
Fixtures become stale
May mask prompt quality issues

Common Mistake: Using Exact Assertions for LLM Tests

What people do: Write tests like assert response == "The answer is 42". Tests pass locally, fail randomly in CI, get marked as flaky, then get ignored.

Why it fails: LLM outputs vary even at temperature=0 across providers, model versions, and infrastructure. “The answer is 42” and “42 is the answer” are equivalent responses but fail exact matching. This creates tests that are either too strict (fail randomly) or too loose (miss real bugs).

Fix: Match the assertion to what you’re actually testing. For format validation, parse the output and check structure. For content correctness, use semantic similarity or LLM-as-judge. For regression testing, run multiple times and assert on statistical properties. Reserve exact matching only for truly deterministic operations like response parsing.

Full implementation: See reference/10_backend_engineering_code.md for complete test suite implementations.

Building a Testing Pyramid for LLM Applications

Run at different frequencies: - Every commit: Mocked tests, deterministic tests - Per PR: Statistical tests on samples, behavioral tests - Daily: Full statistical suite, LLM-judge evaluation - Weekly: Human evaluation of samples, production A/B analysis

Regression Testing: Catching Quality Degradation

LLM applications can silently degrade. A model update, prompt change, or infrastructure shift can cause quality regression that traditional tests miss.

Build regression baselines:

class LLMRegressionSuite:
    """Track quality metrics over time to catch regressions."""

    def __init__(self, baseline_path: str):
        self.baseline = self.load_baseline(baseline_path)
        self.current_metrics = {}

    def run_evaluation(self, llm_client, test_set: list[dict]) -> dict:
        """Evaluate current model on test set."""
        results = {
            'accuracy': self.measure_accuracy(llm_client, test_set),
            'format_compliance': self.measure_format_compliance(llm_client, test_set),
            'latency_p50': self.measure_latency(llm_client, test_set)['p50'],
            'latency_p95': self.measure_latency(llm_client, test_set)['p95'],
            'consistency': self.measure_consistency(llm_client, test_set),
        }
        self.current_metrics = results
        return results

    def check_regressions(self, tolerance: float = 0.05) -> list[dict]:
        """Identify metrics that regressed beyond tolerance."""
        regressions = []
        for metric, baseline_value in self.baseline.items():
            current_value = self.current_metrics.get(metric, 0)

            # For latency, higher is worse
            if 'latency' in metric:
                if current_value > baseline_value * (1 + tolerance):
                    regressions.append({
                        'metric': metric,
                        'baseline': baseline_value,
                        'current': current_value,
                        'change': f"+{(current_value/baseline_value - 1)*100:.1f}%"
                    })
            # For other metrics, lower is worse
            else:
                if current_value < baseline_value * (1 - tolerance):
                    regressions.append({
                        'metric': metric,
                        'baseline': baseline_value,
                        'current': current_value,
                        'change': f"{(current_value/baseline_value - 1)*100:.1f}%"
                    })

        return regressions

Debugging Non-Deterministic Systems

The Debugging Mindset Shift

Traditional debugging follows a pattern: reproduce the issue, isolate the cause, fix the code, verify the fix. LLM debugging requires a different approach because:

Issues may not reproduce: The same input might work fine on retry
“Wrong” is often subjective: Is the response wrong, or just different than expected?
Root causes span systems: The issue might be in prompting, retrieval, model behavior, or their interaction
Fixes don’t guarantee prevention: Changing a prompt might fix one case while breaking others

The LLM debugging process:

Common Mistake: Retrying Until You Get a “Good” Response

What people do: When an LLM response seems wrong, automatically retry the request (sometimes multiple times) until it produces an acceptable output.

Why it fails: If your prompt produces bad outputs 20% of the time, retrying just masks the problem. You’re paying 5x the API cost for 80% quality, and users experience inconsistent latency. Worse, you might be training users to expect retry-based “best of N” quality that production can’t sustain.

Fix: Treat frequent retries as a signal to fix the prompt, not a solution. Log retry rates per prompt template. If a prompt requires retries >10% of the time, that’s a prompt engineering problem. For genuinely ambiguous cases, consider using a verification step that checks output quality and only retries when detection is reliable.

Staff Engineer Perspective

“Classic backend instinct: wrap the LLM call in our standard retry decorator. Three retries, exponential backoff, the same pattern we use for every internal service. The day OpenAI had a 30-minute degradation, our system generated 4x normal traffic against them, triggered our org-wide rate limit, and brought down two unrelated products that shared the key. The retry logic that protects you for idempotent REST calls actively hurts you for shared, rate-limited, expensive endpoints. Now our LLM client has a single retry, a circuit breaker, and a degraded-mode response. Retries against an LLM provider aren’t free—they’re a denial-of-service against yourself.”

— Tech Lead at a B2B analytics company

Logging for LLM Debugging

You cannot debug what you did not log. LLM debugging requires comprehensive logging of the complete interaction context.

What to log:

@dataclass
class LLMInteractionLog:
    """Complete record of an LLM interaction for debugging."""

    # Request identity
    request_id: str
    timestamp: datetime
    user_id: str | None

    # Model configuration
    model: str
    temperature: float
    max_tokens: int

    # Prompt components (logged separately for analysis)
    system_prompt: str
    user_message: str
    conversation_history: list[dict]

    # RAG context (if applicable)
    retrieval_query: str | None
    retrieved_documents: list[dict] | None
    retrieval_scores: list[float] | None

    # Assembled prompt (what actually went to the API)
    full_prompt: str
    prompt_token_count: int

    # Response
    response: str
    completion_token_count: int

    # Timing
    retrieval_latency_ms: float | None
    llm_latency_ms: float | None
    total_latency_ms: float

    # Outcome (filled later via feedback)
    user_feedback: str | None
    automated_quality_score: float | None

Storage considerations:

Hot storage (recent, frequently accessed): Keep 7-30 days in queryable database for debugging
Cold storage (historical, audit): Archive older logs to object storage
Sampling: For high-volume applications, log 100% of errors, sample successful requests

Common Debugging Scenarios

Scenario 1: “The model ignores my instructions”

Symptoms: Output doesn’t follow specified format, ignores constraints, uses wrong tone.

Diagnostic approach:

def diagnose_instruction_following(
    prompt: str,
    response: str,
    expected_behavior: str
) -> dict:
    """Diagnose why model might ignore instructions."""

    diagnostics = {
        'findings': [],
        'likely_causes': [],
        'suggestions': []
    }

    # Check 1: Prompt length and instruction position
    prompt_tokens = count_tokens(prompt)
    if prompt_tokens > 3000:
        diagnostics['findings'].append(f"Long prompt ({prompt_tokens} tokens)")
        diagnostics['likely_causes'].append("Lost-in-the-middle effect")
        diagnostics['suggestions'].append("Move critical instructions to beginning and end")

    # Check 2: Instruction clarity
    instruction_patterns = ['must', 'should', 'always', 'never', 'required']
    instruction_count = sum(1 for p in instruction_patterns if p in prompt.lower())
    if instruction_count > 5:
        diagnostics['findings'].append(f"Many directive words ({instruction_count})")
        diagnostics['likely_causes'].append("Conflicting or overwhelming instructions")
        diagnostics['suggestions'].append("Simplify to core requirements")

    # Check 3: Example-instruction conflict
    if 'example' in prompt.lower() or '```' in prompt:
        diagnostics['findings'].append("Contains examples")
        diagnostics['likely_causes'].append("Examples may contradict instructions")
        diagnostics['suggestions'].append("Verify examples match desired behavior")

    # Check 4: Negative instructions
    negative_patterns = ["don't", "do not", "never", "avoid"]
    negatives = sum(1 for p in negative_patterns if p in prompt.lower())
    if negatives > 2:
        diagnostics['findings'].append(f"Multiple negative instructions ({negatives})")
        diagnostics['likely_causes'].append("Negative framing is less effective")
        diagnostics['suggestions'].append("Rephrase as positive: 'do X' instead of 'don't do Y'")

    return diagnostics

Scenario 2: “RAG returns irrelevant context”

Symptoms: Generated response is off-topic or misses obvious relevant information.

Diagnostic approach:

def diagnose_retrieval(
    query: str,
    retrieved_docs: list[dict],
    expected_topic: str
) -> dict:
    """Diagnose retrieval quality issues."""

    diagnostics = {
        'query_analysis': {},
        'retrieval_analysis': {},
        'suggestions': []
    }

    # Analyze query
    query_tokens = query.split()
    diagnostics['query_analysis'] = {
        'token_count': len(query_tokens),
        'is_question': query.strip().endswith('?'),
        'contains_technical_terms': detect_technical_terms(query),
    }

    if len(query_tokens) < 5:
        diagnostics['suggestions'].append("Query may be too short; consider query expansion")

    # Analyze retrieved documents
    scores = [doc.get('score', 0) for doc in retrieved_docs]
    diagnostics['retrieval_analysis'] = {
        'top_score': max(scores) if scores else 0,
        'score_range': max(scores) - min(scores) if scores else 0,
        'all_scores_low': all(s < 0.5 for s in scores),
    }

    if diagnostics['retrieval_analysis']['all_scores_low']:
        diagnostics['suggestions'].append("All scores low; query may be out-of-domain")
        diagnostics['suggestions'].append("Check if relevant documents exist in corpus")

    if diagnostics['retrieval_analysis']['score_range'] < 0.1:
        diagnostics['suggestions'].append("Scores very similar; retriever not discriminating")
        diagnostics['suggestions'].append("Consider reranking or different embedding model")

    # Check term overlap
    query_terms = set(query.lower().split())
    for i, doc in enumerate(retrieved_docs[:3]):
        doc_terms = set(doc['text'].lower().split())
        overlap = query_terms & doc_terms
        diagnostics[f'doc_{i}_overlap'] = len(overlap)

    return diagnostics

Scenario 3: “Output quality is inconsistent”

Symptoms: Same type of query produces varying quality responses.

Diagnostic approach:

def diagnose_inconsistency(
    prompt: str,
    num_samples: int = 10
) -> dict:
    """Diagnose output consistency issues."""

    responses = []
    for _ in range(num_samples):
        response = llm.generate(prompt, temperature=0.7)
        responses.append(response)

    # Cluster responses by similarity
    embeddings = [embed(r) for r in responses]
    similarities = compute_pairwise_similarity(embeddings)

    diagnostics = {
        'sample_count': num_samples,
        'unique_responses': len(set(responses)),
        'avg_similarity': float(similarities.mean()),
        'min_similarity': float(similarities.min()),
        'similarity_std': float(similarities.std()),
    }

    # Compare with temperature=0
    deterministic = llm.generate(prompt, temperature=0)
    diagnostics['deterministic_response'] = deterministic
    diagnostics['deterministic_in_samples'] = deterministic in responses

    # Suggestions based on findings
    if diagnostics['avg_similarity'] < 0.7:
        diagnostics['suggestions'] = [
            "High variance in outputs",
            "Consider: lower temperature, more specific instructions, few-shot examples",
        ]
    elif diagnostics['unique_responses'] == 1:
        diagnostics['suggestions'] = [
            "All responses identical despite temperature>0",
            "Prompt may be overly constrained or model very confident",
        ]

    return diagnostics

Production Debugging: Working with Incomplete Information

In production, you often can’t reproduce issues exactly. You have logs, but the exact conditions may be unreproducible due to:

Model updates by the provider
Non-deterministic sampling
State that wasn’t logged

Strategies for production debugging:

Pattern analysis: Look for commonalities across failures

def analyze_failure_patterns(failures: list[dict]) -> dict:
    """Find patterns in logged failures."""
    patterns = {
        'by_user_type': Counter(),
        'by_prompt_length': [],
        'by_time_of_day': Counter(),
        'common_keywords': Counter(),
    }

    for f in failures:
        patterns['by_prompt_length'].append(len(f['prompt']))
        patterns['by_time_of_day'][f['timestamp'].hour] += 1
        for word in f['user_message'].split():
            patterns['common_keywords'][word.lower()] += 1

    return {
        'prompt_length_correlation': analyze_correlation(patterns['by_prompt_length']),
        'peak_failure_hours': patterns['by_time_of_day'].most_common(3),
        'keyword_patterns': patterns['common_keywords'].most_common(10),
    }

Reproduction scripts: Generate scripts from logs to attempt reproduction

def generate_reproduction_script(log_entry: dict) -> str:
    """Generate runnable script from log entry."""
    return f'''
# Reproduction script for request {log_entry["request_id"]}
# Original timestamp: {log_entry["timestamp"]}
# Original outcome: {log_entry.get("user_feedback", "unknown")}

from your_app import LLMClient

client = LLMClient(model="{log_entry["model"]}")

prompt = """{log_entry["full_prompt"]}"""

# Try to reproduce
for i in range(5):
    response = client.generate(prompt, temperature={log_entry["temperature"]})
    print(f"Attempt {{i+1}}: {{response[:200]}}...")
'''

Canary testing: Deploy fixes to a small percentage of traffic first

def route_request(request, fix_enabled: bool) -> str:
    """Route requests with canary for new fix."""
    if fix_enabled and random.random() < 0.05:  # 5% canary
        return process_with_fix(request)
    return process_standard(request)

Caching Strategies for LLM Applications

When to Cache (and When Not To)

Caching is more complex for LLM applications than traditional backends due to the nature of LLM inputs and outputs.

Decision framework for LLM caching:

Exact match caching works when:

Temperature = 0 (deterministic)
Prompts are identical (including system prompt, examples, etc.)
Model version is controlled
Freshness requirements are met

Semantic caching (caching similar queries) is risky because:

“Similar” queries may require different answers
Embedding similarity doesn’t guarantee answer transferability
Edge cases can produce very wrong cached results

Implementing a Safe LLM Cache

class LLMCache:
    """Cache layer for LLM responses with safety checks."""

    def __init__(
        self,
        redis_client,
        ttl_seconds: int = 3600,
        enable_semantic: bool = False,
        semantic_threshold: float = 0.98  # Very high threshold if enabled
    ):
        self.redis = redis_client
        self.ttl = ttl_seconds
        self.enable_semantic = enable_semantic
        self.semantic_threshold = semantic_threshold

    def cache_key(
        self,
        prompt: str,
        model: str,
        temperature: float,
        system_prompt: str = ""
    ) -> str | None:
        """Generate cache key if caching is appropriate."""

        # Never cache non-deterministic requests
        if temperature > 0 and not self.enable_semantic:
            return None

        # Include all inputs that affect output
        cache_input = f"{model}:{system_prompt}:{prompt}"
        return f"llm:{hashlib.sha256(cache_input.encode()).hexdigest()}"

    async def get(self, key: str) -> dict | None:
        """Retrieve cached response with metadata."""
        if not key:
            return None

        cached = await self.redis.get(key)
        if cached:
            data = json.loads(cached)
            # Track cache hits for monitoring
            await self.redis.incr(f"cache_hits:{key[:20]}")
            return data
        return None

    async def set(
        self,
        key: str,
        response: str,
        metadata: dict = None
    ) -> None:
        """Cache response with metadata."""
        if not key:
            return

        data = {
            'response': response,
            'cached_at': datetime.utcnow().isoformat(),
            'metadata': metadata or {}
        }
        await self.redis.setex(key, self.ttl, json.dumps(data))

Cache Invalidation Strategies

LLM caches face unique invalidation challenges:

Model updates: Provider updates models without notice
- Strategy: Include model version in cache key when possible
- Strategy: Set shorter TTLs to naturally expire stale entries
Prompt changes: Even small prompt changes should invalidate
- Strategy: Hash complete prompt (including system prompt)
- Strategy: Version prompts explicitly in cache key
Knowledge freshness: RAG responses depend on document corpus
- Strategy: Include corpus version or document hashes in cache key
- Strategy: Invalidate cache when documents are updated

class RAGAwareCache(LLMCache):
    """Cache that accounts for RAG document freshness."""

    def cache_key(
        self,
        prompt: str,
        model: str,
        temperature: float,
        retrieved_doc_ids: list[str]
    ) -> str | None:
        # Include document IDs in cache key
        # Cache is only valid for this exact retrieval result
        doc_hash = hashlib.md5(
            "".join(sorted(retrieved_doc_ids)).encode()
        ).hexdigest()[:8]

        base_key = super().cache_key(prompt, model, temperature)
        if base_key:
            return f"{base_key}:docs:{doc_hash}"
        return None

Data Pipeline Architecture

Training Data Pipelines

If you’re fine-tuning models, training data quality determines model quality. “Garbage in, garbage out” is especially true for LLMs because:

Models learn patterns from examples, including unintended patterns
Small amounts of bad data can disproportionately affect behavior
Data issues manifest as subtle quality problems, not obvious errors

Training data pipeline stages:

Key quality gates:

class TrainingDataValidator:
    """Validate training examples before fine-tuning."""

    def validate_example(self, example: dict) -> tuple[bool, list[str]]:
        """Check single example, return (valid, issues)."""
        issues = []

        # Structure check
        if 'messages' not in example:
            issues.append("Missing 'messages' key")
            return False, issues

        messages = example['messages']

        # Role sequence check
        roles = [m.get('role') for m in messages]
        if roles[-1] != 'assistant':
            issues.append("Last message must be from assistant")

        # Empty content check
        for i, msg in enumerate(messages):
            if not msg.get('content', '').strip():
                issues.append(f"Empty content in message {i}")

        # Token limit check
        total_tokens = sum(
            count_tokens(m.get('content', ''))
            for m in messages
        )
        if total_tokens > 4096:
            issues.append(f"Example exceeds token limit: {total_tokens}")

        # Quality heuristics
        assistant_msgs = [m for m in messages if m['role'] == 'assistant']
        for msg in assistant_msgs:
            content = msg['content']
            if len(content) < 10:
                issues.append("Assistant response very short")
            if content.startswith("I cannot") or content.startswith("I can't"):
                issues.append("Assistant refusal - review if intentional")

        return len(issues) == 0, issues

Inference Data Pipelines (RAG Ingestion)

For RAG systems, document ingestion quality directly affects retrieval quality. The pipeline must handle:

Multiple document formats (PDF, HTML, DOCX, etc.)
Extraction errors and partial failures
Chunking with semantic coherence
Embedding at scale
Incremental updates

Robust ingestion pipeline:

class DocumentIngestionPipeline:
    """Production-grade document ingestion for RAG."""

    def __init__(
        self,
        vector_store,
        embedding_model,
        chunk_size: int = 512,
        chunk_overlap: int = 50
    ):
        self.vector_store = vector_store
        self.embedder = embedding_model
        self.chunker = SemanticChunker(chunk_size, chunk_overlap)

    async def ingest_document(
        self,
        doc_path: str,
        metadata: dict = None
    ) -> dict:
        """Ingest single document with error handling."""

        result = {
            'doc_path': doc_path,
            'status': 'pending',
            'chunks_indexed': 0,
            'errors': []
        }

        try:
            # Extract text
            text, extracted_metadata = await self.extract(doc_path)
            if not text.strip():
                result['status'] = 'failed'
                result['errors'].append("No text extracted")
                return result

            # Chunk
            chunks = self.chunker.chunk(text)

            # Embed in batches
            embeddings = await self.embed_batch(chunks, batch_size=32)

            # Store with metadata
            combined_metadata = {
                **(metadata or {}),
                **extracted_metadata,
                'ingested_at': datetime.utcnow().isoformat()
            }

            chunk_ids = await self.store(
                chunks, embeddings, combined_metadata
            )

            result['status'] = 'success'
            result['chunks_indexed'] = len(chunk_ids)
            result['chunk_ids'] = chunk_ids

        except ExtractionError as e:
            result['status'] = 'failed'
            result['errors'].append(f"Extraction failed: {e}")
        except EmbeddingError as e:
            result['status'] = 'partial'
            result['errors'].append(f"Embedding failed: {e}")

        return result

Integration Patterns

Circuit Breaker for LLM APIs

LLM APIs can fail in various ways: timeouts, rate limits, server errors, degraded quality. The circuit breaker pattern prevents cascading failures.

Why circuit breakers matter more for LLMs: - LLM calls are expensive (time and money) - Retrying a failing LLM API burns budget - User experience degrades rapidly when LLM latency increases - Provider outages can last minutes to hours

from enum import Enum
from datetime import datetime, timedelta

class CircuitState(Enum):
    CLOSED = "closed"      # Normal operation
    OPEN = "open"          # Blocking calls
    HALF_OPEN = "half_open"  # Testing recovery

class LLMCircuitBreaker:
    """Protect against LLM API failures."""

    def __init__(
        self,
        failure_threshold: int = 5,
        recovery_timeout: timedelta = timedelta(seconds=60),
        half_open_max_calls: int = 3
    ):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.half_open_max_calls = half_open_max_calls

        self.state = CircuitState.CLOSED
        self.failures = 0
        self.last_failure_time = None
        self.half_open_successes = 0

    async def call(self, llm_func, *args, **kwargs):
        """Execute LLM call with circuit breaker protection."""

        if self.state == CircuitState.OPEN:
            if self._should_attempt_recovery():
                self.state = CircuitState.HALF_OPEN
                self.half_open_successes = 0
            else:
                raise CircuitOpenError(
                    "LLM circuit breaker open",
                    retry_after=self._time_until_recovery()
                )

        try:
            result = await llm_func(*args, **kwargs)
            self._record_success()
            return result
        except (TimeoutError, APIError, RateLimitError) as e:
            self._record_failure()
            raise

    def _record_success(self):
        if self.state == CircuitState.HALF_OPEN:
            self.half_open_successes += 1
            if self.half_open_successes >= self.half_open_max_calls:
                self.state = CircuitState.CLOSED
                self.failures = 0
        elif self.state == CircuitState.CLOSED:
            self.failures = 0  # Reset on success

    def _record_failure(self):
        self.failures += 1
        self.last_failure_time = datetime.now()

        if self.state == CircuitState.HALF_OPEN:
            self.state = CircuitState.OPEN  # Recovery failed
        elif self.failures >= self.failure_threshold:
            self.state = CircuitState.OPEN

Retry Strategies for LLM Calls

Retries for LLM calls require special consideration:

Retrying may produce different results: Unlike traditional APIs, LLM retry may succeed with different output
Cost accumulates: Each retry costs tokens
Some errors shouldn’t be retried: Content policy violations won’t succeed on retry

class LLMRetryStrategy:
    """Intelligent retry for LLM calls."""

    # Errors that should not be retried
    NON_RETRYABLE = {
        'content_policy_violation',
        'invalid_api_key',
        'model_not_found',
        'context_length_exceeded',
    }

    def __init__(
        self,
        max_retries: int = 3,
        base_delay: float = 1.0,
        max_delay: float = 60.0,
        exponential_base: float = 2.0
    ):
        self.max_retries = max_retries
        self.base_delay = base_delay
        self.max_delay = max_delay
        self.exponential_base = exponential_base

    async def execute(
        self,
        llm_func,
        *args,
        **kwargs
    ):
        """Execute with retry logic."""
        last_exception = None

        for attempt in range(self.max_retries + 1):
            try:
                return await llm_func(*args, **kwargs)
            except LLMError as e:
                last_exception = e

                if e.error_code in self.NON_RETRYABLE:
                    raise  # Don't retry

                if attempt < self.max_retries:
                    delay = self._calculate_delay(attempt, e)
                    await asyncio.sleep(delay)

        raise last_exception

    def _calculate_delay(self, attempt: int, error: LLMError) -> float:
        # Exponential backoff
        delay = self.base_delay * (self.exponential_base ** attempt)

        # Rate limit errors should respect retry-after header
        if hasattr(error, 'retry_after') and error.retry_after:
            delay = max(delay, error.retry_after)

        # Add jitter to prevent thundering herd
        delay *= (0.5 + random.random())

        return min(delay, self.max_delay)

Graceful Degradation

When LLM services fail, have fallback strategies:

class DegradingLLMService:
    """LLM service with graceful degradation."""

    def __init__(self):
        self.primary_model = "claude-opus-4-8"
        self.fallback_models = ["claude-sonnet-4-6", "claude-haiku-4-5"]
        self.cache = LLMCache()

    async def generate(self, prompt: str, **kwargs) -> dict:
        """Generate with fallback chain."""

        # Try cache first
        cached = await self.cache.get(prompt, kwargs)
        if cached:
            return {'response': cached, 'source': 'cache'}

        # Try primary model
        try:
            response = await self.call_model(self.primary_model, prompt, **kwargs)
            return {'response': response, 'source': 'primary'}
        except LLMError as e:
            pass  # Fall through to fallbacks

        # Try fallback models
        for model in self.fallback_models:
            try:
                response = await self.call_model(model, prompt, **kwargs)
                return {
                    'response': response,
                    'source': f'fallback:{model}',
                    'degraded': True
                }
            except LLMError:
                continue

        # All models failed - return graceful error
        return {
            'response': "I'm sorry, I'm having trouble processing your request right now. Please try again in a moment.",
            'source': 'error_fallback',
            'degraded': True,
            'failed': True
        }

Case Study: Production LLM Backend at Scale

Context

A document intelligence company processes millions of documents monthly, extracting structured data and answering questions. Their architecture evolved through several iterations as they scaled.

Architecture Evolution

Phase 1: Synchronous monolith - Single service handled everything - LLM calls blocked request threads - Worked fine until 100 concurrent users

Problems encountered:

Thread pool exhaustion from long LLM calls
No isolation between fast and slow requests
Single LLM provider outage took down everything

Phase 2: Async with queue separation - Split into sync API (fast responses) and async workers (LLM calls) - Job queue for document processing - Results stored in database, polled by clients

Client → API Gateway → {
    Fast requests → Sync Service → Cache/DB
    Slow requests → Queue → Workers → LLM APIs → Results DB
}

Problems encountered:

Queue backup during LLM provider issues
Difficult to estimate completion times
No streaming for interactive use cases

Phase 3: Streaming with circuit breakers - Streaming responses for interactive queries - Circuit breakers per LLM provider - Automatic fallback to cheaper models under load

Key patterns:

# Request routing based on complexity and latency requirements
class RequestRouter:
    def route(self, request) -> str:
        if request.requires_streaming:
            return "streaming_service"
        elif request.estimated_tokens > 10000:
            return "batch_queue"
        elif self.is_cache_candidate(request):
            return "cache_service"
        else:
            return "sync_service"

Lessons Learned

Token estimation is critical: Estimating prompt and completion tokens allows better routing, cost prediction, and capacity planning.
Model fallback is table stakes: Primary model unavailability happens. Design for it from day one.
Streaming changes everything: Once you support streaming, you need streaming-aware caching, logging, and error handling.
Observability is different: Track time-to-first-token, tokens-per-second, and completion rates, not just latency.
Cost monitoring requires real-time: Token costs can spike quickly. Monitor and alert on cost, not just errors.

Decision Frameworks

Sync vs Async Architecture

Testing Strategy Selection

Caching Decision Matrix

Scenario	Cache?	Strategy
Temperature = 0, exact prompt match	Yes	Exact match cache
Temperature > 0, exact prompt match	Risky	Cache only if variation OK
Similar prompts, same intent	Risky	Only with very high similarity threshold
RAG with changing documents	Complex	Include doc versions in key
User-specific personalization	No	Different users need different responses
High-cost queries, stable inputs	Yes	Longer TTL, careful invalidation

Key Takeaways

Design for variable latency and streaming - LLM calls take seconds to minutes. Async patterns are mandatory for scale. Implement streaming for long outputs to improve perceived responsiveness.
Embrace statistical testing over exact assertions - LLM outputs are non-deterministic with many valid variations. Use property-based tests, statistical significance, and LLM-as-judge for quality evaluation.
Log comprehensively for debugging - Non-reproducible issues require pattern analysis. Log complete prompts, responses, latency, and tokens. Build tools for pattern analysis, not just log search.
Cache conservatively - Exact-match caching for temperature=0 is safe. Include model version and prompt template version in cache keys. Semantic caching is risky and requires careful validation.
Implement circuit breakers and fallback chains - LLMs have unique failure modes. Design fallback hierarchies: primary model, fallback model, cached response, graceful error message.

Summary

Building backends for LLM applications requires rethinking traditional patterns:

Architecture: Design for variable latency, streaming, and graceful degradation. LLM calls can take seconds to minutes—async patterns are not optional for scale.

Testing: Embrace statistical testing. Exact assertions rarely make sense for LLM outputs. Build testing pyramids with different strategies at different layers.

Debugging: Log comprehensively, analyze patterns, and accept that some issues won’t reproduce exactly. Build tools for pattern analysis, not just log search.

Caching: Be conservative. Exact-match caching for deterministic calls is safe. Semantic caching is risky and requires careful thought.

Integration: Implement circuit breakers and retry strategies that account for LLM-specific failure modes. Have fallback strategies for when primary models are unavailable.

The key mindset shift: you’re building systems around probabilistic components. This doesn’t mean abandoning engineering rigor—it means applying that rigor with appropriate tools and expectations.

Connections to Other Chapters

Chapter 5 (LLM Foundations) explains why LLMs are non-deterministic—understanding this is essential for debugging
Chapter 6 (Prompt Engineering) covers prompt design that affects testability and consistency
Chapter 7 (RAG Systems) uses the data pipelines and caching strategies discussed here
Chapter 9 (Deployment) covers infrastructure for serving LLM applications
Chapter 15 (MLOps & Evaluation) extends testing into continuous evaluation and monitoring

Practical Exercises

Implement a statistical test suite: Build tests for a classification prompt that:
- Run 100 test cases
- Calculate accuracy with confidence intervals
- Fail if lower confidence bound drops below 90%
- Report per-category accuracy breakdown
Build a debugging toolkit: Create a system that:
- Logs complete LLM interactions (prompt, response, latency, tokens)
- Generates reproduction scripts from log entries
- Identifies patterns in failed requests (common keywords, time patterns)
Implement caching with invalidation: Build a cache layer that:
- Caches temperature=0 responses
- Includes model version in cache key
- Invalidates on prompt template changes
- Tracks cache hit rates
Design a graceful degradation system: Implement:
- Circuit breaker for LLM API calls
- Fallback chain (primary → fallback model → cached response → error message)
- Metrics to track degradation events
Build a training data pipeline: Create a pipeline that:
- Extracts conversation data from logs
- Validates format and quality
- Deduplicates
- Splits into train/val/test
- Versions the output

Self-Assessment Checkpoint

Conceptual Questions

Q1. [IC2] Why do traditional unit tests (assert expected == actual) work poorly for LLM outputs? What testing approach works better?

Answer

LLM outputs are non-deterministic and have many valid variations. “The capital of France is Paris” and “Paris is France’s capital city” are both correct but don’t match exactly. Better approaches: (1) Property-based testing—check that outputs have required properties (contains “Paris”, is valid JSON, length within range). (2) Statistical testing—run N times, check pass rate meets threshold with confidence interval. (3) LLM-as-judge—use another model to evaluate semantic correctness. (4) Behavioral testing—test the system’s behavior, not exact outputs.

Q2. [IC2] A cache stores LLM responses keyed by prompt hash. What problems might arise, and how do you address them?

Answer

Problems: (1) Model version changes—same prompt, different model = different output. Include model version in cache key. (2) Temperature > 0—non-deterministic outputs shouldn’t be cached, or cache becomes lottery. Only cache temp=0. (3) Stale data—if answers reference time-sensitive info, cache invalidation needed. Add TTL. (4) Prompt template changes—code changes prompt format but cache has old responses. Include prompt template version in key. (5) Memory growth—cache unbounded. Implement LRU or size limits. (6) Cache poisoning—bad response gets cached. Validate before caching or allow manual invalidation.

Q3. [Senior] Explain the difference between fine-tuning and LoRA. When would you choose each?

Answer

Full fine-tuning updates all model parameters—computationally expensive, needs multiple GPUs, creates a completely new model. LoRA adds small trainable matrices to frozen base model—much smaller (0.1-1% of parameters), trains on single GPU, creates small adapter that combines with base model. Choose full fine-tuning when: you have massive data, need maximum quality, have compute budget. Choose LoRA when: limited compute, want to maintain base model benefits, need to deploy multiple specialized versions efficiently, training data is moderate-sized. Most practical scenarios favor LoRA due to cost and flexibility.

Q4. [Senior] How do you debug an LLM application where users report “wrong answers” but you can’t reproduce the issues?

Answer

Systematic approach: (1) Comprehensive logging—log full prompt, model response, all parameters, timestamps. Without logs, you’re blind. (2) Reproduction system—ability to replay exact request with same inputs. (3) Check for non-determinism—temperature setting, model version changes, A/B tests. (4) User context—what were the previous turns? Context matters. (5) Timing—did it work before? Check for model updates, prompt changes, data updates. (6) Pattern analysis—are failures correlated (time of day, user segment, input length, topic)? (7) Diff analysis—compare working vs. failing cases, find distinguishing features. Prevention: Build observability in from day one.

Q5. [Staff] You’re designing the backend architecture for an LLM application expecting 10M requests/day with strict latency requirements. What are the key architectural decisions?

Answer

Key decisions: (1) Caching strategy—at 10M requests, even 20% cache hit rate saves 2M API calls/day. Cache aggressively, use semantic similarity for near-matches. (2) Request routing—route simple queries to faster/cheaper models, complex to capable models. (3) Async processing—queue non-urgent requests, batch for efficiency. (4) Rate limiting & backpressure—protect against traffic spikes, graceful degradation. (5) Redundancy—multiple LLM providers, failover capability. (6) Edge caching—for global users, cache at edge for common queries. (7) Streaming—for latency perception, stream responses. (8) Cost controls—budget enforcement, alerting. (9) Observability—distributed tracing, metrics at every stage. Architecture: Queue → Router → Cache check → LLM pool → Response processor → Cache write.

Spot the Problem

Problem 1. [IC2] Test code for an LLM feature:

def test_summarization():
    result = summarize("Long article text here...")
    assert result == "Expected summary of the article."

Answer

Problems: (1) Exact string match will fail—LLM outputs vary in wording even with temp=0. (2) Single test case—doesn’t establish statistical confidence. (3) No property checks—doesn’t verify summary is actually shorter, contains key points, etc. Better: assert len(result) < len(input) * 0.3 (compression), assert "key_topic" in result.lower() (coverage), run multiple test cases with quality scoring.

Problem 2. [Senior] Error handling pattern:

async def call_llm(prompt: str) -> str:
    try:
        return await client.generate(prompt)
    except Exception:
        return "An error occurred. Please try again."

Answer

Problems: (1) Catches all exceptions—hides different failure modes (rate limit, auth, timeout, content filter). (2) No logging—failures are invisible to monitoring. (3) Generic message—user can’t distinguish transient vs. permanent failure. (4) No retry—transient failures should retry automatically. (5) No circuit breaker—if service is down, keeps trying. (6) Silent failure—downstream code may not realize it got an error, not a real response. Better: Specific exception handling, structured logging, retry with backoff for transient errors, circuit breaker, clear error types returned to caller.

Problem 3. [Staff] Fine-tuning approach:

"We fine-tuned on 50K examples from production logs.
Training loss decreased nicely. We deployed it."

Answer

Problems: (1) No validation set—can’t detect overfitting. Training loss means nothing without val loss. (2) No quality evaluation—loss decreasing doesn’t mean outputs improved for your task. (3) Logs may have bias—production logs reflect what model already does, may not improve weaknesses. (4) No A/B test—deployed without comparing to baseline. (5) Data quality unknown—were those 50K examples actually good? PII issues? (6) No held-out test set—can’t measure true performance. Process should include: train/val/test split, task-specific evaluation metrics, comparison to baseline, staged rollout with monitoring.

Design Exercises

Exercise 1. [Senior] Design a test strategy for a RAG-powered Q&A system. The system retrieves from 100K documents and generates answers. Cover: unit tests, integration tests, evaluation metrics, regression testing, and CI/CD integration.

Guidance

Unit tests: (1) Chunking—verify chunks are correct size, overlap works. (2) Embedding—mock embedding service, verify vectors have expected dimensions. (3) Retrieval—test with known documents, verify ranking. (4) Generation—property tests on output format. Integration tests: (1) End-to-end with small doc set—known questions with known answers. (2) Latency tests—verify P95 within bounds. (3) Error handling—test behavior when retrieval fails, LLM fails. Evaluation metrics: (1) Retrieval recall@k—are relevant docs retrieved? (2) Answer quality—LLM-as-judge for faithfulness, relevance. (3) Factual accuracy on golden set. Regression: Golden set of critical Q&A pairs, must pass 100%. Run on every PR. CI/CD: Fast unit tests on every commit, integration tests on PR, full evaluation nightly.

Exercise 2. [Staff] Design a system for managing training data for fine-tuning across multiple LLM applications. Requirements: version control, quality validation, deduplication, PII detection, and the ability to create task-specific datasets from a common pool.

Guidance

Architecture: (1) Raw data lake—all collected data, immutable, versioned (S3 + metadata DB). (2) Processing pipeline—PII detection/redaction, quality scoring, deduplication. (3) Curated datasets—versioned, task-specific, derived from pool. (4) Lineage tracking—which raw examples went into which datasets. Components: (1) Ingestion—from logs, human annotation, synthetic generation. Schema validation on entry. (2) PII detection—regex patterns, NER models, human review for high-risk. (3) Quality scoring—automated (length, formatting) + sampled human review. (4) Deduplication—embedding similarity, exact match, near-duplicate detection. (5) Dataset builder—queries pool by criteria, creates versioned snapshot. (6) Export—formats for different training frameworks. Governance: Access controls, audit logging, retention policies, deletion capability for GDPR.

Kwon et al. (2023), “PagedAttention” - The paper behind vLLM’s memory efficiency. Essential for understanding inference optimization.
Kleppmann (2017), “Designing Data-Intensive Applications” - Foundation for data pipeline design. Backend engineers’ bible.
Nygard (2018), “Release It!” - Circuit breakers, bulkheads, and stability patterns for LLM services.

Deep Dives

Hu et al. (2021), “LoRA” - Parameter-efficient fine-tuning foundation. Critical if you’re fine-tuning.
Zheng et al. (2023), “Judging LLM-as-a-Judge” - Best practices for using LLMs as evaluators.

Practical Resources

OpenAI Cookbook - Production patterns for building with LLM APIs.
Anthropic’s Production Best Practices - Guidelines for reliable Claude integration.

--- title: "Chapter 14: Backend Engineering for AI" keywords: [testing, debugging, integration, API design, async patterns, caching, streaming, error handling] difficulty: intermediate prerequisites: [ch02, ch06] estimated_time: "4-5 hours" --- ## Introduction In late 2023, a fintech startup deployed their first LLM-powered feature: an intelligent assistant that helped users understand their spending patterns. The engineering team had built many successful backend systems before—payment processors handling thousands of transactions per second, real-time fraud detection, complex financial reporting pipelines. They applied their usual practices: comprehensive unit tests, deterministic integration tests, and monitoring dashboards tracking latency percentiles. Within two weeks, they were drowning in bug reports they couldn't reproduce. Users complained the assistant was "wrong" but the logs showed the system working correctly. Tests passed in CI but failed intermittently in production. Latency varied wildly—the same query taking 500ms one moment and 8 seconds the next. Most confusingly, a prompt that worked perfectly for weeks suddenly started producing garbled outputs after a model provider update they hadn't even noticed. The team had encountered a fundamental truth: **LLM backends are not traditional backends**. The patterns that served them well for deterministic systems—exact assertion testing, reproducible debugging, predictable performance—needed rethinking for systems whose core component is a black-box probabilistic function. This chapter bridges the gap between traditional backend engineering and the unique demands of LLM applications. You'll learn not just the techniques, but the underlying principles that explain why LLM backends behave differently and how to design systems that embrace rather than fight this reality. ### The Core Insight: Probabilistic Systems Require Probabilistic Engineering Traditional backend engineering operates in a world of determinism. Given the same inputs and state, the same code produces the same outputs. This assumption underlies everything: we write unit tests expecting exact matches, debug by reproducing issues, and monitor for binary outcomes (success or failure). LLM applications violate this assumption at their core. A language model is a probabilistic function—the same prompt can produce different outputs due to: 1. **Sampling temperature**: Non-zero temperature introduces intentional randomness 2. **Infrastructure variation**: Different GPU instances may produce slightly different floating-point results 3. **Model versioning**: Providers update models without explicit notification 4. **Context sensitivity**: Slight prompt variations can dramatically shift outputs 5. **Batching effects**: Some providers batch requests in ways that affect generation This isn't a bug to be fixed; it's the nature of the technology. The engineering challenge is building robust systems atop this probabilistic foundation. ### What Makes LLM Backends Different Consider how traditional backend concerns transform for LLM applications: | Traditional Backend | LLM Backend | |---------------------|-------------| | Latency is predictable | Latency varies 10x depending on output length | | Errors are binary (works/fails) | "Wrong" exists on a spectrum | | Tests use exact assertions | Tests use statistical assertions | | Debugging reproduces the issue | Debugging analyzes distributions | | Caching is straightforward | Caching requires semantic similarity | | Retry logic is simple | Retry may produce different (better or worse) results | | Load testing scales linearly | Token-based pricing creates non-linear costs | Understanding these differences is the foundation for effective LLM backend engineering. ::: {.callout-warning} ## Common Mistake: Applying Traditional Backend Patterns Without Adaptation **What people do:** Build LLM backends using exact same patterns as deterministic services—exact match caching, binary success/fail monitoring, retry-until-success logic, unit tests with assertEqual. **Why it fails:** LLM outputs are probabilistic. Exact-match caching misses semantically equivalent queries. Binary monitoring misses "technically successful but wrong" responses. Retries might produce different (possibly worse) results. Exact assertions fail randomly. **Fix:** Adapt each pattern to probabilistic reality. Use semantic similarity for cache lookups. Monitor response quality, not just HTTP status. Implement smart retry logic that detects whether retry improved results. Use statistical assertions or LLM-as-judge for testing. Accept that some techniques (like deterministic debugging) need complete rethinking. ::: ### What You'll Learn - Architectural foundations for LLM applications: async patterns, connection pooling, streaming - Why testing LLMs is hard and how to build statistical test frameworks - Debugging strategies for non-deterministic systems - Caching strategies: when semantic caching helps and when it hurts - Data pipeline design for training and inference - Decision frameworks for key architectural choices ### Prerequisites - Understanding of LLM fundamentals (Chapter 5) - Familiarity with prompt engineering (Chapter 6) - Experience with backend development (APIs, databases, async programming) - Python programming --- ## Foundations of LLM Application Architecture ### The Anatomy of an LLM Backend Before diving into specific techniques, let's establish a mental model for LLM backend architecture. Every LLM application shares common structural elements, though implementations vary. ![LLM Application Architecture](../assets/diagrams/rendered/ch10_llm_app_architecture.svg) **Request flow for a typical query:** 1. Request arrives at API gateway (authentication, rate limiting) 2. Cache check: have we seen this exact request before? 3. Router decides: synchronous response or queue for async processing? 4. Orchestrator assembles context (retrieval, tool calls, conversation history) 5. LLM API call with constructed prompt 6. Post-processing (format validation, safety checks) 7. Response with observability logging Each component has design decisions that differ from traditional backends due to LLM characteristics. ### Async Patterns for Streaming LLM generation is inherently sequential—tokens are produced one at a time. For long responses, waiting for complete generation creates poor user experience. Streaming delivers tokens as they're generated, improving perceived latency dramatically. The key insight: **streaming fundamentally changes your architecture**. It's not just a performance optimization; it affects API design, error handling, caching, and observability. ```python # The streaming mental model: think of LLM output as a generator, not a return value async def stream_response(prompt: str) -> AsyncIterator[str]: """Stream tokens as they're generated.""" async for chunk in llm_client.stream(prompt): # Each chunk contains one or more tokens yield chunk.content # You can process tokens incrementally # - Detect early termination signals # - Update progress indicators # - Accumulate for logging # Consumer handles the stream async def handle_request(request): buffer = [] async for token in stream_response(request.prompt): buffer.append(token) await websocket.send(token) # Real-time delivery # Post-completion processing full_response = "".join(buffer) await log_interaction(request, full_response) ``` **Architectural implications of streaming:** 1. **Error handling becomes complex**: An error mid-stream means partial content was already sent to the user. You need strategies for this: - Sentinel tokens to signal errors in the stream - Client-side buffer with commit semantics - Graceful degradation with "generation interrupted" messages 2. **Caching requires decisions**: Do you cache the complete response after streaming finishes? Do you cache per-token for replay? The answer depends on your use case. 3. **Observability changes**: You can't log "response" as a single event. You need streaming-aware metrics: time-to-first-token, tokens-per-second, stream completion rate. 4. **Connection management differs**: HTTP/1.1 streaming ties up connections. Consider Server-Sent Events, WebSockets, or HTTP/2 streaming based on your client requirements. > **Full implementation**: See [reference/10_backend_engineering_code.md](../reference/10_backend_engineering_code.html#streaming-patterns) for complete streaming implementation with error handling. ### Connection Pooling for API Calls LLM API calls are expensive—both in latency (typically 500ms-10s) and in connection overhead. Connection pooling amortizes connection establishment costs across requests. But LLM workloads differ from typical API workloads: **Traditional API call:** 50-200ms response time, high request volume, connection reuse is critical for throughput. **LLM API call:** 500ms-10s response time, lower volume, connection lifespan varies dramatically based on output length. This changes optimal pooling strategy: ```python import httpx from contextlib import asynccontextmanager class LLMConnectionPool: """Connection pool tuned for LLM API characteristics.""" def __init__( self, base_url: str, max_connections: int = 100, max_keepalive: int = 20, # Fewer keepalive for long-running requests timeout_config: dict = None ): # LLM-specific timeouts self.timeout = httpx.Timeout( connect=10.0, # Connection establishment read=120.0, # Long reads for generation write=10.0, # Quick writes (prompts) pool=30.0 # Wait for connection from pool ) self.limits = httpx.Limits( max_connections=max_connections, max_keepalive_connections=max_keepalive, keepalive_expiry=30.0 # Release idle connections faster ) self.client = httpx.AsyncClient( base_url=base_url, timeout=self.timeout, limits=self.limits, http2=True # Better multiplexing for concurrent requests ) ``` **Key tuning considerations:** 1. **Timeout configuration**: Read timeouts must accommodate longest expected generation (function of max_tokens). Connect timeouts should be short to fail fast. 2. **Pool sizing**: More connections than typical because each request holds a connection for seconds, not milliseconds. Size based on expected concurrent LLM calls, not total request volume. 3. **Keep-alive strategy**: LLM providers may have different keep-alive semantics. Test with your specific provider. 4. **HTTP/2 consideration**: HTTP/2 multiplexing can handle multiple requests over fewer connections, but streaming behavior varies by implementation. ### The Cost of Token-Based Pricing LLM APIs typically charge per token, creating a cost model unlike traditional APIs. This has architectural implications: **Traditional API:** Fixed cost per call regardless of payload size. Optimization focuses on reducing call count. **LLM API:** Cost scales with input + output tokens. A 4K token prompt costs 10x a 400 token prompt. Long outputs compound costs. This means: - **Caching ROI is higher**: Cached response saves both latency AND cost - **Prompt optimization is cost optimization**: Shorter prompts save money - **Output limiting is essential**: Max_tokens isn't just for safety; it's cost control - **"Batching" means three different things—keep them straight**: (a) *Request-batching for throughput*—running many requests through one model pass via continuous batching is a large win when you self-host, because it raises GPU utilization and cuts cost-per-token, though it does not change the token count itself; (b) *Provider Batch APIs*—submitting an offline job to a provider's async batch endpoint earns a real ~50% token discount (see the Batch Inference pricing in Chapters 12-13); (c) the *narrow true caveat*—merely concatenating several prompts into one synchronous call does not reduce the number of tokens billed, so it saves no token cost (and can hurt latency). Earlier editions over-generalized (c) into "batching doesn't help," which is wrong: (a) and (b) are among the highest-leverage cost levers available. ```python # Cost-aware request handling class CostAwareLLMClient: def __init__(self, budget_per_request: float = 0.10): self.budget = budget_per_request self.pricing = { 'gpt-5.5': {'input': 0.005/1000, 'output': 0.030/1000}, 'gpt-5.4': {'input': 0.0025/1000, 'output': 0.015/1000}, 'claude-opus-4-8': {'input': 0.005/1000, 'output': 0.025/1000}, } def estimate_cost(self, model: str, prompt: str, max_tokens: int) -> float: """Estimate request cost before making call.""" input_tokens = count_tokens(prompt) prices = self.pricing[model] # Worst case: max_tokens fully used max_cost = ( input_tokens * prices['input'] + max_tokens * prices['output'] ) return max_cost async def generate(self, prompt: str, model: str, max_tokens: int) -> str: estimated = self.estimate_cost(model, prompt, max_tokens) if estimated > self.budget: # Decision: reject, use cheaper model, or reduce max_tokens raise CostLimitExceeded(f"Estimated ${estimated:.4f} exceeds ${self.budget:.4f}") return await self._call_api(prompt, model, max_tokens) ``` --- ## Why Testing LLMs is Hard ### The Fundamental Challenge Traditional testing rests on a simple assumption: given the same inputs, code produces the same outputs. This enables assertions like `assertEqual(actual, expected)`. When tests pass, we have confidence the code works. When they fail, we have a reproducible bug to fix. LLM applications break this assumption. The same prompt may produce: - Different word choices expressing the same meaning - Different valid formats for the same content - Different reasoning paths to the same conclusion - Occasionally, genuinely different (and possibly wrong) answers This creates a testing trilemma: ![Testing Trilemma](../assets/diagrams/rendered/ch10_testing_trilemma.svg) There's no perfect solution—only tradeoffs appropriate to your situation. ### Testing Strategies and When to Use Each **Strategy 1: Deterministic testing (temperature=0)** Set temperature to 0 for reproducible outputs. Useful for: - Regression testing against known-good outputs - Format validation (does output parse correctly?) - Exact-match requirements (code generation, structured data) Limitations: - Doesn't test production behavior (where temperature > 0) - May give false confidence (model could be "luckily" correct at temp=0) - Provider changes can still break determinism ```python def test_classification_deterministic(): """Test with temperature=0 for reproducibility.""" response = llm.generate( "Classify this support ticket: 'My order never arrived'", temperature=0 ) # Can use exact assertion because deterministic assert response.strip() == "shipping_issue" ``` **Strategy 2: Statistical testing** Run tests multiple times, assert on distributions. Useful for: - Production-representative quality testing - Measuring improvement/regression across model versions - Understanding consistency boundaries Limitations: - Slower (multiple API calls per test) - More expensive - Requires defining statistical thresholds ```python def test_classification_statistical(): """Test accuracy over multiple runs.""" test_cases = [ ("My order never arrived", "shipping_issue"), ("The product was damaged", "quality_issue"), # ... more cases ] correct = 0 for input_text, expected in test_cases: response = llm.generate(f"Classify: {input_text}", temperature=0.7) if expected in response.lower(): correct += 1 accuracy = correct / len(test_cases) # Statistical assertion with confidence interval assert accuracy >= 0.90, f"Accuracy {accuracy:.2%} below 90% threshold" ``` **Strategy 3: Behavioral/property testing** Test for properties rather than exact outputs. Useful for: - Format compliance (valid JSON, expected schema) - Safety constraints (no harmful content) - Consistency properties (same entity names throughout) ```python def test_json_output_valid(): """Test that output is always valid JSON.""" for _ in range(10): # Multiple runs response = llm.generate( "Return a JSON object with fields: name, age, city", temperature=0.7 ) try: parsed = json.loads(response) assert 'name' in parsed assert 'age' in parsed assert 'city' in parsed except json.JSONDecodeError: pytest.fail(f"Invalid JSON: {response}") ``` **Strategy 4: LLM-as-judge** Use a (usually stronger) LLM to evaluate outputs. Useful for: - Subjective quality assessment - Complex criteria (helpfulness, accuracy, tone) - Scaling evaluation beyond human capacity Limitations: - Introduces LLM bias into evaluation - Expensive (two LLM calls per test) - Judge LLM can be wrong ```python def test_with_llm_judge(): """Use LLM to evaluate response quality.""" response = llm.generate("Explain quantum entanglement simply") judge_prompt = f""" Evaluate this explanation of quantum entanglement for a general audience. Explanation: {response} Rate 1-5 on: - Accuracy (scientifically correct) - Clarity (understandable to non-experts) - Completeness (covers key concepts) Return JSON: {{"accuracy": N, "clarity": N, "completeness": N}} """ scores = json.loads(judge_llm.generate(judge_prompt, temperature=0)) assert all(score >= 4 for score in scores.values()) ``` **Strategy 5: Mock/fixture testing** Replace LLM calls with recorded responses. Useful for: - Testing non-LLM logic (routing, formatting, error handling) - Fast CI/CD pipelines - Consistent behavior across environments Limitations: - Doesn't test actual model behavior - Fixtures become stale - May mask prompt quality issues ::: {.callout-warning} ## Common Mistake: Using Exact Assertions for LLM Tests **What people do:** Write tests like `assert response == "The answer is 42"`. Tests pass locally, fail randomly in CI, get marked as flaky, then get ignored. **Why it fails:** LLM outputs vary even at temperature=0 across providers, model versions, and infrastructure. "The answer is 42" and "42 is the answer" are equivalent responses but fail exact matching. This creates tests that are either too strict (fail randomly) or too loose (miss real bugs). **Fix:** Match the assertion to what you're actually testing. For format validation, parse the output and check structure. For content correctness, use semantic similarity or LLM-as-judge. For regression testing, run multiple times and assert on statistical properties. Reserve exact matching only for truly deterministic operations like response parsing. ::: > **Full implementation**: See [reference/10_backend_engineering_code.md](../reference/10_backend_engineering_code.html#testing-strategies) for complete test suite implementations. ### Building a Testing Pyramid for LLM Applications ![Testing Pyramid](../assets/diagrams/rendered/ch10_testing_pyramid.svg) **Run at different frequencies:** - Every commit: Mocked tests, deterministic tests - Per PR: Statistical tests on samples, behavioral tests - Daily: Full statistical suite, LLM-judge evaluation - Weekly: Human evaluation of samples, production A/B analysis ### Regression Testing: Catching Quality Degradation LLM applications can silently degrade. A model update, prompt change, or infrastructure shift can cause quality regression that traditional tests miss. Build regression baselines: ```python class LLMRegressionSuite: """Track quality metrics over time to catch regressions.""" def __init__(self, baseline_path: str): self.baseline = self.load_baseline(baseline_path) self.current_metrics = {} def run_evaluation(self, llm_client, test_set: list[dict]) -> dict: """Evaluate current model on test set.""" results = { 'accuracy': self.measure_accuracy(llm_client, test_set), 'format_compliance': self.measure_format_compliance(llm_client, test_set), 'latency_p50': self.measure_latency(llm_client, test_set)['p50'], 'latency_p95': self.measure_latency(llm_client, test_set)['p95'], 'consistency': self.measure_consistency(llm_client, test_set), } self.current_metrics = results return results def check_regressions(self, tolerance: float = 0.05) -> list[dict]: """Identify metrics that regressed beyond tolerance.""" regressions = [] for metric, baseline_value in self.baseline.items(): current_value = self.current_metrics.get(metric, 0) # For latency, higher is worse if 'latency' in metric: if current_value > baseline_value * (1 + tolerance): regressions.append({ 'metric': metric, 'baseline': baseline_value, 'current': current_value, 'change': f"+{(current_value/baseline_value - 1)*100:.1f}%" }) # For other metrics, lower is worse else: if current_value < baseline_value * (1 - tolerance): regressions.append({ 'metric': metric, 'baseline': baseline_value, 'current': current_value, 'change': f"{(current_value/baseline_value - 1)*100:.1f}%" }) return regressions ``` --- ## Debugging Non-Deterministic Systems ### The Debugging Mindset Shift Traditional debugging follows a pattern: reproduce the issue, isolate the cause, fix the code, verify the fix. LLM debugging requires a different approach because: 1. **Issues may not reproduce**: The same input might work fine on retry 2. **"Wrong" is often subjective**: Is the response wrong, or just different than expected? 3. **Root causes span systems**: The issue might be in prompting, retrieval, model behavior, or their interaction 4. **Fixes don't guarantee prevention**: Changing a prompt might fix one case while breaking others **The LLM debugging process:** ![Debugging Workflow](../assets/diagrams/rendered/ch10_debugging_workflow.svg) ::: {.callout-warning} ## Common Mistake: Retrying Until You Get a "Good" Response **What people do:** When an LLM response seems wrong, automatically retry the request (sometimes multiple times) until it produces an acceptable output. **Why it fails:** If your prompt produces bad outputs 20% of the time, retrying just masks the problem. You're paying 5x the API cost for 80% quality, and users experience inconsistent latency. Worse, you might be training users to expect retry-based "best of N" quality that production can't sustain. **Fix:** Treat frequent retries as a signal to fix the prompt, not a solution. Log retry rates per prompt template. If a prompt requires retries >10% of the time, that's a prompt engineering problem. For genuinely ambiguous cases, consider using a verification step that checks output quality and only retries when detection is reliable. ::: ::: {.callout-tip} ## Staff Engineer Perspective "Classic backend instinct: wrap the LLM call in our standard retry decorator. Three retries, exponential backoff, the same pattern we use for every internal service. The day OpenAI had a 30-minute degradation, our system generated 4x normal traffic against them, triggered our org-wide rate limit, and brought down two unrelated products that shared the key. The retry logic that protects you for idempotent REST calls actively hurts you for shared, rate-limited, expensive endpoints. Now our LLM client has a single retry, a circuit breaker, and a degraded-mode response. Retries against an LLM provider aren't free—they're a denial-of-service against yourself." — *Tech Lead at a B2B analytics company* ::: ### Logging for LLM Debugging You cannot debug what you did not log. LLM debugging requires comprehensive logging of the complete interaction context. **What to log:** ```python @dataclass class LLMInteractionLog: """Complete record of an LLM interaction for debugging.""" # Request identity request_id: str timestamp: datetime user_id: str | None # Model configuration model: str temperature: float max_tokens: int # Prompt components (logged separately for analysis) system_prompt: str user_message: str conversation_history: list[dict] # RAG context (if applicable) retrieval_query: str | None retrieved_documents: list[dict] | None retrieval_scores: list[float] | None # Assembled prompt (what actually went to the API) full_prompt: str prompt_token_count: int # Response response: str completion_token_count: int # Timing retrieval_latency_ms: float | None llm_latency_ms: float | None total_latency_ms: float # Outcome (filled later via feedback) user_feedback: str | None automated_quality_score: float | None ``` **Storage considerations:** - **Hot storage** (recent, frequently accessed): Keep 7-30 days in queryable database for debugging - **Cold storage** (historical, audit): Archive older logs to object storage - **Sampling**: For high-volume applications, log 100% of errors, sample successful requests ### Common Debugging Scenarios **Scenario 1: "The model ignores my instructions"** Symptoms: Output doesn't follow specified format, ignores constraints, uses wrong tone. Diagnostic approach: ```python def diagnose_instruction_following( prompt: str, response: str, expected_behavior: str ) -> dict: """Diagnose why model might ignore instructions.""" diagnostics = { 'findings': [], 'likely_causes': [], 'suggestions': [] } # Check 1: Prompt length and instruction position prompt_tokens = count_tokens(prompt) if prompt_tokens > 3000: diagnostics['findings'].append(f"Long prompt ({prompt_tokens} tokens)") diagnostics['likely_causes'].append("Lost-in-the-middle effect") diagnostics['suggestions'].append("Move critical instructions to beginning and end") # Check 2: Instruction clarity instruction_patterns = ['must', 'should', 'always', 'never', 'required'] instruction_count = sum(1 for p in instruction_patterns if p in prompt.lower()) if instruction_count > 5: diagnostics['findings'].append(f"Many directive words ({instruction_count})") diagnostics['likely_causes'].append("Conflicting or overwhelming instructions") diagnostics['suggestions'].append("Simplify to core requirements") # Check 3: Example-instruction conflict if 'example' in prompt.lower() or '```' in prompt: diagnostics['findings'].append("Contains examples") diagnostics['likely_causes'].append("Examples may contradict instructions") diagnostics['suggestions'].append("Verify examples match desired behavior") # Check 4: Negative instructions negative_patterns = ["don't", "do not", "never", "avoid"] negatives = sum(1 for p in negative_patterns if p in prompt.lower()) if negatives > 2: diagnostics['findings'].append(f"Multiple negative instructions ({negatives})") diagnostics['likely_causes'].append("Negative framing is less effective") diagnostics['suggestions'].append("Rephrase as positive: 'do X' instead of 'don't do Y'") return diagnostics ``` **Scenario 2: "RAG returns irrelevant context"** Symptoms: Generated response is off-topic or misses obvious relevant information. Diagnostic approach: ```python def diagnose_retrieval( query: str, retrieved_docs: list[dict], expected_topic: str ) -> dict: """Diagnose retrieval quality issues.""" diagnostics = { 'query_analysis': {}, 'retrieval_analysis': {}, 'suggestions': [] } # Analyze query query_tokens = query.split() diagnostics['query_analysis'] = { 'token_count': len(query_tokens), 'is_question': query.strip().endswith('?'), 'contains_technical_terms': detect_technical_terms(query), } if len(query_tokens) < 5: diagnostics['suggestions'].append("Query may be too short; consider query expansion") # Analyze retrieved documents scores = [doc.get('score', 0) for doc in retrieved_docs] diagnostics['retrieval_analysis'] = { 'top_score': max(scores) if scores else 0, 'score_range': max(scores) - min(scores) if scores else 0, 'all_scores_low': all(s < 0.5 for s in scores), } if diagnostics['retrieval_analysis']['all_scores_low']: diagnostics['suggestions'].append("All scores low; query may be out-of-domain") diagnostics['suggestions'].append("Check if relevant documents exist in corpus") if diagnostics['retrieval_analysis']['score_range'] < 0.1: diagnostics['suggestions'].append("Scores very similar; retriever not discriminating") diagnostics['suggestions'].append("Consider reranking or different embedding model") # Check term overlap query_terms = set(query.lower().split()) for i, doc in enumerate(retrieved_docs[:3]): doc_terms = set(doc['text'].lower().split()) overlap = query_terms & doc_terms diagnostics[f'doc_{i}_overlap'] = len(overlap) return diagnostics ``` **Scenario 3: "Output quality is inconsistent"** Symptoms: Same type of query produces varying quality responses. Diagnostic approach: ```python def diagnose_inconsistency( prompt: str, num_samples: int = 10 ) -> dict: """Diagnose output consistency issues.""" responses = [] for _ in range(num_samples): response = llm.generate(prompt, temperature=0.7) responses.append(response) # Cluster responses by similarity embeddings = [embed(r) for r in responses] similarities = compute_pairwise_similarity(embeddings) diagnostics = { 'sample_count': num_samples, 'unique_responses': len(set(responses)), 'avg_similarity': float(similarities.mean()), 'min_similarity': float(similarities.min()), 'similarity_std': float(similarities.std()), } # Compare with temperature=0 deterministic = llm.generate(prompt, temperature=0) diagnostics['deterministic_response'] = deterministic diagnostics['deterministic_in_samples'] = deterministic in responses # Suggestions based on findings if diagnostics['avg_similarity'] < 0.7: diagnostics['suggestions'] = [ "High variance in outputs", "Consider: lower temperature, more specific instructions, few-shot examples", ] elif diagnostics['unique_responses'] == 1: diagnostics['suggestions'] = [ "All responses identical despite temperature>0", "Prompt may be overly constrained or model very confident", ] return diagnostics ``` ### Production Debugging: Working with Incomplete Information In production, you often can't reproduce issues exactly. You have logs, but the exact conditions may be unreproducible due to: - Model updates by the provider - Non-deterministic sampling - State that wasn't logged **Strategies for production debugging:** 1. **Pattern analysis**: Look for commonalities across failures ```python def analyze_failure_patterns(failures: list[dict]) -> dict: """Find patterns in logged failures.""" patterns = { 'by_user_type': Counter(), 'by_prompt_length': [], 'by_time_of_day': Counter(), 'common_keywords': Counter(), } for f in failures: patterns['by_prompt_length'].append(len(f['prompt'])) patterns['by_time_of_day'][f['timestamp'].hour] += 1 for word in f['user_message'].split(): patterns['common_keywords'][word.lower()] += 1 return { 'prompt_length_correlation': analyze_correlation(patterns['by_prompt_length']), 'peak_failure_hours': patterns['by_time_of_day'].most_common(3), 'keyword_patterns': patterns['common_keywords'].most_common(10), } ``` 2. **Reproduction scripts**: Generate scripts from logs to attempt reproduction ```python def generate_reproduction_script(log_entry: dict) -> str: """Generate runnable script from log entry.""" return f''' # Reproduction script for request {log_entry["request_id"]} # Original timestamp: {log_entry["timestamp"]} # Original outcome: {log_entry.get("user_feedback", "unknown")} from your_app import LLMClient client = LLMClient(model="{log_entry["model"]}") prompt = """{log_entry["full_prompt"]}""" # Try to reproduce for i in range(5): response = client.generate(prompt, temperature={log_entry["temperature"]}) print(f"Attempt {{i+1}}: {{response[:200]}}...") ''' ``` 3. **Canary testing**: Deploy fixes to a small percentage of traffic first ```python def route_request(request, fix_enabled: bool) -> str: """Route requests with canary for new fix.""" if fix_enabled and random.random() < 0.05: # 5% canary return process_with_fix(request) return process_standard(request) ``` --- ## Caching Strategies for LLM Applications ### When to Cache (and When Not To) Caching is more complex for LLM applications than traditional backends due to the nature of LLM inputs and outputs. **Decision framework for LLM caching:** ![LLM Caching Decision Tree](../assets/diagrams/rendered/ch12_caching_decision.svg) **Exact match caching** works when: - Temperature = 0 (deterministic) - Prompts are identical (including system prompt, examples, etc.) - Model version is controlled - Freshness requirements are met **Semantic caching** (caching similar queries) is risky because: - "Similar" queries may require different answers - Embedding similarity doesn't guarantee answer transferability - Edge cases can produce very wrong cached results ### Implementing a Safe LLM Cache ```python class LLMCache: """Cache layer for LLM responses with safety checks.""" def __init__( self, redis_client, ttl_seconds: int = 3600, enable_semantic: bool = False, semantic_threshold: float = 0.98 # Very high threshold if enabled ): self.redis = redis_client self.ttl = ttl_seconds self.enable_semantic = enable_semantic self.semantic_threshold = semantic_threshold def cache_key( self, prompt: str, model: str, temperature: float, system_prompt: str = "" ) -> str | None: """Generate cache key if caching is appropriate.""" # Never cache non-deterministic requests if temperature > 0 and not self.enable_semantic: return None # Include all inputs that affect output cache_input = f"{model}:{system_prompt}:{prompt}" return f"llm:{hashlib.sha256(cache_input.encode()).hexdigest()}" async def get(self, key: str) -> dict | None: """Retrieve cached response with metadata.""" if not key: return None cached = await self.redis.get(key) if cached: data = json.loads(cached) # Track cache hits for monitoring await self.redis.incr(f"cache_hits:{key[:20]}") return data return None async def set( self, key: str, response: str, metadata: dict = None ) -> None: """Cache response with metadata.""" if not key: return data = { 'response': response, 'cached_at': datetime.utcnow().isoformat(), 'metadata': metadata or {} } await self.redis.setex(key, self.ttl, json.dumps(data)) ``` ### Cache Invalidation Strategies LLM caches face unique invalidation challenges: 1. **Model updates**: Provider updates models without notice - Strategy: Include model version in cache key when possible - Strategy: Set shorter TTLs to naturally expire stale entries 2. **Prompt changes**: Even small prompt changes should invalidate - Strategy: Hash complete prompt (including system prompt) - Strategy: Version prompts explicitly in cache key 3. **Knowledge freshness**: RAG responses depend on document corpus - Strategy: Include corpus version or document hashes in cache key - Strategy: Invalidate cache when documents are updated ```python class RAGAwareCache(LLMCache): """Cache that accounts for RAG document freshness.""" def cache_key( self, prompt: str, model: str, temperature: float, retrieved_doc_ids: list[str] ) -> str | None: # Include document IDs in cache key # Cache is only valid for this exact retrieval result doc_hash = hashlib.md5( "".join(sorted(retrieved_doc_ids)).encode() ).hexdigest()[:8] base_key = super().cache_key(prompt, model, temperature) if base_key: return f"{base_key}:docs:{doc_hash}" return None ``` --- ## Data Pipeline Architecture ### Training Data Pipelines If you're fine-tuning models, training data quality determines model quality. "Garbage in, garbage out" is especially true for LLMs because: - Models learn patterns from examples, including unintended patterns - Small amounts of bad data can disproportionately affect behavior - Data issues manifest as subtle quality problems, not obvious errors **Training data pipeline stages:** ![Training Data Pipeline](../assets/diagrams/rendered/ch12_training_data_pipeline.svg) **Key quality gates:** ```python class TrainingDataValidator: """Validate training examples before fine-tuning.""" def validate_example(self, example: dict) -> tuple[bool, list[str]]: """Check single example, return (valid, issues).""" issues = [] # Structure check if 'messages' not in example: issues.append("Missing 'messages' key") return False, issues messages = example['messages'] # Role sequence check roles = [m.get('role') for m in messages] if roles[-1] != 'assistant': issues.append("Last message must be from assistant") # Empty content check for i, msg in enumerate(messages): if not msg.get('content', '').strip(): issues.append(f"Empty content in message {i}") # Token limit check total_tokens = sum( count_tokens(m.get('content', '')) for m in messages ) if total_tokens > 4096: issues.append(f"Example exceeds token limit: {total_tokens}") # Quality heuristics assistant_msgs = [m for m in messages if m['role'] == 'assistant'] for msg in assistant_msgs: content = msg['content'] if len(content) < 10: issues.append("Assistant response very short") if content.startswith("I cannot") or content.startswith("I can't"): issues.append("Assistant refusal - review if intentional") return len(issues) == 0, issues ``` ### Inference Data Pipelines (RAG Ingestion) For RAG systems, document ingestion quality directly affects retrieval quality. The pipeline must handle: - Multiple document formats (PDF, HTML, DOCX, etc.) - Extraction errors and partial failures - Chunking with semantic coherence - Embedding at scale - Incremental updates **Robust ingestion pipeline:** ```python class DocumentIngestionPipeline: """Production-grade document ingestion for RAG.""" def __init__( self, vector_store, embedding_model, chunk_size: int = 512, chunk_overlap: int = 50 ): self.vector_store = vector_store self.embedder = embedding_model self.chunker = SemanticChunker(chunk_size, chunk_overlap) async def ingest_document( self, doc_path: str, metadata: dict = None ) -> dict: """Ingest single document with error handling.""" result = { 'doc_path': doc_path, 'status': 'pending', 'chunks_indexed': 0, 'errors': [] } try: # Extract text text, extracted_metadata = await self.extract(doc_path) if not text.strip(): result['status'] = 'failed' result['errors'].append("No text extracted") return result # Chunk chunks = self.chunker.chunk(text) # Embed in batches embeddings = await self.embed_batch(chunks, batch_size=32) # Store with metadata combined_metadata = { **(metadata or {}), **extracted_metadata, 'ingested_at': datetime.utcnow().isoformat() } chunk_ids = await self.store( chunks, embeddings, combined_metadata ) result['status'] = 'success' result['chunks_indexed'] = len(chunk_ids) result['chunk_ids'] = chunk_ids except ExtractionError as e: result['status'] = 'failed' result['errors'].append(f"Extraction failed: {e}") except EmbeddingError as e: result['status'] = 'partial' result['errors'].append(f"Embedding failed: {e}") return result ``` --- ## Integration Patterns ### Circuit Breaker for LLM APIs LLM APIs can fail in various ways: timeouts, rate limits, server errors, degraded quality. The circuit breaker pattern prevents cascading failures. **Why circuit breakers matter more for LLMs:** - LLM calls are expensive (time and money) - Retrying a failing LLM API burns budget - User experience degrades rapidly when LLM latency increases - Provider outages can last minutes to hours ```python from enum import Enum from datetime import datetime, timedelta class CircuitState(Enum): CLOSED = "closed" # Normal operation OPEN = "open" # Blocking calls HALF_OPEN = "half_open" # Testing recovery class LLMCircuitBreaker: """Protect against LLM API failures.""" def __init__( self, failure_threshold: int = 5, recovery_timeout: timedelta = timedelta(seconds=60), half_open_max_calls: int = 3 ): self.failure_threshold = failure_threshold self.recovery_timeout = recovery_timeout self.half_open_max_calls = half_open_max_calls self.state = CircuitState.CLOSED self.failures = 0 self.last_failure_time = None self.half_open_successes = 0 async def call(self, llm_func, *args, **kwargs): """Execute LLM call with circuit breaker protection.""" if self.state == CircuitState.OPEN: if self._should_attempt_recovery(): self.state = CircuitState.HALF_OPEN self.half_open_successes = 0 else: raise CircuitOpenError( "LLM circuit breaker open", retry_after=self._time_until_recovery() ) try: result = await llm_func(*args, **kwargs) self._record_success() return result except (TimeoutError, APIError, RateLimitError) as e: self._record_failure() raise def _record_success(self): if self.state == CircuitState.HALF_OPEN: self.half_open_successes += 1 if self.half_open_successes >= self.half_open_max_calls: self.state = CircuitState.CLOSED self.failures = 0 elif self.state == CircuitState.CLOSED: self.failures = 0 # Reset on success def _record_failure(self): self.failures += 1 self.last_failure_time = datetime.now() if self.state == CircuitState.HALF_OPEN: self.state = CircuitState.OPEN # Recovery failed elif self.failures >= self.failure_threshold: self.state = CircuitState.OPEN ``` ### Retry Strategies for LLM Calls Retries for LLM calls require special consideration: 1. **Retrying may produce different results**: Unlike traditional APIs, LLM retry may succeed with different output 2. **Cost accumulates**: Each retry costs tokens 3. **Some errors shouldn't be retried**: Content policy violations won't succeed on retry ```python class LLMRetryStrategy: """Intelligent retry for LLM calls.""" # Errors that should not be retried NON_RETRYABLE = { 'content_policy_violation', 'invalid_api_key', 'model_not_found', 'context_length_exceeded', } def __init__( self, max_retries: int = 3, base_delay: float = 1.0, max_delay: float = 60.0, exponential_base: float = 2.0 ): self.max_retries = max_retries self.base_delay = base_delay self.max_delay = max_delay self.exponential_base = exponential_base async def execute( self, llm_func, *args, **kwargs ): """Execute with retry logic.""" last_exception = None for attempt in range(self.max_retries + 1): try: return await llm_func(*args, **kwargs) except LLMError as e: last_exception = e if e.error_code in self.NON_RETRYABLE: raise # Don't retry if attempt < self.max_retries: delay = self._calculate_delay(attempt, e) await asyncio.sleep(delay) raise last_exception def _calculate_delay(self, attempt: int, error: LLMError) -> float: # Exponential backoff delay = self.base_delay * (self.exponential_base ** attempt) # Rate limit errors should respect retry-after header if hasattr(error, 'retry_after') and error.retry_after: delay = max(delay, error.retry_after) # Add jitter to prevent thundering herd delay *= (0.5 + random.random()) return min(delay, self.max_delay) ``` ### Graceful Degradation When LLM services fail, have fallback strategies: ```python class DegradingLLMService: """LLM service with graceful degradation.""" def __init__(self): self.primary_model = "claude-opus-4-8" self.fallback_models = ["claude-sonnet-4-6", "claude-haiku-4-5"] self.cache = LLMCache() async def generate(self, prompt: str, **kwargs) -> dict: """Generate with fallback chain.""" # Try cache first cached = await self.cache.get(prompt, kwargs) if cached: return {'response': cached, 'source': 'cache'} # Try primary model try: response = await self.call_model(self.primary_model, prompt, **kwargs) return {'response': response, 'source': 'primary'} except LLMError as e: pass # Fall through to fallbacks # Try fallback models for model in self.fallback_models: try: response = await self.call_model(model, prompt, **kwargs) return { 'response': response, 'source': f'fallback:{model}', 'degraded': True } except LLMError: continue # All models failed - return graceful error return { 'response': "I'm sorry, I'm having trouble processing your request right now. Please try again in a moment.", 'source': 'error_fallback', 'degraded': True, 'failed': True } ``` --- ## Case Study: Production LLM Backend at Scale ### Context A document intelligence company processes millions of documents monthly, extracting structured data and answering questions. Their architecture evolved through several iterations as they scaled. ### Architecture Evolution **Phase 1: Synchronous monolith** - Single service handled everything - LLM calls blocked request threads - Worked fine until 100 concurrent users Problems encountered: - Thread pool exhaustion from long LLM calls - No isolation between fast and slow requests - Single LLM provider outage took down everything **Phase 2: Async with queue separation** - Split into sync API (fast responses) and async workers (LLM calls) - Job queue for document processing - Results stored in database, polled by clients ``` Client → API Gateway → { Fast requests → Sync Service → Cache/DB Slow requests → Queue → Workers → LLM APIs → Results DB } ``` Problems encountered: - Queue backup during LLM provider issues - Difficult to estimate completion times - No streaming for interactive use cases **Phase 3: Streaming with circuit breakers** - Streaming responses for interactive queries - Circuit breakers per LLM provider - Automatic fallback to cheaper models under load Key patterns: ```python # Request routing based on complexity and latency requirements class RequestRouter: def route(self, request) -> str: if request.requires_streaming: return "streaming_service" elif request.estimated_tokens > 10000: return "batch_queue" elif self.is_cache_candidate(request): return "cache_service" else: return "sync_service" ``` ### Lessons Learned 1. **Token estimation is critical**: Estimating prompt and completion tokens allows better routing, cost prediction, and capacity planning. 2. **Model fallback is table stakes**: Primary model unavailability happens. Design for it from day one. 3. **Streaming changes everything**: Once you support streaming, you need streaming-aware caching, logging, and error handling. 4. **Observability is different**: Track time-to-first-token, tokens-per-second, and completion rates, not just latency. 5. **Cost monitoring requires real-time**: Token costs can spike quickly. Monitor and alert on cost, not just errors. --- ## Decision Frameworks ### Sync vs Async Architecture ![Sync vs Async Decision Framework](../assets/diagrams/rendered/ch12_sync_async_decision.svg) ### Testing Strategy Selection ![Testing Strategy Selection](../assets/diagrams/rendered/ch12_testing_strategy.svg) ### Caching Decision Matrix | Scenario | Cache? | Strategy | |----------|--------|----------| | Temperature = 0, exact prompt match | Yes | Exact match cache | | Temperature > 0, exact prompt match | Risky | Cache only if variation OK | | Similar prompts, same intent | Risky | Only with very high similarity threshold | | RAG with changing documents | Complex | Include doc versions in key | | User-specific personalization | No | Different users need different responses | | High-cost queries, stable inputs | Yes | Longer TTL, careful invalidation | --- ## Key Takeaways 1. **Design for variable latency and streaming** - LLM calls take seconds to minutes. Async patterns are mandatory for scale. Implement streaming for long outputs to improve perceived responsiveness. 2. **Embrace statistical testing over exact assertions** - LLM outputs are non-deterministic with many valid variations. Use property-based tests, statistical significance, and LLM-as-judge for quality evaluation. 3. **Log comprehensively for debugging** - Non-reproducible issues require pattern analysis. Log complete prompts, responses, latency, and tokens. Build tools for pattern analysis, not just log search. 4. **Cache conservatively** - Exact-match caching for temperature=0 is safe. Include model version and prompt template version in cache keys. Semantic caching is risky and requires careful validation. 5. **Implement circuit breakers and fallback chains** - LLMs have unique failure modes. Design fallback hierarchies: primary model, fallback model, cached response, graceful error message. --- ## Summary Building backends for LLM applications requires rethinking traditional patterns: **Architecture**: Design for variable latency, streaming, and graceful degradation. LLM calls can take seconds to minutes—async patterns are not optional for scale. **Testing**: Embrace statistical testing. Exact assertions rarely make sense for LLM outputs. Build testing pyramids with different strategies at different layers. **Debugging**: Log comprehensively, analyze patterns, and accept that some issues won't reproduce exactly. Build tools for pattern analysis, not just log search. **Caching**: Be conservative. Exact-match caching for deterministic calls is safe. Semantic caching is risky and requires careful thought. **Integration**: Implement circuit breakers and retry strategies that account for LLM-specific failure modes. Have fallback strategies for when primary models are unavailable. The key mindset shift: **you're building systems around probabilistic components**. This doesn't mean abandoning engineering rigor—it means applying that rigor with appropriate tools and expectations. ### Connections to Other Chapters - **Chapter 5 (LLM Foundations)** explains why LLMs are non-deterministic—understanding this is essential for debugging - **Chapter 6 (Prompt Engineering)** covers prompt design that affects testability and consistency - **Chapter 7 (RAG Systems)** uses the data pipelines and caching strategies discussed here - **Chapter 9 (Deployment)** covers infrastructure for serving LLM applications - **Chapter 15 (MLOps & Evaluation)** extends testing into continuous evaluation and monitoring --- ## Practical Exercises 1. **Implement a statistical test suite**: Build tests for a classification prompt that: - Run 100 test cases - Calculate accuracy with confidence intervals - Fail if lower confidence bound drops below 90% - Report per-category accuracy breakdown 2. **Build a debugging toolkit**: Create a system that: - Logs complete LLM interactions (prompt, response, latency, tokens) - Generates reproduction scripts from log entries - Identifies patterns in failed requests (common keywords, time patterns) 3. **Implement caching with invalidation**: Build a cache layer that: - Caches temperature=0 responses - Includes model version in cache key - Invalidates on prompt template changes - Tracks cache hit rates 4. **Design a graceful degradation system**: Implement: - Circuit breaker for LLM API calls - Fallback chain (primary → fallback model → cached response → error message) - Metrics to track degradation events 5. **Build a training data pipeline**: Create a pipeline that: - Extracts conversation data from logs - Validates format and quality - Deduplicates - Splits into train/val/test - Versions the output --- ## Self-Assessment Checkpoint ### Conceptual Questions **Q1. [IC2]** Why do traditional unit tests (assert expected == actual) work poorly for LLM outputs? What testing approach works better? <details> <summary>Answer</summary> LLM outputs are non-deterministic and have many valid variations. "The capital of France is Paris" and "Paris is France's capital city" are both correct but don't match exactly. Better approaches: (1) Property-based testing—check that outputs have required properties (contains "Paris", is valid JSON, length within range). (2) Statistical testing—run N times, check pass rate meets threshold with confidence interval. (3) LLM-as-judge—use another model to evaluate semantic correctness. (4) Behavioral testing—test the system's behavior, not exact outputs. </details> **Q2. [IC2]** A cache stores LLM responses keyed by prompt hash. What problems might arise, and how do you address them? <details> <summary>Answer</summary> Problems: (1) Model version changes—same prompt, different model = different output. Include model version in cache key. (2) Temperature > 0—non-deterministic outputs shouldn't be cached, or cache becomes lottery. Only cache temp=0. (3) Stale data—if answers reference time-sensitive info, cache invalidation needed. Add TTL. (4) Prompt template changes—code changes prompt format but cache has old responses. Include prompt template version in key. (5) Memory growth—cache unbounded. Implement LRU or size limits. (6) Cache poisoning—bad response gets cached. Validate before caching or allow manual invalidation. </details> **Q3. [Senior]** Explain the difference between fine-tuning and LoRA. When would you choose each? <details> <summary>Answer</summary> Full fine-tuning updates all model parameters—computationally expensive, needs multiple GPUs, creates a completely new model. LoRA adds small trainable matrices to frozen base model—much smaller (0.1-1% of parameters), trains on single GPU, creates small adapter that combines with base model. Choose full fine-tuning when: you have massive data, need maximum quality, have compute budget. Choose LoRA when: limited compute, want to maintain base model benefits, need to deploy multiple specialized versions efficiently, training data is moderate-sized. Most practical scenarios favor LoRA due to cost and flexibility. </details> **Q4. [Senior]** How do you debug an LLM application where users report "wrong answers" but you can't reproduce the issues? <details> <summary>Answer</summary> Systematic approach: (1) Comprehensive logging—log full prompt, model response, all parameters, timestamps. Without logs, you're blind. (2) Reproduction system—ability to replay exact request with same inputs. (3) Check for non-determinism—temperature setting, model version changes, A/B tests. (4) User context—what were the previous turns? Context matters. (5) Timing—did it work before? Check for model updates, prompt changes, data updates. (6) Pattern analysis—are failures correlated (time of day, user segment, input length, topic)? (7) Diff analysis—compare working vs. failing cases, find distinguishing features. Prevention: Build observability in from day one. </details> **Q5. [Staff]** You're designing the backend architecture for an LLM application expecting 10M requests/day with strict latency requirements. What are the key architectural decisions? <details> <summary>Answer</summary> Key decisions: (1) Caching strategy—at 10M requests, even 20% cache hit rate saves 2M API calls/day. Cache aggressively, use semantic similarity for near-matches. (2) Request routing—route simple queries to faster/cheaper models, complex to capable models. (3) Async processing—queue non-urgent requests, batch for efficiency. (4) Rate limiting & backpressure—protect against traffic spikes, graceful degradation. (5) Redundancy—multiple LLM providers, failover capability. (6) Edge caching—for global users, cache at edge for common queries. (7) Streaming—for latency perception, stream responses. (8) Cost controls—budget enforcement, alerting. (9) Observability—distributed tracing, metrics at every stage. Architecture: Queue → Router → Cache check → LLM pool → Response processor → Cache write. </details> ### Spot the Problem **Problem 1. [IC2]** Test code for an LLM feature: ```python def test_summarization(): result = summarize("Long article text here...") assert result == "Expected summary of the article." ``` <details> <summary>Answer</summary> Problems: (1) Exact string match will fail—LLM outputs vary in wording even with temp=0. (2) Single test case—doesn't establish statistical confidence. (3) No property checks—doesn't verify summary is actually shorter, contains key points, etc. Better: `assert len(result) < len(input) * 0.3` (compression), `assert "key_topic" in result.lower()` (coverage), run multiple test cases with quality scoring. </details> **Problem 2. [Senior]** Error handling pattern: ```python async def call_llm(prompt: str) -> str: try: return await client.generate(prompt) except Exception: return "An error occurred. Please try again." ``` <details> <summary>Answer</summary> Problems: (1) Catches all exceptions—hides different failure modes (rate limit, auth, timeout, content filter). (2) No logging—failures are invisible to monitoring. (3) Generic message—user can't distinguish transient vs. permanent failure. (4) No retry—transient failures should retry automatically. (5) No circuit breaker—if service is down, keeps trying. (6) Silent failure—downstream code may not realize it got an error, not a real response. Better: Specific exception handling, structured logging, retry with backoff for transient errors, circuit breaker, clear error types returned to caller. </details> **Problem 3. [Staff]** Fine-tuning approach: ``` "We fine-tuned on 50K examples from production logs. Training loss decreased nicely. We deployed it." ``` <details> <summary>Answer</summary> Problems: (1) No validation set—can't detect overfitting. Training loss means nothing without val loss. (2) No quality evaluation—loss decreasing doesn't mean outputs improved for your task. (3) Logs may have bias—production logs reflect what model already does, may not improve weaknesses. (4) No A/B test—deployed without comparing to baseline. (5) Data quality unknown—were those 50K examples actually good? PII issues? (6) No held-out test set—can't measure true performance. Process should include: train/val/test split, task-specific evaluation metrics, comparison to baseline, staged rollout with monitoring. </details> ### Design Exercises **Exercise 1. [Senior]** Design a test strategy for a RAG-powered Q&A system. The system retrieves from 100K documents and generates answers. Cover: unit tests, integration tests, evaluation metrics, regression testing, and CI/CD integration. <details> <summary>Guidance</summary> Unit tests: (1) Chunking—verify chunks are correct size, overlap works. (2) Embedding—mock embedding service, verify vectors have expected dimensions. (3) Retrieval—test with known documents, verify ranking. (4) Generation—property tests on output format. Integration tests: (1) End-to-end with small doc set—known questions with known answers. (2) Latency tests—verify P95 within bounds. (3) Error handling—test behavior when retrieval fails, LLM fails. Evaluation metrics: (1) Retrieval recall@k—are relevant docs retrieved? (2) Answer quality—LLM-as-judge for faithfulness, relevance. (3) Factual accuracy on golden set. Regression: Golden set of critical Q&A pairs, must pass 100%. Run on every PR. CI/CD: Fast unit tests on every commit, integration tests on PR, full evaluation nightly. </details> **Exercise 2. [Staff]** Design a system for managing training data for fine-tuning across multiple LLM applications. Requirements: version control, quality validation, deduplication, PII detection, and the ability to create task-specific datasets from a common pool. <details> <summary>Guidance</summary> Architecture: (1) Raw data lake—all collected data, immutable, versioned (S3 + metadata DB). (2) Processing pipeline—PII detection/redaction, quality scoring, deduplication. (3) Curated datasets—versioned, task-specific, derived from pool. (4) Lineage tracking—which raw examples went into which datasets. Components: (1) Ingestion—from logs, human annotation, synthetic generation. Schema validation on entry. (2) PII detection—regex patterns, NER models, human review for high-risk. (3) Quality scoring—automated (length, formatting) + sampled human review. (4) Deduplication—embedding similarity, exact match, near-duplicate detection. (5) Dataset builder—queries pool by criteria, creates versioned snapshot. (6) Export—formats for different training frameworks. Governance: Access controls, audit logging, retention policies, deletion capability for GDPR. </details> --- ## Further Reading ### Essential - **Kwon et al. (2023), "PagedAttention"** - The paper behind vLLM's memory efficiency. Essential for understanding inference optimization. - **Kleppmann (2017), "Designing Data-Intensive Applications"** - Foundation for data pipeline design. Backend engineers' bible. - **Nygard (2018), "Release It!"** - Circuit breakers, bulkheads, and stability patterns for LLM services. ### Deep Dives - **Hu et al. (2021), "LoRA"** - Parameter-efficient fine-tuning foundation. Critical if you're fine-tuning. - **Zheng et al. (2023), "Judging LLM-as-a-Judge"** - Best practices for using LLMs as evaluators. ### Practical Resources - **OpenAI Cookbook** - Production patterns for building with LLM APIs. - **Anthropic's Production Best Practices** - Guidelines for reliable Claude integration.