Chapter 14: Backend Engineering for AI
testing, debugging, integration, API design, async patterns, caching, streaming, error handling
Introduction
In late 2023, a fintech startup deployed their first LLM-powered feature: an intelligent assistant that helped users understand their spending patterns. The engineering team had built many successful backend systems before—payment processors handling thousands of transactions per second, real-time fraud detection, complex financial reporting pipelines. They applied their usual practices: comprehensive unit tests, deterministic integration tests, and monitoring dashboards tracking latency percentiles.
Within two weeks, they were drowning in bug reports they couldn’t reproduce. Users complained the assistant was “wrong” but the logs showed the system working correctly. Tests passed in CI but failed intermittently in production. Latency varied wildly—the same query taking 500ms one moment and 8 seconds the next. Most confusingly, a prompt that worked perfectly for weeks suddenly started producing garbled outputs after a model provider update they hadn’t even noticed.
The team had encountered a fundamental truth: LLM backends are not traditional backends. The patterns that served them well for deterministic systems—exact assertion testing, reproducible debugging, predictable performance—needed rethinking for systems whose core component is a black-box probabilistic function.
This chapter bridges the gap between traditional backend engineering and the unique demands of LLM applications. You’ll learn not just the techniques, but the underlying principles that explain why LLM backends behave differently and how to design systems that embrace rather than fight this reality.
The Core Insight: Probabilistic Systems Require Probabilistic Engineering
Traditional backend engineering operates in a world of determinism. Given the same inputs and state, the same code produces the same outputs. This assumption underlies everything: we write unit tests expecting exact matches, debug by reproducing issues, and monitor for binary outcomes (success or failure).
LLM applications violate this assumption at their core. A language model is a probabilistic function—the same prompt can produce different outputs due to:
- Sampling temperature: Non-zero temperature introduces intentional randomness
- Infrastructure variation: Different GPU instances may produce slightly different floating-point results
- Model versioning: Providers update models without explicit notification
- Context sensitivity: Slight prompt variations can dramatically shift outputs
- Batching effects: Some providers batch requests in ways that affect generation
This isn’t a bug to be fixed; it’s the nature of the technology. The engineering challenge is building robust systems atop this probabilistic foundation.
What Makes LLM Backends Different
Consider how traditional backend concerns transform for LLM applications:
| Traditional Backend | LLM Backend |
|---|---|
| Latency is predictable | Latency varies 10x depending on output length |
| Errors are binary (works/fails) | “Wrong” exists on a spectrum |
| Tests use exact assertions | Tests use statistical assertions |
| Debugging reproduces the issue | Debugging analyzes distributions |
| Caching is straightforward | Caching requires semantic similarity |
| Retry logic is simple | Retry may produce different (better or worse) results |
| Load testing scales linearly | Token-based pricing creates non-linear costs |
Understanding these differences is the foundation for effective LLM backend engineering.
What people do: Build LLM backends using exact same patterns as deterministic services—exact match caching, binary success/fail monitoring, retry-until-success logic, unit tests with assertEqual.
Why it fails: LLM outputs are probabilistic. Exact-match caching misses semantically equivalent queries. Binary monitoring misses “technically successful but wrong” responses. Retries might produce different (possibly worse) results. Exact assertions fail randomly.
Fix: Adapt each pattern to probabilistic reality. Use semantic similarity for cache lookups. Monitor response quality, not just HTTP status. Implement smart retry logic that detects whether retry improved results. Use statistical assertions or LLM-as-judge for testing. Accept that some techniques (like deterministic debugging) need complete rethinking.
What You’ll Learn
- Architectural foundations for LLM applications: async patterns, connection pooling, streaming
- Why testing LLMs is hard and how to build statistical test frameworks
- Debugging strategies for non-deterministic systems
- Caching strategies: when semantic caching helps and when it hurts
- Data pipeline design for training and inference
- Decision frameworks for key architectural choices
Prerequisites
- Understanding of LLM fundamentals (Chapter 5)
- Familiarity with prompt engineering (Chapter 6)
- Experience with backend development (APIs, databases, async programming)
- Python programming
Foundations of LLM Application Architecture
The Anatomy of an LLM Backend
Before diving into specific techniques, let’s establish a mental model for LLM backend architecture. Every LLM application shares common structural elements, though implementations vary.
Request flow for a typical query: 1. Request arrives at API gateway (authentication, rate limiting) 2. Cache check: have we seen this exact request before? 3. Router decides: synchronous response or queue for async processing? 4. Orchestrator assembles context (retrieval, tool calls, conversation history) 5. LLM API call with constructed prompt 6. Post-processing (format validation, safety checks) 7. Response with observability logging
Each component has design decisions that differ from traditional backends due to LLM characteristics.
Async Patterns for Streaming
LLM generation is inherently sequential—tokens are produced one at a time. For long responses, waiting for complete generation creates poor user experience. Streaming delivers tokens as they’re generated, improving perceived latency dramatically.
The key insight: streaming fundamentally changes your architecture. It’s not just a performance optimization; it affects API design, error handling, caching, and observability.
# The streaming mental model: think of LLM output as a generator, not a return value
async def stream_response(prompt: str) -> AsyncIterator[str]:
"""Stream tokens as they're generated."""
async for chunk in llm_client.stream(prompt):
# Each chunk contains one or more tokens
yield chunk.content
# You can process tokens incrementally
# - Detect early termination signals
# - Update progress indicators
# - Accumulate for logging
# Consumer handles the stream
async def handle_request(request):
buffer = []
async for token in stream_response(request.prompt):
buffer.append(token)
await websocket.send(token) # Real-time delivery
# Post-completion processing
full_response = "".join(buffer)
await log_interaction(request, full_response)Architectural implications of streaming:
Error handling becomes complex: An error mid-stream means partial content was already sent to the user. You need strategies for this:
- Sentinel tokens to signal errors in the stream
- Client-side buffer with commit semantics
- Graceful degradation with “generation interrupted” messages
Caching requires decisions: Do you cache the complete response after streaming finishes? Do you cache per-token for replay? The answer depends on your use case.
Observability changes: You can’t log “response” as a single event. You need streaming-aware metrics: time-to-first-token, tokens-per-second, stream completion rate.
Connection management differs: HTTP/1.1 streaming ties up connections. Consider Server-Sent Events, WebSockets, or HTTP/2 streaming based on your client requirements.
Full implementation: See reference/10_backend_engineering_code.md for complete streaming implementation with error handling.
Connection Pooling for API Calls
LLM API calls are expensive—both in latency (typically 500ms-10s) and in connection overhead. Connection pooling amortizes connection establishment costs across requests.
But LLM workloads differ from typical API workloads:
Traditional API call: 50-200ms response time, high request volume, connection reuse is critical for throughput.
LLM API call: 500ms-10s response time, lower volume, connection lifespan varies dramatically based on output length.
This changes optimal pooling strategy:
import httpx
from contextlib import asynccontextmanager
class LLMConnectionPool:
"""Connection pool tuned for LLM API characteristics."""
def __init__(
self,
base_url: str,
max_connections: int = 100,
max_keepalive: int = 20, # Fewer keepalive for long-running requests
timeout_config: dict = None
):
# LLM-specific timeouts
self.timeout = httpx.Timeout(
connect=10.0, # Connection establishment
read=120.0, # Long reads for generation
write=10.0, # Quick writes (prompts)
pool=30.0 # Wait for connection from pool
)
self.limits = httpx.Limits(
max_connections=max_connections,
max_keepalive_connections=max_keepalive,
keepalive_expiry=30.0 # Release idle connections faster
)
self.client = httpx.AsyncClient(
base_url=base_url,
timeout=self.timeout,
limits=self.limits,
http2=True # Better multiplexing for concurrent requests
)Key tuning considerations:
Timeout configuration: Read timeouts must accommodate longest expected generation (function of max_tokens). Connect timeouts should be short to fail fast.
Pool sizing: More connections than typical because each request holds a connection for seconds, not milliseconds. Size based on expected concurrent LLM calls, not total request volume.
Keep-alive strategy: LLM providers may have different keep-alive semantics. Test with your specific provider.
HTTP/2 consideration: HTTP/2 multiplexing can handle multiple requests over fewer connections, but streaming behavior varies by implementation.
The Cost of Token-Based Pricing
LLM APIs typically charge per token, creating a cost model unlike traditional APIs. This has architectural implications:
Traditional API: Fixed cost per call regardless of payload size. Optimization focuses on reducing call count.
LLM API: Cost scales with input + output tokens. A 4K token prompt costs 10x a 400 token prompt. Long outputs compound costs.
This means:
- Caching ROI is higher: Cached response saves both latency AND cost
- Prompt optimization is cost optimization: Shorter prompts save money
- Output limiting is essential: Max_tokens isn’t just for safety; it’s cost control
- “Batching” means three different things—keep them straight: (a) Request-batching for throughput—running many requests through one model pass via continuous batching is a large win when you self-host, because it raises GPU utilization and cuts cost-per-token, though it does not change the token count itself; (b) Provider Batch APIs—submitting an offline job to a provider’s async batch endpoint earns a real ~50% token discount (see the Batch Inference pricing in Chapters 12-13); (c) the narrow true caveat—merely concatenating several prompts into one synchronous call does not reduce the number of tokens billed, so it saves no token cost (and can hurt latency). Earlier editions over-generalized (c) into “batching doesn’t help,” which is wrong: (a) and (b) are among the highest-leverage cost levers available.
# Cost-aware request handling
class CostAwareLLMClient:
def __init__(self, budget_per_request: float = 0.10):
self.budget = budget_per_request
self.pricing = {
'gpt-5.5': {'input': 0.005/1000, 'output': 0.030/1000},
'gpt-5.4': {'input': 0.0025/1000, 'output': 0.015/1000},
'claude-opus-4-8': {'input': 0.005/1000, 'output': 0.025/1000},
}
def estimate_cost(self, model: str, prompt: str, max_tokens: int) -> float:
"""Estimate request cost before making call."""
input_tokens = count_tokens(prompt)
prices = self.pricing[model]
# Worst case: max_tokens fully used
max_cost = (
input_tokens * prices['input'] +
max_tokens * prices['output']
)
return max_cost
async def generate(self, prompt: str, model: str, max_tokens: int) -> str:
estimated = self.estimate_cost(model, prompt, max_tokens)
if estimated > self.budget:
# Decision: reject, use cheaper model, or reduce max_tokens
raise CostLimitExceeded(f"Estimated ${estimated:.4f} exceeds ${self.budget:.4f}")
return await self._call_api(prompt, model, max_tokens)Why Testing LLMs is Hard
The Fundamental Challenge
Traditional testing rests on a simple assumption: given the same inputs, code produces the same outputs. This enables assertions like assertEqual(actual, expected). When tests pass, we have confidence the code works. When they fail, we have a reproducible bug to fix.
LLM applications break this assumption. The same prompt may produce:
- Different word choices expressing the same meaning
- Different valid formats for the same content
- Different reasoning paths to the same conclusion
- Occasionally, genuinely different (and possibly wrong) answers
This creates a testing trilemma:
There’s no perfect solution—only tradeoffs appropriate to your situation.
Testing Strategies and When to Use Each
Strategy 1: Deterministic testing (temperature=0)
Set temperature to 0 for reproducible outputs. Useful for:
- Regression testing against known-good outputs
- Format validation (does output parse correctly?)
- Exact-match requirements (code generation, structured data)
Limitations:
- Doesn’t test production behavior (where temperature > 0)
- May give false confidence (model could be “luckily” correct at temp=0)
- Provider changes can still break determinism
def test_classification_deterministic():
"""Test with temperature=0 for reproducibility."""
response = llm.generate(
"Classify this support ticket: 'My order never arrived'",
temperature=0
)
# Can use exact assertion because deterministic
assert response.strip() == "shipping_issue"Strategy 2: Statistical testing
Run tests multiple times, assert on distributions. Useful for:
- Production-representative quality testing
- Measuring improvement/regression across model versions
- Understanding consistency boundaries
Limitations:
- Slower (multiple API calls per test)
- More expensive
- Requires defining statistical thresholds
def test_classification_statistical():
"""Test accuracy over multiple runs."""
test_cases = [
("My order never arrived", "shipping_issue"),
("The product was damaged", "quality_issue"),
# ... more cases
]
correct = 0
for input_text, expected in test_cases:
response = llm.generate(f"Classify: {input_text}", temperature=0.7)
if expected in response.lower():
correct += 1
accuracy = correct / len(test_cases)
# Statistical assertion with confidence interval
assert accuracy >= 0.90, f"Accuracy {accuracy:.2%} below 90% threshold"Strategy 3: Behavioral/property testing
Test for properties rather than exact outputs. Useful for:
- Format compliance (valid JSON, expected schema)
- Safety constraints (no harmful content)
- Consistency properties (same entity names throughout)
def test_json_output_valid():
"""Test that output is always valid JSON."""
for _ in range(10): # Multiple runs
response = llm.generate(
"Return a JSON object with fields: name, age, city",
temperature=0.7
)
try:
parsed = json.loads(response)
assert 'name' in parsed
assert 'age' in parsed
assert 'city' in parsed
except json.JSONDecodeError:
pytest.fail(f"Invalid JSON: {response}")Strategy 4: LLM-as-judge
Use a (usually stronger) LLM to evaluate outputs. Useful for:
- Subjective quality assessment
- Complex criteria (helpfulness, accuracy, tone)
- Scaling evaluation beyond human capacity
Limitations:
- Introduces LLM bias into evaluation
- Expensive (two LLM calls per test)
- Judge LLM can be wrong
def test_with_llm_judge():
"""Use LLM to evaluate response quality."""
response = llm.generate("Explain quantum entanglement simply")
judge_prompt = f"""
Evaluate this explanation of quantum entanglement for a general audience.
Explanation: {response}
Rate 1-5 on:
- Accuracy (scientifically correct)
- Clarity (understandable to non-experts)
- Completeness (covers key concepts)
Return JSON: {{"accuracy": N, "clarity": N, "completeness": N}}
"""
scores = json.loads(judge_llm.generate(judge_prompt, temperature=0))
assert all(score >= 4 for score in scores.values())Strategy 5: Mock/fixture testing
Replace LLM calls with recorded responses. Useful for:
- Testing non-LLM logic (routing, formatting, error handling)
- Fast CI/CD pipelines
- Consistent behavior across environments
Limitations:
- Doesn’t test actual model behavior
- Fixtures become stale
- May mask prompt quality issues
What people do: Write tests like assert response == "The answer is 42". Tests pass locally, fail randomly in CI, get marked as flaky, then get ignored.
Why it fails: LLM outputs vary even at temperature=0 across providers, model versions, and infrastructure. “The answer is 42” and “42 is the answer” are equivalent responses but fail exact matching. This creates tests that are either too strict (fail randomly) or too loose (miss real bugs).
Fix: Match the assertion to what you’re actually testing. For format validation, parse the output and check structure. For content correctness, use semantic similarity or LLM-as-judge. For regression testing, run multiple times and assert on statistical properties. Reserve exact matching only for truly deterministic operations like response parsing.
Full implementation: See reference/10_backend_engineering_code.md for complete test suite implementations.
Building a Testing Pyramid for LLM Applications
Run at different frequencies: - Every commit: Mocked tests, deterministic tests - Per PR: Statistical tests on samples, behavioral tests - Daily: Full statistical suite, LLM-judge evaluation - Weekly: Human evaluation of samples, production A/B analysis
Regression Testing: Catching Quality Degradation
LLM applications can silently degrade. A model update, prompt change, or infrastructure shift can cause quality regression that traditional tests miss.
Build regression baselines:
class LLMRegressionSuite:
"""Track quality metrics over time to catch regressions."""
def __init__(self, baseline_path: str):
self.baseline = self.load_baseline(baseline_path)
self.current_metrics = {}
def run_evaluation(self, llm_client, test_set: list[dict]) -> dict:
"""Evaluate current model on test set."""
results = {
'accuracy': self.measure_accuracy(llm_client, test_set),
'format_compliance': self.measure_format_compliance(llm_client, test_set),
'latency_p50': self.measure_latency(llm_client, test_set)['p50'],
'latency_p95': self.measure_latency(llm_client, test_set)['p95'],
'consistency': self.measure_consistency(llm_client, test_set),
}
self.current_metrics = results
return results
def check_regressions(self, tolerance: float = 0.05) -> list[dict]:
"""Identify metrics that regressed beyond tolerance."""
regressions = []
for metric, baseline_value in self.baseline.items():
current_value = self.current_metrics.get(metric, 0)
# For latency, higher is worse
if 'latency' in metric:
if current_value > baseline_value * (1 + tolerance):
regressions.append({
'metric': metric,
'baseline': baseline_value,
'current': current_value,
'change': f"+{(current_value/baseline_value - 1)*100:.1f}%"
})
# For other metrics, lower is worse
else:
if current_value < baseline_value * (1 - tolerance):
regressions.append({
'metric': metric,
'baseline': baseline_value,
'current': current_value,
'change': f"{(current_value/baseline_value - 1)*100:.1f}%"
})
return regressionsDebugging Non-Deterministic Systems
The Debugging Mindset Shift
Traditional debugging follows a pattern: reproduce the issue, isolate the cause, fix the code, verify the fix. LLM debugging requires a different approach because:
- Issues may not reproduce: The same input might work fine on retry
- “Wrong” is often subjective: Is the response wrong, or just different than expected?
- Root causes span systems: The issue might be in prompting, retrieval, model behavior, or their interaction
- Fixes don’t guarantee prevention: Changing a prompt might fix one case while breaking others
The LLM debugging process:
What people do: When an LLM response seems wrong, automatically retry the request (sometimes multiple times) until it produces an acceptable output.
Why it fails: If your prompt produces bad outputs 20% of the time, retrying just masks the problem. You’re paying 5x the API cost for 80% quality, and users experience inconsistent latency. Worse, you might be training users to expect retry-based “best of N” quality that production can’t sustain.
Fix: Treat frequent retries as a signal to fix the prompt, not a solution. Log retry rates per prompt template. If a prompt requires retries >10% of the time, that’s a prompt engineering problem. For genuinely ambiguous cases, consider using a verification step that checks output quality and only retries when detection is reliable.
“Classic backend instinct: wrap the LLM call in our standard retry decorator. Three retries, exponential backoff, the same pattern we use for every internal service. The day OpenAI had a 30-minute degradation, our system generated 4x normal traffic against them, triggered our org-wide rate limit, and brought down two unrelated products that shared the key. The retry logic that protects you for idempotent REST calls actively hurts you for shared, rate-limited, expensive endpoints. Now our LLM client has a single retry, a circuit breaker, and a degraded-mode response. Retries against an LLM provider aren’t free—they’re a denial-of-service against yourself.”
— Tech Lead at a B2B analytics company
Logging for LLM Debugging
You cannot debug what you did not log. LLM debugging requires comprehensive logging of the complete interaction context.
What to log:
@dataclass
class LLMInteractionLog:
"""Complete record of an LLM interaction for debugging."""
# Request identity
request_id: str
timestamp: datetime
user_id: str | None
# Model configuration
model: str
temperature: float
max_tokens: int
# Prompt components (logged separately for analysis)
system_prompt: str
user_message: str
conversation_history: list[dict]
# RAG context (if applicable)
retrieval_query: str | None
retrieved_documents: list[dict] | None
retrieval_scores: list[float] | None
# Assembled prompt (what actually went to the API)
full_prompt: str
prompt_token_count: int
# Response
response: str
completion_token_count: int
# Timing
retrieval_latency_ms: float | None
llm_latency_ms: float | None
total_latency_ms: float
# Outcome (filled later via feedback)
user_feedback: str | None
automated_quality_score: float | NoneStorage considerations:
- Hot storage (recent, frequently accessed): Keep 7-30 days in queryable database for debugging
- Cold storage (historical, audit): Archive older logs to object storage
- Sampling: For high-volume applications, log 100% of errors, sample successful requests
Common Debugging Scenarios
Scenario 1: “The model ignores my instructions”
Symptoms: Output doesn’t follow specified format, ignores constraints, uses wrong tone.
Diagnostic approach:
def diagnose_instruction_following(
prompt: str,
response: str,
expected_behavior: str
) -> dict:
"""Diagnose why model might ignore instructions."""
diagnostics = {
'findings': [],
'likely_causes': [],
'suggestions': []
}
# Check 1: Prompt length and instruction position
prompt_tokens = count_tokens(prompt)
if prompt_tokens > 3000:
diagnostics['findings'].append(f"Long prompt ({prompt_tokens} tokens)")
diagnostics['likely_causes'].append("Lost-in-the-middle effect")
diagnostics['suggestions'].append("Move critical instructions to beginning and end")
# Check 2: Instruction clarity
instruction_patterns = ['must', 'should', 'always', 'never', 'required']
instruction_count = sum(1 for p in instruction_patterns if p in prompt.lower())
if instruction_count > 5:
diagnostics['findings'].append(f"Many directive words ({instruction_count})")
diagnostics['likely_causes'].append("Conflicting or overwhelming instructions")
diagnostics['suggestions'].append("Simplify to core requirements")
# Check 3: Example-instruction conflict
if 'example' in prompt.lower() or '```' in prompt:
diagnostics['findings'].append("Contains examples")
diagnostics['likely_causes'].append("Examples may contradict instructions")
diagnostics['suggestions'].append("Verify examples match desired behavior")
# Check 4: Negative instructions
negative_patterns = ["don't", "do not", "never", "avoid"]
negatives = sum(1 for p in negative_patterns if p in prompt.lower())
if negatives > 2:
diagnostics['findings'].append(f"Multiple negative instructions ({negatives})")
diagnostics['likely_causes'].append("Negative framing is less effective")
diagnostics['suggestions'].append("Rephrase as positive: 'do X' instead of 'don't do Y'")
return diagnosticsScenario 2: “RAG returns irrelevant context”
Symptoms: Generated response is off-topic or misses obvious relevant information.
Diagnostic approach:
def diagnose_retrieval(
query: str,
retrieved_docs: list[dict],
expected_topic: str
) -> dict:
"""Diagnose retrieval quality issues."""
diagnostics = {
'query_analysis': {},
'retrieval_analysis': {},
'suggestions': []
}
# Analyze query
query_tokens = query.split()
diagnostics['query_analysis'] = {
'token_count': len(query_tokens),
'is_question': query.strip().endswith('?'),
'contains_technical_terms': detect_technical_terms(query),
}
if len(query_tokens) < 5:
diagnostics['suggestions'].append("Query may be too short; consider query expansion")
# Analyze retrieved documents
scores = [doc.get('score', 0) for doc in retrieved_docs]
diagnostics['retrieval_analysis'] = {
'top_score': max(scores) if scores else 0,
'score_range': max(scores) - min(scores) if scores else 0,
'all_scores_low': all(s < 0.5 for s in scores),
}
if diagnostics['retrieval_analysis']['all_scores_low']:
diagnostics['suggestions'].append("All scores low; query may be out-of-domain")
diagnostics['suggestions'].append("Check if relevant documents exist in corpus")
if diagnostics['retrieval_analysis']['score_range'] < 0.1:
diagnostics['suggestions'].append("Scores very similar; retriever not discriminating")
diagnostics['suggestions'].append("Consider reranking or different embedding model")
# Check term overlap
query_terms = set(query.lower().split())
for i, doc in enumerate(retrieved_docs[:3]):
doc_terms = set(doc['text'].lower().split())
overlap = query_terms & doc_terms
diagnostics[f'doc_{i}_overlap'] = len(overlap)
return diagnosticsScenario 3: “Output quality is inconsistent”
Symptoms: Same type of query produces varying quality responses.
Diagnostic approach:
def diagnose_inconsistency(
prompt: str,
num_samples: int = 10
) -> dict:
"""Diagnose output consistency issues."""
responses = []
for _ in range(num_samples):
response = llm.generate(prompt, temperature=0.7)
responses.append(response)
# Cluster responses by similarity
embeddings = [embed(r) for r in responses]
similarities = compute_pairwise_similarity(embeddings)
diagnostics = {
'sample_count': num_samples,
'unique_responses': len(set(responses)),
'avg_similarity': float(similarities.mean()),
'min_similarity': float(similarities.min()),
'similarity_std': float(similarities.std()),
}
# Compare with temperature=0
deterministic = llm.generate(prompt, temperature=0)
diagnostics['deterministic_response'] = deterministic
diagnostics['deterministic_in_samples'] = deterministic in responses
# Suggestions based on findings
if diagnostics['avg_similarity'] < 0.7:
diagnostics['suggestions'] = [
"High variance in outputs",
"Consider: lower temperature, more specific instructions, few-shot examples",
]
elif diagnostics['unique_responses'] == 1:
diagnostics['suggestions'] = [
"All responses identical despite temperature>0",
"Prompt may be overly constrained or model very confident",
]
return diagnosticsProduction Debugging: Working with Incomplete Information
In production, you often can’t reproduce issues exactly. You have logs, but the exact conditions may be unreproducible due to:
- Model updates by the provider
- Non-deterministic sampling
- State that wasn’t logged
Strategies for production debugging:
Pattern analysis: Look for commonalities across failures
def analyze_failure_patterns(failures: list[dict]) -> dict: """Find patterns in logged failures.""" patterns = { 'by_user_type': Counter(), 'by_prompt_length': [], 'by_time_of_day': Counter(), 'common_keywords': Counter(), } for f in failures: patterns['by_prompt_length'].append(len(f['prompt'])) patterns['by_time_of_day'][f['timestamp'].hour] += 1 for word in f['user_message'].split(): patterns['common_keywords'][word.lower()] += 1 return { 'prompt_length_correlation': analyze_correlation(patterns['by_prompt_length']), 'peak_failure_hours': patterns['by_time_of_day'].most_common(3), 'keyword_patterns': patterns['common_keywords'].most_common(10), }Reproduction scripts: Generate scripts from logs to attempt reproduction
def generate_reproduction_script(log_entry: dict) -> str: """Generate runnable script from log entry.""" return f''' # Reproduction script for request {log_entry["request_id"]} # Original timestamp: {log_entry["timestamp"]} # Original outcome: {log_entry.get("user_feedback", "unknown")} from your_app import LLMClient client = LLMClient(model="{log_entry["model"]}") prompt = """{log_entry["full_prompt"]}""" # Try to reproduce for i in range(5): response = client.generate(prompt, temperature={log_entry["temperature"]}) print(f"Attempt {{i+1}}: {{response[:200]}}...") '''Canary testing: Deploy fixes to a small percentage of traffic first
def route_request(request, fix_enabled: bool) -> str: """Route requests with canary for new fix.""" if fix_enabled and random.random() < 0.05: # 5% canary return process_with_fix(request) return process_standard(request)
Caching Strategies for LLM Applications
When to Cache (and When Not To)
Caching is more complex for LLM applications than traditional backends due to the nature of LLM inputs and outputs.
Decision framework for LLM caching:
Exact match caching works when:
- Temperature = 0 (deterministic)
- Prompts are identical (including system prompt, examples, etc.)
- Model version is controlled
- Freshness requirements are met
Semantic caching (caching similar queries) is risky because:
- “Similar” queries may require different answers
- Embedding similarity doesn’t guarantee answer transferability
- Edge cases can produce very wrong cached results
Implementing a Safe LLM Cache
class LLMCache:
"""Cache layer for LLM responses with safety checks."""
def __init__(
self,
redis_client,
ttl_seconds: int = 3600,
enable_semantic: bool = False,
semantic_threshold: float = 0.98 # Very high threshold if enabled
):
self.redis = redis_client
self.ttl = ttl_seconds
self.enable_semantic = enable_semantic
self.semantic_threshold = semantic_threshold
def cache_key(
self,
prompt: str,
model: str,
temperature: float,
system_prompt: str = ""
) -> str | None:
"""Generate cache key if caching is appropriate."""
# Never cache non-deterministic requests
if temperature > 0 and not self.enable_semantic:
return None
# Include all inputs that affect output
cache_input = f"{model}:{system_prompt}:{prompt}"
return f"llm:{hashlib.sha256(cache_input.encode()).hexdigest()}"
async def get(self, key: str) -> dict | None:
"""Retrieve cached response with metadata."""
if not key:
return None
cached = await self.redis.get(key)
if cached:
data = json.loads(cached)
# Track cache hits for monitoring
await self.redis.incr(f"cache_hits:{key[:20]}")
return data
return None
async def set(
self,
key: str,
response: str,
metadata: dict = None
) -> None:
"""Cache response with metadata."""
if not key:
return
data = {
'response': response,
'cached_at': datetime.utcnow().isoformat(),
'metadata': metadata or {}
}
await self.redis.setex(key, self.ttl, json.dumps(data))Cache Invalidation Strategies
LLM caches face unique invalidation challenges:
- Model updates: Provider updates models without notice
- Strategy: Include model version in cache key when possible
- Strategy: Set shorter TTLs to naturally expire stale entries
- Prompt changes: Even small prompt changes should invalidate
- Strategy: Hash complete prompt (including system prompt)
- Strategy: Version prompts explicitly in cache key
- Knowledge freshness: RAG responses depend on document corpus
- Strategy: Include corpus version or document hashes in cache key
- Strategy: Invalidate cache when documents are updated
class RAGAwareCache(LLMCache):
"""Cache that accounts for RAG document freshness."""
def cache_key(
self,
prompt: str,
model: str,
temperature: float,
retrieved_doc_ids: list[str]
) -> str | None:
# Include document IDs in cache key
# Cache is only valid for this exact retrieval result
doc_hash = hashlib.md5(
"".join(sorted(retrieved_doc_ids)).encode()
).hexdigest()[:8]
base_key = super().cache_key(prompt, model, temperature)
if base_key:
return f"{base_key}:docs:{doc_hash}"
return NoneData Pipeline Architecture
Training Data Pipelines
If you’re fine-tuning models, training data quality determines model quality. “Garbage in, garbage out” is especially true for LLMs because:
- Models learn patterns from examples, including unintended patterns
- Small amounts of bad data can disproportionately affect behavior
- Data issues manifest as subtle quality problems, not obvious errors
Training data pipeline stages:
Key quality gates:
class TrainingDataValidator:
"""Validate training examples before fine-tuning."""
def validate_example(self, example: dict) -> tuple[bool, list[str]]:
"""Check single example, return (valid, issues)."""
issues = []
# Structure check
if 'messages' not in example:
issues.append("Missing 'messages' key")
return False, issues
messages = example['messages']
# Role sequence check
roles = [m.get('role') for m in messages]
if roles[-1] != 'assistant':
issues.append("Last message must be from assistant")
# Empty content check
for i, msg in enumerate(messages):
if not msg.get('content', '').strip():
issues.append(f"Empty content in message {i}")
# Token limit check
total_tokens = sum(
count_tokens(m.get('content', ''))
for m in messages
)
if total_tokens > 4096:
issues.append(f"Example exceeds token limit: {total_tokens}")
# Quality heuristics
assistant_msgs = [m for m in messages if m['role'] == 'assistant']
for msg in assistant_msgs:
content = msg['content']
if len(content) < 10:
issues.append("Assistant response very short")
if content.startswith("I cannot") or content.startswith("I can't"):
issues.append("Assistant refusal - review if intentional")
return len(issues) == 0, issuesInference Data Pipelines (RAG Ingestion)
For RAG systems, document ingestion quality directly affects retrieval quality. The pipeline must handle:
- Multiple document formats (PDF, HTML, DOCX, etc.)
- Extraction errors and partial failures
- Chunking with semantic coherence
- Embedding at scale
- Incremental updates
Robust ingestion pipeline:
class DocumentIngestionPipeline:
"""Production-grade document ingestion for RAG."""
def __init__(
self,
vector_store,
embedding_model,
chunk_size: int = 512,
chunk_overlap: int = 50
):
self.vector_store = vector_store
self.embedder = embedding_model
self.chunker = SemanticChunker(chunk_size, chunk_overlap)
async def ingest_document(
self,
doc_path: str,
metadata: dict = None
) -> dict:
"""Ingest single document with error handling."""
result = {
'doc_path': doc_path,
'status': 'pending',
'chunks_indexed': 0,
'errors': []
}
try:
# Extract text
text, extracted_metadata = await self.extract(doc_path)
if not text.strip():
result['status'] = 'failed'
result['errors'].append("No text extracted")
return result
# Chunk
chunks = self.chunker.chunk(text)
# Embed in batches
embeddings = await self.embed_batch(chunks, batch_size=32)
# Store with metadata
combined_metadata = {
**(metadata or {}),
**extracted_metadata,
'ingested_at': datetime.utcnow().isoformat()
}
chunk_ids = await self.store(
chunks, embeddings, combined_metadata
)
result['status'] = 'success'
result['chunks_indexed'] = len(chunk_ids)
result['chunk_ids'] = chunk_ids
except ExtractionError as e:
result['status'] = 'failed'
result['errors'].append(f"Extraction failed: {e}")
except EmbeddingError as e:
result['status'] = 'partial'
result['errors'].append(f"Embedding failed: {e}")
return resultIntegration Patterns
Circuit Breaker for LLM APIs
LLM APIs can fail in various ways: timeouts, rate limits, server errors, degraded quality. The circuit breaker pattern prevents cascading failures.
Why circuit breakers matter more for LLMs: - LLM calls are expensive (time and money) - Retrying a failing LLM API burns budget - User experience degrades rapidly when LLM latency increases - Provider outages can last minutes to hours
from enum import Enum
from datetime import datetime, timedelta
class CircuitState(Enum):
CLOSED = "closed" # Normal operation
OPEN = "open" # Blocking calls
HALF_OPEN = "half_open" # Testing recovery
class LLMCircuitBreaker:
"""Protect against LLM API failures."""
def __init__(
self,
failure_threshold: int = 5,
recovery_timeout: timedelta = timedelta(seconds=60),
half_open_max_calls: int = 3
):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.half_open_max_calls = half_open_max_calls
self.state = CircuitState.CLOSED
self.failures = 0
self.last_failure_time = None
self.half_open_successes = 0
async def call(self, llm_func, *args, **kwargs):
"""Execute LLM call with circuit breaker protection."""
if self.state == CircuitState.OPEN:
if self._should_attempt_recovery():
self.state = CircuitState.HALF_OPEN
self.half_open_successes = 0
else:
raise CircuitOpenError(
"LLM circuit breaker open",
retry_after=self._time_until_recovery()
)
try:
result = await llm_func(*args, **kwargs)
self._record_success()
return result
except (TimeoutError, APIError, RateLimitError) as e:
self._record_failure()
raise
def _record_success(self):
if self.state == CircuitState.HALF_OPEN:
self.half_open_successes += 1
if self.half_open_successes >= self.half_open_max_calls:
self.state = CircuitState.CLOSED
self.failures = 0
elif self.state == CircuitState.CLOSED:
self.failures = 0 # Reset on success
def _record_failure(self):
self.failures += 1
self.last_failure_time = datetime.now()
if self.state == CircuitState.HALF_OPEN:
self.state = CircuitState.OPEN # Recovery failed
elif self.failures >= self.failure_threshold:
self.state = CircuitState.OPENRetry Strategies for LLM Calls
Retries for LLM calls require special consideration:
- Retrying may produce different results: Unlike traditional APIs, LLM retry may succeed with different output
- Cost accumulates: Each retry costs tokens
- Some errors shouldn’t be retried: Content policy violations won’t succeed on retry
class LLMRetryStrategy:
"""Intelligent retry for LLM calls."""
# Errors that should not be retried
NON_RETRYABLE = {
'content_policy_violation',
'invalid_api_key',
'model_not_found',
'context_length_exceeded',
}
def __init__(
self,
max_retries: int = 3,
base_delay: float = 1.0,
max_delay: float = 60.0,
exponential_base: float = 2.0
):
self.max_retries = max_retries
self.base_delay = base_delay
self.max_delay = max_delay
self.exponential_base = exponential_base
async def execute(
self,
llm_func,
*args,
**kwargs
):
"""Execute with retry logic."""
last_exception = None
for attempt in range(self.max_retries + 1):
try:
return await llm_func(*args, **kwargs)
except LLMError as e:
last_exception = e
if e.error_code in self.NON_RETRYABLE:
raise # Don't retry
if attempt < self.max_retries:
delay = self._calculate_delay(attempt, e)
await asyncio.sleep(delay)
raise last_exception
def _calculate_delay(self, attempt: int, error: LLMError) -> float:
# Exponential backoff
delay = self.base_delay * (self.exponential_base ** attempt)
# Rate limit errors should respect retry-after header
if hasattr(error, 'retry_after') and error.retry_after:
delay = max(delay, error.retry_after)
# Add jitter to prevent thundering herd
delay *= (0.5 + random.random())
return min(delay, self.max_delay)Graceful Degradation
When LLM services fail, have fallback strategies:
class DegradingLLMService:
"""LLM service with graceful degradation."""
def __init__(self):
self.primary_model = "claude-opus-4-8"
self.fallback_models = ["claude-sonnet-4-6", "claude-haiku-4-5"]
self.cache = LLMCache()
async def generate(self, prompt: str, **kwargs) -> dict:
"""Generate with fallback chain."""
# Try cache first
cached = await self.cache.get(prompt, kwargs)
if cached:
return {'response': cached, 'source': 'cache'}
# Try primary model
try:
response = await self.call_model(self.primary_model, prompt, **kwargs)
return {'response': response, 'source': 'primary'}
except LLMError as e:
pass # Fall through to fallbacks
# Try fallback models
for model in self.fallback_models:
try:
response = await self.call_model(model, prompt, **kwargs)
return {
'response': response,
'source': f'fallback:{model}',
'degraded': True
}
except LLMError:
continue
# All models failed - return graceful error
return {
'response': "I'm sorry, I'm having trouble processing your request right now. Please try again in a moment.",
'source': 'error_fallback',
'degraded': True,
'failed': True
}Case Study: Production LLM Backend at Scale
Context
A document intelligence company processes millions of documents monthly, extracting structured data and answering questions. Their architecture evolved through several iterations as they scaled.
Architecture Evolution
Phase 1: Synchronous monolith - Single service handled everything - LLM calls blocked request threads - Worked fine until 100 concurrent users
Problems encountered:
- Thread pool exhaustion from long LLM calls
- No isolation between fast and slow requests
- Single LLM provider outage took down everything
Phase 2: Async with queue separation - Split into sync API (fast responses) and async workers (LLM calls) - Job queue for document processing - Results stored in database, polled by clients
Client → API Gateway → {
Fast requests → Sync Service → Cache/DB
Slow requests → Queue → Workers → LLM APIs → Results DB
}
Problems encountered:
- Queue backup during LLM provider issues
- Difficult to estimate completion times
- No streaming for interactive use cases
Phase 3: Streaming with circuit breakers - Streaming responses for interactive queries - Circuit breakers per LLM provider - Automatic fallback to cheaper models under load
Key patterns:
# Request routing based on complexity and latency requirements
class RequestRouter:
def route(self, request) -> str:
if request.requires_streaming:
return "streaming_service"
elif request.estimated_tokens > 10000:
return "batch_queue"
elif self.is_cache_candidate(request):
return "cache_service"
else:
return "sync_service"Lessons Learned
Token estimation is critical: Estimating prompt and completion tokens allows better routing, cost prediction, and capacity planning.
Model fallback is table stakes: Primary model unavailability happens. Design for it from day one.
Streaming changes everything: Once you support streaming, you need streaming-aware caching, logging, and error handling.
Observability is different: Track time-to-first-token, tokens-per-second, and completion rates, not just latency.
Cost monitoring requires real-time: Token costs can spike quickly. Monitor and alert on cost, not just errors.
Decision Frameworks
Sync vs Async Architecture
Testing Strategy Selection
Caching Decision Matrix
| Scenario | Cache? | Strategy |
|---|---|---|
| Temperature = 0, exact prompt match | Yes | Exact match cache |
| Temperature > 0, exact prompt match | Risky | Cache only if variation OK |
| Similar prompts, same intent | Risky | Only with very high similarity threshold |
| RAG with changing documents | Complex | Include doc versions in key |
| User-specific personalization | No | Different users need different responses |
| High-cost queries, stable inputs | Yes | Longer TTL, careful invalidation |
Key Takeaways
Design for variable latency and streaming - LLM calls take seconds to minutes. Async patterns are mandatory for scale. Implement streaming for long outputs to improve perceived responsiveness.
Embrace statistical testing over exact assertions - LLM outputs are non-deterministic with many valid variations. Use property-based tests, statistical significance, and LLM-as-judge for quality evaluation.
Log comprehensively for debugging - Non-reproducible issues require pattern analysis. Log complete prompts, responses, latency, and tokens. Build tools for pattern analysis, not just log search.
Cache conservatively - Exact-match caching for temperature=0 is safe. Include model version and prompt template version in cache keys. Semantic caching is risky and requires careful validation.
Implement circuit breakers and fallback chains - LLMs have unique failure modes. Design fallback hierarchies: primary model, fallback model, cached response, graceful error message.
Summary
Building backends for LLM applications requires rethinking traditional patterns:
Architecture: Design for variable latency, streaming, and graceful degradation. LLM calls can take seconds to minutes—async patterns are not optional for scale.
Testing: Embrace statistical testing. Exact assertions rarely make sense for LLM outputs. Build testing pyramids with different strategies at different layers.
Debugging: Log comprehensively, analyze patterns, and accept that some issues won’t reproduce exactly. Build tools for pattern analysis, not just log search.
Caching: Be conservative. Exact-match caching for deterministic calls is safe. Semantic caching is risky and requires careful thought.
Integration: Implement circuit breakers and retry strategies that account for LLM-specific failure modes. Have fallback strategies for when primary models are unavailable.
The key mindset shift: you’re building systems around probabilistic components. This doesn’t mean abandoning engineering rigor—it means applying that rigor with appropriate tools and expectations.
Connections to Other Chapters
- Chapter 5 (LLM Foundations) explains why LLMs are non-deterministic—understanding this is essential for debugging
- Chapter 6 (Prompt Engineering) covers prompt design that affects testability and consistency
- Chapter 7 (RAG Systems) uses the data pipelines and caching strategies discussed here
- Chapter 9 (Deployment) covers infrastructure for serving LLM applications
- Chapter 15 (MLOps & Evaluation) extends testing into continuous evaluation and monitoring
Practical Exercises
- Implement a statistical test suite: Build tests for a classification prompt that:
- Run 100 test cases
- Calculate accuracy with confidence intervals
- Fail if lower confidence bound drops below 90%
- Report per-category accuracy breakdown
- Build a debugging toolkit: Create a system that:
- Logs complete LLM interactions (prompt, response, latency, tokens)
- Generates reproduction scripts from log entries
- Identifies patterns in failed requests (common keywords, time patterns)
- Implement caching with invalidation: Build a cache layer that:
- Caches temperature=0 responses
- Includes model version in cache key
- Invalidates on prompt template changes
- Tracks cache hit rates
- Design a graceful degradation system: Implement:
- Circuit breaker for LLM API calls
- Fallback chain (primary → fallback model → cached response → error message)
- Metrics to track degradation events
- Build a training data pipeline: Create a pipeline that:
- Extracts conversation data from logs
- Validates format and quality
- Deduplicates
- Splits into train/val/test
- Versions the output
Self-Assessment Checkpoint
Conceptual Questions
Q1. [IC2] Why do traditional unit tests (assert expected == actual) work poorly for LLM outputs? What testing approach works better?
Answer
LLM outputs are non-deterministic and have many valid variations. “The capital of France is Paris” and “Paris is France’s capital city” are both correct but don’t match exactly. Better approaches: (1) Property-based testing—check that outputs have required properties (contains “Paris”, is valid JSON, length within range). (2) Statistical testing—run N times, check pass rate meets threshold with confidence interval. (3) LLM-as-judge—use another model to evaluate semantic correctness. (4) Behavioral testing—test the system’s behavior, not exact outputs.Q2. [IC2] A cache stores LLM responses keyed by prompt hash. What problems might arise, and how do you address them?
Answer
Problems: (1) Model version changes—same prompt, different model = different output. Include model version in cache key. (2) Temperature > 0—non-deterministic outputs shouldn’t be cached, or cache becomes lottery. Only cache temp=0. (3) Stale data—if answers reference time-sensitive info, cache invalidation needed. Add TTL. (4) Prompt template changes—code changes prompt format but cache has old responses. Include prompt template version in key. (5) Memory growth—cache unbounded. Implement LRU or size limits. (6) Cache poisoning—bad response gets cached. Validate before caching or allow manual invalidation.Q3. [Senior] Explain the difference between fine-tuning and LoRA. When would you choose each?
Answer
Full fine-tuning updates all model parameters—computationally expensive, needs multiple GPUs, creates a completely new model. LoRA adds small trainable matrices to frozen base model—much smaller (0.1-1% of parameters), trains on single GPU, creates small adapter that combines with base model. Choose full fine-tuning when: you have massive data, need maximum quality, have compute budget. Choose LoRA when: limited compute, want to maintain base model benefits, need to deploy multiple specialized versions efficiently, training data is moderate-sized. Most practical scenarios favor LoRA due to cost and flexibility.Q4. [Senior] How do you debug an LLM application where users report “wrong answers” but you can’t reproduce the issues?
Answer
Systematic approach: (1) Comprehensive logging—log full prompt, model response, all parameters, timestamps. Without logs, you’re blind. (2) Reproduction system—ability to replay exact request with same inputs. (3) Check for non-determinism—temperature setting, model version changes, A/B tests. (4) User context—what were the previous turns? Context matters. (5) Timing—did it work before? Check for model updates, prompt changes, data updates. (6) Pattern analysis—are failures correlated (time of day, user segment, input length, topic)? (7) Diff analysis—compare working vs. failing cases, find distinguishing features. Prevention: Build observability in from day one.Q5. [Staff] You’re designing the backend architecture for an LLM application expecting 10M requests/day with strict latency requirements. What are the key architectural decisions?
Answer
Key decisions: (1) Caching strategy—at 10M requests, even 20% cache hit rate saves 2M API calls/day. Cache aggressively, use semantic similarity for near-matches. (2) Request routing—route simple queries to faster/cheaper models, complex to capable models. (3) Async processing—queue non-urgent requests, batch for efficiency. (4) Rate limiting & backpressure—protect against traffic spikes, graceful degradation. (5) Redundancy—multiple LLM providers, failover capability. (6) Edge caching—for global users, cache at edge for common queries. (7) Streaming—for latency perception, stream responses. (8) Cost controls—budget enforcement, alerting. (9) Observability—distributed tracing, metrics at every stage. Architecture: Queue → Router → Cache check → LLM pool → Response processor → Cache write.Spot the Problem
Problem 1. [IC2] Test code for an LLM feature:
def test_summarization():
result = summarize("Long article text here...")
assert result == "Expected summary of the article."Answer
Problems: (1) Exact string match will fail—LLM outputs vary in wording even with temp=0. (2) Single test case—doesn’t establish statistical confidence. (3) No property checks—doesn’t verify summary is actually shorter, contains key points, etc. Better:assert len(result) < len(input) * 0.3 (compression), assert "key_topic" in result.lower() (coverage), run multiple test cases with quality scoring.
Problem 2. [Senior] Error handling pattern:
async def call_llm(prompt: str) -> str:
try:
return await client.generate(prompt)
except Exception:
return "An error occurred. Please try again."Answer
Problems: (1) Catches all exceptions—hides different failure modes (rate limit, auth, timeout, content filter). (2) No logging—failures are invisible to monitoring. (3) Generic message—user can’t distinguish transient vs. permanent failure. (4) No retry—transient failures should retry automatically. (5) No circuit breaker—if service is down, keeps trying. (6) Silent failure—downstream code may not realize it got an error, not a real response. Better: Specific exception handling, structured logging, retry with backoff for transient errors, circuit breaker, clear error types returned to caller.Problem 3. [Staff] Fine-tuning approach:
"We fine-tuned on 50K examples from production logs.
Training loss decreased nicely. We deployed it."
Answer
Problems: (1) No validation set—can’t detect overfitting. Training loss means nothing without val loss. (2) No quality evaluation—loss decreasing doesn’t mean outputs improved for your task. (3) Logs may have bias—production logs reflect what model already does, may not improve weaknesses. (4) No A/B test—deployed without comparing to baseline. (5) Data quality unknown—were those 50K examples actually good? PII issues? (6) No held-out test set—can’t measure true performance. Process should include: train/val/test split, task-specific evaluation metrics, comparison to baseline, staged rollout with monitoring.Design Exercises
Exercise 1. [Senior] Design a test strategy for a RAG-powered Q&A system. The system retrieves from 100K documents and generates answers. Cover: unit tests, integration tests, evaluation metrics, regression testing, and CI/CD integration.
Guidance
Unit tests: (1) Chunking—verify chunks are correct size, overlap works. (2) Embedding—mock embedding service, verify vectors have expected dimensions. (3) Retrieval—test with known documents, verify ranking. (4) Generation—property tests on output format. Integration tests: (1) End-to-end with small doc set—known questions with known answers. (2) Latency tests—verify P95 within bounds. (3) Error handling—test behavior when retrieval fails, LLM fails. Evaluation metrics: (1) Retrieval recall@k—are relevant docs retrieved? (2) Answer quality—LLM-as-judge for faithfulness, relevance. (3) Factual accuracy on golden set. Regression: Golden set of critical Q&A pairs, must pass 100%. Run on every PR. CI/CD: Fast unit tests on every commit, integration tests on PR, full evaluation nightly.Exercise 2. [Staff] Design a system for managing training data for fine-tuning across multiple LLM applications. Requirements: version control, quality validation, deduplication, PII detection, and the ability to create task-specific datasets from a common pool.
Guidance
Architecture: (1) Raw data lake—all collected data, immutable, versioned (S3 + metadata DB). (2) Processing pipeline—PII detection/redaction, quality scoring, deduplication. (3) Curated datasets—versioned, task-specific, derived from pool. (4) Lineage tracking—which raw examples went into which datasets. Components: (1) Ingestion—from logs, human annotation, synthetic generation. Schema validation on entry. (2) PII detection—regex patterns, NER models, human review for high-risk. (3) Quality scoring—automated (length, formatting) + sampled human review. (4) Deduplication—embedding similarity, exact match, near-duplicate detection. (5) Dataset builder—queries pool by criteria, creates versioned snapshot. (6) Export—formats for different training frameworks. Governance: Access controls, audit logging, retention policies, deletion capability for GDPR.Further Reading
Essential
- Kwon et al. (2023), “PagedAttention” - The paper behind vLLM’s memory efficiency. Essential for understanding inference optimization.
- Kleppmann (2017), “Designing Data-Intensive Applications” - Foundation for data pipeline design. Backend engineers’ bible.
- Nygard (2018), “Release It!” - Circuit breakers, bulkheads, and stability patterns for LLM services.
Deep Dives
- Hu et al. (2021), “LoRA” - Parameter-efficient fine-tuning foundation. Critical if you’re fine-tuning.
- Zheng et al. (2023), “Judging LLM-as-a-Judge” - Best practices for using LLMs as evaluators.
Practical Resources
- OpenAI Cookbook - Production patterns for building with LLM APIs.
- Anthropic’s Production Best Practices - Guidelines for reliable Claude integration.