Appendix F: Debugging & Troubleshooting Guide
This appendix provides systematic approaches to diagnosing and fixing common issues in AI/ML systems. Each section covers a specific problem category with symptoms, diagnostic steps, and solutions.
Quick Diagnostic Flowchart
Section 1: Output Quality Issues
Problem: RAG System Returns Irrelevant Results
Symptoms:
- Retrieved documents don’t match the query
- Good documents exist but aren’t retrieved
- Retrieval seems random
Diagnostic Steps:
# Step 1: Check embedding quality
def diagnose_embeddings(query: str, expected_doc: str, embedder):
"""Check if embeddings capture semantic similarity."""
query_emb = embedder.encode(query)
doc_emb = embedder.encode(expected_doc)
similarity = np.dot(query_emb, doc_emb)
print(f"Query-Doc similarity: {similarity:.3f}")
# Should be > 0.7 for clearly related content
if similarity < 0.5:
print("WARNING: Low similarity - embedder may not capture relationship")
print("Consider: Different embedding model, query expansion, or HyDE")
# Step 2: Check what IS being retrieved
def diagnose_retrieval(query: str, retriever, expected_doc_id: str):
"""See what's actually being retrieved."""
results = retriever.search(query, top_k=20)
print(f"Query: {query}")
print(f"\nTop 20 results:")
for i, r in enumerate(results):
marker = "<<<" if r['id'] == expected_doc_id else ""
print(f" {i+1}. [{r['score']:.3f}] {r['id'][:50]} {marker}")
if expected_doc_id not in [r['id'] for r in results]:
print(f"\nExpected doc not in top 20!")
print("Possible issues:")
print(" - Document not indexed")
print(" - Chunking split relevant content")
print(" - Query-document vocabulary mismatch")
# Step 3: Inspect chunks around the expected document
def diagnose_chunking(doc_id: str, vector_store):
"""Check how a document was chunked."""
chunks = vector_store.get_by_doc_id(doc_id)
print(f"Document {doc_id} has {len(chunks)} chunks:")
for chunk in chunks:
print(f"\n--- Chunk {chunk['id']} ---")
print(chunk['text'][:200] + "...")Common Causes & Solutions:
| Cause | Diagnostic Sign | Solution |
|---|---|---|
| Wrong embedding model | Low similarity scores for related content | Try e5-large-v2, bge-large, or domain-specific models |
| Chunks too small | Relevant info split across chunks | Increase chunk size, use overlap |
| Chunks too large | Chunks contain mixed topics | Decrease chunk size, use semantic boundaries |
| Query-doc vocabulary mismatch | Query uses different terms than docs | Add query expansion, use HyDE |
| Missing documents | Expected doc not in index | Re-run indexing, check filters |
Reference: Chapter 7 (Chunking, Embeddings, Hybrid Search)
Problem: Model Hallucinations
Symptoms:
- Model makes up facts not in the context
- Confident but wrong answers
- Plausible-sounding fabrications
Diagnostic Steps:
# Step 1: Check if relevant context was provided
def diagnose_hallucination(query: str, response: str, retrieved_chunks: list):
"""Check if response is grounded in retrieved context."""
context = "\n".join([c['text'] for c in retrieved_chunks])
# Look for claims in response
claims = extract_claims(response) # Use NLI model or LLM
grounded_claims = []
ungrounded_claims = []
for claim in claims:
if is_supported_by_context(claim, context):
grounded_claims.append(claim)
else:
ungrounded_claims.append(claim)
print(f"Grounded claims: {len(grounded_claims)}")
print(f"Ungrounded claims: {len(ungrounded_claims)}")
if ungrounded_claims:
print("\nUngrounded claims:")
for claim in ungrounded_claims:
print(f" - {claim}")
# Step 2: Check prompt for grounding instructions
def check_prompt_grounding(system_prompt: str):
"""Verify prompt encourages grounding."""
grounding_phrases = [
"based on the provided",
"according to the context",
"only use information from",
"if the answer is not in",
"say you don't know"
]
found = [p for p in grounding_phrases if p.lower() in system_prompt.lower()]
print(f"Found {len(found)}/{len(grounding_phrases)} grounding phrases")
if len(found) < 2:
print("WARNING: Prompt may not encourage grounding")Common Causes & Solutions:
| Cause | Diagnostic Sign | Solution |
|---|---|---|
| No grounding instructions | Prompt doesn’t mention “based on context” | Add explicit grounding instructions |
| Irrelevant context | Retrieved docs don’t help | Fix retrieval (see above) |
| Model overconfidence | Claims made without hedging | Add “if unsure, say so” to prompt |
| Temperature too high | Random variation in answers | Lower temperature (0.3-0.5) |
| Context too long | Model loses focus | Reduce context, put key info first |
Reference: Chapter 6 (Prompt Engineering), Chapter 17 (Evaluation)
Problem: Inconsistent Output Format
Symptoms:
- JSON parsing fails intermittently
- Output structure varies between requests
- Missing required fields
Diagnostic Steps:
# Step 1: Check if using structured outputs
def diagnose_format_issues(responses: list[str], expected_schema: dict):
"""Analyze format consistency across responses."""
parse_results = []
for i, response in enumerate(responses):
try:
parsed = json.loads(response)
# Validate against schema
missing_fields = [
k for k in expected_schema.get('required', [])
if k not in parsed
]
parse_results.append({
'index': i,
'parsed': True,
'missing_fields': missing_fields
})
except json.JSONDecodeError as e:
parse_results.append({
'index': i,
'parsed': False,
'error': str(e),
'raw': response[:200]
})
success_rate = sum(1 for r in parse_results if r['parsed']) / len(parse_results)
print(f"Parse success rate: {success_rate:.1%}")
failures = [r for r in parse_results if not r['parsed']]
if failures:
print(f"\nFailed responses ({len(failures)}):")
for f in failures[:3]:
print(f" Error: {f['error']}")
print(f" Raw: {f['raw']}...")Common Causes & Solutions:
| Cause | Diagnostic Sign | Solution |
|---|---|---|
| Not using structured outputs | Parsing fails randomly | Use JSON mode or function calling |
| Schema not in prompt | Missing/wrong fields | Include explicit schema in prompt |
| Temperature too high | High variance in format | Lower temperature for structured output |
| Conflicting instructions | Mixed format signals | Simplify prompt, one format instruction |
Reference: Chapter 6 (Structured Outputs), Chapter 16 (Testing LLM Apps)
Section 2: Performance Issues
Problem: High Inference Latency
Symptoms:
- Time to first token (TTFT) > 2 seconds
- Total response time > 10 seconds
- User-facing timeouts
Diagnostic Steps:
import time
from dataclasses import dataclass
@dataclass
class LatencyBreakdown:
embedding_ms: float = 0
retrieval_ms: float = 0
context_assembly_ms: float = 0
llm_ttft_ms: float = 0
llm_generation_ms: float = 0
total_ms: float = 0
def profile_rag_latency(query: str, rag_pipeline) -> LatencyBreakdown:
"""Profile latency at each stage."""
breakdown = LatencyBreakdown()
# Embedding
start = time.perf_counter()
query_embedding = rag_pipeline.embed(query)
breakdown.embedding_ms = (time.perf_counter() - start) * 1000
# Retrieval
start = time.perf_counter()
chunks = rag_pipeline.retrieve(query_embedding)
breakdown.retrieval_ms = (time.perf_counter() - start) * 1000
# Context assembly
start = time.perf_counter()
context = rag_pipeline.assemble_context(chunks)
breakdown.context_assembly_ms = (time.perf_counter() - start) * 1000
# LLM generation (with streaming to measure TTFT)
start = time.perf_counter()
first_token_time = None
for token in rag_pipeline.generate_stream(query, context):
if first_token_time is None:
first_token_time = time.perf_counter()
breakdown.llm_ttft_ms = (first_token_time - start) * 1000
breakdown.llm_generation_ms = (time.perf_counter() - start) * 1000
breakdown.total_ms = (
breakdown.embedding_ms +
breakdown.retrieval_ms +
breakdown.context_assembly_ms +
breakdown.llm_generation_ms
)
return breakdown
# Run and print breakdown
breakdown = profile_rag_latency("How do I deploy a model?", pipeline)
print(f"""
Latency Breakdown:
Embedding: {breakdown.embedding_ms:>7.1f} ms
Retrieval: {breakdown.retrieval_ms:>7.1f} ms
Context Assembly: {breakdown.context_assembly_ms:>7.1f} ms
LLM TTFT: {breakdown.llm_ttft_ms:>7.1f} ms
LLM Total: {breakdown.llm_generation_ms:>7.1f} ms
─────────────────────────────
Total: {breakdown.total_ms:>7.1f} ms
""")Common Causes & Solutions:
| Stage | Symptom | Cause | Solution |
|---|---|---|---|
| Embedding | >100ms | Large model, CPU inference | Use smaller model, GPU, or batch |
| Retrieval | >200ms | Unoptimized index | Add HNSW index, increase replicas |
| Context | >50ms | Complex assembly | Pre-compute, cache context |
| LLM TTFT | >2s | Model loading, queue | Warm instances, autoscale |
| LLM Generation | >10s | Too many tokens | Reduce max_tokens, shorter context |
Reference: Chapter 9 (Deployment), Chapter 24 (Performance Engineering)
Problem: Low Throughput
Symptoms:
- Queue depth keeps growing
- Can’t handle expected request rate
- GPU utilization < 50%
Diagnostic Steps:
# Step 1: Check GPU utilization
import subprocess
def check_gpu_utilization():
"""Check NVIDIA GPU stats."""
result = subprocess.run(
['nvidia-smi', '--query-gpu=utilization.gpu,memory.used,memory.total',
'--format=csv,noheader,nounits'],
capture_output=True, text=True
)
for i, line in enumerate(result.stdout.strip().split('\n')):
util, mem_used, mem_total = line.split(', ')
print(f"GPU {i}: {util}% utilization, {mem_used}/{mem_total} MB memory")
if int(util) < 50:
print(f" WARNING: Low utilization - batching may help")
if int(mem_used) < int(mem_total) * 0.5:
print(f" WARNING: Memory underutilized - larger batches possible")
# Step 2: Check batch sizes
def analyze_batching(request_logs: list[dict]):
"""Analyze actual batch sizes being processed."""
batch_sizes = [log['batch_size'] for log in request_logs]
print(f"Batch size stats:")
print(f" Mean: {np.mean(batch_sizes):.1f}")
print(f" Max: {max(batch_sizes)}")
print(f" % size 1: {sum(1 for b in batch_sizes if b == 1) / len(batch_sizes):.1%}")
if np.mean(batch_sizes) < 4:
print(" WARNING: Low average batch size - enable continuous batching")Common Causes & Solutions:
| Cause | Diagnostic Sign | Solution |
|---|---|---|
| No batching | Batch size = 1 | Enable continuous batching (vLLM default) |
| Small KV cache | Memory available but not used | Increase --gpu-memory-utilization |
| Wrong tensor parallelism | Multi-GPU but sequential | Adjust --tensor-parallel-size |
| Long context sequences | High memory per request | Reduce context, use PagedAttention |
| Model too large | OOM or swapping | Quantize model or use smaller variant |
Reference: Chapter 9 (Batching), Chapter 24 (GPU Optimization)
Problem: Costs Are Too High
Symptoms:
- API bills higher than expected
- Self-hosted GPU costs exceed budget
- Cost per query not sustainable
Diagnostic Steps:
def analyze_llm_costs(usage_logs: list[dict], pricing: dict):
"""Analyze LLM API costs."""
cost_by_endpoint = {}
cost_by_user = {}
for log in usage_logs:
# Calculate cost
input_cost = log['input_tokens'] * pricing['input_per_1k'] / 1000
output_cost = log['output_tokens'] * pricing['output_per_1k'] / 1000
total_cost = input_cost + output_cost
# Aggregate
endpoint = log['endpoint']
cost_by_endpoint[endpoint] = cost_by_endpoint.get(endpoint, 0) + total_cost
user = log.get('user_id', 'unknown')
cost_by_user[user] = cost_by_user.get(user, 0) + total_cost
# Report
print("Cost by endpoint:")
for endpoint, cost in sorted(cost_by_endpoint.items(), key=lambda x: -x[1])[:10]:
print(f" {endpoint}: ${cost:.2f}")
print("\nTop users by cost:")
for user, cost in sorted(cost_by_user.items(), key=lambda x: -x[1])[:10]:
print(f" {user}: ${cost:.2f}")
# Check for optimization opportunities
avg_input_tokens = np.mean([log['input_tokens'] for log in usage_logs])
avg_output_tokens = np.mean([log['output_tokens'] for log in usage_logs])
print(f"\nAverage tokens: {avg_input_tokens:.0f} input, {avg_output_tokens:.0f} output")
if avg_input_tokens > 2000:
print(" TIP: High input tokens - consider context compression")
if avg_output_tokens > 500:
print(" TIP: High output tokens - consider max_tokens limits")Common Causes & Solutions:
| Cause | Diagnostic Sign | Solution |
|---|---|---|
| No caching | Same queries repeated | Add semantic caching |
| Over-sized model | Using GPT-5 for simple tasks | Route simple queries to smaller models |
| Long contexts | >4K tokens average | Compress context, better retrieval |
| No max_tokens | Verbose outputs | Set appropriate max_tokens |
| Wasteful retries | Many failed requests | Fix root cause, add backoff |
Reference: Chapter 29 (Cost Engineering)
Section 3: Errors and Failures
Problem: Out of Memory (OOM) Errors
Symptoms:
CUDA out of memoryerrors- Process killed by OOM killer
- Gradual memory growth leading to crash
Diagnostic Steps:
# Step 1: Monitor memory over time
import torch
def log_memory_stats():
"""Log current GPU memory usage."""
if torch.cuda.is_available():
for i in range(torch.cuda.device_count()):
allocated = torch.cuda.memory_allocated(i) / 1024**3
reserved = torch.cuda.memory_reserved(i) / 1024**3
total = torch.cuda.get_device_properties(i).total_memory / 1024**3
print(f"GPU {i}: {allocated:.1f}GB allocated, {reserved:.1f}GB reserved, {total:.1f}GB total")
if allocated / total > 0.9:
print(f" WARNING: Near OOM threshold")
# Step 2: Check for memory leaks
def check_memory_leak(func, num_iterations: int = 100):
"""Check if a function leaks memory."""
initial_memory = torch.cuda.memory_allocated()
for i in range(num_iterations):
func()
if i % 10 == 0:
current = torch.cuda.memory_allocated()
growth = (current - initial_memory) / 1024**2
print(f"Iteration {i}: {growth:.1f} MB growth")
final_memory = torch.cuda.memory_allocated()
total_growth = (final_memory - initial_memory) / 1024**2
print(f"\nTotal memory growth: {total_growth:.1f} MB over {num_iterations} iterations")
if total_growth > 100:
print("WARNING: Possible memory leak detected")Common Causes & Solutions:
| Cause | Diagnostic Sign | Solution |
|---|---|---|
| KV cache too large | OOM on long sequences | Reduce max_model_len, enable PagedAttention |
| Batch size too large | OOM on high load | Reduce batch size, enable continuous batching |
| Memory leak | Gradual growth | Clear caches, check for tensor accumulation |
| Model too large | OOM on load | Quantize (INT8/INT4), use smaller model |
| Multiple models | Combined size exceeds memory | Model sharing, offload unused models |
Reference: Chapter 24 (Memory Optimization), Chapter 9 (Quantization)
Problem: API Rate Limits and Failures
Symptoms:
- 429 (Too Many Requests) errors
- 503 (Service Unavailable) errors
- Intermittent failures under load
Diagnostic Steps:
import time
from collections import defaultdict
def analyze_api_errors(error_logs: list[dict]):
"""Analyze API error patterns."""
errors_by_type = defaultdict(list)
errors_by_hour = defaultdict(int)
for log in error_logs:
errors_by_type[log['status_code']].append(log)
hour = log['timestamp'].hour
errors_by_hour[hour] += 1
print("Errors by status code:")
for code, errors in sorted(errors_by_type.items()):
print(f" {code}: {len(errors)} errors")
if code == 429:
# Check rate limit headers
retry_afters = [e.get('retry_after', 0) for e in errors]
print(f" Avg retry-after: {np.mean(retry_afters):.1f}s")
if code == 503:
print(f" Likely provider overload")
print("\nErrors by hour:")
for hour in range(24):
count = errors_by_hour[hour]
bar = "█" * (count // 10)
print(f" {hour:02d}:00 - {bar} ({count})")Retry Strategy Implementation:
import asyncio
from tenacity import (
retry,
stop_after_attempt,
wait_exponential,
retry_if_exception_type
)
class RetryableAPIError(Exception):
pass
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=4, max=60),
retry=retry_if_exception_type(RetryableAPIError)
)
async def call_llm_with_retry(client, request):
"""Call LLM API with exponential backoff."""
try:
return await client.generate(request)
except RateLimitError as e:
raise RetryableAPIError(f"Rate limited: {e}")
except ServiceUnavailableError as e:
raise RetryableAPIError(f"Service unavailable: {e}")
except Exception as e:
# Don't retry non-transient errors
raiseCommon Causes & Solutions:
| Error | Cause | Solution |
|---|---|---|
| 429 | Exceeding rate limits | Add rate limiting, request quota increase |
| 503 | Provider overload | Retry with backoff, add fallback provider |
| Timeout | Long generation, network issues | Increase timeout, add streaming |
| 401 | Invalid/expired API key | Rotate keys, check expiration |
| 400 | Invalid request | Validate inputs, check for changes in API |
Reference: Chapter 28 (Reliability Engineering)
Problem: Agent Gets Stuck in Loops
Symptoms:
- Agent repeats same action multiple times
- Token usage spikes without progress
- Tasks never complete
Diagnostic Steps:
def analyze_agent_trace(trace: list[dict]):
"""Analyze agent execution trace for issues."""
# Check for repeated actions
action_sequence = [step['action'] for step in trace]
# Find repeating patterns
for pattern_len in range(1, 5):
for i in range(len(action_sequence) - pattern_len * 2):
pattern = tuple(action_sequence[i:i+pattern_len])
next_pattern = tuple(action_sequence[i+pattern_len:i+pattern_len*2])
if pattern == next_pattern:
print(f"WARNING: Repeated pattern detected at step {i}: {pattern}")
# Check for increasing token usage
if len(trace) > 5:
recent_tokens = [step['tokens'] for step in trace[-5:]]
if all(recent_tokens[i] >= recent_tokens[i-1] for i in range(1, len(recent_tokens))):
print("WARNING: Token usage monotonically increasing - possible context explosion")
# Check for lack of progress
if len(trace) > 10:
recent_observations = [step.get('observation', '') for step in trace[-5:]]
if len(set(recent_observations)) == 1:
print("WARNING: Same observation repeated - agent may be stuck")Common Causes & Solutions:
| Cause | Diagnostic Sign | Solution |
|---|---|---|
| No termination condition | Runs forever | Add max_steps, explicit “done” action |
| Bad tool output | Same error repeated | Improve error messages, add error handling |
| Context growing | Token count increasing | Summarize history, limit context |
| Conflicting instructions | Oscillating actions | Clarify system prompt |
| Missing capability | Keeps trying impossible action | Add needed tools or handle gracefully |
Reference: Chapter 8 (Agent Architectures, Safety)
Debugging Checklists
Pre-Deployment Checklist
□ Unit tests pass
□ Integration tests pass
□ Load testing completed
□ Error handling tested
□ Monitoring/alerting configured
□ Rollback procedure documented
□ Rate limits configured
□ Cost projections validated
□ Security review completed
□ Documentation updated
Incident Response Checklist
□ Identify the symptom (what's broken?)
□ Check monitoring dashboards
□ Check recent deployments/changes
□ Reproduce the issue
□ Isolate the component
□ Check logs for errors
□ Test fix in staging
□ Deploy fix with monitoring
□ Document root cause
□ Add tests to prevent recurrence
Model Update Checklist
□ Run evaluation suite on new model
□ Compare metrics to baseline
□ Test with production-like traffic
□ Check output format compatibility
□ Verify prompt compatibility
□ Update documentation
□ Plan staged rollout
□ Monitor closely after deployment
□ Have rollback ready
Common Error Messages and Solutions
| Error Message | Likely Cause | Solution |
|---|---|---|
CUDA out of memory |
GPU memory exhausted | Reduce batch size, quantize, use smaller model |
Connection reset by peer |
API timeout/overload | Add retry logic, increase timeout |
Invalid API key |
Auth issue | Check key, rotation, permissions |
Context length exceeded |
Input too long | Truncate input, use smaller chunks |
Model not found |
Wrong model name or not loaded | Check model availability, warm up |
Rate limit exceeded |
Too many requests | Add rate limiting, request quota increase |
JSON decode error |
Malformed model output | Use structured outputs, add validation |
Timeout waiting for response |
Slow generation | Increase timeout, use streaming |
Getting Help
When debugging complex issues:
- Reproduce minimally: Create the smallest example that shows the issue
- Check logs thoroughly: Often the real error is buried in logs
- Binary search: Disable components to isolate the problem
- Compare to working state: What changed since it last worked?
- Search issues: GitHub issues, Stack Overflow, forums
- Ask with context: Include error messages, code, and what you’ve tried
Useful Resources:
- vLLM GitHub Issues: github.com/vllm-project/vllm/issues
- HuggingFace Forums: discuss.huggingface.co
- LangChain Community (Slack): langchain.com/join-community
- r/MachineLearning: reddit.com/r/MachineLearning