Appendix F: Debugging & Troubleshooting Guide

This appendix provides systematic approaches to diagnosing and fixing common issues in AI/ML systems. Each section covers a specific problem category with symptoms, diagnostic steps, and solutions.

Quick Diagnostic Flowchart

Section 1: Output Quality Issues

Problem: RAG System Returns Irrelevant Results

Symptoms:

Retrieved documents don’t match the query
Good documents exist but aren’t retrieved
Retrieval seems random

Diagnostic Steps:

# Step 1: Check embedding quality
def diagnose_embeddings(query: str, expected_doc: str, embedder):
    """Check if embeddings capture semantic similarity."""
    query_emb = embedder.encode(query)
    doc_emb = embedder.encode(expected_doc)

    similarity = np.dot(query_emb, doc_emb)
    print(f"Query-Doc similarity: {similarity:.3f}")

    # Should be > 0.7 for clearly related content
    if similarity < 0.5:
        print("WARNING: Low similarity - embedder may not capture relationship")
        print("Consider: Different embedding model, query expansion, or HyDE")

# Step 2: Check what IS being retrieved
def diagnose_retrieval(query: str, retriever, expected_doc_id: str):
    """See what's actually being retrieved."""
    results = retriever.search(query, top_k=20)

    print(f"Query: {query}")
    print(f"\nTop 20 results:")
    for i, r in enumerate(results):
        marker = "<<<" if r['id'] == expected_doc_id else ""
        print(f"  {i+1}. [{r['score']:.3f}] {r['id'][:50]} {marker}")

    if expected_doc_id not in [r['id'] for r in results]:
        print(f"\nExpected doc not in top 20!")
        print("Possible issues:")
        print("  - Document not indexed")
        print("  - Chunking split relevant content")
        print("  - Query-document vocabulary mismatch")

# Step 3: Inspect chunks around the expected document
def diagnose_chunking(doc_id: str, vector_store):
    """Check how a document was chunked."""
    chunks = vector_store.get_by_doc_id(doc_id)

    print(f"Document {doc_id} has {len(chunks)} chunks:")
    for chunk in chunks:
        print(f"\n--- Chunk {chunk['id']} ---")
        print(chunk['text'][:200] + "...")

Common Causes & Solutions:

Cause	Diagnostic Sign	Solution
Wrong embedding model	Low similarity scores for related content	Try `e5-large-v2`, `bge-large`, or domain-specific models
Chunks too small	Relevant info split across chunks	Increase chunk size, use overlap
Chunks too large	Chunks contain mixed topics	Decrease chunk size, use semantic boundaries
Query-doc vocabulary mismatch	Query uses different terms than docs	Add query expansion, use HyDE
Missing documents	Expected doc not in index	Re-run indexing, check filters

Reference: Chapter 7 (Chunking, Embeddings, Hybrid Search)

Problem: Model Hallucinations

Symptoms:

Model makes up facts not in the context
Confident but wrong answers
Plausible-sounding fabrications

Diagnostic Steps:

# Step 1: Check if relevant context was provided
def diagnose_hallucination(query: str, response: str, retrieved_chunks: list):
    """Check if response is grounded in retrieved context."""
    context = "\n".join([c['text'] for c in retrieved_chunks])

    # Look for claims in response
    claims = extract_claims(response)  # Use NLI model or LLM

    grounded_claims = []
    ungrounded_claims = []

    for claim in claims:
        if is_supported_by_context(claim, context):
            grounded_claims.append(claim)
        else:
            ungrounded_claims.append(claim)

    print(f"Grounded claims: {len(grounded_claims)}")
    print(f"Ungrounded claims: {len(ungrounded_claims)}")

    if ungrounded_claims:
        print("\nUngrounded claims:")
        for claim in ungrounded_claims:
            print(f"  - {claim}")

# Step 2: Check prompt for grounding instructions
def check_prompt_grounding(system_prompt: str):
    """Verify prompt encourages grounding."""
    grounding_phrases = [
        "based on the provided",
        "according to the context",
        "only use information from",
        "if the answer is not in",
        "say you don't know"
    ]

    found = [p for p in grounding_phrases if p.lower() in system_prompt.lower()]
    print(f"Found {len(found)}/{len(grounding_phrases)} grounding phrases")

    if len(found) < 2:
        print("WARNING: Prompt may not encourage grounding")

Common Causes & Solutions:

Cause	Diagnostic Sign	Solution
No grounding instructions	Prompt doesn’t mention “based on context”	Add explicit grounding instructions
Irrelevant context	Retrieved docs don’t help	Fix retrieval (see above)
Model overconfidence	Claims made without hedging	Add “if unsure, say so” to prompt
Temperature too high	Random variation in answers	Lower temperature (0.3-0.5)
Context too long	Model loses focus	Reduce context, put key info first

Reference: Chapter 6 (Prompt Engineering), Chapter 17 (Evaluation)

Problem: Inconsistent Output Format

Symptoms:

JSON parsing fails intermittently
Output structure varies between requests
Missing required fields

Diagnostic Steps:

# Step 1: Check if using structured outputs
def diagnose_format_issues(responses: list[str], expected_schema: dict):
    """Analyze format consistency across responses."""
    parse_results = []

    for i, response in enumerate(responses):
        try:
            parsed = json.loads(response)
            # Validate against schema
            missing_fields = [
                k for k in expected_schema.get('required', [])
                if k not in parsed
            ]
            parse_results.append({
                'index': i,
                'parsed': True,
                'missing_fields': missing_fields
            })
        except json.JSONDecodeError as e:
            parse_results.append({
                'index': i,
                'parsed': False,
                'error': str(e),
                'raw': response[:200]
            })

    success_rate = sum(1 for r in parse_results if r['parsed']) / len(parse_results)
    print(f"Parse success rate: {success_rate:.1%}")

    failures = [r for r in parse_results if not r['parsed']]
    if failures:
        print(f"\nFailed responses ({len(failures)}):")
        for f in failures[:3]:
            print(f"  Error: {f['error']}")
            print(f"  Raw: {f['raw']}...")

Common Causes & Solutions:

Cause	Diagnostic Sign	Solution
Not using structured outputs	Parsing fails randomly	Use JSON mode or function calling
Schema not in prompt	Missing/wrong fields	Include explicit schema in prompt
Temperature too high	High variance in format	Lower temperature for structured output
Conflicting instructions	Mixed format signals	Simplify prompt, one format instruction

Reference: Chapter 6 (Structured Outputs), Chapter 16 (Testing LLM Apps)

Section 2: Performance Issues

Problem: High Inference Latency

Symptoms:

Time to first token (TTFT) > 2 seconds
Total response time > 10 seconds
User-facing timeouts

Diagnostic Steps:

import time
from dataclasses import dataclass

@dataclass
class LatencyBreakdown:
    embedding_ms: float = 0
    retrieval_ms: float = 0
    context_assembly_ms: float = 0
    llm_ttft_ms: float = 0
    llm_generation_ms: float = 0
    total_ms: float = 0

def profile_rag_latency(query: str, rag_pipeline) -> LatencyBreakdown:
    """Profile latency at each stage."""
    breakdown = LatencyBreakdown()

    # Embedding
    start = time.perf_counter()
    query_embedding = rag_pipeline.embed(query)
    breakdown.embedding_ms = (time.perf_counter() - start) * 1000

    # Retrieval
    start = time.perf_counter()
    chunks = rag_pipeline.retrieve(query_embedding)
    breakdown.retrieval_ms = (time.perf_counter() - start) * 1000

    # Context assembly
    start = time.perf_counter()
    context = rag_pipeline.assemble_context(chunks)
    breakdown.context_assembly_ms = (time.perf_counter() - start) * 1000

    # LLM generation (with streaming to measure TTFT)
    start = time.perf_counter()
    first_token_time = None
    for token in rag_pipeline.generate_stream(query, context):
        if first_token_time is None:
            first_token_time = time.perf_counter()
            breakdown.llm_ttft_ms = (first_token_time - start) * 1000
    breakdown.llm_generation_ms = (time.perf_counter() - start) * 1000

    breakdown.total_ms = (
        breakdown.embedding_ms +
        breakdown.retrieval_ms +
        breakdown.context_assembly_ms +
        breakdown.llm_generation_ms
    )

    return breakdown

# Run and print breakdown
breakdown = profile_rag_latency("How do I deploy a model?", pipeline)
print(f"""
Latency Breakdown:
  Embedding:         {breakdown.embedding_ms:>7.1f} ms
  Retrieval:         {breakdown.retrieval_ms:>7.1f} ms
  Context Assembly:  {breakdown.context_assembly_ms:>7.1f} ms
  LLM TTFT:          {breakdown.llm_ttft_ms:>7.1f} ms
  LLM Total:         {breakdown.llm_generation_ms:>7.1f} ms
  ─────────────────────────────
  Total:             {breakdown.total_ms:>7.1f} ms
""")

Common Causes & Solutions:

Stage	Symptom	Cause	Solution
Embedding	>100ms	Large model, CPU inference	Use smaller model, GPU, or batch
Retrieval	>200ms	Unoptimized index	Add HNSW index, increase replicas
Context	>50ms	Complex assembly	Pre-compute, cache context
LLM TTFT	>2s	Model loading, queue	Warm instances, autoscale
LLM Generation	>10s	Too many tokens	Reduce max_tokens, shorter context

Reference: Chapter 9 (Deployment), Chapter 24 (Performance Engineering)

Problem: Low Throughput

Symptoms:

Queue depth keeps growing
Can’t handle expected request rate
GPU utilization < 50%

Diagnostic Steps:

# Step 1: Check GPU utilization
import subprocess

def check_gpu_utilization():
    """Check NVIDIA GPU stats."""
    result = subprocess.run(
        ['nvidia-smi', '--query-gpu=utilization.gpu,memory.used,memory.total',
         '--format=csv,noheader,nounits'],
        capture_output=True, text=True
    )

    for i, line in enumerate(result.stdout.strip().split('\n')):
        util, mem_used, mem_total = line.split(', ')
        print(f"GPU {i}: {util}% utilization, {mem_used}/{mem_total} MB memory")

        if int(util) < 50:
            print(f"  WARNING: Low utilization - batching may help")
        if int(mem_used) < int(mem_total) * 0.5:
            print(f"  WARNING: Memory underutilized - larger batches possible")

# Step 2: Check batch sizes
def analyze_batching(request_logs: list[dict]):
    """Analyze actual batch sizes being processed."""
    batch_sizes = [log['batch_size'] for log in request_logs]

    print(f"Batch size stats:")
    print(f"  Mean: {np.mean(batch_sizes):.1f}")
    print(f"  Max:  {max(batch_sizes)}")
    print(f"  % size 1: {sum(1 for b in batch_sizes if b == 1) / len(batch_sizes):.1%}")

    if np.mean(batch_sizes) < 4:
        print("  WARNING: Low average batch size - enable continuous batching")

Common Causes & Solutions:

Cause	Diagnostic Sign	Solution
No batching	Batch size = 1	Enable continuous batching (vLLM default)
Small KV cache	Memory available but not used	Increase `--gpu-memory-utilization`
Wrong tensor parallelism	Multi-GPU but sequential	Adjust `--tensor-parallel-size`
Long context sequences	High memory per request	Reduce context, use PagedAttention
Model too large	OOM or swapping	Quantize model or use smaller variant

Reference: Chapter 9 (Batching), Chapter 24 (GPU Optimization)

Problem: Costs Are Too High

Symptoms:

API bills higher than expected
Self-hosted GPU costs exceed budget
Cost per query not sustainable

Diagnostic Steps:

def analyze_llm_costs(usage_logs: list[dict], pricing: dict):
    """Analyze LLM API costs."""
    cost_by_endpoint = {}
    cost_by_user = {}

    for log in usage_logs:
        # Calculate cost
        input_cost = log['input_tokens'] * pricing['input_per_1k'] / 1000
        output_cost = log['output_tokens'] * pricing['output_per_1k'] / 1000
        total_cost = input_cost + output_cost

        # Aggregate
        endpoint = log['endpoint']
        cost_by_endpoint[endpoint] = cost_by_endpoint.get(endpoint, 0) + total_cost

        user = log.get('user_id', 'unknown')
        cost_by_user[user] = cost_by_user.get(user, 0) + total_cost

    # Report
    print("Cost by endpoint:")
    for endpoint, cost in sorted(cost_by_endpoint.items(), key=lambda x: -x[1])[:10]:
        print(f"  {endpoint}: ${cost:.2f}")

    print("\nTop users by cost:")
    for user, cost in sorted(cost_by_user.items(), key=lambda x: -x[1])[:10]:
        print(f"  {user}: ${cost:.2f}")

    # Check for optimization opportunities
    avg_input_tokens = np.mean([log['input_tokens'] for log in usage_logs])
    avg_output_tokens = np.mean([log['output_tokens'] for log in usage_logs])

    print(f"\nAverage tokens: {avg_input_tokens:.0f} input, {avg_output_tokens:.0f} output")
    if avg_input_tokens > 2000:
        print("  TIP: High input tokens - consider context compression")
    if avg_output_tokens > 500:
        print("  TIP: High output tokens - consider max_tokens limits")

Common Causes & Solutions:

Cause	Diagnostic Sign	Solution
No caching	Same queries repeated	Add semantic caching
Over-sized model	Using GPT-5 for simple tasks	Route simple queries to smaller models
Long contexts	>4K tokens average	Compress context, better retrieval
No max_tokens	Verbose outputs	Set appropriate max_tokens
Wasteful retries	Many failed requests	Fix root cause, add backoff

Reference: Chapter 29 (Cost Engineering)

Section 3: Errors and Failures

Problem: Out of Memory (OOM) Errors

Symptoms:

CUDA out of memory errors
Process killed by OOM killer
Gradual memory growth leading to crash

Diagnostic Steps:

# Step 1: Monitor memory over time
import torch

def log_memory_stats():
    """Log current GPU memory usage."""
    if torch.cuda.is_available():
        for i in range(torch.cuda.device_count()):
            allocated = torch.cuda.memory_allocated(i) / 1024**3
            reserved = torch.cuda.memory_reserved(i) / 1024**3
            total = torch.cuda.get_device_properties(i).total_memory / 1024**3

            print(f"GPU {i}: {allocated:.1f}GB allocated, {reserved:.1f}GB reserved, {total:.1f}GB total")

            if allocated / total > 0.9:
                print(f"  WARNING: Near OOM threshold")

# Step 2: Check for memory leaks
def check_memory_leak(func, num_iterations: int = 100):
    """Check if a function leaks memory."""
    initial_memory = torch.cuda.memory_allocated()

    for i in range(num_iterations):
        func()

        if i % 10 == 0:
            current = torch.cuda.memory_allocated()
            growth = (current - initial_memory) / 1024**2
            print(f"Iteration {i}: {growth:.1f} MB growth")

    final_memory = torch.cuda.memory_allocated()
    total_growth = (final_memory - initial_memory) / 1024**2

    print(f"\nTotal memory growth: {total_growth:.1f} MB over {num_iterations} iterations")
    if total_growth > 100:
        print("WARNING: Possible memory leak detected")

Common Causes & Solutions:

Cause	Diagnostic Sign	Solution
KV cache too large	OOM on long sequences	Reduce `max_model_len`, enable PagedAttention
Batch size too large	OOM on high load	Reduce batch size, enable continuous batching
Memory leak	Gradual growth	Clear caches, check for tensor accumulation
Model too large	OOM on load	Quantize (INT8/INT4), use smaller model
Multiple models	Combined size exceeds memory	Model sharing, offload unused models

Reference: Chapter 24 (Memory Optimization), Chapter 9 (Quantization)

Problem: API Rate Limits and Failures

Symptoms:

429 (Too Many Requests) errors
503 (Service Unavailable) errors
Intermittent failures under load

Diagnostic Steps:

import time
from collections import defaultdict

def analyze_api_errors(error_logs: list[dict]):
    """Analyze API error patterns."""
    errors_by_type = defaultdict(list)
    errors_by_hour = defaultdict(int)

    for log in error_logs:
        errors_by_type[log['status_code']].append(log)
        hour = log['timestamp'].hour
        errors_by_hour[hour] += 1

    print("Errors by status code:")
    for code, errors in sorted(errors_by_type.items()):
        print(f"  {code}: {len(errors)} errors")
        if code == 429:
            # Check rate limit headers
            retry_afters = [e.get('retry_after', 0) for e in errors]
            print(f"    Avg retry-after: {np.mean(retry_afters):.1f}s")
        if code == 503:
            print(f"    Likely provider overload")

    print("\nErrors by hour:")
    for hour in range(24):
        count = errors_by_hour[hour]
        bar = "█" * (count // 10)
        print(f"  {hour:02d}:00 - {bar} ({count})")

Retry Strategy Implementation:

import asyncio
from tenacity import (
    retry,
    stop_after_attempt,
    wait_exponential,
    retry_if_exception_type
)

class RetryableAPIError(Exception):
    pass

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=4, max=60),
    retry=retry_if_exception_type(RetryableAPIError)
)
async def call_llm_with_retry(client, request):
    """Call LLM API with exponential backoff."""
    try:
        return await client.generate(request)
    except RateLimitError as e:
        raise RetryableAPIError(f"Rate limited: {e}")
    except ServiceUnavailableError as e:
        raise RetryableAPIError(f"Service unavailable: {e}")
    except Exception as e:
        # Don't retry non-transient errors
        raise

Common Causes & Solutions:

Error	Cause	Solution
429	Exceeding rate limits	Add rate limiting, request quota increase
503	Provider overload	Retry with backoff, add fallback provider
Timeout	Long generation, network issues	Increase timeout, add streaming
401	Invalid/expired API key	Rotate keys, check expiration
400	Invalid request	Validate inputs, check for changes in API

Reference: Chapter 28 (Reliability Engineering)

Problem: Agent Gets Stuck in Loops

Symptoms:

Agent repeats same action multiple times
Token usage spikes without progress
Tasks never complete

Diagnostic Steps:

def analyze_agent_trace(trace: list[dict]):
    """Analyze agent execution trace for issues."""
    # Check for repeated actions
    action_sequence = [step['action'] for step in trace]

    # Find repeating patterns
    for pattern_len in range(1, 5):
        for i in range(len(action_sequence) - pattern_len * 2):
            pattern = tuple(action_sequence[i:i+pattern_len])
            next_pattern = tuple(action_sequence[i+pattern_len:i+pattern_len*2])
            if pattern == next_pattern:
                print(f"WARNING: Repeated pattern detected at step {i}: {pattern}")

    # Check for increasing token usage
    if len(trace) > 5:
        recent_tokens = [step['tokens'] for step in trace[-5:]]
        if all(recent_tokens[i] >= recent_tokens[i-1] for i in range(1, len(recent_tokens))):
            print("WARNING: Token usage monotonically increasing - possible context explosion")

    # Check for lack of progress
    if len(trace) > 10:
        recent_observations = [step.get('observation', '') for step in trace[-5:]]
        if len(set(recent_observations)) == 1:
            print("WARNING: Same observation repeated - agent may be stuck")

Common Causes & Solutions:

Cause	Diagnostic Sign	Solution
No termination condition	Runs forever	Add max_steps, explicit “done” action
Bad tool output	Same error repeated	Improve error messages, add error handling
Context growing	Token count increasing	Summarize history, limit context
Conflicting instructions	Oscillating actions	Clarify system prompt
Missing capability	Keeps trying impossible action	Add needed tools or handle gracefully

Reference: Chapter 8 (Agent Architectures, Safety)

Debugging Checklists

Pre-Deployment Checklist

□ Unit tests pass
□ Integration tests pass
□ Load testing completed
□ Error handling tested
□ Monitoring/alerting configured
□ Rollback procedure documented
□ Rate limits configured
□ Cost projections validated
□ Security review completed
□ Documentation updated

Incident Response Checklist

□ Identify the symptom (what's broken?)
□ Check monitoring dashboards
□ Check recent deployments/changes
□ Reproduce the issue
□ Isolate the component
□ Check logs for errors
□ Test fix in staging
□ Deploy fix with monitoring
□ Document root cause
□ Add tests to prevent recurrence

Model Update Checklist

□ Run evaluation suite on new model
□ Compare metrics to baseline
□ Test with production-like traffic
□ Check output format compatibility
□ Verify prompt compatibility
□ Update documentation
□ Plan staged rollout
□ Monitor closely after deployment
□ Have rollback ready

Common Error Messages and Solutions

Error Message	Likely Cause	Solution
`CUDA out of memory`	GPU memory exhausted	Reduce batch size, quantize, use smaller model
`Connection reset by peer`	API timeout/overload	Add retry logic, increase timeout
`Invalid API key`	Auth issue	Check key, rotation, permissions
`Context length exceeded`	Input too long	Truncate input, use smaller chunks
`Model not found`	Wrong model name or not loaded	Check model availability, warm up
`Rate limit exceeded`	Too many requests	Add rate limiting, request quota increase
`JSON decode error`	Malformed model output	Use structured outputs, add validation
`Timeout waiting for response`	Slow generation	Increase timeout, use streaming

Getting Help

When debugging complex issues:

Reproduce minimally: Create the smallest example that shows the issue
Check logs thoroughly: Often the real error is buried in logs
Binary search: Disable components to isolate the problem
Compare to working state: What changed since it last worked?
Search issues: GitHub issues, Stack Overflow, forums
Ask with context: Include error messages, code, and what you’ve tried

Useful Resources:

vLLM GitHub Issues: github.com/vllm-project/vllm/issues
HuggingFace Forums: discuss.huggingface.co
LangChain Community (Slack): langchain.com/join-community
r/MachineLearning: reddit.com/r/MachineLearning

# Appendix F: Debugging & Troubleshooting Guide {.unnumbered} This appendix provides systematic approaches to diagnosing and fixing common issues in AI/ML systems. Each section covers a specific problem category with symptoms, diagnostic steps, and solutions. --- ## Quick Diagnostic Flowchart ![AI System Debugging Flowchart](../assets/diagrams/rendered/appendix_f_debugging_flowchart.svg) --- ## Section 1: Output Quality Issues ### Problem: RAG System Returns Irrelevant Results **Symptoms**: - Retrieved documents don't match the query - Good documents exist but aren't retrieved - Retrieval seems random **Diagnostic Steps**: ```python # Step 1: Check embedding quality def diagnose_embeddings(query: str, expected_doc: str, embedder): """Check if embeddings capture semantic similarity.""" query_emb = embedder.encode(query) doc_emb = embedder.encode(expected_doc) similarity = np.dot(query_emb, doc_emb) print(f"Query-Doc similarity: {similarity:.3f}") # Should be > 0.7 for clearly related content if similarity < 0.5: print("WARNING: Low similarity - embedder may not capture relationship") print("Consider: Different embedding model, query expansion, or HyDE") # Step 2: Check what IS being retrieved def diagnose_retrieval(query: str, retriever, expected_doc_id: str): """See what's actually being retrieved.""" results = retriever.search(query, top_k=20) print(f"Query: {query}") print(f"\nTop 20 results:") for i, r in enumerate(results): marker = "<<<" if r['id'] == expected_doc_id else "" print(f" {i+1}. [{r['score']:.3f}] {r['id'][:50]} {marker}") if expected_doc_id not in [r['id'] for r in results]: print(f"\nExpected doc not in top 20!") print("Possible issues:") print(" - Document not indexed") print(" - Chunking split relevant content") print(" - Query-document vocabulary mismatch") # Step 3: Inspect chunks around the expected document def diagnose_chunking(doc_id: str, vector_store): """Check how a document was chunked.""" chunks = vector_store.get_by_doc_id(doc_id) print(f"Document {doc_id} has {len(chunks)} chunks:") for chunk in chunks: print(f"\n--- Chunk {chunk['id']} ---") print(chunk['text'][:200] + "...") ``` **Common Causes & Solutions**: | Cause | Diagnostic Sign | Solution | |-------|-----------------|----------| | Wrong embedding model | Low similarity scores for related content | Try `e5-large-v2`, `bge-large`, or domain-specific models | | Chunks too small | Relevant info split across chunks | Increase chunk size, use overlap | | Chunks too large | Chunks contain mixed topics | Decrease chunk size, use semantic boundaries | | Query-doc vocabulary mismatch | Query uses different terms than docs | Add query expansion, use HyDE | | Missing documents | Expected doc not in index | Re-run indexing, check filters | **Reference**: Chapter 7 (Chunking, Embeddings, Hybrid Search) --- ### Problem: Model Hallucinations **Symptoms**: - Model makes up facts not in the context - Confident but wrong answers - Plausible-sounding fabrications **Diagnostic Steps**: ```python # Step 1: Check if relevant context was provided def diagnose_hallucination(query: str, response: str, retrieved_chunks: list): """Check if response is grounded in retrieved context.""" context = "\n".join([c['text'] for c in retrieved_chunks]) # Look for claims in response claims = extract_claims(response) # Use NLI model or LLM grounded_claims = [] ungrounded_claims = [] for claim in claims: if is_supported_by_context(claim, context): grounded_claims.append(claim) else: ungrounded_claims.append(claim) print(f"Grounded claims: {len(grounded_claims)}") print(f"Ungrounded claims: {len(ungrounded_claims)}") if ungrounded_claims: print("\nUngrounded claims:") for claim in ungrounded_claims: print(f" - {claim}") # Step 2: Check prompt for grounding instructions def check_prompt_grounding(system_prompt: str): """Verify prompt encourages grounding.""" grounding_phrases = [ "based on the provided", "according to the context", "only use information from", "if the answer is not in", "say you don't know" ] found = [p for p in grounding_phrases if p.lower() in system_prompt.lower()] print(f"Found {len(found)}/{len(grounding_phrases)} grounding phrases") if len(found) < 2: print("WARNING: Prompt may not encourage grounding") ``` **Common Causes & Solutions**: | Cause | Diagnostic Sign | Solution | |-------|-----------------|----------| | No grounding instructions | Prompt doesn't mention "based on context" | Add explicit grounding instructions | | Irrelevant context | Retrieved docs don't help | Fix retrieval (see above) | | Model overconfidence | Claims made without hedging | Add "if unsure, say so" to prompt | | Temperature too high | Random variation in answers | Lower temperature (0.3-0.5) | | Context too long | Model loses focus | Reduce context, put key info first | **Reference**: Chapter 6 (Prompt Engineering), Chapter 17 (Evaluation) --- ### Problem: Inconsistent Output Format **Symptoms**: - JSON parsing fails intermittently - Output structure varies between requests - Missing required fields **Diagnostic Steps**: ```python # Step 1: Check if using structured outputs def diagnose_format_issues(responses: list[str], expected_schema: dict): """Analyze format consistency across responses.""" parse_results = [] for i, response in enumerate(responses): try: parsed = json.loads(response) # Validate against schema missing_fields = [ k for k in expected_schema.get('required', []) if k not in parsed ] parse_results.append({ 'index': i, 'parsed': True, 'missing_fields': missing_fields }) except json.JSONDecodeError as e: parse_results.append({ 'index': i, 'parsed': False, 'error': str(e), 'raw': response[:200] }) success_rate = sum(1 for r in parse_results if r['parsed']) / len(parse_results) print(f"Parse success rate: {success_rate:.1%}") failures = [r for r in parse_results if not r['parsed']] if failures: print(f"\nFailed responses ({len(failures)}):") for f in failures[:3]: print(f" Error: {f['error']}") print(f" Raw: {f['raw']}...") ``` **Common Causes & Solutions**: | Cause | Diagnostic Sign | Solution | |-------|-----------------|----------| | Not using structured outputs | Parsing fails randomly | Use JSON mode or function calling | | Schema not in prompt | Missing/wrong fields | Include explicit schema in prompt | | Temperature too high | High variance in format | Lower temperature for structured output | | Conflicting instructions | Mixed format signals | Simplify prompt, one format instruction | **Reference**: Chapter 6 (Structured Outputs), Chapter 16 (Testing LLM Apps) --- ## Section 2: Performance Issues ### Problem: High Inference Latency **Symptoms**: - Time to first token (TTFT) > 2 seconds - Total response time > 10 seconds - User-facing timeouts **Diagnostic Steps**: ```python import time from dataclasses import dataclass @dataclass class LatencyBreakdown: embedding_ms: float = 0 retrieval_ms: float = 0 context_assembly_ms: float = 0 llm_ttft_ms: float = 0 llm_generation_ms: float = 0 total_ms: float = 0 def profile_rag_latency(query: str, rag_pipeline) -> LatencyBreakdown: """Profile latency at each stage.""" breakdown = LatencyBreakdown() # Embedding start = time.perf_counter() query_embedding = rag_pipeline.embed(query) breakdown.embedding_ms = (time.perf_counter() - start) * 1000 # Retrieval start = time.perf_counter() chunks = rag_pipeline.retrieve(query_embedding) breakdown.retrieval_ms = (time.perf_counter() - start) * 1000 # Context assembly start = time.perf_counter() context = rag_pipeline.assemble_context(chunks) breakdown.context_assembly_ms = (time.perf_counter() - start) * 1000 # LLM generation (with streaming to measure TTFT) start = time.perf_counter() first_token_time = None for token in rag_pipeline.generate_stream(query, context): if first_token_time is None: first_token_time = time.perf_counter() breakdown.llm_ttft_ms = (first_token_time - start) * 1000 breakdown.llm_generation_ms = (time.perf_counter() - start) * 1000 breakdown.total_ms = ( breakdown.embedding_ms + breakdown.retrieval_ms + breakdown.context_assembly_ms + breakdown.llm_generation_ms ) return breakdown # Run and print breakdown breakdown = profile_rag_latency("How do I deploy a model?", pipeline) print(f""" Latency Breakdown: Embedding: {breakdown.embedding_ms:>7.1f} ms Retrieval: {breakdown.retrieval_ms:>7.1f} ms Context Assembly: {breakdown.context_assembly_ms:>7.1f} ms LLM TTFT: {breakdown.llm_ttft_ms:>7.1f} ms LLM Total: {breakdown.llm_generation_ms:>7.1f} ms ───────────────────────────── Total: {breakdown.total_ms:>7.1f} ms """) ``` **Common Causes & Solutions**: | Stage | Symptom | Cause | Solution | |-------|---------|-------|----------| | Embedding | >100ms | Large model, CPU inference | Use smaller model, GPU, or batch | | Retrieval | >200ms | Unoptimized index | Add HNSW index, increase replicas | | Context | >50ms | Complex assembly | Pre-compute, cache context | | LLM TTFT | >2s | Model loading, queue | Warm instances, autoscale | | LLM Generation | >10s | Too many tokens | Reduce max_tokens, shorter context | **Reference**: Chapter 9 (Deployment), Chapter 24 (Performance Engineering) --- ### Problem: Low Throughput **Symptoms**: - Queue depth keeps growing - Can't handle expected request rate - GPU utilization < 50% **Diagnostic Steps**: ```python # Step 1: Check GPU utilization import subprocess def check_gpu_utilization(): """Check NVIDIA GPU stats.""" result = subprocess.run( ['nvidia-smi', '--query-gpu=utilization.gpu,memory.used,memory.total', '--format=csv,noheader,nounits'], capture_output=True, text=True ) for i, line in enumerate(result.stdout.strip().split('\n')): util, mem_used, mem_total = line.split(', ') print(f"GPU {i}: {util}% utilization, {mem_used}/{mem_total} MB memory") if int(util) < 50: print(f" WARNING: Low utilization - batching may help") if int(mem_used) < int(mem_total) * 0.5: print(f" WARNING: Memory underutilized - larger batches possible") # Step 2: Check batch sizes def analyze_batching(request_logs: list[dict]): """Analyze actual batch sizes being processed.""" batch_sizes = [log['batch_size'] for log in request_logs] print(f"Batch size stats:") print(f" Mean: {np.mean(batch_sizes):.1f}") print(f" Max: {max(batch_sizes)}") print(f" % size 1: {sum(1 for b in batch_sizes if b == 1) / len(batch_sizes):.1%}") if np.mean(batch_sizes) < 4: print(" WARNING: Low average batch size - enable continuous batching") ``` **Common Causes & Solutions**: | Cause | Diagnostic Sign | Solution | |-------|-----------------|----------| | No batching | Batch size = 1 | Enable continuous batching (vLLM default) | | Small KV cache | Memory available but not used | Increase `--gpu-memory-utilization` | | Wrong tensor parallelism | Multi-GPU but sequential | Adjust `--tensor-parallel-size` | | Long context sequences | High memory per request | Reduce context, use PagedAttention | | Model too large | OOM or swapping | Quantize model or use smaller variant | **Reference**: Chapter 9 (Batching), Chapter 24 (GPU Optimization) --- ### Problem: Costs Are Too High **Symptoms**: - API bills higher than expected - Self-hosted GPU costs exceed budget - Cost per query not sustainable **Diagnostic Steps**: ```python def analyze_llm_costs(usage_logs: list[dict], pricing: dict): """Analyze LLM API costs.""" cost_by_endpoint = {} cost_by_user = {} for log in usage_logs: # Calculate cost input_cost = log['input_tokens'] * pricing['input_per_1k'] / 1000 output_cost = log['output_tokens'] * pricing['output_per_1k'] / 1000 total_cost = input_cost + output_cost # Aggregate endpoint = log['endpoint'] cost_by_endpoint[endpoint] = cost_by_endpoint.get(endpoint, 0) + total_cost user = log.get('user_id', 'unknown') cost_by_user[user] = cost_by_user.get(user, 0) + total_cost # Report print("Cost by endpoint:") for endpoint, cost in sorted(cost_by_endpoint.items(), key=lambda x: -x[1])[:10]: print(f" {endpoint}: ${cost:.2f}") print("\nTop users by cost:") for user, cost in sorted(cost_by_user.items(), key=lambda x: -x[1])[:10]: print(f" {user}: ${cost:.2f}") # Check for optimization opportunities avg_input_tokens = np.mean([log['input_tokens'] for log in usage_logs]) avg_output_tokens = np.mean([log['output_tokens'] for log in usage_logs]) print(f"\nAverage tokens: {avg_input_tokens:.0f} input, {avg_output_tokens:.0f} output") if avg_input_tokens > 2000: print(" TIP: High input tokens - consider context compression") if avg_output_tokens > 500: print(" TIP: High output tokens - consider max_tokens limits") ``` **Common Causes & Solutions**: | Cause | Diagnostic Sign | Solution | |-------|-----------------|----------| | No caching | Same queries repeated | Add semantic caching | | Over-sized model | Using GPT-5 for simple tasks | Route simple queries to smaller models | | Long contexts | >4K tokens average | Compress context, better retrieval | | No max_tokens | Verbose outputs | Set appropriate max_tokens | | Wasteful retries | Many failed requests | Fix root cause, add backoff | **Reference**: Chapter 29 (Cost Engineering) --- ## Section 3: Errors and Failures ### Problem: Out of Memory (OOM) Errors **Symptoms**: - `CUDA out of memory` errors - Process killed by OOM killer - Gradual memory growth leading to crash **Diagnostic Steps**: ```python # Step 1: Monitor memory over time import torch def log_memory_stats(): """Log current GPU memory usage.""" if torch.cuda.is_available(): for i in range(torch.cuda.device_count()): allocated = torch.cuda.memory_allocated(i) / 1024**3 reserved = torch.cuda.memory_reserved(i) / 1024**3 total = torch.cuda.get_device_properties(i).total_memory / 1024**3 print(f"GPU {i}: {allocated:.1f}GB allocated, {reserved:.1f}GB reserved, {total:.1f}GB total") if allocated / total > 0.9: print(f" WARNING: Near OOM threshold") # Step 2: Check for memory leaks def check_memory_leak(func, num_iterations: int = 100): """Check if a function leaks memory.""" initial_memory = torch.cuda.memory_allocated() for i in range(num_iterations): func() if i % 10 == 0: current = torch.cuda.memory_allocated() growth = (current - initial_memory) / 1024**2 print(f"Iteration {i}: {growth:.1f} MB growth") final_memory = torch.cuda.memory_allocated() total_growth = (final_memory - initial_memory) / 1024**2 print(f"\nTotal memory growth: {total_growth:.1f} MB over {num_iterations} iterations") if total_growth > 100: print("WARNING: Possible memory leak detected") ``` **Common Causes & Solutions**: | Cause | Diagnostic Sign | Solution | |-------|-----------------|----------| | KV cache too large | OOM on long sequences | Reduce `max_model_len`, enable PagedAttention | | Batch size too large | OOM on high load | Reduce batch size, enable continuous batching | | Memory leak | Gradual growth | Clear caches, check for tensor accumulation | | Model too large | OOM on load | Quantize (INT8/INT4), use smaller model | | Multiple models | Combined size exceeds memory | Model sharing, offload unused models | **Reference**: Chapter 24 (Memory Optimization), Chapter 9 (Quantization) --- ### Problem: API Rate Limits and Failures **Symptoms**: - 429 (Too Many Requests) errors - 503 (Service Unavailable) errors - Intermittent failures under load **Diagnostic Steps**: ```python import time from collections import defaultdict def analyze_api_errors(error_logs: list[dict]): """Analyze API error patterns.""" errors_by_type = defaultdict(list) errors_by_hour = defaultdict(int) for log in error_logs: errors_by_type[log['status_code']].append(log) hour = log['timestamp'].hour errors_by_hour[hour] += 1 print("Errors by status code:") for code, errors in sorted(errors_by_type.items()): print(f" {code}: {len(errors)} errors") if code == 429: # Check rate limit headers retry_afters = [e.get('retry_after', 0) for e in errors] print(f" Avg retry-after: {np.mean(retry_afters):.1f}s") if code == 503: print(f" Likely provider overload") print("\nErrors by hour:") for hour in range(24): count = errors_by_hour[hour] bar = "█" * (count // 10) print(f" {hour:02d}:00 - {bar} ({count})") ``` **Retry Strategy Implementation**: ```python import asyncio from tenacity import ( retry, stop_after_attempt, wait_exponential, retry_if_exception_type ) class RetryableAPIError(Exception): pass @retry( stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=60), retry=retry_if_exception_type(RetryableAPIError) ) async def call_llm_with_retry(client, request): """Call LLM API with exponential backoff.""" try: return await client.generate(request) except RateLimitError as e: raise RetryableAPIError(f"Rate limited: {e}") except ServiceUnavailableError as e: raise RetryableAPIError(f"Service unavailable: {e}") except Exception as e: # Don't retry non-transient errors raise ``` **Common Causes & Solutions**: | Error | Cause | Solution | |-------|-------|----------| | 429 | Exceeding rate limits | Add rate limiting, request quota increase | | 503 | Provider overload | Retry with backoff, add fallback provider | | Timeout | Long generation, network issues | Increase timeout, add streaming | | 401 | Invalid/expired API key | Rotate keys, check expiration | | 400 | Invalid request | Validate inputs, check for changes in API | **Reference**: Chapter 28 (Reliability Engineering) --- ### Problem: Agent Gets Stuck in Loops **Symptoms**: - Agent repeats same action multiple times - Token usage spikes without progress - Tasks never complete **Diagnostic Steps**: ```python def analyze_agent_trace(trace: list[dict]): """Analyze agent execution trace for issues.""" # Check for repeated actions action_sequence = [step['action'] for step in trace] # Find repeating patterns for pattern_len in range(1, 5): for i in range(len(action_sequence) - pattern_len * 2): pattern = tuple(action_sequence[i:i+pattern_len]) next_pattern = tuple(action_sequence[i+pattern_len:i+pattern_len*2]) if pattern == next_pattern: print(f"WARNING: Repeated pattern detected at step {i}: {pattern}") # Check for increasing token usage if len(trace) > 5: recent_tokens = [step['tokens'] for step in trace[-5:]] if all(recent_tokens[i] >= recent_tokens[i-1] for i in range(1, len(recent_tokens))): print("WARNING: Token usage monotonically increasing - possible context explosion") # Check for lack of progress if len(trace) > 10: recent_observations = [step.get('observation', '') for step in trace[-5:]] if len(set(recent_observations)) == 1: print("WARNING: Same observation repeated - agent may be stuck") ``` **Common Causes & Solutions**: | Cause | Diagnostic Sign | Solution | |-------|-----------------|----------| | No termination condition | Runs forever | Add max_steps, explicit "done" action | | Bad tool output | Same error repeated | Improve error messages, add error handling | | Context growing | Token count increasing | Summarize history, limit context | | Conflicting instructions | Oscillating actions | Clarify system prompt | | Missing capability | Keeps trying impossible action | Add needed tools or handle gracefully | **Reference**: Chapter 8 (Agent Architectures, Safety) --- ## Debugging Checklists ### Pre-Deployment Checklist ``` □ Unit tests pass □ Integration tests pass □ Load testing completed □ Error handling tested □ Monitoring/alerting configured □ Rollback procedure documented □ Rate limits configured □ Cost projections validated □ Security review completed □ Documentation updated ``` ### Incident Response Checklist ``` □ Identify the symptom (what's broken?) □ Check monitoring dashboards □ Check recent deployments/changes □ Reproduce the issue □ Isolate the component □ Check logs for errors □ Test fix in staging □ Deploy fix with monitoring □ Document root cause □ Add tests to prevent recurrence ``` ### Model Update Checklist ``` □ Run evaluation suite on new model □ Compare metrics to baseline □ Test with production-like traffic □ Check output format compatibility □ Verify prompt compatibility □ Update documentation □ Plan staged rollout □ Monitor closely after deployment □ Have rollback ready ``` --- ## Common Error Messages and Solutions | Error Message | Likely Cause | Solution | |--------------|--------------|----------| | `CUDA out of memory` | GPU memory exhausted | Reduce batch size, quantize, use smaller model | | `Connection reset by peer` | API timeout/overload | Add retry logic, increase timeout | | `Invalid API key` | Auth issue | Check key, rotation, permissions | | `Context length exceeded` | Input too long | Truncate input, use smaller chunks | | `Model not found` | Wrong model name or not loaded | Check model availability, warm up | | `Rate limit exceeded` | Too many requests | Add rate limiting, request quota increase | | `JSON decode error` | Malformed model output | Use structured outputs, add validation | | `Timeout waiting for response` | Slow generation | Increase timeout, use streaming | --- ## Getting Help When debugging complex issues: 1. **Reproduce minimally**: Create the smallest example that shows the issue 2. **Check logs thoroughly**: Often the real error is buried in logs 3. **Binary search**: Disable components to isolate the problem 4. **Compare to working state**: What changed since it last worked? 5. **Search issues**: GitHub issues, Stack Overflow, forums 6. **Ask with context**: Include error messages, code, and what you've tried **Useful Resources**: - vLLM GitHub Issues: github.com/vllm-project/vllm/issues - HuggingFace Forums: discuss.huggingface.co - LangChain Community (Slack): langchain.com/join-community - r/MachineLearning: reddit.com/r/MachineLearning