Appendix F: Debugging & Troubleshooting Guide

This appendix provides systematic approaches to diagnosing and fixing common issues in AI/ML systems. Each section covers a specific problem category with symptoms, diagnostic steps, and solutions.


Quick Diagnostic Flowchart

AI System Debugging Flowchart

AI System Debugging Flowchart

Section 1: Output Quality Issues

Problem: RAG System Returns Irrelevant Results

Symptoms:

  • Retrieved documents don’t match the query
  • Good documents exist but aren’t retrieved
  • Retrieval seems random

Diagnostic Steps:

# Step 1: Check embedding quality
def diagnose_embeddings(query: str, expected_doc: str, embedder):
    """Check if embeddings capture semantic similarity."""
    query_emb = embedder.encode(query)
    doc_emb = embedder.encode(expected_doc)

    similarity = np.dot(query_emb, doc_emb)
    print(f"Query-Doc similarity: {similarity:.3f}")

    # Should be > 0.7 for clearly related content
    if similarity < 0.5:
        print("WARNING: Low similarity - embedder may not capture relationship")
        print("Consider: Different embedding model, query expansion, or HyDE")

# Step 2: Check what IS being retrieved
def diagnose_retrieval(query: str, retriever, expected_doc_id: str):
    """See what's actually being retrieved."""
    results = retriever.search(query, top_k=20)

    print(f"Query: {query}")
    print(f"\nTop 20 results:")
    for i, r in enumerate(results):
        marker = "<<<" if r['id'] == expected_doc_id else ""
        print(f"  {i+1}. [{r['score']:.3f}] {r['id'][:50]} {marker}")

    if expected_doc_id not in [r['id'] for r in results]:
        print(f"\nExpected doc not in top 20!")
        print("Possible issues:")
        print("  - Document not indexed")
        print("  - Chunking split relevant content")
        print("  - Query-document vocabulary mismatch")

# Step 3: Inspect chunks around the expected document
def diagnose_chunking(doc_id: str, vector_store):
    """Check how a document was chunked."""
    chunks = vector_store.get_by_doc_id(doc_id)

    print(f"Document {doc_id} has {len(chunks)} chunks:")
    for chunk in chunks:
        print(f"\n--- Chunk {chunk['id']} ---")
        print(chunk['text'][:200] + "...")

Common Causes & Solutions:

Cause Diagnostic Sign Solution
Wrong embedding model Low similarity scores for related content Try e5-large-v2, bge-large, or domain-specific models
Chunks too small Relevant info split across chunks Increase chunk size, use overlap
Chunks too large Chunks contain mixed topics Decrease chunk size, use semantic boundaries
Query-doc vocabulary mismatch Query uses different terms than docs Add query expansion, use HyDE
Missing documents Expected doc not in index Re-run indexing, check filters

Reference: Chapter 7 (Chunking, Embeddings, Hybrid Search)


Problem: Model Hallucinations

Symptoms:

  • Model makes up facts not in the context
  • Confident but wrong answers
  • Plausible-sounding fabrications

Diagnostic Steps:

# Step 1: Check if relevant context was provided
def diagnose_hallucination(query: str, response: str, retrieved_chunks: list):
    """Check if response is grounded in retrieved context."""
    context = "\n".join([c['text'] for c in retrieved_chunks])

    # Look for claims in response
    claims = extract_claims(response)  # Use NLI model or LLM

    grounded_claims = []
    ungrounded_claims = []

    for claim in claims:
        if is_supported_by_context(claim, context):
            grounded_claims.append(claim)
        else:
            ungrounded_claims.append(claim)

    print(f"Grounded claims: {len(grounded_claims)}")
    print(f"Ungrounded claims: {len(ungrounded_claims)}")

    if ungrounded_claims:
        print("\nUngrounded claims:")
        for claim in ungrounded_claims:
            print(f"  - {claim}")

# Step 2: Check prompt for grounding instructions
def check_prompt_grounding(system_prompt: str):
    """Verify prompt encourages grounding."""
    grounding_phrases = [
        "based on the provided",
        "according to the context",
        "only use information from",
        "if the answer is not in",
        "say you don't know"
    ]

    found = [p for p in grounding_phrases if p.lower() in system_prompt.lower()]
    print(f"Found {len(found)}/{len(grounding_phrases)} grounding phrases")

    if len(found) < 2:
        print("WARNING: Prompt may not encourage grounding")

Common Causes & Solutions:

Cause Diagnostic Sign Solution
No grounding instructions Prompt doesn’t mention “based on context” Add explicit grounding instructions
Irrelevant context Retrieved docs don’t help Fix retrieval (see above)
Model overconfidence Claims made without hedging Add “if unsure, say so” to prompt
Temperature too high Random variation in answers Lower temperature (0.3-0.5)
Context too long Model loses focus Reduce context, put key info first

Reference: Chapter 6 (Prompt Engineering), Chapter 17 (Evaluation)


Problem: Inconsistent Output Format

Symptoms:

  • JSON parsing fails intermittently
  • Output structure varies between requests
  • Missing required fields

Diagnostic Steps:

# Step 1: Check if using structured outputs
def diagnose_format_issues(responses: list[str], expected_schema: dict):
    """Analyze format consistency across responses."""
    parse_results = []

    for i, response in enumerate(responses):
        try:
            parsed = json.loads(response)
            # Validate against schema
            missing_fields = [
                k for k in expected_schema.get('required', [])
                if k not in parsed
            ]
            parse_results.append({
                'index': i,
                'parsed': True,
                'missing_fields': missing_fields
            })
        except json.JSONDecodeError as e:
            parse_results.append({
                'index': i,
                'parsed': False,
                'error': str(e),
                'raw': response[:200]
            })

    success_rate = sum(1 for r in parse_results if r['parsed']) / len(parse_results)
    print(f"Parse success rate: {success_rate:.1%}")

    failures = [r for r in parse_results if not r['parsed']]
    if failures:
        print(f"\nFailed responses ({len(failures)}):")
        for f in failures[:3]:
            print(f"  Error: {f['error']}")
            print(f"  Raw: {f['raw']}...")

Common Causes & Solutions:

Cause Diagnostic Sign Solution
Not using structured outputs Parsing fails randomly Use JSON mode or function calling
Schema not in prompt Missing/wrong fields Include explicit schema in prompt
Temperature too high High variance in format Lower temperature for structured output
Conflicting instructions Mixed format signals Simplify prompt, one format instruction

Reference: Chapter 6 (Structured Outputs), Chapter 16 (Testing LLM Apps)


Section 2: Performance Issues

Problem: High Inference Latency

Symptoms:

  • Time to first token (TTFT) > 2 seconds
  • Total response time > 10 seconds
  • User-facing timeouts

Diagnostic Steps:

import time
from dataclasses import dataclass

@dataclass
class LatencyBreakdown:
    embedding_ms: float = 0
    retrieval_ms: float = 0
    context_assembly_ms: float = 0
    llm_ttft_ms: float = 0
    llm_generation_ms: float = 0
    total_ms: float = 0

def profile_rag_latency(query: str, rag_pipeline) -> LatencyBreakdown:
    """Profile latency at each stage."""
    breakdown = LatencyBreakdown()

    # Embedding
    start = time.perf_counter()
    query_embedding = rag_pipeline.embed(query)
    breakdown.embedding_ms = (time.perf_counter() - start) * 1000

    # Retrieval
    start = time.perf_counter()
    chunks = rag_pipeline.retrieve(query_embedding)
    breakdown.retrieval_ms = (time.perf_counter() - start) * 1000

    # Context assembly
    start = time.perf_counter()
    context = rag_pipeline.assemble_context(chunks)
    breakdown.context_assembly_ms = (time.perf_counter() - start) * 1000

    # LLM generation (with streaming to measure TTFT)
    start = time.perf_counter()
    first_token_time = None
    for token in rag_pipeline.generate_stream(query, context):
        if first_token_time is None:
            first_token_time = time.perf_counter()
            breakdown.llm_ttft_ms = (first_token_time - start) * 1000
    breakdown.llm_generation_ms = (time.perf_counter() - start) * 1000

    breakdown.total_ms = (
        breakdown.embedding_ms +
        breakdown.retrieval_ms +
        breakdown.context_assembly_ms +
        breakdown.llm_generation_ms
    )

    return breakdown

# Run and print breakdown
breakdown = profile_rag_latency("How do I deploy a model?", pipeline)
print(f"""
Latency Breakdown:
  Embedding:         {breakdown.embedding_ms:>7.1f} ms
  Retrieval:         {breakdown.retrieval_ms:>7.1f} ms
  Context Assembly:  {breakdown.context_assembly_ms:>7.1f} ms
  LLM TTFT:          {breakdown.llm_ttft_ms:>7.1f} ms
  LLM Total:         {breakdown.llm_generation_ms:>7.1f} ms
  ─────────────────────────────
  Total:             {breakdown.total_ms:>7.1f} ms
""")

Common Causes & Solutions:

Stage Symptom Cause Solution
Embedding >100ms Large model, CPU inference Use smaller model, GPU, or batch
Retrieval >200ms Unoptimized index Add HNSW index, increase replicas
Context >50ms Complex assembly Pre-compute, cache context
LLM TTFT >2s Model loading, queue Warm instances, autoscale
LLM Generation >10s Too many tokens Reduce max_tokens, shorter context

Reference: Chapter 9 (Deployment), Chapter 24 (Performance Engineering)


Problem: Low Throughput

Symptoms:

  • Queue depth keeps growing
  • Can’t handle expected request rate
  • GPU utilization < 50%

Diagnostic Steps:

# Step 1: Check GPU utilization
import subprocess

def check_gpu_utilization():
    """Check NVIDIA GPU stats."""
    result = subprocess.run(
        ['nvidia-smi', '--query-gpu=utilization.gpu,memory.used,memory.total',
         '--format=csv,noheader,nounits'],
        capture_output=True, text=True
    )

    for i, line in enumerate(result.stdout.strip().split('\n')):
        util, mem_used, mem_total = line.split(', ')
        print(f"GPU {i}: {util}% utilization, {mem_used}/{mem_total} MB memory")

        if int(util) < 50:
            print(f"  WARNING: Low utilization - batching may help")
        if int(mem_used) < int(mem_total) * 0.5:
            print(f"  WARNING: Memory underutilized - larger batches possible")

# Step 2: Check batch sizes
def analyze_batching(request_logs: list[dict]):
    """Analyze actual batch sizes being processed."""
    batch_sizes = [log['batch_size'] for log in request_logs]

    print(f"Batch size stats:")
    print(f"  Mean: {np.mean(batch_sizes):.1f}")
    print(f"  Max:  {max(batch_sizes)}")
    print(f"  % size 1: {sum(1 for b in batch_sizes if b == 1) / len(batch_sizes):.1%}")

    if np.mean(batch_sizes) < 4:
        print("  WARNING: Low average batch size - enable continuous batching")

Common Causes & Solutions:

Cause Diagnostic Sign Solution
No batching Batch size = 1 Enable continuous batching (vLLM default)
Small KV cache Memory available but not used Increase --gpu-memory-utilization
Wrong tensor parallelism Multi-GPU but sequential Adjust --tensor-parallel-size
Long context sequences High memory per request Reduce context, use PagedAttention
Model too large OOM or swapping Quantize model or use smaller variant

Reference: Chapter 9 (Batching), Chapter 24 (GPU Optimization)


Problem: Costs Are Too High

Symptoms:

  • API bills higher than expected
  • Self-hosted GPU costs exceed budget
  • Cost per query not sustainable

Diagnostic Steps:

def analyze_llm_costs(usage_logs: list[dict], pricing: dict):
    """Analyze LLM API costs."""
    cost_by_endpoint = {}
    cost_by_user = {}

    for log in usage_logs:
        # Calculate cost
        input_cost = log['input_tokens'] * pricing['input_per_1k'] / 1000
        output_cost = log['output_tokens'] * pricing['output_per_1k'] / 1000
        total_cost = input_cost + output_cost

        # Aggregate
        endpoint = log['endpoint']
        cost_by_endpoint[endpoint] = cost_by_endpoint.get(endpoint, 0) + total_cost

        user = log.get('user_id', 'unknown')
        cost_by_user[user] = cost_by_user.get(user, 0) + total_cost

    # Report
    print("Cost by endpoint:")
    for endpoint, cost in sorted(cost_by_endpoint.items(), key=lambda x: -x[1])[:10]:
        print(f"  {endpoint}: ${cost:.2f}")

    print("\nTop users by cost:")
    for user, cost in sorted(cost_by_user.items(), key=lambda x: -x[1])[:10]:
        print(f"  {user}: ${cost:.2f}")

    # Check for optimization opportunities
    avg_input_tokens = np.mean([log['input_tokens'] for log in usage_logs])
    avg_output_tokens = np.mean([log['output_tokens'] for log in usage_logs])

    print(f"\nAverage tokens: {avg_input_tokens:.0f} input, {avg_output_tokens:.0f} output")
    if avg_input_tokens > 2000:
        print("  TIP: High input tokens - consider context compression")
    if avg_output_tokens > 500:
        print("  TIP: High output tokens - consider max_tokens limits")

Common Causes & Solutions:

Cause Diagnostic Sign Solution
No caching Same queries repeated Add semantic caching
Over-sized model Using GPT-5 for simple tasks Route simple queries to smaller models
Long contexts >4K tokens average Compress context, better retrieval
No max_tokens Verbose outputs Set appropriate max_tokens
Wasteful retries Many failed requests Fix root cause, add backoff

Reference: Chapter 29 (Cost Engineering)


Section 3: Errors and Failures

Problem: Out of Memory (OOM) Errors

Symptoms:

  • CUDA out of memory errors
  • Process killed by OOM killer
  • Gradual memory growth leading to crash

Diagnostic Steps:

# Step 1: Monitor memory over time
import torch

def log_memory_stats():
    """Log current GPU memory usage."""
    if torch.cuda.is_available():
        for i in range(torch.cuda.device_count()):
            allocated = torch.cuda.memory_allocated(i) / 1024**3
            reserved = torch.cuda.memory_reserved(i) / 1024**3
            total = torch.cuda.get_device_properties(i).total_memory / 1024**3

            print(f"GPU {i}: {allocated:.1f}GB allocated, {reserved:.1f}GB reserved, {total:.1f}GB total")

            if allocated / total > 0.9:
                print(f"  WARNING: Near OOM threshold")

# Step 2: Check for memory leaks
def check_memory_leak(func, num_iterations: int = 100):
    """Check if a function leaks memory."""
    initial_memory = torch.cuda.memory_allocated()

    for i in range(num_iterations):
        func()

        if i % 10 == 0:
            current = torch.cuda.memory_allocated()
            growth = (current - initial_memory) / 1024**2
            print(f"Iteration {i}: {growth:.1f} MB growth")

    final_memory = torch.cuda.memory_allocated()
    total_growth = (final_memory - initial_memory) / 1024**2

    print(f"\nTotal memory growth: {total_growth:.1f} MB over {num_iterations} iterations")
    if total_growth > 100:
        print("WARNING: Possible memory leak detected")

Common Causes & Solutions:

Cause Diagnostic Sign Solution
KV cache too large OOM on long sequences Reduce max_model_len, enable PagedAttention
Batch size too large OOM on high load Reduce batch size, enable continuous batching
Memory leak Gradual growth Clear caches, check for tensor accumulation
Model too large OOM on load Quantize (INT8/INT4), use smaller model
Multiple models Combined size exceeds memory Model sharing, offload unused models

Reference: Chapter 24 (Memory Optimization), Chapter 9 (Quantization)


Problem: API Rate Limits and Failures

Symptoms:

  • 429 (Too Many Requests) errors
  • 503 (Service Unavailable) errors
  • Intermittent failures under load

Diagnostic Steps:

import time
from collections import defaultdict

def analyze_api_errors(error_logs: list[dict]):
    """Analyze API error patterns."""
    errors_by_type = defaultdict(list)
    errors_by_hour = defaultdict(int)

    for log in error_logs:
        errors_by_type[log['status_code']].append(log)
        hour = log['timestamp'].hour
        errors_by_hour[hour] += 1

    print("Errors by status code:")
    for code, errors in sorted(errors_by_type.items()):
        print(f"  {code}: {len(errors)} errors")
        if code == 429:
            # Check rate limit headers
            retry_afters = [e.get('retry_after', 0) for e in errors]
            print(f"    Avg retry-after: {np.mean(retry_afters):.1f}s")
        if code == 503:
            print(f"    Likely provider overload")

    print("\nErrors by hour:")
    for hour in range(24):
        count = errors_by_hour[hour]
        bar = "█" * (count // 10)
        print(f"  {hour:02d}:00 - {bar} ({count})")

Retry Strategy Implementation:

import asyncio
from tenacity import (
    retry,
    stop_after_attempt,
    wait_exponential,
    retry_if_exception_type
)

class RetryableAPIError(Exception):
    pass

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=4, max=60),
    retry=retry_if_exception_type(RetryableAPIError)
)
async def call_llm_with_retry(client, request):
    """Call LLM API with exponential backoff."""
    try:
        return await client.generate(request)
    except RateLimitError as e:
        raise RetryableAPIError(f"Rate limited: {e}")
    except ServiceUnavailableError as e:
        raise RetryableAPIError(f"Service unavailable: {e}")
    except Exception as e:
        # Don't retry non-transient errors
        raise

Common Causes & Solutions:

Error Cause Solution
429 Exceeding rate limits Add rate limiting, request quota increase
503 Provider overload Retry with backoff, add fallback provider
Timeout Long generation, network issues Increase timeout, add streaming
401 Invalid/expired API key Rotate keys, check expiration
400 Invalid request Validate inputs, check for changes in API

Reference: Chapter 28 (Reliability Engineering)


Problem: Agent Gets Stuck in Loops

Symptoms:

  • Agent repeats same action multiple times
  • Token usage spikes without progress
  • Tasks never complete

Diagnostic Steps:

def analyze_agent_trace(trace: list[dict]):
    """Analyze agent execution trace for issues."""
    # Check for repeated actions
    action_sequence = [step['action'] for step in trace]

    # Find repeating patterns
    for pattern_len in range(1, 5):
        for i in range(len(action_sequence) - pattern_len * 2):
            pattern = tuple(action_sequence[i:i+pattern_len])
            next_pattern = tuple(action_sequence[i+pattern_len:i+pattern_len*2])
            if pattern == next_pattern:
                print(f"WARNING: Repeated pattern detected at step {i}: {pattern}")

    # Check for increasing token usage
    if len(trace) > 5:
        recent_tokens = [step['tokens'] for step in trace[-5:]]
        if all(recent_tokens[i] >= recent_tokens[i-1] for i in range(1, len(recent_tokens))):
            print("WARNING: Token usage monotonically increasing - possible context explosion")

    # Check for lack of progress
    if len(trace) > 10:
        recent_observations = [step.get('observation', '') for step in trace[-5:]]
        if len(set(recent_observations)) == 1:
            print("WARNING: Same observation repeated - agent may be stuck")

Common Causes & Solutions:

Cause Diagnostic Sign Solution
No termination condition Runs forever Add max_steps, explicit “done” action
Bad tool output Same error repeated Improve error messages, add error handling
Context growing Token count increasing Summarize history, limit context
Conflicting instructions Oscillating actions Clarify system prompt
Missing capability Keeps trying impossible action Add needed tools or handle gracefully

Reference: Chapter 8 (Agent Architectures, Safety)


Debugging Checklists

Pre-Deployment Checklist

□ Unit tests pass
□ Integration tests pass
□ Load testing completed
□ Error handling tested
□ Monitoring/alerting configured
□ Rollback procedure documented
□ Rate limits configured
□ Cost projections validated
□ Security review completed
□ Documentation updated

Incident Response Checklist

□ Identify the symptom (what's broken?)
□ Check monitoring dashboards
□ Check recent deployments/changes
□ Reproduce the issue
□ Isolate the component
□ Check logs for errors
□ Test fix in staging
□ Deploy fix with monitoring
□ Document root cause
□ Add tests to prevent recurrence

Model Update Checklist

□ Run evaluation suite on new model
□ Compare metrics to baseline
□ Test with production-like traffic
□ Check output format compatibility
□ Verify prompt compatibility
□ Update documentation
□ Plan staged rollout
□ Monitor closely after deployment
□ Have rollback ready

Common Error Messages and Solutions

Error Message Likely Cause Solution
CUDA out of memory GPU memory exhausted Reduce batch size, quantize, use smaller model
Connection reset by peer API timeout/overload Add retry logic, increase timeout
Invalid API key Auth issue Check key, rotation, permissions
Context length exceeded Input too long Truncate input, use smaller chunks
Model not found Wrong model name or not loaded Check model availability, warm up
Rate limit exceeded Too many requests Add rate limiting, request quota increase
JSON decode error Malformed model output Use structured outputs, add validation
Timeout waiting for response Slow generation Increase timeout, use streaming

Getting Help

When debugging complex issues:

  1. Reproduce minimally: Create the smallest example that shows the issue
  2. Check logs thoroughly: Often the real error is buried in logs
  3. Binary search: Disable components to isolate the problem
  4. Compare to working state: What changed since it last worked?
  5. Search issues: GitHub issues, Stack Overflow, forums
  6. Ask with context: Include error messages, code, and what you’ve tried

Useful Resources:

  • vLLM GitHub Issues: github.com/vllm-project/vllm/issues
  • HuggingFace Forums: discuss.huggingface.co
  • LangChain Community (Slack): langchain.com/join-community
  • r/MachineLearning: reddit.com/r/MachineLearning