Chapter 2: Python for AI Engineering

Keywords

Python, async, Pydantic, type hints, error handling, observability, logging, data classes

Introduction

In early 2023, a machine learning engineer at a healthcare startup faced a familiar crisis. Her team had built a diagnostic assistant using GPT-4—a Jupyter notebook that analyzed patient symptoms and suggested possible conditions. In demos, it was impressive. Doctors were excited. The CEO scheduled a board presentation.

Then the engineering team tried to deploy it.

The notebook made synchronous API calls that blocked the web server, causing timeouts when multiple doctors used it simultaneously. Error handling was optimistic at best—a single API hiccup crashed the entire application. Configuration was scattered across cells, with API keys occasionally committed to version control. The code that had dazzled stakeholders became a liability that took six weeks to untangle.

This story plays out constantly across the industry. The gap between “it works in a notebook” and “it runs reliably in production” is vast, and Python—for all its strengths—makes it easy to fall into this chasm. The same flexibility that enables rapid prototyping allows sloppy practices that become expensive later.

This chapter is about writing Python that works both for exploration and production. Not Python that’s technically correct but unmaintainable. Not Python that’s elegant but impractical. Python that lets you iterate quickly during development and sleep soundly when your code runs in production.

Why Python Dominates AI Engineering

Python’s dominance in AI isn’t a historical accident or mere popularity contest. Several forces converge to make it uniquely suited for this domain.

The library ecosystem creates a gravitational pull. NumPy, PyTorch, TensorFlow, Hugging Face Transformers, LangChain—the tools that power AI development are Python-first. Other languages have bindings, but Python has native support with full documentation and community knowledge. When you encounter an obscure error from a transformer model, solutions in Python are abundant. Solutions in Rust or Go often don’t exist.

Rapid iteration matches AI’s experimental nature. AI development isn’t like building a web form. You try a prompting strategy, evaluate results, adjust, repeat. You experiment with different models, embedding approaches, chunking strategies. Python’s dynamic nature—the same trait that causes production bugs—enables this experimentation. You can modify a pipeline mid-execution in a debugger, reload modules without restarting, test ideas in a REPL.

Glue language capabilities matter for integration. Production AI systems rarely stand alone. They connect to databases, message queues, APIs, authentication systems, monitoring infrastructure. Python’s extensive standard library and third-party packages provide connectors for virtually anything. A single language spans your entire stack from data preprocessing to model serving to observability.

But Python’s flexibility is a double-edged sword. The language that lets you quickly test an idea also lets you quickly create unmaintainable code. Dynamic typing that speeds prototyping causes production bugs. The absence of required structure means discipline must come from the developer, not the language.

AI engineering requires disciplined Python—leveraging flexibility for exploration while building in safeguards for reliability.

The Mental Model: Concentric Circles of Trust

A useful framework for thinking about Python in AI systems is concentric circles of trust. At the center are components you fully control—your own code, tested and typed. Moving outward, each ring represents less control and more uncertainty:

Your code: Full control, can be tested and typed exhaustively
Your dependencies: Trusted but external, can break with updates
External APIs: Unreliable, slow, may change without notice, may have outages
User input: Completely untrusted, potentially malicious, definitely surprising

Good AI engineering Python acknowledges these boundaries. You validate data as it crosses from outer rings to inner rings. You handle failures at API boundaries. You type your internal interfaces strictly while accepting that external data needs runtime validation.

This isn’t paranoia—it’s pragmatism. Every production incident teaches the same lesson: something you assumed would work, didn’t.

Skip This Chapter If…

You can skip this chapter if you already:

Write production Python with type hints and Pydantic models
Use async/await for concurrent I/O operations
Implement retry logic with exponential backoff
Manage dependencies with lock files (Poetry, pip-tools)
Structure code with proper error handling and logging

Skim instead if you know most of this but want to see AI-specific patterns. Jump to the “Patterns That Appear Everywhere” section.

Read carefully if you’re coming from data science/notebooks, another language, or haven’t built production Python services.

What You’ll Learn

This chapter covers Python for AI engineering along four dimensions:

Language features that matter: Type hints, async programming, context managers, and dataclasses—not as syntax tutorials but as tools for specific problems
Library choices and trade-offs: When to use httpx versus requests, Pydantic versus dataclasses, Poetry versus pip
Environment discipline: Dependency management, secrets handling, and containerization that prevent “works on my machine” problems
Patterns that appear everywhere: Retry logic, rate limiting, streaming, and observability—the infrastructure code that separates prototypes from products

The goal isn’t comprehensive Python coverage. It’s focused preparation for the AI engineering work ahead.

Type Hints: Documentation That Catches Bugs

The Problem with Dynamic Typing

Python’s dynamic typing is liberating during exploration. You don’t declare types, don’t satisfy a compiler, just write code and run it. This speed has real value when you’re iterating on prompt strategies or testing different embedding models.

But dynamic typing has a cost that compounds over time. Consider this function from an AI codebase:

def process_response(response, config):
    return response['choices'][0]['message']['content'] if config.get('raw') else clean(response)

What is response? A dictionary? What shape? What is config? What does clean() return? Without reading the implementation, all call sites, and possibly the API documentation, you cannot know.

This ambiguity creates several problems:

Code review becomes archaeology. Reviewers must trace through implementations to understand what functions expect and return. Review fatigue leads to bugs slipping through.

Refactoring becomes dangerous. Change a function’s return type, and you have no way to find all affected code without running every path—which in AI systems with many conditional branches is effectively impossible.

Onboarding takes longer. New team members face a codebase where understanding requires not reading code but deciphering it.

IDE support degrades. Without type information, autocomplete is limited to guessing, and errors that could be caught immediately require runtime to surface.

Type Hints as Executable Documentation

Python 3.5 introduced type hints—optional annotations that specify expected types without runtime enforcement. Modern Python type hints are sophisticated enough to express complex data structures clearly:

from typing import TypedDict, Literal

class Message(TypedDict):
    role: Literal["user", "assistant", "system"]
    content: str

class Choice(TypedDict):
    message: Message
    index: int
    finish_reason: Literal["stop", "length", "content_filter"]

class ChatCompletion(TypedDict):
    id: str
    model: str
    choices: list[Choice]
    usage: dict[str, int]

def process_response(response: ChatCompletion, raw: bool = False) -> str:
    """Extract content from a chat completion response."""
    if raw:
        return response['choices'][0]['message']['content']
    return clean_text(response['choices'][0]['message']['content'])

The function signature now documents itself. IDEs provide accurate autocomplete. Static analyzers catch typos like response['choice'] (missing ‘s’) before runtime. Refactoring tools can find all usages and verify compatibility.

Why Type Hints Matter Especially for AI Code

AI applications handle unusually complex data structures. LLM API responses are deeply nested dictionaries. Embeddings are lists of floats of specific dimensions. Batch operations return lists of complex objects with optional fields depending on processing outcomes.

Without types, bugs hide in data shape mismatches that only surface at runtime—often in production, often at 3 AM, often in subtle ways that take hours to diagnose.

Consider the difference in debugging experience:

Without types: “TypeError: ‘NoneType’ object is not subscriptable” somewhere in a 500-line pipeline. Which variable is None? Where did it come from?

With types and static analysis: “error: Item ‘None’ of ‘Optional[Message]’ has no attribute ‘content’” at line 47, during development, before you even run the code.

The Pragmatic Approach: Gradual Adoption

Type hints are opt-in. You don’t need to annotate everything immediately. The highest-value strategy is gradual adoption focused on boundaries:

Start with function signatures. The function’s contract is the minimum viable type coverage. Even without typing internal variables, annotated parameters and returns catch most interface mismatches.

Prioritize data structures that cross modules. When data flows between components written by different people or at different times, types prevent miscommunication.

Type complex nested structures explicitly. API responses, configuration objects, and intermediate data formats benefit most from explicit typing.

Run a type checker in CI. Tools like mypy or pyright catch errors where you’ve added types. Start with lenient settings and tighten as coverage improves.

The goal isn’t type purity. It’s catching bugs early and making code self-documenting where it matters most.

Complete examples: See reference/02_python_ai_patterns.md for TypedDict patterns, Pydantic response models, and generic type patterns for AI applications.

Async Programming: Concurrency for I/O-Bound Work

Understanding the Problem

AI applications make many external calls. Retrieving embeddings, querying LLM APIs, searching vector databases, fetching context documents—each call takes time, and that time is mostly waiting. Your CPU sits idle while data travels across networks and remote servers process requests.

Synchronous code handles this poorly. Consider embedding 100 documents:

def embed_documents_sync(texts: list[str]) -> list[list[float]]:
    embeddings = []
    for text in texts:
        response = requests.post(EMBED_URL, json={"input": text})
        embeddings.append(response.json()["embedding"])
    return embeddings

If each API call takes 100 milliseconds, this function takes 10 seconds. During those 10 seconds, your program is 99% idle—waiting for network responses, doing nothing useful.

The Async Solution

Asynchronous programming lets you initiate many I/O operations without waiting for each to complete:

async def embed_documents_async(texts: list[str]) -> list[list[float] | None]:
    async with httpx.AsyncClient() as client:
        tasks = [
            client.post(EMBED_URL, json={"input": text})
            for text in texts
        ]
        # return_exceptions=True is critical for batch work: by default a
        # single failed request makes gather raise and DISCARD every other
        # in-flight result. Here we keep the successes and handle failures.
        results = await asyncio.gather(*tasks, return_exceptions=True)

    embeddings: list[list[float] | None] = []
    for text, result in zip(texts, results):
        if isinstance(result, Exception):
            logger.warning("embedding failed for %r: %s", text[:50], result)
            embeddings.append(None)  # or queue for retry, per your contract
        else:
            embeddings.append(result.json()["embedding"])
    return embeddings

All 100 requests launch nearly simultaneously. The function completes when the slowest single request finishes—perhaps 150 milliseconds instead of 10 seconds. The speedup is dramatic. The subtle part is failure handling: a bare asyncio.gather(*tasks) raises on the first exception and throws away every other result, so one rate-limited call would sink an entire batch of embeddings. return_exceptions=True makes partial failure a first-class case you handle deliberately.

The Mental Model: Restaurant Servers vs. Cashiers

Synchronous code is like a single cashier serving a queue. Each customer waits while the current one is served. The cashier is always busy, but throughput is limited by processing one customer at a time.

Asynchronous code is like a restaurant server. Take an order, submit it to the kitchen, take another order while the first cooks. The server isn’t cooking faster—they’re not waiting idle while food prepares.

This mental model clarifies when async helps and when it doesn’t:

Async helps when you’re waiting for external resources. API calls, database queries, file I/O—operations where your program waits for something else.

Async doesn’t help for CPU-bound work. If you’re computing embeddings locally, async won’t speed anything up. The CPU is the bottleneck, and async doesn’t parallelize computation.

Async adds complexity. Concurrent code is harder to reason about, debug, and test. The speedup must justify this cost.

Essential Async Patterns

Four patterns appear repeatedly in AI applications:

Concurrent execution with gather. When you have multiple independent operations, run them concurrently:

async def fetch_all_contexts(queries: list[str]) -> list[Context]:
    tasks = [fetch_context(q) for q in queries]
    return await asyncio.gather(*tasks)

Controlled concurrency with semaphores. APIs have rate limits. Launching 10,000 concurrent requests will get you blocked. Semaphores limit concurrency:

async def fetch_with_limit(items: list[str], max_concurrent: int = 10) -> list[Result]:
    semaphore = asyncio.Semaphore(max_concurrent)

    async def fetch_one(item: str) -> Result:
        async with semaphore:
            return await fetch_item(item)

    return await asyncio.gather(*[fetch_one(i) for i in items])

Timeouts for operations that might hang. External services sometimes stall. Timeouts prevent indefinite waits:

async def fetch_with_timeout(item: str, timeout: float = 30.0) -> Result:
    try:
        return await asyncio.wait_for(fetch_item(item), timeout=timeout)
    except asyncio.TimeoutError:
        raise FetchError(f"Timed out after {timeout}s")

First-completed for redundant providers. When multiple providers can serve a request, take the first response:

async def fetch_from_any(item: str, providers: list[Provider]) -> Result:
    tasks = [provider.fetch(item) for provider in providers]
    done, pending = await asyncio.wait(tasks, return_when=asyncio.FIRST_COMPLETED)

    for task in pending:
        task.cancel()

    return done.pop().result()

When to Use Async

Async isn’t always the right choice. Use it when:

Making multiple independent external calls (embedding batches, parallel LLM queries)
Building servers that handle concurrent requests
Processing pipelines where each step involves I/O

Don’t use async for:

Single sequential API calls (no concurrency to exploit)
CPU-bound work (use multiprocessing instead)
Simple scripts where complexity isn’t justified

The decision isn’t about async being “better”—it’s about whether the speedup justifies the complexity in your specific context.

Complete examples: See reference/02_python_ai_patterns.md for semaphore-based rate limiting, timeout handling, and bridging sync/async contexts.

Data Validation: Pydantic and the Trust Boundary

The Problem with Unvalidated Data

Data enters AI systems from many sources: API responses, user inputs, configuration files, database queries. Each source can produce unexpected data—missing fields, wrong types, values outside expected ranges.

Without validation, these problems surface far from their origin. A missing field in an API response causes an error three functions later. A malformed user input corrupts results without raising exceptions. Debugging becomes archaeology, tracing backward from symptoms to causes.

Pydantic: Validation as Type Definition

Pydantic models define data structures that validate on instantiation. If validation fails, you get an immediate, specific error:

from pydantic import BaseModel, Field
from typing import Literal

class ChatMessage(BaseModel):
    role: Literal["user", "assistant", "system"]
    content: str = Field(min_length=1, max_length=100000)

class ChatRequest(BaseModel):
    messages: list[ChatMessage]
    model: str = "gpt-5"
    temperature: float = Field(default=0.7, ge=0.0, le=2.0)

Attempting to create a ChatRequest with invalid data fails immediately:

# Raises ValidationError: temperature must be <= 2.0
ChatRequest(messages=[...], temperature=2.5)

The error is specific, occurs at the point of creation, and includes context about what went wrong.

Validation at Trust Boundaries

Pydantic is most valuable at boundaries where data crosses trust levels:

External API responses. When you call an embedding API, the response should match your expectations. Parsing through Pydantic validates this assumption:

response = await client.post(url, json=payload)
result = EmbeddingResponse.model_validate(response.json())
# If parsing succeeds, result matches your expected structure

User input. Everything from users is suspect. Pydantic enforces constraints:

class UserQuery(BaseModel):
    text: str = Field(min_length=1, max_length=10000)
    filters: dict[str, str] = Field(default_factory=dict)

    @field_validator('text')
    @classmethod
    def sanitize_text(cls, v: str) -> str:
        return v.strip()

Configuration files. Settings loaded from YAML or JSON should be validated before use:

class AppConfig(BaseModel):
    llm_provider: Literal["openai", "anthropic"]
    model: str
    temperature: float = Field(ge=0.0, le=2.0)

    @classmethod
    def from_yaml(cls, path: Path) -> "AppConfig":
        with open(path) as f:
            return cls.model_validate(yaml.safe_load(f))

Pydantic vs. Dataclasses

Python’s built-in dataclasses are lighter weight but don’t validate. The choice depends on context:

Use dataclasses for internal data structures where you control all inputs and validation overhead isn’t warranted:

@dataclass
class EmbeddingResult:
    vector: list[float]
    model: str
    cached: bool = False

Use Pydantic at boundaries where data comes from external sources or needs validation:

class APIResponse(BaseModel):
    embedding: list[float]
    model: str
    usage: dict[str, int]

The distinction matters for performance and complexity. Pydantic validation has overhead—meaningful when processing millions of records. For internal data you trust, dataclasses avoid that cost.

Complete examples: See reference/02_python_ai_patterns.md for field validators, model validators, configuration loading, and structured output patterns for LLM responses.

Error Handling: Expecting Failures

The Reality of External Dependencies

AI applications depend heavily on external services, and external services fail. Networks partition, servers crash, rate limits trigger, APIs change without notice. The question isn’t whether failures will occur but how your code responds.

Robust error handling requires categorizing failures by how they should be handled:

Retryable failures might succeed on the next attempt. Network timeouts, rate limits, server errors (5xx) often resolve themselves. These warrant retry logic.

Non-retryable failures won’t be fixed by trying again. Authentication errors, malformed requests, resource not found—these require human intervention or code changes.

Partial failures in batch operations leave some work completed and some failed. These require careful handling to avoid reprocessing successful items.

Structured Error Handling

Define error types that distinguish these categories:

class LLMError(Exception):
    """Base exception for LLM operations."""
    pass

class RetryableError(LLMError):
    """Error that may succeed on retry."""
    pass

class NonRetryableError(LLMError):
    """Error that won't be fixed by retrying."""
    pass

Then handle API errors by mapping them to these categories:

async def call_llm(prompt: str) -> str:
    try:
        response = await client.chat.completions.create(
            model="gpt-5",
            messages=[{"role": "user", "content": prompt}]
        )
        return response.choices[0].message.content

    except AuthenticationError as e:
        # Invalid credentials - don't retry
        raise NonRetryableError("Invalid API credentials") from e

    except RateLimitError as e:
        # Rate limited - retry after backoff
        raise RetryableError("Rate limit exceeded") from e

    except APIConnectionError as e:
        # Network issue - retry
        raise RetryableError("Connection failed") from e

Retry Logic with Exponential Backoff

Retry logic must balance persistence with politeness. Hammering a failing server with immediate retries makes problems worse. Exponential backoff spaces retries increasingly apart:

async def retry_with_backoff(
    func: Callable[[], Awaitable[T]],
    max_attempts: int = 3,
    base_delay: float = 1.0,
) -> T:
    """Execute function with exponential backoff on failure."""
    for attempt in range(max_attempts):
        try:
            return await func()
        except RetryableError as e:
            if attempt == max_attempts - 1:
                raise  # Final attempt failed

            delay = base_delay * (2 ** attempt)  # 1s, 2s, 4s, ...
            delay *= (0.5 + random.random())     # Add jitter

            logger.warning(f"Attempt {attempt + 1} failed: {e}. Retrying in {delay:.1f}s")
            await asyncio.sleep(delay)

    raise RuntimeError("Should not reach here")

The jitter (random variation in delay) prevents the “thundering herd” problem where many clients retry simultaneously after an outage, overwhelming the recovering server.

Libraries for Retry Logic

The tenacity library provides sophisticated retry logic without reinventing it:

from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=60),
    retry=retry_if_exception_type(RetryableError),
)
async def robust_llm_call(prompt: str) -> str:
    return await call_llm(prompt)

Use libraries for complex retry scenarios. Understand the underlying concepts to debug when things go wrong.

Complete examples: See reference/02_python_ai_patterns.md for comprehensive API error handling, tenacity configurations, and rate limiter implementations.

Environment Management: Reproducibility and Isolation

The Dependency Problem

Python’s package ecosystem is both blessing and curse. Thousands of libraries are available for any task. But these libraries depend on other libraries, which depend on others, creating a web of transitive dependencies that can conflict.

The symptoms are familiar:

“It works on my machine” (but not in production)
Package A requires numpy>=1.20 while package B requires numpy<1.20
An update to a minor dependency breaks your application
A security vulnerability in a transitive dependency requires untangling the entire chain

Virtual Environments: Basic Isolation

Virtual environments isolate project dependencies from system Python and from each other. Every project should have its own:

# Create environment
python -m venv .venv

# Activate
source .venv/bin/activate  # Unix
.venv\Scripts\activate     # Windows

# Install dependencies
pip install -r requirements.txt

This is the minimum standard. Developing without virtual environments causes problems that seem small initially but compound over time.

Dependency Locking: Reproducible Installs

requirements.txt with loose version constraints isn’t reproducible:

# This installs different versions on different days
openai>=1.0
pydantic>=2.0

Dependency locking captures the exact versions of all packages, including transitive dependencies:

pip-tools generates locked requirements from loose specifications:

# requirements.in (your direct dependencies)
openai>=1.0
pydantic>=2.0

# Generate locked requirements.txt
pip-compile requirements.in

# Install exactly those versions
pip-sync requirements.txt

Poetry manages dependencies with built-in locking:

# Add dependencies (updates pyproject.toml and poetry.lock)
poetry add openai pydantic

# Install exactly what's in poetry.lock
poetry install

The lock file is your reproducibility guarantee. Commit it to version control. CI and production install from the lock file, not from loose constraints.

Secrets Management: The API Key Problem

API keys are the most common security leak in AI projects. OpenAI keys committed to GitHub are exploited within minutes by automated scanners.

Never hardcode secrets. Not even temporarily. Not even in example code you’ll “fix later.”

Use environment variables with .env files for local development:

# .env (never commit this)
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...

# .env.example (commit this as documentation)
OPENAI_API_KEY=your-key-here
ANTHROPIC_API_KEY=your-key-here

Load environment variables at application start:

from dotenv import load_dotenv
import os

load_dotenv()  # Loads .env into environment

openai_key = os.environ["OPENAI_API_KEY"]  # Fails fast if missing

For production, use secrets managers—AWS Secrets Manager, GCP Secret Manager, HashiCorp Vault. These provide audit trails, rotation, and access control that environment variables lack.

Docker: Consistent Environments

Docker containers package your application with its complete environment—operating system, Python version, dependencies, configuration. This eliminates environment discrepancies between development and production.

A minimal Dockerfile for an AI application:

FROM python:3.11-slim

WORKDIR /app

# Install dependencies first (better layer caching)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY . .

# Run as non-root user
RUN useradd -m appuser && chown -R appuser:appuser /app
USER appuser

CMD ["python", "-m", "app.main"]

The key insight is layer caching: by copying and installing dependencies before copying application code, you avoid reinstalling dependencies on every code change.

Complete examples: See reference/02_python_ai_patterns.md for AWS/GCP secrets manager patterns, pyproject.toml configurations, and Docker Compose setups for AI development.

Observability: Seeing What Your Code Does

The Debugging Challenge

AI applications are hard to debug. LLM outputs are non-deterministic. Errors in prompt construction might produce plausible but wrong results. Performance problems might stem from any of a dozen external service calls.

Traditional debugging—setting breakpoints, stepping through code—helps for some problems. But production issues often can’t be reproduced locally, and the interactive nature of debugging doesn’t scale to understanding system behavior over time.

Observability—logging, metrics, and tracing—provides visibility into running systems.

Structured Logging

Unstructured log messages are human-readable but machine-hostile:

2024-01-15 10:23:45 INFO Processing request for user 123
2024-01-15 10:23:46 INFO LLM call completed in 1.2s

Structured logging outputs machine-parseable data that enables querying and aggregation:

import structlog

logger = structlog.get_logger()

logger.info(
    "llm_call_completed",
    model="gpt-5",
    latency_ms=1200,
    prompt_tokens=150,
    completion_tokens=80,
    user_id="123",
)

This outputs JSON that log aggregation systems (Datadog, CloudWatch, Loki) can query: “Show me all LLM calls taking more than 2 seconds” or “What’s the average token usage per model?”

What to Log for AI Applications

AI applications have specific logging needs beyond standard web applications:

LLM calls: Model, latency, token counts, prompt preview (truncated), response preview. This data is essential for cost tracking and debugging.

Retrieval operations: Query, number of results, relevance scores, latency. When RAG produces bad results, these logs help diagnose whether retrieval or generation failed.

Validation failures: What data failed validation and why. When external APIs change format, these logs reveal the problem quickly.

Rate limit events: When you hit limits, log it. This informs capacity planning and helps debug intermittent failures.

Request Tracing

In systems with many components, a single user request touches multiple services. Tracing connects these interactions:

import uuid
from contextvars import ContextVar

request_id_var: ContextVar[str] = ContextVar("request_id", default="unknown")

def process_request(user_input: str) -> str:
    request_id = str(uuid.uuid4())[:8]
    request_id_var.set(request_id)

    logger.info("request_started", request_id=request_id)

    try:
        result = run_pipeline(user_input)
        logger.info("request_completed", request_id=request_id, success=True)
        return result
    except Exception as e:
        logger.error("request_failed", request_id=request_id, error=str(e))
        raise

Every log message in the request path includes the request ID. When investigating an issue, filter logs by request ID to see the complete picture.

Complete examples: See reference/02_python_ai_patterns.md for JSON formatter implementations, request tracing middleware, and LLM call logging patterns.

Development Workflow: From Notebooks to Production

The Notebook-to-Production Pipeline

Jupyter notebooks excel at exploration: interactive execution, inline visualization, rapid iteration. But they’re poorly suited for production: no testing framework, problematic version control (JSON diffs are useless), encourages global state.

The solution isn’t abandoning notebooks but establishing a clear pipeline:

Explore in notebooks: Test ideas, visualize data, prototype approaches
Extract to modules: When code stabilizes, move it to .py files
Add types and tests: Production code needs both
Run in production: Modules imported by application, not notebooks

Making Notebooks Production-Adjacent

Some practices keep notebooks closer to production quality:

Keep cells small and focused. Each cell should do one thing. This makes extraction to functions natural.

Use functions, not just sequential code. Even in notebooks, wrap logic in functions. This forces you to think about inputs, outputs, and encapsulation.

Strip outputs before committing. Notebook outputs include cell execution results, which bloat repositories and make diffs useless. Use nbstripout to automatically remove outputs:

pip install nbstripout
nbstripout --install  # Configures git to strip outputs on commit

Consider jupytext for version control. Jupytext syncs notebooks with plain Python files, enabling meaningful diffs:

jupytext --set-formats ipynb,py:percent notebook.ipynb
# Now notebook.py stays in sync with notebook.ipynb

Testing AI Code

AI applications present unique testing challenges. Outputs are non-deterministic. External APIs are slow and expensive. “Correct” is often subjective.

Mock external services to test your code in isolation:

@pytest.fixture
def mock_llm_response():
    return ChatCompletion(
        id="mock",
        model="gpt-5",
        choices=[Choice(
            message=Message(role="assistant", content="Mock response"),
            finish_reason="stop"
        )]
    )

async def test_processing_logic(mock_llm_response):
    with patch("openai.AsyncOpenAI") as mock:
        mock.return_value.chat.completions.create.return_value = mock_llm_response

        result = await process_user_query("test input")

        assert "Mock response" in result

Test deterministic components separately. Prompt construction, response parsing, and business logic can be tested without LLM calls:

def test_prompt_construction():
    prompt = build_rag_prompt(
        query="What is Python?",
        contexts=["Python is a programming language."]
    )

    assert "What is Python?" in prompt
    assert "Python is a programming language." in prompt

Use snapshot testing for prompts. When prompts change, you want to review the diff deliberately:

def test_prompt_template(snapshot):
    prompt = build_prompt(template="summarize", document="...")
    assert prompt == snapshot  # Fails if prompt changes, update with --snapshot-update

Complete examples: See reference/02_python_ai_patterns.md for pytest fixtures, mock API responses, VCR for recorded interactions, and streaming response tests.

Practical Exercises

Exercise 1: Type-Safe LLM Response Handler

Build a type-safe response handler for multiple LLM providers:

Define TypedDict classes for OpenAI and Anthropic API responses
Create a unified LLMResponse Pydantic model that normalizes both formats
Implement converter functions with proper type hints
Write tests that verify type correctness using mypy

Acceptance criteria: - Running mypy --strict produces no errors - Both API formats convert correctly to the unified model - Invalid responses raise ValidationError with clear messages

Exercise 2: Async Rate-Limited API Client

Implement an async HTTP client with production-ready features:

Create a client class that wraps httpx.AsyncClient
Implement a token bucket rate limiter (100 requests/minute)
Add automatic retry with exponential backoff for 429 and 5xx errors
Include request tracing with correlation IDs
Log all requests with structured logging

Acceptance criteria: - Concurrent requests respect rate limits - Retries use jitter to prevent thundering herd - Each request can be traced through logs - Test with 200 concurrent requests

Exercise 3: Pydantic Configuration System

Build a configuration system for an AI application:

Define Pydantic BaseSettings for:
- API keys (from environment variables)
- Model parameters (temperature, max_tokens, etc.)
- Retry configuration (max_retries, base_delay)
- Feature flags (enable_caching, debug_mode)
Support loading from .env files
Add custom validators for value ranges
Implement environment-specific overrides (dev, staging, prod)

Acceptance criteria: - Sensitive values are masked in logs and __repr__ - Invalid configurations fail fast with clear error messages - Can switch environments without code changes

Exercise 4: Streaming Response Processor

Implement a streaming processor for LLM responses:

Create an async generator that yields chunks from the OpenAI streaming API
Implement a StreamBuffer class that accumulates chunks and detects complete JSON objects
Add timeout handling (fail if no chunk received in 30 seconds)
Include progress callbacks for UI updates

async def process_stream():
    async for chunk in stream_completion(prompt="..."):
        print(chunk, end="", flush=True)
    # Should print incrementally as chunks arrive

Acceptance criteria: - Prints text as it arrives, not buffered - Handles connection drops gracefully - Works with both OpenAI and Anthropic streaming formats

Exercise 5: Observable Service Module

Build a Python module with comprehensive observability:

Create a service class with structured logging:
- Log all method entries and exits
- Include timing information
- Use correlation IDs for request tracing
Add metrics using prometheus_client:
- Request count by method
- Latency histograms
- Error rates
Implement a context manager for automatic instrumentation
Write a decorator that adds observability to any function

Acceptance criteria: - Can reconstruct a full request flow from logs - Metrics are exposed on /metrics endpoint - Latency percentiles (p50, p95, p99) are available

Exercise 6: Docker Development Environment

Create a complete development environment for an AI project:

Write a Dockerfile with:
- Multi-stage build for smaller image
- Non-root user for security
- Proper layer caching for dependencies
Create docker-compose.yml with:
- Application service with hot reload
- Redis for caching
- Mock API server for testing
Include a Makefile with common commands:
- make dev - Start development environment
- make test - Run tests in container
- make lint - Run mypy, ruff, black

Acceptance criteria: - Container builds in under 2 minutes (with cache) - Code changes reflect immediately without rebuild - Tests can run in isolation from external services

Self-Assessment Checkpoint

Conceptual Questions

Q1. [IC2] Why does async programming provide significant speedups for LLM API calls but not for local embedding computations?

Answer

LLM API calls are I/O-bound—the CPU waits for network responses. During this wait time, async allows other tasks to execute. With 10 concurrent API calls that each take 1 second, sequential execution takes 10 seconds while async takes ~1 second.

Local embedding computations are CPU-bound—the CPU is actively processing. Async doesn’t help because there’s no waiting; the CPU is always busy. For CPU-bound work, you need multiprocessing (multiple CPU cores) rather than async (concurrency during I/O wait).

Q2. [IC2] When should you use TypedDict versus a Pydantic model?

Answer

TypedDict: Use for typing dictionaries from external sources where you want type checking without runtime validation. Good for API responses you’ve already validated elsewhere, or when you need dict compatibility with existing code.

Pydantic model: Use when you need runtime validation, default values, custom validators, or serialization. Better for data at trust boundaries (user input, config files) and when you want automatic parsing (strings to ints, etc.).

In practice: External API responses often start as TypedDict for typing, then get converted to Pydantic models when entering your application’s core.

Q3. [Senior] How would you implement graceful degradation for a service that calls multiple LLM providers?

Answer

Implement a multi-provider client with:

Health checking: Periodically test each provider, track success rates
Circuit breaker: Open circuit after N consecutive failures, half-open after timeout
Fallback chain: Primary → Secondary → Cached response → Graceful error
Request hedging: For latency-sensitive requests, send to multiple providers, use first response

async def generate(prompt: str) -> str:
    for provider in self.healthy_providers():
        try:
            return await provider.generate(prompt)
        except ProviderError:
            self.mark_unhealthy(provider)
    return await self.fallback_response(prompt)  # Cached or default

Key insight: Never fail completely if any fallback is possible. Users prefer a slower or slightly degraded response over an error.

Q4. [Senior] What are the trade-offs between using poetry versus pip-tools for dependency management?

Answer

Poetry:

Pros: Single tool for dependency management, virtual env, building, and publishing. Excellent lockfile. Nice CLI. Growing adoption.
Cons: Can be slow for large dependency trees. Some compatibility issues with editable installs. Adds a layer of abstraction over pip.

pip-tools:

Pros: Minimal abstraction—just generates requirements.txt from requirements.in. Fast. Works with any pip-compatible tooling.
Cons: Separate tools needed for virtual env (venv) and building (setuptools/build). More manual workflow.

Decision guide: Poetry for new projects where you control the environment. pip-tools for existing projects, Docker containers (where simplicity matters), or teams unfamiliar with Poetry.

Summary

Python for AI engineering requires a mindset shift from exploration to reliability without sacrificing the ability to iterate quickly. The key principles:

Type hints are documentation that catches bugs. They don’t slow you down—they prevent hours of debugging later. Start at function boundaries and data structures that cross modules.

Async programming enables concurrency for I/O-bound work. When making many external calls, async provides dramatic speedups. When making single calls or doing CPU-bound work, it adds complexity without benefit.

Pydantic validates data at trust boundaries. External API responses, user input, and configuration files all warrant validation. Internal data structures can use lighter-weight dataclasses.

Error handling must expect failures. External services fail. Categorize errors as retryable or not, implement backoff for retries, and fail fast for non-retryable errors.

Environment discipline enables reproducibility. Virtual environments, locked dependencies, proper secrets management, and Docker containers prevent “works on my machine” problems.

Observability reveals system behavior. Structured logging, request tracing, and comprehensive metrics let you understand what your code actually does in production.

These practices compound. Each one prevents a category of problems, and together they transform AI code from fragile prototypes to reliable systems. The investment in discipline pays dividends every time you debug an issue in minutes instead of hours, every time a code change deploys without incident, every time you sleep through the night while your system handles traffic.

The patterns in this chapter appear throughout AI engineering. Master them, and you’ll spend less time fighting Python and more time building systems that work.

Connections to Other Chapters

The Python patterns in this chapter form the foundation for production AI code throughout the book:

Chapter 4 (Your First LLM Application): Applies these Python patterns to build a complete RAG application with async API calls, Pydantic validation, and proper error handling.
Chapter 14 (Backend Engineering for AI): Extends these patterns to production systems including testing strategies, debugging techniques, and integration patterns.
Chapter 9 (LLM Deployment & Infrastructure): Uses async patterns for high-throughput inference and discusses retry/circuit breaker patterns for reliability.
Chapter 31 (Reliability Engineering): Builds on the observability patterns here to implement comprehensive monitoring, alerting, and incident response.
Tool & Framework Reference (Appendix B): Comprehensive guide to the Python libraries mentioned in this chapter with version recommendations.

Code Reference

For complete, runnable implementations of all patterns discussed in this chapter, see:

reference/02_python_ai_patterns.md — Full code examples including:

Type hints and TypedDict for LLM responses
Pydantic models with validators and configuration loading
Async patterns: gather, semaphores, timeouts, first-completed
Context managers for resource management
HTTP client configuration with httpx
OpenAI and Anthropic SDK patterns with provider abstraction
NumPy operations for embeddings
Pandas for evaluation data manipulation
JSON extraction from LLM outputs
Streaming response handling
Token bucket rate limiter implementation
Structured logging and request tracing
Comprehensive error handling with retry logic
Testing patterns: fixtures, mocks, snapshots, VCR
Docker and Docker Compose configurations

--- title: "Chapter 2: Python for AI Engineering" keywords: [Python, async, Pydantic, type hints, error handling, observability, logging, data classes] difficulty: beginner prerequisites: [] estimated_time: "3-4 hours" --- ## Introduction In early 2023, a machine learning engineer at a healthcare startup faced a familiar crisis. Her team had built a diagnostic assistant using GPT-4—a Jupyter notebook that analyzed patient symptoms and suggested possible conditions. In demos, it was impressive. Doctors were excited. The CEO scheduled a board presentation. Then the engineering team tried to deploy it. The notebook made synchronous API calls that blocked the web server, causing timeouts when multiple doctors used it simultaneously. Error handling was optimistic at best—a single API hiccup crashed the entire application. Configuration was scattered across cells, with API keys occasionally committed to version control. The code that had dazzled stakeholders became a liability that took six weeks to untangle. This story plays out constantly across the industry. The gap between "it works in a notebook" and "it runs reliably in production" is vast, and Python—for all its strengths—makes it easy to fall into this chasm. The same flexibility that enables rapid prototyping allows sloppy practices that become expensive later. This chapter is about writing Python that works both for exploration and production. Not Python that's technically correct but unmaintainable. Not Python that's elegant but impractical. Python that lets you iterate quickly during development and sleep soundly when your code runs in production. ### Why Python Dominates AI Engineering Python's dominance in AI isn't a historical accident or mere popularity contest. Several forces converge to make it uniquely suited for this domain. **The library ecosystem creates a gravitational pull.** NumPy, PyTorch, TensorFlow, Hugging Face Transformers, LangChain—the tools that power AI development are Python-first. Other languages have bindings, but Python has native support with full documentation and community knowledge. When you encounter an obscure error from a transformer model, solutions in Python are abundant. Solutions in Rust or Go often don't exist. **Rapid iteration matches AI's experimental nature.** AI development isn't like building a web form. You try a prompting strategy, evaluate results, adjust, repeat. You experiment with different models, embedding approaches, chunking strategies. Python's dynamic nature—the same trait that causes production bugs—enables this experimentation. You can modify a pipeline mid-execution in a debugger, reload modules without restarting, test ideas in a REPL. **Glue language capabilities matter for integration.** Production AI systems rarely stand alone. They connect to databases, message queues, APIs, authentication systems, monitoring infrastructure. Python's extensive standard library and third-party packages provide connectors for virtually anything. A single language spans your entire stack from data preprocessing to model serving to observability. But Python's flexibility is a double-edged sword. The language that lets you quickly test an idea also lets you quickly create unmaintainable code. Dynamic typing that speeds prototyping causes production bugs. The absence of required structure means discipline must come from the developer, not the language. AI engineering requires *disciplined* Python—leveraging flexibility for exploration while building in safeguards for reliability. ### The Mental Model: Concentric Circles of Trust A useful framework for thinking about Python in AI systems is concentric circles of trust. At the center are components you fully control—your own code, tested and typed. Moving outward, each ring represents less control and more uncertainty: 1. **Your code**: Full control, can be tested and typed exhaustively 2. **Your dependencies**: Trusted but external, can break with updates 3. **External APIs**: Unreliable, slow, may change without notice, may have outages 4. **User input**: Completely untrusted, potentially malicious, definitely surprising Good AI engineering Python acknowledges these boundaries. You validate data as it crosses from outer rings to inner rings. You handle failures at API boundaries. You type your internal interfaces strictly while accepting that external data needs runtime validation. This isn't paranoia—it's pragmatism. Every production incident teaches the same lesson: something you assumed would work, didn't. ::: {.callout-tip} ## Skip This Chapter If... You can skip this chapter if you already: - Write production Python with type hints and Pydantic models - Use async/await for concurrent I/O operations - Implement retry logic with exponential backoff - Manage dependencies with lock files (Poetry, pip-tools) - Structure code with proper error handling and logging **Skim instead if** you know most of this but want to see AI-specific patterns. Jump to the "Patterns That Appear Everywhere" section. **Read carefully if** you're coming from data science/notebooks, another language, or haven't built production Python services. ::: ### What You'll Learn This chapter covers Python for AI engineering along four dimensions: - **Language features that matter**: Type hints, async programming, context managers, and dataclasses—not as syntax tutorials but as tools for specific problems - **Library choices and trade-offs**: When to use httpx versus requests, Pydantic versus dataclasses, Poetry versus pip - **Environment discipline**: Dependency management, secrets handling, and containerization that prevent "works on my machine" problems - **Patterns that appear everywhere**: Retry logic, rate limiting, streaming, and observability—the infrastructure code that separates prototypes from products The goal isn't comprehensive Python coverage. It's focused preparation for the AI engineering work ahead. --- ## Type Hints: Documentation That Catches Bugs ### The Problem with Dynamic Typing Python's dynamic typing is liberating during exploration. You don't declare types, don't satisfy a compiler, just write code and run it. This speed has real value when you're iterating on prompt strategies or testing different embedding models. But dynamic typing has a cost that compounds over time. Consider this function from an AI codebase: ```python def process_response(response, config): return response['choices'][0]['message']['content'] if config.get('raw') else clean(response) ``` What is `response`? A dictionary? What shape? What is `config`? What does `clean()` return? Without reading the implementation, all call sites, and possibly the API documentation, you cannot know. This ambiguity creates several problems: **Code review becomes archaeology.** Reviewers must trace through implementations to understand what functions expect and return. Review fatigue leads to bugs slipping through. **Refactoring becomes dangerous.** Change a function's return type, and you have no way to find all affected code without running every path—which in AI systems with many conditional branches is effectively impossible. **Onboarding takes longer.** New team members face a codebase where understanding requires not reading code but deciphering it. **IDE support degrades.** Without type information, autocomplete is limited to guessing, and errors that could be caught immediately require runtime to surface. ### Type Hints as Executable Documentation Python 3.5 introduced type hints—optional annotations that specify expected types without runtime enforcement. Modern Python type hints are sophisticated enough to express complex data structures clearly: ```python from typing import TypedDict, Literal class Message(TypedDict): role: Literal["user", "assistant", "system"] content: str class Choice(TypedDict): message: Message index: int finish_reason: Literal["stop", "length", "content_filter"] class ChatCompletion(TypedDict): id: str model: str choices: list[Choice] usage: dict[str, int] def process_response(response: ChatCompletion, raw: bool = False) -> str: """Extract content from a chat completion response.""" if raw: return response['choices'][0]['message']['content'] return clean_text(response['choices'][0]['message']['content']) ``` The function signature now documents itself. IDEs provide accurate autocomplete. Static analyzers catch typos like `response['choice']` (missing 's') before runtime. Refactoring tools can find all usages and verify compatibility. ### Why Type Hints Matter Especially for AI Code AI applications handle unusually complex data structures. LLM API responses are deeply nested dictionaries. Embeddings are lists of floats of specific dimensions. Batch operations return lists of complex objects with optional fields depending on processing outcomes. Without types, bugs hide in data shape mismatches that only surface at runtime—often in production, often at 3 AM, often in subtle ways that take hours to diagnose. Consider the difference in debugging experience: **Without types**: "TypeError: 'NoneType' object is not subscriptable" somewhere in a 500-line pipeline. Which variable is None? Where did it come from? **With types and static analysis**: "error: Item 'None' of 'Optional[Message]' has no attribute 'content'" at line 47, during development, before you even run the code. ### The Pragmatic Approach: Gradual Adoption Type hints are opt-in. You don't need to annotate everything immediately. The highest-value strategy is gradual adoption focused on boundaries: **Start with function signatures.** The function's contract is the minimum viable type coverage. Even without typing internal variables, annotated parameters and returns catch most interface mismatches. **Prioritize data structures that cross modules.** When data flows between components written by different people or at different times, types prevent miscommunication. **Type complex nested structures explicitly.** API responses, configuration objects, and intermediate data formats benefit most from explicit typing. **Run a type checker in CI.** Tools like `mypy` or `pyright` catch errors where you've added types. Start with lenient settings and tighten as coverage improves. The goal isn't type purity. It's catching bugs early and making code self-documenting where it matters most. > **Complete examples**: See [reference/02_python_ai_patterns.md](../reference/02_python_ai_patterns.html#type-hints-and-typeddict) for TypedDict patterns, Pydantic response models, and generic type patterns for AI applications. --- ## Async Programming: Concurrency for I/O-Bound Work ### Understanding the Problem AI applications make many external calls. Retrieving embeddings, querying LLM APIs, searching vector databases, fetching context documents—each call takes time, and that time is mostly waiting. Your CPU sits idle while data travels across networks and remote servers process requests. Synchronous code handles this poorly. Consider embedding 100 documents: ```python def embed_documents_sync(texts: list[str]) -> list[list[float]]: embeddings = [] for text in texts: response = requests.post(EMBED_URL, json={"input": text}) embeddings.append(response.json()["embedding"]) return embeddings ``` If each API call takes 100 milliseconds, this function takes 10 seconds. During those 10 seconds, your program is 99% idle—waiting for network responses, doing nothing useful. ### The Async Solution Asynchronous programming lets you initiate many I/O operations without waiting for each to complete: ```python async def embed_documents_async(texts: list[str]) -> list[list[float] | None]: async with httpx.AsyncClient() as client: tasks = [ client.post(EMBED_URL, json={"input": text}) for text in texts ] # return_exceptions=True is critical for batch work: by default a # single failed request makes gather raise and DISCARD every other # in-flight result. Here we keep the successes and handle failures. results = await asyncio.gather(*tasks, return_exceptions=True) embeddings: list[list[float] | None] = [] for text, result in zip(texts, results): if isinstance(result, Exception): logger.warning("embedding failed for %r: %s", text[:50], result) embeddings.append(None) # or queue for retry, per your contract else: embeddings.append(result.json()["embedding"]) return embeddings ``` All 100 requests launch nearly simultaneously. The function completes when the slowest single request finishes—perhaps 150 milliseconds instead of 10 seconds. The speedup is dramatic. The subtle part is failure handling: a bare `asyncio.gather(*tasks)` raises on the first exception and throws away every other result, so one rate-limited call would sink an entire batch of embeddings. `return_exceptions=True` makes partial failure a first-class case you handle deliberately. ### The Mental Model: Restaurant Servers vs. Cashiers Synchronous code is like a single cashier serving a queue. Each customer waits while the current one is served. The cashier is always busy, but throughput is limited by processing one customer at a time. Asynchronous code is like a restaurant server. Take an order, submit it to the kitchen, take another order while the first cooks. The server isn't cooking faster—they're not waiting idle while food prepares. This mental model clarifies when async helps and when it doesn't: **Async helps when you're waiting for external resources.** API calls, database queries, file I/O—operations where your program waits for something else. **Async doesn't help for CPU-bound work.** If you're computing embeddings locally, async won't speed anything up. The CPU is the bottleneck, and async doesn't parallelize computation. **Async adds complexity.** Concurrent code is harder to reason about, debug, and test. The speedup must justify this cost. ### Essential Async Patterns Four patterns appear repeatedly in AI applications: **Concurrent execution with `gather`.** When you have multiple independent operations, run them concurrently: ```python async def fetch_all_contexts(queries: list[str]) -> list[Context]: tasks = [fetch_context(q) for q in queries] return await asyncio.gather(*tasks) ``` **Controlled concurrency with semaphores.** APIs have rate limits. Launching 10,000 concurrent requests will get you blocked. Semaphores limit concurrency: ```python async def fetch_with_limit(items: list[str], max_concurrent: int = 10) -> list[Result]: semaphore = asyncio.Semaphore(max_concurrent) async def fetch_one(item: str) -> Result: async with semaphore: return await fetch_item(item) return await asyncio.gather(*[fetch_one(i) for i in items]) ``` **Timeouts for operations that might hang.** External services sometimes stall. Timeouts prevent indefinite waits: ```python async def fetch_with_timeout(item: str, timeout: float = 30.0) -> Result: try: return await asyncio.wait_for(fetch_item(item), timeout=timeout) except asyncio.TimeoutError: raise FetchError(f"Timed out after {timeout}s") ``` **First-completed for redundant providers.** When multiple providers can serve a request, take the first response: ```python async def fetch_from_any(item: str, providers: list[Provider]) -> Result: tasks = [provider.fetch(item) for provider in providers] done, pending = await asyncio.wait(tasks, return_when=asyncio.FIRST_COMPLETED) for task in pending: task.cancel() return done.pop().result() ``` ### When to Use Async Async isn't always the right choice. Use it when: - Making multiple independent external calls (embedding batches, parallel LLM queries) - Building servers that handle concurrent requests - Processing pipelines where each step involves I/O Don't use async for: - Single sequential API calls (no concurrency to exploit) - CPU-bound work (use multiprocessing instead) - Simple scripts where complexity isn't justified The decision isn't about async being "better"—it's about whether the speedup justifies the complexity in your specific context. > **Complete examples**: See [reference/02_python_ai_patterns.md](../reference/02_python_ai_patterns.html#asyncawait-patterns) for semaphore-based rate limiting, timeout handling, and bridging sync/async contexts. --- ## Data Validation: Pydantic and the Trust Boundary ### The Problem with Unvalidated Data Data enters AI systems from many sources: API responses, user inputs, configuration files, database queries. Each source can produce unexpected data—missing fields, wrong types, values outside expected ranges. Without validation, these problems surface far from their origin. A missing field in an API response causes an error three functions later. A malformed user input corrupts results without raising exceptions. Debugging becomes archaeology, tracing backward from symptoms to causes. ### Pydantic: Validation as Type Definition Pydantic models define data structures that validate on instantiation. If validation fails, you get an immediate, specific error: ```python from pydantic import BaseModel, Field from typing import Literal class ChatMessage(BaseModel): role: Literal["user", "assistant", "system"] content: str = Field(min_length=1, max_length=100000) class ChatRequest(BaseModel): messages: list[ChatMessage] model: str = "gpt-5" temperature: float = Field(default=0.7, ge=0.0, le=2.0) ``` Attempting to create a `ChatRequest` with invalid data fails immediately: ```python # Raises ValidationError: temperature must be <= 2.0 ChatRequest(messages=[...], temperature=2.5) ``` The error is specific, occurs at the point of creation, and includes context about what went wrong. ### Validation at Trust Boundaries Pydantic is most valuable at boundaries where data crosses trust levels: **External API responses.** When you call an embedding API, the response should match your expectations. Parsing through Pydantic validates this assumption: ```python response = await client.post(url, json=payload) result = EmbeddingResponse.model_validate(response.json()) # If parsing succeeds, result matches your expected structure ``` **User input.** Everything from users is suspect. Pydantic enforces constraints: ```python class UserQuery(BaseModel): text: str = Field(min_length=1, max_length=10000) filters: dict[str, str] = Field(default_factory=dict) @field_validator('text') @classmethod def sanitize_text(cls, v: str) -> str: return v.strip() ``` **Configuration files.** Settings loaded from YAML or JSON should be validated before use: ```python class AppConfig(BaseModel): llm_provider: Literal["openai", "anthropic"] model: str temperature: float = Field(ge=0.0, le=2.0) @classmethod def from_yaml(cls, path: Path) -> "AppConfig": with open(path) as f: return cls.model_validate(yaml.safe_load(f)) ``` ### Pydantic vs. Dataclasses Python's built-in `dataclasses` are lighter weight but don't validate. The choice depends on context: **Use dataclasses for internal data structures** where you control all inputs and validation overhead isn't warranted: ```python @dataclass class EmbeddingResult: vector: list[float] model: str cached: bool = False ``` **Use Pydantic at boundaries** where data comes from external sources or needs validation: ```python class APIResponse(BaseModel): embedding: list[float] model: str usage: dict[str, int] ``` The distinction matters for performance and complexity. Pydantic validation has overhead—meaningful when processing millions of records. For internal data you trust, dataclasses avoid that cost. > **Complete examples**: See [reference/02_python_ai_patterns.md](../reference/02_python_ai_patterns.html#pydantic-models) for field validators, model validators, configuration loading, and structured output patterns for LLM responses. --- ## Error Handling: Expecting Failures ### The Reality of External Dependencies AI applications depend heavily on external services, and external services fail. Networks partition, servers crash, rate limits trigger, APIs change without notice. The question isn't whether failures will occur but how your code responds. Robust error handling requires categorizing failures by how they should be handled: **Retryable failures** might succeed on the next attempt. Network timeouts, rate limits, server errors (5xx) often resolve themselves. These warrant retry logic. **Non-retryable failures** won't be fixed by trying again. Authentication errors, malformed requests, resource not found—these require human intervention or code changes. **Partial failures** in batch operations leave some work completed and some failed. These require careful handling to avoid reprocessing successful items. ### Structured Error Handling Define error types that distinguish these categories: ```python class LLMError(Exception): """Base exception for LLM operations.""" pass class RetryableError(LLMError): """Error that may succeed on retry.""" pass class NonRetryableError(LLMError): """Error that won't be fixed by retrying.""" pass ``` Then handle API errors by mapping them to these categories: ```python async def call_llm(prompt: str) -> str: try: response = await client.chat.completions.create( model="gpt-5", messages=[{"role": "user", "content": prompt}] ) return response.choices[0].message.content except AuthenticationError as e: # Invalid credentials - don't retry raise NonRetryableError("Invalid API credentials") from e except RateLimitError as e: # Rate limited - retry after backoff raise RetryableError("Rate limit exceeded") from e except APIConnectionError as e: # Network issue - retry raise RetryableError("Connection failed") from e ``` ### Retry Logic with Exponential Backoff Retry logic must balance persistence with politeness. Hammering a failing server with immediate retries makes problems worse. Exponential backoff spaces retries increasingly apart: ```python async def retry_with_backoff( func: Callable[[], Awaitable[T]], max_attempts: int = 3, base_delay: float = 1.0, ) -> T: """Execute function with exponential backoff on failure.""" for attempt in range(max_attempts): try: return await func() except RetryableError as e: if attempt == max_attempts - 1: raise # Final attempt failed delay = base_delay * (2 ** attempt) # 1s, 2s, 4s, ... delay *= (0.5 + random.random()) # Add jitter logger.warning(f"Attempt {attempt + 1} failed: {e}. Retrying in {delay:.1f}s") await asyncio.sleep(delay) raise RuntimeError("Should not reach here") ``` The jitter (random variation in delay) prevents the "thundering herd" problem where many clients retry simultaneously after an outage, overwhelming the recovering server. ### Libraries for Retry Logic The `tenacity` library provides sophisticated retry logic without reinventing it: ```python from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type @retry( stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=60), retry=retry_if_exception_type(RetryableError), ) async def robust_llm_call(prompt: str) -> str: return await call_llm(prompt) ``` Use libraries for complex retry scenarios. Understand the underlying concepts to debug when things go wrong. > **Complete examples**: See [reference/02_python_ai_patterns.md](../reference/02_python_ai_patterns.html#error-handling) for comprehensive API error handling, tenacity configurations, and rate limiter implementations. --- ## Environment Management: Reproducibility and Isolation ### The Dependency Problem Python's package ecosystem is both blessing and curse. Thousands of libraries are available for any task. But these libraries depend on other libraries, which depend on others, creating a web of transitive dependencies that can conflict. The symptoms are familiar: - "It works on my machine" (but not in production) - Package A requires numpy>=1.20 while package B requires numpy<1.20 - An update to a minor dependency breaks your application - A security vulnerability in a transitive dependency requires untangling the entire chain ### Virtual Environments: Basic Isolation Virtual environments isolate project dependencies from system Python and from each other. Every project should have its own: ```bash # Create environment python -m venv .venv # Activate source .venv/bin/activate # Unix .venv\Scripts\activate # Windows # Install dependencies pip install -r requirements.txt ``` This is the minimum standard. Developing without virtual environments causes problems that seem small initially but compound over time. ### Dependency Locking: Reproducible Installs `requirements.txt` with loose version constraints isn't reproducible: ``` # This installs different versions on different days openai>=1.0 pydantic>=2.0 ``` Dependency locking captures the exact versions of all packages, including transitive dependencies: **pip-tools** generates locked requirements from loose specifications: ```bash # requirements.in (your direct dependencies) openai>=1.0 pydantic>=2.0 # Generate locked requirements.txt pip-compile requirements.in # Install exactly those versions pip-sync requirements.txt ``` **Poetry** manages dependencies with built-in locking: ```bash # Add dependencies (updates pyproject.toml and poetry.lock) poetry add openai pydantic # Install exactly what's in poetry.lock poetry install ``` The lock file is your reproducibility guarantee. Commit it to version control. CI and production install from the lock file, not from loose constraints. ### Secrets Management: The API Key Problem API keys are the most common security leak in AI projects. OpenAI keys committed to GitHub are exploited within minutes by automated scanners. **Never hardcode secrets.** Not even temporarily. Not even in example code you'll "fix later." **Use environment variables** with `.env` files for local development: ```bash # .env (never commit this) OPENAI_API_KEY=sk-... ANTHROPIC_API_KEY=sk-ant-... # .env.example (commit this as documentation) OPENAI_API_KEY=your-key-here ANTHROPIC_API_KEY=your-key-here ``` Load environment variables at application start: ```python from dotenv import load_dotenv import os load_dotenv() # Loads .env into environment openai_key = os.environ["OPENAI_API_KEY"] # Fails fast if missing ``` **For production**, use secrets managers—AWS Secrets Manager, GCP Secret Manager, HashiCorp Vault. These provide audit trails, rotation, and access control that environment variables lack. ### Docker: Consistent Environments Docker containers package your application with its complete environment—operating system, Python version, dependencies, configuration. This eliminates environment discrepancies between development and production. A minimal Dockerfile for an AI application: ```dockerfile FROM python:3.11-slim WORKDIR /app # Install dependencies first (better layer caching) COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt # Copy application code COPY . . # Run as non-root user RUN useradd -m appuser && chown -R appuser:appuser /app USER appuser CMD ["python", "-m", "app.main"] ``` The key insight is layer caching: by copying and installing dependencies before copying application code, you avoid reinstalling dependencies on every code change. > **Complete examples**: See [reference/02_python_ai_patterns.md](../reference/02_python_ai_patterns.html#environment-management) for AWS/GCP secrets manager patterns, pyproject.toml configurations, and Docker Compose setups for AI development. --- ## Observability: Seeing What Your Code Does ### The Debugging Challenge AI applications are hard to debug. LLM outputs are non-deterministic. Errors in prompt construction might produce plausible but wrong results. Performance problems might stem from any of a dozen external service calls. Traditional debugging—setting breakpoints, stepping through code—helps for some problems. But production issues often can't be reproduced locally, and the interactive nature of debugging doesn't scale to understanding system behavior over time. Observability—logging, metrics, and tracing—provides visibility into running systems. ### Structured Logging Unstructured log messages are human-readable but machine-hostile: ``` 2024-01-15 10:23:45 INFO Processing request for user 123 2024-01-15 10:23:46 INFO LLM call completed in 1.2s ``` Structured logging outputs machine-parseable data that enables querying and aggregation: ```python import structlog logger = structlog.get_logger() logger.info( "llm_call_completed", model="gpt-5", latency_ms=1200, prompt_tokens=150, completion_tokens=80, user_id="123", ) ``` This outputs JSON that log aggregation systems (Datadog, CloudWatch, Loki) can query: "Show me all LLM calls taking more than 2 seconds" or "What's the average token usage per model?" ### What to Log for AI Applications AI applications have specific logging needs beyond standard web applications: **LLM calls**: Model, latency, token counts, prompt preview (truncated), response preview. This data is essential for cost tracking and debugging. **Retrieval operations**: Query, number of results, relevance scores, latency. When RAG produces bad results, these logs help diagnose whether retrieval or generation failed. **Validation failures**: What data failed validation and why. When external APIs change format, these logs reveal the problem quickly. **Rate limit events**: When you hit limits, log it. This informs capacity planning and helps debug intermittent failures. ### Request Tracing In systems with many components, a single user request touches multiple services. Tracing connects these interactions: ```python import uuid from contextvars import ContextVar request_id_var: ContextVar[str] = ContextVar("request_id", default="unknown") def process_request(user_input: str) -> str: request_id = str(uuid.uuid4())[:8] request_id_var.set(request_id) logger.info("request_started", request_id=request_id) try: result = run_pipeline(user_input) logger.info("request_completed", request_id=request_id, success=True) return result except Exception as e: logger.error("request_failed", request_id=request_id, error=str(e)) raise ``` Every log message in the request path includes the request ID. When investigating an issue, filter logs by request ID to see the complete picture. > **Complete examples**: See [reference/02_python_ai_patterns.md](../reference/02_python_ai_patterns.html#logging-and-observability) for JSON formatter implementations, request tracing middleware, and LLM call logging patterns. --- ## Development Workflow: From Notebooks to Production ### The Notebook-to-Production Pipeline Jupyter notebooks excel at exploration: interactive execution, inline visualization, rapid iteration. But they're poorly suited for production: no testing framework, problematic version control (JSON diffs are useless), encourages global state. The solution isn't abandoning notebooks but establishing a clear pipeline: 1. **Explore in notebooks**: Test ideas, visualize data, prototype approaches 2. **Extract to modules**: When code stabilizes, move it to `.py` files 3. **Add types and tests**: Production code needs both 4. **Run in production**: Modules imported by application, not notebooks ### Making Notebooks Production-Adjacent Some practices keep notebooks closer to production quality: **Keep cells small and focused.** Each cell should do one thing. This makes extraction to functions natural. **Use functions, not just sequential code.** Even in notebooks, wrap logic in functions. This forces you to think about inputs, outputs, and encapsulation. **Strip outputs before committing.** Notebook outputs include cell execution results, which bloat repositories and make diffs useless. Use `nbstripout` to automatically remove outputs: ```bash pip install nbstripout nbstripout --install # Configures git to strip outputs on commit ``` **Consider jupytext for version control.** Jupytext syncs notebooks with plain Python files, enabling meaningful diffs: ```bash jupytext --set-formats ipynb,py:percent notebook.ipynb # Now notebook.py stays in sync with notebook.ipynb ``` ### Testing AI Code AI applications present unique testing challenges. Outputs are non-deterministic. External APIs are slow and expensive. "Correct" is often subjective. **Mock external services** to test your code in isolation: ```python @pytest.fixture def mock_llm_response(): return ChatCompletion( id="mock", model="gpt-5", choices=[Choice( message=Message(role="assistant", content="Mock response"), finish_reason="stop" )] ) async def test_processing_logic(mock_llm_response): with patch("openai.AsyncOpenAI") as mock: mock.return_value.chat.completions.create.return_value = mock_llm_response result = await process_user_query("test input") assert "Mock response" in result ``` **Test deterministic components separately.** Prompt construction, response parsing, and business logic can be tested without LLM calls: ```python def test_prompt_construction(): prompt = build_rag_prompt( query="What is Python?", contexts=["Python is a programming language."] ) assert "What is Python?" in prompt assert "Python is a programming language." in prompt ``` **Use snapshot testing for prompts.** When prompts change, you want to review the diff deliberately: ```python def test_prompt_template(snapshot): prompt = build_prompt(template="summarize", document="...") assert prompt == snapshot # Fails if prompt changes, update with --snapshot-update ``` > **Complete examples**: See [reference/02_python_ai_patterns.md](../reference/02_python_ai_patterns.html#testing-patterns) for pytest fixtures, mock API responses, VCR for recorded interactions, and streaming response tests. --- ## Practical Exercises ### Exercise 1: Type-Safe LLM Response Handler Build a type-safe response handler for multiple LLM providers: 1. Define `TypedDict` classes for OpenAI and Anthropic API responses 2. Create a unified `LLMResponse` Pydantic model that normalizes both formats 3. Implement converter functions with proper type hints 4. Write tests that verify type correctness using `mypy` **Acceptance criteria:** - Running `mypy --strict` produces no errors - Both API formats convert correctly to the unified model - Invalid responses raise `ValidationError` with clear messages ### Exercise 2: Async Rate-Limited API Client Implement an async HTTP client with production-ready features: 1. Create a client class that wraps `httpx.AsyncClient` 2. Implement a token bucket rate limiter (100 requests/minute) 3. Add automatic retry with exponential backoff for 429 and 5xx errors 4. Include request tracing with correlation IDs 5. Log all requests with structured logging **Acceptance criteria:** - Concurrent requests respect rate limits - Retries use jitter to prevent thundering herd - Each request can be traced through logs - Test with 200 concurrent requests ### Exercise 3: Pydantic Configuration System Build a configuration system for an AI application: 1. Define Pydantic `BaseSettings` for: - API keys (from environment variables) - Model parameters (temperature, max_tokens, etc.) - Retry configuration (max_retries, base_delay) - Feature flags (enable_caching, debug_mode) 2. Support loading from `.env` files 3. Add custom validators for value ranges 4. Implement environment-specific overrides (dev, staging, prod) **Acceptance criteria:** - Sensitive values are masked in logs and `__repr__` - Invalid configurations fail fast with clear error messages - Can switch environments without code changes ### Exercise 4: Streaming Response Processor Implement a streaming processor for LLM responses: 1. Create an async generator that yields chunks from the OpenAI streaming API 2. Implement a `StreamBuffer` class that accumulates chunks and detects complete JSON objects 3. Add timeout handling (fail if no chunk received in 30 seconds) 4. Include progress callbacks for UI updates ```python async def process_stream(): async for chunk in stream_completion(prompt="..."): print(chunk, end="", flush=True) # Should print incrementally as chunks arrive ``` **Acceptance criteria:** - Prints text as it arrives, not buffered - Handles connection drops gracefully - Works with both OpenAI and Anthropic streaming formats ### Exercise 5: Observable Service Module Build a Python module with comprehensive observability: 1. Create a service class with structured logging: - Log all method entries and exits - Include timing information - Use correlation IDs for request tracing 2. Add metrics using `prometheus_client`: - Request count by method - Latency histograms - Error rates 3. Implement a context manager for automatic instrumentation 4. Write a decorator that adds observability to any function **Acceptance criteria:** - Can reconstruct a full request flow from logs - Metrics are exposed on `/metrics` endpoint - Latency percentiles (p50, p95, p99) are available ### Exercise 6: Docker Development Environment Create a complete development environment for an AI project: 1. Write a `Dockerfile` with: - Multi-stage build for smaller image - Non-root user for security - Proper layer caching for dependencies 2. Create `docker-compose.yml` with: - Application service with hot reload - Redis for caching - Mock API server for testing 3. Include a `Makefile` with common commands: - `make dev` - Start development environment - `make test` - Run tests in container - `make lint` - Run mypy, ruff, black **Acceptance criteria:** - Container builds in under 2 minutes (with cache) - Code changes reflect immediately without rebuild - Tests can run in isolation from external services --- ## Self-Assessment Checkpoint ### Conceptual Questions **Q1. [IC2]** Why does async programming provide significant speedups for LLM API calls but not for local embedding computations? <details> <summary>Answer</summary> LLM API calls are I/O-bound—the CPU waits for network responses. During this wait time, async allows other tasks to execute. With 10 concurrent API calls that each take 1 second, sequential execution takes 10 seconds while async takes ~1 second. Local embedding computations are CPU-bound—the CPU is actively processing. Async doesn't help because there's no waiting; the CPU is always busy. For CPU-bound work, you need multiprocessing (multiple CPU cores) rather than async (concurrency during I/O wait). </details> **Q2. [IC2]** When should you use `TypedDict` versus a Pydantic model? <details> <summary>Answer</summary> **TypedDict**: Use for typing dictionaries from external sources where you want type checking without runtime validation. Good for API responses you've already validated elsewhere, or when you need dict compatibility with existing code. **Pydantic model**: Use when you need runtime validation, default values, custom validators, or serialization. Better for data at trust boundaries (user input, config files) and when you want automatic parsing (strings to ints, etc.). In practice: External API responses often start as TypedDict for typing, then get converted to Pydantic models when entering your application's core. </details> **Q3. [Senior]** How would you implement graceful degradation for a service that calls multiple LLM providers? <details> <summary>Answer</summary> Implement a multi-provider client with: 1. **Health checking**: Periodically test each provider, track success rates 2. **Circuit breaker**: Open circuit after N consecutive failures, half-open after timeout 3. **Fallback chain**: Primary → Secondary → Cached response → Graceful error 4. **Request hedging**: For latency-sensitive requests, send to multiple providers, use first response ```python async def generate(prompt: str) -> str: for provider in self.healthy_providers(): try: return await provider.generate(prompt) except ProviderError: self.mark_unhealthy(provider) return await self.fallback_response(prompt) # Cached or default ``` Key insight: Never fail completely if any fallback is possible. Users prefer a slower or slightly degraded response over an error. </details> **Q4. [Senior]** What are the trade-offs between using `poetry` versus `pip-tools` for dependency management? <details> <summary>Answer</summary> **Poetry**: - Pros: Single tool for dependency management, virtual env, building, and publishing. Excellent lockfile. Nice CLI. Growing adoption. - Cons: Can be slow for large dependency trees. Some compatibility issues with editable installs. Adds a layer of abstraction over pip. **pip-tools**: - Pros: Minimal abstraction—just generates requirements.txt from requirements.in. Fast. Works with any pip-compatible tooling. - Cons: Separate tools needed for virtual env (venv) and building (setuptools/build). More manual workflow. **Decision guide**: Poetry for new projects where you control the environment. pip-tools for existing projects, Docker containers (where simplicity matters), or teams unfamiliar with Poetry. </details> --- ## Summary Python for AI engineering requires a mindset shift from exploration to reliability without sacrificing the ability to iterate quickly. The key principles: **Type hints are documentation that catches bugs.** They don't slow you down—they prevent hours of debugging later. Start at function boundaries and data structures that cross modules. **Async programming enables concurrency for I/O-bound work.** When making many external calls, async provides dramatic speedups. When making single calls or doing CPU-bound work, it adds complexity without benefit. **Pydantic validates data at trust boundaries.** External API responses, user input, and configuration files all warrant validation. Internal data structures can use lighter-weight dataclasses. **Error handling must expect failures.** External services fail. Categorize errors as retryable or not, implement backoff for retries, and fail fast for non-retryable errors. **Environment discipline enables reproducibility.** Virtual environments, locked dependencies, proper secrets management, and Docker containers prevent "works on my machine" problems. **Observability reveals system behavior.** Structured logging, request tracing, and comprehensive metrics let you understand what your code actually does in production. These practices compound. Each one prevents a category of problems, and together they transform AI code from fragile prototypes to reliable systems. The investment in discipline pays dividends every time you debug an issue in minutes instead of hours, every time a code change deploys without incident, every time you sleep through the night while your system handles traffic. The patterns in this chapter appear throughout AI engineering. Master them, and you'll spend less time fighting Python and more time building systems that work. --- ## Connections to Other Chapters The Python patterns in this chapter form the foundation for production AI code throughout the book: - **Chapter 4 (Your First LLM Application)**: Applies these Python patterns to build a complete RAG application with async API calls, Pydantic validation, and proper error handling. - **Chapter 14 (Backend Engineering for AI)**: Extends these patterns to production systems including testing strategies, debugging techniques, and integration patterns. - **Chapter 9 (LLM Deployment & Infrastructure)**: Uses async patterns for high-throughput inference and discusses retry/circuit breaker patterns for reliability. - **Chapter 31 (Reliability Engineering)**: Builds on the observability patterns here to implement comprehensive monitoring, alerting, and incident response. - **Tool & Framework Reference** (Appendix B): Comprehensive guide to the Python libraries mentioned in this chapter with version recommendations. --- ## Further Reading - **Fluent Python** by Luciano Ramalho—Deep coverage of Python's features and idioms - **Architecture Patterns with Python** by Harry Percival and Bob Gregory—Patterns for maintainable Python applications - **Pydantic documentation** (docs.pydantic.dev)—Comprehensive guide to data validation - **Real Python's async guide** (realpython.com/async-io-python)—Practical async programming tutorial - **The Twelve-Factor App** (12factor.net)—Principles for production applications, including configuration and dependencies --- ## Code Reference For complete, runnable implementations of all patterns discussed in this chapter, see: **[reference/02_python_ai_patterns.md](../reference/02_python_ai_patterns.html)** — Full code examples including: - Type hints and TypedDict for LLM responses - Pydantic models with validators and configuration loading - Async patterns: gather, semaphores, timeouts, first-completed - Context managers for resource management - HTTP client configuration with httpx - OpenAI and Anthropic SDK patterns with provider abstraction - NumPy operations for embeddings - Pandas for evaluation data manipulation - JSON extraction from LLM outputs - Streaming response handling - Token bucket rate limiter implementation - Structured logging and request tracing - Comprehensive error handling with retry logic - Testing patterns: fixtures, mocks, snapshots, VCR - Docker and Docker Compose configurations