Chapter 2: Python for AI Engineering
Python, async, Pydantic, type hints, error handling, observability, logging, data classes
Introduction
In early 2023, a machine learning engineer at a healthcare startup faced a familiar crisis. Her team had built a diagnostic assistant using GPT-4—a Jupyter notebook that analyzed patient symptoms and suggested possible conditions. In demos, it was impressive. Doctors were excited. The CEO scheduled a board presentation.
Then the engineering team tried to deploy it.
The notebook made synchronous API calls that blocked the web server, causing timeouts when multiple doctors used it simultaneously. Error handling was optimistic at best—a single API hiccup crashed the entire application. Configuration was scattered across cells, with API keys occasionally committed to version control. The code that had dazzled stakeholders became a liability that took six weeks to untangle.
This story plays out constantly across the industry. The gap between “it works in a notebook” and “it runs reliably in production” is vast, and Python—for all its strengths—makes it easy to fall into this chasm. The same flexibility that enables rapid prototyping allows sloppy practices that become expensive later.
This chapter is about writing Python that works both for exploration and production. Not Python that’s technically correct but unmaintainable. Not Python that’s elegant but impractical. Python that lets you iterate quickly during development and sleep soundly when your code runs in production.
Why Python Dominates AI Engineering
Python’s dominance in AI isn’t a historical accident or mere popularity contest. Several forces converge to make it uniquely suited for this domain.
The library ecosystem creates a gravitational pull. NumPy, PyTorch, TensorFlow, Hugging Face Transformers, LangChain—the tools that power AI development are Python-first. Other languages have bindings, but Python has native support with full documentation and community knowledge. When you encounter an obscure error from a transformer model, solutions in Python are abundant. Solutions in Rust or Go often don’t exist.
Rapid iteration matches AI’s experimental nature. AI development isn’t like building a web form. You try a prompting strategy, evaluate results, adjust, repeat. You experiment with different models, embedding approaches, chunking strategies. Python’s dynamic nature—the same trait that causes production bugs—enables this experimentation. You can modify a pipeline mid-execution in a debugger, reload modules without restarting, test ideas in a REPL.
Glue language capabilities matter for integration. Production AI systems rarely stand alone. They connect to databases, message queues, APIs, authentication systems, monitoring infrastructure. Python’s extensive standard library and third-party packages provide connectors for virtually anything. A single language spans your entire stack from data preprocessing to model serving to observability.
But Python’s flexibility is a double-edged sword. The language that lets you quickly test an idea also lets you quickly create unmaintainable code. Dynamic typing that speeds prototyping causes production bugs. The absence of required structure means discipline must come from the developer, not the language.
AI engineering requires disciplined Python—leveraging flexibility for exploration while building in safeguards for reliability.
The Mental Model: Concentric Circles of Trust
A useful framework for thinking about Python in AI systems is concentric circles of trust. At the center are components you fully control—your own code, tested and typed. Moving outward, each ring represents less control and more uncertainty:
- Your code: Full control, can be tested and typed exhaustively
- Your dependencies: Trusted but external, can break with updates
- External APIs: Unreliable, slow, may change without notice, may have outages
- User input: Completely untrusted, potentially malicious, definitely surprising
Good AI engineering Python acknowledges these boundaries. You validate data as it crosses from outer rings to inner rings. You handle failures at API boundaries. You type your internal interfaces strictly while accepting that external data needs runtime validation.
This isn’t paranoia—it’s pragmatism. Every production incident teaches the same lesson: something you assumed would work, didn’t.
You can skip this chapter if you already:
- Write production Python with type hints and Pydantic models
- Use async/await for concurrent I/O operations
- Implement retry logic with exponential backoff
- Manage dependencies with lock files (Poetry, pip-tools)
- Structure code with proper error handling and logging
Skim instead if you know most of this but want to see AI-specific patterns. Jump to the “Patterns That Appear Everywhere” section.
Read carefully if you’re coming from data science/notebooks, another language, or haven’t built production Python services.
What You’ll Learn
This chapter covers Python for AI engineering along four dimensions:
- Language features that matter: Type hints, async programming, context managers, and dataclasses—not as syntax tutorials but as tools for specific problems
- Library choices and trade-offs: When to use httpx versus requests, Pydantic versus dataclasses, Poetry versus pip
- Environment discipline: Dependency management, secrets handling, and containerization that prevent “works on my machine” problems
- Patterns that appear everywhere: Retry logic, rate limiting, streaming, and observability—the infrastructure code that separates prototypes from products
The goal isn’t comprehensive Python coverage. It’s focused preparation for the AI engineering work ahead.
Type Hints: Documentation That Catches Bugs
The Problem with Dynamic Typing
Python’s dynamic typing is liberating during exploration. You don’t declare types, don’t satisfy a compiler, just write code and run it. This speed has real value when you’re iterating on prompt strategies or testing different embedding models.
But dynamic typing has a cost that compounds over time. Consider this function from an AI codebase:
def process_response(response, config):
return response['choices'][0]['message']['content'] if config.get('raw') else clean(response)What is response? A dictionary? What shape? What is config? What does clean() return? Without reading the implementation, all call sites, and possibly the API documentation, you cannot know.
This ambiguity creates several problems:
Code review becomes archaeology. Reviewers must trace through implementations to understand what functions expect and return. Review fatigue leads to bugs slipping through.
Refactoring becomes dangerous. Change a function’s return type, and you have no way to find all affected code without running every path—which in AI systems with many conditional branches is effectively impossible.
Onboarding takes longer. New team members face a codebase where understanding requires not reading code but deciphering it.
IDE support degrades. Without type information, autocomplete is limited to guessing, and errors that could be caught immediately require runtime to surface.
Type Hints as Executable Documentation
Python 3.5 introduced type hints—optional annotations that specify expected types without runtime enforcement. Modern Python type hints are sophisticated enough to express complex data structures clearly:
from typing import TypedDict, Literal
class Message(TypedDict):
role: Literal["user", "assistant", "system"]
content: str
class Choice(TypedDict):
message: Message
index: int
finish_reason: Literal["stop", "length", "content_filter"]
class ChatCompletion(TypedDict):
id: str
model: str
choices: list[Choice]
usage: dict[str, int]
def process_response(response: ChatCompletion, raw: bool = False) -> str:
"""Extract content from a chat completion response."""
if raw:
return response['choices'][0]['message']['content']
return clean_text(response['choices'][0]['message']['content'])The function signature now documents itself. IDEs provide accurate autocomplete. Static analyzers catch typos like response['choice'] (missing ‘s’) before runtime. Refactoring tools can find all usages and verify compatibility.
Why Type Hints Matter Especially for AI Code
AI applications handle unusually complex data structures. LLM API responses are deeply nested dictionaries. Embeddings are lists of floats of specific dimensions. Batch operations return lists of complex objects with optional fields depending on processing outcomes.
Without types, bugs hide in data shape mismatches that only surface at runtime—often in production, often at 3 AM, often in subtle ways that take hours to diagnose.
Consider the difference in debugging experience:
Without types: “TypeError: ‘NoneType’ object is not subscriptable” somewhere in a 500-line pipeline. Which variable is None? Where did it come from?
With types and static analysis: “error: Item ‘None’ of ‘Optional[Message]’ has no attribute ‘content’” at line 47, during development, before you even run the code.
The Pragmatic Approach: Gradual Adoption
Type hints are opt-in. You don’t need to annotate everything immediately. The highest-value strategy is gradual adoption focused on boundaries:
Start with function signatures. The function’s contract is the minimum viable type coverage. Even without typing internal variables, annotated parameters and returns catch most interface mismatches.
Prioritize data structures that cross modules. When data flows between components written by different people or at different times, types prevent miscommunication.
Type complex nested structures explicitly. API responses, configuration objects, and intermediate data formats benefit most from explicit typing.
Run a type checker in CI. Tools like mypy or pyright catch errors where you’ve added types. Start with lenient settings and tighten as coverage improves.
The goal isn’t type purity. It’s catching bugs early and making code self-documenting where it matters most.
Complete examples: See reference/02_python_ai_patterns.md for TypedDict patterns, Pydantic response models, and generic type patterns for AI applications.
Async Programming: Concurrency for I/O-Bound Work
Understanding the Problem
AI applications make many external calls. Retrieving embeddings, querying LLM APIs, searching vector databases, fetching context documents—each call takes time, and that time is mostly waiting. Your CPU sits idle while data travels across networks and remote servers process requests.
Synchronous code handles this poorly. Consider embedding 100 documents:
def embed_documents_sync(texts: list[str]) -> list[list[float]]:
embeddings = []
for text in texts:
response = requests.post(EMBED_URL, json={"input": text})
embeddings.append(response.json()["embedding"])
return embeddingsIf each API call takes 100 milliseconds, this function takes 10 seconds. During those 10 seconds, your program is 99% idle—waiting for network responses, doing nothing useful.
The Async Solution
Asynchronous programming lets you initiate many I/O operations without waiting for each to complete:
async def embed_documents_async(texts: list[str]) -> list[list[float] | None]:
async with httpx.AsyncClient() as client:
tasks = [
client.post(EMBED_URL, json={"input": text})
for text in texts
]
# return_exceptions=True is critical for batch work: by default a
# single failed request makes gather raise and DISCARD every other
# in-flight result. Here we keep the successes and handle failures.
results = await asyncio.gather(*tasks, return_exceptions=True)
embeddings: list[list[float] | None] = []
for text, result in zip(texts, results):
if isinstance(result, Exception):
logger.warning("embedding failed for %r: %s", text[:50], result)
embeddings.append(None) # or queue for retry, per your contract
else:
embeddings.append(result.json()["embedding"])
return embeddingsAll 100 requests launch nearly simultaneously. The function completes when the slowest single request finishes—perhaps 150 milliseconds instead of 10 seconds. The speedup is dramatic. The subtle part is failure handling: a bare asyncio.gather(*tasks) raises on the first exception and throws away every other result, so one rate-limited call would sink an entire batch of embeddings. return_exceptions=True makes partial failure a first-class case you handle deliberately.
The Mental Model: Restaurant Servers vs. Cashiers
Synchronous code is like a single cashier serving a queue. Each customer waits while the current one is served. The cashier is always busy, but throughput is limited by processing one customer at a time.
Asynchronous code is like a restaurant server. Take an order, submit it to the kitchen, take another order while the first cooks. The server isn’t cooking faster—they’re not waiting idle while food prepares.
This mental model clarifies when async helps and when it doesn’t:
Async helps when you’re waiting for external resources. API calls, database queries, file I/O—operations where your program waits for something else.
Async doesn’t help for CPU-bound work. If you’re computing embeddings locally, async won’t speed anything up. The CPU is the bottleneck, and async doesn’t parallelize computation.
Async adds complexity. Concurrent code is harder to reason about, debug, and test. The speedup must justify this cost.
Essential Async Patterns
Four patterns appear repeatedly in AI applications:
Concurrent execution with gather. When you have multiple independent operations, run them concurrently:
async def fetch_all_contexts(queries: list[str]) -> list[Context]:
tasks = [fetch_context(q) for q in queries]
return await asyncio.gather(*tasks)Controlled concurrency with semaphores. APIs have rate limits. Launching 10,000 concurrent requests will get you blocked. Semaphores limit concurrency:
async def fetch_with_limit(items: list[str], max_concurrent: int = 10) -> list[Result]:
semaphore = asyncio.Semaphore(max_concurrent)
async def fetch_one(item: str) -> Result:
async with semaphore:
return await fetch_item(item)
return await asyncio.gather(*[fetch_one(i) for i in items])Timeouts for operations that might hang. External services sometimes stall. Timeouts prevent indefinite waits:
async def fetch_with_timeout(item: str, timeout: float = 30.0) -> Result:
try:
return await asyncio.wait_for(fetch_item(item), timeout=timeout)
except asyncio.TimeoutError:
raise FetchError(f"Timed out after {timeout}s")First-completed for redundant providers. When multiple providers can serve a request, take the first response:
async def fetch_from_any(item: str, providers: list[Provider]) -> Result:
tasks = [provider.fetch(item) for provider in providers]
done, pending = await asyncio.wait(tasks, return_when=asyncio.FIRST_COMPLETED)
for task in pending:
task.cancel()
return done.pop().result()When to Use Async
Async isn’t always the right choice. Use it when:
- Making multiple independent external calls (embedding batches, parallel LLM queries)
- Building servers that handle concurrent requests
- Processing pipelines where each step involves I/O
Don’t use async for:
- Single sequential API calls (no concurrency to exploit)
- CPU-bound work (use multiprocessing instead)
- Simple scripts where complexity isn’t justified
The decision isn’t about async being “better”—it’s about whether the speedup justifies the complexity in your specific context.
Complete examples: See reference/02_python_ai_patterns.md for semaphore-based rate limiting, timeout handling, and bridging sync/async contexts.
Data Validation: Pydantic and the Trust Boundary
The Problem with Unvalidated Data
Data enters AI systems from many sources: API responses, user inputs, configuration files, database queries. Each source can produce unexpected data—missing fields, wrong types, values outside expected ranges.
Without validation, these problems surface far from their origin. A missing field in an API response causes an error three functions later. A malformed user input corrupts results without raising exceptions. Debugging becomes archaeology, tracing backward from symptoms to causes.
Pydantic: Validation as Type Definition
Pydantic models define data structures that validate on instantiation. If validation fails, you get an immediate, specific error:
from pydantic import BaseModel, Field
from typing import Literal
class ChatMessage(BaseModel):
role: Literal["user", "assistant", "system"]
content: str = Field(min_length=1, max_length=100000)
class ChatRequest(BaseModel):
messages: list[ChatMessage]
model: str = "gpt-5"
temperature: float = Field(default=0.7, ge=0.0, le=2.0)Attempting to create a ChatRequest with invalid data fails immediately:
# Raises ValidationError: temperature must be <= 2.0
ChatRequest(messages=[...], temperature=2.5)The error is specific, occurs at the point of creation, and includes context about what went wrong.
Validation at Trust Boundaries
Pydantic is most valuable at boundaries where data crosses trust levels:
External API responses. When you call an embedding API, the response should match your expectations. Parsing through Pydantic validates this assumption:
response = await client.post(url, json=payload)
result = EmbeddingResponse.model_validate(response.json())
# If parsing succeeds, result matches your expected structureUser input. Everything from users is suspect. Pydantic enforces constraints:
class UserQuery(BaseModel):
text: str = Field(min_length=1, max_length=10000)
filters: dict[str, str] = Field(default_factory=dict)
@field_validator('text')
@classmethod
def sanitize_text(cls, v: str) -> str:
return v.strip()Configuration files. Settings loaded from YAML or JSON should be validated before use:
class AppConfig(BaseModel):
llm_provider: Literal["openai", "anthropic"]
model: str
temperature: float = Field(ge=0.0, le=2.0)
@classmethod
def from_yaml(cls, path: Path) -> "AppConfig":
with open(path) as f:
return cls.model_validate(yaml.safe_load(f))Pydantic vs. Dataclasses
Python’s built-in dataclasses are lighter weight but don’t validate. The choice depends on context:
Use dataclasses for internal data structures where you control all inputs and validation overhead isn’t warranted:
@dataclass
class EmbeddingResult:
vector: list[float]
model: str
cached: bool = FalseUse Pydantic at boundaries where data comes from external sources or needs validation:
class APIResponse(BaseModel):
embedding: list[float]
model: str
usage: dict[str, int]The distinction matters for performance and complexity. Pydantic validation has overhead—meaningful when processing millions of records. For internal data you trust, dataclasses avoid that cost.
Complete examples: See reference/02_python_ai_patterns.md for field validators, model validators, configuration loading, and structured output patterns for LLM responses.
Error Handling: Expecting Failures
The Reality of External Dependencies
AI applications depend heavily on external services, and external services fail. Networks partition, servers crash, rate limits trigger, APIs change without notice. The question isn’t whether failures will occur but how your code responds.
Robust error handling requires categorizing failures by how they should be handled:
Retryable failures might succeed on the next attempt. Network timeouts, rate limits, server errors (5xx) often resolve themselves. These warrant retry logic.
Non-retryable failures won’t be fixed by trying again. Authentication errors, malformed requests, resource not found—these require human intervention or code changes.
Partial failures in batch operations leave some work completed and some failed. These require careful handling to avoid reprocessing successful items.
Structured Error Handling
Define error types that distinguish these categories:
class LLMError(Exception):
"""Base exception for LLM operations."""
pass
class RetryableError(LLMError):
"""Error that may succeed on retry."""
pass
class NonRetryableError(LLMError):
"""Error that won't be fixed by retrying."""
passThen handle API errors by mapping them to these categories:
async def call_llm(prompt: str) -> str:
try:
response = await client.chat.completions.create(
model="gpt-5",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
except AuthenticationError as e:
# Invalid credentials - don't retry
raise NonRetryableError("Invalid API credentials") from e
except RateLimitError as e:
# Rate limited - retry after backoff
raise RetryableError("Rate limit exceeded") from e
except APIConnectionError as e:
# Network issue - retry
raise RetryableError("Connection failed") from eRetry Logic with Exponential Backoff
Retry logic must balance persistence with politeness. Hammering a failing server with immediate retries makes problems worse. Exponential backoff spaces retries increasingly apart:
async def retry_with_backoff(
func: Callable[[], Awaitable[T]],
max_attempts: int = 3,
base_delay: float = 1.0,
) -> T:
"""Execute function with exponential backoff on failure."""
for attempt in range(max_attempts):
try:
return await func()
except RetryableError as e:
if attempt == max_attempts - 1:
raise # Final attempt failed
delay = base_delay * (2 ** attempt) # 1s, 2s, 4s, ...
delay *= (0.5 + random.random()) # Add jitter
logger.warning(f"Attempt {attempt + 1} failed: {e}. Retrying in {delay:.1f}s")
await asyncio.sleep(delay)
raise RuntimeError("Should not reach here")The jitter (random variation in delay) prevents the “thundering herd” problem where many clients retry simultaneously after an outage, overwhelming the recovering server.
Libraries for Retry Logic
The tenacity library provides sophisticated retry logic without reinventing it:
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=60),
retry=retry_if_exception_type(RetryableError),
)
async def robust_llm_call(prompt: str) -> str:
return await call_llm(prompt)Use libraries for complex retry scenarios. Understand the underlying concepts to debug when things go wrong.
Complete examples: See reference/02_python_ai_patterns.md for comprehensive API error handling, tenacity configurations, and rate limiter implementations.
Environment Management: Reproducibility and Isolation
The Dependency Problem
Python’s package ecosystem is both blessing and curse. Thousands of libraries are available for any task. But these libraries depend on other libraries, which depend on others, creating a web of transitive dependencies that can conflict.
The symptoms are familiar:
- “It works on my machine” (but not in production)
- Package A requires numpy>=1.20 while package B requires numpy<1.20
- An update to a minor dependency breaks your application
- A security vulnerability in a transitive dependency requires untangling the entire chain
Virtual Environments: Basic Isolation
Virtual environments isolate project dependencies from system Python and from each other. Every project should have its own:
# Create environment
python -m venv .venv
# Activate
source .venv/bin/activate # Unix
.venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txtThis is the minimum standard. Developing without virtual environments causes problems that seem small initially but compound over time.
Dependency Locking: Reproducible Installs
requirements.txt with loose version constraints isn’t reproducible:
# This installs different versions on different days
openai>=1.0
pydantic>=2.0
Dependency locking captures the exact versions of all packages, including transitive dependencies:
pip-tools generates locked requirements from loose specifications:
# requirements.in (your direct dependencies)
openai>=1.0
pydantic>=2.0
# Generate locked requirements.txt
pip-compile requirements.in
# Install exactly those versions
pip-sync requirements.txtPoetry manages dependencies with built-in locking:
# Add dependencies (updates pyproject.toml and poetry.lock)
poetry add openai pydantic
# Install exactly what's in poetry.lock
poetry installThe lock file is your reproducibility guarantee. Commit it to version control. CI and production install from the lock file, not from loose constraints.
Secrets Management: The API Key Problem
API keys are the most common security leak in AI projects. OpenAI keys committed to GitHub are exploited within minutes by automated scanners.
Never hardcode secrets. Not even temporarily. Not even in example code you’ll “fix later.”
Use environment variables with .env files for local development:
# .env (never commit this)
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
# .env.example (commit this as documentation)
OPENAI_API_KEY=your-key-here
ANTHROPIC_API_KEY=your-key-hereLoad environment variables at application start:
from dotenv import load_dotenv
import os
load_dotenv() # Loads .env into environment
openai_key = os.environ["OPENAI_API_KEY"] # Fails fast if missingFor production, use secrets managers—AWS Secrets Manager, GCP Secret Manager, HashiCorp Vault. These provide audit trails, rotation, and access control that environment variables lack.
Docker: Consistent Environments
Docker containers package your application with its complete environment—operating system, Python version, dependencies, configuration. This eliminates environment discrepancies between development and production.
A minimal Dockerfile for an AI application:
FROM python:3.11-slim
WORKDIR /app
# Install dependencies first (better layer caching)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application code
COPY . .
# Run as non-root user
RUN useradd -m appuser && chown -R appuser:appuser /app
USER appuser
CMD ["python", "-m", "app.main"]The key insight is layer caching: by copying and installing dependencies before copying application code, you avoid reinstalling dependencies on every code change.
Complete examples: See reference/02_python_ai_patterns.md for AWS/GCP secrets manager patterns, pyproject.toml configurations, and Docker Compose setups for AI development.
Observability: Seeing What Your Code Does
The Debugging Challenge
AI applications are hard to debug. LLM outputs are non-deterministic. Errors in prompt construction might produce plausible but wrong results. Performance problems might stem from any of a dozen external service calls.
Traditional debugging—setting breakpoints, stepping through code—helps for some problems. But production issues often can’t be reproduced locally, and the interactive nature of debugging doesn’t scale to understanding system behavior over time.
Observability—logging, metrics, and tracing—provides visibility into running systems.
Structured Logging
Unstructured log messages are human-readable but machine-hostile:
2024-01-15 10:23:45 INFO Processing request for user 123
2024-01-15 10:23:46 INFO LLM call completed in 1.2s
Structured logging outputs machine-parseable data that enables querying and aggregation:
import structlog
logger = structlog.get_logger()
logger.info(
"llm_call_completed",
model="gpt-5",
latency_ms=1200,
prompt_tokens=150,
completion_tokens=80,
user_id="123",
)This outputs JSON that log aggregation systems (Datadog, CloudWatch, Loki) can query: “Show me all LLM calls taking more than 2 seconds” or “What’s the average token usage per model?”
What to Log for AI Applications
AI applications have specific logging needs beyond standard web applications:
LLM calls: Model, latency, token counts, prompt preview (truncated), response preview. This data is essential for cost tracking and debugging.
Retrieval operations: Query, number of results, relevance scores, latency. When RAG produces bad results, these logs help diagnose whether retrieval or generation failed.
Validation failures: What data failed validation and why. When external APIs change format, these logs reveal the problem quickly.
Rate limit events: When you hit limits, log it. This informs capacity planning and helps debug intermittent failures.
Request Tracing
In systems with many components, a single user request touches multiple services. Tracing connects these interactions:
import uuid
from contextvars import ContextVar
request_id_var: ContextVar[str] = ContextVar("request_id", default="unknown")
def process_request(user_input: str) -> str:
request_id = str(uuid.uuid4())[:8]
request_id_var.set(request_id)
logger.info("request_started", request_id=request_id)
try:
result = run_pipeline(user_input)
logger.info("request_completed", request_id=request_id, success=True)
return result
except Exception as e:
logger.error("request_failed", request_id=request_id, error=str(e))
raiseEvery log message in the request path includes the request ID. When investigating an issue, filter logs by request ID to see the complete picture.
Complete examples: See reference/02_python_ai_patterns.md for JSON formatter implementations, request tracing middleware, and LLM call logging patterns.
Development Workflow: From Notebooks to Production
The Notebook-to-Production Pipeline
Jupyter notebooks excel at exploration: interactive execution, inline visualization, rapid iteration. But they’re poorly suited for production: no testing framework, problematic version control (JSON diffs are useless), encourages global state.
The solution isn’t abandoning notebooks but establishing a clear pipeline:
- Explore in notebooks: Test ideas, visualize data, prototype approaches
- Extract to modules: When code stabilizes, move it to
.pyfiles - Add types and tests: Production code needs both
- Run in production: Modules imported by application, not notebooks
Making Notebooks Production-Adjacent
Some practices keep notebooks closer to production quality:
Keep cells small and focused. Each cell should do one thing. This makes extraction to functions natural.
Use functions, not just sequential code. Even in notebooks, wrap logic in functions. This forces you to think about inputs, outputs, and encapsulation.
Strip outputs before committing. Notebook outputs include cell execution results, which bloat repositories and make diffs useless. Use nbstripout to automatically remove outputs:
pip install nbstripout
nbstripout --install # Configures git to strip outputs on commitConsider jupytext for version control. Jupytext syncs notebooks with plain Python files, enabling meaningful diffs:
jupytext --set-formats ipynb,py:percent notebook.ipynb
# Now notebook.py stays in sync with notebook.ipynbTesting AI Code
AI applications present unique testing challenges. Outputs are non-deterministic. External APIs are slow and expensive. “Correct” is often subjective.
Mock external services to test your code in isolation:
@pytest.fixture
def mock_llm_response():
return ChatCompletion(
id="mock",
model="gpt-5",
choices=[Choice(
message=Message(role="assistant", content="Mock response"),
finish_reason="stop"
)]
)
async def test_processing_logic(mock_llm_response):
with patch("openai.AsyncOpenAI") as mock:
mock.return_value.chat.completions.create.return_value = mock_llm_response
result = await process_user_query("test input")
assert "Mock response" in resultTest deterministic components separately. Prompt construction, response parsing, and business logic can be tested without LLM calls:
def test_prompt_construction():
prompt = build_rag_prompt(
query="What is Python?",
contexts=["Python is a programming language."]
)
assert "What is Python?" in prompt
assert "Python is a programming language." in promptUse snapshot testing for prompts. When prompts change, you want to review the diff deliberately:
def test_prompt_template(snapshot):
prompt = build_prompt(template="summarize", document="...")
assert prompt == snapshot # Fails if prompt changes, update with --snapshot-updateComplete examples: See reference/02_python_ai_patterns.md for pytest fixtures, mock API responses, VCR for recorded interactions, and streaming response tests.
Practical Exercises
Exercise 1: Type-Safe LLM Response Handler
Build a type-safe response handler for multiple LLM providers:
- Define
TypedDictclasses for OpenAI and Anthropic API responses - Create a unified
LLMResponsePydantic model that normalizes both formats - Implement converter functions with proper type hints
- Write tests that verify type correctness using
mypy
Acceptance criteria: - Running mypy --strict produces no errors - Both API formats convert correctly to the unified model - Invalid responses raise ValidationError with clear messages
Exercise 2: Async Rate-Limited API Client
Implement an async HTTP client with production-ready features:
- Create a client class that wraps
httpx.AsyncClient - Implement a token bucket rate limiter (100 requests/minute)
- Add automatic retry with exponential backoff for 429 and 5xx errors
- Include request tracing with correlation IDs
- Log all requests with structured logging
Acceptance criteria: - Concurrent requests respect rate limits - Retries use jitter to prevent thundering herd - Each request can be traced through logs - Test with 200 concurrent requests
Exercise 3: Pydantic Configuration System
Build a configuration system for an AI application:
- Define Pydantic
BaseSettingsfor:- API keys (from environment variables)
- Model parameters (temperature, max_tokens, etc.)
- Retry configuration (max_retries, base_delay)
- Feature flags (enable_caching, debug_mode)
- Support loading from
.envfiles - Add custom validators for value ranges
- Implement environment-specific overrides (dev, staging, prod)
Acceptance criteria: - Sensitive values are masked in logs and __repr__ - Invalid configurations fail fast with clear error messages - Can switch environments without code changes
Exercise 4: Streaming Response Processor
Implement a streaming processor for LLM responses:
- Create an async generator that yields chunks from the OpenAI streaming API
- Implement a
StreamBufferclass that accumulates chunks and detects complete JSON objects - Add timeout handling (fail if no chunk received in 30 seconds)
- Include progress callbacks for UI updates
async def process_stream():
async for chunk in stream_completion(prompt="..."):
print(chunk, end="", flush=True)
# Should print incrementally as chunks arriveAcceptance criteria: - Prints text as it arrives, not buffered - Handles connection drops gracefully - Works with both OpenAI and Anthropic streaming formats
Exercise 5: Observable Service Module
Build a Python module with comprehensive observability:
- Create a service class with structured logging:
- Log all method entries and exits
- Include timing information
- Use correlation IDs for request tracing
- Add metrics using
prometheus_client:- Request count by method
- Latency histograms
- Error rates
- Implement a context manager for automatic instrumentation
- Write a decorator that adds observability to any function
Acceptance criteria: - Can reconstruct a full request flow from logs - Metrics are exposed on /metrics endpoint - Latency percentiles (p50, p95, p99) are available
Exercise 6: Docker Development Environment
Create a complete development environment for an AI project:
- Write a
Dockerfilewith:- Multi-stage build for smaller image
- Non-root user for security
- Proper layer caching for dependencies
- Create
docker-compose.ymlwith:- Application service with hot reload
- Redis for caching
- Mock API server for testing
- Include a
Makefilewith common commands:make dev- Start development environmentmake test- Run tests in containermake lint- Run mypy, ruff, black
Acceptance criteria: - Container builds in under 2 minutes (with cache) - Code changes reflect immediately without rebuild - Tests can run in isolation from external services
Self-Assessment Checkpoint
Conceptual Questions
Q1. [IC2] Why does async programming provide significant speedups for LLM API calls but not for local embedding computations?
Answer
LLM API calls are I/O-bound—the CPU waits for network responses. During this wait time, async allows other tasks to execute. With 10 concurrent API calls that each take 1 second, sequential execution takes 10 seconds while async takes ~1 second.
Local embedding computations are CPU-bound—the CPU is actively processing. Async doesn’t help because there’s no waiting; the CPU is always busy. For CPU-bound work, you need multiprocessing (multiple CPU cores) rather than async (concurrency during I/O wait).Q2. [IC2] When should you use TypedDict versus a Pydantic model?
Answer
TypedDict: Use for typing dictionaries from external sources where you want type checking without runtime validation. Good for API responses you’ve already validated elsewhere, or when you need dict compatibility with existing code.
Pydantic model: Use when you need runtime validation, default values, custom validators, or serialization. Better for data at trust boundaries (user input, config files) and when you want automatic parsing (strings to ints, etc.).
In practice: External API responses often start as TypedDict for typing, then get converted to Pydantic models when entering your application’s core.Q3. [Senior] How would you implement graceful degradation for a service that calls multiple LLM providers?
Answer
Implement a multi-provider client with:
- Health checking: Periodically test each provider, track success rates
- Circuit breaker: Open circuit after N consecutive failures, half-open after timeout
- Fallback chain: Primary → Secondary → Cached response → Graceful error
- Request hedging: For latency-sensitive requests, send to multiple providers, use first response
async def generate(prompt: str) -> str:
for provider in self.healthy_providers():
try:
return await provider.generate(prompt)
except ProviderError:
self.mark_unhealthy(provider)
return await self.fallback_response(prompt) # Cached or defaultQ4. [Senior] What are the trade-offs between using poetry versus pip-tools for dependency management?
Answer
Poetry:
- Pros: Single tool for dependency management, virtual env, building, and publishing. Excellent lockfile. Nice CLI. Growing adoption.
- Cons: Can be slow for large dependency trees. Some compatibility issues with editable installs. Adds a layer of abstraction over pip.
pip-tools:
- Pros: Minimal abstraction—just generates requirements.txt from requirements.in. Fast. Works with any pip-compatible tooling.
- Cons: Separate tools needed for virtual env (venv) and building (setuptools/build). More manual workflow.
Summary
Python for AI engineering requires a mindset shift from exploration to reliability without sacrificing the ability to iterate quickly. The key principles:
Type hints are documentation that catches bugs. They don’t slow you down—they prevent hours of debugging later. Start at function boundaries and data structures that cross modules.
Async programming enables concurrency for I/O-bound work. When making many external calls, async provides dramatic speedups. When making single calls or doing CPU-bound work, it adds complexity without benefit.
Pydantic validates data at trust boundaries. External API responses, user input, and configuration files all warrant validation. Internal data structures can use lighter-weight dataclasses.
Error handling must expect failures. External services fail. Categorize errors as retryable or not, implement backoff for retries, and fail fast for non-retryable errors.
Environment discipline enables reproducibility. Virtual environments, locked dependencies, proper secrets management, and Docker containers prevent “works on my machine” problems.
Observability reveals system behavior. Structured logging, request tracing, and comprehensive metrics let you understand what your code actually does in production.
These practices compound. Each one prevents a category of problems, and together they transform AI code from fragile prototypes to reliable systems. The investment in discipline pays dividends every time you debug an issue in minutes instead of hours, every time a code change deploys without incident, every time you sleep through the night while your system handles traffic.
The patterns in this chapter appear throughout AI engineering. Master them, and you’ll spend less time fighting Python and more time building systems that work.
Connections to Other Chapters
The Python patterns in this chapter form the foundation for production AI code throughout the book:
Chapter 4 (Your First LLM Application): Applies these Python patterns to build a complete RAG application with async API calls, Pydantic validation, and proper error handling.
Chapter 14 (Backend Engineering for AI): Extends these patterns to production systems including testing strategies, debugging techniques, and integration patterns.
Chapter 9 (LLM Deployment & Infrastructure): Uses async patterns for high-throughput inference and discusses retry/circuit breaker patterns for reliability.
Chapter 31 (Reliability Engineering): Builds on the observability patterns here to implement comprehensive monitoring, alerting, and incident response.
Tool & Framework Reference (Appendix B): Comprehensive guide to the Python libraries mentioned in this chapter with version recommendations.
Further Reading
- Fluent Python by Luciano Ramalho—Deep coverage of Python’s features and idioms
- Architecture Patterns with Python by Harry Percival and Bob Gregory—Patterns for maintainable Python applications
- Pydantic documentation (docs.pydantic.dev)—Comprehensive guide to data validation
- Real Python’s async guide (realpython.com/async-io-python)—Practical async programming tutorial
- The Twelve-Factor App (12factor.net)—Principles for production applications, including configuration and dependencies
Code Reference
For complete, runnable implementations of all patterns discussed in this chapter, see:
reference/02_python_ai_patterns.md — Full code examples including:
- Type hints and TypedDict for LLM responses
- Pydantic models with validators and configuration loading
- Async patterns: gather, semaphores, timeouts, first-completed
- Context managers for resource management
- HTTP client configuration with httpx
- OpenAI and Anthropic SDK patterns with provider abstraction
- NumPy operations for embeddings
- Pandas for evaluation data manipulation
- JSON extraction from LLM outputs
- Streaming response handling
- Token bucket rate limiter implementation
- Structured logging and request tracing
- Comprehensive error handling with retry logic
- Testing patterns: fixtures, mocks, snapshots, VCR
- Docker and Docker Compose configurations