Chapter 32: Cost Engineering

Keywords

cost optimization, GPU costs, TCO, FinOps, capacity planning, build vs buy, unit economics, ROI

Introduction

In early 2024, an AI startup celebrated a milestone: their LLM-powered product had reached 100,000 daily active users. The celebration was short-lived. Their cloud bill arrived at $847,000 for the month—more than triple their revenue. The CTO’s post-mortem revealed a cascade of cost failures: inference endpoints running at 15% GPU utilization, verbose prompts averaging 8,000 tokens when 2,000 would suffice, no caching despite 60% query similarity, and premium API pricing for workloads that could have used smaller models. The technical architecture was sound; the economic architecture was nonexistent.

This story repeats across the industry. AI systems are extraordinarily expensive compared to traditional software. A single GPU-hour costs what a CPU-hour costs for an entire day. Token costs compound with every user interaction. Training runs can consume more compute in hours than a conventional application uses in years. Yet many teams approach costs reactively—surprised by bills rather than anticipating them, cutting features rather than optimizing systems.

Cost engineering is the discipline of designing AI systems with economic sustainability as a first-class constraint. It is not about minimizing spend—sometimes spending more is precisely right. It is about understanding cost structures deeply enough to make informed tradeoffs, allocating resources to maximize value, and building systems that remain economically viable as they scale.

At Staff and Principal levels, you become accountable not just for systems that work, but for systems that work economically. You must translate GPU utilization metrics into business terms executives understand. You must make build-versus-buy decisions with real numbers rather than intuition. You must design architectures where costs grow sub-linearly with usage rather than exploding at scale.

This chapter provides the theoretical foundations and practical frameworks for AI cost engineering. We begin with why AI is expensive—the fundamental economics that make cost engineering essential. We then develop cost modeling techniques, explore GPU economics, and build decision frameworks for the choices that dominate AI budgets.

The Core Insight: Why AI Cost Engineering is Different

Traditional software cost engineering focuses on compute, storage, and bandwidth—resources that are relatively cheap and well-understood. AI cost engineering introduces fundamentally different challenges:

Compute intensity: A single inference call through a large language model requires more floating-point operations than a traditional web request by factors of millions. GPUs that cost $10,000+ sit idle waiting for the next request.
Marginal costs that compound: Every user interaction consumes tokens, GPU cycles, or API calls. Unlike traditional software where marginal costs approach zero, AI systems have meaningful per-request costs that scale with usage.
Nonlinear scaling: Larger models don’t just cost proportionally more—costs often scale superlinearly with capability. A 70B parameter model costs far more than 10x a 7B model to serve.
Hidden costs everywhere: Fine-tuning data curation, evaluation dataset construction, monitoring for drift, retraining cycles—costs that don’t appear in simple infrastructure budgets but dominate total cost of ownership.
Rapid obsolescence: Models that were state-of-the-art six months ago may be outperformed by models that cost 10x less to run. Cost optimization is a moving target.

These dynamics mean that cost engineering for AI requires different mental models than traditional infrastructure. You must think about cost per token, GPU utilization curves, and inference optimization in ways that don’t apply to conventional systems.

A Mental Model for AI Costs

Mental Model: The 10x cost cliff

Plot API spend and self-hosted TCO on the same chart as request volume grows, and the two lines cross at a brutal inflection point—somewhere in the low hundreds of thousands of requests per month for a typical 7B-13B workload. Below the cliff, managed APIs win on every axis: zero fixed cost, no ops burden, model upgrades for free. Above the cliff, the same API bill is suddenly ~10x what disciplined self-hosting would cost, and the gap widens monotonically. Heuristic: forecast your 12-month volume; if you’ll cross the cliff, start the self-hosting build before you hit it, not after.

Once you’ve mapped where you sit relative to the 10x cost cliff, the rest of cost engineering falls into place. And the API-vs-self-host crossover is only the most visible cliff. AI costs don’t scale linearly anywhere—they have cliffs everywhere, thresholds where costs suddenly jump by 10x or more. Understanding all of them prevents budget surprises:

The context cliff: Going from 4K to 32K context doesn’t cost 8x more—the raw attention FLOPs scale quadratically, so 8x the sequence is ~64x the attention compute. The multipliers in this section are illustrative orders of magnitude, not measured constants. In practice the served cost of long context is far below the naive 64x: FlashAttention and paged KV-cache (see Chapter 27) keep attention memory-bandwidth-bound rather than compute-bound, and prefix/KV caching amortizes shared context across requests. The quadratic term still dominates eventually, but you pay something closer to linear-plus-a-quadratic-tail, not a clean 64x. A document that barely fit in one API call might still need to be chunked, and chunking has its own costs (overlap, multiple calls, synthesis).

The quality cliff: The gap between “good enough” and “excellent” is often 10x cost. A task that Haiku handles 80% correctly might need Opus for 95%. You pay 15x more for that last 15% quality. Sometimes it’s worth it. Often it isn’t.

The latency cliff: Reducing P99 latency from 2s to 200ms might require 10x more GPU capacity. Tail latencies are expensive because they require provisioning for worst-case, not average-case.

The freshness cliff: Real-time features cost exponentially more than daily batch features. Going from daily to hourly might be 3x. Hourly to minute-level might be 10x. Minute to second might be 50x. (As with the other cliffs, treat these multipliers as illustrative magnitudes, not benchmarked figures—the actual factor depends on your infrastructure.) Each step requires more infrastructure, more monitoring, more on-call burden.

The reliability cliff: Going from 99% to 99.9% availability requires roughly 10x investment. Going from 99.9% to 99.99% requires another 10x. Each “nine” costs an order of magnitude.

The strategic implication: know where your cliffs are and stay on the cheap side unless you have a compelling reason to jump. Many teams accidentally fall off cliffs by adding features without understanding cost implications. A product manager asks for “just a bit longer context” and suddenly infrastructure costs triple. Before any architectural decision, ask: “Are we near a cliff? What happens if requirements push us over?”

Think of an AI system as a factory with unusual economics. Traditional software factories have high fixed costs (development, infrastructure) but near-zero marginal costs—serving the millionth user costs almost nothing more than the first. AI factories have high fixed costs and significant marginal costs—every inference consumes real resources.

This creates a different optimization landscape:

Traditional software: Minimize fixed costs, scale efficiently (marginal costs negligible)
AI systems: Balance fixed costs (infrastructure, development) against marginal costs (compute per request, tokens per interaction)

The optimal operating point depends on scale. At low volume, using external APIs with high per-unit cost but zero fixed cost dominates. At high volume, building infrastructure with high fixed costs but low marginal costs wins. Finding the crossover point—and anticipating how quickly you’ll reach it—is a core cost engineering skill.

What You’ll Learn

The fundamental economics of GPU compute and why utilization matters so much
Total cost of ownership analysis for AI systems
Token cost calculations and optimization strategies
Cost attribution methods that create accountability
Build-versus-buy decision frameworks with real numbers
Self-hosting breakeven analysis
Strategies for building cost-aware engineering culture

Prerequisites

System design fundamentals (Chapter 25)
Performance engineering concepts (Chapter 27)
Basic understanding of GPU architectures and inference pipelines

The Economics of AI Compute

Why GPUs Are Expensive—and Why Utilization Matters

To understand AI costs, you must understand GPU economics. A single NVIDIA H100 GPU costs approximately $30,000 to purchase. Cloud providers charge $3-5 per hour for access. Why so expensive?

The hardware reality: GPUs contain thousands of specialized cores optimized for the matrix multiplications that dominate neural network computation. An H100 delivers 1,979 teraflops of FP16 compute—roughly 100x the throughput of a high-end CPU for AI workloads. This specialized capability commands a premium.

The supply constraint: GPU manufacturing requires cutting-edge fabrication facilities that cost tens of billions of dollars to build. TSMC’s advanced processes are capacity-constrained, limiting the number of high-end chips that can be produced. Demand from AI training and inference consistently exceeds supply.

The utilization imperative: Because GPUs are expensive, utilization is everything. Consider the math:

H100 cloud cost: $4.00/hour
If utilized at 100%: $4.00/effective-GPU-hour
If utilized at 25%: $16.00/effective-GPU-hour
If utilized at 10%: $40.00/effective-GPU-hour

A GPU sitting idle at 10% utilization costs four times as much per effective compute unit as one running at 40%. Yet production systems routinely run at 20-30% utilization because:

Request variability: Traffic spikes require headroom
Batch accumulation delays: Waiting for batches to fill adds latency
Memory fragmentation: Different request sizes prevent efficient packing
Cold start overhead: Model loading and warmup consume cycles

Research by Patterson et al. (2022) at Google found that typical ML infrastructure achieves 30-50% GPU utilization in production—meaning half or more of GPU spend is wasted on idle cycles.

The GPU Utilization Curve

GPU costs don’t scale linearly with utilization. There’s a complex relationship between utilization, latency, and throughput:

The shape of this curve explains why the difference between 20% and 40% utilization matters far more than the difference between 70% and 90%. The first doubling saves 50% of costs; the second saves only 22%.

Practical implication: The first priority in GPU cost optimization is always moving from low utilization to moderate utilization. This provides the highest return on engineering investment.

Common Mistake: Ignoring Idle GPU Costs

What people do: Focus on optimizing inference latency and throughput while leaving GPUs running 24/7, regardless of actual demand.

Why it fails: GPU costs are dominated by idle time, not inference time. If your traffic is bursty (most applications), GPUs sitting idle during low-traffic hours can represent 70%+ of total compute spend. A $30K/month GPU bill might include $20K of idle time.

Fix: Implement auto-scaling based on queue depth, not just CPU metrics. Use spot/preemptible instances for batch workloads. Consider serverless inference (AWS Lambda, Google Cloud Run) for low-volume or bursty traffic. Track and alert on utilization—if it’s consistently below 40%, you’re likely over-provisioned.

Common Mistake: No Cost Caps on LLM Calls

What people do: Call LLM APIs without max_tokens limits or budget caps, trusting that normal usage won’t be expensive.

Why it fails: One bug can cause a cost explosion. A retry loop without backoff. A prompt that triggers verbose output. An agent that loops forever. We’ve seen teams wake up to $50K+ surprise bills from overnight batch jobs or runaway agents. API providers are happy to let you spend.

Fix: Always set max_tokens to the actual maximum you need. Implement per-request, per-user, and per-hour cost limits. Add circuit breakers that stop LLM calls if spend exceeds thresholds. Log and alert on cost anomalies (e.g., any hour that’s 3x the previous day’s average).

Token Economics

For LLM-based systems, costs are denominated in tokens. Understanding token economics is essential for cost modeling.

What determines token cost?

Token cost reflects the compute required for inference, which depends on:

Model size: Larger models require more parameters to load and more operations per token
Attention complexity: Self-attention scales quadratically with sequence length
Input vs. output tokens: Output tokens require full forward passes; input tokens can be processed in parallel
Hardware efficiency: Newer GPUs provide more tokens per dollar

The API pricing landscape (as of early 2026):

Model	Input (per 1M tokens)	Output (per 1M tokens)	Relative Cost
GPT-5.5	$5.00	$30.00	1.0x
Claude Opus 4.8	$5.00	$25.00	0.83x
GPT-5.4	$2.50	$15.00	0.5x
Claude Sonnet 4.6	$3.00	$15.00	0.5x
Claude Haiku 4.5	$1.00	$5.00	0.17x
Llama 4 Maverick (API)	$0.22	$0.85	0.03x

These prices encode profound economics. The ~35x price difference between GPT-5.5 and Llama 4 Maverick reflects the premium for proprietary frontier models versus open-weight alternatives. Be careful, though: list API prices are strategic, margin-driven numbers—they are not a direct proxy for the underlying cost of compute. Providers price to capture value, segment customers, and fund R&D, so a 35x list-price gap does not mean a 35x difference in GPU cost to serve. The honest cost-of-compute comparison is the self-hosting analysis later in this chapter, which prices the GPU-seconds directly; API prices tell you what you’ll pay, not what the inference costs to produce. The 5-6x ratio between input and output tokens, by contrast, does track a real computational asymmetry of autoregressive generation.

The token cost calculation:

def calculate_request_cost(
    input_tokens: int,
    output_tokens: int,
    input_price_per_million: float,
    output_price_per_million: float
) -> float:
    """Calculate cost for a single LLM request."""
    input_cost = (input_tokens / 1_000_000) * input_price_per_million
    output_cost = (output_tokens / 1_000_000) * output_price_per_million
    return input_cost + output_cost

# Example: GPT-5.5 request with 2000 input, 500 output tokens
cost = calculate_request_cost(2000, 500, 5.00, 30.00)
# = (2000/1M * $5.00) + (500/1M * $30.00)
# = $0.010 + $0.015 = $0.025 per request

At ~$0.025 per request, 1 million daily requests costs $25,000/day—$750,000/month. This is the marginal cost that traditional software doesn’t have—and it’s why cost engineering matters.

The $2M Token Bill

Setup: A fintech startup ran nightly batch jobs to analyze customer transactions using GPT-5.4. The jobs had worked flawlessly for months, processing about 50,000 transactions per night at a cost of roughly $500.

What happened: On a Thursday night, an engineer deployed a “minor” change to improve analysis quality—they removed the max_tokens parameter to “let the model fully express its reasoning.” The batch job started at 11 PM. By Friday morning, the cloud bill had reached $2.1 million.

Without output limits, the model generated verbose analyses averaging 15,000 tokens per transaction instead of the usual 500. The job ran for 8 hours, burning through 750 million output tokens at $15 per million. The engineer’s phone buzzed with a fraud alert from their credit card company before anyone noticed the API costs.

Why it happened (root cause): No programmatic cost caps. The API would happily accept unlimited requests at unlimited output lengths. The team assumed “max_tokens was just for speed” and didn’t realize it was their primary cost control.

Fix:

Implemented hard per-request cost limits in a gateway layer
Added hourly cost monitoring with automatic shutoff at 120% of expected spend
Required max_tokens in all API calls via a wrapper library
Created nightly budget alerts with escalation to on-call

Takeaway: Every API call needs a cost ceiling. Hope is not a budget strategy.

Unbounded Calls
Cost-Capped Generation

def analyze_document(doc: str) -> str:
    # No limits - cost grows with document size and model verbosity
    response = client.messages.create(
        model="claude-opus-4-8",
        messages=[{
            "role": "user",
            "content": f"Analyze this document thoroughly:\n\n{doc}"
        }]
        # No max_tokens = model decides output length
        # Long doc = expensive input tokens
        # "thoroughly" = model writes essays
    )
    return response.content[0].text

class CostAwareLLM:
    def __init__(self, budget_per_request: float = 0.10):
        self.budget = budget_per_request
        self.pricing = ModelPricing()  # Current rates
        self.daily_spend = DailySpendTracker()

    async def analyze_document(
        self,
        doc: str,
        user_id: str,
        priority: str = "normal"
    ) -> AnalysisResponse:
        # Check daily budget before processing
        if self.daily_spend.remaining(user_id) <= 0:
            return AnalysisResponse(
                error="daily_budget_exhausted",
                fallback=self.get_cached_or_summary(doc)
            )

        # Calculate max affordable tokens
        input_tokens = count_tokens(doc)
        input_cost = self.pricing.input_cost(input_tokens)
        remaining_budget = self.budget - input_cost

        if remaining_budget <= 0:
            # Document too long for budget - truncate or chunk
            doc = self.smart_truncate(doc, target_tokens=5000)
            input_tokens = count_tokens(doc)
            input_cost = self.pricing.input_cost(input_tokens)
            remaining_budget = self.budget - input_cost

        max_output_tokens = self.pricing.tokens_for_budget(
            remaining_budget, output=True
        )
        max_output_tokens = min(max_output_tokens, 2000)  # Hard cap

        # Select model based on priority and budget
        model = self.select_model(priority, remaining_budget)

        response = await client.messages.create(
            model=model,
            max_tokens=max_output_tokens,  # Always set!
            messages=[{
                "role": "user",
                "content": f"""Analyze this document concisely.
Focus on: key findings, action items, risks.
Be direct - no filler.

<document>
{doc}
</document>"""
            }]
        )

        # Track actual spend
        actual_cost = self.pricing.calculate_cost(
            response.usage.input_tokens,
            response.usage.output_tokens,
            model
        )
        self.daily_spend.record(user_id, actual_cost)

        return AnalysisResponse(
            content=response.content[0].text,
            cost=actual_cost,
            model_used=model,
            tokens_used=response.usage.input_tokens + response.usage.output_tokens
        )

    def select_model(self, priority: str, budget: float) -> str:
        if priority == "high" and budget > 0.05:
            return "claude-opus-4-8"
        elif budget > 0.02:
            return "claude-sonnet-4-6"
        else:
            return "claude-haiku-4-5-20251001"  # Always affordable

Key differences: Per-request budget caps, input size management, calculated max_tokens based on remaining budget, model selection by cost tier, daily spend tracking per user, and concise prompt design. The $2M incident becomes impossible.

The Inference Efficiency Frontier

Different inference optimization techniques trade off along multiple dimensions:

The efficient frontier represents the best achievable combinations of cost and latency. Points below the frontier are achievable; points above require engineering innovation or accepting worse tradeoffs.

Key techniques on the frontier:

Continuous batching: Process requests as they arrive rather than waiting for fixed batches. Reduces latency while maintaining throughput. 30-50% latency reduction typical.
Speculative decoding: Use a small model to draft tokens, verify with the large model. Reduces large model calls by 2-3x for standard text.
Quantization: Reduce precision from FP16 to INT8 or INT4. Cuts memory and compute by 2-4x with 1-3% quality degradation.
KV cache optimization: Reuse computed attention states for prefix-sharing requests. Enables prompt caching with 50-80% input cost reduction.
Model distillation: Train smaller models on larger model outputs. Can achieve 90% quality at 10% cost for domain-specific tasks.

Full implementation: See reference/26b_cost_engineering_code.md for implementation details of each technique.

Total Cost of Ownership Analysis

The TCO Framework

Total Cost of Ownership (TCO) captures all costs associated with an AI system over its lifetime. TCO analysis prevents the common failure of optimizing visible costs while ignoring larger hidden costs.

TCO components:

Why indirect costs matter:

A 2023 study by Andreessen Horowitz found that for enterprise AI systems, engineering time represents 40-60% of total costs—often exceeding infrastructure spend. A system that costs $50,000/month in compute but requires a full-time engineer ($15,000/month fully loaded) to maintain has 30% hidden costs. A system that costs $100,000/month but runs autonomously may be cheaper overall.

Staff Engineer Perspective: The Art of Vendor Negotiation

“I’ve negotiated LLM API contracts with every major provider. The list prices are the starting point, not the answer.

At scale, everything is negotiable: per-token rates, minimum commits, reserved capacity, support tiers, even payment terms. The key is leverage: either you have enough volume that they want your business, or you have a credible alternative (open-weight models, self-hosting, competitors).

My playbook: (1) Get usage data—monthly spend, growth trajectory, use case breakdown. (2) Identify your alternative—even if you don’t want to use it, knowing you could gives you leverage. (3) Start conversations 3 months before renewals. (4) Ask for multi-year discounts; providers love predictable revenue. (5) Get the enterprise sales team involved; they have flexibility that developer relations doesn’t.

The discount range varies, but at $10K+/month spend, you should be getting at least 15-20% off list price. At $100K+/month, 30-40% is common. If you’re not asking, you’re leaving money on the table.”

—Principal Engineer, formerly Head of AI Platform

Calculating TCO

@dataclass
class TCOAnalysis:
    """Complete TCO analysis for an AI system."""

    # Direct costs (monthly)
    compute_training: float
    compute_inference: float
    storage: float
    external_apis: float
    data_processing: float

    # Indirect costs (monthly, derived from annual rates)
    engineering_development: float  # FTE allocation * fully-loaded cost
    engineering_operations: float
    engineering_maintenance: float

    # Hidden costs (often forgotten)
    compliance_security: float  # ~5-10% of direct costs
    monitoring_observability: float  # ~3-5% of direct costs
    disaster_recovery: float  # ~2-5% of direct costs

    def monthly_direct(self) -> float:
        return (self.compute_training + self.compute_inference +
                self.storage + self.external_apis + self.data_processing)

    def monthly_indirect(self) -> float:
        return (self.engineering_development + self.engineering_operations +
                self.engineering_maintenance)

    def monthly_hidden(self) -> float:
        return (self.compliance_security + self.monitoring_observability +
                self.disaster_recovery)

    def monthly_tco(self) -> float:
        return self.monthly_direct() + self.monthly_indirect() + self.monthly_hidden()

    def annual_tco(self) -> float:
        return self.monthly_tco() * 12

Example: Recommendation System TCO

Consider a mid-scale recommendation system serving 10 million requests per day:

Category	Monthly Cost	% of TCO
Direct Costs
Inference compute (4x A100 instances)	$28,000	35%
Training compute (periodic retraining)	$8,000	10%
Feature store (1TB)	$3,000	3.8%
Embedding API calls	$5,000	6.3%
Data processing pipeline	$4,000	5%
Indirect Costs
ML Engineer (0.5 FTE)	$12,500	15.6%
Platform Engineer (0.3 FTE)	$7,500	9.4%
Data Engineer (0.2 FTE)	$5,000	6.3%
Hidden Costs
Monitoring & observability	$3,000	3.8%
Security & compliance	$2,500	3.1%
Backup & DR	$1,500	1.9%
Total Monthly TCO	$80,000	100%

This analysis reveals that engineering time (31% of TCO) rivals compute costs (45%). Optimizing only the cloud bill misses nearly a third of the cost structure.

Full implementation: See reference/26b_cost_engineering_code.md for the complete TCOAnalysis class with scenario modeling.

Unit Economics

TCO tells you total costs; unit economics tells you whether your business model works. Unit economics translates aggregate costs into per-unit metrics that can be compared against revenue or value generated.

Key unit economics metrics:

@dataclass
class UnitEconomics:
    """Unit economics for an AI service."""

    total_monthly_cost: float
    monthly_requests: int
    monthly_active_users: int
    monthly_revenue: float

    @property
    def cost_per_request(self) -> float:
        """Cost to serve one request."""
        return self.total_monthly_cost / max(1, self.monthly_requests)

    @property
    def cost_per_user(self) -> float:
        """Cost to serve one active user per month."""
        return self.total_monthly_cost / max(1, self.monthly_active_users)

    @property
    def gross_margin(self) -> float:
        """(Revenue - Cost) / Revenue."""
        if self.monthly_revenue <= 0:
            return float('-inf')
        return (self.monthly_revenue - self.total_monthly_cost) / self.monthly_revenue

    @property
    def ai_cost_ratio(self) -> float:
        """AI costs as percentage of revenue."""
        if self.monthly_revenue <= 0:
            return float('inf')
        return self.total_monthly_cost / self.monthly_revenue

Healthy unit economics benchmarks:

Metric	Concerning	Acceptable	Good
AI cost ratio	>50%	20-50%	<20%
Cost per request	>$0.10	$0.01-$0.10	<$0.01
Gross margin	<30%	30-60%	>60%

The unit economics trap:

Many AI startups fall into a trap: their unit economics worsen as they scale. This happens when:

Usage grows faster than revenue: Free tiers or flat-rate pricing with increasing usage
Costs scale linearly while value doesn’t: Each request costs the same, but marginal value to users decreases
No cost optimization investment: Early-stage focus on features over efficiency

The solution is modeling unit economics at scale from day one:

def project_unit_economics(
    current: UnitEconomics,
    growth_rate: float,  # Monthly user growth
    months: int,
    cost_efficiency_gain: float = 0.02  # Monthly efficiency improvement
) -> list[UnitEconomics]:
    """Project unit economics over time with growth and optimization."""
    projections = [current]

    for _ in range(months):
        prev = projections[-1]

        # Users grow, costs grow (but more slowly with optimization)
        new_users = int(prev.monthly_active_users * (1 + growth_rate))
        new_requests = int(prev.monthly_requests * (1 + growth_rate * 1.2))  # Power users

        # Costs grow sublinearly with optimization
        cost_scale = (new_requests / prev.monthly_requests) * (1 - cost_efficiency_gain)
        new_costs = prev.total_monthly_cost * cost_scale

        # Revenue assumed to scale with users
        new_revenue = prev.monthly_revenue * (new_users / prev.monthly_active_users)

        projections.append(UnitEconomics(
            total_monthly_cost=new_costs,
            monthly_requests=new_requests,
            monthly_active_users=new_users,
            monthly_revenue=new_revenue
        ))

    return projections

Cost Attribution and Accountability

Why Attribution Matters

You cannot optimize what you cannot measure—and you cannot measure what you cannot attribute. Cost attribution assigns costs to the teams, projects, and features that incur them, creating visibility and accountability.

Without attribution: A single cloud bill shows $200,000/month. No one knows which team is responsible for what. Cost optimization is nobody’s job because it’s everybody’s job.

With attribution: Team A’s recommendation service costs $80,000/month, Team B’s search service costs $50,000/month, and shared infrastructure costs $70,000. Each team can optimize their portion; leadership can evaluate ROI by team.

Attribution Models

Different costs require different attribution approaches:

1. Direct attribution: Costs clearly tied to a specific owner

# Example: API costs by team
@dataclass
class DirectAttributionRule:
    """Attribute costs directly to the owner in metadata."""

    def apply(self, cost_item: CostItem) -> dict[str, float]:
        owner = cost_item.metadata.get('team') or cost_item.metadata.get('project')
        if owner:
            return {owner: cost_item.amount}
        return {'unattributed': cost_item.amount}

Direct attribution works for:

Team-specific compute instances
Project-tagged API calls
Dedicated storage buckets

2. Proportional attribution: Shared costs split by usage

@dataclass
class ProportionalAttributionRule:
    """Split shared costs by usage proportion."""

    usage_metric: str  # e.g., 'requests', 'compute_seconds', 'tokens'

    def apply(self, cost_item: CostItem, usage_by_team: dict[str, float]) -> dict[str, float]:
        total_usage = sum(usage_by_team.values())
        if total_usage == 0:
            return {'unattributed': cost_item.amount}

        return {
            team: cost_item.amount * (usage / total_usage)
            for team, usage in usage_by_team.items()
        }

Proportional attribution works for:

Shared GPU clusters (split by GPU-hours consumed)
Shared vector databases (split by queries or storage)
Platform infrastructure (split by requests handled)

3. Activity-based attribution: Costs assigned based on specific activities

@dataclass
class ActivityBasedRule:
    """Attribute based on specific activities that drive costs."""

    activity_costs: dict[str, float]  # Cost per activity type

    def apply(self, activities: list[Activity]) -> dict[str, float]:
        attribution = defaultdict(float)

        for activity in activities:
            cost = self.activity_costs.get(activity.type, 0)
            attribution[activity.team] += cost * activity.count

        return dict(attribution)

Activity-based attribution works for:

Training jobs (cost per GPU-hour by job owner)
Experiment runs (cost per experiment by researcher)
Model deployments (cost per deployment by team)

Implementing Cost Attribution

A production cost attribution system requires:

1. Tagging discipline: All resources tagged with owner, project, environment

REQUIRED_TAGS = ['team', 'project', 'environment', 'cost_center']

def validate_resource_tags(resource: Resource) -> list[str]:
    """Return list of missing required tags."""
    return [tag for tag in REQUIRED_TAGS if tag not in resource.tags]

def enforce_tagging_policy(resource: Resource) -> bool:
    """Block resource creation if tags are missing."""
    missing = validate_resource_tags(resource)
    if missing:
        raise PolicyViolation(f"Resource missing required tags: {missing}")
    return True

2. Usage tracking: Collect usage metrics that drive costs

class UsageTracker:
    """Track usage metrics for cost attribution."""

    def record_request(self, team: str, service: str, tokens: int, gpu_ms: int):
        """Record a request with its resource consumption."""
        self.metrics.append({
            'timestamp': datetime.utcnow(),
            'team': team,
            'service': service,
            'tokens': tokens,
            'gpu_ms': gpu_ms
        })

    def aggregate_by_team(self, start: datetime, end: datetime) -> dict[str, dict]:
        """Aggregate usage metrics by team for a period."""
        # Returns: {'team_a': {'tokens': 1M, 'gpu_ms': 50000}, ...}

3. Attribution engine: Apply rules to calculate allocations

class CostAttributionEngine:
    """Engine for attributing costs to owners."""

    def __init__(self, rules: list[AttributionRule]):
        self.rules = rules

    def calculate_attribution(
        self,
        costs: list[CostItem],
        usage: dict[str, dict]
    ) -> dict[str, float]:
        """Calculate cost attribution by team/project."""
        attribution = defaultdict(float)

        for cost in costs:
            rule = self._get_rule(cost.category)
            team_costs = rule.apply(cost, usage)
            for team, amount in team_costs.items():
                attribution[team] += amount

        return dict(attribution)

Full implementation: See reference/26b_cost_engineering_code.md for the complete attribution system.

Creating Accountability

Attribution data only matters if it drives behavior. Effective accountability requires:

1. Visibility: Teams can see their costs in real-time

class TeamCostDashboard:
    """Real-time cost visibility for engineering teams."""

    def get_team_summary(self, team: str) -> dict:
        return {
            'current_month': {
                'total_cost': self.get_mtd_cost(team),
                'budget': self.get_budget(team),
                'burn_rate': self.get_daily_average(team) * 30,
                'forecast': self.get_month_end_forecast(team)
            },
            'by_service': self.get_cost_by_service(team),
            'trends': self.get_weekly_trends(team),
            'top_cost_drivers': self.get_top_costs(team, n=5),
            'optimization_opportunities': self.get_recommendations(team)
        }

2. Budgets: Teams have cost targets they’re accountable for

@dataclass
class CostBudget:
    """Team cost budget with alerting thresholds."""
    team: str
    monthly_budget: float
    alert_threshold: float = 0.80  # Alert at 80%
    hard_limit: float = 1.20  # Require approval above 120%

    def check_against_spend(self, current_spend: float) -> BudgetStatus:
        ratio = current_spend / self.monthly_budget

        if ratio > self.hard_limit:
            return BudgetStatus.EXCEEDED_HARD_LIMIT
        elif ratio > 1.0:
            return BudgetStatus.OVER_BUDGET
        elif ratio > self.alert_threshold:
            return BudgetStatus.APPROACHING_LIMIT
        else:
            return BudgetStatus.ON_TRACK

3. Reviews: Regular discussion of cost performance

Effective cost reviews happen at multiple cadences:

Cadence	Focus	Attendees
Weekly	Anomalies, quick wins	Engineering leads
Monthly	Trends, budget performance	Team leads + Finance
Quarterly	TCO, strategic investments	Leadership

GPU Selection and Self-Hosting Economics

The GPU Selection Decision

Choosing the right GPU is one of the highest-leverage cost decisions in AI. The wrong GPU wastes money through underutilization, overpayment, or insufficient performance.

GPU selection framework:

Key GPU comparisons (cloud pricing, late 2024):

GPU	VRAM	FP16 TFLOPS	Hourly Cost	$/TFLOP-hr	Best For
H100 SXM	80GB	1,979	$4.50	$0.0023	Large training
A100 80GB	80GB	312	$2.50	$0.0080	Training, large inference
A100 40GB	40GB	312	$1.75	$0.0056	Medium training/inference
L4	24GB	121	$0.50	$0.0041	Cost-efficient inference
A10G	24GB	125	$0.75	$0.0060	Balanced inference
T4	16GB	65	$0.35	$0.0054	Budget inference

Decision heuristics:

For training: Match GPU memory to model size. Training a 7B parameter model in FP16 requires ~14GB for weights plus optimizer states—40GB A100 minimum, 80GB preferred for larger batch sizes.
For inference: Match to your batch size and latency requirements. High-throughput batch inference benefits from larger GPUs (more parallelism); low-latency single requests are limited by memory bandwidth, not compute.
For cost optimization: Calculate cost per token or cost per request across GPU options. The cheapest GPU per hour is often not the cheapest per unit of work.

The Self-Hosting Breakeven Analysis

When should you self-host models instead of using APIs? This is a classic build-versus-buy decision with quantifiable economics.

The cost structure comparison:

Breakeven calculation:

@dataclass
class SelfHostingAnalysis:
    """Analyze self-hosting vs API economics."""

    # API costs
    api_cost_per_request: float  # $ per request (input + output tokens priced out)

    # Self-hosting costs
    gpu_hourly_cost: float  # $ per GPU hour
    num_gpus: int
    gpu_utilization: float  # Expected utilization (0.0 - 1.0)
    engineering_monthly: float  # Engineering time for operations
    infrastructure_monthly: float  # Networking, monitoring, etc.

    # Throughput model (token-based, NOT a flat per-request latency)
    output_tokens_per_sec_per_system: float  # Aggregate decode throughput of the GPU set
    avg_output_tokens_per_request: float  # Output-length assumption — the key swing factor

    @property
    def monthly_fixed_cost(self) -> float:
        """Fixed monthly cost of self-hosting."""
        gpu_monthly = self.gpu_hourly_cost * 24 * 30 * self.num_gpus
        return gpu_monthly + self.engineering_monthly + self.infrastructure_monthly

    @property
    def requests_per_gpu_month(self) -> float:
        """Requests the GPU set can serve per month at expected utilization.

        Per-request time is dominated by OUTPUT token count, not a flat latency:
        a 500-token reply takes ~10-25x as long to generate as a 20-token one
        because decode is sequential. So we model capacity as tokens/sec, then
        divide by the average output length. Prefill (input tokens) is parallel
        and comparatively cheap, so it is omitted here for clarity.
        """
        output_tokens_per_month = (
            self.output_tokens_per_sec_per_system
            * self.gpu_utilization
            * 3600 * 24 * 30
        )
        return output_tokens_per_month / self.avg_output_tokens_per_request

    @property
    def marginal_cost_per_request(self) -> float:
        """Marginal cost once infrastructure is running."""
        # Essentially electricity and wear - minimal
        return 0.0001  # $0.0001 per request

    def total_cost(self, monthly_requests: int) -> dict[str, float]:
        """Calculate total costs for both options."""
        api_cost = monthly_requests * self.api_cost_per_request

        # Self-hosted: fixed + marginal
        if monthly_requests <= self.requests_per_gpu_month:
            self_hosted_cost = self.monthly_fixed_cost
        else:
            # Need more GPUs
            gpus_needed = math.ceil(monthly_requests / self.requests_per_gpu_month * self.num_gpus)
            self_hosted_cost = self.monthly_fixed_cost * (gpus_needed / self.num_gpus)

        return {
            'api_cost': api_cost,
            'self_hosted_cost': self_hosted_cost,
            'recommendation': 'self-host' if self_hosted_cost < api_cost else 'api'
        }

    def breakeven_requests(self) -> int:
        """Find monthly request volume where self-hosting breaks even."""
        # Fixed / (API marginal - Self marginal) = breakeven volume
        marginal_diff = self.api_cost_per_request - self.marginal_cost_per_request
        if marginal_diff <= 0:
            return float('inf')  # Self-hosting never cheaper on marginal
        return int(self.monthly_fixed_cost / marginal_diff)

Example breakeven analysis:

Scenario: Considering self-hosting a Llama 3.1 70B model vs. using API. Note that there are two break-evens to track, and both move with the output-length distribution:

Cost break-even: the monthly volume at which API spend equals self-hosting’s fixed cost. Because API requests are priced per token, the per-request price — and therefore this break-even — scales with average output length.
Capacity: whether your GPU set can actually serve that volume. Capacity is governed by decode throughput in tokens/sec, divided by average output length. A reply that is 5x longer cuts servable requests/month by ~5x.

Parameter	Value
Self-hosted: 2x A100 80GB	$2.50/hr each = $3,600/month
Engineering (0.1 FTE)	$1,500/month
Infrastructure overhead	$500/month
Total fixed cost	$5,600/month
GPU utilization target	50%
Aggregate decode throughput (70B, continuous batching)	~2,000 output tok/sec

We evaluate two output-length regimes that bracket realistic workloads — short replies (~100 output tokens, e.g. classification or extraction) and long replies (~500 output tokens, e.g. summarization or chat). Input/prefill is parallel and comparatively cheap, so we approximate API price per request from output tokens at $0.85/1M (Llama 4 Maverick output rate from the pricing table above):

# --- Short-reply regime: ~100 output tokens/request ---
short = SelfHostingAnalysis(
    api_cost_per_request=0.85 / 1_000_000 * 100,   # ~$0.000085/req
    gpu_hourly_cost=2.50,
    num_gpus=2,
    gpu_utilization=0.50,
    engineering_monthly=1500,
    infrastructure_monthly=500,
    output_tokens_per_sec_per_system=2000,
    avg_output_tokens_per_request=100,
)
# Cost break-even = $5,600 / $0.000085 ≈ 65.9M requests/month
# Capacity on 2 GPUs   = 2000 * 0.5 * 2.592e6 s/mo / 100 ≈ 25.9M requests/month

# --- Long-reply regime: ~500 output tokens/request ---
long = SelfHostingAnalysis(
    api_cost_per_request=0.85 / 1_000_000 * 500,   # ~$0.000425/req
    gpu_hourly_cost=2.50,
    num_gpus=2,
    gpu_utilization=0.50,
    engineering_monthly=1500,
    infrastructure_monthly=500,
    output_tokens_per_sec_per_system=2000,
    avg_output_tokens_per_request=500,
)
# Cost break-even = $5,600 / $0.000425 ≈ 13.2M requests/month
# Capacity on 2 GPUs   = 2000 * 0.5 * 2.592e6 s/mo / 500 ≈ 5.2M requests/month

(Here 2.592e6 is the seconds in a 30-day month: 3600 * 24 * 30.)

The headline number is not a constant — it moves ~5x with output length:

Regime	API $/req	Cost break-even (req/mo)	Capacity on 2 GPUs (req/mo)
Short reply (~100 out tok)	~$0.000085	~66M	~26M
Long reply (~500 out tok)	~$0.000425	~13M	~5.2M

Read this carefully: against a cheap open-weight API rate, self-hosting only pays off at tens of millions of requests per month, and the long-reply workload reaches that break-even at roughly a quarter of the short-reply volume. Against a premium frontier API (e.g. $25-30/1M output, ~30-60x the rate used here), the break-even drops by the same factor — into the low hundreds of thousands of requests, the cliff described at the top of this chapter. The break-even is a function of your output-length distribution and the API tier you are displacing, not a fixed constant; re-derive it whenever either changes.

Critical caveats:

Output-length sensitivity: As the table shows, doubling average output length roughly halves both the cost break-even and the servable capacity. Always model break-even against your measured output-length distribution, not an assumed flat per-request cost.
Utilization uncertainty: The analysis assumes 50% utilization. If you achieve only 25%, effective costs double.
Engineering underestimation: Operations, troubleshooting, and updates often take more time than estimated.
Opportunity cost: Engineering time on infrastructure is time not spent on product.
Flexibility cost: APIs scale instantly; self-hosted requires capacity planning.
Quality differences: Self-hosted open models may not match API quality for your use case.

The right decision depends on your confidence in volume projections, engineering capacity, and strategic importance of ownership.

Full implementation: See reference/26b_cost_engineering_code.md for the complete analysis framework.

Cost Optimization Strategies

Staff Engineer Perspective

“Our AI bill went from $50K to $400K in three months. Leadership was panicking about ‘AI costs.’ When I actually looked at the data, 80% of the spend was one feature that product had added without telling anyone—auto-summarization that ran on every document load. We added caching and moved to a smaller model. Costs dropped 90%. The moral: before optimizing infrastructure, find out what’s actually generating the calls.”

— Engineering Director at enterprise software company

The Optimization Priority Matrix

Not all cost optimizations are equal. Prioritize based on impact and effort:

Tier 1: Quick wins (days to implement)

Model selection optimization: Use smaller models for simpler tasks

class ModelRouter:
    """Route requests to appropriate model based on complexity."""

    MODELS = {
        'simple': {'name': 'claude-haiku-4-5', 'cost_per_1k_tokens': 0.003},
        'moderate': {'name': 'gpt-5', 'cost_per_1k_tokens': 0.0056},
        'complex': {'name': 'claude-opus-4-8', 'cost_per_1k_tokens': 0.012}
    }

    def route(self, request: Request) -> str:
        """Route request to cheapest sufficient model."""
        complexity = self.assess_complexity(request)

        if complexity < 0.3:
            return self.MODELS['simple']['name']
        elif complexity < 0.7:
            return self.MODELS['moderate']['name']
        else:
            return self.MODELS['complex']['name']

    def assess_complexity(self, request: Request) -> float:
        """Estimate task complexity (0-1). Simple heuristics."""
        score = 0.0

        # Simple tasks: short, common patterns
        if len(request.prompt) < 500:
            score += 0.0
        elif len(request.prompt) < 2000:
            score += 0.3
        else:
            score += 0.5

        # Complex indicators
        if any(kw in request.prompt.lower() for kw in ['analyze', 'compare', 'reason']):
            score += 0.2
        if request.requires_structured_output:
            score += 0.1

        return min(1.0, score)

Savings: 40-70% by routing 60% of requests to smaller models

Prompt optimization: Reduce token usage without sacrificing quality

class PromptOptimizer:
    """Optimize prompts for token efficiency."""

    def optimize(self, prompt: str) -> tuple[str, dict]:
        """Return optimized prompt and savings metrics."""
        original_tokens = self.count_tokens(prompt)

        optimizations = []
        optimized = prompt

        # Remove redundant whitespace
        optimized = ' '.join(optimized.split())
        if len(optimized) < len(prompt):
            optimizations.append('whitespace_reduction')

        # Compress verbose instructions
        verbose_patterns = [
            (r'Please (.+) for me', r'\1'),
            (r'I would like you to', ''),
            (r'Can you please', ''),
        ]
        for pattern, replacement in verbose_patterns:
            optimized = re.sub(pattern, replacement, optimized, flags=re.IGNORECASE)

        # Reduce example count if excessive
        if prompt.count('Example:') > 3:
            optimizations.append('reduce_examples')
            # Keep first and last example, remove middle
            # (Implementation detail omitted)

        new_tokens = self.count_tokens(optimized)
        savings = (original_tokens - new_tokens) / original_tokens

        return optimized, {
            'original_tokens': original_tokens,
            'optimized_tokens': new_tokens,
            'savings_pct': savings,
            'optimizations_applied': optimizations
        }

Savings: 15-30% token reduction typical

Caching: Avoid redundant computation

class SemanticCache:
    """Cache responses for semantically similar queries."""

    def __init__(self, embedding_model, similarity_threshold: float = 0.95):
        self.cache = {}  # embedding -> response
        self.embeddings = []
        self.embedding_model = embedding_model
        self.threshold = similarity_threshold

    def get(self, query: str) -> Optional[str]:
        """Return cached response if similar query exists."""
        query_embedding = self.embedding_model.encode(query)

        for cached_embedding, response in self.cache.items():
            similarity = cosine_similarity(query_embedding, cached_embedding)
            if similarity >= self.threshold:
                return response

        return None

    def set(self, query: str, response: str):
        """Cache response for future similar queries."""
        embedding = tuple(self.embedding_model.encode(query))
        self.cache[embedding] = response

Savings: 20-60% depending on query similarity in your workload

Tier 2: Medium effort (weeks to implement)

Batching optimization: Process requests in efficient batches

class DynamicBatcher:
    """Batch requests for efficient GPU utilization."""

    def __init__(self, max_batch_size: int = 32, max_wait_ms: int = 50):
        self.queue = asyncio.Queue()
        self.max_batch_size = max_batch_size
        self.max_wait_ms = max_wait_ms

    async def process_loop(self):
        """Continuously process batches."""
        while True:
            batch = []
            deadline = time.time() + self.max_wait_ms / 1000

            # Collect requests until batch full or timeout
            while len(batch) < self.max_batch_size and time.time() < deadline:
                try:
                    timeout = deadline - time.time()
                    request = await asyncio.wait_for(
                        self.queue.get(),
                        timeout=max(0, timeout)
                    )
                    batch.append(request)
                except asyncio.TimeoutError:
                    break

            if batch:
                responses = await self.process_batch(batch)
                for request, response in zip(batch, responses):
                    request.future.set_result(response)

Savings: 30-50% through better GPU utilization

Quantization: Reduce model precision for faster, cheaper inference

# Using bitsandbytes for 4-bit quantization
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-4-Maverick-402B",
    quantization_config=quantization_config,
    device_map="auto"
)

# 70B model: FP16 = 140GB, INT4 = 35GB
# Can run on 2x A100 40GB instead of 4x A100 80GB

Savings: 50-75% through GPU memory reduction

Tier 3: Strategic investments (months to implement)

Architecture redesign: Reduce unnecessary LLM calls

Pipeline Step	Before (3 LLM calls)	After (1 LLM call)
Classification	LLM call	Local classifier
Entity Extraction	LLM call	Local NER model
Response Generation	LLM call	LLM call
Total LLM calls	3 per request	1 per request

Savings: 60-70% by replacing LLM calls with specialized models

Custom model training: Fine-tune smaller models for your domain

# Distillation: train small model on large model outputs
class ModelDistillation:
    """Distill large model knowledge to smaller model."""

    def __init__(self, teacher_model, student_model):
        self.teacher = teacher_model
        self.student = student_model

    def generate_training_data(self, prompts: list[str]) -> list[dict]:
        """Generate training data from teacher model."""
        data = []
        for prompt in prompts:
            response = self.teacher.generate(prompt)
            data.append({'prompt': prompt, 'response': response})
        return data

    def train_student(self, training_data: list[dict]):
        """Fine-tune student on teacher outputs."""
        # Standard fine-tuning loop
        # Student learns to mimic teacher behavior

Savings: 80-95% by replacing API calls with fine-tuned small model

Full implementation: See reference/26b_cost_engineering_code.md for complete implementations.

Build vs. Buy Decision Framework

The Decision Matrix

Build-versus-buy decisions in AI involve more factors than traditional software:

Quantitative Analysis Framework

@dataclass
class BuildVsBuyAnalysis:
    """Comprehensive build vs buy analysis."""

    # Build costs
    development_months: int
    developers_required: int
    developer_monthly_cost: float
    infrastructure_monthly: float
    maintenance_fte: float

    # Buy costs
    per_request_cost: float
    monthly_minimum: float

    # Usage projections
    monthly_requests_year1: list[int]  # 12 months
    monthly_requests_year2: list[int]  # 12 months

    # Qualitative factors (scored 1-5, weight 0-1)
    factors: list[tuple[str, float, int, int]]  # name, weight, build_score, buy_score

    def calculate_costs(self) -> dict:
        """Calculate 2-year TCO for both options."""

        # Build: Development + infrastructure + maintenance
        development_cost = (
            self.development_months *
            self.developers_required *
            self.developer_monthly_cost
        )

        infrastructure_cost = sum(
            self.infrastructure_monthly for _ in range(24 - self.development_months)
        )

        maintenance_cost = (
            (24 - self.development_months) *
            self.maintenance_fte *
            self.developer_monthly_cost
        )

        build_total = development_cost + infrastructure_cost + maintenance_cost

        # Buy: Per-request + minimums
        all_requests = self.monthly_requests_year1 + self.monthly_requests_year2
        buy_total = sum(
            max(self.monthly_minimum, requests * self.per_request_cost)
            for requests in all_requests
        )

        return {
            'build_2yr_tco': build_total,
            'buy_2yr_tco': buy_total,
            'cost_difference': buy_total - build_total,
            'break_even_month': self._find_breakeven()
        }

    def calculate_scores(self) -> dict:
        """Calculate weighted qualitative scores."""
        build_score = sum(weight * score for name, weight, score, _ in self.factors)
        buy_score = sum(weight * score for name, weight, _, score in self.factors)

        return {
            'build_score': build_score,
            'buy_score': buy_score,
            'recommendation': 'build' if build_score > buy_score else 'buy'
        }

    def full_analysis(self) -> dict:
        """Combined cost and qualitative analysis."""
        costs = self.calculate_costs()
        scores = self.calculate_scores()

        # Weight: 60% cost, 40% qualitative
        cost_favors = 'build' if costs['build_2yr_tco'] < costs['buy_2yr_tco'] else 'buy'
        qual_favors = scores['recommendation']

        if cost_favors == qual_favors:
            confidence = 'high'
        else:
            confidence = 'medium - cost and qualitative factors disagree'

        return {
            **costs,
            **scores,
            'overall_recommendation': cost_favors,  # Default to cost
            'confidence': confidence
        }

Example: Embedding Service Build vs. Buy

analysis = BuildVsBuyAnalysis(
    # Build costs
    development_months=3,
    developers_required=2,
    developer_monthly_cost=15000,  # Fully loaded
    infrastructure_monthly=6000,   # 2x A100
    maintenance_fte=0.25,

    # Buy costs (OpenAI embeddings)
    per_request_cost=0.0001,  # $0.10 per 1M tokens, ~1K tokens/request
    monthly_minimum=0,

    # Usage: Growing from 10M to 50M requests/month
    monthly_requests_year1=[10e6, 12e6, 15e6, 18e6, 22e6, 26e6,
                            30e6, 34e6, 38e6, 42e6, 46e6, 50e6],
    monthly_requests_year2=[50e6] * 12,  # Plateau at 50M

    # Qualitative factors
    factors=[
        ('data_privacy', 0.3, 5, 2),      # Build much better for privacy
        ('customization', 0.2, 4, 2),      # Build allows fine-tuning
        ('engineering_capacity', 0.2, 3, 5),  # Limited ML team
        ('time_to_market', 0.2, 2, 5),     # Need it fast
        ('strategic_importance', 0.1, 3, 3)  # Neutral
    ]
)

result = analysis.full_analysis()
# Build 2yr TCO: $270,000
# Buy 2yr TCO: $216,000
# Qualitative: Build 3.7, Buy 3.2
# Recommendation: Buy (cost wins despite qualitative edge for build)

The Hidden Costs of Building

Build-versus-buy analyses often underestimate build costs:

Scope creep: Initial estimates rarely include production hardening, monitoring, edge cases
Ongoing investment: ML systems require continuous improvement to maintain quality
Opportunity cost: Engineers building infrastructure aren’t building product
Knowledge concentration: What happens when the ML engineer leaves?
Obsolescence risk: The landscape evolves; your custom solution may fall behind

Heuristic: Multiply initial build estimates by 2-3x to account for hidden costs. If build still wins, proceed confidently.

Full implementation: See reference/26b_cost_engineering_code.md for the complete framework.

Building a Cost-Aware Culture

The FinOps Mindset for AI

FinOps—the practice of bringing financial accountability to cloud spend—is especially important for AI systems where costs are higher and more variable.

Core FinOps principles applied to AI:

Teams own their costs: The team that deploys a model is accountable for its cost, not a central platform team.
Decisions are data-driven: Cost tradeoffs are made with real numbers, not intuition.
Optimization is continuous: Cost efficiency is an ongoing practice, not a one-time project.
Value, not just cost: The goal is maximizing value per dollar, not minimizing dollars.

Implementing Cost Governance

1. Cost visibility dashboards

Every engineering team should have real-time visibility into their AI costs:

class AICostDashboard:
    """Team-level AI cost visibility."""

    def get_team_view(self, team: str) -> dict:
        return {
            'summary': {
                'mtd_cost': self.get_mtd_cost(team),
                'budget': self.get_budget(team),
                'utilization': self.get_budget_utilization(team),
                'forecast': self.forecast_month_end(team)
            },
            'by_model': self.get_costs_by_model(team),
            'by_service': self.get_costs_by_service(team),
            'daily_trend': self.get_daily_trend(team, days=30),
            'anomalies': self.detect_anomalies(team),
            'opportunities': self.identify_optimizations(team)
        }

2. Budget controls with escalation

@dataclass
class BudgetPolicy:
    """Budget policy with tiered escalation."""

    team: str
    monthly_budget: float

    # Escalation thresholds
    warn_threshold: float = 0.75    # Alert at 75%
    review_threshold: float = 0.90  # Require review at 90%
    block_threshold: float = 1.10   # Block new requests at 110%

    def check_request(self, request_cost: float, current_spend: float) -> dict:
        """Check if request should proceed given budget."""
        new_spend = current_spend + request_cost
        utilization = new_spend / self.monthly_budget

        if utilization > self.block_threshold:
            return {
                'allowed': False,
                'reason': 'Budget exceeded - requires VP approval',
                'escalation': 'vp'
            }
        elif utilization > self.review_threshold:
            return {
                'allowed': True,
                'warning': 'Approaching budget limit',
                'escalation': 'manager'
            }
        elif utilization > self.warn_threshold:
            return {
                'allowed': True,
                'warning': 'Budget 75% utilized'
            }
        else:
            return {'allowed': True}

3. Cost review cadence

Review	Frequency	Participants	Focus
Anomaly review	Daily (automated)	On-call	Spikes, unusual patterns
Team cost review	Weekly	Tech lead	Trends, quick wins
Service review	Monthly	Engineering + Finance	Budget performance, projections
Strategic review	Quarterly	Leadership	TCO trends, build/buy decisions

Incentive Alignment

Cost awareness requires aligned incentives:

What works:

Teams that reduce costs get some savings allocated to their budget
Cost efficiency is part of performance reviews for tech leads
Cost per unit of value is a team metric alongside reliability and velocity

What doesn’t work:

Centralized cost management with no team accountability
Pure cost minimization targets without value consideration
Punishing teams for high costs without context

Example: Cost efficiency OKRs

Q1 Objectives for ML Platform Team:

Objective: Improve AI cost efficiency
  KR1: Reduce cost per inference request by 30%
       (from $0.015 to $0.010)
  KR2: Increase GPU utilization from 35% to 55%
  KR3: Implement model routing, directing 50% of
       requests to smaller models

Objective: Maintain quality while optimizing
  KR1: Response quality score stays above 4.2/5
  KR2: P95 latency stays under 500ms
  KR3: No customer-facing incidents from optimization changes

Key Takeaways

AI costs are fundamentally different - GPU economics, marginal per-request costs, and hidden engineering time make AI 10-100x more expensive per request than traditional software. Plan accordingly.
Model total cost of ownership completely - Direct infrastructure is often only 60-70% of true TCO. Include engineering time, maintenance, monitoring, and opportunity costs in your calculations.
Attribute costs to create accountability - You cannot optimize what you cannot measure. Tag resources, track usage per team, and allocate costs to create ownership and visibility.
Model routing and caching provide biggest wins - These optimizations often deliver 40-70% savings with days of effort. Start here before complex architectural changes or custom optimization.
Build cost-aware culture through visibility - Dashboards, budgets, and aligned incentives make cost efficiency everyone’s responsibility. Teams that can see costs make better decisions.

Summary

Cost engineering transforms AI systems from economically precarious to sustainable. The key insights:

Understand why AI is expensive: GPU economics, marginal per-request costs, and hidden costs in engineering time make AI fundamentally different from traditional software.

Model total cost of ownership: Direct infrastructure costs are often only 60-70% of true TCO. Include engineering time, maintenance, and hidden costs.

Attribute costs to create accountability: You cannot optimize what you cannot measure. Tag resources, track usage, and allocate costs to teams.

Make data-driven build vs. buy decisions: Calculate crossover points with realistic assumptions. Multiply build estimates by 2-3x for hidden costs.

Prioritize optimizations by impact: Model routing and caching often provide 40-70% savings with days of effort. Start there before complex architectural changes.

Build cost-aware culture: Visibility, budgets, and aligned incentives make cost efficiency everyone’s responsibility.

Plan around the 10x cost cliff: Forecast your 12-month volume against the API-versus-self-host crossover. Below the cliff, managed APIs win; above it, self-hosting wins by roughly an order of magnitude. The expensive mistake is discovering which side you’re on from a surprise invoice.

Cost engineering is a Staff+ skill because it requires translating technical metrics into business terms, making tradeoffs across multiple dimensions, and building organizational systems that create sustainable economics at scale.

Practical Exercises

Exercise 1: TCO Analysis

Scenario: Your company runs a customer support chatbot. Calculate complete TCO.

Given:

Inference: 4x L4 GPUs, $0.50/hr each, running 24/7
API calls: External LLM for complex queries, 50,000 calls/month at $0.02 average
Storage: 500GB conversation logs, 50GB model artifacts
Engineering: 0.5 FTE ML engineer, 0.2 FTE DevOps

Tasks: 1. Calculate monthly direct costs 2. Estimate engineering costs (assume $180k fully-loaded annual salary) 3. Add hidden costs (monitoring, security, DR) at 15% of direct 4. Calculate cost per conversation (assume 500,000 conversations/month) 5. If conversations grow to 2M/month, how does cost per conversation change?

Exercise 2: GPU Selection

Scenario: You need to serve a 7B parameter model.

Requirements:

P95 latency: <200ms
Throughput: 100 requests/second peak
Budget: <$10,000/month

GPU options: | GPU | VRAM | Latency (7B) | Max throughput | Monthly cost | |—–|——|————–|—————-|————–| | L4 | 24GB | 80ms | 40 req/s | $360 | | A10G | 24GB | 65ms | 50 req/s | $540 | | A100 40GB | 40GB | 40ms | 100 req/s | $1,260 |

Tasks: 1. Calculate minimum instances needed to meet throughput 2. Calculate total monthly cost for each GPU option 3. Verify latency requirements are met 4. Make a recommendation with justification

Exercise 3: Build vs. Buy Analysis

Scenario: Evaluate building vs. buying a document extraction service.

Build option:

Development: 2 engineers, 4 months
Infrastructure: 2x A100 for fine-tuned model serving
Maintenance: 0.3 FTE ongoing

Buy option:

API cost: $0.10 per page
Current volume: 100,000 pages/month
Projected growth: 20% monthly for 12 months

Tasks: 1. Calculate 24-month TCO for build (assume $15k/month per engineer) 2. Calculate 24-month TCO for buy 3. Find the monthly volume crossover point 4. List 3 qualitative factors and score them for build vs. buy 5. Make a recommendation

Exercise 4: Optimization Prioritization

Scenario: Your LLM service has the following profile:

2M requests/day
Average 3,000 tokens per request (2,000 input, 1,000 output)
Using GPT-5.5 at $5/1M input, $30/1M output
35% of queries are very similar to previous queries
60% of queries could be handled by a smaller model

Tasks: 1. Calculate current daily and monthly cost 2. Estimate savings from semantic caching (assume 25% cache hit rate) 3. Estimate savings from model routing (Claude Haiku 4.5 at $1.00/$5.00) 4. Estimate savings from 20% prompt optimization 5. Rank optimizations by estimated ROI 6. Calculate combined savings if all three are implemented

Self-Assessment Checkpoint

Conceptual Questions

Q1. [Senior] What components make up the Total Cost of Ownership (TCO) for an AI system? Which costs are commonly underestimated?

Answer

TCO components: (1) Direct costs—compute (GPU hours), storage, API calls, networking. (2) Engineering costs—development, maintenance, on-call, debugging. (3) Infrastructure overhead—monitoring, logging, security, DR. (4) Opportunity cost—what else could engineers/resources do? (5) Risk costs—incidents, data loss, compliance issues. Commonly underestimated: (1) Engineering maintenance—“we’ll build it and forget it” is a myth. Budget 20-30% of initial build for ongoing maintenance. (2) Opportunity cost—engineers building infrastructure aren’t building features. (3) Hidden infrastructure—observability, secrets management, deployment tooling. (4) Scaling costs—unit economics change as you scale. (5) Data costs—storage, transfer, processing grow over time.

Q2. [Senior] Explain the cost difference between input tokens and output tokens for LLM APIs. Why does this matter for optimization?

Answer

Output tokens typically cost 5-8x more than input tokens (e.g., GPT-5.5: $5/M input vs. $30/M output). Why: Output generation is more computationally intensive—each output token requires a forward pass through the entire model. Input tokens are processed in parallel during prefill. Optimization implications: (1) Prompt optimization matters, but output optimization matters more. (2) Limit output length where possible—constrain max_tokens, use structured outputs. (3) For Q&A, long documents as input are cheaper than long generated answers. (4) Caching input (prompt caching) saves less than you might think if outputs dominate cost. (5) Model routing can save dramatically—route simple queries to cheaper models. Cost analysis: For a request with 2K input / 1K output at GPT-5.5 prices: Input = $0.010, Output = $0.030. Output is 75% of cost despite being 33% of tokens.

Q3. [Staff] How do you build a business case for AI infrastructure investment? What metrics do stakeholders care about?

Answer

Stakeholder metrics: (1) CFO cares about—ROI, payback period, cost per unit (transaction, user, query), TCO vs. alternatives. (2) VP Engineering cares about—engineering efficiency, reliability, time to market. (3) Product cares about—user impact, feature velocity, competitive position. Building the case: (1) Define the counterfactual—what happens if we don’t invest? Status quo cost, opportunity cost, risk. (2) Quantify benefits—cost savings, revenue enablement, risk reduction. (3) Calculate payback period—when does investment pay for itself? <12 months is usually easy sell. (4) Include non-financial benefits—developer productivity, user experience, competitive positioning. (5) Address risks—what could go wrong? Mitigation plans. (6) Propose staged approach—invest incrementally, prove value before scaling.

Q4. [Staff] What’s the cost difference between self-hosting LLMs vs. using APIs? How do you calculate the break-even point?

Answer

API costs: Simple—request volume × (input tokens × input price + output tokens × output price). Self-hosting costs: Complex—GPU cost + storage + engineering (setup + maintenance) + infrastructure overhead. Add 15-30% for hidden costs. Break-even calculation: (1) Calculate monthly self-hosted cost: GPU_monthly + storage + (maintenance_FTE × salary). (2) Calculate cost per request at full utilization: monthly_cost / max_requests_per_month. (3) Calculate API cost per request. (4) Break-even volume = monthly_self_hosted_cost / api_cost_per_request. Example: Self-hosted 7B model on A100 costs ~$1,500/month (GPU) + $2,500/month (0.2 FTE maintenance) = $4,000/month. Can serve ~100M requests at full utilization. API at $0.001/request. Break-even = 4M requests/month. Below 4M, API is cheaper. Above, self-host wins. Caveats: Quality difference, utilization reality (rarely 100%), setup time/cost.

Q5. [Staff] How do you implement cost attribution in a platform serving multiple teams/products?

Answer

Implementation: (1) Tagging—every resource/request tagged with team, product, environment. Enforced at API/infrastructure level. (2) Metering—collect usage data at appropriate granularity (request-level for APIs, hourly for compute). (3) Aggregation—roll up by tag dimensions daily/weekly/monthly. (4) Allocation—shared resources (infrastructure, on-call) allocated by usage ratio or evenly. (5) Reporting—dashboards showing cost by team/product, trends, anomalies. (6) Showback vs. chargeback—showback (visibility only) is simpler to start. Chargeback (actual budget transfer) drives stronger behavior change but adds overhead. Challenges: (1) Shared resources—how to allocate platform team costs? (2) Bursty usage—one team’s spike affects others’ costs. (3) Tag enforcement—missing/incorrect tags corrupt data. (4) Granularity vs. overhead—too fine-grained is expensive to track. Best practice: Start with showback at product level. Add granularity over time as needed.

Spot the Problem

Problem 1. [Senior] Cost analysis:

"Our LLM API costs $50K/month. We could self-host for $10K/month in GPU costs.
That's $40K savings per month!"

Answer

Missing costs: (1) Engineering—who builds, deploys, maintains the self-hosted system? Add 0.3-0.5 FTE minimum = $4-8K/month. (2) Infrastructure—load balancing, monitoring, logging, security, CI/CD. (3) Quality risk—is your self-hosted model as good as the API? Quality degradation has business cost. (4) Setup time—how many months until you’re running? API cost continues during development. (5) Opportunity cost—what else could engineers do? (6) Utilization assumption—$10K assumes high utilization. Real utilization is often 40-60%. Realistic comparison: Self-hosted = $10K GPU + $6K engineering + $2K infrastructure + quality risk = $18K+/month + 3-month setup. Savings are real but smaller, and come with risk/complexity.

Problem 2. [Staff] Optimization prioritization:

"We're spending $100K/month on LLMs. I'll focus on reducing prompt length by 20%.
That should save $20K/month."

Answer

Flawed analysis: (1) Output tokens cost more—if 60% of cost is output tokens, 20% input reduction saves only 8% total, not 20%. (2) Prompt reduction may hurt quality—shorter prompts may degrade performance, increasing downstream costs or losing users. (3) Other optimizations may have higher ROI—caching, model routing, batch processing might save more with less risk. (4) Implementation effort—how hard is prompt reduction? May require rewriting many prompts. Proper approach: (1) Break down cost by input vs. output. (2) Identify all optimization opportunities with estimated savings and effort. (3) Prioritize by ROI: savings / (effort + risk). (4) Test optimizations before committing—measure actual savings and quality impact.

Problem 3. [Staff] Budget planning:

"AI costs last year: $1.2M. Budget for next year: $1.5M (25% growth buffer)."

Answer

Problems: (1) Usage growth not considered—if usage grows 100%, 25% buffer is wildly insufficient. (2) New projects not considered—what’s planned? Each new AI feature adds cost. (3) Unit economics not tracked—is cost per query stable or increasing? (4) No efficiency targets—just padding previous spend doesn’t drive improvement. (5) External factors—API pricing changes, new model releases, infrastructure changes. Better approach: (1) Forecast usage growth by product/feature. (2) Apply unit economics (cost per query/user/transaction). (3) Add planned new projects with estimated costs. (4) Set efficiency targets (reduce cost per query by X%). (5) Build in contingency (10-15%) for unknowns. Bottom-up budgeting: Forecast = Σ(product usage × unit cost) + new projects + contingency - efficiency gains.

Design Exercises

Exercise 1. [Staff] Design a cost optimization roadmap for an AI platform spending $500K/month. Current state: No caching, single model for all requests, on-demand GPU instances, no cost visibility by team. Target: 30% cost reduction in 6 months. Outline initiatives, expected savings, effort, and sequencing.

Guidance

Initiative prioritization: (1) Quick wins (Month 1-2): Cost visibility dashboard (shows where money goes), basic request caching (semantic similarity, estimate 10-15% savings), reserved/spot instances for predictable workloads (20-30% infra savings). (2) Medium effort (Month 2-4): Model routing—classify requests, route simple ones to cheaper/smaller models. If 50% of requests can use cheaper model at 10% of price, save 45% on those = ~22% total. (3) Longer term (Month 4-6): Prompt optimization (systematic review), batch processing for non-urgent requests, cost allocation to drive team accountability. Expected savings: Caching (10%) + model routing (20%) + instance optimization (15% of infra) + prompt optimization (5%) = 35%+ achievable. Sequencing: Start with visibility (can’t optimize what you can’t measure), then quick infrastructure wins, then application-level optimizations requiring more effort.

Exercise 2. [Staff] You’re presenting an AI infrastructure investment ($2M over 2 years) to the CFO. The investment will enable self-hosting models currently costing $80K/month in API fees, plus enable new capabilities. Build the business case with financial projections, risks, and alternatives.

Guidance

Business case structure: (1) Current state: $80K/month API = $1.92M over 2 years. Growing 30%/year = $2.5M+ with growth. (2) Investment: $2M (hardware + setup + engineering). (3) Operating cost: $15K/month ongoing (maintenance, power, networking) = $360K over 2 years. (4) Total self-hosted cost: $2M + $360K = $2.36M. (5) Direct ROI: Negative initially, but break-even by month 20 if usage grows. Positive ROI in year 3. (6) Additional benefits: New capabilities (fine-tuning, data privacy, latency control). Quantify if possible. (7) Risks: Technology obsolescence, usage forecast wrong, quality issues. Mitigation plans. (8) Alternatives: Continue API (higher long-term cost), hybrid approach (partial self-hosting). Recommendation: Present as staged investment—initial pilot to prove value, scale upon validation. CFO prefers optionality and proven results over big upfront bets.

Connections to Other Chapters

Cost engineering connects to technical decisions across the book. Here’s how this chapter relates:

Chapter 9 (LLM Deployment & Infrastructure): Technical foundations of deployment that drive cost structure. Understanding serving architectures helps optimize costs.
Chapter 27 (Performance Engineering): Performance optimizations that directly impact cost efficiency. Faster inference means lower GPU-hours per request.
Chapter 31 (Reliability Engineering): Reliability investments affect cost—redundancy costs money but prevents expensive incidents.
LLM Provider Comparison (Appendix K): Detailed pricing models and cost comparison across major LLM providers. Essential reference for build vs. buy decisions.

Part V Checkpoint: Staff+ Engineering Complete

Congratulations on completing Part V and the entire book! You’ve covered the architectural thinking, strategic decision-making, and leadership skills required at Staff+ levels.

Skills Checklist

Design systems at scale (Ch25): Apply distributed systems patterns, handle multi-region deployment, and design for failure
Make technical decisions (Ch26): Use decision frameworks, write ADRs, and build consensus on architectural choices
Optimize performance (Ch27): Profile systems, identify bottlenecks, and apply GPU/inference optimizations
Bridge research and production (Ch28): Evaluate papers critically, prototype responsibly, and know when to adopt new techniques
Lead across teams (Ch29): Influence without authority, build technical consensus, and drive organization-wide initiatives
Architect data systems (Ch30): Design feature stores, implement data pipelines, and ensure data quality at scale
Engineer reliability (Ch31): Set SLOs, implement graceful degradation, and run incident response
Manage costs (Ch32): Model TCO, optimize resource utilization, and make build-vs-buy decisions

Quick Self-Test (10 minutes)

Q1. Your global AI system needs 99.9% availability. How do you architect for this across regions?

Q2. A promising paper shows 30% latency improvement, but your team wants to adopt it immediately. What’s your response?

Q3. Two teams disagree on whether to build a custom solution or use an existing framework. How do you drive resolution?

Q4. Your LLM serving costs are 3x budget. Walk through your cost reduction approach.

Check Your Answers

A1. Multi-region active-active with: (1) Load balancing across regions with health checks, (2) Data replication with eventual consistency where acceptable, (3) Circuit breakers to isolate failures, (4) Graceful degradation to simpler models during outages, (5) Chaos testing to verify resilience.

A2. (1) Reproduce the results in your environment first, (2) Understand the tradeoffs (what did they sacrifice?), (3) Benchmark on your actual workload, (4) Prototype in staging, (5) Plan incremental rollout with rollback capability. Enthusiasm is good; prudent adoption is better.

A3. (1) Get both teams to articulate their criteria for success, (2) Document requirements and constraints objectively, (3) Evaluate options against criteria with evidence, (4) Identify the reversibility of each decision, (5) Propose a time-boxed experiment if uncertainty is high, (6) Make the call if consensus isn’t emerging and document the reasoning.

A4. (1) Profile to find the biggest cost drivers (usually GPU utilization, batch efficiency, or over-provisioning), (2) Quick wins: enable batching, right-size instances, use spot/preemptible, (3) Medium-term: implement caching, route simple requests to smaller models, (4) Strategic: evaluate self-hosting vs. API tradeoffs at your scale, (5) Set cost budgets per feature and make teams accountable.

You’ve Completed the Book!

If you can confidently check all boxes above, you have the knowledge to operate as a Staff+ AI Engineer. Remember:

Technical depth is necessary but not sufficient—Staff+ impact comes from multiplying your expertise through others
The field moves fast—commit to continuous learning with the practices from Chapter 21
Production is the test—everything in this book is theory until you ship systems that handle real traffic

The appendices contain reference materials you’ll return to throughout your career: the glossary, tool comparisons, paper reading lists, case studies, and templates.

Welcome to Staff+ AI Engineering.

--- title: "Chapter 32: Cost Engineering" keywords: [cost optimization, GPU costs, TCO, FinOps, capacity planning, build vs buy, unit economics, ROI] difficulty: advanced prerequisites: [ch09, ch25] estimated_time: "3-4 hours" --- ## Introduction In early 2024, an AI startup celebrated a milestone: their LLM-powered product had reached 100,000 daily active users. The celebration was short-lived. Their cloud bill arrived at $847,000 for the month—more than triple their revenue. The CTO's post-mortem revealed a cascade of cost failures: inference endpoints running at 15% GPU utilization, verbose prompts averaging 8,000 tokens when 2,000 would suffice, no caching despite 60% query similarity, and premium API pricing for workloads that could have used smaller models. The technical architecture was sound; the economic architecture was nonexistent. This story repeats across the industry. AI systems are extraordinarily expensive compared to traditional software. A single GPU-hour costs what a CPU-hour costs for an entire day. Token costs compound with every user interaction. Training runs can consume more compute in hours than a conventional application uses in years. Yet many teams approach costs reactively—surprised by bills rather than anticipating them, cutting features rather than optimizing systems. Cost engineering is the discipline of designing AI systems with economic sustainability as a first-class constraint. It is not about minimizing spend—sometimes spending more is precisely right. It is about understanding cost structures deeply enough to make informed tradeoffs, allocating resources to maximize value, and building systems that remain economically viable as they scale. At Staff and Principal levels, you become accountable not just for systems that work, but for systems that work *economically*. You must translate GPU utilization metrics into business terms executives understand. You must make build-versus-buy decisions with real numbers rather than intuition. You must design architectures where costs grow sub-linearly with usage rather than exploding at scale. This chapter provides the theoretical foundations and practical frameworks for AI cost engineering. We begin with why AI is expensive—the fundamental economics that make cost engineering essential. We then develop cost modeling techniques, explore GPU economics, and build decision frameworks for the choices that dominate AI budgets. ### The Core Insight: Why AI Cost Engineering is Different Traditional software cost engineering focuses on compute, storage, and bandwidth—resources that are relatively cheap and well-understood. AI cost engineering introduces fundamentally different challenges: 1. **Compute intensity**: A single inference call through a large language model requires more floating-point operations than a traditional web request by factors of millions. GPUs that cost $10,000+ sit idle waiting for the next request. 2. **Marginal costs that compound**: Every user interaction consumes tokens, GPU cycles, or API calls. Unlike traditional software where marginal costs approach zero, AI systems have meaningful per-request costs that scale with usage. 3. **Nonlinear scaling**: Larger models don't just cost proportionally more—costs often scale superlinearly with capability. A 70B parameter model costs far more than 10x a 7B model to serve. 4. **Hidden costs everywhere**: Fine-tuning data curation, evaluation dataset construction, monitoring for drift, retraining cycles—costs that don't appear in simple infrastructure budgets but dominate total cost of ownership. 5. **Rapid obsolescence**: Models that were state-of-the-art six months ago may be outperformed by models that cost 10x less to run. Cost optimization is a moving target. These dynamics mean that cost engineering for AI requires different mental models than traditional infrastructure. You must think about cost per token, GPU utilization curves, and inference optimization in ways that don't apply to conventional systems. ### A Mental Model for AI Costs ::: {.callout-important} ## Mental Model: The 10x cost cliff Plot API spend and self-hosted TCO on the same chart as request volume grows, and the two lines cross at a brutal inflection point—somewhere in the low hundreds of thousands of requests per month for a typical 7B-13B workload. Below the cliff, managed APIs win on every axis: zero fixed cost, no ops burden, model upgrades for free. Above the cliff, the same API bill is suddenly ~10x what disciplined self-hosting would cost, and the gap widens monotonically. **Heuristic: forecast your 12-month volume; if you'll cross the cliff, start the self-hosting build *before* you hit it, not after.** ::: Once you've mapped where you sit relative to **the 10x cost cliff**, the rest of cost engineering falls into place. And the API-vs-self-host crossover is only the most visible cliff. AI costs don't scale linearly anywhere—they have cliffs everywhere, thresholds where costs suddenly jump by 10x or more. Understanding *all* of them prevents budget surprises: **The context cliff**: Going from 4K to 32K context doesn't cost 8x more—the raw attention FLOPs scale quadratically, so 8x the sequence is ~64x the attention compute. The multipliers in this section are illustrative orders of magnitude, not measured constants. In practice the *served* cost of long context is far below the naive 64x: FlashAttention and paged KV-cache (see Chapter 27) keep attention memory-bandwidth-bound rather than compute-bound, and prefix/KV caching amortizes shared context across requests. The quadratic term still dominates eventually, but you pay something closer to linear-plus-a-quadratic-tail, not a clean 64x. A document that barely fit in one API call might still need to be chunked, and chunking has its own costs (overlap, multiple calls, synthesis). **The quality cliff**: The gap between "good enough" and "excellent" is often 10x cost. A task that Haiku handles 80% correctly might need Opus for 95%. You pay 15x more for that last 15% quality. Sometimes it's worth it. Often it isn't. **The latency cliff**: Reducing P99 latency from 2s to 200ms might require 10x more GPU capacity. Tail latencies are expensive because they require provisioning for worst-case, not average-case. **The freshness cliff**: Real-time features cost exponentially more than daily batch features. Going from daily to hourly might be 3x. Hourly to minute-level might be 10x. Minute to second might be 50x. (As with the other cliffs, treat these multipliers as illustrative magnitudes, not benchmarked figures—the actual factor depends on your infrastructure.) Each step requires more infrastructure, more monitoring, more on-call burden. **The reliability cliff**: Going from 99% to 99.9% availability requires roughly 10x investment. Going from 99.9% to 99.99% requires another 10x. Each "nine" costs an order of magnitude. The strategic implication: **know where your cliffs are and stay on the cheap side unless you have a compelling reason to jump.** Many teams accidentally fall off cliffs by adding features without understanding cost implications. A product manager asks for "just a bit longer context" and suddenly infrastructure costs triple. Before any architectural decision, ask: "Are we near a cliff? What happens if requirements push us over?" Think of an AI system as a factory with unusual economics. Traditional software factories have high fixed costs (development, infrastructure) but near-zero marginal costs—serving the millionth user costs almost nothing more than the first. AI factories have high fixed costs *and* significant marginal costs—every inference consumes real resources. This creates a different optimization landscape: - **Traditional software**: Minimize fixed costs, scale efficiently (marginal costs negligible) - **AI systems**: Balance fixed costs (infrastructure, development) against marginal costs (compute per request, tokens per interaction) The optimal operating point depends on scale. At low volume, using external APIs with high per-unit cost but zero fixed cost dominates. At high volume, building infrastructure with high fixed costs but low marginal costs wins. Finding the crossover point—and anticipating how quickly you'll reach it—is a core cost engineering skill. ### What You'll Learn - The fundamental economics of GPU compute and why utilization matters so much - Total cost of ownership analysis for AI systems - Token cost calculations and optimization strategies - Cost attribution methods that create accountability - Build-versus-buy decision frameworks with real numbers - Self-hosting breakeven analysis - Strategies for building cost-aware engineering culture ### Prerequisites - System design fundamentals (Chapter 25) - Performance engineering concepts (Chapter 27) - Basic understanding of GPU architectures and inference pipelines --- ## The Economics of AI Compute ### Why GPUs Are Expensive—and Why Utilization Matters To understand AI costs, you must understand GPU economics. A single NVIDIA H100 GPU costs approximately $30,000 to purchase. Cloud providers charge $3-5 per hour for access. Why so expensive? **The hardware reality**: GPUs contain thousands of specialized cores optimized for the matrix multiplications that dominate neural network computation. An H100 delivers 1,979 teraflops of FP16 compute—roughly 100x the throughput of a high-end CPU for AI workloads. This specialized capability commands a premium. **The supply constraint**: GPU manufacturing requires cutting-edge fabrication facilities that cost tens of billions of dollars to build. TSMC's advanced processes are capacity-constrained, limiting the number of high-end chips that can be produced. Demand from AI training and inference consistently exceeds supply. **The utilization imperative**: Because GPUs are expensive, utilization is everything. Consider the math: ``` H100 cloud cost: $4.00/hour If utilized at 100%: $4.00/effective-GPU-hour If utilized at 25%: $16.00/effective-GPU-hour If utilized at 10%: $40.00/effective-GPU-hour ``` A GPU sitting idle at 10% utilization costs four times as much per effective compute unit as one running at 40%. Yet production systems routinely run at 20-30% utilization because: - **Request variability**: Traffic spikes require headroom - **Batch accumulation delays**: Waiting for batches to fill adds latency - **Memory fragmentation**: Different request sizes prevent efficient packing - **Cold start overhead**: Model loading and warmup consume cycles Research by Patterson et al. (2022) at Google found that typical ML infrastructure achieves 30-50% GPU utilization in production—meaning half or more of GPU spend is wasted on idle cycles. ### The GPU Utilization Curve GPU costs don't scale linearly with utilization. There's a complex relationship between utilization, latency, and throughput: ![GPU Cost per Request vs. Utilization](../assets/diagrams/rendered/ch29_gpu_cost_curve.svg) The shape of this curve explains why the difference between 20% and 40% utilization matters far more than the difference between 70% and 90%. The first doubling saves 50% of costs; the second saves only 22%. **Practical implication**: The first priority in GPU cost optimization is always moving from low utilization to moderate utilization. This provides the highest return on engineering investment. ::: {.callout-warning} ## Common Mistake: Ignoring Idle GPU Costs **What people do:** Focus on optimizing inference latency and throughput while leaving GPUs running 24/7, regardless of actual demand. **Why it fails:** GPU costs are dominated by idle time, not inference time. If your traffic is bursty (most applications), GPUs sitting idle during low-traffic hours can represent 70%+ of total compute spend. A $30K/month GPU bill might include $20K of idle time. **Fix:** Implement auto-scaling based on queue depth, not just CPU metrics. Use spot/preemptible instances for batch workloads. Consider serverless inference (AWS Lambda, Google Cloud Run) for low-volume or bursty traffic. Track and alert on utilization—if it's consistently below 40%, you're likely over-provisioned. ::: ::: {.callout-warning} ## Common Mistake: No Cost Caps on LLM Calls **What people do:** Call LLM APIs without max_tokens limits or budget caps, trusting that normal usage won't be expensive. **Why it fails:** One bug can cause a cost explosion. A retry loop without backoff. A prompt that triggers verbose output. An agent that loops forever. We've seen teams wake up to $50K+ surprise bills from overnight batch jobs or runaway agents. API providers are happy to let you spend. **Fix:** Always set max_tokens to the actual maximum you need. Implement per-request, per-user, and per-hour cost limits. Add circuit breakers that stop LLM calls if spend exceeds thresholds. Log and alert on cost anomalies (e.g., any hour that's 3x the previous day's average). ::: ### Token Economics For LLM-based systems, costs are denominated in tokens. Understanding token economics is essential for cost modeling. **What determines token cost?** Token cost reflects the compute required for inference, which depends on: 1. **Model size**: Larger models require more parameters to load and more operations per token 2. **Attention complexity**: Self-attention scales quadratically with sequence length 3. **Input vs. output tokens**: Output tokens require full forward passes; input tokens can be processed in parallel 4. **Hardware efficiency**: Newer GPUs provide more tokens per dollar **The API pricing landscape (as of early 2026)**: | Model | Input (per 1M tokens) | Output (per 1M tokens) | Relative Cost | |-------|----------------------|------------------------|---------------| | GPT-5.5 | $5.00 | $30.00 | 1.0x | | Claude Opus 4.8 | $5.00 | $25.00 | 0.83x | | GPT-5.4 | $2.50 | $15.00 | 0.5x | | Claude Sonnet 4.6 | $3.00 | $15.00 | 0.5x | | Claude Haiku 4.5 | $1.00 | $5.00 | 0.17x | | Llama 4 Maverick (API) | $0.22 | $0.85 | 0.03x | These prices encode profound economics. The ~35x price difference between GPT-5.5 and Llama 4 Maverick reflects the premium for proprietary frontier models versus open-weight alternatives. Be careful, though: **list API prices are strategic, margin-driven numbers—they are not a direct proxy for the underlying cost of compute.** Providers price to capture value, segment customers, and fund R&D, so a 35x list-price gap does not mean a 35x difference in GPU cost to serve. The honest cost-of-compute comparison is the self-hosting analysis later in this chapter, which prices the GPU-seconds directly; API prices tell you what you'll pay, not what the inference costs to produce. The 5-6x ratio between input and output tokens, by contrast, does track a real computational asymmetry of autoregressive generation. **The token cost calculation**: ```python def calculate_request_cost( input_tokens: int, output_tokens: int, input_price_per_million: float, output_price_per_million: float ) -> float: """Calculate cost for a single LLM request.""" input_cost = (input_tokens / 1_000_000) * input_price_per_million output_cost = (output_tokens / 1_000_000) * output_price_per_million return input_cost + output_cost # Example: GPT-5.5 request with 2000 input, 500 output tokens cost = calculate_request_cost(2000, 500, 5.00, 30.00) # = (2000/1M * $5.00) + (500/1M * $30.00) # = $0.010 + $0.015 = $0.025 per request ``` At ~$0.025 per request, 1 million daily requests costs $25,000/day—$750,000/month. This is the marginal cost that traditional software doesn't have—and it's why cost engineering matters. ::: {.callout-warning} ## The $2M Token Bill **Setup**: A fintech startup ran nightly batch jobs to analyze customer transactions using GPT-5.4. The jobs had worked flawlessly for months, processing about 50,000 transactions per night at a cost of roughly $500. **What happened**: On a Thursday night, an engineer deployed a "minor" change to improve analysis quality—they removed the `max_tokens` parameter to "let the model fully express its reasoning." The batch job started at 11 PM. By Friday morning, the cloud bill had reached $2.1 million. Without output limits, the model generated verbose analyses averaging 15,000 tokens per transaction instead of the usual 500. The job ran for 8 hours, burning through 750 million output tokens at $15 per million. The engineer's phone buzzed with a fraud alert from their credit card company before anyone noticed the API costs. **Why it happened (root cause)**: No programmatic cost caps. The API would happily accept unlimited requests at unlimited output lengths. The team assumed "max_tokens was just for speed" and didn't realize it was their primary cost control. **Fix**: 1. Implemented hard per-request cost limits in a gateway layer 2. Added hourly cost monitoring with automatic shutoff at 120% of expected spend 3. Required `max_tokens` in all API calls via a wrapper library 4. Created nightly budget alerts with escalation to on-call **Takeaway**: Every API call needs a cost ceiling. Hope is not a budget strategy. ::: ::: {.panel-tabset} ## Unbounded Calls ```python def analyze_document(doc: str) -> str: # No limits - cost grows with document size and model verbosity response = client.messages.create( model="claude-opus-4-8", messages=[{ "role": "user", "content": f"Analyze this document thoroughly:\n\n{doc}" }] # No max_tokens = model decides output length # Long doc = expensive input tokens # "thoroughly" = model writes essays ) return response.content[0].text ``` ## Cost-Capped Generation ```python class CostAwareLLM: def __init__(self, budget_per_request: float = 0.10): self.budget = budget_per_request self.pricing = ModelPricing() # Current rates self.daily_spend = DailySpendTracker() async def analyze_document( self, doc: str, user_id: str, priority: str = "normal" ) -> AnalysisResponse: # Check daily budget before processing if self.daily_spend.remaining(user_id) <= 0: return AnalysisResponse( error="daily_budget_exhausted", fallback=self.get_cached_or_summary(doc) ) # Calculate max affordable tokens input_tokens = count_tokens(doc) input_cost = self.pricing.input_cost(input_tokens) remaining_budget = self.budget - input_cost if remaining_budget <= 0: # Document too long for budget - truncate or chunk doc = self.smart_truncate(doc, target_tokens=5000) input_tokens = count_tokens(doc) input_cost = self.pricing.input_cost(input_tokens) remaining_budget = self.budget - input_cost max_output_tokens = self.pricing.tokens_for_budget( remaining_budget, output=True ) max_output_tokens = min(max_output_tokens, 2000) # Hard cap # Select model based on priority and budget model = self.select_model(priority, remaining_budget) response = await client.messages.create( model=model, max_tokens=max_output_tokens, # Always set! messages=[{ "role": "user", "content": f"""Analyze this document concisely. Focus on: key findings, action items, risks. Be direct - no filler. <document> {doc} </document>""" }] ) # Track actual spend actual_cost = self.pricing.calculate_cost( response.usage.input_tokens, response.usage.output_tokens, model ) self.daily_spend.record(user_id, actual_cost) return AnalysisResponse( content=response.content[0].text, cost=actual_cost, model_used=model, tokens_used=response.usage.input_tokens + response.usage.output_tokens ) def select_model(self, priority: str, budget: float) -> str: if priority == "high" and budget > 0.05: return "claude-opus-4-8" elif budget > 0.02: return "claude-sonnet-4-6" else: return "claude-haiku-4-5-20251001" # Always affordable ``` **Key differences**: Per-request budget caps, input size management, calculated max_tokens based on remaining budget, model selection by cost tier, daily spend tracking per user, and concise prompt design. The $2M incident becomes impossible. ::: ### The Inference Efficiency Frontier Different inference optimization techniques trade off along multiple dimensions: ![Latency vs Cost Tradeoff](../assets/diagrams/rendered/ch29_latency_cost_tradeoff.svg) The efficient frontier represents the best achievable combinations of cost and latency. Points below the frontier are achievable; points above require engineering innovation or accepting worse tradeoffs. Key techniques on the frontier: 1. **Continuous batching**: Process requests as they arrive rather than waiting for fixed batches. Reduces latency while maintaining throughput. 30-50% latency reduction typical. 2. **Speculative decoding**: Use a small model to draft tokens, verify with the large model. Reduces large model calls by 2-3x for standard text. 3. **Quantization**: Reduce precision from FP16 to INT8 or INT4. Cuts memory and compute by 2-4x with 1-3% quality degradation. 4. **KV cache optimization**: Reuse computed attention states for prefix-sharing requests. Enables prompt caching with 50-80% input cost reduction. 5. **Model distillation**: Train smaller models on larger model outputs. Can achieve 90% quality at 10% cost for domain-specific tasks. > **Full implementation**: See [reference/26b_cost_engineering_code.md](../reference/26b_cost_engineering_code.html#inference-optimization-techniques) for implementation details of each technique. --- ## Total Cost of Ownership Analysis ### The TCO Framework Total Cost of Ownership (TCO) captures all costs associated with an AI system over its lifetime. TCO analysis prevents the common failure of optimizing visible costs while ignoring larger hidden costs. **TCO components**: ![AI System TCO Structure](../assets/diagrams/rendered/ch29_tco_structure.svg) **Why indirect costs matter**: A 2023 study by Andreessen Horowitz found that for enterprise AI systems, engineering time represents 40-60% of total costs—often exceeding infrastructure spend. A system that costs $50,000/month in compute but requires a full-time engineer ($15,000/month fully loaded) to maintain has 30% hidden costs. A system that costs $100,000/month but runs autonomously may be cheaper overall. ::: {.callout-tip} ## Staff Engineer Perspective: The Art of Vendor Negotiation "I've negotiated LLM API contracts with every major provider. The list prices are the starting point, not the answer. At scale, everything is negotiable: per-token rates, minimum commits, reserved capacity, support tiers, even payment terms. The key is leverage: either you have enough volume that they want your business, or you have a credible alternative (open-weight models, self-hosting, competitors). My playbook: (1) Get usage data—monthly spend, growth trajectory, use case breakdown. (2) Identify your alternative—even if you don't want to use it, knowing you could gives you leverage. (3) Start conversations 3 months before renewals. (4) Ask for multi-year discounts; providers love predictable revenue. (5) Get the enterprise sales team involved; they have flexibility that developer relations doesn't. The discount range varies, but at $10K+/month spend, you should be getting at least 15-20% off list price. At $100K+/month, 30-40% is common. If you're not asking, you're leaving money on the table." —*Principal Engineer, formerly Head of AI Platform* ::: ### Calculating TCO ```python @dataclass class TCOAnalysis: """Complete TCO analysis for an AI system.""" # Direct costs (monthly) compute_training: float compute_inference: float storage: float external_apis: float data_processing: float # Indirect costs (monthly, derived from annual rates) engineering_development: float # FTE allocation * fully-loaded cost engineering_operations: float engineering_maintenance: float # Hidden costs (often forgotten) compliance_security: float # ~5-10% of direct costs monitoring_observability: float # ~3-5% of direct costs disaster_recovery: float # ~2-5% of direct costs def monthly_direct(self) -> float: return (self.compute_training + self.compute_inference + self.storage + self.external_apis + self.data_processing) def monthly_indirect(self) -> float: return (self.engineering_development + self.engineering_operations + self.engineering_maintenance) def monthly_hidden(self) -> float: return (self.compliance_security + self.monitoring_observability + self.disaster_recovery) def monthly_tco(self) -> float: return self.monthly_direct() + self.monthly_indirect() + self.monthly_hidden() def annual_tco(self) -> float: return self.monthly_tco() * 12 ``` **Example: Recommendation System TCO** Consider a mid-scale recommendation system serving 10 million requests per day: | Category | Monthly Cost | % of TCO | |----------|-------------|----------| | **Direct Costs** | | | | Inference compute (4x A100 instances) | $28,000 | 35% | | Training compute (periodic retraining) | $8,000 | 10% | | Feature store (1TB) | $3,000 | 3.8% | | Embedding API calls | $5,000 | 6.3% | | Data processing pipeline | $4,000 | 5% | | **Indirect Costs** | | | | ML Engineer (0.5 FTE) | $12,500 | 15.6% | | Platform Engineer (0.3 FTE) | $7,500 | 9.4% | | Data Engineer (0.2 FTE) | $5,000 | 6.3% | | **Hidden Costs** | | | | Monitoring & observability | $3,000 | 3.8% | | Security & compliance | $2,500 | 3.1% | | Backup & DR | $1,500 | 1.9% | | **Total Monthly TCO** | **$80,000** | **100%** | This analysis reveals that engineering time (31% of TCO) rivals compute costs (45%). Optimizing only the cloud bill misses nearly a third of the cost structure. > **Full implementation**: See [reference/26b_cost_engineering_code.md](../reference/26b_cost_engineering_code.html#tco-calculator) for the complete TCOAnalysis class with scenario modeling. ### Unit Economics TCO tells you total costs; unit economics tells you whether your business model works. Unit economics translates aggregate costs into per-unit metrics that can be compared against revenue or value generated. **Key unit economics metrics**: ```python @dataclass class UnitEconomics: """Unit economics for an AI service.""" total_monthly_cost: float monthly_requests: int monthly_active_users: int monthly_revenue: float @property def cost_per_request(self) -> float: """Cost to serve one request.""" return self.total_monthly_cost / max(1, self.monthly_requests) @property def cost_per_user(self) -> float: """Cost to serve one active user per month.""" return self.total_monthly_cost / max(1, self.monthly_active_users) @property def gross_margin(self) -> float: """(Revenue - Cost) / Revenue.""" if self.monthly_revenue <= 0: return float('-inf') return (self.monthly_revenue - self.total_monthly_cost) / self.monthly_revenue @property def ai_cost_ratio(self) -> float: """AI costs as percentage of revenue.""" if self.monthly_revenue <= 0: return float('inf') return self.total_monthly_cost / self.monthly_revenue ``` **Healthy unit economics benchmarks**: | Metric | Concerning | Acceptable | Good | |--------|------------|------------|------| | AI cost ratio | >50% | 20-50% | <20% | | Cost per request | >$0.10 | $0.01-$0.10 | <$0.01 | | Gross margin | <30% | 30-60% | >60% | **The unit economics trap**: Many AI startups fall into a trap: their unit economics worsen as they scale. This happens when: 1. **Usage grows faster than revenue**: Free tiers or flat-rate pricing with increasing usage 2. **Costs scale linearly while value doesn't**: Each request costs the same, but marginal value to users decreases 3. **No cost optimization investment**: Early-stage focus on features over efficiency The solution is modeling unit economics at scale from day one: ```python def project_unit_economics( current: UnitEconomics, growth_rate: float, # Monthly user growth months: int, cost_efficiency_gain: float = 0.02 # Monthly efficiency improvement ) -> list[UnitEconomics]: """Project unit economics over time with growth and optimization.""" projections = [current] for _ in range(months): prev = projections[-1] # Users grow, costs grow (but more slowly with optimization) new_users = int(prev.monthly_active_users * (1 + growth_rate)) new_requests = int(prev.monthly_requests * (1 + growth_rate * 1.2)) # Power users # Costs grow sublinearly with optimization cost_scale = (new_requests / prev.monthly_requests) * (1 - cost_efficiency_gain) new_costs = prev.total_monthly_cost * cost_scale # Revenue assumed to scale with users new_revenue = prev.monthly_revenue * (new_users / prev.monthly_active_users) projections.append(UnitEconomics( total_monthly_cost=new_costs, monthly_requests=new_requests, monthly_active_users=new_users, monthly_revenue=new_revenue )) return projections ``` --- ## Cost Attribution and Accountability ### Why Attribution Matters You cannot optimize what you cannot measure—and you cannot measure what you cannot attribute. Cost attribution assigns costs to the teams, projects, and features that incur them, creating visibility and accountability. **Without attribution**: A single cloud bill shows $200,000/month. No one knows which team is responsible for what. Cost optimization is nobody's job because it's everybody's job. **With attribution**: Team A's recommendation service costs $80,000/month, Team B's search service costs $50,000/month, and shared infrastructure costs $70,000. Each team can optimize their portion; leadership can evaluate ROI by team. ### Attribution Models Different costs require different attribution approaches: **1. Direct attribution**: Costs clearly tied to a specific owner ```python # Example: API costs by team @dataclass class DirectAttributionRule: """Attribute costs directly to the owner in metadata.""" def apply(self, cost_item: CostItem) -> dict[str, float]: owner = cost_item.metadata.get('team') or cost_item.metadata.get('project') if owner: return {owner: cost_item.amount} return {'unattributed': cost_item.amount} ``` Direct attribution works for: - Team-specific compute instances - Project-tagged API calls - Dedicated storage buckets **2. Proportional attribution**: Shared costs split by usage ```python @dataclass class ProportionalAttributionRule: """Split shared costs by usage proportion.""" usage_metric: str # e.g., 'requests', 'compute_seconds', 'tokens' def apply(self, cost_item: CostItem, usage_by_team: dict[str, float]) -> dict[str, float]: total_usage = sum(usage_by_team.values()) if total_usage == 0: return {'unattributed': cost_item.amount} return { team: cost_item.amount * (usage / total_usage) for team, usage in usage_by_team.items() } ``` Proportional attribution works for: - Shared GPU clusters (split by GPU-hours consumed) - Shared vector databases (split by queries or storage) - Platform infrastructure (split by requests handled) **3. Activity-based attribution**: Costs assigned based on specific activities ```python @dataclass class ActivityBasedRule: """Attribute based on specific activities that drive costs.""" activity_costs: dict[str, float] # Cost per activity type def apply(self, activities: list[Activity]) -> dict[str, float]: attribution = defaultdict(float) for activity in activities: cost = self.activity_costs.get(activity.type, 0) attribution[activity.team] += cost * activity.count return dict(attribution) ``` Activity-based attribution works for: - Training jobs (cost per GPU-hour by job owner) - Experiment runs (cost per experiment by researcher) - Model deployments (cost per deployment by team) ### Implementing Cost Attribution A production cost attribution system requires: **1. Tagging discipline**: All resources tagged with owner, project, environment ```python REQUIRED_TAGS = ['team', 'project', 'environment', 'cost_center'] def validate_resource_tags(resource: Resource) -> list[str]: """Return list of missing required tags.""" return [tag for tag in REQUIRED_TAGS if tag not in resource.tags] def enforce_tagging_policy(resource: Resource) -> bool: """Block resource creation if tags are missing.""" missing = validate_resource_tags(resource) if missing: raise PolicyViolation(f"Resource missing required tags: {missing}") return True ``` **2. Usage tracking**: Collect usage metrics that drive costs ```python class UsageTracker: """Track usage metrics for cost attribution.""" def record_request(self, team: str, service: str, tokens: int, gpu_ms: int): """Record a request with its resource consumption.""" self.metrics.append({ 'timestamp': datetime.utcnow(), 'team': team, 'service': service, 'tokens': tokens, 'gpu_ms': gpu_ms }) def aggregate_by_team(self, start: datetime, end: datetime) -> dict[str, dict]: """Aggregate usage metrics by team for a period.""" # Returns: {'team_a': {'tokens': 1M, 'gpu_ms': 50000}, ...} ``` **3. Attribution engine**: Apply rules to calculate allocations ```python class CostAttributionEngine: """Engine for attributing costs to owners.""" def __init__(self, rules: list[AttributionRule]): self.rules = rules def calculate_attribution( self, costs: list[CostItem], usage: dict[str, dict] ) -> dict[str, float]: """Calculate cost attribution by team/project.""" attribution = defaultdict(float) for cost in costs: rule = self._get_rule(cost.category) team_costs = rule.apply(cost, usage) for team, amount in team_costs.items(): attribution[team] += amount return dict(attribution) ``` > **Full implementation**: See [reference/26b_cost_engineering_code.md](../reference/26b_cost_engineering_code.html#cost-attribution-engine) for the complete attribution system. ### Creating Accountability Attribution data only matters if it drives behavior. Effective accountability requires: **1. Visibility**: Teams can see their costs in real-time ```python class TeamCostDashboard: """Real-time cost visibility for engineering teams.""" def get_team_summary(self, team: str) -> dict: return { 'current_month': { 'total_cost': self.get_mtd_cost(team), 'budget': self.get_budget(team), 'burn_rate': self.get_daily_average(team) * 30, 'forecast': self.get_month_end_forecast(team) }, 'by_service': self.get_cost_by_service(team), 'trends': self.get_weekly_trends(team), 'top_cost_drivers': self.get_top_costs(team, n=5), 'optimization_opportunities': self.get_recommendations(team) } ``` **2. Budgets**: Teams have cost targets they're accountable for ```python @dataclass class CostBudget: """Team cost budget with alerting thresholds.""" team: str monthly_budget: float alert_threshold: float = 0.80 # Alert at 80% hard_limit: float = 1.20 # Require approval above 120% def check_against_spend(self, current_spend: float) -> BudgetStatus: ratio = current_spend / self.monthly_budget if ratio > self.hard_limit: return BudgetStatus.EXCEEDED_HARD_LIMIT elif ratio > 1.0: return BudgetStatus.OVER_BUDGET elif ratio > self.alert_threshold: return BudgetStatus.APPROACHING_LIMIT else: return BudgetStatus.ON_TRACK ``` **3. Reviews**: Regular discussion of cost performance Effective cost reviews happen at multiple cadences: | Cadence | Focus | Attendees | |---------|-------|-----------| | Weekly | Anomalies, quick wins | Engineering leads | | Monthly | Trends, budget performance | Team leads + Finance | | Quarterly | TCO, strategic investments | Leadership | --- ## GPU Selection and Self-Hosting Economics ### The GPU Selection Decision Choosing the right GPU is one of the highest-leverage cost decisions in AI. The wrong GPU wastes money through underutilization, overpayment, or insufficient performance. **GPU selection framework**: ![GPU Selection Decision Tree](../assets/diagrams/rendered/ch26_gpu_decision_tree.svg) **Key GPU comparisons (cloud pricing, late 2024)**: | GPU | VRAM | FP16 TFLOPS | Hourly Cost | $/TFLOP-hr | Best For | |-----|------|-------------|-------------|------------|----------| | H100 SXM | 80GB | 1,979 | $4.50 | $0.0023 | Large training | | A100 80GB | 80GB | 312 | $2.50 | $0.0080 | Training, large inference | | A100 40GB | 40GB | 312 | $1.75 | $0.0056 | Medium training/inference | | L4 | 24GB | 121 | $0.50 | $0.0041 | Cost-efficient inference | | A10G | 24GB | 125 | $0.75 | $0.0060 | Balanced inference | | T4 | 16GB | 65 | $0.35 | $0.0054 | Budget inference | **Decision heuristics**: 1. **For training**: Match GPU memory to model size. Training a 7B parameter model in FP16 requires ~14GB for weights plus optimizer states—40GB A100 minimum, 80GB preferred for larger batch sizes. 2. **For inference**: Match to your batch size and latency requirements. High-throughput batch inference benefits from larger GPUs (more parallelism); low-latency single requests are limited by memory bandwidth, not compute. 3. **For cost optimization**: Calculate cost per token or cost per request across GPU options. The cheapest GPU per hour is often not the cheapest per unit of work. ### The Self-Hosting Breakeven Analysis When should you self-host models instead of using APIs? This is a classic build-versus-buy decision with quantifiable economics. **The cost structure comparison**: ![Self-Hosting vs API Cost Structure](../assets/diagrams/rendered/ch29_self_hosting_vs_api.svg) **Breakeven calculation**: ```python @dataclass class SelfHostingAnalysis: """Analyze self-hosting vs API economics.""" # API costs api_cost_per_request: float # $ per request (input + output tokens priced out) # Self-hosting costs gpu_hourly_cost: float # $ per GPU hour num_gpus: int gpu_utilization: float # Expected utilization (0.0 - 1.0) engineering_monthly: float # Engineering time for operations infrastructure_monthly: float # Networking, monitoring, etc. # Throughput model (token-based, NOT a flat per-request latency) output_tokens_per_sec_per_system: float # Aggregate decode throughput of the GPU set avg_output_tokens_per_request: float # Output-length assumption — the key swing factor @property def monthly_fixed_cost(self) -> float: """Fixed monthly cost of self-hosting.""" gpu_monthly = self.gpu_hourly_cost * 24 * 30 * self.num_gpus return gpu_monthly + self.engineering_monthly + self.infrastructure_monthly @property def requests_per_gpu_month(self) -> float: """Requests the GPU set can serve per month at expected utilization. Per-request time is dominated by OUTPUT token count, not a flat latency: a 500-token reply takes ~10-25x as long to generate as a 20-token one because decode is sequential. So we model capacity as tokens/sec, then divide by the average output length. Prefill (input tokens) is parallel and comparatively cheap, so it is omitted here for clarity. """ output_tokens_per_month = ( self.output_tokens_per_sec_per_system * self.gpu_utilization * 3600 * 24 * 30 ) return output_tokens_per_month / self.avg_output_tokens_per_request @property def marginal_cost_per_request(self) -> float: """Marginal cost once infrastructure is running.""" # Essentially electricity and wear - minimal return 0.0001 # $0.0001 per request def total_cost(self, monthly_requests: int) -> dict[str, float]: """Calculate total costs for both options.""" api_cost = monthly_requests * self.api_cost_per_request # Self-hosted: fixed + marginal if monthly_requests <= self.requests_per_gpu_month: self_hosted_cost = self.monthly_fixed_cost else: # Need more GPUs gpus_needed = math.ceil(monthly_requests / self.requests_per_gpu_month * self.num_gpus) self_hosted_cost = self.monthly_fixed_cost * (gpus_needed / self.num_gpus) return { 'api_cost': api_cost, 'self_hosted_cost': self_hosted_cost, 'recommendation': 'self-host' if self_hosted_cost < api_cost else 'api' } def breakeven_requests(self) -> int: """Find monthly request volume where self-hosting breaks even.""" # Fixed / (API marginal - Self marginal) = breakeven volume marginal_diff = self.api_cost_per_request - self.marginal_cost_per_request if marginal_diff <= 0: return float('inf') # Self-hosting never cheaper on marginal return int(self.monthly_fixed_cost / marginal_diff) ``` **Example breakeven analysis**: Scenario: Considering self-hosting a Llama 3.1 70B model vs. using API. Note that there are *two* break-evens to track, and both move with the output-length distribution: - **Cost break-even**: the monthly volume at which API spend equals self-hosting's fixed cost. Because API requests are priced per token, the per-request price — and therefore this break-even — scales with average output length. - **Capacity**: whether your GPU set can actually serve that volume. Capacity is governed by decode *throughput in tokens/sec*, divided by average output length. A reply that is 5x longer cuts servable requests/month by ~5x. | Parameter | Value | |-----------|-------| | Self-hosted: 2x A100 80GB | $2.50/hr each = $3,600/month | | Engineering (0.1 FTE) | $1,500/month | | Infrastructure overhead | $500/month | | **Total fixed cost** | **$5,600/month** | | GPU utilization target | 50% | | Aggregate decode throughput (70B, continuous batching) | ~2,000 output tok/sec | We evaluate two output-length regimes that bracket realistic workloads — short replies (~100 output tokens, e.g. classification or extraction) and long replies (~500 output tokens, e.g. summarization or chat). Input/prefill is parallel and comparatively cheap, so we approximate API price per request from output tokens at $0.85/1M (Llama 4 Maverick output rate from the pricing table above): ```python # --- Short-reply regime: ~100 output tokens/request --- short = SelfHostingAnalysis( api_cost_per_request=0.85 / 1_000_000 * 100, # ~$0.000085/req gpu_hourly_cost=2.50, num_gpus=2, gpu_utilization=0.50, engineering_monthly=1500, infrastructure_monthly=500, output_tokens_per_sec_per_system=2000, avg_output_tokens_per_request=100, ) # Cost break-even = $5,600 / $0.000085 ≈ 65.9M requests/month # Capacity on 2 GPUs = 2000 * 0.5 * 2.592e6 s/mo / 100 ≈ 25.9M requests/month # --- Long-reply regime: ~500 output tokens/request --- long = SelfHostingAnalysis( api_cost_per_request=0.85 / 1_000_000 * 500, # ~$0.000425/req gpu_hourly_cost=2.50, num_gpus=2, gpu_utilization=0.50, engineering_monthly=1500, infrastructure_monthly=500, output_tokens_per_sec_per_system=2000, avg_output_tokens_per_request=500, ) # Cost break-even = $5,600 / $0.000425 ≈ 13.2M requests/month # Capacity on 2 GPUs = 2000 * 0.5 * 2.592e6 s/mo / 500 ≈ 5.2M requests/month ``` (Here `2.592e6` is the seconds in a 30-day month: `3600 * 24 * 30`.) **The headline number is not a constant — it moves ~5x with output length:** | Regime | API $/req | Cost break-even (req/mo) | Capacity on 2 GPUs (req/mo) | |--------|-----------|--------------------------|-----------------------------| | Short reply (~100 out tok) | ~$0.000085 | ~66M | ~26M | | Long reply (~500 out tok) | ~$0.000425 | ~13M | ~5.2M | Read this carefully: against a cheap open-weight API rate, self-hosting only pays off at *tens of millions* of requests per month, and the long-reply workload reaches that break-even at roughly a quarter of the short-reply volume. Against a premium frontier API (e.g. $25-30/1M output, ~30-60x the rate used here), the break-even drops by the same factor — into the low hundreds of thousands of requests, the cliff described at the top of this chapter. **The break-even is a function of your output-length distribution and the API tier you are displacing, not a fixed constant; re-derive it whenever either changes.** **Critical caveats**: 1. **Output-length sensitivity**: As the table shows, doubling average output length roughly halves both the cost break-even and the servable capacity. Always model break-even against your *measured* output-length distribution, not an assumed flat per-request cost. 2. **Utilization uncertainty**: The analysis assumes 50% utilization. If you achieve only 25%, effective costs double. 2. **Engineering underestimation**: Operations, troubleshooting, and updates often take more time than estimated. 3. **Opportunity cost**: Engineering time on infrastructure is time not spent on product. 4. **Flexibility cost**: APIs scale instantly; self-hosted requires capacity planning. 5. **Quality differences**: Self-hosted open models may not match API quality for your use case. The right decision depends on your confidence in volume projections, engineering capacity, and strategic importance of ownership. > **Full implementation**: See [reference/26b_cost_engineering_code.md](../reference/26b_cost_engineering_code.html#self-hosting-breakeven) for the complete analysis framework. --- ## Cost Optimization Strategies ::: {.callout-tip} ## Staff Engineer Perspective "Our AI bill went from $50K to $400K in three months. Leadership was panicking about 'AI costs.' When I actually looked at the data, 80% of the spend was one feature that product had added without telling anyone—auto-summarization that ran on every document load. We added caching and moved to a smaller model. Costs dropped 90%. The moral: before optimizing infrastructure, find out what's actually generating the calls." — *Engineering Director at enterprise software company* ::: ### The Optimization Priority Matrix Not all cost optimizations are equal. Prioritize based on impact and effort: ![Cost Optimization Priority Matrix](../assets/diagrams/rendered/ch29_optimization_priority.svg) **Tier 1: Quick wins (days to implement)** 1. **Model selection optimization**: Use smaller models for simpler tasks ```python class ModelRouter: """Route requests to appropriate model based on complexity.""" MODELS = { 'simple': {'name': 'claude-haiku-4-5', 'cost_per_1k_tokens': 0.003}, 'moderate': {'name': 'gpt-5', 'cost_per_1k_tokens': 0.0056}, 'complex': {'name': 'claude-opus-4-8', 'cost_per_1k_tokens': 0.012} } def route(self, request: Request) -> str: """Route request to cheapest sufficient model.""" complexity = self.assess_complexity(request) if complexity < 0.3: return self.MODELS['simple']['name'] elif complexity < 0.7: return self.MODELS['moderate']['name'] else: return self.MODELS['complex']['name'] def assess_complexity(self, request: Request) -> float: """Estimate task complexity (0-1). Simple heuristics.""" score = 0.0 # Simple tasks: short, common patterns if len(request.prompt) < 500: score += 0.0 elif len(request.prompt) < 2000: score += 0.3 else: score += 0.5 # Complex indicators if any(kw in request.prompt.lower() for kw in ['analyze', 'compare', 'reason']): score += 0.2 if request.requires_structured_output: score += 0.1 return min(1.0, score) ``` **Savings: 40-70% by routing 60% of requests to smaller models** 2. **Prompt optimization**: Reduce token usage without sacrificing quality ```python class PromptOptimizer: """Optimize prompts for token efficiency.""" def optimize(self, prompt: str) -> tuple[str, dict]: """Return optimized prompt and savings metrics.""" original_tokens = self.count_tokens(prompt) optimizations = [] optimized = prompt # Remove redundant whitespace optimized = ' '.join(optimized.split()) if len(optimized) < len(prompt): optimizations.append('whitespace_reduction') # Compress verbose instructions verbose_patterns = [ (r'Please (.+) for me', r'\1'), (r'I would like you to', ''), (r'Can you please', ''), ] for pattern, replacement in verbose_patterns: optimized = re.sub(pattern, replacement, optimized, flags=re.IGNORECASE) # Reduce example count if excessive if prompt.count('Example:') > 3: optimizations.append('reduce_examples') # Keep first and last example, remove middle # (Implementation detail omitted) new_tokens = self.count_tokens(optimized) savings = (original_tokens - new_tokens) / original_tokens return optimized, { 'original_tokens': original_tokens, 'optimized_tokens': new_tokens, 'savings_pct': savings, 'optimizations_applied': optimizations } ``` **Savings: 15-30% token reduction typical** 3. **Caching**: Avoid redundant computation ```python class SemanticCache: """Cache responses for semantically similar queries.""" def __init__(self, embedding_model, similarity_threshold: float = 0.95): self.cache = {} # embedding -> response self.embeddings = [] self.embedding_model = embedding_model self.threshold = similarity_threshold def get(self, query: str) -> Optional[str]: """Return cached response if similar query exists.""" query_embedding = self.embedding_model.encode(query) for cached_embedding, response in self.cache.items(): similarity = cosine_similarity(query_embedding, cached_embedding) if similarity >= self.threshold: return response return None def set(self, query: str, response: str): """Cache response for future similar queries.""" embedding = tuple(self.embedding_model.encode(query)) self.cache[embedding] = response ``` **Savings: 20-60% depending on query similarity in your workload** **Tier 2: Medium effort (weeks to implement)** 4. **Batching optimization**: Process requests in efficient batches ```python class DynamicBatcher: """Batch requests for efficient GPU utilization.""" def __init__(self, max_batch_size: int = 32, max_wait_ms: int = 50): self.queue = asyncio.Queue() self.max_batch_size = max_batch_size self.max_wait_ms = max_wait_ms async def process_loop(self): """Continuously process batches.""" while True: batch = [] deadline = time.time() + self.max_wait_ms / 1000 # Collect requests until batch full or timeout while len(batch) < self.max_batch_size and time.time() < deadline: try: timeout = deadline - time.time() request = await asyncio.wait_for( self.queue.get(), timeout=max(0, timeout) ) batch.append(request) except asyncio.TimeoutError: break if batch: responses = await self.process_batch(batch) for request, response in zip(batch, responses): request.future.set_result(response) ``` **Savings: 30-50% through better GPU utilization** 5. **Quantization**: Reduce model precision for faster, cheaper inference ```python # Using bitsandbytes for 4-bit quantization from transformers import AutoModelForCausalLM, BitsAndBytesConfig quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_quant_type="nf4" ) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-4-Maverick-402B", quantization_config=quantization_config, device_map="auto" ) # 70B model: FP16 = 140GB, INT4 = 35GB # Can run on 2x A100 40GB instead of 4x A100 80GB ``` **Savings: 50-75% through GPU memory reduction** **Tier 3: Strategic investments (months to implement)** 6. **Architecture redesign**: Reduce unnecessary LLM calls | Pipeline Step | Before (3 LLM calls) | After (1 LLM call) | |--------------|---------------------|---------------------| | Classification | LLM call | Local classifier | | Entity Extraction | LLM call | Local NER model | | Response Generation | LLM call | LLM call | | **Total LLM calls** | **3 per request** | **1 per request** | **Savings: 60-70% by replacing LLM calls with specialized models** 7. **Custom model training**: Fine-tune smaller models for your domain ```python # Distillation: train small model on large model outputs class ModelDistillation: """Distill large model knowledge to smaller model.""" def __init__(self, teacher_model, student_model): self.teacher = teacher_model self.student = student_model def generate_training_data(self, prompts: list[str]) -> list[dict]: """Generate training data from teacher model.""" data = [] for prompt in prompts: response = self.teacher.generate(prompt) data.append({'prompt': prompt, 'response': response}) return data def train_student(self, training_data: list[dict]): """Fine-tune student on teacher outputs.""" # Standard fine-tuning loop # Student learns to mimic teacher behavior ``` **Savings: 80-95% by replacing API calls with fine-tuned small model** > **Full implementation**: See [reference/26b_cost_engineering_code.md](../reference/26b_cost_engineering_code.html#cost-optimization-strategies) for complete implementations. --- ## Build vs. Buy Decision Framework ### The Decision Matrix Build-versus-buy decisions in AI involve more factors than traditional software: ![Build vs Buy Decision Factors](../assets/diagrams/rendered/ch29_build_vs_buy.svg) ### Quantitative Analysis Framework ```python @dataclass class BuildVsBuyAnalysis: """Comprehensive build vs buy analysis.""" # Build costs development_months: int developers_required: int developer_monthly_cost: float infrastructure_monthly: float maintenance_fte: float # Buy costs per_request_cost: float monthly_minimum: float # Usage projections monthly_requests_year1: list[int] # 12 months monthly_requests_year2: list[int] # 12 months # Qualitative factors (scored 1-5, weight 0-1) factors: list[tuple[str, float, int, int]] # name, weight, build_score, buy_score def calculate_costs(self) -> dict: """Calculate 2-year TCO for both options.""" # Build: Development + infrastructure + maintenance development_cost = ( self.development_months * self.developers_required * self.developer_monthly_cost ) infrastructure_cost = sum( self.infrastructure_monthly for _ in range(24 - self.development_months) ) maintenance_cost = ( (24 - self.development_months) * self.maintenance_fte * self.developer_monthly_cost ) build_total = development_cost + infrastructure_cost + maintenance_cost # Buy: Per-request + minimums all_requests = self.monthly_requests_year1 + self.monthly_requests_year2 buy_total = sum( max(self.monthly_minimum, requests * self.per_request_cost) for requests in all_requests ) return { 'build_2yr_tco': build_total, 'buy_2yr_tco': buy_total, 'cost_difference': buy_total - build_total, 'break_even_month': self._find_breakeven() } def calculate_scores(self) -> dict: """Calculate weighted qualitative scores.""" build_score = sum(weight * score for name, weight, score, _ in self.factors) buy_score = sum(weight * score for name, weight, _, score in self.factors) return { 'build_score': build_score, 'buy_score': buy_score, 'recommendation': 'build' if build_score > buy_score else 'buy' } def full_analysis(self) -> dict: """Combined cost and qualitative analysis.""" costs = self.calculate_costs() scores = self.calculate_scores() # Weight: 60% cost, 40% qualitative cost_favors = 'build' if costs['build_2yr_tco'] < costs['buy_2yr_tco'] else 'buy' qual_favors = scores['recommendation'] if cost_favors == qual_favors: confidence = 'high' else: confidence = 'medium - cost and qualitative factors disagree' return { **costs, **scores, 'overall_recommendation': cost_favors, # Default to cost 'confidence': confidence } ``` **Example: Embedding Service Build vs. Buy** ```python analysis = BuildVsBuyAnalysis( # Build costs development_months=3, developers_required=2, developer_monthly_cost=15000, # Fully loaded infrastructure_monthly=6000, # 2x A100 maintenance_fte=0.25, # Buy costs (OpenAI embeddings) per_request_cost=0.0001, # $0.10 per 1M tokens, ~1K tokens/request monthly_minimum=0, # Usage: Growing from 10M to 50M requests/month monthly_requests_year1=[10e6, 12e6, 15e6, 18e6, 22e6, 26e6, 30e6, 34e6, 38e6, 42e6, 46e6, 50e6], monthly_requests_year2=[50e6] * 12, # Plateau at 50M # Qualitative factors factors=[ ('data_privacy', 0.3, 5, 2), # Build much better for privacy ('customization', 0.2, 4, 2), # Build allows fine-tuning ('engineering_capacity', 0.2, 3, 5), # Limited ML team ('time_to_market', 0.2, 2, 5), # Need it fast ('strategic_importance', 0.1, 3, 3) # Neutral ] ) result = analysis.full_analysis() # Build 2yr TCO: $270,000 # Buy 2yr TCO: $216,000 # Qualitative: Build 3.7, Buy 3.2 # Recommendation: Buy (cost wins despite qualitative edge for build) ``` ### The Hidden Costs of Building Build-versus-buy analyses often underestimate build costs: 1. **Scope creep**: Initial estimates rarely include production hardening, monitoring, edge cases 2. **Ongoing investment**: ML systems require continuous improvement to maintain quality 3. **Opportunity cost**: Engineers building infrastructure aren't building product 4. **Knowledge concentration**: What happens when the ML engineer leaves? 5. **Obsolescence risk**: The landscape evolves; your custom solution may fall behind **Heuristic**: Multiply initial build estimates by 2-3x to account for hidden costs. If build still wins, proceed confidently. > **Full implementation**: See [reference/26b_cost_engineering_code.md](../reference/26b_cost_engineering_code.html#build-vs-buy-framework) for the complete framework. --- ## Building a Cost-Aware Culture ### The FinOps Mindset for AI FinOps—the practice of bringing financial accountability to cloud spend—is especially important for AI systems where costs are higher and more variable. **Core FinOps principles applied to AI**: 1. **Teams own their costs**: The team that deploys a model is accountable for its cost, not a central platform team. 2. **Decisions are data-driven**: Cost tradeoffs are made with real numbers, not intuition. 3. **Optimization is continuous**: Cost efficiency is an ongoing practice, not a one-time project. 4. **Value, not just cost**: The goal is maximizing value per dollar, not minimizing dollars. ### Implementing Cost Governance **1. Cost visibility dashboards** Every engineering team should have real-time visibility into their AI costs: ```python class AICostDashboard: """Team-level AI cost visibility.""" def get_team_view(self, team: str) -> dict: return { 'summary': { 'mtd_cost': self.get_mtd_cost(team), 'budget': self.get_budget(team), 'utilization': self.get_budget_utilization(team), 'forecast': self.forecast_month_end(team) }, 'by_model': self.get_costs_by_model(team), 'by_service': self.get_costs_by_service(team), 'daily_trend': self.get_daily_trend(team, days=30), 'anomalies': self.detect_anomalies(team), 'opportunities': self.identify_optimizations(team) } ``` **2. Budget controls with escalation** ```python @dataclass class BudgetPolicy: """Budget policy with tiered escalation.""" team: str monthly_budget: float # Escalation thresholds warn_threshold: float = 0.75 # Alert at 75% review_threshold: float = 0.90 # Require review at 90% block_threshold: float = 1.10 # Block new requests at 110% def check_request(self, request_cost: float, current_spend: float) -> dict: """Check if request should proceed given budget.""" new_spend = current_spend + request_cost utilization = new_spend / self.monthly_budget if utilization > self.block_threshold: return { 'allowed': False, 'reason': 'Budget exceeded - requires VP approval', 'escalation': 'vp' } elif utilization > self.review_threshold: return { 'allowed': True, 'warning': 'Approaching budget limit', 'escalation': 'manager' } elif utilization > self.warn_threshold: return { 'allowed': True, 'warning': 'Budget 75% utilized' } else: return {'allowed': True} ``` **3. Cost review cadence** | Review | Frequency | Participants | Focus | |--------|-----------|--------------|-------| | Anomaly review | Daily (automated) | On-call | Spikes, unusual patterns | | Team cost review | Weekly | Tech lead | Trends, quick wins | | Service review | Monthly | Engineering + Finance | Budget performance, projections | | Strategic review | Quarterly | Leadership | TCO trends, build/buy decisions | ### Incentive Alignment Cost awareness requires aligned incentives: **What works**: - Teams that reduce costs get some savings allocated to their budget - Cost efficiency is part of performance reviews for tech leads - Cost per unit of value is a team metric alongside reliability and velocity **What doesn't work**: - Centralized cost management with no team accountability - Pure cost minimization targets without value consideration - Punishing teams for high costs without context **Example: Cost efficiency OKRs** ``` Q1 Objectives for ML Platform Team: Objective: Improve AI cost efficiency KR1: Reduce cost per inference request by 30% (from $0.015 to $0.010) KR2: Increase GPU utilization from 35% to 55% KR3: Implement model routing, directing 50% of requests to smaller models Objective: Maintain quality while optimizing KR1: Response quality score stays above 4.2/5 KR2: P95 latency stays under 500ms KR3: No customer-facing incidents from optimization changes ``` --- ## Key Takeaways 1. **AI costs are fundamentally different** - GPU economics, marginal per-request costs, and hidden engineering time make AI 10-100x more expensive per request than traditional software. Plan accordingly. 2. **Model total cost of ownership completely** - Direct infrastructure is often only 60-70% of true TCO. Include engineering time, maintenance, monitoring, and opportunity costs in your calculations. 3. **Attribute costs to create accountability** - You cannot optimize what you cannot measure. Tag resources, track usage per team, and allocate costs to create ownership and visibility. 4. **Model routing and caching provide biggest wins** - These optimizations often deliver 40-70% savings with days of effort. Start here before complex architectural changes or custom optimization. 5. **Build cost-aware culture through visibility** - Dashboards, budgets, and aligned incentives make cost efficiency everyone's responsibility. Teams that can see costs make better decisions. --- ## Summary Cost engineering transforms AI systems from economically precarious to sustainable. The key insights: **Understand why AI is expensive**: GPU economics, marginal per-request costs, and hidden costs in engineering time make AI fundamentally different from traditional software. **Model total cost of ownership**: Direct infrastructure costs are often only 60-70% of true TCO. Include engineering time, maintenance, and hidden costs. **Attribute costs to create accountability**: You cannot optimize what you cannot measure. Tag resources, track usage, and allocate costs to teams. **Make data-driven build vs. buy decisions**: Calculate crossover points with realistic assumptions. Multiply build estimates by 2-3x for hidden costs. **Prioritize optimizations by impact**: Model routing and caching often provide 40-70% savings with days of effort. Start there before complex architectural changes. **Build cost-aware culture**: Visibility, budgets, and aligned incentives make cost efficiency everyone's responsibility. **Plan around the 10x cost cliff**: Forecast your 12-month volume against the API-versus-self-host crossover. Below the cliff, managed APIs win; above it, self-hosting wins by roughly an order of magnitude. The expensive mistake is discovering which side you're on from a surprise invoice. Cost engineering is a Staff+ skill because it requires translating technical metrics into business terms, making tradeoffs across multiple dimensions, and building organizational systems that create sustainable economics at scale. --- ## Practical Exercises ### Exercise 1: TCO Analysis **Scenario**: Your company runs a customer support chatbot. Calculate complete TCO. **Given**: - Inference: 4x L4 GPUs, $0.50/hr each, running 24/7 - API calls: External LLM for complex queries, 50,000 calls/month at $0.02 average - Storage: 500GB conversation logs, 50GB model artifacts - Engineering: 0.5 FTE ML engineer, 0.2 FTE DevOps **Tasks**: 1. Calculate monthly direct costs 2. Estimate engineering costs (assume $180k fully-loaded annual salary) 3. Add hidden costs (monitoring, security, DR) at 15% of direct 4. Calculate cost per conversation (assume 500,000 conversations/month) 5. If conversations grow to 2M/month, how does cost per conversation change? ### Exercise 2: GPU Selection **Scenario**: You need to serve a 7B parameter model. **Requirements**: - P95 latency: <200ms - Throughput: 100 requests/second peak - Budget: <$10,000/month **GPU options**: | GPU | VRAM | Latency (7B) | Max throughput | Monthly cost | |-----|------|--------------|----------------|--------------| | L4 | 24GB | 80ms | 40 req/s | $360 | | A10G | 24GB | 65ms | 50 req/s | $540 | | A100 40GB | 40GB | 40ms | 100 req/s | $1,260 | **Tasks**: 1. Calculate minimum instances needed to meet throughput 2. Calculate total monthly cost for each GPU option 3. Verify latency requirements are met 4. Make a recommendation with justification ### Exercise 3: Build vs. Buy Analysis **Scenario**: Evaluate building vs. buying a document extraction service. **Build option**: - Development: 2 engineers, 4 months - Infrastructure: 2x A100 for fine-tuned model serving - Maintenance: 0.3 FTE ongoing **Buy option**: - API cost: $0.10 per page - Current volume: 100,000 pages/month - Projected growth: 20% monthly for 12 months **Tasks**: 1. Calculate 24-month TCO for build (assume $15k/month per engineer) 2. Calculate 24-month TCO for buy 3. Find the monthly volume crossover point 4. List 3 qualitative factors and score them for build vs. buy 5. Make a recommendation ### Exercise 4: Optimization Prioritization **Scenario**: Your LLM service has the following profile: - 2M requests/day - Average 3,000 tokens per request (2,000 input, 1,000 output) - Using GPT-5.5 at $5/1M input, $30/1M output - 35% of queries are very similar to previous queries - 60% of queries could be handled by a smaller model **Tasks**: 1. Calculate current daily and monthly cost 2. Estimate savings from semantic caching (assume 25% cache hit rate) 3. Estimate savings from model routing (Claude Haiku 4.5 at $1.00/$5.00) 4. Estimate savings from 20% prompt optimization 5. Rank optimizations by estimated ROI 6. Calculate combined savings if all three are implemented --- ## Self-Assessment Checkpoint ### Conceptual Questions **Q1. [Senior]** What components make up the Total Cost of Ownership (TCO) for an AI system? Which costs are commonly underestimated? <details> <summary>Answer</summary> TCO components: (1) Direct costs—compute (GPU hours), storage, API calls, networking. (2) Engineering costs—development, maintenance, on-call, debugging. (3) Infrastructure overhead—monitoring, logging, security, DR. (4) Opportunity cost—what else could engineers/resources do? (5) Risk costs—incidents, data loss, compliance issues. Commonly underestimated: (1) Engineering maintenance—"we'll build it and forget it" is a myth. Budget 20-30% of initial build for ongoing maintenance. (2) Opportunity cost—engineers building infrastructure aren't building features. (3) Hidden infrastructure—observability, secrets management, deployment tooling. (4) Scaling costs—unit economics change as you scale. (5) Data costs—storage, transfer, processing grow over time. </details> **Q2. [Senior]** Explain the cost difference between input tokens and output tokens for LLM APIs. Why does this matter for optimization? <details> <summary>Answer</summary> Output tokens typically cost 5-8x more than input tokens (e.g., GPT-5.5: $5/M input vs. $30/M output). Why: Output generation is more computationally intensive—each output token requires a forward pass through the entire model. Input tokens are processed in parallel during prefill. Optimization implications: (1) Prompt optimization matters, but output optimization matters more. (2) Limit output length where possible—constrain `max_tokens`, use structured outputs. (3) For Q&A, long documents as input are cheaper than long generated answers. (4) Caching input (prompt caching) saves less than you might think if outputs dominate cost. (5) Model routing can save dramatically—route simple queries to cheaper models. Cost analysis: For a request with 2K input / 1K output at GPT-5.5 prices: Input = $0.010, Output = $0.030. Output is 75% of cost despite being 33% of tokens. </details> **Q3. [Staff]** How do you build a business case for AI infrastructure investment? What metrics do stakeholders care about? <details> <summary>Answer</summary> Stakeholder metrics: (1) CFO cares about—ROI, payback period, cost per unit (transaction, user, query), TCO vs. alternatives. (2) VP Engineering cares about—engineering efficiency, reliability, time to market. (3) Product cares about—user impact, feature velocity, competitive position. Building the case: (1) Define the counterfactual—what happens if we don't invest? Status quo cost, opportunity cost, risk. (2) Quantify benefits—cost savings, revenue enablement, risk reduction. (3) Calculate payback period—when does investment pay for itself? <12 months is usually easy sell. (4) Include non-financial benefits—developer productivity, user experience, competitive positioning. (5) Address risks—what could go wrong? Mitigation plans. (6) Propose staged approach—invest incrementally, prove value before scaling. </details> **Q4. [Staff]** What's the cost difference between self-hosting LLMs vs. using APIs? How do you calculate the break-even point? <details> <summary>Answer</summary> API costs: Simple—request volume × (input tokens × input price + output tokens × output price). Self-hosting costs: Complex—GPU cost + storage + engineering (setup + maintenance) + infrastructure overhead. Add 15-30% for hidden costs. Break-even calculation: (1) Calculate monthly self-hosted cost: GPU_monthly + storage + (maintenance_FTE × salary). (2) Calculate cost per request at full utilization: monthly_cost / max_requests_per_month. (3) Calculate API cost per request. (4) Break-even volume = monthly_self_hosted_cost / api_cost_per_request. Example: Self-hosted 7B model on A100 costs ~$1,500/month (GPU) + $2,500/month (0.2 FTE maintenance) = $4,000/month. Can serve ~100M requests at full utilization. API at $0.001/request. Break-even = 4M requests/month. Below 4M, API is cheaper. Above, self-host wins. Caveats: Quality difference, utilization reality (rarely 100%), setup time/cost. </details> **Q5. [Staff]** How do you implement cost attribution in a platform serving multiple teams/products? <details> <summary>Answer</summary> Implementation: (1) Tagging—every resource/request tagged with team, product, environment. Enforced at API/infrastructure level. (2) Metering—collect usage data at appropriate granularity (request-level for APIs, hourly for compute). (3) Aggregation—roll up by tag dimensions daily/weekly/monthly. (4) Allocation—shared resources (infrastructure, on-call) allocated by usage ratio or evenly. (5) Reporting—dashboards showing cost by team/product, trends, anomalies. (6) Showback vs. chargeback—showback (visibility only) is simpler to start. Chargeback (actual budget transfer) drives stronger behavior change but adds overhead. Challenges: (1) Shared resources—how to allocate platform team costs? (2) Bursty usage—one team's spike affects others' costs. (3) Tag enforcement—missing/incorrect tags corrupt data. (4) Granularity vs. overhead—too fine-grained is expensive to track. Best practice: Start with showback at product level. Add granularity over time as needed. </details> ### Spot the Problem **Problem 1. [Senior]** Cost analysis: ``` "Our LLM API costs $50K/month. We could self-host for $10K/month in GPU costs. That's $40K savings per month!" ``` <details> <summary>Answer</summary> Missing costs: (1) Engineering—who builds, deploys, maintains the self-hosted system? Add 0.3-0.5 FTE minimum = $4-8K/month. (2) Infrastructure—load balancing, monitoring, logging, security, CI/CD. (3) Quality risk—is your self-hosted model as good as the API? Quality degradation has business cost. (4) Setup time—how many months until you're running? API cost continues during development. (5) Opportunity cost—what else could engineers do? (6) Utilization assumption—$10K assumes high utilization. Real utilization is often 40-60%. Realistic comparison: Self-hosted = $10K GPU + $6K engineering + $2K infrastructure + quality risk = $18K+/month + 3-month setup. Savings are real but smaller, and come with risk/complexity. </details> **Problem 2. [Staff]** Optimization prioritization: ``` "We're spending $100K/month on LLMs. I'll focus on reducing prompt length by 20%. That should save $20K/month." ``` <details> <summary>Answer</summary> Flawed analysis: (1) Output tokens cost more—if 60% of cost is output tokens, 20% input reduction saves only 8% total, not 20%. (2) Prompt reduction may hurt quality—shorter prompts may degrade performance, increasing downstream costs or losing users. (3) Other optimizations may have higher ROI—caching, model routing, batch processing might save more with less risk. (4) Implementation effort—how hard is prompt reduction? May require rewriting many prompts. Proper approach: (1) Break down cost by input vs. output. (2) Identify all optimization opportunities with estimated savings and effort. (3) Prioritize by ROI: savings / (effort + risk). (4) Test optimizations before committing—measure actual savings and quality impact. </details> **Problem 3. [Staff]** Budget planning: ``` "AI costs last year: $1.2M. Budget for next year: $1.5M (25% growth buffer)." ``` <details> <summary>Answer</summary> Problems: (1) Usage growth not considered—if usage grows 100%, 25% buffer is wildly insufficient. (2) New projects not considered—what's planned? Each new AI feature adds cost. (3) Unit economics not tracked—is cost per query stable or increasing? (4) No efficiency targets—just padding previous spend doesn't drive improvement. (5) External factors—API pricing changes, new model releases, infrastructure changes. Better approach: (1) Forecast usage growth by product/feature. (2) Apply unit economics (cost per query/user/transaction). (3) Add planned new projects with estimated costs. (4) Set efficiency targets (reduce cost per query by X%). (5) Build in contingency (10-15%) for unknowns. Bottom-up budgeting: Forecast = Σ(product usage × unit cost) + new projects + contingency - efficiency gains. </details> ### Design Exercises **Exercise 1. [Staff]** Design a cost optimization roadmap for an AI platform spending $500K/month. Current state: No caching, single model for all requests, on-demand GPU instances, no cost visibility by team. Target: 30% cost reduction in 6 months. Outline initiatives, expected savings, effort, and sequencing. <details> <summary>Guidance</summary> Initiative prioritization: (1) Quick wins (Month 1-2): Cost visibility dashboard (shows where money goes), basic request caching (semantic similarity, estimate 10-15% savings), reserved/spot instances for predictable workloads (20-30% infra savings). (2) Medium effort (Month 2-4): Model routing—classify requests, route simple ones to cheaper/smaller models. If 50% of requests can use cheaper model at 10% of price, save 45% on those = ~22% total. (3) Longer term (Month 4-6): Prompt optimization (systematic review), batch processing for non-urgent requests, cost allocation to drive team accountability. Expected savings: Caching (10%) + model routing (20%) + instance optimization (15% of infra) + prompt optimization (5%) = 35%+ achievable. Sequencing: Start with visibility (can't optimize what you can't measure), then quick infrastructure wins, then application-level optimizations requiring more effort. </details> **Exercise 2. [Staff]** You're presenting an AI infrastructure investment ($2M over 2 years) to the CFO. The investment will enable self-hosting models currently costing $80K/month in API fees, plus enable new capabilities. Build the business case with financial projections, risks, and alternatives. <details> <summary>Guidance</summary> Business case structure: (1) Current state: $80K/month API = $1.92M over 2 years. Growing 30%/year = $2.5M+ with growth. (2) Investment: $2M (hardware + setup + engineering). (3) Operating cost: $15K/month ongoing (maintenance, power, networking) = $360K over 2 years. (4) Total self-hosted cost: $2M + $360K = $2.36M. (5) Direct ROI: Negative initially, but break-even by month 20 if usage grows. Positive ROI in year 3. (6) Additional benefits: New capabilities (fine-tuning, data privacy, latency control). Quantify if possible. (7) Risks: Technology obsolescence, usage forecast wrong, quality issues. Mitigation plans. (8) Alternatives: Continue API (higher long-term cost), hybrid approach (partial self-hosting). Recommendation: Present as staged investment—initial pilot to prove value, scale upon validation. CFO prefers optionality and proven results over big upfront bets. </details> --- ### Connections to Other Chapters Cost engineering connects to technical decisions across the book. Here's how this chapter relates: - **Chapter 9 (LLM Deployment & Infrastructure)**: Technical foundations of deployment that drive cost structure. Understanding serving architectures helps optimize costs. - **Chapter 27 (Performance Engineering)**: Performance optimizations that directly impact cost efficiency. Faster inference means lower GPU-hours per request. - **Chapter 31 (Reliability Engineering)**: Reliability investments affect cost—redundancy costs money but prevents expensive incidents. - **LLM Provider Comparison** (Appendix K): Detailed pricing models and cost comparison across major LLM providers. Essential reference for build vs. buy decisions. --- ## Further Reading ### Essential - **Storment & Fuller, "Cloud FinOps"** - The definitive guide to cloud cost management for AI infrastructure. - **Patterson et al. (2022), "Carbon Emissions and Large Neural Network Training"** - Quantifies compute costs in production ML. - **Stoica et al. (2023), "The Shift from Models to Compound AI Systems"** - Cost implications of multi-component AI. ### Deep Dives - **Kwon et al. (2023), "PagedAttention/vLLM"** - Memory optimization enabling 2-4x throughput improvement. - **Leviathan et al. (2023), "Speculative Decoding"** - Technique for 2-3x speedup in autoregressive generation. ### Practical Resources - **FinOps Certified Practitioner curriculum** - Industry standard for cloud financial management. --- ## Part V Checkpoint: Staff+ Engineering Complete Congratulations on completing Part V and the entire book! You've covered the architectural thinking, strategic decision-making, and leadership skills required at Staff+ levels. ### Skills Checklist - [ ] **Design systems at scale** (Ch25): Apply distributed systems patterns, handle multi-region deployment, and design for failure - [ ] **Make technical decisions** (Ch26): Use decision frameworks, write ADRs, and build consensus on architectural choices - [ ] **Optimize performance** (Ch27): Profile systems, identify bottlenecks, and apply GPU/inference optimizations - [ ] **Bridge research and production** (Ch28): Evaluate papers critically, prototype responsibly, and know when to adopt new techniques - [ ] **Lead across teams** (Ch29): Influence without authority, build technical consensus, and drive organization-wide initiatives - [ ] **Architect data systems** (Ch30): Design feature stores, implement data pipelines, and ensure data quality at scale - [ ] **Engineer reliability** (Ch31): Set SLOs, implement graceful degradation, and run incident response - [ ] **Manage costs** (Ch32): Model TCO, optimize resource utilization, and make build-vs-buy decisions ### Quick Self-Test (10 minutes) **Q1.** Your global AI system needs 99.9% availability. How do you architect for this across regions? **Q2.** A promising paper shows 30% latency improvement, but your team wants to adopt it immediately. What's your response? **Q3.** Two teams disagree on whether to build a custom solution or use an existing framework. How do you drive resolution? **Q4.** Your LLM serving costs are 3x budget. Walk through your cost reduction approach. ::: {.callout-tip collapse="true"} ## Check Your Answers **A1.** Multi-region active-active with: (1) Load balancing across regions with health checks, (2) Data replication with eventual consistency where acceptable, (3) Circuit breakers to isolate failures, (4) Graceful degradation to simpler models during outages, (5) Chaos testing to verify resilience. **A2.** (1) Reproduce the results in your environment first, (2) Understand the tradeoffs (what did they sacrifice?), (3) Benchmark on your actual workload, (4) Prototype in staging, (5) Plan incremental rollout with rollback capability. Enthusiasm is good; prudent adoption is better. **A3.** (1) Get both teams to articulate their criteria for success, (2) Document requirements and constraints objectively, (3) Evaluate options against criteria with evidence, (4) Identify the reversibility of each decision, (5) Propose a time-boxed experiment if uncertainty is high, (6) Make the call if consensus isn't emerging and document the reasoning. **A4.** (1) Profile to find the biggest cost drivers (usually GPU utilization, batch efficiency, or over-provisioning), (2) Quick wins: enable batching, right-size instances, use spot/preemptible, (3) Medium-term: implement caching, route simple requests to smaller models, (4) Strategic: evaluate self-hosting vs. API tradeoffs at your scale, (5) Set cost budgets per feature and make teams accountable. ::: ### You've Completed the Book! If you can confidently check all boxes above, you have the knowledge to operate as a Staff+ AI Engineer. Remember: - **Technical depth** is necessary but not sufficient—Staff+ impact comes from multiplying your expertise through others - **The field moves fast**—commit to continuous learning with the practices from Chapter 21 - **Production is the test**—everything in this book is theory until you ship systems that handle real traffic The appendices contain reference materials you'll return to throughout your career: the glossary, tool comparisons, paper reading lists, case studies, and templates. Welcome to Staff+ AI Engineering.