Chapter 32: Cost Engineering
cost optimization, GPU costs, TCO, FinOps, capacity planning, build vs buy, unit economics, ROI
Introduction
In early 2024, an AI startup celebrated a milestone: their LLM-powered product had reached 100,000 daily active users. The celebration was short-lived. Their cloud bill arrived at $847,000 for the month—more than triple their revenue. The CTO’s post-mortem revealed a cascade of cost failures: inference endpoints running at 15% GPU utilization, verbose prompts averaging 8,000 tokens when 2,000 would suffice, no caching despite 60% query similarity, and premium API pricing for workloads that could have used smaller models. The technical architecture was sound; the economic architecture was nonexistent.
This story repeats across the industry. AI systems are extraordinarily expensive compared to traditional software. A single GPU-hour costs what a CPU-hour costs for an entire day. Token costs compound with every user interaction. Training runs can consume more compute in hours than a conventional application uses in years. Yet many teams approach costs reactively—surprised by bills rather than anticipating them, cutting features rather than optimizing systems.
Cost engineering is the discipline of designing AI systems with economic sustainability as a first-class constraint. It is not about minimizing spend—sometimes spending more is precisely right. It is about understanding cost structures deeply enough to make informed tradeoffs, allocating resources to maximize value, and building systems that remain economically viable as they scale.
At Staff and Principal levels, you become accountable not just for systems that work, but for systems that work economically. You must translate GPU utilization metrics into business terms executives understand. You must make build-versus-buy decisions with real numbers rather than intuition. You must design architectures where costs grow sub-linearly with usage rather than exploding at scale.
This chapter provides the theoretical foundations and practical frameworks for AI cost engineering. We begin with why AI is expensive—the fundamental economics that make cost engineering essential. We then develop cost modeling techniques, explore GPU economics, and build decision frameworks for the choices that dominate AI budgets.
The Core Insight: Why AI Cost Engineering is Different
Traditional software cost engineering focuses on compute, storage, and bandwidth—resources that are relatively cheap and well-understood. AI cost engineering introduces fundamentally different challenges:
Compute intensity: A single inference call through a large language model requires more floating-point operations than a traditional web request by factors of millions. GPUs that cost $10,000+ sit idle waiting for the next request.
Marginal costs that compound: Every user interaction consumes tokens, GPU cycles, or API calls. Unlike traditional software where marginal costs approach zero, AI systems have meaningful per-request costs that scale with usage.
Nonlinear scaling: Larger models don’t just cost proportionally more—costs often scale superlinearly with capability. A 70B parameter model costs far more than 10x a 7B model to serve.
Hidden costs everywhere: Fine-tuning data curation, evaluation dataset construction, monitoring for drift, retraining cycles—costs that don’t appear in simple infrastructure budgets but dominate total cost of ownership.
Rapid obsolescence: Models that were state-of-the-art six months ago may be outperformed by models that cost 10x less to run. Cost optimization is a moving target.
These dynamics mean that cost engineering for AI requires different mental models than traditional infrastructure. You must think about cost per token, GPU utilization curves, and inference optimization in ways that don’t apply to conventional systems.
A Mental Model for AI Costs
Plot API spend and self-hosted TCO on the same chart as request volume grows, and the two lines cross at a brutal inflection point—somewhere in the low hundreds of thousands of requests per month for a typical 7B-13B workload. Below the cliff, managed APIs win on every axis: zero fixed cost, no ops burden, model upgrades for free. Above the cliff, the same API bill is suddenly ~10x what disciplined self-hosting would cost, and the gap widens monotonically. Heuristic: forecast your 12-month volume; if you’ll cross the cliff, start the self-hosting build before you hit it, not after.
Once you’ve mapped where you sit relative to the 10x cost cliff, the rest of cost engineering falls into place. And the API-vs-self-host crossover is only the most visible cliff. AI costs don’t scale linearly anywhere—they have cliffs everywhere, thresholds where costs suddenly jump by 10x or more. Understanding all of them prevents budget surprises:
The context cliff: Going from 4K to 32K context doesn’t cost 8x more—the raw attention FLOPs scale quadratically, so 8x the sequence is ~64x the attention compute. The multipliers in this section are illustrative orders of magnitude, not measured constants. In practice the served cost of long context is far below the naive 64x: FlashAttention and paged KV-cache (see Chapter 27) keep attention memory-bandwidth-bound rather than compute-bound, and prefix/KV caching amortizes shared context across requests. The quadratic term still dominates eventually, but you pay something closer to linear-plus-a-quadratic-tail, not a clean 64x. A document that barely fit in one API call might still need to be chunked, and chunking has its own costs (overlap, multiple calls, synthesis).
The quality cliff: The gap between “good enough” and “excellent” is often 10x cost. A task that Haiku handles 80% correctly might need Opus for 95%. You pay 15x more for that last 15% quality. Sometimes it’s worth it. Often it isn’t.
The latency cliff: Reducing P99 latency from 2s to 200ms might require 10x more GPU capacity. Tail latencies are expensive because they require provisioning for worst-case, not average-case.
The freshness cliff: Real-time features cost exponentially more than daily batch features. Going from daily to hourly might be 3x. Hourly to minute-level might be 10x. Minute to second might be 50x. (As with the other cliffs, treat these multipliers as illustrative magnitudes, not benchmarked figures—the actual factor depends on your infrastructure.) Each step requires more infrastructure, more monitoring, more on-call burden.
The reliability cliff: Going from 99% to 99.9% availability requires roughly 10x investment. Going from 99.9% to 99.99% requires another 10x. Each “nine” costs an order of magnitude.
The strategic implication: know where your cliffs are and stay on the cheap side unless you have a compelling reason to jump. Many teams accidentally fall off cliffs by adding features without understanding cost implications. A product manager asks for “just a bit longer context” and suddenly infrastructure costs triple. Before any architectural decision, ask: “Are we near a cliff? What happens if requirements push us over?”
Think of an AI system as a factory with unusual economics. Traditional software factories have high fixed costs (development, infrastructure) but near-zero marginal costs—serving the millionth user costs almost nothing more than the first. AI factories have high fixed costs and significant marginal costs—every inference consumes real resources.
This creates a different optimization landscape:
- Traditional software: Minimize fixed costs, scale efficiently (marginal costs negligible)
- AI systems: Balance fixed costs (infrastructure, development) against marginal costs (compute per request, tokens per interaction)
The optimal operating point depends on scale. At low volume, using external APIs with high per-unit cost but zero fixed cost dominates. At high volume, building infrastructure with high fixed costs but low marginal costs wins. Finding the crossover point—and anticipating how quickly you’ll reach it—is a core cost engineering skill.
What You’ll Learn
- The fundamental economics of GPU compute and why utilization matters so much
- Total cost of ownership analysis for AI systems
- Token cost calculations and optimization strategies
- Cost attribution methods that create accountability
- Build-versus-buy decision frameworks with real numbers
- Self-hosting breakeven analysis
- Strategies for building cost-aware engineering culture
Prerequisites
- System design fundamentals (Chapter 25)
- Performance engineering concepts (Chapter 27)
- Basic understanding of GPU architectures and inference pipelines
The Economics of AI Compute
Why GPUs Are Expensive—and Why Utilization Matters
To understand AI costs, you must understand GPU economics. A single NVIDIA H100 GPU costs approximately $30,000 to purchase. Cloud providers charge $3-5 per hour for access. Why so expensive?
The hardware reality: GPUs contain thousands of specialized cores optimized for the matrix multiplications that dominate neural network computation. An H100 delivers 1,979 teraflops of FP16 compute—roughly 100x the throughput of a high-end CPU for AI workloads. This specialized capability commands a premium.
The supply constraint: GPU manufacturing requires cutting-edge fabrication facilities that cost tens of billions of dollars to build. TSMC’s advanced processes are capacity-constrained, limiting the number of high-end chips that can be produced. Demand from AI training and inference consistently exceeds supply.
The utilization imperative: Because GPUs are expensive, utilization is everything. Consider the math:
H100 cloud cost: $4.00/hour
If utilized at 100%: $4.00/effective-GPU-hour
If utilized at 25%: $16.00/effective-GPU-hour
If utilized at 10%: $40.00/effective-GPU-hour
A GPU sitting idle at 10% utilization costs four times as much per effective compute unit as one running at 40%. Yet production systems routinely run at 20-30% utilization because:
- Request variability: Traffic spikes require headroom
- Batch accumulation delays: Waiting for batches to fill adds latency
- Memory fragmentation: Different request sizes prevent efficient packing
- Cold start overhead: Model loading and warmup consume cycles
Research by Patterson et al. (2022) at Google found that typical ML infrastructure achieves 30-50% GPU utilization in production—meaning half or more of GPU spend is wasted on idle cycles.
The GPU Utilization Curve
GPU costs don’t scale linearly with utilization. There’s a complex relationship between utilization, latency, and throughput:
The shape of this curve explains why the difference between 20% and 40% utilization matters far more than the difference between 70% and 90%. The first doubling saves 50% of costs; the second saves only 22%.
Practical implication: The first priority in GPU cost optimization is always moving from low utilization to moderate utilization. This provides the highest return on engineering investment.
What people do: Focus on optimizing inference latency and throughput while leaving GPUs running 24/7, regardless of actual demand.
Why it fails: GPU costs are dominated by idle time, not inference time. If your traffic is bursty (most applications), GPUs sitting idle during low-traffic hours can represent 70%+ of total compute spend. A $30K/month GPU bill might include $20K of idle time.
Fix: Implement auto-scaling based on queue depth, not just CPU metrics. Use spot/preemptible instances for batch workloads. Consider serverless inference (AWS Lambda, Google Cloud Run) for low-volume or bursty traffic. Track and alert on utilization—if it’s consistently below 40%, you’re likely over-provisioned.
What people do: Call LLM APIs without max_tokens limits or budget caps, trusting that normal usage won’t be expensive.
Why it fails: One bug can cause a cost explosion. A retry loop without backoff. A prompt that triggers verbose output. An agent that loops forever. We’ve seen teams wake up to $50K+ surprise bills from overnight batch jobs or runaway agents. API providers are happy to let you spend.
Fix: Always set max_tokens to the actual maximum you need. Implement per-request, per-user, and per-hour cost limits. Add circuit breakers that stop LLM calls if spend exceeds thresholds. Log and alert on cost anomalies (e.g., any hour that’s 3x the previous day’s average).
Token Economics
For LLM-based systems, costs are denominated in tokens. Understanding token economics is essential for cost modeling.
What determines token cost?
Token cost reflects the compute required for inference, which depends on:
- Model size: Larger models require more parameters to load and more operations per token
- Attention complexity: Self-attention scales quadratically with sequence length
- Input vs. output tokens: Output tokens require full forward passes; input tokens can be processed in parallel
- Hardware efficiency: Newer GPUs provide more tokens per dollar
The API pricing landscape (as of early 2026):
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Relative Cost |
|---|---|---|---|
| GPT-5.5 | $5.00 | $30.00 | 1.0x |
| Claude Opus 4.8 | $5.00 | $25.00 | 0.83x |
| GPT-5.4 | $2.50 | $15.00 | 0.5x |
| Claude Sonnet 4.6 | $3.00 | $15.00 | 0.5x |
| Claude Haiku 4.5 | $1.00 | $5.00 | 0.17x |
| Llama 4 Maverick (API) | $0.22 | $0.85 | 0.03x |
These prices encode profound economics. The ~35x price difference between GPT-5.5 and Llama 4 Maverick reflects the premium for proprietary frontier models versus open-weight alternatives. Be careful, though: list API prices are strategic, margin-driven numbers—they are not a direct proxy for the underlying cost of compute. Providers price to capture value, segment customers, and fund R&D, so a 35x list-price gap does not mean a 35x difference in GPU cost to serve. The honest cost-of-compute comparison is the self-hosting analysis later in this chapter, which prices the GPU-seconds directly; API prices tell you what you’ll pay, not what the inference costs to produce. The 5-6x ratio between input and output tokens, by contrast, does track a real computational asymmetry of autoregressive generation.
The token cost calculation:
def calculate_request_cost(
input_tokens: int,
output_tokens: int,
input_price_per_million: float,
output_price_per_million: float
) -> float:
"""Calculate cost for a single LLM request."""
input_cost = (input_tokens / 1_000_000) * input_price_per_million
output_cost = (output_tokens / 1_000_000) * output_price_per_million
return input_cost + output_cost
# Example: GPT-5.5 request with 2000 input, 500 output tokens
cost = calculate_request_cost(2000, 500, 5.00, 30.00)
# = (2000/1M * $5.00) + (500/1M * $30.00)
# = $0.010 + $0.015 = $0.025 per requestAt ~$0.025 per request, 1 million daily requests costs $25,000/day—$750,000/month. This is the marginal cost that traditional software doesn’t have—and it’s why cost engineering matters.
Setup: A fintech startup ran nightly batch jobs to analyze customer transactions using GPT-5.4. The jobs had worked flawlessly for months, processing about 50,000 transactions per night at a cost of roughly $500.
What happened: On a Thursday night, an engineer deployed a “minor” change to improve analysis quality—they removed the max_tokens parameter to “let the model fully express its reasoning.” The batch job started at 11 PM. By Friday morning, the cloud bill had reached $2.1 million.
Without output limits, the model generated verbose analyses averaging 15,000 tokens per transaction instead of the usual 500. The job ran for 8 hours, burning through 750 million output tokens at $15 per million. The engineer’s phone buzzed with a fraud alert from their credit card company before anyone noticed the API costs.
Why it happened (root cause): No programmatic cost caps. The API would happily accept unlimited requests at unlimited output lengths. The team assumed “max_tokens was just for speed” and didn’t realize it was their primary cost control.
Fix:
- Implemented hard per-request cost limits in a gateway layer
- Added hourly cost monitoring with automatic shutoff at 120% of expected spend
- Required
max_tokensin all API calls via a wrapper library - Created nightly budget alerts with escalation to on-call
Takeaway: Every API call needs a cost ceiling. Hope is not a budget strategy.
def analyze_document(doc: str) -> str:
# No limits - cost grows with document size and model verbosity
response = client.messages.create(
model="claude-opus-4-8",
messages=[{
"role": "user",
"content": f"Analyze this document thoroughly:\n\n{doc}"
}]
# No max_tokens = model decides output length
# Long doc = expensive input tokens
# "thoroughly" = model writes essays
)
return response.content[0].textclass CostAwareLLM:
def __init__(self, budget_per_request: float = 0.10):
self.budget = budget_per_request
self.pricing = ModelPricing() # Current rates
self.daily_spend = DailySpendTracker()
async def analyze_document(
self,
doc: str,
user_id: str,
priority: str = "normal"
) -> AnalysisResponse:
# Check daily budget before processing
if self.daily_spend.remaining(user_id) <= 0:
return AnalysisResponse(
error="daily_budget_exhausted",
fallback=self.get_cached_or_summary(doc)
)
# Calculate max affordable tokens
input_tokens = count_tokens(doc)
input_cost = self.pricing.input_cost(input_tokens)
remaining_budget = self.budget - input_cost
if remaining_budget <= 0:
# Document too long for budget - truncate or chunk
doc = self.smart_truncate(doc, target_tokens=5000)
input_tokens = count_tokens(doc)
input_cost = self.pricing.input_cost(input_tokens)
remaining_budget = self.budget - input_cost
max_output_tokens = self.pricing.tokens_for_budget(
remaining_budget, output=True
)
max_output_tokens = min(max_output_tokens, 2000) # Hard cap
# Select model based on priority and budget
model = self.select_model(priority, remaining_budget)
response = await client.messages.create(
model=model,
max_tokens=max_output_tokens, # Always set!
messages=[{
"role": "user",
"content": f"""Analyze this document concisely.
Focus on: key findings, action items, risks.
Be direct - no filler.
<document>
{doc}
</document>"""
}]
)
# Track actual spend
actual_cost = self.pricing.calculate_cost(
response.usage.input_tokens,
response.usage.output_tokens,
model
)
self.daily_spend.record(user_id, actual_cost)
return AnalysisResponse(
content=response.content[0].text,
cost=actual_cost,
model_used=model,
tokens_used=response.usage.input_tokens + response.usage.output_tokens
)
def select_model(self, priority: str, budget: float) -> str:
if priority == "high" and budget > 0.05:
return "claude-opus-4-8"
elif budget > 0.02:
return "claude-sonnet-4-6"
else:
return "claude-haiku-4-5-20251001" # Always affordableKey differences: Per-request budget caps, input size management, calculated max_tokens based on remaining budget, model selection by cost tier, daily spend tracking per user, and concise prompt design. The $2M incident becomes impossible.
The Inference Efficiency Frontier
Different inference optimization techniques trade off along multiple dimensions:
The efficient frontier represents the best achievable combinations of cost and latency. Points below the frontier are achievable; points above require engineering innovation or accepting worse tradeoffs.
Key techniques on the frontier:
Continuous batching: Process requests as they arrive rather than waiting for fixed batches. Reduces latency while maintaining throughput. 30-50% latency reduction typical.
Speculative decoding: Use a small model to draft tokens, verify with the large model. Reduces large model calls by 2-3x for standard text.
Quantization: Reduce precision from FP16 to INT8 or INT4. Cuts memory and compute by 2-4x with 1-3% quality degradation.
KV cache optimization: Reuse computed attention states for prefix-sharing requests. Enables prompt caching with 50-80% input cost reduction.
Model distillation: Train smaller models on larger model outputs. Can achieve 90% quality at 10% cost for domain-specific tasks.
Full implementation: See reference/26b_cost_engineering_code.md for implementation details of each technique.
Total Cost of Ownership Analysis
The TCO Framework
Total Cost of Ownership (TCO) captures all costs associated with an AI system over its lifetime. TCO analysis prevents the common failure of optimizing visible costs while ignoring larger hidden costs.
TCO components:
Why indirect costs matter:
A 2023 study by Andreessen Horowitz found that for enterprise AI systems, engineering time represents 40-60% of total costs—often exceeding infrastructure spend. A system that costs $50,000/month in compute but requires a full-time engineer ($15,000/month fully loaded) to maintain has 30% hidden costs. A system that costs $100,000/month but runs autonomously may be cheaper overall.
“I’ve negotiated LLM API contracts with every major provider. The list prices are the starting point, not the answer.
At scale, everything is negotiable: per-token rates, minimum commits, reserved capacity, support tiers, even payment terms. The key is leverage: either you have enough volume that they want your business, or you have a credible alternative (open-weight models, self-hosting, competitors).
My playbook: (1) Get usage data—monthly spend, growth trajectory, use case breakdown. (2) Identify your alternative—even if you don’t want to use it, knowing you could gives you leverage. (3) Start conversations 3 months before renewals. (4) Ask for multi-year discounts; providers love predictable revenue. (5) Get the enterprise sales team involved; they have flexibility that developer relations doesn’t.
The discount range varies, but at $10K+/month spend, you should be getting at least 15-20% off list price. At $100K+/month, 30-40% is common. If you’re not asking, you’re leaving money on the table.”
—Principal Engineer, formerly Head of AI Platform
Calculating TCO
@dataclass
class TCOAnalysis:
"""Complete TCO analysis for an AI system."""
# Direct costs (monthly)
compute_training: float
compute_inference: float
storage: float
external_apis: float
data_processing: float
# Indirect costs (monthly, derived from annual rates)
engineering_development: float # FTE allocation * fully-loaded cost
engineering_operations: float
engineering_maintenance: float
# Hidden costs (often forgotten)
compliance_security: float # ~5-10% of direct costs
monitoring_observability: float # ~3-5% of direct costs
disaster_recovery: float # ~2-5% of direct costs
def monthly_direct(self) -> float:
return (self.compute_training + self.compute_inference +
self.storage + self.external_apis + self.data_processing)
def monthly_indirect(self) -> float:
return (self.engineering_development + self.engineering_operations +
self.engineering_maintenance)
def monthly_hidden(self) -> float:
return (self.compliance_security + self.monitoring_observability +
self.disaster_recovery)
def monthly_tco(self) -> float:
return self.monthly_direct() + self.monthly_indirect() + self.monthly_hidden()
def annual_tco(self) -> float:
return self.monthly_tco() * 12Example: Recommendation System TCO
Consider a mid-scale recommendation system serving 10 million requests per day:
| Category | Monthly Cost | % of TCO |
|---|---|---|
| Direct Costs | ||
| Inference compute (4x A100 instances) | $28,000 | 35% |
| Training compute (periodic retraining) | $8,000 | 10% |
| Feature store (1TB) | $3,000 | 3.8% |
| Embedding API calls | $5,000 | 6.3% |
| Data processing pipeline | $4,000 | 5% |
| Indirect Costs | ||
| ML Engineer (0.5 FTE) | $12,500 | 15.6% |
| Platform Engineer (0.3 FTE) | $7,500 | 9.4% |
| Data Engineer (0.2 FTE) | $5,000 | 6.3% |
| Hidden Costs | ||
| Monitoring & observability | $3,000 | 3.8% |
| Security & compliance | $2,500 | 3.1% |
| Backup & DR | $1,500 | 1.9% |
| Total Monthly TCO | $80,000 | 100% |
This analysis reveals that engineering time (31% of TCO) rivals compute costs (45%). Optimizing only the cloud bill misses nearly a third of the cost structure.
Full implementation: See reference/26b_cost_engineering_code.md for the complete TCOAnalysis class with scenario modeling.
Unit Economics
TCO tells you total costs; unit economics tells you whether your business model works. Unit economics translates aggregate costs into per-unit metrics that can be compared against revenue or value generated.
Key unit economics metrics:
@dataclass
class UnitEconomics:
"""Unit economics for an AI service."""
total_monthly_cost: float
monthly_requests: int
monthly_active_users: int
monthly_revenue: float
@property
def cost_per_request(self) -> float:
"""Cost to serve one request."""
return self.total_monthly_cost / max(1, self.monthly_requests)
@property
def cost_per_user(self) -> float:
"""Cost to serve one active user per month."""
return self.total_monthly_cost / max(1, self.monthly_active_users)
@property
def gross_margin(self) -> float:
"""(Revenue - Cost) / Revenue."""
if self.monthly_revenue <= 0:
return float('-inf')
return (self.monthly_revenue - self.total_monthly_cost) / self.monthly_revenue
@property
def ai_cost_ratio(self) -> float:
"""AI costs as percentage of revenue."""
if self.monthly_revenue <= 0:
return float('inf')
return self.total_monthly_cost / self.monthly_revenueHealthy unit economics benchmarks:
| Metric | Concerning | Acceptable | Good |
|---|---|---|---|
| AI cost ratio | >50% | 20-50% | <20% |
| Cost per request | >$0.10 | $0.01-$0.10 | <$0.01 |
| Gross margin | <30% | 30-60% | >60% |
The unit economics trap:
Many AI startups fall into a trap: their unit economics worsen as they scale. This happens when:
- Usage grows faster than revenue: Free tiers or flat-rate pricing with increasing usage
- Costs scale linearly while value doesn’t: Each request costs the same, but marginal value to users decreases
- No cost optimization investment: Early-stage focus on features over efficiency
The solution is modeling unit economics at scale from day one:
def project_unit_economics(
current: UnitEconomics,
growth_rate: float, # Monthly user growth
months: int,
cost_efficiency_gain: float = 0.02 # Monthly efficiency improvement
) -> list[UnitEconomics]:
"""Project unit economics over time with growth and optimization."""
projections = [current]
for _ in range(months):
prev = projections[-1]
# Users grow, costs grow (but more slowly with optimization)
new_users = int(prev.monthly_active_users * (1 + growth_rate))
new_requests = int(prev.monthly_requests * (1 + growth_rate * 1.2)) # Power users
# Costs grow sublinearly with optimization
cost_scale = (new_requests / prev.monthly_requests) * (1 - cost_efficiency_gain)
new_costs = prev.total_monthly_cost * cost_scale
# Revenue assumed to scale with users
new_revenue = prev.monthly_revenue * (new_users / prev.monthly_active_users)
projections.append(UnitEconomics(
total_monthly_cost=new_costs,
monthly_requests=new_requests,
monthly_active_users=new_users,
monthly_revenue=new_revenue
))
return projectionsCost Attribution and Accountability
Why Attribution Matters
You cannot optimize what you cannot measure—and you cannot measure what you cannot attribute. Cost attribution assigns costs to the teams, projects, and features that incur them, creating visibility and accountability.
Without attribution: A single cloud bill shows $200,000/month. No one knows which team is responsible for what. Cost optimization is nobody’s job because it’s everybody’s job.
With attribution: Team A’s recommendation service costs $80,000/month, Team B’s search service costs $50,000/month, and shared infrastructure costs $70,000. Each team can optimize their portion; leadership can evaluate ROI by team.
Attribution Models
Different costs require different attribution approaches:
1. Direct attribution: Costs clearly tied to a specific owner
# Example: API costs by team
@dataclass
class DirectAttributionRule:
"""Attribute costs directly to the owner in metadata."""
def apply(self, cost_item: CostItem) -> dict[str, float]:
owner = cost_item.metadata.get('team') or cost_item.metadata.get('project')
if owner:
return {owner: cost_item.amount}
return {'unattributed': cost_item.amount}Direct attribution works for:
- Team-specific compute instances
- Project-tagged API calls
- Dedicated storage buckets
2. Proportional attribution: Shared costs split by usage
@dataclass
class ProportionalAttributionRule:
"""Split shared costs by usage proportion."""
usage_metric: str # e.g., 'requests', 'compute_seconds', 'tokens'
def apply(self, cost_item: CostItem, usage_by_team: dict[str, float]) -> dict[str, float]:
total_usage = sum(usage_by_team.values())
if total_usage == 0:
return {'unattributed': cost_item.amount}
return {
team: cost_item.amount * (usage / total_usage)
for team, usage in usage_by_team.items()
}Proportional attribution works for:
- Shared GPU clusters (split by GPU-hours consumed)
- Shared vector databases (split by queries or storage)
- Platform infrastructure (split by requests handled)
3. Activity-based attribution: Costs assigned based on specific activities
@dataclass
class ActivityBasedRule:
"""Attribute based on specific activities that drive costs."""
activity_costs: dict[str, float] # Cost per activity type
def apply(self, activities: list[Activity]) -> dict[str, float]:
attribution = defaultdict(float)
for activity in activities:
cost = self.activity_costs.get(activity.type, 0)
attribution[activity.team] += cost * activity.count
return dict(attribution)Activity-based attribution works for:
- Training jobs (cost per GPU-hour by job owner)
- Experiment runs (cost per experiment by researcher)
- Model deployments (cost per deployment by team)
Implementing Cost Attribution
A production cost attribution system requires:
1. Tagging discipline: All resources tagged with owner, project, environment
REQUIRED_TAGS = ['team', 'project', 'environment', 'cost_center']
def validate_resource_tags(resource: Resource) -> list[str]:
"""Return list of missing required tags."""
return [tag for tag in REQUIRED_TAGS if tag not in resource.tags]
def enforce_tagging_policy(resource: Resource) -> bool:
"""Block resource creation if tags are missing."""
missing = validate_resource_tags(resource)
if missing:
raise PolicyViolation(f"Resource missing required tags: {missing}")
return True2. Usage tracking: Collect usage metrics that drive costs
class UsageTracker:
"""Track usage metrics for cost attribution."""
def record_request(self, team: str, service: str, tokens: int, gpu_ms: int):
"""Record a request with its resource consumption."""
self.metrics.append({
'timestamp': datetime.utcnow(),
'team': team,
'service': service,
'tokens': tokens,
'gpu_ms': gpu_ms
})
def aggregate_by_team(self, start: datetime, end: datetime) -> dict[str, dict]:
"""Aggregate usage metrics by team for a period."""
# Returns: {'team_a': {'tokens': 1M, 'gpu_ms': 50000}, ...}3. Attribution engine: Apply rules to calculate allocations
class CostAttributionEngine:
"""Engine for attributing costs to owners."""
def __init__(self, rules: list[AttributionRule]):
self.rules = rules
def calculate_attribution(
self,
costs: list[CostItem],
usage: dict[str, dict]
) -> dict[str, float]:
"""Calculate cost attribution by team/project."""
attribution = defaultdict(float)
for cost in costs:
rule = self._get_rule(cost.category)
team_costs = rule.apply(cost, usage)
for team, amount in team_costs.items():
attribution[team] += amount
return dict(attribution)Full implementation: See reference/26b_cost_engineering_code.md for the complete attribution system.
Creating Accountability
Attribution data only matters if it drives behavior. Effective accountability requires:
1. Visibility: Teams can see their costs in real-time
class TeamCostDashboard:
"""Real-time cost visibility for engineering teams."""
def get_team_summary(self, team: str) -> dict:
return {
'current_month': {
'total_cost': self.get_mtd_cost(team),
'budget': self.get_budget(team),
'burn_rate': self.get_daily_average(team) * 30,
'forecast': self.get_month_end_forecast(team)
},
'by_service': self.get_cost_by_service(team),
'trends': self.get_weekly_trends(team),
'top_cost_drivers': self.get_top_costs(team, n=5),
'optimization_opportunities': self.get_recommendations(team)
}2. Budgets: Teams have cost targets they’re accountable for
@dataclass
class CostBudget:
"""Team cost budget with alerting thresholds."""
team: str
monthly_budget: float
alert_threshold: float = 0.80 # Alert at 80%
hard_limit: float = 1.20 # Require approval above 120%
def check_against_spend(self, current_spend: float) -> BudgetStatus:
ratio = current_spend / self.monthly_budget
if ratio > self.hard_limit:
return BudgetStatus.EXCEEDED_HARD_LIMIT
elif ratio > 1.0:
return BudgetStatus.OVER_BUDGET
elif ratio > self.alert_threshold:
return BudgetStatus.APPROACHING_LIMIT
else:
return BudgetStatus.ON_TRACK3. Reviews: Regular discussion of cost performance
Effective cost reviews happen at multiple cadences:
| Cadence | Focus | Attendees |
|---|---|---|
| Weekly | Anomalies, quick wins | Engineering leads |
| Monthly | Trends, budget performance | Team leads + Finance |
| Quarterly | TCO, strategic investments | Leadership |
GPU Selection and Self-Hosting Economics
The GPU Selection Decision
Choosing the right GPU is one of the highest-leverage cost decisions in AI. The wrong GPU wastes money through underutilization, overpayment, or insufficient performance.
GPU selection framework:
Key GPU comparisons (cloud pricing, late 2024):
| GPU | VRAM | FP16 TFLOPS | Hourly Cost | $/TFLOP-hr | Best For |
|---|---|---|---|---|---|
| H100 SXM | 80GB | 1,979 | $4.50 | $0.0023 | Large training |
| A100 80GB | 80GB | 312 | $2.50 | $0.0080 | Training, large inference |
| A100 40GB | 40GB | 312 | $1.75 | $0.0056 | Medium training/inference |
| L4 | 24GB | 121 | $0.50 | $0.0041 | Cost-efficient inference |
| A10G | 24GB | 125 | $0.75 | $0.0060 | Balanced inference |
| T4 | 16GB | 65 | $0.35 | $0.0054 | Budget inference |
Decision heuristics:
For training: Match GPU memory to model size. Training a 7B parameter model in FP16 requires ~14GB for weights plus optimizer states—40GB A100 minimum, 80GB preferred for larger batch sizes.
For inference: Match to your batch size and latency requirements. High-throughput batch inference benefits from larger GPUs (more parallelism); low-latency single requests are limited by memory bandwidth, not compute.
For cost optimization: Calculate cost per token or cost per request across GPU options. The cheapest GPU per hour is often not the cheapest per unit of work.
The Self-Hosting Breakeven Analysis
When should you self-host models instead of using APIs? This is a classic build-versus-buy decision with quantifiable economics.
The cost structure comparison:
Breakeven calculation:
@dataclass
class SelfHostingAnalysis:
"""Analyze self-hosting vs API economics."""
# API costs
api_cost_per_request: float # $ per request (input + output tokens priced out)
# Self-hosting costs
gpu_hourly_cost: float # $ per GPU hour
num_gpus: int
gpu_utilization: float # Expected utilization (0.0 - 1.0)
engineering_monthly: float # Engineering time for operations
infrastructure_monthly: float # Networking, monitoring, etc.
# Throughput model (token-based, NOT a flat per-request latency)
output_tokens_per_sec_per_system: float # Aggregate decode throughput of the GPU set
avg_output_tokens_per_request: float # Output-length assumption — the key swing factor
@property
def monthly_fixed_cost(self) -> float:
"""Fixed monthly cost of self-hosting."""
gpu_monthly = self.gpu_hourly_cost * 24 * 30 * self.num_gpus
return gpu_monthly + self.engineering_monthly + self.infrastructure_monthly
@property
def requests_per_gpu_month(self) -> float:
"""Requests the GPU set can serve per month at expected utilization.
Per-request time is dominated by OUTPUT token count, not a flat latency:
a 500-token reply takes ~10-25x as long to generate as a 20-token one
because decode is sequential. So we model capacity as tokens/sec, then
divide by the average output length. Prefill (input tokens) is parallel
and comparatively cheap, so it is omitted here for clarity.
"""
output_tokens_per_month = (
self.output_tokens_per_sec_per_system
* self.gpu_utilization
* 3600 * 24 * 30
)
return output_tokens_per_month / self.avg_output_tokens_per_request
@property
def marginal_cost_per_request(self) -> float:
"""Marginal cost once infrastructure is running."""
# Essentially electricity and wear - minimal
return 0.0001 # $0.0001 per request
def total_cost(self, monthly_requests: int) -> dict[str, float]:
"""Calculate total costs for both options."""
api_cost = monthly_requests * self.api_cost_per_request
# Self-hosted: fixed + marginal
if monthly_requests <= self.requests_per_gpu_month:
self_hosted_cost = self.monthly_fixed_cost
else:
# Need more GPUs
gpus_needed = math.ceil(monthly_requests / self.requests_per_gpu_month * self.num_gpus)
self_hosted_cost = self.monthly_fixed_cost * (gpus_needed / self.num_gpus)
return {
'api_cost': api_cost,
'self_hosted_cost': self_hosted_cost,
'recommendation': 'self-host' if self_hosted_cost < api_cost else 'api'
}
def breakeven_requests(self) -> int:
"""Find monthly request volume where self-hosting breaks even."""
# Fixed / (API marginal - Self marginal) = breakeven volume
marginal_diff = self.api_cost_per_request - self.marginal_cost_per_request
if marginal_diff <= 0:
return float('inf') # Self-hosting never cheaper on marginal
return int(self.monthly_fixed_cost / marginal_diff)Example breakeven analysis:
Scenario: Considering self-hosting a Llama 3.1 70B model vs. using API. Note that there are two break-evens to track, and both move with the output-length distribution:
- Cost break-even: the monthly volume at which API spend equals self-hosting’s fixed cost. Because API requests are priced per token, the per-request price — and therefore this break-even — scales with average output length.
- Capacity: whether your GPU set can actually serve that volume. Capacity is governed by decode throughput in tokens/sec, divided by average output length. A reply that is 5x longer cuts servable requests/month by ~5x.
| Parameter | Value |
|---|---|
| Self-hosted: 2x A100 80GB | $2.50/hr each = $3,600/month |
| Engineering (0.1 FTE) | $1,500/month |
| Infrastructure overhead | $500/month |
| Total fixed cost | $5,600/month |
| GPU utilization target | 50% |
| Aggregate decode throughput (70B, continuous batching) | ~2,000 output tok/sec |
We evaluate two output-length regimes that bracket realistic workloads — short replies (~100 output tokens, e.g. classification or extraction) and long replies (~500 output tokens, e.g. summarization or chat). Input/prefill is parallel and comparatively cheap, so we approximate API price per request from output tokens at $0.85/1M (Llama 4 Maverick output rate from the pricing table above):
# --- Short-reply regime: ~100 output tokens/request ---
short = SelfHostingAnalysis(
api_cost_per_request=0.85 / 1_000_000 * 100, # ~$0.000085/req
gpu_hourly_cost=2.50,
num_gpus=2,
gpu_utilization=0.50,
engineering_monthly=1500,
infrastructure_monthly=500,
output_tokens_per_sec_per_system=2000,
avg_output_tokens_per_request=100,
)
# Cost break-even = $5,600 / $0.000085 ≈ 65.9M requests/month
# Capacity on 2 GPUs = 2000 * 0.5 * 2.592e6 s/mo / 100 ≈ 25.9M requests/month
# --- Long-reply regime: ~500 output tokens/request ---
long = SelfHostingAnalysis(
api_cost_per_request=0.85 / 1_000_000 * 500, # ~$0.000425/req
gpu_hourly_cost=2.50,
num_gpus=2,
gpu_utilization=0.50,
engineering_monthly=1500,
infrastructure_monthly=500,
output_tokens_per_sec_per_system=2000,
avg_output_tokens_per_request=500,
)
# Cost break-even = $5,600 / $0.000425 ≈ 13.2M requests/month
# Capacity on 2 GPUs = 2000 * 0.5 * 2.592e6 s/mo / 500 ≈ 5.2M requests/month(Here 2.592e6 is the seconds in a 30-day month: 3600 * 24 * 30.)
The headline number is not a constant — it moves ~5x with output length:
| Regime | API $/req | Cost break-even (req/mo) | Capacity on 2 GPUs (req/mo) |
|---|---|---|---|
| Short reply (~100 out tok) | ~$0.000085 | ~66M | ~26M |
| Long reply (~500 out tok) | ~$0.000425 | ~13M | ~5.2M |
Read this carefully: against a cheap open-weight API rate, self-hosting only pays off at tens of millions of requests per month, and the long-reply workload reaches that break-even at roughly a quarter of the short-reply volume. Against a premium frontier API (e.g. $25-30/1M output, ~30-60x the rate used here), the break-even drops by the same factor — into the low hundreds of thousands of requests, the cliff described at the top of this chapter. The break-even is a function of your output-length distribution and the API tier you are displacing, not a fixed constant; re-derive it whenever either changes.
Critical caveats:
Output-length sensitivity: As the table shows, doubling average output length roughly halves both the cost break-even and the servable capacity. Always model break-even against your measured output-length distribution, not an assumed flat per-request cost.
Utilization uncertainty: The analysis assumes 50% utilization. If you achieve only 25%, effective costs double.
Engineering underestimation: Operations, troubleshooting, and updates often take more time than estimated.
Opportunity cost: Engineering time on infrastructure is time not spent on product.
Flexibility cost: APIs scale instantly; self-hosted requires capacity planning.
Quality differences: Self-hosted open models may not match API quality for your use case.
The right decision depends on your confidence in volume projections, engineering capacity, and strategic importance of ownership.
Full implementation: See reference/26b_cost_engineering_code.md for the complete analysis framework.
Cost Optimization Strategies
“Our AI bill went from $50K to $400K in three months. Leadership was panicking about ‘AI costs.’ When I actually looked at the data, 80% of the spend was one feature that product had added without telling anyone—auto-summarization that ran on every document load. We added caching and moved to a smaller model. Costs dropped 90%. The moral: before optimizing infrastructure, find out what’s actually generating the calls.”
— Engineering Director at enterprise software company
The Optimization Priority Matrix
Not all cost optimizations are equal. Prioritize based on impact and effort:
Tier 1: Quick wins (days to implement)
- Model selection optimization: Use smaller models for simpler tasks
class ModelRouter:
"""Route requests to appropriate model based on complexity."""
MODELS = {
'simple': {'name': 'claude-haiku-4-5', 'cost_per_1k_tokens': 0.003},
'moderate': {'name': 'gpt-5', 'cost_per_1k_tokens': 0.0056},
'complex': {'name': 'claude-opus-4-8', 'cost_per_1k_tokens': 0.012}
}
def route(self, request: Request) -> str:
"""Route request to cheapest sufficient model."""
complexity = self.assess_complexity(request)
if complexity < 0.3:
return self.MODELS['simple']['name']
elif complexity < 0.7:
return self.MODELS['moderate']['name']
else:
return self.MODELS['complex']['name']
def assess_complexity(self, request: Request) -> float:
"""Estimate task complexity (0-1). Simple heuristics."""
score = 0.0
# Simple tasks: short, common patterns
if len(request.prompt) < 500:
score += 0.0
elif len(request.prompt) < 2000:
score += 0.3
else:
score += 0.5
# Complex indicators
if any(kw in request.prompt.lower() for kw in ['analyze', 'compare', 'reason']):
score += 0.2
if request.requires_structured_output:
score += 0.1
return min(1.0, score)Savings: 40-70% by routing 60% of requests to smaller models
- Prompt optimization: Reduce token usage without sacrificing quality
class PromptOptimizer:
"""Optimize prompts for token efficiency."""
def optimize(self, prompt: str) -> tuple[str, dict]:
"""Return optimized prompt and savings metrics."""
original_tokens = self.count_tokens(prompt)
optimizations = []
optimized = prompt
# Remove redundant whitespace
optimized = ' '.join(optimized.split())
if len(optimized) < len(prompt):
optimizations.append('whitespace_reduction')
# Compress verbose instructions
verbose_patterns = [
(r'Please (.+) for me', r'\1'),
(r'I would like you to', ''),
(r'Can you please', ''),
]
for pattern, replacement in verbose_patterns:
optimized = re.sub(pattern, replacement, optimized, flags=re.IGNORECASE)
# Reduce example count if excessive
if prompt.count('Example:') > 3:
optimizations.append('reduce_examples')
# Keep first and last example, remove middle
# (Implementation detail omitted)
new_tokens = self.count_tokens(optimized)
savings = (original_tokens - new_tokens) / original_tokens
return optimized, {
'original_tokens': original_tokens,
'optimized_tokens': new_tokens,
'savings_pct': savings,
'optimizations_applied': optimizations
}Savings: 15-30% token reduction typical
- Caching: Avoid redundant computation
class SemanticCache:
"""Cache responses for semantically similar queries."""
def __init__(self, embedding_model, similarity_threshold: float = 0.95):
self.cache = {} # embedding -> response
self.embeddings = []
self.embedding_model = embedding_model
self.threshold = similarity_threshold
def get(self, query: str) -> Optional[str]:
"""Return cached response if similar query exists."""
query_embedding = self.embedding_model.encode(query)
for cached_embedding, response in self.cache.items():
similarity = cosine_similarity(query_embedding, cached_embedding)
if similarity >= self.threshold:
return response
return None
def set(self, query: str, response: str):
"""Cache response for future similar queries."""
embedding = tuple(self.embedding_model.encode(query))
self.cache[embedding] = responseSavings: 20-60% depending on query similarity in your workload
Tier 2: Medium effort (weeks to implement)
- Batching optimization: Process requests in efficient batches
class DynamicBatcher:
"""Batch requests for efficient GPU utilization."""
def __init__(self, max_batch_size: int = 32, max_wait_ms: int = 50):
self.queue = asyncio.Queue()
self.max_batch_size = max_batch_size
self.max_wait_ms = max_wait_ms
async def process_loop(self):
"""Continuously process batches."""
while True:
batch = []
deadline = time.time() + self.max_wait_ms / 1000
# Collect requests until batch full or timeout
while len(batch) < self.max_batch_size and time.time() < deadline:
try:
timeout = deadline - time.time()
request = await asyncio.wait_for(
self.queue.get(),
timeout=max(0, timeout)
)
batch.append(request)
except asyncio.TimeoutError:
break
if batch:
responses = await self.process_batch(batch)
for request, response in zip(batch, responses):
request.future.set_result(response)Savings: 30-50% through better GPU utilization
- Quantization: Reduce model precision for faster, cheaper inference
# Using bitsandbytes for 4-bit quantization
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4"
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-4-Maverick-402B",
quantization_config=quantization_config,
device_map="auto"
)
# 70B model: FP16 = 140GB, INT4 = 35GB
# Can run on 2x A100 40GB instead of 4x A100 80GBSavings: 50-75% through GPU memory reduction
Tier 3: Strategic investments (months to implement)
- Architecture redesign: Reduce unnecessary LLM calls
| Pipeline Step | Before (3 LLM calls) | After (1 LLM call) |
|---|---|---|
| Classification | LLM call | Local classifier |
| Entity Extraction | LLM call | Local NER model |
| Response Generation | LLM call | LLM call |
| Total LLM calls | 3 per request | 1 per request |
Savings: 60-70% by replacing LLM calls with specialized models
- Custom model training: Fine-tune smaller models for your domain
# Distillation: train small model on large model outputs
class ModelDistillation:
"""Distill large model knowledge to smaller model."""
def __init__(self, teacher_model, student_model):
self.teacher = teacher_model
self.student = student_model
def generate_training_data(self, prompts: list[str]) -> list[dict]:
"""Generate training data from teacher model."""
data = []
for prompt in prompts:
response = self.teacher.generate(prompt)
data.append({'prompt': prompt, 'response': response})
return data
def train_student(self, training_data: list[dict]):
"""Fine-tune student on teacher outputs."""
# Standard fine-tuning loop
# Student learns to mimic teacher behaviorSavings: 80-95% by replacing API calls with fine-tuned small model
Full implementation: See reference/26b_cost_engineering_code.md for complete implementations.
Build vs. Buy Decision Framework
The Decision Matrix
Build-versus-buy decisions in AI involve more factors than traditional software:
Quantitative Analysis Framework
@dataclass
class BuildVsBuyAnalysis:
"""Comprehensive build vs buy analysis."""
# Build costs
development_months: int
developers_required: int
developer_monthly_cost: float
infrastructure_monthly: float
maintenance_fte: float
# Buy costs
per_request_cost: float
monthly_minimum: float
# Usage projections
monthly_requests_year1: list[int] # 12 months
monthly_requests_year2: list[int] # 12 months
# Qualitative factors (scored 1-5, weight 0-1)
factors: list[tuple[str, float, int, int]] # name, weight, build_score, buy_score
def calculate_costs(self) -> dict:
"""Calculate 2-year TCO for both options."""
# Build: Development + infrastructure + maintenance
development_cost = (
self.development_months *
self.developers_required *
self.developer_monthly_cost
)
infrastructure_cost = sum(
self.infrastructure_monthly for _ in range(24 - self.development_months)
)
maintenance_cost = (
(24 - self.development_months) *
self.maintenance_fte *
self.developer_monthly_cost
)
build_total = development_cost + infrastructure_cost + maintenance_cost
# Buy: Per-request + minimums
all_requests = self.monthly_requests_year1 + self.monthly_requests_year2
buy_total = sum(
max(self.monthly_minimum, requests * self.per_request_cost)
for requests in all_requests
)
return {
'build_2yr_tco': build_total,
'buy_2yr_tco': buy_total,
'cost_difference': buy_total - build_total,
'break_even_month': self._find_breakeven()
}
def calculate_scores(self) -> dict:
"""Calculate weighted qualitative scores."""
build_score = sum(weight * score for name, weight, score, _ in self.factors)
buy_score = sum(weight * score for name, weight, _, score in self.factors)
return {
'build_score': build_score,
'buy_score': buy_score,
'recommendation': 'build' if build_score > buy_score else 'buy'
}
def full_analysis(self) -> dict:
"""Combined cost and qualitative analysis."""
costs = self.calculate_costs()
scores = self.calculate_scores()
# Weight: 60% cost, 40% qualitative
cost_favors = 'build' if costs['build_2yr_tco'] < costs['buy_2yr_tco'] else 'buy'
qual_favors = scores['recommendation']
if cost_favors == qual_favors:
confidence = 'high'
else:
confidence = 'medium - cost and qualitative factors disagree'
return {
**costs,
**scores,
'overall_recommendation': cost_favors, # Default to cost
'confidence': confidence
}Example: Embedding Service Build vs. Buy
analysis = BuildVsBuyAnalysis(
# Build costs
development_months=3,
developers_required=2,
developer_monthly_cost=15000, # Fully loaded
infrastructure_monthly=6000, # 2x A100
maintenance_fte=0.25,
# Buy costs (OpenAI embeddings)
per_request_cost=0.0001, # $0.10 per 1M tokens, ~1K tokens/request
monthly_minimum=0,
# Usage: Growing from 10M to 50M requests/month
monthly_requests_year1=[10e6, 12e6, 15e6, 18e6, 22e6, 26e6,
30e6, 34e6, 38e6, 42e6, 46e6, 50e6],
monthly_requests_year2=[50e6] * 12, # Plateau at 50M
# Qualitative factors
factors=[
('data_privacy', 0.3, 5, 2), # Build much better for privacy
('customization', 0.2, 4, 2), # Build allows fine-tuning
('engineering_capacity', 0.2, 3, 5), # Limited ML team
('time_to_market', 0.2, 2, 5), # Need it fast
('strategic_importance', 0.1, 3, 3) # Neutral
]
)
result = analysis.full_analysis()
# Build 2yr TCO: $270,000
# Buy 2yr TCO: $216,000
# Qualitative: Build 3.7, Buy 3.2
# Recommendation: Buy (cost wins despite qualitative edge for build)Building a Cost-Aware Culture
The FinOps Mindset for AI
FinOps—the practice of bringing financial accountability to cloud spend—is especially important for AI systems where costs are higher and more variable.
Core FinOps principles applied to AI:
Teams own their costs: The team that deploys a model is accountable for its cost, not a central platform team.
Decisions are data-driven: Cost tradeoffs are made with real numbers, not intuition.
Optimization is continuous: Cost efficiency is an ongoing practice, not a one-time project.
Value, not just cost: The goal is maximizing value per dollar, not minimizing dollars.
Implementing Cost Governance
1. Cost visibility dashboards
Every engineering team should have real-time visibility into their AI costs:
class AICostDashboard:
"""Team-level AI cost visibility."""
def get_team_view(self, team: str) -> dict:
return {
'summary': {
'mtd_cost': self.get_mtd_cost(team),
'budget': self.get_budget(team),
'utilization': self.get_budget_utilization(team),
'forecast': self.forecast_month_end(team)
},
'by_model': self.get_costs_by_model(team),
'by_service': self.get_costs_by_service(team),
'daily_trend': self.get_daily_trend(team, days=30),
'anomalies': self.detect_anomalies(team),
'opportunities': self.identify_optimizations(team)
}2. Budget controls with escalation
@dataclass
class BudgetPolicy:
"""Budget policy with tiered escalation."""
team: str
monthly_budget: float
# Escalation thresholds
warn_threshold: float = 0.75 # Alert at 75%
review_threshold: float = 0.90 # Require review at 90%
block_threshold: float = 1.10 # Block new requests at 110%
def check_request(self, request_cost: float, current_spend: float) -> dict:
"""Check if request should proceed given budget."""
new_spend = current_spend + request_cost
utilization = new_spend / self.monthly_budget
if utilization > self.block_threshold:
return {
'allowed': False,
'reason': 'Budget exceeded - requires VP approval',
'escalation': 'vp'
}
elif utilization > self.review_threshold:
return {
'allowed': True,
'warning': 'Approaching budget limit',
'escalation': 'manager'
}
elif utilization > self.warn_threshold:
return {
'allowed': True,
'warning': 'Budget 75% utilized'
}
else:
return {'allowed': True}3. Cost review cadence
| Review | Frequency | Participants | Focus |
|---|---|---|---|
| Anomaly review | Daily (automated) | On-call | Spikes, unusual patterns |
| Team cost review | Weekly | Tech lead | Trends, quick wins |
| Service review | Monthly | Engineering + Finance | Budget performance, projections |
| Strategic review | Quarterly | Leadership | TCO trends, build/buy decisions |
Incentive Alignment
Cost awareness requires aligned incentives:
What works:
- Teams that reduce costs get some savings allocated to their budget
- Cost efficiency is part of performance reviews for tech leads
- Cost per unit of value is a team metric alongside reliability and velocity
What doesn’t work:
- Centralized cost management with no team accountability
- Pure cost minimization targets without value consideration
- Punishing teams for high costs without context
Example: Cost efficiency OKRs
Q1 Objectives for ML Platform Team:
Objective: Improve AI cost efficiency
KR1: Reduce cost per inference request by 30%
(from $0.015 to $0.010)
KR2: Increase GPU utilization from 35% to 55%
KR3: Implement model routing, directing 50% of
requests to smaller models
Objective: Maintain quality while optimizing
KR1: Response quality score stays above 4.2/5
KR2: P95 latency stays under 500ms
KR3: No customer-facing incidents from optimization changes
Key Takeaways
AI costs are fundamentally different - GPU economics, marginal per-request costs, and hidden engineering time make AI 10-100x more expensive per request than traditional software. Plan accordingly.
Model total cost of ownership completely - Direct infrastructure is often only 60-70% of true TCO. Include engineering time, maintenance, monitoring, and opportunity costs in your calculations.
Attribute costs to create accountability - You cannot optimize what you cannot measure. Tag resources, track usage per team, and allocate costs to create ownership and visibility.
Model routing and caching provide biggest wins - These optimizations often deliver 40-70% savings with days of effort. Start here before complex architectural changes or custom optimization.
Build cost-aware culture through visibility - Dashboards, budgets, and aligned incentives make cost efficiency everyone’s responsibility. Teams that can see costs make better decisions.
Summary
Cost engineering transforms AI systems from economically precarious to sustainable. The key insights:
Understand why AI is expensive: GPU economics, marginal per-request costs, and hidden costs in engineering time make AI fundamentally different from traditional software.
Model total cost of ownership: Direct infrastructure costs are often only 60-70% of true TCO. Include engineering time, maintenance, and hidden costs.
Attribute costs to create accountability: You cannot optimize what you cannot measure. Tag resources, track usage, and allocate costs to teams.
Make data-driven build vs. buy decisions: Calculate crossover points with realistic assumptions. Multiply build estimates by 2-3x for hidden costs.
Prioritize optimizations by impact: Model routing and caching often provide 40-70% savings with days of effort. Start there before complex architectural changes.
Build cost-aware culture: Visibility, budgets, and aligned incentives make cost efficiency everyone’s responsibility.
Plan around the 10x cost cliff: Forecast your 12-month volume against the API-versus-self-host crossover. Below the cliff, managed APIs win; above it, self-hosting wins by roughly an order of magnitude. The expensive mistake is discovering which side you’re on from a surprise invoice.
Cost engineering is a Staff+ skill because it requires translating technical metrics into business terms, making tradeoffs across multiple dimensions, and building organizational systems that create sustainable economics at scale.
Practical Exercises
Exercise 1: TCO Analysis
Scenario: Your company runs a customer support chatbot. Calculate complete TCO.
Given:
- Inference: 4x L4 GPUs, $0.50/hr each, running 24/7
- API calls: External LLM for complex queries, 50,000 calls/month at $0.02 average
- Storage: 500GB conversation logs, 50GB model artifacts
- Engineering: 0.5 FTE ML engineer, 0.2 FTE DevOps
Tasks: 1. Calculate monthly direct costs 2. Estimate engineering costs (assume $180k fully-loaded annual salary) 3. Add hidden costs (monitoring, security, DR) at 15% of direct 4. Calculate cost per conversation (assume 500,000 conversations/month) 5. If conversations grow to 2M/month, how does cost per conversation change?
Exercise 2: GPU Selection
Scenario: You need to serve a 7B parameter model.
Requirements:
- P95 latency: <200ms
- Throughput: 100 requests/second peak
- Budget: <$10,000/month
GPU options: | GPU | VRAM | Latency (7B) | Max throughput | Monthly cost | |—–|——|————–|—————-|————–| | L4 | 24GB | 80ms | 40 req/s | $360 | | A10G | 24GB | 65ms | 50 req/s | $540 | | A100 40GB | 40GB | 40ms | 100 req/s | $1,260 |
Tasks: 1. Calculate minimum instances needed to meet throughput 2. Calculate total monthly cost for each GPU option 3. Verify latency requirements are met 4. Make a recommendation with justification
Exercise 3: Build vs. Buy Analysis
Scenario: Evaluate building vs. buying a document extraction service.
Build option:
- Development: 2 engineers, 4 months
- Infrastructure: 2x A100 for fine-tuned model serving
- Maintenance: 0.3 FTE ongoing
Buy option:
- API cost: $0.10 per page
- Current volume: 100,000 pages/month
- Projected growth: 20% monthly for 12 months
Tasks: 1. Calculate 24-month TCO for build (assume $15k/month per engineer) 2. Calculate 24-month TCO for buy 3. Find the monthly volume crossover point 4. List 3 qualitative factors and score them for build vs. buy 5. Make a recommendation
Exercise 4: Optimization Prioritization
Scenario: Your LLM service has the following profile:
- 2M requests/day
- Average 3,000 tokens per request (2,000 input, 1,000 output)
- Using GPT-5.5 at $5/1M input, $30/1M output
- 35% of queries are very similar to previous queries
- 60% of queries could be handled by a smaller model
Tasks: 1. Calculate current daily and monthly cost 2. Estimate savings from semantic caching (assume 25% cache hit rate) 3. Estimate savings from model routing (Claude Haiku 4.5 at $1.00/$5.00) 4. Estimate savings from 20% prompt optimization 5. Rank optimizations by estimated ROI 6. Calculate combined savings if all three are implemented
Self-Assessment Checkpoint
Conceptual Questions
Q1. [Senior] What components make up the Total Cost of Ownership (TCO) for an AI system? Which costs are commonly underestimated?
Answer
TCO components: (1) Direct costs—compute (GPU hours), storage, API calls, networking. (2) Engineering costs—development, maintenance, on-call, debugging. (3) Infrastructure overhead—monitoring, logging, security, DR. (4) Opportunity cost—what else could engineers/resources do? (5) Risk costs—incidents, data loss, compliance issues. Commonly underestimated: (1) Engineering maintenance—“we’ll build it and forget it” is a myth. Budget 20-30% of initial build for ongoing maintenance. (2) Opportunity cost—engineers building infrastructure aren’t building features. (3) Hidden infrastructure—observability, secrets management, deployment tooling. (4) Scaling costs—unit economics change as you scale. (5) Data costs—storage, transfer, processing grow over time.Q2. [Senior] Explain the cost difference between input tokens and output tokens for LLM APIs. Why does this matter for optimization?
Answer
Output tokens typically cost 5-8x more than input tokens (e.g., GPT-5.5: $5/M input vs. $30/M output). Why: Output generation is more computationally intensive—each output token requires a forward pass through the entire model. Input tokens are processed in parallel during prefill. Optimization implications: (1) Prompt optimization matters, but output optimization matters more. (2) Limit output length where possible—constrainmax_tokens, use structured outputs. (3) For Q&A, long documents as input are cheaper than long generated answers. (4) Caching input (prompt caching) saves less than you might think if outputs dominate cost. (5) Model routing can save dramatically—route simple queries to cheaper models. Cost analysis: For a request with 2K input / 1K output at GPT-5.5 prices: Input = $0.010, Output = $0.030. Output is 75% of cost despite being 33% of tokens.
Q3. [Staff] How do you build a business case for AI infrastructure investment? What metrics do stakeholders care about?
Answer
Stakeholder metrics: (1) CFO cares about—ROI, payback period, cost per unit (transaction, user, query), TCO vs. alternatives. (2) VP Engineering cares about—engineering efficiency, reliability, time to market. (3) Product cares about—user impact, feature velocity, competitive position. Building the case: (1) Define the counterfactual—what happens if we don’t invest? Status quo cost, opportunity cost, risk. (2) Quantify benefits—cost savings, revenue enablement, risk reduction. (3) Calculate payback period—when does investment pay for itself? <12 months is usually easy sell. (4) Include non-financial benefits—developer productivity, user experience, competitive positioning. (5) Address risks—what could go wrong? Mitigation plans. (6) Propose staged approach—invest incrementally, prove value before scaling.Q4. [Staff] What’s the cost difference between self-hosting LLMs vs. using APIs? How do you calculate the break-even point?
Answer
API costs: Simple—request volume × (input tokens × input price + output tokens × output price). Self-hosting costs: Complex—GPU cost + storage + engineering (setup + maintenance) + infrastructure overhead. Add 15-30% for hidden costs. Break-even calculation: (1) Calculate monthly self-hosted cost: GPU_monthly + storage + (maintenance_FTE × salary). (2) Calculate cost per request at full utilization: monthly_cost / max_requests_per_month. (3) Calculate API cost per request. (4) Break-even volume = monthly_self_hosted_cost / api_cost_per_request. Example: Self-hosted 7B model on A100 costs ~$1,500/month (GPU) + $2,500/month (0.2 FTE maintenance) = $4,000/month. Can serve ~100M requests at full utilization. API at $0.001/request. Break-even = 4M requests/month. Below 4M, API is cheaper. Above, self-host wins. Caveats: Quality difference, utilization reality (rarely 100%), setup time/cost.Q5. [Staff] How do you implement cost attribution in a platform serving multiple teams/products?
Answer
Implementation: (1) Tagging—every resource/request tagged with team, product, environment. Enforced at API/infrastructure level. (2) Metering—collect usage data at appropriate granularity (request-level for APIs, hourly for compute). (3) Aggregation—roll up by tag dimensions daily/weekly/monthly. (4) Allocation—shared resources (infrastructure, on-call) allocated by usage ratio or evenly. (5) Reporting—dashboards showing cost by team/product, trends, anomalies. (6) Showback vs. chargeback—showback (visibility only) is simpler to start. Chargeback (actual budget transfer) drives stronger behavior change but adds overhead. Challenges: (1) Shared resources—how to allocate platform team costs? (2) Bursty usage—one team’s spike affects others’ costs. (3) Tag enforcement—missing/incorrect tags corrupt data. (4) Granularity vs. overhead—too fine-grained is expensive to track. Best practice: Start with showback at product level. Add granularity over time as needed.Spot the Problem
Problem 1. [Senior] Cost analysis:
"Our LLM API costs $50K/month. We could self-host for $10K/month in GPU costs.
That's $40K savings per month!"
Answer
Missing costs: (1) Engineering—who builds, deploys, maintains the self-hosted system? Add 0.3-0.5 FTE minimum = $4-8K/month. (2) Infrastructure—load balancing, monitoring, logging, security, CI/CD. (3) Quality risk—is your self-hosted model as good as the API? Quality degradation has business cost. (4) Setup time—how many months until you’re running? API cost continues during development. (5) Opportunity cost—what else could engineers do? (6) Utilization assumption—$10K assumes high utilization. Real utilization is often 40-60%. Realistic comparison: Self-hosted = $10K GPU + $6K engineering + $2K infrastructure + quality risk = $18K+/month + 3-month setup. Savings are real but smaller, and come with risk/complexity.Problem 2. [Staff] Optimization prioritization:
"We're spending $100K/month on LLMs. I'll focus on reducing prompt length by 20%.
That should save $20K/month."
Answer
Flawed analysis: (1) Output tokens cost more—if 60% of cost is output tokens, 20% input reduction saves only 8% total, not 20%. (2) Prompt reduction may hurt quality—shorter prompts may degrade performance, increasing downstream costs or losing users. (3) Other optimizations may have higher ROI—caching, model routing, batch processing might save more with less risk. (4) Implementation effort—how hard is prompt reduction? May require rewriting many prompts. Proper approach: (1) Break down cost by input vs. output. (2) Identify all optimization opportunities with estimated savings and effort. (3) Prioritize by ROI: savings / (effort + risk). (4) Test optimizations before committing—measure actual savings and quality impact.Problem 3. [Staff] Budget planning:
"AI costs last year: $1.2M. Budget for next year: $1.5M (25% growth buffer)."
Answer
Problems: (1) Usage growth not considered—if usage grows 100%, 25% buffer is wildly insufficient. (2) New projects not considered—what’s planned? Each new AI feature adds cost. (3) Unit economics not tracked—is cost per query stable or increasing? (4) No efficiency targets—just padding previous spend doesn’t drive improvement. (5) External factors—API pricing changes, new model releases, infrastructure changes. Better approach: (1) Forecast usage growth by product/feature. (2) Apply unit economics (cost per query/user/transaction). (3) Add planned new projects with estimated costs. (4) Set efficiency targets (reduce cost per query by X%). (5) Build in contingency (10-15%) for unknowns. Bottom-up budgeting: Forecast = Σ(product usage × unit cost) + new projects + contingency - efficiency gains.Design Exercises
Exercise 1. [Staff] Design a cost optimization roadmap for an AI platform spending $500K/month. Current state: No caching, single model for all requests, on-demand GPU instances, no cost visibility by team. Target: 30% cost reduction in 6 months. Outline initiatives, expected savings, effort, and sequencing.
Guidance
Initiative prioritization: (1) Quick wins (Month 1-2): Cost visibility dashboard (shows where money goes), basic request caching (semantic similarity, estimate 10-15% savings), reserved/spot instances for predictable workloads (20-30% infra savings). (2) Medium effort (Month 2-4): Model routing—classify requests, route simple ones to cheaper/smaller models. If 50% of requests can use cheaper model at 10% of price, save 45% on those = ~22% total. (3) Longer term (Month 4-6): Prompt optimization (systematic review), batch processing for non-urgent requests, cost allocation to drive team accountability. Expected savings: Caching (10%) + model routing (20%) + instance optimization (15% of infra) + prompt optimization (5%) = 35%+ achievable. Sequencing: Start with visibility (can’t optimize what you can’t measure), then quick infrastructure wins, then application-level optimizations requiring more effort.Exercise 2. [Staff] You’re presenting an AI infrastructure investment ($2M over 2 years) to the CFO. The investment will enable self-hosting models currently costing $80K/month in API fees, plus enable new capabilities. Build the business case with financial projections, risks, and alternatives.
Guidance
Business case structure: (1) Current state: $80K/month API = $1.92M over 2 years. Growing 30%/year = $2.5M+ with growth. (2) Investment: $2M (hardware + setup + engineering). (3) Operating cost: $15K/month ongoing (maintenance, power, networking) = $360K over 2 years. (4) Total self-hosted cost: $2M + $360K = $2.36M. (5) Direct ROI: Negative initially, but break-even by month 20 if usage grows. Positive ROI in year 3. (6) Additional benefits: New capabilities (fine-tuning, data privacy, latency control). Quantify if possible. (7) Risks: Technology obsolescence, usage forecast wrong, quality issues. Mitigation plans. (8) Alternatives: Continue API (higher long-term cost), hybrid approach (partial self-hosting). Recommendation: Present as staged investment—initial pilot to prove value, scale upon validation. CFO prefers optionality and proven results over big upfront bets.Connections to Other Chapters
Cost engineering connects to technical decisions across the book. Here’s how this chapter relates:
Chapter 9 (LLM Deployment & Infrastructure): Technical foundations of deployment that drive cost structure. Understanding serving architectures helps optimize costs.
Chapter 27 (Performance Engineering): Performance optimizations that directly impact cost efficiency. Faster inference means lower GPU-hours per request.
Chapter 31 (Reliability Engineering): Reliability investments affect cost—redundancy costs money but prevents expensive incidents.
LLM Provider Comparison (Appendix K): Detailed pricing models and cost comparison across major LLM providers. Essential reference for build vs. buy decisions.
Further Reading
Essential
- Storment & Fuller, “Cloud FinOps” - The definitive guide to cloud cost management for AI infrastructure.
- Patterson et al. (2022), “Carbon Emissions and Large Neural Network Training” - Quantifies compute costs in production ML.
- Stoica et al. (2023), “The Shift from Models to Compound AI Systems” - Cost implications of multi-component AI.
Deep Dives
- Kwon et al. (2023), “PagedAttention/vLLM” - Memory optimization enabling 2-4x throughput improvement.
- Leviathan et al. (2023), “Speculative Decoding” - Technique for 2-3x speedup in autoregressive generation.
Practical Resources
- FinOps Certified Practitioner curriculum - Industry standard for cloud financial management.
Part V Checkpoint: Staff+ Engineering Complete
Congratulations on completing Part V and the entire book! You’ve covered the architectural thinking, strategic decision-making, and leadership skills required at Staff+ levels.
Skills Checklist
Quick Self-Test (10 minutes)
Q1. Your global AI system needs 99.9% availability. How do you architect for this across regions?
Q2. A promising paper shows 30% latency improvement, but your team wants to adopt it immediately. What’s your response?
Q3. Two teams disagree on whether to build a custom solution or use an existing framework. How do you drive resolution?
Q4. Your LLM serving costs are 3x budget. Walk through your cost reduction approach.
A1. Multi-region active-active with: (1) Load balancing across regions with health checks, (2) Data replication with eventual consistency where acceptable, (3) Circuit breakers to isolate failures, (4) Graceful degradation to simpler models during outages, (5) Chaos testing to verify resilience.
A2. (1) Reproduce the results in your environment first, (2) Understand the tradeoffs (what did they sacrifice?), (3) Benchmark on your actual workload, (4) Prototype in staging, (5) Plan incremental rollout with rollback capability. Enthusiasm is good; prudent adoption is better.
A3. (1) Get both teams to articulate their criteria for success, (2) Document requirements and constraints objectively, (3) Evaluate options against criteria with evidence, (4) Identify the reversibility of each decision, (5) Propose a time-boxed experiment if uncertainty is high, (6) Make the call if consensus isn’t emerging and document the reasoning.
A4. (1) Profile to find the biggest cost drivers (usually GPU utilization, batch efficiency, or over-provisioning), (2) Quick wins: enable batching, right-size instances, use spot/preemptible, (3) Medium-term: implement caching, route simple requests to smaller models, (4) Strategic: evaluate self-hosting vs. API tradeoffs at your scale, (5) Set cost budgets per feature and make teams accountable.
You’ve Completed the Book!
If you can confidently check all boxes above, you have the knowledge to operate as a Staff+ AI Engineer. Remember:
- Technical depth is necessary but not sufficient—Staff+ impact comes from multiplying your expertise through others
- The field moves fast—commit to continuous learning with the practices from Chapter 21
- Production is the test—everything in this book is theory until you ship systems that handle real traffic
The appendices contain reference materials you’ll return to throughout your career: the glossary, tool comparisons, paper reading lists, case studies, and templates.
Welcome to Staff+ AI Engineering.