---
number-sections: false
execute:
enabled: false
---
# Cost Estimation Formulas {.unnumbered}
## Token Cost Calculator
### API Pricing Reference (as of 2025)
| Provider | Model | Input ($/1M) | Output ($/1M) | Context |
|----------|-------|--------------|---------------|---------|
| Anthropic | Claude 3.5 Sonnet | $3.00 | $15.00 | 200K |
| Anthropic | Claude 3.5 Haiku | $0.25 | $1.25 | 200K |
| Anthropic | Claude 3 Opus | $15.00 | $75.00 | 200K |
| OpenAI | GPT-4o | $2.50 | $10.00 | 128K |
| OpenAI | GPT-4o-mini | $0.15 | $0.60 | 128K |
| OpenAI | GPT-4 Turbo | $10.00 | $30.00 | 128K |
| Google | Gemini 1.5 Pro | $1.25 | $5.00 | 1M |
| Google | Gemini 1.5 Flash | $0.075 | $0.30 | 1M |
*Note: Prices change frequently. Verify current pricing before budgeting.*
### Basic Cost Formula
```
Cost per request = (input_tokens × input_price / 1M) + (output_tokens × output_price / 1M)
Example (Claude 3.5 Sonnet):
- Input: 2,000 tokens
- Output: 500 tokens
- Cost = (2000 × $3/1M) + (500 × $15/1M)
= $0.006 + $0.0075
= $0.0135 per request
```
### Monthly Cost Projection
```
Monthly cost = requests_per_day × 30 × cost_per_request
Example:
- 10,000 requests/day
- $0.0135/request
- Monthly = 10,000 × 30 × $0.0135 = $4,050
```
## Python Cost Calculator
```python
from dataclasses import dataclass
from typing import Optional
@dataclass
class ModelPricing:
name: str
input_per_million: float
output_per_million: float
PRICING = {
"claude-3-5-sonnet": ModelPricing("Claude 3.5 Sonnet", 3.00, 15.00),
"claude-3-5-haiku": ModelPricing("Claude 3.5 Haiku", 0.25, 1.25),
"claude-3-opus": ModelPricing("Claude 3 Opus", 15.00, 75.00),
"gpt-4o": ModelPricing("GPT-4o", 2.50, 10.00),
"gpt-4o-mini": ModelPricing("GPT-4o-mini", 0.15, 0.60),
}
def calculate_cost(
model: str,
input_tokens: int,
output_tokens: int
) -> float:
"""Calculate cost for a single request."""
pricing = PRICING[model]
input_cost = (input_tokens / 1_000_000) * pricing.input_per_million
output_cost = (output_tokens / 1_000_000) * pricing.output_per_million
return input_cost + output_cost
def project_monthly_cost(
model: str,
avg_input_tokens: int,
avg_output_tokens: int,
requests_per_day: int
) -> dict:
"""Project monthly costs with breakdown."""
cost_per_request = calculate_cost(model, avg_input_tokens, avg_output_tokens)
daily_cost = cost_per_request * requests_per_day
monthly_cost = daily_cost * 30
return {
"model": model,
"cost_per_request": cost_per_request,
"daily_cost": daily_cost,
"monthly_cost": monthly_cost,
"annual_cost": monthly_cost * 12,
}
# Example usage {.unnumbered}
projection = project_monthly_cost(
model="claude-3-5-sonnet",
avg_input_tokens=2000,
avg_output_tokens=500,
requests_per_day=10000
)
print(f"Monthly cost: ${projection['monthly_cost']:,.2f}")
```
## RAG System Cost Model
```
RAG Request Cost = Embedding Cost + Retrieval Cost + Generation Cost
Embedding Cost:
- Query embedding: ~100 tokens × embedding_price
- Typically negligible ($0.0001 per query)
Retrieval Cost:
- Vector DB query: ~$0.00001 per query (self-hosted)
- Managed service: $0.0001-0.001 per query
Generation Cost (dominant):
- Context tokens: retrieved_chunks × tokens_per_chunk
- System prompt: ~500 tokens
- Query: ~100 tokens
- Response: ~300-1000 tokens
```
### RAG Cost Calculator
```python
def calculate_rag_cost(
model: str,
chunks_retrieved: int = 5,
tokens_per_chunk: int = 500,
system_prompt_tokens: int = 500,
query_tokens: int = 100,
response_tokens: int = 500,
embedding_cost_per_query: float = 0.0001,
retrieval_cost_per_query: float = 0.00001
) -> dict:
"""Calculate full RAG request cost."""
# Input tokens = system prompt + context + query
context_tokens = chunks_retrieved * tokens_per_chunk
input_tokens = system_prompt_tokens + context_tokens + query_tokens
# LLM cost
llm_cost = calculate_cost(model, input_tokens, response_tokens)
# Total
total = embedding_cost_per_query + retrieval_cost_per_query + llm_cost
return {
"input_tokens": input_tokens,
"output_tokens": response_tokens,
"embedding_cost": embedding_cost_per_query,
"retrieval_cost": retrieval_cost_per_query,
"llm_cost": llm_cost,
"total_cost": total,
"cost_breakdown_pct": {
"embedding": embedding_cost_per_query / total * 100,
"retrieval": retrieval_cost_per_query / total * 100,
"llm": llm_cost / total * 100,
}
}
```
## GPU Self-Hosting Cost Model
### GPU Instance Pricing (Approximate)
| GPU | Cloud ($/hr) | Spot ($/hr) | On-Prem ($/mo amortized) |
|-----|--------------|-------------|--------------------------|
| A100 40GB | $3.50-4.00 | $1.20-1.50 | ~$800 |
| A100 80GB | $4.50-5.00 | $1.50-2.00 | ~$1,200 |
| H100 80GB | $8.00-10.00 | $3.00-4.00 | ~$2,500 |
| A10G | $1.50-2.00 | $0.50-0.75 | ~$300 |
| L4 | $0.80-1.00 | $0.30-0.40 | ~$250 |
### Self-Hosted Cost Formula
```
Monthly GPU Cost = instances × hours_per_month × hourly_rate
= instances × 730 × hourly_rate
Cost per Request = Monthly GPU Cost / Monthly Requests
Break-even Analysis:
API cost at volume V = V × cost_per_api_request
GPU cost = fixed_monthly_cost
Break-even volume = fixed_monthly_cost / cost_per_api_request
```
### Break-Even Calculator
```python
def calculate_breakeven(
gpu_monthly_cost: float,
api_cost_per_request: float,
requests_per_month: int
) -> dict:
"""Calculate self-hosting vs API break-even."""
# API cost at this volume
api_monthly_cost = requests_per_month * api_cost_per_request
# Break-even point
breakeven_requests = gpu_monthly_cost / api_cost_per_request
# Savings at current volume
savings = api_monthly_cost - gpu_monthly_cost
savings_pct = (savings / api_monthly_cost * 100) if api_monthly_cost > 0 else 0
return {
"gpu_monthly_cost": gpu_monthly_cost,
"api_monthly_cost": api_monthly_cost,
"breakeven_requests": int(breakeven_requests),
"current_requests": requests_per_month,
"monthly_savings": savings,
"savings_percent": savings_pct,
"recommendation": "self-host" if savings > 0 else "use API"
}
# Example: Should we self-host Llama 70B? {.unnumbered}
result = calculate_breakeven(
gpu_monthly_cost=4 * 730 * 4.00, # 4x A100 at $4/hr
api_cost_per_request=0.015, # Comparable API cost
requests_per_month=1_500_000 # 1.5M requests/month
)
print(f"Recommendation: {result['recommendation']}")
print(f"Monthly savings: ${result['monthly_savings']:,.2f}")
```
## Evaluation Cost Model
```
Daily Eval Cost = (samples × metrics × judge_cost_per_call)
judge_cost_per_call = calculate_cost(
judge_model,
input_tokens=~2000, # context + response + rubric
output_tokens=~200 # rating + reasoning
)
Example (500 daily samples, 3 metrics, Claude 3.5 Sonnet judge):
- Judge cost = (2000 × $3/1M) + (200 × $15/1M) = $0.009
- Daily cost = 500 × 3 × $0.009 = $13.50
- Monthly cost = $13.50 × 30 = $405
```
## Cost Optimization Decision Tree
```
Is latency critical?
│
┌────────────┴────────────┐
│ YES │ NO
▼ ▼
Need fastest model? Cost sensitive?
│ │
┌─────────┴─────────┐ ┌─────────┴─────────┐
│ YES │ NO │ YES │ NO
▼ ▼ ▼ ▼
Opus/GPT-4 Sonnet/4o Haiku/Mini Sonnet/4o
(highest $) (balanced) (lowest $) (best quality)
Volume > 1M requests/month?
│
┌────────────┴────────────┐
│ YES │ NO
▼ ▼
Consider self-hosting Stick with API
or committed use pricing (simpler ops)
```
## Quick Reference: Cost Reduction Strategies
| Strategy | Effort | Savings | When to Use |
|----------|--------|---------|-------------|
| Prompt caching | Low | 10-30% | Repeated system prompts |
| Response caching | Low | 20-50% | Common queries |
| Smaller model for routing | Medium | 30-50% | Multi-step workflows |
| Batch processing | Medium | 50% | Non-real-time workloads |
| Fine-tuned smaller model | High | 60-80% | Narrow domain, high volume |
| Self-hosted inference | High | 50-80% | Very high volume, predictable |
## Budget Alert Thresholds
```python
# Recommended alert thresholds {.unnumbered}
ALERT_THRESHOLDS = {
"daily_spend": {
"warning": 0.8, # 80% of daily budget
"critical": 1.0, # 100% of daily budget
},
"cost_per_request": {
"warning": 1.5, # 50% above baseline
"critical": 2.0, # 100% above baseline
},
"monthly_projection": {
"warning": 0.9, # 90% of monthly budget
"critical": 1.1, # 110% of monthly budget
}
}
```