Cost Estimation Formulas

Token Cost Calculator

API Pricing Reference (as of 2025)

Provider	Model	Input ($/1M) \| Output ($/1M)	Context
Anthropic	Claude 3.5 Sonnet	$3.00	$15.00	200K
Anthropic	Claude 3.5 Haiku	$0.25	$1.25	200K
Anthropic	Claude 3 Opus	$15.00	$75.00	200K
OpenAI	GPT-4o	$2.50	$10.00	128K
OpenAI	GPT-4o-mini	$0.15	$0.60	128K
OpenAI	GPT-4 Turbo	$10.00	$30.00	128K
Google	Gemini 1.5 Pro	$1.25	$5.00	1M
Google	Gemini 1.5 Flash	$0.075	$0.30	1M

Note: Prices change frequently. Verify current pricing before budgeting.

Basic Cost Formula

Cost per request = (input_tokens × input_price / 1M) + (output_tokens × output_price / 1M)

Example (Claude 3.5 Sonnet):
- Input: 2,000 tokens
- Output: 500 tokens
- Cost = (2000 × $3/1M) + (500 × $15/1M)
       = $0.006 + $0.0075
       = $0.0135 per request

Monthly Cost Projection

Monthly cost = requests_per_day × 30 × cost_per_request

Example:
- 10,000 requests/day
- $0.0135/request
- Monthly = 10,000 × 30 × $0.0135 = $4,050

Python Cost Calculator

from dataclasses import dataclass
from typing import Optional

@dataclass
class ModelPricing:
    name: str
    input_per_million: float
    output_per_million: float

PRICING = {
    "claude-3-5-sonnet": ModelPricing("Claude 3.5 Sonnet", 3.00, 15.00),
    "claude-3-5-haiku": ModelPricing("Claude 3.5 Haiku", 0.25, 1.25),
    "claude-3-opus": ModelPricing("Claude 3 Opus", 15.00, 75.00),
    "gpt-4o": ModelPricing("GPT-4o", 2.50, 10.00),
    "gpt-4o-mini": ModelPricing("GPT-4o-mini", 0.15, 0.60),
}

def calculate_cost(
    model: str,
    input_tokens: int,
    output_tokens: int
) -> float:
    """Calculate cost for a single request."""
    pricing = PRICING[model]
    input_cost = (input_tokens / 1_000_000) * pricing.input_per_million
    output_cost = (output_tokens / 1_000_000) * pricing.output_per_million
    return input_cost + output_cost

def project_monthly_cost(
    model: str,
    avg_input_tokens: int,
    avg_output_tokens: int,
    requests_per_day: int
) -> dict:
    """Project monthly costs with breakdown."""
    cost_per_request = calculate_cost(model, avg_input_tokens, avg_output_tokens)
    daily_cost = cost_per_request * requests_per_day
    monthly_cost = daily_cost * 30

    return {
        "model": model,
        "cost_per_request": cost_per_request,
        "daily_cost": daily_cost,
        "monthly_cost": monthly_cost,
        "annual_cost": monthly_cost * 12,
    }

# Example usage {.unnumbered}
projection = project_monthly_cost(
    model="claude-3-5-sonnet",
    avg_input_tokens=2000,
    avg_output_tokens=500,
    requests_per_day=10000
)
print(f"Monthly cost: ${projection['monthly_cost']:,.2f}")

RAG System Cost Model

RAG Request Cost = Embedding Cost + Retrieval Cost + Generation Cost

Embedding Cost:
- Query embedding: ~100 tokens × embedding_price
- Typically negligible ($0.0001 per query)

Retrieval Cost:
- Vector DB query: ~$0.00001 per query (self-hosted)
- Managed service: $0.0001-0.001 per query

Generation Cost (dominant):
- Context tokens: retrieved_chunks × tokens_per_chunk
- System prompt: ~500 tokens
- Query: ~100 tokens
- Response: ~300-1000 tokens

RAG Cost Calculator

def calculate_rag_cost(
    model: str,
    chunks_retrieved: int = 5,
    tokens_per_chunk: int = 500,
    system_prompt_tokens: int = 500,
    query_tokens: int = 100,
    response_tokens: int = 500,
    embedding_cost_per_query: float = 0.0001,
    retrieval_cost_per_query: float = 0.00001
) -> dict:
    """Calculate full RAG request cost."""

    # Input tokens = system prompt + context + query
    context_tokens = chunks_retrieved * tokens_per_chunk
    input_tokens = system_prompt_tokens + context_tokens + query_tokens

    # LLM cost
    llm_cost = calculate_cost(model, input_tokens, response_tokens)

    # Total
    total = embedding_cost_per_query + retrieval_cost_per_query + llm_cost

    return {
        "input_tokens": input_tokens,
        "output_tokens": response_tokens,
        "embedding_cost": embedding_cost_per_query,
        "retrieval_cost": retrieval_cost_per_query,
        "llm_cost": llm_cost,
        "total_cost": total,
        "cost_breakdown_pct": {
            "embedding": embedding_cost_per_query / total * 100,
            "retrieval": retrieval_cost_per_query / total * 100,
            "llm": llm_cost / total * 100,
        }
    }

GPU Self-Hosting Cost Model

GPU Instance Pricing (Approximate)

GPU	Cloud ($/hr) \| Spot ($/hr)	On-Prem ($/mo amortized)
A100 40GB	$3.50-4.00	$1.20-1.50	~$800
A100 80GB	$4.50-5.00	$1.50-2.00	~$1,200
H100 80GB	$8.00-10.00	$3.00-4.00	~$2,500
A10G	$1.50-2.00	$0.50-0.75	~$300
L4	$0.80-1.00	$0.30-0.40	~$250

Self-Hosted Cost Formula

Monthly GPU Cost = instances × hours_per_month × hourly_rate
                 = instances × 730 × hourly_rate

Cost per Request = Monthly GPU Cost / Monthly Requests

Break-even Analysis:
  API cost at volume V = V × cost_per_api_request
  GPU cost = fixed_monthly_cost

  Break-even volume = fixed_monthly_cost / cost_per_api_request

Break-Even Calculator

def calculate_breakeven(
    gpu_monthly_cost: float,
    api_cost_per_request: float,
    requests_per_month: int
) -> dict:
    """Calculate self-hosting vs API break-even."""

    # API cost at this volume
    api_monthly_cost = requests_per_month * api_cost_per_request

    # Break-even point
    breakeven_requests = gpu_monthly_cost / api_cost_per_request

    # Savings at current volume
    savings = api_monthly_cost - gpu_monthly_cost
    savings_pct = (savings / api_monthly_cost * 100) if api_monthly_cost > 0 else 0

    return {
        "gpu_monthly_cost": gpu_monthly_cost,
        "api_monthly_cost": api_monthly_cost,
        "breakeven_requests": int(breakeven_requests),
        "current_requests": requests_per_month,
        "monthly_savings": savings,
        "savings_percent": savings_pct,
        "recommendation": "self-host" if savings > 0 else "use API"
    }

# Example: Should we self-host Llama 70B? {.unnumbered}
result = calculate_breakeven(
    gpu_monthly_cost=4 * 730 * 4.00,  # 4x A100 at $4/hr
    api_cost_per_request=0.015,        # Comparable API cost
    requests_per_month=1_500_000       # 1.5M requests/month
)
print(f"Recommendation: {result['recommendation']}")
print(f"Monthly savings: ${result['monthly_savings']:,.2f}")

Evaluation Cost Model

Daily Eval Cost = (samples × metrics × judge_cost_per_call)

judge_cost_per_call = calculate_cost(
    judge_model,
    input_tokens=~2000,  # context + response + rubric
    output_tokens=~200   # rating + reasoning
)

Example (500 daily samples, 3 metrics, Claude 3.5 Sonnet judge):
- Judge cost = (2000 × $3/1M) + (200 × $15/1M) = $0.009
- Daily cost = 500 × 3 × $0.009 = $13.50
- Monthly cost = $13.50 × 30 = $405

Cost Optimization Decision Tree

                    Is latency critical?
                           │
              ┌────────────┴────────────┐
              │ YES                     │ NO
              ▼                         ▼
    Need fastest model?          Cost sensitive?
              │                         │
    ┌─────────┴─────────┐     ┌─────────┴─────────┐
    │ YES               │ NO  │ YES               │ NO
    ▼                   ▼     ▼                   ▼
  Opus/GPT-4      Sonnet/4o  Haiku/Mini     Sonnet/4o
  (highest $)    (balanced)  (lowest $)    (best quality)


              Volume > 1M requests/month?
                           │
              ┌────────────┴────────────┐
              │ YES                     │ NO
              ▼                         ▼
       Consider self-hosting      Stick with API
       or committed use pricing   (simpler ops)

Quick Reference: Cost Reduction Strategies

Strategy	Effort	Savings	When to Use
Prompt caching	Low	10-30%	Repeated system prompts
Response caching	Low	20-50%	Common queries
Smaller model for routing	Medium	30-50%	Multi-step workflows
Batch processing	Medium	50%	Non-real-time workloads
Fine-tuned smaller model	High	60-80%	Narrow domain, high volume
Self-hosted inference	High	50-80%	Very high volume, predictable

Budget Alert Thresholds

# Recommended alert thresholds {.unnumbered}
ALERT_THRESHOLDS = {
    "daily_spend": {
        "warning": 0.8,   # 80% of daily budget
        "critical": 1.0,  # 100% of daily budget
    },
    "cost_per_request": {
        "warning": 1.5,   # 50% above baseline
        "critical": 2.0,  # 100% above baseline
    },
    "monthly_projection": {
        "warning": 0.9,   # 90% of monthly budget
        "critical": 1.1,  # 110% of monthly budget
    }
}

--- number-sections: false execute: enabled: false --- # Cost Estimation Formulas {.unnumbered} ## Token Cost Calculator ### API Pricing Reference (as of 2025) | Provider | Model | Input ($/1M) | Output ($/1M) | Context | |----------|-------|--------------|---------------|---------| | Anthropic | Claude 3.5 Sonnet | $3.00 | $15.00 | 200K | | Anthropic | Claude 3.5 Haiku | $0.25 | $1.25 | 200K | | Anthropic | Claude 3 Opus | $15.00 | $75.00 | 200K | | OpenAI | GPT-4o | $2.50 | $10.00 | 128K | | OpenAI | GPT-4o-mini | $0.15 | $0.60 | 128K | | OpenAI | GPT-4 Turbo | $10.00 | $30.00 | 128K | | Google | Gemini 1.5 Pro | $1.25 | $5.00 | 1M | | Google | Gemini 1.5 Flash | $0.075 | $0.30 | 1M | *Note: Prices change frequently. Verify current pricing before budgeting.* ### Basic Cost Formula ``` Cost per request = (input_tokens × input_price / 1M) + (output_tokens × output_price / 1M) Example (Claude 3.5 Sonnet): - Input: 2,000 tokens - Output: 500 tokens - Cost = (2000 × $3/1M) + (500 × $15/1M) = $0.006 + $0.0075 = $0.0135 per request ``` ### Monthly Cost Projection ``` Monthly cost = requests_per_day × 30 × cost_per_request Example: - 10,000 requests/day - $0.0135/request - Monthly = 10,000 × 30 × $0.0135 = $4,050 ``` ## Python Cost Calculator ```python from dataclasses import dataclass from typing import Optional @dataclass class ModelPricing: name: str input_per_million: float output_per_million: float PRICING = { "claude-3-5-sonnet": ModelPricing("Claude 3.5 Sonnet", 3.00, 15.00), "claude-3-5-haiku": ModelPricing("Claude 3.5 Haiku", 0.25, 1.25), "claude-3-opus": ModelPricing("Claude 3 Opus", 15.00, 75.00), "gpt-4o": ModelPricing("GPT-4o", 2.50, 10.00), "gpt-4o-mini": ModelPricing("GPT-4o-mini", 0.15, 0.60), } def calculate_cost( model: str, input_tokens: int, output_tokens: int ) -> float: """Calculate cost for a single request.""" pricing = PRICING[model] input_cost = (input_tokens / 1_000_000) * pricing.input_per_million output_cost = (output_tokens / 1_000_000) * pricing.output_per_million return input_cost + output_cost def project_monthly_cost( model: str, avg_input_tokens: int, avg_output_tokens: int, requests_per_day: int ) -> dict: """Project monthly costs with breakdown.""" cost_per_request = calculate_cost(model, avg_input_tokens, avg_output_tokens) daily_cost = cost_per_request * requests_per_day monthly_cost = daily_cost * 30 return { "model": model, "cost_per_request": cost_per_request, "daily_cost": daily_cost, "monthly_cost": monthly_cost, "annual_cost": monthly_cost * 12, } # Example usage {.unnumbered} projection = project_monthly_cost( model="claude-3-5-sonnet", avg_input_tokens=2000, avg_output_tokens=500, requests_per_day=10000 ) print(f"Monthly cost: ${projection['monthly_cost']:,.2f}") ``` ## RAG System Cost Model ``` RAG Request Cost = Embedding Cost + Retrieval Cost + Generation Cost Embedding Cost: - Query embedding: ~100 tokens × embedding_price - Typically negligible ($0.0001 per query) Retrieval Cost: - Vector DB query: ~$0.00001 per query (self-hosted) - Managed service: $0.0001-0.001 per query Generation Cost (dominant): - Context tokens: retrieved_chunks × tokens_per_chunk - System prompt: ~500 tokens - Query: ~100 tokens - Response: ~300-1000 tokens ``` ### RAG Cost Calculator ```python def calculate_rag_cost( model: str, chunks_retrieved: int = 5, tokens_per_chunk: int = 500, system_prompt_tokens: int = 500, query_tokens: int = 100, response_tokens: int = 500, embedding_cost_per_query: float = 0.0001, retrieval_cost_per_query: float = 0.00001 ) -> dict: """Calculate full RAG request cost.""" # Input tokens = system prompt + context + query context_tokens = chunks_retrieved * tokens_per_chunk input_tokens = system_prompt_tokens + context_tokens + query_tokens # LLM cost llm_cost = calculate_cost(model, input_tokens, response_tokens) # Total total = embedding_cost_per_query + retrieval_cost_per_query + llm_cost return { "input_tokens": input_tokens, "output_tokens": response_tokens, "embedding_cost": embedding_cost_per_query, "retrieval_cost": retrieval_cost_per_query, "llm_cost": llm_cost, "total_cost": total, "cost_breakdown_pct": { "embedding": embedding_cost_per_query / total * 100, "retrieval": retrieval_cost_per_query / total * 100, "llm": llm_cost / total * 100, } } ``` ## GPU Self-Hosting Cost Model ### GPU Instance Pricing (Approximate) | GPU | Cloud ($/hr) | Spot ($/hr) | On-Prem ($/mo amortized) | |-----|--------------|-------------|--------------------------| | A100 40GB | $3.50-4.00 | $1.20-1.50 | ~$800 | | A100 80GB | $4.50-5.00 | $1.50-2.00 | ~$1,200 | | H100 80GB | $8.00-10.00 | $3.00-4.00 | ~$2,500 | | A10G | $1.50-2.00 | $0.50-0.75 | ~$300 | | L4 | $0.80-1.00 | $0.30-0.40 | ~$250 | ### Self-Hosted Cost Formula ``` Monthly GPU Cost = instances × hours_per_month × hourly_rate = instances × 730 × hourly_rate Cost per Request = Monthly GPU Cost / Monthly Requests Break-even Analysis: API cost at volume V = V × cost_per_api_request GPU cost = fixed_monthly_cost Break-even volume = fixed_monthly_cost / cost_per_api_request ``` ### Break-Even Calculator ```python def calculate_breakeven( gpu_monthly_cost: float, api_cost_per_request: float, requests_per_month: int ) -> dict: """Calculate self-hosting vs API break-even.""" # API cost at this volume api_monthly_cost = requests_per_month * api_cost_per_request # Break-even point breakeven_requests = gpu_monthly_cost / api_cost_per_request # Savings at current volume savings = api_monthly_cost - gpu_monthly_cost savings_pct = (savings / api_monthly_cost * 100) if api_monthly_cost > 0 else 0 return { "gpu_monthly_cost": gpu_monthly_cost, "api_monthly_cost": api_monthly_cost, "breakeven_requests": int(breakeven_requests), "current_requests": requests_per_month, "monthly_savings": savings, "savings_percent": savings_pct, "recommendation": "self-host" if savings > 0 else "use API" } # Example: Should we self-host Llama 70B? {.unnumbered} result = calculate_breakeven( gpu_monthly_cost=4 * 730 * 4.00, # 4x A100 at $4/hr api_cost_per_request=0.015, # Comparable API cost requests_per_month=1_500_000 # 1.5M requests/month ) print(f"Recommendation: {result['recommendation']}") print(f"Monthly savings: ${result['monthly_savings']:,.2f}") ``` ## Evaluation Cost Model ``` Daily Eval Cost = (samples × metrics × judge_cost_per_call) judge_cost_per_call = calculate_cost( judge_model, input_tokens=~2000, # context + response + rubric output_tokens=~200 # rating + reasoning ) Example (500 daily samples, 3 metrics, Claude 3.5 Sonnet judge): - Judge cost = (2000 × $3/1M) + (200 × $15/1M) = $0.009 - Daily cost = 500 × 3 × $0.009 = $13.50 - Monthly cost = $13.50 × 30 = $405 ``` ## Cost Optimization Decision Tree ``` Is latency critical? │ ┌────────────┴────────────┐ │ YES │ NO ▼ ▼ Need fastest model? Cost sensitive? │ │ ┌─────────┴─────────┐ ┌─────────┴─────────┐ │ YES │ NO │ YES │ NO ▼ ▼ ▼ ▼ Opus/GPT-4 Sonnet/4o Haiku/Mini Sonnet/4o (highest $) (balanced) (lowest $) (best quality) Volume > 1M requests/month? │ ┌────────────┴────────────┐ │ YES │ NO ▼ ▼ Consider self-hosting Stick with API or committed use pricing (simpler ops) ``` ## Quick Reference: Cost Reduction Strategies | Strategy | Effort | Savings | When to Use | |----------|--------|---------|-------------| | Prompt caching | Low | 10-30% | Repeated system prompts | | Response caching | Low | 20-50% | Common queries | | Smaller model for routing | Medium | 30-50% | Multi-step workflows | | Batch processing | Medium | 50% | Non-real-time workloads | | Fine-tuned smaller model | High | 60-80% | Narrow domain, high volume | | Self-hosted inference | High | 50-80% | Very high volume, predictable | ## Budget Alert Thresholds ```python # Recommended alert thresholds {.unnumbered} ALERT_THRESHOLDS = { "daily_spend": { "warning": 0.8, # 80% of daily budget "critical": 1.0, # 100% of daily budget }, "cost_per_request": { "warning": 1.5, # 50% above baseline "critical": 2.0, # 100% above baseline }, "monthly_projection": { "warning": 0.9, # 90% of monthly budget "critical": 1.1, # 110% of monthly budget } } ```