Cost Estimation Formulas

Token Cost Calculator

API Pricing Reference (as of 2025)

Provider Model Input (\(/1M) | Output (\)/1M) Context
Anthropic Claude 3.5 Sonnet $3.00 $15.00 200K
Anthropic Claude 3.5 Haiku $0.25 $1.25 200K
Anthropic Claude 3 Opus $15.00 $75.00 200K
OpenAI GPT-4o $2.50 $10.00 128K
OpenAI GPT-4o-mini $0.15 $0.60 128K
OpenAI GPT-4 Turbo $10.00 $30.00 128K
Google Gemini 1.5 Pro $1.25 $5.00 1M
Google Gemini 1.5 Flash $0.075 $0.30 1M

Note: Prices change frequently. Verify current pricing before budgeting.

Basic Cost Formula

Cost per request = (input_tokens × input_price / 1M) + (output_tokens × output_price / 1M)

Example (Claude 3.5 Sonnet):
- Input: 2,000 tokens
- Output: 500 tokens
- Cost = (2000 × $3/1M) + (500 × $15/1M)
       = $0.006 + $0.0075
       = $0.0135 per request

Monthly Cost Projection

Monthly cost = requests_per_day × 30 × cost_per_request

Example:
- 10,000 requests/day
- $0.0135/request
- Monthly = 10,000 × 30 × $0.0135 = $4,050

Python Cost Calculator

from dataclasses import dataclass
from typing import Optional

@dataclass
class ModelPricing:
    name: str
    input_per_million: float
    output_per_million: float

PRICING = {
    "claude-3-5-sonnet": ModelPricing("Claude 3.5 Sonnet", 3.00, 15.00),
    "claude-3-5-haiku": ModelPricing("Claude 3.5 Haiku", 0.25, 1.25),
    "claude-3-opus": ModelPricing("Claude 3 Opus", 15.00, 75.00),
    "gpt-4o": ModelPricing("GPT-4o", 2.50, 10.00),
    "gpt-4o-mini": ModelPricing("GPT-4o-mini", 0.15, 0.60),
}

def calculate_cost(
    model: str,
    input_tokens: int,
    output_tokens: int
) -> float:
    """Calculate cost for a single request."""
    pricing = PRICING[model]
    input_cost = (input_tokens / 1_000_000) * pricing.input_per_million
    output_cost = (output_tokens / 1_000_000) * pricing.output_per_million
    return input_cost + output_cost

def project_monthly_cost(
    model: str,
    avg_input_tokens: int,
    avg_output_tokens: int,
    requests_per_day: int
) -> dict:
    """Project monthly costs with breakdown."""
    cost_per_request = calculate_cost(model, avg_input_tokens, avg_output_tokens)
    daily_cost = cost_per_request * requests_per_day
    monthly_cost = daily_cost * 30

    return {
        "model": model,
        "cost_per_request": cost_per_request,
        "daily_cost": daily_cost,
        "monthly_cost": monthly_cost,
        "annual_cost": monthly_cost * 12,
    }

# Example usage {.unnumbered}
projection = project_monthly_cost(
    model="claude-3-5-sonnet",
    avg_input_tokens=2000,
    avg_output_tokens=500,
    requests_per_day=10000
)
print(f"Monthly cost: ${projection['monthly_cost']:,.2f}")

RAG System Cost Model

RAG Request Cost = Embedding Cost + Retrieval Cost + Generation Cost

Embedding Cost:
- Query embedding: ~100 tokens × embedding_price
- Typically negligible ($0.0001 per query)

Retrieval Cost:
- Vector DB query: ~$0.00001 per query (self-hosted)
- Managed service: $0.0001-0.001 per query

Generation Cost (dominant):
- Context tokens: retrieved_chunks × tokens_per_chunk
- System prompt: ~500 tokens
- Query: ~100 tokens
- Response: ~300-1000 tokens

RAG Cost Calculator

def calculate_rag_cost(
    model: str,
    chunks_retrieved: int = 5,
    tokens_per_chunk: int = 500,
    system_prompt_tokens: int = 500,
    query_tokens: int = 100,
    response_tokens: int = 500,
    embedding_cost_per_query: float = 0.0001,
    retrieval_cost_per_query: float = 0.00001
) -> dict:
    """Calculate full RAG request cost."""

    # Input tokens = system prompt + context + query
    context_tokens = chunks_retrieved * tokens_per_chunk
    input_tokens = system_prompt_tokens + context_tokens + query_tokens

    # LLM cost
    llm_cost = calculate_cost(model, input_tokens, response_tokens)

    # Total
    total = embedding_cost_per_query + retrieval_cost_per_query + llm_cost

    return {
        "input_tokens": input_tokens,
        "output_tokens": response_tokens,
        "embedding_cost": embedding_cost_per_query,
        "retrieval_cost": retrieval_cost_per_query,
        "llm_cost": llm_cost,
        "total_cost": total,
        "cost_breakdown_pct": {
            "embedding": embedding_cost_per_query / total * 100,
            "retrieval": retrieval_cost_per_query / total * 100,
            "llm": llm_cost / total * 100,
        }
    }

GPU Self-Hosting Cost Model

GPU Instance Pricing (Approximate)

GPU Cloud (\(/hr) | Spot (\)/hr) On-Prem ($/mo amortized)
A100 40GB $3.50-4.00 $1.20-1.50 ~$800
A100 80GB $4.50-5.00 $1.50-2.00 ~$1,200
H100 80GB $8.00-10.00 $3.00-4.00 ~$2,500
A10G $1.50-2.00 $0.50-0.75 ~$300
L4 $0.80-1.00 $0.30-0.40 ~$250

Self-Hosted Cost Formula

Monthly GPU Cost = instances × hours_per_month × hourly_rate
                 = instances × 730 × hourly_rate

Cost per Request = Monthly GPU Cost / Monthly Requests

Break-even Analysis:
  API cost at volume V = V × cost_per_api_request
  GPU cost = fixed_monthly_cost

  Break-even volume = fixed_monthly_cost / cost_per_api_request

Break-Even Calculator

def calculate_breakeven(
    gpu_monthly_cost: float,
    api_cost_per_request: float,
    requests_per_month: int
) -> dict:
    """Calculate self-hosting vs API break-even."""

    # API cost at this volume
    api_monthly_cost = requests_per_month * api_cost_per_request

    # Break-even point
    breakeven_requests = gpu_monthly_cost / api_cost_per_request

    # Savings at current volume
    savings = api_monthly_cost - gpu_monthly_cost
    savings_pct = (savings / api_monthly_cost * 100) if api_monthly_cost > 0 else 0

    return {
        "gpu_monthly_cost": gpu_monthly_cost,
        "api_monthly_cost": api_monthly_cost,
        "breakeven_requests": int(breakeven_requests),
        "current_requests": requests_per_month,
        "monthly_savings": savings,
        "savings_percent": savings_pct,
        "recommendation": "self-host" if savings > 0 else "use API"
    }

# Example: Should we self-host Llama 70B? {.unnumbered}
result = calculate_breakeven(
    gpu_monthly_cost=4 * 730 * 4.00,  # 4x A100 at $4/hr
    api_cost_per_request=0.015,        # Comparable API cost
    requests_per_month=1_500_000       # 1.5M requests/month
)
print(f"Recommendation: {result['recommendation']}")
print(f"Monthly savings: ${result['monthly_savings']:,.2f}")

Evaluation Cost Model

Daily Eval Cost = (samples × metrics × judge_cost_per_call)

judge_cost_per_call = calculate_cost(
    judge_model,
    input_tokens=~2000,  # context + response + rubric
    output_tokens=~200   # rating + reasoning
)

Example (500 daily samples, 3 metrics, Claude 3.5 Sonnet judge):
- Judge cost = (2000 × $3/1M) + (200 × $15/1M) = $0.009
- Daily cost = 500 × 3 × $0.009 = $13.50
- Monthly cost = $13.50 × 30 = $405

Cost Optimization Decision Tree

                    Is latency critical?
                           │
              ┌────────────┴────────────┐
              │ YES                     │ NO
              ▼                         ▼
    Need fastest model?          Cost sensitive?
              │                         │
    ┌─────────┴─────────┐     ┌─────────┴─────────┐
    │ YES               │ NO  │ YES               │ NO
    ▼                   ▼     ▼                   ▼
  Opus/GPT-4      Sonnet/4o  Haiku/Mini     Sonnet/4o
  (highest $)    (balanced)  (lowest $)    (best quality)


              Volume > 1M requests/month?
                           │
              ┌────────────┴────────────┐
              │ YES                     │ NO
              ▼                         ▼
       Consider self-hosting      Stick with API
       or committed use pricing   (simpler ops)

Quick Reference: Cost Reduction Strategies

Strategy Effort Savings When to Use
Prompt caching Low 10-30% Repeated system prompts
Response caching Low 20-50% Common queries
Smaller model for routing Medium 30-50% Multi-step workflows
Batch processing Medium 50% Non-real-time workloads
Fine-tuned smaller model High 60-80% Narrow domain, high volume
Self-hosted inference High 50-80% Very high volume, predictable

Budget Alert Thresholds

# Recommended alert thresholds {.unnumbered}
ALERT_THRESHOLDS = {
    "daily_spend": {
        "warning": 0.8,   # 80% of daily budget
        "critical": 1.0,  # 100% of daily budget
    },
    "cost_per_request": {
        "warning": 1.5,   # 50% above baseline
        "critical": 2.0,  # 100% above baseline
    },
    "monthly_projection": {
        "warning": 0.9,   # 90% of monthly budget
        "critical": 1.1,  # 110% of monthly budget
    }
}