Appendix K: LLM Provider Comparison

This appendix provides a comprehensive comparison of major LLM providers to help you make informed decisions about which provider to use for different use cases. The AI landscape evolves rapidly, and pricing and capabilities change frequently.

Last Updated: May 2026 Important: Pricing and specifications change frequently. Always verify current rates on provider websites before making decisions. This appendix provides a snapshot for comparison purposes.


1. Provider Overview

The following table summarizes the major LLM providers, their flagship models, and key characteristics.

Provider Models and Context Windows
Provider Flagship Models Context Window
Anthropic Claude Opus 4.8, Sonnet 4.6, Haiku 4.5 1M (Opus/Sonnet), 200K (Haiku)
OpenAI GPT-5.5, GPT-5.4, GPT-5.4-mini ~1.05M (GPT-5.5)
Google Gemini 3.1 Pro, 3.5 Flash, 3 Pro 2M tokens
Meta Llama 4 Scout, Llama 4 Maverick 10M (Scout), 1M (Maverick)
Mistral Mistral Large 3, Medium 3 128K tokens
Cohere Command R+, Command R, Embed v3 128K tokens
Provider Pricing and Use Cases
Provider Pricing (Input / Output per 1M tokens) Best For
Anthropic Opus: $5/$25, Sonnet: $3/$15, Haiku: $1/$5 Complex reasoning, safety-critical, coding
OpenAI GPT-5.5: $5/$30, GPT-5.4: $2.50/$15 General purpose, ecosystem, vision
Google 3.1 Pro: $2/$12 (≤200K), $4/$18 (>200K), 3.5 Flash: $1.50/$9 Multimodal, long context, GCP integration
Meta Scout: $0.15/$0.50, Maverick: $0.22/$0.85 Self-hosting, cost control, customization
Mistral Large 3: $2/$6, Medium 3: $0.40/$2 EU data residency, efficiency
Cohere R+: $2.50/$10, R: $0.50/$1.50 Enterprise RAG, embeddings, multilingual

Model Tiers Explained

Frontier/Flagship Models (Claude Opus 4.8, GPT-5.5, Gemini 3.1 Pro, Llama 4 Maverick):

  • Highest capability across reasoning, coding, and complex tasks
  • Best for: Critical applications, complex analysis, creative work
  • Trade-off: Higher cost, slower response times

Balanced Models (Claude Sonnet 4.6, GPT-5.4, Gemini 3 Pro, Llama 4 Scout):

  • Strong performance at moderate cost
  • Best for: Production workloads, most business applications
  • Trade-off: Good balance of quality and cost

Fast/Efficient Models (Claude Haiku 4.5, GPT-5.4-mini, Gemini 3.5 Flash, Mistral Medium 3):

  • Optimized for speed and cost
  • Best for: High-volume tasks, simple queries, classification
  • Trade-off: Reduced capability on complex tasks

2. Capability Comparison Matrix

This matrix compares providers across key capabilities. Ratings are relative comparisons (Excellent/Good/Adequate/Limited) based on benchmarks and practical experience as of early 2026.

Capability Comparison - Frontier Providers
Capability Anthropic (Claude 4.x) OpenAI (GPT-5.x) Google (Gemini 3.x)
Code Generation Excellent Excellent Excellent
Reasoning Excellent Excellent Good
Long Context Excellent (200K-1M) Excellent (~1M) Excellent (2M)
Tool Use Excellent Excellent Excellent
Vision Excellent Excellent Excellent
Structured Output Excellent Excellent Excellent
Streaming Excellent Excellent Good
Instruction Following Excellent Excellent Good
Multilingual Good Good Excellent
Math/Science Excellent Excellent Good
Safety/Alignment Excellent Good Good
Consistency Excellent Good Good
Capability Comparison - Open-Weight and Specialized Providers
Capability Meta (Llama 4) Mistral Cohere
Code Generation Good Good Adequate
Reasoning Good Good Good
Long Context Excellent (10M) Good (128K) Good (128K)
Tool Use Good Good Good
Vision Excellent Good Limited
Structured Output Good Good Good
Streaming Good Good Good
Instruction Following Good Good Good
Multilingual Good Good Excellent
Math/Science Good Good Adequate
Safety/Alignment Good Adequate Good
Consistency Good Good Good

Capability Details

Code Generation - Claude 4.x: Exceptional at understanding codebases, refactoring, and generating production-quality code - GPT-5.x: Strong code generation with excellent debugging capabilities; Codex variants optimized for agentic coding - Gemini 3.x: Significantly improved, competitive for most tasks - Llama 4: Competitive for many use cases, especially with 10M context for large codebases

Reasoning - Claude Opus 4.8: Best-in-class for complex multi-step reasoning - GPT-5.5 Thinking: Extended reasoning mode with chain-of-thought - Others: Competent but less consistent on edge cases

Long Context - Llama 4 Scout: Industry-leading 10M token context window - Gemini 3.1 Pro: 2M token windows with excellent recall - GPT-5.5: ~1.05M tokens with strong retrieval - Claude 4.x: 1M context (Opus/Sonnet), 200K (Haiku)

Tool Use / Function Calling - Claude & GPT-5.x: Native support, reliable parameter extraction - Gemini 3.x: Now excellent with native tool use - Others: Support available but may require more prompt engineering

Vision / Multimodal - Gemini 3.x: Native multimodal architecture, best for video understanding - GPT-5.x: Excellent vision with 86% accuracy on GUI understanding (ScreenSpot-Pro) - Claude 4.x: Excellent image understanding - Llama 4: Native multimodal with natively fused vision capabilities


3. API Feature Comparison

API Features - Frontier Providers
Feature Anthropic OpenAI Google
Rate Limits (Tier 1) 60 RPM, 60K TPM 60 RPM, 60K TPM 60 RPM
Rate Limits (Enterprise) Custom Custom Custom
Batch API Yes Yes Yes
Function Calling Native (tools) Native (functions) Native
Prompt Caching Yes (explicit cache_control) Yes Yes
Fine-tuning Limited access Yes (GPT-5.x) Yes (Gemini)
Embeddings API No (use Voyage) Yes Yes
Streaming Yes Yes Yes
JSON Mode Yes Yes Yes
System Prompts Yes Yes Yes
Message History Stateless Stateless Stateful option
Content Filtering Built-in Built-in Built-in
Usage Analytics Dashboard Dashboard Cloud Console
API Features - Open-Weight and Specialized Providers
Feature Meta (via providers) Mistral Cohere
Rate Limits (Tier 1) Varies by provider 60 RPM 100 RPM
Rate Limits (Enterprise) N/A Custom Custom
Batch API N/A No Yes
Function Calling Via frameworks Native Native
Prompt Caching N/A (local) No Yes
Fine-tuning Full (self-host) Yes Yes
Embeddings API Yes (self-host) Yes Yes (excellent)
Streaming Yes Yes Yes
JSON Mode Via prompting Yes Yes
System Prompts Yes Yes Yes
Message History Stateless Stateless Stateless
Content Filtering User-managed Built-in Built-in
Usage Analytics Self-managed Dashboard Dashboard

API Feature Details

Batch API Batch APIs allow you to submit many requests at once for asynchronous processing, typically at 50% discount:

  • Anthropic: Message Batches API for high-volume processing
  • OpenAI: Batch API with 24-hour processing window
  • Google: Batch prediction available via Vertex AI
  • Cohere: Bulk endpoints available

Prompt Caching Caching reduces costs for repeated context (system prompts, documents):

  • Anthropic: Prompt caching via explicit cache_control markers, up to 90% cost reduction on cached tokens
  • OpenAI: Automatic caching for identical prefixes
  • Google: Context caching available

Fine-tuning Availability

Provider Models Available Data Requirements Cost Model
OpenAI GPT-5.4, GPT-5.4-mini, GPT-4.1 10+ examples Per-token training + hosting
Google Gemini 3 Pro/Flash Varies Per-token training
Mistral Mistral models 100+ examples Per-token training
Cohere Command models 100+ examples Per-token training
Meta All Llama models Any amount Free (compute costs)

4. When to Use Each Provider

Anthropic (Claude)

Best Use Cases: - Complex reasoning and analysis tasks - Long document processing (contracts, research papers, codebases) - Safety-critical applications requiring predictable behavior - Coding assistants and code review - Tasks requiring nuanced understanding and careful responses - Applications where consistency and reliability are paramount

Considerations: - No native embeddings API (use Voyage AI or alternatives) - Limited fine-tuning access - Slightly higher latency for Opus

Example Scenarios:

- Legal document analysis requiring 100+ page context
- Medical information systems with safety requirements
- Enterprise coding assistants
- Complex customer support requiring nuanced responses

OpenAI (GPT-5)

Best Use Cases: - General-purpose applications requiring broad capabilities - Integration with existing OpenAI ecosystem (DALL-E, Whisper, TTS) - Applications benefiting from plugins and function calling - Real-time applications (GPT-5.4 mini/nano optimized for speed) - Mathematical reasoning (o1 models) - Rapid prototyping with extensive documentation and examples

Considerations: - o1 models have different API patterns (no streaming, no system prompts) - Rate limits can be restrictive at lower tiers - Pricing has increased over time for flagship models

Example Scenarios:

- Customer-facing chatbots with plugin integrations
- Content generation pipelines
- Real-time voice applications
- Complex math/science problems (o1)

Google (Gemini)

Best Use Cases: - Multimodal applications (image, video, audio understanding) - Extremely long context requirements (1M+ tokens) - Integration with Google Cloud services (BigQuery, Vertex AI) - Applications requiring native multimodal reasoning - Video understanding and analysis

Considerations: - API patterns differ from OpenAI standard - Some features only available via Vertex AI - Regional availability varies

Example Scenarios:

- Video content analysis and summarization
- Processing entire codebases or book-length documents
- Multimodal search and retrieval
- Google Workspace integrations

Meta (Llama)

Best Use Cases: - Self-hosting requirements (data sovereignty, air-gapped environments) - Cost control at high volume (no per-token fees) - Full customization and fine-tuning flexibility - On-premise deployment requirements - Research and experimentation - Building custom model variants

Considerations: - Requires ML infrastructure expertise - Compute costs can exceed API costs at low volumes - No managed service from Meta directly - Quality gap vs. frontier models on complex tasks

Example Scenarios:

- Healthcare applications with strict data requirements
- High-volume applications where API costs are prohibitive
- Custom fine-tuned models for specific domains
- Edge deployment on-device

Mistral

Best Use Cases: - European data residency requirements (EU-based company) - Efficiency-focused applications - Open-weight model preferences - Balance of capability and cost - GDPR compliance needs

Considerations: - Smaller context windows than competitors - Less extensive documentation than OpenAI/Anthropic - Fewer third-party integrations

Example Scenarios:

- EU-based applications with data residency requirements
- Cost-efficient production deployments
- Self-hosted with Mixtral for high capability

Cohere

Best Use Cases: - Enterprise search and RAG systems - Embedding-heavy applications - Multilingual applications (100+ languages) - Reranking in retrieval pipelines - Enterprise with existing Cohere relationship

Considerations: - Less powerful for general reasoning vs. Claude/GPT-5 - Strongest in retrieval-specific use cases - Good enterprise support and compliance

Example Scenarios:

- Enterprise knowledge base search
- Multilingual customer support
- Document retrieval and ranking systems
- Semantic search applications

Decision Flowchart

Start
  |
  v
Do you need European data residency?
  |-- Yes --> Mistral (or self-hosted Llama)
  |-- No
  v
Do you need to self-host?
  |-- Yes --> Llama (or Mistral open models)
  |-- No
  v
Is this a multimodal application (video/long docs)?
  |-- Yes --> Gemini 3.1 Pro
  |-- No
  v
Is complex reasoning the primary requirement?
  |-- Yes --> Claude Opus or OpenAI o1
  |-- No
  v
Is this enterprise RAG/search?
  |-- Yes --> Cohere Command R+
  |-- No
  v
Is this high-volume, cost-sensitive?
  |-- Yes --> Claude Haiku / GPT-5.4-mini / Gemini Flash
  |-- No
  v
Default recommendation:
  - Claude Sonnet 4.6 (balanced choice)
  - GPT-5.4 (if OpenAI ecosystem needed)

5. Cost Comparison Scenarios

The following scenarios illustrate typical costs across providers. Assumes average request of 1,000 input tokens and 500 output tokens.

Scenario 1: Low Volume (10K requests/month)

Typical use case: Internal tool, early-stage startup, development/testing

Provider Model Input Cost Output Cost Monthly Total
Anthropic Claude Sonnet 4.6 $30 $75 $105
Anthropic Claude Haiku 4.5 $10 $25 $35
OpenAI GPT-5.4 $25 $75 $100
Meta Llama 4 Scout (API) $1.50 $2.50 $4.00
Google Gemini 3 Pro $20 $60 $80
Google Gemini 3.5 Flash $15 $45 $60
Mistral Mistral Medium 3 $4 $10 $14
Cohere Command R+ $25 $50 $75

Recommendation for Low Volume: Use flagship models (Sonnet, GPT-5.4) for quality. Cost difference is modest at this scale—even Sonnet costs only ~$105/month at 10K requests.

Scenario 2: Medium Volume (100K requests/month)

Typical use case: Production application, B2B SaaS, customer support

Provider Model Input Cost Output Cost Monthly Total
Anthropic Claude Sonnet 4.6 $300 $750 $1,050
Anthropic Claude Haiku 4.5 $100 $250 $350
OpenAI GPT-5.4 $250 $750 $1,000
Meta Llama 4 Scout (API) $15 $25 $40
Google Gemini 3 Pro $200 $600 $800
Google Gemini 3.5 Flash $150 $450 $600
Mistral Mistral Medium 3 $40 $100 $140
Cohere Command R+ $250 $500 $750

Recommendation for Medium Volume: Consider model tiering - use flagship for complex queries, fast models for simple ones. Implement caching to reduce costs 20-50%.

Scenario 3: High Volume (1M+ requests/month)

Typical use case: Consumer application, high-traffic API, enterprise deployment

Provider Model Input Cost Output Cost Monthly Total With Caching (est.)
Anthropic Claude Sonnet 4.6 $3,000 $7,500 $10,500 ~$7,000
Anthropic Claude Haiku 4.5 $1,000 $2,500 $3,500 ~$2,300
OpenAI GPT-5.4 $2,500 $7,500 $10,000 ~$6,700
Meta Llama 4 Scout (API) $150 $250 $400 ~$300
Google Gemini 3 Pro $2,000 $6,000 $8,000 ~$5,400
Google Gemini 3.5 Flash $1,500 $4,500 $6,000 ~$4,000
Llama 4 Self-hosted* - - $2,000-5,000 N/A

*Self-hosted costs include GPU rental (e.g., 2x A100 80GB at ~$2-3/hr) plus infrastructure. Break-even vs. API typically at 500K-2M requests/month depending on complexity.

Recommendation for High Volume: 1. Implement aggressive caching (prompt caching, response caching) 2. Use model routing (simple queries to fast models) 3. Consider self-hosting for predictable, high-volume workloads 4. Negotiate enterprise pricing (typically 20-40% discount) 5. Use batch APIs for non-real-time workloads (50% savings)

Cost Optimization Strategies

Strategy Potential Savings Implementation Effort
Prompt caching 30-90% on cached tokens Low (some providers require explicit setup)
Model routing 40-70% Medium
Batch processing 50% Low
Response caching 20-50% Medium
Self-hosting Variable (can be negative) High
Enterprise contracts 20-40% Low
Prompt optimization 10-30% Medium

6. Open Source vs API Trade-offs

Decision Framework

Use this framework to decide between self-hosting open models vs. using API providers.

When to Use APIs

Factor API Advantage
Volume < 500K requests/month
Team No ML infrastructure expertise
Time to market Need to ship quickly
Variability Traffic is unpredictable/bursty
Capability Need frontier model performance
Compliance Provider has necessary certifications
Budget Limited upfront capital

When to Self-Host

Factor Self-Host Advantage
Volume > 1M requests/month sustained
Data sovereignty Strict requirements (healthcare, government)
Customization Need fine-tuned or modified models
Latency Need <100ms inference
Cost predictability Need fixed monthly costs
Control Need full control over model behavior
Air-gapped No external network access allowed

Total Cost of Ownership Comparison

API Model (Example: Claude Sonnet 4.6 at 1M requests/month)

Cost Category Monthly Cost
API usage $10,500
Engineering time (integration) ~$500 (amortized)
Monitoring/observability ~$100
Total ~$11,100/month

Self-Hosted Model (Example: Llama 4 Scout at 1M requests/month)

Cost Category Monthly Cost
GPU compute (2x A100 80GB) $2,000-4,000
Infrastructure (networking, storage) $200-500
Engineering time (ops) $1,000-2,000
Monitoring/observability $100-200
Total $3,300-6,700/month

Break-Even Analysis

API cost per request = $0.0105 (Sonnet) or $0.00045 (GPT-5.4-mini)
Self-hosted cost per request = $0.003-0.007 (including ops overhead)

For flagship-equivalent quality:
  Break-even: ~300K-500K requests/month
  (Self-hosting is cheaper at 1M+ requests for most workloads)

For efficient model replacement (GPT-5.4-mini equivalent):
  Break-even: ~2-5M requests/month
  Exception: Air-gapped or data sovereignty requirements justify any volume

Hybrid Approaches

Many organizations use a hybrid approach:

  1. Primary API + Fallback Self-Hosted
    • Use API for normal traffic
    • Self-hosted handles overflow or API outages
  2. Tiered by Sensitivity
    • Sensitive data: Self-hosted
    • General queries: API
  3. Development vs Production
    • Development: Self-hosted (no per-query cost)
    • Production: API (reliability)
  4. Task-Based Routing
    • Complex tasks: Frontier API models
    • Simple tasks: Self-hosted efficient models

Self-Hosting Technology Stack

If you decide to self-host, here’s a recommended stack:

Component Recommended Options
Inference Engine vLLM, TGI, TensorRT-LLM
Model Format Safetensors, GGUF (for llama.cpp)
Quantization AWQ, GPTQ, or FP8 (H100)
Orchestration Kubernetes + KServe
Load Balancing Nginx, HAProxy, or cloud LB
Monitoring Prometheus + Grafana
GPU Provider AWS, GCP, Azure, Lambda Labs, RunPod

Self-Hosting Checklist

Before self-hosting, ensure you can answer “yes” to these questions:


Quick Reference Tables

Model Selection by Task

Task Recommended Models
General chat/assistant Claude Sonnet 4.6, GPT-5.4
Complex reasoning Claude Opus 4.8, OpenAI o1
Code generation Claude Sonnet 4.6, GPT-5.4
Long document analysis Claude Sonnet 4.6, Gemini 3.1 Pro
Very long context (500K+) Gemini 3.1 Pro
Fast/cheap classification Claude Haiku 4.5, GPT-5.4-mini, Gemini Flash
Embeddings OpenAI text-embedding-3, Cohere Embed v3
Multilingual Cohere Command R+, Gemini
Self-hosted production Llama 4 Maverick, Llama 4 Scout
Self-hosted efficient Llama 3.2 8B, Mistral 7B
European data residency Mistral (La Plateforme)
Enterprise RAG Cohere Command R+

Pricing Quick Reference (per 1M tokens)

Model Tier Input Output Provider
Frontier
Claude Opus 4.8 $5 $25 Anthropic
GPT-5.5 $5 $30 OpenAI
Gemini 3.1 Pro (≤200K) $2 $12 Google
Balanced
Claude Sonnet 4.6 $3 $15 Anthropic
GPT-5.4 $2.50 $15 OpenAI
Gemini 3 Pro $2 $12 Google
Mistral Large 3 $2 $6 Mistral
Command R+ $2.50 $10 Cohere
Efficient
Claude Haiku 4.5 $1 $5 Anthropic
Llama 4 Scout $0.15 $0.50 API providers
Gemini 3.5 Flash $1.50 $9 Google
Mistral Medium 3 $0.40 $2 Mistral

Changelog and Updates

This appendix will be updated as providers release new models and change pricing. Key changes to watch for:

  • Model releases: New model versions typically offer better price/performance
  • Pricing changes: Providers frequently adjust pricing (usually downward)
  • Feature additions: New capabilities (longer context, new modalities)
  • Fine-tuning availability: Access to fine-tuning expands over time

Verification Resources

Always verify current information at these official sources:

Provider Pricing Page Documentation
Anthropic anthropic.com/pricing docs.anthropic.com
OpenAI openai.com/pricing platform.openai.com/docs
Google cloud.google.com/vertex-ai/pricing ai.google.dev/docs
Meta N/A (self-hosted) llama.meta.com
Mistral mistral.ai/pricing docs.mistral.ai
Cohere cohere.com/pricing docs.cohere.com

Summary

Choosing an LLM provider involves balancing multiple factors:

  1. Capability requirements: Match model capabilities to your task complexity
  2. Cost at scale: Model pricing varies 100x between tiers
  3. Integration needs: Consider existing infrastructure and ecosystem
  4. Data requirements: Sovereignty, compliance, and privacy needs
  5. Operational capacity: Self-hosting requires significant expertise

For most applications, starting with API providers and optimizing costs through caching, model routing, and batch processing is the pragmatic approach. Reserve self-hosting for specific requirements around data sovereignty, extreme cost optimization at scale, or specialized fine-tuned models.

The landscape evolves rapidly. What’s optimal today may change in 6 months. Build abstractions that allow switching providers, and regularly reassess your choices as new models and pricing emerge.