Appendix K: LLM Provider Comparison
This appendix provides a comprehensive comparison of major LLM providers to help you make informed decisions about which provider to use for different use cases. The AI landscape evolves rapidly, and pricing and capabilities change frequently.
Last Updated: May 2026 Important: Pricing and specifications change frequently. Always verify current rates on provider websites before making decisions. This appendix provides a snapshot for comparison purposes.
1. Provider Overview
The following table summarizes the major LLM providers, their flagship models, and key characteristics.
| Provider | Flagship Models | Context Window |
|---|---|---|
| Anthropic | Claude Opus 4.8, Sonnet 4.6, Haiku 4.5 | 1M (Opus/Sonnet), 200K (Haiku) |
| OpenAI | GPT-5.5, GPT-5.4, GPT-5.4-mini | ~1.05M (GPT-5.5) |
| Gemini 3.1 Pro, 3.5 Flash, 3 Pro | 2M tokens | |
| Meta | Llama 4 Scout, Llama 4 Maverick | 10M (Scout), 1M (Maverick) |
| Mistral | Mistral Large 3, Medium 3 | 128K tokens |
| Cohere | Command R+, Command R, Embed v3 | 128K tokens |
| Provider | Pricing (Input / Output per 1M tokens) | Best For |
|---|---|---|
| Anthropic | Opus: $5/$25, Sonnet: $3/$15, Haiku: $1/$5 | Complex reasoning, safety-critical, coding |
| OpenAI | GPT-5.5: $5/$30, GPT-5.4: $2.50/$15 | General purpose, ecosystem, vision |
| 3.1 Pro: $2/$12 (≤200K), $4/$18 (>200K), 3.5 Flash: $1.50/$9 | Multimodal, long context, GCP integration | |
| Meta | Scout: $0.15/$0.50, Maverick: $0.22/$0.85 | Self-hosting, cost control, customization |
| Mistral | Large 3: $2/$6, Medium 3: $0.40/$2 | EU data residency, efficiency |
| Cohere | R+: $2.50/$10, R: $0.50/$1.50 | Enterprise RAG, embeddings, multilingual |
Model Tiers Explained
Frontier/Flagship Models (Claude Opus 4.8, GPT-5.5, Gemini 3.1 Pro, Llama 4 Maverick):
- Highest capability across reasoning, coding, and complex tasks
- Best for: Critical applications, complex analysis, creative work
- Trade-off: Higher cost, slower response times
Balanced Models (Claude Sonnet 4.6, GPT-5.4, Gemini 3 Pro, Llama 4 Scout):
- Strong performance at moderate cost
- Best for: Production workloads, most business applications
- Trade-off: Good balance of quality and cost
Fast/Efficient Models (Claude Haiku 4.5, GPT-5.4-mini, Gemini 3.5 Flash, Mistral Medium 3):
- Optimized for speed and cost
- Best for: High-volume tasks, simple queries, classification
- Trade-off: Reduced capability on complex tasks
2. Capability Comparison Matrix
This matrix compares providers across key capabilities. Ratings are relative comparisons (Excellent/Good/Adequate/Limited) based on benchmarks and practical experience as of early 2026.
| Capability | Anthropic (Claude 4.x) | OpenAI (GPT-5.x) | Google (Gemini 3.x) |
|---|---|---|---|
| Code Generation | Excellent | Excellent | Excellent |
| Reasoning | Excellent | Excellent | Good |
| Long Context | Excellent (200K-1M) | Excellent (~1M) | Excellent (2M) |
| Tool Use | Excellent | Excellent | Excellent |
| Vision | Excellent | Excellent | Excellent |
| Structured Output | Excellent | Excellent | Excellent |
| Streaming | Excellent | Excellent | Good |
| Instruction Following | Excellent | Excellent | Good |
| Multilingual | Good | Good | Excellent |
| Math/Science | Excellent | Excellent | Good |
| Safety/Alignment | Excellent | Good | Good |
| Consistency | Excellent | Good | Good |
| Capability | Meta (Llama 4) | Mistral | Cohere |
|---|---|---|---|
| Code Generation | Good | Good | Adequate |
| Reasoning | Good | Good | Good |
| Long Context | Excellent (10M) | Good (128K) | Good (128K) |
| Tool Use | Good | Good | Good |
| Vision | Excellent | Good | Limited |
| Structured Output | Good | Good | Good |
| Streaming | Good | Good | Good |
| Instruction Following | Good | Good | Good |
| Multilingual | Good | Good | Excellent |
| Math/Science | Good | Good | Adequate |
| Safety/Alignment | Good | Adequate | Good |
| Consistency | Good | Good | Good |
Capability Details
Code Generation - Claude 4.x: Exceptional at understanding codebases, refactoring, and generating production-quality code - GPT-5.x: Strong code generation with excellent debugging capabilities; Codex variants optimized for agentic coding - Gemini 3.x: Significantly improved, competitive for most tasks - Llama 4: Competitive for many use cases, especially with 10M context for large codebases
Reasoning - Claude Opus 4.8: Best-in-class for complex multi-step reasoning - GPT-5.5 Thinking: Extended reasoning mode with chain-of-thought - Others: Competent but less consistent on edge cases
Long Context - Llama 4 Scout: Industry-leading 10M token context window - Gemini 3.1 Pro: 2M token windows with excellent recall - GPT-5.5: ~1.05M tokens with strong retrieval - Claude 4.x: 1M context (Opus/Sonnet), 200K (Haiku)
Tool Use / Function Calling - Claude & GPT-5.x: Native support, reliable parameter extraction - Gemini 3.x: Now excellent with native tool use - Others: Support available but may require more prompt engineering
Vision / Multimodal - Gemini 3.x: Native multimodal architecture, best for video understanding - GPT-5.x: Excellent vision with 86% accuracy on GUI understanding (ScreenSpot-Pro) - Claude 4.x: Excellent image understanding - Llama 4: Native multimodal with natively fused vision capabilities
3. API Feature Comparison
| Feature | Anthropic | OpenAI | |
|---|---|---|---|
| Rate Limits (Tier 1) | 60 RPM, 60K TPM | 60 RPM, 60K TPM | 60 RPM |
| Rate Limits (Enterprise) | Custom | Custom | Custom |
| Batch API | Yes | Yes | Yes |
| Function Calling | Native (tools) | Native (functions) | Native |
| Prompt Caching | Yes (explicit cache_control) | Yes | Yes |
| Fine-tuning | Limited access | Yes (GPT-5.x) | Yes (Gemini) |
| Embeddings API | No (use Voyage) | Yes | Yes |
| Streaming | Yes | Yes | Yes |
| JSON Mode | Yes | Yes | Yes |
| System Prompts | Yes | Yes | Yes |
| Message History | Stateless | Stateless | Stateful option |
| Content Filtering | Built-in | Built-in | Built-in |
| Usage Analytics | Dashboard | Dashboard | Cloud Console |
| Feature | Meta (via providers) | Mistral | Cohere |
|---|---|---|---|
| Rate Limits (Tier 1) | Varies by provider | 60 RPM | 100 RPM |
| Rate Limits (Enterprise) | N/A | Custom | Custom |
| Batch API | N/A | No | Yes |
| Function Calling | Via frameworks | Native | Native |
| Prompt Caching | N/A (local) | No | Yes |
| Fine-tuning | Full (self-host) | Yes | Yes |
| Embeddings API | Yes (self-host) | Yes | Yes (excellent) |
| Streaming | Yes | Yes | Yes |
| JSON Mode | Via prompting | Yes | Yes |
| System Prompts | Yes | Yes | Yes |
| Message History | Stateless | Stateless | Stateless |
| Content Filtering | User-managed | Built-in | Built-in |
| Usage Analytics | Self-managed | Dashboard | Dashboard |
API Feature Details
Batch API Batch APIs allow you to submit many requests at once for asynchronous processing, typically at 50% discount:
- Anthropic: Message Batches API for high-volume processing
- OpenAI: Batch API with 24-hour processing window
- Google: Batch prediction available via Vertex AI
- Cohere: Bulk endpoints available
Prompt Caching Caching reduces costs for repeated context (system prompts, documents):
- Anthropic: Prompt caching via explicit
cache_controlmarkers, up to 90% cost reduction on cached tokens - OpenAI: Automatic caching for identical prefixes
- Google: Context caching available
Fine-tuning Availability
| Provider | Models Available | Data Requirements | Cost Model |
|---|---|---|---|
| OpenAI | GPT-5.4, GPT-5.4-mini, GPT-4.1 | 10+ examples | Per-token training + hosting |
| Gemini 3 Pro/Flash | Varies | Per-token training | |
| Mistral | Mistral models | 100+ examples | Per-token training |
| Cohere | Command models | 100+ examples | Per-token training |
| Meta | All Llama models | Any amount | Free (compute costs) |
4. When to Use Each Provider
Anthropic (Claude)
Best Use Cases: - Complex reasoning and analysis tasks - Long document processing (contracts, research papers, codebases) - Safety-critical applications requiring predictable behavior - Coding assistants and code review - Tasks requiring nuanced understanding and careful responses - Applications where consistency and reliability are paramount
Considerations: - No native embeddings API (use Voyage AI or alternatives) - Limited fine-tuning access - Slightly higher latency for Opus
Example Scenarios:
- Legal document analysis requiring 100+ page context
- Medical information systems with safety requirements
- Enterprise coding assistants
- Complex customer support requiring nuanced responses
OpenAI (GPT-5)
Best Use Cases: - General-purpose applications requiring broad capabilities - Integration with existing OpenAI ecosystem (DALL-E, Whisper, TTS) - Applications benefiting from plugins and function calling - Real-time applications (GPT-5.4 mini/nano optimized for speed) - Mathematical reasoning (o1 models) - Rapid prototyping with extensive documentation and examples
Considerations: - o1 models have different API patterns (no streaming, no system prompts) - Rate limits can be restrictive at lower tiers - Pricing has increased over time for flagship models
Example Scenarios:
- Customer-facing chatbots with plugin integrations
- Content generation pipelines
- Real-time voice applications
- Complex math/science problems (o1)
Google (Gemini)
Best Use Cases: - Multimodal applications (image, video, audio understanding) - Extremely long context requirements (1M+ tokens) - Integration with Google Cloud services (BigQuery, Vertex AI) - Applications requiring native multimodal reasoning - Video understanding and analysis
Considerations: - API patterns differ from OpenAI standard - Some features only available via Vertex AI - Regional availability varies
Example Scenarios:
- Video content analysis and summarization
- Processing entire codebases or book-length documents
- Multimodal search and retrieval
- Google Workspace integrations
Meta (Llama)
Best Use Cases: - Self-hosting requirements (data sovereignty, air-gapped environments) - Cost control at high volume (no per-token fees) - Full customization and fine-tuning flexibility - On-premise deployment requirements - Research and experimentation - Building custom model variants
Considerations: - Requires ML infrastructure expertise - Compute costs can exceed API costs at low volumes - No managed service from Meta directly - Quality gap vs. frontier models on complex tasks
Example Scenarios:
- Healthcare applications with strict data requirements
- High-volume applications where API costs are prohibitive
- Custom fine-tuned models for specific domains
- Edge deployment on-device
Mistral
Best Use Cases: - European data residency requirements (EU-based company) - Efficiency-focused applications - Open-weight model preferences - Balance of capability and cost - GDPR compliance needs
Considerations: - Smaller context windows than competitors - Less extensive documentation than OpenAI/Anthropic - Fewer third-party integrations
Example Scenarios:
- EU-based applications with data residency requirements
- Cost-efficient production deployments
- Self-hosted with Mixtral for high capability
Cohere
Best Use Cases: - Enterprise search and RAG systems - Embedding-heavy applications - Multilingual applications (100+ languages) - Reranking in retrieval pipelines - Enterprise with existing Cohere relationship
Considerations: - Less powerful for general reasoning vs. Claude/GPT-5 - Strongest in retrieval-specific use cases - Good enterprise support and compliance
Example Scenarios:
- Enterprise knowledge base search
- Multilingual customer support
- Document retrieval and ranking systems
- Semantic search applications
Decision Flowchart
Start
|
v
Do you need European data residency?
|-- Yes --> Mistral (or self-hosted Llama)
|-- No
v
Do you need to self-host?
|-- Yes --> Llama (or Mistral open models)
|-- No
v
Is this a multimodal application (video/long docs)?
|-- Yes --> Gemini 3.1 Pro
|-- No
v
Is complex reasoning the primary requirement?
|-- Yes --> Claude Opus or OpenAI o1
|-- No
v
Is this enterprise RAG/search?
|-- Yes --> Cohere Command R+
|-- No
v
Is this high-volume, cost-sensitive?
|-- Yes --> Claude Haiku / GPT-5.4-mini / Gemini Flash
|-- No
v
Default recommendation:
- Claude Sonnet 4.6 (balanced choice)
- GPT-5.4 (if OpenAI ecosystem needed)
5. Cost Comparison Scenarios
The following scenarios illustrate typical costs across providers. Assumes average request of 1,000 input tokens and 500 output tokens.
Scenario 1: Low Volume (10K requests/month)
Typical use case: Internal tool, early-stage startup, development/testing
| Provider | Model | Input Cost | Output Cost | Monthly Total |
|---|---|---|---|---|
| Anthropic | Claude Sonnet 4.6 | $30 | $75 | $105 |
| Anthropic | Claude Haiku 4.5 | $10 | $25 | $35 |
| OpenAI | GPT-5.4 | $25 | $75 | $100 |
| Meta | Llama 4 Scout (API) | $1.50 | $2.50 | $4.00 |
| Gemini 3 Pro | $20 | $60 | $80 | |
| Gemini 3.5 Flash | $15 | $45 | $60 | |
| Mistral | Mistral Medium 3 | $4 | $10 | $14 |
| Cohere | Command R+ | $25 | $50 | $75 |
Recommendation for Low Volume: Use flagship models (Sonnet, GPT-5.4) for quality. Cost difference is modest at this scale—even Sonnet costs only ~$105/month at 10K requests.
Scenario 2: Medium Volume (100K requests/month)
Typical use case: Production application, B2B SaaS, customer support
| Provider | Model | Input Cost | Output Cost | Monthly Total |
|---|---|---|---|---|
| Anthropic | Claude Sonnet 4.6 | $300 | $750 | $1,050 |
| Anthropic | Claude Haiku 4.5 | $100 | $250 | $350 |
| OpenAI | GPT-5.4 | $250 | $750 | $1,000 |
| Meta | Llama 4 Scout (API) | $15 | $25 | $40 |
| Gemini 3 Pro | $200 | $600 | $800 | |
| Gemini 3.5 Flash | $150 | $450 | $600 | |
| Mistral | Mistral Medium 3 | $40 | $100 | $140 |
| Cohere | Command R+ | $250 | $500 | $750 |
Recommendation for Medium Volume: Consider model tiering - use flagship for complex queries, fast models for simple ones. Implement caching to reduce costs 20-50%.
Scenario 3: High Volume (1M+ requests/month)
Typical use case: Consumer application, high-traffic API, enterprise deployment
| Provider | Model | Input Cost | Output Cost | Monthly Total | With Caching (est.) |
|---|---|---|---|---|---|
| Anthropic | Claude Sonnet 4.6 | $3,000 | $7,500 | $10,500 | ~$7,000 |
| Anthropic | Claude Haiku 4.5 | $1,000 | $2,500 | $3,500 | ~$2,300 |
| OpenAI | GPT-5.4 | $2,500 | $7,500 | $10,000 | ~$6,700 |
| Meta | Llama 4 Scout (API) | $150 | $250 | $400 | ~$300 |
| Gemini 3 Pro | $2,000 | $6,000 | $8,000 | ~$5,400 | |
| Gemini 3.5 Flash | $1,500 | $4,500 | $6,000 | ~$4,000 | |
| Llama 4 | Self-hosted* | - | - | $2,000-5,000 | N/A |
*Self-hosted costs include GPU rental (e.g., 2x A100 80GB at ~$2-3/hr) plus infrastructure. Break-even vs. API typically at 500K-2M requests/month depending on complexity.
Recommendation for High Volume: 1. Implement aggressive caching (prompt caching, response caching) 2. Use model routing (simple queries to fast models) 3. Consider self-hosting for predictable, high-volume workloads 4. Negotiate enterprise pricing (typically 20-40% discount) 5. Use batch APIs for non-real-time workloads (50% savings)
Cost Optimization Strategies
| Strategy | Potential Savings | Implementation Effort |
|---|---|---|
| Prompt caching | 30-90% on cached tokens | Low (some providers require explicit setup) |
| Model routing | 40-70% | Medium |
| Batch processing | 50% | Low |
| Response caching | 20-50% | Medium |
| Self-hosting | Variable (can be negative) | High |
| Enterprise contracts | 20-40% | Low |
| Prompt optimization | 10-30% | Medium |
6. Open Source vs API Trade-offs
Decision Framework
Use this framework to decide between self-hosting open models vs. using API providers.
When to Use APIs
| Factor | API Advantage |
|---|---|
| Volume | < 500K requests/month |
| Team | No ML infrastructure expertise |
| Time to market | Need to ship quickly |
| Variability | Traffic is unpredictable/bursty |
| Capability | Need frontier model performance |
| Compliance | Provider has necessary certifications |
| Budget | Limited upfront capital |
When to Self-Host
| Factor | Self-Host Advantage |
|---|---|
| Volume | > 1M requests/month sustained |
| Data sovereignty | Strict requirements (healthcare, government) |
| Customization | Need fine-tuned or modified models |
| Latency | Need <100ms inference |
| Cost predictability | Need fixed monthly costs |
| Control | Need full control over model behavior |
| Air-gapped | No external network access allowed |
Total Cost of Ownership Comparison
API Model (Example: Claude Sonnet 4.6 at 1M requests/month)
| Cost Category | Monthly Cost |
|---|---|
| API usage | $10,500 |
| Engineering time (integration) | ~$500 (amortized) |
| Monitoring/observability | ~$100 |
| Total | ~$11,100/month |
Self-Hosted Model (Example: Llama 4 Scout at 1M requests/month)
| Cost Category | Monthly Cost |
|---|---|
| GPU compute (2x A100 80GB) | $2,000-4,000 |
| Infrastructure (networking, storage) | $200-500 |
| Engineering time (ops) | $1,000-2,000 |
| Monitoring/observability | $100-200 |
| Total | $3,300-6,700/month |
Break-Even Analysis
API cost per request = $0.0105 (Sonnet) or $0.00045 (GPT-5.4-mini)
Self-hosted cost per request = $0.003-0.007 (including ops overhead)
For flagship-equivalent quality:
Break-even: ~300K-500K requests/month
(Self-hosting is cheaper at 1M+ requests for most workloads)
For efficient model replacement (GPT-5.4-mini equivalent):
Break-even: ~2-5M requests/month
Exception: Air-gapped or data sovereignty requirements justify any volume
Hybrid Approaches
Many organizations use a hybrid approach:
- Primary API + Fallback Self-Hosted
- Use API for normal traffic
- Self-hosted handles overflow or API outages
- Tiered by Sensitivity
- Sensitive data: Self-hosted
- General queries: API
- Development vs Production
- Development: Self-hosted (no per-query cost)
- Production: API (reliability)
- Task-Based Routing
- Complex tasks: Frontier API models
- Simple tasks: Self-hosted efficient models
Self-Hosting Technology Stack
If you decide to self-host, here’s a recommended stack:
| Component | Recommended Options |
|---|---|
| Inference Engine | vLLM, TGI, TensorRT-LLM |
| Model Format | Safetensors, GGUF (for llama.cpp) |
| Quantization | AWQ, GPTQ, or FP8 (H100) |
| Orchestration | Kubernetes + KServe |
| Load Balancing | Nginx, HAProxy, or cloud LB |
| Monitoring | Prometheus + Grafana |
| GPU Provider | AWS, GCP, Azure, Lambda Labs, RunPod |
Self-Hosting Checklist
Before self-hosting, ensure you can answer “yes” to these questions:
Quick Reference Tables
Model Selection by Task
| Task | Recommended Models |
|---|---|
| General chat/assistant | Claude Sonnet 4.6, GPT-5.4 |
| Complex reasoning | Claude Opus 4.8, OpenAI o1 |
| Code generation | Claude Sonnet 4.6, GPT-5.4 |
| Long document analysis | Claude Sonnet 4.6, Gemini 3.1 Pro |
| Very long context (500K+) | Gemini 3.1 Pro |
| Fast/cheap classification | Claude Haiku 4.5, GPT-5.4-mini, Gemini Flash |
| Embeddings | OpenAI text-embedding-3, Cohere Embed v3 |
| Multilingual | Cohere Command R+, Gemini |
| Self-hosted production | Llama 4 Maverick, Llama 4 Scout |
| Self-hosted efficient | Llama 3.2 8B, Mistral 7B |
| European data residency | Mistral (La Plateforme) |
| Enterprise RAG | Cohere Command R+ |
Pricing Quick Reference (per 1M tokens)
| Model Tier | Input | Output | Provider |
|---|---|---|---|
| Frontier | |||
| Claude Opus 4.8 | $5 | $25 | Anthropic |
| GPT-5.5 | $5 | $30 | OpenAI |
| Gemini 3.1 Pro (≤200K) | $2 | $12 | |
| Balanced | |||
| Claude Sonnet 4.6 | $3 | $15 | Anthropic |
| GPT-5.4 | $2.50 | $15 | OpenAI |
| Gemini 3 Pro | $2 | $12 | |
| Mistral Large 3 | $2 | $6 | Mistral |
| Command R+ | $2.50 | $10 | Cohere |
| Efficient | |||
| Claude Haiku 4.5 | $1 | $5 | Anthropic |
| Llama 4 Scout | $0.15 | $0.50 | API providers |
| Gemini 3.5 Flash | $1.50 | $9 | |
| Mistral Medium 3 | $0.40 | $2 | Mistral |
Changelog and Updates
This appendix will be updated as providers release new models and change pricing. Key changes to watch for:
- Model releases: New model versions typically offer better price/performance
- Pricing changes: Providers frequently adjust pricing (usually downward)
- Feature additions: New capabilities (longer context, new modalities)
- Fine-tuning availability: Access to fine-tuning expands over time
Verification Resources
Always verify current information at these official sources:
| Provider | Pricing Page | Documentation |
|---|---|---|
| Anthropic | anthropic.com/pricing | docs.anthropic.com |
| OpenAI | openai.com/pricing | platform.openai.com/docs |
| cloud.google.com/vertex-ai/pricing | ai.google.dev/docs | |
| Meta | N/A (self-hosted) | llama.meta.com |
| Mistral | mistral.ai/pricing | docs.mistral.ai |
| Cohere | cohere.com/pricing | docs.cohere.com |
Summary
Choosing an LLM provider involves balancing multiple factors:
- Capability requirements: Match model capabilities to your task complexity
- Cost at scale: Model pricing varies 100x between tiers
- Integration needs: Consider existing infrastructure and ecosystem
- Data requirements: Sovereignty, compliance, and privacy needs
- Operational capacity: Self-hosting requires significant expertise
For most applications, starting with API providers and optimizing costs through caching, model routing, and batch processing is the pragmatic approach. Reserve self-hosting for specific requirements around data sovereignty, extreme cost optimization at scale, or specialized fine-tuned models.
The landscape evolves rapidly. What’s optimal today may change in 6 months. Build abstractions that allow switching providers, and regularly reassess your choices as new models and pricing emerge.