Appendix K: LLM Provider Comparison

This appendix provides a comprehensive comparison of major LLM providers to help you make informed decisions about which provider to use for different use cases. The AI landscape evolves rapidly, and pricing and capabilities change frequently.

Last Updated: May 2026 Important: Pricing and specifications change frequently. Always verify current rates on provider websites before making decisions. This appendix provides a snapshot for comparison purposes.

1. Provider Overview

The following table summarizes the major LLM providers, their flagship models, and key characteristics.

Provider Models and Context Windows
Provider	Flagship Models	Context Window
Anthropic	Claude Opus 4.8, Sonnet 4.6, Haiku 4.5	1M (Opus/Sonnet), 200K (Haiku)
OpenAI	GPT-5.5, GPT-5.4, GPT-5.4-mini	~1.05M (GPT-5.5)
Google	Gemini 3.1 Pro, 3.5 Flash, 3 Pro	2M tokens
Meta	Llama 4 Scout, Llama 4 Maverick	10M (Scout), 1M (Maverick)
Mistral	Mistral Large 3, Medium 3	128K tokens
Cohere	Command R+, Command R, Embed v3	128K tokens

Provider Pricing and Use Cases
Provider	Pricing (Input / Output per 1M tokens)	Best For
Anthropic	Opus: $5/$25, Sonnet: $3/$15, Haiku: $1/$5	Complex reasoning, safety-critical, coding
OpenAI	GPT-5.5: $5/$30, GPT-5.4: $2.50/$15	General purpose, ecosystem, vision
Google	3.1 Pro: $2/$12 (≤200K), $4/$18 (>200K), 3.5 Flash: $1.50/$9	Multimodal, long context, GCP integration
Meta	Scout: $0.15/$0.50, Maverick: $0.22/$0.85	Self-hosting, cost control, customization
Mistral	Large 3: $2/$6, Medium 3: $0.40/$2	EU data residency, efficiency
Cohere	R+: $2.50/$10, R: $0.50/$1.50	Enterprise RAG, embeddings, multilingual

Model Tiers Explained

Frontier/Flagship Models (Claude Opus 4.8, GPT-5.5, Gemini 3.1 Pro, Llama 4 Maverick):

Highest capability across reasoning, coding, and complex tasks
Best for: Critical applications, complex analysis, creative work
Trade-off: Higher cost, slower response times

Balanced Models (Claude Sonnet 4.6, GPT-5.4, Gemini 3 Pro, Llama 4 Scout):

Strong performance at moderate cost
Best for: Production workloads, most business applications
Trade-off: Good balance of quality and cost

Fast/Efficient Models (Claude Haiku 4.5, GPT-5.4-mini, Gemini 3.5 Flash, Mistral Medium 3):

Optimized for speed and cost
Best for: High-volume tasks, simple queries, classification
Trade-off: Reduced capability on complex tasks

2. Capability Comparison Matrix

This matrix compares providers across key capabilities. Ratings are relative comparisons (Excellent/Good/Adequate/Limited) based on benchmarks and practical experience as of early 2026.

Capability Comparison - Frontier Providers
Capability	Anthropic (Claude 4.x)	OpenAI (GPT-5.x)	Google (Gemini 3.x)
Code Generation	Excellent	Excellent	Excellent
Reasoning	Excellent	Excellent	Good
Long Context	Excellent (200K-1M)	Excellent (~1M)	Excellent (2M)
Tool Use	Excellent	Excellent	Excellent
Vision	Excellent	Excellent	Excellent
Structured Output	Excellent	Excellent	Excellent
Streaming	Excellent	Excellent	Good
Instruction Following	Excellent	Excellent	Good
Multilingual	Good	Good	Excellent
Math/Science	Excellent	Excellent	Good
Safety/Alignment	Excellent	Good	Good
Consistency	Excellent	Good	Good

Capability Comparison - Open-Weight and Specialized Providers
Capability	Meta (Llama 4)	Mistral	Cohere
Code Generation	Good	Good	Adequate
Reasoning	Good	Good	Good
Long Context	Excellent (10M)	Good (128K)	Good (128K)
Tool Use	Good	Good	Good
Vision	Excellent	Good	Limited
Structured Output	Good	Good	Good
Streaming	Good	Good	Good
Instruction Following	Good	Good	Good
Multilingual	Good	Good	Excellent
Math/Science	Good	Good	Adequate
Safety/Alignment	Good	Adequate	Good
Consistency	Good	Good	Good

Capability Details

Code Generation - Claude 4.x: Exceptional at understanding codebases, refactoring, and generating production-quality code - GPT-5.x: Strong code generation with excellent debugging capabilities; Codex variants optimized for agentic coding - Gemini 3.x: Significantly improved, competitive for most tasks - Llama 4: Competitive for many use cases, especially with 10M context for large codebases

Reasoning - Claude Opus 4.8: Best-in-class for complex multi-step reasoning - GPT-5.5 Thinking: Extended reasoning mode with chain-of-thought - Others: Competent but less consistent on edge cases

Long Context - Llama 4 Scout: Industry-leading 10M token context window - Gemini 3.1 Pro: 2M token windows with excellent recall - GPT-5.5: ~1.05M tokens with strong retrieval - Claude 4.x: 1M context (Opus/Sonnet), 200K (Haiku)

Tool Use / Function Calling - Claude & GPT-5.x: Native support, reliable parameter extraction - Gemini 3.x: Now excellent with native tool use - Others: Support available but may require more prompt engineering

Vision / Multimodal - Gemini 3.x: Native multimodal architecture, best for video understanding - GPT-5.x: Excellent vision with 86% accuracy on GUI understanding (ScreenSpot-Pro) - Claude 4.x: Excellent image understanding - Llama 4: Native multimodal with natively fused vision capabilities

3. API Feature Comparison

API Features - Frontier Providers
Feature	Anthropic	OpenAI	Google
Rate Limits (Tier 1)	60 RPM, 60K TPM	60 RPM, 60K TPM	60 RPM
Rate Limits (Enterprise)	Custom	Custom	Custom
Batch API	Yes	Yes	Yes
Function Calling	Native (tools)	Native (functions)	Native
Prompt Caching	Yes (explicit cache_control)	Yes	Yes
Fine-tuning	Limited access	Yes (GPT-5.x)	Yes (Gemini)
Embeddings API	No (use Voyage)	Yes	Yes
Streaming	Yes	Yes	Yes
JSON Mode	Yes	Yes	Yes
System Prompts	Yes	Yes	Yes
Message History	Stateless	Stateless	Stateful option
Content Filtering	Built-in	Built-in	Built-in
Usage Analytics	Dashboard	Dashboard	Cloud Console

API Features - Open-Weight and Specialized Providers
Feature	Meta (via providers)	Mistral	Cohere
Rate Limits (Tier 1)	Varies by provider	60 RPM	100 RPM
Rate Limits (Enterprise)	N/A	Custom	Custom
Batch API	N/A	No	Yes
Function Calling	Via frameworks	Native	Native
Prompt Caching	N/A (local)	No	Yes
Fine-tuning	Full (self-host)	Yes	Yes
Embeddings API	Yes (self-host)	Yes	Yes (excellent)
Streaming	Yes	Yes	Yes
JSON Mode	Via prompting	Yes	Yes
System Prompts	Yes	Yes	Yes
Message History	Stateless	Stateless	Stateless
Content Filtering	User-managed	Built-in	Built-in
Usage Analytics	Self-managed	Dashboard	Dashboard

API Feature Details

Batch API Batch APIs allow you to submit many requests at once for asynchronous processing, typically at 50% discount:

Anthropic: Message Batches API for high-volume processing
OpenAI: Batch API with 24-hour processing window
Google: Batch prediction available via Vertex AI
Cohere: Bulk endpoints available

Prompt Caching Caching reduces costs for repeated context (system prompts, documents):

Anthropic: Prompt caching via explicit cache_control markers, up to 90% cost reduction on cached tokens
OpenAI: Automatic caching for identical prefixes
Google: Context caching available

Fine-tuning Availability

Provider	Models Available	Data Requirements	Cost Model
OpenAI	GPT-5.4, GPT-5.4-mini, GPT-4.1	10+ examples	Per-token training + hosting
Google	Gemini 3 Pro/Flash	Varies	Per-token training
Mistral	Mistral models	100+ examples	Per-token training
Cohere	Command models	100+ examples	Per-token training
Meta	All Llama models	Any amount	Free (compute costs)

4. When to Use Each Provider

Anthropic (Claude)

Best Use Cases: - Complex reasoning and analysis tasks - Long document processing (contracts, research papers, codebases) - Safety-critical applications requiring predictable behavior - Coding assistants and code review - Tasks requiring nuanced understanding and careful responses - Applications where consistency and reliability are paramount

Considerations: - No native embeddings API (use Voyage AI or alternatives) - Limited fine-tuning access - Slightly higher latency for Opus

Example Scenarios:

- Legal document analysis requiring 100+ page context
- Medical information systems with safety requirements
- Enterprise coding assistants
- Complex customer support requiring nuanced responses

OpenAI (GPT-5)

Best Use Cases: - General-purpose applications requiring broad capabilities - Integration with existing OpenAI ecosystem (DALL-E, Whisper, TTS) - Applications benefiting from plugins and function calling - Real-time applications (GPT-5.4 mini/nano optimized for speed) - Mathematical reasoning (o1 models) - Rapid prototyping with extensive documentation and examples

Considerations: - o1 models have different API patterns (no streaming, no system prompts) - Rate limits can be restrictive at lower tiers - Pricing has increased over time for flagship models

Example Scenarios:

- Customer-facing chatbots with plugin integrations
- Content generation pipelines
- Real-time voice applications
- Complex math/science problems (o1)

Google (Gemini)

Best Use Cases: - Multimodal applications (image, video, audio understanding) - Extremely long context requirements (1M+ tokens) - Integration with Google Cloud services (BigQuery, Vertex AI) - Applications requiring native multimodal reasoning - Video understanding and analysis

Considerations: - API patterns differ from OpenAI standard - Some features only available via Vertex AI - Regional availability varies

Example Scenarios:

- Video content analysis and summarization
- Processing entire codebases or book-length documents
- Multimodal search and retrieval
- Google Workspace integrations

Meta (Llama)

Best Use Cases: - Self-hosting requirements (data sovereignty, air-gapped environments) - Cost control at high volume (no per-token fees) - Full customization and fine-tuning flexibility - On-premise deployment requirements - Research and experimentation - Building custom model variants

Considerations: - Requires ML infrastructure expertise - Compute costs can exceed API costs at low volumes - No managed service from Meta directly - Quality gap vs. frontier models on complex tasks

Example Scenarios:

- Healthcare applications with strict data requirements
- High-volume applications where API costs are prohibitive
- Custom fine-tuned models for specific domains
- Edge deployment on-device

Mistral

Best Use Cases: - European data residency requirements (EU-based company) - Efficiency-focused applications - Open-weight model preferences - Balance of capability and cost - GDPR compliance needs

Considerations: - Smaller context windows than competitors - Less extensive documentation than OpenAI/Anthropic - Fewer third-party integrations

Example Scenarios:

- EU-based applications with data residency requirements
- Cost-efficient production deployments
- Self-hosted with Mixtral for high capability

Cohere

Best Use Cases: - Enterprise search and RAG systems - Embedding-heavy applications - Multilingual applications (100+ languages) - Reranking in retrieval pipelines - Enterprise with existing Cohere relationship

Considerations: - Less powerful for general reasoning vs. Claude/GPT-5 - Strongest in retrieval-specific use cases - Good enterprise support and compliance

Example Scenarios:

- Enterprise knowledge base search
- Multilingual customer support
- Document retrieval and ranking systems
- Semantic search applications

Decision Flowchart

Start
  |
  v
Do you need European data residency?
  |-- Yes --> Mistral (or self-hosted Llama)
  |-- No
  v
Do you need to self-host?
  |-- Yes --> Llama (or Mistral open models)
  |-- No
  v
Is this a multimodal application (video/long docs)?
  |-- Yes --> Gemini 3.1 Pro
  |-- No
  v
Is complex reasoning the primary requirement?
  |-- Yes --> Claude Opus or OpenAI o1
  |-- No
  v
Is this enterprise RAG/search?
  |-- Yes --> Cohere Command R+
  |-- No
  v
Is this high-volume, cost-sensitive?
  |-- Yes --> Claude Haiku / GPT-5.4-mini / Gemini Flash
  |-- No
  v
Default recommendation:
  - Claude Sonnet 4.6 (balanced choice)
  - GPT-5.4 (if OpenAI ecosystem needed)

5. Cost Comparison Scenarios

The following scenarios illustrate typical costs across providers. Assumes average request of 1,000 input tokens and 500 output tokens.

Scenario 1: Low Volume (10K requests/month)

Typical use case: Internal tool, early-stage startup, development/testing

Provider	Model	Input Cost	Output Cost	Monthly Total
Anthropic	Claude Sonnet 4.6	$30	$75	$105
Anthropic	Claude Haiku 4.5	$10	$25	$35
OpenAI	GPT-5.4	$25	$75	$100
Meta	Llama 4 Scout (API)	$1.50	$2.50	$4.00
Google	Gemini 3 Pro	$20	$60	$80
Google	Gemini 3.5 Flash	$15	$45	$60
Mistral	Mistral Medium 3	$4	$10	$14
Cohere	Command R+	$25	$50	$75

Recommendation for Low Volume: Use flagship models (Sonnet, GPT-5.4) for quality. Cost difference is modest at this scale—even Sonnet costs only ~$105/month at 10K requests.

Scenario 2: Medium Volume (100K requests/month)

Typical use case: Production application, B2B SaaS, customer support

Provider	Model	Input Cost	Output Cost	Monthly Total
Anthropic	Claude Sonnet 4.6	$300	$750	$1,050
Anthropic	Claude Haiku 4.5	$100	$250	$350
OpenAI	GPT-5.4	$250	$750	$1,000
Meta	Llama 4 Scout (API)	$15	$25	$40
Google	Gemini 3 Pro	$200	$600	$800
Google	Gemini 3.5 Flash	$150	$450	$600
Mistral	Mistral Medium 3	$40	$100	$140
Cohere	Command R+	$250	$500	$750

Recommendation for Medium Volume: Consider model tiering - use flagship for complex queries, fast models for simple ones. Implement caching to reduce costs 20-50%.

Scenario 3: High Volume (1M+ requests/month)

Typical use case: Consumer application, high-traffic API, enterprise deployment

Provider	Model	Input Cost	Output Cost	Monthly Total	With Caching (est.)
Anthropic	Claude Sonnet 4.6	$3,000	$7,500	$10,500	~$7,000
Anthropic	Claude Haiku 4.5	$1,000	$2,500	$3,500	~$2,300
OpenAI	GPT-5.4	$2,500	$7,500	$10,000	~$6,700
Meta	Llama 4 Scout (API)	$150	$250	$400	~$300
Google	Gemini 3 Pro	$2,000	$6,000	$8,000	~$5,400
Google	Gemini 3.5 Flash	$1,500	$4,500	$6,000	~$4,000
Llama 4	Self-hosted*	-	-	$2,000-5,000	N/A

*Self-hosted costs include GPU rental (e.g., 2x A100 80GB at ~$2-3/hr) plus infrastructure. Break-even vs. API typically at 500K-2M requests/month depending on complexity.

Recommendation for High Volume: 1. Implement aggressive caching (prompt caching, response caching) 2. Use model routing (simple queries to fast models) 3. Consider self-hosting for predictable, high-volume workloads 4. Negotiate enterprise pricing (typically 20-40% discount) 5. Use batch APIs for non-real-time workloads (50% savings)

Cost Optimization Strategies

Strategy	Potential Savings	Implementation Effort
Prompt caching	30-90% on cached tokens	Low (some providers require explicit setup)
Model routing	40-70%	Medium
Batch processing	50%	Low
Response caching	20-50%	Medium
Self-hosting	Variable (can be negative)	High
Enterprise contracts	20-40%	Low
Prompt optimization	10-30%	Medium

6. Open Source vs API Trade-offs

Decision Framework

Use this framework to decide between self-hosting open models vs. using API providers.

When to Use APIs

Factor	API Advantage
Volume	< 500K requests/month
Team	No ML infrastructure expertise
Time to market	Need to ship quickly
Variability	Traffic is unpredictable/bursty
Capability	Need frontier model performance
Compliance	Provider has necessary certifications
Budget	Limited upfront capital

When to Self-Host

Factor	Self-Host Advantage
Volume	> 1M requests/month sustained
Data sovereignty	Strict requirements (healthcare, government)
Customization	Need fine-tuned or modified models
Latency	Need <100ms inference
Cost predictability	Need fixed monthly costs
Control	Need full control over model behavior
Air-gapped	No external network access allowed

Total Cost of Ownership Comparison

API Model (Example: Claude Sonnet 4.6 at 1M requests/month)

Cost Category	Monthly Cost
API usage	$10,500
Engineering time (integration)	~$500 (amortized)
Monitoring/observability	~$100
Total	~$11,100/month

Self-Hosted Model (Example: Llama 4 Scout at 1M requests/month)

Cost Category	Monthly Cost
GPU compute (2x A100 80GB)	$2,000-4,000
Infrastructure (networking, storage)	$200-500
Engineering time (ops)	$1,000-2,000
Monitoring/observability	$100-200
Total	$3,300-6,700/month

Break-Even Analysis

API cost per request = $0.0105 (Sonnet) or $0.00045 (GPT-5.4-mini)
Self-hosted cost per request = $0.003-0.007 (including ops overhead)

For flagship-equivalent quality:
  Break-even: ~300K-500K requests/month
  (Self-hosting is cheaper at 1M+ requests for most workloads)

For efficient model replacement (GPT-5.4-mini equivalent):
  Break-even: ~2-5M requests/month
  Exception: Air-gapped or data sovereignty requirements justify any volume

Hybrid Approaches

Many organizations use a hybrid approach:

Primary API + Fallback Self-Hosted
- Use API for normal traffic
- Self-hosted handles overflow or API outages
Tiered by Sensitivity
- Sensitive data: Self-hosted
- General queries: API
Development vs Production
- Development: Self-hosted (no per-query cost)
- Production: API (reliability)
Task-Based Routing
- Complex tasks: Frontier API models
- Simple tasks: Self-hosted efficient models

Self-Hosting Technology Stack

If you decide to self-host, here’s a recommended stack:

Component	Recommended Options
Inference Engine	vLLM, TGI, TensorRT-LLM
Model Format	Safetensors, GGUF (for llama.cpp)
Quantization	AWQ, GPTQ, or FP8 (H100)
Orchestration	Kubernetes + KServe
Load Balancing	Nginx, HAProxy, or cloud LB
Monitoring	Prometheus + Grafana
GPU Provider	AWS, GCP, Azure, Lambda Labs, RunPod

Self-Hosting Checklist

Before self-hosting, ensure you can answer “yes” to these questions:

Do we have ML infrastructure expertise on the team?
Do we have budget for dedicated GPU resources?
Can we handle model updates and security patches?
Do we have monitoring and alerting in place?
Can we match API provider reliability (99.9%+ uptime)?
Do we have a plan for scaling up/down?
Is the performance gap vs. frontier models acceptable?

Quick Reference Tables

Model Selection by Task

Task	Recommended Models
General chat/assistant	Claude Sonnet 4.6, GPT-5.4
Complex reasoning	Claude Opus 4.8, OpenAI o1
Code generation	Claude Sonnet 4.6, GPT-5.4
Long document analysis	Claude Sonnet 4.6, Gemini 3.1 Pro
Very long context (500K+)	Gemini 3.1 Pro
Fast/cheap classification	Claude Haiku 4.5, GPT-5.4-mini, Gemini Flash
Embeddings	OpenAI text-embedding-3, Cohere Embed v3
Multilingual	Cohere Command R+, Gemini
Self-hosted production	Llama 4 Maverick, Llama 4 Scout
Self-hosted efficient	Llama 3.2 8B, Mistral 7B
European data residency	Mistral (La Plateforme)
Enterprise RAG	Cohere Command R+

Pricing Quick Reference (per 1M tokens)

Model Tier	Input	Output	Provider
Frontier
Claude Opus 4.8	$5	$25	Anthropic
GPT-5.5	$5	$30	OpenAI
Gemini 3.1 Pro (≤200K)	$2	$12	Google
Balanced
Claude Sonnet 4.6	$3	$15	Anthropic
GPT-5.4	$2.50	$15	OpenAI
Gemini 3 Pro	$2	$12	Google
Mistral Large 3	$2	$6	Mistral
Command R+	$2.50	$10	Cohere
Efficient
Claude Haiku 4.5	$1	$5	Anthropic
Llama 4 Scout	$0.15	$0.50	API providers
Gemini 3.5 Flash	$1.50	$9	Google
Mistral Medium 3	$0.40	$2	Mistral

Changelog and Updates

This appendix will be updated as providers release new models and change pricing. Key changes to watch for:

Model releases: New model versions typically offer better price/performance
Pricing changes: Providers frequently adjust pricing (usually downward)
Feature additions: New capabilities (longer context, new modalities)
Fine-tuning availability: Access to fine-tuning expands over time

Verification Resources

Always verify current information at these official sources:

Provider	Pricing Page	Documentation
Anthropic	anthropic.com/pricing	docs.anthropic.com
OpenAI	openai.com/pricing	platform.openai.com/docs
Google	cloud.google.com/vertex-ai/pricing	ai.google.dev/docs
Meta	N/A (self-hosted)	llama.meta.com
Mistral	mistral.ai/pricing	docs.mistral.ai
Cohere	cohere.com/pricing	docs.cohere.com

Summary

Choosing an LLM provider involves balancing multiple factors:

Capability requirements: Match model capabilities to your task complexity
Cost at scale: Model pricing varies 100x between tiers
Integration needs: Consider existing infrastructure and ecosystem
Data requirements: Sovereignty, compliance, and privacy needs
Operational capacity: Self-hosting requires significant expertise

For most applications, starting with API providers and optimizing costs through caching, model routing, and batch processing is the pragmatic approach. Reserve self-hosting for specific requirements around data sovereignty, extreme cost optimization at scale, or specialized fine-tuned models.

The landscape evolves rapidly. What’s optimal today may change in 6 months. Build abstractions that allow switching providers, and regularly reassess your choices as new models and pricing emerge.

# Appendix K: LLM Provider Comparison {.unnumbered} This appendix provides a comprehensive comparison of major LLM providers to help you make informed decisions about which provider to use for different use cases. The AI landscape evolves rapidly, and pricing and capabilities change frequently. **Last Updated:** May 2026 **Important:** Pricing and specifications change frequently. Always verify current rates on provider websites before making decisions. This appendix provides a snapshot for comparison purposes. --- ## 1. Provider Overview The following table summarizes the major LLM providers, their flagship models, and key characteristics. | Provider | Flagship Models | Context Window | |----------|----------------|----------------| | **Anthropic** | Claude Opus 4.8, Sonnet 4.6, Haiku 4.5 | 1M (Opus/Sonnet), 200K (Haiku) | | **OpenAI** | GPT-5.5, GPT-5.4, GPT-5.4-mini | ~1.05M (GPT-5.5) | | **Google** | Gemini 3.1 Pro, 3.5 Flash, 3 Pro | 2M tokens | | **Meta** | Llama 4 Scout, Llama 4 Maverick | 10M (Scout), 1M (Maverick) | | **Mistral** | Mistral Large 3, Medium 3 | 128K tokens | | **Cohere** | Command R+, Command R, Embed v3 | 128K tokens | : Provider Models and Context Windows {tbl-colwidths="[20,40,40]"} | Provider | Pricing (Input / Output per 1M tokens) | Best For | |----------|----------------------------------------|----------| | **Anthropic** | Opus: $5/$25, Sonnet: $3/$15, Haiku: $1/$5 | Complex reasoning, safety-critical, coding | | **OpenAI** | GPT-5.5: $5/$30, GPT-5.4: $2.50/$15 | General purpose, ecosystem, vision | | **Google** | 3.1 Pro: $2/$12 (≤200K), $4/$18 (>200K), 3.5 Flash: $1.50/$9 | Multimodal, long context, GCP integration | | **Meta** | Scout: $0.15/$0.50, Maverick: $0.22/$0.85 | Self-hosting, cost control, customization | | **Mistral** | Large 3: $2/$6, Medium 3: $0.40/$2 | EU data residency, efficiency | | **Cohere** | R+: $2.50/$10, R: $0.50/$1.50 | Enterprise RAG, embeddings, multilingual | : Provider Pricing and Use Cases {tbl-colwidths="[20,40,40]"} ### Model Tiers Explained **Frontier/Flagship Models** (Claude Opus 4.8, GPT-5.5, Gemini 3.1 Pro, Llama 4 Maverick): - Highest capability across reasoning, coding, and complex tasks - Best for: Critical applications, complex analysis, creative work - Trade-off: Higher cost, slower response times **Balanced Models** (Claude Sonnet 4.6, GPT-5.4, Gemini 3 Pro, Llama 4 Scout): - Strong performance at moderate cost - Best for: Production workloads, most business applications - Trade-off: Good balance of quality and cost **Fast/Efficient Models** (Claude Haiku 4.5, GPT-5.4-mini, Gemini 3.5 Flash, Mistral Medium 3): - Optimized for speed and cost - Best for: High-volume tasks, simple queries, classification - Trade-off: Reduced capability on complex tasks --- ## 2. Capability Comparison Matrix This matrix compares providers across key capabilities. Ratings are relative comparisons (Excellent/Good/Adequate/Limited) based on benchmarks and practical experience as of early 2026. | Capability | Anthropic (Claude 4.x) | OpenAI (GPT-5.x) | Google (Gemini 3.x) | |------------|------------------------|----------------|-------------------| | **Code Generation** | Excellent | Excellent | Excellent | | **Reasoning** | Excellent | Excellent | Good | | **Long Context** | Excellent (200K-1M) | Excellent (~1M) | Excellent (2M) | | **Tool Use** | Excellent | Excellent | Excellent | | **Vision** | Excellent | Excellent | Excellent | | **Structured Output** | Excellent | Excellent | Excellent | | **Streaming** | Excellent | Excellent | Good | | **Instruction Following** | Excellent | Excellent | Good | | **Multilingual** | Good | Good | Excellent | | **Math/Science** | Excellent | Excellent | Good | | **Safety/Alignment** | Excellent | Good | Good | | **Consistency** | Excellent | Good | Good | : Capability Comparison - Frontier Providers {tbl-colwidths="[25,25,25,25]"} | Capability | Meta (Llama 4) | Mistral | Cohere | |------------|----------------|---------|--------| | **Code Generation** | Good | Good | Adequate | | **Reasoning** | Good | Good | Good | | **Long Context** | Excellent (10M) | Good (128K) | Good (128K) | | **Tool Use** | Good | Good | Good | | **Vision** | Excellent | Good | Limited | | **Structured Output** | Good | Good | Good | | **Streaming** | Good | Good | Good | | **Instruction Following** | Good | Good | Good | | **Multilingual** | Good | Good | Excellent | | **Math/Science** | Good | Good | Adequate | | **Safety/Alignment** | Good | Adequate | Good | | **Consistency** | Good | Good | Good | : Capability Comparison - Open-Weight and Specialized Providers {tbl-colwidths="[25,25,25,25]"} ### Capability Details **Code Generation** - *Claude 4.x*: Exceptional at understanding codebases, refactoring, and generating production-quality code - *GPT-5.x*: Strong code generation with excellent debugging capabilities; Codex variants optimized for agentic coding - *Gemini 3.x*: Significantly improved, competitive for most tasks - *Llama 4*: Competitive for many use cases, especially with 10M context for large codebases **Reasoning** - *Claude Opus 4.8*: Best-in-class for complex multi-step reasoning - *GPT-5.5 Thinking*: Extended reasoning mode with chain-of-thought - *Others*: Competent but less consistent on edge cases **Long Context** - *Llama 4 Scout*: Industry-leading 10M token context window - *Gemini 3.1 Pro*: 2M token windows with excellent recall - *GPT-5.5*: ~1.05M tokens with strong retrieval - *Claude 4.x*: 1M context (Opus/Sonnet), 200K (Haiku) **Tool Use / Function Calling** - *Claude & GPT-5.x*: Native support, reliable parameter extraction - *Gemini 3.x*: Now excellent with native tool use - *Others*: Support available but may require more prompt engineering **Vision / Multimodal** - *Gemini 3.x*: Native multimodal architecture, best for video understanding - *GPT-5.x*: Excellent vision with 86% accuracy on GUI understanding (ScreenSpot-Pro) - *Claude 4.x*: Excellent image understanding - *Llama 4*: Native multimodal with natively fused vision capabilities --- ## 3. API Feature Comparison | Feature | Anthropic | OpenAI | Google | |---------|-----------|--------|--------| | **Rate Limits (Tier 1)** | 60 RPM, 60K TPM | 60 RPM, 60K TPM | 60 RPM | | **Rate Limits (Enterprise)** | Custom | Custom | Custom | | **Batch API** | Yes | Yes | Yes | | **Function Calling** | Native (tools) | Native (functions) | Native | | **Prompt Caching** | Yes (explicit cache_control) | Yes | Yes | | **Fine-tuning** | Limited access | Yes (GPT-5.x) | Yes (Gemini) | | **Embeddings API** | No (use Voyage) | Yes | Yes | | **Streaming** | Yes | Yes | Yes | | **JSON Mode** | Yes | Yes | Yes | | **System Prompts** | Yes | Yes | Yes | | **Message History** | Stateless | Stateless | Stateful option | | **Content Filtering** | Built-in | Built-in | Built-in | | **Usage Analytics** | Dashboard | Dashboard | Cloud Console | : API Features - Frontier Providers {tbl-colwidths="[30,23,24,23]"} | Feature | Meta (via providers) | Mistral | Cohere | |---------|---------------------|---------|--------| | **Rate Limits (Tier 1)** | Varies by provider | 60 RPM | 100 RPM | | **Rate Limits (Enterprise)** | N/A | Custom | Custom | | **Batch API** | N/A | No | Yes | | **Function Calling** | Via frameworks | Native | Native | | **Prompt Caching** | N/A (local) | No | Yes | | **Fine-tuning** | Full (self-host) | Yes | Yes | | **Embeddings API** | Yes (self-host) | Yes | Yes (excellent) | | **Streaming** | Yes | Yes | Yes | | **JSON Mode** | Via prompting | Yes | Yes | | **System Prompts** | Yes | Yes | Yes | | **Message History** | Stateless | Stateless | Stateless | | **Content Filtering** | User-managed | Built-in | Built-in | | **Usage Analytics** | Self-managed | Dashboard | Dashboard | : API Features - Open-Weight and Specialized Providers {tbl-colwidths="[30,23,24,23]"} ### API Feature Details **Batch API** Batch APIs allow you to submit many requests at once for asynchronous processing, typically at 50% discount: - *Anthropic*: Message Batches API for high-volume processing - *OpenAI*: Batch API with 24-hour processing window - *Google*: Batch prediction available via Vertex AI - *Cohere*: Bulk endpoints available **Prompt Caching** Caching reduces costs for repeated context (system prompts, documents): - *Anthropic*: Prompt caching via explicit `cache_control` markers, up to 90% cost reduction on cached tokens - *OpenAI*: Automatic caching for identical prefixes - *Google*: Context caching available **Fine-tuning Availability** | Provider | Models Available | Data Requirements | Cost Model | |----------|-----------------|-------------------|------------| | OpenAI | GPT-5.4, GPT-5.4-mini, GPT-4.1 | 10+ examples | Per-token training + hosting | | Google | Gemini 3 Pro/Flash | Varies | Per-token training | | Mistral | Mistral models | 100+ examples | Per-token training | | Cohere | Command models | 100+ examples | Per-token training | | Meta | All Llama models | Any amount | Free (compute costs) | --- ## 4. When to Use Each Provider ### Anthropic (Claude) **Best Use Cases:** - Complex reasoning and analysis tasks - Long document processing (contracts, research papers, codebases) - Safety-critical applications requiring predictable behavior - Coding assistants and code review - Tasks requiring nuanced understanding and careful responses - Applications where consistency and reliability are paramount **Considerations:** - No native embeddings API (use Voyage AI or alternatives) - Limited fine-tuning access - Slightly higher latency for Opus **Example Scenarios:** ``` - Legal document analysis requiring 100+ page context - Medical information systems with safety requirements - Enterprise coding assistants - Complex customer support requiring nuanced responses ``` ### OpenAI (GPT-5) **Best Use Cases:** - General-purpose applications requiring broad capabilities - Integration with existing OpenAI ecosystem (DALL-E, Whisper, TTS) - Applications benefiting from plugins and function calling - Real-time applications (GPT-5.4 mini/nano optimized for speed) - Mathematical reasoning (o1 models) - Rapid prototyping with extensive documentation and examples **Considerations:** - o1 models have different API patterns (no streaming, no system prompts) - Rate limits can be restrictive at lower tiers - Pricing has increased over time for flagship models **Example Scenarios:** ``` - Customer-facing chatbots with plugin integrations - Content generation pipelines - Real-time voice applications - Complex math/science problems (o1) ``` ### Google (Gemini) **Best Use Cases:** - Multimodal applications (image, video, audio understanding) - Extremely long context requirements (1M+ tokens) - Integration with Google Cloud services (BigQuery, Vertex AI) - Applications requiring native multimodal reasoning - Video understanding and analysis **Considerations:** - API patterns differ from OpenAI standard - Some features only available via Vertex AI - Regional availability varies **Example Scenarios:** ``` - Video content analysis and summarization - Processing entire codebases or book-length documents - Multimodal search and retrieval - Google Workspace integrations ``` ### Meta (Llama) **Best Use Cases:** - Self-hosting requirements (data sovereignty, air-gapped environments) - Cost control at high volume (no per-token fees) - Full customization and fine-tuning flexibility - On-premise deployment requirements - Research and experimentation - Building custom model variants **Considerations:** - Requires ML infrastructure expertise - Compute costs can exceed API costs at low volumes - No managed service from Meta directly - Quality gap vs. frontier models on complex tasks **Example Scenarios:** ``` - Healthcare applications with strict data requirements - High-volume applications where API costs are prohibitive - Custom fine-tuned models for specific domains - Edge deployment on-device ``` ### Mistral **Best Use Cases:** - European data residency requirements (EU-based company) - Efficiency-focused applications - Open-weight model preferences - Balance of capability and cost - GDPR compliance needs **Considerations:** - Smaller context windows than competitors - Less extensive documentation than OpenAI/Anthropic - Fewer third-party integrations **Example Scenarios:** ``` - EU-based applications with data residency requirements - Cost-efficient production deployments - Self-hosted with Mixtral for high capability ``` ### Cohere **Best Use Cases:** - Enterprise search and RAG systems - Embedding-heavy applications - Multilingual applications (100+ languages) - Reranking in retrieval pipelines - Enterprise with existing Cohere relationship **Considerations:** - Less powerful for general reasoning vs. Claude/GPT-5 - Strongest in retrieval-specific use cases - Good enterprise support and compliance **Example Scenarios:** ``` - Enterprise knowledge base search - Multilingual customer support - Document retrieval and ranking systems - Semantic search applications ``` ### Decision Flowchart ``` Start | v Do you need European data residency? |-- Yes --> Mistral (or self-hosted Llama) |-- No v Do you need to self-host? |-- Yes --> Llama (or Mistral open models) |-- No v Is this a multimodal application (video/long docs)? |-- Yes --> Gemini 3.1 Pro |-- No v Is complex reasoning the primary requirement? |-- Yes --> Claude Opus or OpenAI o1 |-- No v Is this enterprise RAG/search? |-- Yes --> Cohere Command R+ |-- No v Is this high-volume, cost-sensitive? |-- Yes --> Claude Haiku / GPT-5.4-mini / Gemini Flash |-- No v Default recommendation: - Claude Sonnet 4.6 (balanced choice) - GPT-5.4 (if OpenAI ecosystem needed) ``` --- ## 5. Cost Comparison Scenarios The following scenarios illustrate typical costs across providers. Assumes average request of 1,000 input tokens and 500 output tokens. ### Scenario 1: Low Volume (10K requests/month) Typical use case: Internal tool, early-stage startup, development/testing | Provider | Model | Input Cost | Output Cost | Monthly Total | |----------|-------|------------|-------------|---------------| | Anthropic | Claude Sonnet 4.6 | $30 | $75 | **$105** | | Anthropic | Claude Haiku 4.5 | $10 | $25 | **$35** | | OpenAI | GPT-5.4 | $25 | $75 | **$100** | | Meta | Llama 4 Scout (API) | $1.50 | $2.50 | **$4.00** | | Google | Gemini 3 Pro | $20 | $60 | **$80** | | Google | Gemini 3.5 Flash | $15 | $45 | **$60** | | Mistral | Mistral Medium 3 | $4 | $10 | **$14** | | Cohere | Command R+ | $25 | $50 | **$75** | **Recommendation for Low Volume:** Use flagship models (Sonnet, GPT-5.4) for quality. Cost difference is modest at this scale—even Sonnet costs only ~$105/month at 10K requests. ### Scenario 2: Medium Volume (100K requests/month) Typical use case: Production application, B2B SaaS, customer support | Provider | Model | Input Cost | Output Cost | Monthly Total | |----------|-------|------------|-------------|---------------| | Anthropic | Claude Sonnet 4.6 | $300 | $750 | **$1,050** | | Anthropic | Claude Haiku 4.5 | $100 | $250 | **$350** | | OpenAI | GPT-5.4 | $250 | $750 | **$1,000** | | Meta | Llama 4 Scout (API) | $15 | $25 | **$40** | | Google | Gemini 3 Pro | $200 | $600 | **$800** | | Google | Gemini 3.5 Flash | $150 | $450 | **$600** | | Mistral | Mistral Medium 3 | $40 | $100 | **$140** | | Cohere | Command R+ | $250 | $500 | **$750** | **Recommendation for Medium Volume:** Consider model tiering - use flagship for complex queries, fast models for simple ones. Implement caching to reduce costs 20-50%. ### Scenario 3: High Volume (1M+ requests/month) Typical use case: Consumer application, high-traffic API, enterprise deployment | Provider | Model | Input Cost | Output Cost | Monthly Total | With Caching (est.) | |----------|-------|------------|-------------|---------------|---------------------| | Anthropic | Claude Sonnet 4.6 | $3,000 | $7,500 | **$10,500** | ~$7,000 | | Anthropic | Claude Haiku 4.5 | $1,000 | $2,500 | **$3,500** | ~$2,300 | | OpenAI | GPT-5.4 | $2,500 | $7,500 | **$10,000** | ~$6,700 | | Meta | Llama 4 Scout (API) | $150 | $250 | **$400** | ~$300 | | Google | Gemini 3 Pro | $2,000 | $6,000 | **$8,000** | ~$5,400 | | Google | Gemini 3.5 Flash | $1,500 | $4,500 | **$6,000** | ~$4,000 | | Llama 4 | Self-hosted* | - | - | **$2,000-5,000** | N/A | *Self-hosted costs include GPU rental (e.g., 2x A100 80GB at ~$2-3/hr) plus infrastructure. Break-even vs. API typically at 500K-2M requests/month depending on complexity. **Recommendation for High Volume:** 1. Implement aggressive caching (prompt caching, response caching) 2. Use model routing (simple queries to fast models) 3. Consider self-hosting for predictable, high-volume workloads 4. Negotiate enterprise pricing (typically 20-40% discount) 5. Use batch APIs for non-real-time workloads (50% savings) ### Cost Optimization Strategies | Strategy | Potential Savings | Implementation Effort | |----------|------------------|----------------------| | Prompt caching | 30-90% on cached tokens | Low (some providers require explicit setup) | | Model routing | 40-70% | Medium | | Batch processing | 50% | Low | | Response caching | 20-50% | Medium | | Self-hosting | Variable (can be negative) | High | | Enterprise contracts | 20-40% | Low | | Prompt optimization | 10-30% | Medium | --- ## 6. Open Source vs API Trade-offs ### Decision Framework Use this framework to decide between self-hosting open models vs. using API providers. #### When to Use APIs | Factor | API Advantage | |--------|--------------| | **Volume** | < 500K requests/month | | **Team** | No ML infrastructure expertise | | **Time to market** | Need to ship quickly | | **Variability** | Traffic is unpredictable/bursty | | **Capability** | Need frontier model performance | | **Compliance** | Provider has necessary certifications | | **Budget** | Limited upfront capital | #### When to Self-Host | Factor | Self-Host Advantage | |--------|---------------------| | **Volume** | > 1M requests/month sustained | | **Data sovereignty** | Strict requirements (healthcare, government) | | **Customization** | Need fine-tuned or modified models | | **Latency** | Need <100ms inference | | **Cost predictability** | Need fixed monthly costs | | **Control** | Need full control over model behavior | | **Air-gapped** | No external network access allowed | ### Total Cost of Ownership Comparison #### API Model (Example: Claude Sonnet 4.6 at 1M requests/month) | Cost Category | Monthly Cost | |---------------|-------------| | API usage | $10,500 | | Engineering time (integration) | ~$500 (amortized) | | Monitoring/observability | ~$100 | | **Total** | **~$11,100/month** | #### Self-Hosted Model (Example: Llama 4 Scout at 1M requests/month) | Cost Category | Monthly Cost | |---------------|-------------| | GPU compute (2x A100 80GB) | $2,000-4,000 | | Infrastructure (networking, storage) | $200-500 | | Engineering time (ops) | $1,000-2,000 | | Monitoring/observability | $100-200 | | **Total** | **$3,300-6,700/month** | #### Break-Even Analysis ``` API cost per request = $0.0105 (Sonnet) or $0.00045 (GPT-5.4-mini) Self-hosted cost per request = $0.003-0.007 (including ops overhead) For flagship-equivalent quality: Break-even: ~300K-500K requests/month (Self-hosting is cheaper at 1M+ requests for most workloads) For efficient model replacement (GPT-5.4-mini equivalent): Break-even: ~2-5M requests/month Exception: Air-gapped or data sovereignty requirements justify any volume ``` ### Hybrid Approaches Many organizations use a hybrid approach: 1. **Primary API + Fallback Self-Hosted** - Use API for normal traffic - Self-hosted handles overflow or API outages 2. **Tiered by Sensitivity** - Sensitive data: Self-hosted - General queries: API 3. **Development vs Production** - Development: Self-hosted (no per-query cost) - Production: API (reliability) 4. **Task-Based Routing** - Complex tasks: Frontier API models - Simple tasks: Self-hosted efficient models ### Self-Hosting Technology Stack If you decide to self-host, here's a recommended stack: | Component | Recommended Options | |-----------|---------------------| | Inference Engine | vLLM, TGI, TensorRT-LLM | | Model Format | Safetensors, GGUF (for llama.cpp) | | Quantization | AWQ, GPTQ, or FP8 (H100) | | Orchestration | Kubernetes + KServe | | Load Balancing | Nginx, HAProxy, or cloud LB | | Monitoring | Prometheus + Grafana | | GPU Provider | AWS, GCP, Azure, Lambda Labs, RunPod | ### Self-Hosting Checklist Before self-hosting, ensure you can answer "yes" to these questions: - [ ] Do we have ML infrastructure expertise on the team? - [ ] Do we have budget for dedicated GPU resources? - [ ] Can we handle model updates and security patches? - [ ] Do we have monitoring and alerting in place? - [ ] Can we match API provider reliability (99.9%+ uptime)? - [ ] Do we have a plan for scaling up/down? - [ ] Is the performance gap vs. frontier models acceptable? --- ## Quick Reference Tables ### Model Selection by Task | Task | Recommended Models | |------|-------------------| | General chat/assistant | Claude Sonnet 4.6, GPT-5.4 | | Complex reasoning | Claude Opus 4.8, OpenAI o1 | | Code generation | Claude Sonnet 4.6, GPT-5.4 | | Long document analysis | Claude Sonnet 4.6, Gemini 3.1 Pro | | Very long context (500K+) | Gemini 3.1 Pro | | Fast/cheap classification | Claude Haiku 4.5, GPT-5.4-mini, Gemini Flash | | Embeddings | OpenAI text-embedding-3, Cohere Embed v3 | | Multilingual | Cohere Command R+, Gemini | | Self-hosted production | Llama 4 Maverick, Llama 4 Scout | | Self-hosted efficient | Llama 3.2 8B, Mistral 7B | | European data residency | Mistral (La Plateforme) | | Enterprise RAG | Cohere Command R+ | ### Pricing Quick Reference (per 1M tokens) | Model Tier | Input | Output | Provider | |------------|-------|--------|----------| | **Frontier** | | | | | Claude Opus 4.8 | $5 | $25 | Anthropic | | GPT-5.5 | $5 | $30 | OpenAI | | Gemini 3.1 Pro (≤200K) | $2 | $12 | Google | | **Balanced** | | | | | Claude Sonnet 4.6 | $3 | $15 | Anthropic | | GPT-5.4 | $2.50 | $15 | OpenAI | | Gemini 3 Pro | $2 | $12 | Google | | Mistral Large 3 | $2 | $6 | Mistral | | Command R+ | $2.50 | $10 | Cohere | | **Efficient** | | | | | Claude Haiku 4.5 | $1 | $5 | Anthropic | | Llama 4 Scout | $0.15 | $0.50 | API providers | | Gemini 3.5 Flash | $1.50 | $9 | Google | | Mistral Medium 3 | $0.40 | $2 | Mistral | --- ## Changelog and Updates This appendix will be updated as providers release new models and change pricing. Key changes to watch for: - **Model releases**: New model versions typically offer better price/performance - **Pricing changes**: Providers frequently adjust pricing (usually downward) - **Feature additions**: New capabilities (longer context, new modalities) - **Fine-tuning availability**: Access to fine-tuning expands over time ### Verification Resources Always verify current information at these official sources: | Provider | Pricing Page | Documentation | |----------|-------------|---------------| | Anthropic | anthropic.com/pricing | docs.anthropic.com | | OpenAI | openai.com/pricing | platform.openai.com/docs | | Google | cloud.google.com/vertex-ai/pricing | ai.google.dev/docs | | Meta | N/A (self-hosted) | llama.meta.com | | Mistral | mistral.ai/pricing | docs.mistral.ai | | Cohere | cohere.com/pricing | docs.cohere.com | --- ## Summary Choosing an LLM provider involves balancing multiple factors: 1. **Capability requirements**: Match model capabilities to your task complexity 2. **Cost at scale**: Model pricing varies 100x between tiers 3. **Integration needs**: Consider existing infrastructure and ecosystem 4. **Data requirements**: Sovereignty, compliance, and privacy needs 5. **Operational capacity**: Self-hosting requires significant expertise For most applications, starting with API providers and optimizing costs through caching, model routing, and batch processing is the pragmatic approach. Reserve self-hosting for specific requirements around data sovereignty, extreme cost optimization at scale, or specialized fine-tuned models. The landscape evolves rapidly. What's optimal today may change in 6 months. Build abstractions that allow switching providers, and regularly reassess your choices as new models and pricing emerge. ### Related Chapters - **Chapter 7 (RAG Systems)**: How to build retrieval-augmented generation systems that work with any provider - **Chapter 9 (LLM Deployment & Infrastructure)**: Self-hosting options, vLLM, and deployment patterns - **Chapter 12 (Cloud AI Providers)**: Deep dives into AWS Bedrock, Azure OpenAI, and GCP Vertex AI - **Chapter 32 (Cost Engineering)**: Comprehensive cost optimization strategies at scale