System Diagram
MULTI-REGION INFERENCE ARCHITECTURE
═══════════════════════════════════════════════════════════════════════════════
USERS
│
▼
┌───────────────────────┐
│ Global Load │
│ Balancer │
│ (Cloudflare) │
│ │
│ • Geo-DNS routing │
│ • Health checks │
│ • DDoS protection │
└───────────┬───────────┘
│
┌───────────────────────┼───────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐
│ US-WEST │ │ US-EAST │ │ EU-WEST │
│ (us-west-2) │ │ (us-east-1) │ │ (eu-west-1) │
└──────────┬──────────┘ └──────────┬──────────┘ └──────────┬──────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐
│ API Gateway │ │ API Gateway │ │ API Gateway │
│ (Kong) │ │ (Kong) │ │ (Kong) │
│ │ │ │ │ │
│ • Rate limiting │ │ • Rate limiting │ │ • Rate limiting │
│ • Auth │ │ • Auth │ │ • Auth │
│ • Request routing │ │ • Request routing │ │ • Request routing │
└──────────┬──────────┘ └──────────┬──────────┘ └──────────┬──────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐
│ vLLM Pool │ │ vLLM Pool │ │ vLLM Pool │
│ (4x A100) │ │ (4x A100) │ │ (4x A100) │
│ │ │ │ │ │
│ Models: │ │ Models: │ │ Models: │
│ • LLaMA 3.1 70B │ │ • LLaMA 3.1 70B │ │ • LLaMA 3.1 70B │
│ • Mistral 7B │ │ • Mistral 7B │ │ • Mistral 7B │
└──────────┬──────────┘ └──────────┬──────────┘ └──────────┬──────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐
│ Vector DB │ │ Vector DB │ │ Vector DB │
│ (Milvus) │ │ (Milvus) │ │ (Milvus) │
│ │ │ │ │ │
│ Read Replica │ │ PRIMARY │ │ Read Replica │
└──────────┬──────────┘ └──────────┬──────────┘ └──────────┬──────────┘
│ │ │
└───────────────────────┼───────────────────────┘
│
┌────────┴────────┐
│ Replication │
│ < 30s lag │
└─────────────────┘
│
▼
┌─────────────────────────────┐
│ FALLBACK: Claude API │
│ (when self-hosted full) │
└─────────────────────────────┘
Mermaid Diagram (for rendered viewing)
flowchart TB
Users["👥 Users"] --> GLB["Global Load Balancer<br/>Cloudflare"]
GLB --> USW["US-WEST"]
GLB --> USE["US-EAST"]
GLB --> EU["EU-WEST"]
subgraph USW["US-WEST (us-west-2)"]
APIGW1["API Gateway"] --> VLLM1["vLLM Pool<br/>4x A100"]
VLLM1 --> VDB1[("Vector DB<br/>Read Replica")]
end
subgraph USE["US-EAST (us-east-1)"]
APIGW2["API Gateway"] --> VLLM2["vLLM Pool<br/>4x A100"]
VLLM2 --> VDB2[("Vector DB<br/>PRIMARY")]
end
subgraph EU["EU-WEST (eu-west-1)"]
APIGW3["API Gateway"] --> VLLM3["vLLM Pool<br/>4x A100"]
VLLM3 --> VDB3[("Vector DB<br/>Read Replica")]
end
VDB2 -.->|"Replication<br/>less than 30s lag"| VDB1
VDB2 -.->|"Replication<br/>less than 30s lag"| VDB3
USW & USE & EU -.->|"Fallback when<br/>self-hosted full"| Claude["Claude API<br/>Fallback"]
style USW fill:#e3f2fd
style USE fill:#e8f5e9
style EU fill:#fff3e0
Routing Logic
| 1 |
Default |
Geo-DNS routes to nearest region |
| 2 |
Region unhealthy |
Failover to next nearest region |
| 3 |
Region >80% utilized |
Overflow to another region |
| 4 |
EU user |
Force EU-WEST only (GDPR) |
| 5 |
All self-hosted exhausted |
Fallback to Claude API |
Key Metrics Per Region
| Request latency |
P50/P95/P99 |
P95 > 3s |
| GPU utilization |
% of compute used |
> 85% sustained |
| Queue depth |
Pending requests |
> 100 |
| Error rate |
Failed requests |
> 1% |
| Replication lag |
Vector DB sync delay |
> 30s |
Regional Failover Sequence
US-WEST user request:
1. Try US-WEST (primary)
2. If unavailable → US-EAST (secondary)
3. If unavailable → EU-WEST (tertiary, higher latency)
4. If all unavailable → Claude API fallback
EU user request (GDPR):
1. Try EU-WEST only
2. If unavailable → Claude API (EU endpoint)
3. Never route to US regions