Chapter 13: Multi-Cloud & Cloud-Native Patterns
multi-cloud, Kubernetes, serverless, hybrid deployment, cloud abstraction, failover, disaster recovery
Introduction
Chapter 12 covered the fundamentals of individual cloud AI providers. This chapter goes deeper into advanced deployment patterns: multi-cloud architectures, serverless and container-based deployments, security and compliance, and disaster recovery.
When to Read This Chapter: - You’re deploying across multiple cloud providers - You need serverless or container-based LLM deployments - You have strict security or compliance requirements - You’re designing disaster recovery for AI systems
What You’ll Learn: 1. Multi-Cloud Strategies - When and how to use multiple providers 2. Cloud-Native Patterns - Serverless, containers, and auto-scaling 3. Security & Compliance - VPCs, private endpoints, IAM, and certifications
Multi-Cloud Strategies
When to Use Multiple Providers
Reasons for Multi-Cloud:
- Best-of-breed models: Claude on AWS, GPT-5 on Azure, Gemini on GCP
- Disaster recovery: Fallback if one provider has outage
- Cost arbitrage: Use cheapest provider per task
- Regional presence: Some providers better in certain regions
- Compliance: Different data residency requirements
Anti-patterns (When NOT to multi-cloud):
- Premature optimization: Start with one, add others when needed
- Complexity overhead: Multi-cloud adds operational burden
- Vendor negotiation leverage: Often a myth for smaller customers
What people do: Build abstraction layers across AWS, Azure, and GCP from day one “in case we need to switch providers later.” Team spends months building and maintaining compatibility across three platforms.
Why it fails: Multi-cloud adds operational complexity that compounds over time: three IAM systems, three monitoring stacks, three incident response procedures, three billing dashboards. The “flexibility” costs more than it saves. Most teams never actually switch providers—when they do, the abstraction layer is usually incomplete anyway.
Fix: Start with one cloud, the one your team knows best. Build a clean internal API that could support multiple providers, but only implement the one you’re using. Add a second provider only when you have a concrete, immediate need (required model, compliance mandate, disaster recovery requirement). Never maintain three production implementations “just in case.”
Abstraction Layers and Portability
LiteLLM for Multi-Cloud:
# requirements.txt
litellm>=1.30.0
# multi_cloud_client.py
import litellm
from typing import Optional, List, Dict
import os
class MultiCloudLLM:
"""
Unified interface for AWS, Azure, GCP.
Uses LiteLLM for provider abstraction.
"""
def __init__(self):
# Set API keys for all providers
os.environ["AZURE_API_KEY"] = "..."
os.environ["AZURE_API_BASE"] = "https://..."
os.environ["AWS_ACCESS_KEY_ID"] = "..."
os.environ["AWS_SECRET_ACCESS_KEY"] = "..."
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "path/to/creds.json"
# Define model routing
self.model_routing = {
"coding": "azure/gpt-5", # GPT-5 best for code
"analysis": "bedrock/anthropic.claude-sonnet-4-6-20250929-v1:0", # Claude best for analysis
"simple": "vertex_ai/gemini-3.1-pro", # Gemini cheapest
"vision": "vertex_ai/gemini-3.5-flash"
}
def chat(
self,
messages: List[Dict],
task_type: str = "analysis",
fallback: bool = True
) -> Dict:
"""
Send chat request with automatic provider routing.
Args:
messages: Chat messages
task_type: Type of task (determines model)
fallback: Try alternative provider if primary fails
Returns:
Response dict with content and metadata
"""
model = self.model_routing.get(task_type, "azure/gpt-5")
try:
response = litellm.completion(
model=model,
messages=messages
)
return {
"content": response.choices[0].message.content,
"provider": model.split('/')[0],
"model": model,
"usage": response.usage._asdict()
}
except Exception as e:
if fallback:
# Try alternative provider
fallback_model = self._get_fallback_model(model)
print(f"Primary failed ({e}), trying {fallback_model}")
response = litellm.completion(
model=fallback_model,
messages=messages
)
return {
"content": response.choices[0].message.content,
"provider": fallback_model.split('/')[0],
"model": fallback_model,
"usage": response.usage._asdict(),
"fallback": True
}
else:
raise
def _get_fallback_model(self, primary_model: str) -> str:
"""Get fallback model for a primary model."""
fallbacks = {
"azure/gpt-5": "bedrock/anthropic.claude-sonnet-4-6-20250929-v1:0",
"bedrock/anthropic.claude-sonnet-4-6-20250929-v1:0": "vertex_ai/gemini-3.1-pro",
"vertex_ai/gemini-3.1-pro": "azure/gpt-3.5-turbo"
}
return fallbacks.get(primary_model, "azure/gpt-3.5-turbo")
def cost_optimized_routing(
self,
messages: List[Dict],
max_cost: float = 0.01
) -> Dict:
"""
Route to cheapest model that meets quality threshold.
Strategy: Try cheap model first, upgrade if needed.
"""
# Try cheapest first
models_by_cost = [
"vertex_ai/gemini-3.1-pro", # Cheapest
"azure/gpt-3.5-turbo",
"bedrock/anthropic.claude-sonnet-4-6-20250929-v1:0",
"azure/gpt-5" # Most expensive
]
for model in models_by_cost:
response = litellm.completion(
model=model,
messages=messages
)
# Check if response meets quality bar
if self._is_good_quality(response.choices[0].message.content):
return {
"content": response.choices[0].message.content,
"model": model
}
# If all fail, use best model
return self.chat(messages, task_type="analysis")
def _is_good_quality(self, response: str) -> bool:
"""Heuristic quality check."""
# Simple heuristics - in production, use eval model
return len(response) > 50 and not response.startswith("I don't")
# Usage
client = MultiCloudLLM()
# Automatic routing by task
response = client.chat(
messages=[{"role": "user", "content": "Write a Python function to reverse a string"}],
task_type="coding" # Routes to GPT-5 on Azure
)
print(f"Response from {response['provider']}: {response['content']}")
# Cost-optimized routing
response = client.cost_optimized_routing(
messages=[{"role": "user", "content": "What is 2+2?"}]
)What people do: Build a multi-cloud abstraction layer assuming all providers work the same way. “It’s just an LLM API—swap the endpoint and credentials, right?”
Why it fails: Subtle differences cause production issues. Token counting varies (OpenAI vs Anthropic tokenizers produce different counts for the same text). Rate limiting works differently. Error codes and retry semantics differ. Streaming response formats aren’t identical. The “same” model on different providers can have slightly different outputs.
Fix: Treat each provider as distinct until proven equivalent. Test edge cases on each provider. Build provider-specific error handling. Don’t assume token counts transfer—recalculate per provider. Run integration tests against each provider separately before deploying multi-cloud.
“We ran our first real failover drill after eight months of confidently telling leadership we were multi-region. The primary went ‘down’ at 10am. Failover kicked in cleanly—then p95 latency spiked from 800ms to 14 seconds and stayed there for 40 minutes. Reason: the standby region had cold model caches, cold vector index pages, and no warm DB connection pool. The runbook said ‘traffic shifts in 90 seconds.’ Reality was ‘recovery completes in 40 minutes.’ Now we run a 5 percent shadow load to every standby region 24/7. It costs us about $8K/month and saves the postmortem we don’t want to write.”
— SRE Lead at a fintech infrastructure company
Custom Abstraction Layer:
# llm_abstraction.py
from abc import ABC, abstractmethod
from typing import List, Dict, Optional
import boto3
from openai import AzureOpenAI
import vertexai
from vertexai.generative_models import GenerativeModel
class LLMProvider(ABC):
"""Abstract base class for LLM providers."""
@abstractmethod
def chat(
self,
messages: List[Dict[str, str]],
temperature: float = 0.7,
max_tokens: int = 1000
) -> Dict:
"""Chat completion."""
pass
@abstractmethod
def embed(self, texts: List[str]) -> List[List[float]]:
"""Get embeddings."""
pass
class AWSBedrockProvider(LLMProvider):
"""AWS Bedrock implementation."""
def __init__(self, region: str = "us-east-1"):
self.client = boto3.client('bedrock-runtime', region_name=region)
def chat(self, messages, temperature=0.7, max_tokens=1000):
import json
response = self.client.invoke_model(
modelId="anthropic.claude-sonnet-4-6-20250929-v1:0",
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": max_tokens,
"messages": messages,
"temperature": temperature
})
)
result = json.loads(response['body'].read())
return {
"content": result['content'][0]['text'],
"usage": result['usage']
}
def embed(self, texts):
import json
response = self.client.invoke_model(
modelId="amazon.titan-embed-text-v1",
body=json.dumps({"inputText": texts[0]})
)
result = json.loads(response['body'].read())
return [result['embedding']]
class AzureOpenAIProvider(LLMProvider):
"""Azure OpenAI implementation."""
def __init__(self, endpoint: str, api_key: str):
self.client = AzureOpenAI(
azure_endpoint=endpoint,
api_key=api_key,
api_version="2024-02-15-preview"
)
def chat(self, messages, temperature=0.7, max_tokens=1000):
response = self.client.chat.completions.create(
model="gpt-5", # deployment name
messages=messages,
temperature=temperature,
max_tokens=max_tokens
)
return {
"content": response.choices[0].message.content,
"usage": response.usage.model_dump()
}
def embed(self, texts):
response = self.client.embeddings.create(
model="text-embedding-3-small",
input=texts
)
return [item.embedding for item in response.data]
class VertexAIProvider(LLMProvider):
"""Vertex AI implementation."""
def __init__(self, project_id: str, location: str = "us-central1"):
vertexai.init(project=project_id, location=location)
self.model = GenerativeModel("gemini-3.1-pro")
def chat(self, messages, temperature=0.7, max_tokens=1000):
# Convert to Gemini format
prompt = "\n".join([f"{m['role']}: {m['content']}" for m in messages])
response = self.model.generate_content(
prompt,
generation_config={
"temperature": temperature,
"max_output_tokens": max_tokens
}
)
return {
"content": response.text,
"usage": {} # Vertex doesn't return usage in same format
}
def embed(self, texts):
from vertexai.language_models import TextEmbeddingModel
model = TextEmbeddingModel.from_pretrained("textembedding-gecko@003")
embeddings = model.get_embeddings(texts)
return [emb.values for emb in embeddings]
class MultiCloudRouter:
"""Route requests across providers."""
def __init__(self):
self.providers = {
"aws": AWSBedrockProvider(),
"azure": AzureOpenAIProvider(
endpoint=os.getenv("AZURE_ENDPOINT"),
api_key=os.getenv("AZURE_KEY")
),
"gcp": VertexAIProvider(project_id=os.getenv("GCP_PROJECT"))
}
self.default_provider = "aws"
def chat(
self,
messages: List[Dict],
provider: Optional[str] = None,
**kwargs
) -> Dict:
"""Chat with optional provider override."""
provider_name = provider or self.default_provider
llm = self.providers[provider_name]
try:
result = llm.chat(messages, **kwargs)
result["provider"] = provider_name
return result
except Exception as e:
# Fallback to next provider
print(f"{provider_name} failed: {e}")
for name, fallback_llm in self.providers.items():
if name != provider_name:
try:
result = fallback_llm.chat(messages, **kwargs)
result["provider"] = name
result["fallback"] = True
return result
except:
continue
raise Exception("All providers failed")
# Usage
router = MultiCloudRouter()
# Use default provider (AWS)
response = router.chat([
{"role": "user", "content": "Explain Kubernetes"}
])
# Override to Azure
response = router.chat(
messages=[{"role": "user", "content": "Write a haiku about clouds"}],
provider="azure"
)
print(f"Response from {response['provider']}: {response['content']}")Disaster Recovery Patterns
Active-Passive DR:
# dr_pattern.py
import time
from typing import Optional
class ActivePassiveLLM:
"""
Active-passive DR pattern.
Primary provider handles all traffic.
Automatic failover to secondary on failure.
"""
def __init__(
self,
primary_provider: LLMProvider,
secondary_provider: LLMProvider,
health_check_interval: int = 60
):
self.primary = primary_provider
self.secondary = secondary_provider
self.health_check_interval = health_check_interval
self.primary_healthy = True
self.last_health_check = time.time()
def chat(self, messages: List[Dict], **kwargs) -> Dict:
"""Chat with automatic failover."""
# Health check
if time.time() - self.last_health_check > self.health_check_interval:
self._health_check()
# Try primary
if self.primary_healthy:
try:
result = self.primary.chat(messages, **kwargs)
result["provider"] = "primary"
return result
except Exception as e:
print(f"Primary failed: {e}, failing over to secondary")
self.primary_healthy = False
# Use secondary
result = self.secondary.chat(messages, **kwargs)
result["provider"] = "secondary"
result["failover"] = True
return result
def _health_check(self):
"""Check if primary is healthy."""
try:
# Simple ping
self.primary.chat(
[{"role": "user", "content": "ping"}],
max_tokens=5
)
self.primary_healthy = True
except:
self.primary_healthy = False
self.last_health_check = time.time()
# Usage
dr_llm = ActivePassiveLLM(
primary_provider=AWSBedrockProvider(),
secondary_provider=AzureOpenAIProvider(endpoint="...", api_key="...")
)
response = dr_llm.chat([{"role": "user", "content": "Hello"}])Active-Active with Load Balancing:
# load_balancer.py
import random
from collections import defaultdict
import time
class LoadBalancedLLM:
"""
Active-active load balancing.
Distribute traffic across multiple providers.
"""
def __init__(self, providers: Dict[str, LLMProvider]):
self.providers = providers
# Track metrics per provider
self.metrics = defaultdict(lambda: {
"requests": 0,
"errors": 0,
"latency": []
})
def chat(
self,
messages: List[Dict],
strategy: str = "round-robin",
**kwargs
) -> Dict:
"""
Chat with load balancing.
Strategies:
- round-robin: Distribute evenly
- random: Random selection
- least-latency: Route to fastest provider
- weighted: Based on capacity/cost
"""
if strategy == "round-robin":
provider_name = self._round_robin()
elif strategy == "random":
provider_name = random.choice(list(self.providers.keys()))
elif strategy == "least-latency":
provider_name = self._least_latency()
else:
provider_name = self._weighted()
provider = self.providers[provider_name]
# Execute with metrics
start = time.time()
try:
result = provider.chat(messages, **kwargs)
latency = time.time() - start
# Track metrics
self.metrics[provider_name]["requests"] += 1
self.metrics[provider_name]["latency"].append(latency)
result["provider"] = provider_name
result["latency"] = latency
return result
except Exception as e:
self.metrics[provider_name]["errors"] += 1
# Retry with different provider
providers_to_try = [p for p in self.providers.keys() if p != provider_name]
for retry_provider in providers_to_try:
try:
result = self.providers[retry_provider].chat(messages, **kwargs)
result["provider"] = retry_provider
result["retry"] = True
return result
except:
continue
raise
def _round_robin(self) -> str:
"""Simple round-robin."""
total_requests = sum(m["requests"] for m in self.metrics.values())
provider_idx = total_requests % len(self.providers)
return list(self.providers.keys())[provider_idx]
def _least_latency(self) -> str:
"""Route to provider with lowest average latency."""
avg_latencies = {}
for name, metrics in self.metrics.items():
if metrics["latency"]:
avg_latencies[name] = sum(metrics["latency"]) / len(metrics["latency"])
else:
avg_latencies[name] = 0
if not avg_latencies:
return random.choice(list(self.providers.keys()))
return min(avg_latencies, key=avg_latencies.get)
def _weighted(self, weights: Optional[Dict[str, float]] = None) -> str:
"""Weighted selection based on capacity or cost."""
if not weights:
# Default weights (example: prefer cheaper providers)
weights = {
"gcp": 0.5, # Cheapest, 50% of traffic
"azure": 0.3,
"aws": 0.2
}
return random.choices(
list(weights.keys()),
weights=list(weights.values())
)[0]
def get_metrics(self) -> Dict:
"""Get performance metrics."""
summary = {}
for provider, metrics in self.metrics.items():
avg_latency = (
sum(metrics["latency"]) / len(metrics["latency"])
if metrics["latency"] else 0
)
error_rate = (
metrics["errors"] / metrics["requests"]
if metrics["requests"] > 0 else 0
)
summary[provider] = {
"requests": metrics["requests"],
"errors": metrics["errors"],
"error_rate": error_rate,
"avg_latency": avg_latency
}
return summary
# Usage
lb = LoadBalancedLLM({
"aws": AWSBedrockProvider(),
"azure": AzureOpenAIProvider(endpoint="...", api_key="..."),
"gcp": VertexAIProvider(project_id="...")
})
# Send requests
for _ in range(100):
response = lb.chat(
[{"role": "user", "content": "Hello"}],
strategy="least-latency"
)
# Check metrics
print(lb.get_metrics())Cloud-Native Patterns
Serverless Inference
AWS Lambda with Bedrock:
# lambda_handler.py
import json
import boto3
import os
bedrock = boto3.client('bedrock-runtime')
def lambda_handler(event, context):
"""
Serverless LLM inference.
Benefits:
- No idle costs
- Auto-scaling
- Simple deployment
Limitations:
- 15 min max runtime
- Cold starts (~1-2s)
- Limited GPU access
"""
body = json.loads(event['body'])
prompt = body['prompt']
response = bedrock.invoke_model(
modelId="anthropic.claude-sonnet-4-6-20250929-v1:0",
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 1024,
"messages": [{"role": "user", "content": prompt}]
})
)
result = json.loads(response['body'].read())
return {
'statusCode': 200,
'body': json.dumps({
'response': result['content'][0]['text']
})
}
# Deploy with SAM or Serverless Framework
# serverless.yml
"""
service: llm-api
provider:
name: aws
runtime: python3.11
region: us-east-1
iam:
role:
statements:
- Effect: Allow
Action:
- bedrock:InvokeModel
Resource: '*'
functions:
chat:
handler: lambda_handler.lambda_handler
timeout: 60
events:
- http:
path: chat
method: post
"""GCP Cloud Functions with Vertex AI:
# main.py
import functions_framework
from vertexai.generative_models import GenerativeModel
import vertexai
import os
# Initialize outside handler for warm starts
vertexai.init(project=os.getenv("GCP_PROJECT"), location="us-central1")
model = GenerativeModel("gemini-3.1-pro")
@functions_framework.http
def chat(request):
"""
Cloud Function for Vertex AI.
Warm start optimization: Initialize model globally.
"""
request_json = request.get_json()
prompt = request_json['prompt']
response = model.generate_content(prompt)
return {'response': response.text}
# Deploy
# gcloud functions deploy chat \
# --runtime python311 \
# --trigger-http \
# --allow-unauthenticated \
# --region us-central1Azure Functions with OpenAI:
# function_app.py
import azure.functions as func
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
import os
app = func.FunctionApp()
# Warm start: Initialize outside handler
credential = DefaultAzureCredential()
token_provider = get_bearer_token_provider(
credential,
"https://cognitiveservices.azure.com/.default"
)
client = AzureOpenAI(
azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
azure_ad_token_provider=token_provider,
api_version="2024-02-15-preview"
)
@app.function_name(name="chat")
@app.route(route="chat", methods=["POST"])
def chat(req: func.HttpRequest) -> func.HttpResponse:
"""Azure Function with managed identity."""
req_body = req.get_json()
prompt = req_body['prompt']
response = client.chat.completions.create(
model=os.getenv("DEPLOYMENT_NAME"),
messages=[{"role": "user", "content": prompt}]
)
return func.HttpResponse(
response.choices[0].message.content,
mimetype="text/plain"
)
# Deploy
# func azure functionapp publish <app-name>Container-Based Deployment
AWS ECS with Bedrock:
# Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]# main.py
from fastapi import FastAPI
import boto3
import json
import os
app = FastAPI()
bedrock = boto3.client('bedrock-runtime', region_name='us-east-1')
@app.post("/chat")
async def chat(prompt: str):
"""
FastAPI app running on ECS.
Benefits over Lambda:
- No runtime limits
- Keep connections warm
- More memory/CPU options
"""
response = bedrock.invoke_model(
modelId="anthropic.claude-sonnet-4-6-20250929-v1:0",
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 1024,
"messages": [{"role": "user", "content": prompt}]
})
)
result = json.loads(response['body'].read())
return {"response": result['content'][0]['text']}
# Health check for ECS
@app.get("/health")
async def health():
return {"status": "healthy"}# ecs-task-definition.json
{
"family": "llm-api",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "1024",
"memory": "2048",
"containerDefinitions": [
{
"name": "llm-api",
"image": "<account-id>.dkr.ecr.us-east-1.amazonaws.com/llm-api:latest",
"portMappings": [
{
"containerPort": 8000,
"protocol": "tcp"
}
],
"environment": [
{
"name": "AWS_REGION",
"value": "us-east-1"
}
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/llm-api",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "ecs"
}
}
}
]
}GKE with Vertex AI:
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-api
spec:
replicas: 3
selector:
matchLabels:
app: llm-api
template:
metadata:
labels:
app: llm-api
spec:
serviceAccountName: vertex-ai-sa # GCP service account
containers:
- name: llm-api
image: gcr.io/project-id/llm-api:latest
ports:
- containerPort: 8000
env:
- name: GCP_PROJECT
value: "my-project"
resources:
requests:
memory: "2Gi"
cpu: "1000m"
limits:
memory: "4Gi"
cpu: "2000m"
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
name: llm-api
spec:
selector:
app: llm-api
ports:
- port: 80
targetPort: 8000
type: LoadBalancerGPU Instances and Spot Pricing
Self-Hosted with Spot Instances:
# spot_instance_manager.py
import boto3
from typing import Dict
class SpotInstanceManager:
"""
Manage EC2 spot instances for self-hosted LLM inference.
Use case: Host your own LLM (Llama, Mistral) on GPU spot instances.
Savings: 70%+ compared to on-demand.
"""
def __init__(self):
self.ec2 = boto3.client('ec2')
def request_spot_instance(
self,
instance_type: str = "g5.xlarge", # 1x A10G GPU
max_price: str = "0.50", # USD per hour
ami_id: str = "ami-xxxxx" # Custom AMI with vLLM
) -> str:
"""
Request spot instance for LLM hosting.
Instance types:
- g5.xlarge: 1x A10G (24GB) - ~$0.40/hr spot vs $1.01 on-demand
- g5.2xlarge: 1x A10G (24GB), more CPU/RAM
- g5.12xlarge: 4x A10G (96GB) - for larger models
- p4d.24xlarge: 8x A100 (320GB) - for huge models
"""
user_data = """#!/bin/bash
# Install vLLM and start server
pip install vllm
python -m vllm.entrypoints.api_server \
--model meta-llama/Llama-4-Scout-17B \
--port 8000 \
--gpu-memory-utilization 0.9
"""
response = self.ec2.request_spot_instances(
SpotPrice=max_price,
InstanceCount=1,
LaunchSpecification={
'ImageId': ami_id,
'InstanceType': instance_type,
'KeyName': 'my-key',
'SecurityGroupIds': ['sg-xxxxx'],
'UserData': user_data,
'IamInstanceProfile': {
'Name': 'EC2-vLLM-Role'
},
'BlockDeviceMappings': [
{
'DeviceName': '/dev/sda1',
'Ebs': {
'VolumeSize': 100,
'VolumeType': 'gp3'
}
}
]
}
)
request_id = response['SpotInstanceRequests'][0]['SpotInstanceRequestId']
print(f"Spot request: {request_id}")
return request_id
def handle_spot_interruption(self):
"""
Handle spot instance interruption.
AWS gives 2-minute warning before terminating spot instance.
Use this to gracefully drain connections.
"""
# Run this in background on instance
script = """
import requests
import time
import signal
import sys
def graceful_shutdown(signum, frame):
print("Spot interruption detected, draining...")
# Stop accepting new requests
# Finish existing requests
# Save state if needed
sys.exit(0)
signal.signal(signal.SIGTERM, graceful_shutdown)
while True:
try:
# Check for interruption notice
r = requests.get(
'http://169.254.169.254/latest/meta-data/spot/instance-action',
timeout=1
)
if r.status_code == 200:
graceful_shutdown(None, None)
except:
pass
time.sleep(5)
"""
return script
# Usage
manager = SpotInstanceManager()
# Request spot instance
request_id = manager.request_spot_instance(
instance_type="g5.2xlarge",
max_price="0.80"
)
# Cost comparison:
# On-demand: $1.21/hr * 720 hrs/month = $871/month
# Spot: $0.40/hr * 720 hrs/month = $288/month
# Savings: $583/month (67%)Auto-Scaling Patterns
AWS Bedrock Provisioned Throughput (manual scaling):
Bedrock Provisioned Throughput is not currently a registered Application Auto Scaling target — there is no ServiceNamespace='bedrock' / ScalableDimension='bedrock:model:ProvisionedThroughput' combination supported by AWS. Capacity is purchased in Model Units (MUs) on monthly or six-month commitments, so most teams scale Bedrock manually based on CloudWatch metrics, or rely on on-demand inference for elasticity and reserve PTUs only for steady baseline load.
# bedrock_capacity.py
import boto3
bedrock = boto3.client('bedrock')
# Create provisioned throughput (Model Units, not auto-scaled)
response = bedrock.create_provisioned_model_throughput(
modelUnits=2, # baseline capacity
provisionedModelName='claude-sonnet-prod',
modelId='anthropic.claude-sonnet-4-6-20250929-v1:0',
commitmentDuration='OneMonth' # or 'SixMonths', omit for no commitment
)
provisioned_arn = response['provisionedModelArn']
# To "scale", create a new provisioned model with more MUs and migrate
# traffic, or fall back to on-demand inference during spikes. Watch
# CloudWatch metrics like InvocationThrottles to decide when to expand.For workloads that need true elastic scaling, prefer on-demand inference (throttled per account quota) or front Bedrock with SageMaker endpoints, which do support Application Auto Scaling target tracking via the sagemaker service namespace.
Kubernetes HPA for LLM Workloads:
# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: llm-api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: llm-api
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "100"
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5min before scaling down
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0 # Scale up immediately
policies:
- type: Percent
value: 100
periodSeconds: 30Security & Compliance
VPC and Private Endpoints
AWS Private VPC for Bedrock:
# Use Bedrock from private subnet without internet access
# 1. Create VPC endpoint for Bedrock (via AWS Console or CloudFormation)
# Type: Interface
# Service: com.amazonaws.us-east-1.bedrock-runtime
# 2. Use from private subnet
import boto3
bedrock = boto3.client(
'bedrock-runtime',
region_name='us-east-1',
# Format: https://<vpce-id>.bedrock-runtime.<region>.vpce.amazonaws.com
# If Private DNS is enabled on the endpoint, you can omit endpoint_url
# and use the standard service DNS automatically.
endpoint_url='https://vpce-xxxxxxxxxxxxxxxxx.bedrock-runtime.us-east-1.vpce.amazonaws.com'
)
# Traffic never leaves AWS networkAzure Private Link:
# Azure OpenAI with private endpoint
# 1. Create private endpoint (Azure Portal or ARM)
# 2. Configure DNS
# 3. Use from VNet
from openai import AzureOpenAI
client = AzureOpenAI(
azure_endpoint="https://my-resource.privatelink.openai.azure.com/",
api_key="...",
api_version="2024-02-15-preview"
)
# Traffic stays within Azure backboneIAM and Service Accounts
AWS IAM Best Practices:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "BedrockInvokeOnly",
"Effect": "Allow",
"Action": [
"bedrock:InvokeModel",
"bedrock:InvokeModelWithResponseStream"
],
"Resource": [
"arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-sonnet-4-6-20250929-v1:0"
],
"Condition": {
"StringEquals": {
"aws:RequestedRegion": "us-east-1"
}
}
},
{
"Sid": "DenyDataLogging",
"Effect": "Deny",
"Action": [
"bedrock:PutModelInvocationLoggingConfiguration"
],
"Resource": "*"
}
]
}GCP Service Account:
# Create service account with minimal permissions
from google.cloud import iam_admin_v1
client = iam_admin_v1.IAMClient()
# Create service account
service_account = client.create_service_account(
request={
"name": f"projects/my-project",
"account_id": "vertex-ai-sa",
"service_account": {
"display_name": "Vertex AI Service Account"
}
}
)
# Grant only necessary roles
# - roles/aiplatform.user (use Vertex AI)
# - roles/storage.objectViewer (read from GCS)
# - NO roles/owner or roles/editor
# Use from application
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/path/to/sa-key.json"Data Residency
AWS Regions:
# EU data residency
bedrock_eu = boto3.client('bedrock-runtime', region_name='eu-west-1')
# Data never leaves EU
# Compliant with GDPR
# Available Bedrock regions:
# - us-east-1, us-west-2 (most models)
# - eu-west-1, eu-central-1 (EU data residency)
# - ap-southeast-1, ap-northeast-1 (Asia Pacific)Azure Geographies:
# Azure OpenAI data residency
# Europe geography
client_eu = AzureOpenAI(
azure_endpoint="https://my-resource-westeurope.openai.azure.com/",
api_key="...",
api_version="2024-02-15-preview"
)
# Data stays in European data centers
# GDPR compliantCompliance Certifications
AWS Bedrock: - SOC 2 Type II - HIPAA eligible (BAA available) - ISO 27001 - GDPR compliant - PCI DSS (for payment card data)
Azure OpenAI: - SOC 2 Type II - HIPAA/HITECH (BAA available) - ISO 27001, 27018 - GDPR compliant - FedRAMP (US government)
GCP Vertex AI: - SOC 2 Type II - HIPAA compliant - ISO 27001 - GDPR compliant - FedRAMP (in progress)
Enabling Compliance Features:
# AWS HIPAA example
# 1. Sign BAA with AWS
# 2. Use HIPAA-eligible services only
# 3. Enable encryption at rest and in transit
# 4. Enable audit logging
import boto3
# Enable CloudTrail for audit logs
cloudtrail = boto3.client('cloudtrail')
cloudtrail.create_trail(
Name='bedrock-audit-trail',
S3BucketName='my-audit-bucket',
IsMultiRegionTrail=True,
EnableLogFileValidation=True
)
# Enable encryption for S3 buckets (for Knowledge Bases)
s3 = boto3.client('s3')
s3.put_bucket_encryption(
Bucket='my-kb-bucket',
ServerSideEncryptionConfiguration={
'Rules': [
{
'ApplyServerSideEncryptionByDefault': {
'SSEAlgorithm': 'AES256'
}
}
]
}
)Comparison Table
| Feature | AWS Bedrock | Azure OpenAI | GCP Vertex AI |
|---|---|---|---|
| Best Models | Claude Sonnet 4.6, Llama 4 | GPT-5.5, GPT-5.4 | Gemini 3.1 Pro, Gemini 3.5 Flash |
| Pricing (GPT-5 equivalent) | Claude Sonnet 4.6: $3/$15 per 1M tokens | GPT-5.4: $2.50/$15 per 1M tokens | Gemini 3.1 Pro: $2/$12 per 1M tokens |
| Enterprise Auth | IAM roles | Azure AD, Managed Identity | GCP Service Accounts |
| Private Network | VPC Endpoints | Private Link | VPC Service Controls |
| Fine-tuning | Limited (Titan only) | Yes (GPT-5-mini, GPT-5) | Yes (PaLM, Gemini) |
| Multimodal | Claude 4.x (images) | GPT-5 Vision, DALL-E | Gemini 3.1 Pro, Imagen |
| RAG Support | Knowledge Bases | Azure AI Search | Vertex AI Search |
| Agents | Bedrock Agents | Azure AI Studio | Vertex AI Agents |
| Regions | 6+ regions | 10+ regions | 20+ regions |
| Compliance | SOC2, HIPAA, GDPR | SOC2, HIPAA, FedRAMP | SOC2, HIPAA, GDPR |
| Best For | Claude users, AWS ecosystem | Enterprise, Microsoft stack | Google ecosystem, multimodal |
Practical Exercises
Multi-Cloud RAG: Build a RAG system using Azure Cognitive Search for indexing, AWS Bedrock for embeddings, and GCP Vertex AI for generation. Implement fallback if any provider fails using LiteLLM’s routing capabilities.
Cost Optimizer: Create a cost optimization service that tracks usage across all three providers, routes requests to the cheapest provider for each task type, switches to provisioned throughput when usage crosses threshold, and provides daily cost reports.
Serverless Agent: Deploy an agent on AWS Lambda that uses Bedrock for reasoning, calls tools (weather API, database), stores conversation history in DynamoDB, and auto-scales based on traffic. Target: sub-3-second p95 latency, 1000 req/s capacity, <$100/month at low traffic.
HIPAA-Compliant Chat: Build a HIPAA-compliant medical chatbot on Azure with private endpoints, Azure OpenAI with BAA, encryption at rest/transit, audit logging, and PHI storage in encrypted Cosmos DB. Create a compliance checklist document.
Disaster Recovery Drill: Set up a multi-cloud failover system and conduct a disaster recovery drill. Simulate primary provider failure, measure failover time, document data loss (if any), and test failback procedures.
Self-Assessment Checkpoint
Conceptual Questions
Q1. [IC2] What is LiteLLM and when would you use it? What are its limitations?
Answer
LiteLLM provides a unified interface for calling 100+ LLM providers with the same OpenAI-compatible API. Use cases: (1) Switching between providers without code changes. (2) Building fallback chains (try Claude, fall back to GPT-4). (3) A/B testing different models. (4) Abstracting provider differences for teams.
Limitations: (1) Adds a dependency and potential point of failure. (2) May lag behind provider feature releases. (3) Some provider-specific features aren’t supported. (4) Debugging is harder—errors may be obscured. (5) Performance overhead (minimal but exists).
When to use: Multi-provider deployments, rapid prototyping, teams evaluating models. When to avoid: Single-provider production deployments (unnecessary abstraction), applications needing cutting-edge provider features.Q2. [IC2] What’s the difference between a VPC endpoint and a private endpoint? When do you need each?
Answer
VPC endpoint (AWS): Connects your VPC to AWS services without traffic leaving Amazon’s network. Types: Gateway endpoints (S3, DynamoDB—free) and Interface endpoints (most services—costs per hour + data transfer). Use for: Keeping Bedrock traffic within AWS.
Private endpoint (Azure): Similar concept—connects your VNet to Azure services privately. Uses Azure Private Link. Use for: Keeping Azure OpenAI traffic within Azure.
You need these when: (1) Compliance requires no internet egress. (2) Security policy prohibits public API calls. (3) You need to reduce attack surface. (4) Data classification requires private networking.
Both add cost and complexity. Only implement if you have a concrete security or compliance requirement.Q3. [Senior] Explain the tradeoffs between serverless (Lambda) and container-based (EKS) deployments for LLM applications.
Answer
Serverless (Lambda): Pros: Zero ops, auto-scales to zero, pay-per-invocation. Cons: Cold starts (1-5s for LLM clients), 15-minute timeout, limited memory (10GB), no persistent connections, harder to maintain state.
Containers (EKS): Pros: No cold starts, unlimited execution time, persistent connections to LLM APIs, more control over scaling. Cons: Operational overhead, minimum cost even at zero traffic, need to manage cluster.
Decision factors: (1) Traffic pattern: Bursty with long idle periods → serverless. Steady traffic → containers. (2) Latency requirements: p99 <500ms → containers (cold starts kill serverless). (3) Execution time: >15 minutes → containers. (4) Team expertise: No Kubernetes experience → serverless. (5) Cost optimization: Very low volume → serverless; high volume → containers.
Hybrid pattern: Use Lambda for simple, short-lived tasks (embeddings, classification). Use containers for complex agents and long conversations.Q4. [Senior] A company wants to use Claude on Bedrock but keep all data in Europe for GDPR. What are their options?
Answer
Options: (1) Bedrock in eu-west regions: Check if Claude is available in Bedrock’s European regions (Frankfurt, Ireland). Availability varies by model. If available, this is simplest. (2) Self-hosted Claude: Not available—Anthropic doesn’t offer Claude for self-hosting. (3) API direct to Anthropic: Anthropic has EU data processing options. Check their DPA and data residency commitments. May meet GDPR with appropriate contract. (4) Alternative model: Use a model available in EU regions (Titan, open-source on SageMaker, or Vertex AI in europe-west regions). (5) Data anonymization: Remove PII before sending to non-EU Claude, ensuring no personal data crosses borders.
GDPR considerations: It’s about data protection, not just location. With appropriate DPA, EU-to-US transfers may be legal under new EU-US Data Privacy Framework. Consult legal counsel for specific requirements.Q5. [Staff] Design a disaster recovery architecture for a critical AI application. Requirements: RPO <1 hour, RTO <15 minutes, must survive complete cloud region failure.
Answer
Architecture: (1) Active-passive multi-region: Primary in us-east-1, standby in us-west-2. Async replication of state (conversation history, user data) with <1 hour lag. (2) Cross-region database: Use Aurora Global Database or Cosmos DB with multi-region writes. (3) Health monitoring: External health checks every 30 seconds. Automated failover when primary fails 3 consecutive checks. (4) DNS-based failover: Route 53 health checks with failover routing. TTL of 60 seconds. (5) LLM provider diversity: Primary uses Bedrock us-east-1, failover uses Azure OpenAI or direct Anthropic API. (6) State synchronization: Conversation history replicated to standby. On failover, users may see slight regression but can continue. (7) Failback procedure: Once primary recovers, sync any changes from standby, verify consistency, switch traffic back with gradual rollout.
RTO breakdown: Detection (1-2 min) + DNS propagation (1-2 min) + connection warmup (1-2 min) = ~5 minutes typical. RPO: Depends on replication lag—target <15 minutes with continuous sync.
Test quarterly: Conduct actual failover tests to validate assumptions. Document runbook.Spot the Problem
Problem 1. [IC2] A developer’s multi-cloud fallback isn’t working:
def call_llm(prompt: str) -> str:
try:
return call_bedrock(prompt)
except Exception:
return call_azure_openai(prompt)What’s wrong with this fallback implementation?
Answer
Issues: (1) Catches all exceptions—includes programming errors, not just provider issues. A bug in call_bedrock triggers fallback instead of failing fast. (2) No retry before fallback—single transient failure triggers expensive cross-cloud call. (3) No logging—you won’t know fallback is happening or why. (4) No timeout—if Bedrock hangs, you wait forever before fallback. (5) No circuit breaker—if Bedrock is down, every request tries it first before falling back.
Better implementation:
def call_llm(prompt: str) -> str:
try:
return retry_with_backoff(call_bedrock, prompt, max_retries=2)
except (ProviderUnavailable, RateLimitError, TimeoutError) as e:
logger.warning(f"Bedrock failed: {e}, falling back to Azure")
return call_azure_openai(prompt)
# Let other exceptions propagateProblem 2. [Senior] A serverless LLM application has high cold start latency:
# AWS Lambda handler
import boto3
from anthropic import Anthropic
def handler(event, context):
client = Anthropic() # Initialized on every request
response = client.messages.create(...)
return responseCold starts are taking 3-5 seconds. How would you optimize this?
Answer
Issues: (1) Client initialization on every request—Anthropic client setup includes connection pooling that’s wasted on cold starts. (2) No connection reuse between invocations.
Optimizations: (1) Initialize client outside handler:
client = Anthropic() # Reused across warm invocations
def handler(event, context):
response = client.messages.create(...)- Use provisioned concurrency: Pre-warm Lambda instances to eliminate cold starts for critical paths. Costs more but guarantees latency. (3) Reduce package size: Smaller deployment package = faster cold starts. Remove unused dependencies. (4) Use Lambda SnapStart (Java): Not available for Python, but if you can use Java, it helps significantly. (5) Keep warm: Schedule periodic invocations to prevent cold starts (hacky but works).
Problem 3. [Staff] A multi-cloud deployment has inconsistent behavior:
# Same prompt, different results
bedrock_response = call_bedrock("Summarize this document")
azure_response = call_azure("Summarize this document")
# bedrock_response: 150 words, formal tone
# azure_response: 300 words, casual toneUsers are confused when failover happens because responses feel different. How would you address this?
Answer
Root causes: (1) Different models have different default behaviors. (2) System prompts may differ between providers. (3) Temperature and parameter defaults vary. (4) Token limits may differ.
Solutions: (1) Standardize prompts: Include explicit instructions for length, tone, format in the prompt itself, not just system settings. “Summarize in exactly 150 words using formal academic tone.” (2) Normalize parameters: Set explicit temperature, max_tokens, etc., for all providers. Don’t rely on defaults. (3) System prompt parity: Use identical system prompts across providers. (4) Output post-processing: If format still differs, add a normalization layer (truncate to length, adjust formatting). (5) A/B testing: Evaluate both providers on the same test set. Choose the one closer to your target, make the other match it. (6) Provider-specific tuning: Sometimes you need different prompts per provider to get equivalent outputs.
Architectural consideration: If consistency is critical, avoid multi-model fallback. Use the same model across regions (e.g., Claude on Bedrock us-east-1 → Claude on Bedrock us-west-2, not → Azure GPT-4).Design Exercises
Exercise 1. [Senior] Design a multi-cloud LLM deployment for a global e-commerce company. Requirements: Serve customers in North America, Europe, and Asia-Pacific with <500ms p99 latency for simple queries. Use the best available model in each region. Handle 10K requests/minute peak. Ensure GDPR compliance for EU users.
Guidance
Architecture: (1) Regional deployments: NA (us-east-1): Bedrock with Claude. EU (eu-west-1): Bedrock EU if Claude available, otherwise Vertex AI in europe-west with Gemini. APAC (ap-northeast-1): Bedrock Tokyo if available, otherwise consider Azure Japan or Singapore. (2) Traffic routing: Cloudflare or Route 53 latency-based routing. Each region is independent—no cross-region LLM calls. (3) Data residency: EU user data stays in EU region. Implement at the routing layer—EU requests never route to non-EU backends. (4) Capacity: 10K req/min = 170 req/sec peak. Provisioned throughput in each region for guaranteed latency. (5) Consistency: Standardized prompts and parameters across providers. Test outputs for consistency. (6) Fallback: Within-region fallback (Bedrock → direct Anthropic API), not cross-region. (7) Monitoring: Per-region latency and error dashboards. Alert on regional degradation.
Cost estimate: 3 regions × provisioned capacity + global traffic routing ≈ $50-100K/month depending on model and volume.Exercise 2. [Staff] Your organization is moving from a single-cloud (AWS) to a multi-cloud architecture. Currently running 20 AI applications on Bedrock. The CTO wants “cloud portability” but the VP of Engineering is concerned about operational complexity. Write a proposal that addresses both concerns. Include: phased approach, success metrics, rollback plan, and resource requirements.
Guidance
Proposal structure:
Phase 1 (Months 1-3): Abstraction without migration. Build internal API that wraps Bedrock. All 20 apps migrate to internal API. Zero risk—still single cloud, but foundation for portability.
Phase 2 (Months 4-6): Secondary provider for DR only. Add Azure OpenAI as cold standby. Test failover quarterly. One team owns the integration.
Phase 3 (Months 7-12): Selective multi-cloud. Identify 2-3 apps that would benefit from multi-cloud (cost arbitrage, specific model needs). Migrate those only. Majority stays single-cloud.
Success metrics: (1) Abstraction adoption: 100% of apps using internal API. (2) Failover capability: Successfully tested quarterly. (3) Cost optimization: 10-15% reduction for apps using cost-based routing.
Rollback plan: Each phase is reversible. Internal API can route 100% to Bedrock. Secondary provider can be decommissioned without affecting apps.
Resource requirements: 2 engineers for Phase 1, 1 engineer ongoing for Phases 2-3. Platform team owns the abstraction layer.
Key message to CTO: Portability is achieved through abstraction, not by running on multiple clouds. To VP: We’re not adding complexity to all 20 apps—we’re building one platform layer that makes multi-cloud optional.Connections to Other Chapters
This chapter’s advanced cloud patterns connect to several other topics:
Chapter 12 (Cloud AI Providers): Provider-specific details for AWS Bedrock, Azure OpenAI, and Vertex AI.
Chapter 9 (LLM Deployment & Infrastructure): Foundational deployment concepts for self-hosted deployments.
Chapter 16 (Security & Adversarial Robustness): Deeper dive into AI security patterns.
Chapter 31 (Reliability Engineering): SLOs, graceful degradation, and incident management.
Chapter 32 (Cost Engineering): Detailed cost optimization strategies building on the provider comparisons here.
Summary
Multi-cloud AI deployments offer resilience and flexibility but come with significant operational complexity. For most teams, starting with a single cloud provider and building proper abstraction layers is the pragmatic path—you can add multi-cloud capabilities later when specific needs arise.
Key Takeaways
Managed services win for most teams: Unless you’re at massive scale (100M+ tokens/day), managed services (Bedrock, Azure OpenAI, Vertex AI) are cheaper and faster to deploy than self-hosting.
Choose provider based on model quality, not just price: Claude Sonnet 4.6 (Bedrock) is better for analysis, GPT-5 (Azure) for general tasks, Gemini (Vertex) for multimodal. Pick the best model for each task.
Multi-cloud adds complexity: Only go multi-cloud if you have a specific need (DR, model access, compliance). Single cloud is simpler.
Security is non-negotiable: Use managed identities, private endpoints, encryption at rest/transit. Never put API keys in code.
Monitor and optimize costs: Set up CloudWatch/Azure Monitor/Cloud Monitoring alerts. Use cheaper models where possible. Implement caching.
Serverless for variable workloads: Lambda/Cloud Functions/Azure Functions are perfect for unpredictable traffic. Containers (ECS/GKE/AKS) for steady traffic.
Compliance is easier with managed services: AWS, Azure, GCP all have SOC2, HIPAA, GDPR compliance built-in. Self-hosting means you handle everything.
Interview-Ready: Understand the tradeoffs between providers, when to use provisioned throughput vs pay-as-you-go, and how to implement multi-cloud failover. Know the pricing models and breakeven points.