Chapter 13: Multi-Cloud & Cloud-Native Patterns

Keywords

multi-cloud, Kubernetes, serverless, hybrid deployment, cloud abstraction, failover, disaster recovery

Introduction

Chapter 12 covered the fundamentals of individual cloud AI providers. This chapter goes deeper into advanced deployment patterns: multi-cloud architectures, serverless and container-based deployments, security and compliance, and disaster recovery.

When to Read This Chapter: - You’re deploying across multiple cloud providers - You need serverless or container-based LLM deployments - You have strict security or compliance requirements - You’re designing disaster recovery for AI systems

What You’ll Learn: 1. Multi-Cloud Strategies - When and how to use multiple providers 2. Cloud-Native Patterns - Serverless, containers, and auto-scaling 3. Security & Compliance - VPCs, private endpoints, IAM, and certifications

Multi-Cloud Strategies

When to Use Multiple Providers

Reasons for Multi-Cloud:

Best-of-breed models: Claude on AWS, GPT-5 on Azure, Gemini on GCP
Disaster recovery: Fallback if one provider has outage
Cost arbitrage: Use cheapest provider per task
Regional presence: Some providers better in certain regions
Compliance: Different data residency requirements

Anti-patterns (When NOT to multi-cloud):

Premature optimization: Start with one, add others when needed
Complexity overhead: Multi-cloud adds operational burden
Vendor negotiation leverage: Often a myth for smaller customers

Common Mistake: Multi-Cloud for “Flexibility” Before You Need It

What people do: Build abstraction layers across AWS, Azure, and GCP from day one “in case we need to switch providers later.” Team spends months building and maintaining compatibility across three platforms.

Why it fails: Multi-cloud adds operational complexity that compounds over time: three IAM systems, three monitoring stacks, three incident response procedures, three billing dashboards. The “flexibility” costs more than it saves. Most teams never actually switch providers—when they do, the abstraction layer is usually incomplete anyway.

Fix: Start with one cloud, the one your team knows best. Build a clean internal API that could support multiple providers, but only implement the one you’re using. Add a second provider only when you have a concrete, immediate need (required model, compliance mandate, disaster recovery requirement). Never maintain three production implementations “just in case.”

Abstraction Layers and Portability

LiteLLM for Multi-Cloud:

# requirements.txt
litellm>=1.30.0

# multi_cloud_client.py
import litellm
from typing import Optional, List, Dict
import os

class MultiCloudLLM:
    """
    Unified interface for AWS, Azure, GCP.

    Uses LiteLLM for provider abstraction.
    """

    def __init__(self):
        # Set API keys for all providers
        os.environ["AZURE_API_KEY"] = "..."
        os.environ["AZURE_API_BASE"] = "https://..."
        os.environ["AWS_ACCESS_KEY_ID"] = "..."
        os.environ["AWS_SECRET_ACCESS_KEY"] = "..."
        os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "path/to/creds.json"

        # Define model routing
        self.model_routing = {
            "coding": "azure/gpt-5",  # GPT-5 best for code
            "analysis": "bedrock/anthropic.claude-sonnet-4-6-20250929-v1:0",  # Claude best for analysis
            "simple": "vertex_ai/gemini-3.1-pro",  # Gemini cheapest
            "vision": "vertex_ai/gemini-3.5-flash"
        }

    def chat(
        self,
        messages: List[Dict],
        task_type: str = "analysis",
        fallback: bool = True
    ) -> Dict:
        """
        Send chat request with automatic provider routing.

        Args:
            messages: Chat messages
            task_type: Type of task (determines model)
            fallback: Try alternative provider if primary fails

        Returns:
            Response dict with content and metadata
        """

        model = self.model_routing.get(task_type, "azure/gpt-5")

        try:
            response = litellm.completion(
                model=model,
                messages=messages
            )

            return {
                "content": response.choices[0].message.content,
                "provider": model.split('/')[0],
                "model": model,
                "usage": response.usage._asdict()
            }

        except Exception as e:
            if fallback:
                # Try alternative provider
                fallback_model = self._get_fallback_model(model)

                print(f"Primary failed ({e}), trying {fallback_model}")

                response = litellm.completion(
                    model=fallback_model,
                    messages=messages
                )

                return {
                    "content": response.choices[0].message.content,
                    "provider": fallback_model.split('/')[0],
                    "model": fallback_model,
                    "usage": response.usage._asdict(),
                    "fallback": True
                }
            else:
                raise

    def _get_fallback_model(self, primary_model: str) -> str:
        """Get fallback model for a primary model."""

        fallbacks = {
            "azure/gpt-5": "bedrock/anthropic.claude-sonnet-4-6-20250929-v1:0",
            "bedrock/anthropic.claude-sonnet-4-6-20250929-v1:0": "vertex_ai/gemini-3.1-pro",
            "vertex_ai/gemini-3.1-pro": "azure/gpt-3.5-turbo"
        }

        return fallbacks.get(primary_model, "azure/gpt-3.5-turbo")

    def cost_optimized_routing(
        self,
        messages: List[Dict],
        max_cost: float = 0.01
    ) -> Dict:
        """
        Route to cheapest model that meets quality threshold.

        Strategy: Try cheap model first, upgrade if needed.
        """

        # Try cheapest first
        models_by_cost = [
            "vertex_ai/gemini-3.1-pro",  # Cheapest
            "azure/gpt-3.5-turbo",
            "bedrock/anthropic.claude-sonnet-4-6-20250929-v1:0",
            "azure/gpt-5"  # Most expensive
        ]

        for model in models_by_cost:
            response = litellm.completion(
                model=model,
                messages=messages
            )

            # Check if response meets quality bar
            if self._is_good_quality(response.choices[0].message.content):
                return {
                    "content": response.choices[0].message.content,
                    "model": model
                }

        # If all fail, use best model
        return self.chat(messages, task_type="analysis")

    def _is_good_quality(self, response: str) -> bool:
        """Heuristic quality check."""
        # Simple heuristics - in production, use eval model
        return len(response) > 50 and not response.startswith("I don't")

# Usage
client = MultiCloudLLM()

# Automatic routing by task
response = client.chat(
    messages=[{"role": "user", "content": "Write a Python function to reverse a string"}],
    task_type="coding"  # Routes to GPT-5 on Azure
)

print(f"Response from {response['provider']}: {response['content']}")

# Cost-optimized routing
response = client.cost_optimized_routing(
    messages=[{"role": "user", "content": "What is 2+2?"}]
)

Common Mistake: Assuming API Compatibility Across Providers

What people do: Build a multi-cloud abstraction layer assuming all providers work the same way. “It’s just an LLM API—swap the endpoint and credentials, right?”

Why it fails: Subtle differences cause production issues. Token counting varies (OpenAI vs Anthropic tokenizers produce different counts for the same text). Rate limiting works differently. Error codes and retry semantics differ. Streaming response formats aren’t identical. The “same” model on different providers can have slightly different outputs.

Fix: Treat each provider as distinct until proven equivalent. Test edge cases on each provider. Build provider-specific error handling. Don’t assume token counts transfer—recalculate per provider. Run integration tests against each provider separately before deploying multi-cloud.

Staff Engineer Perspective

“We ran our first real failover drill after eight months of confidently telling leadership we were multi-region. The primary went ‘down’ at 10am. Failover kicked in cleanly—then p95 latency spiked from 800ms to 14 seconds and stayed there for 40 minutes. Reason: the standby region had cold model caches, cold vector index pages, and no warm DB connection pool. The runbook said ‘traffic shifts in 90 seconds.’ Reality was ‘recovery completes in 40 minutes.’ Now we run a 5 percent shadow load to every standby region 24/7. It costs us about $8K/month and saves the postmortem we don’t want to write.”

— SRE Lead at a fintech infrastructure company

Custom Abstraction Layer:

# llm_abstraction.py
from abc import ABC, abstractmethod
from typing import List, Dict, Optional
import boto3
from openai import AzureOpenAI
import vertexai
from vertexai.generative_models import GenerativeModel

class LLMProvider(ABC):
    """Abstract base class for LLM providers."""

    @abstractmethod
    def chat(
        self,
        messages: List[Dict[str, str]],
        temperature: float = 0.7,
        max_tokens: int = 1000
    ) -> Dict:
        """Chat completion."""
        pass

    @abstractmethod
    def embed(self, texts: List[str]) -> List[List[float]]:
        """Get embeddings."""
        pass

class AWSBedrockProvider(LLMProvider):
    """AWS Bedrock implementation."""

    def __init__(self, region: str = "us-east-1"):
        self.client = boto3.client('bedrock-runtime', region_name=region)

    def chat(self, messages, temperature=0.7, max_tokens=1000):
        import json

        response = self.client.invoke_model(
            modelId="anthropic.claude-sonnet-4-6-20250929-v1:0",
            body=json.dumps({
                "anthropic_version": "bedrock-2023-05-31",
                "max_tokens": max_tokens,
                "messages": messages,
                "temperature": temperature
            })
        )

        result = json.loads(response['body'].read())

        return {
            "content": result['content'][0]['text'],
            "usage": result['usage']
        }

    def embed(self, texts):
        import json

        response = self.client.invoke_model(
            modelId="amazon.titan-embed-text-v1",
            body=json.dumps({"inputText": texts[0]})
        )

        result = json.loads(response['body'].read())
        return [result['embedding']]

class AzureOpenAIProvider(LLMProvider):
    """Azure OpenAI implementation."""

    def __init__(self, endpoint: str, api_key: str):
        self.client = AzureOpenAI(
            azure_endpoint=endpoint,
            api_key=api_key,
            api_version="2024-02-15-preview"
        )

    def chat(self, messages, temperature=0.7, max_tokens=1000):
        response = self.client.chat.completions.create(
            model="gpt-5",  # deployment name
            messages=messages,
            temperature=temperature,
            max_tokens=max_tokens
        )

        return {
            "content": response.choices[0].message.content,
            "usage": response.usage.model_dump()
        }

    def embed(self, texts):
        response = self.client.embeddings.create(
            model="text-embedding-3-small",
            input=texts
        )

        return [item.embedding for item in response.data]

class VertexAIProvider(LLMProvider):
    """Vertex AI implementation."""

    def __init__(self, project_id: str, location: str = "us-central1"):
        vertexai.init(project=project_id, location=location)
        self.model = GenerativeModel("gemini-3.1-pro")

    def chat(self, messages, temperature=0.7, max_tokens=1000):
        # Convert to Gemini format
        prompt = "\n".join([f"{m['role']}: {m['content']}" for m in messages])

        response = self.model.generate_content(
            prompt,
            generation_config={
                "temperature": temperature,
                "max_output_tokens": max_tokens
            }
        )

        return {
            "content": response.text,
            "usage": {}  # Vertex doesn't return usage in same format
        }

    def embed(self, texts):
        from vertexai.language_models import TextEmbeddingModel

        model = TextEmbeddingModel.from_pretrained("textembedding-gecko@003")
        embeddings = model.get_embeddings(texts)

        return [emb.values for emb in embeddings]

class MultiCloudRouter:
    """Route requests across providers."""

    def __init__(self):
        self.providers = {
            "aws": AWSBedrockProvider(),
            "azure": AzureOpenAIProvider(
                endpoint=os.getenv("AZURE_ENDPOINT"),
                api_key=os.getenv("AZURE_KEY")
            ),
            "gcp": VertexAIProvider(project_id=os.getenv("GCP_PROJECT"))
        }

        self.default_provider = "aws"

    def chat(
        self,
        messages: List[Dict],
        provider: Optional[str] = None,
        **kwargs
    ) -> Dict:
        """Chat with optional provider override."""

        provider_name = provider or self.default_provider
        llm = self.providers[provider_name]

        try:
            result = llm.chat(messages, **kwargs)
            result["provider"] = provider_name
            return result
        except Exception as e:
            # Fallback to next provider
            print(f"{provider_name} failed: {e}")

            for name, fallback_llm in self.providers.items():
                if name != provider_name:
                    try:
                        result = fallback_llm.chat(messages, **kwargs)
                        result["provider"] = name
                        result["fallback"] = True
                        return result
                    except:
                        continue

            raise Exception("All providers failed")

# Usage
router = MultiCloudRouter()

# Use default provider (AWS)
response = router.chat([
    {"role": "user", "content": "Explain Kubernetes"}
])

# Override to Azure
response = router.chat(
    messages=[{"role": "user", "content": "Write a haiku about clouds"}],
    provider="azure"
)

print(f"Response from {response['provider']}: {response['content']}")

Disaster Recovery Patterns

Active-Passive DR:

# dr_pattern.py
import time
from typing import Optional

class ActivePassiveLLM:
    """
    Active-passive DR pattern.

    Primary provider handles all traffic.
    Automatic failover to secondary on failure.
    """

    def __init__(
        self,
        primary_provider: LLMProvider,
        secondary_provider: LLMProvider,
        health_check_interval: int = 60
    ):
        self.primary = primary_provider
        self.secondary = secondary_provider
        self.health_check_interval = health_check_interval

        self.primary_healthy = True
        self.last_health_check = time.time()

    def chat(self, messages: List[Dict], **kwargs) -> Dict:
        """Chat with automatic failover."""

        # Health check
        if time.time() - self.last_health_check > self.health_check_interval:
            self._health_check()

        # Try primary
        if self.primary_healthy:
            try:
                result = self.primary.chat(messages, **kwargs)
                result["provider"] = "primary"
                return result
            except Exception as e:
                print(f"Primary failed: {e}, failing over to secondary")
                self.primary_healthy = False

        # Use secondary
        result = self.secondary.chat(messages, **kwargs)
        result["provider"] = "secondary"
        result["failover"] = True

        return result

    def _health_check(self):
        """Check if primary is healthy."""
        try:
            # Simple ping
            self.primary.chat(
                [{"role": "user", "content": "ping"}],
                max_tokens=5
            )
            self.primary_healthy = True
        except:
            self.primary_healthy = False

        self.last_health_check = time.time()

# Usage
dr_llm = ActivePassiveLLM(
    primary_provider=AWSBedrockProvider(),
    secondary_provider=AzureOpenAIProvider(endpoint="...", api_key="...")
)

response = dr_llm.chat([{"role": "user", "content": "Hello"}])

Active-Active with Load Balancing:

# load_balancer.py
import random
from collections import defaultdict
import time

class LoadBalancedLLM:
    """
    Active-active load balancing.

    Distribute traffic across multiple providers.
    """

    def __init__(self, providers: Dict[str, LLMProvider]):
        self.providers = providers

        # Track metrics per provider
        self.metrics = defaultdict(lambda: {
            "requests": 0,
            "errors": 0,
            "latency": []
        })

    def chat(
        self,
        messages: List[Dict],
        strategy: str = "round-robin",
        **kwargs
    ) -> Dict:
        """
        Chat with load balancing.

        Strategies:
        - round-robin: Distribute evenly
        - random: Random selection
        - least-latency: Route to fastest provider
        - weighted: Based on capacity/cost
        """

        if strategy == "round-robin":
            provider_name = self._round_robin()
        elif strategy == "random":
            provider_name = random.choice(list(self.providers.keys()))
        elif strategy == "least-latency":
            provider_name = self._least_latency()
        else:
            provider_name = self._weighted()

        provider = self.providers[provider_name]

        # Execute with metrics
        start = time.time()
        try:
            result = provider.chat(messages, **kwargs)
            latency = time.time() - start

            # Track metrics
            self.metrics[provider_name]["requests"] += 1
            self.metrics[provider_name]["latency"].append(latency)

            result["provider"] = provider_name
            result["latency"] = latency

            return result

        except Exception as e:
            self.metrics[provider_name]["errors"] += 1

            # Retry with different provider
            providers_to_try = [p for p in self.providers.keys() if p != provider_name]

            for retry_provider in providers_to_try:
                try:
                    result = self.providers[retry_provider].chat(messages, **kwargs)
                    result["provider"] = retry_provider
                    result["retry"] = True
                    return result
                except:
                    continue

            raise

    def _round_robin(self) -> str:
        """Simple round-robin."""
        total_requests = sum(m["requests"] for m in self.metrics.values())
        provider_idx = total_requests % len(self.providers)
        return list(self.providers.keys())[provider_idx]

    def _least_latency(self) -> str:
        """Route to provider with lowest average latency."""
        avg_latencies = {}

        for name, metrics in self.metrics.items():
            if metrics["latency"]:
                avg_latencies[name] = sum(metrics["latency"]) / len(metrics["latency"])
            else:
                avg_latencies[name] = 0

        if not avg_latencies:
            return random.choice(list(self.providers.keys()))

        return min(avg_latencies, key=avg_latencies.get)

    def _weighted(self, weights: Optional[Dict[str, float]] = None) -> str:
        """Weighted selection based on capacity or cost."""

        if not weights:
            # Default weights (example: prefer cheaper providers)
            weights = {
                "gcp": 0.5,  # Cheapest, 50% of traffic
                "azure": 0.3,
                "aws": 0.2
            }

        return random.choices(
            list(weights.keys()),
            weights=list(weights.values())
        )[0]

    def get_metrics(self) -> Dict:
        """Get performance metrics."""

        summary = {}
        for provider, metrics in self.metrics.items():
            avg_latency = (
                sum(metrics["latency"]) / len(metrics["latency"])
                if metrics["latency"] else 0
            )

            error_rate = (
                metrics["errors"] / metrics["requests"]
                if metrics["requests"] > 0 else 0
            )

            summary[provider] = {
                "requests": metrics["requests"],
                "errors": metrics["errors"],
                "error_rate": error_rate,
                "avg_latency": avg_latency
            }

        return summary

# Usage
lb = LoadBalancedLLM({
    "aws": AWSBedrockProvider(),
    "azure": AzureOpenAIProvider(endpoint="...", api_key="..."),
    "gcp": VertexAIProvider(project_id="...")
})

# Send requests
for _ in range(100):
    response = lb.chat(
        [{"role": "user", "content": "Hello"}],
        strategy="least-latency"
    )

# Check metrics
print(lb.get_metrics())

Cloud-Native Patterns

Serverless Inference

AWS Lambda with Bedrock:

# lambda_handler.py
import json
import boto3
import os

bedrock = boto3.client('bedrock-runtime')

def lambda_handler(event, context):
    """
    Serverless LLM inference.

    Benefits:
    - No idle costs
    - Auto-scaling
    - Simple deployment

    Limitations:
    - 15 min max runtime
    - Cold starts (~1-2s)
    - Limited GPU access
    """

    body = json.loads(event['body'])
    prompt = body['prompt']

    response = bedrock.invoke_model(
        modelId="anthropic.claude-sonnet-4-6-20250929-v1:0",
        body=json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 1024,
            "messages": [{"role": "user", "content": prompt}]
        })
    )

    result = json.loads(response['body'].read())

    return {
        'statusCode': 200,
        'body': json.dumps({
            'response': result['content'][0]['text']
        })
    }

# Deploy with SAM or Serverless Framework
# serverless.yml
"""
service: llm-api

provider:
  name: aws
  runtime: python3.11
  region: us-east-1
  iam:
    role:
      statements:
        - Effect: Allow
          Action:
            - bedrock:InvokeModel
          Resource: '*'

functions:
  chat:
    handler: lambda_handler.lambda_handler
    timeout: 60
    events:
      - http:
          path: chat
          method: post
"""

GCP Cloud Functions with Vertex AI:

# main.py
import functions_framework
from vertexai.generative_models import GenerativeModel
import vertexai
import os

# Initialize outside handler for warm starts
vertexai.init(project=os.getenv("GCP_PROJECT"), location="us-central1")
model = GenerativeModel("gemini-3.1-pro")

@functions_framework.http
def chat(request):
    """
    Cloud Function for Vertex AI.

    Warm start optimization: Initialize model globally.
    """

    request_json = request.get_json()
    prompt = request_json['prompt']

    response = model.generate_content(prompt)

    return {'response': response.text}

# Deploy
# gcloud functions deploy chat \
#   --runtime python311 \
#   --trigger-http \
#   --allow-unauthenticated \
#   --region us-central1

Azure Functions with OpenAI:

# function_app.py
import azure.functions as func
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
import os

app = func.FunctionApp()

# Warm start: Initialize outside handler
credential = DefaultAzureCredential()
token_provider = get_bearer_token_provider(
    credential,
    "https://cognitiveservices.azure.com/.default"
)

client = AzureOpenAI(
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    azure_ad_token_provider=token_provider,
    api_version="2024-02-15-preview"
)

@app.function_name(name="chat")
@app.route(route="chat", methods=["POST"])
def chat(req: func.HttpRequest) -> func.HttpResponse:
    """Azure Function with managed identity."""

    req_body = req.get_json()
    prompt = req_body['prompt']

    response = client.chat.completions.create(
        model=os.getenv("DEPLOYMENT_NAME"),
        messages=[{"role": "user", "content": prompt}]
    )

    return func.HttpResponse(
        response.choices[0].message.content,
        mimetype="text/plain"
    )

# Deploy
# func azure functionapp publish <app-name>

Container-Based Deployment

AWS ECS with Bedrock:

# Dockerfile
FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

# main.py
from fastapi import FastAPI
import boto3
import json
import os

app = FastAPI()

bedrock = boto3.client('bedrock-runtime', region_name='us-east-1')

@app.post("/chat")
async def chat(prompt: str):
    """
    FastAPI app running on ECS.

    Benefits over Lambda:
    - No runtime limits
    - Keep connections warm
    - More memory/CPU options
    """

    response = bedrock.invoke_model(
        modelId="anthropic.claude-sonnet-4-6-20250929-v1:0",
        body=json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 1024,
            "messages": [{"role": "user", "content": prompt}]
        })
    )

    result = json.loads(response['body'].read())

    return {"response": result['content'][0]['text']}

# Health check for ECS
@app.get("/health")
async def health():
    return {"status": "healthy"}

# ecs-task-definition.json
{
  "family": "llm-api",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "1024",
  "memory": "2048",
  "containerDefinitions": [
    {
      "name": "llm-api",
      "image": "<account-id>.dkr.ecr.us-east-1.amazonaws.com/llm-api:latest",
      "portMappings": [
        {
          "containerPort": 8000,
          "protocol": "tcp"
        }
      ],
      "environment": [
        {
          "name": "AWS_REGION",
          "value": "us-east-1"
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/llm-api",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "ecs"
        }
      }
    }
  ]
}

GKE with Vertex AI:

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: llm-api
  template:
    metadata:
      labels:
        app: llm-api
    spec:
      serviceAccountName: vertex-ai-sa  # GCP service account
      containers:
      - name: llm-api
        image: gcr.io/project-id/llm-api:latest
        ports:
        - containerPort: 8000
        env:
        - name: GCP_PROJECT
          value: "my-project"
        resources:
          requests:
            memory: "2Gi"
            cpu: "1000m"
          limits:
            memory: "4Gi"
            cpu: "2000m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
  name: llm-api
spec:
  selector:
    app: llm-api
  ports:
  - port: 80
    targetPort: 8000
  type: LoadBalancer

GPU Instances and Spot Pricing

Self-Hosted with Spot Instances:

# spot_instance_manager.py
import boto3
from typing import Dict

class SpotInstanceManager:
    """
    Manage EC2 spot instances for self-hosted LLM inference.

    Use case: Host your own LLM (Llama, Mistral) on GPU spot instances.
    Savings: 70%+ compared to on-demand.
    """

    def __init__(self):
        self.ec2 = boto3.client('ec2')

    def request_spot_instance(
        self,
        instance_type: str = "g5.xlarge",  # 1x A10G GPU
        max_price: str = "0.50",  # USD per hour
        ami_id: str = "ami-xxxxx"  # Custom AMI with vLLM
    ) -> str:
        """
        Request spot instance for LLM hosting.

        Instance types:
        - g5.xlarge: 1x A10G (24GB) - ~$0.40/hr spot vs $1.01 on-demand
        - g5.2xlarge: 1x A10G (24GB), more CPU/RAM
        - g5.12xlarge: 4x A10G (96GB) - for larger models
        - p4d.24xlarge: 8x A100 (320GB) - for huge models
        """

        user_data = """#!/bin/bash
# Install vLLM and start server
pip install vllm
python -m vllm.entrypoints.api_server \
    --model meta-llama/Llama-4-Scout-17B \
    --port 8000 \
    --gpu-memory-utilization 0.9
"""

        response = self.ec2.request_spot_instances(
            SpotPrice=max_price,
            InstanceCount=1,
            LaunchSpecification={
                'ImageId': ami_id,
                'InstanceType': instance_type,
                'KeyName': 'my-key',
                'SecurityGroupIds': ['sg-xxxxx'],
                'UserData': user_data,
                'IamInstanceProfile': {
                    'Name': 'EC2-vLLM-Role'
                },
                'BlockDeviceMappings': [
                    {
                        'DeviceName': '/dev/sda1',
                        'Ebs': {
                            'VolumeSize': 100,
                            'VolumeType': 'gp3'
                        }
                    }
                ]
            }
        )

        request_id = response['SpotInstanceRequests'][0]['SpotInstanceRequestId']

        print(f"Spot request: {request_id}")

        return request_id

    def handle_spot_interruption(self):
        """
        Handle spot instance interruption.

        AWS gives 2-minute warning before terminating spot instance.
        Use this to gracefully drain connections.
        """

        # Run this in background on instance
        script = """
import requests
import time
import signal
import sys

def graceful_shutdown(signum, frame):
    print("Spot interruption detected, draining...")
    # Stop accepting new requests
    # Finish existing requests
    # Save state if needed
    sys.exit(0)

signal.signal(signal.SIGTERM, graceful_shutdown)

while True:
    try:
        # Check for interruption notice
        r = requests.get(
            'http://169.254.169.254/latest/meta-data/spot/instance-action',
            timeout=1
        )
        if r.status_code == 200:
            graceful_shutdown(None, None)
    except:
        pass

    time.sleep(5)
"""

        return script

# Usage
manager = SpotInstanceManager()

# Request spot instance
request_id = manager.request_spot_instance(
    instance_type="g5.2xlarge",
    max_price="0.80"
)

# Cost comparison:
# On-demand: $1.21/hr * 720 hrs/month = $871/month
# Spot: $0.40/hr * 720 hrs/month = $288/month
# Savings: $583/month (67%)

Auto-Scaling Patterns

AWS Bedrock Provisioned Throughput (manual scaling):

Bedrock Provisioned Throughput is not currently a registered Application Auto Scaling target — there is no ServiceNamespace='bedrock' / ScalableDimension='bedrock:model:ProvisionedThroughput' combination supported by AWS. Capacity is purchased in Model Units (MUs) on monthly or six-month commitments, so most teams scale Bedrock manually based on CloudWatch metrics, or rely on on-demand inference for elasticity and reserve PTUs only for steady baseline load.

# bedrock_capacity.py
import boto3

bedrock = boto3.client('bedrock')

# Create provisioned throughput (Model Units, not auto-scaled)
response = bedrock.create_provisioned_model_throughput(
    modelUnits=2,  # baseline capacity
    provisionedModelName='claude-sonnet-prod',
    modelId='anthropic.claude-sonnet-4-6-20250929-v1:0',
    commitmentDuration='OneMonth'  # or 'SixMonths', omit for no commitment
)
provisioned_arn = response['provisionedModelArn']

# To "scale", create a new provisioned model with more MUs and migrate
# traffic, or fall back to on-demand inference during spikes. Watch
# CloudWatch metrics like InvocationThrottles to decide when to expand.

For workloads that need true elastic scaling, prefer on-demand inference (throttled per account quota) or front Bedrock with SageMaker endpoints, which do support Application Auto Scaling target tracking via the sagemaker service namespace.

Kubernetes HPA for LLM Workloads:

# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-api
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "100"
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5min before scaling down
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0  # Scale up immediately
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30

Security & Compliance

VPC and Private Endpoints

AWS Private VPC for Bedrock:

# Use Bedrock from private subnet without internet access

# 1. Create VPC endpoint for Bedrock (via AWS Console or CloudFormation)
# Type: Interface
# Service: com.amazonaws.us-east-1.bedrock-runtime

# 2. Use from private subnet
import boto3

bedrock = boto3.client(
    'bedrock-runtime',
    region_name='us-east-1',
    # Format: https://<vpce-id>.bedrock-runtime.<region>.vpce.amazonaws.com
    # If Private DNS is enabled on the endpoint, you can omit endpoint_url
    # and use the standard service DNS automatically.
    endpoint_url='https://vpce-xxxxxxxxxxxxxxxxx.bedrock-runtime.us-east-1.vpce.amazonaws.com'
)

# Traffic never leaves AWS network

Azure Private Link:

# Azure OpenAI with private endpoint

# 1. Create private endpoint (Azure Portal or ARM)
# 2. Configure DNS
# 3. Use from VNet

from openai import AzureOpenAI

client = AzureOpenAI(
    azure_endpoint="https://my-resource.privatelink.openai.azure.com/",
    api_key="...",
    api_version="2024-02-15-preview"
)

# Traffic stays within Azure backbone

IAM and Service Accounts

AWS IAM Best Practices:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "BedrockInvokeOnly",
      "Effect": "Allow",
      "Action": [
        "bedrock:InvokeModel",
        "bedrock:InvokeModelWithResponseStream"
      ],
      "Resource": [
        "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-sonnet-4-6-20250929-v1:0"
      ],
      "Condition": {
        "StringEquals": {
          "aws:RequestedRegion": "us-east-1"
        }
      }
    },
    {
      "Sid": "DenyDataLogging",
      "Effect": "Deny",
      "Action": [
        "bedrock:PutModelInvocationLoggingConfiguration"
      ],
      "Resource": "*"
    }
  ]
}

GCP Service Account:

# Create service account with minimal permissions

from google.cloud import iam_admin_v1

client = iam_admin_v1.IAMClient()

# Create service account
service_account = client.create_service_account(
    request={
        "name": f"projects/my-project",
        "account_id": "vertex-ai-sa",
        "service_account": {
            "display_name": "Vertex AI Service Account"
        }
    }
)

# Grant only necessary roles
# - roles/aiplatform.user (use Vertex AI)
# - roles/storage.objectViewer (read from GCS)
# - NO roles/owner or roles/editor

# Use from application
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/path/to/sa-key.json"

Data Residency

AWS Regions:

# EU data residency
bedrock_eu = boto3.client('bedrock-runtime', region_name='eu-west-1')

# Data never leaves EU
# Compliant with GDPR

# Available Bedrock regions:
# - us-east-1, us-west-2 (most models)
# - eu-west-1, eu-central-1 (EU data residency)
# - ap-southeast-1, ap-northeast-1 (Asia Pacific)

Azure Geographies:

# Azure OpenAI data residency

# Europe geography
client_eu = AzureOpenAI(
    azure_endpoint="https://my-resource-westeurope.openai.azure.com/",
    api_key="...",
    api_version="2024-02-15-preview"
)

# Data stays in European data centers
# GDPR compliant

Compliance Certifications

AWS Bedrock: - SOC 2 Type II - HIPAA eligible (BAA available) - ISO 27001 - GDPR compliant - PCI DSS (for payment card data)

Azure OpenAI: - SOC 2 Type II - HIPAA/HITECH (BAA available) - ISO 27001, 27018 - GDPR compliant - FedRAMP (US government)

GCP Vertex AI: - SOC 2 Type II - HIPAA compliant - ISO 27001 - GDPR compliant - FedRAMP (in progress)

Enabling Compliance Features:

# AWS HIPAA example
# 1. Sign BAA with AWS
# 2. Use HIPAA-eligible services only
# 3. Enable encryption at rest and in transit
# 4. Enable audit logging

import boto3

# Enable CloudTrail for audit logs
cloudtrail = boto3.client('cloudtrail')

cloudtrail.create_trail(
    Name='bedrock-audit-trail',
    S3BucketName='my-audit-bucket',
    IsMultiRegionTrail=True,
    EnableLogFileValidation=True
)

# Enable encryption for S3 buckets (for Knowledge Bases)
s3 = boto3.client('s3')

s3.put_bucket_encryption(
    Bucket='my-kb-bucket',
    ServerSideEncryptionConfiguration={
        'Rules': [
            {
                'ApplyServerSideEncryptionByDefault': {
                    'SSEAlgorithm': 'AES256'
                }
            }
        ]
    }
)

Comparison Table

Feature	AWS Bedrock	Azure OpenAI	GCP Vertex AI
Best Models	Claude Sonnet 4.6, Llama 4	GPT-5.5, GPT-5.4	Gemini 3.1 Pro, Gemini 3.5 Flash
Pricing (GPT-5 equivalent)	Claude Sonnet 4.6: $3/$15 per 1M tokens	GPT-5.4: $2.50/$15 per 1M tokens	Gemini 3.1 Pro: $2/$12 per 1M tokens
Enterprise Auth	IAM roles	Azure AD, Managed Identity	GCP Service Accounts
Private Network	VPC Endpoints	Private Link	VPC Service Controls
Fine-tuning	Limited (Titan only)	Yes (GPT-5-mini, GPT-5)	Yes (PaLM, Gemini)
Multimodal	Claude 4.x (images)	GPT-5 Vision, DALL-E	Gemini 3.1 Pro, Imagen
RAG Support	Knowledge Bases	Azure AI Search	Vertex AI Search
Agents	Bedrock Agents	Azure AI Studio	Vertex AI Agents
Regions	6+ regions	10+ regions	20+ regions
Compliance	SOC2, HIPAA, GDPR	SOC2, HIPAA, FedRAMP	SOC2, HIPAA, GDPR
Best For	Claude users, AWS ecosystem	Enterprise, Microsoft stack	Google ecosystem, multimodal

Practical Exercises

Multi-Cloud RAG: Build a RAG system using Azure Cognitive Search for indexing, AWS Bedrock for embeddings, and GCP Vertex AI for generation. Implement fallback if any provider fails using LiteLLM’s routing capabilities.
Cost Optimizer: Create a cost optimization service that tracks usage across all three providers, routes requests to the cheapest provider for each task type, switches to provisioned throughput when usage crosses threshold, and provides daily cost reports.
Serverless Agent: Deploy an agent on AWS Lambda that uses Bedrock for reasoning, calls tools (weather API, database), stores conversation history in DynamoDB, and auto-scales based on traffic. Target: sub-3-second p95 latency, 1000 req/s capacity, <$100/month at low traffic.
HIPAA-Compliant Chat: Build a HIPAA-compliant medical chatbot on Azure with private endpoints, Azure OpenAI with BAA, encryption at rest/transit, audit logging, and PHI storage in encrypted Cosmos DB. Create a compliance checklist document.
Disaster Recovery Drill: Set up a multi-cloud failover system and conduct a disaster recovery drill. Simulate primary provider failure, measure failover time, document data loss (if any), and test failback procedures.

Self-Assessment Checkpoint

Conceptual Questions

Q1. [IC2] What is LiteLLM and when would you use it? What are its limitations?

Answer

LiteLLM provides a unified interface for calling 100+ LLM providers with the same OpenAI-compatible API. Use cases: (1) Switching between providers without code changes. (2) Building fallback chains (try Claude, fall back to GPT-4). (3) A/B testing different models. (4) Abstracting provider differences for teams.

Limitations: (1) Adds a dependency and potential point of failure. (2) May lag behind provider feature releases. (3) Some provider-specific features aren’t supported. (4) Debugging is harder—errors may be obscured. (5) Performance overhead (minimal but exists).

When to use: Multi-provider deployments, rapid prototyping, teams evaluating models. When to avoid: Single-provider production deployments (unnecessary abstraction), applications needing cutting-edge provider features.

Q2. [IC2] What’s the difference between a VPC endpoint and a private endpoint? When do you need each?

Answer

VPC endpoint (AWS): Connects your VPC to AWS services without traffic leaving Amazon’s network. Types: Gateway endpoints (S3, DynamoDB—free) and Interface endpoints (most services—costs per hour + data transfer). Use for: Keeping Bedrock traffic within AWS.

Private endpoint (Azure): Similar concept—connects your VNet to Azure services privately. Uses Azure Private Link. Use for: Keeping Azure OpenAI traffic within Azure.

You need these when: (1) Compliance requires no internet egress. (2) Security policy prohibits public API calls. (3) You need to reduce attack surface. (4) Data classification requires private networking.

Both add cost and complexity. Only implement if you have a concrete security or compliance requirement.

Q3. [Senior] Explain the tradeoffs between serverless (Lambda) and container-based (EKS) deployments for LLM applications.

Answer

Serverless (Lambda): Pros: Zero ops, auto-scales to zero, pay-per-invocation. Cons: Cold starts (1-5s for LLM clients), 15-minute timeout, limited memory (10GB), no persistent connections, harder to maintain state.

Containers (EKS): Pros: No cold starts, unlimited execution time, persistent connections to LLM APIs, more control over scaling. Cons: Operational overhead, minimum cost even at zero traffic, need to manage cluster.

Decision factors: (1) Traffic pattern: Bursty with long idle periods → serverless. Steady traffic → containers. (2) Latency requirements: p99 <500ms → containers (cold starts kill serverless). (3) Execution time: >15 minutes → containers. (4) Team expertise: No Kubernetes experience → serverless. (5) Cost optimization: Very low volume → serverless; high volume → containers.

Hybrid pattern: Use Lambda for simple, short-lived tasks (embeddings, classification). Use containers for complex agents and long conversations.

Q4. [Senior] A company wants to use Claude on Bedrock but keep all data in Europe for GDPR. What are their options?

Answer

Options: (1) Bedrock in eu-west regions: Check if Claude is available in Bedrock’s European regions (Frankfurt, Ireland). Availability varies by model. If available, this is simplest. (2) Self-hosted Claude: Not available—Anthropic doesn’t offer Claude for self-hosting. (3) API direct to Anthropic: Anthropic has EU data processing options. Check their DPA and data residency commitments. May meet GDPR with appropriate contract. (4) Alternative model: Use a model available in EU regions (Titan, open-source on SageMaker, or Vertex AI in europe-west regions). (5) Data anonymization: Remove PII before sending to non-EU Claude, ensuring no personal data crosses borders.

GDPR considerations: It’s about data protection, not just location. With appropriate DPA, EU-to-US transfers may be legal under new EU-US Data Privacy Framework. Consult legal counsel for specific requirements.

Q5. [Staff] Design a disaster recovery architecture for a critical AI application. Requirements: RPO <1 hour, RTO <15 minutes, must survive complete cloud region failure.

Answer

Architecture: (1) Active-passive multi-region: Primary in us-east-1, standby in us-west-2. Async replication of state (conversation history, user data) with <1 hour lag. (2) Cross-region database: Use Aurora Global Database or Cosmos DB with multi-region writes. (3) Health monitoring: External health checks every 30 seconds. Automated failover when primary fails 3 consecutive checks. (4) DNS-based failover: Route 53 health checks with failover routing. TTL of 60 seconds. (5) LLM provider diversity: Primary uses Bedrock us-east-1, failover uses Azure OpenAI or direct Anthropic API. (6) State synchronization: Conversation history replicated to standby. On failover, users may see slight regression but can continue. (7) Failback procedure: Once primary recovers, sync any changes from standby, verify consistency, switch traffic back with gradual rollout.

RTO breakdown: Detection (1-2 min) + DNS propagation (1-2 min) + connection warmup (1-2 min) = ~5 minutes typical. RPO: Depends on replication lag—target <15 minutes with continuous sync.

Test quarterly: Conduct actual failover tests to validate assumptions. Document runbook.

Spot the Problem

Problem 1. [IC2] A developer’s multi-cloud fallback isn’t working:

def call_llm(prompt: str) -> str:
    try:
        return call_bedrock(prompt)
    except Exception:
        return call_azure_openai(prompt)

What’s wrong with this fallback implementation?

Answer

Issues: (1) Catches all exceptions—includes programming errors, not just provider issues. A bug in call_bedrock triggers fallback instead of failing fast. (2) No retry before fallback—single transient failure triggers expensive cross-cloud call. (3) No logging—you won’t know fallback is happening or why. (4) No timeout—if Bedrock hangs, you wait forever before fallback. (5) No circuit breaker—if Bedrock is down, every request tries it first before falling back.

Better implementation:

def call_llm(prompt: str) -> str:
    try:
        return retry_with_backoff(call_bedrock, prompt, max_retries=2)
    except (ProviderUnavailable, RateLimitError, TimeoutError) as e:
        logger.warning(f"Bedrock failed: {e}, falling back to Azure")
        return call_azure_openai(prompt)
    # Let other exceptions propagate

Even better: Use circuit breaker pattern to skip Bedrock entirely when it’s known to be down.

Problem 2. [Senior] A serverless LLM application has high cold start latency:

# AWS Lambda handler
import boto3
from anthropic import Anthropic

def handler(event, context):
    client = Anthropic()  # Initialized on every request
    response = client.messages.create(...)
    return response

Cold starts are taking 3-5 seconds. How would you optimize this?

Answer

Issues: (1) Client initialization on every request—Anthropic client setup includes connection pooling that’s wasted on cold starts. (2) No connection reuse between invocations.

Optimizations: (1) Initialize client outside handler:

client = Anthropic()  # Reused across warm invocations

def handler(event, context):
    response = client.messages.create(...)

Use provisioned concurrency: Pre-warm Lambda instances to eliminate cold starts for critical paths. Costs more but guarantees latency. (3) Reduce package size: Smaller deployment package = faster cold starts. Remove unused dependencies. (4) Use Lambda SnapStart (Java): Not available for Python, but if you can use Java, it helps significantly. (5) Keep warm: Schedule periodic invocations to prevent cold starts (hacky but works).

Architecture change: For latency-critical paths, consider ECS/Fargate with persistent containers instead of Lambda.

Problem 3. [Staff] A multi-cloud deployment has inconsistent behavior:

# Same prompt, different results
bedrock_response = call_bedrock("Summarize this document")
azure_response = call_azure("Summarize this document")

# bedrock_response: 150 words, formal tone
# azure_response: 300 words, casual tone

Users are confused when failover happens because responses feel different. How would you address this?

Answer

Root causes: (1) Different models have different default behaviors. (2) System prompts may differ between providers. (3) Temperature and parameter defaults vary. (4) Token limits may differ.

Solutions: (1) Standardize prompts: Include explicit instructions for length, tone, format in the prompt itself, not just system settings. “Summarize in exactly 150 words using formal academic tone.” (2) Normalize parameters: Set explicit temperature, max_tokens, etc., for all providers. Don’t rely on defaults. (3) System prompt parity: Use identical system prompts across providers. (4) Output post-processing: If format still differs, add a normalization layer (truncate to length, adjust formatting). (5) A/B testing: Evaluate both providers on the same test set. Choose the one closer to your target, make the other match it. (6) Provider-specific tuning: Sometimes you need different prompts per provider to get equivalent outputs.

Architectural consideration: If consistency is critical, avoid multi-model fallback. Use the same model across regions (e.g., Claude on Bedrock us-east-1 → Claude on Bedrock us-west-2, not → Azure GPT-4).

Design Exercises

Exercise 1. [Senior] Design a multi-cloud LLM deployment for a global e-commerce company. Requirements: Serve customers in North America, Europe, and Asia-Pacific with <500ms p99 latency for simple queries. Use the best available model in each region. Handle 10K requests/minute peak. Ensure GDPR compliance for EU users.

Guidance

Architecture: (1) Regional deployments: NA (us-east-1): Bedrock with Claude. EU (eu-west-1): Bedrock EU if Claude available, otherwise Vertex AI in europe-west with Gemini. APAC (ap-northeast-1): Bedrock Tokyo if available, otherwise consider Azure Japan or Singapore. (2) Traffic routing: Cloudflare or Route 53 latency-based routing. Each region is independent—no cross-region LLM calls. (3) Data residency: EU user data stays in EU region. Implement at the routing layer—EU requests never route to non-EU backends. (4) Capacity: 10K req/min = 170 req/sec peak. Provisioned throughput in each region for guaranteed latency. (5) Consistency: Standardized prompts and parameters across providers. Test outputs for consistency. (6) Fallback: Within-region fallback (Bedrock → direct Anthropic API), not cross-region. (7) Monitoring: Per-region latency and error dashboards. Alert on regional degradation.

Cost estimate: 3 regions × provisioned capacity + global traffic routing ≈ $50-100K/month depending on model and volume.

Exercise 2. [Staff] Your organization is moving from a single-cloud (AWS) to a multi-cloud architecture. Currently running 20 AI applications on Bedrock. The CTO wants “cloud portability” but the VP of Engineering is concerned about operational complexity. Write a proposal that addresses both concerns. Include: phased approach, success metrics, rollback plan, and resource requirements.

Guidance

Proposal structure:

Phase 1 (Months 1-3): Abstraction without migration. Build internal API that wraps Bedrock. All 20 apps migrate to internal API. Zero risk—still single cloud, but foundation for portability.

Phase 2 (Months 4-6): Secondary provider for DR only. Add Azure OpenAI as cold standby. Test failover quarterly. One team owns the integration.

Phase 3 (Months 7-12): Selective multi-cloud. Identify 2-3 apps that would benefit from multi-cloud (cost arbitrage, specific model needs). Migrate those only. Majority stays single-cloud.

Success metrics: (1) Abstraction adoption: 100% of apps using internal API. (2) Failover capability: Successfully tested quarterly. (3) Cost optimization: 10-15% reduction for apps using cost-based routing.

Rollback plan: Each phase is reversible. Internal API can route 100% to Bedrock. Secondary provider can be decommissioned without affecting apps.

Resource requirements: 2 engineers for Phase 1, 1 engineer ongoing for Phases 2-3. Platform team owns the abstraction layer.

Key message to CTO: Portability is achieved through abstraction, not by running on multiple clouds. To VP: We’re not adding complexity to all 20 apps—we’re building one platform layer that makes multi-cloud optional.

Connections to Other Chapters

This chapter’s advanced cloud patterns connect to several other topics:

Chapter 12 (Cloud AI Providers): Provider-specific details for AWS Bedrock, Azure OpenAI, and Vertex AI.
Chapter 9 (LLM Deployment & Infrastructure): Foundational deployment concepts for self-hosted deployments.
Chapter 16 (Security & Adversarial Robustness): Deeper dive into AI security patterns.
Chapter 31 (Reliability Engineering): SLOs, graceful degradation, and incident management.
Chapter 32 (Cost Engineering): Detailed cost optimization strategies building on the provider comparisons here.

Summary

Multi-cloud AI deployments offer resilience and flexibility but come with significant operational complexity. For most teams, starting with a single cloud provider and building proper abstraction layers is the pragmatic path—you can add multi-cloud capabilities later when specific needs arise.

Key Takeaways

Managed services win for most teams: Unless you’re at massive scale (100M+ tokens/day), managed services (Bedrock, Azure OpenAI, Vertex AI) are cheaper and faster to deploy than self-hosting.
Choose provider based on model quality, not just price: Claude Sonnet 4.6 (Bedrock) is better for analysis, GPT-5 (Azure) for general tasks, Gemini (Vertex) for multimodal. Pick the best model for each task.
Multi-cloud adds complexity: Only go multi-cloud if you have a specific need (DR, model access, compliance). Single cloud is simpler.
Security is non-negotiable: Use managed identities, private endpoints, encryption at rest/transit. Never put API keys in code.
Monitor and optimize costs: Set up CloudWatch/Azure Monitor/Cloud Monitoring alerts. Use cheaper models where possible. Implement caching.
Serverless for variable workloads: Lambda/Cloud Functions/Azure Functions are perfect for unpredictable traffic. Containers (ECS/GKE/AKS) for steady traffic.
Compliance is easier with managed services: AWS, Azure, GCP all have SOC2, HIPAA, GDPR compliance built-in. Self-hosting means you handle everything.

Interview-Ready: Understand the tradeoffs between providers, when to use provisioned throughput vs pay-as-you-go, and how to implement multi-cloud failover. Know the pricing models and breakeven points.

--- title: "Chapter 13: Multi-Cloud & Cloud-Native Patterns" keywords: [multi-cloud, Kubernetes, serverless, hybrid deployment, cloud abstraction, failover, disaster recovery] difficulty: advanced prerequisites: [ch09, ch12] estimated_time: "3-4 hours" --- ## Introduction Chapter 12 covered the fundamentals of individual cloud AI providers. This chapter goes deeper into advanced deployment patterns: multi-cloud architectures, serverless and container-based deployments, security and compliance, and disaster recovery. **When to Read This Chapter:** - You're deploying across multiple cloud providers - You need serverless or container-based LLM deployments - You have strict security or compliance requirements - You're designing disaster recovery for AI systems **What You'll Learn:** 1. **Multi-Cloud Strategies** - When and how to use multiple providers 2. **Cloud-Native Patterns** - Serverless, containers, and auto-scaling 3. **Security & Compliance** - VPCs, private endpoints, IAM, and certifications --- ## Multi-Cloud Strategies ### When to Use Multiple Providers **Reasons for Multi-Cloud:** 1. **Best-of-breed models**: Claude on AWS, GPT-5 on Azure, Gemini on GCP 2. **Disaster recovery**: Fallback if one provider has outage 3. **Cost arbitrage**: Use cheapest provider per task 4. **Regional presence**: Some providers better in certain regions 5. **Compliance**: Different data residency requirements **Anti-patterns (When NOT to multi-cloud):** 1. **Premature optimization**: Start with one, add others when needed 2. **Complexity overhead**: Multi-cloud adds operational burden 3. **Vendor negotiation leverage**: Often a myth for smaller customers ::: {.callout-warning} ## Common Mistake: Multi-Cloud for "Flexibility" Before You Need It **What people do:** Build abstraction layers across AWS, Azure, and GCP from day one "in case we need to switch providers later." Team spends months building and maintaining compatibility across three platforms. **Why it fails:** Multi-cloud adds operational complexity that compounds over time: three IAM systems, three monitoring stacks, three incident response procedures, three billing dashboards. The "flexibility" costs more than it saves. Most teams never actually switch providers—when they do, the abstraction layer is usually incomplete anyway. **Fix:** Start with one cloud, the one your team knows best. Build a clean internal API that *could* support multiple providers, but only implement the one you're using. Add a second provider only when you have a concrete, immediate need (required model, compliance mandate, disaster recovery requirement). Never maintain three production implementations "just in case." ::: ### Abstraction Layers and Portability **LiteLLM for Multi-Cloud:** ```python # requirements.txt litellm>=1.30.0 # multi_cloud_client.py import litellm from typing import Optional, List, Dict import os class MultiCloudLLM: """ Unified interface for AWS, Azure, GCP. Uses LiteLLM for provider abstraction. """ def __init__(self): # Set API keys for all providers os.environ["AZURE_API_KEY"] = "..." os.environ["AZURE_API_BASE"] = "https://..." os.environ["AWS_ACCESS_KEY_ID"] = "..." os.environ["AWS_SECRET_ACCESS_KEY"] = "..." os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "path/to/creds.json" # Define model routing self.model_routing = { "coding": "azure/gpt-5", # GPT-5 best for code "analysis": "bedrock/anthropic.claude-sonnet-4-6-20250929-v1:0", # Claude best for analysis "simple": "vertex_ai/gemini-3.1-pro", # Gemini cheapest "vision": "vertex_ai/gemini-3.5-flash" } def chat( self, messages: List[Dict], task_type: str = "analysis", fallback: bool = True ) -> Dict: """ Send chat request with automatic provider routing. Args: messages: Chat messages task_type: Type of task (determines model) fallback: Try alternative provider if primary fails Returns: Response dict with content and metadata """ model = self.model_routing.get(task_type, "azure/gpt-5") try: response = litellm.completion( model=model, messages=messages ) return { "content": response.choices[0].message.content, "provider": model.split('/')[0], "model": model, "usage": response.usage._asdict() } except Exception as e: if fallback: # Try alternative provider fallback_model = self._get_fallback_model(model) print(f"Primary failed ({e}), trying {fallback_model}") response = litellm.completion( model=fallback_model, messages=messages ) return { "content": response.choices[0].message.content, "provider": fallback_model.split('/')[0], "model": fallback_model, "usage": response.usage._asdict(), "fallback": True } else: raise def _get_fallback_model(self, primary_model: str) -> str: """Get fallback model for a primary model.""" fallbacks = { "azure/gpt-5": "bedrock/anthropic.claude-sonnet-4-6-20250929-v1:0", "bedrock/anthropic.claude-sonnet-4-6-20250929-v1:0": "vertex_ai/gemini-3.1-pro", "vertex_ai/gemini-3.1-pro": "azure/gpt-3.5-turbo" } return fallbacks.get(primary_model, "azure/gpt-3.5-turbo") def cost_optimized_routing( self, messages: List[Dict], max_cost: float = 0.01 ) -> Dict: """ Route to cheapest model that meets quality threshold. Strategy: Try cheap model first, upgrade if needed. """ # Try cheapest first models_by_cost = [ "vertex_ai/gemini-3.1-pro", # Cheapest "azure/gpt-3.5-turbo", "bedrock/anthropic.claude-sonnet-4-6-20250929-v1:0", "azure/gpt-5" # Most expensive ] for model in models_by_cost: response = litellm.completion( model=model, messages=messages ) # Check if response meets quality bar if self._is_good_quality(response.choices[0].message.content): return { "content": response.choices[0].message.content, "model": model } # If all fail, use best model return self.chat(messages, task_type="analysis") def _is_good_quality(self, response: str) -> bool: """Heuristic quality check.""" # Simple heuristics - in production, use eval model return len(response) > 50 and not response.startswith("I don't") # Usage client = MultiCloudLLM() # Automatic routing by task response = client.chat( messages=[{"role": "user", "content": "Write a Python function to reverse a string"}], task_type="coding" # Routes to GPT-5 on Azure ) print(f"Response from {response['provider']}: {response['content']}") # Cost-optimized routing response = client.cost_optimized_routing( messages=[{"role": "user", "content": "What is 2+2?"}] ) ``` ::: {.callout-warning} ## Common Mistake: Assuming API Compatibility Across Providers **What people do:** Build a multi-cloud abstraction layer assuming all providers work the same way. "It's just an LLM API—swap the endpoint and credentials, right?" **Why it fails:** Subtle differences cause production issues. Token counting varies (OpenAI vs Anthropic tokenizers produce different counts for the same text). Rate limiting works differently. Error codes and retry semantics differ. Streaming response formats aren't identical. The "same" model on different providers can have slightly different outputs. **Fix:** Treat each provider as distinct until proven equivalent. Test edge cases on each provider. Build provider-specific error handling. Don't assume token counts transfer—recalculate per provider. Run integration tests against each provider separately before deploying multi-cloud. ::: ::: {.callout-tip} ## Staff Engineer Perspective "We ran our first real failover drill after eight months of confidently telling leadership we were multi-region. The primary went 'down' at 10am. Failover kicked in cleanly—then p95 latency spiked from 800ms to 14 seconds and stayed there for 40 minutes. Reason: the standby region had cold model caches, cold vector index pages, and no warm DB connection pool. The runbook said 'traffic shifts in 90 seconds.' Reality was 'recovery completes in 40 minutes.' Now we run a 5 percent shadow load to every standby region 24/7. It costs us about $8K/month and saves the postmortem we don't want to write." — *SRE Lead at a fintech infrastructure company* ::: **Custom Abstraction Layer:** ```python # llm_abstraction.py from abc import ABC, abstractmethod from typing import List, Dict, Optional import boto3 from openai import AzureOpenAI import vertexai from vertexai.generative_models import GenerativeModel class LLMProvider(ABC): """Abstract base class for LLM providers.""" @abstractmethod def chat( self, messages: List[Dict[str, str]], temperature: float = 0.7, max_tokens: int = 1000 ) -> Dict: """Chat completion.""" pass @abstractmethod def embed(self, texts: List[str]) -> List[List[float]]: """Get embeddings.""" pass class AWSBedrockProvider(LLMProvider): """AWS Bedrock implementation.""" def __init__(self, region: str = "us-east-1"): self.client = boto3.client('bedrock-runtime', region_name=region) def chat(self, messages, temperature=0.7, max_tokens=1000): import json response = self.client.invoke_model( modelId="anthropic.claude-sonnet-4-6-20250929-v1:0", body=json.dumps({ "anthropic_version": "bedrock-2023-05-31", "max_tokens": max_tokens, "messages": messages, "temperature": temperature }) ) result = json.loads(response['body'].read()) return { "content": result['content'][0]['text'], "usage": result['usage'] } def embed(self, texts): import json response = self.client.invoke_model( modelId="amazon.titan-embed-text-v1", body=json.dumps({"inputText": texts[0]}) ) result = json.loads(response['body'].read()) return [result['embedding']] class AzureOpenAIProvider(LLMProvider): """Azure OpenAI implementation.""" def __init__(self, endpoint: str, api_key: str): self.client = AzureOpenAI( azure_endpoint=endpoint, api_key=api_key, api_version="2024-02-15-preview" ) def chat(self, messages, temperature=0.7, max_tokens=1000): response = self.client.chat.completions.create( model="gpt-5", # deployment name messages=messages, temperature=temperature, max_tokens=max_tokens ) return { "content": response.choices[0].message.content, "usage": response.usage.model_dump() } def embed(self, texts): response = self.client.embeddings.create( model="text-embedding-3-small", input=texts ) return [item.embedding for item in response.data] class VertexAIProvider(LLMProvider): """Vertex AI implementation.""" def __init__(self, project_id: str, location: str = "us-central1"): vertexai.init(project=project_id, location=location) self.model = GenerativeModel("gemini-3.1-pro") def chat(self, messages, temperature=0.7, max_tokens=1000): # Convert to Gemini format prompt = "\n".join([f"{m['role']}: {m['content']}" for m in messages]) response = self.model.generate_content( prompt, generation_config={ "temperature": temperature, "max_output_tokens": max_tokens } ) return { "content": response.text, "usage": {} # Vertex doesn't return usage in same format } def embed(self, texts): from vertexai.language_models import TextEmbeddingModel model = TextEmbeddingModel.from_pretrained("textembedding-gecko@003") embeddings = model.get_embeddings(texts) return [emb.values for emb in embeddings] class MultiCloudRouter: """Route requests across providers.""" def __init__(self): self.providers = { "aws": AWSBedrockProvider(), "azure": AzureOpenAIProvider( endpoint=os.getenv("AZURE_ENDPOINT"), api_key=os.getenv("AZURE_KEY") ), "gcp": VertexAIProvider(project_id=os.getenv("GCP_PROJECT")) } self.default_provider = "aws" def chat( self, messages: List[Dict], provider: Optional[str] = None, **kwargs ) -> Dict: """Chat with optional provider override.""" provider_name = provider or self.default_provider llm = self.providers[provider_name] try: result = llm.chat(messages, **kwargs) result["provider"] = provider_name return result except Exception as e: # Fallback to next provider print(f"{provider_name} failed: {e}") for name, fallback_llm in self.providers.items(): if name != provider_name: try: result = fallback_llm.chat(messages, **kwargs) result["provider"] = name result["fallback"] = True return result except: continue raise Exception("All providers failed") # Usage router = MultiCloudRouter() # Use default provider (AWS) response = router.chat([ {"role": "user", "content": "Explain Kubernetes"} ]) # Override to Azure response = router.chat( messages=[{"role": "user", "content": "Write a haiku about clouds"}], provider="azure" ) print(f"Response from {response['provider']}: {response['content']}") ``` ### Disaster Recovery Patterns **Active-Passive DR:** ```python # dr_pattern.py import time from typing import Optional class ActivePassiveLLM: """ Active-passive DR pattern. Primary provider handles all traffic. Automatic failover to secondary on failure. """ def __init__( self, primary_provider: LLMProvider, secondary_provider: LLMProvider, health_check_interval: int = 60 ): self.primary = primary_provider self.secondary = secondary_provider self.health_check_interval = health_check_interval self.primary_healthy = True self.last_health_check = time.time() def chat(self, messages: List[Dict], **kwargs) -> Dict: """Chat with automatic failover.""" # Health check if time.time() - self.last_health_check > self.health_check_interval: self._health_check() # Try primary if self.primary_healthy: try: result = self.primary.chat(messages, **kwargs) result["provider"] = "primary" return result except Exception as e: print(f"Primary failed: {e}, failing over to secondary") self.primary_healthy = False # Use secondary result = self.secondary.chat(messages, **kwargs) result["provider"] = "secondary" result["failover"] = True return result def _health_check(self): """Check if primary is healthy.""" try: # Simple ping self.primary.chat( [{"role": "user", "content": "ping"}], max_tokens=5 ) self.primary_healthy = True except: self.primary_healthy = False self.last_health_check = time.time() # Usage dr_llm = ActivePassiveLLM( primary_provider=AWSBedrockProvider(), secondary_provider=AzureOpenAIProvider(endpoint="...", api_key="...") ) response = dr_llm.chat([{"role": "user", "content": "Hello"}]) ``` **Active-Active with Load Balancing:** ```python # load_balancer.py import random from collections import defaultdict import time class LoadBalancedLLM: """ Active-active load balancing. Distribute traffic across multiple providers. """ def __init__(self, providers: Dict[str, LLMProvider]): self.providers = providers # Track metrics per provider self.metrics = defaultdict(lambda: { "requests": 0, "errors": 0, "latency": [] }) def chat( self, messages: List[Dict], strategy: str = "round-robin", **kwargs ) -> Dict: """ Chat with load balancing. Strategies: - round-robin: Distribute evenly - random: Random selection - least-latency: Route to fastest provider - weighted: Based on capacity/cost """ if strategy == "round-robin": provider_name = self._round_robin() elif strategy == "random": provider_name = random.choice(list(self.providers.keys())) elif strategy == "least-latency": provider_name = self._least_latency() else: provider_name = self._weighted() provider = self.providers[provider_name] # Execute with metrics start = time.time() try: result = provider.chat(messages, **kwargs) latency = time.time() - start # Track metrics self.metrics[provider_name]["requests"] += 1 self.metrics[provider_name]["latency"].append(latency) result["provider"] = provider_name result["latency"] = latency return result except Exception as e: self.metrics[provider_name]["errors"] += 1 # Retry with different provider providers_to_try = [p for p in self.providers.keys() if p != provider_name] for retry_provider in providers_to_try: try: result = self.providers[retry_provider].chat(messages, **kwargs) result["provider"] = retry_provider result["retry"] = True return result except: continue raise def _round_robin(self) -> str: """Simple round-robin.""" total_requests = sum(m["requests"] for m in self.metrics.values()) provider_idx = total_requests % len(self.providers) return list(self.providers.keys())[provider_idx] def _least_latency(self) -> str: """Route to provider with lowest average latency.""" avg_latencies = {} for name, metrics in self.metrics.items(): if metrics["latency"]: avg_latencies[name] = sum(metrics["latency"]) / len(metrics["latency"]) else: avg_latencies[name] = 0 if not avg_latencies: return random.choice(list(self.providers.keys())) return min(avg_latencies, key=avg_latencies.get) def _weighted(self, weights: Optional[Dict[str, float]] = None) -> str: """Weighted selection based on capacity or cost.""" if not weights: # Default weights (example: prefer cheaper providers) weights = { "gcp": 0.5, # Cheapest, 50% of traffic "azure": 0.3, "aws": 0.2 } return random.choices( list(weights.keys()), weights=list(weights.values()) )[0] def get_metrics(self) -> Dict: """Get performance metrics.""" summary = {} for provider, metrics in self.metrics.items(): avg_latency = ( sum(metrics["latency"]) / len(metrics["latency"]) if metrics["latency"] else 0 ) error_rate = ( metrics["errors"] / metrics["requests"] if metrics["requests"] > 0 else 0 ) summary[provider] = { "requests": metrics["requests"], "errors": metrics["errors"], "error_rate": error_rate, "avg_latency": avg_latency } return summary # Usage lb = LoadBalancedLLM({ "aws": AWSBedrockProvider(), "azure": AzureOpenAIProvider(endpoint="...", api_key="..."), "gcp": VertexAIProvider(project_id="...") }) # Send requests for _ in range(100): response = lb.chat( [{"role": "user", "content": "Hello"}], strategy="least-latency" ) # Check metrics print(lb.get_metrics()) ``` --- ## Cloud-Native Patterns ### Serverless Inference **AWS Lambda with Bedrock:** ```python # lambda_handler.py import json import boto3 import os bedrock = boto3.client('bedrock-runtime') def lambda_handler(event, context): """ Serverless LLM inference. Benefits: - No idle costs - Auto-scaling - Simple deployment Limitations: - 15 min max runtime - Cold starts (~1-2s) - Limited GPU access """ body = json.loads(event['body']) prompt = body['prompt'] response = bedrock.invoke_model( modelId="anthropic.claude-sonnet-4-6-20250929-v1:0", body=json.dumps({ "anthropic_version": "bedrock-2023-05-31", "max_tokens": 1024, "messages": [{"role": "user", "content": prompt}] }) ) result = json.loads(response['body'].read()) return { 'statusCode': 200, 'body': json.dumps({ 'response': result['content'][0]['text'] }) } # Deploy with SAM or Serverless Framework # serverless.yml """ service: llm-api provider: name: aws runtime: python3.11 region: us-east-1 iam: role: statements: - Effect: Allow Action: - bedrock:InvokeModel Resource: '*' functions: chat: handler: lambda_handler.lambda_handler timeout: 60 events: - http: path: chat method: post """ ``` **GCP Cloud Functions with Vertex AI:** ```python # main.py import functions_framework from vertexai.generative_models import GenerativeModel import vertexai import os # Initialize outside handler for warm starts vertexai.init(project=os.getenv("GCP_PROJECT"), location="us-central1") model = GenerativeModel("gemini-3.1-pro") @functions_framework.http def chat(request): """ Cloud Function for Vertex AI. Warm start optimization: Initialize model globally. """ request_json = request.get_json() prompt = request_json['prompt'] response = model.generate_content(prompt) return {'response': response.text} # Deploy # gcloud functions deploy chat \ # --runtime python311 \ # --trigger-http \ # --allow-unauthenticated \ # --region us-central1 ``` **Azure Functions with OpenAI:** ```python # function_app.py import azure.functions as func from openai import AzureOpenAI from azure.identity import DefaultAzureCredential, get_bearer_token_provider import os app = func.FunctionApp() # Warm start: Initialize outside handler credential = DefaultAzureCredential() token_provider = get_bearer_token_provider( credential, "https://cognitiveservices.azure.com/.default" ) client = AzureOpenAI( azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"), azure_ad_token_provider=token_provider, api_version="2024-02-15-preview" ) @app.function_name(name="chat") @app.route(route="chat", methods=["POST"]) def chat(req: func.HttpRequest) -> func.HttpResponse: """Azure Function with managed identity.""" req_body = req.get_json() prompt = req_body['prompt'] response = client.chat.completions.create( model=os.getenv("DEPLOYMENT_NAME"), messages=[{"role": "user", "content": prompt}] ) return func.HttpResponse( response.choices[0].message.content, mimetype="text/plain" ) # Deploy # func azure functionapp publish <app-name> ``` ### Container-Based Deployment **AWS ECS with Bedrock:** ```dockerfile # Dockerfile FROM python:3.11-slim WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY . . CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"] ``` ```python # main.py from fastapi import FastAPI import boto3 import json import os app = FastAPI() bedrock = boto3.client('bedrock-runtime', region_name='us-east-1') @app.post("/chat") async def chat(prompt: str): """ FastAPI app running on ECS. Benefits over Lambda: - No runtime limits - Keep connections warm - More memory/CPU options """ response = bedrock.invoke_model( modelId="anthropic.claude-sonnet-4-6-20250929-v1:0", body=json.dumps({ "anthropic_version": "bedrock-2023-05-31", "max_tokens": 1024, "messages": [{"role": "user", "content": prompt}] }) ) result = json.loads(response['body'].read()) return {"response": result['content'][0]['text']} # Health check for ECS @app.get("/health") async def health(): return {"status": "healthy"} ``` ```yaml # ecs-task-definition.json { "family": "llm-api", "networkMode": "awsvpc", "requiresCompatibilities": ["FARGATE"], "cpu": "1024", "memory": "2048", "containerDefinitions": [ { "name": "llm-api", "image": "<account-id>.dkr.ecr.us-east-1.amazonaws.com/llm-api:latest", "portMappings": [ { "containerPort": 8000, "protocol": "tcp" } ], "environment": [ { "name": "AWS_REGION", "value": "us-east-1" } ], "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "/ecs/llm-api", "awslogs-region": "us-east-1", "awslogs-stream-prefix": "ecs" } } } ] } ``` **GKE with Vertex AI:** ```yaml # deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: llm-api spec: replicas: 3 selector: matchLabels: app: llm-api template: metadata: labels: app: llm-api spec: serviceAccountName: vertex-ai-sa # GCP service account containers: - name: llm-api image: gcr.io/project-id/llm-api:latest ports: - containerPort: 8000 env: - name: GCP_PROJECT value: "my-project" resources: requests: memory: "2Gi" cpu: "1000m" limits: memory: "4Gi" cpu: "2000m" livenessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 30 periodSeconds: 10 --- apiVersion: v1 kind: Service metadata: name: llm-api spec: selector: app: llm-api ports: - port: 80 targetPort: 8000 type: LoadBalancer ``` ### GPU Instances and Spot Pricing **Self-Hosted with Spot Instances:** ```python # spot_instance_manager.py import boto3 from typing import Dict class SpotInstanceManager: """ Manage EC2 spot instances for self-hosted LLM inference. Use case: Host your own LLM (Llama, Mistral) on GPU spot instances. Savings: 70%+ compared to on-demand. """ def __init__(self): self.ec2 = boto3.client('ec2') def request_spot_instance( self, instance_type: str = "g5.xlarge", # 1x A10G GPU max_price: str = "0.50", # USD per hour ami_id: str = "ami-xxxxx" # Custom AMI with vLLM ) -> str: """ Request spot instance for LLM hosting. Instance types: - g5.xlarge: 1x A10G (24GB) - ~$0.40/hr spot vs $1.01 on-demand - g5.2xlarge: 1x A10G (24GB), more CPU/RAM - g5.12xlarge: 4x A10G (96GB) - for larger models - p4d.24xlarge: 8x A100 (320GB) - for huge models """ user_data = """#!/bin/bash # Install vLLM and start server pip install vllm python -m vllm.entrypoints.api_server \ --model meta-llama/Llama-4-Scout-17B \ --port 8000 \ --gpu-memory-utilization 0.9 """ response = self.ec2.request_spot_instances( SpotPrice=max_price, InstanceCount=1, LaunchSpecification={ 'ImageId': ami_id, 'InstanceType': instance_type, 'KeyName': 'my-key', 'SecurityGroupIds': ['sg-xxxxx'], 'UserData': user_data, 'IamInstanceProfile': { 'Name': 'EC2-vLLM-Role' }, 'BlockDeviceMappings': [ { 'DeviceName': '/dev/sda1', 'Ebs': { 'VolumeSize': 100, 'VolumeType': 'gp3' } } ] } ) request_id = response['SpotInstanceRequests'][0]['SpotInstanceRequestId'] print(f"Spot request: {request_id}") return request_id def handle_spot_interruption(self): """ Handle spot instance interruption. AWS gives 2-minute warning before terminating spot instance. Use this to gracefully drain connections. """ # Run this in background on instance script = """ import requests import time import signal import sys def graceful_shutdown(signum, frame): print("Spot interruption detected, draining...") # Stop accepting new requests # Finish existing requests # Save state if needed sys.exit(0) signal.signal(signal.SIGTERM, graceful_shutdown) while True: try: # Check for interruption notice r = requests.get( 'http://169.254.169.254/latest/meta-data/spot/instance-action', timeout=1 ) if r.status_code == 200: graceful_shutdown(None, None) except: pass time.sleep(5) """ return script # Usage manager = SpotInstanceManager() # Request spot instance request_id = manager.request_spot_instance( instance_type="g5.2xlarge", max_price="0.80" ) # Cost comparison: # On-demand: $1.21/hr * 720 hrs/month = $871/month # Spot: $0.40/hr * 720 hrs/month = $288/month # Savings: $583/month (67%) ``` ### Auto-Scaling Patterns **AWS Bedrock Provisioned Throughput (manual scaling):** Bedrock Provisioned Throughput is *not* currently a registered Application Auto Scaling target — there is no `ServiceNamespace='bedrock'` / `ScalableDimension='bedrock:model:ProvisionedThroughput'` combination supported by AWS. Capacity is purchased in Model Units (MUs) on monthly or six-month commitments, so most teams scale Bedrock manually based on CloudWatch metrics, or rely on on-demand inference for elasticity and reserve PTUs only for steady baseline load. ```python # bedrock_capacity.py import boto3 bedrock = boto3.client('bedrock') # Create provisioned throughput (Model Units, not auto-scaled) response = bedrock.create_provisioned_model_throughput( modelUnits=2, # baseline capacity provisionedModelName='claude-sonnet-prod', modelId='anthropic.claude-sonnet-4-6-20250929-v1:0', commitmentDuration='OneMonth' # or 'SixMonths', omit for no commitment ) provisioned_arn = response['provisionedModelArn'] # To "scale", create a new provisioned model with more MUs and migrate # traffic, or fall back to on-demand inference during spikes. Watch # CloudWatch metrics like InvocationThrottles to decide when to expand. ``` For workloads that need true elastic scaling, prefer on-demand inference (throttled per account quota) or front Bedrock with SageMaker endpoints, which *do* support Application Auto Scaling target tracking via the `sagemaker` service namespace. **Kubernetes HPA for LLM Workloads:** ```yaml # hpa.yaml apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: llm-api-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: llm-api minReplicas: 2 maxReplicas: 20 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 - type: Pods pods: metric: name: http_requests_per_second target: type: AverageValue averageValue: "100" behavior: scaleDown: stabilizationWindowSeconds: 300 # Wait 5min before scaling down policies: - type: Percent value: 50 periodSeconds: 60 scaleUp: stabilizationWindowSeconds: 0 # Scale up immediately policies: - type: Percent value: 100 periodSeconds: 30 ``` --- ## Security & Compliance ### VPC and Private Endpoints **AWS Private VPC for Bedrock:** ```python # Use Bedrock from private subnet without internet access # 1. Create VPC endpoint for Bedrock (via AWS Console or CloudFormation) # Type: Interface # Service: com.amazonaws.us-east-1.bedrock-runtime # 2. Use from private subnet import boto3 bedrock = boto3.client( 'bedrock-runtime', region_name='us-east-1', # Format: https://<vpce-id>.bedrock-runtime.<region>.vpce.amazonaws.com # If Private DNS is enabled on the endpoint, you can omit endpoint_url # and use the standard service DNS automatically. endpoint_url='https://vpce-xxxxxxxxxxxxxxxxx.bedrock-runtime.us-east-1.vpce.amazonaws.com' ) # Traffic never leaves AWS network ``` **Azure Private Link:** ```python # Azure OpenAI with private endpoint # 1. Create private endpoint (Azure Portal or ARM) # 2. Configure DNS # 3. Use from VNet from openai import AzureOpenAI client = AzureOpenAI( azure_endpoint="https://my-resource.privatelink.openai.azure.com/", api_key="...", api_version="2024-02-15-preview" ) # Traffic stays within Azure backbone ``` ### IAM and Service Accounts **AWS IAM Best Practices:** ```json { "Version": "2012-10-17", "Statement": [ { "Sid": "BedrockInvokeOnly", "Effect": "Allow", "Action": [ "bedrock:InvokeModel", "bedrock:InvokeModelWithResponseStream" ], "Resource": [ "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-sonnet-4-6-20250929-v1:0" ], "Condition": { "StringEquals": { "aws:RequestedRegion": "us-east-1" } } }, { "Sid": "DenyDataLogging", "Effect": "Deny", "Action": [ "bedrock:PutModelInvocationLoggingConfiguration" ], "Resource": "*" } ] } ``` **GCP Service Account:** ```python # Create service account with minimal permissions from google.cloud import iam_admin_v1 client = iam_admin_v1.IAMClient() # Create service account service_account = client.create_service_account( request={ "name": f"projects/my-project", "account_id": "vertex-ai-sa", "service_account": { "display_name": "Vertex AI Service Account" } } ) # Grant only necessary roles # - roles/aiplatform.user (use Vertex AI) # - roles/storage.objectViewer (read from GCS) # - NO roles/owner or roles/editor # Use from application os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/path/to/sa-key.json" ``` ### Data Residency **AWS Regions:** ```python # EU data residency bedrock_eu = boto3.client('bedrock-runtime', region_name='eu-west-1') # Data never leaves EU # Compliant with GDPR # Available Bedrock regions: # - us-east-1, us-west-2 (most models) # - eu-west-1, eu-central-1 (EU data residency) # - ap-southeast-1, ap-northeast-1 (Asia Pacific) ``` **Azure Geographies:** ```python # Azure OpenAI data residency # Europe geography client_eu = AzureOpenAI( azure_endpoint="https://my-resource-westeurope.openai.azure.com/", api_key="...", api_version="2024-02-15-preview" ) # Data stays in European data centers # GDPR compliant ``` ### Compliance Certifications **AWS Bedrock:** - SOC 2 Type II - HIPAA eligible (BAA available) - ISO 27001 - GDPR compliant - PCI DSS (for payment card data) **Azure OpenAI:** - SOC 2 Type II - HIPAA/HITECH (BAA available) - ISO 27001, 27018 - GDPR compliant - FedRAMP (US government) **GCP Vertex AI:** - SOC 2 Type II - HIPAA compliant - ISO 27001 - GDPR compliant - FedRAMP (in progress) **Enabling Compliance Features:** ```python # AWS HIPAA example # 1. Sign BAA with AWS # 2. Use HIPAA-eligible services only # 3. Enable encryption at rest and in transit # 4. Enable audit logging import boto3 # Enable CloudTrail for audit logs cloudtrail = boto3.client('cloudtrail') cloudtrail.create_trail( Name='bedrock-audit-trail', S3BucketName='my-audit-bucket', IsMultiRegionTrail=True, EnableLogFileValidation=True ) # Enable encryption for S3 buckets (for Knowledge Bases) s3 = boto3.client('s3') s3.put_bucket_encryption( Bucket='my-kb-bucket', ServerSideEncryptionConfiguration={ 'Rules': [ { 'ApplyServerSideEncryptionByDefault': { 'SSEAlgorithm': 'AES256' } } ] } ) ``` --- ## Comparison Table | Feature | AWS Bedrock | Azure OpenAI | GCP Vertex AI | |---------|-------------|--------------|---------------| | **Best Models** | Claude Sonnet 4.6, Llama 4 | GPT-5.5, GPT-5.4 | Gemini 3.1 Pro, Gemini 3.5 Flash | | **Pricing (GPT-5 equivalent)** | Claude Sonnet 4.6: $3/$15 per 1M tokens | GPT-5.4: $2.50/$15 per 1M tokens | Gemini 3.1 Pro: $2/$12 per 1M tokens | | **Enterprise Auth** | IAM roles | Azure AD, Managed Identity | GCP Service Accounts | | **Private Network** | VPC Endpoints | Private Link | VPC Service Controls | | **Fine-tuning** | Limited (Titan only) | Yes (GPT-5-mini, GPT-5) | Yes (PaLM, Gemini) | | **Multimodal** | Claude 4.x (images) | GPT-5 Vision, DALL-E | Gemini 3.1 Pro, Imagen | | **RAG Support** | Knowledge Bases | Azure AI Search | Vertex AI Search | | **Agents** | Bedrock Agents | Azure AI Studio | Vertex AI Agents | | **Regions** | 6+ regions | 10+ regions | 20+ regions | | **Compliance** | SOC2, HIPAA, GDPR | SOC2, HIPAA, FedRAMP | SOC2, HIPAA, GDPR | | **Best For** | Claude users, AWS ecosystem | Enterprise, Microsoft stack | Google ecosystem, multimodal | --- ## Practical Exercises 1. **Multi-Cloud RAG**: Build a RAG system using Azure Cognitive Search for indexing, AWS Bedrock for embeddings, and GCP Vertex AI for generation. Implement fallback if any provider fails using LiteLLM's routing capabilities. 2. **Cost Optimizer**: Create a cost optimization service that tracks usage across all three providers, routes requests to the cheapest provider for each task type, switches to provisioned throughput when usage crosses threshold, and provides daily cost reports. 3. **Serverless Agent**: Deploy an agent on AWS Lambda that uses Bedrock for reasoning, calls tools (weather API, database), stores conversation history in DynamoDB, and auto-scales based on traffic. Target: sub-3-second p95 latency, 1000 req/s capacity, <$100/month at low traffic. 4. **HIPAA-Compliant Chat**: Build a HIPAA-compliant medical chatbot on Azure with private endpoints, Azure OpenAI with BAA, encryption at rest/transit, audit logging, and PHI storage in encrypted Cosmos DB. Create a compliance checklist document. 5. **Disaster Recovery Drill**: Set up a multi-cloud failover system and conduct a disaster recovery drill. Simulate primary provider failure, measure failover time, document data loss (if any), and test failback procedures. --- ## Self-Assessment Checkpoint ### Conceptual Questions **Q1. [IC2]** What is LiteLLM and when would you use it? What are its limitations? <details> <summary>Answer</summary> LiteLLM provides a unified interface for calling 100+ LLM providers with the same OpenAI-compatible API. Use cases: (1) Switching between providers without code changes. (2) Building fallback chains (try Claude, fall back to GPT-4). (3) A/B testing different models. (4) Abstracting provider differences for teams. Limitations: (1) Adds a dependency and potential point of failure. (2) May lag behind provider feature releases. (3) Some provider-specific features aren't supported. (4) Debugging is harder—errors may be obscured. (5) Performance overhead (minimal but exists). When to use: Multi-provider deployments, rapid prototyping, teams evaluating models. When to avoid: Single-provider production deployments (unnecessary abstraction), applications needing cutting-edge provider features. </details> **Q2. [IC2]** What's the difference between a VPC endpoint and a private endpoint? When do you need each? <details> <summary>Answer</summary> VPC endpoint (AWS): Connects your VPC to AWS services without traffic leaving Amazon's network. Types: Gateway endpoints (S3, DynamoDB—free) and Interface endpoints (most services—costs per hour + data transfer). Use for: Keeping Bedrock traffic within AWS. Private endpoint (Azure): Similar concept—connects your VNet to Azure services privately. Uses Azure Private Link. Use for: Keeping Azure OpenAI traffic within Azure. You need these when: (1) Compliance requires no internet egress. (2) Security policy prohibits public API calls. (3) You need to reduce attack surface. (4) Data classification requires private networking. Both add cost and complexity. Only implement if you have a concrete security or compliance requirement. </details> **Q3. [Senior]** Explain the tradeoffs between serverless (Lambda) and container-based (EKS) deployments for LLM applications. <details> <summary>Answer</summary> Serverless (Lambda): Pros: Zero ops, auto-scales to zero, pay-per-invocation. Cons: Cold starts (1-5s for LLM clients), 15-minute timeout, limited memory (10GB), no persistent connections, harder to maintain state. Containers (EKS): Pros: No cold starts, unlimited execution time, persistent connections to LLM APIs, more control over scaling. Cons: Operational overhead, minimum cost even at zero traffic, need to manage cluster. Decision factors: (1) Traffic pattern: Bursty with long idle periods → serverless. Steady traffic → containers. (2) Latency requirements: p99 <500ms → containers (cold starts kill serverless). (3) Execution time: >15 minutes → containers. (4) Team expertise: No Kubernetes experience → serverless. (5) Cost optimization: Very low volume → serverless; high volume → containers. Hybrid pattern: Use Lambda for simple, short-lived tasks (embeddings, classification). Use containers for complex agents and long conversations. </details> **Q4. [Senior]** A company wants to use Claude on Bedrock but keep all data in Europe for GDPR. What are their options? <details> <summary>Answer</summary> Options: (1) Bedrock in eu-west regions: Check if Claude is available in Bedrock's European regions (Frankfurt, Ireland). Availability varies by model. If available, this is simplest. (2) Self-hosted Claude: Not available—Anthropic doesn't offer Claude for self-hosting. (3) API direct to Anthropic: Anthropic has EU data processing options. Check their DPA and data residency commitments. May meet GDPR with appropriate contract. (4) Alternative model: Use a model available in EU regions (Titan, open-source on SageMaker, or Vertex AI in europe-west regions). (5) Data anonymization: Remove PII before sending to non-EU Claude, ensuring no personal data crosses borders. GDPR considerations: It's about data protection, not just location. With appropriate DPA, EU-to-US transfers may be legal under new EU-US Data Privacy Framework. Consult legal counsel for specific requirements. </details> **Q5. [Staff]** Design a disaster recovery architecture for a critical AI application. Requirements: RPO <1 hour, RTO <15 minutes, must survive complete cloud region failure. <details> <summary>Answer</summary> Architecture: (1) Active-passive multi-region: Primary in us-east-1, standby in us-west-2. Async replication of state (conversation history, user data) with <1 hour lag. (2) Cross-region database: Use Aurora Global Database or Cosmos DB with multi-region writes. (3) Health monitoring: External health checks every 30 seconds. Automated failover when primary fails 3 consecutive checks. (4) DNS-based failover: Route 53 health checks with failover routing. TTL of 60 seconds. (5) LLM provider diversity: Primary uses Bedrock us-east-1, failover uses Azure OpenAI or direct Anthropic API. (6) State synchronization: Conversation history replicated to standby. On failover, users may see slight regression but can continue. (7) Failback procedure: Once primary recovers, sync any changes from standby, verify consistency, switch traffic back with gradual rollout. RTO breakdown: Detection (1-2 min) + DNS propagation (1-2 min) + connection warmup (1-2 min) = ~5 minutes typical. RPO: Depends on replication lag—target <15 minutes with continuous sync. Test quarterly: Conduct actual failover tests to validate assumptions. Document runbook. </details> ### Spot the Problem **Problem 1. [IC2]** A developer's multi-cloud fallback isn't working: ```python def call_llm(prompt: str) -> str: try: return call_bedrock(prompt) except Exception: return call_azure_openai(prompt) ``` What's wrong with this fallback implementation? <details> <summary>Answer</summary> Issues: (1) Catches all exceptions—includes programming errors, not just provider issues. A bug in call_bedrock triggers fallback instead of failing fast. (2) No retry before fallback—single transient failure triggers expensive cross-cloud call. (3) No logging—you won't know fallback is happening or why. (4) No timeout—if Bedrock hangs, you wait forever before fallback. (5) No circuit breaker—if Bedrock is down, every request tries it first before falling back. Better implementation: ```python def call_llm(prompt: str) -> str: try: return retry_with_backoff(call_bedrock, prompt, max_retries=2) except (ProviderUnavailable, RateLimitError, TimeoutError) as e: logger.warning(f"Bedrock failed: {e}, falling back to Azure") return call_azure_openai(prompt) # Let other exceptions propagate ``` Even better: Use circuit breaker pattern to skip Bedrock entirely when it's known to be down. </details> **Problem 2. [Senior]** A serverless LLM application has high cold start latency: ```python # AWS Lambda handler import boto3 from anthropic import Anthropic def handler(event, context): client = Anthropic() # Initialized on every request response = client.messages.create(...) return response ``` Cold starts are taking 3-5 seconds. How would you optimize this? <details> <summary>Answer</summary> Issues: (1) Client initialization on every request—Anthropic client setup includes connection pooling that's wasted on cold starts. (2) No connection reuse between invocations. Optimizations: (1) Initialize client outside handler: ```python client = Anthropic() # Reused across warm invocations def handler(event, context): response = client.messages.create(...) ``` (2) Use provisioned concurrency: Pre-warm Lambda instances to eliminate cold starts for critical paths. Costs more but guarantees latency. (3) Reduce package size: Smaller deployment package = faster cold starts. Remove unused dependencies. (4) Use Lambda SnapStart (Java): Not available for Python, but if you can use Java, it helps significantly. (5) Keep warm: Schedule periodic invocations to prevent cold starts (hacky but works). Architecture change: For latency-critical paths, consider ECS/Fargate with persistent containers instead of Lambda. </details> **Problem 3. [Staff]** A multi-cloud deployment has inconsistent behavior: ```python # Same prompt, different results bedrock_response = call_bedrock("Summarize this document") azure_response = call_azure("Summarize this document") # bedrock_response: 150 words, formal tone # azure_response: 300 words, casual tone ``` Users are confused when failover happens because responses feel different. How would you address this? <details> <summary>Answer</summary> Root causes: (1) Different models have different default behaviors. (2) System prompts may differ between providers. (3) Temperature and parameter defaults vary. (4) Token limits may differ. Solutions: (1) Standardize prompts: Include explicit instructions for length, tone, format in the prompt itself, not just system settings. "Summarize in exactly 150 words using formal academic tone." (2) Normalize parameters: Set explicit temperature, max_tokens, etc., for all providers. Don't rely on defaults. (3) System prompt parity: Use identical system prompts across providers. (4) Output post-processing: If format still differs, add a normalization layer (truncate to length, adjust formatting). (5) A/B testing: Evaluate both providers on the same test set. Choose the one closer to your target, make the other match it. (6) Provider-specific tuning: Sometimes you need different prompts per provider to get equivalent outputs. Architectural consideration: If consistency is critical, avoid multi-model fallback. Use the same model across regions (e.g., Claude on Bedrock us-east-1 → Claude on Bedrock us-west-2, not → Azure GPT-4). </details> ### Design Exercises **Exercise 1. [Senior]** Design a multi-cloud LLM deployment for a global e-commerce company. Requirements: Serve customers in North America, Europe, and Asia-Pacific with <500ms p99 latency for simple queries. Use the best available model in each region. Handle 10K requests/minute peak. Ensure GDPR compliance for EU users. <details> <summary>Guidance</summary> Architecture: (1) Regional deployments: NA (us-east-1): Bedrock with Claude. EU (eu-west-1): Bedrock EU if Claude available, otherwise Vertex AI in europe-west with Gemini. APAC (ap-northeast-1): Bedrock Tokyo if available, otherwise consider Azure Japan or Singapore. (2) Traffic routing: Cloudflare or Route 53 latency-based routing. Each region is independent—no cross-region LLM calls. (3) Data residency: EU user data stays in EU region. Implement at the routing layer—EU requests never route to non-EU backends. (4) Capacity: 10K req/min = 170 req/sec peak. Provisioned throughput in each region for guaranteed latency. (5) Consistency: Standardized prompts and parameters across providers. Test outputs for consistency. (6) Fallback: Within-region fallback (Bedrock → direct Anthropic API), not cross-region. (7) Monitoring: Per-region latency and error dashboards. Alert on regional degradation. Cost estimate: 3 regions × provisioned capacity + global traffic routing ≈ $50-100K/month depending on model and volume. </details> **Exercise 2. [Staff]** Your organization is moving from a single-cloud (AWS) to a multi-cloud architecture. Currently running 20 AI applications on Bedrock. The CTO wants "cloud portability" but the VP of Engineering is concerned about operational complexity. Write a proposal that addresses both concerns. Include: phased approach, success metrics, rollback plan, and resource requirements. <details> <summary>Guidance</summary> Proposal structure: Phase 1 (Months 1-3): Abstraction without migration. Build internal API that wraps Bedrock. All 20 apps migrate to internal API. Zero risk—still single cloud, but foundation for portability. Phase 2 (Months 4-6): Secondary provider for DR only. Add Azure OpenAI as cold standby. Test failover quarterly. One team owns the integration. Phase 3 (Months 7-12): Selective multi-cloud. Identify 2-3 apps that would benefit from multi-cloud (cost arbitrage, specific model needs). Migrate those only. Majority stays single-cloud. Success metrics: (1) Abstraction adoption: 100% of apps using internal API. (2) Failover capability: Successfully tested quarterly. (3) Cost optimization: 10-15% reduction for apps using cost-based routing. Rollback plan: Each phase is reversible. Internal API can route 100% to Bedrock. Secondary provider can be decommissioned without affecting apps. Resource requirements: 2 engineers for Phase 1, 1 engineer ongoing for Phases 2-3. Platform team owns the abstraction layer. Key message to CTO: Portability is achieved through abstraction, not by running on multiple clouds. To VP: We're not adding complexity to all 20 apps—we're building one platform layer that makes multi-cloud optional. </details> --- --- ## Connections to Other Chapters This chapter's advanced cloud patterns connect to several other topics: - **Chapter 12 (Cloud AI Providers)**: Provider-specific details for AWS Bedrock, Azure OpenAI, and Vertex AI. - **Chapter 9 (LLM Deployment & Infrastructure)**: Foundational deployment concepts for self-hosted deployments. - **Chapter 16 (Security & Adversarial Robustness)**: Deeper dive into AI security patterns. - **Chapter 31 (Reliability Engineering)**: SLOs, graceful degradation, and incident management. - **Chapter 32 (Cost Engineering)**: Detailed cost optimization strategies building on the provider comparisons here. --- ## Summary Multi-cloud AI deployments offer resilience and flexibility but come with significant operational complexity. For most teams, starting with a single cloud provider and building proper abstraction layers is the pragmatic path—you can add multi-cloud capabilities later when specific needs arise. ### Key Takeaways 1. **Managed services win for most teams**: Unless you're at massive scale (100M+ tokens/day), managed services (Bedrock, Azure OpenAI, Vertex AI) are cheaper and faster to deploy than self-hosting. 2. **Choose provider based on model quality, not just price**: Claude Sonnet 4.6 (Bedrock) is better for analysis, GPT-5 (Azure) for general tasks, Gemini (Vertex) for multimodal. Pick the best model for each task. 3. **Multi-cloud adds complexity**: Only go multi-cloud if you have a specific need (DR, model access, compliance). Single cloud is simpler. 4. **Security is non-negotiable**: Use managed identities, private endpoints, encryption at rest/transit. Never put API keys in code. 5. **Monitor and optimize costs**: Set up CloudWatch/Azure Monitor/Cloud Monitoring alerts. Use cheaper models where possible. Implement caching. 6. **Serverless for variable workloads**: Lambda/Cloud Functions/Azure Functions are perfect for unpredictable traffic. Containers (ECS/GKE/AKS) for steady traffic. 7. **Compliance is easier with managed services**: AWS, Azure, GCP all have SOC2, HIPAA, GDPR compliance built-in. Self-hosting means you handle everything. **Interview-Ready**: Understand the tradeoffs between providers, when to use provisioned throughput vs pay-as-you-go, and how to implement multi-cloud failover. Know the pricing models and breakeven points.