Appendix B: Tool & Framework Reference
This appendix provides a reference guide to the major tools and frameworks used in AI engineering as of 2026. The landscape evolves rapidly—use this as a starting point and verify current versions and alternatives.
Last Updated: May 2026 Review Frequency: This appendix should be reviewed quarterly as the AI tooling landscape changes significantly every 3-6 months.
Emerging Tools and Trends (2025-2026)
Before diving into established tools, here are the notable emerging developments:
Agentic Frameworks
MCP (Model Context Protocol): Anthropic’s protocol for standardized tool integration with LLMs. Growing ecosystem of MCP-compatible tools and servers.
# MCP is becoming the standard for tool integration
# Example MCP server definition
{
"name": "filesystem",
"version": "1.0.0",
"tools": [
{
"name": "read_file",
"description": "Read contents of a file",
"parameters": {"path": {"type": "string"}}
}
]
}Claude Code, Cursor, Windsurf: AI-powered coding assistants that go beyond autocomplete to full agentic coding. These tools can:
- Edit multiple files simultaneously
- Run terminal commands
- Understand full repository context
- Execute multi-step coding tasks
CrewAI, AutoGen, LangGraph: Multi-agent orchestration frameworks for complex workflows.
Inference Optimization
SGLang: Alternative to vLLM with focus on structured generation and complex prompting patterns. Worth evaluating for applications with heavy structured output needs.
Medusa, EAGLE: Speculative decoding implementations achieving 2-3x speedups without quality loss.
FP8 Quantization: Native support in H100/H200 GPUs enabling faster inference with minimal quality loss compared to FP16.
Development Tools
LiteLLM: Unified API across 100+ LLM providers. Increasingly standard for multi-provider applications.
from litellm import completion
# Same interface for any provider
response = completion(
model="gpt-5", # or "claude-opus-4-8", "anthropic/claude-sonnet-4-6", etc.
messages=[{"role": "user", "content": "Hello!"}]
)promptfoo: Prompt testing and evaluation framework gaining adoption for systematic prompt engineering.
Weights & Biases Prompts: Prompt versioning and experimentation tracking.
Evaluation and Safety
HuggingFace Lighteval: Standardized evaluation harness for LLM benchmarks.
Inspect AI: Anthropic’s framework for AI safety evaluations.
Patronus AI, Galileo, Langfuse: Commercial LLM observability and evaluation platforms.
LLM Inference Engines
vLLM
Purpose: High-throughput LLM inference engine
Key Features:
- PagedAttention for efficient memory management
- Continuous batching for high throughput
- Support for most popular open-source LLMs
- OpenAI-compatible API
When to Use: Production LLM serving requiring high throughput
Installation:
pip install vllmExample:
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-4-Scout-17B-16E")
outputs = llm.generate(["Hello, how are you?"], SamplingParams(temperature=0.7))Resources: https://docs.vllm.ai
Text Generation Inference (TGI)
Purpose: Hugging Face’s production inference server
Key Features:
- Optimized for Hugging Face models
- Flash Attention support
- Quantization support (AWQ, GPTQ, bitsandbytes)
- Prometheus metrics built-in
When to Use: Hugging Face model deployment, especially in containers
Docker Launch:
docker run --gpus all -p 8080:80 ghcr.io/huggingface/text-generation-inference \
--model-id meta-llama/Llama-4-Scout-17B-16EResources: https://huggingface.co/docs/text-generation-inference
Ollama
Purpose: Local LLM inference for development
Key Features:
- Simple installation and model management
- Runs on consumer hardware
- Good for development and testing
- REST API compatible with OpenAI format
When to Use: Local development, prototyping, privacy-sensitive applications
Installation:
curl -fsSL https://ollama.com/install.sh | sh
ollama run llama4Resources: https://ollama.com
TensorRT-LLM
Purpose: NVIDIA’s optimized LLM inference
Key Features:
- Maximum performance on NVIDIA GPUs
- INT8/FP8 quantization
- Multi-GPU support
- Inflight batching
When to Use: Production serving requiring maximum GPU efficiency
Resources: https://github.com/NVIDIA/TensorRT-LLM
ML Frameworks
PyTorch
Purpose: Primary deep learning framework
Key Features:
- Dynamic computation graphs
- Extensive ecosystem
- Strong GPU support
- Industry standard for research and production
When to Use: Almost all deep learning development
Installation:
pip install torch torchvision torchaudioResources: https://pytorch.org
Hugging Face Transformers
Purpose: Pre-trained model library and utilities
Key Features:
- Thousands of pre-trained models
- Unified API for different architectures
- Training utilities (Trainer, PEFT)
- Integration with Hugging Face Hub
When to Use: Working with transformer models
Installation:
pip install transformersExample:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")Resources: https://huggingface.co/docs/transformers
LangChain
Purpose: Framework for LLM application development
Key Features:
- Chain composition for complex workflows
- Built-in prompts and output parsers
- Tool and agent support
- Memory and context management
When to Use: Building LLM applications with complex workflows
Installation:
pip install langchain langchain-openaiExample:
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
llm = ChatOpenAI(model="gpt-5")
prompt = ChatPromptTemplate.from_template("Summarize: {text}")
chain = prompt | llm
result = chain.invoke({"text": "Long document..."})Resources: https://python.langchain.com
LlamaIndex
Purpose: Data framework for LLM applications
Key Features:
- Document loading and indexing
- Various retrieval strategies
- Query engines and chat engines
- Strong RAG support
When to Use: RAG systems and document-based applications
Installation:
pip install llama-indexResources: https://docs.llamaindex.ai
Vector Databases
Pinecone
Purpose: Managed vector database
Key Features:
- Fully managed service
- Low-latency queries at scale
- Metadata filtering
- Hybrid search support
When to Use: Production vector search without infrastructure management
Installation:
pip install pineconeResources: https://docs.pinecone.io
Weaviate
Purpose: Open-source vector database
Key Features:
- Self-hosted or cloud
- GraphQL API
- Built-in vectorization modules
- Hybrid search
When to Use: Self-hosted vector search with flexibility
Resources: https://weaviate.io/developers/weaviate
Milvus
Purpose: Open-source vector database for scale
Key Features:
- Designed for billion-scale
- Multiple index types
- Kubernetes-native
- GPU acceleration
When to Use: Large-scale vector search requiring fine control
Resources: https://milvus.io
Qdrant
Purpose: High-performance vector database
Key Features:
- Rust-based for performance
- Rich filtering
- Quantization support
- Simple deployment
When to Use: Performance-critical vector search
Installation:
docker run -p 6333:6333 qdrant/qdrantResources: https://qdrant.tech/documentation
Chroma
Purpose: Lightweight embedding database
Key Features:
- Simple API
- Embedded or client-server mode
- Good for prototyping
- Python-native
When to Use: Prototyping, small-scale applications
Installation:
pip install chromadbResources: https://docs.trychroma.com
Feature Stores
Feast
Purpose: Open-source feature store
Key Features:
- Cloud-agnostic
- Online and offline stores
- Point-in-time joins
- Python SDK
When to Use: Feature management with flexibility
Installation:
pip install feastResources: https://feast.dev
Tecton
Purpose: Enterprise feature platform
Key Features:
- Real-time feature serving
- Streaming feature computation
- Enterprise governance
- Managed service
When to Use: Enterprise ML with real-time features
Resources: https://www.tecton.ai
Experiment Tracking
MLflow
Purpose: ML lifecycle management
Key Features:
- Experiment tracking
- Model registry
- Model serving
- Open source
When to Use: End-to-end ML lifecycle management
Installation:
pip install mlflowExample:
import mlflow
mlflow.start_run()
mlflow.log_param("learning_rate", 0.01)
mlflow.log_metric("accuracy", 0.95)
mlflow.end_run()Resources: https://mlflow.org
Weights & Biases (W&B)
Purpose: Experiment tracking and visualization
Key Features:
- Beautiful visualizations
- Team collaboration
- Hyperparameter sweeps
- Model registry
When to Use: Team-based ML development
Installation:
pip install wandbResources: https://docs.wandb.ai
Data Processing
Apache Spark
Purpose: Distributed data processing
Key Features:
- Large-scale data processing
- SQL, streaming, ML libraries
- Widely deployed
- Databricks integration
When to Use: Large-scale batch processing
Resources: https://spark.apache.org
Databricks
Purpose: Unified analytics platform
Key Features:
- Managed Spark
- Unity Catalog
- MLflow integration
- Feature store
When to Use: Enterprise data and ML platform
Resources: https://docs.databricks.com
Ray
Purpose: Distributed computing framework
Key Features:
- Simple distributed Python
- ML-focused libraries (Ray Train, Ray Serve)
- Scales from laptop to cluster
- Good for hyperparameter tuning
When to Use: Distributed ML training and serving
Installation:
pip install ray[default]Resources: https://docs.ray.io
Orchestration
Airflow
Purpose: Workflow orchestration
Key Features:
- DAG-based workflows
- Extensive operator library
- Monitoring and alerting
- Industry standard
When to Use: Batch ML pipeline orchestration
Installation:
pip install apache-airflowResources: https://airflow.apache.org
Prefect
Purpose: Modern workflow orchestration
Key Features:
- Python-native
- Easier than Airflow
- Cloud or self-hosted
- Good debugging
When to Use: Python-centric teams wanting simpler orchestration
Installation:
pip install prefectResources: https://docs.prefect.io
Kubeflow
Purpose: ML on Kubernetes
Key Features:
- Kubernetes-native ML
- Pipelines, notebooks, training
- Model serving (KServe)
- Multi-tenancy
When to Use: Kubernetes-based ML platforms
Resources: https://www.kubeflow.org
Monitoring & Observability
Prometheus + Grafana
Purpose: Metrics collection and visualization
Key Features:
- Industry standard metrics
- Rich visualization
- Alerting
- Large ecosystem
When to Use: Infrastructure and application monitoring
Resources: https://prometheus.io, https://grafana.com
Datadog
Purpose: Cloud monitoring platform
Key Features:
- Unified monitoring
- ML monitoring capabilities
- Log management
- APM
When to Use: Enterprise observability
Resources: https://docs.datadoghq.com
Arize AI
Purpose: ML observability platform
Key Features:
- Model monitoring
- Drift detection
- Explainability
- Embedding analysis
When to Use: Production ML monitoring
Resources: https://docs.arize.com
Evaluation
RAGAS
Purpose: RAG evaluation framework
Key Features:
- RAG-specific metrics
- Faithfulness, relevancy, context
- LLM-based evaluation
- Easy integration
When to Use: Evaluating RAG systems
Installation:
pip install ragasResources: https://docs.ragas.io
DeepEval
Purpose: LLM evaluation framework
Key Features:
- Multiple evaluation metrics
- Test-driven development for LLMs
- CI/CD integration
- Benchmarking
When to Use: Systematic LLM evaluation
Installation:
pip install deepevalResources: https://docs.confident-ai.com
API Providers
OpenAI
Models: GPT-5.5, GPT-5.4, DALL-E, Whisper
Strengths: Leading capability, extensive ecosystem, reliable API
Pricing: Pay-per-token, varies by model
Resources: https://platform.openai.com
Anthropic
Models: Claude Opus 4.8, Sonnet 4.6, Haiku 4.5
Strengths: Strong reasoning, long context, safety focus
Pricing: Pay-per-token
Resources: https://docs.anthropic.com
Google (Vertex AI)
Models: Gemini family, specialized models (Imagen, Veo, embedding models)
Strengths: Multimodal, GCP integration, enterprise features
Pricing: Pay-per-token and character
Resources: https://cloud.google.com/vertex-ai
Amazon Bedrock
Models: Claude, Llama, Titan, Stable Diffusion
Strengths: AWS integration, multiple model providers
Pricing: Pay-per-token/image
Resources: https://docs.aws.amazon.com/bedrock
Cohere
Models: Command, Embed, Rerank
Strengths: Enterprise focus, strong embeddings
Pricing: Pay-per-token/embedding
Resources: https://docs.cohere.com
Development Tools
LangSmith
Purpose: LLM application development platform
Key Features:
- Tracing and debugging
- Testing and evaluation
- Monitoring
- Prompt management
When to Use: LangChain application development
Resources: https://docs.smith.langchain.com
OpenLLMetry
Purpose: Open-source LLM observability
Key Features:
- OpenTelemetry-based
- Multiple framework integrations
- Vendor-agnostic
- Trace visualization
When to Use: LLM observability with OpenTelemetry
Installation:
pip install traceloop-sdkResources: https://github.com/traceloop/openllmetry
Fine-Tuning & Training
PEFT (Parameter-Efficient Fine-Tuning)
Purpose: Efficient fine-tuning of large models
Key Features:
- LoRA, QLoRA, adapters
- Minimal additional parameters
- Works with Transformers
- Memory efficient
When to Use: Fine-tuning large models with limited resources
Installation:
pip install peftExample:
from peft import get_peft_model, LoraConfig, TaskType
config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=8,
lora_alpha=32,
lora_dropout=0.1
)
model = get_peft_model(base_model, config)Resources: https://huggingface.co/docs/peft
Axolotl
Purpose: Streamlined LLM fine-tuning
Key Features:
- YAML-based configuration
- Multiple training techniques
- Dataset preprocessing
- Good defaults
When to Use: Quick LLM fine-tuning without deep expertise
Resources: https://github.com/axolotl-ai-cloud/axolotl
DeepSpeed
Purpose: Deep learning optimization library
Key Features:
- ZeRO memory optimization
- Mixed precision training
- Model parallelism
- Microsoft-backed
When to Use: Training large models across multiple GPUs
Installation:
pip install deepspeedResources: https://www.deepspeed.ai
FSDP (Fully Sharded Data Parallel)
Purpose: PyTorch distributed training
Key Features:
- Built into PyTorch
- Memory-efficient training
- Simpler than DeepSpeed for some cases
- Native PyTorch integration
When to Use: Distributed training with PyTorch
Resources: https://pytorch.org/docs/stable/fsdp.html
Data Quality & Validation
Great Expectations
Purpose: Data quality validation
Key Features:
- Expectation-based testing
- Data documentation
- CI/CD integration
- Rich ecosystem
When to Use: Data quality in pipelines
Installation:
pip install great_expectationsExample:
import great_expectations as gx
context = gx.get_context()
validator = context.get_validator(batch_request=batch_request)
validator.expect_column_values_to_not_be_null("user_id")Resources: https://docs.greatexpectations.io
Pandera
Purpose: Statistical data validation
Key Features:
- DataFrame validation
- Statistical checks
- Schema inference
- Pandas/Polars support
When to Use: DataFrame validation in Python
Installation:
pip install panderaResources: https://pandera.readthedocs.io
DVC (Data Version Control)
Purpose: Version control for ML data and models
Key Features:
- Git-like data versioning
- Pipeline reproducibility
- Cloud storage backends
- Experiment tracking
When to Use: Versioning data and models alongside code
Installation:
pip install dvcResources: https://dvc.org
Testing & Quality
pytest
Purpose: Python testing framework
Key Features:
- Simple and powerful
- Fixtures and parameterization
- Plugin ecosystem
- Industry standard
When to Use: All Python testing
Installation:
pip install pytestResources: https://docs.pytest.org
hypothesis
Purpose: Property-based testing
Key Features:
- Automatic test case generation
- Edge case discovery
- Shrinking failing examples
- Useful for data processing
When to Use: Testing data processing and transformations
Installation:
pip install hypothesisResources: https://hypothesis.readthedocs.io
Locust
Purpose: Load testing
Key Features:
- Python-based test scripts
- Distributed testing
- Real-time metrics
- Web UI
When to Use: Load testing inference endpoints
Installation:
pip install locustResources: https://locust.io
Security
Guardrails AI
Purpose: LLM output validation
Key Features:
- Structured output enforcement
- Content filtering
- Type validation
- Retry mechanisms
When to Use: Ensuring LLM outputs meet requirements
Installation:
pip install guardrails-aiResources: https://docs.guardrailsai.com
NeMo Guardrails
Purpose: Conversational AI safety
Key Features:
- Programmable guardrails
- Topical rails
- Dialogue management
- NVIDIA-backed
When to Use: Building safe conversational AI
Resources: https://github.com/NVIDIA/NeMo-Guardrails
Embeddings & Retrieval
Sentence Transformers
Purpose: Sentence embeddings library
Key Features:
- Pre-trained embedding models
- Training utilities
- Multiple loss functions
- Simple API
When to Use: Text embeddings for search and similarity
Installation:
pip install sentence-transformersExample:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(["Hello world", "How are you"])Resources: https://www.sbert.net
FAISS
Purpose: Vector similarity search
Key Features:
- Billion-scale search
- GPU support
- Multiple index types
- Facebook/Meta research
When to Use: Efficient vector search in Python
Installation:
pip install faiss-cpu # or faiss-gpuResources: https://faiss.ai
BM25 (rank_bm25)
Purpose: Sparse retrieval
Key Features:
- Classic keyword search
- No embeddings needed
- Fast and interpretable
- Good baseline
When to Use: Hybrid search, keyword matching
Installation:
pip install rank_bm25Resources: https://github.com/dorianbrown/rank_bm25
Prompt Engineering
Guidance
Purpose: Structured LLM generation
Key Features:
- Constrained generation
- Template language
- Control flow
- Output parsing
When to Use: Complex structured outputs
Installation:
pip install guidanceResources: https://github.com/guidance-ai/guidance
DSPy
Purpose: Programming with foundation models
Key Features:
- Declarative programming
- Automatic prompt optimization
- Module composition
- Stanford research
When to Use: Systematic prompt engineering
Installation:
pip install dspy-aiResources: https://dspy-docs.vercel.app
Container & Deployment
Docker
Purpose: Containerization
Key Features:
- Reproducible environments
- Industry standard
- Rich ecosystem
- Essential for deployment
When to Use: Packaging and deploying all applications
Resources: https://docs.docker.com
Kubernetes
Purpose: Container orchestration
Key Features:
- Automatic scaling
- Self-healing
- Service discovery
- Industry standard for production
When to Use: Production deployment at scale
Resources: https://kubernetes.io
KServe
Purpose: ML model serving on Kubernetes
Key Features:
- Serverless inference
- Autoscaling
- A/B testing
- Multiple frameworks
When to Use: Model serving on Kubernetes
Resources: https://kserve.github.io
BentoML
Purpose: ML model serving framework
Key Features:
- Easy model packaging
- Multiple frameworks
- Adaptive batching
- Cloud deployment
When to Use: Packaging models for deployment
Installation:
pip install bentomlResources: https://docs.bentoml.org
Cloud Platforms
AWS SageMaker
Purpose: AWS ML platform
Key Features:
- Managed training and inference
- Jupyter notebooks
- Model registry
- MLOps features
When to Use: ML on AWS
Resources: https://docs.aws.amazon.com/sagemaker
Google Cloud Vertex AI
Purpose: GCP ML platform
Key Features:
- Managed ML pipelines
- Model serving
- AutoML
- Gemini integration
When to Use: ML on GCP
Resources: https://cloud.google.com/vertex-ai/docs
Azure Machine Learning
Purpose: Azure ML platform
Key Features:
- End-to-end ML lifecycle
- MLOps automation
- Responsible AI tools
- Azure OpenAI integration
When to Use: ML on Azure
Resources: https://docs.microsoft.com/azure/machine-learning
Quick Selection Guide
| Use Case | Recommended Tools |
|---|---|
| LLM Inference (High Volume) | vLLM, TGI, TensorRT-LLM |
| LLM Inference (Local Dev) | Ollama |
| RAG Application | LlamaIndex + Pinecone/Qdrant |
| Complex LLM Workflows | LangChain |
| Experiment Tracking | MLflow, W&B |
| Feature Store | Feast (open source), Tecton (enterprise) |
| Vector Database (Managed) | Pinecone |
| Vector Database (Self-Hosted) | Milvus, Qdrant |
| Pipeline Orchestration | Airflow, Prefect |
| ML on Kubernetes | Kubeflow |
| LLM Monitoring | Arize AI, LangSmith |
| API Provider (General) | OpenAI, Anthropic |
| API Provider (Enterprise/AWS) | Amazon Bedrock |
| API Provider (Enterprise/GCP) | Vertex AI |
Version Compatibility Notes
The AI tool landscape evolves rapidly. When this book was written (early 2026):
- PyTorch 2.x (2.12) is standard, with torch.compile providing significant speedups
- Transformers 5.x (5.9, a major version) supports most modern models
- vLLM and TGI are the dominant open-source inference engines
- Vector databases have largely converged on similar feature sets
Always check documentation for current versions and breaking changes.
Additional Tools by Category
Model Optimization
| Tool | Purpose | When to Use |
|---|---|---|
| ONNX Runtime | Cross-platform inference | Deploying models across different hardware |
| TensorRT | NVIDIA GPU optimization | Maximum performance on NVIDIA hardware |
| OpenVINO | Intel optimization | Inference on Intel CPUs/GPUs |
| CoreML | Apple device deployment | iOS/macOS applications |
| TFLite | Mobile deployment | Android/embedded devices |
| llama.cpp | CPU-optimized LLM inference | Running LLMs on CPU |
| ExLLamaV2 | Quantized LLM inference | Memory-efficient GPU inference |
Data Processing
| Tool | Purpose | When to Use |
|---|---|---|
| Polars | Fast DataFrame operations | Large-scale data processing |
| Dask | Parallel computing | Out-of-core and distributed pandas |
| DuckDB | Embedded analytics | SQL queries on local data |
| Delta Lake | ACID data lake storage | Reliable data lake operations |
| Apache Arrow | Columnar memory format | High-performance data interchange |
Annotation & Labeling
| Tool | Purpose | When to Use |
|---|---|---|
| Label Studio | Open-source labeling | Self-hosted data annotation |
| Prodigy | Efficient annotation | Active learning-based labeling |
| Scale AI | Enterprise labeling | Large-scale human labeling |
| Amazon SageMaker Ground Truth | AWS labeling | Labeling with AWS integration |
| Snorkel | Programmatic labeling | Weak supervision approaches |
Distributed Training
| Tool | Purpose | When to Use |
|---|---|---|
| Horovod | Distributed training | Multi-GPU/multi-node training |
| PyTorch Lightning | Training framework | Organized PyTorch training |
| Mosaic ML Composer | Efficient training | Optimized training recipes |
| ColossalAI | Large model training | Training very large models |
AutoML & Hyperparameter Tuning
| Tool | Purpose | When to Use |
|---|---|---|
| Optuna | Hyperparameter optimization | Automatic hyperparameter tuning |
| Ray Tune | Distributed HPO | Scalable hyperparameter search |
| Auto-sklearn | AutoML | Automated model selection |
| H2O AutoML | Enterprise AutoML | Business-focused AutoML |
Explainability
| Tool | Purpose | When to Use |
|---|---|---|
| SHAP | Feature importance | Understanding model decisions |
| LIME | Local explanations | Explaining individual predictions |
| Captum | PyTorch interpretability | Deep learning explainability |
| InterpretML | Unified interpretability | Glass-box and black-box models |
Natural Language Processing
| Tool | Purpose | When to Use |
|---|---|---|
| spaCy | Industrial NLP | Production NLP pipelines |
| NLTK | Educational NLP | Learning and prototyping |
| Stanza | Research NLP | Stanford NLP models |
| Flair | State-of-art NLP | Sequence labeling |
Computer Vision
| Tool | Purpose | When to Use |
|---|---|---|
| OpenCV | Computer vision basics | Image processing operations |
| Ultralytics | Object detection | YOLO models |
| Detectron2 | Object detection | Facebook research models |
| timm | Vision models | Pre-trained image models |
Audio & Speech
| Tool | Purpose | When to Use |
|---|---|---|
| Whisper | Speech-to-text | Transcription |
| TTS | Text-to-speech | Voice synthesis |
| Librosa | Audio processing | Audio feature extraction |
| PyAnnote | Speaker diarization | Who spoke when |
Tool Selection Framework
When selecting tools for a new project, consider these factors:
1. Scale Requirements
Small Scale (< 1M requests/month):
- Ollama for local LLM
- Chroma for vectors
- SQLite/DuckDB for data
- Simple deployment (single server)
Medium Scale (1M-100M requests/month):
- vLLM/TGI for inference
- Qdrant/Weaviate for vectors
- PostgreSQL/Redis for data
- Kubernetes deployment
Large Scale (> 100M requests/month):
- TensorRT-LLM for inference
- Pinecone/Milvus for vectors
- Specialized data infrastructure
- Multi-region deployment
2. Team Expertise
Data Science Background:
- Jupyter notebooks
- MLflow for experiments
- Scikit-learn patterns
- Pandas-centric workflow
Software Engineering Background:
- Containerized workflows
- API-first design
- Proper testing
- CI/CD integration
Mixed Teams:
- Unified platforms (Databricks, SageMaker)
- Clear interfaces between components
- Documentation emphasis
3. Cloud Strategy
AWS-Native:
- SageMaker for ML
- Bedrock for LLMs
- Lambda for serverless
- S3 for storage
GCP-Native:
- Vertex AI for ML
- Gemini for LLMs
- Cloud Functions
- BigQuery for data
Multi-Cloud / Agnostic:
- Open-source tools (MLflow, Kubeflow)
- Portable formats (ONNX)
- Terraform for infrastructure
4. Compliance Requirements
Regulated Industries:
- Self-hosted models
- Private deployments
- Audit logging
- Data residency controls
Standard Business:
- Managed services acceptable
- API providers viable
- Standard security practices
Migration Paths
Moving from Prototyping to Production
| Prototype | Production Alternative |
|---|---|
| OpenAI API | vLLM + open model |
| Chroma | Qdrant, Pinecone |
| Jupyter notebooks | Python packages |
| Local files | Feature store |
| Manual deployment | CI/CD + Kubernetes |
Scaling Existing Systems
| Current State | Scaled State |
|---|---|
| Single GPU | Multi-GPU with DeepSpeed |
| Single node | Kubernetes cluster |
| Monolithic | Microservices |
| Synchronous | Async + queues |
| Single region | Multi-region |
Cost Considerations
Free/Open Source Options
- Inference: vLLM, TGI, Ollama
- Vectors: Qdrant, Milvus, Chroma
- Experiment Tracking: MLflow
- Orchestration: Airflow
- Feature Store: Feast
Managed Service Trade-offs
| Aspect | Self-Hosted | Managed |
|---|---|---|
| Initial cost | Higher (setup) | Lower |
| Ongoing cost | Lower at scale | Higher at scale |
| Operational burden | Higher | Lower |
| Customization | Full control | Limited |
| Scaling | Manual | Automatic |
Getting Started Recommendations
For Individual Developers
If you’re learning or building personal projects:
- Start with: Ollama (local LLMs), Chroma (vectors), LangChain (orchestration)
- Add as needed: Hugging Face for models, MLflow for experiments
- Deploy with: Docker, any cloud provider
For Startups
Building an MVP with limited resources:
- Start with: OpenAI/Anthropic APIs, Pinecone (managed vectors), Vercel/Railway (deployment)
- Graduate to: vLLM + open models when API costs become significant
- Focus on: Product-market fit over infrastructure optimization
For Enterprise Teams
Building production systems at scale:
- Infrastructure: Kubernetes, Terraform, GitOps
- ML Platform: Databricks/SageMaker/Vertex AI or build with MLflow + Kubeflow
- Monitoring: Arize AI or Evidently + Prometheus/Grafana
- Security: Air-gapped options, self-hosted models for sensitive data
For Research Teams
Running experiments and publishing papers:
- Training: PyTorch + W&B + DeepSpeed
- Compute: Cloud spot instances, academic GPU clusters
- Reproducibility: DVC, Docker, experiment configs as code
Tool Ecosystem Compatibility
Many tools work better together. Here are common compatible stacks:
The Hugging Face Stack
- Transformers (models)
- Datasets (data loading)
- PEFT (fine-tuning)
- TGI (serving)
- Hub (model hosting)
The LangChain Stack
- LangChain (orchestration)
- LangSmith (debugging)
- LangServe (deployment)
- LangGraph (agents)
The AWS Stack
- SageMaker (training/serving)
- Bedrock (LLM APIs)
- S3 (storage)
- Lambda (serverless)
- Step Functions (orchestration)
The Open Source Stack
- PyTorch (training)
- vLLM (serving)
- MLflow (experiments)
- Feast (features)
- Airflow (orchestration)
- Prometheus + Grafana (monitoring)
Version Pinning Recommendations
AI tools evolve rapidly. For production systems:
# Example requirements.txt with pinning strategy
# Pin major versions
torch>=2.12,<3.0
transformers>=5.9,<6.0 # transformers v5 is a major version
# Pin exact versions for critical inference
vllm==0.19.0
# Allow patch updates for utilities
langchain>=0.3,<0.4Best Practices:
- Pin exact versions in production
- Test upgrades in staging environments
- Monitor release notes for breaking changes
- Keep a changelog of version updates
- Use virtual environments or containers for isolation
Deprecated and Legacy Tools
The following tools were popular but are now less recommended:
| Tool | Status | Replacement |
|---|---|---|
| TensorFlow 1.x | Deprecated | PyTorch or TF 2.x |
| Keras (standalone) | Merged | tf.keras or PyTorch |
| GPT-2 | Outdated | Modern LLMs |
| BERT-base (original) | Outdated | Sentence Transformers, E5 |
| Apache Zeppelin | Less active | Jupyter |
| Dask-ML | Limited development | Ray, Spark |
| Haystack (older versions) | Major changes | LlamaIndex, LangChain |
When encountering tutorials using deprecated tools, look for updated alternatives or modern equivalents.
Resources for Staying Current
Official Documentation
Always prefer official documentation over tutorials, as it reflects current versions.
Release Tracking
- GitHub release pages for major projects
- PyPI for Python package updates
- Hugging Face blog for ecosystem news
Community Sources
- r/MachineLearning and r/LocalLLaMA for discussions
- Discord servers for major projects (LangChain, vLLM, etc.)
- Twitter/X ML community for breaking developments
This Book’s Companion
Check the book’s repository for updated tool recommendations and errata: the AI landscape in 2027 will differ from what’s documented here in 2026