Chapter 10: Orchestration & Agent Frameworks

Keywords

LangChain, LlamaIndex, LangGraph, CrewAI, DSPy, framework comparison, orchestration, pipelines

Introduction: Navigating the Fragmented Ecosystem

The AI engineering landscape in 2026 is experiencing rapid growth and fragmentation. Unlike traditional software domains where a few frameworks dominate (React for frontends, Django/FastAPI for Python backends), the LLM tooling ecosystem is still finding its footing. New frameworks emerge monthly, each claiming to solve orchestration, agents, or observability “the right way.”

This fragmentation creates a real problem for engineers: tool choice matters. Pick the wrong framework early, and you might face a costly rewrite. Choose a heavyweight solution for a simple problem, and you’ll fight abstraction layers forever. Skip frameworks entirely when you need them, and you’ll rebuild common patterns poorly.

This chapter focuses on two foundational categories: orchestration frameworks (for chaining LLM calls, managing context, and retrieval) and agent frameworks (for building autonomous systems). We’ll examine each major option, compare real implementations, and build a decision framework based on project requirements.

For observability tools, structured output libraries, and guardrails, see Chapter 11.

The Stakes of Tool Choice

Consider these real scenarios:

Scenario 1: The Over-Engineered Chatbot A startup uses LangChain for a simple RAG chatbot. They spend 60% of their time fighting framework quirks, debugging opaque abstractions, and waiting for dependency updates. A “just Python” approach with direct API calls would have shipped faster and been easier to maintain.

Scenario 2: The DIY Nightmare A team builds a complex multi-agent system from scratch, avoiding frameworks. Six months in, they’ve recreated LangGraph’s state management poorly, with subtle bugs in retry logic, state persistence, and error handling. They should have used a framework.

Scenario 3: The Lock-In Trap A company builds heavily on LangChain’s proprietary features. When they need to migrate to a different model provider or framework, they discover their code is tightly coupled to LangChain abstractions. A rewrite takes months.

The goal of this chapter is to help you avoid these scenarios.

Common Mistake: Choosing a Framework Before Understanding the Problem

What people do: Start a project by asking “Should I use LangChain or LlamaIndex?” before understanding their actual requirements. They pick based on tutorial familiarity or popularity.

Why it fails: Frameworks impose mental models and constraints. If you don’t know your requirements, you can’t evaluate whether a framework’s constraints help or hurt. Teams often discover six months in that they chose the wrong abstraction level—too high (fighting the framework) or too low (recreating framework features).

Fix: Start by building a minimal prototype with direct API calls. Let pain points emerge naturally. If you find yourself writing the same retry logic repeatedly, a framework might help. If you need one specific feature (like vector search), evaluate specialized tools. Only adopt a comprehensive framework when you’ve experienced the complexity it addresses.

What You’ll Learn

  1. Tool Evaluation Framework - A timeless approach to assessing any AI engineering tool
  2. Orchestration Frameworks - When to use LangChain vs LlamaIndex vs Haystack vs raw Python
  3. Agent Frameworks - Comparing LangGraph, CrewAI, AutoGen, and Pydantic AI
  4. Migration Strategies - Moving between tools without rewrites

Let’s dive in.


Tool Evaluation Framework (Evergreen)

Before examining specific tools, let’s establish an evaluation framework that applies to any AI engineering tool, now or in the future. This framework will remain useful long after the specific recommendations in this chapter become outdated.

The Five Dimensions of Tool Evaluation

1. Abstraction Level

Every tool makes tradeoffs between ease-of-use and control. Evaluate where a tool sits on this spectrum:

High Abstraction Low Abstraction
Fewer lines of code More lines of code
Faster to start Slower to start
Harder to customize Easier to customize
Opaque when debugging Transparent when debugging
Magic that helps… until it doesn’t No magic, no surprises

Questions to ask: Can I see what the tool does under the hood? When I hit an edge case, can I work around it? Will I understand the code six months from now?

2. Lock-in Risk

Consider how much of your codebase becomes coupled to the tool:

  • API coupling: Does the tool define interfaces throughout your code, or only at boundaries?
  • Data format coupling: Are your documents, embeddings, or configurations stored in proprietary formats?
  • Conceptual coupling: Does using this tool require adopting its mental model everywhere?

Questions to ask: If this tool disappears tomorrow, how much would I need to rewrite? Can I use standard interfaces (OpenAI SDK, standard Python) for most of my code?

3. Production Readiness

Evaluate operational concerns that don’t show up in tutorials:

  • Error handling: What happens when external services fail? Does the tool retry appropriately?
  • Observability: Can you see what’s happening? Latency per stage? Token usage? Error rates?
  • Performance: What overhead does the framework add? Is it acceptable for your latency budget?
  • Stability: How often do breaking changes occur? Is there a deprecation policy?

Questions to ask: What will wake me up at 2 AM? What will I need to debug in production?

4. Ecosystem and Support

Consider the community and resources around the tool:

  • Documentation quality: Are there examples for your use case? Is the documentation accurate and current?
  • Community activity: Are issues addressed? Are there Stack Overflow answers?
  • Integration breadth: Does it work with your existing stack (monitoring, auth, deployment)?
  • Commercial support: For enterprise needs, is paid support available?

Questions to ask: When I get stuck, where will I find help? Who else uses this in production?

5. Team Fit

The best tool is the one your team can use effectively:

  • Learning curve: How long until team members are productive?
  • Existing knowledge: Does it build on skills your team already has?
  • Hiring implications: Will this tool make hiring easier or harder?
  • Maintenance burden: Who will maintain this? Do they want to?

Questions to ask: Can junior engineers contribute to this code? Will this choice affect our ability to hire?

The Evaluation Process

When evaluating a new tool, follow this process:

  1. Define requirements first: What problem are you solving? What constraints matter? Write these down before looking at tools.

  2. Build a proof-of-concept: Spend 2-4 hours building something real. Don’t just follow the tutorial—try something specific to your use case.

  3. Hit the edge cases deliberately: Try the thing that will probably break. Add error handling. Customize the behavior. This reveals true complexity.

  4. Check production stories: Search for “[tool name] production” or “[tool name] postmortem”. What have others learned?

  5. Evaluate the exit path: Before committing, understand how you would migrate away if needed.

When to Use No Framework

Sometimes the best tool is no tool. Consider “just Python” when:

  • Your use case is straightforward (simple RAG, basic chat)
  • You need maximum control over behavior
  • You’re in a latency-critical path
  • Your team prefers explicit code over framework abstractions
  • You’re building something novel that frameworks don’t support well

The LLM provider SDKs (OpenAI, Anthropic, Google) are well-designed. For many applications, direct SDK usage with your own helper functions is cleaner than framework abstractions.


Tool Recommendations: As of January 2026

The following sections contain specific tool comparisons and recommendations. These reflect the landscape as of January 2026. While the evaluation framework above is evergreen, specific tools, versions, and recommendations will evolve. Check appendix B for the latest updates.


1. Orchestration Frameworks: The Foundation Layer

Staff Engineer Perspective

“We picked LangChain in early 2024 because every blog post used it. Six months later, we’d hit three breaking changes, our prompts were buried under four abstraction layers, and our p95 latency had 80ms of pure framework overhead. Migrating to thin wrappers around the raw OpenAI and Anthropic SDKs took a team of three about five weeks and removed roughly 4,000 lines of glue code. Throughput went up 30 percent. My rule now: if you can sketch the LLM call on a whiteboard in 60 seconds, you don’t need a framework for it.”

Principal Engineer at a B2B SaaS company

Orchestration frameworks handle the plumbing: chaining LLM calls, managing context, retrieval, and combining components. This is the most crowded category.

The Contenders

A note on GitHub stars

The star counts below are a popularity proxy, not a quality or production-readiness signal. Stars track mindshare and tutorial volume; they say nothing about debuggability, API stability, or how the framework behaves under load. What actually determines fit is the architectural commitment a framework imposes (its abstraction model, its dependency surface) and the failure modes that commitment creates in production. Read the numbers as “how many people have heard of it,” then judge on the architecture.

LangChain: The Dominant Giant

Market Position: Most popular framework with ~85k GitHub stars (a popularity proxy, not a quality signal—see the note above). Philosophy: Batteries-included with heavy abstractions.

LangChain’s dominance stems from its massive ecosystem—integrations with over 100 LLM providers, vector stores, and external tools. For teams evaluating their first orchestration framework, this breadth is compelling. The extensive documentation, active community, and rich feature set (memory management, callbacks, streaming, async support) mean you’ll find examples for almost any use case.

However, this power comes with significant tradeoffs. LangChain’s abstractions often obscure what’s actually happening, making debugging difficult. When something breaks—and it will—you may find yourself diving through layers of classes to understand a simple API call. The framework’s rapid development pace leads to frequent breaking changes between versions, creating maintenance burden for production systems. Teams regularly report spending more time fighting the framework than building features.

The abstraction layers also introduce performance overhead. For latency-sensitive applications, the extra function calls and object instantiation add measurable delay. Perhaps more concerning, LangChain makes it easy to write code you don’t fully understand. Copy-pasting a RetrievalQA.from_chain_type(chain_type="stuff") call works until you need to customize the behavior—then you discover “stuff” is an opaque implementation detail hiding a specific prompting strategy.

When to choose LangChain: Rapid prototyping where time-to-demo outweighs long-term maintainability. Teams already invested in the ecosystem. Projects requiring many integrations simultaneously. Avoid for production systems where you need transparency and control, or where API stability matters.

LlamaIndex: The RAG Specialist

Market Position: One of the most popular frameworks with ~50k stars, RAG-focused. Philosophy: Data structures optimized for LLM applications.

LlamaIndex takes a fundamentally different approach than LangChain—instead of being a general-purpose orchestration framework, it focuses specifically on data retrieval and indexing. This specialization shows in its architecture: concepts like indexes, retrievers, and query engines map directly to RAG pipeline components, making the mental model clearer for retrieval-heavy applications.

The framework’s focus on data structures brings performance benefits. LlamaIndex handles large document collections more efficiently than LangChain, with better chunking strategies, sophisticated index types (tree, list, keyword), and optimized retrieval paths. The knowledge graph support is particularly strong—if your application needs to reason over structured relationships, LlamaIndex provides first-class primitives.

The abstractions are also cleaner than LangChain’s for RAG use cases. When you create a VectorStoreIndex and call as_query_engine(), the data flow is more intuitive. The response_mode="compact" parameter actually describes what it does, unlike LangChain’s chain_type="stuff".

The tradeoff is scope. LlamaIndex is primarily a RAG framework, and using it for non-retrieval applications often feels awkward. If your project involves agentic workflows, tool use, or general orchestration without document retrieval, you’ll find yourself reaching for other tools. The ecosystem is also smaller—fewer integrations, fewer examples for edge cases, and documentation that can be sparse for advanced configurations.

When to choose LlamaIndex: Document Q&A systems, knowledge base search, any application where retrieval quality is the primary concern. Particularly strong for structured data and knowledge graphs. Avoid for agent-heavy applications or general orchestration without retrieval.

Haystack: The Enterprise Choice

Market Position: ~25k stars, strong enterprise adoption. Philosophy: Modular pipelines designed for production.

Haystack approaches LLM orchestration as a pipeline problem—you build explicit graphs of components that transform data step by step. This isn’t the most beginner-friendly model, but it pays dividends in production. When you define a pipeline with separate TextEmbedder, Retriever, PromptBuilder, and Generator components, you can see exactly what happens at each stage.

The framework’s production focus shows in details that matter for enterprise deployments: built-in error handling and retry logic, monitoring hooks that integrate with standard observability stacks, strong typing that catches errors before runtime, and enterprise features like authentication and audit logging. If you’re in a regulated industry that requires demonstrating exactly how a decision was made, Haystack’s explicit pipeline structure creates natural audit trails.

The component architecture also promotes reusability. A custom embedder you build for one pipeline can drop into another. Teams with multiple LLM applications sharing components find this modularity valuable.

The tradeoffs are ecosystem size and development velocity. Haystack’s smaller community means fewer integrations, fewer Stack Overflow answers, and less collective knowledge to draw from. The framework also moves more slowly than LangChain—deliberate stability rather than rapid feature churn. For some teams this is a feature (fewer breaking changes); for others it’s a limitation (waiting for new model integrations). The pipeline concepts also require a learning investment that may not pay off for simple applications.

When to choose Haystack: Production systems in enterprise environments, especially regulated industries. Teams that prioritize stability and explicit data flow over rapid iteration. Applications requiring audit trails. Avoid for rapid prototyping or when you need cutting-edge features immediately.

“Just Python”: The Minimalist Approach

Market Position: No framework—direct API calls. Philosophy: Complete control with minimal dependencies.

The “just Python” approach means calling LLM APIs directly without an orchestration framework. This isn’t laziness—it’s a deliberate choice that makes sense for many applications. When you call openai.chat.completions.create() directly, there’s no mystery about what happens. The code you write is the code that runs. Debugging is straightforward: you see the exact request and response.

This transparency eliminates the performance overhead of abstraction layers. No object instantiation chains, no callback systems, no framework initialization. For latency-sensitive applications, these savings add up. You also avoid framework lock-in entirely—your code depends only on provider SDKs, which are more stable than framework abstractions.

However, you’ll rebuild patterns that frameworks provide out of the box. Retry logic with exponential backoff, streaming response handling, rate limit management, error classification—all need implementation. For a simple chatbot, this might be 50 extra lines. For a complex multi-step pipeline, it can become substantial maintenance burden.

Provider switching is also harder without abstractions. Migrating from OpenAI to Anthropic means changing every API call, while frameworks like LangChain can theoretically swap providers with configuration changes (though this rarely works perfectly in practice). The code in this chapter’s “Just Python” examples is longer and more explicit—you trade conciseness for control.

When to choose “Just Python”: Simple applications with one or two LLM calls. When you need maximum transparency and control. Applications where minimizing dependencies is important (embedded systems, security-sensitive environments, serverless functions with cold start concerns). Small teams comfortable maintaining custom code. Avoid for complex multi-step workflows where you’d end up recreating framework features poorly.

Code Comparison: Simple RAG Implementation

Let’s implement the same RAG query in each approach to see the differences.

Task: Query a vector store of documents and get an answer with sources.

LangChain Implementation

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import TextLoader

# Load and process documents
loader = TextLoader("docs.txt")
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(documents)

# Create vector store
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(splits, embeddings)

# Create retrieval chain
llm = ChatOpenAI(model="gpt-5", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
    return_source_documents=True
)

# Query
result = qa_chain({"query": "What is the refund policy?"})
print(f"Answer: {result['result']}")
print(f"Sources: {[doc.metadata for doc in result['source_documents']]}")

Analysis:

  • High-level abstractions hide complexity
  • RetrievalQA.from_chain_type - what does “stuff” mean? (It stuffs all docs into context)
  • Easy to get started, hard to debug
  • Lots of magic: automatic prompt construction, source handling
  • Hard to customize the prompt or retrieval strategy without diving into docs

LlamaIndex Implementation

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings

# Configure global settings
Settings.llm = OpenAI(model="gpt-5", temperature=0)
Settings.embed_model = OpenAIEmbedding()

# Load documents
documents = SimpleDirectoryReader("docs/").load_data()

# Create index (combines embedding + storage)
index = VectorStoreIndex.from_documents(documents)

# Create query engine
query_engine = index.as_query_engine(
    similarity_top_k=3,
    response_mode="compact"
)

# Query
response = query_engine.query("What is the refund policy?")
print(f"Answer: {response.response}")
print(f"Sources: {[node.metadata for node in response.source_nodes]}")

Analysis:

  • Cleaner API surface for RAG use cases
  • Global settings pattern can be convenient or confusing
  • response_mode="compact" is clearer than LangChain’s “stuff”
  • Better default handling of sources and metadata
  • Still abstracts away details but easier to understand the flow

Haystack Implementation

from haystack import Pipeline, Document
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.embedders import SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.builders import PromptBuilder
from haystack.components.generators import OpenAIGenerator

# Create document store
document_store = InMemoryDocumentStore()

# Load and embed documents
docs = [Document(content=text) for text in load_texts("docs.txt")]
doc_embedder = SentenceTransformersDocumentEmbedder()
doc_embedder.warm_up()
docs_with_embeddings = doc_embedder.run(docs)
document_store.write_documents(docs_with_embeddings["documents"])

# Build pipeline
pipeline = Pipeline()
pipeline.add_component("text_embedder", SentenceTransformersTextEmbedder())
pipeline.add_component("retriever", InMemoryEmbeddingRetriever(document_store=document_store, top_k=3))
pipeline.add_component("prompt_builder", PromptBuilder(template="""
Answer the question based on the context.
Context: {% for doc in documents %}{{ doc.content }}{% endfor %}
Question: {{ question }}
Answer:
"""))
pipeline.add_component("llm", OpenAIGenerator(model="gpt-5"))

# Connect components
pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
pipeline.connect("retriever.documents", "prompt_builder.documents")
pipeline.connect("prompt_builder.prompt", "llm.prompt")

# Query
result = pipeline.run({
    "text_embedder": {"text": "What is the refund policy?"},
    "prompt_builder": {"question": "What is the refund policy?"}
})

print(f"Answer: {result['llm']['replies'][0]}")
print(f"Sources: {result['retriever']['documents']}")

Analysis:

  • Explicit pipeline construction shows exactly what happens
  • More verbose but completely transparent
  • Easy to swap components (e.g., different embedder)
  • Prompt is visible and customizable
  • Good for understanding data flow, but more boilerplate

“Just Python” Implementation

import openai
from typing import List
import numpy as np
from dataclasses import dataclass

@dataclass
class Document:
    content: str
    metadata: dict
    embedding: List[float] = None

def chunk_text(text: str, chunk_size: int = 1000, overlap: int = 200) -> List[str]:
    """Simple text chunking with overlap."""
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start = end - overlap
    return chunks

def embed_texts(texts: List[str]) -> List[List[float]]:
    """Get embeddings using OpenAI API."""
    response = openai.embeddings.create(
        model="text-embedding-3-small",
        input=texts
    )
    return [item.embedding for item in response.data]

def cosine_similarity(a: List[float], b: List[float]) -> float:
    """Calculate cosine similarity between two vectors."""
    a_arr = np.array(a)
    b_arr = np.array(b)
    return np.dot(a_arr, b_arr) / (np.linalg.norm(a_arr) * np.linalg.norm(b_arr))

def retrieve_documents(query: str, documents: List[Document], k: int = 3) -> List[Document]:
    """Retrieve top-k most similar documents."""
    # Embed query
    query_embedding = embed_texts([query])[0]

    # Calculate similarities
    similarities = []
    for doc in documents:
        sim = cosine_similarity(query_embedding, doc.embedding)
        similarities.append((sim, doc))

    # Sort and return top-k
    similarities.sort(reverse=True, key=lambda x: x[0])
    return [doc for _, doc in similarities[:k]]

def generate_answer(query: str, context_docs: List[Document]) -> str:
    """Generate answer using retrieved context."""
    context = "\n\n".join([doc.content for doc in context_docs])

    response = openai.chat.completions.create(
        model="gpt-5",
        temperature=0,
        messages=[
            {"role": "system", "content": "Answer questions based on the provided context."},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer:"}
        ]
    )

    return response.choices[0].message.content

# Setup
client = openai.OpenAI()
with open("docs.txt") as f:
    text = f.read()

# Process documents
chunks = chunk_text(text)
embeddings = embed_texts(chunks)
documents = [
    Document(content=chunk, metadata={"source": "docs.txt"}, embedding=emb)
    for chunk, emb in zip(chunks, embeddings)
]

# Query
query = "What is the refund policy?"
retrieved = retrieve_documents(query, documents, k=3)
answer = generate_answer(query, retrieved)

print(f"Answer: {answer}")
print(f"Sources: {[doc.metadata for doc in retrieved]}")

Analysis:

  • Complete transparency: you see every step
  • Easy to debug: no magic, just functions
  • Full control over chunking, retrieval, prompting
  • More code, but all of it is yours
  • Need to handle errors, retries, rate limits yourself
  • Switching LLM providers requires rewriting API calls
This example is illustrative, not production-ready

The linear NumPy cosine scan over every document is O(n) per query and only acceptable for a few thousand chunks—above that the retrieval step must become an approximate-nearest-neighbor index (FAISS, hnswlib, or a vector DB) to stay sub-100ms. The embed_texts call also needs retry-with-backoff (embedding endpoints rate-limit and transiently fail); the code above will crash the whole pipeline on the first 429. Treat this as the mental model for the data flow, not a template to ship.

Decision Matrix: Orchestration Frameworks

Criteria LangChain LlamaIndex Haystack Just Python
Learning Curve Medium Medium Steep Gentle
Time to First Demo Fast Fast Medium Slow
Long-term Maintainability Low Medium High High
Debuggability Low Medium High Highest
Performance Overhead High Medium Low None
RAG Quality Good Excellent Good Depends
Non-RAG Use Cases Excellent Poor Good Excellent
Provider Lock-in Medium Low Low None
Community Support Excellent Good Fair N/A
Enterprise Features Medium Low High DIY

When to Use What

Choose LangChain when:

  • You need rapid prototyping with many integrations
  • Your team is experienced with the framework
  • You’re building demos or POCs, not production systems
  • You need bleeding-edge features and can tolerate breaking changes

Choose LlamaIndex when:

  • RAG is your primary use case
  • You need sophisticated retrieval strategies
  • You’re building document Q&A or knowledge bases
  • You want better RAG performance than LangChain

Choose Haystack when:

  • Building production systems in enterprise environments
  • You need stability and good error handling
  • Your team values explicit over implicit
  • You’re in a regulated industry

Choose Just Python when:

  • Your use case is simple (1-2 LLM calls)
  • You need maximum control and transparency
  • Minimizing dependencies is important
  • Your team can maintain custom code

Migration Paths

The key to avoiding lock-in is to isolate framework code in adapter layers.

Good Architecture (Framework-Agnostic):

# domain/models.py - Pure business logic
@dataclass
class SearchResult:
    answer: str
    sources: List[str]
    confidence: float

# domain/interfaces.py - Abstract interfaces
class VectorSearcher(Protocol):
    def search(self, query: str, k: int) -> List[Document]: ...

class LLMClient(Protocol):
    def generate(self, prompt: str) -> str: ...

# application/rag_service.py - Framework-agnostic logic
class RAGService:
    def __init__(self, searcher: VectorSearcher, llm: LLMClient):
        self.searcher = searcher
        self.llm = llm

    def answer_question(self, query: str) -> SearchResult:
        docs = self.searcher.search(query, k=3)
        context = "\n".join([d.content for d in docs])
        prompt = f"Context: {context}\n\nQuestion: {query}\n\nAnswer:"
        answer = self.llm.generate(prompt)
        return SearchResult(
            answer=answer,
            sources=[d.metadata["source"] for d in docs],
            confidence=self._calculate_confidence(docs)
        )

# infrastructure/langchain_adapters.py - Framework-specific adapters
class LangChainSearcher(VectorSearcher):
    def __init__(self, vectorstore):
        self.retriever = vectorstore.as_retriever()

    def search(self, query: str, k: int) -> List[Document]:
        results = self.retriever.get_relevant_documents(query)
        return [Document(content=doc.page_content, metadata=doc.metadata)
                for doc in results]

class LangChainLLM(LLMClient):
    def __init__(self, llm):
        self.llm = llm

    def generate(self, prompt: str) -> str:
        return self.llm.invoke(prompt).content

# infrastructure/llamaindex_adapters.py - Alternative implementation
class LlamaIndexSearcher(VectorSearcher):
    def __init__(self, index):
        self.index = index

    def search(self, query: str, k: int) -> List[Document]:
        retriever = self.index.as_retriever(similarity_top_k=k)
        results = retriever.retrieve(query)
        return [Document(content=node.text, metadata=node.metadata)
                for node in results]

With this architecture, migrating from LangChain to LlamaIndex is just swapping adapters:

# Before (LangChain)
searcher = LangChainSearcher(langchain_vectorstore)
llm = LangChainLLM(ChatOpenAI())
service = RAGService(searcher, llm)

# After (LlamaIndex)
searcher = LlamaIndexSearcher(llamaindex_index)
llm = LlamaIndexLLM(OpenAI())
service = RAGService(searcher, llm)

Key Principle: Keep framework code at the edges, business logic in the center.

Common Mistake: Tight Coupling to Framework Abstractions

What people do: Build entire applications using framework-specific types throughout the codebase. Domain models inherit from LangChain.Document, business logic uses LlamaIndex.Response directly, and API endpoints return framework objects.

Why it fails: When (not if) you need to migrate frameworks or the framework introduces breaking changes, you must rewrite large portions of your application. Every layer of your code depends on types that may change arbitrarily with the next version.

Fix: Define your own domain types (SearchResult, Document, ChatMessage) in plain Python dataclasses or Pydantic models. Use adapter classes at the boundaries to convert between framework types and your domain types. Your business logic should never import from framework packages—only your infrastructure layer should know which framework you’re using.

Staff Engineer Perspective

“One agent kept returning empty tool outputs in production maybe one in 200 times. I lost a full day stepping through six layers of framework wrappers before I realized the framework was silently swallowing a TimeoutError from one tool and retrying with a stripped message. The fix was a 12-line change—but only after I monkey-patched the framework to log raw exceptions. Now whenever I evaluate a framework, the first thing I ask is: when something blows up at 3am, how many stack frames are between me and the actual HTTP call? More than three, and I keep shopping.”

Senior Staff Engineer at a developer tools company


2. Agent Frameworks: Building Autonomous Systems

Agent frameworks help build systems that can plan, use tools, and iterate toward goals. This is a newer, more experimental category.

The Contenders

LangGraph: State Machines for Agents

Market Position: Part of LangChain ecosystem, growing rapidly. Philosophy: Explicit state graphs for maximum control.

LangGraph treats agent orchestration as a state machine problem. You define nodes (functions that process state), edges (transitions between nodes), and conditional routing. This explicitness is the core value proposition: when an agent makes a decision, you can trace exactly which node executed and why the transition occurred.

The state graph model excels at complex, branching workflows. Human-in-the-loop approval steps become natural—just add a node that waits for external input. Long-running agents benefit from built-in persistence and checkpointing; if a process crashes, you can resume from the last checkpoint. The visual graph representation also aids debugging—teams often export graphs to understand agent behavior.

The tradeoffs are learning curve and verbosity. Graph-based programming requires shifting mental models. Simple agents that would be a few lines in other frameworks become verbose state definitions in LangGraph. The LangChain ecosystem dependency means inheriting that ecosystem’s complexity and API stability issues.

When to choose LangGraph: Complex multi-step workflows with conditional branching. Applications requiring human approval gates. Long-running agents needing checkpointing and persistence. When you need to visualize and audit agent decision paths. Avoid for simple agents or when you want to minimize LangChain dependency.

CrewAI: Role-Based Multi-Agent

Market Position: Popular for multi-agent systems with ~50k stars. Philosophy: Agents as team members with defined roles.

CrewAI models agent collaboration as team dynamics—you define agents with roles, backstories, and goals, then assign them tasks. A “Senior Researcher” searches for information, a “Technical Writer” synthesizes it into documents, a “Critic” reviews for accuracy. This metaphor is intuitive: if you can describe your workflow as “a team of experts collaborating,” CrewAI maps naturally.

The framework provides good defaults for common multi-agent patterns. Delegation works out of the box—agents can assign subtasks to others. Built-in memory means agents remember context from earlier in the conversation. For content generation workflows (blog posts, reports, documentation), CrewAI often produces results quickly with minimal configuration.

The tradeoffs are control and flexibility. The role-based abstraction is opinionated—if your workflow doesn’t fit the “team of agents” model, you’ll fight the framework. Fine-grained control over agent interactions is harder than in LangGraph. The ecosystem is also smaller, and single-agent use cases often feel awkward in CrewAI’s multi-agent paradigm.

When to choose CrewAI: Multi-agent systems where agents have clear roles. Content generation workflows with research, writing, and review phases. When the “team of specialists” metaphor matches your problem. Avoid for single-agent applications or when you need precise control over agent interactions.

AutoGen: Conversational Agents

Market Position: Microsoft-backed, research-focused. Philosophy: Agents as conversational partners.

AutoGen approaches multi-agent systems through the lens of conversation—agents communicate by sending messages to each other, similar to how people collaborate via chat. This model comes from Microsoft Research, and the framework reflects that heritage: well-grounded in academic literature, with sophisticated patterns for multi-agent dialogue.

The framework’s standout feature is code execution in sandboxed environments. Agents can write code, execute it, observe results, and iterate—making AutoGen particularly strong for code generation and data analysis tasks. The Microsoft ecosystem integration (Azure OpenAI, Azure services) is also mature, and active development means new capabilities ship regularly.

The tradeoffs are weight and complexity. AutoGen is heavier than alternatives—more dependencies, more setup, more configuration. The research orientation means documentation sometimes reads like academic papers rather than practical guides. Production deployment requires more engineering than simpler frameworks.

When to choose AutoGen: Research and experimentation with multi-agent systems. Code generation workflows where agents need to execute and iterate on code. Microsoft Azure environments where ecosystem integration matters. When you need sophisticated conversation patterns between agents. Avoid for simple production deployments or when minimizing dependencies is important.

Pydantic AI: Type-Safe, Lightweight

Market Position: Newest entrant, gaining traction. Philosophy: Type safety with minimal abstractions.

Pydantic AI takes the opposite approach from heavyweight frameworks—it provides just enough structure for building agents while leveraging Pydantic’s type system for safety and validation. If you’ve used Pydantic for data validation (and most Python projects do), the mental model is immediately familiar.

The lightweight approach brings concrete benefits. Fewer dependencies mean faster installation and smaller deployment artifacts. The Pythonic API reads naturally—no framework-specific DSLs or configuration languages. When something breaks, debugging is straightforward because there’s less framework code between you and the actual agent logic.

Type safety is the distinguishing feature. Agent inputs and outputs are validated at runtime, catching errors that would silently propagate in dynamically-typed frameworks. Your IDE provides better autocomplete and error detection. For teams with strong typing practices, Pydantic AI fits naturally into existing codebases.

The tradeoffs are ecosystem maturity and feature completeness. As the newest entrant, Pydantic AI has fewer examples, less community knowledge, and fewer integrations. Complex multi-agent systems require more custom code than in purpose-built frameworks like CrewAI or LangGraph.

When to choose Pydantic AI: Applications where type safety is important. Simple to medium complexity agents. Teams that value explicit, debuggable code over framework magic. When minimizing dependencies matters. Avoid for complex multi-agent orchestration or when you need extensive pre-built integrations.

Common Mistake: Designing Multi-Agent Systems When a Single Agent Suffices

What people do: Architect elaborate multi-agent systems with a “Researcher,” “Analyst,” “Writer,” and “Reviewer” for tasks that a single well-prompted LLM call could handle.

Why it fails: Each agent hop adds latency, cost, and failure points. More critically, information is lost at each handoff—the “Writer” doesn’t have full context from the “Researcher’s” reasoning process. Complex agent topologies also make debugging exponentially harder. When output quality is poor, you can’t tell which agent failed.

Fix: Start with a single agent with a good system prompt. Only split into multiple agents when you have concrete evidence that: (1) the task genuinely requires different skills that conflict (e.g., exploration vs. summarization), (2) you need parallel execution for performance, or (3) you need human checkpoints between distinct phases. Most content generation tasks work better with one thoughtful prompt than with an agent committee.

Code Comparison: Research Agent

Let’s build an agent that researches a topic by searching the web and synthesizing results.

Task: Research a topic using web search, analyze results, and write a summary.

LangGraph Implementation

from typing import TypedDict, Annotated, Sequence
from langchain_openai import ChatOpenAI
from langchain_community.tools.tavily_search import TavilySearchResults
from langgraph.graph import StateGraph, END
from langgraph.prebuilt import ToolNode
import operator

# Define state
class AgentState(TypedDict):
    messages: Annotated[Sequence[str], operator.add]
    topic: str
    search_results: list
    summary: str
    iteration: int

# Define tools
search_tool = TavilySearchResults(max_results=3)
tools = [search_tool]

# Define nodes
def search_node(state: AgentState) -> AgentState:
    """Execute web search."""
    results = search_tool.invoke({"query": state["topic"]})
    return {
        **state,
        "search_results": results,
        "messages": state["messages"] + [f"Found {len(results)} results"],
        "iteration": state["iteration"] + 1
    }

def analyze_node(state: AgentState) -> AgentState:
    """Analyze search results and decide next action."""
    llm = ChatOpenAI(model="gpt-5")

    context = "\n".join([r["content"] for r in state["search_results"]])
    prompt = f"""
    Topic: {state["topic"]}
    Search Results:
    {context}

    Analyze these results. Should we:
    1. Search for more specific information (respond with "search: <query>")
    2. Write the final summary (respond with "summarize")

    Decision:
    """

    response = llm.invoke(prompt)
    decision = response.content.strip()

    if decision.startswith("search:"):
        new_query = decision.split(":", 1)[1].strip()
        return {
            **state,
            "topic": new_query,
            "messages": state["messages"] + [f"Refining search: {new_query}"]
        }
    else:
        return {
            **state,
            "messages": state["messages"] + ["Ready to summarize"]
        }

def summarize_node(state: AgentState) -> AgentState:
    """Write final summary."""
    llm = ChatOpenAI(model="gpt-5")

    all_context = "\n\n".join([
        r["content"] for r in state["search_results"]
    ])

    prompt = f"""
    Write a comprehensive summary about: {state["topic"]}

    Based on these sources:
    {all_context}

    Summary:
    """

    summary = llm.invoke(prompt).content

    return {
        **state,
        "summary": summary,
        "messages": state["messages"] + ["Summary complete"]
    }

def should_continue(state: AgentState) -> str:
    """Decide whether to continue or end."""
    if state["iteration"] >= 3:
        return "summarize"
    if "Ready to summarize" in state["messages"]:
        return "summarize"
    return "search"

# Build graph
workflow = StateGraph(AgentState)

workflow.add_node("search", search_node)
workflow.add_node("analyze", analyze_node)
workflow.add_node("summarize", summarize_node)

workflow.set_entry_point("search")
workflow.add_edge("search", "analyze")
workflow.add_conditional_edges(
    "analyze",
    should_continue,
    {
        "search": "search",
        "summarize": "summarize"
    }
)
workflow.add_edge("summarize", END)

# Compile
app = workflow.compile()

# Run
result = app.invoke({
    "topic": "Latest developments in quantum computing",
    "messages": [],
    "search_results": [],
    "summary": "",
    "iteration": 0
})

print(result["summary"])

Analysis:

  • Explicit state graph makes flow clear
  • Full control over state transitions
  • Easy to add human-in-the-loop at any node
  • Verbose but transparent
  • Graph can be visualized for debugging

CrewAI Implementation

from crewai import Agent, Task, Crew, Process
from crewai_tools import SerperDevTool

# Define tools
search_tool = SerperDevTool()

# Define agents with roles
researcher = Agent(
    role="Senior Researcher",
    goal="Find comprehensive information about {topic}",
    backstory="""You are an expert researcher with a keen eye for
    identifying credible sources and extracting key insights.""",
    tools=[search_tool],
    verbose=True
)

analyst = Agent(
    role="Research Analyst",
    goal="Analyze research findings and identify key themes",
    backstory="""You excel at synthesizing complex information
    and identifying patterns across multiple sources.""",
    verbose=True
)

writer = Agent(
    role="Technical Writer",
    goal="Create clear, comprehensive summaries",
    backstory="""You are skilled at translating complex technical
    information into accessible summaries.""",
    verbose=True
)

# Define tasks
research_task = Task(
    description="""Research the latest developments in {topic}.
    Find at least 3 credible sources with recent information.
    Extract key facts, trends, and insights.""",
    agent=researcher,
    expected_output="A list of key findings from multiple sources"
)

analysis_task = Task(
    description="""Analyze the research findings.
    Identify the most important themes and developments.
    Determine what information is most relevant.""",
    agent=analyst,
    expected_output="An analysis highlighting key themes",
    context=[research_task]
)

writing_task = Task(
    description="""Write a comprehensive summary about {topic}.
    Include the most important developments and their implications.
    Make it accessible but technically accurate.""",
    agent=writer,
    expected_output="A well-written summary (300-400 words)",
    context=[research_task, analysis_task]
)

# Create crew
crew = Crew(
    agents=[researcher, analyst, writer],
    tasks=[research_task, analysis_task, writing_task],
    process=Process.sequential,
    verbose=True
)

# Run
result = crew.kickoff(inputs={"topic": "Latest developments in quantum computing"})
print(result)

Analysis:

  • Very intuitive role-based abstraction
  • Less code for multi-agent workflows
  • Automatic task delegation between agents
  • Less control over exact execution flow
  • Great for team-like collaboration patterns

AutoGen Implementation

from autogen import AssistantAgent, UserProxyAgent, config_list_from_json

# Configure LLM
config_list = [{"model": "gpt-5", "api_key": "your-key"}]

llm_config = {
    "config_list": config_list,
    "temperature": 0,
}

# Create researcher agent
researcher = AssistantAgent(
    name="Researcher",
    system_message="""You are a research assistant. Your job is to:
    1. Search for information on given topics
    2. Analyze the credibility of sources
    3. Extract key insights
    Use web search to find information.""",
    llm_config=llm_config,
)

# Create writer agent
writer = AssistantAgent(
    name="Writer",
    system_message="""You are a technical writer. Your job is to:
    1. Take research findings
    2. Synthesize them into a clear summary
    3. Ensure technical accuracy
    Don't search for new information, just write based on what you receive.""",
    llm_config=llm_config,
)

# Create user proxy (can execute code, use tools)
user_proxy = UserProxyAgent(
    name="UserProxy",
    human_input_mode="NEVER",
    max_consecutive_auto_reply=10,
    code_execution_config={
        "work_dir": "coding",
        "use_docker": False,
    },
    llm_config=llm_config,
    system_message="""You coordinate the research process.
    You can execute code to search the web and gather information."""
)

# Define group chat
from autogen import GroupChat, GroupChatManager

groupchat = GroupChat(
    agents=[user_proxy, researcher, writer],
    messages=[],
    max_round=10
)

manager = GroupChatManager(groupchat=groupchat, llm_config=llm_config)

# Start conversation
user_proxy.initiate_chat(
    manager,
    message="""Research the latest developments in quantum computing.
    Researcher: Find and analyze information.
    Writer: Create a comprehensive summary based on the research."""
)

Analysis:

  • Conversational approach feels natural
  • Good for complex multi-agent discussions
  • Code execution capabilities are powerful
  • More complex setup than CrewAI
  • Less explicit control over conversation flow

Pydantic AI Implementation

from pydantic import BaseModel, Field
from pydantic_ai import Agent, RunContext
from pydantic_ai.tools import Tool
import httpx

# Define structured outputs
class SearchResult(BaseModel):
    title: str
    snippet: str
    url: str

class ResearchSummary(BaseModel):
    summary: str = Field(description="Comprehensive summary of research")
    key_points: list[str] = Field(description="3-5 key points")
    sources: list[str] = Field(description="URLs of sources used")

# Define tools with type safety
class SearchTool(Tool):
    async def search_web(self, query: str) -> list[SearchResult]:
        """Search the web and return results."""
        # Using a real search API
        async with httpx.AsyncClient() as client:
            response = await client.get(
                "https://api.tavily.com/search",
                params={"query": query, "api_key": "your-key"}
            )
            data = response.json()
            return [
                SearchResult(
                    title=r["title"],
                    snippet=r["content"],
                    url=r["url"]
                )
                for r in data["results"][:3]
            ]

# Create agent with type-safe tools
agent = Agent(
    "openai:gpt-5",
    tools=[SearchTool()],
    result_type=ResearchSummary,
    system_prompt="""You are a research assistant.
    When researching a topic:
    1. Use search_web to find information
    2. Analyze the results for credibility
    3. Synthesize findings into a clear summary
    """
)

# Run with full type safety
async def research_topic(topic: str) -> ResearchSummary:
    result = await agent.run(
        f"Research the latest developments in {topic}",
    )
    return result.data  # Type-safe ResearchSummary

# Execute
import asyncio
summary = asyncio.run(research_topic("quantum computing"))

print(summary.summary)
print("\nKey Points:")
for point in summary.key_points:
    print(f"- {point}")
print("\nSources:")
for source in summary.sources:
    print(f"- {source}")

Analysis:

  • Excellent type safety throughout
  • Clean, Pythonic API
  • Structured outputs with Pydantic validation
  • Simpler than other frameworks for basic agents
  • Less suited for complex multi-agent scenarios

Decision Matrix: Agent Frameworks

Criteria LangGraph CrewAI AutoGen Pydantic AI
Learning Curve Steep Gentle Steep Gentle
Control Maximum Medium Medium High
Multi-Agent Support Good Excellent Excellent Basic
Type Safety Low Low Low Excellent
Debuggability Excellent Medium Medium High
State Management Excellent Automatic Automatic Manual
Human-in-Loop Built-in Possible Possible Manual
Visualization Yes Limited No No
Production Ready Good Fair Fair Good

When to Use What

Choose LangGraph when:

  • You need complex, branching workflows
  • State management is critical
  • You want to visualize agent behavior
  • Human approval steps are required
  • You’re building production systems that need debugging

Choose CrewAI when:

  • Building multi-agent systems with clear roles
  • You want rapid development
  • Your use case maps to team collaboration
  • You need simpler abstractions than LangGraph

Choose AutoGen when:

  • You need sophisticated multi-agent conversations
  • Code execution is important
  • You’re in Microsoft ecosystem
  • Doing research or experimentation

Choose Pydantic AI when:

  • Type safety is a priority
  • Building simple to medium complexity agents
  • You want minimal dependencies
  • Your team values explicit code

Integration Pattern: Hybrid Approach

You can combine frameworks to leverage strengths:

from pydantic import BaseModel
from langgraph.graph import StateGraph
from crewai import Agent, Task, Crew

# Use Pydantic for type-safe state
class ResearchState(BaseModel):
    topic: str
    findings: list[str] = []
    summary: str = ""

# Use CrewAI for sub-tasks (multi-agent research)
def research_with_crew(state: ResearchState) -> ResearchState:
    researcher = Agent(role="Researcher", goal=f"Research {state.topic}")
    task = Task(description=f"Research {state.topic}", agent=researcher)
    crew = Crew(agents=[researcher], tasks=[task])

    result = crew.kickoff()
    state.findings.append(result)
    return state

# Use LangGraph for overall workflow orchestration
def build_workflow():
    workflow = StateGraph(ResearchState)
    workflow.add_node("research", research_with_crew)
    workflow.add_node("summarize", summarize_findings)
    # ... rest of graph
    return workflow.compile()

This gives you:

  • LangGraph’s state management and visualization
  • CrewAI’s multi-agent capabilities
  • Pydantic’s type safety
The hidden cost of stacking frameworks

Combining three frameworks means inheriting three independent release cadences and three breaking-change surfaces—a CrewAI minor bump or a LangChain (LangGraph’s dependency) API shift can break you on a schedule you don’t control, and the union of their version constraints can become unsatisfiable. You also debug across three different abstraction models: when a run misbehaves, the failure could live in LangGraph’s state transitions, CrewAI’s agent delegation, or the Pydantic validation boundary, and the stack traces rarely make clear which. Only reach for a hybrid like this when no single framework can express the workflow; if one framework can carry it, the reduced dependency and cognitive surface is almost always worth more than the marginal capability.


Summary

This chapter covered the foundational tools for building LLM applications: orchestration frameworks and agent frameworks. The key takeaways:

Orchestration Frameworks

  • Start simple (just Python) and add frameworks when complexity demands it
  • LangChain for rapid prototyping, LlamaIndex for RAG, Haystack for enterprise
  • Always isolate framework code in adapter layers

Agent Frameworks

  • LangGraph for complex workflows with explicit state
  • CrewAI for multi-agent simplicity with role-based abstractions
  • Pydantic AI for type safety with minimal abstractions
  • AutoGen for research and code execution workflows
  • Choose based on the control vs simplicity tradeoff

Framework Selection Decision Flowchart

LLM Framework Selection Decision Flowchart

LLM Framework Selection Decision Flowchart

Key Principles

  1. No framework is “best” - only best for your context
  2. Isolate framework dependencies - keep business logic framework-agnostic
  3. Plan for migration - tools evolve quickly, build with change in mind
  4. Measure, don’t assume - prototype before committing

Connections to Other Chapters

This chapter’s framework comparisons connect to implementation details throughout the book:

  • Chapter 7 (RAG Systems): LlamaIndex and LangChain RAG patterns in practice.

  • Chapter 8 (Agentic Systems): Deep dive into agent architectures that these frameworks implement.

  • Chapter 11 (Observability, Structured Output & Guardrails): Tools that complement the frameworks covered here.

  • Chapter 12 (Cloud AI Deployment): How to deploy applications built with these frameworks.

  • Tool & Framework Reference (Appendix B): Detailed version information and installation guides.


Practical Exercises

  1. Framework Comparison: Implement the same simple RAG application in three different ways: LangChain, LlamaIndex, and just Python with direct API calls. Compare lines of code, time to implement, performance (latency), and ease of debugging. Deliverable: A comparison document with code samples and metrics.

  2. Agent Framework Selection: Given this scenario, choose an agent framework and justify: A legal tech company building an AI research assistant that searches legal databases, needs human approval before summarizing findings, must track reasoning for audit, and handles 500+ requests/day. Recommend a specific framework and explain your reasoning.

  3. Migration Architecture: Design an adapter layer architecture for a LangChain-based RAG application that would allow migration to LlamaIndex or raw Python. Identify all LangChain dependencies, create adapter interfaces, implement one adapter with direct OpenAI calls, and write a test suite that works with both implementations.

  4. Framework Profiling: Take an existing LangChain or LlamaIndex application and profile it. Identify where time is spent (network calls, framework overhead, serialization). What percentage of latency comes from the framework itself vs the underlying API calls? Document at least three optimization opportunities.

  5. State Machine Agent: Using LangGraph, implement a multi-step agent with explicit state management: a code review agent that analyzes code, identifies issues, proposes fixes, and asks for human confirmation before applying changes. Implement proper error handling and state persistence so the agent can resume after failures.


Self-Assessment Checkpoint

Conceptual Questions

Q1. [IC2] A team is choosing between LangChain and raw Python for a chatbot. What specific questions would you ask to help them decide?

Answer Key questions: (1) How many LLM providers do you need to support? (LangChain helps with multi-provider abstraction). (2) Do you need complex chains or just simple request-response? (Raw Python is simpler for basic cases). (3) What’s your team’s Python experience? (Experienced teams may prefer raw Python’s transparency). (4) Do you need observability and tracing? (LangChain has LangSmith integration). (5) How important is debugging transparency? (Raw Python is more debuggable). (6) Is this a prototype or production system? (Prototypes may benefit from faster iteration with frameworks). (7) What’s your latency budget? (Frameworks add overhead).

Q2. [IC2] Explain the difference between an “orchestration framework” and an “agent framework.” When would you use each?

Answer Orchestration frameworks (LangChain, LlamaIndex, Haystack) focus on chaining LLM calls, managing context, and retrieval pipelines. They’re for deterministic workflows: input → processing steps → output. Agent frameworks (LangGraph, CrewAI, AutoGen) focus on autonomous decision-making where the LLM decides which tools to use and when. They’re for dynamic workflows where the path depends on intermediate results. Use orchestration for: RAG pipelines, structured data extraction, multi-step processing with known steps. Use agent frameworks for: open-ended problem solving, tasks requiring tool selection, workflows needing human-in-the-loop at dynamic points.

Q3. [Senior] A startup’s LangChain application is experiencing 3-second latency. Walk through how you would diagnose whether the framework is the bottleneck or the underlying API calls.

Answer Diagnostic steps: (1) Add timing to each LangChain component (chain.invoke, retriever.get_relevant_documents, llm.generate). (2) Compare raw API call latency vs LangChain-wrapped call latency. (3) Profile serialization overhead—LangChain’s pydantic models can add latency. (4) Check for redundant API calls—some abstractions make multiple calls when one would suffice. (5) Enable LangSmith tracing to see the full call graph. (6) Test the same logic with raw Python/API calls to establish baseline. Typical findings: Network latency usually dominates (1-2s for LLM calls), but framework overhead can add 100-500ms from serialization, abstraction layers, and callback processing. If overhead is >10% of total latency, consider optimization or simpler abstractions.

Q4. [Senior] Compare LangGraph and CrewAI for building a multi-agent customer support system. What are the key architectural differences and when would you choose each?

Answer

LangGraph: Graph-based workflow with explicit state management. Agents are nodes, edges define transitions based on state. Full control over execution flow, checkpointing, and human-in-the-loop. Steeper learning curve but maximum flexibility. Choose when: You need precise control over agent interactions, complex branching logic, or guaranteed state persistence.

CrewAI: Role-based abstraction where agents have personas and goals. Communication is implicit through shared context. Simpler mental model, faster to prototype. Choose when: You want quick multi-agent prototypes, agents have clear roles (researcher, writer, editor), and you don’t need fine-grained control over agent sequencing.

Key difference: LangGraph is “explicit coordination,” CrewAI is “emergent coordination.” LangGraph for production systems needing auditability; CrewAI for rapid prototyping.

Q5. [Staff] Your organization has 50 LLM applications across 10 teams, each using different frameworks (LangChain, LlamaIndex, raw Python, CrewAI). How would you approach standardization without disrupting existing projects?

Answer Phased approach: (1) Audit current state—catalog frameworks, versions, patterns across all teams. Identify common patterns and pain points. (2) Define interface standards, not framework mandates. Create organization-wide interfaces for LLM calls, embeddings, and vector stores that any framework can implement. (3) Build shared adapters that wrap each framework to conform to your interfaces. Teams can continue using their framework but get interoperability. (4) Create a “recommended stack” for new projects based on audit findings, but don’t force migration. (5) Invest in shared observability—regardless of framework, all applications should report metrics to the same platform. (6) Let natural attrition work—as projects are rewritten or retired, they adopt the recommended stack. (7) Only mandate migration for security or maintainability crises. Key insight: Forced standardization causes more problems than framework diversity. Focus on interoperability.

Spot the Problem

Problem 1. [IC2] A developer is frustrated that their LangChain code is hard to debug:

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)
result = chain.invoke("What is the return policy?")
# Result is wrong, but where did it fail?

What’s the problem and how would you fix it?

Answer

Problem: The chain is opaque—you can’t see intermediate results. When output is wrong, you don’t know if retrieval failed, the prompt was wrong, or the LLM hallucinated.

Fix: Add intermediate inspection points. Options: (1) Use RunnableParallel with logging: add a branch that logs each stage’s output. (2) Enable LangSmith tracing for full visibility. (3) Break the chain into explicit steps for debugging:

docs = retriever.invoke("What is the return policy?")
print(f"Retrieved: {docs}")  # Check retrieval
formatted_prompt = prompt.format(context=docs, question="...")
print(f"Prompt: {formatted_prompt}")  # Check prompt
result = llm.invoke(formatted_prompt)
  1. Add assertions between stages to fail fast on unexpected results. General principle: Composable chains are elegant but sacrifice debuggability. In production, add explicit logging at each stage.

Problem 2. [Senior] A team is using CrewAI for a research agent. The agent works but produces inconsistent results:

researcher = Agent(
    role="Research Analyst",
    goal="Find comprehensive information on the topic",
    backstory="You are an expert researcher with access to the web.",
    tools=[search_tool, scrape_tool]
)
crew = Crew(agents=[researcher], tasks=[research_task])
result = crew.kickoff()

What architectural issues might cause inconsistency?

Answer

Several issues: (1) No explicit process defined—CrewAI defaults to sequential but with single agent, the execution flow depends on the LLM’s interpretation of “comprehensive.” (2) Goal is vague (“comprehensive information”)—the agent decides when it’s done, leading to variable depth. (3) No memory or state—each kickoff starts fresh, losing learnings. (4) Tool results aren’t validated—web search and scraping can return errors, empty results, or irrelevant content. (5) No output schema—“result” could be structured differently each run.

Fixes: (1) Add expected_output specification to the task with format requirements. (2) Define explicit completion criteria (minimum sources, required sections). (3) Add manager_llm for consistent oversight. (4) Implement tool result validation. (5) Use a callback to log intermediate steps and identify where inconsistency originates. (6) Consider if single-agent CrewAI is overkill—a simple LangChain agent might be more predictable.

Problem 3. [Staff] An organization adopted LangChain enterprise-wide. Now they’re discovering issues:

# Every team has code like this
from langchain_openai import ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA

llm = ChatOpenAI(model="gpt-4")  # Hardcoded
vectorstore = FAISS.load_local("index")  # Hardcoded path
chain = RetrievalQA.from_chain_type(llm=llm, retriever=vectorstore.as_retriever())

What systemic problems does this create and how would you address them?

Answer

Problems: (1) Tight coupling to specific providers—every team hardcodes OpenAI, making provider switching expensive. (2) No abstraction layer—LangChain internals leak into business logic. (3) Configuration in code—model names, paths, parameters should be externalized. (4) No dependency injection—testing requires mocking LangChain internals. (5) Version fragmentation—teams on different LangChain versions with incompatible APIs. (6) No shared components—each team builds their own retriever configuration.

Solutions: (1) Create organization-wide interfaces: LLMProvider, VectorStore, RAGPipeline that wrap LangChain. (2) Build a configuration service for model selection, allowing runtime model switching. (3) Create a shared library with blessed configurations and adapters. (4) Implement factory patterns for dependency injection. (5) Pin LangChain versions organization-wide with a regular upgrade cycle. (6) Consider if LangChain is still needed—if you’re wrapping everything anyway, you might simplify with direct API calls behind your interfaces.

Key lesson: Frameworks should be implementation details, not architecture foundations.

Design Exercises

Exercise 1. [Senior] Design a framework evaluation process for an AI team that needs to choose between LangGraph, CrewAI, and raw Python for building a document processing pipeline. The pipeline needs to: classify documents, extract entities, validate against business rules, and route to appropriate handlers. How would you structure a proof-of-concept to make this decision?

Guidance PoC Structure: (1) Define evaluation criteria upfront: development speed, debugging experience, error handling, testability, team familiarity, latency overhead, extensibility. (2) Timeboxed implementation (2 days each): Build the same pipeline in all three approaches. Use realistic but small test data. (3) Consistent test harness: Same inputs, same expected outputs, same error scenarios across all implementations. (4) Metrics to capture: Lines of code, time to implement, p50/p99 latency, ease of adding new document type, ease of testing, error handling quality. (5) Failure mode testing: What happens when LLM fails? When validation fails? When handler is unavailable? (6) Team input: Have multiple team members review each implementation for maintainability. Decision framework: Score each option 1-5 on each criterion. Weight criteria by importance. Present to team with recommendations and tradeoffs.

Exercise 2. [Staff] You’re architecting the AI platform for a large enterprise with 200+ developers building LLM applications. Design an internal framework or platform that provides the benefits of frameworks (observability, common patterns, shared components) without the lock-in risks. Consider: how do you support multiple underlying frameworks while providing a consistent developer experience?

Guidance

Platform architecture: (1) Interface layer: Define your own interfaces for core concepts (LLM, Embeddings, VectorStore, Chain, Agent). These are your stable contracts. (2) Adapter layer: Implement adapters for each supported framework (LangChain, LlamaIndex, raw Python). Adapters translate between your interfaces and framework-specific code. (3) Configuration service: Centralized management of models, credentials, and parameters. Applications request “a capable LLM” not “gpt-4-turbo.” (4) Observability backbone: Unified tracing regardless of framework. All adapters emit standard spans and metrics. (5) Template library: Pre-built patterns (RAG, agents, extraction) that use your interfaces. Teams extend templates instead of starting from scratch. (6) Testing harness: Mock implementations of all interfaces for testing. (7) Migration tooling: Scripts to audit and help migrate between frameworks when underlying adapter changes.

Organizational considerations: Developer portal with documentation, examples, and best practices. Office hours for platform questions. Feedback loop for evolving interfaces based on real needs. Versioning strategy for interfaces with deprecation periods.

Further Reading

Essential

  • LangChain Documentation — Comprehensive guide to chains, agents, and integrations. Start with the conceptual guides.
  • LlamaIndex Documentation — Focus on the data connectors and query engine sections for RAG patterns.
  • LangGraph Documentation — Graph-based agent workflows with state management. Study the examples for complex agent patterns.

Deep Dives

  • Pydantic AI Documentation — Type-safe LLM interactions with minimal abstractions.
  • CrewAI Documentation — Multi-agent orchestration with role-based abstractions.
  • Instructor Library — Structured output extraction patterns that work across frameworks.

Historical Context

  • Building LLM Applications for Production (Chip Huyen, 2023) — Practical advice on framework evaluation and migration strategies.
  • Why I Quit LangChain (Various authors) — Honest assessments of framework pain points that inform better tool selection.