Chapter 26: Technical Decision Making

Keywords

decision making, ADRs, trade-offs, build vs buy, technical strategy, reversibility, consensus building

Introduction

In January 2021, a fast-growing AI startup faced a critical decision. Their recommendation engine, built on a custom embedding pipeline, was straining under 10x traffic growth. The CTO proposed migrating to a managed vector database service that promised seamless scaling. The principal engineer advocated for building their own distributed vector store, arguing that the unique access patterns justified custom infrastructure. The product leader pushed for a hybrid approach: use the managed service now, plan migration later.

Three weeks of meetings produced no consensus. Meanwhile, latency crept up, customers complained, and a key enterprise deal stalled. Finally, the CEO mandated a decision: managed service, shipping in two weeks.

Eighteen months later, that managed service had a 47-hour outage affecting the startup’s largest customer. The principal engineer’s concerns about vendor dependency had proven prescient. But here’s the thing: the decision was still correct. In the intervening months, the company had closed $40M in revenue that wouldn’t have materialized if they’d spent six months building custom infrastructure. The outage cost them one customer; the delay would have cost them the company.

This story illustrates why technical decision-making is so hard. The “right” answer depends on context that changes over time. Good decisions can have bad outcomes; bad decisions can luck into success. What separates great technical leaders isn’t an ability to predict the future, but rather a disciplined approach to making decisions with incomplete information, documenting the reasoning, and learning from outcomes.

The Compounding Nature of Technical Decisions

Technical decisions compound in ways that non-technical decisions often don’t. Choose the wrong database, and every feature built on top inherits that constraint. Pick an incompatible serialization format, and years later teams still write adapters. Select a model architecture optimized for today’s hardware, and watch it struggle on tomorrow’s.

This compounding works in both directions. Good architectural decisions create leverage: they make future decisions easier, enable capabilities you didn’t anticipate, and reduce the cognitive load on your team. The early AWS decision to standardize on service-oriented architecture enabled them to expose internal services as external APIs, eventually becoming AWS itself. That wasn’t the plan, but the decision created options.

Research on technical debt by Cunningham and later by Kruchten, Nord, and Ozkaya demonstrates how poor decisions accumulate interest. Industry surveys (Besker et al., 2018, covering 15 large software organizations) found that engineers spend roughly 25% of their time on technical-debt-related work as systems age. But the same research community has shown that deliberate, documented debt taken strategically can accelerate time-to-market when managed properly—what kills teams is undocumented, accidental debt.

The difference between debt that destroys teams and debt that enables them lies almost entirely in how decisions were made and documented. Undocumented decisions become mysteries; mysteries become folklore; folklore becomes legacy code that no one dares touch.

Why This Chapter Matters

If you’re reading this book, you likely make technical decisions that affect systems serving millions of users, handling sensitive data, or processing millions of dollars in transactions. The frameworks in this chapter will help you:

Structure decision-making processes so that important choices get appropriate rigor
Document decisions so that future teams (including future you) understand the context
Analyze tradeoffs systematically rather than relying on intuition alone
Navigate build vs. buy in a landscape where AI capabilities change monthly
Recognize reversible vs. irreversible decisions and calibrate your process accordingly
Build organizational decision-making capability beyond your individual contributions

What You’ll Learn

How Architecture Decision Records (ADRs) capture the “why” behind technical choices
When and how to use RFC processes for collaborative decisions
Structured frameworks for tradeoff analysis that go beyond gut feeling
Economic models for build vs. buy decisions, especially in AI systems
The science of decision-making under uncertainty
How to build intuition for fast vs. slow decisions

Prerequisites

Experience operating production systems (Parts I & II)
Familiarity with system design principles (Chapter 22)
Understanding of team dynamics and organizational context (Chapter 20)

The Science of Decision Making

Dual-Process Theory and Technical Choices

Daniel Kahneman’s research on dual-process cognition provides a foundation for understanding how engineers make decisions. System 1 thinking is fast, intuitive, and effortless, the kind of thinking that lets an experienced engineer glance at code and sense something is wrong. System 2 thinking is slow, deliberate, and effortful, the kind required to prove that the code is actually buggy.

Both systems have roles in technical decision-making, but they’re suited to different types of problems.

System 1 excels when:

You have deep experience with similar situations
Feedback loops are tight (you quickly learn if you were right)
The environment is stable and predictable
Stakes of individual decisions are low

System 2 is necessary when:

The situation is novel or complex
Feedback is delayed or ambiguous
Multiple stakeholders with different priorities are involved
The decision is hard to reverse or high-stakes

The problem: System 1 doesn’t know when it’s out of its depth. An engineer with ten years of web development experience may have strong intuitions about database choices, but those intuitions can mislead when applied to ML feature stores with fundamentally different access patterns. The confidence feels the same even when the calibration is off.

Gary Klein’s research on naturalistic decision-making found that experts in domains with tight feedback (firefighters, chess players, ICU nurses) developed remarkably accurate intuitions. But in domains with delayed or noisy feedback, like financial forecasting or long-term technology choices, expert intuition performed barely better than chance.

Implication for AI engineering: Many AI decisions have delayed, noisy feedback. Was your choice of embedding model correct? You won’t know for months, and even then, you can’t easily run the counterfactual. This is precisely the environment where System 2 rigor matters most.

The Paradox of Choice in Technical Systems

Barry Schwartz documented how having more options often leads to worse decisions and less satisfaction. This manifests acutely in modern AI engineering. Want a vector database? Choose from Pinecone, Weaviate, Milvus, Qdrant, Chroma, pgvector, and a dozen others. Need an inference server? Consider vLLM, TGI, Triton, Ray Serve, custom solutions. Every layer of the stack presents dozens of viable options.

This abundance creates several dysfunctions:

Analysis paralysis: Teams spend weeks evaluating options that would all work adequately. The evaluation cost exceeds the cost of suboptimal choice.

Maximizing vs. satisficing: Maximizers try to find the best option; satisficers find one that meets their criteria. Research consistently shows satisficers make faster decisions with equal or better outcomes, and experience less regret.

The paradox of expertise: More knowledge about a domain means awareness of more options, more edge cases, more ways things could go wrong. Senior engineers can become slower decision-makers than juniors because they see more complexity.

Practical response: Define decision criteria before evaluating options. Set time limits for evaluation. Accept that “good enough” often is. Document what “good enough” means so others don’t revisit the decision unnecessarily.

Common Mistake: Evaluating Decisions Solely by Outcomes

What people do: The migration succeeded, so it was a good decision. The new architecture failed, so it was a bad decision. Promote people who made decisions that worked out; sideline those whose bets didn’t pay off.

Why it fails: This confuses luck with skill. Good decisions can have bad outcomes due to factors outside your control. Bad decisions can succeed by luck. Over time, this creates risk-averse culture where people avoid any decision that might turn out badly—even when expected value is positive.

Fix: Evaluate decisions by process quality: Were the right people involved? Were alternatives genuinely considered? Was reasoning sound given available information? Were uncertainties acknowledged? Document the reasoning so you can learn from decisions regardless of outcome.

Outcome vs. Process Quality

Annie Duke, in “Thinking in Bets,” distinguishes between decision quality and outcome quality. A good decision can have a bad outcome (you made the right call given available information, but got unlucky). A bad decision can have a good outcome (you made a reckless call, but got lucky).

This distinction matters enormously for organizational learning. If you judge decisions solely by outcomes:

Teams become risk-averse, avoiding decisions that might turn out badly even when the expected value is positive
Survivorship bias distorts learning (you only hear about successful bets, not failed ones)
Post-hoc rationalization replaces honest assessment

Instead, evaluate decisions by process quality:

Were the right people involved? Did you consult those with relevant expertise and those affected by the decision?
Were alternatives genuinely considered? Or did you anchor on one option and rationalize it?
Was the reasoning sound? Given what was known at the time, did the logic hold?
Were uncertainties acknowledged? Did you identify what you didn’t know?
Was it documented? Can others learn from this decision?

A well-documented decision that turned out poorly is valuable. It reveals what you didn’t know or couldn’t predict, enabling better decisions in the future. An undocumented decision that turned out well teaches nothing.

Cognitive Biases in Technical Decisions

Several cognitive biases particularly afflict technical decision-making:

Availability bias: Overweighting information that comes to mind easily. “We had a bad experience with Vendor X three years ago” looms larger than systematic analysis, even if Vendor X has improved dramatically.

Anchoring: The first number or option mentioned disproportionately influences the final decision. If someone suggests “maybe 3 months to build this,” subsequent estimates cluster around 3 months regardless of actual complexity.

Common Mistake: Anchoring on the First Option

What people do: Someone suggests “let’s use Redis for this cache” in the first design meeting. All subsequent discussion compares alternatives to Redis, evaluating their disadvantages relative to Redis rather than to the actual requirements.

Why it fails: The first option becomes the anchor. You evaluate “reasons not to use Redis” rather than “what’s the best solution for our requirements.” Subtle flaws in the anchor option go unexamined because it’s never held to the same scrutiny as alternatives.

Fix: Before discussing solutions, write down your requirements and decision criteria. Deliberately consider multiple options before committing to detailed evaluation of any single one. Assign someone to argue for an alternative to ensure the anchor doesn’t dominate.

Confirmation bias: Seeking information that supports your preferred option. Engineers often run benchmarks until they get results that confirm their intuition, then stop.

Not-invented-here syndrome: Preferring internal solutions over external ones. This bias is particularly strong among skilled engineers who correctly believe they could build something, but incorrectly conclude they should.

Sunk cost fallacy: Continuing with a failing approach because of past investment. “We’ve already spent six months on this migration” becomes a reason to continue rather than a write-off.

Herding: Following what prestigious companies do. “Google uses this” becomes justification, even when your context differs fundamentally from Google’s.

Mitigation strategies:

Assign a “devil’s advocate” to argue against the favored option
Require writing down your prediction before gathering data
Explicitly consider “What would have to be true for the other option to be better?”
Bring in outside perspectives who lack context (and therefore lack biases)

Architecture Decision Records

The Case for Documenting Decisions

Every codebase accumulates decisions. Most remain invisible: implicit in the structure of the code, the choice of libraries, the shape of data models. When the original decision-makers leave, understanding leaves with them.

Consider a concrete scenario. A new engineer asks: “Why do we use Redis for the session cache instead of our standard DynamoDB?” Without documentation, the investigation begins:

Slack search turns up a thread from 2022, but key context is missing
The engineer who made the choice left last year
Someone thinks it was about latency, but isn’t sure
The team debates whether to migrate, unsure if original constraints still apply

Two weeks pass. Eventually, someone finds an old design doc mentioning “99th percentile latency requirements for the recommendation engine require sub-2ms cache reads.” DynamoDB’s typical latency is 5-10ms. Mystery solved, but at significant cost.

With an ADR, this investigation takes five minutes. The document explains the constraint, the analysis, and critically, when to revisit the decision (“if we move to a region where DynamoDB DAX is available” or “if our latency budget increases”).

Michael Nygard, who introduced ADRs, describes them as “capturing the context in which a decision was made.” The context almost always matters more than the decision itself. The same decision might be brilliant in one context and disastrous in another.

ADR Structure and Best Practices

An effective ADR answers five questions:

What is the context? What situation are we in? What constraints apply?
What is the decision? What are we choosing to do?
Why this decision? What reasoning leads to this choice?
What alternatives were considered? What else could we have done?
What are the consequences? What follows from this decision?

Here’s a real ADR from an AI startup’s production system:

ADR-017: Embedding Model Selection for Document Search

Status: Accepted

Date: 2025-03-15

Context

We’re building a document search feature for our enterprise product. Users will search across thousands of internal documents (contracts, policies, meeting notes) using natural language queries.

Requirements:

Query latency under 200ms for 95th percentile
Support for 500,000 documents initially, scaling to 5M
Must handle domain-specific terminology (legal, financial)
Budget constraint: inference costs under $10K/month at full scale

Current infrastructure: AWS-based, using SageMaker for existing ML workloads.

Decision

We will use BGE-large-en-v1.5 (1024 dimensions) deployed on SageMaker Inference endpoints with auto-scaling.

Rationale

We evaluated five embedding models against our requirements:

Model	Dimensions	Our Benchmark NDCG@10	Latency (p95)	Monthly Cost at Scale
BGE-large-en	1024	0.72	45ms	$4,200
E5-large-v2	1024	0.71	48ms	$4,200
text-embedding-3-large	3072	0.74	85ms	$18,000
GTE-large	1024	0.70	43ms	$4,200
Cohere embed-v3	1024	0.73	120ms	$12,000

BGE-large-en slightly outperformed other open-source options on our internal benchmark dataset (200 queries with relevance judgments from legal team). While OpenAI’s model scored 2 points higher, the 4x cost difference and increased latency don’t justify the quality gain for this use case.

Deploying on SageMaker aligns with our existing infrastructure and avoids introducing new operational complexity.

Alternatives Considered

OpenAI text-embedding-3-large: Superior quality, but 4x cost and external API dependency introduces latency variability and compliance considerations for enterprise customers.

Fine-tuned BGE: Could improve domain-specific performance, but adds 2-3 weeks of work and ongoing maintenance burden. We’ll revisit if quality metrics don’t meet targets after launch.

Hybrid approach (sparse + dense): Would improve exact-match cases, but adds pipeline complexity. Our benchmark showed only 3% of queries benefit significantly from sparse matching.

Consequences

Positive: Cost-effective, predictable latency, no external API dependencies
Positive: Team already familiar with SageMaker deployment patterns
Negative: May need fine-tuning if domain-specific performance is insufficient
Negative: Locked to model versions we deploy (no automatic improvements)

Revisit When

Quality metrics (measured monthly) fall below 0.65 NDCG@10
A significantly better open-source model emerges (>10% improvement)
We expand to non-English markets (multilingual models needed)
OpenAI or similar significantly reduces pricing (below 2x our current cost)

Notice several qualities of this ADR:

Specific context: Not “we need embeddings” but specific requirements with numbers.

Actual data: Benchmarks run on their data, not marketing claims.

Honest tradeoffs: Acknowledges what they’re giving up.

Clear triggers for revisiting: Future engineers know when to reconsider.

When to Write an ADR

Not every decision needs formal documentation. ADRs have a cost: time to write, time to maintain, clutter if overused. Use them when decisions are:

Significant: Affects multiple teams, systems, or substantial resources. Choosing a logging library probably doesn’t need an ADR; choosing an observability platform does.

Non-obvious: Reasonable engineers might disagree. If the right choice is clear, document it lightly in code comments or team wiki.

Hard to reverse: Changing would be expensive or disruptive. Database choice, API contracts, data formats merit ADRs.

Frequently questioned: If people keep asking “why do we do X?”, write an ADR to answer them once.

A useful heuristic: if you’re having the third meeting about a decision, it needs an ADR.

What doesn’t need an ADR:

Implementation details that don’t affect interfaces
Choices that are easy to change later
Decisions mandated by company standards
Temporary workarounds with clear timelines

ADR Anti-Patterns

The novel: Multi-page ADRs that nobody reads. If your ADR exceeds two pages, you’re writing a design doc, not a decision record.

The vague: “We chose X because it’s better.” Better how? For what use case? Compared to what?

The unanimous: ADRs that present no alternatives or pretend there was no disagreement. Real decisions have tradeoffs; acknowledge them.

The orphan: ADRs written and forgotten. Review them periodically. Supersede them when contexts change.

The retcon: ADRs written after the fact to justify decisions already made for other reasons. These are useless for learning.

RFC Processes

When ADRs Aren’t Enough

ADRs document decisions made by an individual or small group. Some decisions require broader input: they affect multiple teams, require expertise distributed across the organization, or need buy-in from stakeholders who’ll bear the consequences.

Request for Comments (RFC) processes structure this broader collaboration. The term comes from the Internet Engineering Task Force, whose RFCs defined Internet protocols. The format has been adopted widely in tech companies: Rust’s RFC process has shaped the language, Kubernetes uses them for major features, and many companies have internal RFC systems.

The key distinction:

ADR: “Here is the decision we made and why” RFC: “Here is a proposal; please help us improve it and decide together”

Use RFCs when:

Multiple teams are affected and should have input
The decision requires expertise you don’t have
You need to build consensus before implementation
The scope is large enough that implementation is expensive to redo

Anatomy of an Effective RFC

A good RFC enables asynchronous collaboration. Readers should be able to understand, evaluate, and comment without requiring meetings.

Title and metadata: Clear title, author, status, key dates.

Summary: One paragraph covering the problem and proposed solution. Busy readers should be able to decide if they need to read more.

Motivation: Why is this change needed? What problem does it solve? What happens if we do nothing? This section should create urgency and alignment on the problem before discussing solutions.

Detailed design: How will this work? Be specific enough that readers can evaluate feasibility, but not so detailed that implementation is constrained unnecessarily.

Alternatives considered: What other approaches were evaluated? Why were they rejected? This demonstrates rigor and preempts “have you considered X?” comments.

Drawbacks: What are the risks and downsides? Acknowledging these builds trust and surfaces concerns that might otherwise emerge in discussion.

Adoption strategy: How will this be rolled out? What migration is required? What’s the timeline?

Unresolved questions: What do you still need input on? This focuses reviewer attention.

Real RFC Example: Model Serving Migration

Here’s a condensed version of an actual RFC from an AI platform team:

RFC: Migrate Model Serving from TGI to vLLM

Author: Platform Team Status: Under Review Created: 2025-09-10

Summary

This RFC proposes migrating our LLM serving infrastructure from Text Generation Inference (TGI) to vLLM. The migration would reduce serving costs by approximately 40% through improved batching and increase throughput by 2.5x based on our benchmarks. The migration would occur over three months, with a two-week parallel-running period for validation.

Motivation

Our current TGI deployment serves 15M requests per day across 120 A100 GPUs. Costs are $2.1M per month. Traffic is projected to 3x over the next year, which would require significant additional infrastructure.

Our benchmarks show vLLM achieving 2.5x higher throughput for our workload (mixed batch sizes, 512-2048 token outputs). This would reduce our GPU fleet to approximately 50 GPUs at current traffic, saving $1.2M monthly.

Beyond cost, vLLM’s PagedAttention enables us to serve longer contexts without OOM issues that currently require request rejection.

Detailed Design

Phase 1 (Weeks 1-4): Deploy vLLM alongside TGI. Route 1% of traffic to vLLM. Validate quality parity using our existing eval suite.

Phase 2 (Weeks 5-8): Gradually increase vLLM traffic to 50%. Monitor latency percentiles, error rates, and output quality scores. Address any issues.

Phase 3 (Weeks 9-12): Complete migration. Maintain TGI deployment as cold standby for two weeks. Decommission after validation.

API contracts remain unchanged. Client teams require no code changes.

Alternatives Considered

Optimize TGI configuration: We spent two weeks tuning TGI parameters. Best case improvement was 15%, insufficient for our growth projections.

Custom serving solution: Higher ceiling for optimization but 6+ month build time and ongoing maintenance burden. vLLM is well-maintained with active community.

Managed service (e.g., Anyscale Endpoints): Would reduce operational burden but increase per-request costs at our scale. Makes sense below 1M requests/day.

Drawbacks

Migration risk: Any serving change risks production incidents
Team learning curve: Platform team needs to develop vLLM expertise
Feature parity gaps: vLLM lacks some TGI features (specific quantization methods we don’t currently use)
Vendor dependency shift: Moving from HuggingFace to community-maintained project

Unresolved Questions

Should we maintain internal fork for custom optimizations, or contribute upstream?
How long should we maintain TGI as fallback?
Are there features we use in TGI that need porting?

Managing the RFC Process

An RFC without a process is just a document. Effective RFC processes include:

Clear ownership: Someone drives the RFC through the process, incorporating feedback, scheduling discussions, and pushing toward resolution.

Defined timelines: Open-ended review leads to decision paralysis. Set expectations: “Comments due by Friday; decision meeting Monday.”

Structured feedback: Distinguish between blocking concerns, suggestions, and questions. Not all feedback is equal.

Decision authority: Who actually decides? The RFC process gathers input, but someone must have authority to make the call when consensus can’t be reached.

Outcomes: RFCs should resolve to accepted, rejected, or withdrawn. “Abandoned” RFCs signal process failure.

Synchronous vs. Asynchronous Decisions

Not every decision needs a meeting, but some decisions need real-time discussion.

Favor asynchronous (written RFC/ADR) when:

The decision benefits from careful thought
Stakeholders need time to research or consult others
Multiple time zones make meetings difficult
You need a written record

Favor synchronous (meeting) when:

Significant disagreement needs real-time resolution
Quick back-and-forth would clarify misunderstandings
The decision is time-sensitive
Building relationship and trust matters (controversial changes)

Hybrid approach: Write the RFC asynchronously, meet to discuss key concerns, finalize asynchronously. This captures benefits of both: thoughtful preparation, efficient discussion, documented outcome.

Tradeoff Analysis Methods

Beyond Pros and Cons Lists

Simple pros/cons lists fail because they treat all factors as equal. “Faster development time” and “runs on our existing infrastructure” might both be “pros,” but they’re not equally important.

Structured tradeoff analysis makes weights explicit, forcing clarity about what matters.

Weighted Decision Matrices

A weighted decision matrix assigns importance weights to criteria, then scores options against each criterion. The weighted sum produces a ranking.

Process:

Define criteria: What factors matter for this decision? Be specific.
Assign weights: Distribute 100 points across criteria. This forces prioritization. If everything is equally important, you haven’t thought hard enough.
Score options: Rate each option 1-10 on each criterion. Use evidence where possible.
Calculate weighted scores: Multiply scores by weights, sum for each option.
Sensitivity analysis: How much do weights need to change for the ranking to flip?

Example: Choosing a Feature Store

Criterion	Weight	Feast	Tecton	Databricks FS
Query latency	25	7	9	6
Integration with existing stack	20	8	6	9
Operational complexity	20	6	8	7
Cost (3-year TCO)	15	9	5	6
Feature richness	10	6	9	8
Team expertise	10	5	4	8
Weighted Score		7.05	7.05	7.15

The scores are nearly identical, revealing that this is a close call. Sensitivity analysis shows that if “Integration” weight increases by 5 points (from existing stack), Databricks pulls ahead. If “Latency” increases by 5 points, Tecton wins.

This analysis doesn’t make the decision for you. It clarifies what you’re trading off and surfaces the values embedded in your choice.

Qualitative Factors That Matrices Miss

Numbers don’t capture everything. After quantitative analysis, consider:

Team factors: Do you have people who know this technology? Will the choice help with hiring (exciting tech) or hurt (nobody wants to work on it)?

Strategic alignment: Does this choice support or conflict with company direction? A tactically optimal choice might be strategically wrong.

Relationship dynamics: Do you have good relationships with the vendor? Does the open-source project have responsive maintainers?

Gut check: After all the analysis, does the answer feel right? Intuition is data too, just poorly calibrated. A strong negative gut reaction warrants investigation even if the numbers look good.

Cost of Delay Analysis

Many decisions are framed as “Option A vs. Option B” when the real question is “Decide now vs. decide later.”

Cost of Delay (CoD) analysis quantifies what you lose by waiting:

Revenue not earned while deciding
Customers lost to competitors who shipped first
Engineers blocked waiting for direction
Technical debt accumulating while you deliberate

If delaying a decision by one week costs $50K in blocked engineering time, spending two weeks evaluating options that differ by $10K in annual costs is irrational.

Conversely, if a decision is irreversible and gets more expensive to change over time, rushing to save a week of delay may cost you years of suboptimal architecture.

The framework: estimate the cost of delay, estimate the value of additional information from waiting, compare.

Build vs. Buy Decisions

The Fundamental Economics

Build vs. buy is ultimately an economic question, but the economics are subtle.

Building has:

High upfront cost (engineering time)
Ongoing maintenance cost (bugs, updates, on-call)
Opportunity cost (engineers not working on core product)
Full control and customization
No external dependencies

Buying has:

Lower upfront cost (integration time)
Ongoing subscription/usage cost
Vendor risk (price increases, deprecation, outages)
Limited customization
Faster time to market

The crossover analysis: at what scale does building become cheaper?

The Build vs. Buy Curve

The total cost of build versus buy options follows predictable patterns:

In early stages, buying is cheaper. There’s a crossover point where building becomes cheaper due to lower marginal costs. But the crossover point is often further out than optimistic engineers estimate, because they undercount:

Time to reach feature parity with the bought solution
Ongoing maintenance and security updates
On-call burden and incident response
Documentation and onboarding
Opportunity cost of what else those engineers could build

A common pattern: “We can build this in 3 months” becomes 9 months. “One engineer can maintain it” becomes a team. Meanwhile, the bought solution improved, and the company’s needs evolved.

AI-Specific Build vs. Buy Considerations

AI systems add unique dimensions to build vs. buy analysis:

Model quality trajectory: If you build, your model’s quality is fixed until you improve it. If you buy API access, the vendor continuously improves (usually for free). A model that’s competitive today might be outdated in six months.

Inference economics: API pricing (per token) creates linear cost scaling. Self-hosting has high fixed costs but lower marginal costs. As a rough order of magnitude, the crossover for a frontier-class workload sits somewhere in the low millions of tokens per day—but treat that figure as illustrative, not a constant. It depends entirely on the assumptions baked in: model size (a 70B is far cheaper to self-host than a frontier-scale model), GPU type and price, and—critically—sustained utilization (the break-even roughly doubles if your GPUs sit at 25% instead of 50%). Note also that “tokens per day” and “requests per month” are different units: Chapter 32 derives the same decision on a requests/month basis and shows the break-even swings ~5x with output-length distribution. Use that model with your own measured numbers rather than memorizing a single crossover figure. Below the crossover, APIs are cheaper; above it, self-hosting wins—if you can match capability.

Capability uncertainty: We often don’t know what AI can do until we try. Building custom solutions requires upfront commitment. APIs allow rapid experimentation before committing.

Data privacy: Building keeps data internal. APIs may send data to third parties, creating compliance and competitive risks.

Customization requirements: Fine-tuning requires building (or using vendors who support it). If your use case needs customization, pure API approaches may be insufficient.

Build vs. Buy Decision Framework

When evaluating build vs. buy, ask these questions in order:

1. Is this core to our competitive advantage?

If yes, lean toward building. Your competitive moat shouldn’t depend on a vendor’s roadmap. If no, lean toward buying, you shouldn’t build commodity capabilities.

Test: Would customers choose us over competitors because of this capability? If not, it’s probably not core.

2. Do good solutions exist?

If the market offers solutions that meet your needs, building requires strong justification. If nothing adequate exists, you may have no choice but to build.

Test: Could we achieve 80% of our requirements with an existing solution? If yes, the 20% gap rarely justifies building from scratch.

3. What’s our timeline?

Building takes months to years. Buying takes weeks to months. If you need something in weeks, build isn’t an option regardless of long-term economics.

Test: What’s the cost of shipping three months later? If it’s significant, buying gets more attractive.

4. Do we have the expertise?

Building requires deep expertise. If you don’t have it, you’ll hire or develop it, both of which take time and carry risk.

Test: Have we built something similar before? Do we have people who’ve done this at a previous company?

5. What’s the total cost of ownership?

Include all costs: development, maintenance, on-call, opportunity cost, risk of failure. Compare to all costs of buying: subscription, integration, vendor management, switching costs.

Test: Be honest about maintenance. Most teams underestimate it by 2-3x.

Hidden Costs on Both Sides

Hidden build costs:

Edge cases that simple prototypes don’t reveal
Security hardening and compliance
Documentation and knowledge transfer
On-call burden and incident response
Opportunity cost of not shipping other features

Hidden buy costs:

Integration complexity (APIs change, data formats differ)
Vendor lock-in (what does switching cost?)
Compliance work (security reviews, data processing agreements)
Support interactions (tickets, waiting for fixes)
Price increases (especially after you’re dependent)

The Hybrid Path

Often the best answer is neither pure build nor pure buy:

Buy, then build: Start with a bought solution to validate the use case and learn the domain. Build when you’ve proven value and understand requirements deeply.

Build the wrapper, buy the capability: Use APIs for core functionality but build the orchestration, monitoring, and integration layer. This preserves optionality while reducing build scope.

Buy the commodity, build the differentiation: Use standard components for common functionality. Build only the parts that create competitive advantage.

Reversibility and Decision Speed

One-Way vs. Two-Way Doors

Jeff Bezos popularized the distinction between one-way and two-way door decisions:

One-way doors (Type 1): Irreversible or very costly to reverse. Once through, you can’t easily go back. These merit careful analysis, broad input, and deliberate process.

Two-way doors (Type 2): Easily reversible. If you don’t like where you end up, you can return and try again. These should be made quickly by individuals or small teams.

The failure mode for most organizations is treating two-way doors as one-way doors, applying heavyweight process to decisions that could easily be undone. This slows everything down and burns goodwill.

Examples in AI engineering:

One-way doors:

Public API contracts (breaking changes harm customers)
Major vendor commitments (multi-year contracts)
Data format choices (migration is expensive)
Production model architecture (retraining takes months)

Two-way doors:

Internal service implementation details
Prompt engineering approaches
Development tool choices
Experiment configurations

Preserving Optionality

Given uncertainty about the future, options have value. A decision that preserves future options is worth more than one that forecloses them, all else being equal.

Abstraction layers: Wrapping external services in internal interfaces preserves the option to switch providers. The cost is complexity; the benefit is flexibility.

Incremental commitments: Signing a month-to-month contract preserves options vs. a three-year commitment. The cost may be higher monthly rate; the benefit is flexibility.

Keeping data: Storing raw data preserves the option to reprocess it differently later. The cost is storage; the benefit is not being locked into current assumptions.

Modularity: Building systems from separable components preserves the option to replace individual pieces. The cost is interface overhead; the benefit is evolutionary capability.

But optionality isn’t free. Over-indexing on flexibility leads to:

Excessive abstraction that nobody uses
Analysis paralysis (“we might need X someday”)
Slower development from unnecessary indirection

The skill is knowing which options are valuable. Options are most valuable when:

The future is highly uncertain
The cost of the option is low
The cost of being wrong without the option is high

Decision Velocity as Competitive Advantage

Organizations that make decisions faster have an advantage. They learn faster (more decisions = more feedback), adapt faster, and ship faster.

Andy Grove observed that “only the paranoid survive,” but paranoia that paralyzes is worse than overconfidence that ships. The goal is calibrated decision-making: appropriate speed for the stakes.

Practices that increase decision velocity:

Default to two-way door assumptions unless proven otherwise
Establish clear decision rights (who decides what)
Set time limits for decisions (no open-ended deliberation)
Prefer “disagree and commit” over consensus
Create safe environments for reversing decisions

Warning signs of slow decision-making:

Multiple meetings without decisions
Decisions requiring executive approval that shouldn’t
Fear of being wrong overwhelming desire to make progress
Perfect being the enemy of good

Case Studies: Decisions That Aged Well and Poorly

Case Study 1: The Embedding Model Bet (Aged Well)

Situation: In early 2023, a search startup needed to choose embedding models for their semantic search product. OpenAI’s ada-002 was dominant. The team debated building on OpenAI vs. using open-source alternatives.

Decision: They chose to build on sentence-transformers with self-hosted inference, accepting lower initial quality for control and economics.

Rationale documented in ADR:

OpenAI pricing created unit economics problems at scale
Dependency on a single vendor for core product capability was risky
Open-source quality was “good enough” and improving rapidly
Self-hosting enabled customization and fine-tuning

Outcome: By late 2023, open-source embedding models (BGE, E5) matched or exceeded ada-002 quality. The company’s early investment in self-hosting infrastructure became a competitive advantage. Competitors locked into OpenAI faced difficult migrations when they needed customization.

Why it worked: The team correctly predicted that open-source models would improve faster than the quality gap justified premium pricing. They accepted short-term pain for long-term positioning.

Case Study 2: The Microservices Migration (Aged Poorly)

Situation: A mid-sized AI company decided to break their monolithic ML platform into microservices, reasoning that industry best practice was microservices architecture.

Decision: 18-month migration to decompose the monolith into 15 services.

Rationale (insufficiently documented):

“Microservices are the standard for modern systems”
“This will improve developer velocity”
“We’ll be able to scale services independently”

Outcome: Two years later, the migration was still incomplete. Developer velocity had decreased due to distributed system complexity. Debugging ML pipelines became a nightmare. The team had neither monolith simplicity nor microservices benefits.

What went wrong:

Copied industry pattern without understanding context (Netflix has thousands of engineers; they had 50)
Didn’t define success metrics upfront (how would they know if velocity improved?)
Underestimated operational complexity of distributed ML systems
No revisit criteria to recognize the migration was failing

Lesson: The ADR should have included: “We believe microservices will improve velocity, measured by deployment frequency and lead time. If these metrics don’t improve by month 9, we’ll reassess.”

Case Study 3: The Vendor Lock-in (Mixed Outcome)

Situation: An enterprise AI team needed to choose between AWS SageMaker (deep lock-in, deep integration) and cloud-agnostic tooling (more work, more flexibility).

Decision: Chose SageMaker despite lock-in concerns, reasoning that AWS was their strategic cloud provider and integration benefits were significant.

Rationale in ADR:

Team already expert in AWS; learning curve would be minimal
Feature velocity from AWS integration outweighed flexibility concerns
Multi-cloud was not a business requirement
Cost analysis showed SageMaker competitive with alternatives

Outcome: Two years later, the company was acquired by a firm using GCP. Migration out of SageMaker consumed six months of the ML platform team’s time.

But was it the wrong decision? In those two years, the SageMaker choice enabled faster development that contributed to the company’s value and acquisition. The migration cost was real but manageable.

Lesson: The decision was defensible given available information. What was missing: “If our cloud strategy changes (acquisition, cost pressure to go multi-cloud), migration cost is estimated at X months. We accept this risk because [reasons].”

Building Organizational Decision-Making Capability

Creating a Decision-Making Culture

Individual decision frameworks matter, but organizational culture determines whether they’re used. Key elements of a healthy decision culture:

Psychological safety for decisions: People must feel safe making calls that might turn out wrong. If bad outcomes are punished regardless of decision quality, people will avoid decisions.

Blameless postmortems: When decisions lead to incidents, focus on process improvement, not blame. “What did we learn?” not “Whose fault was it?”

Celebrating good process: Recognize and promote people who make well-reasoned decisions, even when outcomes are mixed. This signals that the organization values decision quality.

Transparency about uncertainty: Leaders who say “I don’t know, here’s how we’ll find out” create permission for others to acknowledge uncertainty.

Decision Rights and Escalation

Ambiguity about who decides causes delay and conflict. Make decision rights explicit:

DACI framework:

Driver: Owns driving the decision to resolution
Approver: Has final authority (ideally one person)
Contributors: Provide input and expertise
Informed: Notified of the decision

The most common failure: too many approvers. If five people must agree, you get lowest-common-denominator decisions and slow cycles. Prefer single approvers with clear accountability.

Escalation paths: When should decisions escalate? When shouldn’t they? Define this explicitly. “Architectural decisions affecting multiple teams escalate to Staff+ review. Implementation decisions within a team don’t.”

Learning from Decisions

Most organizations fail to learn from past decisions. They make similar mistakes repeatedly, reinvent solutions to solved problems, and can’t explain why they do things the way they do.

Decision reviews: Schedule reviews of major decisions 6-12 months after implementation. Compare expected outcomes to actual outcomes. Identify what you didn’t anticipate.

Decision archives: Maintain searchable ADRs and RFCs. When facing new decisions, search for similar past decisions.

Pattern extraction: Look across decisions for recurring themes. Are you consistently underestimating build costs? Over-relying on certain vendors? Pattern recognition enables systematic improvement.

Practical Exercises

Exercise 1: Write an ADR

Choose a technical decision from your recent work. Write a complete ADR including:

Context with specific constraints and requirements
The decision with clear rationale
At least three alternatives considered with rejection reasons
Both positive and negative consequences
Specific criteria for when to revisit

Review it with someone who wasn’t involved in the original decision. Can they understand the reasoning?

Exercise 2: Decision Speed Calibration

For one week, track every technical decision you make or participate in:

What was the decision?
How long did it take?
Was it reversible or irreversible?
Was the time spent proportionate to the stakes?

At the end of the week, categorize decisions. Were you spending too much time on reversible decisions? Too little on irreversible ones?

Exercise 3: Build vs. Buy Analysis

For a capability your team needs:

Estimate honest build cost (include 2x buffer for unknowns)
Identify three buy options with pricing
Calculate 3-year TCO for each option
Apply the framework: core to advantage? Solutions exist? Timeline? Expertise?
Document hidden costs for each path
Make and defend a recommendation

Exercise 4: Postmortem a Decision

Find a past decision that turned out differently than expected:

What was the original decision and rationale?
What information was available then vs. now?
Was it a good decision with bad outcome, or bad decision?
What would have improved the process (not outcome)?
What would you do differently?

Self-Assessment Checkpoint

Conceptual Questions

Q1. [Senior] What makes a decision “two-way door” vs “one-way door”? Give examples of each in AI engineering.

Answer

Two-way door (reversible): Easy to undo if wrong. Examples: Choosing a Python library, prompt template changes, feature flag experiments, model hyperparameters during development. One-way door (irreversible): Hard/expensive to reverse. Examples: Database schema for production data, API contracts exposed to customers, choosing to build vs. buy a core capability, multi-year cloud commitments, publicly committing to AI features. The distinction matters because: Two-way doors should be decided quickly—cost of reversal is low, cost of delay is high. One-way doors merit careful analysis—reversing is expensive, so invest upfront. Most decisions are more reversible than they seem. Ask: “What would it cost to reverse this in 6 months?”

Q2. [Senior] What should an Architecture Decision Record (ADR) contain? Why is each section important?

Answer

Essential sections: (1) Title and date—identification. (2) Status—proposed/accepted/deprecated/superseded. Shows lifecycle. (3) Context—what’s the situation? Constraints, requirements, forces. Without context, decision makes no sense later. (4) Decision—what was decided? Clear, concise statement. (5) Rationale—why this decision? The reasoning, not just the choice. (6) Alternatives considered—what else was evaluated and why rejected? Shows due diligence, helps future readers. (7) Consequences—both positive and negative. Honest assessment. (8) Revisit conditions—when should this be reconsidered? Prevents outdated decisions from persisting. Why important: Future teams (including future you) need to understand not just what was decided, but why. Without ADRs, you relitigate decisions endlessly or blindly follow outdated choices.

Q3. [Staff] How do you approach a build vs. buy decision for a critical capability? What hidden costs should you consider?

Answer

Framework: (1) Is this core to competitive advantage? If yes, lean build. (2) Do mature solutions exist? If good options exist, lean buy. (3) Timeline urgency? Buy is usually faster. (4) Do you have expertise? Building what you don’t understand is expensive. Hidden costs of building: (1) Opportunity cost—engineers not building product features. (2) Maintenance burden—you own it forever. (3) Unknown unknowns—projects always take longer than estimated. Apply 2-3x multiplier. (4) Oncall, documentation, training. Hidden costs of buying: (1) Vendor lock-in—switching costs grow over time. (2) Customization limitations—may not fit exactly. (3) Integration work—often underestimated. (4) Ongoing licensing—costs scale with usage. Analysis: Calculate 3-year TCO for both, including hidden costs. Consider optionality—can you start with buy and migrate to build later?

Q4. [Staff] How do you drive a technical decision to closure when stakeholders have conflicting opinions?

Answer

Process: (1) Clarify the decision—often disagreements are about different interpretations of the question. (2) Identify shared constraints—what does everyone agree on? (3) Make values explicit—are people optimizing for different things (speed vs. quality, cost vs. flexibility)? Surface these. (4) Use structured tradeoff analysis—weighted criteria, sensitivity analysis. Numbers reduce emotion. (5) Disagree and commit protocol—set a deadline. If no consensus, decision owner decides. Dissenters commit to supporting the decision. (6) Time-box—don’t let decisions stall indefinitely. Set clear deadlines. Key: The goal isn’t consensus; it’s a decision that people can support. Document dissenting views in the ADR—validates concerns and provides context for future reconsideration.

Q5. [Staff] How do you learn from past technical decisions? What makes a decision retrospective effective?

Answer

Learning mechanisms: (1) Scheduled reviews—revisit major decisions 6-12 months post-implementation. Compare expected vs. actual outcomes. (2) Decision archives—searchable ADRs and RFCs. Before new decisions, search for similar past decisions. (3) Pattern extraction—across decisions, what themes emerge? Consistently underestimating build costs? Over-relying on certain approaches? Effective retrospective: (1) Separate decision quality from outcome quality—good decisions can have bad outcomes (and vice versa). (2) Focus on what was knowable at decision time—hindsight bias is the enemy. (3) Identify information that would have changed the decision—and how to get it earlier next time. (4) Extract actionable improvements—not just “be more careful.” Specific process changes. (5) Share learnings—organizational learning requires dissemination.

Spot the Problem

Problem 1. [Senior] ADR excerpt:

Decision: We will use Redis for caching.

Rationale: Redis is good for caching.

Answer

Problems: (1) No context—what are the requirements? What constraints exist? (2) Circular rationale—“Redis is good for caching” doesn’t explain why Redis over alternatives. (3) No alternatives—was Memcached considered? Local caching? CDN? Why Redis specifically? (4) No consequences—what does this enable? What risks does it create? Better: “Context: We need sub-10ms read latency for user preferences, supporting 10K req/s. Decision: Redis, because it meets latency requirements, we have operational experience, and it supports data structures we need for ranked recommendations. Alternatives rejected: Memcached (no sorted sets), local cache (cluster consistency), CDN (can’t cache personalized data). Consequences: (+) Fast, flexible; (-) Additional infrastructure, memory cost.”

Problem 2. [Staff] Decision-making behavior:

"This is an important decision, so let's get everyone's input.
We'll schedule weekly meetings until we reach consensus."

Answer

Problems: (1) No deadline—meetings will continue indefinitely. Decisions stall. (2) Consensus requirement—consensus is often impossible. One blocker can hold everyone hostage. (3) Everyone’s input isn’t equally valuable—expertise matters. Junior engineer’s opinion on architecture shouldn’t equal principal engineer’s. (4) Meeting-heavy process—could be async (RFC). Better: (1) Set decision deadline upfront. (2) Use RACI—who decides (Accountable), who provides input (Consulted), who’s informed. (3) RFC for async input with response deadline. (4) “Disagree and commit” if no consensus by deadline. (5) Proportionate effort—is this decision worth weeks of meetings?

Problem 3. [Staff] Build vs. buy analysis:

"Building our own vector database will take 2 engineers 3 months.
Pinecone costs $5K/month. Build is obviously cheaper long-term."

Answer

Flawed analysis: (1) Understimated build cost—2 engineers × 3 months is likely 2-3x optimistic. Call it 12 engineer-months realistically. (2) No maintenance cost—who maintains it after? Oncall? Bug fixes? Upgrades? Add 0.5-1 engineer ongoing. (3) Opportunity cost—those 2 engineers could build product features. What’s the revenue impact? (4) Hidden build costs—documentation, testing, training, deployment infrastructure. (5) Pinecone provides more than storage—operational reliability, scaling, support. (6) Time to market—3+ months delay has business cost. Proper analysis: Build (12 eng-months @ $15K/month = $180K) + ongoing (0.5 FTE = $90K/year) + opportunity cost vs. Buy ($60K/year) + integration effort. Plus: Is a vector database your competitive advantage?

Design Exercises

Exercise 1. [Staff] Write an ADR for this decision: Your team needs to choose between using a third-party LLM API (OpenAI/Anthropic) vs. self-hosting an open-source model. Consider: cost at scale, quality requirements, latency constraints, data privacy, and vendor dependency.

Guidance

ADR structure: Context: What’s your use case? Volume, quality bar, latency needs, data sensitivity. Constraints (budget, timeline, expertise). Decision: Choose one with clear rationale linked to requirements. For example: “API for initial launch, migrate to self-hosted at 1M req/month.” Alternatives: (1) API only—pros (fast, quality), cons (cost, dependency, data leaves network). (2) Self-host only—pros (cost at scale, data control), cons (expertise needed, maintenance, slower quality improvements). (3) Hybrid—pros (flexibility), cons (complexity). Consequences: Include both cost projections (3-year TCO) and operational implications. Revisit criteria: “Reconsider when monthly volume exceeds X or if API costs exceed Y.”

Exercise 2. [Staff] You’re a Staff engineer asked to drive a decision on migrating from a monolith to microservices. Multiple teams have strong opinions. The VP wants a decision in 2 weeks. Design your process for driving this decision to closure.

Guidance

Process: Week 1: (1) Day 1-2: Write RFC framing the decision, criteria for evaluation, and proposal. Distribute for async feedback. (2) Day 3-5: Gather written input. Meet 1:1 with key stakeholders to understand concerns. (3) End of week 1: Synthesize feedback, update RFC with alternatives section addressing concerns. Week 2: (4) Day 1-2: Schedule 90-minute decision meeting with essential stakeholders only (not everyone). Pre-read required. (5) Meeting: Don’t rehash—assume everyone read RFC. Focus on remaining disagreements. Use structured discussion (each alternative gets 10 min). (6) Day 3: Make decision (you or designated decision-maker). Document dissenting views. (7) Day 4-5: Finalize ADR, communicate decision broadly, define immediate next steps. Key: Async for input, sync for deliberation, clear ownership for decision. Don’t let desire for consensus block progress.

Key Takeaways

Document decisions with ADRs - Architecture Decision Records capture context, rationale, alternatives, and consequences. Future teams and future you will thank you for the documentation.
Use structured processes for collaboration - RFCs enable broad input while maintaining momentum. Clear ownership and timelines prevent decisions from stalling indefinitely.
Analyze tradeoffs systematically - Use weighted criteria, not just intuition. Make your values explicit. Conduct sensitivity analysis to understand what factors matter most.
Build costs are almost always underestimated - Include hidden costs: maintenance, documentation, training, deployment infrastructure. Multiply build estimates by 2-3x for realistic planning.
Match decision speed to stakes - Two-way doors (reversible decisions) should be fast. One-way doors merit deliberation. The biggest mistake is treating everything as one-way.

Summary

Technical decision-making is a skill, not a talent. It can be developed through deliberate practice and appropriate frameworks.

Document decisions with ADRs that capture context, rationale, alternatives, and consequences. Future teams will thank you. Future you will thank you.

Use structured processes for collaborative decisions. RFCs enable broad input while maintaining momentum. Clear ownership and timelines prevent decisions from stalling.

Analyze tradeoffs systematically with weighted criteria, not just intuition. Make your values explicit. Conduct sensitivity analysis to understand what matters most.

Navigate build vs. buy with clear economic thinking. Include hidden costs. Consider optionality. Remember that most organizations underestimate build costs.

Match decision speed to stakes. Two-way doors should be fast. One-way doors merit deliberation. The biggest mistake is treating everything as one-way.

Build organizational capability by creating psychological safety, clarifying decision rights, and learning from outcomes.

The goal isn’t to make perfect decisions. Perfect decisions require perfect information, which doesn’t exist. The goal is to make decisions that are good enough, fast enough, and documented well enough that you and your organization can learn and improve.

Connections to Other Chapters

Chapter 23 (Technical Communication) covers presenting and socializing decisions
Chapter 25 (System Design at Scale) provides context for architecture decisions
Chapter 26 (Cross-Team Technical Leadership) addresses building consensus across organizations
Chapter 32 (Cost Engineering) informs economic analysis of decisions

--- title: "Chapter 26: Technical Decision Making" keywords: [decision making, ADRs, trade-offs, build vs buy, technical strategy, reversibility, consensus building] difficulty: advanced prerequisites: [ch22, ch23] estimated_time: "3-4 hours" --- ## Introduction In January 2021, a fast-growing AI startup faced a critical decision. Their recommendation engine, built on a custom embedding pipeline, was straining under 10x traffic growth. The CTO proposed migrating to a managed vector database service that promised seamless scaling. The principal engineer advocated for building their own distributed vector store, arguing that the unique access patterns justified custom infrastructure. The product leader pushed for a hybrid approach: use the managed service now, plan migration later. Three weeks of meetings produced no consensus. Meanwhile, latency crept up, customers complained, and a key enterprise deal stalled. Finally, the CEO mandated a decision: managed service, shipping in two weeks. Eighteen months later, that managed service had a 47-hour outage affecting the startup's largest customer. The principal engineer's concerns about vendor dependency had proven prescient. But here's the thing: the decision was still correct. In the intervening months, the company had closed $40M in revenue that wouldn't have materialized if they'd spent six months building custom infrastructure. The outage cost them one customer; the delay would have cost them the company. This story illustrates why technical decision-making is so hard. The "right" answer depends on context that changes over time. Good decisions can have bad outcomes; bad decisions can luck into success. What separates great technical leaders isn't an ability to predict the future, but rather a disciplined approach to making decisions with incomplete information, documenting the reasoning, and learning from outcomes. ### The Compounding Nature of Technical Decisions Technical decisions compound in ways that non-technical decisions often don't. Choose the wrong database, and every feature built on top inherits that constraint. Pick an incompatible serialization format, and years later teams still write adapters. Select a model architecture optimized for today's hardware, and watch it struggle on tomorrow's. This compounding works in both directions. Good architectural decisions create leverage: they make future decisions easier, enable capabilities you didn't anticipate, and reduce the cognitive load on your team. The early AWS decision to standardize on service-oriented architecture enabled them to expose internal services as external APIs, eventually becoming AWS itself. That wasn't the plan, but the decision created options. Research on technical debt by Cunningham and later by Kruchten, Nord, and Ozkaya demonstrates how poor decisions accumulate interest. Industry surveys (Besker et al., 2018, covering 15 large software organizations) found that engineers spend roughly 25% of their time on technical-debt-related work as systems age. But the same research community has shown that deliberate, documented debt taken strategically can accelerate time-to-market when managed properly—what kills teams is undocumented, accidental debt. The difference between debt that destroys teams and debt that enables them lies almost entirely in how decisions were made and documented. Undocumented decisions become mysteries; mysteries become folklore; folklore becomes legacy code that no one dares touch. ### Why This Chapter Matters If you're reading this book, you likely make technical decisions that affect systems serving millions of users, handling sensitive data, or processing millions of dollars in transactions. The frameworks in this chapter will help you: 1. **Structure decision-making processes** so that important choices get appropriate rigor 2. **Document decisions** so that future teams (including future you) understand the context 3. **Analyze tradeoffs systematically** rather than relying on intuition alone 4. **Navigate build vs. buy** in a landscape where AI capabilities change monthly 5. **Recognize reversible vs. irreversible decisions** and calibrate your process accordingly 6. **Build organizational decision-making capability** beyond your individual contributions ### What You'll Learn - How Architecture Decision Records (ADRs) capture the "why" behind technical choices - When and how to use RFC processes for collaborative decisions - Structured frameworks for tradeoff analysis that go beyond gut feeling - Economic models for build vs. buy decisions, especially in AI systems - The science of decision-making under uncertainty - How to build intuition for fast vs. slow decisions ### Prerequisites - Experience operating production systems (Parts I & II) - Familiarity with system design principles (Chapter 22) - Understanding of team dynamics and organizational context (Chapter 20) --- ## The Science of Decision Making ### Dual-Process Theory and Technical Choices Daniel Kahneman's research on dual-process cognition provides a foundation for understanding how engineers make decisions. System 1 thinking is fast, intuitive, and effortless, the kind of thinking that lets an experienced engineer glance at code and sense something is wrong. System 2 thinking is slow, deliberate, and effortful, the kind required to prove that the code is actually buggy. Both systems have roles in technical decision-making, but they're suited to different types of problems. **System 1 excels when**: - You have deep experience with similar situations - Feedback loops are tight (you quickly learn if you were right) - The environment is stable and predictable - Stakes of individual decisions are low **System 2 is necessary when**: - The situation is novel or complex - Feedback is delayed or ambiguous - Multiple stakeholders with different priorities are involved - The decision is hard to reverse or high-stakes The problem: System 1 doesn't know when it's out of its depth. An engineer with ten years of web development experience may have strong intuitions about database choices, but those intuitions can mislead when applied to ML feature stores with fundamentally different access patterns. The confidence feels the same even when the calibration is off. Gary Klein's research on naturalistic decision-making found that experts in domains with tight feedback (firefighters, chess players, ICU nurses) developed remarkably accurate intuitions. But in domains with delayed or noisy feedback, like financial forecasting or long-term technology choices, expert intuition performed barely better than chance. **Implication for AI engineering**: Many AI decisions have delayed, noisy feedback. Was your choice of embedding model correct? You won't know for months, and even then, you can't easily run the counterfactual. This is precisely the environment where System 2 rigor matters most. ### The Paradox of Choice in Technical Systems Barry Schwartz documented how having more options often leads to worse decisions and less satisfaction. This manifests acutely in modern AI engineering. Want a vector database? Choose from Pinecone, Weaviate, Milvus, Qdrant, Chroma, pgvector, and a dozen others. Need an inference server? Consider vLLM, TGI, Triton, Ray Serve, custom solutions. Every layer of the stack presents dozens of viable options. This abundance creates several dysfunctions: **Analysis paralysis**: Teams spend weeks evaluating options that would all work adequately. The evaluation cost exceeds the cost of suboptimal choice. **Maximizing vs. satisficing**: Maximizers try to find the best option; satisficers find one that meets their criteria. Research consistently shows satisficers make faster decisions with equal or better outcomes, and experience less regret. **The paradox of expertise**: More knowledge about a domain means awareness of more options, more edge cases, more ways things could go wrong. Senior engineers can become slower decision-makers than juniors because they see more complexity. **Practical response**: Define decision criteria before evaluating options. Set time limits for evaluation. Accept that "good enough" often is. Document what "good enough" means so others don't revisit the decision unnecessarily. ::: {.callout-warning} ## Common Mistake: Evaluating Decisions Solely by Outcomes **What people do:** The migration succeeded, so it was a good decision. The new architecture failed, so it was a bad decision. Promote people who made decisions that worked out; sideline those whose bets didn't pay off. **Why it fails:** This confuses luck with skill. Good decisions can have bad outcomes due to factors outside your control. Bad decisions can succeed by luck. Over time, this creates risk-averse culture where people avoid any decision that might turn out badly—even when expected value is positive. **Fix:** Evaluate decisions by process quality: Were the right people involved? Were alternatives genuinely considered? Was reasoning sound given available information? Were uncertainties acknowledged? Document the reasoning so you can learn from decisions regardless of outcome. ::: ### Outcome vs. Process Quality Annie Duke, in "Thinking in Bets," distinguishes between decision quality and outcome quality. A good decision can have a bad outcome (you made the right call given available information, but got unlucky). A bad decision can have a good outcome (you made a reckless call, but got lucky). This distinction matters enormously for organizational learning. If you judge decisions solely by outcomes: - Teams become risk-averse, avoiding decisions that might turn out badly even when the expected value is positive - Survivorship bias distorts learning (you only hear about successful bets, not failed ones) - Post-hoc rationalization replaces honest assessment Instead, evaluate decisions by process quality: 1. **Were the right people involved?** Did you consult those with relevant expertise and those affected by the decision? 2. **Were alternatives genuinely considered?** Or did you anchor on one option and rationalize it? 3. **Was the reasoning sound?** Given what was known at the time, did the logic hold? 4. **Were uncertainties acknowledged?** Did you identify what you didn't know? 5. **Was it documented?** Can others learn from this decision? A well-documented decision that turned out poorly is valuable. It reveals what you didn't know or couldn't predict, enabling better decisions in the future. An undocumented decision that turned out well teaches nothing. ### Cognitive Biases in Technical Decisions Several cognitive biases particularly afflict technical decision-making: **Availability bias**: Overweighting information that comes to mind easily. "We had a bad experience with Vendor X three years ago" looms larger than systematic analysis, even if Vendor X has improved dramatically. **Anchoring**: The first number or option mentioned disproportionately influences the final decision. If someone suggests "maybe 3 months to build this," subsequent estimates cluster around 3 months regardless of actual complexity. ::: {.callout-warning} ## Common Mistake: Anchoring on the First Option **What people do:** Someone suggests "let's use Redis for this cache" in the first design meeting. All subsequent discussion compares alternatives to Redis, evaluating their disadvantages relative to Redis rather than to the actual requirements. **Why it fails:** The first option becomes the anchor. You evaluate "reasons not to use Redis" rather than "what's the best solution for our requirements." Subtle flaws in the anchor option go unexamined because it's never held to the same scrutiny as alternatives. **Fix:** Before discussing solutions, write down your requirements and decision criteria. Deliberately consider multiple options before committing to detailed evaluation of any single one. Assign someone to argue for an alternative to ensure the anchor doesn't dominate. ::: **Confirmation bias**: Seeking information that supports your preferred option. Engineers often run benchmarks until they get results that confirm their intuition, then stop. **Not-invented-here syndrome**: Preferring internal solutions over external ones. This bias is particularly strong among skilled engineers who correctly believe they could build something, but incorrectly conclude they should. **Sunk cost fallacy**: Continuing with a failing approach because of past investment. "We've already spent six months on this migration" becomes a reason to continue rather than a write-off. **Herding**: Following what prestigious companies do. "Google uses this" becomes justification, even when your context differs fundamentally from Google's. **Mitigation strategies**: - Assign a "devil's advocate" to argue against the favored option - Require writing down your prediction before gathering data - Explicitly consider "What would have to be true for the other option to be better?" - Bring in outside perspectives who lack context (and therefore lack biases) --- ## Architecture Decision Records ### The Case for Documenting Decisions Every codebase accumulates decisions. Most remain invisible: implicit in the structure of the code, the choice of libraries, the shape of data models. When the original decision-makers leave, understanding leaves with them. Consider a concrete scenario. A new engineer asks: "Why do we use Redis for the session cache instead of our standard DynamoDB?" Without documentation, the investigation begins: - Slack search turns up a thread from 2022, but key context is missing - The engineer who made the choice left last year - Someone thinks it was about latency, but isn't sure - The team debates whether to migrate, unsure if original constraints still apply Two weeks pass. Eventually, someone finds an old design doc mentioning "99th percentile latency requirements for the recommendation engine require sub-2ms cache reads." DynamoDB's typical latency is 5-10ms. Mystery solved, but at significant cost. With an ADR, this investigation takes five minutes. The document explains the constraint, the analysis, and critically, when to revisit the decision ("if we move to a region where DynamoDB DAX is available" or "if our latency budget increases"). Michael Nygard, who introduced ADRs, describes them as "capturing the context in which a decision was made." The context almost always matters more than the decision itself. The same decision might be brilliant in one context and disastrous in another. ### ADR Structure and Best Practices An effective ADR answers five questions: 1. **What is the context?** What situation are we in? What constraints apply? 2. **What is the decision?** What are we choosing to do? 3. **Why this decision?** What reasoning leads to this choice? 4. **What alternatives were considered?** What else could we have done? 5. **What are the consequences?** What follows from this decision? Here's a real ADR from an AI startup's production system: --- **ADR-017: Embedding Model Selection for Document Search** **Status**: Accepted **Date**: 2025-03-15 **Context** We're building a document search feature for our enterprise product. Users will search across thousands of internal documents (contracts, policies, meeting notes) using natural language queries. Requirements: - Query latency under 200ms for 95th percentile - Support for 500,000 documents initially, scaling to 5M - Must handle domain-specific terminology (legal, financial) - Budget constraint: inference costs under $10K/month at full scale Current infrastructure: AWS-based, using SageMaker for existing ML workloads. **Decision** We will use BGE-large-en-v1.5 (1024 dimensions) deployed on SageMaker Inference endpoints with auto-scaling. **Rationale** We evaluated five embedding models against our requirements: | Model | Dimensions | Our Benchmark NDCG@10 | Latency (p95) | Monthly Cost at Scale | |-------|------------|----------------------|---------------|----------------------| | BGE-large-en | 1024 | 0.72 | 45ms | $4,200 | | E5-large-v2 | 1024 | 0.71 | 48ms | $4,200 | | text-embedding-3-large | 3072 | 0.74 | 85ms | $18,000 | | GTE-large | 1024 | 0.70 | 43ms | $4,200 | | Cohere embed-v3 | 1024 | 0.73 | 120ms | $12,000 | BGE-large-en slightly outperformed other open-source options on our internal benchmark dataset (200 queries with relevance judgments from legal team). While OpenAI's model scored 2 points higher, the 4x cost difference and increased latency don't justify the quality gain for this use case. Deploying on SageMaker aligns with our existing infrastructure and avoids introducing new operational complexity. **Alternatives Considered** *OpenAI text-embedding-3-large*: Superior quality, but 4x cost and external API dependency introduces latency variability and compliance considerations for enterprise customers. *Fine-tuned BGE*: Could improve domain-specific performance, but adds 2-3 weeks of work and ongoing maintenance burden. We'll revisit if quality metrics don't meet targets after launch. *Hybrid approach (sparse + dense)*: Would improve exact-match cases, but adds pipeline complexity. Our benchmark showed only 3% of queries benefit significantly from sparse matching. **Consequences** - Positive: Cost-effective, predictable latency, no external API dependencies - Positive: Team already familiar with SageMaker deployment patterns - Negative: May need fine-tuning if domain-specific performance is insufficient - Negative: Locked to model versions we deploy (no automatic improvements) **Revisit When** - Quality metrics (measured monthly) fall below 0.65 NDCG@10 - A significantly better open-source model emerges (>10% improvement) - We expand to non-English markets (multilingual models needed) - OpenAI or similar significantly reduces pricing (below 2x our current cost) --- Notice several qualities of this ADR: **Specific context**: Not "we need embeddings" but specific requirements with numbers. **Actual data**: Benchmarks run on their data, not marketing claims. **Honest tradeoffs**: Acknowledges what they're giving up. **Clear triggers for revisiting**: Future engineers know when to reconsider. ### When to Write an ADR Not every decision needs formal documentation. ADRs have a cost: time to write, time to maintain, clutter if overused. Use them when decisions are: **Significant**: Affects multiple teams, systems, or substantial resources. Choosing a logging library probably doesn't need an ADR; choosing an observability platform does. **Non-obvious**: Reasonable engineers might disagree. If the right choice is clear, document it lightly in code comments or team wiki. **Hard to reverse**: Changing would be expensive or disruptive. Database choice, API contracts, data formats merit ADRs. **Frequently questioned**: If people keep asking "why do we do X?", write an ADR to answer them once. A useful heuristic: if you're having the third meeting about a decision, it needs an ADR. **What doesn't need an ADR**: - Implementation details that don't affect interfaces - Choices that are easy to change later - Decisions mandated by company standards - Temporary workarounds with clear timelines ### ADR Anti-Patterns **The novel**: Multi-page ADRs that nobody reads. If your ADR exceeds two pages, you're writing a design doc, not a decision record. **The vague**: "We chose X because it's better." Better how? For what use case? Compared to what? **The unanimous**: ADRs that present no alternatives or pretend there was no disagreement. Real decisions have tradeoffs; acknowledge them. **The orphan**: ADRs written and forgotten. Review them periodically. Supersede them when contexts change. **The retcon**: ADRs written after the fact to justify decisions already made for other reasons. These are useless for learning. --- ## RFC Processes ### When ADRs Aren't Enough ADRs document decisions made by an individual or small group. Some decisions require broader input: they affect multiple teams, require expertise distributed across the organization, or need buy-in from stakeholders who'll bear the consequences. Request for Comments (RFC) processes structure this broader collaboration. The term comes from the Internet Engineering Task Force, whose RFCs defined Internet protocols. The format has been adopted widely in tech companies: Rust's RFC process has shaped the language, Kubernetes uses them for major features, and many companies have internal RFC systems. The key distinction: **ADR**: "Here is the decision we made and why" **RFC**: "Here is a proposal; please help us improve it and decide together" Use RFCs when: - Multiple teams are affected and should have input - The decision requires expertise you don't have - You need to build consensus before implementation - The scope is large enough that implementation is expensive to redo ### Anatomy of an Effective RFC A good RFC enables asynchronous collaboration. Readers should be able to understand, evaluate, and comment without requiring meetings. **Title and metadata**: Clear title, author, status, key dates. **Summary**: One paragraph covering the problem and proposed solution. Busy readers should be able to decide if they need to read more. **Motivation**: Why is this change needed? What problem does it solve? What happens if we do nothing? This section should create urgency and alignment on the problem before discussing solutions. **Detailed design**: How will this work? Be specific enough that readers can evaluate feasibility, but not so detailed that implementation is constrained unnecessarily. **Alternatives considered**: What other approaches were evaluated? Why were they rejected? This demonstrates rigor and preempts "have you considered X?" comments. **Drawbacks**: What are the risks and downsides? Acknowledging these builds trust and surfaces concerns that might otherwise emerge in discussion. **Adoption strategy**: How will this be rolled out? What migration is required? What's the timeline? **Unresolved questions**: What do you still need input on? This focuses reviewer attention. ### Real RFC Example: Model Serving Migration Here's a condensed version of an actual RFC from an AI platform team: --- **RFC: Migrate Model Serving from TGI to vLLM** **Author**: Platform Team **Status**: Under Review **Created**: 2025-09-10 **Summary** This RFC proposes migrating our LLM serving infrastructure from Text Generation Inference (TGI) to vLLM. The migration would reduce serving costs by approximately 40% through improved batching and increase throughput by 2.5x based on our benchmarks. The migration would occur over three months, with a two-week parallel-running period for validation. **Motivation** Our current TGI deployment serves 15M requests per day across 120 A100 GPUs. Costs are $2.1M per month. Traffic is projected to 3x over the next year, which would require significant additional infrastructure. Our benchmarks show vLLM achieving 2.5x higher throughput for our workload (mixed batch sizes, 512-2048 token outputs). This would reduce our GPU fleet to approximately 50 GPUs at current traffic, saving $1.2M monthly. Beyond cost, vLLM's PagedAttention enables us to serve longer contexts without OOM issues that currently require request rejection. **Detailed Design** *Phase 1 (Weeks 1-4)*: Deploy vLLM alongside TGI. Route 1% of traffic to vLLM. Validate quality parity using our existing eval suite. *Phase 2 (Weeks 5-8)*: Gradually increase vLLM traffic to 50%. Monitor latency percentiles, error rates, and output quality scores. Address any issues. *Phase 3 (Weeks 9-12)*: Complete migration. Maintain TGI deployment as cold standby for two weeks. Decommission after validation. API contracts remain unchanged. Client teams require no code changes. **Alternatives Considered** *Optimize TGI configuration*: We spent two weeks tuning TGI parameters. Best case improvement was 15%, insufficient for our growth projections. *Custom serving solution*: Higher ceiling for optimization but 6+ month build time and ongoing maintenance burden. vLLM is well-maintained with active community. *Managed service (e.g., Anyscale Endpoints)*: Would reduce operational burden but increase per-request costs at our scale. Makes sense below 1M requests/day. **Drawbacks** - Migration risk: Any serving change risks production incidents - Team learning curve: Platform team needs to develop vLLM expertise - Feature parity gaps: vLLM lacks some TGI features (specific quantization methods we don't currently use) - Vendor dependency shift: Moving from HuggingFace to community-maintained project **Unresolved Questions** 1. Should we maintain internal fork for custom optimizations, or contribute upstream? 2. How long should we maintain TGI as fallback? 3. Are there features we use in TGI that need porting? --- ### Managing the RFC Process An RFC without a process is just a document. Effective RFC processes include: **Clear ownership**: Someone drives the RFC through the process, incorporating feedback, scheduling discussions, and pushing toward resolution. **Defined timelines**: Open-ended review leads to decision paralysis. Set expectations: "Comments due by Friday; decision meeting Monday." **Structured feedback**: Distinguish between blocking concerns, suggestions, and questions. Not all feedback is equal. **Decision authority**: Who actually decides? The RFC process gathers input, but someone must have authority to make the call when consensus can't be reached. **Outcomes**: RFCs should resolve to accepted, rejected, or withdrawn. "Abandoned" RFCs signal process failure. ### Synchronous vs. Asynchronous Decisions Not every decision needs a meeting, but some decisions need real-time discussion. **Favor asynchronous (written RFC/ADR) when**: - The decision benefits from careful thought - Stakeholders need time to research or consult others - Multiple time zones make meetings difficult - You need a written record **Favor synchronous (meeting) when**: - Significant disagreement needs real-time resolution - Quick back-and-forth would clarify misunderstandings - The decision is time-sensitive - Building relationship and trust matters (controversial changes) **Hybrid approach**: Write the RFC asynchronously, meet to discuss key concerns, finalize asynchronously. This captures benefits of both: thoughtful preparation, efficient discussion, documented outcome. --- ## Tradeoff Analysis Methods ### Beyond Pros and Cons Lists Simple pros/cons lists fail because they treat all factors as equal. "Faster development time" and "runs on our existing infrastructure" might both be "pros," but they're not equally important. Structured tradeoff analysis makes weights explicit, forcing clarity about what matters. ### Weighted Decision Matrices A weighted decision matrix assigns importance weights to criteria, then scores options against each criterion. The weighted sum produces a ranking. **Process**: 1. **Define criteria**: What factors matter for this decision? Be specific. 2. **Assign weights**: Distribute 100 points across criteria. This forces prioritization. If everything is equally important, you haven't thought hard enough. 3. **Score options**: Rate each option 1-10 on each criterion. Use evidence where possible. 4. **Calculate weighted scores**: Multiply scores by weights, sum for each option. 5. **Sensitivity analysis**: How much do weights need to change for the ranking to flip? **Example: Choosing a Feature Store** | Criterion | Weight | Feast | Tecton | Databricks FS | |-----------|--------|-------|--------|---------------| | Query latency | 25 | 7 | 9 | 6 | | Integration with existing stack | 20 | 8 | 6 | 9 | | Operational complexity | 20 | 6 | 8 | 7 | | Cost (3-year TCO) | 15 | 9 | 5 | 6 | | Feature richness | 10 | 6 | 9 | 8 | | Team expertise | 10 | 5 | 4 | 8 | | **Weighted Score** | | **7.05** | **7.05** | **7.15** | The scores are nearly identical, revealing that this is a close call. Sensitivity analysis shows that if "Integration" weight increases by 5 points (from existing stack), Databricks pulls ahead. If "Latency" increases by 5 points, Tecton wins. This analysis doesn't make the decision for you. It clarifies what you're trading off and surfaces the values embedded in your choice. ### Qualitative Factors That Matrices Miss Numbers don't capture everything. After quantitative analysis, consider: **Team factors**: Do you have people who know this technology? Will the choice help with hiring (exciting tech) or hurt (nobody wants to work on it)? **Strategic alignment**: Does this choice support or conflict with company direction? A tactically optimal choice might be strategically wrong. **Relationship dynamics**: Do you have good relationships with the vendor? Does the open-source project have responsive maintainers? **Gut check**: After all the analysis, does the answer feel right? Intuition is data too, just poorly calibrated. A strong negative gut reaction warrants investigation even if the numbers look good. ### Cost of Delay Analysis Many decisions are framed as "Option A vs. Option B" when the real question is "Decide now vs. decide later." Cost of Delay (CoD) analysis quantifies what you lose by waiting: - Revenue not earned while deciding - Customers lost to competitors who shipped first - Engineers blocked waiting for direction - Technical debt accumulating while you deliberate If delaying a decision by one week costs $50K in blocked engineering time, spending two weeks evaluating options that differ by $10K in annual costs is irrational. Conversely, if a decision is irreversible and gets more expensive to change over time, rushing to save a week of delay may cost you years of suboptimal architecture. The framework: estimate the cost of delay, estimate the value of additional information from waiting, compare. --- ## Build vs. Buy Decisions ### The Fundamental Economics Build vs. buy is ultimately an economic question, but the economics are subtle. **Building** has: - High upfront cost (engineering time) - Ongoing maintenance cost (bugs, updates, on-call) - Opportunity cost (engineers not working on core product) - Full control and customization - No external dependencies **Buying** has: - Lower upfront cost (integration time) - Ongoing subscription/usage cost - Vendor risk (price increases, deprecation, outages) - Limited customization - Faster time to market The crossover analysis: at what scale does building become cheaper? ### The Build vs. Buy Curve The total cost of build versus buy options follows predictable patterns: ![Build vs Buy Cost Curves](../assets/diagrams/rendered/ch23_build_vs_buy_curve.svg) In early stages, buying is cheaper. There's a crossover point where building becomes cheaper due to lower marginal costs. But the crossover point is often further out than optimistic engineers estimate, because they undercount: - Time to reach feature parity with the bought solution - Ongoing maintenance and security updates - On-call burden and incident response - Documentation and onboarding - Opportunity cost of what else those engineers could build A common pattern: "We can build this in 3 months" becomes 9 months. "One engineer can maintain it" becomes a team. Meanwhile, the bought solution improved, and the company's needs evolved. ### AI-Specific Build vs. Buy Considerations AI systems add unique dimensions to build vs. buy analysis: **Model quality trajectory**: If you build, your model's quality is fixed until you improve it. If you buy API access, the vendor continuously improves (usually for free). A model that's competitive today might be outdated in six months. **Inference economics**: API pricing (per token) creates linear cost scaling. Self-hosting has high fixed costs but lower marginal costs. As a rough order of magnitude, the crossover for a frontier-class workload sits somewhere in the low millions of tokens per day—but treat that figure as illustrative, not a constant. It depends entirely on the assumptions baked in: model size (a 70B is far cheaper to self-host than a frontier-scale model), GPU type and price, and—critically—sustained utilization (the break-even roughly doubles if your GPUs sit at 25% instead of 50%). Note also that "tokens per day" and "requests per month" are different units: Chapter 32 derives the same decision on a *requests/month* basis and shows the break-even swings ~5x with output-length distribution. Use that model with your own measured numbers rather than memorizing a single crossover figure. Below the crossover, APIs are cheaper; above it, self-hosting wins—if you can match capability. **Capability uncertainty**: We often don't know what AI can do until we try. Building custom solutions requires upfront commitment. APIs allow rapid experimentation before committing. **Data privacy**: Building keeps data internal. APIs may send data to third parties, creating compliance and competitive risks. **Customization requirements**: Fine-tuning requires building (or using vendors who support it). If your use case needs customization, pure API approaches may be insufficient. ### Build vs. Buy Decision Framework When evaluating build vs. buy, ask these questions in order: **1. Is this core to our competitive advantage?** If yes, lean toward building. Your competitive moat shouldn't depend on a vendor's roadmap. If no, lean toward buying, you shouldn't build commodity capabilities. Test: Would customers choose us over competitors because of this capability? If not, it's probably not core. **2. Do good solutions exist?** If the market offers solutions that meet your needs, building requires strong justification. If nothing adequate exists, you may have no choice but to build. Test: Could we achieve 80% of our requirements with an existing solution? If yes, the 20% gap rarely justifies building from scratch. **3. What's our timeline?** Building takes months to years. Buying takes weeks to months. If you need something in weeks, build isn't an option regardless of long-term economics. Test: What's the cost of shipping three months later? If it's significant, buying gets more attractive. **4. Do we have the expertise?** Building requires deep expertise. If you don't have it, you'll hire or develop it, both of which take time and carry risk. Test: Have we built something similar before? Do we have people who've done this at a previous company? **5. What's the total cost of ownership?** Include all costs: development, maintenance, on-call, opportunity cost, risk of failure. Compare to all costs of buying: subscription, integration, vendor management, switching costs. Test: Be honest about maintenance. Most teams underestimate it by 2-3x. ### Hidden Costs on Both Sides **Hidden build costs**: - Edge cases that simple prototypes don't reveal - Security hardening and compliance - Documentation and knowledge transfer - On-call burden and incident response - Opportunity cost of not shipping other features **Hidden buy costs**: - Integration complexity (APIs change, data formats differ) - Vendor lock-in (what does switching cost?) - Compliance work (security reviews, data processing agreements) - Support interactions (tickets, waiting for fixes) - Price increases (especially after you're dependent) ### The Hybrid Path Often the best answer is neither pure build nor pure buy: **Buy, then build**: Start with a bought solution to validate the use case and learn the domain. Build when you've proven value and understand requirements deeply. **Build the wrapper, buy the capability**: Use APIs for core functionality but build the orchestration, monitoring, and integration layer. This preserves optionality while reducing build scope. **Buy the commodity, build the differentiation**: Use standard components for common functionality. Build only the parts that create competitive advantage. --- ## Reversibility and Decision Speed ### One-Way vs. Two-Way Doors Jeff Bezos popularized the distinction between one-way and two-way door decisions: **One-way doors** (Type 1): Irreversible or very costly to reverse. Once through, you can't easily go back. These merit careful analysis, broad input, and deliberate process. **Two-way doors** (Type 2): Easily reversible. If you don't like where you end up, you can return and try again. These should be made quickly by individuals or small teams. The failure mode for most organizations is treating two-way doors as one-way doors, applying heavyweight process to decisions that could easily be undone. This slows everything down and burns goodwill. Examples in AI engineering: **One-way doors**: - Public API contracts (breaking changes harm customers) - Major vendor commitments (multi-year contracts) - Data format choices (migration is expensive) - Production model architecture (retraining takes months) **Two-way doors**: - Internal service implementation details - Prompt engineering approaches - Development tool choices - Experiment configurations ### Preserving Optionality Given uncertainty about the future, options have value. A decision that preserves future options is worth more than one that forecloses them, all else being equal. **Abstraction layers**: Wrapping external services in internal interfaces preserves the option to switch providers. The cost is complexity; the benefit is flexibility. **Incremental commitments**: Signing a month-to-month contract preserves options vs. a three-year commitment. The cost may be higher monthly rate; the benefit is flexibility. **Keeping data**: Storing raw data preserves the option to reprocess it differently later. The cost is storage; the benefit is not being locked into current assumptions. **Modularity**: Building systems from separable components preserves the option to replace individual pieces. The cost is interface overhead; the benefit is evolutionary capability. But optionality isn't free. Over-indexing on flexibility leads to: - Excessive abstraction that nobody uses - Analysis paralysis ("we might need X someday") - Slower development from unnecessary indirection The skill is knowing which options are valuable. Options are most valuable when: - The future is highly uncertain - The cost of the option is low - The cost of being wrong without the option is high ### Decision Velocity as Competitive Advantage Organizations that make decisions faster have an advantage. They learn faster (more decisions = more feedback), adapt faster, and ship faster. Andy Grove observed that "only the paranoid survive," but paranoia that paralyzes is worse than overconfidence that ships. The goal is calibrated decision-making: appropriate speed for the stakes. **Practices that increase decision velocity**: - Default to two-way door assumptions unless proven otherwise - Establish clear decision rights (who decides what) - Set time limits for decisions (no open-ended deliberation) - Prefer "disagree and commit" over consensus - Create safe environments for reversing decisions **Warning signs of slow decision-making**: - Multiple meetings without decisions - Decisions requiring executive approval that shouldn't - Fear of being wrong overwhelming desire to make progress - Perfect being the enemy of good --- ## Case Studies: Decisions That Aged Well and Poorly ### Case Study 1: The Embedding Model Bet (Aged Well) **Situation**: In early 2023, a search startup needed to choose embedding models for their semantic search product. OpenAI's ada-002 was dominant. The team debated building on OpenAI vs. using open-source alternatives. **Decision**: They chose to build on sentence-transformers with self-hosted inference, accepting lower initial quality for control and economics. **Rationale documented in ADR**: - OpenAI pricing created unit economics problems at scale - Dependency on a single vendor for core product capability was risky - Open-source quality was "good enough" and improving rapidly - Self-hosting enabled customization and fine-tuning **Outcome**: By late 2023, open-source embedding models (BGE, E5) matched or exceeded ada-002 quality. The company's early investment in self-hosting infrastructure became a competitive advantage. Competitors locked into OpenAI faced difficult migrations when they needed customization. **Why it worked**: The team correctly predicted that open-source models would improve faster than the quality gap justified premium pricing. They accepted short-term pain for long-term positioning. ### Case Study 2: The Microservices Migration (Aged Poorly) **Situation**: A mid-sized AI company decided to break their monolithic ML platform into microservices, reasoning that industry best practice was microservices architecture. **Decision**: 18-month migration to decompose the monolith into 15 services. **Rationale (insufficiently documented)**: - "Microservices are the standard for modern systems" - "This will improve developer velocity" - "We'll be able to scale services independently" **Outcome**: Two years later, the migration was still incomplete. Developer velocity had decreased due to distributed system complexity. Debugging ML pipelines became a nightmare. The team had neither monolith simplicity nor microservices benefits. **What went wrong**: - Copied industry pattern without understanding context (Netflix has thousands of engineers; they had 50) - Didn't define success metrics upfront (how would they know if velocity improved?) - Underestimated operational complexity of distributed ML systems - No revisit criteria to recognize the migration was failing **Lesson**: The ADR should have included: "We believe microservices will improve velocity, measured by deployment frequency and lead time. If these metrics don't improve by month 9, we'll reassess." ### Case Study 3: The Vendor Lock-in (Mixed Outcome) **Situation**: An enterprise AI team needed to choose between AWS SageMaker (deep lock-in, deep integration) and cloud-agnostic tooling (more work, more flexibility). **Decision**: Chose SageMaker despite lock-in concerns, reasoning that AWS was their strategic cloud provider and integration benefits were significant. **Rationale in ADR**: - Team already expert in AWS; learning curve would be minimal - Feature velocity from AWS integration outweighed flexibility concerns - Multi-cloud was not a business requirement - Cost analysis showed SageMaker competitive with alternatives **Outcome**: Two years later, the company was acquired by a firm using GCP. Migration out of SageMaker consumed six months of the ML platform team's time. **But was it the wrong decision?** In those two years, the SageMaker choice enabled faster development that contributed to the company's value and acquisition. The migration cost was real but manageable. **Lesson**: The decision was defensible given available information. What was missing: "If our cloud strategy changes (acquisition, cost pressure to go multi-cloud), migration cost is estimated at X months. We accept this risk because [reasons]." --- ## Building Organizational Decision-Making Capability ### Creating a Decision-Making Culture Individual decision frameworks matter, but organizational culture determines whether they're used. Key elements of a healthy decision culture: **Psychological safety for decisions**: People must feel safe making calls that might turn out wrong. If bad outcomes are punished regardless of decision quality, people will avoid decisions. **Blameless postmortems**: When decisions lead to incidents, focus on process improvement, not blame. "What did we learn?" not "Whose fault was it?" **Celebrating good process**: Recognize and promote people who make well-reasoned decisions, even when outcomes are mixed. This signals that the organization values decision quality. **Transparency about uncertainty**: Leaders who say "I don't know, here's how we'll find out" create permission for others to acknowledge uncertainty. ### Decision Rights and Escalation Ambiguity about who decides causes delay and conflict. Make decision rights explicit: **DACI framework**: - **Driver**: Owns driving the decision to resolution - **Approver**: Has final authority (ideally one person) - **Contributors**: Provide input and expertise - **Informed**: Notified of the decision The most common failure: too many approvers. If five people must agree, you get lowest-common-denominator decisions and slow cycles. Prefer single approvers with clear accountability. **Escalation paths**: When should decisions escalate? When shouldn't they? Define this explicitly. "Architectural decisions affecting multiple teams escalate to Staff+ review. Implementation decisions within a team don't." ### Learning from Decisions Most organizations fail to learn from past decisions. They make similar mistakes repeatedly, reinvent solutions to solved problems, and can't explain why they do things the way they do. **Decision reviews**: Schedule reviews of major decisions 6-12 months after implementation. Compare expected outcomes to actual outcomes. Identify what you didn't anticipate. **Decision archives**: Maintain searchable ADRs and RFCs. When facing new decisions, search for similar past decisions. **Pattern extraction**: Look across decisions for recurring themes. Are you consistently underestimating build costs? Over-relying on certain vendors? Pattern recognition enables systematic improvement. --- ## Practical Exercises ### Exercise 1: Write an ADR Choose a technical decision from your recent work. Write a complete ADR including: 1. Context with specific constraints and requirements 2. The decision with clear rationale 3. At least three alternatives considered with rejection reasons 4. Both positive and negative consequences 5. Specific criteria for when to revisit Review it with someone who wasn't involved in the original decision. Can they understand the reasoning? ### Exercise 2: Decision Speed Calibration For one week, track every technical decision you make or participate in: 1. What was the decision? 2. How long did it take? 3. Was it reversible or irreversible? 4. Was the time spent proportionate to the stakes? At the end of the week, categorize decisions. Were you spending too much time on reversible decisions? Too little on irreversible ones? ### Exercise 3: Build vs. Buy Analysis For a capability your team needs: 1. Estimate honest build cost (include 2x buffer for unknowns) 2. Identify three buy options with pricing 3. Calculate 3-year TCO for each option 4. Apply the framework: core to advantage? Solutions exist? Timeline? Expertise? 5. Document hidden costs for each path 6. Make and defend a recommendation ### Exercise 4: Postmortem a Decision Find a past decision that turned out differently than expected: 1. What was the original decision and rationale? 2. What information was available then vs. now? 3. Was it a good decision with bad outcome, or bad decision? 4. What would have improved the process (not outcome)? 5. What would you do differently? --- ## Self-Assessment Checkpoint ### Conceptual Questions **Q1. [Senior]** What makes a decision "two-way door" vs "one-way door"? Give examples of each in AI engineering. <details> <summary>Answer</summary> Two-way door (reversible): Easy to undo if wrong. Examples: Choosing a Python library, prompt template changes, feature flag experiments, model hyperparameters during development. One-way door (irreversible): Hard/expensive to reverse. Examples: Database schema for production data, API contracts exposed to customers, choosing to build vs. buy a core capability, multi-year cloud commitments, publicly committing to AI features. The distinction matters because: Two-way doors should be decided quickly—cost of reversal is low, cost of delay is high. One-way doors merit careful analysis—reversing is expensive, so invest upfront. Most decisions are more reversible than they seem. Ask: "What would it cost to reverse this in 6 months?" </details> **Q2. [Senior]** What should an Architecture Decision Record (ADR) contain? Why is each section important? <details> <summary>Answer</summary> Essential sections: (1) Title and date—identification. (2) Status—proposed/accepted/deprecated/superseded. Shows lifecycle. (3) Context—what's the situation? Constraints, requirements, forces. Without context, decision makes no sense later. (4) Decision—what was decided? Clear, concise statement. (5) Rationale—why this decision? The reasoning, not just the choice. (6) Alternatives considered—what else was evaluated and why rejected? Shows due diligence, helps future readers. (7) Consequences—both positive and negative. Honest assessment. (8) Revisit conditions—when should this be reconsidered? Prevents outdated decisions from persisting. Why important: Future teams (including future you) need to understand not just what was decided, but why. Without ADRs, you relitigate decisions endlessly or blindly follow outdated choices. </details> **Q3. [Staff]** How do you approach a build vs. buy decision for a critical capability? What hidden costs should you consider? <details> <summary>Answer</summary> Framework: (1) Is this core to competitive advantage? If yes, lean build. (2) Do mature solutions exist? If good options exist, lean buy. (3) Timeline urgency? Buy is usually faster. (4) Do you have expertise? Building what you don't understand is expensive. Hidden costs of building: (1) Opportunity cost—engineers not building product features. (2) Maintenance burden—you own it forever. (3) Unknown unknowns—projects always take longer than estimated. Apply 2-3x multiplier. (4) Oncall, documentation, training. Hidden costs of buying: (1) Vendor lock-in—switching costs grow over time. (2) Customization limitations—may not fit exactly. (3) Integration work—often underestimated. (4) Ongoing licensing—costs scale with usage. Analysis: Calculate 3-year TCO for both, including hidden costs. Consider optionality—can you start with buy and migrate to build later? </details> **Q4. [Staff]** How do you drive a technical decision to closure when stakeholders have conflicting opinions? <details> <summary>Answer</summary> Process: (1) Clarify the decision—often disagreements are about different interpretations of the question. (2) Identify shared constraints—what does everyone agree on? (3) Make values explicit—are people optimizing for different things (speed vs. quality, cost vs. flexibility)? Surface these. (4) Use structured tradeoff analysis—weighted criteria, sensitivity analysis. Numbers reduce emotion. (5) Disagree and commit protocol—set a deadline. If no consensus, decision owner decides. Dissenters commit to supporting the decision. (6) Time-box—don't let decisions stall indefinitely. Set clear deadlines. Key: The goal isn't consensus; it's a decision that people can support. Document dissenting views in the ADR—validates concerns and provides context for future reconsideration. </details> **Q5. [Staff]** How do you learn from past technical decisions? What makes a decision retrospective effective? <details> <summary>Answer</summary> Learning mechanisms: (1) Scheduled reviews—revisit major decisions 6-12 months post-implementation. Compare expected vs. actual outcomes. (2) Decision archives—searchable ADRs and RFCs. Before new decisions, search for similar past decisions. (3) Pattern extraction—across decisions, what themes emerge? Consistently underestimating build costs? Over-relying on certain approaches? Effective retrospective: (1) Separate decision quality from outcome quality—good decisions can have bad outcomes (and vice versa). (2) Focus on what was knowable at decision time—hindsight bias is the enemy. (3) Identify information that would have changed the decision—and how to get it earlier next time. (4) Extract actionable improvements—not just "be more careful." Specific process changes. (5) Share learnings—organizational learning requires dissemination. </details> ### Spot the Problem **Problem 1. [Senior]** ADR excerpt: ``` Decision: We will use Redis for caching. Rationale: Redis is good for caching. ``` <details> <summary>Answer</summary> Problems: (1) No context—what are the requirements? What constraints exist? (2) Circular rationale—"Redis is good for caching" doesn't explain why Redis over alternatives. (3) No alternatives—was Memcached considered? Local caching? CDN? Why Redis specifically? (4) No consequences—what does this enable? What risks does it create? Better: "Context: We need sub-10ms read latency for user preferences, supporting 10K req/s. Decision: Redis, because it meets latency requirements, we have operational experience, and it supports data structures we need for ranked recommendations. Alternatives rejected: Memcached (no sorted sets), local cache (cluster consistency), CDN (can't cache personalized data). Consequences: (+) Fast, flexible; (-) Additional infrastructure, memory cost." </details> **Problem 2. [Staff]** Decision-making behavior: ``` "This is an important decision, so let's get everyone's input. We'll schedule weekly meetings until we reach consensus." ``` <details> <summary>Answer</summary> Problems: (1) No deadline—meetings will continue indefinitely. Decisions stall. (2) Consensus requirement—consensus is often impossible. One blocker can hold everyone hostage. (3) Everyone's input isn't equally valuable—expertise matters. Junior engineer's opinion on architecture shouldn't equal principal engineer's. (4) Meeting-heavy process—could be async (RFC). Better: (1) Set decision deadline upfront. (2) Use RACI—who decides (Accountable), who provides input (Consulted), who's informed. (3) RFC for async input with response deadline. (4) "Disagree and commit" if no consensus by deadline. (5) Proportionate effort—is this decision worth weeks of meetings? </details> **Problem 3. [Staff]** Build vs. buy analysis: ``` "Building our own vector database will take 2 engineers 3 months. Pinecone costs $5K/month. Build is obviously cheaper long-term." ``` <details> <summary>Answer</summary> Flawed analysis: (1) Understimated build cost—2 engineers × 3 months is likely 2-3x optimistic. Call it 12 engineer-months realistically. (2) No maintenance cost—who maintains it after? Oncall? Bug fixes? Upgrades? Add 0.5-1 engineer ongoing. (3) Opportunity cost—those 2 engineers could build product features. What's the revenue impact? (4) Hidden build costs—documentation, testing, training, deployment infrastructure. (5) Pinecone provides more than storage—operational reliability, scaling, support. (6) Time to market—3+ months delay has business cost. Proper analysis: Build (12 eng-months @ $15K/month = $180K) + ongoing (0.5 FTE = $90K/year) + opportunity cost vs. Buy ($60K/year) + integration effort. Plus: Is a vector database your competitive advantage? </details> ### Design Exercises **Exercise 1. [Staff]** Write an ADR for this decision: Your team needs to choose between using a third-party LLM API (OpenAI/Anthropic) vs. self-hosting an open-source model. Consider: cost at scale, quality requirements, latency constraints, data privacy, and vendor dependency. <details> <summary>Guidance</summary> ADR structure: Context: What's your use case? Volume, quality bar, latency needs, data sensitivity. Constraints (budget, timeline, expertise). Decision: Choose one with clear rationale linked to requirements. For example: "API for initial launch, migrate to self-hosted at 1M req/month." Alternatives: (1) API only—pros (fast, quality), cons (cost, dependency, data leaves network). (2) Self-host only—pros (cost at scale, data control), cons (expertise needed, maintenance, slower quality improvements). (3) Hybrid—pros (flexibility), cons (complexity). Consequences: Include both cost projections (3-year TCO) and operational implications. Revisit criteria: "Reconsider when monthly volume exceeds X or if API costs exceed Y." </details> **Exercise 2. [Staff]** You're a Staff engineer asked to drive a decision on migrating from a monolith to microservices. Multiple teams have strong opinions. The VP wants a decision in 2 weeks. Design your process for driving this decision to closure. <details> <summary>Guidance</summary> Process: Week 1: (1) Day 1-2: Write RFC framing the decision, criteria for evaluation, and proposal. Distribute for async feedback. (2) Day 3-5: Gather written input. Meet 1:1 with key stakeholders to understand concerns. (3) End of week 1: Synthesize feedback, update RFC with alternatives section addressing concerns. Week 2: (4) Day 1-2: Schedule 90-minute decision meeting with essential stakeholders only (not everyone). Pre-read required. (5) Meeting: Don't rehash—assume everyone read RFC. Focus on remaining disagreements. Use structured discussion (each alternative gets 10 min). (6) Day 3: Make decision (you or designated decision-maker). Document dissenting views. (7) Day 4-5: Finalize ADR, communicate decision broadly, define immediate next steps. Key: Async for input, sync for deliberation, clear ownership for decision. Don't let desire for consensus block progress. </details> --- ## Key Takeaways 1. **Document decisions with ADRs** - Architecture Decision Records capture context, rationale, alternatives, and consequences. Future teams and future you will thank you for the documentation. 2. **Use structured processes for collaboration** - RFCs enable broad input while maintaining momentum. Clear ownership and timelines prevent decisions from stalling indefinitely. 3. **Analyze tradeoffs systematically** - Use weighted criteria, not just intuition. Make your values explicit. Conduct sensitivity analysis to understand what factors matter most. 4. **Build costs are almost always underestimated** - Include hidden costs: maintenance, documentation, training, deployment infrastructure. Multiply build estimates by 2-3x for realistic planning. 5. **Match decision speed to stakes** - Two-way doors (reversible decisions) should be fast. One-way doors merit deliberation. The biggest mistake is treating everything as one-way. --- ## Summary Technical decision-making is a skill, not a talent. It can be developed through deliberate practice and appropriate frameworks. **Document decisions** with ADRs that capture context, rationale, alternatives, and consequences. Future teams will thank you. Future you will thank you. **Use structured processes** for collaborative decisions. RFCs enable broad input while maintaining momentum. Clear ownership and timelines prevent decisions from stalling. **Analyze tradeoffs systematically** with weighted criteria, not just intuition. Make your values explicit. Conduct sensitivity analysis to understand what matters most. **Navigate build vs. buy** with clear economic thinking. Include hidden costs. Consider optionality. Remember that most organizations underestimate build costs. **Match decision speed to stakes**. Two-way doors should be fast. One-way doors merit deliberation. The biggest mistake is treating everything as one-way. **Build organizational capability** by creating psychological safety, clarifying decision rights, and learning from outcomes. The goal isn't to make perfect decisions. Perfect decisions require perfect information, which doesn't exist. The goal is to make decisions that are good enough, fast enough, and documented well enough that you and your organization can learn and improve. ### Connections to Other Chapters - **Chapter 23 (Technical Communication)** covers presenting and socializing decisions - **Chapter 25 (System Design at Scale)** provides context for architecture decisions - **Chapter 26 (Cross-Team Technical Leadership)** addresses building consensus across organizations - **Chapter 32 (Cost Engineering)** informs economic analysis of decisions --- ## Further Reading ### Essential - **Kahneman, "Thinking, Fast and Slow"** - Dual-process theory underlying decision science. - **Klein, "Sources of Power"** - How experts actually decide in the field. - **Tetlock, "Superforecasting"** - Calibrating confidence and improving prediction accuracy. ### Deep Dives - **Duke, "Thinking in Bets"** - Separating decision quality from outcome quality. - **Forsgren et al., "Accelerate"** - Decision-making practices that correlate with performance. - **Nygard, "Documenting Architecture Decisions"** - Original ADR blog post. Short and practical.