Chapter 22: Project Ownership & Delivery

Keywords

project management, delivery, stakeholder management, risk mitigation, timelines, scope management

Introduction

In March 2020, a well-funded AI startup was racing to launch its flagship product: an automated customer service agent powered by state-of-the-art language models. The engineering team had built impressive technology. Their retrieval system achieved 94% recall on internal benchmarks. Their response generation scored highly on fluency metrics. Individual components worked beautifully in isolation.

But the product never launched.

The team had optimized for technical excellence without owning the outcome. Nobody ensured the retrieval system integrated properly with the generation pipeline. Nobody noticed that response latency exceeded user tolerance by 3x. Nobody coordinated with the trust and safety team, who flagged critical risks two weeks before launch. Nobody owned the project; everyone just owned their piece of it.

This pattern repeats across the industry. A 2019 study by Venturebeat found that 87% of machine learning projects never make it to production. McKinsey’s 2022 analysis reported that only 22% of companies successfully scale AI beyond pilot projects. The technology often works. The projects fail anyway.

The difference between AI projects that succeed and those that fail is rarely the algorithm. It is ownership: someone taking responsibility not just for their component, but for the outcome. Someone who notices the gaps, coordinates across boundaries, and drives relentlessly toward delivery.

This chapter explores what ownership means for AI projects specifically. AI introduces unique challenges: uncertainty in model performance, difficulty estimating development time, stakeholder expectations shaped by science fiction rather than engineering reality. Understanding these challenges is the first step toward navigating them.

Why AI Projects Are Different

Traditional software projects have challenging but ultimately bounded uncertainty. When you build a REST API, you can reasonably estimate how long it will take. The requirements are specific, the path to implementation is clear, and success is verifiable: does the endpoint return the correct response?

AI projects shatter these assumptions.

Performance is probabilistic, not deterministic. A traditional system either works or it does not. An AI system works 85% of the time, or 92% on some inputs and 71% on others. Defining “good enough” requires judgment, and stakeholders often have unrealistic expectations.

Development timelines are highly uncertain. Improving model accuracy from 85% to 90% might take a week or might be impossible. You often cannot know until you try. Traditional estimation techniques, calibrated on deterministic engineering, systematically underestimate AI work.

Requirements evolve through experimentation. In traditional development, you can often specify requirements upfront. In AI, you discover what is possible through iteration. The specification emerges from the work rather than preceding it.

Failure modes are subtle and non-obvious. A traditional bug is usually reproducible. An AI failure might occur only on certain input distributions, or might correlate with factors you haven’t measured. Debugging requires statistical thinking, not just tracing code paths.

Stakeholder expectations are miscalibrated. Decades of science fiction have primed non-technical stakeholders to expect human-level AI. Managing these expectations requires constant recalibration.

These differences don’t make AI projects unmanageable. But they demand adapted approaches to scoping, estimation, execution, and delivery. The project management techniques developed for traditional software must be modified for the probabilistic, iterative, and uncertain nature of AI development.

The Research on Project Failure

Understanding why projects fail helps us avoid the same mistakes.

The Standish Group’s CHAOS reports have tracked software project outcomes for decades. Their 2020 report found that only 31% of projects succeed (delivered on time, on budget, with required features). 19% fail outright. 50% are “challenged”—late, over budget, or missing capabilities. The primary factors distinguishing success from failure are not technical: they are executive support, user involvement, clear requirements, and realistic expectations.

For AI specifically, the failure modes are even more pronounced. A 2022 study by Gartner found that the average enterprise AI project takes 25% longer than estimated and costs 30% more than budgeted. A 2021 analysis by MIT Sloan found that organizations with successful AI deployments differed from unsuccessful ones primarily in process maturity and cross-functional collaboration, not technical sophistication.

Frederick Brooks, in his seminal 1975 work “The Mythical Man-Month,” identified a fundamental truth: “The bearing of a child takes nine months, no matter how many women are assigned.” Some work is not parallelizable. Some timelines are not compressible. Attempting to compress them creates dysfunction, not speed.

Brooks also observed that software projects fall into what he called the “tar pit”—the more you struggle, the deeper you sink. Without disciplined project management, complexity compounds, timelines slip, and morale collapses. This is especially true for AI, where the temptation to “just try one more approach” can extend projects indefinitely.

What You’ll Learn

  • How uncertainty fundamentally changes AI project management
  • Decision frameworks for scoping, estimation, and scope management
  • Patterns for driving execution through ambiguity
  • Cross-functional collaboration specific to AI projects
  • How to deliver results and measure impact
  • Case studies of AI projects that succeeded and failed, and why

Prerequisites

  • Experience shipping production software
  • Familiarity with machine learning development workflows
  • Understanding of AI system components (Part II of this book)
  • Basic project management concepts

Understanding AI Project Uncertainty

The Cone of Uncertainty

In traditional software engineering, the “Cone of Uncertainty”—a concept introduced by Barry Boehm in 1981 (as the “Funnel Curve”) and later popularized under this name by Steve McConnell—describes how project estimates narrow as work progresses. At project inception, estimates might be off by 4x in either direction. By detailed design, the range narrows to 1.25x. By coding, estimates are fairly accurate.

For AI projects, the cone is wider and closes more slowly.

At inception, you often cannot estimate meaningfully. “Build a model that classifies customer complaints” might take two weeks if the data is clean and the classes are separable, or six months if the data is noisy and the task is ambiguous. You cannot know which until you investigate.

After data exploration, some uncertainty resolves. You understand the data quality, label distribution, and apparent difficulty. But model performance remains unknown.

After initial modeling, you have baseline results. Now you can estimate improvement trajectories—but these remain probabilistic. Getting from 80% to 85% accuracy might be straightforward; getting from 85% to 90% might be impossible.

After iteration, the cone finally narrows. You have empirical evidence of what works and how quickly you can improve. Only now can you estimate reliably.

The implication: front-load exploration. Invest in spikes, proofs of concept, and rapid experimentation before committing to detailed schedules. Treat early estimates as hypotheses to be refined, not promises to be kept.

Sources of Uncertainty in AI Projects

Understanding where uncertainty comes from helps you manage it.

Data uncertainty is often the largest source. Will the training data be available when promised? Will it be clean enough? Will the distribution match production? Will labels be accurate? Each of these can derail a project.

A real example: In 2018, a major healthcare AI company built a diagnostic model on data from one hospital system. The model performed excellently in development. When deployed to other hospital systems, performance dropped by 30%. The data distribution had shifted—different patient populations, different imaging equipment, different documentation practices. The uncertainty was not in the model but in the data’s transferability.

Model uncertainty concerns whether the task is learnable at the target performance level. Some tasks that seem similar have vastly different difficulties. Classifying product categories might achieve 98% accuracy; classifying customer intent might plateau at 75%. You often cannot know the ceiling until you approach it.

Integration uncertainty arises when connecting AI components to larger systems. The model works in isolation, but can it meet latency requirements? Can it handle the production data format? Does it integrate with existing pipelines? These questions often surface late, when they are expensive to address.

Stakeholder uncertainty reflects the gap between expectations and reality. Non-technical stakeholders may expect AI to perform tasks that are technically infeasible, or may underestimate the data and labeling requirements. These expectation gaps create project risk even when the technical work succeeds.

Managing Uncertainty: The Spike Pattern

A “spike” is a time-boxed investigation to resolve uncertainty. The term comes from Extreme Programming and refers to experiments that cut through ambiguity.

Effective spikes have specific characteristics:

Time-boxed: Spikes have fixed duration (typically 2-5 days), not fixed scope. The goal is to learn as much as possible within the time limit, then make decisions based on what you learned.

Question-focused: Each spike addresses a specific question. “Can we achieve 90% accuracy on this task?” “What latency can we expect with this architecture?” “Is the data sufficient for training?” Vague spikes yield vague answers.

Decision-oriented: A spike should end with a decision. “Based on our findings, we recommend proceeding with approach X because…” Spikes that conclude with “we need more investigation” have often failed.

Documented: Record what you learned, including negative results. This prevents repeating investigations and builds organizational knowledge.

Example spike structure:

Spike: Feasibility of Real-Time Sentiment Classification
Duration: 3 days
Question: Can we classify customer message sentiment with >90% accuracy
         and <100ms latency using existing infrastructure?

Approach:

- Day 1: Prepare evaluation dataset (200 labeled examples from production)
- Day 2: Test three model sizes (small/medium/large) on accuracy and latency
- Day 3: Document results, prepare recommendation

Success Criteria:

- If accuracy >90% and latency <100ms: Proceed to full implementation
- If accuracy >90% but latency >100ms: Investigate optimization path
- If accuracy <90%: Reassess requirements with stakeholders

Deliverables:

- Accuracy/latency measurements across model configurations
- Recommendation memo with supporting data
- Go/no-go decision

Case Study: The Failed Recommendation System

In 2019, a large e-commerce company embarked on a project to replace their heuristic-based product recommendation system with a deep learning approach. Internal benchmarks showed the new system improved click-through rates by 40% on historical data.

The project failed spectacularly. Here’s what went wrong.

No uncertainty acknowledgment. The team treated the 40% improvement as guaranteed rather than as a hypothesis to be tested. When stakeholders asked about risks, they minimized them. “The model works. We just need to deploy it.”

Late integration testing. The team built the model in isolation for three months before attempting integration. When they finally connected to the production data pipeline, they discovered format incompatibilities that took two months to resolve. The model’s data expectations had drifted from production reality.

Stakeholder surprise. The team promised a launch date based on optimistic estimates. When the integration problems emerged, they repeatedly pushed the date back by “just two weeks.” After the third delay, stakeholder trust collapsed. The project was eventually cancelled—not because the technology failed, but because the organization lost faith in the team’s ability to deliver.

No scope reduction option. When problems emerged, the team had no fallback plan. They had committed to full deployment or nothing. A more resilient approach would have defined intermediate milestones: first deploy to 1% of traffic, then 10%, then full. This would have provided earlier feedback and lower-risk checkpoints.

The technical failure was significant (the model also performed worse in production due to distribution shift), but the project management failures were fatal. With proper uncertainty management, the team would have discovered the problems earlier, when they were easier and cheaper to address.


AI Project Lifecycle Framework

Traditional software projects follow well-established lifecycles: waterfall, agile sprints, or hybrid approaches. AI projects need a modified lifecycle that accounts for their unique characteristics: uncertain feasibility, data dependencies, and iterative model development.

The Five Phases of AI Projects

AI Project Lifecycle

AI Project Lifecycle

Phase 1: Discovery (1-2 weeks)

Discovery answers: “Should we pursue this?” Before investing in building, validate the problem is worth solving and roughly solvable.

Activities:

  • Define the business problem in measurable terms
  • Identify stakeholders and their success criteria
  • Assess data availability (what exists, what’s accessible, what’s missing)
  • Research existing solutions (build vs. buy vs. adapt)
  • Estimate rough order of magnitude for effort and value

Gate 1 Decision: Proceed to feasibility, pivot to different approach, or stop.

Phase 2: Feasibility (2-4 weeks)

Feasibility answers: “Can we build this?” This is where spikes happen. The goal is to reduce uncertainty enough to commit resources.

Activities:

  • Conduct data quality audit (completeness, accuracy, bias, freshness)
  • Build baseline model to establish performance floor
  • Run technical spikes on key uncertainties
  • Validate integration path with dependent systems
  • Refine estimates based on empirical evidence

Gate 2 Decision: Commit to full build, extend feasibility investigation, pivot approach, or stop.

Phase 3: Build (4-12 weeks)

Build is where the main development happens. Unlike traditional software, AI build phases should include explicit iteration cycles.

Activities:

  • Develop production-quality model with proper evaluation
  • Build data pipelines and feature engineering
  • Implement serving infrastructure
  • Create monitoring and alerting
  • Write documentation and runbooks
  • Conduct security and compliance review

Structure build as 2-week iterations, each ending with:

  • Demo of current capability
  • Updated metrics vs. targets
  • Refined estimate for remaining work
  • Decision: continue, adjust scope, or escalate

Gate 3 Decision: Ready for production deployment or needs more iteration.

Phase 4: Deploy (2-4 weeks)

Deploy is not a single event but a graduated process. Never big-bang an AI system.

Activities:

  • Integration testing in staging environment
  • Shadow mode deployment (runs but doesn’t affect users)
  • Canary deployment (small percentage of traffic)
  • Gradual rollout with monitoring at each stage
  • A/B testing where appropriate
  • Rollback validation

Gate 4 Decision: Proceed to full release, extend limited release, or roll back.

Phase 5: Operate (Ongoing)

AI systems require more operational attention than traditional software. Models drift, data changes, and the world evolves.

Activities:

  • Monitor model performance metrics continuously
  • Track data drift and distribution changes
  • Respond to incidents and edge cases
  • Plan and execute model updates
  • Gather user feedback and feature requests
  • Evaluate when to retrain, rebuild, or deprecate

Gate 5 Decision: Continue operating, initiate major update project, or sunset the system.

Phase Gates in Practice

Phase gates are decision points, not bureaucratic checkpoints. Each gate should answer specific questions:

Gate Key Questions Decision Options
Gate 1 Is the problem worth solving? Is it roughly feasible? Proceed / Pivot / Stop
Gate 2 Can we achieve target performance? Do we understand the path? Commit / Extend / Pivot / Stop
Gate 3 Does the system meet requirements? Is it production-ready? Deploy / Iterate / Descope
Gate 4 Is production performance acceptable? Are users successful? Full release / Limited release / Rollback
Gate 5 Is the system delivering value? Is it sustainable? Continue / Update / Sunset

Gates should be lightweight—a 30-minute meeting with key stakeholders, not a day-long review. The purpose is to force explicit decisions rather than letting projects drift.

Adapting the Lifecycle

This framework is a starting point, not a prescription. Adapt it to your context:

For smaller projects: Compress or combine phases. A 2-week project might have a 2-day discovery/feasibility phase and 1.5 weeks of build/deploy.

For research-heavy projects: Extend feasibility. Some projects need months of experimentation before committing to production development.

For incremental improvements: Start at Phase 3. If you’re improving an existing system with well-understood characteristics, skip directly to build.

For high-stakes projects: Add rigor to gates. Healthcare, finance, and safety-critical applications may need formal sign-offs and compliance reviews at each gate.

The key principle: match investment to uncertainty. Front-load investigation when uncertainty is high. Accelerate when the path is clear.


Risk Management for AI Projects

Uncertainty is inherent in AI projects, but it can be managed systematically. Risk management transforms vague worry into concrete action.

AI-Specific Risk Categories

AI projects face standard software risks plus unique categories:

Data Risks - Availability: Required data doesn’t exist or can’t be accessed - Quality: Data is noisy, incomplete, or incorrectly labeled - Bias: Training data doesn’t represent production distribution - Drift: Production data distribution changes over time - Privacy: Data usage violates regulations or user expectations

Model Risks - Feasibility: Target performance is not achievable - Overfitting: Model performs well on test data but fails in production - Brittleness: Model fails on edge cases or adversarial inputs - Interpretability: Cannot explain model decisions when required - Latency: Model is too slow for production requirements

Integration Risks - Compatibility: Model doesn’t integrate with existing systems - Scale: System can’t handle production load - Dependencies: Upstream systems change unexpectedly - Deployment: Can’t deploy model to required environments

Organizational Risks - Staffing: Key people leave or are reassigned - Stakeholder: Requirements change or stakeholders disengage - Timeline: External deadlines can’t be met - Budget: Costs exceed allocated resources - Competing priorities: Resources diverted to other projects

Risk Assessment Matrix

Assess each risk on two dimensions: likelihood (how probable) and impact (how severe if it occurs).

Low Impact Medium Impact High Impact
High Likelihood MEDIUM - Monitor HIGH - Mitigate CRITICAL - Mitigate
Medium Likelihood LOW - Accept MEDIUM - Monitor HIGH - Mitigate
Low Likelihood LOW - Accept LOW - Accept MEDIUM - Monitor

Critical risks require immediate mitigation. Do not proceed until addressed.

High risks need active mitigation plans with owners and deadlines.

Medium risks should be monitored regularly with contingency plans ready.

Low risks can be accepted—document them but don’t invest heavily in mitigation.

Risk Register Template

Maintain a living document tracking identified risks:

ID Risk Description L × I Owner Mitigation / Contingency
R1 Training data has significant quality issues H × H = CRIT Alice Run data audit in week 1. Budget 2 wks for cleaning. Contingency: use smaller verified subset.
R2 Model latency exceeds 100ms requirement M × H = HIGH Bob Spike in week 2 to test model sizes. Contingency: use distilled model.
R3 Data engineering pipeline delayed M × M = MED Carol Weekly sync with DE team. Contingency: use batch export as interim.
R4 Key team member leaves project L × M = LOW David Document decisions and knowledge. Cross-train.

Review the risk register weekly. Update likelihood and impact as you learn more. Add new risks as they emerge. Close risks that are resolved or no longer relevant.

Mitigation Strategies

Different risk types call for different mitigation approaches:

Avoidance: Eliminate the risk by changing approach. If a data source is unreliable, use a different data source.

Reduction: Decrease likelihood or impact. If model latency is a risk, run early spikes to identify fast architectures.

Transfer: Shift risk to another party. Use a managed service instead of building infrastructure. Buy labeled data instead of labeling in-house.

Acceptance: Acknowledge the risk and prepare to handle it if it occurs. Document the contingency plan but don’t invest in prevention.

Contingency planning: Define what you’ll do if the risk materializes. “If data quality issues exceed 2 weeks to fix, we will descope to the verified subset and deliver a more limited model.”

Early Warning Indicators

Don’t wait for risks to materialize. Define leading indicators that signal emerging problems:

Risk Category Early Warning Signs
Data quality Sample audits reveal issues; annotator disagreement is high
Model feasibility Baseline model performance is far below target
Integration API contracts keep changing; staging environment unavailable
Timeline Sprint velocity declining; blockers not resolving
Stakeholder Stakeholders missing meetings; requirements keep changing
Team Increased frustration in retros; people asking about other projects

When warning signs appear, investigate immediately. Early intervention is far cheaper than late crisis management.


Resource and Budget Management

AI projects have unique cost structures that require active management. Compute costs, data labeling, and specialized talent can quickly exceed budgets.

AI Project Cost Categories

Compute costs often dominate AI project budgets:

  • Training runs (especially for large models or hyperparameter search)
  • Experimentation infrastructure (GPUs for development)
  • Inference serving (production deployment)
  • Data processing (ETL, feature engineering)

Data costs are frequently underestimated:

  • Data acquisition (purchasing datasets, API access)
  • Data labeling (internal or vendor annotation)
  • Data storage (especially for large media files)
  • Data infrastructure (warehousing, pipelines)

People costs include:

  • Core team (engineers, data scientists, ML engineers)
  • Supporting roles (product, design, legal)
  • Contractors and vendors
  • Training and upskilling

Infrastructure costs include:

  • Cloud services (storage, networking, monitoring)
  • Tooling (experiment tracking, model registry, MLOps platforms)
  • Third-party services (APIs, managed ML services)

Budget Planning

Create a budget that reflects AI project realities:

Estimate compute empirically. Don’t guess—run small experiments and extrapolate. “A single training run takes 4 GPU-hours. We expect 50 runs during development. At $2/GPU-hour, that’s $400 for training compute.”

Budget for iteration. AI development is iterative. The first approach often doesn’t work. Budget for 2-3x the compute you’d need if everything worked first time.

Include data labeling. Labeling costs $0.10-$10+ per example depending on complexity. 10,000 examples at $1 each is $10,000—a significant line item.

Plan for production costs. Inference costs accrue continuously. A model serving 1M requests/day at $0.001/request costs $30,000/month.

Reserve contingency. Given uncertainty, reserve 20-30% of budget for unknowns.

Cost Monitoring and Control

Track costs continuously, not just at project end:

Set up cost alerts. Cloud providers offer budget alerts. Set thresholds at 50%, 75%, and 90% of budget.

Review weekly. Include cost status in weekly reviews. “We’ve spent 40% of compute budget with 30% of experiments complete—we’re on track.”

Attribute costs to activities. Tag cloud resources by project and activity. This enables analysis: “Hyperparameter search consumed 60% of compute budget but only improved accuracy by 1%.”

Identify cost outliers. Look for unexpected costs: zombie resources, inefficient queries, over-provisioned infrastructure.

Cost-Performance Tradeoffs

AI engineering constantly trades cost against performance:

Model size vs. cost. Larger models are more capable but more expensive to train and serve. A model that’s 5% more accurate but 10x more expensive to run may not be worthwhile.

Labeling quantity vs. quality. More labeled data usually helps, but labeling is expensive. Determine the point of diminishing returns.

Experimentation breadth vs. depth. You can run many cheap experiments or few expensive ones. Balance exploration and exploitation.

Build vs. buy. Building in-house provides control but costs engineering time. Managed services and APIs cost money but save time.

Framework for cost-performance decisions:

  1. Define the performance requirement (not “as good as possible” but a specific target)
  2. Identify the cheapest approach that meets the requirement
  3. Calculate cost of incremental improvement
  4. Stop when marginal cost exceeds marginal value

AI Project Anti-Patterns

Beyond general project management failures, AI projects have characteristic ways of going wrong. Recognizing these patterns helps you avoid them.

Anti-Pattern 1: Benchmark Chasing

The pattern: The team optimizes relentlessly for benchmark metrics while losing sight of the actual problem.

How it happens: Model accuracy becomes the team’s identity. Every percentage point feels like victory. The leaderboard becomes the goal.

The damage: The model achieves 95% accuracy on the benchmark but fails in production because the benchmark doesn’t represent real conditions. Or the model is unusably slow because latency wasn’t measured. Or it works perfectly on common cases but catastrophically fails on edge cases that matter most to users.

Prevention: Define success metrics that reflect actual user outcomes, not just model performance. Include latency, robustness, fairness, and user satisfaction. Regularly test on production-representative data, not just benchmarks.

Anti-Pattern 2: Demo-Driven Development

The pattern: Development prioritizes impressive demos over production reliability.

How it happens: Stakeholders love demos. A compelling demo generates excitement and secures continued funding. So the team builds for demos: cherry-picked examples, controlled conditions, impressive outputs.

The damage: The system demos beautifully but can’t handle the variability of real-world inputs. When it reaches production, users encounter failures the demo never showed. Trust collapses.

Prevention: Insist that demos use randomly sampled inputs, not cherry-picked ones. Show failure cases alongside successes. Demo the monitoring dashboard, not just the happy path. Build stakeholder expectations for how AI really behaves.

Anti-Pattern 3: Evaluation Theater

The pattern: Evaluation happens but doesn’t actually validate readiness for production.

How it happens: The team needs to show that testing was done. They create evaluation datasets, run metrics, and produce reports. But the evaluation doesn’t probe the things that matter: edge cases, distribution shift, adversarial inputs, failure modes.

The damage: Evaluation passes. Production fails. The team is surprised because “we tested it.”

Prevention: Design evaluation specifically to find failures. Use adversarial test sets. Test on data from different sources than training. Conduct red-team exercises. Make evaluation uncomfortable—if it’s easy to pass, it’s not testing hard enough.

Anti-Pattern 4: The Never-Ending Research Project

The pattern: The project perpetually stays in research mode, always pursuing “one more improvement.”

How it happens: There’s always a new technique to try, a new paper to implement, a potential improvement on the horizon. Shipping feels premature when you could make it better.

The damage: The project never delivers value. Eventually, stakeholders lose patience, and the project is cancelled—not because the technology failed but because it was never deployed.

Prevention: Set explicit ship dates and defend them. Define “good enough” before starting. Use timeboxes: “We have two weeks to improve accuracy. Whatever we have at the end ships.” Celebrate shipping, not just improvements.

Anti-Pattern 5: Infrastructure Without Application

The pattern: The team builds sophisticated ML infrastructure before validating the application.

How it happens: Infrastructure feels like progress. It’s concrete, demonstrable, and reusable. Building a feature store, experiment tracking system, or model serving platform feels productive.

The damage: Months pass building infrastructure. When the team finally builds the application, it turns out the problem was harder than expected, or requirements changed, or the infrastructure doesn’t quite fit. All that infrastructure work is wasted.

Prevention: Start with the simplest possible infrastructure. Build the application first with manual processes. Automate only when you’ve validated the approach and understand requirements. Infrastructure should serve applications, not precede them.

Anti-Pattern 6: The Ownership Vacuum

The pattern: Everyone owns their component; nobody owns the outcome.

How it happens: Specialization creates boundaries. The data engineer owns the pipeline. The ML engineer owns the model. The platform engineer owns serving. Each component works. But nobody ensures they work together.

The damage: Integration fails late. Performance gaps fall into cracks between teams. Problems are everybody’s fault and nobody’s responsibility. The project stalls.

Prevention: Explicitly assign end-to-end ownership. One person (or team) is responsible for the outcome, not just a component. This owner has authority to coordinate across boundaries and accountability for the result.

Anti-Pattern 7: Premature Scaling

The pattern: The team builds for massive scale before validating the solution works at any scale.

How it happens: Scaling is a fun technical challenge. It also feels responsible—building for the future. So the team designs for millions of users when they have hundreds.

The damage: Scaling adds complexity that slows development. It also bakes in assumptions that may be wrong. When the solution needs to change (which it will), the scaling infrastructure becomes a constraint.

Prevention: Build for current scale with a clear path to grow. “We can handle 10x current load with this design. When we need 100x, we’ll revisit.” Don’t solve scaling problems you don’t have yet.


Scoping AI Projects

The Scoping Challenge

Scoping is defining what you will build, what you won’t build, and how you will know you’ve succeeded. It sounds straightforward but is where many AI projects go wrong.

The fundamental tension: stakeholders want certainty about outcomes. AI projects inherently resist certainty. Effective scoping bridges this gap by defining concrete goals while acknowledging realistic uncertainty.

Problem Definition: The Most Important Step

Before discussing solutions, ensure the problem is well-defined. Poorly defined problems lead to well-executed solutions to the wrong thing.

A good problem definition includes:

The business or user need being addressed. Not “we need a sentiment analysis model” but “customers abandon support chats when sentiment becomes negative. We need to detect this and intervene.” The former defines a solution; the latter defines a problem.

The context and constraints. What data exists? What latency is acceptable? What false positive rate is tolerable? What are the regulatory constraints? Without understanding constraints, you might build something technically excellent but practically unusable.

Success criteria that are measurable. “Improve customer satisfaction” is unmeasurable. “Reduce chat abandonment rate from 15% to 10%” is measurable. Without measurable criteria, you cannot know when you’ve succeeded.

The stakeholders and their interests. Who cares about this project and what do they care about? Different stakeholders have different success criteria. Engineering might care about model accuracy; product might care about user engagement; finance might care about cost reduction. A successful project addresses the interests that matter.

Setting Realistic Targets

A common failure mode: setting targets without understanding whether they are achievable.

“We need 99% accuracy” sounds reasonable until you discover that the best published research on similar tasks achieves 92%, or that your labeled data has 5% inherent disagreement between human annotators. Targets should be grounded in what is technically feasible.

Research the state of the art. What performance do published papers achieve on comparable tasks? What performance do industry benchmarks report? If your target exceeds the state of the art, you are implicitly betting that your team will advance the field. This happens, but it should be a conscious choice, not an accident.

Understand your data quality. If human annotators agree only 90% of the time, your model cannot reliably exceed 90% accuracy without learning spurious patterns. Label quality sets a ceiling.

Consider the baseline. What does the simplest reasonable approach achieve? If a rule-based system or simple model achieves 80%, aiming for 85% with a complex model might not be worthwhile.

Set tiered targets. “Our minimum viable accuracy is 85%. Our goal is 90%. Stretch goal is 95%.” This acknowledges uncertainty while providing concrete milestones.

In-Scope and Out-of-Scope

Explicit boundaries prevent scope creep and mismatched expectations.

In scope should be specific and limited. “Build a binary classifier for customer complaint urgency, trained on historical ticket data, deployed as a microservice with <200ms p99 latency.”

Out of scope should be explicit. “Multi-class priority classification, handling languages other than English, integration with the ticketing UI, training data labeling.” Listing what you’re NOT doing is as important as listing what you are.

Assumptions should be documented. “We assume the training data is representative of production traffic. We assume the labeling team can provide 5,000 labeled examples by March 1. We assume the platform team can allocate a GPU-enabled deployment environment.” When assumptions prove false, you have documentation of what changed.

The MVP Mindset

Complex AI projects benefit from defining a Minimum Viable Product—the simplest version that delivers value.

An MVP for an AI feature should:

Deliver core value. It might not have every feature, but the features it has should genuinely help users.

Enable learning. You should learn from user interaction with the MVP. What works? What’s missing? What did you get wrong?

Be deployable quickly. If the MVP takes as long as the full product, it’s not an MVP.

Provide a foundation for iteration. The MVP architecture should support incremental improvement, not require rewriting.

For AI specifically, the MVP often involves simpler models than the eventual target. A keyword-based classifier might be the MVP for a deep learning classification system. A small, fast model might be the MVP for a large, accurate one. The goal is to get something in front of users quickly, learn from it, and iterate.

Decision Framework: How to Scope

When scoping an AI project, work through these questions:

1. What problem are we solving, and for whom? Define the user or business need in concrete terms. Avoid solution-oriented framing.

2. How will we know we’ve succeeded? Define measurable success criteria tied to the actual problem, not just model metrics.

3. What is technically feasible? Research the state of the art. Understand your data. Ground targets in reality.

4. What is the smallest thing that would be valuable? Define the MVP. Resist the temptation to solve everything at once.

5. What are we explicitly not doing? Document out-of-scope items and assumptions.

6. What are the major uncertainties? Plan spikes to resolve critical unknowns before committing to full execution.


Estimation for AI Projects

Why AI Estimation Is Hard

Estimation is notoriously difficult in software. It is even harder for AI.

Traditional software estimates draw on well-understood patterns. “Building a REST endpoint takes about two days. Integrating with this external API takes about a week.” Experience calibrates intuition.

AI estimation lacks these patterns because:

Performance improvement is nonlinear. The first 80% of accuracy might come in a week. The next 10% might take a month. The final 5% might be impossible. You cannot know where you are on this curve until you’re there.

“Easy” and “hard” tasks look similar. Two classification problems might look identical from outside: same data format, same number of classes, similar dataset size. One might be solved in days; another might resist solution for months. Difficulty is not visible until you attempt the task.

Novelty is common. Many AI projects involve some innovation—a new combination of techniques, a new application domain, a new data modality. Estimation for novel work is intrinsically uncertain.

Debugging is unpredictable. When a traditional program fails, you can usually trace the failure to its cause. When a model underperforms, the cause might be data quality, architecture choice, training hyperparameters, evaluation methodology, or something else entirely. Diagnosis takes unpredictable time.

Evidence-Based Estimation

Given this difficulty, how should you estimate?

Reference class forecasting draws on outcomes of similar projects. “The last three similar projects took 2 months, 3 months, and 5 months. This one will probably take 2-5 months.” This works when you have comparable reference cases.

Cone of Uncertainty thinking acknowledges that estimates narrow as work progresses. Early estimates should be ranges, not points. “This will take 2-6 weeks” is more honest than “this will take 3 weeks” when uncertainty is high.

Phased estimation estimates only what you can see clearly. Estimate the first phase in detail. Estimate later phases roughly. Re-estimate when you complete each phase and see the next one clearly.

Bottom-up with buffers breaks work into tasks, estimates each task, sums them, and adds a buffer for unknowns. The buffer size should reflect uncertainty—maybe 20% for well-understood work, 50% for novel work.

The Planning Fallacy

In 1979, psychologists Daniel Kahneman and Amos Tversky identified the “planning fallacy”: people systematically underestimate time, costs, and risks of future actions while overestimating benefits. This applies strongly to software and especially to AI.

The planning fallacy persists because:

We focus on the specific case, not the reference class. You think about why THIS project is different, not about how similar projects typically go.

We imagine the happy path. Estimates assume things will go smoothly. They rarely do.

We are motivated to be optimistic. Optimistic estimates get project approval. Realistic estimates might not.

We anchor on early assumptions. Initial estimates become anchors that subsequent estimates adjust from insufficiently.

To counteract the planning fallacy:

Track your historical accuracy. Compare estimates to actuals over time. If you systematically underestimate by 50%, adjust future estimates accordingly.

Estimate independently, then aggregate. Have multiple people estimate independently. Average or discuss the differences. This surfaces different perspectives.

Add explicit buffers. If you estimate 4 weeks and history says you’re usually off by 50%, budget 6 weeks.

Distinguish estimates from commitments. An estimate is your best guess. A commitment is what you’re willing to promise. Commitments should include margin for uncertainty.

Communicating Uncertainty

Stakeholders often prefer point estimates: “It will take 6 weeks.” But this false precision hides real uncertainty and sets up disappointment.

Better approaches:

Range estimates with explicit confidence. “We’re 90% confident this will take 4-8 weeks.” This conveys both expectation and uncertainty.

Scenario-based estimates. “If the data is clean and the approach works, 4 weeks. If we need to iterate on the approach, 6-8 weeks. If we discover fundamental problems with the data, 3 months or possibly infeasible.” This helps stakeholders understand the contingencies.

Milestone-based estimates. “We’ll have initial results in 2 weeks. Based on those results, we can estimate the remainder.” This acknowledges that estimates improve with information.

The key is to be honest about what you know and don’t know. Stakeholders may push for certainty, but false certainty leads to disappointment and damaged trust. Honest uncertainty, delivered well, builds credibility.


Driving Execution

The Owner’s Mindset

Owning execution means more than completing your tasks. It means ensuring the project succeeds.

What does this look like in practice?

You notice what’s falling through cracks. Nobody asked you to check whether the data pipeline is ready. But the project needs it, so you check. And when it’s not ready, you follow up until it is.

You surface problems early. When something is going wrong, you don’t hide it or hope it resolves. You raise it to the people who need to know, with a proposed path forward.

You remove blockers. When you’re stuck, you don’t just wait. You find workarounds, escalate appropriately, or reframe the problem. You take action.

You close loops. When you hand off work or make a request, you follow up. You confirm the handoff was successful or the request was fulfilled. Nothing disappears into the void.

You communicate proactively. Stakeholders know the project status because you keep them informed, not because they chase you.

The opposite of ownership is abdication: “That’s not my job.” “Nobody told me.” “I’m blocked.” These phrases might be true, but they describe the situation without solving it. Owners solve.

Daily Execution Habits

Effective execution isn’t about heroic effort. It’s about consistent habits.

Start each day with clarity. Know your most important task and why it matters. If you don’t know, figuring that out IS your most important task.

Identify blockers early. What could stop your progress today? What can you do about it before it stops you?

Communicate status. Keep people who need to know in the loop. Brief updates prevent surprises.

Make visible progress. Even if a task takes days, find ways to show progress. Break large tasks into visible milestones. This builds trust and catches problems early.

End each day with capture. Record what happened, what you learned, what’s next. This preserves context across days and provides material for status updates.

Handling Blockers

Blockers are the enemy of progress. Effective owners develop strategies for handling them.

Prevention is best. Identify dependencies early. Initiate requests with lead time. Build relationships with people you’ll need help from.

Workarounds create options. When blocked on path A, is there a path B? Maybe it’s less elegant, but it keeps progress going.

Escalation is a tool. When you genuinely cannot unblock yourself, escalate promptly to someone who can. This isn’t failure; it’s effective use of organizational resources.

Time-boxing prevents stuckness. “I’ll try to resolve this myself for 2 hours. If I can’t, I’ll escalate.” This provides structure for situations that might otherwise stretch indefinitely.

The critical point: staying blocked is not neutral. Every day a blocker persists is a day of lost progress. The cost of attempting to resolve the blocker is almost always less than the cost of remaining stuck.

Managing Dependencies

AI projects depend on other teams: data engineering for data, platform teams for infrastructure, product teams for requirements, legal for compliance. Managing these dependencies is critical.

Map dependencies early. At project start, identify what you’ll need from other teams and when you’ll need it. This sounds obvious but is often skipped in the rush to start building.

Build relationships proactively. Know who your counterparts are on other teams. Have conversations before you need something. This makes requests easier when the time comes.

Provide lead time. Other teams have their own priorities. Requests made at the last minute are less likely to be fulfilled. Ask early.

Track explicitly. Maintain a list of dependencies with status, owner, and deadline. Review weekly. Follow up on anything at risk.

Have fallbacks. When possible, identify alternatives if a dependency falls through. “If data engineering can’t provide the feature by March 1, we can use this simpler feature set as a fallback.”

Case Study: The Successful Launch

In 2021, a fintech company launched an AI-powered fraud detection system that reduced fraud losses by 60% while decreasing false positives by 25%. Here’s what they did right.

They front-loaded uncertainty. Before committing to a timeline, they ran three 1-week spikes: one on data quality, one on baseline model performance, one on latency and throughput. These spikes revealed that their data had quality issues that would take 2 weeks to resolve. They incorporated this into the timeline before promising dates to stakeholders.

They defined milestones with decision points. The project had four phases: data preparation, model development, integration, and rollout. Each phase ended with a decision: proceed as planned, adjust, or stop. At the end of model development, they discovered performance was 5% below target. Rather than pushing forward and hoping, they paused for a 1-week spike on model improvement. This spike identified a data augmentation approach that closed the gap.

They managed stakeholders actively. The project lead met weekly with stakeholders, presenting current status, upcoming milestones, and top risks. When the 2-week delay for data quality emerged, stakeholders were not surprised—they had been watching the project’s progress and understood the reasoning.

They deployed incrementally. Rather than a big-bang launch, they deployed first to 1% of traffic, then 10%, then 50%, then 100%. Each stage included monitoring and a rollback plan. At the 10% stage, they discovered a latency issue that only manifested under production load. They fixed it before it affected more traffic.

They measured impact. Before launch, they defined success metrics: fraud loss reduction, false positive rate, and system latency. They tracked these metrics carefully, demonstrating clear value to the organization.

The project succeeded not because the team was more brilliant than others, but because they managed uncertainty explicitly, communicated proactively, and delivered incrementally.


Scope Management

Scope Creep and How to Prevent It

Scope creep is the gradual expansion of project scope beyond initial boundaries. It is a leading cause of project failure.

Scope creep happens because:

Requirements are discovered incrementally. As you build, you discover needs that weren’t initially apparent. “Now that we have fraud detection, we also need an appeals workflow.”

Stakeholders have expanding desires. Initial success creates appetite for more. “This is great! Can it also detect money laundering?”

Engineers have perfectionist tendencies. “I could make this 5% better if I just spent another week…” Quality improvements are valuable but also unbounded.

Boundaries were never clear. If scope was never explicitly defined, it cannot creep—it was never bounded to begin with.

Prevention requires explicit boundaries and active defense:

Define scope clearly at project start. Document what you ARE and ARE NOT doing. Review this with stakeholders.

Maintain a backlog for future scope. When new requests arise, don’t reject them—acknowledge them and add them to a future backlog. “That’s a great idea. Let’s add it to V2.”

Evaluate new requests against criteria. Before adding scope, ask: Does this serve the core problem? Is it necessary for MVP? What is the cost of delay? Most requests fail these tests.

Make scope changes visible. If scope must change, make it explicit: “We’re adding X, which will push the timeline by Y.” This creates awareness and forces conscious tradeoff decisions.

When to Cut Scope

Sometimes you must reduce scope to deliver on time. This is a feature, not a failure—it’s responsible project management.

The key is cutting wisely. Not all scope is equally valuable.

Prioritize by value and cost. Features that deliver high value at low cost should be protected. Features that deliver low value at high cost should be cut first.

Identify the core. What is the minimum set of features that delivers the essential value? Protect this core. Everything else is negotiable.

Consider dependency chains. Some features enable others. Cutting a foundational feature might force cutting everything that depends on it. Understand these dependencies.

Consult stakeholders. Scope decisions affect stakeholders. Include them in discussions. They may have preferences you don’t know about.

A decision framework for scope cutting:

  1. List all remaining features with estimated effort and value (even rough estimates help)
  2. Sort by value-to-effort ratio
  3. Draw a line where remaining effort matches available time
  4. Review with stakeholders for any strong objections
  5. Cut below the line, defer to next version

Managing Stakeholder Expectations

AI projects are particularly prone to mismatched expectations. Stakeholders imagine HAL 9000; you’re building a classifier.

Set expectations early. In project kickoff, explicitly discuss what the system will and won’t do. Be specific: “This will classify complaints into 5 categories. It will not understand complaint nuances or respond to customers.”

Demonstrate incrementally. Show work in progress. This prevents large surprises at the end and calibrates expectations along the way.

Frame limitations honestly. AI has limitations. Models make mistakes. Performance varies across conditions. Help stakeholders understand this is normal, not failure.

Use scenarios. Abstract capabilities are hard to assess. Specific scenarios are easier: “Given this input, the system will output that. Given this other input, it will fail in this way.”

Document and reference. When you explain limitations, document them. When stakeholders later have unrealistic expectations, you can reference the documentation.


Cross-Functional Collaboration

The Cast of Characters

AI projects typically involve many functions:

Product Managers define requirements and prioritize features. They care about user outcomes, business metrics, and timelines. They need from you: feasibility assessments, effort estimates, tradeoff analysis, and status updates.

Data Scientists/ML Engineers (if separate from your role) focus on model development. They care about model performance, experimentation infrastructure, and interesting technical challenges. They need from you: clear problem definitions, access to relevant data, and reasonable timelines.

Data Engineers build and maintain data pipelines. They care about data quality, system reliability, and manageable complexity. They need from you: clear data requirements, reasonable lead times, and stable specifications.

Platform/Infrastructure Teams provide compute, deployment, and monitoring capabilities. They care about system stability, standardization, and scalability. They need from you: resource projections, architectural alignment, and reasonable requests.

Legal and Compliance assess regulatory risk. They care about avoiding liability and meeting regulatory requirements. They need from you: clear explanations of what the system does, data usage details, and risk assessments.

Executives make resource allocation decisions. They care about business impact, risk, and return on investment. They need from you: clear value propositions, honest risk assessments, and progress visibility.

Effective Cross-Functional Communication

Different audiences need different communication.

For product managers: Lead with user and business impact. Explain technical concepts in terms of user-visible behavior. Be clear about tradeoffs: “We can have high accuracy or fast latency, but not both. Which matters more for users?”

For executives: Lead with business impact and risk. Minimize technical details. Be concise: “This project will reduce fraud losses by 30%, with delivery in Q2. Key risk is data quality; we’re addressing it with a 2-week spike.”

For technical partners: Match their technical level. Be specific about what you need: “We need a daily batch export of the customer_events table in Parquet format, partitioned by date, with a 2-hour SLA.”

For legal/compliance: Explain what the system does in concrete terms. Be thorough about data usage: “We process customer complaints to classify urgency. We store the complaint text and classification. We don’t use data from customers who opted out.”

Building Relationships

Cross-functional work is fundamentally relationship work.

Invest before you need. Meet your counterparts before you need something from them. Understand their priorities. Help them when you can. This builds goodwill that pays off later.

Communicate regularly. Regular check-ins prevent surprises. A weekly 15-minute sync with a key partner is more effective than occasional long meetings when there’s a problem.

Be reliable. Follow through on commitments. If you can’t, communicate early. Reliability builds trust; broken commitments destroy it.

Seek to understand. Before pushing your perspective, understand theirs. What are they trying to achieve? What constraints are they under? Solutions that address everyone’s concerns are more likely to succeed.

Handling Disagreements

Disagreements are inevitable and healthy. Handle them constructively.

Separate positions from interests. The product manager demands feature X; you think feature Y is better. But what does the PM actually need? Maybe they need to hit a metric that either feature would serve. Understanding interests opens solution space.

Focus on data when possible. Opinions conflict; data is neutral. “Let’s run an A/B test” often resolves disagreements better than debate.

Acknowledge legitimate concerns. Your counterpart’s concerns are legitimate, even if you disagree with their preferred solution. Acknowledge this: “I understand you’re worried about timeline. Let me explain why I think this approach actually delivers faster.”

Know when to escalate. Some disagreements cannot be resolved between the parties. Escalate constructively: present both perspectives fairly, explain what you’ve tried, and ask for help resolving it.

Disagree and commit. Once a decision is made, commit to it fully even if you disagreed. Undermining decisions creates dysfunction. If you genuinely believe the decision is catastrophic, escalate that—but most disagreements are not catastrophic.


Delivering Results

Staff Engineer Perspective

“My promotion to Staff came from one project, and it wasn’t the technically hardest thing I did that year. It was a personalization launch where I owned everything from the model spec to the rollout plan to the customer support training deck. Three engineers built more impressive components inside it. They’re still Senior. The difference: when the launch slipped a week because legal flagged a data retention issue, I noticed, escalated, and rewrote the migration plan over a weekend. The feature shippers ship features; the project owner notices the things nobody assigned them. If you want the next level, find the unowned seam in your current project and pick it up before someone asks.”

Staff Engineer at a consumer marketplace

Definition of Done

Shipping code is not the same as delivering value. A feature is done when:

It works in production. Not on your machine, not in staging—in production, with real traffic.

It’s monitored. You know when it breaks. Alerts fire if something goes wrong. Dashboards show key metrics.

It’s documented. Others can understand, maintain, and modify it. Runbooks exist for operational scenarios.

It’s validated. Users can accomplish what the feature intended. The success metrics are being measured.

It’s supported. The on-call team knows about it and can handle incidents. There’s a path for bug reports and feature requests.

Until all of these are true, you’re not done.

Measuring Impact

Shipping without measuring is like throwing a dart blindfolded. You might hit the target, but you won’t know.

Define metrics before launch. What will success look like? Be specific: “Success means a 15% reduction in average handle time for support tickets.”

Measure baselines. You can’t demonstrate improvement without knowing where you started. Measure the metric before launch.

Monitor continuously. Don’t check once and move on. Track metrics over time. Initial results may not predict long-term performance.

Distinguish signal from noise. Metrics fluctuate. Short-term changes may be noise. Use statistical methods to determine if changes are significant.

Tie to business outcomes. Model accuracy is a means, not an end. What business outcome does improved accuracy drive? Measure that.

Post-Launch Responsibilities

Launch is the beginning of a new phase, not the end of the project.

Monitor actively. In the hours and days after launch, watch closely. Be ready to respond quickly if something goes wrong.

Gather feedback. Talk to users. Review support tickets. Analyze usage patterns. Reality often differs from expectations.

Iterate. Use what you learn to improve. Plan follow-up work based on real-world performance.

Stabilize. Address operational issues. Fix bugs. Improve monitoring. Leave the system in a maintainable state.

Document. Record what you built, how it works, and what you learned. Future maintainers (including future you) will thank you.

Retrospectives

Every significant project deserves reflection: what went well, what didn’t, and what to do differently next time.

Run promptly. Hold the retrospective within 1-2 weeks of project completion, while memories are fresh.

Include all key participants. Engineers, PM, design, and other stakeholders who were involved. Multiple perspectives enrich the discussion.

Focus on learning, not blame. The goal is improvement, not assigning fault. “What should we do differently?” not “Whose fault was it?”

Be specific. “Communication was bad” is not actionable. “We needed a weekly sync between engineering and data science to catch integration issues earlier” is actionable.

Follow through. Capture action items with owners and deadlines. Follow up on them. Retrospectives without follow-through are wasted time.


Practical Exercises

Exercise 1: Project Scoping

Take this vague requirement and create a full project scope:

“We need to use AI to help our customer support team handle tickets more efficiently.”

Deliverables: 1. Problem statement with measurable success criteria 2. In-scope and out-of-scope lists 3. Key assumptions 4. Major uncertainties with spike plans 5. Tiered targets (minimum, goal, stretch) 6. MVP definition

Self-Assessment Questions: - Does your problem statement describe a user/business need, not a solution? - Are your success criteria measurable with specific numbers? - Would a stakeholder reading your scope document know exactly what you ARE and ARE NOT building? - Have you identified at least 3 significant uncertainties? - Does your MVP deliver value on its own, or is it just a partial implementation? - Could another engineer take your scope document and start working without asking clarifying questions?

Quality Indicators: - Strong: Success criteria tied to business outcomes (e.g., “reduce average resolution time from 24 to 12 hours”) - Weak: Success criteria tied only to model metrics (e.g., “achieve 90% classification accuracy”) - Strong: Out-of-scope list is specific (e.g., “non-English tickets, refund processing, phone support”) - Weak: Out-of-scope is vague (e.g., “other features”)

Exercise 2: Estimation Practice

For the project you scoped in Exercise 1:

  1. Break the work into tasks
  2. Estimate each task using three-point estimation (optimistic, likely, pessimistic)
  3. Calculate the expected duration and 95% confidence interval
  4. Write a stakeholder communication that conveys the estimate with appropriate uncertainty

Self-Assessment Questions: - Did you include tasks for data exploration, integration, and monitoring—not just model development? - Is your optimistic estimate achievable if everything goes perfectly? - Is your pessimistic estimate realistic for a worst-case scenario (not just “it’s impossible”)? - Does your stakeholder communication explain WHY there is uncertainty, not just state the range? - Would your estimate change significantly if you asked a colleague to estimate the same work?

Quality Indicators: - Strong: Task breakdown includes non-obvious work (data audit, stakeholder alignment, documentation) - Weak: Task breakdown only includes the “happy path” technical work - Strong: Stakeholder communication offers scenarios (e.g., “if data is clean… if we need additional labeling…”) - Weak: Stakeholder communication just gives a range without context

Exercise 3: Stakeholder Communication

For the same project, prepare three versions of the project summary:

  1. A 1-paragraph executive summary for a VP
  2. A 1-page scope document for a Product Manager
  3. A technical design overview for the Data Engineering team

Self-Assessment Questions: - Does the executive summary lead with business impact and fit in 3-4 sentences? - Does the PM document focus on user outcomes and include clear success metrics? - Does the technical document include specific data requirements, formats, and SLAs? - Did you avoid jargon appropriate for each audience? - Would each stakeholder get what they need to make decisions from your document?

Quality Indicators: - Strong (Executive): “This project will reduce support costs by 20% ($2M annually) by automating ticket routing.” - Weak (Executive): “This project uses a transformer-based classifier to categorize tickets.” - Strong (Technical): “We need daily batch export of support_tickets table in Parquet format, ~100GB/day, available by 6am UTC.” - Weak (Technical): “We need access to the tickets data.”

Exercise 4: Failure Analysis

Research a well-documented AI project failure (e.g., Microsoft Tay, Amazon’s hiring AI, healthcare.gov, or a case from AI Incident Database). Analyze:

  1. What went wrong?
  2. What project management failures contributed?
  3. What should have been done differently?
  4. What would you do in a similar situation?

Self-Assessment Questions: - Did you distinguish between technical failures and project management failures? - Did you identify the earliest point where the failure could have been detected or prevented? - Are your recommendations specific and actionable (not just “better testing”)? - Did you consider organizational and cultural factors, not just process?

Quality Indicators: - Strong: Identifies specific decision points where intervention was possible - Weak: Lists problems without tracing them to root causes - Strong: Recommendations reference specific practices from this chapter - Weak: Generic recommendations like “communicate better”

Exercise 5: Risk Assessment

You’re starting a new AI project: building a recommendation system for an e-commerce platform. The system should increase average order value by 10%.

  1. Identify at least 8 risks across all four categories (data, model, integration, organizational)
  2. Assess each risk using the likelihood × impact matrix
  3. Create a risk register with mitigation strategies for high/critical risks
  4. Define early warning indicators for your top 3 risks

Self-Assessment Questions: - Did you identify risks in ALL four categories, not just technical ones? - Are your likelihood and impact assessments justified, not just gut feelings? - Does each high/critical risk have a specific owner and concrete mitigation action? - Are your early warning indicators leading indicators (not just “the risk happened”)? - Did you identify any risks that might kill the project entirely?

Quality Indicators: - Strong: Risk register includes contingency plans (“if mitigation fails, we will…”) - Weak: Risk register only has mitigation, no backup - Strong: Early warning indicators are measurable (e.g., “annotator agreement drops below 80%”) - Weak: Early warning indicators are vague (e.g., “data quality seems bad”)

Exercise 6: Project Lifecycle Planning

Design the project lifecycle for a 3-month AI project: building a document classification system for a legal firm.

  1. Define activities for each of the five phases
  2. Specify gate criteria for each phase transition
  3. Create a timeline showing phase durations and overlaps
  4. Identify which anti-patterns this project might be susceptible to and how you’ll prevent them

Self-Assessment Questions: - Does your Discovery phase answer “should we do this?” before committing resources? - Does your Feasibility phase include concrete spikes with specific questions? - Do your gate criteria include both technical and non-technical factors? - Is your timeline realistic given that AI development is iterative? - Did you build in decision points where you could stop or pivot?

Quality Indicators: - Strong: Gate criteria are specific (e.g., “baseline model achieves >80% accuracy on held-out set”) - Weak: Gate criteria are subjective (e.g., “model looks good”) - Strong: Anti-pattern prevention includes specific mechanisms (e.g., “ship date is fixed; scope is variable”) - Weak: Anti-pattern prevention is just awareness (e.g., “we’ll avoid scope creep”)


Self-Assessment Checkpoint

Conceptual Questions

Q1. [IC2] What makes AI project estimation harder than traditional software estimation? How should you adapt your approach?

Answer

AI-specific challenges: (1) Unknown feasibility: Traditional software asks “how long to build X?” AI asks “can X even work?” (2) Iterative discovery: You don’t know final approach until experiments complete. (3) Data dependencies: Data quality issues emerge during development. (4) Non-deterministic outputs: Success criteria are harder to define and verify. (5) Performance plateau: Improvement slows unpredictably.

Adaptation: (1) Estimate in ranges with explicit uncertainty. (2) Front-load spikes to resolve unknowns before committing. (3) Use reference class forecasting—how long did similar projects take? (4) Define go/no-go gates with concrete criteria. (5) Treat estimates as hypotheses to test, not commitments. (6) Track estimation accuracy and calibrate over time.

Q2. [IC2] What is the difference between a spike and a prototype? When would you use each?

Answer

Spike: Time-boxed investigation to answer a specific question. Output is knowledge, not code. “Can we achieve 90% accuracy with our data?” Results inform go/no-go decision. Throwaway—not intended for production.

Prototype: Early version of the actual system to test viability. Output is working code and user feedback. “Does this UX work for users?” Code may evolve into production (or be thrown away). May be shown to stakeholders.

When to use spikes: (1) Technical feasibility unknown. (2) Before committing to an approach. (3) When the question is clear but answer is uncertain.

When to use prototypes: (1) Testing user experience or product-market fit. (2) When stakeholders need to see something tangible. (3) When iteration with users is needed to refine requirements.

Common mistake: Building “prototypes” that are really spikes—code that answers a question but isn’t suitable for production, yet becomes production anyway.

Q3. [Senior] Explain the “ownership mindset” for AI projects. What behaviors distinguish someone who owns outcomes vs someone who just completes tasks?

Answer

Task completion: “I did my part. The model training job ran. Not my problem if deployment failed.”

Outcome ownership: “The project needs to deliver value to users. What’s blocking that? How can I help?”

Distinguishing behaviors:

Proactive problem-finding: Owners notice gaps before being told. “I realized we don’t have a plan for model updates.”

Scope awareness: Owners understand the full picture, not just their slice. “My model works, but integration with the product is struggling.”

Blocker removal: Owners unblock themselves or escalate effectively. They don’t wait passively.

Stakeholder communication: Owners communicate status proactively. Bad news travels fast.

Quality of outcome, not effort: Owners ask “did this work?” not “did I work hard?”

Ownership over credit: Owners focus on project success, not personal attribution.

Developing ownership: Take on projects with clear end-to-end responsibility. Ask “what does success look like?” before “what’s my task?”

Q4. [Senior] How do you handle scope creep in AI projects? What if the new scope is genuinely valuable?

Answer

Preventing scope creep: (1) Clear scope document with explicit boundaries. (2) “Out of scope” section in project definition. (3) Agreement upfront: new features go to backlog. (4) Tiered targets: what’s minimum, what’s stretch? (5) Regular scope reviews with stakeholders.

When new requests arrive: (1) Acknowledge the value: “That’s a great idea for V2.” (2) Clarify impact: “Adding X would push launch from March to May.” (3) Present tradeoffs: “We could add X if we drop Y.” (4) Get explicit decision: Don’t just absorb scope; make it a conscious choice.

If new scope is genuinely valuable: (1) Is it more valuable than original scope? Consider pivoting. (2) Is it additive? Defer to next version. (3) Does it change success criteria? Revisit project goals. (4) Does timeline allow it? Scope is variable; date may not be.

Key principle: Scope creep often kills projects not through any single addition but through cumulative drift. Make every scope change a deliberate decision, not gradual expansion.

Q5. [Staff] How do you balance the need to ship quickly with the uncertainty inherent in AI projects? What’s your framework for deciding when to ship vs when to iterate?

Answer

The tension: AI needs iteration to improve, but you also need to ship to learn from real users.

Framework:

  1. Define minimum quality bar: What’s the threshold below which shipping causes harm (to users, reputation, trust)? Don’t ship below this.

  2. Define value threshold: What’s the level at which the product is useful to users? Ship at this point, even if not perfect.

  3. Iterate in production: Once above minimum bar and at value threshold, ship and iterate. Real feedback > more development.

  4. Time-box development: “We ship on [date] with whatever we have that’s above minimum bar.” Prevents indefinite improvement chasing.

  5. Define rollback criteria: What would trigger pulling the feature back?

Practical tactics: (1) Feature flags for gradual rollout. (2) A/B testing with hold-out group. (3) Explicit success metrics before launch. (4) Postmortem trigger criteria.

Red flags you’re iterating too long: (1) Diminishing improvements per iteration. (2) No user feedback in the loop. (3) Scope expanding with each iteration. (4) Original problem still not solved.

Spot the Problem

Problem 1. [IC2] A project plan:

“Phase 1: Data collection (2 weeks) Phase 2: Model training (3 weeks) Phase 3: Evaluation (1 week) Phase 4: Deployment (2 weeks) Total: 8 weeks”

What’s wrong with this plan?

Answer

Problems: (1) Waterfall structure: AI projects are iterative. You don’t collect all data, then train once, then evaluate once. You iterate. (2) No overlap: Data issues emerge during training; evaluation reveals need for more data. Phases should overlap. (3) No contingency: What if training takes 5 weeks? What if evaluation fails? (4) Missing phases: Where’s discovery? Integration? User testing? (5) No decision points: When do you decide if the project is feasible? (6) No risk buffer: AI projects routinely surprise.

Better structure: (1) Week 1-2: Discovery + initial data assessment + spike on model feasibility. GO/NO-GO decision. (2) Week 3-6: Iterative development (data → train → evaluate → repeat). (3) Week 6-7: Integration + user testing. (4) Week 8: Buffer + deployment. With explicit gates: “Continue only if accuracy >80% by week 4.”

Problem 2. [Senior] A project status update:

“Status: Yellow. We discovered the training data has quality issues. The data team is working on it. We’re waiting for cleaned data before we can continue. Expected delay: TBD.”

What should the project owner do differently?

Answer

Problems with current approach: (1) Passive waiting: Owner is not driving resolution. (2) “TBD” delay: Unquantified uncertainty is planning failure. (3) Single-threaded: All work blocked on one dependency. (4) No escalation: “Yellow” status without action plan or ask.

Better approach:

Immediate: (1) Understand the data issue specifically. What’s wrong? How much data is affected? (2) Quantify the delay: Best/worst case. What determines which? (3) Explore workarounds: Can we proceed with a subset? Can we use synthetic data? Can we change the approach?

Communication: “Status: Yellow. Data quality issue affects 30% of training data. Root cause: [specific]. Data team estimates 1-2 weeks to fix. In parallel, we’re training on clean subset to validate approach. Risk: If data issue is deeper, we may need 3+ weeks. Mitigation: [Plan B]. Ask: Need data team prioritization—can you help?”

Ownership: Don’t wait passively. Drive toward resolution. Always have a Plan B.

Problem 3. [Staff] An AI project shipped 6 months ago. The team has moved on. A support ticket arrives:

“The model recommendations have gotten worse over the past month. Users are complaining.”

Investigation reveals no one is monitoring model quality in production. What went wrong and how would you prevent this?

Answer

What went wrong: (1) Project ended at deployment, not at sustained value delivery. (2) No production monitoring plan. (3) No ownership handoff or ongoing responsibility. (4) Data/model drift not anticipated. (5) Success metrics not continuously monitored.

Prevention:

Before deployment: (1) Define production success metrics and monitoring. (2) Set up automated alerts for quality degradation. (3) Plan for model updates (frequency, triggers, process). (4) Document baseline metrics for comparison.

At deployment: (1) Verify monitoring is working, not just deployed. (2) Establish ownership for production health. (3) Define escalation path for quality issues. (4) Schedule regular health check reviews.

After deployment: (1) Monitor continuously, not on user complaint. (2) Regular model evaluation against fresh data. (3) Planned refresh/retrain cycles. (4) Quarterly review of model performance.

Ownership lesson: A project isn’t complete when code ships. It’s complete when value is sustained. Building without a maintenance plan is setting up for future failure.

Self-Assessment: Project Ownership Readiness

Rate yourself on each dimension (1 = novice, 5 = expert):

Scoping and Planning - [ ] I can translate vague requirements into specific, measurable project scope - [ ] I identify uncertainties upfront and plan spikes to resolve them - [ ] I create realistic estimates with appropriate uncertainty ranges - [ ] I define MVP that delivers value while managing risk

Execution and Delivery - [ ] I notice gaps and problems proactively, not just when assigned - [ ] I remove blockers rather than waiting for them to resolve - [ ] I communicate status proactively to stakeholders - [ ] I deliver incrementally and measure impact

Risk Management - [ ] I identify risks across technical and organizational dimensions - [ ] I create mitigation plans for high-impact risks - [ ] I monitor for early warning signs - [ ] I escalate appropriately when risks materialize

Cross-Functional Collaboration - [ ] I communicate effectively with different audiences (exec, PM, engineering) - [ ] I build relationships before I need them - [ ] I handle disagreements constructively - [ ] I understand and address partner concerns

Growth Areas: Based on your self-assessment, identify 2-3 specific areas to develop. For each: 1. What specific situation would let you practice this skill? 2. What would success look like? 3. How will you know you’ve improved?


Summary

AI projects fail more often than they succeed—not because the technology doesn’t work, but because the projects are not managed for the unique challenges AI presents.

Uncertainty is intrinsic. AI projects have higher and longer-lasting uncertainty than traditional software. Accept this reality and plan accordingly: use spikes to resolve unknowns, estimate in ranges, and build decision points into your timeline.

Ownership means outcomes. Don’t just complete your tasks; ensure the project succeeds. Notice gaps, surface problems, remove blockers, and communicate proactively. The project’s success is your responsibility.

Scoping prevents failure. Define clear boundaries, measurable success criteria, and explicit assumptions. Acknowledge what you’re NOT doing. Resist scope creep by deferring new requests to future versions.

Estimation requires humility. AI estimation is hard. Ground estimates in reference classes and empirical spikes. Communicate uncertainty honestly. Track your accuracy and improve over time.

Cross-functional collaboration is essential. Understand what your partners care about. Communicate in their language. Build relationships before you need them. Handle disagreements constructively.

Delivery is measured by impact. Shipping code is necessary but not sufficient. Define success metrics. Measure baselines. Monitor continuously. Iterate based on what you learn.

The engineers who drive successful AI projects are not necessarily the ones with the deepest technical expertise. They are the ones who own outcomes, navigate uncertainty skillfully, and relentlessly drive toward delivering real value.

Connections to Other Chapters

  • Chapter 15 (MLOps & Evaluation) covers the operational infrastructure that supports project delivery
  • Chapter 21 (Deepening Technical Expertise) provides the foundation for making sound technical decisions during project execution
  • Chapter 23 (Technical Communication) covers communication skills essential for stakeholder management and cross-functional work
  • Chapter 24 (Mentorship Foundations) addresses how to help others develop ownership capabilities
  • Chapter 25 (System Design at Scale) explores architectural decisions that affect project scope and delivery
  • Chapter 23 (Technical Decision Making) provides decision frameworks for complex technical choices during projects

Further Reading

Essential

  • Brooks (1975), “The Mythical Man-Month” - Classic on estimation, complexity, and team dynamics.
  • Sculley et al. (2015), “Hidden Technical Debt in ML Systems” - Why AI projects require different management.
  • Kahneman (2011), “Thinking, Fast and Slow” - Cognitive biases including the planning fallacy.

Deep Dives

  • Amershi et al. (2019), “Software Engineering for ML” - Microsoft’s analysis of ML project practices.
  • McConnell (2006), “Software Estimation” - Foundations for estimation, adapted for AI.

Practical Resources

  • AI Incident Database (incidentdatabase.ai) - Documented AI failures and failure modes.