Chapter 28: Research-to-Production

Keywords

research, paper evaluation, reproducibility, productionization, benchmarking, ablation studies

Introduction

In late 2022, a team at a mid-sized fintech company read a paper that promised a 40% latency reduction for their transformer-based fraud detection system. The technique, a novel attention mechanism, had impressive benchmark numbers on academic datasets. The team spent three months implementing it from scratch, carefully matching every detail in the paper. When they finally deployed to production, the results were devastating: latency increased by 60%, memory usage doubled, and the model’s accuracy dropped on their real-world data. The paper wasn’t wrong—it just solved a different problem than theirs.

This scenario plays out constantly across the industry. The gap between research and production isn’t a bug; it’s a fundamental feature of how the two worlds operate. Researchers optimize for novelty, clean experimental conditions, and benchmark performance. Engineers optimize for reliability, efficiency, and real-world robustness. A paper can be scientifically valid and technically impressive while being completely wrong for your production system.

Yet the opposite mistake is equally costly: ignoring research entirely. Flash Attention went from paper to industry standard in under a year. Teams that adopted it early saw 2-4x throughput improvements. Those who waited “for it to mature” lost ground to competitors. The same pattern repeated with LoRA fine-tuning, PagedAttention, and speculative decoding.

The skill you need isn’t reading papers or building systems—it’s bridging the two. This chapter teaches you to read research with a production lens, evaluate ideas for real-world viability, prototype efficiently, and know when something is ready for prime time versus when it’s not worth the risk.

The Core Tension: What Researchers Optimize For vs. What Engineers Need

To bridge research and production, you must first understand why the gap exists. It’s not incompetence on either side—it’s fundamentally different optimization targets.

What researchers optimize for:

Novelty: Academic incentives reward new ideas over incremental improvements. A 2% accuracy gain with a completely new architecture gets published; a 5% gain from careful engineering of existing approaches often doesn’t.
Clean experimental conditions: Papers typically evaluate on standard benchmarks with well-curated data, fixed computational budgets, and controlled settings. This enables fair comparisons but diverges from production reality.
Benchmark performance: Researchers target metrics that the community has agreed upon. But BLEU scores, perplexity, and accuracy on MMLU may not correlate with what matters for your application.
Publication speed: The race to publish means papers often skip extensive ablations, robustness testing, or scalability analysis. “Future work” sections are filled with unaddressed concerns.

What production engineers need:

Reliability: A technique that works 99.9% of the time beats one that’s 5% better on average but fails catastrophically 1% of the time.
Efficiency at scale: Research prototypes run on a few examples. Production systems handle millions of requests with strict latency SLOs.
Integration simplicity: The best technique in isolation may be worse than a simpler one that fits your existing architecture.
Maintainability: Code that one PhD student understands isn’t the same as code your team can debug at 3 AM during an outage.
Robustness to distribution shift: Academic benchmarks are static. Production data drifts, contains adversarial inputs, and surprises you constantly.

This tension isn’t resolvable—it’s fundamental. Your job is to navigate it skillfully.

A Mental Model for Research Translation

Think of research papers as raw ore from a mine. The ore contains valuable material, but it’s mixed with rock, requires processing, and isn’t immediately usable. Some ore is rich and easy to refine; some is mostly waste. The skill is knowing which veins to pursue and how to extract value efficiently.

The translation process has three phases:

Prospecting: Scanning the landscape to identify promising claims. Most papers aren’t relevant to you—the skill is filtering efficiently.
Assaying: Deep evaluation of promising candidates. Does the ore actually contain what it claims? Will it refine into something useful?
Refining: Transforming research into production-ready implementations. This is where most efforts fail—not from lack of skill, but from underestimating the work required.

What You’ll Learn

How to read ML papers efficiently and evaluate claims critically
Why papers often don’t reproduce, and what to do about it
Decision frameworks for evaluating production viability
Prototyping methodology that validates ideas with minimal investment
When to adopt, when to wait, and when to abandon research directions
Case studies of techniques that translated well—and ones that didn’t

Prerequisites

Strong ML fundamentals (Part II)
Production systems experience (Part III)
Familiarity with your organization’s technical context and constraints

Reading Research Papers

Why Papers Are Hard to Read (And Why You Must Read Them Anyway)

You can’t rely on blog posts and Twitter threads alone. By the time research is popularized, it’s either already implemented in frameworks or has been superseded. The distillation process also loses crucial nuance—the limitations section that explains when the technique fails, the ablation that shows which components actually matter, the hyperparameter sensitivity that determines whether you can reproduce results.

Reading papers gives you early access to techniques 6-12 months before they’re mainstream. More importantly, it gives you the deep understanding needed to predict whether something will work in your context.

But papers are notoriously difficult to read. Academic writing optimizes for different goals than technical documentation: signaling expertise to reviewers, establishing priority in a research area, and compressing complex ideas into page limits. The resulting prose is dense, assumes shared context, and often obscures as much as it reveals.

The Three-Pass Reading Method

Efficient paper reading is about knowing when to stop. Most papers deserve only minutes of your time; a few deserve hours. The three-pass method allocates attention appropriately.

First Pass (5-10 minutes): Survey the territory

Read: Title, abstract, introduction (especially the last paragraph), section headings, figures/tables, conclusion.

Answer these questions:

What is the paper claiming?
What problem does it claim to solve?
Is this relevant to my work?
Is the claim plausible given what I know?

The vast majority of papers fail this filter. Either they solve a different problem than yours, or the claims seem implausible, or the approach is incompatible with your constraints. A paper on improving transformer training is irrelevant if you’re using pre-trained models exclusively.

When to stop: If the paper isn’t relevant or the claims seem obviously flawed, stop here. You’ve spent five minutes and learned enough.

Second Pass (30-60 minutes): Understand the approach

Read: Full paper, but skim derivations and proofs. Focus on figures, tables, and the methods section.

Answer these questions:

What is the main technique? Can you explain it in one paragraph?
What are the key results? What baselines are compared?
What are the stated limitations? What’s conspicuously missing?
Could you implement this at a high level?

During this pass, you’re building a mental model of what the paper does without getting lost in details. Mark sections you’ll need to return to if you proceed.

When to stop: If the technique is less novel than claimed, results aren’t compelling compared to simpler approaches, or limitations make it unsuitable for your use case.

Third Pass (2-4 hours): Deep understanding

Read: Everything, including appendices and supplementary material. Re-derive key equations. Map their code (if available) to the paper.

Answer these questions:

Could you reimplement this from scratch?
What are the hidden assumptions and hyperparameters?
What would break this approach?
What did the ablations actually show?

This pass is expensive. You should only reach it for papers you’re seriously considering implementing.

Critical Reading: What Papers Don’t Tell You

Papers are advocacy documents. Authors want their work published, cited, and recognized. This doesn’t make them dishonest—but it does mean they present their work in the best possible light. Critical reading identifies what’s missing or framed favorably.

Red Flag 1: Weak or missing baselines

Papers should compare against the best available alternatives, properly tuned. Watch for:

Baselines that are outdated (comparing to methods from 2+ years ago in a fast-moving field)
Baselines with default hyperparameters when the proposed method is carefully tuned
Missing obvious baselines (e.g., a retrieval paper that doesn’t compare to BM25)

Case study: A 2023 paper on efficient attention claimed 3x speedup over “standard attention.” The baseline was the naive PyTorch implementation, not Flash Attention. Against Flash Attention, the speedup was 1.2x—barely significant and with worse accuracy.

Red Flag 2: Cherry-picked examples

Qualitative examples in papers are chosen to make the method look good. If you only see the best cases, you have no idea what the failure modes look like.

What to do: Look for papers that show failure cases, or try the method on your own data immediately.

Red Flag 3: New metrics that favor the approach

When standard metrics don’t show improvement, authors sometimes introduce new metrics where their method wins. These aren’t always invalid—sometimes new metrics are genuinely better—but it’s a yellow flag.

What to check: How does the method perform on standard metrics? If it loses on standard metrics, is the argument for the new metric convincing?

Red Flag 4: Toy datasets only

Results on MNIST, CIFAR-10, or small synthetic datasets often don’t transfer to production-scale problems. Techniques can work beautifully at small scale and fail completely when scaled.

Case study: Many neural architecture search papers showed impressive results on CIFAR-10 but produced architectures that didn’t transfer to ImageNet or were impractical for real-world deployment.

Red Flag 5: Missing ablations

A complex method with multiple components but no ablation study is suspicious. Which parts actually matter? Often, one simple component provides most of the benefit.

What it tells you: The authors either didn’t do ablations (concerning) or did them and found disappointing results (also concerning).

Red Flag 6: Reproducibility concerns

Warning signs include:

No code released, with no commitment to release
Vague hyperparameter specifications (“we tuned learning rate on validation set”)
Unusual experimental setup that would be hard to replicate
Results that require specific random seeds to reproduce

The context: Edward Raff’s 2019 study attempting to independently reproduce 255 ML papers (without consulting any released code) found that only about 63.5% could be reproduced within reasonable effort. Papers that did release code fared better in follow-up community reproductions, but many still failed to match claimed results.

Red Flag 7: Compute advantage disguised as algorithmic improvement

Some “improvements” come from using more compute, not better methods. If a paper doesn’t report compute budgets or if the proposed method uses significantly more resources, the comparison isn’t fair.

What to check: Results normalized for compute. Same number of parameters, same training budget, same inference cost.

Building a Paper Reading Practice

Sustainable paper reading requires structure. Without it, you’ll either read nothing (overwhelmed by the firehose) or read everything (and remember nothing).

Weekly rhythm:

15 minutes daily: Scan arXiv titles and abstracts in your area. Mark anything promising.
1-2 hours weekly: Second-pass read 3-5 marked papers. Most won’t survive.
Monthly: Deep-read 1-2 papers that survived second-pass across the month.

Tracking sources:

Primary sources:

arXiv cs.LG, cs.CL, cs.CV for your relevant subfields
Papers with Code (also shows implementations and benchmark rankings)
Top conferences: NeurIPS, ICML, ICLR, ACL, EMNLP, CVPR

Curated sources:

Company research blogs (Anthropic, OpenAI, Google, Meta) for production-oriented work
Research distillation newsletters and podcasts
Senior researchers you respect on Twitter/Bluesky

Note-taking that compounds:

For any paper that survives first-pass, create a structured note:

Title: [Paper title]
Date read: [Date]
Relevance: [1-5]

One-sentence summary:
[What does this paper do in one sentence?]

Key insight:
[The core idea that makes this work]

Results:
[Main quantitative results vs. baselines]

Limitations:
[What the paper admits, and what you infer]

Production viability:
[Quick assessment: likely viable / needs validation / probably not]

If I were to implement:
[What would be the main challenges?]

This structured approach ensures you can find and use paper notes months later.

Why Papers Don’t Reproduce (And What to Do About It)

Staff Engineer Perspective

“VP forwarded a paper with a $20K Slack message attached: ‘why aren’t we doing this?’ Numbers were spectacular—40 percent quality lift on a benchmark adjacent to ours. I spent three weeks reproducing. Got 4 percent. Authors had used a custom data filter that removed roughly 60 percent of hard examples before evaluation, which they mentioned in a footnote on page 11. The lift was real on their filtered set and basically vanished on ours. I now do a ‘gut check reproduction’ on every hyped paper before we even discuss adopting it: one engineer, one week, just enough to see if the headline survives contact with our data. Most don’t. The ones that do are worth the next month of investment.”

— Applied Research Lead at an AI infrastructure company

The Reproducibility Crisis in ML

In 2019, Edward Raff attempted to reproduce 255 papers from top ML conferences. He found that:

Only about 63.5% of papers could be reproduced within reasonable effort
Papers without code were rarely reproducible
Even “successful” reproductions often showed significantly different results than claimed

This isn’t about fraud (which is rare). It’s about the many small decisions that papers don’t document:

Hidden hyperparameters: Learning rate schedules, warmup steps, gradient clipping thresholds, weight initialization schemes. Papers mention “we use Adam” but don’t specify betas, epsilon, or whether they use AMSGrad.

Preprocessing differences: Tokenization choices, data cleaning steps, train/test split procedures. These often matter more than the claimed contribution.

Implicit assumptions: The code assumes a specific library version, hardware configuration, or data format that isn’t documented.

Random seed sensitivity: Some results only reproduce with specific random seeds. The paper shows the best seed; your seed gives worse results.

Selection effects: Authors run many experiments. They publish the configuration that worked best. Your configuration starts from their optimal point—but you don’t know the search that found it.

What Researchers Actually Do (That Papers Don’t Mention)

Having collaborated with research teams and published papers, I can share what typically happens behind the scenes:

Extensive hyperparameter search: What the paper calls “tuned on validation set” often means weeks of GPU time exploring hyperparameter space. The final configuration is optimal for those benchmarks—not necessarily for yours.

Debugging that shapes the method: During development, bugs often get fixed in ways that subtly change the method. The final paper describes the end state, not the journey. Some design choices exist because alternatives were buggy, not because they were theoretically motivated.

Late-stage additions: Reviewers request comparisons, ablations, or variations. These get added hastily before the deadline. Code quality for these components is often lower.

Computational resources: A paper from a well-funded lab might represent millions of dollars in compute. They tried variations you never will. The published method is the survivor of a much larger exploration.

Common Mistake: Implementing Papers from Scratch Without Reference Code

What people do: Read the paper, understand the algorithm, implement it fresh in their preferred framework. After all, the reference code is messy and in a different framework.

Why it fails: Papers under-specify critical details. The reference code contains weeks of debugging that the paper doesn’t mention—learning rate warmup steps, normalization epsilon values, weight initialization schemes, gradient clipping thresholds. Reimplementing from scratch means rediscovering all these implicit decisions through trial and error.

Fix: Always start from official reference code, even if messy. Port it carefully, validating intermediate values match. Only clean up after you’ve reproduced results. The “ugly” code often contains essential implicit knowledge.

Practical Strategies for Reproduction

Given these challenges, here’s how to maximize your chances of successful reproduction:

Strategy 1: Start from reference implementations

If official code exists, use it—even if it’s messy. The authors’ code contains implicit knowledge that papers don’t capture. Rewriting from scratch means remaking all the decisions they made through trial and error.

If no official code exists, look for community reproductions with verification. Papers with Code tracks implementations and whether they reproduce claimed results.

Common Mistake: Adapting Research Before Reproducing It

What people do: Read a paper, immediately start adapting it for their use case—different data, different model size, different framework. When results are disappointing, they don’t know if the technique doesn’t work or if they introduced bugs.

Why it fails: You have two unknowns: whether you implemented correctly AND whether it works for your use case. When it fails (which is common), you can’t diagnose which is the problem.

Fix: First reproduce the paper’s results on their exact setup—same data, same hyperparameters, same evaluation. Only after you can match their numbers should you start adapting. This isolates variables and makes debugging tractable.

Strategy 2: Match the setup exactly before modifying

Before adapting a technique for your use case, verify you can reproduce the original results. Use the same:

Dataset (exact preprocessing)
Hyperparameters (all of them)
Evaluation protocol (same splits, same metrics)
Hardware (if feasible)

Only after reproduction succeeds should you start modifying for your context.

Strategy 3: Email the authors

This sounds old-fashioned, but it works. Authors often respond to polite, specific questions. They have knowledge that didn’t make it into the paper:

“We’re trying to reproduce your results on [benchmark]. We’re using [hyperparameters] and getting [result] instead of [claimed result]. Are there any implementation details we might be missing?”

Common responses reveal missing preprocessing steps, hyperparameter sensitivities, or known issues.

Strategy 4: Budget for failure

Assume your first reproduction attempt will fail. Plan accordingly. If a paper takes 1 week to implement from scratch, budget 2-3 weeks total including debugging. The gap is typically spent:

Debugging why results don’t match
Trying hyperparameter variations
Discovering implicit dependencies

Evaluating Production Viability

The Production Readiness Assessment

Not all reproducible research is production-ready. The gap from “works in a notebook” to “runs in production” can be larger than the initial research.

Here’s a framework for evaluating whether research can transition to production:

Dimension 1: Computational efficiency

Questions to ask:

What are the inference latency and throughput characteristics?
How does memory usage scale with input size?
Does it require specialized hardware you don’t have?
Are there efficient implementations, or only research code?

Red flags: Quadratic scaling with sequence length (for long-context applications), custom CUDA kernels with no PyTorch fallback, requirements for hardware you can’t provision.

Green flags: Already optimized implementations exist (in PyTorch, TensorRT, etc.), scales linearly or better, runs on your standard hardware.

Case study: The Linformer paper proposed O(n) attention as a replacement for O(n^2) standard attention. But the constant factors meant that for sequences under 4096 tokens, standard attention (with Flash Attention) was actually faster. The asymptotic improvement didn’t matter for most production use cases.

Dimension 2: Integration complexity

Questions to ask:

Does this require changing model architecture, or is it a drop-in replacement?
What dependencies does it introduce?
Can it be A/B tested incrementally, or is it all-or-nothing?
Does it require retraining, or can it be applied to existing models?

Red flags: Requires retraining your foundation model, introduces unstable/unmaintained dependencies, can’t be gradually rolled out.

Green flags: Works with pre-trained models, clean API boundaries, can be enabled via configuration.

Case study: LoRA fine-tuning succeeded in production partly because it didn’t require modifying base model weights. You could add adapters without touching the foundation, enabling easy A/B testing and rollback.

Dimension 3: Robustness

Questions to ask:

How sensitive is it to hyperparameters?
What happens on inputs outside the training distribution?
Are failure modes catastrophic or graceful?
Has it been tested on messy, real-world data?

Red flags: Very narrow hyperparameter range works, fails silently on edge cases, tested only on curated benchmarks.

Green flags: Stable across reasonable hyperparameter ranges, documented behavior on adversarial/unusual inputs, tested at scale on real data.

Case study: Many quantization schemes work perfectly on benchmark data but produce catastrophically wrong outputs on rare inputs that don’t match training statistics. Production systems discovered this through user complaints, not benchmarking.

Dimension 4: Maintainability

Questions to ask:

Can multiple team members understand and modify this?
Is the implementation well-tested and documented?
Is the approach still actively developed, or is it abandoned?
If it breaks, can you debug it?

Red flags: Only one researcher understands it, no tests, inactive repository, black-box components.

Green flags: Multiple independent implementations, community support, clear debugging procedures.

Case Studies: Research That Translated Well

Flash Attention: From Paper to Industry Standard in One Year

Why it succeeded:

Clear, measurable benefit: 2-4x throughput improvement with same quality. No tradeoffs to evaluate.
Drop-in replacement: Same API as standard attention. No architectural changes required.
Strong implementation: Authors released optimized CUDA kernels from the start.
Wide applicability: Benefits virtually all transformer workloads.

Adoption pattern: Major labs adopted within months. Framework integration followed. By 12 months post-paper, it was the default.

LoRA Fine-tuning: Democratizing Customization

Why it succeeded:

Solved a real pain point: Fine-tuning large models was prohibitively expensive.
Clean abstraction: Adds small adapter matrices, doesn’t touch base weights.
Easy integration: PEFT library made it trivial to add to existing code.
Flexibility: Works with models you can’t afford to fully fine-tune.

Adoption pattern: Initially used by resource-constrained practitioners. Quality validation led to broader adoption. Now the default for LLM customization.

Speculative Decoding: Emerging Production Adoption

Why it’s succeeding:

No quality compromise: Mathematically provably identical outputs.
Significant speedup: 2-3x latency reduction is common.
Works with existing models: Uses small draft model with existing target model.

Current state: Framework integration is ongoing. Requires good draft model selection. Production adoption accelerating as tooling matures.

Case Studies: Research That Didn’t Translate

Many Neural Architecture Search Results

Why they failed:

Compute advantage: Results came from massive search budgets, not better architectures.
Benchmark overfitting: Found architectures optimized for CIFAR-10 that didn’t transfer.
Impractical costs: Reproducing the search was infeasible for most organizations.
Better alternatives emerged: Transformers made many CV architecture questions moot.

Lesson: Be skeptical of results that require massive search—the winning configuration is selected from thousands of alternatives.

Various “Efficient Attention” Mechanisms

Why many failed:

Constant factors matter: Asymptotic improvements didn’t help at practical sequence lengths.
Quality tradeoffs: Approximations that seemed harmless on benchmarks caused issues in production.
Flash Attention changed the baseline: What seemed efficient in 2021 wasn’t efficient after Flash Attention.

Lesson: Efficiency claims need context. “More efficient than X” means nothing if a better alternative to X appears.

Mixture of Experts (The Long Road)

Why it took years:

Training instability: Early MoE implementations were notoriously unstable to train.
Routing challenges: Load balancing across experts was tricky to get right.
Infrastructure requirements: Needed sophisticated distributed systems for expert parallelism.
Gradually solved: Switch Transformer (2021) simplified training; infrastructure matured.

Current state: Now powers frontier models. The research was valid all along; production engineering took years to catch up.

Lesson: Some research is ahead of its time. Infrastructure and tooling maturity often gate production readiness.

Decision-Making: When to Adopt, Wait, or Skip

The Adoption Timing Framework

Different organizations should adopt research at different stages. Your position depends on:

Technical capacity: Do you have engineers who can debug novel implementations? Can you afford failures?

Competitive dynamics: How much do you gain from being first? How much do you lose from waiting?

Risk tolerance: What happens if adoption fails? Is this critical infrastructure or an experiment?

Here’s a framework for deciding:

ADOPT Early (Be First)

When:

Technique provides significant competitive advantage
You have strong ML engineering capacity to handle issues
Risk of failure is acceptable (not critical path)
You can contribute improvements back (builds expertise)

Example: A company with strong ML infrastructure adopting Flash Attention 2 months after release to gain throughput advantage.

Tradeoffs: Higher risk of hitting bugs, no community solutions yet, more engineering investment.

TRIAL (Fast Follower)

When:

Technique is clearly valuable (not just hyped)
Initial bugs are being shaken out by early adopters
Framework integration is emerging
You want proven value with reduced risk

Example: Adopting Flash Attention 6 months post-release, once it’s in major frameworks and bugs are documented.

Tradeoffs: Less competitive advantage, but much lower risk.

ASSESS (Wait and See)

When:

Technique is promising but unproven at scale
Your capacity to evaluate is limited
Alternatives may emerge
Cost of being wrong is high

Example: Watching speculative decoding mature before investing in implementation.

Tradeoffs: May miss opportunities, but avoids wasted effort on techniques that don’t pan out.

SKIP (Don’t Invest)

When:

Technique is superseded by something better
Complexity exceeds benefit for your use case
Incompatible with your architecture
The problem it solves isn’t your problem

Example: Skipping many “efficient attention” variants because Flash Attention solved the problem more completely.

The Kill Criteria

Knowing when to stop is as important as knowing when to start. Define kill criteria before beginning any research adoption project:

Hard kills (stop immediately):

Fundamental incompatibility with your architecture discovered
Results significantly worse than baseline in realistic testing
Required changes are orders of magnitude larger than expected
Security or compliance issues emerge

Soft kills (evaluate whether to continue):

Results are only marginally better than baseline
Implementation is taking much longer than planned
Key team member is reassigned or leaves
A clearly better alternative emerges

The sunk cost trap: The most common failure mode is continuing projects that should be killed because of invested effort. Define time-boxes upfront. “If we haven’t achieved X in Y weeks, we stop.”

Case study: A team spent 4 months implementing a custom attention mechanism for their domain. At month 3, it was clear the improvement was marginal. But they continued because “we’re almost there.” The final result: 2% improvement at 10x engineering cost. The same effort on other improvements would have yielded far better returns.

Portfolio Thinking for Research Investment

Don’t bet everything on one research direction. Think in portfolios:

Core (60-70% of research effort): Incremental improvements to existing systems. High probability of success, moderate impact.

Adjacency (20-30% of research effort): Promising techniques that extend your current capabilities. Medium probability, high potential impact.

Exploratory (5-10% of research effort): Speculative bets on novel research. Low probability, potentially transformative impact.

This portfolio approach ensures you’re not vulnerable to any single research direction failing while maintaining exposure to breakthrough possibilities.

Prototyping: Validating Ideas with Minimum Investment

The Prototype Mindset

The goal of prototyping is learning, not building. A prototype answers: “Should we invest more in this?” not “How do we ship this?”

This distinction matters because it changes what you build:

Production code: Handles all edge cases, is well-tested, integrates cleanly, is maintainable.

Prototype code: Handles the critical path, validates the hypothesis, produces enough evidence to decide.

Every hour spent on prototype polish is an hour not spent on the decision you’re trying to make.

Designing Effective Prototypes

Step 1: State your hypothesis clearly

Bad: “Let’s try technique X.” Good: “We hypothesize that technique X will reduce latency by >20% on our production workload without degrading accuracy by more than 1%.”

The hypothesis determines what you measure and what counts as success.

Step 2: Define success criteria before starting

Quantitative thresholds:

Latency: Must be <Y ms at P99
Accuracy: Must be >Z% on our evaluation set
Memory: Must fit in existing GPU allocation

Qualitative requirements:

Must work with our existing tokenizer
Must not require retraining the base model

Step 3: Define kill criteria equally clearly

If any of these are true, stop:

Accuracy drops >5% on our evaluation set
Latency increases (not just fails to improve)
Implementation reveals fundamental incompatibility

Step 4: Time-box ruthlessly

“We will spend at most 5 days on this prototype. At day 5, we decide: continue to validation, pivot, or abandon.”

The time-box forces scoping. If you can’t validate the core idea in the time-box, either the idea is too complex or you need to decompose it.

What to Include and Exclude in Prototypes

Include (Scope In):

Core algorithm implementation
Evaluation on representative data sample
Key metric measurement
Comparison to current baseline

Exclude (Scope Out):

Production-ready code quality
Comprehensive error handling
Full test coverage
Documentation
Performance optimization
Edge case handling

The items you exclude can all be added later if the prototype succeeds. Adding them during prototyping wastes effort if the idea doesn’t work.

Implementation Strategies

You have several options for implementing research in a prototype:

Option 1: Use official reference implementation

When: Official code is available and quality is reasonable.

Advantages: Fastest path; known to reproduce paper results; authors’ implicit knowledge baked in.

Disadvantages: May not fit your infrastructure; code quality varies; may have hidden dependencies.

How: Clone the repo, get their evaluation working, then adapt their code to your data.

Option 2: Adapt existing similar implementation

When: A similar technique exists in your codebase or in a framework.

Advantages: Integrates with existing code; familiar patterns; easier to debug.

Disadvantages: May miss subtle but important differences; risk of cargo-culting.

How: Identify the closest existing implementation, understand the key differences, modify carefully.

Option 3: Implement from scratch

When: No suitable reference exists, or you need deep understanding.

Advantages: Deep understanding; tailored to your needs; no external dependencies.

Disadvantages: Time-consuming; risk of bugs; need to validate correctness independently.

How: Start from the paper’s pseudocode; implement in stages; validate each component.

Option 4: Use framework integration

When: The technique is already in a major framework (PyTorch, HuggingFace, etc.).

Advantages: Production-ready; well-tested; maintained by others.

Disadvantages: May lag latest research; limited customization; framework dependency.

How: pip install and configure.

The right choice depends on your situation. Framework integration is obviously best when available. Reference implementation is usually best otherwise. From-scratch implementation is only worth it when you need the deep understanding—typically for core techniques you’ll build on.

Debugging Research Implementations

When your implementation doesn’t match paper results (which happens often), here’s a systematic debugging approach:

Step 1: Verify you can reproduce on their data

Before trying your data, reproduce their exact setup. Same dataset, same preprocessing, same hyperparameters. If you can’t match their results on their setup, you have an implementation bug.

Step 2: Check tensor shapes throughout

Many bugs manifest as shape errors or—worse—silent shape broadcasting. Print shapes at each step. Compare against what the paper implies.

Step 3: Compare against reference implementation numerically

If reference code exists, test intermediate values:

their_output = their_implementation(input)
your_output = your_implementation(input)
assert torch.allclose(their_output, your_output, atol=1e-5)

Do this at each layer/step to isolate where differences emerge.

Step 4: Check for normalization issues

Many papers under-specify normalization. Are you using layer norm where they use batch norm? Is it pre-norm or post-norm? What’s the epsilon value?

Step 5: Verify numerical precision

Some techniques are sensitive to float16 vs float32. Mixed precision training can cause issues. Try full float32 if you’re seeing instabilities.

Step 6: When all else fails, contact authors

Often, there’s a detail that didn’t make it into the paper. A polite email asking about reproduction often gets helpful responses.

Rigorous Benchmarking

Why Rigorous Benchmarking Matters

Poor benchmarking leads to poor decisions. If you don’t benchmark correctly, you might:

Adopt a technique that’s actually worse than your baseline
Reject a technique that would have helped
Misattribute improvements to the wrong component
Make decisions that don’t generalize beyond your test set

Rigorous benchmarking takes more effort but produces trustworthy decisions.

Principles of Fair Comparison

Principle 1: Test on representative data

Your benchmark data should match your production data distribution. If you’re evaluating a retrieval technique, use queries that look like real user queries, not synthetic test sets.

Common mistakes:

Using clean academic benchmarks when production data is messy
Testing on easy examples but deploying on hard ones
Not including adversarial or edge cases

Common Mistake: Comparing Against Weak Baselines

What people do: Implement a new technique from a paper, compare it against a quickly-implemented baseline with default hyperparameters. The new technique wins. Ship it.

Why it fails: Papers often compare against weak baselines (outdated methods, untuned hyperparameters, naive implementations). Your “baseline” might be 20% worse than a properly tuned version of the same approach. Flash Attention made many “efficient attention” papers obsolete overnight.

Fix: Invest equal effort in your baseline: use the best available implementation, tune hyperparameters, use the same compute budget. If the new technique doesn’t beat a strong baseline, the improvement may not be real.

Principle 2: Compare apples to apples

For a comparison to be meaningful, control for:

Compute budget (same training time, or same FLOPs)
Data (same training set, same preprocessing)
Hyperparameter tuning effort (if you tune one method, tune the other equally)
Hardware (same GPU, same batch sizes)

If your new technique has 10x the parameters, comparing accuracy isn’t fair. Normalize for what matters.

Principle 3: Control for random variation

A single run tells you very little. Results vary with:

Random weight initialization
Batch ordering
Stochastic components (dropout, etc.)

Run experiments multiple times. Report means and standard deviations. Use statistical significance tests before claiming improvements.

How many runs? At least 3; preferably 5-10 for important decisions. If results are very noisy, you need more runs or larger sample sizes.

Principle 4: Measure what matters

Academic metrics may not correlate with production value. Quality is important, but so is:

Latency (P50, P99—not just average)
Throughput (requests per second)
Memory usage (peak and sustained)
Cost (compute expense per query)

A technique that’s 2% better on accuracy but 3x slower may not be worth it.

Principle 5: Test at scale

Techniques that work at small scale often fail at production scale:

Memory usage can grow unexpectedly
Latency distributions change under load
Batch sizes affect convergence

Test at your expected production scale before concluding the benchmark.

Statistical Significance: When Is an Improvement Real?

If method A achieves 85.2% accuracy and method B achieves 85.8% accuracy, is B better? It depends on variance.

For a rigorous comparison:

Run each method multiple times (different random seeds)
Collect the distribution of results
Use a statistical test (paired t-test if runs are matched, Mann-Whitney U if not)
Only claim improvement if p < 0.05

Example analysis:

Method A: 5 runs, mean 85.2%, std 0.8%
Method B: 5 runs, mean 85.8%, std 1.1%

A paired t-test gives p = 0.23. This is NOT statistically significant. The observed difference could easily be random variation. You should NOT adopt method B based on this evidence.

Practical significance vs. statistical significance: Even if an improvement is statistically significant, is it practically meaningful? A 0.1% accuracy improvement that’s significant with p=0.01 is still not worth implementing if the engineering cost is high.

Ablation Studies: Understanding What Matters

Complex techniques have multiple components. Ablation studies remove components one at a time to understand their contribution.

Why this matters for production:

You might not need the full complexity
Some components may be essential; others may be cargo-culted
Simpler is better if it achieves similar results

Example ablation for a new retrieval technique:

Full method: 85% recall
Without query expansion: 83% recall (expansion adds 2%)
Without reranking: 78% recall (reranking adds 7%)
Without both: 75% recall

Conclusion: Reranking is essential; query expansion is helpful but not critical. If you can only afford one, choose reranking.

Cost-Benefit Analysis

The final benchmarking step is cost-benefit analysis: Is the improvement worth the investment?

Quantify costs:

Implementation effort (engineer-weeks)
Infrastructure changes (new dependencies, hardware)
Ongoing maintenance (additional complexity)
Compute costs (training, inference)

Quantify benefits:

Quality improvement (in metrics that matter)
Efficiency gains (latency, throughput, cost reduction)
Strategic value (capabilities you couldn’t offer before)

Calculate return:

Payback period: How long until benefits exceed costs?
ROI: After 12 months, what’s the return on investment?

If payback is >12 months, you need strong strategic justification. If payback is <3 months, adoption is likely worthwhile.

Managing the Research-to-Production Pipeline

The Three-Stage Pipeline

Research adoption isn’t a single decision—it’s a process with distinct stages, each with its own goals and exit criteria.

Stage 1: Discovery (1-2 weeks)

Goal: Answer “Is this worth investigating further?”

Activities:

Deep paper reading (third pass)
Quick prototype implementation
Initial evaluation on representative data
Risk assessment

Exit criteria:

Core idea is validated (it works at all)
Initial metrics meet minimum threshold
No critical blockers identified

Gate question: “Should we invest more time in this?”

If NO: Document learnings, move on. If YES: Proceed to Validation stage.

Stage 2: Validation (2-4 weeks)

Goal: Answer “Is this ready for production?”

Activities:

Rigorous benchmarking against production baseline
Testing at production scale
Edge case exploration
Full production viability assessment

Exit criteria:

Performance meets or exceeds baseline on production metrics
Works at production scale without issues
Integration path is clear and feasible

Gate question: “Should we build this for production?”

If NO: Document learnings, archive for later. If YES: Proceed to Integration stage.

Stage 3: Integration (4-8 weeks)

Goal: Answer “Does this improve production?”

Activities:

Production-quality implementation
Full testing and validation
Gradual rollout (feature flags, percentage ramps)
Monitoring setup

Exit criteria:

Passes all integration tests
Metrics are positive in production A/B test
Team can maintain and debug it

Gate question: “Should we keep this in production?”

If NO: Roll back, document learnings. If YES: Complete rollout, document for team.

The Importance of Gates

Gates are decision points that force explicit choices. Without gates, projects drift. A project that should have been killed in Discovery continues to Validation because “we’re already invested.” A project that passes Validation gets stuck in Integration limbo.

At each gate, someone with authority must answer the gate question with YES, NO, or NOT YET (with conditions).

Gates require:

Clear criteria defined before starting
Honest assessment of whether criteria are met
Authority to kill projects that don’t pass

Knowledge Transfer and Documentation

When a research technique reaches production, knowledge transfer becomes critical. The person who implemented it may move on. Others will need to maintain, debug, and extend it.

Essential artifacts:

Implementation documentation:

What this implements and why we adopted it
Where it lives in the codebase
How to configure it
Common issues and debugging approaches

Design decision record:

What alternatives were considered
Why this approach was chosen
What tradeoffs were made
When this decision should be revisited

Runbook:

How to verify it’s working correctly
Known failure modes and remediation
Monitoring and alerting guidance
Rollback procedure

If these artifacts don’t exist, the knowledge lives only in individual heads and walks out the door when they leave.

Team Research Processes

Individual skills matter, but research-to-production at scale requires team processes. Here’s how to run research adoption as a team, not just as individuals working in parallel.

The Research Radar

A research radar tracks promising techniques across your team’s focus areas. It’s a shared document or system that answers: “What research should we be aware of, and what’s our current assessment?”

Structure:

Technique	Status	Owner	Next Action
Flash Attn 3	ADOPT	-	Upgrade in Q2
Spec Decoding	TRIAL	Alice	Validation by 3/15
MoE Routing	ASSESS	Bob	Paper review by 3/1
Technique X	SKIP	-	Re-evaluate in 6mo

Categories:

ADOPT: Proven techniques to integrate into standard practice
TRIAL: Promising techniques in active validation
ASSESS: Techniques worth monitoring but not actively testing
HOLD: Previously interesting techniques now deprioritized
SKIP: Evaluated and rejected (with documented reasoning)

Maintenance:

Review monthly: Update status, add new techniques, remove obsolete entries
Assign owners: Each TRIAL or ASSESS item needs someone responsible for tracking
Document decisions: When something moves categories, record why

Research Review Meetings

Regular research reviews keep the team aligned and prevent both over-investment and under-investment in new techniques.

Format (60-90 minutes, monthly):

Radar updates (15 min): Quick review of status changes since last meeting
Deep dives (30-45 min): 1-2 techniques get detailed presentation
New nominations (15 min): Team members propose additions to radar
Decisions (15 min): Explicit decisions on what to ADOPT, TRIAL, SKIP

Deep dive structure:

The presenter covers:

What the technique does (5 min)
Our evaluation results (10 min)
Production viability assessment (5 min)
Recommendation with rationale (5 min)
Team discussion and decision (10 min)

Ground rules:

Decisions are explicit: Every deep dive ends with a decision
Disagreement is healthy: Different perspectives improve decisions
Time-box discussions: Use parking lot for tangents
Document everything: Meeting notes capture decisions and reasoning

Coordinating Research Efforts

Without coordination, teams duplicate effort or leave gaps. Here’s how to coordinate:

Divide by expertise, not by paper:

Don’t assign papers to people. Assign domains to people. “Alice owns retrieval research; Bob owns inference optimization.” This builds expertise and avoids redundant reading.

Share learnings, not just results:

When someone completes a prototype or evaluation, they should share:

What they learned (not just metrics)
Surprising findings
Implicit knowledge they discovered
Recommendations for others considering similar work

Maintain institutional memory:

Research notes are in a shared location, not individual notebooks
Evaluation results are stored with methodology, not just numbers
Failed experiments are documented (so others don’t repeat them)

Handling Disagreements

Teams will disagree on research adoption decisions. Handle this constructively:

When someone wants to adopt but others are skeptical:

Run a time-boxed validation. If the advocate is right, you’ll have evidence. If they’re wrong, you’ll have learned cheaply. Don’t argue hypothetically—test.

When the team is split:

Identify what information would resolve the disagreement. Often teams disagree because they’re uncertain, not because they have different values. Get that information.

When someone feels strongly:

Strong opinions are valuable information. Understand why. Are they seeing something others miss? Is their context different? Take strong opinions seriously.

When you must decide without consensus:

Document the disagreement. Note who advocated for what. Make a decision and move on. Revisit when new information emerges.

Research Failure Post-Mortems

When research adoption fails—techniques that didn’t work, projects that were killed, implementations that were rolled back—you have a learning opportunity. Failure post-mortems extract that learning.

When to Run a Post-Mortem

Run a post-mortem when:

A technique that passed initial evaluation fails in production
A research project is killed after significant investment
Adoption takes much longer or costs much more than expected
Production implementation reveals issues not caught in evaluation

Don’t run post-mortems for:

Early-stage prototypes that get killed (expected and healthy)
Techniques that were obviously not a fit (no surprise, no learning)

Post-Mortem Structure

1. Timeline reconstruction (15 min)

What actually happened? Build a factual timeline:

When did we first encounter this research?
What was our initial assessment?
What milestones did we hit?
When did problems emerge?
What was the final outcome?

2. What went wrong (20 min)

Identify specific failures. Categories to consider:

Paper evaluation failures: - Did we miss red flags in the paper? - Did we misunderstand the technique? - Did we overestimate the claimed benefits?

Prototype failures: - Did the prototype not accurately predict production behavior? - Did we test on unrepresentative data? - Did we miss important failure modes?

Integration failures: - Were there unexpected dependencies? - Did performance not scale? - Did edge cases cause problems?

Process failures: - Did we ignore warning signs? - Did we continue past appropriate kill points? - Did we fail to get appropriate input?

3. Root cause analysis (15 min)

For each failure, ask “why?” until you reach actionable root causes:

Surface: “The technique was slower in production than in benchmarks.” Why?: “Our production data has longer sequences than benchmark data.” Why?: “We tested on benchmark data, not sampled production data.” Root cause: “Our evaluation protocol didn’t include production-representative data.”

4. Lessons and actions (10 min)

What changes would have caught this earlier? Be specific:

Update evaluation checklist to include X
Add Y to prototype requirements
Change review process to include Z

Assign owners and deadlines for action items.

Common Failure Patterns

Post-mortems often reveal recurring patterns:

Pattern: Benchmark-production gap Symptom: Technique works on benchmarks, fails on production data. Prevention: Always include production-sampled data in evaluation.

Pattern: Scale surprise Symptom: Technique works at small scale, fails at production scale. Prevention: Test at production scale before passing validation gate.

Pattern: Integration underestimation Symptom: Core technique works, but integration takes 3x longer than expected. Prevention: Include integration spike in Discovery phase.

Pattern: Sunk cost continuation Symptom: Project continued past point where it should have been killed. Prevention: Define kill criteria upfront; enforce time-boxes.

Pattern: Missing expertise Symptom: Team misunderstood technique due to unfamiliarity with the domain. Prevention: Include domain expert in evaluation, or consult externally.

Communicating Research Decisions

Research adoption decisions affect stakeholders who don’t read papers: product managers, executives, partner teams. Effective communication builds trust and gets buy-in.

Framing for Different Audiences

For executives (1-2 paragraphs):

Focus on: Business impact, risk/reward, timeline, cost.

Template: “We evaluated [technique] for [use case]. Our assessment: [ADOPT/TRIAL/SKIP]. The expected benefit is [X% improvement in metric Y], with implementation cost of [Z weeks]. Key risk is [risk]. We recommend [action].”

Example: “We evaluated speculative decoding for our inference platform. Our assessment: TRIAL. The expected benefit is 40-50% latency reduction for chat completions, with implementation cost of 4 engineering weeks. Key risk is that benefit varies by model and input distribution—some workloads may see less improvement. We recommend a 4-week trial on our highest-volume endpoint.”

For product managers (1 page):

Focus on: User impact, tradeoffs, timeline, what changes.

Include:

What this technique does (in plain language)
Expected user-visible benefits
Any user-visible downsides or changes
Timeline and dependencies
What you need from them (decisions, prioritization)

For engineering partners (technical detail):

Focus on: How it works, integration requirements, technical risks.

Include:

How the technique works
What changes in their systems
Integration requirements (APIs, data formats, etc.)
Performance characteristics
Known issues and mitigations

Handling “Why Not Just…?” Questions

Non-technical stakeholders often ask reasonable questions that reveal unfamiliarity with research realities:

“Why don’t we just use the technique from [big company’s blog post]?”

Response: Explain what evaluation revealed. “We evaluated that technique. On our data, it provided 3% improvement rather than the 15% they claimed. The difference is likely due to [data distribution / scale / use case]. We found [alternative] more promising for our specific situation.”

“Can’t we just implement what’s in the paper?”

Response: Explain the paper-to-production gap. “Papers describe algorithms but not production systems. Implementation typically takes 3-5x the time to read and understand the paper. We also need to validate that results transfer to our context—they often don’t.”

“Why is this taking so long? It’s just one technique.”

Response: Explain the validation process. “We could implement quickly but might ship something that doesn’t actually help—or worse, causes problems. Our process validates that techniques work on our data at our scale before production. This takes [X weeks] but prevents [past failure example].”

“Our competitor announced they’re using [technique]. Why aren’t we?”

Response: Separate announcement from reality. “Announcements often precede production deployment by months. We’re tracking this technique in our ASSESS category. Current assessment is [X]. We’ll accelerate evaluation if we see evidence of production deployment and competitive impact.”

Building Research Credibility

Credibility with stakeholders comes from a track record of good decisions:

Celebrate successful adoptions: When a technique you recommended pays off, share the results. “Speculative decoding reduced P50 latency by 45%, saving $X/month in infrastructure costs.”

Acknowledge failures honestly: When something doesn’t work, own it. “We invested 3 weeks in [technique] and determined it’s not viable for our use case. Here’s what we learned that will help us evaluate similar techniques better.”

Make predictions, track accuracy: “We predicted this technique would provide 20-30% improvement. Actual result: 25%. Our evaluation methodology is calibrated.” Over time, this builds trust in your assessments.

Avoid hype: If you consistently recommend things that don’t pan out, credibility erodes. Conservative, accurate predictions beat optimistic, inaccurate ones.

Developing Taste for Research

What “Taste” Means in This Context

Beyond frameworks and processes, experienced practitioners develop intuitions—a “taste” for research that helps them quickly filter promising from unpromising ideas. This isn’t mystical; it’s pattern recognition developed over years of reading papers and seeing how they play out.

Here are some patterns that indicate promising research:

Signs of Promising Research

Clear, simple insight: The best research often has a simple core idea. If you can’t explain the key insight in one sentence, the technique may be overcomplicated.

Examples:

Flash Attention: “Compute attention tile-by-tile to avoid materializing the full attention matrix.”
LoRA: “Fine-tune low-rank adapters instead of full weights.”
RLHF: “Train a reward model from human preferences, then optimize against it.”

Strong theoretical grounding: Research that explains why something works—not just that it works—tends to be more robust. You can predict when it will fail.

Large, simple improvements: Be skeptical of techniques that show many small improvements across many benchmarks. Prefer techniques that show large improvements on core metrics.

Robustness to hyperparameters: If results require very specific hyperparameter settings, they may not reproduce in your context. Results stable across a range of settings are more trustworthy.

Multiple independent validations: If multiple groups reproduce similar results independently, confidence increases substantially.

Signs of Concerning Research

Complexity without proportional benefit: A method with 10 novel components that achieves 2% improvement is concerning. What are all those components doing?

Narrow wins: A technique that only helps on one specific dataset or setting may not generalize.

Required compute vastly different from paper: If reproducing results requires 100x the compute in the paper, something is wrong.

Extraordinary claims: Papers claiming to “solve” a longstanding problem deserve extra scrutiny. If it seems too good to be true, check the baselines and evaluation methodology carefully.

Authors with no implementation track record: Novel techniques from authors who’ve never shipped production systems should be viewed differently than techniques from industry labs or researchers with deployment experience.

Building Your Intuition

Taste develops through experience. Deliberately cultivate it:

Track predictions: When you read papers, explicitly predict: “Will this technique be widely adopted? In what timeframe?” Track your predictions. Review them after 1-2 years. Where were you right? Where were you wrong? What should you have noticed?

Study adoption patterns: For techniques that succeeded, trace the adoption path. What made them successful? For techniques that failed, why? What were the warning signs?

Engage with practitioners: The research world and the production world have different information. Conferences give you research context; production experience gives you real-world grounding. You need both.

Maintain healthy skepticism: The default for any paper should be “probably not relevant to me.” Move to interest only when evidence accumulates. This sounds cynical, but it’s just realistic given base rates.

Key Takeaways

Most papers aren’t relevant to you - The skill is efficient filtering. Use multi-pass reading: first pass for relevance, second for understanding, third for critical evaluation.
Reproducibility doesn’t equal production viability - Evaluate computational efficiency, integration complexity, robustness to real-world conditions, and long-term maintainability separately.
Prototype to learn, not to build - Define hypotheses and kill criteria before starting. Time-box ruthlessly. A failed prototype that answers your question quickly is a success.
Benchmark with statistical rigor - Control for variance, compare fairly against proper baselines, test at production scale, and require statistical significance before making decisions.
Develop taste through deliberate practice - Track your predictions about which techniques will be adopted. Study patterns of successful vs failed adoptions. Healthy skepticism is realistic, not cynical.

Summary

The gap between research and production isn’t a bug—it’s a fundamental feature of how the two worlds operate. Researchers optimize for novelty, clean experimental conditions, and benchmark performance. Engineers optimize for reliability, efficiency, and robustness. Bridging this gap requires skills in both domains and a systematic approach to translation.

Reading papers effectively means using a multi-pass method, reading critically, and maintaining a sustainable practice. Most papers aren’t relevant to you—the skill is filtering efficiently.

Evaluating production viability requires assessing computational efficiency, integration complexity, robustness, and maintainability. A reproducible paper can still be completely wrong for your production system.

Prototyping efficiently means focusing on learning, not building. Define hypotheses and kill criteria before starting. Time-box ruthlessly.

Benchmarking rigorously ensures you make decisions based on real evidence. Control for variance, compare fairly, test at scale, and require statistical significance.

Managing the pipeline with clear stages and gates prevents projects from drifting. Know when to proceed and when to kill.

Developing taste comes from experience. Track your predictions, study adoption patterns, and maintain healthy skepticism.

The practitioners who create the most value are those who can quickly identify which research matters for their context, validate it efficiently, and translate it into production impact—while avoiding the many techniques that sound promising but don’t pan out.

Connections to Other Chapters

Chapter 9 (LLM Deployment & Infrastructure) provides context for evaluating production viability of new techniques
Chapter 15 (MLOps & Evaluation) covers evaluation infrastructure that supports rigorous benchmarking
Chapter 21 (Deepening Technical Expertise) discusses paper reading practices and building mental models
Chapter 23 (Technical Communication) extends the stakeholder communication skills needed for research decisions
Chapter 23 (Technical Decision Making) provides broader decision frameworks applicable to research adoption
Chapter 27 (Performance Engineering) helps evaluate computational efficiency claims in papers

Practical Exercises

Exercise 1: Paper Evaluation

Select a paper published in the last 6 months in your area of interest.

Complete all three reading passes, taking notes as specified.
Identify at least three red flags or concerns.
Identify at least two strengths.
Rate production viability on each of the four dimensions.
Make a recommendation: Adopt, Trial, Assess, or Skip. Justify it.

Self-Assessment Questions: - Did you complete all three passes, or did you jump to conclusions after the first? - Are your red flags specific (citing sections/tables) or vague (“seems weak”)? - Does your viability assessment address YOUR context, not viability in general? - Is your recommendation justified by the evidence you gathered? - Could you defend this evaluation to a skeptical colleague?

Quality Indicators: - Strong: Red flags cite specific sections (“Table 3 shows only 1% improvement over BM25 baseline”) - Weak: Red flags are vague (“results don’t seem that impressive”) - Strong: Viability assessment considers your specific constraints (“our P99 latency requirement is 100ms”) - Weak: Viability assessment is generic (“seems efficient enough”)

Exercise 2: Reproduction Attempt

For a paper you’ve evaluated favorably:

Find or create a reproduction plan.
Attempt to reproduce core results.
Document what matched and what didn’t.
Identify what knowledge was implicit (not in the paper).
Estimate how long production implementation would take.

Self-Assessment Questions: - Did you verify reproduction on the paper’s exact setup before modifying for your context? - Did you document implicit knowledge as you discovered it (not reconstructed later)? - Is your production estimate based on this reproduction experience or just a guess? - Did you try contacting authors for missing details?

Quality Indicators: - Strong: Documentation includes specific implicit knowledge (“layer norm epsilon must be 1e-5, not the default 1e-6”) - Weak: Documentation is vague (“some hyperparameters needed tuning”) - Strong: Production estimate includes breakdown (implementation, integration, testing, buffer) - Weak: Production estimate is a single number without justification

Exercise 3: Comparative Benchmark

Compare a new technique against your current production baseline:

Design a fair benchmark protocol.
Define at least three metrics to measure.
Run at least three trials per method.
Calculate statistical significance.
Perform cost-benefit analysis.
Make a final recommendation with quantified justification.

Self-Assessment Questions: - Does your benchmark use production-representative data? - Did you control for compute budget, hyperparameter tuning effort, and hardware? - Did you run enough trials to establish statistical significance? - Does your cost-benefit analysis include all relevant costs (implementation, maintenance, compute)? - Would your recommendation change if variance were higher?

Quality Indicators: - Strong: Statistical analysis includes confidence intervals and p-values - Weak: Statistical analysis is just “average of 3 runs” - Strong: Cost-benefit includes ongoing costs (maintenance, compute) not just implementation - Weak: Cost-benefit only considers implementation time - Strong: Recommendation acknowledges uncertainty (“Given high variance, we should run more trials before committing”) - Weak: Recommendation ignores statistical uncertainty

Exercise 4: Historical Analysis

Select a technique that was adopted in your organization or in the industry broadly:

Find the original paper(s).
Trace the adoption timeline.
What enabled adoption? What slowed it?
What issues emerged in production?
With hindsight, what would you have looked for to predict these outcomes?

Self-Assessment Questions: - Did you trace adoption through primary sources (papers, blog posts) or rely on secondhand accounts? - Did you identify both technical and non-technical factors in adoption? - Are your “with hindsight” observations actually predictable from the original paper, or are they only obvious in retrospect?

Quality Indicators: - Strong: Timeline includes specific dates and milestones - Weak: Timeline is vague (“took about a year”) - Strong: Analysis distinguishes what was predictable from what was genuinely surprising - Weak: Analysis claims everything was obvious in hindsight

Exercise 5: Prediction Tracking

For each of five current papers you find interesting:

Predict: Will this be widely adopted in 2 years? Why or why not?
Record your predictions with reasoning.
Set a calendar reminder to review in 2 years.
When you review, analyze where your predictions were right and wrong.

Self-Assessment Questions: - Are your predictions specific enough to be clearly right or wrong? - Did you record reasoning at prediction time (not reconstructed later)? - Are you predicting actual adoption or just continued research interest? - Did you consider what “widely adopted” means (by whom? in what contexts?)?

Quality Indicators: - Strong: Predictions define success criteria (“I predict >50% of production LLM systems will use this by 2027”) - Weak: Predictions are vague (“this will probably be important”) - Strong: Reasoning identifies specific factors that would lead to adoption or failure - Weak: Reasoning is “this seems good”

Exercise 6: Research Radar Creation

Create a research radar for your team or area:

Identify 10-15 techniques relevant to your work.
Categorize each as ADOPT, TRIAL, ASSESS, HOLD, or SKIP.
For each TRIAL or ASSESS item, define next actions and owners.
Document reasoning for at least 3 SKIP decisions.
Present to colleagues and refine based on feedback.

Self-Assessment Questions: - Do your categories reflect current state, not aspiration? - Are SKIP decisions documented well enough that someone could understand why without asking you? - Did feedback from colleagues change any categorizations? - Is the radar maintainable (not too many items, clear update process)?

Quality Indicators: - Strong: SKIP decisions include specific reasoning (“superseded by X” or “doesn’t address our constraint Y”) - Weak: SKIP decisions are just “not relevant” - Strong: TRIAL items have specific next actions with deadlines - Weak: TRIAL items are “investigate further”

Self-Assessment Checkpoint

Conceptual Questions

Q1. [IC2] What is the three-pass method for reading papers? Why is it more effective than reading start-to-finish?

Answer

Three-pass method:

Pass 1 (5-10 min): Title, abstract, intro, conclusion, skim figures. Decide: read more or skip?

Pass 2 (30-60 min): Main ideas, skip math/implementation details. Summarize key contributions, identify claims.

Pass 3 (2-4 hours): Reproduce reasoning, understand details, identify implicit assumptions.

Why better than linear reading: (1) Efficient filtering: Most papers aren’t worth Pass 3. (2) Prevents sunk cost: Linear reading commits time before knowing value. (3) Builds context first: Understanding the big picture makes details easier. (4) Identifies questions: You know what to look for in deeper passes. (5) Matches reading goals: Sometimes Pass 1 awareness is all you need.

Q2. [IC2] What are the four dimensions of production viability for a research technique? Why might a technique score high on one dimension but low on others?

Answer

Four dimensions:

Quality improvement: Does it actually work better than current methods?
Efficiency: Can we run it at acceptable latency/cost?
Robustness: Does it work on real-world data, not just benchmarks?
Integration: How hard is it to integrate with existing systems?

Why dimensions are independent: (1) Quality vs efficiency: Bigger models are often better but slower/more expensive. (2) Quality vs robustness: Benchmark-tuned methods may overfit to specific conditions. (3) Efficiency vs integration: Highly optimized code may be hard to maintain. (4) All vs integration: Brilliant research may require complete system redesign.

Example: A new retrieval method might show 15% quality improvement (high) but require a complete index rebuild (low integration), GPU inference (low efficiency), and only work on English text (low robustness).

Q3. [Senior] What are kill criteria for research prototypes and why are they important? How do you set good kill criteria?

Answer

Kill criteria: Pre-defined conditions under which you stop pursuing a research direction. Set before starting, enforced during execution.

Why important: (1) Prevents sunk cost fallacy: “We’ve invested so much, let’s try one more thing.” (2) Enables parallel exploration: Can try multiple approaches with bounded investment. (3) Forces clarity: Defining failure makes success criteria clearer. (4) Saves resources: Failing fast leaves time for better ideas.

Setting good kill criteria:

Define before starting (not during, when bias is high).
Make them measurable: “If accuracy <85% after 2 weeks, stop.”
Include time bounds: Unbounded exploration never ends.
Consider multiple metrics: Accuracy, latency, maintainability.
Get external commitment: Tell someone your criteria.

Warning signs of bad criteria: (1) “We’ll know it when we see it.” (2) Criteria adjusted when approaching failure. (3) No time bound. (4) Only success defined, not failure.

Q4. [Senior] What red flags in a paper should make you skeptical about production viability?

Answer

Red flags:

Evaluation concerns: - Comparisons against weak/outdated baselines - Evaluation only on one dataset, especially if curated - Missing statistical significance tests - Metrics that don’t match production needs (accuracy vs latency) - Held-out test set that’s too similar to training

Reproducibility concerns: - Missing implementation details - No code release - Results only from proprietary data - Hyperparameters not specified - Results that seem too good

Practical concerns: - Computation requirements buried/missing - Inference cost not discussed - Scaling behavior not shown - Only works on specific data types - Requires components that are impractical

Implicit assumptions: - Clean data assumed - Single language/domain - Specific hardware requirements - Expertise required to tune - Failure modes not discussed

Each red flag doesn’t disqualify, but multiple red flags warrant skepticism.

Q5. [Staff] How do you decide whether to adopt a new technique early (for competitive advantage) versus waiting for it to mature (for lower risk)? What factors influence this decision?

Answer

Factors favoring early adoption: - Technique provides significant competitive advantage - Cost of being wrong is low (reversible decision) - You have expertise to manage immaturity - Early mover advantage is substantial - Technique is simple enough to understand deeply

Factors favoring waiting: - Cost of failure is high (customer-facing, safety-critical) - Technique is complex, easy to misimplement - Ecosystem support is critical (libraries, tooling) - Industry standards likely to emerge - Your team lacks expertise to debug issues

Decision framework: 1. How reversible is adoption? (Easy = adopt early) 2. How large is the expected improvement? (Large = adopt early) 3. How mature is implementation? (Code/libraries = adopt earlier) 4. How critical is the use case? (Critical = wait) 5. Can you run in shadow mode first? (Yes = adopt earlier)

Strategy: Often a “trial” approach—adopt in limited scope, measure carefully, expand if successful. This limits downside while capturing upside.

Spot the Problem

Problem 1. [IC2] An engineer’s paper evaluation:

“This paper shows 10% improvement on MMLU benchmark. That’s significant. We should implement it.”

What’s missing from this evaluation?

Answer

Missing considerations:

Quality: (1) What’s the baseline they compare against? (Weak baselines inflate improvements.) (2) Is 10% on MMLU meaningful for YOUR use case? (3) Do they show results on multiple benchmarks or just the one where they shine?

Efficiency: (4) What’s the compute cost? (5) How does latency compare to current approach? (6) Does it require additional infrastructure?

Robustness: (7) Does it work on real data, not just benchmarks? (8) What languages/domains was it tested on? (9) What are the failure modes?

Integration: (10) How hard is implementation? (11) Is there code available? (12) Does it require changes to our architecture?

Better evaluation: “This paper shows 10% on MMLU. Before recommending, I need to: (1) reproduce on our benchmark, (2) measure latency/cost, (3) test on production-like data. I’ll spend 3 days on a spike.”

Problem 2. [Senior] A prototype evaluation report:

“We spent 4 weeks implementing the new retrieval method. It works on our test set but only achieves 72% accuracy instead of the paper’s 89%. We’re debugging why our results differ. We think with another 2-3 weeks we can close the gap.”

What concerns should you raise?

Answer

Concerns:

Sunk cost: 4 weeks invested, asking for 2-3 more. Is this justified by potential value, or escalation of commitment?
No kill criteria mentioned: When do we stop? What if 2-3 more weeks also fails?
Gap is very large: 72% vs 89% suggests fundamental issue, not just tuning. Has root cause been identified?
“We think” language: 2-3 weeks is guess, not estimate. What specifically needs to happen?
No alternative considered: Is this technique the only option? Should we parallel path?

Questions to ask: (1) What specifically is causing the gap? (2) What evidence suggests more time will help? (3) What’s the kill criterion for this approach? (4) Is the paper’s 89% reproducible by anyone? (5) What’s the opportunity cost of another 2-3 weeks?

Recommendation: Define explicit kill criterion (e.g., 85% by end of next week). If not met, stop and consider alternatives.

Problem 3. [Staff] A team’s approach to research adoption:

“We have a researcher who reads papers full-time and recommends what to implement. Developers then build what the researcher recommends. This separates concerns—researchers research, engineers engineer.”

What’s problematic about this structure?

Answer

Problems:

Translation gap: Researchers may not understand production constraints. What looks good in a paper may be impractical.
Lost context: Engineers implementing without understanding why make poor tradeoff decisions.
Feedback loop broken: Engineers see what doesn’t work in practice; this needs to inform paper evaluation.
Ownership split: Who owns the outcome—researcher for recommending or engineer for implementing?
Incentive misalignment: Researcher is incentivized to find novel things; engineers may prefer stable, proven approaches.
Speed bottleneck: All decisions flow through one person’s reading capacity.

Better structures:

Option 1: Engineers read papers relevant to their work, with researcher as consultant for deep expertise.

Option 2: Researcher-engineer pairs work together on evaluation and implementation.

Option 3: Researcher does initial filter, engineers do final assessment and implementation together.

Key principle: Production viability assessment requires both research judgment and engineering judgment. Don’t separate them.

Self-Assessment: Research Translation Capabilities

Rate yourself on each capability (1 = novice, 5 = expert):

Paper Reading and Evaluation - [ ] I can efficiently filter papers using the three-pass method - [ ] I identify red flags and implicit assumptions in papers - [ ] I assess production viability across all four dimensions - [ ] I maintain a sustainable paper reading practice

Reproduction and Prototyping - [ ] I can reproduce paper results (or diagnose why not) - [ ] I design prototypes that answer specific hypotheses - [ ] I define and enforce kill criteria - [ ] I budget appropriately for reproduction challenges

Benchmarking and Evaluation - [ ] I design fair comparative benchmarks - [ ] I use statistical methods to establish significance - [ ] I perform cost-benefit analysis before recommending adoption - [ ] I test at production scale before passing validation

Judgment and Taste - [ ] I track my predictions and learn from errors - [ ] I can explain research decisions to non-technical stakeholders - [ ] I know when to adopt early vs. wait for maturity - [ ] I maintain appropriate skepticism without dismissing promising work

Growth Areas: Based on your self-assessment, identify 2-3 specific areas to develop. For each: 1. What specific practice would build this skill? 2. How will you measure improvement? 3. What timeline will you set?

Keshav (2007), “How to Read a Paper” - Classic three-pass approach to academic paper reading.
Sculley et al. (2015), “Hidden Technical Debt in ML Systems” - Essential context for production ML challenges.
Hooker (2021), “The Hardware Lottery” - Why hardware determines which ideas succeed.

Deep Dives

Paleyes et al. (2022), “Challenges in Deploying ML” - Survey of deployment challenges across industry.
Pineau et al. (2021), “Improving Reproducibility in ML Research” - Systematic analysis of reproducibility issues.

Case Studies

Dao et al. (2022), “FlashAttention” - Study as example of research that translated well to production.
Kwon et al. (2023), “PagedAttention/vLLM” - Systems research with clear production applicability.

--- title: "Chapter 28: Research-to-Production" keywords: [research, paper evaluation, reproducibility, productionization, benchmarking, ablation studies] difficulty: advanced prerequisites: [ch21, ch15] estimated_time: "3-4 hours" --- ## Introduction In late 2022, a team at a mid-sized fintech company read a paper that promised a 40% latency reduction for their transformer-based fraud detection system. The technique, a novel attention mechanism, had impressive benchmark numbers on academic datasets. The team spent three months implementing it from scratch, carefully matching every detail in the paper. When they finally deployed to production, the results were devastating: latency increased by 60%, memory usage doubled, and the model's accuracy dropped on their real-world data. The paper wasn't wrong—it just solved a different problem than theirs. This scenario plays out constantly across the industry. The gap between research and production isn't a bug; it's a fundamental feature of how the two worlds operate. Researchers optimize for novelty, clean experimental conditions, and benchmark performance. Engineers optimize for reliability, efficiency, and real-world robustness. A paper can be scientifically valid and technically impressive while being completely wrong for your production system. Yet the opposite mistake is equally costly: ignoring research entirely. Flash Attention went from paper to industry standard in under a year. Teams that adopted it early saw 2-4x throughput improvements. Those who waited "for it to mature" lost ground to competitors. The same pattern repeated with LoRA fine-tuning, PagedAttention, and speculative decoding. The skill you need isn't reading papers or building systems—it's bridging the two. This chapter teaches you to read research with a production lens, evaluate ideas for real-world viability, prototype efficiently, and know when something is ready for prime time versus when it's not worth the risk. ### The Core Tension: What Researchers Optimize For vs. What Engineers Need To bridge research and production, you must first understand why the gap exists. It's not incompetence on either side—it's fundamentally different optimization targets. **What researchers optimize for:** 1. **Novelty**: Academic incentives reward new ideas over incremental improvements. A 2% accuracy gain with a completely new architecture gets published; a 5% gain from careful engineering of existing approaches often doesn't. 2. **Clean experimental conditions**: Papers typically evaluate on standard benchmarks with well-curated data, fixed computational budgets, and controlled settings. This enables fair comparisons but diverges from production reality. 3. **Benchmark performance**: Researchers target metrics that the community has agreed upon. But BLEU scores, perplexity, and accuracy on MMLU may not correlate with what matters for your application. 4. **Publication speed**: The race to publish means papers often skip extensive ablations, robustness testing, or scalability analysis. "Future work" sections are filled with unaddressed concerns. **What production engineers need:** 1. **Reliability**: A technique that works 99.9% of the time beats one that's 5% better on average but fails catastrophically 1% of the time. 2. **Efficiency at scale**: Research prototypes run on a few examples. Production systems handle millions of requests with strict latency SLOs. 3. **Integration simplicity**: The best technique in isolation may be worse than a simpler one that fits your existing architecture. 4. **Maintainability**: Code that one PhD student understands isn't the same as code your team can debug at 3 AM during an outage. 5. **Robustness to distribution shift**: Academic benchmarks are static. Production data drifts, contains adversarial inputs, and surprises you constantly. This tension isn't resolvable—it's fundamental. Your job is to navigate it skillfully. ### A Mental Model for Research Translation Think of research papers as raw ore from a mine. The ore contains valuable material, but it's mixed with rock, requires processing, and isn't immediately usable. Some ore is rich and easy to refine; some is mostly waste. The skill is knowing which veins to pursue and how to extract value efficiently. The translation process has three phases: 1. **Prospecting**: Scanning the landscape to identify promising claims. Most papers aren't relevant to you—the skill is filtering efficiently. 2. **Assaying**: Deep evaluation of promising candidates. Does the ore actually contain what it claims? Will it refine into something useful? 3. **Refining**: Transforming research into production-ready implementations. This is where most efforts fail—not from lack of skill, but from underestimating the work required. ### What You'll Learn - How to read ML papers efficiently and evaluate claims critically - Why papers often don't reproduce, and what to do about it - Decision frameworks for evaluating production viability - Prototyping methodology that validates ideas with minimal investment - When to adopt, when to wait, and when to abandon research directions - Case studies of techniques that translated well—and ones that didn't ### Prerequisites - Strong ML fundamentals (Part II) - Production systems experience (Part III) - Familiarity with your organization's technical context and constraints --- ## Reading Research Papers ### Why Papers Are Hard to Read (And Why You Must Read Them Anyway) You can't rely on blog posts and Twitter threads alone. By the time research is popularized, it's either already implemented in frameworks or has been superseded. The distillation process also loses crucial nuance—the limitations section that explains when the technique fails, the ablation that shows which components actually matter, the hyperparameter sensitivity that determines whether you can reproduce results. Reading papers gives you early access to techniques 6-12 months before they're mainstream. More importantly, it gives you the deep understanding needed to predict whether something will work in your context. But papers are notoriously difficult to read. Academic writing optimizes for different goals than technical documentation: signaling expertise to reviewers, establishing priority in a research area, and compressing complex ideas into page limits. The resulting prose is dense, assumes shared context, and often obscures as much as it reveals. ### The Three-Pass Reading Method Efficient paper reading is about knowing when to stop. Most papers deserve only minutes of your time; a few deserve hours. The three-pass method allocates attention appropriately. **First Pass (5-10 minutes)**: Survey the territory Read: Title, abstract, introduction (especially the last paragraph), section headings, figures/tables, conclusion. Answer these questions: - What is the paper claiming? - What problem does it claim to solve? - Is this relevant to my work? - Is the claim plausible given what I know? The vast majority of papers fail this filter. Either they solve a different problem than yours, or the claims seem implausible, or the approach is incompatible with your constraints. A paper on improving transformer training is irrelevant if you're using pre-trained models exclusively. **When to stop**: If the paper isn't relevant or the claims seem obviously flawed, stop here. You've spent five minutes and learned enough. **Second Pass (30-60 minutes)**: Understand the approach Read: Full paper, but skim derivations and proofs. Focus on figures, tables, and the methods section. Answer these questions: - What is the main technique? Can you explain it in one paragraph? - What are the key results? What baselines are compared? - What are the stated limitations? What's conspicuously missing? - Could you implement this at a high level? During this pass, you're building a mental model of what the paper does without getting lost in details. Mark sections you'll need to return to if you proceed. **When to stop**: If the technique is less novel than claimed, results aren't compelling compared to simpler approaches, or limitations make it unsuitable for your use case. **Third Pass (2-4 hours)**: Deep understanding Read: Everything, including appendices and supplementary material. Re-derive key equations. Map their code (if available) to the paper. Answer these questions: - Could you reimplement this from scratch? - What are the hidden assumptions and hyperparameters? - What would break this approach? - What did the ablations actually show? This pass is expensive. You should only reach it for papers you're seriously considering implementing. ### Critical Reading: What Papers Don't Tell You Papers are advocacy documents. Authors want their work published, cited, and recognized. This doesn't make them dishonest—but it does mean they present their work in the best possible light. Critical reading identifies what's missing or framed favorably. **Red Flag 1: Weak or missing baselines** Papers should compare against the best available alternatives, properly tuned. Watch for: - Baselines that are outdated (comparing to methods from 2+ years ago in a fast-moving field) - Baselines with default hyperparameters when the proposed method is carefully tuned - Missing obvious baselines (e.g., a retrieval paper that doesn't compare to BM25) *Case study*: A 2023 paper on efficient attention claimed 3x speedup over "standard attention." The baseline was the naive PyTorch implementation, not Flash Attention. Against Flash Attention, the speedup was 1.2x—barely significant and with worse accuracy. **Red Flag 2: Cherry-picked examples** Qualitative examples in papers are chosen to make the method look good. If you only see the best cases, you have no idea what the failure modes look like. *What to do*: Look for papers that show failure cases, or try the method on your own data immediately. **Red Flag 3: New metrics that favor the approach** When standard metrics don't show improvement, authors sometimes introduce new metrics where their method wins. These aren't always invalid—sometimes new metrics are genuinely better—but it's a yellow flag. *What to check*: How does the method perform on standard metrics? If it loses on standard metrics, is the argument for the new metric convincing? **Red Flag 4: Toy datasets only** Results on MNIST, CIFAR-10, or small synthetic datasets often don't transfer to production-scale problems. Techniques can work beautifully at small scale and fail completely when scaled. *Case study*: Many neural architecture search papers showed impressive results on CIFAR-10 but produced architectures that didn't transfer to ImageNet or were impractical for real-world deployment. **Red Flag 5: Missing ablations** A complex method with multiple components but no ablation study is suspicious. Which parts actually matter? Often, one simple component provides most of the benefit. *What it tells you*: The authors either didn't do ablations (concerning) or did them and found disappointing results (also concerning). **Red Flag 6: Reproducibility concerns** Warning signs include: - No code released, with no commitment to release - Vague hyperparameter specifications ("we tuned learning rate on validation set") - Unusual experimental setup that would be hard to replicate - Results that require specific random seeds to reproduce *The context*: Edward Raff's 2019 study attempting to independently reproduce 255 ML papers (without consulting any released code) found that only about 63.5% could be reproduced within reasonable effort. Papers that did release code fared better in follow-up community reproductions, but many still failed to match claimed results. **Red Flag 7: Compute advantage disguised as algorithmic improvement** Some "improvements" come from using more compute, not better methods. If a paper doesn't report compute budgets or if the proposed method uses significantly more resources, the comparison isn't fair. *What to check*: Results normalized for compute. Same number of parameters, same training budget, same inference cost. ### Building a Paper Reading Practice Sustainable paper reading requires structure. Without it, you'll either read nothing (overwhelmed by the firehose) or read everything (and remember nothing). **Weekly rhythm**: - 15 minutes daily: Scan arXiv titles and abstracts in your area. Mark anything promising. - 1-2 hours weekly: Second-pass read 3-5 marked papers. Most won't survive. - Monthly: Deep-read 1-2 papers that survived second-pass across the month. **Tracking sources**: *Primary sources*: - arXiv cs.LG, cs.CL, cs.CV for your relevant subfields - Papers with Code (also shows implementations and benchmark rankings) - Top conferences: NeurIPS, ICML, ICLR, ACL, EMNLP, CVPR *Curated sources*: - Company research blogs (Anthropic, OpenAI, Google, Meta) for production-oriented work - Research distillation newsletters and podcasts - Senior researchers you respect on Twitter/Bluesky **Note-taking that compounds**: For any paper that survives first-pass, create a structured note: ``` Title: [Paper title] Date read: [Date] Relevance: [1-5] One-sentence summary: [What does this paper do in one sentence?] Key insight: [The core idea that makes this work] Results: [Main quantitative results vs. baselines] Limitations: [What the paper admits, and what you infer] Production viability: [Quick assessment: likely viable / needs validation / probably not] If I were to implement: [What would be the main challenges?] ``` This structured approach ensures you can find and use paper notes months later. --- ## Why Papers Don't Reproduce (And What to Do About It) ::: {.callout-tip} ## Staff Engineer Perspective "VP forwarded a paper with a $20K Slack message attached: 'why aren't we doing this?' Numbers were spectacular—40 percent quality lift on a benchmark adjacent to ours. I spent three weeks reproducing. Got 4 percent. Authors had used a custom data filter that removed roughly 60 percent of hard examples before evaluation, which they mentioned in a footnote on page 11. The lift was real on their filtered set and basically vanished on ours. I now do a 'gut check reproduction' on every hyped paper before we even discuss adopting it: one engineer, one week, just enough to see if the headline survives contact with our data. Most don't. The ones that do are worth the next month of investment." — *Applied Research Lead at an AI infrastructure company* ::: ### The Reproducibility Crisis in ML In 2019, Edward Raff attempted to reproduce 255 papers from top ML conferences. He found that: - Only about 63.5% of papers could be reproduced within reasonable effort - Papers without code were rarely reproducible - Even "successful" reproductions often showed significantly different results than claimed This isn't about fraud (which is rare). It's about the many small decisions that papers don't document: **Hidden hyperparameters**: Learning rate schedules, warmup steps, gradient clipping thresholds, weight initialization schemes. Papers mention "we use Adam" but don't specify betas, epsilon, or whether they use AMSGrad. **Preprocessing differences**: Tokenization choices, data cleaning steps, train/test split procedures. These often matter more than the claimed contribution. **Implicit assumptions**: The code assumes a specific library version, hardware configuration, or data format that isn't documented. **Random seed sensitivity**: Some results only reproduce with specific random seeds. The paper shows the best seed; your seed gives worse results. **Selection effects**: Authors run many experiments. They publish the configuration that worked best. Your configuration starts from their optimal point—but you don't know the search that found it. ### What Researchers Actually Do (That Papers Don't Mention) Having collaborated with research teams and published papers, I can share what typically happens behind the scenes: **Extensive hyperparameter search**: What the paper calls "tuned on validation set" often means weeks of GPU time exploring hyperparameter space. The final configuration is optimal for those benchmarks—not necessarily for yours. **Debugging that shapes the method**: During development, bugs often get fixed in ways that subtly change the method. The final paper describes the end state, not the journey. Some design choices exist because alternatives were buggy, not because they were theoretically motivated. **Late-stage additions**: Reviewers request comparisons, ablations, or variations. These get added hastily before the deadline. Code quality for these components is often lower. **Computational resources**: A paper from a well-funded lab might represent millions of dollars in compute. They tried variations you never will. The published method is the survivor of a much larger exploration. ::: {.callout-warning} ## Common Mistake: Implementing Papers from Scratch Without Reference Code **What people do:** Read the paper, understand the algorithm, implement it fresh in their preferred framework. After all, the reference code is messy and in a different framework. **Why it fails:** Papers under-specify critical details. The reference code contains weeks of debugging that the paper doesn't mention—learning rate warmup steps, normalization epsilon values, weight initialization schemes, gradient clipping thresholds. Reimplementing from scratch means rediscovering all these implicit decisions through trial and error. **Fix:** Always start from official reference code, even if messy. Port it carefully, validating intermediate values match. Only clean up after you've reproduced results. The "ugly" code often contains essential implicit knowledge. ::: ### Practical Strategies for Reproduction Given these challenges, here's how to maximize your chances of successful reproduction: **Strategy 1: Start from reference implementations** If official code exists, use it—even if it's messy. The authors' code contains implicit knowledge that papers don't capture. Rewriting from scratch means remaking all the decisions they made through trial and error. If no official code exists, look for community reproductions with verification. Papers with Code tracks implementations and whether they reproduce claimed results. ::: {.callout-warning} ## Common Mistake: Adapting Research Before Reproducing It **What people do:** Read a paper, immediately start adapting it for their use case—different data, different model size, different framework. When results are disappointing, they don't know if the technique doesn't work or if they introduced bugs. **Why it fails:** You have two unknowns: whether you implemented correctly AND whether it works for your use case. When it fails (which is common), you can't diagnose which is the problem. **Fix:** First reproduce the paper's results on their exact setup—same data, same hyperparameters, same evaluation. Only after you can match their numbers should you start adapting. This isolates variables and makes debugging tractable. ::: **Strategy 2: Match the setup exactly before modifying** Before adapting a technique for your use case, verify you can reproduce the original results. Use the same: - Dataset (exact preprocessing) - Hyperparameters (all of them) - Evaluation protocol (same splits, same metrics) - Hardware (if feasible) Only after reproduction succeeds should you start modifying for your context. **Strategy 3: Email the authors** This sounds old-fashioned, but it works. Authors often respond to polite, specific questions. They have knowledge that didn't make it into the paper: "We're trying to reproduce your results on [benchmark]. We're using [hyperparameters] and getting [result] instead of [claimed result]. Are there any implementation details we might be missing?" Common responses reveal missing preprocessing steps, hyperparameter sensitivities, or known issues. **Strategy 4: Budget for failure** Assume your first reproduction attempt will fail. Plan accordingly. If a paper takes 1 week to implement from scratch, budget 2-3 weeks total including debugging. The gap is typically spent: - Debugging why results don't match - Trying hyperparameter variations - Discovering implicit dependencies --- ## Evaluating Production Viability ### The Production Readiness Assessment Not all reproducible research is production-ready. The gap from "works in a notebook" to "runs in production" can be larger than the initial research. Here's a framework for evaluating whether research can transition to production: **Dimension 1: Computational efficiency** Questions to ask: - What are the inference latency and throughput characteristics? - How does memory usage scale with input size? - Does it require specialized hardware you don't have? - Are there efficient implementations, or only research code? *Red flags*: Quadratic scaling with sequence length (for long-context applications), custom CUDA kernels with no PyTorch fallback, requirements for hardware you can't provision. *Green flags*: Already optimized implementations exist (in PyTorch, TensorRT, etc.), scales linearly or better, runs on your standard hardware. *Case study*: The Linformer paper proposed O(n) attention as a replacement for O(n^2) standard attention. But the constant factors meant that for sequences under 4096 tokens, standard attention (with Flash Attention) was actually faster. The asymptotic improvement didn't matter for most production use cases. **Dimension 2: Integration complexity** Questions to ask: - Does this require changing model architecture, or is it a drop-in replacement? - What dependencies does it introduce? - Can it be A/B tested incrementally, or is it all-or-nothing? - Does it require retraining, or can it be applied to existing models? *Red flags*: Requires retraining your foundation model, introduces unstable/unmaintained dependencies, can't be gradually rolled out. *Green flags*: Works with pre-trained models, clean API boundaries, can be enabled via configuration. *Case study*: LoRA fine-tuning succeeded in production partly because it didn't require modifying base model weights. You could add adapters without touching the foundation, enabling easy A/B testing and rollback. **Dimension 3: Robustness** Questions to ask: - How sensitive is it to hyperparameters? - What happens on inputs outside the training distribution? - Are failure modes catastrophic or graceful? - Has it been tested on messy, real-world data? *Red flags*: Very narrow hyperparameter range works, fails silently on edge cases, tested only on curated benchmarks. *Green flags*: Stable across reasonable hyperparameter ranges, documented behavior on adversarial/unusual inputs, tested at scale on real data. *Case study*: Many quantization schemes work perfectly on benchmark data but produce catastrophically wrong outputs on rare inputs that don't match training statistics. Production systems discovered this through user complaints, not benchmarking. **Dimension 4: Maintainability** Questions to ask: - Can multiple team members understand and modify this? - Is the implementation well-tested and documented? - Is the approach still actively developed, or is it abandoned? - If it breaks, can you debug it? *Red flags*: Only one researcher understands it, no tests, inactive repository, black-box components. *Green flags*: Multiple independent implementations, community support, clear debugging procedures. ### Case Studies: Research That Translated Well **Flash Attention: From Paper to Industry Standard in One Year** Why it succeeded: - **Clear, measurable benefit**: 2-4x throughput improvement with same quality. No tradeoffs to evaluate. - **Drop-in replacement**: Same API as standard attention. No architectural changes required. - **Strong implementation**: Authors released optimized CUDA kernels from the start. - **Wide applicability**: Benefits virtually all transformer workloads. Adoption pattern: Major labs adopted within months. Framework integration followed. By 12 months post-paper, it was the default. **LoRA Fine-tuning: Democratizing Customization** Why it succeeded: - **Solved a real pain point**: Fine-tuning large models was prohibitively expensive. - **Clean abstraction**: Adds small adapter matrices, doesn't touch base weights. - **Easy integration**: PEFT library made it trivial to add to existing code. - **Flexibility**: Works with models you can't afford to fully fine-tune. Adoption pattern: Initially used by resource-constrained practitioners. Quality validation led to broader adoption. Now the default for LLM customization. **Speculative Decoding: Emerging Production Adoption** Why it's succeeding: - **No quality compromise**: Mathematically provably identical outputs. - **Significant speedup**: 2-3x latency reduction is common. - **Works with existing models**: Uses small draft model with existing target model. Current state: Framework integration is ongoing. Requires good draft model selection. Production adoption accelerating as tooling matures. ### Case Studies: Research That Didn't Translate **Many Neural Architecture Search Results** Why they failed: - **Compute advantage**: Results came from massive search budgets, not better architectures. - **Benchmark overfitting**: Found architectures optimized for CIFAR-10 that didn't transfer. - **Impractical costs**: Reproducing the search was infeasible for most organizations. - **Better alternatives emerged**: Transformers made many CV architecture questions moot. Lesson: Be skeptical of results that require massive search—the winning configuration is selected from thousands of alternatives. **Various "Efficient Attention" Mechanisms** Why many failed: - **Constant factors matter**: Asymptotic improvements didn't help at practical sequence lengths. - **Quality tradeoffs**: Approximations that seemed harmless on benchmarks caused issues in production. - **Flash Attention changed the baseline**: What seemed efficient in 2021 wasn't efficient after Flash Attention. Lesson: Efficiency claims need context. "More efficient than X" means nothing if a better alternative to X appears. **Mixture of Experts (The Long Road)** Why it took years: - **Training instability**: Early MoE implementations were notoriously unstable to train. - **Routing challenges**: Load balancing across experts was tricky to get right. - **Infrastructure requirements**: Needed sophisticated distributed systems for expert parallelism. - **Gradually solved**: Switch Transformer (2021) simplified training; infrastructure matured. Current state: Now powers frontier models. The research was valid all along; production engineering took years to catch up. Lesson: Some research is ahead of its time. Infrastructure and tooling maturity often gate production readiness. --- ## Decision-Making: When to Adopt, Wait, or Skip ### The Adoption Timing Framework Different organizations should adopt research at different stages. Your position depends on: **Technical capacity**: Do you have engineers who can debug novel implementations? Can you afford failures? **Competitive dynamics**: How much do you gain from being first? How much do you lose from waiting? **Risk tolerance**: What happens if adoption fails? Is this critical infrastructure or an experiment? Here's a framework for deciding: **ADOPT Early (Be First)** When: - Technique provides significant competitive advantage - You have strong ML engineering capacity to handle issues - Risk of failure is acceptable (not critical path) - You can contribute improvements back (builds expertise) Example: A company with strong ML infrastructure adopting Flash Attention 2 months after release to gain throughput advantage. Tradeoffs: Higher risk of hitting bugs, no community solutions yet, more engineering investment. **TRIAL (Fast Follower)** When: - Technique is clearly valuable (not just hyped) - Initial bugs are being shaken out by early adopters - Framework integration is emerging - You want proven value with reduced risk Example: Adopting Flash Attention 6 months post-release, once it's in major frameworks and bugs are documented. Tradeoffs: Less competitive advantage, but much lower risk. **ASSESS (Wait and See)** When: - Technique is promising but unproven at scale - Your capacity to evaluate is limited - Alternatives may emerge - Cost of being wrong is high Example: Watching speculative decoding mature before investing in implementation. Tradeoffs: May miss opportunities, but avoids wasted effort on techniques that don't pan out. **SKIP (Don't Invest)** When: - Technique is superseded by something better - Complexity exceeds benefit for your use case - Incompatible with your architecture - The problem it solves isn't your problem Example: Skipping many "efficient attention" variants because Flash Attention solved the problem more completely. ### The Kill Criteria Knowing when to stop is as important as knowing when to start. Define kill criteria before beginning any research adoption project: **Hard kills (stop immediately)**: - Fundamental incompatibility with your architecture discovered - Results significantly worse than baseline in realistic testing - Required changes are orders of magnitude larger than expected - Security or compliance issues emerge **Soft kills (evaluate whether to continue)**: - Results are only marginally better than baseline - Implementation is taking much longer than planned - Key team member is reassigned or leaves - A clearly better alternative emerges **The sunk cost trap**: The most common failure mode is continuing projects that should be killed because of invested effort. Define time-boxes upfront. "If we haven't achieved X in Y weeks, we stop." *Case study*: A team spent 4 months implementing a custom attention mechanism for their domain. At month 3, it was clear the improvement was marginal. But they continued because "we're almost there." The final result: 2% improvement at 10x engineering cost. The same effort on other improvements would have yielded far better returns. ### Portfolio Thinking for Research Investment Don't bet everything on one research direction. Think in portfolios: **Core (60-70% of research effort)**: Incremental improvements to existing systems. High probability of success, moderate impact. **Adjacency (20-30% of research effort)**: Promising techniques that extend your current capabilities. Medium probability, high potential impact. **Exploratory (5-10% of research effort)**: Speculative bets on novel research. Low probability, potentially transformative impact. This portfolio approach ensures you're not vulnerable to any single research direction failing while maintaining exposure to breakthrough possibilities. --- ## Prototyping: Validating Ideas with Minimum Investment ### The Prototype Mindset The goal of prototyping is learning, not building. A prototype answers: "Should we invest more in this?" not "How do we ship this?" This distinction matters because it changes what you build: **Production code**: Handles all edge cases, is well-tested, integrates cleanly, is maintainable. **Prototype code**: Handles the critical path, validates the hypothesis, produces enough evidence to decide. Every hour spent on prototype polish is an hour not spent on the decision you're trying to make. ### Designing Effective Prototypes **Step 1: State your hypothesis clearly** Bad: "Let's try technique X." Good: "We hypothesize that technique X will reduce latency by >20% on our production workload without degrading accuracy by more than 1%." The hypothesis determines what you measure and what counts as success. **Step 2: Define success criteria before starting** Quantitative thresholds: - Latency: Must be <Y ms at P99 - Accuracy: Must be >Z% on our evaluation set - Memory: Must fit in existing GPU allocation Qualitative requirements: - Must work with our existing tokenizer - Must not require retraining the base model **Step 3: Define kill criteria equally clearly** If any of these are true, stop: - Accuracy drops >5% on our evaluation set - Latency increases (not just fails to improve) - Implementation reveals fundamental incompatibility **Step 4: Time-box ruthlessly** "We will spend at most 5 days on this prototype. At day 5, we decide: continue to validation, pivot, or abandon." The time-box forces scoping. If you can't validate the core idea in the time-box, either the idea is too complex or you need to decompose it. ### What to Include and Exclude in Prototypes **Include (Scope In)**: - Core algorithm implementation - Evaluation on representative data sample - Key metric measurement - Comparison to current baseline **Exclude (Scope Out)**: - Production-ready code quality - Comprehensive error handling - Full test coverage - Documentation - Performance optimization - Edge case handling The items you exclude can all be added later if the prototype succeeds. Adding them during prototyping wastes effort if the idea doesn't work. ### Implementation Strategies You have several options for implementing research in a prototype: **Option 1: Use official reference implementation** When: Official code is available and quality is reasonable. Advantages: Fastest path; known to reproduce paper results; authors' implicit knowledge baked in. Disadvantages: May not fit your infrastructure; code quality varies; may have hidden dependencies. How: Clone the repo, get their evaluation working, then adapt their code to your data. **Option 2: Adapt existing similar implementation** When: A similar technique exists in your codebase or in a framework. Advantages: Integrates with existing code; familiar patterns; easier to debug. Disadvantages: May miss subtle but important differences; risk of cargo-culting. How: Identify the closest existing implementation, understand the key differences, modify carefully. **Option 3: Implement from scratch** When: No suitable reference exists, or you need deep understanding. Advantages: Deep understanding; tailored to your needs; no external dependencies. Disadvantages: Time-consuming; risk of bugs; need to validate correctness independently. How: Start from the paper's pseudocode; implement in stages; validate each component. **Option 4: Use framework integration** When: The technique is already in a major framework (PyTorch, HuggingFace, etc.). Advantages: Production-ready; well-tested; maintained by others. Disadvantages: May lag latest research; limited customization; framework dependency. How: pip install and configure. The right choice depends on your situation. Framework integration is obviously best when available. Reference implementation is usually best otherwise. From-scratch implementation is only worth it when you need the deep understanding—typically for core techniques you'll build on. ### Debugging Research Implementations When your implementation doesn't match paper results (which happens often), here's a systematic debugging approach: **Step 1: Verify you can reproduce on their data** Before trying your data, reproduce their exact setup. Same dataset, same preprocessing, same hyperparameters. If you can't match their results on their setup, you have an implementation bug. **Step 2: Check tensor shapes throughout** Many bugs manifest as shape errors or—worse—silent shape broadcasting. Print shapes at each step. Compare against what the paper implies. **Step 3: Compare against reference implementation numerically** If reference code exists, test intermediate values: ```python their_output = their_implementation(input) your_output = your_implementation(input) assert torch.allclose(their_output, your_output, atol=1e-5) ``` Do this at each layer/step to isolate where differences emerge. **Step 4: Check for normalization issues** Many papers under-specify normalization. Are you using layer norm where they use batch norm? Is it pre-norm or post-norm? What's the epsilon value? **Step 5: Verify numerical precision** Some techniques are sensitive to float16 vs float32. Mixed precision training can cause issues. Try full float32 if you're seeing instabilities. **Step 6: When all else fails, contact authors** Often, there's a detail that didn't make it into the paper. A polite email asking about reproduction often gets helpful responses. --- ## Rigorous Benchmarking ### Why Rigorous Benchmarking Matters Poor benchmarking leads to poor decisions. If you don't benchmark correctly, you might: - Adopt a technique that's actually worse than your baseline - Reject a technique that would have helped - Misattribute improvements to the wrong component - Make decisions that don't generalize beyond your test set Rigorous benchmarking takes more effort but produces trustworthy decisions. ### Principles of Fair Comparison **Principle 1: Test on representative data** Your benchmark data should match your production data distribution. If you're evaluating a retrieval technique, use queries that look like real user queries, not synthetic test sets. Common mistakes: - Using clean academic benchmarks when production data is messy - Testing on easy examples but deploying on hard ones - Not including adversarial or edge cases ::: {.callout-warning} ## Common Mistake: Comparing Against Weak Baselines **What people do:** Implement a new technique from a paper, compare it against a quickly-implemented baseline with default hyperparameters. The new technique wins. Ship it. **Why it fails:** Papers often compare against weak baselines (outdated methods, untuned hyperparameters, naive implementations). Your "baseline" might be 20% worse than a properly tuned version of the same approach. Flash Attention made many "efficient attention" papers obsolete overnight. **Fix:** Invest equal effort in your baseline: use the best available implementation, tune hyperparameters, use the same compute budget. If the new technique doesn't beat a strong baseline, the improvement may not be real. ::: **Principle 2: Compare apples to apples** For a comparison to be meaningful, control for: - Compute budget (same training time, or same FLOPs) - Data (same training set, same preprocessing) - Hyperparameter tuning effort (if you tune one method, tune the other equally) - Hardware (same GPU, same batch sizes) If your new technique has 10x the parameters, comparing accuracy isn't fair. Normalize for what matters. **Principle 3: Control for random variation** A single run tells you very little. Results vary with: - Random weight initialization - Batch ordering - Stochastic components (dropout, etc.) Run experiments multiple times. Report means and standard deviations. Use statistical significance tests before claiming improvements. How many runs? At least 3; preferably 5-10 for important decisions. If results are very noisy, you need more runs or larger sample sizes. **Principle 4: Measure what matters** Academic metrics may not correlate with production value. Quality is important, but so is: - Latency (P50, P99—not just average) - Throughput (requests per second) - Memory usage (peak and sustained) - Cost (compute expense per query) A technique that's 2% better on accuracy but 3x slower may not be worth it. **Principle 5: Test at scale** Techniques that work at small scale often fail at production scale: - Memory usage can grow unexpectedly - Latency distributions change under load - Batch sizes affect convergence Test at your expected production scale before concluding the benchmark. ### Statistical Significance: When Is an Improvement Real? If method A achieves 85.2% accuracy and method B achieves 85.8% accuracy, is B better? It depends on variance. For a rigorous comparison: 1. Run each method multiple times (different random seeds) 2. Collect the distribution of results 3. Use a statistical test (paired t-test if runs are matched, Mann-Whitney U if not) 4. Only claim improvement if p < 0.05 Example analysis: - Method A: 5 runs, mean 85.2%, std 0.8% - Method B: 5 runs, mean 85.8%, std 1.1% A paired t-test gives p = 0.23. This is NOT statistically significant. The observed difference could easily be random variation. You should NOT adopt method B based on this evidence. **Practical significance vs. statistical significance**: Even if an improvement is statistically significant, is it practically meaningful? A 0.1% accuracy improvement that's significant with p=0.01 is still not worth implementing if the engineering cost is high. ### Ablation Studies: Understanding What Matters Complex techniques have multiple components. Ablation studies remove components one at a time to understand their contribution. Why this matters for production: - You might not need the full complexity - Some components may be essential; others may be cargo-culted - Simpler is better if it achieves similar results Example ablation for a new retrieval technique: - Full method: 85% recall - Without query expansion: 83% recall (expansion adds 2%) - Without reranking: 78% recall (reranking adds 7%) - Without both: 75% recall Conclusion: Reranking is essential; query expansion is helpful but not critical. If you can only afford one, choose reranking. ### Cost-Benefit Analysis The final benchmarking step is cost-benefit analysis: Is the improvement worth the investment? Quantify costs: - Implementation effort (engineer-weeks) - Infrastructure changes (new dependencies, hardware) - Ongoing maintenance (additional complexity) - Compute costs (training, inference) Quantify benefits: - Quality improvement (in metrics that matter) - Efficiency gains (latency, throughput, cost reduction) - Strategic value (capabilities you couldn't offer before) Calculate return: - Payback period: How long until benefits exceed costs? - ROI: After 12 months, what's the return on investment? If payback is >12 months, you need strong strategic justification. If payback is <3 months, adoption is likely worthwhile. --- ## Managing the Research-to-Production Pipeline ### The Three-Stage Pipeline Research adoption isn't a single decision—it's a process with distinct stages, each with its own goals and exit criteria. **Stage 1: Discovery (1-2 weeks)** Goal: Answer "Is this worth investigating further?" Activities: - Deep paper reading (third pass) - Quick prototype implementation - Initial evaluation on representative data - Risk assessment Exit criteria: - Core idea is validated (it works at all) - Initial metrics meet minimum threshold - No critical blockers identified Gate question: "Should we invest more time in this?" If NO: Document learnings, move on. If YES: Proceed to Validation stage. **Stage 2: Validation (2-4 weeks)** Goal: Answer "Is this ready for production?" Activities: - Rigorous benchmarking against production baseline - Testing at production scale - Edge case exploration - Full production viability assessment Exit criteria: - Performance meets or exceeds baseline on production metrics - Works at production scale without issues - Integration path is clear and feasible Gate question: "Should we build this for production?" If NO: Document learnings, archive for later. If YES: Proceed to Integration stage. **Stage 3: Integration (4-8 weeks)** Goal: Answer "Does this improve production?" Activities: - Production-quality implementation - Full testing and validation - Gradual rollout (feature flags, percentage ramps) - Monitoring setup Exit criteria: - Passes all integration tests - Metrics are positive in production A/B test - Team can maintain and debug it Gate question: "Should we keep this in production?" If NO: Roll back, document learnings. If YES: Complete rollout, document for team. ### The Importance of Gates Gates are decision points that force explicit choices. Without gates, projects drift. A project that should have been killed in Discovery continues to Validation because "we're already invested." A project that passes Validation gets stuck in Integration limbo. At each gate, someone with authority must answer the gate question with YES, NO, or NOT YET (with conditions). Gates require: - Clear criteria defined before starting - Honest assessment of whether criteria are met - Authority to kill projects that don't pass ### Knowledge Transfer and Documentation When a research technique reaches production, knowledge transfer becomes critical. The person who implemented it may move on. Others will need to maintain, debug, and extend it. Essential artifacts: **Implementation documentation**: - What this implements and why we adopted it - Where it lives in the codebase - How to configure it - Common issues and debugging approaches **Design decision record**: - What alternatives were considered - Why this approach was chosen - What tradeoffs were made - When this decision should be revisited **Runbook**: - How to verify it's working correctly - Known failure modes and remediation - Monitoring and alerting guidance - Rollback procedure If these artifacts don't exist, the knowledge lives only in individual heads and walks out the door when they leave. --- ## Team Research Processes Individual skills matter, but research-to-production at scale requires team processes. Here's how to run research adoption as a team, not just as individuals working in parallel. ### The Research Radar A research radar tracks promising techniques across your team's focus areas. It's a shared document or system that answers: "What research should we be aware of, and what's our current assessment?" **Structure:** | Technique | Status | Owner | Next Action | |-----------|--------|-------|-------------| | Flash Attn 3 | ADOPT | - | Upgrade in Q2 | | Spec Decoding | TRIAL | Alice | Validation by 3/15 | | MoE Routing | ASSESS | Bob | Paper review by 3/1 | | Technique X | SKIP | - | Re-evaluate in 6mo | **Categories:** - **ADOPT**: Proven techniques to integrate into standard practice - **TRIAL**: Promising techniques in active validation - **ASSESS**: Techniques worth monitoring but not actively testing - **HOLD**: Previously interesting techniques now deprioritized - **SKIP**: Evaluated and rejected (with documented reasoning) **Maintenance:** - Review monthly: Update status, add new techniques, remove obsolete entries - Assign owners: Each TRIAL or ASSESS item needs someone responsible for tracking - Document decisions: When something moves categories, record why ### Research Review Meetings Regular research reviews keep the team aligned and prevent both over-investment and under-investment in new techniques. **Format (60-90 minutes, monthly):** 1. **Radar updates (15 min)**: Quick review of status changes since last meeting 2. **Deep dives (30-45 min)**: 1-2 techniques get detailed presentation 3. **New nominations (15 min)**: Team members propose additions to radar 4. **Decisions (15 min)**: Explicit decisions on what to ADOPT, TRIAL, SKIP **Deep dive structure:** The presenter covers: - What the technique does (5 min) - Our evaluation results (10 min) - Production viability assessment (5 min) - Recommendation with rationale (5 min) - Team discussion and decision (10 min) **Ground rules:** - Decisions are explicit: Every deep dive ends with a decision - Disagreement is healthy: Different perspectives improve decisions - Time-box discussions: Use parking lot for tangents - Document everything: Meeting notes capture decisions and reasoning ### Coordinating Research Efforts Without coordination, teams duplicate effort or leave gaps. Here's how to coordinate: **Divide by expertise, not by paper:** Don't assign papers to people. Assign domains to people. "Alice owns retrieval research; Bob owns inference optimization." This builds expertise and avoids redundant reading. **Share learnings, not just results:** When someone completes a prototype or evaluation, they should share: - What they learned (not just metrics) - Surprising findings - Implicit knowledge they discovered - Recommendations for others considering similar work **Maintain institutional memory:** - Research notes are in a shared location, not individual notebooks - Evaluation results are stored with methodology, not just numbers - Failed experiments are documented (so others don't repeat them) ### Handling Disagreements Teams will disagree on research adoption decisions. Handle this constructively: **When someone wants to adopt but others are skeptical:** Run a time-boxed validation. If the advocate is right, you'll have evidence. If they're wrong, you'll have learned cheaply. Don't argue hypothetically—test. **When the team is split:** Identify what information would resolve the disagreement. Often teams disagree because they're uncertain, not because they have different values. Get that information. **When someone feels strongly:** Strong opinions are valuable information. Understand why. Are they seeing something others miss? Is their context different? Take strong opinions seriously. **When you must decide without consensus:** Document the disagreement. Note who advocated for what. Make a decision and move on. Revisit when new information emerges. --- ## Research Failure Post-Mortems When research adoption fails—techniques that didn't work, projects that were killed, implementations that were rolled back—you have a learning opportunity. Failure post-mortems extract that learning. ### When to Run a Post-Mortem Run a post-mortem when: - A technique that passed initial evaluation fails in production - A research project is killed after significant investment - Adoption takes much longer or costs much more than expected - Production implementation reveals issues not caught in evaluation Don't run post-mortems for: - Early-stage prototypes that get killed (expected and healthy) - Techniques that were obviously not a fit (no surprise, no learning) ### Post-Mortem Structure **1. Timeline reconstruction (15 min)** What actually happened? Build a factual timeline: - When did we first encounter this research? - What was our initial assessment? - What milestones did we hit? - When did problems emerge? - What was the final outcome? **2. What went wrong (20 min)** Identify specific failures. Categories to consider: *Paper evaluation failures:* - Did we miss red flags in the paper? - Did we misunderstand the technique? - Did we overestimate the claimed benefits? *Prototype failures:* - Did the prototype not accurately predict production behavior? - Did we test on unrepresentative data? - Did we miss important failure modes? *Integration failures:* - Were there unexpected dependencies? - Did performance not scale? - Did edge cases cause problems? *Process failures:* - Did we ignore warning signs? - Did we continue past appropriate kill points? - Did we fail to get appropriate input? **3. Root cause analysis (15 min)** For each failure, ask "why?" until you reach actionable root causes: *Surface*: "The technique was slower in production than in benchmarks." *Why?*: "Our production data has longer sequences than benchmark data." *Why?*: "We tested on benchmark data, not sampled production data." *Root cause*: "Our evaluation protocol didn't include production-representative data." **4. Lessons and actions (10 min)** What changes would have caught this earlier? Be specific: - Update evaluation checklist to include X - Add Y to prototype requirements - Change review process to include Z Assign owners and deadlines for action items. ### Common Failure Patterns Post-mortems often reveal recurring patterns: **Pattern: Benchmark-production gap** *Symptom*: Technique works on benchmarks, fails on production data. *Prevention*: Always include production-sampled data in evaluation. **Pattern: Scale surprise** *Symptom*: Technique works at small scale, fails at production scale. *Prevention*: Test at production scale before passing validation gate. **Pattern: Integration underestimation** *Symptom*: Core technique works, but integration takes 3x longer than expected. *Prevention*: Include integration spike in Discovery phase. **Pattern: Sunk cost continuation** *Symptom*: Project continued past point where it should have been killed. *Prevention*: Define kill criteria upfront; enforce time-boxes. **Pattern: Missing expertise** *Symptom*: Team misunderstood technique due to unfamiliarity with the domain. *Prevention*: Include domain expert in evaluation, or consult externally. --- ## Communicating Research Decisions Research adoption decisions affect stakeholders who don't read papers: product managers, executives, partner teams. Effective communication builds trust and gets buy-in. ### Framing for Different Audiences **For executives (1-2 paragraphs):** Focus on: Business impact, risk/reward, timeline, cost. Template: "We evaluated [technique] for [use case]. Our assessment: [ADOPT/TRIAL/SKIP]. The expected benefit is [X% improvement in metric Y], with implementation cost of [Z weeks]. Key risk is [risk]. We recommend [action]." Example: "We evaluated speculative decoding for our inference platform. Our assessment: TRIAL. The expected benefit is 40-50% latency reduction for chat completions, with implementation cost of 4 engineering weeks. Key risk is that benefit varies by model and input distribution—some workloads may see less improvement. We recommend a 4-week trial on our highest-volume endpoint." **For product managers (1 page):** Focus on: User impact, tradeoffs, timeline, what changes. Include: - What this technique does (in plain language) - Expected user-visible benefits - Any user-visible downsides or changes - Timeline and dependencies - What you need from them (decisions, prioritization) **For engineering partners (technical detail):** Focus on: How it works, integration requirements, technical risks. Include: - How the technique works - What changes in their systems - Integration requirements (APIs, data formats, etc.) - Performance characteristics - Known issues and mitigations ### Handling "Why Not Just...?" Questions Non-technical stakeholders often ask reasonable questions that reveal unfamiliarity with research realities: **"Why don't we just use the technique from [big company's blog post]?"** Response: Explain what evaluation revealed. "We evaluated that technique. On our data, it provided 3% improvement rather than the 15% they claimed. The difference is likely due to [data distribution / scale / use case]. We found [alternative] more promising for our specific situation." **"Can't we just implement what's in the paper?"** Response: Explain the paper-to-production gap. "Papers describe algorithms but not production systems. Implementation typically takes 3-5x the time to read and understand the paper. We also need to validate that results transfer to our context—they often don't." **"Why is this taking so long? It's just one technique."** Response: Explain the validation process. "We could implement quickly but might ship something that doesn't actually help—or worse, causes problems. Our process validates that techniques work on our data at our scale before production. This takes [X weeks] but prevents [past failure example]." **"Our competitor announced they're using [technique]. Why aren't we?"** Response: Separate announcement from reality. "Announcements often precede production deployment by months. We're tracking this technique in our ASSESS category. Current assessment is [X]. We'll accelerate evaluation if we see evidence of production deployment and competitive impact." ### Building Research Credibility Credibility with stakeholders comes from a track record of good decisions: **Celebrate successful adoptions**: When a technique you recommended pays off, share the results. "Speculative decoding reduced P50 latency by 45%, saving $X/month in infrastructure costs." **Acknowledge failures honestly**: When something doesn't work, own it. "We invested 3 weeks in [technique] and determined it's not viable for our use case. Here's what we learned that will help us evaluate similar techniques better." **Make predictions, track accuracy**: "We predicted this technique would provide 20-30% improvement. Actual result: 25%. Our evaluation methodology is calibrated." Over time, this builds trust in your assessments. **Avoid hype**: If you consistently recommend things that don't pan out, credibility erodes. Conservative, accurate predictions beat optimistic, inaccurate ones. --- ## Developing Taste for Research ### What "Taste" Means in This Context Beyond frameworks and processes, experienced practitioners develop intuitions—a "taste" for research that helps them quickly filter promising from unpromising ideas. This isn't mystical; it's pattern recognition developed over years of reading papers and seeing how they play out. Here are some patterns that indicate promising research: ### Signs of Promising Research **Clear, simple insight**: The best research often has a simple core idea. If you can't explain the key insight in one sentence, the technique may be overcomplicated. Examples: - Flash Attention: "Compute attention tile-by-tile to avoid materializing the full attention matrix." - LoRA: "Fine-tune low-rank adapters instead of full weights." - RLHF: "Train a reward model from human preferences, then optimize against it." **Strong theoretical grounding**: Research that explains *why* something works—not just that it works—tends to be more robust. You can predict when it will fail. **Large, simple improvements**: Be skeptical of techniques that show many small improvements across many benchmarks. Prefer techniques that show large improvements on core metrics. **Robustness to hyperparameters**: If results require very specific hyperparameter settings, they may not reproduce in your context. Results stable across a range of settings are more trustworthy. **Multiple independent validations**: If multiple groups reproduce similar results independently, confidence increases substantially. ### Signs of Concerning Research **Complexity without proportional benefit**: A method with 10 novel components that achieves 2% improvement is concerning. What are all those components doing? **Narrow wins**: A technique that only helps on one specific dataset or setting may not generalize. **Required compute vastly different from paper**: If reproducing results requires 100x the compute in the paper, something is wrong. **Extraordinary claims**: Papers claiming to "solve" a longstanding problem deserve extra scrutiny. If it seems too good to be true, check the baselines and evaluation methodology carefully. **Authors with no implementation track record**: Novel techniques from authors who've never shipped production systems should be viewed differently than techniques from industry labs or researchers with deployment experience. ### Building Your Intuition Taste develops through experience. Deliberately cultivate it: **Track predictions**: When you read papers, explicitly predict: "Will this technique be widely adopted? In what timeframe?" Track your predictions. Review them after 1-2 years. Where were you right? Where were you wrong? What should you have noticed? **Study adoption patterns**: For techniques that succeeded, trace the adoption path. What made them successful? For techniques that failed, why? What were the warning signs? **Engage with practitioners**: The research world and the production world have different information. Conferences give you research context; production experience gives you real-world grounding. You need both. **Maintain healthy skepticism**: The default for any paper should be "probably not relevant to me." Move to interest only when evidence accumulates. This sounds cynical, but it's just realistic given base rates. --- ## Key Takeaways 1. **Most papers aren't relevant to you** - The skill is efficient filtering. Use multi-pass reading: first pass for relevance, second for understanding, third for critical evaluation. 2. **Reproducibility doesn't equal production viability** - Evaluate computational efficiency, integration complexity, robustness to real-world conditions, and long-term maintainability separately. 3. **Prototype to learn, not to build** - Define hypotheses and kill criteria before starting. Time-box ruthlessly. A failed prototype that answers your question quickly is a success. 4. **Benchmark with statistical rigor** - Control for variance, compare fairly against proper baselines, test at production scale, and require statistical significance before making decisions. 5. **Develop taste through deliberate practice** - Track your predictions about which techniques will be adopted. Study patterns of successful vs failed adoptions. Healthy skepticism is realistic, not cynical. --- ## Summary The gap between research and production isn't a bug—it's a fundamental feature of how the two worlds operate. Researchers optimize for novelty, clean experimental conditions, and benchmark performance. Engineers optimize for reliability, efficiency, and robustness. Bridging this gap requires skills in both domains and a systematic approach to translation. **Reading papers effectively** means using a multi-pass method, reading critically, and maintaining a sustainable practice. Most papers aren't relevant to you—the skill is filtering efficiently. **Evaluating production viability** requires assessing computational efficiency, integration complexity, robustness, and maintainability. A reproducible paper can still be completely wrong for your production system. **Prototyping efficiently** means focusing on learning, not building. Define hypotheses and kill criteria before starting. Time-box ruthlessly. **Benchmarking rigorously** ensures you make decisions based on real evidence. Control for variance, compare fairly, test at scale, and require statistical significance. **Managing the pipeline** with clear stages and gates prevents projects from drifting. Know when to proceed and when to kill. **Developing taste** comes from experience. Track your predictions, study adoption patterns, and maintain healthy skepticism. The practitioners who create the most value are those who can quickly identify which research matters for their context, validate it efficiently, and translate it into production impact—while avoiding the many techniques that sound promising but don't pan out. ### Connections to Other Chapters - **Chapter 9 (LLM Deployment & Infrastructure)** provides context for evaluating production viability of new techniques - **Chapter 15 (MLOps & Evaluation)** covers evaluation infrastructure that supports rigorous benchmarking - **Chapter 21 (Deepening Technical Expertise)** discusses paper reading practices and building mental models - **Chapter 23 (Technical Communication)** extends the stakeholder communication skills needed for research decisions - **Chapter 23 (Technical Decision Making)** provides broader decision frameworks applicable to research adoption - **Chapter 27 (Performance Engineering)** helps evaluate computational efficiency claims in papers --- ## Practical Exercises ### Exercise 1: Paper Evaluation Select a paper published in the last 6 months in your area of interest. 1. Complete all three reading passes, taking notes as specified. 2. Identify at least three red flags or concerns. 3. Identify at least two strengths. 4. Rate production viability on each of the four dimensions. 5. Make a recommendation: Adopt, Trial, Assess, or Skip. Justify it. **Self-Assessment Questions:** - Did you complete all three passes, or did you jump to conclusions after the first? - Are your red flags specific (citing sections/tables) or vague ("seems weak")? - Does your viability assessment address YOUR context, not viability in general? - Is your recommendation justified by the evidence you gathered? - Could you defend this evaluation to a skeptical colleague? **Quality Indicators:** - *Strong*: Red flags cite specific sections ("Table 3 shows only 1% improvement over BM25 baseline") - *Weak*: Red flags are vague ("results don't seem that impressive") - *Strong*: Viability assessment considers your specific constraints ("our P99 latency requirement is 100ms") - *Weak*: Viability assessment is generic ("seems efficient enough") ### Exercise 2: Reproduction Attempt For a paper you've evaluated favorably: 1. Find or create a reproduction plan. 2. Attempt to reproduce core results. 3. Document what matched and what didn't. 4. Identify what knowledge was implicit (not in the paper). 5. Estimate how long production implementation would take. **Self-Assessment Questions:** - Did you verify reproduction on the paper's exact setup before modifying for your context? - Did you document implicit knowledge as you discovered it (not reconstructed later)? - Is your production estimate based on this reproduction experience or just a guess? - Did you try contacting authors for missing details? **Quality Indicators:** - *Strong*: Documentation includes specific implicit knowledge ("layer norm epsilon must be 1e-5, not the default 1e-6") - *Weak*: Documentation is vague ("some hyperparameters needed tuning") - *Strong*: Production estimate includes breakdown (implementation, integration, testing, buffer) - *Weak*: Production estimate is a single number without justification ### Exercise 3: Comparative Benchmark Compare a new technique against your current production baseline: 1. Design a fair benchmark protocol. 2. Define at least three metrics to measure. 3. Run at least three trials per method. 4. Calculate statistical significance. 5. Perform cost-benefit analysis. 6. Make a final recommendation with quantified justification. **Self-Assessment Questions:** - Does your benchmark use production-representative data? - Did you control for compute budget, hyperparameter tuning effort, and hardware? - Did you run enough trials to establish statistical significance? - Does your cost-benefit analysis include all relevant costs (implementation, maintenance, compute)? - Would your recommendation change if variance were higher? **Quality Indicators:** - *Strong*: Statistical analysis includes confidence intervals and p-values - *Weak*: Statistical analysis is just "average of 3 runs" - *Strong*: Cost-benefit includes ongoing costs (maintenance, compute) not just implementation - *Weak*: Cost-benefit only considers implementation time - *Strong*: Recommendation acknowledges uncertainty ("Given high variance, we should run more trials before committing") - *Weak*: Recommendation ignores statistical uncertainty ### Exercise 4: Historical Analysis Select a technique that was adopted in your organization or in the industry broadly: 1. Find the original paper(s). 2. Trace the adoption timeline. 3. What enabled adoption? What slowed it? 4. What issues emerged in production? 5. With hindsight, what would you have looked for to predict these outcomes? **Self-Assessment Questions:** - Did you trace adoption through primary sources (papers, blog posts) or rely on secondhand accounts? - Did you identify both technical and non-technical factors in adoption? - Are your "with hindsight" observations actually predictable from the original paper, or are they only obvious in retrospect? **Quality Indicators:** - *Strong*: Timeline includes specific dates and milestones - *Weak*: Timeline is vague ("took about a year") - *Strong*: Analysis distinguishes what was predictable from what was genuinely surprising - *Weak*: Analysis claims everything was obvious in hindsight ### Exercise 5: Prediction Tracking For each of five current papers you find interesting: 1. Predict: Will this be widely adopted in 2 years? Why or why not? 2. Record your predictions with reasoning. 3. Set a calendar reminder to review in 2 years. 4. When you review, analyze where your predictions were right and wrong. **Self-Assessment Questions:** - Are your predictions specific enough to be clearly right or wrong? - Did you record reasoning at prediction time (not reconstructed later)? - Are you predicting actual adoption or just continued research interest? - Did you consider what "widely adopted" means (by whom? in what contexts?)? **Quality Indicators:** - *Strong*: Predictions define success criteria ("I predict >50% of production LLM systems will use this by 2027") - *Weak*: Predictions are vague ("this will probably be important") - *Strong*: Reasoning identifies specific factors that would lead to adoption or failure - *Weak*: Reasoning is "this seems good" ### Exercise 6: Research Radar Creation Create a research radar for your team or area: 1. Identify 10-15 techniques relevant to your work. 2. Categorize each as ADOPT, TRIAL, ASSESS, HOLD, or SKIP. 3. For each TRIAL or ASSESS item, define next actions and owners. 4. Document reasoning for at least 3 SKIP decisions. 5. Present to colleagues and refine based on feedback. **Self-Assessment Questions:** - Do your categories reflect current state, not aspiration? - Are SKIP decisions documented well enough that someone could understand why without asking you? - Did feedback from colleagues change any categorizations? - Is the radar maintainable (not too many items, clear update process)? **Quality Indicators:** - *Strong*: SKIP decisions include specific reasoning ("superseded by X" or "doesn't address our constraint Y") - *Weak*: SKIP decisions are just "not relevant" - *Strong*: TRIAL items have specific next actions with deadlines - *Weak*: TRIAL items are "investigate further" --- ## Self-Assessment Checkpoint ### Conceptual Questions **Q1. [IC2]** What is the three-pass method for reading papers? Why is it more effective than reading start-to-finish? <details> <summary>Answer</summary> Three-pass method: Pass 1 (5-10 min): Title, abstract, intro, conclusion, skim figures. Decide: read more or skip? Pass 2 (30-60 min): Main ideas, skip math/implementation details. Summarize key contributions, identify claims. Pass 3 (2-4 hours): Reproduce reasoning, understand details, identify implicit assumptions. Why better than linear reading: (1) Efficient filtering: Most papers aren't worth Pass 3. (2) Prevents sunk cost: Linear reading commits time before knowing value. (3) Builds context first: Understanding the big picture makes details easier. (4) Identifies questions: You know what to look for in deeper passes. (5) Matches reading goals: Sometimes Pass 1 awareness is all you need. </details> **Q2. [IC2]** What are the four dimensions of production viability for a research technique? Why might a technique score high on one dimension but low on others? <details> <summary>Answer</summary> Four dimensions: 1. Quality improvement: Does it actually work better than current methods? 2. Efficiency: Can we run it at acceptable latency/cost? 3. Robustness: Does it work on real-world data, not just benchmarks? 4. Integration: How hard is it to integrate with existing systems? Why dimensions are independent: (1) Quality vs efficiency: Bigger models are often better but slower/more expensive. (2) Quality vs robustness: Benchmark-tuned methods may overfit to specific conditions. (3) Efficiency vs integration: Highly optimized code may be hard to maintain. (4) All vs integration: Brilliant research may require complete system redesign. Example: A new retrieval method might show 15% quality improvement (high) but require a complete index rebuild (low integration), GPU inference (low efficiency), and only work on English text (low robustness). </details> **Q3. [Senior]** What are kill criteria for research prototypes and why are they important? How do you set good kill criteria? <details> <summary>Answer</summary> Kill criteria: Pre-defined conditions under which you stop pursuing a research direction. Set before starting, enforced during execution. Why important: (1) Prevents sunk cost fallacy: "We've invested so much, let's try one more thing." (2) Enables parallel exploration: Can try multiple approaches with bounded investment. (3) Forces clarity: Defining failure makes success criteria clearer. (4) Saves resources: Failing fast leaves time for better ideas. Setting good kill criteria: 1. Define before starting (not during, when bias is high). 2. Make them measurable: "If accuracy <85% after 2 weeks, stop." 3. Include time bounds: Unbounded exploration never ends. 4. Consider multiple metrics: Accuracy, latency, maintainability. 5. Get external commitment: Tell someone your criteria. Warning signs of bad criteria: (1) "We'll know it when we see it." (2) Criteria adjusted when approaching failure. (3) No time bound. (4) Only success defined, not failure. </details> **Q4. [Senior]** What red flags in a paper should make you skeptical about production viability? <details> <summary>Answer</summary> Red flags: Evaluation concerns: - Comparisons against weak/outdated baselines - Evaluation only on one dataset, especially if curated - Missing statistical significance tests - Metrics that don't match production needs (accuracy vs latency) - Held-out test set that's too similar to training Reproducibility concerns: - Missing implementation details - No code release - Results only from proprietary data - Hyperparameters not specified - Results that seem too good Practical concerns: - Computation requirements buried/missing - Inference cost not discussed - Scaling behavior not shown - Only works on specific data types - Requires components that are impractical Implicit assumptions: - Clean data assumed - Single language/domain - Specific hardware requirements - Expertise required to tune - Failure modes not discussed Each red flag doesn't disqualify, but multiple red flags warrant skepticism. </details> **Q5. [Staff]** How do you decide whether to adopt a new technique early (for competitive advantage) versus waiting for it to mature (for lower risk)? What factors influence this decision? <details> <summary>Answer</summary> Factors favoring early adoption: - Technique provides significant competitive advantage - Cost of being wrong is low (reversible decision) - You have expertise to manage immaturity - Early mover advantage is substantial - Technique is simple enough to understand deeply Factors favoring waiting: - Cost of failure is high (customer-facing, safety-critical) - Technique is complex, easy to misimplement - Ecosystem support is critical (libraries, tooling) - Industry standards likely to emerge - Your team lacks expertise to debug issues Decision framework: 1. How reversible is adoption? (Easy = adopt early) 2. How large is the expected improvement? (Large = adopt early) 3. How mature is implementation? (Code/libraries = adopt earlier) 4. How critical is the use case? (Critical = wait) 5. Can you run in shadow mode first? (Yes = adopt earlier) Strategy: Often a "trial" approach—adopt in limited scope, measure carefully, expand if successful. This limits downside while capturing upside. </details> ### Spot the Problem **Problem 1. [IC2]** An engineer's paper evaluation: "This paper shows 10% improvement on MMLU benchmark. That's significant. We should implement it." What's missing from this evaluation? <details> <summary>Answer</summary> Missing considerations: Quality: (1) What's the baseline they compare against? (Weak baselines inflate improvements.) (2) Is 10% on MMLU meaningful for YOUR use case? (3) Do they show results on multiple benchmarks or just the one where they shine? Efficiency: (4) What's the compute cost? (5) How does latency compare to current approach? (6) Does it require additional infrastructure? Robustness: (7) Does it work on real data, not just benchmarks? (8) What languages/domains was it tested on? (9) What are the failure modes? Integration: (10) How hard is implementation? (11) Is there code available? (12) Does it require changes to our architecture? Better evaluation: "This paper shows 10% on MMLU. Before recommending, I need to: (1) reproduce on our benchmark, (2) measure latency/cost, (3) test on production-like data. I'll spend 3 days on a spike." </details> **Problem 2. [Senior]** A prototype evaluation report: "We spent 4 weeks implementing the new retrieval method. It works on our test set but only achieves 72% accuracy instead of the paper's 89%. We're debugging why our results differ. We think with another 2-3 weeks we can close the gap." What concerns should you raise? <details> <summary>Answer</summary> Concerns: 1. Sunk cost: 4 weeks invested, asking for 2-3 more. Is this justified by potential value, or escalation of commitment? 2. No kill criteria mentioned: When do we stop? What if 2-3 more weeks also fails? 3. Gap is very large: 72% vs 89% suggests fundamental issue, not just tuning. Has root cause been identified? 4. "We think" language: 2-3 weeks is guess, not estimate. What specifically needs to happen? 5. No alternative considered: Is this technique the only option? Should we parallel path? Questions to ask: (1) What specifically is causing the gap? (2) What evidence suggests more time will help? (3) What's the kill criterion for this approach? (4) Is the paper's 89% reproducible by anyone? (5) What's the opportunity cost of another 2-3 weeks? Recommendation: Define explicit kill criterion (e.g., 85% by end of next week). If not met, stop and consider alternatives. </details> **Problem 3. [Staff]** A team's approach to research adoption: "We have a researcher who reads papers full-time and recommends what to implement. Developers then build what the researcher recommends. This separates concerns—researchers research, engineers engineer." What's problematic about this structure? <details> <summary>Answer</summary> Problems: 1. Translation gap: Researchers may not understand production constraints. What looks good in a paper may be impractical. 2. Lost context: Engineers implementing without understanding why make poor tradeoff decisions. 3. Feedback loop broken: Engineers see what doesn't work in practice; this needs to inform paper evaluation. 4. Ownership split: Who owns the outcome—researcher for recommending or engineer for implementing? 5. Incentive misalignment: Researcher is incentivized to find novel things; engineers may prefer stable, proven approaches. 6. Speed bottleneck: All decisions flow through one person's reading capacity. Better structures: Option 1: Engineers read papers relevant to their work, with researcher as consultant for deep expertise. Option 2: Researcher-engineer pairs work together on evaluation and implementation. Option 3: Researcher does initial filter, engineers do final assessment and implementation together. Key principle: Production viability assessment requires both research judgment and engineering judgment. Don't separate them. </details> --- ## Self-Assessment: Research Translation Capabilities Rate yourself on each capability (1 = novice, 5 = expert): **Paper Reading and Evaluation** - [ ] I can efficiently filter papers using the three-pass method - [ ] I identify red flags and implicit assumptions in papers - [ ] I assess production viability across all four dimensions - [ ] I maintain a sustainable paper reading practice **Reproduction and Prototyping** - [ ] I can reproduce paper results (or diagnose why not) - [ ] I design prototypes that answer specific hypotheses - [ ] I define and enforce kill criteria - [ ] I budget appropriately for reproduction challenges **Benchmarking and Evaluation** - [ ] I design fair comparative benchmarks - [ ] I use statistical methods to establish significance - [ ] I perform cost-benefit analysis before recommending adoption - [ ] I test at production scale before passing validation **Judgment and Taste** - [ ] I track my predictions and learn from errors - [ ] I can explain research decisions to non-technical stakeholders - [ ] I know when to adopt early vs. wait for maturity - [ ] I maintain appropriate skepticism without dismissing promising work **Growth Areas:** Based on your self-assessment, identify 2-3 specific areas to develop. For each: 1. What specific practice would build this skill? 2. How will you measure improvement? 3. What timeline will you set? --- ## Further Reading ### Essential - **Keshav (2007), "How to Read a Paper"** - Classic three-pass approach to academic paper reading. - **Sculley et al. (2015), "Hidden Technical Debt in ML Systems"** - Essential context for production ML challenges. - **Hooker (2021), "The Hardware Lottery"** - Why hardware determines which ideas succeed. ### Deep Dives - **Paleyes et al. (2022), "Challenges in Deploying ML"** - Survey of deployment challenges across industry. - **Pineau et al. (2021), "Improving Reproducibility in ML Research"** - Systematic analysis of reproducibility issues. ### Case Studies - **Dao et al. (2022), "FlashAttention"** - Study as example of research that translated well to production. - **Kwon et al. (2023), "PagedAttention/vLLM"** - Systems research with clear production applicability.