Chapter 30: Data Architecture for AI

Keywords

data architecture, feature stores, data pipelines, data quality, versioning, point-in-time correctness, lakehouse

Introduction

In early 2021, a team at a major e-commerce company deployed what they believed was their best fraud detection model yet. Offline metrics were exceptional: 94% precision, 91% recall, a clear improvement over the previous system. The model sailed through A/B testing and rolled out to production traffic. Within two weeks, fraud losses had increased by 40%.

The postmortem revealed a subtle but devastating problem. During training, the model used a feature called “average transaction amount over the past 7 days.” The training pipeline computed this by looking at all transactions within a 7-day window ending at the transaction time. But the serving pipeline, built by a different team using a different codebase, computed the same feature using a 7-day window ending at midnight the previous day. A seemingly minor difference, but it meant the model in production was seeing systematically stale data. Fraudulent transactions that happened in the current day weren’t reflected in the risk score until the next day.

This is training-serving skew, the silent killer of ML systems. The model worked perfectly in the training environment where features were computed correctly. It failed silently in production where features were computed differently. No monitoring caught it because the model was returning valid predictions. Only the downstream business metric revealed the problem.

This chapter is about building data architectures that prevent such failures. Data architecture for AI isn’t just about storing and moving data. It’s about ensuring that the data your model sees during training matches what it sees during serving, that features are computed consistently, that historical data doesn’t leak into predictions, and that you can diagnose and fix problems when they inevitably occur.

The Fundamental Challenge: Two Worlds That Must Match

Machine learning creates a unique architectural challenge that doesn’t exist in traditional software systems. You build your model in one environment (training) but deploy it in another (serving). These environments have fundamentally different requirements:

Training environment:

Batch processing over historical data
Time-travel capability: “What features were available at this timestamp?”
High throughput, latency tolerance (hours of processing is acceptable)
Reproducibility: running the same pipeline should produce the same results

Serving environment:

Real-time processing for live requests
Current features: “What features are available right now?”
Low latency, millisecond response requirements
Reliability: the system must work even when components fail

The data that bridges these two worlds must be computed identically in both. This sounds simple but is extraordinarily difficult in practice. Different teams, different codebases, different programming languages, different infrastructure, different time pressures. The surface area for subtle bugs is immense.

What Goes Wrong Without Proper Data Architecture

The fraud detection story is one manifestation of a broader pattern. Here’s what happens when organizations treat data architecture as an afterthought:

Silent model degradation: Models make predictions using different features than they were trained on. Performance degrades, but there’s no clear signal pointing to data as the cause. Teams waste weeks debugging model architecture when the problem is data.

Irreproducible experiments: Data scientists can’t recreate past results because training data has changed. A model that worked three months ago can’t be retrained because the features it used have been modified.

Feature chaos: Teams reinvent the same features repeatedly. One team computes “user activity score” one way; another team computes it differently. Neither can reuse the other’s work. Technical debt accumulates.

Debugging paralysis: When something goes wrong, no one can trace back from a prediction to understand what data produced it. Root cause analysis becomes archaeological excavation.

Compliance nightmares: Regulations require explaining what data influenced a decision. Without lineage, this is impossible to answer.

A Mental Model for ML Data Architecture

Think of ML data architecture as building a system where the past and present versions of your data must coexist and remain synchronized.

Imagine a library where books are constantly being revised. The training environment is like a historian who needs to read exactly what the books said on specific dates in the past. The serving environment is like a reader who needs to read the current edition immediately. Your data architecture must support both use cases while ensuring that when the historian and reader look at the same page, they see consistent information.

The key abstractions that enable this are:

Feature stores: A shared repository where features are defined once and served consistently to both training and serving
Point-in-time correctness: The ability to retrieve what features looked like at any historical timestamp
Data versioning: Treating datasets like code, with versions, diffs, and rollback capability
Lineage tracking: Recording how data flows and transforms, enabling debugging and compliance

What You’ll Learn

Why training-serving skew is so dangerous and the mechanisms that cause it
How feature stores provide a single source of truth for feature computation
Point-in-time joins and preventing data leakage in training
Data versioning strategies and when to use them
Building observability into data pipelines
Decision frameworks for when to invest in each capability

Prerequisites

Understanding of ML model training and serving (Part II)
Distributed systems fundamentals (Chapter 25)
Production systems experience (Part III)

The Theory of Training-Serving Skew

Understanding Skew at a Fundamental Level

Training-serving skew occurs when the distribution of features seen during training differs from the distribution seen during serving. This sounds abstract, so let’s make it concrete.

A model learns a function: given these input features, predict this output. During training, the model observes pairs of (features, labels) and learns the mapping between them. The learned function is only valid for the distribution of features it was trained on.

Mathematically, if we train on features drawn from distribution $P_{train}(X)$ but serve on features drawn from distribution $P_{serve}(X)$, the model’s performance degrades proportionally to the divergence between these distributions. This is the covariate shift problem from statistical learning theory.

But training-serving skew is worse than natural covariate shift. Natural covariate shift happens when the world changes, like user behavior evolving over time. Training-serving skew happens when the same underlying reality produces different feature values depending on which system computes them. It’s a self-inflicted wound.

The Taxonomy of Skew

Skew manifests in several distinct ways:

Feature computation skew: The most common form. The same feature is computed using different logic in training versus serving. Like the fraud detection example: same conceptual feature, different window calculations.

This happens when:

Training uses SQL queries on a data warehouse; serving uses Python code hitting a cache
Training uses batch processing frameworks (Spark); serving uses stream processing (Flink)
Different teams implement “the same” feature independently
Preprocessing libraries have version differences

Data freshness skew: Features are stale at serving time compared to training. If the model trained on features computed from data up to T-0 (the prediction time), but serving uses features that lag behind by hours or days, the model sees systematically older data in production.

Data distribution skew: The production data distribution has drifted from training. This is natural covariate shift, but it interacts with the infrastructure. For example, if serving only uses cached features and the cache has limited coverage, low-traffic entities get stale features while high-traffic entities get fresh ones. This creates a biased feature distribution.

Missing data skew: Missing values are handled differently. Training might drop rows with missing features; serving might impute zeros. The model learns on clean data but predicts on imputed data.

Preprocessing skew: Tokenization, normalization, encoding done differently. A text model trained with one tokenizer but served with another will see completely different input representations.

The Insidiousness of Skew

What makes skew particularly dangerous is that it’s invisible to standard monitoring. The system appears healthy:

Model endpoint is returning predictions (no errors)
Latency is within SLA
Input features pass schema validation
Output predictions are valid probabilities

All green dashboards. Meanwhile, the model is making decisions based on data that doesn’t match what it learned from. It’s like a pilot flying through fog using instruments that are subtly miscalibrated. Everything looks normal until you hit the mountain.

Common Mistake: Assuming Green Metrics Mean No Skew

What people do: Monitor model endpoint latency, error rates, and prediction confidence. All look healthy. Conclude the model is working correctly.

Why it fails: Standard API monitoring catches system failures, not data problems. A model can return valid predictions (within range, formatted correctly, low latency) while making decisions based on systematically wrong features. If your serving pipeline computes a feature differently than training, every prediction is affected but no error is raised.

Fix: Monitor feature distributions, not just model outputs. Compare serving feature statistics against training baselines using PSI or KS tests. Alert on distribution drift even when predictions look valid. The first sign of training-serving skew should come from your data monitoring, not customer complaints.

Quantifying Skew: Mathematical Foundations

To detect skew, we need to measure the divergence between training and serving feature distributions. Several metrics are commonly used:

Population Stability Index (PSI): Measures how much a distribution has shifted relative to a baseline.

\[PSI = \sum_{i=1}^{n} (P_i - Q_i) \cdot \ln\frac{P_i}{Q_i}\]

Where $P_i$ and $Q_i$ are the proportions in bin $i$ for the serving and training distributions respectively.

Interpretation:

PSI < 0.1: No significant shift
0.1 < PSI < 0.25: Moderate shift, monitor closely
PSI > 0.25: Significant shift, investigate

Jensen-Shannon Divergence (JSD): A symmetric measure of divergence between two distributions.

\[JSD(P || Q) = \frac{1}{2}KL(P || M) + \frac{1}{2}KL(Q || M)\]

Where $M = \frac{1}{2}(P + Q)$ and KL is the Kullback-Leibler divergence.

JSD is bounded between 0 and 1 (using log base 2), making it easier to set universal thresholds.

Kolmogorov-Smirnov statistic: For continuous features, the maximum difference between cumulative distribution functions:

\[D = \sup_x |F_{train}(x) - F_{serve}(x)|\]

The p-value from a KS test indicates whether the difference is statistically significant.

Practical detection: Monitor these metrics for every feature in production. Set alerts when they exceed thresholds. But remember: statistical significance doesn’t imply practical significance. A feature might have shifted but still produce correct predictions. Correlate skew detection with business metrics.

Per-feature PSI/JSD/KS miss multivariate and concept drift

PSI, JSD, and KS as described here are univariate—computed one feature at a time. They are blind to two failure modes that routinely break models in production. First, multivariate drift: every feature’s marginal distribution can be unchanged (PSI ≈ 0 on each) while the joint distribution has moved—correlations between features shift, or new feature combinations appear, neither visible to per-feature checks. Second, concept drift: the input distribution P(X) can be perfectly stable while the input→output relationship P(Y|X) moves, so the model is now wrong on inputs that look statistically identical to training data. Neither is detectable from feature distributions alone. Catching them requires methods that look at features jointly (multivariate distance, a domain classifier trained to tell training from serving data) and, for concept drift, monitoring actual outcomes—prediction-vs-label error over time or outcome correlation—rather than inputs.

Feature Stores: The Single Source of Truth

Staff Engineer Perspective

“Feature store v1 was a wrapper around Redis I wrote in a week because we needed something. Two years later we had 600 features, three teams using it, and zero point-in-time correctness. Every model retrain silently leaked future data through aggregates that were ‘as of now’ rather than ‘as of training row timestamp.’ Offline AUC was 0.91, online was 0.74, and we couldn’t explain why for months. The rebuild took 7 months and cost us roughly $1.4M including the team time. What I wish I’d built on day one: explicit event timestamps on every feature row, even if I had to fake them. The schema decisions you defer in week one are the migrations you pay for in year three.”

— Principal ML Engineer at an ad-tech company

Why Feature Stores Emerged

Feature stores didn’t exist as a category until Uber published their paper on Michelangelo in 2017. Before that, ML teams built bespoke data pipelines for each model. Feature computation logic was scattered across training notebooks, batch jobs, and serving microservices. Maintaining consistency was manual and error-prone.

Uber’s insight was that features are first-class citizens in ML systems. They should be:

Defined once and reused across models
Computed consistently for training and serving
Discoverable so teams don’t reinvent them
Governed with ownership and access controls

This shifted the mental model from “data pipelines that happen to produce features” to “features as products with their own lifecycle.”

The Dual Store Architecture

A feature store has two primary storage systems optimized for different access patterns:

Offline store: Optimized for batch reads over historical data. Used for training data generation. Typically built on columnar formats (Parquet, Delta Lake, Iceberg) stored in object storage (S3, GCS) or data warehouses (BigQuery, Snowflake, Redshift).

Access pattern: “Give me the value of features X, Y, Z for all entities E at their respective event timestamps T.”

Online store: Optimized for low-latency point lookups. Used for serving. Typically built on key-value stores (Redis, DynamoDB, Cassandra, Bigtable).

Access pattern: “Give me the current value of features X, Y, Z for entity E, right now, in under 10 milliseconds.”

The critical architectural property is that both stores are populated from the same feature computation logic. Whether the computation is batch (Spark), streaming (Flink), or on-demand, the transformation logic is defined once and applied consistently.

Point-in-Time Correctness: The Key Abstraction

The most important capability of a feature store’s offline component is point-in-time correct feature retrieval. This prevents a category of bugs called data leakage.

Data leakage occurs when information from the future influences predictions about the past. If you’re predicting whether a user will churn next month, you can’t use features computed from data that includes next month. This sounds obvious, but it’s easy to violate accidentally.

Consider: you want to train a model to predict purchase probability at the moment a user views a product. Your training data has user-product pairs with timestamps. You need features like “user’s purchase count in the last 30 days.”

A naive implementation: join the entity table (user-product-timestamp) with a pre-aggregated features table (user-features). If the features table has one row per user with their latest 30-day purchase count, you’ve just leaked future data. Users who appear multiple times in your training data will have features computed from their entire history, including purchases that happened after the training example.

Point-in-time join fixes this. For each row in your entity table, retrieve feature values that were valid at that exact timestamp, not before, not after. The feature store maintains feature timestamps and performs this temporal join automatically.

# Point-in-time join pseudocode
def get_training_data(entities_df, feature_refs):
    """
    entities_df has columns: entity_id, event_timestamp, label

    For each row, we need features that were known at event_timestamp,
    not features computed after event_timestamp.
    """
    results = []

    for entity_id, event_timestamp, label in entities_df.iterrows():
        # For each feature, find the most recent value
        # where feature_timestamp <= event_timestamp
        feature_values = {}
        for feature_ref in feature_refs:
            value = offline_store.get_feature_value(
                feature_ref=feature_ref,
                entity_id=entity_id,
                as_of_timestamp=event_timestamp  # Critical: temporal cutoff
            )
            feature_values[feature_ref] = value

        results.append({
            'entity_id': entity_id,
            'event_timestamp': event_timestamp,
            'label': label,
            **feature_values
        })

    return results

This is computationally expensive. For each training example, you’re potentially scanning feature history. Feature stores optimize this through:

Pre-computed snapshots at regular intervals
Efficient storage formats with timestamp indexes
Approximate joins (accepting small temporal fuzziness when exact joins are too expensive)

Online Feature Serving: The Latency Budget

Online feature serving must happen within the latency budget of the prediction request, typically 10-100 milliseconds total. If your model inference takes 30ms, feature retrieval can’t take more than 20-50ms.

This constrains what’s possible:

Features must be pre-computed and cached
Complex aggregations must be done in advance, not at request time
The feature store must be highly available and partitioned appropriately

Online serving patterns:

Lookup-only: Features are pre-computed by batch or streaming jobs. Online serving is pure key-value lookup. Simplest pattern, lowest latency.

Lookup + on-demand: Some features are pre-computed, others are computed at request time. On-demand features are typically transformations of request data (like embedding a query) rather than aggregations.

Materialized views: For features that are aggregations over entity activity (count of clicks in last hour), maintain materialized views that are incrementally updated by streaming jobs.

The Feature Store Landscape in 2025-2026

The feature store market has matured from early open-source projects to a full ecosystem:

Feast: The leading open-source feature store. Cloud-agnostic, extensible, active community. Best for teams that want control and can manage infrastructure. Requires external compute for feature transformations.

Tecton: Enterprise-grade managed service from ex-Uber engineers who built Michelangelo. Best-in-class streaming feature computation. Higher cost, best for organizations with complex real-time feature requirements.

Databricks Feature Store: Tightly integrated with Unity Catalog and Delta Lake. Best for organizations already invested in Databricks. Excellent for Spark-based feature engineering.

Cloud provider offerings (SageMaker Feature Store, Vertex AI Feature Store): Deep integration with respective cloud ML platforms. Best when you’re already committed to that cloud’s ML stack.

When to build versus buy: Build your own only if you have unusual requirements and dedicated platform team capacity. For most organizations, adopting an existing solution saves years of engineering effort. The build-vs-buy decision should favor “buy” more strongly here than in most infrastructure decisions because the edge cases in feature stores are numerous and subtle.

Common Mistake: Building a Custom Feature Store

What people do: Underestimate feature store complexity. Start with “we just need a key-value store with timestamps.” Build a v1 in a month. Spend the next two years discovering edge cases: point-in-time joins are slow, online-offline consistency drifts, feature backfills corrupt data, streaming features lag, schema changes break consumers.

Why it fails: Feature stores look simple but encode years of hard-won lessons from companies like Uber, Airbnb, and Netflix. Point-in-time correctness alone has dozens of subtle failure modes. Your team will rediscover each one painfully.

Fix: Unless you have 5+ dedicated platform engineers and requirements that genuinely don’t fit existing solutions, adopt Feast, Tecton, or a cloud provider’s feature store. The “we need full control” argument rarely survives the first year of maintaining a custom implementation.

Preventing Data Leakage

The Anatomy of Leakage

Data leakage is using information during training that would not be available at prediction time. It makes models appear much better in offline evaluation than they actually perform in production.

Why leakage is common in ML:

Machine learning workflows are exploratory. Data scientists iterate rapidly, trying features, checking metrics, adjusting. In this rapid iteration, it’s easy to accidentally include information that wouldn’t be available in production.

Traditional software doesn’t have this problem in the same way. If you write code that references a variable that doesn’t exist, you get an error. But if you train a model on a feature that includes future information, you get better metrics. The feedback is positive, not negative. Only deployment reveals the problem.

Types of leakage:

Target leakage: Features that encode the label or are caused by the label. If you’re predicting whether someone will default on a loan, using “collections_initiated = True” as a feature leaks the label directly. This seems obvious, but indirect leakage is subtle. “Customer service calls in last 30 days” might spike after default, not before.

Temporal leakage: Using features computed from data after the prediction timestamp. The point-in-time join issue discussed earlier.

Train-test leakage: Information from test examples influencing training. Normalization using statistics computed on the full dataset (including test) is a form of this.

Proxy leakage: A feature that wouldn’t be available in production but correlates strongly with one that would. If you use “time of data export” as a feature during development, it might correlate with the label due to when different labels were recorded.

Designing for Leak-Free Pipelines

The key principle: temporal partitioning is your first line of defense.

Every dataset should have an explicit timestamp column representing when that information became known. Features should be joined based on these timestamps. Labels should be associated with the timestamp when the outcome was observed.

Pattern: Explicit feature timestamps

Feature table:

- entity_id
- feature_value
- feature_timestamp (when this value became known)
- feature_name

Training labels:

- entity_id
- label_value
- label_timestamp (when this outcome was observed)
- prediction_timestamp (when the prediction would be made)

Point-in-time join:
For each (entity_id, prediction_timestamp) in labels:
  Find features where feature_timestamp <= prediction_timestamp
  Use most recent such feature value

Pattern: Prediction time as a first-class concept

When generating training data, explicitly model the prediction time, the moment at which the model would make its prediction in production. This is often different from both the feature timestamp and the label timestamp.

Example: Predicting next-week purchase probability - Features are computed from data up to Sunday midnight (feature_timestamp) - Prediction is made Monday morning (prediction_timestamp) - Label is whether purchase happened by Sunday (label_timestamp)

The prediction_timestamp is Monday. Features with feature_timestamp > Monday are leakage.

Pattern: Feature computation is strictly causal

When defining feature transformations, think causally. The feature computation should only use data that would have been generated before the prediction time.

Window aggregations must be defined as: “events in window [prediction_time - N days, prediction_time)”. The window ends at prediction time, not after.

Detecting Leakage

Even with careful design, leakage can slip through. Detection approaches:

Suspiciously good metrics: If your model achieves much better metrics than you expected, be suspicious. Check that your test set is truly temporally separated from training.

Feature importance analysis: If a single feature dominates importance, investigate whether it might encode the label directly or indirectly.

Adversarial validation: Train a model to distinguish training from serving data. If this is easy (high AUC), your training data differs systematically from serving, suggesting a pipeline inconsistency.

Time-based validation: Train on older data, test on newer data. If performance is much worse than random cross-validation, temporal leakage may be present.

Manual spot-checking: For a sample of predictions, manually trace the feature values back to source data. Verify timestamps make sense. This is labor-intensive but catches issues that automated checks miss.

Common Mistake: Ignoring “Suspiciously Good” Metrics

What people do: A new model achieves 95% AUC where the previous model got 80%. Celebrate. Ship to production. Watch it perform at 70% on real traffic.

Why it fails: Dramatic metric improvements usually indicate data leakage, not breakthroughs. The model learned to exploit information that won’t be available at prediction time—future data leaking into features, label information encoded in features, or test data overlapping with training.

Fix: Treat unusually good metrics as a red flag, not a success. Ask: “What information could explain this performance?” Check feature importance—if one feature dominates, investigate it. Use strict temporal validation (train on past, test on future). If metrics are dramatically better than domain experts expected, the model probably cheated.

Data Versioning and Lineage

Why Version Data?

Code versioning with Git is universal in software engineering. Data versioning is less mature but equally important for ML.

Reproducibility: Six months from now, can you recreate the exact dataset used to train a model? If the data has changed since then and you have no versioned snapshot, the answer is no.

Debugging: Model performance degraded. Was it a code change or a data change? Without data versions, you can’t easily diff.

Rollback: Bad data got into training. Can you revert to a known-good state?

Compliance: Regulations (GDPR, SOC 2, industry-specific) may require explaining what data influenced a decision. Without lineage, this is impossible.

Experimentation: You want to try a new feature. But if you also updated the underlying data, you’ve conflated two changes. Data versioning enables controlled experiments.

Data Versioning Approaches

Snapshot-based versioning: Periodically capture full copies of datasets. Simple but storage-expensive. Like taking full backups without incrementals.

Delta-based versioning: Store changes from a base version. More storage-efficient but requires computing current state from base + deltas.

Storage-layer versioning: Modern data lake formats (Delta Lake, Apache Iceberg, Apache Hudi) build versioning into the storage format. They enable time-travel queries (“SELECT * FROM table AS OF timestamp”) without external versioning systems.

Reference-based versioning: Track pointers to data locations rather than copying data. Used when data is immutable (like log files). DVC (Data Version Control) works this way, storing data in object storage and tracking references in Git.

Lineage: Understanding Data Flow

Lineage answers: where did this data come from, and what else depends on it?

Upstream lineage: For a given dataset or feature, what raw data sources contribute to it? What transformations were applied?

Downstream lineage: For a given raw data source, what features, models, and predictions depend on it?

Lineage enables:

Impact analysis: “If this table’s schema changes, what breaks?”
Root cause analysis: “This model is producing bad predictions; what upstream data might have caused this?”
Compliance: “Explain what data influenced this decision for user X.”

Implementing Lineage

Lineage can be captured at different granularities:

Job-level lineage: Track which jobs read from and write to which tables. Coarse but easy to implement. Most orchestration tools (Airflow, Dagster, Prefect) provide this out of the box.

Column-level lineage: Track transformations at the column level. “Output column X depends on input columns A and B.” More useful for impact analysis but harder to capture.

Row-level lineage: Track which output rows came from which input rows. Most granular but expensive. Sometimes required for compliance in high-stakes domains.

Automatic vs. manual capture: Automatic lineage parsing (analyzing SQL queries, inspecting dataframes) captures lineage without developer effort but may miss complex logic. Manual annotation is more accurate but requires discipline.

Data Quality for ML

Why ML Systems Need Special Data Quality Attention

Traditional data quality focuses on: Are values correct? Are they complete? Do they conform to schema?

ML data quality adds: Is the distribution stable? Are training and serving consistent? Is there leakage?

A table can be perfectly clean by traditional standards (no nulls, all values valid) yet be useless for ML if the distribution has drifted or if it includes future information.

The Five Pillars of ML Data Observability

Freshness: Is data arriving on time? For features with freshness SLAs (user’s recent activity must be updated hourly), monitor update timestamps. Alert when data is stale.

Volume: Is the amount of data expected? A drop in row count might indicate an upstream failure. A spike might indicate duplicate data. Track counts over time, alert on anomalies.

Schema: Has the structure changed unexpectedly? Added columns might be benign. Removed columns or type changes can break downstream processing. Validate schema on write.

Distribution: Have statistical properties changed? Monitor mean, standard deviation, cardinality, null rate for each column. Use PSI or KS tests to detect shifts.

Lineage: Can we trace data provenance? When something goes wrong, lineage enables root cause analysis.

Building a Data Quality Monitoring System

A production data quality system has several components:

Expectation definition: For each dataset, define what “healthy” looks like. This can be:

Explicit thresholds (null rate < 1%, row count > 10000)
Learned baselines (within 2 standard deviations of historical mean)
Schema expectations (these columns exist with these types)

Check execution: Run quality checks at appropriate points:

Before data enters pipelines (quality gates)
After transformations (validation)
Periodically on production data (monitoring)

Alerting and response: When checks fail, who gets notified? What’s the severity? Some failures should block the pipeline (critical). Others should log a warning but proceed (informational).

Feedback loops: Quality issues discovered in production should feed back into stronger checks. If you discover leakage, add a check to detect it in the future.

Real-Time Data Quality Challenges

Batch data quality checking is straightforward: process the data, run checks, alert if failed. Real-time data quality is harder because latency matters.

You can’t run expensive statistical tests on every event. Instead:

Run cheap checks synchronously (type validation, null checks, range checks)
Run expensive checks asynchronously on sampled data
Maintain streaming approximations (approximate percentiles, count-min sketches for cardinality)

For ML serving specifically:

Validate feature values at retrieval time (are they fresh enough? do types match?)
Monitor prediction distributions for anomalies (many zero-probability predictions?)
Track feature coverage (are features missing for some entities?)

Case Study: How Uber Solved Feature Management at Scale

The Problem

By 2016, Uber had hundreds of ML models in production: pricing, ETAs, fraud detection, driver matching. Each model had its own feature pipelines. Data scientists were spending more time on data plumbing than modeling. Feature computation was duplicated across teams. Training-serving skew was endemic.

The Solution: Michelangelo

Uber built Michelangelo, an end-to-end ML platform. The feature store component, described in their 2017 paper, established many patterns that are now industry standard.

Key architectural decisions:

Feature groups: Features are organized into groups with shared entity types and computation schedules. A “user_trip_features” group contains all features computed per user based on trip data.

Dual store with consistent computation: Batch jobs compute features for offline storage (Hive). The same logic, running as streaming jobs, updates online storage (Cassandra). A schema registry ensures consistency.

Time-partitioned storage: Offline features are stored with temporal partitions, enabling efficient point-in-time joins. “Get user features as of this timestamp” is a supported query pattern.

Automatic feature logging: At serving time, features used for predictions are logged with timestamps. This enables offline analysis of serving data and debugging of production issues.

Results and Lessons

After Michelangelo:

Time to deploy models decreased from weeks to days
Feature reuse increased dramatically (most new models used >50% existing features)
Training-serving skew incidents dropped by 90%

Lessons applicable to other organizations:

Standardization enables velocity. Constraining how features are defined and computed might seem limiting, but it enables automation and reuse.
The offline-online consistency problem is central. Most of Michelangelo’s complexity exists to solve this one problem.
Feature management is an organizational problem, not just technical. Success required cultural change: data scientists agreeing to use shared features rather than building their own.

Case Study: The 2021 Zillow Offers Shutdown

What Happened

In November 2021, Zillow announced it was shutting down Zillow Offers, its home-buying business, and laying off roughly 25% of its workforce (about 2,000 employees). The company took a $304 million write-down on inventory homes purchased at prices that exceeded their resale value, with total losses from the wind-down exceeding $500 million.

A significant factor: their home price prediction models performed much worse in production than in backtesting.

The Data Architecture Angle

The dominant public explanation for the shutdown was business and market-timing: Zillow scaled home purchases aggressively into a volatile, fast-moving 2021 market and couldn’t operationally buy, renovate, and resell fast enough as prices turned. Data-architecture issues—feedback loops and distribution shift—contributed to and amplified the mispricing, but they were not the primary cause. With that framing, the data angle is still instructive:

Distribution shift at scale: The models were trained on historical home sales. When Zillow entered markets as a major buyer, their presence changed the market. Homes Zillow bid on were no longer representative of the training distribution.

Feature freshness issues: Housing market conditions were changing rapidly in 2020-2021 (pandemic, low interest rates, migration patterns). Features based on historical trends became misleading.

Feedback loops: Zillow’s own pricing influenced comparable sales, which influenced features, which influenced pricing. This feedback loop wasn’t adequately modeled.

Lessons for ML Data Architecture

Monitor for distributional shifts continuously, especially when your system can influence its own inputs.
Features based on historical aggregations can become misleading rapidly when the underlying process changes.
Backtesting is necessary but not sufficient. Backtests assume the world doesn’t change in response to your model, which is often false at scale.
Business metrics are the ultimate ground truth. If offline metrics say you’re doing great but the business is losing money, trust the business metrics.

Decision Framework: When to Invest in What

Do You Need a Feature Store?

You probably need a feature store if:

You have multiple models sharing features
Training-serving skew has caused production issues
Teams are reinventing the same features
Feature computation takes significant data science time

You might not need a feature store if:

You have a single model or very few models
Features are simple and unlikely to drift
You have a small team and low ML maturity
The cost and complexity aren’t justified by the problems you’re experiencing

Start small: A feature store can be as simple as a shared table with feature values and timestamps, plus conventions for how to join it. You don’t need enterprise platforms on day one.

Batch vs. Streaming Features

Use batch features when:

Features are aggregations over long windows (30-day averages)
Latency of hours is acceptable
Computation is expensive and benefits from batch optimization
Source data is batch-oriented (daily data warehouse updates)

Use streaming features when:

Features must reflect events within seconds or minutes
Real-time context significantly improves predictions
Source data is event-driven

Hybrid pattern: Most production systems use both. Long-window aggregations are batch-computed daily. Short-window features (clicks in last hour) are stream-updated. The feature store abstracts this from consumers.

How Much Versioning and Lineage?

Minimal investment (suitable for early-stage teams):

Version datasets by date in naming convention (user_features_2024_01_15)
Manual documentation of data sources
Git versioning for code including data generation scripts

Moderate investment (suitable for growing teams):

Storage-layer versioning (Delta Lake time travel)
Automated job-level lineage from orchestration tool
Data quality dashboards with alerts

High investment (suitable for mature organizations):

Full data versioning with DVC or similar
Column-level automated lineage
Data contracts between teams
Comprehensive data quality as code

The right level depends on your scale, regulatory environment, and the cost of data issues in your domain.

Building Intuition: How to Spot Data Architecture Problems

Red Flags to Watch For

“The model works in the notebook but not in production”: Almost always a training-serving consistency issue. Check feature computation, preprocessing, and data freshness.

“We can’t reproduce the model from three months ago”: Data versioning gap. Either the training data has changed or you haven’t recorded which version was used.

“No one knows what this feature means”: Feature governance gap. Features have accumulated without ownership or documentation.

“It takes two weeks to add a new feature to the model”: Infrastructure gap. Feature pipelines are too manual or too coupled.

“We see the model was wrong but can’t figure out why”: Observability gap. You need prediction logging, feature logging, and lineage to debug.

Questions to Ask During Design Reviews

When reviewing ML system designs, probe the data architecture:

“Where does each feature come from? Is that computation the same in training and serving?”
“If I deploy this model today, what’s the oldest data it might see as a feature? Is that acceptable?”
“If I need to recreate this exact training run six months from now, can I? What would I need to store?”
“When this model makes a bad prediction, how do I trace back to understand what data produced it?”
“If the upstream data changes schema, how does this system handle it?”

If the answers are vague, there’s work to do on data architecture.

Common Pitfalls and How to Avoid Them

Pitfall 1: Feature Store Without Governance

The pattern: A team builds a feature store and gets initial adoption. Features proliferate. After a year, there are 2000 features with unclear ownership. Nobody knows which are still used. Deprecation is impossible because impact is unknown.

The fix: Treat features as products. Require documentation and ownership at creation time. Track consumers automatically. Set expiration dates on experimental features. Conduct regular feature audits.

Pitfall 2: Optimizing for Offline Metrics Only

The pattern: Data scientists develop features and models optimizing for offline evaluation metrics. No attention to whether these features can be computed consistently at serving time or whether they meet latency requirements.

The fix: Include serving requirements in the definition of done. A feature isn’t ready until it works in both training and serving. Build staging environments that mirror production latency constraints.

Pitfall 3: Schema Changes Without Migration

The pattern: An upstream team changes a table schema. Downstream ML pipelines fail silently or produce incorrect features.

The fix: Data contracts. Producers and consumers agree on schemas and compatibility rules. Changes require compatibility checks. Breaking changes require migration periods and consumer notification.

Pitfall 4: No Feedback from Production to Development

The pattern: Models are developed, trained, and deployed. Problems are discovered in production weeks later. By then, no one remembers the details of training data decisions.

The fix: Close the loop. Log predictions and features at serving time. Compare serving distributions to training distributions continuously. Make this data available to data scientists as part of their development workflow.

Pitfall 5: Treating Data Pipelines as One-Time Scripts

The pattern: A data scientist writes a script to generate training data. It works once and is never maintained. When the model needs retraining, the script is broken due to environment or data changes.

The fix: Treat data pipelines as production code. Version them. Test them. Run them regularly even when not needed, to catch breakages early. Use orchestration tools that alert on failure.

Team and Organizational Considerations

Data architecture isn’t just a technical challenge—it’s an organizational one. The best technical design fails without the right team structure and governance.

Building a Data Platform Team

Who owns data architecture?

In early-stage ML organizations, data architecture is often nobody’s job. Data scientists build ad-hoc pipelines. ML engineers handle serving. Data engineers manage warehouses. No one owns the end-to-end consistency that matters most.

As organizations mature, a dedicated data platform team emerges. This team owns:

Feature store infrastructure and operations
Data quality monitoring and alerting
Lineage and cataloging systems
Training-serving consistency tooling
Data pipeline frameworks and best practices

Team composition:

A typical data platform team includes:

Data engineers for pipeline development and infrastructure
Backend engineers for online serving systems
ML engineers who understand model training requirements
Product manager to prioritize across competing demands

The key skill is translation: understanding both data science needs and infrastructure constraints, then building platforms that serve both.

Relationship to data science teams:

The data platform team provides infrastructure; data science teams use it. This creates a customer-supplier relationship. Success requires:

Clear service level objectives (What latency? What freshness? What availability?)
Feedback loops (What’s working? What’s painful?)
Shared ownership of quality (Platform provides tools; users follow standards)

Data Governance for ML

Governance ensures that data is used appropriately, consistently, and in compliance with policies.

Feature ownership:

Every feature should have an owner—a person or team accountable for:

Correctness of computation logic
Documentation and discoverability
Deprecation when no longer needed
Responding to quality issues

Without ownership, features become orphans. Orphan features accumulate, their logic becomes unclear, and removing them becomes impossible.

Feature lifecycle management:

Features have lifecycles: 1. Proposed: Someone wants this feature; it’s being evaluated 2. Experimental: Feature exists but isn’t production-validated 3. Production: Feature is stable, documented, and approved for production use 4. Deprecated: Feature is being phased out; no new uses allowed 5. Removed: Feature is deleted from the system

Enforce these states. Block experimental features from production models. Require migration plans before deprecation.

Access control:

Not all features should be available to all users:

PII features may require special approval
Features derived from sensitive data need access controls
Experimental features shouldn’t be used in production

Implement access control at the feature store level. Audit who uses what.

Data contracts:

Data contracts formalize agreements between data producers and consumers:

Schema: What columns exist, what types, what constraints
Quality: What freshness, completeness, accuracy guarantees
Compatibility: How will breaking changes be handled

Contracts prevent surprise breakages. When a producer wants to change something, the contract specifies who to notify and what migration period to allow.

Organizational Change Management

Adopting good data architecture requires behavior change. People who’ve built ad-hoc pipelines must adopt new tools and standards. This is hard.

Start with pain:

Don’t lead with “you should use the feature store because it’s better.” Lead with “remember when model X failed in production because of training-serving skew? The feature store would have prevented that.”

Connect new tools to problems people have actually experienced.

Make the right thing easy:

If using the feature store requires more work than building a one-off pipeline, people won’t use it. Invest in developer experience:

Good documentation and examples
Templates and generators for common patterns
Fast feedback when something goes wrong
Integration with existing workflows

Celebrate success:

When someone uses the feature store and it catches a problem, share that story. When feature reuse saves weeks of work, highlight it. Positive reinforcement accelerates adoption.

Don’t mandate without support:

Mandating feature store use before it’s ready creates resentment. Ensure the platform is good enough before requiring adoption. Early adopters should be enthusiastic volunteers, not reluctant conscripts.

Migration Strategies

Most organizations don’t start with good data architecture. They evolve toward it from ad-hoc beginnings. Here’s how to manage that migration.

Assessment: Where Are You Starting?

Before planning migration, understand your current state:

Maturity assessment questions:

Feature consistency: Is the same feature computed identically in training and serving? How do you know?
Reproducibility: Can you recreate the training data from 6 months ago?
Discoverability: How do teams find existing features? (If “ask around” or “search Slack,” that’s low maturity.)
Quality monitoring: How would you know if a feature started producing bad values?
Leakage prevention: How do you ensure point-in-time correctness?

Score each area 1-5. This reveals where to focus.

Incremental Migration Path

Don’t try to migrate everything at once. A staged approach reduces risk and builds confidence.

Stage 1: Shared definitions (1-3 months)

Start with the organizational layer, not the infrastructure:

Create a feature catalog (can be as simple as a shared document)
Define naming conventions
Establish ownership for existing critical features
Document computation logic for features used by multiple teams

This costs little and provides immediate value. It also reveals which features are actually shared vs. duplicated.

Stage 2: Centralized storage (2-4 months)

Move feature data to shared storage:

Choose or build an offline store (often your existing data warehouse)
Migrate high-value features to shared tables
Implement point-in-time join capability
Add basic quality monitoring

Training now uses centralized features. Serving may still use separate logic.

Stage 3: Consistent computation (3-6 months)

Unify computation between training and serving:

Refactor serving to use the same feature definitions
Implement online store for low-latency access
Add computation pipelines (batch or streaming)
Validate training-serving consistency

This is the hardest stage. It requires coordinating across teams and often rewriting serving logic.

Stage 4: Full platform (ongoing)

Add advanced capabilities:

Streaming features for real-time use cases
Column-level lineage
Automated quality enforcement
Self-service feature creation

These capabilities differentiate mature platforms but aren’t necessary to start.

Migration Anti-Patterns

Big bang rewrite:

“We’ll rebuild everything on the new platform, then switch over.” This fails because:

The new platform doesn’t have feature parity
Users have no investment in something they haven’t used
Timeline stretches, stakeholders lose patience

Instead: Migrate incrementally, one feature group at a time.

Platform without users:

“We’ll build the perfect feature store, then get adoption.” This fails because:

You don’t know what users need until they use it
Platforms without users don’t get maintained
Sunk cost doesn’t create adoption

Instead: Start with one or two teams as design partners. Build for their needs first.

Deprecate before migrate:

“We’re shutting down the old system on date X.” This fails because:

Users scramble to migrate under pressure
Corners get cut; bugs appear
Emergency exceptions undermine the deadline

Instead: Make the new system compelling enough that users want to migrate. Deprecation follows voluntary adoption.

Cost Engineering for Data Architecture

Data architecture has real costs: storage, compute, and operational overhead. Good architecture manages these costs without sacrificing quality.

Understanding Cost Drivers

Storage costs:

Feature stores maintain historical data for point-in-time correctness. This accumulates:

Offline store: Grows with history depth × number of entities × number of features
Online store: Grows with number of entities × number of features (current values only)
Feature logs: Grows with request volume × features per request

Estimate storage needs before building. A feature store for 100M users with 100 features and 1 year of history can easily reach petabyte scale.

Compute costs:

Feature computation can be expensive:

Batch jobs: Cost scales with data volume and computation complexity
Streaming jobs: Cost scales with event rate and processing complexity
On-demand computation: Cost scales with request rate

Real-time features are often 10-100x more expensive per feature value than batch features.

Operational costs:

Running a feature store requires:

On-call support for availability
Incident response for quality issues
Capacity planning and scaling
Upgrades and maintenance

These costs scale with platform complexity and reliability requirements.

Cost Optimization Strategies

Tiered storage:

Not all historical data needs to be equally accessible:

Hot tier: Recent data (weeks) for frequent training; fast storage
Warm tier: Older data (months) for occasional retraining; cheaper storage
Cold tier: Archive data (years) for compliance; cheapest storage

Move data between tiers based on access patterns.

Feature materialization strategy:

Not every feature needs to be pre-computed:

High-value, high-use features: Materialize and serve from feature store
Medium-value features: Compute on-demand during training; cache for serving
Low-value features: Compute on-demand everywhere; don’t store

Analyze feature usage to categorize appropriately.

Aggregation granularity:

Window aggregations (count of events in last N days) can be expensive to compute with fine granularity. Trade precision for efficiency:

Daily granularity for 30-day windows (30 buckets to sum)
Hourly granularity only when hour-level freshness matters
Approximate algorithms (HyperLogLog, t-digest) for expensive aggregations

Retention policies:

Define how long to keep data:

Online store: Keep only current values (minimal retention)
Offline store: Keep enough history for retraining needs
Feature logs: Keep enough for debugging, then archive or delete

Aggressive retention reduces storage costs significantly.

Cost Monitoring and Accountability

Attribution:

Track costs by:

Team: Which teams’ features cost the most?
Feature: Which features are expensive relative to their value?
Model: Which models drive the most feature computation?

Attribution enables accountability and optimization.

Chargeback or showback:

Some organizations charge teams for their data infrastructure usage. Others show costs without charging. Either approach creates awareness and incentivizes efficiency.

Cost-benefit analysis for features:

Before adding a new feature, estimate:

Computation cost (how much to compute daily/hourly)
Storage cost (how much data over time)
Expected value (how much does this improve predictions)

If a feature costs $10K/month to compute but only improves model value by $5K/month, it’s not worth it.

Key Takeaways

Training-serving skew is the central challenge - Features computed differently in training versus serving cause silent model degradation. Prevention requires single sources of truth and shared computation logic.
Feature stores solve coordination problems - Centralizing feature definitions, computation, and serving enables consistency, reuse, and governance. The dual offline/online architecture serves different needs.
Point-in-time correctness prevents data leakage - Training data must respect temporal boundaries. Features should reflect only what was known at prediction time. This requires explicit timestamps and careful join logic.
Data versioning enables reproducibility - Without versioned data, past experiments can’t be recreated. Modern storage formats make versioning practical. Treat it as non-negotiable infrastructure.
Treat data pipelines as production code - Version them, test them, monitor them. Data quality issues are ML bugs. Run pipelines regularly even when not needed to catch breakages early.

Summary

Data architecture is the foundation upon which reliable ML systems are built. The patterns and principles in this chapter emerged from hard-won lessons at organizations running ML at scale:

Training-serving skew is the central challenge. Features computed differently in training versus serving cause silent model degradation. Prevention requires architectural discipline: single sources of truth, shared computation logic, continuous monitoring.

Feature stores solve coordination problems. By centralizing feature definitions, computation, and serving, feature stores enable consistency, reuse, and governance. The dual offline/online architecture serves the different needs of training and serving.

Point-in-time correctness prevents data leakage. Training data generation must respect temporal boundaries. Features should reflect only what was known at the prediction time, not after. This requires explicit timestamps and careful join logic.

Data versioning enables reproducibility. Without versioned data, past experiments can’t be recreated, and debugging is archaeology. Modern storage formats and tools make versioning practical.

Observability is non-negotiable. Monitor freshness, volume, schema, and distribution continuously. Detect skew before it causes business impact. Build lineage to enable root cause analysis.

Start with your actual problems. Not every organization needs enterprise-grade infrastructure on day one. Match your investment to your scale, risk tolerance, and the actual problems you’re experiencing.

The technical patterns in this chapter are well-understood. The harder challenge is organizational: building the culture and processes that make good data architecture the default rather than an afterthought.

Connections to Other Chapters

Chapter 7 (RAG Systems) applies data architecture principles to retrieval systems
Chapter 9 (LLM Deployment & Infrastructure) covers the serving systems that consume features
Chapter 15 (MLOps & Evaluation) discusses evaluation pipelines that depend on data quality
Chapter 25 (System Design at Scale) provides broader context for distributed data systems
Chapter 28 (Research-to-Production) covers validating new data techniques before production adoption
Chapter 31 (Reliability Engineering) addresses operational concerns for data pipelines
Chapter 32 (Cost Engineering) extends the cost optimization strategies introduced here

Practical Exercises

Exercise 1: Point-in-Time Join Implementation

Given the following data:

purchase_events = [
    {"user_id": "u1", "item_id": "i1", "timestamp": "2024-01-15 10:00", "amount": 50.0},
    {"user_id": "u1", "item_id": "i2", "timestamp": "2024-01-20 14:30", "amount": 30.0},
    {"user_id": "u2", "item_id": "i1", "timestamp": "2024-01-18 09:00", "amount": 50.0},
]

user_features = [
    {"user_id": "u1", "timestamp": "2024-01-10", "purchases_30d": 5, "avg_amount": 45.0},
    {"user_id": "u1", "timestamp": "2024-01-17", "purchases_30d": 6, "avg_amount": 46.0},
    {"user_id": "u2", "timestamp": "2024-01-15", "purchases_30d": 3, "avg_amount": 60.0},
]

Implement a point-in-time join that retrieves features for each purchase event.
Add a TTL parameter to treat old features as missing.
Write validation code that checks for leakage (feature_timestamp > event_timestamp).
How would you optimize this for a billion purchase events?

Self-Assessment Questions: - Does your join handle the case where no feature exists before the event timestamp? - Does your TTL implementation correctly handle edge cases (exactly at TTL, no features at all)? - Would your optimization strategy work with real distributed systems constraints?

Quality Indicators: - Strong: Implementation handles all edge cases (no features, multiple features at same timestamp, TTL boundary) - Weak: Implementation only handles happy path - Strong: Optimization discusses specific technologies (Spark broadcast joins, partitioning strategies) with tradeoffs - Weak: Optimization is vague (“use a database”)

Exercise 2: Skew Detection Dashboard

Design and implement a monitoring system for training-serving skew:

Define the data structures for storing training statistics.
Implement PSI and KS test calculations.
Create alerting logic with appropriate thresholds.
Simulate a skew scenario and verify detection.

Self-Assessment Questions: - Does your PSI implementation handle zero-probability bins correctly (avoids log(0))? - Are your thresholds based on research/experience or arbitrary? - Did your simulation actually trigger alerts, validating the detection logic?

Quality Indicators: - Strong: Data structures support both numeric and categorical features with appropriate handling - Weak: Only handles one feature type - Strong: Alert thresholds are justified (references PSI literature or calibrated on test data) - Weak: Thresholds are arbitrary round numbers - Strong: Simulation includes realistic scenarios (gradual drift, sudden shift, seasonal patterns) - Weak: Simulation is trivial (just doubles all values)

Exercise 3: Feature Store Design

Design a feature store for a ride-sharing application:

Identify entities (driver, rider, trip, location).
Define 10 features across these entities with appropriate freshness requirements.
Specify which features are batch vs. streaming.
Design the online store schema for sub-10ms lookups.
Document how you’d ensure training-serving consistency.

Self-Assessment Questions: - Do your freshness requirements align with how the features will be used? - Does your online schema support all access patterns needed for serving? - Is your training-serving consistency approach practical, not just theoretically correct?

Quality Indicators: - Strong: Features include different entity types and relationships (driver-rider interaction features) - Weak: Features are all single-entity aggregations - Strong: Freshness requirements are justified by use case (“surge pricing needs demand features within 5 minutes”) - Weak: All features have same freshness requirement - Strong: Consistency approach includes specific mechanisms (shared code, automated testing, monitoring) - Weak: Consistency approach is “use the same logic”

Exercise 4: Data Leakage Audit

Take an existing ML model in your organization:

List all features used by the model.
For each feature, trace back to its computation logic.
Identify the timestamps associated with each feature value.
Check whether point-in-time correctness is enforced in training data generation.
Propose fixes for any leakage risks discovered.

Self-Assessment Questions: - Did you trace ALL features, including those computed in preprocessing? - Did you verify timestamps empirically (spot-checking actual data) or trust documentation? - Are your proposed fixes actionable with specific implementation steps?

Quality Indicators: - Strong: Audit includes implicit features (preprocessing transformations, derived columns) - Weak: Audit only covers explicit feature columns - Strong: Timestamp verification includes actual data samples showing correct/incorrect behavior - Weak: Timestamp verification only checks code comments - Strong: Fixes include migration plan and validation approach - Weak: Fixes are just “add timestamp column”

Exercise 5: Migration Planning

You’ve inherited a system where each of 5 ML models has its own ad-hoc feature pipelines. Design a migration to centralized data architecture.

Assess the current state (list all features, identify duplicates, find quality issues).
Design the target architecture.
Create a phased migration plan with specific milestones.
Identify risks and mitigation strategies.
Define success metrics for the migration.

Self-Assessment Questions: - Does your assessment distinguish duplicates from legitimately different features? - Is your phased plan realistic given team capacity and organizational constraints? - Do your success metrics measure actual improvement, not just migration completion?

Quality Indicators: - Strong: Migration plan includes parallel running period to validate consistency - Weak: Migration is cut-over without validation - Strong: Risks include organizational factors (adoption, training, change management) - Weak: Risks are only technical - Strong: Success metrics include business outcomes (reduced incidents, faster development) - Weak: Success metrics are only technical (feature count migrated)

Self-Assessment Checkpoint

Conceptual Questions

Q1. [IC2] Your model’s offline metrics look great but production performance is worse. You suspect training-serving skew. How do you diagnose which features are causing it?

Answer

Systematic diagnosis: (1) Log feature values at serving time alongside predictions. (2) For each feature, compute distribution statistics (mean, std, percentiles) from serving logs. (3) Compare to training data distributions using PSI or KS tests. (4) Rank features by skew magnitude. (5) For top skewed features, trace the computation path—is the same code used in training and serving? Are data sources identical? (6) Common culprits: timestamp handling, missing value imputation, aggregation windows, stale features in online store.

Q2. [IC2] You’re joining user features to click events for training. Why can’t you just use the latest feature values?

Answer

Using latest values creates data leakage—you’d be using information from the future to predict past events. If a user’s “purchases_30d” feature is 10 today, but was 5 when the click happened, using 10 means your model learns from data it won’t have at inference time. This artificially inflates offline metrics. The model may learn patterns like “users with high purchases_30d click more” that don’t hold in production because the model sees the feature BEFORE the purchase behavior it was computed from.

Q3. [Senior] Your feature store serves p99 latency of 15ms but you need sub-5ms for real-time bidding. What’s your approach?

Answer

Options by effort: (1) Caching: Pre-warm features for users likely to bid in-memory at edge. Trade freshness for latency. (2) Batch precomputation: Precompute user-ad feature combinations for high-traffic pairs. (3) Simpler features: Replace complex features with simpler approximations that are faster to serve. (4) Architecture: Move to in-process feature serving using embedded RocksDB or similar. (5) Data locality: Co-locate feature serving with bidding logic. (6) Feature reduction: Use feature importance to identify which features actually matter; serve only those. Often the answer is a combination with different strategies for different feature tiers.

Q4. [Senior] The data science team wants a new feature that requires a 90-day window of user activity. Your batch pipelines run daily. How do you compute this feature efficiently?

Answer

Incremental computation: (1) Don’t recompute the full 90-day window daily. (2) Use sliding window patterns: maintain the aggregate state, then on each day add new data and subtract day-91 data. (3) Implementation: store intermediate state (sum, count, etc.) with timestamps. On update: new_sum = old_sum + today_value - day_91_value. (4) Considerations: Need to handle late-arriving data (may need to adjust old aggregates). Store enough state to recompute if corruption occurs. (5) For complex aggregates (percentiles, distinct counts), use approximate algorithms (t-digest, HyperLogLog) that support mergeable state.

Q5. [Staff] You’re designing data architecture for a new ML platform. The company has 50 data scientists who currently copy data to local notebooks. How do you convince leadership to invest in proper infrastructure?

Answer

Build the business case: (1) Quantify the cost of current state: How many hours/week do data scientists spend on data wrangling? What’s the cost of delayed model deployments? Document actual incidents from training-serving skew. (2) Estimate ROI: Feature reuse (features built once, used by 10 teams), faster experimentation (days to hours), reduced incidents (fewer rollbacks). (3) Propose incremental investment: Don’t ask for $2M feature store on day one. Start with shared feature definitions, then add materialization, then online serving. Each stage shows value before the next investment. (4) Find champions: Identify one or two high-visibility projects that will be the initial adopters and success stories.

Spot the Problem

Problem 1. [IC2] A data engineer’s feature computation:

def compute_user_features(user_id: str) -> dict:
    purchases = db.query(f"SELECT * FROM purchases WHERE user_id = '{user_id}'")
    return {
        "total_purchases": len(purchases),
        "avg_amount": sum(p.amount for p in purchases) / len(purchases),
        "days_since_last": (datetime.now() - max(p.timestamp for p in purchases)).days
    }

What’s wrong?

Answer

Multiple issues: (1) SQL injection vulnerability from string formatting (use parameterized queries). (2) No time boundary: For training, this computes features as of NOW, not as of the training example’s timestamp. Creates leakage. (3) Division by zero: If no purchases, len(purchases) is 0. (4) Empty list handling: max() on empty list raises exception. (5) No caching or batching: Called per-user, inefficient for bulk computation. (6) Non-deterministic: datetime.now() means feature values change on every call, making reproduction impossible.

Hermann & Del Balso (2017), “Michelangelo” - Uber’s paper that introduced feature stores to the ML community.
Kleppmann (2017), “Designing Data-Intensive Applications” - Essential background on data systems.
Sculley et al. (2015), “Hidden Technical Debt in ML Systems” - Data-related technical debt awareness.

Deep Dives

Polyzotis et al. (2018), “Data Management Challenges in Production ML” - Academic treatment of ML data management.
Schelter et al. (2018), “Automating Large-Scale Data Quality Verification” - Paper behind Amazon’s Deequ framework.
Huyen (2021), “Feature Stores for ML” - Survey comparing feature store approaches.

Practical Resources

Feast documentation (feast.dev) - Open-source feature store implementation reference.

--- title: "Chapter 30: Data Architecture for AI" keywords: [data architecture, feature stores, data pipelines, data quality, versioning, point-in-time correctness, lakehouse] difficulty: advanced prerequisites: [ch07, ch25] estimated_time: "3-4 hours" --- ## Introduction In early 2021, a team at a major e-commerce company deployed what they believed was their best fraud detection model yet. Offline metrics were exceptional: 94% precision, 91% recall, a clear improvement over the previous system. The model sailed through A/B testing and rolled out to production traffic. Within two weeks, fraud losses had increased by 40%. The postmortem revealed a subtle but devastating problem. During training, the model used a feature called "average transaction amount over the past 7 days." The training pipeline computed this by looking at all transactions within a 7-day window ending at the transaction time. But the serving pipeline, built by a different team using a different codebase, computed the same feature using a 7-day window ending at midnight the previous day. A seemingly minor difference, but it meant the model in production was seeing systematically stale data. Fraudulent transactions that happened in the current day weren't reflected in the risk score until the next day. This is training-serving skew, the silent killer of ML systems. The model worked perfectly in the training environment where features were computed correctly. It failed silently in production where features were computed differently. No monitoring caught it because the model was returning valid predictions. Only the downstream business metric revealed the problem. This chapter is about building data architectures that prevent such failures. Data architecture for AI isn't just about storing and moving data. It's about ensuring that the data your model sees during training matches what it sees during serving, that features are computed consistently, that historical data doesn't leak into predictions, and that you can diagnose and fix problems when they inevitably occur. ### The Fundamental Challenge: Two Worlds That Must Match Machine learning creates a unique architectural challenge that doesn't exist in traditional software systems. You build your model in one environment (training) but deploy it in another (serving). These environments have fundamentally different requirements: **Training environment**: - Batch processing over historical data - Time-travel capability: "What features were available at this timestamp?" - High throughput, latency tolerance (hours of processing is acceptable) - Reproducibility: running the same pipeline should produce the same results **Serving environment**: - Real-time processing for live requests - Current features: "What features are available right now?" - Low latency, millisecond response requirements - Reliability: the system must work even when components fail The data that bridges these two worlds must be computed identically in both. This sounds simple but is extraordinarily difficult in practice. Different teams, different codebases, different programming languages, different infrastructure, different time pressures. The surface area for subtle bugs is immense. ### What Goes Wrong Without Proper Data Architecture The fraud detection story is one manifestation of a broader pattern. Here's what happens when organizations treat data architecture as an afterthought: **Silent model degradation**: Models make predictions using different features than they were trained on. Performance degrades, but there's no clear signal pointing to data as the cause. Teams waste weeks debugging model architecture when the problem is data. **Irreproducible experiments**: Data scientists can't recreate past results because training data has changed. A model that worked three months ago can't be retrained because the features it used have been modified. **Feature chaos**: Teams reinvent the same features repeatedly. One team computes "user activity score" one way; another team computes it differently. Neither can reuse the other's work. Technical debt accumulates. **Debugging paralysis**: When something goes wrong, no one can trace back from a prediction to understand what data produced it. Root cause analysis becomes archaeological excavation. **Compliance nightmares**: Regulations require explaining what data influenced a decision. Without lineage, this is impossible to answer. ### A Mental Model for ML Data Architecture Think of ML data architecture as building a system where the past and present versions of your data must coexist and remain synchronized. Imagine a library where books are constantly being revised. The training environment is like a historian who needs to read exactly what the books said on specific dates in the past. The serving environment is like a reader who needs to read the current edition immediately. Your data architecture must support both use cases while ensuring that when the historian and reader look at the same page, they see consistent information. The key abstractions that enable this are: - **Feature stores**: A shared repository where features are defined once and served consistently to both training and serving - **Point-in-time correctness**: The ability to retrieve what features looked like at any historical timestamp - **Data versioning**: Treating datasets like code, with versions, diffs, and rollback capability - **Lineage tracking**: Recording how data flows and transforms, enabling debugging and compliance ### What You'll Learn - Why training-serving skew is so dangerous and the mechanisms that cause it - How feature stores provide a single source of truth for feature computation - Point-in-time joins and preventing data leakage in training - Data versioning strategies and when to use them - Building observability into data pipelines - Decision frameworks for when to invest in each capability ### Prerequisites - Understanding of ML model training and serving (Part II) - Distributed systems fundamentals (Chapter 25) - Production systems experience (Part III) --- ## The Theory of Training-Serving Skew ### Understanding Skew at a Fundamental Level Training-serving skew occurs when the distribution of features seen during training differs from the distribution seen during serving. This sounds abstract, so let's make it concrete. A model learns a function: given these input features, predict this output. During training, the model observes pairs of (features, labels) and learns the mapping between them. The learned function is only valid for the distribution of features it was trained on. Mathematically, if we train on features drawn from distribution $P_{train}(X)$ but serve on features drawn from distribution $P_{serve}(X)$, the model's performance degrades proportionally to the divergence between these distributions. This is the covariate shift problem from statistical learning theory. But training-serving skew is worse than natural covariate shift. Natural covariate shift happens when the world changes, like user behavior evolving over time. Training-serving skew happens when the same underlying reality produces different feature values depending on which system computes them. It's a self-inflicted wound. ### The Taxonomy of Skew Skew manifests in several distinct ways: **Feature computation skew**: The most common form. The same feature is computed using different logic in training versus serving. Like the fraud detection example: same conceptual feature, different window calculations. This happens when: - Training uses SQL queries on a data warehouse; serving uses Python code hitting a cache - Training uses batch processing frameworks (Spark); serving uses stream processing (Flink) - Different teams implement "the same" feature independently - Preprocessing libraries have version differences **Data freshness skew**: Features are stale at serving time compared to training. If the model trained on features computed from data up to T-0 (the prediction time), but serving uses features that lag behind by hours or days, the model sees systematically older data in production. **Data distribution skew**: The production data distribution has drifted from training. This is natural covariate shift, but it interacts with the infrastructure. For example, if serving only uses cached features and the cache has limited coverage, low-traffic entities get stale features while high-traffic entities get fresh ones. This creates a biased feature distribution. **Missing data skew**: Missing values are handled differently. Training might drop rows with missing features; serving might impute zeros. The model learns on clean data but predicts on imputed data. **Preprocessing skew**: Tokenization, normalization, encoding done differently. A text model trained with one tokenizer but served with another will see completely different input representations. ### The Insidiousness of Skew What makes skew particularly dangerous is that it's invisible to standard monitoring. The system appears healthy: - Model endpoint is returning predictions (no errors) - Latency is within SLA - Input features pass schema validation - Output predictions are valid probabilities All green dashboards. Meanwhile, the model is making decisions based on data that doesn't match what it learned from. It's like a pilot flying through fog using instruments that are subtly miscalibrated. Everything looks normal until you hit the mountain. ::: {.callout-warning} ## Common Mistake: Assuming Green Metrics Mean No Skew **What people do:** Monitor model endpoint latency, error rates, and prediction confidence. All look healthy. Conclude the model is working correctly. **Why it fails:** Standard API monitoring catches system failures, not data problems. A model can return valid predictions (within range, formatted correctly, low latency) while making decisions based on systematically wrong features. If your serving pipeline computes a feature differently than training, every prediction is affected but no error is raised. **Fix:** Monitor feature distributions, not just model outputs. Compare serving feature statistics against training baselines using PSI or KS tests. Alert on distribution drift even when predictions look valid. The first sign of training-serving skew should come from your data monitoring, not customer complaints. ::: ### Quantifying Skew: Mathematical Foundations To detect skew, we need to measure the divergence between training and serving feature distributions. Several metrics are commonly used: **Population Stability Index (PSI)**: Measures how much a distribution has shifted relative to a baseline. $$PSI = \sum_{i=1}^{n} (P_i - Q_i) \cdot \ln\frac{P_i}{Q_i}$$ Where $P_i$ and $Q_i$ are the proportions in bin $i$ for the serving and training distributions respectively. Interpretation: - PSI < 0.1: No significant shift - 0.1 < PSI < 0.25: Moderate shift, monitor closely - PSI > 0.25: Significant shift, investigate **Jensen-Shannon Divergence (JSD)**: A symmetric measure of divergence between two distributions. $$JSD(P || Q) = \frac{1}{2}KL(P || M) + \frac{1}{2}KL(Q || M)$$ Where $M = \frac{1}{2}(P + Q)$ and KL is the Kullback-Leibler divergence. JSD is bounded between 0 and 1 (using log base 2), making it easier to set universal thresholds. **Kolmogorov-Smirnov statistic**: For continuous features, the maximum difference between cumulative distribution functions: $$D = \sup_x |F_{train}(x) - F_{serve}(x)|$$ The p-value from a KS test indicates whether the difference is statistically significant. **Practical detection**: Monitor these metrics for every feature in production. Set alerts when they exceed thresholds. But remember: statistical significance doesn't imply practical significance. A feature might have shifted but still produce correct predictions. Correlate skew detection with business metrics. ::: {.callout-warning} ## Per-feature PSI/JSD/KS miss multivariate and concept drift PSI, JSD, and KS as described here are **univariate**—computed one feature at a time. They are blind to two failure modes that routinely break models in production. First, **multivariate drift**: every feature's marginal distribution can be unchanged (PSI ≈ 0 on each) while the *joint* distribution has moved—correlations between features shift, or new feature combinations appear, neither visible to per-feature checks. Second, **concept drift**: the input distribution P(X) can be perfectly stable while the input→output relationship P(Y|X) moves, so the model is now wrong on inputs that look statistically identical to training data. Neither is detectable from feature distributions alone. Catching them requires methods that look at features jointly (multivariate distance, a domain classifier trained to tell training from serving data) and, for concept drift, monitoring actual outcomes—prediction-vs-label error over time or outcome correlation—rather than inputs. ::: --- ## Feature Stores: The Single Source of Truth ::: {.callout-tip} ## Staff Engineer Perspective "Feature store v1 was a wrapper around Redis I wrote in a week because we needed something. Two years later we had 600 features, three teams using it, and zero point-in-time correctness. Every model retrain silently leaked future data through aggregates that were 'as of now' rather than 'as of training row timestamp.' Offline AUC was 0.91, online was 0.74, and we couldn't explain why for months. The rebuild took 7 months and cost us roughly $1.4M including the team time. What I wish I'd built on day one: explicit event timestamps on every feature row, even if I had to fake them. The schema decisions you defer in week one are the migrations you pay for in year three." — *Principal ML Engineer at an ad-tech company* ::: ### Why Feature Stores Emerged Feature stores didn't exist as a category until Uber published their paper on Michelangelo in 2017. Before that, ML teams built bespoke data pipelines for each model. Feature computation logic was scattered across training notebooks, batch jobs, and serving microservices. Maintaining consistency was manual and error-prone. Uber's insight was that features are first-class citizens in ML systems. They should be: - Defined once and reused across models - Computed consistently for training and serving - Discoverable so teams don't reinvent them - Governed with ownership and access controls This shifted the mental model from "data pipelines that happen to produce features" to "features as products with their own lifecycle." ### The Dual Store Architecture A feature store has two primary storage systems optimized for different access patterns: **Offline store**: Optimized for batch reads over historical data. Used for training data generation. Typically built on columnar formats (Parquet, Delta Lake, Iceberg) stored in object storage (S3, GCS) or data warehouses (BigQuery, Snowflake, Redshift). Access pattern: "Give me the value of features X, Y, Z for all entities E at their respective event timestamps T." **Online store**: Optimized for low-latency point lookups. Used for serving. Typically built on key-value stores (Redis, DynamoDB, Cassandra, Bigtable). Access pattern: "Give me the current value of features X, Y, Z for entity E, right now, in under 10 milliseconds." The critical architectural property is that both stores are populated from the same feature computation logic. Whether the computation is batch (Spark), streaming (Flink), or on-demand, the transformation logic is defined once and applied consistently. ![Feature Store Architecture](../assets/diagrams/rendered/ch24_feature_store.svg) ### Point-in-Time Correctness: The Key Abstraction The most important capability of a feature store's offline component is point-in-time correct feature retrieval. This prevents a category of bugs called data leakage. **Data leakage** occurs when information from the future influences predictions about the past. If you're predicting whether a user will churn next month, you can't use features computed from data that includes next month. This sounds obvious, but it's easy to violate accidentally. Consider: you want to train a model to predict purchase probability at the moment a user views a product. Your training data has user-product pairs with timestamps. You need features like "user's purchase count in the last 30 days." A naive implementation: join the entity table (user-product-timestamp) with a pre-aggregated features table (user-features). If the features table has one row per user with their latest 30-day purchase count, you've just leaked future data. Users who appear multiple times in your training data will have features computed from their entire history, including purchases that happened after the training example. Point-in-time join fixes this. For each row in your entity table, retrieve feature values that were valid at that exact timestamp, not before, not after. The feature store maintains feature timestamps and performs this temporal join automatically. ```python # Point-in-time join pseudocode def get_training_data(entities_df, feature_refs): """ entities_df has columns: entity_id, event_timestamp, label For each row, we need features that were known at event_timestamp, not features computed after event_timestamp. """ results = [] for entity_id, event_timestamp, label in entities_df.iterrows(): # For each feature, find the most recent value # where feature_timestamp <= event_timestamp feature_values = {} for feature_ref in feature_refs: value = offline_store.get_feature_value( feature_ref=feature_ref, entity_id=entity_id, as_of_timestamp=event_timestamp # Critical: temporal cutoff ) feature_values[feature_ref] = value results.append({ 'entity_id': entity_id, 'event_timestamp': event_timestamp, 'label': label, **feature_values }) return results ``` This is computationally expensive. For each training example, you're potentially scanning feature history. Feature stores optimize this through: - Pre-computed snapshots at regular intervals - Efficient storage formats with timestamp indexes - Approximate joins (accepting small temporal fuzziness when exact joins are too expensive) ### Online Feature Serving: The Latency Budget Online feature serving must happen within the latency budget of the prediction request, typically 10-100 milliseconds total. If your model inference takes 30ms, feature retrieval can't take more than 20-50ms. This constrains what's possible: - Features must be pre-computed and cached - Complex aggregations must be done in advance, not at request time - The feature store must be highly available and partitioned appropriately **Online serving patterns**: *Lookup-only*: Features are pre-computed by batch or streaming jobs. Online serving is pure key-value lookup. Simplest pattern, lowest latency. *Lookup + on-demand*: Some features are pre-computed, others are computed at request time. On-demand features are typically transformations of request data (like embedding a query) rather than aggregations. *Materialized views*: For features that are aggregations over entity activity (count of clicks in last hour), maintain materialized views that are incrementally updated by streaming jobs. ### The Feature Store Landscape in 2025-2026 The feature store market has matured from early open-source projects to a full ecosystem: **Feast**: The leading open-source feature store. Cloud-agnostic, extensible, active community. Best for teams that want control and can manage infrastructure. Requires external compute for feature transformations. **Tecton**: Enterprise-grade managed service from ex-Uber engineers who built Michelangelo. Best-in-class streaming feature computation. Higher cost, best for organizations with complex real-time feature requirements. **Databricks Feature Store**: Tightly integrated with Unity Catalog and Delta Lake. Best for organizations already invested in Databricks. Excellent for Spark-based feature engineering. **Cloud provider offerings** (SageMaker Feature Store, Vertex AI Feature Store): Deep integration with respective cloud ML platforms. Best when you're already committed to that cloud's ML stack. **When to build versus buy**: Build your own only if you have unusual requirements and dedicated platform team capacity. For most organizations, adopting an existing solution saves years of engineering effort. The build-vs-buy decision should favor "buy" more strongly here than in most infrastructure decisions because the edge cases in feature stores are numerous and subtle. ::: {.callout-warning} ## Common Mistake: Building a Custom Feature Store **What people do:** Underestimate feature store complexity. Start with "we just need a key-value store with timestamps." Build a v1 in a month. Spend the next two years discovering edge cases: point-in-time joins are slow, online-offline consistency drifts, feature backfills corrupt data, streaming features lag, schema changes break consumers. **Why it fails:** Feature stores look simple but encode years of hard-won lessons from companies like Uber, Airbnb, and Netflix. Point-in-time correctness alone has dozens of subtle failure modes. Your team will rediscover each one painfully. **Fix:** Unless you have 5+ dedicated platform engineers and requirements that genuinely don't fit existing solutions, adopt Feast, Tecton, or a cloud provider's feature store. The "we need full control" argument rarely survives the first year of maintaining a custom implementation. ::: --- ## Preventing Data Leakage ### The Anatomy of Leakage Data leakage is using information during training that would not be available at prediction time. It makes models appear much better in offline evaluation than they actually perform in production. **Why leakage is common in ML**: Machine learning workflows are exploratory. Data scientists iterate rapidly, trying features, checking metrics, adjusting. In this rapid iteration, it's easy to accidentally include information that wouldn't be available in production. Traditional software doesn't have this problem in the same way. If you write code that references a variable that doesn't exist, you get an error. But if you train a model on a feature that includes future information, you get better metrics. The feedback is positive, not negative. Only deployment reveals the problem. **Types of leakage**: *Target leakage*: Features that encode the label or are caused by the label. If you're predicting whether someone will default on a loan, using "collections_initiated = True" as a feature leaks the label directly. This seems obvious, but indirect leakage is subtle. "Customer service calls in last 30 days" might spike after default, not before. *Temporal leakage*: Using features computed from data after the prediction timestamp. The point-in-time join issue discussed earlier. *Train-test leakage*: Information from test examples influencing training. Normalization using statistics computed on the full dataset (including test) is a form of this. *Proxy leakage*: A feature that wouldn't be available in production but correlates strongly with one that would. If you use "time of data export" as a feature during development, it might correlate with the label due to when different labels were recorded. ### Designing for Leak-Free Pipelines The key principle: **temporal partitioning is your first line of defense**. Every dataset should have an explicit timestamp column representing when that information became known. Features should be joined based on these timestamps. Labels should be associated with the timestamp when the outcome was observed. **Pattern: Explicit feature timestamps** ``` Feature table: - entity_id - feature_value - feature_timestamp (when this value became known) - feature_name Training labels: - entity_id - label_value - label_timestamp (when this outcome was observed) - prediction_timestamp (when the prediction would be made) Point-in-time join: For each (entity_id, prediction_timestamp) in labels: Find features where feature_timestamp <= prediction_timestamp Use most recent such feature value ``` **Pattern: Prediction time as a first-class concept** When generating training data, explicitly model the prediction time, the moment at which the model would make its prediction in production. This is often different from both the feature timestamp and the label timestamp. Example: Predicting next-week purchase probability - Features are computed from data up to Sunday midnight (feature_timestamp) - Prediction is made Monday morning (prediction_timestamp) - Label is whether purchase happened by Sunday (label_timestamp) The prediction_timestamp is Monday. Features with feature_timestamp > Monday are leakage. **Pattern: Feature computation is strictly causal** When defining feature transformations, think causally. The feature computation should only use data that would have been generated before the prediction time. Window aggregations must be defined as: "events in window [prediction_time - N days, prediction_time)". The window ends at prediction time, not after. ### Detecting Leakage Even with careful design, leakage can slip through. Detection approaches: **Suspiciously good metrics**: If your model achieves much better metrics than you expected, be suspicious. Check that your test set is truly temporally separated from training. **Feature importance analysis**: If a single feature dominates importance, investigate whether it might encode the label directly or indirectly. **Adversarial validation**: Train a model to distinguish training from serving data. If this is easy (high AUC), your training data differs systematically from serving, suggesting a pipeline inconsistency. **Time-based validation**: Train on older data, test on newer data. If performance is much worse than random cross-validation, temporal leakage may be present. **Manual spot-checking**: For a sample of predictions, manually trace the feature values back to source data. Verify timestamps make sense. This is labor-intensive but catches issues that automated checks miss. ::: {.callout-warning} ## Common Mistake: Ignoring "Suspiciously Good" Metrics **What people do:** A new model achieves 95% AUC where the previous model got 80%. Celebrate. Ship to production. Watch it perform at 70% on real traffic. **Why it fails:** Dramatic metric improvements usually indicate data leakage, not breakthroughs. The model learned to exploit information that won't be available at prediction time—future data leaking into features, label information encoded in features, or test data overlapping with training. **Fix:** Treat unusually good metrics as a red flag, not a success. Ask: "What information could explain this performance?" Check feature importance—if one feature dominates, investigate it. Use strict temporal validation (train on past, test on future). If metrics are dramatically better than domain experts expected, the model probably cheated. ::: --- ## Data Versioning and Lineage ### Why Version Data? Code versioning with Git is universal in software engineering. Data versioning is less mature but equally important for ML. **Reproducibility**: Six months from now, can you recreate the exact dataset used to train a model? If the data has changed since then and you have no versioned snapshot, the answer is no. **Debugging**: Model performance degraded. Was it a code change or a data change? Without data versions, you can't easily diff. **Rollback**: Bad data got into training. Can you revert to a known-good state? **Compliance**: Regulations (GDPR, SOC 2, industry-specific) may require explaining what data influenced a decision. Without lineage, this is impossible. **Experimentation**: You want to try a new feature. But if you also updated the underlying data, you've conflated two changes. Data versioning enables controlled experiments. ### Data Versioning Approaches **Snapshot-based versioning**: Periodically capture full copies of datasets. Simple but storage-expensive. Like taking full backups without incrementals. **Delta-based versioning**: Store changes from a base version. More storage-efficient but requires computing current state from base + deltas. **Storage-layer versioning**: Modern data lake formats (Delta Lake, Apache Iceberg, Apache Hudi) build versioning into the storage format. They enable time-travel queries ("SELECT * FROM table AS OF timestamp") without external versioning systems. **Reference-based versioning**: Track pointers to data locations rather than copying data. Used when data is immutable (like log files). DVC (Data Version Control) works this way, storing data in object storage and tracking references in Git. ### Lineage: Understanding Data Flow Lineage answers: where did this data come from, and what else depends on it? **Upstream lineage**: For a given dataset or feature, what raw data sources contribute to it? What transformations were applied? **Downstream lineage**: For a given raw data source, what features, models, and predictions depend on it? Lineage enables: - Impact analysis: "If this table's schema changes, what breaks?" - Root cause analysis: "This model is producing bad predictions; what upstream data might have caused this?" - Compliance: "Explain what data influenced this decision for user X." ### Implementing Lineage Lineage can be captured at different granularities: **Job-level lineage**: Track which jobs read from and write to which tables. Coarse but easy to implement. Most orchestration tools (Airflow, Dagster, Prefect) provide this out of the box. **Column-level lineage**: Track transformations at the column level. "Output column X depends on input columns A and B." More useful for impact analysis but harder to capture. **Row-level lineage**: Track which output rows came from which input rows. Most granular but expensive. Sometimes required for compliance in high-stakes domains. **Automatic vs. manual capture**: Automatic lineage parsing (analyzing SQL queries, inspecting dataframes) captures lineage without developer effort but may miss complex logic. Manual annotation is more accurate but requires discipline. --- ## Data Quality for ML ### Why ML Systems Need Special Data Quality Attention Traditional data quality focuses on: Are values correct? Are they complete? Do they conform to schema? ML data quality adds: Is the distribution stable? Are training and serving consistent? Is there leakage? A table can be perfectly clean by traditional standards (no nulls, all values valid) yet be useless for ML if the distribution has drifted or if it includes future information. ### The Five Pillars of ML Data Observability **Freshness**: Is data arriving on time? For features with freshness SLAs (user's recent activity must be updated hourly), monitor update timestamps. Alert when data is stale. **Volume**: Is the amount of data expected? A drop in row count might indicate an upstream failure. A spike might indicate duplicate data. Track counts over time, alert on anomalies. **Schema**: Has the structure changed unexpectedly? Added columns might be benign. Removed columns or type changes can break downstream processing. Validate schema on write. **Distribution**: Have statistical properties changed? Monitor mean, standard deviation, cardinality, null rate for each column. Use PSI or KS tests to detect shifts. **Lineage**: Can we trace data provenance? When something goes wrong, lineage enables root cause analysis. ### Building a Data Quality Monitoring System A production data quality system has several components: **Expectation definition**: For each dataset, define what "healthy" looks like. This can be: - Explicit thresholds (null rate < 1%, row count > 10000) - Learned baselines (within 2 standard deviations of historical mean) - Schema expectations (these columns exist with these types) **Check execution**: Run quality checks at appropriate points: - Before data enters pipelines (quality gates) - After transformations (validation) - Periodically on production data (monitoring) **Alerting and response**: When checks fail, who gets notified? What's the severity? Some failures should block the pipeline (critical). Others should log a warning but proceed (informational). **Feedback loops**: Quality issues discovered in production should feed back into stronger checks. If you discover leakage, add a check to detect it in the future. ### Real-Time Data Quality Challenges Batch data quality checking is straightforward: process the data, run checks, alert if failed. Real-time data quality is harder because latency matters. You can't run expensive statistical tests on every event. Instead: - Run cheap checks synchronously (type validation, null checks, range checks) - Run expensive checks asynchronously on sampled data - Maintain streaming approximations (approximate percentiles, count-min sketches for cardinality) For ML serving specifically: - Validate feature values at retrieval time (are they fresh enough? do types match?) - Monitor prediction distributions for anomalies (many zero-probability predictions?) - Track feature coverage (are features missing for some entities?) --- ## Case Study: How Uber Solved Feature Management at Scale ### The Problem By 2016, Uber had hundreds of ML models in production: pricing, ETAs, fraud detection, driver matching. Each model had its own feature pipelines. Data scientists were spending more time on data plumbing than modeling. Feature computation was duplicated across teams. Training-serving skew was endemic. ### The Solution: Michelangelo Uber built Michelangelo, an end-to-end ML platform. The feature store component, described in their 2017 paper, established many patterns that are now industry standard. Key architectural decisions: **Feature groups**: Features are organized into groups with shared entity types and computation schedules. A "user_trip_features" group contains all features computed per user based on trip data. **Dual store with consistent computation**: Batch jobs compute features for offline storage (Hive). The same logic, running as streaming jobs, updates online storage (Cassandra). A schema registry ensures consistency. **Time-partitioned storage**: Offline features are stored with temporal partitions, enabling efficient point-in-time joins. "Get user features as of this timestamp" is a supported query pattern. **Automatic feature logging**: At serving time, features used for predictions are logged with timestamps. This enables offline analysis of serving data and debugging of production issues. ### Results and Lessons After Michelangelo: - Time to deploy models decreased from weeks to days - Feature reuse increased dramatically (most new models used >50% existing features) - Training-serving skew incidents dropped by 90% Lessons applicable to other organizations: 1. **Standardization enables velocity**. Constraining how features are defined and computed might seem limiting, but it enables automation and reuse. 2. **The offline-online consistency problem is central**. Most of Michelangelo's complexity exists to solve this one problem. 3. **Feature management is an organizational problem, not just technical**. Success required cultural change: data scientists agreeing to use shared features rather than building their own. --- ## Case Study: The 2021 Zillow Offers Shutdown ### What Happened In November 2021, Zillow announced it was shutting down Zillow Offers, its home-buying business, and laying off roughly 25% of its workforce (about 2,000 employees). The company took a $304 million write-down on inventory homes purchased at prices that exceeded their resale value, with total losses from the wind-down exceeding $500 million. A significant factor: their home price prediction models performed much worse in production than in backtesting. ### The Data Architecture Angle The dominant public explanation for the shutdown was business and market-timing: Zillow scaled home purchases aggressively into a volatile, fast-moving 2021 market and couldn't operationally buy, renovate, and resell fast enough as prices turned. Data-architecture issues—feedback loops and distribution shift—**contributed** to and amplified the mispricing, but they were not the primary cause. With that framing, the data angle is still instructive: **Distribution shift at scale**: The models were trained on historical home sales. When Zillow entered markets as a major buyer, their presence changed the market. Homes Zillow bid on were no longer representative of the training distribution. **Feature freshness issues**: Housing market conditions were changing rapidly in 2020-2021 (pandemic, low interest rates, migration patterns). Features based on historical trends became misleading. **Feedback loops**: Zillow's own pricing influenced comparable sales, which influenced features, which influenced pricing. This feedback loop wasn't adequately modeled. ### Lessons for ML Data Architecture 1. **Monitor for distributional shifts continuously**, especially when your system can influence its own inputs. 2. **Features based on historical aggregations can become misleading rapidly** when the underlying process changes. 3. **Backtesting is necessary but not sufficient**. Backtests assume the world doesn't change in response to your model, which is often false at scale. 4. **Business metrics are the ultimate ground truth**. If offline metrics say you're doing great but the business is losing money, trust the business metrics. --- ## Decision Framework: When to Invest in What ### Do You Need a Feature Store? **You probably need a feature store if**: - You have multiple models sharing features - Training-serving skew has caused production issues - Teams are reinventing the same features - Feature computation takes significant data science time **You might not need a feature store if**: - You have a single model or very few models - Features are simple and unlikely to drift - You have a small team and low ML maturity - The cost and complexity aren't justified by the problems you're experiencing **Start small**: A feature store can be as simple as a shared table with feature values and timestamps, plus conventions for how to join it. You don't need enterprise platforms on day one. ### Batch vs. Streaming Features **Use batch features when**: - Features are aggregations over long windows (30-day averages) - Latency of hours is acceptable - Computation is expensive and benefits from batch optimization - Source data is batch-oriented (daily data warehouse updates) **Use streaming features when**: - Features must reflect events within seconds or minutes - Real-time context significantly improves predictions - Source data is event-driven **Hybrid pattern**: Most production systems use both. Long-window aggregations are batch-computed daily. Short-window features (clicks in last hour) are stream-updated. The feature store abstracts this from consumers. ### How Much Versioning and Lineage? **Minimal investment** (suitable for early-stage teams): - Version datasets by date in naming convention (user_features_2024_01_15) - Manual documentation of data sources - Git versioning for code including data generation scripts **Moderate investment** (suitable for growing teams): - Storage-layer versioning (Delta Lake time travel) - Automated job-level lineage from orchestration tool - Data quality dashboards with alerts **High investment** (suitable for mature organizations): - Full data versioning with DVC or similar - Column-level automated lineage - Data contracts between teams - Comprehensive data quality as code The right level depends on your scale, regulatory environment, and the cost of data issues in your domain. --- ## Building Intuition: How to Spot Data Architecture Problems ### Red Flags to Watch For **"The model works in the notebook but not in production"**: Almost always a training-serving consistency issue. Check feature computation, preprocessing, and data freshness. **"We can't reproduce the model from three months ago"**: Data versioning gap. Either the training data has changed or you haven't recorded which version was used. **"No one knows what this feature means"**: Feature governance gap. Features have accumulated without ownership or documentation. **"It takes two weeks to add a new feature to the model"**: Infrastructure gap. Feature pipelines are too manual or too coupled. **"We see the model was wrong but can't figure out why"**: Observability gap. You need prediction logging, feature logging, and lineage to debug. ### Questions to Ask During Design Reviews When reviewing ML system designs, probe the data architecture: 1. "Where does each feature come from? Is that computation the same in training and serving?" 2. "If I deploy this model today, what's the oldest data it might see as a feature? Is that acceptable?" 3. "If I need to recreate this exact training run six months from now, can I? What would I need to store?" 4. "When this model makes a bad prediction, how do I trace back to understand what data produced it?" 5. "If the upstream data changes schema, how does this system handle it?" If the answers are vague, there's work to do on data architecture. --- ## Common Pitfalls and How to Avoid Them ### Pitfall 1: Feature Store Without Governance **The pattern**: A team builds a feature store and gets initial adoption. Features proliferate. After a year, there are 2000 features with unclear ownership. Nobody knows which are still used. Deprecation is impossible because impact is unknown. **The fix**: Treat features as products. Require documentation and ownership at creation time. Track consumers automatically. Set expiration dates on experimental features. Conduct regular feature audits. ### Pitfall 2: Optimizing for Offline Metrics Only **The pattern**: Data scientists develop features and models optimizing for offline evaluation metrics. No attention to whether these features can be computed consistently at serving time or whether they meet latency requirements. **The fix**: Include serving requirements in the definition of done. A feature isn't ready until it works in both training and serving. Build staging environments that mirror production latency constraints. ### Pitfall 3: Schema Changes Without Migration **The pattern**: An upstream team changes a table schema. Downstream ML pipelines fail silently or produce incorrect features. **The fix**: Data contracts. Producers and consumers agree on schemas and compatibility rules. Changes require compatibility checks. Breaking changes require migration periods and consumer notification. ### Pitfall 4: No Feedback from Production to Development **The pattern**: Models are developed, trained, and deployed. Problems are discovered in production weeks later. By then, no one remembers the details of training data decisions. **The fix**: Close the loop. Log predictions and features at serving time. Compare serving distributions to training distributions continuously. Make this data available to data scientists as part of their development workflow. ### Pitfall 5: Treating Data Pipelines as One-Time Scripts **The pattern**: A data scientist writes a script to generate training data. It works once and is never maintained. When the model needs retraining, the script is broken due to environment or data changes. **The fix**: Treat data pipelines as production code. Version them. Test them. Run them regularly even when not needed, to catch breakages early. Use orchestration tools that alert on failure. --- ## Team and Organizational Considerations Data architecture isn't just a technical challenge—it's an organizational one. The best technical design fails without the right team structure and governance. ### Building a Data Platform Team **Who owns data architecture?** In early-stage ML organizations, data architecture is often nobody's job. Data scientists build ad-hoc pipelines. ML engineers handle serving. Data engineers manage warehouses. No one owns the end-to-end consistency that matters most. As organizations mature, a dedicated data platform team emerges. This team owns: - Feature store infrastructure and operations - Data quality monitoring and alerting - Lineage and cataloging systems - Training-serving consistency tooling - Data pipeline frameworks and best practices **Team composition:** A typical data platform team includes: - **Data engineers** for pipeline development and infrastructure - **Backend engineers** for online serving systems - **ML engineers** who understand model training requirements - **Product manager** to prioritize across competing demands The key skill is translation: understanding both data science needs and infrastructure constraints, then building platforms that serve both. **Relationship to data science teams:** The data platform team provides infrastructure; data science teams use it. This creates a customer-supplier relationship. Success requires: - Clear service level objectives (What latency? What freshness? What availability?) - Feedback loops (What's working? What's painful?) - Shared ownership of quality (Platform provides tools; users follow standards) ### Data Governance for ML Governance ensures that data is used appropriately, consistently, and in compliance with policies. **Feature ownership:** Every feature should have an owner—a person or team accountable for: - Correctness of computation logic - Documentation and discoverability - Deprecation when no longer needed - Responding to quality issues Without ownership, features become orphans. Orphan features accumulate, their logic becomes unclear, and removing them becomes impossible. **Feature lifecycle management:** Features have lifecycles: 1. **Proposed**: Someone wants this feature; it's being evaluated 2. **Experimental**: Feature exists but isn't production-validated 3. **Production**: Feature is stable, documented, and approved for production use 4. **Deprecated**: Feature is being phased out; no new uses allowed 5. **Removed**: Feature is deleted from the system Enforce these states. Block experimental features from production models. Require migration plans before deprecation. **Access control:** Not all features should be available to all users: - PII features may require special approval - Features derived from sensitive data need access controls - Experimental features shouldn't be used in production Implement access control at the feature store level. Audit who uses what. **Data contracts:** Data contracts formalize agreements between data producers and consumers: - Schema: What columns exist, what types, what constraints - Quality: What freshness, completeness, accuracy guarantees - Compatibility: How will breaking changes be handled Contracts prevent surprise breakages. When a producer wants to change something, the contract specifies who to notify and what migration period to allow. ### Organizational Change Management Adopting good data architecture requires behavior change. People who've built ad-hoc pipelines must adopt new tools and standards. This is hard. **Start with pain:** Don't lead with "you should use the feature store because it's better." Lead with "remember when model X failed in production because of training-serving skew? The feature store would have prevented that." Connect new tools to problems people have actually experienced. **Make the right thing easy:** If using the feature store requires more work than building a one-off pipeline, people won't use it. Invest in developer experience: - Good documentation and examples - Templates and generators for common patterns - Fast feedback when something goes wrong - Integration with existing workflows **Celebrate success:** When someone uses the feature store and it catches a problem, share that story. When feature reuse saves weeks of work, highlight it. Positive reinforcement accelerates adoption. **Don't mandate without support:** Mandating feature store use before it's ready creates resentment. Ensure the platform is good enough before requiring adoption. Early adopters should be enthusiastic volunteers, not reluctant conscripts. --- ## Migration Strategies Most organizations don't start with good data architecture. They evolve toward it from ad-hoc beginnings. Here's how to manage that migration. ### Assessment: Where Are You Starting? Before planning migration, understand your current state: **Maturity assessment questions:** 1. *Feature consistency*: Is the same feature computed identically in training and serving? How do you know? 2. *Reproducibility*: Can you recreate the training data from 6 months ago? 3. *Discoverability*: How do teams find existing features? (If "ask around" or "search Slack," that's low maturity.) 4. *Quality monitoring*: How would you know if a feature started producing bad values? 5. *Leakage prevention*: How do you ensure point-in-time correctness? Score each area 1-5. This reveals where to focus. ### Incremental Migration Path Don't try to migrate everything at once. A staged approach reduces risk and builds confidence. **Stage 1: Shared definitions (1-3 months)** Start with the organizational layer, not the infrastructure: - Create a feature catalog (can be as simple as a shared document) - Define naming conventions - Establish ownership for existing critical features - Document computation logic for features used by multiple teams This costs little and provides immediate value. It also reveals which features are actually shared vs. duplicated. **Stage 2: Centralized storage (2-4 months)** Move feature data to shared storage: - Choose or build an offline store (often your existing data warehouse) - Migrate high-value features to shared tables - Implement point-in-time join capability - Add basic quality monitoring Training now uses centralized features. Serving may still use separate logic. **Stage 3: Consistent computation (3-6 months)** Unify computation between training and serving: - Refactor serving to use the same feature definitions - Implement online store for low-latency access - Add computation pipelines (batch or streaming) - Validate training-serving consistency This is the hardest stage. It requires coordinating across teams and often rewriting serving logic. **Stage 4: Full platform (ongoing)** Add advanced capabilities: - Streaming features for real-time use cases - Column-level lineage - Automated quality enforcement - Self-service feature creation These capabilities differentiate mature platforms but aren't necessary to start. ### Migration Anti-Patterns **Big bang rewrite:** "We'll rebuild everything on the new platform, then switch over." This fails because: - The new platform doesn't have feature parity - Users have no investment in something they haven't used - Timeline stretches, stakeholders lose patience Instead: Migrate incrementally, one feature group at a time. **Platform without users:** "We'll build the perfect feature store, then get adoption." This fails because: - You don't know what users need until they use it - Platforms without users don't get maintained - Sunk cost doesn't create adoption Instead: Start with one or two teams as design partners. Build for their needs first. **Deprecate before migrate:** "We're shutting down the old system on date X." This fails because: - Users scramble to migrate under pressure - Corners get cut; bugs appear - Emergency exceptions undermine the deadline Instead: Make the new system compelling enough that users want to migrate. Deprecation follows voluntary adoption. --- ## Cost Engineering for Data Architecture Data architecture has real costs: storage, compute, and operational overhead. Good architecture manages these costs without sacrificing quality. ### Understanding Cost Drivers **Storage costs:** Feature stores maintain historical data for point-in-time correctness. This accumulates: - Offline store: Grows with history depth × number of entities × number of features - Online store: Grows with number of entities × number of features (current values only) - Feature logs: Grows with request volume × features per request Estimate storage needs before building. A feature store for 100M users with 100 features and 1 year of history can easily reach petabyte scale. **Compute costs:** Feature computation can be expensive: - Batch jobs: Cost scales with data volume and computation complexity - Streaming jobs: Cost scales with event rate and processing complexity - On-demand computation: Cost scales with request rate Real-time features are often 10-100x more expensive per feature value than batch features. **Operational costs:** Running a feature store requires: - On-call support for availability - Incident response for quality issues - Capacity planning and scaling - Upgrades and maintenance These costs scale with platform complexity and reliability requirements. ### Cost Optimization Strategies **Tiered storage:** Not all historical data needs to be equally accessible: - Hot tier: Recent data (weeks) for frequent training; fast storage - Warm tier: Older data (months) for occasional retraining; cheaper storage - Cold tier: Archive data (years) for compliance; cheapest storage Move data between tiers based on access patterns. **Feature materialization strategy:** Not every feature needs to be pre-computed: - **High-value, high-use features**: Materialize and serve from feature store - **Medium-value features**: Compute on-demand during training; cache for serving - **Low-value features**: Compute on-demand everywhere; don't store Analyze feature usage to categorize appropriately. **Aggregation granularity:** Window aggregations (count of events in last N days) can be expensive to compute with fine granularity. Trade precision for efficiency: - Daily granularity for 30-day windows (30 buckets to sum) - Hourly granularity only when hour-level freshness matters - Approximate algorithms (HyperLogLog, t-digest) for expensive aggregations **Retention policies:** Define how long to keep data: - Online store: Keep only current values (minimal retention) - Offline store: Keep enough history for retraining needs - Feature logs: Keep enough for debugging, then archive or delete Aggressive retention reduces storage costs significantly. ### Cost Monitoring and Accountability **Attribution:** Track costs by: - Team: Which teams' features cost the most? - Feature: Which features are expensive relative to their value? - Model: Which models drive the most feature computation? Attribution enables accountability and optimization. **Chargeback or showback:** Some organizations charge teams for their data infrastructure usage. Others show costs without charging. Either approach creates awareness and incentivizes efficiency. **Cost-benefit analysis for features:** Before adding a new feature, estimate: - Computation cost (how much to compute daily/hourly) - Storage cost (how much data over time) - Expected value (how much does this improve predictions) If a feature costs $10K/month to compute but only improves model value by $5K/month, it's not worth it. --- ## Key Takeaways 1. **Training-serving skew is the central challenge** - Features computed differently in training versus serving cause silent model degradation. Prevention requires single sources of truth and shared computation logic. 2. **Feature stores solve coordination problems** - Centralizing feature definitions, computation, and serving enables consistency, reuse, and governance. The dual offline/online architecture serves different needs. 3. **Point-in-time correctness prevents data leakage** - Training data must respect temporal boundaries. Features should reflect only what was known at prediction time. This requires explicit timestamps and careful join logic. 4. **Data versioning enables reproducibility** - Without versioned data, past experiments can't be recreated. Modern storage formats make versioning practical. Treat it as non-negotiable infrastructure. 5. **Treat data pipelines as production code** - Version them, test them, monitor them. Data quality issues are ML bugs. Run pipelines regularly even when not needed to catch breakages early. --- ## Summary Data architecture is the foundation upon which reliable ML systems are built. The patterns and principles in this chapter emerged from hard-won lessons at organizations running ML at scale: **Training-serving skew is the central challenge**. Features computed differently in training versus serving cause silent model degradation. Prevention requires architectural discipline: single sources of truth, shared computation logic, continuous monitoring. **Feature stores solve coordination problems**. By centralizing feature definitions, computation, and serving, feature stores enable consistency, reuse, and governance. The dual offline/online architecture serves the different needs of training and serving. **Point-in-time correctness prevents data leakage**. Training data generation must respect temporal boundaries. Features should reflect only what was known at the prediction time, not after. This requires explicit timestamps and careful join logic. **Data versioning enables reproducibility**. Without versioned data, past experiments can't be recreated, and debugging is archaeology. Modern storage formats and tools make versioning practical. **Observability is non-negotiable**. Monitor freshness, volume, schema, and distribution continuously. Detect skew before it causes business impact. Build lineage to enable root cause analysis. **Start with your actual problems**. Not every organization needs enterprise-grade infrastructure on day one. Match your investment to your scale, risk tolerance, and the actual problems you're experiencing. The technical patterns in this chapter are well-understood. The harder challenge is organizational: building the culture and processes that make good data architecture the default rather than an afterthought. ### Connections to Other Chapters - **Chapter 7 (RAG Systems)** applies data architecture principles to retrieval systems - **Chapter 9 (LLM Deployment & Infrastructure)** covers the serving systems that consume features - **Chapter 15 (MLOps & Evaluation)** discusses evaluation pipelines that depend on data quality - **Chapter 25 (System Design at Scale)** provides broader context for distributed data systems - **Chapter 28 (Research-to-Production)** covers validating new data techniques before production adoption - **Chapter 31 (Reliability Engineering)** addresses operational concerns for data pipelines - **Chapter 32 (Cost Engineering)** extends the cost optimization strategies introduced here --- ## Practical Exercises ### Exercise 1: Point-in-Time Join Implementation Given the following data: ```python purchase_events = [ {"user_id": "u1", "item_id": "i1", "timestamp": "2024-01-15 10:00", "amount": 50.0}, {"user_id": "u1", "item_id": "i2", "timestamp": "2024-01-20 14:30", "amount": 30.0}, {"user_id": "u2", "item_id": "i1", "timestamp": "2024-01-18 09:00", "amount": 50.0}, ] user_features = [ {"user_id": "u1", "timestamp": "2024-01-10", "purchases_30d": 5, "avg_amount": 45.0}, {"user_id": "u1", "timestamp": "2024-01-17", "purchases_30d": 6, "avg_amount": 46.0}, {"user_id": "u2", "timestamp": "2024-01-15", "purchases_30d": 3, "avg_amount": 60.0}, ] ``` 1. Implement a point-in-time join that retrieves features for each purchase event. 2. Add a TTL parameter to treat old features as missing. 3. Write validation code that checks for leakage (feature_timestamp > event_timestamp). 4. How would you optimize this for a billion purchase events? **Self-Assessment Questions:** - Does your join handle the case where no feature exists before the event timestamp? - Does your TTL implementation correctly handle edge cases (exactly at TTL, no features at all)? - Would your optimization strategy work with real distributed systems constraints? **Quality Indicators:** - *Strong*: Implementation handles all edge cases (no features, multiple features at same timestamp, TTL boundary) - *Weak*: Implementation only handles happy path - *Strong*: Optimization discusses specific technologies (Spark broadcast joins, partitioning strategies) with tradeoffs - *Weak*: Optimization is vague ("use a database") ### Exercise 2: Skew Detection Dashboard Design and implement a monitoring system for training-serving skew: 1. Define the data structures for storing training statistics. 2. Implement PSI and KS test calculations. 3. Create alerting logic with appropriate thresholds. 4. Simulate a skew scenario and verify detection. **Self-Assessment Questions:** - Does your PSI implementation handle zero-probability bins correctly (avoids log(0))? - Are your thresholds based on research/experience or arbitrary? - Did your simulation actually trigger alerts, validating the detection logic? **Quality Indicators:** - *Strong*: Data structures support both numeric and categorical features with appropriate handling - *Weak*: Only handles one feature type - *Strong*: Alert thresholds are justified (references PSI literature or calibrated on test data) - *Weak*: Thresholds are arbitrary round numbers - *Strong*: Simulation includes realistic scenarios (gradual drift, sudden shift, seasonal patterns) - *Weak*: Simulation is trivial (just doubles all values) ### Exercise 3: Feature Store Design Design a feature store for a ride-sharing application: 1. Identify entities (driver, rider, trip, location). 2. Define 10 features across these entities with appropriate freshness requirements. 3. Specify which features are batch vs. streaming. 4. Design the online store schema for sub-10ms lookups. 5. Document how you'd ensure training-serving consistency. **Self-Assessment Questions:** - Do your freshness requirements align with how the features will be used? - Does your online schema support all access patterns needed for serving? - Is your training-serving consistency approach practical, not just theoretically correct? **Quality Indicators:** - *Strong*: Features include different entity types and relationships (driver-rider interaction features) - *Weak*: Features are all single-entity aggregations - *Strong*: Freshness requirements are justified by use case ("surge pricing needs demand features within 5 minutes") - *Weak*: All features have same freshness requirement - *Strong*: Consistency approach includes specific mechanisms (shared code, automated testing, monitoring) - *Weak*: Consistency approach is "use the same logic" ### Exercise 4: Data Leakage Audit Take an existing ML model in your organization: 1. List all features used by the model. 2. For each feature, trace back to its computation logic. 3. Identify the timestamps associated with each feature value. 4. Check whether point-in-time correctness is enforced in training data generation. 5. Propose fixes for any leakage risks discovered. **Self-Assessment Questions:** - Did you trace ALL features, including those computed in preprocessing? - Did you verify timestamps empirically (spot-checking actual data) or trust documentation? - Are your proposed fixes actionable with specific implementation steps? **Quality Indicators:** - *Strong*: Audit includes implicit features (preprocessing transformations, derived columns) - *Weak*: Audit only covers explicit feature columns - *Strong*: Timestamp verification includes actual data samples showing correct/incorrect behavior - *Weak*: Timestamp verification only checks code comments - *Strong*: Fixes include migration plan and validation approach - *Weak*: Fixes are just "add timestamp column" ### Exercise 5: Migration Planning You've inherited a system where each of 5 ML models has its own ad-hoc feature pipelines. Design a migration to centralized data architecture. 1. Assess the current state (list all features, identify duplicates, find quality issues). 2. Design the target architecture. 3. Create a phased migration plan with specific milestones. 4. Identify risks and mitigation strategies. 5. Define success metrics for the migration. **Self-Assessment Questions:** - Does your assessment distinguish duplicates from legitimately different features? - Is your phased plan realistic given team capacity and organizational constraints? - Do your success metrics measure actual improvement, not just migration completion? **Quality Indicators:** - *Strong*: Migration plan includes parallel running period to validate consistency - *Weak*: Migration is cut-over without validation - *Strong*: Risks include organizational factors (adoption, training, change management) - *Weak*: Risks are only technical - *Strong*: Success metrics include business outcomes (reduced incidents, faster development) - *Weak*: Success metrics are only technical (feature count migrated) --- ## Self-Assessment Checkpoint ### Conceptual Questions **Q1. [IC2]** Your model's offline metrics look great but production performance is worse. You suspect training-serving skew. How do you diagnose which features are causing it? <details> <summary>Answer</summary> Systematic diagnosis: (1) Log feature values at serving time alongside predictions. (2) For each feature, compute distribution statistics (mean, std, percentiles) from serving logs. (3) Compare to training data distributions using PSI or KS tests. (4) Rank features by skew magnitude. (5) For top skewed features, trace the computation path—is the same code used in training and serving? Are data sources identical? (6) Common culprits: timestamp handling, missing value imputation, aggregation windows, stale features in online store. </details> **Q2. [IC2]** You're joining user features to click events for training. Why can't you just use the latest feature values? <details> <summary>Answer</summary> Using latest values creates data leakage—you'd be using information from the future to predict past events. If a user's "purchases_30d" feature is 10 today, but was 5 when the click happened, using 10 means your model learns from data it won't have at inference time. This artificially inflates offline metrics. The model may learn patterns like "users with high purchases_30d click more" that don't hold in production because the model sees the feature BEFORE the purchase behavior it was computed from. </details> **Q3. [Senior]** Your feature store serves p99 latency of 15ms but you need sub-5ms for real-time bidding. What's your approach? <details> <summary>Answer</summary> Options by effort: (1) **Caching**: Pre-warm features for users likely to bid in-memory at edge. Trade freshness for latency. (2) **Batch precomputation**: Precompute user-ad feature combinations for high-traffic pairs. (3) **Simpler features**: Replace complex features with simpler approximations that are faster to serve. (4) **Architecture**: Move to in-process feature serving using embedded RocksDB or similar. (5) **Data locality**: Co-locate feature serving with bidding logic. (6) **Feature reduction**: Use feature importance to identify which features actually matter; serve only those. Often the answer is a combination with different strategies for different feature tiers. </details> **Q4. [Senior]** The data science team wants a new feature that requires a 90-day window of user activity. Your batch pipelines run daily. How do you compute this feature efficiently? <details> <summary>Answer</summary> Incremental computation: (1) Don't recompute the full 90-day window daily. (2) Use sliding window patterns: maintain the aggregate state, then on each day add new data and subtract day-91 data. (3) Implementation: store intermediate state (sum, count, etc.) with timestamps. On update: `new_sum = old_sum + today_value - day_91_value`. (4) Considerations: Need to handle late-arriving data (may need to adjust old aggregates). Store enough state to recompute if corruption occurs. (5) For complex aggregates (percentiles, distinct counts), use approximate algorithms (t-digest, HyperLogLog) that support mergeable state. </details> **Q5. [Staff]** You're designing data architecture for a new ML platform. The company has 50 data scientists who currently copy data to local notebooks. How do you convince leadership to invest in proper infrastructure? <details> <summary>Answer</summary> Build the business case: (1) **Quantify the cost of current state**: How many hours/week do data scientists spend on data wrangling? What's the cost of delayed model deployments? Document actual incidents from training-serving skew. (2) **Estimate ROI**: Feature reuse (features built once, used by 10 teams), faster experimentation (days to hours), reduced incidents (fewer rollbacks). (3) **Propose incremental investment**: Don't ask for $2M feature store on day one. Start with shared feature definitions, then add materialization, then online serving. Each stage shows value before the next investment. (4) **Find champions**: Identify one or two high-visibility projects that will be the initial adopters and success stories. </details> ### Spot the Problem **Problem 1. [IC2]** A data engineer's feature computation: ```python def compute_user_features(user_id: str) -> dict: purchases = db.query(f"SELECT * FROM purchases WHERE user_id = '{user_id}'") return { "total_purchases": len(purchases), "avg_amount": sum(p.amount for p in purchases) / len(purchases), "days_since_last": (datetime.now() - max(p.timestamp for p in purchases)).days } ``` What's wrong? <details> <summary>Answer</summary> Multiple issues: (1) **SQL injection vulnerability** from string formatting (use parameterized queries). (2) **No time boundary**: For training, this computes features as of NOW, not as of the training example's timestamp. Creates leakage. (3) **Division by zero**: If no purchases, `len(purchases)` is 0. (4) **Empty list handling**: `max()` on empty list raises exception. (5) **No caching or batching**: Called per-user, inefficient for bulk computation. (6) **Non-deterministic**: `datetime.now()` means feature values change on every call, making reproduction impossible. </details> --- ## Further Reading ### Essential - **Hermann & Del Balso (2017), "Michelangelo"** - Uber's paper that introduced feature stores to the ML community. - **Kleppmann (2017), "Designing Data-Intensive Applications"** - Essential background on data systems. - **Sculley et al. (2015), "Hidden Technical Debt in ML Systems"** - Data-related technical debt awareness. ### Deep Dives - **Polyzotis et al. (2018), "Data Management Challenges in Production ML"** - Academic treatment of ML data management. - **Schelter et al. (2018), "Automating Large-Scale Data Quality Verification"** - Paper behind Amazon's Deequ framework. - **Huyen (2021), "Feature Stores for ML"** - Survey comparing feature store approaches. ### Practical Resources - **Feast documentation** (feast.dev) - Open-source feature store implementation reference.