Data Leakage

Data leakage is the silent model-killer. It occurs when information from outside the training set is inadvertently used to create the model, causing it to appear far better than it actually is. Models with leakage can achieve near-perfect validation accuracy, pass all tests, and then fail completely in production.

Data Leakage is the #1 cause of unreproducible ML results

Leakage can be subtle enough that experienced practitioners miss it. It is responsible for a significant portion of retracted machine learning papers and failed production deployments.

Types of Data Leakage

1. Target Leakage

Using features that are causally caused by the target, not causes of it.

Example: Predicting whether a patient will be prescribed antibiotic X.

Feature	Leakage?	Reason
Age, blood pressure	✅ No	Exist before diagnosis
`took_antibiotic_x` flag	❌ YES	Caused by the prescription
`pharmacy_visit_date`	❌ YES	Happens after prescription
`doctor_recommendation_score`	❌ YES	Part of the decision

The model learns "if took_antibiotic_x = True, then prescription = True" — a circular, useless rule.

Prevention: For each feature, ask: "Does this value exist at prediction time?" If the answer is "sometimes" or "it depends", treat it with suspicion.

2. Train-Test Contamination

Allowing information from the test set to influence the training process.

The most common form: fitting a preprocessor (scaler, imputer, encoder) on the full dataset before splitting.

Wrong code:

# ❌ LEAKAGE — scaler sees test data
scaler = StandardScaler()
X_all_scaled = scaler.fit_transform(X_all)          # uses ALL data
X_train, X_test = train_test_split(X_all_scaled)

Correct code:

# ✅ CORRECT — scaler only sees training data
X_train, X_test = train_test_split(X_all)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)             # fit + transform on train only
X_test  = scaler.transform(X_test)                  # transform only on test

Using Pipelines (recommended):

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression())
])
pipe.fit(X_train, y_train)     # scaler.fit is only called on X_train
pipe.score(X_test, y_test)     # scaler.transform is called on X_test

Scikit-learn Pipelines are the idiomatic way to prevent train-test contamination.

3. Temporal Leakage

In time series problems, using future information to predict the past.

         Past                    Future
── [x₁, x₂, x₃] ──predict──► [x₄] ──────────►
                               ↑
                    Must NOT see x₄ during training for x₃!

Wrong: Random split on a time series dataset. Sample at time t=100 ends up in training, but samples at t=95–99 (future-relative to t=90) are in the same training set.

Correct: Always use a temporal split — all training data comes strictly before the validation/test period.

# ❌ Wrong for time series
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# ✅ Correct for time series
split_idx = int(len(X) * 0.8)
X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]

4. Feature Engineering Leakage

Computing features that aggregate information from the full dataset — including test rows.

# ❌ LEAKAGE: group statistics computed on full dataset
df['user_avg_spend'] = df.groupby('user_id')['spend'].transform('mean')

# ✅ CORRECT: compute on training set, merge into test
train_means = X_train.groupby('user_id')['spend'].mean().rename('user_avg_spend')
X_train = X_train.merge(train_means, on='user_id', how='left')
X_test  = X_test.merge(train_means, on='user_id', how='left')   # uses train stats

Leakage Detection Checklist

**🔍 Suspiciously high performance?** If your model achieves >95% accuracy on a hard problem, suspect leakage before celebrating. **Checklist:** - [ ] Does each feature exist at **prediction time** in production? - [ ] Is the **scaler/imputer fit only on train data**? - [ ] For time series: is the split **strictly temporal**? - [ ] Do any features **correlate perfectly** (>0.99) with the target? - [ ] Are there any **future timestamps** in "past" features? - [ ] Did you compute **group aggregates** across the full dataset? - [ ] Is performance **too good** on validation but poor when deployed?

A Famous Real Example: the Heritage Health Prize

The 2011 Heritage Health Prize ($3M competition) had multiple top-performing teams disqualified for leakage. One team achieved AUC=0.98 on validation — far above human baseline — by accidentally using a feature derived from the target variable. When the error was discovered and the feature removed, performance dropped to AUC=0.76.

The lesson: extraordinary results require extraordinary scrutiny of data.