Data Leakage
Data Leakage
Data leakage is the silent model-killer. It occurs when information from outside the training set is inadvertently used to create the model, causing it to appear far better than it actually is. Models with leakage can achieve near-perfect validation accuracy, pass all tests, and then fail completely in production.
Data Leakage is the #1 cause of unreproducible ML results
Leakage can be subtle enough that experienced practitioners miss it. It is responsible for a significant portion of retracted machine learning papers and failed production deployments.
Types of Data Leakage
1. Target Leakage
Using features that are causally caused by the target, not causes of it.
Example: Predicting whether a patient will be prescribed antibiotic X.
| Feature | Leakage? | Reason |
|---|---|---|
| Age, blood pressure | β No | Exist before diagnosis |
took_antibiotic_x flag | β YES | Caused by the prescription |
pharmacy_visit_date | β YES | Happens after prescription |
doctor_recommendation_score | β YES | Part of the decision |
The model learns "if took_antibiotic_x = True, then prescription = True" β a circular, useless rule.
Prevention: For each feature, ask: "Does this value exist at prediction time?" If the answer is "sometimes" or "it depends", treat it with suspicion.
2. Train-Test Contamination
Allowing information from the test set to influence the training process.
The most common form: fitting a preprocessor (scaler, imputer, encoder) on the full dataset before splitting.
Wrong code:
# β LEAKAGE β scaler sees test data
scaler = StandardScaler()
X_all_scaled = scaler.fit_transform(X_all) # uses ALL data
X_train, X_test = train_test_split(X_all_scaled)
Correct code:
# β
CORRECT β scaler only sees training data
X_train, X_test = train_test_split(X_all)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train) # fit + transform on train only
X_test = scaler.transform(X_test) # transform only on test
Using Pipelines (recommended):
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
pipe = Pipeline([
('scaler', StandardScaler()),
('model', LogisticRegression())
])
pipe.fit(X_train, y_train) # scaler.fit is only called on X_train
pipe.score(X_test, y_test) # scaler.transform is called on X_test
Scikit-learn Pipelines are the idiomatic way to prevent train-test contamination.
3. Temporal Leakage
In time series problems, using future information to predict the past.
Past Future
ββ [xβ, xβ, xβ] ββpredictβββΊ [xβ] βββββββββββΊ
β
Must NOT see xβ during training for xβ!
Wrong: Random split on a time series dataset. Sample at time t=100 ends up in training, but samples at t=95β99 (future-relative to t=90) are in the same training set.
Correct: Always use a temporal split β all training data comes strictly before the validation/test period.
# β Wrong for time series
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# β
Correct for time series
split_idx = int(len(X) * 0.8)
X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]
4. Feature Engineering Leakage
Computing features that aggregate information from the full dataset β including test rows.
# β LEAKAGE: group statistics computed on full dataset
df['user_avg_spend'] = df.groupby('user_id')['spend'].transform('mean')
# β
CORRECT: compute on training set, merge into test
train_means = X_train.groupby('user_id')['spend'].mean().rename('user_avg_spend')
X_train = X_train.merge(train_means, on='user_id', how='left')
X_test = X_test.merge(train_means, on='user_id', how='left') # uses train stats
Leakage Detection Checklist
A Famous Real Example: the Heritage Health Prize
The 2011 Heritage Health Prize ($3M competition) had multiple top-performing teams disqualified for leakage. One team achieved AUC=0.98 on validation β far above human baseline β by accidentally using a feature derived from the target variable. When the error was discovered and the feature removed, performance dropped to AUC=0.76.
The lesson: extraordinary results require extraordinary scrutiny of data.