Train/Val/Test Split

Train / Validation / Test Split

Splitting data correctly is not a technicality — it is the experimental design of machine learning. A wrong split gives you a broken experiment: results that look good in development but fail in production.

The Three-Way Split

Each set has a distinct, non-overlapping role:

Train %: Val %:

Set	Used for	How many times?
Train	Fitting model weights	Many iterations
Validation	Tuning hyperparameters, early stopping, model selection	Many times
Test	Reporting final performance	Exactly once

The Golden Rule

If you look at test set performance and then change anything in your model, the test set is no longer a valid measure of generalization. You have been tuning on it implicitly.

Cross-Validation

When data is scarce, a single train/val split wastes data. K-Fold Cross-Validation uses all data for training and validation:

K=5 folds:
Fold 1: [VAL][TRN][TRN][TRN][TRN]
Fold 2: [TRN][VAL][TRN][TRN][TRN]
Fold 3: [TRN][TRN][VAL][TRN][TRN]
Fold 4: [TRN][TRN][TRN][VAL][TRN]
Fold 5: [TRN][TRN][TRN][TRN][VAL]

Final metric = mean ± std across folds. Much more reliable than a single split.

from sklearn.model_selection import cross_val_score, KFold

kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kf, scoring='f1_macro')
print(f"F1: {scores.mean():.3f} ± {scores.std():.3f}")

Note: Even with cross-validation, keep a held-out test set that is never used during cross-validation.

Stratified Split

For classification, a random split may produce very different class distributions across folds — especially with class imbalance. Stratified split preserves the class proportion in each fold.

from sklearn.model_selection import StratifiedKFold, train_test_split

# Stratified train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Stratified k-fold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

Always use stratify=y when working with classification tasks.

Temporal Split for Time Series

Random splits are incorrect for time series — they allow future information to leak into training. Always split by time:

# For a time-indexed DataFrame
split_date = '2024-01-01'
train = df[df.index < split_date]
test  = df[df.index >= split_date]

For cross-validation of time series, use TimeSeriesSplit:

from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
# Each fold: train on past, validate on immediate future

How Much Data for Each Set?

Dataset size	Recommended split	Rationale
Small (< 10k)	60/20/20 or use CV	More validation/test data for reliable estimates
Medium (10k–1M)	70/15/15 or 80/10/10	More training data improves model
Large (> 1M)	98/1/1	1% of 1M = 10k samples, plenty for evaluation
Very large (> 100M)	99/0.5/0.5	Even 0.5% gives 500k evaluation samples

For large datasets, the test set needs to be large enough to give statistically reliable estimates, not a fixed percentage.