Train/Val/Test Split
Train / Validation / Test Split
Splitting data correctly is not a technicality — it is the experimental design of machine learning. A wrong split gives you a broken experiment: results that look good in development but fail in production.
The Three-Way Split
Each set has a distinct, non-overlapping role:
| Set | Used for | How many times? |
|---|---|---|
| Train | Fitting model weights | Many iterations |
| Validation | Tuning hyperparameters, early stopping, model selection | Many times |
| Test | Reporting final performance | Exactly once |
The Golden Rule
If you look at test set performance and then change anything in your model, the test set is no longer a valid measure of generalization. You have been tuning on it implicitly.
Cross-Validation
When data is scarce, a single train/val split wastes data. K-Fold Cross-Validation uses all data for training and validation:
K=5 folds:
Fold 1: [VAL][TRN][TRN][TRN][TRN]
Fold 2: [TRN][VAL][TRN][TRN][TRN]
Fold 3: [TRN][TRN][VAL][TRN][TRN]
Fold 4: [TRN][TRN][TRN][VAL][TRN]
Fold 5: [TRN][TRN][TRN][TRN][VAL]
Final metric = mean ± std across folds. Much more reliable than a single split.
from sklearn.model_selection import cross_val_score, KFold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kf, scoring='f1_macro')
print(f"F1: {scores.mean():.3f} ± {scores.std():.3f}")
Note: Even with cross-validation, keep a held-out test set that is never used during cross-validation.
Stratified Split
For classification, a random split may produce very different class distributions across folds — especially with class imbalance. Stratified split preserves the class proportion in each fold.
from sklearn.model_selection import StratifiedKFold, train_test_split
# Stratified train/test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
# Stratified k-fold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
Always use stratify=y when working with classification tasks.
Temporal Split for Time Series
Random splits are incorrect for time series — they allow future information to leak into training. Always split by time:
# For a time-indexed DataFrame
split_date = '2024-01-01'
train = df[df.index < split_date]
test = df[df.index >= split_date]
For cross-validation of time series, use TimeSeriesSplit:
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
# Each fold: train on past, validate on immediate future
How Much Data for Each Set?
| Dataset size | Recommended split | Rationale |
|---|---|---|
| Small (< 10k) | 60/20/20 or use CV | More validation/test data for reliable estimates |
| Medium (10k–1M) | 70/15/15 or 80/10/10 | More training data improves model |
| Large (> 1M) | 98/1/1 | 1% of 1M = 10k samples, plenty for evaluation |
| Very large (> 100M) | 99/0.5/0.5 | Even 0.5% gives 500k evaluation samples |
For large datasets, the test set needs to be large enough to give statistically reliable estimates, not a fixed percentage.