Overview
Data in Machine Learning
All machine learning is fundamentally a data problem. No matter how sophisticated the model architecture, it cannot compensate for data that is poorly collected, incorrectly labeled, or improperly split. Understanding data — its types, distributions, quality, and pitfalls — is the first and most critical step in any ML project.
"Garbage in, garbage out." — Classic ML maxim
This section is organized into six focused topics:
The Data Pipeline
Before feeding data to any model, it passes through a series of transformations. Understanding the full pipeline helps you avoid bugs and leakage:
flowchart LR
A["Raw Data\ncollection"] --> B["Exploratory\nData Analysis"]
B --> C["Data Cleaning\n(quality issues)"]
C --> D["Split\ntrain / val / test"]
D --> E["Feature Engineering\n& Preprocessing"]
E --> F["Model\nTraining"]
F --> G["Evaluation\non test set"]
style D fill:#1f3244,stroke:#58a6ff
style G fill:#1f3d1f,stroke:#3fb950 Critical Rule
Always split BEFORE preprocessing. Computing statistics (mean, std, min, max) on the full dataset and then splitting is data leakage. Fit all transformers on training data only.
Key Data Repositories
| Source | Domain | Format |
|---|---|---|
| UCI ML Repository | General ML | CSV, ARFF |
| Kaggle Datasets | All domains | CSV, JSON |
| Hugging Face Datasets | NLP, Vision | Arrow, Parquet |
| OpenML | Benchmarks | ARFF |
| TensorFlow Datasets | Vision, NLP, Audio | TFRecord |
| Papers With Code | Research benchmarks | Various |
| Google Dataset Search | Web-wide | Various |