Skip to content

Overview

Data in Machine Learning

All machine learning is fundamentally a data problem. No matter how sophisticated the model architecture, it cannot compensate for data that is poorly collected, incorrectly labeled, or improperly split. Understanding data — its types, distributions, quality, and pitfalls — is the first and most critical step in any ML project.

"Garbage in, garbage out." — Classic ML maxim

This section is organized into six focused topics:


The Data Pipeline

Before feeding data to any model, it passes through a series of transformations. Understanding the full pipeline helps you avoid bugs and leakage:

flowchart LR
    A["Raw Data\ncollection"] --> B["Exploratory\nData Analysis"]
    B --> C["Data Cleaning\n(quality issues)"]
    C --> D["Split\ntrain / val / test"]
    D --> E["Feature Engineering\n& Preprocessing"]
    E --> F["Model\nTraining"]
    F --> G["Evaluation\non test set"]

    style D fill:#1f3244,stroke:#58a6ff
    style G fill:#1f3d1f,stroke:#3fb950

Critical Rule

Always split BEFORE preprocessing. Computing statistics (mean, std, min, max) on the full dataset and then splitting is data leakage. Fit all transformers on training data only.


Key Data Repositories

Source Domain Format
UCI ML Repository General ML CSV, ARFF
Kaggle Datasets All domains CSV, JSON
Hugging Face Datasets NLP, Vision Arrow, Parquet
OpenML Benchmarks ARFF
TensorFlow Datasets Vision, NLP, Audio TFRecord
Papers With Code Research benchmarks Various
Google Dataset Search Web-wide Various