Data Quality

Real-world data is messy. Before training any model, you must understand the quality of your data, identify problems, and decide how to address them. Poor data quality is often harder to fix than a poor model choice.

"Data scientists spend 60–80% of their time cleaning and preparing data." — Common industry estimate

Missing Values

Missing data (NaN, NULL, None) is the most common data quality issue. But not all missing data is the same — the mechanism of missingness determines the right strategy.

Mechanism	Definition	Example	Strategy
MCAR — Missing Completely At Random	Missing independently of the value	Sensor failure at random times	Any imputation or deletion safe
MAR — Missing At Random	Missing depends on other observed variables	Income not reported more often by younger people (age is observed)	Impute using other features
MNAR — Missing Not At Random	Missing depends on the missing value itself	High-income people less likely to report income	Very hard; may need domain knowledge

Detecting Missing Values

import pandas as pd

# Count missing values
print(df.isnull().sum())
print(df.isnull().mean() * 100)  # as percentage

# Heatmap of missing patterns
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(df.isnull(), cbar=False, yticklabels=False, cmap='viridis')

Imputation Strategies

from sklearn.impute import SimpleImputer, KNNImputer

# Mean imputation
imp = SimpleImputer(strategy='mean')
X_train = imp.fit_transform(X_train)
X_test  = imp.transform(X_test)  # use train statistics!

# KNN imputation (more accurate, slower)
imp = KNNImputer(n_neighbors=5)
X_train = imp.fit_transform(X_train)
X_test  = imp.transform(X_test)

Outliers

Outliers are data points that deviate significantly from the rest. They can be:

Genuine: extreme but valid observations (e.g., a billionaire in an income dataset)
Errors: measurement mistakes, data entry errors

Detection Methods

import numpy as np

# Z-score method
z_scores = np.abs((df - df.mean()) / df.std())
outliers = (z_scores > 3).any(axis=1)

# IQR method (more robust)
Q1, Q3 = df.quantile(0.25), df.quantile(0.75)
IQR = Q3 - Q1
outliers = ((df < Q1 - 1.5*IQR) | (df > Q3 + 1.5*IQR)).any(axis=1)

print(f"Outliers: {outliers.sum()} / {len(df)} ({outliers.mean()*100:.1f}%)")

Handling Strategies

Strategy	When to use
Remove	Clear measurement errors, small percentage
Cap/Winsorize	Keep value but clip to percentile
Transform (log, sqrt)	Right-skewed data with many high outliers
Keep	Genuine extreme values relevant to the task

Duplicates and Noise

Duplicates — identical or near-identical rows — inflate training counts and can cause overfitting:

# Check duplicates
print(f"Duplicates: {df.duplicated().sum()}")
df = df.drop_duplicates()

# Near-duplicates (fuzzy)
from sklearn.metrics.pairwise import cosine_similarity
# Compare all pairs — expensive for large datasets

Noise — random errors in feature values or labels — is harder to detect. Strategies:

Label smoothing (soft targets instead of hard 0/1)
Data augmentation
Ensemble methods (averaging over noise)
Confident learning (identify likely mislabeled samples)

Data Quality Audit Checklist

def data_quality_report(df, target_col=None):
    print("=== Data Quality Report ===")
    print(f"Shape: {df.shape}")
    print(f"\nMissing values:")
    print(df.isnull().sum()[df.isnull().sum() > 0])
    print(f"\nDuplicates: {df.duplicated().sum()}")
    print(f"\nData types:\n{df.dtypes}")
    if target_col:
        print(f"\nClass distribution:\n{df[target_col].value_counts(normalize=True)}")
    print(f"\nNumerical summary:")
    print(df.describe())