Skip to content

Data Quality

Data Quality

Real-world data is messy. Before training any model, you must understand the quality of your data, identify problems, and decide how to address them. Poor data quality is often harder to fix than a poor model choice.

"Data scientists spend 60–80% of their time cleaning and preparing data." β€” Common industry estimate


Missing Values

Missing data (NaN, NULL, None) is the most common data quality issue. But not all missing data is the same β€” the mechanism of missingness determines the right strategy.

Mechanism Definition Example Strategy
MCAR β€” Missing Completely At Random Missing independently of the value Sensor failure at random times Any imputation or deletion safe
MAR β€” Missing At Random Missing depends on other observed variables Income not reported more often by younger people (age is observed) Impute using other features
MNAR β€” Missing Not At Random Missing depends on the missing value itself High-income people less likely to report income Very hard; may need domain knowledge

Detecting Missing Values

import pandas as pd

# Count missing values
print(df.isnull().sum())
print(df.isnull().mean() * 100)  # as percentage

# Heatmap of missing patterns
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(df.isnull(), cbar=False, yticklabels=False, cmap='viridis')

Imputation Strategies

from sklearn.impute import SimpleImputer, KNNImputer

# Mean imputation
imp = SimpleImputer(strategy='mean')
X_train = imp.fit_transform(X_train)
X_test  = imp.transform(X_test)  # use train statistics!

# KNN imputation (more accurate, slower)
imp = KNNImputer(n_neighbors=5)
X_train = imp.fit_transform(X_train)
X_test  = imp.transform(X_test)

Outliers

Outliers are data points that deviate significantly from the rest. They can be:

  • Genuine: extreme but valid observations (e.g., a billionaire in an income dataset)
  • Errors: measurement mistakes, data entry errors

Detection Methods

import numpy as np

# Z-score method
z_scores = np.abs((df - df.mean()) / df.std())
outliers = (z_scores > 3).any(axis=1)

# IQR method (more robust)
Q1, Q3 = df.quantile(0.25), df.quantile(0.75)
IQR = Q3 - Q1
outliers = ((df < Q1 - 1.5*IQR) | (df > Q3 + 1.5*IQR)).any(axis=1)

print(f"Outliers: {outliers.sum()} / {len(df)} ({outliers.mean()*100:.1f}%)")

Handling Strategies

Strategy When to use
Remove Clear measurement errors, small percentage
Cap/Winsorize Keep value but clip to percentile
Transform (log, sqrt) Right-skewed data with many high outliers
Keep Genuine extreme values relevant to the task

Duplicates and Noise

Duplicates β€” identical or near-identical rows β€” inflate training counts and can cause overfitting:

# Check duplicates
print(f"Duplicates: {df.duplicated().sum()}")
df = df.drop_duplicates()

# Near-duplicates (fuzzy)
from sklearn.metrics.pairwise import cosine_similarity
# Compare all pairs β€” expensive for large datasets

Noise β€” random errors in feature values or labels β€” is harder to detect. Strategies:

  • Label smoothing (soft targets instead of hard 0/1)
  • Data augmentation
  • Ensemble methods (averaging over noise)
  • Confident learning (identify likely mislabeled samples)

Data Quality Audit Checklist

def data_quality_report(df, target_col=None):
    print("=== Data Quality Report ===")
    print(f"Shape: {df.shape}")
    print(f"\nMissing values:")
    print(df.isnull().sum()[df.isnull().sum() > 0])
    print(f"\nDuplicates: {df.duplicated().sum()}")
    print(f"\nData types:\n{df.dtypes}")
    if target_col:
        print(f"\nClass distribution:\n{df[target_col].value_counts(normalize=True)}")
    print(f"\nNumerical summary:")
    print(df.describe())