Data Quality
Data Quality
Real-world data is messy. Before training any model, you must understand the quality of your data, identify problems, and decide how to address them. Poor data quality is often harder to fix than a poor model choice.
"Data scientists spend 60β80% of their time cleaning and preparing data." β Common industry estimate
Missing Values
Missing data (NaN, NULL, None) is the most common data quality issue. But not all missing data is the same β the mechanism of missingness determines the right strategy.
| Mechanism | Definition | Example | Strategy |
|---|---|---|---|
| MCAR β Missing Completely At Random | Missing independently of the value | Sensor failure at random times | Any imputation or deletion safe |
| MAR β Missing At Random | Missing depends on other observed variables | Income not reported more often by younger people (age is observed) | Impute using other features |
| MNAR β Missing Not At Random | Missing depends on the missing value itself | High-income people less likely to report income | Very hard; may need domain knowledge |
Detecting Missing Values
import pandas as pd
# Count missing values
print(df.isnull().sum())
print(df.isnull().mean() * 100) # as percentage
# Heatmap of missing patterns
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(df.isnull(), cbar=False, yticklabels=False, cmap='viridis')
Imputation Strategies
from sklearn.impute import SimpleImputer, KNNImputer
# Mean imputation
imp = SimpleImputer(strategy='mean')
X_train = imp.fit_transform(X_train)
X_test = imp.transform(X_test) # use train statistics!
# KNN imputation (more accurate, slower)
imp = KNNImputer(n_neighbors=5)
X_train = imp.fit_transform(X_train)
X_test = imp.transform(X_test)
Outliers
Outliers are data points that deviate significantly from the rest. They can be:
- Genuine: extreme but valid observations (e.g., a billionaire in an income dataset)
- Errors: measurement mistakes, data entry errors
Detection Methods
import numpy as np
# Z-score method
z_scores = np.abs((df - df.mean()) / df.std())
outliers = (z_scores > 3).any(axis=1)
# IQR method (more robust)
Q1, Q3 = df.quantile(0.25), df.quantile(0.75)
IQR = Q3 - Q1
outliers = ((df < Q1 - 1.5*IQR) | (df > Q3 + 1.5*IQR)).any(axis=1)
print(f"Outliers: {outliers.sum()} / {len(df)} ({outliers.mean()*100:.1f}%)")
Handling Strategies
| Strategy | When to use |
|---|---|
| Remove | Clear measurement errors, small percentage |
| Cap/Winsorize | Keep value but clip to percentile |
| Transform (log, sqrt) | Right-skewed data with many high outliers |
| Keep | Genuine extreme values relevant to the task |
Duplicates and Noise
Duplicates β identical or near-identical rows β inflate training counts and can cause overfitting:
# Check duplicates
print(f"Duplicates: {df.duplicated().sum()}")
df = df.drop_duplicates()
# Near-duplicates (fuzzy)
from sklearn.metrics.pairwise import cosine_similarity
# Compare all pairs β expensive for large datasets
Noise β random errors in feature values or labels β is harder to detect. Strategies:
- Label smoothing (soft targets instead of hard 0/1)
- Data augmentation
- Ensemble methods (averaging over noise)
- Confident learning (identify likely mislabeled samples)
Data Quality Audit Checklist
def data_quality_report(df, target_col=None):
print("=== Data Quality Report ===")
print(f"Shape: {df.shape}")
print(f"\nMissing values:")
print(df.isnull().sum()[df.isnull().sum() > 0])
print(f"\nDuplicates: {df.duplicated().sum()}")
print(f"\nData types:\n{df.dtypes}")
if target_col:
print(f"\nClass distribution:\n{df[target_col].value_counts(normalize=True)}")
print(f"\nNumerical summary:")
print(df.describe())