Distributions
Data Distributions and Visualization
Before building any model, you must understand your data visually. The distribution of features tells you which preprocessing is needed, what problems to expect, and whether the data can support a given task.
Why Distribution Matters
The same algorithm applied to the same task can fail or succeed depending on the data distribution. A linear classifier works perfectly on linearly separable data but cannot learn XOR. Normalization is critical for gradient-based learning but irrelevant for decision trees.
The Salmon vs Seabass Problem
A classic introductory dataset: classify fish on a conveyor belt as "salmon" or "seabass" based on two sensors — size (cm) and brightness (0–10).
One-dimensional view: each feature individually. Note that neither size alone nor brightness alone perfectly separates the species.
Two-dimensional view: combining both features allows a linear decision boundary to separate most samples.
Lesson
More features = richer feature space = more separation potential. But adding irrelevant features can hurt. Feature selection matters.
The Iris Dataset
UCI Machine Learning Repository: introduced by Ronald A. Fisher in 1936, this 150-sample dataset of three Iris species is a cornerstone ML benchmark.
| Feature | Unit | Range |
|---|---|---|
| Sepal length | cm | 4.3–7.9 |
| Sepal width | cm | 2.0–4.4 |
| Petal length | cm | 1.0–6.9 |
| Petal width | cm | 0.1–2.5 |
import pandas as pd
from sklearn.datasets import load_iris
# Carregar o conjunto de dados Iris
iris = load_iris()
# Transforma em DataFrame
df = pd.DataFrame(
data=iris.data,
columns=['sepal_l', 'sepal_w', 'petal_l', 'petal_w']
)
df['class'] = iris.target_names[iris.target]
# Imprime os dados
print(df) Pairplot of the Iris dataset. Note: petal length vs. petal width clearly separates all three species. Sepal length vs. sepal width shows overlap — not all feature pairs are equally discriminative.
Common Distribution Shapes
Four common 2D data distributions. The decision boundary a model needs to learn depends entirely on how the data is distributed.
| Distribution | Characteristics | Suitable models |
|---|---|---|
| Linear | Classes separated by a hyperplane | Logistic Regression, Linear SVM, Perceptron |
| Circular / radial | Non-linear concentric structure | RBF SVM, Neural Networks, KNN |
| Clusters | Groups in multiple locations | GMM, Neural Networks, CNN |
| Spiral / complex | Highly non-linear | Deep Neural Networks, SVM w/ kernel |
Interactive: Explore a Distribution
Adjust the parameters below to see how different Gaussian distributions look and overlap.
Key Visualization Techniques
| Technique | Best for | Library |
|---|---|---|
| Scatter plot | 2D feature relationships | matplotlib, seaborn |
| Pairplot | All feature pairs at once | seaborn.pairplot |
| Histogram | Single feature distribution | matplotlib.hist |
| Box plot | Distribution + outliers | seaborn.boxplot |
| Heatmap (correlation) | Feature correlations | seaborn.heatmap |
| t-SNE / UMAP | High-dimensional data in 2D | sklearn, umap-learn |
| Violin plot | Distribution per class | seaborn.violinplot |
import seaborn as sns
import matplotlib.pyplot as plt
# Correlation heatmap
corr = df.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlation Matrix')
plt.show()
# Distribution per class
sns.violinplot(data=df, x='class', y='feature_name')
plt.show()
# t-SNE for high-dimensional data
from sklearn.manifold import TSNE
X_2d = TSNE(n_components=2, random_state=42).fit_transform(X_scaled)
plt.scatter(X_2d[:,0], X_2d[:,1], c=y, cmap='tab10')
plt.title('t-SNE visualization')
plt.show()
-
Fisher, R. A. (1936). Iris. UCI Machine Learning Repository. ↩
-
Duda, R. O., Hart, P. E., & Stork, D. G. (2000). Pattern Classification, 2nd Edition. Wiley. ↩
