Distributions

Data Distributions and Visualization

Before building any model, you must understand your data visually. The distribution of features tells you which preprocessing is needed, what problems to expect, and whether the data can support a given task.

Why Distribution Matters

The same algorithm applied to the same task can fail or succeed depending on the data distribution. A linear classifier works perfectly on linearly separable data but cannot learn XOR. Normalization is critical for gradient-based learning but irrelevant for decision trees.

The Salmon vs Seabass Problem

A classic introductory dataset: classify fish on a conveyor belt as "salmon" or "seabass" based on two sensors — size (cm) and brightness (0–10).

\[ \mathbf{x} = \begin{bmatrix} x_1 \text{ (size)} \\ x_2 \text{ (brightness)} \end{bmatrix} \longrightarrow f(\mathbf{x}) \in \{\text{salmon}, \text{seabass}\} \]

One-dimensional view: each feature individually. Note that neither size alone nor brightness alone perfectly separates the species.

Two-dimensional view: combining both features allows a linear decision boundary to separate most samples.

Lesson

More features = richer feature space = more separation potential. But adding irrelevant features can hurt. Feature selection matters.

The Iris Dataset

UCI Machine Learning Repository: introduced by Ronald A. Fisher in 1936, this 150-sample dataset of three Iris species is a cornerstone ML benchmark.

Feature	Unit	Range
Sepal length	cm	4.3–7.9
Sepal width	cm	2.0–4.4
Petal length	cm	1.0–6.9
Petal width	cm	0.1–2.5

Editor (session: default) Run

import pandas as pd
from sklearn.datasets import load_iris

# Carregar o conjunto de dados Iris
iris = load_iris()

# Transforma em DataFrame
df = pd.DataFrame(
    data=iris.data,
    columns=['sepal_l', 'sepal_w', 'petal_l', 'petal_w']
)
df['class'] = iris.target_names[iris.target]

# Imprime os dados
print(df)

Output Clear

Pairplot of the Iris dataset. Note: petal length vs. petal width clearly separates all three species. Sepal length vs. sepal width shows overlap — not all feature pairs are equally discriminative.

Common Distribution Shapes

Four common 2D data distributions. The decision boundary a model needs to learn depends entirely on how the data is distributed.

Distribution	Characteristics	Suitable models
Linear	Classes separated by a hyperplane	Logistic Regression, Linear SVM, Perceptron
Circular / radial	Non-linear concentric structure	RBF SVM, Neural Networks, KNN
Clusters	Groups in multiple locations	GMM, Neural Networks, CNN
Spiral / complex	Highly non-linear	Deep Neural Networks, SVM w/ kernel

Interactive: Explore a Distribution

Adjust the parameters below to see how different Gaussian distributions look and overlap.

Class A (blue)

Mean X:

Mean Y:

Std:

Class B (orange)

Mean X:

Mean Y:

Std:

Key Visualization Techniques

Technique	Best for	Library
Scatter plot	2D feature relationships	`matplotlib`, `seaborn`
Pairplot	All feature pairs at once	`seaborn.pairplot`
Histogram	Single feature distribution	`matplotlib.hist`
Box plot	Distribution + outliers	`seaborn.boxplot`
Heatmap (correlation)	Feature correlations	`seaborn.heatmap`
t-SNE / UMAP	High-dimensional data in 2D	`sklearn`, `umap-learn`
Violin plot	Distribution per class	`seaborn.violinplot`

import seaborn as sns
import matplotlib.pyplot as plt

# Correlation heatmap
corr = df.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlation Matrix')
plt.show()

# Distribution per class
sns.violinplot(data=df, x='class', y='feature_name')
plt.show()

# t-SNE for high-dimensional data
from sklearn.manifold import TSNE
X_2d = TSNE(n_components=2, random_state=42).fit_transform(X_scaled)
plt.scatter(X_2d[:,0], X_2d[:,1], c=y, cmap='tab10')
plt.title('t-SNE visualization')
plt.show()

Fisher, R. A. (1936). Iris. UCI Machine Learning Repository. ↩
Duda, R. O., Hart, P. E., & Stork, D. G. (2000). Pattern Classification, 2nd Edition. Wiley. ↩