Skip to content

Distributions

Data Distributions and Visualization

Before building any model, you must understand your data visually. The distribution of features tells you which preprocessing is needed, what problems to expect, and whether the data can support a given task.


Why Distribution Matters

The same algorithm applied to the same task can fail or succeed depending on the data distribution. A linear classifier works perfectly on linearly separable data but cannot learn XOR. Normalization is critical for gradient-based learning but irrelevant for decision trees.


The Salmon vs Seabass Problem

A classic introductory dataset: classify fish on a conveyor belt as "salmon" or "seabass" based on two sensors — size (cm) and brightness (0–10).

\[ \mathbf{x} = \begin{bmatrix} x_1 \text{ (size)} \\ x_2 \text{ (brightness)} \end{bmatrix} \longrightarrow f(\mathbf{x}) \in \{\text{salmon}, \text{seabass}\} \]
2026-05-16T22:16:02.720673 image/svg+xml Matplotlib v3.10.9, https://matplotlib.org/

One-dimensional view: each feature individually. Note that neither size alone nor brightness alone perfectly separates the species.

2026-05-16T22:16:02.771264 image/svg+xml Matplotlib v3.10.9, https://matplotlib.org/

Two-dimensional view: combining both features allows a linear decision boundary to separate most samples.

Lesson

More features = richer feature space = more separation potential. But adding irrelevant features can hurt. Feature selection matters.


The Iris Dataset

UCI Machine Learning Repository: introduced by Ronald A. Fisher in 1936, this 150-sample dataset of three Iris species is a cornerstone ML benchmark.

Iris flower parts

Feature Unit Range
Sepal length cm 4.3–7.9
Sepal width cm 2.0–4.4
Petal length cm 1.0–6.9
Petal width cm 0.1–2.5

Editor (session: default) Run
import pandas as pd
from sklearn.datasets import load_iris

# Carregar o conjunto de dados Iris
iris = load_iris()

# Transforma em DataFrame
df = pd.DataFrame(
    data=iris.data,
    columns=['sepal_l', 'sepal_w', 'petal_l', 'petal_w']
)
df['class'] = iris.target_names[iris.target]

# Imprime os dados
print(df)
Output Clear

2026-05-16T22:16:04.330413 image/svg+xml Matplotlib v3.10.9, https://matplotlib.org/

Pairplot of the Iris dataset. Note: petal length vs. petal width clearly separates all three species. Sepal length vs. sepal width shows overlap — not all feature pairs are equally discriminative.


Common Distribution Shapes

2026-05-16T22:16:04.749756 image/svg+xml Matplotlib v3.10.9, https://matplotlib.org/

Four common 2D data distributions. The decision boundary a model needs to learn depends entirely on how the data is distributed.

Distribution Characteristics Suitable models
Linear Classes separated by a hyperplane Logistic Regression, Linear SVM, Perceptron
Circular / radial Non-linear concentric structure RBF SVM, Neural Networks, KNN
Clusters Groups in multiple locations GMM, Neural Networks, CNN
Spiral / complex Highly non-linear Deep Neural Networks, SVM w/ kernel

Interactive: Explore a Distribution

Adjust the parameters below to see how different Gaussian distributions look and overlap.

Class A (blue)
Mean X:
Mean Y:
Std:
Class B (orange)
Mean X:
Mean Y:
Std:

Key Visualization Techniques

Technique Best for Library
Scatter plot 2D feature relationships matplotlib, seaborn
Pairplot All feature pairs at once seaborn.pairplot
Histogram Single feature distribution matplotlib.hist
Box plot Distribution + outliers seaborn.boxplot
Heatmap (correlation) Feature correlations seaborn.heatmap
t-SNE / UMAP High-dimensional data in 2D sklearn, umap-learn
Violin plot Distribution per class seaborn.violinplot
import seaborn as sns
import matplotlib.pyplot as plt

# Correlation heatmap
corr = df.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlation Matrix')
plt.show()

# Distribution per class
sns.violinplot(data=df, x='class', y='feature_name')
plt.show()

# t-SNE for high-dimensional data
from sklearn.manifold import TSNE
X_2d = TSNE(n_components=2, random_state=42).fit_transform(X_scaled)
plt.scatter(X_2d[:,0], X_2d[:,1], c=y, cmap='tab10')
plt.title('t-SNE visualization')
plt.show()

  1. Fisher, R. A. (1936). Iris. UCI Machine Learning Repository. â†©

  2. Duda, R. O., Hart, P. E., & Stork, D. G. (2000). Pattern Classification, 2nd Edition. Wiley. â†©