Feature Types

The type of each feature in your dataset fundamentally shapes which preprocessing, architectures, and loss functions you can use. Misidentifying a feature type is one of the most common sources of bugs in ML pipelines.

Primary Feature Types

graph TD
    D[Data] --> S[Structured]
    D --> U[Unstructured]

    S --> N[Numerical]
    S --> C[Categorical]

    N --> N1["Continuous<br/><small>e.g. height, temperature, price</small>"]
    N --> N2["Discrete<br/><small>e.g. count, age in years</small>"]

    C --> C1["Nominal<br/><small>e.g. color, city, country</small>"]
    C --> C2["Ordinal<br/><small>e.g. small / medium / large</small>"]
    C --> C3["Binary<br/><small>e.g. yes / no, true / false</small>"]

    U --> U1["Text<br/><small>e.g. reviews, documents</small>"]
    U --> U2["Image<br/><small>e.g. photos, scans</small>"]
    U --> U3["Audio<br/><small>e.g. speech, music</small>"]
    U --> U4["Graph<br/><small>e.g. social networks</small>"]

    classDef root    fill:#e2e8f0,stroke:#718096,color:#2d3748
    classDef struct  fill:#ebf4ff,stroke:#4299e1,color:#2b6cb0
    classDef unstruct fill:#fef5e7,stroke:#ed8936,color:#7b341e
    classDef num     fill:#f0fff4,stroke:#48bb78,color:#276749
    classDef cat     fill:#fffff0,stroke:#d69e2e,color:#744210
    classDef numLeaf fill:#f0fff4,stroke:#9ae6b4,color:#276749
    classDef catLeaf fill:#fffff0,stroke:#faf089,color:#744210
    classDef uLeaf   fill:#fef5e7,stroke:#fbd38d,color:#7b341e

    class D root
    class S struct
    class U unstruct
    class N num
    class C cat
    class N1,N2 numLeaf
    class C1,C2,C3 catLeaf
    class U1,U2,U3,U4 uLeaf

Numerical Features

Numerical features represent quantities that can be measured on a continuous or discrete scale.

Sub-type	Description	Examples	Encoding
Continuous	Infinite values in a range	Height (1.73m), Temperature (23.4°C), Price ($12.50)	Use as-is, normalize
Discrete	Countable integer values	Age (years), Number of rooms, Visit count	Treat as continuous or ordinal

Why it matters: Most neural networks expect numerical inputs in a bounded range. Inputs with very different scales (e.g., age ∈ [0,100] and income ∈ [0,1,000,000]) cause some features to dominate gradient updates. Always normalize or standardize numerical features.

from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Z-score: (x - mean) / std  → mean 0, std 1
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)   # use train stats!

# Min-Max: (x - min) / (max - min) → [0, 1]
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)

Categorical Features

Categorical features represent membership in discrete groups. The key distinction is ordinal vs. nominal.

Sub-type	Description	Example	Problem with raw integers
Nominal	No natural order	Color:	Model assumes blue > red
Ordinal	Has natural order	Size:	Encoding should respect order
Binary	Two values	Spam:	Encode as 0/1

Encoding strategies

One-Hot Encoding (nominal)Ordinal EncodingTarget / Mean Encoding (high cardinality)Embedding (neural networks)

import pandas as pd
df = pd.get_dummies(df, columns=['color'])
# color_red=1, color_blue=0, color_green=0

Best for nominal features with low cardinality (< ~20 categories).

from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder(categories=[['small', 'medium', 'large']])
X['size_enc'] = enc.fit_transform(X[['size']])
# small=0, medium=1, large=2

# Replace category with mean target value
means = train.groupby('city')['price'].mean()
X['city_enc'] = X['city'].map(means)

⚠️ Must be computed on training set only to avoid leakage.

import torch.nn as nn
# 50 cities → 8-dimensional learned embedding
city_emb = nn.Embedding(num_embeddings=50, embedding_dim=8)

Best for high-cardinality categoricals in deep learning.

Unstructured Feature Types

Type	Shape	Typical representation	Common model
Text	Variable sequence	Token IDs (BPE)	Transformer
Image	H × W × C	Pixel values [0,255]	CNN, ViT
Audio	T × F	Spectrogram or waveform	Conv1D, Transformer
Graph	N nodes, E edges	Adjacency matrix + features	GNN
Time Series	T × F	Ordered sequence	LSTM, Transformer, TCN

Interactive: Identify the Feature Type

What is the correct feature type? Click the right answer.

Score: 0 / 0

Feature Type → Modeling Implications

Feature Type	Raw form	Neural network input	Pitfall
Continuous	Float	Normalize to ~N(0,1)	Large values dominate
Discrete	Int	Treat as continuous OR embed	Arbitrary ordering if misidentified
Nominal	String	One-hot or embedding	Model assumes order if label-encoded
Ordinal	String	Integer mapping	Distances between levels may be unequal
Binary	Bool	0/1	Class imbalance if rare
Text	String	Tokenize → token IDs	Vocabulary size, OOV tokens
Image	Array	Pixel / 255 → [0,1]	Channel order (RGB vs BGR)
Time series	Array	Windowed segments	Look-ahead leakage