Skip to content

Feature Types

Feature Types

The type of each feature in your dataset fundamentally shapes which preprocessing, architectures, and loss functions you can use. Misidentifying a feature type is one of the most common sources of bugs in ML pipelines.


Primary Feature Types

graph TD
    D[Data] --> S[Structured]
    D --> U[Unstructured]

    S --> N[Numerical]
    S --> C[Categorical]

    N --> N1["Continuous<br/><small>e.g. height, temperature, price</small>"]
    N --> N2["Discrete<br/><small>e.g. count, age in years</small>"]

    C --> C1["Nominal<br/><small>e.g. color, city, country</small>"]
    C --> C2["Ordinal<br/><small>e.g. small / medium / large</small>"]
    C --> C3["Binary<br/><small>e.g. yes / no, true / false</small>"]

    U --> U1["Text<br/><small>e.g. reviews, documents</small>"]
    U --> U2["Image<br/><small>e.g. photos, scans</small>"]
    U --> U3["Audio<br/><small>e.g. speech, music</small>"]
    U --> U4["Graph<br/><small>e.g. social networks</small>"]

    classDef root    fill:#e2e8f0,stroke:#718096,color:#2d3748
    classDef struct  fill:#ebf4ff,stroke:#4299e1,color:#2b6cb0
    classDef unstruct fill:#fef5e7,stroke:#ed8936,color:#7b341e
    classDef num     fill:#f0fff4,stroke:#48bb78,color:#276749
    classDef cat     fill:#fffff0,stroke:#d69e2e,color:#744210
    classDef numLeaf fill:#f0fff4,stroke:#9ae6b4,color:#276749
    classDef catLeaf fill:#fffff0,stroke:#faf089,color:#744210
    classDef uLeaf   fill:#fef5e7,stroke:#fbd38d,color:#7b341e

    class D root
    class S struct
    class U unstruct
    class N num
    class C cat
    class N1,N2 numLeaf
    class C1,C2,C3 catLeaf
    class U1,U2,U3,U4 uLeaf

Numerical Features

Numerical features represent quantities that can be measured on a continuous or discrete scale.

Sub-type Description Examples Encoding
Continuous Infinite values in a range Height (1.73m), Temperature (23.4Β°C), Price ($12.50) Use as-is, normalize
Discrete Countable integer values Age (years), Number of rooms, Visit count Treat as continuous or ordinal

Why it matters: Most neural networks expect numerical inputs in a bounded range. Inputs with very different scales (e.g., age ∈ [0,100] and income ∈ [0,1,000,000]) cause some features to dominate gradient updates. Always normalize or standardize numerical features.

from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Z-score: (x - mean) / std  β†’ mean 0, std 1
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)   # use train stats!

# Min-Max: (x - min) / (max - min) β†’ [0, 1]
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)

Categorical Features

Categorical features represent membership in discrete groups. The key distinction is ordinal vs. nominal.

Sub-type Description Example Problem with raw integers
Nominal No natural order Color: Model assumes blue > red
Ordinal Has natural order Size: Encoding should respect order
Binary Two values Spam: Encode as 0/1

Encoding strategies

import pandas as pd
df = pd.get_dummies(df, columns=['color'])
# color_red=1, color_blue=0, color_green=0
Best for nominal features with low cardinality (< ~20 categories).

from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder(categories=[['small', 'medium', 'large']])
X['size_enc'] = enc.fit_transform(X[['size']])
# small=0, medium=1, large=2

# Replace category with mean target value
means = train.groupby('city')['price'].mean()
X['city_enc'] = X['city'].map(means)
⚠️ Must be computed on training set only to avoid leakage.

import torch.nn as nn
# 50 cities β†’ 8-dimensional learned embedding
city_emb = nn.Embedding(num_embeddings=50, embedding_dim=8)
Best for high-cardinality categoricals in deep learning.


Unstructured Feature Types

Type Shape Typical representation Common model
Text Variable sequence Token IDs (BPE) Transformer
Image H Γ— W Γ— C Pixel values [0,255] CNN, ViT
Audio T Γ— F Spectrogram or waveform Conv1D, Transformer
Graph N nodes, E edges Adjacency matrix + features GNN
Time Series T Γ— F Ordered sequence LSTM, Transformer, TCN

Interactive: Identify the Feature Type

What is the correct feature type? Click the right answer.
Score: 0 / 0

Feature Type β†’ Modeling Implications

Feature Type Raw form Neural network input Pitfall
Continuous Float Normalize to ~N(0,1) Large values dominate
Discrete Int Treat as continuous OR embed Arbitrary ordering if misidentified
Nominal String One-hot or embedding Model assumes order if label-encoded
Ordinal String Integer mapping Distances between levels may be unequal
Binary Bool 0/1 Class imbalance if rare
Text String Tokenize β†’ token IDs Vocabulary size, OOV tokens
Image Array Pixel / 255 β†’ [0,1] Channel order (RGB vs BGR)
Time series Array Windowed segments Look-ahead leakage