Feature Types
Feature Types
The type of each feature in your dataset fundamentally shapes which preprocessing, architectures, and loss functions you can use. Misidentifying a feature type is one of the most common sources of bugs in ML pipelines.
Primary Feature Types
graph TD
D[Data] --> S[Structured]
D --> U[Unstructured]
S --> N[Numerical]
S --> C[Categorical]
N --> N1["Continuous<br/><small>e.g. height, temperature, price</small>"]
N --> N2["Discrete<br/><small>e.g. count, age in years</small>"]
C --> C1["Nominal<br/><small>e.g. color, city, country</small>"]
C --> C2["Ordinal<br/><small>e.g. small / medium / large</small>"]
C --> C3["Binary<br/><small>e.g. yes / no, true / false</small>"]
U --> U1["Text<br/><small>e.g. reviews, documents</small>"]
U --> U2["Image<br/><small>e.g. photos, scans</small>"]
U --> U3["Audio<br/><small>e.g. speech, music</small>"]
U --> U4["Graph<br/><small>e.g. social networks</small>"]
classDef root fill:#e2e8f0,stroke:#718096,color:#2d3748
classDef struct fill:#ebf4ff,stroke:#4299e1,color:#2b6cb0
classDef unstruct fill:#fef5e7,stroke:#ed8936,color:#7b341e
classDef num fill:#f0fff4,stroke:#48bb78,color:#276749
classDef cat fill:#fffff0,stroke:#d69e2e,color:#744210
classDef numLeaf fill:#f0fff4,stroke:#9ae6b4,color:#276749
classDef catLeaf fill:#fffff0,stroke:#faf089,color:#744210
classDef uLeaf fill:#fef5e7,stroke:#fbd38d,color:#7b341e
class D root
class S struct
class U unstruct
class N num
class C cat
class N1,N2 numLeaf
class C1,C2,C3 catLeaf
class U1,U2,U3,U4 uLeaf Numerical Features
Numerical features represent quantities that can be measured on a continuous or discrete scale.
| Sub-type | Description | Examples | Encoding |
|---|---|---|---|
| Continuous | Infinite values in a range | Height (1.73m), Temperature (23.4Β°C), Price ($12.50) | Use as-is, normalize |
| Discrete | Countable integer values | Age (years), Number of rooms, Visit count | Treat as continuous or ordinal |
Why it matters: Most neural networks expect numerical inputs in a bounded range. Inputs with very different scales (e.g., age β [0,100] and income β [0,1,000,000]) cause some features to dominate gradient updates. Always normalize or standardize numerical features.
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# Z-score: (x - mean) / std β mean 0, std 1
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # use train stats!
# Min-Max: (x - min) / (max - min) β [0, 1]
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
Categorical Features
Categorical features represent membership in discrete groups. The key distinction is ordinal vs. nominal.
| Sub-type | Description | Example | Problem with raw integers |
|---|---|---|---|
| Nominal | No natural order | Color: | Model assumes blue > red |
| Ordinal | Has natural order | Size: | Encoding should respect order |
| Binary | Two values | Spam: | Encode as 0/1 |
Encoding strategies
Best for nominal features with low cardinality (< ~20 categories).
β οΈ Must be computed on training set only to avoid leakage.
Unstructured Feature Types
| Type | Shape | Typical representation | Common model |
|---|---|---|---|
| Text | Variable sequence | Token IDs (BPE) | Transformer |
| Image | H Γ W Γ C | Pixel values [0,255] | CNN, ViT |
| Audio | T Γ F | Spectrogram or waveform | Conv1D, Transformer |
| Graph | N nodes, E edges | Adjacency matrix + features | GNN |
| Time Series | T Γ F | Ordered sequence | LSTM, Transformer, TCN |
Interactive: Identify the Feature Type
What is the correct feature type? Click the right answer.
Score: 0 / 0
Feature Type β Modeling Implications
| Feature Type | Raw form | Neural network input | Pitfall |
|---|---|---|---|
| Continuous | Float | Normalize to ~N(0,1) | Large values dominate |
| Discrete | Int | Treat as continuous OR embed | Arbitrary ordering if misidentified |
| Nominal | String | One-hot or embedding | Model assumes order if label-encoded |
| Ordinal | String | Integer mapping | Distances between levels may be unequal |
| Binary | Bool | 0/1 | Class imbalance if rare |
| Text | String | Tokenize β token IDs | Vocabulary size, OOV tokens |
| Image | Array | Pixel / 255 β [0,1] | Channel order (RGB vs BGR) |
| Time series | Array | Windowed segments | Look-ahead leakage |