Skip to content

3. Preprocessing

Data preprocessing is a critical phase in the development of neural network models, ensuring that raw data is transformed into a suitable format for effective training and inference. This text explores both basic and advanced preprocessing techniques, drawing from established methodologies in machine learning and deep learning. Basic techniques focus on cleaning and normalizing data to handle inconsistencies and scale issues, while advanced methods address complex challenges such as data scarcity, imbalance, and high dimensionality. The discussion highlights their relevance to neural networks, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), and transformers, with emphasis on improving model convergence, generalization, and performance.

Neural networks, as powerful function approximators, are highly sensitive to the quality and format of input data. Poorly prepared data can lead to slow convergence, overfitting, or suboptimal accuracy. Preprocessing mitigates these issues by addressing noise, inconsistencies, and structural mismatches in datasets. It encompasses a series of steps that transform raw data into a form that aligns with the assumptions and requirements of neural architectures. For instance, in supervised learning tasks, preprocessing ensures features are scaled appropriately to prevent gradient issues during backpropagation. This text delineates basic techniques, which are foundational and widely applicable, and advanced techniques, which are more specialized and often domain-specific, such as for image, text, or time-series data.

Typical Preprocessing Tasks

Task Description
Text Cleaning Remove unwanted characters, stop words, and perform stemming/lemmatization.
Normalization Standardize text formats, such as date and currency formats.
Tokenization Split text into words or subwords for easier analysis.
Feature Extraction Convert text into numerical features using techniques like TF-IDF or word embeddings.
Data Augmentation Generate synthetic data to increase dataset size and diversity.

A typical dataset for machine learning tasks might include columns of different data types, such as numerical, categorical, and text, eg.:

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
675 0 2 Watson, Mr. Ennis Hastings male nan 0 0 239856 0 nan S
383 0 3 Tikkanen, Mr. Juho male 32 0 0 STON/O 2. 3101293 7.925 nan S
657 0 3 Radeff, Mr. Alexander male nan 0 0 349223 7.8958 nan S
181 0 3 Sage, Miss. Constance Gladys female nan 8 2 CA. 2343 69.55 nan S
688 0 3 Dakic, Mr. Branko male 19 0 0 349228 10.1708 nan S
504 0 3 Laitinen, Miss. Kristina Sofia female 37 0 0 4135 9.5875 nan S
135 0 2 Sobey, Mr. Samuel James Hayden male 25 0 0 C.A. 29178 13 nan S
867 1 2 Duran y More, Miss. Asuncion female 27 1 0 SC/PARIS 2149 13.8583 nan C
155 0 3 Olsen, Mr. Ole Martin male nan 0 0 Fa 265302 7.3125 nan S
588 1 1 Frolicher-Stehli, Mr. Maxmillian male 60 1 1 13567 79.2 B41 C

Sample rows from the Titanic dataset

Data Cleaning

Data cleaning involves identifying and rectifying errors, inconsistencies, and missing values in the dataset. Missing values, common in real-world data, can be handled by imputation methods such as mean, median, or mode substitution, or by removing affected rows/columns if the loss is minimal. For example, in pandas, this can be implemented as df.fillna(df.mean()) for mean imputation. Outliers, which may skew neural network training, are detected using statistical methods like z-scores or interquartile ranges and can be winsorized or removed. Noise reduction, such as smoothing time-series data with moving averages, is also essential, particularly for RNNs where temporal dependencies are critical. Inconsistent data, like varying formats in text (e.g., dates), requires standardization to ensure uniformity. Overall, data cleaning enhances data quality, reducing the risk of misleading patterns during neural network optimization.

Pclass Sex Age SibSp Parch Fare Embarked
2 male 31 1 1 37.0042 C
3 male 31 0 0 8.05 S
3 male 31 0 0 7.775 S
3 female 30.5 0 0 7.75 Q
2 female 48 1 2 65 S
3 female 1 0 2 15.7417 C
3 male 31 0 0 7.7375 Q
1 male 31 0 0 42.4 S
1 male 48 1 0 52 S
3 male 25 0 0 7.05 S
import pandas as pd

# Preprocess the data
def preprocess(df):
    # Fill missing values
    df['Age'].fillna(df['Age'].median(), inplace=True)
    df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
    df['Fare'].fillna(df['Fare'].median(), inplace=True)

    # Select features
    features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
    return df[features]

# Load the Titanic dataset
df = pd.read_csv('https://raw.githubusercontent.com/hsandmann/ml/refs/heads/main/data/kaggle/titanic-dataset.csv')
df = df.sample(n=10)

# Preprocessing
df = preprocess(df)

# Display the first few rows of the dataset
print(df.to_markdown(index=False))

Encoding Categorical Variables

Categorical data, non-numeric by nature, must be converted for neural network input. One-hot encoding creates binary vectors for each category, e.g., transforming colors ['red', 'blue', 'green'] into [[1,0,0], [0,1,0], [0,0,1]]. This avoids ordinal assumptions but increases dimensionality, which can be mitigated by embedding layers in neural networks for high-cardinality features. Label encoding assigns integers (e.g., 0 for "red", 1 for "blue"), suitable for ordinal categories but risky for nominal ones due to implied ordering. For text data in NLP tasks with transformers, tokenization and subword encoding (e.g., WordPiece) are basic steps to map words to integer IDs.

Pclass Sex Age SibSp Parch Fare Embarked
3 1 33 1 1 20.525 2
2 1 28 0 0 10.5 2
3 1 21 0 0 8.05 2
1 1 18 1 0 108.9 0
3 0 21 0 0 7.75 1
3 1 28 0 0 7.8958 2
1 0 49 1 0 76.7292 0
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Preprocess the data
def preprocess(df):
    # Fill missing values
    df['Age'].fillna(df['Age'].median(), inplace=True)
    df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
    df['Fare'].fillna(df['Fare'].median(), inplace=True)

    # Convert categorical variables
    label_encoder = LabelEncoder()
    df['Sex'] = label_encoder.fit_transform(df['Sex'])
    df['Embarked'] = label_encoder.fit_transform(df['Embarked'])

    # Select features
    features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
    return df[features]

# Load the Titanic dataset
df = pd.read_csv('https://raw.githubusercontent.com/hsandmann/ml/refs/heads/main/data/kaggle/titanic-dataset.csv')
df = df.sample(n=10)

# Preprocessing
df = preprocess(df)

# Display the first few rows of the dataset
print(df.sample(n=7).to_markdown(index=False))

Normalization and Standardization

Normalization scales features to a bounded range, typically \([0, 1]\), using min-max scaling:

\[ x' = \displaystyle \frac{x - \min(x)}{\max(x) - \min(x)} \]

This is crucial for neural networks employing sigmoid or tanh activations, as it prevents saturation.

Standardization, or z-score normalization, transforms data to have a mean of \(0\) and standard deviation of \(1\):

\[ x' = \frac{x - \mu}{\sigma}, \]

where \(\mu\) is the mean and \(\sigma\) the standard deviation. It is preferred for networks with ReLU activations or when data distributions are Gaussian-like, aiding faster gradient descent convergence. In practice, libraries like scikit-learn provide MinMaxScaler and StandardScaler for these operations. These techniques are especially vital in multilayer perceptrons (MLPs) and CNNs, where feature scales can dominate loss landscapes.

Below is an example of how to apply normalization and standardization using pandas, based on the NASDAQ Apple stock price dataset:

Date Volume N-Volume Z-Volume Change N-Change Z-Change
2025-08-12 00:00:00-04:00 5.56262e+07 0.767907 0.646968 0.0108724 0.529309 0.781748
2025-08-13 00:00:00-04:00 6.98785e+07 1 1.63453 0.0160244 0.618406 1.17505
2025-08-14 00:00:00-04:00 5.19163e+07 0.707493 0.389903 -0.00235719 0.30052 -0.228212
2025-08-15 00:00:00-04:00 5.60387e+07 0.774624 0.67555 -0.00511213 0.252877 -0.438527
2025-08-18 00:00:00-04:00 3.74762e+07 0.472341 -0.610675 -0.00302257 0.289013 -0.279008
2025-08-19 00:00:00-04:00 3.94026e+07 0.503712 -0.477192 -0.00142926 0.316567 -0.157373
2025-08-20 00:00:00-04:00 4.22639e+07 0.550307 -0.278928 -0.0197346 0 -1.55481
2025-08-21 00:00:00-04:00 3.06212e+07 0.360711 -1.08567 -0.00491129 0.25635 -0.423194
2025-08-22 00:00:00-04:00 4.24778e+07 0.553791 -0.264106 0.0127168 0.561205 0.922545
2025-08-25 00:00:00-04:00 3.09831e+07 0.366604 -1.06059 -0.00263431 0.295727 -0.249368
Date Open High Low Close Volume Dividends Stock Splits Change
2025-08-11 00:00:00-04:00 227.92 229.56 224.76 227.18 6.18061e+07 0.26 0 nan
2025-08-12 00:00:00-04:00 228.01 230.8 227.07 229.65 5.56262e+07 0 0 0.0108724
2025-08-13 00:00:00-04:00 231.07 235 230.43 233.33 6.98785e+07 0 0 0.0160244
2025-08-14 00:00:00-04:00 234.06 235.12 230.85 232.78 5.19163e+07 0 0 -0.00235719
2025-08-15 00:00:00-04:00 234 234.28 229.34 231.59 5.60387e+07 0 0 -0.00511213
2025-08-18 00:00:00-04:00 231.7 233.12 230.11 230.89 3.74762e+07 0 0 -0.00302257
2025-08-19 00:00:00-04:00 231.28 232.87 229.35 230.56 3.94026e+07 0 0 -0.00142926
2025-08-20 00:00:00-04:00 229.98 230.47 225.77 226.01 4.22639e+07 0 0 -0.0197346
2025-08-21 00:00:00-04:00 226.27 226.52 223.78 224.9 3.06212e+07 0 0 -0.00491129
2025-08-22 00:00:00-04:00 226.17 229.09 225.41 227.76 4.24778e+07 0 0 0.0127168
import pandas as pd
import yfinance as yf

dat = yf.Ticker("AAPL")
df = dat.history(period='1mo')

df['Change'] = df['Close'].pct_change()
df['Z-Volume'] = df['Volume'].apply(lambda x: (x-df['Volume'].mean())/df['Volume'].std())
df['N-Volume'] = df['Volume'].apply(lambda x: (x-df['Volume'].min())/(df['Volume'].max()-df['Volume'].min()))
df['Z-Change'] = df['Change'].apply(lambda x: (x-df['Change'].mean())/df['Change'].std())
df['N-Change'] = df['Change'].apply(lambda x: (x-df['Change'].min())/(df['Change'].max()-df['Change'].min()))
df = df[['Volume', 'N-Volume', 'Z-Volume', 'Change', 'N-Change', 'Z-Change']].dropna()
print(df.head(10).to_markdown())

Feature Scaling

Feature scaling overlaps with normalization but specifically addresses disparate scales across features. Beyond min-max and z-score, logarithmic scaling (\( x' = \log(x + 1) \)) handles skewed distributions, common in financial data for neural forecasting models. Scaling ensures equal contribution of features during weight updates in stochastic gradient descent (SGD).

Data Augmentation

Data augmentation artificially expands datasets to combat overfitting, particularly in CNNs for image classification. Basic operations include flipping, rotation (e.g., by 90° or random angles), and cropping, while advanced methods involve adding noise (Gaussian or salt-and-pepper) or color jittering. For text data in RNNs or transformers, techniques like synonym replacement, random insertion/deletion, or back-translation (translating to another language and back) generate variations while preserving semantics. In time-series for LSTMs, window slicing or synthetic minority over-sampling technique (SMOTE)1 variants create augmented sequences. Generative models like GANs (Generative Adversarial Networks) represent cutting-edge augmentation, producing realistic synthetic samples. These methods improve generalization by exposing models to diverse inputs.

Handling Imbalanced Data

Imbalanced datasets, where classes are unevenly represented, bias neural networks toward majority classes. Advanced resampling includes oversampling minorities (e.g., SMOTE, which interpolates new instances) or undersampling majorities. Class weighting assigns higher penalties to minority misclassifications in the loss function, e.g., weighted cross-entropy. Ensemble methods, like balanced random forests integrated with neural embeddings, or focal loss in object detection CNNs, further address this. For sequential data, temporal resampling ensures balanced windows.

Feature Engineering and Selection

Feature engineering crafts new features from existing ones, such as polynomial terms or interactions (e.g., \( x_1 \times x_2 \)) to capture non-linearities before neural input. Selection techniques like mutual information or recursive feature elimination reduce irrelevant features, alleviating the curse of dimensionality in high-dimensional data for autoencoders or dense networks. Embedded methods, like L1 regularization in neural training, perform selection during optimization.

Dimensionality Reduction

Techniques like Principal Component Analysis (PCA) project data onto lower-dimensional spaces while preserving variance:

\[ X' = X \cdot W \]

where \(W\) are principal components. Autoencoders, a neural-based approach, learn compressed representations through encoder-decoder architectures. t-SNE or UMAP are used for visualization but less for preprocessing due to non-linearity. These are vital for CNNs on high-resolution images or transformers on long sequences to reduce computational load.

PCA is widely used for dimensionality reduction2, while t-SNE3 and UMAP4 are popular for visualizing high-dimensional data in 2D or 3D spaces.

Basically, PCA identifies orthogonal axes (principal components) capturing maximum variance, enabling efficient data representation. Autoencoders, trained to reconstruct inputs, learn compact latent spaces, useful for denoising or anomaly detection.

PCA Steps2

1. Standardize the data:

\[ X' = \frac{X - μ}{σ} \]

2. Compute the covariance matrix:

\[ C = \frac{1}{n} * (X'ᵀ * X') \]

3. Calculate eigenvalues and eigenvectors:

\[ \text{eigvals}, \text{eigvecs} = \text{np.linalg.eig}(C) \]

4. Sort eigenvectors by eigenvalues in descending order.

5. Select top \(k\) eigenvectors to form a new feature space

\[ Y = X' * W \]

where \(W\) is the matrix of selected eigenvectors.

A example of PCA applied to the Iris dataset:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from io import StringIO
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

# Loading Iris dataset
iris = load_iris()

# Transform in dataframe
df = pd.DataFrame(
    data=iris.data,
    columns=['sepal_l', 'sepal_w', 'petal_l', 'petal_w']
)
df['class'] = iris.target_names[iris.target]

X = df.iloc[:,0:4].values
y = df.iloc[:,4].values

# Standardizing
X_std = StandardScaler().fit_transform(X)

# Covariance
cov_mat = np.cov(X_std.T)

# Calculate autovalues and autovectors
eig_vals, eig_vecs = np.linalg.eig(cov_mat)

print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)

# Make a list of (eigenvalue, eigenvector) tuples
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]

# Sort the (eigenvalue, eigenvector) tuples from high to low
eig_pairs.sort(key=lambda x: x[0], reverse=True)

# Visually confirm that the list is correctly sorted by decreasing eigenvalues
print('Eigenvalues in descending order:')
for i in eig_pairs: print(i[0])

# Sum the cummulative of each eigen value
tot = sum(eig_vals)
var_exp = [(i / tot)*100 for i in sorted(eig_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)

n_eigen = [1, 2, 3, 4]

# Plot the cumulative for each eign value
plt.figure(figsize=(6, 4))
plt.bar(n_eigen, var_exp, alpha=0.5, align='center',
    label='individual explained variance')
plt.step(n_eigen, cum_var_exp, where='mid',
    label='cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.legend(loc='best')
plt.tight_layout()

# Take the only the two firsts eigen values
matrix_w = np.hstack((eig_pairs[0][1].reshape(4,1),
                      eig_pairs[1][1].reshape(4,1)))

print('*' * 10)
print('Reduced to 2-D')
print('Matrix W:\n', matrix_w)

# Calculate the new Y for all samples
Y = X_std.dot(matrix_w)

# Plot the data for the 2 firsts principal components
plt.figure(figsize=(6, 4))
for lab, col in zip(('setosa', 'versicolor', 'virginica'), ('blue', 'red', 'green')):
    plt.scatter(Y[y==lab, 0],
                Y[y==lab, 1],
                label=lab,
                c=col)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(loc='lower center')
plt.tight_layout()

# Para imprimir na página HTML
buffer = StringIO()
plt.savefig(buffer, format="svg", transparent=True)
print(buffer.getvalue())
plt.close()

Now, the same example using scikit-learn is shown below:

import matplotlib.pyplot as plt
import pandas as pd
from io import StringIO
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA as pca
from sklearn.preprocessing import StandardScaler

# Loading Iris dataset
iris = load_iris()

# Transform in dataframe
df = pd.DataFrame(
    data=iris.data,
    columns=['sepal_l', 'sepal_w', 'petal_l', 'petal_w']
)
df['class'] = iris.target_names[iris.target]

X = df.iloc[:,0:4].values
y = df.iloc[:,4].values

# Standardizing
X_std = StandardScaler().fit_transform(X)

sklearn_pca = pca(n_components=2)
Y = sklearn_pca.fit_transform(X_std)

# Plot the data for the 2 firsts principal components
plt.figure(figsize=(6, 4))
for lab, col in zip(('setosa', 'versicolor', 'virginica'), ('blue', 'red', 'green')):
    plt.scatter(Y[y==lab, 0],
                Y[y==lab, 1],
                label=lab,
                c=col)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(loc='lower center')
plt.tight_layout()

# Para imprimir na página HTML
buffer = StringIO()
plt.savefig(buffer, format="svg", transparent=True)
print(buffer.getvalue())
plt.close()

Eigenfaces, a PCA variant, is used in face recognition tasks to reduce image dimensions while retaining essential features5. In NLP, techniques like Latent Semantic Analysis (LSA) apply SVD (Singular Value Decomposition) to reduce term-document matrices, enhancing transformer efficiency.

Domain-Specific Advanced Techniques

For time-series in RNNs, techniques include Fast Fourier Transform (FFT) for frequency domain conversion or segmentation into fixed windows. In text preprocessing for sentiment analysis, advanced steps encompass negation handling (e.g., marking "not good" as "not_pos"), intensification (e.g., "very good" as "strong_pos"), and POS tagging to retain sentiment-bearing words. For images in CNNs, advanced signal processing like wavelet transforms or conversion to spectrograms enhances fault diagnosis applications.


Appendix

Normalization vs. Standardization in Neural Network Preprocessing

In data preprocessing for neural networks, both normalization and standardization are feature scaling techniques used to handle features with different scales, improve model convergence, stabilize gradients during training, and prevent features with larger ranges from dominating the learning process. These methods are particularly important for optimization algorithms like gradient descent, which neural networks rely on.

  • Normalization (Min-Max Scaling): This scales the data to a fixed range, typically [0, 1] or [-1, 1], using the formula: \( \displaystyle x' = \frac{x - \min(x)}{\max(x) - \min(x)} \). It preserves the original data distribution but is sensitive to outliers, as extreme values can compress the rest of the data into a narrow interval.

  • Standardization (Z-Score Scaling): This transforms the data to have a mean of 0 and a standard deviation of 1, using the formula: \( \displaystyle z = \frac{x - \mu}{\sigma} \), where \(\mu\) is the mean and \(\sigma\) is the standard deviation. It centers the data and is more robust to outliers, but it does not bound the values to a specific range.

There is no universal "better" method; the choice depends on the data characteristics, the neural network architecture, and empirical testing. If the dataset is small, it's often worth experimenting with both to see which yields better performance. Below, I'll outline guidelines for when each is appropriate, with specific situations or cases, especially in the context of neural networks.

When to Use Normalization

Normalization is preferred when the data needs to be bounded within a specific range, the distribution is unknown or non-Gaussian, and there are no significant outliers. It helps avoid numeric overflow in neural networks, speeds up learning, and works well with activation functions that expect inputs in a constrained range. Key situations include:

  • Bounded Data or Activation Functions Sensitive to Range: Use normalization for neural networks with sigmoid or tanh activations, as these functions perform better with inputs scaled to [0, 1] or [-1, 1] to prevent saturation (where gradients become near zero). For example, in image classification tasks with convolutional neural networks (CNNs), pixel values (typically 0-255) are often normalized to [0, 1] to ensure consistent scaling and faster convergence.

  • Features with Known Min/Max Bounds and No Outliers: When the data has clear minimum and maximum values (e.g., sensor readings bounded between fixed limits), normalization prevents larger-scale features from dominating. A case is processing demographic data like age (e.g., 0-100) in a feedforward neural network for prediction tasks, where scaling to [0, 1] maintains proportionality without assuming a normal distribution.

  • General Speed Improvements in Training: In scenarios where neural networks handle features like age and weight, normalization to [0, 1] can accelerate training and testing by keeping inputs small and consistent, reducing the risk of overflow.

When to Use Standardization

Standardization is suitable when the data approximates a Gaussian (normal) distribution, outliers are present, or the model benefits from centered data with unit variance. It helps prevent gradient saturation in neural networks, improves numerical stability, and is often the default choice for many algorithms. Specific cases include:

  • Data with Outliers or Unknown Distribution: Standardization is more robust to outliers, as it doesn't compress values into a fixed range like normalization does. For instance, in financial datasets for stock price prediction using recurrent neural networks (RNNs), where extreme values (e.g., market crashes) are common, standardization preserves the relative importance of outliers without skewing the scale.

  • Gaussian-Like Data or Convergence-Focused Models: When features follow a bell-curve distribution (verifiable by plotting), standardization aligns with assumptions in techniques like batch normalization in deep neural networks. An example is sensor data analysis in IoT applications with neural networks, where standardization ensures faster gradient descent convergence by centering the data.

  • Standard Practice for Neural Networks: As recommended in foundational work like Yann LeCun's efficient backpropagation paper, scaling to mean 0 and variance 1 is a go-to method to avoid saturating hidden units and handle numerical issues in training. This is common in large-scale datasets for tasks like natural language processing with transformers.

In practice, for neural networks, standardization is often preferred as a starting point due to its robustness, but normalization shines in bounded, outlier-free scenarios. Always apply scaling after splitting data into train/test sets to avoid data leakage, and use libraries like scikit-learn's MinMaxScaler or StandardScaler for implementation.