Skip to content

3. Preprocessing

Data preprocessing is a critical phase in the development of neural network models, ensuring that raw data is transformed into a suitable format for effective training and inference. This text explores both basic and advanced preprocessing techniques, drawing from established methodologies in machine learning and deep learning. Basic techniques focus on cleaning and normalizing data to handle inconsistencies and scale issues, while advanced methods address complex challenges such as data scarcity, imbalance, and high dimensionality. The discussion highlights their relevance to neural networks, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), and transformers, with emphasis on improving model convergence, generalization, and performance.

Neural networks, as powerful function approximators, are highly sensitive to the quality and format of input data. Poorly prepared data can lead to slow convergence, overfitting, or suboptimal accuracy. Preprocessing mitigates these issues by addressing noise, inconsistencies, and structural mismatches in datasets. It encompasses a series of steps that transform raw data into a form that aligns with the assumptions and requirements of neural architectures. For instance, in supervised learning tasks, preprocessing ensures features are scaled appropriately to prevent gradient issues during backpropagation. This text delineates basic techniques, which are foundational and widely applicable, and advanced techniques, which are more specialized and often domain-specific, such as for image, text, or time-series data.

Typical Preprocessing Tasks

Task Description
Text Cleaning Remove unwanted characters, stop words, and perform stemming/lemmatization.
Normalization Standardize text formats, such as date and currency formats.
Tokenization Split text into words or subwords for easier analysis.
Feature Extraction Convert text into numerical features using techniques like TF-IDF or word embeddings.
Data Augmentation Generate synthetic data to increase dataset size and diversity.

A typical dataset for machine learning tasks might include columns of different data types, such as numerical, categorical, and text, eg.:

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
844 0 3 Lemberopolous, Mr. Peter L male 34.5 0 0 2683 6.4375 nan C
103 0 1 White, Mr. Richard Frasar male 21 0 1 35281 77.2875 D26 S
160 0 3 Sage, Master. Thomas Henry male nan 8 2 CA. 2343 69.55 nan S
190 0 3 Turcin, Mr. Stjepan male 36 0 0 349247 7.8958 nan S
185 1 3 Kink-Heilmann, Miss. Luise Gretchen female 4 0 2 315153 22.025 nan S
588 1 1 Frolicher-Stehli, Mr. Maxmillian male 60 1 1 13567 79.2 B41 C
660 0 1 Newell, Mr. Arthur Webster male 58 0 2 35273 113.275 D48 C
674 1 2 Wilhelms, Mr. Charles male 31 0 0 244270 13 nan S
526 0 3 Farrell, Mr. James male 40.5 0 0 367232 7.75 nan Q
703 0 3 Barbara, Miss. Saiide female 18 0 1 2691 14.4542 nan C

Sample rows from the Titanic dataset

Data Cleaning

Data cleaning involves identifying and rectifying errors, inconsistencies, and missing values in the dataset. Missing values, common in real-world data, can be handled by imputation methods such as mean, median, or mode substitution, or by removing affected rows/columns if the loss is minimal. For example, in pandas, this can be implemented as df.fillna(df.mean()) for mean imputation. Outliers, which may skew neural network training, are detected using statistical methods like z-scores or interquartile ranges and can be winsorized or removed. Noise reduction, such as smoothing time-series data with moving averages, is also essential, particularly for RNNs where temporal dependencies are critical. Inconsistent data, like varying formats in text (e.g., dates), requires standardization to ensure uniformity. Overall, data cleaning enhances data quality, reducing the risk of misleading patterns during neural network optimization.

Pclass Sex Age SibSp Parch Fare Embarked
3 male 19 0 0 8.05 S
1 male 36 0 0 40.125 C
3 male 27 1 0 24.15 Q
2 female 4 2 1 39 S
1 male 27 0 0 0 S
1 male 42 0 0 26.2875 S
2 female 33 0 2 26 S
1 male 27 1 0 53.1 S
2 male 25 0 0 13 S
3 male 27 0 0 8.05 S
import pandas as pd

# Preprocess the data
def preprocess(df):
    # Fill missing values
    df['Age'].fillna(df['Age'].median(), inplace=True)
    df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
    df['Fare'].fillna(df['Fare'].median(), inplace=True)

    # Select features
    features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
    return df[features]

# Load the Titanic dataset
df = pd.read_csv('https://raw.githubusercontent.com/hsandmann/ml/refs/heads/main/data/kaggle/titanic-dataset.csv')
df = df.sample(n=10)

# Preprocessing
df = preprocess(df)

# Display the first few rows of the dataset
print(df.to_markdown(index=False))

Encoding Categorical Variables

Categorical data, non-numeric by nature, must be converted for neural network input. One-hot encoding creates binary vectors for each category, e.g., transforming colors ['red', 'blue', 'green'] into [[1,0,0], [0,1,0], [0,0,1]]. This avoids ordinal assumptions but increases dimensionality, which can be mitigated by embedding layers in neural networks for high-cardinality features. Label encoding assigns integers (e.g., 0 for "red", 1 for "blue"), suitable for ordinal categories but risky for nominal ones due to implied ordering. For text data in NLP tasks with transformers, tokenization and subword encoding (e.g., WordPiece) are basic steps to map words to integer IDs.

Pclass Sex Age SibSp Parch Fare Embarked
3 1 16 4 1 39.6875 1
3 1 34 1 1 14.4 1
2 1 16 0 0 26 1
3 1 34 1 2 23.45 1
1 0 34 1 0 89.1042 0
3 0 37 0 0 9.5875 1
1 0 39 1 1 110.883 0
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Preprocess the data
def preprocess(df):
    # Fill missing values
    df['Age'].fillna(df['Age'].median(), inplace=True)
    df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
    df['Fare'].fillna(df['Fare'].median(), inplace=True)

    # Convert categorical variables
    label_encoder = LabelEncoder()
    df['Sex'] = label_encoder.fit_transform(df['Sex'])
    df['Embarked'] = label_encoder.fit_transform(df['Embarked'])

    # Select features
    features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
    return df[features]

# Load the Titanic dataset
df = pd.read_csv('https://raw.githubusercontent.com/hsandmann/ml/refs/heads/main/data/kaggle/titanic-dataset.csv')
df = df.sample(n=10)

# Preprocessing
df = preprocess(df)

# Display the first few rows of the dataset
print(df.sample(n=7).to_markdown(index=False))

Normalization and Standardization

Normalization scales features to a bounded range, typically \([0, 1]\), using min-max scaling:

\[ x' = \displaystyle \frac{x - \min(x)}{\max(x) - \min(x)} \]

This is crucial for neural networks employing sigmoid or tanh activations, as it prevents saturation.

Standardization, or z-score normalization, transforms data to have a mean of \(0\) and standard deviation of \(1\):

\[ x' = \frac{x - \mu}{\sigma}, \]

where \(\mu\) is the mean and \(\sigma\) the standard deviation. It is preferred for networks with ReLU activations or when data distributions are Gaussian-like, aiding faster gradient descent convergence. In practice, libraries like scikit-learn provide MinMaxScaler and StandardScaler for these operations. These techniques are especially vital in multilayer perceptrons (MLPs) and CNNs, where feature scales can dominate loss landscapes.

Below is an example of how to apply normalization and standardization using pandas, based on the NASDAQ Apple stock price dataset:

Date Volume N-Volume Z-Volume Change N-Change Z-Change
2025-07-22 00:00:00-04:00 4.64041e+07 0.116891 -0.637333 0.00903618 0.44842 0.271479
2025-07-23 00:00:00-04:00 4.69893e+07 0.124553 -0.612765 -0.00116609 0.314022 -0.290172
2025-07-24 00:00:00-04:00 4.60226e+07 0.111896 -0.65335 -0.00182115 0.305392 -0.326235
2025-07-25 00:00:00-04:00 4.02688e+07 0.036563 -0.894912 0.00056142 0.336779 -0.19507
2025-07-28 00:00:00-04:00 3.7858e+07 0.00499883 -0.996124 0.000794875 0.339854 -0.182218
2025-07-29 00:00:00-04:00 5.14117e+07 0.182455 -0.427099 -0.0129877 0.158291 -0.940969
2025-07-30 00:00:00-04:00 4.55125e+07 0.105218 -0.674765 -0.0105079 0.190958 -0.804453
2025-07-31 00:00:00-04:00 8.06984e+07 0.5659 0.802445 -0.00707962 0.23612 -0.615722
2025-08-01 00:00:00-04:00 1.04434e+08 0.876672 1.79896 -0.0250036 0 -1.60247
2025-08-04 00:00:00-04:00 7.51093e+07 0.492723 0.567798 0.00479297 0.392523 0.0378834
Date Open High Low Close Volume Dividends Stock Splits Change
2025-07-21 00:00:00-04:00 211.86 215.535 211.39 212.239 5.13774e+07 0 0 nan
2025-07-22 00:00:00-04:00 212.898 214.706 211.989 214.157 4.64041e+07 0 0 0.00903618
2025-07-23 00:00:00-04:00 214.756 214.906 212.169 213.907 4.69893e+07 0 0 -0.00116609
2025-07-24 00:00:00-04:00 213.658 215.445 213.288 213.518 4.60226e+07 0 0 -0.00182115
2025-07-25 00:00:00-04:00 214.457 214.996 213.158 213.638 4.02688e+07 0 0 0.00056142
2025-07-28 00:00:00-04:00 213.787 214.606 212.818 213.807 3.7858e+07 0 0 0.000794875
2025-07-29 00:00:00-04:00 213.937 214.566 210.581 211.031 5.14117e+07 0 0 -0.0129877
2025-07-30 00:00:00-04:00 211.66 212.149 207.485 208.813 4.55125e+07 0 0 -0.0105079
2025-07-31 00:00:00-04:00 208.254 209.602 206.925 207.335 8.06984e+07 0 0 -0.00707962
2025-08-01 00:00:00-04:00 210.631 213.338 201.272 202.151 1.04434e+08 0 0 -0.0250036
import pandas as pd
import yfinance as yf

dat = yf.Ticker("AAPL")
df = dat.history(period='1mo')

df['Change'] = df['Close'].pct_change()
df['Z-Volume'] = df['Volume'].apply(lambda x: (x-df['Volume'].mean())/df['Volume'].std())
df['N-Volume'] = df['Volume'].apply(lambda x: (x-df['Volume'].min())/(df['Volume'].max()-df['Volume'].min()))
df['Z-Change'] = df['Change'].apply(lambda x: (x-df['Change'].mean())/df['Change'].std())
df['N-Change'] = df['Change'].apply(lambda x: (x-df['Change'].min())/(df['Change'].max()-df['Change'].min()))
df = df[['Volume', 'N-Volume', 'Z-Volume', 'Change', 'N-Change', 'Z-Change']].dropna()
print(df.head(10).to_markdown())

Feature Scaling

Feature scaling overlaps with normalization but specifically addresses disparate scales across features. Beyond min-max and z-score, logarithmic scaling (\( x' = \log(x + 1) \)) handles skewed distributions, common in financial data for neural forecasting models. Scaling ensures equal contribution of features during weight updates in stochastic gradient descent (SGD).

Data Augmentation

Data augmentation artificially expands datasets to combat overfitting, particularly in CNNs for image classification. Basic operations include flipping, rotation (e.g., by 90° or random angles), and cropping, while advanced methods involve adding noise (Gaussian or salt-and-pepper) or color jittering. For text data in RNNs or transformers, techniques like synonym replacement, random insertion/deletion, or back-translation (translating to another language and back) generate variations while preserving semantics. In time-series for LSTMs, window slicing or synthetic minority over-sampling technique (SMOTE)8 variants create augmented sequences. Generative models like GANs (Generative Adversarial Networks) represent cutting-edge augmentation, producing realistic synthetic samples. These methods improve generalization by exposing models to diverse inputs.

Handling Imbalanced Data

Imbalanced datasets, where classes are unevenly represented, bias neural networks toward majority classes. Advanced resampling includes oversampling minorities (e.g., SMOTE, which interpolates new instances) or undersampling majorities. Class weighting assigns higher penalties to minority misclassifications in the loss function, e.g., weighted cross-entropy. Ensemble methods, like balanced random forests integrated with neural embeddings, or focal loss in object detection CNNs, further address this. For sequential data, temporal resampling ensures balanced windows.

Feature Engineering and Selection

Feature engineering crafts new features from existing ones, such as polynomial terms or interactions (e.g., \( x_1 \times x_2 \)) to capture non-linearities before neural input. Selection techniques like mutual information or recursive feature elimination reduce irrelevant features, alleviating the curse of dimensionality in high-dimensional data for autoencoders or dense networks. Embedded methods, like L1 regularization in neural training, perform selection during optimization.

Dimensionality Reduction

Techniques like Principal Component Analysis (PCA) project data onto lower-dimensional spaces while preserving variance:

\[ X' = X \cdot W \]

where \(W\) are principal components. Autoencoders, a neural-based approach, learn compressed representations through encoder-decoder architectures. t-SNE or UMAP are used for visualization but less for preprocessing due to non-linearity. These are vital for CNNs on high-resolution images or transformers on long sequences to reduce computational load.

PCA is widely used for dimensionality reduction5, while t-SNE6 and UMAP7 are popular for visualizing high-dimensional data in 2D or 3D spaces.

Basically, PCA identifies orthogonal axes (principal components) capturing maximum variance, enabling efficient data representation. Autoencoders, trained to reconstruct inputs, learn compact latent spaces, useful for denoising or anomaly detection.

PCA Steps5

1. Standardize the data:

\[ X' = \frac{X - μ}{σ} \]

2. Compute the covariance matrix:

\[ C = \frac{1}{n} * (X'ᵀ * X') \]

3. Calculate eigenvalues and eigenvectors:

\[ \text{eigvals}, \text{eigvecs} = \text{np.linalg.eig}(C) \]

4. Sort eigenvectors by eigenvalues in descending order.

5. Select top \(k\) eigenvectors to form a new feature space

\[ Y = X' * W \]

where \(W\) is the matrix of selected eigenvectors.

A example of PCA applied to the Iris dataset:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from io import StringIO
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

# Loading Iris dataset
iris = load_iris()

# Transform in dataframe
df = pd.DataFrame(
    data=iris.data,
    columns=['sepal_l', 'sepal_w', 'petal_l', 'petal_w']
)
df['class'] = iris.target_names[iris.target]

X = df.iloc[:,0:4].values
y = df.iloc[:,4].values

# Standardizing
X_std = StandardScaler().fit_transform(X)

# Covariance
cov_mat = np.cov(X_std.T)

# Calculate autovalues and autovectors
eig_vals, eig_vecs = np.linalg.eig(cov_mat)

print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)

# Make a list of (eigenvalue, eigenvector) tuples
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]

# Sort the (eigenvalue, eigenvector) tuples from high to low
eig_pairs.sort(key=lambda x: x[0], reverse=True)

# Visually confirm that the list is correctly sorted by decreasing eigenvalues
print('Eigenvalues in descending order:')
for i in eig_pairs: print(i[0])

# Sum the cummulative of each eigen value
tot = sum(eig_vals)
var_exp = [(i / tot)*100 for i in sorted(eig_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)

n_eigen = [1, 2, 3, 4]

# Plot the cumulative for each eign value
plt.figure(figsize=(6, 4))
plt.bar(n_eigen, var_exp, alpha=0.5, align='center',
    label='individual explained variance')
plt.step(n_eigen, cum_var_exp, where='mid',
    label='cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.legend(loc='best')
plt.tight_layout()

# Take the only the two firsts eigen values
matrix_w = np.hstack((eig_pairs[0][1].reshape(4,1),
                      eig_pairs[1][1].reshape(4,1)))

print('*' * 10)
print('Reduced to 2-D')
print('Matrix W:\n', matrix_w)

# Calculate the new Y for all samples
Y = X_std.dot(matrix_w)

# Plot the data for the 2 firsts principal components
plt.figure(figsize=(6, 4))
for lab, col in zip(('setosa', 'versicolor', 'virginica'), ('blue', 'red', 'green')):
    plt.scatter(Y[y==lab, 0],
                Y[y==lab, 1],
                label=lab,
                c=col)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(loc='lower center')
plt.tight_layout()

# Para imprimir na página HTML
buffer = StringIO()
plt.savefig(buffer, format="svg", transparent=True)
print(buffer.getvalue())

Now, the same example using scikit-learn is shown below:

import matplotlib.pyplot as plt
import pandas as pd
from io import StringIO
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA as pca
from sklearn.preprocessing import StandardScaler

# Loading Iris dataset
iris = load_iris()

# Transform in dataframe
df = pd.DataFrame(
    data=iris.data,
    columns=['sepal_l', 'sepal_w', 'petal_l', 'petal_w']
)
df['class'] = iris.target_names[iris.target]

X = df.iloc[:,0:4].values
y = df.iloc[:,4].values

# Standardizing
X_std = StandardScaler().fit_transform(X)

sklearn_pca = pca(n_components=2)
Y = sklearn_pca.fit_transform(X_std)

# Plot the data for the 2 firsts principal components
plt.figure(figsize=(6, 4))
for lab, col in zip(('setosa', 'versicolor', 'virginica'), ('blue', 'red', 'green')):
    plt.scatter(Y[y==lab, 0],
                Y[y==lab, 1],
                label=lab,
                c=col)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(loc='lower center')
plt.tight_layout()

# Para imprimir na página HTML
buffer = StringIO()
plt.savefig(buffer, format="svg", transparent=True)
print(buffer.getvalue())

Eigenfaces, a PCA variant, is used in face recognition tasks to reduce image dimensions while retaining essential features4. In NLP, techniques like Latent Semantic Analysis (LSA) apply SVD (Singular Value Decomposition) to reduce term-document matrices, enhancing transformer efficiency.

Domain-Specific Advanced Techniques

For time-series in RNNs, techniques include Fast Fourier Transform (FFT) for frequency domain conversion or segmentation into fixed windows. In text preprocessing for sentiment analysis, advanced steps encompass negation handling (e.g., marking "not good" as "not_pos"), intensification (e.g., "very good" as "strong_pos"), and POS tagging to retain sentiment-bearing words. For images in CNNs, advanced signal processing like wavelet transforms or conversion to spectrograms enhances fault diagnosis applications.