3. Preprocessing
Data preprocessing is a critical phase in the development of neural network models, ensuring that raw data is transformed into a suitable format for effective training and inference. This text explores both basic and advanced preprocessing techniques, drawing from established methodologies in machine learning and deep learning. Basic techniques focus on cleaning and normalizing data to handle inconsistencies and scale issues, while advanced methods address complex challenges such as data scarcity, imbalance, and high dimensionality. The discussion highlights their relevance to neural networks, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), and transformers, with emphasis on improving model convergence, generalization, and performance.
Neural networks, as powerful function approximators, are highly sensitive to the quality and format of input data. Poorly prepared data can lead to slow convergence, overfitting, or suboptimal accuracy. Preprocessing mitigates these issues by addressing noise, inconsistencies, and structural mismatches in datasets. It encompasses a series of steps that transform raw data into a form that aligns with the assumptions and requirements of neural architectures. For instance, in supervised learning tasks, preprocessing ensures features are scaled appropriately to prevent gradient issues during backpropagation. This text delineates basic techniques, which are foundational and widely applicable, and advanced techniques, which are more specialized and often domain-specific, such as for image, text, or time-series data.
Typical Preprocessing Tasks
Task | Description |
---|---|
Text Cleaning | Remove unwanted characters, stop words, and perform stemming/lemmatization. |
Normalization | Standardize text formats, such as date and currency formats. |
Tokenization | Split text into words or subwords for easier analysis. |
Feature Extraction | Convert text into numerical features using techniques like TF-IDF or word embeddings. |
Data Augmentation | Generate synthetic data to increase dataset size and diversity. |
A typical dataset for machine learning tasks might include columns of different data types, such as numerical, categorical, and text, eg.:
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|---|
844 | 0 | 3 | Lemberopolous, Mr. Peter L | male | 34.5 | 0 | 0 | 2683 | 6.4375 | nan | C |
103 | 0 | 1 | White, Mr. Richard Frasar | male | 21 | 0 | 1 | 35281 | 77.2875 | D26 | S |
160 | 0 | 3 | Sage, Master. Thomas Henry | male | nan | 8 | 2 | CA. 2343 | 69.55 | nan | S |
190 | 0 | 3 | Turcin, Mr. Stjepan | male | 36 | 0 | 0 | 349247 | 7.8958 | nan | S |
185 | 1 | 3 | Kink-Heilmann, Miss. Luise Gretchen | female | 4 | 0 | 2 | 315153 | 22.025 | nan | S |
588 | 1 | 1 | Frolicher-Stehli, Mr. Maxmillian | male | 60 | 1 | 1 | 13567 | 79.2 | B41 | C |
660 | 0 | 1 | Newell, Mr. Arthur Webster | male | 58 | 0 | 2 | 35273 | 113.275 | D48 | C |
674 | 1 | 2 | Wilhelms, Mr. Charles | male | 31 | 0 | 0 | 244270 | 13 | nan | S |
526 | 0 | 3 | Farrell, Mr. James | male | 40.5 | 0 | 0 | 367232 | 7.75 | nan | Q |
703 | 0 | 3 | Barbara, Miss. Saiide | female | 18 | 0 | 1 | 2691 | 14.4542 | nan | C |
Sample rows from the Titanic dataset
Data Cleaning
Data cleaning involves identifying and rectifying errors, inconsistencies, and missing values in the dataset. Missing values, common in real-world data, can be handled by imputation methods such as mean, median, or mode substitution, or by removing affected rows/columns if the loss is minimal. For example, in pandas, this can be implemented as df.fillna(df.mean())
for mean imputation. Outliers, which may skew neural network training, are detected using statistical methods like z-scores or interquartile ranges and can be winsorized or removed. Noise reduction, such as smoothing time-series data with moving averages, is also essential, particularly for RNNs where temporal dependencies are critical. Inconsistent data, like varying formats in text (e.g., dates), requires standardization to ensure uniformity. Overall, data cleaning enhances data quality, reducing the risk of misleading patterns during neural network optimization.
Pclass | Sex | Age | SibSp | Parch | Fare | Embarked |
---|---|---|---|---|---|---|
3 | male | 19 | 0 | 0 | 8.05 | S |
1 | male | 36 | 0 | 0 | 40.125 | C |
3 | male | 27 | 1 | 0 | 24.15 | Q |
2 | female | 4 | 2 | 1 | 39 | S |
1 | male | 27 | 0 | 0 | 0 | S |
1 | male | 42 | 0 | 0 | 26.2875 | S |
2 | female | 33 | 0 | 2 | 26 | S |
1 | male | 27 | 1 | 0 | 53.1 | S |
2 | male | 25 | 0 | 0 | 13 | S |
3 | male | 27 | 0 | 0 | 8.05 | S |
import pandas as pd
# Preprocess the data
def preprocess(df):
# Fill missing values
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
df['Fare'].fillna(df['Fare'].median(), inplace=True)
# Select features
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
return df[features]
# Load the Titanic dataset
df = pd.read_csv('https://raw.githubusercontent.com/hsandmann/ml/refs/heads/main/data/kaggle/titanic-dataset.csv')
df = df.sample(n=10)
# Preprocessing
df = preprocess(df)
# Display the first few rows of the dataset
print(df.to_markdown(index=False))
Encoding Categorical Variables
Categorical data, non-numeric by nature, must be converted for neural network input. One-hot encoding creates binary vectors for each category, e.g., transforming colors ['red', 'blue', 'green']
into [[1,0,0], [0,1,0], [0,0,1]]
. This avoids ordinal assumptions but increases dimensionality, which can be mitigated by embedding layers in neural networks for high-cardinality features. Label encoding assigns integers (e.g., 0 for "red", 1 for "blue"), suitable for ordinal categories but risky for nominal ones due to implied ordering. For text data in NLP tasks with transformers, tokenization and subword encoding (e.g., WordPiece) are basic steps to map words to integer IDs.
Pclass | Sex | Age | SibSp | Parch | Fare | Embarked |
---|---|---|---|---|---|---|
3 | 1 | 16 | 4 | 1 | 39.6875 | 1 |
3 | 1 | 34 | 1 | 1 | 14.4 | 1 |
2 | 1 | 16 | 0 | 0 | 26 | 1 |
3 | 1 | 34 | 1 | 2 | 23.45 | 1 |
1 | 0 | 34 | 1 | 0 | 89.1042 | 0 |
3 | 0 | 37 | 0 | 0 | 9.5875 | 1 |
1 | 0 | 39 | 1 | 1 | 110.883 | 0 |
import pandas as pd
from sklearn.preprocessing import LabelEncoder
# Preprocess the data
def preprocess(df):
# Fill missing values
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
df['Fare'].fillna(df['Fare'].median(), inplace=True)
# Convert categorical variables
label_encoder = LabelEncoder()
df['Sex'] = label_encoder.fit_transform(df['Sex'])
df['Embarked'] = label_encoder.fit_transform(df['Embarked'])
# Select features
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
return df[features]
# Load the Titanic dataset
df = pd.read_csv('https://raw.githubusercontent.com/hsandmann/ml/refs/heads/main/data/kaggle/titanic-dataset.csv')
df = df.sample(n=10)
# Preprocessing
df = preprocess(df)
# Display the first few rows of the dataset
print(df.sample(n=7).to_markdown(index=False))
Normalization and Standardization
Normalization scales features to a bounded range, typically \([0, 1]\), using min-max scaling:
This is crucial for neural networks employing sigmoid or tanh activations, as it prevents saturation.
Standardization, or z-score normalization, transforms data to have a mean of \(0\) and standard deviation of \(1\):
where \(\mu\) is the mean and \(\sigma\) the standard deviation. It is preferred for networks with ReLU activations or when data distributions are Gaussian-like, aiding faster gradient descent convergence. In practice, libraries like scikit-learn provide MinMaxScaler
and StandardScaler
for these operations. These techniques are especially vital in multilayer perceptrons (MLPs) and CNNs, where feature scales can dominate loss landscapes.
Below is an example of how to apply normalization and standardization using pandas, based on the NASDAQ Apple stock price dataset:
Date | Volume | N-Volume | Z-Volume | Change | N-Change | Z-Change |
---|---|---|---|---|---|---|
2025-07-22 00:00:00-04:00 | 4.64041e+07 | 0.116891 | -0.637333 | 0.00903618 | 0.44842 | 0.271479 |
2025-07-23 00:00:00-04:00 | 4.69893e+07 | 0.124553 | -0.612765 | -0.00116609 | 0.314022 | -0.290172 |
2025-07-24 00:00:00-04:00 | 4.60226e+07 | 0.111896 | -0.65335 | -0.00182115 | 0.305392 | -0.326235 |
2025-07-25 00:00:00-04:00 | 4.02688e+07 | 0.036563 | -0.894912 | 0.00056142 | 0.336779 | -0.19507 |
2025-07-28 00:00:00-04:00 | 3.7858e+07 | 0.00499883 | -0.996124 | 0.000794875 | 0.339854 | -0.182218 |
2025-07-29 00:00:00-04:00 | 5.14117e+07 | 0.182455 | -0.427099 | -0.0129877 | 0.158291 | -0.940969 |
2025-07-30 00:00:00-04:00 | 4.55125e+07 | 0.105218 | -0.674765 | -0.0105079 | 0.190958 | -0.804453 |
2025-07-31 00:00:00-04:00 | 8.06984e+07 | 0.5659 | 0.802445 | -0.00707962 | 0.23612 | -0.615722 |
2025-08-01 00:00:00-04:00 | 1.04434e+08 | 0.876672 | 1.79896 | -0.0250036 | 0 | -1.60247 |
2025-08-04 00:00:00-04:00 | 7.51093e+07 | 0.492723 | 0.567798 | 0.00479297 | 0.392523 | 0.0378834 |
Date | Open | High | Low | Close | Volume | Dividends | Stock Splits | Change |
---|---|---|---|---|---|---|---|---|
2025-07-21 00:00:00-04:00 | 211.86 | 215.535 | 211.39 | 212.239 | 5.13774e+07 | 0 | 0 | nan |
2025-07-22 00:00:00-04:00 | 212.898 | 214.706 | 211.989 | 214.157 | 4.64041e+07 | 0 | 0 | 0.00903618 |
2025-07-23 00:00:00-04:00 | 214.756 | 214.906 | 212.169 | 213.907 | 4.69893e+07 | 0 | 0 | -0.00116609 |
2025-07-24 00:00:00-04:00 | 213.658 | 215.445 | 213.288 | 213.518 | 4.60226e+07 | 0 | 0 | -0.00182115 |
2025-07-25 00:00:00-04:00 | 214.457 | 214.996 | 213.158 | 213.638 | 4.02688e+07 | 0 | 0 | 0.00056142 |
2025-07-28 00:00:00-04:00 | 213.787 | 214.606 | 212.818 | 213.807 | 3.7858e+07 | 0 | 0 | 0.000794875 |
2025-07-29 00:00:00-04:00 | 213.937 | 214.566 | 210.581 | 211.031 | 5.14117e+07 | 0 | 0 | -0.0129877 |
2025-07-30 00:00:00-04:00 | 211.66 | 212.149 | 207.485 | 208.813 | 4.55125e+07 | 0 | 0 | -0.0105079 |
2025-07-31 00:00:00-04:00 | 208.254 | 209.602 | 206.925 | 207.335 | 8.06984e+07 | 0 | 0 | -0.00707962 |
2025-08-01 00:00:00-04:00 | 210.631 | 213.338 | 201.272 | 202.151 | 1.04434e+08 | 0 | 0 | -0.0250036 |
import pandas as pd
import yfinance as yf
dat = yf.Ticker("AAPL")
df = dat.history(period='1mo')
df['Change'] = df['Close'].pct_change()
df['Z-Volume'] = df['Volume'].apply(lambda x: (x-df['Volume'].mean())/df['Volume'].std())
df['N-Volume'] = df['Volume'].apply(lambda x: (x-df['Volume'].min())/(df['Volume'].max()-df['Volume'].min()))
df['Z-Change'] = df['Change'].apply(lambda x: (x-df['Change'].mean())/df['Change'].std())
df['N-Change'] = df['Change'].apply(lambda x: (x-df['Change'].min())/(df['Change'].max()-df['Change'].min()))
df = df[['Volume', 'N-Volume', 'Z-Volume', 'Change', 'N-Change', 'Z-Change']].dropna()
print(df.head(10).to_markdown())
Feature Scaling
Feature scaling overlaps with normalization but specifically addresses disparate scales across features. Beyond min-max and z-score, logarithmic scaling (\( x' = \log(x + 1) \)) handles skewed distributions, common in financial data for neural forecasting models. Scaling ensures equal contribution of features during weight updates in stochastic gradient descent (SGD).
Data Augmentation
Data augmentation artificially expands datasets to combat overfitting, particularly in CNNs for image classification. Basic operations include flipping, rotation (e.g., by 90° or random angles), and cropping, while advanced methods involve adding noise (Gaussian or salt-and-pepper) or color jittering. For text data in RNNs or transformers, techniques like synonym replacement, random insertion/deletion, or back-translation (translating to another language and back) generate variations while preserving semantics. In time-series for LSTMs, window slicing or synthetic minority over-sampling technique (SMOTE)8 variants create augmented sequences. Generative models like GANs (Generative Adversarial Networks) represent cutting-edge augmentation, producing realistic synthetic samples. These methods improve generalization by exposing models to diverse inputs.
Handling Imbalanced Data
Imbalanced datasets, where classes are unevenly represented, bias neural networks toward majority classes. Advanced resampling includes oversampling minorities (e.g., SMOTE, which interpolates new instances) or undersampling majorities. Class weighting assigns higher penalties to minority misclassifications in the loss function, e.g., weighted cross-entropy. Ensemble methods, like balanced random forests integrated with neural embeddings, or focal loss in object detection CNNs, further address this. For sequential data, temporal resampling ensures balanced windows.
Feature Engineering and Selection
Feature engineering crafts new features from existing ones, such as polynomial terms or interactions (e.g., \( x_1 \times x_2 \)) to capture non-linearities before neural input. Selection techniques like mutual information or recursive feature elimination reduce irrelevant features, alleviating the curse of dimensionality in high-dimensional data for autoencoders or dense networks. Embedded methods, like L1 regularization in neural training, perform selection during optimization.
Dimensionality Reduction
Techniques like Principal Component Analysis (PCA) project data onto lower-dimensional spaces while preserving variance:
where \(W\) are principal components. Autoencoders, a neural-based approach, learn compressed representations through encoder-decoder architectures. t-SNE or UMAP are used for visualization but less for preprocessing due to non-linearity. These are vital for CNNs on high-resolution images or transformers on long sequences to reduce computational load.
PCA is widely used for dimensionality reduction5, while t-SNE6 and UMAP7 are popular for visualizing high-dimensional data in 2D or 3D spaces.
Basically, PCA identifies orthogonal axes (principal components) capturing maximum variance, enabling efficient data representation. Autoencoders, trained to reconstruct inputs, learn compact latent spaces, useful for denoising or anomaly detection.
PCA Steps5
1. Standardize the data:
2. Compute the covariance matrix:
3. Calculate eigenvalues and eigenvectors:
4. Sort eigenvectors by eigenvalues in descending order.
5. Select top \(k\) eigenvectors to form a new feature space
where \(W\) is the matrix of selected eigenvectors.
A example of PCA applied to the Iris dataset:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from io import StringIO
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
# Loading Iris dataset
iris = load_iris()
# Transform in dataframe
df = pd.DataFrame(
data=iris.data,
columns=['sepal_l', 'sepal_w', 'petal_l', 'petal_w']
)
df['class'] = iris.target_names[iris.target]
X = df.iloc[:,0:4].values
y = df.iloc[:,4].values
# Standardizing
X_std = StandardScaler().fit_transform(X)
# Covariance
cov_mat = np.cov(X_std.T)
# Calculate autovalues and autovectors
eig_vals, eig_vecs = np.linalg.eig(cov_mat)
print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)
# Make a list of (eigenvalue, eigenvector) tuples
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]
# Sort the (eigenvalue, eigenvector) tuples from high to low
eig_pairs.sort(key=lambda x: x[0], reverse=True)
# Visually confirm that the list is correctly sorted by decreasing eigenvalues
print('Eigenvalues in descending order:')
for i in eig_pairs: print(i[0])
# Sum the cummulative of each eigen value
tot = sum(eig_vals)
var_exp = [(i / tot)*100 for i in sorted(eig_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)
n_eigen = [1, 2, 3, 4]
# Plot the cumulative for each eign value
plt.figure(figsize=(6, 4))
plt.bar(n_eigen, var_exp, alpha=0.5, align='center',
label='individual explained variance')
plt.step(n_eigen, cum_var_exp, where='mid',
label='cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.legend(loc='best')
plt.tight_layout()
# Take the only the two firsts eigen values
matrix_w = np.hstack((eig_pairs[0][1].reshape(4,1),
eig_pairs[1][1].reshape(4,1)))
print('*' * 10)
print('Reduced to 2-D')
print('Matrix W:\n', matrix_w)
# Calculate the new Y for all samples
Y = X_std.dot(matrix_w)
# Plot the data for the 2 firsts principal components
plt.figure(figsize=(6, 4))
for lab, col in zip(('setosa', 'versicolor', 'virginica'), ('blue', 'red', 'green')):
plt.scatter(Y[y==lab, 0],
Y[y==lab, 1],
label=lab,
c=col)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(loc='lower center')
plt.tight_layout()
# Para imprimir na página HTML
buffer = StringIO()
plt.savefig(buffer, format="svg", transparent=True)
print(buffer.getvalue())
Now, the same example using scikit-learn is shown below:
import matplotlib.pyplot as plt
import pandas as pd
from io import StringIO
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA as pca
from sklearn.preprocessing import StandardScaler
# Loading Iris dataset
iris = load_iris()
# Transform in dataframe
df = pd.DataFrame(
data=iris.data,
columns=['sepal_l', 'sepal_w', 'petal_l', 'petal_w']
)
df['class'] = iris.target_names[iris.target]
X = df.iloc[:,0:4].values
y = df.iloc[:,4].values
# Standardizing
X_std = StandardScaler().fit_transform(X)
sklearn_pca = pca(n_components=2)
Y = sklearn_pca.fit_transform(X_std)
# Plot the data for the 2 firsts principal components
plt.figure(figsize=(6, 4))
for lab, col in zip(('setosa', 'versicolor', 'virginica'), ('blue', 'red', 'green')):
plt.scatter(Y[y==lab, 0],
Y[y==lab, 1],
label=lab,
c=col)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(loc='lower center')
plt.tight_layout()
# Para imprimir na página HTML
buffer = StringIO()
plt.savefig(buffer, format="svg", transparent=True)
print(buffer.getvalue())
Eigenfaces, a PCA variant, is used in face recognition tasks to reduce image dimensions while retaining essential features4. In NLP, techniques like Latent Semantic Analysis (LSA) apply SVD (Singular Value Decomposition) to reduce term-document matrices, enhancing transformer efficiency.
Domain-Specific Advanced Techniques
For time-series in RNNs, techniques include Fast Fourier Transform (FFT) for frequency domain conversion or segmentation into fixed windows. In text preprocessing for sentiment analysis, advanced steps encompass negation handling (e.g., marking "not good" as "not_pos"), intensification (e.g., "very good" as "strong_pos"), and POS tagging to retain sentiment-bearing words. For images in CNNs, advanced signal processing like wavelet transforms or conversion to spectrograms enhances fault diagnosis applications.