14. VAE
Autoencoders
Autoencoders (AEs) are neural networks designed to learn efficient:
- codings of input data by compressing it into a lower-dimensional representation, and then;
- reconstructing it back to the original form.
Autoencoders consist of two main components:
-
an encoder, which compresses input data into a lower-dimensional representation known as the latent space or code. This latent space, often called embedding, aims to retain as much information as possible, allowing the decoder to reconstruct the data with high precision. If we denote our input data as \( x \) and the encoder as \( E \), then the output latent space representation, \( s \), would be \( s=E(x) \).
-
a decoder, which reconstructs the original input data by accepting the latent space representation \( s \). If we denote the decoder function as \( D \) and the output of the decoder as \( o \), then we can represent the decoder as \( o = D(s) \).
Both encoder and decoder are typically composed of one or more layers, which can be fully connected, convolutional, or recurrent, depending on the input data’s nature and the autoencoder’s architecture.1 The entire autoencoder process can be summarized as:
An illustration of the architecture of autoencoders. Source: 1.
Types of Autoencoder
There are several type of autoencoders, each one with this particularity:
Vanilla Autoencoders
Vanilla encoders are fully connected layers for encoder and decoder. It works to compress input information and are applied over simple data.
The encoder such as the decoder are fully connected networks. The encoder addresses the input data to the latent space (compressed space - encoded data). The decoder addresses the latent space data to output (reconstructed data). Source: 1.
Latent Space is a compressed representation of the input data. The dimensionality of the latent space is typically much smaller than that of the input data, which forces the autoencoder to learn a compact representation that captures the most important features of the data.
Convolutional Autoencoders
In convolutional autoencoders, the encoder and the decoder are neural networks based on Convolutional Neural Networks. So, the approach is more intensive for handling image data.
In convolutional autoencoders, the encoder and decoder are based on Convolutional Neural Networks (CNNs). This architecture is particularly effective for image data, as it can capture spatial hierarchies and patterns. Source: 2.
Variational Autoencoders
Variational Autoencoders (VAEs) are generative models that learn to encode data into a lower-dimensional latent space and then decode it back to the original space. VAEs can generate new samples from the learned latent distribution, making them ideal for image generation and style transfer tasks.
A VAE maps a input data \( \mathbf{x} \) into latent space \( \mathbf{z} \) and then reconstructs it back to the original space \( \mathbf{\hat{x}} \) (output). The encoder learns to capture the underlying structure of the data, while the decoder generates new samples from the latent space. Source: 2.
VAEs were introduced in 2013 by Diederik et al. - Auto-Encoding Variational Bayes.
Figure: Comparison between a standard Autoencoder and a Variational Autoencoder (VAE). In a standard Autoencoder, the encoder maps input data \( \mathbf{x} \) to a fixed latent representation \( \mathbf{z} \), which is then used by the decoder to reconstruct the input as \( \mathbf{\hat{x}} \). In contrast, a VAE encodes the input data into a distribution over the latent space, typically modeled as a Gaussian distribution with mean \( \mu \) and standard deviation \( \sigma \). During training, the VAE samples from this distribution to obtain \( \mathbf{z} \), which is then used by the decoder to reconstruct the input. This probabilistic approach allows VAEs to generate new samples by sampling from the latent space, making them powerful generative models. Dataset: Fashion-MNIST. Source: 3.
Key Features of VAEs
VAEs have the ability to learn smooth and continuous latent spaces, which allows for meaningful interpolation between data points. This is particularly useful in applications such as image generation, where one can generate new images by sampling from the latent space. Also, the probabilistic nature of VAEs helps in regularizing the latent space, preventing overfitting and introducing the same level of randomness, ensuring that the model generalizes well to unseen data.
Aspects of VAEs include:
-
Regularization and Continuity: The latent space in VAEs is regularized to follow a prior distribution (usually a standard normal distribution). This encourages the model to learn a continuous and smooth latent space, allowing for meaningful interpolation between data points.
-
Simplicity in Sampling: VAEs can generate new samples by simply sampling from the latent space distribution - the Gaussian distribution is mathematically tractable and is a universal approximator -, making them efficient for generative tasks.
-
Reparameterization Trick: To enable backpropagation through the stochastic sampling process, VAEs employ the reparameterization trick. This involves expressing the sampled latent variable \( \mathbf{z} \) as a deterministic function of the input data \( \mathbf{x} \) and a random noise variable \( \mathbf{\epsilon} \), allowing gradients to flow through the network during training.
-
Balanced Latent Space: The KL divergence term in the VAE loss function encourages the learned latent space to be similar to the prior distribution, promoting a well-structured and balanced latent space.
Training VAEs
VAEs uses Kullback-Leibler (KL) divergence as its loss function, which measures the difference between the learned latent distribution and the prior distribution. The loss function is a combination of the reconstruction loss (how well the decoder reconstructs the input) and the KL divergence term.
Suppose we have a distribution \( z \) and we want to approximate it with a distribution \( p(z|x) \), where \( x \) is the input data. In other words, we want to find a distribution \( p(z|x) \), then we can to it by following Bayes' theorem:
But, the problem is that \(p(x)\) is intractable:
This integral is often intractable distribution. Hence, we can approximate it with a variational distribution \( q(z|x) \), which is easier to compute. So, we want to minimize the KL divergence between \( q(z|x) \) and \( p(z|x) \):
By simplifying the above minimization problem is equivalent to the following maximization problem:
where:
- The first term, \( \mathbb{E}_{q(z|x)}[\log p(x|z)] \), is the expected log-likelihood of the data given the latent variable, which encourages the model to reconstruct the input data accurately.
- The second term, \( \text{KL}(q(z|x) || p(z)) \), is the KL divergence between the approximate posterior and the prior distribution, which regularizes the latent space.
Thus, the loss function for training a VAE can be expressed as:
Figure: Basic architecture of a Variational Autoencoder (VAE). The encoder maps input data \( \mathbf{x} \) to a latent representation \( \mathbf{z} \), and the decoder reconstructs \( \mathbf{x'} \) from \( \mathbf{z} \). Source: Wikipedia
Reparameterization Trick
The reparameterization trick is a key innovation that allows for efficient backpropagation through the stochastic layers of a VAE. Instead of sampling \( z \) directly from \( q(z|x) \), we can express \( z \) as a deterministic function of \( x \) and some noise \( \epsilon \) drawn from a simple distribution (e.g., Gaussian):
where \( \mu \) and \( \sigma \) are the mean and standard deviation outputs of the encoder. This transformation allows us to backpropagate through the network while still maintaining the stochastic nature of the latent variable.
Numerical Simulation
VAE - Variational Autoencoder
A Variational Autoencoder (VAE) encodes input data into a probabilistic latent space (defined by mean μ and log-variance log(σ²)) and decodes it back to reconstruct the input. The latent space is sampled using the reparameterization trick for differentiability. The loss combines reconstruction error (MSE) and KL divergence to regularize the latent distribution toward a standard normal.
For this numerical example, we've scaled up to:
- Input dimension: 4 (e.g., a vector like
[1.0, 2.0, 3.0, 4.0]) - Latent dimension: 2
- Output dimension: 4 (reconstruction of input)
- Hidden layer size: 8 (for both encoder and decoder, to add capacity)
The model uses PyTorch with random initialization (seeded at 42 for reproducibility). All calculations are shown step-by-step, including matrix multiplications where relevant. Weights and biases are explicitly listed below.
Model Architecture
-
Encoder:
- Linear (fc1): 4 inputs → 8 hidden units, followed by ReLU.
- Linear to μ (fc_mu): 8 → 2.
- Linear to logvar (fc_logvar): 8 → 2.
-
Latent: Sample z from N(μ, σ²) using reparameterization trick.
-
Decoder:
- Linear (fc_dec1): 2 latent → 8 hidden units, followed by ReLU.
- Linear to output (fc_dec2): 8 → 4 (no final activation, assuming Gaussian output for simplicity).
-
Loss: Summed MSE for reconstruction + KL divergence (without β annealing).
Weights and Biases
All parameters are initialized randomly (via torch.manual_seed(42)). Here they are:
Encoder
-
fc1.weight (encoder input to hidden, shape [8, 4]):
-
fc1.bias (shape [8]):
-
fc_mu.weight (hidden to μ, shape [2, 8]):
-
fc_mu.bias (shape [2]):
-
fc_logvar.weight (hidden to logvar, shape [2, 8]):
-
fc_logvar.bias (shape [2]):
Decoder
-
fc_dec1.weight (latent to decoder hidden, shape [8, 2]):
-
fc_dec1.bias (shape [8]):
-
fc_dec2.weight (decoder hidden to output, shape [4, 8]):
-
fc_dec2.bias (shape [4]):
Forward Pass
-
Input:
(batch size 1, dim 4).
-
Encoding to Hidden Layer:
-
Compute pre-ReLU: fc1(x) = fc1.weight @ x^T + fc1.bias.
- This is a matrix multiplication: Each row of fc1.weight dotted with x, plus bias.
- Result (pre-ReLU):
- After ReLU (non-negative): (note: last two are zeroed by ReLU).
-
-
Compute Mean (μ) in Latent Space:
- μ = fc_mu.weight @ hidden^T + fc_mu.bias.
- Result:
- This is the mean of the 2D latent Gaussian.
- μ = fc_mu.weight @ hidden^T + fc_mu.bias.
-
Compute Log-Variance (logvar) in Latent Space:
-
logvar = fc_logvar.weight @ hidden^T + fc_logvar.bias.
- Result:
- Variance σ² = exp(logvar):
-
-
Latent Space: Sampling z (Reparameterization Trick):
-
std (σ) = exp(0.5 * logvar):
-
ε ~ N(0, 1) (seeded random):
-
\( z = \mu + std * \epsilon \)
-
Result:
-
-
Decoding to Reconstructed Output:
- Decoder: ReLu( fc_dec1.weight @ z^T + fc_dec1.bias ).
- pre-ReLU:
- After ReLU:
-
recon_x = fc_dec2.weight @ decoder_hidden^T + fc_dec2.bias.
-
Result:
-
- Decoder: ReLu( fc_dec1.weight @ z^T + fc_dec1.bias ).
Loss Calculation
-
Reconstruction Loss (MSE):
Sum over dimensions of \( (x - \hat{x})^2 ≈ 31.100958927489703 \)
-
KL Divergence:
\[ \text{KL} = -0.5 * \sum \left(1 + \text{logvar} - \mu^2 - \exp(\text{logvar})\right) \approx 0.40952290104490313 \] -
Total Loss:
\[ \text{Loss} = \text{MSE} + \text{KL} \approx 31.510481828534605 \]
Backward Pass
The backward pass computes gradients via autograd (chain rule from loss back through the network). This enables training by updating weights (e.g., via SGD). Gradients are zero-initialized before .backward().
After loss.backward(), key gradients \( \displaystyle \frac{\partial \text{Loss}}{\partial \text{param}} \) are:
Decoder
-
fc_dec2.weight.grad (shape [4, 8]):
\[ \displaystyle \frac{\partial \text{L}}{\partial \text{fc_dec2.weight}} \] -
fc_dec2.bias.grad (shape [4]):
-
fc_dec1.weight.grad (shape [8, 2]):
[ [ 0.0000, -0.0000], [ 2.7321, -2.6684], [ 0.0000, -0.0000], [ 2.6066, -2.5459], [-1.7850, 1.7434], [ 2.1179, -2.0685], [ 0.0000, -0.0000], [ 0.0000, -0.0000] ]- Primarily from MSE, backpropagated through decoder.
-
fc_dec1.bias.grad (shape [8]):
Encoder
-
fc_mu.weight.grad (shape [2, 8]):
- Includes ∂KL/∂μ ≈ μ (pulling toward 0) + flow from MSE via z. -
fc_mu.bias.grad (shape [2]):
-
fc_logvar.weight.grad (shape [2, 8]):
- From ∂KL/∂logvar ≈ 0.5*(exp(logvar) - 1) + MSE flow. -
fc_logvar.bias.grad (shape [2]):
-
fc1.weight.grad (shape [8, 4]):
[ [ 0.2735, 0.5470, 0.8204, 1.0939], [-0.2724, -0.5448, -0.8172, -1.0896], [ 0.5339, 1.0679, 1.6018, 2.1358], [ 1.2016, 2.4032, 3.6049, 4.8065], [ 0.7601, 1.5201, 2.2802, 3.0403], [-0.5289, -1.0578, -1.5868, -2.1157], [ 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000] ]- These flow from both MSE (via reconstruction) and KL (via μ/logvar). Zeros in last rows due to ReLU zeroing those hidden units.
-
fc1.bias.grad (shape [8]):
These gradients would update parameters in training (e.g., param -= lr * grad). Note zeros where ReLU gates flow. This example uses a single pass; real training iterates over datasets. If you change the seed, input, or dimensions, values will differ, but the process remains identical.
Additional
Relation between Log Variance and Standard Deviation
Relation between Log Variance and Standard Deviation
- In VAEs, the encoder outputs the mean \( \mu \) and log variance \( \log(\sigma^2) \) of the latent space distribution.
- The standard deviation \( \sigma \) can be derived from the log variance using the relationship:
- This transformation ensures numerical stability and positivity of the variance during training.
1. Definitions
For a random variable ( x ) that follows a normal distribution:
where:
- \( \mu \): mean
- \( \sigma^2 \): variance
- \( \sigma \): standard deviation
2. Log variance
Often, instead of directly predicting or storing the variance \( \sigma^2 \) or standard deviation \( \sigma \), models work with the log variance:
3. Relationship between log variance and std
From the above definition:
Taking the square root to get the standard deviation:
So:
and conversely,
4. Why use log variance?
It’s common in neural nets because:
- It ensures the variance is always positive (since \( e^x > 0 \)).
- It’s numerically more stable when optimizing.
- It allows unconstrained outputs from the network (no need to force positivity).
Summary
| Quantity | Expression | In terms of log_var |
|---|---|---|
| Variance | \( \sigma^2 \) | \( e^{\text{log_var}} \) |
| Std. deviation | \( \sigma \) | \( e^{\frac{1}{2}\text{log_var}} \) |
| Log variance | \( \text{log_var} \) | \( 2 \log(\sigma) \) |
-
Sharma, A. “Introduction to Autoencoders,” PyImageSearch, P. Chugh, A. R. Gosthipaty, S. Huot, K. Kidriavsteva, and R. Raha, eds., 2023. ↩↩↩
-
Bandyopadhyay, H. "What is an autoencoder and how does it work? Learn about most common types of autoencoders and their applications in machine learning.". ↩↩
-
Sharma, A. “A Deep Dive into Variational Autoencoders with PyTorch,” PyImageSearch, P. Chugh, A. R. Gosthipaty, S. Huot, K. Kidriavsteva, and R. Raha, eds., 2023. ↩







