14. VAE

Autoencoders

Autoencoders (AEs) are neural networks designed to learn efficient:

codings of input data by compressing it into a lower-dimensional representation, and then;
reconstructing it back to the original form.

Autoencoders consist of two main components:

an encoder, which compresses input data into a lower-dimensional representation known as the latent space or code. This latent space, often called embedding, aims to retain as much information as possible, allowing the decoder to reconstruct the data with high precision. If we denote our input data as \( x \) and the encoder as \( E \), then the output latent space representation, \( s \), would be \( s=E(x) \).
a decoder, which reconstructs the original input data by accepting the latent space representation \( s \). If we denote the decoder function as \( D \) and the output of the decoder as \( o \), then we can represent the decoder as \( o = D(s) \).

Both encoder and decoder are typically composed of one or more layers, which can be fully connected, convolutional, or recurrent, depending on the input data’s nature and the autoencoder’s architecture.¹ The entire autoencoder process can be summarized as:

\[ o = D(E(x)) \]

autoencoder architecture — An illustration of the architecture of autoencoders. Source: ¹.

Types of Autoencoder

There are several type of autoencoders, each one with this particularity:

Vanilla Autoencoders

Vanilla encoders are fully connected layers for encoder and decoder. It works to compress input information and are applied over simple data.

Latent Space is a compressed representation of the input data. The dimensionality of the latent space is typically much smaller than that of the input data, which forces the autoencoder to learn a compact representation that captures the most important features of the data.

Convolutional Autoencoders

In convolutional autoencoders, the encoder and the decoder are neural networks based on Convolutional Neural Networks. So, the approach is more intensive for handling image data.

Variational Autoencoders

Variational Autoencoders (VAEs) are generative models that learn to encode data into a lower-dimensional latent space and then decode it back to the original space. VAEs can generate new samples from the learned latent distribution, making them ideal for image generation and style transfer tasks.

VAEs were introduced in 2013 by Diederik et al. - Auto-Encoding Variational Bayes.

autoencoder vs variational autoencoder — Figure: Comparison between a standard Autoencoder and a Variational Autoencoder (VAE). In a standard Autoencoder, the encoder maps input data \( \mathbf{x} \) to a fixed latent representation \( \mathbf{z} \), which is then used by the decoder to reconstruct the input as \( \mathbf{\hat{x}} \). In contrast, a VAE encodes the input data into a distribution over the latent space, typically modeled as a Gaussian distribution with mean \( \mu \) and standard deviation \( \sigma \). During training, the VAE samples from this distribution to obtain \( \mathbf{z} \), which is then used by the decoder to reconstruct the input. This probabilistic approach allows VAEs to generate new samples by sampling from the latent space, making them powerful generative models. Dataset: Fashion-MNIST. Source: ³.

Key Features of VAEs

VAEs have the ability to learn smooth and continuous latent spaces, which allows for meaningful interpolation between data points. This is particularly useful in applications such as image generation, where one can generate new images by sampling from the latent space. Also, the probabilistic nature of VAEs helps in regularizing the latent space, preventing overfitting and introducing the same level of randomness, ensuring that the model generalizes well to unseen data.

Aspects of VAEs include:

Regularization and Continuity: The latent space in VAEs is regularized to follow a prior distribution (usually a standard normal distribution). This encourages the model to learn a continuous and smooth latent space, allowing for meaningful interpolation between data points.
Simplicity in Sampling: VAEs can generate new samples by simply sampling from the latent space distribution - the Gaussian distribution is mathematically tractable and is a universal approximator -, making them efficient for generative tasks.
Reparameterization Trick: To enable backpropagation through the stochastic sampling process, VAEs employ the reparameterization trick. This involves expressing the sampled latent variable \( \mathbf{z} \) as a deterministic function of the input data \( \mathbf{x} \) and a random noise variable \( \mathbf{\epsilon} \), allowing gradients to flow through the network during training.
Balanced Latent Space: The KL divergence term in the VAE loss function encourages the learned latent space to be similar to the prior distribution, promoting a well-structured and balanced latent space.

Training VAEs

VAEs uses Kullback-Leibler (KL) divergence as its loss function, which measures the difference between the learned latent distribution and the prior distribution. The loss function is a combination of the reconstruction loss (how well the decoder reconstructs the input) and the KL divergence term.

Suppose we have a distribution \( z \) and we want to approximate it with a distribution \( p(z|x) \), where \( x \) is the input data. In other words, we want to find a distribution \( p(z|x) \), then we can to it by following Bayes' theorem:

\[ p(z|x) = \displaystyle \frac{p(x|z)p(z)}{p(x)} \]

But, the problem is that \(p(x)\) is intractable:

\[ p(x) = \displaystyle \int p(x|z)p(z)dz \]

This integral is often intractable distribution. Hence, we can approximate it with a variational distribution \( q(z|x) \), which is easier to compute. So, we want to minimize the KL divergence between \( q(z|x) \) and \( p(z|x) \):

\[ \min \text{KL} ( q(z|x) || (z|x) ) \]

By simplifying the above minimization problem is equivalent to the following maximization problem:

\[ \mathbb{E}_{q(z|x)}[\log p(x|z)] - \text{KL}(q(z|x) || p(z)) \]

where:

The first term, \( \mathbb{E}_{q(z|x)}[\log p(x|z)] \), is the expected log-likelihood of the data given the latent variable, which encourages the model to reconstruct the input data accurately.
The second term, \( \text{KL}(q(z|x) || p(z)) \), is the KL divergence between the approximate posterior and the prior distribution, which regularizes the latent space.

Thus, the loss function for training a VAE can be expressed as:

\[ \mathcal{L} = -\mathbb{E}_{q(z|x)}[\log p(x|z)] + \text{KL}(q(z|x) || p(z)) \]

Figure: Basic architecture of a Variational Autoencoder (VAE). The encoder maps input data \( \mathbf{x} \) to a latent representation \( \mathbf{z} \), and the decoder reconstructs \( \mathbf{x'} \) from \( \mathbf{z} \). Source: Wikipedia

Reparameterization Trick

The reparameterization trick is a key innovation that allows for efficient backpropagation through the stochastic layers of a VAE. Instead of sampling \( z \) directly from \( q(z|x) \), we can express \( z \) as a deterministic function of \( x \) and some noise \( \epsilon \) drawn from a simple distribution (e.g., Gaussian):

\[ z = \mu + \sigma \cdot \epsilon \]

where \( \mu \) and \( \sigma \) are the mean and standard deviation outputs of the encoder. This transformation allows us to backpropagate through the network while still maintaining the stochastic nature of the latent variable.

Numerical Simulation

VAE - Variational Autoencoder

A Variational Autoencoder (VAE) encodes input data into a probabilistic latent space (defined by mean μ and log-variance log(σ²)) and decodes it back to reconstruct the input. The latent space is sampled using the reparameterization trick for differentiability. The loss combines reconstruction error (MSE) and KL divergence to regularize the latent distribution toward a standard normal.

For this numerical example, we've scaled up to:

Input dimension: 4 (e.g., a vector like [1.0, 2.0, 3.0, 4.0])
Latent dimension: 2
Output dimension: 4 (reconstruction of input)
Hidden layer size: 8 (for both encoder and decoder, to add capacity)

The model uses PyTorch with random initialization (seeded at 42 for reproducibility). All calculations are shown step-by-step, including matrix multiplications where relevant. Weights and biases are explicitly listed below.

Model Architecture

Encoder:
- Linear (fc1): 4 inputs → 8 hidden units, followed by ReLU.
- Linear to μ (fc_mu): 8 → 2.
- Linear to logvar (fc_logvar): 8 → 2.
Latent: Sample z from N(μ, σ²) using reparameterization trick.
Decoder:
- Linear (fc_dec1): 2 latent → 8 hidden units, followed by ReLU.
- Linear to output (fc_dec2): 8 → 4 (no final activation, assuming Gaussian output for simplicity).
Loss: Summed MSE for reconstruction + KL divergence (without β annealing).

Weights and Biases

All parameters are initialized randomly (via torch.manual_seed(42)). Here they are:

Encoder

fc1.weight (encoder input to hidden, shape [8, 4]):

[
    [ 0.3823,  0.4150, -0.1171,  0.4593],
    [-0.1096,  0.1009, -0.2434,  0.2936],
    [ 0.4408, -0.3668,  0.4346,  0.0936],
    [ 0.3694,  0.0677,  0.2411, -0.0706],
    [ 0.3854,  0.0739, -0.2334,  0.1274],
    [-0.2304, -0.0586, -0.2031,  0.3317],
    [-0.3947, -0.2305, -0.1412, -0.3006],
    [ 0.0472, -0.4938,  0.4516, -0.4247]
]

fc1.bias (shape [8]):

[ 0.3860,  0.0832, -0.1624,  0.3090,  0.0779,  0.4040,  0.0547, -0.1577 ]

fc_mu.weight (hidden to μ, shape [2, 8]):

[
    [ 0.0950, -0.0959,  0.1488,  0.3157,  0.2044, -0.1546,  0.2041,  0.0633],
    [ 0.1795, -0.2155, -0.3500, -0.1366, -0.2712,  0.2901,  0.1018,  0.1464]
]

fc_mu.bias (shape [2]):
```
[ 0.1118, -0.0062 ]
```

fc_logvar.weight (hidden to logvar, shape [2, 8]):

[
    [ 0.2767, -0.2512,  0.0223, -0.2413,  0.1090, -0.1218,  0.1083, -0.0737],
    [ 0.2932, -0.2096, -0.2109, -0.2109,  0.3180,  0.1178,  0.3402, -0.2918]
]

fc_logvar.bias (shape [2]):
```
[ -0.3507, -0.2766 ]
```

Decoder

fc_dec1.weight (latent to decoder hidden, shape [8, 2]):

[
    [-0.4757,  0.2864],
    [ 0.2532,  0.5876],
    [-0.3652, -0.4820],
    [ 0.3752, -0.2858],
    [ 0.4292, -0.1678],
    [ 0.4045, -0.5494],
    [-0.3568,  0.2156],
    [ 0.1495, -0.1803]
]

fc_dec1.bias (shape [8]):

[ 0.4215,  0.4807, -0.5128, -0.3775,  0.6475, -0.2386, -0.2507, -0.6842 ]

fc_dec2.weight (decoder hidden to output, shape [4, 8]):

[
    [-0.2025,  0.0883, -0.0467, -0.2566,  0.0083, -0.2415, -0.3000, -0.1947],
    [-0.3094, -0.2251,  0.3534,  0.0668,  0.1090, -0.3298, -0.2322, -0.1177],
    [ 0.0553, -0.3111, -0.1523, -0.2117,  0.0010, -0.1316, -0.0245, -0.2396],
    [-0.2427, -0.2063, -0.1210, -0.2791,  0.2964, -0.0702,  0.3042,  0.1102]
]

fc_dec2.bias (shape [4]):
```
[ -0.2994,  0.2447, -0.0973, -0.1355 ]
```

Forward Pass

Input:
```
x = [ 1.0, 2.0, 3.0, 4.0 ]
```
(batch size 1, dim 4).
Encoding to Hidden Layer:
- Compute pre-ReLU: fc1(x) = fc1.weight @ x^T + fc1.bias.
  - This is a matrix multiplication: Each row of fc1.weight dotted with x, plus bias.
  - Result (pre-ReLU):
```
[ 3.0842,  0.6196,  1.223,   1.2547,  0.4205,  0.7739, -2.427,  -1.4421 ]
```
  - After ReLU (non-negative):
```
encoder_8 = [ 3.0842, 0.6196, 1.223,  1.2547, 0.4205, 0.7739, 0., 0. ]
```
    (note: last two are zeroed by ReLU).
Compute Mean (μ) in Latent Space:
- μ = fc_mu.weight @ hidden^T + fc_mu.bias.
  - Result:
```
μ = [ 0.88977581, -0.07508313 ]
```
  - This is the mean of the 2D latent Gaussian.
Compute Log-Variance (logvar) in Latent Space:
- logvar = fc_logvar.weight @ hidden^T + fc_logvar.bias.
  - Result:
```
logvar = [ 0.02314189, 0.20015677 ]
```
  - Variance σ² = exp(logvar):
```
variance = [ 1.02341174, 1.22159425 ]
```

Latent Space: Sampling z (Reparameterization Trick):

std (σ) = exp(0.5 * logvar):
```
std = [ 1.01163815, 1.10525755 ]
```
ε ~ N(0, 1) (seeded random):
```
ε = [ -0.2387, -0.5050 ]
```
\( z = \mu + std * \epsilon \)

Result:

z = [ 0.88977581 + 1.01163815*(-0.2387), -0.07508313 + 1.10525755*(-0.5050) ] 
    ≈ [ 0.64829778, -0.63323819 ]

Decoding to Reconstructed Output:

Decoder: ReLu( fc_dec1.weight @ z^T + fc_dec1.bias ).

pre-ReLU:

decoder_hidden = [ -0.06825467,  0.27275824, -0.44433754,  0.0467208,   1.03200678,  0.37153752, -0.6185388,  -0.47310664 ]

After ReLU:

decoder_hidden = [ 0., 0.27275824, 0., 0.0467208,  1.03200678, 0.37153752, 0., 0. ]

recon_x = fc_dec2.weight @ decoder_hidden^T + fc_dec2.bias.

Result:

recon_x = [ -0.36846466,  0.17637874, -0.23990821,  0.07499507 ]

Loss Calculation

Reconstruction Loss (MSE):

Sum over dimensions of \( (x - \hat{x})^2 ≈ 31.100958927489703 \)
KL Divergence:

\[ \text{KL} = -0.5 * \sum \left(1 + \text{logvar} - \mu^2 - \exp(\text{logvar})\right) \approx 0.40952290104490313 \]
Total Loss:

\[ \text{Loss} = \text{MSE} + \text{KL} \approx 31.510481828534605 \]

Backward Pass

The backward pass computes gradients via autograd (chain rule from loss back through the network). This enables training by updating weights (e.g., via SGD). Gradients are zero-initialized before .backward().

After loss.backward(), key gradients \( \displaystyle \frac{\partial \text{Loss}}{\partial \text{param}} \) are:

Decoder

fc_dec2.weight.grad (shape [4, 8]):

\[ \displaystyle \frac{\partial \text{L}}{\partial \text{fc_dec2.weight}} \]

[
    [-0.0000, -0.7467, -0.0000, -0.1278, -2.8242, -1.0167, -0.0000, -0.0000],
    [-0.0000, -0.9951, -0.0000, -0.1703, -3.7638, -1.3549, -0.0000, -0.0000],
    [-0.0000, -1.7679, -0.0000, -0.3025, -6.6867, -2.4071, -0.0000, -0.0000],
    [-0.0000, -2.1417, -0.0000, -0.3664, -8.1006, -2.9161, -0.0000, -0.0000]
]

fc_dec2.bias.grad (shape [4]):
```
[ -2.7369, -3.6474, -6.4798, -7.8500 ]
```

fc_dec1.weight.grad (shape [8, 2]):

[
    [ 0.0000, -0.0000],
    [ 2.7321, -2.6684],
    [ 0.0000, -0.0000],
    [ 2.6066, -2.5459],
    [-1.7850,  1.7434],
    [ 2.1179, -2.0685],
    [ 0.0000, -0.0000],
    [ 0.0000, -0.0000]
]

Primarily from MSE, backpropagated through decoder.

fc_dec1.bias.grad (shape [8]):

[ 0.0000,  4.2144,  0.0000,  4.0209, -2.7535,  3.2670,  0.0000,  0.0000 ]

Encoder

fc_mu.weight.grad (shape [2, 8]):

[
    [11.1188,  2.2342,  4.4088,  4.5235,  1.5168,  2.7899,  0.0000,  0.0000],
    [-0.2493, -0.0501, -0.0989, -0.1014, -0.0340, -0.0626, -0.0000, -0.0000]
]

- Includes ∂KL/∂μ ≈ μ (pulling toward 0) + flow from MSE via z.

fc_mu.bias.grad (shape [2]):
```
[ 3.6052, -0.0808 ]
```

fc_logvar.weight.grad (shape [2, 8]):

[
    [-0.9752, -0.1960, -0.3867, -0.3967, -0.1330, -0.2447, -0.0000, -0.0000],
    [ 0.3473,  0.0698,  0.1377,  0.1413,  0.0474,  0.0871,  0.0000,  0.0000]
]

- From ∂KL/∂logvar ≈ 0.5*(exp(logvar) - 1) + MSE flow.

fc_logvar.bias.grad (shape [2]):
```
[ -0.3162,  0.1126 ]
```

fc1.weight.grad (shape [8, 4]):

[
    [ 0.2735,  0.5470,  0.8204,  1.0939],
    [-0.2724, -0.5448, -0.8172, -1.0896],
    [ 0.5339,  1.0679,  1.6018,  2.1358],
    [ 1.2016,  2.4032,  3.6049,  4.8065],
    [ 0.7601,  1.5201,  2.2802,  3.0403],
    [-0.5289, -1.0578, -1.5868, -2.1157],
    [ 0.0000,  0.0000,  0.0000,  0.0000],
    [ 0.0000,  0.0000,  0.0000,  0.0000]
]

These flow from both MSE (via reconstruction) and KL (via μ/logvar). Zeros in last rows due to ReLU zeroing those hidden units.

fc1.bias.grad (shape [8]):

[ 0.2735, -0.2724,  0.5339,  1.2016,  0.7601, -0.5289,  0.0000,  0.0000 ]

These gradients would update parameters in training (e.g., param -= lr * grad). Note zeros where ReLU gates flow. This example uses a single pass; real training iterates over datasets. If you change the seed, input, or dimensions, values will differ, but the process remains identical.

Additional

Relation between Log Variance and Standard Deviation

In VAEs, the encoder outputs the mean \( \mu \) and log variance \( \log(\sigma^2) \) of the latent space distribution.
The standard deviation \( \sigma \) can be derived from the log variance using the relationship:

\[ \sigma = \exp\left(\frac{1}{2} \log(\sigma^2)\right) \]

This transformation ensures numerical stability and positivity of the variance during training.

1. Definitions

For a random variable ( x ) that follows a normal distribution:

\[ x \sim \mathcal{N}(\mu, \sigma^2) \]

where:

\( \mu \): mean
\( \sigma^2 \): variance
\( \sigma \): standard deviation

2. Log variance

Often, instead of directly predicting or storing the variance \( \sigma^2 \) or standard deviation \( \sigma \), models work with the log variance:

\[ \displaystyle \text{log_var} = \log(\sigma^2) \]

3. Relationship between log variance and std

From the above definition:

\[ \displaystyle \sigma^2 = e^{\text{log_var}} \]

Taking the square root to get the standard deviation:

\[ \displaystyle \sigma = \displaystyle \sqrt{e^{\text{log_var}}} = \displaystyle e^{\frac{1}{2}\text{log_var}} \]

So:

\[ \displaystyle \boxed{\sigma = \exp\left(\frac{1}{2} \cdot \text{log_var}\right)} \]

and conversely,

\[ \displaystyle \boxed{\text{log_var} = 2 \cdot \log(\sigma)} \]

4. Why use log variance?

It’s common in neural nets because:

It ensures the variance is always positive (since \( e^x > 0 \)).
It’s numerically more stable when optimizing.
It allows unconstrained outputs from the network (no need to force positivity).

Summary

Quantity	Expression	In terms of log_var
Variance	\( \sigma^2 \)	\( e^{\text{log_var}} \)
Std. deviation	\( \sigma \)	\( e^{\frac{1}{2}\text{log_var}} \)
Log variance	\( \text{log_var} \)	\( 2 \log(\sigma) \)