Autoencoder focuses on learning a latent representation that compresses data and allows for reconstruction of that compressed data

More commonly called a Variational Autoencoder

Encoder doesn’t output one fixed code but a distribution over possible latent codes

Decoder says how data could be generated from a latent variable

Main Idea

A VAE is a probabilistic generative model

Instead of mapping an input to one deterministic latent vector , it learns:

  • An encoder distribution:
  • A decoder distribution:
  • A prior over latent variables: , usually

So the model assumes data is generated by:

  1. Sample a latent vector
  2. Sample an observation

The encoder approximates the posterior distribution , which is usually intractable to compute exactly.

Latent Distribution

For each input , the encoder outputs parameters of a Gaussian distribution:

where:

  • is the mean vector
  • is the variance vector
  • are the encoder parameters

This means each example is represented by a region in latent space, not a single point.

Objective Function

We want to maximize the likelihood of the data:

But this is difficult because:

The integral is typically intractable, so VAEs optimize the Evidence Lower Bound (ELBO):

and:

Meaning of the Two Terms

1. Reconstruction Term

This encourages the decoder to reconstruct the input accurately from the latent sample.

If reconstruction is poor, this term becomes smaller.

2. KL Divergence Term

This pushes the encoder’s latent distribution to stay close to the prior, usually .

This regularizes the latent space so nearby points decode to similar outputs and random samples from the prior can generate realistic data.

Loss Used in Practice

Since training usually minimizes a loss, people often write:

or equivalently:

For a diagonal Gaussian posterior and standard normal prior:

This closed form makes training efficient.

Reparameterization Trick

We need to sample from , but direct sampling breaks standard backpropagation.

So we rewrite the sample as:

where is element-wise multiplication.

Now the randomness is isolated in , while and remain differentiable functions of the encoder output.

This is the reparameterization trick.

Why VAE Works Better Than a Plain Autoencoder for Generation

A plain Autoencoder may learn a latent space with no smooth global structure.

A VAE forces the latent encodings to match a simple prior distribution, which gives:

  • Smoother latent space
  • Better interpolation between latent points
  • Ability to sample new data by drawing

Intuition

You can think of a VAE as learning:

  • Where an input should live in latent space through
  • How uncertain the model is through
  • How to reconstruct the input from sampled latent variables

The KL term prevents every input from being mapped to completely separate disconnected regions.

Common Tradeoff

There is a tradeoff between:

  • Good reconstruction
  • Well-structured latent space

If the KL term is too strong, reconstructions may become blurry.

If the KL term is too weak, the model behaves more like a regular autoencoder and loses some generative quality.

Compact Summary

VAE learns a distribution in latent space instead of a single code.

It trains by maximizing the ELBO:

So it both:

  • Reconstructs data well
  • Keeps the latent space regular enough for sampling and generation