Autoencoder focuses on learning a latent representation that compresses data and allows for reconstruction of that compressed data
More commonly called a Variational Autoencoder
Encoder doesn’t output one fixed code but a distribution over possible latent codes
Decoder says how data could be generated from a latent variable
Main Idea
A VAE is a probabilistic generative model
Instead of mapping an input to one deterministic latent vector , it learns:
- An encoder distribution:
- A decoder distribution:
- A prior over latent variables: , usually
So the model assumes data is generated by:
- Sample a latent vector
- Sample an observation
The encoder approximates the posterior distribution , which is usually intractable to compute exactly.
Latent Distribution
For each input , the encoder outputs parameters of a Gaussian distribution:
where:
- is the mean vector
- is the variance vector
- are the encoder parameters
This means each example is represented by a region in latent space, not a single point.
Objective Function
We want to maximize the likelihood of the data:
But this is difficult because:
The integral is typically intractable, so VAEs optimize the Evidence Lower Bound (ELBO):
and:
Meaning of the Two Terms
1. Reconstruction Term
This encourages the decoder to reconstruct the input accurately from the latent sample.
If reconstruction is poor, this term becomes smaller.
2. KL Divergence Term
This pushes the encoder’s latent distribution to stay close to the prior, usually .
This regularizes the latent space so nearby points decode to similar outputs and random samples from the prior can generate realistic data.
Loss Used in Practice
Since training usually minimizes a loss, people often write:
or equivalently:
For a diagonal Gaussian posterior and standard normal prior:
This closed form makes training efficient.
Reparameterization Trick
We need to sample from , but direct sampling breaks standard backpropagation.
So we rewrite the sample as:
where is element-wise multiplication.
Now the randomness is isolated in , while and remain differentiable functions of the encoder output.
This is the reparameterization trick.
Why VAE Works Better Than a Plain Autoencoder for Generation
A plain Autoencoder may learn a latent space with no smooth global structure.
A VAE forces the latent encodings to match a simple prior distribution, which gives:
- Smoother latent space
- Better interpolation between latent points
- Ability to sample new data by drawing
Intuition
You can think of a VAE as learning:
Wherean input should live in latent space throughHow uncertainthe model is throughHow to reconstructthe input from sampled latent variables
The KL term prevents every input from being mapped to completely separate disconnected regions.
Common Tradeoff
There is a tradeoff between:
- Good reconstruction
- Well-structured latent space
If the KL term is too strong, reconstructions may become blurry.
If the KL term is too weak, the model behaves more like a regular autoencoder and loses some generative quality.
Compact Summary
VAE learns a distribution in latent space instead of a single code.
It trains by maximizing the ELBO:
So it both:
- Reconstructs data well
- Keeps the latent space regular enough for sampling and generation