Variation Autoencoder

Autoencoder focuses on learning a latent representation that compresses data and allows for reconstruction of that compressed data

More commonly called a Variational Autoencoder

Encoder doesn’t output one fixed code but a distribution over possible latent codes

Decoder says how data could be generated from a latent variable

Main Idea

A VAE is a probabilistic generative model

Instead of mapping an input to one deterministic latent vector , it learns:

An encoder distribution:
A decoder distribution:
A prior over latent variables: , usually

So the model assumes data is generated by:

Sample a latent vector
Sample an observation

The encoder approximates the posterior distribution , which is usually intractable to compute exactly.

Latent Distribution

For each input , the encoder outputs parameters of a Gaussian distribution:

where:

is the mean vector
is the variance vector
are the encoder parameters

This means each example is represented by a region in latent space, not a single point.

Objective Function

We want to maximize the likelihood of the data:

But this is difficult because:

The integral is typically intractable, so VAEs optimize the Evidence Lower Bound (ELBO):

and:

Meaning of the Two Terms

1. Reconstruction Term

This encourages the decoder to reconstruct the input accurately from the latent sample.

If reconstruction is poor, this term becomes smaller.

2. KL Divergence Term

This pushes the encoder’s latent distribution to stay close to the prior, usually .

This regularizes the latent space so nearby points decode to similar outputs and random samples from the prior can generate realistic data.

Loss Used in Practice

Since training usually minimizes a loss, people often write:

or equivalently:

For a diagonal Gaussian posterior and standard normal prior:

This closed form makes training efficient.

Reparameterization Trick

We need to sample from , but direct sampling breaks standard backpropagation.

So we rewrite the sample as:

where is element-wise multiplication.

Now the randomness is isolated in , while and remain differentiable functions of the encoder output.

This is the reparameterization trick.

Why VAE Works Better Than a Plain Autoencoder for Generation

A plain Autoencoder may learn a latent space with no smooth global structure.

A VAE forces the latent encodings to match a simple prior distribution, which gives:

Smoother latent space
Better interpolation between latent points
Ability to sample new data by drawing

Intuition

You can think of a VAE as learning:

Where an input should live in latent space through
How uncertain the model is through
How to reconstruct the input from sampled latent variables

The KL term prevents every input from being mapped to completely separate disconnected regions.

Common Tradeoff

There is a tradeoff between:

Good reconstruction
Well-structured latent space

If the KL term is too strong, reconstructions may become blurry.

If the KL term is too weak, the model behaves more like a regular autoencoder and loses some generative quality.

Compact Summary

VAE learns a distribution in latent space instead of a single code.

It trains by maximizing the ELBO:

So it both:

Reconstructs data well
Keeps the latent space regular enough for sampling and generation

Ayush Garg

Recently Updated

Pareto Principle

Bits

Magnitude of a normalized floating-point number

Mixed Precision Training

Variation Autoencoder

Main Idea

Latent Distribution

Objective Function

Meaning of the Two Terms

1. Reconstruction Term

2. KL Divergence Term

Loss Used in Practice

Reparameterization Trick

Why VAE Works Better Than a Plain Autoencoder for Generation

Intuition

Common Tradeoff

Compact Summary

Graph View

Table of Contents

Backlinks

Ayush Garg

Recently Updated

Pareto Principle

Bits

Magnitude of a normalized floating-point number

Mixed Precision Training

Variation Autoencoder

Main Idea §

Latent Distribution §

Objective Function §

Meaning of the Two Terms §

1. Reconstruction Term §

2. KL Divergence Term §

Loss Used in Practice §

Reparameterization Trick §

Why VAE Works Better Than a Plain Autoencoder for Generation §

Intuition §

Common Tradeoff §

Compact Summary §

Graph View

Table of Contents

Backlinks

Main Idea

Latent Distribution

Objective Function

Meaning of the Two Terms

1. Reconstruction Term

2. KL Divergence Term

Loss Used in Practice

Reparameterization Trick

Why VAE Works Better Than a Plain Autoencoder for Generation

Intuition

Common Tradeoff

Compact Summary