Layer Normalization normalizes the features of an activation vector.

For an activation vector :

Then each component is normalized:

With learned scale and shift parameters:

  • = activation vector
  • = hidden dimension
  • = mean of the vector
  • = variance of the vector
  • = small constant for numerical stability
  • = learned scale parameter
  • = learned shift/bias parameter

Intuition

Layer Norm asks: “For this activation vector, how far is each value from the vector’s average?”

It recenters the vector around zero and rescales it to have a stable spread.

This helps prevent activations from becoming too large or too small during training.

Why Gamma and Beta?

After normalization, the vector has a fixed center and scale. That can be too restrictive.

and let the network learn the best scale and shift after normalization:

  • controls how much to stretch or shrink each feature.
  • controls how much to shift each feature.

So Layer Norm stabilizes the activations, but still lets the model recover useful scales and offsets.