Layer Normalization normalizes the features of an activation vector.
For an activation vector
Then each component is normalized:
With learned scale and shift parameters:
= activation vector = hidden dimension = mean of the vector = variance of the vector = small constant for numerical stability = learned scale parameter = learned shift/bias parameter
Intuition
Layer Norm asks: “For this activation vector, how far is each value from the vector’s average?”
It recenters the vector around zero and rescales it to have a stable spread.
This helps prevent activations from becoming too large or too small during training.
Why Gamma and Beta?
After normalization, the vector has a fixed center and scale. That can be too restrictive.
controls how much to stretch or shrink each feature. controls how much to shift each feature.
So Layer Norm stabilizes the activations, but still lets the model recover useful scales and offsets.