RMS Norm is a normalization method commonly used in Transformer models.
It rescales an activation vector by its root mean square value.
= activation vector = hidden dimension = learned scale parameter = small constant for numerical stability
The denominator is the root mean square:
Intuition
RMS Norm controls the scale of the activation vector without changing its mean.
Instead of asking “how far is each value from the mean?”, it asks “how large is this vector on average?”
This makes the operation simpler than Layer Normalization because it only normalizes by magnitude.
Connection to Layer Norm
Layer Normalization usually does:
RMS Norm removes the mean-centering step and usually removes the learned bias term:
Why Use RMS Norm?
- Cheaper than Layer Norm because it skips mean subtraction.
- Helps keep activation magnitudes stable during training.
- Preserves the direction of the activation vector while normalizing its size.