Cosine Decay is a Learning Rate schedule where the learning rate starts high and gradually decreases following a cosine curve
Instead of dropping the learning rate suddenly, it decays smoothly over time
Common formula:
- - Learning rate at step
- - Starting / maximum learning rate
- - Minimum learning rate
- - Current training step
- - Total number of decay steps
Intuition
At the start of training, bigger updates help the model learn quickly
Later on, smaller updates help it settle into a better minimum without bouncing around too much
Cosine decay does this in a smooth way
Why use it
- Starts aggressive, ends gentle
- Usually works better than keeping one fixed learning rate the whole time
- Smoother than manually dropping the learning rate at a few milestones
Shape
At :
At :
So the learning rate slowly falls from max to min along a cosine-shaped curve
Common Use
Used in deep learning training runs, especially when training for many steps and you want the optimizer to make large moves early and fine adjustments later
Sometimes combined with warmup, where the learning rate first ramps up and then follows cosine decay