Cosine Decay is a Learning Rate schedule where the learning rate starts high and gradually decreases following a cosine curve

Instead of dropping the learning rate suddenly, it decays smoothly over time

Common formula:

  • - Learning rate at step
  • - Starting / maximum learning rate
  • - Minimum learning rate
  • - Current training step
  • - Total number of decay steps

Intuition

At the start of training, bigger updates help the model learn quickly

Later on, smaller updates help it settle into a better minimum without bouncing around too much

Cosine decay does this in a smooth way

Why use it

  • Starts aggressive, ends gentle
  • Usually works better than keeping one fixed learning rate the whole time
  • Smoother than manually dropping the learning rate at a few milestones

Shape

At :

At :

So the learning rate slowly falls from max to min along a cosine-shaped curve

Common Use

Used in deep learning training runs, especially when training for many steps and you want the optimizer to make large moves early and fine adjustments later

Sometimes combined with warmup, where the learning rate first ramps up and then follows cosine decay