Cosine Decay

Cosine Decay is a Learning Rate schedule where the learning rate starts high and gradually decreases following a cosine curve

Instead of dropping the learning rate suddenly, it decays smoothly over time

Common formula:

- Learning rate at step
- Starting / maximum learning rate
- Minimum learning rate
- Current training step
- Total number of decay steps

Intuition

At the start of training, bigger updates help the model learn quickly

Later on, smaller updates help it settle into a better minimum without bouncing around too much

Cosine decay does this in a smooth way

Why use it

Starts aggressive, ends gentle
Usually works better than keeping one fixed learning rate the whole time
Smoother than manually dropping the learning rate at a few milestones

Shape

At :

So the learning rate slowly falls from max to min along a cosine-shaped curve

Common Use

Used in deep learning training runs, especially when training for many steps and you want the optimizer to make large moves early and fine adjustments later

Sometimes combined with warmup, where the learning rate first ramps up and then follows cosine decay

Ayush Garg

Recently Updated

Pareto Principle

Bits

Magnitude of a normalized floating-point number

Mixed Precision Training

Cosine Decay

Intuition

Why use it

Shape

Common Use

Graph View

Table of Contents

Backlinks

Ayush Garg

Recently Updated

Pareto Principle

Bits

Magnitude of a normalized floating-point number

Mixed Precision Training

Cosine Decay

Intuition §

Why use it §

Shape §

Common Use §

Graph View

Table of Contents

Backlinks

Intuition

Why use it

Shape

Common Use