Learning Rate

Learning Rate is a hyperparameter that controls how large each parameter update is during training

In Gradient Descent, the model updates its parameters by moving in the direction that reduces the loss:

- Model parameters at step
- Learning rate
- Gradient of the loss with respect to the parameters

The learning rate decides the size of the step, while the gradient decides the direction

Intuition

If the learning rate is too small, training is stable but slow

The model makes tiny updates and may take a long time to reach a good solution

If the learning rate is too large, training can become unstable

The model may jump over good minima, oscillate around them, or cause the loss to explode

Good learning rates are large enough to make meaningful progress, but small enough that optimization does not bounce around uncontrollably

Too Low

Signs the learning rate may be too low:

Loss decreases very slowly
Training looks stable but barely improves
Model underfits even after many steps
Training takes much longer than expected

Too High

Signs the learning rate may be too high:

Loss spikes or becomes NaN
Loss jumps up and down instead of trending downward
Model performance is very sensitive to small changes
Training diverges early

Learning Rate Schedules

A fixed learning rate uses the same value for the whole training run

In practice, deep learning training often changes the learning rate over time using a schedule

Common patterns:

Warm-up steps - Start with a small learning rate and gradually increase it
Cosine Decay - Gradually lower the learning rate with a smooth cosine curve
Step decay - Drop the learning rate at specific milestones
Linear decay - Decrease the learning rate steadily over training

The common idea is:

Larger learning rate early helps the model learn quickly
Smaller learning rate later helps the model settle into a better solution

Relation to Batch Size

Learning rate interacts with batch size

Larger batches usually produce less noisy gradient estimates, so they can sometimes use larger learning rates

Smaller batches produce noisier updates, so they may need smaller learning rates or more careful scheduling

This is why training recipes often tune learning rate and batch size together

Ayush Garg

Recently Updated

Learning Rate

Warm-up steps

8 Timeless Tips for Training LLMs | Julia Turc

Dropout

Learning Rate

Intuition

Too Low

Too High

Learning Rate Schedules

Relation to Batch Size

Graph View

Table of Contents

Backlinks

Ayush Garg

Recently Updated

Learning Rate

Warm-up steps

8 Timeless Tips for Training LLMs | Julia Turc

Dropout

Learning Rate

Intuition §

Too Low §

Too High §

Learning Rate Schedules §

Relation to Batch Size §

Graph View

Table of Contents

Backlinks

Intuition

Too Low

Too High

Learning Rate Schedules

Relation to Batch Size