Learning Rate is a hyperparameter that controls how large each parameter update is during training

In Gradient Descent, the model updates its parameters by moving in the direction that reduces the loss:

  • - Model parameters at step
  • - Learning rate
  • - Gradient of the loss with respect to the parameters

The learning rate decides the size of the step, while the gradient decides the direction

Intuition

If the learning rate is too small, training is stable but slow

The model makes tiny updates and may take a long time to reach a good solution

If the learning rate is too large, training can become unstable

The model may jump over good minima, oscillate around them, or cause the loss to explode

Good learning rates are large enough to make meaningful progress, but small enough that optimization does not bounce around uncontrollably

Too Low

Signs the learning rate may be too low:

  • Loss decreases very slowly
  • Training looks stable but barely improves
  • Model underfits even after many steps
  • Training takes much longer than expected

Too High

Signs the learning rate may be too high:

  • Loss spikes or becomes NaN
  • Loss jumps up and down instead of trending downward
  • Model performance is very sensitive to small changes
  • Training diverges early

Learning Rate Schedules

A fixed learning rate uses the same value for the whole training run

In practice, deep learning training often changes the learning rate over time using a schedule

Common patterns:

  • Warm-up steps - Start with a small learning rate and gradually increase it
  • Cosine Decay - Gradually lower the learning rate with a smooth cosine curve
  • Step decay - Drop the learning rate at specific milestones
  • Linear decay - Decrease the learning rate steadily over training

The common idea is:

  • Larger learning rate early helps the model learn quickly
  • Smaller learning rate later helps the model settle into a better solution

Relation to Batch Size

Learning rate interacts with batch size

Larger batches usually produce less noisy gradient estimates, so they can sometimes use larger learning rates

Smaller batches produce noisier updates, so they may need smaller learning rates or more careful scheduling

This is why training recipes often tune learning rate and batch size together