Learning Rate is a hyperparameter that controls how large each parameter update is during training
In Gradient Descent, the model updates its parameters by moving in the direction that reduces the loss:
- Model parameters at step - Learning rate - Gradient of the loss with respect to the parameters
The learning rate decides the size of the step, while the gradient decides the direction
Intuition
If the learning rate is too small, training is stable but slow
The model makes tiny updates and may take a long time to reach a good solution
If the learning rate is too large, training can become unstable
The model may jump over good minima, oscillate around them, or cause the loss to explode
Good learning rates are large enough to make meaningful progress, but small enough that optimization does not bounce around uncontrollably
Too Low
Signs the learning rate may be too low:
- Loss decreases very slowly
- Training looks stable but barely improves
- Model underfits even after many steps
- Training takes much longer than expected
Too High
Signs the learning rate may be too high:
- Loss spikes or becomes
NaN - Loss jumps up and down instead of trending downward
- Model performance is very sensitive to small changes
- Training diverges early
Learning Rate Schedules
A fixed learning rate uses the same value for the whole training run
In practice, deep learning training often changes the learning rate over time using a schedule
Common patterns:
- Warm-up steps - Start with a small learning rate and gradually increase it
- Cosine Decay - Gradually lower the learning rate with a smooth cosine curve
- Step decay - Drop the learning rate at specific milestones
- Linear decay - Decrease the learning rate steadily over training
The common idea is:
- Larger learning rate early helps the model learn quickly
- Smaller learning rate later helps the model settle into a better solution
Relation to Batch Size
Learning rate interacts with batch size
Larger batches usually produce less noisy gradient estimates, so they can sometimes use larger learning rates
Smaller batches produce noisier updates, so they may need smaller learning rates or more careful scheduling
This is why training recipes often tune learning rate and batch size together