Similar to Gradient Descent
Gradient Descent aims to minimize the loss function
Stochastic
- Unlike traditional gradient descent which computes gradients for entire dataset, SGD uses a randomly selected subset of data to train the gradient at each step making it computationally faster, especially for larger datasets
The update rule in SGD is represented as:
-
-
- Model Parameters
-
- Learning Rate
-
- Gradient of the loss function
-
Momentum
- Helps with optimization when trying to escape local minimas or avoid oscillations in regions with steep gradients. Adds a fraction of the previous update to current update smoothening the path -
Learning Rate
- Hyperparameter which controls the size of the steps taken in the direction of the negative gradient. Affects how quickly/slowly a model updates (as the name may imply)