Similar to Gradient Descent

Gradient Descent aims to minimize the loss function

Stochastic - Unlike traditional gradient descent which computes gradients for entire dataset, SGD uses a randomly selected subset of data to train the gradient at each step making it computationally faster, especially for larger datasets

The update rule in SGD is represented as:

  • - Model Parameters

  • - Learning Rate

  • - Gradient of the loss function

  • Momentum - Helps with optimization when trying to escape local minimas or avoid oscillations in regions with steep gradients. Adds a fraction of the previous update to current update smoothening the path

  • Learning Rate - Hyperparameter which controls the size of the steps taken in the direction of the negative gradient. Affects how quickly/slowly a model updates (as the name may imply)