Stochastic Gradient Descent

Gradient Descent aims to minimize the loss function

Stochastic - Unlike traditional gradient descent which computes gradients for entire dataset, SGD uses a randomly selected subset of data to train the gradient at each step making it computationally faster, especially for larger datasets

The update rule in SGD is represented as:

$θ = θ - η \nabla_{θ} L (θ)$
$θ$ - Model Parameters
$η$ - Learning Rate
$\nabla_{θ} L (θ)$ - Gradient of the loss function $L (θ)$
Momentum - Helps with optimization when trying to escape local minimas or avoid oscillations in regions with steep gradients. Adds a fraction of the previous update to current update smoothening the path
Learning Rate - Hyperparameter which controls the size of the steps taken in the direction of the negative gradient. Affects how quickly/slowly a model updates (as the name may imply)

Ayush Garg

Recently Updated

Speed

Thanos

Failure

Self Belief

Stochastic Gradient Descent

Graph View

Backlinks