Gradient Accumulation allows simulation of a larger batch size than GPU memory would normally allow

Normally during training the loop is:

  1. Run a forward pass on a batch
  2. Calculate the loss
  3. Run backpropagation to compute gradients
  4. Update the model weights
  5. Clear the gradients

With gradient accumulation, you do steps 1-3 multiple times before doing the optimizer update. Instead of clearing the gradients after every mini-batch, you let them add up across several mini-batches. After enough mini-batches have been processed, you take one optimizer step and then zero the gradients.

So if the GPU can only fit a batch size of 8, but you want the effect of batch size 32, you can use:

where 8 is the per-step batch size and 4 is the number of accumulation steps.

The effective batch size is:

This is useful because memory usage is mostly limited by the activations from the forward pass. Accumulating gradients means each forward/backward pass still uses the smaller batch that fits in memory, but the optimizer update sees gradients from a larger amount of data.

The tradeoff is that training gets slower per optimizer step, because each update now requires multiple forward and backward passes. It also does not perfectly match a true large batch in every detail, especially when layers or training code depend on batch statistics, randomness, or when the loss is not scaled correctly.

Important detail: the loss usually needs to be divided by the number of accumulation steps before calling backward. Otherwise the accumulated gradients become too large compared to the normal large-batch version.

Example training loop:

optimizer.zero_grad()
 
for step, batch in enumerate(dataloader):
    outputs = model(batch)
    loss = compute_loss(outputs, batch)
    loss = loss / accumulation_steps
    loss.backward()
 
    if (step + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

In practice, gradient accumulation is common when training large neural networks because it lets you use a reasonable effective batch size without needing every sample in memory at once. It is especially useful for large language models, diffusion models, and any setup where sequence length or image size makes batches expensive.