Ayush Garg

Search

Fully Sharded Data Parallel (FSDP2)

Jun 21, 2026, 1 min read

Docs Link: https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html

In DDP each rank owns a model replica and processes a batch of data, finally it uses all-reduce to sync gradients across ranks

Compared with DDP, FSDP reduces GPU memory footprint, it shards the model parameters, gradients, and optimizer states; making it possible to train models that don’t fit on a single GPU

Graph View

Backlinks

No backlinks found

GitHub
Linkedin
Blog
Twitter

Ayush Garg

Recently Updated

DD-PPO: Learning Near-Perfect PointGoal Navigators From 2.5 Billion Frames

Distributed Data Parallel

Fully Sharded Data Parallel (FSDP2)

Optimizer

Fully Sharded Data Parallel (FSDP2)

Graph View

Backlinks