Docs Link: https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html

In DDP each rank owns a model replica and processes a batch of data, finally it uses all-reduce to sync gradients across ranks

Compared with DDP, FSDP reduces GPU memory footprint, it shards the model parameters, gradients, and optimizer states; making it possible to train models that don’t fit on a single GPU