Docs Link: https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html
In DDP each rank owns a model replica and processes a batch of data, finally it uses all-reduce to sync gradients across ranks
Compared with DDP, FSDP reduces GPU memory footprint, it shards the model parameters, gradients, and optimizer states; making it possible to train models that don’t fit on a single GPU