PyTorch Distributed Overview

Docs Link: https://docs.pytorch.org/tutorials/beginner/dist_overview.html

Torch distributed library includes collective of parallelism modules, communications layer, and infra for launching and debugging large training jobs

DistributedDataParallel (DDP) is a module in PyTorch that allows your model to parallelize across multiple machines making it perfect for large-scale deep learning applications

DDP uses communications from torch distributed package to synchronize gradients and buffers across processes; each process will have its own copy of the model but all work together to train the model as if it were on a single machine

DDP broadcasts model states from rank 0 processes to all other processes in DDP constructor (dont have to worry about DDP processes starting from different initial model parameter values)

Ayush Garg

Recently Updated

PyTorch Distributed Overview

Pareto Principle

Bits

Magnitude of a normalized floating-point number

PyTorch Distributed Overview

Graph View

Backlinks