pytorch/torch/distributed
Teng Li 5b7951057d Distributed Data Parallel Module Implementation (#8584)
Summary:
This is an initial implementation of Distributed Data Parallel module for c10d GLOO and NCCL backend.

Have done performance testing and made sure that both single GPU / process and multi-GPU / process are able to overlap communication with BW computation

The idea is, DDP will bucket parameters and do all reduce in the reverse order of the bucket. Since all C10D ops are async ops, no more dedicated thread is needed and we simply queue the all-reduce kernels once the bucket is ready following the deterministic reduction order.

Tested with 8 nodes 64 GPUs, ResNet 50, hit the required accuracy within 90 epochs
Closes https://github.com/pytorch/pytorch/pull/8584

Reviewed By: goldsborough

Differential Revision: D8678696

Pulled By: teng-li

fbshipit-source-id: 440341b804befc6762e92acece2759ba47157cea
2018-06-28 17:25:40 -07:00
..
c10d Distributed Data Parallel Module Implementation (#8584) 2018-06-28 17:25:40 -07:00
__init__.py
launch.py Use customized python interpreter (#7520) 2018-05-12 13:06:39 -04:00
remote_types.py