mirror of
https://github.com/saymrwulf/pytorch.git
synced 2026-05-14 20:57:59 +00:00
Summary: Currently, `torch.nn.parallel.DistributedDataParallel(model...)` doesn't deduplicate params shared across `model`'s child Modules before calling Reducer with the param list. This can cause Reducer to register more than one hook on the shared param(s), at which point who knows what happens. We ran into this in mlperf BERT, which has at least one param shared across submodules (an embedding weight iirc, not 100% sure). Running with `gradient_as_bucket_view = False` produced different numerics from running with `gradient_as_bucket_view = True` (which i guess is one potential consequence of multiple DDP hooks on a given param, not sure why, i'd have to dig further). This PR changes DDP to deduplicate shared params (a small diff), and adds some tests (right now just `test_ddp_weight_sharing`, but I'll add more). `test_ddp_weight_sharing` fails with bad numerics on current master (proving the shared param issue is real) and passes with the deduplication diff. Pull Request resolved: https://github.com/pytorch/pytorch/pull/51929 Reviewed By: zou3519 Differential Revision: D26625807 Pulled By: zhaojuanmao fbshipit-source-id: f5f5959fef90dfe2c55812d79fa88b877f22ecc3 |
||
|---|---|---|
| .. | ||
| algorithms/ddp_comm_hooks | ||
| nn/jit | ||
| optim | ||
| pipeline/sync | ||
| rpc | ||
| test_c10d.py | ||
| test_c10d_spawn.py | ||
| test_data_parallel.py | ||
| test_distributed_fork.py | ||
| test_distributed_spawn.py | ||
| test_jit_c10d.py | ||
| test_nccl.py | ||