pytorch

mirror of https://github.com/saymrwulf/pytorch.git synced 2026-05-14 20:57:59 +00:00

History

Michael Carilli 5c15a5bb46 Deduplicate shared params before constructing Reducer in DDP (#51929 ) Summary: Currently, `torch.nn.parallel.DistributedDataParallel(model...)` doesn't deduplicate params shared across `model`'s child Modules before calling Reducer with the param list. This can cause Reducer to register more than one hook on the shared param(s), at which point who knows what happens. We ran into this in mlperf BERT, which has at least one param shared across submodules (an embedding weight iirc, not 100% sure). Running with `gradient_as_bucket_view = False` produced different numerics from running with `gradient_as_bucket_view = True` (which i guess is one potential consequence of multiple DDP hooks on a given param, not sure why, i'd have to dig further). This PR changes DDP to deduplicate shared params (a small diff), and adds some tests (right now just `test_ddp_weight_sharing`, but I'll add more). `test_ddp_weight_sharing` fails with bad numerics on current master (proving the shared param issue is real) and passes with the deduplication diff. Pull Request resolved: https://github.com/pytorch/pytorch/pull/51929 Reviewed By: zou3519 Differential Revision: D26625807 Pulled By: zhaojuanmao fbshipit-source-id: f5f5959fef90dfe2c55812d79fa88b877f22ecc3		2021-03-03 10:13:24 -08:00
..
algorithms/ddp_comm_hooks	[ROCm] Enable test_ddp_hooks.py test cases (#52403 )	2021-02-18 15:51:18 -08:00
nn/jit
optim	[Reland] Update and expose ZeroRedundancyOptimizer docs (#53112 )	2021-03-02 14:16:12 -08:00
pipeline/sync	[Pipe] Refactor convert_to_balance under non-test package. (#50860 )	2021-01-28 12:10:21 -08:00
rpc
test_c10d.py	Deduplicate shared params before constructing Reducer in DDP (#51929 )	2021-03-03 10:13:24 -08:00
test_c10d_spawn.py	Implement autograd functions for c10d communication operations (#40762 )	2021-01-26 07:52:51 -08:00
test_data_parallel.py	Internal gradcheck wrapper in testing._internal that sets certain flags to True (#51133 )	2021-01-29 09:13:37 -08:00
test_distributed_fork.py	reenabling MPI test (#48725 )	2020-12-03 06:50:36 -08:00
test_distributed_spawn.py	Remove py2 compatible future imports (#44735 )	2020-09-16 12:55:57 -07:00
test_jit_c10d.py	[ROCm] Enable test_jit_c10.py tests for ROCm (#52410 )	2021-02-19 08:11:04 -08:00
test_nccl.py