pytorch/test/distributed
Shen Li ba1da47e8f Add OnCompletion Hook to ProcessGroup (#106988)
This allows infra/trainers to get detailed stats about communication
efficiencies without know anything about what model or distributed
training paradigms have been used. This is helpful as infra/trainer
package usually prefers to be as model/algorithm agnostic as possible.
Therefore, we cannot assume that infra/trainer can have access to all
collectives used by the model authors.

This commit adds an `OnCompletion` hook to `ProcessGroupNCCL` which
will be fired on every work completion event.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106988
Approved by: https://github.com/kumpera, https://github.com/H-Huang
ghstack dependencies: #107140, #107141, #107160
2023-08-15 04:32:23 +00:00
..
_composable [FSDP][9/N] Introduce CustomPolicy (#104986) 2023-08-03 12:46:36 +00:00
_shard
_spmd [device_mesh][BE] remove allgather from DM (#105614) 2023-07-27 01:33:05 +00:00
_tensor [TP][DTensor Perf] Some perf improvement to reduce DTensor CPU overhead (#106524) 2023-08-14 20:03:19 +00:00
_tools
algorithms
bin
checkpoint [DCP] Modify tensor saving logic in DCP (#106415) 2023-08-09 00:16:10 +00:00
elastic [BE] Enable ruff's UP rules and autoformat distributed/ (#105433) 2023-07-19 14:27:11 +00:00
fsdp [PT-D][FSDP] Handle corner case of load with multi-backend PG (#107172) 2023-08-14 23:24:44 +00:00
launcher [BE] Enable ruff's UP rules and autoformat distributed/ (#105433) 2023-07-19 14:27:11 +00:00
nn/jit
optim Back out "Reland "Make adding buffers more like adding parameters (#104069)" (#106224)" (#106743) 2023-08-08 15:27:34 +00:00
pipeline/sync [BE] Enable ruff's UP rules and autoformat distributed/ (#105433) 2023-07-19 14:27:11 +00:00
rpc
tensor/parallel Clean up unsed MHA code to avoid confusion (#105956) 2023-07-27 17:10:17 +00:00
argparse_util_test.py
test_c10d_common.py [BE] Enable ruff's UP rules and autoformat distributed/ (#105433) 2023-07-19 14:27:11 +00:00
test_c10d_gloo.py [BE] Enable ruff's UP rules and autoformat distributed/ (#105433) 2023-07-19 14:27:11 +00:00
test_c10d_logger.py
test_c10d_nccl.py Add OnCompletion Hook to ProcessGroup (#106988) 2023-08-15 04:32:23 +00:00
test_c10d_object_collectives.py [c10d] Remove test for init barrier (#103223) 2023-06-08 16:56:40 +00:00
test_c10d_pypg.py
test_c10d_spawn.py [BE] f-stringify torch/ and scripts (#105538) 2023-07-21 19:35:24 +00:00
test_c10d_spawn_gloo.py
test_c10d_spawn_nccl.py
test_c10d_spawn_ucc.py
test_c10d_ucc.py [BE] Enable ruff's UP rules and autoformat distributed/ (#105433) 2023-07-19 14:27:11 +00:00
test_collective_utils.py Initial commit of collective_utils (#101037) 2023-06-27 02:15:16 +00:00
test_data_parallel.py Back out "Reland "Make adding buffers more like adding parameters (#104069)" (#106224)" (#106743) 2023-08-08 15:27:34 +00:00
test_distributed_spawn.py Back out "Revert "[DDP] multiple forward support for static graph (#103487)" (#103873)" (#103938) 2023-06-22 21:55:58 +00:00
test_dynamo_distributed.py Back out "Reland "Make adding buffers more like adding parameters (#104069)" (#106224)" (#106743) 2023-08-08 15:27:34 +00:00
test_fake_pg.py
test_functional_api.py [device_mesh][BE] reduce_scatter fallback to funcol and remove from DM (#105642) 2023-07-27 01:33:05 +00:00
test_inductor_collectives.py [ROCm] enable additional inductor/dynamo UTs (#104624) 2023-07-11 20:44:02 +00:00
test_launcher.py
test_multi_threaded_pg.py [C10D] Improve MTPG autograd test. Fixes #105106 (#105356) 2023-07-20 13:51:21 +00:00
test_nccl.py
test_pg_wrapper.py
test_store.py [BE] Enable ruff's UP rules and autoformat distributed/ (#105433) 2023-07-19 14:27:11 +00:00