pytorch/test/distributed
Ke Wen 762a05b3b3 [DCP] Remove all-gather of state dict keys (#145998)
The original `_all_gather_keys` call was for a safety check, but could be costly as things scale, and it blocks CPU.

Instead, we make it clear in the documentation that the `state_dict` passed to the `load` API should have same set of keys, otherwise the API may hang.

In addition, we move the check to a utility function: `utils.assert_same_keys`. User uncertain about state dict unity can optionally call this API to check.

Resolves #145965 (as a workaround).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145998
Approved by: https://github.com/mhorowitz, https://github.com/fegin
2025-02-04 03:16:13 +00:00
..
_composable pickler for GraphModule (#141659) 2025-01-31 05:34:28 +00:00
_shard PEP585 update - test (#145176) 2025-01-22 04:48:28 +00:00
_tools PEP585 update - test (#145176) 2025-01-22 04:48:28 +00:00
algorithms Enable ruff F841 on distributed tests (#146131) 2025-02-01 03:06:16 +00:00
bin
checkpoint [DCP] Remove all-gather of state dict keys (#145998) 2025-02-04 03:16:13 +00:00
elastic Enable ruff F841 on distributed tests (#146131) 2025-02-01 03:06:16 +00:00
flight_recorder
fsdp Enable ruff F841 on distributed tests (#146131) 2025-02-01 03:06:16 +00:00
launcher PEP585 update - test (#145176) 2025-01-22 04:48:28 +00:00
nn/jit PEP585 update - test (#145176) 2025-01-22 04:48:28 +00:00
optim PEP585 update - test (#145176) 2025-01-22 04:48:28 +00:00
pipelining PEP585 update - test (#145176) 2025-01-22 04:48:28 +00:00
rpc
tensor [2/N] Enable ruff F841 on distributed tests (#146132) 2025-02-02 03:44:48 +00:00
argparse_util_test.py
test_backends.py
test_c10d_common.py [BE][Ez]: FURB148 - remove useless enumerate calls (#145619) 2025-01-24 23:37:15 +00:00
test_c10d_functional_native.py PEP585 update - test (#145176) 2025-01-22 04:48:28 +00:00
test_c10d_gloo.py
test_c10d_logger.py
test_c10d_nccl.py [c10d][NCCL] Implement ncclCommInitRankScalable (merging #136789) (#144794) 2025-01-31 22:39:56 +00:00
test_c10d_object_collectives.py
test_c10d_ops_nccl.py
test_c10d_pypg.py
test_c10d_spawn.py
test_c10d_spawn_gloo.py PEP585 update - test (#145176) 2025-01-22 04:48:28 +00:00
test_c10d_spawn_nccl.py PEP585 update - test (#145176) 2025-01-22 04:48:28 +00:00
test_c10d_spawn_ucc.py PEP585 update - test (#145176) 2025-01-22 04:48:28 +00:00
test_c10d_ucc.py XFAIL test_save_load_checkpoint (#144927) 2025-01-16 07:31:56 +00:00
test_collective_utils.py
test_composability.py composability test cleanup (#145011) 2025-01-18 04:37:12 +00:00
test_compute_comm_reordering.py
test_control_collectives.py
test_data_parallel.py
test_device_mesh.py
test_distributed_spawn.py
test_dynamo_distributed.py PEP585 update - test (#145176) 2025-01-22 04:48:28 +00:00
test_fake_pg.py [BE][Ez]: FURB148 - remove useless enumerate calls (#145619) 2025-01-24 23:37:15 +00:00
test_functional_api.py
test_inductor_collectives.py
test_launcher.py
test_multi_threaded_pg.py
test_nccl.py
test_pg_wrapper.py
test_store.py
test_symmetric_memory.py [SymmetricMemory] fix an issue where rendezvous is performed with wrong device context when torch.cuda.set_device() is not callled (#144886) 2025-01-28 01:43:37 +00:00