pytorch/torch/distributed
Ke Wen 762a05b3b3 [DCP] Remove all-gather of state dict keys (#145998)
The original `_all_gather_keys` call was for a safety check, but could be costly as things scale, and it blocks CPU.

Instead, we make it clear in the documentation that the `state_dict` passed to the `load` API should have same set of keys, otherwise the API may hang.

In addition, we move the check to a utility function: `utils.assert_same_keys`. User uncertain about state dict unity can optionally call this API to check.

Resolves #145965 (as a workaround).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145998
Approved by: https://github.com/mhorowitz, https://github.com/fegin
2025-02-04 03:16:13 +00:00
..
_composable PEP585 update - torch/distributed (#145164) 2025-01-21 04:23:29 +00:00
_shard PEP585 update - torch/distributed (#145164) 2025-01-21 04:23:29 +00:00
_sharded_tensor
_sharding_spec
_symmetric_memory [async-TP] Fix scheduling in matmul+reduce-scatter for 2 ranks (#145846) 2025-01-30 18:26:34 +00:00
_tensor
_tools Turn on mypy for _dynamo/variables/builtin.py (#145552) 2025-01-30 22:21:32 +00:00
algorithms PEP585 update - torch/distributed (#145164) 2025-01-21 04:23:29 +00:00
autograd
benchmarks
checkpoint [DCP] Remove all-gather of state dict keys (#145998) 2025-02-04 03:16:13 +00:00
elastic Make torchelastic etcd rendezvous publicly importable (#145396) 2025-01-23 23:56:45 +00:00
examples
fsdp [MTIA][FSDP2] Enable MTIA device in FSDP2 library code (#145842) 2025-01-31 21:21:00 +00:00
launcher PEP585 update - torch/distributed (#145164) 2025-01-21 04:23:29 +00:00
nn PEP585 update - torch/distributed (#145164) 2025-01-21 04:23:29 +00:00
optim PEP585 update - torch/distributed (#145164) 2025-01-21 04:23:29 +00:00
pipelining [NFC] Fix some minor typos. (#145599) 2025-01-24 18:58:59 +00:00
rpc PEP585 update - torch/distributed (#145164) 2025-01-21 04:23:29 +00:00
tensor DeepSpeed github repo move sync (#146320) 2025-02-03 23:20:49 +00:00
__init__.py PEP585 update - torch/distributed (#145164) 2025-01-21 04:23:29 +00:00
_checkpointable.py PEP585 update - torch/distributed (#145164) 2025-01-21 04:23:29 +00:00
_composable_state.py
_functional_collectives.py [2/N] Enable ruff F841 on distributed tests (#146132) 2025-02-02 03:44:48 +00:00
_functional_collectives_impl.py PEP585 update - torch/distributed (#145164) 2025-01-21 04:23:29 +00:00
_state_dict_utils.py PEP585 update - torch/distributed (#145164) 2025-01-21 04:23:29 +00:00
argparse_util.py
c10d_logger.py PEP585 update - torch/distributed (#145164) 2025-01-21 04:23:29 +00:00
collective_utils.py PEP585 update - torch/distributed (#145164) 2025-01-21 04:23:29 +00:00
constants.py
CONTRIBUTING.md
device_mesh.py PEP585 update - torch/distributed (#145164) 2025-01-21 04:23:29 +00:00
distributed_c10d.py [NFC] Fix some minor typos. (#145599) 2025-01-24 18:58:59 +00:00
launch.py
logging_handlers.py PEP585 update - torch/distributed (#145164) 2025-01-21 04:23:29 +00:00
remote_device.py
rendezvous.py PEP585 update - torch/distributed (#145164) 2025-01-21 04:23:29 +00:00
run.py Improve torchrun documentation (#144354) 2025-01-24 20:40:05 +00:00
utils.py PEP585 update - torch/distributed (#145164) 2025-01-21 04:23:29 +00:00