pytorch/torch/distributed/checkpoint
Ke Wen 762a05b3b3 [DCP] Remove all-gather of state dict keys (#145998)
The original `_all_gather_keys` call was for a safety check, but could be costly as things scale, and it blocks CPU.

Instead, we make it clear in the documentation that the `state_dict` passed to the `load` API should have same set of keys, otherwise the API may hang.

In addition, we move the check to a utility function: `utils.assert_same_keys`. User uncertain about state dict unity can optionally call this API to check.

Resolves #145965 (as a workaround).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145998
Approved by: https://github.com/mhorowitz, https://github.com/fegin
2025-02-04 03:16:13 +00:00
..
examples
__init__.py
_checkpointer.py
_dedup_save_plans.py
_dedup_tensors.py
_extension.py PEP585: Missed conversions (#145342) 2025-01-29 05:24:36 +00:00
_fsspec_filesystem.py [OSS] Add kwargs to fsspec reader/writer (#145845) 2025-01-30 21:00:58 +00:00
_nested_dict.py
_sharded_tensor_utils.py
_storage_utils.py
_traverse.py
_version.py
api.py
default_planner.py
filesystem.py
format_utils.py
logger.py
logging_handlers.py
metadata.py
optimizer.py
planner.py
planner_helpers.py
resharding.py
staging.py
state_dict.py
state_dict_loader.py [DCP] Remove all-gather of state dict keys (#145998) 2025-02-04 03:16:13 +00:00
state_dict_saver.py [OSS] Add no dist as an argument to DCP top level apis (#145754) 2025-01-29 20:33:37 +00:00
stateful.py
storage.py
utils.py [DCP] Remove all-gather of state dict keys (#145998) 2025-02-04 03:16:13 +00:00