pytorch/torch/distributed/checkpoint
Ke Wen 762a05b3b3 [DCP] Remove all-gather of state dict keys (#145998)
The original `_all_gather_keys` call was for a safety check, but could be costly as things scale, and it blocks CPU.

Instead, we make it clear in the documentation that the `state_dict` passed to the `load` API should have same set of keys, otherwise the API may hang.

In addition, we move the check to a utility function: `utils.assert_same_keys`. User uncertain about state dict unity can optionally call this API to check.

Resolves #145965 (as a workaround).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145998
Approved by: https://github.com/mhorowitz, https://github.com/fegin
2025-02-04 03:16:13 +00:00
..
examples
__init__.py
_checkpointer.py PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163) 2025-01-19 20:55:59 +00:00
_dedup_save_plans.py PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163) 2025-01-19 20:55:59 +00:00
_dedup_tensors.py PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163) 2025-01-19 20:55:59 +00:00
_extension.py PEP585: Missed conversions (#145342) 2025-01-29 05:24:36 +00:00
_fsspec_filesystem.py [OSS] Add kwargs to fsspec reader/writer (#145845) 2025-01-30 21:00:58 +00:00
_nested_dict.py PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163) 2025-01-19 20:55:59 +00:00
_sharded_tensor_utils.py
_storage_utils.py PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163) 2025-01-19 20:55:59 +00:00
_traverse.py PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163) 2025-01-19 20:55:59 +00:00
_version.py
api.py PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163) 2025-01-19 20:55:59 +00:00
default_planner.py PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163) 2025-01-19 20:55:59 +00:00
filesystem.py PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163) 2025-01-19 20:55:59 +00:00
format_utils.py PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163) 2025-01-19 20:55:59 +00:00
logger.py PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163) 2025-01-19 20:55:59 +00:00
logging_handlers.py PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163) 2025-01-19 20:55:59 +00:00
metadata.py PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163) 2025-01-19 20:55:59 +00:00
optimizer.py PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163) 2025-01-19 20:55:59 +00:00
planner.py PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163) 2025-01-19 20:55:59 +00:00
planner_helpers.py PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163) 2025-01-19 20:55:59 +00:00
resharding.py PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163) 2025-01-19 20:55:59 +00:00
staging.py Fix staging for CPU tensors in OSS DCP async_save (#145408) 2025-01-23 12:49:26 -08:00
state_dict.py PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163) 2025-01-19 20:55:59 +00:00
state_dict_loader.py [DCP] Remove all-gather of state dict keys (#145998) 2025-02-04 03:16:13 +00:00
state_dict_saver.py [OSS] Add no dist as an argument to DCP top level apis (#145754) 2025-01-29 20:33:37 +00:00
stateful.py PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163) 2025-01-19 20:55:59 +00:00
storage.py PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163) 2025-01-19 20:55:59 +00:00
utils.py [DCP] Remove all-gather of state dict keys (#145998) 2025-02-04 03:16:13 +00:00