pytorch/torch/distributed
Tristan Rice 23af9dde4d distributed/serialization: add experimental streaming torch.save/load methods (#146555)
Summary:

This is intended for use with torchft when we need to do a streaming state dict transfer. This is strictly superior to the prior streaming method in torchft as this supports all tensor subclasses such as DTensor.

This supports 100% of the inputs to torch.save/load but is not wire compatible nor intended to have any backwards compatibility.

Security wise this fully supports weights_only and defaults to True. It does use pickle for some metadata but uses weights_only for the metadata.

Adapted from:

https://github.com/pytorch/torchft/pull/101

https://github.com/pytorch/torchft/pull/54

Test Plan:

pytest test/distributed/test_serialization.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146555
Approved by: https://github.com/fegin, https://github.com/mikaylagawarecki

Co-authored-by: Krishn Parasar <76171905+Krishn1412@users.noreply.github.com>
2025-02-07 18:08:11 +00:00
..
_composable
_shard [BE][Ez]: ISC001 Auto concatenate implicit one line strings (#146408) 2025-02-04 19:07:04 +00:00
_sharded_tensor
_sharding_spec
_symmetric_memory [async-TP] Fix scheduling in matmul+reduce-scatter for 2 ranks (#145846) 2025-01-30 18:26:34 +00:00
_tensor
_tools Turn on mypy for _dynamo/variables/builtin.py (#145552) 2025-01-30 22:21:32 +00:00
algorithms
autograd
benchmarks
checkpoint [DCP] Remove all-gather of state dict keys (#145998) 2025-02-04 03:16:13 +00:00
elastic [BE][Ez]: ISC001 Auto concatenate implicit one line strings (#146408) 2025-02-04 19:07:04 +00:00
examples
fsdp update _unsafe_set_version_counter to accept lists of tensors (#137921) 2025-02-04 04:51:11 +00:00
launcher
nn
optim [BE][Ez]: ISC001 Auto concatenate implicit one line strings (#146408) 2025-02-04 19:07:04 +00:00
pipelining Remove stage_index_to_group_rank from schedule (#146217) 2025-02-05 21:26:45 +00:00
rpc [BE][Ez]: ISC001 Auto concatenate implicit one line strings (#146408) 2025-02-04 19:07:04 +00:00
tensor [2/N][cp][example] flex attention in context parallel (backward pass) (#146397) 2025-02-06 19:50:02 +00:00
__init__.py
_checkpointable.py
_composable_state.py
_functional_collectives.py [2/N] Enable ruff F841 on distributed tests (#146132) 2025-02-02 03:44:48 +00:00
_functional_collectives_impl.py
_serialization.py distributed/serialization: add experimental streaming torch.save/load methods (#146555) 2025-02-07 18:08:11 +00:00
_state_dict_utils.py
argparse_util.py
c10d_logger.py
collective_utils.py
constants.py
CONTRIBUTING.md
device_mesh.py
distributed_c10d.py [BE]: Enable ruff SLOT checks (#146276) 2025-02-04 19:18:23 +00:00
launch.py
logging_handlers.py
remote_device.py
rendezvous.py
run.py Improve torchrun documentation (#144354) 2025-01-24 20:40:05 +00:00
utils.py