pytorch/test/distributed
Tristan Rice 23af9dde4d distributed/serialization: add experimental streaming torch.save/load methods (#146555)
Summary:

This is intended for use with torchft when we need to do a streaming state dict transfer. This is strictly superior to the prior streaming method in torchft as this supports all tensor subclasses such as DTensor.

This supports 100% of the inputs to torch.save/load but is not wire compatible nor intended to have any backwards compatibility.

Security wise this fully supports weights_only and defaults to True. It does use pickle for some metadata but uses weights_only for the metadata.

Adapted from:

https://github.com/pytorch/torchft/pull/101

https://github.com/pytorch/torchft/pull/54

Test Plan:

pytest test/distributed/test_serialization.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146555
Approved by: https://github.com/fegin, https://github.com/mikaylagawarecki

Co-authored-by: Krishn Parasar <76171905+Krishn1412@users.noreply.github.com>
2025-02-07 18:08:11 +00:00
..
_composable Refactoring Distributed test cases to be device agnostic [1/n] (#145222) 2025-02-05 18:47:09 +00:00
_shard PEP585 update - test (#145176) 2025-01-22 04:48:28 +00:00
_tools PEP585 update - test (#145176) 2025-01-22 04:48:28 +00:00
algorithms Enable ruff F841 on distributed tests (#146131) 2025-02-01 03:06:16 +00:00
bin
checkpoint [DCP] Remove all-gather of state dict keys (#145998) 2025-02-04 03:16:13 +00:00
elastic Enable ruff F841 on distributed tests (#146131) 2025-02-01 03:06:16 +00:00
flight_recorder Revert "Use absolute path path.resolve() -> path.absolute() (#129409)" 2025-01-04 14:17:20 +00:00
fsdp [BE][Ez]: ISC001 Auto concatenate implicit one line strings (#146408) 2025-02-04 19:07:04 +00:00
launcher PEP585 update - test (#145176) 2025-01-22 04:48:28 +00:00
nn/jit PEP585 update - test (#145176) 2025-01-22 04:48:28 +00:00
optim PEP585 update - test (#145176) 2025-01-22 04:48:28 +00:00
pipelining Remove stage_index_to_group_rank from schedule (#146217) 2025-02-05 21:26:45 +00:00
rpc
tensor [dynamo][fullgraph] Do not skip frame with fullgraph=True (#146527) 2025-02-06 18:56:07 +00:00
argparse_util_test.py
test_backends.py API to retrieve default distributed backend from device (#140536) 2024-11-22 11:01:53 +00:00
test_c10d_common.py [BE][Ez]: FURB148 - remove useless enumerate calls (#145619) 2025-01-24 23:37:15 +00:00
test_c10d_functional_native.py PyWork: preserve Python reference counting when used in functional collectives (#146376) 2025-02-07 18:07:53 +00:00
test_c10d_gloo.py [ROCm] Enable post-merge trunk workflow on MI300 runners; skip and fix MI300 related failed tests (#143673) 2025-01-09 05:18:57 +00:00
test_c10d_logger.py [c10d] Switch all timer logging in c10d to wait_counter (#141154) 2024-11-21 01:10:11 +00:00
test_c10d_nccl.py [c10d][NCCL] Implement ncclCommInitRankScalable (merging #136789) (#144794) 2025-01-31 22:39:56 +00:00
test_c10d_object_collectives.py
test_c10d_ops_nccl.py Fix unused Python variables in test/[a-d]* (#134665) 2024-12-13 22:13:12 +00:00
test_c10d_pypg.py DistributedDataParallel: add init_sync option to control collectives during initialization (#142824) 2024-12-11 20:28:38 +00:00
test_c10d_spawn.py
test_c10d_spawn_gloo.py PEP585 update - test (#145176) 2025-01-22 04:48:28 +00:00
test_c10d_spawn_nccl.py PEP585 update - test (#145176) 2025-01-22 04:48:28 +00:00
test_c10d_spawn_ucc.py PEP585 update - test (#145176) 2025-01-22 04:48:28 +00:00
test_c10d_ucc.py XFAIL test_save_load_checkpoint (#144927) 2025-01-16 07:31:56 +00:00
test_collective_utils.py Fix unused Python variables in test/[a-d]* (#134665) 2024-12-13 22:13:12 +00:00
test_composability.py composability test cleanup (#145011) 2025-01-18 04:37:12 +00:00
test_compute_comm_reordering.py
test_control_collectives.py
test_data_parallel.py Fix unused Python variables in test/[a-d]* (#134665) 2024-12-13 22:13:12 +00:00
test_device_mesh.py Fix unused Python variables in test/[a-d]* (#134665) 2024-12-13 22:13:12 +00:00
test_distributed_spawn.py
test_dynamo_distributed.py dynamo: fsdp throw unimplemented vs attribute error (#146188) 2025-02-04 21:45:55 +00:00
test_fake_pg.py [BE][Ez]: FURB148 - remove useless enumerate calls (#145619) 2025-01-24 23:37:15 +00:00
test_functional_api.py Fix unused Python variables in test/[a-d]* (#134665) 2024-12-13 22:13:12 +00:00
test_inductor_collectives.py Fix unused Python variables in test/[a-d]* (#134665) 2024-12-13 22:13:12 +00:00
test_launcher.py Fix unused Python variables in test/[a-d]* (#134665) 2024-12-13 22:13:12 +00:00
test_multi_threaded_pg.py
test_nccl.py
test_pg_wrapper.py
test_serialization.py distributed/serialization: add experimental streaming torch.save/load methods (#146555) 2025-02-07 18:08:11 +00:00
test_store.py [BE][CI] bump ruff to 0.8.4 (#143753) 2024-12-24 12:24:10 +00:00
test_symmetric_memory.py [SymmetricMemory] fix an issue where rendezvous is performed with wrong device context when torch.cuda.set_device() is not callled (#144886) 2025-01-28 01:43:37 +00:00