pytorch/torch
Tristan Rice 23af9dde4d distributed/serialization: add experimental streaming torch.save/load methods (#146555)
Summary:

This is intended for use with torchft when we need to do a streaming state dict transfer. This is strictly superior to the prior streaming method in torchft as this supports all tensor subclasses such as DTensor.

This supports 100% of the inputs to torch.save/load but is not wire compatible nor intended to have any backwards compatibility.

Security wise this fully supports weights_only and defaults to True. It does use pickle for some metadata but uses weights_only for the metadata.

Adapted from:

https://github.com/pytorch/torchft/pull/101

https://github.com/pytorch/torchft/pull/54

Test Plan:

pytest test/distributed/test_serialization.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146555
Approved by: https://github.com/fegin, https://github.com/mikaylagawarecki

Co-authored-by: Krishn Parasar <76171905+Krishn1412@users.noreply.github.com>
2025-02-07 18:08:11 +00:00
..
_awaits
_C [CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441) 2025-02-06 19:04:50 +00:00
_C_flatbuffer
_custom_op
_decomp
_dispatch
_dynamo Fix get_top() to return the base level event of the stack, not the most recently started event (#146649) 2025-02-07 18:04:50 +00:00
_export [export] make stack_trace optional in insert_custom_op_guards (#146438) 2025-02-06 01:48:26 +00:00
_functorch Add torch.func.debug_unwrap (#146528) 2025-02-06 18:48:09 +00:00
_higher_order_ops [auto_functionalized] Support Tensor(a!)[]? (#145400) 2025-02-05 14:52:39 +00:00
_inductor [inductor/profiler] add kernel kwargs instrumentation (#145573) 2025-02-07 17:44:30 +00:00
_lazy
_library [BE][Ez]: ISC001 Auto concatenate implicit one line strings (#146408) 2025-02-04 19:07:04 +00:00
_logging use DTRACE_ENV_VAR as the trace logs directory of set (#146412) 2025-02-04 20:54:28 +00:00
_numpy
_prims
_prims_common [dynamo] Disable compiling on elementwise_type_promotion_wrapper (#146219) 2025-02-03 18:02:48 +00:00
_refs Re-add stft option to align window for center = false (#146379) 2025-02-06 14:07:13 +00:00
_strobelight
_subclasses Fix aten.to when input is a tensor constant (#146220) 2025-02-01 11:07:33 +00:00
_vendor
accelerator
amp
ao [BE]: Enable ruff SLOT checks (#146276) 2025-02-04 19:18:23 +00:00
autograd update _unsafe_set_version_counter to accept lists of tensors (#137921) 2025-02-04 04:51:11 +00:00
backends [CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441) 2025-02-06 19:04:50 +00:00
compiler
contrib
cpu [CPUInductor] Fix SVE256 detection (#146207) 2025-02-01 18:51:34 +00:00
csrc PyWork: preserve Python reference counting when used in functional collectives (#146376) 2025-02-07 18:07:53 +00:00
cuda
distributed distributed/serialization: add experimental streaming torch.save/load methods (#146555) 2025-02-07 18:08:11 +00:00
distributions
export [export][dynamic shapes] log provenance for locals & symbols for non-strict (#143378) 2025-02-07 05:46:05 +00:00
fft
func Add torch.func.debug_unwrap (#146528) 2025-02-06 18:48:09 +00:00
futures
fx [export][dynamic shapes] log provenance for locals & symbols for non-strict (#143378) 2025-02-07 05:46:05 +00:00
jit
legacy
lib
linalg
masked
monitor add WaitCounter type interface and get rid of type errors (#146175) 2025-02-01 23:24:52 +00:00
mps
mtia
multiprocessing
nested Small improvements to NJT matrix multiplies (#146405) 2025-02-06 04:51:12 +00:00
nn Fix torch.nn.functional.one_hot param num_classes optional description (#146470) 2025-02-06 07:48:05 +00:00
onnx [ONNX] Create deprecation warning on dynamo_export (#146425) 2025-02-07 04:20:46 +00:00
optim [BE][Ez]: ISC001 Auto concatenate implicit one line strings (#146408) 2025-02-04 19:07:04 +00:00
package [BE][Ez]: ISC001 Auto concatenate implicit one line strings (#146408) 2025-02-04 19:07:04 +00:00
profiler execution trace export supports gzip format (#146179) 2025-02-01 01:25:25 +00:00
quantization
signal
sparse
special
testing Small improvements to NJT matrix multiplies (#146405) 2025-02-06 04:51:12 +00:00
utils Fixed a typo in dataset.py (#146600) 2025-02-07 05:09:51 +00:00
xpu
__config__.py
__future__.py
__init__.py Torch device backend autoload fix (#145611) 2025-01-31 19:27:42 +00:00
_appdirs.py
_classes.py
_compile.py
_custom_ops.py
_deploy.py
_environment.py
_guards.py
_jit_internal.py
_linalg_utils.py
_lobpcg.py
_lowrank.py
_meta_registrations.py nonzero_static with symint size (#146006) 2025-01-30 23:42:42 +00:00
_namedtensor_internals.py
_ops.py [Dynamo][Trace PyDispatcher] Remove disable from HigherOrderOperator.__call__ (#146270) 2025-02-03 21:47:54 +00:00
_python_dispatcher.py
_size_docs.py
_sources.py
_storage_docs.py
_streambase.py
_tensor.py Re-add stft option to align window for center = false (#146379) 2025-02-06 14:07:13 +00:00
_tensor_docs.py Re-add stft option to align window for center = false (#146379) 2025-02-06 14:07:13 +00:00
_tensor_str.py [BE][Ez]: ISC001 Auto concatenate implicit one line strings (#146408) 2025-02-04 19:07:04 +00:00
_thread_safe_fork.py
_torch_docs.py Add overloads to diagonal docs (#144214) 2025-01-31 15:53:59 +00:00
_utils.py [BE]: Enable ruff SLOT checks (#146276) 2025-02-04 19:18:23 +00:00
_utils_internal.py
_VF.py
_vmap_internals.py
_weights_only_unpickler.py
abi-check.cpp
CMakeLists.txt
custom_class.h
custom_class_detail.h
extension.h
functional.py Re-add stft option to align window for center = false (#146379) 2025-02-06 14:07:13 +00:00
hub.py
library.h Remove trivial dispatch_key_allowlist_check function (#146169) 2025-01-31 19:59:40 +00:00
library.py [opcheck] Improve error reporting; allow atol/rtol overrides (#146488) 2025-02-05 21:25:06 +00:00
overrides.py Re-add stft option to align window for center = false (#146379) 2025-02-06 14:07:13 +00:00
py.typed
quasirandom.py
random.py
README.txt
return_types.py
script.h
serialization.py Add option to serialization config to reduce random reads from get_record_offset when loading with mmap=True (#143880) 2025-01-31 17:09:20 +00:00
storage.py
torch_version.py [BE]: Enable ruff SLOT checks (#146276) 2025-02-04 19:18:23 +00:00
types.py
version.py.tpl

Note [TH abstraction violation]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

TH/THC provide some hpp headers, which are proper C++ headers rather than
C headers.  These headers serve double duty as *internal implementation
detail* headers, whose contents should largely not be used by external
clients.

Ideally, we would not install these headers at all; instead, you should
use public functions (in headers like `THTensor.h`, NOT `THTensor.hpp`)
to manipulate these structs.  However, there are a few places
in torch/csrc where we violate this abstraction.  They are marked with
a pointer to this note.  Each of those sites will have to be refactored
when we refactor the guts of THTensor and related structures.