pytorch/test
Mikayla Gawarecki 001e355a56 Add option to serialization config to reduce random reads from get_record_offset when loading with mmap=True (#143880)
## Background

This PR adds `torch.utils.serialization.config.load.calculate_storage_offsets`. This option relies  on the previous PR in this stack, where storage order was changed to non lexicographical. A `.format_version` entry was added to the zipfile and `calculate_storage_offsets` will only work on checkpoints with `.format_version`.

When this is turned on, for `torch.load(mmap=True)`, offsets of each storage record (other than the 0th storage will be calculated instead of relying on `miniz` APIs to determine this).

The existing APIs will issue multiple random reads (reading the end of central directory record, then reading the zipfile header for the record) to determine the storage offset where the record starts. This can greatly degrade `torch.load(mmap=True)` performance for non-filesystem cases.

6aaae9d78f/caffe2/serialize/inline_container.cc (L589-L605)

## How does this work

The format for the checkpoint is as such

```
archive_name/
|_ data.pkl
|_.format_version
|_byteorder
|_data/
  |_ 0
  |_ 1
  |_ 2
  |_ ...
|_
```

Each `data/i` record represents a storage, where storages are written in the order that the Pickler encounters them.

For each storage, our `persistent_load` logic saves the following metadata to the pickle file `dtype, numel, key, location` where `numel` is the number of bytes in the storage.

Note that we always use `miniz` writer  in the zip64 mode per [here](7796e308d0/caffe2/serialize/inline_container.cc (L701)) A zipfile record written by miniz looks as such

```
 ---------------- ----------------- ------------------- ---------------- --------- ------------------------------
| 30 byte header | n byte filename | zip64_extra_data | m byte padding | storage | 16 or 24 byte local dir footer  |
 ---------------- ----------------- ------------------- ---------------- --------- ------------------------------
```

- The header size (30) is given by [`MZ_ZIP_LOCAL_DIR_HEADER_SIZE`](https://github.com/pytorch/pytorch/blob/main/third_party/miniz-3.0.2/miniz.c?fbclid=IwZXh0bgNhZW0CMTEAAR2O8Vysd--UoSCxW70gabXIS1dbz733oHwuUQ5_Ff1hY2WU6PL2i6CSH4A_aem_J9oaU2HpDeWtJKOU9EnVqw#L3290)
- filename will be `"{archive_name}/{filepath}"`

- `zip64_extra_data` is determined by [`mz_zip_writer_create_zip64_extra_data`](7796e308d0/third_party/miniz-3.0.2/miniz.c (L6202)). Note that [we only create zip64_extra_data if storage_size >= 0xFFFFFFFF or the offset of the start of the header >= 0xFFFFFFFF](7796e308d0/third_party/miniz-3.0.2/miniz.c (L6519-L6524))
- `m` is determined by [`getPadding`](7796e308d0/caffe2/serialize/inline_container.cc (L254)), which accounts for filename, zip64_extra_data to determine `m` such that the start of `storage` is aligned to 64 bytes. The `m` bytes will always start with `F B padding_size" as the first 4 bytes
- The local dir footer size is determined based on [this snippet ](7796e308d0/third_party/miniz-3.0.2/miniz.c (L6610-L6632)): if the buffer size is 0 it is skipped. If the zip64_extra_data was created, it is 24, otherwise it is 16.

When `torch.utils.serialization.config.load.calculate_storage_offsets` is set we do the following
- We keep track of where the "cursor" is in the file using `current_offset`, after each persistent_load call, it will be at the offset where the header for the next record starts
- for the 0th storage, "data/0", we use the regular get_record_offset to determine the start of the storage
- for any other storage, (where the storages will be in order encountered by the unpickler, 0, 1, 2, 3, ...) we use `get_record_offset_no_read`, which re-uses the `getPadding` logic to determine the offset of the storage
- Note that `load_tensor` will only ever be called again with the same key if the storage's `._data_ptr()` is 0 [[pointer1](https://github.com/pytorch/pytorch/blob/main/torch/serialization.py#L1917-L1918)][[pointer2](https://github.com/pytorch/pytorch/blob/main/torch/serialization.py#L1936-L1937)], so we cache the offsets for this edge case
- After each storage, if the storage is non-zero, we account for the local dir footer based on the logic described above

## Testing strategy

The agreed upon testing strategy was as follows:
- Add debug code gated by an environment flag `TORCH_SERIALIZATION_DEBUG` that will run this offset calculation logic and verify it against getRecordOffset for each storage (when mmap=False)
- This flag is set throughout CI, which means that every time `torch.load` is called, the offset calculation logic is implicitly being tested.

Differential Revision: [D67673026](https://our.internmc.facebook.com/intern/diff/D67673026)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143880
Approved by: https://github.com/albanD
ghstack dependencies: #143879
2025-01-31 17:09:20 +00:00
..
ao/sparsity
autograd
backends/xeon
benchmark_utils
bottleneck_test
cpp Introduce cache clearing APIs for the lazy graph executor (#144489) 2025-01-29 17:38:01 +00:00
cpp_api_parity
cpp_extensions Set -DPy_LIMITED_API flag for py_limited_api=True extensions (#145764) 2025-01-28 20:11:05 +00:00
custom_backend
custom_operator
distributed pickler for GraphModule (#141659) 2025-01-31 05:34:28 +00:00
distributions
dynamo [dynamo][polyfills]Support getrecursionlimit (#145989) 2025-01-31 00:47:31 +00:00
dynamo_expected_failures [dynamo] Properly model torch profiler context objects (#145537) 2025-01-28 00:03:36 +00:00
dynamo_skips Move Dynamo test to skip from expected_failures (#145390) 2025-01-22 19:06:39 +00:00
edge
error_messages
expect Revert "Add generator parameter to rand*_like functions (#136780)" 2025-01-24 19:00:21 +00:00
export [export] nested terms in nn_module_stack deserialization (#145901) 2025-01-31 10:00:13 +00:00
forward_backward_compatibility Add manual override flag for core ATen op detection during bc check (#146052) 2025-01-30 23:57:01 +00:00
functorch [hop] fix unbacked_bindings meta for while_loop (#143559) 2025-01-30 21:33:09 +00:00
fx pickler for GraphModule (#141659) 2025-01-31 05:34:28 +00:00
higher_order_ops
inductor [AOTI] Fix a memory leak in package boxed_run (#146100) 2025-01-31 13:32:28 +00:00
inductor_expected_failures [BE] Remove test_ops from FIXME_inductor_dont_reset_dynamo (#145307) 2025-01-27 18:12:39 +00:00
inductor_skips [BE] Remove test_ops from FIXME_inductor_dont_reset_dynamo (#145307) 2025-01-27 18:12:39 +00:00
jit [BE][Ez]: FURB148 - remove useless enumerate calls (#145619) 2025-01-24 23:37:15 +00:00
jit_hooks
lazy
mobile
nn [inductor triton] Disable incorrect TF32 usage on CUDA capability < 8 (#145684) 2025-01-28 22:01:08 +00:00
onnx [ONNX] Migrate test_torch_export_with_onnxruntime.py to test_small_models_e2e.py (#146095) 2025-01-31 03:40:26 +00:00
optim
package
profiler Save integral tensor data for ET (#144508) 2025-01-25 05:38:10 +00:00
quantization
scripts
test_img
torch_np [BE][Ez]: FURB148 - remove useless enumerate calls (#145619) 2025-01-24 23:37:15 +00:00
typing
xpu [inductor triton] Disable incorrect TF32 usage on CUDA capability < 8 (#145684) 2025-01-28 22:01:08 +00:00
_test_bazel.py
allowlist_for_publicAPI.json Improve typing in torch/types.py (#145237) 2025-01-28 05:29:12 +00:00
conftest.py
create_dummy_torchscript_model.py
delete.py
hi.py
HowToWriteTestsUsingFileCheck.md
linear.py
load_torchscript_model.py
minioptest_failures_dict.json
mkl_verbose.py
mkldnn_verbose.py
pytest_shard_custom.py
run_doctests.sh
run_test.py add pt2 callbacks for backward pass and prevent duplicate callbacks (#145732) 2025-01-28 03:50:02 +00:00
simulate_nccl_errors.py
slow_tests.json Update slow tests (#145206) 2025-01-27 11:40:39 +00:00
test_accelerator.py Generalize pin memory logic for accelerator when non blocking copy happened (#143783) 2025-01-23 03:43:05 +00:00
test_ao_sparsity.py
test_autocast.py
test_autograd.py Fix allow_mutation_on_saved_tensors for inplace foreach (#145520) 2025-01-25 00:58:03 +00:00
test_autograd_fallback.py
test_autoload.py
test_binary_ufuncs.py Fix lerp weight type promotion (#141117) 2025-01-24 01:18:20 +00:00
test_bundled_images.py
test_bundled_inputs.py
test_ci_sanity_check_fail.py
test_comparison_utils.py
test_compile_benchmark_util.py
test_complex.py
test_content_store.py
test_cpp_api_parity.py Enable C++ API parity tests on AArch64 (#145370) 2025-01-30 22:42:49 +00:00
test_cpp_extensions_aot.py
test_cpp_extensions_jit.py [CI][CUDA][Blackwell] sm_\d\d no longer matches sm_100. (#145641) 2025-01-24 23:20:22 +00:00
test_cpp_extensions_mtia_backend.py
test_cpp_extensions_open_device_registration.py
test_cpp_extensions_stream_and_event.py
test_cuda.py Tensor .cuda() very slow with specific array sizes (#138964) 2025-01-31 17:05:02 +00:00
test_cuda_expandable_segments.py
test_cuda_multigpu.py
test_cuda_nvml_based_avail.py
test_cuda_primary_ctx.py
test_cuda_sanitizer.py
test_cuda_trace.py
test_custom_ops.py Make sure to evaluate annotation strings in the context of where the prototype was created (#145667) 2025-01-29 00:14:45 +00:00
test_dataloader.py
test_datapipe.py
test_decomp.py
test_deploy.py
test_determination.py
test_dispatch.py
test_dlpack.py
test_dynamic_shapes.py Bail on checking internal overlap when dealing with unbacked symints (#145385) 2025-01-23 22:31:31 +00:00
test_expanded_weights.py
test_extension_utils.py Move privateuse1 test out of test_utils and make them serial (#145380) 2025-01-23 00:31:39 +00:00
test_fake_tensor.py Output of nonzero is transposed, fix fake tensor (#144695) 2025-01-26 01:07:22 +00:00
test_file_check.py
test_flop_counter.py
test_foreach.py
test_function_schema.py
test_functional_autograd_benchmark.py
test_functional_optim.py
test_functionalization.py
test_functionalization_of_rng_ops.py
test_futures.py
test_fx.py Fix for failure in D68425364 (#145304) 2025-01-22 23:33:02 +00:00
test_fx_experimental.py Fix incorrect type comparison (#145449) 2025-01-26 04:40:26 +00:00
test_fx_passes.py
test_fx_reinplace_pass.py
test_hop_infra.py Require that all HOPs be imported at import torch time (#145939) 2025-01-29 22:27:52 +00:00
test_hub.py
test_import_stats.py
test_indexing.py
test_itt.py
test_jit.py
test_jit_autocast.py
test_jit_disabled.py
test_jit_fuser.py
test_jit_fuser_legacy.py
test_jit_fuser_te.py [BE][Ez]: FURB148 - remove useless enumerate calls (#145619) 2025-01-24 23:37:15 +00:00
test_jit_legacy.py
test_jit_llga_fuser.py
test_jit_profiling.py
test_jit_simple.py
test_jit_string.py
test_jiterator.py
test_kernel_launch_checks.py
test_legacy_vmap.py
test_license.py
test_linalg.py [ARM] Fix bf32 and tf32 precision for tensordot unit test (#141136) 2025-01-24 02:59:45 +00:00
test_logging.py
test_masked.py
test_maskedtensor.py
test_matmul_cuda.py [CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441) 2025-01-30 22:33:50 +00:00
test_meta.py
test_metal.py
test_mkl_verbose.py
test_mkldnn.py
test_mkldnn_fusion.py
test_mkldnn_verbose.py
test_mobile_optimizer.py
test_model_exports_to_core_aten.py
test_module_tracker.py
test_modules.py Disable slow gradcheck for nn.Transformer ModuleInfo (#145531) 2025-01-25 00:58:03 +00:00
test_monitor.py
test_mps.py [MPS] Fix regression in con-contig bitwise ops (#146085) 2025-01-30 22:36:56 +00:00
test_multiprocessing.py [dynamo] Re-enable test_fs family for dynamo (#145302) 2025-01-22 17:50:05 +00:00
test_multiprocessing_spawn.py
test_namedtensor.py
test_namedtuple_return_api.py
test_native_functions.py
test_native_mha.py
test_nestedtensor.py Support remaining *_like factory functions for NJT (#144889) 2025-01-27 21:33:51 +00:00
test_nn.py Add determinmistic kernel for reflection2d (#136241) 2025-01-29 20:34:03 +00:00
test_nnapi.py
test_numba_integration.py
test_numpy_interop.py
test_openmp.py
test_ops.py [BE][Ez]: FURB148 - remove useless enumerate calls (#145619) 2025-01-24 23:37:15 +00:00
test_ops_fwd_gradients.py
test_ops_gradients.py
test_ops_jit.py
test_optim.py
test_out_dtype_op.py
test_overrides.py
test_package.py
test_per_overload_api.py
test_prims.py
test_proxy_tensor.py Add fake_impl for unique_consecutive (#145649) 2025-01-29 22:33:16 +00:00
test_pruning_op.py
test_public_bindings.py Remove public_allowlist from TestPublicBindings.test_correct_module_names and ensure private_allowlist-ed things are actually private (#145620) 2025-01-27 17:30:02 +00:00
test_python_dispatch.py
test_pytree.py
test_quantization.py
test_reductions.py
test_scatter_gather_ops.py
test_schema_check.py
test_segment_reductions.py
test_serialization.py Add option to serialization config to reduce random reads from get_record_offset when loading with mmap=True (#143880) 2025-01-31 17:09:20 +00:00
test_set_default_mobile_cpu_allocator.py
test_shape_ops.py
test_show_pickle.py
test_sort_and_select.py
test_sparse.py
test_sparse_csr.py
test_sparse_semi_structured.py [CI][CUDA][cuSPARSELt] cusparselt 0.6.3 and cu121 related cleanups (#145793) 2025-01-28 21:01:58 +00:00
test_spectral_ops.py [BE][Ez]: FURB148 - remove useless enumerate calls (#145619) 2025-01-24 23:37:15 +00:00
test_stateless.py
test_static_runtime.py
test_subclass.py
test_sympy_utils.py
test_tensor_creation_ops.py Let tensor_a.new_tensor() be on tensor_a.device by default (#144958) 2025-01-24 22:12:31 +00:00
test_tensorboard.py
test_tensorexpr.py
test_tensorexpr_pybind.py
test_testing.py
test_throughput_benchmark.py Fix Throughputbenchmark issue (#144669) 2025-01-26 03:37:20 +00:00
test_torch.py Add determinmistic kernel for reflection2d (#136241) 2025-01-29 20:34:03 +00:00
test_transformers.py [ATen][CUDA][Transformers] Add Blackwell support to SDPA (#145602) 2025-01-24 22:27:39 +00:00
test_type_hints.py
test_type_info.py
test_type_promotion.py
test_typing.py
test_unary_ufuncs.py
test_utils.py [utils] add try_import method for importing optional modules (#145528) 2025-01-25 00:14:07 +00:00
test_utils_config_module.py config: Don't spam warnings about reference type configs (#145800) 2025-01-30 18:57:16 +00:00
test_utils_filelock.py
test_view_ops.py
test_vulkan.py
test_weak.py
test_xnnpack_integration.py
test_xpu.py