pytorch/test/distributed/checkpoint
mori360 a7ba562ec8 [state dict] Change _load_model_state_dict to enable cpu_offload, accept 2 device type and optimize memory (#142845)
For destributed state dict api [migration](https://github.com/pytorch/torchtune/pull/2138), make the changes here:
1. `load_from_full_model_state_dict` at TorchTune calls `set_model_state_dict` with the options on whether to have cpu_offload. Add cpu_offload at _load_model_state_dict to process to cpu if config is True
2. Change the device check as lora_finetune might hace 2 device types, accept that to be valid.
3. Some changes to optimize the memory performance:
3.1 use `.detach().clone()` instead of view directly
3.2 if local_state is not meta, copy `full_tensor[slices]` to `ret.to_local()`
4. add relative unit tests

Memory performance calling from TorchTune with llama2/7B_full:
1. cpu_offload = True
<img width="555" alt="Screenshot 2024-12-18 at 1 36 47 PM" src="https://github.com/user-attachments/assets/429261f5-1107-4592-b295-de3944a2614b" />

2. cpu_offload = False
<img width="555" alt="Screenshot 2024-12-18 at 1 36 52 PM" src="https://github.com/user-attachments/assets/40bf281a-236a-4218-826b-b1192a10c806" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142845
Approved by: https://github.com/fegin
2024-12-19 05:06:41 +00:00
..
e2e Fix unused Python variables in test/[a-d]* (#134665) 2024-12-13 22:13:12 +00:00
fsdp [FSDP2] Move to public torch.distributed.fsdp (#141868) 2024-12-07 01:24:28 +00:00
test_checkpoint.py Fix unused Python variables in test/[a-d]* (#134665) 2024-12-13 22:13:12 +00:00
test_compatibility.py
test_dedup_tensors.py
test_dtensor_checkpoint.py
test_dtensor_resharding.py
test_file_system_checkpoint.py
test_file_system_checkpoint_cpu.py
test_format_utils.py
test_fsdp_model_state.py
test_fsdp_optim_state.py
test_fsdp_tp_checkpoint_conversion.py Fix unused Python variables in test/[a-d]* (#134665) 2024-12-13 22:13:12 +00:00
test_fsspec.py
test_hsdp_checkpoint.py Fix unused Python variables in test/[a-d]* (#134665) 2024-12-13 22:13:12 +00:00
test_nested_dict.py Fix unused Python variables in test/[a-d]* (#134665) 2024-12-13 22:13:12 +00:00
test_planner.py [checkpointing][oss] Throw an error when loading a different size than saved tensor (#141571) 2024-12-11 15:35:48 +00:00
test_save_load_api.py Fix unused Python variables in test/[a-d]* (#134665) 2024-12-13 22:13:12 +00:00
test_state_dict.py [state dict] Change _load_model_state_dict to enable cpu_offload, accept 2 device type and optimize memory (#142845) 2024-12-19 05:06:41 +00:00
test_state_dict_utils.py Fix unused Python variables in test/[a-d]* (#134665) 2024-12-13 22:13:12 +00:00
test_tp_checkpoint.py
test_traverse.py
test_utils.py