mirror of
https://github.com/saymrwulf/pytorch.git
synced 2026-05-14 20:57:59 +00:00
For destributed state dict api [migration](https://github.com/pytorch/torchtune/pull/2138), make the changes here: 1. `load_from_full_model_state_dict` at TorchTune calls `set_model_state_dict` with the options on whether to have cpu_offload. Add cpu_offload at _load_model_state_dict to process to cpu if config is True 2. Change the device check as lora_finetune might hace 2 device types, accept that to be valid. 3. Some changes to optimize the memory performance: 3.1 use `.detach().clone()` instead of view directly 3.2 if local_state is not meta, copy `full_tensor[slices]` to `ret.to_local()` 4. add relative unit tests Memory performance calling from TorchTune with llama2/7B_full: 1. cpu_offload = True <img width="555" alt="Screenshot 2024-12-18 at 1 36 47 PM" src="https://github.com/user-attachments/assets/429261f5-1107-4592-b295-de3944a2614b" /> 2. cpu_offload = False <img width="555" alt="Screenshot 2024-12-18 at 1 36 52 PM" src="https://github.com/user-attachments/assets/40bf281a-236a-4218-826b-b1192a10c806" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/142845 Approved by: https://github.com/fegin |
||
|---|---|---|
| .. | ||
| e2e | ||
| fsdp | ||
| test_checkpoint.py | ||
| test_compatibility.py | ||
| test_dedup_tensors.py | ||
| test_dtensor_checkpoint.py | ||
| test_dtensor_resharding.py | ||
| test_file_system_checkpoint.py | ||
| test_file_system_checkpoint_cpu.py | ||
| test_format_utils.py | ||
| test_fsdp_model_state.py | ||
| test_fsdp_optim_state.py | ||
| test_fsdp_tp_checkpoint_conversion.py | ||
| test_fsspec.py | ||
| test_hsdp_checkpoint.py | ||
| test_nested_dict.py | ||
| test_planner.py | ||
| test_save_load_api.py | ||
| test_state_dict.py | ||
| test_state_dict_utils.py | ||
| test_tp_checkpoint.py | ||
| test_traverse.py | ||
| test_utils.py | ||