onnxruntime/orttraining/orttraining/python/training
zhijiang 8fadc6c913
Zhijxu/cleanup cached tensors when oom (#19306)
in pytorch, when oom happens at bp, user could decrease the batch size
and rerun it without restarting the process.

while in ORT, the intermediate tensors are kept even OOM, so decrease
batch size still fail.


this is torch run, we can see after oom failure, torch will release
tensor before next step

![image](https://github.com/microsoft/onnxruntime/assets/43435212/92b8a2e3-454b-448a-a223-17cb91d463c2)

this is from ort, we can see ort not release its tensors after OOM
failure.

![image](https://github.com/microsoft/onnxruntime/assets/43435212/bb6a3882-8e14-4f37-8079-e7f70fc2546b)

ort with the PR, we can see memory is released, **the 4GB memory is not
own by ort, and will be released by torch at the end**.

![image](https://github.com/microsoft/onnxruntime/assets/43435212/7f39d711-4e36-47d5-aecf-3805433a6d01)
2024-02-21 10:41:42 +08:00
..
amp [Better Engineering] Bump ruff to 0.0.278 and fix new lint errors (#16789) 2023-07-21 12:53:41 -07:00
api Introduce a Nominal Checkpoint for On-Device Training (#19232) 2024-01-30 22:11:25 -08:00
experimental Manage ORTModule configurations consistently (#16396) 2023-06-27 19:19:36 +08:00
onnxblock Introduce a Nominal Checkpoint for On-Device Training (#19232) 2024-01-30 22:11:25 -08:00
optim FP16 optimizer automatically detect DeepSpeed compatibility (#18084) 2023-10-25 15:11:02 +08:00
ort_triton Bump ruff linter to 0.2.1 (#19471) 2024-02-08 16:08:27 -08:00
ortmodule Zhijxu/cleanup cached tensors when oom (#19306) 2024-02-21 10:41:42 +08:00
utils ORTModule memory improvement (#18924) 2024-01-16 08:57:37 +08:00
__init__.py Removed all the deprecated python training code and related tests and utils (#18333) 2023-11-17 18:19:21 -08:00
_utils.py Removed all the deprecated python training code and related tests and utils (#18333) 2023-11-17 18:19:21 -08:00
artifacts.py Introduce a Nominal Checkpoint for On-Device Training (#19232) 2024-01-30 22:11:25 -08:00