onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-06-25 02:50:42 +00:00

History

pengwa 735a32fee1 Introduce memory observer for ORTModule (#16213 ) ### Introduce memory observer for ORTModule To analyze memory usage for ORTModule training, we need collect per-iteration memory footprint in different stages (pre-forward, post-forward, pre-backward, and post-backward). Currently we only collect the data using torch.cuda APIs. The next step is, we could collect the detailed stashed activation list and its percentage within ORT backend, which is beyond this PR. Sample as below: ``` 0/8] step 0 memory (MiB) \| phase: pre_forward \| allocated: 1866 \| max allocated: 1866 \| cached: 1874 \| max cached: 1874 \| inactive: 8 \| max inactive: 8 [0/8] step 0 memory (MiB) \| phase: post_forward \| allocated: 23277 \| max allocated: 26215 \| cached: 26406 \| max cached: 26406 \| inactive: 193 \| max inactive: 405 [0/8] step 0 memory (MiB) \| phase: pre_backward \| allocated: 23277 \| max allocated: 26215 \| cached: 26406 \| max cached: 26406 \| inactive: 193 \| max inactive: 405 [0/8] step 0 memory (MiB) \| phase: post_backward \| allocated: 2932 \| max allocated: 26215 \| cached: 26406 \| max cached: 26406 \| inactive: 6158 \| max inactive: 6158 0%\|█ \| 1/200 [00:26<1:26:18, 26.02s/it] [0/8] step 1 memory (MiB) \| phase: pre_forward \| allocated: 2356 \| max allocated: 26215 \| cached: 26406 \| max cached: 26406 \| inactive: 2454 \| max inactive: 6165 [0/8] step 1 memory (MiB) \| phase: post_forward \| allocated: 23767 \| max allocated: 26705 \| cached: 29342 \| max cached: 29342 \| inactive: 2639 \| max inactive: 6165 [0/8] step 1 memory (MiB) \| phase: pre_backward \| allocated: 23767 \| max allocated: 26705 \| cached: 29342 \| max cached: 29342 \| inactive: 2639 \| max inactive: 6165 [0/8] step 1 memory (MiB) \| phase: post_backward \| allocated: 3422 \| max allocated: 26705 \| cached: 29342 \| max cached: 29342 \| inactive: 5284 \| max inactive: 6165 1%\|██ \| 2/200 [00:26<36:47, 11.15s/it] [0/8] step 2 memory (MiB) \| phase: pre_forward \| allocated: 2356 \| max allocated: 26705 \| cached: 29342 \| max cached: 29342 \| inactive: 2454 \| max inactive: 6165 [0/8] step 2 memory (MiB) \| phase: post_forward \| allocated: 23767 \| max allocated: 26705 \| cached: 29342 \| max cached: 29342 \| inactive: 2639 \| max inactive: 6165 [0/8] step 2 memory (MiB) \| phase: pre_backward \| allocated: 23767 \| max allocated: 26705 \| cached: 29342 \| max cached: 29342 \| inactive: 2639 \| max inactive: 6165 [0/8] step 2 memory (MiB) \| phase: post_backward \| allocated: 3422 \| max allocated: 26705 \| cached: 29342 \| max cached: 29342 \| inactive: 5284 \| max inactive: 6165 ```		2023-06-15 15:45:36 +08:00
..
amp	Adopt linrtunner as the linting tool - take 2 (#15085 )	2023-03-24 15:29:03 -07:00
api	Python documentation for onnxruntime-training (#15765 )	2023-05-02 16:58:16 -07:00
experimental	Adopt linrtunner as the linting tool - take 2 (#15085 )	2023-03-24 15:29:03 -07:00
onnxblock	Python documentation for onnxruntime-training (#15765 )	2023-05-02 16:58:16 -07:00
optim	support latest deepspeed version for optim (#15682 )	2023-04-25 20:12:23 -07:00
ortmodule	Introduce memory observer for ORTModule (#16213 )	2023-06-15 15:45:36 +08:00
torchdynamo	Detect fake tensor mode if it has already been created. (#16220 )	2023-06-02 23:17:49 -07:00
utils	Introduce memory observer for ORTModule (#16213 )	2023-06-15 15:45:36 +08:00
__init__.py	Refining the offline tooling for training artifact generation (#15212 )	2023-03-30 18:05:51 -07:00
_checkpoint_storage.py	Enable pylint and numpy rules (#15218 )	2023-03-27 20:37:53 -07:00
_utils.py	Introduce float 8 types (#14731 )	2023-05-30 13:25:58 -07:00
artifacts.py	Python documentation for onnxruntime-training (#15765 )	2023-05-02 16:58:16 -07:00
checkpoint.py	Bump ruff in CI (#15533 )	2023-04-17 10:11:44 -07:00
model_desc_validation.py	Adopt linrtunner as the linting tool - take 2 (#15085 )	2023-03-24 15:29:03 -07:00
orttrainer.py	Adopt linrtunner as the linting tool - take 2 (#15085 )	2023-03-24 15:29:03 -07:00
orttrainer_options.py	Adopt linrtunner as the linting tool - take 2 (#15085 )	2023-03-24 15:29:03 -07:00
postprocess.py	Adopt linrtunner as the linting tool - take 2 (#15085 )	2023-03-24 15:29:03 -07:00