onnxruntime/orttraining/orttraining/python/training
pengwa 735a32fee1
Introduce memory observer for ORTModule (#16213)
### Introduce memory observer for ORTModule

To analyze memory usage for ORTModule training, we need collect
per-iteration memory footprint in different stages (pre-forward,
post-forward, pre-backward, and post-backward).

Currently we only collect the data using torch.cuda APIs. The next step
is, we could collect the detailed stashed activation list and its
percentage within ORT backend, which is beyond this PR.

Sample as below: 
```
0/8] step 0 memory (MiB) | phase: pre_forward | allocated: 1866 | max allocated: 1866 | cached: 1874 | max cached: 1874 | inactive: 8 | max inactive: 8
[0/8] step 0 memory (MiB) | phase: post_forward | allocated: 23277 | max allocated: 26215 | cached: 26406 | max cached: 26406 | inactive: 193 | max inactive: 405
[0/8] step 0 memory (MiB) | phase: pre_backward | allocated: 23277 | max allocated: 26215 | cached: 26406 | max cached: 26406 | inactive: 193 | max inactive: 405
[0/8] step 0 memory (MiB) | phase: post_backward | allocated: 2932 | max allocated: 26215 | cached: 26406 | max cached: 26406 | inactive: 6158 | max inactive: 6158
  0%|█                                                                                                                                                                                                            | 1/200 [00:26<1:26:18, 26.02s/it]
[0/8] step 1 memory (MiB) | phase: pre_forward | allocated: 2356 | max allocated: 26215 | cached: 26406 | max cached: 26406 | inactive: 2454 | max inactive: 6165
[0/8] step 1 memory (MiB) | phase: post_forward | allocated: 23767 | max allocated: 26705 | cached: 29342 | max cached: 29342 | inactive: 2639 | max inactive: 6165
[0/8] step 1 memory (MiB) | phase: pre_backward | allocated: 23767 | max allocated: 26705 | cached: 29342 | max cached: 29342 | inactive: 2639 | max inactive: 6165
[0/8] step 1 memory (MiB) | phase: post_backward | allocated: 3422 | max allocated: 26705 | cached: 29342 | max cached: 29342 | inactive: 5284 | max inactive: 6165
  1%|██                                                                                                                                                                                                             | 2/200 [00:26<36:47, 11.15s/it]
[0/8] step 2 memory (MiB) | phase: pre_forward | allocated: 2356 | max allocated: 26705 | cached: 29342 | max cached: 29342 | inactive: 2454 | max inactive: 6165
[0/8] step 2 memory (MiB) | phase: post_forward | allocated: 23767 | max allocated: 26705 | cached: 29342 | max cached: 29342 | inactive: 2639 | max inactive: 6165
[0/8] step 2 memory (MiB) | phase: pre_backward | allocated: 23767 | max allocated: 26705 | cached: 29342 | max cached: 29342 | inactive: 2639 | max inactive: 6165
[0/8] step 2 memory (MiB) | phase: post_backward | allocated: 3422 | max allocated: 26705 | cached: 29342 | max cached: 29342 | inactive: 5284 | max inactive: 6165
```
2023-06-15 15:45:36 +08:00
..
amp Adopt linrtunner as the linting tool - take 2 (#15085) 2023-03-24 15:29:03 -07:00
api Python documentation for onnxruntime-training (#15765) 2023-05-02 16:58:16 -07:00
experimental Adopt linrtunner as the linting tool - take 2 (#15085) 2023-03-24 15:29:03 -07:00
onnxblock Python documentation for onnxruntime-training (#15765) 2023-05-02 16:58:16 -07:00
optim support latest deepspeed version for optim (#15682) 2023-04-25 20:12:23 -07:00
ortmodule Introduce memory observer for ORTModule (#16213) 2023-06-15 15:45:36 +08:00
torchdynamo Detect fake tensor mode if it has already been created. (#16220) 2023-06-02 23:17:49 -07:00
utils Introduce memory observer for ORTModule (#16213) 2023-06-15 15:45:36 +08:00
__init__.py Refining the offline tooling for training artifact generation (#15212) 2023-03-30 18:05:51 -07:00
_checkpoint_storage.py Enable pylint and numpy rules (#15218) 2023-03-27 20:37:53 -07:00
_utils.py Introduce float 8 types (#14731) 2023-05-30 13:25:58 -07:00
artifacts.py Python documentation for onnxruntime-training (#15765) 2023-05-02 16:58:16 -07:00
checkpoint.py Bump ruff in CI (#15533) 2023-04-17 10:11:44 -07:00
model_desc_validation.py Adopt linrtunner as the linting tool - take 2 (#15085) 2023-03-24 15:29:03 -07:00
orttrainer.py Adopt linrtunner as the linting tool - take 2 (#15085) 2023-03-24 15:29:03 -07:00
orttrainer_options.py Adopt linrtunner as the linting tool - take 2 (#15085) 2023-03-24 15:29:03 -07:00
postprocess.py Adopt linrtunner as the linting tool - take 2 (#15085) 2023-03-24 15:29:03 -07:00