onnxruntime/orttraining
pengwa 40bcc0441b
Enhance StatisticsSubscriber (#16098)
### Enhance StatisticsSubscriber

There are few improvements for `StatisticsSubscriber`:

- Reduce peak memory impact for tensors (having many many many elements,
consuming too much GPU memory, causing original recipe run failed with
OOM), by split the statistics into two phases (split into buckets, and
merge result across buckets).
- Allow dump intermediate tensors. Originally only nn.Module forward()'s
return value are dumped, there are requirements we want to inspect some
specific intermediate tensor in the forward() function, now we support
it.
- Add documents for collecting dumps on multiple ranks

Docs link on this branch for better view:
https://github.com/microsoft/onnxruntime/blob/pengwa/conv_tool_v2/docs/ORTModule_Convergence_Notes.md

---------

Co-authored-by: mindest <30493312+mindest@users.noreply.github.com>
2023-06-12 18:32:08 +08:00
..
orttraining Enhance StatisticsSubscriber (#16098) 2023-06-12 18:32:08 +08:00
pytorch_frontend_examples Enable pylint and numpy rules (#15218) 2023-03-27 20:37:53 -07:00
tools [ROCm] reduce batch size to fix CI error (#15714) 2023-05-16 13:10:02 +08:00