mirror of
https://github.com/saymrwulf/onnxruntime.git
synced 2026-05-15 20:50:42 +00:00
50 lines
2 KiB
Markdown
50 lines
2 KiB
Markdown
|
|
# ORTModule Training Convergence Investigation
|
||
|
|
|
||
|
|
## 1. Discovering
|
||
|
|
|
||
|
|
Convergence issues can be identified by:
|
||
|
|
- Large discrepancy on core training metrics including training loss, evaluation loss, model specific AUC metrics.
|
||
|
|
- Runtime failures (for example loss scaler reach the minimum triggering an exception).
|
||
|
|
|
||
|
|
Before looking into further, we should clarify few things (if possible):
|
||
|
|
- If we change seed for baseline run, whether the metric diff is big?
|
||
|
|
(Make sure the discrepancy is not introduced by random)
|
||
|
|
- What's the very first steps we see obvious diverges?
|
||
|
|
- Still repro once remove randomness?
|
||
|
|
- Set same seeds
|
||
|
|
- Set dropout ratio to 0
|
||
|
|
- Set compute to be deterministic and torch-comparable (TODO(pengwa): need a flag for this).
|
||
|
|
|
||
|
|
|
||
|
|
## 2. Collect Activation Statistics
|
||
|
|
|
||
|
|
Add codes:
|
||
|
|
|
||
|
|
```diff
|
||
|
|
+ from onnxruntime.training.utils.hooks import SubscriberManager, StatisticsSubscriber
|
||
|
|
+ SubscriberManager.subscribe(model, [StatisticsSubscriber("pt_out", override_output_dir=True)])
|
||
|
|
|
||
|
|
```
|
||
|
|
Run training script to the steps that triggered the divergence. A folder named `pt_out` is created in current working directory. For each step, there is a folder containing summaries for every activation tensor.
|
||
|
|
|
||
|
|
|
||
|
|
Add few lines of code:
|
||
|
|
```diff
|
||
|
|
from onnxruntime.training.ortmodule import ORTModule
|
||
|
|
from onnxruntime.training.utils.hooks import SubscriberManager, StatisticsSubscriber
|
||
|
|
model = ORTModule(model)
|
||
|
|
+ SubscriberManager.subscribe(model, [StatisticsSubscriber("ort_out", override_output_dir=True)])
|
||
|
|
```
|
||
|
|
|
||
|
|
> `StatisticsSubscriber` can be initialized before OR after wrapping ORTModule.
|
||
|
|
|
||
|
|
Run training script to the steps that triggered the divergence. Similarly, a folder named `ort_out` is created in current working directory.
|
||
|
|
|
||
|
|
Run command to generate per step summary
|
||
|
|
|
||
|
|
```bash
|
||
|
|
python -m onnxruntime.training.utils.hooks.merge_activation_summary --pt_dir pt_out --ort_dir ort_out --output_dir /tmp/output
|
||
|
|
```
|
||
|
|
|
||
|
|
Manual diff the generate per-step summary to find the where is the first big diff happens.
|