# ORTModule Training Convergence Investigation ## 1. Discovering Convergence issues can be identified by: - Large discrepancies in core training metrics including training loss, evaluation loss, model specific AUC metrics. - Runtime failures (for example when the loss scaler reaches the minimum, triggering an exception). Before looking into this further, we should clarify a few things (if possible): - If we change the seed for the baseline run, whether the metric diff is big? (Make sure the discrepancy is not introduced by randomness) - What are the very first steps we see obvious divergence? - Still reproducible once randomness is removed? - Set same seeds - Set the dropout ratio to 0 - Set compute to be deterministic and torch-comparable (TODO(pengwa): need a flag for this). ## 2. Collect Activation Statistics ### Add a few lines of code, run script to collect statistics:
| Baseline | ORTModule |
|---|---|
| ```diff + from onnxruntime.training.utils.hooks import SubscriberManager, + StatisticsSubscriber + sub_m = SubscriberManager() + sub_m.subscribe(model, [StatisticsSubscriber(output_dir="pt_out", + override_output_dir=True)]) ``` | ```diff model = ORTModule(model) + from onnxruntime.training.utils.hooks import SubscriberManager, + StatisticsSubscriber + sub_m = SubscriberManager() + sub_m.subscribe(model, [StatisticsSubscriber(output_dir="ort_out", + override_output_dir=True)]) ``` |
| - Run training script to the steps that trigger the divergence. - A folder named `pt_out` is created in the current working directory. - For each step, there is a folder containing summaries for every activation tensor. | - Run training script to the steps that trigger the divergence. - Similarly, a folder named `ort_out` is created in the current working directory. - `StatisticsSubscriber` can be subscribed before OR after wrapping ORTModule. |