# ORTModule Training Convergence Investigation ## 1. Discovering Convergence issues can be identified by: - Large discrepancy on core training metrics including training loss, evaluation loss, model specific AUC metrics. - Runtime failures (for example loss scaler reach the minimum triggering an exception). Before looking into further, we should clarify few things (if possible): - If we change seed for baseline run, whether the metric diff is big? (Make sure the discrepancy is not introduced by random) - What's the very first steps we see obvious diverges? - Still repro once remove randomness? - Set same seeds - Set dropout ratio to 0 - Set compute to be deterministic and torch-comparable (TODO(pengwa): need a flag for this). ## 2. Collect Activation Statistics Add codes: ```diff + from onnxruntime.training.utils.hooks import SubscriberManager, StatisticsSubscriber + SubscriberManager.subscribe(model, [StatisticsSubscriber("pt_out", override_output_dir=True)]) ``` Run training script to the steps that triggered the divergence. A folder named `pt_out` is created in current working directory. For each step, there is a folder containing summaries for every activation tensor. Add few lines of code: ```diff from onnxruntime.training.ortmodule import ORTModule from onnxruntime.training.utils.hooks import SubscriberManager, StatisticsSubscriber model = ORTModule(model) + SubscriberManager.subscribe(model, [StatisticsSubscriber("ort_out", override_output_dir=True)]) ``` > `StatisticsSubscriber` can be initialized before OR after wrapping ORTModule. Run training script to the steps that triggered the divergence. Similarly, a folder named `ort_out` is created in current working directory. Run command to generate per step summary ```bash python -m onnxruntime.training.utils.hooks.merge_activation_summary --pt_dir pt_out --ort_dir ort_out --output_dir /tmp/output ``` Manual diff the generate per-step summary to find the where is the first big diff happens.