onnxruntime/orttraining/orttraining/python/training/utils
pengwa 3e954da3e6
Fix and enable few ORTModule Unit Tests (#19847)
### Fix and enable few ORTModule Unit Tests

Fix 'test_bert_inputs_with_dynamic_shape' and
'test_bert_result_with_layerwise_recompute' generate Nan loss in ORT
run.

The root cause is, the logic to generatic attention mask test data is
not correct, only 0 or 1 is allowed in the dataset, but we see lots of
other numbers. ( The reason we don't have this using old version of
transformers for example v4.4.2 or 4.16.2 is because they don't contains
such
d3cb28886a,
which increase the scaling to a bigger number, causing a overflow to
inf)

Another improvement during the investigation using convergence tools:
Don't dump the activations during model export phase, otherwise, the
dumped data might contains some PyTorch run's result making us confused
during comparing with stock PyTorch run results.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-03-12 10:49:19 +08:00
..
data
hooks Fix and enable few ORTModule Unit Tests (#19847) 2024-03-12 10:49:19 +08:00
__init__.py Improve memory matrix for ORTModule (#19620) 2024-02-28 15:57:05 +08:00
ptable.py Allow layer-wise recompute (#18566) 2023-12-12 08:44:05 +08:00
torch_io_helper.py Improve perf for stage3 training (#18099) 2023-12-15 13:32:19 +08:00
torch_profile_utils.py Improve memory matrix for ORTModule (#19620) 2024-02-28 15:57:05 +08:00
torch_type_map.py