onnxruntime/orttraining/tools/ci_test
Suffian Khan 4daa14bc74
Fixes to rel-1.9.0 to compile and pass for AMD ROCm (#9144)
* Revert "Fix nightly CI pipeline to generate ROCm 4.2 wheels and add ROCm 4.3.1 wheels (#9101)"

This reverts commit 47888392ab.

* Add BatchNorm kernel for ROCm (#9014)

* Add BatchNorm kernel for ROCm, update BN test

* correct epsilon_ setting; limit min epsilon

* Upgrade ROCm CI pipeline for ROCm 4.3.1 and permit run inside container (#9070)

* try to run inside 4.3.1 container

* no \ in container run command

* remove networking options

* try with adding video render groups

* add job to build docker image

* try without 1st stage

* change alpha, beta to float

* try adding service connection

* retain huggingface directory

* static video and render gid

* use runtime expression for variables

* install torch-ort

* pin sacrebleu==1.5.1

* update curves for rocm 4.3.1

* try again

* disable determinism and only check tail of loss curve and with a much larger threshold of 0.05

* disable RoBERTa due to high run variablity on ROCm 4.3.1

* put reduction unit tests back in

* Fix nightly CI pipeline to generate ROCm 4.2 wheels and add ROCm 4.3.1 wheels (#9101)

* make work for both rocm 4.2 and rocm 4.3.1

* fix rocm 4.3.1 docker image reference

* fix CUDA_VERSION to ROCM_VERSION

* fix ReduceConsts conflict def

* add ifdef to miopen_common.h as well

* trailing ws

Co-authored-by: wangye <wangye@microsoft.com>
Co-authored-by: mindest <30493312+mindest@users.noreply.github.com>
2021-09-21 18:07:07 -07:00
..
results Fixes to rel-1.9.0 to compile and pass for AMD ROCm (#9144) 2021-09-21 18:07:07 -07:00
compare_huggingface.py Fixes to rel-1.9.0 to compile and pass for AMD ROCm (#9144) 2021-09-21 18:07:07 -07:00
compare_results.py Introduce training changes. 2020-03-11 14:39:03 -07:00
download_azure_blob_archive.py Cache build docker images in container registry. (#5811) 2020-11-17 17:02:24 -08:00
run_batch_size_test.py Add BERT-L perf regression test on MI100 and re-enable batch size test (#7240) 2021-04-05 15:51:52 -07:00
run_bert_perf_test.py Add BERT-L perf regression test on MI100 and re-enable batch size test (#7240) 2021-04-05 15:51:52 -07:00
run_convergence_test.py Add nightly pipeline for MI100 to run convergence and batch size test similar to V100. (#6611) 2021-02-12 13:22:06 -08:00
run_gpt2_perf_test.py Opset12 upgrade for existing models used by perf/e2e pipelines (#4238) 2020-06-15 14:26:53 -07:00