Commit graph

4 commits

Author SHA1 Message Date
PeixuanZuo
4b2b588895
[ROCm] Fix azcopy issue on ROCm ci pipeline (#13365)
### Description
<!-- Describe your changes. -->

Use SAS Token to fix error` failed to perform copy command due to error:
no SAS token or OAuth token is present and the resource is not public`

Generate SAS Token of target data, add it into Key vault, and use it as
Pipeline Variable.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
2022-10-20 12:08:57 +08:00
Suffian Khan
9f14af9809
Add BERT-L perf regression test on MI100 and re-enable batch size test (#7240)
* restore bs test and add perf test

* update perf number and fix path to results
2021-04-05 15:51:52 -07:00
Suffian Khan
f27835c4de
Disable batch size test for AMD CI pipeline after agent upgrade to Rocm 4.1 (#7153)
* disable batch size test for rocm 4.1 until resolved

* Update orttraining-pai-ci-pipeline.yml

Forgot to modify both pipelines
2021-03-26 22:32:39 -05:00
Suffian Khan
e6de0eb813
Add nightly pipeline for MI100 to run convergence and batch size test similar to V100. (#6611)
* Partial updating of ROCM reduction code.

* Update reduction_all.cu

* Add reduce template parameters.

* miopen common

* Reuse CUDA's reduction_functions.cc

* Reduction ops.

* Update remaining reduction ops to use MIOpen.  double datatype is not supported, so disable those typed kernels.

* Disable a couple more unsupported tests.

* Code formatting.

* Delete ROCM-specific reduction code that is identical to CUDA reduction code.

* Fix scratch buffer early free.

* Fix merge conflict.

* first attempt nightly amd ci pipeline

* try fix bad yaml file

* try again with corrected model directory

* add convergence test as well

* update reference loss for amd mi100

* include mi100 test results csv

* update the mi100  convergence test reference values

* update batch sizes for mi100 32g

* fix gpu sku for run_convergence_test.py

* undo unrelated changes to master

* pr comments

* pr comment

Co-authored-by: Jesse Benson <jesseb@microsoft.com>
2021-02-12 13:22:06 -08:00