onnxruntime/orttraining/tools/ci_test/results/bert_base.convergence.baseline.mi100.csv at aa60a8368f52b52d0d190847682aa2eb503ddeaa

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-06-28 03:20:58 +00:00

Add nightly pipeline for MI100 to run convergence and batch size test similar to V100. (#6611 )

* Partial updating of ROCM reduction code.

* Update reduction_all.cu

* Add reduce template parameters.

* miopen common

* Reuse CUDA's reduction_functions.cc

* Reduction ops.

* Update remaining reduction ops to use MIOpen.  double datatype is not supported, so disable those typed kernels.

* Disable a couple more unsupported tests.

* Code formatting.

* Delete ROCM-specific reduction code that is identical to CUDA reduction code.

* Fix scratch buffer early free.

* Fix merge conflict.

* first attempt nightly amd ci pipeline

* try fix bad yaml file

* try again with corrected model directory

* add convergence test as well

* update reference loss for amd mi100

* include mi100 test results csv

* update the mi100  convergence test reference values

* update batch sizes for mi100 32g

* fix gpu sku for run_convergence_test.py

* undo unrelated changes to master

* pr comments

* pr comment

Co-authored-by: Jesse Benson <jesseb@microsoft.com>

2021-02-12 13:22:06 -08:00

307 B

Raw Blame History

1	step	total_loss	mlm_loss	nsp_loss
2	0	11.217	10.5178	0.699256
3	5	9.67644	7.52047	2.15598
4	10	8.31964	7.54136	0.778281
5	15	8.22823	7.54625	0.681978
6	20	8.17299	7.49675	0.676236
7	25	8.2415	7.5356	0.705902
8	30	8.0874	7.39312	0.694279
9	35	7.99095	7.25612	0.734829
10	40	7.92988	7.25608	0.673804
11	45	7.94762	7.27291	0.674713

307 B Raw Blame History

307 B

Raw Blame History