onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-06-02 23:39:58 +00:00

Author	SHA1	Message	Date
kailums	1b38c05544	change ci docker image to rocm6.1 (#21296 ) ### Description <!-- Describe your changes. --> There is a bug for kernel running on rocm6.0, so change ci docker image to rocm6.1 For the torch installed in the docker image, change to rocm repo when it is not 6.0 version. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-07-18 14:50:01 +08:00
PeixuanZuo	7a454acd61	[ROCm] Update CI/Packaging pipeline to ROCm6.0 (#18985 ) Update CI/Packaing pipeline to ROCm6.0	2024-01-03 17:25:15 +08:00
PeixuanZuo	2ef6ee674c	[ROCm] Update ROCm and MIGraphX CI to ROCm5.7 (#17834 ) - Update ROCm and MIGraphX CI to ROCm5.7 - Simplify test exculde file. Some tests will output `registered execution providers ROCMExecutionProvider were unable to run the model.` if they cannot run. - Add `enable_training` build argument for MIGraphX pipeline.	2023-10-09 10:29:11 +08:00
PeixuanZuo	12837ba5c7	[ROCm] Update CI based on ubuntu 22.04 (#17076 ) - Update ROCm version to ROCm5.6 - Update CI based on ubuntu 22.04	2023-08-10 09:51:29 -07:00
PeixuanZuo	cb4bf4f5c8	[ROCm] Move ROCm build step on CPU only machine (#16596 ) - Move ROCm build step on CPU only machine - Add the performance data of the huggingface bert-large model on the MI200 - At the beginning of the test step, check the agent's GPU usage and kill the threads occupying the GPU, which may be left over from previous tasks that exited abnormally. - Use different docker images during the build and test steps. The difference is the `uid` and `user` when build docker image and create docker container.	2023-07-10 11:55:10 +08:00
PeixuanZuo	af6cb2af87	[ROCm] update ROCm/MIGraphX CI to ROCm5.5 (#15905 ) update ROCm/MIGraphX CI to ROC5.5. TODO: two PR to fix failure on orttraining/orttraining/test/python/orttraining_test_ortmodule_api.py - test_gradient_correctness_minmax/test_gradient_correctness_argmax_unfold/test_gradient_correctness_argmax_diagonal (https://github.com/microsoft/onnxruntime/pull/15903) - test_ortmodule_attribute_name_collision_warning (https://github.com/microsoft/onnxruntime/pull/15884)	2023-05-15 10:28:15 +08:00
PeixuanZuo	56bccac35d	[ROCm] update bert-L convergence reference file to fix CI (#15200 ) The change of layernorm lead to the change of bert-L convergence result.	2023-03-24 21:43:44 +08:00
PeixuanZuo	ab2dd8dfaf	[ROCm] Update ROCm and MigraphX CI to ROCm5.4 (#14011 ) Update ROCm and MigraphX CI to ROCm5.4 Run ortmodule_test with ROCm5.4 and all passed(https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=824742&view=logs&j=8292f886-7946-5da9-7977-04484c342eda&t=5de68eaa-cbdc-5be5-13d0-bb946f4ddb2d). Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>	2022-12-22 10:01:05 +08:00
PeixuanZuo	80a046b36f	[ROCm] update amd CI huggingface model performance number (#13961 ) Fix CI test failure. Test distilbert-base model performance number on gcramdrr1-mi100-08x and update.	2022-12-14 16:30:25 +08:00
PeixuanZuo	6895918b1c	[ROCm] Revert CI pipeline to ROCm5.2.3 (#13297 ) ### Description <!-- Describe your changes. --> Unit test with ROCm5.3 slower than ROCm5.2.3. Revert to ROCm5.2.3. We will update to ROCm5.3 when the issue resloved by AMD. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2022-10-12 10:47:33 -07:00
PeixuanZuo	4d25b9c8f0	[ROCm] Update ROCm and MIGraphX CI pipeline to ROCm5.3 (#13257 ) ### Description <!-- Describe your changes. --> 1. Update ROCm pipeline and MIGraphX pipeline to ROCm5.3 ROCm pipeline run ortmodule test one time and disable it : https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=777794&view=logs&j=48b14a85-ff1a-5ca4-53fa-8ea420d27feb&t=9c199f35-fc50-565d-6c65-5162c9bb1b04 2. Add `workspace: clean: all `. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2022-10-11 13:47:22 +08:00
PeixuanZuo	adbc0757ad	[UPDATE] update ROCm ci pipeline to ROCm5.2.3 (#12799 ) * [Update] update to rocm5.2.3 * [Fix] cmake version * [Fix] disbale ortmodule tests * [revert] revert performance number	2022-09-01 10:32:24 +08:00
PeixuanZuo	7b53b223b8	[UPDATE] update AMD CI pipeline to Rocm5.2 with torch1.11 (#12162 ) * [UPDATE] update ci to rocm5.2 + torch1.11 * [Revert] disable ort module test * [DELETE] delete Rocm5.1.1 ci test result * [UPDATE] update the comments	2022-07-14 16:38:16 +08:00
PeixuanZuo	a67994316a	Update rocm ci to ROCm5.1.1 + torch1.10.0 * [UPDATE] update amd ci pipeline 2 rocm5.1.1 * [FIX] json format error * [ERROR] disable unit tests * [FIX] ucx error * [FIX] cmake version * [FIX] units test	2022-05-20 11:07:21 +08:00
PeixuanZuo	55af7a96a7	update the amd ci pipeline (#10723 ) * [TEST] test to get amd pipeline information * [FIX] lower the threshold * [UPDATE] add retry task * [UPDATE] add retry task * [ERROR] error to occur retry * [FIX] error * [UPDATE] update retryCountOnTaskFailure to 1 time * [UPDATE] add showmeminfo	2022-03-07 18:39:42 +08:00
ytaous	d3f859fe30	Dropout Vectorized Kernel (#9157 ) * vectorized kernel * fix build * re-calibrate expected loss * fix build * re-calibrate convergence results * more re-calibrate on loss * divide kernels * adress comments * more calibration * calibration * per comments * enable sync Co-authored-by: Ethan Tao <ettao@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2021-09-27 17:19:12 -07:00
Suffian Khan	e758870b18	Upgrade ROCm CI pipeline for ROCm 4.3.1 and permit run inside container (#9070 ) * try to run inside 4.3.1 container * no \ in container run command * remove networking options * try with adding video render groups * add job to build docker image * try without 1st stage * change alpha, beta to float * try adding service connection * retain huggingface directory * static video and render gid * use runtime expression for variables * install torch-ort * pin sacrebleu==1.5.1 * update curves for rocm 4.3.1 * try again * disable determinism and only check tail of loss curve and with a much larger threshold of 0.05 * disable RoBERTa due to high run variablity on ROCm 4.3.1 * put reduction unit tests back in	2021-09-15 12:32:02 -07:00
Suffian Khan	00b0a9c127	Add hugging-face models loss curve and performance guards to ROCm CI pipeline. (#8915 ) * test running hf bert-large * try again * try again * include other models * correct names * disable deberta-v2-xxlarge * avoid torch.distributed * add compare json loss and perf for bert-large to test * fix sed expression * remove pytest * add more models * move unit tests u * display samples/sec	2021-09-01 09:03:10 -07:00
Jesse Benson	29c68888af	Update BERT convergence baseline.	2021-05-25 17:11:46 -07:00
Suffian Khan	e6de0eb813	Add nightly pipeline for MI100 to run convergence and batch size test similar to V100. (#6611 ) * Partial updating of ROCM reduction code. * Update reduction_all.cu * Add reduce template parameters. * miopen common * Reuse CUDA's reduction_functions.cc * Reduction ops. * Update remaining reduction ops to use MIOpen. double datatype is not supported, so disable those typed kernels. * Disable a couple more unsupported tests. * Code formatting. * Delete ROCM-specific reduction code that is identical to CUDA reduction code. * Fix scratch buffer early free. * Fix merge conflict. * first attempt nightly amd ci pipeline * try fix bad yaml file * try again with corrected model directory * add convergence test as well * update reference loss for amd mi100 * include mi100 test results csv * update the mi100 convergence test reference values * update batch sizes for mi100 32g * fix gpu sku for run_convergence_test.py * undo unrelated changes to master * pr comments * pr comment Co-authored-by: Jesse Benson <jesseb@microsoft.com>	2021-02-12 13:22:06 -08:00
Vincent Wang	7fb194d03d	Update convergence baseline for ci_test. (#4465 ) Co-authored-by: Vincent Wang <weicwang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2020-07-09 15:29:36 +08:00
ytaous	5d28efd434	opset12 code cleanup (#4242 ) * opset12 code cleanup * opset12 code cleanup Co-authored-by: Ethan Tao <ettao@microsoft.com>	2020-06-15 19:45:35 -07:00
ytaous	e0334f177c	Opset12 upgrade for existing models used by perf/e2e pipelines (#4238 ) * opset12 support * opset12 support * on comments Co-authored-by: Ethan Tao <ettao@microsoft.com>	2020-06-15 14:26:53 -07:00
edgchen1	ba74914c5a	Remove evaluation output from training e2e test baseline data. (#4092 )	2020-06-01 15:06:21 -07:00
Jesse Benson	3a7539e071	Update bert-base convergence values	2020-03-13 23:03:34 -07:00
Edward Chen	e542cfd0e0	Introduce training changes.	2020-03-11 14:39:03 -07:00

26 commits