onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-05-18 21:21:17 +00:00

Author	SHA1	Message	Date
PeixuanZuo	80a046b36f	[ROCm] update amd CI huggingface model performance number (#13961 ) Fix CI test failure. Test distilbert-base model performance number on gcramdrr1-mi100-08x and update.	2022-12-14 16:30:25 +08:00
PeixuanZuo	4b2b588895	[ROCm] Fix azcopy issue on ROCm ci pipeline (#13365 ) ### Description <!-- Describe your changes. --> Use SAS Token to fix error` failed to perform copy command due to error: no SAS token or OAuth token is present and the resource is not public` Generate SAS Token of target data, add it into Key vault, and use it as Pipeline Variable. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>	2022-10-20 12:08:57 +08:00
PeixuanZuo	6895918b1c	[ROCm] Revert CI pipeline to ROCm5.2.3 (#13297 ) ### Description <!-- Describe your changes. --> Unit test with ROCm5.3 slower than ROCm5.2.3. Revert to ROCm5.2.3. We will update to ROCm5.3 when the issue resloved by AMD. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2022-10-12 10:47:33 -07:00
PeixuanZuo	4d25b9c8f0	[ROCm] Update ROCm and MIGraphX CI pipeline to ROCm5.3 (#13257 ) ### Description <!-- Describe your changes. --> 1. Update ROCm pipeline and MIGraphX pipeline to ROCm5.3 ROCm pipeline run ortmodule test one time and disable it : https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=777794&view=logs&j=48b14a85-ff1a-5ca4-53fa-8ea420d27feb&t=9c199f35-fc50-565d-6c65-5162c9bb1b04 2. Add `workspace: clean: all `. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2022-10-11 13:47:22 +08:00
PeixuanZuo	adbc0757ad	[UPDATE] update ROCm ci pipeline to ROCm5.2.3 (#12799 ) * [Update] update to rocm5.2.3 * [Fix] cmake version * [Fix] disbale ortmodule tests * [revert] revert performance number	2022-09-01 10:32:24 +08:00
PeixuanZuo	7b53b223b8	[UPDATE] update AMD CI pipeline to Rocm5.2 with torch1.11 (#12162 ) * [UPDATE] update ci to rocm5.2 + torch1.11 * [Revert] disable ort module test * [DELETE] delete Rocm5.1.1 ci test result * [UPDATE] update the comments	2022-07-14 16:38:16 +08:00
PeixuanZuo	a67994316a	Update rocm ci to ROCm5.1.1 + torch1.10.0 * [UPDATE] update amd ci pipeline 2 rocm5.1.1 * [FIX] json format error * [ERROR] disable unit tests * [FIX] ucx error * [FIX] cmake version * [FIX] units test	2022-05-20 11:07:21 +08:00
Justin Chu	fdce4fa6af	Format all python files under onnxruntime with black and isort (#11324 ) Description: Format all python files under onnxruntime with black and isort. After checking in, we can use .git-blame-ignore-revs to ignore the formatting PR in git blame. #11315, #11316	2022-04-26 09:35:16 -07:00
PeixuanZuo	55af7a96a7	update the amd ci pipeline (#10723 ) * [TEST] test to get amd pipeline information * [FIX] lower the threshold * [UPDATE] add retry task * [UPDATE] add retry task * [ERROR] error to occur retry * [FIX] error * [UPDATE] update retryCountOnTaskFailure to 1 time * [UPDATE] add showmeminfo	2022-03-07 18:39:42 +08:00
Baiju Meswani	f9b6eef05f	orttraining packaging pipeline for rocm 5.0.1 (#10725 )	2022-03-02 12:32:14 -08:00
Suffian Khan	6f580f07de	Switch AMD CI pipeline to use environment image from onnxruntimecibuildenvironment (#9206 ) * shift docker image reference for amd ci pipeline * fix service endpoint * reduce perf tolerance	2021-09-28 13:06:16 -07:00
ytaous	d3f859fe30	Dropout Vectorized Kernel (#9157 ) * vectorized kernel * fix build * re-calibrate expected loss * fix build * re-calibrate convergence results * more re-calibrate on loss * divide kernels * adress comments * more calibration * calibration * per comments * enable sync Co-authored-by: Ethan Tao <ettao@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2021-09-27 17:19:12 -07:00
Suffian Khan	e758870b18	Upgrade ROCm CI pipeline for ROCm 4.3.1 and permit run inside container (#9070 ) * try to run inside 4.3.1 container * no \ in container run command * remove networking options * try with adding video render groups * add job to build docker image * try without 1st stage * change alpha, beta to float * try adding service connection * retain huggingface directory * static video and render gid * use runtime expression for variables * install torch-ort * pin sacrebleu==1.5.1 * update curves for rocm 4.3.1 * try again * disable determinism and only check tail of loss curve and with a much larger threshold of 0.05 * disable RoBERTa due to high run variablity on ROCm 4.3.1 * put reduction unit tests back in	2021-09-15 12:32:02 -07:00
Suffian Khan	00b0a9c127	Add hugging-face models loss curve and performance guards to ROCm CI pipeline. (#8915 ) * test running hf bert-large * try again * try again * include other models * correct names * disable deberta-v2-xxlarge * avoid torch.distributed * add compare json loss and perf for bert-large to test * fix sed expression * remove pytest * add more models * move unit tests u * display samples/sec	2021-09-01 09:03:10 -07:00
Jesse Benson	29c68888af	Update BERT convergence baseline.	2021-05-25 17:11:46 -07:00
Suffian Khan	9f14af9809	Add BERT-L perf regression test on MI100 and re-enable batch size test (#7240 ) * restore bs test and add perf test * update perf number and fix path to results	2021-04-05 15:51:52 -07:00
Suffian Khan	e6de0eb813	Add nightly pipeline for MI100 to run convergence and batch size test similar to V100. (#6611 ) * Partial updating of ROCM reduction code. * Update reduction_all.cu * Add reduce template parameters. * miopen common * Reuse CUDA's reduction_functions.cc * Reduction ops. * Update remaining reduction ops to use MIOpen. double datatype is not supported, so disable those typed kernels. * Disable a couple more unsupported tests. * Code formatting. * Delete ROCM-specific reduction code that is identical to CUDA reduction code. * Fix scratch buffer early free. * Fix merge conflict. * first attempt nightly amd ci pipeline * try fix bad yaml file * try again with corrected model directory * add convergence test as well * update reference loss for amd mi100 * include mi100 test results csv * update the mi100 convergence test reference values * update batch sizes for mi100 32g * fix gpu sku for run_convergence_test.py * undo unrelated changes to master * pr comments * pr comment Co-authored-by: Jesse Benson <jesseb@microsoft.com>	2021-02-12 13:22:06 -08:00
Edward Chen	71e7c2b423	Cache build docker images in container registry. (#5811 ) This PR adds infrastructure to automatically cache docker images used in CI builds in a container registry. Currently, build images are pulled from a container registry for some builds and built every time for others. The container registry requires maintenance to keep the images up to date and building images every time wastes build agent resources. With this change, a given build image can be looked up in a cache container registry and if present, pulled, and otherwise, built and pushed. The uniqueness of a build image is determined by a hash digest of the dockerfile, docker build context directory, and certain "docker build" options. This digest is part of the image tag in the cache container repository. The cache container registry will need to be cleaned up periodically. This is not automated yet.	2020-11-17 17:02:24 -08:00
liqunfu	1416d12f0b	Liqun/merge e2e pipelines (#5702 ) * Create an Azure Pipeline to merge cpp and python e2e pipelines into one. Still keep cpp 2e2 pipeline until this new pipeline is stable. Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2020-11-11 09:42:08 -08:00
M. Zeeshan Siddiqui	f2168cef29	Misc. cleanup. (#5659 ) Co-authored-by: Ubuntu <OrtTrainingDev3@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2020-11-02 07:05:28 -08:00
M. Zeeshan Siddiqui	9af0d48524	Memory planner and pattern generation enhancements. (#4443 ) * static allocation. * chanegs. * contigious dynamic allocation. * contigious dynamic allocation. * fix bugs. * fix bug. * build errors. * PR feedback. * PR feedback. * Update Graph builder for nccl_allreduce, mps. * misc. * fix windows build break. * changes. * fine-grained memory-time scheduling. * merge. * fix misc stuff. * fix windows build. * fix windows build. * fix merge bug. * merge conflicts. * revert onnx-tensorrt submodule commit. * fix submodule commit. * misc. * merge conflicts. * Revert "merge conflicts." This reverts commit `319a071a6e`. * merge conflict. * merge conflict. * merge conflicts. * fixes. * PR feedback. * build break. * build break. * Add asserts. * Add asserts. * asserts. * asserts. * asserts. * asserts. * asserts. * fixes. * fixes. Co-authored-by: Ubuntu <OrtTrainingDev3@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net> Co-authored-by: root <root@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2020-11-01 23:05:46 -08:00
liqunfu	92662659ba	Liqun/remove number matching (#5606 ) replace number matching with relaxed comparison in frontend tests Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2020-10-27 21:27:37 -07:00
Sherlock	60dbd8a1e5	Update maximum batch size for UT; Include recompute modes (#5444 ) * Update MaxBatchSize and include recompute mode * Minor fix for frontend test Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2020-10-12 14:50:43 -07:00
Sherlock	37445d1198	Update Bert Perf Script (#5339 ) Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2020-09-30 14:30:20 -07:00
Vincent Wang	7fb194d03d	Update convergence baseline for ci_test. (#4465 ) Co-authored-by: Vincent Wang <weicwang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2020-07-09 15:29:36 +08:00
ytaous	4380b8ba68	adjust bs size (#4375 ) Co-authored-by: Ethan Tao <ettao@microsoft.com>	2020-06-30 10:29:48 -07:00
edgchen1	63bf587623	Use azcopy to download test data (#4221 ) Use azcopy from download_e2e_test_data.py, add helper function for downloading azcopy. Update download_test_data.py to use helper function.	2020-06-16 10:14:34 -07:00
ytaous	5d28efd434	opset12 code cleanup (#4242 ) * opset12 code cleanup * opset12 code cleanup Co-authored-by: Ethan Tao <ettao@microsoft.com>	2020-06-15 19:45:35 -07:00
ytaous	e0334f177c	Opset12 upgrade for existing models used by perf/e2e pipelines (#4238 ) * opset12 support * opset12 support * on comments Co-authored-by: Ethan Tao <ettao@microsoft.com>	2020-06-15 14:26:53 -07:00
edgchen1	ba74914c5a	Remove evaluation output from training e2e test baseline data. (#4092 )	2020-06-01 15:06:21 -07:00
ytaous	72d508b7a0	New perf metric - e2e throughput (#4085 ) * new metric * on comments * tab to spaces Co-authored-by: Ethan Tao <ettao@microsoft.com>	2020-06-01 12:11:34 -07:00
edgchen1	38d76cc904	Clean up training E2E test (#4078 ) Update training E2E build to not go through CTest and call test scripts directly.	2020-05-29 09:20:47 -07:00
ytaous	fb4efafc8e	GPT-2 training perf scripts (#3974 ) * gpt2 training perf * gpt2 training perf * debug * debug * debug * fix bug * minor * on comments * dynamic sql * fix build * minor * linked hash * on comments * minor * mem * minor Co-authored-by: Ethan Tao <ettao@microsoft.com>	2020-05-19 10:21:40 -07:00
ytaous	93eb9bcfde	Add yaml/perf scripts for new perf test pipeline (#3909 ) * yaml/perf scripts for new pipeline * yaml/perf scripts for new pipeline * remove unused imports * testing some comments change * testing some comments change * testing jdbc * testing jdbc * testing jdbc * exclude pwd from jdbc properties * exclude pwd from jdbc properties * namedtuple * on comments Co-authored-by: Ethan Tao <ettao@microsoft.com>	2020-05-13 14:15:17 -07:00
pengwa	2c7c45076b	MaxBatchSize E2E Test (#3454 ) * max batch size e2e test *update test data snapshot	2020-04-15 09:50:44 +08:00
Edward Chen	95707d22a5	Disable gradient clipping for E2E test.	2020-04-06 23:07:28 +00:00
Jesse Benson	3a7539e071	Update bert-base convergence values	2020-03-13 23:03:34 -07:00
Edward Chen	e542cfd0e0	Introduce training changes.	2020-03-11 14:39:03 -07:00

38 commits