Commit graph

57 commits

Author SHA1 Message Date
PeixuanZuo
12837ba5c7
[ROCm] Update CI based on ubuntu 22.04 (#17076)
- Update ROCm version to ROCm5.6
- Update CI based on ubuntu 22.04
2023-08-10 09:51:29 -07:00
Changming Sun
73ddba964f
Update the MacOS/Linux build scripts that build/install protobuf from source (#16906)
### Description
1. As a follow-up of #16761, this PR allows build ORT on iOS/Android
without the need to explicitly specify a protoc path. #16761 is for
WASM. This one is for iOS/Android
2. Update the MacOS/Linux build scripts that build/install protobuf from
source. Make them be more flexible. Add the support for
RedHatEnterprise(ubi), which will needed for upgrading the base image
from centos:7 to ubi:8.
3. Update tools/ci_build/github/pai/rocm-ci-pipeline-env.Dockerfile :
the docker file's base image has preinstalled protobuf in /usr/local, we
should uninstall them to avoid conflicts.
2023-07-31 10:51:48 -07:00
PeixuanZuo
8ede2f139e
[ROCm] Optimize ROCm CI pipeline 2 (#16691)
- Set `KERNEL_EXPLORER_TEST_USE_CUPY=1` to replace numpy with cupy on
kernel explorer test.

KERNEL_EXPLORER_TEST_USE_CUPY=0 The CPU utilization is shown as below:

![image](https://github.com/microsoft/onnxruntime/assets/94887879/91724b78-0b4e-4cbd-ad88-83cad9976472)

KERNEL_EXPLORER_TEST_USE_CUPY=1 The CPU utilization is shown as below:

![image](https://github.com/microsoft/onnxruntime/assets/94887879/58239911-667c-4d5f-bb78-deca60d0266f)


- Use `Bash@3`.
- Update shell script.
2023-07-24 13:57:48 +08:00
PeixuanZuo
ebc311365b
[ROCm] Optimize ROCm CI to reduce time (#16620)
This PR mainly optimize ROCm CI test to reduce time and CPU utilization.

- use smaller batch size on strided_batched_gemm/batched_gemm test
- disable cpu training test
- fix test_e2e_padding_elimination Occasional failures on ROCm.
2023-07-13 10:58:03 +08:00
PeixuanZuo
596dbe277e
[ROCm] add upgrade to fix security issue (#16668) 2023-07-12 17:57:18 +08:00
PeixuanZuo
2fd5e1cc39
[ROCm] fix shell bug (#16641)
`set -ex` with `grep` will exit when grep doesn't meet any string.
2023-07-10 17:31:27 +08:00
PeixuanZuo
cb4bf4f5c8
[ROCm] Move ROCm build step on CPU only machine (#16596)
- Move ROCm build step on CPU only machine
- Add the performance data of the huggingface bert-large model on the
MI200
- At the beginning of the test step, check the agent's GPU usage and
kill the threads occupying the GPU, which may be left over from previous
tasks that exited abnormally.
- Use different docker images during the build and test steps. The
difference is the `uid` and `user` when build docker image and create
docker container.
2023-07-10 11:55:10 +08:00
PeixuanZuo
7e211f0e03
[ROCm] Move mount data step into docker container (#16471)
Some CI jobs may interrupted unexpectedly and didn't execute umount data
step. The data left in host device will cause `device or resource busy`
and make subsequent CI jobs fail.

Move the mount data step into docker container, the host machine will
not be occupied when CI jobs exit incorrectly.
2023-06-26 10:25:06 +08:00
PeixuanZuo
470d6c1cce
[ROCm] Delete unused file to fix Component Governance Alert (#16407)
Delete unused file to fix Component Governance Alert
2023-06-19 11:28:32 -07:00
PeixuanZuo
a95f8ae53c
[ROCm] Update ROCm/MIGraphX CI pipeline (#16215)
MIGraphX CI

- Change docker container user name to `onnxruntimedev`

ROCm CI

- Build docker image every job instead of using prebuild image.
- Every job create a container with only one GPU with command `docker
run -it --device=/dev/kfd --device=/dev/dri/renderDxxx`
- Remove tests that are unstable or use outdated interfaces.
- Enable training ortmodule test.
2023-06-05 10:28:10 +08:00
PeixuanZuo
af6cb2af87
[ROCm] update ROCm/MIGraphX CI to ROCm5.5 (#15905)
update ROCm/MIGraphX CI to ROC5.5.

TODO:
two PR to fix failure on
orttraining/orttraining/test/python/orttraining_test_ortmodule_api.py
-
test_gradient_correctness_minmax/test_gradient_correctness_argmax_unfold/test_gradient_correctness_argmax_diagonal
(https://github.com/microsoft/onnxruntime/pull/15903)
- test_ortmodule_attribute_name_collision_warning
(https://github.com/microsoft/onnxruntime/pull/15884)
2023-05-15 10:28:15 +08:00
Baiju Meswani
e464588a0e
Avoid generating training documentation during packaging (#15795) 2023-05-03 19:09:07 -07:00
Changming Sun
d53324d4a7
Update cmake version in a few places (#15775)
### Description
They were missed in #15707 , because they are not in common places for Dockerfiles.

Though this commit updated tools/ci_build/github/pai/rocm-ci-pipeline-env.Dockerfile, it won't automatically take effect. The image needs to be manually generated and pushed to a place, and before doing that our CMakeLists.txt also needs to be tweaked a little bit.
2023-05-02 22:56:28 -07:00
zhijiang
29c74d3c43
softmax perf improvement pr1 - add more softmax related test (#15176)
1. add fp16 test
2. add test for shape is not power of two.
2023-04-11 17:02:40 +08:00
cloudhan
3b6d551c35
Enable ccache for HIP objects (#14465)
This enables HIP compiler to be launched with `ccache` when build with `--use_cache`
2023-01-28 22:34:24 +08:00
PeixuanZuo
d3a09cf77f
[ROCm] use pytest-xdist for fast pytest (#14261)
### Description

Use pytest-xdist to distribute tests across multiple CPUs to speed up
test execution.
Use pytest-rerunfailures to rerun failed test in case of pytest-xdist
crash.
`pytest -n 16` can reduce pytest time from 80 minutes to 20 minutes.


### Motivation and Context
Now kernel explorer pytest of ROCm CI takes nearly 1 hour 20 minutes. It
will take longer time when we add more tunableOp in the future.
2023-01-13 16:57:50 +08:00
PeixuanZuo
ab2dd8dfaf
[ROCm] Update ROCm and MigraphX CI to ROCm5.4 (#14011)
Update ROCm and MigraphX CI to ROCm5.4
Run ortmodule_test with ROCm5.4 and all
passed(https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=824742&view=logs&j=8292f886-7946-5da9-7977-04484c342eda&t=5de68eaa-cbdc-5be5-13d0-bb946f4ddb2d).

Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
2022-12-22 10:01:05 +08:00
Changming Sun
04900f96c1
Improve dependency management (#13523)
## Description
1. Convert some git submodules to cmake external projects
2. Update nsync from
[1.23.0](https://github.com/google/nsync/releases/tag/1.23.0) to
[1.25.0](https://github.com/google/nsync/releases/tag/1.25.0)
3. Update re2 from 2021-06-01 to 2022-06-01
4. Update wil from an old commit to 1.0.220914.1 tag
5. Update gtest to a newer commit so that it can optionally leverage
absl/re2 for parsing command line flags.

The following git submodules are deleted:

1. FP16
2. safeint
3. XNNPACK
4. cxxopts
5. dlpack
7. flatbuffers
8. googlebenchmark
9. json
10. mimalloc
11. mp11
12. pthreadpool

More will come.

## Motivation and Context
There are 3 ways of integrating 3rd party C/C++ libraries into ONNX
Runtime:
1. Install them to a system location, then use cmake's find_package
module to locate them.
2.  Use git submodules 
6.  Use cmake's external projects(externalproject_add). 

At first when this project was just started, we considered both option 2
and option 3. We preferred option 2 because:

1. It's easier to handle authentication. At first this project was not
open source, and it had some other non-public dependencies. If we use
git submodule, ADO will handle authentication smoothly. Otherwise we
need to manually pass tokens around and be very careful on not exposing
them in build logs.
2. At that time, cmake fetched dependencies after "cmake" finished
generating vcprojects/makefiles. So it was very difficult to make cflags
consistent. Since cmake 3.11, it has a new command: FetchContent, which
fetches dependencies when it generates vcprojects/makefiles just before
add_subdirectories, so the parent project's variables/settings can be
easily passed to the child projects.

And when the project went on,  we had some new concerns:
1. As we started to have more and more EPs and build configs, the number
of submodules grew quickly. For more developers, most ORT submodules are
not relevant to them. They shouldn't need to download all of them.
2. It is impossible to let two different build configs use two different
versions of the same dependency. For example, right now we have protobuf
3.18.3 in the submodules. Then every EP must use the same version.
Whenever we have a need to upgrade protobuf, we need to coordinate
across the whole team and many external developers. I can't manage it
anymore.
3. Some projects want to manage the dependencies in a different way,
either because of their preference or because of compliance
requirements. For example, some Microsoft teams want to use vcpkg, but
we don't want to force every user of onnxruntime using vcpkg.
7. Someone wants to dynamically link to protobuf, but our build script
only does static link.
8. Hard to handle security vulnerabilities. For example, whenever
protobuf has a security patch, we have a lot of things to do. But if we
allowed people to build ORT with a different version of protobuf without
changing ORT"s source code, the customer who build ORT from source will
be able to act on such things in a quicker way. They will not need to
wait ORT having a patch release.
9. Every time we do a release, github will also publish a source file
zip file and a source file tarball for us. But they are not usable,
because they miss submodules.
 
### New features

After this change, users will be able to:
1. Build the dependencies in the way they want, then install them to
somewhere(for example, /usr or a temp folder).
2. Or download the dependencies by using cmake commands from these
dependencies official website
3. Similar to the above, but use your private mirrors to migrate supply
chain risks.
4. Use different versions of the dependencies, as long as our source
code is compatible with them. For example, you may use you can't use
protobuf 3.20.x as they need code changes in ONNX Runtime.
6.  Only download the things the current build needs.
10. Avoid building external dependencies again and again in every build.

### Breaking change
The onnxruntime_PREFER_SYSTEM_LIB build option is removed you could think from now 
it is default ON. If you don't like the new behavior, you can set FETCHCONTENT_TRY_FIND_PACKAGE_MODE to NEVER.
Besides, for who relied on the onnxruntime_PREFER_SYSTEM_LIB build
option, please be aware that this PR will change find_package calls from
Module mode to Config mode. For example, in the past if you have
installed protobuf from apt-get from ubuntu 20.04's official repo,
find_package can find it and use it. But after this PR, it won't. This
is because that protobuf version provided by Ubuntu 20.04 is too old to
support the "config mode". It can be resolved by getting a newer version
of protobuf from somewhere.
2022-12-01 09:51:59 -08:00
Edward Chen
9e65f3bfdb
Replace deprecated Python dependency sklearn with scikit-learn. (#13585) 2022-11-08 09:08:29 -08:00
PeixuanZuo
6895918b1c
[ROCm] Revert CI pipeline to ROCm5.2.3 (#13297)
### Description
<!-- Describe your changes. -->

Unit test with ROCm5.3 slower than ROCm5.2.3. Revert to ROCm5.2.3.
We will update to ROCm5.3 when the issue resloved by AMD.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2022-10-12 10:47:33 -07:00
PeixuanZuo
4d25b9c8f0
[ROCm] Update ROCm and MIGraphX CI pipeline to ROCm5.3 (#13257)
### Description
<!-- Describe your changes. -->

1. Update ROCm pipeline and MIGraphX pipeline to ROCm5.3
ROCm pipeline run ortmodule test one time and disable it :
https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=777794&view=logs&j=48b14a85-ff1a-5ca4-53fa-8ea420d27feb&t=9c199f35-fc50-565d-6c65-5162c9bb1b04
2. Add `workspace: clean: all `.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2022-10-11 13:47:22 +08:00
cloudhan
72076b1eb2
Update ROCm CI to use HIP LANGUAGE (#13214)
Update for ROCm CI before reland tunable GEMM #12853. This PR also update
composable kernel to use CMakes's HIP language support so that we can
mix C/C++ compiler with HIP compiler instead of locking to hip-clang
2022-10-05 16:15:16 +08:00
PeixuanZuo
5e4ebbd9d9
[ROCm] add MIGraphX ci pipeline (#11569)
**Description**: Describe your changes.
Add migraphx ci pipeline, test build and unit tests.
This PR is based on #11492 

Pipeline :
https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=765711&view=results
2022-09-28 10:59:30 +08:00
PeixuanZuo
adbc0757ad
[UPDATE] update ROCm ci pipeline to ROCm5.2.3 (#12799)
* [Update] update to rocm5.2.3

* [Fix] cmake version

* [Fix] disbale ortmodule tests

* [revert] revert performance number
2022-09-01 10:32:24 +08:00
Vincent Wang
e85e31ee80
Update ORTModule Default Opset Version to 15 (#12419)
* update ortmodule opset to 15

* update torch version

* fix ut

* fix ut

* rollback

* rollback for orttrainer
2022-08-05 16:55:04 +08:00
mindest
add631410a
[ROCm] Re-enable ReduceL1, L2 and related tests (#12209)
Re-enable ReduceL1,L2 and related tests
2022-07-20 13:13:02 +08:00
PeixuanZuo
7b53b223b8
[UPDATE] update AMD CI pipeline to Rocm5.2 with torch1.11 (#12162)
* [UPDATE] update ci to rocm5.2 + torch1.11

* [Revert] disable ort module test

* [DELETE] delete Rocm5.1.1 ci test result

* [UPDATE] update the comments
2022-07-14 16:38:16 +08:00
Changming Sun
d5e34acb82
Remove git and python packages from the docker images used by Zip-Nuget-Java-Nodejs Packaging Pipeline (#11651) 2022-06-03 20:00:54 -07:00
PeixuanZuo
a67994316a
Update rocm ci to ROCm5.1.1 + torch1.10.0
* [UPDATE] update amd ci pipeline 2 rocm5.1.1

* [FIX] json format error

* [ERROR] disable unit tests

* [FIX] ucx error

* [FIX] cmake version

* [FIX] units test
2022-05-20 11:07:21 +08:00
PeixuanZuo
55af7a96a7
update the amd ci pipeline (#10723)
* [TEST] test to get amd pipeline information

* [FIX] lower the threshold

* [UPDATE] add retry task

* [UPDATE] add retry task

* [ERROR] error to occur retry

* [FIX] error

* [UPDATE] update retryCountOnTaskFailure to 1 time

* [UPDATE] add showmeminfo
2022-03-07 18:39:42 +08:00
Xavier Dupré
42c176b60c
Update default opset to 14 in ORTModule (#9743)
* update to torch 1.10
* update torchvision version
* update torchtext version
* remove deprecated option enable_onnx_checker
* add unit test to test gradient of GatherElements
* add ORTMODULE_ONNX_OPSET_VERSION in a docker file
2021-12-09 12:45:35 +01:00
Jeff Daily
3e879aab6b
work around ucx in rocm ci Dockerfile (#9360) 2021-10-14 09:49:31 -07:00
Suffian Khan
e758870b18
Upgrade ROCm CI pipeline for ROCm 4.3.1 and permit run inside container (#9070)
* try to run inside 4.3.1 container

* no \ in container run command

* remove networking options

* try with adding video render groups

* add job to build docker image

* try without 1st stage

* change alpha, beta to float

* try adding service connection

* retain huggingface directory

* static video and render gid

* use runtime expression for variables

* install torch-ort

* pin sacrebleu==1.5.1

* update curves for rocm 4.3.1

* try again

* disable determinism and only check tail of loss curve and with a much larger threshold of 0.05

* disable RoBERTa due to high run variablity on ROCm 4.3.1

* put reduction unit tests back in
2021-09-15 12:32:02 -07:00
Changming Sun
ae6fdd3333
Bring code coverage dashboard back (#8394) 2021-08-16 20:54:39 -07:00
raviskolli
f641c0f4e8 Update requirements.txt
Updated requests version to address component governance failure
2021-07-22 14:18:21 -07:00
Suffian Khan
35ca3c99d1
Fix ROCm wheels pipeline after changes to manylinux scripts (#8026)
* update

* try fix rocm pipeline

* avoid already isntalled error

* ignore python3.10 since build fails

* fix

* try setting user

* try again

* try again

* try again

* fix script

* disable inference docs generation

* try print device id

* fix name qual

* try again

* try again

* try again

* provider_options

* add device verify

* rty again

* try again

* try aggain

* print video/render gid

* try again

* run as root

* try again with uid, gid

* cleanup

* run as root

* temp fix

* add /bin/bash

Co-authored-by: Changming Sun <chasun@microsoft.com>
2021-06-10 21:01:28 -07:00
Jesse Benson
f977644324 ROCM support int reductions 2021-05-17 16:42:06 -07:00
Jesse Benson
be79575c6a Use built-in reduce_sum() for simple reduction cases, specifically reduce all to a scalar. 2021-04-14 08:55:35 -07:00
Weixing Zhang
75c0192e4f
enable more unit tests for ROCM EP (#7307) 2021-04-09 15:15:13 -07:00
Weixing Zhang
c22963c23d
Polish Lamb Kernel (#7299) 2021-04-09 09:55:57 -07:00
Weixing Zhang
8ad5007f8f
Polish Adam kernel (#7294)
* Polish Adam kernel
2021-04-09 01:11:09 -07:00
Jesse Benson
4543459984 MIOpen supports MIOPEN_REDUCE_TENSOR_AVG now. 2021-04-01 16:00:34 -07:00
Weixing Zhang
40fa40f3ce
Enable more unit tests for ROCM EP (#6776)
* enable more ops and unit tests for ROCM EP
2021-02-24 15:20:50 -08:00
Xavier Dupré
d3a2c8c1c7
Support double for operators ReduceMax, ReduceMin (#6265)
* Support double for operators ReduceMax, ReduceMin

* add unit test to pai-excluded-tests.txt

Co-authored-by: xavier dupré <xavier.dupre@gmail.com>
2021-02-08 19:14:26 -08:00
Jesse Benson
d18aa45b46 Enable more ROCM ops that are sharing CUDA code. Some are needed for Turing NLG models. 2021-02-06 14:40:34 -08:00
Jesse Benson
21a47ec8d9 Disable a couple more unsupported tests. 2021-02-04 15:00:05 -08:00
Jesse Benson
0b147702af Update remaining reduction ops to use MIOpen. double datatype is not supported, so disable those typed kernels. 2021-02-04 15:00:05 -08:00
Jesse Benson
a28ddb85b6 Reduction ops. 2021-02-04 15:00:05 -08:00
ashbhandare
85434273ff
Fix CUDA Reduction kernel for ArgMax/ArgMix for when reduction dim=1 (#6490)
* Fix for when reduction dim=1

* Disable test for AMD GPUs

* Specify Async
2021-02-02 09:50:16 -08:00
Suffian Khan
76bc0e479c
Enable dense sequence optimized version of Pytorch exported BERT-L on AMD GPU (#6504)
* Permit dense seq optimization on BERT-L pytorch export by enabling ReduceSumTraining, Equal, and NonZero on AMD

* enable Equal tests

* enable fast_matrix_reduction test case
2021-01-29 13:12:34 -08:00