Commit graph

42 commits

Author SHA1 Message Date
PeixuanZuo
d3a09cf77f
[ROCm] use pytest-xdist for fast pytest (#14261)
### Description

Use pytest-xdist to distribute tests across multiple CPUs to speed up
test execution.
Use pytest-rerunfailures to rerun failed test in case of pytest-xdist
crash.
`pytest -n 16` can reduce pytest time from 80 minutes to 20 minutes.


### Motivation and Context
Now kernel explorer pytest of ROCm CI takes nearly 1 hour 20 minutes. It
will take longer time when we add more tunableOp in the future.
2023-01-13 16:57:50 +08:00
PeixuanZuo
ab2dd8dfaf
[ROCm] Update ROCm and MigraphX CI to ROCm5.4 (#14011)
Update ROCm and MigraphX CI to ROCm5.4
Run ortmodule_test with ROCm5.4 and all
passed(https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=824742&view=logs&j=8292f886-7946-5da9-7977-04484c342eda&t=5de68eaa-cbdc-5be5-13d0-bb946f4ddb2d).

Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
2022-12-22 10:01:05 +08:00
Changming Sun
04900f96c1
Improve dependency management (#13523)
## Description
1. Convert some git submodules to cmake external projects
2. Update nsync from
[1.23.0](https://github.com/google/nsync/releases/tag/1.23.0) to
[1.25.0](https://github.com/google/nsync/releases/tag/1.25.0)
3. Update re2 from 2021-06-01 to 2022-06-01
4. Update wil from an old commit to 1.0.220914.1 tag
5. Update gtest to a newer commit so that it can optionally leverage
absl/re2 for parsing command line flags.

The following git submodules are deleted:

1. FP16
2. safeint
3. XNNPACK
4. cxxopts
5. dlpack
7. flatbuffers
8. googlebenchmark
9. json
10. mimalloc
11. mp11
12. pthreadpool

More will come.

## Motivation and Context
There are 3 ways of integrating 3rd party C/C++ libraries into ONNX
Runtime:
1. Install them to a system location, then use cmake's find_package
module to locate them.
2.  Use git submodules 
6.  Use cmake's external projects(externalproject_add). 

At first when this project was just started, we considered both option 2
and option 3. We preferred option 2 because:

1. It's easier to handle authentication. At first this project was not
open source, and it had some other non-public dependencies. If we use
git submodule, ADO will handle authentication smoothly. Otherwise we
need to manually pass tokens around and be very careful on not exposing
them in build logs.
2. At that time, cmake fetched dependencies after "cmake" finished
generating vcprojects/makefiles. So it was very difficult to make cflags
consistent. Since cmake 3.11, it has a new command: FetchContent, which
fetches dependencies when it generates vcprojects/makefiles just before
add_subdirectories, so the parent project's variables/settings can be
easily passed to the child projects.

And when the project went on,  we had some new concerns:
1. As we started to have more and more EPs and build configs, the number
of submodules grew quickly. For more developers, most ORT submodules are
not relevant to them. They shouldn't need to download all of them.
2. It is impossible to let two different build configs use two different
versions of the same dependency. For example, right now we have protobuf
3.18.3 in the submodules. Then every EP must use the same version.
Whenever we have a need to upgrade protobuf, we need to coordinate
across the whole team and many external developers. I can't manage it
anymore.
3. Some projects want to manage the dependencies in a different way,
either because of their preference or because of compliance
requirements. For example, some Microsoft teams want to use vcpkg, but
we don't want to force every user of onnxruntime using vcpkg.
7. Someone wants to dynamically link to protobuf, but our build script
only does static link.
8. Hard to handle security vulnerabilities. For example, whenever
protobuf has a security patch, we have a lot of things to do. But if we
allowed people to build ORT with a different version of protobuf without
changing ORT"s source code, the customer who build ORT from source will
be able to act on such things in a quicker way. They will not need to
wait ORT having a patch release.
9. Every time we do a release, github will also publish a source file
zip file and a source file tarball for us. But they are not usable,
because they miss submodules.
 
### New features

After this change, users will be able to:
1. Build the dependencies in the way they want, then install them to
somewhere(for example, /usr or a temp folder).
2. Or download the dependencies by using cmake commands from these
dependencies official website
3. Similar to the above, but use your private mirrors to migrate supply
chain risks.
4. Use different versions of the dependencies, as long as our source
code is compatible with them. For example, you may use you can't use
protobuf 3.20.x as they need code changes in ONNX Runtime.
6.  Only download the things the current build needs.
10. Avoid building external dependencies again and again in every build.

### Breaking change
The onnxruntime_PREFER_SYSTEM_LIB build option is removed you could think from now 
it is default ON. If you don't like the new behavior, you can set FETCHCONTENT_TRY_FIND_PACKAGE_MODE to NEVER.
Besides, for who relied on the onnxruntime_PREFER_SYSTEM_LIB build
option, please be aware that this PR will change find_package calls from
Module mode to Config mode. For example, in the past if you have
installed protobuf from apt-get from ubuntu 20.04's official repo,
find_package can find it and use it. But after this PR, it won't. This
is because that protobuf version provided by Ubuntu 20.04 is too old to
support the "config mode". It can be resolved by getting a newer version
of protobuf from somewhere.
2022-12-01 09:51:59 -08:00
Edward Chen
9e65f3bfdb
Replace deprecated Python dependency sklearn with scikit-learn. (#13585) 2022-11-08 09:08:29 -08:00
PeixuanZuo
6895918b1c
[ROCm] Revert CI pipeline to ROCm5.2.3 (#13297)
### Description
<!-- Describe your changes. -->

Unit test with ROCm5.3 slower than ROCm5.2.3. Revert to ROCm5.2.3.
We will update to ROCm5.3 when the issue resloved by AMD.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2022-10-12 10:47:33 -07:00
PeixuanZuo
4d25b9c8f0
[ROCm] Update ROCm and MIGraphX CI pipeline to ROCm5.3 (#13257)
### Description
<!-- Describe your changes. -->

1. Update ROCm pipeline and MIGraphX pipeline to ROCm5.3
ROCm pipeline run ortmodule test one time and disable it :
https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=777794&view=logs&j=48b14a85-ff1a-5ca4-53fa-8ea420d27feb&t=9c199f35-fc50-565d-6c65-5162c9bb1b04
2. Add `workspace: clean: all `.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2022-10-11 13:47:22 +08:00
cloudhan
72076b1eb2
Update ROCm CI to use HIP LANGUAGE (#13214)
Update for ROCm CI before reland tunable GEMM #12853. This PR also update
composable kernel to use CMakes's HIP language support so that we can
mix C/C++ compiler with HIP compiler instead of locking to hip-clang
2022-10-05 16:15:16 +08:00
PeixuanZuo
5e4ebbd9d9
[ROCm] add MIGraphX ci pipeline (#11569)
**Description**: Describe your changes.
Add migraphx ci pipeline, test build and unit tests.
This PR is based on #11492 

Pipeline :
https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=765711&view=results
2022-09-28 10:59:30 +08:00
PeixuanZuo
adbc0757ad
[UPDATE] update ROCm ci pipeline to ROCm5.2.3 (#12799)
* [Update] update to rocm5.2.3

* [Fix] cmake version

* [Fix] disbale ortmodule tests

* [revert] revert performance number
2022-09-01 10:32:24 +08:00
Vincent Wang
e85e31ee80
Update ORTModule Default Opset Version to 15 (#12419)
* update ortmodule opset to 15

* update torch version

* fix ut

* fix ut

* rollback

* rollback for orttrainer
2022-08-05 16:55:04 +08:00
mindest
add631410a
[ROCm] Re-enable ReduceL1, L2 and related tests (#12209)
Re-enable ReduceL1,L2 and related tests
2022-07-20 13:13:02 +08:00
PeixuanZuo
7b53b223b8
[UPDATE] update AMD CI pipeline to Rocm5.2 with torch1.11 (#12162)
* [UPDATE] update ci to rocm5.2 + torch1.11

* [Revert] disable ort module test

* [DELETE] delete Rocm5.1.1 ci test result

* [UPDATE] update the comments
2022-07-14 16:38:16 +08:00
Changming Sun
d5e34acb82
Remove git and python packages from the docker images used by Zip-Nuget-Java-Nodejs Packaging Pipeline (#11651) 2022-06-03 20:00:54 -07:00
PeixuanZuo
a67994316a
Update rocm ci to ROCm5.1.1 + torch1.10.0
* [UPDATE] update amd ci pipeline 2 rocm5.1.1

* [FIX] json format error

* [ERROR] disable unit tests

* [FIX] ucx error

* [FIX] cmake version

* [FIX] units test
2022-05-20 11:07:21 +08:00
PeixuanZuo
55af7a96a7
update the amd ci pipeline (#10723)
* [TEST] test to get amd pipeline information

* [FIX] lower the threshold

* [UPDATE] add retry task

* [UPDATE] add retry task

* [ERROR] error to occur retry

* [FIX] error

* [UPDATE] update retryCountOnTaskFailure to 1 time

* [UPDATE] add showmeminfo
2022-03-07 18:39:42 +08:00
Xavier Dupré
42c176b60c
Update default opset to 14 in ORTModule (#9743)
* update to torch 1.10
* update torchvision version
* update torchtext version
* remove deprecated option enable_onnx_checker
* add unit test to test gradient of GatherElements
* add ORTMODULE_ONNX_OPSET_VERSION in a docker file
2021-12-09 12:45:35 +01:00
Jeff Daily
3e879aab6b
work around ucx in rocm ci Dockerfile (#9360) 2021-10-14 09:49:31 -07:00
Suffian Khan
e758870b18
Upgrade ROCm CI pipeline for ROCm 4.3.1 and permit run inside container (#9070)
* try to run inside 4.3.1 container

* no \ in container run command

* remove networking options

* try with adding video render groups

* add job to build docker image

* try without 1st stage

* change alpha, beta to float

* try adding service connection

* retain huggingface directory

* static video and render gid

* use runtime expression for variables

* install torch-ort

* pin sacrebleu==1.5.1

* update curves for rocm 4.3.1

* try again

* disable determinism and only check tail of loss curve and with a much larger threshold of 0.05

* disable RoBERTa due to high run variablity on ROCm 4.3.1

* put reduction unit tests back in
2021-09-15 12:32:02 -07:00
Changming Sun
ae6fdd3333
Bring code coverage dashboard back (#8394) 2021-08-16 20:54:39 -07:00
raviskolli
f641c0f4e8 Update requirements.txt
Updated requests version to address component governance failure
2021-07-22 14:18:21 -07:00
Suffian Khan
35ca3c99d1
Fix ROCm wheels pipeline after changes to manylinux scripts (#8026)
* update

* try fix rocm pipeline

* avoid already isntalled error

* ignore python3.10 since build fails

* fix

* try setting user

* try again

* try again

* try again

* fix script

* disable inference docs generation

* try print device id

* fix name qual

* try again

* try again

* try again

* provider_options

* add device verify

* rty again

* try again

* try aggain

* print video/render gid

* try again

* run as root

* try again with uid, gid

* cleanup

* run as root

* temp fix

* add /bin/bash

Co-authored-by: Changming Sun <chasun@microsoft.com>
2021-06-10 21:01:28 -07:00
Jesse Benson
f977644324 ROCM support int reductions 2021-05-17 16:42:06 -07:00
Jesse Benson
be79575c6a Use built-in reduce_sum() for simple reduction cases, specifically reduce all to a scalar. 2021-04-14 08:55:35 -07:00
Weixing Zhang
75c0192e4f
enable more unit tests for ROCM EP (#7307) 2021-04-09 15:15:13 -07:00
Weixing Zhang
c22963c23d
Polish Lamb Kernel (#7299) 2021-04-09 09:55:57 -07:00
Weixing Zhang
8ad5007f8f
Polish Adam kernel (#7294)
* Polish Adam kernel
2021-04-09 01:11:09 -07:00
Jesse Benson
4543459984 MIOpen supports MIOPEN_REDUCE_TENSOR_AVG now. 2021-04-01 16:00:34 -07:00
Weixing Zhang
40fa40f3ce
Enable more unit tests for ROCM EP (#6776)
* enable more ops and unit tests for ROCM EP
2021-02-24 15:20:50 -08:00
Xavier Dupré
d3a2c8c1c7
Support double for operators ReduceMax, ReduceMin (#6265)
* Support double for operators ReduceMax, ReduceMin

* add unit test to pai-excluded-tests.txt

Co-authored-by: xavier dupré <xavier.dupre@gmail.com>
2021-02-08 19:14:26 -08:00
Jesse Benson
d18aa45b46 Enable more ROCM ops that are sharing CUDA code. Some are needed for Turing NLG models. 2021-02-06 14:40:34 -08:00
Jesse Benson
21a47ec8d9 Disable a couple more unsupported tests. 2021-02-04 15:00:05 -08:00
Jesse Benson
0b147702af Update remaining reduction ops to use MIOpen. double datatype is not supported, so disable those typed kernels. 2021-02-04 15:00:05 -08:00
Jesse Benson
a28ddb85b6 Reduction ops. 2021-02-04 15:00:05 -08:00
ashbhandare
85434273ff
Fix CUDA Reduction kernel for ArgMax/ArgMix for when reduction dim=1 (#6490)
* Fix for when reduction dim=1

* Disable test for AMD GPUs

* Specify Async
2021-02-02 09:50:16 -08:00
Suffian Khan
76bc0e479c
Enable dense sequence optimized version of Pytorch exported BERT-L on AMD GPU (#6504)
* Permit dense seq optimization on BERT-L pytorch export by enabling ReduceSumTraining, Equal, and NonZero on AMD

* enable Equal tests

* enable fast_matrix_reduction test case
2021-01-29 13:12:34 -08:00
pengwa
453431f7bb
Add max_norm for gradient clipping. (#6289)
* add max_norm as user option for gradient clipping

* add adam and lamb test cases for clip norm

* add frontend tests
2021-01-21 01:01:11 +08:00
Xavier Dupré
cd14c1af29
Support double for operator ArgMin (#6222)
* Support double for operator ArgMin
* add test specifically for double
* add new test on pai-excluded-tests.txt
2020-12-31 11:25:46 +01:00
Xavier Dupré
84addcd2cf
Support double for operator ReduceMean, ReduceLogSumExp (#6217)
* Support double for operators ReduceMean, ReduceLogSumExp
2020-12-31 11:24:54 +01:00
Vincent Wang
7ddeafdfcc
Add ReduceL2Grad and ClipGrad (#5970)
* ReduceL2Grad and ClipGrad.

* fix win build and amd ci pipeline

* resolve comments.

Co-authored-by: Vincent Wang <weicwang@AiFramework2080ti2.corp.microsoft.com>
2020-12-10 11:03:26 +08:00
Jesse Benson
98ea7372d3 Re-enable Lamb unit tests for AMD 2020-12-03 13:06:34 -08:00
Suffian Khan
9b8189dd0a
Rework AMD CI pipeline to use pool AMD-GPU and disable more tests in order to enable it. (#5885)
Move AMD test pipeline to use self-hosted pool AMD-GPU. For time being, remove failing/flaky unit tests for AMD pipeline.
2020-11-24 09:38:14 -08:00
Weixing Zhang
aec4cb489e
ROCm EP for AMD GPU (#5480)
The ROCm EP is designed and implemented based on AMD GPU software stack named ROCm. Here is the link for the details about ROCm: https://rocmdocs.amd.com/en/latest/

ROCm EP was created based on the following things:
1. AMD GPU programming language: HIP
2. AMD GPU HIP language runtime: amdhip64
3. BLAS: rocBLAS, hipBLAS
4. DNN: miOpen
5. Collective Communication library: RCCL
6. cub: hipCub
7. …

Current status:
BERT-L and GPT2 training can be ran on AMD GPU with data parallel.

Next:
1. Make more GPU code be sharable between ROCm EP and CUDA EP since HIP language and HIP runtime API are very close to CUDA.
2. Continue improving the implementation.
3. Continue GPU kernel optimization.
4. Support model parallelism on ROCm EP.
……

The rocm kernels have been removed from this commit and will be in a separate PR. Since the original PR was too big(~180 files), it was suggested to split the PR into two parts, one is rocm-kernels, the other is non rocm kernels.  

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
Co-authored-by: sabreshao <sabre.shao@amd.com>
Co-authored-by: anghostcici <11013544+anghostcici@users.noreply.github.com>
Co-authored-by: Suffian Khan <sukha@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2020-10-29 17:13:04 -07:00