onnxruntime/tools/ci_build/github/azure-pipelines
Tianlei Wu a46e49b439
Unblock migraphx and linux GPU training ci pipelines (#21662)
### Description
* Fix migraphx build error caused by
https://github.com/microsoft/onnxruntime/pull/21598:
Add a conditional compile on code block that depends on ROCm >= 6.2.
Note that the pipeline uses ROCm 6.0.

Unblock orttraining-linux-gpu-ci-pipeline and
orttraining-ortmodule-distributed and orttraining-amd-gpu-ci-pipeline
pipelines:
* Disable a model test in linux GPU training ci pipelines caused by
https://github.com/microsoft/onnxruntime/pull/19470:
Sometime, cudnn frontend throws exception that cudnn graph does not
support a Conv node of keras_lotus_resnet3D model on V100 GPU.
Note that same test does not throw exception in other GPU pipelines. The
failure might be related to cudnn 8.9 and V100 GPU used in the pipeline
(Amper GPUs and cuDNN 9.x do not have the issue).
The actual fix requires fallback logic, which will take time to
implement, so we temporarily disable the test in training pipelines.
* Force install torch for cuda 11.8. (The docker has torch 2.4.0 for
cuda 12.1 to build torch extension, which it is not compatible cuda
11.8). Note that this is temporary walkround. More elegant fix is to
make sure right torch version in docker build step, that might need
update install_python_deps.sh and corresponding requirements.txt.
* Skip test_gradient_correctness_conv1d since it causes segment fault.
Root cause need more investigation (maybe due to cudnn frontend as
well).
* Skip test_aten_attention since it causes assert failure. Root cause
need more investigation (maybe due to torch version).
* Skip orttraining_ortmodule_distributed_tests.py since it has error
that compiler for torch extension does not support c++17. One possible
fix it to set the following compile argument inside setup.py of
extension fused_adam: extra_compile_args['cxx'] = ['-std=c++17'].
However, due to the urgency of unblocking the pipelines, just disable
the test for now.
* skip test_softmax_bf16_large. For some reason,
torch.cuda.is_bf16_supported() returns True in V100 with torch 2.3.1, so
the test was run in CI, but V100 does not support bf16 natively.
* Fix typo of deterministic

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-08-08 19:44:15 -07:00
..
nodejs/templates Adding Job names to jobs without a name (#20961) 2024-06-06 19:09:21 -07:00
nuget/templates [TensorRT EP] support TensorRT 10.2-GA (#21395) 2024-07-18 12:11:52 -07:00
stages Set CUDA12 as default in GPU packages (#21438) 2024-07-25 10:17:16 -07:00
templates Unblock migraphx and linux GPU training ci pipelines (#21662) 2024-08-08 19:44:15 -07:00
triggers
android-arm64-v8a-QNN-crosscompile-ci-pipeline.yml [QNN EP] Update QNN SDK to 2.25 (#21623) 2024-08-06 09:08:48 -07:00
android-x86_64-crosscompile-ci-pipeline.yml Fix Android CI Pipeline code coverage failure (#21504) 2024-07-26 07:36:23 +10:00
bigmodels-ci-pipeline.yml Fix docker image layer caching to avoid redundant docker building and transient connection exceptions. (#21612) 2024-08-06 21:37:09 +08:00
binary-size-checks-pipeline.yml Clean up some mobile package related files and their usages. (#21606) 2024-08-05 16:38:20 -07:00
build-perf-test-binaries-pipeline.yml Upgrade Ubuntu machine pool from 20.04 to 22.04 (#19117) 2024-01-16 17:25:18 -08:00
c-api-noopenmp-packaging-pipelines.yml [QNN EP] Update QNN SDK to 2.25 (#21623) 2024-08-06 09:08:48 -07:00
c-api-training-packaging-pipelines.yml Move on-device training packages publish step (#21539) 2024-07-29 09:59:46 -07:00
clean-build-docker-image-cache-pipeline.yml Upgrade Ubuntu machine pool from 20.04 to 22.04 (#19117) 2024-01-16 17:25:18 -08:00
cuda-packaging-pipeline.yml [TensorRT EP] support TensorRT 10.2-GA (#21395) 2024-07-18 12:11:52 -07:00
linux-ci-pipeline.yml Update training packaging pipeline's docker files (#20853) 2024-05-30 23:48:42 -07:00
linux-cpu-aten-pipeline.yml Update Aten pipeline's docker file to use UBI8 (#20856) 2024-05-30 07:38:15 -07:00
linux-cpu-eager-pipeline.yml Update Aten pipeline's docker file to use UBI8 (#20856) 2024-05-30 07:38:15 -07:00
linux-cpu-minimal-build-ci-pipeline.yml Update training packaging pipeline's docker files (#20853) 2024-05-30 23:48:42 -07:00
linux-dnnl-ci-pipeline.yml Update training packaging pipeline's docker files (#20853) 2024-05-30 23:48:42 -07:00
linux-gpu-ci-pipeline.yml Set CUDA12 as default in GPU packages (#21438) 2024-07-25 10:17:16 -07:00
linux-gpu-tensorrt-ci-pipeline.yml Set CUDA12 as default in GPU packages (#21438) 2024-07-25 10:17:16 -07:00
linux-gpu-tensorrt-daily-perf-pipeline.yml Set CUDA12 as default in GPU packages (#21438) 2024-07-25 10:17:16 -07:00
linux-migraphx-ci-pipeline.yml change ci docker image to rocm6.1 (#21296) 2024-07-18 14:50:01 +08:00
linux-openvino-ci-pipeline.yml Update OpenVino CI Ubuntu to 22.04 (#21127) 2024-07-09 09:56:44 -07:00
linux-qnn-ci-pipeline.yml [QNN EP] Update QNN SDK to 2.25 (#21623) 2024-08-06 09:08:48 -07:00
mac-ci-pipeline.yml Delete pyop (#21094) 2024-06-19 16:21:33 -07:00
mac-coreml-ci-pipeline.yml Switch a portion of CI/packaging jobs to MacOS12 (#19908) 2024-03-19 14:54:58 -07:00
mac-ios-ci-pipeline.yml Upgrade min ios version to 13.0 (#20773) 2024-06-04 10:15:20 -07:00
mac-ios-packaging-pipeline.yml Upgrade min ios version to 13.0 (#20773) 2024-06-04 10:15:20 -07:00
mac-react-native-ci-pipeline.yml Address React Native pipeline component detection timeout (#20871) 2024-05-30 16:37:03 -07:00
npm-packaging-pipeline.yml Increase NPM ComponentDetection.Timeout: 1200 (#20681) 2024-05-15 13:41:59 -07:00
nuget-cuda-publishing-pipeline.yml Set CUDA12 as default in GPU packages (#21438) 2024-07-25 10:17:16 -07:00
orttraining-linux-ci-pipeline.yml Remove manylinux build scripts from python packaging pipeline (#20786) 2024-05-24 08:18:22 -07:00
orttraining-linux-gpu-ci-pipeline.yml
orttraining-linux-gpu-ortmodule-distributed-test-ci-pipeline.yml Unblock migraphx and linux GPU training ci pipelines (#21662) 2024-08-08 19:44:15 -07:00
orttraining-linux-nightly-ortmodule-test-pipeline.yml ORTModule memory improvement (#18924) 2024-01-16 08:57:37 +08:00
orttraining-mac-ci-pipeline.yml
orttraining-pai-ci-pipeline.yml Replace inline pip install with pip install from requirements*.txt (#21106) 2024-07-22 12:39:10 -07:00
orttraining-py-packaging-pipeline-cpu.yml disables qnn in ort training cpu pipeline (#21510) 2024-07-26 17:23:35 +08:00
orttraining-py-packaging-pipeline-cuda.yml Update training packaging pipeline's docker files (#20853) 2024-05-30 23:48:42 -07:00
orttraining-py-packaging-pipeline-cuda12.yml Update training packaging pipeline's docker files (#20853) 2024-05-30 23:48:42 -07:00
orttraining-py-packaging-pipeline-rocm.yml [ROCm] Update ck to use ck_tile (#21030) 2024-06-19 14:06:10 +08:00
post-merge-jobs.yml [TensorRT EP] support TensorRT 10.2-GA (#21395) 2024-07-18 12:11:52 -07:00
publish-nuget.yml Move on-device training packages publish step (#21539) 2024-07-29 09:59:46 -07:00
py-cuda-package-test-pipeline.yml Adding new pipeline for python cuda testing (#18718) 2023-12-18 18:13:03 -08:00
py-cuda-packaging-pipeline.yml Remove manylinux build scripts from python packaging pipeline (#20786) 2024-05-24 08:18:22 -07:00
py-cuda-publishing-pipeline.yml Set CUDA12 as default in GPU packages (#21438) 2024-07-25 10:17:16 -07:00
py-package-build-pipeline.yml OpenVINO EP Rel 1.18 Changes (#20337) 2024-04-19 00:31:38 -07:00
py-package-test-pipeline.yml [TensorRT EP] support TensorRT 10.2-GA (#21395) 2024-07-18 12:11:52 -07:00
py-packaging-pipeline.yml [QNN EP] Update QNN SDK to 2.25 (#21623) 2024-08-06 09:08:48 -07:00
qnn-ep-nuget-packaging-pipeline.yml [QNN EP] Update QNN SDK to 2.25 (#21623) 2024-08-06 09:08:48 -07:00
rocm-nuget-packaging-pipeline.yml Make ROCm packaging stages to a single workflow (#21235) 2024-07-04 11:07:04 +08:00
web-ci-pipeline.yml Fix typos according to reviewdog report. (#21335) 2024-07-22 13:37:32 -07:00
win-ci-fuzz-testing.yml Uppdate nuget to Use Nuget 6.10.x (#21209) 2024-06-28 19:49:54 -07:00
win-ci-pipeline.yml add vitisai ep build stage to Windows CPU Pipeline (#21361) 2024-07-15 19:34:08 -07:00
win-gpu-cuda-ci-pipeline.yml Separating all GPU stages into different Pipelines (#21521) 2024-07-26 14:54:45 -07:00
win-gpu-dml-ci-pipeline.yml Separating all GPU stages into different Pipelines (#21521) 2024-07-26 14:54:45 -07:00
win-gpu-doc-gen-ci-pipeline.yml Separating all GPU stages into different Pipelines (#21521) 2024-07-26 14:54:45 -07:00
win-gpu-reduce-op-ci-pipeline.yml Move jobs in onnxruntime-Win2022-GPU-T4 machine pool to onnxruntime-Win2022-GPU-A10 (#21023) 2024-06-12 22:04:40 -07:00
win-gpu-tensorrt-ci-pipeline.yml Set CUDA12 as default in GPU packages (#21438) 2024-07-25 10:17:16 -07:00
win-gpu-training-ci-pipeline.yml Separating all GPU stages into different Pipelines (#21521) 2024-07-26 14:54:45 -07:00
win-qnn-arm64-ci-pipeline.yml [QNN EP] Update QNN SDK to 2.25 (#21623) 2024-08-06 09:08:48 -07:00
win-qnn-ci-pipeline.yml [QNN EP] Update QNN SDK to 2.25 (#21623) 2024-08-06 09:08:48 -07:00