both arm64ec and x64 packages are needed.
x64 is needed for offline context binary generation
and arm64ec is needed for interop with python packages that don't have
prebuilt arm64 packages and only have x64.
### Description
Removing `docker_base_image` parameter and variables. From the Cuda
Packaging pipeline.
### Motivation and Context
Since the docker image is hard coded in the
`onnxruntime/tools/ci_build/github/linux/docker/inference/x86_64/default/cuda12/Dockerfile`
and
`onnxruntime/tools/ci_build/github/linux/docker/inference/x86_64/default/cuda11/Dockerfile`
This parameter and variable is no longer needed.
### Description
Do not allow clearing Android logs if the emulator is not running
### Motivation and Context
Previously the Clearing Android logs step stuck until the pipeline
timeout. If one of the previous steps failed.
### Description
- TensorRT 10.2.0.19 -> 10.3.0.26
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
### Motivation and Context
We couldn't get enough A100 agent time to finish the jobs since today.
The PR makes the A100 job only runs in main branch to unblock other PRs
if it's not recovered in a short time.
### Description
<!-- Describe your changes. -->
The xcframework now uses symlinks to have the correct structure
according to Apple requirements. Symlinks are not supported by nuget on
Windows.
In order to work around that we can store a zip of the xcframeworks in
the nuget package.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix nuget packaging build break
### Description
* Fix migraphx build error caused by
https://github.com/microsoft/onnxruntime/pull/21598:
Add a conditional compile on code block that depends on ROCm >= 6.2.
Note that the pipeline uses ROCm 6.0.
Unblock orttraining-linux-gpu-ci-pipeline and
orttraining-ortmodule-distributed and orttraining-amd-gpu-ci-pipeline
pipelines:
* Disable a model test in linux GPU training ci pipelines caused by
https://github.com/microsoft/onnxruntime/pull/19470:
Sometime, cudnn frontend throws exception that cudnn graph does not
support a Conv node of keras_lotus_resnet3D model on V100 GPU.
Note that same test does not throw exception in other GPU pipelines. The
failure might be related to cudnn 8.9 and V100 GPU used in the pipeline
(Amper GPUs and cuDNN 9.x do not have the issue).
The actual fix requires fallback logic, which will take time to
implement, so we temporarily disable the test in training pipelines.
* Force install torch for cuda 11.8. (The docker has torch 2.4.0 for
cuda 12.1 to build torch extension, which it is not compatible cuda
11.8). Note that this is temporary walkround. More elegant fix is to
make sure right torch version in docker build step, that might need
update install_python_deps.sh and corresponding requirements.txt.
* Skip test_gradient_correctness_conv1d since it causes segment fault.
Root cause need more investigation (maybe due to cudnn frontend as
well).
* Skip test_aten_attention since it causes assert failure. Root cause
need more investigation (maybe due to torch version).
* Skip orttraining_ortmodule_distributed_tests.py since it has error
that compiler for torch extension does not support c++17. One possible
fix it to set the following compile argument inside setup.py of
extension fused_adam: extra_compile_args['cxx'] = ['-std=c++17'].
However, due to the urgency of unblocking the pipelines, just disable
the test for now.
* skip test_softmax_bf16_large. For some reason,
torch.cuda.is_bf16_supported() returns True in V100 with torch 2.3.1, so
the test was run in CI, but V100 does not support bf16 natively.
* Fix typo of deterministic
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
- Update pipelines to use QNN SDK 2.25 by default
- Update ifdef condition to apply workaround for QNN LayerNorm
validation bug to QNN SDK 2.25 (as well as 2.24)
### Motivation and Context
Use the latest QNN SDK
### Description
Improve docker commands to make docker image layer caching works.
It can make docker building faster and more stable.
So far, A100 pool's system disk is too small to use docker cache.
We won't use pipeline cache for docker image and remove some legacy
code.
### Motivation and Context
There are often an exception of
```
64.58 + curl https://nodejs.org/dist/v18.17.1/node-v18.17.1-linux-x64.tar.gz -sSL --retry 5 --retry-delay 30 --create-dirs -o /tmp/src/node-v18.17.1-linux-x64.tar.gz --fail
286.4 curl: (92) HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)
```
Because Onnxruntime pipeline have been sending too many requests to
download Nodejs in docker building.
Which is the major reason of pipeline failing now
In fact, docker image layer caching never works.
We can always see the scrips are still running
```
#9 [3/5] RUN cd /tmp/scripts && /tmp/scripts/install_centos.sh && /tmp/scripts/install_deps.sh && rm -rf /tmp/scripts
#9 0.234 /bin/sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
#9 0.235 /bin/sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
#9 0.235 /tmp/scripts/install_centos.sh: line 1: !/bin/bash: No such file or directory
#9 0.235 ++ '[' '!' -f /etc/yum.repos.d/microsoft-prod.repo ']'
#9 0.236 +++ tr -dc 0-9.
#9 0.236 +++ cut -d . -f1
#9 0.238 ++ os_major_version=8
....
#9 60.41 + curl https://nodejs.org/dist/v18.17.1/node-v18.17.1-linux-x64.tar.gz -sSL --retry 5 --retry-delay 30 --create-dirs -o /tmp/src/node-v18.17.1-linux-x64.tar.gz --fail
#9 60.59 + return 0
...
```
This PR is improving the docker command to make image layer caching
work.
Thus, CI won't send so many redundant request of downloading NodeJS.
```
#9 [2/5] ADD scripts /tmp/scripts
#9 CACHED
#10 [3/5] RUN cd /tmp/scripts && /tmp/scripts/install_centos.sh && /tmp/scripts/install_deps.sh && rm -rf /tmp/scripts
#10 CACHED
#11 [4/5] RUN adduser --uid 1000 onnxruntimedev
#11 CACHED
#12 [5/5] WORKDIR /home/onnxruntimedev
#12 CACHED
```
###Reference
https://docs.docker.com/build/drivers/
---------
Co-authored-by: Yi Zhang <your@email.com>
### Description
<!-- Describe your changes. -->
Add ability to test packaging without rebuilding every time.
Add ability to comment out some platforms/architectures without the
scripts to assemble the c/obj-c packages breaking.
Update a couple of commands to preserve symlinks.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Make debugging packaging issues faster.
Creates correct package for mac-catalyst and doesn't require setting
symlinks via bash script.
### Description
Added CUDNN Frontend and used it for NHWC convolutions, and optionally
fuse activation.
#### Backward compatible
- For model existed with FusedConv, model can still run.
- If ORT is built with cuDNN 8, cuDNN frontend will not be built into
binary. Old kernels (using cudnn backend APIs) are used.
#### Major Changes
- For cuDNN 9, we will enable cudnn frontend to fuse convolution and
bias when a provider option `fuse_conv_bias=1`.
- Remove the fusion of FusedConv from graph transformer for CUDA
provider, so there will not be FusedConv be added to graph for CUDA EP
in the future.
- Update cmake files regarding to cudnn settings. The search order of
CUDNN installation in build are like the following:
* environment variable `CUDNN_PATH`
* `onnxruntime_CUDNN_HOME` cmake extra defines. If a build starts from
build.py/build.sh, user can pass it through `--cudnn_home` parameter, or
by environment variable `CUDNN_HOME` if `--cudnn_home` not used.
* cudnn python package installation directory like
python3.xx/site-packages/nvidia/cudnn
* CUDA installation path
#### Potential Issues
- If ORT is built with cuDNN 8, FusedConv fusion is no longer done
automatically, so some model might have performance regression. If user
still wants FusedConv operator for performance reason, they can still
have multiple ways to walkaround: like use older version of onnxruntime;
or use older version of ORT to save optimized onnx, then run with latest
version of ORT. We believe that majority users have moved to cudnn 9
when 1.20 release (since the default in ORT and PyTorch is cudnn 9 for 3
months when 1.20 release), so the impact is small.
- cuDNN graph uses TF32 by default, and user cannot disable TF32 through
the use_tf32 cuda provider option. If user encounters accuracy issue
(like in testing), user has to set environment variable
`NVIDIA_TF32_OVERRIDE=0` to disable TF32. Need update the document of
use_tf32 later.
#### Follow ups
This is one of PRs that target to enable NHWC convolution in CUDA EP by
default if device supports it. There are other changes will follow up to
make it possible.
(1) Enable `prefer_nhwc` by default for device with sm >= 70.
(2) Change `fuse_conv_bias=1` by default after more testing.
(3) Add other NHWC operators (like Resize or UpSample).
### Motivation and Context
The new CUDNN Frontend library provides the functionality to fuse
operations and provides new heuristics for kernel selection. Here it
fuses the convolution with the pointwise bias operation. On the [NVIDIA
ResNet50](https://pytorch.org/hub/nvidia_deeplearningexamples_resnet50/)
we get a performance boost from 49.1144 ms to 42.4643 ms per inference
on a 2560x1440 input (`onnxruntime_perf_test -e cuda -I -q -r 100-d 1 -i
'prefer_nhwc|1' resnet50.onnx`).
---------
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: Maximilian Mueller <maximilianm@nvidia.com>
### Description
<!-- Describe your changes. -->
Update TRT OSS Parser to [latest 10.2-GA
branch](f161f95883)
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Since the onedevice training cpu packaging has been a separated
pipeline, it's nuget package publishing step must be moved as well.
### Motivation and Context
Fixes the exception in Nuget Publishing Packaging Pipeline caused by
#21485
### Description
Delete tools/ci_build/github/azure-pipelines/win-gpu-ci-pipeline.yml
### Motivation and Context
This CI pipeline has been divided into 4 different pipeline.
### Description
<!-- Describe your changes. -->
Set version and other info in the Microsoft.ML.OnnxRuntime C# dll by
setting GenerateAssemblyInfo to true and passing in ORT version in the
CI.
Minor re-org of the order of properties so related things are grouped a
little better.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#21475
### Description
<!-- Describe your changes. -->
`enable_windows_arm64_qnn` and `enable_windows_x64_qnn` are true by
default but unnecessary for training. This change explicitly sets these
parameters to false for training pipeline.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
ORT 1.19 Release Preparation
### Description
<!-- Describe your changes. -->
Current failure is due to a version mismatch.
Use llvm-cov from the Android NDK instead of the system gcov so that the
version is correct.
Also comment out publishing to the Azure dashboard to simplify the
setup. The CI prints out the stats for review by developers.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix CI pipeline
### Description
Right now our "Zip-Nuget-Java-Nodejs Packaging Pipeline" is too big.
This OnDevice training part is independent of the others, so it can be
split out. Then our NPM Packaging pipeline will not depends on this
training stuff.
### Motivation and Context
Similar to #21235
Also, this PR fixed a problem that: "NuGet_Test_Linux_Training_CPU" job
downloads artifacts from "onnxruntime-linux-x64" for getting customop
shared libs, but the job forget to declare it depends on the
"Linux_C_API_Packaging_CPU_x64" which produces the artifact. Such
problems can be hard to find when a pipeline goes big.
### Description
* Swap cuda version 11.8/12.2 in GPU CIs
* Set CUDA12 as default version in yamls of publishing nuget/python/java
GPU packages
* Suppress warnings as errors of flash_api.cc during ort win-build
### Description
- Update pipelines to use QNN SDK 2.24 by default
- Update QNN_Nuget_Windows pipeline to build csharp solution without
mobile projects (fixes errors).
- Implement workaround for QNN 2.24 validation bug for LayerNorm ops
without an explicit bias input.
- Enable Relu unit test, which now passes due to the fact Relu is no
longer fused into QuantizeLinear for QNN EP.
- Fix bug where a negative quantization axis is not properly normalized
for per-channel int4 conv.
### Motivation and Context
Update QNN SDk.
### Description
Before this change, copy_strip_binary.sh manually copies each file from
onnx runtime's build folder to an artifact folder. It can be hard when
dealing with symbolic link for shared libraries.
This PR will change the packaging pipelines to run "make install" first,
before packaging shared libs .
### Motivation and Context
Recently because of feature request #21281 , we changed
libonnxruntime.so's SONAME. Now every package that contains this shared
library must also contains libonnxruntime.so.1. Therefore we need to
change the packaging scripts to include this file. Instead of manually
construct the symlink layout, using `make install` is much easier and
will make things more consistent because it is a standard way of making
packages.
**Breaking change:**
After this change, our **inference** tarballs that are published to our
Github release pages will be not contain ORT **training** headers.
1. Update google benchmark from 1.8.3 to 1.8.5
2. Update google test from commit in main branch to tag 1.15.0
3. Update pybind11 from 2.12.0 to 2.13.1
4. Update pytorch cpuinfo to include the support for Arm Neoverse V2,
Cortex X4, A720 and A520.
5. Update re2 from 2024-05-01 to 2024-07-02
6. Update cmake to 3.30.1
7. Update Linux docker images
8. Fix a warning in test/perftest/ort_test_session.cc:826:37: error:
implicit conversion loses integer precision: 'streamoff' (aka 'long
long') to 'const std::streamsize' (aka 'const long')
[-Werror,-Wshorten-64-to-32]
### Description
Replace inline pip install with pip install from requirements*.txt
### Motivation and Context
so that CG can recognize
### Dependency
- [x] https://github.com/microsoft/onnxruntime/pull/21085
### Description
<!-- Describe your changes. -->
* promote trt version to 10.2.0.19
* EP_Perf CI: clean config of legacy TRT<8.6, promote test env to
trt10.2-cu118/cu125
* skip two tests as Float8/BF16 are supported by TRT>10.0 but TRT CIs
are not hardware-compatible on these:
```
1: [ FAILED ] 2 tests, listed below:
1: [ FAILED ] IsInfTest.test_isinf_bfloat16
1: [ FAILED ] IsInfTest.test_Float8E4M3FN
```
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
There is a bug for kernel running on rocm6.0, so change ci docker image
to rocm6.1
For the torch installed in the docker image, change to rocm repo when it
is not 6.0 version.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
We need to prevent VitisAI EP build breaks, add a stage in Windows CPU
CI Pipeline to build Vitis AI EP on Windows.
There are no external dependencies for builds. Tests have to be disabled
though as the EP has external SW/HW dependencies.
This will at least allow us to prevent build breaks which has happened
on multiple occasions recently.
tested
https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1432346&view=results
and it seems to run fine.
### Description
Combining android build and test step into one job
### Motivation and Context
Reduce runtime by removing additional machine allocation, and artifact
uploading and downloading.
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
the exception was caused by
3dd6fcc089
Why I add skip_macos_test
because there's new an exception in
https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1425579&view=logs&j=c90c5af3-67d5-5936-5a62-71c93ebfca65&t=01038f35-8e78-5801-1aa1-d9647bb65858
```
2024-07-05T14:41:09.3864740Z mkdir -p /Users/runner/Library/Developer/Xcode/DerivedData/apple_package_test-akksnidsbpojopfdqrclgsoqqerv/Build/Products/Debug/macos_package_testUITests.xctest/Contents/Frameworks
2024-07-05T14:41:09.3933430Z mkdir: /Users/runner/Library/Developer/Xcode/DerivedData/apple_package_test-akksnidsbpojopfdqrclgsoqqerv/Build/Products/Debug/macos_package_testUITests.xctest: Operation not permitted
2024-07-05T14:41:09.3996760Z /var/folders/0f/b0mzpg5d31z074x3z5lzkdxc0000gn/T/tmp97ycvwq5/apple_package_test/Pods/Target Support Files/Pods-macos_package_testUITests/Pods-macos_package_testUITests-frameworks.sh: line 7: realpath: command not found
2024-07-05T14:41:09.4003170Z :18: error: Unexpected failure
2024-07-05T14:41:11.1323470Z error: Sandbox: mkdir(72212) deny(1) file-write-create /Users/runner/Library/Developer/Xcode/DerivedData/apple_package_test-akksnidsbpojopfdqrclgsoqqerv/Build/Products/Debug/macos_package_testUITests.xctest (in target 'macos_package_testUITests' from project 'apple_package_test')
2024-07-05T14:41:11.1325620Z
2024-07-05T14:41:11.8731110Z
2024-07-05T14:41:11.8733040Z Test session results, code coverage, and logs:
2024-07-05T14:41:11.8734820Z /Users/runner/Library/Developer/Xcode/DerivedData/apple_package_test-akksnidsbpojopfdqrclgsoqqerv/Logs/Test/Test-macos_package_test-2024.07.05_14-40-38-+0000.xcresult
2024-07-05T14:41:11.8735530Z
2024-07-05T14:41:11.8906210Z Testing failed:
2024-07-05T14:41:11.8911060Z Sandbox: mkdir(72212) deny(1) file-write-create /Users/runner/Library/Developer/Xcode/DerivedData/apple_package_test-akksnidsbpojopfdqrclgsoqqerv/Build/Products/Debug/macos_package_testUITests.xctest
2024-07-05T14:41:11.8912570Z Unexpected failure
2024-07-05T14:41:11.8913690Z Testing cancelled because the build failed.
2024-07-05T14:41:11.8914380Z
2024-07-05T14:41:11.8914970Z ** TEST FAILED **
2024-07-05T14:41:11.8915480Z
2024-07-05T14:41:11.8915780Z
2024-07-05T14:41:11.8916750Z The following build commands failed:
2024-07-05T14:41:11.8919280Z PhaseScriptExecution [CP]\ Embed\ Pods\ Frameworks /Users/runner/Library/Developer/Xcode/DerivedData/apple_package_test-akksnidsbpojopfdqrclgsoqqerv/Build/Intermediates.noindex/apple_package_test.build/Debug/macos_package_testUITests.build/Script-059136A7770CA5376C30F2FD.sh (in target 'macos_package_testUITests' from project 'apple_package_test')
2024-07-05T14:41:11.8922180Z (1 failure)
```
And I find macos test is skipped in
9ef28f092f/tools/ci_build/github/azure-pipelines/templates/c-api-cpu.yml (L119-L127)
as well.
Maybe it is an known issue.
### Description
1. remove QNN stages from the big packaging pipeline
2. Add publish nightly package in the current [QNN Nuget
pipeline](https://dev.azure.com/aiinfra/Lotus/_builddefinitionId=1234])
### Motivation and Context
Reduce the complexity of the big Nuget packaging pipelines.
---------
Co-authored-by: Yi Zhang <your@email.com>
### Description
Our macOS pipeline are failing because of a build error in absl.
However, the bug fix we need is not available in the latest ABSL
release.
Here is the issue: https://github.com/abseil/abseil-cpp/pull/1536
And here is the fix:
779a3565ac
GTests uses ABSL. But this ABSL target also depends on GTest. So, it is
a circular dependency. We should be able to avoid that by avoid building
tests for ABSL. However, the version we are using has a problem with
that: it has cmake target that still depends on GTest even when testing
is disabled.
It's strange that we suddenly hit this problem and it only happens on macOS.
### Description
Make current ROCm packaging stages to a single workflow.
Reduce the possibility of all nightly packages can't be generated by one
failed stage
### Motivation and Context
Our plan is to reduce the complexity of the current zip-nuget pipeline
to improve the stability and performance of nightly packages generation.
ROCm packaging stages has no dependencies with other packaging jobs and
it's the most time-consuming route.
After this change, the most used CPU/CUDA/Mobile packaging workflow
duration can be reduced roughly from 3h20m to 2h30m.