### Description
Pre built QNN Android package
### Future Work
1. Setting up CI with Browserstack- onnxruntime_tests and Android test
2. ESRP Release to Maven
### Description
Resolve#21976 .
ABSL generally does not have forward/backward compatibility. Our code is
only compatible with one fixed LTS version. So it's important to fix the
version number there when using find_package to detect an installed
version.
### Description
It runs after "Python-CUDA-Packaging-Pipeline" that runs on a CPU
machine that skipped all tests.
This testing pipeline is for doing the tests.
Fix the QNN nuget package issue
### Description
Inside the package, folder name \runtimes\win-arm64\ was changed to \runtimes\win-ARM64\, which breaks lib copy settings in Microsoft.ML.OnnxRuntime.QNN.props.
### Motivation and Context
Fix issue: https://github.com/microsoft/onnxruntime/issues/21692
### Description
Update the commit from 59600894a2c1c18290944b83e989bfe618975230 to
1887322ed36d522409a6b805d4e7942cf76a8e40
### Motivation and Context
The new one has python 3.13.
AB#50959
### Description
This change introduces the WebGPU EP into ONNX Runtime.
To make the PR as simple as possible, this PR excluded the following:
- C API changes for WebGPU EP
- actual implementation of WebGPU EP. Currently in this PR, WebGPU is a
stub implementation that does not register any kernel.
- Python IO Binding update
- Node.js IO Binding update
This PR now contains only 43 file changes (while the working branch
contains 130+) and hopefully this makes it easier to review.
There is going to be separated PRs for each mentioned above.
Current working branch: #21904
### Description
With TensorRT 10.4 update, the name of TensorRT windows package changed
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
- removed installing AppCenter + pipeline step that runs AppCenter
Espresso tests
- added script for running AppCenter tests
### Motivation and Context
App Center is getting deprecated in the next year + we have upcoming
Android work that depends on working E2E testing.
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
- Add Java API for appending QNN EP
- Update Java unit test setup
- Fix issues with setting system properties for tests
- Unify Windows/non-Windows setup to simplify
### Description
<!-- Describe your changes. -->
NS is not developed anymore and ORT doesn't use it for int4 inference
either. Remove it to clean up the code
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Fix syntax so usability checker works as expected.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
if the variable is 1, the job running on A100 in PR checks.
Fixes
[AB#50333](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/50333)
### Motivation and Context
We wish more big models which need to run on A100 can be tested in PR
checks, but Azure may decommission A100 agents without notifications
sometimes, which will block merging PRs.
This PR is an improvement of current workaround, making those jobs only
run main branch.
Once we find the A100 are all decommisioned by Azure, we could change
the UseA100 variable to 0 to disable the A100 jobs in PR checks
### Description
Support Float16 for CoreML MLProgram EP.
Operations:
"Add", "Mul", "Sub", "Div", "Pow", "Sqrt", "Reciprocal",
"Sigmoid", "Tanh", "Relu", "LeakyRelu", "Concat", "GridSample",
"GlobalAveragePool",
"Clip", "DepthToSpace", "Resize", "Slice", "Conv",
"ConvTranspose", "GlobalMaxPool", "Gemm", "MatMul",
"AveragePool", "MaxPool", "Reshape", "Split", "Transpose"
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Scott McKay <skottmckay@gmail.com>
### Description
<!-- Describe your changes. -->
Jar maven signing:
- GnuPG
- sha256.
Jar packages artifacts:
- onnxruntime-android-full-aar
- onnxruntime-java
- onnxruntime-java-gpu
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Previously, it is manually signed.
Goal: make it automatically.
### Description
TensorRT 10.4 is GA now, update to 10.4
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Fix regression caused by #17361
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Update XNNPack to latest version (Sep 4)
- Some op outputs are changed, channel or stride paras are moved into
reshape func.
e.g.
96962a602d
- input params of xnnpack's resize related function are changed a lot
- KleidiAI is added as a dependency in ARM64
- The latest XNNPACK includes 2 static libs microkernels-prod and
xnnpack.
Without microkernels-prod, it throws the exception of Undefined symbols.
- Add ORT_TARGET_PROCESSOR to get the real processor target in CMake
### Description
See https://github.com/microsoft/onnxruntime-extensions/pull/476
and https://github.com/actions/runner-images/issues/7671
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Current issue
- [ ] For default xcode 15.2, that come with the MacOS-13, We Need to
update the boost container header boost/container_hash/hash.hpp version
to pass the build
- [x] For xcode 14.2 The Build passed but the `Run React Native Detox
Android e2e Test` Failed.
Possible flaky test, https://github.com/microsoft/onnxruntime/pull/21969
- [x] For xcode 14.3.1 We encountered following issue in `Build React
Native Detox iOS e2e Tests`
```
ld: file not found: /Applications/Xcode_14.3.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/arc/libarclite_iphonesimulator.a
clang: error: linker command failed with exit code 1 (use -v to see invocation)
```
Applied following code to the eof in both ios/Podfile and fixed the
issue
```
post_install do |installer|
installer.generated_projects.each do |project|
project.targets.each do |target|
target.build_configurations.each do |config|
config.build_settings['IPHONEOS_DEPLOYMENT_TARGET'] = '13.0'
end
end
end
end
```
- [x] https://github.com/facebook/react-native/issues/32483
Applying changes to ios/Pofile
```
pre_install do |installer|
# Custom pre-install script or commands
puts "Running pre-install script..."
# Recommended fix for https://github.com/facebook/react-native/issues/32483
# from https://github.com/facebook/react-native/issues/32483#issuecomment-966784501
system("sed -i '' 's/typedef uint8_t clockid_t;//' \"${SRCROOT}/Pods/RCT-Folly/folly/portability/Time.h\"")
end
```
- [ ] Detox environment setting up exceeded time out of 120000ms during
iso e2e test
### dependent
- [x] https://github.com/microsoft/onnxruntime/pull/21159
---------
Co-authored-by: Changming Sun <chasun@microsoft.com>
### Description
<!-- Describe your changes. -->
### Motivation and Context
The parameter isn't correct.
Maybe it hasn't negative impact by chance so far.
d8e64bb529/cmake/CMakeLists.txt (L1712-L1717)
### Description
Fix default value 10.2->10.3 in
linux-gpu-tensorrt-daily-perf-pipeline.yml
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This is more flexible than hardcoding the provisioning profile name or UUID. The name shouldn't usually change but it is not guaranteed to remain constant.
### Description
<!-- Describe your changes. -->
Fix typo: ai:onnx -> ai.onnx
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Typo.
### Description
<!-- Describe your changes. -->
The DML CIs build native and C# as well as sign DLLs in the same CI.
Some parts of that require .net 8 and some .net 6.
Update to use .net 8 in general, and revert to .net 6 for the signing.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix packaging pipeline.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Update various test projects to .net8 from EOL frameworks.
Replace the Xamarin based Android and iOS test projects with a MAUI
based project that uses .net 8.
Add new CoreML flags to C# bindings
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Remove usage of EOL frameworks.
### Description
Rename ios_packaging.requirements.txt to ios_packaging/requirements.txt
### Motivation and Context
By doing this, the package within os_packaging/requirements.txt can be
scanned by CG task
- Remove redundant `OnnxruntimeModuleExampleE2ETest CheckOutputComponentExists` test
- Attempt to close any Application Not Responding (ANR) dialog prior to running Android test
- Add `--take-screenshots failing` option to detox test commands to save screenshots on failure
Calling Split API Calls Read+Model in lieu of unified Compile Model call
for export compile flow to ensure memory optimization. Freeing up model
proto and serialized string and read model ov ir later to free up memory
for the ahead pipeline
Optimization during EpCtxt flow
All the Graph related operations require all the Node Attributes to be
set while dealing with model instances internally with them, in the
existing implementation these attributes make a copy when constructing a
Graph dynamically during runtime.
Propose to use these attributes in place without creating a copy to
avoid memory allocation / copy while calling these Graph related
functions.
Changes to ensure the bug fixes related to openvino version and epctxt
file path.
Moving Compiler version to C++20 for getting r-value mem optimizations
benefit
### Motivation and Context
This change is required because memory optimization during Compilation
flow is too high.
---------
Co-authored-by: saurabhkale17 <saurabh1.kale@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
Co-authored-by: Vishnudas Thaniel S <vishnudas.thaniel.s@intel.com>
Co-authored-by: Javier E. Martinez <javier.e.martinez@intel.com>
Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com>
Co-authored-by: ankitm3k <ankit.maheshkar@intel.com>
Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com>
### Description
<!-- Describe your changes. -->
Files signature validation after signed by ESRP.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
- Add validation after the ESRP process.
- Make sure the targeting pattern/suffix files are signed successfully
by ESRP.
- If the signature is not Valid, then will fail the following stages.
### Description
After editing the set-trigger-rules.py, we must run the file.
### Motivation and Context
Obviously the script wasn't run because some files's name are incorrect.
### Description
* Add new ROCm CI pipeline (`Linux ROCm CI Pipeline`) focusing on
inference.
* Resolve test errors; disable flaky tests.
based on test PR #21614.
### Description
Since the stage need to download drop-extra, it should add the
dependencies
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
both arm64ec and x64 packages are needed.
x64 is needed for offline context binary generation
and arm64ec is needed for interop with python packages that don't have
prebuilt arm64 packages and only have x64.
### Description
Removing `docker_base_image` parameter and variables. From the Cuda
Packaging pipeline.
### Motivation and Context
Since the docker image is hard coded in the
`onnxruntime/tools/ci_build/github/linux/docker/inference/x86_64/default/cuda12/Dockerfile`
and
`onnxruntime/tools/ci_build/github/linux/docker/inference/x86_64/default/cuda11/Dockerfile`
This parameter and variable is no longer needed.
### Description
Do not allow clearing Android logs if the emulator is not running
### Motivation and Context
Previously the Clearing Android logs step stuck until the pipeline
timeout. If one of the previous steps failed.
### Description
- TensorRT 10.2.0.19 -> 10.3.0.26
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Pins pytorch-lightning package to version 2.3.3 since version >=2.4.0
requires torch > 2.1.0 which is not compatible with cu118.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
ORT 1.19 Release Preparation
### Description
<!-- Describe your changes. -->
### Motivation and Context
We couldn't get enough A100 agent time to finish the jobs since today.
The PR makes the A100 job only runs in main branch to unblock other PRs
if it's not recovered in a short time.
### Description
<!-- Describe your changes. -->
The xcframework now uses symlinks to have the correct structure
according to Apple requirements. Symlinks are not supported by nuget on
Windows.
In order to work around that we can store a zip of the xcframeworks in
the nuget package.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix nuget packaging build break
### Description
* Fix migraphx build error caused by
https://github.com/microsoft/onnxruntime/pull/21598:
Add a conditional compile on code block that depends on ROCm >= 6.2.
Note that the pipeline uses ROCm 6.0.
Unblock orttraining-linux-gpu-ci-pipeline and
orttraining-ortmodule-distributed and orttraining-amd-gpu-ci-pipeline
pipelines:
* Disable a model test in linux GPU training ci pipelines caused by
https://github.com/microsoft/onnxruntime/pull/19470:
Sometime, cudnn frontend throws exception that cudnn graph does not
support a Conv node of keras_lotus_resnet3D model on V100 GPU.
Note that same test does not throw exception in other GPU pipelines. The
failure might be related to cudnn 8.9 and V100 GPU used in the pipeline
(Amper GPUs and cuDNN 9.x do not have the issue).
The actual fix requires fallback logic, which will take time to
implement, so we temporarily disable the test in training pipelines.
* Force install torch for cuda 11.8. (The docker has torch 2.4.0 for
cuda 12.1 to build torch extension, which it is not compatible cuda
11.8). Note that this is temporary walkround. More elegant fix is to
make sure right torch version in docker build step, that might need
update install_python_deps.sh and corresponding requirements.txt.
* Skip test_gradient_correctness_conv1d since it causes segment fault.
Root cause need more investigation (maybe due to cudnn frontend as
well).
* Skip test_aten_attention since it causes assert failure. Root cause
need more investigation (maybe due to torch version).
* Skip orttraining_ortmodule_distributed_tests.py since it has error
that compiler for torch extension does not support c++17. One possible
fix it to set the following compile argument inside setup.py of
extension fused_adam: extra_compile_args['cxx'] = ['-std=c++17'].
However, due to the urgency of unblocking the pipelines, just disable
the test for now.
* skip test_softmax_bf16_large. For some reason,
torch.cuda.is_bf16_supported() returns True in V100 with torch 2.3.1, so
the test was run in CI, but V100 does not support bf16 natively.
* Fix typo of deterministic
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Update to match #21627 and make the info for Split consistent.
As a Split that doesn't split anything is a no-op it doesn't seem
meaningful to call that limitation out in the docs.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
- Update pipelines to use QNN SDK 2.25 by default
- Update ifdef condition to apply workaround for QNN LayerNorm
validation bug to QNN SDK 2.25 (as well as 2.24)
### Motivation and Context
Use the latest QNN SDK
### Description
Improve docker commands to make docker image layer caching works.
It can make docker building faster and more stable.
So far, A100 pool's system disk is too small to use docker cache.
We won't use pipeline cache for docker image and remove some legacy
code.
### Motivation and Context
There are often an exception of
```
64.58 + curl https://nodejs.org/dist/v18.17.1/node-v18.17.1-linux-x64.tar.gz -sSL --retry 5 --retry-delay 30 --create-dirs -o /tmp/src/node-v18.17.1-linux-x64.tar.gz --fail
286.4 curl: (92) HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)
```
Because Onnxruntime pipeline have been sending too many requests to
download Nodejs in docker building.
Which is the major reason of pipeline failing now
In fact, docker image layer caching never works.
We can always see the scrips are still running
```
#9 [3/5] RUN cd /tmp/scripts && /tmp/scripts/install_centos.sh && /tmp/scripts/install_deps.sh && rm -rf /tmp/scripts
#9 0.234 /bin/sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
#9 0.235 /bin/sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
#9 0.235 /tmp/scripts/install_centos.sh: line 1: !/bin/bash: No such file or directory
#9 0.235 ++ '[' '!' -f /etc/yum.repos.d/microsoft-prod.repo ']'
#9 0.236 +++ tr -dc 0-9.
#9 0.236 +++ cut -d . -f1
#9 0.238 ++ os_major_version=8
....
#9 60.41 + curl https://nodejs.org/dist/v18.17.1/node-v18.17.1-linux-x64.tar.gz -sSL --retry 5 --retry-delay 30 --create-dirs -o /tmp/src/node-v18.17.1-linux-x64.tar.gz --fail
#9 60.59 + return 0
...
```
This PR is improving the docker command to make image layer caching
work.
Thus, CI won't send so many redundant request of downloading NodeJS.
```
#9 [2/5] ADD scripts /tmp/scripts
#9 CACHED
#10 [3/5] RUN cd /tmp/scripts && /tmp/scripts/install_centos.sh && /tmp/scripts/install_deps.sh && rm -rf /tmp/scripts
#10 CACHED
#11 [4/5] RUN adduser --uid 1000 onnxruntimedev
#11 CACHED
#12 [5/5] WORKDIR /home/onnxruntimedev
#12 CACHED
```
###Reference
https://docs.docker.com/build/drivers/
---------
Co-authored-by: Yi Zhang <your@email.com>
### Description
<!-- Describe your changes. -->
Add ability to test packaging without rebuilding every time.
Add ability to comment out some platforms/architectures without the
scripts to assemble the c/obj-c packages breaking.
Update a couple of commands to preserve symlinks.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Make debugging packaging issues faster.
Creates correct package for mac-catalyst and doesn't require setting
symlinks via bash script.
### Description
Added CUDNN Frontend and used it for NHWC convolutions, and optionally
fuse activation.
#### Backward compatible
- For model existed with FusedConv, model can still run.
- If ORT is built with cuDNN 8, cuDNN frontend will not be built into
binary. Old kernels (using cudnn backend APIs) are used.
#### Major Changes
- For cuDNN 9, we will enable cudnn frontend to fuse convolution and
bias when a provider option `fuse_conv_bias=1`.
- Remove the fusion of FusedConv from graph transformer for CUDA
provider, so there will not be FusedConv be added to graph for CUDA EP
in the future.
- Update cmake files regarding to cudnn settings. The search order of
CUDNN installation in build are like the following:
* environment variable `CUDNN_PATH`
* `onnxruntime_CUDNN_HOME` cmake extra defines. If a build starts from
build.py/build.sh, user can pass it through `--cudnn_home` parameter, or
by environment variable `CUDNN_HOME` if `--cudnn_home` not used.
* cudnn python package installation directory like
python3.xx/site-packages/nvidia/cudnn
* CUDA installation path
#### Potential Issues
- If ORT is built with cuDNN 8, FusedConv fusion is no longer done
automatically, so some model might have performance regression. If user
still wants FusedConv operator for performance reason, they can still
have multiple ways to walkaround: like use older version of onnxruntime;
or use older version of ORT to save optimized onnx, then run with latest
version of ORT. We believe that majority users have moved to cudnn 9
when 1.20 release (since the default in ORT and PyTorch is cudnn 9 for 3
months when 1.20 release), so the impact is small.
- cuDNN graph uses TF32 by default, and user cannot disable TF32 through
the use_tf32 cuda provider option. If user encounters accuracy issue
(like in testing), user has to set environment variable
`NVIDIA_TF32_OVERRIDE=0` to disable TF32. Need update the document of
use_tf32 later.
#### Follow ups
This is one of PRs that target to enable NHWC convolution in CUDA EP by
default if device supports it. There are other changes will follow up to
make it possible.
(1) Enable `prefer_nhwc` by default for device with sm >= 70.
(2) Change `fuse_conv_bias=1` by default after more testing.
(3) Add other NHWC operators (like Resize or UpSample).
### Motivation and Context
The new CUDNN Frontend library provides the functionality to fuse
operations and provides new heuristics for kernel selection. Here it
fuses the convolution with the pointwise bias operation. On the [NVIDIA
ResNet50](https://pytorch.org/hub/nvidia_deeplearningexamples_resnet50/)
we get a performance boost from 49.1144 ms to 42.4643 ms per inference
on a 2560x1440 input (`onnxruntime_perf_test -e cuda -I -q -r 100-d 1 -i
'prefer_nhwc|1' resnet50.onnx`).
---------
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: Maximilian Mueller <maximilianm@nvidia.com>
Bumps [torch](https://github.com/pytorch/pytorch) from 1.13.1 to 2.2.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/pytorch/pytorch/releases">torch's
releases</a>.</em></p>
<blockquote>
<h2>PyTorch 2.2: FlashAttention-v2, AOTInductor</h2>
<h1>PyTorch 2.2 Release Notes</h1>
<ul>
<li>Highlights</li>
<li>Backwards Incompatible Changes</li>
<li>Deprecations</li>
<li>New Features</li>
<li>Improvements</li>
<li>Bug fixes</li>
<li>Performance</li>
<li>Documentation</li>
</ul>
<h1>Highlights</h1>
<p>We are excited to announce the release of PyTorch® 2.2! PyTorch 2.2
offers ~2x performance improvements to
<code>scaled_dot_product_attention</code> via FlashAttention-v2
integration, as well as AOTInductor, a new ahead-of-time compilation and
deployment tool built for non-python server-side deployments.</p>
<p>This release also includes improved torch.compile support for
Optimizers, a number of new inductor optimizations, and a new logging
mechanism called TORCH_LOGS.</p>
<p><strong>Please note that we are <a
href="https://redirect.github.com/pytorch/pytorch/issues/114602">deprecating
macOS x86 support</a>, and PyTorch 2.2.x will be the last version that
supports macOS x64.</strong></p>
<p>Along with 2.2, we are also releasing a series of updates to the
PyTorch domain libraries. More details can be found in the library
updates blog.</p>
<p>This release is composed of 3,628 commits and 521 contributors since
PyTorch 2.1. We want to sincerely thank our dedicated community for your
contributions. As always, we encourage you to try these out and report
any issues as we improve 2.2. More information about how to get started
with the PyTorch 2-series can be found at our <a
href="https://pytorch.org/get-started/pytorch-2.0/">Getting Started</a>
page.</p>
<p>Summary:</p>
<ul>
<li><code>scaled_dot_product_attention</code> (SDPA) now supports
FlashAttention-2, yielding around 2x speedups compared to previous
versions.</li>
<li>PyTorch 2.2 introduces a new ahead-of-time extension of
TorchInductor called AOTInductor, designed to compile and deploy PyTorch
programs for non-python server-side.</li>
<li><code>torch.distributed</code> supports a new abstraction for
initializing and representing ProcessGroups called device_mesh.</li>
<li>PyTorch 2.2 ships a standardized, configurable logging mechanism
called TORCH_LOGS.</li>
<li>A number of torch.compile improvements are included in PyTorch 2.2,
including improved support for compiling Optimizers and improved
TorchInductor fusion and layout optimizations.</li>
<li>Please note that we are deprecating macOS x86 support, and PyTorch
2.2.x will be the last version that supports macOS x64.</li>
<li><code>torch.ao.quantization</code> now offers a prototype
<code>torch.export</code> based flow</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="8ac9b20d4b"><code>8ac9b20</code></a>
Run docker release build on final tag (<a
href="https://redirect.github.com/pytorch/pytorch/issues/117131">#117131</a>)
(<a
href="https://redirect.github.com/pytorch/pytorch/issues/117182">#117182</a>)</li>
<li><a
href="2490352430"><code>2490352</code></a>
Fix cuInit test on Windows (<a
href="https://redirect.github.com/pytorch/pytorch/issues/117095">#117095</a>)</li>
<li><a
href="3a44bb713f"><code>3a44bb7</code></a>
[CI] Test that cuInit is not called during import (<a
href="https://redirect.github.com/pytorch/pytorch/issues/117043">#117043</a>)</li>
<li><a
href="1c8ba3847d"><code>1c8ba38</code></a>
[CI] Use jemalloc for CUDA builds (<a
href="https://redirect.github.com/pytorch/pytorch/issues/116900">#116900</a>)
(<a
href="https://redirect.github.com/pytorch/pytorch/issues/116988">#116988</a>)</li>
<li><a
href="96d2ddbafe"><code>96d2ddb</code></a>
Store user model to simplify
ONNXProgram.{adapt_torch_*,<strong>call</strong>} APIs (<a
href="https://redirect.github.com/pytorch/pytorch/issues/1152">#1152</a>...</li>
<li><a
href="738b4a560a"><code>738b4a5</code></a>
Update ONNX's IO Adapter to support FakeTensor with ExportedProgram (<a
href="https://redirect.github.com/pytorch/pytorch/issues/114407">#114407</a>)...</li>
<li><a
href="4cf10bf4dc"><code>4cf10bf</code></a>
[Cherry-pick] [Quant] [PT2] Enable batchnorm in
_move_exported_model_to_eval ...</li>
<li><a
href="7e97e4b4b6"><code>7e97e4b</code></a>
[AARCH64] Fall back to GEMM if mkldnn_matmul fails (<a
href="https://redirect.github.com/pytorch/pytorch/issues/115936">#115936</a>)
(<a
href="https://redirect.github.com/pytorch/pytorch/issues/116666">#116666</a>)</li>
<li><a
href="1a3e3c7cff"><code>1a3e3c7</code></a>
[CUDA] baddmm should fall back to addmm for batch=1 (<a
href="https://redirect.github.com/pytorch/pytorch/issues/114992">#114992</a>)
(<a
href="https://redirect.github.com/pytorch/pytorch/issues/116518">#116518</a>)</li>
<li><a
href="ab7505f78c"><code>ab7505f</code></a>
Fix broken PyYAML 6.0 on MacOS x86 (<a
href="https://redirect.github.com/pytorch/pytorch/issues/115956">#115956</a>)
(<a
href="https://redirect.github.com/pytorch/pytorch/issues/116551">#116551</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/pytorch/pytorch/compare/v1.13.1...v2.2.0">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
### Description
Add support for Split Op
### Motivation and Context
Address operator gaps in high priority model.
---------
Co-authored-by: Scott McKay <skottmckay@gmail.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
<!-- Describe your changes. -->
Update TRT OSS Parser to [latest 10.2-GA
branch](f161f95883)
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Since the onedevice training cpu packaging has been a separated
pipeline, it's nuget package publishing step must be moved as well.
### Motivation and Context
Fixes the exception in Nuget Publishing Packaging Pipeline caused by
#21485
### Description
Delete tools/ci_build/github/azure-pipelines/win-gpu-ci-pipeline.yml
### Motivation and Context
This CI pipeline has been divided into 4 different pipeline.
The change in #21005 works for directly building wheels with `build.py`,
but ort-nightly-directml wheels, as well as the 1.18.1 release of the
onnxruntime-directml python wheel, still do not work with conda since
they're built from the `py-win-gpu.yml` pipeline, which uses
`install_third_party_deps.ps1` to set compile flags.
### Description
<!-- Describe your changes. -->
Set version and other info in the Microsoft.ML.OnnxRuntime C# dll by
setting GenerateAssemblyInfo to true and passing in ORT version in the
CI.
Minor re-org of the order of properties so related things are grouped a
little better.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#21475
### Description
<!-- Describe your changes. -->
`enable_windows_arm64_qnn` and `enable_windows_x64_qnn` are true by
default but unnecessary for training. This change explicitly sets these
parameters to false for training pipeline.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
ORT 1.19 Release Preparation
### Description
<!-- Describe your changes. -->
Add these changes to one PR to simplify checkin
- Add Concat (#21423)
- Add DepthToSpace (#21426)
- Add LeakyRelu (#21453)
- Add test scripts (#21427)
- Add ability to set coreml flags from python (#21434)
Other changes
- updated partitioning utils to support dropping constant initializers
from a ComputeCapability's inputs.
- noticed that the list of inputs to the coreml model was unexpectedly
long due to this
- we copy constant initializers to a CoreML model so don't need the
originals, and if they remain as inputs ORT can't free them as they
appear to be in use.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Current failure is due to a version mismatch.
Use llvm-cov from the Android NDK instead of the system gcov so that the
version is correct.
Also comment out publishing to the Azure dashboard to simplify the
setup. The CI prints out the stats for review by developers.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix CI pipeline
### Description
Right now our "Zip-Nuget-Java-Nodejs Packaging Pipeline" is too big.
This OnDevice training part is independent of the others, so it can be
split out. Then our NPM Packaging pipeline will not depends on this
training stuff.
### Motivation and Context
Similar to #21235
Also, this PR fixed a problem that: "NuGet_Test_Linux_Training_CPU" job
downloads artifacts from "onnxruntime-linux-x64" for getting customop
shared libs, but the job forget to declare it depends on the
"Linux_C_API_Packaging_CPU_x64" which produces the artifact. Such
problems can be hard to find when a pipeline goes big.
### Description
* Swap cuda version 11.8/12.2 in GPU CIs
* Set CUDA12 as default version in yamls of publishing nuget/python/java
GPU packages
* Suppress warnings as errors of flash_api.cc during ort win-build
### Description
- Update pipelines to use QNN SDK 2.24 by default
- Update QNN_Nuget_Windows pipeline to build csharp solution without
mobile projects (fixes errors).
- Implement workaround for QNN 2.24 validation bug for LayerNorm ops
without an explicit bias input.
- Enable Relu unit test, which now passes due to the fact Relu is no
longer fused into QuantizeLinear for QNN EP.
- Fix bug where a negative quantization axis is not properly normalized
for per-channel int4 conv.
### Motivation and Context
Update QNN SDk.
### Description
Before this change, copy_strip_binary.sh manually copies each file from
onnx runtime's build folder to an artifact folder. It can be hard when
dealing with symbolic link for shared libraries.
This PR will change the packaging pipelines to run "make install" first,
before packaging shared libs .
### Motivation and Context
Recently because of feature request #21281 , we changed
libonnxruntime.so's SONAME. Now every package that contains this shared
library must also contains libonnxruntime.so.1. Therefore we need to
change the packaging scripts to include this file. Instead of manually
construct the symlink layout, using `make install` is much easier and
will make things more consistent because it is a standard way of making
packages.
**Breaking change:**
After this change, our **inference** tarballs that are published to our
Github release pages will be not contain ORT **training** headers.
### Description
<!-- Describe your changes. -->
Add ML Program ConvTranspose
- some limitations to simplify the implementation for now
- some limitations due to flaky CoreML output
Added support for non-contiguous MLMultiArray output as we see that with
some unit tests when the CPU-only flag is not set (e.g. innermost dim
has min size of 16 but test output only has 8 values).
- support only one non-contiguous dim to keep it simple
- manually tested as we don't have a setup that can test objective-c
code
- test code is in model.mm and can be enabled via ifdef if we need to
validate any future changes
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Address operator gaps in high priority model.
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
<!-- Describe your changes. -->
Add GridSample ML Program support
One combination of inputs has diffs between the pytorch generated unit
tests data and CoreML. Disabling until needed as investigation may take
a while.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
High priorities models
1. Update google benchmark from 1.8.3 to 1.8.5
2. Update google test from commit in main branch to tag 1.15.0
3. Update pybind11 from 2.12.0 to 2.13.1
4. Update pytorch cpuinfo to include the support for Arm Neoverse V2,
Cortex X4, A720 and A520.
5. Update re2 from 2024-05-01 to 2024-07-02
6. Update cmake to 3.30.1
7. Update Linux docker images
8. Fix a warning in test/perftest/ort_test_session.cc:826:37: error:
implicit conversion loses integer precision: 'streamoff' (aka 'long
long') to 'const std::streamsize' (aka 'const long')
[-Werror,-Wshorten-64-to-32]
### Description
<!-- Describe your changes. -->
Add support for Slice
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
High priority models.
### Description
Replace inline pip install with pip install from requirements*.txt
### Motivation and Context
so that CG can recognize
### Dependency
- [x] https://github.com/microsoft/onnxruntime/pull/21085
### Description
<!-- Describe your changes. -->
Add CoreML ML Program Resize
- refactor existing logic to try and simplify and share between
NeuralNetwork and MLProgram checks
- add handling for some new attributes
- antialias and axes - should have been done when setting the CoreML EP
max opset to 21
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Support priority models