Commit graph

11428 commits

Author SHA1 Message Date
Prathik Rao
ffceed9d44
ORT 1.19.2 Release: Cherry Pick Round 1 (#21861)
Approved cherry picks for ORT 1.19.2 release.

---------

Co-authored-by: Yi Zhang <zhanyi@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Ye Wang <52801275+wangyems@users.noreply.github.com>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: aciddelgado <139922440+aciddelgado@users.noreply.github.com>
Co-authored-by: mindest <30493312+mindest@users.noreply.github.com>
Co-authored-by: Changming Sun <chasun@microsoft.com>
2024-08-30 15:02:31 -07:00
Prathik Rao
d6514636ea
ORT 1.19.1 Release: Cherry Pick Round 1 (#21796)
Approved cherry picks for ORT 1.19.1 release.

---------

Co-authored-by: Yi Zhang <zhanyi@microsoft.com>
2024-08-20 21:21:44 -07:00
Prathik Rao
26250ae74d
ORT 1.19.0 Release: Cherry-Pick Round 2 (#21726)
### Description
<!-- Describe your changes. -->

PRs marked for cherry-pick & bug fixes.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

ORT 1.19.0 Release Preparation

---------

Signed-off-by: Liqun Fu <liqfu@microsoft.com>
Co-authored-by: George Wu <jywu@microsoft.com>
Co-authored-by: liqun Fu <liqfu@microsoft.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
Co-authored-by: Yi Zhang <zhanyi@microsoft.com>
2024-08-14 13:45:35 -07:00
Prathik Rao
ccf6a28c3c
ORT 1.19.0 Release: Cherry-Pick Round 1 (#21619)
### Description
<!-- Describe your changes. -->

PRs marked for cherry-pick.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

ORT 1.19.0 Release Preparation

---------

Signed-off-by: Liqun Fu <liqfu@microsoft.com>
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Signed-off-by: Liqun Fu <liqun_fu@hotmail.com>
Co-authored-by: liqun Fu <liqfu@microsoft.com>
Co-authored-by: Jing Fang <126209182+fajin-corp@users.noreply.github.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
Co-authored-by: Changming Sun <chasun@microsoft.com>
Co-authored-by: Sumit Agarwal <sumitagarwal330@gmail.com>
Co-authored-by: vraspar <vrajang@outlook.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Yi Zhang <zhanyi@microsoft.com>
Co-authored-by: jingyanwangms <47403504+jingyanwangms@users.noreply.github.com>
Co-authored-by: Yi Zhang <your@email.com>
Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com>
Co-authored-by: saurabh <saurabh1.kale@intel.com>
Co-authored-by: sfatimar <sahar.fatima@intel.com>
2024-08-12 16:54:25 -07:00
Prathik Rao
ee2fe87e2d
ORT 1.19.0 Release: Cherry-Pick Round 0 (#21609)
### Description
<!-- Describe your changes. -->

Critical changes required for an external developer (GeekBench)
 

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

ORT 1.19.0 Release Preparation

---------

Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
2024-08-03 22:04:57 -07:00
Yi-Hong Lyu
530a2d7b41
Enable FP16 Clip and Handle Bias in FP16 Depthwise Conv (#21493)
- Improved accuracy for face-detection, image-classification, and
object-detection in the GeekBench ML benchmark on ARM64.
- Fixed issue https://github.com/microsoft/onnxruntime/issues/18992
2024-07-30 03:49:14 -07:00
Changming Sun
82036b0497
Remove references to the outdated CUDA EP factory method (#21549)
The function "OrtSessionOptionsAppendExecutionProvider_CUDA" is
deprecated.
2024-07-29 21:59:16 -07:00
vraspar
07d3be5b0e
CoreML: Add ML Program Split Op (#21456)
### Description

Add support for Split Op


### Motivation and Context
Address operator gaps in high priority model.

---------

Co-authored-by: Scott McKay <skottmckay@gmail.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2024-07-30 14:04:47 +10:00
Yifan Li
5d78b9a17b
[TensorRT EP] Update TRT OSS Parser to 10.2 (#21552)
### Description
<!-- Describe your changes. -->
Update TRT OSS Parser to [latest 10.2-GA
branch](f161f95883)


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-07-29 17:27:38 -07:00
mcollinswisc
8417c325ec
Keep QDQ nodes w/ nonpositive scale around MaxPool (#21182)
### Description
This change adds a check for whether the scale in the QuantizeLinear (or
DequantizeLinear) is a positive scalar, and a new selector to disallow
removing the QDQ around MaxPool if it is not.

### Motivation and Context
Currently, the DropQDQNodesRules optimization removes QuantizeLinear and
DequantizeLinear nodes from DequantizeLinear ∘ MaxPool ∘ QuantizeLinear.
However, if the x_scale/y_scale values are non-positive, the
(de-)quantization changes the ordering of the elements in the input
value, so this optimization is changing the results.


https://github.com/microsoft/onnxruntime/issues/21176

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2024-07-30 09:06:51 +10:00
Sophie Schoenmeyer
d98581495f
Update labeling bot (#21548)
Current labeling bot over-applies many of the labels (e.g., ep:CUDA and
platform:windows) and is missing some of the APIs + EPs

Working on migrating this workflow to GitHub policies but would like to
use this fix in the meantime to avoid causing any issues w/ ORT 1.19

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-07-29 16:06:03 -07:00
Adam Reeve
7543dd040b
Propagate NaNs in the CPU min and max operators (#21492)
### Description

Propagates NaN values in the min and max operators so that min or max
with a NaN in either input always produces NaN.

### Motivation and Context

Fixes #21455
2024-07-30 08:50:13 +10:00
Preetha Veeramalai
c39f1c4fd8
ORT- OVEP 1.19 PR-follow up (#21546)
### Description
Follow up PR for bug fixes on 1.19


### Motivation and Context

- Handles 1.19 docker file fixes.
- Sets the default file naming of epctx onnx model with _ctx.onnx as
suffix.
- Create epctx model directories if it doesn't exist.

---------

Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com>
2024-07-29 14:12:36 -07:00
Yulong Wang
b03c9496aa
[js/web] allow load WebAssembly binary from buffer (#21534)
### Description

This PR adds a new option `ort.env.wasm.wasmBinary`, which allows user
to set to a buffer containing preload .wasm file content.

This PR should resolve the problem from latest discussion in #20876.
2024-07-29 13:39:38 -07:00
Xu Xing
0d7cf301a1
[js/webgpu] Add activation Tanh (#21540)
Bug:https://github.com/microsoft/onnxruntime/issues/21467

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-07-29 11:05:34 -07:00
Jian Chen
79537d0523
Remove tools/ci_build/github/android/run_nnapi_code_coverage.sh (#21371)
### Description
Remove tools/ci_build/github/android/run_nnapi_code_coverage.sh

### Motivation and Context
This file is no longer needed
2024-07-29 10:00:52 -07:00
Jian Chen
bc3713206d
Update QNN pipeline pool (#21482)
### Description
Update QNN pipeline pool 



### Motivation and Context
Let all our pipelines are using the latest NDK version
2024-07-29 10:00:21 -07:00
Yi Zhang
05cef469e8
Move on-device training packages publish step (#21539)
### Description
Since the onedevice training cpu packaging has been a separated
pipeline, it's nuget package publishing step must be moved as well.

### Motivation and Context
Fixes the exception in Nuget Publishing Packaging Pipeline caused by
#21485
2024-07-29 09:59:46 -07:00
mingyueliuh
d8888136e3
Add support tensor element type for register custom op shape infer function (#21387)
### Description
Functionality extension for the SetOutputShape method in custom op shape inference.


### Motivation and Context
-  **SetOutputShape** Interface enhancement Actually, the shape infer function need set the tensor type and shape ,Add a parameter **type** to allow users to specify the tensor type, and set **ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT** as default value to ensure compatibility.

Co-authored-by: mingyue <mingyue@amd.com>
2024-07-29 09:45:52 -07:00
Wanming Lin
94eb70d983
[WebNN EP] Add labels for all WebNN operators (#21516)
In order to provide more diagnosable error messages for developers.

Spec change: https://github.com/webmachinelearning/webnn/pull/742
2024-07-29 08:50:14 -07:00
Xu Xing
5bc12bf209
[js/webgpu] Add activation for conv3d naive (#21466)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-07-29 08:47:41 -07:00
Yulong Wang
dbff0cd098
[js/node] enable float16 support for Node.js binding (#20581)
### Description
enable float16 support for Node.js binding.

data of float16 tensor uses `Uint16Array`.
2024-07-28 13:03:17 -07:00
liqun Fu
a4d3a1ce0c
pick changes from https://github.com/onnx/onnx/pull/6195 to fix heap-buffer-overflow in onnx::convPoolShapeInference (#21507)
### Description
onnx 1.16.2 is not available before ort 1.19.0 code freeze. Thus pick
the needed change as patch
2024-07-27 15:58:36 -07:00
Jian Chen
7e23212de9
Delete tools/ci_build/github/azure-pipelines/win-gpu-ci-pipeline.yml (#21529)
### Description
Delete tools/ci_build/github/azure-pipelines/win-gpu-ci-pipeline.yml


### Motivation and Context
This CI pipeline has been divided into 4 different pipeline.
2024-07-27 15:58:12 -07:00
Ranjit Ranjan
82b2955268
[AIX]test failure fix using gtest-1.15.0 for AIX (#21497)
### Description
Local CI setup for AIX reported tests failure after the gtest 1.15.0
upgrade.

### Motivation and Context
Below tests failure is observed after gtest upgrade.

The following tests FAILED:
	  1 - onnxruntime_test_all (ILLEGAL)
	  7 - onnxruntime_logging_apis_test (Subprocess aborted)

To fix this, I am enabling pthread support under gtest. This was
disabled with previous version of gtest for some reason.
Now by enabling this, above tests are getting passed with gtest 1.15.0.
2024-07-27 11:17:22 -07:00
jingyanwangms
48fb8a7e56
Security fuzz address sanitizer fix Bug #2 and #3 (#21528)
### Description
Security fuzz test with address sanitizer found several bugs
2024-07-27 11:10:52 -07:00
dependabot[bot]
1ce160883f
Bump Sixlabors.ImageSharp from 2.1.8 to 2.1.9 in /csharp/sample/Microsoft.ML.OnnxRuntime.ResNet50v2Sample (#21444)
Bumps [Sixlabors.ImageSharp](https://github.com/SixLabors/ImageSharp)
from 2.1.8 to 2.1.9.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/SixLabors/ImageSharp/releases">Sixlabors.ImageSharp's
releases</a>.</em></p>
<blockquote>
<h2>v2.1.9</h2>
<h2>What's Changed</h2>
<ul>
<li>[2.1] Fix overflow in MemoryAllocator.Create(options) by <a
href="https://github.com/antonfirsov"><code>@​antonfirsov</code></a> in
<a
href="https://redirect.github.com/SixLabors/ImageSharp/pull/2732">SixLabors/ImageSharp#2732</a></li>
<li>Backport GIF LZW fix to 2.1 by <a
href="https://github.com/antonfirsov"><code>@​antonfirsov</code></a> in
<a
href="https://redirect.github.com/SixLabors/ImageSharp/pull/2756">SixLabors/ImageSharp#2756</a></li>
<li>Backport 2759 to 2.1.x by <a
href="https://github.com/antonfirsov"><code>@​antonfirsov</code></a> in
<a
href="https://redirect.github.com/SixLabors/ImageSharp/pull/2770">SixLabors/ImageSharp#2770</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/SixLabors/ImageSharp/compare/v2.1.8...v2.1.9">https://github.com/SixLabors/ImageSharp/compare/v2.1.8...v2.1.9</a></p>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="9816ca4501"><code>9816ca4</code></a>
Merge pull request <a
href="https://redirect.github.com/SixLabors/ImageSharp/issues/2770">#2770</a>
from SixLabors/af/backport-2759-2.1.x</li>
<li><a
href="b33d666ab7"><code>b33d666</code></a>
handle DecodingMode</li>
<li><a
href="6b2030b549"><code>6b2030b</code></a>
Merge branch 'release/2.1.x' into af/backport-2759-2.1.x</li>
<li><a
href="8ffad3f480"><code>8ffad3f</code></a>
Issue2012BadMinCode should decode now</li>
<li><a
href="1f5bf23b9e"><code>1f5bf23</code></a>
skip Issue2758_DecodeWorks</li>
<li><a
href="3bf8c572a0"><code>3bf8c57</code></a>
manual port of 3.1 gif decoder</li>
<li><a
href="28c20ded87"><code>28c20de</code></a>
Clamp JPEG quality estimation results.</li>
<li><a
href="4b910e7f84"><code>4b910e7</code></a>
Decode LZW row by row</li>
<li><a
href="a1f2879771"><code>a1f2879</code></a>
Merge pull request <a
href="https://redirect.github.com/SixLabors/ImageSharp/issues/2756">#2756</a>
from SixLabors/af/git-av-2.1</li>
<li><a
href="898df7f8ca"><code>898df7f</code></a>
backport <a
href="https://redirect.github.com/SixLabors/ImageSharp/issues/2749">#2749</a>
to 2.1</li>
<li>Additional commits viewable in <a
href="https://github.com/SixLabors/ImageSharp/compare/v2.1.8...v2.1.9">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=Sixlabors.ImageSharp&package-manager=nuget&previous-version=2.1.8&new-version=2.1.9)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).

</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-07-26 22:31:16 -07:00
maggie1059
10b4a3b90b
Fix conda failure for onnxruntime-directml (#21526)
The change in #21005 works for directly building wheels with `build.py`,
but ort-nightly-directml wheels, as well as the 1.18.1 release of the
onnxruntime-directml python wheel, still do not work with conda since
they're built from the `py-win-gpu.yml` pipeline, which uses
`install_third_party_deps.ps1` to set compile flags.
2024-07-26 22:26:38 -07:00
Yueqing Zhang
d01fc75ef1
[VitisAI] support vaip create ep context nodes & bug fix (#21506)
### Description
<!-- Describe your changes. -->
1. We decided to move the context node creation back to our own repo because it is more flexible to modify.
2. We found a bug related the context node. It would change the inference order. So, we fixed in this PR as well.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This is crucial for Microsoft Release next month.

---------

Co-authored-by: Yueqing Zhang <yueqingz@amd.com>
2024-07-26 22:15:57 -07:00
zz002
690d745cbf
[VitisAI] 1. KernelDef supports StartVersion and EndVersion (#21519)
### Description
<!-- Describe your changes. -->

[VitisAI] 1. KernelDef supports StartVersion and EndVersion
2. CapabilityOps checks domain

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: Zhenze Wang <zhenzew@xilinx.com>
2024-07-26 20:28:55 -07:00
Scott McKay
5af423c7c0
Set version and other info in the C# dll (#21517)
### Description
<!-- Describe your changes. -->
Set version and other info in the Microsoft.ML.OnnxRuntime C# dll by
setting GenerateAssemblyInfo to true and passing in ORT version in the
CI.

Minor re-org of the order of properties so related things are grouped a
little better.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#21475
2024-07-27 13:22:57 +10:00
Tianlei Wu
64819f6f8c
Update benchmark_mha.py to compare with PyTorch SDPA (#21449)
### Description
* Update benchmark_mha.py to compare with PyTorch SDPA api.
* Write results to csv file.
* Use sdpa_kernel cuda provider option instead of environment variables
for better control.
* Add arguments (`--use_gpu`, `--causal` etc) to allow testing different
senarios.
* Update benchmark_mha.sh to add cpu benchmarks

For Q,K,V format, torch uses BNSH format, while ort uses BSNH format, so
the result is not apple-to-apple. However, if the latency difference is
large, that could be a warning.

#### Example GPU results

Example results on A100-SXM4-80GB with settings (use_gpu=TRUE,
enable_cuda_graph=FALSE, causal=FALSE, past_sequence_length=0,
intra_op_num_threads=0) in Azure Linux. ORT: build from source with CUDA
12.5; PyTorch 2.3.1 for cuda 12.1.

format | batch_size | sequence_length | num_heads | head_size | latency
(s) | tflops | kernel
-- | -- | -- | -- | -- | -- | -- | --
Q,KV | 4 | 2048 | 32 | 128 | 0.0015 | 179.5 | ort:flash
Q,KV | 4 | 2048 | 32 | 128 | 0.0015 | 179.0 | ort:default
Q,K,V | 4 | 2048 | 32 | 128 | 0.0016 | 170.0 | ort:default
Q,K,V | 4 | 2048 | 32 | 128 | 0.0016 | 169.5 | ort:flash
QKV | 4 | 2048 | 32 | 128 | 0.0016 | 168.5 | ort:default
QKV | 4 | 2048 | 32 | 128 | 0.0016 | 167.4 | ort:flash
Q,K,V | 4 | 2048 | 32 | 128 | 0.0017 | 159.4 | torch:default
Q,K,V | 4 | 2048 | 32 | 128 | 0.0018 | 155.0 | torch:flash
Q,KV | 4 | 2048 | 32 | 128 | 0.0030 | 92.7 | ort:efficient
Q,K,V | 4 | 2048 | 32 | 128 | 0.0030 | 90.9 | ort:efficient
QKV | 4 | 2048 | 32 | 128 | 0.0031 | 89.9 | ort:efficient
Q,K,V | 4 | 2048 | 32 | 128 | 0.0031 | 89.0 | torch:efficient
Q,K,V | 4 | 2048 | 32 | 128 | 0.0054 | 51.3 | torch:math
Q,KV | 4 | 4096 | 32 | 128 | 0.0058 | 191.0 | ort:default
Q,KV | 4 | 4096 | 32 | 128 | 0.0058 | 190.6 | ort:flash
Q,K,V | 4 | 4096 | 32 | 128 | 0.0059 | 187.8 | ort:default
Q,K,V | 4 | 4096 | 32 | 128 | 0.0059 | 186.7 | ort:flash
QKV | 4 | 4096 | 32 | 128 | 0.0059 | 185.9 | ort:flash
QKV | 4 | 4096 | 32 | 128 | 0.0059 | 185.8 | ort:default
Q,K,V | 4 | 4096 | 32 | 128 | 0.0067 | 163.4 | torch:default
Q,K,V | 4 | 4096 | 32 | 128 | 0.0070 | 157.2 | torch:flash
Q,KV | 4 | 4096 | 32 | 128 | 0.0113 | 97.6 | ort:efficient
Q,K,V | 4 | 4096 | 32 | 128 | 0.0114 | 96.4 | ort:efficient
QKV | 4 | 4096 | 32 | 128 | 0.0114 | 96.2 | ort:efficient
Q,K,V | 4 | 4096 | 32 | 128 | 0.0127 | 86.3 | torch:efficient
Q,KV | 8 | 2048 | 32 | 128 | 0.0031 | 177.8 | ort:flash
Q,KV | 8 | 2048 | 32 | 128 | 0.0031 | 177.7 | ort:default
Q,K,V | 8 | 2048 | 32 | 128 | 0.0032 | 170.8 | ort:default
Q,K,V | 8 | 2048 | 32 | 128 | 0.0032 | 170.3 | ort:flash
QKV | 8 | 2048 | 32 | 128 | 0.0032 | 169.2 | ort:default
QKV | 8 | 2048 | 32 | 128 | 0.0033 | 169.0 | ort:flash
Q,K,V | 8 | 2048 | 32 | 128 | 0.0034 | 161.9 | torch:default
Q,K,V | 8 | 2048 | 32 | 128 | 0.0036 | 152.9 | torch:flash
Q,KV | 8 | 2048 | 32 | 128 | 0.0059 | 93.5 | ort:efficient
Q,K,V | 8 | 2048 | 32 | 128 | 0.0060 | 91.3 | ort:efficient
QKV | 8 | 2048 | 32 | 128 | 0.0060 | 91.0 | ort:efficient
Q,K,V | 8 | 2048 | 32 | 128 | 0.0064 | 86.0 | torch:efficient
Q,KV | 8 | 4096 | 32 | 128 | 0.0115 | 190.8 | ort:flash
Q,KV | 8 | 4096 | 32 | 128 | 0.0115 | 190.7 | ort:default
Q,K,V | 8 | 4096 | 32 | 128 | 0.0118 | 187.1 | ort:default
Q,K,V | 8 | 4096 | 32 | 128 | 0.0118 | 187.0 | ort:flash
QKV | 8 | 4096 | 32 | 128 | 0.0118 | 185.6 | ort:default
QKV | 8 | 4096 | 32 | 128 | 0.0118 | 185.6 | ort:flash
Q,K,V | 8 | 4096 | 32 | 128 | 0.0139 | 158.7 | torch:default
Q,K,V | 8 | 4096 | 32 | 128 | 0.0139 | 158.3 | torch:flash
Q,KV | 8 | 4096 | 32 | 128 | 0.0225 | 97.7 | ort:efficient
Q,K,V | 8 | 4096 | 32 | 128 | 0.0227 | 96.8 | ort:efficient
QKV | 8 | 4096 | 32 | 128 | 0.0228 | 96.3 | ort:efficient
Q,K,V | 8 | 4096 | 32 | 128 | 0.0260 | 84.5 | torch:efficient

#### Example CPU results

Dell XPS 8960 with i9-13900 CPU (use_gpu=FALSE, causal=FALSE,
past_sequence_length=0) in Windows. ORT: build from source with CUDA
12.5; PyTorch 2.3.1 for cuda 12.1.

format | causal | batch_size | seq_len | num_heads | head_size | threads
| latency (s) | kernel
-- | -- | -- | -- | -- | -- | -- | -- | --
Q,K,V | FALSE | 1 | 128 | 32 | 128 | 8 | 0.0005 | ort:flash
Q,K,V | FALSE | 1 | 128 | 32 | 128 | 0 | 0.0009 | ort:flash
Q,K,V | FALSE | 1 | 128 | 32 | 128 | 0 | 0.0009 | ort:math
Q,K,V | FALSE | 1 | 128 | 32 | 128 | 4 | 0.0009 | ort:flash
Q,K,V | FALSE | 1 | 128 | 32 | 128 | 2 | 0.0014 | ort:flash
Q,K,V | FALSE | 1 | 128 | 32 | 128 | 1 | 0.0025 | ort:flash
Q,K,V | FALSE | 1 | 128 | 32 | 128 | 2 | 0.0045 | torch:default
Q,K,V | FALSE | 1 | 128 | 32 | 128 | 24 | 0.0046 | torch:default
Q,K,V | FALSE | 1 | 128 | 32 | 128 | 8 | 0.0046 | torch:default
Q,K,V | FALSE | 1 | 128 | 32 | 128 | 4 | 0.0046 | torch:default
Q,K,V | FALSE | 1 | 128 | 32 | 128 | 1 | 0.0047 | torch:default
Q,K,V | FALSE | 1 | 256 | 32 | 128 | 0 | 0.0019 | ort:flash
Q,K,V | FALSE | 1 | 256 | 32 | 128 | 8 | 0.0019 | ort:flash
Q,K,V | FALSE | 1 | 256 | 32 | 128 | 0 | 0.0022 | ort:math
Q,K,V | FALSE | 1 | 256 | 32 | 128 | 4 | 0.0030 | ort:flash
Q,K,V | FALSE | 1 | 256 | 32 | 128 | 2 | 0.0047 | ort:flash
Q,K,V | FALSE | 1 | 256 | 32 | 128 | 1 | 0.0086 | ort:flash
Q,K,V | FALSE | 1 | 256 | 32 | 128 | 2 | 0.0161 | torch:default
Q,K,V | FALSE | 1 | 256 | 32 | 128 | 4 | 0.0162 | torch:default
Q,K,V | FALSE | 1 | 256 | 32 | 128 | 8 | 0.0162 | torch:default
Q,K,V | FALSE | 1 | 256 | 32 | 128 | 24 | 0.0165 | torch:default
Q,K,V | FALSE | 1 | 256 | 32 | 128 | 1 | 0.0166 | torch:default
Q,K,V | FALSE | 1 | 512 | 32 | 128 | 8 | 0.0077 | ort:flash
Q,K,V | FALSE | 1 | 512 | 32 | 128 | 0 | 0.0091 | ort:flash
Q,K,V | FALSE | 1 | 512 | 32 | 128 | 0 | 0.0099 | ort:math
Q,K,V | FALSE | 1 | 512 | 32 | 128 | 4 | 0.0103 | ort:flash
Q,K,V | FALSE | 1 | 512 | 32 | 128 | 2 | 0.0177 | ort:flash
Q,K,V | FALSE | 1 | 512 | 32 | 128 | 1 | 0.0328 | ort:flash
Q,K,V | FALSE | 1 | 512 | 32 | 128 | 2 | 0.0624 | torch:default
Q,K,V | FALSE | 1 | 512 | 32 | 128 | 4 | 0.0624 | torch:default
Q,K,V | FALSE | 1 | 512 | 32 | 128 | 8 | 0.0625 | torch:default
Q,K,V | FALSE | 1 | 512 | 32 | 128 | 24 | 0.0626 | torch:default
Q,K,V | FALSE | 1 | 512 | 32 | 128 | 1 | 0.0640 | torch:default
Q,K,V | FALSE | 1 | 1024 | 32 | 128 | 8 | 0.0286 | ort:flash
Q,K,V | FALSE | 1 | 1024 | 32 | 128 | 0 | 0.0317 | ort:flash
Q,K,V | FALSE | 1 | 1024 | 32 | 128 | 4 | 0.0367 | ort:flash
Q,K,V | FALSE | 1 | 1024 | 32 | 128 | 0 | 0.0391 | ort:math
Q,K,V | FALSE | 1 | 1024 | 32 | 128 | 2 | 0.0656 | ort:flash
Q,K,V | FALSE | 1 | 1024 | 32 | 128 | 1 | 0.1235 | ort:flash
Q,K,V | FALSE | 1 | 1024 | 32 | 128 | 24 | 0.2482 | torch:default
Q,K,V | FALSE | 1 | 1024 | 32 | 128 | 2 | 0.2483 | torch:default
Q,K,V | FALSE | 1 | 1024 | 32 | 128 | 4 | 0.2483 | torch:default
Q,K,V | FALSE | 1 | 1024 | 32 | 128 | 8 | 0.2486 | torch:default
Q,K,V | FALSE | 1 | 1024 | 32 | 128 | 1 | 0.2538 | torch:default
Q,K,V | FALSE | 1 | 2048 | 32 | 128 | 0 | 0.1038 | ort:flash
Q,K,V | FALSE | 1 | 2048 | 32 | 128 | 8 | 0.1050 | ort:flash
Q,K,V | FALSE | 1 | 2048 | 32 | 128 | 0 | 0.1368 | ort:math
Q,K,V | FALSE | 1 | 2048 | 32 | 128 | 4 | 0.1535 | ort:flash
Q,K,V | FALSE | 1 | 2048 | 32 | 128 | 2 | 0.2461 | ort:flash
Q,K,V | FALSE | 1 | 2048 | 32 | 128 | 1 | 0.4724 | ort:flash
Q,K,V | FALSE | 1 | 2048 | 32 | 128 | 8 | 0.9835 | torch:default
Q,K,V | FALSE | 1 | 2048 | 32 | 128 | 4 | 0.9841 | torch:default
Q,K,V | FALSE | 1 | 2048 | 32 | 128 | 24 | 0.9841 | torch:default
Q,K,V | FALSE | 1 | 2048 | 32 | 128 | 2 | 0.9873 | torch:default
Q,K,V | FALSE | 1 | 2048 | 32 | 128 | 1 | 0.9985 | torch:default


### Motivation and Context
To compare with PyTorch SDPA on CPU and CUDA latency.
2024-07-26 18:45:14 -07:00
Hector Li
fb61e14153
Add QNN EP option context_node_name_prefix to set EPContext node name prefix (#21236)
### Description
Add QNN EP option context_node_name_prefix to set EPContext node name prefix

### Motivation and Context
For the case to workaround QNN context PD memory limit, user need split the model into pieces and generate the QNN context model separately. It could happen that the generated EPContext node in separate graph has same node name. This will cause issue if glue those EPContext nodes together into a single model.
To avoid this user can set this context_node_name_prefix for each split pieces to make the node name unique.
2024-07-26 16:56:44 -07:00
Jian Chen
7db7c4e5c8
Separating all GPU stages into different Pipelines (#21521)
### Description
Separating all GPU stages into different Pipelines
2024-07-26 14:54:45 -07:00
Justin Chu
bbbaef3fa6
Update text formatting in generate_cgmanifest.py (#21489)
The only place where I manually fixed I forgot a format string
2024-07-26 08:46:54 -07:00
Prathik Rao
278f0f5cd2
disables qnn in ort training cpu pipeline (#21510)
### Description
<!-- Describe your changes. -->

`enable_windows_arm64_qnn` and `enable_windows_x64_qnn` are true by
default but unnecessary for training. This change explicitly sets these
parameters to false for training pipeline.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

ORT 1.19 Release Preparation
2024-07-26 17:23:35 +08:00
Wanming Lin
b6b29309a5
[WebNN EP] Update argMax/argMin to adapt to latest spec (#21452)
WebNN spec recently changes the definition of argMax/argMin:
- Remove selectLastIndex option, let backends decide to select the last
index or not.
- Move axes option to axis input
2024-07-25 17:07:01 -07:00
aamajumder
166809425e
[DML EP] Register ReduceMin-20 (#20477)
### Description
This PR registers the ReduceMin-20 operator to the DML EP.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-07-25 17:06:30 -07:00
Scott McKay
e5302b23c4
Fix SkipLayerNormFusion incorrectly setting modified every time it runs (#21502)
### Description
<!-- Describe your changes. -->
Current behavior forces all L2 optimizers to loop until they hit the max
number of iterations.

Only update modified if the graph was modified.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix unnecessary loops of L2 optimizers during model loading.
2024-07-26 10:00:28 +10:00
Justin Chu
c464ab3aca
Allow cpplint to always be green (#21491)
Allow cpplint to always be green since it is optional. Also changed the
workflow name to reflect that.
2024-07-25 15:57:30 -07:00
Scott McKay
b0e1f7f798
CoreML: Aggregated changes to add all required ops for priority model (#21472)
### Description
<!-- Describe your changes. -->
Add these changes to one PR to simplify checkin
- Add Concat (#21423)
- Add DepthToSpace (#21426)
- Add LeakyRelu (#21453)
- Add test scripts (#21427)
- Add ability to set coreml flags from python (#21434)


Other changes
- updated partitioning utils to support dropping constant initializers
from a ComputeCapability's inputs.
- noticed that the list of inputs to the coreml model was unexpectedly
long due to this
- we copy constant initializers to a CoreML model so don't need the
originals, and if they remain as inputs ORT can't free them as they
appear to be in use.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-07-26 08:29:33 +10:00
Scott McKay
3cdf4b917b
Fix Android CI Pipeline code coverage failure (#21504)
### Description
<!-- Describe your changes. -->
Current failure is due to a version mismatch.

Use llvm-cov from the Android NDK instead of the system gcov so that the
version is correct.

Also comment out publishing to the Azure dashboard to simplify the
setup. The CI prints out the stats for review by developers.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix CI pipeline
2024-07-26 07:36:23 +10:00
Hector Li
c23517859e
Qnn batchnorm support input with rank 2 (#21469)
### Description
Qnn BatchNorm support input with rank 2
Update Quantization script to quantize BatchNorm bias using int32

---------

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2024-07-25 11:44:10 -07:00
Changming Sun
4167b68abf
Split ondevice training cpu packaging pipeline to a separated pipeline (#21485)
### Description
Right now our "Zip-Nuget-Java-Nodejs Packaging Pipeline" is too big.
This OnDevice training part is independent of the others, so it can be
split out. Then our NPM Packaging pipeline will not depends on this
training stuff.

### Motivation and Context
Similar to #21235 

Also, this PR fixed a problem that: "NuGet_Test_Linux_Training_CPU" job
downloads artifacts from "onnxruntime-linux-x64" for getting customop
shared libs, but the job forget to declare it depends on the
"Linux_C_API_Packaging_CPU_x64" which produces the artifact. Such
problems can be hard to find when a pipeline goes big.
2024-07-25 10:58:34 -07:00
Yifan Li
ebcb7075eb
Set CUDA12 as default in GPU packages (#21438)
### Description
* Swap cuda version 11.8/12.2 in GPU CIs
* Set CUDA12 as default version in yamls of publishing nuget/python/java
GPU packages
* Suppress warnings as errors of flash_api.cc during ort win-build
2024-07-25 10:17:16 -07:00
Sophie Schoenmeyer
f3a6e58ae3
Update 05-performance.yml issue template to auto apply label (#21486)
Updating Performance issue template so "performance" label is
automatically applied

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-07-25 09:52:37 -07:00
Yueqing Zhang
6787cf18a5
[VitisAI] use binary mode for context ep (#21474)
### Description
<!-- Describe your changes. -->
We found text format could caused error.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Because the OS could change the string so we decided to save it as
binary file.
2024-07-25 07:18:55 -07:00
Preetha Veeramalai
ca47f0fdd3
OVEP - PR 1.19 (#21443)
### Description
Add OVEP  features for 1.19 

The PR has,
- Added support for EpCtx with ORT Session options for optimized
performance.
- Added bug fixes
- Support for OV 2024.3

---------

Co-authored-by: ubuntu <ubuntu@ubuntu-mtlp-118727.iind.intel.com>
Co-authored-by: vthaniel <vishnudas.thaniel.s@intel.com>
Co-authored-by: sfatimar <sahar.fatima@intel.com>
Co-authored-by: saurabhkale17 <saurabh1.kale@intel.com>
Co-authored-by: Maheshkar <ankit.maheshkar@intel.com>
2024-07-24 23:45:31 -07:00
Justin Chu
ae3ec2e9ac
Ignore ruff rule N813 (#21477)
Allow importing camelcase names in lowercase
2024-07-24 17:48:22 -07:00
pengwa
08001d18ac
Fix security issue #22016 #22017 #22018 (#21333)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-07-25 08:25:22 +08:00