Commit graph

11807 commits

Author SHA1 Message Date
Jiajia Qin
8159723ba7
[js/webgpu] Optimize matmulnbits (#22360)
### Description
<!-- Describe your changes. -->
This PR further optimizes matmulnbits specially for iGPUs. The phi3 demo
becomes ~12 tokens/second from ~8 tokens on iGPUs.

Some todos:
1. Make the optimization more general, Remove the blockSize = 32
limitation.
2. Tune the parameter, such as workgroupSize, components size (currently
only support components = 1), to see the performance change.
2024-10-14 15:49:29 -07:00
dependabot[bot]
2bc3754494
Bump cookie and socket.io in /js/web (#22408)
Bumps [cookie](https://github.com/jshttp/cookie) and
[socket.io](https://github.com/socketio/socket.io). These dependencies
needed to be updated together.
Updates `cookie` from 0.4.2 to 0.7.2
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/jshttp/cookie/releases">cookie's
releases</a>.</em></p>
<blockquote>
<h2>v0.7.2</h2>
<p><strong>Fixed</strong></p>
<ul>
<li>Fix object assignment of <code>hasOwnProperty</code> (<a
href="https://redirect.github.com/jshttp/cookie/issues/177">#177</a>)
bc38ffd</li>
</ul>
<p><a
href="https://github.com/jshttp/cookie/compare/v0.7.1...v0.7.2">https://github.com/jshttp/cookie/compare/v0.7.1...v0.7.2</a></p>
<h2>0.7.1</h2>
<p><strong>Fixed</strong></p>
<ul>
<li>Allow leading dot for domain (<a
href="https://redirect.github.com/jshttp/cookie/issues/174">#174</a>)
<ul>
<li>Although not permitted in the spec, some users expect this to work
and user agents ignore the leading dot according to spec</li>
</ul>
</li>
<li>Add fast path for <code>serialize</code> without options, use
<code>obj.hasOwnProperty</code> when parsing (<a
href="https://redirect.github.com/jshttp/cookie/issues/172">#172</a>)</li>
</ul>
<p><a
href="https://github.com/jshttp/cookie/compare/v0.7.0...v0.7.1">https://github.com/jshttp/cookie/compare/v0.7.0...v0.7.1</a></p>
<h2>0.7.0</h2>
<ul>
<li>perf: parse cookies ~10% faster (<a
href="https://redirect.github.com/jshttp/cookie/issues/144">#144</a> by
<a href="https://github.com/kurtextrem"><code>@​kurtextrem</code></a>
and <a
href="https://redirect.github.com/jshttp/cookie/issues/170">#170</a>)</li>
<li>fix: narrow the validation of cookies to match RFC6265 (<a
href="https://redirect.github.com/jshttp/cookie/issues/167">#167</a> by
<a href="https://github.com/bewinsnw"><code>@​bewinsnw</code></a>)</li>
<li>fix: add <code>main</code> to <code>package.json</code> for rspack
(<a href="https://redirect.github.com/jshttp/cookie/issues/166">#166</a>
by <a
href="https://github.com/proudparrot2"><code>@​proudparrot2</code></a>)</li>
</ul>
<p><a
href="https://github.com/jshttp/cookie/compare/v0.6.0...v0.7.0">https://github.com/jshttp/cookie/compare/v0.6.0...v0.7.0</a></p>
<h2>0.6.0</h2>
<ul>
<li>Add <code>partitioned</code> option</li>
</ul>
<h2>0.5.0</h2>
<ul>
<li>Add <code>priority</code> option</li>
<li>Fix <code>expires</code> option to reject invalid dates</li>
<li>pref: improve default decode speed</li>
<li>pref: remove slow string split in parse</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="d19eaa1a2b"><code>d19eaa1</code></a>
0.7.2</li>
<li><a
href="bc38ffd0ea"><code>bc38ffd</code></a>
Fix object assignment of <code>hasOwnProperty</code> (<a
href="https://redirect.github.com/jshttp/cookie/issues/177">#177</a>)</li>
<li><a
href="cf4658f492"><code>cf4658f</code></a>
0.7.1</li>
<li><a
href="6a8b8f5a49"><code>6a8b8f5</code></a>
Allow leading dot for domain (<a
href="https://redirect.github.com/jshttp/cookie/issues/174">#174</a>)</li>
<li><a
href="58015c0b93"><code>58015c0</code></a>
Remove more code and perf wins (<a
href="https://redirect.github.com/jshttp/cookie/issues/172">#172</a>)</li>
<li><a
href="ab057d6c06"><code>ab057d6</code></a>
0.7.0</li>
<li><a
href="5f02ca8768"><code>5f02ca8</code></a>
Migrate history to GitHub releases</li>
<li><a
href="a5d591ce84"><code>a5d591c</code></a>
Migrate history to GitHub releases</li>
<li><a
href="51968f94b5"><code>51968f9</code></a>
Skip isNaN</li>
<li><a
href="9e7ca51ade"><code>9e7ca51</code></a>
perf(parse): cache length, return early (<a
href="https://redirect.github.com/jshttp/cookie/issues/144">#144</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/jshttp/cookie/compare/v0.4.2...v0.7.2">compare
view</a></li>
</ul>
</details>
<details>
<summary>Maintainer changes</summary>
<p>This version was pushed to npm by <a
href="https://www.npmjs.com/~blakeembrey">blakeembrey</a>, a new
releaser for cookie since your current version.</p>
</details>
<br />

Updates `socket.io` from 4.7.5 to 4.8.0
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/socketio/socket.io/releases">socket.io's
releases</a>.</em></p>
<blockquote>
<h2>socket.io-client@4.8.0</h2>
<h3>Features</h3>
<h4>Custom transport implementations</h4>
<p>The <code>transports</code> option now accepts an array of transport
implementations:</p>
<pre lang="js"><code>import { io } from &quot;socket.io-client&quot;;
import { XHR, WebSocket } from &quot;engine.io-client&quot;;
<p>const socket = io({
transports: [XHR, WebSocket]
});
</code></pre></p>
<p>Here is the list of provided implementations:</p>
<table>
<thead>
<tr>
<th>Transport</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>Fetch</code></td>
<td>HTTP long-polling based on the built-in <code>fetch()</code>
method.</td>
</tr>
<tr>
<td><code>NodeXHR</code></td>
<td>HTTP long-polling based on the <code>XMLHttpRequest</code> object
provided by the <code>xmlhttprequest-ssl</code> package.</td>
</tr>
<tr>
<td><code>XHR</code></td>
<td>HTTP long-polling based on the built-in <code>XMLHttpRequest</code>
object.</td>
</tr>
<tr>
<td><code>NodeWebSocket</code></td>
<td>WebSocket transport based on the <code>WebSocket</code> object
provided by the <code>ws</code> package.</td>
</tr>
<tr>
<td><code>WebSocket</code></td>
<td>WebSocket transport based on the built-in <code>WebSocket</code>
object.</td>
</tr>
<tr>
<td><code>WebTransport</code></td>
<td>WebTransport transport based on the built-in
<code>WebTransport</code> object.</td>
</tr>
</tbody>
</table>
<p>Usage:</p>
<table>
<thead>
<tr>
<th>Transport</th>
<th>browser</th>
<th>Node.js</th>
<th>Deno</th>
<th>Bun</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>Fetch</code></td>
<td></td>
<td> (1)</td>
<td></td>
<td></td>
</tr>
<tr>
<td><code>NodeXHR</code></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td><code>XHR</code></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td><code>NodeWebSocket</code></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td><code>WebSocket</code></td>
<td></td>
<td> (2)</td>
<td></td>
<td></td>
</tr>
<tr>
<td><code>WebTransport</code></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
<p>(1) since <a
href="https://nodejs.org/api/globals.html#fetch">v18.0.0</a>
(2) since <a
href="https://nodejs.org/api/globals.html#websocket">v21.0.0</a></p>
<p>Added in <a
href="f4d898ee96">f4d898e</a>
and <a
href="b11763beec">b11763b</a>.</p>
<h4>Test each low-level transports</h4>
<p>When setting the <code>tryAllTransports</code> option to
<code>true</code>, if the first transport (usually, HTTP long-polling)
fails, then the other transports will be tested too:</p>
<pre lang="js"><code>import { io } from &quot;socket.io-client&quot;;
&lt;/tr&gt;&lt;/table&gt; 
</code></pre>
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="d0fc720420"><code>d0fc720</code></a>
chore(release): socket.io@4.8.0</li>
<li><a
href="4a0555c671"><code>4a0555c</code></a>
chore(release): socket.io-client@4.8.0</li>
<li><a
href="2b60df18a8"><code>2b60df1</code></a>
chore(release): engine.io@6.6.1</li>
<li><a
href="d4cb375856"><code>d4cb375</code></a>
ci: ignore tests when publishing to npm</li>
<li><a
href="c251ae7ba7"><code>c251ae7</code></a>
chore(release): engine.io-client@6.6.1</li>
<li><a
href="8a2f5a3da0"><code>8a2f5a3</code></a>
fix(eio-client): move 'offline' event listener at the top</li>
<li><a
href="b04fa64365"><code>b04fa64</code></a>
fix(sio): allow to join a room in a middleware (uws)</li>
<li><a
href="7085f0e3e4"><code>7085f0e</code></a>
refactor(sio-client): mangle private attributes</li>
<li><a
href="4f66708210"><code>4f66708</code></a>
chore(sio-client): use babel loose mode when transpiling classes</li>
<li><a
href="1a95db2145"><code>1a95db2</code></a>
chore(sio-client): add a script to compute the bundle size</li>
<li>Additional commits viewable in <a
href="https://github.com/socketio/socket.io/compare/socket.io@4.7.5...socket.io@4.8.0">compare
view</a></li>
</ul>
</details>
<br />


Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
Dependabot will merge this PR once CI passes on it, as requested by
@fs-eire.

[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).

</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-14 15:47:01 -07:00
Jiajia Qin
0409c639f7
[js/webgpu] Optimize MultiHeadAttention|Transpose (#22420)
### Description
<!-- Describe your changes. -->
With this optimization, 96 MultiHeadAttention|Transpose ops in phi3
disappear. Phi3 becomes 113 tokens from 107 tokens on my dGPUs.

The optimization mainly skips the transpose op if one of the transposed
dims is 1. Reshape is enough.
2024-10-14 15:43:14 -07:00
Tianlei Wu
de93f40240
[CUDA] Lean Attention (#22352)
### Description
Add [Lean Attention](https://arxiv.org/abs/2405.10480) and the
integration with MultiHeadAttention operator for LLM in GPU.

LeanAttention speeds up self-attention for the token-generation phase
(decode-phase) of decoder-only transformer models, especially on long
context lengths.

- [x] Initial implementation of Lean Attention (by Srikant Bharadwaj)
- [x] Integration with MultiHeadAttention operator
- [x] Add parity tests
- [x] Add benchmark

#### Implementation Details

(1) Lean Attention is enabled in build for Linux, and disabled for
Windows
(2) Lean Attention is disabled by default. Need enable it through cuda
provider option sdpa_kernel, or use environment variable
`ORT_ENABLE_LEAN_ATTENTION=1`
(3) It only works for token-generation (sequence_length==1,
past_sequence_length > 0).
(4) Like flash attention, it only works in Ampere or newer GPU.

We can revisit #1 and #2 after comparing with
DecoderMaskedMultiHeadAttention and XQA kernels.

#### Benchmark

```
cd onnxruntime/test/python/transformers 
/bin/bash benchmark_mha.sh lean
```

Example outputs in H100:

Note that past and present does not share buffer for MHA for now, so we
can see low tflops. The relative ratio will change after buffer sharing
is enabled. But we expect that the order (kernel A is faster than B)
will remain the same after buffer sharing is enabled.

Note that common settings `sequence_length=1;
causal=True;attn_bias=None;cuda_graph=False` are not shown in the below
table.

batch_size | past_sequence_length | num_heads | head_size |
average_latency | tflops | kernel
-- | -- | -- | -- | -- | -- | --
1 | 512 | 16 | 64 | 0.000059 | 0.0178 | ort:flash
1 | 512 | 16 | 64 | 0.000068 | 0.0155 | ort:efficient
1 | 512 | 16 | 64 | 0.000065 | 0.0161 | ort:math
1 | 512 | 16 | 64 | 0.000060 | 0.0176 | ort:lean
1 | 512 | 32 | 128 | 0.000062 | 0.0674 | ort:flash
1 | 512 | 32 | 128 | 0.000064 | 0.0661 | ort:efficient
1 | 512 | 32 | 128 | 0.000067 | 0.0625 | ort:math
1 | 512 | 32 | 128 | 0.000062 | 0.0678 | ort:lean
1 | 1024 | 16 | 64 | 0.000061 | 0.0345 | ort:flash
1 | 1024 | 16 | 64 | 0.000086 | 0.0244 | ort:efficient
1 | 1024 | 16 | 64 | 0.000065 | 0.0322 | ort:math
1 | 1024 | 16 | 64 | 0.000063 | 0.0332 | ort:lean
1 | 1024 | 32 | 128 | 0.000075 | 0.1125 | ort:flash
1 | 1024 | 32 | 128 | 0.000088 | 0.0951 | ort:efficient
1 | 1024 | 32 | 128 | 0.000079 | 0.1068 | ort:math
1 | 1024 | 32 | 128 | 0.000072 | 0.1171 | ort:lean
1 | 2048 | 16 | 64 | 0.000069 | 0.0606 | ort:flash
1 | 2048 | 16 | 64 | 0.000125 | 0.0336 | ort:efficient
1 | 2048 | 16 | 64 | 0.000064 | 0.0655 | ort:lean
1 | 2048 | 32 | 128 | 0.000098 | 0.1720 | ort:flash
1 | 2048 | 32 | 128 | 0.000132 | 0.1270 | ort:efficient
1 | 2048 | 32 | 128 | 0.000092 | 0.1828 | ort:lean
1 | 4096 | 16 | 64 | 0.000076 | 0.1097 | ort:flash
1 | 4096 | 16 | 64 | 0.000207 | 0.0406 | ort:efficient
1 | 4096 | 16 | 64 | 0.000069 | 0.1209 | ort:lean
1 | 4096 | 32 | 128 | 0.000140 | 0.2394 | ort:flash
1 | 4096 | 32 | 128 | 0.000213 | 0.1575 | ort:efficient
1 | 4096 | 32 | 128 | 0.000139 | 0.2419 | ort:lean
1 | 8192 | 16 | 64 | 0.000104 | 0.1609 | ort:flash
1 | 8192 | 16 | 64 | 0.000392 | 0.0428 | ort:efficient
1 | 8192 | 16 | 64 | 0.000093 | 0.1809 | ort:lean
1 | 8192 | 32 | 128 | 0.000212 | 0.3160 | ort:flash
1 | 8192 | 32 | 128 | 0.000360 | 0.1866 | ort:efficient
1 | 8192 | 32 | 128 | 0.000212 | 0.3162 | ort:lean
1 | 16384 | 16 | 64 | 0.000139 | 0.2410 | ort:flash
1 | 16384 | 16 | 64 | 0.000731 | 0.0459 | ort:efficient
1 | 16384 | 16 | 64 | 0.000136 | 0.2465 | ort:lean
1 | 16384 | 32 | 128 | 0.000361 | 0.3722 | ort:flash
1 | 16384 | 32 | 128 | 0.000667 | 0.2014 | ort:efficient
1 | 16384 | 32 | 128 | 0.000357 | 0.3765 | ort:lean
1 | 32768 | 16 | 64 | 0.000210 | 0.3194 | ort:flash
1 | 32768 | 16 | 64 | 0.001428 | 0.0470 | ort:efficient
1 | 32768 | 16 | 64 | 0.000209 | 0.3211 | ort:lean
1 | 32768 | 32 | 128 | 0.000659 | 0.4074 | ort:flash
1 | 32768 | 32 | 128 | 0.001270 | 0.2114 | ort:efficient
1 | 32768 | 32 | 128 | 0.000651 | 0.4123 | ort:lean
1 | 65536 | 16 | 64 | 0.000355 | 0.3785 | ort:flash
1 | 65536 | 16 | 64 | 0.002736 | 0.0491 | ort:efficient
1 | 65536 | 16 | 64 | 0.000349 | 0.3845 | ort:lean
1 | 65536 | 32 | 128 | 0.001251 | 0.4290 | ort:flash
1 | 65536 | 32 | 128 | 0.002480 | 0.2165 | ort:efficient
1 | 65536 | 32 | 128 | 0.001239 | 0.4333 | ort:lean
4 | 512 | 16 | 64 | 0.000063 | 0.0665 | ort:flash
4 | 512 | 16 | 64 | 0.000069 | 0.0607 | ort:efficient
4 | 512 | 16 | 64 | 0.000066 | 0.0634 | ort:math
4 | 512 | 16 | 64 | 0.000062 | 0.0674 | ort:lean
4 | 512 | 32 | 128 | 0.000100 | 0.1677 | ort:flash
4 | 512 | 32 | 128 | 0.000099 | 0.1703 | ort:efficient
4 | 512 | 32 | 128 | 0.000108 | 0.1557 | ort:math
4 | 512 | 32 | 128 | 0.000092 | 0.1818 | ort:lean
4 | 1024 | 16 | 64 | 0.000077 | 0.1094 | ort:flash
4 | 1024 | 16 | 64 | 0.000099 | 0.0850 | ort:efficient
4 | 1024 | 16 | 64 | 0.000081 | 0.1038 | ort:math
4 | 1024 | 16 | 64 | 0.000072 | 0.1161 | ort:lean
4 | 1024 | 32 | 128 | 0.000143 | 0.2343 | ort:flash
4 | 1024 | 32 | 128 | 0.000137 | 0.2447 | ort:efficient
4 | 1024 | 32 | 128 | 0.000150 | 0.2245 | ort:math
4 | 1024 | 32 | 128 | 0.000135 | 0.2496 | ort:lean
4 | 2048 | 16 | 64 | 0.000096 | 0.1757 | ort:flash
4 | 2048 | 16 | 64 | 0.000156 | 0.1078 | ort:efficient
4 | 2048 | 16 | 64 | 0.000089 | 0.1892 | ort:lean
4 | 2048 | 32 | 128 | 0.000223 | 0.3010 | ort:flash
4 | 2048 | 32 | 128 | 0.000217 | 0.3101 | ort:efficient
4 | 2048 | 32 | 128 | 0.000209 | 0.3209 | ort:lean
4 | 4096 | 16 | 64 | 0.000137 | 0.2448 | ort:flash
4 | 4096 | 16 | 64 | 0.000256 | 0.1312 | ort:efficient
4 | 4096 | 16 | 64 | 0.000133 | 0.2530 | ort:lean
4 | 4096 | 32 | 128 | 0.000389 | 0.3450 | ort:flash
4 | 4096 | 32 | 128 | 0.000376 | 0.3574 | ort:efficient
4 | 4096 | 32 | 128 | 0.000354 | 0.3794 | ort:lean
4 | 8192 | 16 | 64 | 0.000210 | 0.3198 | ort:flash
4 | 8192 | 16 | 64 | 0.000453 | 0.1480 | ort:efficient
4 | 8192 | 16 | 64 | 0.000206 | 0.3260 | ort:lean
4 | 8192 | 32 | 128 | 0.000725 | 0.3705 | ort:flash
4 | 8192 | 32 | 128 | 0.000693 | 0.3874 | ort:efficient
4 | 8192 | 32 | 128 | 0.000653 | 0.4114 | ort:lean
4 | 16384 | 16 | 64 | 0.000355 | 0.3782 | ort:flash
4 | 16384 | 16 | 64 | 0.000849 | 0.1581 | ort:efficient
4 | 16384 | 16 | 64 | 0.000346 | 0.3874 | ort:lean
4 | 16384 | 32 | 128 | 0.001395 | 0.3848 | ort:flash
4 | 16384 | 32 | 128 | 0.001337 | 0.4017 | ort:efficient
4 | 16384 | 32 | 128 | 0.001252 | 0.4288 | ort:lean
4 | 32768 | 16 | 64 | 0.000647 | 0.4146 | ort:flash
4 | 32768 | 16 | 64 | 0.001649 | 0.1628 | ort:efficient
4 | 32768 | 16 | 64 | 0.000639 | 0.4204 | ort:lean
4 | 32768 | 32 | 128 | 0.002721 | 0.3947 | ort:flash
4 | 32768 | 32 | 128 | 0.002601 | 0.4128 | ort:efficient
4 | 32768 | 32 | 128 | 0.002434 | 0.4411 | ort:lean
4 | 65536 | 16 | 64 | 0.001231 | 0.4361 | ort:flash
4 | 65536 | 16 | 64 | 0.003238 | 0.1658 | ort:efficient
4 | 65536 | 16 | 64 | 0.001217 | 0.4412 | ort:lean
4 | 65536 | 32 | 128 | 0.005357 | 0.4009 | ort:flash
4 | 65536 | 32 | 128 | 0.005118 | 0.4196 | ort:efficient
4 | 65536 | 32 | 128 | 0.004781 | 0.4492 | ort:lean
16 | 512 | 16 | 64 | 0.000098 | 0.1724 | ort:flash
16 | 512 | 16 | 64 | 0.000104 | 0.1616 | ort:efficient
16 | 512 | 16 | 64 | 0.000118 | 0.1420 | ort:math
16 | 512 | 16 | 64 | 0.000087 | 0.1926 | ort:lean
16 | 512 | 32 | 128 | 0.000220 | 0.3062 | ort:flash
16 | 512 | 32 | 128 | 0.000208 | 0.3237 | ort:efficient
16 | 512 | 32 | 128 | 0.000237 | 0.2838 | ort:math
16 | 512 | 32 | 128 | 0.000209 | 0.3216 | ort:lean
16 | 1024 | 16 | 64 | 0.000136 | 0.2465 | ort:flash
16 | 1024 | 16 | 64 | 0.000150 | 0.2235 | ort:efficient
16 | 1024 | 16 | 64 | 0.000148 | 0.2266 | ort:math
16 | 1024 | 16 | 64 | 0.000129 | 0.2611 | ort:lean
16 | 1024 | 32 | 128 | 0.000367 | 0.3663 | ort:flash
16 | 1024 | 32 | 128 | 0.000351 | 0.3829 | ort:efficient
16 | 1024 | 32 | 128 | 0.000400 | 0.3357 | ort:math
16 | 1024 | 32 | 128 | 0.000349 | 0.3853 | ort:lean
16 | 2048 | 16 | 64 | 0.000209 | 0.3206 | ort:flash
16 | 2048 | 16 | 64 | 0.000243 | 0.2762 | ort:efficient
16 | 2048 | 16 | 64 | 0.000201 | 0.3338 | ort:lean
16 | 2048 | 32 | 128 | 0.000671 | 0.4002 | ort:flash
16 | 2048 | 32 | 128 | 0.000645 | 0.4163 | ort:efficient
16 | 2048 | 32 | 128 | 0.000642 | 0.4185 | ort:lean
16 | 4096 | 16 | 64 | 0.000360 | 0.3732 | ort:flash
16 | 4096 | 16 | 64 | 0.000425 | 0.3162 | ort:efficient
16 | 4096 | 16 | 64 | 0.000341 | 0.3933 | ort:lean
16 | 4096 | 32 | 128 | 0.001292 | 0.4156 | ort:flash
16 | 4096 | 32 | 128 | 0.001251 | 0.4291 | ort:efficient
16 | 4096 | 32 | 128 | 0.001241 | 0.4327 | ort:lean
16 | 8192 | 16 | 64 | 0.000666 | 0.4030 | ort:flash
16 | 8192 | 16 | 64 | 0.000804 | 0.3339 | ort:efficient
16 | 8192 | 16 | 64 | 0.000627 | 0.4283 | ort:lean
16 | 8192 | 32 | 128 | 0.002541 | 0.4226 | ort:flash
16 | 8192 | 32 | 128 | 0.002454 | 0.4376 | ort:efficient
16 | 8192 | 32 | 128 | 0.002438 | 0.4405 | ort:lean
16 | 16384 | 16 | 64 | 0.001292 | 0.4156 | ort:flash
16 | 16384 | 16 | 64 | 0.001571 | 0.3417 | ort:efficient
16 | 16384 | 16 | 64 | 0.001217 | 0.4411 | ort:lean
16 | 16384 | 32 | 128 | 0.005042 | 0.4260 | ort:flash
16 | 16384 | 32 | 128 | 0.004859 | 0.4420 | ort:efficient
16 | 16384 | 32 | 128 | 0.004827 | 0.4449 | ort:lean
16 | 32768 | 16 | 64 | 0.002537 | 0.4233 | ort:flash
16 | 32768 | 16 | 64 | 0.003103 | 0.3461 | ort:efficient
16 | 32768 | 16 | 64 | 0.002385 | 0.4501 | ort:lean
16 | 32768 | 32 | 128 | 0.009961 | 0.4312 | ort:flash
16 | 32768 | 32 | 128 | 0.009605 | 0.4472 | ort:efficient
16 | 32768 | 32 | 128 | 0.009524 | 0.4510 | ort:lean
16 | 65536 | 16 | 64 | 0.005019 | 0.4279 | ort:flash
16 | 65536 | 16 | 64 | 0.006133 | 0.3502 | ort:efficient
16 | 65536 | 16 | 64 | 0.004703 | 0.4566 | ort:lean
16 | 65536 | 32 | 128 | 0.019746 | 0.4350 | ort:flash
16 | 65536 | 32 | 128 | 0.019027 | 0.4515 | ort:efficient
16 | 65536 | 32 | 128 | 0.018864 | 0.4554 | ort:lean

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-10-14 14:49:37 -07:00
Dmitri Smirnov
87e8a5dfa8
Implement DML copy for Lora Adapters (#22396)
### Description
Request and create DML EP and its data transfer.
Use to copy on device.

The PR includes changes to fix issues in DML provider.

### Motivation and Context
This enables Lora users to run it with DML which is important for GenAI.

Co-authored-by: @PatriceVignola

---------

Co-authored-by: Patrice Vignola <vignola.patrice@gmail.com>
2024-10-14 12:26:50 -07:00
Vishnudas Thaniel S
35adba21c7
Ovep develop lnl 1.2 (#22424)
### Description
Support OV2024.4
Refactor tensor initialization check for external weights
Support loading OV Config
OVEP: Tensor Caching fix, Fix accuracy issues
Refactor device memory implementation to make it more generic

### Motivation and Context
The changes are required to fix accuracy issues, support loading of OV
config, support OV2024.4

---------

Co-authored-by: Eric Crawford <eric.r.crawford@intel.com>
Co-authored-by: saurabhkale17 <saurabh1.kale@intel.com>
Co-authored-by: Javier E. Martinez <javier.e.martinez@intel.com>
Co-authored-by: sfatimar <sahar.fatima@intel.com>
Co-authored-by: ankitm3k <ankit.maheshkar@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
Co-authored-by: n1harika <niharika.sathish@intel.com>
Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com>
2024-10-14 12:10:01 -07:00
Justin Chu
9b1b4e54bb
Move suggest fixes to a separate CI workflow (#22415)
Move suggest fixes to a separate CI workflow so that it is triggered
only on PRs and does not fail the main branch.
2024-10-14 10:26:37 -07:00
Edward Chen
04404ea482
Fix Xcode 16 iOS build issues (#22379)
- Work around Xcode 16 iOS test build issue: `error: Multiple commands produce '.../PlugIns'`.
- Fix link error in iOS static framework test.
- Update build.py to check for the right kind of build before running iOS tests on the simulator.
- Update Xcode 16 build images to 'macos-15' because that's the only image that will have Xcode 16 soon. See https://github.com/actions/runner-images/issues/10703.
2024-10-14 09:24:38 -07:00
Yi Zhang
caa67439b5
Add more F16 kernels of XNNPack (#22381)
### Description
1. Add Gemm, MatMul, Softmax, AveragePool and  Resize F16 kernels

This PR has included all changes in #22378


[AB#51066](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/51066)

[AB#51026](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/51026)

2. Matrix B must be const and martrix A and B dim_size shoule NOT bigger
than 2 in XNNPack, so I added 2 tests in matmul_test.cc to make sure
it's really tested. (that is, compute() must be called.)
### Motivation and Context
2024-10-14 17:41:59 +08:00
Yi Zhang
72cc72cc21
New rocm nuget publish pipeline (#22418)
### Description
Add a new pipeline to publish ROCM package to ADO



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

### Test Link
https://dev.azure.com/aiinfra/Lotus/_build?definitionId=1615
2024-10-13 08:30:06 +08:00
mindest
1fa219d7d5
DecoderMaskedMultiHeadAttention CPU kernel. (#22292)
### Description
DecoderMaskedMultiHeadAttention CPU kernel.
2024-10-12 13:43:00 -07:00
George Wu
332173509d
fixups for doxygen. add c++ wrapper for setEpDynamicOptions (#22416)
follow up to https://github.com/microsoft/onnxruntime/pull/22282

replaces https://github.com/microsoft/onnxruntime/pull/22388
2024-10-11 21:59:33 -07:00
kunal-vaishnavi
18e81f8785
Fix Whisper export for FP16 CUDA (#22410)
### Description
This PR fixes a bug when the ONNX checker is called while exporting
Whisper for FP16 CUDA with optional flags.

### Motivation and Context
Sometimes, the ONNX checker raises an error depending on the optional
flags passed. By wrapping the ONNX checker in a try-except, the
conversion can continue even if the checker fails.
2024-10-11 17:37:36 -07:00
Ted Themistokleous
572e43c5d7
[MIGraphX EP/ ROCm EP] add gfx1200, gfx1201 to CMAKE_HIP_ARCHITECTURES (#22348)
### Description
Add additonal gfx targets for AMD GPU support


### Motivation and Context
Required to integrate mainline onnxruntime support for AMD GPUs

---------

Co-authored-by: Stefan Sokolovic <stsokolo@amd.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2024-10-11 17:31:36 -07:00
Edward Chen
d7367653ab
Remove clean_docker_image_cache.py and clean-build-docker-image-cache-pipeline.yml. (#22409)
Clean up old script and build definition.
2024-10-11 14:25:13 -07:00
anujj
23d48ea647
Add TensorRT-Model-Optimizer INT4 AWQ support in onnxruntime tools (#22390)
[TensorRT-Model-Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer)
have a implementation for INT4 AWQ. Adding the support in onnxruntime
tools to quantized the models with TensorRT-Model-Optimizer
2024-10-11 13:31:54 -07:00
Kyle
cdebf37105
Add Digital Signature to DLLs in Maven Build (#22401)
### Description
* Add digital signature to dll files in jar files.
* Jar file names: onnxruntime-{version}.jar,
onnxruntime_gpu-{version}.jar

### Motivation and Context
#19204
2024-10-11 12:14:03 -07:00
mindest
d1627d2c7f
[ROCm] Register op kernel for Sqrt BFloat16 (#22404)
### Description
ROCm CI fails since adding test for BFloat16, Sqrt op (introduced in
#22068).
2024-10-11 11:02:40 -07:00
Justin Chu
64007ffb79
Create suggestions to autofix files (#22115) 2024-10-11 10:52:19 -07:00
mindest
3c80aa9fee
Add CPU kernels for DynamicTimeWarping and UnfoldTensor. (#22033)
### Description
Add CPU kernels for DynamicTimeWarping and UnfoldTensor.
2024-10-11 09:44:18 -07:00
Dmitri Smirnov
f1f3d94e2d
Accomodate BE platforms. Make sure we always write flatbuffers LE (#22375)
### Description
<!-- Describe your changes. -->
flatbuffers always write data in LE and it is automatically traslated
to/from BE as needed,
but only if we use proper accessors. This would work for shape.
However, we store parameters as bytes, so we need to swap bytes as
needed for BE.

### Motivation and Context
Address https://github.com/microsoft/onnxruntime/issues/22364
2024-10-11 09:14:44 -07:00
sheetalarkadam
c06ecd415c
RC releases to Maven for Android (#22391)
### Description
Aallows alpha, beta and rc version releases to Maven for Android
artifacts.



### Motivation and Context
Helpful to release rc versions or test artifacts to Maven for testing.
For example, a new QNN android package is being released and it will be
nice to test the RC version for dependencies before release

## Future Work
Allow RC version for all Maven artifacts.
2024-10-11 08:58:02 -07:00
Changming Sun
6ada97c84c
Fix a build issue when statically link to MSVC Runtime (#22393)
Yesterday I updated ABSL to a newer version which added a new cmake
option: ABSL_MSVC_STATIC_RUNTIME . I wasn't aware of it. This PR fixes
it.
2024-10-10 20:09:13 -07:00
Indy Zhu
b4fb32d80d
Pick changes from onnx/onnx#6010 to support EinSum shape inference (#22376)
### Description
<!-- Describe your changes. -->
Pick up onnx/onnx#6010 to support EinSum shape inference


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This change allows EinSum operator's output shape to be inferenced so
that it can run on accelerators.
2024-10-10 13:24:08 -07:00
sheetalarkadam
dd2ea8469e
Add qnn android package (#22296)
### Description
Pre built QNN Android package


### Future Work
1. Setting up CI with Browserstack- onnxruntime_tests and Android test
2. ESRP Release to Maven
2024-10-10 10:37:22 -07:00
Kevin Chen
709368ea14
Remove deprecation warnings for TensorRT 10.5 builds (#22374)
### Description
In TensorRT 10.5, the APIs `platformHasFastFp16` and
`platformHasFastInt8` have been deprecated.

Ignore these deprecation warnings.

Signed-off-by: Kevin Chen <kevinch@nvidia.com>
2024-10-10 08:49:10 -07:00
Tianlei Wu
2584d02c5e
Fix SAM2 benchmark script on cuda graph (#22377)
### Description
Update segment anything 2 benchmark script:
(1) Fix cuda graph in benchmark. Make sure --use_cuda_graph takes effect
and random_inputs() generates according to the dtype of the model.
(2) Add a parameter to enable profiling.
(3) Use latest cuda 12.6.2 and cudnn 9.5.
(4) Update README.md.

### Motivation and Context
Previous, --use_cuda_graph does not take effect. This fixes the
benchmark.
2024-10-10 08:46:44 -07:00
Luis E. P.
1bc546af61
Add SetEpDynamicOptions and remove workload_type from run/session options (#22282)
### Description
Add SetEpDynamicOptions and Remove workload_type from run/session
options.

### Motivation and Context
Added SetEpDynamicOptions as a dynamic way of changing EP settings even
in the middle of a Run
Using workload_type run/session options to set Efficient/Default mode
for workloads does not cover all the scenarios and can lead to priority
inversions. Working on a new API to support setting Efficient/Default
mode for workloads.

---------

Co-authored-by: Luis E. Pena <luispena@microsoft.com>
2024-10-09 22:54:22 -07:00
Changming Sun
2bef89c171
Upgrade absl to the latest released version (#22365)
### Description
Resolve #21976 .  
 
ABSL generally does not have forward/backward compatibility. Our code is
only compatible with one fixed LTS version. So it's important to fix the
version number there when using find_package to detect an installed
version.
2024-10-09 20:21:40 -07:00
Changming Sun
dcf1e0c3b0
Re-enable CUDA 12 python package test pipeline (#22370)
### Description
It runs after "Python-CUDA-Packaging-Pipeline" that runs on a CPU
machine that skipped all tests.
This testing pipeline is for doing the tests.
2024-10-09 20:21:27 -07:00
Yi Zhang
25b1c38e87
Add conv fp16 kernel in xnnpack EP (#22301)
### Description
Add FP16 kernels of Conv and ConvTranspose

[AB#50186](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/50186)



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------
2024-10-10 08:48:09 +08:00
Tianlei Wu
2b0ea6c834
Fix security warning: use sha256 instead of md5 (#22373)
### Description

Fix S360 warning: Use of unapproved hash algorithm or API MD5.
2024-10-09 17:24:46 -07:00
Tianlei Wu
5cc2529379
Fix EmbedLayerNormalization fusion related to LayerNormalization optional input (#22371)
This fixes a bug found by libfuzzer: 

LayerNormalization third input (beta) is optional. The following code
has potential out of bound access if the input is not available:
```
NodeArg* beta = layer_norm_node.MutableInputDefs()[2];
```

This adds a check to ensure the third input exists before fusion.

[AB#49036](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/49036)
2024-10-09 17:24:21 -07:00
Tianlei Wu
071b607807
[CUDA] Add CUDA_VERSION and CUDNN_VERSION etc. arguments to Dockerfile.cuda (#22351)
### Description

* Add a few arguments CUDA_VERSION, CUDNN_VERSION, OS, GIT_COMMIT,
GIT_BRANCH and ONNXRUNTIME_VERSION to the Dockerfile.cuda to allow for
more flexibility in the build process.
* Update README.md to include the new arguments and their usage.
* Output labels to image so that it is easy to inspect the image. 

Available CUDA versions for ubuntu 24.04 can be found
[here](https://hub.docker.com/r/nvidia/cuda/tags), and available CUDNN
versions can be found
[here](https://pypi.org/project/nvidia-cudnn-cu12/#history). Example
command line to build docker image:
```
  docker build -t onnxruntime-cuda --build-arg CUDA_VERSION=12.6.1 \
                                   --build-arg CUDNN_VERSION=9.5.0.50 \
                                   --build-arg GIT_BRANCH=$(git rev-parse --abbrev-ref HEAD) \
                                   --build-arg GIT_COMMIT=$(git rev-parse HEAD) \
                                   --build-arg ONNXRUNTIME_VERSION=$(cat ../VERSION_NUMBER) \
                                   -f Dockerfile.cuda ..
```

Example labels from `docker inspect onnxruntime-cuda`:
```
            "Labels": {
                "CUDA_VERSION": "12.6.1",
                "CUDNN_VERSION": "9.5.0.50",
                "maintainer": "Changming Sun <chasun@microsoft.com>",
                "onnxruntime_git_branch": "main",
                "onnxruntime_git_commit": "bc84958dcef5c6017ae58085f55b669efd74f4a5",
                "onnxruntime_version": "1.20.0",
                "org.opencontainers.image.ref.name": "ubuntu",
                "org.opencontainers.image.version": "24.04"
            }
```

### Motivation and Context
https://github.com/microsoft/onnxruntime/pull/22339 has hard-coded the
cuda and cudnn versions. User might want to choose specified cuda and
cudnn version during building docker image.
2024-10-09 12:06:33 -07:00
Hector Li
3b00024b55
Fix the QNN nuget package issue (#22358)
Fix the QNN nuget package issue

### Description
Inside the package, folder name \runtimes\win-arm64\ was changed to \runtimes\win-ARM64\, which breaks lib copy settings in Microsoft.ML.OnnxRuntime.QNN.props.


### Motivation and Context
Fix issue: https://github.com/microsoft/onnxruntime/issues/21692
2024-10-09 08:41:23 -07:00
Changming Sun
9ee963110e
Update manylinux version (#22355)
### Description
Update the commit from 59600894a2c1c18290944b83e989bfe618975230 to
1887322ed36d522409a6b805d4e7942cf76a8e40


### Motivation and Context
The new one has python 3.13.

AB#50959
2024-10-08 23:11:11 -07:00
Pranav Sharma
c415991c16
Revert "ThreadPool: Spend less time busy waiting. (#21545)" (#22350)
This reverts commit 4e15b229a0.

Reason: We are seeing an increase in the number of deadlocks after this
PR. We have a release coming up next week and do not have enough time to
investigate the root cause, hence reverting this PR temporarily.
Moreover, this is causing an increase int he binary size.

### Description
We are seeing an [increase in the number of
deadlocks](https://github.com/microsoft/onnxruntime/pull/22315#issuecomment-2394821893)
after this PR. We have a release coming up next week and do not have
enough time to investigate the root cause, hence reverting this PR
temporarily.

### Motivation and Context
See above.
2024-10-08 17:50:26 -07:00
Yulong Wang
c5d28cac4d
Initial WebGPU EP checkin (#22318)
### Description

This change introduces the WebGPU EP into ONNX Runtime.

To make the PR as simple as possible, this PR excluded the following:
- C API changes for WebGPU EP
- actual implementation of WebGPU EP. Currently in this PR, WebGPU is a
stub implementation that does not register any kernel.
- Python IO Binding update
- Node.js IO Binding update

This PR now contains only 43 file changes (while the working branch
contains 130+) and hopefully this makes it easier to review.

There is going to be separated PRs for each mentioned above.

Current working branch: #21904
2024-10-08 16:10:46 -07:00
Edward Chen
bc84958dce
Remove duplicate fp32_bits definition. (#22340)
Remove duplicate fp32_bits definition in onnxruntime/core/mlas/lib/cast.cpp
2024-10-08 11:57:00 -07:00
Tianlei Wu
8595e56d8e
[CUDA] Update Docker to use Ubuntu 24.04, cuda 12.6, cudnn 9.4 and python 3.12 (#22339)
### Description
Serve as example to build and run onnxruntime-gpu with latest software
stack.

To build docker image:
```
git clone https://github.com/microsoft/onnxruntime
cd onnxruntime/dockerfiles
docker build -t onnxruntime-cuda -f Dockerfile.cuda ..
```

To launch the docker image built from previous step (and mount the code
directory to run a unit test below):
```
cd ..
docker run --rm -it --gpus all -v $PWD:/code onnxruntime-cuda /bin/bash
```

Then run the following in docker image to verify that the cuda provider
is good:
```
python /code/onnxruntime/test/python/onnxruntime_test_python_cudagraph.py
```

### Motivation and Context
https://github.com/microsoft/onnxruntime/issues/22335
2024-10-08 09:54:46 -07:00
Changming Sun
d98340968e
Stop publishing python 3.8/3.9 packages (#22343)
### Description
1. Stop publishing python 3.8/3.9 packages, to align with numpy. 
2. Add a trigger for CUDA12's python test pipeline.
2024-10-08 09:50:05 -07:00
aciddelgado
cc0193cd42
Fix Memory Issue GQA CPU Rotary (#22290)
### Description
In GQA there was a memory issue which was best described by @edgchen1
[here](https://github.com/microsoft/onnxruntime/issues/22252#issuecomment-2384559255)

> here's the problematic code:
> 
>
d9de054eb5/onnxruntime/contrib_ops/cpu/bert/group_query_attention.cc (L149-L157)
> 
> annotated:
> 
> ```c++
>     if (packed_qkv) {
>       // Q is an OrtValue declared in the enclosing scope.
>       OrtValue RotaryQKV;
> Tensor::InitOrtValue(element_type, TensorShape({batch_size, num_heads_
+ 2 * kv_num_heads_, sequence_length, head_size}), allocator,
RotaryQKV);
>       // Save pointer to Q's data in q_input.
>       q_input = Q.Get<Tensor>().Data<T>();
>       k_input = q_input + num_heads_ * sequence_length * head_size;
>       q_rotary = RotaryQKV.GetMutable<Tensor>()->MutableData<T>();
>       k_rotary = q_rotary + num_heads_ * sequence_length * head_size;
> // Overwrite Q with RotaryQKV (OrtValues contain shared_ptr to
contained value).
>       // Now, q_input is pointing to freed memory.
>       Q = RotaryQKV;
>     }
> ```
> 
> later on, when we use `q_input`, there is a read access violation.
> 
>
d9de054eb5/onnxruntime/contrib_ops/cpu/bert/group_query_attention.cc (L170-L172)
> 
> this problem showed up when CPU allocator sharing between sessions was
enabled. in that case, the CPU allocator's arena was disabled. I suspect
that the default usage of the arena hid this issue.
> 
> though I debugged into the first branch, this appears to be a problem
in both branches:
> 
>
d9de054eb5/onnxruntime/contrib_ops/cpu/bert/group_query_attention.cc (L149-L168)





### Motivation and Context
Fixes a crucial bug. The issue was found here
https://github.com/microsoft/onnxruntime/issues/22252
2024-10-08 09:22:05 -07:00
stsokolo
efb8703a25
[MIGraphX EP Support]Add rocm to transformers/benchmark.py script (#22299)
### Description
Add ROCm EP option to benchmark.py script when using int8 quantization.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Without this change benchmarks with int8 quantization cannot be run with
ROCm execution provider.
2024-10-07 21:57:18 -07:00
Xavier Dupré
407c1ab2e2
Fix conversion of TensorData, TensorsData to json (#22166)
### Description
Fix write_calibration_table to support TensorData, TensorsData
2024-10-06 19:13:03 -07:00
Scott McKay
280c013d67
Fix warning when building on Windows (#22327)
### Description
<!-- Describe your changes. -->
Specify type to fix warning


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-10-06 15:51:58 +10:00
jingyanwangms
5036e63d58
Increanse TensorRT tolerance from default 1e-5 to 1e-3 after TRT 10.4 (#22321)
### Description
Increanse TensorRT tolerance from default 1e-5 to 1e-3 after TRT 10.4

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-10-05 18:06:06 +10:00
mingmingtasd
004bd36f3d
[WebNN EP] Support Tile operator (#22148)
PTAL, thanks! @Honry , @fdwr thanks!
2024-10-05 00:56:55 -07:00
chenduan-amd
98a75900ef
[VitisAI]Add interface to get configurations from Sessionoption (#22096)
### Description
Add interface to get config_options from onnxruntime.



### Motivation and Context
to support config session_options after EP Append, So need get
configurations on ep end.
2024-10-05 00:12:44 -07:00
Wanming Lin
39c8b3759f
[JS/WebGPU] Fixed bugs in inputs validation of Resize (#21955)
- 'scales' and 'sizes' may be empty tensor, make sure it's 1D tensor and
non-empty
- Make sure 'scales' and 'sizes' if present its length is non-zero
2024-10-04 18:29:53 -07:00
Tianlei Wu
b5ef85555a
Support onnx data types (bfloat16, float8) in python I/O binding APIs (#22306)
### Description
(1) Support onnx data types in python APIs:
* IOBinding.bind_input
* IOBinding.bind_output
* ortvalue_from_shape_and_type

(2) Add unit tests, which serves an example of running BFloat16 or
Float8 models in Python.

Other minor changes:
(3) replace deprecated NP_TYPE_TO_TENSOR_TYPE by helper API.
(4) Rename ortvalue_from_numpy_with_onnxtype to
ortvalue_from_numpy_with_onnx_type.

The integer of onnx element type can be found in
(https://onnx.ai/onnx/api/mapping.html). Note that FLOAT4E2M1 is not
supported yet.

### Motivation and Context

Current python API does not support Bfloat16 and float8 (FLOAT8E4M3FN,
FLOAT8E4M3FNUZ, FLOAT8E5M2, FLOAT8E5M2FNUZ) types, and other new data
types like INT4, UInt4 etc.

This removes the limitation.

https://github.com/microsoft/onnxruntime/issues/13001
https://github.com/microsoft/onnxruntime/issues/20481
https://github.com/microsoft/onnxruntime/issues/20578
2024-10-04 17:29:15 -07:00