onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-12 17:57:38 +00:00

Author	SHA1	Message	Date
Edward Chen	7964d3aef6	Specify iOS simulator runtime version (#22474 ) - Allow specification of iOS simulator runtime version to use. - Pick simulator runtime version (iphonesimulator 16.4) that is supported by the Xcode version (14.3.1) that we use. - Disable CoreML EP's DepthToSpace op support for CoreML version less than 7, with DCR mode, and FP16 input. It doesn't produce the correct output in this case. - Some cleanup of iOS test infrastructure.	2024-10-18 09:26:06 -07:00
Enrico Galli	1e5bda88f0	[WebNN EP] Cache MLTensors between runs (#22278 ) ### Description This change enables caching `MLTensor`s between inferences runs. This is done by keeping a reference to `MLTensor`s alive after they have been released. `MLTensor`s are only destroyed once the sessions goes out of scope. ### Motivation and Context Creating and destroying `MTensor`s on every run has a non-trivial performance penalty. This performance penalty materializes when using `ort.Tensors`[location=cpu] for inputs/outputs or when using the CPU EP as a fallback EP for unsupported operators. The former could be mitigated by developer using `ort.Tensors`[location=ml-tensor]. The latter cannot be mitigated by developers.	2024-10-18 08:07:00 -07:00
Yulong Wang	b4cb937440	fix LayerNorm f16 CPU implementation (#22479 ) ### Description The recent PR #22223 introduced 2 bugs in implementation of CPU LayerNorm f16: - possible access to nullptr for bias `const TensorShape& bias_shape = bias->Shape();` will crash when `bias` does not exist. (amazingly seems this one is not coverred by any test case) - fix: guard with pointer check - a racing condition inside ComputeJob `ComputeJob()` is dispatched to threadpool and it internally tries to modify `LayerNormImpl::scale_fp32_` and `LayerNormImpl::bias_fp32_`, which are `std::unique_ptr`s and are not thread-safe. - fix: move the modification of `LayerNormImpl::scale_fp32_` and `LayerNormImpl::bias_fp32_` out of `ComputeJob()` and put into `LayerNormImpl::ComputeWithoutContext()`. It may still have racing condition because `ConcurrentRunSupported` is set to `true` for CPU EP. Added an OrtMutex. This should fixes the recent flaky tests as well.	2024-10-17 18:49:38 -07:00
Akshay Sonawane	e5c2e50849	bumps up version in main from 1.20 -> 1.21 (#22482 ) Bump up version in main from 1.20.0 to 1.21.0 since the release branch has been cut.	2024-10-17 12:32:35 -07:00
Yulong Wang	55c584954c	fix supports_device() in python interface (#22473 ) ### Description `get_device()` returns a string of hyphen connected device names, such as "GPU-DML". It's a problem that when CUDA is disabled but OpenVino GPU is enabled in the build, because in this case `get_device()` returns "CPU-OPENVINO_GPU", so `supports_device("CUDA")` will return `True` in this build. Splitting the value of `get_device()` by "-" and check if the input is in the list is not an option because it seems some code in the code base stores the value of `get_device()` and use the value to call `supports_device()`. Using this implementation will cause `supports_device("GPU-DML")` to return `False` for a build with `get_device() == "GPU-DML"` because `"GPU-DML" in ["GPU", "DML"]` is `False`. This change also helps to avoid further problems when "WebGPU" is introduced.	2024-10-17 12:10:25 -07:00
Yulong Wang	1247d69c28	Add onnxtestdata cache for win-web-multi-browsers pipeline (#22477 ) ### Description Apply onnxtestdata cache to win-web-multi-browsers pipeline Same change that applied to win-web-ci #16659	2024-10-17 12:03:29 -07:00
Edward Chen	d649cac9af	Consolidate CPU allocator arena creation checks into a helper function. (#22460 )	2024-10-17 09:08:44 -07:00
Wanming Lin	52b77762bd	[WebNN EP] Remove the numThreads option (#22464 ) Chromium has removed this option via https://chromium-review.googlesource.com/c/chromium/src/+/5905656.	2024-10-17 07:45:39 -07:00
Hector Li	ac98bcae37	Update QNN default version to 2.27 in CI pipeline (#22471 ) ### Description Update QNN default version to 2.27 in CI pipeline	2024-10-16 22:05:47 -07:00
Adrian Lizarraga	84d48b6ad6	[QNN EP] Add provider option to offload graph I/O quantization/dequantization to the CPU EP (#22436 ) ### Description Adds QNN provider option `offload_graph_io_quantization` to offload graph input quantization and graph output dequantization to the CPU EP. Option is disabled by default to maintain current behavior. ### Motivation and Context Offloading the handling of I/O quantization to the CPU EP significantly improves inference latency for many models.	2024-10-16 15:00:53 -07:00
Yulong Wang	b7050c8390	remove unused _fence_ events for profiler (#22403 ) ### Description The current code to log profiler event "_fence_before" and "_fence_after" seems to be useless. The measured duration of the 2 events are 0. Removed them.	2024-10-16 13:38:32 -07:00
Yulong Wang	c3a94c6c5f	Fix Memcpy transformer when dealing multiple EPs (#22413 ) ### Description Fix Memcpy transformer when dealing multiple EPs. --------- Co-authored-by: Scott McKay <Scott.McKay@microsoft.com> Co-authored-by: Scott McKay <skottmckay@gmail.com>	2024-10-16 13:38:22 -07:00
Patrice Vignola	f610605a48	[DML EP] Support partial rotary embedding (#22417 ) ### Description This adds support for partial RotaryEmbedding to DML. Essentially, partial RotaryEmbedding simply consists of doing the rotary embedding calculation on a subregion of the input tensor of as if its head size was `rotary_embedding_dim`, while leaving the second part of the tensor (i.e. `head_size - rotary_embedding_dim`) alone. To achieve this, all we need to do is follow the following steps: 1. Split the tensor into 2 parts 2. Run the rotary embedding algorithm on the first part, just like we were doing before on the entire tensor 3. Join the 2 parts back together Since we're leaving the middle part intact, the RotaryEmbedding fusion will still be done within DML. Also, the concat at the end is essentially free because DML optimizes it out and directly allocate the result of RotaryEmbedding at the right place. The only overhead here is the splitting of the tensor at the beginning, which we should eventually make part of the RotaryEmbedding fusion within DML. ### Motivation and Context This fix allows us to correctly run models that have a `partial_rotary_factor` setting in huggingface, including Nvidia's Nemotron: https://huggingface.co/nvidia/Nemotron-Mini-4B-Instruct	2024-10-16 13:28:44 -07:00
Patrice Vignola	a164228c10	[DML EP] Add QDQ fusions for DML and disable QDQ + Resample fusion (#22458 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-10-16 12:40:39 -07:00
Changming Sun	f9e623e4d1	Update CMake to 3.31.0rc1 (#22433 ) To include a bug fix: https://gitlab.kitware.com/cmake/cmake/-/merge_requests/9890 Discussion: https://discourse.cmake.org/t/cmake-incorrectly-links-to-nvrtc-builtins/12723/4 This bug fix should be included in our upcoming release, because right now our GPU package depends on “libnvrtc-builtins.so.12.2" which has a hardcoded CUDA version: 12.2. The minor CUDA version should not be there.	2024-10-16 11:50:13 -07:00
Caroline Zhu	691de83892	Enable BrowserStack tests (#22457 ) ### Description BrowserStack account issues have been resolved -- this PR enables E2E browserstack tests in the pipeline again	2024-10-16 11:10:12 -07:00
PeixuanZuo	bf604428aa	[ROCm] Update ROCm Nuget pipeline to ROCm 6.2 (#22461 ) 1. Update ROCm Nuget pipeline build version to ROCm 6.2 2. Update AMD-GPU Agent Pool base docker image for ROCm Nuget pipeline test stage. search `AMD GPU pipeline Nuget` page in onenote to see how to update it. passed pipeline: https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=580846&view=results	2024-10-16 10:36:49 -07:00
Yi Zhang	2b8fc5529b	Enable RunMatMulTest all test cases support FP16 (#22440 ) ### Description <!-- Describe your changes. --> ### Motivation and Context increase FP16 test coverage for all related EPs	2024-10-16 09:57:05 +08:00
Jian Chen	af00a20f8a	Change ORT nightly python packages' name (#22450 ) ### Description Our nightly CPU python package's name is "ort-nightly" instead of "onnxruntime". It was because of some historical reasons. Tensorflow was like that. Now we would prefer to make them the same. Do this change for all nightly python packages, including CPU, GPU(CUDA), and maybe others. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-10-15 18:44:59 -07:00
Justin Beavers	a5e85a950c	Fix training artifacts for 2GB+ models and `MSELoss` (#22414 )	2024-10-15 16:47:16 -07:00
Caroline Zhu	6407d81b35	Disable BrowserStack testing stage (#22438 ) ### Description We are seeing this [packaging pipeline](https://aiinfra.visualstudio.com/Lotus/_build?definitionId=940&_a=summary) fail because we are running into BrowserStack account issues. Disabling this step until issues are resolved	2024-10-15 13:27:05 -07:00
Ted Themistokleous	4c47bca8fe	[MIGraphX EP] Add additional operators (#22446 ) * Add in missing operators for llama run * Add simplified layer norm ops ### Description <!-- Describe your changes. --> Adding additional supported operators into MIGraphX EP that are supported in MIGraphX ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Allows for more models to be run through MIGraphX EP	2024-10-15 12:21:22 -07:00
Yi Zhang	c5a0fb182a	Fix big models exception caused by timm upgrade (#22442 ) ### Description Today, stable diffusion stage failed due to there's a upgrade in timm. controlnet_aux depends on it. And its latest version limit the timm version less than 0.6.7. So upgrading controlnet_aux can solve it. And controlnet_aux uses opencv-python-headless, pin opencv-python-headless to 4.8.0.74 too. ### Motivation and Context	2024-10-15 21:13:52 +08:00
wejoncy	20a45dd67b	[CoreML ML Program] support acclerators selector (#22383 ) ### Description For no, CoreML only support run mlmodels on CPU/ALL, However, sometimes CPU_GPU would be faster a lot. We support the option to select different hardware to boost performance in this PR. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2024-10-15 11:50:11 +08:00
Jeff Daily	8c21680ffc	[ROCm] prefer hip interfaces over roc during hipify (#22394 ) ### Description Change the hipify step to remove the -roc option to hipify-perl. This will prefer hipblas over rocblas. rocblas can still be called directly such as in TunableOp. ### Motivation and Context hip interfaces are preferred over roc for porting from cuda to hip. Calling roc interfaces is meant for ROCm-specific enhancements or extensions.	2024-10-14 20:34:03 -07:00
anujj	ec7aa63b3a	nvidia awq only use QuantFormat.QDQ quant format (#22429 ) nvidia awq only use QuantFormat.QDQ quant format	2024-10-14 20:32:59 -07:00
Yi Zhang	6e5e320088	Refactor one test function in MatMul_test (#22432 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve?	2024-10-15 11:16:02 +08:00
amarin16	7d17c466ec	Add microbenchmark for layer normalization and improve latency (#22223 ) - Added a microbenchmark for the `LayerNormalization` MLFloat16 support added in https://github.com/microsoft/onnxruntime/pull/22063. - Updated the `LayerNormalization` MLFloat16 implementation to improve the latency. ``` ---------------------------------------------------------------------------------------------- Original MLFloat16 support Time CPU Iterations ---------------------------------------------------------------------------------------------- BM_LayerNormalization<MLFloat16, float>/1/real_time 15599 us 15625 us 47 BM_LayerNormalization<MLFloat16, float>/1/real_time 14714 us 14824 us 39 BM_LayerNormalization<MLFloat16, float>/1/real_time 14634 us 14688 us 50 ---------------------------------------------------------------------------------------------- Updated MLFloat16 support Time CPU Iterations ---------------------------------------------------------------------------------------------- BM_LayerNormalization<MLFloat16, float>/1/real_time 7276 us 7254 us 84 BM_LayerNormalization<MLFloat16, float>/1/real_time 6820 us 6720 us 93 BM_LayerNormalization<MLFloat16, float>/1/real_time 6840 us 6882 us 84 ```	2024-10-14 18:47:27 -07:00
Changming Sun	4af593a722	Add python 3.13 support (#22380 ) 1. Add python 3.13 to our python packaging pipelines 2. Because numpy 2.0.0 doesn't support thread free python, this PR also upgrades numpy to the latest 3. Delete some unused files.	2024-10-14 18:07:54 -07:00
Jiajia Qin	8159723ba7	[js/webgpu] Optimize matmulnbits (#22360 ) ### Description <!-- Describe your changes. --> This PR further optimizes matmulnbits specially for iGPUs. The phi3 demo becomes ~12 tokens/second from ~8 tokens on iGPUs. Some todos: 1. Make the optimization more general, Remove the blockSize = 32 limitation. 2. Tune the parameter, such as workgroupSize, components size (currently only support components = 1), to see the performance change.	2024-10-14 15:49:29 -07:00
dependabot[bot]	2bc3754494	Bump cookie and socket.io in /js/web (#22408 ) Bumps [cookie](https://github.com/jshttp/cookie) and [socket.io](https://github.com/socketio/socket.io). These dependencies needed to be updated together. Updates `cookie` from 0.4.2 to 0.7.2 <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/jshttp/cookie/releases">cookie's releases</a>.</em></p> <blockquote> <h2>v0.7.2</h2> <p><strong>Fixed</strong></p> <ul> <li>Fix object assignment of <code>hasOwnProperty</code> (<a href="https://redirect.github.com/jshttp/cookie/issues/177">#177</a>) bc38ffd</li> </ul> <p><a href="https://github.com/jshttp/cookie/compare/v0.7.1...v0.7.2">https://github.com/jshttp/cookie/compare/v0.7.1...v0.7.2</a></p> <h2>0.7.1</h2> <p><strong>Fixed</strong></p> <ul> <li>Allow leading dot for domain (<a href="https://redirect.github.com/jshttp/cookie/issues/174">#174</a>) <ul> <li>Although not permitted in the spec, some users expect this to work and user agents ignore the leading dot according to spec</li> </ul> </li> <li>Add fast path for <code>serialize</code> without options, use <code>obj.hasOwnProperty</code> when parsing (<a href="https://redirect.github.com/jshttp/cookie/issues/172">#172</a>)</li> </ul> <p><a href="https://github.com/jshttp/cookie/compare/v0.7.0...v0.7.1">https://github.com/jshttp/cookie/compare/v0.7.0...v0.7.1</a></p> <h2>0.7.0</h2> <ul> <li>perf: parse cookies ~10% faster (<a href="https://redirect.github.com/jshttp/cookie/issues/144">#144</a> by <a href="https://github.com/kurtextrem"><code>@kurtextrem</code></a> and <a href="https://redirect.github.com/jshttp/cookie/issues/170">#170</a>)</li> <li>fix: narrow the validation of cookies to match RFC6265 (<a href="https://redirect.github.com/jshttp/cookie/issues/167">#167</a> by <a href="https://github.com/bewinsnw"><code>@bewinsnw</code></a>)</li> <li>fix: add <code>main</code> to <code>package.json</code> for rspack (<a href="https://redirect.github.com/jshttp/cookie/issues/166">#166</a> by <a href="https://github.com/proudparrot2"><code>@proudparrot2</code></a>)</li> </ul> <p><a href="https://github.com/jshttp/cookie/compare/v0.6.0...v0.7.0">https://github.com/jshttp/cookie/compare/v0.6.0...v0.7.0</a></p> <h2>0.6.0</h2> <ul> <li>Add <code>partitioned</code> option</li> </ul> <h2>0.5.0</h2> <ul> <li>Add <code>priority</code> option</li> <li>Fix <code>expires</code> option to reject invalid dates</li> <li>pref: improve default decode speed</li> <li>pref: remove slow string split in parse</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="`d19eaa1a2b`"><code>d19eaa1</code></a> 0.7.2</li> <li><a href="`bc38ffd0ea`"><code>bc38ffd</code></a> Fix object assignment of <code>hasOwnProperty</code> (<a href="https://redirect.github.com/jshttp/cookie/issues/177">#177</a>)</li> <li><a href="`cf4658f492`"><code>cf4658f</code></a> 0.7.1</li> <li><a href="`6a8b8f5a49`"><code>6a8b8f5</code></a> Allow leading dot for domain (<a href="https://redirect.github.com/jshttp/cookie/issues/174">#174</a>)</li> <li><a href="`58015c0b93`"><code>58015c0</code></a> Remove more code and perf wins (<a href="https://redirect.github.com/jshttp/cookie/issues/172">#172</a>)</li> <li><a href="`ab057d6c06`"><code>ab057d6</code></a> 0.7.0</li> <li><a href="`5f02ca8768`"><code>5f02ca8</code></a> Migrate history to GitHub releases</li> <li><a href="`a5d591ce84`"><code>a5d591c</code></a> Migrate history to GitHub releases</li> <li><a href="`51968f94b5`"><code>51968f9</code></a> Skip isNaN</li> <li><a href="`9e7ca51ade`"><code>9e7ca51</code></a> perf(parse): cache length, return early (<a href="https://redirect.github.com/jshttp/cookie/issues/144">#144</a>)</li> <li>Additional commits viewable in <a href="https://github.com/jshttp/cookie/compare/v0.4.2...v0.7.2">compare view</a></li> </ul> </details> <details> <summary>Maintainer changes</summary> <p>This version was pushed to npm by <a href="https://www.npmjs.com/~blakeembrey">blakeembrey</a>, a new releaser for cookie since your current version.</p> </details> <br /> Updates `socket.io` from 4.7.5 to 4.8.0 <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/socketio/socket.io/releases">socket.io's releases</a>.</em></p> <blockquote> <h2>socket.io-client@4.8.0</h2> <h3>Features</h3> <h4>Custom transport implementations</h4> <p>The <code>transports</code> option now accepts an array of transport implementations:</p> <pre lang="js"><code>import { io } from "socket.io-client"; import { XHR, WebSocket } from "engine.io-client"; <p>const socket = io({ transports: [XHR, WebSocket] }); </code></pre></p> <p>Here is the list of provided implementations:</p> <table> <thead> <tr> <th>Transport</th> <th>Description</th> </tr> </thead> <tbody> <tr> <td><code>Fetch</code></td> <td>HTTP long-polling based on the built-in <code>fetch()</code> method.</td> </tr> <tr> <td><code>NodeXHR</code></td> <td>HTTP long-polling based on the <code>XMLHttpRequest</code> object provided by the <code>xmlhttprequest-ssl</code> package.</td> </tr> <tr> <td><code>XHR</code></td> <td>HTTP long-polling based on the built-in <code>XMLHttpRequest</code> object.</td> </tr> <tr> <td><code>NodeWebSocket</code></td> <td>WebSocket transport based on the <code>WebSocket</code> object provided by the <code>ws</code> package.</td> </tr> <tr> <td><code>WebSocket</code></td> <td>WebSocket transport based on the built-in <code>WebSocket</code> object.</td> </tr> <tr> <td><code>WebTransport</code></td> <td>WebTransport transport based on the built-in <code>WebTransport</code> object.</td> </tr> </tbody> </table> <p>Usage:</p> <table> <thead> <tr> <th>Transport</th> <th>browser</th> <th>Node.js</th> <th>Deno</th> <th>Bun</th> </tr> </thead> <tbody> <tr> <td><code>Fetch</code></td> <td>✅</td> <td>✅ (1)</td> <td>✅</td> <td>✅</td> </tr> <tr> <td><code>NodeXHR</code></td> <td></td> <td>✅</td> <td>✅</td> <td>✅</td> </tr> <tr> <td><code>XHR</code></td> <td>✅</td> <td></td> <td></td> <td></td> </tr> <tr> <td><code>NodeWebSocket</code></td> <td></td> <td>✅</td> <td>✅</td> <td>✅</td> </tr> <tr> <td><code>WebSocket</code></td> <td>✅</td> <td>✅ (2)</td> <td>✅</td> <td>✅</td> </tr> <tr> <td><code>WebTransport</code></td> <td>✅</td> <td>✅</td> <td></td> <td></td> </tr> </tbody> </table> <p>(1) since <a href="https://nodejs.org/api/globals.html#fetch">v18.0.0</a> (2) since <a href="https://nodejs.org/api/globals.html#websocket">v21.0.0</a></p> <p>Added in <a href="`f4d898ee96`">f4d898e</a> and <a href="`b11763beec`">b11763b</a>.</p> <h4>Test each low-level transports</h4> <p>When setting the <code>tryAllTransports</code> option to <code>true</code>, if the first transport (usually, HTTP long-polling) fails, then the other transports will be tested too:</p> <pre lang="js"><code>import { io } from "socket.io-client"; </tr></table> </code></pre> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="`d0fc720420`"><code>d0fc720</code></a> chore(release): socket.io@4.8.0</li> <li><a href="`4a0555c671`"><code>4a0555c</code></a> chore(release): socket.io-client@4.8.0</li> <li><a href="`2b60df18a8`"><code>2b60df1</code></a> chore(release): engine.io@6.6.1</li> <li><a href="`d4cb375856`"><code>d4cb375</code></a> ci: ignore tests when publishing to npm</li> <li><a href="`c251ae7ba7`"><code>c251ae7</code></a> chore(release): engine.io-client@6.6.1</li> <li><a href="`8a2f5a3da0`"><code>8a2f5a3</code></a> fix(eio-client): move 'offline' event listener at the top</li> <li><a href="`b04fa64365`"><code>b04fa64</code></a> fix(sio): allow to join a room in a middleware (uws)</li> <li><a href="`7085f0e3e4`"><code>7085f0e</code></a> refactor(sio-client): mangle private attributes</li> <li><a href="`4f66708210`"><code>4f66708</code></a> chore(sio-client): use babel loose mode when transpiling classes</li> <li><a href="`1a95db2145`"><code>1a95db2</code></a> chore(sio-client): add a script to compute the bundle size</li> <li>Additional commits viewable in <a href="https://github.com/socketio/socket.io/compare/socket.io@4.7.5...socket.io@4.8.0">compare view</a></li> </ul> </details> <br /> Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) Dependabot will merge this PR once CI passes on it, as requested by @fs-eire. [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/microsoft/onnxruntime/network/alerts). </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-10-14 15:47:01 -07:00
Jiajia Qin	0409c639f7	[js/webgpu] Optimize MultiHeadAttention\|Transpose (#22420 ) ### Description <!-- Describe your changes. --> With this optimization, 96 MultiHeadAttention\|Transpose ops in phi3 disappear. Phi3 becomes 113 tokens from 107 tokens on my dGPUs. The optimization mainly skips the transpose op if one of the transposed dims is 1. Reshape is enough.	2024-10-14 15:43:14 -07:00
Tianlei Wu	de93f40240	[CUDA] Lean Attention (#22352 ) ### Description Add [Lean Attention](https://arxiv.org/abs/2405.10480) and the integration with MultiHeadAttention operator for LLM in GPU. LeanAttention speeds up self-attention for the token-generation phase (decode-phase) of decoder-only transformer models, especially on long context lengths. - [x] Initial implementation of Lean Attention (by Srikant Bharadwaj) - [x] Integration with MultiHeadAttention operator - [x] Add parity tests - [x] Add benchmark #### Implementation Details (1) Lean Attention is enabled in build for Linux, and disabled for Windows (2) Lean Attention is disabled by default. Need enable it through cuda provider option sdpa_kernel, or use environment variable `ORT_ENABLE_LEAN_ATTENTION=1` (3) It only works for token-generation (sequence_length==1, past_sequence_length > 0). (4) Like flash attention, it only works in Ampere or newer GPU. We can revisit #1 and #2 after comparing with DecoderMaskedMultiHeadAttention and XQA kernels. #### Benchmark ``` cd onnxruntime/test/python/transformers /bin/bash benchmark_mha.sh lean ``` Example outputs in H100: Note that past and present does not share buffer for MHA for now, so we can see low tflops. The relative ratio will change after buffer sharing is enabled. But we expect that the order (kernel A is faster than B) will remain the same after buffer sharing is enabled. Note that common settings `sequence_length=1; causal=True;attn_bias=None;cuda_graph=False` are not shown in the below table. batch_size \| past_sequence_length \| num_heads \| head_size \| average_latency \| tflops \| kernel -- \| -- \| -- \| -- \| -- \| -- \| -- 1 \| 512 \| 16 \| 64 \| 0.000059 \| 0.0178 \| ort:flash 1 \| 512 \| 16 \| 64 \| 0.000068 \| 0.0155 \| ort:efficient 1 \| 512 \| 16 \| 64 \| 0.000065 \| 0.0161 \| ort:math 1 \| 512 \| 16 \| 64 \| 0.000060 \| 0.0176 \| ort:lean 1 \| 512 \| 32 \| 128 \| 0.000062 \| 0.0674 \| ort:flash 1 \| 512 \| 32 \| 128 \| 0.000064 \| 0.0661 \| ort:efficient 1 \| 512 \| 32 \| 128 \| 0.000067 \| 0.0625 \| ort:math 1 \| 512 \| 32 \| 128 \| 0.000062 \| 0.0678 \| ort:lean 1 \| 1024 \| 16 \| 64 \| 0.000061 \| 0.0345 \| ort:flash 1 \| 1024 \| 16 \| 64 \| 0.000086 \| 0.0244 \| ort:efficient 1 \| 1024 \| 16 \| 64 \| 0.000065 \| 0.0322 \| ort:math 1 \| 1024 \| 16 \| 64 \| 0.000063 \| 0.0332 \| ort:lean 1 \| 1024 \| 32 \| 128 \| 0.000075 \| 0.1125 \| ort:flash 1 \| 1024 \| 32 \| 128 \| 0.000088 \| 0.0951 \| ort:efficient 1 \| 1024 \| 32 \| 128 \| 0.000079 \| 0.1068 \| ort:math 1 \| 1024 \| 32 \| 128 \| 0.000072 \| 0.1171 \| ort:lean 1 \| 2048 \| 16 \| 64 \| 0.000069 \| 0.0606 \| ort:flash 1 \| 2048 \| 16 \| 64 \| 0.000125 \| 0.0336 \| ort:efficient 1 \| 2048 \| 16 \| 64 \| 0.000064 \| 0.0655 \| ort:lean 1 \| 2048 \| 32 \| 128 \| 0.000098 \| 0.1720 \| ort:flash 1 \| 2048 \| 32 \| 128 \| 0.000132 \| 0.1270 \| ort:efficient 1 \| 2048 \| 32 \| 128 \| 0.000092 \| 0.1828 \| ort:lean 1 \| 4096 \| 16 \| 64 \| 0.000076 \| 0.1097 \| ort:flash 1 \| 4096 \| 16 \| 64 \| 0.000207 \| 0.0406 \| ort:efficient 1 \| 4096 \| 16 \| 64 \| 0.000069 \| 0.1209 \| ort:lean 1 \| 4096 \| 32 \| 128 \| 0.000140 \| 0.2394 \| ort:flash 1 \| 4096 \| 32 \| 128 \| 0.000213 \| 0.1575 \| ort:efficient 1 \| 4096 \| 32 \| 128 \| 0.000139 \| 0.2419 \| ort:lean 1 \| 8192 \| 16 \| 64 \| 0.000104 \| 0.1609 \| ort:flash 1 \| 8192 \| 16 \| 64 \| 0.000392 \| 0.0428 \| ort:efficient 1 \| 8192 \| 16 \| 64 \| 0.000093 \| 0.1809 \| ort:lean 1 \| 8192 \| 32 \| 128 \| 0.000212 \| 0.3160 \| ort:flash 1 \| 8192 \| 32 \| 128 \| 0.000360 \| 0.1866 \| ort:efficient 1 \| 8192 \| 32 \| 128 \| 0.000212 \| 0.3162 \| ort:lean 1 \| 16384 \| 16 \| 64 \| 0.000139 \| 0.2410 \| ort:flash 1 \| 16384 \| 16 \| 64 \| 0.000731 \| 0.0459 \| ort:efficient 1 \| 16384 \| 16 \| 64 \| 0.000136 \| 0.2465 \| ort:lean 1 \| 16384 \| 32 \| 128 \| 0.000361 \| 0.3722 \| ort:flash 1 \| 16384 \| 32 \| 128 \| 0.000667 \| 0.2014 \| ort:efficient 1 \| 16384 \| 32 \| 128 \| 0.000357 \| 0.3765 \| ort:lean 1 \| 32768 \| 16 \| 64 \| 0.000210 \| 0.3194 \| ort:flash 1 \| 32768 \| 16 \| 64 \| 0.001428 \| 0.0470 \| ort:efficient 1 \| 32768 \| 16 \| 64 \| 0.000209 \| 0.3211 \| ort:lean 1 \| 32768 \| 32 \| 128 \| 0.000659 \| 0.4074 \| ort:flash 1 \| 32768 \| 32 \| 128 \| 0.001270 \| 0.2114 \| ort:efficient 1 \| 32768 \| 32 \| 128 \| 0.000651 \| 0.4123 \| ort:lean 1 \| 65536 \| 16 \| 64 \| 0.000355 \| 0.3785 \| ort:flash 1 \| 65536 \| 16 \| 64 \| 0.002736 \| 0.0491 \| ort:efficient 1 \| 65536 \| 16 \| 64 \| 0.000349 \| 0.3845 \| ort:lean 1 \| 65536 \| 32 \| 128 \| 0.001251 \| 0.4290 \| ort:flash 1 \| 65536 \| 32 \| 128 \| 0.002480 \| 0.2165 \| ort:efficient 1 \| 65536 \| 32 \| 128 \| 0.001239 \| 0.4333 \| ort:lean 4 \| 512 \| 16 \| 64 \| 0.000063 \| 0.0665 \| ort:flash 4 \| 512 \| 16 \| 64 \| 0.000069 \| 0.0607 \| ort:efficient 4 \| 512 \| 16 \| 64 \| 0.000066 \| 0.0634 \| ort:math 4 \| 512 \| 16 \| 64 \| 0.000062 \| 0.0674 \| ort:lean 4 \| 512 \| 32 \| 128 \| 0.000100 \| 0.1677 \| ort:flash 4 \| 512 \| 32 \| 128 \| 0.000099 \| 0.1703 \| ort:efficient 4 \| 512 \| 32 \| 128 \| 0.000108 \| 0.1557 \| ort:math 4 \| 512 \| 32 \| 128 \| 0.000092 \| 0.1818 \| ort:lean 4 \| 1024 \| 16 \| 64 \| 0.000077 \| 0.1094 \| ort:flash 4 \| 1024 \| 16 \| 64 \| 0.000099 \| 0.0850 \| ort:efficient 4 \| 1024 \| 16 \| 64 \| 0.000081 \| 0.1038 \| ort:math 4 \| 1024 \| 16 \| 64 \| 0.000072 \| 0.1161 \| ort:lean 4 \| 1024 \| 32 \| 128 \| 0.000143 \| 0.2343 \| ort:flash 4 \| 1024 \| 32 \| 128 \| 0.000137 \| 0.2447 \| ort:efficient 4 \| 1024 \| 32 \| 128 \| 0.000150 \| 0.2245 \| ort:math 4 \| 1024 \| 32 \| 128 \| 0.000135 \| 0.2496 \| ort:lean 4 \| 2048 \| 16 \| 64 \| 0.000096 \| 0.1757 \| ort:flash 4 \| 2048 \| 16 \| 64 \| 0.000156 \| 0.1078 \| ort:efficient 4 \| 2048 \| 16 \| 64 \| 0.000089 \| 0.1892 \| ort:lean 4 \| 2048 \| 32 \| 128 \| 0.000223 \| 0.3010 \| ort:flash 4 \| 2048 \| 32 \| 128 \| 0.000217 \| 0.3101 \| ort:efficient 4 \| 2048 \| 32 \| 128 \| 0.000209 \| 0.3209 \| ort:lean 4 \| 4096 \| 16 \| 64 \| 0.000137 \| 0.2448 \| ort:flash 4 \| 4096 \| 16 \| 64 \| 0.000256 \| 0.1312 \| ort:efficient 4 \| 4096 \| 16 \| 64 \| 0.000133 \| 0.2530 \| ort:lean 4 \| 4096 \| 32 \| 128 \| 0.000389 \| 0.3450 \| ort:flash 4 \| 4096 \| 32 \| 128 \| 0.000376 \| 0.3574 \| ort:efficient 4 \| 4096 \| 32 \| 128 \| 0.000354 \| 0.3794 \| ort:lean 4 \| 8192 \| 16 \| 64 \| 0.000210 \| 0.3198 \| ort:flash 4 \| 8192 \| 16 \| 64 \| 0.000453 \| 0.1480 \| ort:efficient 4 \| 8192 \| 16 \| 64 \| 0.000206 \| 0.3260 \| ort:lean 4 \| 8192 \| 32 \| 128 \| 0.000725 \| 0.3705 \| ort:flash 4 \| 8192 \| 32 \| 128 \| 0.000693 \| 0.3874 \| ort:efficient 4 \| 8192 \| 32 \| 128 \| 0.000653 \| 0.4114 \| ort:lean 4 \| 16384 \| 16 \| 64 \| 0.000355 \| 0.3782 \| ort:flash 4 \| 16384 \| 16 \| 64 \| 0.000849 \| 0.1581 \| ort:efficient 4 \| 16384 \| 16 \| 64 \| 0.000346 \| 0.3874 \| ort:lean 4 \| 16384 \| 32 \| 128 \| 0.001395 \| 0.3848 \| ort:flash 4 \| 16384 \| 32 \| 128 \| 0.001337 \| 0.4017 \| ort:efficient 4 \| 16384 \| 32 \| 128 \| 0.001252 \| 0.4288 \| ort:lean 4 \| 32768 \| 16 \| 64 \| 0.000647 \| 0.4146 \| ort:flash 4 \| 32768 \| 16 \| 64 \| 0.001649 \| 0.1628 \| ort:efficient 4 \| 32768 \| 16 \| 64 \| 0.000639 \| 0.4204 \| ort:lean 4 \| 32768 \| 32 \| 128 \| 0.002721 \| 0.3947 \| ort:flash 4 \| 32768 \| 32 \| 128 \| 0.002601 \| 0.4128 \| ort:efficient 4 \| 32768 \| 32 \| 128 \| 0.002434 \| 0.4411 \| ort:lean 4 \| 65536 \| 16 \| 64 \| 0.001231 \| 0.4361 \| ort:flash 4 \| 65536 \| 16 \| 64 \| 0.003238 \| 0.1658 \| ort:efficient 4 \| 65536 \| 16 \| 64 \| 0.001217 \| 0.4412 \| ort:lean 4 \| 65536 \| 32 \| 128 \| 0.005357 \| 0.4009 \| ort:flash 4 \| 65536 \| 32 \| 128 \| 0.005118 \| 0.4196 \| ort:efficient 4 \| 65536 \| 32 \| 128 \| 0.004781 \| 0.4492 \| ort:lean 16 \| 512 \| 16 \| 64 \| 0.000098 \| 0.1724 \| ort:flash 16 \| 512 \| 16 \| 64 \| 0.000104 \| 0.1616 \| ort:efficient 16 \| 512 \| 16 \| 64 \| 0.000118 \| 0.1420 \| ort:math 16 \| 512 \| 16 \| 64 \| 0.000087 \| 0.1926 \| ort:lean 16 \| 512 \| 32 \| 128 \| 0.000220 \| 0.3062 \| ort:flash 16 \| 512 \| 32 \| 128 \| 0.000208 \| 0.3237 \| ort:efficient 16 \| 512 \| 32 \| 128 \| 0.000237 \| 0.2838 \| ort:math 16 \| 512 \| 32 \| 128 \| 0.000209 \| 0.3216 \| ort:lean 16 \| 1024 \| 16 \| 64 \| 0.000136 \| 0.2465 \| ort:flash 16 \| 1024 \| 16 \| 64 \| 0.000150 \| 0.2235 \| ort:efficient 16 \| 1024 \| 16 \| 64 \| 0.000148 \| 0.2266 \| ort:math 16 \| 1024 \| 16 \| 64 \| 0.000129 \| 0.2611 \| ort:lean 16 \| 1024 \| 32 \| 128 \| 0.000367 \| 0.3663 \| ort:flash 16 \| 1024 \| 32 \| 128 \| 0.000351 \| 0.3829 \| ort:efficient 16 \| 1024 \| 32 \| 128 \| 0.000400 \| 0.3357 \| ort:math 16 \| 1024 \| 32 \| 128 \| 0.000349 \| 0.3853 \| ort:lean 16 \| 2048 \| 16 \| 64 \| 0.000209 \| 0.3206 \| ort:flash 16 \| 2048 \| 16 \| 64 \| 0.000243 \| 0.2762 \| ort:efficient 16 \| 2048 \| 16 \| 64 \| 0.000201 \| 0.3338 \| ort:lean 16 \| 2048 \| 32 \| 128 \| 0.000671 \| 0.4002 \| ort:flash 16 \| 2048 \| 32 \| 128 \| 0.000645 \| 0.4163 \| ort:efficient 16 \| 2048 \| 32 \| 128 \| 0.000642 \| 0.4185 \| ort:lean 16 \| 4096 \| 16 \| 64 \| 0.000360 \| 0.3732 \| ort:flash 16 \| 4096 \| 16 \| 64 \| 0.000425 \| 0.3162 \| ort:efficient 16 \| 4096 \| 16 \| 64 \| 0.000341 \| 0.3933 \| ort:lean 16 \| 4096 \| 32 \| 128 \| 0.001292 \| 0.4156 \| ort:flash 16 \| 4096 \| 32 \| 128 \| 0.001251 \| 0.4291 \| ort:efficient 16 \| 4096 \| 32 \| 128 \| 0.001241 \| 0.4327 \| ort:lean 16 \| 8192 \| 16 \| 64 \| 0.000666 \| 0.4030 \| ort:flash 16 \| 8192 \| 16 \| 64 \| 0.000804 \| 0.3339 \| ort:efficient 16 \| 8192 \| 16 \| 64 \| 0.000627 \| 0.4283 \| ort:lean 16 \| 8192 \| 32 \| 128 \| 0.002541 \| 0.4226 \| ort:flash 16 \| 8192 \| 32 \| 128 \| 0.002454 \| 0.4376 \| ort:efficient 16 \| 8192 \| 32 \| 128 \| 0.002438 \| 0.4405 \| ort:lean 16 \| 16384 \| 16 \| 64 \| 0.001292 \| 0.4156 \| ort:flash 16 \| 16384 \| 16 \| 64 \| 0.001571 \| 0.3417 \| ort:efficient 16 \| 16384 \| 16 \| 64 \| 0.001217 \| 0.4411 \| ort:lean 16 \| 16384 \| 32 \| 128 \| 0.005042 \| 0.4260 \| ort:flash 16 \| 16384 \| 32 \| 128 \| 0.004859 \| 0.4420 \| ort:efficient 16 \| 16384 \| 32 \| 128 \| 0.004827 \| 0.4449 \| ort:lean 16 \| 32768 \| 16 \| 64 \| 0.002537 \| 0.4233 \| ort:flash 16 \| 32768 \| 16 \| 64 \| 0.003103 \| 0.3461 \| ort:efficient 16 \| 32768 \| 16 \| 64 \| 0.002385 \| 0.4501 \| ort:lean 16 \| 32768 \| 32 \| 128 \| 0.009961 \| 0.4312 \| ort:flash 16 \| 32768 \| 32 \| 128 \| 0.009605 \| 0.4472 \| ort:efficient 16 \| 32768 \| 32 \| 128 \| 0.009524 \| 0.4510 \| ort:lean 16 \| 65536 \| 16 \| 64 \| 0.005019 \| 0.4279 \| ort:flash 16 \| 65536 \| 16 \| 64 \| 0.006133 \| 0.3502 \| ort:efficient 16 \| 65536 \| 16 \| 64 \| 0.004703 \| 0.4566 \| ort:lean 16 \| 65536 \| 32 \| 128 \| 0.019746 \| 0.4350 \| ort:flash 16 \| 65536 \| 32 \| 128 \| 0.019027 \| 0.4515 \| ort:efficient 16 \| 65536 \| 32 \| 128 \| 0.018864 \| 0.4554 \| ort:lean ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-10-14 14:49:37 -07:00
Dmitri Smirnov	87e8a5dfa8	Implement DML copy for Lora Adapters (#22396 ) ### Description Request and create DML EP and its data transfer. Use to copy on device. The PR includes changes to fix issues in DML provider. ### Motivation and Context This enables Lora users to run it with DML which is important for GenAI. Co-authored-by: @PatriceVignola --------- Co-authored-by: Patrice Vignola <vignola.patrice@gmail.com>	2024-10-14 12:26:50 -07:00
Vishnudas Thaniel S	35adba21c7	Ovep develop lnl 1.2 (#22424 ) ### Description Support OV2024.4 Refactor tensor initialization check for external weights Support loading OV Config OVEP: Tensor Caching fix, Fix accuracy issues Refactor device memory implementation to make it more generic ### Motivation and Context The changes are required to fix accuracy issues, support loading of OV config, support OV2024.4 --------- Co-authored-by: Eric Crawford <eric.r.crawford@intel.com> Co-authored-by: saurabhkale17 <saurabh1.kale@intel.com> Co-authored-by: Javier E. Martinez <javier.e.martinez@intel.com> Co-authored-by: sfatimar <sahar.fatima@intel.com> Co-authored-by: ankitm3k <ankit.maheshkar@intel.com> Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com> Co-authored-by: n1harika <niharika.sathish@intel.com> Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com>	2024-10-14 12:10:01 -07:00
Justin Chu	9b1b4e54bb	Move suggest fixes to a separate CI workflow (#22415 ) Move suggest fixes to a separate CI workflow so that it is triggered only on PRs and does not fail the main branch.	2024-10-14 10:26:37 -07:00
Edward Chen	04404ea482	Fix Xcode 16 iOS build issues (#22379 ) - Work around Xcode 16 iOS test build issue: `error: Multiple commands produce '.../PlugIns'`. - Fix link error in iOS static framework test. - Update build.py to check for the right kind of build before running iOS tests on the simulator. - Update Xcode 16 build images to 'macos-15' because that's the only image that will have Xcode 16 soon. See https://github.com/actions/runner-images/issues/10703.	2024-10-14 09:24:38 -07:00
Yi Zhang	caa67439b5	Add more F16 kernels of XNNPack (#22381 ) ### Description 1. Add Gemm, MatMul, Softmax, AveragePool and Resize F16 kernels This PR has included all changes in #22378 [AB#51066](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/51066) [AB#51026](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/51026) 2. Matrix B must be const and martrix A and B dim_size shoule NOT bigger than 2 in XNNPack, so I added 2 tests in matmul_test.cc to make sure it's really tested. (that is, compute() must be called.) ### Motivation and Context	2024-10-14 17:41:59 +08:00
Yi Zhang	72cc72cc21	New rocm nuget publish pipeline (#22418 ) ### Description Add a new pipeline to publish ROCM package to ADO ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> ### Test Link https://dev.azure.com/aiinfra/Lotus/_build?definitionId=1615	2024-10-13 08:30:06 +08:00
mindest	1fa219d7d5	DecoderMaskedMultiHeadAttention CPU kernel. (#22292 ) ### Description DecoderMaskedMultiHeadAttention CPU kernel.	2024-10-12 13:43:00 -07:00
George Wu	332173509d	fixups for doxygen. add c++ wrapper for setEpDynamicOptions (#22416 ) follow up to https://github.com/microsoft/onnxruntime/pull/22282 replaces https://github.com/microsoft/onnxruntime/pull/22388	2024-10-11 21:59:33 -07:00
kunal-vaishnavi	18e81f8785	Fix Whisper export for FP16 CUDA (#22410 ) ### Description This PR fixes a bug when the ONNX checker is called while exporting Whisper for FP16 CUDA with optional flags. ### Motivation and Context Sometimes, the ONNX checker raises an error depending on the optional flags passed. By wrapping the ONNX checker in a try-except, the conversion can continue even if the checker fails.	2024-10-11 17:37:36 -07:00
Ted Themistokleous	572e43c5d7	[MIGraphX EP/ ROCm EP] add gfx1200, gfx1201 to CMAKE_HIP_ARCHITECTURES (#22348 ) ### Description Add additonal gfx targets for AMD GPU support ### Motivation and Context Required to integrate mainline onnxruntime support for AMD GPUs --------- Co-authored-by: Stefan Sokolovic <stsokolo@amd.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2024-10-11 17:31:36 -07:00
Edward Chen	d7367653ab	Remove clean_docker_image_cache.py and clean-build-docker-image-cache-pipeline.yml. (#22409 ) Clean up old script and build definition.	2024-10-11 14:25:13 -07:00
anujj	23d48ea647	Add TensorRT-Model-Optimizer INT4 AWQ support in onnxruntime tools (#22390 ) [TensorRT-Model-Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) have a implementation for INT4 AWQ. Adding the support in onnxruntime tools to quantized the models with TensorRT-Model-Optimizer	2024-10-11 13:31:54 -07:00
Kyle	cdebf37105	Add Digital Signature to DLLs in Maven Build (#22401 ) ### Description * Add digital signature to dll files in jar files. * Jar file names: onnxruntime-{version}.jar, onnxruntime_gpu-{version}.jar ### Motivation and Context #19204	2024-10-11 12:14:03 -07:00
mindest	d1627d2c7f	[ROCm] Register op kernel for Sqrt BFloat16 (#22404 ) ### Description ROCm CI fails since adding test for BFloat16, Sqrt op (introduced in #22068).	2024-10-11 11:02:40 -07:00
Justin Chu	64007ffb79	Create suggestions to autofix files (#22115 )	2024-10-11 10:52:19 -07:00
mindest	3c80aa9fee	Add CPU kernels for DynamicTimeWarping and UnfoldTensor. (#22033 ) ### Description Add CPU kernels for DynamicTimeWarping and UnfoldTensor.	2024-10-11 09:44:18 -07:00
Dmitri Smirnov	f1f3d94e2d	Accomodate BE platforms. Make sure we always write flatbuffers LE (#22375 ) ### Description <!-- Describe your changes. --> flatbuffers always write data in LE and it is automatically traslated to/from BE as needed, but only if we use proper accessors. This would work for shape. However, we store parameters as bytes, so we need to swap bytes as needed for BE. ### Motivation and Context Address https://github.com/microsoft/onnxruntime/issues/22364	2024-10-11 09:14:44 -07:00

1 2 3 4 5 ...

11836 commits