### Description
<!-- Describe your changes. -->
This PR is a follow-up to
https://github.com/microsoft/onnxruntime/pull/23488 and partially
improves upon https://github.com/microsoft/onnxruntime/issues/23403. It
does the following:
- Prevents unnecessary cache shader recompilation for 'nearest' resize
operation.
- Fixes precision (offset-by-one) errors with asymmetric coordinate
transform. When running the Kokoro TTS model, values for the
`/decoder/decoder/generator/f0_upsamp/Resize_output_0` results in
differences at the end bounds due to precision issues when dividing
21600 by 72 (should be 300, but seemingly results in 299.999, which
causes issues when flooring)
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
I did a deep dive over the weekend to try fix Kokoro TTS on WebGPU and
found that the above node had a large difference. Thinking this was a
major issue, I spent some time fixing it. Turns out, it only happens for
a small number of values, leading to high maximum error, but most values
are correct (as seen here).
BEFORE:
```
[/decoder/decoder/generator/f0_upsamp/Resize_output_0] atol: 78.6640682220459 | rtol: 24.13991587587724 | avgDiff: 0.009967932171121087 | medianDiff: 0.000030517578125
```
AFTER:
```
[/decoder/decoder/generator/f0_upsamp/Resize_output_0] atol: 0.0011138916015625 | rtol: 0.0020059924232260704 | avgDiff: 0.00008570214675873825 | medianDiff: 0.000030517578125
```
So, although it has a very small impact on the final output (waveform),
this bug could appear with other models in a more severe way.
BEFORE:
```
[waveform] atol: 0.04784199967980385 | rtol: 1366.0462001093495 | avgDiff: 0.0009544936942737713 | medianDiff: 0.00015346752479672432
```
AFTER:
```
[waveform] atol: 0.04775865003466606 | rtol: 1354.7002460360852 | avgDiff: 0.000954830244055033 | medianDiff: 0.00015274062752723694
```
### Description
Convert output_padding attribute from 1D to 2D convtranspose
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
https://github.com/microsoft/onnxruntime/issues/23403
### Description
After some investigation and debug, I decided to follow the recommended
workaround as suggested in https://github.com/vitejs/vite/issues/8427.
### Motivation and Context
There is a known issue with Vite 5.x when using WebAssembly package.
Detail information is in https://github.com/vitejs/vite/issues/8427.
There are previous attempts to fix this problem (#23487). I tried
various ways to make it working out of the box for Vite users but none
of them worked: Some "fixes" did fix the usage of Vite but broke other
use case/bundler and some introduced other issues. Eventually I figured
out that there is no good way to fix this inside ONNX Runtime.
Considering the root cause is inside Vite and it may be fixed in Vite
v6. I think now the best way is to follow the recommended workaround.
### Description
Allow importing the `.mjs` and `.wasm` files.
when using Vite, this enables web app to consume ORT-web for simplify
the setup:
```js
import * as ort from 'onnxruntime-web';
import wasmFileUrl from 'onnxruntime-web/.wasm?url';
ort.env.wasm.wasmPaths = { wasm: wasmFileUrl };
BUG #23273
This PR does below optimizations:
1. When output channels is one, 1) calculate the offset before the
inchannel loop to reduce indices to offsets calculation, 2) split the
`inputChannelsPerGroup` into `inputChannelsPerGroupInt` and
`inputChannelsRemainder` parts so that we can always access 4 data for
`inputChannelsPerGroupInt`.
2. Use precise initial value to reduce useless loop iterations. Thanks
@jiangzhaoming 's suggestion's on this.
With this PR, ConvTranspose becomes 3.7s from 8.4s on Intel Meteor Lake.
On NV RTX 2000 Ada, it becomes 1.6s from 2.7s.
The WebNN CPU device type may now target different backends, such as
CoreML. Legacy special workarounds for the TFLite backend should be
removed and allowed to fail as is, as these are implementation issues.
Additionally, the WebNN EP should adhere to the WebNN API conformance.
We assume all the WebNN ops should be supported, so remove the WebNN op
support status for different device types in webnn-operators.md as well.
WebNN doesn't provide a dedicated op for RotaryEmbedding. Instead, we
implement it by using a combination of WebNN ops. The decomposed graph
is referenced from DML EP at:
onnxruntime/core/providers/dml/DmlExecutionProvider/src/Operators/DmlOperatorRotaryEmbedding.cpp
### Description
This PR contains a part of the changes in #23318.
The reason of creating this PR is: The works to support building WebGPU
EP in WASM depends on #23318, which cannot be merged since it's blocked
by upstream (https://github.com/llvm/llvm-project/issues/122166). This
PR contains the changes can be safely merged separately and can unblock
the development of supporting building WebGPU EP in WASM.
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
### Description
<!-- Describe your changes. -->
BUG #23273
With this change, I see the convTranspose time in that bug becomes ~7s
from ~90s on my Meteor Lake.
This PR does below things:
1. Use stride to update the increasement in the loop.
In the bug, the stride is 1024, which can greatly reduce the loop times.
2. Support components for A to reduce the memory access times.
3. When output channels is 1, the b components can be same with A to
further reduce the memory access times.
### Description
<!-- Describe your changes. -->
This PR tries to fix#22615. (see detailed description in the issue)
A perfect solution would be too difficult to make, because there are a
huge number of combinations of usage scenarios, including combinations
of development framework, bundler, dev/prod mode, and so on.
This PR is using the following approach:
- Introduce a new type of end to end test: export test. This type of
tests are complete web apps that use popular web development frameworks,
and the tests are using puppeteer to run the apps and check if the apps
can run without error.
- added one nextjs based web app and one vite based web app.
- In the test, perform the following test steps:
- `npm install` for packages built locally
- `npm run dev` to start dev server and use puppeteer to launch the
browser to test
- `npm run build && npm run start` to test prod build and use puppeteer
to launch the browser to test
- Make changes to ort-web, including:
- special handling on Webpack's behavior of rewriting `import.meta.url`
to a `file://` string
- revise build definitions
- fix wasm URL for proxy, if used in a bundled build
### Description
<!-- Describe your changes. -->
Added a fatal error message for unsupported GroupQuerryAttention
do_rotary attribute.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
https://github.com/microsoft/onnxruntime/issues/22987
Help user understand that this attribute is not supported.
### Description
The Web CI pipeline uses three different Windows machine pools:
1. onnxruntime-Win2022-webgpu-A10
2. onnxruntime-Win2022-VS2022-webgpu-A10
3. onnxruntime-Win-CPU-2022-web
This PR merges them together to reduce ongoing maintenance cost.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
The algorithm of `SkipSimplifiedLayerNormalization` is quite similar to
the `SimplifiedLayerNormalization`, only different is
`SkipSimplifiedLayerNormalization` provides an additional output used
for calculating the sum of the input, skip and bias (if it exits).
BTW, fix a bug in `SimplifiedLayerNormalization`, adding bias if it
exits.
### Description
Those test cases start to fail for unknown reasons.
To unblock the CI, I disabled those tests temporarily to earn time to
investigate the root cause.
### Description
<!-- Describe your changes. -->
Fix a bug caused by potential out-of-bound reads of `W` in the
Conv2DMatMul shader.
### Motivation and Context
Fixes#22983
### Description
This PR is a replacement of #21671. It offers a new way for accessing
the following:
- `ort.env.webgpu.adapter`:
- **deprecating**. There is no point to get the value of it. Once
`GPUDevice.adapterInfo` is supported, there is no point to set the value
too.
- `ort.env.webgpu.device`:
- set value of `GPUDevice` if user created it. Use at user's own risk.
- get value of `Promise<GPUDevice>`. if not exist, create a new one. if
exist return it.
- `ort.env.webgpu.powerPreference`:
- **deprecating**. encouraging users to set `ort.env.webgpu.device` if
necessary.
- `ort.env.webgpu.forceFallbackAdapter`:
- **deprecating**. encouraging users to set `ort.env.webgpu.device` if
necessary.
### Description
This change uses a TypeScript trick to infer global types in
onnxruntime-common. Thanks to the strong type system of TypeScript, we
are able to refer to types that may not be available in the context.
This helps to keep onnxruntime-common not to include dependencies like
"@webgpu/types", and still being able to use the types in the
declaration. See comments of `TryGetGlobalType` in `type-helper.ts`.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
WebNN doesn't provide dedicate op for LRN, use a couple of WebNN ops to
emulate it in WebNN EP:
pow -> transpose -> pad -> averagePool -> transpose -> mul -> add -> pow
-> div
@Honry @fdwr PTAL, thanks!
`Module.jsepRegisterMLConstant` will be shorten by Closure Compiler in
offical release, this would cause undefined error.
Fix it by using `Module['jsepRegisterMLConstant']`.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This PR make MatMul shaders not depend on inputs broadcasting pattern,
but only depend on input ranks and their shape provided in uniform. This
change fix the issue that currently shaders code are different for
different broadcasting, but have identical cache key and results in
wrong cache hit.
This CL make WebGPU backend support subgroup features and thus allow
using subgroup optimizations in the future.
### Description
With this CL WebGPU backends will create devices with subgroups and
subgroups-f16 features (both are under origin trial in Chrome) or
chromium-experimental-subgroups feature enabled whenever available.
### Motivation and Context
This CL would allow WebGPU operator shaders to use subgroup
optimizations in the future, and might get some significant speedup with
these optimization.
This PR fixes a bug that occurs when searching for compatible `MLTensor`
in the cache. We were missing checking the number of dimensions in the
shape. This would mean that a cached buffer of shape `[1]` could match
for `[1, 1, 256, 256]`.
This PR also adds better handling when attempting to force an `MLTensor`
to a different shape.
In current implementation, all the staging buffers for weights uploading
are destroyed after first batch of kernel execution. It requires a lot
of memory as all the staging buffers couldn't be reused. It also hurts
the startup time (weights uploading only happens in session creation),
as weights uploading is delayed to a very late time.
This PR uses a very aggressive way to submit queue and destroy staging
buffers, so that the related GPU memory could be reused as much as
possible, though the real situation depends on the WebGPU and driver
implementation. The aggressive queue submission also moves GPU
operations to a very early time, which helps the startup time.
Some buffer uploading benchmarks are composed to compare multiple
solutions, regarding to the memory and time consumption. Benchmarks can
be found at
https://github.com/webatintel/webbench/blob/master/webgpu/buffer-upload.html,
while detailed test data can be found at
https://docs.google.com/document/d/1KgygOkb9ZNzkgzQ_tWOGlEI9ScmMBHDjDojjPFLmVXU/edit.
I also tested phi3.5 on 2 machines, first inference time improved from
5141ms to 3579ms and from 4327ms to 2947ms separately.
#22031
For reduce related ops, we should increase workgroupSize to improve
parallelism if only one workgroup is dispatched.
The total ReduceMean time becomes 8.98 ms from 77.79 ms on my iGPUs.
BUG #22031
The total Gemm time in demucs model becomes 181.14 ms from over 1000 ms
on my iGPUs.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
WebNN doesn't provide dedicate op for SimplifiedLayerNormalization, use
a couple of WebNN ops to emulate it in WebNN EP.
X --> Pow --> ReduceMean --> Add --> Sqrt --> Div -> Mul