### Description
<!-- Describe your changes. -->
This PR is a follow-up to
https://github.com/microsoft/onnxruntime/pull/23488 and partially
improves upon https://github.com/microsoft/onnxruntime/issues/23403. It
does the following:
- Prevents unnecessary cache shader recompilation for 'nearest' resize
operation.
- Fixes precision (offset-by-one) errors with asymmetric coordinate
transform. When running the Kokoro TTS model, values for the
`/decoder/decoder/generator/f0_upsamp/Resize_output_0` results in
differences at the end bounds due to precision issues when dividing
21600 by 72 (should be 300, but seemingly results in 299.999, which
causes issues when flooring)
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
I did a deep dive over the weekend to try fix Kokoro TTS on WebGPU and
found that the above node had a large difference. Thinking this was a
major issue, I spent some time fixing it. Turns out, it only happens for
a small number of values, leading to high maximum error, but most values
are correct (as seen here).
BEFORE:
```
[/decoder/decoder/generator/f0_upsamp/Resize_output_0] atol: 78.6640682220459 | rtol: 24.13991587587724 | avgDiff: 0.009967932171121087 | medianDiff: 0.000030517578125
```
AFTER:
```
[/decoder/decoder/generator/f0_upsamp/Resize_output_0] atol: 0.0011138916015625 | rtol: 0.0020059924232260704 | avgDiff: 0.00008570214675873825 | medianDiff: 0.000030517578125
```
So, although it has a very small impact on the final output (waveform),
this bug could appear with other models in a more severe way.
BEFORE:
```
[waveform] atol: 0.04784199967980385 | rtol: 1366.0462001093495 | avgDiff: 0.0009544936942737713 | medianDiff: 0.00015346752479672432
```
AFTER:
```
[waveform] atol: 0.04775865003466606 | rtol: 1354.7002460360852 | avgDiff: 0.000954830244055033 | medianDiff: 0.00015274062752723694
```
### Description
Convert output_padding attribute from 1D to 2D convtranspose
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
https://github.com/microsoft/onnxruntime/issues/23403
BUG #23273
This PR does below optimizations:
1. When output channels is one, 1) calculate the offset before the
inchannel loop to reduce indices to offsets calculation, 2) split the
`inputChannelsPerGroup` into `inputChannelsPerGroupInt` and
`inputChannelsRemainder` parts so that we can always access 4 data for
`inputChannelsPerGroupInt`.
2. Use precise initial value to reduce useless loop iterations. Thanks
@jiangzhaoming 's suggestion's on this.
With this PR, ConvTranspose becomes 3.7s from 8.4s on Intel Meteor Lake.
On NV RTX 2000 Ada, it becomes 1.6s from 2.7s.
### Description
<!-- Describe your changes. -->
BUG #23273
With this change, I see the convTranspose time in that bug becomes ~7s
from ~90s on my Meteor Lake.
This PR does below things:
1. Use stride to update the increasement in the loop.
In the bug, the stride is 1024, which can greatly reduce the loop times.
2. Support components for A to reduce the memory access times.
3. When output channels is 1, the b components can be same with A to
further reduce the memory access times.
### Description
<!-- Describe your changes. -->
This PR tries to fix#22615. (see detailed description in the issue)
A perfect solution would be too difficult to make, because there are a
huge number of combinations of usage scenarios, including combinations
of development framework, bundler, dev/prod mode, and so on.
This PR is using the following approach:
- Introduce a new type of end to end test: export test. This type of
tests are complete web apps that use popular web development frameworks,
and the tests are using puppeteer to run the apps and check if the apps
can run without error.
- added one nextjs based web app and one vite based web app.
- In the test, perform the following test steps:
- `npm install` for packages built locally
- `npm run dev` to start dev server and use puppeteer to launch the
browser to test
- `npm run build && npm run start` to test prod build and use puppeteer
to launch the browser to test
- Make changes to ort-web, including:
- special handling on Webpack's behavior of rewriting `import.meta.url`
to a `file://` string
- revise build definitions
- fix wasm URL for proxy, if used in a bundled build
### Description
<!-- Describe your changes. -->
Added a fatal error message for unsupported GroupQuerryAttention
do_rotary attribute.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
https://github.com/microsoft/onnxruntime/issues/22987
Help user understand that this attribute is not supported.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Fix a bug caused by potential out-of-bound reads of `W` in the
Conv2DMatMul shader.
### Motivation and Context
Fixes#22983
### Description
This change uses a TypeScript trick to infer global types in
onnxruntime-common. Thanks to the strong type system of TypeScript, we
are able to refer to types that may not be available in the context.
This helps to keep onnxruntime-common not to include dependencies like
"@webgpu/types", and still being able to use the types in the
declaration. See comments of `TryGetGlobalType` in `type-helper.ts`.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
`Module.jsepRegisterMLConstant` will be shorten by Closure Compiler in
offical release, this would cause undefined error.
Fix it by using `Module['jsepRegisterMLConstant']`.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This PR make MatMul shaders not depend on inputs broadcasting pattern,
but only depend on input ranks and their shape provided in uniform. This
change fix the issue that currently shaders code are different for
different broadcasting, but have identical cache key and results in
wrong cache hit.
This CL make WebGPU backend support subgroup features and thus allow
using subgroup optimizations in the future.
### Description
With this CL WebGPU backends will create devices with subgroups and
subgroups-f16 features (both are under origin trial in Chrome) or
chromium-experimental-subgroups feature enabled whenever available.
### Motivation and Context
This CL would allow WebGPU operator shaders to use subgroup
optimizations in the future, and might get some significant speedup with
these optimization.
This PR fixes a bug that occurs when searching for compatible `MLTensor`
in the cache. We were missing checking the number of dimensions in the
shape. This would mean that a cached buffer of shape `[1]` could match
for `[1, 1, 256, 256]`.
This PR also adds better handling when attempting to force an `MLTensor`
to a different shape.
In current implementation, all the staging buffers for weights uploading
are destroyed after first batch of kernel execution. It requires a lot
of memory as all the staging buffers couldn't be reused. It also hurts
the startup time (weights uploading only happens in session creation),
as weights uploading is delayed to a very late time.
This PR uses a very aggressive way to submit queue and destroy staging
buffers, so that the related GPU memory could be reused as much as
possible, though the real situation depends on the WebGPU and driver
implementation. The aggressive queue submission also moves GPU
operations to a very early time, which helps the startup time.
Some buffer uploading benchmarks are composed to compare multiple
solutions, regarding to the memory and time consumption. Benchmarks can
be found at
https://github.com/webatintel/webbench/blob/master/webgpu/buffer-upload.html,
while detailed test data can be found at
https://docs.google.com/document/d/1KgygOkb9ZNzkgzQ_tWOGlEI9ScmMBHDjDojjPFLmVXU/edit.
I also tested phi3.5 on 2 machines, first inference time improved from
5141ms to 3579ms and from 4327ms to 2947ms separately.
#22031
For reduce related ops, we should increase workgroupSize to improve
parallelism if only one workgroup is dispatched.
The total ReduceMean time becomes 8.98 ms from 77.79 ms on my iGPUs.
BUG #22031
The total Gemm time in demucs model becomes 181.14 ms from over 1000 ms
on my iGPUs.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
BUG #22031
In the demucs model, there are lots of MatMul ops with shapes like
below:
`input[0]: [3448,1,512] | float32, input[1]: [512,1536] | float32,
output[0]: [3448,1,1536] | float32`
We can see that for this kind of shape, the batch size is a big value,
but M = 1. Our current algorithm is based on [M, N] to partition tiles,
which is not efficient for such kind of shapes. This PR reshapes the
inputs to improve the matmul performance.
Before: [3448,1,512] x [512,1536] = [3448,1,1536]
After: [1, 3448, 512] x [512, 1536] = [1, 3448, 1536] , then the output
can be reshaped to [3448, 1, 1536]
The overall MatMul time in demucs model becomes 1778.45 ms from 4418.17
ms on my iGPUs.
---------
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
### Description
This change adds a cache of `MLContext`s keyed by their options to the
`WebNNBackend`. This makes is so that multiple `InferenceSession`s
create with the same options will share the same context.
### Motivation and Context
Since `MLTensor`s are tied `MLContext`s, developer can't easily share
tensors between `InferenceSession` (outside of manually an `MLContext`
and specifying the `context` options). This leads strange behaviors such
as,
```js
const sessionsA = ort.InferenceSession.create(urlA, {
executionProviders: ["webnn"],
preferredOutputLocation: "ml-buffer",
});
const sessionsB = ort.InferenceSession.create(urlB, {
executionProviders: ["webnn"],
});
const temp = await sessionA.run({/* arguments */});
const result = await sessionB.run({"input":temp["output"]}); // ERROR: Failed to execute 'dispatch' on 'MLContext': Invalid inputs: The context of MLGraph doesn't match the context of the MLTensor with name "input".
```
We encountered this behavior when updating the transformers.js version
in the developer preview demos. microsoft/webnn-developer-preview#46
BUG #22031
Optimize below two situations:
1. Increase workgroupSize if only one workgroup is dispatched.
2. Avoid transpose if not necessary.
The overall time of demucs model becomes 106.36 ms from 154.60 ms on my
dGPUs with this PR and PR #22577
### Description
This PR introduces support for registering external data inside WebNN
EP.
### Motivation and Context
- The WebNN EP needs to register the initializers at graph compilation
stage, for initializers from external data, it can't leverage the
general external data loader framework because the graph compilation of
WebNN EP is executed before external data loader called.
- Exposes the `utils::GetExternalDataInfo`, it is useful for WebNN EP to
read the external tensor's infomation.
- Define a new `registerMLConstant` in JSEP to create WebNN constants
from external data in WebNN backend, with the info of tensor as
parameters, as well as the `Module.MountedFiles`, which holds all
preloaded external files.
### Description
This change enables caching `MLTensor`s between inferences runs. This is
done by keeping a reference to `MLTensor`s alive after they have been
released. `MLTensor`s are only destroyed once the sessions goes out of
scope.
### Motivation and Context
Creating and destroying `MTensor`s on every run has a non-trivial
performance penalty. This performance penalty materializes when using
`ort.Tensors`[location=cpu] for inputs/outputs or when using the CPU EP
as a fallback EP for unsupported operators. The former could be
mitigated by developer using `ort.Tensors`[location=ml-tensor]. The
latter cannot be mitigated by developers.
### Description
<!-- Describe your changes. -->
This PR further optimizes matmulnbits specially for iGPUs. The phi3 demo
becomes ~12 tokens/second from ~8 tokens on iGPUs.
Some todos:
1. Make the optimization more general, Remove the blockSize = 32
limitation.
2. Tune the parameter, such as workgroupSize, components size (currently
only support components = 1), to see the performance change.
### Description
<!-- Describe your changes. -->
With this optimization, 96 MultiHeadAttention|Transpose ops in phi3
disappear. Phi3 becomes 113 tokens from 107 tokens on my dGPUs.
The optimization mainly skips the transpose op if one of the transposed
dims is 1. Reshape is enough.
In current implementation, axis in softmax has to be the last, which is
an obvious limitation. This PR removes this limitation and will fix
issues #20710 and #22176.
### Description
Enables using the MLTensor to pass data between models.
### Motivation and Context
Using MLTensor instead of ArrayBuffers reduces the number of copies
between the CPU and devices as well as the renderer and GPU process in
Chromium.
### Description
<!-- Describe your changes. -->
For InstanceNormalization, it has `y = scale * (x - mean) /
sqrt(variance + epsilon) + B` , where mean and variance are computed per
instance per channel. Calculating mean and variance per channel is a
reduce processing, which is NCHW layout friendly since it makes the
adjacent threads can access contiguous data in gpu memory.
This PR optimizes both NHWC and NCHW InstanceNormalization. To
efficiently calculate the mean and variance, we need to make sure the
input is NCHW instead of NHWC. Then use shared memory to do the reduce
operation to get `channel_scale` and `channel_shift`.
With this PR, getting `channel_scale` and `channel_shift` are same for
NHWC and NCHW InstanceNormalization. And the overall performance becomes
very close now.
Below data comes from SD Turbo profiling results.
Before (InstanceNormalization overall time: 140.84 ms)
InstanceNormalization\|InstanceNormComputeMean | 129.70
-- | --
InstanceNormalization\|InstanceNormalizationNHWC | 10.55
InstanceNormalization\|InstanceNormComputeChannelScaleShift | 0.59
After (InstanceNormalization overall time: 59.44 ms)
InstanceNormalization\|InstanceNormComputeChannelScaleShift | 28.57
-- | --
InstanceNormalization\|TransposeShared | 20.19
InstanceNormalization\|InstanceNormalizationNHWC | 10.68
Perf test data(100000 times)
Array: 12.599999997764826ms
String: 1.6000000014901161ms
Perf test case:
```
const permFunctionBodyArray = (rank: number, input: string): string => {
const reverseFunc = [];
reverseFunc.push(`fn perm(i: int) -> int {
var a: int};`);
for (let i = 0; i < rank; ++i) {
reverseFunc.push(input);
}
reverseFunc.push('return a;}');
return reverseFunc.join('\n');
};
const permFunctionBodyString = (rank: number, input: string): string => {
let reverseFunc= `fn perm(i: int}) -> int {
var a: int;`;
for (let i = 0; i < rank; ++i) {
reverseFunc+=input;
}
reverseFunc+='return a;}';
return reverseFunc;//.join('\n');
};
const count = 100000;
let start, end
console.time('array');
start = performance.now();
for(let i =0 ; i < count; i ++) {
permFunctionBodyArray(3, 'input');
}
end = performance.now();
console.timeEnd('array');
console.log("Array: "+ (end-start));
console.time('string');
start = performance.now();
for(let i =0 ; i < count; i ++) {
permFunctionBodyString(3, 'input');
}
end = performance.now();
console.log("String: " +(end-start));
console.timeEnd('string');
```
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This is to fix issue #22031 to run model demucs.
For conv-transpose, outputPadding.length could be 1, while spatialRank
is 2. The fix is to append enough 0s to outputPadding. For conv, the
issue is similar. kernelShape.length sometimes could be 1, while
inputs[1].dims.length is 4. The fix is also to append enough 0s to
kernelShape.