Bug:https://github.com/microsoft/onnxruntime/issues/21467
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This var has been initialized to 0 in tint, so no need extra loop to do
it again:
```
float tint_symbol_52[1][4] = (float[1][4])0;
{
for(int tint_symbol_53 = 0; (tint_symbol_53 < 1); tint_symbol_53 = (tint_symbol_53 + 1)) {
{
for(int tint_symbol_54 = 0; (tint_symbol_54 < 4); tint_symbol_54 = (tint_symbol_54 + 1)) {
tint_symbol_52[min(uint(tint_symbol_53), 0u)][min(uint(tint_symbol_54), 3u)] = 0.0f;
}
}
}
}
```
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Remove explicitly concatinating pastKey with Key and pastValue with
Value.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Distribute writing-to-output work over all threads in MatMulNBits.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Perform computation in fp32 and convert finally to fp16.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
The Key and Value inputs could be 4-dims
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Fixed pastkey, key and pastvalue, value concatenation condition and
fixed index error. Added new test cases.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Enabled more usecases
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Improve performance using shared memory
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
This PR supports
[DepthToSpace](https://onnx.ai/onnx/operators/onnx__DepthToSpace.html#depthtospace)
operator in webgpu backend.
### Test
We followed the steps described on [this
page](https://gist.github.com/fs-eire/a55b2c7e10a6864b9602c279b8b75dce)
to build, tested with the following commands, and confirmed that it
passed the Model and Op tests that already existed. (Probably, these
test cases were prepared in the past for WebGL backend)
```
~/onnxruntime/js/web>
% npm test -- suite0 -b=webgpu --wasm-number-threads=1 --debug
```
##### NOTE
I want to tell you that the main branch version failed 5 tests for the
resize_upsample_sizes_nearest operator.
Since I didn't touch this issue, those test cases still fail in my
branch as well.
Should I post an issue for this?
### Motivation and Context
Though the DepthToSpace operator plays a crucial role in
super-resolution domains, it was not supported in webgpu backend.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
This PR makes a change in WebGPU backend to validate program uniforms.
It compares the uniform data that comes from the result of
`getRunData()` callback from the program info, with the `ShaderHelper`'s
maintained list of uniform variables.
Fixes a few bugs that found by this check as well.
### Description
Avoid using vec4 Matmul implementation for ConvTranspose with channel-last
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Vectorize met 2 failed cases in a CI bot with NVIDIA GPU, but we
couldn't repro with all the GPUs at hand, including NVIDIA GPUs. This PR
introduces GPUAdapterInfo and enables this opt on non-NVIDIA GPUs to
make the bots happy.
No obivous perf gain can be seen if we enable vectorize on NVIDIA.
However, it shows big perf improvement on Intel. On my Gen12 Intel GPU,
mobilenetv2-12 perf was improved from 11.14ms to 7.1ms.
### Description
For Concat operation, the zero-size input tensor shape need to be
preserved and, unlike non-zero tensors, the dims are not constrained to
match other input tensors' dims.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
1. Fix Where operator to handle Boolean input less than 4 bytes.
2. Fix JSEP test harness to use tensor names consistently.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This is used in sam-h-decoder-f16.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
This is required to make shape uniforms really work.
### Motivation and Context
The bug was unveiled in a model with multiple Split nodes. The later
nodes would try to reuse a previous pipeline cache, while the old shapes
were hardcoded as constants in cache.
### Description
Add MatMulNBits to support MatMul using 4-bit quantized weights
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
This PR 1) adds LeakyRelu activation for fusedConv; 2) makes `vec4<f16>`
value work with `float32` uniforms attributes.
For example:
`clamp(value, vec4<f16>(uniforms.clip_min),
vec4<f16>(uniforms.clip_max)` will throw compilation errors since
`uniforms.clip_min` and `uniforms.clip_min` are `f32` not `f16`. So we
need to change it to `clamp(value, vec4<f16>(f16(uniforms.clip_min)),
vec4<f16>(f16(uniforms.clip_max))`
And above problem was introduced when we make activation attributes as
uniforms instead of constant.
BTW, after adding LeakyRelu, `realesrgan-t256` model can pass.
### Description
This PR expands the graph capture capability to JS EP, which is similar
to #16081. But for JS EP, we don't use the CUDA Graph, instead, we
records all gpu commands and replay them, which removes most of the cpu
overhead to avoid the the situation that gpu waiting for cpu.
mobilenetv2-12 becomes 3.7ms from 6ms on NV 3090 and becomes 3.38ms from
4.58ms on Intel A770.
All limitations are similar with CUDA EP:
1. Models with control-flow ops (i.e. If, Loop and Scan ops) are not
supported.
2. Usage of graph capture is limited to models where-in all ops in the
model can be partitioned to the JS EP or CPU EP and no memory copy
between them.
3. Shapes of inputs/outputs cannot change across inference calls.
4. IObinding is required.
The usage is like below:
Method 1: specify outputs buffers explicitly.
```
const sessionOptions = {
executionProviders: [
{
name: "webgpu",
},
],
enableGraphCapture: true,
};
const session = await ort.InferenceSession.create('./models/mobilenetv2-12.onnx', sessionOptions);
// prepare the inputBuffer/outputBuffer
... ...
const feeds = {
'input': ort.Tensor.fromGpuBuffer(inputBuffer, { dataType: 'float32', dims })
};
const fetches = {
'output': ort.Tensor.fromGpuBuffer(outputBuffer, { dataType: 'float32', dims: [1, 1000] })
};
let results = await session.run(feeds, fetches); // The first run will begin to capture the graph.
// update inputBuffer content
... ...
results = = await session.run(feeds, fetches); // The 2ed run and after will directly call replay to execute the graph.
... ...
session.release();
```
Method 2: Don't specify outputs buffers explicitly. Internally, when
graph capture is enabled, it will set all outputs location to
'gpu-buffer'.
```
const sessionOptions = {
executionProviders: [
{
name: "webgpu",
},
],
enableGraphCapture: true,
};
const session = await ort.InferenceSession.create('./models/mobilenetv2-12.onnx', sessionOptions);
// prepare the inputBuffer
... ...
const feeds = {
'input': ort.Tensor.fromGpuBuffer(inputBuffer, { dataType: 'float32', dims })
};
let results = await session.run(feeds); // The first run will begin to capture the graph.
// update inputBuffer content
... ...
results = = await session.run(feeds); // The 2ed run and after will directly call replay to execute the graph.
... ...
session.release();
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->