### Description
The Key and Value inputs could be 4-dims
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Fixed pastkey, key and pastvalue, value concatenation condition and
fixed index error. Added new test cases.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Enabled more usecases
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
fix test runner with optional input/output.
This change fixes the OP test runner (.jsonc format test) with optional
input(s) and/or output(s).
this fix reveals a problem of dealing with optional outputs:
> Take SkipSimplifiedLayerNorm as example:
>
> if in the ONNX model, the node's outputs are: [ 'output_0', '' ]
instead of [ 'output_0' ], the current implementation will fail. The
difference is, in the first case, context.outputCount == 2, and then the
typescript implementation will try to create a tensor for output[1]. It
will eventually call to C++ function (OpKernelContext::Output), and the
output.DataRaw() will be nullptr. WebGPU backend will fail because it
cannot deal with a TensorView with data == 0.
>
This problem may need to be fixed or workaround in separated PR. This PR
does not fix this problem. Failed test cases are modified to work -
please note this PR does not break those test cases as they never work.
### Description
<!-- Describe your changes. -->
Improve performance using shared memory
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Avoid using vec4 Matmul implementation for ConvTranspose with channel-last
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
For Concat operation, the zero-size input tensor shape need to be
preserved and, unlike non-zero tensors, the dims are not constrained to
match other input tensors' dims.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Fixes build break brought by #19614
Currently WebGL backend does not support zero sized tensor. This change
split test data into 2 parts, and only enable zero sized tensor tests
for WebGPU.
### Description
This PR allows zero-sized output.
To make the implementation simple, it does not support partial
zero-sized tensor. Which means, either all outputs are zero-sized, or an
error will be reported.
added 2 tests:
- op test of `Add` with input T[2,0] T[2,1], and
- test_split_zero_size_splits
### Description
<!-- Describe your changes. -->
1. Fix Where operator to handle Boolean input less than 4 bytes.
2. Fix JSEP test harness to use tensor names consistently.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Add MatMulNBits to support MatMul using 4-bit quantized weights
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
This PR 1) adds LeakyRelu activation for fusedConv; 2) makes `vec4<f16>`
value work with `float32` uniforms attributes.
For example:
`clamp(value, vec4<f16>(uniforms.clip_min),
vec4<f16>(uniforms.clip_max)` will throw compilation errors since
`uniforms.clip_min` and `uniforms.clip_min` are `f32` not `f16`. So we
need to change it to `clamp(value, vec4<f16>(f16(uniforms.clip_min)),
vec4<f16>(f16(uniforms.clip_max))`
And above problem was introduced when we make activation attributes as
uniforms instead of constant.
BTW, after adding LeakyRelu, `realesrgan-t256` model can pass.
### Description
```math
\tanh(x)=\frac{e^x-e^{-x}}{e^x+e^{-x}}=
\left\{
\begin{array}{cc}
-\frac{1-e^{-2\cdot(-x)}}{1+e^{-2\cdot(-x)}}, & x<0 \\
0, & x=0 \\
\frac{1-e^{-2x}}{1+e^{-2x}}, & x>0
\end{array}
\right.
```
### Motivation and Context
On some platforms,
$$\tanh(1000)=\frac{e^{1000}-e^{-1000}}{e^{1000}+e^{-1000}}$$ would
produce NaN instead of 0.999... or 1 (imagine $e^{1000}=\infty$ and
$\frac{\infty}{\infty}$ explodes).
### Description
This PR provides a vectorized algorithm for NHWC GroupedConv to improve
performance.
The aggregate time of GroupedConv in mobilenetv2-12 becomes ~1ms from
~4ms on Intel Alder Lake machine. About 20% improvement for the whole
model.
### Description
Also update the op test suite.
### Motivation and Context
Previously the *total* size in case `Expand - last dim is not divisible
by 4` was a multiple of 4, even though the *last dimension* was not, so
the bug has never been caught.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Add trilinear interpolation to Resize and changed activation_params attribute as optional for FuseConv.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Add uinforms to Einsum
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve performance.
### Description
Currently, all conv2dMatmul with inChannels = 3 and outChannels % 4 = 0
will report compilation errors. Models, which include this kind of shape
will be impacted, like mobilenetv2-12, resnet50 .
The errors is introduced by #18452https://github.com/microsoft/onnxruntime/pull/18452/files#diff-8b24ea43aa11b1346c0c9e327f9bce6b37a93bd8f2bf8a6392b2b263972b7ea2R200,
which accidentally pass `components` to `x`. But `x`'s components is
`innerElementSize` not `components `. And when `innerElementSize` is 3,
we should use `1` in current design.
### Description
This PR adds `BatchNormalization` with `float` support.
Some Todos:
1. all inputs don't have same data type. For example, x/y is float16,
but bias/scale is float32 or double.
2. training mode support.
We see many models are using `BatchNormalization` ops. However, due to
the missing in jsep, all of them run on cpu, which result very poor
performance. With this PR's support, densenet-9 model becomes 20.29 ms
from 250.69 ms.
### Description
<!-- Describe your changes. -->
Added Uniforms to Expand operator kernel
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve performance
### Description
This is a narrow implementation of Attention/MultiHeadAttention as it
does not support:
a. inputs 5-7 for MHA
b. packed QKV/KV
c. past/present
d. attention mask
But it works well for StableDiffusion and can be extended later. It
reduces VRAM usage as it combines many ops into few
I've updated demo here https://islamov.ai/stable-diffusion-webgpu/ it
takes ~13sec for 1 image with 20 steps on RTX3090Ti and about 25s on M1
Pro
VRAM usage is about 8gb if you don't use img2img
Going to focus on SDXL now
---------
Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
### Description
Support uniforms in Slice op
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve ferformance
### Description
Added FusedConv and FusedConvTranspose
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve performance
### Description
<!-- Describe your changes. -->
Currently, the uniform support has bugs when dims rank is larger than 4.
See https://github.com/microsoft/onnxruntime/issues/17860 item 1.
So this PR only enables shapes uniforms when shape rank is <= 4 for
transpose. Otherwise, below compilation errors are thrown:
```
1 error(s) generated while compiling the shader:
:3:50 error: uniform storage requires that array elements are aligned to 16 bytes, but array element of type 'u32' has a stride of 4 bytes. Consider using a vector or struct as the element type instead.
struct Uniforms { output_size:u32, a_shape:array<u32, 5>, a_strides:array<u32, 5>, output_shape:array<u32, 5>, output_strides:array<u32, 5> };
^^^^^^^^^^^^^
:3:7 note: see layout of struct:
/* align(4) size(84) */ struct Uniforms {
/* offset( 0) align(4) size( 4) */ output_size : u32;
/* offset( 4) align(4) size(20) */ a_shape : array<u32, 5>;
/* offset(24) align(4) size(20) */ a_strides : array<u32, 5>;
/* offset(44) align(4) size(20) */ output_shape : array<u32, 5>;
/* offset(64) align(4) size(20) */ output_strides : array<u32, 5>;
/* */ };
struct Uniforms { output_size:u32, a_shape:array<u32, 5>, a_strides:array<u32, 5>, output_shape:array<u32, 5>, output_strides:array<u32, 5> };
^^^^^^
:4:42 note: 'Uniforms' used in address space 'uniform' here
@group(0) @binding(2) var<uniform> uniforms: Uniforms;
^^^^^^^^
```
Supported type: float. int32_t, uint32_t, bool.
Case where_broadcast.jsonc is not enabled due to
https://github.com/microsoft/onnxruntime/issues/17405.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
### Description
Two contrib kernels that supposed to speed-up StableDiffusion according
to this doc
https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/models/stable_diffusion/README.md
However, there is no noticable effect in speed or memory consumption. So
i guess the only way to make it faster is to implement
MultiHeadAttention but i'm not capable of doing that right now. So i'll
focus on existing PRs and finding the JSEP kernel that produces
incorrect results. It should be one of the old ones (i suspect Conv or
ConvTranspose), as SD was not generating images correctly on webgpu
since i started working on it. I hoped someone else would fix that by
the time i finish with kernels/optimizations 😅
---------
Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
### Description
Add ConvTranspose implementation using MatMul to increase perf.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
In previous implementation, there are two loops to iterate H * W
elements to calculate the `mean` and `squaredNorm` value in one thread,
meanwhile it outputs H * W elements in one thread. That results it's
very very slow when H * W is a large value. And usually, H * W does be a
large value in a model. For example, in the `candy-8` model, the shapes
of [H, W] are [224,224], [112,112], [56,56] for `InstanceNormalization`
op. And in my ADL, `[1,224,224,32]` consumes 17 ms. See below:
```
[profiling] kernel "23848328|[InstanceNormalization] 23848328" input[0]: [1,224,224,32] | float32, input[1]: [32] | float32, input[2]: [32] | float32, output[0]: [1,224,224,32] | float32, execution time: 17007914 ns
```
In this PR, it uses workgroup memory to optimize the original algorithm.
The advantage is that it can parallelly utilize the 64 (workgroupSize)
threads in one workgroup to calculate `mean` and `squaredNorm` value.
Meanwhile, it only outputs `H * W / workgroupSize` outputs for one
thread, which greatly reduces the overhead for one thread. With this
optimization, `[1,224,224,32]` becomes 3 ms and the main overhead is the
extra two `transpose`. The `createInstanceNormProgramInfo` only needs
`0.64` ms. See below:
```
[profiling] kernel "23003600|[InstanceNormalization] 23003600" input[0]: [1,224,224,32] | float32, output[0]: [1,32,224,224] | float32, execution time: 1543792 ns
program-manager.ts:115
[profiling] kernel "23003600|[InstanceNormalization] 23003600" input[0]: [1,32,224,224] | float32, input[1]: [32] | float32, input[2]: [32] | float32, output[0]: [1,32,224,224] | float32, execution time: 642652 ns
program-manager.ts:115
[profiling] kernel "23003600|[InstanceNormalization] 23003600" input[0]: [1,32,224,224] | float32, output[0]: [1,224,224,32] | float32, execution time: 991608 ns
```
This PR currently only applies the new algorithm to NCHW format. For
NHWC format, one way is to transpose the input so that it can use the
new algorithm. But the disadvantage is that 2 extra transpose are added.
@dakenf also gives another way to optimize NHWC. Details see
[here](d45a96616d/js/web/lib/wasm/jsep/webgpu/ops/instance-norm.ts).
I checked @dakenf's method. The perf is similar with transpose +
optimized NCHW. But on different GPUs, one is a little better than
another or vice versa. So I prefer this PR only does the NCHW part.
@dakenf can submit his optimization on NHWC.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Added Einsum operator support to JSEP.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Include Support for neg.int32
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Commit fffefb1c22 (#16969) optimized
matmul and also fixes broadcasting. So #17191 is no longer needed.
However, the newly added operator test file from the PR by @dakenf is
helpful so pick and add it to enhance the tests.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->