Commit graph

93 commits

Author SHA1 Message Date
Satya Kumar Jandhyala
fd8ee4894d
[JS/WebGPU] GroupQueryAttention rewrite (#20946)
### Description
Implement JSEP GroupQueryAttention



### Motivation and Context
Required to enable certain LLM models to run using WebGPU.
2024-10-23 10:14:09 -07:00
Yang Gu
c75f4a09b7
[js/webgpu] Remove the limitation on axis in softmax (#22231)
In current implementation, axis in softmax has to be the last, which is
an obvious limitation. This PR removes this limitation and will fix
issues #20710 and #22176.
2024-09-30 18:27:11 -07:00
Jiajia Qin
3580e01348
[js/webgpu] Optimize grouped conv (#21892)
### Description
<!-- Describe your changes. -->
#21618

This PR optimizes grouped conv by 1) more sequential memory access in
gpu 2) reusing input's data to reduce global memory access times.

See `Conv|GroupedConv` op in
[Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base-960h) becomes
92 ms from 1058 ms on iGPUs with 32 EU.

For the whole model on my iGPUs with 32 EU,
wav2vec2 model becomes 982ms from 1942 ms.
squeezebert-uncased model becomes 71.86ms from 431.77ms.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-09-04 17:16:35 -07:00
Jiajia Qin
a80bfed5b4
[js/webgpu] Optimize transpose (#21964)
### Description
<!-- Describe your changes. -->
Fix bugs in previous implementation and add more situations to go the
optimized path.

Below situations will go to the optimized path.
1. 2d inputs or squeezed 2d inputs
2. channels last or channels first transpose. For example, channel last
transpose: [1, 256, 512, 512] -> [1, 512, 512, 256]
For this case, the transpose becomes [256, 512x512] -> [512x512, 256]

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
For SD Turbo demo, the total transpose time becomes 39.98ms from
122.09ms. And the correspnding percents becomes 3.89% from 11.05% in
this demo.

This PR will also help #21618, the total transpose time in that demo
becomes 17.32 ms from 70.25 ms on my iGPUs.
2024-09-04 12:04:04 -07:00
xhcao
3bfb5e4f62
[js/webgpu] support float16 for Clip (#21584)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-08-28 13:19:20 -07:00
Satya Kumar Jandhyala
af18824f43
[JS/WebGPU] Add GatherBlockQuantized op support (#21734)
### Description
Add GatherBlockQuantized operator to JSEP.



### Motivation and Context
Gemma model requires this.
2024-08-26 14:46:04 -07:00
Xu Xing
d9c57ac7db
[js/webgpu] Enable pad f16 uniform (#21691)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
2024-08-26 07:58:48 -07:00
Jiajia Qin
27a6890529
[js/webgpu] Optimize conv1d by conv2d (#19388)
### Description
<!-- Describe your changes. -->

Optimize conv1d to go to the conv2d path to utilize the conv2d's
optimization path.

See whisper-tiny-encoder model becomes 158.66 ms from 532.28 ms. Conv
goes to Conv2DMatMul(8 ms) instead of GroupedConv(382 ms).

Old profiling result:
Kernel | Time (ms) | Percentage (%)
-- | -- | --
Conv\|GroupedConv | 382.99 | 71.95
MatMul | 126.16 | 23.70
Softmax | 7.01 | 1.32
Transpose | 4.59 | 0.86
Add | 4.39 | 0.82
Mul | 2.36 | 0.44
Div | 1.44 | 0.27
ReduceMean\|ReduceMeanShared | 1.25 | 0.23
Erf | 0.85 | 0.16
Sub | 0.72 | 0.14
Pow | 0.46 | 0.09
Sqrt | 0.07 | 0.01
Sum | 532.28 |  

New profiling result with this PR:

Kernel | Time (ms) | Percentage (%)
-- | -- | --
MatMul | 127.07 | 80.09
Conv\|Conv2DMatMul | 8.00 | 5.04
Softmax | 6.95 | 4.38
Transpose | 4.65 | 2.93
Add | 4.26 | 2.68
Mul | 2.56 | 1.61
Div | 1.51 | 0.95
ReduceMean\|ReduceMeanShared | 1.31 | 0.83
Erf | 0.85 | 0.54
Sub | 0.79 | 0.50
Pow | 0.46 | 0.29
Conv\|Transpose | 0.26 | 0.17
Sqrt | 0.00 | 0.00
Sum | 158.66 |  

---------

Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
2024-08-22 22:56:07 -07:00
Satya Kumar Jandhyala
1fb2e71ddc
[JS/WebGPU] Avoid producing presentKey/presentValue outputs if pastKey/pastValue … (#21782)
Avoid producing presentKey/presentValue outputs if pastKey/pastValue
don't exists.

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-08-19 18:02:19 -07:00
Yang Gu
49fc168eed
[js/webgpu] Handle negative axis in op Split (#21771)
This is to fix issue #21703, where the axis is a negative value in the
model. According to the spec
(https://onnx.ai/onnx/operators/onnx__Split.html), negative axis means
counting dimensions from the back.
2024-08-17 16:41:23 -07:00
Tianlei Wu
d79e3c5791
Extend Attention Bias Broadcast Support (#21710)
### Description
Previously, MultiHeadAttention supports relative position bias of shape
[1, N, S, T] or [B, N, S, T], and DecoderMaskedMultiHeadAttention
supports [1, N, S, T]. This will extend the support to allow [1, N, S,
T], [B, N, S, T], [B, 1, S, T] and [1, 1, S, T] for CUDA and CPU EPs.

- [x] Rename the input of "relative position bias" to "attention bias"
because it can also be used for other types of bias, like ALiBi
(Attention with Linear Biases) or attention mask.
- [x] Update unfused kernel to support broadcasting 2nd dimension of
attention bias.
- [x] Update efficient attention to support broadcasting 2nd dimension
of attention bias.
- [x] Update operators (MultiHeadAttention,
DecoderMaskedMultiHeadAttention, Attention, PackedAttention,
PackedMultiHeadAttention) to support broadcast attention bias on CUDA
and CPU EPs.
- [x] Update ROCm, DML and WebGPU naming to be consistent. (Note that
those EPs do not support broadcasting attention_bias for now).
- [x] Add attention bias tests for MultiHeadAttention.
- [x] Update operator documents
- [x] Update benchmark script

Other changes:
* Fix some checks in multihead-attention.ts
* Add helper functions to dump tensors given dimensions.
2024-08-16 15:40:04 -07:00
Yulong Wang
ef2ccc477b
[js/web] Add support for int4/uint4 tensor (#21720)
### Description
Add support for int4/uint4 tensor.
2024-08-15 21:32:10 -07:00
Xu Xing
7172aff1cf
[js/webgpu] Fix max pool shape end with 0 (#21698)
Bug: https://github.com/microsoft/onnxruntime/issues/21386

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-08-13 20:59:24 -07:00
Satya Kumar Jandhyala
51b2044120
[JS/WebGPU] Add Dequantizelinear operator (#21642)
### Description
Added DequantizeLinear operator for JSEP.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-08-09 14:44:19 -07:00
Yulong Wang
5e66fcc703
[js/web] allow op test to use f16 type for inputs/outputs (#21664)
### Description
allow op test to use f16 type for inputs/outputs.

This PR introduces "@petamoriken/float16" as Float16Array polyfill but
restricts it to be only used for test runner.
2024-08-08 09:56:37 -07:00
Xu Xing
0d7cf301a1
[js/webgpu] Add activation Tanh (#21540)
Bug:https://github.com/microsoft/onnxruntime/issues/21467

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-07-29 11:05:34 -07:00
Xu Xing
5bc12bf209
[js/webgpu] Add activation for conv3d naive (#21466)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-07-29 08:47:41 -07:00
Xu Xing
c3076721f3
[js/webgpu] Support conv3d naive (#20706)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-06-19 10:13:50 -07:00
Guenther Schmuelling
c749bd997a
webgpu quickgelu (#20939) 2024-06-06 08:21:33 -07:00
Satya Kumar Jandhyala
bab5037eab
Eliminate explicit Concat operations in Attention (#20556)
### Description
Remove explicitly concatinating pastKey with Key and pastValue with
Value.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-05-24 09:07:57 -07:00
Xu Xing
f1fef19b6e
[js/webgpu] Support shared memory for transpose 2d (#19267)
For 1024x1024, without shared memoey, 18.7ms. With shared memory 13.2ms.
2024-05-22 08:15:44 -07:00
Xu Xing
8c59cd4fce
[js/webgpu] Support GroupQueryAttention (#20237)
TODOs:
1. Handle H * params.kvNumHeads greater than work group size limit.
2. Support BNSH kv cache.
2024-05-13 09:43:37 -07:00
Satya Kumar Jandhyala
21b3cbc3af
[WIP][JS/WebGPU] Inputs Key and Value could be 4-dims. (#20470)
### Description
The Key and Value inputs could be 4-dims


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-04-25 13:33:46 -07:00
Satya Kumar Jandhyala
ae78cdb5d7
[JS/WebGPU] MultiheadAttention bugfix (#20447)
### Description
Fixed pastkey, key and pastvalue, value concatenation condition and
fixed index error. Added new test cases.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-04-24 08:43:14 -07:00
Satya Kumar Jandhyala
d42ac7f0c6
[JS/WebGPU] Multihead attention improvements (#20286)
### Description
Enabled more usecases



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-04-23 12:39:49 -07:00
Yulong Wang
4385602386
[js/web] fix test runner with optional input/output (#20399)
### Description
fix test runner with optional input/output.

This change fixes the OP test runner (.jsonc format test) with optional
input(s) and/or output(s).

this fix reveals a problem of dealing with optional outputs:

> Take SkipSimplifiedLayerNorm as example: 
>
> if in the ONNX model, the node's outputs are: [ 'output_0', '' ]
instead of [ 'output_0' ], the current implementation will fail. The
difference is, in the first case, context.outputCount == 2, and then the
typescript implementation will try to create a tensor for output[1]. It
will eventually call to C++ function (OpKernelContext::Output), and the
output.DataRaw() will be nullptr. WebGPU backend will fail because it
cannot deal with a TensorView with data == 0.
>

This problem may need to be fixed or workaround in separated PR. This PR
does not fix this problem. Failed test cases are modified to work -
please note this PR does not break those test cases as they never work.
2024-04-22 12:53:10 -07:00
Satya Kumar Jandhyala
b33216be4c
[JS/WebGPU] Improve MatMulNBits perf (#19974)
### Description
<!-- Describe your changes. -->
Improve performance using shared memory


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-04-12 11:03:05 -07:00
Yulong Wang
50bd4571ac
[js/web] support SimplifiedLayerNorm and SkipSimplifiedLayerNorm (#20277)
### Description
Support operator `SimplifiedLayerNorm` and `SkipSimplifiedLayerNorm` for
WebGPU backend.
2024-04-11 14:08:50 -07:00
Jiajie Hu
23d3afd4fe
[js/webgpu] Implement com.microsoft.RotaryEmbedding (#20209)
### Description

https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md#commicrosoftrotaryembedding

### Motivation and Context
As per customer request, this helps Phi-2 and Gemma.
2024-04-08 09:11:26 -07:00
Satya Kumar Jandhyala
5b64d7c32b
[JS/WebGPU] Use non-matmul implementation for ConvTranspose in channel-first case. (#20022)
### Description
Avoid using vec4 Matmul implementation for ConvTranspose with channel-last



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-03-23 11:19:14 -07:00
Xu Xing
4c6a6a37f7
[js/webgpu] Fix NAN caused by un-initialized buffer in instance-norm (#19387)
The added case will be NAN because of the un-initialized buffer.
2024-03-18 22:59:32 -07:00
Satya Kumar Jandhyala
ed250b88c3
[JS/WebGPU] Optimize MatMulNBits (#19852)
### Description
Use vec<2> or vec<4>, operands in MatMulNBits


### Motivation and Context
Improve performance
2024-03-13 10:33:14 -07:00
Satya Kumar Jandhyala
24b72d2613
[JS/WebGPU] Preserve zero size input tensor dims. (#19737)
### Description
For Concat operation, the zero-size input tensor shape need to be
preserved and, unlike non-zero tensors, the dims are not constrained to
match other input tensors' dims.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-03-07 19:07:49 -08:00
Yulong Wang
0edb035808
[js/web] fix suite test list for zero sized tensor (#19638)
### Description

Fixes build break brought by #19614

Currently WebGL backend does not support zero sized tensor. This change
split test data into 2 parts, and only enable zero sized tensor tests
for WebGPU.
2024-02-24 10:09:07 -08:00
Yulong Wang
aec2389ad0
[js/webgpu] allows a ProgramInfo's RunData to use zero sized output (#19614)
### Description
This PR allows zero-sized output.

To make the implementation simple, it does not support partial
zero-sized tensor. Which means, either all outputs are zero-sized, or an
error will be reported.

added 2 tests:
 - op test of `Add` with input T[2,0] T[2,1], and
 - test_split_zero_size_splits
2024-02-23 12:52:47 -08:00
satyajandhyala
ae3d73c981
[JS/WebGPU] Fix Split and Where to handle corner cases. (#19613)
### Description
<!-- Describe your changes. -->
1. Fix Where operator to handle Boolean input less than 4 bytes.
2. Fix JSEP test harness to use tensor names consistently.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-02-23 00:21:15 -08:00
satyajandhyala
dfeda9019c
[JS/WebGPU] Add MatMulNBits (#19446)
### Description
Add MatMulNBits to support MatMul using 4-bit quantized weights



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-02-17 09:19:17 -08:00
Yulong Wang
5ff27ef02a
[js/webgpu] support customop FastGelu (#19392)
### Description
Support WebGPU custom operator FastGelu.
2024-02-06 09:07:31 -08:00
Jiajia Qin
ccbe264a39
[js/webgpu] Add LeakyRelu activation for fusedConv (#19369)
### Description
This PR 1) adds LeakyRelu activation for fusedConv; 2) makes `vec4<f16>`
value work with `float32` uniforms attributes.

For example:
`clamp(value, vec4<f16>(uniforms.clip_min),
vec4<f16>(uniforms.clip_max)` will throw compilation errors since
`uniforms.clip_min` and `uniforms.clip_min` are `f32` not `f16`. So we
need to change it to `clamp(value, vec4<f16>(f16(uniforms.clip_min)),
vec4<f16>(f16(uniforms.clip_max))`

And above problem was introduced when we make activation attributes as
uniforms instead of constant.

BTW, after adding LeakyRelu, `realesrgan-t256` model can pass.
2024-02-02 09:06:38 -08:00
Jiajia Qin
90883a366a
[js/webgpu] Add hardSigmoid activation for fusedConv (#19233)
### Description
Add hardSigmoid activation for fusedConv. It will be used by
mobilenetv3-small-100 model.
2024-01-30 16:28:53 -08:00
Jiajie Hu
5b06505073
[js/webgpu] Fix Tanh explosion (#19201)
### Description
```math
\tanh(x)=\frac{e^x-e^{-x}}{e^x+e^{-x}}=
\left\{
\begin{array}{cc}
-\frac{1-e^{-2\cdot(-x)}}{1+e^{-2\cdot(-x)}}, & x<0 \\
0, & x=0 \\
\frac{1-e^{-2x}}{1+e^{-2x}}, & x>0
\end{array}
\right.
```

### Motivation and Context
On some platforms,
$$\tanh(1000)=\frac{e^{1000}-e^{-1000}}{e^{1000}+e^{-1000}}$$ would
produce NaN instead of 0.999... or 1 (imagine $e^{1000}=\infty$ and
$\frac{\infty}{\infty}$ explodes).
2024-01-25 08:25:35 -08:00
Xu Xing
61610ff986
[js/webgpu] Add FusedConv clip test case (#18900)
Bug: https://github.com/microsoft/onnxruntime/issues/18899
2024-01-23 08:25:05 -08:00
Jiajia Qin
fd6bab4250
[js/webgpu] Provide a vectorized algorithm for GroupedConv (#18884)
### Description
This PR provides a vectorized algorithm for NHWC GroupedConv to improve
performance.

The aggregate time of GroupedConv in mobilenetv2-12 becomes ~1ms from
~4ms on Intel Alder Lake machine. About 20% improvement for the whole
model.
2024-01-10 16:12:43 -08:00
Xu Xing
76dfe5347c
[js/webgpu] Support uniforms for instance-norm (#18929)
Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>
2024-01-09 14:56:00 -08:00
Jiajie Hu
447a3a7c70
[js/webgpu] Fix Expand/Gather when input type is bool (#18999)
### Description
Also update the op test suite.

### Motivation and Context
Previously the *total* size in case `Expand - last dim is not divisible
by 4` was a multiple of 4, even though the *last dimension* was not, so
the bug has never been caught.
2024-01-05 08:16:15 -08:00
satyajandhyala
780fc3611b
[JS/Web] Sajandhy/webgpu resize scales rank check (#18954)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-12-29 09:23:27 -08:00
satyajandhyala
3bbe4fe2ff
[JS/WebGPU] Add trilinear interpolation to Resize; activation_params attribute is optional for FusedConv also. (#18842)
### Description
Add trilinear interpolation to Resize and changed activation_params attribute as optional for FuseConv.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-12-27 16:21:29 -08:00
Jiajia Qin
b4be9e1bbb
[js/webgpu] Fix shader compilation errors in cumsum (#18779)
### Description
This PR fixes below shader compilation errors:
```
Tint WGSL reader failure: :39:31 error: no matching overload for operator + (f32, i32)

5 candidate operators:
  operator + (T, T) -> T  where: T is abstract-float, abstract-int, f32, i32, u32 or f16
  operator + (vecN<T>, T) -> vecN<T>  where: T is abstract-float, abstract-int, f32, i32, u32 or f16
  operator + (T, vecN<T>) -> vecN<T>  where: T is abstract-float, abstract-int, f32, i32, u32 or f16
  operator + (vecN<T>, vecN<T>) -> vecN<T>  where: T is abstract-float, abstract-int, f32, i32, u32 or f16
  operator + (matNxM<T>, matNxM<T>) -> matNxM<T>  where: T is abstract-float, f32 or f16

                    sum = sum + get_inputByIndices(inputIndices);
                              ^


 - While validating [ShaderModuleDescriptor "CumSum"]
 - While calling [Device].CreateShaderModule([ShaderModuleDescriptor "CumSum"]).
2023-12-11 18:11:38 -08:00
Xu Xing
f949e0580b
[js/webgpu] Support uniforms for pool (#18656) 2023-12-05 07:54:30 -08:00
satyajandhyala
10c547516d
[JS/Web] Added CumSum operator to JSEP (#18637)
### Description
Added CumSum operator



### Motivation and Context
Reduce CPU <->GPU data movement.
2023-12-05 07:51:53 -08:00