Commit graph

55 commits

Author SHA1 Message Date
Xu Xing
c19617a24a
[js/webgpu] Add GatherND (#22847)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-12-04 09:57:32 -08:00
Xu Xing
ff57ac4f3d
[js/webgpu] Add scatterND (#22755)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-11-13 09:13:00 -08:00
xhcao
b5ee4ac760
[js/webgpu] support GridSample operator (#22652)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-11-08 11:02:36 -08:00
Prathik Rao
5cc7fb4a74
[JSEP] Upgrade to ONNX Opset 21 (#22595)
### JSEP Ops that need updating

- [x] Cast
- [x] ReduceMax
- [x] ReduceMin
- [x] Squeeze
- [x] Unsqueeze
- [x] Transpose
- [x] AveragePool
- [x] Flatten
- [x] Pad
- [x] If
2024-10-29 17:44:38 -07:00
Jiajia Qin
252222034f
[js/webgpu] Support Reshape/Shape 21+ on jsep (#21871)
### Description
<!-- Describe your changes. -->
#21618

With this PR, the cross device copying (`MemcpyToHost`) can totally be
removed for model `wav2vec2`. And the overall time becomes 48ms from
604ms.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-08-27 09:02:39 -07:00
Satya Kumar Jandhyala
af18824f43
[JS/WebGPU] Add GatherBlockQuantized op support (#21734)
### Description
Add GatherBlockQuantized operator to JSEP.



### Motivation and Context
Gemma model requires this.
2024-08-26 14:46:04 -07:00
Guenther Schmuelling
d82f15d0e3
add Gelu opset-20 to webgpu (#21725)
https://github.com/microsoft/onnxruntime/issues/21618
2024-08-14 09:45:05 -07:00
Satya Kumar Jandhyala
51b2044120
[JS/WebGPU] Add Dequantizelinear operator (#21642)
### Description
Added DequantizeLinear operator for JSEP.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-08-09 14:44:19 -07:00
Guenther Schmuelling
9eb1c2a7a3
support for layernorm in webgpu pre opset-17 (#21121)
handled the same way cpu does
2024-06-27 10:20:48 -07:00
Guenther Schmuelling
c749bd997a
webgpu quickgelu (#20939) 2024-06-06 08:21:33 -07:00
Xu Xing
8c59cd4fce
[js/webgpu] Support GroupQueryAttention (#20237)
TODOs:
1. Handle H * params.kvNumHeads greater than work group size limit.
2. Support BNSH kv cache.
2024-05-13 09:43:37 -07:00
Yulong Wang
50bd4571ac
[js/web] support SimplifiedLayerNorm and SkipSimplifiedLayerNorm (#20277)
### Description
Support operator `SimplifiedLayerNorm` and `SkipSimplifiedLayerNorm` for
WebGPU backend.
2024-04-11 14:08:50 -07:00
MasayoshiTsutsui
6a9d8a9030
[js/webgpu] implement DepthToSpace operator in webgpu (#19948)
### Description
This PR supports
[DepthToSpace](https://onnx.ai/onnx/operators/onnx__DepthToSpace.html#depthtospace)
operator in webgpu backend.


### Test
We followed the steps described on [this
page](https://gist.github.com/fs-eire/a55b2c7e10a6864b9602c279b8b75dce)
to build, tested with the following commands, and confirmed that it
passed the Model and Op tests that already existed. (Probably, these
test cases were prepared in the past for WebGL backend)
```
~/onnxruntime/js/web>
% npm test -- suite0 -b=webgpu --wasm-number-threads=1 --debug   
```
##### NOTE
I want to tell you that the main branch version failed 5 tests for the
resize_upsample_sizes_nearest operator.
Since I didn't touch this issue, those test cases still fail in my
branch as well.
Should I post an issue for this?


### Motivation and Context
Though the DepthToSpace operator plays a crucial role in
super-resolution domains, it was not supported in webgpu backend.
2024-04-10 12:13:46 -07:00
Jiajie Hu
23d3afd4fe
[js/webgpu] Implement com.microsoft.RotaryEmbedding (#20209)
### Description

https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md#commicrosoftrotaryembedding

### Motivation and Context
As per customer request, this helps Phi-2 and Gemma.
2024-04-08 09:11:26 -07:00
satyajandhyala
dfeda9019c
[JS/WebGPU] Add MatMulNBits (#19446)
### Description
Add MatMulNBits to support MatMul using 4-bit quantized weights



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-02-17 09:19:17 -08:00
Yulong Wang
5ff27ef02a
[js/webgpu] support customop FastGelu (#19392)
### Description
Support WebGPU custom operator FastGelu.
2024-02-06 09:07:31 -08:00
Jiajia Qin
2e0a388c36
[js/webgpu] Add HardSigmoid support (#19215)
### Description
This op is required in mobilenetv3-small-100. With this PR,
mobilenetv3-small-100 model becomes less than 10 ms from over 100 ms on
ADL.
2024-01-22 15:53:26 -08:00
satyajandhyala
10c547516d
[JS/Web] Added CumSum operator to JSEP (#18637)
### Description
Added CumSum operator



### Motivation and Context
Reduce CPU <->GPU data movement.
2023-12-05 07:51:53 -08:00
Jiajia Qin
64dacc2892
[js/webgpu] Add BatchNormalization Op (#18468)
### Description
This PR adds `BatchNormalization` with `float` support.

Some Todos:
1. all inputs don't have same data type. For example, x/y is float16,
but bias/scale is float32 or double.
2. training mode support.

We see many models are using `BatchNormalization` ops. However, due to
the missing in jsep, all of them run on cpu, which result very poor
performance. With this PR's support, densenet-9 model becomes 20.29 ms
from 250.69 ms.
2023-11-22 15:58:06 -08:00
Arthur Islamov
fac3e33da5
[js/web] JSEP Attention & MultiHeadAttention (#17742)
### Description
This is a narrow implementation of Attention/MultiHeadAttention as it
does not support:
a. inputs 5-7 for MHA
b. packed QKV/KV
c. past/present
d. attention mask

But it works well for StableDiffusion and can be extended later. It
reduces VRAM usage as it combines many ops into few
I've updated demo here https://islamov.ai/stable-diffusion-webgpu/ it
takes ~13sec for 1 image with 20 steps on RTX3090Ti and about 25s on M1
Pro
VRAM usage is about 8gb if you don't use img2img

Going to focus on SDXL now

---------

Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
2023-11-17 12:23:52 -08:00
Scott McKay
4f2096be38
Update XNNPACK to latest version (#18038)
### Description
<!-- Describe your changes. -->
Update XNNPACK to latest version
- adds fp16 kernels and various other improvements
- requires pthreadpool update as well

Most code updates in the XNNPACK EP are to adjust to the new XNNPACK API
- 'setup' is split into 'reshape' and 'setup'
-  some ops use a workspace buffer
   -  copied workspace allocation from XNNPACK unit test code
- some suffixes changed 

Added wrapper for XNNPACK caches to base XNNPACK EP kernel
- simplifies usage
- XNNPACK split out the code and weights caches, but the code cache
isn't currently usable via the public API
- we could use the internal types if we think it's required for
performance reasons. non-trivial though as we'd need to propagate ifdef
values from the XNNPACK build up to the ORT build.
- using XNNPACK internals would also mean we would not be able to
support using a pre-build XNNPACK package
    - not an issue currently
  
Fixed opset registration for internal NHWC domain
- was not being tied to the ONNX version, so nodes inserted by layout
transformation had the incorrect opset
- a number of other places needed updating once this issue was fixed

Remove support for NCHW Resize from XNNPACK EP so it's NHWC only
- we only supported NCHW for fp32,
- doing so adds complexity in multiple places (XNNPACK EP kernel
implementation, layout transformation and transpose optimization)
- unclear if that complexity provides any benefit. can add back if
required by production scenario

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
We're looking at enabling fp16 support for CoreML and NNAPI. If we do
that we need a good fallback story if the CPU EP will be used. The
XNNPACK fp16 kernels will hopefully provide that.

NOTE: This PR doesn't add fp16 support to the XNNPACK EP kernels. That
can be done as required in separate EPs and should be relatively simple
to do.
2023-11-03 09:04:28 -07:00
satyajandhyala
a2e9ba72d5
[JS/Web]Added FusedConv. (#17766)
### Description
Added FusedConv and FusedConvTranspose



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve performance
2023-11-01 15:34:51 -07:00
Xu Xing
992f3e4609
[js/webgpu] Support where (#17544)
Supported type: float. int32_t, uint32_t, bool.
Case where_broadcast.jsonc is not enabled due to
https://github.com/microsoft/onnxruntime/issues/17405.

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
2023-10-03 14:28:21 -07:00
Arthur Islamov
d0519a7603
[js/web] BiasSplitGelu and BiasAdd kernels (#17161)
### Description
Two contrib kernels that supposed to speed-up StableDiffusion according
to this doc
https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/models/stable_diffusion/README.md

However, there is no noticable effect in speed or memory consumption. So
i guess the only way to make it faster is to implement
MultiHeadAttention but i'm not capable of doing that right now. So i'll
focus on existing PRs and finding the JSEP kernel that produces
incorrect results. It should be one of the old ones (i suspect Conv or
ConvTranspose), as SD was not generating images correctly on webgpu
since i started working on it. I hoped someone else would fix that by
the time i finish with kernels/optimizations 😅

---------

Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
2023-10-03 12:20:20 -07:00
xhcao
0d60604638
[JS/WebGPU] support Range operator (#17233)
The patch also introduces the method which copies
data from GPU to CPU synchronously.

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-09-30 02:05:32 -07:00
Hariharan Seshadri
460f17fbb8
[JS/WebGPU] Support If on WebGPU (#17478) 2023-09-19 12:20:18 -07:00
xhcao
198d468849
[WebGPU/JS] Added Pad operator support (#16928)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-09-14 13:14:11 -07:00
satyajandhyala
bf6d6961cc
[JS/Web] Added Einsum operator support. (#17401)
### Description
Added Einsum operator support to JSEP.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-09-11 15:57:15 -07:00
xhcao
9017ea131b
[js/webgpu] support GreaterOrEqual and LessOrEqual operators (#17310)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-09-07 17:41:16 -07:00
Hariharan Seshadri
cbd97515cd
[JS/WebGPU] Support GatherElements kernel (#17243)
### Description
As title


### Motivation and Context
Improve WebGPU kernel coverage
2023-08-28 09:55:25 -07:00
Yulong Wang
bb1871332f
[js/webgpu] add kernel Not and Equal (#17306)
### Description
This PR adds kernel implementation for operator "Not" and "Equal". Also
removed download cache in gpu data manager.

**Why removing download cache**
The following test case failed. ("Or" is on CPU, "Greater" and "Equal"
are on JSEP)

![image](https://github.com/microsoft/onnxruntime/assets/7679871/8d9798ad-2703-4fb9-907e-ff716c67d0b2)
after debugging, I found that both "Equal" and "Greater" are using the
same output GPU Data ID. This is because when ORT executes the graph, it
first run "Equal", allowing its shader to write into GPU Data ID 2; then
a Gpu2Cpu copy for it is issued (because currently "Or" is on CPU EP);
at this point, ORT thinks GPU Data ID=2 is free to use; so it reuse it
as output for "Greater". This means there is no allocation for output of
"Greater" kernel, and both kernel writes to GPU Data ID=2.

For gpu data manager, there will be 2 downloads from the same GPU
buffer. Previously I think this is a waste of resource so I cached the
data. But now it shoes that we need to perform 2 downloads because the
GPU data is already different. The download data cache should be
removed.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-08-27 19:50:17 -07:00
xhcao
5e8d94cec8
[js/webgpu] support Greater and Less operators (#17296)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-08-25 12:11:25 -07:00
Arthur Islamov
5842144d98
[js/web] JSEP Gemm for opset 13 (#16936)
### Description
Added JSEP Gemm registration for opset 13. It was falling back to CPU
provider as CPU has it for 13

---------

Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>
2023-08-22 18:13:20 -07:00
Yulong Wang
6fc3fd9ece
[js/webgpu] support Cast operator (#16489)
### Description
support `Cast` operator for webgpu backend.

Cast operator for webgpu backend currently only supports f32, u32, i32
and bool.
2023-08-18 23:51:03 -07:00
Hariharan Seshadri
a476dbf430
[JS/WebGPU] Support Tile operator (#17123)
### Description
As title

### Motivation and Context
Improve WebGPU op coverage
2023-08-18 10:07:21 -07:00
satyajandhyala
7d1a5635a0
[JS/Web] Added SkipLayerNormalization operator. (#17102)
### Description
Add SkipLayerNormalization operator to JSEP.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-08-18 09:59:03 -07:00
xhcao
24e0bd37b4
[JS/WebGPU] Support Log operator (#17045)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-08-14 18:04:12 -07:00
Guenther Schmuelling
9204cd7392
[js/webgpu] Add C++ registration for operator Tanh in JSEP (#17124)
add webgpu/tanh

Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
2023-08-12 11:43:39 -07:00
satyajandhyala
e8a9d4f04d
[JS/Web] Fix Resize kMSInternalNHWCDomain (#17023)
### Description
Fix some Resize failing tests.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
2023-08-10 09:14:43 -07:00
Arthur Islamov
c3f04251c7
[js/web] JSEP LayerNormalization and InstanceNormalizations kernels (#16830)
### Description
Added two kernels for Layer and Instance norm

Also added maximum limits for `maxBufferSize` when requesting GPU device
as by default it's limited to 256mb and it fails allocating 600mb buffer
while running fp32 StableDiffusion weights.


### Motivation and Context
These two are used in StableDiffusion and many other networks
2023-08-08 09:09:37 -07:00
Arthur Islamov
ea55700e1c
[js/web] JSEP Gather OP (#16855)
### Description
Added Gather op that works with both i32 and i64 indices, assuming that
values fall into i32 limit. The assumption is safe because it's not
possible to allocate more than 2gb buffer for inputs.

It treats all data from input tensor as u32, copying 1 or 2 elements for
i64, u64 and double.

---------

Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>
2023-08-03 14:09:37 -07:00
Guenther Schmuelling
0df2e14038
js/webgpu: argmax,argmin,softmax support (#16882)
argmax and argmin are similar to reduce. Eventually we need to add
optimized flavors of the shader.

softmax is optimized but only works on the last axis for now which
should be the common use case.

todo: enable more ut for argmax/argmin
2023-08-02 18:16:19 -07:00
satyajandhyala
d399648869
[JS/Web] Added Resize kMSInternalNHWCDomain domain registration. (#16946)
### Description
Added Resize NHWC domain kernel registration.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-08-02 14:16:21 -07:00
satyajandhyala
77b2b618b2
[JS/WebGPU] Add Resize operator (#16680)
### Description
Implemented Resize operator support in JSEP



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-07-31 09:35:06 -07:00
satyajandhyala
dd24d52737
[JS/Web] Added Gelu contrib operator support to JSEP (#16909)
### Description
Added Gelu operator to JSEP


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-07-31 09:18:58 -07:00
satyajandhyala
e67547b978
[JS/WebGPU] Added Flatten operator support. (#16860)
### Description
Added Flatten operator support to JSEP.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-07-27 12:50:45 -07:00
satyajandhyala
03ce0a5693
[Web/JS] Added Slice operator in JSEP. (#16811)
### Description
Added Slice operator support to JSEP.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-07-25 14:19:20 -07:00
satyajandhyala
d41bbac7b9
[Web/JS] Added Expand operator support. (#16577)
### Description
Added Expand operator support.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-07-11 09:38:16 -07:00
satyajandhyala
00e8f2a2a9
[Web/JS] Add ConvTranspose support (#16433)
### Description
Add ConvTranspose support for WebGPU


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-07-08 11:10:50 -07:00
satyajandhyala
e55a20ece8
[Web/JS] Added Split operator support. (#16567)
### Description
Added WeGPU/JSEP Split operator support.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-07-07 12:16:10 -07:00