Commit graph

136 commits

Author SHA1 Message Date
Xu Xing
76dfe5347c
[js/webgpu] Support uniforms for instance-norm (#18929)
Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>
2024-01-09 14:56:00 -08:00
zesongw
ad6dd0a597
[WebNN] Enable npm unit tests (#18486)
### Description
- Support more test cases for WebNN EP in suite-test-list.jsonc
- Add DISABLE_WEBNN flag in build.ts as preparing for WebNN EP release
- Add test option: '--webnn-device-type' in test-runner-args-cli.ts to
support running WebNN 'gpu' deviceType
- Use Chrome Stable as default browser for WebNN testing to unblock the
CI limitation.
2024-01-09 10:10:57 -08:00
Jiajie Hu
447a3a7c70
[js/webgpu] Fix Expand/Gather when input type is bool (#18999)
### Description
Also update the op test suite.

### Motivation and Context
Previously the *total* size in case `Expand - last dim is not divisible
by 4` was a multiple of 4, even though the *last dimension* was not, so
the bug has never been caught.
2024-01-05 08:16:15 -08:00
satyajandhyala
780fc3611b
[JS/Web] Sajandhy/webgpu resize scales rank check (#18954)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-12-29 09:23:27 -08:00
satyajandhyala
3bbe4fe2ff
[JS/WebGPU] Add trilinear interpolation to Resize; activation_params attribute is optional for FusedConv also. (#18842)
### Description
Add trilinear interpolation to Resize and changed activation_params attribute as optional for FuseConv.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-12-27 16:21:29 -08:00
Yulong Wang
9a61388f0a
[js/web] revise backend registration (#18715)
### Description
This PR revises the backend registration.

The following describes the expected behavior after this change:
(**bolded are changed behavior**)

- (ort.min.js - built without webgpu support)
    - loading: do not register 'webgpu' backend
- creating session without EP list: use default EP list ['webnn', 'cpu',
'wasm']
- creating session with ['webgpu'] as EP list: should fail with backend
not available
- (ort.webgpu.min.js - built with webgpu support)
    - loading: **always register 'webgpu' backend**
( previous behavior: only register 'webgpu' backend when `navigator.gpu`
is available)
- creating session without EP list: use default EP list ['webgpu',
'webnn', 'cpu', 'wasm']
        - when WebGPU is available (win): use WebGPU backend
- when WebGPU is unavailable (android): **should fail backend init,**
and try to use next backend in the list, 'webnn'
(previous behavior: does not fail backend init, but fail in JSEP init,
which was too late to switch to next backend)
    - creating session with ['webgpu'] as EP list
        - when WebGPU is available (win): use WebGPU backend
- when WebGPU is unavailable (android): **should fail backend init, and
because no more EP listed, fail.


related PRs: #18190 #18144
2023-12-20 14:45:55 -08:00
Jiajia Qin
b4be9e1bbb
[js/webgpu] Fix shader compilation errors in cumsum (#18779)
### Description
This PR fixes below shader compilation errors:
```
Tint WGSL reader failure: :39:31 error: no matching overload for operator + (f32, i32)

5 candidate operators:
  operator + (T, T) -> T  where: T is abstract-float, abstract-int, f32, i32, u32 or f16
  operator + (vecN<T>, T) -> vecN<T>  where: T is abstract-float, abstract-int, f32, i32, u32 or f16
  operator + (T, vecN<T>) -> vecN<T>  where: T is abstract-float, abstract-int, f32, i32, u32 or f16
  operator + (vecN<T>, vecN<T>) -> vecN<T>  where: T is abstract-float, abstract-int, f32, i32, u32 or f16
  operator + (matNxM<T>, matNxM<T>) -> matNxM<T>  where: T is abstract-float, f32 or f16

                    sum = sum + get_inputByIndices(inputIndices);
                              ^


 - While validating [ShaderModuleDescriptor "CumSum"]
 - While calling [Device].CreateShaderModule([ShaderModuleDescriptor "CumSum"]).
2023-12-11 18:11:38 -08:00
Yulong Wang
efbef5f611
[js/webgpu] allow to specify callback for profiling data (#18732)
### Description

**This PR is a replacement of #17820.**

allow to specify callback for profiling data

*Previous*:
```js
ort.env.webgpu.profilingMode = 'default';  // enable profiling

// profiling data will output to console.
```

*Now*:
```js
ort.env.webgpu.profiling = {
  mode: 'default';  // enable profiling
  ondata: (data) => {
    // .. process the profiling data
  }
};

//for each kernel, "ondata" will be called once. only output to console if ondata is not specified.
```
2023-12-07 14:10:28 -08:00
Xu Xing
f949e0580b
[js/webgpu] Support uniforms for pool (#18656) 2023-12-05 07:54:30 -08:00
satyajandhyala
10c547516d
[JS/Web] Added CumSum operator to JSEP (#18637)
### Description
Added CumSum operator



### Motivation and Context
Reduce CPU <->GPU data movement.
2023-12-05 07:51:53 -08:00
Jiajia Qin
6781b6cf3d
[js/webgpu] add bool type for Expand/Gather (#18615)
### Description
In [detr-resnet-50](https://huggingface.co/Xenova/detr-resnet-50) model,
it uses expand with bool type running on cpu ep.




| Kernel    | Shape | Provider |
| -------- | ------- | ------- |
| Expand | "input_type_shape" :
[{"bool":[1,1,1,625]},{"int64":[4]}],"activation_size" :
"657","output_type_shape" : [{"bool":[1,1,625,625]}] |
CPUExecutionProvider |

After this change, it will run on jsep.
| Kernel    | Shape | Provider |
| -------- | ------- | ------- |
| Expand | "input_type_shape" :
[{"bool":[1,1,1,625]},{"int64":[4]}],"activation_size" :
"657","output_type_shape" : [{"bool":[1,1,625,625]}] |
JsExecutionProvider |
2023-11-30 15:47:08 -08:00
satyajandhyala
7335760424
[JS/Web] Add uniforms to Einsum (#18531)
### Description
Add uinforms to Einsum



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve performance.
2023-11-29 15:30:33 -08:00
Jiajia Qin
fc8631e2f1
[js/web] Fix conv2dMatmul errors due to #18452 (#18562)
### Description
Currently, all conv2dMatmul with inChannels = 3 and outChannels % 4 = 0
will report compilation errors. Models, which include this kind of shape
will be impacted, like mobilenetv2-12, resnet50 .

The errors is introduced by #18452
https://github.com/microsoft/onnxruntime/pull/18452/files#diff-8b24ea43aa11b1346c0c9e327f9bce6b37a93bd8f2bf8a6392b2b263972b7ea2R200,
which accidentally pass `components` to `x`. But `x`'s components is
`innerElementSize` not `components `. And when `innerElementSize` is 3,
we should use `1` in current design.
2023-11-27 21:21:47 -08:00
Jiajia Qin
64dacc2892
[js/webgpu] Add BatchNormalization Op (#18468)
### Description
This PR adds `BatchNormalization` with `float` support.

Some Todos:
1. all inputs don't have same data type. For example, x/y is float16,
but bias/scale is float32 or double.
2. training mode support.

We see many models are using `BatchNormalization` ops. However, due to
the missing in jsep, all of them run on cpu, which result very poor
performance. With this PR's support, densenet-9 model becomes 20.29 ms
from 250.69 ms.
2023-11-22 15:58:06 -08:00
satyajandhyala
841f7ed3e0
[[JS/Web]Added uniform to Expand op. (#18558)
### Description
<!-- Describe your changes. -->
Added Uniforms to Expand operator kernel


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve performance
2023-11-22 14:14:24 -08:00
Arthur Islamov
fac3e33da5
[js/web] JSEP Attention & MultiHeadAttention (#17742)
### Description
This is a narrow implementation of Attention/MultiHeadAttention as it
does not support:
a. inputs 5-7 for MHA
b. packed QKV/KV
c. past/present
d. attention mask

But it works well for StableDiffusion and can be extended later. It
reduces VRAM usage as it combines many ops into few
I've updated demo here https://islamov.ai/stable-diffusion-webgpu/ it
takes ~13sec for 1 image with 20 steps on RTX3090Ti and about 25s on M1
Pro
VRAM usage is about 8gb if you don't use img2img

Going to focus on SDXL now

---------

Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
2023-11-17 12:23:52 -08:00
satyajandhyala
b291b20fa0
[JS/Web]Added uniforms support to Slice op. (#18422)
### Description
Support uniforms in Slice op



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve ferformance
2023-11-16 09:44:13 -08:00
Yulong Wang
586f06f5a1
[js/web] set noUnusedParameters to true and fix a few bugs (#18404)
### Description
- set tsconfig "noUnusedParameters" to `true` and fix a few bugs
discovered by typescript.
   how unused parameter is fixed:
- for most code (webgl), add underscore as prefix, which is the standard
ignore pattern for typescript check.
- remove unused parameter from function and modify corresponding
function calls (jsep)
- fix a bug in ArgMinMax: this 2 operators do not have more than one
input(s) so the `createArgMinMaxAttributesFromInputs()` is removed.
- add proxy main.ts into typescript check and fix a bug in parameter
passing
   - fixed `run()` function call and add typecheck fix (hack)
2023-11-15 09:16:29 -08:00
Xu Xing
829d802337
[js/webgpu] Support uniform for softmax (#18345) 2023-11-09 11:19:23 -08:00
Scott McKay
4f2096be38
Update XNNPACK to latest version (#18038)
### Description
<!-- Describe your changes. -->
Update XNNPACK to latest version
- adds fp16 kernels and various other improvements
- requires pthreadpool update as well

Most code updates in the XNNPACK EP are to adjust to the new XNNPACK API
- 'setup' is split into 'reshape' and 'setup'
-  some ops use a workspace buffer
   -  copied workspace allocation from XNNPACK unit test code
- some suffixes changed 

Added wrapper for XNNPACK caches to base XNNPACK EP kernel
- simplifies usage
- XNNPACK split out the code and weights caches, but the code cache
isn't currently usable via the public API
- we could use the internal types if we think it's required for
performance reasons. non-trivial though as we'd need to propagate ifdef
values from the XNNPACK build up to the ORT build.
- using XNNPACK internals would also mean we would not be able to
support using a pre-build XNNPACK package
    - not an issue currently
  
Fixed opset registration for internal NHWC domain
- was not being tied to the ONNX version, so nodes inserted by layout
transformation had the incorrect opset
- a number of other places needed updating once this issue was fixed

Remove support for NCHW Resize from XNNPACK EP so it's NHWC only
- we only supported NCHW for fp32,
- doing so adds complexity in multiple places (XNNPACK EP kernel
implementation, layout transformation and transpose optimization)
- unclear if that complexity provides any benefit. can add back if
required by production scenario

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
We're looking at enabling fp16 support for CoreML and NNAPI. If we do
that we need a good fallback story if the CPU EP will be used. The
XNNPACK fp16 kernels will hopefully provide that.

NOTE: This PR doesn't add fp16 support to the XNNPACK EP kernels. That
can be done as required in separate EPs and should be relatively simple
to do.
2023-11-03 09:04:28 -07:00
satyajandhyala
a2e9ba72d5
[JS/Web]Added FusedConv. (#17766)
### Description
Added FusedConv and FusedConvTranspose



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve performance
2023-11-01 15:34:51 -07:00
Jiajia Qin
8a12b2cea6
[js/webgpu] Fix the transpose error when dims > 4D (#18027)
### Description
<!-- Describe your changes. -->
Currently, the uniform support has bugs when dims rank is larger than 4.
See https://github.com/microsoft/onnxruntime/issues/17860 item 1.
So this PR only enables shapes uniforms when shape rank is <= 4 for
transpose. Otherwise, below compilation errors are thrown:
```
1 error(s) generated while compiling the shader:
:3:50 error: uniform storage requires that array elements are aligned to 16 bytes, but array element of type 'u32' has a stride of 4 bytes. Consider using a vector or struct as the element type instead.
      struct Uniforms { output_size:u32, a_shape:array<u32, 5>, a_strides:array<u32, 5>, output_shape:array<u32, 5>, output_strides:array<u32, 5> };
                                                 ^^^^^^^^^^^^^

:3:7 note: see layout of struct:
/*            align(4) size(84) */ struct Uniforms {
/* offset( 0) align(4) size( 4) */   output_size : u32;
/* offset( 4) align(4) size(20) */   a_shape : array<u32, 5>;
/* offset(24) align(4) size(20) */   a_strides : array<u32, 5>;
/* offset(44) align(4) size(20) */   output_shape : array<u32, 5>;
/* offset(64) align(4) size(20) */   output_strides : array<u32, 5>;
/*                              */ };
      struct Uniforms { output_size:u32, a_shape:array<u32, 5>, a_strides:array<u32, 5>, output_shape:array<u32, 5>, output_strides:array<u32, 5> };
      ^^^^^^

:4:42 note: 'Uniforms' used in address space 'uniform' here
      @group(0) @binding(2) var<uniform> uniforms: Uniforms;
                                         ^^^^^^^^
```
2023-10-23 11:02:19 -07:00
Yulong Wang
6ea493571e
[js/web] use esbuild to accelerate bundle build (#17745)
### Description

Use esbuild to accelerate bundle build.

This change uses esbuild to replace webpack for onnxruntime-web. Bundle
build time reduced from ~20sec to ~0.6sec on my windows dev box.

A few changes applied:
- import nodejs modules using "node:" prefix
- remove enum declaration inside namespace (EncoderUsage)
- use "fs/promise" to replace the old promisify from "util"
- separate ort-web and test-runner. Previously they are bundled
together, now they are built into 2 files.
- optimize karma runner launch time
- remove unnecessary sourcemap preprocessor. sourcemaps are handled
inside esbuild
- remove unnecessary proxies (because ort-web and test-runner are
separated now, the path are correctly inferred)
    - remove file watcher for test data
- optimize special handling as esbuild plugins:
- polyfill dummy imports for node.js modules when targetting browser.
    - load as content string for ort-wasm-*.worker.js
    - load as content string for ./proxy-worker/main.ts
- a source patch to ort-wasm*-threaded*.js (see details in comments in
code)
- updated debug configurations for sourcemap mapping to ensure
out-of-box good dev experience
2023-10-06 13:37:37 -07:00
Jiajia Qin
db3901ab97
[js/webgpu] Enable the NCHW ConvMatMul path (#17717)
1) Enable pointwise NCHW conv2d by MatMul.
2) Enable non-pointwise NCHW conv2d by convMatMul.
3) Fix bug when `sameSize` is true

---------

Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
2023-10-05 00:26:01 -07:00
Xu Xing
992f3e4609
[js/webgpu] Support where (#17544)
Supported type: float. int32_t, uint32_t, bool.
Case where_broadcast.jsonc is not enabled due to
https://github.com/microsoft/onnxruntime/issues/17405.

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
2023-10-03 14:28:21 -07:00
Arthur Islamov
d0519a7603
[js/web] BiasSplitGelu and BiasAdd kernels (#17161)
### Description
Two contrib kernels that supposed to speed-up StableDiffusion according
to this doc
https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/models/stable_diffusion/README.md

However, there is no noticable effect in speed or memory consumption. So
i guess the only way to make it faster is to implement
MultiHeadAttention but i'm not capable of doing that right now. So i'll
focus on existing PRs and finding the JSEP kernel that produces
incorrect results. It should be one of the old ones (i suspect Conv or
ConvTranspose), as SD was not generating images correctly on webgpu
since i started working on it. I hoped someone else would fix that by
the time i finish with kernels/optimizations 😅

---------

Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
2023-10-03 12:20:20 -07:00
xhcao
0d60604638
[JS/WebGPU] support Range operator (#17233)
The patch also introduces the method which copies
data from GPU to CPU synchronously.

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-09-30 02:05:32 -07:00
Yulong Wang
561aca97cf
[js/webgpu] support IO binding (#17480)
<del>
**This PR is based on a few prerequisites PRs. They are listed as
below:**
- #17465
- #17469
- #17470
- #17472
- #17473
- #17484

Please review the current change by only looking at commit
e2e6623e673ec6de55a5c1f8edcbd3a46b535a89 and later.


</del>

### Description

This PR introduces WebGPU IO binding. This new feature allows
onnxruntime-web users to use tensors created from GPU as model
input/output so that a model inferencing can be done without unnecessary
data copy between CPU and GPU for model input/output.

### Examples

An E2E demo/example is being worked on.

Following is some simple demo with code snippet.

Let's first check today how we do:
```js
// STEP.1 - create an inference session:
const mySession = await ort.InferenceSession.create('./my_model.onnx', { executionProviders: ['webgpu'] });

// STEP.2 - create model input: (supposing myImageCpuData is a Float32Array)
const feeds = {
  'input_image:0': new ort.Tensor('float32', myImageCpuData, [1, 224, 224, 3])
};

// STEP.3 - run model
const myResults = await mySession.run(feeds);

// STEP.4 - get output data
const myData = myResults['output_image:0'].data; // Float32Array

```

#### for inputs (GPU tensor):

Now, with IO binding, you can create a tensor from a GPU buffer, and
feed it to the model:
```js
// new STEP.2.A - create model input from a GPU buffer: (supposing myInputGpuBuffer is a `GPUBuffer` object with input data)
const feeds = {
  'input_image:0': ort.Tensor.fromGpuBuffer(myInputGpuBuffer, { dataType: 'float32', dims: [1, 224, 224, 3] })
};
```

### for outputs (pre-allocated GPU tensor)

you can also do that for output, **if you know the output shape**:
```js
// new STEP.2.B - create model output from a GPU buffer: (supposing myOutputGpuBuffer is a pre-allocated `GPUBuffer` object)
const fetches = {
  'output_image:0': ort.Tensor.fromGpuBuffer(myOutputGpuBuffer, { dataType: 'float32', dims: [1, 512, 512, 3] })
};

// new STEP.3 - run model with pre-allocated output (fetches)
const myResults = await mySession.run(feeds, fetches);
```

### for outputs (specify location)

if you do not know the output shape, you can specify the output location
when creating the session:

```js
// new STEP.1 - create an inference session with an option "preferredOutputLocation":
const mySession = await ort.InferenceSession.create('./my_model.onnx', {
    executionProviders: ['webgpu'],
    preferredOutputLocation: "gpu-buffer"
});
```

if the model has multiple outputs, you can specify them seperately:
```js
// new STEP.1 - create an inference session with an option "preferredOutputLocation":
const mySession = await ort.InferenceSession.create('./my_model.onnx', {
    executionProviders: ['webgpu'],
    preferredOutputLocation: {
         "output_image:0": "gpu-buffer"
    }
});
```

now you don't need to prepare the `fetches` object and onnxruntime-web
will prepare output data on the location that specified.

#### read data

when you get the output tensor, you can:
```js
// get the gpu buffer object:
const gpuBuffer = myOutputTensor.gpuBuffer; // GPUBuffer

// get the CPU data asynchronizely
const cpuData = await myOutputTensor.getData();

// get the CPU data asynchronizely and release the underlying GPU resources
const cpuData = await myOutputTensor.getData(true);

// dispose the tensor (release the underlying GPU resources). This tensor object will be invalid after dispose() is called.
myOutputTensor.dispose();
```

#### resource management

JavaScript has GC so you don't need to worry about managing JavaScript
objects. But there are 2 types of resources that are not managed by GC:
- GPU buffer that used in tensors
- Underlying ORT native resources

To simplify, most of the unmanaged resources and handled inside ORT web.
But there are a few resources that need users to manage:
- All external GPU resources, including GPU buffers inside all tensors
created by `Tensor.fromGpuBuffer()`, will not be managed by ORT. User
should manage those GPU buffers themselves.
- When a session is created with `preferredOutputLocation` ==
"gpu-buffer" specified in session options, and the corresponding output
is not pre-allocated, user need to call the output tensor's `dispose()`
or `getData(true)` to manually release the underlying GPU buffers.
- ORT internal errors (including providing a pre-allocated output tensor
with wrong type/dims) will invalidate the whole wasm memory and is not
recoverable. An exception is thrown in this situation.
2023-09-29 11:24:42 -07:00
satyajandhyala
b4fbc25b1f
[JS/Web] Add ConvTranspose implementation using MatMul (#17573)
### Description
Add ConvTranspose implementation using MatMul to increase perf.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-09-29 11:00:44 -07:00
Hariharan Seshadri
460f17fbb8
[JS/WebGPU] Support If on WebGPU (#17478) 2023-09-19 12:20:18 -07:00
Jiajia Qin
41d2ff622c
[js/webgpu] Optimize InstanceNormalization (#17491)
### Description
<!-- Describe your changes. -->
In previous implementation, there are two loops to iterate H * W
elements to calculate the `mean` and `squaredNorm` value in one thread,
meanwhile it outputs H * W elements in one thread. That results it's
very very slow when H * W is a large value. And usually, H * W does be a
large value in a model. For example, in the `candy-8` model, the shapes
of [H, W] are [224,224], [112,112], [56,56] for `InstanceNormalization`
op. And in my ADL, `[1,224,224,32]` consumes 17 ms. See below:
```
[profiling] kernel "23848328|[InstanceNormalization] 23848328" input[0]: [1,224,224,32] | float32, input[1]: [32] | float32, input[2]: [32] | float32, output[0]: [1,224,224,32] | float32, execution time: 17007914 ns
```

In this PR, it uses workgroup memory to optimize the original algorithm.
The advantage is that it can parallelly utilize the 64 (workgroupSize)
threads in one workgroup to calculate `mean` and `squaredNorm` value.
Meanwhile, it only outputs `H * W / workgroupSize` outputs for one
thread, which greatly reduces the overhead for one thread. With this
optimization, `[1,224,224,32]` becomes 3 ms and the main overhead is the
extra two `transpose`. The `createInstanceNormProgramInfo` only needs
`0.64` ms. See below:
```
[profiling] kernel "23003600|[InstanceNormalization] 23003600" input[0]: [1,224,224,32] | float32, output[0]: [1,32,224,224] | float32, execution time: 1543792 ns
program-manager.ts:115 
[profiling] kernel "23003600|[InstanceNormalization] 23003600" input[0]: [1,32,224,224] | float32, input[1]: [32] | float32, input[2]: [32] | float32, output[0]: [1,32,224,224] | float32, execution time: 642652 ns
program-manager.ts:115 
[profiling] kernel "23003600|[InstanceNormalization] 23003600" input[0]: [1,32,224,224] | float32, output[0]: [1,224,224,32] | float32, execution time: 991608 ns
```
This PR currently only applies the new algorithm to NCHW format. For
NHWC format, one way is to transpose the input so that it can use the
new algorithm. But the disadvantage is that 2 extra transpose are added.
@dakenf also gives another way to optimize NHWC. Details see
[here](d45a96616d/js/web/lib/wasm/jsep/webgpu/ops/instance-norm.ts).
I checked @dakenf's method. The perf is similar with transpose +
optimized NCHW. But on different GPUs, one is a little better than
another or vice versa. So I prefer this PR only does the NCHW part.
@dakenf can submit his optimization on NHWC.
2023-09-14 17:03:18 -07:00
xhcao
198d468849
[WebGPU/JS] Added Pad operator support (#16928)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-09-14 13:14:11 -07:00
xhcao
ec94b07f0a
[JS/WebGPU] support Concat.int32 operator (#17003)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-09-13 00:05:00 -07:00
Yulong Wang
f923eec28b
[js/web] release session after use in npm test (#17470)
### Description
release session after use in npm test.

This is one of the prerequisites for supporting IO binding for WebGPU
buffer in onnxruntime-web.

list of prerequisites PRs:
#17465
#17469
#17470 (this one)
2023-09-12 16:59:13 -07:00
satyajandhyala
bf6d6961cc
[JS/Web] Added Einsum operator support. (#17401)
### Description
Added Einsum operator support to JSEP.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-09-11 15:57:15 -07:00
Yulong Wang
89da5a0108
[js/webgpu] exclude WebGPU reduce_log_sum_exp_* float64 test cases (#17472)
### Description

as explained in the comments, tests "test_reduce_log_sum_exp_*" on
opset17/opset18 are excluded because they use float64.

They are passing now because they fallback to CPU. WebGPU does not
support f64.


This is one of the prerequisites for supporting IO binding for WebGPU
buffer in onnxruntime-web.

list of prerequisites PRs:
https://github.com/microsoft/onnxruntime/pull/17465
https://github.com/microsoft/onnxruntime/pull/17469
https://github.com/microsoft/onnxruntime/pull/17470
https://github.com/microsoft/onnxruntime/pull/17472 (this one)
2023-09-08 17:03:04 -07:00
Jian Chen
8914fe687b
[js/webgpu] Include Support for neg.int32 (#17374)
### Description
Include Support for neg.int32



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-09-06 12:00:16 -07:00
Yulong Wang
75710f0006
[js/webgpu] add matmul broadcast tests (#17335)
### Description

Commit fffefb1c22 (#16969) optimized
matmul and also fixes broadcasting. So #17191 is no longer needed.
However, the newly added operator test file from the PR by @dakenf is
helpful so pick and add it to enhance the tests.
2023-09-05 20:41:46 -07:00
xhcao
026672e947
[js/webgpu] Support slice int32 (#16968)
Co-authored-by: Xing Xu <xing.xu@intel.com>
2023-09-05 18:05:47 -07:00
Jian Chen
e60493525f
[js/webgpu] Adding support for abs with int32 type (#17359)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-08-31 08:13:54 -07:00
Yulong Wang
e5ca3f3dcb
[js/api] introducing IO binding for tensor (#16452)
[//]: # (## Work In Progress. Feedbacks are welcome!)

### Description
This PR adds a few properties, methods and factories to Tensor type to
support IO-binding feature. This will allow user to create tensor from
GPU/CPU bound data without a force transferring of data between CPU and
GPU.

This change is a way to resolve #15312

### Change Summary
1. Add properties to `Tensor` type:
a. `location`: indicating where the data is sitting. valid values are
`cpu`, `cpu-pinned`, `texture`, `gpu-buffer`.
b. `texture`: sit side to `data`, a readonly property of `WebGLTexture`
type. available only when `location === 'texture'`
c. `gpuBuffer`: sit side to `data`, a readonly property of `GPUBuffer`
type. available only when `location === 'gpu-buffer'`

2. Add methods to `Tensor` type (usually dealing with inference
outputs):
- async function `getData()` allows user to download data from GPU to
CPU manually.
- function `dispose()` allows user to release GPU resources manually.

3. Add factories for creating `Tensor` instances:
    a. `fromTexture()` to create a WebGL texture bound tensor data
    b. `fromGpuBuffer()` to create a WebGPUBuffer bound tensor data
    c. `fromPinnedBuffer()` to create a tensor using a CPU pinned buffer

### Examples:

create tensors from texture and pass to inference session as inputs
```js
// when create session, specify we prefer 'image_output:0' to be stored on GPU as texture
const session = await InferenceSession.create('./my_model.onnx', {
  executionProviders: [ 'webgl' ],
  preferredOutputLocation: { 'image_output:0': 'texture' }
});

...

const myImageTexture = getTexture(); // user's function to get a texture
const myFeeds = { input0: Tensor.fromTexture(myImageTexture, { width: 224, height: 224 }) }; // shape [1, 224, 224, 4], RGBA format.
const results = await session.run(myFeeds);
const myOutputTexture = results['image_output:0'].texture;
```
2023-08-29 12:58:26 -07:00
Jiajia Qin
fffefb1c22
[js/webgpu] Optimize matmul (#16969)
### Description
Changes in this PR:
1) use the optimized version `makeMatMulPacked[Vec4]Source` to support
matmul.
2) enable the conv2dByMatMul path.
3) support broadcast
4) use IndicesHelper.

MatMul with M = 512, K = 512, N = 512 becomes 2ms from 15ms when
enabling profilingMode on my ADL.
2023-08-29 12:40:57 -07:00
Hariharan Seshadri
cbd97515cd
[JS/WebGPU] Support GatherElements kernel (#17243)
### Description
As title


### Motivation and Context
Improve WebGPU kernel coverage
2023-08-28 09:55:25 -07:00
Yulong Wang
bb1871332f
[js/webgpu] add kernel Not and Equal (#17306)
### Description
This PR adds kernel implementation for operator "Not" and "Equal". Also
removed download cache in gpu data manager.

**Why removing download cache**
The following test case failed. ("Or" is on CPU, "Greater" and "Equal"
are on JSEP)

![image](https://github.com/microsoft/onnxruntime/assets/7679871/8d9798ad-2703-4fb9-907e-ff716c67d0b2)
after debugging, I found that both "Equal" and "Greater" are using the
same output GPU Data ID. This is because when ORT executes the graph, it
first run "Equal", allowing its shader to write into GPU Data ID 2; then
a Gpu2Cpu copy for it is issued (because currently "Or" is on CPU EP);
at this point, ORT thinks GPU Data ID=2 is free to use; so it reuse it
as output for "Greater". This means there is no allocation for output of
"Greater" kernel, and both kernel writes to GPU Data ID=2.

For gpu data manager, there will be 2 downloads from the same GPU
buffer. Previously I think this is a waste of resource so I cached the
data. But now it shoes that we need to perform 2 downloads because the
GPU data is already different. The download data cache should be
removed.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-08-27 19:50:17 -07:00
xhcao
5e8d94cec8
[js/webgpu] support Greater and Less operators (#17296)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-08-25 12:11:25 -07:00
satyajandhyala
da180b20fa
[JS/Web] Fix ConvTranspose shader code compilation errors. (#17232)
### Description
Fix JSEP ConvTranspose shader code errors.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-08-25 06:25:54 -07:00
Yulong Wang
6fc3fd9ece
[js/webgpu] support Cast operator (#16489)
### Description
support `Cast` operator for webgpu backend.

Cast operator for webgpu backend currently only supports f32, u32, i32
and bool.
2023-08-18 23:51:03 -07:00
xhcao
dd3b2cefd6
[js/webgpu] Support int32 type for binary (#16901)
### Description
Enable typed binary and support int32 type for binary.

Co-authored-by: Xing Xu <xing.xu@intel.com>

---------

Co-authored-by: Xing Xu <xing.xu@intel.com>
2023-08-18 12:19:01 -07:00
Hariharan Seshadri
a476dbf430
[JS/WebGPU] Support Tile operator (#17123)
### Description
As title

### Motivation and Context
Improve WebGPU op coverage
2023-08-18 10:07:21 -07:00
satyajandhyala
7d1a5635a0
[JS/Web] Added SkipLayerNormalization operator. (#17102)
### Description
Add SkipLayerNormalization operator to JSEP.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-08-18 09:59:03 -07:00