Commit graph

384 commits

Author SHA1 Message Date
Yulong Wang
4538d31a8b
[js/webgpu] expose a few properties in WebGPU API (#19857)
### Description
This change exposes a few properties in `ort.env.webgpu` to resolve
feature requirement mentioned in properties in
https://github.com/microsoft/onnxruntime/pull/14579#discussion_r1519612619.

- Add `powerPreference` and `forceFallbackAdapter` in `ort.env.webgpu`,
to allow users to set the value of the properties before the first
inference session is created.
- Add readonly property `adapter` in `ort.env.webgpu` to allow users to
get the adapter instance. Now users can access `ort.env.webgpu.device`
and `ort.env.webgpu.adapter`.

@xenova @beaufortfrancois
2024-03-12 19:50:51 -07:00
Satya Kumar Jandhyala
24b72d2613
[JS/WebGPU] Preserve zero size input tensor dims. (#19737)
### Description
For Concat operation, the zero-size input tensor shape need to be
preserved and, unlike non-zero tensors, the dims are not constrained to
match other input tensors' dims.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-03-07 19:07:49 -08:00
Yulong Wang
f06164ef8b
[js/web] transfer input buffer back to caller thread (#19677)
### Description

When using proxy worker, input buffers should be transferred back to the
caller thread after `run()` call is done.

Fixes #19488
2024-03-01 14:50:06 -08:00
Yulong Wang
3cb81cdde2
[js/common] move 'env.wasm.trace' to 'env.trace' (#19617)
### Description

Try to move 'env.wasm.trace' to 'env.trace' to make it less confusing,
because it also works in webgpu. Marked 'env.wasm.trace' as deprecated.
2024-02-27 11:07:15 -08:00
Guenther Schmuelling
bb43a0f133
[js/webgpu] minor fixes to make tinyllama work (#19564) 2024-02-23 15:45:30 -08:00
Yulong Wang
aec2389ad0
[js/webgpu] allows a ProgramInfo's RunData to use zero sized output (#19614)
### Description
This PR allows zero-sized output.

To make the implementation simple, it does not support partial
zero-sized tensor. Which means, either all outputs are zero-sized, or an
error will be reported.

added 2 tests:
 - op test of `Add` with input T[2,0] T[2,1], and
 - test_split_zero_size_splits
2024-02-23 12:52:47 -08:00
satyajandhyala
ae3d73c981
[JS/WebGPU] Fix Split and Where to handle corner cases. (#19613)
### Description
<!-- Describe your changes. -->
1. Fix Where operator to handle Boolean input less than 4 bytes.
2. Fix JSEP test harness to use tensor names consistently.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-02-23 00:21:15 -08:00
Xu Xing
fe82fccf1a
[js/webgpu] Fix Conv2DTransposeMatMul f16 compilation failure (#19596)
This is used in sam-h-decoder-f16.

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-02-22 13:09:28 -08:00
Yulong Wang
58f4921686
[js] changes to allow Float16Array if any polyfill is available (#19305)
### Description

This change adds only necessary code to enable ort-web works with any
Float16Array polyfill. Unlike #19302, in this PR, ort-web does not
include any specific polyfill; instead, it's user's choice for how to
use a polyfill.

ORT-web uses Float16Array if it's available; otherwise, fallback to use
Uint16Array.

```js
// case 1: user does not use polyfill:
import * as ort from 'onnxruntime-web';

const myF16Data = new Uint16Array(...);  // need to use Uint16Array
const myF16tensor = new ort.Tensor('float16', myF16Data, dims);
```

```js
// case 2: user use polyfill:
import * as ort from 'onnxruntime-web';
import {
  Float16Array, isFloat16Array, isTypedArray,
  getFloat16, setFloat16,
  f16round,
} from "@petamoriken/float16";
globalThis.Float16Array = Float16Array;  // ort-web will pick the global Float16Array

const myF16Data = new Float16Array(...);  // Use the polyfilled Float16Array type
const myF16tensor = new ort.Tensor('float16', myF16Data, dims);
```
2024-02-21 00:31:06 -08:00
Yulong Wang
3fe2c137ee
[js] small fix to workaround formatter (#19400)
### Description
Rename shader variable names to snake_case naming and also to avoid
formatter behaving inconsistently in win/linux.
2024-02-20 17:23:01 -08:00
Jiajie Hu
1b48054e1b
[js/webgpu] Create Split indices helpers by rank, not by shape (#19554)
### Description
This is required to make shape uniforms really work.

### Motivation and Context
The bug was unveiled in a model with multiple Split nodes. The later
nodes would try to reuse a previous pipeline cache, while the old shapes
were hardcoded as constants in cache.
2024-02-20 09:24:34 -08:00
satyajandhyala
dfeda9019c
[JS/WebGPU] Add MatMulNBits (#19446)
### Description
Add MatMulNBits to support MatMul using 4-bit quantized weights



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-02-17 09:19:17 -08:00
Yulong Wang
06269a3952
[js/webgpu] allow uint8 tensors for webgpu (#19545)
### Description
allow uint8 tensors for webgpu
2024-02-16 18:28:27 -08:00
Yulong Wang
5ff27ef02a
[js/webgpu] support customop FastGelu (#19392)
### Description
Support WebGPU custom operator FastGelu.
2024-02-06 09:07:31 -08:00
Jiajia Qin
ccbe264a39
[js/webgpu] Add LeakyRelu activation for fusedConv (#19369)
### Description
This PR 1) adds LeakyRelu activation for fusedConv; 2) makes `vec4<f16>`
value work with `float32` uniforms attributes.

For example:
`clamp(value, vec4<f16>(uniforms.clip_min),
vec4<f16>(uniforms.clip_max)` will throw compilation errors since
`uniforms.clip_min` and `uniforms.clip_min` are `f32` not `f16`. So we
need to change it to `clamp(value, vec4<f16>(f16(uniforms.clip_min)),
vec4<f16>(f16(uniforms.clip_max))`

And above problem was introduced when we make activation attributes as
uniforms instead of constant.

BTW, after adding LeakyRelu, `realesrgan-t256` model can pass.
2024-02-02 09:06:38 -08:00
Jiajia Qin
efc17e79de
[js/webgpu] Fix the undefined push error (#19366)
### Description
This PR fixes below errors when enable webgpu profiling: 
```
TypeError: Cannot read properties of undefined (reading 'push')
```
2024-02-02 02:04:06 -08:00
Xu Xing
3a2ab1963a
[js/webgpu] Refactor createTensorShapeVariables (#18883) 2024-02-01 17:59:00 -08:00
Yulong Wang
dd1f6ccc45
[js/webgpu] resolve codescan alert (#19343)
### Description
resolve codescan alert:
https://github.com/microsoft/onnxruntime/security/code-scanning/17687
2024-01-30 21:06:21 -08:00
Xu Xing
d73131cf0f
[js/webgpu] Use DataType as uniform cpu type (#19281)
This saves turning data type to string by tensorDataTypeEnumToString.
2024-01-30 21:05:08 -08:00
Jiajia Qin
85cef0af8c
[js/webgpu] Support capture and replay for jsep (#18989)
### Description
This PR expands the graph capture capability to JS EP, which is similar
to #16081. But for JS EP, we don't use the CUDA Graph, instead, we
records all gpu commands and replay them, which removes most of the cpu
overhead to avoid the the situation that gpu waiting for cpu.

mobilenetv2-12 becomes 3.7ms from 6ms on NV 3090 and becomes 3.38ms from
4.58ms on Intel A770.

All limitations are similar with CUDA EP:
1. Models with control-flow ops (i.e. If, Loop and Scan ops) are not
supported.
2. Usage of graph capture is limited to models where-in all ops in the
model can be partitioned to the JS EP or CPU EP and no memory copy
between them.
3. Shapes of inputs/outputs cannot change across inference calls.
4. IObinding is required.

The usage is like below:
Method 1: specify outputs buffers explicitly.
```
    const sessionOptions = {
        executionProviders: [
          {
            name: "webgpu",
          },
        ],
        enableGraphCapture: true,
      };
    const session = await ort.InferenceSession.create('./models/mobilenetv2-12.onnx', sessionOptions);
   
    // prepare the inputBuffer/outputBuffer
    ... ...

   const feeds = {
       'input': ort.Tensor.fromGpuBuffer(inputBuffer, { dataType: 'float32', dims })
   };

   const fetches = {
       'output': ort.Tensor.fromGpuBuffer(outputBuffer, { dataType: 'float32', dims: [1, 1000] })
   };

   let results = await session.run(feeds, fetches);  // The first run will begin to capture the graph.

   // update inputBuffer content
  ... ...
   results = = await session.run(feeds, fetches);  // The 2ed run and after will directly call replay to execute the graph.

  ... ...
   session.release();
```
Method 2: Don't specify outputs buffers explicitly. Internally, when
graph capture is enabled, it will set all outputs location to
'gpu-buffer'.
```
    const sessionOptions = {
        executionProviders: [
          {
            name: "webgpu",
          },
        ],
        enableGraphCapture: true,
      };
    const session = await ort.InferenceSession.create('./models/mobilenetv2-12.onnx', sessionOptions);

    // prepare the inputBuffer
    ... ...

   const feeds = {
       'input': ort.Tensor.fromGpuBuffer(inputBuffer, { dataType: 'float32', dims })
   };

   let results = await session.run(feeds);  // The first run will begin to capture the graph.
   
   // update inputBuffer content
  ... ...
   results = = await session.run(feeds);  // The 2ed run and after will directly call replay to execute the graph.

  ... ...
   session.release();
2024-01-30 18:28:03 -08:00
Jiajia Qin
90883a366a
[js/webgpu] Add hardSigmoid activation for fusedConv (#19233)
### Description
Add hardSigmoid activation for fusedConv. It will be used by
mobilenetv3-small-100 model.
2024-01-30 16:28:53 -08:00
Xu Xing
624b4e2063
[js/webgpu] Remove enableShapesUniforms (#19279) 2024-01-29 17:49:06 -08:00
Guenther Schmuelling
9e69606360
fix f16 for attention, enable slice and flatten for more types (#19262) 2024-01-29 10:13:46 -08:00
Xu Xing
a3f0e2422b
[js/webgpu] Support f16 uniform (#19098)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-25 16:58:22 -08:00
Xu Xing
656ca66186
[js/webgpu] Support uniforms for conv, conv transpose, conv grouped (#18753) 2024-01-25 15:37:05 -08:00
Jiajie Hu
5b06505073
[js/webgpu] Fix Tanh explosion (#19201)
### Description
```math
\tanh(x)=\frac{e^x-e^{-x}}{e^x+e^{-x}}=
\left\{
\begin{array}{cc}
-\frac{1-e^{-2\cdot(-x)}}{1+e^{-2\cdot(-x)}}, & x<0 \\
0, & x=0 \\
\frac{1-e^{-2x}}{1+e^{-2x}}, & x>0
\end{array}
\right.
```

### Motivation and Context
On some platforms,
$$\tanh(1000)=\frac{e^{1000}-e^{-1000}}{e^{1000}+e^{-1000}}$$ would
produce NaN instead of 0.999... or 1 (imagine $e^{1000}=\infty$ and
$\frac{\infty}{\infty}$ explodes).
2024-01-25 08:25:35 -08:00
Wanming Lin
7252c6e747
[WebNN EP] Support WebNN async API with Asyncify (#19145) 2024-01-24 15:37:35 -08:00
Yang Gu
591f90c0b9
[js/webgpu] Fix issue of timestamp query (#19258)
When we enable webgpu profiling mode between session.create and
session.run, current implementation has a problem to create querySet
(and also queryResolveBuffer) if we share the commandEncoder with inputs
upload. This PR fixes this by moving the querySet creation to the place
we set queryType.
2024-01-24 14:49:37 -08:00
satyajandhyala
a33b5bd1fa
[JS/WebGPU] Added Uniforms to SkipLayerNorm. (#18788)
### Description
Added Uniforms to SkipLayerNorm



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve performance

---------

Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
2024-01-25 01:12:21 +05:30
Jiajia Qin
d226e40856
[js/webgpu] set query type in onRunStart (#19202)
### Description
<!-- Describe your changes. -->
`env.webgpu.profiling` is a global flag. It may change before each
session.run. So the best place is to update it in `onRunStart` event.
After this, we can directly check `this.queryType`'s value. Without this
pr, we need to make sure that `getCommandEncoder()` is called before
checking `this.queryType`. Otherwise, it may happen that
`pendingKernels`'s length is not equal to `pendingDispatchNumber`'s
length. See the two ugly workarounds
[1)](e630dbf528 (diff-006fc84d3997f96a29b8033bd2075d6a0a9509211bd5812a6b934fc74fedfd9dR267-R268))
and
[2)](e630dbf528 (diff-618fe297fbe7a1da586380163b8fd2627311ccc217640a3c5cdc9c17a33472c1R73-R80))
if we don't introduce `onRunStart`. Or we need to call `setQueryType` in
each kernel run.
2024-01-22 16:08:55 -08:00
Jiajia Qin
2e0a388c36
[js/webgpu] Add HardSigmoid support (#19215)
### Description
This op is required in mobilenetv3-small-100. With this PR,
mobilenetv3-small-100 model becomes less than 10 ms from over 100 ms on
ADL.
2024-01-22 15:53:26 -08:00
Yulong Wang
f87e69801f
[js/web] show warning when numThreads is set but threads is not supported (#19179)
### Description
show warning when numThreads is set but threads is not supported.
Resolves #19148, #18933

for web: when crossOriginIsolated is false.
for node: always disable.
2024-01-17 15:04:22 -08:00
Yulong Wang
146ebaf91e
[js/web] allow proxy to load model with 1GB <= size < 2GB (#19178)
### Description

allow proxy to load model with 1GB <= size < 2GB

resolves #19157.
2024-01-17 15:03:43 -08:00
Rachel Guo
bd9d8fb2a5
[ORT 1.17.0 release] Bump up version to 1.18.0 (#19170)
### Description
<!-- Describe your changes. -->

Bump up version to 1.18.0 since the release branch has been cut.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
2024-01-17 11:18:32 -08:00
Guenther Schmuelling
9dee543bed
fix gemm beta for fp16 (#19153)
per onnx spec beta is always fp32 so we need to cast it
2024-01-15 18:40:38 -08:00
Yulong Wang
f917dde717
[web] remove xnnpack from web backends (#19116)
### Description
XNNPACK is already disabled in web assembly build. This change removes
the xnnpack backend registration in JS.
2024-01-13 23:04:02 -08:00
Yang Gu
e803f8eb0f
[js/webgpu] Refactor timestamp-query and introduce timestamp-query-inside-passes (#18894)
We submit kernels in a batch (a fixed number 16 is used except for the
last batch) for better performance. However, timestamp query support is
at pass level so we disable the batch execution in profiling mode in
previous implementation. Actually we can have multiple passes in a batch
so that we don't have to disable batch execution, which is the first
enhancement of this PR.
Furthermore, WebGPU has an extension to support timestamp query inside
passes, which isn't supported by all the platforms (e.g., Windows
supports it, while macOS doesn't). This is expected to have lower cost
compared with multiple passes solution. So this PR also introduce this
support when available.
This PR also refactors some implementation related to kernelInfo, and
try to unify the related kernel names.
2024-01-13 00:23:17 -08:00
Yulong Wang
07cfc56538
[js] enable external data loading for ort-web (#19087)
### Description
enable external data loading for ort-web.

### Why
The ORT external data design is highly depending on the file system,
especially synchronous file I/O APIs. Those are not available in web
platforms. We need to have extra code to make external data working on
web.

### How
Considering there is no file system in web, an implementation for web to
support external data is to use pre-loaded data. Assume model file
a.onnx includes initializers that linked to ./b.bin, we require users to
pass a full data file list when creating the session. The user code will
be look like:
```js
const mySess = await ort.InferenceSession.create('./path/model/a.onnx', {
  // session options
  externalData: [
    {
      // relative or absolute path/URL of the file,
      // or a pre-loaded Uint8Array containing the data of the external data file
      data: './path/data/b.bin', 

      // the relative path of the external data. Should match initializers' "location" value defined in the model file
      path: './b.bin'
    },
    // { } if multiple external data file
  ]
});
```

Currently, this feature only works with JSEP build enabled.
2024-01-12 19:24:24 -08:00
Guenther Schmuelling
a756017e9f
[js/webgpu] more fixes for access above 2GB (#19065)
when jsep calls javascript with an index to HEAP8 or HEAP32 the index is
negative when the heap is above 2GB, even if we pass it as uint32_t it
remains negative. So in javascript use >>> 0 to make it unsigned.
2024-01-12 17:47:37 -08:00
Guenther Schmuelling
4a5f13b681
fix resize for fp16 (#19110)
resize for fp16 has 2 issues: scales are always f32 and roi can be f32
or f16.
scales:
this is fixed.

roi
this is fixed for the case where roi is not passed as optional input
with f16. To fix this it requires a much larger change and I did not
want to risk this short before a release. For all practical purpose
passing roi as input with f16 should be rare and we can fix it in the
near future.
2024-01-12 13:44:28 -08:00
Jiajie Hu
acba63c36a
[js/webgpu] Change A/sqrt(B) to A*inverseSqrt(B) in normalization ops (#19101)
### Description
Change `A / sqrt(B)` to `A * inverseSqrt(B)` in BatchNormalization,
InstanceNormalization, LayerNormalization and SkipLayerNormalization.

### Motivation and Context
For the same reason as the existence of the `inverseSqrt` built-in in
WebGPU spec.
2024-01-12 00:08:16 -08:00
Guenther Schmuelling
d0bac8216d
[js/webgpu] fix bcast in where (#19009) 2024-01-11 12:13:24 -08:00
Jiajia Qin
a89db01fce
[js/webgpu] disable GroupedConvVectorize path (#19090)
Disable createGroupedConvVectorizeProgramInfo path due to bots failures
on below two cases:
[webgpu]Conv - conv - vectorize group - B
[webgpu]Conv - conv - vectorize group - D
2024-01-11 08:13:14 -08:00
Jiajia Qin
fd6bab4250
[js/webgpu] Provide a vectorized algorithm for GroupedConv (#18884)
### Description
This PR provides a vectorized algorithm for NHWC GroupedConv to improve
performance.

The aggregate time of GroupedConv in mobilenetv2-12 becomes ~1ms from
~4ms on Intel Alder Lake machine. About 20% improvement for the whole
model.
2024-01-10 16:12:43 -08:00
Xu Xing
ed0f26d3d4
[js/webgpu] Revert parse norm attributes (#19074)
This resolves the below build errors:
```
lib/wasm/jsep/webgpu/op-resolve-rules.ts:19:23 - error TS2724: '"./ops/instance-norm"' has no exported member named 'parseInstanceNormAttributes'. Did you mean 'InstanceNormAttributes'?

19 import {instanceNorm, parseInstanceNormAttributes} from './ops/instance-norm';
                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~

lib/wasm/jsep/webgpu/op-resolve-rules.ts:19:23 - error TS6133: 'parseInstanceNormAttributes' is declared but its value is never read.

19 import {instanceNorm, parseInstanceNormAttributes} from './ops/instance-norm';
                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~

lib/wasm/jsep/webgpu/op-resolve-rules.ts:20:20 - error TS2305: Module '"./ops/layer-norm"' has no exported member 'parseLayerNormAttributes'.

20 import {layerNorm, parseLayerNormAttributes} from './ops/layer-norm';
                      ~~~~~~~~~~~~~~~~~~~~~~~~

lib/wasm/jsep/webgpu/op-resolve-rules.ts:20:20 - error TS6133: 'parseLayerNormAttributes' is declared but its value is never read.

20 import {layerNorm, parseLayerNormAttributes} from './ops/layer-norm';
```
2024-01-09 20:58:50 -08:00
Xu Xing
76dfe5347c
[js/webgpu] Support uniforms for instance-norm (#18929)
Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>
2024-01-09 14:56:00 -08:00
zesongw
ad6dd0a597
[WebNN] Enable npm unit tests (#18486)
### Description
- Support more test cases for WebNN EP in suite-test-list.jsonc
- Add DISABLE_WEBNN flag in build.ts as preparing for WebNN EP release
- Add test option: '--webnn-device-type' in test-runner-args-cli.ts to
support running WebNN 'gpu' deviceType
- Use Chrome Stable as default browser for WebNN testing to unblock the
CI limitation.
2024-01-09 10:10:57 -08:00
Xu Xing
557ac74c05
[js/webgpu] Support gemm uniforms (#19056)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-09 09:57:06 -08:00
Xu Xing
42ba2aed54
[js/webgpu] Support pad uniforms (#19057)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-09 09:34:56 -08:00
Xu Xing
eb92681bfb
[js/webgpu] Support range uniforms (#19055) 2024-01-09 09:33:57 -08:00
Xu Xing
dee6a5b371
[js/webgpu] Support uniforms for attention and multihead attention (#18903) 2024-01-09 07:46:30 -08:00
Xu Xing
8f024b7394
[js/webgpu] Support uniforms for layer-norm (#18755) 2024-01-08 18:16:25 -08:00
Jiajie Hu
447a3a7c70
[js/webgpu] Fix Expand/Gather when input type is bool (#18999)
### Description
Also update the op test suite.

### Motivation and Context
Previously the *total* size in case `Expand - last dim is not divisible
by 4` was a multiple of 4, even though the *last dimension* was not, so
the bug has never been caught.
2024-01-05 08:16:15 -08:00
Yulong Wang
b18abaaa2c
[js/web] wait for threadpool initialization (#18952)
### Description

a replacement of #18683. try to resolve #18689.

By specifying "-s PTHREAD_POOL_SIZE" flag in emscripten, it forces the
threadpool to initialize before the webassembly instance is available.
2024-01-04 08:06:55 -08:00
xhcao
867b9d8f04
[js/webgpu] Fix f16 errors for ConvTranspose2D (#18986)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-04 08:06:01 -08:00
Jiajie Hu
3b8b9147fa
[js/webgpu] Mitigate floating point accuracy issue in Resize (#18956)
### Description
The patch fixes a floating point accuracy issue in Resize by preferring
integer indices and integer arithmetic where possible.

### Motivation and Context
Model test `test_resize_upsample_sizes_nearest_floor_align_corners` was
observed to be failing on certain platforms. The root cause is the
inaccurate floating point evaluation of 21 / 7 (2.999... vs 3), which
results in the wrong input element to be indexed (floor(2.999...) vs
floor(3)).
2024-01-03 14:15:26 -08:00
Yang Gu
c5f3952b68
[js/webgpu] Introduce trace support (#18928)
This is to leverage console.timeStamp to add a single marker to
browsers' (only Chromium and Firefox support it) performance tool. With
this support, we can dump both CPU and GPU timestamps, and use
post-processing tool to clearly understand the calibrated timeline. A
demo tool can be found at https://github.com/webatintel/ort-test, and
more detailed info can be found at

https://docs.google.com/document/d/1TuVxjE8jnELBXdhI4QGFgMnUqQn6Q53QA9y4a_dH688/edit.
2024-01-03 10:13:17 -08:00
satyajandhyala
780fc3611b
[JS/Web] Sajandhy/webgpu resize scales rank check (#18954)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-12-29 09:23:27 -08:00
Jiajia Qin
44584c3ebe
[js/webgpu] only declare shape and strides in shader when necessary (#18940)
### Description
Previously, shape and strides were added unconditionally even they are
not used. This PR fixes this issue and only adds shape and strides when
they are required.

It's useful when some shapes are not used as uniform if the program
depends on type instead of rank.
2023-12-28 15:43:08 -08:00
Jiajia Qin
c613cc58a9
[js/webgpu] Fix shader compilation errors in Resize (#18947)
### Description
An extra right parenthesis was added by accidentally, which results some
resize cases fail. This PR fixes it.
2023-12-28 13:15:26 -08:00
satyajandhyala
3bbe4fe2ff
[JS/WebGPU] Add trilinear interpolation to Resize; activation_params attribute is optional for FusedConv also. (#18842)
### Description
Add trilinear interpolation to Resize and changed activation_params attribute as optional for FuseConv.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-12-27 16:21:29 -08:00
Guenther Schmuelling
31d4a21c4b
[js/webgpu] fix heap access > 2GB (#18914) 2023-12-27 15:22:05 -08:00
Xu Xing
0bc71b0c9b
[js/webgpu] Refactor attributes of pool (#18728) 2023-12-26 17:23:52 -08:00
Yulong Wang
9a61388f0a
[js/web] revise backend registration (#18715)
### Description
This PR revises the backend registration.

The following describes the expected behavior after this change:
(**bolded are changed behavior**)

- (ort.min.js - built without webgpu support)
    - loading: do not register 'webgpu' backend
- creating session without EP list: use default EP list ['webnn', 'cpu',
'wasm']
- creating session with ['webgpu'] as EP list: should fail with backend
not available
- (ort.webgpu.min.js - built with webgpu support)
    - loading: **always register 'webgpu' backend**
( previous behavior: only register 'webgpu' backend when `navigator.gpu`
is available)
- creating session without EP list: use default EP list ['webgpu',
'webnn', 'cpu', 'wasm']
        - when WebGPU is available (win): use WebGPU backend
- when WebGPU is unavailable (android): **should fail backend init,**
and try to use next backend in the list, 'webnn'
(previous behavior: does not fail backend init, but fail in JSEP init,
which was too late to switch to next backend)
    - creating session with ['webgpu'] as EP list
        - when WebGPU is available (win): use WebGPU backend
- when WebGPU is unavailable (android): **should fail backend init, and
because no more EP listed, fail.


related PRs: #18190 #18144
2023-12-20 14:45:55 -08:00
satyajandhyala
98510fb8fb
[JS/WebGPU] fix an error in Clip (#18799)
### Description
<!-- Describe your changes. -->
Check whether the min/max inputs are provided and use default values if not provided.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-12-19 13:51:01 -08:00
Jiajia Qin
8f7b89bd5b
[js/webgpu] Optimize NCHW layout for InstanceNormalization (#18123)
### Description
The changes in this PR includes:
1) Fix f16 errors in InstanceNormalization with NCHW format.
2) Use vec to further optimize the original algorithm.
3) (Removed) Don't do layout conversion for InstanceNormalization for
JSEP since InstanceNormalization itself is suitable for NCHW layout and
has better performance in our current implementation.

Tested on sd-vae-decoder-f16.onnx, it becomes 285 ms from 314 ms. The
aggregate gpu profiling data can be found as below (Note the data is
based change 3).):
Before:
<html>
<body>
<!--StartFragment--><span><span class="ui-provider ef bbg bbh bbi bbj
bbk bbl bbm bbn bbo bbp bbq bbr bbs bbt bbu bbv bbw bbx bby bbz bca bcb
bcc bcd bce bcf bcg bch bci bcj bck bcl bcm bcn" dir="ltr">

Kernel | Time (Ms) | Percentage (%)
-- | -- | --
Conv | 201.55 | 69.56
InstanceNormalization | 42.49 | 14.67
Transpose | 28.95 | 9.99
Mul | 5.69 | 1.96
Add | 3.82 | 1.32
MatMul | 3.27 | 1.13
Sigmoid | 2.24 | 0.77
Resize | 1.16 | 0.40
Softmax | 0.34 | 0.12
Cast | 0.24 | 0.08
Sum | 289.75

<br class="Apple-interchange-newline"><!--EndFragment-->
</body>
</html>
After:
<html>
<body>
<!--StartFragment--><span><span class="ui-provider ef bbg bbh bbi bbj
bbk bbl bbm bbn bbo bbp bbq bbr bbs bbt bbu bbv bbw bbx bby bbz bca bcb
bcc bcd bce bcf bcg bch bci bcj bck bcl bcm bcn" dir="ltr">

Kernel | Time (Ms) | Percentage (%)
-- | -- | --
Conv | 205.44 | 79.43
InstanceNormalization | 18.24 | 7.05
Transpose | 17.64 | 6.82
Mul | 5.69 | 2.20
Add | 3.81 | 1.47
MatMul | 3.56 | 1.38
Sigmoid | 2.24 | 0.86
Resize | 1.19 | 0.46
Softmax | 0.59 | 0.23
Cast | 0.24 | 0.09
Sum | 258.65 |  

</span></span><!--EndFragment-->
</body>
</html>

From above table, we can see that two ops time are greatly reduced. One
is InstanceNormalization and the other is Transpose. The reason that the
transpose time is reduced is because each InstanceNormalization is
surrounded with two reshape ops in sd-vae-decoder-f16.onnx. Due to JSEP
is prefer NHWC and InstanceNormalization is layout sensitive op, so two
extra transpose ops are inserted dynamically when executing this model.
After this change, those inserted transpose ops are not needed anymore.
So the overall transpose time is reduced.
2023-12-15 11:26:15 -08:00
Jiajia Qin
4bbed4c71a
[js/webgpu] Fix f16 errors in unary (#18839)
### Description
This PR fixes below errors:
```
no matching overload for operator > (vec4<f16>, vec4<f32>)
2023-12-15 11:25:12 -08:00
Yang Gu
81ad1e6ac3
[js/webgpu] Fix typo of outputShapes in profiling message (#18837) 2023-12-15 08:57:48 -08:00
Jiajia Qin
b30e721dc8
[js/webgpu] Provide a naive vectorized matmul algorithm (#18758)
### Description
This PR provided a vectorized matmul algorithm. In most situations, we
still go to the workgroup memory optimized matmul. But for some
situations, like N and K are very small, using workgroup optimized
matmul can't fully utilize the underlying hardware due to the 32x32 tile
size. So for very small N/K, we switch to the naive vectorized matmul
algorithm to improve the hardware execution unit usage.

With this PR, matmul with input0: [1, 36864, 3], input1: [1, 3, 3],
input2: [3] becomes less than 1 ms from 4.34 ms on Intel Gen9 GPUs.
2023-12-13 09:03:23 -08:00
satyajandhyala
0ca84549ab
[JS/Web] Added uniforms to Reduce, Resize and Split Ops. (#18727)
### Description
<!-- Describe your changes. -->
Added uniforms to Reduce op


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve perforamnce.
2023-12-12 11:12:23 -08:00
satyajandhyala
d673e39ad8
[JS/WebGPU] Added uniforms to Tile and Where Ops (#18768)
### Description
<!-- Describe your changes. -->
Added uniforms to Tile and Where Ops


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve performance.
2023-12-11 20:58:52 -08:00
Jiajia Qin
b4be9e1bbb
[js/webgpu] Fix shader compilation errors in cumsum (#18779)
### Description
This PR fixes below shader compilation errors:
```
Tint WGSL reader failure: :39:31 error: no matching overload for operator + (f32, i32)

5 candidate operators:
  operator + (T, T) -> T  where: T is abstract-float, abstract-int, f32, i32, u32 or f16
  operator + (vecN<T>, T) -> vecN<T>  where: T is abstract-float, abstract-int, f32, i32, u32 or f16
  operator + (T, vecN<T>) -> vecN<T>  where: T is abstract-float, abstract-int, f32, i32, u32 or f16
  operator + (vecN<T>, vecN<T>) -> vecN<T>  where: T is abstract-float, abstract-int, f32, i32, u32 or f16
  operator + (matNxM<T>, matNxM<T>) -> matNxM<T>  where: T is abstract-float, f32 or f16

                    sum = sum + get_inputByIndices(inputIndices);
                              ^


 - While validating [ShaderModuleDescriptor "CumSum"]
 - While calling [Device].CreateShaderModule([ShaderModuleDescriptor "CumSum"]).
2023-12-11 18:11:38 -08:00
Caroline Zhu
eb03032925
[js/web/training] lazyResetGrad implementation (#18711)
### Description
* implemented lazyResetGrad function

### Motivation and Context
* we are in the process of adding language bindings to enable training
on web
* lazyresetgrad ensures that the gradients are calculated correctly
after the first runTrainStep call

---------

Co-authored-by: Ashwini Khade <askhade@microsoft.com>
2023-12-11 17:36:54 -08:00
Yulong Wang
efbef5f611
[js/webgpu] allow to specify callback for profiling data (#18732)
### Description

**This PR is a replacement of #17820.**

allow to specify callback for profiling data

*Previous*:
```js
ort.env.webgpu.profilingMode = 'default';  // enable profiling

// profiling data will output to console.
```

*Now*:
```js
ort.env.webgpu.profiling = {
  mode: 'default';  // enable profiling
  ondata: (data) => {
    // .. process the profiling data
  }
};

//for each kernel, "ondata" will be called once. only output to console if ondata is not specified.
```
2023-12-07 14:10:28 -08:00
Guenther Schmuelling
9aa7284351
fix lint error (#18708) 2023-12-05 10:37:03 -08:00
satyajandhyala
70816001cc
[JS/Web] AddedUniforms in GatherElements. (#18670)
### Description
Use Uniforms in GatherElements and clean-up



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve performance
2023-12-05 09:19:53 -08:00
Xu Xing
f949e0580b
[js/webgpu] Support uniforms for pool (#18656) 2023-12-05 07:54:30 -08:00
satyajandhyala
10c547516d
[JS/Web] Added CumSum operator to JSEP (#18637)
### Description
Added CumSum operator



### Motivation and Context
Reduce CPU <->GPU data movement.
2023-12-05 07:51:53 -08:00
Caroline Zhu
c02a386145
[js/web/training] Implemented runEvalStep & runOptimizerStep (#18259)
### Description
* implemented runEvalStep and runOptimizerStep
* added hasEvalModel and hasOptimizerModel boolean fields in
TrainingSession representation
* added evalInputNames and evalOutputNames fields to
TrainingSessionHandler & TrainingSession
* removed the inputNamesEncoded and outputNamesEncoded fields from
TrainingSessionHandler -- since none of the training methods require the
input names and output names as parameters, there's no need to store
them.

### Motivation and Context
* part of the work for implementing web bindings for training
* previous PR: #18250

---------

Co-authored-by: Ashwini Khade <askhade@microsoft.com>
2023-12-04 13:37:14 -08:00
Jiajia Qin
5353adcde3
[js/webgpu] Use the naive convTranspose when in/out channels are both 1 (#18658)
### Description
With this change, convTranspose with input0 [1, 18, 32, 1], input1 [1,
1, 16, 16] becomes 0.59ms from 6.64ms.
2023-12-04 13:18:37 -08:00
Jiajia Qin
92ee664f64
[js/webgpu] Fix shader errors in indicesGet/Set when rank > 4 (#18661)
### Description
Currently, for non-uniform variables, we still use `array<u32, N>` type
instead of array<vec4<u32>, N1>`. So we can't always treat all variables
with rank > 4 as uniforms to index.

This PR fixes below errors:
```
error(s) generated while compiling the shader:
:5:44 error: index 4 out of bounds [0..1]
             return uniforms.input_strides[4] * (outputIndices[4] % uniforms.input_shape[4])+uniforms.input_strides[3] * (outputIndices[3] % uniforms.input_shape[3])+uniforms.input_strides[2] * (outputIndices[2] % uniforms.input_shape[2])+uniforms.input_strides[1] * (outputIndices[1] % uniforms.input_shape[1])+uniforms.input_strides[0] * (outputIndices[0] % uniforms.input_shape[0]);
                                           ^
FAILED #OpTest# - expand.jsonc [webgpu]Expand - Expand 5D - float32 Expand 5 - float32
FAILED #OpTest# - expand.jsonc [webgpu]Expand - Expand 5D - float32 Expand 5 - shape < input.size()
2023-12-01 15:35:35 -08:00
Xu Xing
73d9b03509
[js/webgpu] Add multidimensional(>4) uniform support (#18546)
This change removes the check of enableShapesUniforms. When all uses of
this are removed, enableShapesUniforms can be removed too.
2023-11-30 17:10:33 -08:00
Jiajia Qin
6781b6cf3d
[js/webgpu] add bool type for Expand/Gather (#18615)
### Description
In [detr-resnet-50](https://huggingface.co/Xenova/detr-resnet-50) model,
it uses expand with bool type running on cpu ep.




| Kernel    | Shape | Provider |
| -------- | ------- | ------- |
| Expand | "input_type_shape" :
[{"bool":[1,1,1,625]},{"int64":[4]}],"activation_size" :
"657","output_type_shape" : [{"bool":[1,1,625,625]}] |
CPUExecutionProvider |

After this change, it will run on jsep.
| Kernel    | Shape | Provider |
| -------- | ------- | ------- |
| Expand | "input_type_shape" :
[{"bool":[1,1,1,625]},{"int64":[4]}],"activation_size" :
"657","output_type_shape" : [{"bool":[1,1,625,625]}] |
JsExecutionProvider |
2023-11-30 15:47:08 -08:00
Jiajia Qin
b1e749e3be
[js/webgpu] Add program name into webgpuProfiling info (#18640)
### Description
Currently, we only print the kernelName, which is hard to distinguish
which shader we actually used. For example, GroupedConv/Conv2DMatMul
both belong to Conv kernel. It's not intuitive for profiling.
2023-11-30 12:57:29 -08:00
Yang Gu
227dcb3a88
[js/webgpu] Log the key and program info for artifact (#18365)
With uniform support, ideally we may just keep one artifact for each
program to save the compilation time. This PR just logs the related
info, including key and program name, so that we may understand better
the situation.
2023-11-29 18:01:12 -08:00
satyajandhyala
7335760424
[JS/Web] Add uniforms to Einsum (#18531)
### Description
Add uinforms to Einsum



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve performance.
2023-11-29 15:30:33 -08:00
Yulong Wang
50e6235af1
[js/web] allow ShaderHelper to use internal (non-I/O) variables (#18525)
### Description
This PR includes a change that inspired from #18452 to resolve a
requirement: a shader may depend on an instance of `IndicesHelper` to
generate WGSL code snippet, but the IndicesHelper instance is not
necessarily an input/output of the program. So the existing
`declareVariables()` function does not work with this scenario.

In order to support this requirement, I added this "use" function to
`interface ShaderHelper`, which takes a helper-like object as parameter.
The hidden implementation `ShaderHelperImpl` class will iterate the
helpers and call `impl()` for each.

@axinging @qjia7
2023-11-28 15:15:59 -08:00
Jiajia Qin
fc8631e2f1
[js/web] Fix conv2dMatmul errors due to #18452 (#18562)
### Description
Currently, all conv2dMatmul with inChannels = 3 and outChannels % 4 = 0
will report compilation errors. Models, which include this kind of shape
will be impacted, like mobilenetv2-12, resnet50 .

The errors is introduced by #18452
https://github.com/microsoft/onnxruntime/pull/18452/files#diff-8b24ea43aa11b1346c0c9e327f9bce6b37a93bd8f2bf8a6392b2b263972b7ea2R200,
which accidentally pass `components` to `x`. But `x`'s components is
`innerElementSize` not `components `. And when `innerElementSize` is 3,
we should use `1` in current design.
2023-11-27 21:21:47 -08:00
Caroline Zhu
dd355e39a0
[js/web/training] Added parameters methods (#18250)
### Description
* Implemented: `getParametersSize`, `getContiguousParameters`
(equivalent to copyParametersToBuffer), and `loadParametersBuffer`
(equivalent to copyParametersFromBuffer)
* as part of these changes, getParametersSize was added to the
TrainingSession interface so that users know what size buffer to create
for loadParametersBuffer
* The parameters methods in the interface were modified to take in a
Float32Array instead


### Motivation and Context
* part of the work for implementing web bindings for training
* enables federated learning in the web
* previous  PR: #18006

---------

Co-authored-by: Ashwini Khade <askhade@microsoft.com>
2023-11-27 10:30:13 -08:00
Jiajia Qin
64dacc2892
[js/webgpu] Add BatchNormalization Op (#18468)
### Description
This PR adds `BatchNormalization` with `float` support.

Some Todos:
1. all inputs don't have same data type. For example, x/y is float16,
but bias/scale is float32 or double.
2. training mode support.

We see many models are using `BatchNormalization` ops. However, due to
the missing in jsep, all of them run on cpu, which result very poor
performance. With this PR's support, densenet-9 model becomes 20.29 ms
from 250.69 ms.
2023-11-22 15:58:06 -08:00
Xu Xing
fa106942a7
[js/webgpu] Refactor matmul conv to support uniforms for matmul (#18452)
This change refactored matmul/conv related programs to support shape
uniforms. Currently only matmul shape uniforms are fully enabled.
TODOs: add input dependencies for conv related programs, turn clipMax
and clipMin to uniforms.
2023-11-22 14:42:55 -08:00
satyajandhyala
841f7ed3e0
[[JS/Web]Added uniform to Expand op. (#18558)
### Description
<!-- Describe your changes. -->
Added Uniforms to Expand operator kernel


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve performance
2023-11-22 14:14:24 -08:00
Arthur Islamov
1c555c5fc1
[JS/Web] Resize & BiasSplitGelu fp16 support (#18536)
### Description
Resize and BiasSplitGelu fp16 support on WebGPU
2023-11-22 12:12:07 -08:00
Yulong Wang
c7fd930330
[js/web] unify resolve rules for "Clip" (#18527)
### Description
It was a mistake to use 2 different names for Clip operator in
op-resolve-rules.ts for different opset. An optimized implementation can
handle both cases (opset < 11 and opset >=11). Remove "ClipV10" as an
entry from the table.
2023-11-20 23:18:06 -08:00
Jiajia Qin
abdf8b7c3f
[js/webgpu] Optimize broadcast binary. (#18185)
### Description
Currently, the binary algorithms are divided into the vectorize one
(efficient) and non-vectorize one (less efficient). Below situations
will go to the vectorize one:
1) A or B's shape length is 1.
2) The shared dimensions length of A and B are divisible by 4.
3) A and B have same shape.

This PR adds another situation as below to go to the vectorize
algorithm.
4. A or B's last dimension is divisible by 4.

With this change, the aggerate time of Add in sam-b-encoder becomes
309.65 ms from 409.12 ms on Intel ADL.
2023-11-20 16:52:17 -08:00
Yulong Wang
247ce21859
[js] optimize eslint config (#18460)
### Description
optimize eslint config to:
- set parserOptions.project to `true` to allow @typescript-eslint/parser
to find the nearest tsconfig.json file to that source file. This helps
to avoid parsing extra files, may helps with:
- reduce the possibility of seeing OOM or stackoverflow with "npm run
lint"
   - faster processing
- enforce rule "no-underscore-dangle" with a list of exceptions.
2023-11-20 12:00:56 -08:00
Arthur Islamov
fac3e33da5
[js/web] JSEP Attention & MultiHeadAttention (#17742)
### Description
This is a narrow implementation of Attention/MultiHeadAttention as it
does not support:
a. inputs 5-7 for MHA
b. packed QKV/KV
c. past/present
d. attention mask

But it works well for StableDiffusion and can be extended later. It
reduces VRAM usage as it combines many ops into few
I've updated demo here https://islamov.ai/stable-diffusion-webgpu/ it
takes ~13sec for 1 image with 20 steps on RTX3090Ti and about 25s on M1
Pro
VRAM usage is about 8gb if you don't use img2img

Going to focus on SDXL now

---------

Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
2023-11-17 12:23:52 -08:00
satyajandhyala
b291b20fa0
[JS/Web]Added uniforms support to Slice op. (#18422)
### Description
Support uniforms in Slice op



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve ferformance
2023-11-16 09:44:13 -08:00
Yulong Wang
586f06f5a1
[js/web] set noUnusedParameters to true and fix a few bugs (#18404)
### Description
- set tsconfig "noUnusedParameters" to `true` and fix a few bugs
discovered by typescript.
   how unused parameter is fixed:
- for most code (webgl), add underscore as prefix, which is the standard
ignore pattern for typescript check.
- remove unused parameter from function and modify corresponding
function calls (jsep)
- fix a bug in ArgMinMax: this 2 operators do not have more than one
input(s) so the `createArgMinMaxAttributesFromInputs()` is removed.
- add proxy main.ts into typescript check and fix a bug in parameter
passing
   - fixed `run()` function call and add typecheck fix (hack)
2023-11-15 09:16:29 -08:00
Xu Xing
949ac4b7ce
[js/webgpu] Support uniforms for gather (#18312) 2023-11-13 11:24:34 -08:00
Wanming Lin
73ed34ac4b
[WebNN EP] Support numThreads option for WebNN CPU device (#18054) 2023-11-12 16:45:10 -08:00
Xu Xing
0c8c0014f6
[js/webgpu] Use builtin num_workgroups to fix shader key conflict (#18387)
This fixes conformance failure of tinyyolov2-8 and potential shader key
conflict issues.
2023-11-10 17:37:45 -08:00
Yulong Wang
6b0c97b43f
[js/web] fix typescript type check (#18343)
### Description

This PR fixes the TypeScript type check.

Previously, when I use esbuild to replace webpack (#17745), typescript
typecheck was disabled. This causes a few TypeScript type error checked
in into the code base. This PR fixes the followings:

- Use "Node16" as default "module" value in tsconfig.json, because in
TypeScript v5, `(module == "ES2015" && moduleResolution == "Node16")` is
an invalid combination.
- Set `noUnusedParameters` to true as default. in web override it to
false because multiple code need to be updated ( a following-up PR will
do this )
- set correct project file for 'web/lib/**/*.ts' for ESLint (otherwise
WebGPU types are not populated correctly)
- fix type error in file js/web/lib/wasm/jsep/webgpu/program-manager.ts
- upgrade "@webgpu/types" to latest to fix type error in file
js/web/lib/wasm/jsep/backend-webgpu.ts
- add package script "prebuild" for web to run tsc type check
- add type check in CI yml file
2023-11-10 16:03:38 -08:00
Xu Xing
8dba6efd61
[js/webgpu] Add uniforms support to concat op (#18238) 2023-11-10 13:46:03 -08:00
Jiajia Qin
28c23aed04
[js/webgpu] Fix conv2d with activation (#18388)
### Description
Fix #18297

With PR #17766, conv2d activation in mobilenetv2-12 will not be empty.
However, activation is not supported yet in
[biasActivationSnippet](https://github.com/microsoft/onnxruntime/blob/main/js/web/lib/wasm/jsep/webgpu/ops/3rd-party/activation_util.ts#L48C14-L48C36).
This PR makes all places unify to use
[getActivationSnippet](https://github.com/microsoft/onnxruntime/blob/main/js/web/lib/wasm/jsep/webgpu/ops/fuse-utils.ts#L13)
to fix this issue.
2023-11-10 12:54:35 -08:00
Xu Xing
dd1bb760eb
[js/webgpu] Fix scalar uniform (#18318) 2023-11-10 10:12:22 -08:00
Xu Xing
829d802337
[js/webgpu] Support uniform for softmax (#18345) 2023-11-09 11:19:23 -08:00
Guenther Schmuelling
25fbc2b0ab
fix fused relu activation (#18303) 2023-11-09 08:18:21 -08:00
Jiajia Qin
606356d0b1
[js/webgpu] Simplify the Resize shader when noScale is true (#18321)
### Description
For Resize, when `noScale` is true, the shader can become very simple,
which is not related with `attributes.mode` anymore. So we should remove
those parts of shader code for simplification.

This PR can also fix #18311 since the `noScale` are all true in that
model.

However, #18311 also exposes that the Resize implementation for `linear`
mode has bug. It seems that the currently implementation always treat
the input as either 2d or 4d tensor, however, the actual input is 3d
tensor, that's why the shader compilation is failed. We may need to fix
it in a separate PR.
2023-11-07 12:54:20 -08:00
satyajandhyala
a16d528399
[JS/Web] Added Uniforms support to binary ops. (#18260)
### Description
Added Uniform support to binary ops



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
To improve performance
2023-11-07 08:41:52 -08:00
satyajandhyala
e207060ac9
[JS/Web] Added Unifroms support to unary ops. (#18223)
### Description
Added uniforms support to unary ops.


### Motivation and Context
Improve performance
2023-11-03 09:30:54 -07:00
xhcao
8d48d3e9cc
[js/web] optimize reduce related operators (#17957)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-11-02 12:51:48 -07:00
Caroline Zhu
e3b043ba17
[js/web/training] runTrainStep implementation (#18006)
### Description
* based on design document & following InferenceSession's run
implementation, implemented TrainingSession.runTrainStep

### Motivation and Context
* Adding web bindings for training

#### Related work
* #16521 allowed for training artifacts to be built
* #17333 added interfaces for training
* #17474 allowed for training package to be built + added training
backend to web package
* #17891 implementation for createTrainingSession on the TypeScript side
**[SHOULD BE MERGED IN BEFORE THIS PR]**

---------

Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
Co-authored-by: Ashwini Khade <askhade@microsoft.com>
2023-11-02 08:32:50 -07:00
satyajandhyala
a2e9ba72d5
[JS/Web]Added FusedConv. (#17766)
### Description
Added FusedConv and FusedConvTranspose



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve performance
2023-11-01 15:34:51 -07:00
Jiajia Qin
785e2b1eae
[js/webgpu] Optimize softmax by vector (#18153)
### Description
This PR enables `softmax` outputs max supported components instead of
scalar for each thread.

Softmax with input[0]: [12,4096,4096] becomes 47.86 ms from 55.11 ms
2023-10-30 16:05:35 -07:00
Yulong Wang
9bba990871
[js/web] fix a few package consuming problems (#18109)
### Description
This PR tries to fix a part of the NPM package consuming problems for
onnxruntime-web (ES module) as described in #10913:

- reduce the package size to fit the 150MB restriction in jsdelivr, by
removing dev build targets for uncommon exports
- add default export to support `import ort from 'onnxruntime-web';`
(currently only support `import * as ort from 'onnxruntime-web';`
2023-10-30 08:11:43 -07:00
Yang Gu
52f4968359
[js/webgpu] Change timestamp-query-in-passes to timestamp-query (#18108)
Timestamp-query has a broader support than timestamp-query-in-passes on
all the platforms, including macOS.
Note that to enable timestamp-query, you still need to add switch
"--enable-dawn-features=allow_unsafe_apis" to Chrome. By default, the
lowest 16 bits are masked with 0 (at a granularity about 0.1ms) for
privacy. To get the highest precision, you need to add another switch
"--enable-webgpu-developer-features".
2023-10-26 16:33:03 -07:00
Caroline Zhu
64de71c5e2
[js/web/training] Add CreateTrainingSession (#17891)
### Description
* Adds TrainingSession.create() functionality following the web bindings
for training design doc
* Added 2 new training APIs to wasm/api.h:
   * OrtTrainingGetInputOutputName
   * OrtTrainingGetInputOutputCount
* Moved isOrtEnvInitialized boolean to the wasm-core-impl and added a
method that references it

### Motivation and Context
* Adding web bindings for training

#### Related work
* #16521 allowed for training artifacts to be built
* #17333 added interfaces for training
* #17474 allows for training package to be built + adds training backend
to web package **[MUST BE MERGED IN BEFORE THIS ONE]**

---------

Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
Co-authored-by: Ashwini Khade <askhade@microsoft.com>
2023-10-26 09:22:10 -07:00
satyajandhyala
f3cfe08c42
[JS/Web] Enabled 1d spacial input to GlobalAveragePool (#17973)
### Description
Enable one-dim special  input to GlobalAveragePoll input



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Currently only 2D input is supported.
2023-10-23 16:02:50 -07:00
Jiajia Qin
8a12b2cea6
[js/webgpu] Fix the transpose error when dims > 4D (#18027)
### Description
<!-- Describe your changes. -->
Currently, the uniform support has bugs when dims rank is larger than 4.
See https://github.com/microsoft/onnxruntime/issues/17860 item 1.
So this PR only enables shapes uniforms when shape rank is <= 4 for
transpose. Otherwise, below compilation errors are thrown:
```
1 error(s) generated while compiling the shader:
:3:50 error: uniform storage requires that array elements are aligned to 16 bytes, but array element of type 'u32' has a stride of 4 bytes. Consider using a vector or struct as the element type instead.
      struct Uniforms { output_size:u32, a_shape:array<u32, 5>, a_strides:array<u32, 5>, output_shape:array<u32, 5>, output_strides:array<u32, 5> };
                                                 ^^^^^^^^^^^^^

:3:7 note: see layout of struct:
/*            align(4) size(84) */ struct Uniforms {
/* offset( 0) align(4) size( 4) */   output_size : u32;
/* offset( 4) align(4) size(20) */   a_shape : array<u32, 5>;
/* offset(24) align(4) size(20) */   a_strides : array<u32, 5>;
/* offset(44) align(4) size(20) */   output_shape : array<u32, 5>;
/* offset(64) align(4) size(20) */   output_strides : array<u32, 5>;
/*                              */ };
      struct Uniforms { output_size:u32, a_shape:array<u32, 5>, a_strides:array<u32, 5>, output_shape:array<u32, 5>, output_strides:array<u32, 5> };
      ^^^^^^

:4:42 note: 'Uniforms' used in address space 'uniform' here
      @group(0) @binding(2) var<uniform> uniforms: Uniforms;
                                         ^^^^^^^^
```
2023-10-23 11:02:19 -07:00
Arthur Islamov
22947109f2
[js/web] FP16 LayerNorm, InstanceNorm, SkipLayerNorm (#17630)
### Description
This PR includes fixes for Norm operations to support FP16 and also some
optimizations to use vec2/vec4 if possible
2023-10-18 10:47:41 -07:00
Caroline Zhu
c373a808a2
Add "glue" between training WASM artifacts and training web (#17474)
### Description
* follows the packaging approach according to the design document
    * adds `ENABLE_TRAINING` boolean flag to `BUILD_DEFS`
    * modifies `package.json` to include training submodule
* modifies build script to handle, validate, and minimize training WASM
artifacts
* adds the binding for the new backend with training enabled & the new
training artifacts
    * adds training backend
    * edits `index.ts` to use training backend depending on `BUILD_DEFS`
    * edits `wasm-factory.ts` to use the training artifacts if necessary

### Motivation and Context
* we are in the process of adding web bindings to enable training. 
* Adding the "glue" to allow onnxruntime-web to use the training WASM
artifacts is required for this work.
* Since BUILD_DEFS is defined and used at build time, I thought that it
made sense to bundle the changes to building in the same PR.
#### Related work
* #16521 allowed for training artifacts to be built
* #17333 must be merged in before this one

---------

Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
2023-10-12 11:16:56 -07:00
Yulong Wang
d532645bed
[js/webgpu] revise uniform support (#17871)
### Description
<!-- Describe your changes. -->

work for items (2) and (3) in #17860
2023-10-11 16:41:46 -07:00
Yulong Wang
5228332c9f
[js] upgrade JS shared dev dependencies (#17831)
### Description
upgrade JS shared dev dependencies.

- webpack: removed
- eslint: upgrade to latest.
   - eslint config upgraded to compatible with latest version
- typescript upgrade to v5
   - update module "CommonJS" to "Node16" in tsconfig
- update deprecated config "importsNotUsedAsValues" to
"verbatimModuleSyntax"
- remove webpack bundles in onnxruntime-common
2023-10-10 17:44:39 -07:00
Yulong Wang
d9b9c5a537
[js/webgpu] support using uniform buffer (#17803)
### Description
support using uniform buffer.

This PR allows to use uniform buffer in shader program, so that some
runtime information (eg. input/output shape) is no longer need to be
hardcoded into shader code.

There are 2 commits in this PR:
-
[667f31c](667f31c83d):
framework changes to support uniform buffer, as well as updates in
program manager, gpu data manager and indices helper.
-
[09e1d2a](09e1d2ad1d):
an example change for operator `Transpose` to use input's rank-only
instead of dims as shader key. With this change, model mobilenetv2-12
shader compile times dropped from 71 to 52.
2023-10-10 00:31:12 -07:00
Yulong Wang
6ea493571e
[js/web] use esbuild to accelerate bundle build (#17745)
### Description

Use esbuild to accelerate bundle build.

This change uses esbuild to replace webpack for onnxruntime-web. Bundle
build time reduced from ~20sec to ~0.6sec on my windows dev box.

A few changes applied:
- import nodejs modules using "node:" prefix
- remove enum declaration inside namespace (EncoderUsage)
- use "fs/promise" to replace the old promisify from "util"
- separate ort-web and test-runner. Previously they are bundled
together, now they are built into 2 files.
- optimize karma runner launch time
- remove unnecessary sourcemap preprocessor. sourcemaps are handled
inside esbuild
- remove unnecessary proxies (because ort-web and test-runner are
separated now, the path are correctly inferred)
    - remove file watcher for test data
- optimize special handling as esbuild plugins:
- polyfill dummy imports for node.js modules when targetting browser.
    - load as content string for ort-wasm-*.worker.js
    - load as content string for ./proxy-worker/main.ts
- a source patch to ort-wasm*-threaded*.js (see details in comments in
code)
- updated debug configurations for sourcemap mapping to ensure
out-of-box good dev experience
2023-10-06 13:37:37 -07:00
Jiajia Qin
db3901ab97
[js/webgpu] Enable the NCHW ConvMatMul path (#17717)
1) Enable pointwise NCHW conv2d by MatMul.
2) Enable non-pointwise NCHW conv2d by convMatMul.
3) Fix bug when `sameSize` is true

---------

Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
2023-10-05 00:26:01 -07:00
Xu Xing
992f3e4609
[js/webgpu] Support where (#17544)
Supported type: float. int32_t, uint32_t, bool.
Case where_broadcast.jsonc is not enabled due to
https://github.com/microsoft/onnxruntime/issues/17405.

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
2023-10-03 14:28:21 -07:00
Guenther Schmuelling
f8a8452a6b
[js/webgpu] fix pad operator (#17775)
fix pad operator
2023-10-03 13:39:50 -07:00
Arthur Islamov
d0519a7603
[js/web] BiasSplitGelu and BiasAdd kernels (#17161)
### Description
Two contrib kernels that supposed to speed-up StableDiffusion according
to this doc
https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/models/stable_diffusion/README.md

However, there is no noticable effect in speed or memory consumption. So
i guess the only way to make it faster is to implement
MultiHeadAttention but i'm not capable of doing that right now. So i'll
focus on existing PRs and finding the JSEP kernel that produces
incorrect results. It should be one of the old ones (i suspect Conv or
ConvTranspose), as SD was not generating images correctly on webgpu
since i started working on it. I hoped someone else would fix that by
the time i finish with kernels/optimizations 😅

---------

Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
2023-10-03 12:20:20 -07:00
Yulong Wang
451c02543a
[js/webgpu] allow specify preferredLayout (#17756)
### Description
Allow WebGPU backend to specify `preferredLayout`. Default is NHWC.

```js
const options = {executionProviders: [{name:'webgpu', preferredLayout: 'NCHW'}]};
sess1 = await ort.InferenceSession.create('./mobilenetv2-12.onnx', options);
```

### Motivation and Context
- implement @qjia7's requirement for an easier way to do performance
comparison between NCHW vs NHWC.
- It's possible that NCHW does better on some models and NHWC on others.
So offer user the capability to switch.
2023-10-02 21:25:12 -07:00
xhcao
0d60604638
[JS/WebGPU] support Range operator (#17233)
The patch also introduces the method which copies
data from GPU to CPU synchronously.

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-09-30 02:05:32 -07:00
Arthur Islamov
a941dd583e
[js/web] FP16 Conv, ConvTranspose and MatMul (#17514)
### Description
Another three ops for fp16

---------

Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
2023-09-30 00:00:23 -07:00
Caroline Zhu
6a5f469d44
Add training interfaces to js/common (#17333)
### Description
Following the design document:
* Added CreateTrainingSessionHandler to the Backend interface
* All existing Backend implementations throw an error for the new method
createTrainingSessionHandler
* Created TrainingSession namespace, interface, and
TrainingSessionFactory interface
* Created TrainingSessionImpl class implementation 

As methods are implemented, the TrainingSession interface will be added
to or modified.

### Motivation and Context
Adding the public-facing interfaces to the onnxruntime-common package is
one of the first steps to support ORT training for web bindings.

---------

Co-authored-by: Caroline Zhu <carolinezhu@microsoft.com>
2023-09-29 19:05:10 -07:00
Yulong Wang
561aca97cf
[js/webgpu] support IO binding (#17480)
<del>
**This PR is based on a few prerequisites PRs. They are listed as
below:**
- #17465
- #17469
- #17470
- #17472
- #17473
- #17484

Please review the current change by only looking at commit
e2e6623e673ec6de55a5c1f8edcbd3a46b535a89 and later.


</del>

### Description

This PR introduces WebGPU IO binding. This new feature allows
onnxruntime-web users to use tensors created from GPU as model
input/output so that a model inferencing can be done without unnecessary
data copy between CPU and GPU for model input/output.

### Examples

An E2E demo/example is being worked on.

Following is some simple demo with code snippet.

Let's first check today how we do:
```js
// STEP.1 - create an inference session:
const mySession = await ort.InferenceSession.create('./my_model.onnx', { executionProviders: ['webgpu'] });

// STEP.2 - create model input: (supposing myImageCpuData is a Float32Array)
const feeds = {
  'input_image:0': new ort.Tensor('float32', myImageCpuData, [1, 224, 224, 3])
};

// STEP.3 - run model
const myResults = await mySession.run(feeds);

// STEP.4 - get output data
const myData = myResults['output_image:0'].data; // Float32Array

```

#### for inputs (GPU tensor):

Now, with IO binding, you can create a tensor from a GPU buffer, and
feed it to the model:
```js
// new STEP.2.A - create model input from a GPU buffer: (supposing myInputGpuBuffer is a `GPUBuffer` object with input data)
const feeds = {
  'input_image:0': ort.Tensor.fromGpuBuffer(myInputGpuBuffer, { dataType: 'float32', dims: [1, 224, 224, 3] })
};
```

### for outputs (pre-allocated GPU tensor)

you can also do that for output, **if you know the output shape**:
```js
// new STEP.2.B - create model output from a GPU buffer: (supposing myOutputGpuBuffer is a pre-allocated `GPUBuffer` object)
const fetches = {
  'output_image:0': ort.Tensor.fromGpuBuffer(myOutputGpuBuffer, { dataType: 'float32', dims: [1, 512, 512, 3] })
};

// new STEP.3 - run model with pre-allocated output (fetches)
const myResults = await mySession.run(feeds, fetches);
```

### for outputs (specify location)

if you do not know the output shape, you can specify the output location
when creating the session:

```js
// new STEP.1 - create an inference session with an option "preferredOutputLocation":
const mySession = await ort.InferenceSession.create('./my_model.onnx', {
    executionProviders: ['webgpu'],
    preferredOutputLocation: "gpu-buffer"
});
```

if the model has multiple outputs, you can specify them seperately:
```js
// new STEP.1 - create an inference session with an option "preferredOutputLocation":
const mySession = await ort.InferenceSession.create('./my_model.onnx', {
    executionProviders: ['webgpu'],
    preferredOutputLocation: {
         "output_image:0": "gpu-buffer"
    }
});
```

now you don't need to prepare the `fetches` object and onnxruntime-web
will prepare output data on the location that specified.

#### read data

when you get the output tensor, you can:
```js
// get the gpu buffer object:
const gpuBuffer = myOutputTensor.gpuBuffer; // GPUBuffer

// get the CPU data asynchronizely
const cpuData = await myOutputTensor.getData();

// get the CPU data asynchronizely and release the underlying GPU resources
const cpuData = await myOutputTensor.getData(true);

// dispose the tensor (release the underlying GPU resources). This tensor object will be invalid after dispose() is called.
myOutputTensor.dispose();
```

#### resource management

JavaScript has GC so you don't need to worry about managing JavaScript
objects. But there are 2 types of resources that are not managed by GC:
- GPU buffer that used in tensors
- Underlying ORT native resources

To simplify, most of the unmanaged resources and handled inside ORT web.
But there are a few resources that need users to manage:
- All external GPU resources, including GPU buffers inside all tensors
created by `Tensor.fromGpuBuffer()`, will not be managed by ORT. User
should manage those GPU buffers themselves.
- When a session is created with `preferredOutputLocation` ==
"gpu-buffer" specified in session options, and the corresponding output
is not pre-allocated, user need to call the output tensor's `dispose()`
or `getData(true)` to manually release the underlying GPU buffers.
- ORT internal errors (including providing a pre-allocated output tensor
with wrong type/dims) will invalidate the whole wasm memory and is not
recoverable. An exception is thrown in this situation.
2023-09-29 11:24:42 -07:00
satyajandhyala
b4fbc25b1f
[JS/Web] Add ConvTranspose implementation using MatMul (#17573)
### Description
Add ConvTranspose implementation using MatMul to increase perf.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-09-29 11:00:44 -07:00
Jiajia Qin
891fba3b9c
[js/webgpu] Optimize Gather op (#17625)
### Description
This PR optimizes the gather op, which is improved ~6ms in segment
anything model in ADL.
The problem in original algorithm is that it includes a for loop to
calculate a block size of data. However, the block size may be very
large, like `65536`. In GPU shader, we should try to avoid large loop in
shader and try to use more threads to do it parallelly.

Before:
```
[profiling] kernel "41771992|[Gather] 41771992" input[0]: [4,65536] | float32, input[1]: [1] | int64, output[0]: [1,65536] | float32, execution time: 6886207 ns
```
After:
```
[profiling] kernel "41771992|[Gather] 41771992" input[0]: [4,65536] | float32, input[1]: [1] | int64, output[0]: [1,65536] | float32, execution time: 11719 ns
2023-09-21 21:00:36 -07:00
Jiajia Qin
cd3fb377ea
[js/webgpu] Allow binary ops with scalar to use the vectorize path (#17589)
### Description
1. For binary ops, the components is always 4. So the dispatchGroup
should be : `{x: Math.ceil(outputSize / 64 /* workgroup size */ / 4 /*
component size */)}` instead of `{x: Math.ceil(outputSize / 64 /*
workgroup size */ / (vectorize ? 4 : 1) /* vec size */)}`.

2. If any of a or b only has one element, we still can use the vectorize
path since the same value will be broadcasted.
2023-09-21 20:55:08 -07:00
Arthur Islamov
498b60d8a4
[js/web] fp16 Pool & Reduce (#17512)
### Description
Two more ops to support fp16
2023-09-21 14:52:13 -07:00
Vincent Wang
e6301eee6a
Bump Up Version to 1.17.0 (#17587)
Bump up version to 1.17.0 as the 1.16.0 release branch had been branched
out.
2023-09-20 11:02:58 +08:00
Arthur Islamov
0f406ca1d3
[js/web] FP16 binary and unary ops (#17515)
### Description
Binary and unary ops with fp16 support
2023-09-18 15:43:32 -07:00
Yulong Wang
9aafbe3feb
[js/web] revise TensorView (#17473)
### Description

This change:
- removes the unused `Tensor` types declared in
/js/web/lib/wasm/jsep/tensor.ts
- removes duplicated util functions in  /js/web/lib/wasm/jsep/tensor.ts
- renames /js/web/lib/wasm/jsep/**tensor.ts** to
/js/web/lib/wasm/jsep/**tensor-view.ts** and update corresponding
references. It was kind of confusing that we have multiple `Tensor`
types defined in different places also we have multiple `tensor.ts`
source files.

This is one of the prerequisites for supporting IO binding for WebGPU
buffer in onnxruntime-web.

list of prerequisites PRs:
https://github.com/microsoft/onnxruntime/pull/17465
https://github.com/microsoft/onnxruntime/pull/17469
https://github.com/microsoft/onnxruntime/pull/17470
https://github.com/microsoft/onnxruntime/pull/17472
https://github.com/microsoft/onnxruntime/pull/17473 (this one)
2023-09-14 21:14:44 -07:00
Jiajia Qin
41d2ff622c
[js/webgpu] Optimize InstanceNormalization (#17491)
### Description
<!-- Describe your changes. -->
In previous implementation, there are two loops to iterate H * W
elements to calculate the `mean` and `squaredNorm` value in one thread,
meanwhile it outputs H * W elements in one thread. That results it's
very very slow when H * W is a large value. And usually, H * W does be a
large value in a model. For example, in the `candy-8` model, the shapes
of [H, W] are [224,224], [112,112], [56,56] for `InstanceNormalization`
op. And in my ADL, `[1,224,224,32]` consumes 17 ms. See below:
```
[profiling] kernel "23848328|[InstanceNormalization] 23848328" input[0]: [1,224,224,32] | float32, input[1]: [32] | float32, input[2]: [32] | float32, output[0]: [1,224,224,32] | float32, execution time: 17007914 ns
```

In this PR, it uses workgroup memory to optimize the original algorithm.
The advantage is that it can parallelly utilize the 64 (workgroupSize)
threads in one workgroup to calculate `mean` and `squaredNorm` value.
Meanwhile, it only outputs `H * W / workgroupSize` outputs for one
thread, which greatly reduces the overhead for one thread. With this
optimization, `[1,224,224,32]` becomes 3 ms and the main overhead is the
extra two `transpose`. The `createInstanceNormProgramInfo` only needs
`0.64` ms. See below:
```
[profiling] kernel "23003600|[InstanceNormalization] 23003600" input[0]: [1,224,224,32] | float32, output[0]: [1,32,224,224] | float32, execution time: 1543792 ns
program-manager.ts:115 
[profiling] kernel "23003600|[InstanceNormalization] 23003600" input[0]: [1,32,224,224] | float32, input[1]: [32] | float32, input[2]: [32] | float32, output[0]: [1,32,224,224] | float32, execution time: 642652 ns
program-manager.ts:115 
[profiling] kernel "23003600|[InstanceNormalization] 23003600" input[0]: [1,32,224,224] | float32, output[0]: [1,224,224,32] | float32, execution time: 991608 ns
```
This PR currently only applies the new algorithm to NCHW format. For
NHWC format, one way is to transpose the input so that it can use the
new algorithm. But the disadvantage is that 2 extra transpose are added.
@dakenf also gives another way to optimize NHWC. Details see
[here](d45a96616d/js/web/lib/wasm/jsep/webgpu/ops/instance-norm.ts).
I checked @dakenf's method. The perf is similar with transpose +
optimized NCHW. But on different GPUs, one is a little better than
another or vice versa. So I prefer this PR only does the NCHW part.
@dakenf can submit his optimization on NHWC.
2023-09-14 17:03:18 -07:00
xhcao
198d468849
[WebGPU/JS] Added Pad operator support (#16928)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-09-14 13:14:11 -07:00
Arthur Islamov
03b56f7a73
[js/webgpu] FP16 extension registration (#17493)
### Description
First small change to support FP16

---------

Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
2023-09-13 13:11:17 -07:00
Yulong Wang
a2e75114cc
[js/web] add sessionOptions.freeDimensionOverrides (#17488)
### Description
Allows to specify fixed size for dynamic input of a model. resolves
#16707

Pending test
2023-09-13 09:17:34 -07:00
Yulong Wang
41584b2827
[js/web] ensure ORT initialization to run only once (#17529)
### Description
ensure ORT initialization to run only once
2023-09-12 23:52:08 -07:00
Arthur Islamov
65249f42e4
[js/web] FP16 Gemm, Softmax & Transpose (#17494)
### Description
First three OPs to support fp16. Will add more once this gets merged
since others depend on changes in js_data_types
2023-09-11 21:09:37 -07:00
satyajandhyala
bf6d6961cc
[JS/Web] Added Einsum operator support. (#17401)
### Description
Added Einsum operator support to JSEP.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-09-11 15:57:15 -07:00
xhcao
9017ea131b
[js/webgpu] support GreaterOrEqual and LessOrEqual operators (#17310)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-09-07 17:41:16 -07:00