Commit graph

211 commits

Author SHA1 Message Date
Yulong Wang
efbef5f611
[js/webgpu] allow to specify callback for profiling data (#18732)
### Description

**This PR is a replacement of #17820.**

allow to specify callback for profiling data

*Previous*:
```js
ort.env.webgpu.profilingMode = 'default';  // enable profiling

// profiling data will output to console.
```

*Now*:
```js
ort.env.webgpu.profiling = {
  mode: 'default';  // enable profiling
  ondata: (data) => {
    // .. process the profiling data
  }
};

//for each kernel, "ondata" will be called once. only output to console if ondata is not specified.
```
2023-12-07 14:10:28 -08:00
Guenther Schmuelling
9aa7284351
fix lint error (#18708) 2023-12-05 10:37:03 -08:00
satyajandhyala
70816001cc
[JS/Web] AddedUniforms in GatherElements. (#18670)
### Description
Use Uniforms in GatherElements and clean-up



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve performance
2023-12-05 09:19:53 -08:00
Xu Xing
f949e0580b
[js/webgpu] Support uniforms for pool (#18656) 2023-12-05 07:54:30 -08:00
satyajandhyala
10c547516d
[JS/Web] Added CumSum operator to JSEP (#18637)
### Description
Added CumSum operator



### Motivation and Context
Reduce CPU <->GPU data movement.
2023-12-05 07:51:53 -08:00
Caroline Zhu
c02a386145
[js/web/training] Implemented runEvalStep & runOptimizerStep (#18259)
### Description
* implemented runEvalStep and runOptimizerStep
* added hasEvalModel and hasOptimizerModel boolean fields in
TrainingSession representation
* added evalInputNames and evalOutputNames fields to
TrainingSessionHandler & TrainingSession
* removed the inputNamesEncoded and outputNamesEncoded fields from
TrainingSessionHandler -- since none of the training methods require the
input names and output names as parameters, there's no need to store
them.

### Motivation and Context
* part of the work for implementing web bindings for training
* previous PR: #18250

---------

Co-authored-by: Ashwini Khade <askhade@microsoft.com>
2023-12-04 13:37:14 -08:00
Jiajia Qin
5353adcde3
[js/webgpu] Use the naive convTranspose when in/out channels are both 1 (#18658)
### Description
With this change, convTranspose with input0 [1, 18, 32, 1], input1 [1,
1, 16, 16] becomes 0.59ms from 6.64ms.
2023-12-04 13:18:37 -08:00
Jiajia Qin
92ee664f64
[js/webgpu] Fix shader errors in indicesGet/Set when rank > 4 (#18661)
### Description
Currently, for non-uniform variables, we still use `array<u32, N>` type
instead of array<vec4<u32>, N1>`. So we can't always treat all variables
with rank > 4 as uniforms to index.

This PR fixes below errors:
```
error(s) generated while compiling the shader:
:5:44 error: index 4 out of bounds [0..1]
             return uniforms.input_strides[4] * (outputIndices[4] % uniforms.input_shape[4])+uniforms.input_strides[3] * (outputIndices[3] % uniforms.input_shape[3])+uniforms.input_strides[2] * (outputIndices[2] % uniforms.input_shape[2])+uniforms.input_strides[1] * (outputIndices[1] % uniforms.input_shape[1])+uniforms.input_strides[0] * (outputIndices[0] % uniforms.input_shape[0]);
                                           ^
FAILED #OpTest# - expand.jsonc [webgpu]Expand - Expand 5D - float32 Expand 5 - float32
FAILED #OpTest# - expand.jsonc [webgpu]Expand - Expand 5D - float32 Expand 5 - shape < input.size()
2023-12-01 15:35:35 -08:00
Xu Xing
73d9b03509
[js/webgpu] Add multidimensional(>4) uniform support (#18546)
This change removes the check of enableShapesUniforms. When all uses of
this are removed, enableShapesUniforms can be removed too.
2023-11-30 17:10:33 -08:00
Jiajia Qin
6781b6cf3d
[js/webgpu] add bool type for Expand/Gather (#18615)
### Description
In [detr-resnet-50](https://huggingface.co/Xenova/detr-resnet-50) model,
it uses expand with bool type running on cpu ep.




| Kernel    | Shape | Provider |
| -------- | ------- | ------- |
| Expand | "input_type_shape" :
[{"bool":[1,1,1,625]},{"int64":[4]}],"activation_size" :
"657","output_type_shape" : [{"bool":[1,1,625,625]}] |
CPUExecutionProvider |

After this change, it will run on jsep.
| Kernel    | Shape | Provider |
| -------- | ------- | ------- |
| Expand | "input_type_shape" :
[{"bool":[1,1,1,625]},{"int64":[4]}],"activation_size" :
"657","output_type_shape" : [{"bool":[1,1,625,625]}] |
JsExecutionProvider |
2023-11-30 15:47:08 -08:00
Jiajia Qin
b1e749e3be
[js/webgpu] Add program name into webgpuProfiling info (#18640)
### Description
Currently, we only print the kernelName, which is hard to distinguish
which shader we actually used. For example, GroupedConv/Conv2DMatMul
both belong to Conv kernel. It's not intuitive for profiling.
2023-11-30 12:57:29 -08:00
Yang Gu
227dcb3a88
[js/webgpu] Log the key and program info for artifact (#18365)
With uniform support, ideally we may just keep one artifact for each
program to save the compilation time. This PR just logs the related
info, including key and program name, so that we may understand better
the situation.
2023-11-29 18:01:12 -08:00
satyajandhyala
7335760424
[JS/Web] Add uniforms to Einsum (#18531)
### Description
Add uinforms to Einsum



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve performance.
2023-11-29 15:30:33 -08:00
Yulong Wang
50e6235af1
[js/web] allow ShaderHelper to use internal (non-I/O) variables (#18525)
### Description
This PR includes a change that inspired from #18452 to resolve a
requirement: a shader may depend on an instance of `IndicesHelper` to
generate WGSL code snippet, but the IndicesHelper instance is not
necessarily an input/output of the program. So the existing
`declareVariables()` function does not work with this scenario.

In order to support this requirement, I added this "use" function to
`interface ShaderHelper`, which takes a helper-like object as parameter.
The hidden implementation `ShaderHelperImpl` class will iterate the
helpers and call `impl()` for each.

@axinging @qjia7
2023-11-28 15:15:59 -08:00
Jiajia Qin
fc8631e2f1
[js/web] Fix conv2dMatmul errors due to #18452 (#18562)
### Description
Currently, all conv2dMatmul with inChannels = 3 and outChannels % 4 = 0
will report compilation errors. Models, which include this kind of shape
will be impacted, like mobilenetv2-12, resnet50 .

The errors is introduced by #18452
https://github.com/microsoft/onnxruntime/pull/18452/files#diff-8b24ea43aa11b1346c0c9e327f9bce6b37a93bd8f2bf8a6392b2b263972b7ea2R200,
which accidentally pass `components` to `x`. But `x`'s components is
`innerElementSize` not `components `. And when `innerElementSize` is 3,
we should use `1` in current design.
2023-11-27 21:21:47 -08:00
Caroline Zhu
dd355e39a0
[js/web/training] Added parameters methods (#18250)
### Description
* Implemented: `getParametersSize`, `getContiguousParameters`
(equivalent to copyParametersToBuffer), and `loadParametersBuffer`
(equivalent to copyParametersFromBuffer)
* as part of these changes, getParametersSize was added to the
TrainingSession interface so that users know what size buffer to create
for loadParametersBuffer
* The parameters methods in the interface were modified to take in a
Float32Array instead


### Motivation and Context
* part of the work for implementing web bindings for training
* enables federated learning in the web
* previous  PR: #18006

---------

Co-authored-by: Ashwini Khade <askhade@microsoft.com>
2023-11-27 10:30:13 -08:00
Jiajia Qin
64dacc2892
[js/webgpu] Add BatchNormalization Op (#18468)
### Description
This PR adds `BatchNormalization` with `float` support.

Some Todos:
1. all inputs don't have same data type. For example, x/y is float16,
but bias/scale is float32 or double.
2. training mode support.

We see many models are using `BatchNormalization` ops. However, due to
the missing in jsep, all of them run on cpu, which result very poor
performance. With this PR's support, densenet-9 model becomes 20.29 ms
from 250.69 ms.
2023-11-22 15:58:06 -08:00
Xu Xing
fa106942a7
[js/webgpu] Refactor matmul conv to support uniforms for matmul (#18452)
This change refactored matmul/conv related programs to support shape
uniforms. Currently only matmul shape uniforms are fully enabled.
TODOs: add input dependencies for conv related programs, turn clipMax
and clipMin to uniforms.
2023-11-22 14:42:55 -08:00
satyajandhyala
841f7ed3e0
[[JS/Web]Added uniform to Expand op. (#18558)
### Description
<!-- Describe your changes. -->
Added Uniforms to Expand operator kernel


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve performance
2023-11-22 14:14:24 -08:00
Arthur Islamov
1c555c5fc1
[JS/Web] Resize & BiasSplitGelu fp16 support (#18536)
### Description
Resize and BiasSplitGelu fp16 support on WebGPU
2023-11-22 12:12:07 -08:00
Yulong Wang
c7fd930330
[js/web] unify resolve rules for "Clip" (#18527)
### Description
It was a mistake to use 2 different names for Clip operator in
op-resolve-rules.ts for different opset. An optimized implementation can
handle both cases (opset < 11 and opset >=11). Remove "ClipV10" as an
entry from the table.
2023-11-20 23:18:06 -08:00
Jiajia Qin
abdf8b7c3f
[js/webgpu] Optimize broadcast binary. (#18185)
### Description
Currently, the binary algorithms are divided into the vectorize one
(efficient) and non-vectorize one (less efficient). Below situations
will go to the vectorize one:
1) A or B's shape length is 1.
2) The shared dimensions length of A and B are divisible by 4.
3) A and B have same shape.

This PR adds another situation as below to go to the vectorize
algorithm.
4. A or B's last dimension is divisible by 4.

With this change, the aggerate time of Add in sam-b-encoder becomes
309.65 ms from 409.12 ms on Intel ADL.
2023-11-20 16:52:17 -08:00
Yulong Wang
247ce21859
[js] optimize eslint config (#18460)
### Description
optimize eslint config to:
- set parserOptions.project to `true` to allow @typescript-eslint/parser
to find the nearest tsconfig.json file to that source file. This helps
to avoid parsing extra files, may helps with:
- reduce the possibility of seeing OOM or stackoverflow with "npm run
lint"
   - faster processing
- enforce rule "no-underscore-dangle" with a list of exceptions.
2023-11-20 12:00:56 -08:00
Arthur Islamov
fac3e33da5
[js/web] JSEP Attention & MultiHeadAttention (#17742)
### Description
This is a narrow implementation of Attention/MultiHeadAttention as it
does not support:
a. inputs 5-7 for MHA
b. packed QKV/KV
c. past/present
d. attention mask

But it works well for StableDiffusion and can be extended later. It
reduces VRAM usage as it combines many ops into few
I've updated demo here https://islamov.ai/stable-diffusion-webgpu/ it
takes ~13sec for 1 image with 20 steps on RTX3090Ti and about 25s on M1
Pro
VRAM usage is about 8gb if you don't use img2img

Going to focus on SDXL now

---------

Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
2023-11-17 12:23:52 -08:00
satyajandhyala
b291b20fa0
[JS/Web]Added uniforms support to Slice op. (#18422)
### Description
Support uniforms in Slice op



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve ferformance
2023-11-16 09:44:13 -08:00
Yulong Wang
586f06f5a1
[js/web] set noUnusedParameters to true and fix a few bugs (#18404)
### Description
- set tsconfig "noUnusedParameters" to `true` and fix a few bugs
discovered by typescript.
   how unused parameter is fixed:
- for most code (webgl), add underscore as prefix, which is the standard
ignore pattern for typescript check.
- remove unused parameter from function and modify corresponding
function calls (jsep)
- fix a bug in ArgMinMax: this 2 operators do not have more than one
input(s) so the `createArgMinMaxAttributesFromInputs()` is removed.
- add proxy main.ts into typescript check and fix a bug in parameter
passing
   - fixed `run()` function call and add typecheck fix (hack)
2023-11-15 09:16:29 -08:00
Xu Xing
949ac4b7ce
[js/webgpu] Support uniforms for gather (#18312) 2023-11-13 11:24:34 -08:00
Wanming Lin
73ed34ac4b
[WebNN EP] Support numThreads option for WebNN CPU device (#18054) 2023-11-12 16:45:10 -08:00
Xu Xing
0c8c0014f6
[js/webgpu] Use builtin num_workgroups to fix shader key conflict (#18387)
This fixes conformance failure of tinyyolov2-8 and potential shader key
conflict issues.
2023-11-10 17:37:45 -08:00
Yulong Wang
6b0c97b43f
[js/web] fix typescript type check (#18343)
### Description

This PR fixes the TypeScript type check.

Previously, when I use esbuild to replace webpack (#17745), typescript
typecheck was disabled. This causes a few TypeScript type error checked
in into the code base. This PR fixes the followings:

- Use "Node16" as default "module" value in tsconfig.json, because in
TypeScript v5, `(module == "ES2015" && moduleResolution == "Node16")` is
an invalid combination.
- Set `noUnusedParameters` to true as default. in web override it to
false because multiple code need to be updated ( a following-up PR will
do this )
- set correct project file for 'web/lib/**/*.ts' for ESLint (otherwise
WebGPU types are not populated correctly)
- fix type error in file js/web/lib/wasm/jsep/webgpu/program-manager.ts
- upgrade "@webgpu/types" to latest to fix type error in file
js/web/lib/wasm/jsep/backend-webgpu.ts
- add package script "prebuild" for web to run tsc type check
- add type check in CI yml file
2023-11-10 16:03:38 -08:00
Xu Xing
8dba6efd61
[js/webgpu] Add uniforms support to concat op (#18238) 2023-11-10 13:46:03 -08:00
Jiajia Qin
28c23aed04
[js/webgpu] Fix conv2d with activation (#18388)
### Description
Fix #18297

With PR #17766, conv2d activation in mobilenetv2-12 will not be empty.
However, activation is not supported yet in
[biasActivationSnippet](https://github.com/microsoft/onnxruntime/blob/main/js/web/lib/wasm/jsep/webgpu/ops/3rd-party/activation_util.ts#L48C14-L48C36).
This PR makes all places unify to use
[getActivationSnippet](https://github.com/microsoft/onnxruntime/blob/main/js/web/lib/wasm/jsep/webgpu/ops/fuse-utils.ts#L13)
to fix this issue.
2023-11-10 12:54:35 -08:00
Xu Xing
dd1bb760eb
[js/webgpu] Fix scalar uniform (#18318) 2023-11-10 10:12:22 -08:00
Xu Xing
829d802337
[js/webgpu] Support uniform for softmax (#18345) 2023-11-09 11:19:23 -08:00
Guenther Schmuelling
25fbc2b0ab
fix fused relu activation (#18303) 2023-11-09 08:18:21 -08:00
Jiajia Qin
606356d0b1
[js/webgpu] Simplify the Resize shader when noScale is true (#18321)
### Description
For Resize, when `noScale` is true, the shader can become very simple,
which is not related with `attributes.mode` anymore. So we should remove
those parts of shader code for simplification.

This PR can also fix #18311 since the `noScale` are all true in that
model.

However, #18311 also exposes that the Resize implementation for `linear`
mode has bug. It seems that the currently implementation always treat
the input as either 2d or 4d tensor, however, the actual input is 3d
tensor, that's why the shader compilation is failed. We may need to fix
it in a separate PR.
2023-11-07 12:54:20 -08:00
satyajandhyala
a16d528399
[JS/Web] Added Uniforms support to binary ops. (#18260)
### Description
Added Uniform support to binary ops



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
To improve performance
2023-11-07 08:41:52 -08:00
satyajandhyala
e207060ac9
[JS/Web] Added Unifroms support to unary ops. (#18223)
### Description
Added uniforms support to unary ops.


### Motivation and Context
Improve performance
2023-11-03 09:30:54 -07:00
xhcao
8d48d3e9cc
[js/web] optimize reduce related operators (#17957)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-11-02 12:51:48 -07:00
Caroline Zhu
e3b043ba17
[js/web/training] runTrainStep implementation (#18006)
### Description
* based on design document & following InferenceSession's run
implementation, implemented TrainingSession.runTrainStep

### Motivation and Context
* Adding web bindings for training

#### Related work
* #16521 allowed for training artifacts to be built
* #17333 added interfaces for training
* #17474 allowed for training package to be built + added training
backend to web package
* #17891 implementation for createTrainingSession on the TypeScript side
**[SHOULD BE MERGED IN BEFORE THIS PR]**

---------

Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
Co-authored-by: Ashwini Khade <askhade@microsoft.com>
2023-11-02 08:32:50 -07:00
satyajandhyala
a2e9ba72d5
[JS/Web]Added FusedConv. (#17766)
### Description
Added FusedConv and FusedConvTranspose



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve performance
2023-11-01 15:34:51 -07:00
Jiajia Qin
785e2b1eae
[js/webgpu] Optimize softmax by vector (#18153)
### Description
This PR enables `softmax` outputs max supported components instead of
scalar for each thread.

Softmax with input[0]: [12,4096,4096] becomes 47.86 ms from 55.11 ms
2023-10-30 16:05:35 -07:00
Yulong Wang
9bba990871
[js/web] fix a few package consuming problems (#18109)
### Description
This PR tries to fix a part of the NPM package consuming problems for
onnxruntime-web (ES module) as described in #10913:

- reduce the package size to fit the 150MB restriction in jsdelivr, by
removing dev build targets for uncommon exports
- add default export to support `import ort from 'onnxruntime-web';`
(currently only support `import * as ort from 'onnxruntime-web';`
2023-10-30 08:11:43 -07:00
Yang Gu
52f4968359
[js/webgpu] Change timestamp-query-in-passes to timestamp-query (#18108)
Timestamp-query has a broader support than timestamp-query-in-passes on
all the platforms, including macOS.
Note that to enable timestamp-query, you still need to add switch
"--enable-dawn-features=allow_unsafe_apis" to Chrome. By default, the
lowest 16 bits are masked with 0 (at a granularity about 0.1ms) for
privacy. To get the highest precision, you need to add another switch
"--enable-webgpu-developer-features".
2023-10-26 16:33:03 -07:00
Caroline Zhu
64de71c5e2
[js/web/training] Add CreateTrainingSession (#17891)
### Description
* Adds TrainingSession.create() functionality following the web bindings
for training design doc
* Added 2 new training APIs to wasm/api.h:
   * OrtTrainingGetInputOutputName
   * OrtTrainingGetInputOutputCount
* Moved isOrtEnvInitialized boolean to the wasm-core-impl and added a
method that references it

### Motivation and Context
* Adding web bindings for training

#### Related work
* #16521 allowed for training artifacts to be built
* #17333 added interfaces for training
* #17474 allows for training package to be built + adds training backend
to web package **[MUST BE MERGED IN BEFORE THIS ONE]**

---------

Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
Co-authored-by: Ashwini Khade <askhade@microsoft.com>
2023-10-26 09:22:10 -07:00
satyajandhyala
f3cfe08c42
[JS/Web] Enabled 1d spacial input to GlobalAveragePool (#17973)
### Description
Enable one-dim special  input to GlobalAveragePoll input



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Currently only 2D input is supported.
2023-10-23 16:02:50 -07:00
Jiajia Qin
8a12b2cea6
[js/webgpu] Fix the transpose error when dims > 4D (#18027)
### Description
<!-- Describe your changes. -->
Currently, the uniform support has bugs when dims rank is larger than 4.
See https://github.com/microsoft/onnxruntime/issues/17860 item 1.
So this PR only enables shapes uniforms when shape rank is <= 4 for
transpose. Otherwise, below compilation errors are thrown:
```
1 error(s) generated while compiling the shader:
:3:50 error: uniform storage requires that array elements are aligned to 16 bytes, but array element of type 'u32' has a stride of 4 bytes. Consider using a vector or struct as the element type instead.
      struct Uniforms { output_size:u32, a_shape:array<u32, 5>, a_strides:array<u32, 5>, output_shape:array<u32, 5>, output_strides:array<u32, 5> };
                                                 ^^^^^^^^^^^^^

:3:7 note: see layout of struct:
/*            align(4) size(84) */ struct Uniforms {
/* offset( 0) align(4) size( 4) */   output_size : u32;
/* offset( 4) align(4) size(20) */   a_shape : array<u32, 5>;
/* offset(24) align(4) size(20) */   a_strides : array<u32, 5>;
/* offset(44) align(4) size(20) */   output_shape : array<u32, 5>;
/* offset(64) align(4) size(20) */   output_strides : array<u32, 5>;
/*                              */ };
      struct Uniforms { output_size:u32, a_shape:array<u32, 5>, a_strides:array<u32, 5>, output_shape:array<u32, 5>, output_strides:array<u32, 5> };
      ^^^^^^

:4:42 note: 'Uniforms' used in address space 'uniform' here
      @group(0) @binding(2) var<uniform> uniforms: Uniforms;
                                         ^^^^^^^^
```
2023-10-23 11:02:19 -07:00
Arthur Islamov
22947109f2
[js/web] FP16 LayerNorm, InstanceNorm, SkipLayerNorm (#17630)
### Description
This PR includes fixes for Norm operations to support FP16 and also some
optimizations to use vec2/vec4 if possible
2023-10-18 10:47:41 -07:00
Caroline Zhu
c373a808a2
Add "glue" between training WASM artifacts and training web (#17474)
### Description
* follows the packaging approach according to the design document
    * adds `ENABLE_TRAINING` boolean flag to `BUILD_DEFS`
    * modifies `package.json` to include training submodule
* modifies build script to handle, validate, and minimize training WASM
artifacts
* adds the binding for the new backend with training enabled & the new
training artifacts
    * adds training backend
    * edits `index.ts` to use training backend depending on `BUILD_DEFS`
    * edits `wasm-factory.ts` to use the training artifacts if necessary

### Motivation and Context
* we are in the process of adding web bindings to enable training. 
* Adding the "glue" to allow onnxruntime-web to use the training WASM
artifacts is required for this work.
* Since BUILD_DEFS is defined and used at build time, I thought that it
made sense to bundle the changes to building in the same PR.
#### Related work
* #16521 allowed for training artifacts to be built
* #17333 must be merged in before this one

---------

Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
2023-10-12 11:16:56 -07:00
Yulong Wang
d532645bed
[js/webgpu] revise uniform support (#17871)
### Description
<!-- Describe your changes. -->

work for items (2) and (3) in #17860
2023-10-11 16:41:46 -07:00