Commit graph

255 commits

Author SHA1 Message Date
Yulong Wang
45ff957973
1.17.3 cherry-picks for ORT Web changes (#19926)
### Description
This PR is a preview of cherry-picks for ort-web to `rel-1.17.3` based
on `rel-1.17.2`.

<details>

<summary>Changes of ort-web to cherry-pick</summary>

The following commits are from main branch.

`o` stands for pick, and `x` stands for skip.
```
o   2e0a388c36 [js/webgpu] Add HardSigmoid support (#19215)
o   d226e40856 [js/webgpu] set query type in onRunStart (#19202)
o   61610ff986 [js/webgpu] Add FusedConv clip test case (#18900)
o   a33b5bd1fa [JS/WebGPU] Added Uniforms to SkipLayerNorm. (#18788)
o   591f90c0b9 [js/webgpu] Fix issue of timestamp query (#19258)
o   7252c6e747 [WebNN EP] Support WebNN async API with Asyncify (#19145)
o   5b06505073 [js/webgpu] Fix Tanh explosion (#19201)
o   656ca66186 [js/webgpu] Support uniforms for conv, conv transpose, conv grouped (#18753)
o   a3f0e2422b [js/webgpu] Support f16 uniform (#19098)
o   9e69606360 fix f16 for attention, enable slice and flatten for more types (#19262)
o   624b4e2063 [js/webgpu] Remove enableShapesUniforms (#19279)
o   90883a366a [js/webgpu] Add hardSigmoid activation for fusedConv (#19233)
o   85cef0af8c [js/webgpu] Support capture and replay for jsep (#18989)
o   d73131cf0f [js/webgpu] Use DataType as uniform cpu type (#19281)
o   dd1f6ccc45 [js/webgpu] resolve codescan alert (#19343)
o   3a2ab1963a [js/webgpu] Refactor createTensorShapeVariables (#18883)
o   efc17e79de [js/webgpu] Fix the undefined push error (#19366)
 x  50806a7dd5 [js/web] support external data in npm test (#19377)
o   ccbe264a39 [js/webgpu] Add LeakyRelu activation for fusedConv (#19369)
o   5ff27ef02a [js/webgpu] support customop FastGelu (#19392)
 x  03be65e064 [js/web] fix types exports in package.json (#19458)
o   06269a3952 [js/webgpu] allow uint8 tensors for webgpu (#19545)
o   dfeda9019c [JS/WebGPU] Add MatMulNBits (#19446)
o   1b48054e1b [js/webgpu] Create Split indices helpers by rank, not by shape (#19554)
o   3fe2c137ee [js] small fix to workaround formatter (#19400)
 x  70567a4b3a [js/web] use ApiTensor insteadof onnxjs Tensor in TensorResultValidator (#19358)
o   6e04e36e3f [js/common] upgrade tsc in common from 4.9.5 to 5.2.2 (#19317)
o   58f4921686 [js] changes to allow Float16Array if any polyfill is available (#19305)
o   57d6819212 [js/web] Fix fused-conv is not included in npm test (#19581)
o   ebd220b073 Misspelling in README.md (#19433)
o   38c3432393 Bump ip from 1.1.8 to 1.1.9 in /js/react_native (#19582)
o   fe82fccf1a [js/webgpu] Fix Conv2DTransposeMatMul f16 compilation failure (#19596)
o   76a2a487a1 Bump ip from 1.1.8 to 1.1.9 in /js/react_native/e2e (#19583)
o   29b1106033 [node] Switch to setImmediate to avoid starving the Node.js event loop (#19610)
o   ae3d73c981 [JS/WebGPU] Fix Split and Where to handle corner cases. (#19613)
o   aec2389ad0 [js/webgpu] allows a ProgramInfo's RunData to use zero sized output (#19614)
o   bb43a0f133 [js/webgpu] minor fixes to make tinyllama work (#19564)
o   0edb035808 [js/web] fix suite test list for zero sized tensor (#19638)
o   3cb81cdde2 [js/common] move 'env.wasm.trace' to 'env.trace' (#19617)
o   e30618d055 [js/webgpu] use Headless for webgpu test by default (#19702)
o   f06164ef8b [js/web] transfer input buffer back to caller thread (#19677)
 x  a788514027 [js/web] dump debug logs for karma for diagnose purpose (#19785)
o   24b72d2613 [JS/WebGPU] Preserve zero size input tensor dims. (#19737)
o   4538d31a8b [js/webgpu] expose a few properties in WebGPU API (#19857)
o   53de2d8cb0 [js/webgpu] Enable GroupedConvVectorize path (#19791)
o   ed250b88c3 [JS/WebGPU] Optimize MatMulNBits (#19852)
 x  e771a763c3 [js/test] align web test runner flags with ort.env (#19790)
o   79e50aeef3 [js/web] rewrite backend resolve to allow multiple EPs (#19735)
o   acb0df2280 Fix #19931 broken Get Started link of "ONNX Runtime JavaScript API" page (#19932)
o   b29849a287 [js/common] fix typedoc warnings (#19933)
o   afdab62f53 Bump follow-redirects from 1.15.4 to 1.15.6 in /js/web (#19949)
o   28ad6c3955 Bump follow-redirects from 1.15.4 to 1.15.6 in /js/node (#19951)
o   7e0d424934 accumulate in fp32 for Reduce* (#19868)
o   4c6a6a37f7 [js/webgpu] Fix NAN caused by un-initialized buffer in instance-norm (#19387)
o   01c7aaf6aa [js/webgpu] allow setting env.webgpu.adapter (#19940)
o   c45cff60cf [js/webgpu] fix maxpool / fp16 (#19981)
```

</details>

<details>
<summary>Cherry-pick commandlines</summary>

```sh
git cherry-pick 2e0a388c36
git cherry-pick d226e40856
git cherry-pick 61610ff986
git cherry-pick a33b5bd1fa
git cherry-pick 591f90c0b9
git cherry-pick 7252c6e747
git cherry-pick 5b06505073
git cherry-pick 656ca66186
git cherry-pick a3f0e2422b
git cherry-pick 9e69606360
git cherry-pick 624b4e2063
git cherry-pick 90883a366a
git cherry-pick 85cef0af8c  #<<<<< Note: conflicts
git cherry-pick d73131cf0f
git cherry-pick dd1f6ccc45
git cherry-pick 3a2ab1963a
git cherry-pick efc17e79de
git cherry-pick ccbe264a39
git cherry-pick 5ff27ef02a
git cherry-pick 06269a3952
git cherry-pick dfeda9019c
git cherry-pick 1b48054e1b
git cherry-pick 3fe2c137ee
git cherry-pick 6e04e36e3f
git cherry-pick 58f4921686
git cherry-pick 57d6819212
git cherry-pick ebd220b073
git cherry-pick 38c3432393
git cherry-pick fe82fccf1a
git cherry-pick 76a2a487a1
git cherry-pick 29b1106033
git cherry-pick ae3d73c981
git cherry-pick aec2389ad0
git cherry-pick bb43a0f133
git cherry-pick 0edb035808
git cherry-pick 3cb81cdde2
git cherry-pick e30618d055
git cherry-pick f06164ef8b
git cherry-pick 24b72d2613
git cherry-pick 4538d31a8b
git cherry-pick 53de2d8cb0
git cherry-pick ed250b88c3
git cherry-pick 79e50aeef3
git cherry-pick acb0df2280
git cherry-pick b29849a287
git cherry-pick afdab62f53
git cherry-pick 28ad6c3955
git cherry-pick 7e0d424934
git cherry-pick 4c6a6a37f7
git cherry-pick 01c7aaf6aa
git cherry-pick c45cff60cf
```
</details>

<details>
<summary>Cherry-pick conflicts</summary>

- 85cef0af8c #18989
this change is for enabling graph capture feature for JSEP, and it is
done after ROCM EP enabled graph capture feature. However, the ROCM EP
graph capture feature is not cherry-picked in rel-1.17.2.
</details>

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Jiajia Qin <jiajia.qin@intel.com>
Co-authored-by: Xu Xing <xing.xu@intel.com>
Co-authored-by: satyajandhyala <satya.k.jandhyala@gmail.com>
Co-authored-by: Yang Gu <yang.gu@intel.com>
Co-authored-by: Wanming Lin <wanming.lin@intel.com>
Co-authored-by: Jiajie Hu <jiajie.hu@intel.com>
Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>
Co-authored-by: Matttttt <18152455+martholomew@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Segev Finer <segev208@gmail.com>
Co-authored-by: Belem Zhang <belem.zhang@intel.com>
2024-03-29 13:13:39 -07:00
Rachel Guo
046d06ff26
Cherry-pick for 1.17.3 (#20013)
### Description
<!-- Describe your changes. -->

Web prs are not included yet.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Yufeng Li <liyufeng1987@gmail.com>
Co-authored-by: Maximilian Müller <44298237+gedoensmax@users.noreply.github.com>
Co-authored-by: Yi Zhang <zhanyi@microsoft.com>
Co-authored-by: Your Name <your@email.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: enximi <70036307+enximi@users.noreply.github.com>
Co-authored-by: George Wu <jywu@microsoft.com>
Co-authored-by: Markus Tavenrath <mtavenrath@users.noreply.github.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: Adam Pocock <adam.pocock@oracle.com>
Co-authored-by: aciddelgado <139922440+aciddelgado@users.noreply.github.com>
Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
Co-authored-by: Changming Sun <chasun@microsoft.com>
Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com>
2024-03-29 13:10:13 -07:00
Rachel Guo
6bc6adc658
Update version number to 1.17.2 (#19701)
### Description
<!-- Describe your changes. -->

As title. Follow up pr for source code release 1.17.2


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: Changming Sun <chasun@microsoft.com>
2024-03-01 13:51:00 -08:00
Rachel Guo
8f5c79cb63
Update 1.17.1 patch release version (#19622)
### Description
<!-- Describe your changes. -->

Need to update patch release version.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
2024-02-23 16:10:36 -08:00
Rachel Guo
3fd94a8cc7
[ORT 1.17.0 Release] Cherry pick 1st round (#19243)
### Description
<!-- Describe your changes. -->

[ORT 1.17.0 Release] Cherry pick 1st round

PR authors please take a look, and let me know if there are any
questions about the changes or approve accordingly.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: wejoncy <wejoncy@163.com>
Co-authored-by: Xavier Dupré <xadupre@users.noreply.github.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
Co-authored-by: Hector Li <hecli@microsoft.com>
Co-authored-by: luoyu-intel <yu.luo@intel.com>
Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com>
Co-authored-by: Ye Wang <52801275+wangyems@users.noreply.github.com>
Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
Co-authored-by: snadampal <87143774+snadampal@users.noreply.github.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: Heflin Stephen Raj <heflinstephen03@gmail.com>
Co-authored-by: Yifan Li <109183385+yf711@users.noreply.github.com>
Co-authored-by: Yufeng Li <liyufeng1987@gmail.com>
Co-authored-by: Changming Sun <chasun@microsoft.com>
2024-01-26 20:11:48 -08:00
Guenther Schmuelling
9dee543bed
fix gemm beta for fp16 (#19153)
per onnx spec beta is always fp32 so we need to cast it
2024-01-15 18:40:38 -08:00
Yulong Wang
f917dde717
[web] remove xnnpack from web backends (#19116)
### Description
XNNPACK is already disabled in web assembly build. This change removes
the xnnpack backend registration in JS.
2024-01-13 23:04:02 -08:00
Yang Gu
e803f8eb0f
[js/webgpu] Refactor timestamp-query and introduce timestamp-query-inside-passes (#18894)
We submit kernels in a batch (a fixed number 16 is used except for the
last batch) for better performance. However, timestamp query support is
at pass level so we disable the batch execution in profiling mode in
previous implementation. Actually we can have multiple passes in a batch
so that we don't have to disable batch execution, which is the first
enhancement of this PR.
Furthermore, WebGPU has an extension to support timestamp query inside
passes, which isn't supported by all the platforms (e.g., Windows
supports it, while macOS doesn't). This is expected to have lower cost
compared with multiple passes solution. So this PR also introduce this
support when available.
This PR also refactors some implementation related to kernelInfo, and
try to unify the related kernel names.
2024-01-13 00:23:17 -08:00
Yulong Wang
07cfc56538
[js] enable external data loading for ort-web (#19087)
### Description
enable external data loading for ort-web.

### Why
The ORT external data design is highly depending on the file system,
especially synchronous file I/O APIs. Those are not available in web
platforms. We need to have extra code to make external data working on
web.

### How
Considering there is no file system in web, an implementation for web to
support external data is to use pre-loaded data. Assume model file
a.onnx includes initializers that linked to ./b.bin, we require users to
pass a full data file list when creating the session. The user code will
be look like:
```js
const mySess = await ort.InferenceSession.create('./path/model/a.onnx', {
  // session options
  externalData: [
    {
      // relative or absolute path/URL of the file,
      // or a pre-loaded Uint8Array containing the data of the external data file
      data: './path/data/b.bin', 

      // the relative path of the external data. Should match initializers' "location" value defined in the model file
      path: './b.bin'
    },
    // { } if multiple external data file
  ]
});
```

Currently, this feature only works with JSEP build enabled.
2024-01-12 19:24:24 -08:00
Guenther Schmuelling
a756017e9f
[js/webgpu] more fixes for access above 2GB (#19065)
when jsep calls javascript with an index to HEAP8 or HEAP32 the index is
negative when the heap is above 2GB, even if we pass it as uint32_t it
remains negative. So in javascript use >>> 0 to make it unsigned.
2024-01-12 17:47:37 -08:00
Guenther Schmuelling
4a5f13b681
fix resize for fp16 (#19110)
resize for fp16 has 2 issues: scales are always f32 and roi can be f32
or f16.
scales:
this is fixed.

roi
this is fixed for the case where roi is not passed as optional input
with f16. To fix this it requires a much larger change and I did not
want to risk this short before a release. For all practical purpose
passing roi as input with f16 should be rare and we can fix it in the
near future.
2024-01-12 13:44:28 -08:00
Jiajie Hu
acba63c36a
[js/webgpu] Change A/sqrt(B) to A*inverseSqrt(B) in normalization ops (#19101)
### Description
Change `A / sqrt(B)` to `A * inverseSqrt(B)` in BatchNormalization,
InstanceNormalization, LayerNormalization and SkipLayerNormalization.

### Motivation and Context
For the same reason as the existence of the `inverseSqrt` built-in in
WebGPU spec.
2024-01-12 00:08:16 -08:00
Guenther Schmuelling
d0bac8216d
[js/webgpu] fix bcast in where (#19009) 2024-01-11 12:13:24 -08:00
Jiajia Qin
a89db01fce
[js/webgpu] disable GroupedConvVectorize path (#19090)
Disable createGroupedConvVectorizeProgramInfo path due to bots failures
on below two cases:
[webgpu]Conv - conv - vectorize group - B
[webgpu]Conv - conv - vectorize group - D
2024-01-11 08:13:14 -08:00
Jiajia Qin
fd6bab4250
[js/webgpu] Provide a vectorized algorithm for GroupedConv (#18884)
### Description
This PR provides a vectorized algorithm for NHWC GroupedConv to improve
performance.

The aggregate time of GroupedConv in mobilenetv2-12 becomes ~1ms from
~4ms on Intel Alder Lake machine. About 20% improvement for the whole
model.
2024-01-10 16:12:43 -08:00
Xu Xing
ed0f26d3d4
[js/webgpu] Revert parse norm attributes (#19074)
This resolves the below build errors:
```
lib/wasm/jsep/webgpu/op-resolve-rules.ts:19:23 - error TS2724: '"./ops/instance-norm"' has no exported member named 'parseInstanceNormAttributes'. Did you mean 'InstanceNormAttributes'?

19 import {instanceNorm, parseInstanceNormAttributes} from './ops/instance-norm';
                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~

lib/wasm/jsep/webgpu/op-resolve-rules.ts:19:23 - error TS6133: 'parseInstanceNormAttributes' is declared but its value is never read.

19 import {instanceNorm, parseInstanceNormAttributes} from './ops/instance-norm';
                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~

lib/wasm/jsep/webgpu/op-resolve-rules.ts:20:20 - error TS2305: Module '"./ops/layer-norm"' has no exported member 'parseLayerNormAttributes'.

20 import {layerNorm, parseLayerNormAttributes} from './ops/layer-norm';
                      ~~~~~~~~~~~~~~~~~~~~~~~~

lib/wasm/jsep/webgpu/op-resolve-rules.ts:20:20 - error TS6133: 'parseLayerNormAttributes' is declared but its value is never read.

20 import {layerNorm, parseLayerNormAttributes} from './ops/layer-norm';
```
2024-01-09 20:58:50 -08:00
Xu Xing
76dfe5347c
[js/webgpu] Support uniforms for instance-norm (#18929)
Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>
2024-01-09 14:56:00 -08:00
zesongw
ad6dd0a597
[WebNN] Enable npm unit tests (#18486)
### Description
- Support more test cases for WebNN EP in suite-test-list.jsonc
- Add DISABLE_WEBNN flag in build.ts as preparing for WebNN EP release
- Add test option: '--webnn-device-type' in test-runner-args-cli.ts to
support running WebNN 'gpu' deviceType
- Use Chrome Stable as default browser for WebNN testing to unblock the
CI limitation.
2024-01-09 10:10:57 -08:00
Xu Xing
557ac74c05
[js/webgpu] Support gemm uniforms (#19056)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-09 09:57:06 -08:00
Xu Xing
42ba2aed54
[js/webgpu] Support pad uniforms (#19057)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-09 09:34:56 -08:00
Xu Xing
eb92681bfb
[js/webgpu] Support range uniforms (#19055) 2024-01-09 09:33:57 -08:00
Xu Xing
dee6a5b371
[js/webgpu] Support uniforms for attention and multihead attention (#18903) 2024-01-09 07:46:30 -08:00
Xu Xing
8f024b7394
[js/webgpu] Support uniforms for layer-norm (#18755) 2024-01-08 18:16:25 -08:00
Jiajie Hu
447a3a7c70
[js/webgpu] Fix Expand/Gather when input type is bool (#18999)
### Description
Also update the op test suite.

### Motivation and Context
Previously the *total* size in case `Expand - last dim is not divisible
by 4` was a multiple of 4, even though the *last dimension* was not, so
the bug has never been caught.
2024-01-05 08:16:15 -08:00
Yulong Wang
b18abaaa2c
[js/web] wait for threadpool initialization (#18952)
### Description

a replacement of #18683. try to resolve #18689.

By specifying "-s PTHREAD_POOL_SIZE" flag in emscripten, it forces the
threadpool to initialize before the webassembly instance is available.
2024-01-04 08:06:55 -08:00
xhcao
867b9d8f04
[js/webgpu] Fix f16 errors for ConvTranspose2D (#18986)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-04 08:06:01 -08:00
Jiajie Hu
3b8b9147fa
[js/webgpu] Mitigate floating point accuracy issue in Resize (#18956)
### Description
The patch fixes a floating point accuracy issue in Resize by preferring
integer indices and integer arithmetic where possible.

### Motivation and Context
Model test `test_resize_upsample_sizes_nearest_floor_align_corners` was
observed to be failing on certain platforms. The root cause is the
inaccurate floating point evaluation of 21 / 7 (2.999... vs 3), which
results in the wrong input element to be indexed (floor(2.999...) vs
floor(3)).
2024-01-03 14:15:26 -08:00
Yang Gu
c5f3952b68
[js/webgpu] Introduce trace support (#18928)
This is to leverage console.timeStamp to add a single marker to
browsers' (only Chromium and Firefox support it) performance tool. With
this support, we can dump both CPU and GPU timestamps, and use
post-processing tool to clearly understand the calibrated timeline. A
demo tool can be found at https://github.com/webatintel/ort-test, and
more detailed info can be found at

https://docs.google.com/document/d/1TuVxjE8jnELBXdhI4QGFgMnUqQn6Q53QA9y4a_dH688/edit.
2024-01-03 10:13:17 -08:00
satyajandhyala
780fc3611b
[JS/Web] Sajandhy/webgpu resize scales rank check (#18954)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-12-29 09:23:27 -08:00
Jiajia Qin
44584c3ebe
[js/webgpu] only declare shape and strides in shader when necessary (#18940)
### Description
Previously, shape and strides were added unconditionally even they are
not used. This PR fixes this issue and only adds shape and strides when
they are required.

It's useful when some shapes are not used as uniform if the program
depends on type instead of rank.
2023-12-28 15:43:08 -08:00
Jiajia Qin
c613cc58a9
[js/webgpu] Fix shader compilation errors in Resize (#18947)
### Description
An extra right parenthesis was added by accidentally, which results some
resize cases fail. This PR fixes it.
2023-12-28 13:15:26 -08:00
satyajandhyala
3bbe4fe2ff
[JS/WebGPU] Add trilinear interpolation to Resize; activation_params attribute is optional for FusedConv also. (#18842)
### Description
Add trilinear interpolation to Resize and changed activation_params attribute as optional for FuseConv.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-12-27 16:21:29 -08:00
Guenther Schmuelling
31d4a21c4b
[js/webgpu] fix heap access > 2GB (#18914) 2023-12-27 15:22:05 -08:00
Xu Xing
0bc71b0c9b
[js/webgpu] Refactor attributes of pool (#18728) 2023-12-26 17:23:52 -08:00
Yulong Wang
9a61388f0a
[js/web] revise backend registration (#18715)
### Description
This PR revises the backend registration.

The following describes the expected behavior after this change:
(**bolded are changed behavior**)

- (ort.min.js - built without webgpu support)
    - loading: do not register 'webgpu' backend
- creating session without EP list: use default EP list ['webnn', 'cpu',
'wasm']
- creating session with ['webgpu'] as EP list: should fail with backend
not available
- (ort.webgpu.min.js - built with webgpu support)
    - loading: **always register 'webgpu' backend**
( previous behavior: only register 'webgpu' backend when `navigator.gpu`
is available)
- creating session without EP list: use default EP list ['webgpu',
'webnn', 'cpu', 'wasm']
        - when WebGPU is available (win): use WebGPU backend
- when WebGPU is unavailable (android): **should fail backend init,**
and try to use next backend in the list, 'webnn'
(previous behavior: does not fail backend init, but fail in JSEP init,
which was too late to switch to next backend)
    - creating session with ['webgpu'] as EP list
        - when WebGPU is available (win): use WebGPU backend
- when WebGPU is unavailable (android): **should fail backend init, and
because no more EP listed, fail.


related PRs: #18190 #18144
2023-12-20 14:45:55 -08:00
satyajandhyala
98510fb8fb
[JS/WebGPU] fix an error in Clip (#18799)
### Description
<!-- Describe your changes. -->
Check whether the min/max inputs are provided and use default values if not provided.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-12-19 13:51:01 -08:00
Jiajia Qin
8f7b89bd5b
[js/webgpu] Optimize NCHW layout for InstanceNormalization (#18123)
### Description
The changes in this PR includes:
1) Fix f16 errors in InstanceNormalization with NCHW format.
2) Use vec to further optimize the original algorithm.
3) (Removed) Don't do layout conversion for InstanceNormalization for
JSEP since InstanceNormalization itself is suitable for NCHW layout and
has better performance in our current implementation.

Tested on sd-vae-decoder-f16.onnx, it becomes 285 ms from 314 ms. The
aggregate gpu profiling data can be found as below (Note the data is
based change 3).):
Before:
<html>
<body>
<!--StartFragment--><span><span class="ui-provider ef bbg bbh bbi bbj
bbk bbl bbm bbn bbo bbp bbq bbr bbs bbt bbu bbv bbw bbx bby bbz bca bcb
bcc bcd bce bcf bcg bch bci bcj bck bcl bcm bcn" dir="ltr">

Kernel | Time (Ms) | Percentage (%)
-- | -- | --
Conv | 201.55 | 69.56
InstanceNormalization | 42.49 | 14.67
Transpose | 28.95 | 9.99
Mul | 5.69 | 1.96
Add | 3.82 | 1.32
MatMul | 3.27 | 1.13
Sigmoid | 2.24 | 0.77
Resize | 1.16 | 0.40
Softmax | 0.34 | 0.12
Cast | 0.24 | 0.08
Sum | 289.75

<br class="Apple-interchange-newline"><!--EndFragment-->
</body>
</html>
After:
<html>
<body>
<!--StartFragment--><span><span class="ui-provider ef bbg bbh bbi bbj
bbk bbl bbm bbn bbo bbp bbq bbr bbs bbt bbu bbv bbw bbx bby bbz bca bcb
bcc bcd bce bcf bcg bch bci bcj bck bcl bcm bcn" dir="ltr">

Kernel | Time (Ms) | Percentage (%)
-- | -- | --
Conv | 205.44 | 79.43
InstanceNormalization | 18.24 | 7.05
Transpose | 17.64 | 6.82
Mul | 5.69 | 2.20
Add | 3.81 | 1.47
MatMul | 3.56 | 1.38
Sigmoid | 2.24 | 0.86
Resize | 1.19 | 0.46
Softmax | 0.59 | 0.23
Cast | 0.24 | 0.09
Sum | 258.65 |  

</span></span><!--EndFragment-->
</body>
</html>

From above table, we can see that two ops time are greatly reduced. One
is InstanceNormalization and the other is Transpose. The reason that the
transpose time is reduced is because each InstanceNormalization is
surrounded with two reshape ops in sd-vae-decoder-f16.onnx. Due to JSEP
is prefer NHWC and InstanceNormalization is layout sensitive op, so two
extra transpose ops are inserted dynamically when executing this model.
After this change, those inserted transpose ops are not needed anymore.
So the overall transpose time is reduced.
2023-12-15 11:26:15 -08:00
Jiajia Qin
4bbed4c71a
[js/webgpu] Fix f16 errors in unary (#18839)
### Description
This PR fixes below errors:
```
no matching overload for operator > (vec4<f16>, vec4<f32>)
2023-12-15 11:25:12 -08:00
Yang Gu
81ad1e6ac3
[js/webgpu] Fix typo of outputShapes in profiling message (#18837) 2023-12-15 08:57:48 -08:00
Jiajia Qin
b30e721dc8
[js/webgpu] Provide a naive vectorized matmul algorithm (#18758)
### Description
This PR provided a vectorized matmul algorithm. In most situations, we
still go to the workgroup memory optimized matmul. But for some
situations, like N and K are very small, using workgroup optimized
matmul can't fully utilize the underlying hardware due to the 32x32 tile
size. So for very small N/K, we switch to the naive vectorized matmul
algorithm to improve the hardware execution unit usage.

With this PR, matmul with input0: [1, 36864, 3], input1: [1, 3, 3],
input2: [3] becomes less than 1 ms from 4.34 ms on Intel Gen9 GPUs.
2023-12-13 09:03:23 -08:00
satyajandhyala
0ca84549ab
[JS/Web] Added uniforms to Reduce, Resize and Split Ops. (#18727)
### Description
<!-- Describe your changes. -->
Added uniforms to Reduce op


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve perforamnce.
2023-12-12 11:12:23 -08:00
satyajandhyala
d673e39ad8
[JS/WebGPU] Added uniforms to Tile and Where Ops (#18768)
### Description
<!-- Describe your changes. -->
Added uniforms to Tile and Where Ops


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve performance.
2023-12-11 20:58:52 -08:00
Jiajia Qin
b4be9e1bbb
[js/webgpu] Fix shader compilation errors in cumsum (#18779)
### Description
This PR fixes below shader compilation errors:
```
Tint WGSL reader failure: :39:31 error: no matching overload for operator + (f32, i32)

5 candidate operators:
  operator + (T, T) -> T  where: T is abstract-float, abstract-int, f32, i32, u32 or f16
  operator + (vecN<T>, T) -> vecN<T>  where: T is abstract-float, abstract-int, f32, i32, u32 or f16
  operator + (T, vecN<T>) -> vecN<T>  where: T is abstract-float, abstract-int, f32, i32, u32 or f16
  operator + (vecN<T>, vecN<T>) -> vecN<T>  where: T is abstract-float, abstract-int, f32, i32, u32 or f16
  operator + (matNxM<T>, matNxM<T>) -> matNxM<T>  where: T is abstract-float, f32 or f16

                    sum = sum + get_inputByIndices(inputIndices);
                              ^


 - While validating [ShaderModuleDescriptor "CumSum"]
 - While calling [Device].CreateShaderModule([ShaderModuleDescriptor "CumSum"]).
2023-12-11 18:11:38 -08:00
Caroline Zhu
eb03032925
[js/web/training] lazyResetGrad implementation (#18711)
### Description
* implemented lazyResetGrad function

### Motivation and Context
* we are in the process of adding language bindings to enable training
on web
* lazyresetgrad ensures that the gradients are calculated correctly
after the first runTrainStep call

---------

Co-authored-by: Ashwini Khade <askhade@microsoft.com>
2023-12-11 17:36:54 -08:00
Yulong Wang
efbef5f611
[js/webgpu] allow to specify callback for profiling data (#18732)
### Description

**This PR is a replacement of #17820.**

allow to specify callback for profiling data

*Previous*:
```js
ort.env.webgpu.profilingMode = 'default';  // enable profiling

// profiling data will output to console.
```

*Now*:
```js
ort.env.webgpu.profiling = {
  mode: 'default';  // enable profiling
  ondata: (data) => {
    // .. process the profiling data
  }
};

//for each kernel, "ondata" will be called once. only output to console if ondata is not specified.
```
2023-12-07 14:10:28 -08:00
Guenther Schmuelling
9aa7284351
fix lint error (#18708) 2023-12-05 10:37:03 -08:00
satyajandhyala
70816001cc
[JS/Web] AddedUniforms in GatherElements. (#18670)
### Description
Use Uniforms in GatherElements and clean-up



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve performance
2023-12-05 09:19:53 -08:00
Xu Xing
f949e0580b
[js/webgpu] Support uniforms for pool (#18656) 2023-12-05 07:54:30 -08:00
satyajandhyala
10c547516d
[JS/Web] Added CumSum operator to JSEP (#18637)
### Description
Added CumSum operator



### Motivation and Context
Reduce CPU <->GPU data movement.
2023-12-05 07:51:53 -08:00
Caroline Zhu
c02a386145
[js/web/training] Implemented runEvalStep & runOptimizerStep (#18259)
### Description
* implemented runEvalStep and runOptimizerStep
* added hasEvalModel and hasOptimizerModel boolean fields in
TrainingSession representation
* added evalInputNames and evalOutputNames fields to
TrainingSessionHandler & TrainingSession
* removed the inputNamesEncoded and outputNamesEncoded fields from
TrainingSessionHandler -- since none of the training methods require the
input names and output names as parameters, there's no need to store
them.

### Motivation and Context
* part of the work for implementing web bindings for training
* previous PR: #18250

---------

Co-authored-by: Ashwini Khade <askhade@microsoft.com>
2023-12-04 13:37:14 -08:00