Commit graph

349 commits

Author SHA1 Message Date
Jiajia Qin
3580e01348
[js/webgpu] Optimize grouped conv (#21892)
### Description
<!-- Describe your changes. -->
#21618

This PR optimizes grouped conv by 1) more sequential memory access in
gpu 2) reusing input's data to reduce global memory access times.

See `Conv|GroupedConv` op in
[Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base-960h) becomes
92 ms from 1058 ms on iGPUs with 32 EU.

For the whole model on my iGPUs with 32 EU,
wav2vec2 model becomes 982ms from 1942 ms.
squeezebert-uncased model becomes 71.86ms from 431.77ms.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-09-04 17:16:35 -07:00
Jiajia Qin
a80bfed5b4
[js/webgpu] Optimize transpose (#21964)
### Description
<!-- Describe your changes. -->
Fix bugs in previous implementation and add more situations to go the
optimized path.

Below situations will go to the optimized path.
1. 2d inputs or squeezed 2d inputs
2. channels last or channels first transpose. For example, channel last
transpose: [1, 256, 512, 512] -> [1, 512, 512, 256]
For this case, the transpose becomes [256, 512x512] -> [512x512, 256]

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
For SD Turbo demo, the total transpose time becomes 39.98ms from
122.09ms. And the correspnding percents becomes 3.89% from 11.05% in
this demo.

This PR will also help #21618, the total transpose time in that demo
becomes 17.32 ms from 70.25 ms on my iGPUs.
2024-09-04 12:04:04 -07:00
Guenther Schmuelling
4fece0430f
remove duplicate function definition (#21903) 2024-08-28 16:18:56 -07:00
xhcao
3bfb5e4f62
[js/webgpu] support float16 for Clip (#21584)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-08-28 13:19:20 -07:00
Satya Kumar Jandhyala
af18824f43
[JS/WebGPU] Add GatherBlockQuantized op support (#21734)
### Description
Add GatherBlockQuantized operator to JSEP.



### Motivation and Context
Gemma model requires this.
2024-08-26 14:46:04 -07:00
Xu Xing
d9c57ac7db
[js/webgpu] Enable pad f16 uniform (#21691)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
2024-08-26 07:58:48 -07:00
Jiajia Qin
87165b92e9
[js/webgpu] optimize MatmulNBits (#21747)
### Description
<!-- Describe your changes. -->
See 2x speedup for phi3 on the integrated intel gpu with this
optimization.

The optimization is mainly to store input A's data into local variable
instead of loading them from global memory each time when calculate them
with B data.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-08-23 16:36:00 -07:00
Jiajia Qin
27a6890529
[js/webgpu] Optimize conv1d by conv2d (#19388)
### Description
<!-- Describe your changes. -->

Optimize conv1d to go to the conv2d path to utilize the conv2d's
optimization path.

See whisper-tiny-encoder model becomes 158.66 ms from 532.28 ms. Conv
goes to Conv2DMatMul(8 ms) instead of GroupedConv(382 ms).

Old profiling result:
Kernel | Time (ms) | Percentage (%)
-- | -- | --
Conv\|GroupedConv | 382.99 | 71.95
MatMul | 126.16 | 23.70
Softmax | 7.01 | 1.32
Transpose | 4.59 | 0.86
Add | 4.39 | 0.82
Mul | 2.36 | 0.44
Div | 1.44 | 0.27
ReduceMean\|ReduceMeanShared | 1.25 | 0.23
Erf | 0.85 | 0.16
Sub | 0.72 | 0.14
Pow | 0.46 | 0.09
Sqrt | 0.07 | 0.01
Sum | 532.28 |  

New profiling result with this PR:

Kernel | Time (ms) | Percentage (%)
-- | -- | --
MatMul | 127.07 | 80.09
Conv\|Conv2DMatMul | 8.00 | 5.04
Softmax | 6.95 | 4.38
Transpose | 4.65 | 2.93
Add | 4.26 | 2.68
Mul | 2.56 | 1.61
Div | 1.51 | 0.95
ReduceMean\|ReduceMeanShared | 1.31 | 0.83
Erf | 0.85 | 0.54
Sub | 0.79 | 0.50
Pow | 0.46 | 0.29
Conv\|Transpose | 0.26 | 0.17
Sqrt | 0.00 | 0.00
Sum | 158.66 |  

---------

Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
2024-08-22 22:56:07 -07:00
Satya Kumar Jandhyala
1fb2e71ddc
[JS/WebGPU] Avoid producing presentKey/presentValue outputs if pastKey/pastValue … (#21782)
Avoid producing presentKey/presentValue outputs if pastKey/pastValue
don't exists.

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-08-19 18:02:19 -07:00
xhcao
417aa00406
[js/webgpu] fix conv1d error (#21585)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-08-18 15:45:13 -07:00
Jiajia Qin
c4ade796d6
[js/webgpu] Fix attention shader recompilation issue (#21770)
### Description
<!-- Describe your changes. -->

This PR fixes the `AttentionProbsSoftmax` recompilation issue when
executing the phi3 model. With this fix, it will further improve the
phi3 performance.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-08-17 17:15:15 -07:00
Yang Gu
49fc168eed
[js/webgpu] Handle negative axis in op Split (#21771)
This is to fix issue #21703, where the axis is a negative value in the
model. According to the spec
(https://onnx.ai/onnx/operators/onnx__Split.html), negative axis means
counting dimensions from the back.
2024-08-17 16:41:23 -07:00
Tianlei Wu
d79e3c5791
Extend Attention Bias Broadcast Support (#21710)
### Description
Previously, MultiHeadAttention supports relative position bias of shape
[1, N, S, T] or [B, N, S, T], and DecoderMaskedMultiHeadAttention
supports [1, N, S, T]. This will extend the support to allow [1, N, S,
T], [B, N, S, T], [B, 1, S, T] and [1, 1, S, T] for CUDA and CPU EPs.

- [x] Rename the input of "relative position bias" to "attention bias"
because it can also be used for other types of bias, like ALiBi
(Attention with Linear Biases) or attention mask.
- [x] Update unfused kernel to support broadcasting 2nd dimension of
attention bias.
- [x] Update efficient attention to support broadcasting 2nd dimension
of attention bias.
- [x] Update operators (MultiHeadAttention,
DecoderMaskedMultiHeadAttention, Attention, PackedAttention,
PackedMultiHeadAttention) to support broadcast attention bias on CUDA
and CPU EPs.
- [x] Update ROCm, DML and WebGPU naming to be consistent. (Note that
those EPs do not support broadcasting attention_bias for now).
- [x] Add attention bias tests for MultiHeadAttention.
- [x] Update operator documents
- [x] Update benchmark script

Other changes:
* Fix some checks in multihead-attention.ts
* Add helper functions to dump tensors given dimensions.
2024-08-16 15:40:04 -07:00
Yulong Wang
ef2ccc477b
[js/web] Add support for int4/uint4 tensor (#21720)
### Description
Add support for int4/uint4 tensor.
2024-08-15 21:32:10 -07:00
Yulong Wang
abdc31de40
[js] change default formatter for JavaScript/TypeScript from clang-format to Prettier (#21728)
### Description

See
454996d496
for manual changes (excluded auto-generated formatting changes)

### Why

Because the toolsets for old clang-format is out-of-date. This reduces
the development efficiency.

- The NPM package `clang-format` is already in maintenance mode. not
updated since 2 years ago.
- The VSCode extension for clang-format is not maintained for a while,
and a recent Node.js security update made it not working at all in
Windows.

No one in community seems interested in fixing those.

Choose Prettier as it is the most popular TS/JS formatter.

### How to merge

It's easy to break the build:
- Be careful of any new commits on main not included in this PR.
- Be careful that after this PR is merged, other PRs that already passed
CI can merge.

So, make sure there is no new commits before merging this one, and
invalidate js PRs that already passed CI, force them to merge to latest.
2024-08-14 16:51:22 -07:00
xhcao
9c6ee89fa7
[js/webgpu] fix two errors of attention operator (#21687)
Fix two issues:
(1) scale shall be fp32 instead of f16
(2) Softmax program does not handle the normalized dispatch group values, so if the sequence length is over 65535, the result is not correct for this program.
2024-08-13 09:42:34 -07:00
Satya Kumar Jandhyala
51b2044120
[JS/WebGPU] Add Dequantizelinear operator (#21642)
### Description
Added DequantizeLinear operator for JSEP.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-08-09 14:44:19 -07:00
Prathik Rao
134f47743e
bumps up version in main from 1.19 -> 1.20 (#21588)
Bump up version in main from 1.19.0 to 1.20.0 since the release branch
has been cut.
2024-08-05 15:46:04 -07:00
Yulong Wang
b03c9496aa
[js/web] allow load WebAssembly binary from buffer (#21534)
### Description

This PR adds a new option `ort.env.wasm.wasmBinary`, which allows user
to set to a buffer containing preload .wasm file content.

This PR should resolve the problem from latest discussion in #20876.
2024-07-29 13:39:38 -07:00
Xu Xing
0d7cf301a1
[js/webgpu] Add activation Tanh (#21540)
Bug:https://github.com/microsoft/onnxruntime/issues/21467

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-07-29 11:05:34 -07:00
Xu Xing
5bc12bf209
[js/webgpu] Add activation for conv3d naive (#21466)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-07-29 08:47:41 -07:00
mindest
5b9369e93c
Fix typos according to reviewdog report. (#21335)
### Description
Fix typos based on reviewdog report but with some
exceptions/corrections.
2024-07-22 13:37:32 -07:00
Xu Xing
92a8407b39
[js/webgpu] Remove unnecessary initialization of var (#21312)
This var has been initialized to 0 in tint, so no need extra loop to do
it again:
```
  float tint_symbol_52[1][4] = (float[1][4])0;
  {
    for(int tint_symbol_53 = 0; (tint_symbol_53 < 1); tint_symbol_53 = (tint_symbol_53 + 1)) {
      {
        for(int tint_symbol_54 = 0; (tint_symbol_54 < 4); tint_symbol_54 = (tint_symbol_54 + 1)) {
          tint_symbol_52[min(uint(tint_symbol_53), 0u)][min(uint(tint_symbol_54), 3u)] = 0.0f;
        }
      }
    }
  }
```
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-07-12 12:34:34 -07:00
pengwa
88336ffa92
Fix typos - 1st Wave (#21278)
### Description

There are so many typos reported by the review dog, [Optional Lint]
actions (example:
https://github.com/microsoft/onnxruntime/actions/runs/9864564489/job/27239732367),
this PR is to fix some of them.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2024-07-11 13:35:08 +08:00
Enrico Galli
4c3c809bdb
[js/webnn] Enable user-supplied MLContext (#20600)
### Description
This PR enables the API added in #20816 as well as moving context
creation to JS.

### Motivation and Context
In order to enable I/O Binding with the upcoming
[MLBuffer](https://github.com/webmachinelearning/webnn/issues/542) API
in the WebNN specification, we need to share the same `MLContext` across
multiple sessions. This is because `MLBuffer`s are restricted to the
`MLContext` where they were created. This PR enables developers to use
the same `MLContext` across multiple sessions.
2024-07-08 10:19:39 -07:00
Xu Xing
c3076721f3
[js/webgpu] Support conv3d naive (#20706)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-06-19 10:13:50 -07:00
Yulong Wang
631a2c16be
[js/web] skip default locateFile() when dynamic import is disabled (#21073)
### Description
skip default `locateFile()` when dynamic import is disabled. This allows
the file to work with bundlers to load WebAssembly file correctly if
`env.wasm.wasmPaths` is not set.
2024-06-18 12:21:45 -07:00
Yang Gu
1473d66a00
[js/webgpu] Prefer adapter.info to adapter.requestAdapterInfo (#21065)
WebGPU is deprecating async adapter.requestAdapterInfo, and replacing it
with sync adapter.info.
Spec change: https://github.com/gpuweb/gpuweb/pull/4662
2024-06-18 12:02:38 -07:00
Guenther Schmuelling
c749bd997a
webgpu quickgelu (#20939) 2024-06-06 08:21:33 -07:00
Yulong Wang
ab9f153746
[js/web] allow build target for non dynamic import (#20898)
### Description
<!-- Describe your changes. -->

This PR allows to build ORT web to `ort{.all|.webgpu}.bundle.min.mjs`,
which does not have any dynamic import. This makes it possible to use
ort web via static import in service worker.

Fixes #20876
2024-06-03 12:33:37 -07:00
Yulong Wang
35697d2421
[js/webnn] update API of session options for WebNN (#20816)
### Description

This PR is an API-only change to address the requirements being
discussed in #20729.

There are multiple ways that users may create an ORT session by
specifying the session options differently.

All the code snippet below will use the variable `webnnOptions` as this:
```js
const myWebnnSession = await ort.InferenceSession.create('./model.onnx', {
   executionProviders: [
     webnnOptions
   ]
});
```

### The old way (backward-compatibility)

```js
// all-default, name only
const webnnOptions_0 = 'webnn';

// all-default, properties omitted
const webnnOptions_1 = { name: 'webnn' };

// partial
const webnnOptions_2 = {
  name: 'webnn',
  deviceType: 'cpu'
};

// full
const webnnOptions_3 = {
  name: 'webnn',
  deviceType: 'gpu',
  numThreads: 1,
  powerPreference: 'high-performance'
};
```

### The new way (specify with MLContext)

```js
// options to create MLcontext
const options = {
  deviceType: 'gpu',
  powerPreference: 'high-performance'
};

const myMlContext = await navigator.ml.createContext(options);

// options for session options
const webnnOptions = {
  name: 'webnn',
  context: myMlContext,
  ...options
};
```

This should throw (because no deviceType is specified):
```js
const myMlContext = await navigator.ml.createContext({ ... });
const webnnOptions = {
  name: 'webnn',
  context: myMlContext
};
```

### Interop with WebGPU
```js
// get WebGPU device
const adaptor = await navigator.gpu.requestAdapter({ ... });
const device = await adaptor.requestDevice({ ... });

// set WebGPU adaptor and device
ort.env.webgpu.adaptor = adaptor;
ort.env.webgpu.device = device;

const myMlContext = await navigator.ml.createContext(device);
const webnnOptions = {
  name: 'webnn',
  context: myMlContext,
  gpuDevice: device
};
```

This should throw (because cannot specify both gpu device and MLContext
option at the same time):
```js
const webnnOptions = {
  name: 'webnn',
  context: myMlContext,
  gpuDevice: device,
  deviceType: 'gpu'
};
```
2024-05-31 03:25:14 -07:00
Xu Xing
25ac65375c
[js/webgpu] Fix mha name (#20860)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-05-30 00:01:06 -07:00
Guenther Schmuelling
33a68d221f
add missing file for pr20791 (#20811)
this file should have been in pr20791 to allow fp16 in the tile
implementation
2024-05-24 09:59:13 -07:00
Satya Kumar Jandhyala
bab5037eab
Eliminate explicit Concat operations in Attention (#20556)
### Description
Remove explicitly concatinating pastKey with Key and pastValue with
Value.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-05-24 09:07:57 -07:00
Xu Xing
f1fef19b6e
[js/webgpu] Support shared memory for transpose 2d (#19267)
For 1024x1024, without shared memoey, 18.7ms. With shared memory 13.2ms.
2024-05-22 08:15:44 -07:00
Yulong Wang
036fcd93d4
[js/web] optimize module export and deployment (#20165)
### Description

This PR make numbers of optimizations to onnxruntime-web's module export
and deployment.

See each section below for more details.

#### Preview

>
[onnxruntime-web@1.19.0-esmtest.20240513-a16cd2bd21](https://www.npmjs.com/package/onnxruntime-web/v/1.19.0-esmtest.20240513-a16cd2bd21)

> ~~onnxruntime-web@1.19.0-esmtest.20240430-c7edbcc63d~~

> ~~onnxruntime-web@1.18.0-esmtest.20240428-624c681c83~~

> ~~onnxruntime-web@1.18.0-esmtest.20240411-1abb64e894~~

<details>
<summary><h4>Breaking changes</h4></summary>

There is no code change required, but there are a few differences
regarding **code import**, **flags**, **bundler config** and
**deployment steps**.

#### Importing:

Import table is changed. See following for details.

<details>
<summary><h5>Current import table:</h5></summary>

| Target Name | Path for "import" or "require" | WebGL | JSEP | wasm |
Proxy | Training |
  |------|-----|-----|-----|-----|-----|-----|
  | `ort` (default) | `onnxruntime-web` | ✔️ |  | ✔️ | ✔️ |  |
  | `ort.all` | `onnxruntime-web/experimental` | ✔️ | ✔️ | ✔️ | ✔️ |  |
  | `ort.node` | `onnxruntime-web` |  |  | ✔️ |  |  |
| `ort.training` | `onnxruntime-web/training` |  |  | ✔️ |
✔️<sup>\[1]</sup> | ✔️ |
  | `ort.wasm` | `onnxruntime-web/wasm` |  |  | ✔️ | ✔️ |  |
  | `ort.wasm-core` | `onnxruntime-web/wasm-core` |  |  | ✔️ |  |  |
| `ort.webgl` | `onnxruntime-web/webgl` | ✔️ |  |  | ✔️<sup>\[2]</sup>
|  |
  | `ort.webgpu` | `onnxruntime-web/webgpu` |  | ✔️ | ✔️ | ✔️ |  |

* [1] didn't test. may not actually work.
* [2] not working. this is a mistake in build config.

</details>

<details>
<summary><h5>Proposed update:</h5></summary>

| Target Name | Path for "import" or "require" | WebGL | JSEP | wasm |
Proxy | Training |
  |------|-----|-----|-----|-----|-----|-----|
  | `ort` (default) | `onnxruntime-web` | ✔️ |  | ✔️ | ✔️ |  |
| `ort.all` |
~~`onnxruntime-web/experimental`~~<br/>`onnxruntime-web/all` | ✔️ | ✔️ |
✔️ | ✔️ |  |
  | `ort.node` | `onnxruntime-web` |  |  | ✔️ |  |  |
  | `ort.training` | `onnxruntime-web/training` |  |  | ✔️ | ✔️ | ✔️ |
  | `ort.wasm` | `onnxruntime-web/wasm` |  |  | ✔️ | ✔️ |  |
| ~~`ort.wasm-core`~~ | ~~`onnxruntime-web/wasm-core`~~ | ~~~~ | ~~~~
| ~~✔️~~ | ~~~~ | ~~~~ |
  | `ort.webgl` | `onnxruntime-web/webgl` | ✔️ |  |  | ~~✔️~~  |  |
  | `ort.webgpu` | `onnxruntime-web/webgpu` |  | ✔️ | ✔️ | ✔️ |  |

</details>

#### Flags:

The following flags are deprecated:
- `env.wasm.simd` (boolean): will be ignored. SIMD is always enabled in
build.

The following flags changed their type:
- `env.wasm.wasmPaths`: When using this flag as a string ( for the URL
prefix ), nothing is changed. When using this flag as an object ( for
per-file path override ), the type changed:
  ```diff
  -  export interface Old_WasmFilePaths{
  -    'ort-wasm.wasm'?: string;
  -    'ort-wasm-threaded.wasm'?: string;
  -    'ort-wasm-simd.wasm'?: string;
  -    'ort-training-wasm-simd.wasm'?: string;
  -    'ort-wasm-simd-threaded.wasm'?: string;
  -  };
  +  export interface New_WasmFilePaths {
  +    /**
  +     * Specify the override path for the main .wasm file.
  +     *
  +     * This path should be an absolute path.
  +     *
  +     * If not modified, the filename of the .wasm file is:
  +     * - `ort-wasm-simd-threaded.wasm` for default build
+ * - `ort-wasm-simd-threaded.jsep.wasm` for JSEP build (with WebGPU and
WebNN)
  +     * - `ort-training-wasm-simd-threaded.wasm` for training build
  +     */
  +    wasm?: URL|string;
  +    /**
  +     * Specify the override path for the main .mjs file.
  +     *
  +     * This path should be an absolute path.
  +     *
  +     * If not modified, the filename of the .mjs file is:
  +     * - `ort-wasm-simd-threaded.mjs` for default build
+ * - `ort-wasm-simd-threaded.jsep.mjs` for JSEP build (with WebGPU and
WebNN)
  +     * - `ort-training-wasm-simd-threaded.mjs` for training build
  +     */
  +    mjs?: URL|string;
  +  }
  ```

#### Bundler compatibility:

Config changes are need for bundlers. See usage example in
/js/web/test/e2e/ for Webpack, parcel and rollup.

#### Deployment:

- if consuming from a CDN, there is no breaking change.
- if consuming from a local server, need to copy all `ort-*.wasm` and
`ort-*.mjs` files (totally 6 files) in the dist folder. (previously only
need to copy `ort-*.wasm` files.)

</details>
<details>
<summary><h4>Problems</h4></summary>

There are a few problems with the current module export and deployment:

- Script URL cannot be correctly inferred when imported as ESM.
- Workers are forcefully encoded using Blob URL, which makes
onnxruntime-web not working in CSP environment and Node.js, when using
proxy or multi-threading feature.
- Generated JS code (by Emscripten) is encoded using
`function.toString()`, which is unstable and error-prone.
- When running with a different Emscripten build, always need the build
step. Making it difficult to swap artifacts in deveopment/debug.
</details>
<details>
<summary><h4>Goals</h4></summary>

- Full ESM support
- Support variances of ways to import. Including:
- import from HTML's `<script>` tag (IIFE format, exporting to global
variable `ort`)
    ```html
<script
src="https://example.com/cdn-path-to-onnxruntime-web/dist/ort.min.js"></script>
    ```
  - import from source code inside `<script type="module">` tag (ESM)
    ```html
    <script type="module">
import * as ort from
"https://example.com/cdn-path-to-onnxruntime-web/dist/ort.min.mjs";

      // using 'ort'
    </script>
    ```
- import in a CommonJS project (CJS format, resolve from package.json
"exports" field)
    ```js
    // myProject/main.js
    const ort = require('onnxruntime-web');
    ```
- import in an ESM project (ESM format, resolve from package.json
"exports" field)
    ```js
    // myProject/main.js (or main.mjs)
    import * as ort from 'onnxruntime-web';
    ```
- Support popular bundlers when importing onnxruntime-web into a CJS/ESM
project.
  - webpack (esm requires extra post-process step)
  - rollup
  - parcel (esm requires extra post-process step)
  - More bundlers **TBD**
- Multi-threading support for Node.js

NOTE: keeping single JavaScript file (the all-in-one bundle) is no
longer a goal. This is because technically there is a conflict with the
other requirements.
</details>

<details>
<summary><h4>Important Design Decisions</h4></summary>

- Drop support of single JavaScript output.
- The current onnxruntime-web distribution uses a single JavaScript file
to include all code. While there are a few benefits, it also creates
problems as mentioned above. Since ESM is being used more and more
widely, and browsers are making more restricted security checks and
requirement, the old Blob based solution is going to be replaced.
- To achieve the requirement, specifically, the CSP environment support,
we have to offer a non Blob based solution. Therefore, we have to
distribute multiple files and drop the single file solution.

- Do not run parser/postprocess on Emscripten generated JavaScript.
- Emscripten is evolving quickly so we should only depends on what's in
its documentation instead of a certain implementation details. (for
example, currently we patch on its code to deal with a special variable
`_scriptDir`)
  - Keep the generated files as-is also helps to:
    - reduce the size of ort.min.js
- make it easier to replace build artifacts when in development/debug

- Drop support for non-SIMD and non-MultiThread. This helps to reduce
the number of artifacts in distribution.
  - (fixed-sized) SIMD is supported in any mainstream JS environment.
- Multi-thread as WebAssembly feature is supported in any mainstream JS
environment. In some environment the feature is guarded with cross
origin policy, but it can still work if not trying to create any worker.

- Use ESM output for Emscripten generated JavaScript.
- There are 2 ways to dynamically import classic (umd) modules and
neither of them are recommended:
- dynamically creating a <script> tag. This changes the HTML structure
and have quite a lot of compatibility issue
- use `fetch()` and `eval()`. However `eval` is strongly suggested to be
avoid because there is a great perf hit.
- importing ESM is super easy - just use the `import()` call.
Considering ESM is widely supported in modern browsers and Node.js this
is the better option.

- Add Blob based solution as a fallback for cross-origin workers.
- There are still wide use case of importing onnxruntime-web from CDN.
In this usage, make it able create worker by using `fetch()`+`Blob` to
create a same-origin Blob URL.

</details>

<details>
<summary><h4>Distribution File Manifest</h4></summary>

The distribution folder contains the following files:

- WebAssembly artifacts. These files are the result of compiling the
ONNX Runtime C++ code to WebAssembly by Emscripten.

  | File Name | Build Flags |
  |------|-----|
| ort-wasm-simd-threaded.mjs <br/> ort-wasm-simd-threaded.wasm |
`--enable_wasm_simd` <br/> `--enable_wasm_threads` |
| ort-training-wasm-simd-threaded.mjs <br/>
ort-training-wasm-simd-threaded.wasm | `--enable_training_apis` <br/>
`--enable_wasm_simd` <br/> `--enable_wasm_threads` |
| ort-wasm-simd-threaded.jsep.mjs <br/> ort-wasm-simd-threaded.jsep.wasm
| `--enable_wasm_simd` <br/> `--enable_wasm_threads` <br/> `--use_jsep`
<br/> `--use_webnn` |

- onnxruntime-web JavaScript artifacts. These files are generated by
ESBuild as the entry point for onnxruntime-web.

  There are multiple build targets for different use cases:
  | Target Name | Path for "import" or "require" | Description |
  |------|-----|-----|
  | `ort` | `onnxruntime-web` | The default target. |
  | `ort.all` | `onnxruntime-web/all` | The target including webgl. |
  | `ort.node` | `onnxruntime-web` | The default target for Node.js. |
| `ort.training` | `onnxruntime-web/training` | The target including
training APIs |
| `ort.wasm` | `onnxruntime-web/wasm` | The target including only
WebAssembly (CPU) EP |
| `ort.webgl` | `onnxruntime-web/webgl` | The target including only
WebGL EP |


  For each target, there are multiple files generated:
  | File Name | Description |
  |------|-----|
| [target].js | The entry point for the target. IIFE and CommonJS
format. |
  | [target].mjs | The entry point for the target. ESM format. |
| [target].min.js <br/> [target].min.js.map | The entry point for the
target. Minimized with sourcemap. IIFE and CommonJS format. |
| [target].min.mjs <br/> [target].min.mjs.map | The entry point for the
target. Minimized with sourcemap. ESM format. |
| [target].proxy.mjs | (if appliable) The proxy ESM module for the
target. |
| [target].proxy.min.mjs <br/> [target].proxy.min.mjs.map | (if
appliable) The proxy ESM module for the target. Minimized with
sourcemap. |

</details>

<details>
<summary><h4>Dynamic Import Explained</h4></summary>

- Local Served | No Proxy:
  ```
  [Bundle or ort.min.js]
    |
    + import()--> [ort-wasm-simd-threaded.mjs]
                    |
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
                    |
+ new Worker()--> [ort-wasm-simd-threaded.mjs (worker)]
                                        |
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
  ```
- Local Served | Proxy:
  ```
  [Bundle or ort.min.js]
    |
    + import()--> [ort.proxy.min.mjs]
                    |
                    + new Worker()--> [ort.proxy.min.mjs (worker)]
                                        |
+ import()--> [ort-wasm-simd-threaded.mjs]
                                                        |
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
                                                        |
+ new Worker()--> [ort-wasm-simd-threaded.mjs (worker)]
|
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
  ```
- Cross Origin | No Proxy:
  ```
  [Bundle or ort.min.js]
    |
    + fetch('ort-wasm-simd-threaded.mjs')
        |
        + URL.createObjectURL(res.blob())
        |
        + import()--> [blob:... (ort-wasm-simd-threaded)]
                        |
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
                        |
+ new Worker()--> [blob:... (ort-wasm-simd-threaded) (worker)]
                                            |
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
  ```

- Cross Origin | Proxy
  ```
  [Bundle or ort.min.js]
    |
    + fetch('ort.proxy.min.mjs')
        |
        + URL.createObjectURL(res.blob())
        |
        + import()--> [blob:... (ort.proxy)]
                        |
+ new Worker()--> [blob:... (ort.proxy) (worker)]
                                            |
+ fetch('ort-wasm-simd-threaded.mjs')
                                                |
+ URL.createObjectURL(res.blob())
                                                |
+ import()--> [blob:... (ort-wasm-simd-threaded)]
                                                                |
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
                                                                |
+ new Worker()--> [blob:... (ort-wasm-simd-threaded) (worker)]
|
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
  ```
</details>
2024-05-20 09:51:16 -07:00
Xu Xing
8c59cd4fce
[js/webgpu] Support GroupQueryAttention (#20237)
TODOs:
1. Handle H * params.kvNumHeads greater than work group size limit.
2. Support BNSH kv cache.
2024-05-13 09:43:37 -07:00
Guenther Schmuelling
55a6986d38
optimize skiplayernorm (#20551)
SkipSimplifiedLayerNormalization used in phi3 comes down from 222usec to
14usec
2024-05-08 08:40:03 -07:00
Yi-Hong Lyu
b2481e3602
Bump up version in main from 1.18.0 to 1.19.0 (#20489)
Bump up version in main from 1.18.0 to 1.19.0 since the release branch
has been cut.

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2024-04-29 20:21:41 -07:00
Satya Kumar Jandhyala
99b0e19f11
[JS/WebGPU] MatMulNBits remove unnecessary condition (#20396)
Distribute writing-to-output work over all threads in MatMulNBits.
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-04-29 14:27:21 -07:00
Satya Kumar Jandhyala
736cbb3925
[JS/WebGU] Support fp16 in Attention by performing the computation in fp32. (#20486)
### Description
Perform computation in fp32 and convert finally to fp16.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-04-27 08:30:26 -07:00
Satya Kumar Jandhyala
21b3cbc3af
[WIP][JS/WebGPU] Inputs Key and Value could be 4-dims. (#20470)
### Description
The Key and Value inputs could be 4-dims


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-04-25 13:33:46 -07:00
Satya Kumar Jandhyala
ae78cdb5d7
[JS/WebGPU] MultiheadAttention bugfix (#20447)
### Description
Fixed pastkey, key and pastvalue, value concatenation condition and
fixed index error. Added new test cases.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-04-24 08:43:14 -07:00
Guenther Schmuelling
33d5ea39b3
[js/webgpu] fixes for fp16 attention (#20440) 2024-04-24 08:01:28 -07:00
Satya Kumar Jandhyala
d42ac7f0c6
[JS/WebGPU] Multihead attention improvements (#20286)
### Description
Enabled more usecases



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-04-23 12:39:49 -07:00
Guenther Schmuelling
b8e6684313
more conservitive gpu-buffer cache algo (#20312)
tuned based on 80 models to keep performance impact minimal
2024-04-23 09:07:04 -07:00
Guenther Schmuelling
497a627a69
fix fp16 for skiplayernorm (#20381) 2024-04-19 12:12:02 -07:00
Guenther Schmuelling
a8a77ddfdc
fix csum and enable ut (#20355) 2024-04-17 15:01:06 -07:00
Satya Kumar Jandhyala
b33216be4c
[JS/WebGPU] Improve MatMulNBits perf (#19974)
### Description
<!-- Describe your changes. -->
Improve performance using shared memory


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-04-12 11:03:05 -07:00
Yulong Wang
50bd4571ac
[js/web] support SimplifiedLayerNorm and SkipSimplifiedLayerNorm (#20277)
### Description
Support operator `SimplifiedLayerNorm` and `SkipSimplifiedLayerNorm` for
WebGPU backend.
2024-04-11 14:08:50 -07:00