### Description
This PR allows zero-sized output.
To make the implementation simple, it does not support partial
zero-sized tensor. Which means, either all outputs are zero-sized, or an
error will be reported.
added 2 tests:
- op test of `Add` with input T[2,0] T[2,1], and
- test_split_zero_size_splits
### Description
<!-- Describe your changes. -->
1. Fix Where operator to handle Boolean input less than 4 bytes.
2. Fix JSEP test harness to use tensor names consistently.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Switch to setImmediate to avoid starving the Node.js event loop
There should really be a true async version though, running
computationally intensive things on the event loop will stop everything
else from happening while it is running, e.g. a web server from
answering requests.
This can be done by wrapping `RunAsync` behind a
[`napi::Promise`](https://github.com/nodejs/node-addon-api/blob/main/doc/promises.md)
to run on the onnxruntime thread pool or [`AsyncWorker`](
https://github.com/nodejs/node-addon-api/blob/main/doc/async_worker.md)
for the Node.js/libuv thread pool.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Without this, if you run inference in a tight loop, without anything
else in between that is async/deferred, `process.nextTick` will lead to
starving the event loop and not letting anything else run,
`setImmediate` at least lets the event loop spin between calls to `run`.
See
https://dev.to/ynmanware/setimmediate-settimeout-and-process-nexttick-3mfd
Contributed on behalf of [Swimm](https://swimm.io/)
Bumps [ip](https://github.com/indutny/node-ip) from 1.1.8 to 1.1.9.
<details>
<summary>Commits</summary>
<ul>
<li><a
href="1ecbf2fd8c"><code>1ecbf2f</code></a>
1.1.9</li>
<li><a
href="6a3ada9b47"><code>6a3ada9</code></a>
lib: fixed CVE-2023-42282 and added unit test</li>
<li>See full diff in <a
href="https://github.com/indutny/node-ip/compare/v1.1.8...v1.1.9">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
Dependabot will merge this PR once CI passes on it, as requested by
@fs-eire.
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
This is used in sam-h-decoder-f16.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Bumps [ip](https://github.com/indutny/node-ip) from 1.1.8 to 1.1.9.
<details>
<summary>Commits</summary>
<ul>
<li><a
href="1ecbf2fd8c"><code>1ecbf2f</code></a>
1.1.9</li>
<li><a
href="6a3ada9b47"><code>6a3ada9</code></a>
lib: fixed CVE-2023-42282 and added unit test</li>
<li>See full diff in <a
href="https://github.com/indutny/node-ip/compare/v1.1.8...v1.1.9">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
Dependabot will merge this PR once CI passes on it, as requested by
@fs-eire.
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
BUG: https://github.com/microsoft/onnxruntime/issues/18855
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
This change adds only necessary code to enable ort-web works with any
Float16Array polyfill. Unlike #19302, in this PR, ort-web does not
include any specific polyfill; instead, it's user's choice for how to
use a polyfill.
ORT-web uses Float16Array if it's available; otherwise, fallback to use
Uint16Array.
```js
// case 1: user does not use polyfill:
import * as ort from 'onnxruntime-web';
const myF16Data = new Uint16Array(...); // need to use Uint16Array
const myF16tensor = new ort.Tensor('float16', myF16Data, dims);
```
```js
// case 2: user use polyfill:
import * as ort from 'onnxruntime-web';
import {
Float16Array, isFloat16Array, isTypedArray,
getFloat16, setFloat16,
f16round,
} from "@petamoriken/float16";
globalThis.Float16Array = Float16Array; // ort-web will pick the global Float16Array
const myF16Data = new Float16Array(...); // Use the polyfilled Float16Array type
const myF16tensor = new ort.Tensor('float16', myF16Data, dims);
```
### Description
This is required to make shape uniforms really work.
### Motivation and Context
The bug was unveiled in a model with multiple Split nodes. The later
nodes would try to reuse a previous pipeline cache, while the old shapes
were hardcoded as constants in cache.
### Description
Add MatMulNBits to support MatMul using 4-bit quantized weights
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Since TypeScript v4.7, types need to specify inside "exports" field when
it is available. This PR appends types just before each "default" (which
is required by spec to be the last item).
Fixes#19403.
### Description
This PR 1) adds LeakyRelu activation for fusedConv; 2) makes `vec4<f16>`
value work with `float32` uniforms attributes.
For example:
`clamp(value, vec4<f16>(uniforms.clip_min),
vec4<f16>(uniforms.clip_max)` will throw compilation errors since
`uniforms.clip_min` and `uniforms.clip_min` are `f32` not `f16`. So we
need to change it to `clamp(value, vec4<f16>(f16(uniforms.clip_min)),
vec4<f16>(f16(uniforms.clip_max))`
And above problem was introduced when we make activation attributes as
uniforms instead of constant.
BTW, after adding LeakyRelu, `realesrgan-t256` model can pass.
### Description
support external data in npm test.
This allows test runner to detect whether an external data is available
in the test folder, and if it is, load it as external data
automatically.
this feature does not parse every model to figure out whether the model
has external data. the following comments in code explained how to
determine whether should parse the model file.
```js
// for performance consideration, we do not parse every model. when we think it's likely to have external
// data, we will parse it. We think it's "likely" when one of the following conditions is met:
// 1. any file in the same folder has the similar file name as the model file
// (e.g., model file is "model_abc.onnx", and there is a file "model_abc.pb" or "model_abc.onnx.data")
// 2. the file size is larger than 1GB
```
### Description
This PR expands the graph capture capability to JS EP, which is similar
to #16081. But for JS EP, we don't use the CUDA Graph, instead, we
records all gpu commands and replay them, which removes most of the cpu
overhead to avoid the the situation that gpu waiting for cpu.
mobilenetv2-12 becomes 3.7ms from 6ms on NV 3090 and becomes 3.38ms from
4.58ms on Intel A770.
All limitations are similar with CUDA EP:
1. Models with control-flow ops (i.e. If, Loop and Scan ops) are not
supported.
2. Usage of graph capture is limited to models where-in all ops in the
model can be partitioned to the JS EP or CPU EP and no memory copy
between them.
3. Shapes of inputs/outputs cannot change across inference calls.
4. IObinding is required.
The usage is like below:
Method 1: specify outputs buffers explicitly.
```
const sessionOptions = {
executionProviders: [
{
name: "webgpu",
},
],
enableGraphCapture: true,
};
const session = await ort.InferenceSession.create('./models/mobilenetv2-12.onnx', sessionOptions);
// prepare the inputBuffer/outputBuffer
... ...
const feeds = {
'input': ort.Tensor.fromGpuBuffer(inputBuffer, { dataType: 'float32', dims })
};
const fetches = {
'output': ort.Tensor.fromGpuBuffer(outputBuffer, { dataType: 'float32', dims: [1, 1000] })
};
let results = await session.run(feeds, fetches); // The first run will begin to capture the graph.
// update inputBuffer content
... ...
results = = await session.run(feeds, fetches); // The 2ed run and after will directly call replay to execute the graph.
... ...
session.release();
```
Method 2: Don't specify outputs buffers explicitly. Internally, when
graph capture is enabled, it will set all outputs location to
'gpu-buffer'.
```
const sessionOptions = {
executionProviders: [
{
name: "webgpu",
},
],
enableGraphCapture: true,
};
const session = await ort.InferenceSession.create('./models/mobilenetv2-12.onnx', sessionOptions);
// prepare the inputBuffer
... ...
const feeds = {
'input': ort.Tensor.fromGpuBuffer(inputBuffer, { dataType: 'float32', dims })
};
let results = await session.run(feeds); // The first run will begin to capture the graph.
// update inputBuffer content
... ...
results = = await session.run(feeds); // The 2ed run and after will directly call replay to execute the graph.
... ...
session.release();
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
```math
\tanh(x)=\frac{e^x-e^{-x}}{e^x+e^{-x}}=
\left\{
\begin{array}{cc}
-\frac{1-e^{-2\cdot(-x)}}{1+e^{-2\cdot(-x)}}, & x<0 \\
0, & x=0 \\
\frac{1-e^{-2x}}{1+e^{-2x}}, & x>0
\end{array}
\right.
```
### Motivation and Context
On some platforms,
$$\tanh(1000)=\frac{e^{1000}-e^{-1000}}{e^{1000}+e^{-1000}}$$ would
produce NaN instead of 0.999... or 1 (imagine $e^{1000}=\infty$ and
$\frac{\infty}{\infty}$ explodes).
When we enable webgpu profiling mode between session.create and
session.run, current implementation has a problem to create querySet
(and also queryResolveBuffer) if we share the commandEncoder with inputs
upload. This PR fixes this by moving the querySet creation to the place
we set queryType.
### Description
Added Uniforms to SkipLayerNorm
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve performance
---------
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
### Description
<!-- Describe your changes. -->
`env.webgpu.profiling` is a global flag. It may change before each
session.run. So the best place is to update it in `onRunStart` event.
After this, we can directly check `this.queryType`'s value. Without this
pr, we need to make sure that `getCommandEncoder()` is called before
checking `this.queryType`. Otherwise, it may happen that
`pendingKernels`'s length is not equal to `pendingDispatchNumber`'s
length. See the two ugly workarounds
[1)](e630dbf528 (diff-006fc84d3997f96a29b8033bd2075d6a0a9509211bd5812a6b934fc74fedfd9dR267-R268))
and
[2)](e630dbf528 (diff-618fe297fbe7a1da586380163b8fd2627311ccc217640a3c5cdc9c17a33472c1R73-R80))
if we don't introduce `onRunStart`. Or we need to call `setQueryType` in
each kernel run.
### Description
This op is required in mobilenetv3-small-100. With this PR,
mobilenetv3-small-100 model becomes less than 10 ms from over 100 ms on
ADL.
### Description
upgrade packages version.
```
# npm audit report
electron 23.0.0-alpha.1 - 23.3.13
Severity: moderate
ASAR Integrity bypass via filetype confusion in electron - https://github.com/advisories/GHSA-7m48-wc93-9g85
fix available via `npm audit fix --force`
Will install electron@28.1.4, which is a breaking change
node_modules/electron
get-func-name <2.0.1
Severity: high
Chaijs/get-func-name vulnerable to ReDoS - https://github.com/advisories/GHSA-4q6p-r6v2-jvc5
fix available via `npm audit fix`
node_modules/get-func-name
semver <=5.7.1 || 6.0.0 - 6.3.0 || 7.0.0 - 7.5.1
Severity: moderate
semver vulnerable to Regular Expression Denial of Service - https://github.com/advisories/GHSA-c2qf-rxjj-qqgw
semver vulnerable to Regular Expression Denial of Service - https://github.com/advisories/GHSA-c2qf-rxjj-qqgw
semver vulnerable to Regular Expression Denial of Service - https://github.com/advisories/GHSA-c2qf-rxjj-qqgw
fix available via `npm audit fix`
node_modules/cross-spawn/node_modules/semver
node_modules/global-agent/node_modules/semver
node_modules/semver
```
### Description
show warning when numThreads is set but threads is not supported.
Resolves#19148, #18933
for web: when crossOriginIsolated is false.
for node: always disable.
### Description
<!-- Describe your changes. -->
Bump up version to 1.18.0 since the release branch has been cut.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
We submit kernels in a batch (a fixed number 16 is used except for the
last batch) for better performance. However, timestamp query support is
at pass level so we disable the batch execution in profiling mode in
previous implementation. Actually we can have multiple passes in a batch
so that we don't have to disable batch execution, which is the first
enhancement of this PR.
Furthermore, WebGPU has an extension to support timestamp query inside
passes, which isn't supported by all the platforms (e.g., Windows
supports it, while macOS doesn't). This is expected to have lower cost
compared with multiple passes solution. So this PR also introduce this
support when available.
This PR also refactors some implementation related to kernelInfo, and
try to unify the related kernel names.
### Description
enable external data loading for ort-web.
### Why
The ORT external data design is highly depending on the file system,
especially synchronous file I/O APIs. Those are not available in web
platforms. We need to have extra code to make external data working on
web.
### How
Considering there is no file system in web, an implementation for web to
support external data is to use pre-loaded data. Assume model file
a.onnx includes initializers that linked to ./b.bin, we require users to
pass a full data file list when creating the session. The user code will
be look like:
```js
const mySess = await ort.InferenceSession.create('./path/model/a.onnx', {
// session options
externalData: [
{
// relative or absolute path/URL of the file,
// or a pre-loaded Uint8Array containing the data of the external data file
data: './path/data/b.bin',
// the relative path of the external data. Should match initializers' "location" value defined in the model file
path: './b.bin'
},
// { } if multiple external data file
]
});
```
Currently, this feature only works with JSEP build enabled.
when jsep calls javascript with an index to HEAP8 or HEAP32 the index is
negative when the heap is above 2GB, even if we pass it as uint32_t it
remains negative. So in javascript use >>> 0 to make it unsigned.
### Description
when DOM API is not avaiable, using OffscreenCanvas
### Motivation and Context
In some environment like service worker or web worker, the DOM API is
not avaiable, we can use OffscreenCanvas API to replace
`document.createElement('canvas')`.
Most of the APIs of OffscreenCanvas and HTMLCanvasElement are the same,
except that `toDataUrl` is missing.
It fix this issues #19032
resize for fp16 has 2 issues: scales are always f32 and roi can be f32
or f16.
scales:
this is fixed.
roi
this is fixed for the case where roi is not passed as optional input
with f16. To fix this it requires a much larger change and I did not
want to risk this short before a release. For all practical purpose
passing roi as input with f16 should be rare and we can fix it in the
near future.
Update WebNN test list in suite-test-list.jsonc so all test cases are
passed behind WebNN CPU backend on Chrome Stable (Although some cases
may fall back to CPU EP).
Enable int64 support for WebNN in unit tests.
### Description
Change `A / sqrt(B)` to `A * inverseSqrt(B)` in BatchNormalization,
InstanceNormalization, LayerNormalization and SkipLayerNormalization.
### Motivation and Context
For the same reason as the existence of the `inverseSqrt` built-in in
WebGPU spec.
Bumps
[follow-redirects](https://github.com/follow-redirects/follow-redirects)
from 1.15.2 to 1.15.4.
<details>
<summary>Commits</summary>
<ul>
<li><a
href="65858205e5"><code>6585820</code></a>
Release version 1.15.4 of the npm package.</li>
<li><a
href="7a6567e16d"><code>7a6567e</code></a>
Disallow bracketed hostnames.</li>
<li><a
href="05629af696"><code>05629af</code></a>
Prefer native URL instead of deprecated url.parse.</li>
<li><a
href="1cba8e85fa"><code>1cba8e8</code></a>
Prefer native URL instead of legacy url.resolve.</li>
<li><a
href="72bc2a4229"><code>72bc2a4</code></a>
Simplify _processResponse error handling.</li>
<li><a
href="3d42aecdca"><code>3d42aec</code></a>
Add bracket tests.</li>
<li><a
href="bcbb096b32"><code>bcbb096</code></a>
Do not directly set Error properties.</li>
<li><a
href="192dbe7ce6"><code>192dbe7</code></a>
Release version 1.15.3 of the npm package.</li>
<li><a
href="bd8c81e4f3"><code>bd8c81e</code></a>
Fix resource leak on destroy.</li>
<li><a
href="9c728c314b"><code>9c728c3</code></a>
Split linting and testing.</li>
<li>Additional commits viewable in <a
href="https://github.com/follow-redirects/follow-redirects/compare/v1.15.2...v1.15.4">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Disable createGroupedConvVectorizeProgramInfo path due to bots failures
on below two cases:
[webgpu]Conv - conv - vectorize group - B
[webgpu]Conv - conv - vectorize group - D
Bumps
[follow-redirects](https://github.com/follow-redirects/follow-redirects)
from 1.15.2 to 1.15.4.
<details>
<summary>Commits</summary>
<ul>
<li><a
href="65858205e5"><code>6585820</code></a>
Release version 1.15.4 of the npm package.</li>
<li><a
href="7a6567e16d"><code>7a6567e</code></a>
Disallow bracketed hostnames.</li>
<li><a
href="05629af696"><code>05629af</code></a>
Prefer native URL instead of deprecated url.parse.</li>
<li><a
href="1cba8e85fa"><code>1cba8e8</code></a>
Prefer native URL instead of legacy url.resolve.</li>
<li><a
href="72bc2a4229"><code>72bc2a4</code></a>
Simplify _processResponse error handling.</li>
<li><a
href="3d42aecdca"><code>3d42aec</code></a>
Add bracket tests.</li>
<li><a
href="bcbb096b32"><code>bcbb096</code></a>
Do not directly set Error properties.</li>
<li><a
href="192dbe7ce6"><code>192dbe7</code></a>
Release version 1.15.3 of the npm package.</li>
<li><a
href="bd8c81e4f3"><code>bd8c81e</code></a>
Fix resource leak on destroy.</li>
<li><a
href="9c728c314b"><code>9c728c3</code></a>
Split linting and testing.</li>
<li>Additional commits viewable in <a
href="https://github.com/follow-redirects/follow-redirects/compare/v1.15.2...v1.15.4">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
### Description
This PR provides a vectorized algorithm for NHWC GroupedConv to improve
performance.
The aggregate time of GroupedConv in mobilenetv2-12 becomes ~1ms from
~4ms on Intel Alder Lake machine. About 20% improvement for the whole
model.
This resolves the below build errors:
```
lib/wasm/jsep/webgpu/op-resolve-rules.ts:19:23 - error TS2724: '"./ops/instance-norm"' has no exported member named 'parseInstanceNormAttributes'. Did you mean 'InstanceNormAttributes'?
19 import {instanceNorm, parseInstanceNormAttributes} from './ops/instance-norm';
~~~~~~~~~~~~~~~~~~~~~~~~~~~
lib/wasm/jsep/webgpu/op-resolve-rules.ts:19:23 - error TS6133: 'parseInstanceNormAttributes' is declared but its value is never read.
19 import {instanceNorm, parseInstanceNormAttributes} from './ops/instance-norm';
~~~~~~~~~~~~~~~~~~~~~~~~~~~
lib/wasm/jsep/webgpu/op-resolve-rules.ts:20:20 - error TS2305: Module '"./ops/layer-norm"' has no exported member 'parseLayerNormAttributes'.
20 import {layerNorm, parseLayerNormAttributes} from './ops/layer-norm';
~~~~~~~~~~~~~~~~~~~~~~~~
lib/wasm/jsep/webgpu/op-resolve-rules.ts:20:20 - error TS6133: 'parseLayerNormAttributes' is declared but its value is never read.
20 import {layerNorm, parseLayerNormAttributes} from './ops/layer-norm';
```
### Description
- Support more test cases for WebNN EP in suite-test-list.jsonc
- Add DISABLE_WEBNN flag in build.ts as preparing for WebNN EP release
- Add test option: '--webnn-device-type' in test-runner-args-cli.ts to
support running WebNN 'gpu' deviceType
- Use Chrome Stable as default browser for WebNN testing to unblock the
CI limitation.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Also update the op test suite.
### Motivation and Context
Previously the *total* size in case `Expand - last dim is not divisible
by 4` was a multiple of 4, even though the *last dimension* was not, so
the bug has never been caught.
### Description
a replacement of #18683. try to resolve#18689.
By specifying "-s PTHREAD_POOL_SIZE" flag in emscripten, it forces the
threadpool to initialize before the webassembly instance is available.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
The patch fixes a floating point accuracy issue in Resize by preferring
integer indices and integer arithmetic where possible.
### Motivation and Context
Model test `test_resize_upsample_sizes_nearest_floor_align_corners` was
observed to be failing on certain platforms. The root cause is the
inaccurate floating point evaluation of 21 / 7 (2.999... vs 3), which
results in the wrong input element to be indexed (floor(2.999...) vs
floor(3)).
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Previously, shape and strides were added unconditionally even they are
not used. This PR fixes this issue and only adds shape and strides when
they are required.
It's useful when some shapes are not used as uniform if the program
depends on type instead of rank.
### Description
Add trilinear interpolation to Resize and changed activation_params attribute as optional for FuseConv.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
This PR revises the backend registration.
The following describes the expected behavior after this change:
(**bolded are changed behavior**)
- (ort.min.js - built without webgpu support)
- loading: do not register 'webgpu' backend
- creating session without EP list: use default EP list ['webnn', 'cpu',
'wasm']
- creating session with ['webgpu'] as EP list: should fail with backend
not available
- (ort.webgpu.min.js - built with webgpu support)
- loading: **always register 'webgpu' backend**
( previous behavior: only register 'webgpu' backend when `navigator.gpu`
is available)
- creating session without EP list: use default EP list ['webgpu',
'webnn', 'cpu', 'wasm']
- when WebGPU is available (win): use WebGPU backend
- when WebGPU is unavailable (android): **should fail backend init,**
and try to use next backend in the list, 'webnn'
(previous behavior: does not fail backend init, but fail in JSEP init,
which was too late to switch to next backend)
- creating session with ['webgpu'] as EP list
- when WebGPU is available (win): use WebGPU backend
- when WebGPU is unavailable (android): **should fail backend init, and
because no more EP listed, fail.
related PRs: #18190#18144
### Description
<!-- Describe your changes. -->
Check whether the min/max inputs are provided and use default values if not provided.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
The casing of Podfile is incorrect in the plugin. This causes issues
when building iOS on case-sensitive systems such as Linux.
### Motivation and Context
because cannot build ios on case sensitive systems
### Description
The changes in this PR includes:
1) Fix f16 errors in InstanceNormalization with NCHW format.
2) Use vec to further optimize the original algorithm.
3) (Removed) Don't do layout conversion for InstanceNormalization for
JSEP since InstanceNormalization itself is suitable for NCHW layout and
has better performance in our current implementation.
Tested on sd-vae-decoder-f16.onnx, it becomes 285 ms from 314 ms. The
aggregate gpu profiling data can be found as below (Note the data is
based change 3).):
Before:
<html>
<body>
<!--StartFragment--><span><span class="ui-provider ef bbg bbh bbi bbj
bbk bbl bbm bbn bbo bbp bbq bbr bbs bbt bbu bbv bbw bbx bby bbz bca bcb
bcc bcd bce bcf bcg bch bci bcj bck bcl bcm bcn" dir="ltr">
Kernel | Time (Ms) | Percentage (%)
-- | -- | --
Conv | 201.55 | 69.56
InstanceNormalization | 42.49 | 14.67
Transpose | 28.95 | 9.99
Mul | 5.69 | 1.96
Add | 3.82 | 1.32
MatMul | 3.27 | 1.13
Sigmoid | 2.24 | 0.77
Resize | 1.16 | 0.40
Softmax | 0.34 | 0.12
Cast | 0.24 | 0.08
Sum | 289.75
<br class="Apple-interchange-newline"><!--EndFragment-->
</body>
</html>
After:
<html>
<body>
<!--StartFragment--><span><span class="ui-provider ef bbg bbh bbi bbj
bbk bbl bbm bbn bbo bbp bbq bbr bbs bbt bbu bbv bbw bbx bby bbz bca bcb
bcc bcd bce bcf bcg bch bci bcj bck bcl bcm bcn" dir="ltr">
Kernel | Time (Ms) | Percentage (%)
-- | -- | --
Conv | 205.44 | 79.43
InstanceNormalization | 18.24 | 7.05
Transpose | 17.64 | 6.82
Mul | 5.69 | 2.20
Add | 3.81 | 1.47
MatMul | 3.56 | 1.38
Sigmoid | 2.24 | 0.86
Resize | 1.19 | 0.46
Softmax | 0.59 | 0.23
Cast | 0.24 | 0.09
Sum | 258.65 |
</span></span><!--EndFragment-->
</body>
</html>
From above table, we can see that two ops time are greatly reduced. One
is InstanceNormalization and the other is Transpose. The reason that the
transpose time is reduced is because each InstanceNormalization is
surrounded with two reshape ops in sd-vae-decoder-f16.onnx. Due to JSEP
is prefer NHWC and InstanceNormalization is layout sensitive op, so two
extra transpose ops are inserted dynamically when executing this model.
After this change, those inserted transpose ops are not needed anymore.
So the overall transpose time is reduced.
### Description
This PR provided a vectorized matmul algorithm. In most situations, we
still go to the workgroup memory optimized matmul. But for some
situations, like N and K are very small, using workgroup optimized
matmul can't fully utilize the underlying hardware due to the 32x32 tile
size. So for very small N/K, we switch to the naive vectorized matmul
algorithm to improve the hardware execution unit usage.
With this PR, matmul with input0: [1, 36864, 3], input1: [1, 3, 3],
input2: [3] becomes less than 1 ms from 4.34 ms on Intel Gen9 GPUs.
### Description
<!-- Describe your changes. -->
Added uniforms to Reduce op
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve perforamnce.
### Description
<!-- Describe your changes. -->
Added uniforms to Tile and Where Ops
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve performance.
### Description
* implemented lazyResetGrad function
### Motivation and Context
* we are in the process of adding language bindings to enable training
on web
* lazyresetgrad ensures that the gradients are calculated correctly
after the first runTrainStep call
---------
Co-authored-by: Ashwini Khade <askhade@microsoft.com>
### Description
**This PR is a replacement of #17820.**
allow to specify callback for profiling data
*Previous*:
```js
ort.env.webgpu.profilingMode = 'default'; // enable profiling
// profiling data will output to console.
```
*Now*:
```js
ort.env.webgpu.profiling = {
mode: 'default'; // enable profiling
ondata: (data) => {
// .. process the profiling data
}
};
//for each kernel, "ondata" will be called once. only output to console if ondata is not specified.
```
### Description
Use Uniforms in GatherElements and clean-up
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve performance
### Description
* implemented runEvalStep and runOptimizerStep
* added hasEvalModel and hasOptimizerModel boolean fields in
TrainingSession representation
* added evalInputNames and evalOutputNames fields to
TrainingSessionHandler & TrainingSession
* removed the inputNamesEncoded and outputNamesEncoded fields from
TrainingSessionHandler -- since none of the training methods require the
input names and output names as parameters, there's no need to store
them.
### Motivation and Context
* part of the work for implementing web bindings for training
* previous PR: #18250
---------
Co-authored-by: Ashwini Khade <askhade@microsoft.com>
### Description
Currently, we only print the kernelName, which is hard to distinguish
which shader we actually used. For example, GroupedConv/Conv2DMatMul
both belong to Conv kernel. It's not intuitive for profiling.
### Description
ESLint will went into error sometimes.
The root cause is because some large generated JavaScript file in the
tsconfig's include path will cause TypeScript parser fail in a line of
`string.match()` with a regex on a huge string (~8MB), causing the
following error:
```
RangeError: Maximum call stack size exceeded
```
The solution is to remove the large files from the tsconfig's include
path. Previously I excluded the `web/dist/` folder and this PR excludes
`web/test/ort.test[.min].js`.
With uniform support, ideally we may just keep one artifact for each
program to save the compilation time. This PR just logs the related
info, including key and program name, so that we may understand better
the situation.
### Description
Add uinforms to Einsum
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve performance.
### Description
This PR includes a change that inspired from #18452 to resolve a
requirement: a shader may depend on an instance of `IndicesHelper` to
generate WGSL code snippet, but the IndicesHelper instance is not
necessarily an input/output of the program. So the existing
`declareVariables()` function does not work with this scenario.
In order to support this requirement, I added this "use" function to
`interface ShaderHelper`, which takes a helper-like object as parameter.
The hidden implementation `ShaderHelperImpl` class will iterate the
helpers and call `impl()` for each.
@axinging @qjia7
### Description
<!-- Describe your changes. -->
As title.
1. Add macos build as an optionally enabled arch for pod and changes to
exsiting build_ios_framework/assemble_c_pod scripts.
2. Enable macos build arch in ios packaging pipeline (currently for
variants other than Mobile) and check the output artifacts are correct.
3. Write MacOS Test Target scheme in the test app and integrate into ios
packaging CI testing pipeline.
Currently the changes only apply to onnxruntime-c pod. as the original
request was from ORT SPM which consumes the onnxruntime-c pod only as
the binary target. TODO: could look into adding macos platform to objc
pod as well.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Enable macos platform support in cocoapods. and also potentially produce
binary target for enabling macos platform in SPM as well.
Replace https://github.com/microsoft/onnxruntime/pull/18334
---------
Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
Currently, all conv2dMatmul with inChannels = 3 and outChannels % 4 = 0
will report compilation errors. Models, which include this kind of shape
will be impacted, like mobilenetv2-12, resnet50 .
The errors is introduced by #18452https://github.com/microsoft/onnxruntime/pull/18452/files#diff-8b24ea43aa11b1346c0c9e327f9bce6b37a93bd8f2bf8a6392b2b263972b7ea2R200,
which accidentally pass `components` to `x`. But `x`'s components is
`innerElementSize` not `components `. And when `innerElementSize` is 3,
we should use `1` in current design.
### Description
* Implemented: `getParametersSize`, `getContiguousParameters`
(equivalent to copyParametersToBuffer), and `loadParametersBuffer`
(equivalent to copyParametersFromBuffer)
* as part of these changes, getParametersSize was added to the
TrainingSession interface so that users know what size buffer to create
for loadParametersBuffer
* The parameters methods in the interface were modified to take in a
Float32Array instead
### Motivation and Context
* part of the work for implementing web bindings for training
* enables federated learning in the web
* previous PR: #18006
---------
Co-authored-by: Ashwini Khade <askhade@microsoft.com>
### Description
This PR adds `BatchNormalization` with `float` support.
Some Todos:
1. all inputs don't have same data type. For example, x/y is float16,
but bias/scale is float32 or double.
2. training mode support.
We see many models are using `BatchNormalization` ops. However, due to
the missing in jsep, all of them run on cpu, which result very poor
performance. With this PR's support, densenet-9 model becomes 20.29 ms
from 250.69 ms.
This change refactored matmul/conv related programs to support shape
uniforms. Currently only matmul shape uniforms are fully enabled.
TODOs: add input dependencies for conv related programs, turn clipMax
and clipMin to uniforms.
### Description
<!-- Describe your changes. -->
Added Uniforms to Expand operator kernel
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve performance
### Description
It was a mistake to use 2 different names for Clip operator in
op-resolve-rules.ts for different opset. An optimized implementation can
handle both cases (opset < 11 and opset >=11). Remove "ClipV10" as an
entry from the table.
### Description
Currently, the binary algorithms are divided into the vectorize one
(efficient) and non-vectorize one (less efficient). Below situations
will go to the vectorize one:
1) A or B's shape length is 1.
2) The shared dimensions length of A and B are divisible by 4.
3) A and B have same shape.
This PR adds another situation as below to go to the vectorize
algorithm.
4. A or B's last dimension is divisible by 4.
With this change, the aggerate time of Add in sam-b-encoder becomes
309.65 ms from 409.12 ms on Intel ADL.
### Description
optimize eslint config to:
- set parserOptions.project to `true` to allow @typescript-eslint/parser
to find the nearest tsconfig.json file to that source file. This helps
to avoid parsing extra files, may helps with:
- reduce the possibility of seeing OOM or stackoverflow with "npm run
lint"
- faster processing
- enforce rule "no-underscore-dangle" with a list of exceptions.
### Description
[js] update a few packages
- update semver
- update reference of onnx_proto to local folder in order to upgrade
protobufjs@7.2.4
Resolve AB#18513
### Description
This is a narrow implementation of Attention/MultiHeadAttention as it
does not support:
a. inputs 5-7 for MHA
b. packed QKV/KV
c. past/present
d. attention mask
But it works well for StableDiffusion and can be extended later. It
reduces VRAM usage as it combines many ops into few
I've updated demo here https://islamov.ai/stable-diffusion-webgpu/ it
takes ~13sec for 1 image with 20 steps on RTX3090Ti and about 25s on M1
Pro
VRAM usage is about 8gb if you don't use img2img
Going to focus on SDXL now
---------
Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
### Description
Support uniforms in Slice op
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve ferformance
### Description
- set tsconfig "noUnusedParameters" to `true` and fix a few bugs
discovered by typescript.
how unused parameter is fixed:
- for most code (webgl), add underscore as prefix, which is the standard
ignore pattern for typescript check.
- remove unused parameter from function and modify corresponding
function calls (jsep)
- fix a bug in ArgMinMax: this 2 operators do not have more than one
input(s) so the `createArgMinMaxAttributesFromInputs()` is removed.
- add proxy main.ts into typescript check and fix a bug in parameter
passing
- fixed `run()` function call and add typecheck fix (hack)
### Description
This PR fixes the TypeScript type check.
Previously, when I use esbuild to replace webpack (#17745), typescript
typecheck was disabled. This causes a few TypeScript type error checked
in into the code base. This PR fixes the followings:
- Use "Node16" as default "module" value in tsconfig.json, because in
TypeScript v5, `(module == "ES2015" && moduleResolution == "Node16")` is
an invalid combination.
- Set `noUnusedParameters` to true as default. in web override it to
false because multiple code need to be updated ( a following-up PR will
do this )
- set correct project file for 'web/lib/**/*.ts' for ESLint (otherwise
WebGPU types are not populated correctly)
- fix type error in file js/web/lib/wasm/jsep/webgpu/program-manager.ts
- upgrade "@webgpu/types" to latest to fix type error in file
js/web/lib/wasm/jsep/backend-webgpu.ts
- add package script "prebuild" for web to run tsc type check
- add type check in CI yml file
### Description
For Resize, when `noScale` is true, the shader can become very simple,
which is not related with `attributes.mode` anymore. So we should remove
those parts of shader code for simplification.
This PR can also fix#18311 since the `noScale` are all true in that
model.
However, #18311 also exposes that the Resize implementation for `linear`
mode has bug. It seems that the currently implementation always treat
the input as either 2d or 4d tensor, however, the actual input is 3d
tensor, that's why the shader compilation is failed. We may need to fix
it in a separate PR.
### Description
Added Uniform support to binary ops
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
To improve performance
### Description
<!-- Describe your changes. -->
Update XNNPACK to latest version
- adds fp16 kernels and various other improvements
- requires pthreadpool update as well
Most code updates in the XNNPACK EP are to adjust to the new XNNPACK API
- 'setup' is split into 'reshape' and 'setup'
- some ops use a workspace buffer
- copied workspace allocation from XNNPACK unit test code
- some suffixes changed
Added wrapper for XNNPACK caches to base XNNPACK EP kernel
- simplifies usage
- XNNPACK split out the code and weights caches, but the code cache
isn't currently usable via the public API
- we could use the internal types if we think it's required for
performance reasons. non-trivial though as we'd need to propagate ifdef
values from the XNNPACK build up to the ORT build.
- using XNNPACK internals would also mean we would not be able to
support using a pre-build XNNPACK package
- not an issue currently
Fixed opset registration for internal NHWC domain
- was not being tied to the ONNX version, so nodes inserted by layout
transformation had the incorrect opset
- a number of other places needed updating once this issue was fixed
Remove support for NCHW Resize from XNNPACK EP so it's NHWC only
- we only supported NCHW for fp32,
- doing so adds complexity in multiple places (XNNPACK EP kernel
implementation, layout transformation and transpose optimization)
- unclear if that complexity provides any benefit. can add back if
required by production scenario
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
We're looking at enabling fp16 support for CoreML and NNAPI. If we do
that we need a good fallback story if the CPU EP will be used. The
XNNPACK fp16 kernels will hopefully provide that.
NOTE: This PR doesn't add fp16 support to the XNNPACK EP kernels. That
can be done as required in separate EPs and should be relatively simple
to do.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
* based on design document & following InferenceSession's run
implementation, implemented TrainingSession.runTrainStep
### Motivation and Context
* Adding web bindings for training
#### Related work
* #16521 allowed for training artifacts to be built
* #17333 added interfaces for training
* #17474 allowed for training package to be built + added training
backend to web package
* #17891 implementation for createTrainingSession on the TypeScript side
**[SHOULD BE MERGED IN BEFORE THIS PR]**
---------
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
Co-authored-by: Ashwini Khade <askhade@microsoft.com>
### Description
Added FusedConv and FusedConvTranspose
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve performance
### Description
This PR enables `softmax` outputs max supported components instead of
scalar for each thread.
Softmax with input[0]: [12,4096,4096] becomes 47.86 ms from 55.11 ms
### Description
This PR tries to fix a part of the NPM package consuming problems for
onnxruntime-web (ES module) as described in #10913:
- reduce the package size to fit the 150MB restriction in jsdelivr, by
removing dev build targets for uncommon exports
- add default export to support `import ort from 'onnxruntime-web';`
(currently only support `import * as ort from 'onnxruntime-web';`
Timestamp-query has a broader support than timestamp-query-in-passes on
all the platforms, including macOS.
Note that to enable timestamp-query, you still need to add switch
"--enable-dawn-features=allow_unsafe_apis" to Chrome. By default, the
lowest 16 bits are masked with 0 (at a granularity about 0.1ms) for
privacy. To get the highest precision, you need to add another switch
"--enable-webgpu-developer-features".
### Description
* Adds TrainingSession.create() functionality following the web bindings
for training design doc
* Added 2 new training APIs to wasm/api.h:
* OrtTrainingGetInputOutputName
* OrtTrainingGetInputOutputCount
* Moved isOrtEnvInitialized boolean to the wasm-core-impl and added a
method that references it
### Motivation and Context
* Adding web bindings for training
#### Related work
* #16521 allowed for training artifacts to be built
* #17333 added interfaces for training
* #17474 allows for training package to be built + adds training backend
to web package **[MUST BE MERGED IN BEFORE THIS ONE]**
---------
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
Co-authored-by: Ashwini Khade <askhade@microsoft.com>
### Description
Enable one-dim special input to GlobalAveragePoll input
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Currently only 2D input is supported.
### Description
<!-- Describe your changes. -->
Currently, the uniform support has bugs when dims rank is larger than 4.
See https://github.com/microsoft/onnxruntime/issues/17860 item 1.
So this PR only enables shapes uniforms when shape rank is <= 4 for
transpose. Otherwise, below compilation errors are thrown:
```
1 error(s) generated while compiling the shader:
:3:50 error: uniform storage requires that array elements are aligned to 16 bytes, but array element of type 'u32' has a stride of 4 bytes. Consider using a vector or struct as the element type instead.
struct Uniforms { output_size:u32, a_shape:array<u32, 5>, a_strides:array<u32, 5>, output_shape:array<u32, 5>, output_strides:array<u32, 5> };
^^^^^^^^^^^^^
:3:7 note: see layout of struct:
/* align(4) size(84) */ struct Uniforms {
/* offset( 0) align(4) size( 4) */ output_size : u32;
/* offset( 4) align(4) size(20) */ a_shape : array<u32, 5>;
/* offset(24) align(4) size(20) */ a_strides : array<u32, 5>;
/* offset(44) align(4) size(20) */ output_shape : array<u32, 5>;
/* offset(64) align(4) size(20) */ output_strides : array<u32, 5>;
/* */ };
struct Uniforms { output_size:u32, a_shape:array<u32, 5>, a_strides:array<u32, 5>, output_shape:array<u32, 5>, output_strides:array<u32, 5> };
^^^^^^
:4:42 note: 'Uniforms' used in address space 'uniform' here
@group(0) @binding(2) var<uniform> uniforms: Uniforms;
^^^^^^^^
```
### Description
* follows the packaging approach according to the design document
* adds `ENABLE_TRAINING` boolean flag to `BUILD_DEFS`
* modifies `package.json` to include training submodule
* modifies build script to handle, validate, and minimize training WASM
artifacts
* adds the binding for the new backend with training enabled & the new
training artifacts
* adds training backend
* edits `index.ts` to use training backend depending on `BUILD_DEFS`
* edits `wasm-factory.ts` to use the training artifacts if necessary
### Motivation and Context
* we are in the process of adding web bindings to enable training.
* Adding the "glue" to allow onnxruntime-web to use the training WASM
artifacts is required for this work.
* Since BUILD_DEFS is defined and used at build time, I thought that it
made sense to bundle the changes to building in the same PR.
#### Related work
* #16521 allowed for training artifacts to be built
* #17333 must be merged in before this one
---------
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
### Description
upgrade JS shared dev dependencies.
- webpack: removed
- eslint: upgrade to latest.
- eslint config upgraded to compatible with latest version
- typescript upgrade to v5
- update module "CommonJS" to "Node16" in tsconfig
- update deprecated config "importsNotUsedAsValues" to
"verbatimModuleSyntax"
- remove webpack bundles in onnxruntime-common
### Description
flags `--enable_wasm_api_exception_catching --disable_rtti` are used in
release build, so fix the build_jsep.bat script to make it more
consistent with CI.
### Description
support using uniform buffer.
This PR allows to use uniform buffer in shader program, so that some
runtime information (eg. input/output shape) is no longer need to be
hardcoded into shader code.
There are 2 commits in this PR:
-
[667f31c](667f31c83d):
framework changes to support uniform buffer, as well as updates in
program manager, gpu data manager and indices helper.
-
[09e1d2a](09e1d2ad1d):
an example change for operator `Transpose` to use input's rank-only
instead of dims as shader key. With this change, model mobilenetv2-12
shader compile times dropped from 71 to 52.
Two major modifications of this PR:
1. Refactor OrtTensorRTProviderOptions initialization and make it easy
to add new field.
2. Make Python API capable of using TensorRT plugins by adding new
Python binding api `register_tensorrt_plugins_as_custom_ops`. (It needs
to register ep's custom op domain before model load. For C++ API, it's
slightly different, when calling
SessionOptionsAppendExecutionProvider_TensorRT_XX, it appends cutom op
domain to session option. Later ORT can register custom op domain from
session option before model loading)
### Description
Use esbuild to accelerate bundle build.
This change uses esbuild to replace webpack for onnxruntime-web. Bundle
build time reduced from ~20sec to ~0.6sec on my windows dev box.
A few changes applied:
- import nodejs modules using "node:" prefix
- remove enum declaration inside namespace (EncoderUsage)
- use "fs/promise" to replace the old promisify from "util"
- separate ort-web and test-runner. Previously they are bundled
together, now they are built into 2 files.
- optimize karma runner launch time
- remove unnecessary sourcemap preprocessor. sourcemaps are handled
inside esbuild
- remove unnecessary proxies (because ort-web and test-runner are
separated now, the path are correctly inferred)
- remove file watcher for test data
- optimize special handling as esbuild plugins:
- polyfill dummy imports for node.js modules when targetting browser.
- load as content string for ort-wasm-*.worker.js
- load as content string for ./proxy-worker/main.ts
- a source patch to ort-wasm*-threaded*.js (see details in comments in
code)
- updated debug configurations for sourcemap mapping to ensure
out-of-box good dev experience
Supported type: float. int32_t, uint32_t, bool.
Case where_broadcast.jsonc is not enabled due to
https://github.com/microsoft/onnxruntime/issues/17405.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
### Description
Two contrib kernels that supposed to speed-up StableDiffusion according
to this doc
https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/models/stable_diffusion/README.md
However, there is no noticable effect in speed or memory consumption. So
i guess the only way to make it faster is to implement
MultiHeadAttention but i'm not capable of doing that right now. So i'll
focus on existing PRs and finding the JSEP kernel that produces
incorrect results. It should be one of the old ones (i suspect Conv or
ConvTranspose), as SD was not generating images correctly on webgpu
since i started working on it. I hoped someone else would fix that by
the time i finish with kernels/optimizations 😅
---------
Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
### Description
Allow WebGPU backend to specify `preferredLayout`. Default is NHWC.
```js
const options = {executionProviders: [{name:'webgpu', preferredLayout: 'NCHW'}]};
sess1 = await ort.InferenceSession.create('./mobilenetv2-12.onnx', options);
```
### Motivation and Context
- implement @qjia7's requirement for an easier way to do performance
comparison between NCHW vs NHWC.
- It's possible that NCHW does better on some models and NHWC on others.
So offer user the capability to switch.
### Description
<!-- Describe your changes. -->
Update E2E test to also check InferenceSession.create with bytes.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Add tests to validate #17739
The patch also introduces the method which copies
data from GPU to CPU synchronously.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Another three ops for fp16
---------
Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
### Description
Following the design document:
* Added CreateTrainingSessionHandler to the Backend interface
* All existing Backend implementations throw an error for the new method
createTrainingSessionHandler
* Created TrainingSession namespace, interface, and
TrainingSessionFactory interface
* Created TrainingSessionImpl class implementation
As methods are implemented, the TrainingSession interface will be added
to or modified.
### Motivation and Context
Adding the public-facing interfaces to the onnxruntime-common package is
one of the first steps to support ORT training for web bindings.
---------
Co-authored-by: Caroline Zhu <carolinezhu@microsoft.com>
### Description
<!-- Describe your changes. -->
Use `.buffer` of Uint8Array to get ArrayBuffer.
TODO: Add E2E React Native test case to cover JS level testing to avoid
future breakage.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#17732
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
<del>
**This PR is based on a few prerequisites PRs. They are listed as
below:**
- #17465
- #17469
- #17470
- #17472
- #17473
- #17484
Please review the current change by only looking at commit
e2e6623e673ec6de55a5c1f8edcbd3a46b535a89 and later.
</del>
### Description
This PR introduces WebGPU IO binding. This new feature allows
onnxruntime-web users to use tensors created from GPU as model
input/output so that a model inferencing can be done without unnecessary
data copy between CPU and GPU for model input/output.
### Examples
An E2E demo/example is being worked on.
Following is some simple demo with code snippet.
Let's first check today how we do:
```js
// STEP.1 - create an inference session:
const mySession = await ort.InferenceSession.create('./my_model.onnx', { executionProviders: ['webgpu'] });
// STEP.2 - create model input: (supposing myImageCpuData is a Float32Array)
const feeds = {
'input_image:0': new ort.Tensor('float32', myImageCpuData, [1, 224, 224, 3])
};
// STEP.3 - run model
const myResults = await mySession.run(feeds);
// STEP.4 - get output data
const myData = myResults['output_image:0'].data; // Float32Array
```
#### for inputs (GPU tensor):
Now, with IO binding, you can create a tensor from a GPU buffer, and
feed it to the model:
```js
// new STEP.2.A - create model input from a GPU buffer: (supposing myInputGpuBuffer is a `GPUBuffer` object with input data)
const feeds = {
'input_image:0': ort.Tensor.fromGpuBuffer(myInputGpuBuffer, { dataType: 'float32', dims: [1, 224, 224, 3] })
};
```
### for outputs (pre-allocated GPU tensor)
you can also do that for output, **if you know the output shape**:
```js
// new STEP.2.B - create model output from a GPU buffer: (supposing myOutputGpuBuffer is a pre-allocated `GPUBuffer` object)
const fetches = {
'output_image:0': ort.Tensor.fromGpuBuffer(myOutputGpuBuffer, { dataType: 'float32', dims: [1, 512, 512, 3] })
};
// new STEP.3 - run model with pre-allocated output (fetches)
const myResults = await mySession.run(feeds, fetches);
```
### for outputs (specify location)
if you do not know the output shape, you can specify the output location
when creating the session:
```js
// new STEP.1 - create an inference session with an option "preferredOutputLocation":
const mySession = await ort.InferenceSession.create('./my_model.onnx', {
executionProviders: ['webgpu'],
preferredOutputLocation: "gpu-buffer"
});
```
if the model has multiple outputs, you can specify them seperately:
```js
// new STEP.1 - create an inference session with an option "preferredOutputLocation":
const mySession = await ort.InferenceSession.create('./my_model.onnx', {
executionProviders: ['webgpu'],
preferredOutputLocation: {
"output_image:0": "gpu-buffer"
}
});
```
now you don't need to prepare the `fetches` object and onnxruntime-web
will prepare output data on the location that specified.
#### read data
when you get the output tensor, you can:
```js
// get the gpu buffer object:
const gpuBuffer = myOutputTensor.gpuBuffer; // GPUBuffer
// get the CPU data asynchronizely
const cpuData = await myOutputTensor.getData();
// get the CPU data asynchronizely and release the underlying GPU resources
const cpuData = await myOutputTensor.getData(true);
// dispose the tensor (release the underlying GPU resources). This tensor object will be invalid after dispose() is called.
myOutputTensor.dispose();
```
#### resource management
JavaScript has GC so you don't need to worry about managing JavaScript
objects. But there are 2 types of resources that are not managed by GC:
- GPU buffer that used in tensors
- Underlying ORT native resources
To simplify, most of the unmanaged resources and handled inside ORT web.
But there are a few resources that need users to manage:
- All external GPU resources, including GPU buffers inside all tensors
created by `Tensor.fromGpuBuffer()`, will not be managed by ORT. User
should manage those GPU buffers themselves.
- When a session is created with `preferredOutputLocation` ==
"gpu-buffer" specified in session options, and the corresponding output
is not pre-allocated, user need to call the output tensor's `dispose()`
or `getData(true)` to manually release the underlying GPU buffers.
- ORT internal errors (including providing a pre-allocated output tensor
with wrong type/dims) will invalidate the whole wasm memory and is not
recoverable. An exception is thrown in this situation.
### Description
Add ConvTranspose implementation using MatMul to increase perf.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
update Chromium browser launch command line flags
Canary already using dxc so no need to specify
'--enable-dawn-features=use_dxc' for canary.
### Description
ort-web build step - webpack consumes the amount of memory on the edge
of Node.js(V8)'s default max-old-space-size, so increase the default
memory size to 5GB to avoid this issue.
### Description
This PR optimizes the gather op, which is improved ~6ms in segment
anything model in ADL.
The problem in original algorithm is that it includes a for loop to
calculate a block size of data. However, the block size may be very
large, like `65536`. In GPU shader, we should try to avoid large loop in
shader and try to use more threads to do it parallelly.
Before:
```
[profiling] kernel "41771992|[Gather] 41771992" input[0]: [4,65536] | float32, input[1]: [1] | int64, output[0]: [1,65536] | float32, execution time: 6886207 ns
```
After:
```
[profiling] kernel "41771992|[Gather] 41771992" input[0]: [4,65536] | float32, input[1]: [1] | int64, output[0]: [1,65536] | float32, execution time: 11719 ns
### Description
1. For binary ops, the components is always 4. So the dispatchGroup
should be : `{x: Math.ceil(outputSize / 64 /* workgroup size */ / 4 /*
component size */)}` instead of `{x: Math.ceil(outputSize / 64 /*
workgroup size */ / (vectorize ? 4 : 1) /* vec size */)}`.
2. If any of a or b only has one element, we still can use the vectorize
path since the same value will be broadcasted.
### Description
Update test to explicitly fail for webnn without proxy.
I am doing this change because if I test webnn with other backend
together, it silently enables proxy. I want to make test runner behave
with less implicit flag reset. If proxy is not enabled, webnn test
should fail.
@Honry please let me know if other places (eg. CI scripts) should change
also.
### Description
<!-- Describe your changes. -->
In previous implementation, there are two loops to iterate H * W
elements to calculate the `mean` and `squaredNorm` value in one thread,
meanwhile it outputs H * W elements in one thread. That results it's
very very slow when H * W is a large value. And usually, H * W does be a
large value in a model. For example, in the `candy-8` model, the shapes
of [H, W] are [224,224], [112,112], [56,56] for `InstanceNormalization`
op. And in my ADL, `[1,224,224,32]` consumes 17 ms. See below:
```
[profiling] kernel "23848328|[InstanceNormalization] 23848328" input[0]: [1,224,224,32] | float32, input[1]: [32] | float32, input[2]: [32] | float32, output[0]: [1,224,224,32] | float32, execution time: 17007914 ns
```
In this PR, it uses workgroup memory to optimize the original algorithm.
The advantage is that it can parallelly utilize the 64 (workgroupSize)
threads in one workgroup to calculate `mean` and `squaredNorm` value.
Meanwhile, it only outputs `H * W / workgroupSize` outputs for one
thread, which greatly reduces the overhead for one thread. With this
optimization, `[1,224,224,32]` becomes 3 ms and the main overhead is the
extra two `transpose`. The `createInstanceNormProgramInfo` only needs
`0.64` ms. See below:
```
[profiling] kernel "23003600|[InstanceNormalization] 23003600" input[0]: [1,224,224,32] | float32, output[0]: [1,32,224,224] | float32, execution time: 1543792 ns
program-manager.ts:115
[profiling] kernel "23003600|[InstanceNormalization] 23003600" input[0]: [1,32,224,224] | float32, input[1]: [32] | float32, input[2]: [32] | float32, output[0]: [1,32,224,224] | float32, execution time: 642652 ns
program-manager.ts:115
[profiling] kernel "23003600|[InstanceNormalization] 23003600" input[0]: [1,32,224,224] | float32, output[0]: [1,224,224,32] | float32, execution time: 991608 ns
```
This PR currently only applies the new algorithm to NCHW format. For
NHWC format, one way is to transpose the input so that it can use the
new algorithm. But the disadvantage is that 2 extra transpose are added.
@dakenf also gives another way to optimize NHWC. Details see
[here](d45a96616d/js/web/lib/wasm/jsep/webgpu/ops/instance-norm.ts).
I checked @dakenf's method. The perf is similar with transpose +
optimized NCHW. But on different GPUs, one is a little better than
another or vice versa. So I prefer this PR only does the NCHW part.
@dakenf can submit his optimization on NHWC.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
For some use case need to create boolean tensor.
I've tested on [this
project](https://github.com/hans00/react-native-transformers-example)
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Add handle `ONNX_TENSOR_ELEMENT_DATA_TYPE_BOOL`
And it required #15556 (It seems not include in latest release
(v1.15.1))
### Description
update prepack script to use exact version.
the prepack script for onnxruntime-node, onnxruntime-web and
onnxruntime-react-native is used to update their referencing version of
dependency "onnxruntime-common".
Previously "~" (tilde symbol) is used. This may cause NPM choose an
older version (if the old version matches the version requirement and
was previously installed already so hit the cache). see also
https://semver.npmjs.com/. [This
build](https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1134671&view=results)
is caused by this issue.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
release session after use in npm test.
This is one of the prerequisites for supporting IO binding for WebGPU
buffer in onnxruntime-web.
list of prerequisites PRs:
#17465#17469#17470 (this one)
### Description
Added Einsum operator support to JSEP.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
[Successful pipeline
run](https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1123141&view=results)
Added flag to build the training artifacts & updated the
pull-wasm-artifacts script to pull the training artifacts as well.
Bundled into this PR are minor formatting fixes + naming fixes.
### Motivation and Context
[This PR](https://github.com/microsoft/onnxruntime/pull/16521) extended
the WASM API wrapper to build training WASM artifacts as well.
The ORT training WASM artifacts are required to support ORT training web
bindings.
### Description
This PR contains a few changes in /js/common/ to support a coming PR for
a full implementation of webgpu IO binding.
- allows pass-through if value is already a Tensor instance in return
value of `handler.run()` called by `InferenceSession.run()`
(inference-session-impl.ts). Specifically, onnxruntime-node and
onnxruntime-react-native uses native bindings to generate a Tensor-like
object so we need to create a real Tensor instance here; for
onnxruntime-web the return value is already a Tensor instance.
- adds new types for GPU buffer supported types: `'float32'|'int32'` ->
`'float32'|'float16'|'int32'|'int64'|'uint32'|'bool'`
- exposes types `GpuBufferDataTypes` together with `CpuPinnedDataTypes`
and `TextureDataTypes` as exported
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Include Support for neg.int32
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Commit fffefb1c22 (#16969) optimized
matmul and also fixes broadcasting. So #17191 is no longer needed.
However, the newly added operator test file from the PR by @dakenf is
helpful so pick and add it to enhance the tests.
### Description
clean up JSDoc for onnxruntime-common:
- replace "@internal" to "@ignore" as JSDoc do not use "@internal".
Using "@ignore" will let the content not show on the generated doc.
### Description
<!-- Describe your changes. -->
For the conv2dByMatMul path, the simulated matmul output shape is the
reshape of the original conv2d. So we should pass this information to
`createMatmulProgramInfo` so that it can process it correctly.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
[//]: # (## Work In Progress. Feedbacks are welcome!)
### Description
This PR adds a few properties, methods and factories to Tensor type to
support IO-binding feature. This will allow user to create tensor from
GPU/CPU bound data without a force transferring of data between CPU and
GPU.
This change is a way to resolve#15312
### Change Summary
1. Add properties to `Tensor` type:
a. `location`: indicating where the data is sitting. valid values are
`cpu`, `cpu-pinned`, `texture`, `gpu-buffer`.
b. `texture`: sit side to `data`, a readonly property of `WebGLTexture`
type. available only when `location === 'texture'`
c. `gpuBuffer`: sit side to `data`, a readonly property of `GPUBuffer`
type. available only when `location === 'gpu-buffer'`
2. Add methods to `Tensor` type (usually dealing with inference
outputs):
- async function `getData()` allows user to download data from GPU to
CPU manually.
- function `dispose()` allows user to release GPU resources manually.
3. Add factories for creating `Tensor` instances:
a. `fromTexture()` to create a WebGL texture bound tensor data
b. `fromGpuBuffer()` to create a WebGPUBuffer bound tensor data
c. `fromPinnedBuffer()` to create a tensor using a CPU pinned buffer
### Examples:
create tensors from texture and pass to inference session as inputs
```js
// when create session, specify we prefer 'image_output:0' to be stored on GPU as texture
const session = await InferenceSession.create('./my_model.onnx', {
executionProviders: [ 'webgl' ],
preferredOutputLocation: { 'image_output:0': 'texture' }
});
...
const myImageTexture = getTexture(); // user's function to get a texture
const myFeeds = { input0: Tensor.fromTexture(myImageTexture, { width: 224, height: 224 }) }; // shape [1, 224, 224, 4], RGBA format.
const results = await session.run(myFeeds);
const myOutputTexture = results['image_output:0'].texture;
```
### Description
Changes in this PR:
1) use the optimized version `makeMatMulPacked[Vec4]Source` to support
matmul.
2) enable the conv2dByMatMul path.
3) support broadcast
4) use IndicesHelper.
MatMul with M = 512, K = 512, N = 512 becomes 2ms from 15ms when
enabling profilingMode on my ADL.
### Description
* Created `wasm/training_api` source and header files & modified
WebAssembly CMake to include training flags
* The `wasm/training_api` files use an `OrtTrainingManager` handle which
is a struct of an OrtCheckpointState and an OrtTrainingSession, rather
than creating a CheckpointState handle & a separate TrainingSession
handle.
* This is so that the TypeScript side only has to manage one handle that
will be passed between TrainingSession & CheckpointState
representations, rather than the TypeScript side managing separate
CheckpointStateHandle and TrainingSessionHandle.
### Motivation and Context
WASM API needs to be updated with ORT training API function calls so
that ORT training web bindings can be added for on-device training.
---------
Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
Co-authored-by: carzh <carolinezhu@microsoft.com>
Co-authored-by: Ashwini Khade <askhade@microsoft.com>
### Description
This PR adds kernel implementation for operator "Not" and "Equal". Also
removed download cache in gpu data manager.
**Why removing download cache**
The following test case failed. ("Or" is on CPU, "Greater" and "Equal"
are on JSEP)

after debugging, I found that both "Equal" and "Greater" are using the
same output GPU Data ID. This is because when ORT executes the graph, it
first run "Equal", allowing its shader to write into GPU Data ID 2; then
a Gpu2Cpu copy for it is issued (because currently "Or" is on CPU EP);
at this point, ORT thinks GPU Data ID=2 is free to use; so it reuse it
as output for "Greater". This means there is no allocation for output of
"Greater" kernel, and both kernel writes to GPU Data ID=2.
For gpu data manager, there will be 2 downloads from the same GPU
buffer. Previously I think this is a waste of resource so I cached the
data. But now it shoes that we need to perform 2 downloads because the
GPU data is already different. The download data cache should be
removed.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
With the label, it's more easier to identify which op causes the error.
Without the label, the error message is like below:
```
Tint WGSL reader failure: :12:5 error: return statement type must match its function return type, returned 'vec4<f32>', expected 'f32'
return W[i2o_W(indices)];
^^^^^^
- While validating [ShaderModuleDescriptor]
- While calling [Device].CreateShaderModule([ShaderModuleDescriptor]).
```
With the label, the error message is like below:
```
Tint WGSL reader failure: :12:5 error: return statement type must match its function return type, returned 'vec4<f32>', expected 'f32'
return W[i2o_W(indices)];
^^^^^^
- While validating [ShaderModuleDescriptor "ConvTranspose2D"]
- While calling [Device].CreateShaderModule([ShaderModuleDescriptor "ConvTranspose2D"]).
```
### Motivation and Context
This change is mainly for debugging. With this change, we can easily
know that `ConvTranspose2D`'s shader has problem from above message.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
This PR contains changes to support error pop and kernel name.
- Add a function `JsepGetNodeName` to allow reading kernel name from JS
to C++
- When in debug mode ( `env.debug = true;` ) or in profiling mode (
`env.webgpu.profilingMode = 'default';` ), kernel name will be read from
ORT; otherwise use the kernel pointer ( a number ) as kernel name to
save calls from JS to C++.
- When in debug mode, WebGPU validation errors will be recorded and if
any error occurs, `inferenceSession.run()` will fail (Promise get
rejected). Behavior when not in debug mode is not changed. This is
because recording errors are not zero-overhead, and GPU validation
errors should occur consistently in and not in debug mode.
- Add `jsepOnRunStart()` and `jsepOnRunEnd()` hook to:
- allow implementation of the features mentioned above.
- pass session ID to backend.
### Description
Fix JSEP ConvTranspose shader code errors.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Release OrtEnv before main function returns. Before this change, OrtEnv
is deleted when C/C++ runtime destructs all global variables in ONNX
Runtime's core framework.
The callstack is like this:
```
* frame #0: 0x00007fffee39f5a6 libonnxruntime.so.1.16.0`onnxruntime::Environment::~Environment(this=0x00007fffee39fbf2) at environment.h:20:7
frame #1: 0x00007fffee39f614 libonnxruntime.so.1.16.0`std::default_delete<onnxruntime::Environment>::operator()(this=0x00007ffff4c30e50, __ptr=0x0000000005404b00) const at unique_ptr.h:85:2
frame #2: 0x00007fffee39edca libonnxruntime.so.1.16.0`std::unique_ptr<onnxruntime::Environment, std::default_delete<onnxruntime::Environment>>::~unique_ptr(this=0x5404b00) at unique_ptr.h:361:17
frame #3: 0x00007fffee39e2ab libonnxruntime.so.1.16.0`OrtEnv::~OrtEnv(this=0x00007ffff4c30e50) at ort_env.cc:43:1
frame #4: 0x00007fffee39fa96 libonnxruntime.so.1.16.0`std::default_delete<OrtEnv>::operator()(this=0x00007fffefff8f78, __ptr=0x00007ffff4c30e50) const at unique_ptr.h:85:2
frame #5: 0x00007fffee39f394 libonnxruntime.so.1.16.0`std::unique_ptr<OrtEnv, std::default_delete<OrtEnv>>::~unique_ptr(this=0x7ffff4c30e50) at unique_ptr.h:361:17
frame #6: 0x00007ffff78574b5 libc.so.6`__run_exit_handlers + 261
frame #7: 0x00007ffff7857630 libc.so.6`exit + 32
frame #8: 0x00007ffff783feb7 libc.so.6`__libc_start_call_main + 135
frame #9: 0x00007ffff783ff60 libc.so.6`__libc_start_main@@GLIBC_2.34 + 128
frame #10: 0x0000000000abbdee node`_start + 46
```
After this change, OrtEnv will be deleted before the main function
returns and nodejs is still alive.
### Description
Added JSEP Gemm registration for opset 13. It was falling back to CPU
provider as CPU has it for 13
---------
Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>
### Description
Enable typed binary and support int32 type for binary.
Co-authored-by: Xing Xu <xing.xu@intel.com>
---------
Co-authored-by: Xing Xu <xing.xu@intel.com>
### Description
Add SkipLayerNormalization operator to JSEP.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Fix a typo. LayerNormalization takes 2 or 3 inputs. The third input,
bias, is optional.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
1. allows passing session options to operator test (eg. graph
optimization level)
2. add a short flag '-x' for '--wasm-number-threads' as it is frequently
used.
### Description
Since WebGPU supports only float32 and int32, having Gather, Reshape,
Shape, Squeeze and Unsqueeze ops with other data types create additional
MemCpy ops and slow down the overall execution as all other OPs with
other tensor types will be done on CPU.
Before this patch SD Unet had these numbers:
Node(s) placed on [CPUExecutionProvider]. Number of nodes: 1141
Node(s) placed on [JsExecutionProvider]. Number of nodes: 4025
memcpy tokens: 2001
After patch:
Node(s) placed on [CPUExecutionProvider]. Number of nodes: 1735
Node(s) placed on [JsExecutionProvider]. Number of nodes: 2243
memcpu tokens: 813
It also gives more than 5X performance benefit. From 12sec for one Unet
step to 2.2sec on RTX 3090 Ti, so we are almost getting to native
performance.
UPD: with latest changes from main branch and multi-threading it went
down to 1.6sec. Will try re-exporting my model to onnx with maximum
optimizations, like using MultiHeadAttention to decrease node count.
Maybe after implementing that it can go in less than 1 sec
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
test case 'test_batchnorm_epsilon_training_mode' on webgpu is failing.
the issue need time to investigate so comment this off and re-enable it
when the root cause is fixed.
### Description
This PR introduces the new incides helper.
IndicesHelper is a helper class for generating WGSL code for
manipulating indices and data for a shader's input or output.
This class is designed to offer a unified way to generate WGSL code for
manipulating indices and data for a shader's input or output. The
following is a list of terminologies used in this class:
- `offset`: a uint32 value representing the offset of an element in the
data buffer.
- `indices`: an abstraction of a multi-dimensional array's indices
representing the data's index on each dimension.
- `value`: a value of a data element.
Users are expected to create an instance of this class for each shader's
input or output, and use the instance to generate WGSL code for
manipulating indices and data. The following 2 exported functions are
for users to call to create an instance of an indices helper:
- `inputVariable()`: create an indices helper instance for an input.
- `outputVariable()`: create an indices helper instance for an output.
An indices helper instance contains helper functions for the following
operations:
- access readonly basic information, including: `name`(the name of the
input or output), `usage`(whether it's an input or an output) and
`shape`(the passed in shape).
- `type`: access readonly type information, including: `indices`(the
type of indices), `value`(the type of value at runtime), `storage`(the
type of value at storage) and `tensor`(the tensor type as represented in
TensorView).
- generate WGSL code for getting indices from offset. Use
`offsetToIndices()` for WGSL code snippet to calculate incides from
offset, and use `indicesToOffset()` for WGSL code snippet to calculate
offset from indices.
- to manipulate an instance of indices, use `setIndices()` and
`getIndices()` to set and get the indices on an indices variable.
- to manipulate data, use `set()`/`get()` to access data at the given
indices from parameter list, use `setByIndices()`/`getByIndices()` to
access data at the given indices from an indices variable, and use
`setByOffset()`/`getByOffset()` to access data at the given offset.
- `impl`: get WGSL code of function implementation for the util
functions mentioned above.
This change applies the usage of new IndicesHelper through the code, but
not necessary for all code.
### Description
Fix some Resize failing tests.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
### Description
onnxjs contains a `Resize` op input check which is outdated since opset
9. Currently `Resize` supports up to 4 inputs. This PR looses the input
check.
### Motivation and Context
Fixes#15636
### Description
enable webgpu in browser unit test.
The CI pipeline uses Edge v113+ which enables WebGPU.
===
**UPDATE on 08/07/2023:**
- add flags to Edge browser launch commandline so that Edge on CI agents
can initialize WebGPU correctly.
- ONLY enable webgpu on web release build. Other pipelines are using
flag `-b=wasm,webgl,xnnpack` to specify the other 3 backends explicitly.
- disable "Resize" related test failures. Once they are fixed the tests
can be re-enabled.
---------
Co-authored-by: Satya Jandhyala <satya.k.jandhyala@gmail.com>
### Description
Added two kernels for Layer and Instance norm
Also added maximum limits for `maxBufferSize` when requesting GPU device
as by default it's limited to 256mb and it fails allocating 600mb buffer
while running fp32 StableDiffusion weights.
### Motivation and Context
These two are used in StableDiffusion and many other networks
### Description
<!-- Describe your changes. -->
This PR makes sure that only storage buffers are reused. Previously, the
query buffer might also get from the freeBuffers list if there is a
matching size in it. But they are different usage, which results errors.
Fixed ArgMin and ArgMax and refactored using functionality from Reduce
operator code.
### Description
Removed code/functionality duplication and fixed some issue.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Make CacheHint mechanism, which is designed to avoid running the same
test multiple times saving the result mapped against a key, working by
adding input dims.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
update op test schema.
This changes fixes several problems for operator tests for web:
- `opsets` -> `opset`: an operator uses exactly one opset instead of
multiple
- `condition` -> `platformCondition`: make it less confusing
- `inputShapeDefinitions`: allows to test ORT behaviors when it get
no/partial/full shape info.
Added a JSON schema file and also an example file
### Description
Added Gather op that works with both i32 and i64 indices, assuming that
values fall into i32 limit. The assumption is safe because it's not
possible to allocate more than 2gb buffer for inputs.
It treats all data from input tensor as u32, copying 1 or 2 elements for
i64, u64 and double.
---------
Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>
### Description
If Expand inputs has rank < 2, `inputIndicesHelper` and
`outputIndicesHelper` create indices as u32 instead if array<u32> and
`calculateInputIndex` throws an error
### Motivation and Context
I've encountered this error while making StableDiffusion work with JSEP
### Description
<!-- Describe your changes. -->
As title.
And manually validated it in the
https://github.com/fs-eire/ort-rn-hello-world test app with the
dev/updated version of onnxruntime-react-native package:
https://www.npmjs.com/package/onnxruntime-react-native/v/1.16.0-dev.20230712-a396a15fa6
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Resolve security warning issues. cc @skottmckay thanks author for the
changes.
Co-authored-by: Scott McKay <skottmckay@gmail.com>
argmax and argmin are similar to reduce. Eventually we need to add
optimized flavors of the shader.
softmax is optimized but only works on the last axis for now which
should be the common use case.
todo: enable more ut for argmax/argmin