onnxruntime/js/web
Jiajia Qin 41d2ff622c
[js/webgpu] Optimize InstanceNormalization (#17491)
### Description
<!-- Describe your changes. -->
In previous implementation, there are two loops to iterate H * W
elements to calculate the `mean` and `squaredNorm` value in one thread,
meanwhile it outputs H * W elements in one thread. That results it's
very very slow when H * W is a large value. And usually, H * W does be a
large value in a model. For example, in the `candy-8` model, the shapes
of [H, W] are [224,224], [112,112], [56,56] for `InstanceNormalization`
op. And in my ADL, `[1,224,224,32]` consumes 17 ms. See below:
```
[profiling] kernel "23848328|[InstanceNormalization] 23848328" input[0]: [1,224,224,32] | float32, input[1]: [32] | float32, input[2]: [32] | float32, output[0]: [1,224,224,32] | float32, execution time: 17007914 ns
```

In this PR, it uses workgroup memory to optimize the original algorithm.
The advantage is that it can parallelly utilize the 64 (workgroupSize)
threads in one workgroup to calculate `mean` and `squaredNorm` value.
Meanwhile, it only outputs `H * W / workgroupSize` outputs for one
thread, which greatly reduces the overhead for one thread. With this
optimization, `[1,224,224,32]` becomes 3 ms and the main overhead is the
extra two `transpose`. The `createInstanceNormProgramInfo` only needs
`0.64` ms. See below:
```
[profiling] kernel "23003600|[InstanceNormalization] 23003600" input[0]: [1,224,224,32] | float32, output[0]: [1,32,224,224] | float32, execution time: 1543792 ns
program-manager.ts:115 
[profiling] kernel "23003600|[InstanceNormalization] 23003600" input[0]: [1,32,224,224] | float32, input[1]: [32] | float32, input[2]: [32] | float32, output[0]: [1,32,224,224] | float32, execution time: 642652 ns
program-manager.ts:115 
[profiling] kernel "23003600|[InstanceNormalization] 23003600" input[0]: [1,32,224,224] | float32, output[0]: [1,224,224,32] | float32, execution time: 991608 ns
```
This PR currently only applies the new algorithm to NCHW format. For
NHWC format, one way is to transpose the input so that it can use the
new algorithm. But the disadvantage is that 2 extra transpose are added.
@dakenf also gives another way to optimize NHWC. Details see
[here](d45a96616d/js/web/lib/wasm/jsep/webgpu/ops/instance-norm.ts).
I checked @dakenf's method. The perf is similar with transpose +
optimized NCHW. But on different GPUs, one is a little better than
another or vice versa. So I prefer this PR only does the NCHW part.
@dakenf can submit his optimization on NHWC.
2023-09-14 17:03:18 -07:00
..
docs [WebGPU/JS] Added Pad operator support (#16928) 2023-09-14 13:14:11 -07:00
lib [js/webgpu] Optimize InstanceNormalization (#17491) 2023-09-14 17:03:18 -07:00
script [js/web] add a test flag to customize chromium flags (#17545) 2023-09-14 10:05:31 -07:00
test [js/webgpu] Optimize InstanceNormalization (#17491) 2023-09-14 17:03:18 -07:00
.gitignore [js/web] add target ort.webgpu.min.js (#15780) 2023-05-04 10:05:39 -07:00
.npmignore [js/web] add target ort.webgpu.min.js (#15780) 2023-05-04 10:05:39 -07:00
karma.conf.js [js/web] add a test flag to customize chromium flags (#17545) 2023-09-14 10:05:31 -07:00
package-lock.json Bump electron from 23.1.2 to 23.3.13 in /js/web (#17436) 2023-09-07 17:39:49 -07:00
package.json [js/common] add unit tests for onnxruntime-common (#16812) 2023-07-25 14:37:41 -07:00
README.md [js] enable formatter for more file types (#16888) 2023-07-28 15:46:58 -07:00
tsconfig.json [js/common] add unit tests for onnxruntime-common (#16812) 2023-07-25 14:37:41 -07:00
types.d.ts [js/web] add target ort.webgpu.min.js (#15780) 2023-05-04 10:05:39 -07:00
webpack.config.js [js] enable formatter for more file types (#16888) 2023-07-28 15:46:58 -07:00

ONNX Runtime Web

ONNX Runtime Web is a Javascript library for running ONNX models on browsers and on Node.js.

ONNX Runtime Web has adopted WebAssembly and WebGL technologies for providing an optimized ONNX model inference runtime for both CPUs and GPUs.

Why ONNX models

The Open Neural Network Exchange (ONNX) is an open standard for representing machine learning models. The biggest advantage of ONNX is that it allows interoperability across different open source AI frameworks, which itself offers more flexibility for AI frameworks adoption.

Why ONNX Runtime Web

With ONNX Runtime Web, web developers can score models directly on browsers with various benefits including reducing server-client communication and protecting user privacy, as well as offering install-free and cross-platform in-browser ML experience.

ONNX Runtime Web can run on both CPU and GPU. On CPU side, WebAssembly is adopted to execute the model at near-native speed. ONNX Runtime Web complies the native ONNX Runtime CPU engine into WebAssembly backend by using Emscripten, so it supports most functionalities native ONNX Runtime offers, including full ONNX operator coverage, multi-threading, ONNX Runtime Quantization as well as ONNX Runtime Mobile. For performance acceleration with GPUs, ONNX Runtime Web leverages WebGL, a popular standard for accessing GPU capabilities. We are keeping improving op coverage and optimizing performance in WebGL backend.

See Compatibility and Operators Supported for a list of platforms and operators ONNX Runtime Web currently supports.

Usage

Refer to ONNX Runtime JavaScript examples for samples and tutorials.

Documents

Developement

Refer to the following links for development information:

Compatibility

OS/Browser Chrome Edge Safari Electron Node.js
Windows 10 wasm, webgl wasm, webgl - wasm, webgl wasm
macOS wasm, webgl wasm, webgl wasm, webgl wasm, webgl wasm
Ubuntu LTS 18.04 wasm, webgl wasm, webgl - wasm, webgl wasm
iOS wasm, webgl wasm, webgl wasm, webgl - -
Android wasm, webgl wasm, webgl - - -

Operators

WebAssembly backend

ONNX Runtime Web currently support all operators in ai.onnx and ai.onnx.ml.

WebGL backend

ONNX Runtime Web currently supports a subset of operators in ai.onnx operator set. See webgl-operators.md for a complete, detailed list of which ONNX operators are supported by WebGL backend.

WebGPU backend

WebGPU backend is still an experimental feature. See webgpu-operators.md for a detailed list of which ONNX operators are supported by WebGPU backend.

License

License information can be found here.