onnxruntime/js/web/lib/wasm/session-handler-inference.ts

128 lines
4.3 KiB
TypeScript
Raw Normal View History

// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License.
import {InferenceSession, InferenceSessionHandler, SessionHandler, Tensor, TRACE_FUNC_BEGIN, TRACE_FUNC_END} from 'onnxruntime-common';
import {SerializableInternalBuffer, TensorMetadata} from './proxy-messages';
import {copyFromExternalBuffer, createSession, endProfiling, releaseSession, run} from './proxy-wrapper';
[js/webgpu] support IO binding (#17480) <del> **This PR is based on a few prerequisites PRs. They are listed as below:** - #17465 - #17469 - #17470 - #17472 - #17473 - #17484 Please review the current change by only looking at commit e2e6623e673ec6de55a5c1f8edcbd3a46b535a89 and later. </del> ### Description This PR introduces WebGPU IO binding. This new feature allows onnxruntime-web users to use tensors created from GPU as model input/output so that a model inferencing can be done without unnecessary data copy between CPU and GPU for model input/output. ### Examples An E2E demo/example is being worked on. Following is some simple demo with code snippet. Let's first check today how we do: ```js // STEP.1 - create an inference session: const mySession = await ort.InferenceSession.create('./my_model.onnx', { executionProviders: ['webgpu'] }); // STEP.2 - create model input: (supposing myImageCpuData is a Float32Array) const feeds = { 'input_image:0': new ort.Tensor('float32', myImageCpuData, [1, 224, 224, 3]) }; // STEP.3 - run model const myResults = await mySession.run(feeds); // STEP.4 - get output data const myData = myResults['output_image:0'].data; // Float32Array ``` #### for inputs (GPU tensor): Now, with IO binding, you can create a tensor from a GPU buffer, and feed it to the model: ```js // new STEP.2.A - create model input from a GPU buffer: (supposing myInputGpuBuffer is a `GPUBuffer` object with input data) const feeds = { 'input_image:0': ort.Tensor.fromGpuBuffer(myInputGpuBuffer, { dataType: 'float32', dims: [1, 224, 224, 3] }) }; ``` ### for outputs (pre-allocated GPU tensor) you can also do that for output, **if you know the output shape**: ```js // new STEP.2.B - create model output from a GPU buffer: (supposing myOutputGpuBuffer is a pre-allocated `GPUBuffer` object) const fetches = { 'output_image:0': ort.Tensor.fromGpuBuffer(myOutputGpuBuffer, { dataType: 'float32', dims: [1, 512, 512, 3] }) }; // new STEP.3 - run model with pre-allocated output (fetches) const myResults = await mySession.run(feeds, fetches); ``` ### for outputs (specify location) if you do not know the output shape, you can specify the output location when creating the session: ```js // new STEP.1 - create an inference session with an option "preferredOutputLocation": const mySession = await ort.InferenceSession.create('./my_model.onnx', { executionProviders: ['webgpu'], preferredOutputLocation: "gpu-buffer" }); ``` if the model has multiple outputs, you can specify them seperately: ```js // new STEP.1 - create an inference session with an option "preferredOutputLocation": const mySession = await ort.InferenceSession.create('./my_model.onnx', { executionProviders: ['webgpu'], preferredOutputLocation: { "output_image:0": "gpu-buffer" } }); ``` now you don't need to prepare the `fetches` object and onnxruntime-web will prepare output data on the location that specified. #### read data when you get the output tensor, you can: ```js // get the gpu buffer object: const gpuBuffer = myOutputTensor.gpuBuffer; // GPUBuffer // get the CPU data asynchronizely const cpuData = await myOutputTensor.getData(); // get the CPU data asynchronizely and release the underlying GPU resources const cpuData = await myOutputTensor.getData(true); // dispose the tensor (release the underlying GPU resources). This tensor object will be invalid after dispose() is called. myOutputTensor.dispose(); ``` #### resource management JavaScript has GC so you don't need to worry about managing JavaScript objects. But there are 2 types of resources that are not managed by GC: - GPU buffer that used in tensors - Underlying ORT native resources To simplify, most of the unmanaged resources and handled inside ORT web. But there are a few resources that need users to manage: - All external GPU resources, including GPU buffers inside all tensors created by `Tensor.fromGpuBuffer()`, will not be managed by ORT. User should manage those GPU buffers themselves. - When a session is created with `preferredOutputLocation` == "gpu-buffer" specified in session options, and the corresponding output is not pre-allocated, user need to call the output tensor's `dispose()` or `getData(true)` to manually release the underlying GPU buffers. - ORT internal errors (including providing a pre-allocated output tensor with wrong type/dims) will invalidate the whole wasm memory and is not recoverable. An exception is thrown in this situation.
2023-09-29 18:24:42 +00:00
import {isGpuBufferSupportedType} from './wasm-common';
[js/web] optimize module export and deployment (#20165) ### Description This PR make numbers of optimizations to onnxruntime-web's module export and deployment. See each section below for more details. #### Preview > [onnxruntime-web@1.19.0-esmtest.20240513-a16cd2bd21](https://www.npmjs.com/package/onnxruntime-web/v/1.19.0-esmtest.20240513-a16cd2bd21) > ~~onnxruntime-web@1.19.0-esmtest.20240430-c7edbcc63d~~ > ~~onnxruntime-web@1.18.0-esmtest.20240428-624c681c83~~ > ~~onnxruntime-web@1.18.0-esmtest.20240411-1abb64e894~~ <details> <summary><h4>Breaking changes</h4></summary> There is no code change required, but there are a few differences regarding **code import**, **flags**, **bundler config** and **deployment steps**. #### Importing: Import table is changed. See following for details. <details> <summary><h5>Current import table:</h5></summary> | Target Name | Path for "import" or "require" | WebGL | JSEP | wasm | Proxy | Training | |------|-----|-----|-----|-----|-----|-----| | `ort` (default) | `onnxruntime-web` | ✔️ | ❌ | ✔️ | ✔️ | ❌ | | `ort.all` | `onnxruntime-web/experimental` | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | | `ort.node` | `onnxruntime-web` | ❌ | ❌ | ✔️ | ❌ | ❌ | | `ort.training` | `onnxruntime-web/training` | ❌ | ❌ | ✔️ | ✔️<sup>\[1]</sup> | ✔️ | | `ort.wasm` | `onnxruntime-web/wasm` | ❌ | ❌ | ✔️ | ✔️ | ❌ | | `ort.wasm-core` | `onnxruntime-web/wasm-core` | ❌ | ❌ | ✔️ | ❌ | ❌ | | `ort.webgl` | `onnxruntime-web/webgl` | ✔️ | ❌ | ❌ | ✔️<sup>\[2]</sup> | ❌ | | `ort.webgpu` | `onnxruntime-web/webgpu` | ❌ | ✔️ | ✔️ | ✔️ | ❌ | * [1] didn't test. may not actually work. * [2] not working. this is a mistake in build config. </details> <details> <summary><h5>Proposed update:</h5></summary> | Target Name | Path for "import" or "require" | WebGL | JSEP | wasm | Proxy | Training | |------|-----|-----|-----|-----|-----|-----| | `ort` (default) | `onnxruntime-web` | ✔️ | ❌ | ✔️ | ✔️ | ❌ | | `ort.all` | ~~`onnxruntime-web/experimental`~~<br/>`onnxruntime-web/all` | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | | `ort.node` | `onnxruntime-web` | ❌ | ❌ | ✔️ | ❌ | ❌ | | `ort.training` | `onnxruntime-web/training` | ❌ | ❌ | ✔️ | ✔️ | ✔️ | | `ort.wasm` | `onnxruntime-web/wasm` | ❌ | ❌ | ✔️ | ✔️ | ❌ | | ~~`ort.wasm-core`~~ | ~~`onnxruntime-web/wasm-core`~~ | ~~❌~~ | ~~❌~~ | ~~✔️~~ | ~~❌~~ | ~~❌~~ | | `ort.webgl` | `onnxruntime-web/webgl` | ✔️ | ❌ | ❌ | ~~✔️~~ ❌ | ❌ | | `ort.webgpu` | `onnxruntime-web/webgpu` | ❌ | ✔️ | ✔️ | ✔️ | ❌ | </details> #### Flags: The following flags are deprecated: - `env.wasm.simd` (boolean): will be ignored. SIMD is always enabled in build. The following flags changed their type: - `env.wasm.wasmPaths`: When using this flag as a string ( for the URL prefix ), nothing is changed. When using this flag as an object ( for per-file path override ), the type changed: ```diff - export interface Old_WasmFilePaths{ - 'ort-wasm.wasm'?: string; - 'ort-wasm-threaded.wasm'?: string; - 'ort-wasm-simd.wasm'?: string; - 'ort-training-wasm-simd.wasm'?: string; - 'ort-wasm-simd-threaded.wasm'?: string; - }; + export interface New_WasmFilePaths { + /** + * Specify the override path for the main .wasm file. + * + * This path should be an absolute path. + * + * If not modified, the filename of the .wasm file is: + * - `ort-wasm-simd-threaded.wasm` for default build + * - `ort-wasm-simd-threaded.jsep.wasm` for JSEP build (with WebGPU and WebNN) + * - `ort-training-wasm-simd-threaded.wasm` for training build + */ + wasm?: URL|string; + /** + * Specify the override path for the main .mjs file. + * + * This path should be an absolute path. + * + * If not modified, the filename of the .mjs file is: + * - `ort-wasm-simd-threaded.mjs` for default build + * - `ort-wasm-simd-threaded.jsep.mjs` for JSEP build (with WebGPU and WebNN) + * - `ort-training-wasm-simd-threaded.mjs` for training build + */ + mjs?: URL|string; + } ``` #### Bundler compatibility: Config changes are need for bundlers. See usage example in /js/web/test/e2e/ for Webpack, parcel and rollup. #### Deployment: - if consuming from a CDN, there is no breaking change. - if consuming from a local server, need to copy all `ort-*.wasm` and `ort-*.mjs` files (totally 6 files) in the dist folder. (previously only need to copy `ort-*.wasm` files.) </details> <details> <summary><h4>Problems</h4></summary> There are a few problems with the current module export and deployment: - Script URL cannot be correctly inferred when imported as ESM. - Workers are forcefully encoded using Blob URL, which makes onnxruntime-web not working in CSP environment and Node.js, when using proxy or multi-threading feature. - Generated JS code (by Emscripten) is encoded using `function.toString()`, which is unstable and error-prone. - When running with a different Emscripten build, always need the build step. Making it difficult to swap artifacts in deveopment/debug. </details> <details> <summary><h4>Goals</h4></summary> - Full ESM support - Support variances of ways to import. Including: - import from HTML's `<script>` tag (IIFE format, exporting to global variable `ort`) ```html <script src="https://example.com/cdn-path-to-onnxruntime-web/dist/ort.min.js"></script> ``` - import from source code inside `<script type="module">` tag (ESM) ```html <script type="module"> import * as ort from "https://example.com/cdn-path-to-onnxruntime-web/dist/ort.min.mjs"; // using 'ort' </script> ``` - import in a CommonJS project (CJS format, resolve from package.json "exports" field) ```js // myProject/main.js const ort = require('onnxruntime-web'); ``` - import in an ESM project (ESM format, resolve from package.json "exports" field) ```js // myProject/main.js (or main.mjs) import * as ort from 'onnxruntime-web'; ``` - Support popular bundlers when importing onnxruntime-web into a CJS/ESM project. - webpack (esm requires extra post-process step) - rollup - parcel (esm requires extra post-process step) - More bundlers **TBD** - Multi-threading support for Node.js NOTE: keeping single JavaScript file (the all-in-one bundle) is no longer a goal. This is because technically there is a conflict with the other requirements. </details> <details> <summary><h4>Important Design Decisions</h4></summary> - Drop support of single JavaScript output. - The current onnxruntime-web distribution uses a single JavaScript file to include all code. While there are a few benefits, it also creates problems as mentioned above. Since ESM is being used more and more widely, and browsers are making more restricted security checks and requirement, the old Blob based solution is going to be replaced. - To achieve the requirement, specifically, the CSP environment support, we have to offer a non Blob based solution. Therefore, we have to distribute multiple files and drop the single file solution. - Do not run parser/postprocess on Emscripten generated JavaScript. - Emscripten is evolving quickly so we should only depends on what's in its documentation instead of a certain implementation details. (for example, currently we patch on its code to deal with a special variable `_scriptDir`) - Keep the generated files as-is also helps to: - reduce the size of ort.min.js - make it easier to replace build artifacts when in development/debug - Drop support for non-SIMD and non-MultiThread. This helps to reduce the number of artifacts in distribution. - (fixed-sized) SIMD is supported in any mainstream JS environment. - Multi-thread as WebAssembly feature is supported in any mainstream JS environment. In some environment the feature is guarded with cross origin policy, but it can still work if not trying to create any worker. - Use ESM output for Emscripten generated JavaScript. - There are 2 ways to dynamically import classic (umd) modules and neither of them are recommended: - dynamically creating a <script> tag. This changes the HTML structure and have quite a lot of compatibility issue - use `fetch()` and `eval()`. However `eval` is strongly suggested to be avoid because there is a great perf hit. - importing ESM is super easy - just use the `import()` call. Considering ESM is widely supported in modern browsers and Node.js this is the better option. - Add Blob based solution as a fallback for cross-origin workers. - There are still wide use case of importing onnxruntime-web from CDN. In this usage, make it able create worker by using `fetch()`+`Blob` to create a same-origin Blob URL. </details> <details> <summary><h4>Distribution File Manifest</h4></summary> The distribution folder contains the following files: - WebAssembly artifacts. These files are the result of compiling the ONNX Runtime C++ code to WebAssembly by Emscripten. | File Name | Build Flags | |------|-----| | ort-wasm-simd-threaded.mjs <br/> ort-wasm-simd-threaded.wasm | `--enable_wasm_simd` <br/> `--enable_wasm_threads` | | ort-training-wasm-simd-threaded.mjs <br/> ort-training-wasm-simd-threaded.wasm | `--enable_training_apis` <br/> `--enable_wasm_simd` <br/> `--enable_wasm_threads` | | ort-wasm-simd-threaded.jsep.mjs <br/> ort-wasm-simd-threaded.jsep.wasm | `--enable_wasm_simd` <br/> `--enable_wasm_threads` <br/> `--use_jsep` <br/> `--use_webnn` | - onnxruntime-web JavaScript artifacts. These files are generated by ESBuild as the entry point for onnxruntime-web. There are multiple build targets for different use cases: | Target Name | Path for "import" or "require" | Description | |------|-----|-----| | `ort` | `onnxruntime-web` | The default target. | | `ort.all` | `onnxruntime-web/all` | The target including webgl. | | `ort.node` | `onnxruntime-web` | The default target for Node.js. | | `ort.training` | `onnxruntime-web/training` | The target including training APIs | | `ort.wasm` | `onnxruntime-web/wasm` | The target including only WebAssembly (CPU) EP | | `ort.webgl` | `onnxruntime-web/webgl` | The target including only WebGL EP | For each target, there are multiple files generated: | File Name | Description | |------|-----| | [target].js | The entry point for the target. IIFE and CommonJS format. | | [target].mjs | The entry point for the target. ESM format. | | [target].min.js <br/> [target].min.js.map | The entry point for the target. Minimized with sourcemap. IIFE and CommonJS format. | | [target].min.mjs <br/> [target].min.mjs.map | The entry point for the target. Minimized with sourcemap. ESM format. | | [target].proxy.mjs | (if appliable) The proxy ESM module for the target. | | [target].proxy.min.mjs <br/> [target].proxy.min.mjs.map | (if appliable) The proxy ESM module for the target. Minimized with sourcemap. | </details> <details> <summary><h4>Dynamic Import Explained</h4></summary> - Local Served | No Proxy: ``` [Bundle or ort.min.js] | + import()--> [ort-wasm-simd-threaded.mjs] | + WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm] | + new Worker()--> [ort-wasm-simd-threaded.mjs (worker)] | + WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm] ``` - Local Served | Proxy: ``` [Bundle or ort.min.js] | + import()--> [ort.proxy.min.mjs] | + new Worker()--> [ort.proxy.min.mjs (worker)] | + import()--> [ort-wasm-simd-threaded.mjs] | + WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm] | + new Worker()--> [ort-wasm-simd-threaded.mjs (worker)] | + WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm] ``` - Cross Origin | No Proxy: ``` [Bundle or ort.min.js] | + fetch('ort-wasm-simd-threaded.mjs') | + URL.createObjectURL(res.blob()) | + import()--> [blob:... (ort-wasm-simd-threaded)] | + WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm] | + new Worker()--> [blob:... (ort-wasm-simd-threaded) (worker)] | + WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm] ``` - Cross Origin | Proxy ``` [Bundle or ort.min.js] | + fetch('ort.proxy.min.mjs') | + URL.createObjectURL(res.blob()) | + import()--> [blob:... (ort.proxy)] | + new Worker()--> [blob:... (ort.proxy) (worker)] | + fetch('ort-wasm-simd-threaded.mjs') | + URL.createObjectURL(res.blob()) | + import()--> [blob:... (ort-wasm-simd-threaded)] | + WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm] | + new Worker()--> [blob:... (ort-wasm-simd-threaded) (worker)] | + WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm] ``` </details>
2024-05-20 16:51:16 +00:00
import {isNode} from './wasm-utils-env';
import {loadFile} from './wasm-utils-load-file';
export const encodeTensorMetadata = (tensor: Tensor, getName: () => string): TensorMetadata => {
[js/webgpu] support IO binding (#17480) <del> **This PR is based on a few prerequisites PRs. They are listed as below:** - #17465 - #17469 - #17470 - #17472 - #17473 - #17484 Please review the current change by only looking at commit e2e6623e673ec6de55a5c1f8edcbd3a46b535a89 and later. </del> ### Description This PR introduces WebGPU IO binding. This new feature allows onnxruntime-web users to use tensors created from GPU as model input/output so that a model inferencing can be done without unnecessary data copy between CPU and GPU for model input/output. ### Examples An E2E demo/example is being worked on. Following is some simple demo with code snippet. Let's first check today how we do: ```js // STEP.1 - create an inference session: const mySession = await ort.InferenceSession.create('./my_model.onnx', { executionProviders: ['webgpu'] }); // STEP.2 - create model input: (supposing myImageCpuData is a Float32Array) const feeds = { 'input_image:0': new ort.Tensor('float32', myImageCpuData, [1, 224, 224, 3]) }; // STEP.3 - run model const myResults = await mySession.run(feeds); // STEP.4 - get output data const myData = myResults['output_image:0'].data; // Float32Array ``` #### for inputs (GPU tensor): Now, with IO binding, you can create a tensor from a GPU buffer, and feed it to the model: ```js // new STEP.2.A - create model input from a GPU buffer: (supposing myInputGpuBuffer is a `GPUBuffer` object with input data) const feeds = { 'input_image:0': ort.Tensor.fromGpuBuffer(myInputGpuBuffer, { dataType: 'float32', dims: [1, 224, 224, 3] }) }; ``` ### for outputs (pre-allocated GPU tensor) you can also do that for output, **if you know the output shape**: ```js // new STEP.2.B - create model output from a GPU buffer: (supposing myOutputGpuBuffer is a pre-allocated `GPUBuffer` object) const fetches = { 'output_image:0': ort.Tensor.fromGpuBuffer(myOutputGpuBuffer, { dataType: 'float32', dims: [1, 512, 512, 3] }) }; // new STEP.3 - run model with pre-allocated output (fetches) const myResults = await mySession.run(feeds, fetches); ``` ### for outputs (specify location) if you do not know the output shape, you can specify the output location when creating the session: ```js // new STEP.1 - create an inference session with an option "preferredOutputLocation": const mySession = await ort.InferenceSession.create('./my_model.onnx', { executionProviders: ['webgpu'], preferredOutputLocation: "gpu-buffer" }); ``` if the model has multiple outputs, you can specify them seperately: ```js // new STEP.1 - create an inference session with an option "preferredOutputLocation": const mySession = await ort.InferenceSession.create('./my_model.onnx', { executionProviders: ['webgpu'], preferredOutputLocation: { "output_image:0": "gpu-buffer" } }); ``` now you don't need to prepare the `fetches` object and onnxruntime-web will prepare output data on the location that specified. #### read data when you get the output tensor, you can: ```js // get the gpu buffer object: const gpuBuffer = myOutputTensor.gpuBuffer; // GPUBuffer // get the CPU data asynchronizely const cpuData = await myOutputTensor.getData(); // get the CPU data asynchronizely and release the underlying GPU resources const cpuData = await myOutputTensor.getData(true); // dispose the tensor (release the underlying GPU resources). This tensor object will be invalid after dispose() is called. myOutputTensor.dispose(); ``` #### resource management JavaScript has GC so you don't need to worry about managing JavaScript objects. But there are 2 types of resources that are not managed by GC: - GPU buffer that used in tensors - Underlying ORT native resources To simplify, most of the unmanaged resources and handled inside ORT web. But there are a few resources that need users to manage: - All external GPU resources, including GPU buffers inside all tensors created by `Tensor.fromGpuBuffer()`, will not be managed by ORT. User should manage those GPU buffers themselves. - When a session is created with `preferredOutputLocation` == "gpu-buffer" specified in session options, and the corresponding output is not pre-allocated, user need to call the output tensor's `dispose()` or `getData(true)` to manually release the underlying GPU buffers. - ORT internal errors (including providing a pre-allocated output tensor with wrong type/dims) will invalidate the whole wasm memory and is not recoverable. An exception is thrown in this situation.
2023-09-29 18:24:42 +00:00
switch (tensor.location) {
case 'cpu':
return [tensor.type, tensor.dims, tensor.data, 'cpu'];
case 'gpu-buffer':
return [tensor.type, tensor.dims, {gpuBuffer: tensor.gpuBuffer}, 'gpu-buffer'];
default:
throw new Error(`invalid data location: ${tensor.location} for ${getName()}`);
}
};
export const decodeTensorMetadata = (tensor: TensorMetadata): Tensor => {
[js/webgpu] support IO binding (#17480) <del> **This PR is based on a few prerequisites PRs. They are listed as below:** - #17465 - #17469 - #17470 - #17472 - #17473 - #17484 Please review the current change by only looking at commit e2e6623e673ec6de55a5c1f8edcbd3a46b535a89 and later. </del> ### Description This PR introduces WebGPU IO binding. This new feature allows onnxruntime-web users to use tensors created from GPU as model input/output so that a model inferencing can be done without unnecessary data copy between CPU and GPU for model input/output. ### Examples An E2E demo/example is being worked on. Following is some simple demo with code snippet. Let's first check today how we do: ```js // STEP.1 - create an inference session: const mySession = await ort.InferenceSession.create('./my_model.onnx', { executionProviders: ['webgpu'] }); // STEP.2 - create model input: (supposing myImageCpuData is a Float32Array) const feeds = { 'input_image:0': new ort.Tensor('float32', myImageCpuData, [1, 224, 224, 3]) }; // STEP.3 - run model const myResults = await mySession.run(feeds); // STEP.4 - get output data const myData = myResults['output_image:0'].data; // Float32Array ``` #### for inputs (GPU tensor): Now, with IO binding, you can create a tensor from a GPU buffer, and feed it to the model: ```js // new STEP.2.A - create model input from a GPU buffer: (supposing myInputGpuBuffer is a `GPUBuffer` object with input data) const feeds = { 'input_image:0': ort.Tensor.fromGpuBuffer(myInputGpuBuffer, { dataType: 'float32', dims: [1, 224, 224, 3] }) }; ``` ### for outputs (pre-allocated GPU tensor) you can also do that for output, **if you know the output shape**: ```js // new STEP.2.B - create model output from a GPU buffer: (supposing myOutputGpuBuffer is a pre-allocated `GPUBuffer` object) const fetches = { 'output_image:0': ort.Tensor.fromGpuBuffer(myOutputGpuBuffer, { dataType: 'float32', dims: [1, 512, 512, 3] }) }; // new STEP.3 - run model with pre-allocated output (fetches) const myResults = await mySession.run(feeds, fetches); ``` ### for outputs (specify location) if you do not know the output shape, you can specify the output location when creating the session: ```js // new STEP.1 - create an inference session with an option "preferredOutputLocation": const mySession = await ort.InferenceSession.create('./my_model.onnx', { executionProviders: ['webgpu'], preferredOutputLocation: "gpu-buffer" }); ``` if the model has multiple outputs, you can specify them seperately: ```js // new STEP.1 - create an inference session with an option "preferredOutputLocation": const mySession = await ort.InferenceSession.create('./my_model.onnx', { executionProviders: ['webgpu'], preferredOutputLocation: { "output_image:0": "gpu-buffer" } }); ``` now you don't need to prepare the `fetches` object and onnxruntime-web will prepare output data on the location that specified. #### read data when you get the output tensor, you can: ```js // get the gpu buffer object: const gpuBuffer = myOutputTensor.gpuBuffer; // GPUBuffer // get the CPU data asynchronizely const cpuData = await myOutputTensor.getData(); // get the CPU data asynchronizely and release the underlying GPU resources const cpuData = await myOutputTensor.getData(true); // dispose the tensor (release the underlying GPU resources). This tensor object will be invalid after dispose() is called. myOutputTensor.dispose(); ``` #### resource management JavaScript has GC so you don't need to worry about managing JavaScript objects. But there are 2 types of resources that are not managed by GC: - GPU buffer that used in tensors - Underlying ORT native resources To simplify, most of the unmanaged resources and handled inside ORT web. But there are a few resources that need users to manage: - All external GPU resources, including GPU buffers inside all tensors created by `Tensor.fromGpuBuffer()`, will not be managed by ORT. User should manage those GPU buffers themselves. - When a session is created with `preferredOutputLocation` == "gpu-buffer" specified in session options, and the corresponding output is not pre-allocated, user need to call the output tensor's `dispose()` or `getData(true)` to manually release the underlying GPU buffers. - ORT internal errors (including providing a pre-allocated output tensor with wrong type/dims) will invalidate the whole wasm memory and is not recoverable. An exception is thrown in this situation.
2023-09-29 18:24:42 +00:00
switch (tensor[3]) {
case 'cpu':
return new Tensor(tensor[0], tensor[2], tensor[1]);
case 'gpu-buffer': {
const dataType = tensor[0];
if (!isGpuBufferSupportedType(dataType)) {
throw new Error(`not supported data type: ${dataType} for deserializing GPU tensor`);
}
const {gpuBuffer, download, dispose} = tensor[2];
return Tensor.fromGpuBuffer(gpuBuffer, {dataType, dims: tensor[1], download, dispose});
}
default:
throw new Error(`invalid data location: ${tensor[3]}`);
}
};
export class OnnxruntimeWebAssemblySessionHandler implements InferenceSessionHandler {
private sessionId: number;
inputNames: string[];
outputNames: string[];
async fetchModelAndCopyToWasmMemory(path: string): Promise<SerializableInternalBuffer> {
// fetch model from url and move to wasm heap.
return copyFromExternalBuffer(await loadFile(path));
}
async loadModel(pathOrBuffer: string|Uint8Array, options?: InferenceSession.SessionOptions): Promise<void> {
TRACE_FUNC_BEGIN();
let model: Parameters<typeof createSession>[0];
if (typeof pathOrBuffer === 'string') {
[js/web] optimize module export and deployment (#20165) ### Description This PR make numbers of optimizations to onnxruntime-web's module export and deployment. See each section below for more details. #### Preview > [onnxruntime-web@1.19.0-esmtest.20240513-a16cd2bd21](https://www.npmjs.com/package/onnxruntime-web/v/1.19.0-esmtest.20240513-a16cd2bd21) > ~~onnxruntime-web@1.19.0-esmtest.20240430-c7edbcc63d~~ > ~~onnxruntime-web@1.18.0-esmtest.20240428-624c681c83~~ > ~~onnxruntime-web@1.18.0-esmtest.20240411-1abb64e894~~ <details> <summary><h4>Breaking changes</h4></summary> There is no code change required, but there are a few differences regarding **code import**, **flags**, **bundler config** and **deployment steps**. #### Importing: Import table is changed. See following for details. <details> <summary><h5>Current import table:</h5></summary> | Target Name | Path for "import" or "require" | WebGL | JSEP | wasm | Proxy | Training | |------|-----|-----|-----|-----|-----|-----| | `ort` (default) | `onnxruntime-web` | ✔️ | ❌ | ✔️ | ✔️ | ❌ | | `ort.all` | `onnxruntime-web/experimental` | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | | `ort.node` | `onnxruntime-web` | ❌ | ❌ | ✔️ | ❌ | ❌ | | `ort.training` | `onnxruntime-web/training` | ❌ | ❌ | ✔️ | ✔️<sup>\[1]</sup> | ✔️ | | `ort.wasm` | `onnxruntime-web/wasm` | ❌ | ❌ | ✔️ | ✔️ | ❌ | | `ort.wasm-core` | `onnxruntime-web/wasm-core` | ❌ | ❌ | ✔️ | ❌ | ❌ | | `ort.webgl` | `onnxruntime-web/webgl` | ✔️ | ❌ | ❌ | ✔️<sup>\[2]</sup> | ❌ | | `ort.webgpu` | `onnxruntime-web/webgpu` | ❌ | ✔️ | ✔️ | ✔️ | ❌ | * [1] didn't test. may not actually work. * [2] not working. this is a mistake in build config. </details> <details> <summary><h5>Proposed update:</h5></summary> | Target Name | Path for "import" or "require" | WebGL | JSEP | wasm | Proxy | Training | |------|-----|-----|-----|-----|-----|-----| | `ort` (default) | `onnxruntime-web` | ✔️ | ❌ | ✔️ | ✔️ | ❌ | | `ort.all` | ~~`onnxruntime-web/experimental`~~<br/>`onnxruntime-web/all` | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | | `ort.node` | `onnxruntime-web` | ❌ | ❌ | ✔️ | ❌ | ❌ | | `ort.training` | `onnxruntime-web/training` | ❌ | ❌ | ✔️ | ✔️ | ✔️ | | `ort.wasm` | `onnxruntime-web/wasm` | ❌ | ❌ | ✔️ | ✔️ | ❌ | | ~~`ort.wasm-core`~~ | ~~`onnxruntime-web/wasm-core`~~ | ~~❌~~ | ~~❌~~ | ~~✔️~~ | ~~❌~~ | ~~❌~~ | | `ort.webgl` | `onnxruntime-web/webgl` | ✔️ | ❌ | ❌ | ~~✔️~~ ❌ | ❌ | | `ort.webgpu` | `onnxruntime-web/webgpu` | ❌ | ✔️ | ✔️ | ✔️ | ❌ | </details> #### Flags: The following flags are deprecated: - `env.wasm.simd` (boolean): will be ignored. SIMD is always enabled in build. The following flags changed their type: - `env.wasm.wasmPaths`: When using this flag as a string ( for the URL prefix ), nothing is changed. When using this flag as an object ( for per-file path override ), the type changed: ```diff - export interface Old_WasmFilePaths{ - 'ort-wasm.wasm'?: string; - 'ort-wasm-threaded.wasm'?: string; - 'ort-wasm-simd.wasm'?: string; - 'ort-training-wasm-simd.wasm'?: string; - 'ort-wasm-simd-threaded.wasm'?: string; - }; + export interface New_WasmFilePaths { + /** + * Specify the override path for the main .wasm file. + * + * This path should be an absolute path. + * + * If not modified, the filename of the .wasm file is: + * - `ort-wasm-simd-threaded.wasm` for default build + * - `ort-wasm-simd-threaded.jsep.wasm` for JSEP build (with WebGPU and WebNN) + * - `ort-training-wasm-simd-threaded.wasm` for training build + */ + wasm?: URL|string; + /** + * Specify the override path for the main .mjs file. + * + * This path should be an absolute path. + * + * If not modified, the filename of the .mjs file is: + * - `ort-wasm-simd-threaded.mjs` for default build + * - `ort-wasm-simd-threaded.jsep.mjs` for JSEP build (with WebGPU and WebNN) + * - `ort-training-wasm-simd-threaded.mjs` for training build + */ + mjs?: URL|string; + } ``` #### Bundler compatibility: Config changes are need for bundlers. See usage example in /js/web/test/e2e/ for Webpack, parcel and rollup. #### Deployment: - if consuming from a CDN, there is no breaking change. - if consuming from a local server, need to copy all `ort-*.wasm` and `ort-*.mjs` files (totally 6 files) in the dist folder. (previously only need to copy `ort-*.wasm` files.) </details> <details> <summary><h4>Problems</h4></summary> There are a few problems with the current module export and deployment: - Script URL cannot be correctly inferred when imported as ESM. - Workers are forcefully encoded using Blob URL, which makes onnxruntime-web not working in CSP environment and Node.js, when using proxy or multi-threading feature. - Generated JS code (by Emscripten) is encoded using `function.toString()`, which is unstable and error-prone. - When running with a different Emscripten build, always need the build step. Making it difficult to swap artifacts in deveopment/debug. </details> <details> <summary><h4>Goals</h4></summary> - Full ESM support - Support variances of ways to import. Including: - import from HTML's `<script>` tag (IIFE format, exporting to global variable `ort`) ```html <script src="https://example.com/cdn-path-to-onnxruntime-web/dist/ort.min.js"></script> ``` - import from source code inside `<script type="module">` tag (ESM) ```html <script type="module"> import * as ort from "https://example.com/cdn-path-to-onnxruntime-web/dist/ort.min.mjs"; // using 'ort' </script> ``` - import in a CommonJS project (CJS format, resolve from package.json "exports" field) ```js // myProject/main.js const ort = require('onnxruntime-web'); ``` - import in an ESM project (ESM format, resolve from package.json "exports" field) ```js // myProject/main.js (or main.mjs) import * as ort from 'onnxruntime-web'; ``` - Support popular bundlers when importing onnxruntime-web into a CJS/ESM project. - webpack (esm requires extra post-process step) - rollup - parcel (esm requires extra post-process step) - More bundlers **TBD** - Multi-threading support for Node.js NOTE: keeping single JavaScript file (the all-in-one bundle) is no longer a goal. This is because technically there is a conflict with the other requirements. </details> <details> <summary><h4>Important Design Decisions</h4></summary> - Drop support of single JavaScript output. - The current onnxruntime-web distribution uses a single JavaScript file to include all code. While there are a few benefits, it also creates problems as mentioned above. Since ESM is being used more and more widely, and browsers are making more restricted security checks and requirement, the old Blob based solution is going to be replaced. - To achieve the requirement, specifically, the CSP environment support, we have to offer a non Blob based solution. Therefore, we have to distribute multiple files and drop the single file solution. - Do not run parser/postprocess on Emscripten generated JavaScript. - Emscripten is evolving quickly so we should only depends on what's in its documentation instead of a certain implementation details. (for example, currently we patch on its code to deal with a special variable `_scriptDir`) - Keep the generated files as-is also helps to: - reduce the size of ort.min.js - make it easier to replace build artifacts when in development/debug - Drop support for non-SIMD and non-MultiThread. This helps to reduce the number of artifacts in distribution. - (fixed-sized) SIMD is supported in any mainstream JS environment. - Multi-thread as WebAssembly feature is supported in any mainstream JS environment. In some environment the feature is guarded with cross origin policy, but it can still work if not trying to create any worker. - Use ESM output for Emscripten generated JavaScript. - There are 2 ways to dynamically import classic (umd) modules and neither of them are recommended: - dynamically creating a <script> tag. This changes the HTML structure and have quite a lot of compatibility issue - use `fetch()` and `eval()`. However `eval` is strongly suggested to be avoid because there is a great perf hit. - importing ESM is super easy - just use the `import()` call. Considering ESM is widely supported in modern browsers and Node.js this is the better option. - Add Blob based solution as a fallback for cross-origin workers. - There are still wide use case of importing onnxruntime-web from CDN. In this usage, make it able create worker by using `fetch()`+`Blob` to create a same-origin Blob URL. </details> <details> <summary><h4>Distribution File Manifest</h4></summary> The distribution folder contains the following files: - WebAssembly artifacts. These files are the result of compiling the ONNX Runtime C++ code to WebAssembly by Emscripten. | File Name | Build Flags | |------|-----| | ort-wasm-simd-threaded.mjs <br/> ort-wasm-simd-threaded.wasm | `--enable_wasm_simd` <br/> `--enable_wasm_threads` | | ort-training-wasm-simd-threaded.mjs <br/> ort-training-wasm-simd-threaded.wasm | `--enable_training_apis` <br/> `--enable_wasm_simd` <br/> `--enable_wasm_threads` | | ort-wasm-simd-threaded.jsep.mjs <br/> ort-wasm-simd-threaded.jsep.wasm | `--enable_wasm_simd` <br/> `--enable_wasm_threads` <br/> `--use_jsep` <br/> `--use_webnn` | - onnxruntime-web JavaScript artifacts. These files are generated by ESBuild as the entry point for onnxruntime-web. There are multiple build targets for different use cases: | Target Name | Path for "import" or "require" | Description | |------|-----|-----| | `ort` | `onnxruntime-web` | The default target. | | `ort.all` | `onnxruntime-web/all` | The target including webgl. | | `ort.node` | `onnxruntime-web` | The default target for Node.js. | | `ort.training` | `onnxruntime-web/training` | The target including training APIs | | `ort.wasm` | `onnxruntime-web/wasm` | The target including only WebAssembly (CPU) EP | | `ort.webgl` | `onnxruntime-web/webgl` | The target including only WebGL EP | For each target, there are multiple files generated: | File Name | Description | |------|-----| | [target].js | The entry point for the target. IIFE and CommonJS format. | | [target].mjs | The entry point for the target. ESM format. | | [target].min.js <br/> [target].min.js.map | The entry point for the target. Minimized with sourcemap. IIFE and CommonJS format. | | [target].min.mjs <br/> [target].min.mjs.map | The entry point for the target. Minimized with sourcemap. ESM format. | | [target].proxy.mjs | (if appliable) The proxy ESM module for the target. | | [target].proxy.min.mjs <br/> [target].proxy.min.mjs.map | (if appliable) The proxy ESM module for the target. Minimized with sourcemap. | </details> <details> <summary><h4>Dynamic Import Explained</h4></summary> - Local Served | No Proxy: ``` [Bundle or ort.min.js] | + import()--> [ort-wasm-simd-threaded.mjs] | + WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm] | + new Worker()--> [ort-wasm-simd-threaded.mjs (worker)] | + WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm] ``` - Local Served | Proxy: ``` [Bundle or ort.min.js] | + import()--> [ort.proxy.min.mjs] | + new Worker()--> [ort.proxy.min.mjs (worker)] | + import()--> [ort-wasm-simd-threaded.mjs] | + WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm] | + new Worker()--> [ort-wasm-simd-threaded.mjs (worker)] | + WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm] ``` - Cross Origin | No Proxy: ``` [Bundle or ort.min.js] | + fetch('ort-wasm-simd-threaded.mjs') | + URL.createObjectURL(res.blob()) | + import()--> [blob:... (ort-wasm-simd-threaded)] | + WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm] | + new Worker()--> [blob:... (ort-wasm-simd-threaded) (worker)] | + WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm] ``` - Cross Origin | Proxy ``` [Bundle or ort.min.js] | + fetch('ort.proxy.min.mjs') | + URL.createObjectURL(res.blob()) | + import()--> [blob:... (ort.proxy)] | + new Worker()--> [blob:... (ort.proxy) (worker)] | + fetch('ort-wasm-simd-threaded.mjs') | + URL.createObjectURL(res.blob()) | + import()--> [blob:... (ort-wasm-simd-threaded)] | + WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm] | + new Worker()--> [blob:... (ort-wasm-simd-threaded) (worker)] | + WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm] ``` </details>
2024-05-20 16:51:16 +00:00
if (isNode) {
// node
model = await loadFile(pathOrBuffer);
} else {
// browser
// fetch model and copy to wasm heap.
model = await this.fetchModelAndCopyToWasmMemory(pathOrBuffer);
}
} else {
model = pathOrBuffer;
}
[this.sessionId, this.inputNames, this.outputNames] = await createSession(model, options);
TRACE_FUNC_END();
}
async dispose(): Promise<void> {
return releaseSession(this.sessionId);
}
async run(feeds: SessionHandler.FeedsType, fetches: SessionHandler.FetchesType, options: InferenceSession.RunOptions):
Promise<SessionHandler.ReturnType> {
TRACE_FUNC_BEGIN();
const inputArray: Tensor[] = [];
const inputIndices: number[] = [];
Object.entries(feeds).forEach(kvp => {
const name = kvp[0];
const tensor = kvp[1];
const index = this.inputNames.indexOf(name);
if (index === -1) {
throw new Error(`invalid input '${name}'`);
}
inputArray.push(tensor);
inputIndices.push(index);
});
[js/webgpu] support IO binding (#17480) <del> **This PR is based on a few prerequisites PRs. They are listed as below:** - #17465 - #17469 - #17470 - #17472 - #17473 - #17484 Please review the current change by only looking at commit e2e6623e673ec6de55a5c1f8edcbd3a46b535a89 and later. </del> ### Description This PR introduces WebGPU IO binding. This new feature allows onnxruntime-web users to use tensors created from GPU as model input/output so that a model inferencing can be done without unnecessary data copy between CPU and GPU for model input/output. ### Examples An E2E demo/example is being worked on. Following is some simple demo with code snippet. Let's first check today how we do: ```js // STEP.1 - create an inference session: const mySession = await ort.InferenceSession.create('./my_model.onnx', { executionProviders: ['webgpu'] }); // STEP.2 - create model input: (supposing myImageCpuData is a Float32Array) const feeds = { 'input_image:0': new ort.Tensor('float32', myImageCpuData, [1, 224, 224, 3]) }; // STEP.3 - run model const myResults = await mySession.run(feeds); // STEP.4 - get output data const myData = myResults['output_image:0'].data; // Float32Array ``` #### for inputs (GPU tensor): Now, with IO binding, you can create a tensor from a GPU buffer, and feed it to the model: ```js // new STEP.2.A - create model input from a GPU buffer: (supposing myInputGpuBuffer is a `GPUBuffer` object with input data) const feeds = { 'input_image:0': ort.Tensor.fromGpuBuffer(myInputGpuBuffer, { dataType: 'float32', dims: [1, 224, 224, 3] }) }; ``` ### for outputs (pre-allocated GPU tensor) you can also do that for output, **if you know the output shape**: ```js // new STEP.2.B - create model output from a GPU buffer: (supposing myOutputGpuBuffer is a pre-allocated `GPUBuffer` object) const fetches = { 'output_image:0': ort.Tensor.fromGpuBuffer(myOutputGpuBuffer, { dataType: 'float32', dims: [1, 512, 512, 3] }) }; // new STEP.3 - run model with pre-allocated output (fetches) const myResults = await mySession.run(feeds, fetches); ``` ### for outputs (specify location) if you do not know the output shape, you can specify the output location when creating the session: ```js // new STEP.1 - create an inference session with an option "preferredOutputLocation": const mySession = await ort.InferenceSession.create('./my_model.onnx', { executionProviders: ['webgpu'], preferredOutputLocation: "gpu-buffer" }); ``` if the model has multiple outputs, you can specify them seperately: ```js // new STEP.1 - create an inference session with an option "preferredOutputLocation": const mySession = await ort.InferenceSession.create('./my_model.onnx', { executionProviders: ['webgpu'], preferredOutputLocation: { "output_image:0": "gpu-buffer" } }); ``` now you don't need to prepare the `fetches` object and onnxruntime-web will prepare output data on the location that specified. #### read data when you get the output tensor, you can: ```js // get the gpu buffer object: const gpuBuffer = myOutputTensor.gpuBuffer; // GPUBuffer // get the CPU data asynchronizely const cpuData = await myOutputTensor.getData(); // get the CPU data asynchronizely and release the underlying GPU resources const cpuData = await myOutputTensor.getData(true); // dispose the tensor (release the underlying GPU resources). This tensor object will be invalid after dispose() is called. myOutputTensor.dispose(); ``` #### resource management JavaScript has GC so you don't need to worry about managing JavaScript objects. But there are 2 types of resources that are not managed by GC: - GPU buffer that used in tensors - Underlying ORT native resources To simplify, most of the unmanaged resources and handled inside ORT web. But there are a few resources that need users to manage: - All external GPU resources, including GPU buffers inside all tensors created by `Tensor.fromGpuBuffer()`, will not be managed by ORT. User should manage those GPU buffers themselves. - When a session is created with `preferredOutputLocation` == "gpu-buffer" specified in session options, and the corresponding output is not pre-allocated, user need to call the output tensor's `dispose()` or `getData(true)` to manually release the underlying GPU buffers. - ORT internal errors (including providing a pre-allocated output tensor with wrong type/dims) will invalidate the whole wasm memory and is not recoverable. An exception is thrown in this situation.
2023-09-29 18:24:42 +00:00
const outputArray: Array<Tensor|null> = [];
const outputIndices: number[] = [];
Object.entries(fetches).forEach(kvp => {
const name = kvp[0];
[js/webgpu] support IO binding (#17480) <del> **This PR is based on a few prerequisites PRs. They are listed as below:** - #17465 - #17469 - #17470 - #17472 - #17473 - #17484 Please review the current change by only looking at commit e2e6623e673ec6de55a5c1f8edcbd3a46b535a89 and later. </del> ### Description This PR introduces WebGPU IO binding. This new feature allows onnxruntime-web users to use tensors created from GPU as model input/output so that a model inferencing can be done without unnecessary data copy between CPU and GPU for model input/output. ### Examples An E2E demo/example is being worked on. Following is some simple demo with code snippet. Let's first check today how we do: ```js // STEP.1 - create an inference session: const mySession = await ort.InferenceSession.create('./my_model.onnx', { executionProviders: ['webgpu'] }); // STEP.2 - create model input: (supposing myImageCpuData is a Float32Array) const feeds = { 'input_image:0': new ort.Tensor('float32', myImageCpuData, [1, 224, 224, 3]) }; // STEP.3 - run model const myResults = await mySession.run(feeds); // STEP.4 - get output data const myData = myResults['output_image:0'].data; // Float32Array ``` #### for inputs (GPU tensor): Now, with IO binding, you can create a tensor from a GPU buffer, and feed it to the model: ```js // new STEP.2.A - create model input from a GPU buffer: (supposing myInputGpuBuffer is a `GPUBuffer` object with input data) const feeds = { 'input_image:0': ort.Tensor.fromGpuBuffer(myInputGpuBuffer, { dataType: 'float32', dims: [1, 224, 224, 3] }) }; ``` ### for outputs (pre-allocated GPU tensor) you can also do that for output, **if you know the output shape**: ```js // new STEP.2.B - create model output from a GPU buffer: (supposing myOutputGpuBuffer is a pre-allocated `GPUBuffer` object) const fetches = { 'output_image:0': ort.Tensor.fromGpuBuffer(myOutputGpuBuffer, { dataType: 'float32', dims: [1, 512, 512, 3] }) }; // new STEP.3 - run model with pre-allocated output (fetches) const myResults = await mySession.run(feeds, fetches); ``` ### for outputs (specify location) if you do not know the output shape, you can specify the output location when creating the session: ```js // new STEP.1 - create an inference session with an option "preferredOutputLocation": const mySession = await ort.InferenceSession.create('./my_model.onnx', { executionProviders: ['webgpu'], preferredOutputLocation: "gpu-buffer" }); ``` if the model has multiple outputs, you can specify them seperately: ```js // new STEP.1 - create an inference session with an option "preferredOutputLocation": const mySession = await ort.InferenceSession.create('./my_model.onnx', { executionProviders: ['webgpu'], preferredOutputLocation: { "output_image:0": "gpu-buffer" } }); ``` now you don't need to prepare the `fetches` object and onnxruntime-web will prepare output data on the location that specified. #### read data when you get the output tensor, you can: ```js // get the gpu buffer object: const gpuBuffer = myOutputTensor.gpuBuffer; // GPUBuffer // get the CPU data asynchronizely const cpuData = await myOutputTensor.getData(); // get the CPU data asynchronizely and release the underlying GPU resources const cpuData = await myOutputTensor.getData(true); // dispose the tensor (release the underlying GPU resources). This tensor object will be invalid after dispose() is called. myOutputTensor.dispose(); ``` #### resource management JavaScript has GC so you don't need to worry about managing JavaScript objects. But there are 2 types of resources that are not managed by GC: - GPU buffer that used in tensors - Underlying ORT native resources To simplify, most of the unmanaged resources and handled inside ORT web. But there are a few resources that need users to manage: - All external GPU resources, including GPU buffers inside all tensors created by `Tensor.fromGpuBuffer()`, will not be managed by ORT. User should manage those GPU buffers themselves. - When a session is created with `preferredOutputLocation` == "gpu-buffer" specified in session options, and the corresponding output is not pre-allocated, user need to call the output tensor's `dispose()` or `getData(true)` to manually release the underlying GPU buffers. - ORT internal errors (including providing a pre-allocated output tensor with wrong type/dims) will invalidate the whole wasm memory and is not recoverable. An exception is thrown in this situation.
2023-09-29 18:24:42 +00:00
const tensor = kvp[1];
const index = this.outputNames.indexOf(name);
if (index === -1) {
throw new Error(`invalid output '${name}'`);
}
[js/webgpu] support IO binding (#17480) <del> **This PR is based on a few prerequisites PRs. They are listed as below:** - #17465 - #17469 - #17470 - #17472 - #17473 - #17484 Please review the current change by only looking at commit e2e6623e673ec6de55a5c1f8edcbd3a46b535a89 and later. </del> ### Description This PR introduces WebGPU IO binding. This new feature allows onnxruntime-web users to use tensors created from GPU as model input/output so that a model inferencing can be done without unnecessary data copy between CPU and GPU for model input/output. ### Examples An E2E demo/example is being worked on. Following is some simple demo with code snippet. Let's first check today how we do: ```js // STEP.1 - create an inference session: const mySession = await ort.InferenceSession.create('./my_model.onnx', { executionProviders: ['webgpu'] }); // STEP.2 - create model input: (supposing myImageCpuData is a Float32Array) const feeds = { 'input_image:0': new ort.Tensor('float32', myImageCpuData, [1, 224, 224, 3]) }; // STEP.3 - run model const myResults = await mySession.run(feeds); // STEP.4 - get output data const myData = myResults['output_image:0'].data; // Float32Array ``` #### for inputs (GPU tensor): Now, with IO binding, you can create a tensor from a GPU buffer, and feed it to the model: ```js // new STEP.2.A - create model input from a GPU buffer: (supposing myInputGpuBuffer is a `GPUBuffer` object with input data) const feeds = { 'input_image:0': ort.Tensor.fromGpuBuffer(myInputGpuBuffer, { dataType: 'float32', dims: [1, 224, 224, 3] }) }; ``` ### for outputs (pre-allocated GPU tensor) you can also do that for output, **if you know the output shape**: ```js // new STEP.2.B - create model output from a GPU buffer: (supposing myOutputGpuBuffer is a pre-allocated `GPUBuffer` object) const fetches = { 'output_image:0': ort.Tensor.fromGpuBuffer(myOutputGpuBuffer, { dataType: 'float32', dims: [1, 512, 512, 3] }) }; // new STEP.3 - run model with pre-allocated output (fetches) const myResults = await mySession.run(feeds, fetches); ``` ### for outputs (specify location) if you do not know the output shape, you can specify the output location when creating the session: ```js // new STEP.1 - create an inference session with an option "preferredOutputLocation": const mySession = await ort.InferenceSession.create('./my_model.onnx', { executionProviders: ['webgpu'], preferredOutputLocation: "gpu-buffer" }); ``` if the model has multiple outputs, you can specify them seperately: ```js // new STEP.1 - create an inference session with an option "preferredOutputLocation": const mySession = await ort.InferenceSession.create('./my_model.onnx', { executionProviders: ['webgpu'], preferredOutputLocation: { "output_image:0": "gpu-buffer" } }); ``` now you don't need to prepare the `fetches` object and onnxruntime-web will prepare output data on the location that specified. #### read data when you get the output tensor, you can: ```js // get the gpu buffer object: const gpuBuffer = myOutputTensor.gpuBuffer; // GPUBuffer // get the CPU data asynchronizely const cpuData = await myOutputTensor.getData(); // get the CPU data asynchronizely and release the underlying GPU resources const cpuData = await myOutputTensor.getData(true); // dispose the tensor (release the underlying GPU resources). This tensor object will be invalid after dispose() is called. myOutputTensor.dispose(); ``` #### resource management JavaScript has GC so you don't need to worry about managing JavaScript objects. But there are 2 types of resources that are not managed by GC: - GPU buffer that used in tensors - Underlying ORT native resources To simplify, most of the unmanaged resources and handled inside ORT web. But there are a few resources that need users to manage: - All external GPU resources, including GPU buffers inside all tensors created by `Tensor.fromGpuBuffer()`, will not be managed by ORT. User should manage those GPU buffers themselves. - When a session is created with `preferredOutputLocation` == "gpu-buffer" specified in session options, and the corresponding output is not pre-allocated, user need to call the output tensor's `dispose()` or `getData(true)` to manually release the underlying GPU buffers. - ORT internal errors (including providing a pre-allocated output tensor with wrong type/dims) will invalidate the whole wasm memory and is not recoverable. An exception is thrown in this situation.
2023-09-29 18:24:42 +00:00
outputArray.push(tensor);
outputIndices.push(index);
});
[js/webgpu] support IO binding (#17480) <del> **This PR is based on a few prerequisites PRs. They are listed as below:** - #17465 - #17469 - #17470 - #17472 - #17473 - #17484 Please review the current change by only looking at commit e2e6623e673ec6de55a5c1f8edcbd3a46b535a89 and later. </del> ### Description This PR introduces WebGPU IO binding. This new feature allows onnxruntime-web users to use tensors created from GPU as model input/output so that a model inferencing can be done without unnecessary data copy between CPU and GPU for model input/output. ### Examples An E2E demo/example is being worked on. Following is some simple demo with code snippet. Let's first check today how we do: ```js // STEP.1 - create an inference session: const mySession = await ort.InferenceSession.create('./my_model.onnx', { executionProviders: ['webgpu'] }); // STEP.2 - create model input: (supposing myImageCpuData is a Float32Array) const feeds = { 'input_image:0': new ort.Tensor('float32', myImageCpuData, [1, 224, 224, 3]) }; // STEP.3 - run model const myResults = await mySession.run(feeds); // STEP.4 - get output data const myData = myResults['output_image:0'].data; // Float32Array ``` #### for inputs (GPU tensor): Now, with IO binding, you can create a tensor from a GPU buffer, and feed it to the model: ```js // new STEP.2.A - create model input from a GPU buffer: (supposing myInputGpuBuffer is a `GPUBuffer` object with input data) const feeds = { 'input_image:0': ort.Tensor.fromGpuBuffer(myInputGpuBuffer, { dataType: 'float32', dims: [1, 224, 224, 3] }) }; ``` ### for outputs (pre-allocated GPU tensor) you can also do that for output, **if you know the output shape**: ```js // new STEP.2.B - create model output from a GPU buffer: (supposing myOutputGpuBuffer is a pre-allocated `GPUBuffer` object) const fetches = { 'output_image:0': ort.Tensor.fromGpuBuffer(myOutputGpuBuffer, { dataType: 'float32', dims: [1, 512, 512, 3] }) }; // new STEP.3 - run model with pre-allocated output (fetches) const myResults = await mySession.run(feeds, fetches); ``` ### for outputs (specify location) if you do not know the output shape, you can specify the output location when creating the session: ```js // new STEP.1 - create an inference session with an option "preferredOutputLocation": const mySession = await ort.InferenceSession.create('./my_model.onnx', { executionProviders: ['webgpu'], preferredOutputLocation: "gpu-buffer" }); ``` if the model has multiple outputs, you can specify them seperately: ```js // new STEP.1 - create an inference session with an option "preferredOutputLocation": const mySession = await ort.InferenceSession.create('./my_model.onnx', { executionProviders: ['webgpu'], preferredOutputLocation: { "output_image:0": "gpu-buffer" } }); ``` now you don't need to prepare the `fetches` object and onnxruntime-web will prepare output data on the location that specified. #### read data when you get the output tensor, you can: ```js // get the gpu buffer object: const gpuBuffer = myOutputTensor.gpuBuffer; // GPUBuffer // get the CPU data asynchronizely const cpuData = await myOutputTensor.getData(); // get the CPU data asynchronizely and release the underlying GPU resources const cpuData = await myOutputTensor.getData(true); // dispose the tensor (release the underlying GPU resources). This tensor object will be invalid after dispose() is called. myOutputTensor.dispose(); ``` #### resource management JavaScript has GC so you don't need to worry about managing JavaScript objects. But there are 2 types of resources that are not managed by GC: - GPU buffer that used in tensors - Underlying ORT native resources To simplify, most of the unmanaged resources and handled inside ORT web. But there are a few resources that need users to manage: - All external GPU resources, including GPU buffers inside all tensors created by `Tensor.fromGpuBuffer()`, will not be managed by ORT. User should manage those GPU buffers themselves. - When a session is created with `preferredOutputLocation` == "gpu-buffer" specified in session options, and the corresponding output is not pre-allocated, user need to call the output tensor's `dispose()` or `getData(true)` to manually release the underlying GPU buffers. - ORT internal errors (including providing a pre-allocated output tensor with wrong type/dims) will invalidate the whole wasm memory and is not recoverable. An exception is thrown in this situation.
2023-09-29 18:24:42 +00:00
const inputs =
inputArray.map((t, i) => encodeTensorMetadata(t, () => `input "${this.inputNames[inputIndices[i]]}"`));
const outputs = outputArray.map(
(t, i) => t ? encodeTensorMetadata(t, () => `output "${this.outputNames[outputIndices[i]]}"`) : null);
const results = await run(this.sessionId, inputIndices, inputs, outputIndices, outputs, options);
[js/webgpu] support IO binding (#17480) <del> **This PR is based on a few prerequisites PRs. They are listed as below:** - #17465 - #17469 - #17470 - #17472 - #17473 - #17484 Please review the current change by only looking at commit e2e6623e673ec6de55a5c1f8edcbd3a46b535a89 and later. </del> ### Description This PR introduces WebGPU IO binding. This new feature allows onnxruntime-web users to use tensors created from GPU as model input/output so that a model inferencing can be done without unnecessary data copy between CPU and GPU for model input/output. ### Examples An E2E demo/example is being worked on. Following is some simple demo with code snippet. Let's first check today how we do: ```js // STEP.1 - create an inference session: const mySession = await ort.InferenceSession.create('./my_model.onnx', { executionProviders: ['webgpu'] }); // STEP.2 - create model input: (supposing myImageCpuData is a Float32Array) const feeds = { 'input_image:0': new ort.Tensor('float32', myImageCpuData, [1, 224, 224, 3]) }; // STEP.3 - run model const myResults = await mySession.run(feeds); // STEP.4 - get output data const myData = myResults['output_image:0'].data; // Float32Array ``` #### for inputs (GPU tensor): Now, with IO binding, you can create a tensor from a GPU buffer, and feed it to the model: ```js // new STEP.2.A - create model input from a GPU buffer: (supposing myInputGpuBuffer is a `GPUBuffer` object with input data) const feeds = { 'input_image:0': ort.Tensor.fromGpuBuffer(myInputGpuBuffer, { dataType: 'float32', dims: [1, 224, 224, 3] }) }; ``` ### for outputs (pre-allocated GPU tensor) you can also do that for output, **if you know the output shape**: ```js // new STEP.2.B - create model output from a GPU buffer: (supposing myOutputGpuBuffer is a pre-allocated `GPUBuffer` object) const fetches = { 'output_image:0': ort.Tensor.fromGpuBuffer(myOutputGpuBuffer, { dataType: 'float32', dims: [1, 512, 512, 3] }) }; // new STEP.3 - run model with pre-allocated output (fetches) const myResults = await mySession.run(feeds, fetches); ``` ### for outputs (specify location) if you do not know the output shape, you can specify the output location when creating the session: ```js // new STEP.1 - create an inference session with an option "preferredOutputLocation": const mySession = await ort.InferenceSession.create('./my_model.onnx', { executionProviders: ['webgpu'], preferredOutputLocation: "gpu-buffer" }); ``` if the model has multiple outputs, you can specify them seperately: ```js // new STEP.1 - create an inference session with an option "preferredOutputLocation": const mySession = await ort.InferenceSession.create('./my_model.onnx', { executionProviders: ['webgpu'], preferredOutputLocation: { "output_image:0": "gpu-buffer" } }); ``` now you don't need to prepare the `fetches` object and onnxruntime-web will prepare output data on the location that specified. #### read data when you get the output tensor, you can: ```js // get the gpu buffer object: const gpuBuffer = myOutputTensor.gpuBuffer; // GPUBuffer // get the CPU data asynchronizely const cpuData = await myOutputTensor.getData(); // get the CPU data asynchronizely and release the underlying GPU resources const cpuData = await myOutputTensor.getData(true); // dispose the tensor (release the underlying GPU resources). This tensor object will be invalid after dispose() is called. myOutputTensor.dispose(); ``` #### resource management JavaScript has GC so you don't need to worry about managing JavaScript objects. But there are 2 types of resources that are not managed by GC: - GPU buffer that used in tensors - Underlying ORT native resources To simplify, most of the unmanaged resources and handled inside ORT web. But there are a few resources that need users to manage: - All external GPU resources, including GPU buffers inside all tensors created by `Tensor.fromGpuBuffer()`, will not be managed by ORT. User should manage those GPU buffers themselves. - When a session is created with `preferredOutputLocation` == "gpu-buffer" specified in session options, and the corresponding output is not pre-allocated, user need to call the output tensor's `dispose()` or `getData(true)` to manually release the underlying GPU buffers. - ORT internal errors (including providing a pre-allocated output tensor with wrong type/dims) will invalidate the whole wasm memory and is not recoverable. An exception is thrown in this situation.
2023-09-29 18:24:42 +00:00
const resultMap: SessionHandler.ReturnType = {};
for (let i = 0; i < results.length; i++) {
resultMap[this.outputNames[outputIndices[i]]] = outputArray[i] ?? decodeTensorMetadata(results[i]);
}
TRACE_FUNC_END();
[js/webgpu] support IO binding (#17480) <del> **This PR is based on a few prerequisites PRs. They are listed as below:** - #17465 - #17469 - #17470 - #17472 - #17473 - #17484 Please review the current change by only looking at commit e2e6623e673ec6de55a5c1f8edcbd3a46b535a89 and later. </del> ### Description This PR introduces WebGPU IO binding. This new feature allows onnxruntime-web users to use tensors created from GPU as model input/output so that a model inferencing can be done without unnecessary data copy between CPU and GPU for model input/output. ### Examples An E2E demo/example is being worked on. Following is some simple demo with code snippet. Let's first check today how we do: ```js // STEP.1 - create an inference session: const mySession = await ort.InferenceSession.create('./my_model.onnx', { executionProviders: ['webgpu'] }); // STEP.2 - create model input: (supposing myImageCpuData is a Float32Array) const feeds = { 'input_image:0': new ort.Tensor('float32', myImageCpuData, [1, 224, 224, 3]) }; // STEP.3 - run model const myResults = await mySession.run(feeds); // STEP.4 - get output data const myData = myResults['output_image:0'].data; // Float32Array ``` #### for inputs (GPU tensor): Now, with IO binding, you can create a tensor from a GPU buffer, and feed it to the model: ```js // new STEP.2.A - create model input from a GPU buffer: (supposing myInputGpuBuffer is a `GPUBuffer` object with input data) const feeds = { 'input_image:0': ort.Tensor.fromGpuBuffer(myInputGpuBuffer, { dataType: 'float32', dims: [1, 224, 224, 3] }) }; ``` ### for outputs (pre-allocated GPU tensor) you can also do that for output, **if you know the output shape**: ```js // new STEP.2.B - create model output from a GPU buffer: (supposing myOutputGpuBuffer is a pre-allocated `GPUBuffer` object) const fetches = { 'output_image:0': ort.Tensor.fromGpuBuffer(myOutputGpuBuffer, { dataType: 'float32', dims: [1, 512, 512, 3] }) }; // new STEP.3 - run model with pre-allocated output (fetches) const myResults = await mySession.run(feeds, fetches); ``` ### for outputs (specify location) if you do not know the output shape, you can specify the output location when creating the session: ```js // new STEP.1 - create an inference session with an option "preferredOutputLocation": const mySession = await ort.InferenceSession.create('./my_model.onnx', { executionProviders: ['webgpu'], preferredOutputLocation: "gpu-buffer" }); ``` if the model has multiple outputs, you can specify them seperately: ```js // new STEP.1 - create an inference session with an option "preferredOutputLocation": const mySession = await ort.InferenceSession.create('./my_model.onnx', { executionProviders: ['webgpu'], preferredOutputLocation: { "output_image:0": "gpu-buffer" } }); ``` now you don't need to prepare the `fetches` object and onnxruntime-web will prepare output data on the location that specified. #### read data when you get the output tensor, you can: ```js // get the gpu buffer object: const gpuBuffer = myOutputTensor.gpuBuffer; // GPUBuffer // get the CPU data asynchronizely const cpuData = await myOutputTensor.getData(); // get the CPU data asynchronizely and release the underlying GPU resources const cpuData = await myOutputTensor.getData(true); // dispose the tensor (release the underlying GPU resources). This tensor object will be invalid after dispose() is called. myOutputTensor.dispose(); ``` #### resource management JavaScript has GC so you don't need to worry about managing JavaScript objects. But there are 2 types of resources that are not managed by GC: - GPU buffer that used in tensors - Underlying ORT native resources To simplify, most of the unmanaged resources and handled inside ORT web. But there are a few resources that need users to manage: - All external GPU resources, including GPU buffers inside all tensors created by `Tensor.fromGpuBuffer()`, will not be managed by ORT. User should manage those GPU buffers themselves. - When a session is created with `preferredOutputLocation` == "gpu-buffer" specified in session options, and the corresponding output is not pre-allocated, user need to call the output tensor's `dispose()` or `getData(true)` to manually release the underlying GPU buffers. - ORT internal errors (including providing a pre-allocated output tensor with wrong type/dims) will invalidate the whole wasm memory and is not recoverable. An exception is thrown in this situation.
2023-09-29 18:24:42 +00:00
return resultMap;
}
startProfiling(): void {
// TODO: implement profiling
}
endProfiling(): void {
void endProfiling(this.sessionId);
}
}