2021-04-27 07:04:25 +00:00
|
|
|
// Copyright (c) Microsoft Corporation. All rights reserved.
|
|
|
|
|
// Licensed under the MIT License.
|
|
|
|
|
|
|
|
|
|
import minimist from 'minimist';
|
|
|
|
|
import npmlog from 'npmlog';
|
2024-08-14 23:51:22 +00:00
|
|
|
import { Env, InferenceSession } from 'onnxruntime-common';
|
2021-04-27 07:04:25 +00:00
|
|
|
|
2024-08-14 23:51:22 +00:00
|
|
|
import { Logger } from '../lib/onnxjs/instrument';
|
|
|
|
|
import { Test } from '../test/test-types';
|
2021-04-27 07:04:25 +00:00
|
|
|
|
|
|
|
|
/* eslint-disable max-len */
|
|
|
|
|
const HELP_MESSAGE = `
|
|
|
|
|
test-runner-cli
|
|
|
|
|
|
2021-05-07 19:12:37 +00:00
|
|
|
Run ONNX Runtime Web tests, models, benchmarks in different environments.
|
2021-04-27 07:04:25 +00:00
|
|
|
|
|
|
|
|
Usage:
|
|
|
|
|
test-runner-cli <mode> ... [options]
|
|
|
|
|
|
|
|
|
|
Modes:
|
2021-07-23 07:49:57 +00:00
|
|
|
suite0 Run all unittests, all operator tests and node model tests that described in suite test list
|
2021-08-26 18:57:31 +00:00
|
|
|
suite1 Run all operator tests and node model tests that described in suite test list
|
2021-04-27 07:04:25 +00:00
|
|
|
model Run a single model test
|
|
|
|
|
unittest Run all unittests
|
|
|
|
|
op Run a single operator test
|
|
|
|
|
|
|
|
|
|
Options:
|
|
|
|
|
|
|
|
|
|
*** General Options ***
|
|
|
|
|
|
|
|
|
|
-h, --help Print this message.
|
2024-03-13 19:00:36 +00:00
|
|
|
-d, --debug Specify to run test runner in debug mode. Debug mode does the following:
|
|
|
|
|
- outputs verbose log for test runner
|
|
|
|
|
- sets up environment debug flag (env.debug = true)
|
|
|
|
|
- opens Chromium debug port at 9333 and keeps karma not to exit after tests completed.
|
2021-04-27 07:04:25 +00:00
|
|
|
-b=<...>, --backend=<...> Specify one or more backend(s) to run the test upon.
|
|
|
|
|
Backends can be one or more of the following, splitted by comma:
|
|
|
|
|
webgl
|
[js/web] WebGPU backend via JSEP (#14579)
### Description
This change introduced the following new components into ONNX Runtime
Web:
- JavaScript Execution Provider (JSEP)
- Asynchronized inferencing execution powered by Emscripten's Asyncify
- WebGPU backend implemented in TypeScript
- initial implementation of kernels:
- elementwise operators (22)
- binary operators (5)
- tensor: Shape, Reshape, Transpose, Gemm
- nn: Conv, {Global}Maxpool, {Global}AveragePool
Code need to be polished. still working on it.
## Q&A
What is JSEP?
> JSEP, aka JavaScript Execution Provider, is a new ONNXRuntime
execution provider that specifically works on Web environment
(browsers). JSEP allows JavaScript code to kick in from various places
when ONNX Runtime inferences a model.
Why JSEP?
> JSEP is a hybrid mode EP that contains both C/C++ and
TypeScript/JavaScript implementation. There are 2 strong reasons why we
introduces JSEP:
> 1. the C/C++ part helps JSEP to leverage ONNX Runtime's capabilities
as much as possible including graph transformer, optimizers and also the
capabilities to fallback to CPU EP. TypeScript/JavaScript helps JSEP to
develop and debug much easier in the browser for the kernel
implementation.
> 2. the requirement of asynchronized execution from JavaScript API (eg.
`buffer.mapAsync()`) makes it impossible to run `OrtRun()` in a
synchronized context (see "async problem" section below). This is done
by using Emscripten's Asyncify.
What is WebGPU?
> WebGPU is the new GPU API that available in browser. It's one of the
only 2 APIs that currently available to access the GPU from browser (the
other is WebGL).
> WebGPU is designed with more advanced and stronger features comparing
to WebGL and is potentially solution that offer the best GPU performance
for model inferencing that currently available.
What is the async problem and why we have the problem?
> The "async problem" is a problem that you cannot call an async
function in a synchronous context. Think about the following C++ code:
> ```c
> // C-style declarations (API)
> typedef void (*ON_COMPLETE)(PVOID state, DATA *data);
> void read_data_from_file(FILEHANDLE file, ON_COMPLETE on_complete);
>
> // implementation
> DATA * my_impl_read_data_from_file_sync(FILEHANDLE file) {
> // how to implement?
> }
> ```
> The answer is, it's impossible to implement this function. Usually we
try to find a sync version API, or launch a thread to call the async
function and sync-wait on the main thread. Unfortunately, in browser
environment, neither is possible.
>
> WebGPU does not offer any synchronized API for data downloading (GPU
to CPU). This is the only operation that MUST be async. As `OrtRun()`
will eventually call into DataTransfer for copy data from GPU to CPU,
and `OrtRun()` is a synchronized function, this cannot be done in normal
way.
What is Emscripten? How is the Asyncify feature resolved the problem?
> Emscripten is the C/C++ compiler for WebAssembly. It's what we use to
compile ORT and generates the WebAssembly artifacts which runs on
browsers.
>
> Asyncify is a [compiler
feature](https://emscripten.org/docs/porting/asyncify.html) that allows
calling async functions from a synchronized context. In short, it
generates code to unwind and rewind call stack to emulate async
execution. With this feature, we are able to call the async function
inside `OrtRun()` call.
## Design Overview
**Inter-op**
JSEP is doing pretty much same thing to just another EP. It exposes an
interface for inter-op with JavaScript, which is defined in
onnxruntime/wasm/js_internal_api.js:
```js
// init JSEP
Module["jsepInit"] = function (backend, alloc, free, copy, copyAsync, createKernel, releaseKernel, run) {
Module.jsepBackend = backend;
Module.jsepAlloc = alloc;
Module.jsepFree = free;
Module.jsepCopy = copy;
Module.jsepCopyAsync = copyAsync;
Module.jsepCreateKernel = createKernel;
Module.jsepReleaseKernel = releaseKernel;
Module.jsepRun = run;
};
```
This simple JavaScript snippet defines all language barrier level
functions that requires by JSEP to achieve implementing kernels and data
transfers using JavaScript inside ONNX Runtime:
- `jsepBackend`: assign the singleton object to webassembly module
- `jsepAlloc` and `jsepFree`: implementation of data transfer's Alloc()
and Free()
- `jsepCopy`: synchronized copy ( GPU to GPU, CPU to GPU)
- `jsepCopyAsync`: asynchronized copy ( GPU to CPU)
- `jsepCreateKernel` and `jsepReleaseKernel`: a corresponding object
that maintained in JS to match lifecycle of Kernel in ORT
- `jsepRun`: OpKernel::Compute() should call into this
The abstraction above allows to tie as little as possible connections
and dependencies between C/C++ and TypeScript/JavaScript.
**Resource Management**
Lifecycle of tensor data and kernels are managed by ORT(C/C++) but the
implementation are left to JavaScript. JavaScript code are responsible
to implement the callbacks correctly.
For WebGPU, the GPU data is managed by JavaScript using a singleton map
(tensot_data_id => GPUBuffer). GPU pipeline is managed as singleton.
Shaders are managed using a singletonmap (shader_key => gpu_program),
while shader_key is generated by cache_key (OP specific, including
attributes) and input shapes.
**about data transfer**
`js::DataTransfer::CopyTensor` implemented to call either synchronized
or asynchronized copy callback, depending on the destination is GPU or
not. Emscripten's macro `EM_ASYNC_JS` is used to wrap the async function
to be called in the synchronized context.
**run kernel in JS**
Kernel class constructor calls once `jsepCreateKernel()` with an
optional per-kernel specific serialization to pass attributes into
JavaScript.
`Compute()` are implemented in a way that a metadata serialization is
performed in a base class and JavaScript code can access the data using
the Emscripten specific builtin macro `EM_ASM_*`.
**disabled features**
memory pattern is force disabled, because the WebGPU data is not
presented by a general memory model (a buffer can be represented by
offset + size).
concurrent run support is disabled. WebGPU is stateful and it also has
async function call. To support concurrent run will significantly
increase the complexity and we don't get any real benefit from it.
**prefer channels last**
JSEP prefers channels last and returns `DataLayout::NHWC` in method
`GetPreferredLayout()`. This will let the graph transformers to
preprocess the graph into a channels last form so that a more optimized
WebGPU shader can be used.
**Testing code**
It's impossible to test JSEP directly because JSEP itself does not
contain any kernel implementation. However, it has the kernel
registration which need to work together with the corresponding
JavaScript code. There are unit tests that run onnx models from
JavaScript API.
---------
Co-authored-by: Scott McKay <skottmckay@gmail.com>
2023-04-24 22:21:18 +00:00
|
|
|
webgpu
|
2021-04-27 07:04:25 +00:00
|
|
|
wasm
|
2023-05-09 04:25:10 +00:00
|
|
|
webnn
|
2021-04-27 07:04:25 +00:00
|
|
|
-e=<...>, --env=<...> Specify the environment to run the test. Should be one of the following:
|
|
|
|
|
chrome (default)
|
|
|
|
|
edge (Windows only)
|
|
|
|
|
firefox
|
|
|
|
|
electron
|
|
|
|
|
safari (MacOS only)
|
|
|
|
|
node
|
|
|
|
|
bs (for BrowserStack tests)
|
|
|
|
|
-p, --profile Enable profiler.
|
|
|
|
|
Profiler will generate extra logs which include the information of events time consumption
|
2024-03-13 19:00:36 +00:00
|
|
|
-t, --trace Enable trace.
|
2021-04-27 07:04:25 +00:00
|
|
|
-P[=<...>], --perf[=<...>] Generate performance number. Cannot be used with flag --debug.
|
|
|
|
|
This flag can be used with a number as value, specifying the total count of test cases to run. The test cases may be used multiple times. Default value is 10.
|
|
|
|
|
-c, --file-cache Enable file cache.
|
2024-03-13 19:00:36 +00:00
|
|
|
|
|
|
|
|
*** Session Options ***
|
|
|
|
|
-u=<...>, --optimized-model-file-path=<...> Specify whether to dump the optimized model.
|
|
|
|
|
-o=<...>, --graph-optimization-level=<...> Specify graph optimization level.
|
|
|
|
|
Default is 'all'. Valid values are 'disabled', 'basic', 'extended', 'all'.
|
[js/webgpu] support IO binding (#17480)
<del>
**This PR is based on a few prerequisites PRs. They are listed as
below:**
- #17465
- #17469
- #17470
- #17472
- #17473
- #17484
Please review the current change by only looking at commit
e2e6623e673ec6de55a5c1f8edcbd3a46b535a89 and later.
</del>
### Description
This PR introduces WebGPU IO binding. This new feature allows
onnxruntime-web users to use tensors created from GPU as model
input/output so that a model inferencing can be done without unnecessary
data copy between CPU and GPU for model input/output.
### Examples
An E2E demo/example is being worked on.
Following is some simple demo with code snippet.
Let's first check today how we do:
```js
// STEP.1 - create an inference session:
const mySession = await ort.InferenceSession.create('./my_model.onnx', { executionProviders: ['webgpu'] });
// STEP.2 - create model input: (supposing myImageCpuData is a Float32Array)
const feeds = {
'input_image:0': new ort.Tensor('float32', myImageCpuData, [1, 224, 224, 3])
};
// STEP.3 - run model
const myResults = await mySession.run(feeds);
// STEP.4 - get output data
const myData = myResults['output_image:0'].data; // Float32Array
```
#### for inputs (GPU tensor):
Now, with IO binding, you can create a tensor from a GPU buffer, and
feed it to the model:
```js
// new STEP.2.A - create model input from a GPU buffer: (supposing myInputGpuBuffer is a `GPUBuffer` object with input data)
const feeds = {
'input_image:0': ort.Tensor.fromGpuBuffer(myInputGpuBuffer, { dataType: 'float32', dims: [1, 224, 224, 3] })
};
```
### for outputs (pre-allocated GPU tensor)
you can also do that for output, **if you know the output shape**:
```js
// new STEP.2.B - create model output from a GPU buffer: (supposing myOutputGpuBuffer is a pre-allocated `GPUBuffer` object)
const fetches = {
'output_image:0': ort.Tensor.fromGpuBuffer(myOutputGpuBuffer, { dataType: 'float32', dims: [1, 512, 512, 3] })
};
// new STEP.3 - run model with pre-allocated output (fetches)
const myResults = await mySession.run(feeds, fetches);
```
### for outputs (specify location)
if you do not know the output shape, you can specify the output location
when creating the session:
```js
// new STEP.1 - create an inference session with an option "preferredOutputLocation":
const mySession = await ort.InferenceSession.create('./my_model.onnx', {
executionProviders: ['webgpu'],
preferredOutputLocation: "gpu-buffer"
});
```
if the model has multiple outputs, you can specify them seperately:
```js
// new STEP.1 - create an inference session with an option "preferredOutputLocation":
const mySession = await ort.InferenceSession.create('./my_model.onnx', {
executionProviders: ['webgpu'],
preferredOutputLocation: {
"output_image:0": "gpu-buffer"
}
});
```
now you don't need to prepare the `fetches` object and onnxruntime-web
will prepare output data on the location that specified.
#### read data
when you get the output tensor, you can:
```js
// get the gpu buffer object:
const gpuBuffer = myOutputTensor.gpuBuffer; // GPUBuffer
// get the CPU data asynchronizely
const cpuData = await myOutputTensor.getData();
// get the CPU data asynchronizely and release the underlying GPU resources
const cpuData = await myOutputTensor.getData(true);
// dispose the tensor (release the underlying GPU resources). This tensor object will be invalid after dispose() is called.
myOutputTensor.dispose();
```
#### resource management
JavaScript has GC so you don't need to worry about managing JavaScript
objects. But there are 2 types of resources that are not managed by GC:
- GPU buffer that used in tensors
- Underlying ORT native resources
To simplify, most of the unmanaged resources and handled inside ORT web.
But there are a few resources that need users to manage:
- All external GPU resources, including GPU buffers inside all tensors
created by `Tensor.fromGpuBuffer()`, will not be managed by ORT. User
should manage those GPU buffers themselves.
- When a session is created with `preferredOutputLocation` ==
"gpu-buffer" specified in session options, and the corresponding output
is not pre-allocated, user need to call the output tensor's `dispose()`
or `getData(true)` to manually release the underlying GPU buffers.
- ORT internal errors (including providing a pre-allocated output tensor
with wrong type/dims) will invalidate the whole wasm memory and is not
recoverable. An exception is thrown in this situation.
2023-09-29 18:24:42 +00:00
|
|
|
-i=<...>, --io-binding=<...> Specify the IO binding testing type. Should be one of the following:
|
2024-03-13 19:00:36 +00:00
|
|
|
none (default)
|
[js/webgpu] support IO binding (#17480)
<del>
**This PR is based on a few prerequisites PRs. They are listed as
below:**
- #17465
- #17469
- #17470
- #17472
- #17473
- #17484
Please review the current change by only looking at commit
e2e6623e673ec6de55a5c1f8edcbd3a46b535a89 and later.
</del>
### Description
This PR introduces WebGPU IO binding. This new feature allows
onnxruntime-web users to use tensors created from GPU as model
input/output so that a model inferencing can be done without unnecessary
data copy between CPU and GPU for model input/output.
### Examples
An E2E demo/example is being worked on.
Following is some simple demo with code snippet.
Let's first check today how we do:
```js
// STEP.1 - create an inference session:
const mySession = await ort.InferenceSession.create('./my_model.onnx', { executionProviders: ['webgpu'] });
// STEP.2 - create model input: (supposing myImageCpuData is a Float32Array)
const feeds = {
'input_image:0': new ort.Tensor('float32', myImageCpuData, [1, 224, 224, 3])
};
// STEP.3 - run model
const myResults = await mySession.run(feeds);
// STEP.4 - get output data
const myData = myResults['output_image:0'].data; // Float32Array
```
#### for inputs (GPU tensor):
Now, with IO binding, you can create a tensor from a GPU buffer, and
feed it to the model:
```js
// new STEP.2.A - create model input from a GPU buffer: (supposing myInputGpuBuffer is a `GPUBuffer` object with input data)
const feeds = {
'input_image:0': ort.Tensor.fromGpuBuffer(myInputGpuBuffer, { dataType: 'float32', dims: [1, 224, 224, 3] })
};
```
### for outputs (pre-allocated GPU tensor)
you can also do that for output, **if you know the output shape**:
```js
// new STEP.2.B - create model output from a GPU buffer: (supposing myOutputGpuBuffer is a pre-allocated `GPUBuffer` object)
const fetches = {
'output_image:0': ort.Tensor.fromGpuBuffer(myOutputGpuBuffer, { dataType: 'float32', dims: [1, 512, 512, 3] })
};
// new STEP.3 - run model with pre-allocated output (fetches)
const myResults = await mySession.run(feeds, fetches);
```
### for outputs (specify location)
if you do not know the output shape, you can specify the output location
when creating the session:
```js
// new STEP.1 - create an inference session with an option "preferredOutputLocation":
const mySession = await ort.InferenceSession.create('./my_model.onnx', {
executionProviders: ['webgpu'],
preferredOutputLocation: "gpu-buffer"
});
```
if the model has multiple outputs, you can specify them seperately:
```js
// new STEP.1 - create an inference session with an option "preferredOutputLocation":
const mySession = await ort.InferenceSession.create('./my_model.onnx', {
executionProviders: ['webgpu'],
preferredOutputLocation: {
"output_image:0": "gpu-buffer"
}
});
```
now you don't need to prepare the `fetches` object and onnxruntime-web
will prepare output data on the location that specified.
#### read data
when you get the output tensor, you can:
```js
// get the gpu buffer object:
const gpuBuffer = myOutputTensor.gpuBuffer; // GPUBuffer
// get the CPU data asynchronizely
const cpuData = await myOutputTensor.getData();
// get the CPU data asynchronizely and release the underlying GPU resources
const cpuData = await myOutputTensor.getData(true);
// dispose the tensor (release the underlying GPU resources). This tensor object will be invalid after dispose() is called.
myOutputTensor.dispose();
```
#### resource management
JavaScript has GC so you don't need to worry about managing JavaScript
objects. But there are 2 types of resources that are not managed by GC:
- GPU buffer that used in tensors
- Underlying ORT native resources
To simplify, most of the unmanaged resources and handled inside ORT web.
But there are a few resources that need users to manage:
- All external GPU resources, including GPU buffers inside all tensors
created by `Tensor.fromGpuBuffer()`, will not be managed by ORT. User
should manage those GPU buffers themselves.
- When a session is created with `preferredOutputLocation` ==
"gpu-buffer" specified in session options, and the corresponding output
is not pre-allocated, user need to call the output tensor's `dispose()`
or `getData(true)` to manually release the underlying GPU buffers.
- ORT internal errors (including providing a pre-allocated output tensor
with wrong type/dims) will invalidate the whole wasm memory and is not
recoverable. An exception is thrown in this situation.
2023-09-29 18:24:42 +00:00
|
|
|
gpu-tensor use pre-allocated GPU tensors for inputs and outputs
|
|
|
|
|
gpu-location use pre-allocated GPU tensors for inputs and set preferredOutputLocation to 'gpu-buffer'
|
2021-04-27 07:04:25 +00:00
|
|
|
|
|
|
|
|
*** Logging Options ***
|
|
|
|
|
|
2024-03-13 19:00:36 +00:00
|
|
|
--log-verbose Set log level to verbose
|
|
|
|
|
--log-info Set log level to info
|
|
|
|
|
--log-warning Set log level to warning
|
|
|
|
|
--log-error Set log level to error
|
|
|
|
|
The 4 flags above specify the logging configuration.
|
2021-04-27 07:04:25 +00:00
|
|
|
|
|
|
|
|
*** Backend Options ***
|
|
|
|
|
|
2024-03-13 19:00:36 +00:00
|
|
|
--wasm.<...>=<...> Set global environment flags for each backend.
|
|
|
|
|
--webgl.<...>=<...> These flags can be used multiple times to set multiple flags. For example:
|
|
|
|
|
--webgpu.<...>=<...> --webgpu.profiling.mode=default --wasm.numThreads=1 --wasm.simd=false
|
|
|
|
|
--webnn.<...>=<...>
|
|
|
|
|
|
2024-04-16 01:43:46 +00:00
|
|
|
--webnn-device-type Set the WebNN device type (cpu/gpu/npu)
|
2024-03-13 19:00:36 +00:00
|
|
|
|
2023-08-16 04:00:23 +00:00
|
|
|
-x, --wasm-number-threads Set the WebAssembly number of threads
|
2024-03-13 19:00:36 +00:00
|
|
|
("--wasm-number-threads" is deprecated. use "--wasm.numThreads" or "-x" instead)
|
2021-04-27 07:04:25 +00:00
|
|
|
--wasm-init-timeout Set the timeout for WebAssembly backend initialization, in milliseconds
|
2024-03-13 19:00:36 +00:00
|
|
|
(deprecated. use "--wasm.initTimeout" instead)
|
2021-06-08 06:24:27 +00:00
|
|
|
--wasm-enable-simd Set whether to enable SIMD
|
2024-03-13 19:00:36 +00:00
|
|
|
(deprecated. use "--wasm.simd" instead)
|
2021-08-31 17:23:42 +00:00
|
|
|
--wasm-enable-proxy Set whether to enable proxy worker
|
2024-03-13 19:00:36 +00:00
|
|
|
(deprecated. use "--wasm.proxy" instead)
|
2021-04-27 07:04:25 +00:00
|
|
|
--webgl-context-id Set the WebGL context ID (webgl/webgl2)
|
2024-03-13 19:00:36 +00:00
|
|
|
(deprecated. use "--webgl.contextId" instead)
|
2021-04-27 07:04:25 +00:00
|
|
|
--webgl-matmul-max-batch-size Set the WebGL matmulMaxBatchSize
|
2024-03-13 19:00:36 +00:00
|
|
|
(deprecated. use "--webgl.matmulMaxBatchSize" instead)
|
2021-04-27 07:04:25 +00:00
|
|
|
--webgl-texture-cache-mode Set the WebGL texture cache mode (initializerOnly/full)
|
2024-03-13 19:00:36 +00:00
|
|
|
(deprecated. use "--webgl.textureCacheMode" instead)
|
2021-04-27 07:04:25 +00:00
|
|
|
--webgl-texture-pack-mode Set the WebGL texture pack mode (true/false)
|
2024-03-13 19:00:36 +00:00
|
|
|
(deprecated. use "--webgl.pack" instead)
|
[js/web] WebGPU backend via JSEP (#14579)
### Description
This change introduced the following new components into ONNX Runtime
Web:
- JavaScript Execution Provider (JSEP)
- Asynchronized inferencing execution powered by Emscripten's Asyncify
- WebGPU backend implemented in TypeScript
- initial implementation of kernels:
- elementwise operators (22)
- binary operators (5)
- tensor: Shape, Reshape, Transpose, Gemm
- nn: Conv, {Global}Maxpool, {Global}AveragePool
Code need to be polished. still working on it.
## Q&A
What is JSEP?
> JSEP, aka JavaScript Execution Provider, is a new ONNXRuntime
execution provider that specifically works on Web environment
(browsers). JSEP allows JavaScript code to kick in from various places
when ONNX Runtime inferences a model.
Why JSEP?
> JSEP is a hybrid mode EP that contains both C/C++ and
TypeScript/JavaScript implementation. There are 2 strong reasons why we
introduces JSEP:
> 1. the C/C++ part helps JSEP to leverage ONNX Runtime's capabilities
as much as possible including graph transformer, optimizers and also the
capabilities to fallback to CPU EP. TypeScript/JavaScript helps JSEP to
develop and debug much easier in the browser for the kernel
implementation.
> 2. the requirement of asynchronized execution from JavaScript API (eg.
`buffer.mapAsync()`) makes it impossible to run `OrtRun()` in a
synchronized context (see "async problem" section below). This is done
by using Emscripten's Asyncify.
What is WebGPU?
> WebGPU is the new GPU API that available in browser. It's one of the
only 2 APIs that currently available to access the GPU from browser (the
other is WebGL).
> WebGPU is designed with more advanced and stronger features comparing
to WebGL and is potentially solution that offer the best GPU performance
for model inferencing that currently available.
What is the async problem and why we have the problem?
> The "async problem" is a problem that you cannot call an async
function in a synchronous context. Think about the following C++ code:
> ```c
> // C-style declarations (API)
> typedef void (*ON_COMPLETE)(PVOID state, DATA *data);
> void read_data_from_file(FILEHANDLE file, ON_COMPLETE on_complete);
>
> // implementation
> DATA * my_impl_read_data_from_file_sync(FILEHANDLE file) {
> // how to implement?
> }
> ```
> The answer is, it's impossible to implement this function. Usually we
try to find a sync version API, or launch a thread to call the async
function and sync-wait on the main thread. Unfortunately, in browser
environment, neither is possible.
>
> WebGPU does not offer any synchronized API for data downloading (GPU
to CPU). This is the only operation that MUST be async. As `OrtRun()`
will eventually call into DataTransfer for copy data from GPU to CPU,
and `OrtRun()` is a synchronized function, this cannot be done in normal
way.
What is Emscripten? How is the Asyncify feature resolved the problem?
> Emscripten is the C/C++ compiler for WebAssembly. It's what we use to
compile ORT and generates the WebAssembly artifacts which runs on
browsers.
>
> Asyncify is a [compiler
feature](https://emscripten.org/docs/porting/asyncify.html) that allows
calling async functions from a synchronized context. In short, it
generates code to unwind and rewind call stack to emulate async
execution. With this feature, we are able to call the async function
inside `OrtRun()` call.
## Design Overview
**Inter-op**
JSEP is doing pretty much same thing to just another EP. It exposes an
interface for inter-op with JavaScript, which is defined in
onnxruntime/wasm/js_internal_api.js:
```js
// init JSEP
Module["jsepInit"] = function (backend, alloc, free, copy, copyAsync, createKernel, releaseKernel, run) {
Module.jsepBackend = backend;
Module.jsepAlloc = alloc;
Module.jsepFree = free;
Module.jsepCopy = copy;
Module.jsepCopyAsync = copyAsync;
Module.jsepCreateKernel = createKernel;
Module.jsepReleaseKernel = releaseKernel;
Module.jsepRun = run;
};
```
This simple JavaScript snippet defines all language barrier level
functions that requires by JSEP to achieve implementing kernels and data
transfers using JavaScript inside ONNX Runtime:
- `jsepBackend`: assign the singleton object to webassembly module
- `jsepAlloc` and `jsepFree`: implementation of data transfer's Alloc()
and Free()
- `jsepCopy`: synchronized copy ( GPU to GPU, CPU to GPU)
- `jsepCopyAsync`: asynchronized copy ( GPU to CPU)
- `jsepCreateKernel` and `jsepReleaseKernel`: a corresponding object
that maintained in JS to match lifecycle of Kernel in ORT
- `jsepRun`: OpKernel::Compute() should call into this
The abstraction above allows to tie as little as possible connections
and dependencies between C/C++ and TypeScript/JavaScript.
**Resource Management**
Lifecycle of tensor data and kernels are managed by ORT(C/C++) but the
implementation are left to JavaScript. JavaScript code are responsible
to implement the callbacks correctly.
For WebGPU, the GPU data is managed by JavaScript using a singleton map
(tensot_data_id => GPUBuffer). GPU pipeline is managed as singleton.
Shaders are managed using a singletonmap (shader_key => gpu_program),
while shader_key is generated by cache_key (OP specific, including
attributes) and input shapes.
**about data transfer**
`js::DataTransfer::CopyTensor` implemented to call either synchronized
or asynchronized copy callback, depending on the destination is GPU or
not. Emscripten's macro `EM_ASYNC_JS` is used to wrap the async function
to be called in the synchronized context.
**run kernel in JS**
Kernel class constructor calls once `jsepCreateKernel()` with an
optional per-kernel specific serialization to pass attributes into
JavaScript.
`Compute()` are implemented in a way that a metadata serialization is
performed in a base class and JavaScript code can access the data using
the Emscripten specific builtin macro `EM_ASM_*`.
**disabled features**
memory pattern is force disabled, because the WebGPU data is not
presented by a general memory model (a buffer can be represented by
offset + size).
concurrent run support is disabled. WebGPU is stateful and it also has
async function call. To support concurrent run will significantly
increase the complexity and we don't get any real benefit from it.
**prefer channels last**
JSEP prefers channels last and returns `DataLayout::NHWC` in method
`GetPreferredLayout()`. This will let the graph transformers to
preprocess the graph into a channels last form so that a more optimized
WebGPU shader can be used.
**Testing code**
It's impossible to test JSEP directly because JSEP itself does not
contain any kernel implementation. However, it has the kernel
registration which need to work together with the corresponding
JavaScript code. There are unit tests that run onnx models from
JavaScript API.
---------
Co-authored-by: Scott McKay <skottmckay@gmail.com>
2023-04-24 22:21:18 +00:00
|
|
|
--webgpu-profiling-mode Set the WebGPU profiling mode (off/default)
|
2024-03-13 19:00:36 +00:00
|
|
|
(deprecated. use "--webgpu.profiling.mode" instead)
|
2021-04-27 07:04:25 +00:00
|
|
|
|
|
|
|
|
*** Browser Options ***
|
|
|
|
|
|
|
|
|
|
--no-sandbox This flag will be passed to Chrome.
|
|
|
|
|
Sometimes Chrome need this flag to work together with Karma.
|
2024-03-26 20:16:59 +00:00
|
|
|
--user-data-dir=<...> This flag will be passed to browsers to specify the user data directory.
|
2023-09-14 17:05:31 +00:00
|
|
|
--chromium-flags=<...> This flag will be passed to Chrome and Edge browsers. Can be used multiple times.
|
2021-04-27 07:04:25 +00:00
|
|
|
|
|
|
|
|
Examples:
|
|
|
|
|
|
|
|
|
|
Run all suite0 tests:
|
|
|
|
|
> test-runner-cli suite0
|
|
|
|
|
|
|
|
|
|
Run single model test (test_relu) on WebAssembly backend
|
|
|
|
|
> test-runner-cli model test_relu --backend=wasm
|
|
|
|
|
|
|
|
|
|
Debug unittest
|
|
|
|
|
> test-runner-cli unittest --debug
|
|
|
|
|
|
|
|
|
|
Debug operator matmul, highlight verbose log from BaseGlContext and WebGLBackend
|
|
|
|
|
> test-runner-cli op matmul --backend=webgl --debug --log-verbose=BaseGlContext,WebGLBackend
|
|
|
|
|
|
|
|
|
|
Profile an ONNX model on WebGL backend
|
|
|
|
|
> test-runner-cli model <model_folder> --profile --backend=webgl
|
|
|
|
|
|
|
|
|
|
Run perf testing of an ONNX model on WebGL backend
|
|
|
|
|
> test-runner-cli model <model_folder> -b=webgl -P
|
|
|
|
|
`;
|
|
|
|
|
/* eslint-enable max-len */
|
|
|
|
|
|
|
|
|
|
export declare namespace TestRunnerCliArgs {
|
2024-08-14 23:51:22 +00:00
|
|
|
type Mode = 'suite0' | 'suite1' | 'model' | 'unittest' | 'op';
|
|
|
|
|
type Backend = 'cpu' | 'webgl' | 'webgpu' | 'wasm' | 'onnxruntime' | 'webnn';
|
2024-08-16 02:27:54 +00:00
|
|
|
type Environment = 'chrome' | 'chromecanary' | 'edge' | 'firefox' | 'electron' | 'safari' | 'node' | 'bs';
|
2024-08-14 23:51:22 +00:00
|
|
|
type BundleMode = 'dev' | 'perf';
|
|
|
|
|
type IOBindingMode = 'none' | 'gpu-tensor' | 'gpu-location';
|
2021-04-27 07:04:25 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
export interface TestRunnerCliArgs {
|
|
|
|
|
debug: boolean;
|
|
|
|
|
mode: TestRunnerCliArgs.Mode;
|
|
|
|
|
/**
|
|
|
|
|
* The parameter that used when in mode 'model' or 'op', specifying the search string for the model or op test
|
|
|
|
|
*/
|
|
|
|
|
param?: string;
|
|
|
|
|
backends: [TestRunnerCliArgs.Backend];
|
|
|
|
|
env: TestRunnerCliArgs.Environment;
|
|
|
|
|
|
|
|
|
|
/**
|
|
|
|
|
* Bundle Mode
|
|
|
|
|
*
|
2023-10-06 20:37:37 +00:00
|
|
|
* this field affects the behavior of Karma and build script.
|
2021-04-27 07:04:25 +00:00
|
|
|
*
|
2023-10-06 20:37:37 +00:00
|
|
|
* Mode "perf":
|
|
|
|
|
* - use "dist/ort.all.min.js" as main file
|
|
|
|
|
* - use "test/ort.test.min.js" as test file
|
|
|
|
|
* Mode "dev":
|
|
|
|
|
* - use "dist/ort.all.js" as main file
|
|
|
|
|
* - use "test/ort.test.js" as test file
|
2021-04-27 07:04:25 +00:00
|
|
|
*/
|
|
|
|
|
bundleMode: TestRunnerCliArgs.BundleMode;
|
|
|
|
|
|
[js/webgpu] support IO binding (#17480)
<del>
**This PR is based on a few prerequisites PRs. They are listed as
below:**
- #17465
- #17469
- #17470
- #17472
- #17473
- #17484
Please review the current change by only looking at commit
e2e6623e673ec6de55a5c1f8edcbd3a46b535a89 and later.
</del>
### Description
This PR introduces WebGPU IO binding. This new feature allows
onnxruntime-web users to use tensors created from GPU as model
input/output so that a model inferencing can be done without unnecessary
data copy between CPU and GPU for model input/output.
### Examples
An E2E demo/example is being worked on.
Following is some simple demo with code snippet.
Let's first check today how we do:
```js
// STEP.1 - create an inference session:
const mySession = await ort.InferenceSession.create('./my_model.onnx', { executionProviders: ['webgpu'] });
// STEP.2 - create model input: (supposing myImageCpuData is a Float32Array)
const feeds = {
'input_image:0': new ort.Tensor('float32', myImageCpuData, [1, 224, 224, 3])
};
// STEP.3 - run model
const myResults = await mySession.run(feeds);
// STEP.4 - get output data
const myData = myResults['output_image:0'].data; // Float32Array
```
#### for inputs (GPU tensor):
Now, with IO binding, you can create a tensor from a GPU buffer, and
feed it to the model:
```js
// new STEP.2.A - create model input from a GPU buffer: (supposing myInputGpuBuffer is a `GPUBuffer` object with input data)
const feeds = {
'input_image:0': ort.Tensor.fromGpuBuffer(myInputGpuBuffer, { dataType: 'float32', dims: [1, 224, 224, 3] })
};
```
### for outputs (pre-allocated GPU tensor)
you can also do that for output, **if you know the output shape**:
```js
// new STEP.2.B - create model output from a GPU buffer: (supposing myOutputGpuBuffer is a pre-allocated `GPUBuffer` object)
const fetches = {
'output_image:0': ort.Tensor.fromGpuBuffer(myOutputGpuBuffer, { dataType: 'float32', dims: [1, 512, 512, 3] })
};
// new STEP.3 - run model with pre-allocated output (fetches)
const myResults = await mySession.run(feeds, fetches);
```
### for outputs (specify location)
if you do not know the output shape, you can specify the output location
when creating the session:
```js
// new STEP.1 - create an inference session with an option "preferredOutputLocation":
const mySession = await ort.InferenceSession.create('./my_model.onnx', {
executionProviders: ['webgpu'],
preferredOutputLocation: "gpu-buffer"
});
```
if the model has multiple outputs, you can specify them seperately:
```js
// new STEP.1 - create an inference session with an option "preferredOutputLocation":
const mySession = await ort.InferenceSession.create('./my_model.onnx', {
executionProviders: ['webgpu'],
preferredOutputLocation: {
"output_image:0": "gpu-buffer"
}
});
```
now you don't need to prepare the `fetches` object and onnxruntime-web
will prepare output data on the location that specified.
#### read data
when you get the output tensor, you can:
```js
// get the gpu buffer object:
const gpuBuffer = myOutputTensor.gpuBuffer; // GPUBuffer
// get the CPU data asynchronizely
const cpuData = await myOutputTensor.getData();
// get the CPU data asynchronizely and release the underlying GPU resources
const cpuData = await myOutputTensor.getData(true);
// dispose the tensor (release the underlying GPU resources). This tensor object will be invalid after dispose() is called.
myOutputTensor.dispose();
```
#### resource management
JavaScript has GC so you don't need to worry about managing JavaScript
objects. But there are 2 types of resources that are not managed by GC:
- GPU buffer that used in tensors
- Underlying ORT native resources
To simplify, most of the unmanaged resources and handled inside ORT web.
But there are a few resources that need users to manage:
- All external GPU resources, including GPU buffers inside all tensors
created by `Tensor.fromGpuBuffer()`, will not be managed by ORT. User
should manage those GPU buffers themselves.
- When a session is created with `preferredOutputLocation` ==
"gpu-buffer" specified in session options, and the corresponding output
is not pre-allocated, user need to call the output tensor's `dispose()`
or `getData(true)` to manually release the underlying GPU buffers.
- ORT internal errors (including providing a pre-allocated output tensor
with wrong type/dims) will invalidate the whole wasm memory and is not
recoverable. An exception is thrown in this situation.
2023-09-29 18:24:42 +00:00
|
|
|
ioBindingMode: TestRunnerCliArgs.IOBindingMode;
|
|
|
|
|
|
2021-04-27 07:04:25 +00:00
|
|
|
logConfig: Test.Config['log'];
|
|
|
|
|
|
|
|
|
|
/**
|
|
|
|
|
* Whether to enable InferenceSession's profiler
|
|
|
|
|
*/
|
|
|
|
|
profile: boolean;
|
|
|
|
|
|
|
|
|
|
/**
|
|
|
|
|
* Whether to enable file cache
|
|
|
|
|
*/
|
|
|
|
|
fileCache: boolean;
|
|
|
|
|
|
|
|
|
|
/**
|
|
|
|
|
* Specify the times that test cases to run
|
|
|
|
|
*/
|
|
|
|
|
times?: number;
|
|
|
|
|
|
2023-02-24 23:50:15 +00:00
|
|
|
/**
|
|
|
|
|
* whether to dump the optimized model
|
|
|
|
|
*/
|
|
|
|
|
optimizedModelFilePath?: string;
|
|
|
|
|
|
|
|
|
|
/**
|
|
|
|
|
* Specify graph optimization level
|
|
|
|
|
*/
|
2024-08-14 23:51:22 +00:00
|
|
|
graphOptimizationLevel: 'disabled' | 'basic' | 'extended' | 'all';
|
2023-02-24 23:50:15 +00:00
|
|
|
|
2021-04-27 07:04:25 +00:00
|
|
|
cpuOptions?: InferenceSession.CpuExecutionProviderOption;
|
|
|
|
|
cudaOptions?: InferenceSession.CudaExecutionProviderOption;
|
|
|
|
|
wasmOptions?: InferenceSession.WebAssemblyExecutionProviderOption;
|
|
|
|
|
webglOptions?: InferenceSession.WebGLExecutionProviderOption;
|
2024-01-09 18:10:57 +00:00
|
|
|
webnnOptions?: InferenceSession.WebNNExecutionProviderOption;
|
2023-06-09 23:18:53 +00:00
|
|
|
globalEnvFlags?: Test.Options['globalEnvFlags'];
|
2021-04-27 07:04:25 +00:00
|
|
|
noSandbox?: boolean;
|
2024-03-26 20:16:59 +00:00
|
|
|
userDataDir?: string;
|
2023-09-14 17:05:31 +00:00
|
|
|
chromiumFlags: string[];
|
2021-04-27 07:04:25 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
function parseBooleanArg(arg: unknown, defaultValue: boolean): boolean;
|
2024-08-14 23:51:22 +00:00
|
|
|
function parseBooleanArg(arg: unknown): boolean | undefined;
|
|
|
|
|
function parseBooleanArg(arg: unknown, defaultValue?: boolean): boolean | undefined {
|
2021-04-27 07:04:25 +00:00
|
|
|
if (typeof arg === 'undefined') {
|
|
|
|
|
return defaultValue;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
if (typeof arg === 'boolean') {
|
|
|
|
|
return arg;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
if (typeof arg === 'number') {
|
|
|
|
|
return arg !== 0;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
if (typeof arg === 'string') {
|
|
|
|
|
if (arg.toLowerCase() === 'true') {
|
|
|
|
|
return true;
|
|
|
|
|
}
|
|
|
|
|
if (arg.toLowerCase() === 'false') {
|
|
|
|
|
return false;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
throw new TypeError(`invalid boolean arg: ${arg}`);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
function parseLogLevel<T>(arg: T) {
|
2024-08-14 23:51:22 +00:00
|
|
|
let v: string[] | boolean;
|
2021-04-27 07:04:25 +00:00
|
|
|
if (typeof arg === 'string') {
|
|
|
|
|
v = arg.split(',');
|
|
|
|
|
} else if (Array.isArray(arg)) {
|
|
|
|
|
v = [];
|
|
|
|
|
for (const e of arg) {
|
|
|
|
|
v.push(...e.split(','));
|
|
|
|
|
}
|
|
|
|
|
} else {
|
|
|
|
|
v = arg ? true : false;
|
|
|
|
|
}
|
|
|
|
|
return v;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
function parseLogConfig(args: minimist.ParsedArgs) {
|
2024-08-14 23:51:22 +00:00
|
|
|
const config: Array<{ category: string; config: Logger.Config }> = [];
|
2021-04-27 07:04:25 +00:00
|
|
|
const verbose = parseLogLevel(args['log-verbose']);
|
|
|
|
|
const info = parseLogLevel(args['log-info']);
|
|
|
|
|
const warning = parseLogLevel(args['log-warning']);
|
|
|
|
|
const error = parseLogLevel(args['log-error']);
|
|
|
|
|
|
|
|
|
|
if (typeof error === 'boolean' && error) {
|
2024-08-14 23:51:22 +00:00
|
|
|
config.push({ category: '*', config: { minimalSeverity: 'error' } });
|
2021-04-27 07:04:25 +00:00
|
|
|
} else if (typeof warning === 'boolean' && warning) {
|
2024-08-14 23:51:22 +00:00
|
|
|
config.push({ category: '*', config: { minimalSeverity: 'warning' } });
|
2021-04-27 07:04:25 +00:00
|
|
|
} else if (typeof info === 'boolean' && info) {
|
2024-08-14 23:51:22 +00:00
|
|
|
config.push({ category: '*', config: { minimalSeverity: 'info' } });
|
2021-04-27 07:04:25 +00:00
|
|
|
} else if (typeof verbose === 'boolean' && verbose) {
|
2024-08-14 23:51:22 +00:00
|
|
|
config.push({ category: '*', config: { minimalSeverity: 'verbose' } });
|
2021-04-27 07:04:25 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
if (Array.isArray(error)) {
|
2024-08-14 23:51:22 +00:00
|
|
|
config.push(...error.map((i) => ({ category: i, config: { minimalSeverity: 'error' as Logger.Severity } })));
|
2021-04-27 07:04:25 +00:00
|
|
|
}
|
|
|
|
|
if (Array.isArray(warning)) {
|
2024-08-14 23:51:22 +00:00
|
|
|
config.push(...warning.map((i) => ({ category: i, config: { minimalSeverity: 'warning' as Logger.Severity } })));
|
2021-04-27 07:04:25 +00:00
|
|
|
}
|
|
|
|
|
if (Array.isArray(info)) {
|
2024-08-14 23:51:22 +00:00
|
|
|
config.push(...info.map((i) => ({ category: i, config: { minimalSeverity: 'info' as Logger.Severity } })));
|
2021-04-27 07:04:25 +00:00
|
|
|
}
|
|
|
|
|
if (Array.isArray(verbose)) {
|
2024-08-14 23:51:22 +00:00
|
|
|
config.push(...verbose.map((i) => ({ category: i, config: { minimalSeverity: 'verbose' as Logger.Severity } })));
|
2021-04-27 07:04:25 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
return config;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
function parseCpuOptions(_args: minimist.ParsedArgs): InferenceSession.CpuExecutionProviderOption {
|
2024-08-14 23:51:22 +00:00
|
|
|
return { name: 'cpu' };
|
2021-04-27 07:04:25 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
function parseWasmOptions(_args: minimist.ParsedArgs): InferenceSession.WebAssemblyExecutionProviderOption {
|
2024-08-14 23:51:22 +00:00
|
|
|
return { name: 'wasm' };
|
2021-04-27 07:04:25 +00:00
|
|
|
}
|
|
|
|
|
|
2021-05-07 19:12:37 +00:00
|
|
|
function parseWasmFlags(args: minimist.ParsedArgs): Env.WebAssemblyFlags {
|
2024-03-13 19:00:36 +00:00
|
|
|
const wasm = args.wasm || {};
|
2024-08-14 23:51:22 +00:00
|
|
|
const numThreads = (wasm.numThreads = wasm.numThreads ?? args.x ?? args['wasm-number-threads']);
|
2021-05-07 19:12:37 +00:00
|
|
|
if (typeof numThreads !== 'undefined' && typeof numThreads !== 'number') {
|
2024-03-13 19:00:36 +00:00
|
|
|
throw new Error('Flag "wasm.numThreads"/"x"/"wasm-number-threads" must be a number value');
|
2021-04-27 07:04:25 +00:00
|
|
|
}
|
2024-08-14 23:51:22 +00:00
|
|
|
const initTimeout = (wasm.initTimeout = wasm.initTimeout ?? args['wasm-init-timeout']);
|
2021-04-27 07:04:25 +00:00
|
|
|
if (typeof initTimeout !== 'undefined' && typeof initTimeout !== 'number') {
|
2024-03-13 19:00:36 +00:00
|
|
|
throw new Error('Flag "wasm.initTimeout"/"wasm-init-timeout" must be a number value');
|
2021-04-27 07:04:25 +00:00
|
|
|
}
|
2024-08-14 23:51:22 +00:00
|
|
|
const simd = (wasm.simd = parseBooleanArg(wasm.simd ?? args['wasm-enable-simd']));
|
2024-03-13 19:00:36 +00:00
|
|
|
if (typeof simd !== 'undefined' && typeof simd !== 'boolean') {
|
|
|
|
|
throw new Error('Flag "wasm.simd"/"wasm-enable-simd" must be a boolean value');
|
2021-06-08 06:24:27 +00:00
|
|
|
}
|
2024-08-14 23:51:22 +00:00
|
|
|
const proxy = (wasm.proxy = parseBooleanArg(wasm.proxy ?? args['wasm-enable-proxy']));
|
2024-03-13 19:00:36 +00:00
|
|
|
if (typeof proxy !== 'undefined' && typeof proxy !== 'boolean') {
|
|
|
|
|
throw new Error('Flag "wasm.proxy"/"wasm-enable-proxy" must be a boolean value');
|
2021-08-31 17:23:42 +00:00
|
|
|
}
|
2024-03-13 19:00:36 +00:00
|
|
|
return wasm;
|
2021-04-27 07:04:25 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
function parseWebglOptions(_args: minimist.ParsedArgs): InferenceSession.WebGLExecutionProviderOption {
|
2024-08-14 23:51:22 +00:00
|
|
|
return { name: 'webgl' };
|
2021-04-27 07:04:25 +00:00
|
|
|
}
|
|
|
|
|
|
[js/api] introducing IO binding for tensor (#16452)
[//]: # (## Work In Progress. Feedbacks are welcome!)
### Description
This PR adds a few properties, methods and factories to Tensor type to
support IO-binding feature. This will allow user to create tensor from
GPU/CPU bound data without a force transferring of data between CPU and
GPU.
This change is a way to resolve #15312
### Change Summary
1. Add properties to `Tensor` type:
a. `location`: indicating where the data is sitting. valid values are
`cpu`, `cpu-pinned`, `texture`, `gpu-buffer`.
b. `texture`: sit side to `data`, a readonly property of `WebGLTexture`
type. available only when `location === 'texture'`
c. `gpuBuffer`: sit side to `data`, a readonly property of `GPUBuffer`
type. available only when `location === 'gpu-buffer'`
2. Add methods to `Tensor` type (usually dealing with inference
outputs):
- async function `getData()` allows user to download data from GPU to
CPU manually.
- function `dispose()` allows user to release GPU resources manually.
3. Add factories for creating `Tensor` instances:
a. `fromTexture()` to create a WebGL texture bound tensor data
b. `fromGpuBuffer()` to create a WebGPUBuffer bound tensor data
c. `fromPinnedBuffer()` to create a tensor using a CPU pinned buffer
### Examples:
create tensors from texture and pass to inference session as inputs
```js
// when create session, specify we prefer 'image_output:0' to be stored on GPU as texture
const session = await InferenceSession.create('./my_model.onnx', {
executionProviders: [ 'webgl' ],
preferredOutputLocation: { 'image_output:0': 'texture' }
});
...
const myImageTexture = getTexture(); // user's function to get a texture
const myFeeds = { input0: Tensor.fromTexture(myImageTexture, { width: 224, height: 224 }) }; // shape [1, 224, 224, 4], RGBA format.
const results = await session.run(myFeeds);
const myOutputTexture = results['image_output:0'].texture;
```
2023-08-29 19:58:26 +00:00
|
|
|
function parseWebglFlags(args: minimist.ParsedArgs): Partial<Env.WebGLFlags> {
|
2024-03-13 19:00:36 +00:00
|
|
|
const webgl = args.webgl || {};
|
2024-08-14 23:51:22 +00:00
|
|
|
const contextId = (webgl.contextId = webgl.contextId ?? args['webgl-context-id']);
|
2021-04-27 07:04:25 +00:00
|
|
|
if (contextId !== undefined && contextId !== 'webgl' && contextId !== 'webgl2') {
|
2024-03-13 19:00:36 +00:00
|
|
|
throw new Error('Flag "webgl.contextId"/"webgl-context-id" is invalid');
|
2021-04-27 07:04:25 +00:00
|
|
|
}
|
2024-08-14 23:51:22 +00:00
|
|
|
const matmulMaxBatchSize = (webgl.matmulMaxBatchSize =
|
|
|
|
|
webgl.matmulMaxBatchSize ?? args['webgl-matmul-max-batch-size']);
|
2021-04-27 07:04:25 +00:00
|
|
|
if (matmulMaxBatchSize !== undefined && typeof matmulMaxBatchSize !== 'number') {
|
2024-03-13 19:00:36 +00:00
|
|
|
throw new Error('Flag "webgl.matmulMaxBatchSize"/"webgl-matmul-max-batch-size" must be a number value');
|
2021-04-27 07:04:25 +00:00
|
|
|
}
|
2024-08-14 23:51:22 +00:00
|
|
|
const textureCacheMode = (webgl.textureCacheMode = webgl.textureCacheMode ?? args['webgl-texture-cache-mode']);
|
2021-04-27 07:04:25 +00:00
|
|
|
if (textureCacheMode !== undefined && textureCacheMode !== 'initializerOnly' && textureCacheMode !== 'full') {
|
2024-03-13 19:00:36 +00:00
|
|
|
throw new Error('Flag "webgl.textureCacheMode"/"webgl-texture-cache-mode" is invalid');
|
2021-04-27 07:04:25 +00:00
|
|
|
}
|
2024-08-14 23:51:22 +00:00
|
|
|
const pack = (webgl.pack = parseBooleanArg(webgl.pack ?? args['webgl-texture-pack-mode']));
|
2021-04-27 07:04:25 +00:00
|
|
|
if (pack !== undefined && typeof pack !== 'boolean') {
|
2024-03-13 19:00:36 +00:00
|
|
|
throw new Error('Flag "webgl.pack"/"webgl-texture-pack-mode" is invalid');
|
2021-04-27 07:04:25 +00:00
|
|
|
}
|
2024-08-14 23:51:22 +00:00
|
|
|
const async = (webgl.async = parseBooleanArg(webgl.async ?? args['webgl-async']));
|
2021-09-10 05:17:42 +00:00
|
|
|
if (async !== undefined && typeof async !== 'boolean') {
|
2024-03-13 19:00:36 +00:00
|
|
|
throw new Error('Flag "webgl.async"/"webgl-async" is invalid');
|
2021-09-10 05:17:42 +00:00
|
|
|
}
|
2024-03-13 19:00:36 +00:00
|
|
|
return webgl;
|
2021-04-27 07:04:25 +00:00
|
|
|
}
|
|
|
|
|
|
[js/api] introducing IO binding for tensor (#16452)
[//]: # (## Work In Progress. Feedbacks are welcome!)
### Description
This PR adds a few properties, methods and factories to Tensor type to
support IO-binding feature. This will allow user to create tensor from
GPU/CPU bound data without a force transferring of data between CPU and
GPU.
This change is a way to resolve #15312
### Change Summary
1. Add properties to `Tensor` type:
a. `location`: indicating where the data is sitting. valid values are
`cpu`, `cpu-pinned`, `texture`, `gpu-buffer`.
b. `texture`: sit side to `data`, a readonly property of `WebGLTexture`
type. available only when `location === 'texture'`
c. `gpuBuffer`: sit side to `data`, a readonly property of `GPUBuffer`
type. available only when `location === 'gpu-buffer'`
2. Add methods to `Tensor` type (usually dealing with inference
outputs):
- async function `getData()` allows user to download data from GPU to
CPU manually.
- function `dispose()` allows user to release GPU resources manually.
3. Add factories for creating `Tensor` instances:
a. `fromTexture()` to create a WebGL texture bound tensor data
b. `fromGpuBuffer()` to create a WebGPUBuffer bound tensor data
c. `fromPinnedBuffer()` to create a tensor using a CPU pinned buffer
### Examples:
create tensors from texture and pass to inference session as inputs
```js
// when create session, specify we prefer 'image_output:0' to be stored on GPU as texture
const session = await InferenceSession.create('./my_model.onnx', {
executionProviders: [ 'webgl' ],
preferredOutputLocation: { 'image_output:0': 'texture' }
});
...
const myImageTexture = getTexture(); // user's function to get a texture
const myFeeds = { input0: Tensor.fromTexture(myImageTexture, { width: 224, height: 224 }) }; // shape [1, 224, 224, 4], RGBA format.
const results = await session.run(myFeeds);
const myOutputTexture = results['image_output:0'].texture;
```
2023-08-29 19:58:26 +00:00
|
|
|
function parseWebgpuFlags(args: minimist.ParsedArgs): Partial<Env.WebGpuFlags> {
|
2024-03-13 19:00:36 +00:00
|
|
|
const webgpu = args.webgpu || {};
|
2024-08-14 23:51:22 +00:00
|
|
|
const profilingMode = ((webgpu.profiling = webgpu.profiling ?? {}).mode =
|
|
|
|
|
webgpu?.profiling?.mode ?? webgpu.profilingMode ?? args['webgpu-profiling-mode']);
|
[js/web] WebGPU backend via JSEP (#14579)
### Description
This change introduced the following new components into ONNX Runtime
Web:
- JavaScript Execution Provider (JSEP)
- Asynchronized inferencing execution powered by Emscripten's Asyncify
- WebGPU backend implemented in TypeScript
- initial implementation of kernels:
- elementwise operators (22)
- binary operators (5)
- tensor: Shape, Reshape, Transpose, Gemm
- nn: Conv, {Global}Maxpool, {Global}AveragePool
Code need to be polished. still working on it.
## Q&A
What is JSEP?
> JSEP, aka JavaScript Execution Provider, is a new ONNXRuntime
execution provider that specifically works on Web environment
(browsers). JSEP allows JavaScript code to kick in from various places
when ONNX Runtime inferences a model.
Why JSEP?
> JSEP is a hybrid mode EP that contains both C/C++ and
TypeScript/JavaScript implementation. There are 2 strong reasons why we
introduces JSEP:
> 1. the C/C++ part helps JSEP to leverage ONNX Runtime's capabilities
as much as possible including graph transformer, optimizers and also the
capabilities to fallback to CPU EP. TypeScript/JavaScript helps JSEP to
develop and debug much easier in the browser for the kernel
implementation.
> 2. the requirement of asynchronized execution from JavaScript API (eg.
`buffer.mapAsync()`) makes it impossible to run `OrtRun()` in a
synchronized context (see "async problem" section below). This is done
by using Emscripten's Asyncify.
What is WebGPU?
> WebGPU is the new GPU API that available in browser. It's one of the
only 2 APIs that currently available to access the GPU from browser (the
other is WebGL).
> WebGPU is designed with more advanced and stronger features comparing
to WebGL and is potentially solution that offer the best GPU performance
for model inferencing that currently available.
What is the async problem and why we have the problem?
> The "async problem" is a problem that you cannot call an async
function in a synchronous context. Think about the following C++ code:
> ```c
> // C-style declarations (API)
> typedef void (*ON_COMPLETE)(PVOID state, DATA *data);
> void read_data_from_file(FILEHANDLE file, ON_COMPLETE on_complete);
>
> // implementation
> DATA * my_impl_read_data_from_file_sync(FILEHANDLE file) {
> // how to implement?
> }
> ```
> The answer is, it's impossible to implement this function. Usually we
try to find a sync version API, or launch a thread to call the async
function and sync-wait on the main thread. Unfortunately, in browser
environment, neither is possible.
>
> WebGPU does not offer any synchronized API for data downloading (GPU
to CPU). This is the only operation that MUST be async. As `OrtRun()`
will eventually call into DataTransfer for copy data from GPU to CPU,
and `OrtRun()` is a synchronized function, this cannot be done in normal
way.
What is Emscripten? How is the Asyncify feature resolved the problem?
> Emscripten is the C/C++ compiler for WebAssembly. It's what we use to
compile ORT and generates the WebAssembly artifacts which runs on
browsers.
>
> Asyncify is a [compiler
feature](https://emscripten.org/docs/porting/asyncify.html) that allows
calling async functions from a synchronized context. In short, it
generates code to unwind and rewind call stack to emulate async
execution. With this feature, we are able to call the async function
inside `OrtRun()` call.
## Design Overview
**Inter-op**
JSEP is doing pretty much same thing to just another EP. It exposes an
interface for inter-op with JavaScript, which is defined in
onnxruntime/wasm/js_internal_api.js:
```js
// init JSEP
Module["jsepInit"] = function (backend, alloc, free, copy, copyAsync, createKernel, releaseKernel, run) {
Module.jsepBackend = backend;
Module.jsepAlloc = alloc;
Module.jsepFree = free;
Module.jsepCopy = copy;
Module.jsepCopyAsync = copyAsync;
Module.jsepCreateKernel = createKernel;
Module.jsepReleaseKernel = releaseKernel;
Module.jsepRun = run;
};
```
This simple JavaScript snippet defines all language barrier level
functions that requires by JSEP to achieve implementing kernels and data
transfers using JavaScript inside ONNX Runtime:
- `jsepBackend`: assign the singleton object to webassembly module
- `jsepAlloc` and `jsepFree`: implementation of data transfer's Alloc()
and Free()
- `jsepCopy`: synchronized copy ( GPU to GPU, CPU to GPU)
- `jsepCopyAsync`: asynchronized copy ( GPU to CPU)
- `jsepCreateKernel` and `jsepReleaseKernel`: a corresponding object
that maintained in JS to match lifecycle of Kernel in ORT
- `jsepRun`: OpKernel::Compute() should call into this
The abstraction above allows to tie as little as possible connections
and dependencies between C/C++ and TypeScript/JavaScript.
**Resource Management**
Lifecycle of tensor data and kernels are managed by ORT(C/C++) but the
implementation are left to JavaScript. JavaScript code are responsible
to implement the callbacks correctly.
For WebGPU, the GPU data is managed by JavaScript using a singleton map
(tensot_data_id => GPUBuffer). GPU pipeline is managed as singleton.
Shaders are managed using a singletonmap (shader_key => gpu_program),
while shader_key is generated by cache_key (OP specific, including
attributes) and input shapes.
**about data transfer**
`js::DataTransfer::CopyTensor` implemented to call either synchronized
or asynchronized copy callback, depending on the destination is GPU or
not. Emscripten's macro `EM_ASYNC_JS` is used to wrap the async function
to be called in the synchronized context.
**run kernel in JS**
Kernel class constructor calls once `jsepCreateKernel()` with an
optional per-kernel specific serialization to pass attributes into
JavaScript.
`Compute()` are implemented in a way that a metadata serialization is
performed in a base class and JavaScript code can access the data using
the Emscripten specific builtin macro `EM_ASM_*`.
**disabled features**
memory pattern is force disabled, because the WebGPU data is not
presented by a general memory model (a buffer can be represented by
offset + size).
concurrent run support is disabled. WebGPU is stateful and it also has
async function call. To support concurrent run will significantly
increase the complexity and we don't get any real benefit from it.
**prefer channels last**
JSEP prefers channels last and returns `DataLayout::NHWC` in method
`GetPreferredLayout()`. This will let the graph transformers to
preprocess the graph into a channels last form so that a more optimized
WebGPU shader can be used.
**Testing code**
It's impossible to test JSEP directly because JSEP itself does not
contain any kernel implementation. However, it has the kernel
registration which need to work together with the corresponding
JavaScript code. There are unit tests that run onnx models from
JavaScript API.
---------
Co-authored-by: Scott McKay <skottmckay@gmail.com>
2023-04-24 22:21:18 +00:00
|
|
|
if (profilingMode !== undefined && profilingMode !== 'off' && profilingMode !== 'default') {
|
|
|
|
|
throw new Error('Flag "webgpu-profiling-mode" is invalid');
|
|
|
|
|
}
|
2024-08-14 23:51:22 +00:00
|
|
|
const validateInputContent = (webgpu.validateInputContent = parseBooleanArg(
|
|
|
|
|
webgpu.validateInputContent ?? args['webgpu-validate-input-content'],
|
|
|
|
|
));
|
2023-09-30 09:05:32 +00:00
|
|
|
if (validateInputContent !== undefined && typeof validateInputContent !== 'boolean') {
|
|
|
|
|
throw new Error('Flag "webgpu-validate-input-content" is invalid');
|
|
|
|
|
}
|
2024-03-13 19:00:36 +00:00
|
|
|
return webgpu;
|
[js/web] WebGPU backend via JSEP (#14579)
### Description
This change introduced the following new components into ONNX Runtime
Web:
- JavaScript Execution Provider (JSEP)
- Asynchronized inferencing execution powered by Emscripten's Asyncify
- WebGPU backend implemented in TypeScript
- initial implementation of kernels:
- elementwise operators (22)
- binary operators (5)
- tensor: Shape, Reshape, Transpose, Gemm
- nn: Conv, {Global}Maxpool, {Global}AveragePool
Code need to be polished. still working on it.
## Q&A
What is JSEP?
> JSEP, aka JavaScript Execution Provider, is a new ONNXRuntime
execution provider that specifically works on Web environment
(browsers). JSEP allows JavaScript code to kick in from various places
when ONNX Runtime inferences a model.
Why JSEP?
> JSEP is a hybrid mode EP that contains both C/C++ and
TypeScript/JavaScript implementation. There are 2 strong reasons why we
introduces JSEP:
> 1. the C/C++ part helps JSEP to leverage ONNX Runtime's capabilities
as much as possible including graph transformer, optimizers and also the
capabilities to fallback to CPU EP. TypeScript/JavaScript helps JSEP to
develop and debug much easier in the browser for the kernel
implementation.
> 2. the requirement of asynchronized execution from JavaScript API (eg.
`buffer.mapAsync()`) makes it impossible to run `OrtRun()` in a
synchronized context (see "async problem" section below). This is done
by using Emscripten's Asyncify.
What is WebGPU?
> WebGPU is the new GPU API that available in browser. It's one of the
only 2 APIs that currently available to access the GPU from browser (the
other is WebGL).
> WebGPU is designed with more advanced and stronger features comparing
to WebGL and is potentially solution that offer the best GPU performance
for model inferencing that currently available.
What is the async problem and why we have the problem?
> The "async problem" is a problem that you cannot call an async
function in a synchronous context. Think about the following C++ code:
> ```c
> // C-style declarations (API)
> typedef void (*ON_COMPLETE)(PVOID state, DATA *data);
> void read_data_from_file(FILEHANDLE file, ON_COMPLETE on_complete);
>
> // implementation
> DATA * my_impl_read_data_from_file_sync(FILEHANDLE file) {
> // how to implement?
> }
> ```
> The answer is, it's impossible to implement this function. Usually we
try to find a sync version API, or launch a thread to call the async
function and sync-wait on the main thread. Unfortunately, in browser
environment, neither is possible.
>
> WebGPU does not offer any synchronized API for data downloading (GPU
to CPU). This is the only operation that MUST be async. As `OrtRun()`
will eventually call into DataTransfer for copy data from GPU to CPU,
and `OrtRun()` is a synchronized function, this cannot be done in normal
way.
What is Emscripten? How is the Asyncify feature resolved the problem?
> Emscripten is the C/C++ compiler for WebAssembly. It's what we use to
compile ORT and generates the WebAssembly artifacts which runs on
browsers.
>
> Asyncify is a [compiler
feature](https://emscripten.org/docs/porting/asyncify.html) that allows
calling async functions from a synchronized context. In short, it
generates code to unwind and rewind call stack to emulate async
execution. With this feature, we are able to call the async function
inside `OrtRun()` call.
## Design Overview
**Inter-op**
JSEP is doing pretty much same thing to just another EP. It exposes an
interface for inter-op with JavaScript, which is defined in
onnxruntime/wasm/js_internal_api.js:
```js
// init JSEP
Module["jsepInit"] = function (backend, alloc, free, copy, copyAsync, createKernel, releaseKernel, run) {
Module.jsepBackend = backend;
Module.jsepAlloc = alloc;
Module.jsepFree = free;
Module.jsepCopy = copy;
Module.jsepCopyAsync = copyAsync;
Module.jsepCreateKernel = createKernel;
Module.jsepReleaseKernel = releaseKernel;
Module.jsepRun = run;
};
```
This simple JavaScript snippet defines all language barrier level
functions that requires by JSEP to achieve implementing kernels and data
transfers using JavaScript inside ONNX Runtime:
- `jsepBackend`: assign the singleton object to webassembly module
- `jsepAlloc` and `jsepFree`: implementation of data transfer's Alloc()
and Free()
- `jsepCopy`: synchronized copy ( GPU to GPU, CPU to GPU)
- `jsepCopyAsync`: asynchronized copy ( GPU to CPU)
- `jsepCreateKernel` and `jsepReleaseKernel`: a corresponding object
that maintained in JS to match lifecycle of Kernel in ORT
- `jsepRun`: OpKernel::Compute() should call into this
The abstraction above allows to tie as little as possible connections
and dependencies between C/C++ and TypeScript/JavaScript.
**Resource Management**
Lifecycle of tensor data and kernels are managed by ORT(C/C++) but the
implementation are left to JavaScript. JavaScript code are responsible
to implement the callbacks correctly.
For WebGPU, the GPU data is managed by JavaScript using a singleton map
(tensot_data_id => GPUBuffer). GPU pipeline is managed as singleton.
Shaders are managed using a singletonmap (shader_key => gpu_program),
while shader_key is generated by cache_key (OP specific, including
attributes) and input shapes.
**about data transfer**
`js::DataTransfer::CopyTensor` implemented to call either synchronized
or asynchronized copy callback, depending on the destination is GPU or
not. Emscripten's macro `EM_ASYNC_JS` is used to wrap the async function
to be called in the synchronized context.
**run kernel in JS**
Kernel class constructor calls once `jsepCreateKernel()` with an
optional per-kernel specific serialization to pass attributes into
JavaScript.
`Compute()` are implemented in a way that a metadata serialization is
performed in a base class and JavaScript code can access the data using
the Emscripten specific builtin macro `EM_ASM_*`.
**disabled features**
memory pattern is force disabled, because the WebGPU data is not
presented by a general memory model (a buffer can be represented by
offset + size).
concurrent run support is disabled. WebGPU is stateful and it also has
async function call. To support concurrent run will significantly
increase the complexity and we don't get any real benefit from it.
**prefer channels last**
JSEP prefers channels last and returns `DataLayout::NHWC` in method
`GetPreferredLayout()`. This will let the graph transformers to
preprocess the graph into a channels last form so that a more optimized
WebGPU shader can be used.
**Testing code**
It's impossible to test JSEP directly because JSEP itself does not
contain any kernel implementation. However, it has the kernel
registration which need to work together with the corresponding
JavaScript code. There are unit tests that run onnx models from
JavaScript API.
---------
Co-authored-by: Scott McKay <skottmckay@gmail.com>
2023-04-24 22:21:18 +00:00
|
|
|
}
|
|
|
|
|
|
2024-01-09 18:10:57 +00:00
|
|
|
function parseWebNNOptions(args: minimist.ParsedArgs): InferenceSession.WebNNExecutionProviderOption {
|
|
|
|
|
const deviceType = args['webnn-device-type'];
|
2024-04-16 01:43:46 +00:00
|
|
|
if (deviceType !== undefined && !['cpu', 'gpu', 'npu'].includes(deviceType)) {
|
2024-01-09 18:10:57 +00:00
|
|
|
throw new Error('Flag "webnn-device-type" is invalid');
|
|
|
|
|
}
|
2024-08-14 23:51:22 +00:00
|
|
|
return { name: 'webnn', deviceType };
|
2024-01-09 18:10:57 +00:00
|
|
|
}
|
|
|
|
|
|
2024-03-13 19:00:36 +00:00
|
|
|
function parseGlobalEnvFlags(args: minimist.ParsedArgs) {
|
[js/web] WebGPU backend via JSEP (#14579)
### Description
This change introduced the following new components into ONNX Runtime
Web:
- JavaScript Execution Provider (JSEP)
- Asynchronized inferencing execution powered by Emscripten's Asyncify
- WebGPU backend implemented in TypeScript
- initial implementation of kernels:
- elementwise operators (22)
- binary operators (5)
- tensor: Shape, Reshape, Transpose, Gemm
- nn: Conv, {Global}Maxpool, {Global}AveragePool
Code need to be polished. still working on it.
## Q&A
What is JSEP?
> JSEP, aka JavaScript Execution Provider, is a new ONNXRuntime
execution provider that specifically works on Web environment
(browsers). JSEP allows JavaScript code to kick in from various places
when ONNX Runtime inferences a model.
Why JSEP?
> JSEP is a hybrid mode EP that contains both C/C++ and
TypeScript/JavaScript implementation. There are 2 strong reasons why we
introduces JSEP:
> 1. the C/C++ part helps JSEP to leverage ONNX Runtime's capabilities
as much as possible including graph transformer, optimizers and also the
capabilities to fallback to CPU EP. TypeScript/JavaScript helps JSEP to
develop and debug much easier in the browser for the kernel
implementation.
> 2. the requirement of asynchronized execution from JavaScript API (eg.
`buffer.mapAsync()`) makes it impossible to run `OrtRun()` in a
synchronized context (see "async problem" section below). This is done
by using Emscripten's Asyncify.
What is WebGPU?
> WebGPU is the new GPU API that available in browser. It's one of the
only 2 APIs that currently available to access the GPU from browser (the
other is WebGL).
> WebGPU is designed with more advanced and stronger features comparing
to WebGL and is potentially solution that offer the best GPU performance
for model inferencing that currently available.
What is the async problem and why we have the problem?
> The "async problem" is a problem that you cannot call an async
function in a synchronous context. Think about the following C++ code:
> ```c
> // C-style declarations (API)
> typedef void (*ON_COMPLETE)(PVOID state, DATA *data);
> void read_data_from_file(FILEHANDLE file, ON_COMPLETE on_complete);
>
> // implementation
> DATA * my_impl_read_data_from_file_sync(FILEHANDLE file) {
> // how to implement?
> }
> ```
> The answer is, it's impossible to implement this function. Usually we
try to find a sync version API, or launch a thread to call the async
function and sync-wait on the main thread. Unfortunately, in browser
environment, neither is possible.
>
> WebGPU does not offer any synchronized API for data downloading (GPU
to CPU). This is the only operation that MUST be async. As `OrtRun()`
will eventually call into DataTransfer for copy data from GPU to CPU,
and `OrtRun()` is a synchronized function, this cannot be done in normal
way.
What is Emscripten? How is the Asyncify feature resolved the problem?
> Emscripten is the C/C++ compiler for WebAssembly. It's what we use to
compile ORT and generates the WebAssembly artifacts which runs on
browsers.
>
> Asyncify is a [compiler
feature](https://emscripten.org/docs/porting/asyncify.html) that allows
calling async functions from a synchronized context. In short, it
generates code to unwind and rewind call stack to emulate async
execution. With this feature, we are able to call the async function
inside `OrtRun()` call.
## Design Overview
**Inter-op**
JSEP is doing pretty much same thing to just another EP. It exposes an
interface for inter-op with JavaScript, which is defined in
onnxruntime/wasm/js_internal_api.js:
```js
// init JSEP
Module["jsepInit"] = function (backend, alloc, free, copy, copyAsync, createKernel, releaseKernel, run) {
Module.jsepBackend = backend;
Module.jsepAlloc = alloc;
Module.jsepFree = free;
Module.jsepCopy = copy;
Module.jsepCopyAsync = copyAsync;
Module.jsepCreateKernel = createKernel;
Module.jsepReleaseKernel = releaseKernel;
Module.jsepRun = run;
};
```
This simple JavaScript snippet defines all language barrier level
functions that requires by JSEP to achieve implementing kernels and data
transfers using JavaScript inside ONNX Runtime:
- `jsepBackend`: assign the singleton object to webassembly module
- `jsepAlloc` and `jsepFree`: implementation of data transfer's Alloc()
and Free()
- `jsepCopy`: synchronized copy ( GPU to GPU, CPU to GPU)
- `jsepCopyAsync`: asynchronized copy ( GPU to CPU)
- `jsepCreateKernel` and `jsepReleaseKernel`: a corresponding object
that maintained in JS to match lifecycle of Kernel in ORT
- `jsepRun`: OpKernel::Compute() should call into this
The abstraction above allows to tie as little as possible connections
and dependencies between C/C++ and TypeScript/JavaScript.
**Resource Management**
Lifecycle of tensor data and kernels are managed by ORT(C/C++) but the
implementation are left to JavaScript. JavaScript code are responsible
to implement the callbacks correctly.
For WebGPU, the GPU data is managed by JavaScript using a singleton map
(tensot_data_id => GPUBuffer). GPU pipeline is managed as singleton.
Shaders are managed using a singletonmap (shader_key => gpu_program),
while shader_key is generated by cache_key (OP specific, including
attributes) and input shapes.
**about data transfer**
`js::DataTransfer::CopyTensor` implemented to call either synchronized
or asynchronized copy callback, depending on the destination is GPU or
not. Emscripten's macro `EM_ASYNC_JS` is used to wrap the async function
to be called in the synchronized context.
**run kernel in JS**
Kernel class constructor calls once `jsepCreateKernel()` with an
optional per-kernel specific serialization to pass attributes into
JavaScript.
`Compute()` are implemented in a way that a metadata serialization is
performed in a base class and JavaScript code can access the data using
the Emscripten specific builtin macro `EM_ASM_*`.
**disabled features**
memory pattern is force disabled, because the WebGPU data is not
presented by a general memory model (a buffer can be represented by
offset + size).
concurrent run support is disabled. WebGPU is stateful and it also has
async function call. To support concurrent run will significantly
increase the complexity and we don't get any real benefit from it.
**prefer channels last**
JSEP prefers channels last and returns `DataLayout::NHWC` in method
`GetPreferredLayout()`. This will let the graph transformers to
preprocess the graph into a channels last form so that a more optimized
WebGPU shader can be used.
**Testing code**
It's impossible to test JSEP directly because JSEP itself does not
contain any kernel implementation. However, it has the kernel
registration which need to work together with the corresponding
JavaScript code. There are unit tests that run onnx models from
JavaScript API.
---------
Co-authored-by: Scott McKay <skottmckay@gmail.com>
2023-04-24 22:21:18 +00:00
|
|
|
const wasm = parseWasmFlags(args);
|
|
|
|
|
const webgl = parseWebglFlags(args);
|
|
|
|
|
const webgpu = parseWebgpuFlags(args);
|
2024-08-14 23:51:22 +00:00
|
|
|
return { webgl, wasm, webgpu };
|
2021-05-17 21:16:59 +00:00
|
|
|
}
|
|
|
|
|
|
2021-04-27 07:04:25 +00:00
|
|
|
export function parseTestRunnerCliArgs(cmdlineArgs: string[]): TestRunnerCliArgs {
|
|
|
|
|
const args = minimist(cmdlineArgs);
|
|
|
|
|
|
|
|
|
|
if (args.help || args.h) {
|
|
|
|
|
console.log(HELP_MESSAGE);
|
|
|
|
|
process.exit();
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
// Option: -d, --debug
|
|
|
|
|
const debug = parseBooleanArg(args.debug || args.d, false);
|
|
|
|
|
if (debug) {
|
|
|
|
|
npmlog.level = 'verbose';
|
|
|
|
|
}
|
|
|
|
|
npmlog.verbose('TestRunnerCli.Init', 'Parsing commandline arguments...');
|
|
|
|
|
|
|
|
|
|
const mode = args._.length === 0 ? 'suite0' : args._[0];
|
|
|
|
|
|
|
|
|
|
// Option: -e=<...>, --env=<...>
|
|
|
|
|
const envArg = args.env || args.e;
|
2024-08-14 23:51:22 +00:00
|
|
|
const env = typeof envArg !== 'string' ? 'chrome' : envArg;
|
2024-08-16 02:27:54 +00:00
|
|
|
if (['chrome', 'chromecanary', 'edge', 'firefox', 'electron', 'safari', 'node', 'bs'].indexOf(env) === -1) {
|
2021-04-27 07:04:25 +00:00
|
|
|
throw new Error(`not supported env ${env}`);
|
|
|
|
|
}
|
2021-05-15 01:15:38 +00:00
|
|
|
|
|
|
|
|
// Option: -b=<...>, --backend=<...>
|
2024-01-14 07:04:02 +00:00
|
|
|
const browserBackends = ['webgl', 'webgpu', 'wasm', 'webnn'];
|
[js/web] WebGPU backend via JSEP (#14579)
### Description
This change introduced the following new components into ONNX Runtime
Web:
- JavaScript Execution Provider (JSEP)
- Asynchronized inferencing execution powered by Emscripten's Asyncify
- WebGPU backend implemented in TypeScript
- initial implementation of kernels:
- elementwise operators (22)
- binary operators (5)
- tensor: Shape, Reshape, Transpose, Gemm
- nn: Conv, {Global}Maxpool, {Global}AveragePool
Code need to be polished. still working on it.
## Q&A
What is JSEP?
> JSEP, aka JavaScript Execution Provider, is a new ONNXRuntime
execution provider that specifically works on Web environment
(browsers). JSEP allows JavaScript code to kick in from various places
when ONNX Runtime inferences a model.
Why JSEP?
> JSEP is a hybrid mode EP that contains both C/C++ and
TypeScript/JavaScript implementation. There are 2 strong reasons why we
introduces JSEP:
> 1. the C/C++ part helps JSEP to leverage ONNX Runtime's capabilities
as much as possible including graph transformer, optimizers and also the
capabilities to fallback to CPU EP. TypeScript/JavaScript helps JSEP to
develop and debug much easier in the browser for the kernel
implementation.
> 2. the requirement of asynchronized execution from JavaScript API (eg.
`buffer.mapAsync()`) makes it impossible to run `OrtRun()` in a
synchronized context (see "async problem" section below). This is done
by using Emscripten's Asyncify.
What is WebGPU?
> WebGPU is the new GPU API that available in browser. It's one of the
only 2 APIs that currently available to access the GPU from browser (the
other is WebGL).
> WebGPU is designed with more advanced and stronger features comparing
to WebGL and is potentially solution that offer the best GPU performance
for model inferencing that currently available.
What is the async problem and why we have the problem?
> The "async problem" is a problem that you cannot call an async
function in a synchronous context. Think about the following C++ code:
> ```c
> // C-style declarations (API)
> typedef void (*ON_COMPLETE)(PVOID state, DATA *data);
> void read_data_from_file(FILEHANDLE file, ON_COMPLETE on_complete);
>
> // implementation
> DATA * my_impl_read_data_from_file_sync(FILEHANDLE file) {
> // how to implement?
> }
> ```
> The answer is, it's impossible to implement this function. Usually we
try to find a sync version API, or launch a thread to call the async
function and sync-wait on the main thread. Unfortunately, in browser
environment, neither is possible.
>
> WebGPU does not offer any synchronized API for data downloading (GPU
to CPU). This is the only operation that MUST be async. As `OrtRun()`
will eventually call into DataTransfer for copy data from GPU to CPU,
and `OrtRun()` is a synchronized function, this cannot be done in normal
way.
What is Emscripten? How is the Asyncify feature resolved the problem?
> Emscripten is the C/C++ compiler for WebAssembly. It's what we use to
compile ORT and generates the WebAssembly artifacts which runs on
browsers.
>
> Asyncify is a [compiler
feature](https://emscripten.org/docs/porting/asyncify.html) that allows
calling async functions from a synchronized context. In short, it
generates code to unwind and rewind call stack to emulate async
execution. With this feature, we are able to call the async function
inside `OrtRun()` call.
## Design Overview
**Inter-op**
JSEP is doing pretty much same thing to just another EP. It exposes an
interface for inter-op with JavaScript, which is defined in
onnxruntime/wasm/js_internal_api.js:
```js
// init JSEP
Module["jsepInit"] = function (backend, alloc, free, copy, copyAsync, createKernel, releaseKernel, run) {
Module.jsepBackend = backend;
Module.jsepAlloc = alloc;
Module.jsepFree = free;
Module.jsepCopy = copy;
Module.jsepCopyAsync = copyAsync;
Module.jsepCreateKernel = createKernel;
Module.jsepReleaseKernel = releaseKernel;
Module.jsepRun = run;
};
```
This simple JavaScript snippet defines all language barrier level
functions that requires by JSEP to achieve implementing kernels and data
transfers using JavaScript inside ONNX Runtime:
- `jsepBackend`: assign the singleton object to webassembly module
- `jsepAlloc` and `jsepFree`: implementation of data transfer's Alloc()
and Free()
- `jsepCopy`: synchronized copy ( GPU to GPU, CPU to GPU)
- `jsepCopyAsync`: asynchronized copy ( GPU to CPU)
- `jsepCreateKernel` and `jsepReleaseKernel`: a corresponding object
that maintained in JS to match lifecycle of Kernel in ORT
- `jsepRun`: OpKernel::Compute() should call into this
The abstraction above allows to tie as little as possible connections
and dependencies between C/C++ and TypeScript/JavaScript.
**Resource Management**
Lifecycle of tensor data and kernels are managed by ORT(C/C++) but the
implementation are left to JavaScript. JavaScript code are responsible
to implement the callbacks correctly.
For WebGPU, the GPU data is managed by JavaScript using a singleton map
(tensot_data_id => GPUBuffer). GPU pipeline is managed as singleton.
Shaders are managed using a singletonmap (shader_key => gpu_program),
while shader_key is generated by cache_key (OP specific, including
attributes) and input shapes.
**about data transfer**
`js::DataTransfer::CopyTensor` implemented to call either synchronized
or asynchronized copy callback, depending on the destination is GPU or
not. Emscripten's macro `EM_ASYNC_JS` is used to wrap the async function
to be called in the synchronized context.
**run kernel in JS**
Kernel class constructor calls once `jsepCreateKernel()` with an
optional per-kernel specific serialization to pass attributes into
JavaScript.
`Compute()` are implemented in a way that a metadata serialization is
performed in a base class and JavaScript code can access the data using
the Emscripten specific builtin macro `EM_ASM_*`.
**disabled features**
memory pattern is force disabled, because the WebGPU data is not
presented by a general memory model (a buffer can be represented by
offset + size).
concurrent run support is disabled. WebGPU is stateful and it also has
async function call. To support concurrent run will significantly
increase the complexity and we don't get any real benefit from it.
**prefer channels last**
JSEP prefers channels last and returns `DataLayout::NHWC` in method
`GetPreferredLayout()`. This will let the graph transformers to
preprocess the graph into a channels last form so that a more optimized
WebGPU shader can be used.
**Testing code**
It's impossible to test JSEP directly because JSEP itself does not
contain any kernel implementation. However, it has the kernel
registration which need to work together with the corresponding
JavaScript code. There are unit tests that run onnx models from
JavaScript API.
---------
Co-authored-by: Scott McKay <skottmckay@gmail.com>
2023-04-24 22:21:18 +00:00
|
|
|
|
2023-05-12 22:47:59 +00:00
|
|
|
// TODO: remove this when Chrome support WebNN.
|
|
|
|
|
// we need this for now because Chrome does not support webnn yet,
|
[js/web] WebGPU backend via JSEP (#14579)
### Description
This change introduced the following new components into ONNX Runtime
Web:
- JavaScript Execution Provider (JSEP)
- Asynchronized inferencing execution powered by Emscripten's Asyncify
- WebGPU backend implemented in TypeScript
- initial implementation of kernels:
- elementwise operators (22)
- binary operators (5)
- tensor: Shape, Reshape, Transpose, Gemm
- nn: Conv, {Global}Maxpool, {Global}AveragePool
Code need to be polished. still working on it.
## Q&A
What is JSEP?
> JSEP, aka JavaScript Execution Provider, is a new ONNXRuntime
execution provider that specifically works on Web environment
(browsers). JSEP allows JavaScript code to kick in from various places
when ONNX Runtime inferences a model.
Why JSEP?
> JSEP is a hybrid mode EP that contains both C/C++ and
TypeScript/JavaScript implementation. There are 2 strong reasons why we
introduces JSEP:
> 1. the C/C++ part helps JSEP to leverage ONNX Runtime's capabilities
as much as possible including graph transformer, optimizers and also the
capabilities to fallback to CPU EP. TypeScript/JavaScript helps JSEP to
develop and debug much easier in the browser for the kernel
implementation.
> 2. the requirement of asynchronized execution from JavaScript API (eg.
`buffer.mapAsync()`) makes it impossible to run `OrtRun()` in a
synchronized context (see "async problem" section below). This is done
by using Emscripten's Asyncify.
What is WebGPU?
> WebGPU is the new GPU API that available in browser. It's one of the
only 2 APIs that currently available to access the GPU from browser (the
other is WebGL).
> WebGPU is designed with more advanced and stronger features comparing
to WebGL and is potentially solution that offer the best GPU performance
for model inferencing that currently available.
What is the async problem and why we have the problem?
> The "async problem" is a problem that you cannot call an async
function in a synchronous context. Think about the following C++ code:
> ```c
> // C-style declarations (API)
> typedef void (*ON_COMPLETE)(PVOID state, DATA *data);
> void read_data_from_file(FILEHANDLE file, ON_COMPLETE on_complete);
>
> // implementation
> DATA * my_impl_read_data_from_file_sync(FILEHANDLE file) {
> // how to implement?
> }
> ```
> The answer is, it's impossible to implement this function. Usually we
try to find a sync version API, or launch a thread to call the async
function and sync-wait on the main thread. Unfortunately, in browser
environment, neither is possible.
>
> WebGPU does not offer any synchronized API for data downloading (GPU
to CPU). This is the only operation that MUST be async. As `OrtRun()`
will eventually call into DataTransfer for copy data from GPU to CPU,
and `OrtRun()` is a synchronized function, this cannot be done in normal
way.
What is Emscripten? How is the Asyncify feature resolved the problem?
> Emscripten is the C/C++ compiler for WebAssembly. It's what we use to
compile ORT and generates the WebAssembly artifacts which runs on
browsers.
>
> Asyncify is a [compiler
feature](https://emscripten.org/docs/porting/asyncify.html) that allows
calling async functions from a synchronized context. In short, it
generates code to unwind and rewind call stack to emulate async
execution. With this feature, we are able to call the async function
inside `OrtRun()` call.
## Design Overview
**Inter-op**
JSEP is doing pretty much same thing to just another EP. It exposes an
interface for inter-op with JavaScript, which is defined in
onnxruntime/wasm/js_internal_api.js:
```js
// init JSEP
Module["jsepInit"] = function (backend, alloc, free, copy, copyAsync, createKernel, releaseKernel, run) {
Module.jsepBackend = backend;
Module.jsepAlloc = alloc;
Module.jsepFree = free;
Module.jsepCopy = copy;
Module.jsepCopyAsync = copyAsync;
Module.jsepCreateKernel = createKernel;
Module.jsepReleaseKernel = releaseKernel;
Module.jsepRun = run;
};
```
This simple JavaScript snippet defines all language barrier level
functions that requires by JSEP to achieve implementing kernels and data
transfers using JavaScript inside ONNX Runtime:
- `jsepBackend`: assign the singleton object to webassembly module
- `jsepAlloc` and `jsepFree`: implementation of data transfer's Alloc()
and Free()
- `jsepCopy`: synchronized copy ( GPU to GPU, CPU to GPU)
- `jsepCopyAsync`: asynchronized copy ( GPU to CPU)
- `jsepCreateKernel` and `jsepReleaseKernel`: a corresponding object
that maintained in JS to match lifecycle of Kernel in ORT
- `jsepRun`: OpKernel::Compute() should call into this
The abstraction above allows to tie as little as possible connections
and dependencies between C/C++ and TypeScript/JavaScript.
**Resource Management**
Lifecycle of tensor data and kernels are managed by ORT(C/C++) but the
implementation are left to JavaScript. JavaScript code are responsible
to implement the callbacks correctly.
For WebGPU, the GPU data is managed by JavaScript using a singleton map
(tensot_data_id => GPUBuffer). GPU pipeline is managed as singleton.
Shaders are managed using a singletonmap (shader_key => gpu_program),
while shader_key is generated by cache_key (OP specific, including
attributes) and input shapes.
**about data transfer**
`js::DataTransfer::CopyTensor` implemented to call either synchronized
or asynchronized copy callback, depending on the destination is GPU or
not. Emscripten's macro `EM_ASYNC_JS` is used to wrap the async function
to be called in the synchronized context.
**run kernel in JS**
Kernel class constructor calls once `jsepCreateKernel()` with an
optional per-kernel specific serialization to pass attributes into
JavaScript.
`Compute()` are implemented in a way that a metadata serialization is
performed in a base class and JavaScript code can access the data using
the Emscripten specific builtin macro `EM_ASM_*`.
**disabled features**
memory pattern is force disabled, because the WebGPU data is not
presented by a general memory model (a buffer can be represented by
offset + size).
concurrent run support is disabled. WebGPU is stateful and it also has
async function call. To support concurrent run will significantly
increase the complexity and we don't get any real benefit from it.
**prefer channels last**
JSEP prefers channels last and returns `DataLayout::NHWC` in method
`GetPreferredLayout()`. This will let the graph transformers to
preprocess the graph into a channels last form so that a more optimized
WebGPU shader can be used.
**Testing code**
It's impossible to test JSEP directly because JSEP itself does not
contain any kernel implementation. However, it has the kernel
registration which need to work together with the corresponding
JavaScript code. There are unit tests that run onnx models from
JavaScript API.
---------
Co-authored-by: Scott McKay <skottmckay@gmail.com>
2023-04-24 22:21:18 +00:00
|
|
|
// and ChromeCanary is not in CI.
|
2023-05-12 22:47:59 +00:00
|
|
|
|
2024-01-14 07:04:02 +00:00
|
|
|
const defaultBrowserBackends = ['webgl', 'webgpu', 'wasm' /*, 'webnn'*/];
|
2021-05-15 01:15:38 +00:00
|
|
|
const nodejsBackends = ['cpu', 'wasm'];
|
|
|
|
|
const backendArgs = args.backend || args.b;
|
2024-08-14 23:51:22 +00:00
|
|
|
const backend =
|
|
|
|
|
typeof backendArgs !== 'string'
|
|
|
|
|
? env === 'node'
|
|
|
|
|
? nodejsBackends
|
|
|
|
|
: defaultBrowserBackends
|
|
|
|
|
: backendArgs.split(',');
|
2021-05-15 01:15:38 +00:00
|
|
|
for (const b of backend) {
|
|
|
|
|
if ((env !== 'node' && browserBackends.indexOf(b) === -1) || (env === 'node' && nodejsBackends.indexOf(b) === -1)) {
|
|
|
|
|
throw new Error(`backend ${b} is not supported in env ${env}`);
|
|
|
|
|
}
|
2021-04-27 07:04:25 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
// Options:
|
|
|
|
|
// --log-verbose=<...>
|
|
|
|
|
// --log-info=<...>
|
|
|
|
|
// --log-warning=<...>
|
|
|
|
|
// --log-error=<...>
|
|
|
|
|
const logConfig = parseLogConfig(args);
|
2024-03-13 19:00:36 +00:00
|
|
|
let logLevel = logConfig[0]?.config.minimalSeverity;
|
|
|
|
|
|
2021-04-27 07:04:25 +00:00
|
|
|
// Option: -p, --profile
|
2024-08-14 23:51:22 +00:00
|
|
|
const profile = args.profile || args.p ? true : false;
|
2021-04-27 07:04:25 +00:00
|
|
|
if (profile) {
|
2024-08-14 23:51:22 +00:00
|
|
|
logConfig.push({ category: 'Profiler.session', config: { minimalSeverity: 'verbose' } });
|
|
|
|
|
logConfig.push({ category: 'Profiler.node', config: { minimalSeverity: 'verbose' } });
|
|
|
|
|
logConfig.push({ category: 'Profiler.op', config: { minimalSeverity: 'verbose' } });
|
|
|
|
|
logConfig.push({ category: 'Profiler.backend', config: { minimalSeverity: 'verbose' } });
|
2024-03-13 19:00:36 +00:00
|
|
|
logLevel = 'verbose';
|
2021-04-27 07:04:25 +00:00
|
|
|
}
|
|
|
|
|
|
2024-03-13 19:00:36 +00:00
|
|
|
// Option: -t, --trace
|
|
|
|
|
const trace = parseBooleanArg(args.trace || args.t, false);
|
|
|
|
|
|
|
|
|
|
// Options:
|
|
|
|
|
// --wasm.<...>=<...>
|
|
|
|
|
// --webgl.<...>=<...>
|
|
|
|
|
// --webgpu.<...>=<...>
|
2024-08-14 23:51:22 +00:00
|
|
|
const globalEnvFlags = { ...parseGlobalEnvFlags(args), debug, trace, logLevel };
|
2024-03-13 19:00:36 +00:00
|
|
|
|
2021-04-27 07:04:25 +00:00
|
|
|
// Option: -P[=<...>], --perf[=<...>]
|
2024-08-14 23:51:22 +00:00
|
|
|
const perfArg = args.perf || args.P;
|
2021-04-27 07:04:25 +00:00
|
|
|
const perf = perfArg ? true : false;
|
2024-08-14 23:51:22 +00:00
|
|
|
const times = typeof perfArg === 'number' ? perfArg : 10;
|
2021-04-27 07:04:25 +00:00
|
|
|
if (debug && perf) {
|
|
|
|
|
throw new Error('Flag "perf" cannot be used together with flag "debug".');
|
|
|
|
|
}
|
2024-08-14 23:51:22 +00:00
|
|
|
if (perf && mode !== 'model') {
|
2021-04-27 07:04:25 +00:00
|
|
|
throw new Error('Flag "perf" can only be used in mode "model".');
|
|
|
|
|
}
|
|
|
|
|
if (perf) {
|
2024-08-14 23:51:22 +00:00
|
|
|
logConfig.push({ category: 'TestRunner.Perf', config: { minimalSeverity: 'verbose' } });
|
2021-04-27 07:04:25 +00:00
|
|
|
}
|
|
|
|
|
|
[js/webgpu] support IO binding (#17480)
<del>
**This PR is based on a few prerequisites PRs. They are listed as
below:**
- #17465
- #17469
- #17470
- #17472
- #17473
- #17484
Please review the current change by only looking at commit
e2e6623e673ec6de55a5c1f8edcbd3a46b535a89 and later.
</del>
### Description
This PR introduces WebGPU IO binding. This new feature allows
onnxruntime-web users to use tensors created from GPU as model
input/output so that a model inferencing can be done without unnecessary
data copy between CPU and GPU for model input/output.
### Examples
An E2E demo/example is being worked on.
Following is some simple demo with code snippet.
Let's first check today how we do:
```js
// STEP.1 - create an inference session:
const mySession = await ort.InferenceSession.create('./my_model.onnx', { executionProviders: ['webgpu'] });
// STEP.2 - create model input: (supposing myImageCpuData is a Float32Array)
const feeds = {
'input_image:0': new ort.Tensor('float32', myImageCpuData, [1, 224, 224, 3])
};
// STEP.3 - run model
const myResults = await mySession.run(feeds);
// STEP.4 - get output data
const myData = myResults['output_image:0'].data; // Float32Array
```
#### for inputs (GPU tensor):
Now, with IO binding, you can create a tensor from a GPU buffer, and
feed it to the model:
```js
// new STEP.2.A - create model input from a GPU buffer: (supposing myInputGpuBuffer is a `GPUBuffer` object with input data)
const feeds = {
'input_image:0': ort.Tensor.fromGpuBuffer(myInputGpuBuffer, { dataType: 'float32', dims: [1, 224, 224, 3] })
};
```
### for outputs (pre-allocated GPU tensor)
you can also do that for output, **if you know the output shape**:
```js
// new STEP.2.B - create model output from a GPU buffer: (supposing myOutputGpuBuffer is a pre-allocated `GPUBuffer` object)
const fetches = {
'output_image:0': ort.Tensor.fromGpuBuffer(myOutputGpuBuffer, { dataType: 'float32', dims: [1, 512, 512, 3] })
};
// new STEP.3 - run model with pre-allocated output (fetches)
const myResults = await mySession.run(feeds, fetches);
```
### for outputs (specify location)
if you do not know the output shape, you can specify the output location
when creating the session:
```js
// new STEP.1 - create an inference session with an option "preferredOutputLocation":
const mySession = await ort.InferenceSession.create('./my_model.onnx', {
executionProviders: ['webgpu'],
preferredOutputLocation: "gpu-buffer"
});
```
if the model has multiple outputs, you can specify them seperately:
```js
// new STEP.1 - create an inference session with an option "preferredOutputLocation":
const mySession = await ort.InferenceSession.create('./my_model.onnx', {
executionProviders: ['webgpu'],
preferredOutputLocation: {
"output_image:0": "gpu-buffer"
}
});
```
now you don't need to prepare the `fetches` object and onnxruntime-web
will prepare output data on the location that specified.
#### read data
when you get the output tensor, you can:
```js
// get the gpu buffer object:
const gpuBuffer = myOutputTensor.gpuBuffer; // GPUBuffer
// get the CPU data asynchronizely
const cpuData = await myOutputTensor.getData();
// get the CPU data asynchronizely and release the underlying GPU resources
const cpuData = await myOutputTensor.getData(true);
// dispose the tensor (release the underlying GPU resources). This tensor object will be invalid after dispose() is called.
myOutputTensor.dispose();
```
#### resource management
JavaScript has GC so you don't need to worry about managing JavaScript
objects. But there are 2 types of resources that are not managed by GC:
- GPU buffer that used in tensors
- Underlying ORT native resources
To simplify, most of the unmanaged resources and handled inside ORT web.
But there are a few resources that need users to manage:
- All external GPU resources, including GPU buffers inside all tensors
created by `Tensor.fromGpuBuffer()`, will not be managed by ORT. User
should manage those GPU buffers themselves.
- When a session is created with `preferredOutputLocation` ==
"gpu-buffer" specified in session options, and the corresponding output
is not pre-allocated, user need to call the output tensor's `dispose()`
or `getData(true)` to manually release the underlying GPU buffers.
- ORT internal errors (including providing a pre-allocated output tensor
with wrong type/dims) will invalidate the whole wasm memory and is not
recoverable. An exception is thrown in this situation.
2023-09-29 18:24:42 +00:00
|
|
|
// Option: -i=<...>, --io-binding=<...>
|
|
|
|
|
const ioBindingArg = args['io-binding'] || args.i;
|
2024-08-14 23:51:22 +00:00
|
|
|
const ioBindingMode = typeof ioBindingArg !== 'string' ? 'none' : ioBindingArg;
|
[js/webgpu] support IO binding (#17480)
<del>
**This PR is based on a few prerequisites PRs. They are listed as
below:**
- #17465
- #17469
- #17470
- #17472
- #17473
- #17484
Please review the current change by only looking at commit
e2e6623e673ec6de55a5c1f8edcbd3a46b535a89 and later.
</del>
### Description
This PR introduces WebGPU IO binding. This new feature allows
onnxruntime-web users to use tensors created from GPU as model
input/output so that a model inferencing can be done without unnecessary
data copy between CPU and GPU for model input/output.
### Examples
An E2E demo/example is being worked on.
Following is some simple demo with code snippet.
Let's first check today how we do:
```js
// STEP.1 - create an inference session:
const mySession = await ort.InferenceSession.create('./my_model.onnx', { executionProviders: ['webgpu'] });
// STEP.2 - create model input: (supposing myImageCpuData is a Float32Array)
const feeds = {
'input_image:0': new ort.Tensor('float32', myImageCpuData, [1, 224, 224, 3])
};
// STEP.3 - run model
const myResults = await mySession.run(feeds);
// STEP.4 - get output data
const myData = myResults['output_image:0'].data; // Float32Array
```
#### for inputs (GPU tensor):
Now, with IO binding, you can create a tensor from a GPU buffer, and
feed it to the model:
```js
// new STEP.2.A - create model input from a GPU buffer: (supposing myInputGpuBuffer is a `GPUBuffer` object with input data)
const feeds = {
'input_image:0': ort.Tensor.fromGpuBuffer(myInputGpuBuffer, { dataType: 'float32', dims: [1, 224, 224, 3] })
};
```
### for outputs (pre-allocated GPU tensor)
you can also do that for output, **if you know the output shape**:
```js
// new STEP.2.B - create model output from a GPU buffer: (supposing myOutputGpuBuffer is a pre-allocated `GPUBuffer` object)
const fetches = {
'output_image:0': ort.Tensor.fromGpuBuffer(myOutputGpuBuffer, { dataType: 'float32', dims: [1, 512, 512, 3] })
};
// new STEP.3 - run model with pre-allocated output (fetches)
const myResults = await mySession.run(feeds, fetches);
```
### for outputs (specify location)
if you do not know the output shape, you can specify the output location
when creating the session:
```js
// new STEP.1 - create an inference session with an option "preferredOutputLocation":
const mySession = await ort.InferenceSession.create('./my_model.onnx', {
executionProviders: ['webgpu'],
preferredOutputLocation: "gpu-buffer"
});
```
if the model has multiple outputs, you can specify them seperately:
```js
// new STEP.1 - create an inference session with an option "preferredOutputLocation":
const mySession = await ort.InferenceSession.create('./my_model.onnx', {
executionProviders: ['webgpu'],
preferredOutputLocation: {
"output_image:0": "gpu-buffer"
}
});
```
now you don't need to prepare the `fetches` object and onnxruntime-web
will prepare output data on the location that specified.
#### read data
when you get the output tensor, you can:
```js
// get the gpu buffer object:
const gpuBuffer = myOutputTensor.gpuBuffer; // GPUBuffer
// get the CPU data asynchronizely
const cpuData = await myOutputTensor.getData();
// get the CPU data asynchronizely and release the underlying GPU resources
const cpuData = await myOutputTensor.getData(true);
// dispose the tensor (release the underlying GPU resources). This tensor object will be invalid after dispose() is called.
myOutputTensor.dispose();
```
#### resource management
JavaScript has GC so you don't need to worry about managing JavaScript
objects. But there are 2 types of resources that are not managed by GC:
- GPU buffer that used in tensors
- Underlying ORT native resources
To simplify, most of the unmanaged resources and handled inside ORT web.
But there are a few resources that need users to manage:
- All external GPU resources, including GPU buffers inside all tensors
created by `Tensor.fromGpuBuffer()`, will not be managed by ORT. User
should manage those GPU buffers themselves.
- When a session is created with `preferredOutputLocation` ==
"gpu-buffer" specified in session options, and the corresponding output
is not pre-allocated, user need to call the output tensor's `dispose()`
or `getData(true)` to manually release the underlying GPU buffers.
- ORT internal errors (including providing a pre-allocated output tensor
with wrong type/dims) will invalidate the whole wasm memory and is not
recoverable. An exception is thrown in this situation.
2023-09-29 18:24:42 +00:00
|
|
|
if (['none', 'gpu-tensor', 'gpu-location'].indexOf(ioBindingMode) === -1) {
|
|
|
|
|
throw new Error(`not supported io binding mode ${ioBindingMode}`);
|
|
|
|
|
}
|
|
|
|
|
|
2023-02-24 23:50:15 +00:00
|
|
|
// Option: -u, --optimized-model-file-path
|
|
|
|
|
const optimizedModelFilePath = args['optimized-model-file-path'] || args.u || undefined;
|
|
|
|
|
if (typeof optimizedModelFilePath !== 'undefined' && typeof optimizedModelFilePath !== 'string') {
|
|
|
|
|
throw new Error('Flag "optimized-model-file-path" need to be either empty or a valid file path.');
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
// Option: -o, --graph-optimization-level
|
|
|
|
|
const graphOptimizationLevel = args['graph-optimization-level'] || args.o || 'all';
|
2024-08-14 23:51:22 +00:00
|
|
|
if (
|
|
|
|
|
typeof graphOptimizationLevel !== 'string' ||
|
|
|
|
|
['disabled', 'basic', 'extended', 'all'].indexOf(graphOptimizationLevel) === -1
|
|
|
|
|
) {
|
2023-02-24 23:50:15 +00:00
|
|
|
throw new Error(`graph optimization level is invalid: ${graphOptimizationLevel}`);
|
|
|
|
|
}
|
|
|
|
|
|
2021-04-27 07:04:25 +00:00
|
|
|
// Option: -c, --file-cache
|
|
|
|
|
const fileCache = parseBooleanArg(args['file-cache'] || args.c, false);
|
|
|
|
|
|
|
|
|
|
const cpuOptions = parseCpuOptions(args);
|
|
|
|
|
const wasmOptions = parseWasmOptions(args);
|
2021-05-17 21:16:59 +00:00
|
|
|
|
2021-04-27 07:04:25 +00:00
|
|
|
const webglOptions = parseWebglOptions(args);
|
2024-01-09 18:10:57 +00:00
|
|
|
const webnnOptions = parseWebNNOptions(args);
|
2021-04-27 07:04:25 +00:00
|
|
|
|
|
|
|
|
// Option: --no-sandbox
|
|
|
|
|
const noSandbox = !!args['no-sandbox'];
|
|
|
|
|
|
2024-03-26 20:16:59 +00:00
|
|
|
// Option: --user-data-dir
|
|
|
|
|
const userDataDir = args['user-data-dir'];
|
|
|
|
|
|
2023-09-14 17:05:31 +00:00
|
|
|
// parse chromium flags
|
|
|
|
|
let chromiumFlags = args['chromium-flags'];
|
|
|
|
|
if (!chromiumFlags) {
|
|
|
|
|
chromiumFlags = [];
|
|
|
|
|
} else if (typeof chromiumFlags === 'string') {
|
|
|
|
|
chromiumFlags = [chromiumFlags];
|
|
|
|
|
} else if (!Array.isArray(chromiumFlags)) {
|
|
|
|
|
throw new Error(`Invalid command line arg: --chromium-flags: ${chromiumFlags}`);
|
|
|
|
|
}
|
|
|
|
|
|
2021-04-27 07:04:25 +00:00
|
|
|
npmlog.verbose('TestRunnerCli.Init', ` Mode: ${mode}`);
|
|
|
|
|
npmlog.verbose('TestRunnerCli.Init', ` Env: ${env}`);
|
|
|
|
|
npmlog.verbose('TestRunnerCli.Init', ` Debug: ${debug}`);
|
|
|
|
|
npmlog.verbose('TestRunnerCli.Init', ` Backend: ${backend}`);
|
[js/webgpu] support IO binding (#17480)
<del>
**This PR is based on a few prerequisites PRs. They are listed as
below:**
- #17465
- #17469
- #17470
- #17472
- #17473
- #17484
Please review the current change by only looking at commit
e2e6623e673ec6de55a5c1f8edcbd3a46b535a89 and later.
</del>
### Description
This PR introduces WebGPU IO binding. This new feature allows
onnxruntime-web users to use tensors created from GPU as model
input/output so that a model inferencing can be done without unnecessary
data copy between CPU and GPU for model input/output.
### Examples
An E2E demo/example is being worked on.
Following is some simple demo with code snippet.
Let's first check today how we do:
```js
// STEP.1 - create an inference session:
const mySession = await ort.InferenceSession.create('./my_model.onnx', { executionProviders: ['webgpu'] });
// STEP.2 - create model input: (supposing myImageCpuData is a Float32Array)
const feeds = {
'input_image:0': new ort.Tensor('float32', myImageCpuData, [1, 224, 224, 3])
};
// STEP.3 - run model
const myResults = await mySession.run(feeds);
// STEP.4 - get output data
const myData = myResults['output_image:0'].data; // Float32Array
```
#### for inputs (GPU tensor):
Now, with IO binding, you can create a tensor from a GPU buffer, and
feed it to the model:
```js
// new STEP.2.A - create model input from a GPU buffer: (supposing myInputGpuBuffer is a `GPUBuffer` object with input data)
const feeds = {
'input_image:0': ort.Tensor.fromGpuBuffer(myInputGpuBuffer, { dataType: 'float32', dims: [1, 224, 224, 3] })
};
```
### for outputs (pre-allocated GPU tensor)
you can also do that for output, **if you know the output shape**:
```js
// new STEP.2.B - create model output from a GPU buffer: (supposing myOutputGpuBuffer is a pre-allocated `GPUBuffer` object)
const fetches = {
'output_image:0': ort.Tensor.fromGpuBuffer(myOutputGpuBuffer, { dataType: 'float32', dims: [1, 512, 512, 3] })
};
// new STEP.3 - run model with pre-allocated output (fetches)
const myResults = await mySession.run(feeds, fetches);
```
### for outputs (specify location)
if you do not know the output shape, you can specify the output location
when creating the session:
```js
// new STEP.1 - create an inference session with an option "preferredOutputLocation":
const mySession = await ort.InferenceSession.create('./my_model.onnx', {
executionProviders: ['webgpu'],
preferredOutputLocation: "gpu-buffer"
});
```
if the model has multiple outputs, you can specify them seperately:
```js
// new STEP.1 - create an inference session with an option "preferredOutputLocation":
const mySession = await ort.InferenceSession.create('./my_model.onnx', {
executionProviders: ['webgpu'],
preferredOutputLocation: {
"output_image:0": "gpu-buffer"
}
});
```
now you don't need to prepare the `fetches` object and onnxruntime-web
will prepare output data on the location that specified.
#### read data
when you get the output tensor, you can:
```js
// get the gpu buffer object:
const gpuBuffer = myOutputTensor.gpuBuffer; // GPUBuffer
// get the CPU data asynchronizely
const cpuData = await myOutputTensor.getData();
// get the CPU data asynchronizely and release the underlying GPU resources
const cpuData = await myOutputTensor.getData(true);
// dispose the tensor (release the underlying GPU resources). This tensor object will be invalid after dispose() is called.
myOutputTensor.dispose();
```
#### resource management
JavaScript has GC so you don't need to worry about managing JavaScript
objects. But there are 2 types of resources that are not managed by GC:
- GPU buffer that used in tensors
- Underlying ORT native resources
To simplify, most of the unmanaged resources and handled inside ORT web.
But there are a few resources that need users to manage:
- All external GPU resources, including GPU buffers inside all tensors
created by `Tensor.fromGpuBuffer()`, will not be managed by ORT. User
should manage those GPU buffers themselves.
- When a session is created with `preferredOutputLocation` ==
"gpu-buffer" specified in session options, and the corresponding output
is not pre-allocated, user need to call the output tensor's `dispose()`
or `getData(true)` to manually release the underlying GPU buffers.
- ORT internal errors (including providing a pre-allocated output tensor
with wrong type/dims) will invalidate the whole wasm memory and is not
recoverable. An exception is thrown in this situation.
2023-09-29 18:24:42 +00:00
|
|
|
npmlog.verbose('TestRunnerCli.Init', ` IO Binding Mode: ${ioBindingMode}`);
|
2021-04-27 07:04:25 +00:00
|
|
|
npmlog.verbose('TestRunnerCli.Init', 'Parsing commandline arguments... DONE');
|
|
|
|
|
|
|
|
|
|
return {
|
|
|
|
|
debug,
|
|
|
|
|
mode: mode as TestRunnerCliArgs['mode'],
|
|
|
|
|
param: args._.length > 1 ? args._[1] : undefined,
|
|
|
|
|
backends: backend as TestRunnerCliArgs['backends'],
|
|
|
|
|
bundleMode: perf ? 'perf' : 'dev',
|
|
|
|
|
env: env as TestRunnerCliArgs['env'],
|
|
|
|
|
logConfig,
|
|
|
|
|
profile,
|
|
|
|
|
times: perf ? times : undefined,
|
[js/webgpu] support IO binding (#17480)
<del>
**This PR is based on a few prerequisites PRs. They are listed as
below:**
- #17465
- #17469
- #17470
- #17472
- #17473
- #17484
Please review the current change by only looking at commit
e2e6623e673ec6de55a5c1f8edcbd3a46b535a89 and later.
</del>
### Description
This PR introduces WebGPU IO binding. This new feature allows
onnxruntime-web users to use tensors created from GPU as model
input/output so that a model inferencing can be done without unnecessary
data copy between CPU and GPU for model input/output.
### Examples
An E2E demo/example is being worked on.
Following is some simple demo with code snippet.
Let's first check today how we do:
```js
// STEP.1 - create an inference session:
const mySession = await ort.InferenceSession.create('./my_model.onnx', { executionProviders: ['webgpu'] });
// STEP.2 - create model input: (supposing myImageCpuData is a Float32Array)
const feeds = {
'input_image:0': new ort.Tensor('float32', myImageCpuData, [1, 224, 224, 3])
};
// STEP.3 - run model
const myResults = await mySession.run(feeds);
// STEP.4 - get output data
const myData = myResults['output_image:0'].data; // Float32Array
```
#### for inputs (GPU tensor):
Now, with IO binding, you can create a tensor from a GPU buffer, and
feed it to the model:
```js
// new STEP.2.A - create model input from a GPU buffer: (supposing myInputGpuBuffer is a `GPUBuffer` object with input data)
const feeds = {
'input_image:0': ort.Tensor.fromGpuBuffer(myInputGpuBuffer, { dataType: 'float32', dims: [1, 224, 224, 3] })
};
```
### for outputs (pre-allocated GPU tensor)
you can also do that for output, **if you know the output shape**:
```js
// new STEP.2.B - create model output from a GPU buffer: (supposing myOutputGpuBuffer is a pre-allocated `GPUBuffer` object)
const fetches = {
'output_image:0': ort.Tensor.fromGpuBuffer(myOutputGpuBuffer, { dataType: 'float32', dims: [1, 512, 512, 3] })
};
// new STEP.3 - run model with pre-allocated output (fetches)
const myResults = await mySession.run(feeds, fetches);
```
### for outputs (specify location)
if you do not know the output shape, you can specify the output location
when creating the session:
```js
// new STEP.1 - create an inference session with an option "preferredOutputLocation":
const mySession = await ort.InferenceSession.create('./my_model.onnx', {
executionProviders: ['webgpu'],
preferredOutputLocation: "gpu-buffer"
});
```
if the model has multiple outputs, you can specify them seperately:
```js
// new STEP.1 - create an inference session with an option "preferredOutputLocation":
const mySession = await ort.InferenceSession.create('./my_model.onnx', {
executionProviders: ['webgpu'],
preferredOutputLocation: {
"output_image:0": "gpu-buffer"
}
});
```
now you don't need to prepare the `fetches` object and onnxruntime-web
will prepare output data on the location that specified.
#### read data
when you get the output tensor, you can:
```js
// get the gpu buffer object:
const gpuBuffer = myOutputTensor.gpuBuffer; // GPUBuffer
// get the CPU data asynchronizely
const cpuData = await myOutputTensor.getData();
// get the CPU data asynchronizely and release the underlying GPU resources
const cpuData = await myOutputTensor.getData(true);
// dispose the tensor (release the underlying GPU resources). This tensor object will be invalid after dispose() is called.
myOutputTensor.dispose();
```
#### resource management
JavaScript has GC so you don't need to worry about managing JavaScript
objects. But there are 2 types of resources that are not managed by GC:
- GPU buffer that used in tensors
- Underlying ORT native resources
To simplify, most of the unmanaged resources and handled inside ORT web.
But there are a few resources that need users to manage:
- All external GPU resources, including GPU buffers inside all tensors
created by `Tensor.fromGpuBuffer()`, will not be managed by ORT. User
should manage those GPU buffers themselves.
- When a session is created with `preferredOutputLocation` ==
"gpu-buffer" specified in session options, and the corresponding output
is not pre-allocated, user need to call the output tensor's `dispose()`
or `getData(true)` to manually release the underlying GPU buffers.
- ORT internal errors (including providing a pre-allocated output tensor
with wrong type/dims) will invalidate the whole wasm memory and is not
recoverable. An exception is thrown in this situation.
2023-09-29 18:24:42 +00:00
|
|
|
ioBindingMode: ioBindingMode as TestRunnerCliArgs['ioBindingMode'],
|
2023-02-24 23:50:15 +00:00
|
|
|
optimizedModelFilePath,
|
|
|
|
|
graphOptimizationLevel: graphOptimizationLevel as TestRunnerCliArgs['graphOptimizationLevel'],
|
2021-04-27 07:04:25 +00:00
|
|
|
fileCache,
|
|
|
|
|
cpuOptions,
|
|
|
|
|
webglOptions,
|
2024-01-09 18:10:57 +00:00
|
|
|
webnnOptions,
|
2021-04-27 07:04:25 +00:00
|
|
|
wasmOptions,
|
2021-05-17 21:16:59 +00:00
|
|
|
globalEnvFlags,
|
2023-09-14 17:05:31 +00:00
|
|
|
noSandbox,
|
2024-03-26 20:16:59 +00:00
|
|
|
userDataDir,
|
2024-08-14 23:51:22 +00:00
|
|
|
chromiumFlags,
|
2021-04-27 07:04:25 +00:00
|
|
|
};
|
|
|
|
|
}
|