onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-06-03 23:49:44 +00:00

Author	SHA1	Message	Date
Xu Xing	dee6a5b371	[js/webgpu] Support uniforms for attention and multihead attention (#18903 )	2024-01-09 07:46:30 -08:00
Jiajia Qin	44584c3ebe	[js/webgpu] only declare shape and strides in shader when necessary (#18940 ) ### Description Previously, shape and strides were added unconditionally even they are not used. This PR fixes this issue and only adds shape and strides when they are required. It's useful when some shapes are not used as uniform if the program depends on type instead of rank.	2023-12-28 15:43:08 -08:00
satyajandhyala	98510fb8fb	[JS/WebGPU] fix an error in Clip (#18799 ) ### Description <!-- Describe your changes. --> Check whether the min/max inputs are provided and use default values if not provided. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-12-19 13:51:01 -08:00
Jiajia Qin	92ee664f64	[js/webgpu] Fix shader errors in indicesGet/Set when rank > 4 (#18661 ) ### Description Currently, for non-uniform variables, we still use `array<u32, N>` type instead of array<vec4<u32>, N1>`. So we can't always treat all variables with rank > 4 as uniforms to index. This PR fixes below errors: ``` error(s) generated while compiling the shader: :5:44 error: index 4 out of bounds [0..1] return uniforms.input_strides[4] * (outputIndices[4] % uniforms.input_shape[4])+uniforms.input_strides[3] * (outputIndices[3] % uniforms.input_shape[3])+uniforms.input_strides[2] * (outputIndices[2] % uniforms.input_shape[2])+uniforms.input_strides[1] * (outputIndices[1] % uniforms.input_shape[1])+uniforms.input_strides[0] * (outputIndices[0] % uniforms.input_shape[0]); ^ FAILED #OpTest# - expand.jsonc [webgpu]Expand - Expand 5D - float32 Expand 5 - float32 FAILED #OpTest# - expand.jsonc [webgpu]Expand - Expand 5D - float32 Expand 5 - shape < input.size()	2023-12-01 15:35:35 -08:00
Xu Xing	73d9b03509	[js/webgpu] Add multidimensional(>4) uniform support (#18546 ) This change removes the check of enableShapesUniforms. When all uses of this are removed, enableShapesUniforms can be removed too.	2023-11-30 17:10:33 -08:00
Yulong Wang	50e6235af1	[js/web] allow ShaderHelper to use internal (non-I/O) variables (#18525 ) ### Description This PR includes a change that inspired from #18452 to resolve a requirement: a shader may depend on an instance of `IndicesHelper` to generate WGSL code snippet, but the IndicesHelper instance is not necessarily an input/output of the program. So the existing `declareVariables()` function does not work with this scenario. In order to support this requirement, I added this "use" function to `interface ShaderHelper`, which takes a helper-like object as parameter. The hidden implementation `ShaderHelperImpl` class will iterate the helpers and call `impl()` for each. @axinging @qjia7	2023-11-28 15:15:59 -08:00
Xu Xing	fa106942a7	[js/webgpu] Refactor matmul conv to support uniforms for matmul (#18452 ) This change refactored matmul/conv related programs to support shape uniforms. Currently only matmul shape uniforms are fully enabled. TODOs: add input dependencies for conv related programs, turn clipMax and clipMin to uniforms.	2023-11-22 14:42:55 -08:00
satyajandhyala	b291b20fa0	[JS/Web]Added uniforms support to Slice op. (#18422 ) ### Description Support uniforms in Slice op ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Improve ferformance	2023-11-16 09:44:13 -08:00
Xu Xing	0c8c0014f6	[js/webgpu] Use builtin num_workgroups to fix shader key conflict (#18387 ) This fixes conformance failure of tinyyolov2-8 and potential shader key conflict issues.	2023-11-10 17:37:45 -08:00
Xu Xing	dd1bb760eb	[js/webgpu] Fix scalar uniform (#18318 )	2023-11-10 10:12:22 -08:00
satyajandhyala	a16d528399	[JS/Web] Added Uniforms support to binary ops. (#18260 ) ### Description Added Uniform support to binary ops ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> To improve performance	2023-11-07 08:41:52 -08:00
Jiajia Qin	8a12b2cea6	[js/webgpu] Fix the transpose error when dims > 4D (#18027 ) ### Description <!-- Describe your changes. --> Currently, the uniform support has bugs when dims rank is larger than 4. See https://github.com/microsoft/onnxruntime/issues/17860 item 1. So this PR only enables shapes uniforms when shape rank is <= 4 for transpose. Otherwise, below compilation errors are thrown: ``` 1 error(s) generated while compiling the shader: :3:50 error: uniform storage requires that array elements are aligned to 16 bytes, but array element of type 'u32' has a stride of 4 bytes. Consider using a vector or struct as the element type instead. struct Uniforms { output_size:u32, a_shape:array<u32, 5>, a_strides:array<u32, 5>, output_shape:array<u32, 5>, output_strides:array<u32, 5> }; ^^^^^^^^^^^^^ :3:7 note: see layout of struct: /* align(4) size(84) / struct Uniforms { / offset( 0) align(4) size( 4) / output_size : u32; / offset( 4) align(4) size(20) / a_shape : array<u32, 5>; / offset(24) align(4) size(20) / a_strides : array<u32, 5>; / offset(44) align(4) size(20) / output_shape : array<u32, 5>; / offset(64) align(4) size(20) / output_strides : array<u32, 5>; / */ }; struct Uniforms { output_size:u32, a_shape:array<u32, 5>, a_strides:array<u32, 5>, output_shape:array<u32, 5>, output_strides:array<u32, 5> }; ^^^^^^ :4:42 note: 'Uniforms' used in address space 'uniform' here @group(0) @binding(2) var<uniform> uniforms: Uniforms; ^^^^^^^^ ```	2023-10-23 11:02:19 -07:00
Arthur Islamov	22947109f2	[js/web] FP16 LayerNorm, InstanceNorm, SkipLayerNorm (#17630 ) ### Description This PR includes fixes for Norm operations to support FP16 and also some optimizations to use vec2/vec4 if possible	2023-10-18 10:47:41 -07:00
Yulong Wang	d532645bed	[js/webgpu] revise uniform support (#17871 ) ### Description <!-- Describe your changes. --> work for items (2) and (3) in #17860	2023-10-11 16:41:46 -07:00
Yulong Wang	d9b9c5a537	[js/webgpu] support using uniform buffer (#17803 ) ### Description support using uniform buffer. This PR allows to use uniform buffer in shader program, so that some runtime information (eg. input/output shape) is no longer need to be hardcoded into shader code. There are 2 commits in this PR: - [667f31c](`667f31c83d`): framework changes to support uniform buffer, as well as updates in program manager, gpu data manager and indices helper. - [09e1d2a](`09e1d2ad1d`): an example change for operator `Transpose` to use input's rank-only instead of dims as shader key. With this change, model mobilenetv2-12 shader compile times dropped from 71 to 52.	2023-10-10 00:31:12 -07:00
Xu Xing	992f3e4609	[js/webgpu] Support where (#17544 ) Supported type: float. int32_t, uint32_t, bool. Case where_broadcast.jsonc is not enabled due to https://github.com/microsoft/onnxruntime/issues/17405. ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>	2023-10-03 14:28:21 -07:00
Jiajia Qin	891fba3b9c	[js/webgpu] Optimize Gather op (#17625 ) ### Description This PR optimizes the gather op, which is improved ~6ms in segment anything model in ADL. The problem in original algorithm is that it includes a for loop to calculate a block size of data. However, the block size may be very large, like `65536`. In GPU shader, we should try to avoid large loop in shader and try to use more threads to do it parallelly. Before: ``` [profiling] kernel "41771992\|[Gather] 41771992" input[0]: [4,65536] \| float32, input[1]: [1] \| int64, output[0]: [1,65536] \| float32, execution time: 6886207 ns ``` After: ``` [profiling] kernel "41771992\|[Gather] 41771992" input[0]: [4,65536] \| float32, input[1]: [1] \| int64, output[0]: [1,65536] \| float32, execution time: 11719 ns	2023-09-21 21:00:36 -07:00
Jiajia Qin	41d2ff622c	[js/webgpu] Optimize InstanceNormalization (#17491 ) ### Description <!-- Describe your changes. --> In previous implementation, there are two loops to iterate H * W elements to calculate the `mean` and `squaredNorm` value in one thread, meanwhile it outputs H * W elements in one thread. That results it's very very slow when H * W is a large value. And usually, H * W does be a large value in a model. For example, in the `candy-8` model, the shapes of [H, W] are [224,224], [112,112], [56,56] for `InstanceNormalization` op. And in my ADL, `[1,224,224,32]` consumes 17 ms. See below: ``` [profiling] kernel "23848328\|[InstanceNormalization] 23848328" input[0]: [1,224,224,32] \| float32, input[1]: [32] \| float32, input[2]: [32] \| float32, output[0]: [1,224,224,32] \| float32, execution time: 17007914 ns ``` In this PR, it uses workgroup memory to optimize the original algorithm. The advantage is that it can parallelly utilize the 64 (workgroupSize) threads in one workgroup to calculate `mean` and `squaredNorm` value. Meanwhile, it only outputs `H * W / workgroupSize` outputs for one thread, which greatly reduces the overhead for one thread. With this optimization, `[1,224,224,32]` becomes 3 ms and the main overhead is the extra two `transpose`. The `createInstanceNormProgramInfo` only needs `0.64` ms. See below: ``` [profiling] kernel "23003600\|[InstanceNormalization] 23003600" input[0]: [1,224,224,32] \| float32, output[0]: [1,32,224,224] \| float32, execution time: 1543792 ns program-manager.ts:115 [profiling] kernel "23003600\|[InstanceNormalization] 23003600" input[0]: [1,32,224,224] \| float32, input[1]: [32] \| float32, input[2]: [32] \| float32, output[0]: [1,32,224,224] \| float32, execution time: 642652 ns program-manager.ts:115 [profiling] kernel "23003600\|[InstanceNormalization] 23003600" input[0]: [1,32,224,224] \| float32, output[0]: [1,224,224,32] \| float32, execution time: 991608 ns ``` This PR currently only applies the new algorithm to NCHW format. For NHWC format, one way is to transpose the input so that it can use the new algorithm. But the disadvantage is that 2 extra transpose are added. @dakenf also gives another way to optimize NHWC. Details see [here](`d45a96616d/js/web/lib/wasm/jsep/webgpu/ops/instance-norm.ts`). I checked @dakenf's method. The perf is similar with transpose + optimized NCHW. But on different GPUs, one is a little better than another or vice versa. So I prefer this PR only does the NCHW part. @dakenf can submit his optimization on NHWC.	2023-09-14 17:03:18 -07:00
Arthur Islamov	03b56f7a73	[js/webgpu] FP16 extension registration (#17493 ) ### Description First small change to support FP16 --------- Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>	2023-09-13 13:11:17 -07:00
Jiajia Qin	fffefb1c22	[js/webgpu] Optimize matmul (#16969 ) ### Description Changes in this PR: 1) use the optimized version `makeMatMulPacked[Vec4]Source` to support matmul. 2) enable the conv2dByMatMul path. 3) support broadcast 4) use IndicesHelper. MatMul with M = 512, K = 512, N = 512 becomes 2ms from 15ms when enabling profilingMode on my ADL.	2023-08-29 12:40:57 -07:00
Yulong Wang	bb1871332f	[js/webgpu] add kernel Not and Equal (#17306 ) ### Description This PR adds kernel implementation for operator "Not" and "Equal". Also removed download cache in gpu data manager. Why removing download cache The following test case failed. ("Or" is on CPU, "Greater" and "Equal" are on JSEP) ![image](https://github.com/microsoft/onnxruntime/assets/7679871/8d9798ad-2703-4fb9-907e-ff716c67d0b2) after debugging, I found that both "Equal" and "Greater" are using the same output GPU Data ID. This is because when ORT executes the graph, it first run "Equal", allowing its shader to write into GPU Data ID 2; then a Gpu2Cpu copy for it is issued (because currently "Or" is on CPU EP); at this point, ORT thinks GPU Data ID=2 is free to use; so it reuse it as output for "Greater". This means there is no allocation for output of "Greater" kernel, and both kernel writes to GPU Data ID=2. For gpu data manager, there will be 2 downloads from the same GPU buffer. Previously I think this is a waste of resource so I cached the data. But now it shoes that we need to perform 2 downloads because the GPU data is already different. The download data cache should be removed. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-08-27 19:50:17 -07:00
Yulong Wang	8b18d48c7c	[js/webgpu] make IndicesHelper implementation implicit (#17193 ) ### Description This change makes it no longer required to call indicesHelper.impl() in shader code.	2023-08-23 14:41:35 -07:00
Yulong Wang	14a8315f10	[js/web] [webgpu] new incides helper (#16957 ) ### Description This PR introduces the new incides helper. IndicesHelper is a helper class for generating WGSL code for manipulating indices and data for a shader's input or output. This class is designed to offer a unified way to generate WGSL code for manipulating indices and data for a shader's input or output. The following is a list of terminologies used in this class: - `offset`: a uint32 value representing the offset of an element in the data buffer. - `indices`: an abstraction of a multi-dimensional array's indices representing the data's index on each dimension. - `value`: a value of a data element. Users are expected to create an instance of this class for each shader's input or output, and use the instance to generate WGSL code for manipulating indices and data. The following 2 exported functions are for users to call to create an instance of an indices helper: - `inputVariable()`: create an indices helper instance for an input. - `outputVariable()`: create an indices helper instance for an output. An indices helper instance contains helper functions for the following operations: - access readonly basic information, including: `name`(the name of the input or output), `usage`(whether it's an input or an output) and `shape`(the passed in shape). - `type`: access readonly type information, including: `indices`(the type of indices), `value`(the type of value at runtime), `storage`(the type of value at storage) and `tensor`(the tensor type as represented in TensorView). - generate WGSL code for getting indices from offset. Use `offsetToIndices()` for WGSL code snippet to calculate incides from offset, and use `indicesToOffset()` for WGSL code snippet to calculate offset from indices. - to manipulate an instance of indices, use `setIndices()` and `getIndices()` to set and get the indices on an indices variable. - to manipulate data, use `set()`/`get()` to access data at the given indices from parameter list, use `setByIndices()`/`getByIndices()` to access data at the given indices from an indices variable, and use `setByOffset()`/`getByOffset()` to access data at the given offset. - `impl`: get WGSL code of function implementation for the util functions mentioned above. This change applies the usage of new IndicesHelper through the code, but not necessary for all code.	2023-08-11 11:36:59 -07:00
Yulong Wang	14cc02c65c	[js/web] WebGPU backend via JSEP (#14579 ) ### Description This change introduced the following new components into ONNX Runtime Web: - JavaScript Execution Provider (JSEP) - Asynchronized inferencing execution powered by Emscripten's Asyncify - WebGPU backend implemented in TypeScript - initial implementation of kernels: - elementwise operators (22) - binary operators (5) - tensor: Shape, Reshape, Transpose, Gemm - nn: Conv, {Global}Maxpool, {Global}AveragePool Code need to be polished. still working on it. ## Q&A What is JSEP? > JSEP, aka JavaScript Execution Provider, is a new ONNXRuntime execution provider that specifically works on Web environment (browsers). JSEP allows JavaScript code to kick in from various places when ONNX Runtime inferences a model. Why JSEP? > JSEP is a hybrid mode EP that contains both C/C++ and TypeScript/JavaScript implementation. There are 2 strong reasons why we introduces JSEP: > 1. the C/C++ part helps JSEP to leverage ONNX Runtime's capabilities as much as possible including graph transformer, optimizers and also the capabilities to fallback to CPU EP. TypeScript/JavaScript helps JSEP to develop and debug much easier in the browser for the kernel implementation. > 2. the requirement of asynchronized execution from JavaScript API (eg. `buffer.mapAsync()`) makes it impossible to run `OrtRun()` in a synchronized context (see "async problem" section below). This is done by using Emscripten's Asyncify. What is WebGPU? > WebGPU is the new GPU API that available in browser. It's one of the only 2 APIs that currently available to access the GPU from browser (the other is WebGL). > WebGPU is designed with more advanced and stronger features comparing to WebGL and is potentially solution that offer the best GPU performance for model inferencing that currently available. What is the async problem and why we have the problem? > The "async problem" is a problem that you cannot call an async function in a synchronous context. Think about the following C++ code: > ```c > // C-style declarations (API) > typedef void (ON_COMPLETE)(PVOID state, DATA data); > void read_data_from_file(FILEHANDLE file, ON_COMPLETE on_complete); > > // implementation > DATA * my_impl_read_data_from_file_sync(FILEHANDLE file) { > // how to implement? > } > ``` > The answer is, it's impossible to implement this function. Usually we try to find a sync version API, or launch a thread to call the async function and sync-wait on the main thread. Unfortunately, in browser environment, neither is possible. > > WebGPU does not offer any synchronized API for data downloading (GPU to CPU). This is the only operation that MUST be async. As `OrtRun()` will eventually call into DataTransfer for copy data from GPU to CPU, and `OrtRun()` is a synchronized function, this cannot be done in normal way. What is Emscripten? How is the Asyncify feature resolved the problem? > Emscripten is the C/C++ compiler for WebAssembly. It's what we use to compile ORT and generates the WebAssembly artifacts which runs on browsers. > > Asyncify is a [compiler feature](https://emscripten.org/docs/porting/asyncify.html) that allows calling async functions from a synchronized context. In short, it generates code to unwind and rewind call stack to emulate async execution. With this feature, we are able to call the async function inside `OrtRun()` call. ## Design Overview Inter-op JSEP is doing pretty much same thing to just another EP. It exposes an interface for inter-op with JavaScript, which is defined in onnxruntime/wasm/js_internal_api.js: ```js // init JSEP Module["jsepInit"] = function (backend, alloc, free, copy, copyAsync, createKernel, releaseKernel, run) { Module.jsepBackend = backend; Module.jsepAlloc = alloc; Module.jsepFree = free; Module.jsepCopy = copy; Module.jsepCopyAsync = copyAsync; Module.jsepCreateKernel = createKernel; Module.jsepReleaseKernel = releaseKernel; Module.jsepRun = run; }; ``` This simple JavaScript snippet defines all language barrier level functions that requires by JSEP to achieve implementing kernels and data transfers using JavaScript inside ONNX Runtime: - `jsepBackend`: assign the singleton object to webassembly module - `jsepAlloc` and `jsepFree`: implementation of data transfer's Alloc() and Free() - `jsepCopy`: synchronized copy ( GPU to GPU, CPU to GPU) - `jsepCopyAsync`: asynchronized copy ( GPU to CPU) - `jsepCreateKernel` and `jsepReleaseKernel`: a corresponding object that maintained in JS to match lifecycle of Kernel in ORT - `jsepRun`: OpKernel::Compute() should call into this The abstraction above allows to tie as little as possible connections and dependencies between C/C++ and TypeScript/JavaScript. Resource Management Lifecycle of tensor data and kernels are managed by ORT(C/C++) but the implementation are left to JavaScript. JavaScript code are responsible to implement the callbacks correctly. For WebGPU, the GPU data is managed by JavaScript using a singleton map (tensot_data_id => GPUBuffer). GPU pipeline is managed as singleton. Shaders are managed using a singletonmap (shader_key => gpu_program), while shader_key is generated by cache_key (OP specific, including attributes) and input shapes. about data transfer `js::DataTransfer::CopyTensor` implemented to call either synchronized or asynchronized copy callback, depending on the destination is GPU or not. Emscripten's macro `EM_ASYNC_JS` is used to wrap the async function to be called in the synchronized context. run kernel in JS Kernel class constructor calls once `jsepCreateKernel()` with an optional per-kernel specific serialization to pass attributes into JavaScript. `Compute()` are implemented in a way that a metadata serialization is performed in a base class and JavaScript code can access the data using the Emscripten specific builtin macro `EM_ASM_`. disabled features* memory pattern is force disabled, because the WebGPU data is not presented by a general memory model (a buffer can be represented by offset + size). concurrent run support is disabled. WebGPU is stateful and it also has async function call. To support concurrent run will significantly increase the complexity and we don't get any real benefit from it. prefer channels last JSEP prefers channels last and returns `DataLayout::NHWC` in method `GetPreferredLayout()`. This will let the graph transformers to preprocess the graph into a channels last form so that a more optimized WebGPU shader can be used. Testing code It's impossible to test JSEP directly because JSEP itself does not contain any kernel implementation. However, it has the kernel registration which need to work together with the corresponding JavaScript code. There are unit tests that run onnx models from JavaScript API. --------- Co-authored-by: Scott McKay <skottmckay@gmail.com>	2023-04-24 15:21:18 -07:00

24 commits