onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-05-18 21:21:17 +00:00

Author	SHA1	Message	Date
Xu Xing	76dfe5347c	[js/webgpu] Support uniforms for instance-norm (#18929 ) Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>	2024-01-09 14:56:00 -08:00
Xu Xing	557ac74c05	[js/webgpu] Support gemm uniforms (#19056 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-01-09 09:57:06 -08:00
Xu Xing	42ba2aed54	[js/webgpu] Support pad uniforms (#19057 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-01-09 09:34:56 -08:00
Xu Xing	eb92681bfb	[js/webgpu] Support range uniforms (#19055 )	2024-01-09 09:33:57 -08:00
Xu Xing	dee6a5b371	[js/webgpu] Support uniforms for attention and multihead attention (#18903 )	2024-01-09 07:46:30 -08:00
Xu Xing	8f024b7394	[js/webgpu] Support uniforms for layer-norm (#18755 )	2024-01-08 18:16:25 -08:00
Jiajie Hu	447a3a7c70	[js/webgpu] Fix Expand/Gather when input type is bool (#18999 ) ### Description Also update the op test suite. ### Motivation and Context Previously the total size in case `Expand - last dim is not divisible by 4` was a multiple of 4, even though the last dimension was not, so the bug has never been caught.	2024-01-05 08:16:15 -08:00
xhcao	867b9d8f04	[js/webgpu] Fix f16 errors for ConvTranspose2D (#18986 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-01-04 08:06:01 -08:00
Jiajie Hu	3b8b9147fa	[js/webgpu] Mitigate floating point accuracy issue in Resize (#18956 ) ### Description The patch fixes a floating point accuracy issue in Resize by preferring integer indices and integer arithmetic where possible. ### Motivation and Context Model test `test_resize_upsample_sizes_nearest_floor_align_corners` was observed to be failing on certain platforms. The root cause is the inaccurate floating point evaluation of 21 / 7 (2.999... vs 3), which results in the wrong input element to be indexed (floor(2.999...) vs floor(3)).	2024-01-03 14:15:26 -08:00
satyajandhyala	780fc3611b	[JS/Web] Sajandhy/webgpu resize scales rank check (#18954 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-12-29 09:23:27 -08:00
Jiajia Qin	44584c3ebe	[js/webgpu] only declare shape and strides in shader when necessary (#18940 ) ### Description Previously, shape and strides were added unconditionally even they are not used. This PR fixes this issue and only adds shape and strides when they are required. It's useful when some shapes are not used as uniform if the program depends on type instead of rank.	2023-12-28 15:43:08 -08:00
Jiajia Qin	c613cc58a9	[js/webgpu] Fix shader compilation errors in Resize (#18947 ) ### Description An extra right parenthesis was added by accidentally, which results some resize cases fail. This PR fixes it.	2023-12-28 13:15:26 -08:00
satyajandhyala	3bbe4fe2ff	[JS/WebGPU] Add trilinear interpolation to Resize; activation_params attribute is optional for FusedConv also. (#18842 ) ### Description Add trilinear interpolation to Resize and changed activation_params attribute as optional for FuseConv. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-12-27 16:21:29 -08:00
Xu Xing	0bc71b0c9b	[js/webgpu] Refactor attributes of pool (#18728 )	2023-12-26 17:23:52 -08:00
satyajandhyala	98510fb8fb	[JS/WebGPU] fix an error in Clip (#18799 ) ### Description <!-- Describe your changes. --> Check whether the min/max inputs are provided and use default values if not provided. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-12-19 13:51:01 -08:00
Jiajia Qin	8f7b89bd5b	[js/webgpu] Optimize NCHW layout for InstanceNormalization (#18123 ) ### Description The changes in this PR includes: 1) Fix f16 errors in InstanceNormalization with NCHW format. 2) Use vec to further optimize the original algorithm. 3) (Removed) Don't do layout conversion for InstanceNormalization for JSEP since InstanceNormalization itself is suitable for NCHW layout and has better performance in our current implementation. Tested on sd-vae-decoder-f16.onnx, it becomes 285 ms from 314 ms. The aggregate gpu profiling data can be found as below (Note the data is based change 3).): Before: <html> <body> <!--StartFragment--><span><span class="ui-provider ef bbg bbh bbi bbj bbk bbl bbm bbn bbo bbp bbq bbr bbs bbt bbu bbv bbw bbx bby bbz bca bcb bcc bcd bce bcf bcg bch bci bcj bck bcl bcm bcn" dir="ltr"> Kernel \| Time (Ms) \| Percentage (%) -- \| -- \| -- Conv \| 201.55 \| 69.56 InstanceNormalization \| 42.49 \| 14.67 Transpose \| 28.95 \| 9.99 Mul \| 5.69 \| 1.96 Add \| 3.82 \| 1.32 MatMul \| 3.27 \| 1.13 Sigmoid \| 2.24 \| 0.77 Resize \| 1.16 \| 0.40 Softmax \| 0.34 \| 0.12 Cast \| 0.24 \| 0.08 Sum \| 289.75 <br class="Apple-interchange-newline"><!--EndFragment--> </body> </html> After: <html> <body> <!--StartFragment--><span><span class="ui-provider ef bbg bbh bbi bbj bbk bbl bbm bbn bbo bbp bbq bbr bbs bbt bbu bbv bbw bbx bby bbz bca bcb bcc bcd bce bcf bcg bch bci bcj bck bcl bcm bcn" dir="ltr"> Kernel \| Time (Ms) \| Percentage (%) -- \| -- \| -- Conv \| 205.44 \| 79.43 InstanceNormalization \| 18.24 \| 7.05 Transpose \| 17.64 \| 6.82 Mul \| 5.69 \| 2.20 Add \| 3.81 \| 1.47 MatMul \| 3.56 \| 1.38 Sigmoid \| 2.24 \| 0.86 Resize \| 1.19 \| 0.46 Softmax \| 0.59 \| 0.23 Cast \| 0.24 \| 0.09 Sum \| 258.65 \| </span></span><!--EndFragment--> </body> </html> From above table, we can see that two ops time are greatly reduced. One is InstanceNormalization and the other is Transpose. The reason that the transpose time is reduced is because each InstanceNormalization is surrounded with two reshape ops in sd-vae-decoder-f16.onnx. Due to JSEP is prefer NHWC and InstanceNormalization is layout sensitive op, so two extra transpose ops are inserted dynamically when executing this model. After this change, those inserted transpose ops are not needed anymore. So the overall transpose time is reduced.	2023-12-15 11:26:15 -08:00
Jiajia Qin	4bbed4c71a	[js/webgpu] Fix f16 errors in unary (#18839 ) ### Description This PR fixes below errors: ``` no matching overload for operator > (vec4<f16>, vec4<f32>)	2023-12-15 11:25:12 -08:00
Jiajia Qin	b30e721dc8	[js/webgpu] Provide a naive vectorized matmul algorithm (#18758 ) ### Description This PR provided a vectorized matmul algorithm. In most situations, we still go to the workgroup memory optimized matmul. But for some situations, like N and K are very small, using workgroup optimized matmul can't fully utilize the underlying hardware due to the 32x32 tile size. So for very small N/K, we switch to the naive vectorized matmul algorithm to improve the hardware execution unit usage. With this PR, matmul with input0: [1, 36864, 3], input1: [1, 3, 3], input2: [3] becomes less than 1 ms from 4.34 ms on Intel Gen9 GPUs.	2023-12-13 09:03:23 -08:00
satyajandhyala	0ca84549ab	[JS/Web] Added uniforms to Reduce, Resize and Split Ops. (#18727 ) ### Description <!-- Describe your changes. --> Added uniforms to Reduce op ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Improve perforamnce.	2023-12-12 11:12:23 -08:00
satyajandhyala	d673e39ad8	[JS/WebGPU] Added uniforms to Tile and Where Ops (#18768 ) ### Description <!-- Describe your changes. --> Added uniforms to Tile and Where Ops ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Improve performance.	2023-12-11 20:58:52 -08:00
Jiajia Qin	b4be9e1bbb	[js/webgpu] Fix shader compilation errors in cumsum (#18779 ) ### Description This PR fixes below shader compilation errors: ``` Tint WGSL reader failure: :39:31 error: no matching overload for operator + (f32, i32) 5 candidate operators: operator + (T, T) -> T where: T is abstract-float, abstract-int, f32, i32, u32 or f16 operator + (vecN<T>, T) -> vecN<T> where: T is abstract-float, abstract-int, f32, i32, u32 or f16 operator + (T, vecN<T>) -> vecN<T> where: T is abstract-float, abstract-int, f32, i32, u32 or f16 operator + (vecN<T>, vecN<T>) -> vecN<T> where: T is abstract-float, abstract-int, f32, i32, u32 or f16 operator + (matNxM<T>, matNxM<T>) -> matNxM<T> where: T is abstract-float, f32 or f16 sum = sum + get_inputByIndices(inputIndices); ^ - While validating [ShaderModuleDescriptor "CumSum"] - While calling [Device].CreateShaderModule([ShaderModuleDescriptor "CumSum"]).	2023-12-11 18:11:38 -08:00
Guenther Schmuelling	9aa7284351	fix lint error (#18708 )	2023-12-05 10:37:03 -08:00
satyajandhyala	70816001cc	[JS/Web] AddedUniforms in GatherElements. (#18670 ) ### Description Use Uniforms in GatherElements and clean-up ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Improve performance	2023-12-05 09:19:53 -08:00
Xu Xing	f949e0580b	[js/webgpu] Support uniforms for pool (#18656 )	2023-12-05 07:54:30 -08:00
satyajandhyala	10c547516d	[JS/Web] Added CumSum operator to JSEP (#18637 ) ### Description Added CumSum operator ### Motivation and Context Reduce CPU <->GPU data movement.	2023-12-05 07:51:53 -08:00
Jiajia Qin	5353adcde3	[js/webgpu] Use the naive convTranspose when in/out channels are both 1 (#18658 ) ### Description With this change, convTranspose with input0 [1, 18, 32, 1], input1 [1, 1, 16, 16] becomes 0.59ms from 6.64ms.	2023-12-04 13:18:37 -08:00
Jiajia Qin	92ee664f64	[js/webgpu] Fix shader errors in indicesGet/Set when rank > 4 (#18661 ) ### Description Currently, for non-uniform variables, we still use `array<u32, N>` type instead of array<vec4<u32>, N1>`. So we can't always treat all variables with rank > 4 as uniforms to index. This PR fixes below errors: ``` error(s) generated while compiling the shader: :5:44 error: index 4 out of bounds [0..1] return uniforms.input_strides[4] * (outputIndices[4] % uniforms.input_shape[4])+uniforms.input_strides[3] * (outputIndices[3] % uniforms.input_shape[3])+uniforms.input_strides[2] * (outputIndices[2] % uniforms.input_shape[2])+uniforms.input_strides[1] * (outputIndices[1] % uniforms.input_shape[1])+uniforms.input_strides[0] * (outputIndices[0] % uniforms.input_shape[0]); ^ FAILED #OpTest# - expand.jsonc [webgpu]Expand - Expand 5D - float32 Expand 5 - float32 FAILED #OpTest# - expand.jsonc [webgpu]Expand - Expand 5D - float32 Expand 5 - shape < input.size()	2023-12-01 15:35:35 -08:00
Xu Xing	73d9b03509	[js/webgpu] Add multidimensional(>4) uniform support (#18546 ) This change removes the check of enableShapesUniforms. When all uses of this are removed, enableShapesUniforms can be removed too.	2023-11-30 17:10:33 -08:00
Jiajia Qin	6781b6cf3d	[js/webgpu] add bool type for Expand/Gather (#18615 ) ### Description In [detr-resnet-50](https://huggingface.co/Xenova/detr-resnet-50) model, it uses expand with bool type running on cpu ep. \| Kernel \| Shape \| Provider \| \| -------- \| ------- \| ------- \| \| Expand \| "input_type_shape" : [{"bool":[1,1,1,625]},{"int64":[4]}],"activation_size" : "657","output_type_shape" : [{"bool":[1,1,625,625]}] \| CPUExecutionProvider \| After this change, it will run on jsep. \| Kernel \| Shape \| Provider \| \| -------- \| ------- \| ------- \| \| Expand \| "input_type_shape" : [{"bool":[1,1,1,625]},{"int64":[4]}],"activation_size" : "657","output_type_shape" : [{"bool":[1,1,625,625]}] \| JsExecutionProvider \|	2023-11-30 15:47:08 -08:00
satyajandhyala	7335760424	[JS/Web] Add uniforms to Einsum (#18531 ) ### Description Add uinforms to Einsum ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Improve performance.	2023-11-29 15:30:33 -08:00
Yulong Wang	50e6235af1	[js/web] allow ShaderHelper to use internal (non-I/O) variables (#18525 ) ### Description This PR includes a change that inspired from #18452 to resolve a requirement: a shader may depend on an instance of `IndicesHelper` to generate WGSL code snippet, but the IndicesHelper instance is not necessarily an input/output of the program. So the existing `declareVariables()` function does not work with this scenario. In order to support this requirement, I added this "use" function to `interface ShaderHelper`, which takes a helper-like object as parameter. The hidden implementation `ShaderHelperImpl` class will iterate the helpers and call `impl()` for each. @axinging @qjia7	2023-11-28 15:15:59 -08:00
Jiajia Qin	fc8631e2f1	[js/web] Fix conv2dMatmul errors due to #18452 (#18562 ) ### Description Currently, all conv2dMatmul with inChannels = 3 and outChannels % 4 = 0 will report compilation errors. Models, which include this kind of shape will be impacted, like mobilenetv2-12, resnet50 . The errors is introduced by #18452 https://github.com/microsoft/onnxruntime/pull/18452/files#diff-8b24ea43aa11b1346c0c9e327f9bce6b37a93bd8f2bf8a6392b2b263972b7ea2R200, which accidentally pass `components` to `x`. But `x`'s components is `innerElementSize` not `components `. And when `innerElementSize` is 3, we should use `1` in current design.	2023-11-27 21:21:47 -08:00
Jiajia Qin	64dacc2892	[js/webgpu] Add BatchNormalization Op (#18468 ) ### Description This PR adds `BatchNormalization` with `float` support. Some Todos: 1. all inputs don't have same data type. For example, x/y is float16, but bias/scale is float32 or double. 2. training mode support. We see many models are using `BatchNormalization` ops. However, due to the missing in jsep, all of them run on cpu, which result very poor performance. With this PR's support, densenet-9 model becomes 20.29 ms from 250.69 ms.	2023-11-22 15:58:06 -08:00
Xu Xing	fa106942a7	[js/webgpu] Refactor matmul conv to support uniforms for matmul (#18452 ) This change refactored matmul/conv related programs to support shape uniforms. Currently only matmul shape uniforms are fully enabled. TODOs: add input dependencies for conv related programs, turn clipMax and clipMin to uniforms.	2023-11-22 14:42:55 -08:00
satyajandhyala	841f7ed3e0	[[JS/Web]Added uniform to Expand op. (#18558 ) ### Description <!-- Describe your changes. --> Added Uniforms to Expand operator kernel ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Improve performance	2023-11-22 14:14:24 -08:00
Arthur Islamov	1c555c5fc1	[JS/Web] Resize & BiasSplitGelu fp16 support (#18536 ) ### Description Resize and BiasSplitGelu fp16 support on WebGPU	2023-11-22 12:12:07 -08:00
Yulong Wang	c7fd930330	[js/web] unify resolve rules for "Clip" (#18527 ) ### Description It was a mistake to use 2 different names for Clip operator in op-resolve-rules.ts for different opset. An optimized implementation can handle both cases (opset < 11 and opset >=11). Remove "ClipV10" as an entry from the table.	2023-11-20 23:18:06 -08:00
Jiajia Qin	abdf8b7c3f	[js/webgpu] Optimize broadcast binary. (#18185 ) ### Description Currently, the binary algorithms are divided into the vectorize one (efficient) and non-vectorize one (less efficient). Below situations will go to the vectorize one: 1) A or B's shape length is 1. 2) The shared dimensions length of A and B are divisible by 4. 3) A and B have same shape. This PR adds another situation as below to go to the vectorize algorithm. 4. A or B's last dimension is divisible by 4. With this change, the aggerate time of Add in sam-b-encoder becomes 309.65 ms from 409.12 ms on Intel ADL.	2023-11-20 16:52:17 -08:00
Arthur Islamov	fac3e33da5	[js/web] JSEP Attention & MultiHeadAttention (#17742 ) ### Description This is a narrow implementation of Attention/MultiHeadAttention as it does not support: a. inputs 5-7 for MHA b. packed QKV/KV c. past/present d. attention mask But it works well for StableDiffusion and can be extended later. It reduces VRAM usage as it combines many ops into few I've updated demo here https://islamov.ai/stable-diffusion-webgpu/ it takes ~13sec for 1 image with 20 steps on RTX3090Ti and about 25s on M1 Pro VRAM usage is about 8gb if you don't use img2img Going to focus on SDXL now --------- Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com> Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>	2023-11-17 12:23:52 -08:00
satyajandhyala	b291b20fa0	[JS/Web]Added uniforms support to Slice op. (#18422 ) ### Description Support uniforms in Slice op ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Improve ferformance	2023-11-16 09:44:13 -08:00
Yulong Wang	586f06f5a1	[js/web] set noUnusedParameters to true and fix a few bugs (#18404 ) ### Description - set tsconfig "noUnusedParameters" to `true` and fix a few bugs discovered by typescript. how unused parameter is fixed: - for most code (webgl), add underscore as prefix, which is the standard ignore pattern for typescript check. - remove unused parameter from function and modify corresponding function calls (jsep) - fix a bug in ArgMinMax: this 2 operators do not have more than one input(s) so the `createArgMinMaxAttributesFromInputs()` is removed. - add proxy main.ts into typescript check and fix a bug in parameter passing - fixed `run()` function call and add typecheck fix (hack)	2023-11-15 09:16:29 -08:00
Xu Xing	949ac4b7ce	[js/webgpu] Support uniforms for gather (#18312 )	2023-11-13 11:24:34 -08:00
Xu Xing	0c8c0014f6	[js/webgpu] Use builtin num_workgroups to fix shader key conflict (#18387 ) This fixes conformance failure of tinyyolov2-8 and potential shader key conflict issues.	2023-11-10 17:37:45 -08:00
Xu Xing	8dba6efd61	[js/webgpu] Add uniforms support to concat op (#18238 )	2023-11-10 13:46:03 -08:00
Jiajia Qin	28c23aed04	[js/webgpu] Fix conv2d with activation (#18388 ) ### Description Fix #18297 With PR #17766, conv2d activation in mobilenetv2-12 will not be empty. However, activation is not supported yet in [biasActivationSnippet](https://github.com/microsoft/onnxruntime/blob/main/js/web/lib/wasm/jsep/webgpu/ops/3rd-party/activation_util.ts#L48C14-L48C36). This PR makes all places unify to use [getActivationSnippet](https://github.com/microsoft/onnxruntime/blob/main/js/web/lib/wasm/jsep/webgpu/ops/fuse-utils.ts#L13) to fix this issue.	2023-11-10 12:54:35 -08:00
Xu Xing	dd1bb760eb	[js/webgpu] Fix scalar uniform (#18318 )	2023-11-10 10:12:22 -08:00
Xu Xing	829d802337	[js/webgpu] Support uniform for softmax (#18345 )	2023-11-09 11:19:23 -08:00
Guenther Schmuelling	25fbc2b0ab	fix fused relu activation (#18303 )	2023-11-09 08:18:21 -08:00
Jiajia Qin	606356d0b1	[js/webgpu] Simplify the Resize shader when noScale is true (#18321 ) ### Description For Resize, when `noScale` is true, the shader can become very simple, which is not related with `attributes.mode` anymore. So we should remove those parts of shader code for simplification. This PR can also fix #18311 since the `noScale` are all true in that model. However, #18311 also exposes that the Resize implementation for `linear` mode has bug. It seems that the currently implementation always treat the input as either 2d or 4d tensor, however, the actual input is 3d tensor, that's why the shader compilation is failed. We may need to fix it in a separate PR.	2023-11-07 12:54:20 -08:00
satyajandhyala	a16d528399	[JS/Web] Added Uniforms support to binary ops. (#18260 ) ### Description Added Uniform support to binary ops ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> To improve performance	2023-11-07 08:41:52 -08:00

1 2 3

118 commits