onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-02 03:55:34 +00:00

Author	SHA1	Message	Date
Satya Kumar Jandhyala	544bdd6073	Fix ConvTranspose for certain attribute combinations (#23488 ) ### Description Convert output_padding attribute from 1D to 2D convtranspose ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> https://github.com/microsoft/onnxruntime/issues/23403	2025-02-05 12:22:47 -08:00
Jiajia Qin	25f427466e	[js/webgpu] Optimize ConvTranspose (Continue) (#23429 ) BUG #23273 This PR does below optimizations: 1. When output channels is one, 1) calculate the offset before the inchannel loop to reduce indices to offsets calculation, 2) split the `inputChannelsPerGroup` into `inputChannelsPerGroupInt` and `inputChannelsRemainder` parts so that we can always access 4 data for `inputChannelsPerGroupInt`. 2. Use precise initial value to reduce useless loop iterations. Thanks @jiangzhaoming 's suggestion's on this. With this PR, ConvTranspose becomes 3.7s from 8.4s on Intel Meteor Lake. On NV RTX 2000 Ada, it becomes 1.6s from 2.7s.	2025-01-22 08:59:17 -08:00
Jiajia Qin	7be006c466	[js/webgpu] Optimize convtranspose (#23302 ) ### Description <!-- Describe your changes. --> BUG #23273 With this change, I see the convTranspose time in that bug becomes ~7s from ~90s on my Meteor Lake. This PR does below things: 1. Use stride to update the increasement in the loop. In the bug, the stride is 1024, which can greatly reduce the loop times. 2. Support components for A to reduce the memory access times. 3. When output channels is 1, the b components can be same with A to further reduce the memory access times.	2025-01-09 11:24:42 -08:00
Changming Sun	5d692b0136	Merge web machine pools (#23243 ) ### Description The Web CI pipeline uses three different Windows machine pools: 1. onnxruntime-Win2022-webgpu-A10 2. onnxruntime-Win2022-VS2022-webgpu-A10 3. onnxruntime-Win-CPU-2022-web This PR merges them together to reduce ongoing maintenance cost.	2025-01-03 13:53:17 -08:00
Yulong Wang	ae6dcc839e	Revert "[js/webgpu] disable failed tests temporarily (#23127 )" (#23130 ) ### Description This reverts commit `9115682d69`. ### Motivation and Context	2024-12-18 18:07:50 -08:00
Yulong Wang	9115682d69	[js/webgpu] disable failed tests temporarily (#23127 ) ### Description Those test cases start to fail for unknown reasons. To unblock the CI, I disabled those tests temporarily to earn time to investigate the root cause.	2024-12-16 15:35:47 -08:00
Xu Xing	c19617a24a	[js/webgpu] Add GatherND (#22847 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-12-04 09:57:32 -08:00
Jiajia Qin	e597eaed4a	[js/webgpu] Optimize transpose as reshape when suitable (#22870 ) BUG #22031	2024-11-18 12:52:48 -08:00
Xu Xing	ff57ac4f3d	[js/webgpu] Add scatterND (#22755 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-11-13 09:13:00 -08:00
Jiajia Qin	7e0dd9d433	[js/webgpu] Optimize Expand (#22752 ) Use components = 4 if possible. llama3.2-1B becomes 20 tokens/s from 18 tokens/s on my iGPUs.	2024-11-12 12:37:19 -08:00
jzm-intel	d9b91682f1	WebGPU JSEP: Make shader code not depend on input broadcasting patterns (#22536 ) This PR make MatMul shaders not depend on inputs broadcasting pattern, but only depend on input ranks and their shape provided in uniform. This change fix the issue that currently shaders code are different for different broadcasting, but have identical cache key and results in wrong cache hit.	2024-11-08 11:00:51 -08:00
Jiajia Qin	8fbbf2fd4f	[js/webgpu] Optimize MatMul with M = 1 (#22577 ) ### Description <!-- Describe your changes. --> BUG #22031 In the demucs model, there are lots of MatMul ops with shapes like below: `input[0]: [3448,1,512] \| float32, input[1]: [512,1536] \| float32, output[0]: [3448,1,1536] \| float32` We can see that for this kind of shape, the batch size is a big value, but M = 1. Our current algorithm is based on [M, N] to partition tiles, which is not efficient for such kind of shapes. This PR reshapes the inputs to improve the matmul performance. Before: [3448,1,512] x [512,1536] = [3448,1,1536] After: [1, 3448, 512] x [512, 1536] = [1, 3448, 1536] , then the output can be reshaped to [3448, 1, 1536] The overall MatMul time in demucs model becomes 1778.45 ms from 4418.17 ms on my iGPUs. --------- Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>	2024-11-01 08:04:42 -07:00
Satya Kumar Jandhyala	05fbb43b34	[JSEP/WebGPU] Fix data causing output mismatch resulting in CI build failures occasionally (#22596 ) ### Description <!-- Describe your changes. --> Test case failing sometimes and passing other times. ### Motivation and Context Prevent unnecessary CI build failures requiring manually rerunning tests	2024-10-26 01:37:12 -07:00
Satya Kumar Jandhyala	fd8ee4894d	[JS/WebGPU] GroupQueryAttention rewrite (#20946 ) ### Description Implement JSEP GroupQueryAttention ### Motivation and Context Required to enable certain LLM models to run using WebGPU.	2024-10-23 10:14:09 -07:00
Yang Gu	c75f4a09b7	[js/webgpu] Remove the limitation on axis in softmax (#22231 ) In current implementation, axis in softmax has to be the last, which is an obvious limitation. This PR removes this limitation and will fix issues #20710 and #22176.	2024-09-30 18:27:11 -07:00
Jiajia Qin	3580e01348	[js/webgpu] Optimize grouped conv (#21892 ) ### Description <!-- Describe your changes. --> #21618 This PR optimizes grouped conv by 1) more sequential memory access in gpu 2) reusing input's data to reduce global memory access times. See `Conv\|GroupedConv` op in [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base-960h) becomes 92 ms from 1058 ms on iGPUs with 32 EU. For the whole model on my iGPUs with 32 EU, wav2vec2 model becomes 982ms from 1942 ms. squeezebert-uncased model becomes 71.86ms from 431.77ms. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-09-04 17:16:35 -07:00
Jiajia Qin	a80bfed5b4	[js/webgpu] Optimize transpose (#21964 ) ### Description <!-- Describe your changes. --> Fix bugs in previous implementation and add more situations to go the optimized path. Below situations will go to the optimized path. 1. 2d inputs or squeezed 2d inputs 2. channels last or channels first transpose. For example, channel last transpose: [1, 256, 512, 512] -> [1, 512, 512, 256] For this case, the transpose becomes [256, 512x512] -> [512x512, 256] ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> For SD Turbo demo, the total transpose time becomes 39.98ms from 122.09ms. And the correspnding percents becomes 3.89% from 11.05% in this demo. This PR will also help #21618, the total transpose time in that demo becomes 17.32 ms from 70.25 ms on my iGPUs.	2024-09-04 12:04:04 -07:00
xhcao	3bfb5e4f62	[js/webgpu] support float16 for Clip (#21584 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-08-28 13:19:20 -07:00
Satya Kumar Jandhyala	af18824f43	[JS/WebGPU] Add GatherBlockQuantized op support (#21734 ) ### Description Add GatherBlockQuantized operator to JSEP. ### Motivation and Context Gemma model requires this.	2024-08-26 14:46:04 -07:00
Xu Xing	d9c57ac7db	[js/webgpu] Enable pad f16 uniform (#21691 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>	2024-08-26 07:58:48 -07:00
Jiajia Qin	27a6890529	[js/webgpu] Optimize conv1d by conv2d (#19388 ) ### Description <!-- Describe your changes. --> Optimize conv1d to go to the conv2d path to utilize the conv2d's optimization path. See whisper-tiny-encoder model becomes 158.66 ms from 532.28 ms. Conv goes to Conv2DMatMul(8 ms) instead of GroupedConv(382 ms). Old profiling result: Kernel \| Time (ms) \| Percentage (%) -- \| -- \| -- Conv\\|GroupedConv \| 382.99 \| 71.95 MatMul \| 126.16 \| 23.70 Softmax \| 7.01 \| 1.32 Transpose \| 4.59 \| 0.86 Add \| 4.39 \| 0.82 Mul \| 2.36 \| 0.44 Div \| 1.44 \| 0.27 ReduceMean\\|ReduceMeanShared \| 1.25 \| 0.23 Erf \| 0.85 \| 0.16 Sub \| 0.72 \| 0.14 Pow \| 0.46 \| 0.09 Sqrt \| 0.07 \| 0.01 Sum \| 532.28 \| New profiling result with this PR: Kernel \| Time (ms) \| Percentage (%) -- \| -- \| -- MatMul \| 127.07 \| 80.09 Conv\\|Conv2DMatMul \| 8.00 \| 5.04 Softmax \| 6.95 \| 4.38 Transpose \| 4.65 \| 2.93 Add \| 4.26 \| 2.68 Mul \| 2.56 \| 1.61 Div \| 1.51 \| 0.95 ReduceMean\\|ReduceMeanShared \| 1.31 \| 0.83 Erf \| 0.85 \| 0.54 Sub \| 0.79 \| 0.50 Pow \| 0.46 \| 0.29 Conv\\|Transpose \| 0.26 \| 0.17 Sqrt \| 0.00 \| 0.00 Sum \| 158.66 \| --------- Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>	2024-08-22 22:56:07 -07:00
Satya Kumar Jandhyala	1fb2e71ddc	[JS/WebGPU] Avoid producing presentKey/presentValue outputs if pastKey/pastValue … (#21782 ) Avoid producing presentKey/presentValue outputs if pastKey/pastValue don't exists. ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-08-19 18:02:19 -07:00
Yang Gu	49fc168eed	[js/webgpu] Handle negative axis in op Split (#21771 ) This is to fix issue #21703, where the axis is a negative value in the model. According to the spec (https://onnx.ai/onnx/operators/onnx__Split.html), negative axis means counting dimensions from the back.	2024-08-17 16:41:23 -07:00
Tianlei Wu	d79e3c5791	Extend Attention Bias Broadcast Support (#21710 ) ### Description Previously, MultiHeadAttention supports relative position bias of shape [1, N, S, T] or [B, N, S, T], and DecoderMaskedMultiHeadAttention supports [1, N, S, T]. This will extend the support to allow [1, N, S, T], [B, N, S, T], [B, 1, S, T] and [1, 1, S, T] for CUDA and CPU EPs. - [x] Rename the input of "relative position bias" to "attention bias" because it can also be used for other types of bias, like ALiBi (Attention with Linear Biases) or attention mask. - [x] Update unfused kernel to support broadcasting 2nd dimension of attention bias. - [x] Update efficient attention to support broadcasting 2nd dimension of attention bias. - [x] Update operators (MultiHeadAttention, DecoderMaskedMultiHeadAttention, Attention, PackedAttention, PackedMultiHeadAttention) to support broadcast attention bias on CUDA and CPU EPs. - [x] Update ROCm, DML and WebGPU naming to be consistent. (Note that those EPs do not support broadcasting attention_bias for now). - [x] Add attention bias tests for MultiHeadAttention. - [x] Update operator documents - [x] Update benchmark script Other changes: * Fix some checks in multihead-attention.ts * Add helper functions to dump tensors given dimensions.	2024-08-16 15:40:04 -07:00
Yulong Wang	ef2ccc477b	[js/web] Add support for int4/uint4 tensor (#21720 ) ### Description Add support for int4/uint4 tensor.	2024-08-15 21:32:10 -07:00
Xu Xing	7172aff1cf	[js/webgpu] Fix max pool shape end with 0 (#21698 ) Bug: https://github.com/microsoft/onnxruntime/issues/21386 ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-08-13 20:59:24 -07:00
Satya Kumar Jandhyala	51b2044120	[JS/WebGPU] Add Dequantizelinear operator (#21642 ) ### Description Added DequantizeLinear operator for JSEP. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-08-09 14:44:19 -07:00
Yulong Wang	5e66fcc703	[js/web] allow op test to use f16 type for inputs/outputs (#21664 ) ### Description allow op test to use f16 type for inputs/outputs. This PR introduces "@petamoriken/float16" as Float16Array polyfill but restricts it to be only used for test runner.	2024-08-08 09:56:37 -07:00
Xu Xing	0d7cf301a1	[js/webgpu] Add activation Tanh (#21540 ) Bug:https://github.com/microsoft/onnxruntime/issues/21467 ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-07-29 11:05:34 -07:00
Xu Xing	5bc12bf209	[js/webgpu] Add activation for conv3d naive (#21466 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-07-29 08:47:41 -07:00
Xu Xing	c3076721f3	[js/webgpu] Support conv3d naive (#20706 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-06-19 10:13:50 -07:00
Guenther Schmuelling	c749bd997a	webgpu quickgelu (#20939 )	2024-06-06 08:21:33 -07:00
Satya Kumar Jandhyala	bab5037eab	Eliminate explicit Concat operations in Attention (#20556 ) ### Description Remove explicitly concatinating pastKey with Key and pastValue with Value. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-05-24 09:07:57 -07:00
Xu Xing	f1fef19b6e	[js/webgpu] Support shared memory for transpose 2d (#19267 ) For 1024x1024, without shared memoey, 18.7ms. With shared memory 13.2ms.	2024-05-22 08:15:44 -07:00
Xu Xing	8c59cd4fce	[js/webgpu] Support GroupQueryAttention (#20237 ) TODOs: 1. Handle H * params.kvNumHeads greater than work group size limit. 2. Support BNSH kv cache.	2024-05-13 09:43:37 -07:00
Satya Kumar Jandhyala	21b3cbc3af	[WIP][JS/WebGPU] Inputs Key and Value could be 4-dims. (#20470 ) ### Description The Key and Value inputs could be 4-dims ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-04-25 13:33:46 -07:00
Satya Kumar Jandhyala	ae78cdb5d7	[JS/WebGPU] MultiheadAttention bugfix (#20447 ) ### Description Fixed pastkey, key and pastvalue, value concatenation condition and fixed index error. Added new test cases. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-04-24 08:43:14 -07:00
Satya Kumar Jandhyala	d42ac7f0c6	[JS/WebGPU] Multihead attention improvements (#20286 ) ### Description Enabled more usecases ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-04-23 12:39:49 -07:00
Yulong Wang	4385602386	[js/web] fix test runner with optional input/output (#20399 ) ### Description fix test runner with optional input/output. This change fixes the OP test runner (.jsonc format test) with optional input(s) and/or output(s). this fix reveals a problem of dealing with optional outputs: > Take SkipSimplifiedLayerNorm as example: > > if in the ONNX model, the node's outputs are: [ 'output_0', '' ] instead of [ 'output_0' ], the current implementation will fail. The difference is, in the first case, context.outputCount == 2, and then the typescript implementation will try to create a tensor for output[1]. It will eventually call to C++ function (OpKernelContext::Output), and the output.DataRaw() will be nullptr. WebGPU backend will fail because it cannot deal with a TensorView with data == 0. > This problem may need to be fixed or workaround in separated PR. This PR does not fix this problem. Failed test cases are modified to work - please note this PR does not break those test cases as they never work.	2024-04-22 12:53:10 -07:00
Satya Kumar Jandhyala	b33216be4c	[JS/WebGPU] Improve MatMulNBits perf (#19974 ) ### Description <!-- Describe your changes. --> Improve performance using shared memory ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-04-12 11:03:05 -07:00
Yulong Wang	50bd4571ac	[js/web] support SimplifiedLayerNorm and SkipSimplifiedLayerNorm (#20277 ) ### Description Support operator `SimplifiedLayerNorm` and `SkipSimplifiedLayerNorm` for WebGPU backend.	2024-04-11 14:08:50 -07:00
Jiajie Hu	23d3afd4fe	[js/webgpu] Implement com.microsoft.RotaryEmbedding (#20209 ) ### Description https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md#commicrosoftrotaryembedding ### Motivation and Context As per customer request, this helps Phi-2 and Gemma.	2024-04-08 09:11:26 -07:00
Satya Kumar Jandhyala	5b64d7c32b	[JS/WebGPU] Use non-matmul implementation for ConvTranspose in channel-first case. (#20022 ) ### Description Avoid using vec4 Matmul implementation for ConvTranspose with channel-last ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-23 11:19:14 -07:00
Xu Xing	4c6a6a37f7	[js/webgpu] Fix NAN caused by un-initialized buffer in instance-norm (#19387 ) The added case will be NAN because of the un-initialized buffer.	2024-03-18 22:59:32 -07:00
Satya Kumar Jandhyala	ed250b88c3	[JS/WebGPU] Optimize MatMulNBits (#19852 ) ### Description Use vec<2> or vec<4>, operands in MatMulNBits ### Motivation and Context Improve performance	2024-03-13 10:33:14 -07:00
Satya Kumar Jandhyala	24b72d2613	[JS/WebGPU] Preserve zero size input tensor dims. (#19737 ) ### Description For Concat operation, the zero-size input tensor shape need to be preserved and, unlike non-zero tensors, the dims are not constrained to match other input tensors' dims. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-07 19:07:49 -08:00
Yulong Wang	0edb035808	[js/web] fix suite test list for zero sized tensor (#19638 ) ### Description Fixes build break brought by #19614 Currently WebGL backend does not support zero sized tensor. This change split test data into 2 parts, and only enable zero sized tensor tests for WebGPU.	2024-02-24 10:09:07 -08:00
Yulong Wang	aec2389ad0	[js/webgpu] allows a ProgramInfo's RunData to use zero sized output (#19614 ) ### Description This PR allows zero-sized output. To make the implementation simple, it does not support partial zero-sized tensor. Which means, either all outputs are zero-sized, or an error will be reported. added 2 tests: - op test of `Add` with input T[2,0] T[2,1], and - test_split_zero_size_splits	2024-02-23 12:52:47 -08:00
satyajandhyala	ae3d73c981	[JS/WebGPU] Fix Split and Where to handle corner cases. (#19613 ) ### Description <!-- Describe your changes. --> 1. Fix Where operator to handle Boolean input less than 4 bytes. 2. Fix JSEP test harness to use tensor names consistently. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-02-23 00:21:15 -08:00
satyajandhyala	dfeda9019c	[JS/WebGPU] Add MatMulNBits (#19446 ) ### Description Add MatMulNBits to support MatMul using 4-bit quantized weights ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-02-17 09:19:17 -08:00

1 2 3

106 commits