onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-03 03:58:54 +00:00

Author	SHA1	Message	Date
Joshua Lochner	d981b153d3	[webgpu/js] Optimize resize webgpu op & fix precision issues (#23591 ) ### Description <!-- Describe your changes. --> This PR is a follow-up to https://github.com/microsoft/onnxruntime/pull/23488 and partially improves upon https://github.com/microsoft/onnxruntime/issues/23403. It does the following: - Prevents unnecessary cache shader recompilation for 'nearest' resize operation. - Fixes precision (offset-by-one) errors with asymmetric coordinate transform. When running the Kokoro TTS model, values for the `/decoder/decoder/generator/f0_upsamp/Resize_output_0` results in differences at the end bounds due to precision issues when dividing 21600 by 72 (should be 300, but seemingly results in 299.999, which causes issues when flooring) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> I did a deep dive over the weekend to try fix Kokoro TTS on WebGPU and found that the above node had a large difference. Thinking this was a major issue, I spent some time fixing it. Turns out, it only happens for a small number of values, leading to high maximum error, but most values are correct (as seen here). BEFORE: ``` [/decoder/decoder/generator/f0_upsamp/Resize_output_0] atol: 78.6640682220459 \| rtol: 24.13991587587724 \| avgDiff: 0.009967932171121087 \| medianDiff: 0.000030517578125 ``` AFTER: ``` [/decoder/decoder/generator/f0_upsamp/Resize_output_0] atol: 0.0011138916015625 \| rtol: 0.0020059924232260704 \| avgDiff: 0.00008570214675873825 \| medianDiff: 0.000030517578125 ``` So, although it has a very small impact on the final output (waveform), this bug could appear with other models in a more severe way. BEFORE: ``` [waveform] atol: 0.04784199967980385 \| rtol: 1366.0462001093495 \| avgDiff: 0.0009544936942737713 \| medianDiff: 0.00015346752479672432 ``` AFTER: ``` [waveform] atol: 0.04775865003466606 \| rtol: 1354.7002460360852 \| avgDiff: 0.000954830244055033 \| medianDiff: 0.00015274062752723694 ```	2025-02-06 10:26:25 -08:00
Satya Kumar Jandhyala	544bdd6073	Fix ConvTranspose for certain attribute combinations (#23488 ) ### Description Convert output_padding attribute from 1D to 2D convtranspose ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> https://github.com/microsoft/onnxruntime/issues/23403	2025-02-05 12:22:47 -08:00
Jiajia Qin	25f427466e	[js/webgpu] Optimize ConvTranspose (Continue) (#23429 ) BUG #23273 This PR does below optimizations: 1. When output channels is one, 1) calculate the offset before the inchannel loop to reduce indices to offsets calculation, 2) split the `inputChannelsPerGroup` into `inputChannelsPerGroupInt` and `inputChannelsRemainder` parts so that we can always access 4 data for `inputChannelsPerGroupInt`. 2. Use precise initial value to reduce useless loop iterations. Thanks @jiangzhaoming 's suggestion's on this. With this PR, ConvTranspose becomes 3.7s from 8.4s on Intel Meteor Lake. On NV RTX 2000 Ada, it becomes 1.6s from 2.7s.	2025-01-22 08:59:17 -08:00
Jiajia Qin	7be006c466	[js/webgpu] Optimize convtranspose (#23302 ) ### Description <!-- Describe your changes. --> BUG #23273 With this change, I see the convTranspose time in that bug becomes ~7s from ~90s on my Meteor Lake. This PR does below things: 1. Use stride to update the increasement in the loop. In the bug, the stride is 1024, which can greatly reduce the loop times. 2. Support components for A to reduce the memory access times. 3. When output channels is 1, the b components can be same with A to further reduce the memory access times.	2025-01-09 11:24:42 -08:00
Yulong Wang	0627a6cb93	[js/web] fix package export for bundlers (#23257 ) ### Description <!-- Describe your changes. --> This PR tries to fix #22615. (see detailed description in the issue) A perfect solution would be too difficult to make, because there are a huge number of combinations of usage scenarios, including combinations of development framework, bundler, dev/prod mode, and so on. This PR is using the following approach: - Introduce a new type of end to end test: export test. This type of tests are complete web apps that use popular web development frameworks, and the tests are using puppeteer to run the apps and check if the apps can run without error. - added one nextjs based web app and one vite based web app. - In the test, perform the following test steps: - `npm install` for packages built locally - `npm run dev` to start dev server and use puppeteer to launch the browser to test - `npm run build && npm run start` to test prod build and use puppeteer to launch the browser to test - Make changes to ort-web, including: - special handling on Webpack's behavior of rewriting `import.meta.url` to a `file://` string - revise build definitions - fix wasm URL for proxy, if used in a bundled build	2025-01-09 11:01:00 -08:00
Satya Kumar Jandhyala	d0c7438f5a	[JSEP/WebGPU] Add a fatal error message for unsupported GQA do_rotary attribute. (#23287 ) ### Description <!-- Describe your changes. --> Added a fatal error message for unsupported GroupQuerryAttention do_rotary attribute. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> https://github.com/microsoft/onnxruntime/issues/22987 Help user understand that this attribute is not supported.	2025-01-09 08:52:17 -08:00
xhcao	a3833a5e79	[js/webgpu] validate transpose perm if specified (#23197 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2025-01-01 15:58:54 -08:00
Enrico Galli	54edb43e77	[WebNN] Fixes MLTensor caching across different contexts (#23100 ) We weren't checking that MLTensors were from the same context before reusing them. Found while debugging microsoft/webnn-developer-preview#69	2024-12-17 12:51:16 -08:00
Yulong Wang	01539ee7ab	[js/webgpu] fix Conv2DMatMul shader's out-of-bound read (#23085 ) ### Description <!-- Describe your changes. --> Fix a bug caused by potential out-of-bound reads of `W` in the Conv2DMatMul shader. ### Motivation and Context Fixes #22983	2024-12-12 11:33:53 -08:00
Yulong Wang	1c79a4c9dd	[js/common] use TS type inference to eliminate `unknown` (#23012 ) ### Description This change uses a TypeScript trick to infer global types in onnxruntime-common. Thanks to the strong type system of TypeScript, we are able to refer to types that may not be available in the context. This helps to keep onnxruntime-common not to include dependencies like "@webgpu/types", and still being able to use the types in the declaration. See comments of `TryGetGlobalType` in `type-helper.ts`.	2024-12-04 19:01:26 -08:00
Xu Xing	c19617a24a	[js/webgpu] Add GatherND (#22847 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-12-04 09:57:32 -08:00
Yulong Wang	06526af346	[js/webgpu] fix a bug in transpose shader (#22997 ) ### Description Fix a bug in transpose shader, when input/output rank is 1. ### Motivation and Context Fixes #22994	2024-12-03 20:21:08 -08:00
Wanming Lin	fe749a88a5	[WebNN EP] Fixed bug in usage of Array.reduce() (#22944 ) In JS, reduce of empty array with no initial value will throw error. Fix it by checking the array length firstly.	2024-11-26 19:03:44 -08:00
Wanming Lin	8a06f13301	[WebNN] Remove wasm.currentContext check (#22886 ) If a WebNN session is threw early, this check for `wasm.currentContext` will break all the following WebNN sessions, this often happens in npm tests.	2024-11-19 12:22:02 -06:00
Jiajia Qin	e597eaed4a	[js/webgpu] Optimize transpose as reshape when suitable (#22870 ) BUG #22031	2024-11-18 12:52:48 -08:00
Wanming Lin	82681205e4	[WebNN] Fix MLTensorUsage is undefined issue (#22831 ) `MLTensorUsage` has been removed from Chromium: https://chromium-review.googlesource.com/c/chromium/src/+/6015318, but we still need to make it compatible with old Chrome versions, so just make it `undefined` for latest Chrome version.	2024-11-13 20:22:22 -08:00
Xu Xing	ff57ac4f3d	[js/webgpu] Add scatterND (#22755 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-11-13 09:13:00 -08:00
Jiajia Qin	7e0dd9d433	[js/webgpu] Optimize Expand (#22752 ) Use components = 4 if possible. llama3.2-1B becomes 20 tokens/s from 18 tokens/s on my iGPUs.	2024-11-12 12:37:19 -08:00
Jiajia Qin	05c8dc9d1c	[js/webgpu] Optimize ConvTranspose (#22774 ) BUG #22031 The overall time of ConvTranspose in Demucs model becomes 517.41 ms from 1415.65 ms on my iGPUs.	2024-11-12 12:37:07 -08:00
Wanming Lin	cdc8db9984	[WebNN] Fixed WebNN Module undefined issue (#22795 ) `Module.jsepRegisterMLConstant` will be shorten by Closure Compiler in offical release, this would cause undefined error. Fix it by using `Module['jsepRegisterMLConstant']`.	2024-11-11 21:31:24 -08:00
xhcao	b5ee4ac760	[js/webgpu] support GridSample operator (#22652 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-11-08 11:02:36 -08:00
jzm-intel	d9b91682f1	WebGPU JSEP: Make shader code not depend on input broadcasting patterns (#22536 ) This PR make MatMul shaders not depend on inputs broadcasting pattern, but only depend on input ranks and their shape provided in uniform. This change fix the issue that currently shaders code are different for different broadcasting, but have identical cache key and results in wrong cache hit.	2024-11-08 11:00:51 -08:00
jzm-intel	6a295eb75b	[JS/WebGPU] Creating devices with subgroup features enabled if possible (#21833 ) This CL make WebGPU backend support subgroup features and thus allow using subgroup optimizations in the future. ### Description With this CL WebGPU backends will create devices with subgroups and subgroups-f16 features (both are under origin trial in Chrome) or chromium-experimental-subgroups feature enabled whenever available. ### Motivation and Context This CL would allow WebGPU operator shaders to use subgroup optimizations in the future, and might get some significant speedup with these optimization.	2024-11-07 02:13:40 -08:00
Enrico Galli	1cb5ceedf3	[WebNN EP] Fix issues with MLTensor caching (#22701 ) This PR fixes a bug that occurs when searching for compatible `MLTensor` in the cache. We were missing checking the number of dimensions in the shape. This would mean that a cached buffer of shape `[1]` could match for `[1, 1, 256, 256]`. This PR also adds better handling when attempting to force an `MLTensor` to a different shape.	2024-11-06 09:17:11 -08:00
Yang Gu	811231e418	[js/webgpu] Destroy staging buffers aggressively during weights uploading (#22726 ) In current implementation, all the staging buffers for weights uploading are destroyed after first batch of kernel execution. It requires a lot of memory as all the staging buffers couldn't be reused. It also hurts the startup time (weights uploading only happens in session creation), as weights uploading is delayed to a very late time. This PR uses a very aggressive way to submit queue and destroy staging buffers, so that the related GPU memory could be reused as much as possible, though the real situation depends on the WebGPU and driver implementation. The aggressive queue submission also moves GPU operations to a very early time, which helps the startup time. Some buffer uploading benchmarks are composed to compare multiple solutions, regarding to the memory and time consumption. Benchmarks can be found at https://github.com/webatintel/webbench/blob/master/webgpu/buffer-upload.html, while detailed test data can be found at https://docs.google.com/document/d/1KgygOkb9ZNzkgzQ_tWOGlEI9ScmMBHDjDojjPFLmVXU/edit. I also tested phi3.5 on 2 machines, first inference time improved from 5141ms to 3579ms and from 4327ms to 2947ms separately.	2024-11-06 08:55:15 -08:00
Jiajia Qin	d5b2730ff8	[js/webgpu] Increase workgroupSize if only one workgroup is dispached (#22709 ) #22031 For reduce related ops, we should increase workgroupSize to improve parallelism if only one workgroup is dispatched. The total ReduceMean time becomes 8.98 ms from 77.79 ms on my iGPUs.	2024-11-05 13:13:52 -08:00
Jiajia Qin	64d8e25b4c	[js/webgpu] Optimize Gemm (#22706 ) BUG #22031 The total Gemm time in demucs model becomes 181.14 ms from over 1000 ms on my iGPUs. ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-11-04 15:05:21 -08:00
Jiajia Qin	8fbbf2fd4f	[js/webgpu] Optimize MatMul with M = 1 (#22577 ) ### Description <!-- Describe your changes. --> BUG #22031 In the demucs model, there are lots of MatMul ops with shapes like below: `input[0]: [3448,1,512] \| float32, input[1]: [512,1536] \| float32, output[0]: [3448,1,1536] \| float32` We can see that for this kind of shape, the batch size is a big value, but M = 1. Our current algorithm is based on [M, N] to partition tiles, which is not efficient for such kind of shapes. This PR reshapes the inputs to improve the matmul performance. Before: [3448,1,512] x [512,1536] = [3448,1,1536] After: [1, 3448, 512] x [512, 1536] = [1, 3448, 1536] , then the output can be reshaped to [3448, 1, 1536] The overall MatMul time in demucs model becomes 1778.45 ms from 4418.17 ms on my iGPUs. --------- Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>	2024-11-01 08:04:42 -07:00
Wanming Lin	eb66bfa7b4	[WebNN] Convert MLOperand methods into readonly attributes (#22653 ) Adapt to spec change at https://github.com/webmachinelearning/webnn/pull/774	2024-10-30 17:54:49 -07:00
Enrico Galli	df236c7894	[WebNN EP] Add cache for `MLContext`s in the `WebNNBackend` (#22510 ) ### Description This change adds a cache of `MLContext`s keyed by their options to the `WebNNBackend`. This makes is so that multiple `InferenceSession`s create with the same options will share the same context. ### Motivation and Context Since `MLTensor`s are tied `MLContext`s, developer can't easily share tensors between `InferenceSession` (outside of manually an `MLContext` and specifying the `context` options). This leads strange behaviors such as, ```js const sessionsA = ort.InferenceSession.create(urlA, { executionProviders: ["webnn"], preferredOutputLocation: "ml-buffer", }); const sessionsB = ort.InferenceSession.create(urlB, { executionProviders: ["webnn"], }); const temp = await sessionA.run({/* arguments */}); const result = await sessionB.run({"input":temp["output"]}); // ERROR: Failed to execute 'dispatch' on 'MLContext': Invalid inputs: The context of MLGraph doesn't match the context of the MLTensor with name "input". ``` We encountered this behavior when updating the transformers.js version in the developer preview demos. microsoft/webnn-developer-preview#46	2024-10-30 10:26:33 -07:00
Jiajia Qin	04e696d8e0	[js/webgpu] Optimize InstanceNorm in some shapes (#22637 ) BUG #22031 Optimize below two situations: 1. Increase workgroupSize if only one workgroup is dispatched. 2. Avoid transpose if not necessary. The overall time of demucs model becomes 106.36 ms from 154.60 ms on my dGPUs with this PR and PR #22577	2024-10-29 17:10:14 -07:00
Wanming Lin	008c9090b4	[WebNN] Support int4 and uint4 data types (#22575 )	2024-10-25 17:44:46 -07:00
Satya Kumar Jandhyala	4ed5bec2e7	[JS/WebGPU] Support WASM64 (#21836 ) ### Description Support wasm64 ### Motivation and Context Overcome memory limitations --------- Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>	2024-10-24 20:21:51 -07:00
Prathik Rao	742594c8f0	Clears GPU Cache when there are no more active sessions (#22490 ) Fixes https://github.com/microsoft/onnxruntime/issues/21574	2024-10-23 22:22:57 -07:00
Satya Kumar Jandhyala	fd8ee4894d	[JS/WebGPU] GroupQueryAttention rewrite (#20946 ) ### Description Implement JSEP GroupQueryAttention ### Motivation and Context Required to enable certain LLM models to run using WebGPU.	2024-10-23 10:14:09 -07:00
Wanming Lin	33e2f6ad8d	[WebNN EP] Support external data (#22263 ) ### Description This PR introduces support for registering external data inside WebNN EP. ### Motivation and Context - The WebNN EP needs to register the initializers at graph compilation stage, for initializers from external data, it can't leverage the general external data loader framework because the graph compilation of WebNN EP is executed before external data loader called. - Exposes the `utils::GetExternalDataInfo`, it is useful for WebNN EP to read the external tensor's infomation. - Define a new `registerMLConstant` in JSEP to create WebNN constants from external data in WebNN backend, with the info of tensor as parameters, as well as the `Module.MountedFiles`, which holds all preloaded external files.	2024-10-23 08:18:16 -07:00
Wanming Lin	e6e94e6252	[WebNN EP] Use boolean flags instead of MLTensorUsage (#22497 ) Fixed #22495 We will keep MLTensorUsage until it is removed from Chromium. --------- Co-authored-by: Dwayne Robinson <fdwr@hotmail.com>	2024-10-22 17:20:36 -07:00
Enrico Galli	1e5bda88f0	[WebNN EP] Cache MLTensors between runs (#22278 ) ### Description This change enables caching `MLTensor`s between inferences runs. This is done by keeping a reference to `MLTensor`s alive after they have been released. `MLTensor`s are only destroyed once the sessions goes out of scope. ### Motivation and Context Creating and destroying `MTensor`s on every run has a non-trivial performance penalty. This performance penalty materializes when using `ort.Tensors`[location=cpu] for inputs/outputs or when using the CPU EP as a fallback EP for unsupported operators. The former could be mitigated by developer using `ort.Tensors`[location=ml-tensor]. The latter cannot be mitigated by developers.	2024-10-18 08:07:00 -07:00
Wanming Lin	52b77762bd	[WebNN EP] Remove the numThreads option (#22464 ) Chromium has removed this option via https://chromium-review.googlesource.com/c/chromium/src/+/5905656.	2024-10-17 07:45:39 -07:00
Jiajia Qin	8159723ba7	[js/webgpu] Optimize matmulnbits (#22360 ) ### Description <!-- Describe your changes. --> This PR further optimizes matmulnbits specially for iGPUs. The phi3 demo becomes ~12 tokens/second from ~8 tokens on iGPUs. Some todos: 1. Make the optimization more general, Remove the blockSize = 32 limitation. 2. Tune the parameter, such as workgroupSize, components size (currently only support components = 1), to see the performance change.	2024-10-14 15:49:29 -07:00
Jiajia Qin	0409c639f7	[js/webgpu] Optimize MultiHeadAttention\|Transpose (#22420 ) ### Description <!-- Describe your changes. --> With this optimization, 96 MultiHeadAttention\|Transpose ops in phi3 disappear. Phi3 becomes 113 tokens from 107 tokens on my dGPUs. The optimization mainly skips the transpose op if one of the transposed dims is 1. Reshape is enough.	2024-10-14 15:43:14 -07:00
Wanming Lin	39c8b3759f	[JS/WebGPU] Fixed bugs in inputs validation of Resize (#21955 ) - 'scales' and 'sizes' may be empty tensor, make sure it's 1D tensor and non-empty - Make sure 'scales' and 'sizes' if present its length is non-zero	2024-10-04 18:29:53 -07:00
Yang Gu	c75f4a09b7	[js/webgpu] Remove the limitation on axis in softmax (#22231 ) In current implementation, axis in softmax has to be the last, which is an obvious limitation. This PR removes this limitation and will fix issues #20710 and #22176.	2024-09-30 18:27:11 -07:00
Yulong Wang	1bda91fc57	[js/webgpu] fix external buffer registration (#22254 ) ### Description Fixes the problem of running into failure when GPU inputs shuffled between iterations.	2024-09-28 10:36:40 -07:00
Enrico Galli	52a8c1cae8	[WebNN EP] Enable IO Bindings with MLTensor (#21301 ) ### Description Enables using the MLTensor to pass data between models. ### Motivation and Context Using MLTensor instead of ArrayBuffers reduces the number of copies between the CPU and devices as well as the renderer and GPU process in Chromium.	2024-09-27 17:24:21 -07:00
Jiajia Qin	80e9df826e	[js/webgpu] Optimize InstanceNormalization (#21995 ) ### Description <!-- Describe your changes. --> For InstanceNormalization, it has `y = scale * (x - mean) / sqrt(variance + epsilon) + B` , where mean and variance are computed per instance per channel. Calculating mean and variance per channel is a reduce processing, which is NCHW layout friendly since it makes the adjacent threads can access contiguous data in gpu memory. This PR optimizes both NHWC and NCHW InstanceNormalization. To efficiently calculate the mean and variance, we need to make sure the input is NCHW instead of NHWC. Then use shared memory to do the reduce operation to get `channel_scale` and `channel_shift`. With this PR, getting `channel_scale` and `channel_shift` are same for NHWC and NCHW InstanceNormalization. And the overall performance becomes very close now. Below data comes from SD Turbo profiling results. Before (InstanceNormalization overall time: 140.84 ms) InstanceNormalization\\|InstanceNormComputeMean \| 129.70 -- \| -- InstanceNormalization\\|InstanceNormalizationNHWC \| 10.55 InstanceNormalization\\|InstanceNormComputeChannelScaleShift \| 0.59 After (InstanceNormalization overall time: 59.44 ms) InstanceNormalization\\|InstanceNormComputeChannelScaleShift \| 28.57 -- \| -- InstanceNormalization\\|TransposeShared \| 20.19 InstanceNormalization\\|InstanceNormalizationNHWC \| 10.68	2024-09-23 11:32:09 -07:00
Xu Xing	afd642a194	[js/webgpu] Replace array with string in transpose perm (#21930 ) Perf test data(100000 times) Array: 12.599999997764826ms String: 1.6000000014901161ms Perf test case: ``` const permFunctionBodyArray = (rank: number, input: string): string => { const reverseFunc = []; reverseFunc.push(`fn perm(i: int) -> int { var a: int};`); for (let i = 0; i < rank; ++i) { reverseFunc.push(input); } reverseFunc.push('return a;}'); return reverseFunc.join('\n'); }; const permFunctionBodyString = (rank: number, input: string): string => { let reverseFunc= `fn perm(i: int}) -> int { var a: int;`; for (let i = 0; i < rank; ++i) { reverseFunc+=input; } reverseFunc+='return a;}'; return reverseFunc;//.join('\n'); }; const count = 100000; let start, end console.time('array'); start = performance.now(); for(let i =0 ; i < count; i ++) { permFunctionBodyArray(3, 'input'); } end = performance.now(); console.timeEnd('array'); console.log("Array: "+ (end-start)); console.time('string'); start = performance.now(); for(let i =0 ; i < count; i ++) { permFunctionBodyString(3, 'input'); } end = performance.now(); console.log("String: " +(end-start)); console.timeEnd('string'); ``` ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-09-16 23:17:46 -07:00
Yang Gu	2db6b734f5	[js/webgpu] Fix issue to run model demucs (#22074 ) This is to fix issue #22031 to run model demucs. For conv-transpose, outputPadding.length could be 1, while spatialRank is 2. The fix is to append enough 0s to outputPadding. For conv, the issue is similar. kernelShape.length sometimes could be 1, while inputs[1].dims.length is 4. The fix is also to append enough 0s to kernelShape.	2024-09-16 23:17:10 -07:00
Yulong Wang	291a5352b2	[js/web] remove training release (#22103 ) ### Description Remove training from onnxruntime-web Following up of #22082	2024-09-16 10:56:22 -07:00
Jiajia Qin	3580e01348	[js/webgpu] Optimize grouped conv (#21892 ) ### Description <!-- Describe your changes. --> #21618 This PR optimizes grouped conv by 1) more sequential memory access in gpu 2) reusing input's data to reduce global memory access times. See `Conv\|GroupedConv` op in [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base-960h) becomes 92 ms from 1058 ms on iGPUs with 32 EU. For the whole model on my iGPUs with 32 EU, wav2vec2 model becomes 982ms from 1942 ms. squeezebert-uncased model becomes 71.86ms from 431.77ms. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-09-04 17:16:35 -07:00

1 2 3 4 5 ...

349 commits