onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-06-07 00:13:17 +00:00

Author	SHA1	Message	Date
Arthur Islamov	d0519a7603	[js/web] BiasSplitGelu and BiasAdd kernels (#17161 ) ### Description Two contrib kernels that supposed to speed-up StableDiffusion according to this doc https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/models/stable_diffusion/README.md However, there is no noticable effect in speed or memory consumption. So i guess the only way to make it faster is to implement MultiHeadAttention but i'm not capable of doing that right now. So i'll focus on existing PRs and finding the JSEP kernel that produces incorrect results. It should be one of the old ones (i suspect Conv or ConvTranspose), as SD was not generating images correctly on webgpu since i started working on it. I hoped someone else would fix that by the time i finish with kernels/optimizations 😅 --------- Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com> Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>	2023-10-03 12:20:20 -07:00
Yulong Wang	451c02543a	[js/webgpu] allow specify preferredLayout (#17756 ) ### Description Allow WebGPU backend to specify `preferredLayout`. Default is NHWC. ```js const options = {executionProviders: [{name:'webgpu', preferredLayout: 'NCHW'}]}; sess1 = await ort.InferenceSession.create('./mobilenetv2-12.onnx', options); ``` ### Motivation and Context - implement @qjia7's requirement for an easier way to do performance comparison between NCHW vs NHWC. - It's possible that NCHW does better on some models and NHWC on others. So offer user the capability to switch.	2023-10-02 21:25:12 -07:00
xhcao	0d60604638	[JS/WebGPU] support Range operator (#17233 ) The patch also introduces the method which copies data from GPU to CPU synchronously. ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-09-30 02:05:32 -07:00
Arthur Islamov	a941dd583e	[js/web] FP16 Conv, ConvTranspose and MatMul (#17514 ) ### Description Another three ops for fp16 --------- Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com> Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>	2023-09-30 00:00:23 -07:00
Caroline Zhu	6a5f469d44	Add training interfaces to js/common (#17333 ) ### Description Following the design document: * Added CreateTrainingSessionHandler to the Backend interface * All existing Backend implementations throw an error for the new method createTrainingSessionHandler * Created TrainingSession namespace, interface, and TrainingSessionFactory interface * Created TrainingSessionImpl class implementation As methods are implemented, the TrainingSession interface will be added to or modified. ### Motivation and Context Adding the public-facing interfaces to the onnxruntime-common package is one of the first steps to support ORT training for web bindings. --------- Co-authored-by: Caroline Zhu <carolinezhu@microsoft.com>	2023-09-29 19:05:10 -07:00
Yulong Wang	561aca97cf	[js/webgpu] support IO binding (#17480 ) <del> This PR is based on a few prerequisites PRs. They are listed as below: - #17465 - #17469 - #17470 - #17472 - #17473 - #17484 Please review the current change by only looking at commit e2e6623e673ec6de55a5c1f8edcbd3a46b535a89 and later. </del> ### Description This PR introduces WebGPU IO binding. This new feature allows onnxruntime-web users to use tensors created from GPU as model input/output so that a model inferencing can be done without unnecessary data copy between CPU and GPU for model input/output. ### Examples An E2E demo/example is being worked on. Following is some simple demo with code snippet. Let's first check today how we do: ```js // STEP.1 - create an inference session: const mySession = await ort.InferenceSession.create('./my_model.onnx', { executionProviders: ['webgpu'] }); // STEP.2 - create model input: (supposing myImageCpuData is a Float32Array) const feeds = { 'input_image:0': new ort.Tensor('float32', myImageCpuData, [1, 224, 224, 3]) }; // STEP.3 - run model const myResults = await mySession.run(feeds); // STEP.4 - get output data const myData = myResults['output_image:0'].data; // Float32Array ``` #### for inputs (GPU tensor): Now, with IO binding, you can create a tensor from a GPU buffer, and feed it to the model: ```js // new STEP.2.A - create model input from a GPU buffer: (supposing myInputGpuBuffer is a `GPUBuffer` object with input data) const feeds = { 'input_image:0': ort.Tensor.fromGpuBuffer(myInputGpuBuffer, { dataType: 'float32', dims: [1, 224, 224, 3] }) }; ``` ### for outputs (pre-allocated GPU tensor) you can also do that for output, if you know the output shape: ```js // new STEP.2.B - create model output from a GPU buffer: (supposing myOutputGpuBuffer is a pre-allocated `GPUBuffer` object) const fetches = { 'output_image:0': ort.Tensor.fromGpuBuffer(myOutputGpuBuffer, { dataType: 'float32', dims: [1, 512, 512, 3] }) }; // new STEP.3 - run model with pre-allocated output (fetches) const myResults = await mySession.run(feeds, fetches); ``` ### for outputs (specify location) if you do not know the output shape, you can specify the output location when creating the session: ```js // new STEP.1 - create an inference session with an option "preferredOutputLocation": const mySession = await ort.InferenceSession.create('./my_model.onnx', { executionProviders: ['webgpu'], preferredOutputLocation: "gpu-buffer" }); ``` if the model has multiple outputs, you can specify them seperately: ```js // new STEP.1 - create an inference session with an option "preferredOutputLocation": const mySession = await ort.InferenceSession.create('./my_model.onnx', { executionProviders: ['webgpu'], preferredOutputLocation: { "output_image:0": "gpu-buffer" } }); ``` now you don't need to prepare the `fetches` object and onnxruntime-web will prepare output data on the location that specified. #### read data when you get the output tensor, you can: ```js // get the gpu buffer object: const gpuBuffer = myOutputTensor.gpuBuffer; // GPUBuffer // get the CPU data asynchronizely const cpuData = await myOutputTensor.getData(); // get the CPU data asynchronizely and release the underlying GPU resources const cpuData = await myOutputTensor.getData(true); // dispose the tensor (release the underlying GPU resources). This tensor object will be invalid after dispose() is called. myOutputTensor.dispose(); ``` #### resource management JavaScript has GC so you don't need to worry about managing JavaScript objects. But there are 2 types of resources that are not managed by GC: - GPU buffer that used in tensors - Underlying ORT native resources To simplify, most of the unmanaged resources and handled inside ORT web. But there are a few resources that need users to manage: - All external GPU resources, including GPU buffers inside all tensors created by `Tensor.fromGpuBuffer()`, will not be managed by ORT. User should manage those GPU buffers themselves. - When a session is created with `preferredOutputLocation` == "gpu-buffer" specified in session options, and the corresponding output is not pre-allocated, user need to call the output tensor's `dispose()` or `getData(true)` to manually release the underlying GPU buffers. - ORT internal errors (including providing a pre-allocated output tensor with wrong type/dims) will invalidate the whole wasm memory and is not recoverable. An exception is thrown in this situation.	2023-09-29 11:24:42 -07:00
satyajandhyala	b4fbc25b1f	[JS/Web] Add ConvTranspose implementation using MatMul (#17573 ) ### Description Add ConvTranspose implementation using MatMul to increase perf. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-09-29 11:00:44 -07:00
Jiajia Qin	891fba3b9c	[js/webgpu] Optimize Gather op (#17625 ) ### Description This PR optimizes the gather op, which is improved ~6ms in segment anything model in ADL. The problem in original algorithm is that it includes a for loop to calculate a block size of data. However, the block size may be very large, like `65536`. In GPU shader, we should try to avoid large loop in shader and try to use more threads to do it parallelly. Before: ``` [profiling] kernel "41771992\|[Gather] 41771992" input[0]: [4,65536] \| float32, input[1]: [1] \| int64, output[0]: [1,65536] \| float32, execution time: 6886207 ns ``` After: ``` [profiling] kernel "41771992\|[Gather] 41771992" input[0]: [4,65536] \| float32, input[1]: [1] \| int64, output[0]: [1,65536] \| float32, execution time: 11719 ns	2023-09-21 21:00:36 -07:00
Jiajia Qin	cd3fb377ea	[js/webgpu] Allow binary ops with scalar to use the vectorize path (#17589 ) ### Description 1. For binary ops, the components is always 4. So the dispatchGroup should be : `{x: Math.ceil(outputSize / 64 /* workgroup size / / 4 / component size /)}` instead of `{x: Math.ceil(outputSize / 64 / workgroup size / / (vectorize ? 4 : 1) / vec size */)}`. 2. If any of a or b only has one element, we still can use the vectorize path since the same value will be broadcasted.	2023-09-21 20:55:08 -07:00
Arthur Islamov	498b60d8a4	[js/web] fp16 Pool & Reduce (#17512 ) ### Description Two more ops to support fp16	2023-09-21 14:52:13 -07:00
Vincent Wang	e6301eee6a	Bump Up Version to 1.17.0 (#17587 ) Bump up version to 1.17.0 as the 1.16.0 release branch had been branched out.	2023-09-20 11:02:58 +08:00
Arthur Islamov	0f406ca1d3	[js/web] FP16 binary and unary ops (#17515 ) ### Description Binary and unary ops with fp16 support	2023-09-18 15:43:32 -07:00
Yulong Wang	9aafbe3feb	[js/web] revise TensorView (#17473 ) ### Description This change: - removes the unused `Tensor` types declared in /js/web/lib/wasm/jsep/tensor.ts - removes duplicated util functions in /js/web/lib/wasm/jsep/tensor.ts - renames /js/web/lib/wasm/jsep/tensor.ts to /js/web/lib/wasm/jsep/tensor-view.ts and update corresponding references. It was kind of confusing that we have multiple `Tensor` types defined in different places also we have multiple `tensor.ts` source files. This is one of the prerequisites for supporting IO binding for WebGPU buffer in onnxruntime-web. list of prerequisites PRs: https://github.com/microsoft/onnxruntime/pull/17465 https://github.com/microsoft/onnxruntime/pull/17469 https://github.com/microsoft/onnxruntime/pull/17470 https://github.com/microsoft/onnxruntime/pull/17472 https://github.com/microsoft/onnxruntime/pull/17473 (this one)	2023-09-14 21:14:44 -07:00
Jiajia Qin	41d2ff622c	[js/webgpu] Optimize InstanceNormalization (#17491 ) ### Description <!-- Describe your changes. --> In previous implementation, there are two loops to iterate H * W elements to calculate the `mean` and `squaredNorm` value in one thread, meanwhile it outputs H * W elements in one thread. That results it's very very slow when H * W is a large value. And usually, H * W does be a large value in a model. For example, in the `candy-8` model, the shapes of [H, W] are [224,224], [112,112], [56,56] for `InstanceNormalization` op. And in my ADL, `[1,224,224,32]` consumes 17 ms. See below: ``` [profiling] kernel "23848328\|[InstanceNormalization] 23848328" input[0]: [1,224,224,32] \| float32, input[1]: [32] \| float32, input[2]: [32] \| float32, output[0]: [1,224,224,32] \| float32, execution time: 17007914 ns ``` In this PR, it uses workgroup memory to optimize the original algorithm. The advantage is that it can parallelly utilize the 64 (workgroupSize) threads in one workgroup to calculate `mean` and `squaredNorm` value. Meanwhile, it only outputs `H * W / workgroupSize` outputs for one thread, which greatly reduces the overhead for one thread. With this optimization, `[1,224,224,32]` becomes 3 ms and the main overhead is the extra two `transpose`. The `createInstanceNormProgramInfo` only needs `0.64` ms. See below: ``` [profiling] kernel "23003600\|[InstanceNormalization] 23003600" input[0]: [1,224,224,32] \| float32, output[0]: [1,32,224,224] \| float32, execution time: 1543792 ns program-manager.ts:115 [profiling] kernel "23003600\|[InstanceNormalization] 23003600" input[0]: [1,32,224,224] \| float32, input[1]: [32] \| float32, input[2]: [32] \| float32, output[0]: [1,32,224,224] \| float32, execution time: 642652 ns program-manager.ts:115 [profiling] kernel "23003600\|[InstanceNormalization] 23003600" input[0]: [1,32,224,224] \| float32, output[0]: [1,224,224,32] \| float32, execution time: 991608 ns ``` This PR currently only applies the new algorithm to NCHW format. For NHWC format, one way is to transpose the input so that it can use the new algorithm. But the disadvantage is that 2 extra transpose are added. @dakenf also gives another way to optimize NHWC. Details see [here](`d45a96616d/js/web/lib/wasm/jsep/webgpu/ops/instance-norm.ts`). I checked @dakenf's method. The perf is similar with transpose + optimized NCHW. But on different GPUs, one is a little better than another or vice versa. So I prefer this PR only does the NCHW part. @dakenf can submit his optimization on NHWC.	2023-09-14 17:03:18 -07:00
xhcao	198d468849	[WebGPU/JS] Added Pad operator support (#16928 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-09-14 13:14:11 -07:00
Arthur Islamov	03b56f7a73	[js/webgpu] FP16 extension registration (#17493 ) ### Description First small change to support FP16 --------- Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>	2023-09-13 13:11:17 -07:00
Yulong Wang	a2e75114cc	[js/web] add sessionOptions.freeDimensionOverrides (#17488 ) ### Description Allows to specify fixed size for dynamic input of a model. resolves #16707 Pending test	2023-09-13 09:17:34 -07:00
Yulong Wang	41584b2827	[js/web] ensure ORT initialization to run only once (#17529 ) ### Description ensure ORT initialization to run only once	2023-09-12 23:52:08 -07:00
Arthur Islamov	65249f42e4	[js/web] FP16 Gemm, Softmax & Transpose (#17494 ) ### Description First three OPs to support fp16. Will add more once this gets merged since others depend on changes in js_data_types	2023-09-11 21:09:37 -07:00
satyajandhyala	bf6d6961cc	[JS/Web] Added Einsum operator support. (#17401 ) ### Description Added Einsum operator support to JSEP. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-09-11 15:57:15 -07:00
xhcao	9017ea131b	[js/webgpu] support GreaterOrEqual and LessOrEqual operators (#17310 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-09-07 17:41:16 -07:00
Jiajia Qin	5e747071be	[js/webgpu] Fix bug in conv2dByMatMul path (#17369 ) ### Description <!-- Describe your changes. --> For the conv2dByMatMul path, the simulated matmul output shape is the reshape of the original conv2d. So we should pass this information to `createMatmulProgramInfo` so that it can process it correctly.	2023-09-02 00:16:28 -07:00
Jiajia Qin	352b745deb	[js/webgpu] Add input/output shapes information to profiling (#17342 ) ### Description This PR is to enhance the profiling information. With the PR, the profiling result is like below: ``` [profiling] kernel "[Split] 51288384" input[0]: 1,256,64,64, output[0]: 1,256,64,64, execution time: 37135 ns program-manager.ts:114 [profiling] kernel "[Concat] 52361040" input[0]: 1,256,64,64, output[0]: 1,256,64,64, execution time: 50833 ns program-manager.ts:114 [profiling] kernel "[Transpose] 52375264" input[0]: 1,256,64,64, output[0]: 1,64,64,256, execution time: 99791 ns program-manager.ts:114 [profiling] kernel "[Sub] 51098472" input[0]: , input[1]: 1, output[0]: 1, execution time: 7448 ns program-manager.ts:114 [profiling] kernel "[Mul] 51344440" input[0]: 1, input[1]: 1,256,1,1, output[0]: 1,256,1,1, execution time: 8334 ns ``` Without this PR, the profiling result is like below: ``` [profiling] kernel "52097928\|[Split] 52097928" execution time: 37760 ns program-manager.ts:105 [profiling] kernel "41898328\|[Concat] 41898328" execution time: 51666 ns program-manager.ts:105 [profiling] kernel "41915648\|[Transpose] 41915648" execution time: 95416 ns program-manager.ts:105 [profiling] kernel "49757856\|[Sub] 49757856" execution time: 7969 ns program-manager.ts:105 [profiling] kernel "51680504\|[Mul] 51680504" execution time: 8906 ns ``` With the new information, we can easily know what kind of shape ops have poor performance. Also it can help us to check whether too small shape ops run on gpu.	2023-08-31 08:12:28 -07:00
Yulong Wang	e5ca3f3dcb	[js/api] introducing IO binding for tensor (#16452 ) [//]: # (## Work In Progress. Feedbacks are welcome!) ### Description This PR adds a few properties, methods and factories to Tensor type to support IO-binding feature. This will allow user to create tensor from GPU/CPU bound data without a force transferring of data between CPU and GPU. This change is a way to resolve #15312 ### Change Summary 1. Add properties to `Tensor` type: a. `location`: indicating where the data is sitting. valid values are `cpu`, `cpu-pinned`, `texture`, `gpu-buffer`. b. `texture`: sit side to `data`, a readonly property of `WebGLTexture` type. available only when `location === 'texture'` c. `gpuBuffer`: sit side to `data`, a readonly property of `GPUBuffer` type. available only when `location === 'gpu-buffer'` 2. Add methods to `Tensor` type (usually dealing with inference outputs): - async function `getData()` allows user to download data from GPU to CPU manually. - function `dispose()` allows user to release GPU resources manually. 3. Add factories for creating `Tensor` instances: a. `fromTexture()` to create a WebGL texture bound tensor data b. `fromGpuBuffer()` to create a WebGPUBuffer bound tensor data c. `fromPinnedBuffer()` to create a tensor using a CPU pinned buffer ### Examples: create tensors from texture and pass to inference session as inputs ```js // when create session, specify we prefer 'image_output:0' to be stored on GPU as texture const session = await InferenceSession.create('./my_model.onnx', { executionProviders: [ 'webgl' ], preferredOutputLocation: { 'image_output:0': 'texture' } }); ... const myImageTexture = getTexture(); // user's function to get a texture const myFeeds = { input0: Tensor.fromTexture(myImageTexture, { width: 224, height: 224 }) }; // shape [1, 224, 224, 4], RGBA format. const results = await session.run(myFeeds); const myOutputTexture = results['image_output:0'].texture; ```	2023-08-29 12:58:26 -07:00
Jiajia Qin	fffefb1c22	[js/webgpu] Optimize matmul (#16969 ) ### Description Changes in this PR: 1) use the optimized version `makeMatMulPacked[Vec4]Source` to support matmul. 2) enable the conv2dByMatMul path. 3) support broadcast 4) use IndicesHelper. MatMul with M = 512, K = 512, N = 512 becomes 2ms from 15ms when enabling profilingMode on my ADL.	2023-08-29 12:40:57 -07:00
Caroline	228db24317	Add training API functions to WASM API (#16521 ) ### Description * Created `wasm/training_api` source and header files & modified WebAssembly CMake to include training flags * The `wasm/training_api` files use an `OrtTrainingManager` handle which is a struct of an OrtCheckpointState and an OrtTrainingSession, rather than creating a CheckpointState handle & a separate TrainingSession handle. * This is so that the TypeScript side only has to manage one handle that will be passed between TrainingSession & CheckpointState representations, rather than the TypeScript side managing separate CheckpointStateHandle and TrainingSessionHandle. ### Motivation and Context WASM API needs to be updated with ORT training API function calls so that ORT training web bindings can be added for on-device training. --------- Co-authored-by: Baiju Meswani <bmeswani@microsoft.com> Co-authored-by: carzh <carolinezhu@microsoft.com> Co-authored-by: Ashwini Khade <askhade@microsoft.com>	2023-08-28 11:05:02 -07:00
Hariharan Seshadri	cbd97515cd	[JS/WebGPU] Support GatherElements kernel (#17243 ) ### Description As title ### Motivation and Context Improve WebGPU kernel coverage	2023-08-28 09:55:25 -07:00
Yulong Wang	bb1871332f	[js/webgpu] add kernel Not and Equal (#17306 ) ### Description This PR adds kernel implementation for operator "Not" and "Equal". Also removed download cache in gpu data manager. Why removing download cache The following test case failed. ("Or" is on CPU, "Greater" and "Equal" are on JSEP) ![image](https://github.com/microsoft/onnxruntime/assets/7679871/8d9798ad-2703-4fb9-907e-ff716c67d0b2) after debugging, I found that both "Equal" and "Greater" are using the same output GPU Data ID. This is because when ORT executes the graph, it first run "Equal", allowing its shader to write into GPU Data ID 2; then a Gpu2Cpu copy for it is issued (because currently "Or" is on CPU EP); at this point, ORT thinks GPU Data ID=2 is free to use; so it reuse it as output for "Greater". This means there is no allocation for output of "Greater" kernel, and both kernel writes to GPU Data ID=2. For gpu data manager, there will be 2 downloads from the same GPU buffer. Previously I think this is a waste of resource so I cached the data. But now it shoes that we need to perform 2 downloads because the GPU data is already different. The download data cache should be removed. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-08-27 19:50:17 -07:00
Yulong Wang	ddcd46174e	[js/webgpu] fix jsepOnRunEnd (#17300 ) ### Description fix jsepOnRunEnd: jsepOnRunEnd() need to be run after runPromise is resolved.	2023-08-26 00:30:28 -07:00
Jiajia Qin	873ef8b8f0	[js/webgpu] add label for some webgpu APIs (#17291 ) ### Description <!-- Describe your changes. --> With the label, it's more easier to identify which op causes the error. Without the label, the error message is like below: ``` Tint WGSL reader failure: :12:5 error: return statement type must match its function return type, returned 'vec4<f32>', expected 'f32' return W[i2o_W(indices)]; ^^^^^^ - While validating [ShaderModuleDescriptor] - While calling [Device].CreateShaderModule([ShaderModuleDescriptor]). ``` With the label, the error message is like below: ``` Tint WGSL reader failure: :12:5 error: return statement type must match its function return type, returned 'vec4<f32>', expected 'f32' return W[i2o_W(indices)]; ^^^^^^ - While validating [ShaderModuleDescriptor "ConvTranspose2D"] - While calling [Device].CreateShaderModule([ShaderModuleDescriptor "ConvTranspose2D"]). ``` ### Motivation and Context This change is mainly for debugging. With this change, we can easily know that `ConvTranspose2D`'s shader has problem from above message.	2023-08-25 12:12:56 -07:00
xhcao	5e8d94cec8	[js/webgpu] support Greater and Less operators (#17296 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-08-25 12:11:25 -07:00
Yulong Wang	79c4ed9a45	[js/webgpu] support error pop and kernel name (#17260 ) ### Description This PR contains changes to support error pop and kernel name. - Add a function `JsepGetNodeName` to allow reading kernel name from JS to C++ - When in debug mode ( `env.debug = true;` ) or in profiling mode ( `env.webgpu.profilingMode = 'default';` ), kernel name will be read from ORT; otherwise use the kernel pointer ( a number ) as kernel name to save calls from JS to C++. - When in debug mode, WebGPU validation errors will be recorded and if any error occurs, `inferenceSession.run()` will fail (Promise get rejected). Behavior when not in debug mode is not changed. This is because recording errors are not zero-overhead, and GPU validation errors should occur consistently in and not in debug mode. - Add `jsepOnRunStart()` and `jsepOnRunEnd()` hook to: - allow implementation of the features mentioned above. - pass session ID to backend.	2023-08-25 08:08:15 -07:00
satyajandhyala	da180b20fa	[JS/Web] Fix ConvTranspose shader code compilation errors. (#17232 ) ### Description Fix JSEP ConvTranspose shader code errors. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-08-25 06:25:54 -07:00
Yulong Wang	fb51faea64	[js/webgpu] fix 2 build breaks introduced in merge (#17273 ) ### Description fix 2 build breaks introduced in merge. Fixes web build	2023-08-23 18:09:50 -07:00
Yulong Wang	8b18d48c7c	[js/webgpu] make IndicesHelper implementation implicit (#17193 ) ### Description This change makes it no longer required to call indicesHelper.impl() in shader code.	2023-08-23 14:41:35 -07:00
Guenther Schmuelling	d3d3dde844	fix webgpu split (#17258 ) fix webgpu split for the case of split_sizes coming from input[1]	2023-08-22 16:49:22 -07:00
Yulong Wang	6fc3fd9ece	[js/webgpu] support Cast operator (#16489 ) ### Description support `Cast` operator for webgpu backend. Cast operator for webgpu backend currently only supports f32, u32, i32 and bool.	2023-08-18 23:51:03 -07:00
xhcao	dd3b2cefd6	[js/webgpu] Support int32 type for binary (#16901 ) ### Description Enable typed binary and support int32 type for binary. Co-authored-by: Xing Xu <xing.xu@intel.com> --------- Co-authored-by: Xing Xu <xing.xu@intel.com>	2023-08-18 12:19:01 -07:00
Hariharan Seshadri	a476dbf430	[JS/WebGPU] Support Tile operator (#17123 ) ### Description As title ### Motivation and Context Improve WebGPU op coverage	2023-08-18 10:07:21 -07:00
satyajandhyala	7d1a5635a0	[JS/Web] Added SkipLayerNormalization operator. (#17102 ) ### Description Add SkipLayerNormalization operator to JSEP. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-08-18 09:59:03 -07:00
Hariharan Seshadri	66df11769c	[JS/WebGPU] Expand operator fixes (#17137 )	2023-08-16 11:24:26 -07:00
satyajandhyala	89b682e3f3	[JS/Web] The bias input is optional, not required, for LayerNormalization operator (#17143 ) ### Description Fix a typo. LayerNormalization takes 2 or 3 inputs. The third input, bias, is optional. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-08-16 10:41:20 -07:00
Yulong Wang	133af1385c	[js/webgpu] update shader cache key to include input tensor datatype (#17176 ) ### Description update shader cache key to include input tensor datatype. and make the key a little bit easier to read	2023-08-16 09:14:19 -07:00
Guenther Schmuelling	8289e8b6ef	[js/webgpu] fix a few shader errors (#17171 ) Fix for segment anything decoder, reduceMax with rank1 and concat.	2023-08-15 21:14:20 -07:00
Arthur Islamov	ccf14e891e	[js/web] JSEP node assignment optimization (#17128 ) ### Description Since WebGPU supports only float32 and int32, having Gather, Reshape, Shape, Squeeze and Unsqueeze ops with other data types create additional MemCpy ops and slow down the overall execution as all other OPs with other tensor types will be done on CPU. Before this patch SD Unet had these numbers: Node(s) placed on [CPUExecutionProvider]. Number of nodes: 1141 Node(s) placed on [JsExecutionProvider]. Number of nodes: 4025 memcpy tokens: 2001 After patch: Node(s) placed on [CPUExecutionProvider]. Number of nodes: 1735 Node(s) placed on [JsExecutionProvider]. Number of nodes: 2243 memcpu tokens: 813 It also gives more than 5X performance benefit. From 12sec for one Unet step to 2.2sec on RTX 3090 Ti, so we are almost getting to native performance. UPD: with latest changes from main branch and multi-threading it went down to 1.6sec. Will try re-exporting my model to onnx with maximum optimizations, like using MultiHeadAttention to decrease node count. Maybe after implementing that it can go in less than 1 sec	2023-08-15 18:58:05 -07:00
xhcao	24e0bd37b4	[JS/WebGPU] Support Log operator (#17045 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-08-14 18:04:12 -07:00
Yulong Wang	14a8315f10	[js/web] [webgpu] new incides helper (#16957 ) ### Description This PR introduces the new incides helper. IndicesHelper is a helper class for generating WGSL code for manipulating indices and data for a shader's input or output. This class is designed to offer a unified way to generate WGSL code for manipulating indices and data for a shader's input or output. The following is a list of terminologies used in this class: - `offset`: a uint32 value representing the offset of an element in the data buffer. - `indices`: an abstraction of a multi-dimensional array's indices representing the data's index on each dimension. - `value`: a value of a data element. Users are expected to create an instance of this class for each shader's input or output, and use the instance to generate WGSL code for manipulating indices and data. The following 2 exported functions are for users to call to create an instance of an indices helper: - `inputVariable()`: create an indices helper instance for an input. - `outputVariable()`: create an indices helper instance for an output. An indices helper instance contains helper functions for the following operations: - access readonly basic information, including: `name`(the name of the input or output), `usage`(whether it's an input or an output) and `shape`(the passed in shape). - `type`: access readonly type information, including: `indices`(the type of indices), `value`(the type of value at runtime), `storage`(the type of value at storage) and `tensor`(the tensor type as represented in TensorView). - generate WGSL code for getting indices from offset. Use `offsetToIndices()` for WGSL code snippet to calculate incides from offset, and use `indicesToOffset()` for WGSL code snippet to calculate offset from indices. - to manipulate an instance of indices, use `setIndices()` and `getIndices()` to set and get the indices on an indices variable. - to manipulate data, use `set()`/`get()` to access data at the given indices from parameter list, use `setByIndices()`/`getByIndices()` to access data at the given indices from an indices variable, and use `setByOffset()`/`getByOffset()` to access data at the given offset. - `impl`: get WGSL code of function implementation for the util functions mentioned above. This change applies the usage of new IndicesHelper through the code, but not necessary for all code.	2023-08-11 11:36:59 -07:00
Zimon Tai	a3e02e8e2a	Fix Resize op input check (#16594 ) ### Description onnxjs contains a `Resize` op input check which is outdated since opset 9. Currently `Resize` supports up to 4 inputs. This PR looses the input check. ### Motivation and Context Fixes #15636	2023-08-09 15:42:30 -07:00
Arthur Islamov	c3f04251c7	[js/web] JSEP LayerNormalization and InstanceNormalizations kernels (#16830 ) ### Description Added two kernels for Layer and Instance norm Also added maximum limits for `maxBufferSize` when requesting GPU device as by default it's limited to 256mb and it fails allocating 600mb buffer while running fp32 StableDiffusion weights. ### Motivation and Context These two are used in StableDiffusion and many other networks	2023-08-08 09:09:37 -07:00
Jiajia Qin	9ea0a3129b	[js/webgpu] Make sure only storage buffers are reused (#16893 ) ### Description <!-- Describe your changes. --> This PR makes sure that only storage buffers are reused. Previously, the query buffer might also get from the freeBuffers list if there is a matching size in it. But they are different usage, which results errors.	2023-08-04 13:40:52 -07:00

1 2 3 4

155 commits