onnxruntime/js/web/lib/wasm/jsep/webgpu/ops
Jiajia Qin 8f7b89bd5b
[js/webgpu] Optimize NCHW layout for InstanceNormalization (#18123)
### Description
The changes in this PR includes:
1) Fix f16 errors in InstanceNormalization with NCHW format.
2) Use vec to further optimize the original algorithm.
3) (Removed) Don't do layout conversion for InstanceNormalization for
JSEP since InstanceNormalization itself is suitable for NCHW layout and
has better performance in our current implementation.

Tested on sd-vae-decoder-f16.onnx, it becomes 285 ms from 314 ms. The
aggregate gpu profiling data can be found as below (Note the data is
based change 3).):
Before:
<html>
<body>
<!--StartFragment--><span><span class="ui-provider ef bbg bbh bbi bbj
bbk bbl bbm bbn bbo bbp bbq bbr bbs bbt bbu bbv bbw bbx bby bbz bca bcb
bcc bcd bce bcf bcg bch bci bcj bck bcl bcm bcn" dir="ltr">

Kernel | Time (Ms) | Percentage (%)
-- | -- | --
Conv | 201.55 | 69.56
InstanceNormalization | 42.49 | 14.67
Transpose | 28.95 | 9.99
Mul | 5.69 | 1.96
Add | 3.82 | 1.32
MatMul | 3.27 | 1.13
Sigmoid | 2.24 | 0.77
Resize | 1.16 | 0.40
Softmax | 0.34 | 0.12
Cast | 0.24 | 0.08
Sum | 289.75

<br class="Apple-interchange-newline"><!--EndFragment-->
</body>
</html>
After:
<html>
<body>
<!--StartFragment--><span><span class="ui-provider ef bbg bbh bbi bbj
bbk bbl bbm bbn bbo bbp bbq bbr bbs bbt bbu bbv bbw bbx bby bbz bca bcb
bcc bcd bce bcf bcg bch bci bcj bck bcl bcm bcn" dir="ltr">

Kernel | Time (Ms) | Percentage (%)
-- | -- | --
Conv | 205.44 | 79.43
InstanceNormalization | 18.24 | 7.05
Transpose | 17.64 | 6.82
Mul | 5.69 | 2.20
Add | 3.81 | 1.47
MatMul | 3.56 | 1.38
Sigmoid | 2.24 | 0.86
Resize | 1.19 | 0.46
Softmax | 0.59 | 0.23
Cast | 0.24 | 0.09
Sum | 258.65 |  

</span></span><!--EndFragment-->
</body>
</html>

From above table, we can see that two ops time are greatly reduced. One
is InstanceNormalization and the other is Transpose. The reason that the
transpose time is reduced is because each InstanceNormalization is
surrounded with two reshape ops in sd-vae-decoder-f16.onnx. Due to JSEP
is prefer NHWC and InstanceNormalization is layout sensitive op, so two
extra transpose ops are inserted dynamically when executing this model.
After this change, those inserted transpose ops are not needed anymore.
So the overall transpose time is reduced.
2023-12-15 11:26:15 -08:00
..
3rd-party [js/webgpu] Provide a naive vectorized matmul algorithm (#18758) 2023-12-13 09:03:23 -08:00
argminmax.ts [JS/Web] Added uniforms to Reduce, Resize and Split Ops. (#18727) 2023-12-12 11:12:23 -08:00
attention.ts [js/web] JSEP Attention & MultiHeadAttention (#17742) 2023-11-17 12:23:52 -08:00
batch-norm.ts [js/webgpu] Add BatchNormalization Op (#18468) 2023-11-22 15:58:06 -08:00
bias-add.ts [js/webgpu] revise uniform support (#17871) 2023-10-11 16:41:46 -07:00
bias-split-gelu.ts [JS/Web] Resize & BiasSplitGelu fp16 support (#18536) 2023-11-22 12:12:07 -08:00
binary-op.ts [js/webgpu] Optimize broadcast binary. (#18185) 2023-11-20 16:52:17 -08:00
common.ts [js/webgpu] Fix shader errors in indicesGet/Set when rank > 4 (#18661) 2023-12-01 15:35:35 -08:00
concat.ts [js/webgpu] Add uniforms support to concat op (#18238) 2023-11-10 13:46:03 -08:00
conv-grouped.ts [js/webgpu] Fix conv2d with activation (#18388) 2023-11-10 12:54:35 -08:00
conv-transpose.ts [js/webgpu] Use the naive convTranspose when in/out channels are both 1 (#18658) 2023-12-04 13:18:37 -08:00
conv.ts [js/webgpu] Provide a naive vectorized matmul algorithm (#18758) 2023-12-13 09:03:23 -08:00
cumsum.ts [JS/Web] Added uniforms to Reduce, Resize and Split Ops. (#18727) 2023-12-12 11:12:23 -08:00
einsum.ts [JS/Web] Add uniforms to Einsum (#18531) 2023-11-29 15:30:33 -08:00
expand.ts [js/webgpu] add bool type for Expand/Gather (#18615) 2023-11-30 15:47:08 -08:00
fuse-utils.ts [js/webgpu] Fix conv2d with activation (#18388) 2023-11-10 12:54:35 -08:00
gather-elements.ts [JS/Web] AddedUniforms in GatherElements. (#18670) 2023-12-05 09:19:53 -08:00
gather.ts [js/webgpu] add bool type for Expand/Gather (#18615) 2023-11-30 15:47:08 -08:00
gemm.ts [js/webgpu] revise uniform support (#17871) 2023-10-11 16:41:46 -07:00
instance-norm.ts [js/webgpu] Optimize NCHW layout for InstanceNormalization (#18123) 2023-12-15 11:26:15 -08:00
layer-norm.ts [js/web] FP16 LayerNorm, InstanceNorm, SkipLayerNorm (#17630) 2023-10-18 10:47:41 -07:00
matmul.ts [js/webgpu] Provide a naive vectorized matmul algorithm (#18758) 2023-12-13 09:03:23 -08:00
multi-head-attentiion.ts [js/web] JSEP Attention & MultiHeadAttention (#17742) 2023-11-17 12:23:52 -08:00
pad.ts [js/web] set noUnusedParameters to true and fix a few bugs (#18404) 2023-11-15 09:16:29 -08:00
pool.ts fix lint error (#18708) 2023-12-05 10:37:03 -08:00
range.ts [js/webgpu] revise uniform support (#17871) 2023-10-11 16:41:46 -07:00
reduce-shared.ts [js/web] set noUnusedParameters to true and fix a few bugs (#18404) 2023-11-15 09:16:29 -08:00
reduce.ts [JS/Web] Added uniforms to Reduce, Resize and Split Ops. (#18727) 2023-12-12 11:12:23 -08:00
resize.ts [JS/Web] Added uniforms to Reduce, Resize and Split Ops. (#18727) 2023-12-12 11:12:23 -08:00
skip-layer-norm.ts [js/web] FP16 LayerNorm, InstanceNorm, SkipLayerNorm (#17630) 2023-10-18 10:47:41 -07:00
slice.ts [JS/Web] Added uniforms to Reduce, Resize and Split Ops. (#18727) 2023-12-12 11:12:23 -08:00
softmax.ts [js/webgpu] Support uniform for softmax (#18345) 2023-11-09 11:19:23 -08:00
split.ts [JS/Web] Added uniforms to Reduce, Resize and Split Ops. (#18727) 2023-12-12 11:12:23 -08:00
tile.ts [JS/WebGPU] Added uniforms to Tile and Where Ops (#18768) 2023-12-11 20:58:52 -08:00
transpose.ts [js/webgpu] Fix the transpose error when dims > 4D (#18027) 2023-10-23 11:02:19 -07:00
unary-op.ts [js/webgpu] Fix f16 errors in unary (#18839) 2023-12-15 11:25:12 -08:00
where.ts [JS/WebGPU] Added uniforms to Tile and Where Ops (#18768) 2023-12-11 20:58:52 -08:00