mirror of
https://github.com/saymrwulf/onnxruntime.git
synced 2026-06-25 02:50:42 +00:00
BUG #23273 This PR does below optimizations: 1. When output channels is one, 1) calculate the offset before the inchannel loop to reduce indices to offsets calculation, 2) split the `inputChannelsPerGroup` into `inputChannelsPerGroupInt` and `inputChannelsRemainder` parts so that we can always access 4 data for `inputChannelsPerGroupInt`. 2. Use precise initial value to reduce useless loop iterations. Thanks @jiangzhaoming 's suggestion's on this. With this PR, ConvTranspose becomes 3.7s from 8.4s on Intel Meteor Lake. On NV RTX 2000 Ada, it becomes 1.6s from 2.7s. |
||
|---|---|---|
| .. | ||
| 3rd-party | ||
| argminmax.ts | ||
| attention.ts | ||
| batch-norm.ts | ||
| bias-add.ts | ||
| bias-split-gelu.ts | ||
| binary-op.ts | ||
| common.ts | ||
| concat.ts | ||
| conv-grouped.ts | ||
| conv-transpose.ts | ||
| conv.ts | ||
| cumsum.ts | ||
| depth-to-space.ts | ||
| einsum.ts | ||
| expand.ts | ||
| fast-gelu.ts | ||
| fuse-utils.ts | ||
| gather-block-quantized.ts | ||
| gather-elements.ts | ||
| gather-nd.ts | ||
| gather.ts | ||
| gemm.ts | ||
| grid-sample.ts | ||
| group-query-attention.ts | ||
| instance-norm.ts | ||
| layer-norm.ts | ||
| matmul-shaders.ts | ||
| matmul.ts | ||
| matmulnbits.ts | ||
| multihead-attention.ts | ||
| pad.ts | ||
| pool.ts | ||
| quantize-linear.ts | ||
| range.ts | ||
| reduce-shared.ts | ||
| reduce.ts | ||
| resize.ts | ||
| rotary-embedding.ts | ||
| scatter-nd.ts | ||
| skip-layer-norm.ts | ||
| slice.ts | ||
| softmax.ts | ||
| split.ts | ||
| tile.ts | ||
| transpose.ts | ||
| unary-op.ts | ||
| where.ts | ||