mirror of
https://github.com/saymrwulf/onnxruntime.git
synced 2026-05-28 22:56:32 +00:00
### Description This PR provided a vectorized matmul algorithm. In most situations, we still go to the workgroup memory optimized matmul. But for some situations, like N and K are very small, using workgroup optimized matmul can't fully utilize the underlying hardware due to the 32x32 tile size. So for very small N/K, we switch to the naive vectorized matmul algorithm to improve the hardware execution unit usage. With this PR, matmul with input0: [1, 36864, 3], input1: [1, 3, 3], input2: [3] becomes less than 1 ms from 4.34 ms on Intel Gen9 GPUs. |
||
|---|---|---|
| .. | ||
| binding | ||
| jsep | ||
| proxy-worker | ||
| proxy-messages.ts | ||
| proxy-wrapper.ts | ||
| run-options.ts | ||
| session-handler-inference.ts | ||
| session-handler-training.ts | ||
| session-options.ts | ||
| wasm-common.ts | ||
| wasm-core-impl.ts | ||
| wasm-factory.ts | ||
| wasm-training-core-impl.ts | ||
| wasm-utils.ts | ||