onnxruntime/onnxruntime/test
Jing Fang 13348c572a
[ARM CPU] hgemm optimized for gqa (#23107)
### Description
Add fp16 kernels for GQA matmul on ARM CPU.
The kernels are mlas hgemm for C = alpha * A x B' + beta * C


### Motivation and Context
Add fp16 support for GQA, speed up the operator and reduce memory usage.

__Token Generation__
| | HGEMM Runtime (ns) | SGEMM Runtime (ns) | Speed-up (%) |

|---------------------------------|--------------------|--------------------|--------------|
| M:1/N:4096/K:4096 | 251551 | 1775905 | 85.84 |
| M:1/N:11008/K:4096 | 892507 | 4649145 | 80.80 |
| M:1/N:4096/K:11008 | 866860 | 3240015 | 73.25 |
| M:1/N:11008/K:11008 | 2631615 |8783877 | 70.04 |

__Prompting__
| | HGEMM Runtime (ns) | SGEMM Runtime (ns) | Speed-up (%) |

|---------------------------------|--------------------|--------------------|--------------|
| M:1024/N:4096/K:4096 | 90508701 | 111283029 | 18.67 |
| M:2048/N:4096/K:4096 | 181307522 | 240211107 | 24.52 |
| M:1024/N:11008/K:4096 | 241120234 | 307707933 | 21.64 |
| M:2048/N:11008/K:4096 | 481091232 | 648921367 | 25.86 |
| M:1024/N:4096/K:11008 | 241736343 | 310129880 | 22.05 |
| M:2048/N:4096/K:11008 | 480456703 | 644814999 | 25.49 |
| M:1024/N:11008/K:11008 | 642121440 | 847925766 | 24.27 |
| M:2048/N:11008/K:11008 | 1276097154 | 1731314509 | 26.29
2025-01-24 15:25:24 -08:00
..
common Revert DML pipeline changes (#23135) 2024-12-18 10:42:10 -08:00
contrib_ops Enable comprehension simplification in ruff rules (#23414) 2025-01-17 08:43:06 -08:00
cuda_host
custom_op_registration
debug_node_inputs_outputs
flatbuffers
framework Implement pre-packed blobs serialization on disk and their memory mapping on load (#23069) 2024-12-20 10:49:08 -08:00
fuzzing Use onnx_protobuf.h to suppress some GCC warnings (#23453) 2025-01-21 20:25:12 -08:00
global_thread_pools
ir
logging_apis
lora Revert DML pipeline changes (#23135) 2024-12-18 10:42:10 -08:00
mlas [ARM CPU] hgemm optimized for gqa (#23107) 2025-01-24 15:25:24 -08:00
onnx Align AvgPool ceil_mode on last value to torch (#16752) 2025-01-23 17:35:11 -08:00
opaque_api
optimizer Add QNN EP HTP shared memory allocator (#23136) 2025-01-14 11:09:50 -08:00
perftest [MigraphX EP] [ROCm EP] Upstream ROCm changes for bugfixes and features (#23249) 2025-01-15 12:57:04 -08:00
platform Update MACOSX_DEPLOYMENT_TARGET (#23308) 2025-01-10 14:25:32 -08:00
proto
providers Align AvgPool ceil_mode on last value to torch (#16752) 2025-01-23 17:35:11 -08:00
python Enable comprehension simplification in ruff rules (#23414) 2025-01-17 08:43:06 -08:00
qnn_ctx_gen [QNN EP] Make QNN EP a shared library (#23120) 2025-01-22 12:11:00 -08:00
quantization
shared_lib [QNN EP] Fix segfault when unregistering HTP shared memory handles (#23402) 2025-01-16 16:20:33 -08:00
testdata Enable comprehension simplification in ruff rules (#23414) 2025-01-17 08:43:06 -08:00
unittest_main
util Revert DML pipeline changes (#23135) 2024-12-18 10:42:10 -08:00
wasm [WebGPU] allow build WebGPU EP for WebAssembly (#23364) 2025-01-16 10:52:17 -08:00
webgpu fix webgpu delay load test (#23157) 2024-12-20 13:37:12 -08:00
win_getopt
xctest
run_benchmark.py
run_benchmark.readme.md