onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-06-25 02:50:42 +00:00

History

Jing Fang 13348c572a [ARM CPU] hgemm optimized for gqa (#23107 ) ### Description Add fp16 kernels for GQA matmul on ARM CPU. The kernels are mlas hgemm for C = alpha * A x B' + beta * C ### Motivation and Context Add fp16 support for GQA, speed up the operator and reduce memory usage. __Token Generation__ \| \| HGEMM Runtime (ns) \| SGEMM Runtime (ns) \| Speed-up (%) \| \|---------------------------------\|--------------------\|--------------------\|--------------\| \| M:1/N:4096/K:4096 \| 251551 \| 1775905 \| 85.84 \| \| M:1/N:11008/K:4096 \| 892507 \| 4649145 \| 80.80 \| \| M:1/N:4096/K:11008 \| 866860 \| 3240015 \| 73.25 \| \| M:1/N:11008/K:11008 \| 2631615 \|8783877 \| 70.04 \| __Prompting__ \| \| HGEMM Runtime (ns) \| SGEMM Runtime (ns) \| Speed-up (%) \| \|---------------------------------\|--------------------\|--------------------\|--------------\| \| M:1024/N:4096/K:4096 \| 90508701 \| 111283029 \| 18.67 \| \| M:2048/N:4096/K:4096 \| 181307522 \| 240211107 \| 24.52 \| \| M:1024/N:11008/K:4096 \| 241120234 \| 307707933 \| 21.64 \| \| M:2048/N:11008/K:4096 \| 481091232 \| 648921367 \| 25.86 \| \| M:1024/N:4096/K:11008 \| 241736343 \| 310129880 \| 22.05 \| \| M:2048/N:4096/K:11008 \| 480456703 \| 644814999 \| 25.49 \| \| M:1024/N:11008/K:11008 \| 642121440 \| 847925766 \| 24.27 \| \| M:2048/N:11008/K:11008 \| 1276097154 \| 1731314509 \| 26.29		2025-01-24 15:25:24 -08:00
..
common	Revert DML pipeline changes (#23135 )	2024-12-18 10:42:10 -08:00
contrib_ops	Enable comprehension simplification in ruff rules (#23414 )	2025-01-17 08:43:06 -08:00
cuda_host
custom_op_registration
debug_node_inputs_outputs
flatbuffers
framework	Implement pre-packed blobs serialization on disk and their memory mapping on load (#23069 )	2024-12-20 10:49:08 -08:00
fuzzing	Use onnx_protobuf.h to suppress some GCC warnings (#23453 )	2025-01-21 20:25:12 -08:00
global_thread_pools
ir
logging_apis
lora	Revert DML pipeline changes (#23135 )	2024-12-18 10:42:10 -08:00
mlas	[ARM CPU] hgemm optimized for gqa (#23107 )	2025-01-24 15:25:24 -08:00
onnx	Align AvgPool ceil_mode on last value to torch (#16752 )	2025-01-23 17:35:11 -08:00
opaque_api
optimizer	Add QNN EP HTP shared memory allocator (#23136 )	2025-01-14 11:09:50 -08:00
perftest	[MigraphX EP] [ROCm EP] Upstream ROCm changes for bugfixes and features (#23249 )	2025-01-15 12:57:04 -08:00
platform	Update MACOSX_DEPLOYMENT_TARGET (#23308 )	2025-01-10 14:25:32 -08:00
proto
providers	Align AvgPool ceil_mode on last value to torch (#16752 )	2025-01-23 17:35:11 -08:00
python	Enable comprehension simplification in ruff rules (#23414 )	2025-01-17 08:43:06 -08:00
qnn_ctx_gen	[QNN EP] Make QNN EP a shared library (#23120 )	2025-01-22 12:11:00 -08:00
quantization
shared_lib	[QNN EP] Fix segfault when unregistering HTP shared memory handles (#23402 )	2025-01-16 16:20:33 -08:00
testdata	Enable comprehension simplification in ruff rules (#23414 )	2025-01-17 08:43:06 -08:00
unittest_main
util	Revert DML pipeline changes (#23135 )	2024-12-18 10:42:10 -08:00
wasm	[WebGPU] allow build WebGPU EP for WebAssembly (#23364 )	2025-01-16 10:52:17 -08:00
webgpu	fix webgpu delay load test (#23157 )	2024-12-20 13:37:12 -08:00
win_getopt
xctest
run_benchmark.py
run_benchmark.readme.md