onnxruntime/cmake
snadampal 780ee186d7
[aarch64] Implement QGEMM kernels with UMMLA/SMMLA instructions (#17160)
### Description
<!-- Describe your changes. -->
This PR adds UMMLA and SMMLA based QGEMM kernels for aarch64. This
covers
(i) symmetric quantization (zero point is Zero)
(ii) asymmetric quantization (zero point is non zero)
(iii) per channel as well as per tensor quantization
(iv) Signed weights (U8S8 Gemm)
(v) Unsigned weights (U8U8 Gemm) and 
(vi) Signed activations and weights (S8S8 Gemm) scenarios

I've enabled the ummla/smmla kernels based on cpuinfo check for `I8MM`
support
MMLA QGEMM kernels are enabled for all the devices that support I8MM
instructions.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This is to improve INT8 quantized MatMul performance on aarch64
platform.
I have run the below benchmarking script (bert , roberta and gpt2 model
inference) on AWS Graviton3 based c7g.4xl instance and observed up to
1.33x performance improvement compared to the optimized UDOT qgemm
kernel performance.

```
cd onnxruntime/python/tools/transformers
python3 benchmark.py
```
I have also run the unit tests, and made sure all are passing

```
./build.sh --config RelWithDebInfo --build_shared_lib --parallel --compile_no_warning_as_error --skip_submodule_sync 

```
2023-10-24 07:49:04 +10:00
..
external Update ONNX to 1.15.0rc1 (#17914) 2023-10-20 15:08:25 -07:00
patches ONNX 1.15 integration (#17125) 2023-09-26 14:44:48 -07:00
tensorboard
adjust_global_compile_flags.cmake Update cmake to 3.27 and upgrade Linux CUDA docker files from CentOS7 to UBI8 (#16856) 2023-09-05 18:12:10 -07:00
CMakeLists.txt Make CUDA a NHWC EP (#17200) 2023-10-16 10:16:37 -07:00
CMakeSettings.json
codeconv.runsettings
deps.txt Update ONNX to 1.15.0rc1 (#17914) 2023-10-20 15:08:25 -07:00
deps_update_and_upload.py [Linter] Bump ruff and remove pylint (#17797) 2023-10-05 21:07:33 -07:00
EnableVisualStudioCodeAnalysis.props
gdk_toolchain.cmake
Info.plist.in
libonnxruntime.pc.cmake.in
nuget_helpers.cmake
onnxruntime.cmake Add noexcep_operators to onnxruntime internal libraries (#17850) 2023-10-09 16:29:41 -07:00
onnxruntime_codegen_tvm.cmake
onnxruntime_common.cmake Update C/C++ dependencies: abseil, date, nsync, googletest, wil, mp11, cpuinfo and safeint (#15470) 2023-09-08 13:35:04 -07:00
onnxruntime_compile_triton_kernel.cmake [ROCm] Add ROCm Triton TunableOp for GroupNorm (#16196) 2023-07-11 13:55:30 +08:00
onnxruntime_config.h.in Enabling c++ 20 in MacOS build (#16187) 2023-09-26 11:27:02 -07:00
onnxruntime_csharp.cmake
onnxruntime_flatbuffers.cmake
onnxruntime_framework.cmake [C#, CPP] Introduce Float16/BFloat16 support and tests for C#, C++ (#16506) 2023-07-14 10:46:52 -07:00
onnxruntime_framework.natvis [C#, CPP] Introduce Float16/BFloat16 support and tests for C#, C++ (#16506) 2023-07-14 10:46:52 -07:00
onnxruntime_fuzz_test.cmake
onnxruntime_graph.cmake Update C/C++ dependencies: abseil, date, nsync, googletest, wil, mp11, cpuinfo and safeint (#15470) 2023-09-08 13:35:04 -07:00
onnxruntime_ios.toolchain.cmake
onnxruntime_java.cmake
onnxruntime_java_unittests.cmake
onnxruntime_kernel_explorer.cmake [ROCm] TunableOp: Update rocBLAS get_solutions API (since ROCm5.6) (#16657) 2023-07-13 11:20:26 +08:00
onnxruntime_language_interop_ops.cmake
onnxruntime_mlas.cmake [aarch64] Implement QGEMM kernels with UMMLA/SMMLA instructions (#17160) 2023-10-24 07:49:04 +10:00
onnxruntime_nodejs.cmake Added DML and CUDA provider support in onnxruntime-node (#16050) 2023-08-25 16:57:06 -07:00
onnxruntime_objectivec.cmake Objective C Training API: TrainingSession (#16374) 2023-06-28 09:13:56 -07:00
onnxruntime_opschema_lib.cmake
onnxruntime_optimizer.cmake Support inplace update for PythonOp/Grad (#17687) 2023-10-10 21:36:45 -07:00
onnxruntime_providers.cmake Add API for NPU Device Selection in the DML EP (#17612) 2023-10-11 14:53:00 -07:00
onnxruntime_providers_acl.cmake Split onnxruntime_providers.cmake to multiple (#17853) 2023-10-09 20:33:44 -07:00
onnxruntime_providers_armnn.cmake Split onnxruntime_providers.cmake to multiple (#17853) 2023-10-09 20:33:44 -07:00
onnxruntime_providers_azure.cmake Split onnxruntime_providers.cmake to multiple (#17853) 2023-10-09 20:33:44 -07:00
onnxruntime_providers_cann.cmake Split onnxruntime_providers.cmake to multiple (#17853) 2023-10-09 20:33:44 -07:00
onnxruntime_providers_coreml.cmake Split onnxruntime_providers.cmake to multiple (#17853) 2023-10-09 20:33:44 -07:00
onnxruntime_providers_cpu.cmake Split onnxruntime_providers.cmake to multiple (#17853) 2023-10-09 20:33:44 -07:00
onnxruntime_providers_cuda.cmake distributed slice (#17761) 2023-10-12 14:28:00 -07:00
onnxruntime_providers_dml.cmake Add API for NPU Device Selection in the DML EP (#17612) 2023-10-11 14:53:00 -07:00
onnxruntime_providers_dnnl.cmake Split onnxruntime_providers.cmake to multiple (#17853) 2023-10-09 20:33:44 -07:00
onnxruntime_providers_js.cmake Split onnxruntime_providers.cmake to multiple (#17853) 2023-10-09 20:33:44 -07:00
onnxruntime_providers_migraphx.cmake CUDA EP vs ROCM EP hipify audit (#17776) 2023-10-13 10:13:53 +08:00
onnxruntime_providers_nnapi.cmake Split onnxruntime_providers.cmake to multiple (#17853) 2023-10-09 20:33:44 -07:00
onnxruntime_providers_openvino.cmake Split onnxruntime_providers.cmake to multiple (#17853) 2023-10-09 20:33:44 -07:00
onnxruntime_providers_qnn.cmake Split onnxruntime_providers.cmake to multiple (#17853) 2023-10-09 20:33:44 -07:00
onnxruntime_providers_rknpu.cmake Split onnxruntime_providers.cmake to multiple (#17853) 2023-10-09 20:33:44 -07:00
onnxruntime_providers_rocm.cmake CUDA EP vs ROCM EP hipify audit (#17776) 2023-10-13 10:13:53 +08:00
onnxruntime_providers_tensorrt.cmake [TensorRT EP] Fix cmake install (#17923) 2023-10-16 09:16:24 -07:00
onnxruntime_providers_tvm.cmake Split onnxruntime_providers.cmake to multiple (#17853) 2023-10-09 20:33:44 -07:00
onnxruntime_providers_vitisai.cmake Split onnxruntime_providers.cmake to multiple (#17853) 2023-10-09 20:33:44 -07:00
onnxruntime_providers_webnn.cmake Split onnxruntime_providers.cmake to multiple (#17853) 2023-10-09 20:33:44 -07:00
onnxruntime_providers_xnnpack.cmake Split onnxruntime_providers.cmake to multiple (#17853) 2023-10-09 20:33:44 -07:00
onnxruntime_pyop.cmake
onnxruntime_python.cmake Add LLaMA scripts (#17020) 2023-08-22 18:05:11 -07:00
onnxruntime_rocm_hipify.cmake Fix AMD builds and enable testing NHWC CUDA ops in one GPU CI (#17972) 2023-10-17 09:23:52 -07:00
onnxruntime_session.cmake added support for cmake "find_package" (#8919) 2023-06-19 22:20:31 -07:00
onnxruntime_snpe_provider.cmake
onnxruntime_training.cmake Triton Codegen for ORTModule (#15831) 2023-07-13 18:17:58 +08:00
onnxruntime_unittests.cmake Make CUDA a NHWC EP (#17200) 2023-10-16 10:16:37 -07:00
onnxruntime_util.cmake
onnxruntime_webassembly.cmake Add training WASM generation to Web CI pipeline (#17319) 2023-09-08 15:49:47 -07:00
precompiled_header.cmake
Sdl.ruleset Add a Github workflow for Prefast (#15763) 2023-05-03 11:42:51 -07:00
set_winapi_family_desktop.h
target_delayload.cmake
uwp_stubs.h
wcos_rules_override.cmake
winml.cmake Rework WIL dependency retrieval/usage (#17130) 2023-08-15 09:11:46 -07:00
winml_cppwinrt.cmake
winml_sdk_helpers.cmake
winml_unittests.cmake Update C/C++ dependencies: abseil, date, nsync, googletest, wil, mp11, cpuinfo and safeint (#15470) 2023-09-08 13:35:04 -07:00