- Implement MLAS function for quantized 4-bit int Gemm (Gemm with float A and quantized 4-bit int B) for ARM NEON. This is an initial implementation. Only the M=1 path (with M being number of rows of A and C) has any optimization attempted so far. More optimization to come in future PRs.
- Connect MatMulNBits contrib op to MLAS function.
### Description
<!-- Describe your changes. -->
The mmla kernels require additional ISA flags
and are currently supported only on Linux
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
more context is in https://github.com/microsoft/onnxruntime/pull/15270
cc: @skottmckay , @chenfucn , @snnn
### Description
<!-- Describe your changes. -->
This PR adds UMMLA and SMMLA based QGEMM kernels for aarch64. This
covers
(i) symmetric quantization (zero point is Zero)
(ii) asymmetric quantization (zero point is non zero)
(iii) per channel as well as per tensor quantization
(iv) Signed weights (U8S8 Gemm)
(v) Unsigned weights (U8U8 Gemm) and
(vi) Signed activations and weights (S8S8 Gemm) scenarios
I've enabled the ummla/smmla kernels based on cpuinfo check for `I8MM`
support
MMLA QGEMM kernels are enabled for all the devices that support I8MM
instructions.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This is to improve INT8 quantized MatMul performance on aarch64
platform.
I have run the below benchmarking script (bert , roberta and gpt2 model
inference) on AWS Graviton3 based c7g.4xl instance and observed up to
1.33x performance improvement compared to the optimized UDOT qgemm
kernel performance.
```
cd onnxruntime/python/tools/transformers
python3 benchmark.py
```
I have also run the unit tests, and made sure all are passing
```
./build.sh --config RelWithDebInfo --build_shared_lib --parallel --compile_no_warning_as_error --skip_submodule_sync
```
### Description
1. Replacing AMX intrinsics with machine code macros in QGEMM kernel.
2. Removing AMX build flags for GCC in cmake file.
3. Fixing the link time optimization (LTO) issue introduced with asm
.include of an assembly file.
I have moved the AMX instruction macro definitions from
QgemmU8S8KernelAmxCommon.S to the amx_common.h to fix the LTO issue.
Note that I am also pushing the macros defined in
QgemmU8S8KernelAmxCommon.S for future reference.
A special thanks to @laxmansole who helped in the development of the
instruction macro definitions for AMX intrinsics and fixing the LTO
issue.
### Motivation and Context
The additional AMX flag in cmake adds an extra layer of dependency on
GCC version to use the feature.These changes should allow the usage of
the AMX feature with just the CPU ID check.
### Description
<!-- Describe your changes. -->
Split out the more basic changes from #15552 for easier review.
Re-organize to clarify the structure
- Separate out generic base functionality from ORT specific components
- pass in handlers for internal ORT ops to Optimize
- Split out layout transformation from transpose optimization
- Separate out level 1 transpose optimizer
- Cleanup some naming to try and clarify things like an optimizer vs.
general optimization code
Most of the changes are from this movement of code.
Two implementation changes:
- the extended handlers are queried first in GetHandler
- allows the extended handlers to override the default behaviour for an
ONNX operator
- simplify the Optimize function to remove OptimizerMode.
- `can_modify_node` is used instead of `mode` and
`ignore_assigned_nodes` and a long description of the current usage is
added. I don't _think_ that changes the current behavior and hopefully
clarifies what happens and when, and makes the base transpose optimizer
implementation more generic.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Create a cleaner separation to support adding EP specific logic next to
cleanly handle where an EP has additional layout sensitive behaviour
required (e.g. it's Resize implementation only handles one layout).
### Description
1. Replacing AMX intrinsics with machine code macro instructions in
QGEMM kernel.
2. Removing AMX build flags for GCC in cmake file.
### Motivation and Context
The additional AMX flag in cmake adds an extra layer of dependency on
GCC version to use the feature.These changes should allow the usage of
the AMX feature with just the CPU ID check.
### Description
Remove the "onnxruntime_BUILD_WEBASSEMBLY" cmake option. Use `if
(CMAKE_SYSTEM_NAME STREQUAL "Emscripten")` instead. It makes some code
look more nature.
For example,
```cmake
if (CMAKE_SYSTEM_NAME STREQUAL "iOS" OR CMAKE_SYSTEM_NAME STREQUAL "Android" OR onnxruntime_BUILD_WEBASSEMBLY)
```
becomes
```cmake
if (CMAKE_SYSTEM_NAME STREQUAL "iOS" OR CMAKE_SYSTEM_NAME STREQUAL "Android" OR CMAKE_SYSTEM_NAME STREQUAL "Emscripten")
```
A recent commit added an assembler check if the ASM dialect was ATT
This unfortunately broke the AMX build for systems that don't have the
ASM-ATT dialect.
This change assumes if the CMAKE_ASM-ATT_COMPILER_ID is not found and
the CMAKE_ASM_COMPILER_ID is "GNU" based on all the other already passed
checks AMX is supported by the compiler and assembler.
### Description
### Motivation and Context
On my build system the recent change to add the ASM-ATT version check
disabled AMX code from the build.
---------
Signed-off-by: George Nash <george.nash@intel.com>
### Description
Fix onnxruntime_mlas build failure with cmake 3.26. Updated CMAKE
generator expression to make sure certain complier flags only apply for
C/CXX compiler.
### Motivation and Context
CMake changed the behavior of ASM_MASM in version 3.26. See
https://gitlab.kitware.com/cmake/cmake/-/issues/24639.
This also fixed the issue of #15101
### Description
Removing fp16 support from apple build
### Motivation and Context
FP16 support on ARM64 only available after armv8.2a, thus the clang
compiler needs a compilation flag `-march=armv8.2-a+fp16`.
Unfortunately, our current universal build does not support hardware
specific compilation flags on cpp source files, as it would cause
trouble when compiling against more than one hardware target. Until we
figure out how to remove this limitation, had to disable fp16 support
for Apple systems.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
AMX isn't supportted until assembler 2.40 even though the GCC frontend
supports it.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Convolution for fp16 datatype. Use NHWC for computation. For NCHW input,
it rearranges the input tensor to NHWC format before computing the
result.
Support two optional fusion:
1. Activation
2. Add (not yet implemented)
### Motivation and Context
Accelerating fp16 inference
### Description
<!-- Describe your changes. -->
This fix macos packaging build on universal2 arch.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
FP16 GEMM, including hardware agnostic driver code, a slow C++ kernel,
and ARM64 NEON kernel.
### Motivation and Context
First step in creating native support of fp16 model inferencing on ARM64
and AMD64 platforms.
---------
Co-authored-by: Chen Fu <fuchen@microsoft.com>
ARM a55 micro-architecture (with dot product instructions), similar to a53, is widely used as little cores in big.Little configurations. A55 has a narrower memory load/store hardware, where a 128b load instruction would block the pipeline for 2 whole cycles, during which no other instructions can be executed. On the other hand, a 64b load instruction can be duo issued with many other instructions.
This change adds a Symmetric Quant indirect Conv kernel for a55 micro-architecture, where we replace
ldr q4,[x1],
with
ldr d4,[x1],
ldr x11,[x1],
ins v4.d[1],x11
so that we can try to hide the memory load cycles behind computing cycles in the kernel.
With this new kernel, cartoongan model shows significant perf improvement on Pixel5a little cores (2 threads running on two little cores):
new kernel: 2188.59 ms
old kernel: 2360.61 ms
ARM a55 micro-architecture (with dot product instructions), similar to a53, is widely used as little cores in big.Little configurations. A55 has a narrower memory load/store hardware, where a 128b load instruction would block the pipeline for 2 whole cycles, during which no other instructions can be executed. On the other hand, a 64b load instruction can be duo issued with many other instructions.
This change adds a Symmetric QGEMM kernel for a55 micro-architecture, where we replace
ldr q4,[x1],#16
with
ldr d4,[x1],#8
ldr x11,[x1],#8
ins v4.d[1],x11
so that we can try to hide the memory load cycles behind computing cycles in the kernel.
Co-authored-by: Chen Fu <fuchen@microsoft.com>
This code is valid only when -mcpu is set to utilize POWER9 technology
or above. A compatible code for POWER8 was created as well, but it
was not tuned for performance.
* POWER10: QGEMM optimization
This patch makes use of POWER10 MMA feature for QGEMM function.
This optimization includes signed and unsigned cases.Tested and
there are no new failures with gcc11 and clang-14.
* Changes as per review comments
Co-authored-by: Rajalakshmi Srinivasaraghavan <rajis@linux.ibm.com>
Adding S8S8 kernels for symmetric quantized indirect conv and depthwise conv.
Perf number with single thread:
Nokia G10 (baseline / new) in ms Pixel 4 (baseline/new) in ms
mobilenet_edgetpu 220 / 213 18.5 / 17.6
cartoongan 8537 / 8521 967 / 928
Co-authored-by: Chen Fu <fuchen@microsoft.com>
Adding code for symmetric quantized matrix multiplication. Used in quantized convolution, achieving significant perf gain.
TODO, use Symmetric Quantized GEMM in other operators!
TODO address activation buffer overread in custom allocators and tensors supplied by users.
DOT kernel perf test:
Pixel 5a:
Cartoongan 513.539 ms 471.786 ms
Efficient 57.5169 ms 56.4174 ms
Edgetpu 14.6673 ms 13.5959 ms
NEON kernel perf test
Pixel 3a
Cartoongan 1423.53 ms 1069.92 ms
Efficient 114.086 ms 107.968 ms
Edgetpu 39.2632 ms 36.9839 ms
Co-authored-by: Chen Fu <fuchen@microsoft.com>
Adding a symmetric quantized convolution kernel for ARM64
Note:
Indirect conv performs worse for shallow convs (input channels are small). This is much more so for low end pre-dot CPUs, where only 128 or deeper conv is faster with indirect conv. With DOT-CPUs, 32 deep conv is already faster
Co-authored-by: Chen Fu <fuchen@microsoft.com>
* POWER10: Add optimized dgemm kernel
This patch makes use of POWER10 matrix multiply assist feature and
adds new DGEMM kernel.
* Indentation update
Co-authored-by: Rajalakshmi Srinivasaraghavan <rajis@linux.ibm.com>
* Arm64 Depthwise Convolution 3x3.
* Add 5x5 intrinsic dwqconv for arm64
* rebase to master, remove no-need logic after arm64 convsym enabled.
* Some more adjustment on the instrunction pipeling.
* Add specific test cases.
* Fix test dimension too small.
* Fix build warning as error on some CI.
* better format, etc.
Adding ARM64 depthwise convolution kernel for symmetric quantization
Motivation and Context
Two improvements against current kernel code :
1. Signed int8 based instructions, no need to extend from 8b to 16b before multiplication.
2. Unrolled loop with manual software pipelining
Co-authored-by: Chen Fu <fuchen@microsoft.com>
* POWER: Add Dgemm kernel for POWER processor
This patch adds new dgemm kernel specific to POWER processor.
* POWER: Restrict new functions to VSX in header
* Remove warning check in header
* POWER: Dgemm Adjust indentation
Fixing indentation based on review comments.
Co-authored-by: Rajalakshmi Srinivasaraghavan <rajis@linux.ibm.com>
Add kernels for QLinearConv with symmetric quantized filter, e.g., filter type is int8 and zero point of filter is 0. This PR includes kernels for avx2, avxvnni, avx512 and avx 512 vnni. Will adds kernels for ARM64 in following PR.
Kernels uses direct input buffer directly for pointwise, and in-direct buffer for depthwise and non-group conv.
The advantages of those new kernels are:
no need to compute the sum of each pixel output image, and sum/offset of filter can be combined with bias.
with in-direct buffer, im2col returns an array of buffer pointers instead of memcpy'ing the original data. This saves memcpy time and reduces the size of the intermediate buffer needed to hold the im2col transform. In the future, will compute im2col ahead of time for input with fixed input size.
Using signed int, qgemm kernel avoids extending uint8 to int16 while computing matrix multiplication, achieving higher performance. We also find that by using only lower 64b of vector registers to load A and B matrix, we can get further performance improvements. We also experimented with using ldp to load two 64b in one shot, vs using two ldr to load one 64b at a time, in both Big and little cores, there is no noticeable differences.
Submitting the LDP version. At this point we don't need to choose kernel based on micro-architecture.
Inference time of resnet50, thread count 2
Big Core on Pixel 3a
Current master: 292.947 ms
First iteration S8S8: 188.239 ms
LDP load two 64b reg: 178.715 ms
LDR load one 64b reg: 179.536 ms
Little Core
Master: 546.317 ms
S8S8: 513.332 ms
LDP: 489.19 ms
LDR: 497.865 ms
Raspberry Pi 3B+
Master: 660.08 ms
S8S8: 608.577 ms
LDP: 603.675 ms
LDR 602.075 ms