Commit graph

113 commits

Author SHA1 Message Date
luoyu-intel
5f00bc9931
Integrate high-performance x64 gemm library to MLAS (#17669)
### Description
Improve MLAS to support high-performance x64 INT4 kernels



### Motivation and Context
1. improve LLM inference performance on Intel CPUs.
2. support more 4bit quantization types: nf4, fp4
3. support dynamic block size: block size aligned with kernel's tiling
size(e.g. 4 for VNNI kernel), per channel on N dimension
4. support most Intel ISAs: avx2, avx_vnni, avx512f, avx512_vnni,
amx_bf16, amx_int8, avx512_fp16
5. support MatMulNBits' data format

### Tasks
- [x] support block_size: 32, 128, -1(per channel)
- [x] get weight pack size without memory allocation
- [x] use ort's thread pool for parallelism
- [x] support ISAs: avx2, avx512f, avx_vnni, avx512_vnni, amx_int8

### Benchmark
Ubuntu 20.22 + Intel(R) Xeon(R) Platinum 8480+ 56 cores

Benchmark | Time | CPU | Iterations
-- | -- | -- | --
Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:4096/Threads:56/real_time | 47613
| 47401 | 12970
Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:4096/Threads:56/real_time |
6347792 | 6317562 | 109
Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:4096/Threads:56/real_time |
11814014 | 11757847 | 59
Q4GEMM_Jblas/Q4G128SymInt8/M:1/N:4096/K:4096/Threads:56/real_time |
50222 | 50031 | 13759
Q4GEMM_Jblas/Q4G128SymInt8/M:1024/N:4096/K:4096/Threads:56/real_time |
2038222 | 2028743 | 341
Q4GEMM_Jblas/Q4G128SymInt8/M:2048/N:4096/K:4096/Threads:56/real_time |
3792832 | 3774485 | 191
Q4GEMM_Jblas/Q4GPerNSymInt8/M:1/N:4096/K:4096/Threads:56/real_time |
58717 | 58501 | 11467
Q4GEMM_Jblas/Q4GPerNSymInt8/M:1024/N:4096/K:4096/Threads:56/real_time |
1360846 | 1354598 | 543
Q4GEMM_Jblas/Q4GPerNSymInt8/M:2048/N:4096/K:4096/Threads:56/real_time |
2564232 | 2551365 | 266
Q4GEMM_Jblas/Q4G32SymFp32/M:1/N:4096/K:4096/Threads:56/real_time | 57929
| 57694 | 12047
Q4GEMM_Jblas/Q4G32SymFp32/M:1024/N:4096/K:4096/Threads:56/real_time |
5495330 | 5465810 | 126
Q4GEMM_Jblas/Q4G32SymFp32/M:2048/N:4096/K:4096/Threads:56/real_time |
10676240 | 10617817 | 66
Q4GEMM_Jblas/Q4G128SymFp32/M:1/N:4096/K:4096/Threads:56/real_time |
68305 | 68047 | 10026
Q4GEMM_Jblas/Q4G128SymFp32/M:1024/N:4096/K:4096/Threads:56/real_time |
5504862 | 5476215 | 126
Q4GEMM_Jblas/Q4G128SymFp32/M:2048/N:4096/K:4096/Threads:56/real_time |
11758623 | 11697337 | 66
Q4GEMM_Jblas/Q4GPerNSymFp32/M:1/N:4096/K:4096/Threads:56/real_time |
67713 | 67451 | 10298
Q4GEMM_Jblas/Q4GPerNSymFp32/M:1024/N:4096/K:4096/Threads:56/real_time |
5508325 | 5480237 | 126
Q4GEMM_Jblas/Q4GPerNSymFp32/M:2048/N:4096/K:4096/Threads:56/real_time |
10738528 | 10681656 | 64
Q4GEMM_Jblas/Q4G32AsymFp32/M:1/N:4096/K:4096/Threads:56/real_time |
60708 | 60486 | 11321
Q4GEMM_Jblas/Q4G32AsymFp32/M:1024/N:4096/K:4096/Threads:56/real_time |
5523784 | 5495736 | 126
Q4GEMM_Jblas/Q4G32AsymFp32/M:2048/N:4096/K:4096/Threads:56/real_time |
10829633 | 10772161 | 67


Reference:

Benchmark | Time | CPU | Iterations
-- | -- | -- | --
Q4GEMM/Q4Sym/M:1/N:4096/K:4096/Threads:56/real_time | 53088 | 52911 |
13364
Q4GEMM/Q4Sym/M:1024/N:4096/K:4096/Threads:56/real_time | 6268981 |
6230335 | 110
Q4GEMM/Q4Sym/M:2048/N:4096/K:4096/Threads:56/real_time | 11701237 |
11632339 | 59

Win11+12900K 8 cores:
Benchmark | Time | CPU | Iterations
-- | -- | -- | --
Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:4096/Threads:8/real_time | 215976
| 211295 | 2884
Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:4096/Threads:8/real_time |
60960590 | 60937500 | 10
Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:4096/Threads:8/real_time |
1.18E+08 | 1.19E+08 | 5
Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:11008/K:4096/Threads:8/real_time |
470377 | 453059 | 1414
Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:11008/K:4096/Threads:8/real_time |
1.54E+08 | 1.53E+08 | 5
Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:11008/K:4096/Threads:8/real_time |
3.18E+08 | 3.13E+08 | 2
Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:11008/Threads:8/real_time |
569072 | 559398 | 1229
Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:11008/Threads:8/real_time |
1.54E+08 | 1.52E+08 | 4
Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:11008/Threads:8/real_time |
3.22E+08 | 3.28E+08 | 2
Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:11008/K:11008/Threads:8/real_time |
1486055 | 1473325 | 403
Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:11008/K:11008/Threads:8/real_time |
4.14E+08 | 4.14E+08 | 2
Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:11008/K:11008/Threads:8/real_time |
8.88E+08 | 8.59E+08 | 1

---------

Signed-off-by: Mengni Wang <mengni.wang@intel.com>
Co-authored-by: Mengni Wang <mengni.wang@intel.com>
2023-12-19 09:36:31 -08:00
junchao-loongson
4abec9749e
[mlas] add loongarch lsx and lasx optimize code (#17937)
### Description
Hello we(@lixing-star) are the developers of loongson team.

We add 128 (lsx), 256 (lasx) vector optimization code for the loongarch
architecture


[100% tests passed, 0 tests failed out of
7](https://cloud.a-boat.cn:2021/api/public/dl/6831z1Bi?inline=true)

### Development Environments1
```
CPU: 
    Loongson-3C5000L
uname -a:  
    Linux localhost.localdomain 4.19.190-6.4.lns8.loongarch64 #1 SMP Thu Jul 14 12:08:04 CST 2022 loongarch64 loongarch64 loongarch64 GNU/Linux

```
### LonngArch Documents
- [LoongArch Reference Manual - Volume 1: Basic Architecture: This
manual describes the basic part of the LoongArch
architecture.](https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1-EN.html)
- [LoongArch ELF psABI: This manual describes the LoongArch ELF
psABI.](https://loongson.github.io/LoongArch-Documentation/LoongArch-ELF-ABI-EN.html)
-
[more](https://loongson.github.io/LoongArch-Documentation/README-EN.html)
2023-12-07 11:15:59 -08:00
Edward Chen
0a4d76d98b
MLAS AArch64 quantized int4 Gemm kernel (#18031)
- Implement MLAS function for quantized 4-bit int Gemm (Gemm with float A and quantized 4-bit int B) for ARM NEON. This is an initial implementation. Only the M=1 path (with M being number of rows of A and C) has any optimization attempted so far. More optimization to come in future PRs.

- Connect MatMulNBits contrib op to MLAS function.
2023-11-15 09:31:54 -08:00
snadampal
d88d52eead
[aarch64] Remove mmla kernel support from apple (#18082)
### Description
<!-- Describe your changes. -->
The mmla kernels require additional ISA flags
and are currently supported only on Linux


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
more context is in https://github.com/microsoft/onnxruntime/pull/15270

cc: @skottmckay , @chenfucn , @snnn
2023-10-25 11:34:57 -07:00
snadampal
780ee186d7
[aarch64] Implement QGEMM kernels with UMMLA/SMMLA instructions (#17160)
### Description
<!-- Describe your changes. -->
This PR adds UMMLA and SMMLA based QGEMM kernels for aarch64. This
covers
(i) symmetric quantization (zero point is Zero)
(ii) asymmetric quantization (zero point is non zero)
(iii) per channel as well as per tensor quantization
(iv) Signed weights (U8S8 Gemm)
(v) Unsigned weights (U8U8 Gemm) and 
(vi) Signed activations and weights (S8S8 Gemm) scenarios

I've enabled the ummla/smmla kernels based on cpuinfo check for `I8MM`
support
MMLA QGEMM kernels are enabled for all the devices that support I8MM
instructions.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This is to improve INT8 quantized MatMul performance on aarch64
platform.
I have run the below benchmarking script (bert , roberta and gpt2 model
inference) on AWS Graviton3 based c7g.4xl instance and observed up to
1.33x performance improvement compared to the optimized UDOT qgemm
kernel performance.

```
cd onnxruntime/python/tools/transformers
python3 benchmark.py
```
I have also run the unit tests, and made sure all are passing

```
./build.sh --config RelWithDebInfo --build_shared_lib --parallel --compile_no_warning_as_error --skip_submodule_sync 

```
2023-10-24 07:49:04 +10:00
MistEO
870b0bc305
Fix typo of cmake (#17715)
This caused a cmake configuration error.
2023-09-27 11:48:46 -07:00
Chen Fu
3c10f027de
4b quantization for weights of LLMs (#16833)
### Description
Blockwise 4b quantization for LLMs. 
1. Introduce 4b block-wise quantization for linear layer weights.
2. Implements matrix multiplication kernel for fp32 x int4
3. Implements special operator MatMulFpQ4
4. Implements quantization tool, that convert MatMul operator to
MatMulFpQ4, when the right hand side is 2D const tensor.


### Motivation and Context
Compress and accelerate LLMs

|Benchmark | Time(ns)|
|-------------|----------|
|Q4GEMM/Q4Sym/M:1/N:4096/K:4096/Threads:8| 218054|
|Q4GEMM/Q4Sym/M:1024/N:4096/K:4096/Threads:8| 35830155|
|Q4GEMM/Q4Sym/M:2048/N:4096/K:4096/Threads:8| 73479790|
|Q4GEMM/Q4Zp8/M:1/N:4096/K:4096/Threads:8| 270152|
|Q4GEMM/Q4Zp8/M:1024/N:4096/K:4096/Threads:8| 35826721|
|Q4GEMM/Q4Zp8/M:2048/N:4096/K:4096/Threads:8| 73021200|
|Q4GEMM/Q4Sym128/M:1/N:4096/K:4096/Threads:8| 213832|
|Q4GEMM/Q4Sym128/M:1024/N:4096/K:4096/Threads:8| 36749874|
|Q4GEMM/Q4Sym128/M:2048/N:4096/K:4096/Threads:8| 72618120|


|Benchmark | Time(ns)|
|-------------|----------|
|SGEMM/LLM/M:1/N:4096/K:4096/Threads:8|   522610|
|SGEMM/LLM/M:1024/N:4096/K:4096/Threads:8| 39237689|
|SGEMM/LLM/M:2048/N:4096/K:4096/Threads:8| 75983467|

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-08-07 12:23:55 -07:00
Yi Zhang
2e214d6e27
Workaround to upgrade VS2022 for Windows ARM build (#16826)
### Description



### Motivation and Context
It should be reverted when VS2022 is upgraded to 17.7 or above.

### Vefication

https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=331401&view=logs&j=7517abfd-115a-5c61-78a0-7ba3c9e3a88d
2023-07-25 08:35:52 +08:00
Dipanjan Sengupta
a461608409
Amx flag removal (#16527)
### Description
1. Replacing AMX intrinsics with machine code macros in QGEMM kernel.
2. Removing AMX build flags for GCC in cmake file.
3. Fixing the link time optimization (LTO) issue introduced with asm
.include of an assembly file.

I have moved the AMX instruction macro definitions from
QgemmU8S8KernelAmxCommon.S to the amx_common.h to fix the LTO issue.
Note that I am also pushing the macros defined in
QgemmU8S8KernelAmxCommon.S for future reference.

A special thanks to @laxmansole who helped in the development of the
instruction macro definitions for AMX intrinsics and fixing the LTO
issue.

### Motivation and Context
The additional AMX flag in cmake adds an extra layer of dependency on
GCC version to use the feature.These changes should allow the usage of
the AMX feature with just the CPU ID check.
2023-07-13 11:19:49 -07:00
Scott McKay
697dd12f6e
Re-organize the transpose optimization and layout transformation files. (#16246)
### Description
<!-- Describe your changes. -->
Split out the more basic changes from #15552 for easier review.

Re-organize to clarify the structure
- Separate out generic base functionality from ORT specific components
  - pass in handlers for internal ORT ops to Optimize
- Split out layout transformation from transpose optimization
- Separate out level 1 transpose optimizer
- Cleanup some naming to try and clarify things like an optimizer vs.
general optimization code

Most of the changes are from this movement of code.

Two implementation changes:
- the extended handlers are queried first in GetHandler
- allows the extended handlers to override the default behaviour for an
ONNX operator
- simplify the Optimize function to remove OptimizerMode. 
- `can_modify_node` is used instead of `mode` and
`ignore_assigned_nodes` and a long description of the current usage is
added. I don't _think_ that changes the current behavior and hopefully
clarifies what happens and when, and makes the base transpose optimizer
implementation more generic.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Create a cleaner separation to support adding EP specific logic next to
cleanly handle where an EP has additional layout sensitive behaviour
required (e.g. it's Resize implementation only handles one layout).
2023-07-07 08:24:47 +10:00
Chen Fu
5c125b4366
Cfu revertamx (#16455)
### Description

This is to revert two PRs that aim at reducing AMX toolchain
requirements. Unfortunately we still have some pipeline issues.

https://github.com/microsoft/onnxruntime/pull/16390
https://github.com/microsoft/onnxruntime/pull/16086

### Motivation and Context

Looks like gcc link time optimization does not work very well with
inline assembly in the above PRs.
2023-06-23 09:20:23 -07:00
Dipanjan Sengupta
35fa6af428
Fix for the build break in AMX feature on Mac OS. (#16390)
### Description
Fixing the build break issue in Apple pipeline due to AMX flag removal.
2023-06-16 21:00:41 -07:00
Dipanjan Sengupta
681a0d084d
Removing AMX build flag (#16086)
### Description
1. Replacing AMX intrinsics with machine code macro instructions in
QGEMM kernel.
2. Removing AMX build flags for GCC in cmake file.



### Motivation and Context
The additional AMX flag in cmake adds an extra layer of dependency on
GCC version to use the feature.These changes should allow the usage of
the AMX feature with just the CPU ID check.
2023-06-15 11:22:59 -07:00
Changming Sun
0204594f90
Cleanup WASM cmake code (#15996)
### Description
Remove the "onnxruntime_BUILD_WEBASSEMBLY" cmake option. Use `if
(CMAKE_SYSTEM_NAME STREQUAL "Emscripten")` instead. It makes some code
look more nature.
For example,

```cmake
if (CMAKE_SYSTEM_NAME STREQUAL "iOS" OR CMAKE_SYSTEM_NAME STREQUAL "Android" OR onnxruntime_BUILD_WEBASSEMBLY)
```
becomes
```cmake
if (CMAKE_SYSTEM_NAME STREQUAL "iOS" OR CMAKE_SYSTEM_NAME STREQUAL "Android" OR CMAKE_SYSTEM_NAME STREQUAL "Emscripten")
```
2023-05-20 18:07:39 -07:00
George Nash
f2889b41c1
[AMX] Update assembler check (#15501)
A recent commit added an assembler check if the ASM dialect was ATT

This unfortunately broke the AMX build for systems that don't have the
ASM-ATT dialect.

This change assumes if the CMAKE_ASM-ATT_COMPILER_ID is not found and
the CMAKE_ASM_COMPILER_ID is "GNU" based on all the other already passed
checks AMX is supported by the compiler and assembler.

### Description




### Motivation and Context
On my build system the recent change to add the ASM-ATT version check
disabled AMX code from the build.

---------

Signed-off-by: George Nash <george.nash@intel.com>
2023-04-19 14:16:26 -07:00
Yateng Hong
9bb4e4bef4
Fix masm flags (#15417)
### Description
Fix onnxruntime_mlas build failure with cmake 3.26. Updated CMAKE
generator expression to make sure certain complier flags only apply for
C/CXX compiler.

### Motivation and Context
CMake changed the behavior of ASM_MASM in version 3.26. See
https://gitlab.kitware.com/cmake/cmake/-/issues/24639.

This also fixed the issue of #15101
2023-04-07 10:20:03 -07:00
Chen Fu
605c2f4b89
Remove fp16 support from apple (#15270)
### Description

Removing fp16 support from apple build


### Motivation and Context
FP16 support on ARM64 only available after armv8.2a, thus the clang
compiler needs a compilation flag `-march=armv8.2-a+fp16`.
Unfortunately, our current universal build does not support hardware
specific compilation flags on cpp source files, as it would cause
trouble when compiling against more than one hardware target. Until we
figure out how to remove this limitation, had to disable fp16 support
for Apple systems.
2023-03-30 16:44:26 -07:00
Chen Fu
41ddcd30a1
Fp16 NHWC Max and Average Pooling (#15181)
### Description
Max and average pooling operators for fp16, NHWC 


### Motivation and Context
Continue on the steps for fp16 inference support
2023-03-28 08:22:55 -07:00
Jian Chen
527e006124
Update mlas (#15228)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-03-27 14:18:48 -07:00
JiCheng
126e7bf15f
[AMX] add assembler check (#15055)
### Description
<!-- Describe your changes. -->

AMX isn't supportted until assembler 2.40 even though the GCC frontend
supports it.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-03-22 07:57:22 +08:00
Chen Fu
34175f0b7c
FP16 conv (#15062)
### Description

Convolution for fp16 datatype. Use NHWC for computation. For NCHW input,
it rearranges the input tensor to NHWC format before computing the
result.

Support two optional fusion:

1. Activation
2. Add (not yet implemented)

### Motivation and Context

Accelerating fp16 inference
2023-03-21 10:32:43 -07:00
Jian Chen
6891ab5bac
fix_macos (#15018)
### Description
<!-- Describe your changes. -->
This fix macos packaging build on universal2 arch. 


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-03-14 21:54:44 -07:00
Chen Fu
acc2ac627f
Fp16 Activations (#14722)
### Description

NEON fp16 SIMD implementation of Activation functions


### Motivation and Context
Step 2 of fp16 SIMD support.

---------

Co-authored-by: Chen Fu <fuchen@microsoft.com>
2023-02-28 17:20:40 -08:00
Chen Fu
733ca85b73
Cfu fp16 (#14538)
### Description
FP16 GEMM, including hardware agnostic driver code, a slow C++ kernel,
and ARM64 NEON kernel.


### Motivation and Context
First step in creating native support of fp16 model inferencing on ARM64
and AMD64 platforms.

---------

Co-authored-by: Chen Fu <fuchen@microsoft.com>
2023-02-15 12:51:53 -08:00
Chen Fu
90142899bd
Supporting Intel AMX instructions in quantized GEMM (#14042)
### Description
Using Intel AMX int8 instructions to accelerate quantized GEMM


### Motivation and Context
AMX instructions accelerate quantized GEMM significantly:

Prepacked B perf numbers (latency in ns)

GEMM Config | AVX512Vnni | AMX
-- | --: | --:
M:384/N:1024/K:1024/Batch:1/Threads:4 | 1057511 | 285393
M:384/N:1024/K:3072/Batch:1/Threads:4 | 2643929 | 700397
M:384/N:1024/K:4096/Batch:1/Threads:4 | 3784750 | 890701
M:384/N:4096/K:1024/Batch:1/Threads:4 | 2378139 | 887251
M:384/N:1024/K:1024/Batch:1/Threads:16 | 307137 | 138481
M:384/N:1024/K:3072/Batch:1/Threads:16 | 855730 | 295027
M:384/N:1024/K:4096/Batch:1/Threads:16 | 1126878 | 317395
M:384/N:4096/K:1024/Batch:1/Threads:16 | 781963 | 237014
M:1536/N:1024/K:1024/Batch:1/Threads:16 | 538864 | 181459
M:1536/N:1024/K:3072/Batch:1/Threads:16 | 1681002 | 561600
M:1536/N:1024/K:4096/Batch:1/Threads:16 | 2158127 | 717470
M:1536/N:4096/K:1024/Batch:1/Threads:16 | 2428622 | 896140
M:3072/N:1024/K:1024/Batch:1/Threads:16 | 1058029 | 357031
M:3072/N:1024/K:3072/Batch:1/Threads:16 | 3138504 | 1095857
M:3072/N:1024/K:4096/Batch:1/Threads:16 | 4155640 | 1386183
M:3072/N:4096/K:1024/Batch:1/Threads:16 | 4679030 | 1778624

Co-authored-by: Yi-Hong Lyu <yilyu@microsoft.com>
Co-authored-by: Chen Fu <fuchen@microsoft.com>
2023-01-10 12:16:27 -08:00
Edward Chen
2ecd1d6622
Switch GSL to MS GSL 4.0.0 (#13416) 2022-10-29 04:15:20 -07:00
Jack·Boos·Yu
ea004e953f
[cmake] Export multi targets in static build (#11063)
* [cmake] Export multi targets in static build

* Install more components in static build, format some code

* Fix code pos
2022-04-03 22:37:18 -07:00
Chen Fu
dc72159105
Symmetric Quant indirect Conv kernel for ARMv8 A55 chip (#10862)
ARM a55 micro-architecture (with dot product instructions), similar to a53, is widely used as little cores in big.Little configurations. A55 has a narrower memory load/store hardware, where a 128b load instruction would block the pipeline for 2 whole cycles, during which no other instructions can be executed. On the other hand, a 64b load instruction can be duo issued with many other instructions.

This change adds a Symmetric Quant indirect Conv kernel for a55 micro-architecture, where we replace

ldr q4,[x1],

with

ldr d4,[x1],
ldr x11,[x1],
ins v4.d[1],x11

so that we can try to hide the memory load cycles behind computing cycles in the kernel.

With this new kernel, cartoongan model shows significant perf improvement on Pixel5a little cores (2 threads running on two little cores):

new kernel: 2188.59 ms
old kernel: 2360.61 ms
2022-03-25 17:10:47 -07:00
Chen Fu
50a6f095cd
Symmetric QGEMM kernel for ARMv8 A55 chip (#10754)
ARM a55 micro-architecture (with dot product instructions), similar to a53, is widely used as little cores in big.Little configurations. A55 has a narrower memory load/store hardware, where a 128b load instruction would block the pipeline for 2 whole cycles, during which no other instructions can be executed. On the other hand, a 64b load instruction can be duo issued with many other instructions.

This change adds a Symmetric QGEMM kernel for a55 micro-architecture, where we replace

ldr q4,[x1],#16

with

ldr d4,[x1],#8
ldr x11,[x1],#8
ins v4.d[1],x11

so that we can try to hide the memory load cycles behind computing cycles in the kernel.

Co-authored-by: Chen Fu <fuchen@microsoft.com>
2022-03-07 08:41:13 -08:00
Maxiwell
43ff27c7c8
ppc64le: optimizing the MlasQuantizeLinear() with VSX (#10644)
This code is valid only when -mcpu is set to utilize POWER9 technology
or above. A compatible code for POWER8 was created as well, but it
was not tuned for performance.
2022-03-04 14:54:56 -08:00
RajalakshmiSR
5d8c5409ab
POWER10: QGEMM optimization (#10642)
* POWER10: QGEMM optimization

This patch makes use of POWER10 MMA feature for QGEMM function.
This optimization includes signed and unsigned cases.Tested and
there are no new failures with gcc11 and clang-14.

* Changes as per review comments

Co-authored-by: Rajalakshmi Srinivasaraghavan <rajis@linux.ibm.com>
2022-03-02 08:36:26 -08:00
Chen Fu
c4f1dfcfaa
Cfu s8s8 (#10413)
Adding S8S8 kernels for symmetric quantized indirect conv and depthwise conv.

Perf number with single thread:

Nokia G10 (baseline / new) in ms	Pixel 4 (baseline/new) in ms
mobilenet_edgetpu	220 / 213	18.5 / 17.6
cartoongan	8537 / 8521	967 / 928

Co-authored-by: Chen Fu <fuchen@microsoft.com>
2022-01-28 09:26:52 -08:00
Chen Fu
2afce4830c
Symmetric QGEMM (#10289)
Adding code for symmetric quantized matrix multiplication. Used in quantized convolution, achieving significant perf gain.

TODO, use Symmetric Quantized GEMM in other operators!

TODO address activation buffer overread in custom allocators and tensors supplied by users.

DOT kernel perf test:

Pixel 5a:

Cartoongan	513.539 ms	471.786 ms
Efficient	57.5169 ms	56.4174 ms
Edgetpu	14.6673 ms	13.5959 ms
NEON kernel perf test

Pixel 3a

Cartoongan	1423.53 ms	1069.92 ms
Efficient	114.086 ms	107.968 ms
Edgetpu	39.2632 ms	36.9839 ms


Co-authored-by: Chen Fu <fuchen@microsoft.com>
2022-01-24 10:49:04 -08:00
Yufeng Li
7208fcbe1c
use wasmscalar as default kernel (#9988)
* use wasmscalar as default kernel
2022-01-03 10:55:08 -08:00
Changming Sun
4e9e01cb3c
Fix SDL warnings in CPU EP (#9975) 2021-12-19 20:54:29 -08:00
Chen Fu
cd0af7ad44
Symmetric quantized convolution kernel ARM64 (#9772)
Adding a symmetric quantized convolution kernel for ARM64

Note:
Indirect conv performs worse for shallow convs (input channels are small). This is much more so for low end pre-dot CPUs, where only 128 or deeper conv is faster with indirect conv. With DOT-CPUs, 32 deep conv is already faster

Co-authored-by: Chen Fu <fuchen@microsoft.com>
2021-12-13 21:14:45 -08:00
Yi-Hong Lyu
f60a287a64
Add __x86.get_pc_thunk.bx to avoid dependency (#9955) 2021-12-08 04:50:41 -08:00
Yufeng Li
e613019174
add s8s8 support for quantized conv and gemm (#9902)
* add s8s8 support for quantized conv and gemm
2021-12-03 14:55:18 -08:00
RajalakshmiSR
8564fc1933
POWER10: Add optimized dgemm kernel (#9652)
* POWER10: Add optimized dgemm kernel

This patch makes use of POWER10 matrix multiply assist feature and
adds new DGEMM kernel.

* Indentation update

Co-authored-by: Rajalakshmi Srinivasaraghavan <rajis@linux.ibm.com>
2021-11-22 20:28:21 -08:00
Zhang Lei
8ef6aff734
Zhalei/dwqconv3x3 5x5 arm64 (#9714)
* Arm64 Depthwise Convolution 3x3.

* Add 5x5 intrinsic dwqconv for arm64

* rebase to master, remove no-need logic after arm64 convsym enabled.

* Some more adjustment on the instrunction pipeling.

* Add specific test cases.

* Fix test dimension too small.

* Fix build warning as error on some CI.

* better format, etc.
2021-11-18 13:57:16 -08:00
Chen Fu
1c84621020
Adding ARM64 depthwise convolution kernel for symmetric quantization (#9655)
Adding ARM64 depthwise convolution kernel for symmetric quantization

Motivation and Context
Two improvements against current kernel code :

1. Signed int8 based instructions, no need to extend from 8b to 16b before multiplication.
2. Unrolled loop with manual software pipelining

Co-authored-by: Chen Fu <fuchen@microsoft.com>
2021-11-15 12:18:43 -08:00
RajalakshmiSR
c54ad0dd0b
POWER: Add Dgemm kernel for POWER processor (#9459)
* POWER: Add Dgemm kernel for POWER processor

This patch adds new dgemm kernel specific to POWER processor.

* POWER: Restrict new functions to VSX in header

* Remove warning check in header

* POWER: Dgemm Adjust indentation

Fixing indentation based on review comments.

Co-authored-by: Rajalakshmi Srinivasaraghavan <rajis@linux.ibm.com>
2021-10-26 20:27:24 -07:00
Yufeng Li
da3dd398c5
Kernels for QLinearConv with symmetrically quantized filter (#9323)
Add kernels for QLinearConv with symmetric quantized filter, e.g., filter type is int8 and zero point of filter is 0. This PR includes kernels for avx2, avxvnni, avx512 and avx 512 vnni. Will adds kernels for ARM64 in following PR.

Kernels uses direct input buffer directly for pointwise, and in-direct buffer for depthwise and non-group conv.

The advantages of those new kernels are:

no need to compute the sum of each pixel output image, and sum/offset of filter can be combined with bias.
with in-direct buffer, im2col returns an array of buffer pointers instead of memcpy'ing the original data. This saves memcpy time and reduces the size of the intermediate buffer needed to hold the im2col transform. In the future, will compute im2col ahead of time for input with fixed input size.
2021-10-18 19:40:18 -07:00
Yulong Wang
8c57d51928
support WebAssembly SIMD for qgemm (#9191)
* support WebAssembly SIMD for qgemm

* remove '--experimental-wasm-bulk-memory' for test
2021-09-30 12:40:56 -07:00
Tracy Sharpe
4828d2ebb1
MLAS: port aarch64 sgemv kernel to Windows ARM64 (#9071) 2021-09-15 18:40:40 -07:00
Rajalakshmi Srinivasaraghavan
e83cc534d4 Fix cmake POWER10 detection
Recent commit 60c98a8 changed variable mlas_common_srcs which affects
POWER10 detection.
2021-09-12 11:56:55 -07:00
Changming Sun
60c98a86b7
CMake file changes for macOS universal2 support (#8953) 2021-09-04 13:30:33 -07:00
Chen Fu
00b345eb7b
ARM Neon S8S8 kernel for QGemm (#8695)
Using signed int, qgemm kernel avoids extending uint8 to int16 while computing matrix multiplication, achieving higher performance. We also find that by using only lower 64b of vector registers to load A and B matrix, we can get further performance improvements. We also experimented with using ldp to load two 64b in one shot, vs using two ldr to load one 64b at a time, in both Big and little cores, there is no noticeable differences.

Submitting the LDP version. At this point we don't need to choose kernel based on micro-architecture.

Inference time of resnet50, thread count 2

Big Core on Pixel 3a
Current master: 292.947 ms
First iteration S8S8: 188.239 ms
LDP load two 64b reg: 178.715 ms
LDR load one 64b reg: 179.536 ms

Little Core
Master: 546.317 ms
S8S8: 513.332 ms
LDP: 489.19 ms
LDR: 497.865 ms

Raspberry Pi 3B+
Master: 660.08 ms
S8S8: 608.577 ms
LDP: 603.675 ms
LDR 602.075 ms
2021-08-18 09:58:47 -07:00
Tracy Sharpe
539d1d44c1
Optimize ARM64EC build (#8515)
Add sgemm and qgemm optimized kernels for ARM64EC configuration.
2021-07-27 23:46:39 -07:00
Tracy Sharpe
b2b9de939f
cleanup onnxruntime_mlas.cmake of old gcc workarounds (#8469) 2021-07-22 22:01:05 -07:00