Commit graph

131 commits

Author SHA1 Message Date
Jing Fang
7fa69461fd
[ARM] MatMulNBits FP16 support - kernels only (#22806)
### Description
A break down PR of https://github.com/microsoft/onnxruntime/pull/22651
Add fp16 kernels.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-11-12 14:28:47 -08:00
Ranjit Ranjan
193671295e
[AIX] Fix for AIX build break (#22745)
### Description
With recent changes, below build error is found under AIX. 

```
ld: 0706-012 The -p flag is not recognized.
ld: 0706-012 The -a flag is not recognized.
ld: 0706-012 The -t flag is not recognized.
ld: 0706-012 The -h flag is not recognized.
ld: 0706-012 The -= flag is not recognized.
ld: 0706-012 The -$ flag is not recognized.
ld: 0706-012 The -$ flag is not recognized.
ld: 0706-012 The -O flag is not recognized.
ld: 0706-027 The -R IGIN flag is ignored.

collect2: error: ld returned 255 exit status
```

### Motivation and Context
AIX linker doesn't support -rpath option , so blocking this option under
AIX.
2024-11-07 13:22:22 -08:00
Changming Sun
88676e62b9
Remove nsync (#20413)
### Description
1. Remove the onnxruntime::OrtMutex class and replace it with
~absl::Mutex~ std::mutex.
2. After this change, most source files will not include <Windows.h>
indirectly.


### Motivation and Context
To reduce the number of deps we have, and address some Github issues
that are related to build ONNX Runtime from source.
In PR #3000 , I added a custom implementation of std::mutex . It was
mainly because at that time std::mutex's default constructor was not
trivial on Windows. If you had such a mutex as a global var, it could
not be initialized at compile time. Then VC++ team fixed this issue.
Therefore we don't need this custom implementation anymore.

This PR also removes nsync. I ran several models tests on Linux. I
didn't see any perf difference.
This PR also reverts PR #21005 , which is no longer needed since conda
has updated its msvc runtime DLL.

This PR unblocks #22173 and resolves #22092 . We have a lot of open
issues with nsync. This PR can resolve all of them.
2024-10-21 15:32:14 -07:00
Jing Fang
1942e40e05
[ARM64] MatMulNBits: use neon instrinsics to convert between fp16 and fp32 (#22195)
### Description
For fp16 Atype, the fallback operation is convert the data to fp32 and
calculate.
Added neon intrinsics version to speed up the conversion.

Store address alignment and loop unrolling have insignificant impact on
latency so they are omitted.

|Benchmark | Time | CPU |

|--------------|---------------------------------------------|--------------------|
|M_ConvertF16ToF32/baseline/real_time | 1076961 ns | 1083398 ns |
|M_ConvertF16ToF32/aligned:0/real_time | 46785 ns | 46516 ns |
|M_ConvertF16ToF32/aligned:1/real_time | 46631 ns | 46391 ns |
|M_ConvertF16ToF32_unroll2/aligned:0/real_time | 44074 ns | 44392 ns |
|M_ConvertF16ToF32_unroll2/aligned:1/real_time | 44726 ns | 45226 ns |
|M_ConvertF32ToF16/baseline/real_time | 520109 ns | 527329 ns |
|M_ConvertF32ToF16/aligned:0/real_time | 73610 ns | 74015 ns |
|M_ConvertF32ToF16/aligned:1/real_time | 71557 ns | 71525 ns |
|M_ConvertF32ToF16_unroll2/aligned:0/real_time | 64227 ns | 63374 ns |
|M_ConvertF32ToF16_unroll2/aligned:1/real_time | 67428 ns | 67989 ns |



### Motivation and Context
speed up fallback implementation of Fp16 MatMulNBits
2024-09-26 13:55:40 -07:00
liqun Fu
a89bddd5c2
Matmul_nbits kernel for mlas sqnbits to support Fp16 inputs (#21807) 2024-09-13 14:55:08 -07:00
wangshuai09
d539c27de8
Fix version check for using -mavxvnni (#21616)
### Description
<!-- Describe your changes. -->
Change the `CMAKE_CXX_COMPILER_VERSION` greater than `11` for using
'-mavxvnni'.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->


`CMakeFiles/onnxruntime_mlas.dir/root/Git.d/onnxruntime/onnxruntime/core/mlas/lib/x86_64/QgemmU8S8KernelAvx2.S.o
cc: error: unrecognized command-line option ‘-mavxvnni’; did you mean
‘-mavx512vnni’?` using `gcc (GCC) 10.3.1`.

`-mavxnni` is supported since [GCC 11
Release](https://gcc.gnu.org/gcc-11/changes.html), this PR change the
version check.
2024-09-12 11:42:17 -07:00
Erick Muñoz
7489bfee53
Enable AVX NE CONVERT for FP16 to FP32 cast (#21183)
### Description
Implementation of a new cast assembly kernel that uses AVX_NE_CONVERT
instructions to accelerate casting from FP16 to FP32. Added CPUID checks
to determine support of the ISA.

### Motivation and Context
Currently FP16 models executed on systems that lack complete FP16
operator support use single precision on every node to run the model,
this means the original FP16 weights have to be casted to FP32 in order
to run the model properly, this change aims to accelerate the casting by
using upconvert instructions and therefore improve performance.
2024-09-09 21:19:31 -07:00
liqun Fu
b87e8edb98
Mlas int4 int8 with avx2/512 (#20687)
### Description
model: phi-3-mini-4k-instruct
avx2 symmetric
blklen|updated prompt tps | baseline prompt tps | prompt tps
change%|updated token gen tps | baseline token gen tps | token gen
change%
-|-|-|-|-|-|-
16 |49.5|70.0|-29.2%|9.6|10.8|-34.2%
32 |76.8|52.4|9.7%|15.2|14.6|4.1%
64 |78.2|71.4|9.5%|16.6|16.3|1.8%
128 |72.9|70.6|3.2%|17.1|16.8|1.7%
256 |83.7|63.6|31.6%|18.1|17.4|4%

avx2 asymmetric
blklen|updated prompt tps | baseline prompt tps | prompt tps
change%|updated token gen tps | baseline token gen tps | token gen
change%
-|-|-|-|-|-|-
16 |50.7|61.5|-17.5%|9.6|9.2|4.3%
32 |77.4|52.4|47.7%|14.6|13.9|5.0%
64 |78.7|63.0|24.9%|16.2|15.9|1.8%
128 |80.0|61.9|29.2%|17.2|16.9|1.7%
256 |81.5|63.3|28.7%|17.9|17.3|3.4%

avx2vnni symmetric
blklen|updated prompt tps | baseline prompt tps | prompt tps
change%|updated token gen tps | baseline token gen tps | token gen
change%
-|-|-|-|-|-|-
16 |82.9|117.0|-29.0%|15.9|19.3|-17.6%
32 |133.0|100.4|32.4%|26.1|24.5|6.5%
64 |166.9|118.8|40.4%|28.3|27.1|4.4%
128 |165.9|119.6|38.7%|29.3|28.5|2.8%
256 |165.2|119.6|38.1%|30.2|29.0|4.1%

avx2vnni asymmetric
blklen|updated prompt tps | baseline prompt tps | prompt tps
change%|updated token gen tps | baseline token gen tps | token gen
change%
-|-|-|-|-|-|-
16 |80.2|118.9|-32.5%|15.1|16.7|-9.5%
32 |130.7|99.7|31.0%|25.0|23.8|5.0%
64 |168.7|124.9|35.0%|27.3|26.8|1.8%
128 |169.6|123.8|36.9%|29.2|27.9|4.6%
256 |175.0|125.7|39.0%|30.0|29.7|1.0%

avx512 symmetric
blklen|updated prompt tps | baseline prompt tps | prompt tps
change%|updated token gen tps | baseline token gen tps | token gen
change%
-|-|-|-|-|-|-
16 |135.2|156.5|-13.6|25.5|23.8|7.1
32 |150.0|159.5|-5.9|34.9|29.6|17.9
64 |167.5|157.5|6.3|39.7|34.4|15.4
128 |177.8|158.0|12.5|40.3|35.4|13.8
256 |182.6|157.3|16.0|41.7|37.7|10.6

avx512 asymmetric
blklen|updated prompt tps | baseline prompt tps | prompt tps
change%|updated token gen tps | baseline token gen tps | token gen
change%
-|-|-|-|-|-|-
16 |136.1|151.4|-10.1%|26.1|19.9|31.1%
32 |150.0|157.8|-4.9%|34.3|29.3|17.0%
64 |165.7|156.6|5.8%|38.7|30.7|26.0%
128 |180.4|156.6|15.1%|40.2|34.7|15.8%
256 |181.3|158.0|14.7%|41.6|36.6|13.6%

avx512vnni symmetric
blklen|updated prompt tps | baseline prompt tps | prompt tps
change%|updated token gen tps | baseline token gen tps | token gen
change%
-|-|-|-|-|-|-
16 |143.4|155.4|-7.7%|25.6|23.3|9.8%
32 |159.2|157.0|1.4%|34.1|29.8|14.4%
64 |182.0|159.5|14.1%|38.4|34.8|10.3%
128 |221.2|160.8|37.5%|41.0|36.4|12.6%
256 |250.5|162.4|54.2%|41.6|37.7|10.3%

avx512vnni asymmetric
blklen|updated prompt tps | baseline prompt tps | prompt tps
change%|updated token gen tps | baseline token gen tps | token gen
change%
-|-|-|-|-|-|-
16 |142.5|152.3|-6.4%|26.3|19.7|33.5%
32 |158.2|155.0|2.0%|34.3|29.2|17.4%
64 |184.1|156.6|17.5%|38.3|30.9|23.9%
128 |215.8|156.1|17.5%|41.3|35.0|17.9%
256 |249.2|155.9|59.8%|41.1|36.3|13.2%


4bit gemm implementation with avx using tile.

1.
tile size is 2blk by 4. in case of size less then tile, it reduce to
1blk by 4, 2blk by 1 and lastly 1blk by 1.
with internal kernel, weight and activation are loaded based on SIMD
register width and blk length:
avx2 256bit register, 64 weights and activation are loaded.
   blklen16: 4 blks are computed by the internal kernel
   blklen32: 2 blks are computed by the internal kernel
   blklen64: 1 blk are computed by the internal kernel
   blklen128: 1 blks are computed 2 times by the internal kernel
   blklen16: 1 blks are computed 4 times by the internal kernel

avx512 512bit register, 128 weights and activation are loaded.
   blklen16: 8 blks are computed by the internal kernel
   blklen32: 4 blks are computed by the internal kernel
   blklen64: 2 blk are computed by the internal kernel
   blklen128: 1 blks are computed by the internal kernel
   blklen16: 1 blks are computed 2 times by the internal kernel

2.
blksum is precomputed during prepacking. 
computation is reformed:
Sum1(scale_a * scale_b * Sum_blk(a_i * b_i)) + Sum2(blksum_a * blksum_b)
  Sum_blk is over one blk
  Sum1 is over all blks for one output
  Sum2 is over all blks for one output
Sum is computed with sgemm with the current implementation. Further
improvement is possible.

 

---------

Signed-off-by: Liqun Fu <liqfu@microsoft.com>
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Signed-off-by: Liqun Fu <liqun_fu@hotmail.com>
2024-08-02 10:20:22 -07:00
Ranjit Ranjan
6c7562b097
Enablement of onnxruntime for AIX and fixing issues related to big-endian platform. (#21133)
### Description
Enablement of onnxruntime for AIX and fixing issues related to
big-endian platform.

### Motivation and Context
changes in this PR contains:
1. Enablement code for building onnxruntime on AIX operating system.
2. while testing the build on AIX, we found issues related to big endian
platform . More details about few of those issues can be found in [Big
endian issue: Graph Transformation Attention Fusion tests are failing
#12921](https://github.com/microsoft/onnxruntime/issues/12921)

Below are list of files and the description about the change.
1.	cmake/CMakeLists.txt
[BUILDING on AIX issue] check for "IBMClang" is added for handling
-Wno-unused-parameter
2.	cmake/external/onnxruntime_external_deps.cmake
[BUILDING on AIX issue]Enabling gtest_disable_pthreads for AIX
3.	cmake/onnxruntime.cmake
[BUILDING on AIX issue]
o Blocking codes for AIX which generates generated_source.c and further
requires some symbol files.
o	Putting NO AIX check for non-supported linker flags like --Xlinker
o	iconv linking
4.	cmake/onnxruntime_framework.cmake
[BUILDING on AIX issue]Putting NO AIX check for -Wl,-rpath='$ORIGIN'
5.	cmake/onnxruntime_mlas.cmake
[BUILDING on AIX issue]POWER10 releated macro/function definition .
6.	cmake/onnxruntime_providers_cpu.cmake
[BUILDING on AIX issue]Putting NO AIX check for non-supported linker
flags like --Xlinker
7.	cmake/onnxruntime_unittests.cmake
[BUILDING on AIX issue]
o	Putting NO AIX check for non-supported linker flags like --Xlinker
o Adding required libraries for AIX linker under applicatiion like
onnxruntime_shared_lib_test ,onnxruntime_logging_apis etc
8.	cmake/patches/flatbuffers/flatbuffers.patch
[BUILDING on AIX issue] Handling of TypeCode in
include/flatbuffers/flatbuffers.h under AIX + clang
9.	onnxruntime/contrib_ops/cpu/murmur_hash3.cc
[Big endian issue] Byte-Conversion handlling in compute() and getblock()
routines
10.	onnxruntime/contrib_ops/cpu/quantization/matmul_nbits_impl.cc
[Big endian issue] Handling of test failures . Byte swapping for
quant_value.
11.	onnxruntime/core/framework/tensorprotoutils.cc
[Big endian issue]
Implementation of SetRawDataInTensorProto , ConvertRawDataInTensorProto
.
o SetRawDataInTensorProto : Wrapper for set_raw_data(). Calling
ConvertRawDataInTensorProto() in big-endian system
o ConvertRawDataInTensorProto : function used mainly on big-endian
system for byte-swapping of tensor raw_data
12.	onnxruntime/core/framework/tensorprotoutils.h
[Big endian issue]
Declaration of SetRawDataInTensorProto,  ConvertRawDataInTensorProto
13.	onnxruntime/core/graph/graph.cc
[Big endian issue]
 o	Call ConvertRawDataInTensorProto for SPARSE_TENSOR type
 o	Call ConvertRawDataInTensorProto for SaveToOrtFormat
14.	onnxruntime/core/mlas/lib/platform.cpp
[BUILDING on AIX issue] POWER10 released enablement for AIX
15.	onnxruntime/core/mlas/lib/power/qgemm_kernel_power10.cpp
[BUILDING on AIX issue]Handling of __vector under AIX+clang
16.	onnxruntime/core/mlas/lib/qgemm.h
[BUILDING on AIX issue] Adding _AIX flag
17.	onnxruntime/core/mlas/lib/qlmul.cpp
[BUILDING on AIX issue] Handling of __vector under AIX+clang
18.  onnxruntime/core/optimizer/attention_fusion.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
19.  onnxruntime/core/optimizer/compute_optimizer/shared_utils.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
20.  onnxruntime/core/optimizer/constant_folding.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
21.  onnxruntime/core/optimizer/embed_layer_norm_fusion.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
22.  onnxruntime/core/optimizer/nchwc_transformer.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
23.  onnxruntime/core/optimizer/qdq_transformer/avx2_weight_s8_to_u8.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
24.  onnxruntime/core/optimizer/qdq_transformer/qdq_s8_to_u8.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
25.  onnxruntime/core/optimizer/qdq_transformer/s8_to_u8.h
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
26.
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
27.  onnxruntime/core/optimizer/reshape_fusion.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
28.  onnxruntime/core/optimizer/stft_decomposition.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
29.
onnxruntime/core/optimizer/transpose_optimization/ort_optimizer_api_impl.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
30.	onnxruntime/core/platform/path_lib.h
[BUILDING on AIX issue] Moving to normal function call, instead of
template
31.	onnxruntime/core/platform/posix/env.cc
[BUILDING on AIX issue]Blocking syscall.h in AIX
32.	onnxruntime/core/session/inference_session.cc
[Big endian issue] Removing ORT_RETURN_IF_NOT, FLATBUFFERS_LITTLEENDIAN
33.	onnxruntime/test/flatbuffers/flatbuffer_utils_test.cc
[Big endian issue] Call ConvertRawDataInTensorProto in CreateInitializer
and ExternalWriteReadWithLoadInitializers
34.	onnxruntime/test/framework/sparse_kernels_test.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
35.	onnxruntime/test/framework/tensorutils_test.cc
[Big endian issue] Helper method ConvertEndianessForVector and call this
from required place.
36.	onnxruntime/test/framework/test_tensor_loader.cc
o.  [BUILDING on AIX issue] Handling of getcwd for AIX
o.  [Big endian issue]  Bytes Swapping in run_external_data_test
37.	onnxruntime/test/onnx/main.cc
[Big endian issue] including <thread> for AIX
38.	onnxruntime/test/onnx/tensorprotoutils.cc
[Big endian issue]  Bytes swapping in UnpackTensorWithRawData
39.	onnxruntime/test/optimizer/graph_transform_test.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
40.	onnxruntime/test/optimizer/graph_transform_test_builder.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
41.	onnxruntime/test/optimizer/graph_transform_test_builder.h
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
42.	onnxruntime/test/optimizer/initializer_test.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
43.	onnxruntime/test/optimizer/nchwc_optimizer_test.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
44.	onnxruntime/test/providers/base_tester.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
45.	onnxruntime/test/providers/cpu/generator/random_test.cc
[BUILDING on AIX issue]  Adding AIX check in MultinomialGoodCase

---------

Co-authored-by: Vamshikrishna Thatikonda <vamshikrishna@in.ibm.com>
2024-07-17 12:37:06 -07:00
Qingnan Duan
80b56feb41
Implement FlashAttention for CPU (#20805)
### Description
Implement [FlashAttention](https://arxiv.org/pdf/2205.14135) and
[FlashAttention-2](https://arxiv.org/pdf/2307.08691) for
MultiHeadAttention on CPU.


### Motivation and Context
Accelerate the execution of MultiHeadAttention.

Current performance: 10ms vs 16ms (com.microsoft.MultiHeadAttention) on
my Linux machine and 10ms vs 38ms (com.microsoft.MultiHeadAttention) on
my Windows machine. May need further optimizations.

---------

Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: Qingnan Duan <qiduan@microsoft.com>
2024-07-11 14:19:59 -07:00
Edward Chen
20cd3394fc
[MLAS] AArch64 SQNBitGemm CompInt8 initial multi-row implementation (#21193)
Update AArch64 SQNBitGemm CompInt8 kernels to process matrix in tiles. E.g., computing the output in 2x2 tiles allows us to compute four elements of the output with one read of two rows of A and two columns of B.

Also moved some code around as it was getting big for a single file.
2024-07-10 15:39:26 -07:00
Edward Chen
a39f8862fd
SQNBitGemm - move workspace size calculation functions to hardware-specific implementations (#20757)
The workspace usage may be hardware-specific. Moving away from a common workspace size calculation allows more flexibility in the hardware-specific implementations.
2024-05-22 15:12:17 -07:00
liqun Fu
cc26b2dac2
Mlas Gemm 4bit avx2, avx512, and avx512vnni kernels (#20163)
### Description

```
Avx2:
Int8

NS(Prompt)		MLAS(Prompt)  	MLAS(Prompt)Gain/Loss		NS(TokenGen)		MLAS(TokenGen)  	MLAS(TokenGen)Gain/Loss
Blklen16: 	90.96			25.15			-72%					7.65				11.71			53%
Blklen32:	90.73			48.55			-46%					7.86				14.28			81%
Blklen64:	89.49			68.84			-23%					8.30				15.78			90%
Blklen128:	87.38			78.37			-10%					7.90				16.05			103%
Blklen256:	89.45			82.36			-7%					8.30				16.56			99%

Fp32		
NS(Prompt)		MLAS(Prompt)  	MLAS(Prompt)Gain/Loss		NS(TokenGen)		MLAS(TokenGen)  	MLAS(TokenGen)Gain/Loss
Blklen16:	91.36			105.18		15%				7.57			9.52		25%
Blklen32:	89.30			105.99			18%					7.65				9.68			26%
Blklen64:	89.53			101.41			13%					7.97				9.84			23%
Blklen128:	85.23			99.71			16%					7.86				10.39			32%
Blklen256:	88.46			97.94			10%					8.32				10.23			22%

Avx512vnni:
Int8		
NS(Prompt)		MLAS(Prompt)  	MLAS(Prompt)Gain/Loss		NS(TokenGen)		MLAS(TokenGen)  	MLAS(TokenGen)Gain/Loss
Blklen16:	132.18			21.56			-83%					10.34				11.48			11%
Blklen32:	168.28			43.69			-74%					11.85				14.73			24%
Blklen64:	201.81			60.29			-70%					12.36				15.47			25%
Blklen128:	194.92			57.04			-71%					13.03				14.67			12%
Blklen256:	218.76			70.20			-68%					13.33				16.31			22%

Fp32		
NS(Prompt)		MLAS(Prompt)  	MLAS(Prompt)Gain/Loss		NS(TokenGen)		MLAS(TokenGen)  	MLAS(TokenGen)Gain/Loss
Blklen16:	102.81			92.74			-9%					8.41				9.18			9%
Blklen32:	109.49			97.08			-11%					8.83				11.51			30%
Blklen64:	104.13			101.57			-2%					9.32				12.00			28%
Blklen128:	108.45			103.69			-4%					9.58				12.45			29%
Blklen256:	109.43			106.43			-2%					9.19				12.2			32%

```

---------

Signed-off-by: Liqun Fu <liqfu@microsoft.com>
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Co-authored-by: edgchen1 <18449977+edgchen1@users.noreply.github.com>
2024-04-25 21:30:50 -07:00
Yi-Hong Lyu
6b6a62fb40
Add vectorized AVX512F kernel for ReduceMaximumF32Kernel (#20268)
### Description
<!-- Describe your changes. -->

This commit introduces a new vectorized AVX512F kernel,
MlasReduceMaximumF32KernelAvx512F, which efficiently computes the
maximum value of the supplied buffer. Additionally, microbenchmarks have
been added for MlasComputeSoftmax (inplace),
MlasReduceMaximumF32KernelAvx, MlasComputeSumExpF32KernelAvx512F, and
MlasComputeSoftmaxOutputF32KernelAvx.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

The goal of this commit is to enhance the performance of
ReduceMaximumF32Kernel on CPUs with AVX512F instruction support.
  | AVX |   |   | AVX512 |   |   |  
-- | -- | -- | -- | -- | -- | -- | --
name | iterations | real_time | cpu_time | iterations | real_time |
cpu_time | time_unit
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:3/real_time | 271277304 |
2.58095 | 2.58091 | 263338132 | 2.65661 | 2.65661 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:3/real_time | 271220477 |
2.58095 | 2.58095 | 263509929 | 2.65652 | 2.65649 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:3/real_time | 271240587 |
2.58064 | 2.58064 | 263479542 | 2.65671 | 2.65665 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:3/real_time | 271227745 |
2.58083 | 2.58079 | 263402506 | 2.65657 | 2.65657 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:3/real_time | 271255069 |
2.58073 | 2.58071 | 263463858 | 2.65682 | 2.65682 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:3/real_time | 271257174 |
2.58058 | 2.58052 | 263460120 | 2.65682 | 2.65682 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:4/real_time | 174395051 |
4.01401 | 4.01401 | 197330481 | 3.5465 | 3.54636 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:4/real_time | 174645502 |
3.99691 | 3.99691 | 197474831 | 3.54298 | 3.54278 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:4/real_time | 174523308 |
4.01391 | 4.01386 | 197389981 | 3.54518 | 3.54506 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:4/real_time | 174779200 |
3.99874 | 3.99874 | 197519075 | 3.54227 | 3.54209 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:4/real_time | 174642874 |
4.00645 | 4.00641 | 197642101 | 3.54195 | 3.54188 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:4/real_time | 174546754 |
4.0061 | 4.00608 | 197621033 | 3.54296 | 3.54281 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:5/real_time | 162752651 |
4.30119 | 4.30114 | 215552503 | 3.24767 | 3.24752 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:5/real_time | 162717463 |
4.30123 | 4.30116 | 215541082 | 3.24711 | 3.24695 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:5/real_time | 162718819 |
4.3016 | 4.30153 | 215589239 | 3.24725 | 3.24708 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:5/real_time | 162719596 |
4.30151 | 4.30145 | 215563846 | 3.24956 | 3.24949 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:5/real_time | 162753333 |
4.30125 | 4.30125 | 215537315 | 3.24924 | 3.24908 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:5/real_time | 162752258 |
4.3014 | 4.30141 | 215526482 | 3.24744 | 3.24735 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:7/real_time | 143579660 |
4.87526 | 4.87516 | 100000000 | 5.25767 | 5.25752 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:7/real_time | 143585097 |
4.87476 | 4.87467 | 100000000 | 5.41583 | 5.41567 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:7/real_time | 143571011 |
4.87506 | 4.87503 | 182359467 | 3.83773 | 3.83764 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:7/real_time | 143587142 |
4.87487 | 4.8748 | 182397261 | 3.83807 | 3.8379 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:7/real_time | 143578465 |
4.87525 | 4.87521 | 182428602 | 3.83777 | 3.83768 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:7/real_time | 143588555 |
4.87491 | 4.87488 | 125280452 | 5.59791 | 5.59766 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:9/real_time | 284851058 |
2.43476 | 2.43476 | 156879863 | 4.42895 | 4.42884 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:9/real_time | 270700898 |
2.59031 | 2.59024 | 157953114 | 4.42995 | 4.42968 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:9/real_time | 282871172 |
2.45385 | 2.45385 | 157801156 | 4.42817 | 4.42804 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:9/real_time | 285307738 |
2.47009 | 2.47005 | 158058507 | 4.4279 | 4.42786 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:9/real_time | 285709536 |
2.45481 | 2.45476 | 158070961 | 4.42809 | 4.42799 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:9/real_time | 285449733 |
2.47495 | 2.47491 | 158069718 | 4.45026 | 4.45017 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:11/real_time | 189213618 |
3.79684 | 3.79676 | 139459497 | 5.01882 | 5.01871 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:11/real_time | 185600468 |
3.76394 | 3.76376 | 139444892 | 5.01922 | 5.01905 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:11/real_time | 184968668 |
3.80636 | 3.80636 | 139470834 | 5.01948 | 5.01936 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:11/real_time | 183867226 |
3.80432 | 3.80427 | 139481986 | 5.01975 | 5.01944 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:11/real_time | 184301650 |
3.81634 | 3.81634 | 139452846 | 5.01983 | 5.01972 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:11/real_time | 186215795 |
3.82659 | 3.82654 | 139497736 | 5.02119 | 5.02113 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:13/real_time | 135622415 |
5.16256 | 5.16252 | 124661337 | 5.61227 | 5.61194 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:13/real_time | 135618907 |
5.15967 | 5.1596 | 124805224 | 5.6088 | 5.60854 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:13/real_time | 135612192 |
5.15506 | 5.15501 | 124803221 | 5.60901 | 5.60869 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:13/real_time | 135906082 |
5.15818 | 5.15818 | 124776601 | 5.60898 | 5.60886 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:13/real_time | 135369523 |
5.15709 | 5.15682 | 124790370 | 5.60927 | 5.60902 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:13/real_time | 135596827 |
5.1603 | 5.1603 | 124792145 | 5.61637 | 5.61614 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:15/real_time | 110947137 |
5.96511 | 5.96495 | 112861522 | 6.20035 | 6.20014 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:15/real_time | 118004792 |
6.22645 | 6.22628 | 112909900 | 6.20073 | 6.20073 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:15/real_time | 112630319 |
6.25564 | 6.25552 | 112874563 | 6.19932 | 6.19924 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:15/real_time | 117403034 |
6.17263 | 6.17258 | 112927318 | 6.19866 | 6.19842 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:15/real_time | 108921863 |
6.48624 | 6.48612 | 112927746 | 6.20057 | 6.20026 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:15/real_time | 110358148 |
6.66805 | 6.66789 | 112907312 | 6.19938 | 6.19908 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:16/real_time | 203419574 |
3.4415 | 3.44137 | 237134525 | 2.95649 | 2.95638 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:16/real_time | 203414035 |
3.4411 | 3.44099 | 237129564 | 2.95178 | 2.95171 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:16/real_time | 203404068 |
3.44157 | 3.44151 | 236981704 | 2.9518 | 2.95167 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:16/real_time | 203391471 |
3.44146 | 3.44137 | 237108807 | 2.95203 | 2.95196 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:16/real_time | 203393801 |
3.44131 | 3.44127 | 237126460 | 2.95278 | 2.95272 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:16/real_time | 203407476 |
3.44181 | 3.44162 | 237154444 | 2.95293 | 2.9528 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:500/real_time | 37551439 |
18.6407 | 18.6407 | 39222534 | 17.858 | 17.8571 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:500/real_time | 37544097 |
18.6404 | 18.6401 | 39174151 | 17.8539 | 17.8536 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:500/real_time | 37549837 |
18.6391 | 18.6391 | 39233956 | 17.8507 | 17.8505 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:500/real_time | 45996345 |
15.2157 | 15.2153 | 39285929 | 17.848 | 17.8474 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:500/real_time | 46012429 |
15.2184 | 15.2179 | 65664865 | 10.7366 | 10.7364 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:500/real_time | 45912375 |
15.2349 | 15.2346 | 65205908 | 10.8498 | 10.8492 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:2000/real_time | 9493955 |
73.7232 | 73.7203 | 10188090 | 68.7931 | 68.7908 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:2000/real_time | 9495562 |
73.7173 | 73.7173 | 10180895 | 68.7533 | 68.7511 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:2000/real_time | 9487371 |
73.7852 | 73.7831 | 10164473 | 68.7279 | 68.725 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:2000/real_time | 10816047 |
64.7322 | 64.7287 | 10168481 | 68.8109 | 68.8096 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:2000/real_time | 10808802 |
64.7232 | 64.721 | 19478320 | 36.1471 | 36.1461 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:2000/real_time | 10818192 |
64.7304 | 64.728 | 19419672 | 35.9635 | 35.9635 | ns
2024-04-16 13:52:43 -07:00
Rachel Guo
6b305f95e0
Support xcframework for mac catalyst builds. (#19534)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

MAUI on macOS uses mac-catalyst which requires a different native
binary.

---------

Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
2024-03-20 10:55:19 -07:00
snadampal
77da2ef278
[aarch64] Add Sbgemm kernel to accelerate fp32 tensor matmul with bfloat16 (#17031)
### Description
This PR adds SbgemmKernel for aarch64. This includes Sbegmm kernel to
implement matrix multiplication with bfloat16 SIMD instructions (bfmmla)
and MatMul operator changes to invoke the Sbgemm kernel. To enable
Sbgemm kernel, set the following session option:
"kOrtSessionOptionsGemmFastMathMode"

The PR also adds new test cases for mlas and ort.

### Motivation and Context

This is to improve MatMul performance on aarch64 platform.
I have run the below benchmarking script (bert , roberta and gpt2 model
inference) on AWS Graviton3 based c7g.4xl instance and observed 1.2x
-1.76x performance improvement compared to sgemm (fp32) kernel
performance.

```
cd onnxruntime/python/tools/transformers
python3 benchmark.py
```
And the unit test precision results are matching to sgemm kernel
results.
`./build.sh --config RelWithDebInfo --build_shared_lib --parallel
--compile_no_warning_as_error --skip_submodule_sync `
2024-01-22 14:43:06 -08:00
luoyu-intel
459c750b03
Update x64 template kernel library for 'sqnbitgemm' (#19016)
### Description
<!-- Describe your changes. -->
1. Make JBLAS codes an external module of ORT.
2. Move q4 gemm code to contrib_ops.
3. Update template kernel library to v0.1 release.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
We found that the current LLM model performance is far below our
expectations. Here is some performance data collected on Mistral-7B
model with Xeon-8480:
8 threads | prompt length=32 past_len=32 | prompt length=1   past_len=32
-- | -- | --
ORT-main | 1220ms | 263ms
Neural-speed | 564ms | 87ms
ORT-this PR|597ms|120ms

Although `Neural-speed` and `ORT-this PR` use the same int4 kernel code,
there is a 33ms(87ms vs. 120ms) latency gap between the two frameworks.
Through some statistics analysis, the summary latency of `MatMulNBits`
is 86.7ms
The summary latency of all int4 GEMMs in `Neural-speed` is 84.8ms. So
other OPs introduce an extra 30ms latency.

The performance of MatMulNBits in this PR meets our expectations.

### Remain Issues
1. For hybrid CPUs, like core 12900K, the ONNXRuntime thread pool uses
TaskGranularityFactor to scale its number of threads. This is not
expected in our code design. It may slow down the hybrid CPU performance
by 30~40%.
2. Prepack uses a single thread which is very slow to init a session.
3. MatMulNBits with zero points will fall through to COMP_FP32 even
accuracy_level=4. Our COMP_INT8 IGemmCore with zero points process is
not optimized for now. It will be updated in the future. So, for an int4
model with zero points, whether the accuracy_level is 0 or 4 will be no
difference.
2024-01-18 13:16:34 -08:00
Edward Chen
150c4cb8fe
[MLAS AArch64] SQNBitGemm CompInt8 kernel (#18953)
Implement ARM NEON SQNBitGemm kernel that first block quantizes A to int8 and then does int8 multiplication.
2024-01-12 17:58:08 -08:00
luoyu-intel
5f00bc9931
Integrate high-performance x64 gemm library to MLAS (#17669)
### Description
Improve MLAS to support high-performance x64 INT4 kernels



### Motivation and Context
1. improve LLM inference performance on Intel CPUs.
2. support more 4bit quantization types: nf4, fp4
3. support dynamic block size: block size aligned with kernel's tiling
size(e.g. 4 for VNNI kernel), per channel on N dimension
4. support most Intel ISAs: avx2, avx_vnni, avx512f, avx512_vnni,
amx_bf16, amx_int8, avx512_fp16
5. support MatMulNBits' data format

### Tasks
- [x] support block_size: 32, 128, -1(per channel)
- [x] get weight pack size without memory allocation
- [x] use ort's thread pool for parallelism
- [x] support ISAs: avx2, avx512f, avx_vnni, avx512_vnni, amx_int8

### Benchmark
Ubuntu 20.22 + Intel(R) Xeon(R) Platinum 8480+ 56 cores

Benchmark | Time | CPU | Iterations
-- | -- | -- | --
Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:4096/Threads:56/real_time | 47613
| 47401 | 12970
Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:4096/Threads:56/real_time |
6347792 | 6317562 | 109
Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:4096/Threads:56/real_time |
11814014 | 11757847 | 59
Q4GEMM_Jblas/Q4G128SymInt8/M:1/N:4096/K:4096/Threads:56/real_time |
50222 | 50031 | 13759
Q4GEMM_Jblas/Q4G128SymInt8/M:1024/N:4096/K:4096/Threads:56/real_time |
2038222 | 2028743 | 341
Q4GEMM_Jblas/Q4G128SymInt8/M:2048/N:4096/K:4096/Threads:56/real_time |
3792832 | 3774485 | 191
Q4GEMM_Jblas/Q4GPerNSymInt8/M:1/N:4096/K:4096/Threads:56/real_time |
58717 | 58501 | 11467
Q4GEMM_Jblas/Q4GPerNSymInt8/M:1024/N:4096/K:4096/Threads:56/real_time |
1360846 | 1354598 | 543
Q4GEMM_Jblas/Q4GPerNSymInt8/M:2048/N:4096/K:4096/Threads:56/real_time |
2564232 | 2551365 | 266
Q4GEMM_Jblas/Q4G32SymFp32/M:1/N:4096/K:4096/Threads:56/real_time | 57929
| 57694 | 12047
Q4GEMM_Jblas/Q4G32SymFp32/M:1024/N:4096/K:4096/Threads:56/real_time |
5495330 | 5465810 | 126
Q4GEMM_Jblas/Q4G32SymFp32/M:2048/N:4096/K:4096/Threads:56/real_time |
10676240 | 10617817 | 66
Q4GEMM_Jblas/Q4G128SymFp32/M:1/N:4096/K:4096/Threads:56/real_time |
68305 | 68047 | 10026
Q4GEMM_Jblas/Q4G128SymFp32/M:1024/N:4096/K:4096/Threads:56/real_time |
5504862 | 5476215 | 126
Q4GEMM_Jblas/Q4G128SymFp32/M:2048/N:4096/K:4096/Threads:56/real_time |
11758623 | 11697337 | 66
Q4GEMM_Jblas/Q4GPerNSymFp32/M:1/N:4096/K:4096/Threads:56/real_time |
67713 | 67451 | 10298
Q4GEMM_Jblas/Q4GPerNSymFp32/M:1024/N:4096/K:4096/Threads:56/real_time |
5508325 | 5480237 | 126
Q4GEMM_Jblas/Q4GPerNSymFp32/M:2048/N:4096/K:4096/Threads:56/real_time |
10738528 | 10681656 | 64
Q4GEMM_Jblas/Q4G32AsymFp32/M:1/N:4096/K:4096/Threads:56/real_time |
60708 | 60486 | 11321
Q4GEMM_Jblas/Q4G32AsymFp32/M:1024/N:4096/K:4096/Threads:56/real_time |
5523784 | 5495736 | 126
Q4GEMM_Jblas/Q4G32AsymFp32/M:2048/N:4096/K:4096/Threads:56/real_time |
10829633 | 10772161 | 67


Reference:

Benchmark | Time | CPU | Iterations
-- | -- | -- | --
Q4GEMM/Q4Sym/M:1/N:4096/K:4096/Threads:56/real_time | 53088 | 52911 |
13364
Q4GEMM/Q4Sym/M:1024/N:4096/K:4096/Threads:56/real_time | 6268981 |
6230335 | 110
Q4GEMM/Q4Sym/M:2048/N:4096/K:4096/Threads:56/real_time | 11701237 |
11632339 | 59

Win11+12900K 8 cores:
Benchmark | Time | CPU | Iterations
-- | -- | -- | --
Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:4096/Threads:8/real_time | 215976
| 211295 | 2884
Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:4096/Threads:8/real_time |
60960590 | 60937500 | 10
Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:4096/Threads:8/real_time |
1.18E+08 | 1.19E+08 | 5
Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:11008/K:4096/Threads:8/real_time |
470377 | 453059 | 1414
Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:11008/K:4096/Threads:8/real_time |
1.54E+08 | 1.53E+08 | 5
Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:11008/K:4096/Threads:8/real_time |
3.18E+08 | 3.13E+08 | 2
Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:11008/Threads:8/real_time |
569072 | 559398 | 1229
Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:11008/Threads:8/real_time |
1.54E+08 | 1.52E+08 | 4
Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:11008/Threads:8/real_time |
3.22E+08 | 3.28E+08 | 2
Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:11008/K:11008/Threads:8/real_time |
1486055 | 1473325 | 403
Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:11008/K:11008/Threads:8/real_time |
4.14E+08 | 4.14E+08 | 2
Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:11008/K:11008/Threads:8/real_time |
8.88E+08 | 8.59E+08 | 1

---------

Signed-off-by: Mengni Wang <mengni.wang@intel.com>
Co-authored-by: Mengni Wang <mengni.wang@intel.com>
2023-12-19 09:36:31 -08:00
junchao-loongson
4abec9749e
[mlas] add loongarch lsx and lasx optimize code (#17937)
### Description
Hello we(@lixing-star) are the developers of loongson team.

We add 128 (lsx), 256 (lasx) vector optimization code for the loongarch
architecture


[100% tests passed, 0 tests failed out of
7](https://cloud.a-boat.cn:2021/api/public/dl/6831z1Bi?inline=true)

### Development Environments1
```
CPU: 
    Loongson-3C5000L
uname -a:  
    Linux localhost.localdomain 4.19.190-6.4.lns8.loongarch64 #1 SMP Thu Jul 14 12:08:04 CST 2022 loongarch64 loongarch64 loongarch64 GNU/Linux

```
### LonngArch Documents
- [LoongArch Reference Manual - Volume 1: Basic Architecture: This
manual describes the basic part of the LoongArch
architecture.](https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1-EN.html)
- [LoongArch ELF psABI: This manual describes the LoongArch ELF
psABI.](https://loongson.github.io/LoongArch-Documentation/LoongArch-ELF-ABI-EN.html)
-
[more](https://loongson.github.io/LoongArch-Documentation/README-EN.html)
2023-12-07 11:15:59 -08:00
Edward Chen
0a4d76d98b
MLAS AArch64 quantized int4 Gemm kernel (#18031)
- Implement MLAS function for quantized 4-bit int Gemm (Gemm with float A and quantized 4-bit int B) for ARM NEON. This is an initial implementation. Only the M=1 path (with M being number of rows of A and C) has any optimization attempted so far. More optimization to come in future PRs.

- Connect MatMulNBits contrib op to MLAS function.
2023-11-15 09:31:54 -08:00
snadampal
d88d52eead
[aarch64] Remove mmla kernel support from apple (#18082)
### Description
<!-- Describe your changes. -->
The mmla kernels require additional ISA flags
and are currently supported only on Linux


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
more context is in https://github.com/microsoft/onnxruntime/pull/15270

cc: @skottmckay , @chenfucn , @snnn
2023-10-25 11:34:57 -07:00
snadampal
780ee186d7
[aarch64] Implement QGEMM kernels with UMMLA/SMMLA instructions (#17160)
### Description
<!-- Describe your changes. -->
This PR adds UMMLA and SMMLA based QGEMM kernels for aarch64. This
covers
(i) symmetric quantization (zero point is Zero)
(ii) asymmetric quantization (zero point is non zero)
(iii) per channel as well as per tensor quantization
(iv) Signed weights (U8S8 Gemm)
(v) Unsigned weights (U8U8 Gemm) and 
(vi) Signed activations and weights (S8S8 Gemm) scenarios

I've enabled the ummla/smmla kernels based on cpuinfo check for `I8MM`
support
MMLA QGEMM kernels are enabled for all the devices that support I8MM
instructions.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This is to improve INT8 quantized MatMul performance on aarch64
platform.
I have run the below benchmarking script (bert , roberta and gpt2 model
inference) on AWS Graviton3 based c7g.4xl instance and observed up to
1.33x performance improvement compared to the optimized UDOT qgemm
kernel performance.

```
cd onnxruntime/python/tools/transformers
python3 benchmark.py
```
I have also run the unit tests, and made sure all are passing

```
./build.sh --config RelWithDebInfo --build_shared_lib --parallel --compile_no_warning_as_error --skip_submodule_sync 

```
2023-10-24 07:49:04 +10:00
MistEO
870b0bc305
Fix typo of cmake (#17715)
This caused a cmake configuration error.
2023-09-27 11:48:46 -07:00
Chen Fu
3c10f027de
4b quantization for weights of LLMs (#16833)
### Description
Blockwise 4b quantization for LLMs. 
1. Introduce 4b block-wise quantization for linear layer weights.
2. Implements matrix multiplication kernel for fp32 x int4
3. Implements special operator MatMulFpQ4
4. Implements quantization tool, that convert MatMul operator to
MatMulFpQ4, when the right hand side is 2D const tensor.


### Motivation and Context
Compress and accelerate LLMs

|Benchmark | Time(ns)|
|-------------|----------|
|Q4GEMM/Q4Sym/M:1/N:4096/K:4096/Threads:8| 218054|
|Q4GEMM/Q4Sym/M:1024/N:4096/K:4096/Threads:8| 35830155|
|Q4GEMM/Q4Sym/M:2048/N:4096/K:4096/Threads:8| 73479790|
|Q4GEMM/Q4Zp8/M:1/N:4096/K:4096/Threads:8| 270152|
|Q4GEMM/Q4Zp8/M:1024/N:4096/K:4096/Threads:8| 35826721|
|Q4GEMM/Q4Zp8/M:2048/N:4096/K:4096/Threads:8| 73021200|
|Q4GEMM/Q4Sym128/M:1/N:4096/K:4096/Threads:8| 213832|
|Q4GEMM/Q4Sym128/M:1024/N:4096/K:4096/Threads:8| 36749874|
|Q4GEMM/Q4Sym128/M:2048/N:4096/K:4096/Threads:8| 72618120|


|Benchmark | Time(ns)|
|-------------|----------|
|SGEMM/LLM/M:1/N:4096/K:4096/Threads:8|   522610|
|SGEMM/LLM/M:1024/N:4096/K:4096/Threads:8| 39237689|
|SGEMM/LLM/M:2048/N:4096/K:4096/Threads:8| 75983467|

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-08-07 12:23:55 -07:00
Yi Zhang
2e214d6e27
Workaround to upgrade VS2022 for Windows ARM build (#16826)
### Description



### Motivation and Context
It should be reverted when VS2022 is upgraded to 17.7 or above.

### Vefication

https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=331401&view=logs&j=7517abfd-115a-5c61-78a0-7ba3c9e3a88d
2023-07-25 08:35:52 +08:00
Dipanjan Sengupta
a461608409
Amx flag removal (#16527)
### Description
1. Replacing AMX intrinsics with machine code macros in QGEMM kernel.
2. Removing AMX build flags for GCC in cmake file.
3. Fixing the link time optimization (LTO) issue introduced with asm
.include of an assembly file.

I have moved the AMX instruction macro definitions from
QgemmU8S8KernelAmxCommon.S to the amx_common.h to fix the LTO issue.
Note that I am also pushing the macros defined in
QgemmU8S8KernelAmxCommon.S for future reference.

A special thanks to @laxmansole who helped in the development of the
instruction macro definitions for AMX intrinsics and fixing the LTO
issue.

### Motivation and Context
The additional AMX flag in cmake adds an extra layer of dependency on
GCC version to use the feature.These changes should allow the usage of
the AMX feature with just the CPU ID check.
2023-07-13 11:19:49 -07:00
Scott McKay
697dd12f6e
Re-organize the transpose optimization and layout transformation files. (#16246)
### Description
<!-- Describe your changes. -->
Split out the more basic changes from #15552 for easier review.

Re-organize to clarify the structure
- Separate out generic base functionality from ORT specific components
  - pass in handlers for internal ORT ops to Optimize
- Split out layout transformation from transpose optimization
- Separate out level 1 transpose optimizer
- Cleanup some naming to try and clarify things like an optimizer vs.
general optimization code

Most of the changes are from this movement of code.

Two implementation changes:
- the extended handlers are queried first in GetHandler
- allows the extended handlers to override the default behaviour for an
ONNX operator
- simplify the Optimize function to remove OptimizerMode. 
- `can_modify_node` is used instead of `mode` and
`ignore_assigned_nodes` and a long description of the current usage is
added. I don't _think_ that changes the current behavior and hopefully
clarifies what happens and when, and makes the base transpose optimizer
implementation more generic.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Create a cleaner separation to support adding EP specific logic next to
cleanly handle where an EP has additional layout sensitive behaviour
required (e.g. it's Resize implementation only handles one layout).
2023-07-07 08:24:47 +10:00
Chen Fu
5c125b4366
Cfu revertamx (#16455)
### Description

This is to revert two PRs that aim at reducing AMX toolchain
requirements. Unfortunately we still have some pipeline issues.

https://github.com/microsoft/onnxruntime/pull/16390
https://github.com/microsoft/onnxruntime/pull/16086

### Motivation and Context

Looks like gcc link time optimization does not work very well with
inline assembly in the above PRs.
2023-06-23 09:20:23 -07:00
Dipanjan Sengupta
35fa6af428
Fix for the build break in AMX feature on Mac OS. (#16390)
### Description
Fixing the build break issue in Apple pipeline due to AMX flag removal.
2023-06-16 21:00:41 -07:00
Dipanjan Sengupta
681a0d084d
Removing AMX build flag (#16086)
### Description
1. Replacing AMX intrinsics with machine code macro instructions in
QGEMM kernel.
2. Removing AMX build flags for GCC in cmake file.



### Motivation and Context
The additional AMX flag in cmake adds an extra layer of dependency on
GCC version to use the feature.These changes should allow the usage of
the AMX feature with just the CPU ID check.
2023-06-15 11:22:59 -07:00
Changming Sun
0204594f90
Cleanup WASM cmake code (#15996)
### Description
Remove the "onnxruntime_BUILD_WEBASSEMBLY" cmake option. Use `if
(CMAKE_SYSTEM_NAME STREQUAL "Emscripten")` instead. It makes some code
look more nature.
For example,

```cmake
if (CMAKE_SYSTEM_NAME STREQUAL "iOS" OR CMAKE_SYSTEM_NAME STREQUAL "Android" OR onnxruntime_BUILD_WEBASSEMBLY)
```
becomes
```cmake
if (CMAKE_SYSTEM_NAME STREQUAL "iOS" OR CMAKE_SYSTEM_NAME STREQUAL "Android" OR CMAKE_SYSTEM_NAME STREQUAL "Emscripten")
```
2023-05-20 18:07:39 -07:00
George Nash
f2889b41c1
[AMX] Update assembler check (#15501)
A recent commit added an assembler check if the ASM dialect was ATT

This unfortunately broke the AMX build for systems that don't have the
ASM-ATT dialect.

This change assumes if the CMAKE_ASM-ATT_COMPILER_ID is not found and
the CMAKE_ASM_COMPILER_ID is "GNU" based on all the other already passed
checks AMX is supported by the compiler and assembler.

### Description




### Motivation and Context
On my build system the recent change to add the ASM-ATT version check
disabled AMX code from the build.

---------

Signed-off-by: George Nash <george.nash@intel.com>
2023-04-19 14:16:26 -07:00
Yateng Hong
9bb4e4bef4
Fix masm flags (#15417)
### Description
Fix onnxruntime_mlas build failure with cmake 3.26. Updated CMAKE
generator expression to make sure certain complier flags only apply for
C/CXX compiler.

### Motivation and Context
CMake changed the behavior of ASM_MASM in version 3.26. See
https://gitlab.kitware.com/cmake/cmake/-/issues/24639.

This also fixed the issue of #15101
2023-04-07 10:20:03 -07:00
Chen Fu
605c2f4b89
Remove fp16 support from apple (#15270)
### Description

Removing fp16 support from apple build


### Motivation and Context
FP16 support on ARM64 only available after armv8.2a, thus the clang
compiler needs a compilation flag `-march=armv8.2-a+fp16`.
Unfortunately, our current universal build does not support hardware
specific compilation flags on cpp source files, as it would cause
trouble when compiling against more than one hardware target. Until we
figure out how to remove this limitation, had to disable fp16 support
for Apple systems.
2023-03-30 16:44:26 -07:00
Chen Fu
41ddcd30a1
Fp16 NHWC Max and Average Pooling (#15181)
### Description
Max and average pooling operators for fp16, NHWC 


### Motivation and Context
Continue on the steps for fp16 inference support
2023-03-28 08:22:55 -07:00
Jian Chen
527e006124
Update mlas (#15228)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-03-27 14:18:48 -07:00
JiCheng
126e7bf15f
[AMX] add assembler check (#15055)
### Description
<!-- Describe your changes. -->

AMX isn't supportted until assembler 2.40 even though the GCC frontend
supports it.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-03-22 07:57:22 +08:00
Chen Fu
34175f0b7c
FP16 conv (#15062)
### Description

Convolution for fp16 datatype. Use NHWC for computation. For NCHW input,
it rearranges the input tensor to NHWC format before computing the
result.

Support two optional fusion:

1. Activation
2. Add (not yet implemented)

### Motivation and Context

Accelerating fp16 inference
2023-03-21 10:32:43 -07:00
Jian Chen
6891ab5bac
fix_macos (#15018)
### Description
<!-- Describe your changes. -->
This fix macos packaging build on universal2 arch. 


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-03-14 21:54:44 -07:00
Chen Fu
acc2ac627f
Fp16 Activations (#14722)
### Description

NEON fp16 SIMD implementation of Activation functions


### Motivation and Context
Step 2 of fp16 SIMD support.

---------

Co-authored-by: Chen Fu <fuchen@microsoft.com>
2023-02-28 17:20:40 -08:00
Chen Fu
733ca85b73
Cfu fp16 (#14538)
### Description
FP16 GEMM, including hardware agnostic driver code, a slow C++ kernel,
and ARM64 NEON kernel.


### Motivation and Context
First step in creating native support of fp16 model inferencing on ARM64
and AMD64 platforms.

---------

Co-authored-by: Chen Fu <fuchen@microsoft.com>
2023-02-15 12:51:53 -08:00
Chen Fu
90142899bd
Supporting Intel AMX instructions in quantized GEMM (#14042)
### Description
Using Intel AMX int8 instructions to accelerate quantized GEMM


### Motivation and Context
AMX instructions accelerate quantized GEMM significantly:

Prepacked B perf numbers (latency in ns)

GEMM Config | AVX512Vnni | AMX
-- | --: | --:
M:384/N:1024/K:1024/Batch:1/Threads:4 | 1057511 | 285393
M:384/N:1024/K:3072/Batch:1/Threads:4 | 2643929 | 700397
M:384/N:1024/K:4096/Batch:1/Threads:4 | 3784750 | 890701
M:384/N:4096/K:1024/Batch:1/Threads:4 | 2378139 | 887251
M:384/N:1024/K:1024/Batch:1/Threads:16 | 307137 | 138481
M:384/N:1024/K:3072/Batch:1/Threads:16 | 855730 | 295027
M:384/N:1024/K:4096/Batch:1/Threads:16 | 1126878 | 317395
M:384/N:4096/K:1024/Batch:1/Threads:16 | 781963 | 237014
M:1536/N:1024/K:1024/Batch:1/Threads:16 | 538864 | 181459
M:1536/N:1024/K:3072/Batch:1/Threads:16 | 1681002 | 561600
M:1536/N:1024/K:4096/Batch:1/Threads:16 | 2158127 | 717470
M:1536/N:4096/K:1024/Batch:1/Threads:16 | 2428622 | 896140
M:3072/N:1024/K:1024/Batch:1/Threads:16 | 1058029 | 357031
M:3072/N:1024/K:3072/Batch:1/Threads:16 | 3138504 | 1095857
M:3072/N:1024/K:4096/Batch:1/Threads:16 | 4155640 | 1386183
M:3072/N:4096/K:1024/Batch:1/Threads:16 | 4679030 | 1778624

Co-authored-by: Yi-Hong Lyu <yilyu@microsoft.com>
Co-authored-by: Chen Fu <fuchen@microsoft.com>
2023-01-10 12:16:27 -08:00
Edward Chen
2ecd1d6622
Switch GSL to MS GSL 4.0.0 (#13416) 2022-10-29 04:15:20 -07:00
Jack·Boos·Yu
ea004e953f
[cmake] Export multi targets in static build (#11063)
* [cmake] Export multi targets in static build

* Install more components in static build, format some code

* Fix code pos
2022-04-03 22:37:18 -07:00
Chen Fu
dc72159105
Symmetric Quant indirect Conv kernel for ARMv8 A55 chip (#10862)
ARM a55 micro-architecture (with dot product instructions), similar to a53, is widely used as little cores in big.Little configurations. A55 has a narrower memory load/store hardware, where a 128b load instruction would block the pipeline for 2 whole cycles, during which no other instructions can be executed. On the other hand, a 64b load instruction can be duo issued with many other instructions.

This change adds a Symmetric Quant indirect Conv kernel for a55 micro-architecture, where we replace

ldr q4,[x1],

with

ldr d4,[x1],
ldr x11,[x1],
ins v4.d[1],x11

so that we can try to hide the memory load cycles behind computing cycles in the kernel.

With this new kernel, cartoongan model shows significant perf improvement on Pixel5a little cores (2 threads running on two little cores):

new kernel: 2188.59 ms
old kernel: 2360.61 ms
2022-03-25 17:10:47 -07:00
Chen Fu
50a6f095cd
Symmetric QGEMM kernel for ARMv8 A55 chip (#10754)
ARM a55 micro-architecture (with dot product instructions), similar to a53, is widely used as little cores in big.Little configurations. A55 has a narrower memory load/store hardware, where a 128b load instruction would block the pipeline for 2 whole cycles, during which no other instructions can be executed. On the other hand, a 64b load instruction can be duo issued with many other instructions.

This change adds a Symmetric QGEMM kernel for a55 micro-architecture, where we replace

ldr q4,[x1],#16

with

ldr d4,[x1],#8
ldr x11,[x1],#8
ins v4.d[1],x11

so that we can try to hide the memory load cycles behind computing cycles in the kernel.

Co-authored-by: Chen Fu <fuchen@microsoft.com>
2022-03-07 08:41:13 -08:00
Maxiwell
43ff27c7c8
ppc64le: optimizing the MlasQuantizeLinear() with VSX (#10644)
This code is valid only when -mcpu is set to utilize POWER9 technology
or above. A compatible code for POWER8 was created as well, but it
was not tuned for performance.
2022-03-04 14:54:56 -08:00
RajalakshmiSR
5d8c5409ab
POWER10: QGEMM optimization (#10642)
* POWER10: QGEMM optimization

This patch makes use of POWER10 MMA feature for QGEMM function.
This optimization includes signed and unsigned cases.Tested and
there are no new failures with gcc11 and clang-14.

* Changes as per review comments

Co-authored-by: Rajalakshmi Srinivasaraghavan <rajis@linux.ibm.com>
2022-03-02 08:36:26 -08:00
Chen Fu
c4f1dfcfaa
Cfu s8s8 (#10413)
Adding S8S8 kernels for symmetric quantized indirect conv and depthwise conv.

Perf number with single thread:

Nokia G10 (baseline / new) in ms	Pixel 4 (baseline/new) in ms
mobilenet_edgetpu	220 / 213	18.5 / 17.6
cartoongan	8537 / 8521	967 / 928

Co-authored-by: Chen Fu <fuchen@microsoft.com>
2022-01-28 09:26:52 -08:00