Commit graph

47 commits

Author SHA1 Message Date
RajalakshmiSR
5d8c5409ab
POWER10: QGEMM optimization (#10642)
* POWER10: QGEMM optimization

This patch makes use of POWER10 MMA feature for QGEMM function.
This optimization includes signed and unsigned cases.Tested and
there are no new failures with gcc11 and clang-14.

* Changes as per review comments

Co-authored-by: Rajalakshmi Srinivasaraghavan <rajis@linux.ibm.com>
2022-03-02 08:36:26 -08:00
Chen Fu
12c44bfc4e
fix bug: getting current cpu core type (#10630)
Prev merged pull request has a bug:

#10521

It was aimed to detect current CPU core micro-architecture and select a best suited kernel. Unfortunately it assumes that a thread can never migrate from one core to another.

This change tries to fix that problem. It introduces about 2-5% performance degradation on symmetric quantized matmul

Co-authored-by: Chen Fu <fuchen@microsoft.com>
2022-02-25 08:56:14 -08:00
Chen Fu
c4f1dfcfaa
Cfu s8s8 (#10413)
Adding S8S8 kernels for symmetric quantized indirect conv and depthwise conv.

Perf number with single thread:

Nokia G10 (baseline / new) in ms	Pixel 4 (baseline/new) in ms
mobilenet_edgetpu	220 / 213	18.5 / 17.6
cartoongan	8537 / 8521	967 / 928

Co-authored-by: Chen Fu <fuchen@microsoft.com>
2022-01-28 09:26:52 -08:00
Chen Fu
2afce4830c
Symmetric QGEMM (#10289)
Adding code for symmetric quantized matrix multiplication. Used in quantized convolution, achieving significant perf gain.

TODO, use Symmetric Quantized GEMM in other operators!

TODO address activation buffer overread in custom allocators and tensors supplied by users.

DOT kernel perf test:

Pixel 5a:

Cartoongan	513.539 ms	471.786 ms
Efficient	57.5169 ms	56.4174 ms
Edgetpu	14.6673 ms	13.5959 ms
NEON kernel perf test

Pixel 3a

Cartoongan	1423.53 ms	1069.92 ms
Efficient	114.086 ms	107.968 ms
Edgetpu	39.2632 ms	36.9839 ms


Co-authored-by: Chen Fu <fuchen@microsoft.com>
2022-01-24 10:49:04 -08:00
Yufeng Li
d2b1424968
fix bugs in cpuid_info (#10334)
* fix serveral bugs in cpuid_info
2022-01-20 16:30:18 -08:00
Chen Fu
cd0af7ad44
Symmetric quantized convolution kernel ARM64 (#9772)
Adding a symmetric quantized convolution kernel for ARM64

Note:
Indirect conv performs worse for shallow convs (input channels are small). This is much more so for low end pre-dot CPUs, where only 128 or deeper conv is faster with indirect conv. With DOT-CPUs, 32 deep conv is already faster

Co-authored-by: Chen Fu <fuchen@microsoft.com>
2021-12-13 21:14:45 -08:00
Yufeng Li
e613019174
add s8s8 support for quantized conv and gemm (#9902)
* add s8s8 support for quantized conv and gemm
2021-12-03 14:55:18 -08:00
RajalakshmiSR
8564fc1933
POWER10: Add optimized dgemm kernel (#9652)
* POWER10: Add optimized dgemm kernel

This patch makes use of POWER10 matrix multiply assist feature and
adds new DGEMM kernel.

* Indentation update

Co-authored-by: Rajalakshmi Srinivasaraghavan <rajis@linux.ibm.com>
2021-11-22 20:28:21 -08:00
Chen Fu
1c84621020
Adding ARM64 depthwise convolution kernel for symmetric quantization (#9655)
Adding ARM64 depthwise convolution kernel for symmetric quantization

Motivation and Context
Two improvements against current kernel code :

1. Signed int8 based instructions, no need to extend from 8b to 16b before multiplication.
2. Unrolled loop with manual software pipelining

Co-authored-by: Chen Fu <fuchen@microsoft.com>
2021-11-15 12:18:43 -08:00
RajalakshmiSR
c54ad0dd0b
POWER: Add Dgemm kernel for POWER processor (#9459)
* POWER: Add Dgemm kernel for POWER processor

This patch adds new dgemm kernel specific to POWER processor.

* POWER: Restrict new functions to VSX in header

* Remove warning check in header

* POWER: Dgemm Adjust indentation

Fixing indentation based on review comments.

Co-authored-by: Rajalakshmi Srinivasaraghavan <rajis@linux.ibm.com>
2021-10-26 20:27:24 -07:00
Yufeng Li
da3dd398c5
Kernels for QLinearConv with symmetrically quantized filter (#9323)
Add kernels for QLinearConv with symmetric quantized filter, e.g., filter type is int8 and zero point of filter is 0. This PR includes kernels for avx2, avxvnni, avx512 and avx 512 vnni. Will adds kernels for ARM64 in following PR.

Kernels uses direct input buffer directly for pointwise, and in-direct buffer for depthwise and non-group conv.

The advantages of those new kernels are:

no need to compute the sum of each pixel output image, and sum/offset of filter can be combined with bias.
with in-direct buffer, im2col returns an array of buffer pointers instead of memcpy'ing the original data. This saves memcpy time and reduces the size of the intermediate buffer needed to hold the im2col transform. In the future, will compute im2col ahead of time for input with fixed input size.
2021-10-18 19:40:18 -07:00
Tracy Sharpe
b2b9de939f
cleanup onnxruntime_mlas.cmake of old gcc workarounds (#8469) 2021-07-22 22:01:05 -07:00
Rajalakshmi Srinivasaraghavan
894fc82858 POWER10: Additional check in cmake
When compiling with newer gcc and older glibc, there is a chance
for new POWER10 macros to be not available in hwcap.h. This patch
checks whether hwcap macros are available before using that in
platform.cpp.
2021-07-20 13:04:18 -07:00
RajalakshmiSR
32ceaf4532
POWER10: Optimized SGEMM in MLAS (#8121)
* POWER10: Optimized SGEMM in MLAS

This patch introduces new optimized version of SGEMM in MLAS
using power10 Matrix-Multiply Assist (MMA) feature introduced in
POWER ISA v3.1. This patch makes use of new POWER10 compute instructions
for matrix multiplication operation.

* Adjust tabs in cmake

Changing tabs to spaces as per review comment.

* Adjust tabs in new sgemm file

Changing tabs to spaces in SgemmKernelPOWER10.cpp.

* Reusing functions using common header

Co-authored-by: Rajalakshmi Srinivasaraghavan <rajis@linux.ibm.com>
2021-06-28 14:41:08 -07:00
Tracy Sharpe
cbdd59dae9
MLAS: enable SSE 4.1 path for x86 build (#8127) 2021-06-23 09:38:58 -07:00
Tracy Sharpe
2b0bbfd1a8
MLAS: add SSE 4.1 u8s8 kernel (#7490) 2021-04-29 11:12:32 -07:00
Tracy Sharpe
a01334ba56
MLAS: activate udot kernel on Windows ARM64 (#7169) 2021-03-29 17:56:48 -07:00
Tracy Sharpe
90642e7eac
MLAS: more code cleanup (#7036)
Change int32_t->ptrdiff_t when interacting with the threadpool.
Migrate more code from MlasMaskMoveAvx->MlasMaskMoveTableAvx.
Update more code to use FUNCTION_ENTRY macro.
2021-03-17 09:22:55 -07:00
Tracy Sharpe
5480f8dd1d
MLAS: misc cleanup (#7013)
Miscellaneous changes to synchronize the style used over time:

Remove unneeded PFN types in favor of FN*.
Switch more functions over to using the common FUNCTION_ENTRY macro.
Switch logistic/tanh kernels over to the style used in TransKernelFma3.asm.
2021-03-15 18:24:18 -07:00
Tracy Sharpe
a8b897f710
MLAS: quantized GEMM update (#6916)
Various updates to the int8_t GEMMs:

1) Add ARM64 udot kernel to take advantage of dot product instructions available in newer cores. Some models run 4x faster than the stock implementation we used before.
2) Refactor the x64 kernels to share common code for AVX2(u8u8/u8s8/avxvnni) vs AVX512(u8u8/u8s8/avx512vnni) to reduce binary size.
3) Extend kernels to support per-column zero points for matrix B. This is not currently wired to an operator.
2021-03-10 09:54:43 -08:00
Ramakrishnan Sivakumar
a5bef6886b
Threading support for Hybrid core architecture (#6728) 2021-02-17 15:35:07 -08:00
Tracy Sharpe
9a6e71574a
MLAS: improve quantized depthwise convolution (#6513) 2021-02-01 21:22:27 -08:00
Yufeng Li
7264a067a9
Implement QuantizeLinear with avx512 (#6260)
* Implement QuantizeLinear with AVX512
2021-02-01 14:33:44 -08:00
Ramakrishnan Sivakumar
5bcb5f5a3d
MLAS: Add support for AVXVNNI (#5592)
Adds Gemm kernels with AVXVNNI support for Int8 acceleration
2020-10-26 16:27:48 -07:00
Tracy Sharpe
3ef449816c
MLAS: support prepacking APIs for quantized GEMM (#4433)
Add support for prepacking matrix B for use in the quantized GEMMs.
2020-07-06 15:20:10 -07:00
Zhang Lei
94c98aa0a7
qlinaradd for arm/sse2/avx2 using intrinsic, enable binary broadcasting parallel (#4216)
* Support quantization linear binary element wise math ops, implement QLinearAdd.
Support tests for quantization linear binary element wise math ops, implement test for QLinearAdd.
Add QlinearAdd with SSE2 intrisinc implemntation, Avx2 assembly implemntation, Neon intrisinc support.
QLinearAdd support VectorOnVector, VectorOnScalar, ScalarOnVector.
Generalized QlinearBinaryOp parallel related with broadcasting.

* Modify according to PR feedbacks. Mainly:
    * template helper for generalize the qladd logic on v2v, s2v, v2s
    * remove GetKernel related.
    * change mixed lagecy MM/SSE code in the AVX code
    * formater, typos, convensions, etc.

* Utilize MlasSubtractInt32x4 in MlasDequantizeLinearVector().

* Some format fix.

* More nature parallel parameter type.

* Fix build break for x86.

* Comment goes to 80 before wrap.

* Many change on assembly on Marco related.
Using vminps than vpminsd to handle NaN.
tested on windows.

* Using CLang Format to format the file.

* Fix arm32 build error.

* Remove some duplicate in different #if defined

* working add.u8.vector to vector

* Fix runtime bus error on real arm32 linux.

* fix typo in store last one lane.

* arm32 qlinearadd handle scalar.

* Move qladd to seperate c++ file

* Add neon64 qladd.

* refactor some, enhance two instructions on arm64 only instructions

* Fix typo for arm64

* use strict op in pure c++ (min/max on float value)

* sse2 new version.

* mrege arm/sse2/avx2

* pass arm/sse/avx2 linux test

* remove non-used assembly file.

* Remove unused data definition and tailing spaces.

* Fix broadcasting parallel issue.

* Enhance broadcasting scenarios. Allow testing result diff due to round
on half.

* Add Mlas or MLAS_ prefix for namespace safety.

* Handle alignment issue for arm32 for GCC/MSVC. remove some unused
signed/unsigned int ops.

* Specify /arch:AVX2 for qladd_avx2.cpp

* Fix type during copy/paste when unrolling. Better one GreatEqual
condition. Better formater by splitting two statements on single line.

* Arm neon alignment parameter is bits rather than bytes, change it.

* Move qladd_avx2.cpp to intrinsics/avx2/ folder

* Formatting using mlas style.

* Double check mlas style for these files.

* change indent 2 to 4 for qladd_avx2.cpp

* Fix windows x86 build error due to sse2 no _mm_cvtsi128_si64

* To re-trigger all as old failed pipeline updated.

Co-authored-by: Lei Zhang <phill.zhang@gmail.com>
2020-07-01 11:54:44 -07:00
Yufeng Li
867ba846f7
Implement MinMax with SIMD (#4285)
* Implement MinMax with SIMD
2020-06-23 20:07:53 -07:00
Tracy Sharpe
0d8abc1a99
MLAS: qgemm refactoring (#4030)
Treat U8U8 as U8S8 for VNNI for performance and optimize SSE2 kernel.
2020-05-26 17:27:32 -07:00
Tracy Sharpe
cb554fbc2d
MLAS: Add MlasComputeSoftmax/MlasComputeExp (#3846)
* add MlasComputeSoftmax

* fix onnxruntime_mlas_test DLLs

* remove unneeded header

* remove unneeded header

* call MlasComputeExp

* call MlasComputeSoftmax

* call MlasComputeSoftmax

* finish off

* fix static analysis warning
2020-05-07 14:02:01 -07:00
Tracy Sharpe
88c20eaef1
MLAS: rename AVX512BW->AVX512Core (#3216)
Cleanup change: remap functions and files with Avx512BW to Avx512Core.
2020-03-13 22:45:51 -07:00
Dmitri Smirnov
af9dbb70f2
Introduce a separate check and conditional for AVX512BW build (#2083)
Separate checks for AVX512f and AVX512BW
  Make AVX512BW cmake instructions nested within AVX512F support.
2019-10-10 16:14:00 -07:00
Tracy Sharpe
57e0099425
MLAS: Implement U8S8 GEMV kernels (#2069)
This implements an optimization for U8S8 MlasGemm when M=1, aka GEMV.
2019-10-09 11:54:16 -07:00
Dmitri Smirnov
cae571c713 Add a test for AVX512 compilation before compiling 512 asm (#2055) 2019-10-08 21:18:04 -07:00
Tracy Sharpe
4c995d3251 MLAS: add DGEMM support (#1953)
* rename existing kernels

* add dgemm support

* rename existing kernels

* add dgemm support

* synchronize with amd64

* dgemm

* remove test code

* remove more test code

* fix file extension
2019-09-30 10:04:59 -07:00
Tracy Sharpe
28a62f7728
MLAS: add U8S8 MatMul operation (#1895)
Implement the second round of changes for quantization inside MLAS. This adds a MatMul operation for U8xS8=S32 for x86/x64 processors.
2019-09-24 18:15:11 -07:00
Tracy Sharpe
071a0c2522
MLAS: MlasSgemm refactoring (#1749)
Refactor the SGEMM kernels to resynchronize the code between Windows/Linux and remove unneeded binary bloat from a different zero/add mode kernel. Another goal is to get to a cleaner state for then doing a DGEMM kernel.
2019-09-06 22:26:28 -07:00
Tracy Sharpe
bc72c2dba7
MLAS: add U8U8 MatMul operation (#1644)
Implement the first round of changes for quantization inside MLAS. This adds a MatMul operation for U8xU8=S32 for x86/x64 processors.
2019-08-18 18:15:48 -07:00
Changming Sun
6b89c7ad04
Let mlas use session thread pool (#1609)
1.Let mlas use session thread pool
2.Remove onnxruntime_USE_MLAS cmake option
3. Remove the win32 thread pool code inside mlas

mlas will:

1.use ort thread pool if it get passed in
2.use openmp if the threadpool parameter is nullptr
3.run single threaded if the threadpool parameter is nullptr and openmp is disabled.
2019-08-16 13:21:15 -07:00
Tracy Sharpe
719e58d831
Use MLAS to retrieve the CPU preferred tensor buffer alignment (#1377)
Add MlasGetPreferredBufferAlignment() for use by CPUAllocator::Alloc to get the byte alignment for CPU tensors. Using MLAS allows the value to be based on the platform the binary is running on instead of a constant value fixed at compile time.
2019-07-12 22:22:46 -07:00
Tracy Sharpe
3ebad81abc
MLAS: NCHWc low-level changes (#1283)
Implementation of the MLAS changes for NCHWc convolution/pooling support. These changes adopt the blocking format used by MKL-DNN and other convolution libraries for better performance.
2019-06-25 16:57:30 -07:00
Zhang Lei
468de7c8af
Zhalei/erff (#846)
Implement error function in mlas with avx2 optimization.
2019-05-06 14:05:04 -07:00
Yufeng Li
0d2181cf85
Remove parallelfor for certain ops (#908)
Parallelfor makes maxpool, gather and reduce ops slower. This PR:

removes parallelfor for those ops
add windows thread pool back for sgemm.
2019-04-25 19:38:59 -07:00
Tracy Sharpe
cb69c65756
Update MLAS to be able to build standalone again (#874)
Change MLAS to be able to build standalone without onnxruntime header dependencies. This is enabled when building with MLAS_NO_ONNXRUNTIME_THREADPOOL defined.
mlas.h had been changed to include the ThreadPool header, but this header now just has a forward reference for the class. The header was also doing a "using onnxruntime::concurrency"; that has been removed and the external mlas.h users fixed up as needed.
As before, if ThreadPool==nullptr, then MLAS uses OpenMP or falls back to a single threaded implementation. The build option to use the Win32 system thread pool has been removed as onnxruntime can't hit that path and I don't use that option for standalone tests anymore.
2019-04-21 14:04:15 -07:00
Randy
f048fc5fb0 cross compile x86 linux (#562)
* cross compile x86 linux

* fix comments

* install multilib for ubuntu cross compile

* remove tailing slash

* fix -fPIC relocations for x86 target too

* add asm make flag

* fix x86 compile err

* test x86 with zlib and png

* Disable zlib from x86

* install x86 python header

* remove cross-compiling changes

* test 32bit ubuntu

* add x86 ubuntu docker file

* add x86 as arch parametr for docker build

* config pipeline

* avoid dotnet install

* install cmake

* skip dep install

* use latest ubuntu

* install latest cmake

* install x86 deps

* configure cmake

* install ninja

* correct ninja dir

* apt get re2c

* install onnx

* set processor x86

* disable warning

* skip test

* disable test

* disable test

* find lib

* fix typo

* restore test

* disable backend model test

* disable test

* fix test err

* stop installing onnx

* disable onnx test on x86

* restore yml

* mergef with master yml

* cancel needless config setting

* enable x86 flag

* restore all onnx tests

* fix yml typo

* install onnx

* add back x86 flag

* disable cases

* disable case

* disable cases

* add macro to disable cases

* fix typo

* print platform

* remove condition
2019-03-12 09:47:45 -07:00
Tracy Sharpe
47551da994
Optimize Tanh/Sigmoid activations (#162)
* optimized tanh/sigmoid

* fix /W4 warnings from alternate build environment

* use MLAS for tanh/sigmoid

* fix my broken C++ templates

* add x86_64 files
2018-12-13 22:53:40 -08:00
Tracy Sharpe
3c7c1068e7
refactor threading (#110) 2018-12-06 09:20:32 -08:00
Pranav Sharma
89618e8f1e Initial bootstrap commit. 2018-11-19 16:48:22 -08:00