onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-06-12 00:59:23 +00:00

Author	SHA1	Message	Date
RajalakshmiSR	5d8c5409ab	POWER10: QGEMM optimization (#10642 ) * POWER10: QGEMM optimization This patch makes use of POWER10 MMA feature for QGEMM function. This optimization includes signed and unsigned cases.Tested and there are no new failures with gcc11 and clang-14. * Changes as per review comments Co-authored-by: Rajalakshmi Srinivasaraghavan <rajis@linux.ibm.com>	2022-03-02 08:36:26 -08:00
Chen Fu	12c44bfc4e	fix bug: getting current cpu core type (#10630 ) Prev merged pull request has a bug: #10521 It was aimed to detect current CPU core micro-architecture and select a best suited kernel. Unfortunately it assumes that a thread can never migrate from one core to another. This change tries to fix that problem. It introduces about 2-5% performance degradation on symmetric quantized matmul Co-authored-by: Chen Fu <fuchen@microsoft.com>	2022-02-25 08:56:14 -08:00
Chen Fu	c4f1dfcfaa	Cfu s8s8 (#10413 ) Adding S8S8 kernels for symmetric quantized indirect conv and depthwise conv. Perf number with single thread: Nokia G10 (baseline / new) in ms Pixel 4 (baseline/new) in ms mobilenet_edgetpu 220 / 213 18.5 / 17.6 cartoongan 8537 / 8521 967 / 928 Co-authored-by: Chen Fu <fuchen@microsoft.com>	2022-01-28 09:26:52 -08:00
Chen Fu	2afce4830c	Symmetric QGEMM (#10289 ) Adding code for symmetric quantized matrix multiplication. Used in quantized convolution, achieving significant perf gain. TODO, use Symmetric Quantized GEMM in other operators! TODO address activation buffer overread in custom allocators and tensors supplied by users. DOT kernel perf test: Pixel 5a: Cartoongan 513.539 ms 471.786 ms Efficient 57.5169 ms 56.4174 ms Edgetpu 14.6673 ms 13.5959 ms NEON kernel perf test Pixel 3a Cartoongan 1423.53 ms 1069.92 ms Efficient 114.086 ms 107.968 ms Edgetpu 39.2632 ms 36.9839 ms Co-authored-by: Chen Fu <fuchen@microsoft.com>	2022-01-24 10:49:04 -08:00
Yufeng Li	d2b1424968	fix bugs in cpuid_info (#10334 ) * fix serveral bugs in cpuid_info	2022-01-20 16:30:18 -08:00
Chen Fu	cd0af7ad44	Symmetric quantized convolution kernel ARM64 (#9772 ) Adding a symmetric quantized convolution kernel for ARM64 Note: Indirect conv performs worse for shallow convs (input channels are small). This is much more so for low end pre-dot CPUs, where only 128 or deeper conv is faster with indirect conv. With DOT-CPUs, 32 deep conv is already faster Co-authored-by: Chen Fu <fuchen@microsoft.com>	2021-12-13 21:14:45 -08:00
Yufeng Li	e613019174	add s8s8 support for quantized conv and gemm (#9902 ) * add s8s8 support for quantized conv and gemm	2021-12-03 14:55:18 -08:00
RajalakshmiSR	8564fc1933	POWER10: Add optimized dgemm kernel (#9652 ) * POWER10: Add optimized dgemm kernel This patch makes use of POWER10 matrix multiply assist feature and adds new DGEMM kernel. * Indentation update Co-authored-by: Rajalakshmi Srinivasaraghavan <rajis@linux.ibm.com>	2021-11-22 20:28:21 -08:00
Chen Fu	1c84621020	Adding ARM64 depthwise convolution kernel for symmetric quantization (#9655 ) Adding ARM64 depthwise convolution kernel for symmetric quantization Motivation and Context Two improvements against current kernel code : 1. Signed int8 based instructions, no need to extend from 8b to 16b before multiplication. 2. Unrolled loop with manual software pipelining Co-authored-by: Chen Fu <fuchen@microsoft.com>	2021-11-15 12:18:43 -08:00
RajalakshmiSR	c54ad0dd0b	POWER: Add Dgemm kernel for POWER processor (#9459 ) * POWER: Add Dgemm kernel for POWER processor This patch adds new dgemm kernel specific to POWER processor. * POWER: Restrict new functions to VSX in header * Remove warning check in header * POWER: Dgemm Adjust indentation Fixing indentation based on review comments. Co-authored-by: Rajalakshmi Srinivasaraghavan <rajis@linux.ibm.com>	2021-10-26 20:27:24 -07:00
Yufeng Li	da3dd398c5	Kernels for QLinearConv with symmetrically quantized filter (#9323 ) Add kernels for QLinearConv with symmetric quantized filter, e.g., filter type is int8 and zero point of filter is 0. This PR includes kernels for avx2, avxvnni, avx512 and avx 512 vnni. Will adds kernels for ARM64 in following PR. Kernels uses direct input buffer directly for pointwise, and in-direct buffer for depthwise and non-group conv. The advantages of those new kernels are: no need to compute the sum of each pixel output image, and sum/offset of filter can be combined with bias. with in-direct buffer, im2col returns an array of buffer pointers instead of memcpy'ing the original data. This saves memcpy time and reduces the size of the intermediate buffer needed to hold the im2col transform. In the future, will compute im2col ahead of time for input with fixed input size.	2021-10-18 19:40:18 -07:00
Tracy Sharpe	b2b9de939f	cleanup onnxruntime_mlas.cmake of old gcc workarounds (#8469 )	2021-07-22 22:01:05 -07:00
Rajalakshmi Srinivasaraghavan	894fc82858	POWER10: Additional check in cmake When compiling with newer gcc and older glibc, there is a chance for new POWER10 macros to be not available in hwcap.h. This patch checks whether hwcap macros are available before using that in platform.cpp.	2021-07-20 13:04:18 -07:00
RajalakshmiSR	32ceaf4532	POWER10: Optimized SGEMM in MLAS (#8121 ) * POWER10: Optimized SGEMM in MLAS This patch introduces new optimized version of SGEMM in MLAS using power10 Matrix-Multiply Assist (MMA) feature introduced in POWER ISA v3.1. This patch makes use of new POWER10 compute instructions for matrix multiplication operation. * Adjust tabs in cmake Changing tabs to spaces as per review comment. * Adjust tabs in new sgemm file Changing tabs to spaces in SgemmKernelPOWER10.cpp. * Reusing functions using common header Co-authored-by: Rajalakshmi Srinivasaraghavan <rajis@linux.ibm.com>	2021-06-28 14:41:08 -07:00
Tracy Sharpe	cbdd59dae9	MLAS: enable SSE 4.1 path for x86 build (#8127 )	2021-06-23 09:38:58 -07:00
Tracy Sharpe	2b0bbfd1a8	MLAS: add SSE 4.1 u8s8 kernel (#7490 )	2021-04-29 11:12:32 -07:00
Tracy Sharpe	a01334ba56	MLAS: activate udot kernel on Windows ARM64 (#7169 )	2021-03-29 17:56:48 -07:00
Tracy Sharpe	90642e7eac	MLAS: more code cleanup (#7036 ) Change int32_t->ptrdiff_t when interacting with the threadpool. Migrate more code from MlasMaskMoveAvx->MlasMaskMoveTableAvx. Update more code to use FUNCTION_ENTRY macro.	2021-03-17 09:22:55 -07:00
Tracy Sharpe	5480f8dd1d	MLAS: misc cleanup (#7013 ) Miscellaneous changes to synchronize the style used over time: Remove unneeded PFN types in favor of FN*. Switch more functions over to using the common FUNCTION_ENTRY macro. Switch logistic/tanh kernels over to the style used in TransKernelFma3.asm.	2021-03-15 18:24:18 -07:00
Tracy Sharpe	a8b897f710	MLAS: quantized GEMM update (#6916 ) Various updates to the int8_t GEMMs: 1) Add ARM64 udot kernel to take advantage of dot product instructions available in newer cores. Some models run 4x faster than the stock implementation we used before. 2) Refactor the x64 kernels to share common code for AVX2(u8u8/u8s8/avxvnni) vs AVX512(u8u8/u8s8/avx512vnni) to reduce binary size. 3) Extend kernels to support per-column zero points for matrix B. This is not currently wired to an operator.	2021-03-10 09:54:43 -08:00
Ramakrishnan Sivakumar	a5bef6886b	Threading support for Hybrid core architecture (#6728 )	2021-02-17 15:35:07 -08:00
Tracy Sharpe	9a6e71574a	MLAS: improve quantized depthwise convolution (#6513 )	2021-02-01 21:22:27 -08:00
Yufeng Li	7264a067a9	Implement QuantizeLinear with avx512 (#6260 ) * Implement QuantizeLinear with AVX512	2021-02-01 14:33:44 -08:00
Ramakrishnan Sivakumar	5bcb5f5a3d	MLAS: Add support for AVXVNNI (#5592 ) Adds Gemm kernels with AVXVNNI support for Int8 acceleration	2020-10-26 16:27:48 -07:00
Tracy Sharpe	3ef449816c	MLAS: support prepacking APIs for quantized GEMM (#4433 ) Add support for prepacking matrix B for use in the quantized GEMMs.	2020-07-06 15:20:10 -07:00
Zhang Lei	94c98aa0a7	qlinaradd for arm/sse2/avx2 using intrinsic, enable binary broadcasting parallel (#4216 ) * Support quantization linear binary element wise math ops, implement QLinearAdd. Support tests for quantization linear binary element wise math ops, implement test for QLinearAdd. Add QlinearAdd with SSE2 intrisinc implemntation, Avx2 assembly implemntation, Neon intrisinc support. QLinearAdd support VectorOnVector, VectorOnScalar, ScalarOnVector. Generalized QlinearBinaryOp parallel related with broadcasting. * Modify according to PR feedbacks. Mainly: * template helper for generalize the qladd logic on v2v, s2v, v2s * remove GetKernel related. * change mixed lagecy MM/SSE code in the AVX code * formater, typos, convensions, etc. * Utilize MlasSubtractInt32x4 in MlasDequantizeLinearVector(). * Some format fix. * More nature parallel parameter type. * Fix build break for x86. * Comment goes to 80 before wrap. * Many change on assembly on Marco related. Using vminps than vpminsd to handle NaN. tested on windows. * Using CLang Format to format the file. * Fix arm32 build error. * Remove some duplicate in different #if defined * working add.u8.vector to vector * Fix runtime bus error on real arm32 linux. * fix typo in store last one lane. * arm32 qlinearadd handle scalar. * Move qladd to seperate c++ file * Add neon64 qladd. * refactor some, enhance two instructions on arm64 only instructions * Fix typo for arm64 * use strict op in pure c++ (min/max on float value) * sse2 new version. * mrege arm/sse2/avx2 * pass arm/sse/avx2 linux test * remove non-used assembly file. * Remove unused data definition and tailing spaces. * Fix broadcasting parallel issue. * Enhance broadcasting scenarios. Allow testing result diff due to round on half. * Add Mlas or MLAS_ prefix for namespace safety. * Handle alignment issue for arm32 for GCC/MSVC. remove some unused signed/unsigned int ops. * Specify /arch:AVX2 for qladd_avx2.cpp * Fix type during copy/paste when unrolling. Better one GreatEqual condition. Better formater by splitting two statements on single line. * Arm neon alignment parameter is bits rather than bytes, change it. * Move qladd_avx2.cpp to intrinsics/avx2/ folder * Formatting using mlas style. * Double check mlas style for these files. * change indent 2 to 4 for qladd_avx2.cpp * Fix windows x86 build error due to sse2 no _mm_cvtsi128_si64 * To re-trigger all as old failed pipeline updated. Co-authored-by: Lei Zhang <phill.zhang@gmail.com>	2020-07-01 11:54:44 -07:00
Yufeng Li	867ba846f7	Implement MinMax with SIMD (#4285 ) * Implement MinMax with SIMD	2020-06-23 20:07:53 -07:00
Tracy Sharpe	0d8abc1a99	MLAS: qgemm refactoring (#4030 ) Treat U8U8 as U8S8 for VNNI for performance and optimize SSE2 kernel.	2020-05-26 17:27:32 -07:00
Tracy Sharpe	cb554fbc2d	MLAS: Add MlasComputeSoftmax/MlasComputeExp (#3846 ) * add MlasComputeSoftmax * fix onnxruntime_mlas_test DLLs * remove unneeded header * remove unneeded header * call MlasComputeExp * call MlasComputeSoftmax * call MlasComputeSoftmax * finish off * fix static analysis warning	2020-05-07 14:02:01 -07:00
Tracy Sharpe	88c20eaef1	MLAS: rename AVX512BW->AVX512Core (#3216 ) Cleanup change: remap functions and files with Avx512BW to Avx512Core.	2020-03-13 22:45:51 -07:00
Dmitri Smirnov	af9dbb70f2	Introduce a separate check and conditional for AVX512BW build (#2083 ) Separate checks for AVX512f and AVX512BW Make AVX512BW cmake instructions nested within AVX512F support.	2019-10-10 16:14:00 -07:00
Tracy Sharpe	57e0099425	MLAS: Implement U8S8 GEMV kernels (#2069 ) This implements an optimization for U8S8 MlasGemm when M=1, aka GEMV.	2019-10-09 11:54:16 -07:00
Dmitri Smirnov	cae571c713	Add a test for AVX512 compilation before compiling 512 asm (#2055 )	2019-10-08 21:18:04 -07:00
Tracy Sharpe	4c995d3251	MLAS: add DGEMM support (#1953 ) * rename existing kernels * add dgemm support * rename existing kernels * add dgemm support * synchronize with amd64 * dgemm * remove test code * remove more test code * fix file extension	2019-09-30 10:04:59 -07:00
Tracy Sharpe	28a62f7728	MLAS: add U8S8 MatMul operation (#1895 ) Implement the second round of changes for quantization inside MLAS. This adds a MatMul operation for U8xS8=S32 for x86/x64 processors.	2019-09-24 18:15:11 -07:00
Tracy Sharpe	071a0c2522	MLAS: MlasSgemm refactoring (#1749 ) Refactor the SGEMM kernels to resynchronize the code between Windows/Linux and remove unneeded binary bloat from a different zero/add mode kernel. Another goal is to get to a cleaner state for then doing a DGEMM kernel.	2019-09-06 22:26:28 -07:00
Tracy Sharpe	bc72c2dba7	MLAS: add U8U8 MatMul operation (#1644 ) Implement the first round of changes for quantization inside MLAS. This adds a MatMul operation for U8xU8=S32 for x86/x64 processors.	2019-08-18 18:15:48 -07:00
Changming Sun	6b89c7ad04	Let mlas use session thread pool (#1609 ) 1.Let mlas use session thread pool 2.Remove onnxruntime_USE_MLAS cmake option 3. Remove the win32 thread pool code inside mlas mlas will: 1.use ort thread pool if it get passed in 2.use openmp if the threadpool parameter is nullptr 3.run single threaded if the threadpool parameter is nullptr and openmp is disabled.	2019-08-16 13:21:15 -07:00
Tracy Sharpe	719e58d831	Use MLAS to retrieve the CPU preferred tensor buffer alignment (#1377 ) Add MlasGetPreferredBufferAlignment() for use by CPUAllocator::Alloc to get the byte alignment for CPU tensors. Using MLAS allows the value to be based on the platform the binary is running on instead of a constant value fixed at compile time.	2019-07-12 22:22:46 -07:00
Tracy Sharpe	3ebad81abc	MLAS: NCHWc low-level changes (#1283 ) Implementation of the MLAS changes for NCHWc convolution/pooling support. These changes adopt the blocking format used by MKL-DNN and other convolution libraries for better performance.	2019-06-25 16:57:30 -07:00
Zhang Lei	468de7c8af	Zhalei/erff (#846 ) Implement error function in mlas with avx2 optimization.	2019-05-06 14:05:04 -07:00
Yufeng Li	0d2181cf85	Remove parallelfor for certain ops (#908 ) Parallelfor makes maxpool, gather and reduce ops slower. This PR: removes parallelfor for those ops add windows thread pool back for sgemm.	2019-04-25 19:38:59 -07:00
Tracy Sharpe	cb69c65756	Update MLAS to be able to build standalone again (#874 ) Change MLAS to be able to build standalone without onnxruntime header dependencies. This is enabled when building with MLAS_NO_ONNXRUNTIME_THREADPOOL defined. mlas.h had been changed to include the ThreadPool header, but this header now just has a forward reference for the class. The header was also doing a "using onnxruntime::concurrency"; that has been removed and the external mlas.h users fixed up as needed. As before, if ThreadPool==nullptr, then MLAS uses OpenMP or falls back to a single threaded implementation. The build option to use the Win32 system thread pool has been removed as onnxruntime can't hit that path and I don't use that option for standalone tests anymore.	2019-04-21 14:04:15 -07:00
Randy	f048fc5fb0	cross compile x86 linux (#562 ) * cross compile x86 linux * fix comments * install multilib for ubuntu cross compile * remove tailing slash * fix -fPIC relocations for x86 target too * add asm make flag * fix x86 compile err * test x86 with zlib and png * Disable zlib from x86 * install x86 python header * remove cross-compiling changes * test 32bit ubuntu * add x86 ubuntu docker file * add x86 as arch parametr for docker build * config pipeline * avoid dotnet install * install cmake * skip dep install * use latest ubuntu * install latest cmake * install x86 deps * configure cmake * install ninja * correct ninja dir * apt get re2c * install onnx * set processor x86 * disable warning * skip test * disable test * disable test * find lib * fix typo * restore test * disable backend model test * disable test * fix test err * stop installing onnx * disable onnx test on x86 * restore yml * mergef with master yml * cancel needless config setting * enable x86 flag * restore all onnx tests * fix yml typo * install onnx * add back x86 flag * disable cases * disable case * disable cases * add macro to disable cases * fix typo * print platform * remove condition	2019-03-12 09:47:45 -07:00
Tracy Sharpe	47551da994	Optimize Tanh/Sigmoid activations (#162 ) * optimized tanh/sigmoid * fix /W4 warnings from alternate build environment * use MLAS for tanh/sigmoid * fix my broken C++ templates * add x86_64 files	2018-12-13 22:53:40 -08:00
Tracy Sharpe	3c7c1068e7	refactor threading (#110 )	2018-12-06 09:20:32 -08:00
Pranav Sharma	89618e8f1e	Initial bootstrap commit.	2018-11-19 16:48:22 -08:00

47 commits