onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-06-26 03:00:54 +00:00

Author	SHA1	Message	Date
Changming Sun	0204594f90	Cleanup WASM cmake code (#15996 ) ### Description Remove the "onnxruntime_BUILD_WEBASSEMBLY" cmake option. Use `if (CMAKE_SYSTEM_NAME STREQUAL "Emscripten")` instead. It makes some code look more nature. For example, ```cmake if (CMAKE_SYSTEM_NAME STREQUAL "iOS" OR CMAKE_SYSTEM_NAME STREQUAL "Android" OR onnxruntime_BUILD_WEBASSEMBLY) ``` becomes ```cmake if (CMAKE_SYSTEM_NAME STREQUAL "iOS" OR CMAKE_SYSTEM_NAME STREQUAL "Android" OR CMAKE_SYSTEM_NAME STREQUAL "Emscripten") ```	2023-05-20 18:07:39 -07:00
George Nash	f2889b41c1	[AMX] Update assembler check (#15501 ) A recent commit added an assembler check if the ASM dialect was ATT This unfortunately broke the AMX build for systems that don't have the ASM-ATT dialect. This change assumes if the CMAKE_ASM-ATT_COMPILER_ID is not found and the CMAKE_ASM_COMPILER_ID is "GNU" based on all the other already passed checks AMX is supported by the compiler and assembler. ### Description ### Motivation and Context On my build system the recent change to add the ASM-ATT version check disabled AMX code from the build. --------- Signed-off-by: George Nash <george.nash@intel.com>	2023-04-19 14:16:26 -07:00
Yateng Hong	9bb4e4bef4	Fix masm flags (#15417 ) ### Description Fix onnxruntime_mlas build failure with cmake 3.26. Updated CMAKE generator expression to make sure certain complier flags only apply for C/CXX compiler. ### Motivation and Context CMake changed the behavior of ASM_MASM in version 3.26. See https://gitlab.kitware.com/cmake/cmake/-/issues/24639. This also fixed the issue of #15101	2023-04-07 10:20:03 -07:00
Chen Fu	605c2f4b89	Remove fp16 support from apple (#15270 ) ### Description Removing fp16 support from apple build ### Motivation and Context FP16 support on ARM64 only available after armv8.2a, thus the clang compiler needs a compilation flag `-march=armv8.2-a+fp16`. Unfortunately, our current universal build does not support hardware specific compilation flags on cpp source files, as it would cause trouble when compiling against more than one hardware target. Until we figure out how to remove this limitation, had to disable fp16 support for Apple systems.	2023-03-30 16:44:26 -07:00
Chen Fu	41ddcd30a1	Fp16 NHWC Max and Average Pooling (#15181 ) ### Description Max and average pooling operators for fp16, NHWC ### Motivation and Context Continue on the steps for fp16 inference support	2023-03-28 08:22:55 -07:00
Jian Chen	527e006124	Update mlas (#15228 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-03-27 14:18:48 -07:00
JiCheng	126e7bf15f	[AMX] add assembler check (#15055 ) ### Description <!-- Describe your changes. --> AMX isn't supportted until assembler 2.40 even though the GCC frontend supports it. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-03-22 07:57:22 +08:00
Chen Fu	34175f0b7c	FP16 conv (#15062 ) ### Description Convolution for fp16 datatype. Use NHWC for computation. For NCHW input, it rearranges the input tensor to NHWC format before computing the result. Support two optional fusion: 1. Activation 2. Add (not yet implemented) ### Motivation and Context Accelerating fp16 inference	2023-03-21 10:32:43 -07:00
Jian Chen	6891ab5bac	fix_macos (#15018 ) ### Description <!-- Describe your changes. --> This fix macos packaging build on universal2 arch. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-03-14 21:54:44 -07:00
Chen Fu	acc2ac627f	Fp16 Activations (#14722 ) ### Description NEON fp16 SIMD implementation of Activation functions ### Motivation and Context Step 2 of fp16 SIMD support. --------- Co-authored-by: Chen Fu <fuchen@microsoft.com>	2023-02-28 17:20:40 -08:00
Chen Fu	733ca85b73	Cfu fp16 (#14538 ) ### Description FP16 GEMM, including hardware agnostic driver code, a slow C++ kernel, and ARM64 NEON kernel. ### Motivation and Context First step in creating native support of fp16 model inferencing on ARM64 and AMD64 platforms. --------- Co-authored-by: Chen Fu <fuchen@microsoft.com>	2023-02-15 12:51:53 -08:00
Chen Fu	90142899bd	Supporting Intel AMX instructions in quantized GEMM (#14042 ) ### Description Using Intel AMX int8 instructions to accelerate quantized GEMM ### Motivation and Context AMX instructions accelerate quantized GEMM significantly: Prepacked B perf numbers (latency in ns) GEMM Config \| AVX512Vnni \| AMX -- \| --: \| --: M:384/N:1024/K:1024/Batch:1/Threads:4 \| 1057511 \| 285393 M:384/N:1024/K:3072/Batch:1/Threads:4 \| 2643929 \| 700397 M:384/N:1024/K:4096/Batch:1/Threads:4 \| 3784750 \| 890701 M:384/N:4096/K:1024/Batch:1/Threads:4 \| 2378139 \| 887251 M:384/N:1024/K:1024/Batch:1/Threads:16 \| 307137 \| 138481 M:384/N:1024/K:3072/Batch:1/Threads:16 \| 855730 \| 295027 M:384/N:1024/K:4096/Batch:1/Threads:16 \| 1126878 \| 317395 M:384/N:4096/K:1024/Batch:1/Threads:16 \| 781963 \| 237014 M:1536/N:1024/K:1024/Batch:1/Threads:16 \| 538864 \| 181459 M:1536/N:1024/K:3072/Batch:1/Threads:16 \| 1681002 \| 561600 M:1536/N:1024/K:4096/Batch:1/Threads:16 \| 2158127 \| 717470 M:1536/N:4096/K:1024/Batch:1/Threads:16 \| 2428622 \| 896140 M:3072/N:1024/K:1024/Batch:1/Threads:16 \| 1058029 \| 357031 M:3072/N:1024/K:3072/Batch:1/Threads:16 \| 3138504 \| 1095857 M:3072/N:1024/K:4096/Batch:1/Threads:16 \| 4155640 \| 1386183 M:3072/N:4096/K:1024/Batch:1/Threads:16 \| 4679030 \| 1778624 Co-authored-by: Yi-Hong Lyu <yilyu@microsoft.com> Co-authored-by: Chen Fu <fuchen@microsoft.com>	2023-01-10 12:16:27 -08:00
Edward Chen	2ecd1d6622	Switch GSL to MS GSL 4.0.0 (#13416 )	2022-10-29 04:15:20 -07:00
Jack·Boos·Yu	ea004e953f	[cmake] Export multi targets in static build (#11063 ) * [cmake] Export multi targets in static build * Install more components in static build, format some code * Fix code pos	2022-04-03 22:37:18 -07:00
Chen Fu	dc72159105	Symmetric Quant indirect Conv kernel for ARMv8 A55 chip (#10862 ) ARM a55 micro-architecture (with dot product instructions), similar to a53, is widely used as little cores in big.Little configurations. A55 has a narrower memory load/store hardware, where a 128b load instruction would block the pipeline for 2 whole cycles, during which no other instructions can be executed. On the other hand, a 64b load instruction can be duo issued with many other instructions. This change adds a Symmetric Quant indirect Conv kernel for a55 micro-architecture, where we replace ldr q4,[x1], with ldr d4,[x1], ldr x11,[x1], ins v4.d[1],x11 so that we can try to hide the memory load cycles behind computing cycles in the kernel. With this new kernel, cartoongan model shows significant perf improvement on Pixel5a little cores (2 threads running on two little cores): new kernel: 2188.59 ms old kernel: 2360.61 ms	2022-03-25 17:10:47 -07:00
Chen Fu	50a6f095cd	Symmetric QGEMM kernel for ARMv8 A55 chip (#10754 ) ARM a55 micro-architecture (with dot product instructions), similar to a53, is widely used as little cores in big.Little configurations. A55 has a narrower memory load/store hardware, where a 128b load instruction would block the pipeline for 2 whole cycles, during which no other instructions can be executed. On the other hand, a 64b load instruction can be duo issued with many other instructions. This change adds a Symmetric QGEMM kernel for a55 micro-architecture, where we replace ldr q4,[x1],#16 with ldr d4,[x1],#8 ldr x11,[x1],#8 ins v4.d[1],x11 so that we can try to hide the memory load cycles behind computing cycles in the kernel. Co-authored-by: Chen Fu <fuchen@microsoft.com>	2022-03-07 08:41:13 -08:00
Maxiwell	43ff27c7c8	ppc64le: optimizing the MlasQuantizeLinear() with VSX (#10644 ) This code is valid only when -mcpu is set to utilize POWER9 technology or above. A compatible code for POWER8 was created as well, but it was not tuned for performance.	2022-03-04 14:54:56 -08:00
RajalakshmiSR	5d8c5409ab	POWER10: QGEMM optimization (#10642 ) * POWER10: QGEMM optimization This patch makes use of POWER10 MMA feature for QGEMM function. This optimization includes signed and unsigned cases.Tested and there are no new failures with gcc11 and clang-14. * Changes as per review comments Co-authored-by: Rajalakshmi Srinivasaraghavan <rajis@linux.ibm.com>	2022-03-02 08:36:26 -08:00
Chen Fu	c4f1dfcfaa	Cfu s8s8 (#10413 ) Adding S8S8 kernels for symmetric quantized indirect conv and depthwise conv. Perf number with single thread: Nokia G10 (baseline / new) in ms Pixel 4 (baseline/new) in ms mobilenet_edgetpu 220 / 213 18.5 / 17.6 cartoongan 8537 / 8521 967 / 928 Co-authored-by: Chen Fu <fuchen@microsoft.com>	2022-01-28 09:26:52 -08:00
Chen Fu	2afce4830c	Symmetric QGEMM (#10289 ) Adding code for symmetric quantized matrix multiplication. Used in quantized convolution, achieving significant perf gain. TODO, use Symmetric Quantized GEMM in other operators! TODO address activation buffer overread in custom allocators and tensors supplied by users. DOT kernel perf test: Pixel 5a: Cartoongan 513.539 ms 471.786 ms Efficient 57.5169 ms 56.4174 ms Edgetpu 14.6673 ms 13.5959 ms NEON kernel perf test Pixel 3a Cartoongan 1423.53 ms 1069.92 ms Efficient 114.086 ms 107.968 ms Edgetpu 39.2632 ms 36.9839 ms Co-authored-by: Chen Fu <fuchen@microsoft.com>	2022-01-24 10:49:04 -08:00
Yufeng Li	7208fcbe1c	use wasmscalar as default kernel (#9988 ) * use wasmscalar as default kernel	2022-01-03 10:55:08 -08:00
Changming Sun	4e9e01cb3c	Fix SDL warnings in CPU EP (#9975 )	2021-12-19 20:54:29 -08:00
Chen Fu	cd0af7ad44	Symmetric quantized convolution kernel ARM64 (#9772 ) Adding a symmetric quantized convolution kernel for ARM64 Note: Indirect conv performs worse for shallow convs (input channels are small). This is much more so for low end pre-dot CPUs, where only 128 or deeper conv is faster with indirect conv. With DOT-CPUs, 32 deep conv is already faster Co-authored-by: Chen Fu <fuchen@microsoft.com>	2021-12-13 21:14:45 -08:00
Yi-Hong Lyu	f60a287a64	Add __x86.get_pc_thunk.bx to avoid dependency (#9955 )	2021-12-08 04:50:41 -08:00
Yufeng Li	e613019174	add s8s8 support for quantized conv and gemm (#9902 ) * add s8s8 support for quantized conv and gemm	2021-12-03 14:55:18 -08:00
RajalakshmiSR	8564fc1933	POWER10: Add optimized dgemm kernel (#9652 ) * POWER10: Add optimized dgemm kernel This patch makes use of POWER10 matrix multiply assist feature and adds new DGEMM kernel. * Indentation update Co-authored-by: Rajalakshmi Srinivasaraghavan <rajis@linux.ibm.com>	2021-11-22 20:28:21 -08:00
Zhang Lei	8ef6aff734	Zhalei/dwqconv3x3 5x5 arm64 (#9714 ) * Arm64 Depthwise Convolution 3x3. * Add 5x5 intrinsic dwqconv for arm64 * rebase to master, remove no-need logic after arm64 convsym enabled. * Some more adjustment on the instrunction pipeling. * Add specific test cases. * Fix test dimension too small. * Fix build warning as error on some CI. * better format, etc.	2021-11-18 13:57:16 -08:00
Chen Fu	1c84621020	Adding ARM64 depthwise convolution kernel for symmetric quantization (#9655 ) Adding ARM64 depthwise convolution kernel for symmetric quantization Motivation and Context Two improvements against current kernel code : 1. Signed int8 based instructions, no need to extend from 8b to 16b before multiplication. 2. Unrolled loop with manual software pipelining Co-authored-by: Chen Fu <fuchen@microsoft.com>	2021-11-15 12:18:43 -08:00
RajalakshmiSR	c54ad0dd0b	POWER: Add Dgemm kernel for POWER processor (#9459 ) * POWER: Add Dgemm kernel for POWER processor This patch adds new dgemm kernel specific to POWER processor. * POWER: Restrict new functions to VSX in header * Remove warning check in header * POWER: Dgemm Adjust indentation Fixing indentation based on review comments. Co-authored-by: Rajalakshmi Srinivasaraghavan <rajis@linux.ibm.com>	2021-10-26 20:27:24 -07:00
Yufeng Li	da3dd398c5	Kernels for QLinearConv with symmetrically quantized filter (#9323 ) Add kernels for QLinearConv with symmetric quantized filter, e.g., filter type is int8 and zero point of filter is 0. This PR includes kernels for avx2, avxvnni, avx512 and avx 512 vnni. Will adds kernels for ARM64 in following PR. Kernels uses direct input buffer directly for pointwise, and in-direct buffer for depthwise and non-group conv. The advantages of those new kernels are: no need to compute the sum of each pixel output image, and sum/offset of filter can be combined with bias. with in-direct buffer, im2col returns an array of buffer pointers instead of memcpy'ing the original data. This saves memcpy time and reduces the size of the intermediate buffer needed to hold the im2col transform. In the future, will compute im2col ahead of time for input with fixed input size.	2021-10-18 19:40:18 -07:00
Yulong Wang	8c57d51928	support WebAssembly SIMD for qgemm (#9191 ) * support WebAssembly SIMD for qgemm * remove '--experimental-wasm-bulk-memory' for test	2021-09-30 12:40:56 -07:00
Tracy Sharpe	4828d2ebb1	MLAS: port aarch64 sgemv kernel to Windows ARM64 (#9071 )	2021-09-15 18:40:40 -07:00
Rajalakshmi Srinivasaraghavan	e83cc534d4	Fix cmake POWER10 detection Recent commit `60c98a8` changed variable mlas_common_srcs which affects POWER10 detection.	2021-09-12 11:56:55 -07:00
Changming Sun	60c98a86b7	CMake file changes for macOS universal2 support (#8953 )	2021-09-04 13:30:33 -07:00
Chen Fu	00b345eb7b	ARM Neon S8S8 kernel for QGemm (#8695 ) Using signed int, qgemm kernel avoids extending uint8 to int16 while computing matrix multiplication, achieving higher performance. We also find that by using only lower 64b of vector registers to load A and B matrix, we can get further performance improvements. We also experimented with using ldp to load two 64b in one shot, vs using two ldr to load one 64b at a time, in both Big and little cores, there is no noticeable differences. Submitting the LDP version. At this point we don't need to choose kernel based on micro-architecture. Inference time of resnet50, thread count 2 Big Core on Pixel 3a Current master: 292.947 ms First iteration S8S8: 188.239 ms LDP load two 64b reg: 178.715 ms LDR load one 64b reg: 179.536 ms Little Core Master: 546.317 ms S8S8: 513.332 ms LDP: 489.19 ms LDR: 497.865 ms Raspberry Pi 3B+ Master: 660.08 ms S8S8: 608.577 ms LDP: 603.675 ms LDR 602.075 ms	2021-08-18 09:58:47 -07:00
Tracy Sharpe	539d1d44c1	Optimize ARM64EC build (#8515 ) Add sgemm and qgemm optimized kernels for ARM64EC configuration.	2021-07-27 23:46:39 -07:00
Tracy Sharpe	b2b9de939f	cleanup onnxruntime_mlas.cmake of old gcc workarounds (#8469 )	2021-07-22 22:01:05 -07:00
Rajalakshmi Srinivasaraghavan	894fc82858	POWER10: Additional check in cmake When compiling with newer gcc and older glibc, there is a chance for new POWER10 macros to be not available in hwcap.h. This patch checks whether hwcap macros are available before using that in platform.cpp.	2021-07-20 13:04:18 -07:00
Yufeng Li	5bf862eef9	Fix build break on windows arm64 (#8361 )	2021-07-12 22:35:21 -07:00
Yufeng Li	f6956e0259	Refactor qgemm file (#8322 ) This PR purely extracts each kernel to a standalone file. No functionality change. It includes specifically: leave the MlasGemm function and thread handling in the qgemm.cc put dispatcher functions and the template functions (interfaces) that are required to implement a kernel into qgemm.h put each kernel implementation in a separate file, which implements/specialize template functions: MlasGemmU8X8FixupZeroPointB, MlasGemmU8X8CopyPackA, MlasGemmU8X8CopyPackB, MlasGemmU8X8Kernel determine the files to be compiled in cmake file	2021-07-12 10:13:20 -07:00
RajalakshmiSR	32ceaf4532	POWER10: Optimized SGEMM in MLAS (#8121 ) * POWER10: Optimized SGEMM in MLAS This patch introduces new optimized version of SGEMM in MLAS using power10 Matrix-Multiply Assist (MMA) feature introduced in POWER ISA v3.1. This patch makes use of new POWER10 compute instructions for matrix multiplication operation. * Adjust tabs in cmake Changing tabs to spaces as per review comment. * Adjust tabs in new sgemm file Changing tabs to spaces in SgemmKernelPOWER10.cpp. * Reusing functions using common header Co-authored-by: Rajalakshmi Srinivasaraghavan <rajis@linux.ibm.com>	2021-06-28 14:41:08 -07:00
Changming Sun	c716b56f26	Update C++ Standard from 14 to 17 (#8041 ) Switched the code to C++17. To build ONNX Runtime on old distros like CentOS 7, you need to install a newer GCC from additionary repos. If you build onnxruntime with the newer GCC, typically the result binary can't be distributed to other places because it depends on the new GCC's runtime libraries, something that the stock OS doesn't have. But on RHEL/CentOS, it can be better. We use Red Hat devtoolset 8/9/10 with CentOS7 building our code. The new library features(like std::filesystem) that not exists in the old C++ runtime will be statically linked into the applications with some restrictions: 1. GCC has dual ABI, but we can only use the old one. It means std::string is still copy-on-write and std::list::size() is still O(n). Also, if you build onnxruntime on CentOS 7 and link it with some binaries that were built on CentOS 8 or Ubuntu with the new ABI and export C++ symbols directly(instead of using a C API), the it won't work. 2. We still can't use std::optional. It is a limitation coming from macOS. We will solve it when we got macOS 11 build machines. It won't be too long. 3. Please avoid to use C++17 in CUDA files(.cu). Also, the .h files that they include(like core/framework/float16.h). This is Because CUDA 10.2 doesn't support C++17. You are welcome to use the new features in any *.cc files.	2021-06-25 14:08:01 -07:00
Gao, Chun	4dd724ef1a	Enable WebAssembly SIMD build (#7839 ) Add a build switch "--enable_wasm_simd" to enable WebAssembly SIMD build	2021-05-28 16:29:58 -07:00
Taewoo Kim	d1c531058a	Add elseif statement for arm64e	2021-05-18 14:58:58 -07:00
Changming Sun	7b003967b1	Add static code analyzer to Windows CPU/GPU CI builds and fix the warnings (#7489 )	2021-04-29 11:54:57 -07:00
RajalakshmiSR	3c7c728989	cmake: Add regex pattern for POWER architecture (#7494 ) This patch helps to set architecture as power, when processor check output matches ppc64le*. Co-authored-by: Rajalakshmi Srinivasaraghavan <rajis@linux.ibm.com>	2021-04-28 22:23:14 -07:00
Yulong Wang	405ca49012	build ONNXRuntime into WebAssembly (#6478 ) * Simplified version of WebAssembly support to keep most of existing data structures and add cmake using Ninja and emcmake * Clean up CMakeLists.txt and add an example to create and compute a kernel * Load a model from bytes and remove graph building steps * Add all cpu and contrib ops with mlas library * WebAssembly build with Onnxruntime C/CXX API * Use protobuf cmakefile directory instead of adding every necessary source file * Fix invalid output at example * add missing files * Change an example to use Teams model and support ort mobile format * add API for javascript * fix input releasing in _ort_run() * update API * Let onnxruntime cmake build WebAssembly with option '--wasm' * allow one-step building for wasm * Make build script working on Linux and MacOS * Fix broken build from Windows command * Enable unit test on building WebAssembly * Resolve comments * update build flags * wasm conv improvement from: 1) GemmV; 2) Depthwise direct convolution 3x3; 3) Direct convolution 3x3 * Cleaned mlas unittest. * use glob * update comments * Update baseline due to loss scale fix (#6948) * fix stream sync issue (#6954) * Enable type reduction in EyeLike, Mod, random.cc CPU kernels. (#6960) * Update EyeLike CPU kernel. * Update Mod CPU kernel. * Update Multinomial CPU kernel. * Slight improvement to Pad CPU kernel binary size. * Update RandomNormal[Like], RandomUniform[Like] CPU kernels. * Fix warning from setting multiple MSVC warning level options. (#6917) Fix warning from setting multiple MSVC warning level options. Replace an existing /Wn flag instead of always appending a new one. * MLAS: quantized GEMM update (#6916) Various updates to the int8_t GEMMs: 1) Add ARM64 udot kernel to take advantage of dot product instructions available in newer cores. Some models run 4x faster than the stock implementation we used before. 2) Refactor the x64 kernels to share common code for AVX2(u8u8/u8s8/avxvnni) vs AVX512(u8u8/u8s8/avx512vnni) to reduce binary size. 3) Extend kernels to support per-column zero points for matrix B. This is not currently wired to an operator. * Implement QLinearAveragePool with unit tests. (#6896) Implement QLinearAveragePool with unit tests. * Attention fusion detect num_heads and hidden_size automatically (#6920) * fixed type to experimental session constructor (#6950) * fixed type to experimental session constructor Co-authored-by: David Medine <david.medine@brainproducts.com> * Update onnxruntime_perf_test.exe to accept free dimension overrides (#6962) Co-authored-by: Ori Levari <orlevari@microsoft.com> * Fix possible fd leak in NNAPI (#6966) * Release buffers for prepacked tensors (#6820) Unsolved problems: 1. One test failure was caused by a bug in Cudnn rnn kernels, when they can allocate a buffer and partially initialize it, the garbage data near tail of the buffer caused problem in some of the hardware. To attack this problem in a broader sense, should we add code in our allocators, and during a memory fuzzing test, fill an allocated buffer with garbage before returning to the caller? 2. Prepacking is used more widely than we know. For instance, Cudnn rnn kernels also cache their weights. They mix several weight tensors together into a single buffer, and never touch the original weight tensor anymore. This is the same idea with pre-pack, but they didn't override the virtual function, and they never tried to release those weight tensors, leading to memory waste. It also seems to me that there are some other kernels have similar behavior. Wonder how much memory we can save if we try to cleanup those too. 3. Turning off memory pattern planning does increase memory fragmentation, leading to out of memory error in some training test cases. Perhaps we can revisit the idea of pushing kernels-creation stage earlier, and then during initializer deserialization, we only avoid tracing those that will be prepacked. * Enable type reduction for Range, ReverseSequence, ScatterND, Split, and Unique CPU kernels. (#6963) * add CI * fix test in ci * fix flags for nsync in wasm build * add copyright banner * fix wasm source glob * add missing exports * resolve comments * Perf gain by make packb wide to 4 from 16 on GEMM for WASM. Remove no need direct conv in previous perf tuning. * fix buildbreak introduced from latest master merge * fix buildbreak in mlasi.h * resolve all comments except MLAS * rewrite packb related 3 functions for WASM_SCALAR seperately rather than using #ifdef in each. and other changes according to PR feedback in mlas. * More complete scalar path in sgemm from Tracy. * Fix edge case handling in depthwise conv2d kernel 3x3. where: ) support input W==1 and H==1 ) recalc in accurate pad_right and pad_bottom ) support hidden pad_right == 2 or pad_bottom == 2 when W == 1 or H==1 and no pad left/top Add more test coverage for conv depthwise from Tracy. Fix one typo according to PR. * resolve comments * replace typedef by using * do not use throw in OrtRun() * output error message Co-authored-by: Sunghoon <35605090+hanbitmyths@users.noreply.github.com> Co-authored-by: Lei Zhang <zhang.huanning@hotmail.com> Co-authored-by: Wei-Sheng Chin <wschin@outlook.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Tracy Sharpe <42477615+tracysh@users.noreply.github.com> Co-authored-by: David Medine <david.eric.medine@gmail.com> Co-authored-by: David Medine <david.medine@brainproducts.com> Co-authored-by: Ori Levari <ori.levari@microsoft.com> Co-authored-by: Ori Levari <orlevari@microsoft.com> Co-authored-by: Guoyu Wang <62914304+gwang-msft@users.noreply.github.com> Co-authored-by: Chen Fu <chenfucs@gmail.com>	2021-04-06 16:18:10 -07:00
Ben Niu	d1acdd4f4b	Support building ARM64EC onnxruntime.dll (#6999 )	2021-03-29 15:35:30 -07:00
Tracy Sharpe	a8b897f710	MLAS: quantized GEMM update (#6916 ) Various updates to the int8_t GEMMs: 1) Add ARM64 udot kernel to take advantage of dot product instructions available in newer cores. Some models run 4x faster than the stock implementation we used before. 2) Refactor the x64 kernels to share common code for AVX2(u8u8/u8s8/avxvnni) vs AVX512(u8u8/u8s8/avx512vnni) to reduce binary size. 3) Extend kernels to support per-column zero points for matrix B. This is not currently wired to an operator.	2021-03-10 09:54:43 -08:00
Tracy Sharpe	bc27652188	MLAS: workaround LLVM x86 assembler (#6922 ) Implement an alternate workaround for the LLVM x86 problem described in PR #5088. That change made the x86 assembly files build with the GNU assembler by using -fno-integrated-as	2021-03-08 14:18:49 -08:00

1 2

100 commits