onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-05 04:17:53 +00:00

History

Chen Fu 040c2f4517 x86/64 U8S8 Gemm Precision Fix (#12088 ) Add a graph optimization that convert u8s8 matrix multiplication to u8u8 if needed In x86/64 platforms, specifically SSE4.1, AVX2 and AVX512 CPUs provide better performance computing u8s8 matrix multiplications. Unfortunately, the higher performance comes with value overflow problems, as described in: https://www.intel.com/content/www/us/en/develop/documentation/onednn-developer-guide-and-reference/top/advanced-topics/nuances-of-int8-computations.html In this change we added a session option "session.x64quantprecision" (default off). For operators that calls u8s8 matrix multiplications, e.g. QAttention, we convert them to u8u8 when the following conditions are all satisfied: 1. Current CPU is SSE4.1, AVX2 or AVX512 with no VNNI support 2. Session option "session.x64quantprecision" is on. 3. Constant weight tensor contains values outside of [-64, 63] range Note that when weight tensor is not constant, QDQS8ToU8Transformer should already convert it to u8.		2022-07-13 10:12:25 -07:00
..
common	Consolidate several types into onnxruntime::ArgType. (#11430 )	2022-05-09 14:44:28 -07:00
eager	support register external ep lib information (#8897 )	2021-08-31 20:51:22 -07:00
framework	Retry Rework execution frame to reduce memory allocations (#11897 )	2022-06-20 10:29:43 -07:00
graph	Generalize native op creation (#11539 )	2022-06-27 21:12:15 -07:00
optimizer	Remove ORT_ENABLE_RUNTIME_OPTIMIZATION_IN_MINIMAL_BUILD. (#10778 )	2022-03-08 16:18:49 -08:00
platform	Allow saving on CPU usage for infrequent inference requests by reducing thread spinning (#11841 )	2022-06-23 10:04:37 -07:00
providers	Share execution context memory between TensorRT subgraphs (#11859 )	2022-06-16 22:42:40 -07:00
session	x86/64 U8S8 Gemm Precision Fix (#12088 )	2022-07-13 10:12:25 -07:00