onnxruntime/include/onnxruntime/core
Chen Fu 040c2f4517
x86/64 U8S8 Gemm Precision Fix (#12088)
Add a graph optimization that convert u8s8 matrix multiplication to u8u8 if needed

In x86/64 platforms, specifically SSE4.1, AVX2 and AVX512 CPUs provide better performance computing u8s8 matrix multiplications. Unfortunately, the higher performance comes with value overflow problems, as described in:
https://www.intel.com/content/www/us/en/develop/documentation/onednn-developer-guide-and-reference/top/advanced-topics/nuances-of-int8-computations.html

In this change we added a session option "session.x64quantprecision" (default off). For operators that calls u8s8 matrix multiplications, e.g. QAttention, we convert them to u8u8 when the following conditions are all satisfied:

1. Current CPU is SSE4.1, AVX2 or AVX512 with no VNNI support
2. Session option "session.x64quantprecision" is on.
3. Constant weight tensor contains values outside of [-64, 63] range

Note that when weight tensor is not constant, QDQS8ToU8Transformer should already convert it to u8.
2022-07-13 10:12:25 -07:00
..
common Consolidate several types into onnxruntime::ArgType. (#11430) 2022-05-09 14:44:28 -07:00
eager support register external ep lib information (#8897) 2021-08-31 20:51:22 -07:00
framework Retry Rework execution frame to reduce memory allocations (#11897) 2022-06-20 10:29:43 -07:00
graph Generalize native op creation (#11539) 2022-06-27 21:12:15 -07:00
optimizer Remove ORT_ENABLE_RUNTIME_OPTIMIZATION_IN_MINIMAL_BUILD. (#10778) 2022-03-08 16:18:49 -08:00
platform Allow saving on CPU usage for infrequent inference requests by reducing thread spinning (#11841) 2022-06-23 10:04:37 -07:00
providers Share execution context memory between TensorRT subgraphs (#11859) 2022-06-16 22:42:40 -07:00
session x86/64 U8S8 Gemm Precision Fix (#12088) 2022-07-13 10:12:25 -07:00