onnxruntime/include/onnxruntime/core/session
Chen Fu 040c2f4517
x86/64 U8S8 Gemm Precision Fix (#12088)
Add a graph optimization that convert u8s8 matrix multiplication to u8u8 if needed

In x86/64 platforms, specifically SSE4.1, AVX2 and AVX512 CPUs provide better performance computing u8s8 matrix multiplications. Unfortunately, the higher performance comes with value overflow problems, as described in:
https://www.intel.com/content/www/us/en/develop/documentation/onednn-developer-guide-and-reference/top/advanced-topics/nuances-of-int8-computations.html

In this change we added a session option "session.x64quantprecision" (default off). For operators that calls u8s8 matrix multiplications, e.g. QAttention, we convert them to u8u8 when the following conditions are all satisfied:

1. Current CPU is SSE4.1, AVX2 or AVX512 with no VNNI support
2. Session option "session.x64quantprecision" is on.
3. Constant weight tensor contains values outside of [-64, 63] range

Note that when weight tensor is not constant, QDQS8ToU8Transformer should already convert it to u8.
2022-07-13 10:12:25 -07:00
..
environment.h Revert "Call pluggable EP's shutdown function in Environment::~Environment() (#11120)" (#11393) 2022-05-02 14:38:31 -07:00
experimental_onnxruntime_cxx_api.h
experimental_onnxruntime_cxx_inline.h Deprecate APIs returning raw ptrs and provide replacements (#11922) 2022-06-24 09:50:04 -07:00
onnxruntime_c_api.h Generalize native op creation (#11539) 2022-06-27 21:12:15 -07:00
onnxruntime_cxx_api.h Generalize native op creation (#11539) 2022-06-27 21:12:15 -07:00
onnxruntime_cxx_inline.h Generalize native op creation (#11539) 2022-06-27 21:12:15 -07:00
onnxruntime_run_options_config_keys.h Add ability for memory arenas to "shrink" periodically (#7284) 2021-05-08 07:53:21 -07:00
onnxruntime_session_options_config_keys.h x86/64 U8S8 Gemm Precision Fix (#12088) 2022-07-13 10:12:25 -07:00
snippets.dox [C API Docs] Add docs for run options tag/log level accessors/modifiers. (#9045) 2021-09-14 08:53:35 -07:00