LLVM compiler complains the std::hash<const char*> and suggests std::hash<const void*>. But the intention is to hash the name string instead of the pointer. So use std::hash<std::string> to be explicit.
* Add ability to use ORT format model flatbuffer directly for intiializers by leveraging the TensorProto external data infrastructure.
Requires user to provide ORT format model bytes when creating the session, and set both `session.use_ort_model_bytes_directly` and `session.use_ort_model_bytes_for_initializers` to 1 in SessionOptions config entries (AddSessionConfigEntry in C API).
Add a graph optimization that convert u8s8 matrix multiplication to u8u8 if needed
In x86/64 platforms, specifically SSE4.1, AVX2 and AVX512 CPUs provide better performance computing u8s8 matrix multiplications. Unfortunately, the higher performance comes with value overflow problems, as described in:
https://www.intel.com/content/www/us/en/develop/documentation/onednn-developer-guide-and-reference/top/advanced-topics/nuances-of-int8-computations.html
In this change we added a session option "session.x64quantprecision" (default off). For operators that calls u8s8 matrix multiplications, e.g. QAttention, we convert them to u8u8 when the following conditions are all satisfied:
1. Current CPU is SSE4.1, AVX2 or AVX512 with no VNNI support
2. Session option "session.x64quantprecision" is on.
3. Constant weight tensor contains values outside of [-64, 63] range
Note that when weight tensor is not constant, QDQS8ToU8Transformer should already convert it to u8.
* create op from ep
* read input count from context
* create holder to host nodes
* fix typo
* cast type before comparison
* throw error on API fail
* silence warning from minimal build
* switch to unique_ptr with deleter to host nodes
* fix typo
* fix build err for minimal
* fix build err for minimal
* add UT for conv
* enable test on CUDA
* add comment
* fix typo
* use gsl::span and string view for Node constructor
* Added two APIs - CopyKernelInfo and ReleaseKernelInfo
* pass gsl::span by value
* switch to span<NodeArg* const> to allow for reference to const containers
* fix typo
* fix reduced build err
* fix reduced build err
* refactoring node construction logic
* rename exceptions
* add input and output count as arguments for op creation
* refactor static member
* use ORT_CATCH instead of catch
* cancel try catch
* add static value name map
* format input definition and set err code
* fix comments
* fix typo
* Revert "Revert "Refactor ExecutionFrame and SessionState to reduce memory all… (#11888)"
This reverts commit d2cbae3a04.
* Revert prepacked_weights to avoid indirect inclusion in CUDA and TRT code that breaks the build.
Minor wording update to warning message to clarify that the function style Compile API is deprecated now and will be removed soon.
Also updated some code comments.
* Rework the EP factory creation setup so we're not cut-and-pasting function declarations in multiple places.
Convert append EP for SNPE to be generic, and also use for XNNPACK.
Add XNNPACK to C# API
* Don't need stub for MIGraphX as it's using provider bridge.
* Remove old 'create' functions that aren't applicable now that the EPs are built as separate libraries.
* Only use EPs that require the layout transform if the opset is supported by the layout transformer.
* Update wasm registration of xnnpack.
* C API version 0.001
* fix linker issues
* fixes for save checkpoint api
* plus fixes based on tests
* plus test_runner and other changes
* Plus cosmetic updates
* remove unnecessary headers
* plus some updates
* plus more changes
Co-authored-by: Ashwini Khade <askhade@microsoft.com@orttrainingdev10.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
* Reserve the first core for the main thread
Currently in "auto affinity" mode the worker threads are affinized to cores 0..(N-1), leaving the very last core for the main thread. This patch preserves core #0 for the main thread, and affinizes the worker threads to cores 1..N.
* Avoid unneeded spin_pause in thread pool's worker threads
Remove unneeded PAUSE instruction (0.1-0.2 usec latency) after a worker thread finds a task to execute.
* MLAS/x86: optimize QLinearConv on hybrid CPUs
Existing 4x task granularity for task partitioning on hybrid CPUs is
not sufficient to compensate the difference of VNNI instructions
throughput
between performance and efficient cores. This patch...
* Increases granularity for QLinearConv by 2x, to have 2x more tasks
with 2x
smaller output count
* Limits QLinearConv task count from above, to avoid output count per
task
getting smaller than kernel's capability
* Remove hardcoded task count for QLineConv as it limited scaling on
16+ cores CPUs
* MLAS/x86: optimize QLinearConv on hybrid CPUs
Existing 4x task granularity for task partitioning on hybrid CPUs is not sufficient to compensate the difference of VNNI instructions
throughput between performance and efficient cores. This patch...
* Increases granularity for QLinearConv by 2x, to have 2x more tasks
with 2x smaller output count
* Limits QLinearConv task count from above, to avoid output count per
task getting smaller than kernel's capability
* Remove hardcoded task count for QLineConv as it limited scaling on
16+ cores CP
* Addressing comments
* combining x86 ARM branches in qlinearconv threaded job partition
* revert first core assignment
Co-authored-by: Saurabh <saurabh.tangri@intel.com>
Co-authored-by: Chen Fu <fuchen@microsoft.com>
* aten op for inference
* fix build error
* more some code to training only
* remove domain from operator name
* move aten_op_executor ext out from ortmodule
* add pipeline
* add exec mode
* fix script
* fix ut script
* fix test pipeline
* failure test
* rollback
* bugfix
* resolve comments
* enable aten for python build only
* fix win build
* use target_compile_definitions
* support io binding
* turn off aten by default
* fix ut
Co-authored-by: Vincent Wang <weicwang@microsoft.com>
Co-authored-by: zhijxu <zhijxu@microsoft.com>
* Rework allocator sharing to work for multiple devices.
* Update SessionState to not use allocator name in matching for consistency with IExecutionProvider. The name doesn't have any clear meaning (e.g. we use the same name for the per-thread allocator in the CUDA EP as the shared allocate there and in the TRT EP).
* NOTE: this means we will have one allocator per OrtMemType+OrtDevice.
* Reverse order when doing allocator setup in SessionState. This will result in the CPU and CUDA EPs allocators being preferred (they are the most configurable), and also means the per-thread CUDA allocator for default GPU memory will be used even when TRT is enabled.
* NOTE: Combined with the change to remove the allocator name from the key this will mean that if CUDA and TRT or ROCM and MIGraphX are both enabled the CUDA/ROCM per-thread allocator will be used to allocate GPU memory.
* Use InsertAllocator instead of TryInsertAllocator. Each EP should be registered once, and we should only enter RegisterAllocator once, so the 'try' should not be required and would indicate an unexpected setup was involved. i.e. better to fail and figure out if we need to support that setup.
* Add some clarifying comments around how replace allocator works.
* Add unit testing for setup where EP has local allocator that may get out of sync with values in the IExecutionProvider base class.
* Fix invalid check of whether data is on CPU to use device info instead of allocator name.
This reverts commit 1f2c926. Because it makes our packaging pipeline crash
Error message:
[ RUN ] QLinearConvTest.Conv3D_S8S8_Depthwise
Test #1: onnxruntime_test_all ...................Subprocess killed***Exception: 838.24 sec
We haven't successfully reproduced the bug on a real ARM64 hardware. Currently we only saw it showed up with qemu. More investigations are on-going.
* Initiate Ort SNPE EP
* fix snpe ep windows build which is caused by the utility method (ToUTF8String) name change on master
* correct the source path for libonnxruntime.so while building for andorid package
* add AdditionalDependencies for amr64
* On MS-Windows, the patchfile must be a text file, i.e. CR-LF must be used as line endings. A file with LF may give the error: "Assertion failed, hunk, file patch.c, line 343," unless the option '--binary' is given.
* fix build failure if snpe is not enabled
* update doc for contrib op
* separate out snpe ep settings to onnxruntime_snpe_provider.cmake
* renaming according review comments
* update according review comments
* Implement XNNPACK support via an EP.
* Layout transform uses the GraphPartitioner infrastructure.
* Node fusion is supported.
* Conv and MaxPool implementations were ported from Changming's PR.
* Added optional mutex in InferenceSession::Run as we only want to allow sequential calls if xnnpack is enabled
* use the lightweight compile api as default; use dnnl ep for testing
* apply to tensorrt ep
* fix the missing files
* fix build
* fix the copy issue on linux
* migrate migraphx and openvino ep
* fix openvino build break
* fix linux build
* fix unused parameter
* fix coreml build
* use graph view's filtered initializers
* fix openvino break
* fix tvm compile api
* fix tvm / rknpu / vitisai ep build
* add IsInitializedTensor in graph_viewer; fix nuphar build
* use serializer directly as tvm ep is still static lib
* fix the type mismatch
* fix the type mismatch
* fix merge conflict
* add a comment
* fix minimal build
* fix the DML EP's legacy approach
* save type/shape in dnnl IR
* fix linux break
* fix tvm failure
* dnnl ep: move initializer referenced out of dnnl subgraph
* Revert "add IsInitializedTensor in graph_viewer; fix nuphar build"
This reverts commit 1cc3c7f08c16fee4fe3309a67209eb769d479587.
* add IsInitializedTensor to graph viewer
* add the legacy code for nuphar build to temporarily make nuphar build work
* ignore internal test for nuphar
* remove the out of date tests
* keep the legacy API in EP for a while
* turn serializer into a static function
* update comments
* fix tvm build
* Update include/onnxruntime/core/framework/execution_provider.h
Co-authored-by: Pranav Sharma <prs@microsoft.com>
* Update include/onnxruntime/core/framework/execution_provider.h
Co-authored-by: Pranav Sharma <prs@microsoft.com>
* Update onnxruntime/core/framework/execution_provider.cc
Co-authored-by: Pranav Sharma <prs@microsoft.com>
* updatee comments; add warning message for legacy compil call
* add a flag to control out of scope arg in serialization
* fix trt build; improve the test
* resolve merege errors
* fix a typo
Co-authored-by: Cheng Tang <chenta@microsoft.com>
Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Pranav Sharma <prs@microsoft.com>
* draft kernel creation
* setup eager context
* call into kernel in eager mode
* redefine test case
* refact eager context
* add comment
* remove header
* rename argument
* redefine API definition with types
* list outputs as argument
* switch to int to represent length
* fix compile err
* create attribute API
* add test case for topk
* remove bool from c api
* add gru test case
* remove var
* fix compile warnings
* rename status
* fix compile err
* exclude sparse tensor
* fix comments
* fix comments
* fix build err
* rename file and move location
* format code
* move file to session folder
* fix comments
Co-authored-by: Randy <Randy@randysmac.attlocal.net>
This reverts commit 4983d6e5d6. We can't destroy OrtEnv through python's atexit function, because at that time there might be many other ORT python objects alive.
* initial fix
* refactor the function handle
* update the implementation
* fix linux build break
* fix training build
* fix minmal build
* fix gradient checker
* deprecate the local function members in graph. host it in model
* fix changming's comments
* fix comments about inlined containers
* fix a missed inlined container
* fix training build
* avoid const for std string_view
Co-authored-by: Cheng Tang <chenta@microsoft.com>