# Motivation
Currently, ORT minimal builds use kernel def hashes to map from nodes to
kernels to execute when loading the model. As the kernel def hashes must
be known ahead of time, this works for statically registered kernels.
This works well for the CPU EP.
For this approach to work, the kernel def hashes must also be known at
ORT format model conversion time, which means the EP with statically
registered kernels must also be enabled then. This is not an issue for
the always-available CPU EP. However, we do not want to require that any
EP which statically registers kernels is always available too.
Consequently, we explore another approach to match nodes to kernels that
does not rely on kernel def hashes. An added benefit of this is the
possibility of moving away from kernel def hashes completely, which
would eliminate the maintenance burden of keeping the hashes stable.
# Approach
In a full build, ORT uses some information from the ONNX op schema to
match a node to a kernel. We want to avoid including the ONNX op schema
in a minimal build to reduce binary size. Essentially, we take the
necessary information from the ONNX op schema and make it available in a
minimal build.
We decouple the ONNX op schema from the kernel matching logic. The
kernel matching logic instead relies on per-op information which can
either be obtained from the ONNX op schema or another source.
This per-op information must be available in a minimal build when there
are no ONNX op schemas. We put it in the ORT format model.
Existing uses of kernel def hashes to look up kernels are replaced
with the updated kernel matching logic. We no longer store
kernel def hashes in the ORT format model’s session state and runtime
optimization representations. We no longer keep the logic to
generate and ensure stability of kernel def hashes.
LLVM compiler complains the std::hash<const char*> and suggests std::hash<const void*>. But the intention is to hash the name string instead of the pointer. So use std::hash<std::string> to be explicit.
* Revert "Revert "Refactor ExecutionFrame and SessionState to reduce memory all… (#11888)"
This reverts commit d2cbae3a04.
* Revert prepacked_weights to avoid indirect inclusion in CUDA and TRT code that breaks the build.
Minor wording update to warning message to clarify that the function style Compile API is deprecated now and will be removed soon.
Also updated some code comments.
* aten op for inference
* fix build error
* more some code to training only
* remove domain from operator name
* move aten_op_executor ext out from ortmodule
* add pipeline
* add exec mode
* fix script
* fix ut script
* fix test pipeline
* failure test
* rollback
* bugfix
* resolve comments
* enable aten for python build only
* fix win build
* use target_compile_definitions
* support io binding
* turn off aten by default
* fix ut
Co-authored-by: Vincent Wang <weicwang@microsoft.com>
Co-authored-by: zhijxu <zhijxu@microsoft.com>
* Rework allocator sharing to work for multiple devices.
* Update SessionState to not use allocator name in matching for consistency with IExecutionProvider. The name doesn't have any clear meaning (e.g. we use the same name for the per-thread allocator in the CUDA EP as the shared allocate there and in the TRT EP).
* NOTE: this means we will have one allocator per OrtMemType+OrtDevice.
* Reverse order when doing allocator setup in SessionState. This will result in the CPU and CUDA EPs allocators being preferred (they are the most configurable), and also means the per-thread CUDA allocator for default GPU memory will be used even when TRT is enabled.
* NOTE: Combined with the change to remove the allocator name from the key this will mean that if CUDA and TRT or ROCM and MIGraphX are both enabled the CUDA/ROCM per-thread allocator will be used to allocate GPU memory.
* Use InsertAllocator instead of TryInsertAllocator. Each EP should be registered once, and we should only enter RegisterAllocator once, so the 'try' should not be required and would indicate an unexpected setup was involved. i.e. better to fail and figure out if we need to support that setup.
* Add some clarifying comments around how replace allocator works.
* Add unit testing for setup where EP has local allocator that may get out of sync with values in the IExecutionProvider base class.
* Fix invalid check of whether data is on CPU to use device info instead of allocator name.
* Initiate Ort SNPE EP
* fix snpe ep windows build which is caused by the utility method (ToUTF8String) name change on master
* correct the source path for libonnxruntime.so while building for andorid package
* add AdditionalDependencies for amr64
* On MS-Windows, the patchfile must be a text file, i.e. CR-LF must be used as line endings. A file with LF may give the error: "Assertion failed, hunk, file patch.c, line 343," unless the option '--binary' is given.
* fix build failure if snpe is not enabled
* update doc for contrib op
* separate out snpe ep settings to onnxruntime_snpe_provider.cmake
* renaming according review comments
* update according review comments
* Implement XNNPACK support via an EP.
* Layout transform uses the GraphPartitioner infrastructure.
* Node fusion is supported.
* Conv and MaxPool implementations were ported from Changming's PR.
* Added optional mutex in InferenceSession::Run as we only want to allow sequential calls if xnnpack is enabled
* use the lightweight compile api as default; use dnnl ep for testing
* apply to tensorrt ep
* fix the missing files
* fix build
* fix the copy issue on linux
* migrate migraphx and openvino ep
* fix openvino build break
* fix linux build
* fix unused parameter
* fix coreml build
* use graph view's filtered initializers
* fix openvino break
* fix tvm compile api
* fix tvm / rknpu / vitisai ep build
* add IsInitializedTensor in graph_viewer; fix nuphar build
* use serializer directly as tvm ep is still static lib
* fix the type mismatch
* fix the type mismatch
* fix merge conflict
* add a comment
* fix minimal build
* fix the DML EP's legacy approach
* save type/shape in dnnl IR
* fix linux break
* fix tvm failure
* dnnl ep: move initializer referenced out of dnnl subgraph
* Revert "add IsInitializedTensor in graph_viewer; fix nuphar build"
This reverts commit 1cc3c7f08c16fee4fe3309a67209eb769d479587.
* add IsInitializedTensor to graph viewer
* add the legacy code for nuphar build to temporarily make nuphar build work
* ignore internal test for nuphar
* remove the out of date tests
* keep the legacy API in EP for a while
* turn serializer into a static function
* update comments
* fix tvm build
* Update include/onnxruntime/core/framework/execution_provider.h
Co-authored-by: Pranav Sharma <prs@microsoft.com>
* Update include/onnxruntime/core/framework/execution_provider.h
Co-authored-by: Pranav Sharma <prs@microsoft.com>
* Update onnxruntime/core/framework/execution_provider.cc
Co-authored-by: Pranav Sharma <prs@microsoft.com>
* updatee comments; add warning message for legacy compil call
* add a flag to control out of scope arg in serialization
* fix trt build; improve the test
* resolve merege errors
* fix a typo
Co-authored-by: Cheng Tang <chenta@microsoft.com>
Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Pranav Sharma <prs@microsoft.com>
* draft kernel creation
* setup eager context
* call into kernel in eager mode
* redefine test case
* refact eager context
* add comment
* remove header
* rename argument
* redefine API definition with types
* list outputs as argument
* switch to int to represent length
* fix compile err
* create attribute API
* add test case for topk
* remove bool from c api
* add gru test case
* remove var
* fix compile warnings
* rename status
* fix compile err
* exclude sparse tensor
* fix comments
* fix comments
* fix build err
* rename file and move location
* format code
* move file to session folder
* fix comments
Co-authored-by: Randy <Randy@randysmac.attlocal.net>
Rework initializer.cc to eliminate code duplication and add type enforcement.
Address review comments. Add literal operators for MLFloat16 abd BFloat16 and tests.
Fix CUDA 10.2 compile error due to inlined_containers.h inclusion
into a common CUDA header.
Use NumberOfNodes() to reserve space in a hash table
Prefer separate call to reserve() rather than passing in the
hash table constructor. They have somewhat different meaning.
Hide Inlined Hash set and maps guts behind template forward declarations.
Currently CUDA 10.2 compiler can not compile abseil but provider interfaces
use those types in their signatures. InlinedVector seems to be fine.
Introduce core/common/inlined_containers_fwd.h header
Work on minimizing memory management calls by
reducing number of allocations and copies.
Replace std::unordered_set to InlinedHashSet
and add usage of InlinedVector.
Employ std::move() to minimize copying and memory allocations.
Remove copying of the const shared data into each of the
PropagateCast transformer instances.
Move inlined_containers.h header to include/common
Adjust AsSpan imlementation for C++ < 17
* Add layout transformer for NNAPI
* plus merge fixes
* plus some more merge fixes
* test fixes
* comments + cleanup
* plus updates
* post merge changes
* enable layout transformer in extended minimal build
* plus more comments
* more tests + fix CI
* plus updates per review
* more updates per review
* fix file name
* fix qdq tests
* plus more updates
* plus updates
* typo fix
* fix qdq selection in 2nd optimization pass
* fix typo
* fix a test
* update dependency structure for layout transformer
* plus updates
* more updates
* plus change
* more updates to fix linker error in minimal build
* remove unnecessary headers
Add abseil cgmanifest declaration. Update coding standards for InlinedContainers
Adjust coding guidelines. Add default N calculation for InlinedVector<T, N> for general use.
Rename T from InlinedShapeVectorT. Fix Eager build
Add LLVM Copyright with modified derived code notice.
Add abseil and inlined containers typedefs
Introduce TensorShapeVector for shape building.
Use gsl::span<const T> to make interfaces accept different types of vector like args.
Introduce InineShapeVectorT for shape capacity typed instantiations
Refactor cuda slice along with provider shared interfaces
Refactor Concat, Conv, Pad
Build with Conv Einsum and ConvTranspose refactored.
Remove TesnorShape::GetDimsAsVector()
Refactor SliceIterator and SliceIteratorBase
Refactor broadcast
Refactor Pads for twice as long
Remove memory planner intermediate shapes vector
Refactor orttraining
Fix passing TenshroShapeVector to tests
Remove abseil copy and submodule, use FetchContent_Declare/Fetch
Path with separate command
Make RocmAsyncBuffer accept anything convertible to span. Adjust Linux GPU pipeline.
* Added checks for Hetero/Multi
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Remote Context Plugin
* changes for IO Buffer plugin
* erronous couts added
* erronous entry rectified
* Set the Openvino OP Buffer also as output
* Enable AUTO plugin in OpenVINO EP
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Remote Context Plugin
* changes for IO Buffer plugin
* erronous couts added
* erronous entry rectified
* Added checks for Hetero/Multi
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Set the Openvino OP Buffer also as output
* Enable AUTO plugin in OpenVINO EP
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Please commit error message and rectification of param.context
* Alignment fixed
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Changed the string to OpenVINO_GPU
* hanged OpenVINO to to OpenVINO_CPU
* Onnxruntime updated API for memory location
* Removing Duplicate LOG Error
* Tensor.h removed DeviceType function. Updated comment
* API Comments updated
* Removing changes to Provider Indo
* Erronous commit
* Removing Extra logs
* Merge CMAKE
* Not copy from a local location
* Duplicate Entry
* Remove extra line
Co-authored-by: MaajidKhan <n.maajidkhan@gmail.com>