* refactoring the ep codes.
* remove unnecessary lock.
* fix the comment to claim KernelRegistryManager is not thread safe.
* clarify that APIs to add custom op in inferencesession is not thread safe.
More C++ API improvements and cleanup
Add templates to tensor creation
Add run method that allows preallocated outputs
Simplify CreateTensor<T> to multiply by sizeof(T)
Convert io_types code
Optimize away vector copies in Session::Run
* Change function signature
* Convert compute to use custom op style APIs
* Remove dead CustomOp function
* Use CustomOp API in TensorRT EP
* Switch to new API in ngraph
* More C++ API improvements and conversions
* Mark more constructors as explicit
* Fix CSharp function name changes
* Change more test cases to use C++ API
* allow users to set graph inputs and outputs fully.
* update
* update the comments of the APIs
* update
* remove commented-out codes.
* fix test failures.
* fix comments.
* adding more check to throw not support exception right now.
Memory pattern doesn't work for parallel executor by design. Enabling Memory Pattern for parallel executor logs warning and make the perf bad.
Add option to enable/disable memory pattern back.
Introduce a quick pre-filtering of rules based on the node op types they are targeting.
The goal is to avoid evaluating all rules for all nodes. Instead, for each node, we will only be evaluating the rules associated with its op type.
* constant node should not be put into graph inputs any more.
* simplify graph input/output set logic.
* refactor comments.
* remove adding initializers as graph inputs when creating graph from scratch.
Some changes that reduce the size of the release onnxruntime.dll by 170KB:
Change the ONNX_OPERATOR_KERNEL macros to not create a unique virtual class per kernel create lambda, but instead use a generic class with the raw function address supplied at BuildCreateKernelInfo time.
Changed the exceution providers to use a table driven approach to calling the BuildCreateKernelInfo functions instead of a massive function with construct/call/delete sequences.
The CreateFunc in data_types.h didn't need to be a std::function, eliminating more lambda virtual classes.
N.B. To accommodate MSVC 14.11 toolchain (used for CUDA builds), the operator+() syntax cannot be used to retrieve the raw function address. The older toolchain can't resolve between cdecl/vectorcall and gives up. An explicit cast is needed to help the compiler along.
* Adding a custom op interface to the C API to remove shared library dependency.
* Remove old custom op test
* Rework how custom ops handle inputs/outputs to enable custom op output shape calculation in the compute method
* Add a nicer C++ API for custom ops and switch the tests to use it.
Generalize node removal method in graph_utils. This is a higher-level method that keeps the graph consistent so that no Resolve is needed after the removal of a node.
The new method supports the removal of nodes with a single input (be it an incoming node or an initializer) and a single output (but allowing multiple output edges of that output). It also takes into account the case that one of the output edges is fed to a subgraph.
Also updated the rewrite rules to use this new, less restrictive method, and improved the rules' conditions. Introduced a GraphEdge struct to simplify various methods in graph_utils.
* fix graph transformers and refactor tests
* fix merge master
* Set default optimization level to Level1
* fix build warnings for Linux
* try root cause tensorrt test failures
* try root cause tensorrt test failure
* Test level2 transformers with all CI builds
* remove ConvActivation fusion transformer
* change default level back to level1
* remove providers from apply api
* more changes
* Convert unsqueeze elimination to rewrite rule
* Simplify the way we register predefined transformers and rules in the inference session (all details are now moved to the graph transformer utils)
* Some reorganization and renaming of methods in graph_utils
* Updates in graph transformers test
* Update in edge removal to not perform unnecessary check of node args that led to race conditions when updating the graph
* Improve documentation for rewrite rules
* Remove top-down rule-based transformer (given we currently have only one type of rule-based transformer)
* Adding a custom op interface to the C API to remove shared library dependency.
* Fixup const issues
* Renaming to make things a little simpler
* Add a comment
* Test protobuf-lite
* Test protobuf-lite
* Test protobuf-lite
* Optimize protobuf usage for LITE_RUNTIME to reduce the binary size of
onnxruntime.dll. More details can be found here https://developers.google.com/protocol-buffers/docs/proto.
The reduction is significant. For commit id: 4873b452151bafe49da332aaeab639ef0318fc1ca28d728, the size
reduced by ~700K; from 4873728 to 4172800.
* Add LITE_RUNTIME flag in in.proto files
* Fix merge conflict.
* Address PR comments
* Forgot to add 2 files + fix linux and gpu build errors.
* Fix build errors + test failures
* Fix cuda tests
* Fix tensor rt build
* Use full protobuf for trt
* Address PR comments
* Print tensor shape proto as text string for easier debugging
* Check usage of node output as implicit input in any subgraphs.
* Add logic to check/update subgraphs when removing a node.
Fix some issues with Graph
- Include local outer scope variables when validating. Required if calling Resolve on a subgraph
- Include outer scope variables in the value info so the type information is captured. Also required to Resolve a subgraph but will detect a type mismatch (previously we threw the type information away).
- Fix GraphNodes iterator so it can be used with std::find_if. Needed to be assignable so the end_ value can't be const.
* graph transformers update
* some updates
* plus changes
* more updates
* fixes per review comments
* enable tests
* adding more tests
* more changes
* update api in inference sesion
* changes per review
* Linux CI fix
* fix linux CI failure
* fix MAC CI failure
* more updates
* add more documentation and add level param to register transformer
* Add AllocKind::kShare to allow copying the MLValue for a pre-existing value to a graph output when an Identity node is involved. Ideally we can make this handling for an Identity node more general purpose, however the current logic to free an MLValue during execution doesn't take into account a re-use point also needing a free. Due to that, limit the scope and start with a somewhat ugly hardcoded approach.
Migrate some changes from PR497
The existing Loop unit tests exercise the new code. Also manually stepped through the problematic model to verify the unnecessary copy was avoided.
* Fix build error
* Fix missing switch case in debug output of allocation plan
* Limit optimization to Loop