* Add ability to specify just the device when using IOBinding for an output. This enables keeping an output on a different device GPU when it has a dynamic size that is not known ahead of graph execution.
* Move allocators to SessionState so they're decoupled from ExecutionProviders
- when looking up an allocator it's based on OrtMemoryInfo not the EP so SessionState is a more natural place for that infromation to be stored
- add device based lookup
- simplifies logic for copying feeds/fetches across devices
Cleanup SessionState and SessionStateInitializer
- provide more things to SessionState at construction time so we don't construct and instance and immediately after call a bunch of setters
- simplify SessionStateInitializer
- reduced down to FinalizeSessionState method
* Outputs from model execution should always be returned in a newly allocated buffer or an pre-allocated buffer provided in fetches. When an initializer is providing a graph output (e.g. constant folding may result in this) we were returning an OrtValue that pointed to the initializer and not a separately allocated buffer with a copy.
This was wrong as:
- value wasn't returned in a pre-allocated fetch so whilst the value returned was correct, it was returned in the wrong place
- user could alter the data in the initializer via the returned value
* Add unit test with and without pre-allocated fetch.
* Add some extra info around why we're handling this special case.
1. Fix static analysis warnings found by VC++
2. Add a new pipeline for static analysis
3. Merge all the windows CI build into one single yaml file.(Easier to queue them all).
4. Make DNNL build faster by disabling building the tests and examples.
5. Enable custom op unitest.
1. Copy tensorflow's thread pool class to ORT, so that we can get a better implementation of thread pool based parallelfor
2. Copy Eigen's thread pool class to ORT
3. Support thread affinity
4. Remove RNN kernel’s private thread pool
5. Modify pool kernels to use the thread pool when openmp is disabled.
* Add support for sessions to share a global threadpool.
* Fix build issues
* Add tests, fix build issues.
* Added some documentation
* Fix centos issue when threadpools become nullptr due to 1 core.
* Fix mac and x86 build issues
* Address some PR comments
* Disabled test for android, added few more tests and addressed more PR comments.
* const_cast
1. Enable warning "4503" # Decorated name length exceeded.
2. Enable warning "4146" # unary minus operator applied to unsigned type.
3. Enable float64 support for the Softmax operator
4. Enable compliance checks for Windows x86 32bits build
5. Use TryBatchParallelFor to replace some fallback code in mlas pooling.cc
6. Fix Android CI pipeline.
WinML would like to update the googletest submodule. They want some newer features (namely GTEST_SKIP to skip tests programmatically and be able to skip entire fixtures easily) and would need to update the submodule version.
However, because the new version of code hit a bug in gcc, even though the bug is already fixed in the latest gcc but we're using gcc 4.8.x and it won't get patched for the bug, so we have to do a compromise, change our code a little bit to make it work.
The gcc bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51213
* Introduce execution mode for clarity and extensibility; Change Python APIs accordingly; Replace DisableSequentialExecution API with EnableParallelExecution for clarity.
* Fix cuda build
* Modify the test slightly
* Make C and C# APIs consistent with Python.
Remove gsl subodule and replace with a local copy of gsl-lite
Refactor for onnxruntime::make_unique
gsl::span size and index are now size_t
Remove lambda auto argument type detection.
Remove constexpr from fail_fast in gsl due to Linux not being happy.
Comment out std::stream support due to MacOS std lib broken.
Move make_unique into include/core/common so it is accessible for server builds.
Relax requirements for onnxruntime/test/providers/cpu/ml/write_scores_test.cc
due to x86 build.
Add ONNXRUNTIME_ROOT to Server Lib includes so gsl is recognized
Description: Refine threading control options and move inter op thread pool to session state.
Added thread_utils.h/cc to centralize the decision around the thread pool size under various conditions.
Motivation and Context
Currently the thread pool size of the parallel executor is hardcoded to 32 for some reason. This PR makes the options to configure the thread pool sizes clearer.
1.Let mlas use session thread pool
2.Remove onnxruntime_USE_MLAS cmake option
3. Remove the win32 thread pool code inside mlas
mlas will:
1.use ort thread pool if it get passed in
2.use openmp if the threadpool parameter is nullptr
3.run single threaded if the threadpool parameter is nullptr and openmp is disabled.
* Add AllocKind::kShare to allow copying the MLValue for a pre-existing value to a graph output when an Identity node is involved. Ideally we can make this handling for an Identity node more general purpose, however the current logic to free an MLValue during execution doesn't take into account a re-use point also needing a free. Due to that, limit the scope and start with a somewhat ugly hardcoded approach.
Migrate some changes from PR497
The existing Loop unit tests exercise the new code. Also manually stepped through the problematic model to verify the unnecessary copy was avoided.
* Fix build error
* Fix missing switch case in debug output of allocation plan
* Limit optimization to Loop
1. Support the new external data extension in ONNX 1.4 onnx/onnx#678
2. Enable onnxruntime_perf_test in Mac Build
3. move path_lib.h from onnx_test_runner source dir to onnxruntime_framework
4. Enable memory planner for string tensors
5. Make memory planner always enabled, to simplify model loading logic
6. Delete some duplicated code between onnxruntime_perf_test and onnx_test_runner
7. Delete win_getopt_mb lib.
8. Remove the dependency on Pathcch lib, which is only available on Windows 8 and newer.
* Break dependency on SessionState for ExecutionFrame and OpKernelContext so optimizers can execute a node with a minimal setup.
- Create IExecutionFrame
- split out core logic and interface from extended logic used in full Graph execution (that uses allocation plan and memory pattern planner)
- Update NodeIndexInfo to allow contruction from a subset of nodes
- split out logic from GraphNodes into a re-usable template so it can be used with a vector of const Node* as well as a vector of unique_ptr<Node>
- Remove SessionState from OpKernelContext
- Misc cleanups
- move AllocPlanPerValue out of SequentialExecutionPlan as it's used in a more generic manner that isn't specific to a sequential execution plan
NOTE: I manually tested the new paths, especially NodeIndexInfo. There will shortly be optimizers added that use the new infrastucture so they'll get test coverage as part of those changes.
* Fix linux build issue.
Handle graph with no nodes in NodeIndexInfo.
* Various optimizations to reduce the setup and execution cost.
Cache information about the feeds and fetches, and any device copies required to execute the graph so we minimize checking for later calls to ExecuteGraph using the same input/output.
- enable use of caching in Loop and Scan
- make use of caching optional for InferenceSession::Run
- handle calls to Run with different feeds and fetches to support scenarios where there may be a truncated sequence in some calls
Take the feed names and MLValue instances as vectors so the order is deterministic.
Add unit tests
Update onnxruntime_perf_test to enable caching.
* Couple of tweaks.
Fix shared library unit test failure.
Attempt to workaround MacOS build failure due to VC++ bug around including reaching scope values in a lambda automatically.
* Rework order of init in Run so we get nice error messages about invalid feed/output names.
* Refine logic around copying MLValue using execution provider so common code can be used. Simplify the logic due to this change.
Split the paths for executing with/without cached info so we can be more const correct with how FeedsFetchesManager is passed in. This makes it clearer when a shared instance can be used due to it being const.
Cache the FeedsFetchesManager instances in the control flow nodes. They can be re-used across calls to Compute.
* Removed unused local variable to fix some builds.
* Fix build issue by cleaning up some more unused params.
* Check names when using cache entry from SessionState. Add unit test.
* Separate out the NodeArg index information from ExecutionFrame so it is only calculated once.
* Skip copy to/from device if only CPU execution provider is registered.
Cleanups.
* Address PR comments.
Clean up a few areas.
* Fix Linux build error
* Add the ability to use a custom allocator for fetches.
Allows control flow nodes to forward the allocation to the control flow op and avoid an unnecessary copy when the subgraph output has a symbolic dimension.
Update Scan and If to use custom allocators when applicable.
* Remove unnecessary forward declaration
* Fix Mac build warnings
Applies to all public headers and macros, plus many internal ones. There are still some internal things with OnnxRuntime in the name, but this fixes all public functions & macros.
* Checkpoint.
* Add doco to graph.h and graph_base.h.
Change NodeConstIterator to return a reference to clearly advertise no nullptr's are going to be returned as it's only iterating valid Nodes.
Fix some code analysis warnings.
* Make a couple of APIs return a reference instead of a pointer as they never return nullptr.
* More doco and some minor naming cleanups.
* Cleanups
Couple more consistency changes.
* Fix CUDA test file
* Fix invalid line.