* Extend C++ API for Map/Sequence Type Info (#3517)
Expose functionality to view type information about sequences/maps
to C++ API.
- Add functions
- `TypeInfo::GetSequenceTypeInfo`
- `SequenceTypeInfo::GetSequenceElementType`
- `TypeInfo::GetMapTypeInfo`
- `MapTypeInfo::GetMapValueType`
- `MapTypeInfo::GetMapKeyType`
- Add structs
- `SequenceTypeInfo`
- `MapTypeInfo`
Co-authored-by: Dudeldu <mustermann.informatik@gmail.com>
Co-authored-by: Jonas-Heinrich <Jonas@JonasHeinrich.com>
* Extend tests to cover new type info functionality for sequences and maps
- two new test case in test_nontensor_types for maps and sequences
Co-authored-by: Jonas-Heinrich <Jonas@JonasHeinrich.com>
* Next round of changes.
Remove inclusion of ONNX schema header
Exclude custom registry related things
Move IsConstantInitializer from graph_utils to Graph as it's needed in a minimal build and graph_utils is excluded.
* Add support for sharing allocators
* Incremental update
* Address some PR comments, add unit tests, add documentation.
* Address PR comments, add tests and some documentation.
* Fix build and test issues
* Remove RegisterAllocator API restoring the OrtAllocator interface changes. Changed docs to reflect this.
Also fixed the orttraining segfault. The segfault was because in the case of training session,
the CPU exec prov is not available at the time the transformers are applied. Changed it to create
a new one.
* cancel night build on pyop
* add rewriter to rewrite cpu provider
* skip BuildKernelCreateInfo<void>
* refactor variable name and comment
* include ops from csv file
* process multiple eps
* add default function to cuda provider
* rename function and add license header
* fix import
* add doc
* fix typo
* deal with empty kernel entry in cuda
* rename the rewriter file
* add comment into provider file
* add comment and rename function
* log warnings
* refactor extracting logic
* add entry for script to run solo
* add better example
* avoid onnx importing
* fix flake8 alerts
* minor fixes to better comments and doc
* add entries for all domains
* add void entry into contrib providers
* format cuda_contrib_kernels.cc
* format cpu_contrib_kernels.cc
* add all providers
* add default entry to all providers
* include op_kernel header
* cancelling change in providers beyond cpu/cuda
* rename file and switch file format to domain;opset;op1,op2...
* update doc
* restore non-regular ending grammar in cuda_contrib_kernels.cc
* add ort_root as input argument of script
* enable test in ci
* update doc
* update doc
* revert change on linux gnu ci
* switch to set to host ops
* simplify trimming logic
* add domain map to track current model
* allow ort_root to take relative path
* Initial set of changes to start disabling code in the minimal build. Breaking changes into multiple PRs so they're more easily reviewed. Focus on InferenceSession, Model and Graph here. SessionState will be next.
Needs to be integrated with de/serialization code before being testable so changes are all off by default.
Changes are limited to
- #ifdef'ing out code
- moving some things around so there are fewer #ifdef statements
- moving definition of some one-line methods into the header so we don't need to #ifdef out in a .cc as well
- exclude some things in the cmake setup
* Update session state and a few other places.
The core code builds if ORT_MINIMAL_BUILD is specified.
* Add Node::SinceVersion() so that the value is known when loading a graph from the ORT format (OpSchema is not available).
* Fix build warning from returning 'const int'
* adding generic configurations for session options
* fix a build break on linux
* fix training ci build break
* fix training ci build break
* addressed CR comments
* fix traning ci build break
* move config_key from enum to string
* add c# api
* add python api
* fix build break
* move prepacking from 2 new api entries to session options configs
* fix traning ci build break
* add python test, update some comments, move const key definition to avoid build break
* addressed comments
* move definitions of keys to common.h
* move api to version 5
* remove accidental change in build.py
* remove pragma to avoid build break
* addressed CR comments
* fix the python build break, and move location of config keys definition
* small typo changes
* Eliminate redundant subexpressions
Apply local value numbering to merge graph nodes that will always
evaluate to the same value.
* Rename cpp->cc
* Handle optional arguments
* Add test models
* Add more tests with optional arguments
* Fix processing of subgraphs
Also, be resilient to possible mixture of optional and variadic
parameters
* Fix random operators
* Address PR comments
* Minor changes and a test
* Move CSE before constant folding
* Random* operators are always non-deterministic
Even when seed is provided.
* Fix a CSE test
* Reuse the list of non-deterministic operators with constant folding pass
* Address PR comments
* Fix formatting
* Address PR comment
* Minor cleanup / comments
* Fix build failure in Linux
* Reuse existing optimizer/utils file.
Also, check for graph outputs when removing a node.
* Add a test
* Fix compiler warnings
* Fix build in older compilers
* More compatibility with old STL versions
This commit means that when the thread pool is configured to spin, then we spin at the barrier at the end of parallel sections in the main thread, in addition to having workers spin waiting for work.
The change updates Barrier.h to take an additional boolean to select spin/block, and passes this in based on the thread pool configuration.
It adds an additional test case for barriers, although no problems were identified by the test case.
* Gelu Activation Recompute Draft
* Prototype for localized recompute
* Introduce localized_recompute rewriter
* Command line args for enabling recompute
* Add logger to Gradient Graph Builder
* use const when possible
Update TransposeMatMul to support scaling of the matrix product by a constant scalar value (analogous to the GEMM alpha parameter). Rename TransposeMatMul to TransposeScaleMatMul.
Fuse MatMul with surrounding Mul/Div with constant scalar into TransposeScaleMatMul.
While investigating an unrelated issue, I noticed that the thread pool may drop tasks when a burst of 1024+ tasks is submitted by a thread from inside the pool. Today, in general, we execute work synchronously in this case. However, there is a bug where work submitted by a thread already inside the pool will be discarded instead of executed. Currently the only scenario where I can see this occurring is when the parallel executor is used with a model in which such a large number of nodes become eligible to run all at once. This PR fixes the underlying issue and adds a test case for burst-submission of work.
* Add ability to retrieve inferred shapes when executing a kernel.
This ability helps Recv to know its output shapes without doing
actual cummunication. Of course, if the output shapes cannot be
inferred, Recv still needs to do communication to get shapes from
Send.
* Avoid communicating shape information when it can be inferred statically
* Replace unordered_map with thread-safe wrapper.
We don't want to have racing condition and undefined behavior
when using parallel executor.y
* Remove cout
* Add missing file
* Address comments
* Check dim_value. -1 means missing
* lock properly
* Address comments (remove thread-safe map)
* Remove poc header
* Replace Stream with DeferredReleaseCPUPtr
* Add python API for specifying CUDA device id
* Modification for providing session based python api for specifying
device id
* When include header file pybind11/stl.h, conversion between c++
containers and Python list, vector and dict data structure are
automatically enabled.
https://pybind11.readthedocs.io/en/stable/advanced/cast/stl.html#
Therefore, refactor the code for better leverage this advantage.
* Make struct CudaDeviceOptions as default cuda device options
* Implement sess.set_providers(list_of_providers, list_of_provider_option_dicts)
But still stay consistent with existing sess.set_providers(list_of_provider)
* Add cuda provider option default setting
* Add support for setting cuda cuda_mem_limit and arena_extend_strategy.
Also resolved the merge conflict on session.py
* Use python ctypes to call cuda library to help python unittest
* Refine the code with reviewer's suggestions
* Add the capability of getting execution provider's configuration
- Once we introduced the capability to set execution provider's
configuration, it makes sense to add capability of getting ep's configuration.
* Modify the code with reviewer's suggestions.
* Using stoull() and stoul() depends on 32/64-bits architecture.
* Rewrite the testcases for testing setting CUDA device id
Note: We need to make sure every ORT process be run on one CUDA device
at a time.
* Make sure old session object is destroyed by python gc before new
session object is being created
* Move testcases to original onnxruntime_test_python.py
* Fix bugs to pass CI build
* Make it pass CI build (cont.)
* Make it pass CI build (cont.)
* support bert partition with shared initializer
* address feedback
* address feedback
* address feedback
* add more test
* remove bert-tiny model
* address feedback
* address function comment
* move CreateNodeArg to graph_utils
* rename function name
* rename function name
* fix windows build
* fix windows type conversion warning
* add function comment
Create N-1 threads in a thread pool when configured with intra-op parallelism of N. This ensures we have N active threads, given that the main thread also runs work. To avoid ambiguity on the value returned, rename ThreadPool::NumThreads method to ThreadPool::DegreeOfParallelism, and make corresponding updates in MLAS and operators.
For the special case where all variadic inputs of a kernel are the same shape (i.e. no broadcasting is required) and there are few enough of them, we perform the entire computation in a single kernel. The general implementation (which was previously used for this special case) handles broadcasting by repeatedly invoking a binary kernel on successive inputs.
* add modern standards to function arguments
* code cleanup
* fix code formatting
* add element access convenience function
* change template type name to match rest of code
* remove new At() convenience function
* add better documentation message
* Update function body initialization
* minor fix
* changes per review comments
* minor fix
* format fix
* add function initialization in mixed precision transformer
* more updates
* more fixes
* Move allocators to SessionState so they're decoupled from ExecutionProviders
- when looking up an allocator it's based on OrtMemoryInfo not the EP so SessionState is a more natural place for that infromation to be stored
- add device based lookup
- simplifies logic for copying feeds/fetches across devices
Cleanup SessionState and SessionStateInitializer
- provide more things to SessionState at construction time so we don't construct and instance and immediately after call a bunch of setters
- simplify SessionStateInitializer
- reduced down to FinalizeSessionState method
As a zero-cost wrapper around the C API, the current state of the C++ API is still pretty low-level and requires programmers to use C-style standards to interact with ONNX.
- Move thread hint vectors from thread-local struct
- Add static_assert that the per-thread state in the thread pool is trivially-destructible
- Rename "thread_data" to "worker_data" (only allocated for workers in the pool, not threads calling into the pool)
Updates the thread pool implementation to make work distribution over the Eigen thread pool more closely resemble techniques used in OpenMP. In particular:
(1) A thread entering a parallel loop works on the iterations itself, rather than requiring a thread switch to/from a thread in the pool, if called from outside the thread pool.
(2) To support this, work items pushed to the thread pool run a loop to claim iterations from a shared counter via atomic-fetch-and-add, as opposed to having work items themselves represent individual batches of iterations. This means that any thread working on the loop can execute any batch of iterations, including having the main thread run through all of the batches itself if the loop turns out to be short-running.
(3) As with OpenMP active scheduling, the worker loop spins waiting for work prior to blocking. This avoids OS blocking / wake-up paths in workloads with series of short-running parallel sections.
* Added GetAvailableProviders to C API
* Fix API version and Windows build error
* Changed function name
* Changed ORT_API_VERSION to 4
* Moved all_providers array to constants.h
* Move check for providers to constants.h
* Changed name of array to avoid warning
* Address review comment
* Added unit test
- Update IAllocator setup to move the OrtMemoryInfo to the base class instead of requiring derived classes to have that as a member and override a virtual method to return it.
- Cleanup CreateAllocator setup to take an argument as to whether to wrap the device allocator in an arena allocator. The choice to do that isn't a property of the underlying device allocator.
- Minor cleanups in the various EPs to adjust to the change to IAllocator and CreateAllocator, and to use the create_arena flag consistently when available.
* Enable static memory planning for pipeline.
1. We fix a bug when resolving symbolic shape for scalars.
2. We pass the original inputs to all pipeline stages so that
the symbolic shapes can be resolved.
* Further Improvements
1. Address comments.
2. Further reduce activation size by ~50% when pipeline is on.
This is done by removing all but one gradient tensor from the last
RecordEvent in the backward pass.
* Address a comment
* Fix Windows build