* Support to allow user to specify compute stream per session
Create computation cuda stream explicitly rather than use default legacy stream or per-thread default stream.
remove some redudant cudaStreamSynchronize
fix gpt2 model test failures
don't use default stream in nccl either.
add stream schronization in OnRunEnd()
using cub::DeviceScan::InclusiveSum which can be called with stream specified.
fix topK failure due to latest rebase
fix tensorrt
support user specified stream
add user_stream support in tensorrt EP
use same stream for both tensort and CUDA EP.
fix ScatterND
specify stream for adasum and p2p kernels.
fix loop
fix CApiTest.custom_op_handler
fix CApiTest.varied_input_custom_op_handler
change for cudaMemcpyFromSymbol
improve provider options for user specified compute stream
* add changes for ROCM EP
* fix GatherGrad UT for ROCM EP
* clean code and fix NonMaxSuppression
* use default stream for ROCM now
* fix CApiTest.custom_op_handler:OrtFormatCustomOpTests.ConvertOnnxModelToOrt
* fix tensorrt ut: CApiTest.io_binding_cuda
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
* Add ability to generate configuration that includes required types for individual operators, to allow build size reduction based on that.
- Add python bindings for ORT format models
- Add script to update bindings and help info
- Add parsing of ORT format models
- Add ability to enable type reduction to config generation
- Update build.py to only allow operator/type reduction via config
- simpler to require config to be generated first
- can't mix a type aware (ORT format model only) and non-type aware config as that may result in insufficient types being enabled
- Add script to create reduced build config
- Update CIs
* Share allocator between CUDA EP & TRT EP.
limitation:
1. Does not cover the per-thread allocator created by CUDA EP, still need to figure out the way to remove it
2. Need to have more identifiers to make it able to share CPU allocator across all EPs
Update Python API to allow more flexibility for setting providers and provider options.
The providers argument (InferenceSession/TrainingSession constructors, InferenceSession.set_providers()) now also accepts a tuple of (name, options dict).
Fix get_available_providers() API (and the corresponding function in the C API) to return the providers in default priority order. Now it can be used as a starting point for the providers argument and maintain the default priority order.
Convert some usages of the deprecated global configuration functions to use EP-specific options instead.
Update some EP-specific option parsing to fail on unknown options.
Other clean up.
* Fix issue: https://github.com/microsoft/onnxruntime/issues/6094
Root cause: we didn't expose the OrtMemoryInfo for TRT, so it will cause issue if user want use IObinding for Tensorrt.
Short term fix, add the OrtMemoryInfo for TRT. Long term should unify the allocator for CUDA and TRT
* allow custom op taking varied types
* refactor test case
* add test model
* refactor test case
* enable copy elision
* update test case
* fix issue in ToString function
* add case for cpu custom op on gpu
* format doc
* restrict GPU custom op on Linux GPU CI only
* separate cu file to a independent project
* fix typo
* include cuda_add lib
* move lib def
* add file header
Co-authored-by: RandySheriffH <rashuai@microsoft.com>
Add tag types for Ort::Float16_t and Ort:Bfloat16_t structs
that contain uint16_t values for float16 and bfloat16.
These will serve as type dispatching types for C++ API.
They are of uint16_t size and arrays of these types can be used
to create Tensors of the corresponding types.
Make documentation Doxygen compliant.
* add case for cpu custom op on gpu
* format doc
* restrict GPU custom op on Linux GPU CI only
* separate cu file to a independent project
* fix typo
Co-authored-by: RandySheriffH <rashuai@microsoft.com>
* Allow sharing of initializers between sessions.
* Allow sharing of initializers between sessions (2).
* Add test for C#
* Add test for C#; address PR comments
* Address PR comments
Moved AddInitializer logic to internal session options
Added tests for owned buffer
Clarified documentation
Fix bug where memory info and not device was getting compared
* Fix test
* Fix training build
* Add ver 5 end marker and ver 6 starter, add scenario and usage examples.
* Refactor TensorAt
locations* must be const and int64_t since our dims are int64_t
Remove unnecessary copy of locations.
Remove unnecesary casting and C-casting. Simplify implementation.
Add a check for string type.
Make CXX api return T& to fully expose C API in C++, const std::vector& by value as it
covers more ground and eliminate redundant copy.
Eliminate inner loop, compute strides first.
* add GetStartTime() for profiler
* add function in inference_session
* remove qualified name
* add the api in cxx_api.h
* rename starttime to StartTimeNs, expost profiling object
* rename GetProfilingStartTime
* move Ortapis to the right place
* move to the end
* add const for session
* const the right place
* use const auto instead of const auto* for session
* remove const for auto getstarttime
* remove const for auto getstarttime
add unit tests
* nit: update test name and add comments
* Rename DeviceAllocatorRegistrationInfo to a more generic name; Remove OrtMemType; Simplify CreateAllocator interface.
* - fix builds
- fixed mixed aggregation + constructor calls (which were coded before this PR)
- changed default value of max_mem in API header
- added some validation of values for for arena_extend_strategy
* fix tensorrt and cuda tests
* cancel night build on pyop
* setup ci pipeline for build of reduced ops
* add back c# test
* remove debugging print
* add testing model
* add more arg in pipeline script
* disable pipeline trigger temporarily
* fix yaml format
* fix yaml format
* fix pipeline error
* rid c# test
* add ops for test cases
* add Conv from domain com.microsoft.nchwc
* remove --reduce_ops
* fix typo
* remove --build_java
* add test case for excluded op
* update doc with --skip_test
* formatting code, renaming files and simplify yaml
* remove debug build from yaml
* remove surplus ops from included_ops.txt
* add MinSizeRel build to yaml
* rename test cases and models
* exclude ir test from minimum build
* restrict ir test to be only applied to reduced ops build
* Add support for sharing allocators
* Incremental update
* Address some PR comments, add unit tests, add documentation.
* Address PR comments, add tests and some documentation.
* Fix build and test issues
* Remove RegisterAllocator API restoring the OrtAllocator interface changes. Changed docs to reflect this.
Also fixed the orttraining segfault. The segfault was because in the case of training session,
the CPU exec prov is not available at the time the transformers are applied. Changed it to create
a new one.
* Added GetAvailableProviders to C API
* Fix API version and Windows build error
* Changed function name
* Changed ORT_API_VERSION to 4
* Moved all_providers array to constants.h
* Move check for providers to constants.h
* Changed name of array to avoid warning
* Address review comment
* Added unit test
1. Fix static analysis warnings found by VC++
2. Add a new pipeline for static analysis
3. Merge all the windows CI build into one single yaml file.(Easier to queue them all).
4. Make DNNL build faster by disabling building the tests and examples.
5. Enable custom op unitest.