* Share allocator between CUDA EP & TRT EP.
limitation:
1. Does not cover the per-thread allocator created by CUDA EP, still need to figure out the way to remove it
2. Need to have more identifiers to make it able to share CPU allocator across all EPs
Rename MakeString to MakeStringWithClassicLocale, MakeStringLite to MakeString, *ParseString to *ParseStringWithClassicLocale.
Add missing pass-through versions of MakeStringWithClassicLocale for string types.
Update Python API to allow more flexibility for setting providers and provider options.
The providers argument (InferenceSession/TrainingSession constructors, InferenceSession.set_providers()) now also accepts a tuple of (name, options dict).
Fix get_available_providers() API (and the corresponding function in the C API) to return the providers in default priority order. Now it can be used as a starting point for the providers argument and maintain the default priority order.
Convert some usages of the deprecated global configuration functions to use EP-specific options instead.
Update some EP-specific option parsing to fail on unknown options.
Other clean up.
* Fix issue: https://github.com/microsoft/onnxruntime/issues/6094
Root cause: we didn't expose the OrtMemoryInfo for TRT, so it will cause issue if user want use IObinding for Tensorrt.
Short term fix, add the OrtMemoryInfo for TRT. Long term should unify the allocator for CUDA and TRT
* Create a helper for generating unique ids that can be used by an EP that creates compiled nodes and needs ids to be deterministic for a model when used in multiple sessions.
Added to IExecutionProvider as this can potentially be used by all compiling EPs and is more robust than a simplistic counter (although EP implementer is free to choose either approach).
* Restructure the helper so it can be called across the EP bridge.
Add ability to call id generation helper from EP bridge
- convert DNNL EP to use helper to validate
Address issue where a new Model may be loaded into the same address as a previous one.
- hash the bytes in the Graph instance (1728 bytes currently) to use as the key to the full hash for the model
Add lock around id generation to ensure no issues if multiple sessions partitions graphs at exactly the same time.
- Extremely unlikely but would be hard to debug and the locking cost is not an issue as it's only incurred during graph partitioning and not execution.
* Remove Provider_IExecutionProvider and make the internal IExecutionProvider usable by shared providers
* Change Provider_IExecutionProviderFactory to be the core version.
* Introduce VariadicAlias, remove hardcoded alias limits
* Include optional-lite in winml build
Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
* allow custom op taking varied types
* refactor test case
* add test model
* refactor test case
* enable copy elision
* update test case
* fix issue in ToString function
* Exclude some training specific code around the allocation planning and initializer handling from the minimal build.
Simplify the code around tracking start/end usage of a value.
The ROCm EP is designed and implemented based on AMD GPU software stack named ROCm. Here is the link for the details about ROCm: https://rocmdocs.amd.com/en/latest/
ROCm EP was created based on the following things:
1. AMD GPU programming language: HIP
2. AMD GPU HIP language runtime: amdhip64
3. BLAS: rocBLAS, hipBLAS
4. DNN: miOpen
5. Collective Communication library: RCCL
6. cub: hipCub
7. …
Current status:
BERT-L and GPT2 training can be ran on AMD GPU with data parallel.
Next:
1. Make more GPU code be sharable between ROCm EP and CUDA EP since HIP language and HIP runtime API are very close to CUDA.
2. Continue improving the implementation.
3. Continue GPU kernel optimization.
4. Support model parallelism on ROCm EP.
……
The rocm kernels have been removed from this commit and will be in a separate PR. Since the original PR was too big(~180 files), it was suggested to split the PR into two parts, one is rocm-kernels, the other is non rocm kernels.
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
Co-authored-by: sabreshao <sabre.shao@amd.com>
Co-authored-by: anghostcici <11013544+anghostcici@users.noreply.github.com>
Co-authored-by: Suffian Khan <sukha@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
* Introduce OpKernelInfo GetAttrAsSpan() for floats and ints attribute proto arrays
and GetAttrsStringRefs() to return a vector of string references.
These new APIs allow kernels not copy attribute arrays especially if they are large
and save on memory.
but refer directly to data that is in AttributeProto.
Modify TfIdfVectorizer to take advantage of the new API.
Signed-off-by: Dmitri Smirnov <dmitrism@microsoft.com>
Introduce sparse_initializers support.
Convert them to dense on model load and prune graph_proto_
so they don't consume space. Convert back to sparse on ORT Format model save.
Implement serializing sparse initializers to OrtFormat.
Fix Model::ToProto() to return original sparse initializers
Set a flag that graph_sync is needed when loading a simple ORT Format model.
otherwise nothing is resolved.
Add ORT Format history to README.md
ifdef MINIMAL build for DenseToSparseTensorInitializer
Allow duplicate initializers to support existing models.
Issue a warning instead of aborting.
* Revert "Remove SparseTensor support from minimal build. (#5114)"
This reverts commit 59ee8ffb17.
Signed-off-by: Dmitri Smirnov <dmitrism@microsoft.com>
* Change shared providers so that they are shutdown before shared library unload
* Move UnloadSharedProviders declaration into a shared header to avoid bugs.
* - Link with libatomic if needed
- Install pip differently so it doesn't clash with the system pip which may involve a wrapper script
- Remove ability to specify offset when Tensor allocates the data. The data prior to offset isn't accessible by anything.
- Fix use of offset in TensorOpTest to work on armv7 where it must be aligned to the type it points to.
- Fix ActivationOpNoInfTest.Softsign to allow for armv7 behavior
- Fix ReductionOpTest.ReduceMean_*keepdims to allow for armv7 floating point inaccuracy
* Address PR comments
* Allow sharing of initializers between sessions.
* Allow sharing of initializers between sessions (2).
* Add test for C#
* Add test for C#; address PR comments
* Address PR comments
Moved AddInitializer logic to internal session options
Added tests for owned buffer
Clarified documentation
Fix bug where memory info and not device was getting compared
* Fix test
* Fix training build
* Add ver 5 end marker and ver 6 starter, add scenario and usage examples.
* Remove SparseTensor support from minimal build.
Currently the only valid usage of a SparseTensor is as an attribute of a Constant node. That would have been lifted to a dense tensor initializer when loading the onnx model, so would not exist when saving the ORT format model. Due to that there can be no SparseTensors in an ORT format model.
Co-authored-by: gwang <wanggy@outlook.com>
* opset13 cuda kernels for BERT.
* add opset13 SoftmaxCrossEntropyLoss.
* opset13 size.
* fix argmax/min for ut.
* fix ut failure for argmax/min.
* OrtMemTypeCPUInput
Co-authored-by: Vincent Wang <weicwang@microsoft.com>
* Add minimal build option to build.py
Group some of the build settings so binary size reduction options are all together
Make some cmake variable naming more consistent
Replace usage of std::hash with murmurhash3 for kernel. std::hash is implementation dependent so can't be used.
Add initial doco and ONNX to ORT model conversion script
Misc cleanups of minimal build breaks.
* Next round of changes.
Remove inclusion of ONNX schema header
Exclude custom registry related things
Move IsConstantInitializer from graph_utils to Graph as it's needed in a minimal build and graph_utils is excluded.
* Add support for sharing allocators
* Incremental update
* Address some PR comments, add unit tests, add documentation.
* Address PR comments, add tests and some documentation.
* Fix build and test issues
* Remove RegisterAllocator API restoring the OrtAllocator interface changes. Changed docs to reflect this.
Also fixed the orttraining segfault. The segfault was because in the case of training session,
the CPU exec prov is not available at the time the transformers are applied. Changed it to create
a new one.
* cancel night build on pyop
* add rewriter to rewrite cpu provider
* skip BuildKernelCreateInfo<void>
* refactor variable name and comment
* include ops from csv file
* process multiple eps
* add default function to cuda provider
* rename function and add license header
* fix import
* add doc
* fix typo
* deal with empty kernel entry in cuda
* rename the rewriter file
* add comment into provider file
* add comment and rename function
* log warnings
* refactor extracting logic
* add entry for script to run solo
* add better example
* avoid onnx importing
* fix flake8 alerts
* minor fixes to better comments and doc
* add entries for all domains
* add void entry into contrib providers
* format cuda_contrib_kernels.cc
* format cpu_contrib_kernels.cc
* add all providers
* add default entry to all providers
* include op_kernel header
* cancelling change in providers beyond cpu/cuda
* rename file and switch file format to domain;opset;op1,op2...
* update doc
* restore non-regular ending grammar in cuda_contrib_kernels.cc
* add ort_root as input argument of script
* enable test in ci
* update doc
* update doc
* revert change on linux gnu ci
* switch to set to host ops
* simplify trimming logic
* add domain map to track current model
* allow ort_root to take relative path
* Add ability to retrieve inferred shapes when executing a kernel.
This ability helps Recv to know its output shapes without doing
actual cummunication. Of course, if the output shapes cannot be
inferred, Recv still needs to do communication to get shapes from
Send.
* Avoid communicating shape information when it can be inferred statically
* Replace unordered_map with thread-safe wrapper.
We don't want to have racing condition and undefined behavior
when using parallel executor.y
* Remove cout
* Add missing file
* Address comments
* Check dim_value. -1 means missing
* lock properly
* Address comments (remove thread-safe map)
* Remove poc header
* Replace Stream with DeferredReleaseCPUPtr
* Add python API for specifying CUDA device id
* Modification for providing session based python api for specifying
device id
* When include header file pybind11/stl.h, conversion between c++
containers and Python list, vector and dict data structure are
automatically enabled.
https://pybind11.readthedocs.io/en/stable/advanced/cast/stl.html#
Therefore, refactor the code for better leverage this advantage.
* Make struct CudaDeviceOptions as default cuda device options
* Implement sess.set_providers(list_of_providers, list_of_provider_option_dicts)
But still stay consistent with existing sess.set_providers(list_of_provider)
* Add cuda provider option default setting
* Add support for setting cuda cuda_mem_limit and arena_extend_strategy.
Also resolved the merge conflict on session.py
* Use python ctypes to call cuda library to help python unittest
* Refine the code with reviewer's suggestions
* Add the capability of getting execution provider's configuration
- Once we introduced the capability to set execution provider's
configuration, it makes sense to add capability of getting ep's configuration.
* Modify the code with reviewer's suggestions.
* Using stoull() and stoul() depends on 32/64-bits architecture.
* Rewrite the testcases for testing setting CUDA device id
Note: We need to make sure every ORT process be run on one CUDA device
at a time.
* Make sure old session object is destroyed by python gc before new
session object is being created
* Move testcases to original onnxruntime_test_python.py
* Fix bugs to pass CI build
* Make it pass CI build (cont.)
* Make it pass CI build (cont.)