* Expose load tensor proto from protobuf file function
* Add comment
* Remove use of fstream and use parsefromzerocopystream
* Close file descriptor after finish parsing it
* Close input stream too
* Set Close on delete only, no need to close file descriptor
* Revert "Set Close on delete only, no need to close file descriptor"
This reverts commit 5ba6e3c31b.
* Revert "Close input stream too"
This reverts commit 4564776733.
* Revert "Close file descriptor after finish parsing it"
This reverts commit 846e550c4f.
* Revert "Remove use of fstream and use parsefromzerocopystream"
This reverts commit 25a3117183.
* Add python API for specifying CUDA device id
* Modification for providing session based python api for specifying
device id
* When include header file pybind11/stl.h, conversion between c++
containers and Python list, vector and dict data structure are
automatically enabled.
https://pybind11.readthedocs.io/en/stable/advanced/cast/stl.html#
Therefore, refactor the code for better leverage this advantage.
* Make struct CudaDeviceOptions as default cuda device options
* Implement sess.set_providers(list_of_providers, list_of_provider_option_dicts)
But still stay consistent with existing sess.set_providers(list_of_provider)
* Add cuda provider option default setting
* Add support for setting cuda cuda_mem_limit and arena_extend_strategy.
Also resolved the merge conflict on session.py
* Use python ctypes to call cuda library to help python unittest
* Refine the code with reviewer's suggestions
* Add the capability of getting execution provider's configuration
- Once we introduced the capability to set execution provider's
configuration, it makes sense to add capability of getting ep's configuration.
* Modify the code with reviewer's suggestions.
* Using stoull() and stoul() depends on 32/64-bits architecture.
* Rewrite the testcases for testing setting CUDA device id
Note: We need to make sure every ORT process be run on one CUDA device
at a time.
* Make sure old session object is destroyed by python gc before new
session object is being created
* Move testcases to original onnxruntime_test_python.py
* Fix bugs to pass CI build
* Make it pass CI build (cont.)
* Make it pass CI build (cont.)
* Adding CPU implementation of BroadcastGradientArgs op
* Modify to take shape as input instead of tensor
* Cleanup
* Correct schema
* Corrected kernel, added tests, addressed review comments.
* Added exception,test for invalid broadcast,addresed review comments.
* Fix mac build error.
* Initial change, to add ReduceSumTraining cpu op
* cpu support
* cuda support + more UTs
* on comments + UT
* no op support for {} axes with new attr - noop_with_empty_axes
* on comments
* fix build
* on comments
Co-authored-by: aishwarya bhandare <aibhanda@microsoft.com>
Co-authored-by: Ethan Tao <ettao@microsoft.com>
* Revert "Temporarily remove dnnl from Linux CI build to unblock the whole team (#4266)"
Previously it fails because it used too much memory.
Now we only run dnnl EP with opset12 models in unit tests, to reduce peak memory usage.
* Deprecate TrainableDropout.
* Add Dropout(12) back into Megatron transformer.
* Remove TrainableDropout from front-end test models.
* Update baseline for front-end tests after converting test models to opset-12.
* Update baseline for front-end tests after converting test models to opset-12.
* Revise pipeline schedule to consider communication ops
* Add test
* Fix warning
* inline some short functions
* Fix warnings
* Rename a class
* Add comment for test
* op renamed to task
* Fix NVTX wrapper's bug
* concat
* add path_utils
* address feedback
* use string in test
* convert wstring to sting in windows
* address feedback
* address feedback
* fix comment
* Replace loss function in BERT_LOSS with SoftmaxCrossEntropyLoss.
* Update BERT loss function with correct logit shapes for softmax cross entropy loss.
* fix test and PR comments.
* build engine in runtime for dynamic shape subgraphs
* Update TensorRT-ExecutionProvider.md
* Update TensorRT-ExecutionProvider.md
* fix build issue
* Add more instructions on how to use engine caching
* add precision to trt node name
* Update tensorrt_execution_provider.cc
* Update tensorrt_execution_provider.cc
* Split ComputePadAndOutputShape into ComputePad and ComputeOutputShape
* update NNAPI conv ouput shape compute to use shared ComputeOutputShapec
* move use ptr to use reference for ComputePadAndOutputShape
* nnapi conv support auto_pad
* add logging operator support bt target devices
* update InferOutputShape/ComputePadAndOutputShape/ComputePad to use force_symmetric_auto_padding as param instead of template
* make log op support for target devices optional
* add auto_pad support to pool operators
* ignore GetTargetDevices if using all devices
* fix some typo in padding calculation
* fix a bug of compute padding difference between conv and pool ops
* addressed CR comments, removed NNAPI device logging and move nnapi ep autopad handling into a shared function
* change helper functions to static
* support bert partition with shared initializer
* address feedback
* address feedback
* address feedback
* add more test
* remove bert-tiny model
* address feedback
* address function comment
* move CreateNodeArg to graph_utils
* rename function name
* rename function name
* fix windows build
* fix windows type conversion warning
* add function comment
Create N-1 threads in a thread pool when configured with intra-op parallelism of N. This ensures we have N active threads, given that the main thread also runs work. To avoid ambiguity on the value returned, rename ThreadPool::NumThreads method to ThreadPool::DegreeOfParallelism, and make corresponding updates in MLAS and operators.
* Split ComputePadAndOutputShape into ComputePad and ComputeOutputShape
* update NNAPI conv ouput shape compute to use shared ComputeOutputShapec
* move use ptr to use reference for ComputePadAndOutputShape
* Enable onnxruntime_test_all for NNAPI EP
* switch to use ninja for ANdroid CI
* make android elumator boot faster in android ci
* simplify adb push
* more style change
* more tweaking on android ci
* build.py style update