* Adding CPU implementation of BroadcastGradientArgs op
* Modify to take shape as input instead of tensor
* Cleanup
* Correct schema
* Corrected kernel, added tests, addressed review comments.
* Added exception,test for invalid broadcast,addresed review comments.
* Fix mac build error.
* Initial change, to add ReduceSumTraining cpu op
* cpu support
* cuda support + more UTs
* on comments + UT
* no op support for {} axes with new attr - noop_with_empty_axes
* on comments
* fix build
* on comments
Co-authored-by: aishwarya bhandare <aibhanda@microsoft.com>
Co-authored-by: Ethan Tao <ettao@microsoft.com>
* Revert "Temporarily remove dnnl from Linux CI build to unblock the whole team (#4266)"
Previously it fails because it used too much memory.
Now we only run dnnl EP with opset12 models in unit tests, to reduce peak memory usage.
* Deprecate TrainableDropout.
* Add Dropout(12) back into Megatron transformer.
* Remove TrainableDropout from front-end test models.
* Update baseline for front-end tests after converting test models to opset-12.
* Update baseline for front-end tests after converting test models to opset-12.
* Revise pipeline schedule to consider communication ops
* Add test
* Fix warning
* inline some short functions
* Fix warnings
* Rename a class
* Add comment for test
* op renamed to task
* Fix NVTX wrapper's bug
* concat
* add path_utils
* address feedback
* use string in test
* convert wstring to sting in windows
* address feedback
* address feedback
* fix comment
* Replace loss function in BERT_LOSS with SoftmaxCrossEntropyLoss.
* Update BERT loss function with correct logit shapes for softmax cross entropy loss.
* fix test and PR comments.
* build engine in runtime for dynamic shape subgraphs
* Update TensorRT-ExecutionProvider.md
* Update TensorRT-ExecutionProvider.md
* fix build issue
* Add more instructions on how to use engine caching
* add precision to trt node name
* Update tensorrt_execution_provider.cc
* Update tensorrt_execution_provider.cc
* Split ComputePadAndOutputShape into ComputePad and ComputeOutputShape
* update NNAPI conv ouput shape compute to use shared ComputeOutputShapec
* move use ptr to use reference for ComputePadAndOutputShape
* nnapi conv support auto_pad
* add logging operator support bt target devices
* update InferOutputShape/ComputePadAndOutputShape/ComputePad to use force_symmetric_auto_padding as param instead of template
* make log op support for target devices optional
* add auto_pad support to pool operators
* ignore GetTargetDevices if using all devices
* fix some typo in padding calculation
* fix a bug of compute padding difference between conv and pool ops
* addressed CR comments, removed NNAPI device logging and move nnapi ep autopad handling into a shared function
* change helper functions to static
* support bert partition with shared initializer
* address feedback
* address feedback
* address feedback
* add more test
* remove bert-tiny model
* address feedback
* address function comment
* move CreateNodeArg to graph_utils
* rename function name
* rename function name
* fix windows build
* fix windows type conversion warning
* add function comment
Create N-1 threads in a thread pool when configured with intra-op parallelism of N. This ensures we have N active threads, given that the main thread also runs work. To avoid ambiguity on the value returned, rename ThreadPool::NumThreads method to ThreadPool::DegreeOfParallelism, and make corresponding updates in MLAS and operators.
* Split ComputePadAndOutputShape into ComputePad and ComputeOutputShape
* update NNAPI conv ouput shape compute to use shared ComputeOutputShapec
* move use ptr to use reference for ComputePadAndOutputShape
* Enable onnxruntime_test_all for NNAPI EP
* switch to use ninja for ANdroid CI
* make android elumator boot faster in android ci
* simplify adb push
* more style change
* more tweaking on android ci
* build.py style update
- make size_ and data_ data members private
- rename GetCapacity() to Capacity() to be consistent (e.g., with Size())
- add static_assert for trivially copyable T because it is copied with memcpy