* Revert "Temporarily remove dnnl from Linux CI build to unblock the whole team (#4266)"
Previously it fails because it used too much memory.
Now we only run dnnl EP with opset12 models in unit tests, to reduce peak memory usage.
* Deprecate TrainableDropout.
* Add Dropout(12) back into Megatron transformer.
* Remove TrainableDropout from front-end test models.
* Update baseline for front-end tests after converting test models to opset-12.
* Update baseline for front-end tests after converting test models to opset-12.
* Revise pipeline schedule to consider communication ops
* Add test
* Fix warning
* inline some short functions
* Fix warnings
* Rename a class
* Add comment for test
* op renamed to task
* Fix NVTX wrapper's bug
* concat
* add path_utils
* address feedback
* use string in test
* convert wstring to sting in windows
* address feedback
* address feedback
* fix comment
* Replace loss function in BERT_LOSS with SoftmaxCrossEntropyLoss.
* Update BERT loss function with correct logit shapes for softmax cross entropy loss.
* fix test and PR comments.
* build engine in runtime for dynamic shape subgraphs
* Update TensorRT-ExecutionProvider.md
* Update TensorRT-ExecutionProvider.md
* fix build issue
* Add more instructions on how to use engine caching
* add precision to trt node name
* Update tensorrt_execution_provider.cc
* Update tensorrt_execution_provider.cc
* Split ComputePadAndOutputShape into ComputePad and ComputeOutputShape
* update NNAPI conv ouput shape compute to use shared ComputeOutputShapec
* move use ptr to use reference for ComputePadAndOutputShape
* nnapi conv support auto_pad
* add logging operator support bt target devices
* update InferOutputShape/ComputePadAndOutputShape/ComputePad to use force_symmetric_auto_padding as param instead of template
* make log op support for target devices optional
* add auto_pad support to pool operators
* ignore GetTargetDevices if using all devices
* fix some typo in padding calculation
* fix a bug of compute padding difference between conv and pool ops
* addressed CR comments, removed NNAPI device logging and move nnapi ep autopad handling into a shared function
* change helper functions to static
* support bert partition with shared initializer
* address feedback
* address feedback
* address feedback
* add more test
* remove bert-tiny model
* address feedback
* address function comment
* move CreateNodeArg to graph_utils
* rename function name
* rename function name
* fix windows build
* fix windows type conversion warning
* add function comment
Create N-1 threads in a thread pool when configured with intra-op parallelism of N. This ensures we have N active threads, given that the main thread also runs work. To avoid ambiguity on the value returned, rename ThreadPool::NumThreads method to ThreadPool::DegreeOfParallelism, and make corresponding updates in MLAS and operators.
* Split ComputePadAndOutputShape into ComputePad and ComputeOutputShape
* update NNAPI conv ouput shape compute to use shared ComputeOutputShapec
* move use ptr to use reference for ComputePadAndOutputShape
* Enable onnxruntime_test_all for NNAPI EP
* switch to use ninja for ANdroid CI
* make android elumator boot faster in android ci
* simplify adb push
* more style change
* more tweaking on android ci
* build.py style update
- make size_ and data_ data members private
- rename GetCapacity() to Capacity() to be consistent (e.g., with Size())
- add static_assert for trivially copyable T because it is copied with memcpy
Extracting some common code related to "AutoPadType" from the cpu execution provider into "common.h".
Motivation and Context
* Sharing code with authors of other execution providers that need the same functionality.
* I didn't modify the code in shared_library or dnnl EP to avoid changing their dependency structure, so there is still a redundant copy of the AutoPadType code in there.
Handle model with multiple embed nodes:
* update embed layer norm fusion in onnxruntime
* Fix temp model path in optimizer
* Add unit test for model with multiple embed nodes.
* Add unit test for gpt2 fusion with past state and mask
* Add unit test for change input to int32
For the special case where all variadic inputs of a kernel are the same shape (i.e. no broadcasting is required) and there are few enough of them, we perform the entire computation in a single kernel. The general implementation (which was previously used for this special case) handles broadcasting by repeatedly invoking a binary kernel on successive inputs.