This change makes some optimizations on various places. This change consists of a part of PR #1240 (removed the problematic part) and some other trivial fix.
1. reduce unnecessary copy when constructing vector or objects that contains vector as member. use std::move when applicable.
2. use std::vector<std::reference_wrapper<const TensorShape>> instead of std::vector<TensorShape>, when it is only for constant reference usage.
3. calculate key BEFORE (instead of AFTER) acquire lock in SessionState::GetMemoryPatternGroup
other trivial fixes (code should be straightforward and self-explainable).
Description: Describe your changes.
Change the logic to find cublas dll
Motivation and Context
Why is this change required? What problem does it solve?
The name pattern of cublas changed since 10.1. It doesn't include minor version in its name anymore.
If it fixes an open issue, please link to the issue here.
This change implements Conv+Clip activation fusion for FusedConv and NCHWc convolutions. The Clip operation runs in the thread context that is producing the convolution output.
* Minor bug fixes for accelerators
* Added dimensionality checks for each graph input for GPU
* Disabled some tests for MYRAID and GPU
* This change is required for running some of the models on
OpenVINO instead of falling back to default CPU EP
Signed-off-by: suryasidd <surya.siddharth.pemmaraju@intel.com>
* PR Feedback
Signed-off-by: suryasidd <surya.siddharth.pemmaraju@intel.com>
* Fix missing bracket
Signed-off-by: suryasidd <surya.siddharth.pemmaraju@intel.com>
* Use INFO instead of WARNING for an unused graph input.
* Drop severity of unused initializer as well
* Update to output a warning level message if removing an initializer that is never used, and an info level message if removing an initializer that optimization has made redundant.
* Now that we check for a constant initializer in an ancestor graph we also need to be able to retrieve and replace that initializer.
Add helpers to do so.
Update optimizers to use the new helpers.
Fix bug in UnsqueezeElimination where it wasn't checking if the initializer it was replacing was constant.
Add MlasGetPreferredBufferAlignment() for use by CPUAllocator::Alloc to get the byte alignment for CPU tensors. Using MLAS allows the value to be based on the platform the binary is running on instead of a constant value fixed at compile time.
* Add arm64 nocontribops pipeline
* minor fix
* Added new template for arm build -- disable all tests
* fix build command
* add arm64 flag for msbuild
* add arm leg as upstream dependency
* update platform to arm64 for msbuild
* remove test task from arm build
* remove ESRP signing of C# dlls in arm build
* Updated to work for both --arm and --arm64
* Make the cross compiling cmake flags symmetric
* Add dynamic check for /Wno-error flag, instead of extra build option
* remove extra full-stop
This extends build.py to run git submodule sync --recursive before running git submodule update --init --recursive. This makes sure submodule URLs are up-to-date.
This change integrates the NCHWc support recently added to MLAS into ONNX Runtime. When using "-o 3" optimizations, then the runtime will do a NCHWc layout optimization pass to convert standard ONNX operators such as Conv/MaxPool to the com.microsoft.nchwc domain with weights and biases reordered for speed.
Log a warning if the fallback is caused by functional limitation
Log a information if the fallback is by design. e.g Nodes between Shape (CPU output) -> CUDA nodes .. -> ReShape (CPU input)
More cleanup of the math files. Instead of using templates to instantiate a full GEMM for the types added for MatMul (integers and double), use a simpler MatMul function that doesn't do any transposing and assumes alpha=1 and beta=0.
Fix the random UT failure for RNN/GRU cases which have padded sequence. e.g. max_seq = 2. batch_size =2, sequence_lengths = {2, 1}. For the output beyond the shorter sequence {1}, we should initialize the value to 0.
Root cause:
Cudnn library doesn't guarantee the value beyond the shorter sequence.
Fix:
Initialize the output Y data to all 0 before calling cudnn library.