* POWER10: Add optimized dgemm kernel
This patch makes use of POWER10 matrix multiply assist feature and
adds new DGEMM kernel.
* Indentation update
Co-authored-by: Rajalakshmi Srinivasaraghavan <rajis@linux.ibm.com>
* Arm64 Depthwise Convolution 3x3.
* Add 5x5 intrinsic dwqconv for arm64
* rebase to master, remove no-need logic after arm64 convsym enabled.
* Some more adjustment on the instrunction pipeling.
* Add specific test cases.
* Fix test dimension too small.
* Fix build warning as error on some CI.
* better format, etc.
* Added checks for Hetero/Multi
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Remote Context Plugin
* changes for IO Buffer plugin
* erronous couts added
* erronous entry rectified
* Set the Openvino OP Buffer also as output
* Enable AUTO plugin in OpenVINO EP
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Remote Context Plugin
* changes for IO Buffer plugin
* erronous couts added
* erronous entry rectified
* Added checks for Hetero/Multi
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Set the Openvino OP Buffer also as output
* Enable AUTO plugin in OpenVINO EP
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Please commit error message and rectification of param.context
* Alignment fixed
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Changed the string to OpenVINO_GPU
* hanged OpenVINO to to OpenVINO_CPU
* Onnxruntime updated API for memory location
* Removing Duplicate LOG Error
* Tensor.h removed DeviceType function. Updated comment
* API Comments updated
* Removing changes to Provider Indo
* Erronous commit
* Removing Extra logs
* Merge CMAKE
* Not copy from a local location
* Duplicate Entry
* Remove extra line
Co-authored-by: MaajidKhan <n.maajidkhan@gmail.com>
Adding ARM64 depthwise convolution kernel for symmetric quantization
Motivation and Context
Two improvements against current kernel code :
1. Signed int8 based instructions, no need to extend from 8b to 16b before multiplication.
2. Unrolled loop with manual software pipelining
Co-authored-by: Chen Fu <fuchen@microsoft.com>
* Only serialize runtime optimization records container if non-empty.
* Remove runtime optimizations from onnxruntime/core/flatbuffers/schema/README.md as it's not completely implemented yet.
* Disable partial runtime optimization implementation by default.
ORT format model runtime optimization implementation is in progress.
This change adds a build.py option to disable the partial runtime optimization implementation, adds CI builds to test it, and disables runtime optimizations in mobile package builds.
* explicit link with libtorch instead of use cmake var to avoid introduce mkl dependency
* use find_lib to get libtorch lib name
* temp fix
* add missing libraries
Co-authored-by: Cheng Tang <chenta@microsoft.com>
* libonnxruntime_providers_rocm.so and libonnxruntime_providers_shared.so are not included in python package.
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
Add Xamarin support to the ORT nuget packages.
- Update C# code to support Xamarin builds for iOS and Android
- refactor some things to split out common code
- include iOS and Android ORT native shared library in native nuget package
* POWER: Add Dgemm kernel for POWER processor
This patch adds new dgemm kernel specific to POWER processor.
* POWER: Restrict new functions to VSX in header
* Remove warning check in header
* POWER: Dgemm Adjust indentation
Fixing indentation based on review comments.
Co-authored-by: Rajalakshmi Srinivasaraghavan <rajis@linux.ibm.com>
* optimize python overhead of _post_amp_backward
* overwrite apex amp's zero_grad for faster implementation
* move unscale_fp16_grads_into_fp32_grads into C++ impl
* improve the efficiency furthur, reducing 3.5ms to 1.7ms for unilm.
* unilm 1.7ms to 338us: 1). optimize python list <==> std::vector copy, 2). launch the kernels as long as num_elem reach thresh hold. This help reduce the CUDA idel time.
* refine the logic a bit after validating
Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
Add kernels for QLinearConv with symmetric quantized filter, e.g., filter type is int8 and zero point of filter is 0. This PR includes kernels for avx2, avxvnni, avx512 and avx 512 vnni. Will adds kernels for ARM64 in following PR.
Kernels uses direct input buffer directly for pointwise, and in-direct buffer for depthwise and non-group conv.
The advantages of those new kernels are:
no need to compute the sum of each pixel output image, and sum/offset of filter can be combined with bias.
with in-direct buffer, im2col returns an array of buffer pointers instead of memcpy'ing the original data. This saves memcpy time and reduces the size of the intermediate buffer needed to hold the im2col transform. In the future, will compute im2col ahead of time for input with fixed input size.
* re-hipify all rocm EP sources
* fix all other files affected by re-hipify
* add cuda_provider_factory.h to amd_hipify.py
* do not use cudnn_conv_algo_search in ROCm EP, missing reduce min registration
* Fix ReduceConsts template specialization introduced in #9101.
Fixes the error when building for ROCm 4.3.1:
error: too many template headers for onnxruntime::rocm::ReduceConsts<__half>::One (should be 0)
* fix flake8 error in amd_hipify.py
* speed up hipify with concurrent.futures
* flake8 fix in amd_hipify.py
* removing warnings which are causing errors from torch and changing flags for Windows
* adding MKL library resolution and comments
* cleaning up the code
* fixing onnxruntime_python file for windows build
* fix the include order to aovid the python_d.lib issue on win debug build
* changes for warnings, typos and other comments
* merge conflict
* adding fix for mkl library error
* Revert "adding fix for mkl library error"
This reverts commit 73b87c73c2.
* fix for dll path for windows
* typo for dll path
Co-authored-by: Cheng Tang <chenta@microsoft.com>
* model caching changes for 2021.4
Signed-off-by: Your Name <you@example.com>
* changed the ov version check
* Minor changes added
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Added support for external data format
Starting from OpenVINO 2021.4 version, OpenVINO-EP
will support onnx models with Weights saved in external
file location.
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Introduced Hetero/Multi options for perf_test
Enabled to use HETERO/MULTI device feature from
OpenVINO-EP using the onnxruntime_perf_test tool.
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* cleaned up CMake code for older OV version support
OV 2020.3 is now longer supported by OpenVINO-EP.
This check is not required now.
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Add option to disable graph partitioning
Added a option to diable graph partitioning
during build time for OpenVINO-EP.
with this option, when the model is not fully
supported on OpenVINO-EP, the model fully fall
backs to default CPU EP (MLAS).
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Changed the flag for diabling graph partitioning
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Fixes the flake8 check error
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Added changes for disable graph partition option
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Fixed flake8 indentation error
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
Co-authored-by: Your Name <you@example.com>