* squashed commit for standalone tvm execution provider
* critical fix for correct python build with stvm ep
* get tuning log file from ep options. It has priority over AUTOTVM_TUNING_LOG
* updates and fixes
* update parsing of stvm provider options
* add support of external data for onnx model
* add conditional dump of subgraphs
* remove unused code
* get input tensor shapes through provider options. get output shapes for fixed input ones by TVM API
* support AUTO_TVM tuning log file inside ORT. Selector for Ansor and Auto_TVM is provider option (tuning_type)
* add fp16
* add functionality of conversion of model layout to NHWC if need. Necessary parameter was added to STVM provider options
* fix license text in header. fix log format
* small fixes
* fix issues from flake8
* remove model proto construction from GetCapability
* reserve memory for vector of DLTensors
* add simple tutorial for STVM EP
* STVM docs
* jroesch/tvm -> apache/tvm
* remove dead code, unneccessary logs and comments
* fix in readme
* improve tutorial notebook
* tvm update
* update STVM_EP.md
* fix default value
* update STVM_EP.md
* some TODOs for the future development
* shorten long lines
* add hyperlink to STVM_EP.md
* fix Linux CI error
* fix error in csharp test
Co-authored-by: Jared Roesch <jroesch@octoml.ai>
Co-authored-by: Valery Chernov <valery.chernov@deelvin.com>
Co-authored-by: KJlaccHoeUM9l <wotpricol@mail.ru>
* update base image from 11.4.0 to 11.4.2
* update Linux TRT GPU pipeline to TRT 8.2
* update onnx-tensorrt to 8.2-GA
* disable failing TensorRT 8.2 tests.
* update pad test.
* fix
* update win trt ci pipeline to trt 8.2
* test run with cuda 11.4 and cudnn 8.2
* increase timeout
* revert
* revert
* update packaging pipelines to use trt 8.2
* fix typo
* update trt gpu perf pipeline to trt 8.2
* increase timeout
* delete deprecated ci-perf-pipeline.yml
* bump timeout
* adjust timeout packaging
Adding a symmetric quantized convolution kernel for ARM64
Note:
Indirect conv performs worse for shallow convs (input channels are small). This is much more so for low end pre-dot CPUs, where only 128 or deeper conv is faster with indirect conv. With DOT-CPUs, 32 deep conv is already faster
Co-authored-by: Chen Fu <fuchen@microsoft.com>
* Add QAttention to DNNL EP
Add QAttention to DNNL EP (limited support and disable for gpu)
update ONEDNN version to 2.4.4
bug fix in getcapability
add memory debug print
Signed-off-by: Wang <zhaoyang.wang@intel.com>
* Address Code Review + MatMulInteger Fix
clean up code and add comments
fix matmulinteger and add fusion rule to enable initialized vector weight zero
points of 0s
update DNNL_TAG to v2.5
Signed-off-by: Wang <zhaoyang.wang@intel.com>
* Linux Compile Fix + rollback ONEDNN to 2.4.4
Signed-off-by: Zhaoyang Wang <zhaoyang.wang@intel.com>
* Fix QAttention Debug build
Signed-off-by: Wang <zhaoyang.wang@intel.com>
* Fix QAttention build if USE_DNNL not specified
Signed-off-by: George Nash <george.nash@intel.com>
Co-authored-by: Wang <zhaoyang.wang@intel.com>
Co-authored-by: MTC <63478620+jeyblu@users.noreply.github.com>
* fix error C4996
* remove wd4996 and fix error C4966
* fix typo
* remove wd4996 for onnx-tensorrt
* remove more /wd for onnx-tensorrt
* gix bug for strncpy_s of (Buffer is too small && 0)
* fix code to remove warning 4244
* fix code to remove warning 4267
* remove /wd4267 /wd4244
* fix bug
* change int to size_t
* using size_t instead of int
* use float instead of double
* Use size_t instead of int
* use size_t instead of int
* use size_t instead of int. Also fix typo
* Changes to ensure openvino build go through in Windows
* Modified Hetero plugin Logic
*Modified Hetero Feature logic. In Hetero,
if the operator to be marked true in getcapability(),
it should be supported by either of the devices
specified with HETERO in the device_type.
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* OV updated to 2021.4.2 version
* OV updated to 2021.4.2 version
* Updated OV to 2021.4.2 version, mono download link and dotnet version
* Copying Managed nugets in openvino c# docker file
*Copying Managed nuget to nugets artifacts
directory
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
Co-authored-by: saharfraza <sfatima.3001@gmail.com>
Co-authored-by: mayavijx <mayax.vijayan@intel.com>
Co-authored-by: Aravind Gunda <aravindx.gunda@intel.com>
* POWER10: Add optimized dgemm kernel
This patch makes use of POWER10 matrix multiply assist feature and
adds new DGEMM kernel.
* Indentation update
Co-authored-by: Rajalakshmi Srinivasaraghavan <rajis@linux.ibm.com>
* Arm64 Depthwise Convolution 3x3.
* Add 5x5 intrinsic dwqconv for arm64
* rebase to master, remove no-need logic after arm64 convsym enabled.
* Some more adjustment on the instrunction pipeling.
* Add specific test cases.
* Fix test dimension too small.
* Fix build warning as error on some CI.
* better format, etc.
* Added checks for Hetero/Multi
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Remote Context Plugin
* changes for IO Buffer plugin
* erronous couts added
* erronous entry rectified
* Set the Openvino OP Buffer also as output
* Enable AUTO plugin in OpenVINO EP
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Remote Context Plugin
* changes for IO Buffer plugin
* erronous couts added
* erronous entry rectified
* Added checks for Hetero/Multi
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Set the Openvino OP Buffer also as output
* Enable AUTO plugin in OpenVINO EP
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Please commit error message and rectification of param.context
* Alignment fixed
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Changed the string to OpenVINO_GPU
* hanged OpenVINO to to OpenVINO_CPU
* Onnxruntime updated API for memory location
* Removing Duplicate LOG Error
* Tensor.h removed DeviceType function. Updated comment
* API Comments updated
* Removing changes to Provider Indo
* Erronous commit
* Removing Extra logs
* Merge CMAKE
* Not copy from a local location
* Duplicate Entry
* Remove extra line
Co-authored-by: MaajidKhan <n.maajidkhan@gmail.com>
Adding ARM64 depthwise convolution kernel for symmetric quantization
Motivation and Context
Two improvements against current kernel code :
1. Signed int8 based instructions, no need to extend from 8b to 16b before multiplication.
2. Unrolled loop with manual software pipelining
Co-authored-by: Chen Fu <fuchen@microsoft.com>
* Only serialize runtime optimization records container if non-empty.
* Remove runtime optimizations from onnxruntime/core/flatbuffers/schema/README.md as it's not completely implemented yet.
* Disable partial runtime optimization implementation by default.
ORT format model runtime optimization implementation is in progress.
This change adds a build.py option to disable the partial runtime optimization implementation, adds CI builds to test it, and disables runtime optimizations in mobile package builds.
* explicit link with libtorch instead of use cmake var to avoid introduce mkl dependency
* use find_lib to get libtorch lib name
* temp fix
* add missing libraries
Co-authored-by: Cheng Tang <chenta@microsoft.com>
* libonnxruntime_providers_rocm.so and libonnxruntime_providers_shared.so are not included in python package.
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
Add Xamarin support to the ORT nuget packages.
- Update C# code to support Xamarin builds for iOS and Android
- refactor some things to split out common code
- include iOS and Android ORT native shared library in native nuget package
* POWER: Add Dgemm kernel for POWER processor
This patch adds new dgemm kernel specific to POWER processor.
* POWER: Restrict new functions to VSX in header
* Remove warning check in header
* POWER: Dgemm Adjust indentation
Fixing indentation based on review comments.
Co-authored-by: Rajalakshmi Srinivasaraghavan <rajis@linux.ibm.com>
* optimize python overhead of _post_amp_backward
* overwrite apex amp's zero_grad for faster implementation
* move unscale_fp16_grads_into_fp32_grads into C++ impl
* improve the efficiency furthur, reducing 3.5ms to 1.7ms for unilm.
* unilm 1.7ms to 338us: 1). optimize python list <==> std::vector copy, 2). launch the kernels as long as num_elem reach thresh hold. This help reduce the CUDA idel time.
* refine the logic a bit after validating
Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>