* Arm64 Depthwise Convolution 3x3.
* Add 5x5 intrinsic dwqconv for arm64
* rebase to master, remove no-need logic after arm64 convsym enabled.
* Some more adjustment on the instrunction pipeling.
* Add specific test cases.
* Fix test dimension too small.
* Fix build warning as error on some CI.
* better format, etc.
* add use_tensorrt build option
* Add use_tensorrt to running tests
* add use_tensorrt for Windows
* make trt ep to skip backend test
* make trt ep to skip backend test
* Fix bug
* Add/Modify description
* modify for debug
* swtich pool to test
* modify to debug
* modify to debug
* add vobersity
* refine the code
* refine the code
* refine the code
* fix flake8 warning
* refine the code
* add pre_load check for trt as well as add cupti lib to cuda depedencies
* modify script to make trt build path the same as cuda
* show error message when user wants to run TensorRT but TensorRT is not installed in the env
* fix bug
* fix bug
* add trt lib for manylinux
* include cuda_dependencies for trt
* rewrite the condition to throw exception
* make code more compact
* Add intermediate header between the ORT code and pybind11 to workaround an issue with VS2022 debug builds by making sure corecrt.h is included first.
This avoids the _STL_ASSERT macro being defined in an incompatible way for a debug build by pybind including the python headers with _DEBUG temporarily undefined .
See #9735 for details.
* register custom symbolic for einsum
* bugfix for case needs permute at the end
* refactor
* refactor equation parser
* support new case, use ReduceProd
* optimize perf and graph
* remove some Gather node
* add more ut, fix gemm trans fusion
* Update required operators for prebuilt package to add opsets 14 and 15.
Add helper script to check if the prebuilt package will support the model and if not why not.
* Add support for multiple opsets being specified on a single line in the required operators config. This makes it easier to update the pre-built package config.
It's also required for validation tools to work as they only have a single opset from the model and not per-operator opsets. If we only list the incremental ops we could merge in the ops from the previous opset, but that wouldn't give a way to drop an operator from being supported.
Left the info on which ops changed though so we have a better feel for the cost of supporting each opset.
* Added checks for Hetero/Multi
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Remote Context Plugin
* changes for IO Buffer plugin
* erronous couts added
* erronous entry rectified
* Set the Openvino OP Buffer also as output
* Enable AUTO plugin in OpenVINO EP
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Remote Context Plugin
* changes for IO Buffer plugin
* erronous couts added
* erronous entry rectified
* Added checks for Hetero/Multi
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Set the Openvino OP Buffer also as output
* Enable AUTO plugin in OpenVINO EP
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Please commit error message and rectification of param.context
* Alignment fixed
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Changed the string to OpenVINO_GPU
* hanged OpenVINO to to OpenVINO_CPU
* Onnxruntime updated API for memory location
* Removing Duplicate LOG Error
* Tensor.h removed DeviceType function. Updated comment
* API Comments updated
* Removing changes to Provider Indo
* Erronous commit
* Removing Extra logs
* Merge CMAKE
* Not copy from a local location
* Duplicate Entry
* Remove extra line
Co-authored-by: MaajidKhan <n.maajidkhan@gmail.com>
Adding ARM64 depthwise convolution kernel for symmetric quantization
Motivation and Context
Two improvements against current kernel code :
1. Signed int8 based instructions, no need to extend from 8b to 16b before multiplication.
2. Unrolled loop with manual software pipelining
Co-authored-by: Chen Fu <fuchen@microsoft.com>
* Only serialize runtime optimization records container if non-empty.
* Remove runtime optimizations from onnxruntime/core/flatbuffers/schema/README.md as it's not completely implemented yet.
* Disable partial runtime optimization implementation by default.
* schema change
* cc channges
* remove temp debug code
* Adding fbs namespace to session_state_flatbuffers_utils.h
* Add fbs namepsace to all ort format utils
When the pattern Sum(Gemm(A, B), C) exists, we can convert it to
Gemm(A, B, C), assuming that C the output of the original Gemm is
not used elsewhere, and this change does not break broadcasting.
ORT format model runtime optimization implementation is in progress.
This change adds a build.py option to disable the partial runtime optimization implementation, adds CI builds to test it, and disables runtime optimizations in mobile package builds.
* Construct valid graphs for ONNX checker for IR version < 4.
Previously the constructed graph was not guaranteed to have its
initializers be a subset of its inputs, which is required for IR
version < 4. This resulted in spurious failures.
Fixes#9663
implement dynamicquantizelinear in DNNL EP
add debug log in EP for operator coverage
block gpu elementwise op with 5 dims or more
Signed-off-by: Wang <zhaoyang.wang@intel.com>
* Fix#9671 by running the level 1 rewrite rules first and allowing the transpose optimizer to run multiple times to ensure it completes in level 1.
Removed unnecessary call to GenerateRuleBasedGraphTransformer as there are no level 2 rewrite rules.
* remove default python ep registration. raise exception if providers are not explicitly set if there are available providers
* temporarily disable exception
* fix python tests
* explicitly set CUDAProvider for python iobinding tests
* explicitly set providers param for InferenceSession())
* onnxrt
* raise ValueError if not explicitly set providers when creating InferenceSession
* add required providers param
* explicitly set providers
* typo
* Add 1.option for enable qdq for node's output 2.force qdq appear as a pair
* modify description
* modify description
* Revert the logic of variable
* Revert the logic of variable
* Code refactor based on review's suggestions
* Update init
* Code refactor for able to specify nodes to exclude output quantization
* rename variable
* Fix bug
* code refactor
* remove the exposure of APIs
* fix bug
* fix bug
* fix bug
* fix bug
* exposure one API
Co-authored-by: Ubuntu <onnxruntime@ort-trt-ep-linux-t4.bxgbzpva45kedp3rhbsbit4phb.jx.internal.cloudapp.net>
Co-authored-by: Chi Lo <Chi.Lo@gmail.com>