* re-hipify all rocm EP sources
* fix all other files affected by re-hipify
* add cuda_provider_factory.h to amd_hipify.py
* do not use cudnn_conv_algo_search in ROCm EP, missing reduce min registration
* Fix ReduceConsts template specialization introduced in #9101.
Fixes the error when building for ROCm 4.3.1:
error: too many template headers for onnxruntime::rocm::ReduceConsts<__half>::One (should be 0)
* fix flake8 error in amd_hipify.py
* speed up hipify with concurrent.futures
* flake8 fix in amd_hipify.py
* removing warnings which are causing errors from torch and changing flags for Windows
* adding MKL library resolution and comments
* cleaning up the code
* fixing onnxruntime_python file for windows build
* fix the include order to aovid the python_d.lib issue on win debug build
* changes for warnings, typos and other comments
* merge conflict
* adding fix for mkl library error
* Revert "adding fix for mkl library error"
This reverts commit 73b87c73c2.
* fix for dll path for windows
* typo for dll path
Co-authored-by: Cheng Tang <chenta@microsoft.com>
* model caching changes for 2021.4
Signed-off-by: Your Name <you@example.com>
* changed the ov version check
* Minor changes added
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Added support for external data format
Starting from OpenVINO 2021.4 version, OpenVINO-EP
will support onnx models with Weights saved in external
file location.
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Introduced Hetero/Multi options for perf_test
Enabled to use HETERO/MULTI device feature from
OpenVINO-EP using the onnxruntime_perf_test tool.
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* cleaned up CMake code for older OV version support
OV 2020.3 is now longer supported by OpenVINO-EP.
This check is not required now.
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Add option to disable graph partitioning
Added a option to diable graph partitioning
during build time for OpenVINO-EP.
with this option, when the model is not fully
supported on OpenVINO-EP, the model fully fall
backs to default CPU EP (MLAS).
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Changed the flag for diabling graph partitioning
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Fixes the flake8 check error
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Added changes for disable graph partition option
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Fixed flake8 indentation error
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
Co-authored-by: Your Name <you@example.com>
* implement cuda provider
* define profiler common
* call start after register
* add memcpy event
* add cuda correlation
* format code
* add cupti to test path
* switch to CUpti_ActivityKernel3
* reset cupti path
* fix test case
* fix trt pipeline
* add namespace
* format code
* exclude training from testing
* remove mutex
* make work for both rocm 4.2 and rocm 4.3.1
* fix rocm 4.3.1 docker image reference
* fix CUDA_VERSION to ROCM_VERSION
* fix ReduceConsts conflict def
* add ifdef to miopen_common.h as well
* trailing ws
* Include pytorch_export_contrib_ops in inference builds
Rename / move it from tools/python/register_custom_ops_pytorch_exporter
to onnxruntime/python/tools/pytorch_export_contrib_ops.
Rationale for inclusion in inference builds:
This code is potentially useful for anyone using ORT, not just training.
Rationale for new name:
"Contrib op" is the nomenclature used within ORT to refer to the set of
ops that are not in the standard op set but are included by default with
ORT. This is more specific than "custom op", which is what the PyTorch
exporter uses to refer to any non-standard op.
Step 1 of addressing #8818. After this is merged I will update the docs.
* Enable test_pytorch_export_contrib_ops.py in CI
Fixes AB#1342330
* Use PROTOBUF_LIB instead of protobuf::libprotbuf
* Moved setdlopenflags to _pybind_state.py
* Copy the generated _pybind_state.py to required location for Windows.
* Add command to skip tests
* Remove support for OV_2021.3_LTS and ov_2021.1
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Removed request_id parameter from all references
request_id parameter was being used with ov_2020.3
release. Starting from 2020.4 OV release, input_name
paramater is being used instead to get the
KernelContext_GetInput.
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Enabling CI Logs in the branch
* CI Commits to enable logs
* Enable CI Print
* Added Imagescaler op to the supported op's list
Fixes test_tiny_yolo_V2 opset 8 model to support
fully on OV-EP. This model is the older variation
of tiny_yolo_v2 model which has Imagescaler op.
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Added ops to fully support yolov3 model
-Added changes to support yolov3 opset 10 model
fully on CPU_FP32.
-This also increases the operator coverage for GPU
hardware. There by enabling yolov3 model on GPU
with fewer subgraphs.
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Enabling tiny_yolov3 model fully on CPU
->Enabled tiny_yolov3 model fully on CPU.
-> Also reduces the number of subgraphs
to infer this model on GPU
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Adding GatherND op support for CPU and GPU
->This enables yolov3_pytorch model to work
with fewer subgraphs on CPU and GPU Devices.
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Fixes Albert model for ISV customer
ConvTranspose op was getting rejected
due to a condition. Fixed it.
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Disabling this 4 cpp tests for openvino-ep
These unit tests are failing with special conditions
for conv_transpose op with output_shape attribute.
so disabling them for now.
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Docker file changes for 2021.4-v3.1
* Remvoing duplicate code
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* ReduceMax No dimension supported
* Fixes failing protobuf issue for docker
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Excluding openvinoep type for convtranpose test
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Disabled 2 Failing convtranspose tests with TensorRT EP
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
Co-authored-by: suryasidd <surya.siddharth.pemmaraju@intel.com>
Co-authored-by: Aravind Gunda <aravindx.gunda@intel.com>
Co-authored-by: sfatimar <sahar.fatima@intel/com>
* Expose symbols in onnx and protobuf namespaces in python when building with --enable_external_custom_op_schemas
* Add external onnx and protobuf files to wheel
* Added an example to demonstrate external custom ops use-case
* Added a Linux build pipeline to test external custom ops
* Enable selecting custom ops in onnxruntime-extensions.
* Move cmake_helper.py.
* Remove over-indented spaces.
* Add doc.
* Remove onnxruntime-extensions from git submodules, and user should pass path of onnxruntime-extensions for build.
* Modify doc.
* Remove argument --enable_onnxruntime_extensions and use --onnxruntime_extensions_path.
* Fix build error.
* Fix build error.
* Use onnxruntime_extensions_path.
* support both submodule and external source folders
* refinement
* Update cgmanifest.json
* Support building onnxruntime-extensions from either git submodule or pre-pulled path.
* Update doc.
* more standard name
* update docs
* add the copyright header
Co-authored-by: Zuwei Zhao <zuzhao@microsoft.com>
Co-authored-by: Wenbing Li <wenbingl@outlook.com>
Co-authored-by: Wenbing Li <10278425+wenbingl@users.noreply.github.com>
* seperate the training python module; share the execution proivder instance
* fix build break
* fix cuda test crash; reorg the python module code base
* se correct env
* use provider customized hash func
* fixbuild break
* fix rocm break
* use const ref in argument
* rename the file
* move hash func to trainiing module
* Revert "Cleanup C# bindings to add EP (#8810)"
This reverts commit b21ea00020.
* Add back in a minimal set of changes.
Provide stubs in for a limited set of things
- things called from C# using a static lib of ORT built for mac/ios
- things in OrtApis that are not included in the build by default
- things in OrtApis that are excluded in a minimal build
* Cleanup order or EPs in test
* Fix unused function in ROCM build
Add cmake parameter and #ifdefs to allow for disabling sparse tensor support. This comes with a significant binary size cost so we want to be able to exclude it in a minimal build.
Fix C# add EP bindings.
Add stubs to ORT so that if EP is not included in the build we return a graceful error message.
Move declaration of stubs into C API and out for EP so they're in one place and are easier to use (no extra header required in the C/C++ world and consistent with the CUDA EP setup).
Fix inconsistency in ROCM EP.
Cleanup a few other things.
* Remove APIs unavailable in Store in #8349, #8178, #8065
* Add UWP stubs of C runtime functions
* Remove UWP incompatible tests from UWP build
* Remove incompatible tests from Store
* Use UWP stubs in store only
* Skip partition check outside of Windows
* Remove unused WRL include
* Workaround Windows header not including what it uses
* Fix precompiled header name clash
* Workaround SDK bugs
* DXCore workaround in Win7
* Fix warning
* Fix more warnings
* Bump WinML to target Windows 8
* Fix more warnings
* Remove unnecessary workarounds
* Remove Desktop only APIs from DML adapter
* adding support for tracing to sqldb instead of files
* use compiled statements
* script to pull tensors from db
* link sqlite3
* remove node info redundant with onnx graph
* addressing PR comments
* address PR comments and include program counter
* third party notice
* use find_pacakge
* add to cgmanifests.json
* address thread safety and add pid suffix
* build fi
* python script to select on devicetype
* remove unpopulated and redundant Shape and Type fields
* comment
* comment
* PR comments
* add graph execution counter to session state
* move increment to inference session
* std::endl to \n
* ifdef on graph execution counter
* add ifdef to inference session
* move DEBUG_NODE_INPUTS_OUTPUTS to CMakeLists.txt
* fix build - python.h not found
* disable --build_shared_lib for ortmodule tests
* fix
* fix the build flag
* disable --build_shared_lib for training path (not only for ortmodule)
* fix missing test model files
* disable test CApiTest.test_custom_op_library when ENABLE_TRAINING_TORCH_INTEROP is ON
* enable custom_op_library build
* fix build
* fix
* merge master and fix build failure
* build onnx_test_runner when onnxruntime_ENABLE_TRAINING_TORCH_INTEROP is ON
* resolve comments
* use --enable_training_torch_interop to replace "onnxruntime_ENABLE_TRAINING_TORCH_INTEROP=ON"
Using signed int, qgemm kernel avoids extending uint8 to int16 while computing matrix multiplication, achieving higher performance. We also find that by using only lower 64b of vector registers to load A and B matrix, we can get further performance improvements. We also experimented with using ldp to load two 64b in one shot, vs using two ldr to load one 64b at a time, in both Big and little cores, there is no noticeable differences.
Submitting the LDP version. At this point we don't need to choose kernel based on micro-architecture.
Inference time of resnet50, thread count 2
Big Core on Pixel 3a
Current master: 292.947 ms
First iteration S8S8: 188.239 ms
LDP load two 64b reg: 178.715 ms
LDR load one 64b reg: 179.536 ms
Little Core
Master: 546.317 ms
S8S8: 513.332 ms
LDP: 489.19 ms
LDR: 497.865 ms
Raspberry Pi 3B+
Master: 660.08 ms
S8S8: 608.577 ms
LDP: 603.675 ms
LDR 602.075 ms