* Add command to skip tests
* Remove support for OV_2021.3_LTS and ov_2021.1
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Removed request_id parameter from all references
request_id parameter was being used with ov_2020.3
release. Starting from 2020.4 OV release, input_name
paramater is being used instead to get the
KernelContext_GetInput.
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Enabling CI Logs in the branch
* CI Commits to enable logs
* Enable CI Print
* Added Imagescaler op to the supported op's list
Fixes test_tiny_yolo_V2 opset 8 model to support
fully on OV-EP. This model is the older variation
of tiny_yolo_v2 model which has Imagescaler op.
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Added ops to fully support yolov3 model
-Added changes to support yolov3 opset 10 model
fully on CPU_FP32.
-This also increases the operator coverage for GPU
hardware. There by enabling yolov3 model on GPU
with fewer subgraphs.
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Enabling tiny_yolov3 model fully on CPU
->Enabled tiny_yolov3 model fully on CPU.
-> Also reduces the number of subgraphs
to infer this model on GPU
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Adding GatherND op support for CPU and GPU
->This enables yolov3_pytorch model to work
with fewer subgraphs on CPU and GPU Devices.
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Fixes Albert model for ISV customer
ConvTranspose op was getting rejected
due to a condition. Fixed it.
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Disabling this 4 cpp tests for openvino-ep
These unit tests are failing with special conditions
for conv_transpose op with output_shape attribute.
so disabling them for now.
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Docker file changes for 2021.4-v3.1
* Remvoing duplicate code
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* ReduceMax No dimension supported
* Fixes failing protobuf issue for docker
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Excluding openvinoep type for convtranpose test
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Disabled 2 Failing convtranspose tests with TensorRT EP
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
Co-authored-by: suryasidd <surya.siddharth.pemmaraju@intel.com>
Co-authored-by: Aravind Gunda <aravindx.gunda@intel.com>
Co-authored-by: sfatimar <sahar.fatima@intel/com>
* Expose symbols in onnx and protobuf namespaces in python when building with --enable_external_custom_op_schemas
* Add external onnx and protobuf files to wheel
* Added an example to demonstrate external custom ops use-case
* Added a Linux build pipeline to test external custom ops
* Enable selecting custom ops in onnxruntime-extensions.
* Move cmake_helper.py.
* Remove over-indented spaces.
* Add doc.
* Remove onnxruntime-extensions from git submodules, and user should pass path of onnxruntime-extensions for build.
* Modify doc.
* Remove argument --enable_onnxruntime_extensions and use --onnxruntime_extensions_path.
* Fix build error.
* Fix build error.
* Use onnxruntime_extensions_path.
* support both submodule and external source folders
* refinement
* Update cgmanifest.json
* Support building onnxruntime-extensions from either git submodule or pre-pulled path.
* Update doc.
* more standard name
* update docs
* add the copyright header
Co-authored-by: Zuwei Zhao <zuzhao@microsoft.com>
Co-authored-by: Wenbing Li <wenbingl@outlook.com>
Co-authored-by: Wenbing Li <10278425+wenbingl@users.noreply.github.com>
* seperate the training python module; share the execution proivder instance
* fix build break
* fix cuda test crash; reorg the python module code base
* se correct env
* use provider customized hash func
* fixbuild break
* fix rocm break
* use const ref in argument
* rename the file
* move hash func to trainiing module
* Revert "Cleanup C# bindings to add EP (#8810)"
This reverts commit b21ea00020.
* Add back in a minimal set of changes.
Provide stubs in for a limited set of things
- things called from C# using a static lib of ORT built for mac/ios
- things in OrtApis that are not included in the build by default
- things in OrtApis that are excluded in a minimal build
* Cleanup order or EPs in test
* Fix unused function in ROCM build
Add cmake parameter and #ifdefs to allow for disabling sparse tensor support. This comes with a significant binary size cost so we want to be able to exclude it in a minimal build.
Fix C# add EP bindings.
Add stubs to ORT so that if EP is not included in the build we return a graceful error message.
Move declaration of stubs into C API and out for EP so they're in one place and are easier to use (no extra header required in the C/C++ world and consistent with the CUDA EP setup).
Fix inconsistency in ROCM EP.
Cleanup a few other things.
* Remove APIs unavailable in Store in #8349, #8178, #8065
* Add UWP stubs of C runtime functions
* Remove UWP incompatible tests from UWP build
* Remove incompatible tests from Store
* Use UWP stubs in store only
* Skip partition check outside of Windows
* Remove unused WRL include
* Workaround Windows header not including what it uses
* Fix precompiled header name clash
* Workaround SDK bugs
* DXCore workaround in Win7
* Fix warning
* Fix more warnings
* Bump WinML to target Windows 8
* Fix more warnings
* Remove unnecessary workarounds
* Remove Desktop only APIs from DML adapter
* adding support for tracing to sqldb instead of files
* use compiled statements
* script to pull tensors from db
* link sqlite3
* remove node info redundant with onnx graph
* addressing PR comments
* address PR comments and include program counter
* third party notice
* use find_pacakge
* add to cgmanifests.json
* address thread safety and add pid suffix
* build fi
* python script to select on devicetype
* remove unpopulated and redundant Shape and Type fields
* comment
* comment
* PR comments
* add graph execution counter to session state
* move increment to inference session
* std::endl to \n
* ifdef on graph execution counter
* add ifdef to inference session
* move DEBUG_NODE_INPUTS_OUTPUTS to CMakeLists.txt
* fix build - python.h not found
* disable --build_shared_lib for ortmodule tests
* fix
* fix the build flag
* disable --build_shared_lib for training path (not only for ortmodule)
* fix missing test model files
* disable test CApiTest.test_custom_op_library when ENABLE_TRAINING_TORCH_INTEROP is ON
* enable custom_op_library build
* fix build
* fix
* merge master and fix build failure
* build onnx_test_runner when onnxruntime_ENABLE_TRAINING_TORCH_INTEROP is ON
* resolve comments
* use --enable_training_torch_interop to replace "onnxruntime_ENABLE_TRAINING_TORCH_INTEROP=ON"
Using signed int, qgemm kernel avoids extending uint8 to int16 while computing matrix multiplication, achieving higher performance. We also find that by using only lower 64b of vector registers to load A and B matrix, we can get further performance improvements. We also experimented with using ldp to load two 64b in one shot, vs using two ldr to load one 64b at a time, in both Big and little cores, there is no noticeable differences.
Submitting the LDP version. At this point we don't need to choose kernel based on micro-architecture.
Inference time of resnet50, thread count 2
Big Core on Pixel 3a
Current master: 292.947 ms
First iteration S8S8: 188.239 ms
LDP load two 64b reg: 178.715 ms
LDR load one 64b reg: 179.536 ms
Little Core
Master: 546.317 ms
S8S8: 513.332 ms
LDP: 489.19 ms
LDR: 497.865 ms
Raspberry Pi 3B+
Master: 660.08 ms
S8S8: 608.577 ms
LDP: 603.675 ms
LDR 602.075 ms
* dnnl ep rework
rework DnnlTensor,DnnlNode,DnnlSubgraph to support arbitrary graph topology and tensor data types
rework GetCapability to claim nodes in graph greedily from node topological ordering and delay creation of DnnlSubgraph until Compile
rework compile to have DnnlSubgraphPrimitive as the object to handle primitive creation and execution
instead of thread local primitive pool which duplicates intermediate memory allocated by the EP across threads
DnnlSubgraphPrimitive provides helpers to handle many common functions for each dnnl primitive builder and become the centralized place to store input, output, intermediate memories, initializer memories and etc
it provides functions to obtain input memories with automatic reordering/reshaping and moving between engines
it provides interfaces to add primitive, set output memory for single node and etc
add CONCURRENT_EXEC compile flag for dnnl library as without it, convolution primitive cannot be created and executed on different threads
enable unit tests to run on dnnl ep as well if built with dnnl ep
add dnnl ep support for Matmulinteger
* Add Relu to the DNNL refactor
Signed-off-by: George Nash <george.nash@intel.com>
* Add Convolution op to the DNNL rework
Signed-off-by: George Nash <george.nash@intel.com>
* Add Pooling ops to the DNNL rework
This adds the following ops:
- AveragePool
- GlobalAveragePool
- GlobalMaxPool
- MaxPool
Note: Pooling with dilation is not yet supported.
Note: GlobalLpPool, LpPool, MaxRoiPool, and MaxUnpool are not supported yet.
Signed-off-by: George Nash <george.nash@intel.com>
* Add Sum op to the DNNL rework
Signed-off-by: George Nash <george.nash@intel.com>
* Add ConvGrad op to the DNNL rework
Signed-off-by: George Nash <george.nash@intel.com>
* Add MaxPoolGrad and AveragePoolGrad ops to DNNL rework
Signed-off-by: George Nash <george.nash@intel.com>
* Added lrn operator to the refactored code
Signed-off by chethan.palangoutu.keshava@intel.com
* Added ReduceMean DNNL op to the refactor code
Signed-off-by: Chethan Palangotu Keshava <chethan.palangotu.keshava@intel.com>
* Added Softmax DNNL op for the refactored code
Signed-off-by: Chethan Palangotu Keshava <chethan.palangotu.keshava@intel.com>
* Added BatchNorm DNNL op inference-only for refactored code
Signed-off-by: Chethan Palangotu Keshava <chethan.palangotu.keshava@intel.com>
* Added Binary Ops to DNNL rework
Signed-off-by: Wang <zhaoyang.wang@intel.com>
* Added ReluGrad to DNNL Rework
Signed-off-by: Wang <zhaoyang.wang@intel.com>
* Update OneDNN tag to v2.3
Signed-off-by: Wang <zhaoyang.wang@intel.com>
* Added support for memory upto dim size 12
this is to fix the CI test cases that contain binary ops of input dim
size > 5
Signed-off-by: Wang <zhaoyang.wang@intel.com>
* Prevent claiming support for float16 and bfloat16 when only float is suppoted
By using The string.find used was causing the code to claiming support
for float16 and bfloat16 when we only supported float. We now explicitly
check the code for the data type or the data type with a 7 letter prefix
basically prefixed with "tensor("
Signed-off-by: George Nash <george.nash@intel.com>
* Disable uint8 mul and div, improve type conversion
Disable mul_uint8 and div_uint8 test cases as they use modulo for
overflow handling while onednn uses saturation
improve ype conversion using enum instead of string comparsion as well
as adding more types
Signed-off-by: Wang <zhaoyang.wang@intel.com>
Co-authored-by: Wang <zhaoyang.wang@intel.com>
Co-authored-by: Chethan Palangotu Keshava <chethan.palangotu.keshava@intel.com>
* Remove APIs unavailable in Store in #8349, #8178, #8065
* Add UWP stubs of C runtime functions
* Remove UWP incompatible tests from UWP build
* Remove incompatible tests from Store
* Use UWP stubs in store only
* Skip partition check outside of Windows
* Remove unused WRL include
* Workaround Windows header not including what it uses
* Fix precompiled header name clash
* Workaround SDK bugs
* DXCore workaround in Win7
* Fix warning
* Fix more warnings
* Bump WinML to target Windows 8
* Fix more warnings
* Remove unnecessary workarounds
* integrate eager mode source codde; build with cmake and integrate the python test
* Adding the python path for importing libraries in the Eager mode
* fix clang break;check if training and python enabled
* handling the linking of torch libraries across multiple platforms
* merge and fix the naming
* add build instruction
Co-authored-by: Abhishek Jindal <abjindal@OrtTrainingDev0.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: ajindal1 <abjindal@microsoft.com>
* updates for picking pnnx commit
* add tests filter to c# tests
* plus test fixes
* fix versioning for contrib ops
* fix tests
* test filter for optional ops
* more versioning related updates
* fix test
* fix layernorm spec
* more updates
* update docs
* add more test filters
* more filters
* update binary size threshold
* update docs
* plus more fixes
* updates per review
* update to release commit
* add filters for optional type tests
* plus updates
* update onnx-tensorrt parser to master
* disable unsupported tests
* add cuda sm 75 for T4
* update tensorrt pipeline
* update trt pipelines
* update trt pipelines
* Update linux-gpu-tensorrt-ci-pipeline.yml
* update trt cid pipeline
* Update linux-gpu-tensorrt-ci-pipeline.yml
* Update Tensorrt Windows build pool and TensorRT/CUDA/CuDNN version
* update to cuda11.4 in trt ci pipeline
* update base image to cuda11.4
* update packaging pipeline to cuda11.4
* clean up
* remove cuda11.1 and cuda11.3 docker file
* disable unsupported tensorrt tests at runtime
* Update linux-multi-gpu-tensorrt-ci-pipeline.yml
1. Update SDLNativeRules from v2 to v3. The new one allows us setting excluded paths.
2. Update TSAUpload from v1 to v2. And add a config file ".gdn/.gdntsa" for it.
3. Fix some parentheses warnings
4. Update cmake to the latest.
5. Remove "--x86" build option from pipeline yaml files. Now we can auto-detect cpu architecture from python. So we don't need to ask user to specify it.
* enable shared lib test on linux
* fix build break
* add onnx dependency
* add rpath
* skip the test for linux training
* set ONNX_ML definition
* install training python dependency
* update
* fix format; add eigen include folder
* fix format
* skip amd build
* enable shared provider on training
* fix comments in pr
Co-authored-by: Ubuntu <chenta@chenta-orttraining-cpu.bxgbzpva45kedp3rhbsbit4phb.jx.internal.cloudapp.net>
Co-authored-by: Changming Sun <chasun@microsoft.com>
Adds a StridedCopy function that implements a copy from strided tensor to another.
This parallelizes the Concat operator, and can also be used in the future to parallelize many other data movement operators (e.g. Transpose, Split, etc.).
This operation is also required for the proposed data layout extensions to ORT.
* atenop for inference
* assert if dtype mismatch
* atenop config in frontend
* fix orttrainer test
* gradient def not only for ATenOp
* bugfix
* fix gradient input shape and type issue
* fix after merge master
SparseTensor support
Implement Builder pattern
Fix support for 1-D and 2-D COO indices
Implement and test CSR support.
Handle shape inference for SparseTensors
Implement conversion for COO, CSR and tests.
Address the case where constant sparse initializer is the output.
Implement test infra for SparseTensors
Implement SparseDenseMatMul for Csr and COO and tested it.
Add hash for SparseToDenseMatMul
Finish shared provider refactor
Refactor GetOrCreate to Create
Working on py interface
Expose OrtDevice and use it in allocate_numpy
Adjust Sparse interfaces, add support for string SparseTensor. Add tests.
Add and test to_cuda()
Add accessors to format specific indices
Test values and indices views, read-only flag, after GC access
Add sparse related methods to OrtValue
Re-work SparseTensor wrapper, add OrtValue methods
Rework numpy_array_to_cuda/to_cpu
Add run_with_ort_values
Add models and test sparse_mat_mul with run_with_ort_values
Refactor sparse tensor to use a single buffer
Ifdef x86 Eigen CSR sparse matmul implementation
Exclude broken test, check for string type when copying cross device
Split pybind schema, regenerate docs, add exclusion
Conditionally exclude schema module
Update docs fix cuda build
Add test to a filter and renerate JS docs
Add conversion and test string support for sparse tensors
Exclude conversion utils from minimal build
Add CUDA Memcpy and adjust provider interfaces
When compiling with newer gcc and older glibc, there is a chance
for new POWER10 macros to be not available in hwcap.h. This patch
checks whether hwcap macros are available before using that in
platform.cpp.