* Add arm64 nocontribops pipeline
* minor fix
* Added new template for arm build -- disable all tests
* fix build command
* add arm64 flag for msbuild
* add arm leg as upstream dependency
* update platform to arm64 for msbuild
* remove test task from arm build
* remove ESRP signing of C# dlls in arm build
* Updated to work for both --arm and --arm64
* Make the cross compiling cmake flags symmetric
* Add dynamic check for /Wno-error flag, instead of extra build option
* remove extra full-stop
This extends build.py to run git submodule sync --recursive before running git submodule update --init --recursive. This makes sure submodule URLs are up-to-date.
This change integrates the NCHWc support recently added to MLAS into ONNX Runtime. When using "-o 3" optimizations, then the runtime will do a NCHWc layout optimization pass to convert standard ONNX operators such as Conv/MaxPool to the com.microsoft.nchwc domain with weights and biases reordered for speed.
Log a warning if the fallback is caused by functional limitation
Log a information if the fallback is by design. e.g Nodes between Shape (CPU output) -> CUDA nodes .. -> ReShape (CPU input)
More cleanup of the math files. Instead of using templates to instantiate a full GEMM for the types added for MatMul (integers and double), use a simpler MatMul function that doesn't do any transposing and assumes alpha=1 and beta=0.
Fix the random UT failure for RNN/GRU cases which have padded sequence. e.g. max_seq = 2. batch_size =2, sequence_lengths = {2, 1}. For the output beyond the shorter sequence {1}, we should initialize the value to 0.
Root cause:
Cudnn library doesn't guarantee the value beyond the shorter sequence.
Fix:
Initialize the output Y data to all 0 before calling cudnn library.
* replace log sinks
* limit headers to include dir
* first changes to do dynamic linking
* wip for using cxx api
* remove weird dangling dependency
* building with tests failing
* finish updating converters
* fix const
* intital introduction of typedef
* change logging to use spdlog
* get tests passing
* clang format
* map logging levels better
* clean up unused imports
* trent cr comments
* clang-format
* code review comments
* changing buffer use to reserve
* Dynamically link
* revert tvm
* update binary uploading
* catch exceptions by const-ref
* Revert "revert tvm"
This reverts commit 387676dd1018134d15eb71fa126f7caf94380800.
* fix typo
* update versioning of lib
Description:
This change adds the common part of TVM based codegen library. It includes following parts:
* Microsoft TVM Inventory (MTI): a set of TVM ops for neural networks, similar to TOPI
* Compiler pass for traversing ONNX graph and generate TVM ops
* Compiler pass for traversing generated graph and specify TVM schedule
* Compiler pass for handling weight layout
* Utils for debugging
Motivation and Context:
TVM is an open deep learning compiler stack for cpu, gpu and specialized accelerators. To leverage it in ONNX, we built an execution provider named Nuphar. Currently, Nuphar gets good performance on CPUs with AVX2 on quantized LSTM models.
This codegen library was part of Nuphar execution provider. It is split out for sharing with other execution providers, as we'd like to reuse TVM in more devices.
Description:
Disallow overriding an initializer via a graph input if the IR version is < 4. This enforces an implicit assumption that initializers should be treated as constant, and allows constant folding to be done on a model with an older IR version.
Separate constant and overridable initializers so that it's clear which ones constant folding can utilize.
Update Graph to not add all initializers to the graph inputs when the graph is manually created (i.e. not loaded from a GraphProto) and the IR version is >= 4.
Motivation and Context
In order to do constant folding we need to know which initializers can be treated as constant and which are overridable. All initializers were required to have a matching graph input prior to IR version 4, technically making all of them overridable. The intention however was for them to be treated as constants, and this change enforces that intent.
The benefit of doing so is that constant folding will work for models with IR version < 4. The cost is that if someone is actually overriding an initializer they will need to update the IR version of their model to version 4 in order to keep doing so. The belief is that this is a very small subset of usage (e.g. models involving feeding in a truncated sequence) and the cost to update that small subset is warranted by the benefit of constant folding being able to be enabled on all older models without them needing an IR version update.
* Improve CUDA kernel performance for Concat. Implement the kernel code instead of using cudaMemCpy in a loop.
* Update the index lookup part for Concat & Split
* init
* Update DNNLibrary
* Update DNNLibrary, set compiler flags, it compiles now
* Add more missing flags, add test
* Update DNNLibrary
* Update Compile method, fix allocator and some other bugs
* Update DNNLibrary
* Implement CopyTensor
* Not delete state explicitly since it is managed by unique_ptr
* Add the missing files when SingleUnitTestProjct is ON
* misc changes
* Fix wrong name in provider factory
* Add my own test
* Update the code of add node into graph, and add the missing initializer into graph
* Fix the bug that re-build the graph produces extra output
* Update DNNLibrary
* Transpose nchw (ONNX) -> nhwc (NNAPI)
* Add license
* Add GetSupportedNodes method (implement it later)
* Rename onnxruntime_nnapi_test->onnxruntime_nnapi_squeezenet_test
* Update squeezenet_test.cpp after rebase master
* Remove squeezenet_test.cpp since it is almost same with the c++ sample
* Update DNNLibrary for GetSupportedNodes
* Update GetSupportedNodes
* Revert "Remove squeezenet_test.cpp since it is almost same with the c++ sample"
This reverts commit a97575fd9ff49e50ba1dc8d8154790d8cd86c48d.
* Update DNNLibrary
* Fix multiple outputs bug
* Remove GetKernelRegistry
* Revert "Revert "Remove squeezenet_test.cpp since it is almost same with the c++ sample""
This reverts commit 2a0670e9cbf10ea654111ce39e198a4be0ddd838.
* Set default memory type of NNAPI EP
* Add CPUOutput allocator
* Update DNNLibrary for multiple outputs
* Fix bug of nhwc->nchw
* Remove GetExecutionHandle()
* Update cuda for python wheels
* Update cuda for python wheels
* Update cuda for python wheels
* Update azure-pipelines-py-packaging.yml
* Update to cuda 10
* Only test win gpu
* Update cuda for python wheels
* Use manylinux2010 image to build linux python wheels
Allow wheels built to truly be compliant with a manylinux policy
* Add CUDA expand operator
* Reset counter variables when striding
* Reset counter variables when striding
* use fast_divmod and other PR comments
* Fix merge variable rename
* Fix indentation per PR comment
* Remove maxpool_argmax
* Reduce number of type templates for Expand operator
* removed all types
* Commit updated cuda_execution_provider.cc
* Check for non-existent initializers while fusing conv and add.
* Fix other places where initializer can be null
* Add check if initializer is an input
* update the models to comply with the new ONNX spec.
In new ONNX spec, the initializers should not be in inputs.
* Fix previous temporary code
* Add negative test
* Revert changes to conv_bn_fusion and conv_mul_fusion
* making helper IsNodeArgConstant a little more general; updating remaining Conv*Fusion rules
* minor comment
* AllNodeIputsAreConstant to use new function
Implementation of the MLAS changes for NCHWc convolution/pooling support. These changes adopt the blocking format used by MKL-DNN and other convolution libraries for better performance.