* support non-tensor types
* support non-tensor types.
* support non-tensor types.
* fix compilation issues
* fix compilation issues
* Build without mkldnn for release packages. We'll default to MLAS.
* Modify roialign to conform with the new onnx spec and take it out from contrib ops.
Memory pattern doesn't work for parallel executor by design. Enabling Memory Pattern for parallel executor logs warning and make the perf bad.
Add option to enable/disable memory pattern back.
* move files
* move files
* Remove NonMaxSuppression from Contrib op, move it to Onnx domain, opset 10
* move NMS out of namespace contrib
* update data type in UT
* update to latest onnx
* white list the node test for Mod which is not implemented yet
* Fix warning in tensor_type_and_shape.cc
tensor_type_and_shape.cc:139:18: error: ‘out’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
* fix warnings
Change MLAS to be able to build standalone without onnxruntime header dependencies. This is enabled when building with MLAS_NO_ONNXRUNTIME_THREADPOOL defined.
mlas.h had been changed to include the ThreadPool header, but this header now just has a forward reference for the class. The header was also doing a "using onnxruntime::concurrency"; that has been removed and the external mlas.h users fixed up as needed.
As before, if ThreadPool==nullptr, then MLAS uses OpenMP or falls back to a single threaded implementation. The build option to use the Win32 system thread pool has been removed as onnxruntime can't hit that path and I don't use that option for standalone tests anymore.
* Disable tests for certain models (Cherry pick from 0.3.1)
* Disable more tests
* More tests
* even more tests
* Fix gpu builds
* Disable L2 transformers
* Env variable to disable contrip ops for csharp tests
Introduce a quick pre-filtering of rules based on the node op types they are targeting.
The goal is to avoid evaluating all rules for all nodes. Instead, for each node, we will only be evaluating the rules associated with its op type.
* enable android build
* Add 'log' to onnxruntime_EXTERNAL_LIBRARIES
* Remove cmake about header_files_test.cc
* Add Android CI pipeline
* Remove some ms-specific(?) ci
* Fix bash error
* Add execute flag for install_deps_android.sh
* Add install_ubuntu_for_android.sh
* Remove python in deps for android
* Add comment for BUILD_ARCH
* Set BUILD_SERVICE to cpu
* Set BUILD_OS in run_build.sh
* Fix -o bug in run_build.sh
* Android -> android
* Correct the android ndk location
* Checkout submodules in my own azure pipelines
* Revert "Remove some ms-specific(?) ci"
This reverts commit 302463213480487d8944c3127a3b311c591d55c0.
* Revert "Checkout submodules in my own azure pipelines"
This reverts commit 1acfb6755f933e532b8312ca35bb4900a833903f.
* Add docker image clean script
* Change the command not to generate warning if no such image presents
* Update linux-gpu-ci-pipeline.yml
* Update linux-ci-pipeline.yml
* Update azure-pipelines-py-packaging.yml
* Fix issues in GRU GPU implementation. The cudnnGetRNNWorkspaceSize could failed because some descriptor are defined as local variable and are destroyed.
* Fix the issue for ReduceSum. cudnnReduceTensor for ReduceSum has issue if input and output has same size, we just need to copy the data for this case.
* constant node should not be put into graph inputs any more.
* simplify graph input/output set logic.
* refactor comments.
* remove adding initializers as graph inputs when creating graph from scratch.
* define new test load function
* remove bak file
* add stat operator
* add arguments
* fix comments
* try enable fp16_tiny_yolov2 on linux
* fix compile err
* try enable fp16_tiny_yolov2
* Adding the kernel for Resize op.
* Fixing a bug in nearest neighbour.
* remove gpu resize kernel.
will add it in another pr.
* fix a bug.
* Accomodating PR comments.
* support non-tensor types
* support non-tensor types.
* support non-tensor types.
* fix compilation issues
* fix compilation issues
* Build without mkldnn for release packages. We'll default to MLAS.
* Update license - came up during IP scan
* Cache CUDNN convolution benchmark results in cuda::Conv kernels
Previously, the best convolution algorithm was determined by running
cudnnFindConvolutionForwardAlgorithmEx and cudnnFindConvolutionBackwardDataAlgorithmEx
on every shape change.
This is very detrimental for variable input shapes, such as variable batch
sizes.
This change adds a map to cache previously determined benchmark results.
The caching results in significant speedups for variable input shapes.
* Use LRU to limit cached benchmark results
* Only cache benchmark results for a fixed weight shape
In case the weight shape changes, all cached results are discarded.
* Use padded shape as key for cached benchmarks
* Add constant for max number of cached benchmark results
* Use unordered_map to store cached benchmark results
* Only store the parameters that are actuallt needed
Some changes that reduce the size of the release onnxruntime.dll by 170KB:
Change the ONNX_OPERATOR_KERNEL macros to not create a unique virtual class per kernel create lambda, but instead use a generic class with the raw function address supplied at BuildCreateKernelInfo time.
Changed the exceution providers to use a table driven approach to calling the BuildCreateKernelInfo functions instead of a massive function with construct/call/delete sequences.
The CreateFunc in data_types.h didn't need to be a std::function, eliminating more lambda virtual classes.
N.B. To accommodate MSVC 14.11 toolchain (used for CUDA builds), the operator+() syntax cannot be used to retrieve the raw function address. The older toolchain can't resolve between cdecl/vectorcall and gives up. An explicit cast is needed to help the compiler along.
* Exclude unreferenced global data and op doc strings in the opschema object. The first causes a decrease in the binary size by at least 85k. The latter reduces resident memory size.
* Update onnx to incorporate my PR that fixes SetDoc compiler warnings