* Cache CUDNN convolution benchmark results in cuda::Conv kernels
Previously, the best convolution algorithm was determined by running
cudnnFindConvolutionForwardAlgorithmEx and cudnnFindConvolutionBackwardDataAlgorithmEx
on every shape change.
This is very detrimental for variable input shapes, such as variable batch
sizes.
This change adds a map to cache previously determined benchmark results.
The caching results in significant speedups for variable input shapes.
* Use LRU to limit cached benchmark results
* Only cache benchmark results for a fixed weight shape
In case the weight shape changes, all cached results are discarded.
* Use padded shape as key for cached benchmarks
* Add constant for max number of cached benchmark results
* Use unordered_map to store cached benchmark results
* Only store the parameters that are actuallt needed
Some changes that reduce the size of the release onnxruntime.dll by 170KB:
Change the ONNX_OPERATOR_KERNEL macros to not create a unique virtual class per kernel create lambda, but instead use a generic class with the raw function address supplied at BuildCreateKernelInfo time.
Changed the exceution providers to use a table driven approach to calling the BuildCreateKernelInfo functions instead of a massive function with construct/call/delete sequences.
The CreateFunc in data_types.h didn't need to be a std::function, eliminating more lambda virtual classes.
N.B. To accommodate MSVC 14.11 toolchain (used for CUDA builds), the operator+() syntax cannot be used to retrieve the raw function address. The older toolchain can't resolve between cdecl/vectorcall and gives up. An explicit cast is needed to help the compiler along.
* Exclude unreferenced global data and op doc strings in the opschema object. The first causes a decrease in the binary size by at least 85k. The latter reduces resident memory size.
* Update onnx to incorporate my PR that fixes SetDoc compiler warnings
* Ensure Linux binaries are built with debug info. Extract debug info out of the main binaries. Strip the main binaries.
* add binutils
* add uname
* add binutils
* remove linux portion
* Fix#612, TfidfVectorizer handles empty matrices as an input
* Add more unit tests, better consistency of error messages
* Update tfidfvectorizer.cc
* better comment
* fix comments
* add unit test failure for an empty input {0, 1}
* Adding a custom op interface to the C API to remove shared library dependency.
* Remove old custom op test
* Rework how custom ops handle inputs/outputs to enable custom op output shape calculation in the compute method
* Add a nicer C++ API for custom ops and switch the tests to use it.
Currently, when using OrtEnableProfiling to enable profiling using the C API,
the profile output file is created but is always empty.
The reason is that InferenceSession::EndProfiling() needs to be called to
write the profiling data to the output file.
However there's currently no way to call this function via the C API.
This adds a call to EndProfiling() to the descructor of the session if
profiling is enabled in the session options.
Generalize node removal method in graph_utils. This is a higher-level method that keeps the graph consistent so that no Resolve is needed after the removal of a node.
The new method supports the removal of nodes with a single input (be it an incoming node or an initializer) and a single output (but allowing multiple output edges of that output). It also takes into account the case that one of the output edges is fed to a subgraph.
Also updated the rewrite rules to use this new, less restrictive method, and improved the rules' conditions. Introduced a GraphEdge struct to simplify various methods in graph_utils.