* matmul add fusion
* add shape check on Gemm input C
* walk around the issue with RemoveNode
* update the version support
* If MatMul has shape [K] * [K, N], update it to [1, K] * [K, N], so that it can work for Gemm
* Fuse Gemm+Activation into FusedGemm
* test
* revert the change which fuse the matmul with shape [K]*[K, N] to Gemm as shape [1, K]*[K, N], this may cause runtime failure, as the we can't change input data shape.
* revert the change which change the shape for Matmul from [K]*[K, N] to [1, K]*[K, N]. It enables fuse Matmul + Add to Gemm, but the issue is the data is not aware of this, so the data shape is still [K]*[K, N] and cause runtime issue.
* 1. Fix build issue for CUDA
2. Update Gemm so that we can fuse Matmul [K] * [K, N] + Add [1, N] into Gemm with shape [1,K] * [K, N] + [1, N]
* Fix build issue
* Fuse the activation node even it connects the output
* resolve the merge conflicts
* Add test model for Gemm+Activation fusion
* refactor kernel registry to make it a little bit more readable.
* update
* update cudaexecutionprovider
* fix build break
* fix comments
* fix build break
Root cause:
The cudaStreamWaitEvent is used after copy data from GPU memory to CPU memory, but the following node has CPU code depend on the data. Should use cudaEventSynchronize instead.
Fix:
Add code in executor to check the input memory type first, if it wants CPU memory, pass the CPUExecutionProvider type to BeforeUsingAsInput, then it will use cudaEventSynchronize to wait the write event.
* Revert to ignoring optional subgraph inputs due to abandoning PR 216. Restores previous behaviour that changed a couple of days ago with the Scan v9 checkin.
* Update to allow either all inputs, or just required inputs to be provided for the subgraph.
* Update IterateSequence to prefer all inputs over required inputs.
* switch to nonblocking threadpool in inference session and sessions state
* switch to eigen threadpool - first draft
* refine
* refine
* add a switch to easily revert back to windows thread pool
* switch thread pool in test runner and turn on leak checker
* remove unncessary files
* fix build error
* more build fixes
* catch exceptions in parallel executor
* fix mac build error
* fix mac build error
* more build fixes
* more mac build fixes
* fix cv issue
* change macro to include cuda compiler for disabled compiler warning
* try switching the macro to win32 only
* test #error
* move #disable warning to the top
* Update onnxruntime_framework.cmake
* move eigen include to public scope
* turn off eigenthreadpool by default and add todo comment
* update
* cmake change
* rename
* update
* update
* add cmake
* fix build warnings.
* fix comments
* update cmake to avoid run gemmlowp tests
* update cmake
* update
* fix build break
* update
* fix comments
* fix test failure
* add one more test case with padding.
* fix conv implementation of mkldnn and cuda to use updated computekernelshape function.
* fix linux ci build break
* Check the pads attribute on Conv, and auto fallback to CPU if it's not symmetric padding
* Insert copy nodes after all graph transformer. It causes some issue if do the cast transformer before memory copy transformer.
* Fix for non-wide characters in strings for linux - for c#-native interop
* update some unit tests
* added unicode and utf-8 encoding explicitly for file names
* mkldnn:Conv weight optimization
* weight optimization: review changes
* lock_guard and mutex for thread safe
* mutex added to provider
* lock to ReOrder done only once
* removed #ifndef mkldnn_hpp
* keep re-ordered mem buffer in scope
* applied clang format
* review updates: map to unordered map
* conv_mutex to mutex_
* implement dynamic slice cuda
* add template parameter
* add delaration
* init base class
* exclude case from cuda
* use cuda mapped type
* separate function implementation
* add cpy logic
* refactor
* add type check
* use InputMemoryType
* merge functions
* Make OrtAllocator not be reference counted
* Make the allocator interface more type safe
* Fix build break
* Build break fix
* Build break fix
* Mistake in previous build fix.
* Fix review comments + build break
* Missed the export symbols
* C specific error, need 'struct' keyword in one case.
* Function calling OrtReleaseObject instead of OrtReleaseEnv