Introduce sparse_initializers support.
Convert them to dense on model load and prune graph_proto_
so they don't consume space. Convert back to sparse on ORT Format model save.
Implement serializing sparse initializers to OrtFormat.
Fix Model::ToProto() to return original sparse initializers
Set a flag that graph_sync is needed when loading a simple ORT Format model.
otherwise nothing is resolved.
Add ORT Format history to README.md
ifdef MINIMAL build for DenseToSparseTensorInitializer
Allow duplicate initializers to support existing models.
Issue a warning instead of aborting.
* Revert "Remove SparseTensor support from minimal build. (#5114)"
This reverts commit 59ee8ffb17.
Signed-off-by: Dmitri Smirnov <dmitrism@microsoft.com>
Prepacking in subgraph is not supported currently. We see more and more models with subgraph, which has MatMul, MatMulInteger and other ops. Prepacking can speed up those models significantly.
The existing implementation of the GatherGrad CUDA kernel does not do work in a very parallel manner for certain inputs which can lead to poor performance.
The computation essentially involves multiple summations. The values are gathered from the input and the sums are scattered to the output.
Previously, each sum was computed by a single thread. If there is an instance of a summation of a large number of values, it can significantly impact the overall kernel execution time.
The updated version has an alternate implementation which splits the sums into partial sums which get accumulated together later. This allows for more parallelism. A significant downside is that the alternate implementation requires CPU and GPU synchronization because intermediate GPU results are required by the CPU computation. The original implementation outperformed the alternate for certain inputs (e.g., where the maximum number of values in a sum was not large), so the updated version chooses between them based on the input. The input analysis has some overhead.
The implementation was adapted from PyTorch (b186831c08/aten/src/ATen/native/cuda/EmbeddingBackwardKernel.cu).
* Add test for CommonSubexpressionElimination in training.
* Enable CommonSubexpressionElimination in training.
* Add ommonSubexpressionEliminationApplyOnce for training.
* Add flatten support for NNAPI, correct some typo in NNAPI code files
* Address review comments
* Update CanSkipReshape
* Add test for verify NNAPI is actually running for a supported model
* Adding test for reshape/flatten test for NNAPI
* Add one extra verbose log for skipping reshape
* Fix Android CI failure
* Correct test file name to fix Android CI failure
* Build ACL and ArmNN with custom library path
* Define import to tensor as a separate function for maintenance and readability
* Enabled optimized depthwise convolution for ACL v20.02
* Check operation status for ACL and ArmNN Execution Providers
* Enabled fused operation for convolution-activation
Co-authored-by: Andrei-Alexandru <andrei-alexandru.avram@nxp.com>
* TopKGrad CPU kernel
* use Scatter for GatherElementsGrad and TopKGrad.
* rollback convgrad change.
Co-authored-by: Vincent Wang <weicwang@microsoft.com>
* Change shared providers so that they are shutdown before shared library unload
* Move UnloadSharedProviders declaration into a shared header to avoid bugs.
* updating examples with current api calls
* Fixing capitalization in api calls, adding RKNPU update
* Correcting nuphar and rknpu ep api calls
* Include creating session in readme
* Model test start with float
* Clean up code and add environment variable detection
* Move into namespace
* PR comments
* Fix linker errors in latest merge to master and also fix warning
* add skipping model test mechanism
* Return std::string instead of writing to buffer
* Address case where env variable is larger than max_path
* use const static string for test reason
* Disable x86 tests and don't build if ort memory checker is enabled
* Add comment
* Add additional failing x86 tests and ifdef for checking fo rx86 build
* PR comments