The existing implementation of the GatherGrad CUDA kernel does not do work in a very parallel manner for certain inputs which can lead to poor performance.
The computation essentially involves multiple summations. The values are gathered from the input and the sums are scattered to the output.
Previously, each sum was computed by a single thread. If there is an instance of a summation of a large number of values, it can significantly impact the overall kernel execution time.
The updated version has an alternate implementation which splits the sums into partial sums which get accumulated together later. This allows for more parallelism. A significant downside is that the alternate implementation requires CPU and GPU synchronization because intermediate GPU results are required by the CPU computation. The original implementation outperformed the alternate for certain inputs (e.g., where the maximum number of values in a sum was not large), so the updated version chooses between them based on the input. The input analysis has some overhead.
The implementation was adapted from PyTorch (b186831c08/aten/src/ATen/native/cuda/EmbeddingBackwardKernel.cu).
* Add test for CommonSubexpressionElimination in training.
* Enable CommonSubexpressionElimination in training.
* Add ommonSubexpressionEliminationApplyOnce for training.
* Add flatten support for NNAPI, correct some typo in NNAPI code files
* Address review comments
* Update CanSkipReshape
* Add test for verify NNAPI is actually running for a supported model
* Adding test for reshape/flatten test for NNAPI
* Add one extra verbose log for skipping reshape
* Fix Android CI failure
* Correct test file name to fix Android CI failure
* Build ACL and ArmNN with custom library path
* Define import to tensor as a separate function for maintenance and readability
* Enabled optimized depthwise convolution for ACL v20.02
* Check operation status for ACL and ArmNN Execution Providers
* Enabled fused operation for convolution-activation
Co-authored-by: Andrei-Alexandru <andrei-alexandru.avram@nxp.com>
* TopKGrad CPU kernel
* use Scatter for GatherElementsGrad and TopKGrad.
* rollback convgrad change.
Co-authored-by: Vincent Wang <weicwang@microsoft.com>
* Change shared providers so that they are shutdown before shared library unload
* Move UnloadSharedProviders declaration into a shared header to avoid bugs.
* updating examples with current api calls
* Fixing capitalization in api calls, adding RKNPU update
* Correcting nuphar and rknpu ep api calls
* Include creating session in readme
* Model test start with float
* Clean up code and add environment variable detection
* Move into namespace
* PR comments
* Fix linker errors in latest merge to master and also fix warning
* add skipping model test mechanism
* Return std::string instead of writing to buffer
* Address case where env variable is larger than max_path
* use const static string for test reason
* Disable x86 tests and don't build if ort memory checker is enabled
* Add comment
* Add additional failing x86 tests and ifdef for checking fo rx86 build
* PR comments
* Add type inference for BroadcastGradientArgs
This change enables the ONNX shape and type inference to work on a function body containing a BroadcastGradientArgs op. Without this change, the dummy inference function is used, and no types are inferred for the output here:
531e6dd459/onnx/shape_inference/implementation.cc (L467-L469)
* Handle optional outputs.
* Created shared version of InferenceSession wrapper class and update relevant tests to use it.
Include domain in the ops counting helper so it's more general and we don't need to duplicate it in the nchwc tests. Update tests to include domain in key being checked.
* Fix some training tests
* Fix prefixing of contrib op names in test
* Add session option and global thread pool option to set denormal as zero.
* Revert unneccessary changes.
* Add cpuinfo submodule
* Add more comments
* Remove cpuinfo submodule dependency and check only SSE3 support for ftz and daz inspired by Tensorflow
* Preserve API order in C api
* Clean up and utilize SSE3 detection logic from existeing cpuid_info.h
* Keep the same order with header file
* Fix build issue with Linux pipeline, which has old g++ compiler
* Fix broken build on Linux and remove a duplicated unit test
* Remove reformatting at eigen thread pool
* Remove flatbuffers which is not intentionally added
* Revert "Remove flatbuffers which is not intentionally added"
This reverts commit 9f509a9aaaa3c7832d88854c82fd26b234770b7f.
* Remove flatbuffers which is not intentionally added
* Resolve comments
- Put details on APIs
- Add a log for ftz/daz initialization
- Add clang
- Fix typo
* Remove unnecessary header include
* Resolve comments