Now that we are using legacy default stream, which is shared among all inference threads,
there is no need to have per-thread allocator.
In the past, the race could happen when two threads running concurrently on GPU:
thread1: allocA->copyA->computeA->freeA
thread2: allocB->copyB->computeB->freeB
Note that freeA/B only means the buffer is ready to be allocated on CPU, while the corresponding
operation on GPU is not finished yet. It is possible for thread1/2 use the same buffer, when the
alloc/free pair are not interleaved (note that alloc/free is thread-safe)
If the GPU commands run in separate per-thread default stream, there's a chance that copyA/computeA
are interleaved with copyB/computeB, even when the order in CPU execution is not interleaved. This
would cause incorrect results if computeB uses copyA's results.
By using one legacy default stream, CPU execution order would match the GPU execution order, so
if A and B use the same buffer from alloc, the correpsonding copy/compute won't be interleaved. If
the copy/compute is indeed interleaved, then allocA and allocB would return different buffers, thus
no racing either.
* Add test for FastGelu + GeluRecompute.
* Fix GeluRecompute for 2 inputs case.
* Fix test for BiasGelu + GeluRecompute.
* Copy all inputs to Gelu, not just 2.
* Move GeluRecompute test to training-specific file.
Fixed static IS_ANDROID detection
final static IS_ANDROID is causing an Error Unsupport arch:aarch64, so removed IS_ANDROID & replaced with IS_ANDROID with isAndroid().
Description: This change makes three changes to the ThreadPool class to clean up issues identified during performance analysis and optimization. (1) It uses mm_pause intrinsics in spin loops, helping avoid consuming pipeline resources while waiting. (2) It re-organizes the spin-then-steal loop for work distribution to start out spinning as intended, rather than to start out trying to steal. (3) It updates the ThreadPool class's API to be consistent in the use of static methods for public functions. The PR includes minor doc updates and corresponding changes to test cases.
Motivation and Context
The change helps ensure consistency in behavior between the OpenMP and Eigen-based implementations. Unlike the instance methods, the static methods abstract over the different ways in which threading can be implemented; they will map onto the OpenMP or Eigen-based implementations when threading is used. When threading is not used they will run work sequentially.
replace number matching with relaxed comparison in frontend tests
Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
* Enabled multi-threading for OpenVino EP
->Enabled support for concurrent_session_runs
*Run UEP using concurrent_session_runs > 1
*Enabled support for ORT_PARALLEL ExecutionMode
->Documentation Added for Enabling MultiThreading
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Minor Fixes added
*Configure the value of nireq during Runtime
*Documentation typos rectified and details
added for Multi_Threaded Inference
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Some checks added for this fix
*Added checks to invalidate wrong nireq value
and assigned it to default value of 8
*Added new config options for enable_vpu_fast_compile
which were changed w.r.t OpenVINO_2021.1 Release
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
Introduce sparse_initializers support.
Convert them to dense on model load and prune graph_proto_
so they don't consume space. Convert back to sparse on ORT Format model save.
Implement serializing sparse initializers to OrtFormat.
Fix Model::ToProto() to return original sparse initializers
Set a flag that graph_sync is needed when loading a simple ORT Format model.
otherwise nothing is resolved.
Add ORT Format history to README.md
ifdef MINIMAL build for DenseToSparseTensorInitializer
Allow duplicate initializers to support existing models.
Issue a warning instead of aborting.
* Revert "Remove SparseTensor support from minimal build. (#5114)"
This reverts commit 59ee8ffb17.
Signed-off-by: Dmitri Smirnov <dmitrism@microsoft.com>
Prepacking in subgraph is not supported currently. We see more and more models with subgraph, which has MatMul, MatMulInteger and other ops. Prepacking can speed up those models significantly.
The existing implementation of the GatherGrad CUDA kernel does not do work in a very parallel manner for certain inputs which can lead to poor performance.
The computation essentially involves multiple summations. The values are gathered from the input and the sums are scattered to the output.
Previously, each sum was computed by a single thread. If there is an instance of a summation of a large number of values, it can significantly impact the overall kernel execution time.
The updated version has an alternate implementation which splits the sums into partial sums which get accumulated together later. This allows for more parallelism. A significant downside is that the alternate implementation requires CPU and GPU synchronization because intermediate GPU results are required by the CPU computation. The original implementation outperformed the alternate for certain inputs (e.g., where the maximum number of values in a sum was not large), so the updated version chooses between them based on the input. The input analysis has some overhead.
The implementation was adapted from PyTorch (b186831c08/aten/src/ATen/native/cuda/EmbeddingBackwardKernel.cu).
* Add test for CommonSubexpressionElimination in training.
* Enable CommonSubexpressionElimination in training.
* Add ommonSubexpressionEliminationApplyOnce for training.
* Add flatten support for NNAPI, correct some typo in NNAPI code files
* Address review comments
* Update CanSkipReshape
* Add test for verify NNAPI is actually running for a supported model
* Adding test for reshape/flatten test for NNAPI
* Add one extra verbose log for skipping reshape
* Fix Android CI failure
* Correct test file name to fix Android CI failure
* Build ACL and ArmNN with custom library path
* Define import to tensor as a separate function for maintenance and readability
* Enabled optimized depthwise convolution for ACL v20.02
* Check operation status for ACL and ArmNN Execution Providers
* Enabled fused operation for convolution-activation
Co-authored-by: Andrei-Alexandru <andrei-alexandru.avram@nxp.com>
* TopKGrad CPU kernel
* use Scatter for GatherElementsGrad and TopKGrad.
* rollback convgrad change.
Co-authored-by: Vincent Wang <weicwang@microsoft.com>