* Add options for nnapi ep
* Add nnapi flags test
* add comments
* Add flag comments
* Make the flags bitset const
* Fix build break
* Add stub changes to java and c# api
* Fix java related build break
* Fix java build break
* Switch to bit flags instead of bitset
* Split change
* ReduceSum and Split change
* Other op changes, Grad builder, tests, registering required opset 13 ops
* Rebase fixes
* Fix tests, add some more
* Review changes, rebase
* Fix windows build
Co-authored-by: Aishwarya <aibhanda@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
* Implement Hetero in UEP
* Added security checks to take valid Hetero combinations
as device type
* Integrating Hetero features
* Get the statistics Report in Debug Mode
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Passing right device type for vadm_baackend
Added simple fix to pick the right device type
when using vadm_backend with Hetero as well.
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Fixed batching logic for 2020.4 and above
* Fixed flake8 PEP8 errors
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Minor Fixes Added
*Added security checks for device_type passed
in for Hetero build during run time
*code cleanup
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Minor changes Added
*Fixed batch_size bug in vadm_backend
*code cleanup
*Documentation updated for Hetero
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
Co-authored-by: suryasidd <surya.siddharth.pemmaraju@intel.com>
* Some fixes to symbolic shape inference
1. Topological sort before iteration in graph
2. Fix a case in slice: start=100000, end=-100000, step=-1, dim=2
3. Fix Nuphar Gemm test's random seed
4. Slice opset 1 axes is optional
* add case for cpu custom op on gpu
* format doc
* restrict GPU custom op on Linux GPU CI only
* separate cu file to a independent project
* fix typo
Co-authored-by: RandySheriffH <rashuai@microsoft.com>
The ROCm EP is designed and implemented based on AMD GPU software stack named ROCm. Here is the link for the details about ROCm: https://rocmdocs.amd.com/en/latest/
ROCm EP was created based on the following things:
1. AMD GPU programming language: HIP
2. AMD GPU HIP language runtime: amdhip64
3. BLAS: rocBLAS, hipBLAS
4. DNN: miOpen
5. Collective Communication library: RCCL
6. cub: hipCub
7. …
Current status:
BERT-L and GPT2 training can be ran on AMD GPU with data parallel.
Next:
1. Make more GPU code be sharable between ROCm EP and CUDA EP since HIP language and HIP runtime API are very close to CUDA.
2. Continue improving the implementation.
3. Continue GPU kernel optimization.
4. Support model parallelism on ROCm EP.
……
The rocm kernels have been removed from this commit and will be in a separate PR. Since the original PR was too big(~180 files), it was suggested to split the PR into two parts, one is rocm-kernels, the other is non rocm kernels.
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
Co-authored-by: sabreshao <sabre.shao@amd.com>
Co-authored-by: anghostcici <11013544+anghostcici@users.noreply.github.com>
Co-authored-by: Suffian Khan <sukha@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
* Introduce OpKernelInfo GetAttrAsSpan() for floats and ints attribute proto arrays
and GetAttrsStringRefs() to return a vector of string references.
These new APIs allow kernels not copy attribute arrays especially if they are large
and save on memory.
but refer directly to data that is in AttributeProto.
Modify TfIdfVectorizer to take advantage of the new API.
Signed-off-by: Dmitri Smirnov <dmitrism@microsoft.com>
Now that we are using legacy default stream, which is shared among all inference threads,
there is no need to have per-thread allocator.
In the past, the race could happen when two threads running concurrently on GPU:
thread1: allocA->copyA->computeA->freeA
thread2: allocB->copyB->computeB->freeB
Note that freeA/B only means the buffer is ready to be allocated on CPU, while the corresponding
operation on GPU is not finished yet. It is possible for thread1/2 use the same buffer, when the
alloc/free pair are not interleaved (note that alloc/free is thread-safe)
If the GPU commands run in separate per-thread default stream, there's a chance that copyA/computeA
are interleaved with copyB/computeB, even when the order in CPU execution is not interleaved. This
would cause incorrect results if computeB uses copyA's results.
By using one legacy default stream, CPU execution order would match the GPU execution order, so
if A and B use the same buffer from alloc, the correpsonding copy/compute won't be interleaved. If
the copy/compute is indeed interleaved, then allocA and allocB would return different buffers, thus
no racing either.
* Add test for FastGelu + GeluRecompute.
* Fix GeluRecompute for 2 inputs case.
* Fix test for BiasGelu + GeluRecompute.
* Copy all inputs to Gelu, not just 2.
* Move GeluRecompute test to training-specific file.
Fixed static IS_ANDROID detection
final static IS_ANDROID is causing an Error Unsupport arch:aarch64, so removed IS_ANDROID & replaced with IS_ANDROID with isAndroid().
Description: This change makes three changes to the ThreadPool class to clean up issues identified during performance analysis and optimization. (1) It uses mm_pause intrinsics in spin loops, helping avoid consuming pipeline resources while waiting. (2) It re-organizes the spin-then-steal loop for work distribution to start out spinning as intended, rather than to start out trying to steal. (3) It updates the ThreadPool class's API to be consistent in the use of static methods for public functions. The PR includes minor doc updates and corresponding changes to test cases.
Motivation and Context
The change helps ensure consistency in behavior between the OpenMP and Eigen-based implementations. Unlike the instance methods, the static methods abstract over the different ways in which threading can be implemented; they will map onto the OpenMP or Eigen-based implementations when threading is used. When threading is not used they will run work sequentially.
replace number matching with relaxed comparison in frontend tests
Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>