Commit graph

49 commits

Author SHA1 Message Date
Tianlei Wu
36c3271546
BeamSearch op cuda (#10556)
Add BeamSearch cuda implementation with support of fp16 GPT-2 subgraph
2022-02-25 13:08:55 -08:00
zhangyaobit
fd16085cea
Zhanyao/attention (#10545)
* Enable Attention op for ROCM EP.

As a note, potential hipify improvements: (1) handle math
contants (attention_softmax.h), (2) correctly generate transpose
options for the GEMM helpers, consider counterpart/dummy API for
CublasMathModeSetter (attention_impl.cu, attention_impl.cu). After
these improvements, we don't need to manually keep copies of the
above mentioned files any more.

* Clean up debugging code.
2022-02-17 09:02:45 -08:00
Jeff Daily
e7efcc93fe
[ROCm] update hipify-perl location (#10102)
* [ROCm] update hipify-perl location

Depending on the ROCm version installed, hipify-perl might not always
live in the hard-coded path of /opt/rocm/bin. Use python 3.3's
shutil.which to locate the script.

* provide alternative locations for hipify-perl if not in PATH

* implement hipify-perl search as a function

This avoids running the logic during module import since all builds
import the amd_hipify module.

* fix flake8 errors
2022-01-06 17:21:02 -08:00
Ye Wang
6856619b18
Decoder Attention CUDA Op (#9792)
* add kernel interface

* register kernel

* add self/cross qkv projection without cache

* add LaunchTransQkv2 for (S,B,X,N,H) -> (X,B,N,S,H)

* refactor ConcatPastToPresent

* DecoderQkvToContext interface

* q,k,v buffer and cache as output

* qk, pv and transctx

* fix compiler error on linux machine

* key_padding_mask

* add test_parity file. However not runnable

* add partial unittest

* made partial attributes to inputs

* --gen_doc

* change kernel interface, add more tests

* morre parity tests

* fix test

* fix typo

* transpose optimizer has bug. remove it temporarily

* add input shape checks

* add type/shape inference

* fix cache shape check

* fix rocm build failure

* fix rocm build error

* review comments

* review comments
2021-11-19 19:25:36 -08:00
pengwa
6e09fc5152
Implement block wise softmax for reduction dimention > 1024 cases. (#9696)
* implement block wise softmax for reduction dimention > 1024 cases.

* fix builds

* fix

* fix amd build

* fix amd build

* fix win-gpu build

* add tests

* remove cudnn path/add python tests
2021-11-14 11:47:58 +08:00
Jeff Daily
ca7116ca3e
CUDA EP's ResizeImpl now uses functors, hipify for ROCm EP (#9466)
Support for device function pointers is not yet available for ROCm.
Instead, the device function pointers were converted to device functors.
Case statements, lambdas, and macros are used for dispatch; as a result,
all combinations of kernels are compiled with inlined functors. The
basis of this approach can be found in PyTorch.

Lastly, hipify and register Resize and Upsample for ROCm EP.
2021-10-21 15:02:41 -07:00
Jeff Daily
66ceb6926d
rehipify ROCm EP files under orttraining (#9443)
* rehipify rocm ep files under orttraining committed to source control

* fix flake8 error
2021-10-21 13:36:21 -07:00
Jeff Daily
89a22fb641
Add TopK to ROCm EP (#9391)
* Add TopK to ROCm EP

* flake8 fix
2021-10-20 10:39:44 -07:00
Jeff Daily
f8acc6d0e8
Add NonMaxSuppression and RoiAlign to ROCm EP (#9394) 2021-10-20 10:38:45 -07:00
Jeff Daily
c33391329a
Add QuantizeLinear and DequantizeLinear to ROCm EP (#9401) 2021-10-20 10:37:58 -07:00
Jeff Daily
52c53e396d
hipify tensor/gather_nd_impl.cu (#9392) 2021-10-19 14:15:49 -07:00
Jeff Daily
a2ba923ac7
hipify fast_divmod.h (#9400) 2021-10-19 12:34:46 -07:00
Jeff Daily
a8e2e8d76a
hipify tensor/transpose.cc and tensor/transpose.h (#9397) 2021-10-19 12:27:36 -07:00
Jeff Daily
c8789d3047
[ROCm] static re-hipify of CUDA EP to ROCm EP, now a shared provider (#8877)
* re-hipify all rocm EP sources

* fix all other files affected by re-hipify

* add cuda_provider_factory.h to amd_hipify.py

* do not use cudnn_conv_algo_search in ROCm EP, missing reduce min registration

* Fix ReduceConsts template specialization introduced in #9101.

Fixes the error when building for ROCm 4.3.1:

error: too many template headers for onnxruntime::rocm::ReduceConsts<__half>::One (should be 0)

* fix flake8 error in amd_hipify.py

* speed up hipify with concurrent.futures

* flake8 fix in amd_hipify.py
2021-10-14 15:15:51 -07:00
Suffian Khan
47888392ab
Fix nightly CI pipeline to generate ROCm 4.2 wheels and add ROCm 4.3.1 wheels (#9101)
* make work for both rocm 4.2 and rocm 4.3.1

* fix rocm 4.3.1 docker image reference

* fix CUDA_VERSION to ROCM_VERSION

* fix ReduceConsts conflict def

* add ifdef to miopen_common.h as well

* trailing ws
2021-09-19 23:36:03 -07:00
mindest
a71dab691d
Implement BatchNormInternal for cuda (#8172)
* correct batchnorm replacement output order;

remove bn replacement in grad graph builder

* update op defs and kernel class

* implement batch norm internal and grad.

* change saved_var into saved_inv_std

* cuda test case: bn internal

* remove redundant include

* fix comment; add support and UT for 1d input.

* exclude batch_norm_internal in amd_hipify

* run BNInternal UT for CUDA only

* fix CI error

* fix comment errors

* fix error

* add comment for inconsistency with cudnnBN doc

* additional comments for cudnnBN inconsistency
2021-07-28 16:04:49 +08:00
Hariharan Seshadri
5369821ad6
Support SpaceDepth ops in the CUDA and ROCM EPs (#7960) 2021-07-09 01:00:22 -07:00
Weixing Zhang
ca9b3f18e9
Explicitly pass cuda stream to thrust function rather than use cuda default stream implicitly (#7414)
* Pass cuda stream to thrust function to not use default stream.

In the commit 299ace0, ORT has been changed to not use cuda default stream.

* update amd_hipify.py

* remove un-necessary stream sync

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
2021-04-25 01:18:56 -07:00
Weixing Zhang
8ad5007f8f
Polish Adam kernel (#7294)
* Polish Adam kernel
2021-04-09 01:11:09 -07:00
Tianlei Wu
274e2fea0c
change half gemm to use compute_32f as default (#7253)
change half gemm to use compute_32f as default; add env variable for configuration
2021-04-08 20:54:37 -07:00
Sherlock
4bc17ca04e
CUDA ConvGrad Kernel (#7227)
* ConvGrad CUDA impl

* Set up the test case for Deberta Conv1D

* Add fp16 test

Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2021-04-06 22:09:06 -07:00
Weixing Zhang
2d352056cf
Support SkipLayerNorm for ROCm EP (#7210)
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
2021-04-02 09:03:30 -07:00
Weixing Zhang
a3f17c8b0d
update lamb and GatherGrad kernel for ROCm EP (#7184)
With ROCm4.1, the CUDA implementation of Lamb and GatherGrad can be
utilized for ROCm EP.
2021-04-02 09:02:49 -07:00
Weixing Zhang
40fa40f3ce
Enable more unit tests for ROCM EP (#6776)
* enable more ops and unit tests for ROCM EP
2021-02-24 15:20:50 -08:00
Tianlei Wu
3bda7f4d36
Fix longformer parity and perf regression (#6760)
* add fast kernel back, update benchmark and conversion scripts
2021-02-19 21:47:36 -08:00
Suffian Khan
105883f4b8
remove longformer_global_impl.cu from hipify (#6716) 2021-02-16 22:26:18 -08:00
Jesse Benson
d18aa45b46 Enable more ROCM ops that are sharing CUDA code. Some are needed for Turing NLG models. 2021-02-06 14:40:34 -08:00
Jesse Benson
d914e29fe1 Reuse reduction_functions.cu 2021-02-04 15:00:05 -08:00
Jesse Benson
86ac11af1a Delete ROCM-specific reduction code that is identical to CUDA reduction code. 2021-02-04 15:00:05 -08:00
Jesse Benson
196132925e Reuse CUDA's reduction_functions.cc 2021-02-04 15:00:05 -08:00
Suffian Khan
76bc0e479c
Enable dense sequence optimized version of Pytorch exported BERT-L on AMD GPU (#6504)
* Permit dense seq optimization on BERT-L pytorch export by enabling ReduceSumTraining, Equal, and NonZero on AMD

* enable Equal tests

* enable fast_matrix_reduction test case
2021-01-29 13:12:34 -08:00
RandySheriffH
a19c48f5cb
Fuse cuda conv with activation (#6351)
* optimize cuda conv by fused activation

* remove needless print out

* exclude test from cpu

* handle status error from cudnn 8.x

* add reference to base class

* add hipify
2021-01-29 10:58:10 -08:00
Wei-Sheng Chin
8ce252caa9
Pipeline Parallel Experimental Python API (#5815) 2021-01-15 12:07:28 +08:00
Jesse Benson
fa851bff66 Add workaround to remove ROCm-specific binary-elementwise files. 2021-01-11 10:00:18 -08:00
Suffian Khan
46e0e4e69f
Tune BiasGeluGradDx kernel in approximation mode to avoid tanh(...) on Rocm (#6239)
* bias gelu grad use exp(...) instead

* update cuda to rocm

* missing semicolon

* comment

* remove dockerfile

* missing factor of two
2021-01-02 08:54:16 -08:00
Jesse Benson
7ccdfed1a6 Remove most ROCm-specific element-wise code and reuse CUDA element-wise code. 2020-12-27 10:30:29 -08:00
Weixing Zhang
53307a5f2e
improve perf for softmax (#6128)
* improve perf for both gathergrad and softmax

* revert the change in gathergrad and will be done in another PR.

* address comments from code review.
2020-12-21 14:15:54 -08:00
Tixxx
32c67c2944
Deprecating Horovod and refactored Adasum computations (#5468)
deprecated horovod submodule
refactored adasum logic to be ort-native
added tests for native kernel and e2e tests
2020-12-17 16:21:33 -08:00
Edward Chen
64709b1335
Deprecate Python global configuration functions [Part 1] (#5923)
Enable options to be set via execution provider (EP)-specific options and log deprecation warning from current global configuration functions.
2020-12-15 11:32:43 -08:00
Jesse Benson
a8d549e181 Minor changes to AMD element-wise kernels to converge with CUDA element-wise kernels. 2020-12-15 08:46:36 -08:00
Edward Chen
9810b9e02b
Reduce amount of compiled CUDA device code (#6118)
Move CudaKernel from cuda_common.h to a new separate header, cuda_kernel.h. Update include sites to use cuda_kernel.h instead if they need CudaKernel. Inclusions of cuda_common.h are now more lightweight.

Make corresponding changes for ROCM execution provider code.

Other minor cleanup.
2020-12-14 15:27:40 -08:00
Jesse Benson
cc47cfcb31 Update AMD transpose to match CUDA transpose. 2020-12-09 11:00:18 -08:00
Jesse Benson
14f6eb14b1 Use __launch_bounds__ workaround, rather than limiting threads to 256 on AMD. 2020-12-03 13:06:34 -08:00
Jesse Benson
245d43615d Fix AMD multi-tensor implementation. 2020-12-03 13:06:34 -08:00
Jesse Benson
bd96f60888 Use CUDA's IsAllFinite kernel for ROCm 2020-11-30 09:24:22 -08:00
Tianlei Wu
31a6be3d67
Add Longformer Attention Cuda Op(#5932)
Limitation: Global tokens must be at the beginning of sequence.
2020-11-25 13:52:10 -08:00
Suffian Khan
4d603e83d7
Remove attention_past.cu and attention_transpose.cu from hipify to fix AMD build (#5921)
* remove attention_transpose.cu and attention_past.cu from hipify

* remove print line

* remove trailing ws for flake test

* fix ws onre mor etime
2020-11-24 20:49:06 -05:00
Weixing Zhang
bb1af718b5
fix build failures due to recent change(858040fa) in CUDA EP (#5736)
Some part of code for reduction kernels has been changed in 858040fa,
which cause failures in rocm build since ROCm EP shares some code with
CUDA EP. This PR is to quick fix this failure by not sharing two files
for now to unblock CI enabling on ROCm EP. Another PR for leveraging
858040fa for ROCm EP will be done later.
2020-11-09 08:41:30 -08:00
Weixing Zhang
fff85a6a35
Add GPU kernels for ROCm EP (#5655)
* Add kernels for AMD GPU.

This PR is mostly about GPU kernels for ROCm EP. Due to similar GPU programming language (CUDA and HIP and similar math library calls, one principle in ROCM EP design is to share CUDA kernels as much as possible for ROCm. Thus, the script amd_hipify.py has been created for converting CUDA kernels to ROCm HIP kernels automatically during compilation phase. But, for some reasons such as perf issue, syntax difference..., some converted kernels need some manual intervention. These kernels will be checked in the repo physically for now. In order to avoid manual intervention, the plan is to refactor CUDA kernels to make them portable between CUDA EP and ROCm EP as much as possible.

Please refer to "HIP Porting Guide" for details.

* like lamb, multi-tensor-apply needs to be disabled for IsAllFiniteOp and ReduceAllL2, current AMD GPU compiler has perf issue for kernel parameter which is a structure with "pass by value".

* Use hipMemsetAsync and add checks on HIP calls.

* move the generated files to build folder.

Co-authored-by: Jesse Benson <jesseb@microsoft.com>
2020-11-06 16:11:06 -08:00