Commit graph

57 commits

Author SHA1 Message Date
Ye Wang
89ac61f4d4
support gpt2 model with greedy search (#12068)
* greedy search gpt2 cpu checkin

* add cuda support

* add test

* provider

* update

* fix some bugs

* refactor impl class

* refactor test

* remove unused func

* refactor parameters class

* simplify padding

* fix lint warnings

* python format

* Revert "python format"

This reverts commit f25fe1017fa33d960b2418ebbb5dba6a4bd043cf.

* python format

* fix pipelines

* fix pipeline

* move bufferallocater to generate_impl_base

* review comments(alignment, filename/namespace change)

* rebase2

* python reformat

* reformat

* fix rocm build

* review comment

* review comments

* review comments

* fix a bug

* rebase test files

* python format

* format import order

* review comments

* fix build
2022-07-22 15:45:16 -07:00
Xinya Zhang
03dfcb0e87
[ROCm] Enable int8 for MatMulInteger Op (#11776) 2022-07-21 11:20:48 -07:00
mindest
add631410a
[ROCm] Re-enable ReduceL1, L2 and related tests (#12209)
Re-enable ReduceL1,L2 and related tests
2022-07-20 13:13:02 +08:00
zhangyaobit
a9b9c7f69f
Add autotuning support to FastGelu (#12093)
* Add autotuning for FastGelu (Draft).

* Clean up.

* delete unused header file

* Fix lint errors.

* Add missing template parameter.

* Improvements.

* Fix type.

* Fix namespace issue.
2022-07-06 23:17:48 -07:00
Hubert Lu
dbcf54aa41
Add hipified SkipLayerNorm code for ROCmEP (#12107)
* First attempt for half2 vectorized memory access in SkipLayerNorm

* Add some functions for debugging

* Clean up the code

* Clean up the code

* Generalize the vectorized kernels with aligned_vector and remove cudaDeviceProp

* Add a unit test for a larger input size

* Fix some Lint C++ warnings

* Use ILP = 4 for the vectorized kernels

* Rewrite the vectorized kernel and templatize ComputeSkipLayerNorm

* Use conditional operator for input_v

* Refactor LaunchSkipLayerNormKernel and replace the original SkipLayerNormKernelSmall with the vectorized kernel

* Clean some comments and rename the layernorm function

* Use ComputeSkipLayerNorm to replace LaunchSkipLayerNormKernel

* Resolve a Lint C++ warning

* Fix SkipLayerNormBatch1_Float16_vec output data

* Add hipified code of bert SkipLayerNorm for ROCmEP

* Resolve some Lint C++ warnings

* Resolve some Lint C++ warnings

* Resolve some Lint C++ warnings

* Resolve Python formatting issue
2022-07-06 22:13:11 -07:00
Hubert Lu
f4ba199bad
Optimize FastGelu with float2 and float4 vectorized kernels on ROCm (#11491)
* Using vectorized loads (float2) for fp16 to improve performance

* Fix a few warnings from cpplint

* Fix a few warnings from cpplint

* Use __float2half2_rn and fix some cpplint warnings

* Move some computaions to LaunchFastGeluKernel

* Fix some Lint C++ warning

* Using vectorized loads (float4) for fp16 to improve performance

* Switch   whether to optimize FastGelu with float4 vectorization

* Switch to float4 memory access based on input_length in FastGelu

* Comment how to set the threshold of float2 and float4 vectorized kernels

* Add FastGelu fp16 unit tests for bias_length = 2 and 8

* Make vectorized kernels generic with aligned_vector

* Unify the vectorized kernels with/without bias

* Refactor the code to suppress cpplint warnings

* Solve formatting issues

* Remove cudaDeviceProp from FastGeluKernel and LaunchFastGeluKernel

* Move fast_gelu_impl.h to rocm/bert

* Fix some Lint C++ warnings and code alignment
2022-06-24 12:46:17 -07:00
Justin Chu
fdce4fa6af
Format all python files under onnxruntime with black and isort (#11324)
Description: Format all python files under onnxruntime with black and isort.

After checking in, we can use .git-blame-ignore-revs to ignore the formatting PR in git blame.

#11315, #11316
2022-04-26 09:35:16 -07:00
Weixing Zhang
0aaf3a676a
Update reduce norm1/norm2 and layernorm kernels with ROCm 4.3.1 (#9399)
* update layernorm to reflect the fix in ROCm 4.3.1

* fix UT

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
Co-authored-by: Ethan Tao <ettao@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2022-04-07 22:54:12 -07:00
Tianlei Wu
36c3271546
BeamSearch op cuda (#10556)
Add BeamSearch cuda implementation with support of fp16 GPT-2 subgraph
2022-02-25 13:08:55 -08:00
zhangyaobit
fd16085cea
Zhanyao/attention (#10545)
* Enable Attention op for ROCM EP.

As a note, potential hipify improvements: (1) handle math
contants (attention_softmax.h), (2) correctly generate transpose
options for the GEMM helpers, consider counterpart/dummy API for
CublasMathModeSetter (attention_impl.cu, attention_impl.cu). After
these improvements, we don't need to manually keep copies of the
above mentioned files any more.

* Clean up debugging code.
2022-02-17 09:02:45 -08:00
Jeff Daily
e7efcc93fe
[ROCm] update hipify-perl location (#10102)
* [ROCm] update hipify-perl location

Depending on the ROCm version installed, hipify-perl might not always
live in the hard-coded path of /opt/rocm/bin. Use python 3.3's
shutil.which to locate the script.

* provide alternative locations for hipify-perl if not in PATH

* implement hipify-perl search as a function

This avoids running the logic during module import since all builds
import the amd_hipify module.

* fix flake8 errors
2022-01-06 17:21:02 -08:00
Ye Wang
6856619b18
Decoder Attention CUDA Op (#9792)
* add kernel interface

* register kernel

* add self/cross qkv projection without cache

* add LaunchTransQkv2 for (S,B,X,N,H) -> (X,B,N,S,H)

* refactor ConcatPastToPresent

* DecoderQkvToContext interface

* q,k,v buffer and cache as output

* qk, pv and transctx

* fix compiler error on linux machine

* key_padding_mask

* add test_parity file. However not runnable

* add partial unittest

* made partial attributes to inputs

* --gen_doc

* change kernel interface, add more tests

* morre parity tests

* fix test

* fix typo

* transpose optimizer has bug. remove it temporarily

* add input shape checks

* add type/shape inference

* fix cache shape check

* fix rocm build failure

* fix rocm build error

* review comments

* review comments
2021-11-19 19:25:36 -08:00
pengwa
6e09fc5152
Implement block wise softmax for reduction dimention > 1024 cases. (#9696)
* implement block wise softmax for reduction dimention > 1024 cases.

* fix builds

* fix

* fix amd build

* fix amd build

* fix win-gpu build

* add tests

* remove cudnn path/add python tests
2021-11-14 11:47:58 +08:00
Jeff Daily
ca7116ca3e
CUDA EP's ResizeImpl now uses functors, hipify for ROCm EP (#9466)
Support for device function pointers is not yet available for ROCm.
Instead, the device function pointers were converted to device functors.
Case statements, lambdas, and macros are used for dispatch; as a result,
all combinations of kernels are compiled with inlined functors. The
basis of this approach can be found in PyTorch.

Lastly, hipify and register Resize and Upsample for ROCm EP.
2021-10-21 15:02:41 -07:00
Jeff Daily
66ceb6926d
rehipify ROCm EP files under orttraining (#9443)
* rehipify rocm ep files under orttraining committed to source control

* fix flake8 error
2021-10-21 13:36:21 -07:00
Jeff Daily
89a22fb641
Add TopK to ROCm EP (#9391)
* Add TopK to ROCm EP

* flake8 fix
2021-10-20 10:39:44 -07:00
Jeff Daily
f8acc6d0e8
Add NonMaxSuppression and RoiAlign to ROCm EP (#9394) 2021-10-20 10:38:45 -07:00
Jeff Daily
c33391329a
Add QuantizeLinear and DequantizeLinear to ROCm EP (#9401) 2021-10-20 10:37:58 -07:00
Jeff Daily
52c53e396d
hipify tensor/gather_nd_impl.cu (#9392) 2021-10-19 14:15:49 -07:00
Jeff Daily
a2ba923ac7
hipify fast_divmod.h (#9400) 2021-10-19 12:34:46 -07:00
Jeff Daily
a8e2e8d76a
hipify tensor/transpose.cc and tensor/transpose.h (#9397) 2021-10-19 12:27:36 -07:00
Jeff Daily
c8789d3047
[ROCm] static re-hipify of CUDA EP to ROCm EP, now a shared provider (#8877)
* re-hipify all rocm EP sources

* fix all other files affected by re-hipify

* add cuda_provider_factory.h to amd_hipify.py

* do not use cudnn_conv_algo_search in ROCm EP, missing reduce min registration

* Fix ReduceConsts template specialization introduced in #9101.

Fixes the error when building for ROCm 4.3.1:

error: too many template headers for onnxruntime::rocm::ReduceConsts<__half>::One (should be 0)

* fix flake8 error in amd_hipify.py

* speed up hipify with concurrent.futures

* flake8 fix in amd_hipify.py
2021-10-14 15:15:51 -07:00
Suffian Khan
47888392ab
Fix nightly CI pipeline to generate ROCm 4.2 wheels and add ROCm 4.3.1 wheels (#9101)
* make work for both rocm 4.2 and rocm 4.3.1

* fix rocm 4.3.1 docker image reference

* fix CUDA_VERSION to ROCM_VERSION

* fix ReduceConsts conflict def

* add ifdef to miopen_common.h as well

* trailing ws
2021-09-19 23:36:03 -07:00
mindest
a71dab691d
Implement BatchNormInternal for cuda (#8172)
* correct batchnorm replacement output order;

remove bn replacement in grad graph builder

* update op defs and kernel class

* implement batch norm internal and grad.

* change saved_var into saved_inv_std

* cuda test case: bn internal

* remove redundant include

* fix comment; add support and UT for 1d input.

* exclude batch_norm_internal in amd_hipify

* run BNInternal UT for CUDA only

* fix CI error

* fix comment errors

* fix error

* add comment for inconsistency with cudnnBN doc

* additional comments for cudnnBN inconsistency
2021-07-28 16:04:49 +08:00
Hariharan Seshadri
5369821ad6
Support SpaceDepth ops in the CUDA and ROCM EPs (#7960) 2021-07-09 01:00:22 -07:00
Weixing Zhang
ca9b3f18e9
Explicitly pass cuda stream to thrust function rather than use cuda default stream implicitly (#7414)
* Pass cuda stream to thrust function to not use default stream.

In the commit 299ace0, ORT has been changed to not use cuda default stream.

* update amd_hipify.py

* remove un-necessary stream sync

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
2021-04-25 01:18:56 -07:00
Weixing Zhang
8ad5007f8f
Polish Adam kernel (#7294)
* Polish Adam kernel
2021-04-09 01:11:09 -07:00
Tianlei Wu
274e2fea0c
change half gemm to use compute_32f as default (#7253)
change half gemm to use compute_32f as default; add env variable for configuration
2021-04-08 20:54:37 -07:00
Sherlock
4bc17ca04e
CUDA ConvGrad Kernel (#7227)
* ConvGrad CUDA impl

* Set up the test case for Deberta Conv1D

* Add fp16 test

Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2021-04-06 22:09:06 -07:00
Weixing Zhang
2d352056cf
Support SkipLayerNorm for ROCm EP (#7210)
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
2021-04-02 09:03:30 -07:00
Weixing Zhang
a3f17c8b0d
update lamb and GatherGrad kernel for ROCm EP (#7184)
With ROCm4.1, the CUDA implementation of Lamb and GatherGrad can be
utilized for ROCm EP.
2021-04-02 09:02:49 -07:00
Weixing Zhang
40fa40f3ce
Enable more unit tests for ROCM EP (#6776)
* enable more ops and unit tests for ROCM EP
2021-02-24 15:20:50 -08:00
Tianlei Wu
3bda7f4d36
Fix longformer parity and perf regression (#6760)
* add fast kernel back, update benchmark and conversion scripts
2021-02-19 21:47:36 -08:00
Suffian Khan
105883f4b8
remove longformer_global_impl.cu from hipify (#6716) 2021-02-16 22:26:18 -08:00
Jesse Benson
d18aa45b46 Enable more ROCM ops that are sharing CUDA code. Some are needed for Turing NLG models. 2021-02-06 14:40:34 -08:00
Jesse Benson
d914e29fe1 Reuse reduction_functions.cu 2021-02-04 15:00:05 -08:00
Jesse Benson
86ac11af1a Delete ROCM-specific reduction code that is identical to CUDA reduction code. 2021-02-04 15:00:05 -08:00
Jesse Benson
196132925e Reuse CUDA's reduction_functions.cc 2021-02-04 15:00:05 -08:00
Suffian Khan
76bc0e479c
Enable dense sequence optimized version of Pytorch exported BERT-L on AMD GPU (#6504)
* Permit dense seq optimization on BERT-L pytorch export by enabling ReduceSumTraining, Equal, and NonZero on AMD

* enable Equal tests

* enable fast_matrix_reduction test case
2021-01-29 13:12:34 -08:00
RandySheriffH
a19c48f5cb
Fuse cuda conv with activation (#6351)
* optimize cuda conv by fused activation

* remove needless print out

* exclude test from cpu

* handle status error from cudnn 8.x

* add reference to base class

* add hipify
2021-01-29 10:58:10 -08:00
Wei-Sheng Chin
8ce252caa9
Pipeline Parallel Experimental Python API (#5815) 2021-01-15 12:07:28 +08:00
Jesse Benson
fa851bff66 Add workaround to remove ROCm-specific binary-elementwise files. 2021-01-11 10:00:18 -08:00
Suffian Khan
46e0e4e69f
Tune BiasGeluGradDx kernel in approximation mode to avoid tanh(...) on Rocm (#6239)
* bias gelu grad use exp(...) instead

* update cuda to rocm

* missing semicolon

* comment

* remove dockerfile

* missing factor of two
2021-01-02 08:54:16 -08:00
Jesse Benson
7ccdfed1a6 Remove most ROCm-specific element-wise code and reuse CUDA element-wise code. 2020-12-27 10:30:29 -08:00
Weixing Zhang
53307a5f2e
improve perf for softmax (#6128)
* improve perf for both gathergrad and softmax

* revert the change in gathergrad and will be done in another PR.

* address comments from code review.
2020-12-21 14:15:54 -08:00
Tixxx
32c67c2944
Deprecating Horovod and refactored Adasum computations (#5468)
deprecated horovod submodule
refactored adasum logic to be ort-native
added tests for native kernel and e2e tests
2020-12-17 16:21:33 -08:00
Edward Chen
64709b1335
Deprecate Python global configuration functions [Part 1] (#5923)
Enable options to be set via execution provider (EP)-specific options and log deprecation warning from current global configuration functions.
2020-12-15 11:32:43 -08:00
Jesse Benson
a8d549e181 Minor changes to AMD element-wise kernels to converge with CUDA element-wise kernels. 2020-12-15 08:46:36 -08:00
Edward Chen
9810b9e02b
Reduce amount of compiled CUDA device code (#6118)
Move CudaKernel from cuda_common.h to a new separate header, cuda_kernel.h. Update include sites to use cuda_kernel.h instead if they need CudaKernel. Inclusions of cuda_common.h are now more lightweight.

Make corresponding changes for ROCM execution provider code.

Other minor cleanup.
2020-12-14 15:27:40 -08:00
Jesse Benson
cc47cfcb31 Update AMD transpose to match CUDA transpose. 2020-12-09 11:00:18 -08:00