onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-05-26 22:35:43 +00:00

Author	SHA1	Message	Date
Ye Wang	89ac61f4d4	support gpt2 model with greedy search (#12068 ) * greedy search gpt2 cpu checkin * add cuda support * add test * provider * update * fix some bugs * refactor impl class * refactor test * remove unused func * refactor parameters class * simplify padding * fix lint warnings * python format * Revert "python format" This reverts commit f25fe1017fa33d960b2418ebbb5dba6a4bd043cf. * python format * fix pipelines * fix pipeline * move bufferallocater to generate_impl_base * review comments(alignment, filename/namespace change) * rebase2 * python reformat * reformat * fix rocm build * review comment * review comments * review comments * fix a bug * rebase test files * python format * format import order * review comments * fix build	2022-07-22 15:45:16 -07:00
Xinya Zhang	03dfcb0e87	[ROCm] Enable int8 for MatMulInteger Op (#11776 )	2022-07-21 11:20:48 -07:00
mindest	add631410a	[ROCm] Re-enable ReduceL1, L2 and related tests (#12209 ) Re-enable ReduceL1,L2 and related tests	2022-07-20 13:13:02 +08:00
zhangyaobit	a9b9c7f69f	Add autotuning support to FastGelu (#12093 ) * Add autotuning for FastGelu (Draft). * Clean up. * delete unused header file * Fix lint errors. * Add missing template parameter. * Improvements. * Fix type. * Fix namespace issue.	2022-07-06 23:17:48 -07:00
Hubert Lu	dbcf54aa41	Add hipified SkipLayerNorm code for ROCmEP (#12107 ) * First attempt for half2 vectorized memory access in SkipLayerNorm * Add some functions for debugging * Clean up the code * Clean up the code * Generalize the vectorized kernels with aligned_vector and remove cudaDeviceProp * Add a unit test for a larger input size * Fix some Lint C++ warnings * Use ILP = 4 for the vectorized kernels * Rewrite the vectorized kernel and templatize ComputeSkipLayerNorm * Use conditional operator for input_v * Refactor LaunchSkipLayerNormKernel and replace the original SkipLayerNormKernelSmall with the vectorized kernel * Clean some comments and rename the layernorm function * Use ComputeSkipLayerNorm to replace LaunchSkipLayerNormKernel * Resolve a Lint C++ warning * Fix SkipLayerNormBatch1_Float16_vec output data * Add hipified code of bert SkipLayerNorm for ROCmEP * Resolve some Lint C++ warnings * Resolve some Lint C++ warnings * Resolve some Lint C++ warnings * Resolve Python formatting issue	2022-07-06 22:13:11 -07:00
Hubert Lu	f4ba199bad	Optimize FastGelu with float2 and float4 vectorized kernels on ROCm (#11491 ) * Using vectorized loads (float2) for fp16 to improve performance * Fix a few warnings from cpplint * Fix a few warnings from cpplint * Use __float2half2_rn and fix some cpplint warnings * Move some computaions to LaunchFastGeluKernel * Fix some Lint C++ warning * Using vectorized loads (float4) for fp16 to improve performance * Switch whether to optimize FastGelu with float4 vectorization * Switch to float4 memory access based on input_length in FastGelu * Comment how to set the threshold of float2 and float4 vectorized kernels * Add FastGelu fp16 unit tests for bias_length = 2 and 8 * Make vectorized kernels generic with aligned_vector * Unify the vectorized kernels with/without bias * Refactor the code to suppress cpplint warnings * Solve formatting issues * Remove cudaDeviceProp from FastGeluKernel and LaunchFastGeluKernel * Move fast_gelu_impl.h to rocm/bert * Fix some Lint C++ warnings and code alignment	2022-06-24 12:46:17 -07:00
Justin Chu	fdce4fa6af	Format all python files under onnxruntime with black and isort (#11324 ) Description: Format all python files under onnxruntime with black and isort. After checking in, we can use .git-blame-ignore-revs to ignore the formatting PR in git blame. #11315, #11316	2022-04-26 09:35:16 -07:00
Weixing Zhang	0aaf3a676a	Update reduce norm1/norm2 and layernorm kernels with ROCm 4.3.1 (#9399 ) * update layernorm to reflect the fix in ROCm 4.3.1 * fix UT Co-authored-by: Weixing Zhang <wezhan@microsoft.com> Co-authored-by: Ethan Tao <ettao@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>	2022-04-07 22:54:12 -07:00
Tianlei Wu	36c3271546	BeamSearch op cuda (#10556 ) Add BeamSearch cuda implementation with support of fp16 GPT-2 subgraph	2022-02-25 13:08:55 -08:00
zhangyaobit	fd16085cea	Zhanyao/attention (#10545 ) * Enable Attention op for ROCM EP. As a note, potential hipify improvements: (1) handle math contants (attention_softmax.h), (2) correctly generate transpose options for the GEMM helpers, consider counterpart/dummy API for CublasMathModeSetter (attention_impl.cu, attention_impl.cu). After these improvements, we don't need to manually keep copies of the above mentioned files any more. * Clean up debugging code.	2022-02-17 09:02:45 -08:00
Jeff Daily	e7efcc93fe	[ROCm] update hipify-perl location (#10102 ) * [ROCm] update hipify-perl location Depending on the ROCm version installed, hipify-perl might not always live in the hard-coded path of /opt/rocm/bin. Use python 3.3's shutil.which to locate the script. * provide alternative locations for hipify-perl if not in PATH * implement hipify-perl search as a function This avoids running the logic during module import since all builds import the amd_hipify module. * fix flake8 errors	2022-01-06 17:21:02 -08:00
Ye Wang	6856619b18	Decoder Attention CUDA Op (#9792 ) * add kernel interface * register kernel * add self/cross qkv projection without cache * add LaunchTransQkv2 for (S,B,X,N,H) -> (X,B,N,S,H) * refactor ConcatPastToPresent * DecoderQkvToContext interface * q,k,v buffer and cache as output * qk, pv and transctx * fix compiler error on linux machine * key_padding_mask * add test_parity file. However not runnable * add partial unittest * made partial attributes to inputs * --gen_doc * change kernel interface, add more tests * morre parity tests * fix test * fix typo * transpose optimizer has bug. remove it temporarily * add input shape checks * add type/shape inference * fix cache shape check * fix rocm build failure * fix rocm build error * review comments * review comments	2021-11-19 19:25:36 -08:00
pengwa	6e09fc5152	Implement block wise softmax for reduction dimention > 1024 cases. (#9696 ) * implement block wise softmax for reduction dimention > 1024 cases. * fix builds * fix * fix amd build * fix amd build * fix win-gpu build * add tests * remove cudnn path/add python tests	2021-11-14 11:47:58 +08:00
Jeff Daily	ca7116ca3e	CUDA EP's ResizeImpl now uses functors, hipify for ROCm EP (#9466 ) Support for device function pointers is not yet available for ROCm. Instead, the device function pointers were converted to device functors. Case statements, lambdas, and macros are used for dispatch; as a result, all combinations of kernels are compiled with inlined functors. The basis of this approach can be found in PyTorch. Lastly, hipify and register Resize and Upsample for ROCm EP.	2021-10-21 15:02:41 -07:00
Jeff Daily	66ceb6926d	rehipify ROCm EP files under orttraining (#9443 ) * rehipify rocm ep files under orttraining committed to source control * fix flake8 error	2021-10-21 13:36:21 -07:00
Jeff Daily	89a22fb641	Add TopK to ROCm EP (#9391 ) * Add TopK to ROCm EP * flake8 fix	2021-10-20 10:39:44 -07:00
Jeff Daily	f8acc6d0e8	Add NonMaxSuppression and RoiAlign to ROCm EP (#9394 )	2021-10-20 10:38:45 -07:00
Jeff Daily	c33391329a	Add QuantizeLinear and DequantizeLinear to ROCm EP (#9401 )	2021-10-20 10:37:58 -07:00
Jeff Daily	52c53e396d	hipify tensor/gather_nd_impl.cu (#9392 )	2021-10-19 14:15:49 -07:00
Jeff Daily	a2ba923ac7	hipify fast_divmod.h (#9400 )	2021-10-19 12:34:46 -07:00
Jeff Daily	a8e2e8d76a	hipify tensor/transpose.cc and tensor/transpose.h (#9397 )	2021-10-19 12:27:36 -07:00
Jeff Daily	c8789d3047	[ROCm] static re-hipify of CUDA EP to ROCm EP, now a shared provider (#8877 ) * re-hipify all rocm EP sources * fix all other files affected by re-hipify * add cuda_provider_factory.h to amd_hipify.py * do not use cudnn_conv_algo_search in ROCm EP, missing reduce min registration * Fix ReduceConsts template specialization introduced in #9101. Fixes the error when building for ROCm 4.3.1: error: too many template headers for onnxruntime::rocm::ReduceConsts<__half>::One (should be 0) * fix flake8 error in amd_hipify.py * speed up hipify with concurrent.futures * flake8 fix in amd_hipify.py	2021-10-14 15:15:51 -07:00
Suffian Khan	47888392ab	Fix nightly CI pipeline to generate ROCm 4.2 wheels and add ROCm 4.3.1 wheels (#9101 ) * make work for both rocm 4.2 and rocm 4.3.1 * fix rocm 4.3.1 docker image reference * fix CUDA_VERSION to ROCM_VERSION * fix ReduceConsts conflict def * add ifdef to miopen_common.h as well * trailing ws	2021-09-19 23:36:03 -07:00
mindest	a71dab691d	Implement BatchNormInternal for cuda (#8172 ) * correct batchnorm replacement output order; remove bn replacement in grad graph builder * update op defs and kernel class * implement batch norm internal and grad. * change saved_var into saved_inv_std * cuda test case: bn internal * remove redundant include * fix comment; add support and UT for 1d input. * exclude batch_norm_internal in amd_hipify * run BNInternal UT for CUDA only * fix CI error * fix comment errors * fix error * add comment for inconsistency with cudnnBN doc * additional comments for cudnnBN inconsistency	2021-07-28 16:04:49 +08:00
Hariharan Seshadri	5369821ad6	Support SpaceDepth ops in the CUDA and ROCM EPs (#7960 )	2021-07-09 01:00:22 -07:00
Weixing Zhang	ca9b3f18e9	Explicitly pass cuda stream to thrust function rather than use cuda default stream implicitly (#7414 ) * Pass cuda stream to thrust function to not use default stream. In the commit `299ace0`, ORT has been changed to not use cuda default stream. * update amd_hipify.py * remove un-necessary stream sync Co-authored-by: Weixing Zhang <wezhan@microsoft.com>	2021-04-25 01:18:56 -07:00
Weixing Zhang	8ad5007f8f	Polish Adam kernel (#7294 ) * Polish Adam kernel	2021-04-09 01:11:09 -07:00
Tianlei Wu	274e2fea0c	change half gemm to use compute_32f as default (#7253 ) change half gemm to use compute_32f as default; add env variable for configuration	2021-04-08 20:54:37 -07:00
Sherlock	4bc17ca04e	CUDA ConvGrad Kernel (#7227 ) * ConvGrad CUDA impl * Set up the test case for Deberta Conv1D * Add fp16 test Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2021-04-06 22:09:06 -07:00
Weixing Zhang	2d352056cf	Support SkipLayerNorm for ROCm EP (#7210 ) Co-authored-by: Weixing Zhang <wezhan@microsoft.com>	2021-04-02 09:03:30 -07:00
Weixing Zhang	a3f17c8b0d	update lamb and GatherGrad kernel for ROCm EP (#7184 ) With ROCm4.1, the CUDA implementation of Lamb and GatherGrad can be utilized for ROCm EP.	2021-04-02 09:02:49 -07:00
Weixing Zhang	40fa40f3ce	Enable more unit tests for ROCM EP (#6776 ) * enable more ops and unit tests for ROCM EP	2021-02-24 15:20:50 -08:00
Tianlei Wu	3bda7f4d36	Fix longformer parity and perf regression (#6760 ) * add fast kernel back, update benchmark and conversion scripts	2021-02-19 21:47:36 -08:00
Suffian Khan	105883f4b8	remove longformer_global_impl.cu from hipify (#6716 )	2021-02-16 22:26:18 -08:00
Jesse Benson	d18aa45b46	Enable more ROCM ops that are sharing CUDA code. Some are needed for Turing NLG models.	2021-02-06 14:40:34 -08:00
Jesse Benson	d914e29fe1	Reuse reduction_functions.cu	2021-02-04 15:00:05 -08:00
Jesse Benson	86ac11af1a	Delete ROCM-specific reduction code that is identical to CUDA reduction code.	2021-02-04 15:00:05 -08:00
Jesse Benson	196132925e	Reuse CUDA's reduction_functions.cc	2021-02-04 15:00:05 -08:00
Suffian Khan	76bc0e479c	Enable dense sequence optimized version of Pytorch exported BERT-L on AMD GPU (#6504 ) * Permit dense seq optimization on BERT-L pytorch export by enabling ReduceSumTraining, Equal, and NonZero on AMD * enable Equal tests * enable fast_matrix_reduction test case	2021-01-29 13:12:34 -08:00
RandySheriffH	a19c48f5cb	Fuse cuda conv with activation (#6351 ) * optimize cuda conv by fused activation * remove needless print out * exclude test from cpu * handle status error from cudnn 8.x * add reference to base class * add hipify	2021-01-29 10:58:10 -08:00
Wei-Sheng Chin	8ce252caa9	Pipeline Parallel Experimental Python API (#5815 )	2021-01-15 12:07:28 +08:00
Jesse Benson	fa851bff66	Add workaround to remove ROCm-specific binary-elementwise files.	2021-01-11 10:00:18 -08:00
Suffian Khan	46e0e4e69f	Tune BiasGeluGradDx kernel in approximation mode to avoid tanh(...) on Rocm (#6239 ) * bias gelu grad use exp(...) instead * update cuda to rocm * missing semicolon * comment * remove dockerfile * missing factor of two	2021-01-02 08:54:16 -08:00
Jesse Benson	7ccdfed1a6	Remove most ROCm-specific element-wise code and reuse CUDA element-wise code.	2020-12-27 10:30:29 -08:00
Weixing Zhang	53307a5f2e	improve perf for softmax (#6128 ) * improve perf for both gathergrad and softmax * revert the change in gathergrad and will be done in another PR. * address comments from code review.	2020-12-21 14:15:54 -08:00
Tixxx	32c67c2944	Deprecating Horovod and refactored Adasum computations (#5468 ) deprecated horovod submodule refactored adasum logic to be ort-native added tests for native kernel and e2e tests	2020-12-17 16:21:33 -08:00
Edward Chen	64709b1335	Deprecate Python global configuration functions [Part 1] (#5923 ) Enable options to be set via execution provider (EP)-specific options and log deprecation warning from current global configuration functions.	2020-12-15 11:32:43 -08:00
Jesse Benson	a8d549e181	Minor changes to AMD element-wise kernels to converge with CUDA element-wise kernels.	2020-12-15 08:46:36 -08:00
Edward Chen	9810b9e02b	Reduce amount of compiled CUDA device code (#6118 ) Move CudaKernel from cuda_common.h to a new separate header, cuda_kernel.h. Update include sites to use cuda_kernel.h instead if they need CudaKernel. Inclusions of cuda_common.h are now more lightweight. Make corresponding changes for ROCM execution provider code. Other minor cleanup.	2020-12-14 15:27:40 -08:00
Jesse Benson	cc47cfcb31	Update AMD transpose to match CUDA transpose.	2020-12-09 11:00:18 -08:00

1 2

57 commits