onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-05-23 22:13:38 +00:00

Author	SHA1	Message	Date
Weixing Zhang	40fa40f3ce	Enable more unit tests for ROCM EP (#6776 ) * enable more ops and unit tests for ROCM EP	2021-02-24 15:20:50 -08:00
Tianlei Wu	3bda7f4d36	Fix longformer parity and perf regression (#6760 ) * add fast kernel back, update benchmark and conversion scripts	2021-02-19 21:47:36 -08:00
Suffian Khan	105883f4b8	remove longformer_global_impl.cu from hipify (#6716 )	2021-02-16 22:26:18 -08:00
Jesse Benson	d18aa45b46	Enable more ROCM ops that are sharing CUDA code. Some are needed for Turing NLG models.	2021-02-06 14:40:34 -08:00
Jesse Benson	d914e29fe1	Reuse reduction_functions.cu	2021-02-04 15:00:05 -08:00
Jesse Benson	86ac11af1a	Delete ROCM-specific reduction code that is identical to CUDA reduction code.	2021-02-04 15:00:05 -08:00
Jesse Benson	196132925e	Reuse CUDA's reduction_functions.cc	2021-02-04 15:00:05 -08:00
Suffian Khan	76bc0e479c	Enable dense sequence optimized version of Pytorch exported BERT-L on AMD GPU (#6504 ) * Permit dense seq optimization on BERT-L pytorch export by enabling ReduceSumTraining, Equal, and NonZero on AMD * enable Equal tests * enable fast_matrix_reduction test case	2021-01-29 13:12:34 -08:00
RandySheriffH	a19c48f5cb	Fuse cuda conv with activation (#6351 ) * optimize cuda conv by fused activation * remove needless print out * exclude test from cpu * handle status error from cudnn 8.x * add reference to base class * add hipify	2021-01-29 10:58:10 -08:00
Wei-Sheng Chin	8ce252caa9	Pipeline Parallel Experimental Python API (#5815 )	2021-01-15 12:07:28 +08:00
Jesse Benson	fa851bff66	Add workaround to remove ROCm-specific binary-elementwise files.	2021-01-11 10:00:18 -08:00
Suffian Khan	46e0e4e69f	Tune BiasGeluGradDx kernel in approximation mode to avoid tanh(...) on Rocm (#6239 ) * bias gelu grad use exp(...) instead * update cuda to rocm * missing semicolon * comment * remove dockerfile * missing factor of two	2021-01-02 08:54:16 -08:00
Jesse Benson	7ccdfed1a6	Remove most ROCm-specific element-wise code and reuse CUDA element-wise code.	2020-12-27 10:30:29 -08:00
Weixing Zhang	53307a5f2e	improve perf for softmax (#6128 ) * improve perf for both gathergrad and softmax * revert the change in gathergrad and will be done in another PR. * address comments from code review.	2020-12-21 14:15:54 -08:00
Tixxx	32c67c2944	Deprecating Horovod and refactored Adasum computations (#5468 ) deprecated horovod submodule refactored adasum logic to be ort-native added tests for native kernel and e2e tests	2020-12-17 16:21:33 -08:00
Edward Chen	64709b1335	Deprecate Python global configuration functions [Part 1] (#5923 ) Enable options to be set via execution provider (EP)-specific options and log deprecation warning from current global configuration functions.	2020-12-15 11:32:43 -08:00
Jesse Benson	a8d549e181	Minor changes to AMD element-wise kernels to converge with CUDA element-wise kernels.	2020-12-15 08:46:36 -08:00
Edward Chen	9810b9e02b	Reduce amount of compiled CUDA device code (#6118 ) Move CudaKernel from cuda_common.h to a new separate header, cuda_kernel.h. Update include sites to use cuda_kernel.h instead if they need CudaKernel. Inclusions of cuda_common.h are now more lightweight. Make corresponding changes for ROCM execution provider code. Other minor cleanup.	2020-12-14 15:27:40 -08:00
Jesse Benson	cc47cfcb31	Update AMD transpose to match CUDA transpose.	2020-12-09 11:00:18 -08:00
Jesse Benson	14f6eb14b1	Use __launch_bounds__ workaround, rather than limiting threads to 256 on AMD.	2020-12-03 13:06:34 -08:00
Jesse Benson	245d43615d	Fix AMD multi-tensor implementation.	2020-12-03 13:06:34 -08:00
Jesse Benson	bd96f60888	Use CUDA's IsAllFinite kernel for ROCm	2020-11-30 09:24:22 -08:00
Tianlei Wu	31a6be3d67	Add Longformer Attention Cuda Op(#5932 ) Limitation: Global tokens must be at the beginning of sequence.	2020-11-25 13:52:10 -08:00
Suffian Khan	4d603e83d7	Remove attention_past.cu and attention_transpose.cu from hipify to fix AMD build (#5921 ) * remove attention_transpose.cu and attention_past.cu from hipify * remove print line * remove trailing ws for flake test * fix ws onre mor etime	2020-11-24 20:49:06 -05:00
Weixing Zhang	bb1af718b5	fix build failures due to recent change(`858040fa`) in CUDA EP (#5736 ) Some part of code for reduction kernels has been changed in `858040fa`, which cause failures in rocm build since ROCm EP shares some code with CUDA EP. This PR is to quick fix this failure by not sharing two files for now to unblock CI enabling on ROCm EP. Another PR for leveraging `858040fa` for ROCm EP will be done later.	2020-11-09 08:41:30 -08:00
Weixing Zhang	fff85a6a35	Add GPU kernels for ROCm EP (#5655 ) * Add kernels for AMD GPU. This PR is mostly about GPU kernels for ROCm EP. Due to similar GPU programming language (CUDA and HIP and similar math library calls, one principle in ROCM EP design is to share CUDA kernels as much as possible for ROCm. Thus, the script amd_hipify.py has been created for converting CUDA kernels to ROCm HIP kernels automatically during compilation phase. But, for some reasons such as perf issue, syntax difference..., some converted kernels need some manual intervention. These kernels will be checked in the repo physically for now. In order to avoid manual intervention, the plan is to refactor CUDA kernels to make them portable between CUDA EP and ROCm EP as much as possible. Please refer to "HIP Porting Guide" for details. * like lamb, multi-tensor-apply needs to be disabled for IsAllFiniteOp and ReduceAllL2, current AMD GPU compiler has perf issue for kernel parameter which is a structure with "pass by value". * Use hipMemsetAsync and add checks on HIP calls. * move the generated files to build folder. Co-authored-by: Jesse Benson <jesseb@microsoft.com>	2020-11-06 16:11:06 -08:00

26 commits