onnxruntime/tools
Hubert Lu f4ba199bad
Optimize FastGelu with float2 and float4 vectorized kernels on ROCm (#11491)
* Using vectorized loads (float2) for fp16 to improve performance

* Fix a few warnings from cpplint

* Fix a few warnings from cpplint

* Use __float2half2_rn and fix some cpplint warnings

* Move some computaions to LaunchFastGeluKernel

* Fix some Lint C++ warning

* Using vectorized loads (float4) for fp16 to improve performance

* Switch   whether to optimize FastGelu with float4 vectorization

* Switch to float4 memory access based on input_length in FastGelu

* Comment how to set the threshold of float2 and float4 vectorized kernels

* Add FastGelu fp16 unit tests for bias_length = 2 and 8

* Make vectorized kernels generic with aligned_vector

* Unify the vectorized kernels with/without bias

* Refactor the code to suppress cpplint warnings

* Solve formatting issues

* Remove cudaDeviceProp from FastGeluKernel and LaunchFastGeluKernel

* Move fast_gelu_impl.h to rocm/bert

* Fix some Lint C++ warnings and code alignment
2022-06-24 12:46:17 -07:00
..
android_custom_build Format all python files under onnxruntime with black and isort (#11324) 2022-04-26 09:35:16 -07:00
ci_build Optimize FastGelu with float2 and float4 vectorized kernels on ROCm (#11491) 2022-06-24 12:46:17 -07:00
doc Format all python files under onnxruntime with black and isort (#11324) 2022-04-26 09:35:16 -07:00
natvis Refactor transformers and other code to reduce memory allocation calls (#10523) 2022-02-24 16:17:14 -08:00
nuget Update DML 1.9 Nuget package to fix WindowsAI nuget pipeline build issue (#11934) 2022-06-21 15:55:51 -07:00
perf_view fix json format (#11046) 2022-03-30 16:15:33 -07:00
python Set black's target version (#11370) 2022-04-27 14:52:19 -07:00