mirror of
https://github.com/saymrwulf/onnxruntime.git
synced 2026-05-18 21:21:17 +00:00
* Using vectorized loads (float2) for fp16 to improve performance * Fix a few warnings from cpplint * Fix a few warnings from cpplint * Use __float2half2_rn and fix some cpplint warnings * Move some computaions to LaunchFastGeluKernel * Fix some Lint C++ warning * Using vectorized loads (float4) for fp16 to improve performance * Switch whether to optimize FastGelu with float4 vectorization * Switch to float4 memory access based on input_length in FastGelu * Comment how to set the threshold of float2 and float4 vectorized kernels * Add FastGelu fp16 unit tests for bias_length = 2 and 8 * Make vectorized kernels generic with aligned_vector * Unify the vectorized kernels with/without bias * Refactor the code to suppress cpplint warnings * Solve formatting issues * Remove cudaDeviceProp from FastGeluKernel and LaunchFastGeluKernel * Move fast_gelu_impl.h to rocm/bert * Fix some Lint C++ warnings and code alignment |
||
|---|---|---|
| .. | ||
| android_custom_build | ||
| ci_build | ||
| doc | ||
| natvis | ||
| nuget | ||
| perf_view | ||
| python | ||