onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-23 19:32:23 +00:00

History

Vincent Wang 8b0669bf63 QuickGelu Fusion (#12417 ) Some models have QuickGelu(x)=x*sigmoid(1.702x), which has 3 Ops for forward and 5 Ops for backward. The PR is to fuse this to a single Op named QuickGelu and its gradient QuickGeluGrad. For CUDA, tested in V100 using input tensor with shape [64,128,2048] and float16 type: Before, FW takes 335us, BW takes 614us ![image](https://user-images.githubusercontent.com/11661208/182291335-15188709-ffe7-44d1-9d14-0b544cbe5e55.png) After, FW takes 115us, BW takes 139us, which is much faster. ![image](https://user-images.githubusercontent.com/11661208/182291502-f0b5161c-b95c-45fc-90f8-ad0c592d2433.png) For CPU kernel, using same shape and float type: Before, FW takes 10us, BW takes 49us Mul: 3480[µs] Sigmoid: 1996[µs] Mul: 4789[µs] Mul: 4642[µs] Mul: 4195[µs] SigmoidGrad: 18328[µs] Mul: 2988[µs] Sum: 18576[µs] After, FW takes 4us, BW takes 5us, which is also much faster. QuickGelu: 3939[µs] QuickGeluGrad: 5089[µs] Co-authored-by: Vincent Wang <weicwang@microsoft.com>		2022-10-28 18:12:07 +08:00
..
orttraining	QuickGelu Fusion (#12417 )	2022-10-28 18:12:07 +08:00
pytorch_frontend_examples	Set black's target version (#11370 )	2022-04-27 14:52:19 -07:00
tools	[ROCm] Fix azcopy issue on ROCm ci pipeline (#13365 )	2022-10-20 12:08:57 +08:00