onnxruntime/onnxruntime
Vincent Wang 6c63c1c9ee
Multiple Gather to Split Fusion (#13095)
For below code in some transformers models:
```
fused_qkv = fused_qkv.view(batch_size, seq_length, self.num_heads, 3, self.head_dim)
return fused_qkv[..., 0, :], fused_qkv[..., 1, :], fused_qkv[..., 2, :]
```

The exported graph will contains 3 Gather nodes, currently ORT's
GatherGrad CUDA implementation is slow. This pattern can be fused to use
one Split, so that we can launch less kernels for the compute, the perf
of Split/Concat (for grad) is also better than Gather/GatherGrad.

In a real example, one GatherGrad will take 15ms and there are 3 for
each layer in the graph, after the fusion, one Concat takes only 35us.
The total time of a step is improved from 1.5s to 0.4s.
2022-09-29 11:09:57 +08:00
..
contrib_ops [ROCm] add SkipLayerNorm vectorize Regular case (#12821) 2022-09-27 12:52:10 -07:00
core Multiple Gather to Split Fusion (#13095) 2022-09-29 11:09:57 +08:00
gsl
python Allow fastgelu/skiplayernorm profile by pass args from commandline (#13025) 2022-09-28 15:48:59 -07:00
test Multiple Gather to Split Fusion (#13095) 2022-09-29 11:09:57 +08:00
tool/etw
wasm [js/web][Fix] - updating the C API to catch non-tensor data (#12811) 2022-09-21 13:59:17 -07:00
__init__.py Bump ort version number (#11948) 2022-07-22 12:55:53 -07:00
ReformatSource.ps1
ReformatSourcePython.bat Add python docstring linting in vscode settings (#11316) 2022-04-23 06:23:04 -07:00
VSCodeCoverage.runsettings