onnxruntime/onnxruntime
Yufeng Li 90d1f537cb
optimize SLN with large dimension (#18138)
### Description
<!-- Describe your changes. -->
Optimize SkipLayerNorm for large dimension (>=2048) by handling 8
elements in one thread. It avoid the re-writing and re-loading sum of
input, skip and bias to main memory. It reduces the latency of dimension
4096 with small batch size from ~18us to ~3.8us on A100.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-10-30 14:12:17 -07:00
..
contrib_ops optimize SLN with large dimension (#18138) 2023-10-30 14:12:17 -07:00
core [DML EP] Handle non-raw data in dynamic graph compilation (#18160) 2023-10-30 13:48:34 -07:00
python Enable global TRT timing cache (#17865) 2023-10-27 09:23:19 -07:00
test Augment blockwise quantization (#18101) 2023-10-30 09:14:37 -07:00
tool/etw
wasm [js/web/training] Add CreateTrainingSession (#17891) 2023-10-26 09:22:10 -07:00
__init__.py Python API to check whether collective ops are available or not (#17730) 2023-09-29 14:11:05 -07:00
ReformatSource.ps1
ReformatSourcePython.bat
VSCodeCoverage.runsettings