mirror of
https://github.com/saymrwulf/onnxruntime.git
synced 2026-05-31 23:27:43 +00:00
(1) Support packed QKV format in MultiHeadAttention. This format could avoid add bias transpose when TRT fused kernel is used. (2) Add cache for cumulated sequence length computation. For SD, it only need computed once since sequence length is fixed. (3) Do not allocate qkv workspace to save memory for packed KV or QKV. (4) Add unit tests for packed kv and packed qkv format in MultiHeadAttention (5) Mark some fusion options for SD only Performance tests show slight improvement in T4. Average latency reduced 0.15 seconds (from 5.25s to 5.10s) for 512x512 in 50 steps for SD 1.5 models. Memory usage drops from 5.1GB to 4.8GB. |
||
|---|---|---|
| .. | ||
| c_cxx | ||
| execution_providers/images | ||
| images | ||
| python | ||
| ABI_Dev_Notes.md | ||
| Android_testing.md | ||
| C_API_Guidelines.md | ||
| cmake_guideline.md | ||
| Coding_Conventions_and_Standards.md | ||
| ContribOperators.md | ||
| FAQ.md | ||
| How_To_Update_ONNX_Dev_Notes.md | ||
| Memory_Optimizer.md | ||
| Model_Test.md | ||
| NotesOnThreading.md | ||
| ONNX_Runtime_Server_Usage.md | ||
| onnxruntime_dependencies.dot | ||
| onnxruntime_dependencies.png | ||
| onnxruntime_extensions.md | ||
| OperatorKernels.md | ||
| ORT_Format_Update_in_1.13.md | ||
| ORTMobilePackageOperatorTypeSupport.md | ||
| ORTModule_Training_Guidelines.md | ||
| PR_Guidelines.md | ||
| Privacy.md | ||
| Python_Dev_Notes.md | ||
| Reduced_Operator_Kernel_build.md | ||
| ReleaseManagement.md | ||
| Roadmap.md | ||
| Server.md | ||
| TVM_EP.md | ||
| Versioning.md | ||
| WinML_principles.md | ||