onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-05-14 20:48:00 +00:00

History

kunal-vaishnavi 901c2bc384 Whisper Model Optimization (#15473 ) ### Description This PR contains fusion-level and kernel-level optimizations for [OpenAI's Whisper](https://github.com/openai/whisper). Some of the added optimizations include: - Pruning of duplicate/unnecessary inputs and outputs - Fusion support for Whisper models with or without these inputs/outputs (e.g. with these inputs/outputs if exporting with an older official Optimum version, without these inputs/outputs if exporting with Optimum from source) - Attention fusions - For Whisper's encoder and decoder - Modified symbolic shape inference for present output when no past input exists (for decoder) - Multi-head attention fusions - For Whisper's decoder and decoder with past - Packed MatMul for the 3 MatMuls excluded in multi-head attention fusion - Attention kernel changes - CPU: - Different Q and KV sequence lengths - Parallel memset for large sequence lengths - Convert broadcast add after MatMul of Q and K (add_qk) to element-wise add - Separate present key-value output into present key and present value (for multi-head attention spec) - CUDA: - Use memory efficient attention compute kernel with present state (for decoder) - Multi-head attention kernel changes - CPU: - Introduction of multi-head attention CPU kernel (previously did not exist) - Use AddBiasReshape instead of AddBiasTranspose when sequence length = 1 (for decoder with past) - Different Q, K, V input shapes - Pass past key and past value directly as key and value - CUDA: - Use memory efficient attention compute kernel with past and/or present state (for decoder with past) ### Usage To use the optimizations, run the ORT transformer optimizer script as follows: ``` $ cd onnxruntime/onnxruntime/python/tools/transformers/ $ python3 optimizer.py --input <filename>.onnx --output <filename>.onnx --model_type bart --num_heads <number of attention heads, depends on the size of the whisper model used> --hidden_size <attention hidden size, depends on the size of the whisper model used> --use_external_data_format --use_multi_head_attention ``` Once optimized, here's an example of how to run Whisper with [Hugging Face's Optimum](https://github.com/huggingface/optimum): ``` from transformers.onnx.utils import get_preprocessor from optimum.onnxruntime import ORTModelForSpeechSeq2Seq from optimum.pipelines import pipeline as ort_pipeline import whisper # Installed from OpenAI's repo - setup instructions at https://github.com/openai/whisper/ directory = './whisper_opt' # Where the optimized ONNX models are located model_name = 'openai/whisper-tiny' device = 'cpu' # Get pipeline processor = get_preprocessor(model_name) model = ORTModelForSpeechSeq2Seq.from_pretrained( directory, use_io_binding=(device == 'cuda'), provider='CPUExecutionProvider', ).to(device) pipe = ort_pipeline( "automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, device=(-1 if device == 'cpu' else 0), ) # Load audio file and run pipeline audio = whisper.load_audio('tests/jfk.flac') audio = whisper.pad_or_trim(audio) outputs = pipe([audio]) print(outputs) ``` Note: In order to use these changes with Optimum, it is recommended to use Optimum from source to have the following changes: - https://github.com/huggingface/optimum/pull/872 - https://github.com/huggingface/optimum/pull/920 ### Motivation and Context This PR helps the following issues: - https://github.com/microsoft/onnxruntime/issues/15100 - https://github.com/microsoft/onnxruntime/issues/15235 - https://github.com/huggingface/optimum/issues/869 (work in progress) This PR can be used with the other currently merged Whisper PRs: - https://github.com/microsoft/onnxruntime/pull/15247 - https://github.com/microsoft/onnxruntime/pull/15339 - https://github.com/microsoft/onnxruntime/pull/15362 - https://github.com/microsoft/onnxruntime/pull/15365 - https://github.com/microsoft/onnxruntime/pull/15427 This PR uses changes from the following merged PRs: - https://github.com/microsoft/onnxruntime/pull/14198 - https://github.com/microsoft/onnxruntime/pull/14146 - https://github.com/microsoft/onnxruntime/pull/14201 - https://github.com/microsoft/onnxruntime/pull/14928 (this introduced the new multi-head attention spec)		2023-04-18 17:13:54 -07:00
..
c_cxx	Upgrade doxygen to fix C API docs build issue (#13950 )	2023-02-03 09:43:29 -08:00
execution_providers/images
images
python	Bump ruff in CI (#15533 )	2023-04-17 10:11:44 -07:00
ABI_Dev_Notes.md	skip windows GPU check if changes only in doc (#13248 )	2022-10-11 13:51:44 +08:00
Android_testing.md
C_API_Guidelines.md	Replace 'master' branch ref to 'main' in the code (#12547 )	2022-08-22 10:48:12 -07:00
cmake_guideline.md	fix some typo in docs (#13212 )	2022-10-07 15:58:18 -07:00
Coding_Conventions_and_Standards.md	Adopt linrtunner as the linting tool - take 2 (#15085 )	2023-03-24 15:29:03 -07:00
ContribOperators.md	[DML EP] Add MatMul + SoftMax fusion (#15240 )	2023-04-11 08:31:04 -07:00
FAQ.md	Fix typo enviroment => environment (#13195 )	2022-10-03 17:02:26 -07:00
How_To_Update_ONNX_Dev_Notes.md	Remove exclusions for ONNX model tests that now pass. (#14337 )	2023-01-24 08:04:27 +10:00
Memory_Optimizer.md	Add guidelines for ORTModule (#13553 )	2022-11-04 19:42:10 +08:00
Model_Test.md
NotesOnThreading.md	Replace 'master' branch ref to 'main' in the code (#12547 )	2022-08-22 10:48:12 -07:00
ONNX_Runtime_Server_Usage.md
onnxruntime_dependencies.dot
onnxruntime_dependencies.png
onnxruntime_extensions.md	Fix broken and outdated links in documentation (#14092 )	2023-02-23 10:48:04 -08:00
OperatorKernels.md	Whisper Model Optimization (#15473 )	2023-04-18 17:13:54 -07:00
ORT_Format_Update_in_1.13.md	Update ORT format v5 change docs to cover limited backwards compatibility in 1.14. (#14413 )	2023-01-25 08:23:12 -08:00
ORTMobilePackageOperatorTypeSupport.md	Replace 'master' branch ref to 'main' in the code (#12547 )	2022-08-22 10:48:12 -07:00
ORTModule_Convergence_Notes.md	log level control + fix typos (#15302 )	2023-04-04 20:19:13 +08:00
ORTModule_Training_Guidelines.md	Optimize SCE loss compute (#15401 )	2023-04-13 13:02:12 +08:00
PR_Guidelines.md
Privacy.md	[C# and Python APIs] Expose knobs to enable/disable platform telemetry collection (#5481 )	2020-10-21 10:32:13 -07:00
Python_Dev_Notes.md
Reduced_Operator_Kernel_build.md	replace 'master' branch ref to 'main' for onnx repo (#12678 )	2022-08-30 13:41:42 -07:00
ReleaseManagement.md
Roadmap.md	Replace 'master' branch ref to 'main' in the code (#12547 )	2022-08-22 10:48:12 -07:00
Server.md
TVM_EP.md	Update python 3.11 and remove 3.7 for Linux (#15214 )	2023-03-27 14:46:30 -07:00
Versioning.md	replace 'master' branch ref to 'main' for onnx repo (#12678 )	2022-08-30 13:41:42 -07:00
WinML_principles.md	Replace 'master' branch ref to 'main' in the code (#12547 )	2022-08-22 10:48:12 -07:00