onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-05-16 21:00:14 +00:00

History

kunal-vaishnavi 901c2bc384 Whisper Model Optimization (#15473 ) ### Description This PR contains fusion-level and kernel-level optimizations for [OpenAI's Whisper](https://github.com/openai/whisper). Some of the added optimizations include: - Pruning of duplicate/unnecessary inputs and outputs - Fusion support for Whisper models with or without these inputs/outputs (e.g. with these inputs/outputs if exporting with an older official Optimum version, without these inputs/outputs if exporting with Optimum from source) - Attention fusions - For Whisper's encoder and decoder - Modified symbolic shape inference for present output when no past input exists (for decoder) - Multi-head attention fusions - For Whisper's decoder and decoder with past - Packed MatMul for the 3 MatMuls excluded in multi-head attention fusion - Attention kernel changes - CPU: - Different Q and KV sequence lengths - Parallel memset for large sequence lengths - Convert broadcast add after MatMul of Q and K (add_qk) to element-wise add - Separate present key-value output into present key and present value (for multi-head attention spec) - CUDA: - Use memory efficient attention compute kernel with present state (for decoder) - Multi-head attention kernel changes - CPU: - Introduction of multi-head attention CPU kernel (previously did not exist) - Use AddBiasReshape instead of AddBiasTranspose when sequence length = 1 (for decoder with past) - Different Q, K, V input shapes - Pass past key and past value directly as key and value - CUDA: - Use memory efficient attention compute kernel with past and/or present state (for decoder with past) ### Usage To use the optimizations, run the ORT transformer optimizer script as follows: ``` $ cd onnxruntime/onnxruntime/python/tools/transformers/ $ python3 optimizer.py --input <filename>.onnx --output <filename>.onnx --model_type bart --num_heads <number of attention heads, depends on the size of the whisper model used> --hidden_size <attention hidden size, depends on the size of the whisper model used> --use_external_data_format --use_multi_head_attention ``` Once optimized, here's an example of how to run Whisper with [Hugging Face's Optimum](https://github.com/huggingface/optimum): ``` from transformers.onnx.utils import get_preprocessor from optimum.onnxruntime import ORTModelForSpeechSeq2Seq from optimum.pipelines import pipeline as ort_pipeline import whisper # Installed from OpenAI's repo - setup instructions at https://github.com/openai/whisper/ directory = './whisper_opt' # Where the optimized ONNX models are located model_name = 'openai/whisper-tiny' device = 'cpu' # Get pipeline processor = get_preprocessor(model_name) model = ORTModelForSpeechSeq2Seq.from_pretrained( directory, use_io_binding=(device == 'cuda'), provider='CPUExecutionProvider', ).to(device) pipe = ort_pipeline( "automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, device=(-1 if device == 'cpu' else 0), ) # Load audio file and run pipeline audio = whisper.load_audio('tests/jfk.flac') audio = whisper.pad_or_trim(audio) outputs = pipe([audio]) print(outputs) ``` Note: In order to use these changes with Optimum, it is recommended to use Optimum from source to have the following changes: - https://github.com/huggingface/optimum/pull/872 - https://github.com/huggingface/optimum/pull/920 ### Motivation and Context This PR helps the following issues: - https://github.com/microsoft/onnxruntime/issues/15100 - https://github.com/microsoft/onnxruntime/issues/15235 - https://github.com/huggingface/optimum/issues/869 (work in progress) This PR can be used with the other currently merged Whisper PRs: - https://github.com/microsoft/onnxruntime/pull/15247 - https://github.com/microsoft/onnxruntime/pull/15339 - https://github.com/microsoft/onnxruntime/pull/15362 - https://github.com/microsoft/onnxruntime/pull/15365 - https://github.com/microsoft/onnxruntime/pull/15427 This PR uses changes from the following merged PRs: - https://github.com/microsoft/onnxruntime/pull/14198 - https://github.com/microsoft/onnxruntime/pull/14146 - https://github.com/microsoft/onnxruntime/pull/14201 - https://github.com/microsoft/onnxruntime/pull/14928 (this introduced the new multi-head attention spec)		2023-04-18 17:13:54 -07:00
..
external	Improve cache hit rate in windows build (#15538 )	2023-04-18 09:31:35 -07:00
patches	Update protobuf to 3.21.x (#15245 )	2023-03-29 14:08:18 -07:00
tensorboard	Improve dependency management (#13523 )	2022-12-01 09:51:59 -08:00
adjust_global_compile_flags.cmake	Delete eager mode code and increase minimal required python version to 3.8 (#15450 )	2023-04-10 16:00:04 -07:00
CMakeLists.txt	Improve cache hit rate in windows build (#15538 )	2023-04-18 09:31:35 -07:00
CMakeSettings.json	Fork the WinML APIs into the Microsoft namespace (#3503 )	2020-04-17 06:18:54 -07:00
codeconv.runsettings	CMake changes (#2961 )	2020-02-03 19:33:14 -08:00
deps.txt	update with onnx main (#14929 )	2023-04-18 08:42:51 -07:00
EnableVisualStudioCodeAnalysis.props	Fix SDL warnings in CPU EP (#9975 )	2021-12-19 20:54:29 -08:00
gdk_toolchain.cmake	Enable building with a GDK (#11126 )	2022-04-07 15:06:31 -07:00
Info.plist.in	Enable build dynamic framework for macOS/iOS (#7343 )	2021-04-15 16:47:53 -07:00
libonnxruntime.pc.cmake.in	cmake: support install target with generated pkg-config file (#7076 )	2021-03-22 19:36:31 -07:00
nuget_helpers.cmake	Fix nuget build error (#6009 )	2020-12-03 09:28:39 -08:00
onnxruntime.cmake	OnnxRuntime QNN EP (#14791 )	2023-03-01 13:48:20 -08:00
onnxruntime_codegen_tvm.cmake	Use target name for flatbuffers (#13991 )	2022-12-20 11:44:02 -08:00
onnxruntime_common.cmake	Add clog back to onnxruntime_EXTERNAL_LIBRARIES. (#15363 )	2023-04-05 09:11:19 -07:00
onnxruntime_config.h.in	Use safe allocator for JNI code (#13999 )	2023-03-08 11:40:55 -08:00
onnxruntime_csharp.cmake	Refactor training build options (#13964 )	2023-01-03 13:28:16 -08:00
onnxruntime_flatbuffers.cmake	Rework some external targets to ease building with `-DFETCHCONTENT_FULLY_DISCONNECTED=ON` (#15323 )	2023-04-03 17:45:12 -07:00
onnxruntime_framework.cmake	Introduce collective ops to ort inference build (#14399 )	2023-02-07 13:47:48 -08:00
onnxruntime_fuzz_test.cmake	Fix fuzz test (#14385 )	2023-01-22 22:17:43 -08:00
onnxruntime_graph.cmake	Create dedicated build for training api (#14136 )	2023-01-10 20:58:04 -08:00
onnxruntime_ios.toolchain.cmake	Enable build dynamic framework for macOS/iOS (#7343 )	2021-04-15 16:47:53 -07:00
onnxruntime_java.cmake	Update Gradle version (#14862 )	2023-03-08 12:22:06 -08:00
onnxruntime_java_unittests.cmake	[Java] Initial on device training support (#14027 )	2023-03-08 10:01:08 -08:00
onnxruntime_kernel_explorer.cmake	Add TuningContext for TunableOp (#14557 )	2023-02-10 14:27:43 +08:00
onnxruntime_language_interop_ops.cmake	Use target name for flatbuffers (#13991 )	2022-12-20 11:44:02 -08:00
onnxruntime_mlas.cmake	Fix masm flags (#15417 )	2023-04-07 10:20:03 -07:00
onnxruntime_nodejs.cmake	[js] upgrade dependencies and enable strict mode (#14930 )	2023-03-22 15:05:04 -07:00
onnxruntime_objectivec.cmake	Remove SafeInt dependency from Objective-C API. (#13698 )	2022-11-18 17:06:12 -08:00
onnxruntime_opschema_lib.cmake	Use target name for flatbuffers (#13991 )	2022-12-20 11:44:02 -08:00
onnxruntime_optimizer.cmake	Optimize SCE loss compute (#15401 )	2023-04-13 13:02:12 +08:00
onnxruntime_providers.cmake	Improve cache hit rate in windows build (#15538 )	2023-04-18 09:31:35 -07:00
onnxruntime_pyop.cmake	Use target name for flatbuffers (#13991 )	2022-12-20 11:44:02 -08:00
onnxruntime_python.cmake	Whisper Model Optimization (#15473 )	2023-04-18 17:13:54 -07:00
onnxruntime_rocm_hipify.cmake	DecoderMaskedMultiHeadAttention enhancement (#15292 )	2023-04-02 21:53:03 -07:00
onnxruntime_session.cmake	fix headers for training apis (#14350 )	2023-01-19 10:26:53 -08:00
onnxruntime_snpe_provider.cmake	Use target name for flatbuffers (#13991 )	2022-12-20 11:44:02 -08:00
onnxruntime_training.cmake	Create dedicated build for training api (#14136 )	2023-01-10 20:58:04 -08:00
onnxruntime_unittests.cmake	ROCm MHA (#15279 )	2023-04-11 13:20:44 +08:00
onnxruntime_util.cmake	Improve dependency management (#13523 )	2022-12-01 09:51:59 -08:00
onnxruntime_webassembly.cmake	Rework some external targets to ease building with `-DFETCHCONTENT_FULLY_DISCONNECTED=ON` (#15323 )	2023-04-03 17:45:12 -07:00
precompiled_header.cmake	Fix Windows Store build (#8753 )	2021-08-23 11:19:03 -07:00
Sdl.ruleset	Update Sdl.ruleset to remove C26812 from the rules (#12695 )	2022-09-01 20:05:20 -07:00
set_winapi_family_desktop.h	Fix WCOS/Win32 linking bugs (#3126 )	2020-03-19 08:52:40 -07:00
target_delayload.cmake	Remove Windows Store specific code	2022-03-17 23:38:14 -07:00
uwp_stubs.h	Run clang-format in CI (#15524 )	2023-04-18 09:26:58 -07:00
wcos_rules_override.cmake	Use onecore umbrella lib in onecore builds (#5182 )	2020-09-16 10:46:27 -07:00
winml.cmake	Use target name for flatbuffers (#13991 )	2022-12-20 11:44:02 -08:00
winml_cppwinrt.cmake	Fix Windows Store build (#8753 )	2021-08-23 11:19:03 -07:00
winml_sdk_helpers.cmake	Merge windowsai (winml layering) into master (#2956 )	2020-02-04 17:12:19 -08:00
winml_unittests.cmake	Use target name for flatbuffers (#13991 )	2022-12-20 11:44:02 -08:00