onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-06-19 02:03:52 +00:00

Author	SHA1	Message	Date
Tianlei Wu	77b45c6503	Add Stable Diffusion Benchmark on A100-PCIE-80GB (#16702 ) 0(1) Fix a bug in https://github.com/microsoft/onnxruntime/pull/16560 that UNet shall be set fp16 flag. (2) Remove wget in requirements since it is no longer needed. (3) Add benchmark numbers in A100-PCIE-80GB. Note that CUDA EP have issue to run in batch size 4 so the number is not added.	2023-07-14 10:37:00 -07:00
mindest	810512c658	[ROCm] TunableOp: add hipBLASLt tuning logic (#16338 ) ### Description - Add hipBLASLt tuning logic in place of default hipBLASLt implementation; - add kernel explorer for hipBLASLt. related operators: Gemm, StridedBatchedGemm, and GemmFastGelu. Temporarily mark algos that require extra workspace as unsupported. Will add workspace support in later PR, which will change Gemm Params def and affect multiple files.	2023-07-14 08:20:58 +08:00
Vincent Wang	c07a3b869c	Triton Codegen for ORTModule (#15831 ) Fuse connected elementwise and reduce Ops to TritonOp and codegen triton code to run the kernel. This PR is co-edited by @wejoncy and @er3x3	2023-07-13 18:17:58 +08:00
mindest	b7fd5af48b	[ROCm] TunableOp: Update rocBLAS get_solutions API (since ROCm5.6) (#16657 ) ### Description - Update existing rocBLAS get_solutions API using `*_get_solutions_by_type` (supported from ROCm5.6); remove the original nested TunableOp logic. - Update kernel_explorer.	2023-07-13 11:20:26 +08:00
PeixuanZuo	ebc311365b	[ROCm] Optimize ROCm CI to reduce time (#16620 ) This PR mainly optimize ROCm CI test to reduce time and CPU utilization. - use smaller batch size on strided_batched_gemm/batched_gemm test - disable cpu training test - fix test_e2e_padding_elimination Occasional failures on ROCm.	2023-07-13 10:58:03 +08:00
Ye Wang	dd7d721f3c	support rotary embeddings in decoder masked self-attention (#16556 ) ### Description <!-- Describe your changes. --> This PR adds support for rotary embeddings in decoder masked self-attention ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2023-07-12 13:48:48 -07:00
Tianlei Wu	2de5807703	Attention fusion for UNet onnx model export from PyTorch 2.* (#16629 ) ### Description Tested with stable diffusion unet models exported by pytorch nightly. Example to run: ``` cd onnxruntime/python/tools/transformers/ python optimizer.py --input unet.onnx --output unet_fp16.onnx --model_type unet --float16 --opt_level 0 ```	2023-07-11 14:35:48 -07:00
Justin Chu	ad994565ae	Fix type annotation for InferenceSession (#16632 ) The Sequence should have been annotated to take a Union type; otherwise the annotation would be invalid.	2023-07-11 06:34:22 -07:00
mindest	347c963d5c	[ROCm] Add ROCm Triton TunableOp for GroupNorm (#16196 ) ### Description - Refactor existing Triton TunableOp-related code (based on work in #15862) - Add GroupNorm Triton implementation	2023-07-11 13:55:30 +08:00
Tianlei Wu	b8f6235f11	Update stable diffusion benchmark for TensorRT EP (#16560 ) ### Description Add Stable Diffusion Text2Image pipelines of TensorRT EP and CUDA EP. They can automatically export and optimize ONNX model, and create ONNXRuntime session to use TensorRT EP or CUDA execution provider. Add support for benchmarking TensorRT. Add support of cuda graph. The feature is only supported in nightly package right now. Engine/Provider to test \| command line ---- \| --- CUDA EP \| `python benchmark.py -v 1.5` CUDA EP with cuda graph \| `python benchmark.py -v 1.5 --enable_cuda_graph` TensorRT EP \| `python benchmark.py -v 1.5 -r tensorrt` TensorRT EP with cuda graph \| `python benchmark.py -v 1.5 -r tensorrt --enable_cuda_graph` TensorRT \| `python benchmark.py -v 1.5 -e tensorrt` Add benchmark numbers of T4 GPU using CUDA 11.7, cuDNN 8.5, PyTorch 1.13.1+cu11.7, TensorRT 8.6.1, onnxruntime-gpu 1.15.1 (or ort-nightly-gpu 1.16 for cuda graph). TODO: add benchmark numbers of A100-80GB ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-07-10 09:51:03 -07:00
PeixuanZuo	3b729e5d2f	[ROCm] use cupy for GPU-accelerated computing (#16611 ) kernel explorer has lots of tests and need numpy to verify the results of GPU kernels, it will make CPU utilization very high. This PR use `cupy ` to replace `numpy` to do compute on GPU to reduce CPU utilization. set `KERNEL_EXPLORER_TEST_USE_CUPY=1` to enable cupy.	2023-07-10 17:17:39 +08:00
PeixuanZuo	2f56815344	[Whisper] Fix provider in export procress (#16545 ) type of parameter `provider` is string.	2023-07-10 11:55:36 +08:00
cloudhan	01c5d05712	Avoid repeated GemmSoftmaxGemmPermuteTunableOp<HipT> ctor invocation (#16518 ) The `GemmSoftmaxGemmPermuteTunableOp<HipT>` is expensive to construct, avoid the ctor invocation will substantially improve the launch time and get better performance during the decoding. This get <7% e2e time reduction of whisper large.	2023-07-08 00:25:07 +08:00
Edward Chen	6be7b03e53	Enable `-Wshorten-64-to-32` warning if available. (#16524 ) - Fix some warnings from Xcode build (`-Wshorten-64-to-32`). - Enable `-Wshorten-64-to-32` warning if available. Currently it's not fully enabled for `onnxruntime_test_all` and `onnxruntime_providers_xnnpack` yet. - Some clean up in build.py including setting CMake generator more consistently.	2023-07-07 08:11:44 -07:00
petermcaughan	47f136e2d3	Speed Up Whisper Export (#16504 ) ### Description Add a greedy option to the initializer deduplication process in the Whisper export. Currently to detect shared initializers, ORT compares every initializer against every other initializer (n^2). In the comparison operator, if the two initializers have different data types (e.g. raw_data and int_64), both initializers are converted to a numpy array and the cast result is compared. This cast happens in every comparison, and exponentially affects the runtime of finding shared initializers. This cast operation is the bottleneck for the current Whisper export script. The conversion to the numpy array is useful for detecting equal initializer values across nodes of different data types (e.g. recognizing a bias value of 0.0 is the same as a slice index of 0) but isn't triggered when comparing initializers of the same data type (e.g. weight value of 0.6 == weight value of 0.6). The latter case is where the majority of utility is for Whisper, and so by eliminating our path for comparing numpy arrays for initializers we save a lot of time for minimal cost. In other words, this PR adds an option to remove the ability to detect shared initializers of different types (e.g. Slice Index and MatMul Constant) while retaining the ability to deduplicate weights. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - Current time to export Whisper-large is prohibitive. --------- Co-authored-by: Peter McAughan <petermca@microsoft.com>	2023-06-30 12:22:30 -07:00
Mike Guo	aeaa1d650f	make optimized_model_path be in temp folder instead of source model folder for transformer optimization (#16531 ) ### The optimize_model will generate a temporary model in current model folder. Most of time, it is fine. However, the scenario will break when the function run against input model mount from AzureML. In that case, the mounted folder is read-only. We have to copy the model to another temp folder to call optimize_model to workaround this issue. Otherwise, the optimize_model will fail when creating the optimized model in the read-only folder. However, the model copy is painful, especially when model is huge. This PR just expose the optimized_model_path at optimize_model level so that the caller could decide where to save the temp model.	2023-06-30 20:19:51 +08:00
cao lei	0c5f492493	remove AllocatorMgr class (#16509 ) ### Description Remove AllocatorManager class ### Motivation and Context After the refactor PR #15833 is in, AllocatorManager class is not referenced anymore.	2023-06-28 15:43:19 -07:00
Tianlei Wu	9407c3270c	GPT-2 attention fusion for transformers >= 4.27 (#16461 ) ### Description Before transformers 4.27, the causal mask uses uint8 data type, so there is extra Cast node to convert it to bool. This adds a pattern that without Cast node to support attention fusion for GPT-2 models exported with transformers >= 4.27. ### Motivation and Context https://github.com/microsoft/onnxruntime/issues/16453	2023-06-23 15:38:35 -07:00
Hector Li	a8c313dec4	[QNN EP] Python script to modify Onnx model to make it aligned with converted QNN model (#16423 ) Python script to modify Onnx model to make it aligned with converted QNN model ### Description Onnxruntime QNN EP can support context binary file generated by QNN tool chain. However QNN generated context binary file uses channel last and 8 bits or 16 bits for input and output. This script get the QNN model input & output information from QNN converted model_net.json file, and insert Cast, Transpose nodes to Onnx model if required.	2023-06-23 11:00:51 -07:00
Yufeng Li	89f8f20a61	fix protobuf copyfrom 2G limit (#16422 ) ### Description <!-- Describe your changes. --> protobuf CopyFrom doesn't work for model > 2GB for version 4.23. This PR removes the copy for Calibrator. Currently Calibrator copies the ModelProto to avoid changing it. The reason is that: quantize_static passes a ModelProto to Calibrator to calibrate quantitation parameters, and then use it for quantization. If calibrator changes the ModelProto, quantizaiton won't work. This PR changes quantize_static to pass in a model path to Calibrator instead of a ModelProto, and make Calibrator only take in model path as input, which is how it is used in most cases. This PR also remove the optimization from quantization. User needs to call pre-process to optimize the model	2023-06-21 20:45:11 -07:00
kunal-vaishnavi	4b69226fca	Fix input typo in Whisper export with beam search (#16439 ) ### Description This PR fixes a typo with assigning the `repetition_penalty` input in the Whisper export with beam search model. It is a follow-up to the [export stabilization PR](https://github.com/microsoft/onnxruntime/pull/16297). ### Motivation and Context The `repetition_penalty` input should be set to `repetition_penalty` instead of `input_features`.	2023-06-21 18:59:11 -07:00
Chi Lo	4e3cff60fd	CUDA graph support for TRT EP (#16081 ) CUDA EP already supports [CUDA graph](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs), also we observed some models can benefit from using CUDA graph with `trtexec`. Therefore, this PR enables the CUDA graph support for TRT EP. The implementation is based on https://github.com/microsoft/onnxruntime/pull/9978 with the same [constraints](https://github.com/microsoft/onnxruntime/pull/9978) as below: - Models with control-flow ops (i.e. If, Loop and Scan ops) are not supported. - Usage of CUDA Graphs is limited to models where-in all the model ops (graph nodes) can be partitioned to the TRT EP. - The input/output types of models need to be tensors. - Shapes of inputs/outputs cannot change across inference calls. - IObinding is required.	2023-06-21 09:36:45 -07:00
Yufeng Li	d190db7fcd	Update default external_data_location for pre-process of quantization (#16399 ) external_data_location should be a string/file_name to indicate the file name of external data instead of a directory	2023-06-20 09:37:17 -07:00
cao lei	dd72192cf4	ExecutionProvider API refactor - move allocator from EP level to SessionState level and indexed by OrtDevice (#15833 ) ### Description This PR is to refactor ExecutionProvider API for memory management, which is to move allocators from EP level to SessionState level and indexed by OrtDevice ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> This PR is to refactor ExecutionProvider API for memory management, which is to move allocators from EP level to SessionState level and indexed by OrtDevice. By this change, EP level will shift the burden of maintaining allocators, which will be user friendly for EP developers --------- Co-authored-by: Lei Cao <leca@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>	2023-06-19 17:44:45 -07:00
kunal-vaishnavi	3f7f90aed0	Stabilize Whisper export with beam search (#16297 ) ### Description This PR stabilizes the Whisper export with beam search by adding the following: - Remove unused ONNX models and extra folders generated during the export process - Specify the Whisper with beam search model's IR version for E2E integration - Parity check for Whisper with beam search model between PyTorch and ORT - Remove previously exported Whisper with beam search model before saving newly exported model ### Motivation and Context - Removing the unused ONNX models and extra folders frees up disk space after exporting and makes it easier to copy and move the output folder to other environments. - Specifying the IR version fixes an issue with generating the ONNX E2E model - Adding a parity check helps detect runtime issues during the export process - Removing the previously exported Whisper with beam search model prevents the data file size from doubling when the newly exported model is saved with the same filename	2023-06-16 18:56:52 -07:00
PeixuanZuo	bcdb81c563	[Whisper] add a fusion option to split input bias from MHA/DMHA (#16049 ) Whsiper model contains five different types of attention, q, k, v bias was fused into Attention/MHA/DMHA op, encoderdecoderinit subgraph - Attention: encoder attention - Attention: decoder self attention + present k, v - MultiHeadAttention: decoder cross attention + present k and v. q and v have bias. decoder subgraph - DecoderMultiHeadAttention: decoder cross attention + past k, v. q has bias - DecoderMultiHeadAttention: decoder self attention + past/present k, v. q, k, v have bias. For ROCm EP, MHA/DMHA doesn't support additional bias. This PR add a fusion option `disable_multi_head_attention_bias` to split q.k,v bias from MHA/DMHA.	2023-06-16 10:29:48 +08:00
kunal-vaishnavi	79e0230002	Add vocab masks to Whisper export with beam search (#16180 ) ### Description This PR adds flags for exporting Whisper with vocab masks for logits processing. This PR also sets `input_features` back to FP32 precision for the user and casts `input_features` to FP16 precision when needed. ### Motivation and Context This helps enable specific logits processing for the exported Whisper model.	2023-06-08 12:36:35 -07:00
FFFrog	d185bf444d	[CANN] Add IOBinding Support For CANN EP (#15802 ) ### Description Add IOBinding Support For CANN EP ### Motivation and Context Now, Users can use IOBinding feature to speed up the inference on CANN.	2023-06-01 03:13:38 -07:00
Xavier Dupré	e726151b5c	Introduce float 8 types (#14731 ) ### Description The PR implements FloatE4M3FN, FloatE5M2, FloatE4MEFNUZ, FloatE5M2FNUZ as described in PR https://github.com/onnx/onnx/pull/4805. It uses CUDA API to cast float/half to float8 if CUDA>=11.8, a custom implementation if CUDA<11.8. * It implements, Cast, QuantizeLinear, DequantizeLinear for all types on CPU, only for types FloatE4M3FN, FloatE5M2 on CUDA. * It extends the supported types for control flow operator, Shape, Reshape, Identity, If, Loop, Scan, Reshape * It implements Equal(19). * Cast, QuantizeLinear, DequantizeLinear operators now support a parameter `saturate` only valid for float 8 types. It is true by default. In that case, any value out of range is converted into the maximum float 8 value. If false, it is infinite. * QuantizeLinear, DequantizeLinear now supports multiple scales on CUDA (and ROCm by extension), scale = 1D tensor with one scale per channel ### Motivation and Context Supports latest onnx version. Fixes [AB#15395](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/15395) --------- Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net> Co-authored-by: Randy Shuai <rashuai@microsoft.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>	2023-05-30 13:25:58 -07:00
mindest	90e8c8daaf	profile_explorer: add op-kernel correlation info (#15946 ) ### Description <!-- Describe your changes. --> * Add aggregated op-kernel correlation information in profiler explorer when running inference session. * Add filtering feature so that we can focus on model runs of interest (excluding warmup steps, etc.)	2023-05-30 23:25:43 +08:00
cloudhan	2cf0ae7d01	[ROCm] Add AttentionMode to make attention logic streamline (#15978 ) Refactor for future kv cache change.	2023-05-26 12:06:36 +08:00
Yuhong Guo	04a8f50674	New configuration to limit the arena extension (#15983 ) Add a configuration `max_power_of_two_extend_bytes ` to limit the arena extension size. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> In our real scenario, we observe that if the model is big enough the BfcArena will extend uncontrollable. As showed by the following figures, if a model uses more than 16GB memory, the BfcArena will totally apply for 32GB memory according to the `kNextPowerOfTwo` strategy. With the new strategy, the extension is limited. The default maximum extension size is 1GB. #### Without the new configuration After loading the model, ORT uses 32G GPU memory. ![image](https://github.com/microsoft/onnxruntime/assets/19584326/42b93c66-b957-4f20-a13b-d34cb390afff) #### With the new configuration After loading the model, ORT uses 23G GPU memory. ![image](https://github.com/microsoft/onnxruntime/assets/19584326/5abffeff-9ca3-4187-a262-37fd2764fe1b) Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com>	2023-05-25 02:19:07 -07:00
Adrian Lizarraga	efc84a43e8	[QNN EP] Add session option to disable fallback to default CPU EP (#16016 ) ### Description Adds the session config option `disable_cpu_ep_fallback` to allow the user to prevent the CPU EP from handling nodes not supported by other execution providers. ```C++ // Graph nodes that are not supported by the execution providers (EPs) explicitly added to the session are // assigned (i.e., "fallback") to the CPU EP by default. // // This option allows the user to disable the fallback of unsupported graph nodes to the CPU EP. // If this option is set to "1", session creation will fail if the execution providers other than the CPU EP cannot // fully support all of the nodes in the graph. // // It is invalid to set this option and explicitly add the CPU EP to the session. In this case, session creation // will also fail with an error. // // Option values: // - "0": CPU EP fallback is not disabled. [DEFAULT] // - "1": CPU EP fallback is disabled. static const char* const kOrtSessionOptionsDisableCPUEPFallback = "session.disable_cpu_ep_fallback"; ``` #### Example use ```C++ #include "core/session/onnxruntime_cxx_api.h" #include "core/session/onnxruntime_session_options_config_keys.h" int main(int argc, char** argv) { Ort::SessionOptions so; so.AddConfigEntry(kOrtSessionOptionsDisableCPUEPFallback, "1"); // Disable fallback to the CPU EP. onnxruntime::ProviderOptions options; #if defined(_WIN32) options["backend_path"] = "QnnCpu.dll"; #else options["backend_path"] = "libQnnCpu.so"; #endif so.AppendExecutionProvider("QNN", options); const ORTCHAR_T* ort_model_path = ORT_MODEL_FOLDER "qnn_ep_partial_support.onnx"; Ort::Session session(*ort_env, ort_model_path, so); // Throws exception if nodes fallback to CPU // ... ``` ### Motivation and Context Makes it easier for application developers to ensure that the entire model runs on specific EPs. This is critical for Qualcomm/scenarios. If the compute cannot be offloaded to the NPU, running on CPU is not acceptable. (could be the difference between 90 second inference and 6 seconds inference) --------- Co-authored-by: Pranav Sharma <prs@microsoft.com>	2023-05-23 17:56:32 -07:00
PeixuanZuo	2fddc65c8c	[ROCm] add hipblaslt into GemmFastGelu TunableOp (#15945 ) add hipblaslt into GemmFastGelu TunableOp.	2023-05-23 11:07:09 +08:00
Zhang Lei	0f8e66d905	optimization for whisper model with decoder masked multihead attention (#15827 ) * graph tools update * cuda kernel update * operator spec update and implementation update * greed search bug fix on wrong assumption for cross/self attention input length * avoid use of "" name in value info when loading graph which historically in many model	2023-05-18 15:38:31 -07:00
Yufeng Li	0fed00c04d	fix topo sort in quantization tool (#16003 ) ### Description <!-- Describe your changes. --> Should not set up dependent node list for empty('') input ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-05-18 13:43:52 -07:00
stevenlix	270c09a37f	Add timestamp logits processor for whisper (#15853 ) Enable timestamp estimation and logits processing for Whisper model.	2023-05-16 21:40:00 -07:00
kailums	f62f722c70	integrate triton into ort (#15862 ) ### Description In some scenarios, the triton written kernels are more performant than CK or other handwritten kernels, so we implement a framework that onnxruntime can use these triton written kernels. This PR is to integrate triton into ort, so that ort can use kernels that written and compiled by triton. The main change focus on two part: 1. a build part to compile triton written kernel and combine these kernels into libonnxruntime_providers_rocm.so 2. a loader and launcher in c++, for loading and launch triton written kernels. #### Build To compile triton written kernel, add a script `tools/ci_build/compile_triton.py`. This script will dynamic load all kernel files, compile them, and generate `triton_kernel_infos.a` and `triton_kernel_infos.h`. `triton_kernel_infos.a` contains all compiled kernel instructions, this file will be combined into libonnxruntime_providers_rocm.so, using --whole-archive flag. `triton_kernel_infos.h` defines a const array that contains all the metadata for each compiled kernel. These metadata will be used for load and launch. So this header file is included by 'triton_kernel.cu' which defines load and launch functions. Add a build flag in build.py and CMakeList.txt, when building rocm provider, it will call triton_kernel build command, and generate all necessary files. #### C++ Load and Launch On c++ part, we implement load and launch functions in triton_kernel.cu and triton_kernel.h. These two files located in `providers/cuda`, and when compiling rocm, they will be hipified. so this part supports both cuda and rocm. But currently we only call triton kernel in rocm. We also implement a softmax triton op for example. Because there will generate many kernels for different input shape of softmax, we use TunableOp to select the best one. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-05-17 09:35:28 +08:00
Akash	1079df6aaa	Update StableDiffusion path after cloning repo (#15948 ) ### Description Correct path to SD files in README ### Motivation and Context Small typo in path	2023-05-16 08:39:27 -07:00
Prathik Rao	a0ccb95f3c	add option to load pretrained weights for T5 model (#15951 ) ### Description <!-- Describe your changes. --> Adds option to pass in pretrained weights file during T5 inference onnx export. Mimics the changes made to whisper: https://github.com/microsoft/onnxruntime/pull/15759 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Required for ONNX Runtime demo being presented at BUILD.	2023-05-15 22:52:35 -07:00
kunal-vaishnavi	5b663d6797	Whisper Multitask and Multilingual (#15936 ) ### Description This PR enables Whisper's multitask format and allows a user to use Whisper for multiple tasks (e.g. transcription, translation) and for multilingual purposes (e.g. English, Spanish). This PR also removes `attention_mask` as a required input for Whisper with beam search. ### Usage Here is an example of how you can use Whisper for English transcription. ``` import numpy as np import onnxruntime as ort from datasets import load_dataset from transformers import AutoConfig, AutoProcessor model = "openai/whisper-tiny" config = AutoConfig.from_pretrained(model) processor = AutoProcessor.from_pretrained(model) forced_decoder_ids = processor.get_decoder_prompt_ids(language="english", task="transcribe") # forced_decoder_ids is of the format [(1, 50259), (2, 50359), (3, 50363)] and needs to be # of the format [50258, 50259, 50359, 50363] where 50258 is the start token id forced_decoder_ids = [config.decoder_start_token_id] + list(map(lambda token: token[1], forced_decoder_ids)) ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") input_features = processor(ds[0]["audio"]["array"], return_tensors="np").input_features inputs = { "input_features": np.float32(input_features), "max_length": np.array([26], dtype=np.int32), "min_length": np.array([1], dtype=np.int32), "num_beams": np.array([2], dtype=np.int32), "num_return_sequences": np.array([1], dtype=np.int32), "length_penalty": np.array([1.0], dtype=np.float32), "repetition_penalty": np.array([1.0], dtype=np.float32), "decoder_input_ids": np.array([forced_decoder_ids], dtype=np.int32), } sess = ort.InferenceSession("whisper-tiny_beamsearch.onnx", providers=["CPUExecutionProvider"]) outputs = sess.run(None, inputs) # Print tokens and decoded output print(outputs[0][0][0]) print(processor.decode(outputs[0][0][0])) ``` If you don't want to provide specific decoder input ids or you want Whisper to predict the output language and task, you can set `forced_decoder_ids = [config.decoder_start_token_id]` instead. ### Motivation and Context As seen in the figure below from the [OpenAI Whisper paper](https://cdn.openai.com/papers/whisper.pdf), Whisper can be used for multiple tasks and languages. ![Screenshot 2023-05-12 165215](https://github.com/microsoft/onnxruntime/assets/115581922/49335e39-a79c-4f78-92e9-89b034405f65)	2023-05-15 14:36:33 -07:00
Ye Wang	3418ca28a8	pack qkv in t5 decoder (#15801 ) ### Description <!-- Describe your changes. --> V100, b_4_s_128, max_output_len=64, beam=4 before: t5_small: 101.28ms t5_base: 200.07ms after: t5_small: 87.65ms t5_base: 174.44ms ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2023-05-15 13:45:39 -07:00
Chester Liu	984dd02df3	Update optimize_pipeline.py to use __name__ detection (#15866 ) ### Description <!-- Describe your changes. --> Use `__name__` detection in `optimize_pipeline.py`. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> It prevents unwanted execution of `main` when importing the file.	2023-05-12 20:43:29 -07:00
petermcaughan	e5189330d5	Address OOM Issue when exporting Whisper (#15880 ) ### Description Remove attention_mask from unnecessary code paths in the whisper export process. ### Motivation and Context Current export script frequently hits OOM error when export whisper-large. Memory profiling shows that this is a result of generating dummy inputs for the `encoder_attention_mask` input for a model pass during exporting - in whisper-large, this dummy tensor can be around 20GB in size. `encoder_attention_mask` is ultimately a dummy input - it's just there to satisfy certain BeamSearch requirements. Thus, we're currently creating a 20GB tensor and passing it to the model, which then discards the input anyways. By removing the code path to generate a dummy encoder_mask tensor, we can reduce the memory requirements to export whisper substantially, while keeping the BeamSearch checks satisfied. --------- Co-authored-by: Peter McAughan <petermca@microsoft.com>	2023-05-12 11:23:07 -07:00
Tianlei Wu	e0c1fa35a8	update stable diffusion script and doc (#15846 ) ### Description Update script: (1) change some float16 verbose logging to debug level. (2) Let requirements-cuda.txt includes requirements.txt (3) Use an environment variable ORT_DISABLE_TRT_FLASH_ATTENTION=1 to avoid black image in 2.1 model. Update benchmark and doc. (4) Update document to include command lines to build ORT rocm from source. (5) Update optimize_pipeline.py so that user can disable packed qkv/kv from command line options. (6) Update document to use torch < 2.0 for onnx export. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-05-09 15:29:13 -07:00
Tianlei Wu	191ee1d3c0	Fix symbolic shape infer empty value_info (#15842 ) ### Description When node output is optional, symbolic shape infer might add an empty value_info item. Add some checking to avoid this. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> - Stable diffusion optimized model reported invalid data type 0 during inference.	2023-05-08 16:18:35 -07:00
Ted Themistokleous	42d62b8f2b	Fixes to get stable diffusion benchmark running (#15755 ) ### Description Added changes to MIGraphX EP to suppoert stable diffusion 1. Added parameterized input dimensions to not trigger a precompile to set input parameters in the EP 2. Removed input checking for Resize operator in EP as MIGraphX already performs these checks 3. Add support to benchmark script to use the MIGraphX execution provider 4. Add support for an odd valued batch size (3) that was seen on other benchmarks we were performing comparison on. ### Motivation and Context These changes are required to get stable diffusion mdoels to run on MIGraphX through the EP. Without these changes we see the following incorrect behavior. 1. Resize operators are pushed onto the CPU EP instead of MIGraphX, causing a significant slowdown during runs 2. Precompile operations incorrectly parse input_ids parameter for our text model, with a 1, which breaks during MIGraphX Compile of onnx. This in turn throws an error and stops any setup before inference. 3. Selecting the correct EP in the benchmark script which was previously missing the MIGraphX option 5. Suppressed an error we keep seeing with pthread_set_affinity - this is a quality of life change when using the MIGraphX EP This was testing with the benchmark.py script using stable diffusion v2 located in onnxruntime/onnxruntime/python/tools/transformers/models/stable_diffusion/ --------- Co-authored-by: Ted Themistokleous <tthemist@amd.com>	2023-05-06 17:35:21 +08:00
BoarQing	272aab4afa	Fix issues on Windows for Vitis AI (#15810 ) ### Description Fix two errors that is only encountered on windows ### Motivation and Context For onnxruntime::VitisAIProviderFactoryCreator::Create, it would cause the compile error. For if (it == provider_options_map.end()), it would cause an error but execute as normal Co-authored-by: Zhang <yueqingz@amd.com>	2023-05-04 14:42:19 -07:00
Baiju Meswani	ba7b83ff3c	Remove onnxruntime_PYBIND_EXPORT_OPSCHEMA definition from onnxruntime (#15776 )	2023-05-03 13:08:35 -07:00
Prathik Rao	090312af71	add local state dict option (#15759 ) ### Description Adds an option to load local state dictionary for whisper model export. ### Motivation and Context This is useful to demonstrate workflow of using ORT Training to get model weights, downloading said weights onto a local gpu-enabled device, exporting the custom model using `convert_to_onnx.py`, and then nicely feeding the .onnx file into ORT InferenceSession.	2023-05-01 22:08:11 -07:00

1 2 3 4 5 ...

1062 commits