Commit graph

1062 commits

Author SHA1 Message Date
Tianlei Wu
77b45c6503
Add Stable Diffusion Benchmark on A100-PCIE-80GB (#16702)
0(1) Fix a bug in https://github.com/microsoft/onnxruntime/pull/16560
that UNet shall be set fp16 flag.
(2) Remove wget in requirements since it is no longer needed.
(3) Add benchmark numbers in A100-PCIE-80GB. Note that CUDA EP have
issue to run in batch size 4 so the number is not added.
2023-07-14 10:37:00 -07:00
mindest
810512c658
[ROCm] TunableOp: add hipBLASLt tuning logic (#16338)
### Description
- Add hipBLASLt tuning logic in place of default hipBLASLt
implementation;
- add kernel explorer for hipBLASLt.

related operators: Gemm, StridedBatchedGemm, and GemmFastGelu.

Temporarily mark algos that require extra workspace as unsupported.
Will add workspace support in later PR, which will change Gemm Params
def and affect multiple files.
2023-07-14 08:20:58 +08:00
Vincent Wang
c07a3b869c
Triton Codegen for ORTModule (#15831)
Fuse connected elementwise and reduce Ops to TritonOp and codegen triton
code to run the kernel.

This PR is co-edited by @wejoncy and @er3x3
2023-07-13 18:17:58 +08:00
mindest
b7fd5af48b
[ROCm] TunableOp: Update rocBLAS get_solutions API (since ROCm5.6) (#16657)
### Description
- Update existing rocBLAS get_solutions API using
`*_get_solutions_by_type` (supported from ROCm5.6); remove the original
nested TunableOp logic.
- Update kernel_explorer.
2023-07-13 11:20:26 +08:00
PeixuanZuo
ebc311365b
[ROCm] Optimize ROCm CI to reduce time (#16620)
This PR mainly optimize ROCm CI test to reduce time and CPU utilization.

- use smaller batch size on strided_batched_gemm/batched_gemm test
- disable cpu training test
- fix test_e2e_padding_elimination Occasional failures on ROCm.
2023-07-13 10:58:03 +08:00
Ye Wang
dd7d721f3c
support rotary embeddings in decoder masked self-attention (#16556)
### Description
<!-- Describe your changes. -->

This PR adds support for rotary embeddings in decoder masked
self-attention

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-07-12 13:48:48 -07:00
Tianlei Wu
2de5807703
Attention fusion for UNet onnx model export from PyTorch 2.* (#16629)
### Description
Tested with stable diffusion unet models exported by pytorch nightly.

Example to run:
```
cd onnxruntime/python/tools/transformers/
python optimizer.py --input unet.onnx  --output unet_fp16.onnx --model_type unet --float16 --opt_level 0
```
2023-07-11 14:35:48 -07:00
Justin Chu
ad994565ae
Fix type annotation for InferenceSession (#16632)
The Sequence should have been annotated to take a Union type; otherwise
the annotation would be invalid.
2023-07-11 06:34:22 -07:00
mindest
347c963d5c
[ROCm] Add ROCm Triton TunableOp for GroupNorm (#16196)
### Description
- Refactor existing Triton TunableOp-related code (based on work in
#15862)
- Add GroupNorm Triton implementation
2023-07-11 13:55:30 +08:00
Tianlei Wu
b8f6235f11
Update stable diffusion benchmark for TensorRT EP (#16560)
### Description

Add Stable Diffusion Text2Image pipelines of TensorRT EP and CUDA EP.
They can automatically export and optimize ONNX model, and create
ONNXRuntime session to use TensorRT EP or CUDA execution provider.

Add support for benchmarking TensorRT.

Add support of cuda graph. The feature is only supported in nightly
package right now.

Engine/Provider to test | command line
---- | ---
CUDA EP | `python benchmark.py -v 1.5`
CUDA EP with cuda graph | `python benchmark.py -v 1.5
--enable_cuda_graph`
TensorRT EP | `python benchmark.py -v 1.5 -r tensorrt`
TensorRT EP with cuda graph | `python benchmark.py -v 1.5 -r tensorrt
--enable_cuda_graph`
TensorRT | `python benchmark.py -v 1.5 -e tensorrt`

Add benchmark numbers of T4 GPU using CUDA 11.7, cuDNN 8.5, PyTorch
1.13.1+cu11.7, TensorRT 8.6.1, onnxruntime-gpu 1.15.1 (or
ort-nightly-gpu 1.16 for cuda graph).

TODO: add benchmark numbers of A100-80GB

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-07-10 09:51:03 -07:00
PeixuanZuo
3b729e5d2f
[ROCm] use cupy for GPU-accelerated computing (#16611)
kernel explorer has lots of tests and need numpy to verify the results
of GPU kernels, it will make CPU utilization very high. This PR use
`cupy ` to replace `numpy` to do compute on GPU to reduce CPU
utilization.

set `KERNEL_EXPLORER_TEST_USE_CUPY=1` to enable cupy.
2023-07-10 17:17:39 +08:00
PeixuanZuo
2f56815344
[Whisper] Fix provider in export procress (#16545)
type of parameter `provider` is string.
2023-07-10 11:55:36 +08:00
cloudhan
01c5d05712
Avoid repeated GemmSoftmaxGemmPermuteTunableOp<HipT> ctor invocation (#16518)
The `GemmSoftmaxGemmPermuteTunableOp<HipT>` is expensive to construct,
avoid the ctor invocation will substantially improve the launch time and
get better performance during the decoding. This get <7% e2e time
reduction of whisper large.
2023-07-08 00:25:07 +08:00
Edward Chen
6be7b03e53
Enable -Wshorten-64-to-32 warning if available. (#16524)
- Fix some warnings from Xcode build (`-Wshorten-64-to-32`).
- Enable `-Wshorten-64-to-32` warning if available. Currently it's not fully enabled for `onnxruntime_test_all` and `onnxruntime_providers_xnnpack` yet.
- Some clean up in build.py including setting CMake generator more consistently.
2023-07-07 08:11:44 -07:00
petermcaughan
47f136e2d3
Speed Up Whisper Export (#16504)
### Description
Add a greedy option to the initializer deduplication process in the
Whisper export.

Currently to detect shared initializers, ORT compares every initializer
against every other initializer (n^2). In the comparison operator, if
the two initializers have different data types (e.g. raw_data and
int_64), both initializers are converted to a numpy array and the cast
result is compared. This cast happens in every comparison, and
exponentially affects the runtime of finding shared initializers. This
cast operation is the bottleneck for the current Whisper export script.

The conversion to the numpy array is useful for detecting equal
initializer values across nodes of different data types (e.g.
recognizing a bias value of 0.0 is the same as a slice index of 0) but
isn't triggered when comparing initializers of the same data type (e.g.
weight value of 0.6 == weight value of 0.6). The latter case is where
the majority of utility is for Whisper, and so by eliminating our path
for comparing numpy arrays for initializers we save a lot of time for
minimal cost.

In other words, this PR adds an option to remove the ability to detect
shared initializers of different types (e.g. Slice Index and MatMul
Constant) while retaining the ability to deduplicate weights.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- Current time to export Whisper-large is prohibitive.

---------

Co-authored-by: Peter McAughan <petermca@microsoft.com>
2023-06-30 12:22:30 -07:00
Mike Guo
aeaa1d650f
make optimized_model_path be in temp folder instead of source model folder for transformer optimization (#16531)
### The optimize_model will generate a temporary model in current model
folder. Most of time, it is fine.

However, the scenario will break when the function run against input
model mount from AzureML. In that case, the mounted folder is read-only.
We have to copy the model to another temp folder to call optimize_model
to workaround this issue. Otherwise, the optimize_model will fail when
creating the optimized model in the read-only folder. However, the model
copy is painful, especially when model is huge.

This PR just expose the optimized_model_path at optimize_model level so
that the caller could decide where to save the temp model.
2023-06-30 20:19:51 +08:00
cao lei
0c5f492493
remove AllocatorMgr class (#16509)
### Description
Remove AllocatorManager class


### Motivation and Context
After the refactor PR #15833 is in, AllocatorManager class is not
referenced anymore.
2023-06-28 15:43:19 -07:00
Tianlei Wu
9407c3270c
GPT-2 attention fusion for transformers >= 4.27 (#16461)
### Description
Before transformers 4.27, the causal mask uses uint8 data type, so there
is extra Cast node to convert it to bool. This adds a pattern that
without Cast node to support attention fusion for GPT-2 models exported
with transformers >= 4.27.

### Motivation and Context

https://github.com/microsoft/onnxruntime/issues/16453
2023-06-23 15:38:35 -07:00
Hector Li
a8c313dec4
[QNN EP] Python script to modify Onnx model to make it aligned with converted QNN model (#16423)
Python script to modify Onnx model to make it aligned with converted QNN
model

### Description
Onnxruntime QNN EP can support context binary file generated by QNN tool chain. However QNN generated context binary file uses channel last and 8 bits or 16 bits for input and output. This script get the QNN model input & output information from QNN converted model_net.json file, and insert Cast, Transpose nodes to Onnx model if required.
2023-06-23 11:00:51 -07:00
Yufeng Li
89f8f20a61
fix protobuf copyfrom 2G limit (#16422)
### Description
<!-- Describe your changes. -->
protobuf CopyFrom doesn't work for model > 2GB for version 4.23. This PR
removes the copy for Calibrator.

Currently Calibrator copies the ModelProto to avoid changing it. The
reason is that: quantize_static passes a ModelProto to Calibrator to
calibrate quantitation parameters, and then use it for quantization. If
calibrator changes the ModelProto, quantizaiton won't work.

This PR changes quantize_static to pass in a model path to Calibrator
instead of a ModelProto, and make Calibrator only take in model path as
input, which is how it is used in most cases.

This PR also remove the optimization from quantization. User needs to
call pre-process to optimize the model
2023-06-21 20:45:11 -07:00
kunal-vaishnavi
4b69226fca
Fix input typo in Whisper export with beam search (#16439)
### Description
This PR fixes a typo with assigning the `repetition_penalty` input in
the Whisper export with beam search model. It is a follow-up to the
[export stabilization
PR](https://github.com/microsoft/onnxruntime/pull/16297).


### Motivation and Context
The `repetition_penalty` input should be set to `repetition_penalty`
instead of `input_features`.
2023-06-21 18:59:11 -07:00
Chi Lo
4e3cff60fd
CUDA graph support for TRT EP (#16081)
CUDA EP already supports [CUDA
graph](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs),
also we observed some models can benefit from using CUDA graph with
`trtexec`. Therefore, this PR enables the CUDA graph support for TRT EP.

The implementation is based on
https://github.com/microsoft/onnxruntime/pull/9978 with the same
[constraints](https://github.com/microsoft/onnxruntime/pull/9978) as
below:

- Models with control-flow ops (i.e. If, Loop and Scan ops) are not
supported.
- Usage of CUDA Graphs is limited to models where-in all the model ops
(graph nodes) can be partitioned to the TRT EP.
- The input/output types of models need to be tensors.
- Shapes of inputs/outputs cannot change across inference calls.
- IObinding is required.
2023-06-21 09:36:45 -07:00
Yufeng Li
d190db7fcd
Update default external_data_location for pre-process of quantization (#16399)
external_data_location should be a string/file_name to indicate the file
name of external data instead of a directory
2023-06-20 09:37:17 -07:00
cao lei
dd72192cf4
ExecutionProvider API refactor - move allocator from EP level to SessionState level and indexed by OrtDevice (#15833)
### Description
This PR is to refactor ExecutionProvider API for memory management,
which is to move allocators from EP level to SessionState level and
indexed by OrtDevice



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This PR is to refactor ExecutionProvider API for memory management,
which is to move allocators from EP level to SessionState level and
indexed by OrtDevice. By this change, EP level will shift the burden of
maintaining allocators, which will be user friendly for EP developers

---------

Co-authored-by: Lei Cao <leca@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2023-06-19 17:44:45 -07:00
kunal-vaishnavi
3f7f90aed0
Stabilize Whisper export with beam search (#16297)
### Description
This PR stabilizes the Whisper export with beam search by adding the
following:
- Remove unused ONNX models and extra folders generated during the
export process
- Specify the Whisper with beam search model's IR version for E2E
integration
- Parity check for Whisper with beam search model between PyTorch and
ORT
- Remove previously exported Whisper with beam search model before
saving newly exported model


### Motivation and Context
- Removing the unused ONNX models and extra folders frees up disk space
after exporting and makes it easier to copy and move the output folder
to other environments.
- Specifying the IR version fixes an issue with generating the ONNX E2E
model
- Adding a parity check helps detect runtime issues during the export
process
- Removing the previously exported Whisper with beam search model
prevents the data file size from doubling when the newly exported model
is saved with the same filename
2023-06-16 18:56:52 -07:00
PeixuanZuo
bcdb81c563
[Whisper] add a fusion option to split input bias from MHA/DMHA (#16049)
Whsiper model contains five different types of attention, q, k, v bias
was fused into Attention/MHA/DMHA op,

encoderdecoderinit subgraph
- Attention: encoder attention
- Attention: decoder self attention + present k, v
- MultiHeadAttention: decoder cross attention + present k and v. q and v
have bias.

decoder subgraph
- DecoderMultiHeadAttention: decoder cross attention + past k, v. q has
bias
- DecoderMultiHeadAttention: decoder self attention + past/present k, v.
q, k, v have bias.

For ROCm EP, MHA/DMHA doesn't support additional bias. This PR add a
fusion option `disable_multi_head_attention_bias` to split q.k,v bias
from MHA/DMHA.
2023-06-16 10:29:48 +08:00
kunal-vaishnavi
79e0230002
Add vocab masks to Whisper export with beam search (#16180)
### Description
This PR adds flags for exporting Whisper with vocab masks for logits
processing. This PR also sets `input_features` back to FP32 precision
for the user and casts `input_features` to FP16 precision when needed.



### Motivation and Context
This helps enable specific logits processing for the exported Whisper
model.
2023-06-08 12:36:35 -07:00
FFFrog
d185bf444d
[CANN] Add IOBinding Support For CANN EP (#15802)
### Description
Add IOBinding Support For CANN EP

### Motivation and Context
Now, Users can use IOBinding feature to speed up the inference on CANN.
2023-06-01 03:13:38 -07:00
Xavier Dupré
e726151b5c
Introduce float 8 types (#14731)
### Description
The PR implements FloatE4M3FN, FloatE5M2, FloatE4MEFNUZ, FloatE5M2FNUZ
as described in PR https://github.com/onnx/onnx/pull/4805. It uses CUDA
API to cast float/half to float8 if CUDA>=11.8, a custom implementation
if CUDA<11.8.

* It implements, Cast, QuantizeLinear, DequantizeLinear for all types on
CPU, only for types FloatE4M3FN, FloatE5M2 on CUDA.
* It extends the supported types for control flow operator, Shape,
Reshape, Identity, If, Loop, Scan, Reshape
* It implements Equal(19).
* Cast, QuantizeLinear, DequantizeLinear operators now support a
parameter `saturate` only valid for float 8 types. It is true by
default. In that case, any value out of range is converted into the
maximum float 8 value. If false, it is infinite.
* QuantizeLinear, DequantizeLinear now supports multiple scales on CUDA
(and ROCm by extension), scale = 1D tensor with one scale per channel

### Motivation and Context
Supports latest onnx version.

Fixes
[AB#15395](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/15395)

---------

Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
2023-05-30 13:25:58 -07:00
mindest
90e8c8daaf
profile_explorer: add op-kernel correlation info (#15946)
### Description
<!-- Describe your changes. -->
* Add aggregated op-kernel correlation information in profiler explorer
when running inference session.
* Add filtering feature so that we can focus on model runs of interest
(excluding warmup steps, etc.)
2023-05-30 23:25:43 +08:00
cloudhan
2cf0ae7d01
[ROCm] Add AttentionMode to make attention logic streamline (#15978)
Refactor for future kv cache change.
2023-05-26 12:06:36 +08:00
Yuhong Guo
04a8f50674
New configuration to limit the arena extension (#15983)
Add a configuration `max_power_of_two_extend_bytes ` to limit the arena extension size.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
In our real scenario, we observe that if the model is big enough the
BfcArena will extend uncontrollable.
As showed by the following figures, if a model uses more than 16GB
memory, the BfcArena will totally apply for 32GB memory according to the
`kNextPowerOfTwo` strategy. With the new strategy, the extension is
limited. The default maximum extension size is 1GB.

#### Without the new configuration
After loading the model, ORT uses 32G GPU memory.

![image](https://github.com/microsoft/onnxruntime/assets/19584326/42b93c66-b957-4f20-a13b-d34cb390afff)

#### With the new configuration
After loading the model, ORT uses 23G GPU memory.

![image](https://github.com/microsoft/onnxruntime/assets/19584326/5abffeff-9ca3-4187-a262-37fd2764fe1b)

Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com>
2023-05-25 02:19:07 -07:00
Adrian Lizarraga
efc84a43e8
[QNN EP] Add session option to disable fallback to default CPU EP (#16016)
### Description
Adds the session config option `disable_cpu_ep_fallback` to allow the
user to prevent the CPU EP from handling
nodes not supported by other execution providers.

```C++
// Graph nodes that are not supported by the execution providers (EPs) explicitly added to the session are
// assigned (i.e., "fallback") to the CPU EP by default.
//
// This option allows the user to disable the fallback of unsupported graph nodes to the CPU EP.
// If this option is set to "1", session creation will fail if the execution providers other than the CPU EP cannot
// fully support all of the nodes in the graph.
//
// It is invalid to set this option and explicitly add the CPU EP to the session. In this case, session creation
// will also fail with an error.
//
// Option values:
// - "0": CPU EP fallback is not disabled. [DEFAULT]
// - "1": CPU EP fallback is disabled.
static const char* const kOrtSessionOptionsDisableCPUEPFallback = "session.disable_cpu_ep_fallback";
```

#### Example use
```C++
#include "core/session/onnxruntime_cxx_api.h"
#include "core/session/onnxruntime_session_options_config_keys.h"

int main(int argc, char** argv) {
    Ort::SessionOptions so;
    so.AddConfigEntry(kOrtSessionOptionsDisableCPUEPFallback, "1");  // Disable fallback to the CPU EP.

    onnxruntime::ProviderOptions options;
#if defined(_WIN32)
    options["backend_path"] = "QnnCpu.dll";
#else
    options["backend_path"] = "libQnnCpu.so";
#endif

    so.AppendExecutionProvider("QNN", options);

    const ORTCHAR_T* ort_model_path = ORT_MODEL_FOLDER "qnn_ep_partial_support.onnx";
    Ort::Session session(*ort_env, ort_model_path, so);  // Throws exception if nodes fallback to CPU
    // ...
```

### Motivation and Context
Makes it easier for application developers to ensure that the entire
model runs on specific EPs. This is critical for Qualcomm/scenarios. If
the compute cannot be offloaded to the NPU, running on CPU is not
acceptable. (could be the difference between 90 second inference and 6
seconds inference)

---------

Co-authored-by: Pranav Sharma <prs@microsoft.com>
2023-05-23 17:56:32 -07:00
PeixuanZuo
2fddc65c8c
[ROCm] add hipblaslt into GemmFastGelu TunableOp (#15945)
add hipblaslt into GemmFastGelu TunableOp.
2023-05-23 11:07:09 +08:00
Zhang Lei
0f8e66d905
optimization for whisper model with decoder masked multihead attention (#15827)
* graph tools update
* cuda kernel update
* operator spec update and implementation update
* greed search bug fix on wrong assumption for cross/self attention
input length
* avoid use of "" name in value info when loading graph which
historically in many model
2023-05-18 15:38:31 -07:00
Yufeng Li
0fed00c04d
fix topo sort in quantization tool (#16003)
### Description
<!-- Describe your changes. -->
Should not set up dependent node list for empty('') input


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-05-18 13:43:52 -07:00
stevenlix
270c09a37f
Add timestamp logits processor for whisper (#15853)
Enable timestamp estimation and logits processing for Whisper model.
2023-05-16 21:40:00 -07:00
kailums
f62f722c70
integrate triton into ort (#15862)
### Description
In some scenarios, the triton written kernels are more performant than
CK or other handwritten kernels, so we implement a framework that
onnxruntime can use these triton written kernels.

This PR is to integrate triton into ort, so that ort can use kernels
that written and compiled by triton.

The main change focus on two part:
1. a build part to compile triton written kernel and combine these
kernels into libonnxruntime_providers_rocm.so
2. a loader and launcher in c++, for loading and launch triton written
kernels.

#### Build

To compile triton written kernel, add a script
`tools/ci_build/compile_triton.py`. This script will dynamic load all
kernel files, compile them, and generate `triton_kernel_infos.a` and
`triton_kernel_infos.h`.

`triton_kernel_infos.a` contains all compiled kernel instructions, this
file will be combined into libonnxruntime_providers_rocm.so, using
--whole-archive flag.

`triton_kernel_infos.h` defines a const array that contains all the
metadata for each compiled kernel. These metadata will be used for load
and launch. So this header file is included by 'triton_kernel.cu' which
defines load and launch functions.

Add a build flag in build.py and CMakeList.txt, when building rocm
provider, it will call triton_kernel build command, and generate all
necessary files.

#### C++ Load and Launch

On c++ part, we implement load and launch functions in triton_kernel.cu
and triton_kernel.h.

These two files located in `providers/cuda`, and when compiling rocm,
they will be hipified. so this part supports both cuda and rocm. But
currently we only call triton kernel in rocm.

We also implement a softmax triton op for example. Because there will
generate many kernels for different input shape of softmax, we use
TunableOp to select the best one.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-05-17 09:35:28 +08:00
Akash
1079df6aaa
Update StableDiffusion path after cloning repo (#15948)
### Description
Correct path to SD files in README



### Motivation and Context
Small typo in path
2023-05-16 08:39:27 -07:00
Prathik Rao
a0ccb95f3c
add option to load pretrained weights for T5 model (#15951)
### Description
<!-- Describe your changes. -->

Adds option to pass in pretrained weights file during T5 inference onnx
export. Mimics the changes made to whisper:
https://github.com/microsoft/onnxruntime/pull/15759

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Required for ONNX Runtime demo being presented at BUILD.
2023-05-15 22:52:35 -07:00
kunal-vaishnavi
5b663d6797
Whisper Multitask and Multilingual (#15936)
### Description
This PR enables Whisper's multitask format and allows a user to use
Whisper for multiple tasks (e.g. transcription, translation) and for
multilingual purposes (e.g. English, Spanish). This PR also removes
`attention_mask` as a required input for Whisper with beam search.

### Usage
Here is an example of how you can use Whisper for English transcription.
```
import numpy as np
import onnxruntime as ort

from datasets import load_dataset
from transformers import AutoConfig, AutoProcessor

model = "openai/whisper-tiny"
config = AutoConfig.from_pretrained(model)
processor = AutoProcessor.from_pretrained(model)

forced_decoder_ids = processor.get_decoder_prompt_ids(language="english", task="transcribe")
# forced_decoder_ids is of the format [(1, 50259), (2, 50359), (3, 50363)] and needs to be 
# of the format [50258, 50259, 50359, 50363] where 50258 is the start token id
forced_decoder_ids = [config.decoder_start_token_id] + list(map(lambda token: token[1], forced_decoder_ids))

ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
input_features = processor(ds[0]["audio"]["array"], return_tensors="np").input_features

inputs = {
  "input_features": np.float32(input_features),
  "max_length": np.array([26], dtype=np.int32),
  "min_length": np.array([1], dtype=np.int32),
  "num_beams": np.array([2], dtype=np.int32),
  "num_return_sequences": np.array([1], dtype=np.int32),
  "length_penalty": np.array([1.0], dtype=np.float32),
  "repetition_penalty": np.array([1.0], dtype=np.float32),
  "decoder_input_ids": np.array([forced_decoder_ids], dtype=np.int32),
}
sess = ort.InferenceSession("whisper-tiny_beamsearch.onnx", providers=["CPUExecutionProvider"])
outputs = sess.run(None, inputs)

# Print tokens and decoded output
print(outputs[0][0][0])
print(processor.decode(outputs[0][0][0]))
```

If you don't want to provide specific decoder input ids or you want
Whisper to predict the output language and task, you can set
`forced_decoder_ids = [config.decoder_start_token_id]` instead.

### Motivation and Context

As seen in the figure below from the [OpenAI Whisper
paper](https://cdn.openai.com/papers/whisper.pdf), Whisper can be used
for multiple tasks and languages.

![Screenshot 2023-05-12
165215](https://github.com/microsoft/onnxruntime/assets/115581922/49335e39-a79c-4f78-92e9-89b034405f65)
2023-05-15 14:36:33 -07:00
Ye Wang
3418ca28a8
pack qkv in t5 decoder (#15801)
### Description
<!-- Describe your changes. -->

V100, b_4_s_128, max_output_len=64, beam=4

before:
t5_small: 101.28ms
t5_base:  200.07ms

after:
t5_small: 87.65ms
t5_base: 174.44ms



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-05-15 13:45:39 -07:00
Chester Liu
984dd02df3
Update optimize_pipeline.py to use __name__ detection (#15866)
### Description
<!-- Describe your changes. -->

Use `__name__` detection in `optimize_pipeline.py`.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

It prevents unwanted execution of `main` when importing the file.
2023-05-12 20:43:29 -07:00
petermcaughan
e5189330d5
Address OOM Issue when exporting Whisper (#15880)
### Description
Remove attention_mask from unnecessary code paths in the whisper export
process.

### Motivation and Context
Current export script frequently hits OOM error when export
whisper-large. Memory profiling shows that this is a result of
generating dummy inputs for the `encoder_attention_mask` input for a
model pass during exporting - in whisper-large, this dummy tensor can be
around 20GB in size.

`encoder_attention_mask` is ultimately a dummy input - it's just there
to satisfy certain BeamSearch requirements. Thus, we're currently
creating a 20GB tensor and passing it to the model, which then discards
the input anyways. By removing the code path to generate a dummy
encoder_mask tensor, we can reduce the memory requirements to export
whisper substantially, while keeping the BeamSearch checks satisfied.

---------

Co-authored-by: Peter McAughan <petermca@microsoft.com>
2023-05-12 11:23:07 -07:00
Tianlei Wu
e0c1fa35a8
update stable diffusion script and doc (#15846)
### Description
Update script: 
(1) change some float16 verbose logging to debug level.
(2) Let requirements-cuda.txt includes requirements.txt
(3) Use an environment variable ORT_DISABLE_TRT_FLASH_ATTENTION=1 to
avoid black image in 2.1 model. Update benchmark and doc.
(4) Update document to include command lines to build ORT rocm from
source.
(5) Update optimize_pipeline.py so that user can disable packed qkv/kv
from command line options.
(6) Update document to use torch < 2.0 for onnx export.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-05-09 15:29:13 -07:00
Tianlei Wu
191ee1d3c0
Fix symbolic shape infer empty value_info (#15842)
### Description
When node output is optional, symbolic shape infer might add an empty
value_info item. Add some checking to avoid this.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
- 
Stable diffusion optimized model reported invalid data type 0 during
inference.
2023-05-08 16:18:35 -07:00
Ted Themistokleous
42d62b8f2b
Fixes to get stable diffusion benchmark running (#15755)
### Description

Added changes to MIGraphX EP to suppoert stable diffusion

1. Added parameterized input dimensions to not trigger a precompile to
set input parameters in the EP
2. Removed input checking for Resize operator in EP as MIGraphX already
performs these checks
3. Add support to benchmark script to use the MIGraphX execution
provider
4. Add support for an odd valued batch size (3) that was seen on other
benchmarks we were performing comparison on.

### Motivation and Context

These changes are required to get stable diffusion mdoels to run on
MIGraphX through the EP. Without these changes we see the following
incorrect behavior.

1. Resize operators are pushed onto the CPU EP instead of MIGraphX,
causing a significant slowdown during runs
2. Precompile operations incorrectly parse input_ids parameter for our
text model, with a 1, which breaks during MIGraphX Compile of onnx. This
in turn throws an error and stops any setup before inference.
3. Selecting the correct EP in the benchmark script which was previously
missing the MIGraphX option
5. Suppressed an error we keep seeing with pthread_set_affinity - this
is a quality of life change when using the MIGraphX EP

This was testing with the benchmark.py script using stable diffusion v2
located in

onnxruntime/onnxruntime/python/tools/transformers/models/stable_diffusion/

---------

Co-authored-by: Ted Themistokleous <tthemist@amd.com>
2023-05-06 17:35:21 +08:00
BoarQing
272aab4afa
Fix issues on Windows for Vitis AI (#15810)
### Description
Fix two errors that is only encountered on windows


### Motivation and Context
For onnxruntime::VitisAIProviderFactoryCreator::Create, it would cause
the compile error.
For if (it == provider_options_map.end()), it would cause an error but
execute as normal

Co-authored-by: Zhang <yueqingz@amd.com>
2023-05-04 14:42:19 -07:00
Baiju Meswani
ba7b83ff3c
Remove onnxruntime_PYBIND_EXPORT_OPSCHEMA definition from onnxruntime (#15776) 2023-05-03 13:08:35 -07:00
Prathik Rao
090312af71
add local state dict option (#15759)
### Description

Adds an option to load local state dictionary for whisper model export.

### Motivation and Context

This is useful to demonstrate workflow of using ORT Training to get
model weights, downloading said weights onto a local gpu-enabled device,
exporting the custom model using `convert_to_onnx.py`, and then nicely
feeding the .onnx file into ORT InferenceSession.
2023-05-01 22:08:11 -07:00