Commit graph

15 commits

Author SHA1 Message Date
Tianlei Wu
09c98433e7
[CUDA] stable diffusion benchmark allows IO binding for optimum (#22834)
### Description

Update stable diffusion benchmark:
(1) allow IO binding for optimum.
(2) do not use num_images_per_prompt across all engines for fair
comparison.

Example to run benchmark of optimum on stable diffusion 1.5:
```
git clone https://github.com/tianleiwu/optimum
cd optimum
git checkout tlwu/diffusers-io-binding
pip install -e .

pip install -U onnxruntime-gpu
git clone https://github.com/microsoft/onnxruntime
cd onnxruntime/onnxruntime/python/tools/transformers/models/stable_diffusion
git checkout tlwu/benchmark_sd_optimum_io_binding
pip install -r requirements/cuda12/requirements.txt

optimum-cli export onnx --model runwayml/stable-diffusion-v1-5  --task text-to-image ./sd_onnx_fp32

python optimize_pipeline.py -i ./sd_onnx_fp32 -o ./sd_onnx_fp16 --float16
python benchmark.py -e optimum -r cuda -v 1.5 -p ./sd_onnx_fp16
python benchmark.py -e optimum -r cuda -v 1.5 -p ./sd_onnx_fp16 --use_io_binding
```

Example output in H100_80GB_HBM3: 572 ms with IO Binding; 588 ms without
IO Binding; IO binding gains 16ms, or 2.7%,

### Motivation and Context

Optimum is working on enabling I/O binding:
https://github.com/huggingface/optimum/pull/2056. This could help
testing the impact of I/O binding on the performance of the stable
diffusion.
2024-11-14 00:09:07 -08:00
Bowen Bao
742595b885
Speedup Llama2 cpu throughput in bench by 1.69x with iobinding (#19853)
### Description
Always set `use_io_binding=True` when using optimum.onnxruntime unless
there is a special case.


### Motivation and Context
By default, `ORTModel` under optimum.onnxruntime will choose the
appropriate `use_io_binding` value based on provider and use cases.

>         use_io_binding (`Optional[bool]`, defaults to `None`):
> Whether to use IOBinding during inference to avoid memory copy between
the host and device, or between numpy/torch tensors and ONNX Runtime
ORTValue. Defaults to
> `True` if the execution provider is CUDAExecutionProvider. For
[~onnxruntime.ORTModelForCausalLM], defaults to `True` on
CPUExecutionProvider,
 >           in all other cases defaults to `False`.

For Llama token benchmark, using iobinding yields almost 2x speedup,
even on CPU. This is because this particular model yields a large number
of outputs (>60). Without iobinding, a copy is performed for each output
from ortvalue to numpy array. This adds significant overhead to the
overall run time.

```
Evaluating Llama2 `model(inputs)` step with past_key_values

Before, w/o iobinding on cpu

Batch Size: 1
Sequence Length: 512
Latency: 0.4518657898902893 s
Throughput: 2.2130464894073856 tps

After, w/ iobinding on cpu

Batch Size: 1
Sequence Length: 512
Latency: 0.2662619352340698 s
Throughput: 3.7557001871893703 tps
```
2024-03-12 09:41:11 -07:00
Tianlei Wu
2d6e2e243d
update sdxl demo (#18889)
### Description
(1) Support importing model from Olive.
(2) Add backend engine Torch (Eager and Compile modes) to the demo.
(3) Use fp16 in most places.
(4) Remove some old pipeline scripts that are not useful anymore. They
are replaced by the demo.
(5) Remove old benchmark results that are out of date.
(6) Add PIL image conversion to end to end latency (for fair comparison
with diffusers since the default output type is pil)
(7) Remove some options are seldom used like force-rebuild-engine,
hf-token, refit etc.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-12-20 14:46:22 -08:00
Tianlei Wu
59ae3fdfdc
[CUDA] StableDiffusion XL demo with CUDA EP (#17997)
Add CUDA EP to the StableDiffusion XL Demo including:
(1) Add fp16 VAE support for CUDA EP.
(2) Configuration for each model separately (For example, some models
can run with CUDA graph but some models cannot).

Some remaining works will boost performance further later:
(1) Enable CUDA Graph for Clip2 and UNet. Currently, some part of graph
is partitioned to CPU, which blocks CUDA graph.
(2) Update GroupNorm CUDA kernel for refiner. Currently, the cuda kernel
only supports limited number of channels in refiner so we shall see some
gain there if we remove the limitation.

Some extra works that are nice to have (thus lower priority):
(3) Support denoising_end to ensemble base and refiner.
(4) Support classifier free guidance (The idea is from
https://www.baseten.co/blog/sdxl-inference-in-under-2-seconds-the-ultimate-guide-to-stable-diffusion-optimiza/).


#### Performance on A100-SXM4-80GB

Example commands to test an engine built with static shape or dynamic
shape:
```
engine_name=ORT_CUDA
python demo_txt2img_xl.py --engine $engine_name "some prompt"
python demo_txt2img_xl.py --engine $engine_name --disable-cuda-graph --build-dynamic-batch --build-dynamic-shape "some prompt"
```
Engine built with dynamic shape could support different batch size (1 to
4 for TRT; 1 to 16 for CUDA) and image size (256x256 to 1024x1024).
Engine built with static shape could only support fixed batch size (1)
and image size (1024x1024).

The latency (ms) of generating an image of size 1024x1024 (sorted by
total latency):

 Engine | Base (30 Steps)* | Refiner (9 Steps) | Total Latency (ms)
-- | -- | -- | --
ORT_TRT (static shape) | 2467 | 1033 | 3501
TRT (static shape) | 2507 | 1048 | 3555
ORT_CUDA (static shape) | 2630 | 1015 | 3645
ORT_CUDA (dynamic shape) | 2639 | 1016 | 3654
TRT (dynamic shape) | 2777 | 1099 | 3876
ORT_TRT (dynamic shape) | 2890 | 1166 | 4057

\* VAE decoder is not used in Base since the output from base is latent,
which is consumed by refiner to output image.

We can see that ORT_CUDA is faster on dynamic shape, while slower in
static shape (The cause is Clip2 and UNet cannot run with CUDA Graph
right now, and we will address the issue later).

### Motivation and Context
Follow up of https://github.com/microsoft/onnxruntime/pull/17536
2023-10-17 21:30:04 -07:00
Tianlei Wu
a05580ed5b
StableDiffusion XL with TensorRT EP (#17748)
Accelerate StableDiffusion XL with TensorRT EP. It is modified from
TensorRT demo diffusion, and we updated the design to make the pipeline
works with different backend engines.

The following result is from A100 80GB with 30 steps of Base, or 30
steps Base & 30 Steps Refiner to generate 1024x1024 images. The engine
is built with static input shape, and cuda graph is enabled.

  | Batch Size | TRT Latency (ms) | ORT_TRT Latency (ms) | Diff
-- | -- | -- | -- | --
Base | 1 | 2714 | 2679 | -1.3%
Base & Refiner | 1 | 3593 | 3530 | -1.8%

The test environment: onnxruntime-gpu is built from source, and the following packages or
libraries are used in this test:
* tensorrt==8.6.1.post1
* torch==2.2.0.dev20230920+cu121
* transformers==4.31.0
* diffusers==0.19.3
* onnx==1.14.1
* onnx-graphsurgeon==0.3.27
* polygraphy==0.47.1
* protobuf==3.20.2
* onnxruntime-gpu==1.17.0 (built from source of main branch)
* CUDA 12.2.2
* cuDNN 8.9.5.29
* python 3.10.13
2023-10-04 08:01:39 -07:00
Tianlei Wu
b8f6235f11
Update stable diffusion benchmark for TensorRT EP (#16560)
### Description

Add Stable Diffusion Text2Image pipelines of TensorRT EP and CUDA EP.
They can automatically export and optimize ONNX model, and create
ONNXRuntime session to use TensorRT EP or CUDA execution provider.

Add support for benchmarking TensorRT.

Add support of cuda graph. The feature is only supported in nightly
package right now.

Engine/Provider to test | command line
---- | ---
CUDA EP | `python benchmark.py -v 1.5`
CUDA EP with cuda graph | `python benchmark.py -v 1.5
--enable_cuda_graph`
TensorRT EP | `python benchmark.py -v 1.5 -r tensorrt`
TensorRT EP with cuda graph | `python benchmark.py -v 1.5 -r tensorrt
--enable_cuda_graph`
TensorRT | `python benchmark.py -v 1.5 -e tensorrt`

Add benchmark numbers of T4 GPU using CUDA 11.7, cuDNN 8.5, PyTorch
1.13.1+cu11.7, TensorRT 8.6.1, onnxruntime-gpu 1.15.1 (or
ort-nightly-gpu 1.16 for cuda graph).

TODO: add benchmark numbers of A100-80GB

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-07-10 09:51:03 -07:00
Tianlei Wu
e0c1fa35a8
update stable diffusion script and doc (#15846)
### Description
Update script: 
(1) change some float16 verbose logging to debug level.
(2) Let requirements-cuda.txt includes requirements.txt
(3) Use an environment variable ORT_DISABLE_TRT_FLASH_ATTENTION=1 to
avoid black image in 2.1 model. Update benchmark and doc.
(4) Update document to include command lines to build ORT rocm from
source.
(5) Update optimize_pipeline.py so that user can disable packed qkv/kv
from command line options.
(6) Update document to use torch < 2.0 for onnx export.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-05-09 15:29:13 -07:00
Ted Themistokleous
42d62b8f2b
Fixes to get stable diffusion benchmark running (#15755)
### Description

Added changes to MIGraphX EP to suppoert stable diffusion

1. Added parameterized input dimensions to not trigger a precompile to
set input parameters in the EP
2. Removed input checking for Resize operator in EP as MIGraphX already
performs these checks
3. Add support to benchmark script to use the MIGraphX execution
provider
4. Add support for an odd valued batch size (3) that was seen on other
benchmarks we were performing comparison on.

### Motivation and Context

These changes are required to get stable diffusion mdoels to run on
MIGraphX through the EP. Without these changes we see the following
incorrect behavior.

1. Resize operators are pushed onto the CPU EP instead of MIGraphX,
causing a significant slowdown during runs
2. Precompile operations incorrectly parse input_ids parameter for our
text model, with a 1, which breaks during MIGraphX Compile of onnx. This
in turn throws an error and stops any setup before inference.
3. Selecting the correct EP in the benchmark script which was previously
missing the MIGraphX option
5. Suppressed an error we keep seeing with pthread_set_affinity - this
is a quality of life change when using the MIGraphX EP

This was testing with the benchmark.py script using stable diffusion v2
located in

onnxruntime/onnxruntime/python/tools/transformers/models/stable_diffusion/

---------

Co-authored-by: Ted Themistokleous <tthemist@amd.com>
2023-05-06 17:35:21 +08:00
cloudhan
8297148bde
[ROCm] Update benchmark for stable diffusion (#15602)
1. update scripts for ROCm memory measurement.
2. update README to contain ROCm result.
3. address some minor issue in the README
2023-04-23 11:49:40 +08:00
PeixuanZuo
a6279d4cfb
[ROCm] update Stable Diffusion benchmark to support ROCm EP (#15094)
Update Stable Diffusion benchmark to support ROCm EP
2023-03-29 15:19:52 +08:00
Tianlei Wu
c66af46fc1
Doc for Stable Diffusion CUDA Optimizations (#14830)
Add document for stable diffusion optimizations and benchmark.
2023-03-01 19:29:30 -08:00
Tianlei Wu
262e46e8ce
Update stable diffusion benchmark script (#14759)
Update stable diffusion benchmark script:
(1) Test GPU memory usage
(2) Change diffusers version to 0.13, and add support of PyTorch 2.0
including compile
(3) Add support of xformers
(4) Output result to CSV file

Example to run PyTorch 2.0 with torch.compile:
```
pip3 install numpy --pre torch --force-reinstall --extra-index-url https://download.pytorch.org/whl/nightly/cu117
export TRITON_PTXAS_PATH=/usr/local/cuda-11.7/bin/ptxas
python benchmark.py -e torch -v 1.5 -c 5 -n 1 -b 1 --enable_torch_compile
```
2023-02-21 23:37:38 -08:00
Tianlei Wu
f638c5a2ae
Stable Diffusion CUDA Optimizations Part 3 (#14646)
The third part for stable diffusion CUDA optimizations
(1) Add BiasAdd operator to replace two Add (bias and residual); Add
fusion for BiasAdd
(2) Add Attention fusion for VAE decoder.
(3) Update float16 conversion to handle Resize and GroupNorm. This could
reduce two Cast nodes for each Resize op in fp16 model.
(4) Force inputs and outputs to be float16 to avoid data casts in the
pipeline.
(5) Add options --force_fp32_ops, --inspect etc in optimize script so that
user could force some operator to run in float32 to potentially get
better image quality (with cost of performance).

Performance tests show slight improvement in T4. Average latency reduced
0.1 seconds (from 5.35s to 5.25s) for 512x512 in 50 steps.
2023-02-14 12:46:50 -08:00
Tianlei Wu
742658d171
Stable Diffusion CUDA optimizations Part 2 (#14597)
### Description
This is a follow-up of
https://github.com/microsoft/onnxruntime/pull/14428 for Stable Diffusion
CUDA optimizations:
(1) use NchwConv to replace Conv in onnx graph and add Tranpose nodes
accordingly
(2) reduce sequential Transpose nodes to at most one.
(3) symbolic shape infer of NchwConv
(4) fix add bias transpose which causes CUDA error (launching more than
1024 threads per block) in inferencing fp32 model.
(5) add models (bert, bart, stable_diffusion subdirectories) to package;
(6) remove option --disable_channels_last

Note that 
(1) We can add a few graph transformations to reduce Transpose nodes
further. It is not done in this PR due to time limit.
(2) Stable diffusion 2.1 model outputs black images. It seems that
forcing Attention to float32 could avoid the issue. However it is much
slow to use float32 Attention.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-02-07 07:49:15 -08:00
Tianlei Wu
a6c5ba0185
Stable Diffusion CUDA Optimizations (#14428)
### Description

Add stable diffusion CUDA kernel optimizations.

The following are included:
(1) GroupNorm operator. This kernel is from TensorRT 8.5.
(2) BiasSplitGelu operator. This kernel is modified from SplitGelu of
TensorRT 8.5. We added bias to the SplitGelu.
(3) NhwcConv operator. This adds support of NHWC format (ONNX Conv
operator uses NCHW format).
(3) Update MultiHeadAttention (packed kv and no bias) for cross
attention. This could avoid transpose of kv for TRT fused cross
attention kernel.
(4) Optimization and benchmark script

Not included:
(1) Script to convert Conv to NhwcConv in onnx graph.
(2) Update symbolic shape inference for NhwcConv.
(3) Add SeqLen2Spatial operator
(4) Documents

Limitations: GroupNorm, BiasSplitGelu and NhwcConv kernels are
implemented based on stable diffusion usage. They might not be
applicable to any input size or dimensions. For example, BiasSplitGelu
requires hidden size to be 2560 | 5120 | 10240, and NhwcConv assumes 4D
input/weight.

There is minor increasement of binary size. For SM=75 only, python
package wheel size adds (33757K - 33640K) = 117 KB. It is possible to
move NHWC from template parameter to constructor to reduce binary size
(with slight cost of performance).

Note: for RTX 4090/4080/4070 Ti, need build with CUDA 11.8 and latest
cuDNN to get best performance.
2023-02-02 23:43:51 -08:00