onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-05 04:17:53 +00:00

Author	SHA1	Message	Date
Tianlei Wu	09c98433e7	[CUDA] stable diffusion benchmark allows IO binding for optimum (#22834 ) ### Description Update stable diffusion benchmark: (1) allow IO binding for optimum. (2) do not use num_images_per_prompt across all engines for fair comparison. Example to run benchmark of optimum on stable diffusion 1.5: ``` git clone https://github.com/tianleiwu/optimum cd optimum git checkout tlwu/diffusers-io-binding pip install -e . pip install -U onnxruntime-gpu git clone https://github.com/microsoft/onnxruntime cd onnxruntime/onnxruntime/python/tools/transformers/models/stable_diffusion git checkout tlwu/benchmark_sd_optimum_io_binding pip install -r requirements/cuda12/requirements.txt optimum-cli export onnx --model runwayml/stable-diffusion-v1-5 --task text-to-image ./sd_onnx_fp32 python optimize_pipeline.py -i ./sd_onnx_fp32 -o ./sd_onnx_fp16 --float16 python benchmark.py -e optimum -r cuda -v 1.5 -p ./sd_onnx_fp16 python benchmark.py -e optimum -r cuda -v 1.5 -p ./sd_onnx_fp16 --use_io_binding ``` Example output in H100_80GB_HBM3: 572 ms with IO Binding; 588 ms without IO Binding; IO binding gains 16ms, or 2.7%, ### Motivation and Context Optimum is working on enabling I/O binding: https://github.com/huggingface/optimum/pull/2056. This could help testing the impact of I/O binding on the performance of the stable diffusion.	2024-11-14 00:09:07 -08:00
Bowen Bao	742595b885	Speedup Llama2 cpu throughput in bench by 1.69x with iobinding (#19853 ) ### Description Always set `use_io_binding=True` when using optimum.onnxruntime unless there is a special case. ### Motivation and Context By default, `ORTModel` under optimum.onnxruntime will choose the appropriate `use_io_binding` value based on provider and use cases. > use_io_binding (`Optional[bool]`, defaults to `None`): > Whether to use IOBinding during inference to avoid memory copy between the host and device, or between numpy/torch tensors and ONNX Runtime ORTValue. Defaults to > `True` if the execution provider is CUDAExecutionProvider. For [~onnxruntime.ORTModelForCausalLM], defaults to `True` on CPUExecutionProvider, > in all other cases defaults to `False`. For Llama token benchmark, using iobinding yields almost 2x speedup, even on CPU. This is because this particular model yields a large number of outputs (>60). Without iobinding, a copy is performed for each output from ortvalue to numpy array. This adds significant overhead to the overall run time. ``` Evaluating Llama2 `model(inputs)` step with past_key_values Before, w/o iobinding on cpu Batch Size: 1 Sequence Length: 512 Latency: 0.4518657898902893 s Throughput: 2.2130464894073856 tps After, w/ iobinding on cpu Batch Size: 1 Sequence Length: 512 Latency: 0.2662619352340698 s Throughput: 3.7557001871893703 tps ```	2024-03-12 09:41:11 -07:00
Tianlei Wu	2d6e2e243d	update sdxl demo (#18889 ) ### Description (1) Support importing model from Olive. (2) Add backend engine Torch (Eager and Compile modes) to the demo. (3) Use fp16 in most places. (4) Remove some old pipeline scripts that are not useful anymore. They are replaced by the demo. (5) Remove old benchmark results that are out of date. (6) Add PIL image conversion to end to end latency (for fair comparison with diffusers since the default output type is pil) (7) Remove some options are seldom used like force-rebuild-engine, hf-token, refit etc. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-12-20 14:46:22 -08:00
Tianlei Wu	59ae3fdfdc	[CUDA] StableDiffusion XL demo with CUDA EP (#17997 ) Add CUDA EP to the StableDiffusion XL Demo including: (1) Add fp16 VAE support for CUDA EP. (2) Configuration for each model separately (For example, some models can run with CUDA graph but some models cannot). Some remaining works will boost performance further later: (1) Enable CUDA Graph for Clip2 and UNet. Currently, some part of graph is partitioned to CPU, which blocks CUDA graph. (2) Update GroupNorm CUDA kernel for refiner. Currently, the cuda kernel only supports limited number of channels in refiner so we shall see some gain there if we remove the limitation. Some extra works that are nice to have (thus lower priority): (3) Support denoising_end to ensemble base and refiner. (4) Support classifier free guidance (The idea is from https://www.baseten.co/blog/sdxl-inference-in-under-2-seconds-the-ultimate-guide-to-stable-diffusion-optimiza/). #### Performance on A100-SXM4-80GB Example commands to test an engine built with static shape or dynamic shape: ``` engine_name=ORT_CUDA python demo_txt2img_xl.py --engine $engine_name "some prompt" python demo_txt2img_xl.py --engine $engine_name --disable-cuda-graph --build-dynamic-batch --build-dynamic-shape "some prompt" ``` Engine built with dynamic shape could support different batch size (1 to 4 for TRT; 1 to 16 for CUDA) and image size (256x256 to 1024x1024). Engine built with static shape could only support fixed batch size (1) and image size (1024x1024). The latency (ms) of generating an image of size 1024x1024 (sorted by total latency): Engine \| Base (30 Steps)* \| Refiner (9 Steps) \| Total Latency (ms) -- \| -- \| -- \| -- ORT_TRT (static shape) \| 2467 \| 1033 \| 3501 TRT (static shape) \| 2507 \| 1048 \| 3555 ORT_CUDA (static shape) \| 2630 \| 1015 \| 3645 ORT_CUDA (dynamic shape) \| 2639 \| 1016 \| 3654 TRT (dynamic shape) \| 2777 \| 1099 \| 3876 ORT_TRT (dynamic shape) \| 2890 \| 1166 \| 4057 \* VAE decoder is not used in Base since the output from base is latent, which is consumed by refiner to output image. We can see that ORT_CUDA is faster on dynamic shape, while slower in static shape (The cause is Clip2 and UNet cannot run with CUDA Graph right now, and we will address the issue later). ### Motivation and Context Follow up of https://github.com/microsoft/onnxruntime/pull/17536	2023-10-17 21:30:04 -07:00
Tianlei Wu	a05580ed5b	StableDiffusion XL with TensorRT EP (#17748 ) Accelerate StableDiffusion XL with TensorRT EP. It is modified from TensorRT demo diffusion, and we updated the design to make the pipeline works with different backend engines. The following result is from A100 80GB with 30 steps of Base, or 30 steps Base & 30 Steps Refiner to generate 1024x1024 images. The engine is built with static input shape, and cuda graph is enabled. \| Batch Size \| TRT Latency (ms) \| ORT_TRT Latency (ms) \| Diff -- \| -- \| -- \| -- \| -- Base \| 1 \| 2714 \| 2679 \| -1.3% Base & Refiner \| 1 \| 3593 \| 3530 \| -1.8% The test environment: onnxruntime-gpu is built from source, and the following packages or libraries are used in this test: * tensorrt==8.6.1.post1 * torch==2.2.0.dev20230920+cu121 * transformers==4.31.0 * diffusers==0.19.3 * onnx==1.14.1 * onnx-graphsurgeon==0.3.27 * polygraphy==0.47.1 * protobuf==3.20.2 * onnxruntime-gpu==1.17.0 (built from source of main branch) * CUDA 12.2.2 * cuDNN 8.9.5.29 * python 3.10.13	2023-10-04 08:01:39 -07:00
Tianlei Wu	b8f6235f11	Update stable diffusion benchmark for TensorRT EP (#16560 ) ### Description Add Stable Diffusion Text2Image pipelines of TensorRT EP and CUDA EP. They can automatically export and optimize ONNX model, and create ONNXRuntime session to use TensorRT EP or CUDA execution provider. Add support for benchmarking TensorRT. Add support of cuda graph. The feature is only supported in nightly package right now. Engine/Provider to test \| command line ---- \| --- CUDA EP \| `python benchmark.py -v 1.5` CUDA EP with cuda graph \| `python benchmark.py -v 1.5 --enable_cuda_graph` TensorRT EP \| `python benchmark.py -v 1.5 -r tensorrt` TensorRT EP with cuda graph \| `python benchmark.py -v 1.5 -r tensorrt --enable_cuda_graph` TensorRT \| `python benchmark.py -v 1.5 -e tensorrt` Add benchmark numbers of T4 GPU using CUDA 11.7, cuDNN 8.5, PyTorch 1.13.1+cu11.7, TensorRT 8.6.1, onnxruntime-gpu 1.15.1 (or ort-nightly-gpu 1.16 for cuda graph). TODO: add benchmark numbers of A100-80GB ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-07-10 09:51:03 -07:00
Tianlei Wu	e0c1fa35a8	update stable diffusion script and doc (#15846 ) ### Description Update script: (1) change some float16 verbose logging to debug level. (2) Let requirements-cuda.txt includes requirements.txt (3) Use an environment variable ORT_DISABLE_TRT_FLASH_ATTENTION=1 to avoid black image in 2.1 model. Update benchmark and doc. (4) Update document to include command lines to build ORT rocm from source. (5) Update optimize_pipeline.py so that user can disable packed qkv/kv from command line options. (6) Update document to use torch < 2.0 for onnx export. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-05-09 15:29:13 -07:00
Ted Themistokleous	42d62b8f2b	Fixes to get stable diffusion benchmark running (#15755 ) ### Description Added changes to MIGraphX EP to suppoert stable diffusion 1. Added parameterized input dimensions to not trigger a precompile to set input parameters in the EP 2. Removed input checking for Resize operator in EP as MIGraphX already performs these checks 3. Add support to benchmark script to use the MIGraphX execution provider 4. Add support for an odd valued batch size (3) that was seen on other benchmarks we were performing comparison on. ### Motivation and Context These changes are required to get stable diffusion mdoels to run on MIGraphX through the EP. Without these changes we see the following incorrect behavior. 1. Resize operators are pushed onto the CPU EP instead of MIGraphX, causing a significant slowdown during runs 2. Precompile operations incorrectly parse input_ids parameter for our text model, with a 1, which breaks during MIGraphX Compile of onnx. This in turn throws an error and stops any setup before inference. 3. Selecting the correct EP in the benchmark script which was previously missing the MIGraphX option 5. Suppressed an error we keep seeing with pthread_set_affinity - this is a quality of life change when using the MIGraphX EP This was testing with the benchmark.py script using stable diffusion v2 located in onnxruntime/onnxruntime/python/tools/transformers/models/stable_diffusion/ --------- Co-authored-by: Ted Themistokleous <tthemist@amd.com>	2023-05-06 17:35:21 +08:00
cloudhan	8297148bde	[ROCm] Update benchmark for stable diffusion (#15602 ) 1. update scripts for ROCm memory measurement. 2. update README to contain ROCm result. 3. address some minor issue in the README	2023-04-23 11:49:40 +08:00
PeixuanZuo	a6279d4cfb	[ROCm] update Stable Diffusion benchmark to support ROCm EP (#15094 ) Update Stable Diffusion benchmark to support ROCm EP	2023-03-29 15:19:52 +08:00
Tianlei Wu	c66af46fc1	Doc for Stable Diffusion CUDA Optimizations (#14830 ) Add document for stable diffusion optimizations and benchmark.	2023-03-01 19:29:30 -08:00
Tianlei Wu	262e46e8ce	Update stable diffusion benchmark script (#14759 ) Update stable diffusion benchmark script: (1) Test GPU memory usage (2) Change diffusers version to 0.13, and add support of PyTorch 2.0 including compile (3) Add support of xformers (4) Output result to CSV file Example to run PyTorch 2.0 with torch.compile: ``` pip3 install numpy --pre torch --force-reinstall --extra-index-url https://download.pytorch.org/whl/nightly/cu117 export TRITON_PTXAS_PATH=/usr/local/cuda-11.7/bin/ptxas python benchmark.py -e torch -v 1.5 -c 5 -n 1 -b 1 --enable_torch_compile ```	2023-02-21 23:37:38 -08:00
Tianlei Wu	f638c5a2ae	Stable Diffusion CUDA Optimizations Part 3 (#14646 ) The third part for stable diffusion CUDA optimizations (1) Add BiasAdd operator to replace two Add (bias and residual); Add fusion for BiasAdd (2) Add Attention fusion for VAE decoder. (3) Update float16 conversion to handle Resize and GroupNorm. This could reduce two Cast nodes for each Resize op in fp16 model. (4) Force inputs and outputs to be float16 to avoid data casts in the pipeline. (5) Add options --force_fp32_ops, --inspect etc in optimize script so that user could force some operator to run in float32 to potentially get better image quality (with cost of performance). Performance tests show slight improvement in T4. Average latency reduced 0.1 seconds (from 5.35s to 5.25s) for 512x512 in 50 steps.	2023-02-14 12:46:50 -08:00
Tianlei Wu	742658d171	Stable Diffusion CUDA optimizations Part 2 (#14597 ) ### Description This is a follow-up of https://github.com/microsoft/onnxruntime/pull/14428 for Stable Diffusion CUDA optimizations: (1) use NchwConv to replace Conv in onnx graph and add Tranpose nodes accordingly (2) reduce sequential Transpose nodes to at most one. (3) symbolic shape infer of NchwConv (4) fix add bias transpose which causes CUDA error (launching more than 1024 threads per block) in inferencing fp32 model. (5) add models (bert, bart, stable_diffusion subdirectories) to package; (6) remove option --disable_channels_last Note that (1) We can add a few graph transformations to reduce Transpose nodes further. It is not done in this PR due to time limit. (2) Stable diffusion 2.1 model outputs black images. It seems that forcing Attention to float32 could avoid the issue. However it is much slow to use float32 Attention. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-02-07 07:49:15 -08:00
Tianlei Wu	a6c5ba0185	Stable Diffusion CUDA Optimizations (#14428 ) ### Description Add stable diffusion CUDA kernel optimizations. The following are included: (1) GroupNorm operator. This kernel is from TensorRT 8.5. (2) BiasSplitGelu operator. This kernel is modified from SplitGelu of TensorRT 8.5. We added bias to the SplitGelu. (3) NhwcConv operator. This adds support of NHWC format (ONNX Conv operator uses NCHW format). (3) Update MultiHeadAttention (packed kv and no bias) for cross attention. This could avoid transpose of kv for TRT fused cross attention kernel. (4) Optimization and benchmark script Not included: (1) Script to convert Conv to NhwcConv in onnx graph. (2) Update symbolic shape inference for NhwcConv. (3) Add SeqLen2Spatial operator (4) Documents Limitations: GroupNorm, BiasSplitGelu and NhwcConv kernels are implemented based on stable diffusion usage. They might not be applicable to any input size or dimensions. For example, BiasSplitGelu requires hidden size to be 2560 \| 5120 \| 10240, and NhwcConv assumes 4D input/weight. There is minor increasement of binary size. For SM=75 only, python package wheel size adds (33757K - 33640K) = 117 KB. It is possible to move NHWC from template parameter to constructor to reduce binary size (with slight cost of performance). Note: for RTX 4090/4080/4070 Ti, need build with CUDA 11.8 and latest cuDNN to get best performance.	2023-02-02 23:43:51 -08:00

15 commits