onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-01 03:45:06 +00:00

Author	SHA1	Message	Date
PeixuanZuo	0ecfe83932	[ROCm] add beam search support (#15625 ) add beam search support for ROCm EP.	2023-04-26 17:53:33 +08:00
Tianlei Wu	686fd3c22a	Fix cuda 12.1 windows Build (#15614 ) ### Description Fix CUDA 12.1 Windows build error of cuda namespace ambiguous. Use a new namespace for attention softmax. Tested with VS 2019 and VS 2022 with the following settings: - OS: Microsoft Windows 11 Enterprise (Version 10.0.22621 Build 22621) - CUDA: cuda_12.1.0_531.14_windows - TensorRT: TensorRT-8.6.0.12.Windows10.x86_64.cuda-12.0 - CUDNN: 8.8.1.3 for cuda 12 - Visual Studio Enterprise 2019, version 16.11.26 (MSVC v142) or Visual Studio Enterprise 2022 (64-bit), version 17.5.4 - Python: 3.10 - CMake: 3.25.2 VS 2019: ``` build.bat --cmake_generator "Visual Studio 16 2019" --config Release --cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=52;60;61;70;75;80;86" --skip_submodule_sync --parallel --build_shared_lib --update --build --build_dir .\build\trt --use_cuda --cuda_version "12.1" --cuda_home "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1" --cudnn_home "C:\CuDNN\8.8.1.3_cuda12" --use_tensorrt --tensorrt_home "C:\TensorRT-8.6.0.12.Windows10.x86_64.cuda-12.0\TensorRT-8.6.0.12" ``` VS 2022: ``` build.bat --cmake_generator "Visual Studio 17 2022" --config Release --cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=52;60;61;70;75;80;86" --skip_submodule_sync --parallel --build_shared_lib --update --build --build_dir .\build\trt_2022 --use_cuda --cuda_version "12.1" --cuda_home "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1" --cudnn_home "C:\CuDNN\8.8.1.3_cuda12" --use_tensorrt --tensorrt_home "C:\TensorRT-8.6.0.12.Windows10.x86_64.cuda-12.0\TensorRT-8.6.0.12" ``` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> https://github.com/microsoft/onnxruntime/issues/15242	2023-04-24 10:02:35 -07:00
cloudhan	9e44248bf9	Workaround ROCm global pool (#15481 ) Implement global avg/max pool with reduction	2023-04-23 11:48:43 +08:00
Ye Wang	633dec0b17	refactor some code (#15566 ) ### Description <!-- Describe your changes. --> 1. moved onnxruntime/contrib_ops/cuda/decoder to onnxruntime/contrib_ops/cuda/bert 2. create utils.cuh under /bert for shared implementations in decoder_masked_multihead_attention_impl_utils.h and rotary_embedding_util.h 3. refactored relative_attn_bias_impl.cu by reusing the template specializations in utils.cuh ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2023-04-21 12:57:08 -07:00
PeixuanZuo	59ea35d592	[ROCm] add CK GroupNorm to GroupNormTunable (#15510 ) - Add CK GroupNorm to GroupNormTunable. - Reduce configuration of GroupNormNHWCOp because CK implementation is better. The performance gain on stable diffusion v1.5. Before: ``` 'height': 512 'width': 512 'steps': 50 'batch_size': 1 'batch_count': 5 'num_prompts': 1 'average_latency': 2.4782688856124877 'median_latency': 2.4783748388290405 'provider': 'ROCMExecutionProvider' 'disable_safety_checker': True ``` After: ``` 'height': 512, 'width': 512, 'steps': 50, 'batch_size': 1, 'batch_count': 5, 'num_prompts': 1, 'average_latency': 2.107170510292053, 'median_latency': 2.1067750453948975, 'first_run_memory_MB': -1, 'second_run_memory_MB': -1, 'provider': 'ROCMExecutionProvider', 'disable_safety_checker': True ```	2023-04-19 13:54:59 +08:00
Ye Wang	fbfe92f66a	DecoderMaskedMultiHeadAttention enhancement (#15292 )	2023-04-02 21:53:03 -07:00
Ye Wang	0402f930f2	exclude decoder files in hipify.cmake (#15188 )	2023-03-23 22:40:06 -07:00
Yufeng Li	dccbe9d492	exclude packed_attention* from rocm (#15161 ) exclude Contrib op PackedAttention from ROCM EP	2023-03-23 13:58:57 +08:00
PeixuanZuo	2ff7f3e93a	[ROCm] support optimized Stable Diffusion model (#14980 ) Add BiasSplitGelu/BiasAdd/GroupNorm/NhwcConv operator for ROCm EP. 1. BiasSplitGelu and BiasAdd operators can be automatically hipified from CUDA EP. 2. GroupNorm was hipified from CUDA EP and modified to build. 3. NhwcConv is similar to NhwcConv in CUDA EP, But the MIOpen API and cuDnn API are different. `miopenConvolutionForwardbias` and `miopenOpTensor` of MIOpen doesn't support NHWC layout now, use BinaryElementwise to replace miopenConvolutionForwardbias(NHWC layout).	2023-03-14 23:15:37 +08:00
Hariharan Seshadri	112a4d215a	[CUDA] Support decoding multihead self-attention implementation (#14848 )	2023-03-08 09:17:54 -08:00
PeixuanZuo	0f9d2432d2	[ROCm] Add WarpWise Softmax into SoftmaxTunableOp (#14612 ) 1. Add Softmax warpwise_forward into SoftmaxTunableOp. 2. Set Softmax op use tunableOp as optional and use original implementation by default. 3. There are some other operators use `dispatch_warpwise_softmax_forward /dispatch_warpwise_softmax_forward/ SoftMaxComputeHelper ` directly. But they only have files under cuda directory, adding `RocmTuningContext ` for these files requires copying and modifying hipified files. Now only set RocmTuningContext as nullptr by default and not hipified other operators. Related PR: https://github.com/microsoft/onnxruntime/pull/14541 --------- Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>	2023-02-16 11:26:08 +08:00
PeixuanZuo	326cf2f5e9	[ROCm] add Softmax Tunable Op (#14541 ) ### Description Add Softmax Tunable Op, only include blockwise vec implementation and composable kernel. Related PR: https://github.com/microsoft/onnxruntime/pull/14475, https://github.com/microsoft/onnxruntime/pull/14612 --------- Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>	2023-02-13 15:56:50 +08:00
Tang, Cheng	8f34c8c8ed	Introduce collective ops to ort inference build (#14399 ) ### Description Introduce collective ops into onnxruntime inference build, including 1) AllReduce and AllGather schema in contrib op, controlled by USE_MPI flag 2) AllReduce and AllGather kernel in cuda EP, controlled by ORT_USE_NCCL flag ### Motivation and Context Enable the collective ops in onnxruntime inference build so we have the ability to run distributed inference with multiple GPUs. The original ncclAllReduce ops in training build require quite complex configurations, which is not suitable for inference case, and it already broken. so we introduce a new implementation. --------- Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>	2023-02-07 13:47:48 -08:00
Ye Wang	b539c364ee	Some kernel changes for TULR (#14517 ) ### Description <!-- Describe your changes. --> 1. fix a bug in relative position bias kernel where seq_len > 32 2. rename extra_add_qk to relative_position_bias 3. support relative_position_bias in multihead attention (B, N, S, S*) 4. gru_gate support by Lei ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net> Co-authored-by: Lei Zhang <zhang.huanning@hotmail.com>	2023-02-07 11:51:06 -08:00
ytaous	d632f9a3fa	[ROCm] Enable Sampling Op UT on AMD (#14581 ) Making basic porting effort to run Sampling UT on ROCm ep, based on the commits: https://github.com/microsoft/onnxruntime/pull/13426 https://github.com/microsoft/onnxruntime/pull/14218 1. enabling EmbedLayerNorm op 2. enabling Sampling op 3. enabling helpers to copy data from CPU->GPU for subgraph This task is the first checkpoint. There could be other missing ops when testing a real model. We will migrate more code onto ROCm as needed. Co-authored-by: Ubuntu <ettao@ettao-amd-dev1.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>	2023-02-06 20:52:06 -08:00
Tianlei Wu	a6c5ba0185	Stable Diffusion CUDA Optimizations (#14428 ) ### Description Add stable diffusion CUDA kernel optimizations. The following are included: (1) GroupNorm operator. This kernel is from TensorRT 8.5. (2) BiasSplitGelu operator. This kernel is modified from SplitGelu of TensorRT 8.5. We added bias to the SplitGelu. (3) NhwcConv operator. This adds support of NHWC format (ONNX Conv operator uses NCHW format). (3) Update MultiHeadAttention (packed kv and no bias) for cross attention. This could avoid transpose of kv for TRT fused cross attention kernel. (4) Optimization and benchmark script Not included: (1) Script to convert Conv to NhwcConv in onnx graph. (2) Update symbolic shape inference for NhwcConv. (3) Add SeqLen2Spatial operator (4) Documents Limitations: GroupNorm, BiasSplitGelu and NhwcConv kernels are implemented based on stable diffusion usage. They might not be applicable to any input size or dimensions. For example, BiasSplitGelu requires hidden size to be 2560 \| 5120 \| 10240, and NhwcConv assumes 4D input/weight. There is minor increasement of binary size. For SM=75 only, python package wheel size adds (33757K - 33640K) = 117 KB. It is possible to move NHWC from template parameter to constructor to reduce binary size (with slight cost of performance). Note: for RTX 4090/4080/4070 Ti, need build with CUDA 11.8 and latest cuDNN to get best performance.	2023-02-02 23:43:51 -08:00
PeixuanZuo	1059cf6d98	[ROCm] Fix ROCm build issue caused by REMOVE_ITEM incorrect path (#14534 ) ### Description Fix not working REMOVE_ITEM. `onnxruntime/contrib_ops/rocm/aten_ops/aten_op.cc` is hipyfied from `onnxruntime/contrib_ops/cuda/aten_ops/aten_op.cc`. The file correct path is `${CMAKE_CURRENT_BINARY_DIR}/amdgpu/onnxruntime/contrib_ops/rocm/aten_ops/aten_op.cc` and it exists in hipyfied source files list `onnxruntime_rocm_generated_contrib_ops_cc_srcs`. A better way to fix it: If we don't want to build a file. Add it into hipify excluded files and will not hipify it.	2023-02-03 13:34:59 +08:00
Tianlei Wu	414b012f42	Add memory efficient attention from CUTLASS (#14343 ) ### Description Add memory efficient attention from CUTLASS. TODO (in next pull request): (1) Need performance tests on different GPUs, then add a sequence length threshold (only activate it for long sequence length). (2) Merge changes from https://github.com/NVIDIA/cutlass/pull/773 when it is in cutlass master.	2023-01-20 12:33:01 -08:00
Ye Wang	a01bf8dbb1	rename CrossAttention to MultiHeadAttention (#14201 ) ### Description <!-- Describe your changes. --> rename the CrossAttention to MultiheadAttention since this op can also be used as self attention ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2023-01-10 10:18:39 -08:00
Tianlei Wu	2cacb24cb0	Add CrossAttention operator (#14146 ) Move separated Q, K and V (without input projection) from Attention to a new operator CrossAttention. The Attention operator is hard to maintain when we need support with and without input projection in one class. Add a new operator according to feedback. Some change might need in the future, but not in this PR: (1) bias could be optional (We will not proceed that route unless experiments show that fusing Add bias with MatMul instead of this op could improve performance). (2) support packed KV. There are two ways to support it: when key and value are same Tensor, they are packed; or we can make value as optional, and use packed mode when value is empty and the key has packed K/V. (3) support cached key and value, and other (like relative position bias), or more attention mask format. They can be added easily without breaking backward compatible. (4) ROCm/CPU implementation of this op.	2023-01-06 14:27:40 -08:00
Ye Wang	68518a1b72	Sampling op (#13426 ) ### Description <!-- Describe your changes. --> Sampling op for cpu and cuda support huggingface case and custom case ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2022-12-22 17:34:12 -08:00
Tang, Cheng	a81faee41e	Multi-stream execution support (#13495 ) Description: This PR including following works: 1. provide stream and related synchronization abstractions in onnxruntime. 2. enhance onnxruntime's execution planner / executor / memory arena to support execute multiple streams in parallel. 3. deprecate the parallel executor for cpu. 4. deprecate the Fence mechanism. 5. update the cuda / tensorrt EP to support the stream mechanism, support running different request in different cuda stream. Motivation and Context - Why is this change required? currently, the execution plan is just a linear list of those primitives, ort will execute them step by step. For any given graph, ORT will serialize it to a fixed execution order. This sequential execution design simplifies most scenarios, but it has the following limitations: 1. it is difficult to enable inter-node parallelization, we have a half-baked parallel executor but it is very difficult to make it work with GPU. 2. The fence mechanism can work with single gpu stream + cpu thread case, but when extend to multiple stream, it is difficult to manage the cross GPU stream synchronizations. 3. our cuda EP rely on the BFCArena to make the memory management work with the GPU async kernels, but current BFCArena is not aware of the streams, so it doesn't behavior correctly when run with multiple streams. This PR enhance our existing execution plan and executor to support multiple stream execution. we use an unified algorithm to mange both single stream and multiple stream scenarios. This PR mainly focus on the infrastructure support for multiple stream execution, that is said, given a valid stream assignment, onnxruntime can execute it correctly. How to generate a good stream assignment for a given model will be in the future PR. Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net> Co-authored-by: Cheng Tang <chenta@microsoft.com> Co-authored-by: RandySheriffH <48490400+RandySheriffH@users.noreply.github.com> Co-authored-by: Randy Shuai <rashuai@microsoft.com> Co-authored-by: cao lei <jslhcl@gmail.com> Co-authored-by: Lei Cao <leca@microsoft.com>	2022-12-15 07:39:29 -08:00
Abhishek Udupa	83c59d2594	Session-aware and thread-safe CUDA profiler (#13706 ) ### Description The existing CUDA profiler is neither session-aware, nor thread-safe. This PR ensures both. ### Motivation and Context [PR 13549](https://github.com/microsoft/onnxruntime/pull/13549) brought thread-safety and session-awareness to the ROCm profiler. This PR brings the same goodness to the CUDA profiler as well. Sample outputs of a profiling run from the StableDiffusion model (this model was chosen because it requires orchestration of multiple sessions, and verifies that the profilers are now indeed session-aware) on both CUDA and ROCm EPs are attached, along with a script that checks that the trace files generated by the profile are well-formed. Update 11/29: Updated the profile outputs. The older profile outputs exhibited an issue where some timestamps were wildly out of range, leading to problems visualizing the traces. The bug has been fixed and the profile outputs have been updated, along with an update to the check script to ensure that timestamps are monotonically increasing. [sd_profile_outputs_cuda.tar.gz](https://github.com/microsoft/onnxruntime/files/10118088/sd_profile_outputs_cuda.tar.gz) [sd_profile_outputs_rocm.tar.gz](https://github.com/microsoft/onnxruntime/files/10118089/sd_profile_outputs_rocm.tar.gz) [check_profile_output_well_formedness.zip](https://github.com/microsoft/onnxruntime/files/10118090/check_profile_output_well_formedness.zip) Co-authored-by: Abhishek Udupa <abhishek.udupa@microsoft.com>	2022-12-09 13:22:12 -08:00
cloudhan	369a822409	Share TunableOp between CUDA and ROCM EP (#13560 ) Make TunableOp to support CUDA kernel authoring and add the corresponding supports for kernel explorer	2022-11-11 13:56:44 +08:00
cloudhan	2748f38362	Drop hip_add_library (#13406 ) Switching to use CMake's builtin hip language support.	2022-10-25 12:57:48 +08:00
cloudhan	928c9fc348	Hipify during build instead of before cmake config (#13333 ) ### Description Currently, hipify happens before cmake is configured and then cmake glob the directories. This get rids of thoes customized python threading logic and opt for build system itself to generate the files. This also supersede the half baked branch [sukha/hipify-with-cmake](https://github.com/microsoft/onnxruntime/tree/sukha/hipify-with-cmake)	2022-10-20 22:46:22 -07:00

26 commits