onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-06-06 00:03:22 +00:00

Author	SHA1	Message	Date
Changming Sun	88676e62b9	Remove nsync (#20413 ) ### Description 1. Remove the onnxruntime::OrtMutex class and replace it with ~absl::Mutex~ std::mutex. 2. After this change, most source files will not include <Windows.h> indirectly. ### Motivation and Context To reduce the number of deps we have, and address some Github issues that are related to build ONNX Runtime from source. In PR #3000 , I added a custom implementation of std::mutex . It was mainly because at that time std::mutex's default constructor was not trivial on Windows. If you had such a mutex as a global var, it could not be initialized at compile time. Then VC++ team fixed this issue. Therefore we don't need this custom implementation anymore. This PR also removes nsync. I ran several models tests on Linux. I didn't see any perf difference. This PR also reverts PR #21005 , which is no longer needed since conda has updated its msvc runtime DLL. This PR unblocks #22173 and resolves #22092 . We have a lot of open issues with nsync. This PR can resolve all of them.	2024-10-21 15:32:14 -07:00
Julius Tischbein	1391354265	Adding CUDNN Frontend and use for CUDA NN Convolution (#19470 ) ### Description Added CUDNN Frontend and used it for NHWC convolutions, and optionally fuse activation. #### Backward compatible - For model existed with FusedConv, model can still run. - If ORT is built with cuDNN 8, cuDNN frontend will not be built into binary. Old kernels (using cudnn backend APIs) are used. #### Major Changes - For cuDNN 9, we will enable cudnn frontend to fuse convolution and bias when a provider option `fuse_conv_bias=1`. - Remove the fusion of FusedConv from graph transformer for CUDA provider, so there will not be FusedConv be added to graph for CUDA EP in the future. - Update cmake files regarding to cudnn settings. The search order of CUDNN installation in build are like the following: * environment variable `CUDNN_PATH` * `onnxruntime_CUDNN_HOME` cmake extra defines. If a build starts from build.py/build.sh, user can pass it through `--cudnn_home` parameter, or by environment variable `CUDNN_HOME` if `--cudnn_home` not used. * cudnn python package installation directory like python3.xx/site-packages/nvidia/cudnn * CUDA installation path #### Potential Issues - If ORT is built with cuDNN 8, FusedConv fusion is no longer done automatically, so some model might have performance regression. If user still wants FusedConv operator for performance reason, they can still have multiple ways to walkaround: like use older version of onnxruntime; or use older version of ORT to save optimized onnx, then run with latest version of ORT. We believe that majority users have moved to cudnn 9 when 1.20 release (since the default in ORT and PyTorch is cudnn 9 for 3 months when 1.20 release), so the impact is small. - cuDNN graph uses TF32 by default, and user cannot disable TF32 through the use_tf32 cuda provider option. If user encounters accuracy issue (like in testing), user has to set environment variable `NVIDIA_TF32_OVERRIDE=0` to disable TF32. Need update the document of use_tf32 later. #### Follow ups This is one of PRs that target to enable NHWC convolution in CUDA EP by default if device supports it. There are other changes will follow up to make it possible. (1) Enable `prefer_nhwc` by default for device with sm >= 70. (2) Change `fuse_conv_bias=1` by default after more testing. (3) Add other NHWC operators (like Resize or UpSample). ### Motivation and Context The new CUDNN Frontend library provides the functionality to fuse operations and provides new heuristics for kernel selection. Here it fuses the convolution with the pointwise bias operation. On the [NVIDIA ResNet50](https://pytorch.org/hub/nvidia_deeplearningexamples_resnet50/) we get a performance boost from 49.1144 ms to 42.4643 ms per inference on a 2560x1440 input (`onnxruntime_perf_test -e cuda -I -q -r 100-d 1 -i 'prefer_nhwc\|1' resnet50.onnx`). --------- Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: Maximilian Mueller <maximilianm@nvidia.com>	2024-08-02 15:16:42 -07:00
Changming Sun	25722bb9e3	Add CUDA custom op header files to Linux tarball (#21551 ) ### Description The header files were added in PR #16454. Then, recently I made a PR #21464 that changed how we packed Linux tarballs. The new tarball misses the custom op header files. Therefore I need to make this change. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-08-01 04:23:02 -07:00
Maximilian Müller	cc0de0d526	[Build] Propagate build option for CUDA minimal to TRT (#20695 ) ### Description Extend cuda minimal option to TRT provider, as with TRT 10 no linking to cuDNN is required anymore . Besides that with the new engine dump feature it is also possible to embed an engine in to an ONNX and not ship a builder lib. In addition to that this has roughly the same deserialization time/session setup time that using TRT standalone has. ### Motivation and Context ``` exe_builder_lib\onnxruntime_perf_test.exe -I -e tensorrt -r 5 -i 'trt_engine_cache_enable\|1 trt_timing_cache_enable\|1 trt_dump_ep_context_model\|1 trt_weightless_engine_enable\|1' model.onnx exe_no_builder_lib\onnxruntime_perf_test.exe -I -e tensorrt -r 5 -i 'trt_engine_cache_enable\|1 trt_timing_cache_enable\|1 trt_dump_ep_context_model\|1 trt_weightless_engine_enable\|1' model_ctx.onnx ```	2024-07-09 14:40:04 -07:00
Tianlei Wu	b3fc9b5a0e	[CUDA] upgrade cutlass to 3.5.0 (#20940 ) ### Description Upgrade cutlass to 3.5 to fix build errors using CUDA 12.4 or 12.5 in Windows - [x] Upgrade cutlass to 3.5.0. - [x] Fix flash attention build error with latest cutlass header files and APIs. This fix is provided by @wangyems. - [x] Update efficient attention to use new cutlass fmha interface. - [x] Patch cutlass to fix `hrsqrt` not found error for sm < 53. - [x] Disable TF32 Staged Accumulation to fix blkq4_fp16_gemm_sm80_test build error for cuda 11.8 to 12.3. - [x] Disable TRT 10 deprecate warnings. The following are not included in this PR: * TRT provider replaces the deprecated APIs. * Fix blkq4_fp16_gemm_sm80_test build error for cuda 12.4 or 12.5. This test is not built by default unless you add `--cmake_extra_defines onnxruntime_ENABLE_CUDA_EP_INTERNAL_TESTS=ON` in build command. To integrate to rel-1.18.1: Either bring in other changes (like onnx 1.16.1), or generate manifest and upload a new ONNX Runtime Build Time Deps artifact based on rel-1.18.1. ### Motivation and Context https://github.com/microsoft/onnxruntime/issues/19891 https://github.com/microsoft/onnxruntime/issues/20924 https://github.com/microsoft/onnxruntime/issues/20953	2024-06-11 13:32:15 -07:00
Carson M	5bfca1dc57	[Build] Change `onnxruntime_NVCC_THREADS` from option to cache entry (#20768 ) ### Description Changes the `onnxruntime_NVCC_THREADS` CMake variable from an [`option`](https://cmake.org/cmake/help/latest/command/option.html) to a [cache entry](https://cmake.org/cmake/help/latest/command/set.html#set-cache-entry). ### Motivation and Context Fixes #19833. `option` in CMake (confusingly, IMHO) always defines a boolean option. The original definition of `onnxruntime_NVCC_THREADS` specified a default of `1`, which I presume is coerced to `ON`. Thus, if the option is not overridden with a value of another type, NVCC will receive a malformed option `--threads ON` (rather than the expected `--threads 1`), which causes the error reported in #19833. This error only occurred if compiling ONNX Runtime via CMake with `-Donnxruntime_USE_CUDA=ON`; the CI build script always overrode `onnxruntime_NVCC_THREADS` with a string value: `f1fef19b6e/tools/ci_build/build.py (L1152-L1154)`	2024-05-29 12:28:33 -07:00
Dmitri Smirnov	b95fd4e644	Enable CUDA EP unit testing on Windows (#20039 ) ### Description Address build issues and source code discrepancies. Fix cuda_test_provider gtest argument stack corruption. ### Motivation and Context `OpTester` class that is widely used for kernel testing is not suitable for testing internal classes for EPs that are built as shared objects. Currently, CUDA EP tests run only on Linux. We want to enable testing and developments on Windows, and create a usable pattern for testing of other EPs internals. Alternatives considered: Abstracting EP unit tests into separate test executable such as `onnxruntime_test_all`. This alternative was rejected as it would create a lot more changes in the established patterns, and potentially interfere with CUDA functionality with more complex source code maintanence.	2024-03-27 13:32:36 -07:00
Scott McKay	db59cec82f	Don't reduce warning level for CUDA build on Windows (#19663 ) ### Description <!-- Describe your changes. --> Address warnings so all the ORT projects build with /W4 on Windows. Mainly - unused parameters - variables shadowing other ones ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> #19588 started on this.	2024-03-06 15:03:55 +10:00
Chen Fu	06e684c9f2	Adding cuda kernel (optimized for sm80) for block-wise 4b quantized float 16 GEMM. (#18619 ) ### Description Adding CUDA kernel for block-wise 4b quantized float 16 GEMM, this is specially optimized for Nvidia Ampere GPUs. ### Motivation and Context Trying to improve quantized LLM inference performance on Nvidia Ampere GPUs ### Note: This is implemented by extending CUTLASS, so it has a hard dependency on CUTLASS. However, in current build system, loading of CUTLASS dependency is guarded with: (onnxruntime_USE_FLASH_ATTENTION OR onnxruntime_USE_MEMORY_EFFICIENT_ATTENTION) If both of these options are turned off, then compilation will fail. Why CUTLASS dependency is guarded at all? It's a header file only library that does not introduce any binary if not instantiated. What's the downside of removing all the guards and just include CUTLASS unconditionally?	2024-03-05 09:37:45 -08:00
Maximilian Müller	c20ced4132	Use CMake's find package for CUDA libs (#19673 ) ### Description Answers issue #19640 More details are in the issue, basically I am changing all the include directory and link directory usage to CMake's `CUDA::*` targets	2024-02-27 11:26:48 -08:00
Maximilian Müller	bc219ed553	[TensorRT EP] Enable a minimal CUDA EP compilation without kernels (#19052 ) Adresses https://github.com/microsoft/onnxruntime/issues/18542. I followed the advice given by @RyanUnderhill [here](https://github.com/microsoft/onnxruntime/pull/18731#issuecomment-1848261925) and went with a minimal CUDA EP for now.	2024-01-17 11:33:34 -08:00
Ye Wang	c012e41f93	MoE with Expert Slicing (#18565 ) ### Description <!-- Describe your changes. --> Registered Sharded MoE op under contrib_op/cuda/collective with expert slicing. The broadcast process happens just before adding second bias(if has) and permutation undoing. Tensor slicing is planned but not included in this PR. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-12-05 16:56:38 -08:00
Chen Fu	05046e5452	Adding unit test for sm80 prepack (#18514 ) ### Description Prepacking code for block q4 x fp16 GEMM cuda kernel, for SM80 hardware ### Motivation and Context Preparing for addition of Q4 x FP16 GEMM kernel on Nvidia Ampere GPUs. This kernel requires sophisticated quantized weight rearrangement to speedup loading data to tensor-core. To facilitate the addition, this change includes the following: 1. matrix_layout.h A new layout lib that facilitate iterating matrix elements and tiles that balance memory safety and performance. 2. prepack_sm80.h Code for rearranging quantized weight, scales and offsets (aka. prepacking) 3. blkq4_fp16_sm80_prepack_test.cc Unit tests that explicitly test the memory safety and correctness of the prepacking code. Currently the prepacking code runs on CPU with single threaded code. We run this on CPU in order to minimize GPU memory fragmentation. On the other hand, hopefully we get around to parallelize this part of the code. Should be straight forward with the unit tests in place.	2023-11-28 10:01:09 -08:00
Wei-Sheng Chin	fb6737e893	Distributed Squeeze and Distributed Unsqueeze (#18269 ) Implementat DistributedSqueeze & DistributedUnsqueeze for llama 2.	2023-11-06 20:11:35 -08:00
Wei-Sheng Chin	9e8ad39847	Distributed Reduction (#18206 ) This PR implements distributed reduciton for llama 2. This version doesn't consider any cases requring re-sharding because we haven't seen any use cases. Intutive examples: - [supported] [2,4,6]-tensor with spec=RRS[0] and device_mesh=[0,1] -> Reduce(axes=[0]) -> [1,4,6]-tensor with spec=RRS[0] and device_mesh=[0,1] - [supported] [2,4,6]-tensor with spec=RRS[0] and device_mesh=[0,1] -> Reduce(axes=[1]) -> [2,1,6]-tensor with spec=RRS[0] and device_mesh=[0,1] - [not supported] [2,4,6]-tensor with spec=RRS[0] and device_mesh=[0,1] -> Reduce(axes=[2]) -> [2,4,1]-tensor with spec=RRS[0] and device_mesh=[0,1] Algorithm: When the reduced axes are not sharded, each device can call reduction directly. The output sharding spec will be identical to input sharding spec. We currently throw when input and output sharding specs are different. Review guideline: - Check 97b8d2f for new op's schema and how new op is registered. - Read tests in 2450f93 to get faimilar with the behavior of these ops. - Check the implementation details in 753d9af.	2023-11-01 08:49:33 -07:00
Wei-Sheng Chin	24f9c1afe3	Distributed Expand (#18126 ) This PR implements DistributedExpand for llama 2. Representative Examples of DistributedExpand: - [shard on non-expanded axis] `input tensor (shape=[8, 1], spec=S[0]R, device_mesh=[0,1]) -> Expand(target_shape=[8, 2] -> output tensor (shape=[8, 2], spec=S[0]R, device_mesh=[0,1])` - [sharding expanded axis is invalid since it must have dim=1 and axis with dim=1 cannot be sharded] `input tensor (shape=[1, 8], spec=S[0]R, device_mesh=[0,1]) -> Expand(target_shape=[2, 8] -> output tensor (shape=[2, 8], spec=S[0]R, device_mesh=[0,1])` From those examples, we observe a few important behaviors. - The output sharding spec is always the same to the input sharding spec. - Expanding always happen on axis with dimension=1. Otherwise, it will violate the broadcasting rule. - No communication is needed since all computation can happen locally. Let's consider the first example again. If you put the first half tensor (shape: [4, 1]) on device 0 and the second half (shape: [4, 1]) on device 1, then `Expand` it with target shape [4, 2] , these two local tensors (shape: [4, 2]) are exactly the same as the one described by output sharding spec. Algorithm: - Compute logical (i.e., unsharded) shapes of input and output. - Compute sharded output shape from logical output. - Call Expand to broadcast local input to sharded output shape. How to review? - Start with [changes in onnxruntime_test_distributed.py](`ea33392f37`). Those tests are good examples for using this op. - [Read expand.h/expand.cc](`e4c49987f5`). Theose changes are for exposing functionalities in Expand to DistributedExpand. - Read distributed_expand.h/distributed_expand.cc. It follows the algorithm described above. The commit `68ac301bba` first sketches the definition of DistributedExpand. The next commit `0eb9330c3b` adds real implementation.	2023-10-28 00:44:02 -07:00
Wei-Sheng Chin	9c32310673	Distributed Reshape Implementation (#18068 ) This DistributedReshape aims at supporting all sharding patterns encountered in llama 2. All patterns found are tested in `TestDistributedReshape` in `onnxruntime_test_distributed.py`. This PR implements algorithms to compute the categories below. - All inputs and outputs are replica, so it's computed like a normal Reshape. - Two-axis fusion (if any of the inputs and outputs are sharded). This category convers, e.g., `[batch, seq, hidden] -> [batch x seq, hidden]`. - Two-axis decomposition (if any of the inputs and outputs are sharded). This category convers, e.g., `[batch x seq, hidden] -> [batch, seq, hidden]`. Review guideline: - Ignore the changes in sharding_spec.h and sharding_spec.cc since they come from another PR #18025. - First, read onnxruntime_test_distributed.py to get familiar with the input/output of DistributedReshape. - Second, check the new APIs in reshape.h/reshape.cc to expose CUDA Reshape kernel to DistributedReshape. - For DistributedReshape, check its `ComputeInternal` for the 3 categories mentioned above.	2023-10-26 22:33:42 -07:00
Tang, Cheng	ca8cab29cd	distributed slice (#17761 ) ### Description Support DistributedSlice kernel in Cuda EP. mainly support following cases: 1. input data is sharded or replica for all axes (including slice axes) 2. slice axes is sharded across different devices. starts / ends / steps sharded across different devices are not supported yet. --------- Co-authored-by: Wei-Sheng Chin <wschin@outlook.com> Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net> Co-authored-by: Cheng Tang <chenta@microsoft.com>	2023-10-12 14:28:00 -07:00
Maximilian Müller	74a8acf405	Set default value for NVCC threads (#17866 ) Without doing this CMake gives a miscellaneous error on windows when checking if NVCC is functional. It will be missing a number after `--threads`. Currently it is only possible to configure through the python build scripts and not CMake only configure - which is what I am usually doing through CLion.	2023-10-11 22:46:40 -07:00
Changming Sun	05ac9f6f2a	Split onnxruntime_providers.cmake to multiple (#17853 ) ### Description Split onnxruntime_providers.cmake to multiple files, for easier editing. No other change was made in this PR.	2023-10-09 20:33:44 -07:00

20 commits