onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-06-03 23:49:44 +00:00

Author	SHA1	Message	Date
Numfor Tiapo	b8f373b0ae	Add API for NPU Device Selection in the DML EP (#17612 ) Co-authored-by: Sheil Kumar <sheilk@microsoft.com>	2023-10-11 14:53:00 -07:00
pengwa	0e2782438a	Support inplace update for PythonOp/Grad (#17687 ) ### Support inplace update for PythonOp/Grad This PR is based on another PR https://github.com/microsoft/onnxruntime/pull/17685's branch, to make it easier to review. With PR: PR https://github.com/microsoft/onnxruntime/pull/17685, By default all PythonOp inputs/outputs are assumed to not be inplaced, if during run, we found some inplace update happens (by checking output data address with all inputs data address), we add clone before set it as PythonOp/Grad's outputs. In this case, results are correct, but implicit copies overheads are introduced. This PR allow users to define output input reuse map, to let ORT know how to do the reuse map, avoid such unnecessary copies.	2023-10-10 21:36:45 -07:00
Changming Sun	05ac9f6f2a	Split onnxruntime_providers.cmake to multiple (#17853 ) ### Description Split onnxruntime_providers.cmake to multiple files, for easier editing. No other change was made in this PR.	2023-10-09 20:33:44 -07:00
Baiju Meswani	9c716f4557	Add noexcep_operators to onnxruntime internal libraries (#17850 )	2023-10-09 16:29:41 -07:00
cloudhan	c2bd5b70b2	Fix enable_training and use_migraphx (#17827 )	2023-10-08 11:43:27 +08:00
MistEO	faf9a0f6c7	Fix runtime installation error (#17828 )	2023-10-07 11:50:02 -07:00
JiCheng	3878011ce2	Remove MPI dependency (#17624 ) ### Description <!-- Describe your changes. --> Support launch multi-GPU without MPI ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-10-06 15:33:18 +08:00
George Wu	b306b02a86	[QNN EP] fixed input for InstanceNormU8 unit test and update copy lib paths (#17806 ) -update InstanceNormU8 with fixed input. With this input, it fails consistently using QNN 2.15.1 -update QNN lib paths (target is deprecated) and additionally copy V73 skel file	2023-10-05 22:17:15 -07:00
Justin Chu	be7541ef4a	[Linter] Bump ruff and remove pylint (#17797 ) Bump ruff version and remove pylint from the linter list. Fix any new error detected by ruff. ### Motivation and Context Ruff covers many of the pylint rules. Since pylint is not enabled in this repo and runs slow, we remove it from the linters	2023-10-05 21:07:33 -07:00
Wei-Sheng Chin	faef9c32fa	ONNX-Native Tensor Parallel: Using Distributed MatMul as Example (#17695 ) This PR introduces - New data structure to represent kernel-level (aka node-level or op-level) tensor sharding informaiton. I consider it as the fundamentaion of ONNX distribtued inference. - Building blocks for distribtued kernels implementation especially stateless implementation for communication ops. - Implementation of DistributedMatMul and its tests. Code structure: - sharding.h/.cc: Function to shard and reshard tensors (calling into NCCL). - sharding_spec.h/.cc: Representation of how a tensor is sharded. - distributed_matmul.h/.cc: Implementation of tensor parallel MatMul. Inputs and outputs are sharded across devices. - onnxruntime_test_distributed.py: distributed operator tests. Example of specifying sharding information ```python @onnxscript.script() def matmul_rs_sr_rr(tensor_x: FLOAT, tensor_w: FLOAT) -> FLOAT: # Run MatMul by sharding x along column axis and w along row axis on # 2 GPUs. return MICROSOFT_OPSET.DistributedMatMul( tensor_x, tensor_w, device_mesh_shape=[2], device_mesh_elements=[0, 1], input_shard_specs=["RS[0]", "S[0]R"], output_shard_specs=["RR"], ) onnx_model = matmul_rs_sr_rr.to_model_proto( input_types=[FLOAT[2, "s"], FLOAT["s", 2]], output_types=[FLOAT[2, 2]], ) ``` In this example, the device mesh can be visualized as 1-D tensor, `[0, 1]`. The 2nd axis of `tensor_x` is sharded across `[0, 1]` (i.e., the 0-axis of the device mesh). Similarly, the 1st axis of `tensor_w` is sharded across `[0, 1]` as well. C++ classes to represent tensor sharding (copied from sharding_spec.h): ```cpp class DeviceMesh { public: // [Device Mesh and Tensor Sharding for Tensor Parallel] // Device mesh is a tensor of device indices. // A tensor can then be partitioned along specific mesh axes. // // Assume we have 4 GPUs indexed by 0, 1, 2, and 3. // Let's consider some examples. // 1. 1D device mesh [0, 1, 2, 3]. In this case, // device_mesh_shape is [4] and device_mesh_elements // is [0, 1, 2, 3]. // If we want to shard a 2-D tensor along its axis 1, the // corresponding sharding spec is a string "RS[0]". // 2. 2D device mesh [[0, 1], [2, 3]]. In this case, // device_mesh_shape is [2, 2] and device_mesh_elements // is [0, 1, 2, 3]. // If we want to shard a 2-D tensor's // rows along mesh axis 1 and // columns along mesh axis 0, the // corresponding sharding spec is a string "S[1]S[0]". // If that 2-D tensor's value is np.array([[5, 6], [7, 8]]), // GPU 0/1/2/3 owns 5/7/6/8. Below is a visualization the sharding // proccess. // - Start with a 2-D device mesh [[0, 1], [2, 3]] and // a 2-D tensor [[5, 6], [7, 8]] // - GPU: [[0, 1], [2, 3]], Tensor: [[5, 6], [7, 8]] // - Split GPU mesh along axis 1 and tensor along // axis 0 for "S[1]" in "S[1]S[0]" // - GPU: [[0], [2]], Tensor: [[5, 6]] // GPU: [[1], [3]], Tensor: [[7, 8]] // - Split GPU mesh along axis 0 and tensor along // axis 1 for "S[0]" in "S[1]S[0]" // - GPU: [[0]], Tensor: [[5]] // - GPU: [[2]], Tensor: [[6]] // - GPU: [[1]], Tensor: [[7]] // - GPU: [[3]], Tensor: [[8]] // Actual shape of device mesh represented by `device_mesh_elements`. std::vector<int64_t> device_mesh_shape; // Flattened device mesh. std::vector<int64_t> device_mesh_elements; }; class AxisPartitionSpec { // [Device Mesh and Tensor Sharding for Tensor Parallel] // This class is the in-memory representation of // 1. if a tensor is sharded or not (aka replica), and // 2. which tensor axis is shard by which device mesh axis. // Let's consider sharding 2-D tensor along column axis on // device mesh [0, 1] as an example. // The required sharding spec RS[0] can be represented by // - AxisPartitionSpec(Condition::Replica, -1) // - AxisPartitionSpec(Condition::Shard, 0) public: // Status of a tensor axis. // A tensor axis can be either sharded or replicated // along a device mesh axis. enum class Condition { Replica, Shard }; // This field tells if a tensor axis is sharded or not. Condition cond; // If a tensor axis is sharded, this field tells which device // mesh axis to distribute the shards along. // If a tensor axis is not sharded, this field is ignored. int device_mesh_axis; // A helper to construct a replica spec for a tensor axis. static AxisPartitionSpec CreateReplica() { return AxisPartitionSpec(Condition::Replica, -1); } // A helper to construct a sharding spec for a tensor axis. // This tensor axis is sharded along `device_mesh_axis` in device mesh. static AxisPartitionSpec CreateShard(int device_mesh_axis) { return AxisPartitionSpec(Condition::Shard, device_mesh_axis); } }; class TensorPartitionSpec { // [Device Mesh and Tensor Sharding for Tensor Parallel] // TensorPartitionSpec holds a collection of AxisPartitionSpec and an // associated DeviceMesh. It is responsible for determining how a tensor // should be partitioned across a device mesh. // // Example 1: RS[0] // In this scenario, `axis_specs` would contain two `AxisPartitionSpec` objects. // - The first object is a Replica, denoting that the first axis of the tensor is // not sharded but is instead replicated. // - The second object is a Shard along the 0-th axis of the device mesh. It denotes // that the second axis of the tensor is sharded along the first axis of the // device mesh. // // Example 2: S[0]RR // In this scenario, `axis_specs` would contain three `AxisPartitionSpec` objects. // - The first object is a Shard along the 0-th axis of the device mesh, indicating // that the first axis of the tensor is sharded along the first axis of the // device mesh. // - The second and third objects are Replicas, indicating that the second and third // axes of the tensor are not sharded but are instead replicated. public: // axis_specs[i]: AxisPartitionSpec for tensor axis i. For a 2-D tensor, // axis_specs[0] is for row axis and axis_specs[1] is for // column axis. axis_specs[i].device_mesh_axis = j means that // tensor axis i is sharded along device mesh axis j. std::vector<AxisPartitionSpec> axis_specs; // device_mesh: DeviceMesh for sharding the associated tensor. // Read [Device Mesh and Tensor Sharding for Tensor Parallel] in DeviceMesh's comment. DeviceMesh device_mesh; }; ```	2023-10-05 14:22:25 -07:00
Edward Chen	1bc115719c	Unify handling of public headers in onnxruntime.cmake. (#17779 ) The changes in PR #8919 overwrote the PUBLIC_HEADER property value of the `onnxruntime` target with a list that did not include EP-specific headers. We should probably be using a consistent set of header files across packages anyway.	2023-10-04 08:55:08 -07:00
Changming Sun	14d349e290	Enable backtrace in unit tests (#17655 ) ### Description Google test can be built either with absl/re2 or not. This PR enables the build option so that google test framework can print out a nice stacktrace when something went wrong. It helps locate test errors in CI build pipelines. Also, Google test will remove the build option and make it always ON. So sooner or later we must make this change.	2023-09-29 12:32:56 -07:00
MistEO	870b0bc305	Fix typo of cmake (#17715 ) This caused a cmake configuration error.	2023-09-27 11:48:46 -07:00
Mustafa Ateş Uzun	13b0f8a6ce	fix: supported typo (#17216 )	2023-09-27 10:45:27 -07:00
liqun Fu	2be4dc6d04	ONNX 1.15 integration (#17125 ) ### Description this is for ORT 1.17.0 - make ORT to use ONNX release 1.15.0 branch. Eventually will update to the release tag once ONNX 1.15.0 is released ### Motivation and Context Prepare for ORT 1.17.0 release. People can start work on new and updated ONNX ops in ORT. --------- Signed-off-by: Liqun Fu <liqfu@microsoft.com>	2023-09-26 14:44:48 -07:00
Jian Chen	0141e27ca1	Enabling c++ 20 in MacOS build (#16187 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-09-26 11:27:02 -07:00
Tianlei Wu	730fab3050	Refactor Attention cuda kernel (#17578 ) * Break QkvToContext into small functions. Each fused and unfused kernel will have separated function. * Move DecoderAttention kernel to separated file * Move KV cache related kernel to attention_kv_cache.cu ### Motivation and Context To make the code easier to maintain.	2023-09-19 09:49:21 -07:00
Tianlei Wu	adb0be45d3	Refactoring of attention cuda kernel: move prepare qkv and concat_past_to_present (#17559 ) To avoid a huge cu file and make code more readable: - Move PrepareQKV to separate cu file (attention_prepare_qkv.cu) - Move ConcatPastToPresent to attention_concat.cu - Add default value for AttentionData - Add a data structure QkvData to track Q, K and V pointers and track QKV format.	2023-09-15 10:57:29 -07:00
Changming Sun	5af6279440	Fix Android build (#17540 ) ### Description The new cpuinfo library doesn't use clog on Android. Newer XNNPack versions have removed the dependency on clog, but the one we use still has it. So I cherry-pick the XNNPack to our patch file.	2023-09-14 07:36:01 -07:00
Changming Sun	24a3c740c0	Revert "[ROCm][MIGraphX] for googletest dep, set OVERRIDE_FIND_PACKAGE (#16715 )" (#17523 ) This reverts commit `bb136f86c8`, then re-implement it in a different way. I reverted the original change, then added a version constraint to the find_package args. If you still found it picks up wrong gtest version after this change, you may disable `find_package` by setting 'FETCHCONTENT_TRY_FIND_PACKAGE_MODE' to NEVER. For example, the latest gtest version is 1.14.0. If at a later time Google releases a new version of gtest and that one is incompatible with the ONNX Runtime source code you get today and your dev environment already installed the new version and you do not want to create a new clean build environment that is without the package, you can add `--cmake_extra_defines FETCHCONTENT_TRY_FIND_PACKAGE_MODE=NEVER` to your build command to solve the problem.	2023-09-12 22:39:31 -07:00
Chi Lo	b827ab0efc	[TRT EP] Fix build error for building oss onnx-tensorrt parser (#17468 ) If building ORT TRT with `--use_tensorrt_oss_parse` (meaning ORT wil include [oss onnx-tensorrt parser](https://github.com/onnx/onnx-tensorrt/blob/main/CMakeLists.txt#L82) and build it from source) ,the cmake CUDA_INCLUDE_DIR variable is needed. if not, you will encounter following [ build error](https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1133937&view=logs&j=7536d2cd-87d4-54fe-4891-bfbbf2741d83&t=39e3f98f-7fe5-578c-20bd-5ae5a4590bda): CMake Error: The following variables are used in this project, but they are set to NOTFOUND. Please set them or make sure they are set and tested correctly in the CMake files: /build/Release/_deps/onnx_tensorrt-src/CUDA_INCLUDE_DIR Note: Not quite sure why in the past when CI still tested with oss parser won't hit this issue. probably the CUDA_INCLUDE_DIR was defined somewhere back then.	2023-09-08 20:34:57 -07:00
Caroline Zhu	dcc93909b4	Add training WASM generation to Web CI pipeline (#17319 ) ### Description [Successful pipeline run](https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1123141&view=results) Added flag to build the training artifacts & updated the pull-wasm-artifacts script to pull the training artifacts as well. Bundled into this PR are minor formatting fixes + naming fixes. ### Motivation and Context [This PR](https://github.com/microsoft/onnxruntime/pull/16521) extended the WASM API wrapper to build training WASM artifacts as well. The ORT training WASM artifacts are required to support ORT training web bindings.	2023-09-08 15:49:47 -07:00
Changming Sun	bc84f52633	Update C/C++ dependencies: abseil, date, nsync, googletest, wil, mp11, cpuinfo and safeint (#15470 ) ### Description Update C/C++ dependencies abseil, date, nsync, googletest, wil, mp11, cpuinfo and safeint to newer versions per request of @ mayeut. He created the following PRs to update the deps: https://github.com/microsoft/onnxruntime/pull/15432 https://github.com/microsoft/onnxruntime/pull/15434 https://github.com/microsoft/onnxruntime/pull/15435 https://github.com/microsoft/onnxruntime/pull/15436 https://github.com/microsoft/onnxruntime/pull/15437 However, our build system needs to fetch the dependencies from an internal mirror that only Microsoft employees have write access to. So I closed his PRs and created this one. This PR also updates abseil to a newer version. This is to prepare for upgrading re2.	2023-09-08 13:35:04 -07:00
Yulong Wang	110a2d0b73	[build][wasm] add js_internal_api.js to link dependency (#17407 ) ### Description add js_internal_api.js to link dependency. Now changes to js_internal_api.js will correctly trigger re-link of ort-wasm.wasm	2023-09-05 20:40:40 -07:00
Changming Sun	c6b0d185b4	Update cmake to 3.27 and upgrade Linux CUDA docker files from CentOS7 to UBI8 (#16856 ) ### Description 1. Update docker files and their build instructions. ARM64 and x86_64 can use the same docker file. 2. Upgrade Linux CUDA pipeline's base docker image from CentOS7 to UBI8 AB#18990	2023-09-05 18:12:10 -07:00
Lennart Hannink	e3bb2a0cdd	Fix git working dir for ORT_BUILD_INFO (fixes #17197 ) (#17198 ) ### Description Git commands producing `git-commid-id` and `git-branch` are always run in `CMAKE_CURRENT_SOURCE_DIR` (i.e. `onnxruntime/cmake`) ### Motivation and Context Please refer to corresponding issue [#17197](https://github.com/microsoft/onnxruntime/issues/17197).	2023-09-05 09:20:49 -07:00
cloudhan	6ea3908db4	Add ck's streamk and splitk gemm impl (#17280 )	2023-09-04 11:49:07 +08:00
aciddelgado	44101e8771	Flash Attention v2 MHA (#17227 ) ### Description Integrate Flash Attention V2 to PackedMultiHeadAttention, MultiHeadAttention and Attention operators. Flash Attention v2 source code is from https://github.com/Dao-AILab/flash-attention/tree/main/csrc/flash_attn/src. We did some change to remove dependency on Torch, then removed backward and bfloat16 related code. Add benchmark script (see benchmark_mha.sh) to compare different attention kernels for MultiHeadAttention operator. Current limitations for Flash Attention in PackedMultiHeadAttention, MultiHeadAttention and Attention operators: * Relative Position Bias is not supported * Different hidden size for Q and V is not supported * Only float16 is supported * Padding/attention mask is not supported * For MultiHeadAttention, when there is past or present input, bias shall be provided to activate flash attention * For Attention, past or present inputs will deactivate flash attention * Causal is not supported Some limitations (like attention mask and causal) might be removed later. Currently, Flash Attention v2 only works in Linux. For Windows, we will enable later with Cutlass 3.2. Two environment variables can be used for testing purpose: (1) `ORT_DISABLE_FLASH_ATTENTION` to disable flash attention. Default value is 0 (enable). Set it to "1" to disable it. (2) `ORT_MIN_SEQ_LEN_FLASH_ATTENTION_PACKED_QKV`. Default value is "513", which means that we only enable flash attention when sequence length is larger than 512 for packed QKV format. Set it to "0" if you want to use flash attention v2 whenever possible. ### Speedup The following result is from Standard_ND96amsr_A100_v4 VM (A100-SXM4-80GB GPU) using benchmark_mha.sh. The metric is TFLOPs per second for MultiHeadAttention operator. There are 3 input formats: * `Q,K,V` means separated inputs query, key and value of BxSxNH * `Q,KV` means packed KV, where key is 5D: BxSxNx2xH * `QKV` means packed QKV, where query is 5D: BxSxNx3xH Note that flash attention cannot use packed QKV format, so extra Transpose is needed. We found that TensorRT kernel is faster for sequence length <= 512 for packed QKV. The reason might be no transpose is needed for TensorRT kernel in this format. We also notice that, TensorRT kernel is faster for stable diffusion 512x512 image (see seq_len=4096, heads=8, head_dim=40 below), while flash attention v2 is faster for 1024x1024 image (see seq_len=16384, heads=8, head_dim=40 below). input format \| batch size \| sequence length \| heads \| head dim \| flash_v2 (TFLOPs/s) \| TensorRT (TFLOPs/s) \| Memory Efficient Attention (TFLOPs/s) -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- Q,K,V \| 32 \| 512 \| 64 \| 32 \| 78.1 \| 60.0 \| 39.3 Q,K,V \| 32 \| 512 \| 128 \| 16 \| 46.8 \| 44.1 \| 21.7 Q,K,V \| 16 \| 1024 \| 64 \| 32 \| 99.0 \| 72.8 \| 44.3 Q,K,V \| 16 \| 1024 \| 128 \| 16 \| 54.7 \| 49.2 \| 23.4 Q,K,V \| 8 \| 2048 \| 64 \| 32 \| 113.8 \| 81.2 \| 47.8 Q,K,V \| 8 \| 2048 \| 128 \| 16 \| 59.7 \| 51.9 \| 24.7 Q,K,V \| 4 \| 4096 \| 64 \| 32 \| 122.5 \| 85.6 \| 49.7 Q,K,V \| 4 \| 4096 \| 128 \| 16 \| 62.5 \| 53.3 \| 25.3 Q,K,V \| 2 \| 8192 \| 64 \| 32 \| 127.4 \| 87.5 \| 50.7 Q,K,V \| 2 \| 8192 \| 128 \| 16 \| 64.0 \| 54.2 \| 25.6 Q,K,V \| 1 \| 16384 \| 64 \| 32 \| 129.5 \| 91.0 \| 51.2 Q,K,V \| 1 \| 16384 \| 128 \| 16 \| 64.7 \| 54.5 \| 25.8 Q,K,V \| 1 \| 4096 \| 8 \| 40 \| 51.0 \| 43.6 \| 36.8 Q,K,V \| 1 \| 4096 \| 8 \| 80 \| 97.7 \| 77.0 \| 55.5 Q,K,V \| 1 \| 4096 \| 8 \| 160 \| 120.0 \| 39.7 \| 57.8 Q,K,V \| 4 \| 4096 \| 8 \| 40 \| 89.0 \| 84.4 \| 49.2 Q,K,V \| 4 \| 4096 \| 8 \| 80 \| 133.0 \| 92.2 \| 63.2 Q,K,V \| 4 \| 4096 \| 8 \| 160 \| 164.8 \| 42.7 \| 63.8 Q,K,V \| 1 \| 16384 \| 8 \| 40 \| 96.9 \| 91.3 \| 52.1 Q,K,V \| 1 \| 16384 \| 8 \| 80 \| 142.9 \| 101.5 \| 65.6 Q,K,V \| 1 \| 16384 \| 8 \| 160 \| 177.4 \| 44.2 \| 65.7 Q,K,V \| 128 \| 128 \| 12 \| 64 \| 29.0 \| 26.9 \| 25.7 Q,K,V \| 64 \| 128 \| 12 \| 64 \| 23.1 \| 10.8 \| 21.3 Q,K,V \| 128 \| 384 \| 12 \| 64 \| 83.5 \| 60.8 \| 55.7 Q,K,V \| 64 \| 384 \| 12 \| 64 \| 72.6 \| 40.5 \| 52.8 Q,K,V \| 128 \| 512 \| 12 \| 64 \| 98.9 \| 77.9 \| 62.1 Q,K,V \| 64 \| 512 \| 12 \| 64 \| 94.7 \| 75.6 \| 60.4 Q,KV \| 32 \| 512 \| 64 \| 32 \| 85.9 \| 41.1 \| 41.1 Q,KV \| 32 \| 512 \| 128 \| 16 \| 47.1 \| 21.6 \| 21.6 Q,KV \| 16 \| 1024 \| 64 \| 32 \| 104.4 \| 45.8 \| 45.8 Q,KV \| 16 \| 1024 \| 128 \| 16 \| 54.7 \| 23.6 \| 23.6 Q,KV \| 8 \| 2048 \| 64 \| 32 \| 116.8 \| 48.5 \| 48.5 Q,KV \| 8 \| 2048 \| 128 \| 16 \| 59.8 \| 24.7 \| 24.7 Q,KV \| 4 \| 4096 \| 64 \| 32 \| 124.2 \| 50.1 \| 50.1 Q,KV \| 4 \| 4096 \| 128 \| 16 \| 62.6 \| 25.3 \| 25.3 Q,KV \| 2 \| 8192 \| 64 \| 32 \| 128.5 \| 50.8 \| 50.9 Q,KV \| 2 \| 8192 \| 128 \| 16 \| 64.1 \| 25.6 \| 25.6 Q,KV \| 1 \| 16384 \| 64 \| 32 \| 129.4 \| 51.2 \| 51.2 Q,KV \| 1 \| 16384 \| 128 \| 16 \| 64.8 \| 25.8 \| 25.8 Q,KV \| 1 \| 4096 \| 8 \| 40 \| 67.5 \| 37.7 \| 37.5 Q,KV \| 1 \| 4096 \| 8 \| 80 \| 101.3 \| 56.7 \| 56.6 Q,KV \| 1 \| 4096 \| 8 \| 160 \| 124.0 \| 58.6 \| 58.6 Q,KV \| 4 \| 4096 \| 8 \| 40 \| 90.8 \| 49.8 \| 49.8 Q,KV \| 4 \| 4096 \| 8 \| 80 \| 135.6 \| 63.8 \| 63.8 Q,KV \| 4 \| 4096 \| 8 \| 160 \| 166.3 \| 64.5 \| 64.5 Q,KV \| 1 \| 16384 \| 8 \| 40 \| 97.5 \| 52.3 \| 52.3 Q,KV \| 1 \| 16384 \| 8 \| 80 \| 143.5 \| 65.9 \| 65.8 Q,KV \| 1 \| 16384 \| 8 \| 160 \| 178.4 \| 65.9 \| 65.8 Q,KV \| 128 \| 128 \| 12 \| 64 \| 26.8 \| 48.1 \| 30.9 Q,KV \| 64 \| 128 \| 12 \| 64 \| 28.0 \| 38.9 \| 25.0 Q,KV \| 128 \| 384 \| 12 \| 64 \| 97.7 \| 61.1 \| 61.0 Q,KV \| 64 \| 384 \| 12 \| 64 \| 89.5 \| 57.8 \| 57.9 Q,KV \| 128 \| 512 \| 12 \| 64 \| 111.9 \| 66.7 \| 66.9 Q,KV \| 64 \| 512 \| 12 \| 64 \| 107.2 \| 64.9 \| 64.8 QKV \| 32 \| 512 \| 64 \| 32 \| 77.2 \| 84.7 \| 39.3 QKV \| 32 \| 512 \| 128 \| 16 \| 43.4 \| 53.1 \| 20.9 QKV \| 16 \| 1024 \| 64 \| 32 \| 98.8 \| 87.4 \| 44.6 QKV \| 16 \| 1024 \| 128 \| 16 \| 52.0 \| 54.1 \| 23.2 QKV \| 8 \| 2048 \| 64 \| 32 \| 113.1 \| 89.0 \| 47.9 QKV \| 8 \| 2048 \| 128 \| 16 \| 58.2 \| 54.6 \| 24.5 QKV \| 4 \| 4096 \| 64 \| 32 \| 120.6 \| 89.7 \| 49.7 QKV \| 4 \| 4096 \| 128 \| 16 \| 61.7 \| 54.6 \| 25.2 QKV \| 2 \| 8192 \| 64 \| 32 \| 125.9 \| 89.5 \| 50.7 QKV \| 2 \| 8192 \| 128 \| 16 \| 63.6 \| 54.8 \| 25.5 QKV \| 1 \| 16384 \| 64 \| 32 \| 128.5 \| 92.0 \| 51.2 QKV \| 1 \| 16384 \| 128 \| 16 \| 64.6 \| 54.8 \| 25.7 QKV \| 1 \| 4096 \| 8 \| 40 \| 60.2 \| 69.8 \| 38.1 QKV \| 1 \| 4096 \| 8 \| 80 \| 101.6 \| 75.2 \| 56.7 QKV \| 1 \| 4096 \| 8 \| 160 \| 130.2 \| 41.2 \| 58.4 QKV \| 4 \| 4096 \| 8 \| 40 \| 90.6 \| 91.0 \| 49.5 QKV \| 4 \| 4096 \| 8 \| 80 \| 133.6 \| 98.1 \| 62.8 QKV \| 4 \| 4096 \| 8 \| 160 \| 165.3 \| 43.7 \| 63.9 QKV \| 1 \| 16384 \| 8 \| 40 \| 97.2 \| 92.8 \| 52.1 QKV \| 1 \| 16384 \| 8 \| 80 \| 143.0 \| 103.1 \| 65.6 QKV \| 1 \| 16384 \| 8 \| 160 \| 177.6 \| 44.5 \| 65.7 QKV \| 128 \| 128 \| 12 \| 64 \| 31.1 \| 65.9 \| 27.6 QKV \| 64 \| 128 \| 12 \| 64 \| 26.1 \| 49.8 \| 23.5 QKV \| 128 \| 384 \| 12 \| 64 \| 84.6 \| 88.5 \| 56.1 QKV \| 64 \| 384 \| 12 \| 64 \| 79.1 \| 80.3 \| 53.5 QKV \| 128 \| 512 \| 12 \| 64 \| 97.3 \| 114.2 \| 62.2 QKV \| 64 \| 512 \| 12 \| 64 \| 95.9 \| 110.7 \| 60.6 QKV \| 4 \| 2048 \| 32 \| 128 \| 125.26 \| 44.72 \| 78.15 QKV \| 4 \| 4096 \| 32 \| 128 \| 141.62 \| 46.29 \| 85.84 QKV \| 8 \| 2048 \| 32 \| 128 \| 127.40 \| 45.49 \| 78.75 QKV \| 8 \| 4096 \| 32 \| 128 \| 144.24 \| 46.60 \| 86.95 ### Known Issues NVCC uses huge memory while compiling flash attention CUDA kernel. Linux build with CUDA might fail when machine has limited memory while number of CPUs is large. Walkaround is to use a build machine with larger memory, or use argument like `--nvcc_threads 1` to limit nvcc threads in build. ### Motivation and Context Increases speed and efficiency of MHA or Packed MHA. --------- Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: tlwu@microsoft.com <tlwu@a100.crj0ad2y1kku1j4yxl4sj10o4e.gx.internal.cloudapp.net>	2023-08-31 13:52:21 -07:00
Wanming Lin	3a53836836	[WebNN EP] Fix compilation with newer flatbuffers (#17367 )	2023-08-31 10:22:15 -07:00
Artem Shilkin	6e60dba726	Fix compilation with newer flatbuffers (#17164 ) In flatbuffers@v23.5.9 was broken forward declaration for FlatBufferBuilder. Trying to compile onnxruntime falls with the following error: ``` flatbuffers/include/flatbuffers/flatbuffer_builder.h:1420:38: error: typedef redefinition with different types ('FlatBufferBuilderImpl<false>' vs 'flatbuffers::FlatBufferBuilder') typedef FlatBufferBuilderImpl<false> FlatBufferBuilder; ^ onnx_runtime/include/onnxruntime/core/graph/graph.h:47:11: note: previous definition is here class FlatBufferBuilder; ``` This PR removes these declarations and puts includes instead	2023-08-29 10:28:26 -07:00
Caroline	228db24317	Add training API functions to WASM API (#16521 ) ### Description * Created `wasm/training_api` source and header files & modified WebAssembly CMake to include training flags * The `wasm/training_api` files use an `OrtTrainingManager` handle which is a struct of an OrtCheckpointState and an OrtTrainingSession, rather than creating a CheckpointState handle & a separate TrainingSession handle. * This is so that the TypeScript side only has to manage one handle that will be passed between TrainingSession & CheckpointState representations, rather than the TypeScript side managing separate CheckpointStateHandle and TrainingSessionHandle. ### Motivation and Context WASM API needs to be updated with ORT training API function calls so that ORT training web bindings can be added for on-device training. --------- Co-authored-by: Baiju Meswani <bmeswani@microsoft.com> Co-authored-by: carzh <carolinezhu@microsoft.com> Co-authored-by: Ashwini Khade <askhade@microsoft.com>	2023-08-28 11:05:02 -07:00
Arthur Islamov	c262879214	Added DML and CUDA provider support in onnxruntime-node (#16050 ) ### Description I've added changes to support CUDA and DML (only on Windows, on other platforms it will throw an error) ### Motivation and Context It fixes this feature request https://github.com/microsoft/onnxruntime/issues/14127 which is tracked here https://github.com/microsoft/onnxruntime/issues/14529 I was working on StableDiffusion implementation for node.js and it is very slow on CPU, so GPU support is essential. Here is a working demo with a patched and precompiled version https://github.com/dakenf/stable-diffusion-nodejs ---------	2023-08-25 16:57:06 -07:00
Yulong Wang	79c4ed9a45	[js/webgpu] support error pop and kernel name (#17260 ) ### Description This PR contains changes to support error pop and kernel name. - Add a function `JsepGetNodeName` to allow reading kernel name from JS to C++ - When in debug mode ( `env.debug = true;` ) or in profiling mode ( `env.webgpu.profilingMode = 'default';` ), kernel name will be read from ORT; otherwise use the kernel pointer ( a number ) as kernel name to save calls from JS to C++. - When in debug mode, WebGPU validation errors will be recorded and if any error occurs, `inferenceSession.run()` will fail (Promise get rejected). Behavior when not in debug mode is not changed. This is because recording errors are not zero-overhead, and GPU validation errors should occur consistently in and not in debug mode. - Add `jsepOnRunStart()` and `jsepOnRunEnd()` hook to: - allow implementation of the features mentioned above. - pass session ID to backend.	2023-08-25 08:08:15 -07:00
mindest	735cc8e6c8	[ROCm] enable If op for ROCm EP. (#17279 ) ### Description Enable If op for ROCm EP.	2023-08-25 17:49:49 +08:00
Baiju Meswani	fca81cc5d5	ConvTransposeGrad CUDA Kernel (#17201 )	2023-08-24 09:08:06 -07:00
cloudhan	87bef1f3f2	Move composable_kernel to deps.txt (#17245 )	2023-08-23 17:39:16 -07:00
kunal-vaishnavi	edac3ef150	Add LLaMA scripts (#17020 ) ### Description This PR adds the following scripts for LLaMA: - LLaMA conversion (support for TorchScript and Dynamo exporters) - LLaMA parity - LLaMA benchmark - LLaMA quantization - LLaMA integration with [Hugging Face Optimum](https://github.com/huggingface/optimum) ### Motivation and Context This PR adds scripts for using LLaMA. There is a [follow-up PR](https://github.com/microsoft/onnxruntime/pull/17043) for adding scripts for Whisper.	2023-08-22 18:05:11 -07:00
Edward Chen	bd8a488f4b	Enable verbose logging in unit test program with environment variable. (#17133 ) Enable verbose logging in unit test program with environment variable. E.g., `ORT_UNIT_TEST_MAIN_LOG_LEVEL=0 ./onnxruntime_test_all --gtest_filter="<test that I want to see more logs for>"`.	2023-08-22 12:13:52 -07:00
cloudhan	4e6cec4d09	Update ck and enable test (#16383 ) Apply the fix in https://github.com/ROCmSoftwarePlatform/composable_kernel/issues/728 Introduce more kernel instances and allow the introduction of streamk and splitk.	2023-08-22 11:08:55 +08:00
Sheil Kumar	cbaa008391	Bump DirectML version from 1.12.0 to 1.12.1 (#17225 ) Bump DirectML version from 1.12.0 to 1.12.1 Co-authored-by: Sheil Kumar <sheilk@microsoft.com>	2023-08-20 09:55:38 -07:00
Changming Sun	3cec88bd12	FIX: memory leak checker is incompatible with std::stacktrace (#17209 ) ### Description When I worked on PR #17173, I didn't notice that onnxruntime\core\platform\windows\debug_alloc.cc also needs to call dbghelp functions like SymInitialize. So, if we use vc runtime's stacktrace functionality, vc runtime will initialize/uninitialize the dbghelp library independently and vc runtime's stacktrace helper DLLs get unloaded before our memory leak checker starts get work. Then we call SymSetOptions, it crashes. More details: In VC runtime the C++23 stacktrace functions are implemented on top of dbgeng.dll. In C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Tools\MSVC\14.37.32822\crt\src\stl\stacktrace.cpp, you can see it has: ``` dbgeng = LoadLibraryExW(L"dbgeng.dll", nullptr, LOAD_LIBRARY_SEARCH_SYSTEM32); ``` The dbgeng.dll is a wrapper around dbghelp.dll. It calls SymInitialize and SymCleanup. dbgeng.dll gets unloaded before our memory leak check starts to run. In theory we should be able to call SymInitialize again if the previous user who called SymInitialize has also called SymCleanup. However, users can use SymRegisterCallback/SymRegisterCallback64/SymRegisterCallbackW64 to register callback functions to dbghelp.dll. These callback functions need to be alive when SymSetOptions(and some other dbghelp APIs) get called. ### Motivation and Context	2023-08-18 17:10:33 -07:00
Changming Sun	ee09a5ff35	Add DISABLE_CUSPARSE_DEPRECATED flag to CUDA build (#17207 ) This is to suppress a warning and make Windows CUDA 12.2 build work.	2023-08-18 10:25:49 -07:00
Chi Lo	2fb148dd88	Temporarily enforce "Debug build" TRT EP with trt oss parser on Windows (#17059 ) This PR handles two changes: 1. There is an issue when running "Debug build" TRT EP with "Release build" TRT builtin parser on Windows. Enforce use oss parser for Debug build. Note: args.config in build.py is an array, for example ["Debug", "Release"...]. The code will be much mess if we made the change there. 2. Update to use latest commit of oss parser. Please see the https://github.com/microsoft/onnxruntime/issues/16273	2023-08-17 12:17:25 -07:00
Changming Sun	5249b7ab7c	Re-implement stacktrace (#17173 ) ### Description Re-implement stacktrace. The new implementation doesn't directly use Windows API, hence can avoid problems regarding to initialize/uninitialize the dbghelp library. ### Motivation and Context	2023-08-16 16:07:49 -07:00
Dmitri Smirnov	f45eef399e	Fix visualization issues with Attribute/Tensor protos (#17188 ) ### Description Protobuf Natvis	2023-08-16 13:56:51 -07:00
RandySheriffH	3dd2c1b4d7	EP context for custom op (#16454 ) Implement infrastructures to allow EP resources surfaced to custom ops. --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2023-08-16 13:03:40 -07:00
Maximilian Müller	7b9d1f18c7	NVTX windows include and link fixes (#16831 ) ### Description For windows headers are not duplicated to the normal cuda include. For linux they are: ``` (base) maximilianm@maximilianm-dt-linux:~$ ls /usr/local/cuda/include/nvtx3 \| grep nvTool nvToolsExt.h nvToolsExtCuda.h nvToolsExtCudaRt.h nvToolsExtOpenCL.h nvToolsExtSync.h (base) maximilianm@maximilianm-dt-linux:~$ ls /usr/local/cuda/include \| grep nvTool nvToolsExt.h nvToolsExtCuda.h nvToolsExtCudaRt.h nvToolsExtOpenCL.h nvToolsExtSync.h ``` Is the preference via those added defines or should the include just be changed to be `nvtx3/` ? Also there is no library linking needed on Windows and the library is not even present.	2023-08-16 11:53:58 -07:00
Changming Sun	8e203efc69	Cleanup cmake file (#17154 ) ### Description 1. Clean up cmake files. Remove some unused code 2. Remove the "Semmle" task from tools/ci_build/github/azure-pipelines/templates/win-ci.yml. Semmle is deprecated and replaced by CodeQL.	2023-08-15 10:51:33 -07:00
Matthieu Darbois	5e971bc51a	Rework WIL dependency retrieval/usage (#17130 ) ### Description 1. `onnxruntime_fetchcontent_makeavailable` works around unconditional install commands so that can be used instead of `FetchContent_Populate` 2. This dependency is Windows specific, mark it as such. ### Motivation and Context 1. This simplifies `cmake/external/wil.cmake` not to do anything specific wether WIL was fetched or found 2. Given it's specific to Windows, it might not be available on other OS in specific air-gapped environment such as [conan-center-index](https://github.com/conan-io/conan-center-index). This allows downstream builds not to require specific patches for something not required by the build in the first place.	2023-08-15 09:11:46 -07:00
Wenbing Li	d052c8a45c	Remove the extensions submodule (#17097 ) ### Description Remove the onnxruntime-extensions submodule since it now was used via cmake FetchContent ### Motivation and Context The submodule relies on an outdated version of the extensions, and the build instructions should be updated to eliminate any confusion.	2023-08-14 10:16:33 -07:00

1 2 3 4 5 ...

1481 commits