Commit graph

307 commits

Author SHA1 Message Date
Yulong Wang
c5d28cac4d
Initial WebGPU EP checkin (#22318)
### Description

This change introduces the WebGPU EP into ONNX Runtime.

To make the PR as simple as possible, this PR excluded the following:
- C API changes for WebGPU EP
- actual implementation of WebGPU EP. Currently in this PR, WebGPU is a
stub implementation that does not register any kernel.
- Python IO Binding update
- Node.js IO Binding update

This PR now contains only 43 file changes (while the working branch
contains 130+) and hopefully this makes it easier to review.

There is going to be separated PRs for each mentioned above.

Current working branch: #21904
2024-10-08 16:10:46 -07:00
Chen Feiyue
fffd430091
[VSINPU]Code improvement && Slice/Dropout OP support (#21217)
### Description
- Refactor codes to meet line length limit and guard missing warning
- Add slice/dropout op support
- Move vsinpu ep's cmake settings from onnxruntime_providers.cmake to a
separate file
- Modify apis with param onnxruntime::Path because this kind is replaced
by std:filesystem::path by #20920
2024-07-09 20:14:46 -07:00
Chen Feiyue
56b36a58ba
Initial PR for VSINPU execution provider (#20903)
### Description
<!-- Describe your changes. -->
-It is an initial PR for VSINPU execution provider



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
- For support VeriSilicon hardware
- TIM-VX(Tensor Interface Module)
(https://github.com/VeriSilicon/TIM-VX) is an integrated software
solution by Verisilicon for our hardware(A311D/i.MX 8M Plus etc.)
design, it is easy to use Verisilicon’s hardware by simply connecting
onnxruntime with the TIM-VX API by this VSINPU execution provider.
2024-06-28 21:48:34 -07:00
Scott McKay
4e5119760d
Add initial support for CoreML ML Program to the CoreML EP. (#19347)
### Description
<!-- Describe your changes. -->
Adds infrastructure to create an ML Package containing the Model using
ML Program. Updated coremltools files to v7.1 to bring in new protobuf
definitions along with the tools to write the weight.bin file and create
an ML Package correctly.

Enables building a CoreML Model on all platforms which means all the
operator builder code can be debugged anywhere. Execution of the
generated CoreML model is obviously limited to Apple platforms.

The Conv operator builder has been updated to be able to generate an ML
Program Operation.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
NeuralNetwork is no longer being developed and ML Program is the
replacement going forward.
2024-02-15 08:46:03 +10:00
Scott McKay
debd1cab10
Add coremltools 7.1 as a dependency (#19389)
### Description
<!-- Describe your changes. -->
Setup usage of coremltools via dependencies instead of copying files. 
Pull in some changes from
https://github.com/microsoft/onnxruntime/pull/19347 in preparation for
supporting ML Program and enabling building the ML Model on all
platforms to make development and testing of CoreML EP code easier.

- Update to coremltools 7.1 
- Add patch for changes required for cross platform build of ML Program
related code
- Generate coreml proto files on all platforms
- mainly to test these changes work everywhere, as the proto files will
be used on all platforms when #19347 is checked in
- rename onnxruntime_coreml_proto target to coreml_proto as it contains
purely coreml protobuf code with no ORT related chagnes

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve setup.
2024-02-03 09:42:21 +10:00
Numfor Tiapo
b8f373b0ae
Add API for NPU Device Selection in the DML EP (#17612)
Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
2023-10-11 14:53:00 -07:00
Changming Sun
05ac9f6f2a
Split onnxruntime_providers.cmake to multiple (#17853)
### Description
Split onnxruntime_providers.cmake to multiple files, for easier editing.
No other change was made in this PR.
2023-10-09 20:33:44 -07:00
cloudhan
c2bd5b70b2
Fix enable_training and use_migraphx (#17827) 2023-10-08 11:43:27 +08:00
MistEO
faf9a0f6c7
Fix runtime installation error (#17828) 2023-10-07 11:50:02 -07:00
Wei-Sheng Chin
faef9c32fa
ONNX-Native Tensor Parallel: Using Distributed MatMul as Example (#17695)
This PR introduces
- New data structure to represent kernel-level (aka node-level or
op-level) tensor sharding informaiton. I consider it as the
fundamentaion of ONNX distribtued inference.
- Building blocks for distribtued kernels implementation especially
stateless implementation for communication ops.
- Implementation of DistributedMatMul and its tests.

Code structure:
- sharding.h/.cc: Function to shard and reshard tensors (calling into
NCCL).
- sharding_spec.h/.cc: Representation of how a tensor is sharded.
- distributed_matmul.h/.cc: Implementation of tensor parallel MatMul.
Inputs and outputs are sharded across devices.
- onnxruntime_test_distributed.py: distributed operator tests.

Example of specifying sharding information
```python
        @onnxscript.script()
        def matmul_rs_sr_rr(tensor_x: FLOAT, tensor_w: FLOAT) -> FLOAT:
            # Run MatMul by sharding x along column axis and w along row axis on
            # 2 GPUs.
            return MICROSOFT_OPSET.DistributedMatMul(
                tensor_x,
                tensor_w,
                device_mesh_shape=[2],
                device_mesh_elements=[0, 1],
                input_shard_specs=["RS[0]", "S[0]R"],
                output_shard_specs=["RR"],
            )
        onnx_model = matmul_rs_sr_rr.to_model_proto(
            input_types=[FLOAT[2, "s"], FLOAT["s", 2]],
            output_types=[FLOAT[2, 2]],
        )
```

In this example, the device mesh can be visualized as 1-D tensor, `[0,
1]`. The 2nd axis of `tensor_x` is sharded across `[0, 1]` (i.e., the
0-axis of the device mesh). Similarly, the 1st axis of `tensor_w` is
sharded across `[0, 1]` as well.

C++ classes to represent tensor sharding (copied from sharding_spec.h):
```cpp
class DeviceMesh {
 public:
  // [Device Mesh and Tensor Sharding for Tensor Parallel]
  // Device mesh is a tensor of device indices.
  // A tensor can then be partitioned along specific mesh axes.
  //
  // Assume we have 4 GPUs indexed by 0, 1, 2, and 3.
  // Let's consider some examples.
  //  1. 1D device mesh [0, 1, 2, 3]. In this case,
  //     device_mesh_shape is [4] and device_mesh_elements
  //     is [0, 1, 2, 3].
  //     If we want to shard a 2-D tensor along its axis 1, the
  //     corresponding sharding spec is a string "RS[0]".
  //  2. 2D device mesh [[0, 1], [2, 3]]. In this case,
  //     device_mesh_shape is [2, 2] and device_mesh_elements
  //     is [0, 1, 2, 3].
  //     If we want to shard a 2-D tensor's
  //     rows along mesh axis 1 and
  //     columns along mesh axis 0, the
  //     corresponding sharding spec is a string "S[1]S[0]".
  //     If that 2-D tensor's value is np.array([[5, 6], [7, 8]]),
  //     GPU 0/1/2/3 owns 5/7/6/8.  Below is a visualization the sharding
  //     proccess.
  //     - Start with a 2-D device mesh [[0, 1], [2, 3]] and
  //       a 2-D tensor [[5, 6], [7, 8]]
  //       - GPU: [[0, 1], [2, 3]], Tensor: [[5, 6], [7, 8]]
  //     - Split GPU mesh along axis 1 and tensor along
  //       axis 0 for "S[1]" in "S[1]S[0]"
  //       - GPU: [[0], [2]], Tensor: [[5, 6]]
  //         GPU: [[1], [3]], Tensor: [[7, 8]]
  //     - Split GPU mesh along axis 0 and tensor along
  //       axis 1 for "S[0]" in "S[1]S[0]"
  //       - GPU: [[0]], Tensor: [[5]]
  //       - GPU: [[2]], Tensor: [[6]]
  //       - GPU: [[1]], Tensor: [[7]]
  //       - GPU: [[3]], Tensor: [[8]]

  // Actual shape of device mesh represented by `device_mesh_elements`.
  std::vector<int64_t> device_mesh_shape;

  // Flattened device mesh.
  std::vector<int64_t> device_mesh_elements;
};

class AxisPartitionSpec {
  // [Device Mesh and Tensor Sharding for Tensor Parallel]
  // This class is the in-memory representation of
  //  1. if a tensor is sharded or not (aka replica), and
  //  2. which tensor axis is shard by which device mesh axis.
  // Let's consider sharding 2-D tensor along column axis on
  // device mesh [0, 1] as an example.
  // The required sharding spec RS[0] can be represented by
  // - AxisPartitionSpec(Condition::Replica, -1)
  // - AxisPartitionSpec(Condition::Shard, 0)
 public:
  // Status of a tensor axis.
  // A tensor axis can be either sharded or replicated
  // along a device mesh axis.
  enum class Condition { Replica,
                         Shard };

  // This field tells if a tensor axis is sharded or not.
  Condition cond;

  // If a tensor axis is sharded, this field tells which device
  // mesh axis to distribute the shards along.
  // If a tensor axis is not sharded, this field is ignored.
  int device_mesh_axis;

  // A helper to construct a replica spec for a tensor axis.
  static AxisPartitionSpec CreateReplica() {
    return AxisPartitionSpec(Condition::Replica, -1);
  }

  // A helper to construct a sharding spec for a tensor axis.
  // This tensor axis is sharded along `device_mesh_axis` in device mesh.
  static AxisPartitionSpec CreateShard(int device_mesh_axis) {
    return AxisPartitionSpec(Condition::Shard, device_mesh_axis);
  }
};

class TensorPartitionSpec {
  // [Device Mesh and Tensor Sharding for Tensor Parallel]
  // TensorPartitionSpec holds a collection of AxisPartitionSpec and an
  // associated DeviceMesh. It is responsible for determining how a tensor
  // should be partitioned across a device mesh.
  //
  // Example 1: RS[0]
  // In this scenario, `axis_specs` would contain two `AxisPartitionSpec` objects.
  // - The first object is a Replica, denoting that the first axis of the tensor is
  //   not sharded but is instead replicated.
  // - The second object is a Shard along the 0-th axis of the device mesh. It denotes
  //   that the second axis of the tensor is sharded along the first axis of the
  //   device mesh.
  //
  // Example 2: S[0]RR
  // In this scenario, `axis_specs` would contain three `AxisPartitionSpec` objects.
  // - The first object is a Shard along the 0-th axis of the device mesh, indicating
  //   that the first axis of the tensor is sharded along the first axis of the
  //   device mesh.
  // - The second and third objects are Replicas, indicating that the second and third
  //   axes of the tensor are not sharded but are instead replicated.
 public:
  // axis_specs[i]: AxisPartitionSpec for tensor axis i. For a 2-D tensor,
  //                axis_specs[0] is for row axis and axis_specs[1] is for
  //                column axis. axis_specs[i].device_mesh_axis = j means that
  //                tensor axis i is sharded along device mesh axis j.
  std::vector<AxisPartitionSpec> axis_specs;

  // device_mesh: DeviceMesh for sharding the associated tensor.
  // Read [Device Mesh and Tensor Sharding for Tensor Parallel] in DeviceMesh's comment.
  DeviceMesh device_mesh;
};
```
2023-10-05 14:22:25 -07:00
Changming Sun
14d349e290
Enable backtrace in unit tests (#17655)
### Description
Google test can be built either with absl/re2 or not. This PR enables
the build option so that google test framework can print out a nice
stacktrace when something went wrong. It helps locate test errors in CI
build pipelines.

Also, Google test will remove the build option and make it always ON. So
sooner or later we must make this change.
2023-09-29 12:32:56 -07:00
Mustafa Ateş Uzun
13b0f8a6ce
fix: supported typo (#17216) 2023-09-27 10:45:27 -07:00
Chi Lo
b827ab0efc
[TRT EP] Fix build error for building oss onnx-tensorrt parser (#17468)
If building ORT TRT with `--use_tensorrt_oss_parse` (meaning ORT wil
include [oss onnx-tensorrt
parser](https://github.com/onnx/onnx-tensorrt/blob/main/CMakeLists.txt#L82)
and build it from source) ,the cmake CUDA_INCLUDE_DIR variable is
needed.

if not, you will encounter following [ build
error](https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1133937&view=logs&j=7536d2cd-87d4-54fe-4891-bfbbf2741d83&t=39e3f98f-7fe5-578c-20bd-5ae5a4590bda):

CMake Error: The following variables are used in this project, but they
are set to NOTFOUND.
Please set them or make sure they are set and tested correctly in the
CMake files:
    /build/Release/_deps/onnx_tensorrt-src/CUDA_INCLUDE_DIR

Note: Not quite sure why in the past when CI still tested with oss
parser won't hit this issue. probably the CUDA_INCLUDE_DIR was defined
somewhere back then.
2023-09-08 20:34:57 -07:00
cloudhan
6ea3908db4
Add ck's streamk and splitk gemm impl (#17280) 2023-09-04 11:49:07 +08:00
aciddelgado
44101e8771
Flash Attention v2 MHA (#17227)
### Description
Integrate Flash Attention V2 to PackedMultiHeadAttention,
MultiHeadAttention and Attention operators.

Flash Attention v2 source code is from
https://github.com/Dao-AILab/flash-attention/tree/main/csrc/flash_attn/src.
We did some change to remove dependency on Torch, then removed backward
and bfloat16 related code.

Add benchmark script (see benchmark_mha.sh) to compare different
attention kernels for MultiHeadAttention operator.

Current limitations for Flash Attention in PackedMultiHeadAttention,
MultiHeadAttention and Attention operators:
* Relative Position Bias is not supported
* Different hidden size for Q and V is not supported
* Only float16 is supported
* Padding/attention mask is not supported
* For MultiHeadAttention, when there is past or present input, bias
shall be provided to activate flash attention
* For Attention, past or present inputs will deactivate flash attention
* Causal is not supported

Some limitations (like attention mask and causal) might be removed
later.

Currently, Flash Attention v2 only works in Linux. For Windows, we will
enable later with Cutlass 3.2.

Two environment variables can be used for testing purpose:
(1) `ORT_DISABLE_FLASH_ATTENTION` to disable flash attention. Default
value is 0 (enable). Set it to "1" to disable it.
(2) `ORT_MIN_SEQ_LEN_FLASH_ATTENTION_PACKED_QKV`. Default value is
"513", which means that we only enable flash attention when sequence
length is larger than 512 for packed QKV format. Set it to "0" if you
want to use flash attention v2 whenever possible.

### Speedup

The following result is from Standard_ND96amsr_A100_v4 VM
(A100-SXM4-80GB GPU) using benchmark_mha.sh. The metric is TFLOPs per
second for MultiHeadAttention operator.

There are 3 input formats:
* `Q,K,V` means separated inputs query, key and value of BxSxNH
* `Q,KV` means packed KV, where key is 5D: BxSxNx2xH
* `QKV` means packed QKV, where query is 5D: BxSxNx3xH

Note that flash attention cannot use packed QKV format, so extra
Transpose is needed. We found that TensorRT kernel is faster for
sequence length <= 512 for packed QKV. The reason might be no transpose
is needed for TensorRT kernel in this format.

We also notice that, TensorRT kernel is faster for stable diffusion
512x512 image (see seq_len=4096, heads=8, head_dim=40 below), while
flash attention v2 is faster for 1024x1024 image (see seq_len=16384,
heads=8, head_dim=40 below).

input format | batch size | sequence length | heads | head dim |
flash_v2 (TFLOPs/s) | TensorRT (TFLOPs/s) | Memory Efficient Attention
(TFLOPs/s)
-- | -- | -- | -- | -- | -- | -- | --
Q,K,V | 32 | 512 | 64 | 32 | 78.1 | 60.0 | 39.3
Q,K,V | 32 | 512 | 128 | 16 | 46.8 | 44.1 | 21.7
Q,K,V | 16 | 1024 | 64 | 32 | 99.0 | 72.8 | 44.3
Q,K,V | 16 | 1024 | 128 | 16 | 54.7 | 49.2 | 23.4
Q,K,V | 8 | 2048 | 64 | 32 | 113.8 | 81.2 | 47.8
Q,K,V | 8 | 2048 | 128 | 16 | 59.7 | 51.9 | 24.7
Q,K,V | 4 | 4096 | 64 | 32 | 122.5 | 85.6 | 49.7
Q,K,V | 4 | 4096 | 128 | 16 | 62.5 | 53.3 | 25.3
Q,K,V | 2 | 8192 | 64 | 32 | 127.4 | 87.5 | 50.7
Q,K,V | 2 | 8192 | 128 | 16 | 64.0 | 54.2 | 25.6
Q,K,V | 1 | 16384 | 64 | 32 | 129.5 | 91.0 | 51.2
Q,K,V | 1 | 16384 | 128 | 16 | 64.7 | 54.5 | 25.8
Q,K,V | 1 | 4096 | 8 | 40 | 51.0 | 43.6 | 36.8
Q,K,V | 1 | 4096 | 8 | 80 | 97.7 | 77.0 | 55.5
Q,K,V | 1 | 4096 | 8 | 160 | 120.0 | 39.7 | 57.8
Q,K,V | 4 | 4096 | 8 | 40 | 89.0 | 84.4 | 49.2
Q,K,V | 4 | 4096 | 8 | 80 | 133.0 | 92.2 | 63.2
Q,K,V | 4 | 4096 | 8 | 160 | 164.8 | 42.7 | 63.8
Q,K,V | 1 | 16384 | 8 | 40 | 96.9 | 91.3 | 52.1
Q,K,V | 1 | 16384 | 8 | 80 | 142.9 | 101.5 | 65.6
Q,K,V | 1 | 16384 | 8 | 160 | 177.4 | 44.2 | 65.7
Q,K,V | 128 | 128 | 12 | 64 | 29.0 | 26.9 | 25.7
Q,K,V | 64 | 128 | 12 | 64 | 23.1 | 10.8 | 21.3
Q,K,V | 128 | 384 | 12 | 64 | 83.5 | 60.8 | 55.7
Q,K,V | 64 | 384 | 12 | 64 | 72.6 | 40.5 | 52.8
Q,K,V | 128 | 512 | 12 | 64 | 98.9 | 77.9 | 62.1
Q,K,V | 64 | 512 | 12 | 64 | 94.7 | 75.6 | 60.4
Q,KV | 32 | 512 | 64 | 32 | 85.9 | 41.1 | 41.1
Q,KV | 32 | 512 | 128 | 16 | 47.1 | 21.6 | 21.6
Q,KV | 16 | 1024 | 64 | 32 | 104.4 | 45.8 | 45.8
Q,KV | 16 | 1024 | 128 | 16 | 54.7 | 23.6 | 23.6
Q,KV | 8 | 2048 | 64 | 32 | 116.8 | 48.5 | 48.5
Q,KV | 8 | 2048 | 128 | 16 | 59.8 | 24.7 | 24.7
Q,KV | 4 | 4096 | 64 | 32 | 124.2 | 50.1 | 50.1
Q,KV | 4 | 4096 | 128 | 16 | 62.6 | 25.3 | 25.3
Q,KV | 2 | 8192 | 64 | 32 | 128.5 | 50.8 | 50.9
Q,KV | 2 | 8192 | 128 | 16 | 64.1 | 25.6 | 25.6
Q,KV | 1 | 16384 | 64 | 32 | 129.4 | 51.2 | 51.2
Q,KV | 1 | 16384 | 128 | 16 | 64.8 | 25.8 | 25.8
Q,KV | 1 | 4096 | 8 | 40 | 67.5 | 37.7 | 37.5
Q,KV | 1 | 4096 | 8 | 80 | 101.3 | 56.7 | 56.6
Q,KV | 1 | 4096 | 8 | 160 | 124.0 | 58.6 | 58.6
Q,KV | 4 | 4096 | 8 | 40 | 90.8 | 49.8 | 49.8
Q,KV | 4 | 4096 | 8 | 80 | 135.6 | 63.8 | 63.8
Q,KV | 4 | 4096 | 8 | 160 | 166.3 | 64.5 | 64.5
Q,KV | 1 | 16384 | 8 | 40 | 97.5 | 52.3 | 52.3
Q,KV | 1 | 16384 | 8 | 80 | 143.5 | 65.9 | 65.8
Q,KV | 1 | 16384 | 8 | 160 | 178.4 | 65.9 | 65.8
Q,KV | 128 | 128 | 12 | 64 | 26.8 | 48.1 | 30.9
Q,KV | 64 | 128 | 12 | 64 | 28.0 | 38.9 | 25.0
Q,KV | 128 | 384 | 12 | 64 | 97.7 | 61.1 | 61.0
Q,KV | 64 | 384 | 12 | 64 | 89.5 | 57.8 | 57.9
Q,KV | 128 | 512 | 12 | 64 | 111.9 | 66.7 | 66.9
Q,KV | 64 | 512 | 12 | 64 | 107.2 | 64.9 | 64.8
QKV | 32 | 512 | 64 | 32 | 77.2 | 84.7 | 39.3
QKV | 32 | 512 | 128 | 16 | 43.4 | 53.1 | 20.9
QKV | 16 | 1024 | 64 | 32 | 98.8 | 87.4 | 44.6
QKV | 16 | 1024 | 128 | 16 | 52.0 | 54.1 | 23.2
QKV | 8 | 2048 | 64 | 32 | 113.1 | 89.0 | 47.9
QKV | 8 | 2048 | 128 | 16 | 58.2 | 54.6 | 24.5
QKV | 4 | 4096 | 64 | 32 | 120.6 | 89.7 | 49.7
QKV | 4 | 4096 | 128 | 16 | 61.7 | 54.6 | 25.2
QKV | 2 | 8192 | 64 | 32 | 125.9 | 89.5 | 50.7
QKV | 2 | 8192 | 128 | 16 | 63.6 | 54.8 | 25.5
QKV | 1 | 16384 | 64 | 32 | 128.5 | 92.0 | 51.2
QKV | 1 | 16384 | 128 | 16 | 64.6 | 54.8 | 25.7
QKV | 1 | 4096 | 8 | 40 | 60.2 | **69.8** | 38.1
QKV | 1 | 4096 | 8 | 80 | 101.6 | 75.2 | 56.7
QKV | 1 | 4096 | 8 | 160 | 130.2 | 41.2 | 58.4
QKV | 4 | 4096 | 8 | 40 | 90.6 | **91.0** | 49.5
QKV | 4 | 4096 | 8 | 80 | 133.6 | 98.1 | 62.8
QKV | 4 | 4096 | 8 | 160 | 165.3 | 43.7 | 63.9
QKV | 1 | 16384 | 8 | 40 | 97.2 | 92.8 | 52.1
QKV | 1 | 16384 | 8 | 80 | 143.0 | 103.1 | 65.6
QKV | 1 | 16384 | 8 | 160 | 177.6 | 44.5 | 65.7
QKV | 128 | 128 | 12 | 64 | 31.1 | 65.9 | 27.6
QKV | 64 | 128 | 12 | 64 | 26.1 | 49.8 | 23.5
QKV | 128 | 384 | 12 | 64 | 84.6 | 88.5 | 56.1
QKV | 64 | 384 | 12 | 64 | 79.1 | 80.3 | 53.5
QKV | 128 | 512 | 12 | 64 | 97.3 | 114.2 | 62.2
QKV | 64 | 512 | 12 | 64 | 95.9 | 110.7 | 60.6
QKV | 4 | 2048 | 32 | 128 | 125.26 | 44.72 | 78.15
QKV | 4 | 4096 | 32 | 128 | 141.62 | 46.29 | 85.84
QKV | 8 | 2048 | 32 | 128 | 127.40 | 45.49 | 78.75
QKV | 8 | 4096 | 32 | 128 | 144.24 | 46.60 | 86.95

### Known Issues

NVCC uses huge memory while compiling flash attention CUDA kernel. Linux
build with CUDA might fail when machine has limited memory while number
of CPUs is large. Walkaround is to use a build machine with larger
memory, or use argument like `--nvcc_threads 1` to limit nvcc threads in
build.

### Motivation and Context
Increases speed and efficiency of MHA or Packed MHA.

---------

Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: tlwu@microsoft.com <tlwu@a100.crj0ad2y1kku1j4yxl4sj10o4e.gx.internal.cloudapp.net>
2023-08-31 13:52:21 -07:00
Wanming Lin
3a53836836
[WebNN EP] Fix compilation with newer flatbuffers (#17367) 2023-08-31 10:22:15 -07:00
Artem Shilkin
6e60dba726
Fix compilation with newer flatbuffers (#17164)
In flatbuffers@v23.5.9 was broken forward declaration for
FlatBufferBuilder. Trying to compile onnxruntime falls with the
following error:
```
flatbuffers/include/flatbuffers/flatbuffer_builder.h:1420:38: error: typedef redefinition with different types ('FlatBufferBuilderImpl<false>' vs 'flatbuffers::FlatBufferBuilder')
typedef FlatBufferBuilderImpl<false> FlatBufferBuilder;
                                     ^
onnx_runtime/include/onnxruntime/core/graph/graph.h:47:11: note: previous definition is here
    class FlatBufferBuilder;
```
This PR removes these declarations and puts includes instead
2023-08-29 10:28:26 -07:00
Chi Lo
2fb148dd88
Temporarily enforce "Debug build" TRT EP with trt oss parser on Windows (#17059)
This PR handles two changes:

1. There is an issue when running "Debug build" TRT EP with "Release
build" TRT builtin parser on Windows. Enforce use oss parser for Debug
build.
Note: args.config in build.py is an array, for example ["Debug",
"Release"...]. The code will be much mess if we made the change there.
2. Update to use latest commit of oss parser.

Please see the https://github.com/microsoft/onnxruntime/issues/16273
2023-08-17 12:17:25 -07:00
Maximilian Müller
7b9d1f18c7
NVTX windows include and link fixes (#16831)
### Description

For windows headers are not duplicated to the normal cuda include. For
linux they are:
```
(base) maximilianm@maximilianm-dt-linux:~$ ls /usr/local/cuda/include/nvtx3 | grep nvTool
nvToolsExt.h
nvToolsExtCuda.h
nvToolsExtCudaRt.h
nvToolsExtOpenCL.h
nvToolsExtSync.h
(base) maximilianm@maximilianm-dt-linux:~$ ls /usr/local/cuda/include | grep nvTool
nvToolsExt.h
nvToolsExtCuda.h
nvToolsExtCudaRt.h
nvToolsExtOpenCL.h
nvToolsExtSync.h
```
Is the preference via those added defines or should the include just be
changed to be `nvtx3/` ?

Also there is no library linking needed on Windows and the library is
not even present.
2023-08-16 11:53:58 -07:00
Matthieu Darbois
5e971bc51a
Rework WIL dependency retrieval/usage (#17130)
### Description
1. `onnxruntime_fetchcontent_makeavailable` works around unconditional
install commands so that can be used instead of `FetchContent_Populate`
2. This dependency is Windows specific, mark it as such.

### Motivation and Context
1. This simplifies `cmake/external/wil.cmake` not to do anything
specific wether WIL was fetched or found
2. Given it's specific to Windows, it might not be available on other OS
in specific air-gapped environment such as
[conan-center-index](https://github.com/conan-io/conan-center-index).
This allows downstream builds not to require specific patches for
something not required by the build in the first place.
2023-08-15 09:11:46 -07:00
Jeff Daily
dbbfc249f7
[ROCm] update header and binary search paths used by cmake (#17083)
This is in preparation for planned ROCm 6.0 changes that are not
backward compatible. However, the adjustments made by this PR to the
current onnxruntime cmake files will work with ROCm 5.x and 6.x.
2023-08-10 11:05:21 +08:00
satyajandhyala
dd24d52737
[JS/Web] Added Gelu contrib operator support to JSEP (#16909)
### Description
Added Gelu operator to JSEP


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-07-31 09:18:58 -07:00
Vincent Wang
c07a3b869c
Triton Codegen for ORTModule (#15831)
Fuse connected elementwise and reduce Ops to TritonOp and codegen triton
code to run the kernel.

This PR is co-edited by @wejoncy and @er3x3
2023-07-13 18:17:58 +08:00
Edward Chen
6be7b03e53
Enable -Wshorten-64-to-32 warning if available. (#16524)
- Fix some warnings from Xcode build (`-Wshorten-64-to-32`).
- Enable `-Wshorten-64-to-32` warning if available. Currently it's not fully enabled for `onnxruntime_test_all` and `onnxruntime_providers_xnnpack` yet.
- Some clean up in build.py including setting CMake generator more consistently.
2023-07-07 08:11:44 -07:00
Chi Lo
4e3cff60fd
CUDA graph support for TRT EP (#16081)
CUDA EP already supports [CUDA
graph](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs),
also we observed some models can benefit from using CUDA graph with
`trtexec`. Therefore, this PR enables the CUDA graph support for TRT EP.

The implementation is based on
https://github.com/microsoft/onnxruntime/pull/9978 with the same
[constraints](https://github.com/microsoft/onnxruntime/pull/9978) as
below:

- Models with control-flow ops (i.e. If, Loop and Scan ops) are not
supported.
- Usage of CUDA Graphs is limited to models where-in all the model ops
(graph nodes) can be partitioned to the TRT EP.
- The input/output types of models need to be tensors.
- Shapes of inputs/outputs cannot change across inference calls.
- IObinding is required.
2023-06-21 09:36:45 -07:00
Yuhong Guo
48e6186b1a
Move tests from core/providers/cuda/test/* to test/providers/cuda/ and refactor CUDA UT (#16161)
### Description
<!-- Describe your changes. -->

1. Add a new test lib `onnxruntime_providers_cuda_ut` which is similar
to `onnxruntime_providers_cuda` but `onnxruntime_providers_cuda_ut` is
only built if `onnxruntime_BUILD_UNIT_TESTS` is set. We can call all
CUDA UTs through this ut lib without affecting production lib
`onnxruntime_providers_cuda`.
2. Move all test cases from `core/providers/cuda/test/` to
`test/providers/cuda/`. These test cases are built into lib
`onnxruntime_providers_cuda_ut` and run by `./onnxruntime_test_all
--gtest_filter="*CUDA_EP_Unittest*"`. Since the lib is only for test, we
can use gtest macros in the test cases. Previous implementation do not
support using gtest lib in the CUDA UT cases.
3. The cmake code in `cmake/onnxruntime_providers.cmake` is refactored a
bit. A new function `onnxruntime_add_object_library` is to build a
object target. The 2 libs `onnxruntime_providers_cuda_ut` &
`onnxruntime_providers_cuda` share most of the code, so the object files
can be used in both libs, which helps reduce build time. Another
function `config_cuda_provider_shared_module` is used to configure all 3
similar
targets(onnxruntime_providers_cuda_obj/onnxruntime_providers_cuda/onnxruntime_providers_cuda_ut).
4. Refactored the test to call `testing::InitGoogleTest` &
`RUN_ALL_TESTS` in `libonnxruntime_providers_cuda_ut.so`'s `TestAll`.
After this change, we can see all the cases running in
`CUDA_EP_Unittest.All`:

![image](https://github.com/microsoft/onnxruntime/assets/19584326/8ff80df6-060b-4ef0-90b7-657e68d3db87)




### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

After https://github.com/microsoft/onnxruntime/pull/13016, there are
still test files in test/providers/cuda/ that are not moved to
core/providers/cuda/test/ and the test cases are disabled. This PR helps
to clean the unfinished TODOs.

Even through onnxruntime_shared_lib_test covers some test for CUDA
provider. onnxruntime_shared_lib_test works like a coarse grain
end-to-end test for CUDA provider. If CUDA unittest can run cases for a
single component, this wound be helpful for CUDA developers.

---------

Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com>
2023-06-20 14:54:55 -07:00
Prateek Chokse
12dffef768
added support for cmake "find_package" (#8919)
**Description**: 
Adds support for cmake find_package.

**Motivation and Context**
As mentioned in issue #7150 onnxruntime doesn't have support for CMake
find_package, this PR adds that and also adds the CMake package version
file. Now anyone can link onnxruntime like this:
```cmake
find_package(onnxruntime)
add_executable(test Source.cpp)
target_link_libraries(test PRIVATE onnxruntime::onnxruntime)
```
this also simplifies #3124
2023-06-19 22:20:31 -07:00
saurabh
a6ce7b339f
Enable model subgraph execution in OVEP and setting the OpenVINO dll's to the path from the OpenVINO pypi packge in OVEP and fix OVEP windows io buffer sample (#16147)
### Description
This PR enables execution of subgraphs in OVEP and currently, when OVEP
developers install the onnxruntime-openvino package on windows from
pypi, they would have to additionally download OpenVINO windows binaries
and run the setupvars.bat script which sets the environment PATH to
locate the OV dll's. Also this PR fixes issues of OVEP windows io buffer
sample.



### Motivation and Context
Fix: We want to make the user experience easy for OVEP Python developers
on windows platform.
This fix, introduces a function add_openvino_libs_to_path at the
location tools/python/util/add_openvino_win_libs.py.
The above function, can be called by OVEP python users in the
application code and that takes care of setting
the OpenVINO dll's to the path from the OpenVINO pypi packge (openvino)
which was installed.
This change also makes sure that add_openvino_libs_to_path() function is
added to onnxruntime python package
only when it is build for OpenVINO Execution Provider for ONNXRuntime
and not for default ORT python package builds.

New user experience for Python OVEP developers on windows platform:
step 1: pip install onnxruntime-openvino
step 2: pip install openvino
step 3: <Add these 2 lines in the application code>
import onnxruntime.tools.add_openvino_win_libs as utils
utils.add_openvino_libs_to_path()

---------

Signed-off-by: MaajidKhan <n.maajid.khan@intel.com>
Co-authored-by: MaajidKhan <n.maajid.khan@intel.com>
Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
2023-06-16 19:47:09 -07:00
satyajandhyala
889f80082f
[js/web] Added Reduce operators support (#16122)
### Description
Added support for ReduceL1, ReduceL2, ReduceMean, ReduceMin, ReduceMax,
ReduceSum, ReduceLogSum, ReduceLogSumExp, ReduceProd and
ReduceSquareSum.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Satya Jandhyala <sajandhy@microsoft.com>
Co-authored-by: guschmue <guschmue@microsoft.com>
2023-06-12 07:46:27 -07:00
Changming Sun
7686193c40
Fix DNNL build (#16201) 2023-06-02 09:46:03 +08:00
Changming Sun
0204594f90
Cleanup WASM cmake code (#15996)
### Description
Remove the "onnxruntime_BUILD_WEBASSEMBLY" cmake option. Use `if
(CMAKE_SYSTEM_NAME STREQUAL "Emscripten")` instead. It makes some code
look more nature.
For example,

```cmake
if (CMAKE_SYSTEM_NAME STREQUAL "iOS" OR CMAKE_SYSTEM_NAME STREQUAL "Android" OR onnxruntime_BUILD_WEBASSEMBLY)
```
becomes
```cmake
if (CMAKE_SYSTEM_NAME STREQUAL "iOS" OR CMAKE_SYSTEM_NAME STREQUAL "Android" OR CMAKE_SYSTEM_NAME STREQUAL "Emscripten")
```
2023-05-20 18:07:39 -07:00
kailums
f62f722c70
integrate triton into ort (#15862)
### Description
In some scenarios, the triton written kernels are more performant than
CK or other handwritten kernels, so we implement a framework that
onnxruntime can use these triton written kernels.

This PR is to integrate triton into ort, so that ort can use kernels
that written and compiled by triton.

The main change focus on two part:
1. a build part to compile triton written kernel and combine these
kernels into libonnxruntime_providers_rocm.so
2. a loader and launcher in c++, for loading and launch triton written
kernels.

#### Build

To compile triton written kernel, add a script
`tools/ci_build/compile_triton.py`. This script will dynamic load all
kernel files, compile them, and generate `triton_kernel_infos.a` and
`triton_kernel_infos.h`.

`triton_kernel_infos.a` contains all compiled kernel instructions, this
file will be combined into libonnxruntime_providers_rocm.so, using
--whole-archive flag.

`triton_kernel_infos.h` defines a const array that contains all the
metadata for each compiled kernel. These metadata will be used for load
and launch. So this header file is included by 'triton_kernel.cu' which
defines load and launch functions.

Add a build flag in build.py and CMakeList.txt, when building rocm
provider, it will call triton_kernel build command, and generate all
necessary files.

#### C++ Load and Launch

On c++ part, we implement load and launch functions in triton_kernel.cu
and triton_kernel.h.

These two files located in `providers/cuda`, and when compiling rocm,
they will be hipified. so this part supports both cuda and rocm. But
currently we only call triton kernel in rocm.

We also implement a softmax triton op for example. Because there will
generate many kernels for different input shape of softmax, we use
TunableOp to select the best one.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-05-17 09:35:28 +08:00
cloudhan
dc383ed4ce
Basic CSharp packaging support for ROCm EP (#15535)
This PR mainly fixes building errors when trying to build nupkg for ROCm EP.
It also slighly improve the packaging logic so that devlopers can
produce the nupkg on linux natively.
2023-05-16 07:27:38 +08:00
Wanming Lin
00b1e79e04
Support WebNN EP (#15698)
**Description**: 

This PR intends to enable WebNN EP in ONNX Runtime Web. It translates
the ONNX nodes by [WebNN
API](https://webmachinelearning.github.io/webnn/), which is implemented
in C++ and uses Emscripten [Embind
API](https://emscripten.org/docs/porting/connecting_cpp_and_javascript/embind.html#).
Temporarily using preferred layout **NHWC** for WebNN graph partitions
since the restriction in WebNN XNNPack backend implementation and the
ongoing
[discussion](https://github.com/webmachinelearning/webnn/issues/324) in
WebNN spec that whether WebNN should support both 'NHWC' and 'NCHW'
layouts. No WebNN native EP, only for Web.

**Motivation and Context**:
Allow ONNXRuntime Web developers to access WebNN API to benefit from
hardware acceleration.

**WebNN API Implementation Status in Chromium**:
- Tracked in Chromium issue:
[#1273291](https://bugs.chromium.org/p/chromium/issues/detail?id=1273291)
- **CPU device**: based on XNNPack backend, and had been available on
Chrome Canary M112 behind "#enable-experimental-web-platform-features"
flag for Windows and Linux platforms. Further implementation for more
ops is ongoing.
- **GPU device**: based on DML, implementation is ongoing.

**Open**:
- GitHub CI: WebNN currently is only available on Chrome Canary/Dev with
XNNPack backend for Linux and Windows. This is an open to reviewers to
help identify which GitHub CI should involved the WebNN EP and guide me
to enable it. Thanks!
2023-05-08 21:25:10 -07:00
Chunye Wang@AMD
d35850c142
[VitisAI]Update VitisAI EP to be compatible with VitisAI 3.5 (#15673)
### Description

Originally VitisAI EP only works with old version of VitisAI release. 


### Motivation and Context

Update VitisAI EP so that it works together with the current VitisiAI
3.5 and further version of VitisAI. We try our best to make it forward
compatible.

---------

Co-authored-by: Wang Chunye <chunywan@xilinx.com>
Co-authored-by: mingyue <mingyue@amd.com>
Co-authored-by: mingyueliuh <131847423+mingyueliuh@users.noreply.github.com>
Co-authored-by: liumingyue <mingyue@xilinx.com>
Co-authored-by: moore-ch <129165652+moore-ch@users.noreply.github.com>
Co-authored-by: shoucair <shoucai.ren@amd.com>
Co-authored-by: zz002 <zhenze.wang@amd.com>
Co-authored-by: BoarQing <yuz75@Pitt.edu>
Co-authored-by: Yueqing Zhang <yueqingz@amd.com>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
2023-05-01 08:28:26 -07:00
Changming Sun
65020d433e
Prefast fixes for CUDA EP (#15726)
### Description
1. Adjust cmake flags. Do not modify CMAKE_CXX_FLAGS globally. Only
apply the flags to ORT code.
2. Fix some SDL warnings.
2023-04-29 12:43:12 -07:00
Yulong Wang
b98317b907
[js/webgpu] following up for JSEP/WebGPU code cleanup (#15666)
### Description
This PR resolves a part of non-critical comments from code review
comments in #14579.

- use `USE_JSEP` instead of `USE_JS` in build definition to make it less
ambiguous
- remove unused util functions from util.ts
- fix transpose.h
- other misc fixes
2023-04-25 21:20:03 -07:00
Yulong Wang
14cc02c65c
[js/web] WebGPU backend via JSEP (#14579)
### Description
This change introduced the following new components into ONNX Runtime
Web:
- JavaScript Execution Provider (JSEP)
  - Asynchronized inferencing execution powered by Emscripten's Asyncify
- WebGPU backend implemented in TypeScript
  - initial implementation of kernels:
    - elementwise operators (22)
    - binary operators (5)
    - tensor: Shape, Reshape, Transpose, Gemm
    - nn: Conv, {Global}Maxpool, {Global}AveragePool


Code need to be polished. still working on it.

## Q&A
What is JSEP?
> JSEP, aka JavaScript Execution Provider, is a new ONNXRuntime
execution provider that specifically works on Web environment
(browsers). JSEP allows JavaScript code to kick in from various places
when ONNX Runtime inferences a model.

Why JSEP?
> JSEP is a hybrid mode EP that contains both C/C++ and
TypeScript/JavaScript implementation. There are 2 strong reasons why we
introduces JSEP:
> 1. the C/C++ part helps JSEP to leverage ONNX Runtime's capabilities
as much as possible including graph transformer, optimizers and also the
capabilities to fallback to CPU EP. TypeScript/JavaScript helps JSEP to
develop and debug much easier in the browser for the kernel
implementation.
> 2. the requirement of asynchronized execution from JavaScript API (eg.
`buffer.mapAsync()`) makes it impossible to run `OrtRun()` in a
synchronized context (see "async problem" section below). This is done
by using Emscripten's Asyncify.

What is WebGPU?
> WebGPU is the new GPU API that available in browser. It's one of the
only 2 APIs that currently available to access the GPU from browser (the
other is WebGL).
> WebGPU is designed with more advanced and stronger features comparing
to WebGL and is potentially solution that offer the best GPU performance
for model inferencing that currently available.

What is the async problem and why we have the problem?
> The "async problem" is a problem that you cannot call an async
function in a synchronous context. Think about the following C++ code:
> ```c
> // C-style declarations (API)
> typedef void (*ON_COMPLETE)(PVOID state, DATA *data);
> void read_data_from_file(FILEHANDLE file, ON_COMPLETE on_complete);
> 
> // implementation
> DATA * my_impl_read_data_from_file_sync(FILEHANDLE file) {
>   // how to implement?
> }
> ```
> The answer is, it's impossible to implement this function. Usually we
try to find a sync version API, or launch a thread to call the async
function and sync-wait on the main thread. Unfortunately, in browser
environment, neither is possible.
>
> WebGPU does not offer any synchronized API for data downloading (GPU
to CPU). This is the only operation that MUST be async. As `OrtRun()`
will eventually call into DataTransfer for copy data from GPU to CPU,
and `OrtRun()` is a synchronized function, this cannot be done in normal
way.

What is Emscripten? How is the Asyncify feature resolved the problem?
> Emscripten is the C/C++ compiler for WebAssembly. It's what we use to
compile ORT and generates the WebAssembly artifacts which runs on
browsers.
>
> Asyncify is a [compiler
feature](https://emscripten.org/docs/porting/asyncify.html) that allows
calling async functions from a synchronized context. In short, it
generates code to unwind and rewind call stack to emulate async
execution. With this feature, we are able to call the async function
inside `OrtRun()` call.

## Design Overview

**Inter-op**

JSEP is doing pretty much same thing to just another EP. It exposes an
interface for inter-op with JavaScript, which is defined in
onnxruntime/wasm/js_internal_api.js:
```js
// init JSEP
Module["jsepInit"] = function (backend, alloc, free, copy, copyAsync, createKernel, releaseKernel, run) {
    Module.jsepBackend = backend;
    Module.jsepAlloc = alloc;
    Module.jsepFree = free;
    Module.jsepCopy = copy;
    Module.jsepCopyAsync = copyAsync;
    Module.jsepCreateKernel = createKernel;
    Module.jsepReleaseKernel = releaseKernel;
    Module.jsepRun = run;
};
```
This simple JavaScript snippet defines all language barrier level
functions that requires by JSEP to achieve implementing kernels and data
transfers using JavaScript inside ONNX Runtime:
- `jsepBackend`: assign the singleton object to webassembly module
- `jsepAlloc` and `jsepFree`: implementation of data transfer's Alloc()
and Free()
- `jsepCopy`: synchronized copy ( GPU to GPU, CPU to GPU)
- `jsepCopyAsync`: asynchronized copy ( GPU to CPU)
- `jsepCreateKernel` and `jsepReleaseKernel`: a corresponding object
that maintained in JS to match lifecycle of Kernel in ORT
- `jsepRun`: OpKernel::Compute() should call into this

The abstraction above allows to tie as little as possible connections
and dependencies between C/C++ and TypeScript/JavaScript.

**Resource Management**

Lifecycle of tensor data and kernels are managed by ORT(C/C++) but the
implementation are left to JavaScript. JavaScript code are responsible
to implement the callbacks correctly.

For WebGPU, the GPU data is managed by JavaScript using a singleton map
(tensot_data_id => GPUBuffer). GPU pipeline is managed as singleton.
Shaders are managed using a singletonmap (shader_key => gpu_program),
while shader_key is generated by cache_key (OP specific, including
attributes) and input shapes.

**about data transfer**
`js::DataTransfer::CopyTensor` implemented to call either synchronized
or asynchronized copy callback, depending on the destination is GPU or
not. Emscripten's macro `EM_ASYNC_JS` is used to wrap the async function
to be called in the synchronized context.

**run kernel in JS**

Kernel class constructor calls once `jsepCreateKernel()` with an
optional per-kernel specific serialization to pass attributes into
JavaScript.

`Compute()` are implemented in a way that a metadata serialization is
performed in a base class and JavaScript code can access the data using
the Emscripten specific builtin macro `EM_ASM_*`.

**disabled features**
memory pattern is force disabled, because the WebGPU data is not
presented by a general memory model (a buffer can be represented by
offset + size).
concurrent run support is disabled. WebGPU is stateful and it also has
async function call. To support concurrent run will significantly
increase the complexity and we don't get any real benefit from it.

**prefer channels last**
JSEP prefers channels last and returns `DataLayout::NHWC` in method
`GetPreferredLayout()`. This will let the graph transformers to
preprocess the graph into a channels last form so that a more optimized
WebGPU shader can be used.

**Testing code**
It's impossible to test JSEP directly because JSEP itself does not
contain any kernel implementation. However, it has the kernel
registration which need to work together with the corresponding
JavaScript code. There are unit tests that run onnx models from
JavaScript API.

---------

Co-authored-by: Scott McKay <skottmckay@gmail.com>
2023-04-24 15:21:18 -07:00
George Wu
c2acf69d13
support new include,lib dir structure in upcoming QNN 2.11 (#15605)
upcoming QNN 2.11 will have a different include/lib directory structure.
update cmake files to support the new structure.
2023-04-24 13:10:17 -07:00
Yi Zhang
698e9f71cd
Improve cache hit rate in windows build (#15538)
### Description
1.  Update /Zi to /Z7 in abseil project while using cache
2.  Skip target_precompile_headers while using cache


### Motivation and Context
There're about 1/4 uncacheable calls in Windows GPU compilation with
cache.
```
Uncacheable calls:                   441 / 1641 (26.87%)
  Could not use precompiled header:  361 /  441 (81.86%)
  Preprocessing failed:                1 /  441 ( 0.23%)
  Unsupported compiler option:        79 /  441 (17.91%)
```

https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=961916&view=logs&j=5076e696-f193-5f12-2d8a-703dda41a79b&t=9b927034-e3ef-5e25-c6df-387bc37acd63&l=21

The root cause of `Unsupported compiler option` is that /Zi in Abseil
isn't updated to /Z7.
The root cause of `Could not use precompiled header` is the
`target_precompile_headers` creates cmake_pch.pch every time and it's
hash value is changed too.

### Result
It could reduce compilation time by another 20%. 
For example:
It took 16m43 in CUDA training compilation on Windows.
It takes 13m32 after the change.

https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=964002&view=logs&s=959c6b43-5937-53e5-5f36-e53cb0249117


### N.B.
In winml project, it's using own target_precompile**d**_header
https://github.com/microsoft/onnxruntime/blob/main/cmake/precompiled_header.cmake.
Just let it be.
2023-04-18 09:31:35 -07:00
mindest
0fdd356abf
[ROCm] Add hipBLASLt GEMM support to Tunable op. (#15351)
### Description
Add hipBLASLt to GEMM Tunable op, which supports GEMM and
StridedBatchedGEMM.

To enable hipBLASLt implementation, add an extra flag to the building
command: `--cmake_extra_defines onnxruntime_USE_HIPBLASLT=ON`.
2023-04-14 17:56:01 +08:00
Stephan Gocht
026fb3ca1e
Fix compilation error when CUDNN_HOME is defined. (#15348) 2023-04-06 08:56:20 -07:00
George Wu
4db10c93d1
[TensorRT EP] make --use_tensorrt_builtin_parser the default behavior in build.py (#15320)
Change the default behavior to link against the nvonnxparser library
(onnx-tensorrt parser) that is included with the TensorRT package.
Previously, the default behavior was to build and statically link
against the OSS onnx-tensorrt parser.
Historically, we wanted to incorporate the latest commits/fixes from OSS
parser.
These days the OSS parser is not significantly different from the
included parser library so there is less reason to build against it by
default.
By linking with parser shared library from TensorRT library, the major
benefit is it's much easier to
build/link against a minor version update of TensorRT. And OnnxRuntime
can be updated with a new TensorRT minor version by simply replacing
TensorRT libraries with the newer version. (because the parser is no
longer statically linked into onnxruntime)

Added --use_tensorrt_oss_parser to build.py to support the previous
default behavior. (build + static link OSS parser)
2023-04-05 07:53:29 -07:00
Matthieu Darbois
85bb13345d
Rework some external targets to ease building with -DFETCHCONTENT_FULLY_DISCONNECTED=ON (#15323)
### Description
Rework some external targets to ease building with
`-DFETCHCONTENT_FULLY_DISCONNECTED=ON`
This will allow package managers to more easily provide an onnxruntime
package by reducing the amount of patching needed downstream at each
version.

### Motivation and Context
Availability of onnxruntime in some C++ package managers
https://github.com/microsoft/onnxruntime/issues/7150
https://github.com/conan-io/conan-center-index/issues/16699
https://github.com/microsoft/vcpkg/issues/20548

My initial intent is to get this in conan but the PR would most likely
be useful (though not tested) to vcpkg as well (and maybe others).
I tried to get only a first batch of not too specific patches (i.e. not
specific to conan).

The first commit reworks `flatbuffers` and just extends what @snnn did
in https://github.com/microsoft/onnxruntime/pull/13991
The second commit reworks `pytorch_cpuinfo`
The third commit reworks `google_nsync`
2023-04-03 17:45:12 -07:00
Chris Austen
93e6902790
resolve undefined symbol: rocblas_create_handle (#15204)
Update migraphx section of onnxruntime_providers.cmake to add the rocblas library
2023-03-26 18:01:58 +08:00
PeixuanZuo
4a8cd4256a
fix miopen new API cannot be supported by ROCm5.2.3 (#15077)
miopenTensorLayout_t was added after MIOpen version 2.18.0. Define it
in ORT when use MIOpen version lower than 2.18.0.
2023-03-17 08:40:35 +08:00
cloudhan
a5ab88247b
ROCm Flash Attention (#14838)
Adds flash attention via composable kernel for ROCm EP
2023-03-16 10:39:58 +08:00
Hector Li
c6074f3a4b
OnnxRuntime QNN EP (#14791)
### Description
Integrate Qualcomm QNN SDK to enable inference on QC hexagon NPU devices

### Motivation and Context
Enable Ort inference on QC hexagon NPU devices.

---------

Co-authored-by: Satya Jandhyala <sajandhy@microsoft.com>
Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
Co-authored-by: Adrian Lizarraga <adrianlm2@gmail.com>
2023-03-01 13:48:20 -08:00
kailums
9bdd42115c
add build flag for rocblas tune and fix bug (#14797)
### Description
<!-- Describe your changes. -->
1. add a build flag for rocblas tuning feature.

2. fix a build bug when enable rocblas tuning.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
The rocblas tunning feature has no build flag to control, only using a
MACRO flag.

So I add an build flag, and fix a code bug when enable rocblas tunning.
2023-02-28 10:37:07 +08:00
PeixuanZuo
326cf2f5e9
[ROCm] add Softmax Tunable Op (#14541)
### Description
Add Softmax Tunable Op, only include blockwise vec implementation and
composable kernel.
Related PR: https://github.com/microsoft/onnxruntime/pull/14475,
https://github.com/microsoft/onnxruntime/pull/14612

---------

Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
2023-02-13 15:56:50 +08:00