Commit graph

545 commits

Author SHA1 Message Date
Milos Puzovic
37ac9d391c
Enable Arm Compute Library 23.08 (#17672)
### Description

This PR enables onnxruntime to build with the most recent release of Arm
Compute Library

### Motivation and Context

The latest version of Arm Compute Library that onnxruntime can build is
20.02 which is more than 3 years old.
2024-01-09 14:10:25 -08:00
raoanag
56fcea94e3 Enable QDQ quantization for DML EP (#18367)
### Description
This enables QDQ transforms with the DML EP
2024-01-03 16:13:23 -08:00
cloudhan
de32baeeef
[ROCm] Add GemmFloat8 (#18488) 2023-12-11 11:37:29 +08:00
moyo1997
9479ba525b
Build onnxruntime.dll as arm64x (#18633)
Build onnxruntime.dll as arm64x

Added a .cmake file to generate a link repro of the onnxruntime.dll
during arm64 build. This provides us a directory containing all the
arm64 objs, def file and libs to link to when it is time to building
arm64x onnxruntime.dll during the arm64ec build by passing the
/machine:arm64x flag to the linker along with the arm64 artifacts.

If other dlls wanted to be built as x, setting the ARM64X_TARGETS
variable in the toplevel cmakelists.txt to include these other targets
is all that will be needed.

Added build_arm64x.bat as a wrapper for the multiple (rm64, then
arm64ec) cmake calls needed to build as arm64x.

AB#22533
2023-12-06 16:49:00 -08:00
snadampal
05a9c95764
[DNNL] add Arm Compute Library (ACL) backend for dnnl execution provider (#15847)
Add ACL as the DNNL runtime option for aarch64 platforms. Update
makefile and the python wheel build script.

### Description
<!-- Describe your changes. -->
Add ACL as the DNNL runtime option for aarch64 platforms. Update
makefile and the python wheel build script.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This is to enable the optimized ACL gemm kernels for dnnl execution
provider on aarch64 platform.
2023-12-01 09:16:44 -08:00
George Wu
5c67a00d8e
Revert "remove full protobuf requirement for tensorrt ep" (#18626)
Reverts microsoft/onnxruntime#18413

there's a timing issue here. we eventually want to get this change
merged in but we need to update OSS onnx-tensorrt first.
2023-11-29 22:27:51 -08:00
Rachel Guo
288b80d363
Add MacOS build to ORT C Pod (#18550)
### Description
<!-- Describe your changes. -->

As title.

1. Add macos build as an optionally enabled arch for pod and changes to
exsiting build_ios_framework/assemble_c_pod scripts.
2. Enable macos build arch in ios packaging pipeline (currently for
variants other than Mobile) and check the output artifacts are correct.
3. Write MacOS Test Target scheme in the test app and integrate into ios
packaging CI testing pipeline.
Currently the changes only apply to onnxruntime-c pod. as the original
request was from ORT SPM which consumes the onnxruntime-c pod only as
the binary target. TODO: could look into adding macos platform to objc
pod as well.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Enable macos platform support in cocoapods. and also potentially produce
binary target for enabling macos platform in SPM as well.

Replace https://github.com/microsoft/onnxruntime/pull/18334

---------

Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-11-28 10:11:53 -08:00
Xavier Dupré
29a409acaa
Add missing flags DISABLE_FLOAT8_TYPES in GemmFloat8 custom operator for CUDA < 11.8 (#18162)
### Description
PR #16051 introduced operator GemmFloat8 but the flags
DISABLE_FLOAT8_TYPES was missing in a couple of places. The PR addresses
that issue. That would allows the compilation on CUDA < 11.8.
2023-11-21 14:37:48 +01:00
George Wu
d73073d491
remove full protobuf requirement for tensorrt ep (#18413)
tensorrt can work with protobuf lite.
2023-11-16 20:44:27 -08:00
Justin Chu
c250540722
Bump linter versions (#18341)
Bump linter versions and run format.
2023-11-08 13:04:40 -08:00
Yi Zhang
b7b8b5b2ce
Fix Eigen-3.4.0 URL and hash (#18290)
### Description
Add CI changes for #18287

Install onnx explicitly to pass windows GPU+dml stage.


### Motivation and Context
'eigen-3.4' was refering to a branch, not to a tag. There is now an
Eigen 3.4.1 on that branch, and thus the hash has changed.
See
https://github.com/microsoft/onnxruntime/issues/18286#issuecomment-1793683416
2023-11-06 09:19:51 -08:00
Scott McKay
c352e9b1f9
Rework/cleanup the C# build infrastructure for nuget packages. (#18127)
### Description
Update the C# nuget build infrastructure to make building a test nuget
package more user friendly and to simplify
- Remove usage of dotnet and msbuild in CIs
- was temporary requirement until .net 6 MAUI was added to the released
Visual Studio
  - remove SelectedTargets property and its usage
- Add property for excluding mobile targets
  -  generally we exclude based on the nuget package name
- can now specify `/p:IncludeMobileTargets=false` on the command line to
force exclusion
- support building test package using build.py `--build_nuget` better
- limit inclusion of xamarin targets as building with them requires a
lot more infrastructure
- use msbuild directly if xamarin targets are included. use dotnet
otherwise.
- remove quoting of property values as it doesn't appear to be necessary
and breaks when msbuild is being used
- add infrastructure to be able to pack the nuget package on linux with
`dotnet pack`
    - `nuget pack` is not user friendly as-per comments in changes
    - requires stub csproj to provide the nuspec path 
- Remove netstandard1.0 targets from nuspec
  - we removed support from the actual bindings previously
- Remove usage of nuget-staging directory when creating nuget package on
linux
- the nuspec file element has a fully qualified path for a source file
so there is no obvious benefit to copying to a staging directory prior
to packing

### Motivation and Context
Address issues with 1P users trying to create test nuget packages
locally.
Long overdue cleanup of CI complexity.
2023-11-03 09:05:17 -07:00
Preetha Veeramalai
d87216bcb1
Openvino ep ort 23.1 (#17911)
### Description
Integration to OpenVINO 2023.1


### Motivation and Context

- Alignment with latest OpenVINO Version. 
- Device name change from VPUX to NPU and Remove from supported list
until official public support is available.

---------

Co-authored-by: Sahar Fatima <sfatima.3001@gmail.com>
Co-authored-by: Saurabh Kale <saurabh1.kale@intel.com>
Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
Co-authored-by: sfatimar <sahar.fatima@intel.com>
2023-11-01 08:39:39 -07:00
Hariharan Seshadri
9356986730
Fix AMD builds and enable testing NHWC CUDA ops in one GPU CI (#17972)
### Description
This PR:

(1) Fixes AMD builds after #17200 broke them (Need to remember to run
AMD builds while trying to merge external CUDA PRs next time)

(2) Turn on the NHWC CUDA feature in the Linux GPU CI. The extra time
spent in building a few more files and running a few more tests will not
be much.

Test Linux GPU CI run :
https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1170770

### Motivation and Context
Keep the NHWC CUDA ops tested
(https://github.com/microsoft/onnxruntime/pull/17200) and guard against
regressions
2023-10-17 09:23:52 -07:00
aciddelgado
406cd324e0
[CUDA] GroupQueryAttention operator using FlashAttention (#17674)
### Description
Added Group Query Attention op, supporting integer multiple number of
heads for Q / KV. As of now, this op can only use FlashAttention kernel,
meaning it only supports sm>=80 on Linux.

Results from onnxruntime/test/python/transformers/benchmark_gqa.py show
an on-average ~37% speed-up over Decoder Masked Multi-Head Attention,
with even greater improvements for long past sequence lengths.

```
op      batch   s_kv    heads   h_dim   ms      TFLOPS
gqa     16      2048    8       32      0.34    0.10
dmmha   16      2048    8       32      0.39    0.09
---------
gqa     16      2048    8       64      0.45    0.15
dmmha   16      2048    8       64      0.61    0.11
---------
gqa     16      2048    8       128     0.54    0.25
dmmha   16      2048    8       128     0.83    0.16
---------
gqa     16      2048    16      32      0.45    0.15
dmmha   16      2048    16      32      0.69    0.10
---------
gqa     16      2048    16      64      0.69    0.19
dmmha   16      2048    16      64      0.83    0.16
---------
gqa     16      2048    16      128     0.71    0.38
dmmha   16      2048    16      128     1.28    0.21
---------
gqa     16      2048    32      32      0.58    0.23
dmmha   16      2048    32      32      0.77    0.17
---------
gqa     16      2048    32      64      0.58    0.46
dmmha   16      2048    32      64      1.25    0.21
---------
gqa     16      2048    32      128     0.76    0.71
dmmha   16      2048    32      128     2.15    0.25
---------
gqa     16      2048    64      32      0.68    0.39
dmmha   16      2048    64      32      1.23    0.22
---------
gqa     16      2048    64      64      0.77    0.70
dmmha   16      2048    64      64      2.11    0.25
---------
gqa     16      2048    64      128     1.10    0.97
dmmha   16      2048    64      128     4.06    0.26
---------
gqa     16      2048    128     32      1.00    0.54
dmmha   16      2048    128     32      2.09    0.26
---------
gqa     16      2048    128     64      1.10    0.97
dmmha   16      2048    128     64      4.08    0.26
```


### Motivation and Context
As of now, this op is targeted for use on LLama models, as it supports
kv-caching and different number of heads for Q and KV (Grouped Query
Attention). We plan to add support for more platforms, input formats,
etc. in the future.

---------

Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: tlwu@microsoft.com <tlwu@a100.crj0ad2y1kku1j4yxl4sj10o4e.gx.internal.cloudapp.net>
2023-10-09 12:43:12 -07:00
Justin Chu
be7541ef4a
[Linter] Bump ruff and remove pylint (#17797)
Bump ruff version and remove pylint from the linter list. Fix any new
error detected by ruff.

### Motivation and Context

Ruff covers many of the pylint rules. Since pylint is not enabled in
this repo and runs slow, we remove it from the linters
2023-10-05 21:07:33 -07:00
Mustafa Ateş Uzun
13b0f8a6ce
fix: supported typo (#17216) 2023-09-27 10:45:27 -07:00
Patrice Vignola
54a092c427
[DML EP] Complete python IO binding implementation (#17344)
@fdwr This is the part 2 of the pybind work that was started earlier.
This adds the following features to the python IO binding
implementation:

- Use a bucketized allocator in order to reduce the number of resource
allocations
- Implement the following functions: `ortvalue_from_numpy`,
`update_inplace`, `ortvalue_from_shape_and_type` and `numpy`
- Modify the `onnxruntime_test_python_iobinding` tests to also run on
DML

---------

Co-authored-by: Jeff Bloomfield <jeffbloo@microsoft.com>
2023-09-13 07:26:35 -07:00
Scott McKay
e1a9f2ed6d
Fix insufficient space error in Android CI (#17423)
### Description
<!-- Describe your changes. -->
Remove onnxruntime_test_all from emulator once tests have finished as
it's 1.2GB and takes up too much space given the 2GB maximum partition
size for the emulator.

Side issue is the java build isn't able to strip the binaries in the
java apk which causes that to be 800MB (exceeding the 2GB max). That may
require an Android/Gradle fix as I don't think we can hardcode an NDK
version into our build files.
https://issuetracker.google.com/issues/237187538?pli=1


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix Android CI build failures for
2023-09-06 10:12:05 +10:00
Tianlei Wu
8818a99c93
Set proper nvcc threads to avoid OOM (#17419)
### Description

There are 8 cu files under [flash
attention](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/contrib_ops/cuda/bert/flash_attention)
and 4 cu files under [cutlass
fmha](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/contrib_ops/cuda/bert/cutlass_fmha)
need a lot of memory to compile.

Previously, the default value is same as parallel - number of CPU cores.
Standard_NC4as_T4_v3 has 4 CPUs and 28 GB memory, and we launched 16
nvcc threads in total (4 parallel jobs, and 4 nvcc threads per job).
Each thread might take 4 GB on average (peak is around 6GB, but threads
are not started at same time). OOM happens since 16 threads might need
close to 64 GB in worst case. When build machine has 64GB or larger
memory, OOM is rare.

Here we set a proper nvcc --threads based on available memory to avoid
OOM.

### Motivation and Context
Fix `Python Packaging Pipeline (Training Cuda 11.8)`
2023-09-05 10:59:27 -07:00
aciddelgado
44101e8771
Flash Attention v2 MHA (#17227)
### Description
Integrate Flash Attention V2 to PackedMultiHeadAttention,
MultiHeadAttention and Attention operators.

Flash Attention v2 source code is from
https://github.com/Dao-AILab/flash-attention/tree/main/csrc/flash_attn/src.
We did some change to remove dependency on Torch, then removed backward
and bfloat16 related code.

Add benchmark script (see benchmark_mha.sh) to compare different
attention kernels for MultiHeadAttention operator.

Current limitations for Flash Attention in PackedMultiHeadAttention,
MultiHeadAttention and Attention operators:
* Relative Position Bias is not supported
* Different hidden size for Q and V is not supported
* Only float16 is supported
* Padding/attention mask is not supported
* For MultiHeadAttention, when there is past or present input, bias
shall be provided to activate flash attention
* For Attention, past or present inputs will deactivate flash attention
* Causal is not supported

Some limitations (like attention mask and causal) might be removed
later.

Currently, Flash Attention v2 only works in Linux. For Windows, we will
enable later with Cutlass 3.2.

Two environment variables can be used for testing purpose:
(1) `ORT_DISABLE_FLASH_ATTENTION` to disable flash attention. Default
value is 0 (enable). Set it to "1" to disable it.
(2) `ORT_MIN_SEQ_LEN_FLASH_ATTENTION_PACKED_QKV`. Default value is
"513", which means that we only enable flash attention when sequence
length is larger than 512 for packed QKV format. Set it to "0" if you
want to use flash attention v2 whenever possible.

### Speedup

The following result is from Standard_ND96amsr_A100_v4 VM
(A100-SXM4-80GB GPU) using benchmark_mha.sh. The metric is TFLOPs per
second for MultiHeadAttention operator.

There are 3 input formats:
* `Q,K,V` means separated inputs query, key and value of BxSxNH
* `Q,KV` means packed KV, where key is 5D: BxSxNx2xH
* `QKV` means packed QKV, where query is 5D: BxSxNx3xH

Note that flash attention cannot use packed QKV format, so extra
Transpose is needed. We found that TensorRT kernel is faster for
sequence length <= 512 for packed QKV. The reason might be no transpose
is needed for TensorRT kernel in this format.

We also notice that, TensorRT kernel is faster for stable diffusion
512x512 image (see seq_len=4096, heads=8, head_dim=40 below), while
flash attention v2 is faster for 1024x1024 image (see seq_len=16384,
heads=8, head_dim=40 below).

input format | batch size | sequence length | heads | head dim |
flash_v2 (TFLOPs/s) | TensorRT (TFLOPs/s) | Memory Efficient Attention
(TFLOPs/s)
-- | -- | -- | -- | -- | -- | -- | --
Q,K,V | 32 | 512 | 64 | 32 | 78.1 | 60.0 | 39.3
Q,K,V | 32 | 512 | 128 | 16 | 46.8 | 44.1 | 21.7
Q,K,V | 16 | 1024 | 64 | 32 | 99.0 | 72.8 | 44.3
Q,K,V | 16 | 1024 | 128 | 16 | 54.7 | 49.2 | 23.4
Q,K,V | 8 | 2048 | 64 | 32 | 113.8 | 81.2 | 47.8
Q,K,V | 8 | 2048 | 128 | 16 | 59.7 | 51.9 | 24.7
Q,K,V | 4 | 4096 | 64 | 32 | 122.5 | 85.6 | 49.7
Q,K,V | 4 | 4096 | 128 | 16 | 62.5 | 53.3 | 25.3
Q,K,V | 2 | 8192 | 64 | 32 | 127.4 | 87.5 | 50.7
Q,K,V | 2 | 8192 | 128 | 16 | 64.0 | 54.2 | 25.6
Q,K,V | 1 | 16384 | 64 | 32 | 129.5 | 91.0 | 51.2
Q,K,V | 1 | 16384 | 128 | 16 | 64.7 | 54.5 | 25.8
Q,K,V | 1 | 4096 | 8 | 40 | 51.0 | 43.6 | 36.8
Q,K,V | 1 | 4096 | 8 | 80 | 97.7 | 77.0 | 55.5
Q,K,V | 1 | 4096 | 8 | 160 | 120.0 | 39.7 | 57.8
Q,K,V | 4 | 4096 | 8 | 40 | 89.0 | 84.4 | 49.2
Q,K,V | 4 | 4096 | 8 | 80 | 133.0 | 92.2 | 63.2
Q,K,V | 4 | 4096 | 8 | 160 | 164.8 | 42.7 | 63.8
Q,K,V | 1 | 16384 | 8 | 40 | 96.9 | 91.3 | 52.1
Q,K,V | 1 | 16384 | 8 | 80 | 142.9 | 101.5 | 65.6
Q,K,V | 1 | 16384 | 8 | 160 | 177.4 | 44.2 | 65.7
Q,K,V | 128 | 128 | 12 | 64 | 29.0 | 26.9 | 25.7
Q,K,V | 64 | 128 | 12 | 64 | 23.1 | 10.8 | 21.3
Q,K,V | 128 | 384 | 12 | 64 | 83.5 | 60.8 | 55.7
Q,K,V | 64 | 384 | 12 | 64 | 72.6 | 40.5 | 52.8
Q,K,V | 128 | 512 | 12 | 64 | 98.9 | 77.9 | 62.1
Q,K,V | 64 | 512 | 12 | 64 | 94.7 | 75.6 | 60.4
Q,KV | 32 | 512 | 64 | 32 | 85.9 | 41.1 | 41.1
Q,KV | 32 | 512 | 128 | 16 | 47.1 | 21.6 | 21.6
Q,KV | 16 | 1024 | 64 | 32 | 104.4 | 45.8 | 45.8
Q,KV | 16 | 1024 | 128 | 16 | 54.7 | 23.6 | 23.6
Q,KV | 8 | 2048 | 64 | 32 | 116.8 | 48.5 | 48.5
Q,KV | 8 | 2048 | 128 | 16 | 59.8 | 24.7 | 24.7
Q,KV | 4 | 4096 | 64 | 32 | 124.2 | 50.1 | 50.1
Q,KV | 4 | 4096 | 128 | 16 | 62.6 | 25.3 | 25.3
Q,KV | 2 | 8192 | 64 | 32 | 128.5 | 50.8 | 50.9
Q,KV | 2 | 8192 | 128 | 16 | 64.1 | 25.6 | 25.6
Q,KV | 1 | 16384 | 64 | 32 | 129.4 | 51.2 | 51.2
Q,KV | 1 | 16384 | 128 | 16 | 64.8 | 25.8 | 25.8
Q,KV | 1 | 4096 | 8 | 40 | 67.5 | 37.7 | 37.5
Q,KV | 1 | 4096 | 8 | 80 | 101.3 | 56.7 | 56.6
Q,KV | 1 | 4096 | 8 | 160 | 124.0 | 58.6 | 58.6
Q,KV | 4 | 4096 | 8 | 40 | 90.8 | 49.8 | 49.8
Q,KV | 4 | 4096 | 8 | 80 | 135.6 | 63.8 | 63.8
Q,KV | 4 | 4096 | 8 | 160 | 166.3 | 64.5 | 64.5
Q,KV | 1 | 16384 | 8 | 40 | 97.5 | 52.3 | 52.3
Q,KV | 1 | 16384 | 8 | 80 | 143.5 | 65.9 | 65.8
Q,KV | 1 | 16384 | 8 | 160 | 178.4 | 65.9 | 65.8
Q,KV | 128 | 128 | 12 | 64 | 26.8 | 48.1 | 30.9
Q,KV | 64 | 128 | 12 | 64 | 28.0 | 38.9 | 25.0
Q,KV | 128 | 384 | 12 | 64 | 97.7 | 61.1 | 61.0
Q,KV | 64 | 384 | 12 | 64 | 89.5 | 57.8 | 57.9
Q,KV | 128 | 512 | 12 | 64 | 111.9 | 66.7 | 66.9
Q,KV | 64 | 512 | 12 | 64 | 107.2 | 64.9 | 64.8
QKV | 32 | 512 | 64 | 32 | 77.2 | 84.7 | 39.3
QKV | 32 | 512 | 128 | 16 | 43.4 | 53.1 | 20.9
QKV | 16 | 1024 | 64 | 32 | 98.8 | 87.4 | 44.6
QKV | 16 | 1024 | 128 | 16 | 52.0 | 54.1 | 23.2
QKV | 8 | 2048 | 64 | 32 | 113.1 | 89.0 | 47.9
QKV | 8 | 2048 | 128 | 16 | 58.2 | 54.6 | 24.5
QKV | 4 | 4096 | 64 | 32 | 120.6 | 89.7 | 49.7
QKV | 4 | 4096 | 128 | 16 | 61.7 | 54.6 | 25.2
QKV | 2 | 8192 | 64 | 32 | 125.9 | 89.5 | 50.7
QKV | 2 | 8192 | 128 | 16 | 63.6 | 54.8 | 25.5
QKV | 1 | 16384 | 64 | 32 | 128.5 | 92.0 | 51.2
QKV | 1 | 16384 | 128 | 16 | 64.6 | 54.8 | 25.7
QKV | 1 | 4096 | 8 | 40 | 60.2 | **69.8** | 38.1
QKV | 1 | 4096 | 8 | 80 | 101.6 | 75.2 | 56.7
QKV | 1 | 4096 | 8 | 160 | 130.2 | 41.2 | 58.4
QKV | 4 | 4096 | 8 | 40 | 90.6 | **91.0** | 49.5
QKV | 4 | 4096 | 8 | 80 | 133.6 | 98.1 | 62.8
QKV | 4 | 4096 | 8 | 160 | 165.3 | 43.7 | 63.9
QKV | 1 | 16384 | 8 | 40 | 97.2 | 92.8 | 52.1
QKV | 1 | 16384 | 8 | 80 | 143.0 | 103.1 | 65.6
QKV | 1 | 16384 | 8 | 160 | 177.6 | 44.5 | 65.7
QKV | 128 | 128 | 12 | 64 | 31.1 | 65.9 | 27.6
QKV | 64 | 128 | 12 | 64 | 26.1 | 49.8 | 23.5
QKV | 128 | 384 | 12 | 64 | 84.6 | 88.5 | 56.1
QKV | 64 | 384 | 12 | 64 | 79.1 | 80.3 | 53.5
QKV | 128 | 512 | 12 | 64 | 97.3 | 114.2 | 62.2
QKV | 64 | 512 | 12 | 64 | 95.9 | 110.7 | 60.6
QKV | 4 | 2048 | 32 | 128 | 125.26 | 44.72 | 78.15
QKV | 4 | 4096 | 32 | 128 | 141.62 | 46.29 | 85.84
QKV | 8 | 2048 | 32 | 128 | 127.40 | 45.49 | 78.75
QKV | 8 | 4096 | 32 | 128 | 144.24 | 46.60 | 86.95

### Known Issues

NVCC uses huge memory while compiling flash attention CUDA kernel. Linux
build with CUDA might fail when machine has limited memory while number
of CPUs is large. Walkaround is to use a build machine with larger
memory, or use argument like `--nvcc_threads 1` to limit nvcc threads in
build.

### Motivation and Context
Increases speed and efficiency of MHA or Packed MHA.

---------

Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: tlwu@microsoft.com <tlwu@a100.crj0ad2y1kku1j4yxl4sj10o4e.gx.internal.cloudapp.net>
2023-08-31 13:52:21 -07:00
Rachel Guo
b54619509f
Refine build script for adding disable selected data types option (#17284)
### Description
<!-- Describe your changes. -->

As title. 

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Now we have multiple data types that we want to disable for minimal
build and to reduce binary size. may be worth adding an argument in the
build script for specifying that.

Also for fp16 type stuff, it may be too restrict to disable that for all
minimal build.

---------

Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
2023-08-31 13:32:55 -07:00
Yifan Li
808215366d
Fix Multi GPU TensorRT tests (#17269)
### Description
* Integrate `trt_multi_gpu` test stage in ORT post merge CI (Win-2xA10
vm)
* Deprecate Linux MultiGPU TRT CI (This vm will be deprecated soon)
* Add multi gpu support to existing C# test cases
* Deprecate unfunctional flag `--enable_multi_device_tests`

### Motivation and Context
* Two contexts of replacing Linux MultiGPU TRT CI:
* Flag `--enable_multi_device_tests` is not functional, which cannot
detect issues like #17036
* The Linux-2xM60 VM of this CI pool is about to be deprecated 9/6/23.
Need to enable this test in other dualGPU vm pool.
2023-08-25 20:30:45 -07:00
Yulong Wang
9cd4e5af68
[wasm] upgrade emsdk to 3.1.44 (#17069)
### Description
This change upgrade emsdk to 3.1.44.

Because backend is upgraded to LLVM 16, so need to fix a lot of build
failures caused by "-Wshorten-64-to-32".

most of the build failures comes from generated `onnx.pb.h`, and this
can be fixed by including "core/graph/onnx_protobuf.h", which detects
and ignore shorten-64-to-32 warnings.
2023-08-10 16:08:36 -07:00
Changming Sun
7d340256f1
Add "windows_sdk_version" build arg and fix SCA build pipeline (#17062)
### Description
1. Add "--windows_sdk_version" argument to build.py
2. Fix Windows Static Analysis build pipeline. It is failing because it
picks up a different Windows SDK version after a build machine image
update. If we can explicitly specify Windows SDK version, we can avoid
such things happening again.
3. Remove --enable_training from Windows Static Analysis build pipeline
because PR #16993 makes it incompatible with "no_rtti".

AB#18315
2023-08-09 14:01:16 -07:00
Edward Chen
50719d2f8e
[iOS] Add script to get simulator device info. (#17012)
Add script to get iOS simulator device info so we don't need to use hardcoded specifiers which may or may not refer to a valid simulator device.

Add use-xcode-version step to a packaging pipeline so it uses a consistent version of Xcode.
2023-08-08 09:04:06 -07:00
Baiju Meswani
249917a093
Add mac and windows python packages for onnxruntime-training (#16993) 2023-08-07 20:32:55 -07:00
Edward Chen
06096fcb31
Hardcode xcodebuild destination iOS simulator OS to 16.4. (#16982) 2023-08-03 14:49:54 -07:00
Yi Zhang
9f21f694cf
stop support to VS 2019 (#16892)
### Description
Remove VS 2019 code.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-07-28 13:09:35 +08:00
Chi Lo
21ef14476b
Bug fix for nested control flow ops for TRT EP (#16343)
Current TRT EP can support model which has nested control flow ops
(multiple level subgraphs). But it fails at a case where the subgraph
has outer scope value that is defined several levels up in the top-level
graph, in this case, the outer scope value is the input of the top-level
graph. The outer scope values are not properly handled during TRT EP's
subgraph reconstruction stage and fails at `graph.resolve()`.

The way ORT gets capability from EPs is a bottom-up approach meaning
inner most subgraph gets handled first. TRT EP reconstructs each
subgraph level by level and following modifications are made to fix the
outer scope values issue:

- `SetGraphOuterScopeValuesAndInputs()` and `SetAllGraphInputs()` are
added to handle outer scope values and add those values as graph inputs
if needed in order to make `graph.resolve()` happy.
- Change to use `GetNodeArgIncludingParentGraphs` so that when creating
the fused TRT node for some subgraphs in`
Graph::CreateFusedSubGraphNode()`, it can get the NodeArgs for outer
scope values from top-level graph.


This PR fixes https://github.com/microsoft/onnxruntime/issues/16217
2023-07-23 16:16:17 -07:00
Edward Chen
df8843c4a7
Upgrade old Python version in packaging pipeline (#16667)
- Upgrade from Python 3.6 to 3.8 in packaging pipeline.
- Raise build.py minimum required Python version.
2023-07-17 08:24:47 -07:00
Edward Chen
1b8d5c43c2
Fix builds (#16646)
- Fix some more `shorten-64-to-32` warnings
- Move minimum build.py Python version back to 3.6
2023-07-11 19:21:25 -07:00
Scott McKay
ce68a4c06a
Fix Linux build failure when onnxruntime_DISABLE_ABSEIL=ON (#16373)
### Description
<!-- Describe your changes. -->
Add ort_value.h to session_options.h so OrtValue is defined. 

Update a unit test binary to add required include paths. Adding
ort_value.h pulls in more data type headers.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#16193
2023-07-12 11:23:18 +10:00
Edward Chen
6be7b03e53
Enable -Wshorten-64-to-32 warning if available. (#16524)
- Fix some warnings from Xcode build (`-Wshorten-64-to-32`).
- Enable `-Wshorten-64-to-32` warning if available. Currently it's not fully enabled for `onnxruntime_test_all` and `onnxruntime_providers_xnnpack` yet.
- Some clean up in build.py including setting CMake generator more consistently.
2023-07-07 08:11:44 -07:00
Edward Chen
b668a6da96
Treat Objective-C static analysis warnings as errors (#16293)
- Update Objective-C static analysis check to fail on warnings.
- Address warning.
- Clean up build definition.
2023-06-09 08:51:49 -07:00
Edward Chen
1261d0b8ba
Fix some build issues on MacOS with Xcode 14.3. (#15878)
- Fix flatbuffers flatc warning, unused-but-set-variable.
- Address `-Wshorten-64-to-32` warnings (fix in our code, allow in dependencies' code).
- Update CI builds to use Xcode 14.3.
- Update minimum iOS version to 12.0.
- Update Mac hosted agents to MacOS 13 where possible.
2023-06-07 12:07:11 -07:00
Xavier Dupré
e726151b5c
Introduce float 8 types (#14731)
### Description
The PR implements FloatE4M3FN, FloatE5M2, FloatE4MEFNUZ, FloatE5M2FNUZ
as described in PR https://github.com/onnx/onnx/pull/4805. It uses CUDA
API to cast float/half to float8 if CUDA>=11.8, a custom implementation
if CUDA<11.8.

* It implements, Cast, QuantizeLinear, DequantizeLinear for all types on
CPU, only for types FloatE4M3FN, FloatE5M2 on CUDA.
* It extends the supported types for control flow operator, Shape,
Reshape, Identity, If, Loop, Scan, Reshape
* It implements Equal(19).
* Cast, QuantizeLinear, DequantizeLinear operators now support a
parameter `saturate` only valid for float 8 types. It is true by
default. In that case, any value out of range is converted into the
maximum float 8 value. If false, it is infinite.
* QuantizeLinear, DequantizeLinear now supports multiple scales on CUDA
(and ROCm by extension), scale = 1D tensor with one scale per channel

### Motivation and Context
Supports latest onnx version.

Fixes
[AB#15395](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/15395)

---------

Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
2023-05-30 13:25:58 -07:00
Changming Sun
0204594f90
Cleanup WASM cmake code (#15996)
### Description
Remove the "onnxruntime_BUILD_WEBASSEMBLY" cmake option. Use `if
(CMAKE_SYSTEM_NAME STREQUAL "Emscripten")` instead. It makes some code
look more nature.
For example,

```cmake
if (CMAKE_SYSTEM_NAME STREQUAL "iOS" OR CMAKE_SYSTEM_NAME STREQUAL "Android" OR onnxruntime_BUILD_WEBASSEMBLY)
```
becomes
```cmake
if (CMAKE_SYSTEM_NAME STREQUAL "iOS" OR CMAKE_SYSTEM_NAME STREQUAL "Android" OR CMAKE_SYSTEM_NAME STREQUAL "Emscripten")
```
2023-05-20 18:07:39 -07:00
RandySheriffH
4dfb89b3ad
Implement mutex-free spin lock for task queue (#14834)
Implemented "lock-free" spinlock to save CPU usage on context switching.
The change has been tested on queene service of Ads team, the lock-free
version of ort (40 threads) saves CPU usage on gen8 (128 logical
processors on 8 numa nodes) windows by nearly half, from 65% to 35%.

For 32 cores, the curve is flat:

Anubis, 32 vCPU, windows, hugging face models,
95 percentile E2E latency in ms:

model | mutex(ms) | mutex-free
--- | --- | ---
 alvert_base_v2 | 34.21 | 34.09
 bert_large_uncased | 116.27| 117.84
 bart_base | 72.06 | 71.99
 distilgpt2 | 25.43 | 25.02
 vit_base_patch16_224 | 37.33 | 37.76

Anubis, 32 vCPU win, Linux, 1st party models,
95 percentile E2E latency in ms:

model | mutex(ms) | mutex-free
--- | --- | ---
deepthink_v2 | 24.35 | 22.95
bing_feeds |  36.96 | 36.48
deep_writes |  14.46 | 14.32
keypoints |  9.34 | 7.69
model11 |  1.71 | 1.66
model12 |  1.82 | 1.44
model2 |  4.21 | 3.95
model6 |  1.08 | 1.05
agiencoder |  0.99 | 0.93
geminet_transformer |  5.32 | 5.24

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2023-05-19 10:12:10 -07:00
cloudhan
856afa49dd
[C#] Add missing rocm csharp api (#15540) 2023-05-18 08:15:19 +08:00
Yi Zhang
6d43d51eb0
[Fix] No test result report while not using ctest (#15976)
### Description
1. Set gtest output while ctest is set to empty.
2. onnx_src in _deps shouldn't be removed because
onnx_test_pytorch_converted and onnx_test_pytorch_converted need to read
data from onnx/backend/test/data/..

### Motivation and Context
Test result report is important to find the flaky tests.

### To do
Tests are not inconsistent.
If ctest_path is empty, onnx_test_pytorch_converted and
onnx_test_pytorch_converted will not be executed, if it's not,
onnxruntime_mlas_test will not be executed.


270c09a37f/tools/ci_build/build.py (L1743-L1753)
2023-05-17 08:31:16 -07:00
kailums
f62f722c70
integrate triton into ort (#15862)
### Description
In some scenarios, the triton written kernels are more performant than
CK or other handwritten kernels, so we implement a framework that
onnxruntime can use these triton written kernels.

This PR is to integrate triton into ort, so that ort can use kernels
that written and compiled by triton.

The main change focus on two part:
1. a build part to compile triton written kernel and combine these
kernels into libonnxruntime_providers_rocm.so
2. a loader and launcher in c++, for loading and launch triton written
kernels.

#### Build

To compile triton written kernel, add a script
`tools/ci_build/compile_triton.py`. This script will dynamic load all
kernel files, compile them, and generate `triton_kernel_infos.a` and
`triton_kernel_infos.h`.

`triton_kernel_infos.a` contains all compiled kernel instructions, this
file will be combined into libonnxruntime_providers_rocm.so, using
--whole-archive flag.

`triton_kernel_infos.h` defines a const array that contains all the
metadata for each compiled kernel. These metadata will be used for load
and launch. So this header file is included by 'triton_kernel.cu' which
defines load and launch functions.

Add a build flag in build.py and CMakeList.txt, when building rocm
provider, it will call triton_kernel build command, and generate all
necessary files.

#### C++ Load and Launch

On c++ part, we implement load and launch functions in triton_kernel.cu
and triton_kernel.h.

These two files located in `providers/cuda`, and when compiling rocm,
they will be hipified. so this part supports both cuda and rocm. But
currently we only call triton kernel in rocm.

We also implement a softmax triton op for example. Because there will
generate many kernels for different input shape of softmax, we use
TunableOp to select the best one.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-05-17 09:35:28 +08:00
Jian Chen
780442b9f6
Change windows machine pools to use VS2022
 (#15806)
### Description
<!-- Describe your changes. -->



Old pool | New pool | Notes
-- | -- | --
onnxruntime-Win-CPU-2019 | onnxruntime-Win-CPU-2022 |  
onnxruntime-Win2019-CPU-training | onnxruntime-Win2022-CPU-training-AMD
|  
onnxruntime-Win2019-CPU-training-AMD |
onnxruntime-Win2022-CPU-training-AMD | Same as the above
onnxruntime-Win2019-GPU-dml-A10 | Need be created | You need to create a
new image for it first
onnxruntime-Win2019-GPU-T4 | onnxruntime-Win2022-GPU-T4 |  
onnxruntime-Win2019-GPU-training-T4 | onnxruntime-Win2022-GPU-T4 | Same
as the above because we do not have many T4 GPUs
onnxruntime-tensorrt8-winbuild-T4| TBD|TBD
Win-CPU-2021|onnxruntime-Win-CPU-2022| will do it in next PR
Win-CPU-2019|onnxruntime-Win2022-Intel-CPU'| Intel CPU needed for
win-ci-pipeline.yml -> `stage: x64_release_dnnl`

<br class="Apple-interchange-newline">

### Motivation and Context
With vs2022 we can take the advantage of 64bit compiler. It also with
better c++20 support
2023-05-16 10:34:34 -07:00
cloudhan
dc383ed4ce
Basic CSharp packaging support for ROCm EP (#15535)
This PR mainly fixes building errors when trying to build nupkg for ROCm EP.
It also slighly improve the packaging logic so that devlopers can
produce the nupkg on linux natively.
2023-05-16 07:27:38 +08:00
Wanming Lin
00b1e79e04
Support WebNN EP (#15698)
**Description**: 

This PR intends to enable WebNN EP in ONNX Runtime Web. It translates
the ONNX nodes by [WebNN
API](https://webmachinelearning.github.io/webnn/), which is implemented
in C++ and uses Emscripten [Embind
API](https://emscripten.org/docs/porting/connecting_cpp_and_javascript/embind.html#).
Temporarily using preferred layout **NHWC** for WebNN graph partitions
since the restriction in WebNN XNNPack backend implementation and the
ongoing
[discussion](https://github.com/webmachinelearning/webnn/issues/324) in
WebNN spec that whether WebNN should support both 'NHWC' and 'NCHW'
layouts. No WebNN native EP, only for Web.

**Motivation and Context**:
Allow ONNXRuntime Web developers to access WebNN API to benefit from
hardware acceleration.

**WebNN API Implementation Status in Chromium**:
- Tracked in Chromium issue:
[#1273291](https://bugs.chromium.org/p/chromium/issues/detail?id=1273291)
- **CPU device**: based on XNNPack backend, and had been available on
Chrome Canary M112 behind "#enable-experimental-web-platform-features"
flag for Windows and Linux platforms. Further implementation for more
ops is ongoing.
- **GPU device**: based on DML, implementation is ongoing.

**Open**:
- GitHub CI: WebNN currently is only available on Chrome Canary/Dev with
XNNPack backend for Linux and Windows. This is an open to reviewers to
help identify which GitHub CI should involved the WebNN EP and guide me
to enable it. Thanks!
2023-05-08 21:25:10 -07:00
Yulong Wang
0457fd0b40
upgrade emsdk to 3.1.37 (#15817)
### Description
upgrade emsdk to 3.1.37

WIP branch to debug the mystery memory issue in web assembly
multi-thread build.
2023-05-08 16:49:47 -07:00
Yulong Wang
33d1372729
[wasm] revert emsdk to v3.1.19 (#15793)
### Description
latest emsdk generated multi-thread version sometimes crash with unknown
reason ( error: memory access out of bounds ).

we don't want to break existing ort-web users, so revert emsdk back to
3.1.19 (same to what ort v1.14.0 uses)
2023-05-04 01:15:01 -07:00
Baiju Meswani
ba7b83ff3c
Remove onnxruntime_PYBIND_EXPORT_OPSCHEMA definition from onnxruntime (#15776) 2023-05-03 13:08:35 -07:00
Changming Sun
328cabb194
Download protoc from Github Release instead of Nuget (#15731)
### Description
Download protoc from Github Release instead of Nuget to avoid having
dependency on nuget.exe on Linux

### Motivation and Context
To avoid having dependency on nuget.exe on Linux. Many users' build
environment do not have nuget or dotnet.
2023-05-02 12:18:59 -07:00
Changming Sun
5352f6d9b0
Make "--cuda_version" build arg optional (#15758)
### Description
This change will allow us building CUDA EP without installing CUDA SDK
on Windows.

### Motivation and Context
Nvidia's CUDA installer comes with a VS extension. In the past, we
require installing the extension. It is a little bit inconvenient since:
1. Visual Studio must be installed before CUDA SDK. CUDA's installer
will not install the extension if your machine doesn't have Visual
Studio.
2. We need to install CUDA SDK on our build machines, instead of just
downloading it and using it.

After this change, we will not need to install CUDA SDK on our build
machines. So it will be easier to add a support for a different CUDA
version.

Also, fix two PreFast warnings.
2023-05-01 18:00:47 -07:00