Commit graph

1461 commits

Author SHA1 Message Date
Chi Lo
b827ab0efc
[TRT EP] Fix build error for building oss onnx-tensorrt parser (#17468)
If building ORT TRT with `--use_tensorrt_oss_parse` (meaning ORT wil
include [oss onnx-tensorrt
parser](https://github.com/onnx/onnx-tensorrt/blob/main/CMakeLists.txt#L82)
and build it from source) ,the cmake CUDA_INCLUDE_DIR variable is
needed.

if not, you will encounter following [ build
error](https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1133937&view=logs&j=7536d2cd-87d4-54fe-4891-bfbbf2741d83&t=39e3f98f-7fe5-578c-20bd-5ae5a4590bda):

CMake Error: The following variables are used in this project, but they
are set to NOTFOUND.
Please set them or make sure they are set and tested correctly in the
CMake files:
    /build/Release/_deps/onnx_tensorrt-src/CUDA_INCLUDE_DIR

Note: Not quite sure why in the past when CI still tested with oss
parser won't hit this issue. probably the CUDA_INCLUDE_DIR was defined
somewhere back then.
2023-09-08 20:34:57 -07:00
Caroline Zhu
dcc93909b4
Add training WASM generation to Web CI pipeline (#17319)
### Description
[Successful pipeline
run](https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1123141&view=results)

Added flag to build the training artifacts & updated the
pull-wasm-artifacts script to pull the training artifacts as well.

Bundled into this PR are minor formatting fixes + naming fixes.

### Motivation and Context
[This PR](https://github.com/microsoft/onnxruntime/pull/16521) extended
the WASM API wrapper to build training WASM artifacts as well.
The ORT training WASM artifacts are required to support ORT training web
bindings.
2023-09-08 15:49:47 -07:00
Changming Sun
bc84f52633
Update C/C++ dependencies: abseil, date, nsync, googletest, wil, mp11, cpuinfo and safeint (#15470)
### Description
Update C/C++ dependencies abseil, date, nsync, googletest, wil, mp11,
cpuinfo and safeint to newer versions per request of @
mayeut. He created the following PRs to update the deps:
https://github.com/microsoft/onnxruntime/pull/15432
https://github.com/microsoft/onnxruntime/pull/15434
https://github.com/microsoft/onnxruntime/pull/15435
https://github.com/microsoft/onnxruntime/pull/15436
https://github.com/microsoft/onnxruntime/pull/15437

However, our build system needs to fetch the dependencies from an
internal mirror that only Microsoft employees have write access to. So I
closed his PRs and created this one.

This PR also updates abseil to a newer version. This is to prepare for
upgrading re2.
2023-09-08 13:35:04 -07:00
Yulong Wang
110a2d0b73
[build][wasm] add js_internal_api.js to link dependency (#17407)
### Description
add js_internal_api.js to link dependency. Now changes to
js_internal_api.js will correctly trigger re-link of ort-wasm.wasm
2023-09-05 20:40:40 -07:00
Changming Sun
c6b0d185b4
Update cmake to 3.27 and upgrade Linux CUDA docker files from CentOS7 to UBI8 (#16856)
### Description
1. Update docker files and their build instructions.
ARM64 and x86_64 can use the same docker file.

2. Upgrade Linux CUDA pipeline's base docker image from CentOS7 to UBI8
AB#18990
2023-09-05 18:12:10 -07:00
Lennart Hannink
e3bb2a0cdd
Fix git working dir for ORT_BUILD_INFO (fixes #17197) (#17198)
### Description
Git commands producing `git-commid-id` and `git-branch` are always run
in `CMAKE_CURRENT_SOURCE_DIR` (i.e. `onnxruntime/cmake`)


### Motivation and Context
Please refer to corresponding issue
[#17197](https://github.com/microsoft/onnxruntime/issues/17197).
2023-09-05 09:20:49 -07:00
cloudhan
6ea3908db4
Add ck's streamk and splitk gemm impl (#17280) 2023-09-04 11:49:07 +08:00
aciddelgado
44101e8771
Flash Attention v2 MHA (#17227)
### Description
Integrate Flash Attention V2 to PackedMultiHeadAttention,
MultiHeadAttention and Attention operators.

Flash Attention v2 source code is from
https://github.com/Dao-AILab/flash-attention/tree/main/csrc/flash_attn/src.
We did some change to remove dependency on Torch, then removed backward
and bfloat16 related code.

Add benchmark script (see benchmark_mha.sh) to compare different
attention kernels for MultiHeadAttention operator.

Current limitations for Flash Attention in PackedMultiHeadAttention,
MultiHeadAttention and Attention operators:
* Relative Position Bias is not supported
* Different hidden size for Q and V is not supported
* Only float16 is supported
* Padding/attention mask is not supported
* For MultiHeadAttention, when there is past or present input, bias
shall be provided to activate flash attention
* For Attention, past or present inputs will deactivate flash attention
* Causal is not supported

Some limitations (like attention mask and causal) might be removed
later.

Currently, Flash Attention v2 only works in Linux. For Windows, we will
enable later with Cutlass 3.2.

Two environment variables can be used for testing purpose:
(1) `ORT_DISABLE_FLASH_ATTENTION` to disable flash attention. Default
value is 0 (enable). Set it to "1" to disable it.
(2) `ORT_MIN_SEQ_LEN_FLASH_ATTENTION_PACKED_QKV`. Default value is
"513", which means that we only enable flash attention when sequence
length is larger than 512 for packed QKV format. Set it to "0" if you
want to use flash attention v2 whenever possible.

### Speedup

The following result is from Standard_ND96amsr_A100_v4 VM
(A100-SXM4-80GB GPU) using benchmark_mha.sh. The metric is TFLOPs per
second for MultiHeadAttention operator.

There are 3 input formats:
* `Q,K,V` means separated inputs query, key and value of BxSxNH
* `Q,KV` means packed KV, where key is 5D: BxSxNx2xH
* `QKV` means packed QKV, where query is 5D: BxSxNx3xH

Note that flash attention cannot use packed QKV format, so extra
Transpose is needed. We found that TensorRT kernel is faster for
sequence length <= 512 for packed QKV. The reason might be no transpose
is needed for TensorRT kernel in this format.

We also notice that, TensorRT kernel is faster for stable diffusion
512x512 image (see seq_len=4096, heads=8, head_dim=40 below), while
flash attention v2 is faster for 1024x1024 image (see seq_len=16384,
heads=8, head_dim=40 below).

input format | batch size | sequence length | heads | head dim |
flash_v2 (TFLOPs/s) | TensorRT (TFLOPs/s) | Memory Efficient Attention
(TFLOPs/s)
-- | -- | -- | -- | -- | -- | -- | --
Q,K,V | 32 | 512 | 64 | 32 | 78.1 | 60.0 | 39.3
Q,K,V | 32 | 512 | 128 | 16 | 46.8 | 44.1 | 21.7
Q,K,V | 16 | 1024 | 64 | 32 | 99.0 | 72.8 | 44.3
Q,K,V | 16 | 1024 | 128 | 16 | 54.7 | 49.2 | 23.4
Q,K,V | 8 | 2048 | 64 | 32 | 113.8 | 81.2 | 47.8
Q,K,V | 8 | 2048 | 128 | 16 | 59.7 | 51.9 | 24.7
Q,K,V | 4 | 4096 | 64 | 32 | 122.5 | 85.6 | 49.7
Q,K,V | 4 | 4096 | 128 | 16 | 62.5 | 53.3 | 25.3
Q,K,V | 2 | 8192 | 64 | 32 | 127.4 | 87.5 | 50.7
Q,K,V | 2 | 8192 | 128 | 16 | 64.0 | 54.2 | 25.6
Q,K,V | 1 | 16384 | 64 | 32 | 129.5 | 91.0 | 51.2
Q,K,V | 1 | 16384 | 128 | 16 | 64.7 | 54.5 | 25.8
Q,K,V | 1 | 4096 | 8 | 40 | 51.0 | 43.6 | 36.8
Q,K,V | 1 | 4096 | 8 | 80 | 97.7 | 77.0 | 55.5
Q,K,V | 1 | 4096 | 8 | 160 | 120.0 | 39.7 | 57.8
Q,K,V | 4 | 4096 | 8 | 40 | 89.0 | 84.4 | 49.2
Q,K,V | 4 | 4096 | 8 | 80 | 133.0 | 92.2 | 63.2
Q,K,V | 4 | 4096 | 8 | 160 | 164.8 | 42.7 | 63.8
Q,K,V | 1 | 16384 | 8 | 40 | 96.9 | 91.3 | 52.1
Q,K,V | 1 | 16384 | 8 | 80 | 142.9 | 101.5 | 65.6
Q,K,V | 1 | 16384 | 8 | 160 | 177.4 | 44.2 | 65.7
Q,K,V | 128 | 128 | 12 | 64 | 29.0 | 26.9 | 25.7
Q,K,V | 64 | 128 | 12 | 64 | 23.1 | 10.8 | 21.3
Q,K,V | 128 | 384 | 12 | 64 | 83.5 | 60.8 | 55.7
Q,K,V | 64 | 384 | 12 | 64 | 72.6 | 40.5 | 52.8
Q,K,V | 128 | 512 | 12 | 64 | 98.9 | 77.9 | 62.1
Q,K,V | 64 | 512 | 12 | 64 | 94.7 | 75.6 | 60.4
Q,KV | 32 | 512 | 64 | 32 | 85.9 | 41.1 | 41.1
Q,KV | 32 | 512 | 128 | 16 | 47.1 | 21.6 | 21.6
Q,KV | 16 | 1024 | 64 | 32 | 104.4 | 45.8 | 45.8
Q,KV | 16 | 1024 | 128 | 16 | 54.7 | 23.6 | 23.6
Q,KV | 8 | 2048 | 64 | 32 | 116.8 | 48.5 | 48.5
Q,KV | 8 | 2048 | 128 | 16 | 59.8 | 24.7 | 24.7
Q,KV | 4 | 4096 | 64 | 32 | 124.2 | 50.1 | 50.1
Q,KV | 4 | 4096 | 128 | 16 | 62.6 | 25.3 | 25.3
Q,KV | 2 | 8192 | 64 | 32 | 128.5 | 50.8 | 50.9
Q,KV | 2 | 8192 | 128 | 16 | 64.1 | 25.6 | 25.6
Q,KV | 1 | 16384 | 64 | 32 | 129.4 | 51.2 | 51.2
Q,KV | 1 | 16384 | 128 | 16 | 64.8 | 25.8 | 25.8
Q,KV | 1 | 4096 | 8 | 40 | 67.5 | 37.7 | 37.5
Q,KV | 1 | 4096 | 8 | 80 | 101.3 | 56.7 | 56.6
Q,KV | 1 | 4096 | 8 | 160 | 124.0 | 58.6 | 58.6
Q,KV | 4 | 4096 | 8 | 40 | 90.8 | 49.8 | 49.8
Q,KV | 4 | 4096 | 8 | 80 | 135.6 | 63.8 | 63.8
Q,KV | 4 | 4096 | 8 | 160 | 166.3 | 64.5 | 64.5
Q,KV | 1 | 16384 | 8 | 40 | 97.5 | 52.3 | 52.3
Q,KV | 1 | 16384 | 8 | 80 | 143.5 | 65.9 | 65.8
Q,KV | 1 | 16384 | 8 | 160 | 178.4 | 65.9 | 65.8
Q,KV | 128 | 128 | 12 | 64 | 26.8 | 48.1 | 30.9
Q,KV | 64 | 128 | 12 | 64 | 28.0 | 38.9 | 25.0
Q,KV | 128 | 384 | 12 | 64 | 97.7 | 61.1 | 61.0
Q,KV | 64 | 384 | 12 | 64 | 89.5 | 57.8 | 57.9
Q,KV | 128 | 512 | 12 | 64 | 111.9 | 66.7 | 66.9
Q,KV | 64 | 512 | 12 | 64 | 107.2 | 64.9 | 64.8
QKV | 32 | 512 | 64 | 32 | 77.2 | 84.7 | 39.3
QKV | 32 | 512 | 128 | 16 | 43.4 | 53.1 | 20.9
QKV | 16 | 1024 | 64 | 32 | 98.8 | 87.4 | 44.6
QKV | 16 | 1024 | 128 | 16 | 52.0 | 54.1 | 23.2
QKV | 8 | 2048 | 64 | 32 | 113.1 | 89.0 | 47.9
QKV | 8 | 2048 | 128 | 16 | 58.2 | 54.6 | 24.5
QKV | 4 | 4096 | 64 | 32 | 120.6 | 89.7 | 49.7
QKV | 4 | 4096 | 128 | 16 | 61.7 | 54.6 | 25.2
QKV | 2 | 8192 | 64 | 32 | 125.9 | 89.5 | 50.7
QKV | 2 | 8192 | 128 | 16 | 63.6 | 54.8 | 25.5
QKV | 1 | 16384 | 64 | 32 | 128.5 | 92.0 | 51.2
QKV | 1 | 16384 | 128 | 16 | 64.6 | 54.8 | 25.7
QKV | 1 | 4096 | 8 | 40 | 60.2 | **69.8** | 38.1
QKV | 1 | 4096 | 8 | 80 | 101.6 | 75.2 | 56.7
QKV | 1 | 4096 | 8 | 160 | 130.2 | 41.2 | 58.4
QKV | 4 | 4096 | 8 | 40 | 90.6 | **91.0** | 49.5
QKV | 4 | 4096 | 8 | 80 | 133.6 | 98.1 | 62.8
QKV | 4 | 4096 | 8 | 160 | 165.3 | 43.7 | 63.9
QKV | 1 | 16384 | 8 | 40 | 97.2 | 92.8 | 52.1
QKV | 1 | 16384 | 8 | 80 | 143.0 | 103.1 | 65.6
QKV | 1 | 16384 | 8 | 160 | 177.6 | 44.5 | 65.7
QKV | 128 | 128 | 12 | 64 | 31.1 | 65.9 | 27.6
QKV | 64 | 128 | 12 | 64 | 26.1 | 49.8 | 23.5
QKV | 128 | 384 | 12 | 64 | 84.6 | 88.5 | 56.1
QKV | 64 | 384 | 12 | 64 | 79.1 | 80.3 | 53.5
QKV | 128 | 512 | 12 | 64 | 97.3 | 114.2 | 62.2
QKV | 64 | 512 | 12 | 64 | 95.9 | 110.7 | 60.6
QKV | 4 | 2048 | 32 | 128 | 125.26 | 44.72 | 78.15
QKV | 4 | 4096 | 32 | 128 | 141.62 | 46.29 | 85.84
QKV | 8 | 2048 | 32 | 128 | 127.40 | 45.49 | 78.75
QKV | 8 | 4096 | 32 | 128 | 144.24 | 46.60 | 86.95

### Known Issues

NVCC uses huge memory while compiling flash attention CUDA kernel. Linux
build with CUDA might fail when machine has limited memory while number
of CPUs is large. Walkaround is to use a build machine with larger
memory, or use argument like `--nvcc_threads 1` to limit nvcc threads in
build.

### Motivation and Context
Increases speed and efficiency of MHA or Packed MHA.

---------

Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: tlwu@microsoft.com <tlwu@a100.crj0ad2y1kku1j4yxl4sj10o4e.gx.internal.cloudapp.net>
2023-08-31 13:52:21 -07:00
Wanming Lin
3a53836836
[WebNN EP] Fix compilation with newer flatbuffers (#17367) 2023-08-31 10:22:15 -07:00
Artem Shilkin
6e60dba726
Fix compilation with newer flatbuffers (#17164)
In flatbuffers@v23.5.9 was broken forward declaration for
FlatBufferBuilder. Trying to compile onnxruntime falls with the
following error:
```
flatbuffers/include/flatbuffers/flatbuffer_builder.h:1420:38: error: typedef redefinition with different types ('FlatBufferBuilderImpl<false>' vs 'flatbuffers::FlatBufferBuilder')
typedef FlatBufferBuilderImpl<false> FlatBufferBuilder;
                                     ^
onnx_runtime/include/onnxruntime/core/graph/graph.h:47:11: note: previous definition is here
    class FlatBufferBuilder;
```
This PR removes these declarations and puts includes instead
2023-08-29 10:28:26 -07:00
Caroline
228db24317
Add training API functions to WASM API (#16521)
### Description
* Created `wasm/training_api` source and header files & modified
WebAssembly CMake to include training flags
* The `wasm/training_api` files use an `OrtTrainingManager` handle which
is a struct of an OrtCheckpointState and an OrtTrainingSession, rather
than creating a CheckpointState handle & a separate TrainingSession
handle.
* This is so that the TypeScript side only has to manage one handle that
will be passed between TrainingSession & CheckpointState
representations, rather than the TypeScript side managing separate
CheckpointStateHandle and TrainingSessionHandle.


### Motivation and Context
WASM API needs to be updated with ORT training API function calls so
that ORT training web bindings can be added for on-device training.

---------

Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
Co-authored-by: carzh <carolinezhu@microsoft.com>
Co-authored-by: Ashwini Khade <askhade@microsoft.com>
2023-08-28 11:05:02 -07:00
Arthur Islamov
c262879214
Added DML and CUDA provider support in onnxruntime-node (#16050)
### Description
I've added changes to support CUDA and DML (only on Windows, on other
platforms it will throw an error)



### Motivation and Context
It fixes this feature request
https://github.com/microsoft/onnxruntime/issues/14127 which is tracked
here https://github.com/microsoft/onnxruntime/issues/14529

I was working on StableDiffusion implementation for node.js and it is
very slow on CPU, so GPU support is essential.

Here is a working demo with a patched and precompiled version
https://github.com/dakenf/stable-diffusion-nodejs

---------
2023-08-25 16:57:06 -07:00
Yulong Wang
79c4ed9a45
[js/webgpu] support error pop and kernel name (#17260)
### Description
This PR contains changes to support error pop and kernel name.

- Add a function `JsepGetNodeName` to allow reading kernel name from JS
to C++
- When in debug mode ( `env.debug = true;` ) or in profiling mode (
`env.webgpu.profilingMode = 'default';` ), kernel name will be read from
ORT; otherwise use the kernel pointer ( a number ) as kernel name to
save calls from JS to C++.
- When in debug mode, WebGPU validation errors will be recorded and if
any error occurs, `inferenceSession.run()` will fail (Promise get
rejected). Behavior when not in debug mode is not changed. This is
because recording errors are not zero-overhead, and GPU validation
errors should occur consistently in and not in debug mode.
- Add `jsepOnRunStart()` and `jsepOnRunEnd()` hook to:
   - allow implementation of the features mentioned above.
   - pass session ID to backend.
2023-08-25 08:08:15 -07:00
mindest
735cc8e6c8
[ROCm] enable If op for ROCm EP. (#17279)
### Description
Enable If op for ROCm EP.
2023-08-25 17:49:49 +08:00
Baiju Meswani
fca81cc5d5
ConvTransposeGrad CUDA Kernel (#17201) 2023-08-24 09:08:06 -07:00
cloudhan
87bef1f3f2
Move composable_kernel to deps.txt (#17245) 2023-08-23 17:39:16 -07:00
kunal-vaishnavi
edac3ef150
Add LLaMA scripts (#17020)
### Description
This PR adds the following scripts for LLaMA:
- LLaMA conversion (support for TorchScript and Dynamo exporters)
- LLaMA parity
- LLaMA benchmark
- LLaMA quantization
- LLaMA integration with [Hugging Face
Optimum](https://github.com/huggingface/optimum)



### Motivation and Context
This PR adds scripts for using LLaMA. There is a [follow-up
PR](https://github.com/microsoft/onnxruntime/pull/17043) for adding
scripts for Whisper.
2023-08-22 18:05:11 -07:00
Edward Chen
bd8a488f4b
Enable verbose logging in unit test program with environment variable. (#17133)
Enable verbose logging in unit test program with environment variable.
E.g., `ORT_UNIT_TEST_MAIN_LOG_LEVEL=0 ./onnxruntime_test_all --gtest_filter="<test that I want to see more logs for>"`.
2023-08-22 12:13:52 -07:00
cloudhan
4e6cec4d09
Update ck and enable test (#16383)
Apply the fix in https://github.com/ROCmSoftwarePlatform/composable_kernel/issues/728
Introduce more kernel instances and allow the introduction of streamk and splitk.
2023-08-22 11:08:55 +08:00
Sheil Kumar
cbaa008391
Bump DirectML version from 1.12.0 to 1.12.1 (#17225)
Bump DirectML version from 1.12.0 to 1.12.1

Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
2023-08-20 09:55:38 -07:00
Changming Sun
3cec88bd12
FIX: memory leak checker is incompatible with std::stacktrace (#17209)
### Description
When I worked on PR #17173, I didn't notice that
onnxruntime\core\platform\windows\debug_alloc.cc also needs to call
dbghelp functions like SymInitialize. So, if we use vc runtime's
stacktrace functionality, vc runtime will initialize/uninitialize the
dbghelp library independently and vc runtime's stacktrace helper DLLs
get unloaded before our memory leak checker starts get work. Then we
call SymSetOptions, it crashes.

More details:
In VC runtime the C++23 stacktrace functions are implemented on top of
dbgeng.dll. In C:\Program Files\Microsoft Visual
Studio\2022\Enterprise\VC\Tools\MSVC\14.37.32822\crt\src\stl\stacktrace.cpp,
you can see it has:
```
                dbgeng = LoadLibraryExW(L"dbgeng.dll", nullptr, LOAD_LIBRARY_SEARCH_SYSTEM32);
```
The dbgeng.dll is a wrapper around dbghelp.dll. It calls SymInitialize
and SymCleanup. dbgeng.dll gets unloaded before our memory leak check
starts to run. In theory we should be able to call SymInitialize again
if the previous user who called SymInitialize has also called
SymCleanup. However, users can use
SymRegisterCallback/SymRegisterCallback64/SymRegisterCallbackW64 to
register callback functions to dbghelp.dll. These callback functions
need to be alive when SymSetOptions(and some other dbghelp APIs) get
called.

### Motivation and Context
2023-08-18 17:10:33 -07:00
Changming Sun
ee09a5ff35
Add DISABLE_CUSPARSE_DEPRECATED flag to CUDA build (#17207)
This is to suppress a warning and make Windows CUDA 12.2 build work.
2023-08-18 10:25:49 -07:00
Chi Lo
2fb148dd88
Temporarily enforce "Debug build" TRT EP with trt oss parser on Windows (#17059)
This PR handles two changes:

1. There is an issue when running "Debug build" TRT EP with "Release
build" TRT builtin parser on Windows. Enforce use oss parser for Debug
build.
Note: args.config in build.py is an array, for example ["Debug",
"Release"...]. The code will be much mess if we made the change there.
2. Update to use latest commit of oss parser.

Please see the https://github.com/microsoft/onnxruntime/issues/16273
2023-08-17 12:17:25 -07:00
Changming Sun
5249b7ab7c
Re-implement stacktrace (#17173)
### Description
Re-implement stacktrace. The new implementation doesn't directly use
Windows API, hence can avoid problems regarding to
initialize/uninitialize the dbghelp library.

### Motivation and Context
2023-08-16 16:07:49 -07:00
Dmitri Smirnov
f45eef399e
Fix visualization issues with Attribute/Tensor protos (#17188)
### Description
Protobuf Natvis
2023-08-16 13:56:51 -07:00
RandySheriffH
3dd2c1b4d7
EP context for custom op (#16454)
Implement infrastructures to allow EP resources surfaced to custom ops.

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2023-08-16 13:03:40 -07:00
Maximilian Müller
7b9d1f18c7
NVTX windows include and link fixes (#16831)
### Description

For windows headers are not duplicated to the normal cuda include. For
linux they are:
```
(base) maximilianm@maximilianm-dt-linux:~$ ls /usr/local/cuda/include/nvtx3 | grep nvTool
nvToolsExt.h
nvToolsExtCuda.h
nvToolsExtCudaRt.h
nvToolsExtOpenCL.h
nvToolsExtSync.h
(base) maximilianm@maximilianm-dt-linux:~$ ls /usr/local/cuda/include | grep nvTool
nvToolsExt.h
nvToolsExtCuda.h
nvToolsExtCudaRt.h
nvToolsExtOpenCL.h
nvToolsExtSync.h
```
Is the preference via those added defines or should the include just be
changed to be `nvtx3/` ?

Also there is no library linking needed on Windows and the library is
not even present.
2023-08-16 11:53:58 -07:00
Changming Sun
8e203efc69
Cleanup cmake file (#17154)
### Description
1. Clean up cmake files. Remove some unused code
2. Remove the "Semmle" task from
tools/ci_build/github/azure-pipelines/templates/win-ci.yml. Semmle is
deprecated and replaced by CodeQL.
2023-08-15 10:51:33 -07:00
Matthieu Darbois
5e971bc51a
Rework WIL dependency retrieval/usage (#17130)
### Description
1. `onnxruntime_fetchcontent_makeavailable` works around unconditional
install commands so that can be used instead of `FetchContent_Populate`
2. This dependency is Windows specific, mark it as such.

### Motivation and Context
1. This simplifies `cmake/external/wil.cmake` not to do anything
specific wether WIL was fetched or found
2. Given it's specific to Windows, it might not be available on other OS
in specific air-gapped environment such as
[conan-center-index](https://github.com/conan-io/conan-center-index).
This allows downstream builds not to require specific patches for
something not required by the build in the first place.
2023-08-15 09:11:46 -07:00
Wenbing Li
d052c8a45c
Remove the extensions submodule (#17097)
### Description
Remove the onnxruntime-extensions submodule since it now was used via
cmake FetchContent


### Motivation and Context
The submodule relies on an outdated version of the extensions, and the
build instructions should be updated to eliminate any confusion.
2023-08-14 10:16:33 -07:00
Yulong Wang
5704e71b89
update onnx.patch to apply wasm build break fix (#17104)
### Description
This PR fixes build break for WebAssembly introduced in
6986981482
(435ad2b1d8).

This change updates onnx.patch in onnxruntime repo. the corresponding PR
in onnx repo is: https://github.com/onnx/onnx/pull/5495.

It may takes a while for the next onnx version bump.
2023-08-11 15:00:39 -07:00
Changming Sun
4728f20f9a
Fix CI build (#17118)
### Description
Some pipelines are failing. It is because PR #16325 set ONNX version to
`rel-1.14.1` . It is a branch name, not a commit or tag name. It means
whenever the branch got a new commit, we will auto pick it and use it.
2023-08-11 10:56:38 -07:00
Yulong Wang
9cd4e5af68
[wasm] upgrade emsdk to 3.1.44 (#17069)
### Description
This change upgrade emsdk to 3.1.44.

Because backend is upgraded to LLVM 16, so need to fix a lot of build
failures caused by "-Wshorten-64-to-32".

most of the build failures comes from generated `onnx.pb.h`, and this
can be fixed by including "core/graph/onnx_protobuf.h", which detects
and ignore shorten-64-to-32 warnings.
2023-08-10 16:08:36 -07:00
Bowen Bao
6986981482
Bump ONNX version (#16325)
### Description
Bump ONNX version to https://github.com/onnx/onnx/tree/rel-1.14.1 to
include a fix for segfault when shape inferencing nested onnx functions.



### Motivation and Context
Resolves #16170
2023-08-10 11:27:28 -07:00
Jeff Daily
dbbfc249f7
[ROCm] update header and binary search paths used by cmake (#17083)
This is in preparation for planned ROCm 6.0 changes that are not
backward compatible. However, the adjustments made by this PR to the
current onnxruntime cmake files will work with ROCm 5.x and 6.x.
2023-08-10 11:05:21 +08:00
Changming Sun
7d340256f1
Add "windows_sdk_version" build arg and fix SCA build pipeline (#17062)
### Description
1. Add "--windows_sdk_version" argument to build.py
2. Fix Windows Static Analysis build pipeline. It is failing because it
picks up a different Windows SDK version after a build machine image
update. If we can explicitly specify Windows SDK version, we can avoid
such things happening again.
3. Remove --enable_training from Windows Static Analysis build pipeline
because PR #16993 makes it incompatible with "no_rtti".

AB#18315
2023-08-09 14:01:16 -07:00
sfatimar
2c5d4dce77
Openvino ep ort 5.1 (#17042)
OpenVINO EP ORT 5.1 Branch
Changes for the new API to take in OpenVINO Provider Options
and compatibility with OV 2023.1


### Motivation and Context
The change is required for the new API to take in OpenVINO Provider
Options
and make it seamless.

---------

Signed-off-by: MaajidKhan <n.maajid.khan@intel.com>
Co-authored-by: saurabhintel0 <saurabh1.kale@intel.com>
Co-authored-by: MaajidKhan <n.maajid.khan@intel.com>
Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
2023-08-09 11:50:10 -07:00
Dmitri Smirnov
07dfe34714
Fix FunctionProto visualization (#17063)
### Description
Title

### Motivation and Context
Need to debug function protos
2023-08-09 11:05:52 -07:00
Baiju Meswani
249917a093
Add mac and windows python packages for onnxruntime-training (#16993) 2023-08-07 20:32:55 -07:00
Chen Fu
3c10f027de
4b quantization for weights of LLMs (#16833)
### Description
Blockwise 4b quantization for LLMs. 
1. Introduce 4b block-wise quantization for linear layer weights.
2. Implements matrix multiplication kernel for fp32 x int4
3. Implements special operator MatMulFpQ4
4. Implements quantization tool, that convert MatMul operator to
MatMulFpQ4, when the right hand side is 2D const tensor.


### Motivation and Context
Compress and accelerate LLMs

|Benchmark | Time(ns)|
|-------------|----------|
|Q4GEMM/Q4Sym/M:1/N:4096/K:4096/Threads:8| 218054|
|Q4GEMM/Q4Sym/M:1024/N:4096/K:4096/Threads:8| 35830155|
|Q4GEMM/Q4Sym/M:2048/N:4096/K:4096/Threads:8| 73479790|
|Q4GEMM/Q4Zp8/M:1/N:4096/K:4096/Threads:8| 270152|
|Q4GEMM/Q4Zp8/M:1024/N:4096/K:4096/Threads:8| 35826721|
|Q4GEMM/Q4Zp8/M:2048/N:4096/K:4096/Threads:8| 73021200|
|Q4GEMM/Q4Sym128/M:1/N:4096/K:4096/Threads:8| 213832|
|Q4GEMM/Q4Sym128/M:1024/N:4096/K:4096/Threads:8| 36749874|
|Q4GEMM/Q4Sym128/M:2048/N:4096/K:4096/Threads:8| 72618120|


|Benchmark | Time(ns)|
|-------------|----------|
|SGEMM/LLM/M:1/N:4096/K:4096/Threads:8|   522610|
|SGEMM/LLM/M:1024/N:4096/K:4096/Threads:8| 39237689|
|SGEMM/LLM/M:2048/N:4096/K:4096/Threads:8| 75983467|

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-08-07 12:23:55 -07:00
Dmitri Smirnov
d5e4bdbe7d
Fix protobuf TaggedStringPtr display (#17008)
### Description
<!-- Describe your changes. -->
Adjust nativs to display tagged strings.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Hard to debug without seeing names.
2023-08-04 17:51:01 -07:00
pengwa
a6887f171f
Refactor schema extraction and output unflattening (#16894)
### Motivation and Context

When we handle PyTorch models' inputs in different places (ORTModule or
others), it's common for us to flatten a structured data into a 1-D
tensor list (required by lib for example torch.onnx.export,
torch.autograd.Function.forward or ORT inference session), then do
subsequent work, then unflatten back to original hierarchy as returned
values.

DeepStage3 hooks support work also need such a lib to do similar things,
so I was proposing to extract this pair of APIs in training/utils/,
which can be more used more generally. Also a comprehensive set of test
data are used for testing unflatten/flatten in unit tests.

Let me know if you have any other suggestions. 


### Refactor schema extraction and output unflattening

Move `_extract_schema` and `unflatten_user_output` in
`orttraining/orttraining/python/training/ortmodule/_io.py` . to
`extract_data_and_schema` and `unflatten_data_using_schema` in
`orttraining/orttraining/python/training/utils/torch_io_helper.py` as
shared libs, which can be used later by other features (deepspeed stage
3 hook rewrite).

While there are still a few duplicated logic handling flatten with
different task by recursively loop the data struct, will change them
step by step in case of heavy review efforts.
2023-08-04 13:58:21 +08:00
Jeff Daily
1629a6fa75
[ROCm] add gfx1100 and gfx1101 to CMAKE_HIP_ARCHITECTURES (#16972)
### Description
Support additional AMD GPU architectures.

### Motivation and Context
AMD announced expanding support for additional GPUs.

https://community.amd.com/t5/rocm/new-rocm-5-6-release-brings-enhancements-and-optimizations-for/ba-p/614745

This PR is how we will deliver that expanded support to onnxruntime.
2023-08-04 08:38:42 +08:00
Michael Klimenko
07e6648e12
Enable Intel oneAPI DPC++/C++ compiler build (#16587)
Last week I fixed error #16484 found when trying to build onnxruntime
with the icpx compiler. Another thing I found out is that icpx uses
-ffast-math flag by default. You can check it by running the compiler
with -v flag like following:

```bash
# Setup the environment
. /opt/intel/oneapi/setvars.sh
# Compile any file to see all the implicit flags
icpx -v main.cpp
```

This leads to a bunch of warnings during the build like:

```bash
In file included from /mnt/f/wsl_home/onnxruntime/onnxruntime/test/providers/cpu/tensor/upsample_op_test.cc:5:
In file included from /mnt/f/wsl_home/onnxruntime/onnxruntime/test/providers/provider_test_utils.h:6:
In file included from /mnt/f/wsl_home/onnxruntime/onnxruntime/test/providers/checkers.h:10:
In file included from /mnt/f/wsl_home/onnxruntime/onnxruntime/core/util/math_cpuonly.h:68:
In file included from /mnt/f/wsl_home/onnxruntime/build/Linux/RelWithDebInfo/_deps/eigen-src/Eigen/Core:172:
/mnt/f/wsl_home/onnxruntime/build/Linux/RelWithDebInfo/_deps/eigen-src/Eigen/src/Core/MathFunctions.h:1019:12: warning: comparison with NaN always evaluates to false in fast floating point modes [-Wtautological-constant-compare]
    return isnan EIGEN_NOT_A_MACRO (x);
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~
```
		   
And some tests are failing as well, usually with infinities involved. To
list a few:

```bash
# ...
1: [  FAILED  ] IsInfTest.test_isinf_float
1: [  FAILED  ] IsInfTest.test_isinf_double
1: [  FAILED  ] IsInfTest.test_isinf_positive_float
1: [  FAILED  ] IsInfTest.test_isinf_positive_double
1: [  FAILED  ] IsInfTest.test_isinf_negative_float
1: [  FAILED  ] IsInfTest.test_isinf_negative_double
1: [  FAILED  ] IsNaNOpTest.IsNaNFloat
1: [  FAILED  ] IsNaNOpTest.IsNaNDouble
# ...
```

This PR adds a quick global check for the IntelLLVM compiler, as in the
way its name is reported by CMake and then, depending on the compiler
driver, sets either MSVC-like or GCC-like switch to disable fast-maths.

Probably a bit cleaner solution would be to use
```target_compile_options(${TARGET} PRIVATE MEOW)``` instead of a
global-wide ```set(CMAKE_CXX_FLAGS MEOW)```, but then we'd be required
to add it to all the individual targets and execution providers and this
will lead to a lot of code duplication.
2023-08-02 12:50:35 -07:00
Chi Lo
f4faceab28
Ignore deprecated declarations warning for TRT EP build (#16948)
In additions to `onnxruntime_test_all`, `onnxruntime_shared_lib_test`
and `onnxruntime_customopregistration_test` should
also add  "-Wno-deprecated-declarations" flag to ignore compiler warning
2023-08-02 09:51:58 -07:00
Changming Sun
73ddba964f
Update the MacOS/Linux build scripts that build/install protobuf from source (#16906)
### Description
1. As a follow-up of #16761, this PR allows build ORT on iOS/Android
without the need to explicitly specify a protoc path. #16761 is for
WASM. This one is for iOS/Android
2. Update the MacOS/Linux build scripts that build/install protobuf from
source. Make them be more flexible. Add the support for
RedHatEnterprise(ubi), which will needed for upgrading the base image
from centos:7 to ubi:8.
3. Update tools/ci_build/github/pai/rocm-ci-pipeline-env.Dockerfile :
the docker file's base image has preinstalled protobuf in /usr/local, we
should uninstall them to avoid conflicts.
2023-07-31 10:51:48 -07:00
Dmitri Smirnov
50764362ac
Update protobuf Natvis visualization (#16911)
### Description
Protobuf library update broke debug visualization.

### Motivation and Context
Hard to debug
2023-07-31 09:35:21 -07:00
satyajandhyala
dd24d52737
[JS/Web] Added Gelu contrib operator support to JSEP (#16909)
### Description
Added Gelu operator to JSEP


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-07-31 09:18:58 -07:00
Tianlei Wu
742edec5e8
[CUDA] Add PackedMultiHeadAttention operator (#16779)
### Description
Add new operator for MultiHeadAttention with inputs removed padding.
This only supports packed QKV format.
2023-07-28 16:35:38 -07:00
Changming Sun
161a9d1d6d
Add some safety check for conv op (#16839)
### Description
Add some safety check for conv op.
It is to validate if the attributes coming from a conv op are in a valid
range. (shouldn't be too large or too small).
2023-07-27 16:37:55 -07:00