Commit graph

1285 commits

Author SHA1 Message Date
Jian Chen
6891ab5bac
fix_macos (#15018)
### Description
<!-- Describe your changes. -->
This fix macos packaging build on universal2 arch. 


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-03-14 21:54:44 -07:00
PeixuanZuo
2ff7f3e93a
[ROCm] support optimized Stable Diffusion model (#14980)
Add BiasSplitGelu/BiasAdd/GroupNorm/NhwcConv operator for ROCm EP.

1. BiasSplitGelu and BiasAdd operators can be automatically hipified
from CUDA EP.
2. GroupNorm was hipified from CUDA EP and modified to build.
3. NhwcConv is similar to NhwcConv in CUDA EP, But the MIOpen API and
cuDnn API are different. `miopenConvolutionForwardbias` and
`miopenOpTensor` of MIOpen doesn't support NHWC layout now, use
BinaryElementwise to replace miopenConvolutionForwardbias(NHWC layout).
2023-03-14 23:15:37 +08:00
Adrian Lizarraga
d8ddd25272
Add InstanceNormalization operator to QNN EP (#14867)
### Description

QNN EP:
- Adds the
[InstanceNormalization](https://onnx.ai/onnx/operators/onnx__InstanceNormalization.html)
operator to QNN EP.
- Fixes graph composition bug when Transpose node is the last node in a
graph.
- Adds check for input shape when GetCapability is called (before and
after layout transformation)
- Should add similar checks for other layout sensitive ops (conv, pool,
...) in a separate PR
- Adds initial QNN op tests for QDQ conv and QDQ  InstanceNormalization
  - Should add tests for other ops in a separate PR

Optimizer:
- Makes InstanceNormalization a layout sensitive operator.
- Adds a custom QDQ group selector for InstanceNormalization.

Quantization tool:
- Adds QDQ support for InstanceNormalization operator.
- Adds python unit test for InstanceNormalization quantization.

### Motivation and Context
Needed to support stable diffusion models with QNN.

---------

Co-authored-by: Hector Li <hecli@microsoft.com>
2023-03-10 14:42:41 -08:00
Maximilian Müller
ad4db12699
TensorRT EP - timing cache (#14767)
### Description

This will enable a user to use a TensorRT timing cache based on #10297
to accelerate build times on a device with the same compute capability.
This will work across models as it simply store kernel runtimes for
specific configurations. Those files are usually very small (only a few
MB) which makes them very easy to ship with an application to accelerate
the build time on the user end.

### Motivation and Context
Especially for workstation use cases TRT build times can be a roadblock.
With a few model from ONNX model zoo i evaluated speedups when a timing
cache is present.
`./build/onnxruntime_perf_test -e tensorrt -I -t 5 -i
"trt_timing_cache_enable|true" <onnx_path>`

|Model | no Cache | with Cache|
| ------------- | ------------- | ------------- |
|efficientnet-lite4-11 | 34.6 s | 7.7 s|
|yolov4 | 108.62 s | 9.4 s|

To capture this is had to modify the onnxruntime_perf_test. The time is
sometimes not captured within "Session creation time cost:" which is why
i introduced "First inference time cost:".

---------

Co-authored-by: Chi Lo <Chi.Lo@microsoft.com>
2023-03-10 09:02:27 -08:00
Edward Chen
c46c7ccba5
Update Gradle version (#14862)
- Update Gradle version used in most places from 6.8.3 to 8.0.1. Update Android Gradle Plugin version where applicable.
  Not updated in this change: React Native Android projects (under `js/react_native/`). That can be done later along with updating the React Native projects.

- Add Gradle wrapper in `java/` to make it easier to consistently use a specific Gradle version.
2023-03-08 12:22:06 -08:00
Changming Sun
d9436407b6
Use safe allocator for JNI code (#13999)
### Description
Use a customized allocarray function to replace the original malloc
calls to avoid integer overflow.

### Motivation and Context
Fix Prefast warnings. 

Fixed
[AB#8990](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/8990)
Fixed
[AB#8991](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/8991)
Fixed
[AB#9016](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/9016)
2023-03-08 11:40:55 -08:00
Adam Pocock
47f00b5d49
[Java] Initial on device training support (#14027)
contributor: @Craigacp
2023-03-08 10:01:08 -08:00
Hariharan Seshadri
112a4d215a
[CUDA] Support decoding multihead self-attention implementation (#14848) 2023-03-08 09:17:54 -08:00
George Wu
289f7dbcdd
enable pybind for qnn ep (#14897)
enable python bindings for QNN EP.
tested on Windows Dev Kit 2023 (ARM64) with python 3.11 (ARM64) from 
https://www.python.org/ftp/python/3.11.1/python-3.11.1-arm64.exe
2023-03-03 07:26:53 -08:00
Chun-Wei Chen
70a31e047a
Consume ONNX 1.13.1 in ONNX Runtime (#14812)
### Description
<!-- Describe your changes. -->
Consume ONNX 1.13.1 in ONNX Runtime. (ONNX 1.13.0 to ONNX 1.13.1)


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
ONNX 1.13.1 patch was just released yesterday. This PR is making ORT's
ONNX submodule consistent with the latest released ONNX. Not sure
whether this PR is really needed, but let me make it ready. Previous PR
for testing ONNX 1.13.1rc2 :
https://github.com/microsoft/onnxruntime/pull/14634.

Fixed
[AB#13174](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/13174)
.
2023-03-02 14:57:35 -08:00
Hector Li
c6074f3a4b
OnnxRuntime QNN EP (#14791)
### Description
Integrate Qualcomm QNN SDK to enable inference on QC hexagon NPU devices

### Motivation and Context
Enable Ort inference on QC hexagon NPU devices.

---------

Co-authored-by: Satya Jandhyala <sajandhy@microsoft.com>
Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
Co-authored-by: Adrian Lizarraga <adrianlm2@gmail.com>
2023-03-01 13:48:20 -08:00
Chen Fu
acc2ac627f
Fp16 Activations (#14722)
### Description

NEON fp16 SIMD implementation of Activation functions


### Motivation and Context
Step 2 of fp16 SIMD support.

---------

Co-authored-by: Chen Fu <fuchen@microsoft.com>
2023-02-28 17:20:40 -08:00
Yulong Wang
69c5edb11b
[wasm] upgrade emsdk from 3.1.19 to 3.1.32 (#14818)
### Description
upgrade emsdk from 3.1.19 to 3.1.32

also add explicit config for stack size (1MB).
2023-02-28 11:06:09 -08:00
kailums
9bdd42115c
add build flag for rocblas tune and fix bug (#14797)
### Description
<!-- Describe your changes. -->
1. add a build flag for rocblas tuning feature.

2. fix a build bug when enable rocblas tuning.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
The rocblas tunning feature has no build flag to control, only using a
MACRO flag.

So I add an build flag, and fix a code bug when enable rocblas tunning.
2023-02-28 10:37:07 +08:00
Yulong Wang
6b83ad9659
[js/web] allow unittest (onnxruntime_test_all) to run in browser (#14820)
### Description
allow onnxruntime_test_all to run in browser for WebAssembly build (use
flag `--wasm_run_tests_in_browser`).

To output the logs from stdout correctly, this test needs to be build
with `--enable_wasm_threads`.
2023-02-24 16:45:33 -08:00
Yulong Wang
a631ed77c0
[js/web] support flag 'optimizedModelFilePath' in session options (#14355)
### Description
* Support flag 'optimizedModelFilePath' in session options.

In Node.js, the model will be saved into filesystem just like its
behaviour on native platforms.

In browser, the new model is not saved to filesystem. the file path is
ignored. Instead, a new pop-up window will be launched in browser and
user can 'save' the file as onnx model.

* Add corresponding commandline args for the following session option
flags:
    - optimizedModelFilePath
    - graphOptimizationLevel
2023-02-24 15:50:15 -08:00
Jian Chen
62ee0c8110
Migrating ORT Extensions from Git submodule to cmake FetchContent (#14298)
### Description
<!-- Describe your changes. -->

Merging extensions from Git submodule to cmake FetchContent


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Changming Sun <chasun@microsoft.com>
Co-authored-by: Jian Chen <jchen351@MacBook-Pro.local>
2023-02-22 19:42:36 -08:00
Erick Muñoz
8372c86e7f
[oneDNN] Update to oneDNN v3.0 (#14267)
### Description
Update oneDNN version from 2.7 to 3.0



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-02-17 09:56:29 -08:00
PeixuanZuo
0f9d2432d2
[ROCm] Add WarpWise Softmax into SoftmaxTunableOp (#14612)
1. Add Softmax warpwise_forward into SoftmaxTunableOp.
2. Set Softmax op use tunableOp as optional and use original
implementation by default.
3. There are some other operators use `dispatch_warpwise_softmax_forward
/dispatch_warpwise_softmax_forward/ SoftMaxComputeHelper ` directly. But
they only have files under cuda directory, adding `RocmTuningContext `
for these files requires copying and modifying hipified files. Now only
set RocmTuningContext as nullptr by default and not hipified other
operators.
Related PR: https://github.com/microsoft/onnxruntime/pull/14541

---------

Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
2023-02-16 11:26:08 +08:00
Chen Fu
733ca85b73
Cfu fp16 (#14538)
### Description
FP16 GEMM, including hardware agnostic driver code, a slow C++ kernel,
and ARM64 NEON kernel.


### Motivation and Context
First step in creating native support of fp16 model inferencing on ARM64
and AMD64 platforms.

---------

Co-authored-by: Chen Fu <fuchen@microsoft.com>
2023-02-15 12:51:53 -08:00
PeixuanZuo
326cf2f5e9
[ROCm] add Softmax Tunable Op (#14541)
### Description
Add Softmax Tunable Op, only include blockwise vec implementation and
composable kernel.
Related PR: https://github.com/microsoft/onnxruntime/pull/14475,
https://github.com/microsoft/onnxruntime/pull/14612

---------

Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
2023-02-13 15:56:50 +08:00
cloudhan
9bd022b8be
Add TuningContext for TunableOp (#14557)
This makes the the TunableOp tuning results state free and will allow us to
dump and load offline tuning results.
2023-02-10 14:27:43 +08:00
Kevin Chen
0a6b22018f
Move TRT include_directories to outside scope (#14622)
Signed-off-by: Kevin Chen <kevinch@nvidia.com>

### Description
Previously `include_directories(${TENSORRT_INCLUDE_DIR})` was only done
if `onnxruntime_USE_TENSORRT_BUILTIN_PARSER` was false. This would cause
a build failure when the switch was true as the include directory was
not added.

### Motivation and Context
Fixes TRT build when `onnxruntime_USE_TENSORRT_BUILTIN_PARSER` is true.

---------

Signed-off-by: Kevin Chen <kevinch@nvidia.com>
2023-02-08 10:19:55 -08:00
Valery Chernov
ba8a00f62f
[TVM EP] Support zero copying TVM EP output tensor to ONNX Runtime output tensor (#12593)
**Description**:
Support new feature of TVM Virtual Machine (method `set_outputs`) on TVM
Execution Provider side. It allows to avoid excess copying from TVM EP
output tensor to ONNX Runtime one

**Motivation and Context**
Tests with multiple output topologies and big output tensors shows that
there is overheads spent on copying from TVM EP to ONNX Runtime.
Returning output(s) on preallocated memory for VirtualMachine was
implemented on TVM side.

**Details**
`set_output_zero_copy` provider option for TVM EP switches on/off this
feature. It is true by default.
The feature works for both GraphExecutor and VirtualMachine from TVM.

---------

Co-authored-by: Valery Chernov <valery.chernov@deelvin.com>
2023-02-08 10:02:20 -08:00
Hector Li
cd7098fdf4
fix snpe build (#14616)
### Description
Fix SNPE build issue caused by cmake dependency refactor

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
fix issue: https://github.com/microsoft/onnxruntime/pull/14547
2023-02-07 15:33:05 -08:00
Tang, Cheng
8f34c8c8ed
Introduce collective ops to ort inference build (#14399)
### Description
Introduce collective ops into onnxruntime inference build, including
1) AllReduce and AllGather schema in contrib op, controlled by USE_MPI
flag
2) AllReduce and AllGather kernel in cuda EP, controlled by ORT_USE_NCCL
flag


### Motivation and Context
Enable the collective ops in onnxruntime inference build so we have the
ability to run distributed inference with multiple GPUs.
The original ncclAllReduce ops in training build require quite complex
configurations, which is not suitable for inference case, and it already
broken. so we introduce a new implementation.

---------

Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2023-02-07 13:47:48 -08:00
Ye Wang
b539c364ee
Some kernel changes for TULR (#14517)
### Description
<!-- Describe your changes. -->
1. fix a bug in relative position bias kernel where seq_len > 32
2. rename extra_add_qk to relative_position_bias
3. support relative_position_bias in multihead attention (B, N, S, S*)
4. gru_gate support by Lei


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
Co-authored-by: Lei Zhang <zhang.huanning@hotmail.com>
2023-02-07 11:51:06 -08:00
RandySheriffH
b6bec54341
Revert mimalloc from v2.0.9 to v2.0.3 (#14603)
Revert mimalloc from v2.0.9 to v2.0.3 to silence build error in
[post-merge
](https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=273075&view=logs&j=f019f681-ae8f-5ee4-d119-02530df66a84&t=6c90c65c-2ab2-56af-633f-b5631256a8e1&l=351)
pipeline.
New dependency version was generated
[here](https://aiinfra.visualstudio.com/Lotus/_artifacts/feed/Lotus/UPack/onnxruntime_build_dependencies/overview/1.0.29).

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: rui-ren <ruiren1225@gmail.com>
2023-02-07 09:58:25 -08:00
Yufeng Li
8de885fdb1
reduce cuda library binary size (#14555)
### Description
Reduce the cuda library size by:
1. refactoring beam_search_top_k to reduce template instantiation. It
saves ~56MB
2. opt out TopK for type uint*, int8_t and int16_t. It saves ~50MB.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-02-07 09:03:14 -08:00
Tianlei Wu
742658d171
Stable Diffusion CUDA optimizations Part 2 (#14597)
### Description
This is a follow-up of
https://github.com/microsoft/onnxruntime/pull/14428 for Stable Diffusion
CUDA optimizations:
(1) use NchwConv to replace Conv in onnx graph and add Tranpose nodes
accordingly
(2) reduce sequential Transpose nodes to at most one.
(3) symbolic shape infer of NchwConv
(4) fix add bias transpose which causes CUDA error (launching more than
1024 threads per block) in inferencing fp32 model.
(5) add models (bert, bart, stable_diffusion subdirectories) to package;
(6) remove option --disable_channels_last

Note that 
(1) We can add a few graph transformations to reduce Transpose nodes
further. It is not done in this PR due to time limit.
(2) Stable diffusion 2.1 model outputs black images. It seems that
forcing Attention to float32 could avoid the issue. However it is much
slow to use float32 Attention.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-02-07 07:49:15 -08:00
ytaous
d632f9a3fa
[ROCm] Enable Sampling Op UT on AMD (#14581)
Making basic porting effort to run Sampling UT on ROCm ep, based on the
commits:

https://github.com/microsoft/onnxruntime/pull/13426
https://github.com/microsoft/onnxruntime/pull/14218

1. enabling EmbedLayerNorm op
2. enabling Sampling op
3. enabling helpers to copy data from CPU->GPU for subgraph

This task is the first checkpoint. There could be other missing ops when
testing a real model.
We will migrate more code onto ROCm as needed.

Co-authored-by: Ubuntu <ettao@ettao-amd-dev1.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
2023-02-06 20:52:06 -08:00
Ted Themistokleous
c1a0fc55e7
[ROCm][MIGraphX EP]Add back in support for gfx1030 (#14565)
Adds back in proper build support for the Navi gen cards (gfx1030) 

Co-authored-by: Ted Themistokleous <tthemist@amd.com>
2023-02-04 11:35:45 +08:00
pengwa
7eca42484c
link mpi when either use_mpi or use_nccl enabled (#14467)
### Only link mpi when either use_mpi or use_nccl enabled

To fix the issue https://github.com/microsoft/onnxruntime/issues/14278. 

Talked with @askhade, we think if users want to enable NCCL/MPi but MPI
is not found, it should be failure instead of warning.
So this PR made the change. As a result, to make CIs pass, we need
disable NCCL/MPI explicitly in the build command. This PR take an
alternative approach, e.g. since NCCL and MPi are not used for
customers, disable NCCL by default if "--disable_nccl" not specified,
disable MPI by default if "--use_mpi" not specified.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-02-03 20:11:50 +08:00
Tianlei Wu
a6c5ba0185
Stable Diffusion CUDA Optimizations (#14428)
### Description

Add stable diffusion CUDA kernel optimizations.

The following are included:
(1) GroupNorm operator. This kernel is from TensorRT 8.5.
(2) BiasSplitGelu operator. This kernel is modified from SplitGelu of
TensorRT 8.5. We added bias to the SplitGelu.
(3) NhwcConv operator. This adds support of NHWC format (ONNX Conv
operator uses NCHW format).
(3) Update MultiHeadAttention (packed kv and no bias) for cross
attention. This could avoid transpose of kv for TRT fused cross
attention kernel.
(4) Optimization and benchmark script

Not included:
(1) Script to convert Conv to NhwcConv in onnx graph.
(2) Update symbolic shape inference for NhwcConv.
(3) Add SeqLen2Spatial operator
(4) Documents

Limitations: GroupNorm, BiasSplitGelu and NhwcConv kernels are
implemented based on stable diffusion usage. They might not be
applicable to any input size or dimensions. For example, BiasSplitGelu
requires hidden size to be 2560 | 5120 | 10240, and NhwcConv assumes 4D
input/weight.

There is minor increasement of binary size. For SM=75 only, python
package wheel size adds (33757K - 33640K) = 117 KB. It is possible to
move NHWC from template parameter to constructor to reduce binary size
(with slight cost of performance).

Note: for RTX 4090/4080/4070 Ti, need build with CUDA 11.8 and latest
cuDNN to get best performance.
2023-02-02 23:43:51 -08:00
PeixuanZuo
1059cf6d98
[ROCm] Fix ROCm build issue caused by REMOVE_ITEM incorrect path (#14534)
### Description
Fix not working REMOVE_ITEM.

`onnxruntime/contrib_ops/rocm/aten_ops/aten_op.cc` is hipyfied from
`onnxruntime/contrib_ops/cuda/aten_ops/aten_op.cc`.
The file correct path is
`${CMAKE_CURRENT_BINARY_DIR}/amdgpu/onnxruntime/contrib_ops/rocm/aten_ops/aten_op.cc`
and it exists in hipyfied source files list
`onnxruntime_rocm_generated_contrib_ops_cc_srcs`.

A better way to fix it: If we don't want to build a file. Add it into
hipify excluded files and will not hipify it.
2023-02-03 13:34:59 +08:00
RandySheriffH
01cafe89f0
Specify deps in deps.txt and manifest (#14530)
Specify new deps and update cgmanifest.json.

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2023-02-02 09:44:57 -08:00
pengwa
62442c3d27
Enable multiple step run for adamw tests (on device training) (#14520)
(cherry picked from commit 414b73a02123b672e496326664cd2dc3bd6c6d24)

### Rework for PR https://github.com/microsoft/onnxruntime/pull/14068:
Enable multiple step run for adamw tests (on device training)
### Removed duplicated MACRO checks for training.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-02-02 18:40:30 +08:00
Erick Muñoz
d1533c27eb
[oneDNN] Improved thread handling (#13618)
* Added the OrtDnnlProviderOptions structure to expose configuration
options to the user

* The number of threads can be defined by the user with the -i flag on
the perftest

* Number of threads can also be configured via the OMP_NUM_THREADS
environment variable

* The number of threads defined in the OrtDnnlProviderOptions is
prioritized over the environment variable

### Description
Avoids thread oversubscription caused by OpenMP allocating the maximum
number of threads possible for oneDNN EP. Added support for the
OrtDnnlProviderOptions, this will allow for more EP customization
capabilities, and allows for user defined number of threads.



### Motivation and Context
- Improves performances and allows for user to fine tune the number of
threads
2023-01-31 14:37:13 -08:00
Yi Zhang
80f807c03d
upgrade protobuf to 3.20.2 and onnx to 1.13 (#14279)
### Description
upgrade protobuf to 3.20.2, same as onnx 1.13.0

### Motivation and Context
Per component governance requirement and Fixes #14060

unused-parameter error occurs in 2 conditions.
1. compile protolbuf

`onnxruntime_src/cmake/external/protobuf/src/google/protobuf/repeated_ptr_field.h:752:66:
error: unused parameter ‘prototype’ [-Werror=unused-parameter]`
2. include onnx_pb.h
```
2023-01-28T10:20:15.0410853Z FAILED: CMakeFiles/onnxruntime_pybind11_state.dir/onnxruntime_src/onnxruntime/python/onnxruntime_pybind_iobinding.cc.o 
......
2023-01-28T10:20:15.0466024Z                  from /build/Debug/_deps/onnx-src/onnx/onnx_pb.h:51,
2023-01-28T10:20:15.0466958Z                  from /onnxruntime_src/include/onnxruntime/core/framework/to_tensor_proto_element_type.h:10,
....
2023-01-28T10:20:15.0609678Z /build/Debug/_deps/onnx-build/onnx/onnx-operators-ml.pb.h:1178:25:   required from here
2023-01-28T10:20:15.0610895Z /onnxruntime_src/cmake/external/protobuf/src/google/protobuf/repeated_ptr_field.h:752:66: error: unused parameter ‘prototype’ [-Werror=unused-parameter]
2023-01-28T10:20:15.0611707Z cc1plus: all warnings being treated as errors

```

https://dev.azure.com/onnxruntime/2a773b67-e88b-4c7f-9fc0-87d31fea8ef2/_apis/build/builds/874605/logs/22
2023-01-31 12:55:09 -08:00
pengwa
e2dd1315c7
Fix build for --enable_language_interop_ops + DISABLE_ABSEIL=ON (#14469)
### Fix build error on Windows when building with "
--enable_language_interop_ops -cmake_extra_defines
onnxruntime_DISABLE_ABSEIL=ON"

This is a subsequent fix after
https://github.com/microsoft/onnxruntime/pull/14309, which fixed build
for onnxruntime_DISABLE_ABSEIL=ON build.

Going furthur, if we enable --enable_language_interop_ops, there are
following two errors:

```
 test_symm_qgemm.cpp
  test_transpose.cpp
onnxruntime_session.lib(inference_session.obj) : error LNK2019: unresolved external symbol "void __cdecl onnxruntime::L
oadInterOp(class std::basic_string<wchar_t,struct std::char_traits<wchar_t>,class std::allocator<wchar_t> > const &,cla
ss std::vector<struct Ort::CustomOpDomain,class std::allocator<struct Ort::CustomOpDomain> > &,class std::function<void
 __cdecl(char const *)> const &)" (?LoadInterOp@onnxruntime@@YAXAEBV?$basic_string@_WU?$char_traits@_W@std@@V?$allocato
r@_W@2@@std@@AEAV?$vector@UCustomOpDomain@Ort@@V?$allocator@UCustomOpDomain@Ort@@@std@@@3@AEBV?$function@$$A6AXPEBD@Z@3
@@Z) referenced in function "public: __cdecl <lambda_f3a907e0b0a0e11d80d305605215cce8>::operator()(class std::shared_pt
r<class onnxruntime::Model> &)const " (??R<lambda_f3a907e0b0a0e11d80d305605215cce8>@@QEBA@AEAV?$shared_ptr@VModel@onnxr
untime@@@std@@@Z) [C:\Users\pengwa\dev\onnxruntime\build\Windows\RelWithDebInfo\onnxruntime_test_trainer.vcxproj]
onnxruntime_session.lib(inference_session.obj) : error LNK2019: unresolved external symbol "void __cdecl onnxruntime::L
oadInterOp(class onnx::ModelProto const &,class std::vector<struct Ort::CustomOpDomain,class std::allocator<struct Ort:
:CustomOpDomain> > &,class std::function<void __cdecl(char const *)> const &)" (?LoadInterOp@onnxruntime@@YAXAEBVModelP
roto@onnx@@AEAV?$vector@UCustomOpDomain@Ort@@V?$allocator@UCustomOpDomain@Ort@@@std@@@std@@AEBV?$function@$$A6AXPEBD@Z@
5@@Z) referenced in function "public: __cdecl <lambda_340b7b787b9c0f81848d348e60fe6c91>::operator()(class std::shared_p
tr<class onnxruntime::Model> &)const " (??R<lambda_340b7b787b9c0f81848d348e60fe6c91>@@QEBA@AEAV?$shared_ptr@VModel@onnx
runtime@@@std@@@Z) [C:\Users\pengwa\dev\onnxruntime\build\Windows\RelWithDebInfo\onnxruntime_test_trainer.vcxproj]
C:\Users\pengwa\dev\onnxruntime\build\Windows\RelWithDebInfo\RelWithDebInfo\onnxruntime_test_trainer.exe : fatal error
LNK1120: 2 unresolved externals [C:\Users\pengwa\dev\onnxruntime\build\Windows\RelWithDebInfo\onnxruntime_test_trainer.
vcxproj]
  onnxruntime.vcxproj -> C:\Users\pengwa\dev\onnxruntime\build\Windows\RelWithDebInfo\RelWithDebInfo\onnxruntime.dll
  onnxruntime_test_utils.vcxproj -> C:\Users\pengwa\dev\onnxruntime\build\Windows\RelWithDebInfo\RelWithDebInfo\onnxrun
  time_test_utils.lib
CUDACOMPILE : nvcc warning : The 'compute_35', 'compute_37', 'sm_35', and 'sm_37' architectures are deprecated, and may
 be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). [C:\Users\pengwa\dev\onnxruntime
\build\Windows\RelWithDebInfo\custom_op_library.vcxproj]
  cuda_ops.cu
CUDACOMPILE : nvcc warning : The 'compute_35', 'compute_37', 'sm_35', and 'sm_37' architectures are deprecated, and may
 be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). [C:\Users\pengwa\dev\onnxruntime
\build\Windows\RelWithDebInfo\onnxruntime_test_cuda_ops_lib.vcxproj]
```



```
  kernel_type_str_resolver_utils_test.cc
  local_kernel_registry_test.cc
C:\Users\pengwa\dev\onnxruntime\onnxruntime\test\framework\allocation_planner_test.cc(1388,9): error C2220: the followin
g warning is treated as an error [C:\Users\pengwa\dev\onnxruntime\build\Windows\RelWithDebInfo\onnxruntime_test_all.vcxp
roj]
C:\Users\pengwa\dev\onnxruntime\onnxruntime\test\framework\allocation_planner_test.cc(1388,9): warning C4067: unexpected
 tokens following preprocessor directive - expected a newline [C:\Users\pengwa\dev\onnxruntime\build\Windows\RelWithDebI
nfo\onnxruntime_test_all.vcxproj]
```


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-01-31 12:34:45 +08:00
Sumit Agarwal
edb377f2cb
[DML EP] Upgrade DML to 1.10.1 (#14433)
### Description
Updated DirectML version to 1.10.1
(https://www.nuget.org/packages/Microsoft.AI.DirectML/1.10.1)



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-01-25 21:07:10 -08:00
Tianlei Wu
94b1791974
Upgrade CUTLASS to v2.11 and add sequence length threshold for cutlass FMHA (#14401)
### Description
Add sequence length threshold for triggering cutlass FMHA in FP32. See
performance test results in
https://github.com/microsoft/onnxruntime/pull/14343 to see how this
threshold is selected.

Upgrade cutlass to v2.11 and update deps.txt and cgmanifest for nuget
pipeline build (test build:
https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=268574&view=results)
2023-01-25 09:43:48 -08:00
Hector Li
f03c507cf0
Fix fuzz test (#14385)
Fix fuzz test
2023-01-22 22:17:43 -08:00
Tianlei Wu
414b012f42
Add memory efficient attention from CUTLASS (#14343)
### Description
Add memory efficient attention from CUTLASS.

TODO (in next pull request): 
(1) Need performance tests on different GPUs, then add a sequence length
threshold (only activate it for long sequence length).
(2) Merge changes from https://github.com/NVIDIA/cutlass/pull/773 when
it is in cutlass master.
2023-01-20 12:33:01 -08:00
Ashwini Khade
ea7bbd667d
fix headers for training apis (#14350)
### Description
Minor refactor PR for fixing header placement for training apis
2023-01-19 10:26:53 -08:00
Adrian Lizarraga
de17d53c50
Custom Op runtime wrapper (#13427)
### Description

Adds the below C APIs to support custom ops that wrap an entire model to
be inferenced with an external runtime. The current SNPE EP is an
example of an EP that could be ported to use a custom op wrapper. Ex:
The custom op stores the serialized SNPE DLC binary as a string
attribute. The SNPE model is built when the kernel is created. The model
is inferenced with SNPE APIs on call to the kernel's compute method.

#### C APIs
| API | Description | Why |
| ---            | ---        | ---  |
| `KernelInfo_GetInputCount` | Gets number of inputs from
`OrtKernelInfo`. | Query I/O characteristics during kernel
creation<sup>1</sup> |
| `KernelInfo_GetOutputCount` | Gets number of outputs from
`OrtKernelInfo`. | Query I/O characteristics during kernel
creation<sup>1</sup> |
| `KernelInfo_GetInputName` | Gets an input's name. | Query I/O
characteristics during kernel creation<sup>1</sup> |
| `KernelInfo_GetOutputName` | Gets an output's name. | Query I/O
characteristics during kernel creation<sup>1</sup> |
| `KernelInfo_GetInputTypeInfo` | Gets the type/shape information for an
input. | Query I/O characteristics during kernel creation<sup>1</sup> |
| `KernelInfo_GetOutputTypeInfo` | Gets the type/shape information for
an output. | Query I/O characteristics during kernel
creation<sup>1</sup> |
| `KernelInfoGetAttribute_tensor` | Get a OrtValue tensor stored as an
attribute in the graph node | Extract serialized models, weights, etc. |
| `GetSessionConfigEntry` | Get a session configuration value | Need to
be able to get session-time configurations from within custom op |
| `HasSessionConfigEntry` | Check if session configuration entry exists.
| Need to be able to get session-time configurations from within custom
op |

#### Why so many KernelInfo APIs?<sup>1</sup>
Similar APIs currently exist for `OrtKernelContext`, but not
`OrtKernelInfo`. Note that `OrtKernelContext` is passed to the custom op
on call to its kernel's compute() function. However, `OrtKernelInfo` is
available on kernel creation, which occurs when the session is created.
Having these APIs available from `OrtKernelInfo` allows an operator to
trade-off computation time for session-creation time, and vice versa.
Operators that must build expensive state may prefer to do it during
session creation time instead of compute-time.

SNPE is an example of an EP that needs to be able to query `KernelInfo`
for the name, type, and shape of inputs and outputs in order to build
the model from the serialized DLC data. This is an expensive operation.
Other providers (e.g., OpenVINO) are able to query i/o info from the
serialized model, so they do not strictly need these APIs. However, the
APIs can still be used to validate the expected I/O characteristics.

Additionally, several of our CPU contrib ops currently use the same
internal version of these KernelInfo APIs (Ex:
[qlinear_softmax](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/contrib_ops/cpu/quantization/qlinear_softmax.cc#L71)).
If custom ops are also meant to be a test bed for future ops, then all
custom ops (not just runtime wrappers) would benefit from the addition
of these public KernelInfo APIs (IMO).

#### Example of usage in a custom OP
From
`onnxruntime/test/testdata/custom_op_openvino_wrapper_library/openvino_wrapper.h`

```c++
struct CustomOpOpenVINO : Ort::CustomOpBase<CustomOpOpenVINO, KernelOpenVINO> {
  explicit CustomOpOpenVINO(Ort::ConstSessionOptions session_options);

  CustomOpOpenVINO(const CustomOpOpenVINO&) = delete;
  CustomOpOpenVINO& operator=(const CustomOpOpenVINO&) = delete;

  void* CreateKernel(const OrtApi& api, const OrtKernelInfo* info) const;

  constexpr const char* GetName() const noexcept {
    return "OpenVINO_Wrapper";
  }

  constexpr const char* GetExecutionProviderType() const noexcept {
    return "CPUExecutionProvider";
  }

  // IMPORTANT: In order to wrap a generic runtime-specific model, the custom operator
  // must have a non-homogeneous variadic input and output.

  constexpr size_t GetInputTypeCount() const noexcept {
    return 1;
  }

  constexpr size_t GetOutputTypeCount() const noexcept {
    return 1;
  }

  constexpr ONNXTensorElementDataType GetInputType(size_t /* index */) const noexcept {
    return ONNX_TENSOR_ELEMENT_DATA_TYPE_UNDEFINED;
  }

  constexpr ONNXTensorElementDataType GetOutputType(size_t /* index */) const noexcept {
    return ONNX_TENSOR_ELEMENT_DATA_TYPE_UNDEFINED;
  }

  constexpr OrtCustomOpInputOutputCharacteristic GetInputCharacteristic(size_t /* index */) const noexcept {
    return INPUT_OUTPUT_VARIADIC;
  }

  constexpr OrtCustomOpInputOutputCharacteristic GetOutputCharacteristic(size_t /* index */) const noexcept {
    return INPUT_OUTPUT_VARIADIC;
  }

  constexpr bool GetVariadicInputHomogeneity() const noexcept {
    return false;  // heterogenous
  }

  constexpr bool GetVariadicOutputHomogeneity() const noexcept {
    return false;  // heterogeneous
  }

  std::vector<std::string> GetSessionConfigKeys() const { return {"device_type"}; }

 private:
  std::unordered_map<std::string, std::string> session_configs_;
};
```

#### How to create a session:
```c++
Ort::Env env;
Ort::SessionOptions session_opts;
Ort::CustomOpConfigs custom_op_configs;

// Create local session config entries for the custom op.
custom_op_configs.AddConfig("OpenVINO_Wrapper", "device_type", "CPU");

// Register custom op library and pass in the custom op configs (optional).
session_opts.RegisterCustomOpsLibrary(lib_name, custom_op_configs);

Ort::Session session(env, model_path.data(), session_opts);
```
### Motivation and Context
Allows creation of simple "wrapper" EPs outside of the main ORT code
base.
2023-01-18 09:09:32 -08:00
Guenther Schmuelling
60290393f3
enable ort-extensions in wasm release builds (#14239)
enable ort-extensions in wasm release builds. sentence piece, gpt2, bert
and word piece tokenizers for now.

wasm size will grow from 8.4MB to 8.9MB.
2023-01-17 12:39:13 -08:00
Scott McKay
7f374f4012
Fix build error on Windows if Python debug libraries are installed (#14308)
### Description
<!-- Describe your changes. -->
If a user installs the debug libraries from Python on Windows the ORT
python project file attempts to use the debug python lib, which
conflicts with a pragma in pyconfig.h that wants the release lib (due to
pybind11 undefining _DEBUG).

Explicitly use the release lib instead of Python::Module so the build
doesn't break.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix obtuse build break.
2023-01-17 09:48:26 +10:00
Jeff Daily
fe052e603b
ROCm header path updates (#14170)
ROCm reorganized header file locations. Use the new locations to avoid
warnings.
2023-01-16 10:28:13 +08:00
Patrice Vignola
99a4036c80
[DML EP] Add FusedMatMul (#14196)
### Description
Add FusedMatMul



### Motivation and Context
- Add the FusedMatMul fusion for DML
- Fix the FusedMatMul logic and tests when transposed batches are
involved
2023-01-12 02:17:04 -08:00