### Description
upgrade protobuf to 3.20.2, same as onnx 1.13.0
### Motivation and Context
Per component governance requirement and Fixes#14060
unused-parameter error occurs in 2 conditions.
1. compile protolbuf
`onnxruntime_src/cmake/external/protobuf/src/google/protobuf/repeated_ptr_field.h:752:66:
error: unused parameter ‘prototype’ [-Werror=unused-parameter]`
2. include onnx_pb.h
```
2023-01-28T10:20:15.0410853Z FAILED: CMakeFiles/onnxruntime_pybind11_state.dir/onnxruntime_src/onnxruntime/python/onnxruntime_pybind_iobinding.cc.o
......
2023-01-28T10:20:15.0466024Z from /build/Debug/_deps/onnx-src/onnx/onnx_pb.h:51,
2023-01-28T10:20:15.0466958Z from /onnxruntime_src/include/onnxruntime/core/framework/to_tensor_proto_element_type.h:10,
....
2023-01-28T10:20:15.0609678Z /build/Debug/_deps/onnx-build/onnx/onnx-operators-ml.pb.h:1178:25: required from here
2023-01-28T10:20:15.0610895Z /onnxruntime_src/cmake/external/protobuf/src/google/protobuf/repeated_ptr_field.h:752:66: error: unused parameter ‘prototype’ [-Werror=unused-parameter]
2023-01-28T10:20:15.0611707Z cc1plus: all warnings being treated as errors
```
https://dev.azure.com/onnxruntime/2a773b67-e88b-4c7f-9fc0-87d31fea8ef2/_apis/build/builds/874605/logs/22
### Fix build error on Windows when building with "
--enable_language_interop_ops -cmake_extra_defines
onnxruntime_DISABLE_ABSEIL=ON"
This is a subsequent fix after
https://github.com/microsoft/onnxruntime/pull/14309, which fixed build
for onnxruntime_DISABLE_ABSEIL=ON build.
Going furthur, if we enable --enable_language_interop_ops, there are
following two errors:
```
test_symm_qgemm.cpp
test_transpose.cpp
onnxruntime_session.lib(inference_session.obj) : error LNK2019: unresolved external symbol "void __cdecl onnxruntime::L
oadInterOp(class std::basic_string<wchar_t,struct std::char_traits<wchar_t>,class std::allocator<wchar_t> > const &,cla
ss std::vector<struct Ort::CustomOpDomain,class std::allocator<struct Ort::CustomOpDomain> > &,class std::function<void
__cdecl(char const *)> const &)" (?LoadInterOp@onnxruntime@@YAXAEBV?$basic_string@_WU?$char_traits@_W@std@@V?$allocato
r@_W@2@@std@@AEAV?$vector@UCustomOpDomain@Ort@@V?$allocator@UCustomOpDomain@Ort@@@std@@@3@AEBV?$function@$$A6AXPEBD@Z@3
@@Z) referenced in function "public: __cdecl <lambda_f3a907e0b0a0e11d80d305605215cce8>::operator()(class std::shared_pt
r<class onnxruntime::Model> &)const " (??R<lambda_f3a907e0b0a0e11d80d305605215cce8>@@QEBA@AEAV?$shared_ptr@VModel@onnxr
untime@@@std@@@Z) [C:\Users\pengwa\dev\onnxruntime\build\Windows\RelWithDebInfo\onnxruntime_test_trainer.vcxproj]
onnxruntime_session.lib(inference_session.obj) : error LNK2019: unresolved external symbol "void __cdecl onnxruntime::L
oadInterOp(class onnx::ModelProto const &,class std::vector<struct Ort::CustomOpDomain,class std::allocator<struct Ort:
:CustomOpDomain> > &,class std::function<void __cdecl(char const *)> const &)" (?LoadInterOp@onnxruntime@@YAXAEBVModelP
roto@onnx@@AEAV?$vector@UCustomOpDomain@Ort@@V?$allocator@UCustomOpDomain@Ort@@@std@@@std@@AEBV?$function@$$A6AXPEBD@Z@
5@@Z) referenced in function "public: __cdecl <lambda_340b7b787b9c0f81848d348e60fe6c91>::operator()(class std::shared_p
tr<class onnxruntime::Model> &)const " (??R<lambda_340b7b787b9c0f81848d348e60fe6c91>@@QEBA@AEAV?$shared_ptr@VModel@onnx
runtime@@@std@@@Z) [C:\Users\pengwa\dev\onnxruntime\build\Windows\RelWithDebInfo\onnxruntime_test_trainer.vcxproj]
C:\Users\pengwa\dev\onnxruntime\build\Windows\RelWithDebInfo\RelWithDebInfo\onnxruntime_test_trainer.exe : fatal error
LNK1120: 2 unresolved externals [C:\Users\pengwa\dev\onnxruntime\build\Windows\RelWithDebInfo\onnxruntime_test_trainer.
vcxproj]
onnxruntime.vcxproj -> C:\Users\pengwa\dev\onnxruntime\build\Windows\RelWithDebInfo\RelWithDebInfo\onnxruntime.dll
onnxruntime_test_utils.vcxproj -> C:\Users\pengwa\dev\onnxruntime\build\Windows\RelWithDebInfo\RelWithDebInfo\onnxrun
time_test_utils.lib
CUDACOMPILE : nvcc warning : The 'compute_35', 'compute_37', 'sm_35', and 'sm_37' architectures are deprecated, and may
be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). [C:\Users\pengwa\dev\onnxruntime
\build\Windows\RelWithDebInfo\custom_op_library.vcxproj]
cuda_ops.cu
CUDACOMPILE : nvcc warning : The 'compute_35', 'compute_37', 'sm_35', and 'sm_37' architectures are deprecated, and may
be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). [C:\Users\pengwa\dev\onnxruntime
\build\Windows\RelWithDebInfo\onnxruntime_test_cuda_ops_lib.vcxproj]
```
```
kernel_type_str_resolver_utils_test.cc
local_kernel_registry_test.cc
C:\Users\pengwa\dev\onnxruntime\onnxruntime\test\framework\allocation_planner_test.cc(1388,9): error C2220: the followin
g warning is treated as an error [C:\Users\pengwa\dev\onnxruntime\build\Windows\RelWithDebInfo\onnxruntime_test_all.vcxp
roj]
C:\Users\pengwa\dev\onnxruntime\onnxruntime\test\framework\allocation_planner_test.cc(1388,9): warning C4067: unexpected
tokens following preprocessor directive - expected a newline [C:\Users\pengwa\dev\onnxruntime\build\Windows\RelWithDebI
nfo\onnxruntime_test_all.vcxproj]
```
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix C6011, C6385, C6386 found by Visual Studio. Basically, I set the
maximum number of options for every EP to 128. To my knowledge, 128 is
big enough to support all EPs.
For support arbitrary number of EP options, we probably need #13999 and
create a "std::vector"-like struct in C language.
PyTorch skipped version 1.14 and jumped to 2.0, while the image for the
onnxruntime-CI-nightly-ort-pipeline is still using
nightly-ubuntu2004-cu116-py38-torch1140dev. Switch to the new torch
version image to fix the failure of the pipeline.
A tool to convert ONNX model to tfevents so that we can use tensorboard
to open it for visualization. This is especially useful for debugging
when the ONNX model is too large to open by Netron.
usage: onnx2tfevents.py [-h] [--logdir LOGDIR] [--model MODEL]
### Description
as a more generic solution to #13660, always set OpSchema in
CreateNodeHelper() so that added nodes by transformers will have
OpSchema set
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
the CreateEncoderInputs functor was passed to the ctor as nullptr when
type is MLFloat16.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Signed-off-by: Liqun Fu <liqfu@microsoft.com>
### Description
To support LpPool (18)
### Motivation and Context
for Ort 1.14 release
Signed-off-by: Liqun Fu <liqfu@microsoft.com>
### Description
Updated DirectML version to 1.10.1
(https://www.nuget.org/packages/Microsoft.AI.DirectML/1.10.1)
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Update Android package custom build script.
- Use later version of various dependencies (CMake, JDK, Android command line tools, Android NDK, Ubuntu). The CMake version was too old for the current ORT code.
- Do in-container build in a directory that is not shared with the host. Resolves some file permission issues and speeds up file access.
Add a nightly build to make sure the script works with the latest ORT.
### Description
Fixes unused `use_memory_efficient_attention` variable in
contrib_ops/cuda/bert/attention_impl.cu.
### Motivation and Context
ORT with CUDA version < 11.6 fails to build for release configurations
due to an unused variable.
```shell
c:\...\onnxruntime\onnxruntime\contrib_ops\cuda\bert\attention_impl.cu(420): error : variable "use_memory_efficient_attention" was declared but never referenced [C:\...\onnxruntime\build\Windows\RelWithDebInfo\onnx
runtime_providers_cuda.vcxproj]
detected during instantiation of "onnxruntime::common::Status onnxruntime::contrib::cuda::QkvToContext(const cudaDeviceProp &, cublasHandle_t &, cudaStream_t, onnxruntime::contrib::AttentionParameters &, onnxruntime::contrib::cuda::AttentionData<T> &) [wit
h T=float]"
(923): here
```
This happens for CUDA < 11.6. Our cmake script turns off
onnxruntime_USE_FLASH_ATTENTION for CUDA < 11.6, which leaves the
aforementioned variable unused outside of asserts (which are removed in
release builds).
The USE_FLASH_ATTENTION option was added by
https://github.com/microsoft/onnxruntime/pull/14343
Signed-off-by: Kevin Chen <kevinch@nvidia.com>
### Description
Add a `-t` option for `onnx_test_runner` to allow users to specify
custom tolerance values when running ONNX models.
### Motivation and Context
For some backends, the default tolerance of 1-e5 is too tight to pass
accuracy checks with ONNX model zoo reference values, especially if only
one or two values are mismatched. Having a custom option will allow
different backends to specify their own custom tolerance when running
these models.
Signed-off-by: Kevin Chen <kevinch@nvidia.com>
### Description
Introduce cache_dir CLI for graph serialisation.
Replace existing use_compile_network and blob_dump_path cli options for
openvino with a single command line option "cache_dir" specifying the
path that needs to be passed for blob dump/load improving the developer
experience.
### Motivation and Context?
We were having two values to set cache dir which was unnecessary
Co-authored-by: Preetha <preetha.veeramalai@intel.com>
### Description
<!-- Describe your changes. -->
Remove exclusions for ONNX model tests that now pass due to kernels
being implemented.
Update ONNX update doc to point to correct location for tests.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Run as many tests as possible.
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Add script to fuse nodes to optimized operators in stable diffusion 1.5
models, and a script to convert fp32 models to fp16 models. Tested with
stable diffusion 1.5.
Note that the optimized model needs onnxruntime-gpu v1.14 (release candidate
will be available soon).
Note: We will update the script to work with latest diffusers and stable
diffusion v2 and v2.1 models.
…ckaging_CPU_x86_default (#14332)"
This reverts commit a491f33f54.
### Description
### Motivation and Context
It looks an ADO issue.
Now, it's recovered.
It could be reenabled.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
This PR adds PyTorch 2.0 as an option when running the ORT transformer
benchmarking script.
### Motivation and Context
PyTorch released [PyTorch
2.0](https://pytorch.org/get-started/pytorch-2.0/) in the nightly
binaries and a stable release of PyTorch 2.0 is expected in March 2023.
### Description
Add memory efficient attention from CUTLASS.
TODO (in next pull request):
(1) Need performance tests on different GPUs, then add a sequence length
threshold (only activate it for long sequence length).
(2) Merge changes from https://github.com/NVIDIA/cutlass/pull/773 when
it is in cutlass master.
### Description
Remove the unnecessary WaitOnEPStep if the current operator node and its
consumer are in the same stream while there are notifications filed in
the current node
### Motivation and Context
In the current code, the WaitOnEPStep will always be launched as long as
the notification is filed in the input node, no matter the current node
and the input node are in the same stream or not, which is not
necessary.
This PR is to remove the WaitOnEPStep for this case.
Co-authored-by: Lei Cao <leca@microsoft.com>
### Description
<!-- Describe your changes. -->
This PR extends OrtBackend to allow for configuring an EP based on the
name, and fallbacks to existing mechanism that infers the EP based on
tensor affinity if nothing is provided.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Currently OrtBackend needs `get_ort_device()` with the device tag
inferred from torch.Tensor, but ort device is not yet supported for
dort. The change allows run dort with a supported EP, by configuring
dort with a desired EP and letting the dort (ort InferenceSession) take
CPU-affined pytorch Tensors as inputs then inject data transfer nodes
internally.
### Description
Remove intermedia obj files and reenable cache
### Motivation and Context
Recently, training_debug_x64 pipeline often failed due to not enough
space.
It could free nearly 8G space by deleting obj files.
So, the compilation cache can be reenabled
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
### Description
Fix https://github.com/microsoft/onnxruntime/issues/14359
test\greedy_search_top_one.cc(21,44): warning C4244: '=':
conversion from 'int32_t' to '_Ty', possible loss of data
[C:\Users\11000978\onnxruntime\build\Windows\Debug\onnxrunti
me_providers_cuda.vcxproj]
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
As title. The fuser in LORT doesn't like "scalar". With a recent PyTorch
change, scalar is intorduced somewhere it was there before. Now, a
simple fix is to check if all inputs are tensors or some specially
allowed cases before sending ops to ORT.