Update Android package custom build script.
- Use later version of various dependencies (CMake, JDK, Android command line tools, Android NDK, Ubuntu). The CMake version was too old for the current ORT code.
- Do in-container build in a directory that is not shared with the host. Resolves some file permission issues and speeds up file access.
Add a nightly build to make sure the script works with the latest ORT.
### Description
Fixes unused `use_memory_efficient_attention` variable in
contrib_ops/cuda/bert/attention_impl.cu.
### Motivation and Context
ORT with CUDA version < 11.6 fails to build for release configurations
due to an unused variable.
```shell
c:\...\onnxruntime\onnxruntime\contrib_ops\cuda\bert\attention_impl.cu(420): error : variable "use_memory_efficient_attention" was declared but never referenced [C:\...\onnxruntime\build\Windows\RelWithDebInfo\onnx
runtime_providers_cuda.vcxproj]
detected during instantiation of "onnxruntime::common::Status onnxruntime::contrib::cuda::QkvToContext(const cudaDeviceProp &, cublasHandle_t &, cudaStream_t, onnxruntime::contrib::AttentionParameters &, onnxruntime::contrib::cuda::AttentionData<T> &) [wit
h T=float]"
(923): here
```
This happens for CUDA < 11.6. Our cmake script turns off
onnxruntime_USE_FLASH_ATTENTION for CUDA < 11.6, which leaves the
aforementioned variable unused outside of asserts (which are removed in
release builds).
The USE_FLASH_ATTENTION option was added by
https://github.com/microsoft/onnxruntime/pull/14343
Signed-off-by: Kevin Chen <kevinch@nvidia.com>
### Description
Add a `-t` option for `onnx_test_runner` to allow users to specify
custom tolerance values when running ONNX models.
### Motivation and Context
For some backends, the default tolerance of 1-e5 is too tight to pass
accuracy checks with ONNX model zoo reference values, especially if only
one or two values are mismatched. Having a custom option will allow
different backends to specify their own custom tolerance when running
these models.
Signed-off-by: Kevin Chen <kevinch@nvidia.com>
### Description
Introduce cache_dir CLI for graph serialisation.
Replace existing use_compile_network and blob_dump_path cli options for
openvino with a single command line option "cache_dir" specifying the
path that needs to be passed for blob dump/load improving the developer
experience.
### Motivation and Context?
We were having two values to set cache dir which was unnecessary
Co-authored-by: Preetha <preetha.veeramalai@intel.com>
### Description
<!-- Describe your changes. -->
Remove exclusions for ONNX model tests that now pass due to kernels
being implemented.
Update ONNX update doc to point to correct location for tests.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Run as many tests as possible.
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Add script to fuse nodes to optimized operators in stable diffusion 1.5
models, and a script to convert fp32 models to fp16 models. Tested with
stable diffusion 1.5.
Note that the optimized model needs onnxruntime-gpu v1.14 (release candidate
will be available soon).
Note: We will update the script to work with latest diffusers and stable
diffusion v2 and v2.1 models.
…ckaging_CPU_x86_default (#14332)"
This reverts commit a491f33f54.
### Description
### Motivation and Context
It looks an ADO issue.
Now, it's recovered.
It could be reenabled.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
This PR adds PyTorch 2.0 as an option when running the ORT transformer
benchmarking script.
### Motivation and Context
PyTorch released [PyTorch
2.0](https://pytorch.org/get-started/pytorch-2.0/) in the nightly
binaries and a stable release of PyTorch 2.0 is expected in March 2023.
### Description
Add memory efficient attention from CUTLASS.
TODO (in next pull request):
(1) Need performance tests on different GPUs, then add a sequence length
threshold (only activate it for long sequence length).
(2) Merge changes from https://github.com/NVIDIA/cutlass/pull/773 when
it is in cutlass master.
### Description
Remove the unnecessary WaitOnEPStep if the current operator node and its
consumer are in the same stream while there are notifications filed in
the current node
### Motivation and Context
In the current code, the WaitOnEPStep will always be launched as long as
the notification is filed in the input node, no matter the current node
and the input node are in the same stream or not, which is not
necessary.
This PR is to remove the WaitOnEPStep for this case.
Co-authored-by: Lei Cao <leca@microsoft.com>
### Description
<!-- Describe your changes. -->
This PR extends OrtBackend to allow for configuring an EP based on the
name, and fallbacks to existing mechanism that infers the EP based on
tensor affinity if nothing is provided.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Currently OrtBackend needs `get_ort_device()` with the device tag
inferred from torch.Tensor, but ort device is not yet supported for
dort. The change allows run dort with a supported EP, by configuring
dort with a desired EP and letting the dort (ort InferenceSession) take
CPU-affined pytorch Tensors as inputs then inject data transfer nodes
internally.
### Description
Remove intermedia obj files and reenable cache
### Motivation and Context
Recently, training_debug_x64 pipeline often failed due to not enough
space.
It could free nearly 8G space by deleting obj files.
So, the compilation cache can be reenabled
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
### Description
Fix https://github.com/microsoft/onnxruntime/issues/14359
test\greedy_search_top_one.cc(21,44): warning C4244: '=':
conversion from 'int32_t' to '_Ty', possible loss of data
[C:\Users\11000978\onnxruntime\build\Windows\Debug\onnxrunti
me_providers_cuda.vcxproj]
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
As title. The fuser in LORT doesn't like "scalar". With a recent PyTorch
change, scalar is intorduced somewhere it was there before. Now, a
simple fix is to check if all inputs are tensors or some specially
allowed cases before sending ops to ORT.
### Description
<!-- Describe your changes. -->
The changes correspond to specify the mask_filter_value in attention
attribute. However, the ORT optimizer cannot fuse
SkipLayerNorm/Attention/EmbedLayerNorm with the most recent
transformers. So this PR may only address this issue with some older
version of onnx models(e.g the one used in the unittest)
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
- Fix debug node inputs outputs nullptr dereference with ONNX optional types.
- Fix model test memory leak.
- Convert jobs to stages in post-merge-jobs.yml to allow a subset of builds to be enabled when running manually.
- Fix buffer overrun in CumSum op exposed by Mimalloc build.
### Description
disable cache to save disk space for training_x64_debug
### Motivation and Context
To mitigate not enough disk space in training_x64_debug first.
Two modifications:
- After [TRT 8.5](https://github.com/microsoft/onnxruntime/pull/13867)
being merged, we can manually set timeout and make TRT EP only run small
portion of unit tests
(`onnxruntime_SKIP_AND_PERFORM_FILTERED_TENSORRT_TESTS=ON`) due to
additional TRT kernel overhead introduced by TRT 8.5 which increases
test time a lot. This PR modifies the checking condition and make
TensorRT CIs (can enable builder placeholder) still run most of the unit
tests.
- Exclude TRT EP from [Resize Opset
18](https://github.com/microsoft/onnxruntime/pull/13890) unit tests
since TensorRT 8.5 supports operators up to Opset 17.
### Description
Allows the PostAnalysis@2 task for windows CI jobs to continue even if
an error is encountered.
### Motivation and Context
This is a temporary workaround that enables the
`Windows_Packaging_CPU_x86_default` job within the Zip-Nuget-Java-NodeJS
packaging pipeline to finish. A recent update to dotnet 6 has broken the
PostAnalysis task for this job.
This task was originally added by
https://github.com/microsoft/onnxruntime/pull/13694
### Description
Adds the below C APIs to support custom ops that wrap an entire model to
be inferenced with an external runtime. The current SNPE EP is an
example of an EP that could be ported to use a custom op wrapper. Ex:
The custom op stores the serialized SNPE DLC binary as a string
attribute. The SNPE model is built when the kernel is created. The model
is inferenced with SNPE APIs on call to the kernel's compute method.
#### C APIs
| API | Description | Why |
| --- | --- | --- |
| `KernelInfo_GetInputCount` | Gets number of inputs from
`OrtKernelInfo`. | Query I/O characteristics during kernel
creation<sup>1</sup> |
| `KernelInfo_GetOutputCount` | Gets number of outputs from
`OrtKernelInfo`. | Query I/O characteristics during kernel
creation<sup>1</sup> |
| `KernelInfo_GetInputName` | Gets an input's name. | Query I/O
characteristics during kernel creation<sup>1</sup> |
| `KernelInfo_GetOutputName` | Gets an output's name. | Query I/O
characteristics during kernel creation<sup>1</sup> |
| `KernelInfo_GetInputTypeInfo` | Gets the type/shape information for an
input. | Query I/O characteristics during kernel creation<sup>1</sup> |
| `KernelInfo_GetOutputTypeInfo` | Gets the type/shape information for
an output. | Query I/O characteristics during kernel
creation<sup>1</sup> |
| `KernelInfoGetAttribute_tensor` | Get a OrtValue tensor stored as an
attribute in the graph node | Extract serialized models, weights, etc. |
| `GetSessionConfigEntry` | Get a session configuration value | Need to
be able to get session-time configurations from within custom op |
| `HasSessionConfigEntry` | Check if session configuration entry exists.
| Need to be able to get session-time configurations from within custom
op |
#### Why so many KernelInfo APIs?<sup>1</sup>
Similar APIs currently exist for `OrtKernelContext`, but not
`OrtKernelInfo`. Note that `OrtKernelContext` is passed to the custom op
on call to its kernel's compute() function. However, `OrtKernelInfo` is
available on kernel creation, which occurs when the session is created.
Having these APIs available from `OrtKernelInfo` allows an operator to
trade-off computation time for session-creation time, and vice versa.
Operators that must build expensive state may prefer to do it during
session creation time instead of compute-time.
SNPE is an example of an EP that needs to be able to query `KernelInfo`
for the name, type, and shape of inputs and outputs in order to build
the model from the serialized DLC data. This is an expensive operation.
Other providers (e.g., OpenVINO) are able to query i/o info from the
serialized model, so they do not strictly need these APIs. However, the
APIs can still be used to validate the expected I/O characteristics.
Additionally, several of our CPU contrib ops currently use the same
internal version of these KernelInfo APIs (Ex:
[qlinear_softmax](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/contrib_ops/cpu/quantization/qlinear_softmax.cc#L71)).
If custom ops are also meant to be a test bed for future ops, then all
custom ops (not just runtime wrappers) would benefit from the addition
of these public KernelInfo APIs (IMO).
#### Example of usage in a custom OP
From
`onnxruntime/test/testdata/custom_op_openvino_wrapper_library/openvino_wrapper.h`
```c++
struct CustomOpOpenVINO : Ort::CustomOpBase<CustomOpOpenVINO, KernelOpenVINO> {
explicit CustomOpOpenVINO(Ort::ConstSessionOptions session_options);
CustomOpOpenVINO(const CustomOpOpenVINO&) = delete;
CustomOpOpenVINO& operator=(const CustomOpOpenVINO&) = delete;
void* CreateKernel(const OrtApi& api, const OrtKernelInfo* info) const;
constexpr const char* GetName() const noexcept {
return "OpenVINO_Wrapper";
}
constexpr const char* GetExecutionProviderType() const noexcept {
return "CPUExecutionProvider";
}
// IMPORTANT: In order to wrap a generic runtime-specific model, the custom operator
// must have a non-homogeneous variadic input and output.
constexpr size_t GetInputTypeCount() const noexcept {
return 1;
}
constexpr size_t GetOutputTypeCount() const noexcept {
return 1;
}
constexpr ONNXTensorElementDataType GetInputType(size_t /* index */) const noexcept {
return ONNX_TENSOR_ELEMENT_DATA_TYPE_UNDEFINED;
}
constexpr ONNXTensorElementDataType GetOutputType(size_t /* index */) const noexcept {
return ONNX_TENSOR_ELEMENT_DATA_TYPE_UNDEFINED;
}
constexpr OrtCustomOpInputOutputCharacteristic GetInputCharacteristic(size_t /* index */) const noexcept {
return INPUT_OUTPUT_VARIADIC;
}
constexpr OrtCustomOpInputOutputCharacteristic GetOutputCharacteristic(size_t /* index */) const noexcept {
return INPUT_OUTPUT_VARIADIC;
}
constexpr bool GetVariadicInputHomogeneity() const noexcept {
return false; // heterogenous
}
constexpr bool GetVariadicOutputHomogeneity() const noexcept {
return false; // heterogeneous
}
std::vector<std::string> GetSessionConfigKeys() const { return {"device_type"}; }
private:
std::unordered_map<std::string, std::string> session_configs_;
};
```
#### How to create a session:
```c++
Ort::Env env;
Ort::SessionOptions session_opts;
Ort::CustomOpConfigs custom_op_configs;
// Create local session config entries for the custom op.
custom_op_configs.AddConfig("OpenVINO_Wrapper", "device_type", "CPU");
// Register custom op library and pass in the custom op configs (optional).
session_opts.RegisterCustomOpsLibrary(lib_name, custom_op_configs);
Ort::Session session(env, model_path.data(), session_opts);
```
### Motivation and Context
Allows creation of simple "wrapper" EPs outside of the main ORT code
base.
### Description
fix a security warning in GemmInt8 cuda kernel
### Motivation and Context
it is for issue:
https://dev.azure.com/aiinfra/ONNX%20Runtime/_workitems/edit/11158/
Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
### Description
<!-- Describe your changes. -->
PartitionIntoStreams was incorrectly using std::string instead of
PathString for the config file argument when ORT_ENABLE_STREAM was not
defined.
Also Incorporate changes from #14291 to fix build and test issues.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix build error on Windows due to mismatched type.
### Description
The DML EP provider factory verifies the adapter id is a real GPU (not
some software emulation like WARP which would be quite slow or basic
display driver which lacks D3D compute ability), but the automated tests
sometimes erratically get run on a variety of ADO cloud machines that
lack a GPU or are in a bad state such that Windows fell back to software
emulation. In such cases, you end up reaching the `!IsSoftwareAdapter`
check in the provider factory ([line
132](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/dml/dml_provider_factory.cc#L132))
and seeing in the pipeline logs E_INVALIDARG. Let's return a more
immediately enlightening error code like
ERROR_GRAPHICS_INVALID_DISPLAY_ADAPTER rather than just E_INVALIDARG.
### Motivation and Context
- *Why is this change required? What problem does it solve* Pipeline
noise.
- *If it fixes an open issue, please link to the issue here.* NA.