Fix assertion failure in onnxruntime_test_all in debug build with CUDA,
which is caused by a test case added in
https://github.com/microsoft/onnxruntime/pull/16075.
Remove an assumption that bias exists in MultiHeadAttention.
Always set causal mask to the lowest float.
Note that since huggingface transformers v4.21, gpt2 uses lowest half
for FP16, and lowest float for FP32:
66fd3a8d62/src/transformers/models/gpt2/modeling_gpt2.py (L199)
Assume that most fp16 ONNX models are converted from fp32 models. We
decided to use lowest float32 for both half and float model for
consistency.
The mask_filter_value only applies to raw attention mask (2D, 3D or 4D).
For 1D mask, masked item is 0.0 after softmax so mask filter value is
the lowest float for 1D mask.
* For BERT model, when users use 1D mask (required by FMHA) and
mask_filter_value is not applicable.
* For BERT or GPT-2, when fused kernel is used, mask_filter_value has no impact
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
https://github.com/microsoft/onnxruntime/issues/12843https://github.com/microsoft/onnxruntime/issues/14363
Whsiper model contains five different types of attention, q, k, v bias
was fused into Attention/MHA/DMHA op,
encoderdecoderinit subgraph
- Attention: encoder attention
- Attention: decoder self attention + present k, v
- MultiHeadAttention: decoder cross attention + present k and v. q and v
have bias.
decoder subgraph
- DecoderMultiHeadAttention: decoder cross attention + past k, v. q has
bias
- DecoderMultiHeadAttention: decoder self attention + past/present k, v.
q, k, v have bias.
For ROCm EP, MHA/DMHA doesn't support additional bias. This PR add a
fusion option `disable_multi_head_attention_bias` to split q.k,v bias
from MHA/DMHA.
This fixes the type lists used to register DML kernels for Microsoft
domain QuantizeLinear and DequantizeLinear. These previously did not
include FP16 and incorrectly used the same type list for both operators.
The new type lists are the same as opset 19 ONNX which aren't
implemented yet in the DML EP.
### Description
The build pipeline runs on Azure NV12 machines that will be deprecated
soon because the SKU is too old. So this PR will move the pipeline to a
Windows machine with two A10 GPUs.
### Description
- Add a new field to `MLAS_PLATFORM` for S8S8 GEMM dispatch.
- Set this field to either dot product instructions or NEON MLA in
platform.cpp.
- Clean up dispatch selector in qgemm.h.
### Motivation and Context
This will allow future extensibility as other functions that use other
ARM64
extensions for quantized matrix multiplication.
---------
Co-authored-by: Skand Hurkat <skhurkat@microsoft.com>
"default" should be last element for exports.
This fixes "Module not found: Error: Default condition should be last
one" when importing the onnxruntime-web package in some conditions.
### Description
1. Replacing AMX intrinsics with machine code macro instructions in
QGEMM kernel.
2. Removing AMX build flags for GCC in cmake file.
### Motivation and Context
The additional AMX flag in cmake adds an extra layer of dependency on
GCC version to use the feature.These changes should allow the usage of
the AMX feature with just the CPU ID check.
### Fix Reshape check
3D->2D reshape by merging the first dims.
There is a bug for the case.
```mermaid
stateDiagram
[768,12,64] --> Reshape
(—1,768) --> Reshape
Reshape --> [768,768]
```
The Reshape pass the upstream Reshape check, but it should not.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Since WebNN API doesn't support shape op, in the WebNN EP, we calculate
the ONNX Shape node output and pass the values to a WebNN's constant +
slice as workaround.
1. Enable xnnpack test
2. Change TSA database name from onnxruntime_master to onnxruntime_main.
This is a leftover of renaming the "master" branch to "main"
3. Add two static analysis jobs for WinML and DML
4. Rename the machine pool "aiinfra-dml-winbuild" to
"onnxruntime-Win2019-GPU-dml-A10", so that the internal and public ADO
instances use the same machine pool name.
5. Move Windows GPU CI build pipeline from "onnxruntime-Win2022-GPU-T4"
to "onnxruntime-Win2022-GPU-A10" machine pool, because we do not have
enough T4 GPUs.
Fix two issues related to cuda graph capture:
https://github.com/microsoft/onnxruntime/issues/14942 and
https://github.com/microsoft/onnxruntime/issues/15002
Issue 1: Previously, graph capture starts at the second run. However,
memory pattern optimization will allocate memory from the second run,
and cudamalloc is not allowed during graph capture. In this PR, the
graph capture will start graph capture after 2 runs to avoid the issue.
Issue 2: https://github.com/microsoft/onnxruntime/pull/13495 introduced
multiple stream support. But stream cleanup will call
cudaStreamSyncronize which is not allowed in cuda graph capture. In this
PR, we move stream cleanup after cuda graph capture.
Update the squeeze net test model with dynamic axis so that we can test
with larger batch size. Add a test that could reproduce the bug (when
changing min runs from 2 back to 1).
### Description
- Add new Ops: Ceil, Exp, Identity, Reciprocal, Tan.
- Set MinSupportedOpSet for unary Ops.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Support more Ops for other models.
The legacy optimization attribute "consumed_inputs" is not supported in
WebNN EP.
### Description
<!-- Describe your changes. -->
Publish E2E test logs on build failure too.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Get more information about intermittent test failures.
### Description
Added support for ReduceL1, ReduceL2, ReduceMean, ReduceMin, ReduceMax,
ReduceSum, ReduceLogSum, ReduceLogSumExp, ReduceProd and
ReduceSquareSum.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Satya Jandhyala <sajandhy@microsoft.com>
Co-authored-by: guschmue <guschmue@microsoft.com>
### Enhance StatisticsSubscriber
There are few improvements for `StatisticsSubscriber`:
- Reduce peak memory impact for tensors (having many many many elements,
consuming too much GPU memory, causing original recipe run failed with
OOM), by split the statistics into two phases (split into buckets, and
merge result across buckets).
- Allow dump intermediate tensors. Originally only nn.Module forward()'s
return value are dumped, there are requirements we want to inspect some
specific intermediate tensor in the forward() function, now we support
it.
- Add documents for collecting dumps on multiple ranks
Docs link on this branch for better view:
https://github.com/microsoft/onnxruntime/blob/pengwa/conv_tool_v2/docs/ORTModule_Convergence_Notes.md
---------
Co-authored-by: mindest <30493312+mindest@users.noreply.github.com>
### Description
A few QDQ tests failed on XNNPACK EP.
The reason should be the range of input_data doesn't fit for scale and
zero_point.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
This
[PR](e726151b5c (diff-6957596681c25d78e7f3f56485f307fb7e66369309523240209a62c8fa21646b))
introduces a missing registration of Identity operator for version
greater than 14.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
It broke the CANN CI. I added the registration of identity operator.
### Description
NNAPI Doesn't support the indices input of Gather to be a scalar.
To workaround it.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
refactor tensor type in onnxruntime-common.
### Motivation and Context
There major motivation is that I am doing a local change to address the
API part of #15312. And I am doing a refactoring of onnxruntime-common
anyway (#15772).
The `tensor.ts` and `tensor-impl.ts` are too large, so I split contents
into multiple files to make the type declarations clearer.
The original target of this change is for API only ( ie. do not refactor
any implementation.). However, there are a few type/implementation
inconsistencies so I also made minimal changes to fix them.
### Changes
- extract `TensorUtils` for non-template interfaces
- extract `TensorFactory` for all overloads of `Tensor.fromImage()`
- refactor options type that used for `Tensor.fromImage()`
- fix JSDoc comments to make option descriptions consistent with actual
type declarations
- fix an inconsistency for `options.format` and `options.bitmapFormat`;
change all `bitmapFormat` to `format`
- extract `ConversionUtils` for `tensor.toDataURL()` and
`tensor.toImageData()`
- put implementations into multiple files from `tensor-impl.ts`
- fix a bug that cause unittest fail. put comments for future fix.
### Description
Add an API for users to get version of current package. example usage:
```js
import { env } from 'onnxruntime-node';
console.log(env.versions.node); // output "1.16.0"
```
```js
import { env } from 'onnxruntime-web';
console.log(env.versions.web); // output "1.16.0"
console.log(env.versions.common); // output "1.16.0"
console.log(env.versions.node); // output "undefined"
```
#16156
### Description
1. Updated Mac package workflow for easily debugging.
2. Changed Archive type from tgz to zip since zip is supported by ESRP.
3. .../dylib.dSYM/Contents/Resources/DWARF/libonnxruntime.1.16.0.dylib
is a debug symbol file, so it couldn't be signed.
### Motivation and Context
It‘s required from VS code.
Mac binaries in nuget should be signed
### Description
Correctly sets padding when the `auto_pad` attribute is specified for
Conv operator.
### Motivation and Context
Needed to correctly translate ONNX Conv to QNN Conv2d.
### Description
<!-- Describe your changes. -->
SetThreadDescription isn't available in an Azure App Service sandbox.
#15219 removed a check that it was available, making it a hard
dependency. When it's not available the dll load fails with a 'procedure
not found' error.
Add back the check.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#15375 - although note this has nothing to do with the original issue.
This is just for
https://github.com/microsoft/onnxruntime/issues/15375#issuecomment-1579464889
### Description
Add model description into context binary file metadata for validation
### Motivation and Context
Dump more information for validation
---------
Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
### Description
Fix an issue for Conv with dynamic weights
Root cause:
Conv op builder create the weight input tensor with wrong name. With dynamic weight, Transpose node is inserted. Conv op builder should use the new name which is Transpose output. It cause the weight producer has wrong output shape.
### Description
<!-- Describe your changes. -->
Implement `dispose` react native method.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Currently we are not able to release the memory used by model in JS
runtime if we don't want to use it anymore, we can do that only by
reload app on debug or restart app on release.
### Description
Adds tests for operators that return error 1002
(QNN_COMMON_ERROR_MEM_ALLOC) when the call to graphFinalize() fails.
This seems to happen for large input sizes.
Operators:
- Sub
- Div
- Conv
- MaxPool
### Motivation and Context
This documents bugs that need to be addressed with unit tests.
### Description
1. Use IAllocatorUniquePtr to replace BufferUniquePtr. It will ensure
the deleter is always right.
2. Change some std::unique_ptr to std::optional
3. Bypass Arena allocator when allocating the prepack buffers for mlas.
In this special case, Arena doesn't help any. And this change is just an
internal implementation change, it doesn't affect our public interface.
### Description
This PR adds flags for exporting Whisper with vocab masks for logits
processing. This PR also sets `input_features` back to FP32 precision
for the user and casts `input_features` to FP16 precision when needed.
### Motivation and Context
This helps enable specific logits processing for the exported Whisper
model.
### Description
google/re2 [was
switched](49d776b9d2)
to absl::string_view in version 2023-06-02.
As `absl::string_view` is a drop-in replacement for `std::string_view`
it does not have `as_string()` method.
This PR ensures the forward compatibility with the newest versions of
re2 library.
### Description
<!-- Describe your changes. -->
Update NNAPI Softmax to coerce to 2D when opset is < 13. This prevents
the layout change to NHWC from breaking the implementation, as well as
making it work correctly when the ONNX node's axis != 1.
Add check for opset 13+ that axis is inner-most dimension as we don't
currently handle any other value correctly.
Update tests to add model to check NHWC layout, as well as 4D tests. We
didn't notice the issues with the NNAPI EP as it was only processing
input shapes that were 2D or 4D (which was overly restrictive as well).
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#15949
### Description
We should avoid using the macro since the value of the macro is
inaccurate. For example, our prebuilt packages are built with CUDA 11.8
but people may run the binaries with CUDA 11.4. (The minimal CUDA version we support is CUDA 11.4)
A runtime function should be used to determine CUDA version. Like:
```C++
int cuda_runtime_version = 0;
CUDA_CALL_THROW(cudaRuntimeGetVersion(&cuda_runtime_version));
ORT_ENFORCE(cuda_runtime_version >= 11040, "ONNX Runtime needs cuda runtime higher than 11.4");
```