### Description
Recently Visual Studio and python started to provide native Windows
ARM64 packages. This PR is to provide better support for building on
Windows ARM64. You can do it as what you did for x64. Like:
```
python tools\ci_build\build.py --config Debug --update --skip_submodule_sync --build_dir b --cmake_generator "Visual Studio 17 2022"
```
You do not need to append the "--arm64" build arg, and do not need to
cross-compile protoc for a different arch as you are not cross-compiling.
**caveat:** it does not work with the latest cmake release(3.26.x). It
only works fine with cmake 3.25.x and below. Filed a bug to them:
https://gitlab.kitware.com/cmake/cmake/-/issues/24797
### Motivation and Context
Provide better support for building on Windows ARM64.
### Description
<!-- Describe your changes. -->
As title
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
https://github.com/microsoft/onnxruntime/issues/15110
---------
Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
### Description
add script to validate generated NPM packages and publish it to
artifacts, so that release pipeline can use it.
once this PR is merged, I will update the NPM package release pipeline.
### Description
Implement Optional Type metadata support in the library.
Implement optional support in C# API along with metadata.
Implement Sequence, Map, Optional test data support
and test execution.
Prune tests and provide more details for failing tests in C# code.
Note, this PR does not enable running onnx test models in C++.
### Motivation and Context
Opset18 optional type support.
Add support for kMSInternalNHWCDomain and kPytorchAtenDomain op domains to op reduction script.
Make it an error if the op reduction script encounters unknown op domains.
### Description
<!-- Describe your changes. -->
1. enabled self-attention fusion in mt-5 decoder graph
2. fix a parity issue
https://github.com/microsoft/onnxruntime/issues/15042
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
CP: [For HLSL shader ops in the DirectML EP (STFT,DFT) FP16 ops should
fallback to CPU when there is no hardware support #15414
](https://github.com/microsoft/onnxruntime/pull/15414)
For HLSL shader ops in the DirectML EP (STFT,DFT) FP16 ops should
fallback to CPU when there is no hardware support.
## Description
Implements support for LeakyReLU in ActivationOpBuilder for CoreML's EP.
### Motivation and Context
This speeds up inference on macOS significantly for models using
LeakyReLU.
### Description
optimize default session options parsing.
- do minimal property assignment to the passed in `options` object.
- modify default value of `enableCpuMemArena` and `enableMemPattern` to
`false`. We don't get benefits from enabling these 2 flags in web
assembly
### Description
1. Move it to a separated pool that use the same image as [the public
hosted
pool](https://learn.microsoft.com/en-us/azure/devops/pipelines/agents/hosted?view=azure-devops&tabs=yaml).
Also, create a beta pool which contains the next version image of the
hosted pool, and add jobs in our post merge pipeline to test if the next
version image will break our CI. So, usually we will have at least one
week to prepare.
2. Change the cmake generator in use in our pipelines from "Ninja" to
"MingW Makefile", because the latest version of cmake doesn't work with
the latest version of Ninja. People who prefer Ninja could still use
ninja in their local build by passing "--cmake_generator ninja" to
[build.py](https://github.com/microsoft/onnxruntime/blob/main/tools/ci_build/build.py).
3. Delete eager mode CI pipeline.
### Motivation and Context
I need to update the software we have in our CI build machines, and I
need to resolve this incompatibility issue. In more detail, the build
error I hit was:
em++: error:
CMakeFilesonnxruntime_mlas_test.dirC_a_work1sonnxruntimetestmlasunittesttest_activation.cpp.o:
No such file or directory
("CMakeFilesonnxruntime_mlas_test.dirC_a_work1sonnxruntimetestmlasunittesttest_activation.cpp.o"
was expected to be an input file, based on the commandline arguments
provided)
After this PR we will deprecate python 3.7 support. The eager mode CI
pipeline is the last one that still use python 3.7. Then we can rework
the PR #10953 made by [fs-eire](https://github.com/fs-eire) last year.
Fixed
[AB#14435](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/14435)
### Description
Adds VIT model type to the benchmark
Also adds Swin (v1) model type
### Motivation and Context
Image models are important and we should verify these work as expected
at the performance we expect.
**Description**: Register an implementation for BatchNormInternal and
add a CPU kernel for BatchNormGradient. This is the third in a series of
PRs to implement BN training on CPU (first was #6946, second was #7539).
**Motivation and Context**
Support training networks with BatchNorm (e.g. convnets). Also note that
there exists a CUDA kernel for BN (forward training & backwards) but
it's currently disabled due to flaky failures; someone more familiar
with those parts can register the implementation for BNInternal on CUDA
(gradient kernel doesn't have to change).
---------
Co-authored-by: Simon Zirui Guo <simonguozirui@berkeley.edu>
Co-authored-by: mindest <linminuser@gmail.com>
Co-authored-by: mindest <30493312+mindest@users.noreply.github.com>
### Description
<!-- Describe your changes. -->
Update the support deepspeed to 0.8.3 as it's the latest version
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This will fix the error of `Skip modifying optimizer because of
unsupported DeepSpeed version`
Co-authored-by: ruiren <ruiren@microsoft.com>
Add workflow to update Objective-C API docs. Remove the Objective-C API doc generation step from the packaging pipeline.
There are similar workflows for automatically updating other language API docs. This change enables this for Objective-C too.
Ensure that we build with a known version of NDK and are not surprised when the default version on the build machine changes.
A similar change was made for other Android build pipelines previously, but this one was missed.
### Description
Fix onnxruntime_mlas build failure with cmake 3.26. Updated CMAKE
generator expression to make sure certain complier flags only apply for
C/CXX compiler.
### Motivation and Context
CMake changed the behavior of ASM_MASM in version 3.26. See
https://gitlab.kitware.com/cmake/cmake/-/issues/24639.
This also fixed the issue of #15101
### Description
Adds support for the AveragePool operator to QNN EP.
### Motivation and Context
This is needed to enable more models to run with QNN EP.
### Description
Adding 'Add' functionality to FP16 Conv operator. It takes a tensor that
has the same shape of the output tensor, and add it to the result
tensor.
### Motivation and Context
Needed to run Resnet 50
### Description
Fix issue in LeakyRelu Opbuilder for HTP backend.
Qnn Prelu(Onnx LeakyRelu) requires alpha data as the 2nd input while
Onnx set it as attribute. HTP backend requires input to be quantized. It
caused Qnn Op validation failed by setting the 2ns input as float32 data
type.
Fix:
Need to set the 2nd input as quantized input for HTP backend. Calculate
the quantization parameter and quantize the alpha data into uint8.
### Motivation and Context
Unblock models with the LeakyRelu execution on QualComm HTP backend.
### Introduce shrunken gather operator
Exist Gather operator schema won't guarantee output element count will
be smaller than input element count.
Actually, it is possible output element count >, =, or < input element
count.
For some cases we know for sure output element count MUST be <= input
element count, we will upstream those Gather operators to reduce compute
flops.
So this PR introduces an ShrunkenGather which explicitly guarantee
output count will be smaller than input count. The operator add
additional restriction on inputs, but still re-use existing Gather's
implementations plus input check during runtime.
This is a requirement for subsequent optimization (Draft PR:
https://github.com/microsoft/onnxruntime/pull/15401) we will do for
label sparsity and embedding sparsity.
### Description
- Now uses QNN's Resize operator for quantized models
- Still uses QNN's ResizeBilinear or ResizeNearestNeighbor for
non-quantized models.
### Motivation and Context
This update is necessary to support more models on QNN HTP backend.
Specifically, we need to support Resize's `pytorch_half_pixel`
coordinate transformation mode on HTP.
### Description
1. The protoc package on nuget.org contains binaries for
Windows_x86/Windows_x64/Linux_x86/Linux_x64/MacOS_x64, which can cover
most use cases. Though it doesn't have binaries for AMR64, they are only
needed when we cross-compile for Intel CPUs on ARM CPUs. It is rare.
When you have such a need, you always can build protoc from source by
yourself and pass it to build.py as "--path_to_protoc_exe". Or if you
have security concerns that you don't want to use prebuilt binaries from
outside, you can do the same thing.
2. Remove GoogleTestAdapter related thing. That part of code is out of
maintain.
### Motivation and Context
As a follow-up of PR #15190.
### Description
Apply `get_shared_initializers()` to the encoder and decoder subgraphs
of Whisper before chaining and exporting the full, final model.
### Motivation and Context
The Whisper export process has some overlap between the encoder and
decoder subgraphs due to the format of the BeamSearch contrib op.
Consequently, there is some shared model data that is duplicated in the
final exported product, which can result in a file size increase of
~40%. This PR takes the methods in `convert_generation.py` and applies
them during the whisper export process.
---------
Co-authored-by: Peter McAughan <petermca@microsoft.com>
### Description
Update mimalloc dependency.
### Motivation and Context
The latest release contains important fixes including memory leaks and
used by customers.
### Description
The current ONNX export of Whisper utilizes hard-coded values for
token_ids when configuring the BeamSearch node. This PR removes these
literals and instead takes these values straight from the WhisperConfig.
### Motivation and Context
Hard-coding these values can cause some parity issues when comparing to
default PyTorch behavior - this change to take from WhisperConfig
resolves these.
Co-authored-by: Peter McAughan <petermca@microsoft.com>
WindowsAI build failing due to deprecated .NET5 SDK missing in build
image
.NET5 was deprecated last year, and recently the build machine images
have been updated to not include this SDK.
Unblock failing builds by force insalling .NET5 SDK as part of the build
pipeline.
### Description
XNNPack: allow users to choose whether enable CPU MEM arena or not.
Right now it is hardcoded to true and it is not impacted by the on/off
switch in SessionOption. We should make it work.
### Motivation and Context
As we have such a switch in SessionOption, it should work as expected.
Split `IsTunbaleOpEnable` semantics into **enable tunable op for using**
and **enable tunable op for tuning**.
They remain disabled in general for safety purpose. But
- if session is created with onnx model with tuning results embeded
- the embedded tuning results is set to the EP without error `Status`
then we automatically enable the using, tuning remains disabled.
The planned options will be
- `tunable_op_enable`: The top-level switch of `TunableOp`, indicate if we will run into `TunableOp` related logic. **NOTE:** most of our impls have a bottom impl that is acting as a fallback and is set as the default. In this case, we still call into the `TunableOp`, but no kernel selection, no kernel tuning and caching is involved. This reduced our maintainance burden of a duplicate code path.
- `tunable_op_tuning_enable`: The secondary switch of `TunableOp`, indicate if we will run into the tuning related logic of `TunableOp`
Then for the possible future options:
- `tunable_op_tuning_max_iteration`: blahblah
- `tunable_op_tuning_max_duration_ms`: blahblah
- `tunable_op_flash_attention_enable`: blahblah, for example only, we will not have this.
For developer oriented envvar, it is for developers' convenience to inspect the performance impact of tuning. So there is only `ORT_ROCM_TUNABLE_OP_ENABLE`, `ORT_ROCM_TUNABLE_OP_TUNING_ENABLE` to take the fine-grind control of combinations.