In this change
1. Vectorization of k is updated to 4.
2. Tile_A, Tile_B are stored transposed in shared memory. This makes it
so that memory locality is improved for our access pattern.
3. Lane output is switched to being individual vectors and its loop
unrolled, this solves the problem where laneoutput was not on registers
before.
Perf improvements are not very consistent with this change. On Tigerlake
GPU with 32.0.101.6460 (latest intel drivers)
```
Baseline
model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web\ -l 1000
Batch size: 1, prompt tokens: 1001, tokens to generate: 128
Prompt processing (time to first token):
avg (us): 7.36557e+06 <<<<
avg (tokens/s): 135.903
p50 (us): 7.35498e+06
stddev (us): 27599
n: 5 * 1001 token(s)
With Change
model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web\ -l 1000
Batch size: 1, prompt tokens: 1001, tokens to generate: 128
Prompt processing (time to first token):
avg (us): 6.52302e+06 <<<<
avg (tokens/s): 153.457
p50 (us): 6.52224e+06
stddev (us): 10407.3
n: 5 * 1001 token(s)
```
However, using the Intel GPA comparing before and after profile, one can
clearly see straight runs of ALU work without being interspersed by
writebacks to local memory that contained lane_output before.

There is a crash in the WebGPU CI pipeline. It crashed at process
shutdown when unloading onnxruntime_pybind11_state.pyd.
Here is the callstack:
```
dxil.dll!DxcSwapThreadMalloc() Unknown
dxil.dll!DxcThreadMalloc::DxcThreadMalloc(struct IMalloc *) Unknown
dxil.dll!DxcValidator::Release(void) Unknown
[Inline Frame] webgpu_dawn.dll!Microsoft::WRL::ComPtr<IDxcValidator>::InternalRelease() Line 235 C++
[Inline Frame] webgpu_dawn.dll!Microsoft::WRL::ComPtr<IDxcValidator>::{dtor}() Line 290 C++
webgpu_dawn.dll!dawn::native::d3d12::Backend::`scalar deleting destructor'(unsigned int) C++
webgpu_dawn.dll!`eh vector destructor iterator'(void * ptr, unsigned __int64 size, unsigned __int64 count, void(*)(void *) destructor) C++
webgpu_dawn.dll!dawn::native::InstanceBase::~InstanceBase() Line 197 C++
webgpu_dawn.dll!dawn::native::InstanceBase::`scalar deleting destructor'(unsigned int) C++
webgpu_dawn.dll!dawn::native::InstanceBase::DeleteThis() Line 218 C++
ucrtbase.dll!<lambda>(void)() Unknown
ucrtbase.dll!__crt_seh_guarded_call<int>::operator()<<lambda_7777bce6b2f8c936911f934f8298dc43>,<lambda>(void) &,<lambda_3883c3dff614d5e0c5f61bb1ac94921c>>() Unknown
ucrtbase.dll!_execute_onexit_table() Unknown
onnxruntime_pybind11_state.pyd!dllmain_crt_process_detach(const bool is_terminating) Line 182 C++
> onnxruntime_pybind11_state.pyd!dllmain_dispatch(HINSTANCE__ * const instance, const unsigned long reason, void * const reserved) Line 293 C++
ntdll.dll!LdrpCallInitRoutine() Unknown
ntdll.dll!LdrShutdownProcess() Unknown
ntdll.dll!RtlExitUserProcess() Unknown
kernel32.dll!ExitProcessImplementation() Unknown
ucrtbase.dll!exit_or_terminate_process() Unknown
ucrtbase.dll!common_exit() Unknown
python312.dll!00007ff9cab3ec8d() Unknown
python312.dll!00007ff9cab3efbf() Unknown
python312.dll!00007ff9cab3edee() Unknown
python312.dll!00007ff9cab57f4c() Unknown
python312.dll!00007ff9cab57579() Unknown
python312.dll!00007ff9cab573be() Unknown
python312.dll!00007ff9cab5729b() Unknown
python312.dll!00007ff9cabacfcb() Unknown
python312.dll!00007ff9cabacd7d() Unknown
python312.dll!00007ff9cab99e2d() Unknown
python.exe!00007ff78a641230() Unknown
kernel32.dll!BaseThreadInitThunk() Unknown
ntdll.dll!RtlUserThreadStart() Unknown
```
It might be because the destruct order of some global variables was
wrong. I saw DX DLLs were getting destroyed earlier than the WebGPU
instance in our code in onnxruntime_pybind11_state.pyd.
### Description
(1) Update BiasGelu fusion to support onnx Gelu-20
Since onnx Gelu-20 supports float/double/bf16/fp16, here we update
related ops to support these data types in CUDA and ROCm execution
providers:
(2) Add double support for Gelu/FastGelu op in CUDA/ROCm execution
provider
(3) Add BFloat16 support for Gelu ops in CUDA execution provider
(4) Add unit tests
(5) Update operator documents
### Motivation and Context
https://github.com/microsoft/onnxruntime/issues/23491
### Description
Add details about how to access the BrowserStack logs
### Motivation and Context
- browserstack link on its own is confusing to people who don't have
context.
Let me know if you have suggestions to make the text more clear or
informative
NDK has two toolchain cmake files as you can see in
https://android.googlesource.com/platform/ndk/+/refs/heads/main/build/cmake
By default NDK use the legacy one for providing the best compatibility.
We don't need to. This PR changes to use the new one.
The new toolchain cmake file uses standard cmake flags like
CMAKE_ANDROID_RTTI to control C++ features.
### Description
<!-- Describe your changes. -->
This PR will enable python dlpack interface by default.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
dlpack python interface is useful in inference mode not only training
mode.
Since some inference result preprocess may be written in torch and
making unnecessary device transfer should be reduced in those cases.
closes https://github.com/microsoft/onnxruntime/issues/15963 closes
https://github.com/microsoft/onnxruntime/issues/22061
TODOs:
- [x] Add tests like
5407c69028/orttraining/orttraining/test/python/orttraining_test_ortvalue.py
that's unrelated to training feature
---------
Co-authored-by: Xavier Dupré <xadupre@users.noreply.github.com>
Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
Add overload of `TryParseStringWithClassicLocale()` that uses `std::from_chars()` for certain types.
Reduce binary size. It recently increased after PR #23526.
Fix the issue that the new generated EP context model not able to find external data
### Description
The new generated EP context model was not able to find the external data file because it lost track of the source model path which used to locate the external initializers.
Relate to issue: https://github.com/microsoft/onnxruntime/issues/23358
### Description
After some investigation and debug, I decided to follow the recommended
workaround as suggested in https://github.com/vitejs/vite/issues/8427.
### Motivation and Context
There is a known issue with Vite 5.x when using WebAssembly package.
Detail information is in https://github.com/vitejs/vite/issues/8427.
There are previous attempts to fix this problem (#23487). I tried
various ways to make it working out of the box for Vite users but none
of them worked: Some "fixes" did fix the usage of Vite but broke other
use case/bundler and some introduced other issues. Eventually I figured
out that there is no good way to fix this inside ONNX Runtime.
Considering the root cause is inside Vite and it may be fixed in Vite
v6. I think now the best way is to follow the recommended workaround.
Fix tensor external data info length parsing issue.
The old implementation was parsing a `size_t` value with `strtol` (via `OrtStrToPtrDiff`) on ARM64 MSVC.
bf023ab3d5/onnxruntime/core/platform/path_lib.h (L74)
If we have `sizeof(size_t) == 8` and `sizeof(long) == 4` (as is the case for x64 and ARM64 MSVC), `strtol` will return a maximum value of `2^31-1` even for a larger, valid `size_t` value. `strtol` will also set `errno` to `ERANGE`, but we weren't checking that.
Updated to use `ParseStringWithClassicLocale` which will parse directly to the target type.
Added some tests.
Remove inline default transposeHelper and ensure we use the proper check
via CanUse_hipBlasTransposeHelper_MLFloat16
Related to change in ROCm Onnxruntime repo:
https://github.com/ROCm/onnxruntime/pull/82
### Description
Required to correctly limit grid size of transpose helper kernel
### Motivation and Context
Compile was defaulting to the inline constructor that was removed
instead of using the overloaded case with proper checks.
Removed the inline default "true" case as this is incorrect for newer
AMD cards/targets
Co-authored-by: Ted Themistokleous <tedthemistokleous@amd.com>
### Description
When user dump the EP context model, if the nodes not partitioned to the EP, and they have external initializers, then the dumped model still point to the old external data file. It does not make sense that new generated model still point to old external data file.
Example, model has node A, B, C, D all has external initializer in ext.bin. So ext.bin contains data for A, B, C, D.
After dumping the EP context model, node A is on CPU, node B, C, D are on EP and dumped as EPContext node. If A's data is still in ext.bin, then new generated model has to depend on old ext.bin which contains all external data for the old model which is a big overhead.
Fix:
For new generated model, user should have option to specify the new external data file, so that the new generated model either pack all initializers into the Onnx model or has all initializers in the external data file.
Add option ep.context_model_external_initializers_file_name to specify the new external data file and size threshold. All initializers will be inside the external data fie if the options is specified. Otherwise all initializers will be inside the EP context Onnx model.
### Motivation and Context
Fix the issue https://github.com/microsoft/onnxruntime/issues/23358
### Description
Allow importing the `.mjs` and `.wasm` files.
when using Vite, this enables web app to consume ORT-web for simplify
the setup:
```js
import * as ort from 'onnxruntime-web';
import wasmFileUrl from 'onnxruntime-web/.wasm?url';
ort.env.wasm.wasmPaths = { wasm: wasmFileUrl };
### Description
- Add new build flag in build.py to build onnxruntime.dll supporting
interfaces for all primary EPs( QNN, TensoRT, OpenVino, VitisAI).
- Modify onnxruntime.dll/onnxruntime_shared.dll build settings to remove
dependency of IHV SDK Toolset to be installed on the system.
- Change CMake variables to be explicit when building EP vs ORT. e.g.
onnxruntime_USE_TENSORRT vs onnxruntime_USE_TENSORRT_INTERFACE, to
evolve the build system to build ORT independent of EPs.
### Motivation and Context
Changes in the build system required to evolve the repo to build the
components independently while removing unnecessary dependencies
---------
Co-authored-by: Lei Cao <jslhcl@gmail.com>
Co-authored-by: Karim Vadsariya <kvadsariya@microsoft.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
### Description
Delete Prefast workflow until the build failure is fixed
### Motivation and Context
Right now the pipelines are failing due to an environment change from
Github.
### Description
Added gradient computation support for the GlobalMaxPool node.
### Motivation and Context
Improve the training capabilities of ONNX Runtime.
### Description
<!-- Describe your changes. -->
Remove thrust::unary_function which is deprecated in later versions of
CUDA.
### Motivation and Context
Addresses issue: https://github.com/microsoft/onnxruntime/issues/23499
### Description
This PR updates the version of Dawn to
`b9b4a37041dec3dd62ac92014a6cc1aece48d9f3` (ref:
[chromium](67f86f01dd/DEPS (399)))
in the `deps.txt` file.
The newer version of Dawn includes the previous changes from dawn.patch
so that we can remove the patch file.
There is a little interface changes and code is updated correspondingly.
### Description
This change avoids creating loop variable copy. GCC 13.3 suggests to use
reference type to prevent copying.
### Motivation and Context
While building onnxruntime 1.20.1 with latest changes from gcc 13.3, I
get build error like
```
onnxruntime-1.20.1/onnxruntime/core/optimizer/selectors_actions/selector_action_transformer.cc: In function 'onnxruntime::common::Status onnxruntime::MatchAndProcess(Graph&, const GraphViewer&, Node&, bool&, const logging::Logger&, const std::string&, const SelectorActionRegistry&, const SatRuntimeOptimizationSaveContext*)':
onnxruntime-1.20.1/onnxruntime/core/optimizer/selectors_actions/selector_action_transformer.cc:150:23: error: loop variable 'op_schema' creates a copy from type 'const gsl::not_null<const onnx::OpSchema*>' [-Werror=range-loop-construct]
150 | for (const auto op_schema : action_saved_state.produced_node_op_schemas) {
| ^~~~~~~~~
onnxruntime-1.20.1/onnxruntime/core/optimizer/selectors_actions/selector_action_transformer.cc:150:23: note: use reference type to prevent copying
150 | for (const auto op_schema : action_saved_state.produced_node_op_schemas) {
| ^~~~~~~~~
| &
```
### Description
Adds the new System.Numerics.Tensors as an input/output type when using
dotnet 8.0 and up. It does not change/remove any of the existing API,
only adds additional ones.
### Motivation and Context
Now that C#/Dotnet has an official tensor type built into the language,
we want to expand the places that it can be used.
### Description
<!-- Describe your changes. -->
Fix shape infer of onnx GroupNorm.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Unable to run shape inference for onnx `GroupNorm`.
[model.onnx](https://raw.githubusercontent.com/onnx/onnx/refs/heads/main/onnx/backend/test/data/node/test_group_normalization_example/model.onnx)
> python
D:\source\cognition\onnxruntime\onnxruntime\python\tools\symbolic_shape_infer.py
--input model.onnx
Traceback (most recent call last):
File
"D:\source\cognition\onnxruntime\onnxruntime\python\tools\symbolic_shape_infer.py",
line 2999, in <module>
out_mp = SymbolicShapeInference.infer_shapes(
File
"D:\source\cognition\onnxruntime\onnxruntime\python\tools\symbolic_shape_infer.py",
line 2935, in infer_shapes
raise Exception("Incomplete symbolic shape inference")
### Description
Enable coremltools for Linux build. In order to do this, I did:
1. Add uuid-devel to the Linux images and regenerate them.
2. Patch the coremltools code a little bit to add some missing header
files.
### Motivation and Context
To make the code simpler. Later on I will create another PR to remove
the COREML_ENABLE_MLPROGRAM C/C++ macro.
Also, after this PR I will bring more changes to
onnxruntime_provider_coreml.cmake to make it work with vcpkg.
Microsoft.ML.OnnxRuntime is not built with the Release configuration but
RelWithDebInfo which is not recognized by the MSBuild SDK. Consequently,
the optimizations are not enabled. A fix would be to simply force the
configuration to be Release when building the .NET code even if it was
set to RelWithDebInfo in the command line arguments but I could not find
an easy way to do that. Instead, I try to mimic the behavior of the
Release configuration by setting the optimize property.
I can see a 15% performance improvement using this simple model summing
up the 3 inputs:
```csharp
using System.Buffers;
using System.Collections.Frozen;
using System.Net;
using System.Net.Sockets;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System.Text;
using System.Text.RegularExpressions;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Running;
using Microsoft.ML.OnnxRuntime;
var config = DefaultConfig.Instance; //.WithOptions(ConfigOptions.DisableOptimizationsValidator);
BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args, config);
public class OnnxBench
{
private const int Iterations = 100_000;
private const int BatchSize = 50;
private InferenceSession _session = default!;
private string[] _inputNames = default!;
private OrtValue[] _inputValues = default!;
private RunOptions _runOptions = default!;
[GlobalSetup]
public void GlobalSetup()
{
using SessionOptions sessionOptions = new();
sessionOptions.InterOpNumThreads = 1;
sessionOptions.IntraOpNumThreads = 1;
sessionOptions.GraphOptimizationLevel = GraphOptimizationLevel.ORT_ENABLE_ALL;
sessionOptions.ExecutionMode = ExecutionMode.ORT_SEQUENTIAL;
_session = new InferenceSession(
Convert.FromBase64String("CAo6cAoOCgFBCgFCEgFEIgNBZGQKDgoBQwoBRBIBWCIDQWRkEgJscloRCgFBEgwKCggBEgYKAAoCCAFaEQoBQhIMCgoIARIGCgAKAggBWhEKAUMSDAoKCAESBgoACgIIAWIRCgFYEgwKCggBEgYKAAoCCAFCBAoAEBU="),
sessionOptions);
_inputNames = ["A", "B", "C"];
_inputValues =
[
OrtValue.CreateTensorValueFromMemory(new float[BatchSize], [BatchSize, 1]),
OrtValue.CreateTensorValueFromMemory(new float[BatchSize], [BatchSize, 1]),
OrtValue.CreateTensorValueFromMemory(new float[BatchSize], [BatchSize, 1]),
];
_runOptions = new RunOptions();
}
[Benchmark(OperationsPerInvoke = Iterations)]
public float Run()
{
var inputValues0Span = _inputValues[0].GetTensorMutableDataAsSpan<float>();
var inputValues1Span = _inputValues[1].GetTensorMutableDataAsSpan<float>();
var inputValues2Span = _inputValues[2].GetTensorMutableDataAsSpan<float>();
for (int i = 0; i < BatchSize; i += 1)
{
inputValues0Span[i] = Random.Shared.NextSingle();
inputValues1Span[i] = Random.Shared.NextSingle();
inputValues2Span[i] = Random.Shared.NextSingle();
}
float sum = 0f;
for (int i = 0; i < Iterations; i += 1)
{
using var output = _session.Run(_runOptions, _inputNames, _inputValues, _session.OutputNames);
ReadOnlySpan<float> outputData = output[0].GetTensorDataAsSpan<float>();
for (int j = 0; j < outputData.Length; j += 1)
{
sum += outputData[j];
}
}
return sum;
}
}
```
| Method | Mean | Error | StdDev |
|------- |---------:|----------:|----------:|
| Before | 5.003 us | 0.0318 us | 0.0297 us |
| After | 4.325 us | 0.0568 us | 0.0503 us |
Fix#16203
Previous to this PR, if `ceil_mode` is on, the calculation of a value
would divide the kernel size, even if remaining pixels is less than the
kernel size, which causes the difference in this operator between ORT
and torch.
However, this fix only applies to the change in #15597, which only
supports AvgPool since 19. The older opset version is remain the same,
as it's using mlas files.
Also, the PR fixes the shape mismatch caused by sliding window starting
from padding. More detail: https://github.com/onnx/onnx/pull/6650 (And
this PR is also validated with the tests added in
https://github.com/onnx/onnx/pull/6650)
### Description
Adds `from __future__ import annotations` to python script to support
annotations on Python 3.8.
### Motivation and Context
Pipeline that runs this script is using Ubuntu 20.04's default python
version (3.8), which does not support annotations unless one imports
from __future__.
### Description
Fixes QNN EP builds due to missing function in provider bridge API:
`logging::LoggingManager::HasDefaultLogger()`
### Motivation and Context
A [recent PR](https://github.com/microsoft/onnxruntime/pull/23120) made
QNN EP a shared library. A [different
PR](https://github.com/microsoft/onnxruntime/pull/23435) added use of a
new function to QNN EP that was not part of the provider bridge API. The
CI did not catch it because main was not merged into the first PR before
merging.
### Description
- Makes QNN EP a shared library **by default** when building with
`--use_qnn` or `--use_qnn shared_lib`. Generates the following build
artifacts:
- **Windows**: `onnxruntime_providers_qnn.dll` and
`onnxruntime_providers_shared.dll`
- **Linux**: `libonnxruntime_providers_qnn.so` and
`libonnxruntime_providers_shared.so`
- **Android**: Not supported. Must build QNN EP as a static library.
- Allows QNN EP to still be built as a static library with `--use_qnn
static_lib`. This is primarily for the Android QNN AAR package.
- Unit tests run for both the static and shared QNN EP builds.
### Detailed changes
- Updates Java bindings to support both shared and static QNN EP builds.
- Provider bridge API:
- Adds logging sink ETW to the provider bridge. Allows EPs to register
ETW callbacks for ORT logging.
- Adds a variety of methods for onnxruntime objects that are needed by
QNN EP.
- QNN EP:
- Adds `ort_api.h` and `ort_api.cc` that encapsulates the API provided
by ORT in a manner that allows the EP to be built as either a shared or
static library.
- Adds custom function to transpose weights for Conv and Gemm (instead
of adding util to provider bridge API).
- Adds custom function to quantize data for LeakyRelu (instead of adding
util to provider bridge API).
- Adds custom ETW tracing for QNN profiling events:
- shared library: defines its own TraceLogging provider handle
- static library: uses ORT's TraceLogging provider handle and existing
telemetry provider.
- ORT-QNN Packages:
- **Python**: Pipelines build QNN EP as a shared library by default.
User can build a local python wheel with QNN EP as a static library by
passing `--use_qnn static_lib`.
- **NuGet**: Pipelines build QNN EP as a shared library by default.
`build.py` currently enforces QNN EP to be built as a shared library.
Can add support for building a QNN NuGet package with static later if
deemed necessary.
- **Android**: Pipelines build QNN EP as a **static library**.
`build.py` enforces QNN EP to be built as a static library. Packaging
multiple shared libraries into an Android AAR package is not currently
supported due to the added need to also distribute a shared libcpp.so
library.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Add custom vcpkg ports for the following packages:
1. cpuinfo
2. onnx
3. pthreadpool
4. xnnpack
Because:
- The cpuinfo/pthreadpool/xnnpack packages in the official vcpkg repo
are too old.
- XNNPack's version is updated from 2022-12-22 to 2025-01-17
- CPUINFO's version is updated from 2022-07-19 to 2024-12-09
- Pthreadpool's version is updated from 2020-04-10 to 2024-12-17, and
the source code location is changed from
https://github.com/Maratyszcza/pthreadpool to
https://github.com/google/pthreadpool
- The onnx package in the official repo requires building python from
source, which then requires a lot of additional dependencies to be
installed. This PR removes them.
- Added a disable_gcc_warning.patch file for xnnpack for addressing the
issue reported in https://github.com/google/XNNPACK/issues/7650. I will
remove this patch when the issue is fully addressed.
- Added " -DONNX_DISABLE_STATIC_REGISTRATION=ON" to ONNX's config
options.
-
### Description
This PR updates the triplets files that manage the compile flags for
vcpkg packages.
All the changes are autogenerated except for the gen.py file in this PR.
Main changes:
1. Enable debug info for all Linux build config(Release and Debug)
2. Set CMAKE_CXX_STANDARD in each triplet. The value is set to 20 for
macOS targets and 17 for the others.
3. Only set _FORTIFY_SOURCE in release build. This is to address a build
issue on some platforms with the following glibc change:
"Warn if user requests __FORTIFY_SOURCE but it is disabled"
https://sourceware.org/git/?p=glibc.git;a=commit;f=include/features.h;h=05c2c9618f583ea4acd69b3fe5ae2a2922dd2ddc
### Motivation and Context
Address a Linux build error.
### Description
Add test project that will perform an automated UI test that runs the
unit tests on Android.
### Motivation
- Enables end-to-end on-device MAUI unit testing which we want to add to
the packaging pipelines
### Context
Microsoft.ML.OnnxRuntime.Tests.MAUI uses DeviceRunners.VisualRunners to
allow running the unit tests (found in
Microsoft.ML.OnnxRuntime.Tests.Common) across multiple devices.
DeviceRunners.VisualRunners provides a simple UI with a button that will
run the unit tests and a panel with the unit test results.
In order to automate the process of running the unit tests across mobile
devices, Appium is used for UI testing orchestration (it provides a way
to interact with the UI), and BrowserStack automatically runs these
Appium tests across different mobile devices.
This project does not include the capability to start an Appium server
locally or attach to a local emulator or device.
## Build & run instructions
### Requirements
* A BrowserStack account with access to App Automate
* You can set BrowserStack credentials as environment variables as shown
[here](https://www.browserstack.com/docs/app-automate/appium/getting-started/c-sharp/nunit/integrate-your-tests#CLI)
* ONNXRuntime NuGet package
1. You can either download the [stable NuGet
package](https://www.nuget.org/packages/Microsoft.ML.OnnxRuntime) then
follow the instructions from [NativeLibraryInclude.props
file](../Microsoft.ML.OnnxRuntime.Tests.Common/NativeLibraryInclude.props)
to use the downloaded .nupkg file
2. Or follow the [build
instructions](https://onnxruntime.ai/docs/build/android.html) to build
the Android package locally
* The dotnet workloads for maui and maui-android, which will not always
automatically install correctly
1. `dotnet workload install maui`
2. `dotnet workload install maui-android`
* [Appium](https://appium.io/docs/en/latest/quickstart/) and the
[UiAutomator2
driver](https://appium.io/docs/en/latest/quickstart/uiauto2-driver/)
### Run instructions
1. Build the Microsoft.ML.OnnxRuntime.Tests.MAUI project into a signed
APK.
1. Run the following: `dotnet publish -c Release -f net8.0-android` in
the Microsoft.ML.OnnxRuntime.Tests.MAUI directory.
2. Search for the APK files generated. They should be located in
`bin\Release\net8.0-android\publish`.
3. If they're in a different location, edit the `browserstack.yml` file
to target the path to the signed APK.
2. Ensure you've set the BrowserStack credentials as environment
variables.
3. Run the following in the
Microsoft.ML.OnnxRuntime.Tests.Android.BrowserStack directory: `dotnet
test`
4. Navigate to the [BrowserStack App Automate
dashboard](https://app-automate.browserstack.com/dashboard/v2/builds) to
see your test running!
BUG #23273
This PR does below optimizations:
1. When output channels is one, 1) calculate the offset before the
inchannel loop to reduce indices to offsets calculation, 2) split the
`inputChannelsPerGroup` into `inputChannelsPerGroupInt` and
`inputChannelsRemainder` parts so that we can always access 4 data for
`inputChannelsPerGroupInt`.
2. Use precise initial value to reduce useless loop iterations. Thanks
@jiangzhaoming 's suggestion's on this.
With this PR, ConvTranspose becomes 3.7s from 8.4s on Intel Meteor Lake.
On NV RTX 2000 Ada, it becomes 1.6s from 2.7s.
### Description
Use onnx_protobuf.h to suppress some GCC warnings.
All the changes are autogenerated by a shell command.
```bash
find . -type f -exec sed -i 's/#include\s\+<onnx\/onnx_pb.h>/#include "core\/graph\/onnx_protobuf.h"/g' {} \;
```
### Motivation and Context
This PR is needed for making vcpkg work(without disabling all warnings)
This PR is split from another bigger PR per request from a reviewer.
### Description
Suppress some strict-aliasing related warnings in WebGPU EP
For example:
```
/home/chasun/src/onnxruntime/onnxruntime/core/providers/webgpu/math/unary_elementwise_ops.cc:208:30: error: dereferencing type-punned pointer will break strict-aliasing rules [-Werror=strict-aliasing]
208 | float encoded_value = *reinterpret_cast<const float*>(attr);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```
This PR does not really fix the problems. It just suppresses the
warnings to make build pass. Some issues related to strict aliasing may
be fixed by using std::bit_cast, which requires c++20 however.
### Motivation and Context
Build the code on Azure Linux 3 fails. To reproduce the issue, you may
get an AzureLinux3 machine and run:
```
python3 tools/ci_build/build.py --update --build --build_wheel --use_xnnpack --build_nodejs --use_webgpu --build_dir b --skip_submodule_sync --parallel --use_binskim_compliant_compile_flags --build_shared_lib --config Release
```
The WebNN CPU device type may now target different backends, such as
CoreML. Legacy special workarounds for the TFLite backend should be
removed and allowed to fail as is, as these are implementation issues.
Additionally, the WebNN EP should adhere to the WebNN API conformance.
We assume all the WebNN ops should be supported, so remove the WebNN op
support status for different device types in webnn-operators.md as well.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Re-implementation of https://github.com/microsoft/onnxruntime/pull/23320
(which was reverted).
- Cleans up QNN logging resources if an error occurs during
initialization.
- Updates `QnnLogging()`, which is a logging callback called by QNN
libs, to handle situations in which ORT logging is unavailable, thus
avoiding a segmentation fault.
- Updates `QnnBackendManager::CreateHtpPowerCfgId()` and
`QnnBackendManager::SetHtpPowerConfig()` to check that backend setup is
complete. These functions get called in QNN EP's `OnRunStart()` even if
QNN backend setup failed and the model is assigned to a different EP.
This prevents a segmentation fault. Our Android tests ran into this
issue because the QNN backend setup failed, the model was then assigned
to CPU EP, and the QNN EP's `OnRunStart()` was still called with an
invalid backend.
### Motivation and Context
If QNN initialization fails at any point, we have to properly clean up
the logging resources so that QNN does not call our `QnnLogging()`
callback after the EP has been destroyed.
Bumps [clang-format](https://github.com/ssciwr/clang-format-wheel) from
19.1.6 to 19.1.7.
<details>
<summary>Commits</summary>
<ul>
<li><a
href="f865928dd2"><code>f865928</code></a>
Bump to v19.1.7</li>
<li>See full diff in <a
href="https://github.com/ssciwr/clang-format-wheel/compare/v19.1.6...v19.1.7">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
### Description
Moving Android E2E test steps from Mac-OS13 to unbunt22.04
### Motivation and Context
Deduced the dependency on MacOS, which is deprecating the x64 version.