### Description
This PR enables onnxruntime to build with the most recent release of Arm
Compute Library
### Motivation and Context
The latest version of Arm Compute Library that onnxruntime can build is
20.02 which is more than 3 years old.
### Description
<!-- Describe your changes. -->
Add heterogeneous support to skip this check for TRT plugin which has
different input tensor types
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
- Support more test cases for WebNN EP in suite-test-list.jsonc
- Add DISABLE_WEBNN flag in build.ts as preparing for WebNN EP release
- Add test option: '--webnn-device-type' in test-runner-args-cli.ts to
support running WebNN 'gpu' deviceType
- Use Chrome Stable as default browser for WebNN testing to unblock the
CI limitation.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Add support code for loongarch64 platform in sqnbitgemm
```
100% tests passed, 0 tests failed out of 7
Total Test time (real) = 116.99 sec
2023-12-11 10:43:21,287 build [INFO] - Build complete
```
### Description
1. Remove Windows ARM32 from nuget packaging pipelines
2. Add missing component-governance-component-detection-steps.yml to
some build jobs.
### Motivation and Context
Stop supporting Windows ARM32 to align with [Windows's support
policy](https://learn.microsoft.com/en-us/windows/arm/arm32-to-arm64).
Users who need this feature still can build the DLLs from source.
However, later on we will remove that support too.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Use batchNormalization, layerNormalization and instanceNormalization
instead of meanVarianceNormalization to implement normalization Ops. The
spec of meanVarianceNormalization has been deleted.
Remove groupNormalization.
### Description
This addresses a 32 bit build error affecting the packaging pipeline
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
- Removes `--disable_ml_ops` build flag
- Automatically detects ORT version from VERSION file via
`templates/set-version-number-variables-step.yml`. We will no longer
need to create a commit to update ORT versions.
### Motivation and Context
- A new unit test caused failures in the QNN Nuget pipeline because it
did not enable ml ops.
- Automate ORT version specification
The FusedConv operator for the ROCm EP could fail to compile the fused
operation, in which case it should not attempt to use the failed fusion
plan. In addition, the hash for the miopenConvolutionDescriptor_t for
newer ROCm versions was failing to use all components of the descriptor.
### Description
Also update the op test suite.
### Motivation and Context
Previously the *total* size in case `Expand - last dim is not divisible
by 4` was a multiple of 4, even though the *last dimension* was not, so
the bug has never been caught.
### Description
Change all macOS python packages to use universal2, to reduce the number
of packages we have.
### Motivation and Context
According to [wikipedia](https://en.wikipedia.org/wiki/MacOS_Big_Sur),
macOS 11 is the first macOS version that supports universal 2. And it is
the min macOS version we support. So we no longer need to maintain
separate binaries for different CPU archs.
### Description
reducemax/min have been updated in onnx(20). implement it in ort
### Motivation and Context
this is for ort1.17.0 release
---------
Signed-off-by: Liqun Fu <liqfu@microsoft.com>
### Description
- Add mutex to protect QNN API calls for executing a graph and
extracting the corresponding profile data.
- Ensures QNN EP's execute function does not store unnecessary state
(i.e., input and output buffer pointers do not need to be stored as
class members.)
### Motivation and Context
Allow calling `session.Run()` from multiple threads when using QNN EP.
### Description
onnxruntime may raise an error "type inference failed" but when a custom
operator sets IsHomogeneous to false in its schema. This change make
sure that TypeInferenceFunction and schema type constraints are aligned
to prevent that from happening.
---------
Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
### Description
a replacement of #18683. try to resolve#18689.
By specifying "-s PTHREAD_POOL_SIZE" flag in emscripten, it forces the
threadpool to initialize before the webassembly instance is available.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
## Description
This pull request aims to enhance the efficiency of the inference
session creation by eliminating unnecessary `Node::ToProto` invocations.
The current codebase presents opportunities for optimization,
particularly in the removal of superfluous `Node::ToProto` calls, along
with their subsequent `~NodeProto` invocations.
## Motivation and Context
The optimization focus of this pull request is on addressing low-hanging
fruit in the inference session creation process. By strategically
removing undesired `Node::ToProto` calls, we aim to streamline the
codebase and enhance the overall performance. The flame graphs
illustrate the notable improvements achieved by reducing the percentage
of `Node::ToProto` calls, thereby optimizing the execution flow.
### Code Snippet
```cpp
TEST(InferenceSessionTests, Bench) {
// Initialize logging manager
auto logging_manager = std::make_unique<logging::LoggingManager>(
std::unique_ptr<ISink>(new CLogSink()), logging::Severity::kVERBOSE, false,
LoggingManager::InstanceType::Temporal);
// Create environment
std::unique_ptr<Environment> env;
auto st = Environment::Create(std::move(logging_manager), env);
ASSERT_TRUE(st.IsOK());
// Configure session options
SessionOptions so;
so.execution_mode = ExecutionMode::ORT_SEQUENTIAL;
so.graph_optimization_level = TransformerLevel::Level2;
so.intra_op_param.thread_pool_size = 1;
// Initialize and load the InferenceSession
InferenceSessionTestGlobalThreadPools session1{so, *env};
ASSERT_STATUS_OK(session1.Load("big.onnx"));
ASSERT_STATUS_OK(session1.Initialize());
}
```
### `big.onnx` model creation
```python
import onnx
import numpy as np
from spox import argument, build, Tensor, Var
from spox.opset.ai.onnx import v17 as op
from spox.opset.ai.onnx.ml.v3 import label_encoder
a = argument(Tensor(np.int64, ('N',)))
c = a
for x in range(1000):
c = op.mul(c, op.const(np.ones(10000, dtype=np.int64)))
for x in range(3000):
all_strings = list("random_string" + str(i) for i in range(100))
all_ints = list(range(len(all_strings)))
c = label_encoder(
c,
keys_int64s=all_ints,
values_strings=all_strings
)
c = label_encoder(c, keys_strings=all_strings, values_int64s=all_ints)
model: onnx.ModelProto = build(inputs={'a': a}, outputs={'c': c})
onnx.save(model, "big.onnx")
```
Testing in `Release` with `perf` yields:
Before: 3.3% spent in `Node::ToProto`
After: 1.6% spent in `Node::ToProto`
---------
Co-authored-by: Atanas Dimitrov <atanasdimitrov@Atanass-MacBook-Pro.local>
Fixed a small spelling error.
### Description
<!-- Describe your changes. -->
Small spelling error fix.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
It is documentation for the product, and it misspells the word
documentation. This reflects on your product and the quality of the
work.
### Description
<!-- Describe your changes. -->
SessionOptions use_deterministic_compute can be set via the python API.
User request to enable setting via C API.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#17416
### Description
This limits the size of constant data nodes which the DML EP creates in
the DML graph following de-duplication of 1D quantization tensors. In
the process it reduces a check for the maximum size of the constant
node.
This is merged from: https://github.com/microsoft/onnxruntime/pull/18494
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Cleanup and rebase from [this
PR](https://github.com/microsoft/onnxruntime/pull/18629)
### Motivation and Context
---------
Co-authored-by: Christian Larson <chrilaMSFT@users.noreply.github.com>
Co-authored-by: Christian Larson <28911437+chrilaMSFT@users.noreply.github.com>
Co-authored-by: Jeff Bloomfield <jeffbloo@microsoft.com>
Co-authored-by: Anagha Rao <anagrao@microsoft.com>
### Description
This addresses a bug in a fast path that was added for submission of
re-used command lists of fused graph kernels in the DML EP, addressing a
D3D debug layer error.
### Motivation and Context
The fast path in DmlCommandRecorder::ExecuteCommandList enabled a
current non-reused command list, if empty, to be used for commands
following submission of the fused command list. The fix ensures the
associated command allocator is only re-used after the next fence value
is completed, which is higher due to submission of the other command
list.
The command recorder design was intended to support batching of provided
command list execution, however it submits command lists immedately as
an implementation detail to maximize CPU/GPU parallelism. If that
heuristic was removed, it would expose additional issues in this same
fast path. Because of this and complexity and inefficiency of the old
batching mechanism, I also removed this.
### Description
[DirectML EP] Add DML EP registration for Col2Im operator
### Motivation and Context
Add Col2Im support for opset 18.
This operator is implemented as the DirectML Fold operator.
---------
Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
Co-authored-by: Dwayne Robinson <dwayner@microsoft.com>