### Description
Adding python3.12 support to ORT
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Disable createGroupedConvVectorizeProgramInfo path due to bots failures
on below two cases:
[webgpu]Conv - conv - vectorize group - B
[webgpu]Conv - conv - vectorize group - D
Bumps
[follow-redirects](https://github.com/follow-redirects/follow-redirects)
from 1.15.2 to 1.15.4.
<details>
<summary>Commits</summary>
<ul>
<li><a
href="65858205e5"><code>6585820</code></a>
Release version 1.15.4 of the npm package.</li>
<li><a
href="7a6567e16d"><code>7a6567e</code></a>
Disallow bracketed hostnames.</li>
<li><a
href="05629af696"><code>05629af</code></a>
Prefer native URL instead of deprecated url.parse.</li>
<li><a
href="1cba8e85fa"><code>1cba8e8</code></a>
Prefer native URL instead of legacy url.resolve.</li>
<li><a
href="72bc2a4229"><code>72bc2a4</code></a>
Simplify _processResponse error handling.</li>
<li><a
href="3d42aecdca"><code>3d42aec</code></a>
Add bracket tests.</li>
<li><a
href="bcbb096b32"><code>bcbb096</code></a>
Do not directly set Error properties.</li>
<li><a
href="192dbe7ce6"><code>192dbe7</code></a>
Release version 1.15.3 of the npm package.</li>
<li><a
href="bd8c81e4f3"><code>bd8c81e</code></a>
Fix resource leak on destroy.</li>
<li><a
href="9c728c314b"><code>9c728c3</code></a>
Split linting and testing.</li>
<li>Additional commits viewable in <a
href="https://github.com/follow-redirects/follow-redirects/compare/v1.15.2...v1.15.4">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
### Fix missing subgraph candidates for recompute
For subgraphs for example `MatMul+Transpose+Reshape`, since the ending
node is a Reshape, in ORT, it is reusing input buffers.
Currently, the subgraph detection logic has defect, as a result, those
subgraphs will be missing as recompute candidates.
Also append a few more node types for recompute support.
TODO: add unit test later. This PR is needed for a customer model now.
### Description
ORT web prefers to use a global thread pool for all inference sessions.
See how OrtCreateSession is implemented in
https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/wasm/api.cc#L183
.
Application code can only the global thread poo. However, internal
testing code still often use per-session threadpool. This pr is to fix
the inconsistency.
### Motivation and Context
Replace PR #18476
### Description
The copied QDQ node should have exactly the same attributes as the
original QDQ node. Otherwise, it might cause errors when the original
node has attributes that use non default values (such as axis != 1
case).
An example user case is like:
A DequantizeLinear node has more than 1 consumer in the graph, and its
attributes axis is 0.
### Motivation and Context
I see the errors like
https://github.com/microsoft/onnxruntime/issues/16188
and this fix could solve the issue.
### Description
This PR provides a vectorized algorithm for NHWC GroupedConv to improve
performance.
The aggregate time of GroupedConv in mobilenetv2-12 becomes ~1ms from
~4ms on Intel Alder Lake machine. About 20% improvement for the whole
model.
### Description
<!-- Describe your changes. -->
To fix memleak:
```bash
192 bytes in 1 blocks are definitely lost in loss record 1,254 of 1,999
at 0x483BE63: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
by 0x4A93FD5: OrtApis::CreateTensorRTProviderOptions(OrtTensorRTProviderOptionsV2**) (in /code/onnxruntime/build/Linux/Release/libonnxruntime.so.1.17.0)
by 0x1502E1: onnxruntime::perftest::OnnxRuntimeTestSession::OnnxRuntimeTestSession(Ort::Env&, std::random_device&, onnxruntime::perftest::PerformanceTestConfig const&, TestModelInfo const&) (in /code/onnxruntime/build/Linux/Release/onnxruntime_perf_test)
by 0x15A404: onnxruntime::perftest::PerformanceRunner::PerformanceRunner(Ort::Env&, onnxruntime::perftest::PerformanceTestConfig const&, std::random_device&) (in /code/onnxruntime/build/Linux/Release/onnxruntime_perf_test)
by 0x14C6D9: real_main(int, char**) (in /code/onnxruntime/build/Linux/Release/onnxruntime_perf_test)
by 0x145A2A: main (in /code/onnxruntime/build/Linux/Release/onnxruntime_perf_test)
```
add ptr to help release trtep provider options
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Support INT4 weight only quantize (WOQ) via Intel Neural Compressor,
including RTN and GPTQ 2 algorithms.
**Note:**
Please install `neural-compressor==2.3` for weight only quantize.
### Motivation and Context
As large language models (LLMs) become more prevalent, there is a
growing need for new and improved quantization methods that can meet the
computational demands of these modern architectures while maintaining
the accuracy. Compared to normal quantization like W8A8, weight only
quantization is probably a better trade-off to balance the performance
and the accuracy.
RTN is the most straightforward way to quantize weight.
GPTQ algorithm provides more accurate quantization but requires more
computational resources.
### Evaluation results
The following table shows the accuracy results of Llama-2 models
evaluated on [lambada_openai](https://huggingface.co/datasets/lambada)
task. `GPTQ W4G32Asym` in configuration column means GPTQ algorithm is
used for 4-bit weight only quantization, setting group_size=32 and
scheme=asym.
<table class="tg">
<thead>
<tr>
<th rowspan="2">Model name</th>
<th rowspan="2">Configuration</th>
<th colspan="2">Lambada_openai</th>
<th rowspan="2">Accuracy Ratio<br>[WOQ/FP32]</th>
</tr>
<tr>
<th>Accuracy</th>
<th>Perplexity</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="2">meta-llama/Llama-2-7b-chat-hf</td>
<td>FP32</td>
<td>0.7058</td>
<td>3.2788</td>
<td>/</td>
</tr>
<tr>
<td>GPTQ<br>W4G32Asym</td>
<td>0.7025</td>
<td>3.4489</td>
<td>99.53%</td>
</tr>
<tr>
<td rowspan="2">meta-llama/Llama-2-7b-hf</td>
<td>FP32</td>
<td>0.7392</td>
<td>3.3950</td>
<td>/</td>
</tr>
<tr>
<td>GPTQ<br>W4G32Asym</td>
<td>0.7326</td>
<td>3.5286</td>
<td>99.11%</td>
</tr>
<tr>
<td rowspan="2">meta-llama/Llama-2-13b-chat-hf</td>
<td>FP32</td>
<td>0.7312</td>
<td>2.9163</td>
<td>/</td>
</tr>
<tr>
<td>GPTQ<br>W4G128Asym</td>
<td>0.7289</td>
<td>3.0061</td>
<td>99.56%</td>
<tr>
<td rowspan="2">meta-llama/Llama-2-13b-hf</td>
<td>FP32</td>
<td>0.7677</td>
<td>3.0438</td>
<td>/</td>
</tr>
<tr>
<td>GPTQ<br>W4G32Asym</td>
<td>0.7607</td>
<td>3.1562</td>
<td>99.09%</td>
</tr>
<tr>
<td rowspan="2">meta-llama/Llama-2-70b-chat-hf</td>
<td>FP32</td>
<td>0.7543</td>
<td>2.6181</td>
<td>/</td>
</tr>
<tr>
<td>RTN<br>W4G32Sym</td>
<td>0.7489</td>
<td>2.6850</td>
<td>99.28%</td>
</tr>
<tr>
<td rowspan="2">meta-llama/Llama-2-70b-hf</td>
<td>FP32</td>
<td>0.7964</td>
<td>2.6612</td>
<td>/</td>
</tr>
<tr>
<td>RTN<br>W4G32Sym</td>
<td>0.7896</td>
<td>2.7546</td>
<td>99.15%</td>
</tr>
</tbody>
</table>
---------
Signed-off-by: yuwenzho <yuwen.zhou@intel.com>
Co-authored-by: Wang, Mengni <mengni.wang@intel.com>
### Description
QNN_Nuget_Windows does not allow ai.onnx.ml operators but the test
test_custom_op_local_function is using LabelEncoder. The operator can be
removed as the test is only checking custom ops api.
### Motivation and Context
Fix test test_custom_op_local_function in QNN_Nuget_Windows pipeline.
Fix error:
```
[ 48%] Built target onnxruntime_optimizer
In file included from /onnxruntime_src/onnxruntime/core/providers/rocm/rocm_stream_handle.cc:5:
/onnxruntime_src/onnxruntime/core/providers/rocm/rocm_common.h:11:10: fatal error: core/providers/rocm/shared_inc/fast_divmod.h: No such file or directory
11 | #include "core/providers/rocm/shared_inc/fast_divmod.h"
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
```
This error is due to onnxruntime_optimizer missing dependencies on
hipify generated files.
This resolves the below build errors:
```
lib/wasm/jsep/webgpu/op-resolve-rules.ts:19:23 - error TS2724: '"./ops/instance-norm"' has no exported member named 'parseInstanceNormAttributes'. Did you mean 'InstanceNormAttributes'?
19 import {instanceNorm, parseInstanceNormAttributes} from './ops/instance-norm';
~~~~~~~~~~~~~~~~~~~~~~~~~~~
lib/wasm/jsep/webgpu/op-resolve-rules.ts:19:23 - error TS6133: 'parseInstanceNormAttributes' is declared but its value is never read.
19 import {instanceNorm, parseInstanceNormAttributes} from './ops/instance-norm';
~~~~~~~~~~~~~~~~~~~~~~~~~~~
lib/wasm/jsep/webgpu/op-resolve-rules.ts:20:20 - error TS2305: Module '"./ops/layer-norm"' has no exported member 'parseLayerNormAttributes'.
20 import {layerNorm, parseLayerNormAttributes} from './ops/layer-norm';
~~~~~~~~~~~~~~~~~~~~~~~~
lib/wasm/jsep/webgpu/op-resolve-rules.ts:20:20 - error TS6133: 'parseLayerNormAttributes' is declared but its value is never read.
20 import {layerNorm, parseLayerNormAttributes} from './ops/layer-norm';
```
This PR also makes some processing on the subgraph's initializers. The
subgraph doesn't contain all its required initializers, some common
initializers are stored in its ancestor graphs. We need to collect all
required initializers and re-map to the subgraph.
### Description
This PR enables onnxruntime to build with the most recent release of Arm
Compute Library
### Motivation and Context
The latest version of Arm Compute Library that onnxruntime can build is
20.02 which is more than 3 years old.
### Description
<!-- Describe your changes. -->
Add heterogeneous support to skip this check for TRT plugin which has
different input tensor types
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
- Support more test cases for WebNN EP in suite-test-list.jsonc
- Add DISABLE_WEBNN flag in build.ts as preparing for WebNN EP release
- Add test option: '--webnn-device-type' in test-runner-args-cli.ts to
support running WebNN 'gpu' deviceType
- Use Chrome Stable as default browser for WebNN testing to unblock the
CI limitation.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Add support code for loongarch64 platform in sqnbitgemm
```
100% tests passed, 0 tests failed out of 7
Total Test time (real) = 116.99 sec
2023-12-11 10:43:21,287 build [INFO] - Build complete
```
### Description
1. Remove Windows ARM32 from nuget packaging pipelines
2. Add missing component-governance-component-detection-steps.yml to
some build jobs.
### Motivation and Context
Stop supporting Windows ARM32 to align with [Windows's support
policy](https://learn.microsoft.com/en-us/windows/arm/arm32-to-arm64).
Users who need this feature still can build the DLLs from source.
However, later on we will remove that support too.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Use batchNormalization, layerNormalization and instanceNormalization
instead of meanVarianceNormalization to implement normalization Ops. The
spec of meanVarianceNormalization has been deleted.
Remove groupNormalization.
### Description
This addresses a 32 bit build error affecting the packaging pipeline
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
- Removes `--disable_ml_ops` build flag
- Automatically detects ORT version from VERSION file via
`templates/set-version-number-variables-step.yml`. We will no longer
need to create a commit to update ORT versions.
### Motivation and Context
- A new unit test caused failures in the QNN Nuget pipeline because it
did not enable ml ops.
- Automate ORT version specification
The FusedConv operator for the ROCm EP could fail to compile the fused
operation, in which case it should not attempt to use the failed fusion
plan. In addition, the hash for the miopenConvolutionDescriptor_t for
newer ROCm versions was failing to use all components of the descriptor.
### Description
Also update the op test suite.
### Motivation and Context
Previously the *total* size in case `Expand - last dim is not divisible
by 4` was a multiple of 4, even though the *last dimension* was not, so
the bug has never been caught.
### Description
Change all macOS python packages to use universal2, to reduce the number
of packages we have.
### Motivation and Context
According to [wikipedia](https://en.wikipedia.org/wiki/MacOS_Big_Sur),
macOS 11 is the first macOS version that supports universal 2. And it is
the min macOS version we support. So we no longer need to maintain
separate binaries for different CPU archs.
### Description
reducemax/min have been updated in onnx(20). implement it in ort
### Motivation and Context
this is for ort1.17.0 release
---------
Signed-off-by: Liqun Fu <liqfu@microsoft.com>
### Description
- Add mutex to protect QNN API calls for executing a graph and
extracting the corresponding profile data.
- Ensures QNN EP's execute function does not store unnecessary state
(i.e., input and output buffer pointers do not need to be stored as
class members.)
### Motivation and Context
Allow calling `session.Run()` from multiple threads when using QNN EP.
### Description
onnxruntime may raise an error "type inference failed" but when a custom
operator sets IsHomogeneous to false in its schema. This change make
sure that TypeInferenceFunction and schema type constraints are aligned
to prevent that from happening.
---------
Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>