### Description
In some environments the test code has undefined behavior. To prove it, save the following code as
test.cpp
```c++
#include <iostream>
#include <stdio.h>
int main(){
char buf[1024];
int ret = snprintf(buf, sizeof(buf), "%ls","abc");
if(ret <0){
std::cout<< ret<< std::endl;
} else{
std::cout<< "OK: ret="<<ret<< std::endl;
}
return 0;
}
```
Then compile it as
```
g++ -DNDEBUG -std=gnu++17 test.cpp -o /tmp/t
```
Or
```
g++ -O2 -DNDEBUG -std=gnu++17 test.cpp -o /tmp/t
```
The first command is without optimization. The second one turns on
optimization. Then the outputs are different.
When optimization is enabled, the output might be:
```
OK: ret=-1
```
You cannot explain why it would go to this branch when ret is "-1". It
might be a bug of a specific version of GCC. However, at this moment we
cannot change the version. It was found in GCC version 8.5.0 20210514
(Red Hat 8.5.0-18) (GCC) that is provided by UBI8. RHEL9 doesn't have
the problem. snprintf is a builtin function of GCC. So the problem was
not related to glibc.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Whenever a node QuantizeLinear or DequantizeLinear, the type of the
weights before being quantize must be known to create the scale with the
expected type. Another option would be to add many operator CastLike but
that would push the burden to onnxruntime optimizer.
The PR tries to avoid changing the signature. To do so, it modified the
scale computation to use a numpy array to store the result and not a
python float. The numpy array must be of the same type than the weights
to quantize.
The PR adds many `assert` to check the type of the scale is not a python
type or a float64. This was added to make sure all the code follows the
same logic. These lines were kept for the first review.
DequantizeLinear, QuantizeLinear cannot be tested with onnx==1.15. PR
https://github.com/onnx/onnx/pull/5709 is missing to fix shape
inference. PR https://github.com/onnx/onnx/pull/5473) is missing to
support QLinearMatMul with float 16. That explains why some tests are
disabled with float 16.
### Motivation and Context
The current quantization tool assumes every weight is float 32. For
large models such as LLAMA, it is usually float 16. The quantization
needs to quantize such weights.
### Description
1. Add two build jobs for enabling Address Sanitizer in CI. One for
Windows CPU, One for Linux CPU.
2. Set default compiler flags/linker flags in build.py for normal
Windows/Linux/MacOS build. This can help control compiler flags in a
more centralized way.
3. All Windows binaries in our official packages will be built with
"/PROFILE" flag. Symbols of onnxruntime.dll can be found at [Microsoft
public symbol
server](https://learn.microsoft.com/en-us/windows-hardware/drivers/debugger/microsoft-public-symbols).
Limitations:
1. On Linux Address Sanitizer ignores RPATH settings in ELF binaries.
Therefore once Address Sanitizer is enabled, before running tests we
need to manually set LD_LIBRARY_PATH properly otherwise
libonnxruntime.so may not be able to find custom ops and shared EPs.
4. On Linux we also need to set LD_PRELOAD before running some tests(if
the main executable, like python, is not built with address sanitizer.
On Windows we do not need to.
5. On Windows before running python tests we should manually copy
address sanitizer DLL to the onnxruntime/capi directory, because python
3.8 and above has enabled "Safe DLL Search Mode" that wouldn't use the
information provided by PATH env.
6. On Linux Address Sanitizer found a lot of memory leaks from our
python binding code. Therefore right now we cannot enable Address
Sanitizer when building ONNX Runtime with python binding.
7. Address Sanitizer itself uses a lot of memory address space and
delays memory deallocations, which is easy to cause OOM issues in 32-bit
applications. We cannot run all the tests in onnxruntime_test_all in
32-bit mode with Address Sanitizer due to this reason. However, we still
can run individual tests in such a way. We just cannot run all of them
in one process.
### Motivation and Context
To catch memory issues.
### Description
Set pythonInterpreter in set-python-manylinux-variables-step.yml. To fix
a build error:
```
Starting: Set Python manylinux variables
==============================================================================
Task : Python script
Description : Run a Python file or inline script
Version : 0.231.1
Author : Microsoft Corporation
Help : https://docs.microsoft.com/azure/devops/pipelines/tasks/utility/python-script
==============================================================================
##[error]Parameter 'toolPath' cannot be null or empty.
Finishing: Set Python manylinux variables
```
The error was because today I deleted a bunch of software from the VM
image. The task might fail if no Python versions are found in
$(Agent.ToolsDirectory).
### Description
Remove the references to CreateFileMapping2 because the function is
mainly for system services. To use the function, we need to link to one
of the four [Windows umbrella
libraries](https://learn.microsoft.com/en-us/windows/win32/apiindex/windows-umbrella-libraries).
It's tricky because a custom build might want to use any of the four. So
I cannot just choose one and add that one to our CMakeLists.txt.
Given it's so complicated and the code is not actually used now, I will
remove it. It is not used because it requires NTDDI_VERSION >=
NTDDI_WIN10_RS5 but in our top level CMakeLists.txt we set the version
to the first Windows 10 release which is lower than RS5.
### Description
<!-- Describe your changes. -->
Add four quantize Ops: MatmulInteger, ConvInteger, DynamicQuantizeLinear
and DequantizeLinear.
Add datatype TensorProto_DataType_INT8 and TensorProto_DataType_UINT8.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Support quantized models.
### Description
Change `A / sqrt(B)` to `A * inverseSqrt(B)` in BatchNormalization,
InstanceNormalization, LayerNormalization and SkipLayerNormalization.
### Motivation and Context
For the same reason as the existence of the `inverseSqrt` built-in in
WebGPU spec.
Bumps
[follow-redirects](https://github.com/follow-redirects/follow-redirects)
from 1.15.2 to 1.15.4.
<details>
<summary>Commits</summary>
<ul>
<li><a
href="65858205e5"><code>6585820</code></a>
Release version 1.15.4 of the npm package.</li>
<li><a
href="7a6567e16d"><code>7a6567e</code></a>
Disallow bracketed hostnames.</li>
<li><a
href="05629af696"><code>05629af</code></a>
Prefer native URL instead of deprecated url.parse.</li>
<li><a
href="1cba8e85fa"><code>1cba8e8</code></a>
Prefer native URL instead of legacy url.resolve.</li>
<li><a
href="72bc2a4229"><code>72bc2a4</code></a>
Simplify _processResponse error handling.</li>
<li><a
href="3d42aecdca"><code>3d42aec</code></a>
Add bracket tests.</li>
<li><a
href="bcbb096b32"><code>bcbb096</code></a>
Do not directly set Error properties.</li>
<li><a
href="192dbe7ce6"><code>192dbe7</code></a>
Release version 1.15.3 of the npm package.</li>
<li><a
href="bd8c81e4f3"><code>bd8c81e</code></a>
Fix resource leak on destroy.</li>
<li><a
href="9c728c314b"><code>9c728c3</code></a>
Split linting and testing.</li>
<li>Additional commits viewable in <a
href="https://github.com/follow-redirects/follow-redirects/compare/v1.15.2...v1.15.4">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
### Description
1. Add metrics.py for define the metrics schema used by Anubis
2. Add two examples (llama2 and whisper) of how to save local benchmark
results following Anubis metrics schema
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Kyle Zhang <Xi.Zhang@microsoft.com>
Co-authored-by: ironman <bitzhangxi@outlook.com>
When the TRT engine cache (precompiled engine) is present, it doesn't
make sense to go over the processes of model verification, model
optimization, TRT EP's GetCapability(), TRT EP's model proto
reconstruction, calling TRT parser and engine compilation.
This PR makes TRT EP skip those processes and directly load the engine
to perform inference.
The feature request:
https://github.com/microsoft/onnxruntime/issues/18072
Features:
- Replace original model with TRT engine wrapped ONNX model. It can save
a lot of time as mentioned above.
- How to get TRT engine wrapped ONNX model?
1. Set `trt_dump_ep_context_model` provider option to "true" and run the
inference. You will find the "xxx_wrapper.onnx" at the engine cache
path. (The same logic of generating engine cache)
2. Use gen_trt_engine_wrapper_onnx_model.py
- Three provider options are added,
`trt_dump_ep_context_model`: Enable dump wrapped onnx model by TRT EP
`trt_ep_context_embed_mode`: Add embed_mode as attribute. 0 means engine
cache path, 1 means engine binary data.
`trt_ep_context_compute_capability_enable`: Add hardware_arch as
attribute. When running the model, TRT EP will check consistency between
model's hardware_arch and GPU's compute capability.
- When the engine cache path is given in the wrapped model, TRT EP will
first search for the engine file using the path (relative to model
path), if it can't find it, it will change to use the path as it is
(depends on user, could be relative to working dir or absolute path)
Note:
1. This PR includes the change of
https://github.com/microsoft/onnxruntime/pull/17751
Constraints:
1. The whole model should be fully supported by TRT.
4. Users need to make sure the engine is built with min/max/opt
optimization profiles that large enough to cover the range of all
inputs. TRT EP will simply fail and won't rebuild the engine if the
input shape is out of range during runtime.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Bumps
[follow-redirects](https://github.com/follow-redirects/follow-redirects)
from 1.15.2 to 1.15.4.
<details>
<summary>Commits</summary>
<ul>
<li><a
href="65858205e5"><code>6585820</code></a>
Release version 1.15.4 of the npm package.</li>
<li><a
href="7a6567e16d"><code>7a6567e</code></a>
Disallow bracketed hostnames.</li>
<li><a
href="05629af696"><code>05629af</code></a>
Prefer native URL instead of deprecated url.parse.</li>
<li><a
href="1cba8e85fa"><code>1cba8e8</code></a>
Prefer native URL instead of legacy url.resolve.</li>
<li><a
href="72bc2a4229"><code>72bc2a4</code></a>
Simplify _processResponse error handling.</li>
<li><a
href="3d42aecdca"><code>3d42aec</code></a>
Add bracket tests.</li>
<li><a
href="bcbb096b32"><code>bcbb096</code></a>
Do not directly set Error properties.</li>
<li><a
href="192dbe7ce6"><code>192dbe7</code></a>
Release version 1.15.3 of the npm package.</li>
<li><a
href="bd8c81e4f3"><code>bd8c81e</code></a>
Fix resource leak on destroy.</li>
<li><a
href="9c728c314b"><code>9c728c3</code></a>
Split linting and testing.</li>
<li>Additional commits viewable in <a
href="https://github.com/follow-redirects/follow-redirects/compare/v1.15.2...v1.15.4">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
### Description
This makes detection of identical QDQ scales work with float16 and
bfloat16 rather than failing.
### Motivation and Context
This addresses failures in customer models
### Description
With QDQ enabled for Dml EP we are seeing some models not optimize
constant nodes with incorrect tensor size of scale[1] and zeropoint[1]
that does not match the input size. CPU accepts this parameter type so
updating Dml EP to match CPU behavior.
### Motivation and Context
Want to match CPU EP behavior.
---------
Co-authored-by: Christian Larson <28911437+chrilaMSFT@users.noreply.github.com>
Co-authored-by: Dwayne Robinson <dwayner@microsoft.com>
### Description
This PR has several combined ORT ETW changes that improve ORT log
diagnosability & performance.
- The existing log behavior in the ORT API and Severity behavior remain
the same as compiled by the dev using the ORT API
- The PR keeps the existing design which has 2 TraceLogging providers
defined (although both were not used before this PR)
- Keeps great inference (inf) and session load performance even with
dynamic logging enabled (see below)
- On Windows, when ONNXRuntimeTraceLoggingProvider is enabled, then ORT
will dynamically _add_ a new sink reflecting the severity level provided
by ETW dynamically. E.G Critical - Verbose per the need at runtime
- This allows previous printf style LOGS() statements both for CPU and
NPU cases to flow to ETW via a local trace (if enabled)
- This DOES NOT add any new Telemetry which can optionally be sent to
Microsoft.
- Telemetry are ETW events marked with the Measure keyword that _can_ be
sampled if a box opts-in
- Existing Microsoft.ML.ONNXRuntime events have appropriate keywords and
levels added if they were missing
- If Execution Providers (EPs) can provide more detailed insight into
their HW or component, then this PR allows for those to be dynamically
logged instead of just at compile time
- In this PR, the QNN EP for QC NPUs can have basic or detailed
profiling enabled to give some insight into how the NPU is performing
- When the Microsoft.ML.ONNXRuntime ETW provider is enabled with the
Profiling keyword and level 5 then QC QNN basic profiling info is output
to ETW
### Motivation and Context
- This make ORT logging and diagnosability more performant (on Windows)
and available in a wider variety of runtime environments.
- The performance difference for inf times was ~300x+ drastically
better/faster when these logs were output to ETW vs just stdout (Verbose
Severity)
- This style of ETW dynamic tracing is the widely used standard for
Windows components, and even by some 3rd party software since the ETW
API is open and part of the Windows API
- These ETW based logs can be seamlessly combined with other ETW logs
such as an AI component/feature using ORT, OS CPU profiling, scheduling,
and more
- Before the PR, ORT logging is largely printf style and output to a
sink (usually stdout) only if compiled with a certain log Severity. When
enabled the previous logging (to stdout) would vastly slow down
inference times. Once compiled for release the internal ORT logs were
not accessible by anyone except the AI model developer in their dev
inner loop. That means logs could not be enabled on a lab machine, or on
a production system where the runtime behavior or performance might be
different for various reasons on a wide variety of HW.
- This change was tested with performance in mind and tested with a
mobilenet small AI model with onnxruntime_perf_test
- CPU: There was no statistical difference for inf times with the
baseline (main) or this PR whether ETW was enabled or not (both ORT
providers all keywords level 5).
- NPU (QNN on SP9 or Dev Kit 2023 QC SQ3): There was no statistical
difference for inf times with this PR whether ETW (both ORT providers
all keywords) were enabled or not for Level 5 (Verbose). This is even
with QNN Basic profiling turned on and outputting NPU stats to ETW
- As expected and designed, there was perf slowdown when Max Level 255
is enabled which translates to QNN Detailed profiling. This mirrors the
expected slowdown in the NPU when individual model operations are
monitored & recorded as well. This perf is similar to the QNN SDK
Detailed profiling performance separate from this PR. This is designed
to be above level 5 (verbose) as that is commonly the max level used in
many trace profiles and won't affect inf performance.
- Other OSes such as Linux & Android are left untouched for now.
- Out of scope for this PR but TraceLogging is available for Linux with
LTTng tracing. So in the future, this optional tracing could also be
made available on other OSes where a TraceLogging API is available
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Yi Zhang <zhanyi@microsoft.com>
### Description
Currently there are 2 memory latency bound hotspots in the
DecoderMaskedMultiheadAttention kernel in terms of reading from global
memory - one reading K values and the other reading V values.
The current logic to read them both is something like this -
for(int i=0; i<all_time_steps; ++i) {
auto data_in_register = load_chunk_from_global_memory(i);
do_compute(data_in_register);
}
This incurs a data read stall as data needs to be fetched into the
registers before compute can begin and the compute instruction incurs a
data read stall and this also does not fully utilize the memory
bandwidth of A100. The above logic can be re-written by doing some
manual loop unrolling so that more data read is triggered "in flight".
Unroll factor: 4
for(int i=0; i<all_time_steps; i+=4) {
auto data_in_register_0 = load_chunk_from_global_memory(i);
// Do bounds check for the following
auto data_in_registers_1 = load_chunk_from_global_memory(i+1);
auto data_in_register_2 = load_chunk_from_global_memory(i+2);
auto data_in_register_3 = load_chunk_from_global_memory(i+3);
do_compute(data_in_register_0);
// Do bounds check for the following
do_compute(data_in_register_1);
do_compute(data_in_register_2);
do_compute(data_in_register_3);
}
The idea is that the memory read latency is hidden by instructions being
issued for subsequent data reads. See here for more details -
https://forums.developer.nvidia.com/t/global-memory-access-synchronous-or-asynchronous-read-write/3256/4
Kernel clock cycles, latency, and memory bandwidth usage before:
<img width="1210" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/9969784/7a1f41f9-fdaa-47b3-b629-996d7b5eef17">
Kernel clock cycles, latency, and memory bandwidth usage after:
<img width="1205" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/9969784/c76b2d2f-43e3-43c9-a710-b5fae76f69b6">
As can be seen, the kernel latency is better by >30% and memory
throughput is better by >14%.
We have a 1P customer using the Whisper model (sampling using
BeamSearch) and the E2E perf for a representative production input is >
6.5%
Whisper E2E Latency for sample input before (on A100):
<img width="194" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/9969784/84ef59f5-84f2-4277-b9f8-b04c27336642">
Whisper E2E Latency for sample input after (on A100):
<img width="191" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/9969784/ca9fe5d3-f726-403e-b27c-be4ee07e0625">
This feature of loading more data in flight may not always yield gains
and it will be workload dependent. For now, keeping the feature turned
OFF by default. It can be turned ON by the user when needed.
### Motivation and Context
Improve BeamSearch performance on CUDA EP
### Description
Adding python3.12 support to ORT
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Disable createGroupedConvVectorizeProgramInfo path due to bots failures
on below two cases:
[webgpu]Conv - conv - vectorize group - B
[webgpu]Conv - conv - vectorize group - D
Bumps
[follow-redirects](https://github.com/follow-redirects/follow-redirects)
from 1.15.2 to 1.15.4.
<details>
<summary>Commits</summary>
<ul>
<li><a
href="65858205e5"><code>6585820</code></a>
Release version 1.15.4 of the npm package.</li>
<li><a
href="7a6567e16d"><code>7a6567e</code></a>
Disallow bracketed hostnames.</li>
<li><a
href="05629af696"><code>05629af</code></a>
Prefer native URL instead of deprecated url.parse.</li>
<li><a
href="1cba8e85fa"><code>1cba8e8</code></a>
Prefer native URL instead of legacy url.resolve.</li>
<li><a
href="72bc2a4229"><code>72bc2a4</code></a>
Simplify _processResponse error handling.</li>
<li><a
href="3d42aecdca"><code>3d42aec</code></a>
Add bracket tests.</li>
<li><a
href="bcbb096b32"><code>bcbb096</code></a>
Do not directly set Error properties.</li>
<li><a
href="192dbe7ce6"><code>192dbe7</code></a>
Release version 1.15.3 of the npm package.</li>
<li><a
href="bd8c81e4f3"><code>bd8c81e</code></a>
Fix resource leak on destroy.</li>
<li><a
href="9c728c314b"><code>9c728c3</code></a>
Split linting and testing.</li>
<li>Additional commits viewable in <a
href="https://github.com/follow-redirects/follow-redirects/compare/v1.15.2...v1.15.4">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
### Fix missing subgraph candidates for recompute
For subgraphs for example `MatMul+Transpose+Reshape`, since the ending
node is a Reshape, in ORT, it is reusing input buffers.
Currently, the subgraph detection logic has defect, as a result, those
subgraphs will be missing as recompute candidates.
Also append a few more node types for recompute support.
TODO: add unit test later. This PR is needed for a customer model now.
### Description
ORT web prefers to use a global thread pool for all inference sessions.
See how OrtCreateSession is implemented in
https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/wasm/api.cc#L183
.
Application code can only the global thread poo. However, internal
testing code still often use per-session threadpool. This pr is to fix
the inconsistency.
### Motivation and Context
Replace PR #18476
### Description
The copied QDQ node should have exactly the same attributes as the
original QDQ node. Otherwise, it might cause errors when the original
node has attributes that use non default values (such as axis != 1
case).
An example user case is like:
A DequantizeLinear node has more than 1 consumer in the graph, and its
attributes axis is 0.
### Motivation and Context
I see the errors like
https://github.com/microsoft/onnxruntime/issues/16188
and this fix could solve the issue.
### Description
This PR provides a vectorized algorithm for NHWC GroupedConv to improve
performance.
The aggregate time of GroupedConv in mobilenetv2-12 becomes ~1ms from
~4ms on Intel Alder Lake machine. About 20% improvement for the whole
model.
### Description
<!-- Describe your changes. -->
To fix memleak:
```bash
192 bytes in 1 blocks are definitely lost in loss record 1,254 of 1,999
at 0x483BE63: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
by 0x4A93FD5: OrtApis::CreateTensorRTProviderOptions(OrtTensorRTProviderOptionsV2**) (in /code/onnxruntime/build/Linux/Release/libonnxruntime.so.1.17.0)
by 0x1502E1: onnxruntime::perftest::OnnxRuntimeTestSession::OnnxRuntimeTestSession(Ort::Env&, std::random_device&, onnxruntime::perftest::PerformanceTestConfig const&, TestModelInfo const&) (in /code/onnxruntime/build/Linux/Release/onnxruntime_perf_test)
by 0x15A404: onnxruntime::perftest::PerformanceRunner::PerformanceRunner(Ort::Env&, onnxruntime::perftest::PerformanceTestConfig const&, std::random_device&) (in /code/onnxruntime/build/Linux/Release/onnxruntime_perf_test)
by 0x14C6D9: real_main(int, char**) (in /code/onnxruntime/build/Linux/Release/onnxruntime_perf_test)
by 0x145A2A: main (in /code/onnxruntime/build/Linux/Release/onnxruntime_perf_test)
```
add ptr to help release trtep provider options
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Support INT4 weight only quantize (WOQ) via Intel Neural Compressor,
including RTN and GPTQ 2 algorithms.
**Note:**
Please install `neural-compressor==2.3` for weight only quantize.
### Motivation and Context
As large language models (LLMs) become more prevalent, there is a
growing need for new and improved quantization methods that can meet the
computational demands of these modern architectures while maintaining
the accuracy. Compared to normal quantization like W8A8, weight only
quantization is probably a better trade-off to balance the performance
and the accuracy.
RTN is the most straightforward way to quantize weight.
GPTQ algorithm provides more accurate quantization but requires more
computational resources.
### Evaluation results
The following table shows the accuracy results of Llama-2 models
evaluated on [lambada_openai](https://huggingface.co/datasets/lambada)
task. `GPTQ W4G32Asym` in configuration column means GPTQ algorithm is
used for 4-bit weight only quantization, setting group_size=32 and
scheme=asym.
<table class="tg">
<thead>
<tr>
<th rowspan="2">Model name</th>
<th rowspan="2">Configuration</th>
<th colspan="2">Lambada_openai</th>
<th rowspan="2">Accuracy Ratio<br>[WOQ/FP32]</th>
</tr>
<tr>
<th>Accuracy</th>
<th>Perplexity</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="2">meta-llama/Llama-2-7b-chat-hf</td>
<td>FP32</td>
<td>0.7058</td>
<td>3.2788</td>
<td>/</td>
</tr>
<tr>
<td>GPTQ<br>W4G32Asym</td>
<td>0.7025</td>
<td>3.4489</td>
<td>99.53%</td>
</tr>
<tr>
<td rowspan="2">meta-llama/Llama-2-7b-hf</td>
<td>FP32</td>
<td>0.7392</td>
<td>3.3950</td>
<td>/</td>
</tr>
<tr>
<td>GPTQ<br>W4G32Asym</td>
<td>0.7326</td>
<td>3.5286</td>
<td>99.11%</td>
</tr>
<tr>
<td rowspan="2">meta-llama/Llama-2-13b-chat-hf</td>
<td>FP32</td>
<td>0.7312</td>
<td>2.9163</td>
<td>/</td>
</tr>
<tr>
<td>GPTQ<br>W4G128Asym</td>
<td>0.7289</td>
<td>3.0061</td>
<td>99.56%</td>
<tr>
<td rowspan="2">meta-llama/Llama-2-13b-hf</td>
<td>FP32</td>
<td>0.7677</td>
<td>3.0438</td>
<td>/</td>
</tr>
<tr>
<td>GPTQ<br>W4G32Asym</td>
<td>0.7607</td>
<td>3.1562</td>
<td>99.09%</td>
</tr>
<tr>
<td rowspan="2">meta-llama/Llama-2-70b-chat-hf</td>
<td>FP32</td>
<td>0.7543</td>
<td>2.6181</td>
<td>/</td>
</tr>
<tr>
<td>RTN<br>W4G32Sym</td>
<td>0.7489</td>
<td>2.6850</td>
<td>99.28%</td>
</tr>
<tr>
<td rowspan="2">meta-llama/Llama-2-70b-hf</td>
<td>FP32</td>
<td>0.7964</td>
<td>2.6612</td>
<td>/</td>
</tr>
<tr>
<td>RTN<br>W4G32Sym</td>
<td>0.7896</td>
<td>2.7546</td>
<td>99.15%</td>
</tr>
</tbody>
</table>
---------
Signed-off-by: yuwenzho <yuwen.zhou@intel.com>
Co-authored-by: Wang, Mengni <mengni.wang@intel.com>
### Description
QNN_Nuget_Windows does not allow ai.onnx.ml operators but the test
test_custom_op_local_function is using LabelEncoder. The operator can be
removed as the test is only checking custom ops api.
### Motivation and Context
Fix test test_custom_op_local_function in QNN_Nuget_Windows pipeline.
Fix error:
```
[ 48%] Built target onnxruntime_optimizer
In file included from /onnxruntime_src/onnxruntime/core/providers/rocm/rocm_stream_handle.cc:5:
/onnxruntime_src/onnxruntime/core/providers/rocm/rocm_common.h:11:10: fatal error: core/providers/rocm/shared_inc/fast_divmod.h: No such file or directory
11 | #include "core/providers/rocm/shared_inc/fast_divmod.h"
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
```
This error is due to onnxruntime_optimizer missing dependencies on
hipify generated files.
This resolves the below build errors:
```
lib/wasm/jsep/webgpu/op-resolve-rules.ts:19:23 - error TS2724: '"./ops/instance-norm"' has no exported member named 'parseInstanceNormAttributes'. Did you mean 'InstanceNormAttributes'?
19 import {instanceNorm, parseInstanceNormAttributes} from './ops/instance-norm';
~~~~~~~~~~~~~~~~~~~~~~~~~~~
lib/wasm/jsep/webgpu/op-resolve-rules.ts:19:23 - error TS6133: 'parseInstanceNormAttributes' is declared but its value is never read.
19 import {instanceNorm, parseInstanceNormAttributes} from './ops/instance-norm';
~~~~~~~~~~~~~~~~~~~~~~~~~~~
lib/wasm/jsep/webgpu/op-resolve-rules.ts:20:20 - error TS2305: Module '"./ops/layer-norm"' has no exported member 'parseLayerNormAttributes'.
20 import {layerNorm, parseLayerNormAttributes} from './ops/layer-norm';
~~~~~~~~~~~~~~~~~~~~~~~~
lib/wasm/jsep/webgpu/op-resolve-rules.ts:20:20 - error TS6133: 'parseLayerNormAttributes' is declared but its value is never read.
20 import {layerNorm, parseLayerNormAttributes} from './ops/layer-norm';
```
This PR also makes some processing on the subgraph's initializers. The
subgraph doesn't contain all its required initializers, some common
initializers are stored in its ancestor graphs. We need to collect all
required initializers and re-map to the subgraph.
### Description
This PR enables onnxruntime to build with the most recent release of Arm
Compute Library
### Motivation and Context
The latest version of Arm Compute Library that onnxruntime can build is
20.02 which is more than 3 years old.
### Description
<!-- Describe your changes. -->
Add heterogeneous support to skip this check for TRT plugin which has
different input tensor types
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
- Support more test cases for WebNN EP in suite-test-list.jsonc
- Add DISABLE_WEBNN flag in build.ts as preparing for WebNN EP release
- Add test option: '--webnn-device-type' in test-runner-args-cli.ts to
support running WebNN 'gpu' deviceType
- Use Chrome Stable as default browser for WebNN testing to unblock the
CI limitation.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Add support code for loongarch64 platform in sqnbitgemm
```
100% tests passed, 0 tests failed out of 7
Total Test time (real) = 116.99 sec
2023-12-11 10:43:21,287 build [INFO] - Build complete
```