Commit graph

10329 commits

Author SHA1 Message Date
gunandrose4u
e2c145d37f
Add Anubis metrics schema for local benchmark results uploading (#19018)
### Description
1. Add metrics.py for define the metrics schema used by Anubis
2. Add two examples (llama2 and whisper) of how to save local benchmark
results following Anubis metrics schema


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Kyle Zhang <Xi.Zhang@microsoft.com>
Co-authored-by: ironman <bitzhangxi@outlook.com>
2024-01-12 14:24:01 +08:00
Chi Lo
46dd0d3f52
[TensorRT EP] Load precompiled TRT engine file directly (#18217)
When the TRT engine cache (precompiled engine) is present, it doesn't
make sense to go over the processes of model verification, model
optimization, TRT EP's GetCapability(), TRT EP's model proto
reconstruction, calling TRT parser and engine compilation.
This PR makes TRT EP skip those processes and directly load the engine
to perform inference.

The feature request:
https://github.com/microsoft/onnxruntime/issues/18072

Features:

- Replace original model with TRT engine wrapped ONNX model. It can save
a lot of time as mentioned above.

- How to get TRT engine wrapped ONNX model?
1. Set `trt_dump_ep_context_model` provider option to "true" and run the
inference. You will find the "xxx_wrapper.onnx" at the engine cache
path. (The same logic of generating engine cache)
    2. Use gen_trt_engine_wrapper_onnx_model.py

- Three provider options are added, 
`trt_dump_ep_context_model`: Enable dump wrapped onnx model by TRT EP
`trt_ep_context_embed_mode`: Add embed_mode as attribute. 0 means engine
cache path, 1 means engine binary data.
`trt_ep_context_compute_capability_enable`: Add hardware_arch as
attribute. When running the model, TRT EP will check consistency between
model's hardware_arch and GPU's compute capability.

- When the engine cache path is given in the wrapped model, TRT EP will
first search for the engine file using the path (relative to model
path), if it can't find it, it will change to use the path as it is
(depends on user, could be relative to working dir or absolute path)

Note: 

1. This PR includes the change of
https://github.com/microsoft/onnxruntime/pull/17751


Constraints:

1. The whole model should be fully supported by TRT. 
4. Users need to make sure the engine is built with min/max/opt
optimization profiles that large enough to cover the range of all
inputs. TRT EP will simply fail and won't rebuild the engine if the
input shape is out of range during runtime.
2024-01-11 22:20:54 -08:00
Ye Wang
b6d82834d4
add bfp16 to gqa (#19095)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-11 20:53:31 -08:00
dependabot[bot]
189be8e997
Bump follow-redirects from 1.15.2 to 1.15.4 in /onnxruntime/test/wasm (#19069)
Bumps
[follow-redirects](https://github.com/follow-redirects/follow-redirects)
from 1.15.2 to 1.15.4.
<details>
<summary>Commits</summary>
<ul>
<li><a
href="65858205e5"><code>6585820</code></a>
Release version 1.15.4 of the npm package.</li>
<li><a
href="7a6567e16d"><code>7a6567e</code></a>
Disallow bracketed hostnames.</li>
<li><a
href="05629af696"><code>05629af</code></a>
Prefer native URL instead of deprecated url.parse.</li>
<li><a
href="1cba8e85fa"><code>1cba8e8</code></a>
Prefer native URL instead of legacy url.resolve.</li>
<li><a
href="72bc2a4229"><code>72bc2a4</code></a>
Simplify _processResponse error handling.</li>
<li><a
href="3d42aecdca"><code>3d42aec</code></a>
Add bracket tests.</li>
<li><a
href="bcbb096b32"><code>bcbb096</code></a>
Do not directly set Error properties.</li>
<li><a
href="192dbe7ce6"><code>192dbe7</code></a>
Release version 1.15.3 of the npm package.</li>
<li><a
href="bd8c81e4f3"><code>bd8c81e</code></a>
Fix resource leak on destroy.</li>
<li><a
href="9c728c314b"><code>9c728c3</code></a>
Split linting and testing.</li>
<li>Additional commits viewable in <a
href="https://github.com/follow-redirects/follow-redirects/compare/v1.15.2...v1.15.4">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=follow-redirects&package-manager=npm_and_yarn&previous-version=1.15.2&new-version=1.15.4)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).

</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-01-11 16:13:22 -08:00
Aditya Goel
d8962d67f4
RegexFullMatch operator (#18002)
### Description
<!-- Describe your changes. -->



### Motivation and Context
Closes https://github.com/microsoft/onnxruntime/issues/17594.
2024-01-11 15:50:07 -08:00
Jeff Bloomfield
08cf4fbcad
Handle all float types in IsQDQPairSupported (#19085)
### Description
This makes detection of identical QDQ scales work with float16 and
bfloat16 rather than failing.


### Motivation and Context
This addresses failures in customer models
2024-01-11 15:16:44 -08:00
Christian Larson
8a0a972f39
Update DML EP to accept broadcasted tensor of size 1 to match CPU (#19081)
### Description
With QDQ enabled for Dml EP we are seeing some models not optimize
constant nodes with incorrect tensor size of scale[1] and zeropoint[1]
that does not match the input size. CPU accepts this parameter type so
updating Dml EP to match CPU behavior.



### Motivation and Context
Want to match CPU EP behavior.

---------

Co-authored-by: Christian Larson <28911437+chrilaMSFT@users.noreply.github.com>
Co-authored-by: Dwayne Robinson <dwayner@microsoft.com>
2024-01-11 15:15:51 -08:00
Maximilian Müller
daa22f919f
[TensorRT] query GPU properties only once when setting device_id (#19092)
### Description

For most models this does not show significant overhead but for very
small models it shows significant impact.
Attached screenshot shows impact when only using 2 FC layers: 

![image](https://github.com/microsoft/onnxruntime/assets/44298237/b4fdf8bf-0422-43ab-a49e-7d2996cba68e)
2024-01-11 13:37:10 -08:00
ivberg
4d1243b4b4
ORT ETW dynamic logging that improves ORT diagnosability & performance (#18882)
### Description
This PR has several combined ORT ETW changes that improve ORT log
diagnosability & performance. 
- The existing log behavior in the ORT API and Severity behavior remain
the same as compiled by the dev using the ORT API
- The PR keeps the existing design which has 2 TraceLogging providers
defined (although both were not used before this PR)
- Keeps great inference (inf) and session load performance even with
dynamic logging enabled (see below)
- On Windows, when ONNXRuntimeTraceLoggingProvider is enabled, then ORT
will dynamically _add_ a new sink reflecting the severity level provided
by ETW dynamically. E.G Critical - Verbose per the need at runtime
- This allows previous printf style LOGS() statements both for CPU and
NPU cases to flow to ETW via a local trace (if enabled)
- This DOES NOT add any new Telemetry which can optionally be sent to
Microsoft.
- Telemetry are ETW events marked with the Measure keyword that _can_ be
sampled if a box opts-in
- Existing Microsoft.ML.ONNXRuntime events have appropriate keywords and
levels added if they were missing
- If Execution Providers (EPs) can provide more detailed insight into
their HW or component, then this PR allows for those to be dynamically
logged instead of just at compile time
- In this PR, the QNN EP for QC NPUs can have basic or detailed
profiling enabled to give some insight into how the NPU is performing
- When the Microsoft.ML.ONNXRuntime ETW provider is enabled with the
Profiling keyword and level 5 then QC QNN basic profiling info is output
to ETW
  
### Motivation and Context
- This make ORT logging and diagnosability more performant (on Windows)
and available in a wider variety of runtime environments.
- The performance difference for inf times was ~300x+ drastically
better/faster when these logs were output to ETW vs just stdout (Verbose
Severity)
- This style of ETW dynamic tracing is the widely used standard for
Windows components, and even by some 3rd party software since the ETW
API is open and part of the Windows API
- These ETW based logs can be seamlessly combined with other ETW logs
such as an AI component/feature using ORT, OS CPU profiling, scheduling,
and more
- Before the PR, ORT logging is largely printf style and output to a
sink (usually stdout) only if compiled with a certain log Severity. When
enabled the previous logging (to stdout) would vastly slow down
inference times. Once compiled for release the internal ORT logs were
not accessible by anyone except the AI model developer in their dev
inner loop. That means logs could not be enabled on a lab machine, or on
a production system where the runtime behavior or performance might be
different for various reasons on a wide variety of HW.
- This change was tested with performance in mind and tested with a
mobilenet small AI model with onnxruntime_perf_test
- CPU: There was no statistical difference for inf times with the
baseline (main) or this PR whether ETW was enabled or not (both ORT
providers all keywords level 5).
- NPU (QNN on SP9 or Dev Kit 2023 QC SQ3): There was no statistical
difference for inf times with this PR whether ETW (both ORT providers
all keywords) were enabled or not for Level 5 (Verbose). This is even
with QNN Basic profiling turned on and outputting NPU stats to ETW
- As expected and designed, there was perf slowdown when Max Level 255
is enabled which translates to QNN Detailed profiling. This mirrors the
expected slowdown in the NPU when individual model operations are
monitored & recorded as well. This perf is similar to the QNN SDK
Detailed profiling performance separate from this PR. This is designed
to be above level 5 (verbose) as that is commonly the max level used in
many trace profiles and won't affect inf performance.
- Other OSes such as Linux & Android are left untouched for now. 
- Out of scope for this PR but TraceLogging is available for Linux with
LTTng tracing. So in the future, this optional tracing could also be
made available on other OSes where a TraceLogging API is available
2024-01-11 12:43:27 -08:00
Guenther Schmuelling
d0bac8216d
[js/webgpu] fix bcast in where (#19009) 2024-01-11 12:13:24 -08:00
Jian Chen
53497702a6
Fix Nuget CUDA Packaging pipeline (#19054)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Yi Zhang <zhanyi@microsoft.com>
2024-01-11 11:59:21 -08:00
RandySheriffH
24e9daf707
Enrich cuda resources with ep options (#19014)
Allow custom ops to access cuda ep options.

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2024-01-11 10:56:07 -08:00
Baiju Meswani
58bf836592
Offline tooling for training to use reduction with keepdims=False (#19027) 2024-01-11 10:51:23 -08:00
Aditya Goel
4694edcd41
String concat operator (#17994)
### Description
<!-- Describe your changes. -->



### Motivation and Context
Closes https://github.com/microsoft/onnxruntime/issues/17595.

---------

Signed-off-by: Aditya Goel <agoel4512@gmail.com>
2024-01-11 10:01:43 -08:00
Hariharan Seshadri
f68dfcd888
[CUDA] Improve performance of DecoderMaskedMultiheadAttention on A100 (#18695)
### Description

Currently there are 2 memory latency bound hotspots in the
DecoderMaskedMultiheadAttention kernel in terms of reading from global
memory - one reading K values and the other reading V values.

The current logic to read them both is something like this - 

for(int i=0; i<all_time_steps; ++i) {
  auto data_in_register = load_chunk_from_global_memory(i);
  do_compute(data_in_register);
}

This incurs a data read stall as data needs to be fetched into the
registers before compute can begin and the compute instruction incurs a
data read stall and this also does not fully utilize the memory
bandwidth of A100. The above logic can be re-written by doing some
manual loop unrolling so that more data read is triggered "in flight".

Unroll factor: 4
for(int i=0; i<all_time_steps; i+=4) {
  auto data_in_register_0 = load_chunk_from_global_memory(i);

  // Do bounds check for the following
  auto data_in_registers_1 = load_chunk_from_global_memory(i+1);
  auto data_in_register_2 = load_chunk_from_global_memory(i+2);
  auto data_in_register_3 = load_chunk_from_global_memory(i+3);

  do_compute(data_in_register_0);

 // Do bounds check for the following
 do_compute(data_in_register_1);
 do_compute(data_in_register_2);
 do_compute(data_in_register_3);
}

The idea is that the memory read latency is hidden by instructions being
issued for subsequent data reads. See here for more details -
https://forums.developer.nvidia.com/t/global-memory-access-synchronous-or-asynchronous-read-write/3256/4

Kernel clock cycles, latency, and memory bandwidth usage before:

<img width="1210" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/9969784/7a1f41f9-fdaa-47b3-b629-996d7b5eef17">

Kernel clock cycles, latency, and memory bandwidth usage after:

<img width="1205" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/9969784/c76b2d2f-43e3-43c9-a710-b5fae76f69b6">


As can be seen, the kernel latency is better by >30% and memory
throughput is better by >14%.

We have a 1P customer using the Whisper model (sampling using
BeamSearch) and the E2E perf for a representative production input is >
6.5%

Whisper E2E Latency for sample input before (on A100):

<img width="194" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/9969784/84ef59f5-84f2-4277-b9f8-b04c27336642">

Whisper E2E Latency for sample input after (on A100):

<img width="191" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/9969784/ca9fe5d3-f726-403e-b27c-be4ee07e0625">


This feature of loading more data in flight may not always yield gains
and it will be workload dependent. For now, keeping the feature turned
OFF by default. It can be turned ON by the user when needed.

### Motivation and Context
Improve BeamSearch performance on CUDA EP
2024-01-11 09:19:12 -08:00
Jian Chen
2eb3db6bf0
Adding python3.12 support to ORT (#18814)
### Description
Adding python3.12 support to ORT



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-11 08:34:28 -08:00
Jiajia Qin
a89db01fce
[js/webgpu] disable GroupedConvVectorize path (#19090)
Disable createGroupedConvVectorizeProgramInfo path due to bots failures
on below two cases:
[webgpu]Conv - conv - vectorize group - B
[webgpu]Conv - conv - vectorize group - D
2024-01-11 08:13:14 -08:00
dependabot[bot]
f11713702f
Bump follow-redirects from 1.15.2 to 1.15.4 in /js/node (#19070)
Bumps
[follow-redirects](https://github.com/follow-redirects/follow-redirects)
from 1.15.2 to 1.15.4.
<details>
<summary>Commits</summary>
<ul>
<li><a
href="65858205e5"><code>6585820</code></a>
Release version 1.15.4 of the npm package.</li>
<li><a
href="7a6567e16d"><code>7a6567e</code></a>
Disallow bracketed hostnames.</li>
<li><a
href="05629af696"><code>05629af</code></a>
Prefer native URL instead of deprecated url.parse.</li>
<li><a
href="1cba8e85fa"><code>1cba8e8</code></a>
Prefer native URL instead of legacy url.resolve.</li>
<li><a
href="72bc2a4229"><code>72bc2a4</code></a>
Simplify _processResponse error handling.</li>
<li><a
href="3d42aecdca"><code>3d42aec</code></a>
Add bracket tests.</li>
<li><a
href="bcbb096b32"><code>bcbb096</code></a>
Do not directly set Error properties.</li>
<li><a
href="192dbe7ce6"><code>192dbe7</code></a>
Release version 1.15.3 of the npm package.</li>
<li><a
href="bd8c81e4f3"><code>bd8c81e</code></a>
Fix resource leak on destroy.</li>
<li><a
href="9c728c314b"><code>9c728c3</code></a>
Split linting and testing.</li>
<li>Additional commits viewable in <a
href="https://github.com/follow-redirects/follow-redirects/compare/v1.15.2...v1.15.4">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=follow-redirects&package-manager=npm_and_yarn&previous-version=1.15.2&new-version=1.15.4)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).

</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-01-10 22:08:14 -08:00
pengwa
d03e477b90
Fix missing subgraph candidates for recompute (#19077)
### Fix missing subgraph candidates for recompute

For subgraphs for example `MatMul+Transpose+Reshape`, since the ending
node is a Reshape, in ORT, it is reusing input buffers.

Currently, the subgraph detection logic has defect, as a result, those
subgraphs will be missing as recompute candidates.

Also append a few more node types for recompute support. 

TODO: add unit test later. This PR is needed for a customer model now.
2024-01-11 12:50:55 +08:00
Yulong Wang
0a0ef958eb
update .vscode/settings.json (#19084)
### Description

`"explicit"` now replaced `true` to config entry
"source.organizeImports". Latest VSCode will automatically modify this
config.
2024-01-10 19:26:01 -08:00
Changming Sun
053ddfe3fd
Disable per-session thread pool for web (#18480)
### Description
ORT web prefers to use a global thread pool for all inference sessions.
See how OrtCreateSession is implemented in
https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/wasm/api.cc#L183
.

Application code can only the global thread poo. However, internal
testing code still often use per-session threadpool. This pr is to fix
the inconsistency.

### Motivation and Context
Replace PR #18476
2024-01-10 18:45:49 -08:00
Yvonne Chen
5678317baf
Fix the duplicated QDQ attributes setup issue (#18039)
### Description
The copied QDQ node should have exactly the same attributes as the
original QDQ node. Otherwise, it might cause errors when the original
node has attributes that use non default values (such as axis != 1
case).

An example user case is like:
A DequantizeLinear node has more than 1 consumer in the graph, and its
attributes axis is 0.

### Motivation and Context
I see the errors like 
https://github.com/microsoft/onnxruntime/issues/16188 
and this fix could solve the issue.
2024-01-10 18:36:33 -08:00
Jiajia Qin
fd6bab4250
[js/webgpu] Provide a vectorized algorithm for GroupedConv (#18884)
### Description
This PR provides a vectorized algorithm for NHWC GroupedConv to improve
performance.

The aggregate time of GroupedConv in mobilenetv2-12 becomes ~1ms from
~4ms on Intel Alder Lake machine. About 20% improvement for the whole
model.
2024-01-10 16:12:43 -08:00
Yifan Li
e58319ebfc
[TensorRT EP] Fix memleak (#19053)
### Description
<!-- Describe your changes. -->
To fix memleak:
```bash
192 bytes in 1 blocks are definitely lost in loss record 1,254 of 1,999
   at 0x483BE63: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
   by 0x4A93FD5: OrtApis::CreateTensorRTProviderOptions(OrtTensorRTProviderOptionsV2**) (in /code/onnxruntime/build/Linux/Release/libonnxruntime.so.1.17.0)
   by 0x1502E1: onnxruntime::perftest::OnnxRuntimeTestSession::OnnxRuntimeTestSession(Ort::Env&, std::random_device&, onnxruntime::perftest::PerformanceTestConfig const&, TestModelInfo const&) (in /code/onnxruntime/build/Linux/Release/onnxruntime_perf_test)
   by 0x15A404: onnxruntime::perftest::PerformanceRunner::PerformanceRunner(Ort::Env&, onnxruntime::perftest::PerformanceTestConfig const&, std::random_device&) (in /code/onnxruntime/build/Linux/Release/onnxruntime_perf_test)
   by 0x14C6D9: real_main(int, char**) (in /code/onnxruntime/build/Linux/Release/onnxruntime_perf_test)
   by 0x145A2A: main (in /code/onnxruntime/build/Linux/Release/onnxruntime_perf_test)
```

add ptr to help release trtep provider options


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-10 15:29:34 -08:00
yuwenzho
731b50dfc4
Support INT4 weight only quantize, including RTN and GPTQ 2 algorithms (#17390)
### Description
Support INT4 weight only quantize (WOQ) via Intel Neural Compressor,
including RTN and GPTQ 2 algorithms.

**Note:**
Please install `neural-compressor==2.3` for weight only quantize.

### Motivation and Context
As large language models (LLMs) become more prevalent, there is a
growing need for new and improved quantization methods that can meet the
computational demands of these modern architectures while maintaining
the accuracy. Compared to normal quantization like W8A8, weight only
quantization is probably a better trade-off to balance the performance
and the accuracy.
RTN is the most straightforward way to quantize weight.
GPTQ algorithm provides more accurate quantization but requires more
computational resources.

### Evaluation results
The following table shows the accuracy results of Llama-2 models
evaluated on [lambada_openai](https://huggingface.co/datasets/lambada)
task. `GPTQ W4G32Asym` in configuration column means GPTQ algorithm is
used for 4-bit weight only quantization, setting group_size=32 and
scheme=asym.
<table class="tg">
<thead>
  <tr>
    <th rowspan="2">Model name</th>
    <th rowspan="2">Configuration</th>
    <th colspan="2">Lambada_openai</th>
    <th rowspan="2">Accuracy Ratio<br>[WOQ/FP32]</th>
  </tr>
  <tr>
    <th>Accuracy</th>
    <th>Perplexity</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td rowspan="2">meta-llama/Llama-2-7b-chat-hf</td>
    <td>FP32</td>
    <td>0.7058</td>
    <td>3.2788</td>
    <td>/</td>
  </tr>
  <tr>
    <td>GPTQ<br>W4G32Asym</td>
    <td>0.7025</td>
    <td>3.4489</td>
    <td>99.53%</td>
  </tr>
  <tr>
    <td rowspan="2">meta-llama/Llama-2-7b-hf</td>
    <td>FP32</td>
    <td>0.7392</td>
    <td>3.3950</td>
    <td>/</td>
  </tr>
  <tr>
    <td>GPTQ<br>W4G32Asym</td>
    <td>0.7326</td>
    <td>3.5286</td>
    <td>99.11%</td>
  </tr>
  <tr>
    <td rowspan="2">meta-llama/Llama-2-13b-chat-hf</td>
    <td>FP32</td>
    <td>0.7312</td>
    <td>2.9163</td>
    <td>/</td>
  </tr>
  <tr>
    <td>GPTQ<br>W4G128Asym</td>
    <td>0.7289</td>
    <td>3.0061</td>
    <td>99.56%</td>
  <tr>
    <td rowspan="2">meta-llama/Llama-2-13b-hf</td>
    <td>FP32</td>
    <td>0.7677</td>
    <td>3.0438</td>
    <td>/</td>
  </tr>
  <tr>
    <td>GPTQ<br>W4G32Asym</td>
    <td>0.7607</td>
    <td>3.1562</td>
    <td>99.09%</td>
  </tr>
  <tr>
    <td rowspan="2">meta-llama/Llama-2-70b-chat-hf</td>
    <td>FP32</td>
    <td>0.7543</td>
    <td>2.6181</td>
    <td>/</td>
  </tr>
  <tr>
    <td>RTN<br>W4G32Sym</td>
    <td>0.7489</td>
    <td>2.6850</td>
    <td>99.28%</td>
  </tr>
  <tr>
    <td rowspan="2">meta-llama/Llama-2-70b-hf</td>
    <td>FP32</td>
    <td>0.7964</td>
    <td>2.6612</td>
    <td>/</td>
  </tr>
  <tr>
    <td>RTN<br>W4G32Sym</td>
    <td>0.7896</td>
    <td>2.7546</td>
    <td>99.15%</td>
  </tr>
</tbody>
</table>

---------

Signed-off-by: yuwenzho <yuwen.zhou@intel.com>
Co-authored-by: Wang, Mengni <mengni.wang@intel.com>
2024-01-10 15:13:04 -08:00
RandySheriffH
df116b82c7
Custom op API for thread pool (#18980)
Allow custom op to invoke internal thread-pool for parallelism.

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2024-01-10 14:13:25 -08:00
Xavier Dupré
cf78d01546
remove use of ai.onnx.ml in test for custom ops and local functions (#19043)
### Description

QNN_Nuget_Windows does not allow ai.onnx.ml operators but the test
test_custom_op_local_function is using LabelEncoder. The operator can be
removed as the test is only checking custom ops api.

### Motivation and Context

Fix test test_custom_op_local_function in QNN_Nuget_Windows pipeline.
2024-01-10 16:36:50 +01:00
PeixuanZuo
5f3113ecd6
[ROCm] Fix hipify error: fast_divmod.h: No such file or directory (#19060)
Fix error:
```
[ 48%] Built target onnxruntime_optimizer

In file included from /onnxruntime_src/onnxruntime/core/providers/rocm/rocm_stream_handle.cc:5:
/onnxruntime_src/onnxruntime/core/providers/rocm/rocm_common.h:11:10: fatal error: core/providers/rocm/shared_inc/fast_divmod.h: No such file or directory
   11 | #include "core/providers/rocm/shared_inc/fast_divmod.h"
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
```

This error is due to onnxruntime_optimizer missing dependencies on
hipify generated files.
2024-01-10 14:49:19 +08:00
Xu Xing
ed0f26d3d4
[js/webgpu] Revert parse norm attributes (#19074)
This resolves the below build errors:
```
lib/wasm/jsep/webgpu/op-resolve-rules.ts:19:23 - error TS2724: '"./ops/instance-norm"' has no exported member named 'parseInstanceNormAttributes'. Did you mean 'InstanceNormAttributes'?

19 import {instanceNorm, parseInstanceNormAttributes} from './ops/instance-norm';
                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~

lib/wasm/jsep/webgpu/op-resolve-rules.ts:19:23 - error TS6133: 'parseInstanceNormAttributes' is declared but its value is never read.

19 import {instanceNorm, parseInstanceNormAttributes} from './ops/instance-norm';
                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~

lib/wasm/jsep/webgpu/op-resolve-rules.ts:20:20 - error TS2305: Module '"./ops/layer-norm"' has no exported member 'parseLayerNormAttributes'.

20 import {layerNorm, parseLayerNormAttributes} from './ops/layer-norm';
                      ~~~~~~~~~~~~~~~~~~~~~~~~

lib/wasm/jsep/webgpu/op-resolve-rules.ts:20:20 - error TS6133: 'parseLayerNormAttributes' is declared but its value is never read.

20 import {layerNorm, parseLayerNormAttributes} from './ops/layer-norm';
```
2024-01-09 20:58:50 -08:00
Baiju Meswani
730df1bfa2
Increase MacOS pipeline timeout (#19072) 2024-01-09 18:35:21 -08:00
Changming Sun
b25980c011
Disable rust pipeline for now (#19067)
### Description
They are not working. When we have time to continue working on it, we
can restore them from git history.
2024-01-09 17:09:31 -08:00
Wanming Lin
fa14dcd2b6
[WebNN EP] Support subgraph of the control flow nodes (#18923)
This PR also makes some processing on the subgraph's initializers. The
subgraph doesn't contain all its required initializers, some common
initializers are stored in its ancestor graphs. We need to collect all
required initializers and re-map to the subgraph.
2024-01-09 15:07:54 -08:00
Xu Xing
76dfe5347c
[js/webgpu] Support uniforms for instance-norm (#18929)
Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>
2024-01-09 14:56:00 -08:00
Milos Puzovic
37ac9d391c
Enable Arm Compute Library 23.08 (#17672)
### Description

This PR enables onnxruntime to build with the most recent release of Arm
Compute Library

### Motivation and Context

The latest version of Arm Compute Library that onnxruntime can build is
20.02 which is more than 3 years old.
2024-01-09 14:10:25 -08:00
Changming Sun
a2afd92093
Format TS code (#19066)
### Description
Format code
2024-01-09 13:41:10 -08:00
Ashwini Khade
897a4163d7
Update transformer version for training CIs (#19046)
### Description
Updating version to resolve security vulnerability.
2024-01-09 12:00:34 -08:00
Yifan Li
574c7caf3a
[TensorRT EP] Clear constrain of trt plugin with different input type (#19044)
### Description
<!-- Describe your changes. -->
Add heterogeneous support to skip this check for TRT plugin which has
different input tensor types



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-09 10:29:06 -08:00
zesongw
ad6dd0a597
[WebNN] Enable npm unit tests (#18486)
### Description
- Support more test cases for WebNN EP in suite-test-list.jsonc
- Add DISABLE_WEBNN flag in build.ts as preparing for WebNN EP release
- Add test option: '--webnn-device-type' in test-runner-args-cli.ts to
support running WebNN 'gpu' deviceType
- Use Chrome Stable as default browser for WebNN testing to unblock the
CI limitation.
2024-01-09 10:10:57 -08:00
Xu Xing
557ac74c05
[js/webgpu] Support gemm uniforms (#19056)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-09 09:57:06 -08:00
Xu Xing
42ba2aed54
[js/webgpu] Support pad uniforms (#19057)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-09 09:34:56 -08:00
Xu Xing
eb92681bfb
[js/webgpu] Support range uniforms (#19055) 2024-01-09 09:33:57 -08:00
junchao-loongson
c1367ae553
Sqnbitgemm: add loongarch64 code path (#18775)
### Description

Add support code for loongarch64 platform in sqnbitgemm

```
100% tests passed, 0 tests failed out of 7

Total Test time (real) = 116.99 sec
2023-12-11 10:43:21,287 build [INFO] - Build complete

```
2024-01-09 09:20:45 -08:00
Xu Xing
dee6a5b371
[js/webgpu] Support uniforms for attention and multihead attention (#18903) 2024-01-09 07:46:30 -08:00
Changming Sun
ab897a4a40
Remove Windows ARM32 from nuget packaging pipelines (#19049)
### Description
1. Remove Windows ARM32 from nuget  packaging pipelines

2. Add missing component-governance-component-detection-steps.yml to
some build jobs.

### Motivation and Context
Stop supporting Windows ARM32 to align with [Windows's support
policy](https://learn.microsoft.com/en-us/windows/arm/arm32-to-arm64).
Users who need this feature still can build the DLLs from source.
However, later on we will remove that support too.
2024-01-09 07:45:03 -08:00
pengwa
7cb8b20db2
Remove mem consuming test case to unblock running ci on lower-end gpu (#19059)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-09 20:05:34 +08:00
zesongw
eb35896ede
[WebNN EP] Update WebNN normalization ops (#18817)
Use batchNormalization, layerNormalization and instanceNormalization
instead of meanVarianceNormalization to implement normalization Ops. The
spec of meanVarianceNormalization has been deleted.
Remove groupNormalization.
2024-01-08 22:02:44 -08:00
Changming Sun
68c29ece23
In a Linux or Android build check if the compiler support bfloat16 and float16 (#18813)
### Description
Restrict clang version because we have an upcoming change that requires
clang version >=16 , which will mainly affect Android build.
2024-01-08 19:46:33 -08:00
Xu Xing
8f024b7394
[js/webgpu] Support uniforms for layer-norm (#18755) 2024-01-08 18:16:25 -08:00
Guenther Schmuelling
a8bb1df331
[js/webgpu] fix heap access > 2GB (#19010) 2024-01-08 17:58:38 -08:00
Jeff Bloomfield
975a315cd7
Fix x86 build error in GraphDescBuilder.cpp affecting packaging pipeline (#19045)
### Description
This addresses a 32 bit build error affecting the packaging pipeline



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-08 17:49:19 -08:00