Commit graph

11997 commits

Author SHA1 Message Date
Edward Chen
f1be92faf0
Patch fp16 to fix Xcode 16 builds with XNNPACK EP targeting x86_64. (#22294) 2024-10-03 14:17:15 -07:00
Yi Zhang
bbb54985a8
Add MaxPool FP16 in XnnPack EP (#22258)
### Description
Add support for FP16 kernels in the XnnPack execution provider for
MaxPool operations.
Fixes:
[AB#50332](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/50332)

### Motivation and Context
The major purpose of this pull request is to add some common
vars/functions and setup a consistent style for adding FP16 kernels in
XnnPack EP.

---------
2024-10-03 18:28:58 +08:00
Caroline Zhu
c73e6afa6c
Migrate Android Java E2E tests from App Center to Browserstack (#22117)
### Description
- removed installing AppCenter + pipeline step that runs AppCenter
Espresso tests
- added script for running AppCenter tests

### Motivation and Context
App Center is getting deprecated in the next year + we have upcoming
Android work that depends on working E2E testing.

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2024-10-02 15:04:58 -07:00
Dmitri Smirnov
224f0651d0
[C#] Expose Multi-Lora support in C# (#22281)
### Description


### Motivation and Context
https://github.com/microsoft/onnxruntime/pull/22046
2024-10-02 10:00:43 -07:00
goldsteinn
4e15b229a0
ThreadPool: Spend less time busy waiting. (#21545)
The purpose of the patch is primarily to save power, but it also has
nice perf benefits (mostly from allowing the system to better distribute
power to cores doing meaningful work).

Changes are twofold:

1)    Decrease WorkerLoop spin count dramatically ~10^6 -> ~10^4. The
       reality is after ~10^4 spins, if there hasn't been any new work
       added its unlikely any new work is imminent so sleep to
       preserve power. This aligns more closely with upstream EigenV3.

2)   Use exponential backoff for waiting on memory. This saves a bit
       more power, and important increases the time between iterations
       in WorkerLoop to help accomidate the dramatically lowering spin
       counts.

Since the tuning for both the iteration counts / backoff counts are
dramatically different for hybrid/non-hybrid systems, this patch
templates the affected functions and dynamically choses based on
`CPUIDInfo::IsHybrid()`. This seemed like the "lightest weight" way of
getting the change in, although its likely we could incur less dynamic
overhead if we added the template argument to the entirety of
`ThreadPoolTempl`.

Measured performance on an [Intel Meteor Lake
CPU](https://www.intel.com/content/www/us/en/products/sku/237329/intel-core-ultra-7-processor-165u-12m-cache-up-to-4-90-ghz/specifications.html)
across a range of models.

Below are the result of 3 runs with each metric being the
value-before-patch / value-after-patch (so for something like inference
time, lower is better).
<div align="center">
<table>
<tr>
<th>Session creation time cost</th>
<td>0.7179</td>
</tr>
<tr>
<th>First inference time cost</th>
<td>0.7156</td>
</tr>
<tr>
<th>Total inference time cost</th>
<td>1.0146</td>
</tr>
<tr>
<th>Total inference requests</th>
<td>0.8874</td>
</tr>
<tr>
<th>Average inference time cost</th>
<td>0.8800</td>
</tr>
<tr>
<th>Total inference run time</th>
<td>1.0146</td>
</tr>
<tr>
<th>Number of inferences per second</th>
<td>0.8955</td>
</tr>
<tr>
<th>Avg CPU usage</th>
<td>0.9462</td>
</tr>
<tr>
<th>Peak working set size</th>
<td>0.9922</td>
</tr>
<tr>
<th>Runs</th>
<td>1.1552</td>
</tr>
<tr>
<th>Min Latency</th>
<td>0.7283</td>
</tr>
<tr>
<th>Max Latency</th>
<td>0.9258</td>
</tr>
<tr>
<th>P50 Latency</th>
<td>0.9534</td>
</tr>
<tr>
<th>P90 Latency</th>
<td>0.9639</td>
</tr>
<tr>
<th>P95 Latency</th>
<td>0.9659</td>
</tr>
<tr>
<th>P99 Latency</th>
<td>0.9640</td>
</tr>
</table>
</div>

So the net result is a 1.16x improvement in throughput and between
1.08-1.37x improvement in latency.
2024-10-01 17:25:02 -07:00
Adam Pocock
14d1bfc34b
[java] Multi-LoRA support (#22280)
### Description
Java parts of Multi-LoRA support - #22046.

### Motivation and Context
API equivalence with Python & C#.

---------

Co-authored-by: Dmitri Smirnov <dmitrism@microsoft.com>
2024-10-01 13:54:37 -07:00
Dmitri Smirnov
1fc2b94644
Address Android warning error (#22285)
### Description
<!-- Describe your changes. -->

### Motivation and Context
Build issue
https://github.com/microsoft/onnxruntime/pull/22046#issuecomment-2386414899
2024-10-01 13:52:25 -07:00
Edward Chen
c24e55b1f1
[Java] Add API for appending QNN EP (#22208)
- Add Java API for appending QNN EP
- Update Java unit test setup
  - Fix issues with setting system properties for tests
  - Unify Windows/non-Windows setup to simplify
2024-10-01 10:18:04 -07:00
Tianlei Wu
e2b9ccc44a
Update SAM2 benchmark for testing torch compile modes and profiling (#22279)
This pull request introduces several enhancements to the benchmarking
process for the SAM2 model, including:
(1) Add profiling capabilities.
(2) test torch compile modes (none will disable compile and fallback to
eager mode)
(3) Update README for setting up the environment.

### Documentation Updates:
* README.md: Updated instructions to create separate conda environments
for GPU and CPU benchmarking, and detailed the parameters and outputs of
the benchmark script.

### Benchmark Script Enhancements:
* benchmark_sam2.py: Added optional parameters for enabling NVTX and
PyTorch profiling, and adjusted the initialization and execution flow to
incorporate these profiling options.

These changes enhance the flexibility and functionality of the
benchmarking process, making it easier to profile and benchmark the SAM2
model on different hardware configurations.
2024-10-01 09:51:12 -07:00
Yufeng Li
96e9c99dce
remove neural-speed (#22236)
### Description
<!-- Describe your changes. -->
NS is not developed anymore and ORT doesn't use it for int4 inference
either. Remove it to clean up the code


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-10-01 09:50:44 -07:00
kunal-vaishnavi
50bda44a70
Fix equation in MatMulNBits op spec (#22253)
### Description
This PR fixes an equation in the MatMulNBits op spec. The old formula is
stated as

```
[CeilDiv((N * n_blocks_per_col + 1) * bits, 8)]
```

but it should be stated as

```
[N * CeilDiv(n_blocks_per_col * bits, 8)]
```

or as

```
[N * FloorDiv((n_blocks_per_col + 1) * bits, 8)]
```

### Motivation and Context
For models such as ChatGLM where the column size is odd, the division
math can be off. For example:


![image_360](https://github.com/user-attachments/assets/a5035bec-4dad-46af-9cb1-24a881eb70a0)

With the old equation, the projections are calculated as follows.

```
# Down projection
B = 4,096 x 107 x 64
zero_points = 221,184
N = 4,096
n_blocks_per_col = 107
 
4,096 * CeilDiv((107 + 1) * 4, 8) = 4,096 * CeilDiv(108 * 4, 8) = 4,096 * 54 = 221,184

# Up projection
B = 13,696 x 32 x 64
zero_points = 219,136
N = 13,696
n_blocks_per_col = 32
 
13,696 * CeilDiv((32 + 1) * 4, 8) = 13,696 * CeilDiv(33 * 4, 8) = 13,696 * 17 = 232,832
```

With the new equation, the projections are calculated as follows.

```
# Down projection
B = 4,096 x 107 x 64
zero_points = 221,184
N = 4,096
n_blocks_per_col = 107
 
4,096 * CeilDiv(107 * 4, 8) = 4,096 * 54 = 221,184

# Up projection
B = 13,696 x 32 x 64
zero_points= 219,136
N = 13,696
n_blocks_per_col = 32
 
13,696 * CeilDiv(32 * 4, 8) = 13,696 * 16 = 219,136
```
2024-10-01 09:31:56 -07:00
Mauricio A Rovira Galvez
ffca096b5a
Fixes a crash on macOS 15 when using CoreML. (#22277)
### Description
In macOS 15, apps running with CoreML will crash with an error message
like this one:
```
Terminating app due to uncaught exception 'NSGenericException', reason: 'Failed to set compute_device_types_mask E5RT: Cannot provide zero compute device types. (1)'
```

This can be easily seen when building ONNXRuntime from source and
running the unit tests. The fix was suggested in [this bug
report](https://forums.developer.apple.com/forums/thread/757040).
I've ported the change to ONNXRuntime and verified that:
* The issue is resolved in macOS 15 (all unit tests pass).
* The behaviour is unchanged in macOS 14. 


### Motivation and Context
This fixes #22275 allowing apps using ONNXRuntime with CoreML to work
normally.
2024-10-01 16:06:03 +10:00
Scott McKay
ee7081b828
Fix syntax for some CoreML ML Program supported operator entries (#22268)
### Description
<!-- Describe your changes. -->
Fix syntax so usability checker works as expected.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-10-01 15:49:43 +10:00
Yang Gu
9e5153b688
[js/webgpu] Manage model download with a specific unittest option (#22214)
Currently in debug mode, unit test will always download models to local
file system, which is a bit annoying. This PR fixes this by adding a
specific option to enable model download.
2024-09-30 18:27:43 -07:00
Yang Gu
c75f4a09b7
[js/webgpu] Remove the limitation on axis in softmax (#22231)
In current implementation, axis in softmax has to be the last, which is
an obvious limitation. This PR removes this limitation and will fix
issues #20710 and #22176.
2024-09-30 18:27:11 -07:00
Dmitri Smirnov
d9de054eb5
Multi-Lora support (#22046)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-09-30 15:59:07 -07:00
Jian Chen
40bcb7664d
Revert "Jar Maven Signing - GnuPG and sha256" (#22273)
Reverts microsoft/onnxruntime#22217
2024-09-30 15:07:59 -07:00
Jian Chen
ebcf2fcd16
Replace gradle/wrapper-validation-action with gradle/actions/wrapper-validation-action (#22224)
### Description
Replace gradle/wrapper-validation-action with
gradle/actions/wrapper-validation-action


### Motivation and Context
This is recommended by
https://github.com/gradle/wrapper-validation-action. This job uses
deprecated functionality from the 'gradle/wrapper-validation-action'
action.
2024-09-30 14:29:16 -07:00
Ranjit Ranjan
812075731c
[AIX] Build fix for using system installed protobuf/onnx (#22272)
### Description
To fix the build issues for AIX OS while using system installed
protobuf/onnx.

### Motivation and Context
Code changes in this PR contains:

1. Fix for below compilation issue.
```
collect2: fatal error: library liblibprotobuf-lite not found
compilation terminated.
```
2.  Adding onnx library into dependency list for test applicaitons.
2024-09-30 12:36:21 -07:00
Yi Zhang
d069475a63
Make A100 jobs in PR checks again (#22261)
### Description
if the variable is 1, the job running on A100 in PR checks.
Fixes
[AB#50333](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/50333)


### Motivation and Context
We wish more big models which need to run on A100 can be tested in PR
checks, but Azure may decommission A100 agents without notifications
sometimes, which will block merging PRs.
This PR is an improvement of current workaround, making those jobs only
run main branch.
Once we find the A100 are all decommisioned by Azure, we could change
the UseA100 variable to 0 to disable the A100 jobs in PR checks
2024-09-30 08:29:30 -07:00
wejoncy
2cfe1f031d
[CoreML MLProgram] Support Float16 (1/N) (#22068)
### Description
Support Float16 for CoreML MLProgram EP.
Operations:
    "Add", "Mul", "Sub", "Div", "Pow", "Sqrt", "Reciprocal",
"Sigmoid", "Tanh", "Relu", "LeakyRelu", "Concat", "GridSample",
"GlobalAveragePool",
    "Clip", "DepthToSpace", "Resize", "Slice", "Conv",
    "ConvTranspose", "GlobalMaxPool", "Gemm", "MatMul",
    "AveragePool", "MaxPool", "Reshape", "Split", "Transpose"

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Scott McKay <skottmckay@gmail.com>
2024-09-30 17:56:47 +08:00
Yang Gu
434f0fa536
[js/webgpu] Fix the crash issue in unsqueeze (#22264)
While allowing axes in unsqueeze to be scalar, its shape couldn't be
always accessed like a vector. This PR fixes issue #22031 so that the
original model could run well.
2024-09-30 02:28:16 -07:00
Yulong Wang
1bda91fc57
[js/webgpu] fix external buffer registration (#22254)
### Description

Fixes the problem of running into failure when GPU inputs shuffled
between iterations.
2024-09-28 10:36:40 -07:00
Enrico Galli
52a8c1cae8
[WebNN EP] Enable IO Bindings with MLTensor (#21301)
### Description
Enables using the MLTensor to pass data between models. 


### Motivation and Context
Using MLTensor instead of ArrayBuffers reduces the number of copies
between the CPU and devices as well as the renderer and GPU process in
Chromium.
2024-09-27 17:24:21 -07:00
Patrice Vignola
ebda23be16
[DML EP] Fix Clip clamping (#22251)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-09-27 16:24:37 -07:00
shiyi
1e3cd86d80
[WebNN EP] Support LSTM op (#20293)
<!-- Describe your changes. -->




<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-09-27 14:23:08 -07:00
liqun Fu
f410e7c4cf
Fix mlas bench crash (#22248)
Fix mlas bench crash

---------

Signed-off-by: Liqun Fu <liqfu@microsoft.com>
2024-09-27 13:50:42 -07:00
Sumit Agarwal
529835cc46
[DML EP] Update DML to 1.15.2 (#22247)
### Description
Update DML binary to the current latest redist version
[1.15.2](https://www.nuget.org/packages/Microsoft.AI.DirectML/1.15.2).
2024-09-27 13:20:29 -07:00
Patrice Vignola
20be51525b
Support if node with sequence outputs (#22234)
`If` nodes can have sequence outputs. Those nodes are mapped to the DML
EP to be able to keep the outputs on the GPU, but they actually execute
on the CPU by selecting either the `then` subgraph or the `else`
subgraph.
2024-09-27 12:40:01 -07:00
Patrice Vignola
14ba2fb83c
[DML EP] Add intermediate tensor dumping for DML (#22246)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-09-27 12:39:45 -07:00
Hector Li
6e3163faa5
Update code regarding some QNN bug fixes (#22222)
### Description
Update code regarding some QNN bug fixes:
1. QnnProfile_ExtendedEventData_t.version is not initialized in Qnn
2. Failed to finalize the graph for HardSigmoid with FP16 precision
2024-09-27 09:51:47 -07:00
Kyle
b81e76b9a6
Jar Maven Signing - GnuPG and sha256 (#22217)
### Description
<!-- Describe your changes. -->
Jar maven signing: 
- GnuPG 
- sha256.

Jar packages artifacts: 
- onnxruntime-android-full-aar
- onnxruntime-java
- onnxruntime-java-gpu


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Previously, it is manually signed. 
Goal: make it automatically.
2024-09-27 17:50:06 +08:00
Tianlei Wu
ff8a48ef3b
Update SAM2 benchmark script and doc (#22238)
(1) Fix a bug of parameters order.
(2) Update benchmark script: 
* download test image if not exist
* combine multiple csv files into one file, and remove duplicated lines
(3) Add a section for benchmark in README.md
2024-09-26 20:57:03 -07:00
Scott McKay
3846f84218
Increase React Native E2E (#22230)
### Description
<!-- Describe your changes. -->
Increase the detox setup timeout to 4 minutes. 

The iOS RN E2E tests are taking slightly around 2 mins to setup causing
flakiness.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve RN CI pass rate
2024-09-27 08:59:36 +10:00
Tianlei Wu
2deab75d39
Add numeric_limits for float8 types (#22228)
Add std::numeric_limits for float8 data types to provide a consistent
way to access limits of those types.

Reference:
* https://onnx.ai/onnx/technical/float8.html
2024-09-26 14:42:36 -07:00
Jing Fang
1942e40e05
[ARM64] MatMulNBits: use neon instrinsics to convert between fp16 and fp32 (#22195)
### Description
For fp16 Atype, the fallback operation is convert the data to fp32 and
calculate.
Added neon intrinsics version to speed up the conversion.

Store address alignment and loop unrolling have insignificant impact on
latency so they are omitted.

|Benchmark | Time | CPU |

|--------------|---------------------------------------------|--------------------|
|M_ConvertF16ToF32/baseline/real_time | 1076961 ns | 1083398 ns |
|M_ConvertF16ToF32/aligned:0/real_time | 46785 ns | 46516 ns |
|M_ConvertF16ToF32/aligned:1/real_time | 46631 ns | 46391 ns |
|M_ConvertF16ToF32_unroll2/aligned:0/real_time | 44074 ns | 44392 ns |
|M_ConvertF16ToF32_unroll2/aligned:1/real_time | 44726 ns | 45226 ns |
|M_ConvertF32ToF16/baseline/real_time | 520109 ns | 527329 ns |
|M_ConvertF32ToF16/aligned:0/real_time | 73610 ns | 74015 ns |
|M_ConvertF32ToF16/aligned:1/real_time | 71557 ns | 71525 ns |
|M_ConvertF32ToF16_unroll2/aligned:0/real_time | 64227 ns | 63374 ns |
|M_ConvertF32ToF16_unroll2/aligned:1/real_time | 67428 ns | 67989 ns |



### Motivation and Context
speed up fallback implementation of Fp16 MatMulNBits
2024-09-26 13:55:40 -07:00
jingyanwangms
d0b0ecfdb9
[Running CI] Update TensorRT to 10.4 (#22049)
### Description
TensorRT 10.4 is GA now, update to 10.4



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-09-26 11:10:52 -07:00
Tianlei Wu
7880342e5e
Add numeric_limits for MLFloat16 and BFloat16 (#22197)
### Description
* Add std::numeric_limits for MLFloat16 and BFloat16.
* Update some comments in csharp ORTFloat16.shared.cs.
* Add unit tests (including Clip)

Note that the canonical NaN is not consistent in C++ and C#. C# uses
negative quiet NaN as canonical NaN, while C++ uses positive quiet NaN.
The choice of CSharp Float16.NaN is to be consistent with
System.Half.NaN.

FP16 data returns from CUDA might have 7FFF as NaN; FP16 data from CPU
provider might have 0x7E00 as NaN. Anyway there is no consistent
canonical NaN in ORT right now. Because all these NaNs are aligned with
IEEE spec, there shall not an issue in downstream.

### Motivation and Context
std::numeric_limits is used in codebase but not defined for MLFloat16
and BFloat16. It causes some bugs like
https://github.com/microsoft/onnxruntime/issues/21957 introduced by
https://github.com/microsoft/onnxruntime/pull/21493.
2024-09-25 17:10:05 -07:00
liqun Fu
72b0979e8a
Fix a wrong assignment that causing mlas benchmark to crash (#22221)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Signed-off-by: Liqun Fu <liqfu@microsoft.com>
2024-09-25 15:53:28 -07:00
saurabh
4d6019fa02
OVEP: Tensor caching fix (#22218)
### Description
1. changing the emplace to [] that does have a difference, emplace will
only create a new entry if it doesn't already exist in the map
2. change the logic of the caching lookup to key off of input/output
names instead of ort raw ptrs.
3. changes OV tensor creation for CPU allocated input/output ORT
tensors. The CPU allocated input/output tensor path was re-allocating OV
tensors based on the ORT input/output tensors. So we'd get 2 copies: ORT
input/output tensor -> OV tensor (OVEP) -> NPU Tensor (NPU plugin).

---------

Co-authored-by: Eric Crawford <eric.r.crawford@intel.com>
2024-09-25 14:58:04 -07:00
Hector Li
50d9612bc0
change shared_ptr to unique_ptr to make the ownership clear (#22209)
### Description
change shared_ptr to unique_ptr to make the ownership clear.
2024-09-25 12:58:46 -07:00
Claude
3494f80e83
Check if HTMLCanvasElement exists (i.e. we are not running in a webworker) (#22153)
This fixes #22152


### Description
Tensor.fromImage fails in a webworker context, because HTMLCanvasElement
does not exist:

> HTMLCanvasElement is not defined



### Motivation and Context
This fixes #22152

---------

Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
2024-09-25 11:52:52 -07:00
Adrian Lizarraga
a47254eaef
Remove empty (DQ -> Q -> graph output) sequence in TransposeOptimizer (#22172)
### Description
Updates the TransposeOptimizer to also remove empty (DQ -> Q) sequences
that occur at a graph output. An empty DQ->Q sequence results from a
Transpose being optimized out.

Consider the following example model:

![image](https://github.com/user-attachments/assets/4e7bc4eb-ea8a-463b-9672-c4ec5ef779b2)

The TransposeOptimizer removes the final Transpose and leaves an empty
DQ->Q->output_0 sequence. This PR ensures that the final DQ->Q is also
removed.

### Motivation and Context
Models with quantized output can run on QNN EP. The inference latency of
a customer model is impacted by the unnecessary DQ->Q sequence at the
output.

---------

Co-authored-by: Scott McKay <skottmckay@gmail.com>
2024-09-24 21:02:17 -07:00
Caroline Zhu
ee6a91533c
Add BrowserStack mention to project ReadMe (#22207)
### Description
Condition for [BrowserStack support for open-source
projects](https://www.browserstack.com/open-source)

### Motivation and Context
- Considering using BrowserStack for our end-to-end tests for iOS and
Android
2024-09-24 17:14:14 -07:00
Adrian Lizarraga
7811839265
[QNN EP] Always fuse (DQ->Q) to a QNN Convert operator (#22205)
### Description
Previously, we only fused (DQ -> Q) into a QNN Convert if the
quantization types differed (e.g., converting uint8 to uint16). This PR
always fuses DQ -> Q regardless of the quantization type because a
single QNN Convert op is faster than two separate ops.

Example fusions:
- [CURRENTLY SUPPORTED] Convert uint8 to uint16:
  - `uint8 -> DQ -> Q -> uint16` becomes `uint8 -> Convert -> uint16`
- [CURRENTLY SUPPORTED] Convert uint16 to uint8:
  - `uint16 -> DQ -> Q -> uint8` becomes `uint16 -> Convert -> uint8`
- [NEW] Convert uint8 (zp0, scale0) to uint8 (zp1, scale1):
- `uint8(zp0/scale0) -> DQ -> Q -> uint8(zp1/scale1)` becomes
`uint8(zp0/scale0) -> Convert -> uint8(zp1/scale1)`
- [NEW] Convert uint16 (zp0, scale0) to uint16 (zp1, scale1):
- `uint16(zp0/scale0) -> DQ -> Q -> uint16(zp1/scale1)` becomes
`uint16(zp0/scale0) -> Convert -> uint16(zp1/scale1)`


### Motivation and Context
The Transpose optimizer will normally remove empty DQ->Q sequences if
the quantization params are equal. However, for cases in which the
quantization params are not equal, QNN EP should convert DQ->Q to a
single QNN Convert op for performance. This affects a customer model.
2024-09-24 15:51:32 -07:00
amarin16
eb2506d77a
Add MLFloat16 support for LayerNormalization, SkipLayerNormalization (#22063)
Add `MLFloat16` support for:
- `LayerNormalization`
- `SimplifiedLayerNormalization`
- `SkipLayerNormalization`
- `SkipSimplifiedLayerNormalization`

There are existing `LayerNormTest` unit tests that cover the `MLFloat16`
functionality for `LayerNormalization` once `MLFloat16` is registered
(for example
[`LayerNormTest.LayerNorm_Scale_Float16Input`](91c916f9c6/onnxruntime/test/contrib_ops/layer_norm_op_test.cc (L112))).

Similarly, there are unit tests such as
[`SkipLayerNormTest.SkipLayerNormBatch1_Float16`](91c916f9c6/onnxruntime/test/contrib_ops/skiplayernorm_op_test.cc (L255))
that cover MLFloat16 inputs for `SkipLayerNormalization`.
2024-09-24 15:06:27 -07:00
chenduan-amd
61996332ad
[VitisAI] support run_options in vitisai EP end (#22029)
### Description
add OnRunStart() method for Vitis AI execution provider



### Motivation and Context
To dynamically obtain some runtime parameters during execution, use
run_options within the Vitis AI execution provider (EP).
2024-09-24 14:37:05 -07:00
George Wu
7727b4b909
[TensorRT EP] update gen_trt_engine_wrapper_onnx_model.py script (#22184)
update script which was using deprecated num_bindings to num_io_tensors
tested on an engine dumped by trtexec and loaded the engine using
onnxruntime-gpu 1.19.2 python package.
2024-09-24 14:34:05 -07:00
Ye Wang
6cc06ad069
GQA MLFloat16 cpu (#22102)
### Description
<!-- Describe your changes. -->


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Your Name <you@example.com>
2024-09-24 09:51:59 -07:00
Hector Li
5fa4505d1b
Set enable_htp_fp16_precision default to true (#22186)
### Description
Set enable_htp_fp16_precision default to true for HTP backend.
2024-09-24 09:37:53 -07:00