### Description
DML Implementation for
[com.microsoft.MatMulIntegerToFloat](https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md#com.microsoft.MatMulIntegerToFloat)
```
.\onnxruntime_test_all.exe --gtest_filter="*MatMulIntegerToFloat.*"
Note: Google Test filter = *MatMulIntegerToFloat.*
[==========] Running 22 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 22 tests from MatMulIntegerToFloat
[ RUN ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_S8S8
[ OK ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_S8S8 (620 ms)
[ RUN ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_S8S8
[ OK ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_S8S8 (497 ms)
[ RUN ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_S8S8
[ OK ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_S8S8 (488 ms)
[ RUN ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_S8S8
[ OK ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_S8S8 (503 ms)
[ RUN ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_U8U8
[ OK ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_U8U8 (495 ms)
[ RUN ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_U8U8
[ OK ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_U8U8 (488 ms)
[ RUN ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_U8U8
[ OK ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_U8U8 (492 ms)
[ RUN ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_U8X8
[ OK ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_U8X8 (502 ms)
[ RUN ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_S8U8
[ OK ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_S8U8 (452 ms)
[ RUN ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_S8U8
[ OK ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_S8U8 (454 ms)
[ RUN ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_S8U8
[ OK ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_S8U8 (446 ms)
[ RUN ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_S8U8
[ OK ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_S8U8 (508 ms)
[ RUN ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_U8S8
[ OK ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_U8S8 (456 ms)
[ RUN ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_U8S8
[ OK ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_U8S8 (455 ms)
[ RUN ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_U8S8
[ OK ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_U8S8 (447 ms)
[ RUN ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_U8S8
[ OK ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_U8S8 (465 ms)
[ RUN ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_U8U8
[ OK ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_U8U8 (111 ms)
[ RUN ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_U8S8
[ OK ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_U8S8 (115 ms)
[ RUN ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_S8S8
[ OK ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_S8S8 (114 ms)
[ RUN ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_S8U8
[ OK ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_S8U8 (110 ms)
[ RUN ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16
[ OK ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16 (112 ms)
[ RUN ] MatMulIntegerToFloat.MatMulInteger_With_ZeroPoint
[ OK ] MatMulIntegerToFloat.MatMulInteger_With_ZeroPoint (337 ms)
[----------] 22 tests from MatMulIntegerToFloat (8679 ms total)
[----------] Global test environment tear-down
[==========] 22 tests from 1 test suite ran. (8680 ms total)
[ PASSED ] 22 tests.
memleakdbg:
----- No memory leaks detected -----
```
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
* `CalculateMatMulIntegerToFloat` to replace CPU EP run reference
* Added more FP32 testcases to isolate all input datatype combinations
* Added fixed input to `MatMulIntegerToFloat_FP16*` test cases as for
FP16 test cases.
* onnxruntime/test/testdata/matmul_integer_to_float.py` is capable of
generating FP16 models, but we do not produce any for now
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Adds the optional parameters `inputs_to_make_channel_last` and
`outputs_to_make_channel_last` to the `qnn_preprocess_model()` function.
```python
"""
inputs_to_make_channel_last: List of graph input names to transpose to be "channel-last". For example,
if "input0" originally has the shape (N, C, D1, D2, ..., Dn), the resulting model will change input0's
shape to (N, D1, D2, ..., Dn, C) and add a transpose node after it.
Original:
input0 (N, C, D1, D2, ..., Dn) --> <Nodes>
Updated:
input0 (N, D1, D2, ..., Dn, C) --> Transpose --> input0_chanfirst (N, C, D1, D2, ..., Dn) --> <Nodes>
This can potentially improve inference latency for QDQ models running on QNN EP because the
additional transpose node may allow other transpose nodes inserted during ORT layout transformation
to cancel out.
outputs_to_make_channel_last: List of graph output names to transpose to be "channel-last". For example,
if "output0" originally has the shape (N, C, D1, D2, ..., Dn), the resulting model will change output0's
shape to (N, D1, D2, ..., Dn, C) and add a transpose node before it.
Original:
<Nodes> --> output0 (N, C, D1, D2, ..., Dn)
Updated:
<Nodes> --> output0_chanfirst (N, C, D1, D2, ..., Dn) --> Transpose --> output0 (N, D1, D2, ..., Dn, C)
This can potentially improve inference latency for QDQ models running on QNN EP because the
additional transpose node may allow other transpose nodes inserted during ORT layout transformation
to cancel out.
"""
```
**NOTE: If you use these options with the quantization scripts, you'll
have to make sure your data_reader feeds in transposed input data. It
won't happen automatically.**
### Motivation and Context
Native QNN operators use the channel-last data layout, but ONNX uses
channel-first. To bridge the gap, ORT's layout transformer inserts
transposes around layout-sensitive nodes and updates their domain to
indicate that they now operate on channel-last data. The transpose
optimizer is able to remove most of these inserted transposes, but not
all transposes can always be removed (i.e., some could remain at the
graph's inputs and outputs).
We've found that these extra transpose nodes can significantly degrade
inference latency on QNN EP. One workaround (provided by this PR) is to
add _additional_ transpose nodes at the graph inputs or outputs. These
additional nodes can often help the ORT transpose optimizer cancel out
any remaining transpose nodes, which significantly improves latency.
Additionally, it may make more sense for some kinds of inputs to just be
in channel-last form (e.g., images), avoiding the need to pre-transpose
of the input data before inference.
Example at the input:
```
Original:
input0 (N, C, D1, D2, ..., Dn) --> <Nodes>
Updated:
input0 (N, D1, D2, ..., Dn, C) --> Transpose --> input0_chanfirst (N, C, D1, D2, ..., Dn) --> <Nodes>
```
Example at the output:
```
Original:
<Nodes> --> output0 (N, C, D1, D2, ..., Dn)
Updated:
<Nodes> --> output0_chanfirst (N, C, D1, D2, ..., Dn) --> Transpose --> output0 (N, D1, D2, ..., Dn, C)
```
### Description
Add constraints to MatMul:
- The input must be at least 2D.
- CPU backend: The input rank must be the same.
- CPU backend: The input shape except for the last two axis must be the
same.
### Motivation and Context
Prevent regression for some models.
### Description
It was disabled in PR #9065. And the reason was:
" api-ms-win-core-kernel32-legacy-*.dll wasn't available in Windows 8
and was added in Windows 10, so cpuinfo breaks our Windows 8 support.
I'm disabling it again."
We no longer support Windows 8. Therefore we can add CPUINFO back.
### Motivation and Context
To make the code simpler. If in any case the library doesn't work as
expected, we can submit a PR to their code base and fix it.
### Description
<!-- Describe your changes. -->
Temporarily disable fp16 gemm on CPU because it usually needs a
following Cast which offsets the gain. Need more fp16 operators
implementation and performance tuning.
Also fix a fusion error of LayerNormalization.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Change webgpu CI pipeline to use a preinstalled chrome. Hopefully it can
increase the stability. Now the chrome got from puppeteer often failed
to start.
### Follow up fix for Gelu impl
There are two minor comments in
https://github.com/microsoft/onnxruntime/pull/19560.
Fix them in this pull request.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Add support for:
- Clip/Relu/Relu6
- Add/Mul/Div/Sub/Pow
- GlobalAveragePool/GlobalMaxPool/AveragePool/MaxPool
- Reshape
- Gemm/MatMul
Fix some build issues/warnings from changes.
Fix a couple of potential issues with the Resize op as well (noticed due
to change to reject inputs with empty data at a higher level).
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Enable mobilenetv2 with ML Program
# loss function extra inputs.
Currently, the loss functions in onnxblock expect exactly two inputs in
their build method.
Occasionally, models may pass additional inputs, causing the build
function to fail.
To solve this issue, we can let users pass a list of loss input names to
be used in the loss function.
Achieved a speedup of 1.098 in MultiHeadAttention and an end-to-end
speedup of 1.021 in the OCR model through parallelization of the
Transpose_BSNH_to_BNSH operation.
### Description
Ensures that DQ and Q ops use the msft domain if tensor quantization
overrides specify 16-bit integer types.
### Motivation and Context
ONNX does not yet support 16bit integer types for QuantizeLinear and
DequantizeLinear ops (coming soon). For now, DQ/Q ops must use the MSFT
domain.
We have to also check if tensor quantization overrides force the use of
16-bit quantization types. If so, we must correctly set the domain for
Q/DQ ops.
#19218 tried to fuse Gather/Slice to Split, but the logic has problem.
Scalar value or 1-dim value of indices in Gather node will produce
different result, scalar value will produce a result tensor by removing
the axis dim, will 1-dim indices value will keep that dim, even when the
dim value is 1. For example,
Node
|-> Gather(indices=[0], axis=axis)
|-> Gather(indices=[1], axis=axis)
|-> Slice(index=2, axis=axis)
is same as
Node
|-> Split(axis=axis)
But
Node
|-> Gather(indices=0, axis=axis)
|-> Gather(indices=1, axis=axis)
|-> Slice(index=2, axis=axis)
is same as
Node
|-> Split(axis=axis)
||-> Squeeze(axis=axis)
||-> Squeeze(axis=axis)
||->
Previous PR doesn't take such case related to Squeeze/Unsqueeze into
account.
This PR merges #19218 and GatherToSplitFusion to a general fusion, which
relaxes the limit the number of Gather and Slice node number, check all
Gather and Slice consumers, if the indices of Gather and start/end of
Slice can cover the specific dim of the input tensor, then we can fuse
them to a Split, and adding Squeeze if necessary according to the dim
count of the indices tensor in Gather.
@rui-ren, please check if the fix can still be applied to your model.
### Description
Updated github/issue-labeler permissions to give write access for
issues. Tried to submit the same PR last week, but the checks kept
failing, so I couldn't merge.
### Motivation and Context
Enables issue labeling again, which has been broken since GitHub Actions
permissions were changed a couple weeks ago.
### Description
use Chromium Headless for webgpu test by default. Still use normal
Chromium with window when debug=true or perfMode=true.
Use the
[`--headless=new`](https://developer.chrome.com/docs/chromium/new-headless)
mode.
### Motivation and Context
try to use a more stable way to launch npm tests to avoid a "chrome not
found" issue in pipeline, which may potentially caused by windowed
application.
### Description
- Adds parameters to `qnn_preprocess_model()` to allow saving the new
model with external data.
- Updates `get_qnn_qdq_config()` to:
- Load model without external data (it is not needed)
- Return a quantization configuration with `use_external_data_format`
set to `True` if the model has external data or if the model is >= 2GB.
### Motivation and Context
Update QNN quantization to better handle large models that use external
data.
### Description
It is a "Bash" task that requires running bash on Windows. Most Windows
operating systems do not have Bash installed. Given this task is only
debugging purposes, we can remove it for now.
### Motivation and Context
I am making this change because I am regenerating the VM image in a
different manner, and the new image does not contain bash. Once this PR
is in, I can switch the images.
### Description
Answers issue #19640
More details are in the issue, basically I am changing all the include
directory and link directory usage to CMake's `CUDA::*` targets
### Description
Try to move 'env.wasm.trace' to 'env.trace' to make it less confusing,
because it also works in webgpu. Marked 'env.wasm.trace' as deprecated.
### Description
<!-- Describe your changes. -->
* Publish the artifacts as late as possible
* once published the artifacts are immutable, and any retry will fail if
they exist
* if any step fails after publishing the stage cannot be retried
* use powershell to cleanup
* DeleteFiles is taking >30 mins and causing the stage to timeout
* powershell took < 1s
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Make pipeline more robust
### Description
<!-- Describe your changes. -->
Use UseMultiToolTask and limit the number of cl.exe instances running.
MultiToolTask info:
https://devblogs.microsoft.com/cppblog/improved-parallelism-in-msbuild/
Info on why limiting CL_MPCount can help:
https://github.com/Microsoft/checkedc-clang/wiki/Parallel-builds-of-clang-on-Windows
The current CIs have 4 cores (both physical and logical). Hardcoded the
GPU build in win-ci.yml to use CL_MPCount of 2 as that seems to work
fine. Can adjust if needed to base it on the actual number of cores or
to use build.py to build.
Caveat: I've run about 16 builds and haven't seen a slow build yet, but
as the root cause of the slow builds isn't really known this isn't
guaranteed to be a fix.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Try and prevent super slow GPU builds by reducing number of tasks
potentially running in parallel.
### Description
- Updates the `qnn_preprocess_model()` method to set a name for any new
nodes added to the graph (due to fusion).
- Updates the `qnn_preprocess_model()` method to set a name for any
unnamed nodes that previously existed in the original graph.
- Adds unit tests for fusions (previously missing)
- Checks that fused node names exist and are unique
- Checks that fused graph is equivalent to original graph
### Motivation and Context
Nodes are not strictly required to have names. However, a
planned/upcoming feature to support mixed-precision (integer) quantized
models needs nodes to have names.
`/opt/rocm/.info/version-dev` is only available if the `rocm-dev`
metapackage is installed. This will bring a lot of unused packages which
are not needed by the users, they may opt for fine grained control.
Fallback to `rocm_version.h` in case `rocm-dev` is not installed.
### Description
<!-- Describe your changes. -->
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix related bug as described in
https://github.com/microsoft/onnxruntime/issues/19430
### Description
<!-- Describe your changes. -->
The RN CI has intermittent failure error with "app seems to idle".
enable the most verbose logging level (and can add steps to dump
device.log from the detox folder/artifacts if necessary) to at least get
more information.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
### Description
<!-- Describe your changes. -->
According to the pr #19229 supporting cuda EP use external compute
stream, we add support for rocm EP.
And when we testing this feature with torch, we found torch use stream 0
for the default stream, and `torch.cuda.current_stream()` returns `0`
for current stream, but ort treat `0` or `nullptr` as invalid, and reset
has_user_compute_stream to false.
Will remove has_user_compute_stream option in the future.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
The motivation for this pr is that we want to use torch.cuda.graph to
capture ort running kernel, which requires torch and ort are running in
the same stream, so we use this API to set ort's working stream.
### Description
<!-- Describe your changes. -->
A number of Qualcomm Snapdragon chipsets do not produce correct output
if we skip the Reshape, which ironically was a performance optimization
for Snapdragon chips.
Perf testing showed that Squeeze also seems to execute on CPU so there's
no benefit to using that as an alternative where possible e.g.
Global*Pool -> Reshape to 2D -> Gemm could be potentially be replaced
with Global*Pool -> Squeeze dims 2 and 3 -> Gemm if that offered better
performance.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#19518
### Description
Fix a bug in build.py that accidentally disabled C# tests for most
builds when "--build_nuget" is specified.
### Motivation and Context
The bug was introduced in PR #8892 .
### Description
<!-- Describe your changes. -->
Xcode UI tests seem to be flaky:
https://github.com/orgs/community/discussions/68807
Add a couple of retries if we get a "Timed out while loading
Accessibility." error which is transient.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
This PR adds a feature to serialize all DML EP partitions into DML
currency individually for a given a model. This feature can be
dynamically turned on by using DML EP option
`ep.dml.enable_graph_serialization`.
### Motivation and Context
- Why is this change required? What problem does it solve?
Useful when user want to capture the DML EP specific partition into DML
currency to mitigate the dependency on the framework.
<!-- - If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Windows memory map casts mapped_offset to DWORD directly. It will be
truncated if it is larger than 2^32-1. We need to set high
dwFileOffsetHigh for this case.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
The bug was found from #19450
### Description
Add Whisper Conversion and E2E into Big Models pipeline
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Your Name <your@email.com>
Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
### Description
1. check GPU status in docker
2. use stages to make test stage can leverage existing building
artifacts
### Motivation and Context
To investigate the root cause of the random exception
`CUDA failure 100: no CUDA-capable device is detected`
### Description
Fixes build break brought by #19614
Currently WebGL backend does not support zero sized tensor. This change
split test data into 2 parts, and only enable zero sized tensor tests
for WebGPU.
### Description
Stop using apiset in OneCore build: use onecoreuap.lib instead of
onecoreuap_apiset.lib in onecore build.
### Motivation and Context
1. Now all Windows Editions come with Reverse Forwarders. We should just
use the normal onecore libs.
2. Many new Windows APIs are only available in [windows umbrella
libraries](https://learn.microsoft.com/en-us/windows/win32/apiindex/windows-umbrella-libraries).
So these libraries are not specific for Windows CoreOS or Onecore.
3. Going forward we should use "IsApiSetImplemented" to guard our API
usages:
https://learn.microsoft.com/en-us/windows/win32/apiindex/detect-api-set-availability
.
After this change, our built binaries can pass apivalidator's check.
```
C:\local\apivalidator>apivalidator.exe -BinaryPath:C:\src\onnxruntime\b\Debug\Debug\onnxruntime.dll -SupportedApiXmlFiles:onecoreuap_DDIs.xml
ApiValidation:
Summary:
"C:\src\onnxruntime\b\Debug\Debug\onnxruntime.dll" is Universal
ApiValidation: All binaries are Universal
```
So it will give an easy way to test ONNX Runtime's compatibility to
Windows versions.
### Description
<!-- Describe your changes. -->
Add helper to run CIs for a branch using `az pipelines`.
This can be used to easily kick off multiple CIs for a branch prior to
creating a PR.
Update run_CIs_for_external_pr.py so the CI list can be shared.
Request json output from `gh pr view` so the current state is more
easily parsed.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->