### Description
- Adds a dummy bias of all zeros when translating a Conv without an
explicit bias input. This is a workaround for a QNN validation issue
that fails when the optional bias input is not provided.
- Corrects logic for unpacking of **non-zero int4** zero-points. Bug
does not impact models because we currently only support int4
zero-points equal to 0 (symmetric quant). But this would become an issue
in the future if/when QNN supports non-zero int4 zero-points (so good to
fix now).
### Motivation and Context
Support Conv operators without a bias input on QNN EP with the latest
QNN SDK.
### Description
Do not allow clearing Android logs if the emulator is not running
### Motivation and Context
Previously the Clearing Android logs step stuck until the pipeline
timeout. If one of the previous steps failed.
### Description
<!-- Describe your changes. -->
Remove legacy code and wrong message.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This is required by Microsoft to remove unwanted error message. This is
required for 8.15 release.
Co-authored-by: Yueqing Zhang <yueqingz@amd.com>
Use debug info to identify sdpa kernel actually used, and show it in the
output of benchmark_mha.py. This updated benchmark script was used to
get the benchmark results in
https://github.com/microsoft/onnxruntime/pull/21629.
(1) Change the output format of debug info to output like SdpaKernel=*
(2) Add a step to capture stdout from onnxruntime session, and use
regular expression to parse SdpaKernel=* from the captured text.
Other minor changes:
(1) Set different default repeats during benchmark: 100 for CPU; and
10000 for CUDA.
(2) Fix PrintTensorByDims used in console dumper: if it is not enabled,
do not dump tensor.
(3) Update some comments
### Motivation and Context
Sometime, we will use fallback for a sdpa_kernel. It could confuse user
unless we can tell exact kernel is used in benchmark.
### Description
For some reason, run SparseAttention tests in parallel causes random
failure in CI pipeline. Maybe due to out of memory when too many tests
running in parallel.
This will run those tests in sequentially.
### Description
Minor changes to resolve some warnings in ORT
### Motivation and Context
Binskim for WindowsAI (which consumes ORT) treats warnings as errors,
and has hit these warnings.
As a security requirement, warnings like "signed/unsigned mismatch" must
be resolved.
### Description
<!-- Describe your changes. -->
Set the exhaustive tune flag through the MIGraphX API and make this a
Session option in Onnxruntime
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Allow users to use MIGraphX Exhaustive tuning with Onnxruntime
inferences
This goers hand in hand with save/load after a model and been compiled
and tuning has found.
---------
Co-authored-by: Ted Themistokleous <tedthemistokleous@amd.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
### Description
<!-- Describe your changes. -->
No code changes to the EP only changes to the scripts whihc invoke
MIGraphX EP
- One case be explicit to set MIGraphX EP when running gpt2 testing
- The other to ensure we turn off optimizations like tensorRT and allow
MIGraphX to handle graph optimizations
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
MIGraphX has moved away from using rocBLAS and without this, some cases
used in CI shall fail as optmizations will attempt to use rocBLAS
kernels instead of MIGraphx EP directly.
… to int8 for now
Allow for models with biases/full input and only check for int8 support
in EP
### Description
<!-- Describe your changes. -->
Allows for all inputs for MatMulInteger and ConvInteger to be supported
for prequantized models
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fixes issues when using prequantized models that contain weight biases
---------
Co-authored-by: Ted Themistokleous <tedthemistokleous@amd.com>
### Description
Fix `Orttraining Linux Lazy Tensor CI Pipeline`
- Remove unused import of `torch.onnx._internal.exporter`, whose path is
changed in newer torch (pytorch/pytorch#132429).
- Move import of `register_custom_op_symbolic` from `torch.onnx` into
local function, which causes circular import when running `import
torch.onnx` (at least in the CI environment).
### Description
- Fix computation of axis for `QuantizeLinear` inserted after the
sequence `DQ (per-channel) -> Unsqueeze`. Example:
- Original: `DQ (axis = 0) -> Unsqueeze (axes = [0, 1, 2]) -> Op`
- After QDQ fix-up: `DQ (axis = 0) -> Unsqueeze (axes = [0, 1, 2]) -> Q
(axis = 3) -> DQ (axis = 3) -> Op`
- Before this PR, the axis for the inserted Q/DQ ops was not correctly
set to 3 (left as 0).
- Fix normalization of negative axis values for `QuantizeLinear`
inserted after the sequence `DQ (per-channel) ->Transpose`
- Existing code added the wrong rank value to normalize the DQ axis.
### Motivation and Context
Fix errors in handling of per-channel DQ in code that fixes QDQ
NodeUnits.
### Description
Compiling onnxruntime with QNN EP on Windows x86_64 results in a
compilation error:
```shell
$ onnxruntime\test\optimizer\qdq_transformer_test.cc(1,1): error C1128: num
ber of sections exceeded object file format limit: compile with /bigobj [...onnxruntime\build\Debug\onnxruntime_test_all.vcxproj]
```
This PR adds the `/bigobj` compilation flag for the
`qdq_transformer_test.cc` file.
### Description
<!-- Describe your changes. -->
### Motivation and Context
1. Python API doc needs to be merged from a fork, but 1ES self-hosted
pool is only for one github repo.
2. ubuntu-latest will be install numpy above 2.0 by default, and current
python API doc generation doesn't support it.
So I pin numpy < 2.0.0
---------
Avoid producing presentKey/presentValue outputs if pastKey/pastValue
don't exists.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
- Adds support for the GatherElements operator to QNN EP.
- Adds GatherElements to QDQ quantizer tool.
### Motivation and Context
Enable more models to run on QNN EP.
### Description
Added code in MatMul4BitsQuantizer to quantize Gather to
GatherBlockQuantized.
Only Gather with constant data is quantized.
Since quantized data is in int4, the quantized model will force upgrade
to onnx opset 21.
The implementation purely relies on numpy. If optimization is needed,
C++ kernels can be added later.
Only support default RTN algorithm since GatherBlockQuantized require
zero points to have the same type as quantized data.
### Motivation and Context
Support quantizing gather to int4 in Web scenario.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Bug fix for the ShapeInferContext GetAttrxxxs APIs. Node attribute maybe
is empty.
### Motivation and Context
If the attr value is empty, the expected result through the interface is
empty , but currently, it returns a meaningless {0}.
---------
Co-authored-by: mingyue <mingyue@amd.com>
Co-authored-by: Liu Minyue <mingyue@xilinx.com>
### Description
- TensorRT 10.2.0.19 -> 10.3.0.26
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
This PR fixes the `AttentionProbsSoftmax` recompilation issue when
executing the phi3 model. With this fix, it will further improve the
phi3 performance.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Previously, MultiHeadAttention supports relative position bias of shape
[1, N, S, T] or [B, N, S, T], and DecoderMaskedMultiHeadAttention
supports [1, N, S, T]. This will extend the support to allow [1, N, S,
T], [B, N, S, T], [B, 1, S, T] and [1, 1, S, T] for CUDA and CPU EPs.
- [x] Rename the input of "relative position bias" to "attention bias"
because it can also be used for other types of bias, like ALiBi
(Attention with Linear Biases) or attention mask.
- [x] Update unfused kernel to support broadcasting 2nd dimension of
attention bias.
- [x] Update efficient attention to support broadcasting 2nd dimension
of attention bias.
- [x] Update operators (MultiHeadAttention,
DecoderMaskedMultiHeadAttention, Attention, PackedAttention,
PackedMultiHeadAttention) to support broadcast attention bias on CUDA
and CPU EPs.
- [x] Update ROCm, DML and WebGPU naming to be consistent. (Note that
those EPs do not support broadcasting attention_bias for now).
- [x] Add attention bias tests for MultiHeadAttention.
- [x] Update operator documents
- [x] Update benchmark script
Other changes:
* Fix some checks in multihead-attention.ts
* Add helper functions to dump tensors given dimensions.
### Description
This PR modifies the run_dynamo_export function to ensure it mirrors the
behavior of run_torchscript_merged_export rather than
run_torchscript_separate_export. Additionally, I made adjustments to the
main function to ensure that run_dynamo is correctly invoked.
### Motivation and Context
The main motivation for this change is to enable successful export of
LLaMA-2 and LLaMA-3 models using the Dynamo exporter to ONNX.
Previously, the exporter was saving two copies of the weights, which is
inefficient. The modified approach ensures that only one copy of the
weights is saved, and the model can support both scenarios. These
changes enhance the compatibility of the exporter with LLaMA models and
subsequently other models and optimize the export process
### Description
<!-- Describe your changes. -->
Handle targets in subdirectories for external projects. All targets will
now go in a per-project folder under 'External'
e.g. gmock and gtest now get handled correctly and are under
External/googletest vs. existing setup where they ended up as top-level
projects.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve developer experience.
Chrome Canary is helpful to test some new features. With this PR, we can
enable Chrome Canary in unit tests with command like "npm test -- op
abs.jsonc -b=webgpu -e=chromecanary".
### Description
Replace `memset(0)` with `std::fill(T{})`. This would ensure that all
the types are initialized in a portable way.
### Motivation and Context
Some platforms exhibit intermittent failures with NaN results.
Follow up to: https://github.com/microsoft/onnxruntime/pull/21525
Cc: @ranjitshs
### Description
Exclude cuDNN 9 and CUDA 12 DLLs from manylinux wheel to reduce python
package size.
### Motivation and Context
The 1.20.0 ort-nightly-gpu python wheels on linux are suddenly > 800 MB
in size. The wheels built on 1.19 release branch have a size of around
220 MB.
The size change is caused by
https://github.com/microsoft/onnxruntime/pull/19470.
### Description
See
454996d496
for manual changes (excluded auto-generated formatting changes)
### Why
Because the toolsets for old clang-format is out-of-date. This reduces
the development efficiency.
- The NPM package `clang-format` is already in maintenance mode. not
updated since 2 years ago.
- The VSCode extension for clang-format is not maintained for a while,
and a recent Node.js security update made it not working at all in
Windows.
No one in community seems interested in fixing those.
Choose Prettier as it is the most popular TS/JS formatter.
### How to merge
It's easy to break the build:
- Be careful of any new commits on main not included in this PR.
- Be careful that after this PR is merged, other PRs that already passed
CI can merge.
So, make sure there is no new commits before merging this one, and
invalidate js PRs that already passed CI, force them to merge to latest.
Bug: https://github.com/microsoft/onnxruntime/issues/21386
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Pins pytorch-lightning package to version 2.3.3 since version >=2.4.0
requires torch > 2.1.0 which is not compatible with cu118.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
ORT 1.19 Release Preparation
### Description
This change addresses a case where we multiply two matrices, and their
inner dimension is 0.
numpy and Eigen which is being used in our CPU EP implementation
correctly handle this case
and output a [M, N] matrix filled with zeros.
### Motivation and Context
This is required to support GenAI empty input Lora implementation.
Addresses: https://github.com/microsoft/onnxruntime/issues/21483
### Description
<!-- Describe your changes. -->
Replace jcenter. It's deprecated and not responding.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix CIs