### Description
- Adds a dummy bias of all zeros when translating a Conv without an
explicit bias input. This is a workaround for a QNN validation issue
that fails when the optional bias input is not provided.
- Corrects logic for unpacking of **non-zero int4** zero-points. Bug
does not impact models because we currently only support int4
zero-points equal to 0 (symmetric quant). But this would become an issue
in the future if/when QNN supports non-zero int4 zero-points (so good to
fix now).
### Motivation and Context
Support Conv operators without a bias input on QNN EP with the latest
QNN SDK.
### Description
<!-- Describe your changes. -->
Remove legacy code and wrong message.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This is required by Microsoft to remove unwanted error message. This is
required for 8.15 release.
Co-authored-by: Yueqing Zhang <yueqingz@amd.com>
Use debug info to identify sdpa kernel actually used, and show it in the
output of benchmark_mha.py. This updated benchmark script was used to
get the benchmark results in
https://github.com/microsoft/onnxruntime/pull/21629.
(1) Change the output format of debug info to output like SdpaKernel=*
(2) Add a step to capture stdout from onnxruntime session, and use
regular expression to parse SdpaKernel=* from the captured text.
Other minor changes:
(1) Set different default repeats during benchmark: 100 for CPU; and
10000 for CUDA.
(2) Fix PrintTensorByDims used in console dumper: if it is not enabled,
do not dump tensor.
(3) Update some comments
### Motivation and Context
Sometime, we will use fallback for a sdpa_kernel. It could confuse user
unless we can tell exact kernel is used in benchmark.
### Description
For some reason, run SparseAttention tests in parallel causes random
failure in CI pipeline. Maybe due to out of memory when too many tests
running in parallel.
This will run those tests in sequentially.
### Description
Minor changes to resolve some warnings in ORT
### Motivation and Context
Binskim for WindowsAI (which consumes ORT) treats warnings as errors,
and has hit these warnings.
As a security requirement, warnings like "signed/unsigned mismatch" must
be resolved.
### Description
<!-- Describe your changes. -->
Set the exhaustive tune flag through the MIGraphX API and make this a
Session option in Onnxruntime
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Allow users to use MIGraphX Exhaustive tuning with Onnxruntime
inferences
This goers hand in hand with save/load after a model and been compiled
and tuning has found.
---------
Co-authored-by: Ted Themistokleous <tedthemistokleous@amd.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
### Description
<!-- Describe your changes. -->
No code changes to the EP only changes to the scripts whihc invoke
MIGraphX EP
- One case be explicit to set MIGraphX EP when running gpt2 testing
- The other to ensure we turn off optimizations like tensorRT and allow
MIGraphX to handle graph optimizations
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
MIGraphX has moved away from using rocBLAS and without this, some cases
used in CI shall fail as optmizations will attempt to use rocBLAS
kernels instead of MIGraphx EP directly.
… to int8 for now
Allow for models with biases/full input and only check for int8 support
in EP
### Description
<!-- Describe your changes. -->
Allows for all inputs for MatMulInteger and ConvInteger to be supported
for prequantized models
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fixes issues when using prequantized models that contain weight biases
---------
Co-authored-by: Ted Themistokleous <tedthemistokleous@amd.com>
### Description
- Fix computation of axis for `QuantizeLinear` inserted after the
sequence `DQ (per-channel) -> Unsqueeze`. Example:
- Original: `DQ (axis = 0) -> Unsqueeze (axes = [0, 1, 2]) -> Op`
- After QDQ fix-up: `DQ (axis = 0) -> Unsqueeze (axes = [0, 1, 2]) -> Q
(axis = 3) -> DQ (axis = 3) -> Op`
- Before this PR, the axis for the inserted Q/DQ ops was not correctly
set to 3 (left as 0).
- Fix normalization of negative axis values for `QuantizeLinear`
inserted after the sequence `DQ (per-channel) ->Transpose`
- Existing code added the wrong rank value to normalize the DQ axis.
### Motivation and Context
Fix errors in handling of per-channel DQ in code that fixes QDQ
NodeUnits.
### Description
- Adds support for the GatherElements operator to QNN EP.
- Adds GatherElements to QDQ quantizer tool.
### Motivation and Context
Enable more models to run on QNN EP.
### Description
Added code in MatMul4BitsQuantizer to quantize Gather to
GatherBlockQuantized.
Only Gather with constant data is quantized.
Since quantized data is in int4, the quantized model will force upgrade
to onnx opset 21.
The implementation purely relies on numpy. If optimization is needed,
C++ kernels can be added later.
Only support default RTN algorithm since GatherBlockQuantized require
zero points to have the same type as quantized data.
### Motivation and Context
Support quantizing gather to int4 in Web scenario.
### Description
- TensorRT 10.2.0.19 -> 10.3.0.26
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Previously, MultiHeadAttention supports relative position bias of shape
[1, N, S, T] or [B, N, S, T], and DecoderMaskedMultiHeadAttention
supports [1, N, S, T]. This will extend the support to allow [1, N, S,
T], [B, N, S, T], [B, 1, S, T] and [1, 1, S, T] for CUDA and CPU EPs.
- [x] Rename the input of "relative position bias" to "attention bias"
because it can also be used for other types of bias, like ALiBi
(Attention with Linear Biases) or attention mask.
- [x] Update unfused kernel to support broadcasting 2nd dimension of
attention bias.
- [x] Update efficient attention to support broadcasting 2nd dimension
of attention bias.
- [x] Update operators (MultiHeadAttention,
DecoderMaskedMultiHeadAttention, Attention, PackedAttention,
PackedMultiHeadAttention) to support broadcast attention bias on CUDA
and CPU EPs.
- [x] Update ROCm, DML and WebGPU naming to be consistent. (Note that
those EPs do not support broadcasting attention_bias for now).
- [x] Add attention bias tests for MultiHeadAttention.
- [x] Update operator documents
- [x] Update benchmark script
Other changes:
* Fix some checks in multihead-attention.ts
* Add helper functions to dump tensors given dimensions.
### Description
This PR modifies the run_dynamo_export function to ensure it mirrors the
behavior of run_torchscript_merged_export rather than
run_torchscript_separate_export. Additionally, I made adjustments to the
main function to ensure that run_dynamo is correctly invoked.
### Motivation and Context
The main motivation for this change is to enable successful export of
LLaMA-2 and LLaMA-3 models using the Dynamo exporter to ONNX.
Previously, the exporter was saving two copies of the weights, which is
inefficient. The modified approach ensures that only one copy of the
weights is saved, and the model can support both scenarios. These
changes enhance the compatibility of the exporter with LLaMA models and
subsequently other models and optimize the export process
### Description
Replace `memset(0)` with `std::fill(T{})`. This would ensure that all
the types are initialized in a portable way.
### Motivation and Context
Some platforms exhibit intermittent failures with NaN results.
Follow up to: https://github.com/microsoft/onnxruntime/pull/21525
Cc: @ranjitshs
Bug: https://github.com/microsoft/onnxruntime/issues/21386
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
This change addresses a case where we multiply two matrices, and their
inner dimension is 0.
numpy and Eigen which is being used in our CPU EP implementation
correctly handle this case
and output a [M, N] matrix filled with zeros.
### Motivation and Context
This is required to support GenAI empty input Lora implementation.
Addresses: https://github.com/microsoft/onnxruntime/issues/21483
### Description
Fix address sanitizer and memory access Bug 1, 4, 5, 7, 8 found in
security fuzz test
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
We saw some models failed to run due to OOM and can be fixed by increase
trt_max_workspace_size.
This PR makes no size limitation by default (max device memory) which is
aligned with trtexec.
### Description
Added eval model buffer as optional field in Module so that you can
export for inference using the eval model stored as a buffer.
### Motivation and Context
- Resolves#21152
- Previous solution (PR #21422) produced an eval model that was specific
to the EP's used to train because of unavoidable runtime optimizations
that changed the graph stored with the eval session.
### Description
This PR improves the range calculation for input to Relu/Clip nodes for
the symmetric quantization case.
### Motivation and Context
Currently, the issue we face is that for the common scenario of conv
followed by relu in the symmetric quantization config, different scales
could assigned for the tensors corresponding to input & output of relu.
The downside is that this may introduce noise due to multiple re-quant,
and makes it difficult to fuse conv-relu nodes for hardware accelerators
that support fused conv-relu.
Instead, it is more efficient to assign the output range of relu as the
input range of relu / output range of upstream op wherever possible.
This adjustment is currently only being done for the asymmetric
quantization case.
For the scenario where the upstream op has multiple consumers, this
assumption could be incorrect. For this case we do not adjust the
ranges.
### Description
Fixes validation of per-channel quantization overrides by not trying to
unnecessary load the external weights.
### Motivation and Context
The `get_qnn_qdq_config()` explicitly loads models without external data
(i.e., `onnx.load_model(load_external_data=False)`). Afterwards,
`get_qnn_qdq_config()` calls `tensor_proto_to_array()`, which expects
that the external weights are stored in the current working directory.
If the external weights are stored in a different directory, then we get
a crash.
Loading the actual weight values is unnecessary because we only need the
weight shape. This PR removes the unnecessary call to
`tensor_proto_to_array()` call.
### Description
Added DequantizeLinear operator for JSEP.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Add null_ptr check to avoid crash when running session which was failed
to generate trt_engine previously
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Reported and verified by
https://github.com/microsoft/onnxruntime/issues/21567
### Description
This fix addresses the issue of handling multiple QLinear nodes as
outputs from the target node in OVEP. Previously, the stripping logic
only supported a single Q node, leading to incorrect stripping of
additional Q nodes.
### Motivation and Context
The OVEP stripping logic was limited to handling a single Q node as an
output from the target node. As a result, additional Q nodes were being
stripped, despite the stripping rules indicating they should be
retained.
With this fix, OVEP can now properly handle multiple Q nodes according
to the specified stripping rules, ensuring that the fate of each Q node
is correctly determined.
---------
Co-authored-by: sfatimar <sahar.fatima@intel.com>
### Description
When quantize MatMul to DQ + MatMul using 4bit QDQ tool chain,
previously the opsets of domains are not changed.
Now, when quantize MatMul to DQ + MatMul in QDQ format, force upgrade
onnx domain to opset 21.
### Motivation and Context
In QDQ format, DQ with int4 and blocked quantization is used. This
requires DQ with opset >= 21.
When quantize MatMul to DQ + MatMul, force upgrade onnx domain to opset
21.
### Description
<!-- Describe your changes. -->
Fix wrong per-tensor quantized weight type for matmul.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix related bug as described in
https://github.com/microsoft/onnxruntime/issues/21346
### Description
Add a gather that supports block-quantized input data.
### Motivation and Context
To support Web inference scenario with quantized vocabulary embeddings.
### Description
This change enhances the existing Pad Fusion to fuse Pad even if a Cast
operator is present between Pad and Conv/MaxPool/AveragePool. It keeps
the Cast as it is.
<pre>
/*
* Before Fusion:
* Pad
* |
* Cast (Optional)
* |
* Conv/MaxPool/AveragePool
*
* After Fusion:
* Cast (Optional)
* |
* Conv/MaxPool/AveragePool
*/
</pre>
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
This change allows to match external data path like `a.data` to
`./a.data`.
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
* Fix migraphx build error caused by
https://github.com/microsoft/onnxruntime/pull/21598:
Add a conditional compile on code block that depends on ROCm >= 6.2.
Note that the pipeline uses ROCm 6.0.
Unblock orttraining-linux-gpu-ci-pipeline and
orttraining-ortmodule-distributed and orttraining-amd-gpu-ci-pipeline
pipelines:
* Disable a model test in linux GPU training ci pipelines caused by
https://github.com/microsoft/onnxruntime/pull/19470:
Sometime, cudnn frontend throws exception that cudnn graph does not
support a Conv node of keras_lotus_resnet3D model on V100 GPU.
Note that same test does not throw exception in other GPU pipelines. The
failure might be related to cudnn 8.9 and V100 GPU used in the pipeline
(Amper GPUs and cuDNN 9.x do not have the issue).
The actual fix requires fallback logic, which will take time to
implement, so we temporarily disable the test in training pipelines.
* Force install torch for cuda 11.8. (The docker has torch 2.4.0 for
cuda 12.1 to build torch extension, which it is not compatible cuda
11.8). Note that this is temporary walkround. More elegant fix is to
make sure right torch version in docker build step, that might need
update install_python_deps.sh and corresponding requirements.txt.
* Skip test_gradient_correctness_conv1d since it causes segment fault.
Root cause need more investigation (maybe due to cudnn frontend as
well).
* Skip test_aten_attention since it causes assert failure. Root cause
need more investigation (maybe due to torch version).
* Skip orttraining_ortmodule_distributed_tests.py since it has error
that compiler for torch extension does not support c++17. One possible
fix it to set the following compile argument inside setup.py of
extension fused_adam: extra_compile_args['cxx'] = ['-std=c++17'].
However, due to the urgency of unblocking the pipelines, just disable
the test for now.
* skip test_softmax_bf16_large. For some reason,
torch.cuda.is_bf16_supported() returns True in V100 with torch 2.3.1, so
the test was run in CI, but V100 does not support bf16 natively.
* Fix typo of deterministic
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Improve speed in combining `per-channel` data for using a single
`np.concatenate` instead of multiple `np.concatenates` within a for
loop.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix the issue https://github.com/microsoft/onnxruntime/issues/21562
Signed-off-by: duansheng.liu <44742794+duanshengliu@users.noreply.github.com>
### Description
- Update pipelines to use QNN SDK 2.25 by default
- Update ifdef condition to apply workaround for QNN LayerNorm
validation bug to QNN SDK 2.25 (as well as 2.24)
### Motivation and Context
Use the latest QNN SDK
### Description
Several tests result in segfaults during the minimal cuda build.
Although test failures are expected due to the limitation of the minimal
cuda EP, failing gracefully would be much preferred.
### Motivation and Context
To reproduce:
1. Build ORT with:
```bash
./build.sh --build_shared_lib --use_full_protobuf --cuda_home /usr/local/cuda --cudnn_home /usr/lib/x86_64-linux-gnu/ --tensorrt_home /TensorRT-10.0.1.6 --parallel --skip_tests --skip_submodule_sync --allow_running_as_root --use_tensorrt --cmake_extra_defines onnxruntime_CUDA_MINIMAL=1
```
2. Run `onnxruntime_test_all`
```bash
...
[----------] 1 test from AllocationPlannerTest
[ RUN ] AllocationPlannerTest.ReusedInputCrossDifferentStreams
Segmentation fault (core dumped)
```
### Description
WebNN only supports test mode, so we don't care about other inputs or
attributes about training mode, use WebNN's identity op to implement the
Dropout op directly.
### Description
<!-- Describe your changes. -->
Changes to add in Set external data path for model weight files.
Additional fixes to ensure this compiles off the latest v1.19
Onnxruntime
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Separate weights used for larger models (like stable diffusion) is
motivation for this change set
---------
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Co-authored-by: Artur Wojcik <artur.wojcik@amd.com>
Co-authored-by: Ted Themistokleous <tedthemistokleous@amd.com>