### Description
<!-- Describe your changes. -->
1. Use macro defined to check version number
2. Add a new api
3. Fix bug at attr_proto
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
These are some problems we need to address for the final delivery to
Microsoft.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Changes the `onnxruntime_NVCC_THREADS` CMake variable from an
[`option`](https://cmake.org/cmake/help/latest/command/option.html) to a
[cache
entry](https://cmake.org/cmake/help/latest/command/set.html#set-cache-entry).
### Motivation and Context
Fixes#19833.
`option` in CMake (confusingly, IMHO) always defines a *boolean* option.
The original definition of `onnxruntime_NVCC_THREADS` specified a
default of `1`, which I presume is coerced to `ON`. Thus, if the option
is not overridden with a value of another type, NVCC will receive a
malformed option `--threads ON` (rather than the expected `--threads
1`), which causes the error reported in #19833.
This error only occurred if compiling ONNX Runtime via CMake with
`-Donnxruntime_USE_CUDA=ON`; the CI build script always overrode
`onnxruntime_NVCC_THREADS` with a string value:
f1fef19b6e/tools/ci_build/build.py (L1152-L1154)
WebNN spec has added data type constraint for every op, and its CPU
backend (currently is TFLite) has additional constraint. Add
corresponding constraint to each op in WebNN EP.
Note: Temporarily disable fp16 for CPU backend as which is planned to be
ready in Chromium next month.
### Description
Allows mixed-precision overrides that adds a QDQ quantization type
conversion sequence at a graph output that **is not** consumed by other
nodes. This is not a common use-case but should handle it instead of
raising an error.
#### Example
Original model

mixed-precision overrides:
```python
mixed_prec_overrides = {
"input_0": [{"quant_type": QuantType.QUInt16}],
"op_0_out": [
{
"quant_type": QuantType.QUInt16,
"convert": {"quant_type": QuantType.QUInt8},
}
],
}
quantize_static(
float_model_path,
qdq_model_path,
data_reader,
quant_format=QuantFormat.QDQ,
activation_type=QuantType.QUInt8,
op_types_to_quantize=[node.op_type for node in float_model.graph.node],
extra_options={
"TensorQuantOverrides": mixed_prec_overrides,
},
)
```
QDQ model:

### Motivation and Context
This scenario is arising for certain quantization configurations. Should
handle it gracefully.
### Description
<!-- Describe your changes. -->
- Introduce option `trt_engine_hw_compatible` to support engine hardware
compatibility for Ampere+ GPUs
- This enables `nvinfer1::HardwareCompatibilityLevel::kAMPERE_PLUS` flag
when generating engines
- This option has been validated on sm80/86 GPUs, as engine can be
reused across different ampere+ arch:
- Client side need to enable this option as well to leverage existing
sm80+ engines
- If this option is enabled by users which TRT<8.6 or sm<80, there will
be a warning showing this option not supported
Engine naming:
| When | `trt_engine_hw_compat=false` | `trt_engine_hw_compat=true` |
| -------------- |
------------------------------------------------------------ |
------------------------------------------------------------ |
| A100 (sm80) |
TensorrtExecutionProvider_TRTKernel_graph_torch-jit-export_9454133937466702238_0_0_sm**80**.engine
|
TensorrtExecutionProvider_TRTKernel_graph_torch-jit-export_9454133937466702238_0_0_sm**80+**.engine
|
| RTX3080 (sm86) |
TensorrtExecutionProvider_TRTKernel_graph_torch-jit-export_9454133937466702238_0_0_sm**86**.engine
|
TensorrtExecutionProvider_TRTKernel_graph_torch-jit-export_9454133937466702238_0_0_sm**80+**.engine
|
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Reference:
https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#hardware-compat
---------
Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com>
1. Move azcopy environment variables out of script and into an Azure DevOps variable group. Move towards consolidating the managed identity client ID definition in one place.
2. Disable azcopy overwrite. We don't want to accidentally change the files for a released package.
### Description
Runs of the React Native CI are timing out during ComponentDetection
after 8 minutes. This increases the timeout value.
### Motivation and Context
Runs of the React Native CI are timing out during ComponentDetection.
### Description
Adding java build/packaging stage to `cuda-packaging-pipeline.yml`
### Motivation and Context
This way we can enable publishing the Java Cuda 12 along with Nuget CUDA
12
This PR includes the weight-stripped engine feature (thanks @moraxu for
the #20214) which is the major feature for TRT 10 integration.
Two TRT EP options are added:
- `trt_weight_stripped_engine_enable`: Enable weight-stripped engine
build and refit.
- `trt_onnx_model_folder_path`: In the quick load case using embedded
engine model / EPContext mode, the original onnx filename is in the
node's attribute, and this option specifies the directory of that onnx
file if needed.
Normal weight-stripped engine workflow:

Weight-stripped engine and quick load workflow:

see the doc [here
](https://onnxruntime.ai/docs/execution-providers/TensorRT-ExecutionProvider.html#tensorrt-ep-caches)for
more information about EPContext model.
---------
Co-authored-by: yf711 <yifanl@microsoft.com>
Co-authored-by: Ye Wang <52801275+wangyems@users.noreply.github.com>
Co-authored-by: Michal Guzek <moraxu@users.noreply.github.com>
Co-authored-by: pengwa <pengwa@microsoft.com>
Co-authored-by: wejoncy <wejoncy@163.com>
Co-authored-by: Yi Zhang <zhanyi@microsoft.com>
Co-authored-by: Yi Zhang <your@email.com>
Co-authored-by: Pranav Sharma <prs@microsoft.com>
Co-authored-by: Adam Pocock <adam.pocock@oracle.com>
Co-authored-by: cao lei <jslhcl@gmail.com>
Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
Co-authored-by: inisis <46103969+inisis@users.noreply.github.com>
Co-authored-by: Jeff Bloomfield <38966965+jeffbloo@users.noreply.github.com>
Co-authored-by: mo-ja <60505697+mo-ja@users.noreply.github.com>
Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
Co-authored-by: Sumit Agarwal <sumitagarwal330@gmail.com>
Co-authored-by: Atanas Dimitrov <70822030+neNasko1@users.noreply.github.com>
Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
Co-authored-by: Yufeng Li <liyufeng1987@gmail.com>
Co-authored-by: Dhruv Matani <dhruvbird@gmail.com>
Co-authored-by: Dhruv Matani <dhruv.matani@grammarly.com>
Co-authored-by: wangshuai09 <391746016@qq.com>
Co-authored-by: Xiaoyu <85524621+xiaoyu-work@users.noreply.github.com>
Co-authored-by: Xu Xing <xing.xu@intel.com>
Co-authored-by: Dmitri Smirnov <yuslepukhin@users.noreply.github.com>
Co-authored-by: Rachel Guo <35738743+YUNQIUGUO@users.noreply.github.com>
Co-authored-by: Sai Kishan Pampana <sai.kishan.pampana@intel.com>
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: Jian Chen <cjian@microsoft.com>
Co-authored-by: Shubham Bhokare <32080845+shubhambhokare1@users.noreply.github.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
Co-authored-by: Andrew Fantino <15876180+afantino951@users.noreply.github.com>
Co-authored-by: Thomas Boby <thomas@boby.uk>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
Co-authored-by: Michal Guzek <mguzek@nvidia.com>
Co-authored-by: George Wu <jywu@microsoft.com>
### Description
Temporarily remove TVM EP's pipeline until someone helps us upgrade TVM
to a newer version which is compatible with the latest ONNX.
### Motivation and Context
The ONNX version that TVM EP uses has a known security vulnerability. We
cannot continue using it in our hosted build environment. This change is temporary
### Description
Extend the DoubleQDQPairsRemover optimizer to also handle sequences that
end in duplicate DQ nodes.
For example, the following sequence:
```
Q1 --> DQ1 --> Q2 --+--> DQ2
|
+--> DQ2'
```
Is now simplified to:
```
Q1 ---+--> DQ2
|
+--> DQ2'
```
### Motivation and Context
The EnsureUniqueDQNodeUnits pass may add duplicate DQ nodes to ensure
valid QDQ node units. The DoubleQDQPairsRemover should still be able to
remove unnecessary QDQ ops if the target sequence ends in duplicate DQ
nodes.
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
We introduce rulesets that eliminate QDQ nodes of unsupported types and
for unsupported quantised operators for the NPU device. This leads to
improved performance and accuracy on critical client AI models.
Here's a summary of the changes:
- Introduces the provider option `enable_qdq_optimizer` which when set
to `True` enables stripping of QDQ nodes on the NPU device for models
with `QuantizeLinear` and `DequantizeLinear` layers in them.
`enable_qdq_optimizer` defaults to `False`.
- Always strip out int16/uint16 QDQ layers as these types are not
supported by the NPU compiler.
- Only supported ops `Conv`, `MatMul`, and `Add` retain QDQ layers
around them, specifically identified for optimal inference performance.
OpenVINO EP achieves this by iterating through NodeUnits in the QDQ
model, and reconstructing the graph only with the required layers.
- Added provider APIs to manipulate node units from EP code by
@adrianlizarraga
- Added capability rule for the Pad operator when it takes DQ layers as
input
- Fixes from static code analysis tool
---------
Co-authored-by: adrianlizarraga <adlizarraga@microsoft.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
Co-authored-by: sfatimar <sahar.fatima@intel.com>
Co-authored-by: saurabhkale17 <saurabh1.kale@intel.com>
### Description
Adding bf16 support to GatherElementsGrad.
---------
Co-authored-by: Adam Louly <adamlouly@microsoft.com@h100vm-ort.kxelwkzfzxguje5bxvwxxs135a.gvxx.internal.cloudapp.net>
### Description
Graph member value_info_ (unordered_set) is ordered before its values
are added to the graph proto.
### Motivation and Context
- Without this ordering, the model proto used by the OpenVINO EP is not
deterministic and varies across runs.
- Since the model proto varies, it affects caching attempts by OpenVINO.
Q: If creating a vector to have ordered elements is costly, should we
make value_info_ a std::set that is sorted according to NodeArg names?
Related PR about ordering initializers:
https://github.com/microsoft/onnxruntime/pull/14631
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
This PR adding protoc.exe to make the Nuget Cuda Pipleine, which also
allowing it to get build Java for various CUDA version
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Remove explicitly concatinating pastKey with Key and pastValue with
Value.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Use a common set of prebuilt manylinux base images to build the
packages, to avoid building the manylinux part again and again. The base
images can be used in GenAI and other projects too.
This PR also updates the GCC version for inference python CUDA11/CUDA12
builds from 8 to 11. Later on I will update all other CUDA pipelines to
use GCC 11, to avoid the issue described in
https://github.com/onnx/onnx/issues/6047 and
https://github.com/microsoft/onnxruntime-genai/issues/257 .
### Motivation and Context
To extract the common part as a reusable build infra among different
ONNX Runtime projects.
### Description
Previous all feed are set to nightly, the offcial released feed-id is
not set
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
WebNN CPU backend implementation has been migrated from XNNPack to
TFLite, currently TFLite has not supported WebNN's convTranspose2d yet,
just disable it for now.
Adding Bfloat16 to scale op
---------
Co-authored-by: Adam Louly <adamlouly@microsoft.com@h100vm-ort.kxelwkzfzxguje5bxvwxxs135a.gvxx.internal.cloudapp.net>
The workspace usage may be hardware-specific. Moving away from a common workspace size calculation allows more flexibility in the hardware-specific implementations.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Flash attn recompute
1. Allow PythonOp(FlashAttn) can be recomputed correctly.
45879ff5c2
2. Use JSON to pass the selected-to-recompute subgraphs.
3c374da678
#### Better Memory Efficiency
Customer model can run both PyTorch SPDA and Flash Attn, this PR make it
possible to let the Flash Attn path work with ORTModule layerwise
recompute. The peak drop from 45.xGB to 32.xGB if we only compare the
layers (not including other pieces, BTW there are few more optimization
targeting other pieces as well later).
#### Better Perf
Using Flash ATTN bring additionally 16% end to end time reduction, with
highly aligned loss curve.

#### Use JSON File to pass Recompute Plans
To overcome the limitation of max length of the strings defined in
session options.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
#### Problem 1: Broken Transpose QDQ unit
Layout transform's specialized cost function aggressively pushes down
transposes with channel-first or channel-last perms. This can lead to a
situation where a channel-fist/last Transpose gets stuck after being
pushed through an Unsqueeze node that makes the Transpose's perm no
longer channel-first/last. At this point, the specialized cost function
defers to the default const function, which does not see a need to
continue pushing this transpose node. This breaks the QDQ node units for
both the Unsqueeze and the Transpose: DQ -> Unsqueeze -> Transpose -> Q.
<img width="266" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/19691973/82f8432d-ca27-451b-8c36-c8d87b806e30">
The transpose optimizer should insert a Q -> DQ pair between the
Unsqueeze and Transpose nodes to fix both QDQ node units: DQ ->
Unsqueeze -> Q[new] -> DQ[new] -> Transpose -> Q
<img width="198" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/19691973/5a584bdf-e5db-4622-b3bb-83c060e09261">
#### Problem 2: Inserted Squeeze/Transpose nodes should be constant
folded when possible.
The transpose optimizer inserts Squeeze (and Transpose) ops between an
initializer and a DQ to counteract the effect of Unsqueezing that
initializer if it is consumed by multiple nodes. This results in a graph
where the inserted nodes are not in valid node units:
Original graph where two Mul nodes share a common initializer input:
<img width="456" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/19691973/4b9155ae-e32f-41fc-9136-f953b73e92e7">
Resulting graph after transpose optimization without constant folding:
<img width="452" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/19691973/3c1bfef1-d45f-4d6e-aa19-1c2929eae3f5">
Here, the circled Transpose and Squeeze nodes operate on a quantized
integer type but are not in valid QDQ node units. The solution is to run
constant folding, which results in:
<img width="405" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/19691973/aebdb91f-f38f-4583-adec-33e46126365f">
### Motivation and Context
Improve the layout transformation to allow more models to run on EPs
that prefer the channel-last layout.
---------
Co-authored-by: Scott McKay <skottmckay@gmail.com>
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Add a new GitHub Actions workflow, `.github/workflows/mac.yml`. It contains these jobs:
- ARM64 MacOS CI build.
- Objective-C static analysis build. This was moved over from another Azure DevOps pipeline to make it more visible.
### Description
Have a unified API in OVEP that pass the ONNX graph proto from ORT to OV
for compilation
### Motivation and Context
The earlier implementation used two different flows when onnx model path
is present vs model laoded from memory.
The former directly passed the onnx model path to OV when the graph is
fully supported by EP. While the latter pass the ORT model proto to OV.
This cause a difference in results when ORT optimizations are enabled.
This PR address this issue.