Commit graph

11997 commits

Author SHA1 Message Date
Caroline Zhu
94ce1209f9
Bug fix for gather fusion with on-device training (#20891)
### Description
Update the initializer that's added in GatherSliceToSplitFusion to use
the GenerateNodeArgName function, rather than the GenerateNodeName
function.

GenerateNodeName goes through all the nodes in the graph to see if the
given name is already used and generates a unique one if it has been
used. GenerateNodeArgName iterates through all the node args in the
graph to see if the given name is already used.

### Motivation and Context
* on-device training goes through a generate artifacts step, where
optimizations are applied, then, when the training artifact is loaded,
additional optimizations are applied. In the first round of
optimizations, a "splits" initializer is added for phi-3. With the
second round of optimizations, another "splits" initializer with
different dimensions and data is added. Since we call GenerateNodeName
func, the first splits initializer isn't found, causing a type error
where it claims the shape of splits does not match the TensorProto
shape.
2024-06-03 14:41:39 -07:00
Jian Chen
456ab09d17
Component Governance fix round 5 (#20905)
…over the case where there is only single repo checked out

### Description
adding $(Build.SourcesDirectory)/cmake/external/onnx/third_party to
cover the case where there is only single repo checked out



### Motivation and Context
Fix CG issue
https://aiinfra.visualstudio.com/Lotus/_componentGovernance/97926/alert/8862110?typeId=16576846
2024-06-03 14:22:22 -07:00
Wanming Lin
9c6481fa2d
[WebNN EP] Enable ArgMax and ArgMin for CPU backend (#20865)
WebNN TFLite backend supports ArgMax and ArgMin, but only supports
'select_last_index' value is 0.
2024-06-03 14:12:11 -07:00
Wanming Lin
c128132dd8
[WebNN EP] TFLite backend only supports Elu with default alpha (#20862) 2024-06-03 14:10:22 -07:00
Jian Chen
ae8df4db8f
Split java's gradle build and test (#20817)
### Description

This PR to allow `./gradlew cmakeCheck` failed on
Windows_Packaging_(CUDA|TensorRT) Job. This way, it will still generate
all nessary jar and pom file need for later stage to consume while
`./gradlew cmakeCheck`will be also run again in the
Windows_Packaging_(CUDA|TensorRT)_Testing stage.


### Motivation and Context
Reduce the time of All java packaging stages by 30+ min.
2024-06-03 14:08:45 -07:00
Yulong Wang
ab9f153746
[js/web] allow build target for non dynamic import (#20898)
### Description
<!-- Describe your changes. -->

This PR allows to build ORT web to `ort{.all|.webgpu}.bundle.min.mjs`,
which does not have any dynamic import. This makes it possible to use
ort web via static import in service worker.

Fixes #20876
2024-06-03 12:33:37 -07:00
Changming Sun
d13cabf7f9
Upgrade GCC and remove the dependency on GCC8's experimental std::filesystem implementation (#20893)
### Description
This PR upgrades CUDA 11 build pipelines' GCC version from 8 to 11. 

### Motivation and Context

GCC8 has an experimental std::filesystem implementation which is not ABI
compatible with the formal one in later GCC releases. It didn't cause
trouble for us, however, ONNX community has encountered this issue much.
For example, https://github.com/onnx/onnx/issues/6047 . So this PR
increases the minimum supported GCC version from 8 to 9, and removes the
references to GCC's "stdc++fs" library. Please note we compile our code
on RHEL8 and RHEL8's libstdc++ doesn't have the fs library, which means
the binaries in ONNX Runtime's official packages always static link to
the fs library. It is just a matter of which version of the library, an
experimental one or a more mature one. And it is an implementation
detail that is not visible from outside. Anyway, a newer GCC is better.
It will give us the chance to use many C++20 features.

#### Why we were using GCC 8?
It is because all our Linux packages were built on RHEL8 or its
equivalents. The default GCC version in RHEL8 is 8. RHEL also provides
additional GCC versions from RH devtoolset. UBI8 is the abbreviation of
Red Hat Universal Base Image 8, which is the containerized RHEL8. UBI8
is free, which means it doesn't require a subscription(while RHEL does).
The only devtoolset that UBI8 provides is GCC 12, which is too new for
being used with CUDA 11.8. And our CUDA 11.8's build env is a docker
image from Nvidia that is based on UBI8.
#### How the problem is solved
Almalinux is an alternative to RHEL. Almalinux 8 provides GCC 11. And
the CUDA 11.8 docker image from Nvidia is open source, which means we
can rebuild the image based on Almalinux 8 to get GCC 11. I've done
this, but I cannot republish the new image due to various complicated
license restrictions. Therefore I put them at an internal location in
onnxruntimebuildcache.azurecr.io.
2024-06-03 10:14:08 -07:00
Edward Chen
a7a49189e8
Suppress Eigen warning in onnxruntime/test/onnx/microbenchmark/eigen.cc. (#20892)
Fix ARM64 GCC build with `--build_micro_benchmarks`.
2024-06-03 11:25:56 -05:00
Jian Chen
217b66fd85
Update py-publishing pipeline to use the resoure from packaging pipeline (#20888)
### Description
<!-- Describe your changes. -->



### Motivation and Context
To allow nightly release to be automatic triggered
2024-06-01 16:10:02 -07:00
Adrian Lizarraga
5ec7ac80c7
Fix compiler error when onnxruntime_DEBUG_NODE_INPUTS_OUTPUTS is enabled (#20889)
### Description
The recent [PR for int4
support](https://github.com/microsoft/onnxruntime/pull/20362) breaks
builds with the onnxruntime_DEBUG_NODE_INPUTS_OUTPUTS option enabled.

This PR adds utility functions for debug printing of int4 tensor
statistics and data.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-05-31 18:07:53 -07:00
Patrice Vignola
50ee1b056c
[DML EP] Improve memory usage and fix memory leak in graph capture (#20879)
Phi-3 vision loads 3 models in memory, which means that we have 3
different sessions, 3 different execution providers and 3 different
allocators all loaded at the same time. Since the DML EP uses a
bucketized allocator, this results in a lot of memory fragmentation
across all 3 models that can only be used by the model itself.

To fix that, we can disable the memory arena (term for any kind of
allocator that reuses memory in ORT) as an opt-in option. In the case of
LLMs, we essentially never need to reallocate memory after the initial
graphs have been capture, which means that we gain nothing by using the
bucketized allocator, and it causes unnecessary fragmentation.

---------

Co-authored-by: Patrice Vignola <pavignol@microsoft.com>
2024-05-31 17:24:50 -07:00
Ye Wang
ad769f14a8
Suppress maybe used uninitialized warning as being false alert (#20886)
### Description
<!-- Describe your changes. -->


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

It breaks the python package pipeline.
A new run:

https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=477415&view=logs&s=d66927fc-650e-5e6f-874c-ae9229c1e7e4

---------

Co-authored-by: Your Name <you@example.com>
2024-05-31 17:04:58 -07:00
Changming Sun
4e18344028
Delete docs/Python_Dev_Notes.md (#20887)
It is no longer relevant since it is not a problem since python 3.5, and
the minimum python version we support is 3.8.
2024-05-31 14:01:11 -07:00
Yulong Wang
35697d2421
[js/webnn] update API of session options for WebNN (#20816)
### Description

This PR is an API-only change to address the requirements being
discussed in #20729.

There are multiple ways that users may create an ORT session by
specifying the session options differently.

All the code snippet below will use the variable `webnnOptions` as this:
```js
const myWebnnSession = await ort.InferenceSession.create('./model.onnx', {
   executionProviders: [
     webnnOptions
   ]
});
```

### The old way (backward-compatibility)

```js
// all-default, name only
const webnnOptions_0 = 'webnn';

// all-default, properties omitted
const webnnOptions_1 = { name: 'webnn' };

// partial
const webnnOptions_2 = {
  name: 'webnn',
  deviceType: 'cpu'
};

// full
const webnnOptions_3 = {
  name: 'webnn',
  deviceType: 'gpu',
  numThreads: 1,
  powerPreference: 'high-performance'
};
```

### The new way (specify with MLContext)

```js
// options to create MLcontext
const options = {
  deviceType: 'gpu',
  powerPreference: 'high-performance'
};

const myMlContext = await navigator.ml.createContext(options);

// options for session options
const webnnOptions = {
  name: 'webnn',
  context: myMlContext,
  ...options
};
```

This should throw (because no deviceType is specified):
```js
const myMlContext = await navigator.ml.createContext({ ... });
const webnnOptions = {
  name: 'webnn',
  context: myMlContext
};
```

### Interop with WebGPU
```js
// get WebGPU device
const adaptor = await navigator.gpu.requestAdapter({ ... });
const device = await adaptor.requestDevice({ ... });

// set WebGPU adaptor and device
ort.env.webgpu.adaptor = adaptor;
ort.env.webgpu.device = device;

const myMlContext = await navigator.ml.createContext(device);
const webnnOptions = {
  name: 'webnn',
  context: myMlContext,
  gpuDevice: device
};
```

This should throw (because cannot specify both gpu device and MLContext
option at the same time):
```js
const webnnOptions = {
  name: 'webnn',
  context: myMlContext,
  gpuDevice: device,
  deviceType: 'gpu'
};
```
2024-05-31 03:25:14 -07:00
Changming Sun
67bc9438d7
Update training packaging pipeline's docker files (#20853)
### Description
Similar to #20786 . The last PR was able to update all pipelines and all
docker files. This is a follow-up to that PR.

### Motivation and Context
1. To extract the common part as a reusable build infra among different
ONNX Runtime projects.
2. Avoid hitting docker hub's limit: 429 Too Many Requests - Server
message: toomanyrequests: You have reached your pull rate limit. You may
increase the limit by authenticating and upgrading:
https://www.docker.com/increase-rate-limit
2024-05-30 23:48:42 -07:00
Edward Chen
00589f578d
Fix bench_sqnbitgemm.cpp benchmark argument name list. (#20858)
Add the "HasBias" argument to the ArgNames() call so it matches with the ArgsProduct() call.
2024-05-30 18:59:54 -07:00
Adrian Lizarraga
b02d5e6d76
[CPU EP] Int4 support for QuantizeLinear, DequantizeLinear, and Transpose (#20362)
### Description
- 4-bit QuantizeLinear(21). **Blocked quantization still missing (i.e.,
do not support the new `block_size` attribute)**
- 4-bit DequantizeLinear(21). **Blocked dequantization still missing
(i.e., do not support the new `block_size` attribute)**
- 4-bit Transpose(21).
- Update quantization tool with int4 types.
- Disable QDQ fusions for 4-bit types. See:
https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selector_action_transformer.cc
- MLAS 4-bit quantization kernels for intel, neon, powerpc.

##### Notes
To calculate a tensor's storage size, we normally get the number of
elements from the shape (i.e., `tensor_shape.Size()`) and multiply by
the size of a single element. This does not directly work for sub-byte
elements like int4 as each element in a `Tensor<Int4x2>` stores **two**
packed int4 elements in a byte. The `Tensor::
CalculateTensorStorageSize` should be called to perform the correct
calculation for any tensor element type.

### Motivation and Context
ONNX 1.16 added the int4 and uint4 types. This initial PR adds the int4
type to ORT and adds int4 implementations for the Quant, Dequant, and
Transpose ops on CPU EP. We still need to add int4 support for many ops
and execution providers. See the ONNX 1.16 release notes:
https://github.com/onnx/onnx/releases.
2024-05-30 18:56:24 -07:00
Edward Chen
a508130456
Address React Native pipeline component detection timeout (#20871)
mac-react-native-ci-pipeline.yml:
- We don't need to run component detection for PR builds so just disable it there.

npm-packaging-pipeline.yml:
- Manually added component detection task was being added twice - removed one.
- Increased timeout of stage where component detection is run since the existing timeout was close for some builds.
2024-05-30 16:37:03 -07:00
Ye Wang
2200a0b3dd
Fix moe tests to run on supported arch (#20872)
### Description
<!-- Describe your changes. -->

https://github.com/microsoft/onnxruntime/issues/20788

Will do sm70 validation separately. 

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-05-30 13:26:38 -07:00
Changming Sun
65ef270e06
Update Aten pipeline's docker file to use UBI8 (#20856)
### Description
Now it uses CentOS 7 which is EOL. This PR updates it to UBI8.

### Motivation and Context
To deprecate CentOS 7 .
2024-05-30 07:38:15 -07:00
Yueqing Zhang
59b13b7bbd
[VitisAI] update version and api & bug fix (#20851)
### Description
<!-- Describe your changes. -->
1. Use macro defined to check version number
2. Add a new api
3. Fix bug at attr_proto


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
These are some problems we need to address for the final delivery to
Microsoft.
2024-05-30 07:36:53 -07:00
Xu Xing
25ac65375c
[js/webgpu] Fix mha name (#20860)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-05-30 00:01:06 -07:00
Jian Chen
228713f635
adding publishing stage to publish java CUDA 12 pkg to ado (#20834) 2024-05-29 16:24:23 -07:00
Carson M
5bfca1dc57
[Build] Change onnxruntime_NVCC_THREADS from option to cache entry (#20768)
### Description
Changes the `onnxruntime_NVCC_THREADS` CMake variable from an
[`option`](https://cmake.org/cmake/help/latest/command/option.html) to a
[cache
entry](https://cmake.org/cmake/help/latest/command/set.html#set-cache-entry).

### Motivation and Context
Fixes #19833.

`option` in CMake (confusingly, IMHO) always defines a *boolean* option.
The original definition of `onnxruntime_NVCC_THREADS` specified a
default of `1`, which I presume is coerced to `ON`. Thus, if the option
is not overridden with a value of another type, NVCC will receive a
malformed option `--threads ON` (rather than the expected `--threads
1`), which causes the error reported in #19833.

This error only occurred if compiling ONNX Runtime via CMake with
`-Donnxruntime_USE_CUDA=ON`; the CI build script always overrode
`onnxruntime_NVCC_THREADS` with a string value:

f1fef19b6e/tools/ci_build/build.py (L1152-L1154)
2024-05-29 12:28:33 -07:00
Wanming Lin
798cea2350
[WebNN EP] Remove legacy MLOperandDescriptor.type (#20783)
Latest Chrome has supported MLOperandDescriptor.dataType, remove legacy
MLOperandDescriptor.type.
2024-05-29 10:20:17 -07:00
Wanming Lin
9ea9f9e46a
[WebNN EP] Add data type constraint (#20779)
WebNN spec has added data type constraint for every op, and its CPU
backend (currently is TFLite) has additional constraint. Add
corresponding constraint to each op in WebNN EP.

Note: Temporarily disable fp16 for CPU backend as which is planned to be
ready in Chromium next month.
2024-05-29 10:19:51 -07:00
Vincent Wang
e77f238dc6
Update Torch Version to Fix ATen CPU Pipeline Failure (#20845)
Update Torch Version to Fix ATen CPU Pipeline Failure.
2024-05-29 16:04:18 +08:00
Adrian Lizarraga
3044aa8743
[Quant tool] Extend support for QDQ type conversion at graph output (#20841)
### Description
Allows mixed-precision overrides that adds a QDQ quantization type
conversion sequence at a graph output that **is not** consumed by other
nodes. This is not a common use-case but should handle it instead of
raising an error.

#### Example
Original model

![image](https://github.com/microsoft/onnxruntime/assets/19691973/4c9c3bb0-4ca1-4213-9259-9d0506ed22f2)

mixed-precision overrides:
```python
        mixed_prec_overrides = {
            "input_0": [{"quant_type": QuantType.QUInt16}],
            "op_0_out": [
                {
                    "quant_type": QuantType.QUInt16,
                    "convert": {"quant_type": QuantType.QUInt8},
                }
            ],
        }
        quantize_static(
            float_model_path,
            qdq_model_path,
            data_reader,
            quant_format=QuantFormat.QDQ,
            activation_type=QuantType.QUInt8,
            op_types_to_quantize=[node.op_type for node in float_model.graph.node],
            extra_options={
                "TensorQuantOverrides": mixed_prec_overrides,
            },
        )
```

QDQ model:

![image](https://github.com/microsoft/onnxruntime/assets/19691973/804fc89b-4a00-43bc-a4ff-21edd6f27e98)

### Motivation and Context
This scenario is arising for certain quantization configurations. Should
handle it gracefully.
2024-05-28 21:27:54 -07:00
Yifan Li
d44be41e1c
[TensorRT EP] Support engine hardware compatibility (#20669)
### Description
<!-- Describe your changes. -->
- Introduce option `trt_engine_hw_compatible` to support engine hardware
compatibility for Ampere+ GPUs
- This enables `nvinfer1::HardwareCompatibilityLevel::kAMPERE_PLUS` flag
when generating engines
- This option has been validated on sm80/86 GPUs, as engine can be
reused across different ampere+ arch:
- Client side need to enable this option as well to leverage existing
sm80+ engines
- If this option is enabled by users which TRT<8.6 or sm<80, there will
be a warning showing this option not supported

Engine naming:
| When | `trt_engine_hw_compat=false` | `trt_engine_hw_compat=true` |
| -------------- |
------------------------------------------------------------ |
------------------------------------------------------------ |
| A100 (sm80) |
TensorrtExecutionProvider_TRTKernel_graph_torch-jit-export_9454133937466702238_0_0_sm**80**.engine
|
TensorrtExecutionProvider_TRTKernel_graph_torch-jit-export_9454133937466702238_0_0_sm**80+**.engine
|
| RTX3080 (sm86) |
TensorrtExecutionProvider_TRTKernel_graph_torch-jit-export_9454133937466702238_0_0_sm**86**.engine
|
TensorrtExecutionProvider_TRTKernel_graph_torch-jit-export_9454133937466702238_0_0_sm**80+**.engine
|


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Reference:
https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#hardware-compat

---------

Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com>
2024-05-28 18:12:56 -07:00
Edward Chen
535e9d7114
Update package_release_tasks.py (#20835)
1. Move azcopy environment variables out of script and into an Azure DevOps variable group. Move towards consolidating the managed identity client ID definition in one place.
2. Disable azcopy overwrite. We don't want to accidentally change the files for a released package.
2024-05-28 17:50:25 -07:00
Ye Wang
362a623905
fix a build error with cuda 12.5 (#20770)
### Motivation and Context
https://github.com/microsoft/onnxruntime/issues/20765
2024-05-28 10:46:24 -07:00
Adrian Lizarraga
e78b18a2fb
Increase ComponentDetection timeout for React Native CI (#20800)
### Description
Runs of the React Native CI are timing out during ComponentDetection
after 8 minutes. This increases the timeout value.



### Motivation and Context
Runs of the React Native CI are timing out during ComponentDetection.
2024-05-28 08:36:38 -07:00
Jian Chen
b1b8cb05dc
Adding java build and packaging stage to cuda-packaging-pipeline.yml (#20812)
### Description
Adding java build/packaging stage to `cuda-packaging-pipeline.yml`



### Motivation and Context
This way we can enable publishing the Java Cuda 12 along with Nuget CUDA
12
2024-05-27 07:59:19 -07:00
Chi Lo
454fcdde00
[TensorRT EP] Weightless API integration (#20412)
This PR includes the weight-stripped engine feature (thanks @moraxu for
the #20214) which is the major feature for TRT 10 integration.

Two TRT EP options are added:

- `trt_weight_stripped_engine_enable`: Enable weight-stripped engine
build and refit.
- `trt_onnx_model_folder_path`: In the quick load case using embedded
engine model / EPContext mode, the original onnx filename is in the
node's attribute, and this option specifies the directory of that onnx
file if needed.

Normal weight-stripped engine workflow:

![image](https://github.com/microsoft/onnxruntime/assets/54722500/9f314865-cbda-4979-a7ac-b31c7a553b56)
Weight-stripped engine and quick load workflow:

![image](https://github.com/microsoft/onnxruntime/assets/54722500/9f31db51-a7a8-495b-ba25-54c7f904cbad)

see the doc [here
](https://onnxruntime.ai/docs/execution-providers/TensorRT-ExecutionProvider.html#tensorrt-ep-caches)for
more information about EPContext model.

---------

Co-authored-by: yf711 <yifanl@microsoft.com>
Co-authored-by: Ye Wang <52801275+wangyems@users.noreply.github.com>
Co-authored-by: Michal Guzek <moraxu@users.noreply.github.com>
Co-authored-by: pengwa <pengwa@microsoft.com>
Co-authored-by: wejoncy <wejoncy@163.com>
Co-authored-by: Yi Zhang <zhanyi@microsoft.com>
Co-authored-by: Yi Zhang <your@email.com>
Co-authored-by: Pranav Sharma <prs@microsoft.com>
Co-authored-by: Adam Pocock <adam.pocock@oracle.com>
Co-authored-by: cao lei <jslhcl@gmail.com>
Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
Co-authored-by: inisis <46103969+inisis@users.noreply.github.com>
Co-authored-by: Jeff Bloomfield <38966965+jeffbloo@users.noreply.github.com>
Co-authored-by: mo-ja <60505697+mo-ja@users.noreply.github.com>
Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
Co-authored-by: Sumit Agarwal <sumitagarwal330@gmail.com>
Co-authored-by: Atanas Dimitrov <70822030+neNasko1@users.noreply.github.com>
Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
Co-authored-by: Yufeng Li <liyufeng1987@gmail.com>
Co-authored-by: Dhruv Matani <dhruvbird@gmail.com>
Co-authored-by: Dhruv Matani <dhruv.matani@grammarly.com>
Co-authored-by: wangshuai09 <391746016@qq.com>
Co-authored-by: Xiaoyu <85524621+xiaoyu-work@users.noreply.github.com>
Co-authored-by: Xu Xing <xing.xu@intel.com>
Co-authored-by: Dmitri Smirnov <yuslepukhin@users.noreply.github.com>
Co-authored-by: Rachel Guo <35738743+YUNQIUGUO@users.noreply.github.com>
Co-authored-by: Sai Kishan Pampana <sai.kishan.pampana@intel.com>
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: Jian Chen <cjian@microsoft.com>
Co-authored-by: Shubham Bhokare <32080845+shubhambhokare1@users.noreply.github.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
Co-authored-by: Andrew Fantino <15876180+afantino951@users.noreply.github.com>
Co-authored-by: Thomas Boby <thomas@boby.uk>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
Co-authored-by: Michal Guzek <mguzek@nvidia.com>
Co-authored-by: George Wu <jywu@microsoft.com>
2024-05-26 12:24:17 -07:00
Changming Sun
439ed92b96
Remove TVM EP's pipeline (#20813)
### Description
Temporarily remove TVM EP's pipeline until someone helps us upgrade TVM
to a newer version which is compatible with the latest ONNX.

### Motivation and Context
The ONNX version that TVM EP uses has a known security vulnerability. We
cannot continue using it in our hosted build environment. This change is temporary
2024-05-25 20:42:41 -07:00
Adrian Lizarraga
5bae32eb34
Extend DoubleQDQPairsRemover to handle sequences that end in duplicate DQ nodes (#20759)
### Description
Extend the DoubleQDQPairsRemover optimizer to also handle sequences that
end in duplicate DQ nodes.

For example, the following sequence:
```
 Q1 --> DQ1 --> Q2 --+--> DQ2
                     |
                     +--> DQ2'
```
Is now simplified to:
```
 Q1 ---+--> DQ2
       |
       +--> DQ2'
```


### Motivation and Context
The EnsureUniqueDQNodeUnits pass may add duplicate DQ nodes to ensure
valid QDQ node units. The DoubleQDQPairsRemover should still be able to
remove unnecessary QDQ ops if the target sequence ends in duplicate DQ
nodes.

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2024-05-24 18:30:15 -07:00
Chi Lo
a7bc49a565
[TensorRT EP] Use latest commit of onnx-tensorrt parser (#20758)
The 10 GA branch updated with several issues fixed.
https://github.com/onnx/onnx-tensorrt/commits/10.0-GA/
2024-05-24 16:44:16 -07:00
Suryaprakash Shanmugam
1765da17e4
QDQ transformations in the OpenVINO EP for the NPU device (#20622)
We introduce rulesets that eliminate QDQ nodes of unsupported types and
for unsupported quantised operators for the NPU device. This leads to
improved performance and accuracy on critical client AI models.

Here's a summary of the changes:

- Introduces the provider option `enable_qdq_optimizer` which when set
to `True` enables stripping of QDQ nodes on the NPU device for models
with `QuantizeLinear` and `DequantizeLinear` layers in them.
`enable_qdq_optimizer` defaults to `False`.
- Always strip out int16/uint16 QDQ layers as these types are not
supported by the NPU compiler.
- Only supported ops `Conv`, `MatMul`, and `Add` retain QDQ layers
around them, specifically identified for optimal inference performance.
OpenVINO EP achieves this by iterating through NodeUnits in the QDQ
model, and reconstructing the graph only with the required layers.
- Added provider APIs to manipulate node units from EP code by
@adrianlizarraga
- Added capability rule for the Pad operator when it takes DQ layers as
input
- Fixes from static code analysis tool

---------

Co-authored-by: adrianlizarraga <adlizarraga@microsoft.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
Co-authored-by: sfatimar <sahar.fatima@intel.com>
Co-authored-by: saurabhkale17 <saurabh1.kale@intel.com>
2024-05-24 16:25:05 -07:00
Adam Louly
ed8275883a
[Training] Add bf16 support to GatherElementsGrad. (#20796)
### Description
Adding bf16 support to GatherElementsGrad.

---------

Co-authored-by: Adam Louly <adamlouly@microsoft.com@h100vm-ort.kxelwkzfzxguje5bxvwxxs135a.gvxx.internal.cloudapp.net>
2024-05-24 15:55:14 -07:00
Suryaprakash Shanmugam
76e1a06986
Fix ordering of value info in GraphProto creation (#20691)
### Description
Graph member value_info_ (unordered_set) is ordered before its values
are added to the graph proto.


### Motivation and Context

- Without this ordering, the model proto used by the OpenVINO EP is not
deterministic and varies across runs.
- Since the model proto varies, it affects caching attempts by OpenVINO.

Q: If creating a vector to have ordered elements is costly, should we
make value_info_ a std::set that is sorted according to NodeArg names?


Related PR about ordering initializers:
https://github.com/microsoft/onnxruntime/pull/14631
2024-05-24 10:49:32 -07:00
Peishen Yan
cfe68e489e
[WebNN EP] Support Trilu op (#20730)
Adds support for Trilu via WebNN Triangular op
2024-05-24 10:46:54 -07:00
Guenther Schmuelling
33a68d221f
add missing file for pr20791 (#20811)
this file should have been in pr20791 to allow fp16 in the tile
implementation
2024-05-24 09:59:13 -07:00
Jian Chen
10c425a4d5
Fix Onnx >= to == (#20798)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-05-24 09:16:23 -07:00
Jian Chen
fe24006425
Fix Nuget Cuda pipeline package pipeline (#20741)
### Description
<!-- Describe your changes. -->

This PR adding protoc.exe to make the Nuget Cuda Pipleine, which also
allowing it to get build Java for various CUDA version

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-05-24 09:15:57 -07:00
Satya Kumar Jandhyala
bab5037eab
Eliminate explicit Concat operations in Attention (#20556)
### Description
Remove explicitly concatinating pastKey with Key and pastValue with
Value.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-05-24 09:07:57 -07:00
Changming Sun
535a030b1e
Remove manylinux build scripts from python packaging pipeline (#20786)
### Description
Use a common set of prebuilt manylinux base images to build the
packages, to avoid building the manylinux part again and again. The base
images can be used in GenAI and other projects too.
This PR also updates the GCC version for inference python CUDA11/CUDA12
builds from 8 to 11. Later on I will update all other CUDA pipelines to
use GCC 11, to avoid the issue described in
https://github.com/onnx/onnx/issues/6047 and
https://github.com/microsoft/onnxruntime-genai/issues/257 .

### Motivation and Context
To extract the common part as a reusable build infra among different
ONNX Runtime projects.
2024-05-24 08:18:22 -07:00
Jian Chen
884acd4598
Fix Nuget-Cuda pubish pipeline (#20794)
### Description
Previous all feed are set to nightly, the offcial released feed-id is
not set


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-05-23 18:27:46 -07:00
Guenther Schmuelling
0cf7caaff2
[js/webgpu] enable fp16 for tile (#20791) 2024-05-23 16:59:39 -07:00
Edward Chen
d1af19db9d
Add some CPU feature detection for Apple platforms. (#20769)
Add CPUIDInfo::ArmAppleInit() to detect CPU features on Apple platforms. This initial implementation is not comprehensive.
2024-05-23 15:59:46 -07:00
Changming Sun
b522df0ae4
Update RE2 to the latest (#20775)
Update RE2 to the latest.

To keep the components up to date.
2024-05-23 14:30:15 -07:00