onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-07 17:15:29 +00:00

Author	SHA1	Message	Date
Adrian Lizarraga	b5eb9e8a8a	[QNN EP] Update to QNN SDK 2.22 (#20628 ) ### Description - Updates pipelines to use QNN SDK 2.22 by default. - Linux QNN pipeline now uses an Ubuntu 22.04 image (required by QNN SDK) - Android QNN pipeline still uses the current Ubuntu 20.04 image. Will update in a separate PR. - Disables QDQ LayerNorm test that triggers QNN's graph finalization error on QNN 2.22 - Increases accuracy tolerance for various HTP tests so that they pass on Windows arm64. ### Motivation and Context Test QNN EP with latest QNN SDK version by default. --------- Signed-off-by: adrianlizarraga <adlizarraga@microsoft.com>	2024-06-05 18:25:23 -07:00
Adrian Lizarraga	df28c7d73b	[Quant tool] Improve performance of int4 weight quantization (#20935 ) ### Description - Uses our own quantization functions instead of the ONNX reference implementation of QuantizeLinear when quantizing weights to int4. - Uses a custom function that packs bytes into 4-bit elements. ### Motivation and Context Running the quantization tool to create QDQ models with int4 weights could take up to 7x longer. This PR uses our own quantization and byte packing utilities to improve performance. #### Measurements Model with ~5M parameters to quantize to int4. - Current implementation: 84.5s - Only replace ONNX QuantizeLinear implementation: 50.3s (1.68x speedup) - This PR (replace onnx Q impl, custom packing func): 13.5s (6.26x speedup) --------- Signed-off-by: adrianlizarraga <adlizarraga@microsoft.com>	2024-06-05 16:48:40 -07:00
Chip Kerchner	4cb23b020c	Improvements to the INT8 GEMM portion of the code for Power (#20595 ) These are changes to improve GEMM portion of the code for Power. There are 2 main code changes : 1) Changing a function to a template parameter so that operations that add/sub zero are eliminated at compile time. Plus reuse a vector that has the mask instead of rebuilding each time. 2) Add processing 16 columns at a time in MlasGemmQuantCopyPackB8x8 - this should reduce potential page faults by a factor of 4 and also be faster. 3) Unroll MlasQgemmStoreVectorMMA and vectorize other variables.	2024-06-05 14:24:22 -07:00
Yufeng Li	63c13a4811	fix integer overflow in Attention (#20921 ) ### Description <!-- Describe your changes. --> offset used in attention is with data type int. It can overflow for large sequence length. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-06-05 10:19:26 -07:00
Yueqing Zhang	b374ddd704	[VitisAI] add new api for models (#20899 ) ### Description <!-- Describe your changes. --> Add new APIs. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> This change is required for satisfying requirement of Microsoft. --------- Co-authored-by: Zhenze Wang <zhenzew@xilinx.com>	2024-06-04 22:48:04 -07:00
Jing Fang	3ecb012337	[CPU EP] Add blocked quantization to DequantizeLinear op kernel (#20901 ) ### Description Added blocked quantization to DequantizeLinear op kernel. All existing [input types and output types](https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md#commicrosoftdequantizelinear) are supported. All axes are supported. The implementation in the PR is naive - single thread and scalar instructions. Multi-threading and vector instructions are planned in the future based on the needs. ### Motivation and Context onnx introduced blocked quantization in opset 21 for [DequantizeLinear](https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md#commicrosoftdequantizelinear). This PR adds the spec support in onnx runtime.	2024-06-04 14:44:40 -07:00
Jian Chen	5faeaf6437	Remove failOnStderr from Gradle cmakeCheck (#20919 ) ### Description Remove failOnStderr from Gradle cmakeCheck ### Motivation and Context The Gradle is still using the deprecated API	2024-06-04 13:54:49 -07:00
Tianlei Wu	6dfdef7782	update stable diffusion demo requirements (#20914 ) ### Description Update docker and package version for stable diffusion demo. ### Motivation and Context Update onnx to 1.16 for security	2024-06-04 12:08:04 -07:00
liqun Fu	51bc53580d	Update to onnx 1.16.1 (#20702 )	2024-06-04 11:06:28 -07:00
Changming Sun	3dd6fcc089	Upgrade min ios version to 13.0 (#20773 ) To align with Office and other MS products. Office's support policy is: "Office for iPad and iPhone is supported on the two most recent versions of iOS and iPadOS. When a new version of iOS or iPadOS is released, the Office Operating System requirement becomes the two most recent versions: the new version of iOS or iPadOS and the previous version." (from https://products.office.com/office-system-requirements) The latest iOS version is 17. So they support both 17 and 16. Here I set our min iOS version to 13 so that it will be a superset of what Office supports. This change would allow us using C++17's std::filesystem feature in the core framework. The modifications were generated by running ```bash find . -type f -exec sed -i "s/apple_deploy_target[ =]12.0/apple_deploy_target=13.0/g" {} \; ``` Cannot use 15.0 because otherwise iOS packaging would fail with: ``` /Users/runner/work/1/b/apple_framework/intermediates/iphoneos_arm64/Release/_deps/coremltools-src/mlmodel/src/MILBlob/Util/Span.hpp:288:9: error: cannot use 'throw' with exceptions disabled MILVerifyIsTrue(index < Size(), std::range_error, "index out of bounds"); ``` The Google OSS libraries we use only officially support iOS 15+.	2024-06-04 10:15:20 -07:00
Yi Zhang	c5087b9b58	Improve stable diffusion image parity test stability (#20904 ) ### Description 1. Add one image into whitelist, but if the image is hit, the pipeline status is warning. 2. adjust the image parity test tolerance ### Motivation and Context improve pipeline stability	2024-06-04 10:19:32 +08:00
zhijiang	3c561c8b26	fix bug (#20694 ) when num of elem in tensor large than 2^32, then we can use cuda_long as dtype of offset	2024-06-04 09:22:10 +08:00
Caroline Zhu	94ce1209f9	Bug fix for gather fusion with on-device training (#20891 ) ### Description Update the initializer that's added in GatherSliceToSplitFusion to use the GenerateNodeArgName function, rather than the GenerateNodeName function. GenerateNodeName goes through all the nodes in the graph to see if the given name is already used and generates a unique one if it has been used. GenerateNodeArgName iterates through all the node args in the graph to see if the given name is already used. ### Motivation and Context * on-device training goes through a generate artifacts step, where optimizations are applied, then, when the training artifact is loaded, additional optimizations are applied. In the first round of optimizations, a "splits" initializer is added for phi-3. With the second round of optimizations, another "splits" initializer with different dimensions and data is added. Since we call GenerateNodeName func, the first splits initializer isn't found, causing a type error where it claims the shape of splits does not match the TensorProto shape.	2024-06-03 14:41:39 -07:00
Jian Chen	456ab09d17	Component Governance fix round 5 (#20905 ) …over the case where there is only single repo checked out ### Description adding $(Build.SourcesDirectory)/cmake/external/onnx/third_party to cover the case where there is only single repo checked out ### Motivation and Context Fix CG issue https://aiinfra.visualstudio.com/Lotus/_componentGovernance/97926/alert/8862110?typeId=16576846	2024-06-03 14:22:22 -07:00
Wanming Lin	9c6481fa2d	[WebNN EP] Enable ArgMax and ArgMin for CPU backend (#20865 ) WebNN TFLite backend supports ArgMax and ArgMin, but only supports 'select_last_index' value is 0.	2024-06-03 14:12:11 -07:00
Wanming Lin	c128132dd8	[WebNN EP] TFLite backend only supports Elu with default alpha (#20862 )	2024-06-03 14:10:22 -07:00
Jian Chen	ae8df4db8f	Split java's gradle build and test (#20817 ) ### Description This PR to allow `./gradlew cmakeCheck` failed on Windows_Packaging_(CUDA\|TensorRT) Job. This way, it will still generate all nessary jar and pom file need for later stage to consume while `./gradlew cmakeCheck`will be also run again in the Windows_Packaging_(CUDA\|TensorRT)_Testing stage. ### Motivation and Context Reduce the time of All java packaging stages by 30+ min.	2024-06-03 14:08:45 -07:00
Yulong Wang	ab9f153746	[js/web] allow build target for non dynamic import (#20898 ) ### Description <!-- Describe your changes. --> This PR allows to build ORT web to `ort{.all\|.webgpu}.bundle.min.mjs`, which does not have any dynamic import. This makes it possible to use ort web via static import in service worker. Fixes #20876	2024-06-03 12:33:37 -07:00
Changming Sun	d13cabf7f9	Upgrade GCC and remove the dependency on GCC8's experimental std::filesystem implementation (#20893 ) ### Description This PR upgrades CUDA 11 build pipelines' GCC version from 8 to 11. ### Motivation and Context GCC8 has an experimental std::filesystem implementation which is not ABI compatible with the formal one in later GCC releases. It didn't cause trouble for us, however, ONNX community has encountered this issue much. For example, https://github.com/onnx/onnx/issues/6047 . So this PR increases the minimum supported GCC version from 8 to 9, and removes the references to GCC's "stdc++fs" library. Please note we compile our code on RHEL8 and RHEL8's libstdc++ doesn't have the fs library, which means the binaries in ONNX Runtime's official packages always static link to the fs library. It is just a matter of which version of the library, an experimental one or a more mature one. And it is an implementation detail that is not visible from outside. Anyway, a newer GCC is better. It will give us the chance to use many C++20 features. #### Why we were using GCC 8? It is because all our Linux packages were built on RHEL8 or its equivalents. The default GCC version in RHEL8 is 8. RHEL also provides additional GCC versions from RH devtoolset. UBI8 is the abbreviation of Red Hat Universal Base Image 8, which is the containerized RHEL8. UBI8 is free, which means it doesn't require a subscription(while RHEL does). The only devtoolset that UBI8 provides is GCC 12, which is too new for being used with CUDA 11.8. And our CUDA 11.8's build env is a docker image from Nvidia that is based on UBI8. #### How the problem is solved Almalinux is an alternative to RHEL. Almalinux 8 provides GCC 11. And the CUDA 11.8 docker image from Nvidia is open source, which means we can rebuild the image based on Almalinux 8 to get GCC 11. I've done this, but I cannot republish the new image due to various complicated license restrictions. Therefore I put them at an internal location in onnxruntimebuildcache.azurecr.io.	2024-06-03 10:14:08 -07:00
Edward Chen	a7a49189e8	Suppress Eigen warning in onnxruntime/test/onnx/microbenchmark/eigen.cc. (#20892 ) Fix ARM64 GCC build with `--build_micro_benchmarks`.	2024-06-03 11:25:56 -05:00
Jian Chen	217b66fd85	Update py-publishing pipeline to use the resoure from packaging pipeline (#20888 ) ### Description <!-- Describe your changes. --> ### Motivation and Context To allow nightly release to be automatic triggered	2024-06-01 16:10:02 -07:00
Adrian Lizarraga	5ec7ac80c7	Fix compiler error when onnxruntime_DEBUG_NODE_INPUTS_OUTPUTS is enabled (#20889 ) ### Description The recent [PR for int4 support](https://github.com/microsoft/onnxruntime/pull/20362) breaks builds with the onnxruntime_DEBUG_NODE_INPUTS_OUTPUTS option enabled. This PR adds utility functions for debug printing of int4 tensor statistics and data. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-05-31 18:07:53 -07:00
Patrice Vignola	50ee1b056c	[DML EP] Improve memory usage and fix memory leak in graph capture (#20879 ) Phi-3 vision loads 3 models in memory, which means that we have 3 different sessions, 3 different execution providers and 3 different allocators all loaded at the same time. Since the DML EP uses a bucketized allocator, this results in a lot of memory fragmentation across all 3 models that can only be used by the model itself. To fix that, we can disable the memory arena (term for any kind of allocator that reuses memory in ORT) as an opt-in option. In the case of LLMs, we essentially never need to reallocate memory after the initial graphs have been capture, which means that we gain nothing by using the bucketized allocator, and it causes unnecessary fragmentation. --------- Co-authored-by: Patrice Vignola <pavignol@microsoft.com>	2024-05-31 17:24:50 -07:00
Ye Wang	ad769f14a8	Suppress maybe used uninitialized warning as being false alert (#20886 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> It breaks the python package pipeline. A new run: https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=477415&view=logs&s=d66927fc-650e-5e6f-874c-ae9229c1e7e4 --------- Co-authored-by: Your Name <you@example.com>	2024-05-31 17:04:58 -07:00
Changming Sun	4e18344028	Delete docs/Python_Dev_Notes.md (#20887 ) It is no longer relevant since it is not a problem since python 3.5, and the minimum python version we support is 3.8.	2024-05-31 14:01:11 -07:00
Yulong Wang	35697d2421	[js/webnn] update API of session options for WebNN (#20816 ) ### Description This PR is an API-only change to address the requirements being discussed in #20729. There are multiple ways that users may create an ORT session by specifying the session options differently. All the code snippet below will use the variable `webnnOptions` as this: ```js const myWebnnSession = await ort.InferenceSession.create('./model.onnx', { executionProviders: [ webnnOptions ] }); ``` ### The old way (backward-compatibility) ```js // all-default, name only const webnnOptions_0 = 'webnn'; // all-default, properties omitted const webnnOptions_1 = { name: 'webnn' }; // partial const webnnOptions_2 = { name: 'webnn', deviceType: 'cpu' }; // full const webnnOptions_3 = { name: 'webnn', deviceType: 'gpu', numThreads: 1, powerPreference: 'high-performance' }; ``` ### The new way (specify with MLContext) ```js // options to create MLcontext const options = { deviceType: 'gpu', powerPreference: 'high-performance' }; const myMlContext = await navigator.ml.createContext(options); // options for session options const webnnOptions = { name: 'webnn', context: myMlContext, ...options }; ``` This should throw (because no deviceType is specified): ```js const myMlContext = await navigator.ml.createContext({ ... }); const webnnOptions = { name: 'webnn', context: myMlContext }; ``` ### Interop with WebGPU ```js // get WebGPU device const adaptor = await navigator.gpu.requestAdapter({ ... }); const device = await adaptor.requestDevice({ ... }); // set WebGPU adaptor and device ort.env.webgpu.adaptor = adaptor; ort.env.webgpu.device = device; const myMlContext = await navigator.ml.createContext(device); const webnnOptions = { name: 'webnn', context: myMlContext, gpuDevice: device }; ``` This should throw (because cannot specify both gpu device and MLContext option at the same time): ```js const webnnOptions = { name: 'webnn', context: myMlContext, gpuDevice: device, deviceType: 'gpu' }; ```	2024-05-31 03:25:14 -07:00
Changming Sun	67bc9438d7	Update training packaging pipeline's docker files (#20853 ) ### Description Similar to #20786 . The last PR was able to update all pipelines and all docker files. This is a follow-up to that PR. ### Motivation and Context 1. To extract the common part as a reusable build infra among different ONNX Runtime projects. 2. Avoid hitting docker hub's limit: 429 Too Many Requests - Server message: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit	2024-05-30 23:48:42 -07:00
Edward Chen	00589f578d	Fix bench_sqnbitgemm.cpp benchmark argument name list. (#20858 ) Add the "HasBias" argument to the ArgNames() call so it matches with the ArgsProduct() call.	2024-05-30 18:59:54 -07:00
Adrian Lizarraga	b02d5e6d76	[CPU EP] Int4 support for QuantizeLinear, DequantizeLinear, and Transpose (#20362 ) ### Description - 4-bit QuantizeLinear(21). Blocked quantization still missing (i.e., do not support the new `block_size` attribute) - 4-bit DequantizeLinear(21). Blocked dequantization still missing (i.e., do not support the new `block_size` attribute) - 4-bit Transpose(21). - Update quantization tool with int4 types. - Disable QDQ fusions for 4-bit types. See: https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selector_action_transformer.cc - MLAS 4-bit quantization kernels for intel, neon, powerpc. ##### Notes To calculate a tensor's storage size, we normally get the number of elements from the shape (i.e., `tensor_shape.Size()`) and multiply by the size of a single element. This does not directly work for sub-byte elements like int4 as each element in a `Tensor<Int4x2>` stores two packed int4 elements in a byte. The `Tensor:: CalculateTensorStorageSize` should be called to perform the correct calculation for any tensor element type. ### Motivation and Context ONNX 1.16 added the int4 and uint4 types. This initial PR adds the int4 type to ORT and adds int4 implementations for the Quant, Dequant, and Transpose ops on CPU EP. We still need to add int4 support for many ops and execution providers. See the ONNX 1.16 release notes: https://github.com/onnx/onnx/releases.	2024-05-30 18:56:24 -07:00
Edward Chen	a508130456	Address React Native pipeline component detection timeout (#20871 ) mac-react-native-ci-pipeline.yml: - We don't need to run component detection for PR builds so just disable it there. npm-packaging-pipeline.yml: - Manually added component detection task was being added twice - removed one. - Increased timeout of stage where component detection is run since the existing timeout was close for some builds.	2024-05-30 16:37:03 -07:00
Ye Wang	2200a0b3dd	Fix moe tests to run on supported arch (#20872 ) ### Description <!-- Describe your changes. --> https://github.com/microsoft/onnxruntime/issues/20788 Will do sm70 validation separately. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-05-30 13:26:38 -07:00
Changming Sun	65ef270e06	Update Aten pipeline's docker file to use UBI8 (#20856 ) ### Description Now it uses CentOS 7 which is EOL. This PR updates it to UBI8. ### Motivation and Context To deprecate CentOS 7 .	2024-05-30 07:38:15 -07:00
Yueqing Zhang	59b13b7bbd	[VitisAI] update version and api & bug fix (#20851 ) ### Description <!-- Describe your changes. --> 1. Use macro defined to check version number 2. Add a new api 3. Fix bug at attr_proto ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> These are some problems we need to address for the final delivery to Microsoft.	2024-05-30 07:36:53 -07:00
Xu Xing	25ac65375c	[js/webgpu] Fix mha name (#20860 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-05-30 00:01:06 -07:00
Jian Chen	228713f635	adding publishing stage to publish java CUDA 12 pkg to ado (#20834 )	2024-05-29 16:24:23 -07:00
Carson M	5bfca1dc57	[Build] Change `onnxruntime_NVCC_THREADS` from option to cache entry (#20768 ) ### Description Changes the `onnxruntime_NVCC_THREADS` CMake variable from an [`option`](https://cmake.org/cmake/help/latest/command/option.html) to a [cache entry](https://cmake.org/cmake/help/latest/command/set.html#set-cache-entry). ### Motivation and Context Fixes #19833. `option` in CMake (confusingly, IMHO) always defines a boolean option. The original definition of `onnxruntime_NVCC_THREADS` specified a default of `1`, which I presume is coerced to `ON`. Thus, if the option is not overridden with a value of another type, NVCC will receive a malformed option `--threads ON` (rather than the expected `--threads 1`), which causes the error reported in #19833. This error only occurred if compiling ONNX Runtime via CMake with `-Donnxruntime_USE_CUDA=ON`; the CI build script always overrode `onnxruntime_NVCC_THREADS` with a string value: `f1fef19b6e/tools/ci_build/build.py (L1152-L1154)`	2024-05-29 12:28:33 -07:00
Wanming Lin	798cea2350	[WebNN EP] Remove legacy MLOperandDescriptor.type (#20783 ) Latest Chrome has supported MLOperandDescriptor.dataType, remove legacy MLOperandDescriptor.type.	2024-05-29 10:20:17 -07:00
Wanming Lin	9ea9f9e46a	[WebNN EP] Add data type constraint (#20779 ) WebNN spec has added data type constraint for every op, and its CPU backend (currently is TFLite) has additional constraint. Add corresponding constraint to each op in WebNN EP. Note: Temporarily disable fp16 for CPU backend as which is planned to be ready in Chromium next month.	2024-05-29 10:19:51 -07:00
Vincent Wang	e77f238dc6	Update Torch Version to Fix ATen CPU Pipeline Failure (#20845 ) Update Torch Version to Fix ATen CPU Pipeline Failure.	2024-05-29 16:04:18 +08:00
Adrian Lizarraga	3044aa8743	[Quant tool] Extend support for QDQ type conversion at graph output (#20841 ) ### Description Allows mixed-precision overrides that adds a QDQ quantization type conversion sequence at a graph output that is not consumed by other nodes. This is not a common use-case but should handle it instead of raising an error. #### Example Original model ![image](https://github.com/microsoft/onnxruntime/assets/19691973/4c9c3bb0-4ca1-4213-9259-9d0506ed22f2) mixed-precision overrides: ```python mixed_prec_overrides = { "input_0": [{"quant_type": QuantType.QUInt16}], "op_0_out": [ { "quant_type": QuantType.QUInt16, "convert": {"quant_type": QuantType.QUInt8}, } ], } quantize_static( float_model_path, qdq_model_path, data_reader, quant_format=QuantFormat.QDQ, activation_type=QuantType.QUInt8, op_types_to_quantize=[node.op_type for node in float_model.graph.node], extra_options={ "TensorQuantOverrides": mixed_prec_overrides, }, ) ``` QDQ model: ![image](https://github.com/microsoft/onnxruntime/assets/19691973/804fc89b-4a00-43bc-a4ff-21edd6f27e98) ### Motivation and Context This scenario is arising for certain quantization configurations. Should handle it gracefully.	2024-05-28 21:27:54 -07:00
Yifan Li	d44be41e1c	[TensorRT EP] Support engine hardware compatibility (#20669 ) ### Description <!-- Describe your changes. --> - Introduce option `trt_engine_hw_compatible` to support engine hardware compatibility for Ampere+ GPUs - This enables `nvinfer1::HardwareCompatibilityLevel::kAMPERE_PLUS` flag when generating engines - This option has been validated on sm80/86 GPUs, as engine can be reused across different ampere+ arch: - Client side need to enable this option as well to leverage existing sm80+ engines - If this option is enabled by users which TRT<8.6 or sm<80, there will be a warning showing this option not supported Engine naming: \| When \| `trt_engine_hw_compat=false` \| `trt_engine_hw_compat=true` \| \| -------------- \| ------------------------------------------------------------ \| ------------------------------------------------------------ \| \| A100 (sm80) \| TensorrtExecutionProvider_TRTKernel_graph_torch-jit-export_9454133937466702238_0_0_sm80.engine \| TensorrtExecutionProvider_TRTKernel_graph_torch-jit-export_9454133937466702238_0_0_sm80+.engine \| \| RTX3080 (sm86) \| TensorrtExecutionProvider_TRTKernel_graph_torch-jit-export_9454133937466702238_0_0_sm86.engine \| TensorrtExecutionProvider_TRTKernel_graph_torch-jit-export_9454133937466702238_0_0_sm80+.engine \| ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Reference: https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#hardware-compat --------- Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com>	2024-05-28 18:12:56 -07:00
Edward Chen	535e9d7114	Update package_release_tasks.py (#20835 ) 1. Move azcopy environment variables out of script and into an Azure DevOps variable group. Move towards consolidating the managed identity client ID definition in one place. 2. Disable azcopy overwrite. We don't want to accidentally change the files for a released package.	2024-05-28 17:50:25 -07:00
Ye Wang	362a623905	fix a build error with cuda 12.5 (#20770 ) ### Motivation and Context https://github.com/microsoft/onnxruntime/issues/20765	2024-05-28 10:46:24 -07:00
Adrian Lizarraga	e78b18a2fb	Increase ComponentDetection timeout for React Native CI (#20800 ) ### Description Runs of the React Native CI are timing out during ComponentDetection after 8 minutes. This increases the timeout value. ### Motivation and Context Runs of the React Native CI are timing out during ComponentDetection.	2024-05-28 08:36:38 -07:00
Jian Chen	b1b8cb05dc	Adding java build and packaging stage to cuda-packaging-pipeline.yml (#20812 ) ### Description Adding java build/packaging stage to `cuda-packaging-pipeline.yml` ### Motivation and Context This way we can enable publishing the Java Cuda 12 along with Nuget CUDA 12	2024-05-27 07:59:19 -07:00
Chi Lo	454fcdde00	[TensorRT EP] Weightless API integration (#20412 ) This PR includes the weight-stripped engine feature (thanks @moraxu for the #20214) which is the major feature for TRT 10 integration. Two TRT EP options are added: - `trt_weight_stripped_engine_enable`: Enable weight-stripped engine build and refit. - `trt_onnx_model_folder_path`: In the quick load case using embedded engine model / EPContext mode, the original onnx filename is in the node's attribute, and this option specifies the directory of that onnx file if needed. Normal weight-stripped engine workflow: ![image](https://github.com/microsoft/onnxruntime/assets/54722500/9f314865-cbda-4979-a7ac-b31c7a553b56) Weight-stripped engine and quick load workflow: ![image](https://github.com/microsoft/onnxruntime/assets/54722500/9f31db51-a7a8-495b-ba25-54c7f904cbad) see the doc [here ](https://onnxruntime.ai/docs/execution-providers/TensorRT-ExecutionProvider.html#tensorrt-ep-caches)for more information about EPContext model. --------- Co-authored-by: yf711 <yifanl@microsoft.com> Co-authored-by: Ye Wang <52801275+wangyems@users.noreply.github.com> Co-authored-by: Michal Guzek <moraxu@users.noreply.github.com> Co-authored-by: pengwa <pengwa@microsoft.com> Co-authored-by: wejoncy <wejoncy@163.com> Co-authored-by: Yi Zhang <zhanyi@microsoft.com> Co-authored-by: Yi Zhang <your@email.com> Co-authored-by: Pranav Sharma <prs@microsoft.com> Co-authored-by: Adam Pocock <adam.pocock@oracle.com> Co-authored-by: cao lei <jslhcl@gmail.com> Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com> Co-authored-by: inisis <46103969+inisis@users.noreply.github.com> Co-authored-by: Jeff Bloomfield <38966965+jeffbloo@users.noreply.github.com> Co-authored-by: mo-ja <60505697+mo-ja@users.noreply.github.com> Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com> Co-authored-by: Sumit Agarwal <sumitagarwal330@gmail.com> Co-authored-by: Atanas Dimitrov <70822030+neNasko1@users.noreply.github.com> Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com> Co-authored-by: Yufeng Li <liyufeng1987@gmail.com> Co-authored-by: Dhruv Matani <dhruvbird@gmail.com> Co-authored-by: Dhruv Matani <dhruv.matani@grammarly.com> Co-authored-by: wangshuai09 <391746016@qq.com> Co-authored-by: Xiaoyu <85524621+xiaoyu-work@users.noreply.github.com> Co-authored-by: Xu Xing <xing.xu@intel.com> Co-authored-by: Dmitri Smirnov <yuslepukhin@users.noreply.github.com> Co-authored-by: Rachel Guo <35738743+YUNQIUGUO@users.noreply.github.com> Co-authored-by: Sai Kishan Pampana <sai.kishan.pampana@intel.com> Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net> Co-authored-by: Jian Chen <cjian@microsoft.com> Co-authored-by: Shubham Bhokare <32080845+shubhambhokare1@users.noreply.github.com> Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com> Co-authored-by: Andrew Fantino <15876180+afantino951@users.noreply.github.com> Co-authored-by: Thomas Boby <thomas@boby.uk> Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: Scott McKay <skottmckay@gmail.com> Co-authored-by: Michal Guzek <mguzek@nvidia.com> Co-authored-by: George Wu <jywu@microsoft.com>	2024-05-26 12:24:17 -07:00
Changming Sun	439ed92b96	Remove TVM EP's pipeline (#20813 ) ### Description Temporarily remove TVM EP's pipeline until someone helps us upgrade TVM to a newer version which is compatible with the latest ONNX. ### Motivation and Context The ONNX version that TVM EP uses has a known security vulnerability. We cannot continue using it in our hosted build environment. This change is temporary	2024-05-25 20:42:41 -07:00
Adrian Lizarraga	5bae32eb34	Extend DoubleQDQPairsRemover to handle sequences that end in duplicate DQ nodes (#20759 ) ### Description Extend the DoubleQDQPairsRemover optimizer to also handle sequences that end in duplicate DQ nodes. For example, the following sequence: ``` Q1 --> DQ1 --> Q2 --+--> DQ2 \| +--> DQ2' ``` Is now simplified to: ``` Q1 ---+--> DQ2 \| +--> DQ2' ``` ### Motivation and Context The EnsureUniqueDQNodeUnits pass may add duplicate DQ nodes to ensure valid QDQ node units. The DoubleQDQPairsRemover should still be able to remove unnecessary QDQ ops if the target sequence ends in duplicate DQ nodes. --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2024-05-24 18:30:15 -07:00
Chi Lo	a7bc49a565	[TensorRT EP] Use latest commit of onnx-tensorrt parser (#20758 ) The 10 GA branch updated with several issues fixed. https://github.com/onnx/onnx-tensorrt/commits/10.0-GA/	2024-05-24 16:44:16 -07:00
Suryaprakash Shanmugam	1765da17e4	QDQ transformations in the OpenVINO EP for the NPU device (#20622 ) We introduce rulesets that eliminate QDQ nodes of unsupported types and for unsupported quantised operators for the NPU device. This leads to improved performance and accuracy on critical client AI models. Here's a summary of the changes: - Introduces the provider option `enable_qdq_optimizer` which when set to `True` enables stripping of QDQ nodes on the NPU device for models with `QuantizeLinear` and `DequantizeLinear` layers in them. `enable_qdq_optimizer` defaults to `False`. - Always strip out int16/uint16 QDQ layers as these types are not supported by the NPU compiler. - Only supported ops `Conv`, `MatMul`, and `Add` retain QDQ layers around them, specifically identified for optimal inference performance. OpenVINO EP achieves this by iterating through NodeUnits in the QDQ model, and reconstructing the graph only with the required layers. - Added provider APIs to manipulate node units from EP code by @adrianlizarraga - Added capability rule for the Pad operator when it takes DQ layers as input - Fixes from static code analysis tool --------- Co-authored-by: adrianlizarraga <adlizarraga@microsoft.com> Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com> Co-authored-by: sfatimar <sahar.fatima@intel.com> Co-authored-by: saurabhkale17 <saurabh1.kale@intel.com>	2024-05-24 16:25:05 -07:00

1 2 3 4 5 ...

11159 commits