onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-02 03:55:34 +00:00

Author	SHA1	Message	Date
Wanming Lin	8c2ee7b32e	[WebNN EP] Create MLGraphBuilder for every model builder (#21514 ) Currently WebNN spec only allows MLGraphBuilder.build() to be called once, we need to create new builder for every subgraph in WebNN EP. Spec change: https://github.com/webmachinelearning/webnn/pull/717	2024-08-01 09:15:31 -07:00
dependabot[bot]	3b73ef2bf7	Bump torch from 1.13.1 to 2.2.0 in /tools/ci_build/github/windows/eager (#21505 ) Bumps [torch](https://github.com/pytorch/pytorch) from 1.13.1 to 2.2.0. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/pytorch/pytorch/releases">torch's releases</a>.</em></p> <blockquote> <h2>PyTorch 2.2: FlashAttention-v2, AOTInductor</h2> <h1>PyTorch 2.2 Release Notes</h1> <ul> <li>Highlights</li> <li>Backwards Incompatible Changes</li> <li>Deprecations</li> <li>New Features</li> <li>Improvements</li> <li>Bug fixes</li> <li>Performance</li> <li>Documentation</li> </ul> <h1>Highlights</h1> <p>We are excited to announce the release of PyTorch® 2.2! PyTorch 2.2 offers ~2x performance improvements to <code>scaled_dot_product_attention</code> via FlashAttention-v2 integration, as well as AOTInductor, a new ahead-of-time compilation and deployment tool built for non-python server-side deployments.</p> <p>This release also includes improved torch.compile support for Optimizers, a number of new inductor optimizations, and a new logging mechanism called TORCH_LOGS.</p> <p><strong>Please note that we are <a href="https://redirect.github.com/pytorch/pytorch/issues/114602">deprecating macOS x86 support</a>, and PyTorch 2.2.x will be the last version that supports macOS x64.</strong></p> <p>Along with 2.2, we are also releasing a series of updates to the PyTorch domain libraries. More details can be found in the library updates blog.</p> <p>This release is composed of 3,628 commits and 521 contributors since PyTorch 2.1. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.2. More information about how to get started with the PyTorch 2-series can be found at our <a href="https://pytorch.org/get-started/pytorch-2.0/">Getting Started</a> page.</p> <p>Summary:</p> <ul> <li><code>scaled_dot_product_attention</code> (SDPA) now supports FlashAttention-2, yielding around 2x speedups compared to previous versions.</li> <li>PyTorch 2.2 introduces a new ahead-of-time extension of TorchInductor called AOTInductor, designed to compile and deploy PyTorch programs for non-python server-side.</li> <li><code>torch.distributed</code> supports a new abstraction for initializing and representing ProcessGroups called device_mesh.</li> <li>PyTorch 2.2 ships a standardized, configurable logging mechanism called TORCH_LOGS.</li> <li>A number of torch.compile improvements are included in PyTorch 2.2, including improved support for compiling Optimizers and improved TorchInductor fusion and layout optimizations.</li> <li>Please note that we are deprecating macOS x86 support, and PyTorch 2.2.x will be the last version that supports macOS x64.</li> <li><code>torch.ao.quantization</code> now offers a prototype <code>torch.export</code> based flow</li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="`8ac9b20d4b`"><code>8ac9b20</code></a> Run docker release build on final tag (<a href="https://redirect.github.com/pytorch/pytorch/issues/117131">#117131</a>) (<a href="https://redirect.github.com/pytorch/pytorch/issues/117182">#117182</a>)</li> <li><a href="`2490352430`"><code>2490352</code></a> Fix cuInit test on Windows (<a href="https://redirect.github.com/pytorch/pytorch/issues/117095">#117095</a>)</li> <li><a href="`3a44bb713f`"><code>3a44bb7</code></a> [CI] Test that cuInit is not called during import (<a href="https://redirect.github.com/pytorch/pytorch/issues/117043">#117043</a>)</li> <li><a href="`1c8ba3847d`"><code>1c8ba38</code></a> [CI] Use jemalloc for CUDA builds (<a href="https://redirect.github.com/pytorch/pytorch/issues/116900">#116900</a>) (<a href="https://redirect.github.com/pytorch/pytorch/issues/116988">#116988</a>)</li> <li><a href="`96d2ddbafe`"><code>96d2ddb</code></a> Store user model to simplify ONNXProgram.{adapt_torch_*,<strong>call</strong>} APIs (<a href="https://redirect.github.com/pytorch/pytorch/issues/1152">#1152</a>...</li> <li><a href="`738b4a560a`"><code>738b4a5</code></a> Update ONNX's IO Adapter to support FakeTensor with ExportedProgram (<a href="https://redirect.github.com/pytorch/pytorch/issues/114407">#114407</a>)...</li> <li><a href="`4cf10bf4dc`"><code>4cf10bf</code></a> [Cherry-pick] [Quant] [PT2] Enable batchnorm in _move_exported_model_to_eval ...</li> <li><a href="`7e97e4b4b6`"><code>7e97e4b</code></a> [AARCH64] Fall back to GEMM if mkldnn_matmul fails (<a href="https://redirect.github.com/pytorch/pytorch/issues/115936">#115936</a>) (<a href="https://redirect.github.com/pytorch/pytorch/issues/116666">#116666</a>)</li> <li><a href="`1a3e3c7cff`"><code>1a3e3c7</code></a> [CUDA] baddmm should fall back to addmm for batch=1 (<a href="https://redirect.github.com/pytorch/pytorch/issues/114992">#114992</a>) (<a href="https://redirect.github.com/pytorch/pytorch/issues/116518">#116518</a>)</li> <li><a href="`ab7505f78c`"><code>ab7505f</code></a> Fix broken PyYAML 6.0 on MacOS x86 (<a href="https://redirect.github.com/pytorch/pytorch/issues/115956">#115956</a>) (<a href="https://redirect.github.com/pytorch/pytorch/issues/116551">#116551</a>)</li> <li>Additional commits viewable in <a href="https://github.com/pytorch/pytorch/compare/v1.13.1...v2.2.0">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=torch&package-manager=pip&previous-version=1.13.1&new-version=2.2.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/microsoft/onnxruntime/network/alerts). </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-08-01 04:28:43 -07:00
Changming Sun	25722bb9e3	Add CUDA custom op header files to Linux tarball (#21551 ) ### Description The header files were added in PR #16454. Then, recently I made a PR #21464 that changed how we packed Linux tarballs. The new tarball misses the custom op header files. Therefore I need to make this change. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-08-01 04:23:02 -07:00
Adrian Lizarraga	4b8f6dcbb6	[QNN EP] Improve INT4 accuracy (#21582 ) ### Description Masks off top 4-bits of INT4 weights, improving accuracy. ### Motivation and Context This is a workaround as the QNN docs state masking is not required.	2024-07-31 21:05:11 -07:00
Jing Fang	8540ac4f78	Fix quant_format argument for 4bit quantizer (#21581 ) ### Description Original argument accepts Enum QuantFormat.QOperator or QuantFormat.QDQ, but the default value is QOperator. Change the argument to str to accept QOperator or QDQ and convert to QuantFormat after parsing. ### Motivation and Context Bug fix	2024-07-31 15:30:33 -07:00
Wanming Lin	a3883af7bf	[WebNN EP] Fixed bug in ConvTranspose (#21569 ) The constraint of ConvTranspose was placed in wrong place.	2024-07-31 14:39:21 -07:00
Tianlei Wu	c5f8389648	[CUDA] Fix MultiHeadAttention thread safe and bias support (#21498 ) ### Description #### Issues Fixed (1) TRT cross attention not thread safe. [Core changes like this](`6fd7aba3d4`) are used to make it thread-safe: * Add an once_flag to CumulatedSequenceLengthCache to make sure it is only initialized once; and change the cache to be read only after initialization. Previously, the content is not read-only so it might be changed by other thread and potentially cause buffer overrun. * The kernel initialization is not guarded (Although the factory of kernel loading has static mutex to guard multiple threading), so the mutable variable might be set by two different threads at the same time. Add an once_flag to avoid that. This requires need some workspace computation change as well. So I did not create a separated pull request. (2) Bias for cross attention That scenario has assumption that only query has bias, but not for key and value. However, such assumption is not verified in runtime and there was no comment of assumption, and there was no test case so the support of scenario was disabled by mistake. Actually, the scenario is used in whisper model (TODO: we shall add tests for whisper to CI pipeline, and also update fusion script to verify such assumptions if needed.) CUDA/CPU kernels supports bias for cross attention as long as bias is zero for key and value. I updated the check to support the scenario and added comments wherever there is such assumption. (3) Fallback support Previously, unfused kernel did not support packed qkv and packed kv formats. That means some case might fail since there is no fallback. I added new AddBiasTranpose cuda kernels for them to support fallback, so that all supported cases will not fail. #### Improvements (4) QKV workspace size. The logic for no_qkv_workspace could be easily out of sync since related code are scattered in different source files. I refactor the code to move all related code to one file (attention_prepare_qkv.cu) and add asserts, so that the logic can be in sync. (5) Remove confusing concept of pass past in kv parameters.pass_past_in_kv is confusing since the k/v in cross attention is not past state. Remove it and use parameters.qkv_format == Q_K_V_BSNH_BNSH_BNSH instead. New code does not use past_key/past_value for cross attention, so the logic is more clear. (6) More coverage and less workspace and less transpose of flash and efficient attention Previously, there is one condition does not run flash or efficient attention: ``` bool past_no_bias = (pass_key_value_as_past \|\| past_key != nullptr \|\| present_key != nullptr) && bias == nullptr; ``` After this change, we can use flash and efficient attention for the case, and also less workspace. For example, cross attention with bias, the original code uses two additional workspaces: ``` transpose: past_key (BxNxSxH) => temp_k_workspace (BxSxNxH), past_value (BxNxSxH_v) => temp_v_workspace (BxSxNxH_v) add bias: query => q, temp_k_workspace => k, temp_v_workspace => v ``` New logic is like ``` if (has bias) Add bias to query, key, value, and store in q, k, v workspace else Use query, key and value directly as q, k and v in kernel ``` We can see that, we do not need allocate temp_k_workspace and temp_v_workspace so use less memory. New code saved two transposes in this case. Flash and efficient attention supports BSNH or BNSH formats for k and v. In old code, k/v are also converted to BSNH format. Some is not necessary. I do some change to convert k/v to BSNH or BNSH case by case. So that there are more cases can be covered by flash or efficient attention to improve performance. (6) Debugging support Previously, there is less debug info. In this change, I add a flag for debug info in the AttentionData. So that we can output debug info during the processing. Also add functions to consolidate the dumping of inputs, QKV processing and outputs; Add an environment variable `ORT_ENABLE_GPU_DUMP` to allow disable dumping from cuda kernel. #### Summary of changes (1) Refactoring the CheckInputs, and pass in operator type. (2) Refactoring the PrepareQKV to support fallback for packed qkv or packed kv inputs. (3) Change a few case of PrepareQKV to allow more case covered by flash and efficient attention. (4) use parameters.qkv_format == Q_K_V_BSNH_BNSH_BNSH to replace parameters.pass_past_in_kv (5) Allow bias input for Q_K_V_BSNH_BNSH_BNSH, and add comments of assumption that key/value has no bias in this case. (6) Fix thread-safe issue in CumulatedSequenceLengthCache handling. (7) Add test cases to cover all supported scenarios. Current support scenarios for MultiHeadAttention for CUDA/CPU: \| Q \| K \| V \| pastK\| pastV \| presentK\| presentV \| Bias \| Op desc \| ---- \| ---- \| ---- \| ------ \| ----- \| --------- \| -------- \| -----\|--------- \| BSNH \| BLNH\| BLNH\| - \| - \| - \| - \| QKV \| not packed \| BLN3H\| - \| - \| - \| - \| - \| - \| QKV \| qkv packed <br> not support in CPU \| BSNH \| BLN2H\| - \| - \| - \| - \| - \| --- \| kv packed <br> not support in CPU \| BSNH \| BNLH\| BNLH\| - \| - \| - \| - \| Q-- \| cross attention <br> bias for Q only \| BSNH \| BLNH \| BLNH \| - \| - \| BNTH \| BNTH \| QKV \| no past <br> only present \| BSNH \| BLNH \| BLNH \| BNPH \| BNPH \| BNTH \| BNTH \| QKV \| past and present <br> (not share buffer) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> https://github.com/microsoft/onnxruntime/issues/18854	2024-07-31 09:01:05 -07:00
Sheil Kumar	b341c44c20	Fix ETW trace logging crash in multithreading situations (#21566 ) ### Description ETW trace logger is fakely registered as initialized_ is marked as true before the registration is done, causing crashing issue for Lenovo camera application. A prior attempt to address was made here: https://github.com/microsoft/onnxruntime/pull/21226 It was reverted here: https://github.com/microsoft/onnxruntime/pull/21360 ### Motivation and Context The problem is that during initialization of TraceLoggingRegisterEx, it will reinvoke the callback and attempt reinitialization, which is not allowed. TraceLoggingRegisterEx however can be initialized concurrently when initialization happens on multiple threads. For these reasons it needs to be protected by a lock, but the lock cannot naively block because the callback's reinvocation will cause a deadlock. To solve this problem another tracking variable is added : "initializing" which protects against reinitialization during the first initialization. --------- Co-authored-by: Sheil Kumar <sheilk@microsoft.com>	2024-07-31 08:59:55 -07:00
Wanming Lin	1d4b161145	[WebNN EP] Support ConvTranspose for TFLite backend (#21291 ) ### Description Chromium supports ConvTranspose for TFLite in https://chromium-review.googlesource.com/c/chromium/src/+/5635194 With constraint that only default dilations and groups are supported. --------- Co-authored-by: Dwayne Robinson <fdwr@hotmail.com>	2024-07-30 17:46:08 -07:00
Jing Fang	e7aa11607f	Utilize ext data location to reduce qd matmul memory usage (#21451 ) ### Description When the graph is quantized to qdq format, the DQ + MatMul is transformed to MatMulNBits in the level 2 optimizer when the model is initialized in an inference session. In the transformation step, tensors are transposed and new tensor protos are created. Instead of using protobuf arena allocated memory, the PR sets the tensor proto to use external buffer, and point the external location to memory location which contains the tensor buffer allocated by CPU. Then, in the step that creates OrtValue using the tensor proto, the memory buffers in the tensor proto are directly assigned to the tensors which were originally allocated by Ort Arena. With these two steps, the peak memory usage of QDQ format model is the same as usage of QOperator model. Besides, the model initialization time is significantly reduced. Take [Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) for example: \|\| QOperator Model (MatMulNBits) \| QDQ Model (DQ + MatMul, original code) \| QDQ Model (this PR) \| \|---\|---\|---\|---\| \| peak memory consumption \| 2.8 GB \| ~4.8 GB \| 2.8 GB \| \| initialization time \| 3 sec \| 9 sec \| 5 sec \| ### Motivation and Context When the graph is quantized to qdq format, the DQ + MatMul is converted to MatMulNBits in the level 2 optimizer. Originally, the newly created tensor proto use memory allocated by protobuf arena. These memory usage cannot be fully released when the tensor protos are deleted. Then, in the tensor proto to OrtValue step, tensors are created using ORT arena. Later, in the pre-pack step for MatMulNBits, new OrtValues are created. The tensors in the ORT arena are not fully released as well. The two arena memory allocation steps in the DQ + MatMul -> MatMulNBits transformation will result in almost 2x memory consumption in the model initialization.	2024-07-30 15:22:46 -07:00
Sumit Agarwal	1637f22d39	Extend Pad Fusion for AveragePool (#21556 ) ### Description This extends the existing pad_fusion for AveragePool operator i.e. fuse Pad if it is followed by AveragePool operator. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-07-30 09:35:45 -07:00
Yi-Hong Lyu	530a2d7b41	Enable FP16 Clip and Handle Bias in FP16 Depthwise Conv (#21493 ) - Improved accuracy for face-detection, image-classification, and object-detection in the GeekBench ML benchmark on ARM64. - Fixed issue https://github.com/microsoft/onnxruntime/issues/18992	2024-07-30 03:49:14 -07:00
Changming Sun	82036b0497	Remove references to the outdated CUDA EP factory method (#21549 ) The function "OrtSessionOptionsAppendExecutionProvider_CUDA" is deprecated.	2024-07-29 21:59:16 -07:00
vraspar	07d3be5b0e	CoreML: Add ML Program Split Op (#21456 ) ### Description Add support for Split Op ### Motivation and Context Address operator gaps in high priority model. --------- Co-authored-by: Scott McKay <skottmckay@gmail.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2024-07-30 14:04:47 +10:00
Yifan Li	5d78b9a17b	[TensorRT EP] Update TRT OSS Parser to 10.2 (#21552 ) ### Description <!-- Describe your changes. --> Update TRT OSS Parser to [latest 10.2-GA branch](`f161f95883`) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-07-29 17:27:38 -07:00
mcollinswisc	8417c325ec	Keep QDQ nodes w/ nonpositive scale around MaxPool (#21182 ) ### Description This change adds a check for whether the scale in the QuantizeLinear (or DequantizeLinear) is a positive scalar, and a new selector to disallow removing the QDQ around MaxPool if it is not. ### Motivation and Context Currently, the DropQDQNodesRules optimization removes QuantizeLinear and DequantizeLinear nodes from DequantizeLinear ∘ MaxPool ∘ QuantizeLinear. However, if the x_scale/y_scale values are non-positive, the (de-)quantization changes the ordering of the elements in the input value, so this optimization is changing the results. https://github.com/microsoft/onnxruntime/issues/21176 --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2024-07-30 09:06:51 +10:00
Sophie Schoenmeyer	d98581495f	Update labeling bot (#21548 ) Current labeling bot over-applies many of the labels (e.g., ep:CUDA and platform:windows) and is missing some of the APIs + EPs Working on migrating this workflow to GitHub policies but would like to use this fix in the meantime to avoid causing any issues w/ ORT 1.19 ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-07-29 16:06:03 -07:00
Adam Reeve	7543dd040b	Propagate NaNs in the CPU min and max operators (#21492 ) ### Description Propagates NaN values in the min and max operators so that min or max with a NaN in either input always produces NaN. ### Motivation and Context Fixes #21455	2024-07-30 08:50:13 +10:00
Preetha Veeramalai	c39f1c4fd8	ORT- OVEP 1.19 PR-follow up (#21546 ) ### Description Follow up PR for bug fixes on 1.19 ### Motivation and Context - Handles 1.19 docker file fixes. - Sets the default file naming of epctx onnx model with _ctx.onnx as suffix. - Create epctx model directories if it doesn't exist. --------- Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com>	2024-07-29 14:12:36 -07:00
Yulong Wang	b03c9496aa	[js/web] allow load WebAssembly binary from buffer (#21534 ) ### Description This PR adds a new option `ort.env.wasm.wasmBinary`, which allows user to set to a buffer containing preload .wasm file content. This PR should resolve the problem from latest discussion in #20876.	2024-07-29 13:39:38 -07:00
Xu Xing	0d7cf301a1	[js/webgpu] Add activation Tanh (#21540 ) Bug:https://github.com/microsoft/onnxruntime/issues/21467 ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-07-29 11:05:34 -07:00
Jian Chen	79537d0523	Remove tools/ci_build/github/android/run_nnapi_code_coverage.sh (#21371 ) ### Description Remove tools/ci_build/github/android/run_nnapi_code_coverage.sh ### Motivation and Context This file is no longer needed	2024-07-29 10:00:52 -07:00
Jian Chen	bc3713206d	Update QNN pipeline pool (#21482 ) ### Description Update QNN pipeline pool ### Motivation and Context Let all our pipelines are using the latest NDK version	2024-07-29 10:00:21 -07:00
Yi Zhang	05cef469e8	Move on-device training packages publish step (#21539 ) ### Description Since the onedevice training cpu packaging has been a separated pipeline, it's nuget package publishing step must be moved as well. ### Motivation and Context Fixes the exception in Nuget Publishing Packaging Pipeline caused by #21485	2024-07-29 09:59:46 -07:00
mingyueliuh	d8888136e3	Add support tensor element type for register custom op shape infer function (#21387 ) ### Description Functionality extension for the SetOutputShape method in custom op shape inference. ### Motivation and Context - SetOutputShape Interface enhancement Actually, the shape infer function need set the tensor type and shape ，Add a parameter type to allow users to specify the tensor type, and set ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT as default value to ensure compatibility. Co-authored-by: mingyue <mingyue@amd.com>	2024-07-29 09:45:52 -07:00
Wanming Lin	94eb70d983	[WebNN EP] Add labels for all WebNN operators (#21516 ) In order to provide more diagnosable error messages for developers. Spec change: https://github.com/webmachinelearning/webnn/pull/742	2024-07-29 08:50:14 -07:00
Xu Xing	5bc12bf209	[js/webgpu] Add activation for conv3d naive (#21466 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-07-29 08:47:41 -07:00
Yulong Wang	dbff0cd098	[js/node] enable float16 support for Node.js binding (#20581 ) ### Description enable float16 support for Node.js binding. data of float16 tensor uses `Uint16Array`.	2024-07-28 13:03:17 -07:00
liqun Fu	a4d3a1ce0c	pick changes from https://github.com/onnx/onnx/pull/6195 to fix heap-buffer-overflow in onnx::convPoolShapeInference (#21507 ) ### Description onnx 1.16.2 is not available before ort 1.19.0 code freeze. Thus pick the needed change as patch	2024-07-27 15:58:36 -07:00
Jian Chen	7e23212de9	Delete tools/ci_build/github/azure-pipelines/win-gpu-ci-pipeline.yml (#21529 ) ### Description Delete tools/ci_build/github/azure-pipelines/win-gpu-ci-pipeline.yml ### Motivation and Context This CI pipeline has been divided into 4 different pipeline.	2024-07-27 15:58:12 -07:00
Ranjit Ranjan	82b2955268	[AIX]test failure fix using gtest-1.15.0 for AIX (#21497 ) ### Description Local CI setup for AIX reported tests failure after the gtest 1.15.0 upgrade. ### Motivation and Context Below tests failure is observed after gtest upgrade. The following tests FAILED: 1 - onnxruntime_test_all (ILLEGAL) 7 - onnxruntime_logging_apis_test (Subprocess aborted) To fix this, I am enabling pthread support under gtest. This was disabled with previous version of gtest for some reason. Now by enabling this, above tests are getting passed with gtest 1.15.0.	2024-07-27 11:17:22 -07:00
jingyanwangms	48fb8a7e56	Security fuzz address sanitizer fix Bug #2 and #3 (#21528 ) ### Description Security fuzz test with address sanitizer found several bugs	2024-07-27 11:10:52 -07:00
dependabot[bot]	1ce160883f	Bump Sixlabors.ImageSharp from 2.1.8 to 2.1.9 in /csharp/sample/Microsoft.ML.OnnxRuntime.ResNet50v2Sample (#21444 ) Bumps [Sixlabors.ImageSharp](https://github.com/SixLabors/ImageSharp) from 2.1.8 to 2.1.9. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/SixLabors/ImageSharp/releases">Sixlabors.ImageSharp's releases</a>.</em></p> <blockquote> <h2>v2.1.9</h2> <h2>What's Changed</h2> <ul> <li>[2.1] Fix overflow in MemoryAllocator.Create(options) by <a href="https://github.com/antonfirsov"><code>@antonfirsov</code></a> in <a href="https://redirect.github.com/SixLabors/ImageSharp/pull/2732">SixLabors/ImageSharp#2732</a></li> <li>Backport GIF LZW fix to 2.1 by <a href="https://github.com/antonfirsov"><code>@antonfirsov</code></a> in <a href="https://redirect.github.com/SixLabors/ImageSharp/pull/2756">SixLabors/ImageSharp#2756</a></li> <li>Backport 2759 to 2.1.x by <a href="https://github.com/antonfirsov"><code>@antonfirsov</code></a> in <a href="https://redirect.github.com/SixLabors/ImageSharp/pull/2770">SixLabors/ImageSharp#2770</a></li> </ul> <p><strong>Full Changelog</strong>: <a href="https://github.com/SixLabors/ImageSharp/compare/v2.1.8...v2.1.9">https://github.com/SixLabors/ImageSharp/compare/v2.1.8...v2.1.9</a></p> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="`9816ca4501`"><code>9816ca4</code></a> Merge pull request <a href="https://redirect.github.com/SixLabors/ImageSharp/issues/2770">#2770</a> from SixLabors/af/backport-2759-2.1.x</li> <li><a href="`b33d666ab7`"><code>b33d666</code></a> handle DecodingMode</li> <li><a href="`6b2030b549`"><code>6b2030b</code></a> Merge branch 'release/2.1.x' into af/backport-2759-2.1.x</li> <li><a href="`8ffad3f480`"><code>8ffad3f</code></a> Issue2012BadMinCode should decode now</li> <li><a href="`1f5bf23b9e`"><code>1f5bf23</code></a> skip Issue2758_DecodeWorks</li> <li><a href="`3bf8c572a0`"><code>3bf8c57</code></a> manual port of 3.1 gif decoder</li> <li><a href="`28c20ded87`"><code>28c20de</code></a> Clamp JPEG quality estimation results.</li> <li><a href="`4b910e7f84`"><code>4b910e7</code></a> Decode LZW row by row</li> <li><a href="`a1f2879771`"><code>a1f2879</code></a> Merge pull request <a href="https://redirect.github.com/SixLabors/ImageSharp/issues/2756">#2756</a> from SixLabors/af/git-av-2.1</li> <li><a href="`898df7f8ca`"><code>898df7f</code></a> backport <a href="https://redirect.github.com/SixLabors/ImageSharp/issues/2749">#2749</a> to 2.1</li> <li>Additional commits viewable in <a href="https://github.com/SixLabors/ImageSharp/compare/v2.1.8...v2.1.9">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=Sixlabors.ImageSharp&package-manager=nuget&previous-version=2.1.8&new-version=2.1.9)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/microsoft/onnxruntime/network/alerts). </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-07-26 22:31:16 -07:00
maggie1059	10b4a3b90b	Fix conda failure for onnxruntime-directml (#21526 ) The change in #21005 works for directly building wheels with `build.py`, but ort-nightly-directml wheels, as well as the 1.18.1 release of the onnxruntime-directml python wheel, still do not work with conda since they're built from the `py-win-gpu.yml` pipeline, which uses `install_third_party_deps.ps1` to set compile flags.	2024-07-26 22:26:38 -07:00
Yueqing Zhang	d01fc75ef1	[VitisAI] support vaip create ep context nodes & bug fix (#21506 ) ### Description <!-- Describe your changes. --> 1. We decided to move the context node creation back to our own repo because it is more flexible to modify. 2. We found a bug related the context node. It would change the inference order. So, we fixed in this PR as well. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> This is crucial for Microsoft Release next month. --------- Co-authored-by: Yueqing Zhang <yueqingz@amd.com>	2024-07-26 22:15:57 -07:00
zz002	690d745cbf	[VitisAI] 1. KernelDef supports StartVersion and EndVersion (#21519 ) ### Description <!-- Describe your changes. --> [VitisAI] 1. KernelDef supports StartVersion and EndVersion 2. CapabilityOps checks domain ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: Zhenze Wang <zhenzew@xilinx.com>	2024-07-26 20:28:55 -07:00
Scott McKay	5af423c7c0	Set version and other info in the C# dll (#21517 ) ### Description <!-- Describe your changes. --> Set version and other info in the Microsoft.ML.OnnxRuntime C# dll by setting GenerateAssemblyInfo to true and passing in ORT version in the CI. Minor re-org of the order of properties so related things are grouped a little better. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> #21475	2024-07-27 13:22:57 +10:00
Tianlei Wu	64819f6f8c	Update benchmark_mha.py to compare with PyTorch SDPA (#21449 ) ### Description * Update benchmark_mha.py to compare with PyTorch SDPA api. * Write results to csv file. * Use sdpa_kernel cuda provider option instead of environment variables for better control. * Add arguments (`--use_gpu`, `--causal` etc) to allow testing different senarios. * Update benchmark_mha.sh to add cpu benchmarks For Q,K,V format, torch uses BNSH format, while ort uses BSNH format, so the result is not apple-to-apple. However, if the latency difference is large, that could be a warning. #### Example GPU results Example results on A100-SXM4-80GB with settings (use_gpu=TRUE, enable_cuda_graph=FALSE, causal=FALSE, past_sequence_length=0, intra_op_num_threads=0) in Azure Linux. ORT: build from source with CUDA 12.5; PyTorch 2.3.1 for cuda 12.1. format \| batch_size \| sequence_length \| num_heads \| head_size \| latency (s) \| tflops \| kernel -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- Q,KV \| 4 \| 2048 \| 32 \| 128 \| 0.0015 \| 179.5 \| ort:flash Q,KV \| 4 \| 2048 \| 32 \| 128 \| 0.0015 \| 179.0 \| ort:default Q,K,V \| 4 \| 2048 \| 32 \| 128 \| 0.0016 \| 170.0 \| ort:default Q,K,V \| 4 \| 2048 \| 32 \| 128 \| 0.0016 \| 169.5 \| ort:flash QKV \| 4 \| 2048 \| 32 \| 128 \| 0.0016 \| 168.5 \| ort:default QKV \| 4 \| 2048 \| 32 \| 128 \| 0.0016 \| 167.4 \| ort:flash Q,K,V \| 4 \| 2048 \| 32 \| 128 \| 0.0017 \| 159.4 \| torch:default Q,K,V \| 4 \| 2048 \| 32 \| 128 \| 0.0018 \| 155.0 \| torch:flash Q,KV \| 4 \| 2048 \| 32 \| 128 \| 0.0030 \| 92.7 \| ort:efficient Q,K,V \| 4 \| 2048 \| 32 \| 128 \| 0.0030 \| 90.9 \| ort:efficient QKV \| 4 \| 2048 \| 32 \| 128 \| 0.0031 \| 89.9 \| ort:efficient Q,K,V \| 4 \| 2048 \| 32 \| 128 \| 0.0031 \| 89.0 \| torch:efficient Q,K,V \| 4 \| 2048 \| 32 \| 128 \| 0.0054 \| 51.3 \| torch:math Q,KV \| 4 \| 4096 \| 32 \| 128 \| 0.0058 \| 191.0 \| ort:default Q,KV \| 4 \| 4096 \| 32 \| 128 \| 0.0058 \| 190.6 \| ort:flash Q,K,V \| 4 \| 4096 \| 32 \| 128 \| 0.0059 \| 187.8 \| ort:default Q,K,V \| 4 \| 4096 \| 32 \| 128 \| 0.0059 \| 186.7 \| ort:flash QKV \| 4 \| 4096 \| 32 \| 128 \| 0.0059 \| 185.9 \| ort:flash QKV \| 4 \| 4096 \| 32 \| 128 \| 0.0059 \| 185.8 \| ort:default Q,K,V \| 4 \| 4096 \| 32 \| 128 \| 0.0067 \| 163.4 \| torch:default Q,K,V \| 4 \| 4096 \| 32 \| 128 \| 0.0070 \| 157.2 \| torch:flash Q,KV \| 4 \| 4096 \| 32 \| 128 \| 0.0113 \| 97.6 \| ort:efficient Q,K,V \| 4 \| 4096 \| 32 \| 128 \| 0.0114 \| 96.4 \| ort:efficient QKV \| 4 \| 4096 \| 32 \| 128 \| 0.0114 \| 96.2 \| ort:efficient Q,K,V \| 4 \| 4096 \| 32 \| 128 \| 0.0127 \| 86.3 \| torch:efficient Q,KV \| 8 \| 2048 \| 32 \| 128 \| 0.0031 \| 177.8 \| ort:flash Q,KV \| 8 \| 2048 \| 32 \| 128 \| 0.0031 \| 177.7 \| ort:default Q,K,V \| 8 \| 2048 \| 32 \| 128 \| 0.0032 \| 170.8 \| ort:default Q,K,V \| 8 \| 2048 \| 32 \| 128 \| 0.0032 \| 170.3 \| ort:flash QKV \| 8 \| 2048 \| 32 \| 128 \| 0.0032 \| 169.2 \| ort:default QKV \| 8 \| 2048 \| 32 \| 128 \| 0.0033 \| 169.0 \| ort:flash Q,K,V \| 8 \| 2048 \| 32 \| 128 \| 0.0034 \| 161.9 \| torch:default Q,K,V \| 8 \| 2048 \| 32 \| 128 \| 0.0036 \| 152.9 \| torch:flash Q,KV \| 8 \| 2048 \| 32 \| 128 \| 0.0059 \| 93.5 \| ort:efficient Q,K,V \| 8 \| 2048 \| 32 \| 128 \| 0.0060 \| 91.3 \| ort:efficient QKV \| 8 \| 2048 \| 32 \| 128 \| 0.0060 \| 91.0 \| ort:efficient Q,K,V \| 8 \| 2048 \| 32 \| 128 \| 0.0064 \| 86.0 \| torch:efficient Q,KV \| 8 \| 4096 \| 32 \| 128 \| 0.0115 \| 190.8 \| ort:flash Q,KV \| 8 \| 4096 \| 32 \| 128 \| 0.0115 \| 190.7 \| ort:default Q,K,V \| 8 \| 4096 \| 32 \| 128 \| 0.0118 \| 187.1 \| ort:default Q,K,V \| 8 \| 4096 \| 32 \| 128 \| 0.0118 \| 187.0 \| ort:flash QKV \| 8 \| 4096 \| 32 \| 128 \| 0.0118 \| 185.6 \| ort:default QKV \| 8 \| 4096 \| 32 \| 128 \| 0.0118 \| 185.6 \| ort:flash Q,K,V \| 8 \| 4096 \| 32 \| 128 \| 0.0139 \| 158.7 \| torch:default Q,K,V \| 8 \| 4096 \| 32 \| 128 \| 0.0139 \| 158.3 \| torch:flash Q,KV \| 8 \| 4096 \| 32 \| 128 \| 0.0225 \| 97.7 \| ort:efficient Q,K,V \| 8 \| 4096 \| 32 \| 128 \| 0.0227 \| 96.8 \| ort:efficient QKV \| 8 \| 4096 \| 32 \| 128 \| 0.0228 \| 96.3 \| ort:efficient Q,K,V \| 8 \| 4096 \| 32 \| 128 \| 0.0260 \| 84.5 \| torch:efficient #### Example CPU results Dell XPS 8960 with i9-13900 CPU (use_gpu=FALSE, causal=FALSE, past_sequence_length=0) in Windows. ORT: build from source with CUDA 12.5; PyTorch 2.3.1 for cuda 12.1. format \| causal \| batch_size \| seq_len \| num_heads \| head_size \| threads \| latency (s) \| kernel -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- Q,K,V \| FALSE \| 1 \| 128 \| 32 \| 128 \| 8 \| 0.0005 \| ort:flash Q,K,V \| FALSE \| 1 \| 128 \| 32 \| 128 \| 0 \| 0.0009 \| ort:flash Q,K,V \| FALSE \| 1 \| 128 \| 32 \| 128 \| 0 \| 0.0009 \| ort:math Q,K,V \| FALSE \| 1 \| 128 \| 32 \| 128 \| 4 \| 0.0009 \| ort:flash Q,K,V \| FALSE \| 1 \| 128 \| 32 \| 128 \| 2 \| 0.0014 \| ort:flash Q,K,V \| FALSE \| 1 \| 128 \| 32 \| 128 \| 1 \| 0.0025 \| ort:flash Q,K,V \| FALSE \| 1 \| 128 \| 32 \| 128 \| 2 \| 0.0045 \| torch:default Q,K,V \| FALSE \| 1 \| 128 \| 32 \| 128 \| 24 \| 0.0046 \| torch:default Q,K,V \| FALSE \| 1 \| 128 \| 32 \| 128 \| 8 \| 0.0046 \| torch:default Q,K,V \| FALSE \| 1 \| 128 \| 32 \| 128 \| 4 \| 0.0046 \| torch:default Q,K,V \| FALSE \| 1 \| 128 \| 32 \| 128 \| 1 \| 0.0047 \| torch:default Q,K,V \| FALSE \| 1 \| 256 \| 32 \| 128 \| 0 \| 0.0019 \| ort:flash Q,K,V \| FALSE \| 1 \| 256 \| 32 \| 128 \| 8 \| 0.0019 \| ort:flash Q,K,V \| FALSE \| 1 \| 256 \| 32 \| 128 \| 0 \| 0.0022 \| ort:math Q,K,V \| FALSE \| 1 \| 256 \| 32 \| 128 \| 4 \| 0.0030 \| ort:flash Q,K,V \| FALSE \| 1 \| 256 \| 32 \| 128 \| 2 \| 0.0047 \| ort:flash Q,K,V \| FALSE \| 1 \| 256 \| 32 \| 128 \| 1 \| 0.0086 \| ort:flash Q,K,V \| FALSE \| 1 \| 256 \| 32 \| 128 \| 2 \| 0.0161 \| torch:default Q,K,V \| FALSE \| 1 \| 256 \| 32 \| 128 \| 4 \| 0.0162 \| torch:default Q,K,V \| FALSE \| 1 \| 256 \| 32 \| 128 \| 8 \| 0.0162 \| torch:default Q,K,V \| FALSE \| 1 \| 256 \| 32 \| 128 \| 24 \| 0.0165 \| torch:default Q,K,V \| FALSE \| 1 \| 256 \| 32 \| 128 \| 1 \| 0.0166 \| torch:default Q,K,V \| FALSE \| 1 \| 512 \| 32 \| 128 \| 8 \| 0.0077 \| ort:flash Q,K,V \| FALSE \| 1 \| 512 \| 32 \| 128 \| 0 \| 0.0091 \| ort:flash Q,K,V \| FALSE \| 1 \| 512 \| 32 \| 128 \| 0 \| 0.0099 \| ort:math Q,K,V \| FALSE \| 1 \| 512 \| 32 \| 128 \| 4 \| 0.0103 \| ort:flash Q,K,V \| FALSE \| 1 \| 512 \| 32 \| 128 \| 2 \| 0.0177 \| ort:flash Q,K,V \| FALSE \| 1 \| 512 \| 32 \| 128 \| 1 \| 0.0328 \| ort:flash Q,K,V \| FALSE \| 1 \| 512 \| 32 \| 128 \| 2 \| 0.0624 \| torch:default Q,K,V \| FALSE \| 1 \| 512 \| 32 \| 128 \| 4 \| 0.0624 \| torch:default Q,K,V \| FALSE \| 1 \| 512 \| 32 \| 128 \| 8 \| 0.0625 \| torch:default Q,K,V \| FALSE \| 1 \| 512 \| 32 \| 128 \| 24 \| 0.0626 \| torch:default Q,K,V \| FALSE \| 1 \| 512 \| 32 \| 128 \| 1 \| 0.0640 \| torch:default Q,K,V \| FALSE \| 1 \| 1024 \| 32 \| 128 \| 8 \| 0.0286 \| ort:flash Q,K,V \| FALSE \| 1 \| 1024 \| 32 \| 128 \| 0 \| 0.0317 \| ort:flash Q,K,V \| FALSE \| 1 \| 1024 \| 32 \| 128 \| 4 \| 0.0367 \| ort:flash Q,K,V \| FALSE \| 1 \| 1024 \| 32 \| 128 \| 0 \| 0.0391 \| ort:math Q,K,V \| FALSE \| 1 \| 1024 \| 32 \| 128 \| 2 \| 0.0656 \| ort:flash Q,K,V \| FALSE \| 1 \| 1024 \| 32 \| 128 \| 1 \| 0.1235 \| ort:flash Q,K,V \| FALSE \| 1 \| 1024 \| 32 \| 128 \| 24 \| 0.2482 \| torch:default Q,K,V \| FALSE \| 1 \| 1024 \| 32 \| 128 \| 2 \| 0.2483 \| torch:default Q,K,V \| FALSE \| 1 \| 1024 \| 32 \| 128 \| 4 \| 0.2483 \| torch:default Q,K,V \| FALSE \| 1 \| 1024 \| 32 \| 128 \| 8 \| 0.2486 \| torch:default Q,K,V \| FALSE \| 1 \| 1024 \| 32 \| 128 \| 1 \| 0.2538 \| torch:default Q,K,V \| FALSE \| 1 \| 2048 \| 32 \| 128 \| 0 \| 0.1038 \| ort:flash Q,K,V \| FALSE \| 1 \| 2048 \| 32 \| 128 \| 8 \| 0.1050 \| ort:flash Q,K,V \| FALSE \| 1 \| 2048 \| 32 \| 128 \| 0 \| 0.1368 \| ort:math Q,K,V \| FALSE \| 1 \| 2048 \| 32 \| 128 \| 4 \| 0.1535 \| ort:flash Q,K,V \| FALSE \| 1 \| 2048 \| 32 \| 128 \| 2 \| 0.2461 \| ort:flash Q,K,V \| FALSE \| 1 \| 2048 \| 32 \| 128 \| 1 \| 0.4724 \| ort:flash Q,K,V \| FALSE \| 1 \| 2048 \| 32 \| 128 \| 8 \| 0.9835 \| torch:default Q,K,V \| FALSE \| 1 \| 2048 \| 32 \| 128 \| 4 \| 0.9841 \| torch:default Q,K,V \| FALSE \| 1 \| 2048 \| 32 \| 128 \| 24 \| 0.9841 \| torch:default Q,K,V \| FALSE \| 1 \| 2048 \| 32 \| 128 \| 2 \| 0.9873 \| torch:default Q,K,V \| FALSE \| 1 \| 2048 \| 32 \| 128 \| 1 \| 0.9985 \| torch:default ### Motivation and Context To compare with PyTorch SDPA on CPU and CUDA latency.	2024-07-26 18:45:14 -07:00
Hector Li	fb61e14153	Add QNN EP option context_node_name_prefix to set EPContext node name prefix (#21236 ) ### Description Add QNN EP option context_node_name_prefix to set EPContext node name prefix ### Motivation and Context For the case to workaround QNN context PD memory limit, user need split the model into pieces and generate the QNN context model separately. It could happen that the generated EPContext node in separate graph has same node name. This will cause issue if glue those EPContext nodes together into a single model. To avoid this user can set this context_node_name_prefix for each split pieces to make the node name unique.	2024-07-26 16:56:44 -07:00
Jian Chen	7db7c4e5c8	Separating all GPU stages into different Pipelines (#21521 ) ### Description Separating all GPU stages into different Pipelines	2024-07-26 14:54:45 -07:00
Justin Chu	bbbaef3fa6	Update text formatting in generate_cgmanifest.py (#21489 ) The only place where I manually fixed I forgot a format string	2024-07-26 08:46:54 -07:00
Prathik Rao	278f0f5cd2	disables qnn in ort training cpu pipeline (#21510 ) ### Description <!-- Describe your changes. --> `enable_windows_arm64_qnn` and `enable_windows_x64_qnn` are true by default but unnecessary for training. This change explicitly sets these parameters to false for training pipeline. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> ORT 1.19 Release Preparation	2024-07-26 17:23:35 +08:00
Wanming Lin	b6b29309a5	[WebNN EP] Update argMax/argMin to adapt to latest spec (#21452 ) WebNN spec recently changes the definition of argMax/argMin: - Remove selectLastIndex option, let backends decide to select the last index or not. - Move axes option to axis input	2024-07-25 17:07:01 -07:00
aamajumder	166809425e	[DML EP] Register ReduceMin-20 (#20477 ) ### Description This PR registers the ReduceMin-20 operator to the DML EP. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-07-25 17:06:30 -07:00
Scott McKay	e5302b23c4	Fix SkipLayerNormFusion incorrectly setting modified every time it runs (#21502 ) ### Description <!-- Describe your changes. --> Current behavior forces all L2 optimizers to loop until they hit the max number of iterations. Only update modified if the graph was modified. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix unnecessary loops of L2 optimizers during model loading.	2024-07-26 10:00:28 +10:00
Justin Chu	c464ab3aca	Allow cpplint to always be green (#21491 ) Allow cpplint to always be green since it is optional. Also changed the workflow name to reflect that.	2024-07-25 15:57:30 -07:00
Scott McKay	b0e1f7f798	CoreML: Aggregated changes to add all required ops for priority model (#21472 ) ### Description <!-- Describe your changes. --> Add these changes to one PR to simplify checkin - Add Concat (#21423) - Add DepthToSpace (#21426) - Add LeakyRelu (#21453) - Add test scripts (#21427) - Add ability to set coreml flags from python (#21434) Other changes - updated partitioning utils to support dropping constant initializers from a ComputeCapability's inputs. - noticed that the list of inputs to the coreml model was unexpectedly long due to this - we copy constant initializers to a CoreML model so don't need the originals, and if they remain as inputs ORT can't free them as they appear to be in use. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-07-26 08:29:33 +10:00
Scott McKay	3cdf4b917b	Fix Android CI Pipeline code coverage failure (#21504 ) ### Description <!-- Describe your changes. --> Current failure is due to a version mismatch. Use llvm-cov from the Android NDK instead of the system gcov so that the version is correct. Also comment out publishing to the Azure dashboard to simplify the setup. The CI prints out the stats for review by developers. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix CI pipeline	2024-07-26 07:36:23 +10:00
Hector Li	c23517859e	Qnn batchnorm support input with rank 2 (#21469 ) ### Description Qnn BatchNorm support input with rank 2 Update Quantization script to quantize BatchNorm bias using int32 --------- Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2024-07-25 11:44:10 -07:00
Changming Sun	4167b68abf	Split ondevice training cpu packaging pipeline to a separated pipeline (#21485 ) ### Description Right now our "Zip-Nuget-Java-Nodejs Packaging Pipeline" is too big. This OnDevice training part is independent of the others, so it can be split out. Then our NPM Packaging pipeline will not depends on this training stuff. ### Motivation and Context Similar to #21235 Also, this PR fixed a problem that: "NuGet_Test_Linux_Training_CPU" job downloads artifacts from "onnxruntime-linux-x64" for getting customop shared libs, but the job forget to declare it depends on the "Linux_C_API_Packaging_CPU_x64" which produces the artifact. Such problems can be hard to find when a pipeline goes big.	2024-07-25 10:58:34 -07:00

1 2 3 4 5 ...

11434 commits