Commit graph

10933 commits

Author SHA1 Message Date
Hector Li
55e0aaeeef
fix android build issue (#20389)
fix android build issue
2024-04-19 14:21:34 -07:00
Rachel Guo
f396748ed6
Nuget .NET changes for Mac Catalyst (#19923)
### Description
<!-- Describe your changes. -->

Add Nuget package changes for adding new 'net6.0-maccatalyst' platform.

The output ORT Nuget package was manually tested and verified in a .NET
MAUI app setup.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: Yi Zhang <zhanyi@microsoft.com>
Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
2024-04-19 14:20:03 -07:00
Guenther Schmuelling
497a627a69
fix fp16 for skiplayernorm (#20381) 2024-04-19 12:12:02 -07:00
Dmitri Smirnov
42b700d463
Eliminate stray vector and the contention it creates (#20377)
### Description
Unused vector allocating large memory chunk within a concurrent routine
creates heap contention and is eliminated.

### Motivation and Context
This partially addresses
https://github.com/microsoft/onnxruntime/issues/20373.
2024-04-19 10:27:42 -07:00
Patrice Vignola
4d98f06f93
[DML EP] Add GroupQueryAttention (#20327) 2024-04-19 10:25:29 -07:00
Wanming Lin
7c80c39f74
[WebNN EP] WebNN CPU backend only support up to 4 Split outputs (#20350) 2024-04-19 08:31:22 -07:00
sfatimar
4d1963c2a2
OpenVINO EP Rel 1.18 Changes (#20337)
### Description
These changes include
Support to OpenVINO 2024.1 
Import PreCompiled Blobs with EPContext Blob 
Separate Device/Precision as input
Deprecate CPU_FP32 , GPU_FP32 terminology , introduce CPU, GPU 
AUTO GPU, CPU will only create GPU Blob and not CPU Blob. 



### Motivation and Context
- OpenVINO 2024.1 will be out soon
- Import Precompiled Blob can greatly reduce FEIL/FIL Time. 
- Separating Device/Precision will make the input cleaner
-

---------

Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
2024-04-19 00:31:38 -07:00
Yueqing Zhang
9001c69b84
[VitisAI] Add Version Check. Requsted by Microsoft (#20347)
### Description
<!-- Describe your changes. -->
Add version for onnxruntime_providers_vitisai.dll. So, the
onnxruntime_vitisai_ep.dll can check if the version is compatible.
To make sure the old onnxruntime_vitisai_ep.dll still work, we would
offset the api struct by version field.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve? -->
This is the direct request from Microsoft. The following is the problem
we try to solve:

How would you describe the dependency between (a)
onnxruntime_vitisai_ep.dll and (b) onnxruntime_providers_vitisai.dll?
E.g. for each version of (a) there is a minimum required version of (b),
or for each version of (b) there is minimum required version of (a).

Please note that in practice we won't be able to use the exact version
of ORT/EP that you tested against (because we might need to update ORT
for other reasons), but we might be able to accommodate some version
constraints that you specify. As we approach shipping, we'll lock the
version of ORT/EP to allow for stabilization and more detailed testing
(and work with you if it needs to be updated).
2024-04-18 23:05:44 -07:00
Patrice Vignola
12569626cb
Update DML to 1.14.1 (#20380)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-04-18 22:43:41 -07:00
Patrice Vignola
b8c90beef2
[DML EP] Add SimplifiedLayerNorm and SkipSimplifiedLayerNorm (#20326) 2024-04-18 22:17:31 -07:00
Chi Lo
a747a00cd3
[TensorRT EP] Use protobuf with debug build on Windows (#20378)
TRT EP implicitly uses oss_parser with debug build on Windows, therefore
it should use protobuf rather than protobuf-lite.
2024-04-18 19:39:08 -07:00
Patrice Vignola
745b426c60
[DML] Update DML to 1.14 (#20304)
I am prefiring this change to pre-run the non-dml checks, and also to
give folks the time to review it before DML gets released. When DML 1.14
officially releases, we'll only need to run the DML pipeline to
automatically pick up the nuget package. This should save us some
valuable time.

Note that DML 1.14 is the release needed for ORT 1.17.4, and DML 1.15
will come soon after.
2024-04-18 16:22:57 -07:00
Adrian Lizarraga
e4c0cb2b9a
[Quant tool] Do not default to contrib Q/DQ ops for 16-bit (#20376)
### Description
Updates the QDQ quantizer to use ONNX Q/DQ ops for 16-bit quantization
if opset >= 21.

### Motivation and Context
The QDQ quantizer previously set the 'com.microsoft' domain on inserted
Q/DQ ops when the model needed 16-bit support. ONNX 1.16.0 added
int16/uint16 support to the QuantizeLinear and DequantizeLinear
operators, so we can change the default behavior.
2024-04-18 15:26:07 -07:00
Chi Lo
a8f74e3ec7
[TensorRT EP] TensorRT 10 support (#20167)
This PR has the change of supporting INT64 tensor type for TRT 10.
This PR is also **compatible with TRT 8.6 and TRT 10** meaning user can
build ORT TRT against TRT 8.6 or TRT 10.

Due to the timeline for TRT 10 GA and ORT 1.18 release is very tight (We
don't have enough time to get our CIs installed with TRT 10 GA libraries
and run the build/tests), as well as Nvidia new Triton release (The
timeline is also very close to the timeline of TRT 10 GA) wants to
integrate TRT EP with TRT 10.

Therefore, our approach is to make this PR into ORT 1.18 first, so
everything is fully tested with TRT 8.6 CIs, and user can still manually
build ORT 1.18 against TRT 10 like the Triton case.

As for testing TRT 10, once TRT 10 GA is released, we will have another
branch which includes change at this PR as well as whatever changes
needed and update our CIs with TRT 10.
2024-04-18 14:03:04 -07:00
Yulong Wang
3577a4bd02
[Node.js binding] Allow installation to download CUDA binaries via script (#20364)
### Description
Currently we try to include all prebuilt binaries into the NPM packages.
This was working until we added libonnxruntime_providers_cuda.so
(>400MB) into the NPM package. The NPM registry refuses to accept new
package publishment because the file is too large.

To make the new NPM package working, we have to remove the large file
from the package, and add a new script on package installation. This
script will try to dynamically install onnxruntime CUDA dynamic library
for Linux/x64.
2024-04-18 13:44:42 -07:00
Guenther Schmuelling
7b017cf9f8
fix web ci: csum tests need fp64 which is not supported on webgpu (#20374) 2024-04-18 12:30:26 -07:00
Adam Louly
ee74fb6908
Introducing ORTPipelineModule - DeepSpeed Parallel Pipeline Support. (#20287)
### Description
Introducing a new class ORTPipelineModule to handle wrapping layers in
DeepSpeed pipeline parallel.


### Motivation and Context
To support pipeline parallelism on ORTModule.

This PR will include an initial support of deepspeed Pipeline
parallelism.

- [x] Support Pipeline parallel where layers are nn Modules in
Sequential.
- [ ] Support LayerSpec and TiedLayerSpec
- [ ] Enable partitioning to accept List
- [ ] Full-GPU Graph Consolidation
- [ ] Subgraph Merging for Inference
2024-04-18 11:30:15 -07:00
Sumit Agarwal
f664f91298
[DML EP] Expose NPU macro via build command (#20306)
### Description
This fixes following things:
- Expose `ENABLE_NPU_ADAPTER_ENUMERATION` macro via build command, so
that a user can enable NPU support for DML EP seamlessly.
- Add keyword `_dmlEp_` as part of the node name, which would be useful
for debugging purpose.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-04-18 11:23:13 -07:00
Patrice Vignola
76434907fb
[DML EP] Add graph capture (#20257)
This adds a new "Graph Capture" option to the DML ep, similar to the
cuda graph functionality. Here's how graph capture works:

- A user can enable graph capture in the session options by setting
`ep.dml.enable_graph_capture` to `true`
- When they want to capture a run, they set `gpu_graph_id` in their
`RunOptions` to a number bigger than 0 (0 is reserved for internal use
according to the cuda graph documentation).
- Then, when they start the inference, the graph will be captured and
stored in the DML EP for future use
- When they execute the run for a second time with the same id, the
`ReplayGraph` function in the DML EP will be called instead of executing
the kernels, resulting in very low overhead and avoiding kernel
recompilation.

This feature can give up-to-par or even better performance than
specifying the static dimensions at session creation time, but is also
much more flexible.
2024-04-18 10:15:00 -07:00
Vincent Wang
c47f446f25
Support BFloat16 for Triton Codegen (#20353)
Previous implementation used numpy array and numpy data_type to store
constant value and data type, which is not support BFloat16 natively.
This PR is to switch to use torch tensor which supports BFloat16.
2024-04-18 17:15:11 +08:00
Wanming Lin
da86f6f408
[WebNN EP] Add operators support table (#20253) 2024-04-17 21:19:46 -07:00
Hector Li
5daeb5e0b0
enable model with external data be loaded from memory buffer (#19089)
### Description
Background:
User save large model with initializer data in external file. e.g:
onnx.save_model(onnx_model, "path/to/save/the/model.onnx", save_as_external_data=True, all_tensors_to_one_file=True,
location="filename", size_threshold=1024).
In that case, Ort loads the model, get the external initializer information (external file name, offset, length) and use the model path to find the external file, and locate to the tensor data via the offset and length.
But it won't work if user load the model from memory, since Ort lost track of the model path.

This PR adds API/session option to let user provide a table with external initializer file name as the key, the pointer to the loaded external file in memory and the buffer length as value. So that

1. user can load the model from memory buffer with external initializers in memory buffer too.
2. the initializers can be shared across sessions, for different EPs.
3. user can load the file in any way they want, e.g mmap.

Internally, 
1. at session creation time, Ort goes through the external initializers in the graph, gets the file name, offset, data length of the external initializers from Tensorproto .
2. With the file name, Ort get the file in memory buffer and buffer length from the table user provided.
4. Ort locates the tensor buffer from file in memory buffer (user provided) using the offset and data length (from Tensorproto ).
5. Ort creates the Tensor and replace the existing Tensor in the graph.


### Motivation and Context
https://github.com/onnx/onnx/blob/main/docs/ExternalData.md
For a model with external data, the Tensorproto may have initializer data in a separate file. The external file location is set via the file path relative to the model path. With the API to load model from memory buffer, it lost track of the
model path. So it causes error if the model has external data. By adding a session option to set the external data buffer, Ort can find the external data correctly if model loaded from memory buffer.
2024-04-17 19:01:01 -07:00
Hector Li
301240433c
Log error caused by NPU SSR. (#20356)
### Description
Log error caused by NPU SSR.
2024-04-17 18:27:45 -07:00
Edward Chen
ccaa4d1db2
[MLAS][AArch64] SQNBitGemm M>1 CompFp32 kernel optimization (#20319)
Add ARM NEON intrinsics implementation for `Q4BitBlkDequantBForSgemm_CompFp32`.
2024-04-17 17:50:26 -07:00
Patrice Vignola
6bd6d879a3
[DML EP] Improve python API perf (#20331)
This change improves the python API perf in 2 few ways:

1. Remove unnecessary CPU syncs by sharing a queue between the python
EPs and the allocator.
2. Add an opt-in CPU spinning sync to reduce overhead in applications
that run a lot of inferences per second.
2024-04-17 17:33:37 -07:00
Guenther Schmuelling
a8a77ddfdc
fix csum and enable ut (#20355) 2024-04-17 15:01:06 -07:00
dependabot[bot]
4c3fc26255
Bump Sixlabors.ImageSharp from 2.1.7 to 2.1.8 in /csharp/sample/Microsoft.ML.OnnxRuntime.FasterRcnnSample (#20314)
Bumps [Sixlabors.ImageSharp](https://github.com/SixLabors/ImageSharp)
from 2.1.7 to 2.1.8.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/SixLabors/ImageSharp/releases">Sixlabors.ImageSharp's
releases</a>.</em></p>
<blockquote>
<h2>v2.1.8</h2>
<h2>What's Changed</h2>
<ul>
<li>V2 - Limit Read Palette Indices by <a
href="https://github.com/JimBobSquarePants"><code>@​JimBobSquarePants</code></a>
in <a
href="https://redirect.github.com/SixLabors/ImageSharp/pull/2719">SixLabors/ImageSharp#2719</a></li>
<li>V2 - Clear Pixel Buffers on Decode. by <a
href="https://github.com/JimBobSquarePants"><code>@​JimBobSquarePants</code></a>
in <a
href="https://redirect.github.com/SixLabors/ImageSharp/pull/2717">SixLabors/ImageSharp#2717</a></li>
<li>V2 - Limit all memory allocations in the MemoryAllocator layer by <a
href="https://github.com/JimBobSquarePants"><code>@​JimBobSquarePants</code></a>
in <a
href="https://redirect.github.com/SixLabors/ImageSharp/pull/2715">SixLabors/ImageSharp#2715</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/SixLabors/ImageSharp/compare/v2.1.7...v2.1.8">https://github.com/SixLabors/ImageSharp/compare/v2.1.7...v2.1.8</a></p>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="f21d64188e"><code>f21d641</code></a>
Merge pull request <a
href="https://redirect.github.com/SixLabors/ImageSharp/issues/2715">#2715</a>
from SixLabors/backport/v2-memlimit</li>
<li><a
href="8f0b4d3e68"><code>8f0b4d3</code></a>
Merge pull request <a
href="https://redirect.github.com/SixLabors/ImageSharp/issues/2717">#2717</a>
from SixLabors/backport/v2-clear-buffers</li>
<li><a
href="cf9496d284"><code>cf9496d</code></a>
test allocation limits</li>
<li><a
href="3d298db2cd"><code>3d298db</code></a>
Adapt BmpDecoder_ThrowsException_Issue2696 for V2</li>
<li><a
href="a78ce27a2b"><code>a78ce27</code></a>
Merge pull request <a
href="https://redirect.github.com/SixLabors/ImageSharp/issues/2719">#2719</a>
from SixLabors/backport/v2-check-palette-indices</li>
<li><a
href="e6209147b1"><code>e620914</code></a>
Clamp read palette indices.</li>
<li><a
href="c122185ea0"><code>c122185</code></a>
Clear pixel buffers on decode.</li>
<li><a
href="5c6ec5d6fb"><code>5c6ec5d</code></a>
Limit all allocations</li>
<li>See full diff in <a
href="https://github.com/SixLabors/ImageSharp/compare/v2.1.7...v2.1.8">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=Sixlabors.ImageSharp&package-manager=nuget&previous-version=2.1.7&new-version=2.1.8)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).

</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-04-17 14:47:44 -07:00
Adrian Lizarraga
eae7b705ac
[Quant tool] Fix quantized bias's scale dtype to properly handle fp16 bias inputs (#20340)
### Description
- Fix quantization tool bug that did not correctly set a quantized
bias's scale data type to fp16 if the original bias was fp16.
- Enabled fp16 ConvTranspose quantization unit tests that were disabled.



### Motivation and Context
Python quantization tests for fp16 ConvTranspose were originally
disabled due to a shape inference bug. It turns out that we also have a
bug in our quantizer that does not properly handle fp16 bias inputs.
Fixing the bug allows us to re-enable these tests with the latest
version of ONNX.
2024-04-17 10:24:28 -07:00
Adrian Lizarraga
0a1902525f
Add patch for ONNX 1.16.0 shape inference bug (#20316)
### Description
- Adds a patch that fixes a shape inference bug that caused a segfault:
https://github.com/onnx/onnx/pull/6080
- Fix documentation describing why QLinearMatMul tests are currently
being skipped.



### Motivation and Context
The [PR for integrating with ONNX
1.16.0](https://github.com/microsoft/onnxruntime/pull/19745) disabled
various python quantization tests due to a shape inference bug. This PR
applies the ONNX fix as a patch. We still can't enable the tests because
some of our CIs pip install onnx-1.16.0, which doesn't include the fix.
2024-04-17 10:23:22 -07:00
Hector Li
bb1972264b
Enable provider option to let user provider the profiling file path (#20285)
Enable provider option to let user provider the profiling file path.
Separate out the profiling level for ETW, in case there's switch like ETW enabled when Ort creates the QNN profiling, then gets disabled when Ort logs the profiling events. vise versa. Enhance the logic to decide the profiling level.
2024-04-17 09:42:40 -07:00
Scott McKay
8143b0d798
Add helper to get errno and error message (#20324)
### Description
<!-- Describe your changes. -->
Add platform aware helper to fetch errno message string.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
For usage in #20077

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2024-04-17 21:17:36 +10:00
Yi Zhang
4d2b98155f
More fixes on random connection excepiton in Mac Build. (#20328)
### Description
supplement of #20322 



### Motivation and Context
Fixes random connection exceptions in 
Mac build in Python Packaging Pipeline

https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=443617&view=logs&j=5849a411-e258-5ce5-39bd-7b65d44961a0&t=ccb871c8-76d9-5e80-55b0-4279efd5567f
and IOS full xcframework

https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=443458&view=logs&j=370fd1a2-3dec-5916-4d2c-8aae58c72d28&t=686352ba-ee61-5ad4-8739-e8abd07372a4&s=e9aa87c8-a9ad-51f7-3b12-045ecc319776
2024-04-17 08:37:56 +08:00
Scott McKay
5c8034cc20
Avoid call to Node::ToProto on first Graph::Resolve to improve session creation performance. (#20296)
### Description
<!-- Describe your changes. -->
The first call to Graph::Resolve occurs when creating the Graph instance
when loading an existing model from ModelProto. As the Node instance
will exactly match the source NodeProto there's no need to call
Node::ToProto in this case.

Add a temporary reference to the original NodeProto to avoid the call on
the first Graph::Resolve.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Better alternative to #19469
2024-04-17 10:07:12 +10:00
jingyanwangms
c11941289b
Add Gemma Rotary Embedding (#20267)
### Description
Add GemmaRotaryEmbedding kernel which includes sin and cos in
GemmaRotaryEmbedding forward and apply_rotary_pos_emb. See
gemma_rotary_emb_impl.cu for subgraph details

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-04-16 15:31:56 -07:00
dependabot[bot]
7354f3cdd8
Bump transformers from 4.36.0 to 4.38.0 in /tools/ci_build (#20272)
Bumps [transformers](https://github.com/huggingface/transformers) from
4.36.0 to 4.38.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/huggingface/transformers/releases">transformers's
releases</a>.</em></p>
<blockquote>
<h2>v4.38: Gemma, Depth Anything, Stable LM; Static Cache, HF Quantizer,
AQLM</h2>
<h2>New model additions</h2>
<h3>💎 Gemma 💎</h3>
<p>Gemma is a new opensource Language Model series from Google AI that
comes with a 2B and 7B variant. The release comes with the pre-trained
and instruction fine-tuned versions and you can use them via
<code>AutoModelForCausalLM</code>, <code>GemmaForCausalLM</code> or
<code>pipeline</code> interface!</p>
<p>Read more about it in the Gemma release blogpost: <a
href="https://hf.co/blog/gemma">https://hf.co/blog/gemma</a></p>
<pre lang="python"><code>from transformers import AutoTokenizer,
AutoModelForCausalLM
<p>tokenizer =
AutoTokenizer.from_pretrained(&quot;google/gemma-2b&quot;)
model =
AutoModelForCausalLM.from_pretrained(&quot;google/gemma-2b&quot;,
device_map=&quot;auto&quot;, torch_dtype=torch.float16)</p>
<p>input_text = &quot;Write me a poem about Machine Learning.&quot;
input_ids = tokenizer(input_text,
return_tensors=&quot;pt&quot;).to(&quot;cuda&quot;)</p>
<p>outputs = model.generate(**input_ids)
</code></pre></p>
<p>You can use the model with Flash Attention, SDPA, Static cache and
quantization API for further optimizations !</p>
<ul>
<li>Flash Attention 2</li>
</ul>
<pre lang="python"><code>from transformers import AutoTokenizer,
AutoModelForCausalLM
<p>tokenizer =
AutoTokenizer.from_pretrained(&quot;google/gemma-2b&quot;)</p>
<p>model = AutoModelForCausalLM.from_pretrained(
&quot;google/gemma-2b&quot;, device_map=&quot;auto&quot;,
torch_dtype=torch.float16,
attn_implementation=&quot;flash_attention_2&quot;
)</p>
<p>input_text = &quot;Write me a poem about Machine Learning.&quot;
input_ids = tokenizer(input_text,
return_tensors=&quot;pt&quot;).to(&quot;cuda&quot;)</p>
<p>outputs = model.generate(**input_ids)
</code></pre></p>
<ul>
<li>bitsandbytes-4bit</li>
</ul>
<pre lang="python"><code>from transformers import AutoTokenizer,
AutoModelForCausalLM
<p>tokenizer =
AutoTokenizer.from_pretrained(&quot;google/gemma-2b&quot;)</p>
<p>model = AutoModelForCausalLM.from_pretrained(
&quot;google/gemma-2b&quot;, device_map=&quot;auto&quot;,
load_in_4bit=True
)
&lt;/tr&gt;&lt;/table&gt;
</code></pre></p>
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="08ab54ada5"><code>08ab54a</code></a>
[ <code>gemma</code>] Adds support for Gemma 💎 (<a
href="https://redirect.github.com/huggingface/transformers/issues/29167">#29167</a>)</li>
<li><a
href="2de9314197"><code>2de9314</code></a>
[<code>Maskformer</code>] safely get backbone config (<a
href="https://redirect.github.com/huggingface/transformers/issues/29166">#29166</a>)</li>
<li><a
href="476957b5b4"><code>476957b</code></a>
🚨 Llama: update rope scaling to match static cache changes (<a
href="https://redirect.github.com/huggingface/transformers/issues/29143">#29143</a>)</li>
<li><a
href="7a4bec6e8f"><code>7a4bec6</code></a>
Release: 4.38.0</li>
<li><a
href="ee3af60be0"><code>ee3af60</code></a>
Add support for fine-tuning CLIP-like models using
contrastive-image-text exa...</li>
<li><a
href="0996a10077"><code>0996a10</code></a>
Revert low cpu mem tie weights (<a
href="https://redirect.github.com/huggingface/transformers/issues/29135">#29135</a>)</li>
<li><a
href="15cfe38942"><code>15cfe38</code></a>
[<code>Core tokenization</code>] <code>add_dummy_prefix_space</code>
option to help with latest is...</li>
<li><a
href="efdd436663"><code>efdd436</code></a>
FIX [<code>PEFT</code> / <code>Trainer</code> ] Handle better peft +
quantized compiled models (<a
href="https://redirect.github.com/huggingface/transformers/issues/29">#29</a>...</li>
<li><a
href="5e95dcabe1"><code>5e95dca</code></a>
[<code>cuda kernels</code>] only compile them when initializing (<a
href="https://redirect.github.com/huggingface/transformers/issues/29133">#29133</a>)</li>
<li><a
href="a7755d2409"><code>a7755d2</code></a>
Generate: unset GenerationConfig parameters do not raise warning (<a
href="https://redirect.github.com/huggingface/transformers/issues/29119">#29119</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/huggingface/transformers/compare/v4.36.0...v4.38.0">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=transformers&package-manager=pip&previous-version=4.36.0&new-version=4.38.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).

</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-04-16 14:21:12 -07:00
dependabot[bot]
efa51de7e4
Bump gradle/wrapper-validation-action from 2 to 3 (#20305)
Bumps
[gradle/wrapper-validation-action](https://github.com/gradle/wrapper-validation-action)
from 2 to 3.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/gradle/wrapper-validation-action/releases">gradle/wrapper-validation-action's
releases</a>.</em></p>
<blockquote>
<h2>v2.1.3</h2>
<h2>What's Changed</h2>
<ul>
<li>Update various NPM dependencies</li>
<li>Update wrapper checksums to include Gradle 8.7</li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/gradle/wrapper-validation-action/compare/v2.1.2...v2.1.3">https://github.com/gradle/wrapper-validation-action/compare/v2.1.2...v2.1.3</a></p>
<h2>v2.1.2</h2>
<h2>What's Changed</h2>
<ul>
<li>Update various NPM dependencies</li>
<li>Update wrapper checksums</li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/gradle/wrapper-validation-action/compare/v2.1.1...v2.1.2">https://github.com/gradle/wrapper-validation-action/compare/v2.1.1...v2.1.2</a></p>
<h2>v2.1.1</h2>
<h2>Changelog</h2>
<ul>
<li>[FIX] Add hardcoded checksum for Gradle 7.6.4</li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/gradle/wrapper-validation-action/compare/v2...v2.1.1">https://github.com/gradle/wrapper-validation-action/compare/v2...v2.1.1</a></p>
<h2>v2.1.0</h2>
<p>This release should vastly reduce the number of network requests made
by the <code>wrapper-validation-action</code>, by hardcoding the
checksums of all known Gradle wrapper jars at time of release. With this
improvement, a number of long-standing issues should be addressed (<a
href="https://redirect.github.com/gradle/wrapper-validation-action/issues/164">#164</a>,
<a
href="https://redirect.github.com/gradle/wrapper-validation-action/issues/162">#162</a>,
<a
href="https://redirect.github.com/gradle/wrapper-validation-action/issues/57">#57</a>).</p>
<p>The action should now only make network requests to validate the
checksums of an unknown <code>gradle-wrapper.jar</code>. This can happen
if:</p>
<ul>
<li>The Gradle version was published after this action was released</li>
<li>The <code>gradle-wrapper.jar</code> is truly invalid</li>
</ul>
<h2>Changelog</h2>
<ul>
<li>[NEW] Hardcode list of known checksums to avoid network requests in
most cases (<a
href="https://redirect.github.com/gradle/wrapper-validation-action/issues/161">#161</a>)</li>
</ul>
<p>Huge thanks to <a
href="https://github.com/Marcono1234"><code>@​Marcono1234</code></a> for
contributing this long-awaited improvement.</p>
<h2>v2.0.1</h2>
<p>This patch release fixes error reporting when failing to retrieve the
checksums from services.gradle.org</p>
<ul>
<li>[FIX] After migration from v1 to v2 silently fails (<a
href="https://redirect.github.com/gradle/wrapper-validation-action/issues/174">#174</a>)</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="460a3ca55f"><code>460a3ca</code></a>
Delegate to 'gradle/actions/wrapper-validation' (<a
href="https://redirect.github.com/gradle/wrapper-validation-action/issues/200">#200</a>)</li>
<li>See full diff in <a
href="https://github.com/gradle/wrapper-validation-action/compare/v2...v3">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=gradle/wrapper-validation-action&package-manager=github_actions&previous-version=2&new-version=3)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-04-16 14:20:51 -07:00
dependabot[bot]
e1499a007a
Bump Sixlabors.ImageSharp from 2.1.7 to 2.1.8 in /csharp/sample/Microsoft.ML.OnnxRuntime.ResNet50v2Sample (#20315)
Bumps [Sixlabors.ImageSharp](https://github.com/SixLabors/ImageSharp)
from 2.1.7 to 2.1.8.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/SixLabors/ImageSharp/releases">Sixlabors.ImageSharp's
releases</a>.</em></p>
<blockquote>
<h2>v2.1.8</h2>
<h2>What's Changed</h2>
<ul>
<li>V2 - Limit Read Palette Indices by <a
href="https://github.com/JimBobSquarePants"><code>@​JimBobSquarePants</code></a>
in <a
href="https://redirect.github.com/SixLabors/ImageSharp/pull/2719">SixLabors/ImageSharp#2719</a></li>
<li>V2 - Clear Pixel Buffers on Decode. by <a
href="https://github.com/JimBobSquarePants"><code>@​JimBobSquarePants</code></a>
in <a
href="https://redirect.github.com/SixLabors/ImageSharp/pull/2717">SixLabors/ImageSharp#2717</a></li>
<li>V2 - Limit all memory allocations in the MemoryAllocator layer by <a
href="https://github.com/JimBobSquarePants"><code>@​JimBobSquarePants</code></a>
in <a
href="https://redirect.github.com/SixLabors/ImageSharp/pull/2715">SixLabors/ImageSharp#2715</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/SixLabors/ImageSharp/compare/v2.1.7...v2.1.8">https://github.com/SixLabors/ImageSharp/compare/v2.1.7...v2.1.8</a></p>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="f21d64188e"><code>f21d641</code></a>
Merge pull request <a
href="https://redirect.github.com/SixLabors/ImageSharp/issues/2715">#2715</a>
from SixLabors/backport/v2-memlimit</li>
<li><a
href="8f0b4d3e68"><code>8f0b4d3</code></a>
Merge pull request <a
href="https://redirect.github.com/SixLabors/ImageSharp/issues/2717">#2717</a>
from SixLabors/backport/v2-clear-buffers</li>
<li><a
href="cf9496d284"><code>cf9496d</code></a>
test allocation limits</li>
<li><a
href="3d298db2cd"><code>3d298db</code></a>
Adapt BmpDecoder_ThrowsException_Issue2696 for V2</li>
<li><a
href="a78ce27a2b"><code>a78ce27</code></a>
Merge pull request <a
href="https://redirect.github.com/SixLabors/ImageSharp/issues/2719">#2719</a>
from SixLabors/backport/v2-check-palette-indices</li>
<li><a
href="e6209147b1"><code>e620914</code></a>
Clamp read palette indices.</li>
<li><a
href="c122185ea0"><code>c122185</code></a>
Clear pixel buffers on decode.</li>
<li><a
href="5c6ec5d6fb"><code>5c6ec5d</code></a>
Limit all allocations</li>
<li>See full diff in <a
href="https://github.com/SixLabors/ImageSharp/compare/v2.1.7...v2.1.8">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=Sixlabors.ImageSharp&package-manager=nuget&previous-version=2.1.7&new-version=2.1.8)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).

</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-04-16 14:20:16 -07:00
Yi-Hong Lyu
6b6a62fb40
Add vectorized AVX512F kernel for ReduceMaximumF32Kernel (#20268)
### Description
<!-- Describe your changes. -->

This commit introduces a new vectorized AVX512F kernel,
MlasReduceMaximumF32KernelAvx512F, which efficiently computes the
maximum value of the supplied buffer. Additionally, microbenchmarks have
been added for MlasComputeSoftmax (inplace),
MlasReduceMaximumF32KernelAvx, MlasComputeSumExpF32KernelAvx512F, and
MlasComputeSoftmaxOutputF32KernelAvx.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

The goal of this commit is to enhance the performance of
ReduceMaximumF32Kernel on CPUs with AVX512F instruction support.
  | AVX |   |   | AVX512 |   |   |  
-- | -- | -- | -- | -- | -- | -- | --
name | iterations | real_time | cpu_time | iterations | real_time |
cpu_time | time_unit
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:3/real_time | 271277304 |
2.58095 | 2.58091 | 263338132 | 2.65661 | 2.65661 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:3/real_time | 271220477 |
2.58095 | 2.58095 | 263509929 | 2.65652 | 2.65649 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:3/real_time | 271240587 |
2.58064 | 2.58064 | 263479542 | 2.65671 | 2.65665 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:3/real_time | 271227745 |
2.58083 | 2.58079 | 263402506 | 2.65657 | 2.65657 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:3/real_time | 271255069 |
2.58073 | 2.58071 | 263463858 | 2.65682 | 2.65682 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:3/real_time | 271257174 |
2.58058 | 2.58052 | 263460120 | 2.65682 | 2.65682 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:4/real_time | 174395051 |
4.01401 | 4.01401 | 197330481 | 3.5465 | 3.54636 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:4/real_time | 174645502 |
3.99691 | 3.99691 | 197474831 | 3.54298 | 3.54278 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:4/real_time | 174523308 |
4.01391 | 4.01386 | 197389981 | 3.54518 | 3.54506 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:4/real_time | 174779200 |
3.99874 | 3.99874 | 197519075 | 3.54227 | 3.54209 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:4/real_time | 174642874 |
4.00645 | 4.00641 | 197642101 | 3.54195 | 3.54188 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:4/real_time | 174546754 |
4.0061 | 4.00608 | 197621033 | 3.54296 | 3.54281 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:5/real_time | 162752651 |
4.30119 | 4.30114 | 215552503 | 3.24767 | 3.24752 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:5/real_time | 162717463 |
4.30123 | 4.30116 | 215541082 | 3.24711 | 3.24695 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:5/real_time | 162718819 |
4.3016 | 4.30153 | 215589239 | 3.24725 | 3.24708 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:5/real_time | 162719596 |
4.30151 | 4.30145 | 215563846 | 3.24956 | 3.24949 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:5/real_time | 162753333 |
4.30125 | 4.30125 | 215537315 | 3.24924 | 3.24908 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:5/real_time | 162752258 |
4.3014 | 4.30141 | 215526482 | 3.24744 | 3.24735 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:7/real_time | 143579660 |
4.87526 | 4.87516 | 100000000 | 5.25767 | 5.25752 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:7/real_time | 143585097 |
4.87476 | 4.87467 | 100000000 | 5.41583 | 5.41567 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:7/real_time | 143571011 |
4.87506 | 4.87503 | 182359467 | 3.83773 | 3.83764 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:7/real_time | 143587142 |
4.87487 | 4.8748 | 182397261 | 3.83807 | 3.8379 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:7/real_time | 143578465 |
4.87525 | 4.87521 | 182428602 | 3.83777 | 3.83768 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:7/real_time | 143588555 |
4.87491 | 4.87488 | 125280452 | 5.59791 | 5.59766 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:9/real_time | 284851058 |
2.43476 | 2.43476 | 156879863 | 4.42895 | 4.42884 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:9/real_time | 270700898 |
2.59031 | 2.59024 | 157953114 | 4.42995 | 4.42968 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:9/real_time | 282871172 |
2.45385 | 2.45385 | 157801156 | 4.42817 | 4.42804 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:9/real_time | 285307738 |
2.47009 | 2.47005 | 158058507 | 4.4279 | 4.42786 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:9/real_time | 285709536 |
2.45481 | 2.45476 | 158070961 | 4.42809 | 4.42799 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:9/real_time | 285449733 |
2.47495 | 2.47491 | 158069718 | 4.45026 | 4.45017 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:11/real_time | 189213618 |
3.79684 | 3.79676 | 139459497 | 5.01882 | 5.01871 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:11/real_time | 185600468 |
3.76394 | 3.76376 | 139444892 | 5.01922 | 5.01905 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:11/real_time | 184968668 |
3.80636 | 3.80636 | 139470834 | 5.01948 | 5.01936 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:11/real_time | 183867226 |
3.80432 | 3.80427 | 139481986 | 5.01975 | 5.01944 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:11/real_time | 184301650 |
3.81634 | 3.81634 | 139452846 | 5.01983 | 5.01972 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:11/real_time | 186215795 |
3.82659 | 3.82654 | 139497736 | 5.02119 | 5.02113 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:13/real_time | 135622415 |
5.16256 | 5.16252 | 124661337 | 5.61227 | 5.61194 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:13/real_time | 135618907 |
5.15967 | 5.1596 | 124805224 | 5.6088 | 5.60854 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:13/real_time | 135612192 |
5.15506 | 5.15501 | 124803221 | 5.60901 | 5.60869 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:13/real_time | 135906082 |
5.15818 | 5.15818 | 124776601 | 5.60898 | 5.60886 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:13/real_time | 135369523 |
5.15709 | 5.15682 | 124790370 | 5.60927 | 5.60902 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:13/real_time | 135596827 |
5.1603 | 5.1603 | 124792145 | 5.61637 | 5.61614 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:15/real_time | 110947137 |
5.96511 | 5.96495 | 112861522 | 6.20035 | 6.20014 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:15/real_time | 118004792 |
6.22645 | 6.22628 | 112909900 | 6.20073 | 6.20073 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:15/real_time | 112630319 |
6.25564 | 6.25552 | 112874563 | 6.19932 | 6.19924 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:15/real_time | 117403034 |
6.17263 | 6.17258 | 112927318 | 6.19866 | 6.19842 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:15/real_time | 108921863 |
6.48624 | 6.48612 | 112927746 | 6.20057 | 6.20026 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:15/real_time | 110358148 |
6.66805 | 6.66789 | 112907312 | 6.19938 | 6.19908 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:16/real_time | 203419574 |
3.4415 | 3.44137 | 237134525 | 2.95649 | 2.95638 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:16/real_time | 203414035 |
3.4411 | 3.44099 | 237129564 | 2.95178 | 2.95171 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:16/real_time | 203404068 |
3.44157 | 3.44151 | 236981704 | 2.9518 | 2.95167 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:16/real_time | 203391471 |
3.44146 | 3.44137 | 237108807 | 2.95203 | 2.95196 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:16/real_time | 203393801 |
3.44131 | 3.44127 | 237126460 | 2.95278 | 2.95272 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:16/real_time | 203407476 |
3.44181 | 3.44162 | 237154444 | 2.95293 | 2.9528 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:500/real_time | 37551439 |
18.6407 | 18.6407 | 39222534 | 17.858 | 17.8571 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:500/real_time | 37544097 |
18.6404 | 18.6401 | 39174151 | 17.8539 | 17.8536 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:500/real_time | 37549837 |
18.6391 | 18.6391 | 39233956 | 17.8507 | 17.8505 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:500/real_time | 45996345 |
15.2157 | 15.2153 | 39285929 | 17.848 | 17.8474 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:500/real_time | 46012429 |
15.2184 | 15.2179 | 65664865 | 10.7366 | 10.7364 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:500/real_time | 45912375 |
15.2349 | 15.2346 | 65205908 | 10.8498 | 10.8492 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:2000/real_time | 9493955 |
73.7232 | 73.7203 | 10188090 | 68.7931 | 68.7908 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:2000/real_time | 9495562 |
73.7173 | 73.7173 | 10180895 | 68.7533 | 68.7511 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:2000/real_time | 9487371 |
73.7852 | 73.7831 | 10164473 | 68.7279 | 68.725 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:2000/real_time | 10816047 |
64.7322 | 64.7287 | 10168481 | 68.8109 | 68.8096 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:2000/real_time | 10808802 |
64.7232 | 64.721 | 19478320 | 36.1471 | 36.1461 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:2000/real_time | 10818192 |
64.7304 | 64.728 | 19419672 | 35.9635 | 35.9635 | ns
2024-04-16 13:52:43 -07:00
Yifan Li
54f91ea65a
[TensorRT EP] support user_compute_stream in python API (#20168)
### Description
<!-- Describe your changes. -->

* Implement `user_compute_stream` python api for TensorRT EP
* Using this option will implicitly set `has_user_compute_stream` as
`true`
* Extend existing TRTEP unit test to verify `user_compute_stream` option
* This has been verified in local pytorch env, with
`torch.cuda.Stream()` passing into `user_compute_stream`:
```python
...
# Before inference
if torch.cuda.is_available():
    s = torch.cuda.Stream()
    option = {"user_compute_stream": str(s.cuda_stream)}
    sess.set_providers(["TensorrtExecutionProvider"], [option])
    options = sess.get_provider_options()

    assert "TensorrtExecutionProvider" in options
    assert options["TensorrtExecutionProvider"].get("user_compute_stream", "") == str(s.cuda_stream)
    assert options["TensorrtExecutionProvider"].get("has_user_compute_stream", "") == "1"
...
```
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Align with existing `user_compute_stream` python implementations for
[CUDA EP](https://github.com/microsoft/onnxruntime/pull/19229)/[ROCm
EP](https://github.com/microsoft/onnxruntime/pull/19619)
2024-04-16 12:49:29 -07:00
dependabot[bot]
e02aef1ded
Bump transformers from 4.36.0 to 4.38.0 in /onnxruntime/python/tools/transformers/models/stable_diffusion (#20271)
Bumps [transformers](https://github.com/huggingface/transformers) from
4.36.0 to 4.38.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/huggingface/transformers/releases">transformers's
releases</a>.</em></p>
<blockquote>
<h2>v4.38: Gemma, Depth Anything, Stable LM; Static Cache, HF Quantizer,
AQLM</h2>
<h2>New model additions</h2>
<h3>💎 Gemma 💎</h3>
<p>Gemma is a new opensource Language Model series from Google AI that
comes with a 2B and 7B variant. The release comes with the pre-trained
and instruction fine-tuned versions and you can use them via
<code>AutoModelForCausalLM</code>, <code>GemmaForCausalLM</code> or
<code>pipeline</code> interface!</p>
<p>Read more about it in the Gemma release blogpost: <a
href="https://hf.co/blog/gemma">https://hf.co/blog/gemma</a></p>
<pre lang="python"><code>from transformers import AutoTokenizer,
AutoModelForCausalLM
<p>tokenizer =
AutoTokenizer.from_pretrained(&quot;google/gemma-2b&quot;)
model =
AutoModelForCausalLM.from_pretrained(&quot;google/gemma-2b&quot;,
device_map=&quot;auto&quot;, torch_dtype=torch.float16)</p>
<p>input_text = &quot;Write me a poem about Machine Learning.&quot;
input_ids = tokenizer(input_text,
return_tensors=&quot;pt&quot;).to(&quot;cuda&quot;)</p>
<p>outputs = model.generate(**input_ids)
</code></pre></p>
<p>You can use the model with Flash Attention, SDPA, Static cache and
quantization API for further optimizations !</p>
<ul>
<li>Flash Attention 2</li>
</ul>
<pre lang="python"><code>from transformers import AutoTokenizer,
AutoModelForCausalLM
<p>tokenizer =
AutoTokenizer.from_pretrained(&quot;google/gemma-2b&quot;)</p>
<p>model = AutoModelForCausalLM.from_pretrained(
&quot;google/gemma-2b&quot;, device_map=&quot;auto&quot;,
torch_dtype=torch.float16,
attn_implementation=&quot;flash_attention_2&quot;
)</p>
<p>input_text = &quot;Write me a poem about Machine Learning.&quot;
input_ids = tokenizer(input_text,
return_tensors=&quot;pt&quot;).to(&quot;cuda&quot;)</p>
<p>outputs = model.generate(**input_ids)
</code></pre></p>
<ul>
<li>bitsandbytes-4bit</li>
</ul>
<pre lang="python"><code>from transformers import AutoTokenizer,
AutoModelForCausalLM
<p>tokenizer =
AutoTokenizer.from_pretrained(&quot;google/gemma-2b&quot;)</p>
<p>model = AutoModelForCausalLM.from_pretrained(
&quot;google/gemma-2b&quot;, device_map=&quot;auto&quot;,
load_in_4bit=True
)
&lt;/tr&gt;&lt;/table&gt;
</code></pre></p>
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="08ab54ada5"><code>08ab54a</code></a>
[ <code>gemma</code>] Adds support for Gemma 💎 (<a
href="https://redirect.github.com/huggingface/transformers/issues/29167">#29167</a>)</li>
<li><a
href="2de9314197"><code>2de9314</code></a>
[<code>Maskformer</code>] safely get backbone config (<a
href="https://redirect.github.com/huggingface/transformers/issues/29166">#29166</a>)</li>
<li><a
href="476957b5b4"><code>476957b</code></a>
🚨 Llama: update rope scaling to match static cache changes (<a
href="https://redirect.github.com/huggingface/transformers/issues/29143">#29143</a>)</li>
<li><a
href="7a4bec6e8f"><code>7a4bec6</code></a>
Release: 4.38.0</li>
<li><a
href="ee3af60be0"><code>ee3af60</code></a>
Add support for fine-tuning CLIP-like models using
contrastive-image-text exa...</li>
<li><a
href="0996a10077"><code>0996a10</code></a>
Revert low cpu mem tie weights (<a
href="https://redirect.github.com/huggingface/transformers/issues/29135">#29135</a>)</li>
<li><a
href="15cfe38942"><code>15cfe38</code></a>
[<code>Core tokenization</code>] <code>add_dummy_prefix_space</code>
option to help with latest is...</li>
<li><a
href="efdd436663"><code>efdd436</code></a>
FIX [<code>PEFT</code> / <code>Trainer</code> ] Handle better peft +
quantized compiled models (<a
href="https://redirect.github.com/huggingface/transformers/issues/29">#29</a>...</li>
<li><a
href="5e95dcabe1"><code>5e95dca</code></a>
[<code>cuda kernels</code>] only compile them when initializing (<a
href="https://redirect.github.com/huggingface/transformers/issues/29133">#29133</a>)</li>
<li><a
href="a7755d2409"><code>a7755d2</code></a>
Generate: unset GenerationConfig parameters do not raise warning (<a
href="https://redirect.github.com/huggingface/transformers/issues/29119">#29119</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/huggingface/transformers/compare/v4.36.0...v4.38.0">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=transformers&package-manager=pip&previous-version=4.36.0&new-version=4.38.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).

</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-04-16 11:48:54 -07:00
Adrian Lizarraga
f644ff9fc0
[QNN EP] Support per-channel quantized weights (#20154)
### Description
- Adds general support for per-channel quantized weights to QNN EP (HTP
backend).
- Add QNN EP unit tests for per-channel Conv
- Update quantization tool to allow selecting which ops are quantized
per-channel (and which axis) via tensor-level overrides. Currently,
setting `per_channel=True` assumes all Convs, MatMuls, Gemms,
InstanceNormalization, and LayerNormalization ops should be quantized
per-channel using some assumed default axis.

#### Creating QDQ per-channel Conv model example
```python
from onnxruntime.quantization import CalibrationDataReader, QuantType, quantize
from onnxruntime.quantization.execution_providers.qnn import get_qnn_qdq_config, qnn_preprocess_model

class DataReader(CalibrationDataReader):
    # TODO: See ONNX Runtime QNN docs for example of a data reader
    # https://onnxruntime.ai/docs/execution-providers/QNN-ExecutionProvider.html#generating-a-quantized-model-x64
    pass

if __name__ == "__main__":
    input_model_path = "model.onnx"
    my_data_reader = DataReader(model_to_quantize)

    # Pre-process the original float32 model.
    preproc_model_path = "model.preproc.onnx"
    model_changed = qnn_preprocess_model(input_model_path, preproc_model_path)
    model_to_quantize = preproc_model_path if model_changed else input_model_path

    # RELEVANT TO THIS PR:
    # Make sure Conv's weight input is quantized to int8/symmetric/per-channel with axis == 0.
    # The presence of the 'axis' key indicates that this is a per-channel quantized weight.
    init_overrides = {'weight': [{'axis': 0, 'quant_type': QuantType.QInt8, 'symmetric': True}]}

    qnn_config = get_qnn_qdq_config(model_to_quantize,
                                    my_data_reader,
                                    init_overrides=init_overrides,
                                    activation_type=QuantType.QUInt16, # uint16 activations
                                    weight_type=QuantType.QUInt8)      # uint8 weights by default

    quantize(model_to_quantize, "model.qdq.onnx", qnn_config)
```

float32 model:
<img width="683" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/19691973/ca650e49-1ad0-47d8-8c46-17fbc224ca39">

QDQ model (per-channel Conv weight):
<img width="748" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/19691973/6bd469f2-968b-4d11-9526-09b3e71f98e7">

### Motivation and Context
Support more models, especially models with int4 quantized weights.
2024-04-16 08:45:35 -07:00
George Wu
08d208b969
[QNN EP] refactor QNN deps/copy logic. start copying deps to target python loc… (#20317)
copy QNN deps when building python bindings as well.
tweak the wildcard to only copy QNN related files. latest sdk from
Qualcomm (>= 2.21) also include SNPE dll's which we don't want to
include.
2024-04-15 22:33:12 -07:00
Yi Zhang
caf692e626
[Fix] Random connection exceptions in MacOS_C_API_Packaging_CPU stage (#20322)
### Description
Add download_deps to reduce downloading from 3rd party websites.


### Motivation and Context
Fix frequent random exception like
```
CMake Error at abseil_cpp-subbuild/abseil_cpp-populate-prefix/src/abseil_cpp-populate-stamp/download-abseil_cpp-populate.cmake:162 (message):
  Each download failed!

    error: downloading 'https://github.com/abseil/abseil-cpp/archive/refs/tags/20240116.0.zip' failed
          status_code: 35
          status_string: "SSL connect error"
          log:
          --- LOG BEGIN ---
            Trying 20.29.134.23:443...

  Connected to github.com (20.29.134.23) port 443

  ALPN: curl offers h2,http/1.1

  (304) (OUT), TLS handshake, Client hello (1):

  [315 bytes data]

   CAfile: /etc/ssl/cert.pem
   CApath: none

  Recv failure: Operation timed out

  LibreSSL/3.3.6: error:02FFF03C:system library:func(4095):Operation timed
  out

  Closing connection
```

https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=443278&view=logs&j=006a7a04-d43b-5fe1-df02-ecafb79c4d6e&t=110edd38-9f3b-50cf-b328-8ed0f915e5c1

---------

Co-authored-by: Yi Zhang <your@email.com>
2024-04-16 13:28:18 +08:00
Wanming Lin
fe1c3a45c1
[WebNN EP] Support NPU deviceType (#20278) 2024-04-15 18:43:46 -07:00
Edward Chen
287ecea2f1
Fix binary size check build publish step. (#20298)
Add `--user` option to pip install command.

Error:
```
ERROR: Could not install packages due to an OSError: [Errno 13] Permission denied: '/usr/local/bin/f2py'
Consider using the `--user` option or check the permissions.
```

See #19877.
2024-04-15 10:15:42 -07:00
kunal-vaishnavi
6e4516cef9
Fix parity checker in LLaMA scripts (#20301)
### Description
This PR fixes the parity checker in the LLaMA scripts by adding the
following.
- Enable buffer sharing manually with `use_buffer_share` instead of
`use_gqa`
- Get max sequence length from model's config


### Motivation and Context
This PR fixes an issue with running the parity checker on other
large-language models where `GroupQueryAttention` can be used without
buffer sharing enabled.
2024-04-14 17:59:14 -07:00
Xiang Zhang
bf72f996e3
Enrich logic to fuse rotaryembedding with bias and support partial RE fusion into GQA (#20300)
### Description
This PR mainly focuses on adding two functionalities:
1. Fuse RotaryEmbedding op taking output from previous layers with bias
enabled.

> Matmul->RotaryEmbedding    ----->  Matmul->Add->RotatyEmbedding

2. Fuse GQA op for partial RotaryEmbedding applied in phi-2.

> # Partial rotary embedding
        query_rot, query_pass = (
            query_states[..., : self.rotary_emb.dim],
            query_states[..., self.rotary_emb.dim :],
        )
        key_rot, key_pass = (
            key_states[..., : self.rotary_emb.dim],
            key_states[..., self.rotary_emb.dim :],
        )
# [batch_size, seq_length, num_heads, head_dim //
config.partial_rotary_factor]
query_rot, key_rot = apply_rotary_pos_emb(query_rot, key_rot, cos, sin,
position_ids)

        # [batch_size, seq_length, num_heads, head_dim]
        query_states = torch.cat((query_rot, query_pass), dim=-1)
        key_states = torch.cat((key_rot, key_pass), dim=-1)

# Optimized graph

![image](https://github.com/microsoft/onnxruntime/assets/17421593/76fd8576-7e60-41af-9a4f-48d205fc6b56)
2024-04-14 17:43:00 -07:00
George Wu
7ec51f0a13
pin controlnet_aux version to 0.0.7 to fix Big Models Stable Diffusion pipeline failure (#20302)
conflict in cv2 causes
"    LayerId = cv2.dnn.DictValue
AttributeError: module 'cv2.dnn' has no attribute 'DictValue'
"
controlnet_aux 0.0.8 pulls in a conflicting version of opencv-python 
pin to 0.0.7

failing pipeline passes with this change:

https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1348876&view=results
2024-04-15 00:09:15 +08:00
Patrice Vignola
01acc25d9d
[DML EP] Fix the output shapes of nodes with multiple outputs in the graph builder (#20289)
The graph builder currently doesn't assign the correct shapes for
subgraphs that have more than 1 output, and where each output comes from
a different node. `nodeOutputShapes` should be a map of shapes (1:1
relationship), and not a map of lists of shapes (1:N relationship) since
an output referenced by `arg->Name()` can only have 1 output.

Take for example the following example of a subgraph where a node has 2
outputs, then each output feeds into an elementwise op. Both nodes will
have a `targetIndex` of 0, and we were using this target index to query
their shape, resulting in both outputs querying the same shape. In
reality, what we need to do is use the `GraphOutputIndex` ofthe subgraph
to query the correct output shape of the subgraph.
2024-04-12 16:34:49 -07:00
Satya Kumar Jandhyala
b33216be4c
[JS/WebGPU] Improve MatMulNBits perf (#19974)
### Description
<!-- Describe your changes. -->
Improve performance using shared memory


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-04-12 11:03:05 -07:00