Commit graph

11997 commits

Author SHA1 Message Date
dependabot[bot]
e1499a007a
Bump Sixlabors.ImageSharp from 2.1.7 to 2.1.8 in /csharp/sample/Microsoft.ML.OnnxRuntime.ResNet50v2Sample (#20315)
Bumps [Sixlabors.ImageSharp](https://github.com/SixLabors/ImageSharp)
from 2.1.7 to 2.1.8.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/SixLabors/ImageSharp/releases">Sixlabors.ImageSharp's
releases</a>.</em></p>
<blockquote>
<h2>v2.1.8</h2>
<h2>What's Changed</h2>
<ul>
<li>V2 - Limit Read Palette Indices by <a
href="https://github.com/JimBobSquarePants"><code>@​JimBobSquarePants</code></a>
in <a
href="https://redirect.github.com/SixLabors/ImageSharp/pull/2719">SixLabors/ImageSharp#2719</a></li>
<li>V2 - Clear Pixel Buffers on Decode. by <a
href="https://github.com/JimBobSquarePants"><code>@​JimBobSquarePants</code></a>
in <a
href="https://redirect.github.com/SixLabors/ImageSharp/pull/2717">SixLabors/ImageSharp#2717</a></li>
<li>V2 - Limit all memory allocations in the MemoryAllocator layer by <a
href="https://github.com/JimBobSquarePants"><code>@​JimBobSquarePants</code></a>
in <a
href="https://redirect.github.com/SixLabors/ImageSharp/pull/2715">SixLabors/ImageSharp#2715</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/SixLabors/ImageSharp/compare/v2.1.7...v2.1.8">https://github.com/SixLabors/ImageSharp/compare/v2.1.7...v2.1.8</a></p>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="f21d64188e"><code>f21d641</code></a>
Merge pull request <a
href="https://redirect.github.com/SixLabors/ImageSharp/issues/2715">#2715</a>
from SixLabors/backport/v2-memlimit</li>
<li><a
href="8f0b4d3e68"><code>8f0b4d3</code></a>
Merge pull request <a
href="https://redirect.github.com/SixLabors/ImageSharp/issues/2717">#2717</a>
from SixLabors/backport/v2-clear-buffers</li>
<li><a
href="cf9496d284"><code>cf9496d</code></a>
test allocation limits</li>
<li><a
href="3d298db2cd"><code>3d298db</code></a>
Adapt BmpDecoder_ThrowsException_Issue2696 for V2</li>
<li><a
href="a78ce27a2b"><code>a78ce27</code></a>
Merge pull request <a
href="https://redirect.github.com/SixLabors/ImageSharp/issues/2719">#2719</a>
from SixLabors/backport/v2-check-palette-indices</li>
<li><a
href="e6209147b1"><code>e620914</code></a>
Clamp read palette indices.</li>
<li><a
href="c122185ea0"><code>c122185</code></a>
Clear pixel buffers on decode.</li>
<li><a
href="5c6ec5d6fb"><code>5c6ec5d</code></a>
Limit all allocations</li>
<li>See full diff in <a
href="https://github.com/SixLabors/ImageSharp/compare/v2.1.7...v2.1.8">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=Sixlabors.ImageSharp&package-manager=nuget&previous-version=2.1.7&new-version=2.1.8)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).

</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-04-16 14:20:16 -07:00
Yi-Hong Lyu
6b6a62fb40
Add vectorized AVX512F kernel for ReduceMaximumF32Kernel (#20268)
### Description
<!-- Describe your changes. -->

This commit introduces a new vectorized AVX512F kernel,
MlasReduceMaximumF32KernelAvx512F, which efficiently computes the
maximum value of the supplied buffer. Additionally, microbenchmarks have
been added for MlasComputeSoftmax (inplace),
MlasReduceMaximumF32KernelAvx, MlasComputeSumExpF32KernelAvx512F, and
MlasComputeSoftmaxOutputF32KernelAvx.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

The goal of this commit is to enhance the performance of
ReduceMaximumF32Kernel on CPUs with AVX512F instruction support.
  | AVX |   |   | AVX512 |   |   |  
-- | -- | -- | -- | -- | -- | -- | --
name | iterations | real_time | cpu_time | iterations | real_time |
cpu_time | time_unit
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:3/real_time | 271277304 |
2.58095 | 2.58091 | 263338132 | 2.65661 | 2.65661 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:3/real_time | 271220477 |
2.58095 | 2.58095 | 263509929 | 2.65652 | 2.65649 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:3/real_time | 271240587 |
2.58064 | 2.58064 | 263479542 | 2.65671 | 2.65665 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:3/real_time | 271227745 |
2.58083 | 2.58079 | 263402506 | 2.65657 | 2.65657 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:3/real_time | 271255069 |
2.58073 | 2.58071 | 263463858 | 2.65682 | 2.65682 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:3/real_time | 271257174 |
2.58058 | 2.58052 | 263460120 | 2.65682 | 2.65682 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:4/real_time | 174395051 |
4.01401 | 4.01401 | 197330481 | 3.5465 | 3.54636 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:4/real_time | 174645502 |
3.99691 | 3.99691 | 197474831 | 3.54298 | 3.54278 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:4/real_time | 174523308 |
4.01391 | 4.01386 | 197389981 | 3.54518 | 3.54506 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:4/real_time | 174779200 |
3.99874 | 3.99874 | 197519075 | 3.54227 | 3.54209 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:4/real_time | 174642874 |
4.00645 | 4.00641 | 197642101 | 3.54195 | 3.54188 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:4/real_time | 174546754 |
4.0061 | 4.00608 | 197621033 | 3.54296 | 3.54281 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:5/real_time | 162752651 |
4.30119 | 4.30114 | 215552503 | 3.24767 | 3.24752 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:5/real_time | 162717463 |
4.30123 | 4.30116 | 215541082 | 3.24711 | 3.24695 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:5/real_time | 162718819 |
4.3016 | 4.30153 | 215589239 | 3.24725 | 3.24708 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:5/real_time | 162719596 |
4.30151 | 4.30145 | 215563846 | 3.24956 | 3.24949 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:5/real_time | 162753333 |
4.30125 | 4.30125 | 215537315 | 3.24924 | 3.24908 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:5/real_time | 162752258 |
4.3014 | 4.30141 | 215526482 | 3.24744 | 3.24735 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:7/real_time | 143579660 |
4.87526 | 4.87516 | 100000000 | 5.25767 | 5.25752 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:7/real_time | 143585097 |
4.87476 | 4.87467 | 100000000 | 5.41583 | 5.41567 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:7/real_time | 143571011 |
4.87506 | 4.87503 | 182359467 | 3.83773 | 3.83764 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:7/real_time | 143587142 |
4.87487 | 4.8748 | 182397261 | 3.83807 | 3.8379 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:7/real_time | 143578465 |
4.87525 | 4.87521 | 182428602 | 3.83777 | 3.83768 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:7/real_time | 143588555 |
4.87491 | 4.87488 | 125280452 | 5.59791 | 5.59766 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:9/real_time | 284851058 |
2.43476 | 2.43476 | 156879863 | 4.42895 | 4.42884 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:9/real_time | 270700898 |
2.59031 | 2.59024 | 157953114 | 4.42995 | 4.42968 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:9/real_time | 282871172 |
2.45385 | 2.45385 | 157801156 | 4.42817 | 4.42804 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:9/real_time | 285307738 |
2.47009 | 2.47005 | 158058507 | 4.4279 | 4.42786 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:9/real_time | 285709536 |
2.45481 | 2.45476 | 158070961 | 4.42809 | 4.42799 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:9/real_time | 285449733 |
2.47495 | 2.47491 | 158069718 | 4.45026 | 4.45017 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:11/real_time | 189213618 |
3.79684 | 3.79676 | 139459497 | 5.01882 | 5.01871 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:11/real_time | 185600468 |
3.76394 | 3.76376 | 139444892 | 5.01922 | 5.01905 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:11/real_time | 184968668 |
3.80636 | 3.80636 | 139470834 | 5.01948 | 5.01936 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:11/real_time | 183867226 |
3.80432 | 3.80427 | 139481986 | 5.01975 | 5.01944 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:11/real_time | 184301650 |
3.81634 | 3.81634 | 139452846 | 5.01983 | 5.01972 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:11/real_time | 186215795 |
3.82659 | 3.82654 | 139497736 | 5.02119 | 5.02113 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:13/real_time | 135622415 |
5.16256 | 5.16252 | 124661337 | 5.61227 | 5.61194 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:13/real_time | 135618907 |
5.15967 | 5.1596 | 124805224 | 5.6088 | 5.60854 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:13/real_time | 135612192 |
5.15506 | 5.15501 | 124803221 | 5.60901 | 5.60869 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:13/real_time | 135906082 |
5.15818 | 5.15818 | 124776601 | 5.60898 | 5.60886 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:13/real_time | 135369523 |
5.15709 | 5.15682 | 124790370 | 5.60927 | 5.60902 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:13/real_time | 135596827 |
5.1603 | 5.1603 | 124792145 | 5.61637 | 5.61614 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:15/real_time | 110947137 |
5.96511 | 5.96495 | 112861522 | 6.20035 | 6.20014 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:15/real_time | 118004792 |
6.22645 | 6.22628 | 112909900 | 6.20073 | 6.20073 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:15/real_time | 112630319 |
6.25564 | 6.25552 | 112874563 | 6.19932 | 6.19924 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:15/real_time | 117403034 |
6.17263 | 6.17258 | 112927318 | 6.19866 | 6.19842 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:15/real_time | 108921863 |
6.48624 | 6.48612 | 112927746 | 6.20057 | 6.20026 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:15/real_time | 110358148 |
6.66805 | 6.66789 | 112907312 | 6.19938 | 6.19908 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:16/real_time | 203419574 |
3.4415 | 3.44137 | 237134525 | 2.95649 | 2.95638 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:16/real_time | 203414035 |
3.4411 | 3.44099 | 237129564 | 2.95178 | 2.95171 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:16/real_time | 203404068 |
3.44157 | 3.44151 | 236981704 | 2.9518 | 2.95167 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:16/real_time | 203391471 |
3.44146 | 3.44137 | 237108807 | 2.95203 | 2.95196 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:16/real_time | 203393801 |
3.44131 | 3.44127 | 237126460 | 2.95278 | 2.95272 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:16/real_time | 203407476 |
3.44181 | 3.44162 | 237154444 | 2.95293 | 2.9528 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:500/real_time | 37551439 |
18.6407 | 18.6407 | 39222534 | 17.858 | 17.8571 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:500/real_time | 37544097 |
18.6404 | 18.6401 | 39174151 | 17.8539 | 17.8536 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:500/real_time | 37549837 |
18.6391 | 18.6391 | 39233956 | 17.8507 | 17.8505 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:500/real_time | 45996345 |
15.2157 | 15.2153 | 39285929 | 17.848 | 17.8474 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:500/real_time | 46012429 |
15.2184 | 15.2179 | 65664865 | 10.7366 | 10.7364 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:500/real_time | 45912375 |
15.2349 | 15.2346 | 65205908 | 10.8498 | 10.8492 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:2000/real_time | 9493955 |
73.7232 | 73.7203 | 10188090 | 68.7931 | 68.7908 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:2000/real_time | 9495562 |
73.7173 | 73.7173 | 10180895 | 68.7533 | 68.7511 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:2000/real_time | 9487371 |
73.7852 | 73.7831 | 10164473 | 68.7279 | 68.725 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:2000/real_time | 10816047 |
64.7322 | 64.7287 | 10168481 | 68.8109 | 68.8096 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:2000/real_time | 10808802 |
64.7232 | 64.721 | 19478320 | 36.1471 | 36.1461 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:2000/real_time | 10818192 |
64.7304 | 64.728 | 19419672 | 35.9635 | 35.9635 | ns
2024-04-16 13:52:43 -07:00
Yifan Li
54f91ea65a
[TensorRT EP] support user_compute_stream in python API (#20168)
### Description
<!-- Describe your changes. -->

* Implement `user_compute_stream` python api for TensorRT EP
* Using this option will implicitly set `has_user_compute_stream` as
`true`
* Extend existing TRTEP unit test to verify `user_compute_stream` option
* This has been verified in local pytorch env, with
`torch.cuda.Stream()` passing into `user_compute_stream`:
```python
...
# Before inference
if torch.cuda.is_available():
    s = torch.cuda.Stream()
    option = {"user_compute_stream": str(s.cuda_stream)}
    sess.set_providers(["TensorrtExecutionProvider"], [option])
    options = sess.get_provider_options()

    assert "TensorrtExecutionProvider" in options
    assert options["TensorrtExecutionProvider"].get("user_compute_stream", "") == str(s.cuda_stream)
    assert options["TensorrtExecutionProvider"].get("has_user_compute_stream", "") == "1"
...
```
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Align with existing `user_compute_stream` python implementations for
[CUDA EP](https://github.com/microsoft/onnxruntime/pull/19229)/[ROCm
EP](https://github.com/microsoft/onnxruntime/pull/19619)
2024-04-16 12:49:29 -07:00
dependabot[bot]
e02aef1ded
Bump transformers from 4.36.0 to 4.38.0 in /onnxruntime/python/tools/transformers/models/stable_diffusion (#20271)
Bumps [transformers](https://github.com/huggingface/transformers) from
4.36.0 to 4.38.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/huggingface/transformers/releases">transformers's
releases</a>.</em></p>
<blockquote>
<h2>v4.38: Gemma, Depth Anything, Stable LM; Static Cache, HF Quantizer,
AQLM</h2>
<h2>New model additions</h2>
<h3>💎 Gemma 💎</h3>
<p>Gemma is a new opensource Language Model series from Google AI that
comes with a 2B and 7B variant. The release comes with the pre-trained
and instruction fine-tuned versions and you can use them via
<code>AutoModelForCausalLM</code>, <code>GemmaForCausalLM</code> or
<code>pipeline</code> interface!</p>
<p>Read more about it in the Gemma release blogpost: <a
href="https://hf.co/blog/gemma">https://hf.co/blog/gemma</a></p>
<pre lang="python"><code>from transformers import AutoTokenizer,
AutoModelForCausalLM
<p>tokenizer =
AutoTokenizer.from_pretrained(&quot;google/gemma-2b&quot;)
model =
AutoModelForCausalLM.from_pretrained(&quot;google/gemma-2b&quot;,
device_map=&quot;auto&quot;, torch_dtype=torch.float16)</p>
<p>input_text = &quot;Write me a poem about Machine Learning.&quot;
input_ids = tokenizer(input_text,
return_tensors=&quot;pt&quot;).to(&quot;cuda&quot;)</p>
<p>outputs = model.generate(**input_ids)
</code></pre></p>
<p>You can use the model with Flash Attention, SDPA, Static cache and
quantization API for further optimizations !</p>
<ul>
<li>Flash Attention 2</li>
</ul>
<pre lang="python"><code>from transformers import AutoTokenizer,
AutoModelForCausalLM
<p>tokenizer =
AutoTokenizer.from_pretrained(&quot;google/gemma-2b&quot;)</p>
<p>model = AutoModelForCausalLM.from_pretrained(
&quot;google/gemma-2b&quot;, device_map=&quot;auto&quot;,
torch_dtype=torch.float16,
attn_implementation=&quot;flash_attention_2&quot;
)</p>
<p>input_text = &quot;Write me a poem about Machine Learning.&quot;
input_ids = tokenizer(input_text,
return_tensors=&quot;pt&quot;).to(&quot;cuda&quot;)</p>
<p>outputs = model.generate(**input_ids)
</code></pre></p>
<ul>
<li>bitsandbytes-4bit</li>
</ul>
<pre lang="python"><code>from transformers import AutoTokenizer,
AutoModelForCausalLM
<p>tokenizer =
AutoTokenizer.from_pretrained(&quot;google/gemma-2b&quot;)</p>
<p>model = AutoModelForCausalLM.from_pretrained(
&quot;google/gemma-2b&quot;, device_map=&quot;auto&quot;,
load_in_4bit=True
)
&lt;/tr&gt;&lt;/table&gt;
</code></pre></p>
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="08ab54ada5"><code>08ab54a</code></a>
[ <code>gemma</code>] Adds support for Gemma 💎 (<a
href="https://redirect.github.com/huggingface/transformers/issues/29167">#29167</a>)</li>
<li><a
href="2de9314197"><code>2de9314</code></a>
[<code>Maskformer</code>] safely get backbone config (<a
href="https://redirect.github.com/huggingface/transformers/issues/29166">#29166</a>)</li>
<li><a
href="476957b5b4"><code>476957b</code></a>
🚨 Llama: update rope scaling to match static cache changes (<a
href="https://redirect.github.com/huggingface/transformers/issues/29143">#29143</a>)</li>
<li><a
href="7a4bec6e8f"><code>7a4bec6</code></a>
Release: 4.38.0</li>
<li><a
href="ee3af60be0"><code>ee3af60</code></a>
Add support for fine-tuning CLIP-like models using
contrastive-image-text exa...</li>
<li><a
href="0996a10077"><code>0996a10</code></a>
Revert low cpu mem tie weights (<a
href="https://redirect.github.com/huggingface/transformers/issues/29135">#29135</a>)</li>
<li><a
href="15cfe38942"><code>15cfe38</code></a>
[<code>Core tokenization</code>] <code>add_dummy_prefix_space</code>
option to help with latest is...</li>
<li><a
href="efdd436663"><code>efdd436</code></a>
FIX [<code>PEFT</code> / <code>Trainer</code> ] Handle better peft +
quantized compiled models (<a
href="https://redirect.github.com/huggingface/transformers/issues/29">#29</a>...</li>
<li><a
href="5e95dcabe1"><code>5e95dca</code></a>
[<code>cuda kernels</code>] only compile them when initializing (<a
href="https://redirect.github.com/huggingface/transformers/issues/29133">#29133</a>)</li>
<li><a
href="a7755d2409"><code>a7755d2</code></a>
Generate: unset GenerationConfig parameters do not raise warning (<a
href="https://redirect.github.com/huggingface/transformers/issues/29119">#29119</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/huggingface/transformers/compare/v4.36.0...v4.38.0">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=transformers&package-manager=pip&previous-version=4.36.0&new-version=4.38.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).

</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-04-16 11:48:54 -07:00
Adrian Lizarraga
f644ff9fc0
[QNN EP] Support per-channel quantized weights (#20154)
### Description
- Adds general support for per-channel quantized weights to QNN EP (HTP
backend).
- Add QNN EP unit tests for per-channel Conv
- Update quantization tool to allow selecting which ops are quantized
per-channel (and which axis) via tensor-level overrides. Currently,
setting `per_channel=True` assumes all Convs, MatMuls, Gemms,
InstanceNormalization, and LayerNormalization ops should be quantized
per-channel using some assumed default axis.

#### Creating QDQ per-channel Conv model example
```python
from onnxruntime.quantization import CalibrationDataReader, QuantType, quantize
from onnxruntime.quantization.execution_providers.qnn import get_qnn_qdq_config, qnn_preprocess_model

class DataReader(CalibrationDataReader):
    # TODO: See ONNX Runtime QNN docs for example of a data reader
    # https://onnxruntime.ai/docs/execution-providers/QNN-ExecutionProvider.html#generating-a-quantized-model-x64
    pass

if __name__ == "__main__":
    input_model_path = "model.onnx"
    my_data_reader = DataReader(model_to_quantize)

    # Pre-process the original float32 model.
    preproc_model_path = "model.preproc.onnx"
    model_changed = qnn_preprocess_model(input_model_path, preproc_model_path)
    model_to_quantize = preproc_model_path if model_changed else input_model_path

    # RELEVANT TO THIS PR:
    # Make sure Conv's weight input is quantized to int8/symmetric/per-channel with axis == 0.
    # The presence of the 'axis' key indicates that this is a per-channel quantized weight.
    init_overrides = {'weight': [{'axis': 0, 'quant_type': QuantType.QInt8, 'symmetric': True}]}

    qnn_config = get_qnn_qdq_config(model_to_quantize,
                                    my_data_reader,
                                    init_overrides=init_overrides,
                                    activation_type=QuantType.QUInt16, # uint16 activations
                                    weight_type=QuantType.QUInt8)      # uint8 weights by default

    quantize(model_to_quantize, "model.qdq.onnx", qnn_config)
```

float32 model:
<img width="683" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/19691973/ca650e49-1ad0-47d8-8c46-17fbc224ca39">

QDQ model (per-channel Conv weight):
<img width="748" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/19691973/6bd469f2-968b-4d11-9526-09b3e71f98e7">

### Motivation and Context
Support more models, especially models with int4 quantized weights.
2024-04-16 08:45:35 -07:00
George Wu
08d208b969
[QNN EP] refactor QNN deps/copy logic. start copying deps to target python loc… (#20317)
copy QNN deps when building python bindings as well.
tweak the wildcard to only copy QNN related files. latest sdk from
Qualcomm (>= 2.21) also include SNPE dll's which we don't want to
include.
2024-04-15 22:33:12 -07:00
Yi Zhang
caf692e626
[Fix] Random connection exceptions in MacOS_C_API_Packaging_CPU stage (#20322)
### Description
Add download_deps to reduce downloading from 3rd party websites.


### Motivation and Context
Fix frequent random exception like
```
CMake Error at abseil_cpp-subbuild/abseil_cpp-populate-prefix/src/abseil_cpp-populate-stamp/download-abseil_cpp-populate.cmake:162 (message):
  Each download failed!

    error: downloading 'https://github.com/abseil/abseil-cpp/archive/refs/tags/20240116.0.zip' failed
          status_code: 35
          status_string: "SSL connect error"
          log:
          --- LOG BEGIN ---
            Trying 20.29.134.23:443...

  Connected to github.com (20.29.134.23) port 443

  ALPN: curl offers h2,http/1.1

  (304) (OUT), TLS handshake, Client hello (1):

  [315 bytes data]

   CAfile: /etc/ssl/cert.pem
   CApath: none

  Recv failure: Operation timed out

  LibreSSL/3.3.6: error:02FFF03C:system library:func(4095):Operation timed
  out

  Closing connection
```

https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=443278&view=logs&j=006a7a04-d43b-5fe1-df02-ecafb79c4d6e&t=110edd38-9f3b-50cf-b328-8ed0f915e5c1

---------

Co-authored-by: Yi Zhang <your@email.com>
2024-04-16 13:28:18 +08:00
Wanming Lin
fe1c3a45c1
[WebNN EP] Support NPU deviceType (#20278) 2024-04-15 18:43:46 -07:00
Edward Chen
287ecea2f1
Fix binary size check build publish step. (#20298)
Add `--user` option to pip install command.

Error:
```
ERROR: Could not install packages due to an OSError: [Errno 13] Permission denied: '/usr/local/bin/f2py'
Consider using the `--user` option or check the permissions.
```

See #19877.
2024-04-15 10:15:42 -07:00
kunal-vaishnavi
6e4516cef9
Fix parity checker in LLaMA scripts (#20301)
### Description
This PR fixes the parity checker in the LLaMA scripts by adding the
following.
- Enable buffer sharing manually with `use_buffer_share` instead of
`use_gqa`
- Get max sequence length from model's config


### Motivation and Context
This PR fixes an issue with running the parity checker on other
large-language models where `GroupQueryAttention` can be used without
buffer sharing enabled.
2024-04-14 17:59:14 -07:00
Xiang Zhang
bf72f996e3
Enrich logic to fuse rotaryembedding with bias and support partial RE fusion into GQA (#20300)
### Description
This PR mainly focuses on adding two functionalities:
1. Fuse RotaryEmbedding op taking output from previous layers with bias
enabled.

> Matmul->RotaryEmbedding    ----->  Matmul->Add->RotatyEmbedding

2. Fuse GQA op for partial RotaryEmbedding applied in phi-2.

> # Partial rotary embedding
        query_rot, query_pass = (
            query_states[..., : self.rotary_emb.dim],
            query_states[..., self.rotary_emb.dim :],
        )
        key_rot, key_pass = (
            key_states[..., : self.rotary_emb.dim],
            key_states[..., self.rotary_emb.dim :],
        )
# [batch_size, seq_length, num_heads, head_dim //
config.partial_rotary_factor]
query_rot, key_rot = apply_rotary_pos_emb(query_rot, key_rot, cos, sin,
position_ids)

        # [batch_size, seq_length, num_heads, head_dim]
        query_states = torch.cat((query_rot, query_pass), dim=-1)
        key_states = torch.cat((key_rot, key_pass), dim=-1)

# Optimized graph

![image](https://github.com/microsoft/onnxruntime/assets/17421593/76fd8576-7e60-41af-9a4f-48d205fc6b56)
2024-04-14 17:43:00 -07:00
George Wu
7ec51f0a13
pin controlnet_aux version to 0.0.7 to fix Big Models Stable Diffusion pipeline failure (#20302)
conflict in cv2 causes
"    LayerId = cv2.dnn.DictValue
AttributeError: module 'cv2.dnn' has no attribute 'DictValue'
"
controlnet_aux 0.0.8 pulls in a conflicting version of opencv-python 
pin to 0.0.7

failing pipeline passes with this change:

https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1348876&view=results
2024-04-15 00:09:15 +08:00
Patrice Vignola
01acc25d9d
[DML EP] Fix the output shapes of nodes with multiple outputs in the graph builder (#20289)
The graph builder currently doesn't assign the correct shapes for
subgraphs that have more than 1 output, and where each output comes from
a different node. `nodeOutputShapes` should be a map of shapes (1:1
relationship), and not a map of lists of shapes (1:N relationship) since
an output referenced by `arg->Name()` can only have 1 output.

Take for example the following example of a subgraph where a node has 2
outputs, then each output feeds into an elementwise op. Both nodes will
have a `targetIndex` of 0, and we were using this target index to query
their shape, resulting in both outputs querying the same shape. In
reality, what we need to do is use the `GraphOutputIndex` ofthe subgraph
to query the correct output shape of the subgraph.
2024-04-12 16:34:49 -07:00
Satya Kumar Jandhyala
b33216be4c
[JS/WebGPU] Improve MatMulNBits perf (#19974)
### Description
<!-- Describe your changes. -->
Improve performance using shared memory


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-04-12 11:03:05 -07:00
Changming Sun
794d39a977
LLVM16 compat changes (#20294)
The change is similar to #15672 and #11667, for making the code compatible with CUDA 12 and LLVM16 on Mariner2.
2024-04-12 10:16:12 -07:00
liqun Fu
cd7112f800
Integration with ONNX 1.16.0 (#19745)
### Description
update with ONNX 1.16.0 branch according to
https://github.com/microsoft/onnxruntime/blob/main/docs/How_To_Update_ONNX_Dev_Notes.md

ONNX 1.16.0 release notes:
https://github.com/onnx/onnx/releases/tag/v1.16.0

#### Updated ops for CPU EP:
- DequantizeLinear(21)
  - Added int16 and uint16 support + various optimizer tests
  - Missing int4 and uint4 support
  - Missing block dequantization support
- QuantizeLinear(21)
  - Added int16 and uint16 support + various optimizer tests
  - Missing int4 and uint4 support
  - Missing block quantization support
- Cast(21)
  - Missing int4 and uint4 support
- CastLike(21)
  - Missing int4 and uint4 support
- ConstantOfShape(21)
  - Missing int4 and uint4 support
- Identity(21)
  - Missing int4 and uint4 support
- If(21)
  - Missing int4 and uint4 support
- Loop(21)
  - Missing int4 and uint4 support
- Reshape(21)
  - Missing int4 and uint4 support
- Scan(21)
  - Missing int4 and uint4 support
- Shape(21)
  - Missing int4 and uint4 support
- Size(21)
  - Missing int4 and uint4 support
- Flatten(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Pad(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Squeeze(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Transpose(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Unsqueeze(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support

#### Unimplemented opset 21 features/ops
- int4 and uint4 data type
- QLinearMatMul(21)
- GroupNormalization(21)
- ai.onnx.ml.TreeEnsemble(5)

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

### Disabled tests
#### ORT Training

orttraining/orttraining/test/python/orttraining_test_ort_apis_py_bindings.py
- test_ort_custom_ops: Potential shape inference bug for custom ops

#### Python quantization unit tests
test/onnx/python/quantization (shape inference bug)
- test_op_conv_transpose.py: test_quantize_conv_transpose_u8u8_fp16
- test_op_conv_transpose.py: test_quantize_conv_transpose_s8s8_fp16
- test_op_gemm.py: test_quantize_qop_gemm_s8s8
- test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_same
 - test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_p3
- test_op_matmul.py: test_quantize_matmul_u8u8_f16
- test_op_matmul.py: test_quantize_matmul_s8s8_f16
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_entropy
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_percentile
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_distribution
- test_op_relu.py: test_quantize_qop_relu_s8s8

#### ONNX tests
- test_maxpool_2d_ceil_output_size_reduce_by_one: ONNX 1.16.0 fixed a
maxpool output size bug and added this test. Enable this test when [ORT
PR](https://github.com/microsoft/onnxruntime/pull/18377) is merged.
Refer to original [ONNX PR](https://github.com/onnx/onnx/pull/5741).
- test_ai_onnx_ml_tree_ensemble_set_membership_cpu: new unimplemented op
ai.onnx.ml.TreeEnsemble
- test_ai_onnx_ml_tree_ensemble_single_tree_cpu: same
- test_ai_onnx_ml_tree_ensemble_set_membership_cuda: same
- test_ai_onnx_ml_tree_ensemble_single_tree_cuda: same
- test_cast_INT4_to_FLOAT_cpu: ORT Cast(21) impl doesn't support int4
yet
- test_cast_INT4_to_INT8_cpu: same
- test_cast_UINT4_to_FLOAT_cpu: same
- test_cast_UINT4_to_UINT8_cpu: same
- test_cast_INT4_to_FLOAT_cuda
- test_cast_INT4_to_INT8_cuda
- test_cast_UINT4_to_FLOAT_cuda
- test_cast_UINT4_to_UINT8_cuda
- test_constantofshape_float_ones_cuda: ConstantOfShape(21) not
implemented for cuda
- test_constantofshape_int_shape_zero_cuda: same
- test_constantofshape_int_zeros_cuda: same
- test_flatten_axis0_cuda: Flatten(21) not implemented for cuda
- test_flatten_axis1_cuda: same
- test_flatten_axis2_cuda: same
- test_flatten_axis3_cuda: same
- test_flatten_default_axis_cuda: same
- test_flatten_negative_axis1_cuda: same
- test_flatten_negative_axis2_cuda: same
- test_flatten_negative_axis3_cuda: same
- test_flatten_negative_axis4_cuda: same
- test_qlinearmatmul_2D_int8_float16_cpu: QLinearMatMul(21) for onnx not
implemented in ORT yet
- test_qlinearmatmul_2D_int8_float32_cpu: same
- test_qlinearmatmul_2D_uint8_float16_cpu: same
- test_qlinearmatmul_2D_uint8_float32_cpu: same
- test_qlinearmatmul_3D_int8_float16_cpu: same
- test_qlinearmatmul_3D_int8_float32_cpu: same
- test_qlinearmatmul_3D_uint8_float16_cpu: same
- test_qlinearmatmul_3D_uint8_float32_cpu: same
- test_qlinearmatmul_2D_int8_float16_cuda: same
- test_qlinearmatmul_2D_int8_float32_cuda: same
- test_qlinearmatmul_2D_uint8_float16_cuda: same
- test_qlinearmatmul_2D_uint8_float32_cuda: same
- test_qlinearmatmul_3D_int8_float16_cuda: same
- test_qlinearmatmul_3D_int8_float32_cuda: same
- test_qlinearmatmul_3D_uint8_float16_cuda: same
- test_qlinearmatmul_3D_uint8_float32_cuda: same
- test_size_cuda: Size(21) not implemented for cuda
- test_size_example_cuda: same
- test_dequantizelinear_blocked: Missing implementation for block
dequant for DequantizeLinear(21)
- test_quantizelinear_blocked_asymmetric: Missing implementation for
block quant for QuantizeLinear(21)
- test_quantizelinear_blocked_symmetric: Missing implementation for
block quant for QuantizeLinear(21)

---------

Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Signed-off-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: George Wu <jywu@microsoft.com>
Co-authored-by: adrianlizarraga <adlizarraga@microsoft.com>
2024-04-12 09:46:49 -07:00
Patrice Vignola
a0d5067341
[DML EP] Move operators to feature level 6.4 (#20290)
Those operators won't be in the next version of DML, but they will come
in the version right after.
2024-04-12 00:02:27 -07:00
Adrian Lizarraga
327fb1fde3
[QNN EP] Use QNN's ResizeBilinear operator for specific configs of ONNX Resize (#20292)
### Description
Uses QNN's ResizeBilinear operator for ONNX Resize with:
- input rank: 4
- mode: linear
- coordinate transformation mode: half_pixel, align_corners, or
asymmetric


#### Mapping matrix of ONNX Resize w/ "linear" mode on HTP backend.
Table entries correspond to the QNN operator used for the given
configuration
(Resize = QNN Resize op, RBL = QNN ResizeBilinear op, X = Unsupported).

| coordinate_transformation_mode | input_rank < 3 | input_rank = 3 |
input_rank = 4 | input_rank = 5 | input_rank > 5 |
| ------------- | ------------- |------------- |-------------
|------------- |------------- |
| half_pixel | X | Resize | RBL | Resize | X |
| pytorch_half_pixel | X | Resize | Resize | Resize | X |
| align_corners | X | Resize | RBL | Resize | X |
| asymmetric | X | Resize | RBL | Resize | X |



### Motivation and Context
QNN's ResizeBilinear operator seems to perform better (lower latency)
than QNN's Resize operator for certain configurations.
2024-04-11 22:38:55 -07:00
dependabot[bot]
9ca1afa25c
Bump protobufjs from 7.2.4 to 7.2.5 in /js/web (#20270)
Bumps [protobufjs](https://github.com/protobufjs/protobuf.js) from 7.2.4
to 7.2.5.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/protobufjs/protobuf.js/releases">protobufjs's
releases</a>.</em></p>
<blockquote>
<h2>protobufjs: v7.2.5</h2>
<h2><a
href="https://github.com/protobufjs/protobuf.js/compare/protobufjs-v7.2.4...protobufjs-v7.2.5">7.2.5</a>
(2023-08-21)</h2>
<h3>Bug Fixes</h3>
<ul>
<li>crash in comment parsing (<a
href="https://redirect.github.com/protobufjs/protobuf.js/issues/1890">#1890</a>)
(<a
href="eaf9f0a5a4">eaf9f0a</a>)</li>
<li>deprecation warning for new Buffer (<a
href="https://redirect.github.com/protobufjs/protobuf.js/issues/1905">#1905</a>)
(<a
href="e93286ef70">e93286e</a>)</li>
<li>possible infinite loop when parsing option (<a
href="https://redirect.github.com/protobufjs/protobuf.js/issues/1923">#1923</a>)
(<a
href="f2a8620179">f2a8620</a>)</li>
</ul>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/protobufjs/protobuf.js/blob/master/CHANGELOG.md">protobufjs's
changelog</a>.</em></p>
<blockquote>
<h2><a
href="https://github.com/protobufjs/protobuf.js/compare/protobufjs-v7.2.4...protobufjs-v7.2.5">7.2.5</a>
(2023-08-21)</h2>
<h3>Bug Fixes</h3>
<ul>
<li>crash in comment parsing (<a
href="https://redirect.github.com/protobufjs/protobuf.js/issues/1890">#1890</a>)
(<a
href="eaf9f0a5a4">eaf9f0a</a>)</li>
<li>deprecation warning for new Buffer (<a
href="https://redirect.github.com/protobufjs/protobuf.js/issues/1905">#1905</a>)
(<a
href="e93286ef70">e93286e</a>)</li>
<li>possible infinite loop when parsing option (<a
href="https://redirect.github.com/protobufjs/protobuf.js/issues/1923">#1923</a>)
(<a
href="f2a8620179">f2a8620</a>)</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="4436cc748c"><code>4436cc7</code></a>
chore: release master (<a
href="https://redirect.github.com/protobufjs/protobuf.js/issues/1925">#1925</a>)</li>
<li><a
href="e93286ef70"><code>e93286e</code></a>
fix: deprecation warning for new Buffer (<a
href="https://redirect.github.com/protobufjs/protobuf.js/issues/1905">#1905</a>)</li>
<li><a
href="eaf9f0a5a4"><code>eaf9f0a</code></a>
fix: crash in comment parsing (<a
href="https://redirect.github.com/protobufjs/protobuf.js/issues/1890">#1890</a>)</li>
<li><a
href="f2a8620179"><code>f2a8620</code></a>
fix: possible infinite loop when parsing option (<a
href="https://redirect.github.com/protobufjs/protobuf.js/issues/1923">#1923</a>)</li>
<li>See full diff in <a
href="https://github.com/protobufjs/protobuf.js/compare/protobufjs-v7.2.4...protobufjs-v7.2.5">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=protobufjs&package-manager=npm_and_yarn&previous-version=7.2.4&new-version=7.2.5)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).

</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-04-11 22:07:08 -07:00
Wanming Lin
667d2eb8e6
[WebNN EP] Support Gelu op (#20240) 2024-04-11 19:55:44 -07:00
Patrice Vignola
12042a9387
[DML] Add FastGelu (#20066)
Although DML doesn't have a "fast" gelu approximation operator, its
standard GELU operator is still faster than having to combine all the
separate elementwise operators from different ops.
2024-04-11 14:40:28 -07:00
Yulong Wang
50bd4571ac
[js/web] support SimplifiedLayerNorm and SkipSimplifiedLayerNorm (#20277)
### Description
Support operator `SimplifiedLayerNorm` and `SkipSimplifiedLayerNorm` for
WebGPU backend.
2024-04-11 14:08:50 -07:00
Maximilian Müller
2d0e1df80a
TRT detailed log and strong typed networks (#19695)
### Description
 
@chilo-ms to me it seems sensible to forward the detailed log argument
to the TRT logger itself.
Also when no precision downcast is wanted this will ensure to actually
stick to ONNX precision when used with TRT 9+.
2024-04-11 13:40:13 -07:00
Jeff Bloomfield
f7e2faf961
Re-enable MatMul QDQ fusions with the DML EP (#20248)
This re-enables MatMul QDQ fusions with the DML EP now that bugs in
related DML kernels previously encountered in the pipeline are expected
to be addressed.
2024-04-11 10:43:20 -07:00
Wanming Lin
ee603ee326
[WebNN EP] Fixed WebNN constant operand is detached issue (#20229)
Wasm allows growing the memory size, this will cause all array buffers
reallocation. WebNN EP passes a wasm view to a WebNN constant directly
which would lead to the WebNN constant be treated as detached buffers in
JS side. Simply create a copy for WebNN constant to fix it.
2024-04-10 20:30:03 -07:00
Yufeng Li
e6ca360695
fix build break in kernel_explorer (#20235)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-04-10 15:01:44 -07:00
Chi Lo
47072509c7
[TensorRT EP] Fix updating provider options from session options (#20246)
The issue comes from if user specifies a path for "ep.context_file_path"
in session options, due to `context_cache_path` is a local variable and
it will be destroyed when returning from
`UpdateOrtTensorRTProviderOptionsV2FromSessionOptionsConfigs()`.
Later in
`onnxruntime::TensorrtProviderFactoryCreator::Create(&new_tensorrt_options)`,
it will access the corrupted memory location because of the location is
saved via context_cache_path.c_str().

Inline the
`UpdateOrtTensorRTProviderOptionsV2FromSessionOptionsConfigs()` can fix
this issue.
2024-04-10 12:48:37 -07:00
MasayoshiTsutsui
6a9d8a9030
[js/webgpu] implement DepthToSpace operator in webgpu (#19948)
### Description
This PR supports
[DepthToSpace](https://onnx.ai/onnx/operators/onnx__DepthToSpace.html#depthtospace)
operator in webgpu backend.


### Test
We followed the steps described on [this
page](https://gist.github.com/fs-eire/a55b2c7e10a6864b9602c279b8b75dce)
to build, tested with the following commands, and confirmed that it
passed the Model and Op tests that already existed. (Probably, these
test cases were prepared in the past for WebGL backend)
```
~/onnxruntime/js/web>
% npm test -- suite0 -b=webgpu --wasm-number-threads=1 --debug   
```
##### NOTE
I want to tell you that the main branch version failed 5 tests for the
resize_upsample_sizes_nearest operator.
Since I didn't touch this issue, those test cases still fail in my
branch as well.
Should I post an issue for this?


### Motivation and Context
Though the DepthToSpace operator plays a crucial role in
super-resolution domains, it was not supported in webgpu backend.
2024-04-10 12:13:46 -07:00
Dmitri Smirnov
89a96bdc34
Reduce heap contention in Tokenizer (#20196)
### Description
<!-- Describe your changes. -->
Re-use vector buffers to prevent frequent reallocations.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Reduce process heap contention.


![image](https://github.com/microsoft/onnxruntime/assets/11303988/f0b78062-3d86-45b7-87fd-e0696b170cf8)
2024-04-10 12:12:17 -07:00
Yifan Li
9577fe454d
[EP Perf] Customize onnx-tensorrt commit id when init CI tasks (#20175)
### Description
<!-- Describe your changes. -->
Customize commit id of onnx-tensorrt in EP Perf CI variables when
testing OSS parsers in different versions

### To Verify

![image](https://github.com/microsoft/onnxruntime/assets/109183385/9dc650d8-377d-4223-8951-f0849b1fe984)

After assigning `onnxTensorrtCommitId` in EP Perf CI Variables, 
CI would prompt during the step of **[Build latest ORT Image with
TensorRT OSS
parser](https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=438217&view=logs&j=b6bfa4e2-8141-507f-8ca1-59b3f929fa71&t=fc64e110-ab59-54e4-1c37-853e84a52a7e&l=396450)**:
```
Updated deps.txt with new commit id a43ce67187bab219520fd80f21af8bbd4354bc8c and hash 572535aefef477050f86744dfab1fef840198035
```
And CI would [overwrite the line of onnx_tensorrt in
deps.txt](https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=438217&view=logs&j=b6bfa4e2-8141-507f-8ca1-59b3f929fa71&t=fc64e110-ab59-54e4-1c37-853e84a52a7e&l=396451)
which was assigned as:
```
onnx_tensorrt;a43ce67187.zip;572535aefef477050f86744dfab1fef840198035

```


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
To save time of modifying deps.txt and manually calculating zip hash
2024-04-10 09:46:05 -07:00
guyang3532
471e969e2f
Check padding density by input of embedding module (#19821)
### Description
The PaddingElimination optimization is enabled when the density of
embedding padding less than 90%. We need to check the density of the
embedding padding to decide whether enable the optimization.

Before this pr, we just check the inputs of graph and correlate one with
the embedding node by iterate graph from the embedding node back to one
graph input.
This is hard to be general because there may be complicated pattern
between graph input and embedding node.

This pr check padding density by the direct input of embedding module
rather than the input of graph at the first graph execution when
exporting onnx graph.
And if the density < 90%, insert a flag PythonOp after the embedding
node as:
```
             Embedding
		  |
            PythonOp (func_name:_FlagPaddingElimination)   (insert if density < 90%)
		  |
            Following graph
```

When the PaddingElimination is invoked, it check if there is the flag
PythonOp(func_name:_FlagPaddingElimination) after the Embedding node and
if it is, remove it and do the padding elimination optimization.
2024-04-10 18:45:51 +08:00
Yi Zhang
0acde1157a
Set parallel count to avoid OOM in training GPU packaging pipeline (#20255)
### Description
make the compilation work on Azure CPU Agent by reduce the parallel
count



### Motivation and Context
The OOM issue mentioned in #20244 was caused the by low
memory/parallel_count.
2024-04-10 14:05:53 +08:00
pengwa
280b2634c5
Prompt layer-wise recompute when applicable (#20126)
### Prompt layer-wise when applicable

Give explicit prompts in export failures to users to enable layer-wise
memory optimization if we found the checkpoint function is used.
- Using checkpoint function is a strong indicator that the model is too
large to fit in GPU memory.
- If we don't override the checkpoint function here, mostly ONNX export
will be failed. 1. For old version PyTorch, when handling gradient
checkpoint feature, we just throw an exception. 2. For new version
PyTorch, an export failure happens.
- But both failures did not give users explicitly "HOW" to mitigate.
This PR did that.

``


![image](https://github.com/microsoft/onnxruntime/assets/10530022/c0476748-5818-4cc8-b2d6-88c7580fe4da)



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-04-10 11:50:28 +08:00
Yi Zhang
14d7872ce9
Reuse T4 for Cuda12.2 training packaging pipeline. (#20244)
### Description
It always has been out of memory in training CUDA 12.2 packaging
pipeline
https://dev.azure.com/aiinfra/Lotus/_build?definitionId=1308&_a=summary
since the PR #19910
I tried other CPU agents for example, D64as_v5(256G memory) and
D32as_v4(128G memory and 256 G SSD temp storage), which are still out of
memory like the below image

![image](https://github.com/microsoft/onnxruntime/assets/16190118/5acde9ef-674f-4b6d-a1b3-b54647645083)


But it works on T4, though T4 only has 4 vCPUs, 28G memory and 180G temp
storage, and it takes much more time.

### Motivation and Context
Restore CUDA 12.2 training packaging pipeline first.
More time is needed to investigate the root cause


### Other Clues.
These 2 compilation steps take nearly 6 minutes with Cuda 12.2 on T4
And it runs out of memory on CPU machine. @ajindal1 
cuda12.2 on T4
```
2024-03-14T05:39:08.7726865Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim32_fp16_sm80.cu.o
2024-03-14T05:45:01.3223393Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim64_bf16_sm80.cu.o

2024-03-14T05:46:07.9218003Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim96_fp16_sm80.cu.o
2024-03-14T05:52:59.2387051Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/group_query_attention_impl.cu.o

```

But they could be finished in about one minute with Cuda 11.8 on CPU
```
cuda11.8 on CPU
2024-04-09T11:34:35.0849836Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim32_fp16_sm80.cu.o
2024-04-09T11:35:53.6648154Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim64_bf16_sm80.cu.o

cuda11.8 on GPU
024-03-13T12:16:33.4102477Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim32_fp16_sm80.cu.o
2024-03-13T12:19:58.8268272Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim64_bf16_sm80.cu.o
```
2024-04-10 09:21:40 +08:00
Dmitri Smirnov
7d8dea9f10
Reduce Heap contention in StringNormalizer (#20182)
### Description
<!-- Describe your changes. -->
Re-use pre-computed and pre-allocated buffers for UNICODE conversions.
Make sure we do not introduce unnecessary intermediate `std::string`
instances.
Create a Utf8Generic converter for use with non-Windows platforms.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This reduces heap contention in P1 customer.


![image](https://github.com/microsoft/onnxruntime/assets/11303988/fd39fb01-7361-47d2-8f83-69dbc3bbc65c)
2024-04-09 16:10:31 -07:00
pengwa
81005e2c92
Optimize constant sharing perf (#20143)
### Optimize constant sharing perf

by avoiding [renaming for the first name we detect a constant pattern. 

Currently every time we start run ConstantSharing, for each initializer,
we find its pattern does not exist, then we create a new NodeArg with a
unique name. Then later if other initializer share the same pattern,
they will be replaced by the NodeArg.

The problem is: once there is no real constant sharing cases, we still
modify the graph for each initializer. This is not needed.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-04-09 12:04:36 +08:00
danyue
07b5377f7c
Add INT16 and UINT16 compatibility for relu_quantizelinear (#20187)
### Description
<!-- Describe your changes. -->
There is a problem in relu_quantizelinear transformer that causes wrong
results. The purpose of this PR is to solve this problem.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This does not take into account the situation where Q's zeropoint is
tensor(int16), tensor(uint16), so when this happens, an error will
occur.
How to verify:
```python
import onnx
import onnxruntime as ort
import numpy as np
model_name = 'relu_quantize_testcase.onnx'
model = onnx.load(model_name)
ort_input0 = np.random.rand((1, 64, 64, 128),np.float32)
# infer with GraphOptimizationLevel=0
so = ort.SessionOptions()
so.graph_optimization_level = ort.GraphOptimizationLevel.ORT_DISABLE_ALL
ort_session = ort.InferenceSession(
    model_name,
    providers=["CPUExecutionProvider"],
    sess_options=so
)

outputs = [x.name for x in ort_session.get_outputs()]

ort_outs_mod = ort_session.run(outputs, { 'generator/conv2d_input/conv2d/Conv2D:0': ort_input0} )
del ort_session

# infer with GraphOptimizationLevel=default
model_orig = onnx.load(model_name)
ort_session_orig = ort.InferenceSession(model_orig.SerializeToString())

outputs_orig = [x.name for x in ort_session_orig.get_outputs()]

ort_outs_orig = ort_session_orig.run(outputs_orig,  { 'generator/conv2d_input/conv2d/Conv2D:0': ort_input0}  )

# diff
print(np.linalg.norm(ort_outs_mod[0].astype(np.float32) - ort_outs_orig[0].astype(np.float32)))

del ort_session_orig
```

[relu_quantize_testcase.zip](https://github.com/microsoft/onnxruntime/files/14848160/relu_quantize_testcase.zip)

---------

Co-authored-by: genmingz <genming.zhong@amd.com>
2024-04-08 19:41:43 -07:00
pengwa
41acd8c543
Support more ops for recompute (#20234)
### Support more ops for recompute

To cover Mistral model, and support padding elimination ops.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-04-09 09:24:48 +08:00
Adam Louly
22a61a3cf5
Fix Mixtral Parity test to keep it consistent with Transformers. (#20210)
### Description
I recently opened a PR in hf transformers repo to fix an issue on the
indexing part.

https://github.com/huggingface/transformers/issues/29857

onnx exporter was failing because of the tolist() conversion so we had
to remove it.

I found out that the code was also a part of our codebase so this PR is
to keep the code consistent.
2024-04-08 13:04:12 -07:00
wejoncy
908a76d675
fix "4bit quantization scales and zeropoint tensor shape" (#19986)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-04-08 10:15:28 -07:00
Jiajie Hu
23d3afd4fe
[js/webgpu] Implement com.microsoft.RotaryEmbedding (#20209)
### Description

https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md#commicrosoftrotaryembedding

### Motivation and Context
As per customer request, this helps Phi-2 and Gemma.
2024-04-08 09:11:26 -07:00
cloudhan
e19c778934
Improve KE for commandline and programmatically tuning dispatch (#18778) 2024-04-08 11:08:59 +08:00
Ye Wang
cc3faba616
Support seq_len > 64K in rotary embedding cuda kernel (#20204)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-04-05 19:52:55 -07:00
Francesco
6af02ae06a
Remove non-existing function call (#19416)
This function call is confusing, since it is a function call without
definition of the function. It was correctly repalced from compute_data
to compute_range, but function call was reintroudced in a later PR.

### Description
Problem as described in [this
issue](https://github.com/microsoft/onnxruntime/issues/18893 )

In the examples, different calls of compute_range() from calibrate.py
can be found, also in the calibrate.py itself.

The problem is that it was [replaced here]
(https://github.com/microsoft/onnxruntime/pull/16550/files#diff-75e84436a983e17527f8b5bc585087e7ad75b3b515c2101c2a82dcaecca490de
) from `compute_range()` to `cpmute_data() -> TensorsData` and then
falsely [added as call
here](https://github.com/microsoft/onnxruntime/pull/17029/files#diff-75e84436a983e17527f8b5bc585087e7ad75b3b515c2101c2a82dcaecca490de
).

### Motivation and Context
I suggest in this PR to remove this confusing call
`self.calibrate_range()` in calibrate.py. Once it is removed and
packaged, somehow the examples from the onnx-runtime-examples repository
must be adapted, since they are already not working. Examples of
`compute_range()` in the examples are linked in [this
issue](https://github.com/microsoft/onnxruntime/issues/18893 ).
2024-04-05 19:48:48 -07:00
Adrian Lizarraga
05d97e8d18
Update QNN python packages to use QNN SDK version 2.19.2 (#20213)
### Description
Update QNN python packages to use QNN SDK version 2.19.2.



### Motivation and Context
Our CI builds already use QNN SDK version 2.19.2. We should make sure
the ort-nightly-qnn python packages are also built with the same QNN SDK
version.
2024-04-05 17:15:25 -07:00
Yi Zhang
23a5d0a305
Extend time out in Windows GPU packaging jobs (#20207)
### Description
Extend Windows GPU Packaging job building time out to 6 hours, and test
stage to 3 hours.



### Motivation and Context
There're still a few timeout issues after refactoring. The probability
is about 20% in
https://dev.azure.com/aiinfra/Lotus/_build?definitionId=84.
I found the building could be finished in 4 hours if it becomes slow,
https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=434340&view=logs&j=0c6ee496-b38e-55a9-3699-12934156e90f,
although in most cases, it only take about 30 minutes.
Not like before, the building couldn't be completed.
So, In this PR, I extend the timeout to 6 hours.

And one interesting thing, if one windows GPU job becomes slow, all
other windows GPU jobs in the same run become slow too.
So I doubt it has something with the ADO or virtualization. That is,
it's not completely random.
https://dev.azure.com/aiinfra/Lotus/_build?definitionId=841
2024-04-06 08:03:42 +08:00
Andrew Grigorev
a6611409cc
Fix HalideIR title in third party notices reference (#20190) 2024-04-05 11:12:43 -07:00
dependabot[bot]
2a323eb670
Bump Sixlabors.ImageSharp from 2.1.1 to 2.1.7 in /csharp/sample/Microsoft.ML.OnnxRuntime.ResNet50v2Sample (#19805)
Bumps [Sixlabors.ImageSharp](https://github.com/SixLabors/ImageSharp)
from 2.1.1 to 2.1.7.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/SixLabors/ImageSharp/releases">Sixlabors.ImageSharp's
releases</a>.</em></p>
<blockquote>
<h2>v2.1.7</h2>
<h2>What's Changed</h2>
<ul>
<li>[release/2.1] Disallow allocation attempts of unrepresentable sizes
by <a
href="https://github.com/antonfirsov"><code>@​antonfirsov</code></a> in
<a
href="https://redirect.github.com/SixLabors/ImageSharp/pull/2553">SixLabors/ImageSharp#2553</a></li>
<li>[release/2.1] Tiff decoding robustness improvements (<a
href="https://redirect.github.com/SixLabors/ImageSharp/issues/2550">#2550</a>)
by <a
href="https://github.com/antonfirsov"><code>@​antonfirsov</code></a> in
<a
href="https://redirect.github.com/SixLabors/ImageSharp/pull/2554">SixLabors/ImageSharp#2554</a></li>
<li>[release/2.1] PBM decoder robustness improvements and
BufferedReadStream observability by <a
href="https://github.com/antonfirsov"><code>@​antonfirsov</code></a> in
<a
href="https://redirect.github.com/SixLabors/ImageSharp/pull/2555">SixLabors/ImageSharp#2555</a></li>
<li>Backport 2681 by <a
href="https://github.com/JimBobSquarePants"><code>@​JimBobSquarePants</code></a>
in <a
href="https://redirect.github.com/SixLabors/ImageSharp/pull/2688">SixLabors/ImageSharp#2688</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/SixLabors/ImageSharp/compare/v2.1.6...v2.1.7">https://github.com/SixLabors/ImageSharp/compare/v2.1.6...v2.1.7</a></p>
<h2>v2.1.6</h2>
<h2>What's Changed</h2>
<ul>
<li>Backport - Handle EOF in Jpeg bit reader when data is bad to prevent
DOS attack. by <a
href="https://github.com/JimBobSquarePants"><code>@​JimBobSquarePants</code></a>
in <a
href="https://redirect.github.com/SixLabors/ImageSharp/pull/2524">SixLabors/ImageSharp#2524</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/SixLabors/ImageSharp/compare/v2.1.5...v2.1.6">https://github.com/SixLabors/ImageSharp/compare/v2.1.5...v2.1.6</a></p>
<h2>v2.1.5</h2>
<h2>What's Changed</h2>
<ul>
<li>Backport <a
href="https://redirect.github.com/SixLabors/ImageSharp/issues/2501">#2501</a>
by <a
href="https://github.com/JimBobSquarePants"><code>@​JimBobSquarePants</code></a>
in <a
href="https://redirect.github.com/SixLabors/ImageSharp/pull/2509">SixLabors/ImageSharp#2509</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/SixLabors/ImageSharp/compare/v2.1.4...v2.1.5">https://github.com/SixLabors/ImageSharp/compare/v2.1.4...v2.1.5</a></p>
<h2>v2.1.4</h2>
<h2>What's Changed</h2>
<ul>
<li>Backport WebP fix to 2.1 by <a
href="https://github.com/antonfirsov"><code>@​antonfirsov</code></a> in
<a
href="https://redirect.github.com/SixLabors/ImageSharp/pull/2420">SixLabors/ImageSharp#2420</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/SixLabors/ImageSharp/compare/v2.1.3...v2.1.4">https://github.com/SixLabors/ImageSharp/compare/v2.1.3...v2.1.4</a></p>
<h2>v2.1.3</h2>
<h2>What's Changed</h2>
<ul>
<li>V2 Backport: 2133, 2154 by <a
href="https://github.com/JimBobSquarePants"><code>@​JimBobSquarePants</code></a>
in <a
href="https://redirect.github.com/SixLabors/ImageSharp/pull/2157">SixLabors/ImageSharp#2157</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/SixLabors/ImageSharp/compare/v2.1.2...v2.1.3">https://github.com/SixLabors/ImageSharp/compare/v2.1.2...v2.1.3</a></p>
<h2>v2.1.2</h2>
<h2>What's Changed</h2>
<ul>
<li>Backport - Issue 2123 by <a
href="https://github.com/JimBobSquarePants"><code>@​JimBobSquarePants</code></a>
in <a
href="https://redirect.github.com/SixLabors/ImageSharp/pull/2126">SixLabors/ImageSharp#2126</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/SixLabors/ImageSharp/compare/v2.1.1...v2.1.2">https://github.com/SixLabors/ImageSharp/compare/v2.1.1...v2.1.2</a></p>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="fa7d712702"><code>fa7d712</code></a>
Merge pull request <a
href="https://redirect.github.com/SixLabors/ImageSharp/issues/2688">#2688</a>
from SixLabors/js/backport-2681</li>
<li><a
href="36b3533cc3"><code>36b3533</code></a>
Use correct property to disable upstream warnings.</li>
<li><a
href="94bb7615a1"><code>94bb761</code></a>
Update ImageSharp.csproj</li>
<li><a
href="3ea2574726"><code>3ea2574</code></a>
Update PngDecoderCore.cs</li>
<li><a
href="e74a55fbfd"><code>e74a55f</code></a>
[release/2.1] PBM decoder robustness improvements and BufferedReadStream
obse...</li>
<li><a
href="749b1c04d7"><code>749b1c0</code></a>
[release/2.1] Tiff decoding robustness improvements (<a
href="https://redirect.github.com/SixLabors/ImageSharp/issues/2550">#2550</a>)
(<a
href="https://redirect.github.com/SixLabors/ImageSharp/issues/2554">#2554</a>)</li>
<li><a
href="3064b78927"><code>3064b78</code></a>
Merge pull request <a
href="https://redirect.github.com/SixLabors/ImageSharp/issues/2553">#2553</a>
from SixLabors/backport/2.1.x/2545</li>
<li><a
href="f36ec12695"><code>f36ec12</code></a>
Disallow allocation attempts of unrepresentable sizes </li>
<li><a
href="688e242a84"><code>688e242</code></a>
Merge pull request <a
href="https://redirect.github.com/SixLabors/ImageSharp/issues/2524">#2524</a>
from SixLabors/js/backport-fix-jpeg-dos</li>
<li><a
href="0f17a8be9c"><code>0f17a8b</code></a>
Handle EOF in Jpeg bit reader when data is bad to prevent DOS
attack.</li>
<li>Additional commits viewable in <a
href="https://github.com/SixLabors/ImageSharp/compare/v2.1.1...v2.1.7">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=Sixlabors.ImageSharp&package-manager=nuget&previous-version=2.1.1&new-version=2.1.7)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).

</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-04-05 11:11:52 -07:00
Hector Li
1ccb164c12
Improve the script to add Q, DQ nodes around EPContext node (#20107)
Improve the script to add Q, DQ nodes around EPContext node so that the wrapper model use float data as inputs and outputs. User don't need to quantize or dequantize the data in their application
2024-04-05 10:12:01 -07:00
Guenther Schmuelling
c529e05e38
fix ConvTranspose 1D (#20194) 2024-04-05 10:05:32 -07:00