Bumps [Sixlabors.ImageSharp](https://github.com/SixLabors/ImageSharp)
from 2.1.7 to 2.1.8.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/SixLabors/ImageSharp/releases">Sixlabors.ImageSharp's
releases</a>.</em></p>
<blockquote>
<h2>v2.1.8</h2>
<h2>What's Changed</h2>
<ul>
<li>V2 - Limit Read Palette Indices by <a
href="https://github.com/JimBobSquarePants"><code>@JimBobSquarePants</code></a>
in <a
href="https://redirect.github.com/SixLabors/ImageSharp/pull/2719">SixLabors/ImageSharp#2719</a></li>
<li>V2 - Clear Pixel Buffers on Decode. by <a
href="https://github.com/JimBobSquarePants"><code>@JimBobSquarePants</code></a>
in <a
href="https://redirect.github.com/SixLabors/ImageSharp/pull/2717">SixLabors/ImageSharp#2717</a></li>
<li>V2 - Limit all memory allocations in the MemoryAllocator layer by <a
href="https://github.com/JimBobSquarePants"><code>@JimBobSquarePants</code></a>
in <a
href="https://redirect.github.com/SixLabors/ImageSharp/pull/2715">SixLabors/ImageSharp#2715</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/SixLabors/ImageSharp/compare/v2.1.7...v2.1.8">https://github.com/SixLabors/ImageSharp/compare/v2.1.7...v2.1.8</a></p>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="f21d64188e"><code>f21d641</code></a>
Merge pull request <a
href="https://redirect.github.com/SixLabors/ImageSharp/issues/2715">#2715</a>
from SixLabors/backport/v2-memlimit</li>
<li><a
href="8f0b4d3e68"><code>8f0b4d3</code></a>
Merge pull request <a
href="https://redirect.github.com/SixLabors/ImageSharp/issues/2717">#2717</a>
from SixLabors/backport/v2-clear-buffers</li>
<li><a
href="cf9496d284"><code>cf9496d</code></a>
test allocation limits</li>
<li><a
href="3d298db2cd"><code>3d298db</code></a>
Adapt BmpDecoder_ThrowsException_Issue2696 for V2</li>
<li><a
href="a78ce27a2b"><code>a78ce27</code></a>
Merge pull request <a
href="https://redirect.github.com/SixLabors/ImageSharp/issues/2719">#2719</a>
from SixLabors/backport/v2-check-palette-indices</li>
<li><a
href="e6209147b1"><code>e620914</code></a>
Clamp read palette indices.</li>
<li><a
href="c122185ea0"><code>c122185</code></a>
Clear pixel buffers on decode.</li>
<li><a
href="5c6ec5d6fb"><code>5c6ec5d</code></a>
Limit all allocations</li>
<li>See full diff in <a
href="https://github.com/SixLabors/ImageSharp/compare/v2.1.7...v2.1.8">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
### Description
<!-- Describe your changes. -->
* Implement `user_compute_stream` python api for TensorRT EP
* Using this option will implicitly set `has_user_compute_stream` as
`true`
* Extend existing TRTEP unit test to verify `user_compute_stream` option
* This has been verified in local pytorch env, with
`torch.cuda.Stream()` passing into `user_compute_stream`:
```python
...
# Before inference
if torch.cuda.is_available():
s = torch.cuda.Stream()
option = {"user_compute_stream": str(s.cuda_stream)}
sess.set_providers(["TensorrtExecutionProvider"], [option])
options = sess.get_provider_options()
assert "TensorrtExecutionProvider" in options
assert options["TensorrtExecutionProvider"].get("user_compute_stream", "") == str(s.cuda_stream)
assert options["TensorrtExecutionProvider"].get("has_user_compute_stream", "") == "1"
...
```
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Align with existing `user_compute_stream` python implementations for
[CUDA EP](https://github.com/microsoft/onnxruntime/pull/19229)/[ROCm
EP](https://github.com/microsoft/onnxruntime/pull/19619)
Bumps [transformers](https://github.com/huggingface/transformers) from
4.36.0 to 4.38.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/huggingface/transformers/releases">transformers's
releases</a>.</em></p>
<blockquote>
<h2>v4.38: Gemma, Depth Anything, Stable LM; Static Cache, HF Quantizer,
AQLM</h2>
<h2>New model additions</h2>
<h3>💎 Gemma 💎</h3>
<p>Gemma is a new opensource Language Model series from Google AI that
comes with a 2B and 7B variant. The release comes with the pre-trained
and instruction fine-tuned versions and you can use them via
<code>AutoModelForCausalLM</code>, <code>GemmaForCausalLM</code> or
<code>pipeline</code> interface!</p>
<p>Read more about it in the Gemma release blogpost: <a
href="https://hf.co/blog/gemma">https://hf.co/blog/gemma</a></p>
<pre lang="python"><code>from transformers import AutoTokenizer,
AutoModelForCausalLM
<p>tokenizer =
AutoTokenizer.from_pretrained("google/gemma-2b")
model =
AutoModelForCausalLM.from_pretrained("google/gemma-2b",
device_map="auto", torch_dtype=torch.float16)</p>
<p>input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text,
return_tensors="pt").to("cuda")</p>
<p>outputs = model.generate(**input_ids)
</code></pre></p>
<p>You can use the model with Flash Attention, SDPA, Static cache and
quantization API for further optimizations !</p>
<ul>
<li>Flash Attention 2</li>
</ul>
<pre lang="python"><code>from transformers import AutoTokenizer,
AutoModelForCausalLM
<p>tokenizer =
AutoTokenizer.from_pretrained("google/gemma-2b")</p>
<p>model = AutoModelForCausalLM.from_pretrained(
"google/gemma-2b", device_map="auto",
torch_dtype=torch.float16,
attn_implementation="flash_attention_2"
)</p>
<p>input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text,
return_tensors="pt").to("cuda")</p>
<p>outputs = model.generate(**input_ids)
</code></pre></p>
<ul>
<li>bitsandbytes-4bit</li>
</ul>
<pre lang="python"><code>from transformers import AutoTokenizer,
AutoModelForCausalLM
<p>tokenizer =
AutoTokenizer.from_pretrained("google/gemma-2b")</p>
<p>model = AutoModelForCausalLM.from_pretrained(
"google/gemma-2b", device_map="auto",
load_in_4bit=True
)
</tr></table>
</code></pre></p>
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="08ab54ada5"><code>08ab54a</code></a>
[ <code>gemma</code>] Adds support for Gemma 💎 (<a
href="https://redirect.github.com/huggingface/transformers/issues/29167">#29167</a>)</li>
<li><a
href="2de9314197"><code>2de9314</code></a>
[<code>Maskformer</code>] safely get backbone config (<a
href="https://redirect.github.com/huggingface/transformers/issues/29166">#29166</a>)</li>
<li><a
href="476957b5b4"><code>476957b</code></a>
🚨 Llama: update rope scaling to match static cache changes (<a
href="https://redirect.github.com/huggingface/transformers/issues/29143">#29143</a>)</li>
<li><a
href="7a4bec6e8f"><code>7a4bec6</code></a>
Release: 4.38.0</li>
<li><a
href="ee3af60be0"><code>ee3af60</code></a>
Add support for fine-tuning CLIP-like models using
contrastive-image-text exa...</li>
<li><a
href="0996a10077"><code>0996a10</code></a>
Revert low cpu mem tie weights (<a
href="https://redirect.github.com/huggingface/transformers/issues/29135">#29135</a>)</li>
<li><a
href="15cfe38942"><code>15cfe38</code></a>
[<code>Core tokenization</code>] <code>add_dummy_prefix_space</code>
option to help with latest is...</li>
<li><a
href="efdd436663"><code>efdd436</code></a>
FIX [<code>PEFT</code> / <code>Trainer</code> ] Handle better peft +
quantized compiled models (<a
href="https://redirect.github.com/huggingface/transformers/issues/29">#29</a>...</li>
<li><a
href="5e95dcabe1"><code>5e95dca</code></a>
[<code>cuda kernels</code>] only compile them when initializing (<a
href="https://redirect.github.com/huggingface/transformers/issues/29133">#29133</a>)</li>
<li><a
href="a7755d2409"><code>a7755d2</code></a>
Generate: unset GenerationConfig parameters do not raise warning (<a
href="https://redirect.github.com/huggingface/transformers/issues/29119">#29119</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/huggingface/transformers/compare/v4.36.0...v4.38.0">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
### Description
- Adds general support for per-channel quantized weights to QNN EP (HTP
backend).
- Add QNN EP unit tests for per-channel Conv
- Update quantization tool to allow selecting which ops are quantized
per-channel (and which axis) via tensor-level overrides. Currently,
setting `per_channel=True` assumes all Convs, MatMuls, Gemms,
InstanceNormalization, and LayerNormalization ops should be quantized
per-channel using some assumed default axis.
#### Creating QDQ per-channel Conv model example
```python
from onnxruntime.quantization import CalibrationDataReader, QuantType, quantize
from onnxruntime.quantization.execution_providers.qnn import get_qnn_qdq_config, qnn_preprocess_model
class DataReader(CalibrationDataReader):
# TODO: See ONNX Runtime QNN docs for example of a data reader
# https://onnxruntime.ai/docs/execution-providers/QNN-ExecutionProvider.html#generating-a-quantized-model-x64
pass
if __name__ == "__main__":
input_model_path = "model.onnx"
my_data_reader = DataReader(model_to_quantize)
# Pre-process the original float32 model.
preproc_model_path = "model.preproc.onnx"
model_changed = qnn_preprocess_model(input_model_path, preproc_model_path)
model_to_quantize = preproc_model_path if model_changed else input_model_path
# RELEVANT TO THIS PR:
# Make sure Conv's weight input is quantized to int8/symmetric/per-channel with axis == 0.
# The presence of the 'axis' key indicates that this is a per-channel quantized weight.
init_overrides = {'weight': [{'axis': 0, 'quant_type': QuantType.QInt8, 'symmetric': True}]}
qnn_config = get_qnn_qdq_config(model_to_quantize,
my_data_reader,
init_overrides=init_overrides,
activation_type=QuantType.QUInt16, # uint16 activations
weight_type=QuantType.QUInt8) # uint8 weights by default
quantize(model_to_quantize, "model.qdq.onnx", qnn_config)
```
float32 model:
<img width="683" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/19691973/ca650e49-1ad0-47d8-8c46-17fbc224ca39">
QDQ model (per-channel Conv weight):
<img width="748" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/19691973/6bd469f2-968b-4d11-9526-09b3e71f98e7">
### Motivation and Context
Support more models, especially models with int4 quantized weights.
copy QNN deps when building python bindings as well.
tweak the wildcard to only copy QNN related files. latest sdk from
Qualcomm (>= 2.21) also include SNPE dll's which we don't want to
include.
Add `--user` option to pip install command.
Error:
```
ERROR: Could not install packages due to an OSError: [Errno 13] Permission denied: '/usr/local/bin/f2py'
Consider using the `--user` option or check the permissions.
```
See #19877.
### Description
This PR fixes the parity checker in the LLaMA scripts by adding the
following.
- Enable buffer sharing manually with `use_buffer_share` instead of
`use_gqa`
- Get max sequence length from model's config
### Motivation and Context
This PR fixes an issue with running the parity checker on other
large-language models where `GroupQueryAttention` can be used without
buffer sharing enabled.
The graph builder currently doesn't assign the correct shapes for
subgraphs that have more than 1 output, and where each output comes from
a different node. `nodeOutputShapes` should be a map of shapes (1:1
relationship), and not a map of lists of shapes (1:N relationship) since
an output referenced by `arg->Name()` can only have 1 output.
Take for example the following example of a subgraph where a node has 2
outputs, then each output feeds into an elementwise op. Both nodes will
have a `targetIndex` of 0, and we were using this target index to query
their shape, resulting in both outputs querying the same shape. In
reality, what we need to do is use the `GraphOutputIndex` ofthe subgraph
to query the correct output shape of the subgraph.
### Description
<!-- Describe your changes. -->
Improve performance using shared memory
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
update with ONNX 1.16.0 branch according to
https://github.com/microsoft/onnxruntime/blob/main/docs/How_To_Update_ONNX_Dev_Notes.md
ONNX 1.16.0 release notes:
https://github.com/onnx/onnx/releases/tag/v1.16.0
#### Updated ops for CPU EP:
- DequantizeLinear(21)
- Added int16 and uint16 support + various optimizer tests
- Missing int4 and uint4 support
- Missing block dequantization support
- QuantizeLinear(21)
- Added int16 and uint16 support + various optimizer tests
- Missing int4 and uint4 support
- Missing block quantization support
- Cast(21)
- Missing int4 and uint4 support
- CastLike(21)
- Missing int4 and uint4 support
- ConstantOfShape(21)
- Missing int4 and uint4 support
- Identity(21)
- Missing int4 and uint4 support
- If(21)
- Missing int4 and uint4 support
- Loop(21)
- Missing int4 and uint4 support
- Reshape(21)
- Missing int4 and uint4 support
- Scan(21)
- Missing int4 and uint4 support
- Shape(21)
- Missing int4 and uint4 support
- Size(21)
- Missing int4 and uint4 support
- Flatten(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Pad(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Squeeze(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Transpose(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Unsqueeze(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
#### Unimplemented opset 21 features/ops
- int4 and uint4 data type
- QLinearMatMul(21)
- GroupNormalization(21)
- ai.onnx.ml.TreeEnsemble(5)
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Disabled tests
#### ORT Training
orttraining/orttraining/test/python/orttraining_test_ort_apis_py_bindings.py
- test_ort_custom_ops: Potential shape inference bug for custom ops
#### Python quantization unit tests
test/onnx/python/quantization (shape inference bug)
- test_op_conv_transpose.py: test_quantize_conv_transpose_u8u8_fp16
- test_op_conv_transpose.py: test_quantize_conv_transpose_s8s8_fp16
- test_op_gemm.py: test_quantize_qop_gemm_s8s8
- test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_same
- test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_p3
- test_op_matmul.py: test_quantize_matmul_u8u8_f16
- test_op_matmul.py: test_quantize_matmul_s8s8_f16
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_entropy
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_percentile
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_distribution
- test_op_relu.py: test_quantize_qop_relu_s8s8
#### ONNX tests
- test_maxpool_2d_ceil_output_size_reduce_by_one: ONNX 1.16.0 fixed a
maxpool output size bug and added this test. Enable this test when [ORT
PR](https://github.com/microsoft/onnxruntime/pull/18377) is merged.
Refer to original [ONNX PR](https://github.com/onnx/onnx/pull/5741).
- test_ai_onnx_ml_tree_ensemble_set_membership_cpu: new unimplemented op
ai.onnx.ml.TreeEnsemble
- test_ai_onnx_ml_tree_ensemble_single_tree_cpu: same
- test_ai_onnx_ml_tree_ensemble_set_membership_cuda: same
- test_ai_onnx_ml_tree_ensemble_single_tree_cuda: same
- test_cast_INT4_to_FLOAT_cpu: ORT Cast(21) impl doesn't support int4
yet
- test_cast_INT4_to_INT8_cpu: same
- test_cast_UINT4_to_FLOAT_cpu: same
- test_cast_UINT4_to_UINT8_cpu: same
- test_cast_INT4_to_FLOAT_cuda
- test_cast_INT4_to_INT8_cuda
- test_cast_UINT4_to_FLOAT_cuda
- test_cast_UINT4_to_UINT8_cuda
- test_constantofshape_float_ones_cuda: ConstantOfShape(21) not
implemented for cuda
- test_constantofshape_int_shape_zero_cuda: same
- test_constantofshape_int_zeros_cuda: same
- test_flatten_axis0_cuda: Flatten(21) not implemented for cuda
- test_flatten_axis1_cuda: same
- test_flatten_axis2_cuda: same
- test_flatten_axis3_cuda: same
- test_flatten_default_axis_cuda: same
- test_flatten_negative_axis1_cuda: same
- test_flatten_negative_axis2_cuda: same
- test_flatten_negative_axis3_cuda: same
- test_flatten_negative_axis4_cuda: same
- test_qlinearmatmul_2D_int8_float16_cpu: QLinearMatMul(21) for onnx not
implemented in ORT yet
- test_qlinearmatmul_2D_int8_float32_cpu: same
- test_qlinearmatmul_2D_uint8_float16_cpu: same
- test_qlinearmatmul_2D_uint8_float32_cpu: same
- test_qlinearmatmul_3D_int8_float16_cpu: same
- test_qlinearmatmul_3D_int8_float32_cpu: same
- test_qlinearmatmul_3D_uint8_float16_cpu: same
- test_qlinearmatmul_3D_uint8_float32_cpu: same
- test_qlinearmatmul_2D_int8_float16_cuda: same
- test_qlinearmatmul_2D_int8_float32_cuda: same
- test_qlinearmatmul_2D_uint8_float16_cuda: same
- test_qlinearmatmul_2D_uint8_float32_cuda: same
- test_qlinearmatmul_3D_int8_float16_cuda: same
- test_qlinearmatmul_3D_int8_float32_cuda: same
- test_qlinearmatmul_3D_uint8_float16_cuda: same
- test_qlinearmatmul_3D_uint8_float32_cuda: same
- test_size_cuda: Size(21) not implemented for cuda
- test_size_example_cuda: same
- test_dequantizelinear_blocked: Missing implementation for block
dequant for DequantizeLinear(21)
- test_quantizelinear_blocked_asymmetric: Missing implementation for
block quant for QuantizeLinear(21)
- test_quantizelinear_blocked_symmetric: Missing implementation for
block quant for QuantizeLinear(21)
---------
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Signed-off-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: George Wu <jywu@microsoft.com>
Co-authored-by: adrianlizarraga <adlizarraga@microsoft.com>
Although DML doesn't have a "fast" gelu approximation operator, its
standard GELU operator is still faster than having to combine all the
separate elementwise operators from different ops.
### Description
@chilo-ms to me it seems sensible to forward the detailed log argument
to the TRT logger itself.
Also when no precision downcast is wanted this will ensure to actually
stick to ONNX precision when used with TRT 9+.
This re-enables MatMul QDQ fusions with the DML EP now that bugs in
related DML kernels previously encountered in the pipeline are expected
to be addressed.
Wasm allows growing the memory size, this will cause all array buffers
reallocation. WebNN EP passes a wasm view to a WebNN constant directly
which would lead to the WebNN constant be treated as detached buffers in
JS side. Simply create a copy for WebNN constant to fix it.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
The issue comes from if user specifies a path for "ep.context_file_path"
in session options, due to `context_cache_path` is a local variable and
it will be destroyed when returning from
`UpdateOrtTensorRTProviderOptionsV2FromSessionOptionsConfigs()`.
Later in
`onnxruntime::TensorrtProviderFactoryCreator::Create(&new_tensorrt_options)`,
it will access the corrupted memory location because of the location is
saved via context_cache_path.c_str().
Inline the
`UpdateOrtTensorRTProviderOptionsV2FromSessionOptionsConfigs()` can fix
this issue.
### Description
This PR supports
[DepthToSpace](https://onnx.ai/onnx/operators/onnx__DepthToSpace.html#depthtospace)
operator in webgpu backend.
### Test
We followed the steps described on [this
page](https://gist.github.com/fs-eire/a55b2c7e10a6864b9602c279b8b75dce)
to build, tested with the following commands, and confirmed that it
passed the Model and Op tests that already existed. (Probably, these
test cases were prepared in the past for WebGL backend)
```
~/onnxruntime/js/web>
% npm test -- suite0 -b=webgpu --wasm-number-threads=1 --debug
```
##### NOTE
I want to tell you that the main branch version failed 5 tests for the
resize_upsample_sizes_nearest operator.
Since I didn't touch this issue, those test cases still fail in my
branch as well.
Should I post an issue for this?
### Motivation and Context
Though the DepthToSpace operator plays a crucial role in
super-resolution domains, it was not supported in webgpu backend.
### Description
<!-- Describe your changes. -->
Re-use vector buffers to prevent frequent reallocations.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Reduce process heap contention.

### Description
The PaddingElimination optimization is enabled when the density of
embedding padding less than 90%. We need to check the density of the
embedding padding to decide whether enable the optimization.
Before this pr, we just check the inputs of graph and correlate one with
the embedding node by iterate graph from the embedding node back to one
graph input.
This is hard to be general because there may be complicated pattern
between graph input and embedding node.
This pr check padding density by the direct input of embedding module
rather than the input of graph at the first graph execution when
exporting onnx graph.
And if the density < 90%, insert a flag PythonOp after the embedding
node as:
```
Embedding
|
PythonOp (func_name:_FlagPaddingElimination) (insert if density < 90%)
|
Following graph
```
When the PaddingElimination is invoked, it check if there is the flag
PythonOp(func_name:_FlagPaddingElimination) after the Embedding node and
if it is, remove it and do the padding elimination optimization.
### Description
make the compilation work on Azure CPU Agent by reduce the parallel
count
### Motivation and Context
The OOM issue mentioned in #20244 was caused the by low
memory/parallel_count.
### Prompt layer-wise when applicable
Give explicit prompts in export failures to users to enable layer-wise
memory optimization if we found the checkpoint function is used.
- Using checkpoint function is a strong indicator that the model is too
large to fit in GPU memory.
- If we don't override the checkpoint function here, mostly ONNX export
will be failed. 1. For old version PyTorch, when handling gradient
checkpoint feature, we just throw an exception. 2. For new version
PyTorch, an export failure happens.
- But both failures did not give users explicitly "HOW" to mitigate.
This PR did that.
``

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
It always has been out of memory in training CUDA 12.2 packaging
pipeline
https://dev.azure.com/aiinfra/Lotus/_build?definitionId=1308&_a=summary
since the PR #19910
I tried other CPU agents for example, D64as_v5(256G memory) and
D32as_v4(128G memory and 256 G SSD temp storage), which are still out of
memory like the below image

But it works on T4, though T4 only has 4 vCPUs, 28G memory and 180G temp
storage, and it takes much more time.
### Motivation and Context
Restore CUDA 12.2 training packaging pipeline first.
More time is needed to investigate the root cause
### Other Clues.
These 2 compilation steps take nearly 6 minutes with Cuda 12.2 on T4
And it runs out of memory on CPU machine. @ajindal1
cuda12.2 on T4
```
2024-03-14T05:39:08.7726865Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim32_fp16_sm80.cu.o
2024-03-14T05:45:01.3223393Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim64_bf16_sm80.cu.o
2024-03-14T05:46:07.9218003Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim96_fp16_sm80.cu.o
2024-03-14T05:52:59.2387051Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/group_query_attention_impl.cu.o
```
But they could be finished in about one minute with Cuda 11.8 on CPU
```
cuda11.8 on CPU
2024-04-09T11:34:35.0849836Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim32_fp16_sm80.cu.o
2024-04-09T11:35:53.6648154Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim64_bf16_sm80.cu.o
cuda11.8 on GPU
024-03-13T12:16:33.4102477Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim32_fp16_sm80.cu.o
2024-03-13T12:19:58.8268272Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim64_bf16_sm80.cu.o
```
### Description
<!-- Describe your changes. -->
Re-use pre-computed and pre-allocated buffers for UNICODE conversions.
Make sure we do not introduce unnecessary intermediate `std::string`
instances.
Create a Utf8Generic converter for use with non-Windows platforms.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This reduces heap contention in P1 customer.

### Optimize constant sharing perf
by avoiding [renaming for the first name we detect a constant pattern.
Currently every time we start run ConstantSharing, for each initializer,
we find its pattern does not exist, then we create a new NodeArg with a
unique name. Then later if other initializer share the same pattern,
they will be replaced by the NodeArg.
The problem is: once there is no real constant sharing cases, we still
modify the graph for each initializer. This is not needed.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
There is a problem in relu_quantizelinear transformer that causes wrong
results. The purpose of this PR is to solve this problem.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This does not take into account the situation where Q's zeropoint is
tensor(int16), tensor(uint16), so when this happens, an error will
occur.
How to verify:
```python
import onnx
import onnxruntime as ort
import numpy as np
model_name = 'relu_quantize_testcase.onnx'
model = onnx.load(model_name)
ort_input0 = np.random.rand((1, 64, 64, 128),np.float32)
# infer with GraphOptimizationLevel=0
so = ort.SessionOptions()
so.graph_optimization_level = ort.GraphOptimizationLevel.ORT_DISABLE_ALL
ort_session = ort.InferenceSession(
model_name,
providers=["CPUExecutionProvider"],
sess_options=so
)
outputs = [x.name for x in ort_session.get_outputs()]
ort_outs_mod = ort_session.run(outputs, { 'generator/conv2d_input/conv2d/Conv2D:0': ort_input0} )
del ort_session
# infer with GraphOptimizationLevel=default
model_orig = onnx.load(model_name)
ort_session_orig = ort.InferenceSession(model_orig.SerializeToString())
outputs_orig = [x.name for x in ort_session_orig.get_outputs()]
ort_outs_orig = ort_session_orig.run(outputs_orig, { 'generator/conv2d_input/conv2d/Conv2D:0': ort_input0} )
# diff
print(np.linalg.norm(ort_outs_mod[0].astype(np.float32) - ort_outs_orig[0].astype(np.float32)))
del ort_session_orig
```
[relu_quantize_testcase.zip](https://github.com/microsoft/onnxruntime/files/14848160/relu_quantize_testcase.zip)
---------
Co-authored-by: genmingz <genming.zhong@amd.com>
### Support more ops for recompute
To cover Mistral model, and support padding elimination ops.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
I recently opened a PR in hf transformers repo to fix an issue on the
indexing part.
https://github.com/huggingface/transformers/issues/29857
onnx exporter was failing because of the tolist() conversion so we had
to remove it.
I found out that the code was also a part of our codebase so this PR is
to keep the code consistent.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Update QNN python packages to use QNN SDK version 2.19.2.
### Motivation and Context
Our CI builds already use QNN SDK version 2.19.2. We should make sure
the ort-nightly-qnn python packages are also built with the same QNN SDK
version.
Improve the script to add Q, DQ nodes around EPContext node so that the wrapper model use float data as inputs and outputs. User don't need to quantize or dequantize the data in their application