### Description
Fix default value 10.2->10.3 in
linux-gpu-tensorrt-daily-perf-pipeline.yml
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
When API docs workflows fail, we typically don't catch the issue until
the most recently generated artifact expires. The current artifact
retention is 60 days, so by decreasing to 30 days, we can ensure that
we're resolving the workflow failures more quickly.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Implementation of a new cast assembly kernel that uses AVX_NE_CONVERT
instructions to accelerate casting from FP16 to FP32. Added CPUID checks
to determine support of the ISA.
### Motivation and Context
Currently FP16 models executed on systems that lack complete FP16
operator support use single precision on every node to run the model,
this means the original FP16 weights have to be casted to FP32 in order
to run the model properly, this change aims to accelerate the casting by
using upconvert instructions and therefore improve performance.
### Description
- Applies a workaround that prevents the histogram-based calibrators
(percentile, entropy, distribution) from crashing. The workaround
involves copying inference outputs that come directly from model inputs.
A description of the bug is here:
https://github.com/microsoft/onnxruntime/issues/21922. **This PR does
not fix the root bug, but instead provides a workaround to _unblock_
users using histogram-based calibration.**
- Adds a unit test that runs all histogram-based calibrators to help
catch future regressions. We didn't have unit tests that ran these
calibration methods.
### Motivation and Context
Trying to quantize a model with the percentile, entropy, or distribution
calibration methods raises an exception:
```shell
File "/.../site-packages/onnxruntime/quantization/quantize.py", line 691, in quantize
quantize_static(
File "/.../site-packages/onnxruntime/quantization/quantize.py", line 525, in quantize_static
calibrator.collect_data(calibration_data_reader)
File "/.../site-packages/onnxruntime/quantization/calibrate.py", line 571, in collect_data
self.collector.collect(clean_merged_dict)
File "/.../site-packages/onnxruntime/quantization/calibrate.py", line 746, in collect
return self.collect_value(name_to_arr)
File "/.../site-packages/onnxruntime/quantization/calibrate.py", line 836, in collect_value
hist, hist_edges = np.histogram(data_arr, self.num_bins, range=(-threshold, threshold))
File "<__array_function__ internals>", line 180, in histogram
File ".../site-packages/numpy/lib/histograms.py", line 793, in histogram
bin_edges, uniform_bins = _get_bin_edges(a, bins, range, weights)
File "/.../site-packages/numpy/lib/histograms.py", line 426, in _get_bin_edges
first_edge, last_edge = _get_outer_edges(a, range)
File "/.../site-packages/numpy/lib/histograms.py", line 315, in _get_outer_edges
raise ValueError(
ValueError: supplied range of [nan, nan] is not finite
```
The calibrators create an augmented model with all tensors (including
model inputs) set as model outputs. The data for outputs that are also
model inputs is corrupted as described in
https://github.com/microsoft/onnxruntime/issues/21922. The corrupted
data sometimes contains `NaN` values that cause numpy's histogram
utilities to raise an exception.
According to https://github.com/microsoft/onnxruntime/issues/20915, we
move the Gelu and LayerNorm fusion to L1 with a condition on the ONNX
opset the model imports (LayerNorm requires opset 16+ and Gelu requires
opset 20+.) If the opset version doesn't meet the requirements, the
fusion is delayed to L2 optimization since the internal contrib op
doesn't have a requirement for any specific ONNX opset.
---------
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
This is more flexible than hardcoding the provisioning profile name or UUID. The name shouldn't usually change but it is not guaranteed to remain constant.
### Description
<!-- Describe your changes. -->
[VitisAI] Add processing for
sessionOptions.AppendExecutionProvider("VitisAI", options)
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Zhenze Wang <zhenzew@xilinx.com>
for the Float8 types with unsigned zero, we must clear the sign bit when
rounding to zero;
otherwise we end up with 0x80 which is the encoding for NAN.
### Description
Handle all zero and near-zero values the same way, rounding to positive
zero.
Note that I removed one "if" level but did not re-indent the code in
this PR, to make it
easier to see what the actual changes are.
### Motivation and Context
For the two new 8-bit floating point types Float8E4M3FNUZ and
Float8E5M2FNUZ,
converting from a near-zero negative value would end up with the sign
bit set only;
this bit pattern is not negative zero but instead means NAN.
### Description
Remove unused and confusing special constants in MLFloat16 and BFloat16
types.
### Motivation and Context
While looking at adding a specialization for std::numeric_limits for the
16-bit floating point types, I found that there are various special
constants in those types that are confusing or just wrong.
MLFLoat16::Epsilon is not an epsilon at all, but approximates "e". Looks
like a copy-paste bug.
BFloat16::Epsilon does not correspond to `numeric_limits::epsilon()`,
nor even to the C# Float.Epsilon.
Instead, it corresponds to `numeric_limits::min()` which was really
confusing to me.
The "MinValue" constants does correspond to the C# `Float.MinValue`
constant, but this is C++ so it would be better renamed to "LowestValue"
since it corresponds to `numeric_limits::lowest()`. As it was unused
except for some unit tests I have replaced it with the equivalent
`MaxValue.Negate()` here.
There's also an unused `kSignaling_NaNBits` constant which is just wrong
(has the same value as `kPositiveInfinityBits` instead of a NaN).
### Description
<!-- Describe your changes. -->
Fix typo: ai:onnx -> ai.onnx
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Typo.
### Description
Follow-up to: https://github.com/microsoft/onnxruntime/pull/21793
- Support looking past a per-axis DQ to do in-place Unsqueeze/Transpose
of initializers
- Support looking past a per-axis DQ to cancel a Transpose or Squeeze.
### Test models
For all test models, the transpose optimizer pushes a Transpose through
a Mul's input[0]. The Mul's input[1] is optionally unsqueezed and then
transposed.
### I. Test in-place unsqueeze and transpose of per-axis quantized
weight
Original model has input[1] with shape (3,)
<details><summary>click to expand model image</summary>
<img
src="https://github.com/user-attachments/assets/37b6f60c-77d2-4bd3-8ca2-58dc7c88a304"
/>
</details>
Optimized model has input[1] with shape (1, 3, 1, 1). The initializer
was unsqueezed and transposed in-place.
<details><summary>click expand model image</summary>
<img
src="https://github.com/user-attachments/assets/adb72757-a164-400c-bfef-2a05f0e35825"
/>
</details>
### II. Test canceling existing Squeeze before per-axis DQ
Original model has input[1] that is squeezed.
<details><summary>click expand model image</summary>
<img
src="https://github.com/user-attachments/assets/f27e6742-b563-42a9-ad06-bb3178b0ceb8"
/>
</details>
Optimized model unsqueezed and transposed input[1]. The original squeeze
was removed due to the unsqueeze, leaving only the Transpose.
<details><summary>click expand model image</summary>
<img
src="https://github.com/user-attachments/assets/e56261d4-eba6-4a9f-847b-dcd33548dd07"
/>
</details>
### III. Test canceling existing Transpose before per-axis DQ
Original model has input[1] that is transposed.
<details><summary>click expand model image</summary>
<img
src="https://github.com/user-attachments/assets/f157e04a-572a-479d-8e3b-cf57954df5c0"
/>
</details>
Optimized model transposed input[1], thus canceling the existing
transpose.
<details><summary>click expand model image</summary>
<img
src="https://github.com/user-attachments/assets/63d742ce-3762-4ab2-bdb0-1b507886da9d"
/>
</details>
### IV. Test QDQ fix-up of Transpose/Unsqueeze for per-axis quantization
Original model has input[1] that can be broadcasted.
<details><summary>click expand model image</summary>
<img
src="https://github.com/user-attachments/assets/96c0092c-22ec-486d-882e-e2cb59ffe324"
/>
</details>
The main transpose optimization loop inserts float32 Unsqueeze and
Transpose after the DQ. The qdq fix-up pass inserts new per-axis Q/DQ
ops after the inserted nodes.
<details><summary>click expand model image</summary>
<img
src="https://github.com/user-attachments/assets/b6f89c11-974d-4b35-922f-11effdf06883"
/>
</details>
### Motivation and Context
Enables the TransposeOptimizer to support more models with per-axis QDQ
nodes. Per-axis quantization can improve model accuracy and is used by
EPs like QNN.
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
<!-- Describe your changes. -->
Request from Nuget team to add a better readme to the nuget package so
it is displayed nicely on nuget.org.
Previously we were using the ORT repo readme.md but that a) doesn't
display correctly due to limited markdown support on nuget.org, and b)
has a lot of irrelevant info like build pipeline status.
- Created a generic readme.md that includes the ORT description from the
main readme, includes the ORT logo via an acceptable link, and lists the
native nuget packages so the file can be included in any of them as-is.
- Updated the nuget packaging script to add the `readme` tag and use
this file.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Request from MS Nuget team to MS package owners to add.
### Description
Previous image is based on cuda 12.1 and cudnn 8, which is out of date
since we have moved to cudnn 9 since 1.19 release.
(1) Upgrade base image to cuda 12.5.1 and cudnn 9.
(2) Update CMAKE_CUDA_ARCHITECTURES from 52;60;61;70;75;86 to
61;70;75;80;86;90 to support A100 and H100
(3) Make the build faster: exclude unit test; use ninja etc.
(4) upgrade some packages (like packaging etc) before building to avoid
build error.
### Motivation and Context
https://github.com/microsoft/onnxruntime/issues/21792https://github.com/microsoft/onnxruntime/issues/21532
### Description
Added some change in fuzzer project code to support linux also.
How to test on linux:
1. Make sure you have installed clang/llvm.
2. run below command to build asan instrumented project:
```
CFLAGS="-g -fsanitize=address -shared-libasan -fprofile-instr-generate -fcoverage-mapping" CXXFLAGS="-g -shared-libasan -fsanitize=address -fprofile-instr-generate -fcoverage-mapping" CC=clang CXX=clang++ ./build.sh --update --build --config Debug --compile_no_warning_as_error --build_shared_lib --skip_submodule_sync --skip_tests --use_full_protobuf --parallel --fuzz_testing --build_dir build/
```
3. run fuzzer for some time, it will generate *.profraw file:
```
LLVM_PROFILE_FILE="%p.profraw" ./build/Debug/onnxruntime_security_fuzz /t /v onnxruntime/test/testdata/bart_tiny.onnx 1 m
```
4. Get the cov by running below cmd:
```
llvm-profdata merge -sparse *.profraw -o default.profdata
llvm-cov report ./build/Debug/onnxruntime_security_fuzz -instr-profile=default.profdata
```
<img width="1566" alt="Screenshot 2024-09-05 at 4 25 08 PM"
src="https://github.com/user-attachments/assets/2aa0bb83-6634-4d33-b026-3535e97df431">
### Motivation and Context
1. Currently fuzzer only supports windows and MSVC, we can't generate
the code coverage using MSVC. With clang/llvm we can try and use clang
instrumentation and llvm tools like llvm-cov.
2. In future we can add coverage guided fuzzer (libfuzzer) in same
project. (Working on it)
### Description
<!-- Describe your changes. -->
Removed unused headers
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This would cause compile error on machine that didn't install nlohmann.
Co-authored-by: Yueqing Zhang <yueqingz@amd.com>
### Description
<!-- Describe your changes. -->
The DML CIs build native and C# as well as sign DLLs in the same CI.
Some parts of that require .net 8 and some .net 6.
Update to use .net 8 in general, and revert to .net 6 for the signing.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix packaging pipeline.
### Description
<!-- Describe your changes. -->
- Update docfx usage.
- The docfx cli is now a dotnet tool.
- Split some commands up so it's easier to debug failures
- Update to .net8.
- Exclude mobile targets from build as the workloads aren't available
and it doesn't change the generated documentation.
- The mobile specific APIs (e.g. enable CoreML EP) still exist in this
case as we check in the implementation if it's valid to use them or not,
so the workloads are not required to generate complete API
documentation.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix doc gen.
### Description
Fix one MSVC warning member not initialized
```
Warning C26495 Variable 'onnxruntime::ITuningContext::allocators_' is uninitialized. Always initialize a member variable (type.6). C:\code\onnxruntime\onnxruntime\core\framework\tuning_context.h 22
```
### Description
<!-- Describe your changes. -->
#21618
This PR optimizes grouped conv by 1) more sequential memory access in
gpu 2) reusing input's data to reduce global memory access times.
See `Conv|GroupedConv` op in
[Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base-960h) becomes
92 ms from 1058 ms on iGPUs with 32 EU.
For the whole model on my iGPUs with 32 EU,
wav2vec2 model becomes 982ms from 1942 ms.
squeezebert-uncased model becomes 71.86ms from 431.77ms.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Update various test projects to .net8 from EOL frameworks.
Replace the Xamarin based Android and iOS test projects with a MAUI
based project that uses .net 8.
Add new CoreML flags to C# bindings
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Remove usage of EOL frameworks.
### Description
<!-- Describe your changes. -->
Update C# test package dependencies to match #21913
This csproj isn't included in the main sln and was overlooked. We need
the newer xunit version for Assert.Fail which is used in shared unit
test source that is included here as well.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix CI failure
### Description
Rename ios_packaging.requirements.txt to ios_packaging/requirements.txt
### Motivation and Context
By doing this, the package within os_packaging/requirements.txt can be
scanned by CG task
### Description
<!-- Describe your changes. -->
Fix bugs in previous implementation and add more situations to go the
optimized path.
Below situations will go to the optimized path.
1. 2d inputs or squeezed 2d inputs
2. channels last or channels first transpose. For example, channel last
transpose: [1, 256, 512, 512] -> [1, 512, 512, 256]
For this case, the transpose becomes [256, 512x512] -> [512x512, 256]
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
For SD Turbo demo, the total transpose time becomes 39.98ms from
122.09ms. And the correspnding percents becomes 3.89% from 11.05% in
this demo.
This PR will also help #21618, the total transpose time in that demo
becomes 17.32 ms from 70.25 ms on my iGPUs.
### Description
<!-- Describe your changes. -->
Register for custom op when testing the performance
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This is needed for providers to test their implementation
### Description
<!-- Describe your changes. -->
VitisAI bug fixes in model clone
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: Zhenze Wang <zhenzew@xilinx.com>
- Remove redundant `OnnxruntimeModuleExampleE2ETest CheckOutputComponentExists` test
- Attempt to close any Application Not Responding (ANR) dialog prior to running Android test
- Add `--take-screenshots failing` option to detox test commands to save screenshots on failure
### Description
<!-- Describe your changes. -->
Change the .data path so it is on the same path as the model path.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This would fix the issue if a model has .data file, the executable can't
read the data if the model is in another directory.
### Description
The function find_cudnn_supported_cuda_versions is not used anymore.
Remove it.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Calling Split API Calls Read+Model in lieu of unified Compile Model call
for export compile flow to ensure memory optimization. Freeing up model
proto and serialized string and read model ov ir later to free up memory
for the ahead pipeline
Optimization during EpCtxt flow
All the Graph related operations require all the Node Attributes to be
set while dealing with model instances internally with them, in the
existing implementation these attributes make a copy when constructing a
Graph dynamically during runtime.
Propose to use these attributes in place without creating a copy to
avoid memory allocation / copy while calling these Graph related
functions.
Changes to ensure the bug fixes related to openvino version and epctxt
file path.
Moving Compiler version to C++20 for getting r-value mem optimizations
benefit
### Motivation and Context
This change is required because memory optimization during Compilation
flow is too high.
---------
Co-authored-by: saurabhkale17 <saurabh1.kale@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
Co-authored-by: Vishnudas Thaniel S <vishnudas.thaniel.s@intel.com>
Co-authored-by: Javier E. Martinez <javier.e.martinez@intel.com>
Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com>
Co-authored-by: ankitm3k <ankit.maheshkar@intel.com>
Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com>
### Description
The EP wasn't added in session option in onnxruntime_test_all.
### Motivation and Context
After this PR
onnxruntime_test_all --gtest_filter=\*xnnpack\*maxpool\* can step into
8c5336449d/onnxruntime/core/providers/xnnpack/nn/max_pool.cc (L209)
---------
Co-authored-by: Yi Zhang <your@email.com>
### Description
revert forceinline for MakeString.
This change reverts https://github.com/microsoft/onnxruntime/pull/21893.
The forceinline was introduced for performance considerations, however
it turns out to have some notable binary size increase, which is a
concern for some binary size sensitive platforms like Android.
I made a few tests locally and found it is not related to whether or not
have used the template struct `if_char_array_make_ptr_t` trick. So I
have to revert this back.
### Description
<!-- Describe your changes. -->
Update some testing dependencies.
Fix various warnings. Mainly around documentation (existing) and unit
test usage (mainly resulting from xunit update).
Invalid angle brackets for generics in documentation were changed to use
curly braces based on
https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/xmldoc/
> To refer to generic identifiers in code reference (cref) elements, you
can use either the escape characters (for example, cref="List<T>")
or braces (cref="List{T}"). As a special case, the compiler parses the
braces as angle brackets to make the documentation comment less
cumbersome to the author when referring to generic identifiers.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Files signature validation after signed by ESRP.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
- Add validation after the ESRP process.
- Make sure the targeting pattern/suffix files are signed successfully
by ESRP.
- If the signature is not Valid, then will fail the following stages.
### Description
Vitis AI EP's custom op are completely self contained within Vitis AI EP
implementation (rather than needing to add static functions in
provider_bridge).
---------
Co-authored-by: liumingyue <mingyue@xilinx.com>