Commit graph

11997 commits

Author SHA1 Message Date
Michael Tyler
904b850b44
Update Arm Compute Library Execution Provider (#22032)
### Description
This PR makes the following updates to the Arm Compute Library execution
provider:

- Target Arm Compute Library 24.07  
- Add support for the following operators: 
  - Conv (FP16) 
  - NhwcConv 
  - QLinearConv 
  - MatMul 
  - FusedMatMul 
  - MatMulIntegerToFloat 
- Optimize memory usage and performance
- Expose the enable_fast_math setting 
- Use the main runtime thread pool 



### Motivation and Context
These updates improve performance and memory usage, and enable use of a
more recent version of Arm Compute Library.

@microsoft-github-policy-service agree company="Arm Ltd"

---------

Signed-off-by: Michael Tyler <michael.tyler@arm.com>
2024-09-12 20:51:59 -07:00
Adam Pocock
22437b581b
[java] Fix for OnnxTensor creation when passing in a ByteBuffer containing elements of a different type (#21774)
### Description
Fixes a bug where the buffer offset and position was incorrectly
computed if the user supplied a `ByteBuffer` to `createTensor` but set
the type of the tensor to something other than `INT8`. This would be
more common if the user was trying to load the initializers from a
serialized representation and didn't want to bother with the type
information (which is the case in #21321).

### Motivation and Context
Partial fix for #21321. The remainder of the fix is to add a helper
which allows users to load initializers out of an `onnx_data` file, but
that will require adding protobuf as a dependency for the Java API to
allow the parsing of an ONNX file separately from the native code. It
might be nicer to put that functionality into ORT's C API so it can
return the lengths & offsets of the initializers when provided with an
ONNX file containing external initializers. We hit this kind of thing in
Java more often than other languages as in Java models can be supplied
as classpath resources which we can easily read, but not materialize on
disk for the ORT native library to read.
2024-09-13 12:38:17 +10:00
Adrian Lizarraga
f7bf5a19ba
[QNN EP] Ensure QNN EP rejects nodes with I/O of dynamic shape (#22066)
### Description
Updates QNN EP to properly reject nodes that have inputs or outputs with
dynamic shapes.


### Motivation and Context
Currently, QNN EP does not properly offload subgraphs with dynamic
shapes to the CPU EP. This PR ensures that QNN EP rejects nodes that
consume or generate I/O with dynamic shapes.
2024-09-12 17:18:50 -07:00
mingyueliuh
55ab13e7ca
[VitisAI] support memory buffer contains the TensorProto external data (#22042)
### Description
Extend VitisAI EP `tensor_proto_as_raw` API to support memory buffer
containing the TensorProto external data


### Motivation and Context
For reduce peak memory usage, VitisAI EP need support ORT format model
and setting session option
`session.use_ort_model_bytes_for_initializers` for enable directly use
the model bytes for initializers.

Co-authored-by: mingyue <mingyue@xilinx.com>
2024-09-12 16:23:09 -07:00
0xdr3dd
5c361106e6
[Fuzzer] Add two new ORT libfuzzer (Linux clang support for now) (#22055)
### Description
This PR adds two new libfuzzer in fuzzer project.
1. Binary libfuzzer 
2. libprotobuf-fuzzer

To compile run below cmd on linux:
```
LLVM_PROFILE_FILE="%p.profraw" CFLAGS="-g -fsanitize=address,fuzzer-no-link -shared-libasan -fprofile-instr-generate -fcoverage-mapping" CXXFLAGS="-g -shared-libasan -fsanitize=address,fuzzer-no-link -fprofile-instr-generate -fcoverage-mapping" CC=clang CXX=clang++ ./build.sh --update --build --config Debug --compile_no_warning_as_error --build_shared_lib --skip_submodule_sync --use_full_protobuf  --parallel --fuzz_testing --build_dir build/
```
Run fuzzer:
```
LD_PRELOAD=$(clang -print-file-name=libclang_rt.asan-x86_64.so) build/Debug/onnxruntime_libfuzzer_fuzz  testinput -rss_limit_mb=8196 -max_total_time=472800 -fork=2 -jobs=4 -workers=4 -ignore_crashes=1 -max_len=2097152 2>&1 | grep -v "\[libprotobuf ERROR"
```


### Motivation and Context
The existing custom fuzzer is not coverage guided and it's slow and it
will work on one model mutation at a time. The new fuzzers are coverage
guided, and we can use more models' files as a corpus to increase the
coverage.
2024-09-12 11:50:34 -07:00
wangshuai09
d539c27de8
Fix version check for using -mavxvnni (#21616)
### Description
<!-- Describe your changes. -->
Change the `CMAKE_CXX_COMPILER_VERSION` greater than `11` for using
'-mavxvnni'.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->


`CMakeFiles/onnxruntime_mlas.dir/root/Git.d/onnxruntime/onnxruntime/core/mlas/lib/x86_64/QgemmU8S8KernelAvx2.S.o
cc: error: unrecognized command-line option ‘-mavxvnni’; did you mean
‘-mavx512vnni’?` using `gcc (GCC) 10.3.1`.

`-mavxnni` is supported since [GCC 11
Release](https://gcc.gnu.org/gcc-11/changes.html), this PR change the
version check.
2024-09-12 11:42:17 -07:00
Clément Péron
10883d7997
Suppress GCC warning in TreeEnsembleAggregator (#22062)
### Description
When building with GCC 14.2.1, I got the following warning:

onnxruntime/core/providers/cpu/ml/tree_ensemble_aggregator.h:329:59:
error: template-id not allowed for constructor in C++20
[-Werror=template-id-cdtor]

Remove template parameters from the constructor: The constructor
TreeAggregatorMax<InputType, ThresholdType, OutputType> has been
simplified to TreeAggregatorMax, because the compiler already knows the
template parameters from the class definition.

### Motivation and Context
Fix the build issue

Signed-off-by: Clément Péron <peron.clem@gmail.com>
2024-09-12 19:46:27 +02:00
Yulong Wang
84f73327f5
allow scalar axes for Unsqueeze for WebGPU (#22054)
### Description

Align with CPU behavior.


https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/cpu/tensor/unsqueeze.cc#L60-L62
2024-09-12 10:33:37 -07:00
mindest
951b1b7160
[CI] Linux ROCm CI Pipeline: fix error, set trigger rules. (#22069)
### Description
* Correct the wrong EP name for ROCm, fix CI error.
* Update `set-trigger-rules.py`.
* Modify the .yml via `set-trigger-rules.py`
2024-09-12 09:54:32 -07:00
Yi Zhang
ae39c40e5b
fix typo in iOS pipeline (#22067)
### Description
<!-- Describe your changes. -->



### Motivation and Context
The parameter isn't correct.
Maybe it hasn't negative impact by chance so far.

d8e64bb529/cmake/CMakeLists.txt (L1712-L1717)
2024-09-12 19:07:42 +08:00
Prathik Rao
d495e6cf1c
adds support for Uint8ClampedArray (#21985)
Fixes https://github.com/microsoft/onnxruntime/issues/21753
2024-09-11 22:02:30 -07:00
Lennart Hannink
d8e64bb529
Refactor CoreMLExecution to C++ bridge class (#21857)
Refactor Objective-C++ class `CoreMLExecution` into existing C++ bridge class `onnxruntime::coreml::Execution`.
2024-09-11 16:05:37 -07:00
sfatimar
0309c5f02f
Ovep release lnl 1.2.1 (#22027)
Error Codes are added to catch compilation error and signal recompile.
Remote Tensors are added to ensure direct memory access for NPU
inferencing.
UMD Bypass cache enabled with 2024.4 will eliminate need to disk caching

### Motivation and Context
The changes are needed to ensure backward compatibility
UMD Bypass caching eliminates driver caching
Remote Tensors lead to performance improvement with inferencing on NPU

---------

Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
Co-authored-by: Srirammaswamy <srirammaswamy.s@intel.com>
Co-authored-by: saurabh <saurabh1.kale@intel.com>
Co-authored-by: Javier E. Martinez <javier.e.martinez@intel.com>
Co-authored-by: Eric Crawford <eric.r.crawford@intel.com>
Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com>
2024-09-11 14:55:40 -07:00
Jagadish Krishnamoorthy
b800328628
[ROCm EP/ MIGraphx EP] matmul_nbits: Use GPU_WARP_SIZE_HOST for host side code (#22045)
### Description
For ROCm device, the host side code needs to call GPU_WARP_SIZE_HOST to
query warpSize
of the underlying GPU device.

### Motivation and Context
Fixes MatMulNBits tests on gfx1100/01 which has warpSize of 32.

Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>
2024-09-11 14:52:18 -07:00
Bin Miao
4d82404544
[WebNN EP] Support GRU operator (#20405)
This PR support Gru operator for WebNN EP.
@Honry ,  @fdwr thanks!
2024-09-11 14:16:36 -07:00
Xavier Dupré
91c916f9c6
Improve hash_function used by TreeEnsemble (#22043)
### Description
unordered_map are implemented in a different way on VisualStudio and
gcc. It seems that inserting consecutive keys has a poor performance on
Windows.



### Motivation and Context
Improve the performance of onnxruntime when initializing trees.
2024-09-11 10:41:04 -07:00
Yi-Hong Lyu
e91ff9438b
Enable Pad->Conv(no pads) fusion (#22001)
### Description


### Motivation and Context
For some model has pattern Pad -> Conv. If the Conv doesn't have pads
attributes, the Pad can be fused into Conv.
2024-09-11 09:54:15 -07:00
Julius Tischbein
20d94648bb
ConvTranpose using CUDNN Frontend with NHWC support (#21752)
### Description
Added CUDNN Frontend and used it for NHWC ConvTranspose op including
option for bias fusion. Similar to this [Conv
PR](https://github.com/microsoft/onnxruntime/pull/19470)

### Backward compatible
If ORT is built with cuDNN 8, cuDNN frontend will not be built into
binary. Old kernels (using cudnn backend APIs) are used.

### Major Changes
For cuDNN 9, we will enable cudnn frontend to fuse data gradient
convolution and bias when a provider option fuse_conv_bias=1.

### Potential Issues
cuDNN frontend uses TF32 by default. It can be disabled using use_tf32
cuda provider option, but in the case cuDNN frontend encounters issues
building an operation graph it will fallback to using TF32.

### Follow ups
This is one of the PRs that target to enable NHWC, here the
ConvTranspose operation in CUDA EP by default if device supports it.
There are other changes will follow up to make it possible.
(1) Enable prefer_nhwc by default for device with sm >= 70.
(2) Change fuse_conv_bias=1 by default after more testing.
(3) Add other NHWC operators (like Resize or UpSample).

### Motivation and Context
The new CUDNN Frontend library provides the functionality to fuse
operations and provides new heuristics for kernel selection. Here it
fuses the convolution data gradient operation (ConvTranspose) with the
pointwise bias operation.

### Minor Change
In the CUDA convolution operation was a small bug when
`GetCudnnConv1dPadToNc1d ` was enabled.
2024-09-10 16:51:00 -07:00
PARK DongHa
f633caa0b1
Create CMake option onnxruntime_USE_VCPKG (#21348)
### Changes

1. CMake option `onnxruntime_USE_VCPKG`. It will be used in the vcpkg
port
* Unit test may fail because this option leads to a mixture of
unexpected external library versions.
     Especially ONNX, Protobuf, and Flatbuffers version can be different
2. Overhaul of `onnxruntime_external_deps.cmake`
   * Make `FetchContent_Declare` to try `find_package`.  
See
https://cmake.org/cmake/help/latest/guide/using-dependencies/index.html
* Relocated `FetchContent_Declare` and `FetchContent_MakeAvailable`(or
`onnxruntime_fetchcontent_makeavailable`) to closer lines.
It was too hard to navigate the entire file to search related
sections...
* Alias `IMPORTED` targets like build targets (e.g. `ONNX::onnx` -->
`onnx`)

```cmake
# The script uses `find_package` with the changes.
# In this case, use vcpkg to search dependencies
# See https://cmake.org/cmake/help/latest/guide/using-dependencies/index.html
include(external/onnxruntime_external_deps.cmake)
```

3. Create CMakePresets.json and presets to [run vcpkg in manifest
mode](https://learn.microsoft.com/en-us/vcpkg/concepts/manifest-mode)
   * Currently, it's NOT for training build
   * Main triplets are `x64-windows` and `x64-osx`

```pwsh
Push-Location "cmake"
    cmake --preset "x64-windows-vcpkg"
    cmake --build --preset "x64-windows-vcpkg-debug"
Pop-Location
```
```bash
pushd "cmake"
    cmake --preset "x64-osx-vcpkg"
    cmake --build --preset "x64-osx-vcpkg-debug"
popd
```

4. Updated tools/ci_build/build.py
* `--use_vcpkg` option: it needs `CMAKE_TOOLCHAIN_FILE` with
[vcpkg.cmake toolchain
script](https://github.com/microsoft/vcpkg/blob/master/scripts/buildsystems/vcpkg.cmake)
* `--compile_no_warning_as_error` is recommended because library version
differences will cause unexpected compiler warnings

```bash
python ./tools/ci_build/build.py \
    --compile_no_warning_as_error \
    --use_vcpkg \
    --cmake_extra_defines "CMAKE_TOOLCHAIN_FILE:FILEPATH=${VCPKG_ROOT}/scripts/buildsystems/vcpkg.cmake" \
    --cmake_extra_defines "VCPKG_TARGET_TRIPLET=..."
```

5. Created Job `Vcpkg` for Windows and macOS
   * Show how to setup and use vcpkg.  
     Similar to the CMakePresets.json usage

### Motivation and Context

* Help #7150
* Help https://github.com/microsoft/vcpkg/pull/36850
   * https://github.com/luncliff/vcpkg-registry/pull/212
   * https://github.com/microsoft/vcpkg/pull/39881
* https://github.com/luncliff/vcpkg-registry/pull/215
   * https://github.com/luncliff/vcpkg-registry/pull/216
   * https://github.com/luncliff/vcpkg-registry/pull/227
*
https://cmake.org/cmake/help/latest/guide/using-dependencies/index.html
*
https://github.com/microsoft/vcpkg/blob/master/scripts/buildsystems/vcpkg.cmake

### Future Works?

More feature coverage with the vcpkg supported libraries

* CUDA feature support
* Training feature support
2024-09-10 16:39:27 -07:00
kunal-vaishnavi
c5418f35d4
Add fusions for re-designed Phi-3 vision and Phi-3.5 vision ONNX models (#22026)
### Description
This PR adds the optimizer logic to fuse the newly designed exported
ONNX models for Phi-3 vision and Phi-3.5 vision.

### Motivation and Context
After the re-designed export of Phi-3 vision and Phi-3.5 vision, the
ONNX models for the vision component and embedding component contain
`If` and `Loop` ops to handle multi-image support.
2024-09-10 16:18:05 -07:00
dependabot[bot]
19954decaf
Bump body-parser from 1.20.2 to 1.20.3 in /js/web (#22044) 2024-09-10 23:05:44 +00:00
jingyanwangms
4a5d66c15f
Default value 10.2->10.3 in linux-gpu-tensorrt-daily-perf-pipeline.yml (#21823)
### Description
Fix default value 10.2->10.3 in
linux-gpu-tensorrt-daily-perf-pipeline.yml

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-09-10 15:26:16 -07:00
George Wu
31ae11788a
[QNN EP] Update QNN SDK to 2.26 (#22037)
* update default QNN SDK version to 2.26
* enable layernorm implicit bias workaround for QNN 2.26
* update artifact names for py win arm64 and arm64ec to re-enable
ort-qnn-nightly arm64 python packages
2024-09-10 14:03:06 -07:00
Sophie Schoenmeyer
e7107f41de
Decrease API docs artifact retention days (#22003)
### Description
When API docs workflows fail, we typically don't catch the issue until
the most recently generated artifact expires. The current artifact
retention is 60 days, so by decreasing to 30 days, we can ensure that
we're resolving the workflow failures more quickly.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-09-10 10:44:08 -07:00
Erick Muñoz
7489bfee53
Enable AVX NE CONVERT for FP16 to FP32 cast (#21183)
### Description
Implementation of a new cast assembly kernel that uses AVX_NE_CONVERT
instructions to accelerate casting from FP16 to FP32. Added CPUID checks
to determine support of the ISA.

### Motivation and Context
Currently FP16 models executed on systems that lack complete FP16
operator support use single precision on every node to run the model,
this means the original FP16 weights have to be casted to FP32 in order
to run the model properly, this change aims to accelerate the casting by
using upconvert instructions and therefore improve performance.
2024-09-09 21:19:31 -07:00
Jake Mathern
d4d419f789
fix more dml warnings (#21980)
### Description
Fixes more warnings in DML execution provider that lead to security
issues in binskim


### Motivation and Context
OS components that include ORT must treat certain warnings as errors,
and cannot disable critical compiler warnings

https://github.com/microsoft/binskim/blob/main/src/BinSkim.Rules/PERules/BA2007.EnableCriticalCompilerWarnings.cs
2024-09-09 17:50:17 -07:00
Jian Chen
93c4c9cb6a
Using wostringstream only on Windows (#21938)
### Description
Using wostringstream only on Windows



### Motivation and Context
From line
[62](https://github.com/microsoft/onnxruntime/pull/21938/files#diff-47776d020ac08134de4059eab473550237f4999c598ab56afad3676d2f193edcR62),
currently, `stream_` can be either `wostringstream` or `ostringstream`
depending on the OS, however, for Unix like system, `stream_` should be
`ostringstream`, instead of.
2024-09-09 13:20:17 -07:00
Adrian Lizarraga
c7ae9b977a
[Quantization] Apply workaround for crash when using histogram-based calibrators (#21972)
### Description
- Applies a workaround that prevents the histogram-based calibrators
(percentile, entropy, distribution) from crashing. The workaround
involves copying inference outputs that come directly from model inputs.
A description of the bug is here:
https://github.com/microsoft/onnxruntime/issues/21922. **This PR does
not fix the root bug, but instead provides a workaround to _unblock_
users using histogram-based calibration.**
- Adds a unit test that runs all histogram-based calibrators to help
catch future regressions. We didn't have unit tests that ran these
calibration methods.

### Motivation and Context
Trying to quantize a model with the percentile, entropy, or distribution
calibration methods raises an exception:
```shell
  File "/.../site-packages/onnxruntime/quantization/quantize.py", line 691, in quantize
    quantize_static(
  File "/.../site-packages/onnxruntime/quantization/quantize.py", line 525, in quantize_static
    calibrator.collect_data(calibration_data_reader)
  File "/.../site-packages/onnxruntime/quantization/calibrate.py", line 571, in collect_data
    self.collector.collect(clean_merged_dict)
  File "/.../site-packages/onnxruntime/quantization/calibrate.py", line 746, in collect
    return self.collect_value(name_to_arr)
  File "/.../site-packages/onnxruntime/quantization/calibrate.py", line 836, in collect_value
    hist, hist_edges = np.histogram(data_arr, self.num_bins, range=(-threshold, threshold))
  File "<__array_function__ internals>", line 180, in histogram
  File ".../site-packages/numpy/lib/histograms.py", line 793, in histogram
    bin_edges, uniform_bins = _get_bin_edges(a, bins, range, weights)
  File "/.../site-packages/numpy/lib/histograms.py", line 426, in _get_bin_edges
    first_edge, last_edge = _get_outer_edges(a, range)
  File "/.../site-packages/numpy/lib/histograms.py", line 315, in _get_outer_edges
    raise ValueError(
ValueError: supplied range of [nan, nan] is not finite
```

The calibrators create an augmented model with all tensors (including
model inputs) set as model outputs. The data for outputs that are also
model inputs is corrupted as described in
https://github.com/microsoft/onnxruntime/issues/21922. The corrupted
data sometimes contains `NaN` values that cause numpy's histogram
utilities to raise an exception.
2024-09-09 12:05:41 -07:00
Peishen Yan
2cdc05f189
Move Gelu and LayerNorm fusion to L1 optimization (#21332)
According to https://github.com/microsoft/onnxruntime/issues/20915, we
move the Gelu and LayerNorm fusion to L1 with a condition on the ONNX
opset the model imports (LayerNorm requires opset 16+ and Gelu requires
opset 20+.) If the opset version doesn't meet the requirements, the
fusion is delayed to L2 optimization since the internal contrib op
doesn't have a requirement for any specific ONNX opset.

---------

Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2024-09-09 13:27:52 +10:00
Yi Zhang
de7a02beef
Add parameter for flexdonwload (#22009)
### Description
<!-- Describe your changes. -->



### Motivation and Context
Thus, we can run Nuget_Packaging_GPU stage directly
2024-09-08 14:17:55 +08:00
Wanming Lin
ad9afbb042
[WebNN EP] Remove workaround for CPU op supported list (#21962)
We assume all WebNN ops are supported across all backends.
2024-09-06 22:14:52 -07:00
Edward Chen
f3725b9f06
Use output variable from InstallAppleProvisioningProfile task to set provisioning profile UUID. (#22018)
This is more flexible than hardcoding the provisioning profile name or UUID. The name shouldn't usually change but it is not guaranteed to remain constant.
2024-09-06 18:00:34 -07:00
zz002
28b550f091
[VitisAI] Add processing for sessionOptions.AppendExecutionProvider("VitisAI", options) (#21839)
### Description
<!-- Describe your changes. -->

[VitisAI] Add processing for
sessionOptions.AppendExecutionProvider("VitisAI", options)

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Zhenze Wang <zhenzew@xilinx.com>
2024-09-06 14:06:33 -07:00
Arne H Juul
493159b481
near-zero negative values must convert to 0 not NAN (#18473)
for the Float8 types with unsigned zero, we must clear the sign bit when
rounding to zero;
otherwise we end up with 0x80 which is the encoding for NAN.

### Description
Handle all zero and near-zero values the same way, rounding to positive
zero.
Note that I removed one "if" level but did not re-indent the code in
this PR, to make it
easier to see what the actual changes are.

### Motivation and Context
For the two new 8-bit floating point types Float8E4M3FNUZ and
Float8E5M2FNUZ,
converting from a near-zero negative value would end up with the sign
bit set only;
this bit pattern is not negative zero but instead means NAN.
2024-09-06 11:41:48 -07:00
Arne H Juul
605a84ffc9
remove unused and confusing float16 constants (#21999)
### Description
Remove unused and confusing special constants in MLFloat16 and BFloat16
types.

### Motivation and Context
While looking at adding a specialization for std::numeric_limits for the
16-bit floating point types, I found that there are various special
constants in those types that are confusing or just wrong.

MLFLoat16::Epsilon is not an epsilon at all, but approximates "e". Looks
like a copy-paste bug.
BFloat16::Epsilon does not correspond to `numeric_limits::epsilon()`,
nor even to the C# Float.Epsilon.
Instead, it corresponds to `numeric_limits::min()` which was really
confusing to me.

The "MinValue" constants does correspond to the C# `Float.MinValue`
constant, but this is C++ so it would be better renamed to "LowestValue"
since it corresponds to `numeric_limits::lowest()`. As it was unused
except for some unit tests I have replaced it with the equivalent
`MaxValue.Negate()` here.

There's also an unused `kSignaling_NaNBits` constant which is just wrong
(has the same value as `kPositiveInfinityBits` instead of a NaN).
2024-09-05 22:00:48 -07:00
Edward Chen
970ebc2ccf
Fix typo in coreml_supported_mlprogram_ops.md (#22004)
### Description
<!-- Describe your changes. -->

Fix typo: ai:onnx -> ai.onnx

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Typo.
2024-09-06 12:50:56 +10:00
Edward Chen
0c398b3e52
Update Android NDK version to 27.0.12077973. (#21989)
Upgrade to newer version. r26 will be unsupported soon.
2024-09-05 17:57:24 -07:00
Adrian Lizarraga
b011f6fbf6
[TransposeOptimizer] Support Unsqueeze/Transpose of input consumed by per-axis DQ (#21821)
### Description
Follow-up to: https://github.com/microsoft/onnxruntime/pull/21793

- Support looking past a per-axis DQ to do in-place Unsqueeze/Transpose
of initializers
- Support looking past a per-axis DQ to cancel a Transpose or Squeeze.

### Test models
For all test models, the transpose optimizer pushes a Transpose through
a Mul's input[0]. The Mul's input[1] is optionally unsqueezed and then
transposed.

### I. Test in-place unsqueeze and transpose of per-axis quantized
weight
Original model has input[1] with shape (3,)
<details><summary>click to expand model image</summary>
<img
src="https://github.com/user-attachments/assets/37b6f60c-77d2-4bd3-8ca2-58dc7c88a304"
/>
</details>

Optimized model has input[1] with shape (1, 3, 1, 1). The initializer
was unsqueezed and transposed in-place.
<details><summary>click expand model image</summary>
<img
src="https://github.com/user-attachments/assets/adb72757-a164-400c-bfef-2a05f0e35825"
/>
</details>

### II. Test canceling existing Squeeze before per-axis DQ
Original model has input[1] that is squeezed.
<details><summary>click expand model image</summary>
<img
src="https://github.com/user-attachments/assets/f27e6742-b563-42a9-ad06-bb3178b0ceb8"
/>
</details>

Optimized model unsqueezed and transposed input[1]. The original squeeze
was removed due to the unsqueeze, leaving only the Transpose.
<details><summary>click expand model image</summary>
<img
src="https://github.com/user-attachments/assets/e56261d4-eba6-4a9f-847b-dcd33548dd07"
/>
</details>

### III. Test canceling existing Transpose before per-axis DQ
Original model has input[1] that is transposed.
<details><summary>click expand model image</summary>
<img
src="https://github.com/user-attachments/assets/f157e04a-572a-479d-8e3b-cf57954df5c0"
/>
</details>

Optimized model transposed input[1], thus canceling the existing
transpose.
<details><summary>click expand model image</summary>
<img
src="https://github.com/user-attachments/assets/63d742ce-3762-4ab2-bdb0-1b507886da9d"
/>
</details>

### IV. Test QDQ fix-up of Transpose/Unsqueeze for per-axis quantization
Original model has input[1] that can be broadcasted.
<details><summary>click expand model image</summary>
<img
src="https://github.com/user-attachments/assets/96c0092c-22ec-486d-882e-e2cb59ffe324"
/>
</details>

The main transpose optimization loop inserts float32 Unsqueeze and
Transpose after the DQ. The qdq fix-up pass inserts new per-axis Q/DQ
ops after the inserted nodes.
<details><summary>click expand model image</summary>
<img
src="https://github.com/user-attachments/assets/b6f89c11-974d-4b35-922f-11effdf06883"
/>
</details>


### Motivation and Context
Enables the TransposeOptimizer to support more models with per-axis QDQ
nodes. Per-axis quantization can improve model accuracy and is used by
EPs like QNN.

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2024-09-05 17:26:17 -07:00
Wanming Lin
23f6604c39
[WebNN EP] Use identity for one input of Max/Min (#21974)
Now WebNN supports `identity` op, use it for `Max` and `Min` ops with
only one input.
2024-09-05 16:47:40 -07:00
Scott McKay
20c802afd4
Add better native nuget package readme (#21889)
### Description
<!-- Describe your changes. -->
Request from Nuget team to add a better readme to the nuget package so
it is displayed nicely on nuget.org.

Previously we were using the ORT repo readme.md but that a) doesn't
display correctly due to limited markdown support on nuget.org, and b)
has a lot of irrelevant info like build pipeline status.

- Created a generic readme.md that includes the ORT description from the
main readme, includes the ORT logo via an acceptable link, and lists the
native nuget packages so the file can be included in any of them as-is.
- Updated the nuget packaging script to add the `readme` tag and use
this file.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Request from MS Nuget team to MS package owners to add.
2024-09-06 08:28:14 +10:00
Tianlei Wu
c7d0ded079
[CUDA] Update Dockerfile.cuda with cuda 12.5.1 and cudnn 9 (#21987)
### Description
Previous image is based on cuda 12.1 and cudnn 8, which is out of date
since we have moved to cudnn 9 since 1.19 release.
(1) Upgrade base image to cuda 12.5.1 and cudnn 9.
(2) Update CMAKE_CUDA_ARCHITECTURES from 52;60;61;70;75;86 to
61;70;75;80;86;90 to support A100 and H100
(3) Make the build faster: exclude unit test; use ninja etc.
(4) upgrade some packages (like packaging etc) before building to avoid
build error.

### Motivation and Context
https://github.com/microsoft/onnxruntime/issues/21792
https://github.com/microsoft/onnxruntime/issues/21532
2024-09-05 15:25:40 -07:00
0xdr3dd
2dae8aaced
[Fuzzer] Add fuzzer support for linux (#21996)
### Description
Added some change in fuzzer project code to support linux also.

How to test on linux:
1. Make sure you have installed clang/llvm.
2. run below command to build asan instrumented project:
```
CFLAGS="-g -fsanitize=address -shared-libasan -fprofile-instr-generate -fcoverage-mapping" CXXFLAGS="-g -shared-libasan -fsanitize=address -fprofile-instr-generate -fcoverage-mapping" CC=clang CXX=clang++ ./build.sh --update --build --config Debug --compile_no_warning_as_error --build_shared_lib --skip_submodule_sync --skip_tests --use_full_protobuf  --parallel --fuzz_testing --build_dir build/
```

3. run fuzzer for some time, it will generate *.profraw file:
```
LLVM_PROFILE_FILE="%p.profraw" ./build/Debug/onnxruntime_security_fuzz /t /v onnxruntime/test/testdata/bart_tiny.onnx 1 m
```
4. Get the cov by running below cmd:
```
llvm-profdata merge -sparse *.profraw -o default.profdata
llvm-cov report ./build/Debug/onnxruntime_security_fuzz  -instr-profile=default.profdata
```

<img width="1566" alt="Screenshot 2024-09-05 at 4 25 08 PM"
src="https://github.com/user-attachments/assets/2aa0bb83-6634-4d33-b026-3535e97df431">



### Motivation and Context
1. Currently fuzzer only supports windows and MSVC, we can't generate
the code coverage using MSVC. With clang/llvm we can try and use clang
instrumentation and llvm tools like llvm-cov.
2. In future we can add coverage guided fuzzer (libfuzzer) in same
project. (Working on it)
2024-09-05 11:52:15 -07:00
Yueqing Zhang
f4d62eeb2e
[VitisAI] remove unused header (#21890)
### Description
<!-- Describe your changes. -->
Removed unused headers


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This would cause compile error on machine that didn't install nlohmann.

Co-authored-by: Yueqing Zhang <yueqingz@amd.com>
2024-09-05 08:37:15 -07:00
Javier Martinez
840f896c5f
Uncomment line in OVEP that was commented out in error (#21973)
### Description
One line change to re-enable a line incorrectly commented out in an
earlier commit



### Motivation and Context
Fix issue introduced with [PR
21872](https://github.com/microsoft/onnxruntime/pull/21872#discussion_r1736744441)
2024-09-05 08:34:55 -07:00
Scott McKay
8b661f7157
Fix DML packaging CIs (#21997)
### Description
<!-- Describe your changes. -->
The DML CIs build native and C# as well as sign DLLs in the same CI.
Some parts of that require .net 8 and some .net 6.
Update to use .net 8 in general, and revert to .net 6 for the signing.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix packaging pipeline.
2024-09-05 22:30:40 +08:00
Scott McKay
5e24c5d5f8
Fix C# doc generation workflow (#21988)
### Description
<!-- Describe your changes. -->
- Update docfx usage. 
  - The docfx cli is now a dotnet tool.
  - Split some commands up so it's easier to debug failures
- Update to .net8.
- Exclude mobile targets from build as the workloads aren't available
and it doesn't change the generated documentation.
- The mobile specific APIs (e.g. enable CoreML EP) still exist in this
case as we check in the implementation if it's valid to use them or not,
so the workloads are not required to generate complete API
documentation.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix doc gen.
2024-09-05 13:54:17 +10:00
Yulong Wang
2e83541eba
fix one build warning in MSVC (#21983)
### Description

Fix one MSVC warning member not initialized


```
Warning	C26495	Variable 'onnxruntime::ITuningContext::allocators_' is uninitialized. Always initialize a member variable (type.6).  C:\code\onnxruntime\onnxruntime\core\framework\tuning_context.h	22		
```
2024-09-04 17:51:14 -07:00
Jiajia Qin
3580e01348
[js/webgpu] Optimize grouped conv (#21892)
### Description
<!-- Describe your changes. -->
#21618

This PR optimizes grouped conv by 1) more sequential memory access in
gpu 2) reusing input's data to reduce global memory access times.

See `Conv|GroupedConv` op in
[Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base-960h) becomes
92 ms from 1058 ms on iGPUs with 32 EU.

For the whole model on my iGPUs with 32 EU,
wav2vec2 model becomes 982ms from 1942 ms.
squeezebert-uncased model becomes 71.86ms from 431.77ms.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-09-04 17:16:35 -07:00
mindest
30f07758a2
Add packaging version constraint. (#21814)
### Description
Newer `setuptools` requires newer version of `packaging`, due to
function update.

### Motivation and Context
Fixes #21792
2024-09-04 16:57:04 -07:00
Prathik Rao
ed232dc1ef
Sets enable_windows_arm64ec_qnn to false in training CI (#21981)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-09-04 16:01:14 -07:00