onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-05-21 21:52:11 +00:00

Author	SHA1	Message	Date
Michael Tyler	904b850b44	Update Arm Compute Library Execution Provider (#22032 ) ### Description This PR makes the following updates to the Arm Compute Library execution provider: - Target Arm Compute Library 24.07 - Add support for the following operators: - Conv (FP16) - NhwcConv - QLinearConv - MatMul - FusedMatMul - MatMulIntegerToFloat - Optimize memory usage and performance - Expose the enable_fast_math setting - Use the main runtime thread pool ### Motivation and Context These updates improve performance and memory usage, and enable use of a more recent version of Arm Compute Library. @microsoft-github-policy-service agree company="Arm Ltd" --------- Signed-off-by: Michael Tyler <michael.tyler@arm.com>	2024-09-12 20:51:59 -07:00
Adam Pocock	22437b581b	[java] Fix for OnnxTensor creation when passing in a ByteBuffer containing elements of a different type (#21774 ) ### Description Fixes a bug where the buffer offset and position was incorrectly computed if the user supplied a `ByteBuffer` to `createTensor` but set the type of the tensor to something other than `INT8`. This would be more common if the user was trying to load the initializers from a serialized representation and didn't want to bother with the type information (which is the case in #21321). ### Motivation and Context Partial fix for #21321. The remainder of the fix is to add a helper which allows users to load initializers out of an `onnx_data` file, but that will require adding protobuf as a dependency for the Java API to allow the parsing of an ONNX file separately from the native code. It might be nicer to put that functionality into ORT's C API so it can return the lengths & offsets of the initializers when provided with an ONNX file containing external initializers. We hit this kind of thing in Java more often than other languages as in Java models can be supplied as classpath resources which we can easily read, but not materialize on disk for the ORT native library to read.	2024-09-13 12:38:17 +10:00
Adrian Lizarraga	f7bf5a19ba	[QNN EP] Ensure QNN EP rejects nodes with I/O of dynamic shape (#22066 ) ### Description Updates QNN EP to properly reject nodes that have inputs or outputs with dynamic shapes. ### Motivation and Context Currently, QNN EP does not properly offload subgraphs with dynamic shapes to the CPU EP. This PR ensures that QNN EP rejects nodes that consume or generate I/O with dynamic shapes.	2024-09-12 17:18:50 -07:00
mingyueliuh	55ab13e7ca	[VitisAI] support memory buffer contains the TensorProto external data (#22042 ) ### Description Extend VitisAI EP `tensor_proto_as_raw` API to support memory buffer containing the TensorProto external data ### Motivation and Context For reduce peak memory usage, VitisAI EP need support ORT format model and setting session option `session.use_ort_model_bytes_for_initializers` for enable directly use the model bytes for initializers. Co-authored-by: mingyue <mingyue@xilinx.com>	2024-09-12 16:23:09 -07:00
0xdr3dd	5c361106e6	[Fuzzer] Add two new ORT libfuzzer (Linux clang support for now) (#22055 ) ### Description This PR adds two new libfuzzer in fuzzer project. 1. Binary libfuzzer 2. libprotobuf-fuzzer To compile run below cmd on linux: ``` LLVM_PROFILE_FILE="%p.profraw" CFLAGS="-g -fsanitize=address,fuzzer-no-link -shared-libasan -fprofile-instr-generate -fcoverage-mapping" CXXFLAGS="-g -shared-libasan -fsanitize=address,fuzzer-no-link -fprofile-instr-generate -fcoverage-mapping" CC=clang CXX=clang++ ./build.sh --update --build --config Debug --compile_no_warning_as_error --build_shared_lib --skip_submodule_sync --use_full_protobuf --parallel --fuzz_testing --build_dir build/ ``` Run fuzzer: ``` LD_PRELOAD=$(clang -print-file-name=libclang_rt.asan-x86_64.so) build/Debug/onnxruntime_libfuzzer_fuzz testinput -rss_limit_mb=8196 -max_total_time=472800 -fork=2 -jobs=4 -workers=4 -ignore_crashes=1 -max_len=2097152 2>&1 \| grep -v "\[libprotobuf ERROR" ``` ### Motivation and Context The existing custom fuzzer is not coverage guided and it's slow and it will work on one model mutation at a time. The new fuzzers are coverage guided, and we can use more models' files as a corpus to increase the coverage.	2024-09-12 11:50:34 -07:00
wangshuai09	d539c27de8	Fix version check for using -mavxvnni (#21616 ) ### Description <!-- Describe your changes. --> Change the `CMAKE_CXX_COMPILER_VERSION` greater than `11` for using '-mavxvnni'. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> `CMakeFiles/onnxruntime_mlas.dir/root/Git.d/onnxruntime/onnxruntime/core/mlas/lib/x86_64/QgemmU8S8KernelAvx2.S.o cc: error: unrecognized command-line option ‘-mavxvnni’; did you mean ‘-mavx512vnni’?` using `gcc (GCC) 10.3.1`. `-mavxnni` is supported since [GCC 11 Release](https://gcc.gnu.org/gcc-11/changes.html), this PR change the version check.	2024-09-12 11:42:17 -07:00
Clément Péron	10883d7997	Suppress GCC warning in TreeEnsembleAggregator (#22062 ) ### Description When building with GCC 14.2.1, I got the following warning: onnxruntime/core/providers/cpu/ml/tree_ensemble_aggregator.h:329:59: error: template-id not allowed for constructor in C++20 [-Werror=template-id-cdtor] Remove template parameters from the constructor: The constructor TreeAggregatorMax<InputType, ThresholdType, OutputType> has been simplified to TreeAggregatorMax, because the compiler already knows the template parameters from the class definition. ### Motivation and Context Fix the build issue Signed-off-by: Clément Péron <peron.clem@gmail.com>	2024-09-12 19:46:27 +02:00
Yulong Wang	84f73327f5	allow scalar axes for Unsqueeze for WebGPU (#22054 ) ### Description Align with CPU behavior. https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/cpu/tensor/unsqueeze.cc#L60-L62	2024-09-12 10:33:37 -07:00
mindest	951b1b7160	[CI] Linux ROCm CI Pipeline: fix error, set trigger rules. (#22069 ) ### Description * Correct the wrong EP name for ROCm, fix CI error. * Update `set-trigger-rules.py`. * Modify the .yml via `set-trigger-rules.py`	2024-09-12 09:54:32 -07:00
Yi Zhang	ae39c40e5b	fix typo in iOS pipeline (#22067 ) ### Description <!-- Describe your changes. --> ### Motivation and Context The parameter isn't correct. Maybe it hasn't negative impact by chance so far. `d8e64bb529/cmake/CMakeLists.txt (L1712-L1717)`	2024-09-12 19:07:42 +08:00
Prathik Rao	d495e6cf1c	adds support for Uint8ClampedArray (#21985 ) Fixes https://github.com/microsoft/onnxruntime/issues/21753	2024-09-11 22:02:30 -07:00
Lennart Hannink	d8e64bb529	Refactor CoreMLExecution to C++ bridge class (#21857 ) Refactor Objective-C++ class `CoreMLExecution` into existing C++ bridge class `onnxruntime::coreml::Execution`.	2024-09-11 16:05:37 -07:00
sfatimar	0309c5f02f	Ovep release lnl 1.2.1 (#22027 ) Error Codes are added to catch compilation error and signal recompile. Remote Tensors are added to ensure direct memory access for NPU inferencing. UMD Bypass cache enabled with 2024.4 will eliminate need to disk caching ### Motivation and Context The changes are needed to ensure backward compatibility UMD Bypass caching eliminates driver caching Remote Tensors lead to performance improvement with inferencing on NPU --------- Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com> Co-authored-by: Srirammaswamy <srirammaswamy.s@intel.com> Co-authored-by: saurabh <saurabh1.kale@intel.com> Co-authored-by: Javier E. Martinez <javier.e.martinez@intel.com> Co-authored-by: Eric Crawford <eric.r.crawford@intel.com> Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com>	2024-09-11 14:55:40 -07:00
Jagadish Krishnamoorthy	b800328628	[ROCm EP/ MIGraphx EP] matmul_nbits: Use GPU_WARP_SIZE_HOST for host side code (#22045 ) ### Description For ROCm device, the host side code needs to call GPU_WARP_SIZE_HOST to query warpSize of the underlying GPU device. ### Motivation and Context Fixes MatMulNBits tests on gfx1100/01 which has warpSize of 32. Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>	2024-09-11 14:52:18 -07:00
Bin Miao	4d82404544	[WebNN EP] Support GRU operator (#20405 ) This PR support Gru operator for WebNN EP. @Honry , @fdwr thanks!	2024-09-11 14:16:36 -07:00
Xavier Dupré	91c916f9c6	Improve hash_function used by TreeEnsemble (#22043 ) ### Description unordered_map are implemented in a different way on VisualStudio and gcc. It seems that inserting consecutive keys has a poor performance on Windows. ### Motivation and Context Improve the performance of onnxruntime when initializing trees.	2024-09-11 10:41:04 -07:00
Yi-Hong Lyu	e91ff9438b	Enable Pad->Conv(no pads) fusion (#22001 ) ### Description ### Motivation and Context For some model has pattern Pad -> Conv. If the Conv doesn't have pads attributes, the Pad can be fused into Conv.	2024-09-11 09:54:15 -07:00
Julius Tischbein	20d94648bb	ConvTranpose using CUDNN Frontend with NHWC support (#21752 ) ### Description Added CUDNN Frontend and used it for NHWC ConvTranspose op including option for bias fusion. Similar to this [Conv PR](https://github.com/microsoft/onnxruntime/pull/19470) ### Backward compatible If ORT is built with cuDNN 8, cuDNN frontend will not be built into binary. Old kernels (using cudnn backend APIs) are used. ### Major Changes For cuDNN 9, we will enable cudnn frontend to fuse data gradient convolution and bias when a provider option fuse_conv_bias=1. ### Potential Issues cuDNN frontend uses TF32 by default. It can be disabled using use_tf32 cuda provider option, but in the case cuDNN frontend encounters issues building an operation graph it will fallback to using TF32. ### Follow ups This is one of the PRs that target to enable NHWC, here the ConvTranspose operation in CUDA EP by default if device supports it. There are other changes will follow up to make it possible. (1) Enable prefer_nhwc by default for device with sm >= 70. (2) Change fuse_conv_bias=1 by default after more testing. (3) Add other NHWC operators (like Resize or UpSample). ### Motivation and Context The new CUDNN Frontend library provides the functionality to fuse operations and provides new heuristics for kernel selection. Here it fuses the convolution data gradient operation (ConvTranspose) with the pointwise bias operation. ### Minor Change In the CUDA convolution operation was a small bug when `GetCudnnConv1dPadToNc1d ` was enabled.	2024-09-10 16:51:00 -07:00
PARK DongHa	f633caa0b1	Create CMake option `onnxruntime_USE_VCPKG` (#21348 ) ### Changes 1. CMake option `onnxruntime_USE_VCPKG`. It will be used in the vcpkg port * Unit test may fail because this option leads to a mixture of unexpected external library versions. Especially ONNX, Protobuf, and Flatbuffers version can be different 2. Overhaul of `onnxruntime_external_deps.cmake` * Make `FetchContent_Declare` to try `find_package`. See https://cmake.org/cmake/help/latest/guide/using-dependencies/index.html * Relocated `FetchContent_Declare` and `FetchContent_MakeAvailable`(or `onnxruntime_fetchcontent_makeavailable`) to closer lines. It was too hard to navigate the entire file to search related sections... * Alias `IMPORTED` targets like build targets (e.g. `ONNX::onnx` --> `onnx`) ```cmake # The script uses `find_package` with the changes. # In this case, use vcpkg to search dependencies # See https://cmake.org/cmake/help/latest/guide/using-dependencies/index.html include(external/onnxruntime_external_deps.cmake) ``` 3. Create CMakePresets.json and presets to [run vcpkg in manifest mode](https://learn.microsoft.com/en-us/vcpkg/concepts/manifest-mode) * Currently, it's NOT for training build * Main triplets are `x64-windows` and `x64-osx` ```pwsh Push-Location "cmake" cmake --preset "x64-windows-vcpkg" cmake --build --preset "x64-windows-vcpkg-debug" Pop-Location ``` ```bash pushd "cmake" cmake --preset "x64-osx-vcpkg" cmake --build --preset "x64-osx-vcpkg-debug" popd ``` 4. Updated tools/ci_build/build.py * `--use_vcpkg` option: it needs `CMAKE_TOOLCHAIN_FILE` with [vcpkg.cmake toolchain script](https://github.com/microsoft/vcpkg/blob/master/scripts/buildsystems/vcpkg.cmake) * `--compile_no_warning_as_error` is recommended because library version differences will cause unexpected compiler warnings ```bash python ./tools/ci_build/build.py \ --compile_no_warning_as_error \ --use_vcpkg \ --cmake_extra_defines "CMAKE_TOOLCHAIN_FILE:FILEPATH=${VCPKG_ROOT}/scripts/buildsystems/vcpkg.cmake" \ --cmake_extra_defines "VCPKG_TARGET_TRIPLET=..." ``` 5. Created Job `Vcpkg` for Windows and macOS * Show how to setup and use vcpkg. Similar to the CMakePresets.json usage ### Motivation and Context * Help #7150 * Help https://github.com/microsoft/vcpkg/pull/36850 * https://github.com/luncliff/vcpkg-registry/pull/212 * https://github.com/microsoft/vcpkg/pull/39881 * https://github.com/luncliff/vcpkg-registry/pull/215 * https://github.com/luncliff/vcpkg-registry/pull/216 * https://github.com/luncliff/vcpkg-registry/pull/227 * https://cmake.org/cmake/help/latest/guide/using-dependencies/index.html * https://github.com/microsoft/vcpkg/blob/master/scripts/buildsystems/vcpkg.cmake ### Future Works? More feature coverage with the vcpkg supported libraries * CUDA feature support * Training feature support	2024-09-10 16:39:27 -07:00
kunal-vaishnavi	c5418f35d4	Add fusions for re-designed Phi-3 vision and Phi-3.5 vision ONNX models (#22026 ) ### Description This PR adds the optimizer logic to fuse the newly designed exported ONNX models for Phi-3 vision and Phi-3.5 vision. ### Motivation and Context After the re-designed export of Phi-3 vision and Phi-3.5 vision, the ONNX models for the vision component and embedding component contain `If` and `Loop` ops to handle multi-image support.	2024-09-10 16:18:05 -07:00
dependabot[bot]	19954decaf	Bump body-parser from 1.20.2 to 1.20.3 in /js/web (#22044 )	2024-09-10 23:05:44 +00:00
jingyanwangms	4a5d66c15f	Default value 10.2->10.3 in linux-gpu-tensorrt-daily-perf-pipeline.yml (#21823 ) ### Description Fix default value 10.2->10.3 in linux-gpu-tensorrt-daily-perf-pipeline.yml ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-09-10 15:26:16 -07:00
George Wu	31ae11788a	[QNN EP] Update QNN SDK to 2.26 (#22037 ) * update default QNN SDK version to 2.26 * enable layernorm implicit bias workaround for QNN 2.26 * update artifact names for py win arm64 and arm64ec to re-enable ort-qnn-nightly arm64 python packages	2024-09-10 14:03:06 -07:00
Sophie Schoenmeyer	e7107f41de	Decrease API docs artifact retention days (#22003 ) ### Description When API docs workflows fail, we typically don't catch the issue until the most recently generated artifact expires. The current artifact retention is 60 days, so by decreasing to 30 days, we can ensure that we're resolving the workflow failures more quickly. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-09-10 10:44:08 -07:00
Erick Muñoz	7489bfee53	Enable AVX NE CONVERT for FP16 to FP32 cast (#21183 ) ### Description Implementation of a new cast assembly kernel that uses AVX_NE_CONVERT instructions to accelerate casting from FP16 to FP32. Added CPUID checks to determine support of the ISA. ### Motivation and Context Currently FP16 models executed on systems that lack complete FP16 operator support use single precision on every node to run the model, this means the original FP16 weights have to be casted to FP32 in order to run the model properly, this change aims to accelerate the casting by using upconvert instructions and therefore improve performance.	2024-09-09 21:19:31 -07:00
Jake Mathern	d4d419f789	fix more dml warnings (#21980 ) ### Description Fixes more warnings in DML execution provider that lead to security issues in binskim ### Motivation and Context OS components that include ORT must treat certain warnings as errors, and cannot disable critical compiler warnings https://github.com/microsoft/binskim/blob/main/src/BinSkim.Rules/PERules/BA2007.EnableCriticalCompilerWarnings.cs	2024-09-09 17:50:17 -07:00
Jian Chen	93c4c9cb6a	Using wostringstream only on Windows (#21938 ) ### Description Using wostringstream only on Windows ### Motivation and Context From line [62](https://github.com/microsoft/onnxruntime/pull/21938/files#diff-47776d020ac08134de4059eab473550237f4999c598ab56afad3676d2f193edcR62), currently, `stream_` can be either `wostringstream` or `ostringstream` depending on the OS, however, for Unix like system, `stream_` should be `ostringstream`, instead of.	2024-09-09 13:20:17 -07:00
Adrian Lizarraga	c7ae9b977a	[Quantization] Apply workaround for crash when using histogram-based calibrators (#21972 ) ### Description - Applies a workaround that prevents the histogram-based calibrators (percentile, entropy, distribution) from crashing. The workaround involves copying inference outputs that come directly from model inputs. A description of the bug is here: https://github.com/microsoft/onnxruntime/issues/21922. This PR does not fix the root bug, but instead provides a workaround to _unblock_ users using histogram-based calibration. - Adds a unit test that runs all histogram-based calibrators to help catch future regressions. We didn't have unit tests that ran these calibration methods. ### Motivation and Context Trying to quantize a model with the percentile, entropy, or distribution calibration methods raises an exception: ```shell File "/.../site-packages/onnxruntime/quantization/quantize.py", line 691, in quantize quantize_static( File "/.../site-packages/onnxruntime/quantization/quantize.py", line 525, in quantize_static calibrator.collect_data(calibration_data_reader) File "/.../site-packages/onnxruntime/quantization/calibrate.py", line 571, in collect_data self.collector.collect(clean_merged_dict) File "/.../site-packages/onnxruntime/quantization/calibrate.py", line 746, in collect return self.collect_value(name_to_arr) File "/.../site-packages/onnxruntime/quantization/calibrate.py", line 836, in collect_value hist, hist_edges = np.histogram(data_arr, self.num_bins, range=(-threshold, threshold)) File "<__array_function__ internals>", line 180, in histogram File ".../site-packages/numpy/lib/histograms.py", line 793, in histogram bin_edges, uniform_bins = _get_bin_edges(a, bins, range, weights) File "/.../site-packages/numpy/lib/histograms.py", line 426, in _get_bin_edges first_edge, last_edge = _get_outer_edges(a, range) File "/.../site-packages/numpy/lib/histograms.py", line 315, in _get_outer_edges raise ValueError( ValueError: supplied range of [nan, nan] is not finite ``` The calibrators create an augmented model with all tensors (including model inputs) set as model outputs. The data for outputs that are also model inputs is corrupted as described in https://github.com/microsoft/onnxruntime/issues/21922. The corrupted data sometimes contains `NaN` values that cause numpy's histogram utilities to raise an exception.	2024-09-09 12:05:41 -07:00
Peishen Yan	2cdc05f189	Move Gelu and LayerNorm fusion to L1 optimization (#21332 ) According to https://github.com/microsoft/onnxruntime/issues/20915, we move the Gelu and LayerNorm fusion to L1 with a condition on the ONNX opset the model imports (LayerNorm requires opset 16+ and Gelu requires opset 20+.) If the opset version doesn't meet the requirements, the fusion is delayed to L2 optimization since the internal contrib op doesn't have a requirement for any specific ONNX opset. --------- Co-authored-by: Scott McKay <Scott.McKay@microsoft.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2024-09-09 13:27:52 +10:00
Yi Zhang	de7a02beef	Add parameter for flexdonwload (#22009 ) ### Description <!-- Describe your changes. --> ### Motivation and Context Thus, we can run Nuget_Packaging_GPU stage directly	2024-09-08 14:17:55 +08:00
Wanming Lin	ad9afbb042	[WebNN EP] Remove workaround for CPU op supported list (#21962 ) We assume all WebNN ops are supported across all backends.	2024-09-06 22:14:52 -07:00
Edward Chen	f3725b9f06	Use output variable from InstallAppleProvisioningProfile task to set provisioning profile UUID. (#22018 ) This is more flexible than hardcoding the provisioning profile name or UUID. The name shouldn't usually change but it is not guaranteed to remain constant.	2024-09-06 18:00:34 -07:00
zz002	28b550f091	[VitisAI] Add processing for sessionOptions.AppendExecutionProvider("VitisAI", options) (#21839 ) ### Description <!-- Describe your changes. --> [VitisAI] Add processing for sessionOptions.AppendExecutionProvider("VitisAI", options) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Zhenze Wang <zhenzew@xilinx.com>	2024-09-06 14:06:33 -07:00
Arne H Juul	493159b481	near-zero negative values must convert to 0 not NAN (#18473 ) for the Float8 types with unsigned zero, we must clear the sign bit when rounding to zero; otherwise we end up with 0x80 which is the encoding for NAN. ### Description Handle all zero and near-zero values the same way, rounding to positive zero. Note that I removed one "if" level but did not re-indent the code in this PR, to make it easier to see what the actual changes are. ### Motivation and Context For the two new 8-bit floating point types Float8E4M3FNUZ and Float8E5M2FNUZ, converting from a near-zero negative value would end up with the sign bit set only; this bit pattern is not negative zero but instead means NAN.	2024-09-06 11:41:48 -07:00
Arne H Juul	605a84ffc9	remove unused and confusing float16 constants (#21999 ) ### Description Remove unused and confusing special constants in MLFloat16 and BFloat16 types. ### Motivation and Context While looking at adding a specialization for std::numeric_limits for the 16-bit floating point types, I found that there are various special constants in those types that are confusing or just wrong. MLFLoat16::Epsilon is not an epsilon at all, but approximates "e". Looks like a copy-paste bug. BFloat16::Epsilon does not correspond to `numeric_limits::epsilon()`, nor even to the C# Float.Epsilon. Instead, it corresponds to `numeric_limits::min()` which was really confusing to me. The "MinValue" constants does correspond to the C# `Float.MinValue` constant, but this is C++ so it would be better renamed to "LowestValue" since it corresponds to `numeric_limits::lowest()`. As it was unused except for some unit tests I have replaced it with the equivalent `MaxValue.Negate()` here. There's also an unused `kSignaling_NaNBits` constant which is just wrong (has the same value as `kPositiveInfinityBits` instead of a NaN).	2024-09-05 22:00:48 -07:00
Edward Chen	970ebc2ccf	Fix typo in coreml_supported_mlprogram_ops.md (#22004 ) ### Description <!-- Describe your changes. --> Fix typo: ai:onnx -> ai.onnx ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Typo.	2024-09-06 12:50:56 +10:00
Edward Chen	0c398b3e52	Update Android NDK version to 27.0.12077973. (#21989 ) Upgrade to newer version. r26 will be unsupported soon.	2024-09-05 17:57:24 -07:00
Adrian Lizarraga	b011f6fbf6	[TransposeOptimizer] Support Unsqueeze/Transpose of input consumed by per-axis DQ (#21821 ) ### Description Follow-up to: https://github.com/microsoft/onnxruntime/pull/21793 - Support looking past a per-axis DQ to do in-place Unsqueeze/Transpose of initializers - Support looking past a per-axis DQ to cancel a Transpose or Squeeze. ### Test models For all test models, the transpose optimizer pushes a Transpose through a Mul's input[0]. The Mul's input[1] is optionally unsqueezed and then transposed. ### I. Test in-place unsqueeze and transpose of per-axis quantized weight Original model has input[1] with shape (3,) <details><summary>click to expand model image</summary> <img src="https://github.com/user-attachments/assets/37b6f60c-77d2-4bd3-8ca2-58dc7c88a304" /> </details> Optimized model has input[1] with shape (1, 3, 1, 1). The initializer was unsqueezed and transposed in-place. <details><summary>click expand model image</summary> <img src="https://github.com/user-attachments/assets/adb72757-a164-400c-bfef-2a05f0e35825" /> </details> ### II. Test canceling existing Squeeze before per-axis DQ Original model has input[1] that is squeezed. <details><summary>click expand model image</summary> <img src="https://github.com/user-attachments/assets/f27e6742-b563-42a9-ad06-bb3178b0ceb8" /> </details> Optimized model unsqueezed and transposed input[1]. The original squeeze was removed due to the unsqueeze, leaving only the Transpose. <details><summary>click expand model image</summary> <img src="https://github.com/user-attachments/assets/e56261d4-eba6-4a9f-847b-dcd33548dd07" /> </details> ### III. Test canceling existing Transpose before per-axis DQ Original model has input[1] that is transposed. <details><summary>click expand model image</summary> <img src="https://github.com/user-attachments/assets/f157e04a-572a-479d-8e3b-cf57954df5c0" /> </details> Optimized model transposed input[1], thus canceling the existing transpose. <details><summary>click expand model image</summary> <img src="https://github.com/user-attachments/assets/63d742ce-3762-4ab2-bdb0-1b507886da9d" /> </details> ### IV. Test QDQ fix-up of Transpose/Unsqueeze for per-axis quantization Original model has input[1] that can be broadcasted. <details><summary>click expand model image</summary> <img src="https://github.com/user-attachments/assets/96c0092c-22ec-486d-882e-e2cb59ffe324" /> </details> The main transpose optimization loop inserts float32 Unsqueeze and Transpose after the DQ. The qdq fix-up pass inserts new per-axis Q/DQ ops after the inserted nodes. <details><summary>click expand model image</summary> <img src="https://github.com/user-attachments/assets/b6f89c11-974d-4b35-922f-11effdf06883" /> </details> ### Motivation and Context Enables the TransposeOptimizer to support more models with per-axis QDQ nodes. Per-axis quantization can improve model accuracy and is used by EPs like QNN. --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2024-09-05 17:26:17 -07:00
Wanming Lin	23f6604c39	[WebNN EP] Use identity for one input of Max/Min (#21974 ) Now WebNN supports `identity` op, use it for `Max` and `Min` ops with only one input.	2024-09-05 16:47:40 -07:00
Scott McKay	20c802afd4	Add better native nuget package readme (#21889 ) ### Description <!-- Describe your changes. --> Request from Nuget team to add a better readme to the nuget package so it is displayed nicely on nuget.org. Previously we were using the ORT repo readme.md but that a) doesn't display correctly due to limited markdown support on nuget.org, and b) has a lot of irrelevant info like build pipeline status. - Created a generic readme.md that includes the ORT description from the main readme, includes the ORT logo via an acceptable link, and lists the native nuget packages so the file can be included in any of them as-is. - Updated the nuget packaging script to add the `readme` tag and use this file. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Request from MS Nuget team to MS package owners to add.	2024-09-06 08:28:14 +10:00
Tianlei Wu	c7d0ded079	[CUDA] Update Dockerfile.cuda with cuda 12.5.1 and cudnn 9 (#21987 ) ### Description Previous image is based on cuda 12.1 and cudnn 8, which is out of date since we have moved to cudnn 9 since 1.19 release. (1) Upgrade base image to cuda 12.5.1 and cudnn 9. (2) Update CMAKE_CUDA_ARCHITECTURES from 52;60;61;70;75;86 to 61;70;75;80;86;90 to support A100 and H100 (3) Make the build faster: exclude unit test; use ninja etc. (4) upgrade some packages (like packaging etc) before building to avoid build error. ### Motivation and Context https://github.com/microsoft/onnxruntime/issues/21792 https://github.com/microsoft/onnxruntime/issues/21532	2024-09-05 15:25:40 -07:00
0xdr3dd	2dae8aaced	[Fuzzer] Add fuzzer support for linux (#21996 ) ### Description Added some change in fuzzer project code to support linux also. How to test on linux: 1. Make sure you have installed clang/llvm. 2. run below command to build asan instrumented project: ``` CFLAGS="-g -fsanitize=address -shared-libasan -fprofile-instr-generate -fcoverage-mapping" CXXFLAGS="-g -shared-libasan -fsanitize=address -fprofile-instr-generate -fcoverage-mapping" CC=clang CXX=clang++ ./build.sh --update --build --config Debug --compile_no_warning_as_error --build_shared_lib --skip_submodule_sync --skip_tests --use_full_protobuf --parallel --fuzz_testing --build_dir build/ ``` 3. run fuzzer for some time, it will generate .profraw file: ``` LLVM_PROFILE_FILE="%p.profraw" ./build/Debug/onnxruntime_security_fuzz /t /v onnxruntime/test/testdata/bart_tiny.onnx 1 m ``` 4. Get the cov by running below cmd: ``` llvm-profdata merge -sparse .profraw -o default.profdata llvm-cov report ./build/Debug/onnxruntime_security_fuzz -instr-profile=default.profdata ``` <img width="1566" alt="Screenshot 2024-09-05 at 4 25 08 PM" src="https://github.com/user-attachments/assets/2aa0bb83-6634-4d33-b026-3535e97df431"> ### Motivation and Context 1. Currently fuzzer only supports windows and MSVC, we can't generate the code coverage using MSVC. With clang/llvm we can try and use clang instrumentation and llvm tools like llvm-cov. 2. In future we can add coverage guided fuzzer (libfuzzer) in same project. (Working on it)	2024-09-05 11:52:15 -07:00
Yueqing Zhang	f4d62eeb2e	[VitisAI] remove unused header (#21890 ) ### Description <!-- Describe your changes. --> Removed unused headers ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> This would cause compile error on machine that didn't install nlohmann. Co-authored-by: Yueqing Zhang <yueqingz@amd.com>	2024-09-05 08:37:15 -07:00
Javier Martinez	840f896c5f	Uncomment line in OVEP that was commented out in error (#21973 ) ### Description One line change to re-enable a line incorrectly commented out in an earlier commit ### Motivation and Context Fix issue introduced with [PR 21872](https://github.com/microsoft/onnxruntime/pull/21872#discussion_r1736744441)	2024-09-05 08:34:55 -07:00
Scott McKay	8b661f7157	Fix DML packaging CIs (#21997 ) ### Description <!-- Describe your changes. --> The DML CIs build native and C# as well as sign DLLs in the same CI. Some parts of that require .net 8 and some .net 6. Update to use .net 8 in general, and revert to .net 6 for the signing. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix packaging pipeline.	2024-09-05 22:30:40 +08:00
Scott McKay	5e24c5d5f8	Fix C# doc generation workflow (#21988 ) ### Description <!-- Describe your changes. --> - Update docfx usage. - The docfx cli is now a dotnet tool. - Split some commands up so it's easier to debug failures - Update to .net8. - Exclude mobile targets from build as the workloads aren't available and it doesn't change the generated documentation. - The mobile specific APIs (e.g. enable CoreML EP) still exist in this case as we check in the implementation if it's valid to use them or not, so the workloads are not required to generate complete API documentation. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix doc gen.	2024-09-05 13:54:17 +10:00
Yulong Wang	2e83541eba	fix one build warning in MSVC (#21983 ) ### Description Fix one MSVC warning member not initialized ``` Warning C26495 Variable 'onnxruntime::ITuningContext::allocators_' is uninitialized. Always initialize a member variable (type.6). C:\code\onnxruntime\onnxruntime\core\framework\tuning_context.h 22 ```	2024-09-04 17:51:14 -07:00
Jiajia Qin	3580e01348	[js/webgpu] Optimize grouped conv (#21892 ) ### Description <!-- Describe your changes. --> #21618 This PR optimizes grouped conv by 1) more sequential memory access in gpu 2) reusing input's data to reduce global memory access times. See `Conv\|GroupedConv` op in [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base-960h) becomes 92 ms from 1058 ms on iGPUs with 32 EU. For the whole model on my iGPUs with 32 EU, wav2vec2 model becomes 982ms from 1942 ms. squeezebert-uncased model becomes 71.86ms from 431.77ms. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-09-04 17:16:35 -07:00
mindest	30f07758a2	Add packaging version constraint. (#21814 ) ### Description Newer `setuptools` requires newer version of `packaging`, due to function update. ### Motivation and Context Fixes #21792	2024-09-04 16:57:04 -07:00
Prathik Rao	ed232dc1ef	Sets enable_windows_arm64ec_qnn to false in training CI (#21981 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-09-04 16:01:14 -07:00

... 6 7 8 9 10 ...

11997 commits