onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-15 18:23:41 +00:00

Author	SHA1	Message	Date
Chi Lo	fa4cbcd36b	[TensorRT EP] Add new provider option to exclude nodes from running on TRT (#22681 ) Add new provider option `trt_op_types_to_exclude`: - User can provide op type list to be excluded from running on TRT - e.g. `trt_op_types_to_exclude="MaxPool"` There is a known performance issue with the DDS ops (NonMaxSuppression, NonZero and RoiAlign) from TRT versions 10.0 to 10.7. TRT EP excludes DDS ops from running on TRT by default, user can override default value with empty string to include all ops.	2024-11-13 11:34:43 -08:00
shiyi	3adcf4d714	[WebNN] Remove validation for coordinate_transformation_mode (#22811 ) The performance cost of falling back to the CPU EP is high for several resampling nodes and causes multiple partitions in SD Turbo and VAE decoder. Since the asymmetric mode with nearest to floor and integer scales is identical to half_pixel anyway, stick with the WebNN EP.	2024-11-13 11:12:00 -08:00
Xu Xing	ff57ac4f3d	[js/webgpu] Add scatterND (#22755 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-11-13 09:13:00 -08:00
liqun Fu	bc2b1b5e37	Fix issue #22796 - a typo: (__GNUC__ > 9) -> (__GNUC__ > 10) (#22807 ) ### Description fix #22796 Signed-off-by: liqunfu <liqun.fu@microsoft.com>	2024-11-12 18:56:35 -08:00
Xiang Zhang	69a36eb231	Revert Implement DML copy for Lora Adapters (#22814 ) Revert https://github.com/microsoft/onnxruntime/pull/22396	2024-11-12 17:45:59 -05:00
Jing Fang	7fa69461fd	[ARM] MatMulNBits FP16 support - kernels only (#22806 ) ### Description A break down PR of https://github.com/microsoft/onnxruntime/pull/22651 Add fp16 kernels. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-11-12 14:28:47 -08:00
Jiajia Qin	7e0dd9d433	[js/webgpu] Optimize Expand (#22752 ) Use components = 4 if possible. llama3.2-1B becomes 20 tokens/s from 18 tokens/s on my iGPUs.	2024-11-12 12:37:19 -08:00
Jiajia Qin	05c8dc9d1c	[js/webgpu] Optimize ConvTranspose (#22774 ) BUG #22031 The overall time of ConvTranspose in Demucs model becomes 517.41 ms from 1415.65 ms on my iGPUs.	2024-11-12 12:37:07 -08:00
Bin Miao	67f5be0da2	[WebNN EP] Support LRN operator (#22775 ) WebNN doesn't provide dedicate op for LRN, use a couple of WebNN ops to emulate it in WebNN EP: pow -> transpose -> pad -> averagePool -> transpose -> mul -> add -> pow -> div @Honry @fdwr PTAL, thanks!	2024-11-12 11:53:52 -08:00
junchao-zhao	fd5b1a18ee	Fix LARCH64 compile error (#22759 ) ### Description Currently loongarch has not implemented AIsSigned qgemm, so I added bypass for it	2024-11-12 11:47:43 -08:00
Jian Chen	75a44582ba	Update all JDK version to 17 (#22786 )	2024-11-12 11:42:18 -08:00
Ted Themistokleous	2b0f3435d2	[MIGraphX EP] Add support for Gelu, BiasGelu, FastGelu operators (#22808 ) ### Description Adds support for different flavours of gelu already supported in MIGraphX	2024-11-12 11:04:15 -08:00
dtang317	9836ef1c89	register Identity and QLinearMatmul for opset21 (#22804 ) ### Description This PR registers the following opset 21 operators: Idenity-21 OlieanrMatmul-21 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-11-12 09:36:19 -08:00
amarin16	f0ac5e0d3d	Update skip layer norm (#22719 ) Update the `SkipLayerNorm` implementation to address issues.	2024-11-12 07:01:30 -08:00
Wanming Lin	cdc8db9984	[WebNN] Fixed WebNN Module undefined issue (#22795 ) `Module.jsepRegisterMLConstant` will be shorten by Closure Compiler in offical release, this would cause undefined error. Fix it by using `Module['jsepRegisterMLConstant']`.	2024-11-11 21:31:24 -08:00
Adrian Lizarraga	0ad44d0f79	[Quant Tool] Flaky test due to Pad reflect bug (#22798 ) ### Description Fixes a unit test that would fail intermittently due to an existing bug with Pad (reflect mode). When the number of padded values is >= the inner dimension size, the ORT Pad implementation accesses invalid memory. This PR makes the number of padding values less than the inner dimension size to avoid triggering the bug. ### Motivation and Context See related issues: https://github.com/microsoft/onnxruntime/issues/8265 https://github.com/microsoft/onnxruntime/issues/11828 https://github.com/microsoft/onnxruntime/issues/20801 Here's a valgrind trace obtained on a Linux machine (with `sess_options.enable_cpu_mem_arena = False`) ``` ==864228== Invalid read of size 4 ==864228== at 0x2716272A: void onnxruntime::PadInnermostAxis<unsigned int>(unsigned int, unsigned int, long, unsigned long) (pad.cc:370) ==864228== by 0x2715D213: onnxruntime::common::Status onnxruntime::PadImpl<unsigned int>(onnxruntime::OpKernelContext, absl::lts_20240722::InlinedVector<long, 10ul, std::allocator<long> > const&, absl::lts_20240722::InlinedVector<long, 10ul, std::allocator<long> > const&, onnxruntime::Mode const&, unsigned int) (pad.cc:551) ==864228== by 0x2715B2BB: onnxruntime::Pad::Compute(onnxruntime::OpKernelContext) const (pad.cc:725) ==864228== by 0x276FF6A7: onnxruntime::ExecuteKernel(onnxruntime::StreamExecutionContext&, unsigned long, unsigned long, bool const&, onnxruntime::SessionScope&) (sequential_executor.cc:484) ==864228== by 0x276F4A04: onnxruntime::LaunchKernelStep::Execute(onnxruntime::StreamExecutionContext&, unsigned long, onnxruntime::SessionScope&, bool const&, bool&) (execution_steps.cc:73) ... ``` The above is obtained with the basic Pad(reflect) example on the [ONNX Pad operator spec page](https://onnx.ai/onnx/operators/onnx__Pad.html#summary): ```python data = [ [1.0, 1.2], [2.3, 3.4], [4.5, 5.7], ] pads = [0, 2, 0, 0] mode = 'reflect' # Expected output by ONNX spec expected_output = [ [1.0, 1.2, 1.0, 1.2], [2.3, 3.4, 2.3, 3.4], [4.5, 5.7, 4.5, 5.7], ] # Bugged output from onnxruntime has invalid/uninitialized data for the first element in the inner dimension # invalid data may be 0.0, inf, nan, etc. ort_output = [ [inf, 1.2, 1.0, 1.2], [inf, 3.4, 2.3, 3.4], [inf, 5.7, 4.5, 5.7], ] ```	2024-11-11 19:49:27 -08:00
shiyi	f7d1f0fc5e	Reland "[WebNN] Fallback the node when its output doesn't have shape info" (#22685 ) The previous PR was reverted because it causes the whole model to fallback when there is output shape info missing. This PR fixes the issue by removing redundant fallbacks.	2024-11-11 16:30:10 -08:00
Adrian Lizarraga	b1e0930eab	Fix build for linux python wheel (#22801 ) ### Description Fixes command for building Linux python packages by preventing an empty `-p` command-line option from being passed to a subsequent build script: `1f3b675453/tools/ci_build/github/linux/run_python_dockerbuild.sh (L37)` ### Motivation and Context A recent [PR ](https://github.com/microsoft/onnxruntime/pull/22773)introduced a new optional command-line option (`-p`) to pass custom python exe paths. We need to check if the option is empty before forwarding the option to a separate build script.	2024-11-11 15:20:07 -08:00
Jian Chen	885a7acd45	Fix warning - LegacyKeyValueFormat: "ENV key=value" should be used instead of legacy "ENV key value" format (#22800 ) ### Description This PR Fix warning - `LegacyKeyValueFormat: "ENV key=value" should be used instead of legacy "ENV key value" format` from all Dockerfile ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-11-11 13:05:34 -08:00
Xavier Dupré	1f3b675453	Fix MatMulBnFusion to exclude cases when tensors are not 2D tensors (#22762 ) ### Description Fixes #22512, MatMul, Add can be fused into a single Gemm even if tensors dimensions are > 2. The PR excludes that cases. ### Motivation and Context ORT crashes on valid models due to that unexpected fusion.	2024-11-11 19:48:25 +01:00
Dmitri Smirnov	c5276ac448	Revert "enable serialize prepacked weights into data file (#22256 )" (#22788 ) This reverts commit `c5b6be045f`. ### Description Revert ### Motivation and Context This needs simpler and more robust approach	2024-11-11 09:59:05 -08:00
sheetalarkadam	e8f1d73b0b	Add Android QNN Browserstack test (#22434 ) Add Android QNN Browserstack test ### Motivation and Context Real device test in CI	2024-11-10 16:10:29 -08:00
Preetha Veeramalai	c9ed016b12	OVEP Dynamic WorkloadType support (#22779 ) ### Description Support to set EPdynamic options in OVEP ### Motivation and Context relate to https://github.com/microsoft/onnxruntime/pull/22282 --------- Co-authored-by: Javier E. Martinez <javier.e.martinez@intel.com>	2024-11-09 23:26:29 -08:00
shiyi	63cb53257b	[WebNN] Support steps >= 1 for slice operator (#22708 ) Co-authored-by: Wanming Lin <wanming.lin@intel.com>	2024-11-09 18:20:52 -08:00
Wanming Lin	b9b1a0353a	[WebNN] QDQ's axis should be used for broadcasting (#22721 ) For per-axis quantization/dequantization, WebNN requires the scale and zero_point inputs to be broadcastable. Axis should be used for reshape these two inputs.	2024-11-09 18:19:46 -08:00
zz002	d3ad76b2cf	[VitisAI] Cache node subgraph when necessary (#22073 ) ### Description <!-- Describe your changes. --> [VitisAI] Cache node subgraph when necessary ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Zhenze Wang <zhenzew@xilinx.com> Co-authored-by: zhenzew <zhenzew@amd.com>	2024-11-08 23:17:16 -08:00
Yi Zhang	ef281f850a	Add XNNPack build on Linux ARM64 and improve Linux CPU (#22773 ) ### Description 1. Add XNNPack build on Linux ARM64 2. Build only one python wheel for PR request. [AB#49763](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/49763) ### Motivation and Context Why I add xnnpack build on Linux ARM64 rather than Windows ARM64. Becuase KleidiAI doesn't support Windows ``` IF(XNNPACK_TARGET_PROCESSOR STREQUAL "arm64" AND XNNPACK_ENABLE_ARM_I8MM AND NOT CMAKE_C_COMPILER_ID STREQUAL "MSVC") IF (XNNPACK_ENABLE_KLEIDIAI) MESSAGE(STATUS "Enabling KleidiAI for Arm64") ENDIF() ELSE() SET(XNNPACK_ENABLE_KLEIDIAI OFF) ENDIF() ``` ---------	2024-11-09 11:26:19 +08:00
Justin Chu	a8539ec7d1	Ignore all whitespace lint messages for cpplint (#22781 ) ### Description Ignore all whitespace lint messages for cpplint. Remove redundant configs in dml/. ### Motivation and Context They are handled automatically by clang-format and creates too much noise in the PR files tab.	2024-11-08 14:31:28 -08:00
Adrian Lizarraga	020d52d92c	[Quant Tool] Add reduce_range option to get_qdq_config() (#22782 ) ### Description Adds `reduce_range` option to `get_qdq_config()` ### Motivation and Context Make it easier to set this option when calling get_qdq_config(). Otherwise, user has to set the option manually.	2024-11-08 14:04:11 -08:00
xhcao	b5ee4ac760	[js/webgpu] support GridSample operator (#22652 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-11-08 11:02:36 -08:00
jzm-intel	d9b91682f1	WebGPU JSEP: Make shader code not depend on input broadcasting patterns (#22536 ) This PR make MatMul shaders not depend on inputs broadcasting pattern, but only depend on input ranks and their shape provided in uniform. This change fix the issue that currently shaders code are different for different broadcasting, but have identical cache key and results in wrong cache hit.	2024-11-08 11:00:51 -08:00
Michael Cho	4d614e15bd	Fix build with GCC 11 (#22770 ) ### Description Fix a build error seen with GCC 11 when building at Homebrew on our Linux x86_64 Ubuntu 22.04 CI (GitHub action runner). ### Motivation and Context When building latest v1.20.0 at Homebrew (https://github.com/Homebrew/homebrew-core/pull/196547), we hit a build failure with GCC 11: ``` [ 65%] Building CXX object CMakeFiles/onnxruntime_optimizer.dir/tmp/onnxruntime-20241103-6403-lh3bwj/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc.o /home/linuxbrew/.linuxbrew/Homebrew/Library/Homebrew/shims/linux/super/g++-11 -DCPUINFO_SUPPORTED_PLATFORM=1 -DEIGEN_MPL2_ONLY -DEIGEN_USE_THREADS -DENABLE_CPU_FP16_TRAINING_OPS -DHAS_STRING_VIEW=1 -DNSYNC_ATOMIC_CPP11 -DONLY_C_LOCALE=0 -DONNX_ML=1 -DONNX_NAMESPACE=onnx -DORT_ENABLE_STREAM -DORT_NO_RTTI -DPLATFORM_POSIX -DPROTOBUF_USE_DLLS -D_GNU_SOURCE -I/tmp/onnxruntime-20241103-6403-lh3bwj/build/_deps/utf8_range-src -I/tmp/onnxruntime-20241103-6403-lh3bwj/include/onnxruntime -I/tmp/onnxruntime-20241103-6403-lh3bwj/include/onnxruntime/core/session -I/tmp/onnxruntime-20241103-6403-lh3bwj/build/_deps/pytorch_cpuinfo-src/include -I/tmp/onnxruntime-20241103-6403-lh3bwj/build -I/tmp/onnxruntime-20241103-6403-lh3bwj/onnxruntime -I/tmp/onnxruntime-20241103-6403-lh3bwj/build/_deps/onnx-src -I/tmp/onnxruntime-20241103-6403-lh3bwj/build/_deps/onnx-build -ffunction-sections -fdata-sections -Wno-restrict -DCPUINFO_SUPPORTED -O3 -DNDEBUG -fPIC -fno-rtti -Wall -Wextra -Wno-deprecated-copy -Wno-tautological-pointer-compare -Wno-nonnull-compare -Wno-ambiguous-reversed-operator -Wno-deprecated-anon-enum-enum-conversion -Wno-undefined-var-template -Wno-deprecated-builtins -Wshorten-64-to-32 -Werror -MD -MT CMakeFiles/onnxruntime_optimizer.dir/tmp/onnxruntime-20241103-6403-lh3bwj/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc.o -MF CMakeFiles/onnxruntime_optimizer.dir/tmp/onnxruntime-20241103-6403-lh3bwj/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc.o.d -o CMakeFiles/onnxruntime_optimizer.dir/tmp/onnxruntime-20241103-6403-lh3bwj/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc.o -c /tmp/onnxruntime-20241103-6403-lh3bwj/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc /tmp/onnxruntime-20241103-6403-lh3bwj/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc: In function ‘void onnx_transpose_optimization::Permute1DConstant(onnx_transpose_optimization::api::GraphRef&, onnx_transpose_optimization::api::NodeRef&, onnx_transpose_optimization::api::TensorRef&, size_t, std::string_view, const std::vector<long int>&)’: /tmp/onnxruntime-20241103-6403-lh3bwj/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc:1114:10: error: ‘memcpy’ is not a member of ‘std’; did you mean ‘wmemcpy’? 1114 \| std::memcpy(dst, src, bytes_per_val); \| ^~~~~~ \| wmemcpy ``` It is possible this error may not occur on different GCC versions if `cstring` has been indirectly included by another header.	2024-11-07 21:04:57 -08:00
Jian Chen	e7987a6b0b	Replace reference to python 3.8 with python 3.10 (#22692 ) ### Description This PR will set default python to 3.10 except tools/ci_build/github/azure-pipelines/bigmodels-ci-pipeline.yml. This is needed because we are no longer using python 3.8 This PR excludes changes for Big Models CI, because it will require additional changes. Which will be track in USER STORY 52729 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-11-07 16:51:40 -08:00
Ranjit Ranjan	193671295e	[AIX] Fix for AIX build break (#22745 ) ### Description With recent changes, below build error is found under AIX. ``` ld: 0706-012 The -p flag is not recognized. ld: 0706-012 The -a flag is not recognized. ld: 0706-012 The -t flag is not recognized. ld: 0706-012 The -h flag is not recognized. ld: 0706-012 The -= flag is not recognized. ld: 0706-012 The -$ flag is not recognized. ld: 0706-012 The -$ flag is not recognized. ld: 0706-012 The -O flag is not recognized. ld: 0706-027 The -R IGIN flag is ignored. collect2: error: ld returned 255 exit status ``` ### Motivation and Context AIX linker doesn't support -rpath option , so blocking this option under AIX.	2024-11-07 13:22:22 -08:00
raoanag	f16036b6f5	[DML EP] Prefer MatMulInteger over MatMulIntegerToFloat in case of (#22469 ) ### Description Skip `MatMulIntegerToFloat` fusion in case of DML EP for cases where model uses Quantization before `MatMulInteger`. This is mainly done to be resource efficient, and we have better `MatMulInteger` Metacommand coverage which computes in int data type ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-11-07 10:02:01 -08:00
Yulong Wang	a436b3af1a	[webgpu] fix indices type when it's 4D (#22758 ) ### Description Fix indices type from `array<u32, 4>` to `vec4<u32>` when the variable is 4D.	2024-11-07 08:10:05 -08:00
jzm-intel	6a295eb75b	[JS/WebGPU] Creating devices with subgroup features enabled if possible (#21833 ) This CL make WebGPU backend support subgroup features and thus allow using subgroup optimizations in the future. ### Description With this CL WebGPU backends will create devices with subgroups and subgroups-f16 features (both are under origin trial in Chrome) or chromium-experimental-subgroups feature enabled whenever available. ### Motivation and Context This CL would allow WebGPU operator shaders to use subgroup optimizations in the future, and might get some significant speedup with these optimization.	2024-11-07 02:13:40 -08:00
Yifan Li	3b7a6eba69	[TensorRT EP] support TensorRT 10.6-GA (#22644 ) ### Description <!-- Describe your changes. --> * Update CI with TRT 10.6 * Update oss parser to [10.6-GA-ORT-DDS ](https://github.com/onnx/onnx-tensorrt/tree/10.6-GA-ORT-DDS) and update dependency version * Update Py-cuda11 CI to use trt10.6 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> (There will be 3rd PR to further reduce trt_version hardcoding)	2024-11-06 14:33:46 -08:00
Adrian Lizarraga	aa0cf1c5e1	[Quant Tool] Update QDQ Pad, Slice, Softmax (#22676 ) ### Description Updates python quantization tool: - Ensures QDQ Pad has equal quantization parameters across input and output for certain Pad configurations. - Ensures QDQ Slice always has equal quantization parameters across input and output. - Fixes bug when Softmax is _excluded_ from quantization. ### Motivation and Context QDQ Pad and Slice have lower latency on QNN EP when their quantization parameters are equal.	2024-11-06 14:06:29 -08:00
Caroline Zhu	0221693e43	[Mobile] Add E2E BrowserStack tests for iOS tests (#22610 ) ### Description - Changes running the E2E iOS tests from running in App Center to running in BrowserStack - Steps for running locally can be found in the OneNote ### Motivation and Context - Follow-up of #22117 - App Center (the previous platform for running E2E mobile tests) is getting deprecated in 2025 ### Misc info Additional build steps were required to get the necessary testing artifacts for BrowserStack. App Center consumed an entire folder, while BrowserStack requests the following: 1. a ZIP file of all the tests 2. an IPA file of the test app #### Flow Here is a rough outline of what is happening in the pipeline: 1. The build_and_assemble_apple_pods.py script builds the relevant frameworks (currently, this means packages for iOS and Mac) 4. The test_apple_packages.py script installs the necessary cocoapods for later steps 5. XCode task to build for testing builds the iOS target for the test app 6. Now that the test app and the tests have been built, we can zip them, creating the tests .zip file 7. To create the IPA file, we need to create a .plist XML file which is generated by the generate_plist.py script. - Attempts to use the Xcode@5 task to automatically generate the plist file failed. - Also, building for testing generates some plist files -- these cannot be used to export an IPA file. 8. We run the Xcode task to build an .xcarchive file, which is required for creating an IPA file. 9. We use xcodebuild in a script step to build an IPA file with the xcarchive and plist files from the last two steps. 10. Finally, we can run the tests using the BrowserStack script. --------- Co-authored-by: Scott McKay <skottmckay@gmail.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2024-11-06 11:22:29 -08:00
Adrian Lizarraga	4f6993d567	[Quant Tool] Prevent int32 quantized bias from clipping by adjusting the weight's scale (#22020 ) ### Description Fixes scenario in which a bias input quantized to int32 has a scale that is too small. A bias with a scale that is smaller than a certain threshold will overflow the range of an `int32` when quantized, which significantly decreases accuracy. Credit to @yihonglyu for finding out about this issue and the fix. ### Motivation and Context Consider the following Convolution with very small weights and a constant bias input of `[5, -4.5]`. ![image](https://github.com/user-attachments/assets/4bde2bd9-892f-4ae9-887b-61a6668779a1) The QDQ quantizer first computes the following quantization scale for `input_0` and `weight`: - `input_0`: scale=0.5 - `weight`: scale=7.843e-10 [really small] The QDQ quantizer then computes the bias input's scale as follows: ``` bias_scale = input_0_scale * weight_0_scale = 0.5 * 7.843e-10 = 3.9215686274509805e-11 ``` This `bias_scale` is too small. Before this PR, the QDQ quantizer would quantize the f32 bias with this `bias_scale`: ``` bias_quant = round(bias_f32 / bias_scale) = round([5.0/bias_scale, -4.5/bias_scale]) = [127500000000, -114750000000] ``` These quantized bias values exceed the range of int32, and so are clipped to [int32.min(), int32.max()], which is very inaccurate. #### New approach This PR increases the `weight_0_scale` by the necessary amount to ensure that `bias_scale` (which equals `weight_0_scale * input_0_scale`) is appropriate for the int32 quantization type. The smallest valid bias scale is given by the normal scale formula: `bias_smallest_valid_scale = (bias_f32_max - bias_f32_min) / (int32_max - int32_min)` Then, we compute the candidate bias scale: `bias_scale_candidate = input_0_scale * weight_0_scale` If the candidate scale is smaller than the smallest valid scale, we increase the `weight_0_scale` by the necessary ratio: ```python if bias_scale_candidate < bias_smallest_valid_scale: ratio = bias_smallest_valid_scale / bias_scale_candidate weight_0_scale = ratio * weight_0_scale ``` Then, we recompute the final bias scale: ```python bias_scale = input_0_scale * weight_0_scale ``` #### Impact on accuracy Here's the above model's quantized output compared to the f32 (ground-truth) output. - Before PR: - f32 model output[0]: 5.0f - qdq model output[0]: 0.075 - SNR: 0.1369 (higher is better) - After PR: - f32 model output[0]: 5.0f - qdq model output[0]: 4.992 - SNR: 55.656 (higher is better)	2024-11-06 10:44:54 -08:00
Adrian Lizarraga	2c1b17ce98	[Quant Tool] Introduce get_qdq_config() helper to get QDQ configurations (#22677 ) ### Description Introduces the `get_qdq_config()` function to get a quantization configuration for a full integer QDQ model. This function provides an easier way of specifying commonly used options and sets convenient defaults. Specifically: - Instead of requiring the user to pass a dictionary of `extra_options`, the new interface adds function parameters for common settings: - All calibrator settings - Whether activations/weights are symmetric - Whether to keep or fuse relu/clip into Q - Minimum real range for quantization - Dictionary of tensor quantization overrides. - Automatically scans the input floating-point model and fills out the operator types to quantize. Otherwise, only a limited number of operator types would be quantized by default. - Detects if the input model uses external data. If so, ensures that the generated QDQ model also uses external data. - Detects if the model will use newly introduced quantization types (int4/int16) with an older opset. If so, forces the use of the `com.microsoft` domain for Q/DQ ops, which support all types. - Automatically enables the "extra option" called `ForceQuantizeNoInputCheck` to ensure data movement operators (e.g., Transpose) are always quantized. - User can pass a function to indicate which nodes to exclude from quantization. - The user can still pass their own `extra_options` to override any of the above if necessary. ```python from onnxruntime.quantization import get_int_qdq_config, quantize # , ... # Get QDQ configuration qdq_config = get_int_qdq_config( float_model, data_reader, calibrate_method=CalibrationMethod.Percentile, calibrate_args={"percentile": 99.98}, # Converted to extra_options activation_type=QuantType.QUInt8, weight_type=QuantType.QInt8, per_channel=True, nodes_to_exclude=["Mul"], # Could also be a function. Ex: `lambda model, node: node.op_type == "Softmax"` # Other options converted to extra_options: min_real_range=0.0001, keep_removable_activations=True, activation_symmetric=True, weight_symmetric=True, ) # Quantize model quantize(float_model_path, qdq_model_path, qdq_config) ``` ### Motivation and Context Need a version of `get_qnn_qdq_config()` that is not EP-specific.	2024-11-06 10:27:02 -08:00
Tianlei Wu	72186bbb71	[CUDA] Build nhwc ops by default (#22648 ) ### Description * Build cuda nhwc ops by default. * Deprecate `--enable_cuda_nhwc_ops` in build.py and add `--disable_cuda_nhwc_ops` option Note that it requires cuDNN 9.x. If you build with cuDNN 8, NHWC ops will be disabled automatically. ### Motivation and Context In general, NHWC is faster than NCHW for convolution in Nvidia GPUs with Tensor Cores, and this could improve performance for vision models. This is the first step to prefer NHWC for CUDA in 1.21 release. Next step is to do some tests on popular vision models. If it help in most models and devices, set `prefer_nhwc=1` as default cuda provider option.	2024-11-06 09:54:55 -08:00
Tianlei Wu	ba22d7879a	[CUDA/ROCm] Conditionally support ArgMax and ArgMin for opset 12 and above (#22713 ) ### Description Based on https://github.com/microsoft/onnxruntime/pull/9700, and extend it to ArgMin as well. This pull request introduces several enhancements and fixes related to the `ArgMax` and `ArgMin` operators in the CUDA execution provider. The changes ensure proper handling of these operators across different versions and improve kernel registration and fallback mechanisms. Key changes include: #### Enhancements to `ArgMax` and `ArgMin` Operators: * Added new kernel class registrations for `ArgMax` and `ArgMin` for different data types and versions in `onnxruntime/core/providers/cuda/cuda_execution_provider.cc`. [[1]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R966-R972) [[2]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1209-R1215) [[3]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1657-R1659) [[4]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285L1825-L1827) [[5]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1933-R1939) [[6]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2174-R2180) * Introduced `ArgMaxOrArgMinNeedFallbackToCPU` function to handle fallback to CPU when the `select_last_index` attribute is set to 1, as CUDA does not support this attribute. [[1]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2597-R2622) [[2]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2672-R2674) #### Macro and Kernel Registration Improvements: * Replaced `REGISTER_KERNEL_UNTIL_VERSIONED_TYPED` with `REGISTER_KERNEL_VERSIONED_RANGE_TYPED` and `REGISTER_KERNEL_VERSIONED_SINCE_TYPED` macros for better version handling. [[1]](diffhunk://#diff-ee5316fc3898058f70e942d9a84de36be4c7da09f144633a2504236430d5d033L19-R29) [[2]](diffhunk://#diff-ee5316fc3898058f70e942d9a84de36be4c7da09f144633a2504236430d5d033L40-R46) * Updated kernel registration for `ArgMax` and `ArgMin` to use the new macros, ensuring proper version handling and support for different data types. #### Safety Checks: * Added safety checks in the `ArgMax` and `ArgMin` classes to ensure `select_last_index` is not set to 1, as it is not supported on CUDA. [[1]](diffhunk://#diff-8ab09fef1f4a12cbf3b3432e509f8f1ef561e83c72778a0e047780060aeef6efL91-R99) [[2]](diffhunk://#diff-8ab09fef1f4a12cbf3b3432e509f8f1ef561e83c72778a0e047780060aeef6efL101-R117) #### Testing Enhancements: * Added new tests for `ArgMax` and `ArgMin` operators to verify behavior when `select_last_index` is set to 0, ensuring compatibility with both CPU and CUDA execution providers. [[1]](diffhunk://#diff-77affe1b70d1a9d38c2485f7c6b16ef2b6b541ed94dd727bc9b286f068f1481aR3340-R3360) [[2]](diffhunk://#diff-77affe1b70d1a9d38c2485f7c6b16ef2b6b541ed94dd727bc9b286f068f1481aR3679-R3699) ### Motivation and Context Improve CUDA kernel coverage for stable diffusion model and hence improve its performance on CUDA	2024-11-06 09:54:32 -08:00
Tianlei Wu	d993ec313f	[CUDA] Fix NumericLimits (#22738 ) ### Description * Fix `NumericLimits<float>` that used infinity as max, which is not consistent with `std::numeric_limits<float>::max()` In Windows, (float)(1e+300) is used for INFINITY, which causes compiler error in Visual Studio 2022 v17.12 Preview 5. * Rename `NumericLimits<T>::Min` to Lowest to be consistent with std::numeric_limits * Fix topk implementation: use `NumericLimits<CudaT>` instead of `NumericLimits<T>` in kernel. That could avoid defining a confusing defintion of `NumericLimits<MLFloat16>` that returns half instead of MLFloat16. * Use CUDART_MAX_NORMAL_FP16 if possible. It sets bits value directly, which is faster than converting float to half. Note that NumericLimits does not support __nv_bfloat16 and _nv_fp8_e4m3 and __nv_fp8_e5m2 right now. ### Motivation and Context https://github.com/microsoft/onnxruntime/issues/22728	2024-11-06 09:53:49 -08:00
Enrico Galli	1cb5ceedf3	[WebNN EP] Fix issues with MLTensor caching (#22701 ) This PR fixes a bug that occurs when searching for compatible `MLTensor` in the cache. We were missing checking the number of dimensions in the shape. This would mean that a cached buffer of shape `[1]` could match for `[1, 1, 256, 256]`. This PR also adds better handling when attempting to force an `MLTensor` to a different shape.	2024-11-06 09:17:11 -08:00
Yang Gu	811231e418	[js/webgpu] Destroy staging buffers aggressively during weights uploading (#22726 ) In current implementation, all the staging buffers for weights uploading are destroyed after first batch of kernel execution. It requires a lot of memory as all the staging buffers couldn't be reused. It also hurts the startup time (weights uploading only happens in session creation), as weights uploading is delayed to a very late time. This PR uses a very aggressive way to submit queue and destroy staging buffers, so that the related GPU memory could be reused as much as possible, though the real situation depends on the WebGPU and driver implementation. The aggressive queue submission also moves GPU operations to a very early time, which helps the startup time. Some buffer uploading benchmarks are composed to compare multiple solutions, regarding to the memory and time consumption. Benchmarks can be found at https://github.com/webatintel/webbench/blob/master/webgpu/buffer-upload.html, while detailed test data can be found at https://docs.google.com/document/d/1KgygOkb9ZNzkgzQ_tWOGlEI9ScmMBHDjDojjPFLmVXU/edit. I also tested phi3.5 on 2 machines, first inference time improved from 5141ms to 3579ms and from 4327ms to 2947ms separately.	2024-11-06 08:55:15 -08:00
Edward Chen	742a0d30be	[C# MauiModelTester] Fix icon name in Info.plist (#21666 ) Fix icon name in Info.plist. It now matches the icon at `csharp/tools/MauiModelTester/Resources/AppIcon/onnxruntime_icon.png`.	2024-11-05 16:55:38 -08:00
Jian Chen	deee48002c	Enable CUDA Python Test (#22717 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-11-05 16:26:50 -08:00
Yulong Wang	0371e92419	[webgpu] change default validation mode (#22730 ) ### Description Change default validation mode in Release build from "wgpuOnly" to "basic"	2024-11-05 16:18:05 -08:00

1 2 3 4 5 ...

11989 commits