onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-19 19:00:47 +00:00

Author	SHA1	Message	Date
George Wu	c270fe6dd3	[qnn ep] fix naming convention of ort-nightly-qnn package (#22157 ) followed the rocm example below it which isn't the naming convention we want to follow. didn't fix rocm because i'm not sure if there are consumers using its naming convention.	2024-09-19 17:33:31 -07:00
Hector Li	03ce996b7c	Fix QNN random crash for UT with multi-thread run (#22160 ) ### Description Fix random crash for QNN UTs with multi-thread run like QnnHTPBackendTests.MultithreadHtpPowerCfgDefaultAndRunOption Root cause, last minute code change `b4e26bd5f9` static std::mutex mutex; -> OrtMutex mutex; missed static.	2024-09-19 16:39:13 -07:00
raoanag	73b5c3354c	Set Transpose Attribute instead for manipulating MatMul Strides (#21927 ) ### Description Update DML EP for `FusedMatMul` ORT graph node have TransA/B attribute set instead of updating the strides. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-09-19 16:26:20 -07:00
Scott McKay	bd60add8ce	Update nuget.exe used in WindowsAI nuget packaging so `readme` property is supported. (#22141 ) ### Description <!-- Describe your changes. --> Use the latest nuget.exe for the `readme` property to be supported. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> #22137	2024-09-19 19:06:47 +10:00
Scott McKay	99ee6eeca2	Enable Android 16 KB page size support (#22076 ) ### Description <!-- Describe your changes. --> Add linker flags to support 16KB page size support on Android. See https://source.android.com/docs/core/architecture/16kb-page-size/16kb#build-lib-16kb-alignment ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> #21837	2024-09-19 18:53:57 +10:00
Wanming Lin	e33b08ead1	[WebNN EP] Use both MLOperandDescriptor.dimensions and MLOperandDescriptor.shape (#22121 ) The spec renames MLOperandDescriptor.dimensions to MLOperandDescriptor.shape, in order to support older Chromium versions, we will keep both in WebNN EP for a while. Fixed #22120	2024-09-19 01:20:40 -07:00
George Wu	944d87381d	[QNN EP] set up py packaging pipeline for Linux x64 (#22132 ) set up a pipeline to produce nightly Linux x64 whls for onnxruntime-qnn this can be used for offline context binary generation.	2024-09-18 23:24:32 -07:00
mguynn-intc	d5f6343a4a	Implementation of AVX-VNNI-INT8 dot product instructions into MLAS GEMM (#21984 ) ### Description <!-- Describe your changes. --> ONNXRuntime implementation of S8S8 was using the default C++ implementation; with this new ISA, all variants of QGemm Int8 can support VNNI dot product and full AVX2 instructions. All signed/unsigned variants support VNNI instructions starting with LNL. Renamed structs and functions to better indicate support of all Int8 vs U8X8 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> LNL HW implemented new ISA, and this code enables that ISA in QGemm. Speed is improved for S8S8 to match with existing U8S8 code. S8U8 would also match speed if ONNX formally accepted the data type.	2024-09-18 22:18:23 -07:00
Yi Zhang	560778fd07	use mac 12 for esrp code sign (#22134 ) ### Description Fix regression caused by #17361 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-09-19 12:06:41 +08:00
Tianlei Wu	a9740d6f96	Add onnx export script for segment anything v2 (#22119 ) ### Description Add ONNX export script for segment anything v2 (SAM2). ### Limitations * Does not support video. Only support image right now. * The decoder does not support batch inference. ### Credits The demo that is based on [SAM2 notebook](https://github.com/facebookresearch/segment-anything-2/blob/main/notebooks/image_predictor_example.ipynb), and modified to run with ORT. The export of decoder is inspired by https://github.com/vietanhdev/samexporter. ### Demo Example output of demo: ![sam2_demo](https://github.com/user-attachments/assets/9a9fa360-8c20-482e-9935-a7aba9cf15de) ### Motivation and Context For support optimization of SAM2 image segmentation.	2024-09-18 14:31:59 -07:00
Patrice Vignola	05acfb90ab	[DML EP] Add QDQ+MatMul fusion into MatMulNBits (#22114 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-09-17 22:37:45 -07:00
Adrian Lizarraga	b8dae685e4	[QNN EP] Build Python 3.12 wheel for Windows ARM64 (#22118 ) ### Description Builds arm64 python 3.12 wheel for QNN EP. ### Motivation and Context	2024-09-17 21:16:31 -07:00
Fangjun Kuang	c6dc787a3d	Update q4common.h to include the missing header (#21786 ) Fixes #21748 CC @gyagp	2024-09-17 20:55:56 -07:00
dependabot[bot]	7e98926810	Bump body-parser from 1.20.1 to 1.20.3 in /onnxruntime/test/wasm (#22106 )	2024-09-17 22:59:40 +00:00
Atanas Dimitrov	275eb404bf	Speedup `CumSum` for large arrays (#22048 ) ### Description This PR refactors the `CPU` kernel for the `CumSum` operator. The new implementation strives to have as little indirection as possible. ### Motivation and Context Currently the `CumSum` operator perform very poorly in the case of 1D tensors(it was slower than a python loop). This is caused by the extensive use of the `SliceIterator`-s. Here is a relevant snippet: ```python import time import ndonnx as ndx import onnxruntime as ort import numpy as np import onnx def test_cumsum(sz): a = ndx.array(shape=(sz,), dtype=ndx.int64) b = ndx.cumsum(a) model = ndx.build({'a': a}, {'b': b}) onnx.save(model, "model.onnx") input = np.ones(sz, np.int64) start = time.time() result = ort.InferenceSession(model.SerializeToString()).run(None, {'a': input}) end = time.time() return end - start def test_cumsum_by_hand(sz): input = np.ones(sz, np.int64) start = time.time() answer = [0] for i in input: answer.append(answer[-1] + i) end = time.time() return end - start print(test_cumsum(int(1e7))) print(test_cumsum_by_hand(int(1e7))) ``` Before ```console 0.9794480800628662 0.4518160820007324 ``` After ```console 0.02483987808227539 0.5496008396148682 ``` The `model.onnx`: <img width="214" alt="image" src="https://github.com/user-attachments/assets/a213d6ff-86c3-49b5-a493-ebfd97deaa41"> The flame graph: ![profile-3](https://github.com/user-attachments/assets/c7418a05-cb65-4d72-a76d-6a6b05b4ba4d)	2024-09-17 15:53:07 -07:00
Yi Zhang	b94ba09e4f	Upgrade XNNPACK to latest version (#22012 ) ### Description Update XNNPack to latest version (Sep 4) - Some op outputs are changed, channel or stride paras are moved into reshape func. e.g. `96962a602d` - input params of xnnpack's resize related function are changed a lot - KleidiAI is added as a dependency in ARM64 - The latest XNNPACK includes 2 static libs microkernels-prod and xnnpack. Without microkernels-prod, it throws the exception of Undefined symbols. - Add ORT_TARGET_PROCESSOR to get the real processor target in CMake	2024-09-17 10:12:16 -07:00
Jian Chen	fa68ae2def	Update pool to MacOS-13 (#17361 ) ### Description See https://github.com/microsoft/onnxruntime-extensions/pull/476 and https://github.com/actions/runner-images/issues/7671 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> ### Current issue - [ ] For default xcode 15.2, that come with the MacOS-13, We Need to update the boost container header boost/container_hash/hash.hpp version to pass the build - [x] For xcode 14.2 The Build passed but the `Run React Native Detox Android e2e Test` Failed. Possible flaky test, https://github.com/microsoft/onnxruntime/pull/21969 - [x] For xcode 14.3.1 We encountered following issue in `Build React Native Detox iOS e2e Tests` ``` ld: file not found: /Applications/Xcode_14.3.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/arc/libarclite_iphonesimulator.a clang: error: linker command failed with exit code 1 (use -v to see invocation) ``` Applied following code to the eof in both ios/Podfile and fixed the issue ``` post_install do \|installer\| installer.generated_projects.each do \|project\| project.targets.each do \|target\| target.build_configurations.each do \|config\| config.build_settings['IPHONEOS_DEPLOYMENT_TARGET'] = '13.0' end end end end ``` - [x] https://github.com/facebook/react-native/issues/32483 Applying changes to ios/Pofile ``` pre_install do \|installer\| # Custom pre-install script or commands puts "Running pre-install script..." # Recommended fix for https://github.com/facebook/react-native/issues/32483 # from https://github.com/facebook/react-native/issues/32483#issuecomment-966784501 system("sed -i '' 's/typedef uint8_t clockid_t;//' \"${SRCROOT}/Pods/RCT-Folly/folly/portability/Time.h\"") end ``` - [ ] Detox environment setting up exceeded time out of 120000ms during iso e2e test ### dependent - [x] https://github.com/microsoft/onnxruntime/pull/21159 --------- Co-authored-by: Changming Sun <chasun@microsoft.com>	2024-09-17 10:07:30 -07:00
Chi Lo	6dcdc70aa7	[TensorRT EP] Add supportsModelV2 (#22081 ) `supportsModel` is deprecated in TRT 10.1. Add `supportsModelV2 `but still keep `supportsModel` as we still need to support TRT 8.6 where `supportsModelV2 ` is not supported.	2024-09-17 09:52:28 -07:00
Wanming Lin	9786909ab5	[WebNN EP] Support QuantizeLinear and DequantizeLinear ops (#22097 )	2024-09-17 08:18:47 -07:00
Xu Xing	afd642a194	[js/webgpu] Replace array with string in transpose perm (#21930 ) Perf test data(100000 times) Array: 12.599999997764826ms String: 1.6000000014901161ms Perf test case: ``` const permFunctionBodyArray = (rank: number, input: string): string => { const reverseFunc = []; reverseFunc.push(`fn perm(i: int) -> int { var a: int};`); for (let i = 0; i < rank; ++i) { reverseFunc.push(input); } reverseFunc.push('return a;}'); return reverseFunc.join('\n'); }; const permFunctionBodyString = (rank: number, input: string): string => { let reverseFunc= `fn perm(i: int}) -> int { var a: int;`; for (let i = 0; i < rank; ++i) { reverseFunc+=input; } reverseFunc+='return a;}'; return reverseFunc;//.join('\n'); }; const count = 100000; let start, end console.time('array'); start = performance.now(); for(let i =0 ; i < count; i ++) { permFunctionBodyArray(3, 'input'); } end = performance.now(); console.timeEnd('array'); console.log("Array: "+ (end-start)); console.time('string'); start = performance.now(); for(let i =0 ; i < count; i ++) { permFunctionBodyString(3, 'input'); } end = performance.now(); console.log("String: " +(end-start)); console.timeEnd('string'); ``` ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-09-16 23:17:46 -07:00
Yang Gu	2db6b734f5	[js/webgpu] Fix issue to run model demucs (#22074 ) This is to fix issue #22031 to run model demucs. For conv-transpose, outputPadding.length could be 1, while spatialRank is 2. The fix is to append enough 0s to outputPadding. For conv, the issue is similar. kernelShape.length sometimes could be 1, while inputs[1].dims.length is 4. The fix is also to append enough 0s to kernelShape.	2024-09-16 23:17:10 -07:00
Yulong Wang	291a5352b2	[js/web] remove training release (#22103 ) ### Description Remove training from onnxruntime-web Following up of #22082	2024-09-16 10:56:22 -07:00
Erick Muñoz	e93f14e00d	Check partial conversion on FP16 to FP32 AVX Cast kernel (#22091 ) ### Description Added checks to convert partial vectors in the early stages of the FP16 to FP32 cast using AVX NE CONVERT ISA. ### Motivation and Context Avoid storing data in sections outside of the output buffer, these checks are missing on the [original PR](https://github.com/microsoft/onnxruntime/pull/21183). This fix prevents memory corruption when the output buffer has a size [n16 + 1, n16 + 7] with 0< n	2024-09-16 09:20:06 -07:00
George Wu	1a1669fe81	use node name in transpose optimizer when adding nodes rather than optype (#22084 ) patch from @john-dance "The main change is simple: Use the original node name rather than the original node op_type when creating new nodes. Here are my comments on the change: ------ The onnx runtime uses the op_type as the basis for a new node name, so a node claimed by QNN EP might be named Conv_token_1 with no relation to the original /conv1/Conv. This patch: 1. Adds OpName as a virtual function in NodeRef and implements it in ApiNode. 2. AddNode now takes an op_name and op_type and passes them both to CreateNodeHelper. 3. CreateNodeHelper uses the op_name rather than the op_type in GenerateNodeName 4. Direct calls to AddNode are modified to either use the NodeRef if available, or just repeat the op_type if not available. The result is that the new nodes are named something like /conv1/Conv_token_1, allowing a straight forward mapping back to the original model node (if they exist in the original graph)."	2024-09-16 09:12:13 -07:00
Adam Pocock	6d7235ba5a	[Java] Exposing SessionOptions.SetDeterministicCompute (#18998 ) ### Description Exposes `SetDeterministicCompute` in Java, added to the C API by #18944. ### Motivation and Context Parity between C and Java APIs.	2024-09-16 11:55:38 +10:00
Adam Pocock	02e00dc023	[java] Adding ability to load a model from a memory mapped byte buffer (#20062 ) ### Description Adds support for constructing an `OrtSession` from a `java.nio.ByteBuffer`. These buffers can be memory mapped from files which means there doesn't need to be copies of the model protobuf held in Java, reducing peak memory usage during session construction. ### Motivation and Context Reduces memory usage on model construction by not requiring as many copies on the Java side. Should help with #19599.	2024-09-16 08:31:55 +10:00
Wanming Lin	c63dd0234b	[WebNN EP] Use opSupportLimits to dynamically check data type support (#22025 ) - Remove hard code data type checks and use WebNN's opSupportLimits instead - Add HasSupportedOutputsImpl for output data type validation - Get preferred layout info from opSupportLimits - Move Not op to logical_op_builder.cc because it should be there. This avoid the inconsistent input names in `unary_op_builder.cc`.	2024-09-13 21:36:20 -07:00
liqun Fu	a89bddd5c2	Matmul_nbits kernel for mlas sqnbits to support Fp16 inputs (#21807 )	2024-09-13 14:55:08 -07:00
aciddelgado	7e2c722459	Add Continuous Decoding support in GQA (#21523 ) ### Description This PR will add support for Continuous Decoding for batch_size = 1 input. From now on, GQA can take arbitrary length input using seqlens_k as total_sequence_length - 1 and the sequence length of qkv as new_sequence_length. This change will not affect the default behavior of GQA ### Motivation and Context Prior to this change it was impossible to support sequence_length > 1 inputs when past context was given. This use case is essential to making continuous decoding work, which is one of our current efforts in ORT-GenAI.	2024-09-13 13:21:11 -07:00
Changming Sun	59b7b6bb7c	Remove training from web ci pipeline (#22082 ) ### Description Remove training from web ci pipeline ### Motivation and Context	2024-09-13 09:52:49 -07:00
Michael Tyler	904b850b44	Update Arm Compute Library Execution Provider (#22032 ) ### Description This PR makes the following updates to the Arm Compute Library execution provider: - Target Arm Compute Library 24.07 - Add support for the following operators: - Conv (FP16) - NhwcConv - QLinearConv - MatMul - FusedMatMul - MatMulIntegerToFloat - Optimize memory usage and performance - Expose the enable_fast_math setting - Use the main runtime thread pool ### Motivation and Context These updates improve performance and memory usage, and enable use of a more recent version of Arm Compute Library. @microsoft-github-policy-service agree company="Arm Ltd" --------- Signed-off-by: Michael Tyler <michael.tyler@arm.com>	2024-09-12 20:51:59 -07:00
Adam Pocock	22437b581b	[java] Fix for OnnxTensor creation when passing in a ByteBuffer containing elements of a different type (#21774 ) ### Description Fixes a bug where the buffer offset and position was incorrectly computed if the user supplied a `ByteBuffer` to `createTensor` but set the type of the tensor to something other than `INT8`. This would be more common if the user was trying to load the initializers from a serialized representation and didn't want to bother with the type information (which is the case in #21321). ### Motivation and Context Partial fix for #21321. The remainder of the fix is to add a helper which allows users to load initializers out of an `onnx_data` file, but that will require adding protobuf as a dependency for the Java API to allow the parsing of an ONNX file separately from the native code. It might be nicer to put that functionality into ORT's C API so it can return the lengths & offsets of the initializers when provided with an ONNX file containing external initializers. We hit this kind of thing in Java more often than other languages as in Java models can be supplied as classpath resources which we can easily read, but not materialize on disk for the ORT native library to read.	2024-09-13 12:38:17 +10:00
Adrian Lizarraga	f7bf5a19ba	[QNN EP] Ensure QNN EP rejects nodes with I/O of dynamic shape (#22066 ) ### Description Updates QNN EP to properly reject nodes that have inputs or outputs with dynamic shapes. ### Motivation and Context Currently, QNN EP does not properly offload subgraphs with dynamic shapes to the CPU EP. This PR ensures that QNN EP rejects nodes that consume or generate I/O with dynamic shapes.	2024-09-12 17:18:50 -07:00
mingyueliuh	55ab13e7ca	[VitisAI] support memory buffer contains the TensorProto external data (#22042 ) ### Description Extend VitisAI EP `tensor_proto_as_raw` API to support memory buffer containing the TensorProto external data ### Motivation and Context For reduce peak memory usage, VitisAI EP need support ORT format model and setting session option `session.use_ort_model_bytes_for_initializers` for enable directly use the model bytes for initializers. Co-authored-by: mingyue <mingyue@xilinx.com>	2024-09-12 16:23:09 -07:00
0xdr3dd	5c361106e6	[Fuzzer] Add two new ORT libfuzzer (Linux clang support for now) (#22055 ) ### Description This PR adds two new libfuzzer in fuzzer project. 1. Binary libfuzzer 2. libprotobuf-fuzzer To compile run below cmd on linux: ``` LLVM_PROFILE_FILE="%p.profraw" CFLAGS="-g -fsanitize=address,fuzzer-no-link -shared-libasan -fprofile-instr-generate -fcoverage-mapping" CXXFLAGS="-g -shared-libasan -fsanitize=address,fuzzer-no-link -fprofile-instr-generate -fcoverage-mapping" CC=clang CXX=clang++ ./build.sh --update --build --config Debug --compile_no_warning_as_error --build_shared_lib --skip_submodule_sync --use_full_protobuf --parallel --fuzz_testing --build_dir build/ ``` Run fuzzer: ``` LD_PRELOAD=$(clang -print-file-name=libclang_rt.asan-x86_64.so) build/Debug/onnxruntime_libfuzzer_fuzz testinput -rss_limit_mb=8196 -max_total_time=472800 -fork=2 -jobs=4 -workers=4 -ignore_crashes=1 -max_len=2097152 2>&1 \| grep -v "\[libprotobuf ERROR" ``` ### Motivation and Context The existing custom fuzzer is not coverage guided and it's slow and it will work on one model mutation at a time. The new fuzzers are coverage guided, and we can use more models' files as a corpus to increase the coverage.	2024-09-12 11:50:34 -07:00
wangshuai09	d539c27de8	Fix version check for using -mavxvnni (#21616 ) ### Description <!-- Describe your changes. --> Change the `CMAKE_CXX_COMPILER_VERSION` greater than `11` for using '-mavxvnni'. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> `CMakeFiles/onnxruntime_mlas.dir/root/Git.d/onnxruntime/onnxruntime/core/mlas/lib/x86_64/QgemmU8S8KernelAvx2.S.o cc: error: unrecognized command-line option ‘-mavxvnni’; did you mean ‘-mavx512vnni’?` using `gcc (GCC) 10.3.1`. `-mavxnni` is supported since [GCC 11 Release](https://gcc.gnu.org/gcc-11/changes.html), this PR change the version check.	2024-09-12 11:42:17 -07:00
Clément Péron	10883d7997	Suppress GCC warning in TreeEnsembleAggregator (#22062 ) ### Description When building with GCC 14.2.1, I got the following warning: onnxruntime/core/providers/cpu/ml/tree_ensemble_aggregator.h:329:59: error: template-id not allowed for constructor in C++20 [-Werror=template-id-cdtor] Remove template parameters from the constructor: The constructor TreeAggregatorMax<InputType, ThresholdType, OutputType> has been simplified to TreeAggregatorMax, because the compiler already knows the template parameters from the class definition. ### Motivation and Context Fix the build issue Signed-off-by: Clément Péron <peron.clem@gmail.com>	2024-09-12 19:46:27 +02:00
Yulong Wang	84f73327f5	allow scalar axes for Unsqueeze for WebGPU (#22054 ) ### Description Align with CPU behavior. https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/cpu/tensor/unsqueeze.cc#L60-L62	2024-09-12 10:33:37 -07:00
mindest	951b1b7160	[CI] Linux ROCm CI Pipeline: fix error, set trigger rules. (#22069 ) ### Description * Correct the wrong EP name for ROCm, fix CI error. * Update `set-trigger-rules.py`. * Modify the .yml via `set-trigger-rules.py`	2024-09-12 09:54:32 -07:00
Yi Zhang	ae39c40e5b	fix typo in iOS pipeline (#22067 ) ### Description <!-- Describe your changes. --> ### Motivation and Context The parameter isn't correct. Maybe it hasn't negative impact by chance so far. `d8e64bb529/cmake/CMakeLists.txt (L1712-L1717)`	2024-09-12 19:07:42 +08:00
Prathik Rao	d495e6cf1c	adds support for Uint8ClampedArray (#21985 ) Fixes https://github.com/microsoft/onnxruntime/issues/21753	2024-09-11 22:02:30 -07:00
Lennart Hannink	d8e64bb529	Refactor CoreMLExecution to C++ bridge class (#21857 ) Refactor Objective-C++ class `CoreMLExecution` into existing C++ bridge class `onnxruntime::coreml::Execution`.	2024-09-11 16:05:37 -07:00
sfatimar	0309c5f02f	Ovep release lnl 1.2.1 (#22027 ) Error Codes are added to catch compilation error and signal recompile. Remote Tensors are added to ensure direct memory access for NPU inferencing. UMD Bypass cache enabled with 2024.4 will eliminate need to disk caching ### Motivation and Context The changes are needed to ensure backward compatibility UMD Bypass caching eliminates driver caching Remote Tensors lead to performance improvement with inferencing on NPU --------- Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com> Co-authored-by: Srirammaswamy <srirammaswamy.s@intel.com> Co-authored-by: saurabh <saurabh1.kale@intel.com> Co-authored-by: Javier E. Martinez <javier.e.martinez@intel.com> Co-authored-by: Eric Crawford <eric.r.crawford@intel.com> Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com>	2024-09-11 14:55:40 -07:00
Jagadish Krishnamoorthy	b800328628	[ROCm EP/ MIGraphx EP] matmul_nbits: Use GPU_WARP_SIZE_HOST for host side code (#22045 ) ### Description For ROCm device, the host side code needs to call GPU_WARP_SIZE_HOST to query warpSize of the underlying GPU device. ### Motivation and Context Fixes MatMulNBits tests on gfx1100/01 which has warpSize of 32. Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>	2024-09-11 14:52:18 -07:00
Bin Miao	4d82404544	[WebNN EP] Support GRU operator (#20405 ) This PR support Gru operator for WebNN EP. @Honry , @fdwr thanks!	2024-09-11 14:16:36 -07:00
Xavier Dupré	91c916f9c6	Improve hash_function used by TreeEnsemble (#22043 ) ### Description unordered_map are implemented in a different way on VisualStudio and gcc. It seems that inserting consecutive keys has a poor performance on Windows. ### Motivation and Context Improve the performance of onnxruntime when initializing trees.	2024-09-11 10:41:04 -07:00
Yi-Hong Lyu	e91ff9438b	Enable Pad->Conv(no pads) fusion (#22001 ) ### Description ### Motivation and Context For some model has pattern Pad -> Conv. If the Conv doesn't have pads attributes, the Pad can be fused into Conv.	2024-09-11 09:54:15 -07:00
Julius Tischbein	20d94648bb	ConvTranpose using CUDNN Frontend with NHWC support (#21752 ) ### Description Added CUDNN Frontend and used it for NHWC ConvTranspose op including option for bias fusion. Similar to this [Conv PR](https://github.com/microsoft/onnxruntime/pull/19470) ### Backward compatible If ORT is built with cuDNN 8, cuDNN frontend will not be built into binary. Old kernels (using cudnn backend APIs) are used. ### Major Changes For cuDNN 9, we will enable cudnn frontend to fuse data gradient convolution and bias when a provider option fuse_conv_bias=1. ### Potential Issues cuDNN frontend uses TF32 by default. It can be disabled using use_tf32 cuda provider option, but in the case cuDNN frontend encounters issues building an operation graph it will fallback to using TF32. ### Follow ups This is one of the PRs that target to enable NHWC, here the ConvTranspose operation in CUDA EP by default if device supports it. There are other changes will follow up to make it possible. (1) Enable prefer_nhwc by default for device with sm >= 70. (2) Change fuse_conv_bias=1 by default after more testing. (3) Add other NHWC operators (like Resize or UpSample). ### Motivation and Context The new CUDNN Frontend library provides the functionality to fuse operations and provides new heuristics for kernel selection. Here it fuses the convolution data gradient operation (ConvTranspose) with the pointwise bias operation. ### Minor Change In the CUDA convolution operation was a small bug when `GetCudnnConv1dPadToNc1d ` was enabled.	2024-09-10 16:51:00 -07:00
PARK DongHa	f633caa0b1	Create CMake option `onnxruntime_USE_VCPKG` (#21348 ) ### Changes 1. CMake option `onnxruntime_USE_VCPKG`. It will be used in the vcpkg port * Unit test may fail because this option leads to a mixture of unexpected external library versions. Especially ONNX, Protobuf, and Flatbuffers version can be different 2. Overhaul of `onnxruntime_external_deps.cmake` * Make `FetchContent_Declare` to try `find_package`. See https://cmake.org/cmake/help/latest/guide/using-dependencies/index.html * Relocated `FetchContent_Declare` and `FetchContent_MakeAvailable`(or `onnxruntime_fetchcontent_makeavailable`) to closer lines. It was too hard to navigate the entire file to search related sections... * Alias `IMPORTED` targets like build targets (e.g. `ONNX::onnx` --> `onnx`) ```cmake # The script uses `find_package` with the changes. # In this case, use vcpkg to search dependencies # See https://cmake.org/cmake/help/latest/guide/using-dependencies/index.html include(external/onnxruntime_external_deps.cmake) ``` 3. Create CMakePresets.json and presets to [run vcpkg in manifest mode](https://learn.microsoft.com/en-us/vcpkg/concepts/manifest-mode) * Currently, it's NOT for training build * Main triplets are `x64-windows` and `x64-osx` ```pwsh Push-Location "cmake" cmake --preset "x64-windows-vcpkg" cmake --build --preset "x64-windows-vcpkg-debug" Pop-Location ``` ```bash pushd "cmake" cmake --preset "x64-osx-vcpkg" cmake --build --preset "x64-osx-vcpkg-debug" popd ``` 4. Updated tools/ci_build/build.py * `--use_vcpkg` option: it needs `CMAKE_TOOLCHAIN_FILE` with [vcpkg.cmake toolchain script](https://github.com/microsoft/vcpkg/blob/master/scripts/buildsystems/vcpkg.cmake) * `--compile_no_warning_as_error` is recommended because library version differences will cause unexpected compiler warnings ```bash python ./tools/ci_build/build.py \ --compile_no_warning_as_error \ --use_vcpkg \ --cmake_extra_defines "CMAKE_TOOLCHAIN_FILE:FILEPATH=${VCPKG_ROOT}/scripts/buildsystems/vcpkg.cmake" \ --cmake_extra_defines "VCPKG_TARGET_TRIPLET=..." ``` 5. Created Job `Vcpkg` for Windows and macOS * Show how to setup and use vcpkg. Similar to the CMakePresets.json usage ### Motivation and Context * Help #7150 * Help https://github.com/microsoft/vcpkg/pull/36850 * https://github.com/luncliff/vcpkg-registry/pull/212 * https://github.com/microsoft/vcpkg/pull/39881 * https://github.com/luncliff/vcpkg-registry/pull/215 * https://github.com/luncliff/vcpkg-registry/pull/216 * https://github.com/luncliff/vcpkg-registry/pull/227 * https://cmake.org/cmake/help/latest/guide/using-dependencies/index.html * https://github.com/microsoft/vcpkg/blob/master/scripts/buildsystems/vcpkg.cmake ### Future Works? More feature coverage with the vcpkg supported libraries * CUDA feature support * Training feature support	2024-09-10 16:39:27 -07:00
kunal-vaishnavi	c5418f35d4	Add fusions for re-designed Phi-3 vision and Phi-3.5 vision ONNX models (#22026 ) ### Description This PR adds the optimizer logic to fuse the newly designed exported ONNX models for Phi-3 vision and Phi-3.5 vision. ### Motivation and Context After the re-designed export of Phi-3 vision and Phi-3.5 vision, the ONNX models for the vision component and embedding component contain `If` and `Loop` ops to handle multi-image support.	2024-09-10 16:18:05 -07:00

1 2 3 4 5 ...

11677 commits