onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-05-20 21:40:57 +00:00

Author	SHA1	Message	Date
Edward Chen	209ff86d52	Get build working on Xcode 16 (#22168 )	2024-09-24 08:33:03 -07:00
Adam Reeve	ce13f651d8	Fix NaN propagation for float16 min and max operators (#22161 ) This makes min and max with NaN for either operand always return NaN for float16 data, matching the behaviour of float and double. The behaviour for floats and doubles was previously fixed for the CPU provider in #21492 and the CUDA provider in #19984, but these PRs didn't fix the behaviour for float16 due to tests causing asan errors. The memory access violations with float16 data have now been fixed in #22135, so this PR is a follow up to make float16 min and max behave the same as float and double for both the CPU and CUDA providers now that we can add tests for this. ### Motivation and Context Relevant previous issues (not float16 specific): * #21455 * https://github.com/onnx/onnx/issues/6003	2024-09-24 08:25:20 -07:00
Adam Pocock	cfa45df6b5	[java] Migrate OnnxTensors created from arrays over to a backing Java buffer (#18556 ) ### Description Following from #16578 and #16835 this migrates over `OnnxTensor.createTensor(<array>)` to first instantiate a `java.nio.Buffer` and then copy the array into that buffer in Java before creating the tensor. It also changes the `OnnxTensor.getValue()` method which returns a multidimensional array so it does the array construction and value copy in Java. This allows the removal of some unpleasant recursive C code which repeatedly calls into the JVM to traverse Java's arrays. The equivalent Java code is still unpleasant and recursive, but it's easier to reason about and memory safe. As a bonus, more `OnnxTensor`s are now backed by buffers which allow users to pin memory and reduce allocations by reusing them for same sized inputs. Some of the JNI code which parses Java arrays still exists as it's used by `OnnxMap`, removing that will be the target of a future refactor. Strings are still processed in JNI as it is easier to work with String tensors and UTF-8 arrays in C. ### Motivation and Context Minimizing the amount of JNI code makes it easier to maintain and using buffers in preference to arrays allows for fewer allocations.	2024-09-24 15:36:52 +10:00
Scott McKay	ae66d0e7cf	Update ROCm reduction to match recent CUDA change (#22192 ) ### Description <!-- Describe your changes. --> Add handling of a missing optional axes input to the ROCm reduction ops. Matches CUDA EP change from #22149 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix pipeline.	2024-09-24 11:58:48 +10:00
Tianlei Wu	0806879ad4	Update lintrunner requirements (#22185 ) ### Description * Add lintrunner to requirements-lintrunner.txt * Lock lintrunner and lintrunner-adapter version * Update documentation ### Motivation and Context The document is not up to date.	2024-09-23 18:27:16 -07:00
Dmitri Smirnov	a7c9f27d2d	Remove training pipelines from Win CPI CI as redundant (#22190 )	2024-09-23 18:15:41 -07:00
Yulong Wang	df25006d1b	upgrade micromatch to v4.0.8 (#22174 ) ### Description Upgrade `micromatch` to v4.0.8 https://github.com/advisories/GHSA-952p-6rrq-rcjv	2024-09-23 14:39:32 -07:00
Hann Wang	7a782b7213	[ROCm] fix rocm-6.2 build issues (#21993 ) Composable Kernel build fails under ROCm 6.2. This PR patches Composable Kernel the same way as https://github.com/ROCm/composable_kernel/pull/1346 * fix buffer resource to match "s" constraint * add missing memory clobber	2024-09-23 14:01:54 -07:00
Christian Bourjau	1a84f53c35	Make argmin/armax support identical data types and add int64 support (#21641 )	2024-09-23 13:02:29 -07:00
Jiajia Qin	80e9df826e	[js/webgpu] Optimize InstanceNormalization (#21995 ) ### Description <!-- Describe your changes. --> For InstanceNormalization, it has `y = scale * (x - mean) / sqrt(variance + epsilon) + B` , where mean and variance are computed per instance per channel. Calculating mean and variance per channel is a reduce processing, which is NCHW layout friendly since it makes the adjacent threads can access contiguous data in gpu memory. This PR optimizes both NHWC and NCHW InstanceNormalization. To efficiently calculate the mean and variance, we need to make sure the input is NCHW instead of NHWC. Then use shared memory to do the reduce operation to get `channel_scale` and `channel_shift`. With this PR, getting `channel_scale` and `channel_shift` are same for NHWC and NCHW InstanceNormalization. And the overall performance becomes very close now. Below data comes from SD Turbo profiling results. Before (InstanceNormalization overall time: 140.84 ms) InstanceNormalization\\|InstanceNormComputeMean \| 129.70 -- \| -- InstanceNormalization\\|InstanceNormalizationNHWC \| 10.55 InstanceNormalization\\|InstanceNormComputeChannelScaleShift \| 0.59 After (InstanceNormalization overall time: 59.44 ms) InstanceNormalization\\|InstanceNormComputeChannelScaleShift \| 28.57 -- \| -- InstanceNormalization\\|TransposeShared \| 20.19 InstanceNormalization\\|InstanceNormalizationNHWC \| 10.68	2024-09-23 11:32:09 -07:00
Chester Liu	9b37b3ea44	Specify the paths of system tools when building Apple framework (#22056 ) ### Description <!-- Describe your changes. --> Specify the path of `ar`, `ld` and `libtool` when building apple framework. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Sometimes non-system executables will comes before the system-provided ones. This PR intends to prevent it from happening.	2024-09-23 17:19:30 +08:00
Hector Li	b636b275aa	Fix an issue that QNN models shared from other session use the session logger from that session (#22170 ) ### Description Fix an issue that QNN models shared from other session use the session logger from that producer session also which cause confusion. Make QNN model compute function use the session logger from current session.	2024-09-21 20:41:56 -07:00
Tianlei Wu	171b901e32	Add benchmark script for segment anything v2 (#22169 ) ### Description Add benchmark script segment anything v2. It depends on https://github.com/microsoft/onnxruntime/pull/22119 for onnx export, and https://github.com/microsoft/onnxruntime/pull/22167 for sam2 graph fusion. ### Motivation and Context Benchmark SAM2 model performance.	2024-09-20 21:32:37 -07:00
Tianlei Wu	1431215dcf	Add fusion script for segment anything v2 (#22167 ) ### Description * Add MultiHeadAttention fusion for SAM2. * Add LayerNormalization fusion for NCHW format by inserting Transpose from NCHW to NHWC before layer normalization, and add another Transpose after layer norm to convert NHWC back to NCHW. Hopefully, those extra Transpose nodes will be removed when prefer_nhwc is enabled later. * Add a condition that the input shall be 3D when fuse SkipLayerNorm. * Update convert_to_onnx.py to add `--optimize` and `--use_gpu` options to output optimized onnx model for CPU/CUDA eps. * Add an option `--dtype fp16\|fp32` in convert_to_onnx.py to support converting optimized model to float16. * Update the demo to use the optimized onnx models. ### Motivation and Context To support optimization of SAM2 for CPU/CUDA eps that is exported in https://github.com/microsoft/onnxruntime/pull/22119	2024-09-20 21:32:16 -07:00
Dmitri Smirnov	fe8a10caa4	Address ZeroK case for Gemm for CPU and CUDA (#22111 ) ### Description When K == 0 output a MxN matrix filled with bias if present or filled with zeros. This brings it inline with MatMul behavior especially when Gemm is used to fuse MatMul with Add. ### Motivation and Context * Comply with numpy spec of MatMul * Address a case when empty initializers are used for computation.	2024-09-20 17:24:13 -07:00
Yi Zhang	8d2d40781c	set CMAKE_SYSTEM_PROCESSOR in xnnpack.cmake (#22155 ) ### Description <!-- Describe your changes. --> ### Motivation and Context By default, CMAKE_SYSTEM_PROCESSOR is same CMAKE_HOST_SYSTEM_PROCESSOR https://cmake.org/cmake/help/latest/variable/CMAKE_SYSTEM_PROCESSOR.html KleidiAI uses CMAKE_SYSTEM_PROCESSOR to determine whether to include some arm64 ukernels. https://gitlab.arm.com/kleidi/kleidiai/-/blob/main/CMakeLists.txt#L134 We use Mac with Intel CPU to cross compile MAC with ARM in ios packaging pipeline So we need to make CMAKE_SYSTEM_PROCESSOR same with ORT_TARGET_PROCESSOR	2024-09-20 15:19:26 -07:00
Scott McKay	d4692835bf	Fix std::chrono/date conflict for mac builds with C++20 (#22138 ) ### Description Fix usage of c++ std::chrono::operator<< in mac builds for wider range of xcode/targets. ### Motivation and Context #21033	2024-09-20 11:18:24 -07:00
Scott McKay	da3bd45cdd	Fix CUDA reduction ops handling of optional axes input (#22149 ) ### Description <!-- Describe your changes. --> The optional `axes` input may exist with an empty name and be a nullptr. Update the CUDA implementation to handle this. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> #22035	2024-09-20 13:44:47 +10:00
Adam Reeve	f3cbe76059	Fix memory access violations in the CPU float16 min and max operators (#22135 ) ### Description Fixes the logic for getting the number of elements for the input and output spans in the `MinMaxMLFloat16` method. This was incorrectly using the full number of elements in the output rather than the number of elements in the current span, which worked fine with 1D inputs but breaks with 2D inputs. This meant that as the `BroadcastLooper` iterated over spans, `MinMaxMLFloat16` would start at a position further forward in the input and output and read and write further beyond the end of the input and output respectively, causing the asan error in #21558 and sometimes segfaults in larger examples. ### Motivation and Context Fixes #21558. From further testing, this issue didn't only cause asan errors in tests but causes segfaults with larger sized inputs.	2024-09-19 18:04:10 -07:00
Jing Fang	b0ef1f3923	[CPU EP] Refactor MatMulNBits to decouple type implementation (#22140 ) ### Description Decouple implementation for different A types to improve readability and maintainability. ### Motivation and Context As more types are added, the implementation can differ a lot between types. Besides, different hardware may require different implementations. This PR creates an abstraction boundary where different implemetation can plug in easily.	2024-09-19 17:57:35 -07:00
George Wu	c270fe6dd3	[qnn ep] fix naming convention of ort-nightly-qnn package (#22157 ) followed the rocm example below it which isn't the naming convention we want to follow. didn't fix rocm because i'm not sure if there are consumers using its naming convention.	2024-09-19 17:33:31 -07:00
Hector Li	03ce996b7c	Fix QNN random crash for UT with multi-thread run (#22160 ) ### Description Fix random crash for QNN UTs with multi-thread run like QnnHTPBackendTests.MultithreadHtpPowerCfgDefaultAndRunOption Root cause, last minute code change `b4e26bd5f9` static std::mutex mutex; -> OrtMutex mutex; missed static.	2024-09-19 16:39:13 -07:00
raoanag	73b5c3354c	Set Transpose Attribute instead for manipulating MatMul Strides (#21927 ) ### Description Update DML EP for `FusedMatMul` ORT graph node have TransA/B attribute set instead of updating the strides. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-09-19 16:26:20 -07:00
Scott McKay	bd60add8ce	Update nuget.exe used in WindowsAI nuget packaging so `readme` property is supported. (#22141 ) ### Description <!-- Describe your changes. --> Use the latest nuget.exe for the `readme` property to be supported. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> #22137	2024-09-19 19:06:47 +10:00
Scott McKay	99ee6eeca2	Enable Android 16 KB page size support (#22076 ) ### Description <!-- Describe your changes. --> Add linker flags to support 16KB page size support on Android. See https://source.android.com/docs/core/architecture/16kb-page-size/16kb#build-lib-16kb-alignment ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> #21837	2024-09-19 18:53:57 +10:00
Wanming Lin	e33b08ead1	[WebNN EP] Use both MLOperandDescriptor.dimensions and MLOperandDescriptor.shape (#22121 ) The spec renames MLOperandDescriptor.dimensions to MLOperandDescriptor.shape, in order to support older Chromium versions, we will keep both in WebNN EP for a while. Fixed #22120	2024-09-19 01:20:40 -07:00
George Wu	944d87381d	[QNN EP] set up py packaging pipeline for Linux x64 (#22132 ) set up a pipeline to produce nightly Linux x64 whls for onnxruntime-qnn this can be used for offline context binary generation.	2024-09-18 23:24:32 -07:00
mguynn-intc	d5f6343a4a	Implementation of AVX-VNNI-INT8 dot product instructions into MLAS GEMM (#21984 ) ### Description <!-- Describe your changes. --> ONNXRuntime implementation of S8S8 was using the default C++ implementation; with this new ISA, all variants of QGemm Int8 can support VNNI dot product and full AVX2 instructions. All signed/unsigned variants support VNNI instructions starting with LNL. Renamed structs and functions to better indicate support of all Int8 vs U8X8 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> LNL HW implemented new ISA, and this code enables that ISA in QGemm. Speed is improved for S8S8 to match with existing U8S8 code. S8U8 would also match speed if ONNX formally accepted the data type.	2024-09-18 22:18:23 -07:00
Yi Zhang	560778fd07	use mac 12 for esrp code sign (#22134 ) ### Description Fix regression caused by #17361 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-09-19 12:06:41 +08:00
Tianlei Wu	a9740d6f96	Add onnx export script for segment anything v2 (#22119 ) ### Description Add ONNX export script for segment anything v2 (SAM2). ### Limitations * Does not support video. Only support image right now. * The decoder does not support batch inference. ### Credits The demo that is based on [SAM2 notebook](https://github.com/facebookresearch/segment-anything-2/blob/main/notebooks/image_predictor_example.ipynb), and modified to run with ORT. The export of decoder is inspired by https://github.com/vietanhdev/samexporter. ### Demo Example output of demo: ![sam2_demo](https://github.com/user-attachments/assets/9a9fa360-8c20-482e-9935-a7aba9cf15de) ### Motivation and Context For support optimization of SAM2 image segmentation.	2024-09-18 14:31:59 -07:00
Patrice Vignola	05acfb90ab	[DML EP] Add QDQ+MatMul fusion into MatMulNBits (#22114 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-09-17 22:37:45 -07:00
Adrian Lizarraga	b8dae685e4	[QNN EP] Build Python 3.12 wheel for Windows ARM64 (#22118 ) ### Description Builds arm64 python 3.12 wheel for QNN EP. ### Motivation and Context	2024-09-17 21:16:31 -07:00
Fangjun Kuang	c6dc787a3d	Update q4common.h to include the missing header (#21786 ) Fixes #21748 CC @gyagp	2024-09-17 20:55:56 -07:00
dependabot[bot]	7e98926810	Bump body-parser from 1.20.1 to 1.20.3 in /onnxruntime/test/wasm (#22106 )	2024-09-17 22:59:40 +00:00
Atanas Dimitrov	275eb404bf	Speedup `CumSum` for large arrays (#22048 ) ### Description This PR refactors the `CPU` kernel for the `CumSum` operator. The new implementation strives to have as little indirection as possible. ### Motivation and Context Currently the `CumSum` operator perform very poorly in the case of 1D tensors(it was slower than a python loop). This is caused by the extensive use of the `SliceIterator`-s. Here is a relevant snippet: ```python import time import ndonnx as ndx import onnxruntime as ort import numpy as np import onnx def test_cumsum(sz): a = ndx.array(shape=(sz,), dtype=ndx.int64) b = ndx.cumsum(a) model = ndx.build({'a': a}, {'b': b}) onnx.save(model, "model.onnx") input = np.ones(sz, np.int64) start = time.time() result = ort.InferenceSession(model.SerializeToString()).run(None, {'a': input}) end = time.time() return end - start def test_cumsum_by_hand(sz): input = np.ones(sz, np.int64) start = time.time() answer = [0] for i in input: answer.append(answer[-1] + i) end = time.time() return end - start print(test_cumsum(int(1e7))) print(test_cumsum_by_hand(int(1e7))) ``` Before ```console 0.9794480800628662 0.4518160820007324 ``` After ```console 0.02483987808227539 0.5496008396148682 ``` The `model.onnx`: <img width="214" alt="image" src="https://github.com/user-attachments/assets/a213d6ff-86c3-49b5-a493-ebfd97deaa41"> The flame graph: ![profile-3](https://github.com/user-attachments/assets/c7418a05-cb65-4d72-a76d-6a6b05b4ba4d)	2024-09-17 15:53:07 -07:00
Yi Zhang	b94ba09e4f	Upgrade XNNPACK to latest version (#22012 ) ### Description Update XNNPack to latest version (Sep 4) - Some op outputs are changed, channel or stride paras are moved into reshape func. e.g. `96962a602d` - input params of xnnpack's resize related function are changed a lot - KleidiAI is added as a dependency in ARM64 - The latest XNNPACK includes 2 static libs microkernels-prod and xnnpack. Without microkernels-prod, it throws the exception of Undefined symbols. - Add ORT_TARGET_PROCESSOR to get the real processor target in CMake	2024-09-17 10:12:16 -07:00
Jian Chen	fa68ae2def	Update pool to MacOS-13 (#17361 ) ### Description See https://github.com/microsoft/onnxruntime-extensions/pull/476 and https://github.com/actions/runner-images/issues/7671 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> ### Current issue - [ ] For default xcode 15.2, that come with the MacOS-13, We Need to update the boost container header boost/container_hash/hash.hpp version to pass the build - [x] For xcode 14.2 The Build passed but the `Run React Native Detox Android e2e Test` Failed. Possible flaky test, https://github.com/microsoft/onnxruntime/pull/21969 - [x] For xcode 14.3.1 We encountered following issue in `Build React Native Detox iOS e2e Tests` ``` ld: file not found: /Applications/Xcode_14.3.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/arc/libarclite_iphonesimulator.a clang: error: linker command failed with exit code 1 (use -v to see invocation) ``` Applied following code to the eof in both ios/Podfile and fixed the issue ``` post_install do \|installer\| installer.generated_projects.each do \|project\| project.targets.each do \|target\| target.build_configurations.each do \|config\| config.build_settings['IPHONEOS_DEPLOYMENT_TARGET'] = '13.0' end end end end ``` - [x] https://github.com/facebook/react-native/issues/32483 Applying changes to ios/Pofile ``` pre_install do \|installer\| # Custom pre-install script or commands puts "Running pre-install script..." # Recommended fix for https://github.com/facebook/react-native/issues/32483 # from https://github.com/facebook/react-native/issues/32483#issuecomment-966784501 system("sed -i '' 's/typedef uint8_t clockid_t;//' \"${SRCROOT}/Pods/RCT-Folly/folly/portability/Time.h\"") end ``` - [ ] Detox environment setting up exceeded time out of 120000ms during iso e2e test ### dependent - [x] https://github.com/microsoft/onnxruntime/pull/21159 --------- Co-authored-by: Changming Sun <chasun@microsoft.com>	2024-09-17 10:07:30 -07:00
Chi Lo	6dcdc70aa7	[TensorRT EP] Add supportsModelV2 (#22081 ) `supportsModel` is deprecated in TRT 10.1. Add `supportsModelV2 `but still keep `supportsModel` as we still need to support TRT 8.6 where `supportsModelV2 ` is not supported.	2024-09-17 09:52:28 -07:00
Wanming Lin	9786909ab5	[WebNN EP] Support QuantizeLinear and DequantizeLinear ops (#22097 )	2024-09-17 08:18:47 -07:00
Xu Xing	afd642a194	[js/webgpu] Replace array with string in transpose perm (#21930 ) Perf test data(100000 times) Array: 12.599999997764826ms String: 1.6000000014901161ms Perf test case: ``` const permFunctionBodyArray = (rank: number, input: string): string => { const reverseFunc = []; reverseFunc.push(`fn perm(i: int) -> int { var a: int};`); for (let i = 0; i < rank; ++i) { reverseFunc.push(input); } reverseFunc.push('return a;}'); return reverseFunc.join('\n'); }; const permFunctionBodyString = (rank: number, input: string): string => { let reverseFunc= `fn perm(i: int}) -> int { var a: int;`; for (let i = 0; i < rank; ++i) { reverseFunc+=input; } reverseFunc+='return a;}'; return reverseFunc;//.join('\n'); }; const count = 100000; let start, end console.time('array'); start = performance.now(); for(let i =0 ; i < count; i ++) { permFunctionBodyArray(3, 'input'); } end = performance.now(); console.timeEnd('array'); console.log("Array: "+ (end-start)); console.time('string'); start = performance.now(); for(let i =0 ; i < count; i ++) { permFunctionBodyString(3, 'input'); } end = performance.now(); console.log("String: " +(end-start)); console.timeEnd('string'); ``` ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-09-16 23:17:46 -07:00
Yang Gu	2db6b734f5	[js/webgpu] Fix issue to run model demucs (#22074 ) This is to fix issue #22031 to run model demucs. For conv-transpose, outputPadding.length could be 1, while spatialRank is 2. The fix is to append enough 0s to outputPadding. For conv, the issue is similar. kernelShape.length sometimes could be 1, while inputs[1].dims.length is 4. The fix is also to append enough 0s to kernelShape.	2024-09-16 23:17:10 -07:00
Yulong Wang	291a5352b2	[js/web] remove training release (#22103 ) ### Description Remove training from onnxruntime-web Following up of #22082	2024-09-16 10:56:22 -07:00
Erick Muñoz	e93f14e00d	Check partial conversion on FP16 to FP32 AVX Cast kernel (#22091 ) ### Description Added checks to convert partial vectors in the early stages of the FP16 to FP32 cast using AVX NE CONVERT ISA. ### Motivation and Context Avoid storing data in sections outside of the output buffer, these checks are missing on the [original PR](https://github.com/microsoft/onnxruntime/pull/21183). This fix prevents memory corruption when the output buffer has a size [n16 + 1, n16 + 7] with 0< n	2024-09-16 09:20:06 -07:00
George Wu	1a1669fe81	use node name in transpose optimizer when adding nodes rather than optype (#22084 ) patch from @john-dance "The main change is simple: Use the original node name rather than the original node op_type when creating new nodes. Here are my comments on the change: ------ The onnx runtime uses the op_type as the basis for a new node name, so a node claimed by QNN EP might be named Conv_token_1 with no relation to the original /conv1/Conv. This patch: 1. Adds OpName as a virtual function in NodeRef and implements it in ApiNode. 2. AddNode now takes an op_name and op_type and passes them both to CreateNodeHelper. 3. CreateNodeHelper uses the op_name rather than the op_type in GenerateNodeName 4. Direct calls to AddNode are modified to either use the NodeRef if available, or just repeat the op_type if not available. The result is that the new nodes are named something like /conv1/Conv_token_1, allowing a straight forward mapping back to the original model node (if they exist in the original graph)."	2024-09-16 09:12:13 -07:00
Adam Pocock	6d7235ba5a	[Java] Exposing SessionOptions.SetDeterministicCompute (#18998 ) ### Description Exposes `SetDeterministicCompute` in Java, added to the C API by #18944. ### Motivation and Context Parity between C and Java APIs.	2024-09-16 11:55:38 +10:00
Adam Pocock	02e00dc023	[java] Adding ability to load a model from a memory mapped byte buffer (#20062 ) ### Description Adds support for constructing an `OrtSession` from a `java.nio.ByteBuffer`. These buffers can be memory mapped from files which means there doesn't need to be copies of the model protobuf held in Java, reducing peak memory usage during session construction. ### Motivation and Context Reduces memory usage on model construction by not requiring as many copies on the Java side. Should help with #19599.	2024-09-16 08:31:55 +10:00
Wanming Lin	c63dd0234b	[WebNN EP] Use opSupportLimits to dynamically check data type support (#22025 ) - Remove hard code data type checks and use WebNN's opSupportLimits instead - Add HasSupportedOutputsImpl for output data type validation - Get preferred layout info from opSupportLimits - Move Not op to logical_op_builder.cc because it should be there. This avoid the inconsistent input names in `unary_op_builder.cc`.	2024-09-13 21:36:20 -07:00
liqun Fu	a89bddd5c2	Matmul_nbits kernel for mlas sqnbits to support Fp16 inputs (#21807 )	2024-09-13 14:55:08 -07:00
aciddelgado	7e2c722459	Add Continuous Decoding support in GQA (#21523 ) ### Description This PR will add support for Continuous Decoding for batch_size = 1 input. From now on, GQA can take arbitrary length input using seqlens_k as total_sequence_length - 1 and the sequence length of qkv as new_sequence_length. This change will not affect the default behavior of GQA ### Motivation and Context Prior to this change it was impossible to support sequence_length > 1 inputs when past context was given. This use case is essential to making continuous decoding work, which is one of our current efforts in ORT-GenAI.	2024-09-13 13:21:11 -07:00
Changming Sun	59b7b6bb7c	Remove training from web ci pipeline (#22082 ) ### Description Remove training from web ci pipeline ### Motivation and Context	2024-09-13 09:52:49 -07:00

... 5 6 7 8 9 ...

11997 commits