onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-18 18:52:16 +00:00

Author	SHA1	Message	Date
Edward Chen	2ec1f94bfd	Make MlasTestFixture::mlas_tester an inline variable. (#18263 ) Make MlasTestFixture::mlas_tester an inline variable. With this change we no longer need to define `MlasTestFixture<T>::mlas_tester` outside of the class definition.	2023-11-03 10:50:21 -07:00
Changming Sun	4c4d79a612	Change a bitwise logical xor to logical wise (#18246 ) ### Description Change a bitwise logical xor to logical-wise ### Motivation and Context For Boolean values we should not use bitwise operations.	2023-11-03 10:42:51 -07:00
Numfor Tiapo	192caee81f	Fix Signed Mismatch (#18258 ) This PR fixes the the signed mismatch warning in DmlRuntimeFusedGraphKernel. This warning is treated as an error on the x86 versions of our internal builds preventing us from updating to latest ORT.	2023-11-03 10:16:37 -07:00
satyajandhyala	e207060ac9	[JS/Web] Added Unifroms support to unary ops. (#18223 ) ### Description Added uniforms support to unary ops. ### Motivation and Context Improve performance	2023-11-03 09:30:54 -07:00
winskuo-quic	90f205e79c	[QNN EP] Fix Pad UT (#17982 ) ### Description QNN EP has 2 unit tests failing: TEST_F(QnnHTPBackendTests, DISABLED_PadReflectMode) TEST_F(QnnHTPBackendTests, DISABLED_Pad4dOutOfRangePadConstantValue) For the first unit test, in QNN's master definition, it is stated that when using MIRROR_REFLECT, the before and after pad amounts must not be greater than shape(in[0])[i] - 1. Therefore, we need to change the pad amount from {0,2,0,0} to {0,1,0,0}. For second unit test, QNN does not have limitations stating that pad constant should be smaller than input[0]. The reason that the test is failing is because the unit test did not take the pad constant into consideration when doing quantization. ### Motivation and Context Fix the 2 unit tests mentioned in description.	2023-11-03 09:21:33 -07:00
Scott McKay	c352e9b1f9	Rework/cleanup the C# build infrastructure for nuget packages. (#18127 ) ### Description Update the C# nuget build infrastructure to make building a test nuget package more user friendly and to simplify - Remove usage of dotnet and msbuild in CIs - was temporary requirement until .net 6 MAUI was added to the released Visual Studio - remove SelectedTargets property and its usage - Add property for excluding mobile targets - generally we exclude based on the nuget package name - can now specify `/p:IncludeMobileTargets=false` on the command line to force exclusion - support building test package using build.py `--build_nuget` better - limit inclusion of xamarin targets as building with them requires a lot more infrastructure - use msbuild directly if xamarin targets are included. use dotnet otherwise. - remove quoting of property values as it doesn't appear to be necessary and breaks when msbuild is being used - add infrastructure to be able to pack the nuget package on linux with `dotnet pack` - `nuget pack` is not user friendly as-per comments in changes - requires stub csproj to provide the nuspec path - Remove netstandard1.0 targets from nuspec - we removed support from the actual bindings previously - Remove usage of nuget-staging directory when creating nuget package on linux - the nuspec file element has a fully qualified path for a source file so there is no obvious benefit to copying to a staging directory prior to packing ### Motivation and Context Address issues with 1P users trying to create test nuget packages locally. Long overdue cleanup of CI complexity.	2023-11-03 09:05:17 -07:00
Scott McKay	4f2096be38	Update XNNPACK to latest version (#18038 ) ### Description <!-- Describe your changes. --> Update XNNPACK to latest version - adds fp16 kernels and various other improvements - requires pthreadpool update as well Most code updates in the XNNPACK EP are to adjust to the new XNNPACK API - 'setup' is split into 'reshape' and 'setup' - some ops use a workspace buffer - copied workspace allocation from XNNPACK unit test code - some suffixes changed Added wrapper for XNNPACK caches to base XNNPACK EP kernel - simplifies usage - XNNPACK split out the code and weights caches, but the code cache isn't currently usable via the public API - we could use the internal types if we think it's required for performance reasons. non-trivial though as we'd need to propagate ifdef values from the XNNPACK build up to the ORT build. - using XNNPACK internals would also mean we would not be able to support using a pre-build XNNPACK package - not an issue currently Fixed opset registration for internal NHWC domain - was not being tied to the ONNX version, so nodes inserted by layout transformation had the incorrect opset - a number of other places needed updating once this issue was fixed Remove support for NCHW Resize from XNNPACK EP so it's NHWC only - we only supported NCHW for fp32, - doing so adds complexity in multiple places (XNNPACK EP kernel implementation, layout transformation and transpose optimization) - unclear if that complexity provides any benefit. can add back if required by production scenario ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> We're looking at enabling fp16 support for CoreML and NNAPI. If we do that we need a good fallback story if the CPU EP will be used. The XNNPACK fp16 kernels will hopefully provide that. NOTE: This PR doesn't add fp16 support to the XNNPACK EP kernels. That can be done as required in separate EPs and should be relatively simple to do.	2023-11-03 09:04:28 -07:00
Sumit Agarwal	e36d003765	Introduce new optimizer Pad + Conv/MaxPool (#18136 ) ### Description Introducing new L1 optimizer to fuse Pad to it's child node if the child node is Conv or MaxPool. Pad -> Conv = Conv Pad -> MaxPool = MaxPool Major Conditions: - It will only fuse for the `Constant` mode of padding. - Conv/MaxPool should not have optional `indices` output tensor - Padding value for non-spatial dimensions should be zero and for spatial dimensions padding values should be positive for `pad` operator. For other conditions please see `SatisfyCondition()` in `pad_fusion.cc`. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-11-03 07:17:02 -07:00
Scott McKay	016b75260b	Pre-link when creating static library for apple framework (#18241 ) ### Description <!-- Describe your changes. --> Pre-link with `ld -r` to apply symbol visibility when the static library is created to replicate XCode's Single Object Pre-link. Current builds set the visibility flags but that doesn't get applied until the static library is linked into something else, which can be too late. Pre-linking fixes this. The pre-link uses the .o files from the ORT static libraries and the .a files from external libraries. This combination limits the symbols included from the .a files to things required by the ORT .o files. In order to minimize changes elsewhere in the build we extract the .o files from the ORT static libraries using `ar -x`. Re-ordered the pieces use to build the Apple framework to make it a little more readable. Fixed a couple of misc issues with missing symbols from the minimal build that show up when pre-linking is applied. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Will hopefully address #17722	2023-11-03 23:38:29 +10:00
Xavier Dupré	1439da36fe	Partially disable QGemm tests for float 8 types (#18196 ) ### Description The quantization tool assumes QGemm is implemented for float 8 types but it is not yet supported. The condition partially disabling the test was not robust enough. This is changed by this PR.	2023-11-03 10:17:50 +01:00
Yi Zhang	9f5a6856fe	Rerun the flaky ort-web tests automatically (#18187 ) ### Description Retry 3 times at most if the web test fails. ### Motivation and Context Web GPU tests are not stable. From this link, we could find these ort-web tests are all in top 10 failing tasks. https://dev.azure.com/onnxruntime/onnxruntime/_pipeline/analytics/stageawareoutcome?definitionId=161&contextType=build. Generally, it could pass by manually rerunning it. So, enable it to rerun automatically. These test steps duration isn't long. So, it won't take too long to retry.	2023-11-03 16:34:56 +08:00
Changming Sun	d8d79521ca	Disable ccache for DML (#18230 ) ### Description Disable ccache for DML. This change is similar to #18104. Now the DML build job is having the same timeout issue. I don't know why. But disabling ccache probably would help.	2023-11-02 16:00:55 -07:00
xhcao	8d48d3e9cc	[js/web] optimize reduce related operators (#17957 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-11-02 12:51:48 -07:00
Prathik Rao	8978bdc59d	add bfloat16 support for where operator (#18118 ) ### Description <!-- Describe your changes. --> Adds bfloat16 as a valid input parameter type for where node for ONNX opset 16+. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Enabling `meta-llama/Llama-2-70b` to be finetuned with ONNX Runtime training. --------- Co-authored-by: Prathik Rao <prathikrao@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>	2023-11-02 12:23:20 -07:00
pengwa	c8e1038eab	Optimize 4bit Qlora training (#18131 ) ### Optimize 4bit Qlora training Extent existing `MatmulBnb4bit` to its usage in training scenarios. The PR includes following changes: 1. Add special `torch.autograd.Function` export logic for `bitsandbytes.autograd._functions.MatMul4Bit` that is preferred before common PythonOp exporter. 2. Add `training_mode` optional attribute for op `MatmulBnb4bit`, which help skip some inference specific logic in implementation. 3. Add `transB` optional attribute, which is by default be 1; setting it to be 0 is needed by backward usage. Changing from `PythonOp` to this `MatmulBnb4bit` brings roughly ~2.9% throughput gains. The reason is: `bitsandbytes.autograd._functions.MatMul4Bit` has logic `ctx.save_for_backward`, which would need an additional copy in PythonOp, otherwise, the tensor might be released by ORT, while backward op still references it. Removing the clones also reduce the peak memory consumptions because `bitsandbytes.autograd._functions.MatMul4Bit` saved tensors that are not needed in backward compute.	2023-11-02 09:46:11 -07:00
Caroline Zhu	e3b043ba17	[js/web/training] runTrainStep implementation (#18006 ) ### Description * based on design document & following InferenceSession's run implementation, implemented TrainingSession.runTrainStep ### Motivation and Context * Adding web bindings for training #### Related work * #16521 allowed for training artifacts to be built * #17333 added interfaces for training * #17474 allowed for training package to be built + added training backend to web package * #17891 implementation for createTrainingSession on the TypeScript side [SHOULD BE MERGED IN BEFORE THIS PR] --------- Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com> Co-authored-by: Ashwini Khade <askhade@microsoft.com>	2023-11-02 08:32:50 -07:00
Yifan Li	f9d5705db4	[EP Perf] Fix sort (#18174 ) ### Description Existing sort can't handle the index within test data folders/inputs name string correctly, when index is larger than 10 Update the logic to sort based on index fetched by regex	2023-11-02 08:05:08 -07:00
Frank Dong	dabd395fdf	llama 70b model fusion and shardding (#18175 ) ### Description Support llama-70b model fusion and shardding ### Motivation and Context This change enables shard and export llama-70b model into Onnx as this model is too large for single GPU. This change also fuses llama-70b model with repeat_kv pattern different with llama-7b and llama-13b.	2023-11-02 06:03:59 -07:00
aciddelgado	178f7caaeb	GQA Memory Efficient Kernel (#17920 ) Implement Cutlass Memory Efficient Attention Kernel into Group Query Attention Operator. ### Motivation and Context Before this change, Group Query Attention Operator was supported only by Flash-Attention. While this is the most efficient kernel for the operation, it only supports sm >= 80. Cutlass Memory Efficient Attention Kernel supports sm >= 53, allowing us to support a broader range of GPU hardware.	2023-11-01 20:04:22 -07:00
satyajandhyala	a2e9ba72d5	[JS/Web]Added FusedConv. (#17766 ) ### Description Added FusedConv and FusedConvTranspose ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Improve performance	2023-11-01 15:34:51 -07:00
Wei-Sheng Chin	9e8ad39847	Distributed Reduction (#18206 ) This PR implements distributed reduciton for llama 2. This version doesn't consider any cases requring re-sharding because we haven't seen any use cases. Intutive examples: - [supported] [2,4,6]-tensor with spec=RRS[0] and device_mesh=[0,1] -> Reduce(axes=[0]) -> [1,4,6]-tensor with spec=RRS[0] and device_mesh=[0,1] - [supported] [2,4,6]-tensor with spec=RRS[0] and device_mesh=[0,1] -> Reduce(axes=[1]) -> [2,1,6]-tensor with spec=RRS[0] and device_mesh=[0,1] - [not supported] [2,4,6]-tensor with spec=RRS[0] and device_mesh=[0,1] -> Reduce(axes=[2]) -> [2,4,1]-tensor with spec=RRS[0] and device_mesh=[0,1] Algorithm: When the reduced axes are not sharded, each device can call reduction directly. The output sharding spec will be identical to input sharding spec. We currently throw when input and output sharding specs are different. Review guideline: - Check 97b8d2f for new op's schema and how new op is registered. - Read tests in 2450f93 to get faimilar with the behavior of these ops. - Check the implementation details in 753d9af.	2023-11-01 08:49:33 -07:00
Preetha Veeramalai	d87216bcb1	Openvino ep ort 23.1 (#17911 ) ### Description Integration to OpenVINO 2023.1 ### Motivation and Context - Alignment with latest OpenVINO Version. - Device name change from VPUX to NPU and Remove from supported list until official public support is available. --------- Co-authored-by: Sahar Fatima <sfatima.3001@gmail.com> Co-authored-by: Saurabh Kale <saurabh1.kale@intel.com> Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com> Co-authored-by: sfatimar <sahar.fatima@intel.com>	2023-11-01 08:39:39 -07:00
weischan-quic	69f029797d	[QNN EP] Fix Batch Normalization Op Builder (#17981 ) ### Description There is a gap between onnx’s definition of batch normalization and QNN’s. According to the formula: onnx: `(X - input_mean) / sqrt(input_var + epsilon) * scale + B` QNN: `X * weight + bias` We can then deduce that: `weight = scale / sqrt(var + epsilon)` `bias = B – (mean * scale / sqrt(var + epsilon))` We must calculate the weight and bias, and their quantization parameters for QNN in QNN EP. Therefore, `scale`, `B`, `input_mean`, and `input_var` must be static (`initializer`). Implementation: Firstly, dequantize `scale`, `B`, `input_mean`, and `input_var` to floating point. Second, calculate `weight` and `bias`, and their quantization parameters. Finally, quantize `weight` and `bias`, and add them into `TensorWrapper` ### Motivation and Context Fix QnnHTPBackendTests.BatchNorm1D and QnnHTPBackendTests.BatchNorm2D failures	2023-10-31 23:04:42 -07:00
aciddelgado	819b5a3eba	Split KV on MHA and Attention ops (#18007 ) ### Description Implement Split KV optimization for FlashAttention in MHA and Attention operators. ### Motivation and Context Can help further accelerate these ops.	2023-10-31 21:05:42 -07:00
Wanming Lin	c181159783	[WebNN EP] Restore to use deviceType enum (#18154 ) The Chromium implementation will support `MLDeviceType` enum to align with spec. CL: https://chromium-review.googlesource.com/c/chromium/src/+/4986939	2023-10-31 20:30:32 -07:00
kunal-vaishnavi	d1b85f5fb4	Reduce LLaMA memory usage (#18181 ) ### Description This PR reduces the memory usage when exporting and benchmarking LLaMA. ### Motivation and Context - Exporting: The PyTorch model is deleted from memory after a successful export instead of deleting it from memory after exporting + converting the ONNX model to the desired precision. - Benchmarking: In the ONNX model with GroupQueryAttention, the KV cache inputs use the same GPU memory for both the prompt and token generation benchmarks.	2023-10-31 17:53:52 -07:00
RandySheriffH	2b95e74fa1	Versioning for custom op (#18088 ) Allow custom ops to have versions. --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2023-10-31 16:50:27 -07:00
Scott McKay	62c7894ffe	Add mobile CIs to list run by script for external PRs. (#18094 ) ### Description <!-- Describe your changes. --> Add the mobile CIs to the list so we check external PRs don't break those. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Recent external PR was found to break iOS CI after checkin	2023-11-01 09:25:48 +10:00
Aditya Goel	ed41a2836c	Fix cast removal bug (#17953 ) The `RemoveDuplicateCastTransformer` fairly naively removed Cast nodes from the graph without considering precision loss when using the same `TypeGroup`. For instance, F64 -> F32 -> F64 would be optimised out of the graph. I also noticed that signedness was not accounted for, which is not covered by any existing issue but is a problem. For example doing int -> unsigned int -> int produces very different values for negative inputs and so should not be optimised out One could argue that we shouldn't be performing such cast elimination at all (at least not in this transformer). The original scope might be well restricted to only eliminating unnecessary casts from the `InsertCastTransformer` and no others. ### Motivation and Context This should fix https://github.com/microsoft/onnxruntime/issues/17565, ttps://github.com/microsoft/onnxruntime/issues/9915 and https://github.com/microsoft/onnxruntime/issues/8787.	2023-10-31 15:48:32 -07:00
liqun Fu	20f2dd8b6b	use onnx rel-1.15.0, update cgman, cmake/external and requirement hash (#18177 )	2023-10-31 14:58:21 -07:00
Tianlei Wu	95f053c652	[CUDA] Update GroupNorm and Add SkipGroupNorm (#18091 ) * Add a new operator SkipGroupNorm to support skip and bias inputs. * Update GroupNorm kernel to support number of channels used in SD XLrefiner. * Add epsilon in kernel * Add parity and performance test script * Remove many limitations including max batch size, max number of groups, c % cPerBlock ==0 etc. ### Motivation and Context Update GroupNorm to support SD XL Refiner and beyond.	2023-10-31 10:27:20 -07:00
Jian Chen	29e40987e3	Update batch file to set PATH for Cuda with TRT (#18182 ) ### Description Update batch file to set PATH for Cuda with TRT ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-10-31 10:22:40 -07:00
Vincent Wang	1c25fe5580	Fix PoliCheck (#18180 ) Fix PoliCheck by changing some words, which was from Triton flash attention's original code.	2023-10-31 13:53:11 +08:00
cloudhan	08dce54266	Improve tunable verbose log (#17328 )	2023-10-31 13:10:21 +08:00
Jian Chen	8a574b874c	Update setup_env_cuda.bat (#18176 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-10-30 21:28:02 -07:00
Patrice Vignola	8ed9bd6eca	Add one more MHA mask pattern (#18164 ) Add an MHA mask pattern for the scenario where the mask has already been broadcasted via an Expand node.	2023-10-30 21:21:51 -07:00
PeixuanZuo	efef6407bc	[ROCm] update rocm package exclude libs (#18130 ) update rocm package exclude libs. - change librocblas.so.0 to librocblas.so.3 which is used on ROCm5.6 and ROCm5.7 - add librocfft.so.0, libhipfft.so.0, libhiprtc.so.5 and sort the list.	2023-10-31 08:41:01 +08:00
Jiajia Qin	785e2b1eae	[js/webgpu] Optimize softmax by vector (#18153 ) ### Description This PR enables `softmax` outputs max supported components instead of scalar for each thread. Softmax with input[0]: [12,4096,4096] becomes 47.86 ms from 55.11 ms	2023-10-30 16:05:35 -07:00
Yufeng Li	90d1f537cb	optimize SLN with large dimension (#18138 ) ### Description <!-- Describe your changes. --> Optimize SkipLayerNorm for large dimension (>=2048) by handling 8 elements in one thread. It avoid the re-writing and re-loading sum of input, skip and bias to main memory. It reduces the latency of dimension 4096 with small batch size from ~18us to ~3.8us on A100. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-10-30 14:12:17 -07:00
Patrice Vignola	348a963238	[DML EP] Handle non-raw data in dynamic graph compilation (#18160 )	2023-10-30 13:48:34 -07:00
Chen Fu	4819fbf31c	Augment blockwise quantization (#18101 ) ### Description Augment block wise 4b quantization -- plain CPU impl ### Motivation and Context Allow column wise or row wise blocks. Experiments show row wise quantization in LLM weight matrices achieves better precision. Added tests for quantization and dequantization code.	2023-10-30 09:14:37 -07:00
Hector Li	be2f72a315	[QNN EP] Disable early termination in GetCapability (#18140 ) [QNN EP] Disable early termination in GetCapability if there are multiple partition and context binary enabled ### Description QNN EP context binary cache feature only support single partition for now. We have early termination in GetCapability. After the PR https://github.com/microsoft/onnxruntime/pull/17764. There's no Level 1 optimization any more for the 1st GetCapability. Graph transformer EnsureUniqueDQForNodeUnit is not applied. So if there's initializer -> DQ -> shared by multiple node unit. The node is not identified as node unit group. QNN EP report many not supported nodes because of this in the 1st GetCapability call. The 2nd GetCapability still works normally. Disable the early termination in GetCapability, delay the decision to Compile.	2023-10-30 08:34:49 -07:00
Yulong Wang	9bba990871	[js/web] fix a few package consuming problems (#18109 ) ### Description This PR tries to fix a part of the NPM package consuming problems for onnxruntime-web (ES module) as described in #10913: - reduce the package size to fit the 150MB restriction in jsdelivr, by removing dev build targets for uncommon exports - add default export to support `import ort from 'onnxruntime-web';` (currently only support `import * as ort from 'onnxruntime-web';`	2023-10-30 08:11:43 -07:00
Yi Zhang	436056dcd7	Revert "Disable dml stage in windows GPU pipeline temporarily. (#18034 )" (#18150 ) This reverts commit `99b8dcaae2`. ### Description <!-- Describe your changes. --> ### Motivation and Context Restore the dml stage in windows GPU pipeline. Agent issue is solved by adding Feature.DisableGpuDriver in pool properties.	2023-10-30 15:41:07 +08:00
Hariharan Seshadri	8ebdd3bbca	Fix regression in perf test runner (#18139 )	2023-10-29 19:26:12 -07:00
snadampal	0e34100484	create memory descriptors based on the tensor dimensions (#15848 ) Arm Compute Library(ACL)backend requires explicit memory format tag iniatilization to decide wether the tensor can be computed with the ACL kernels. Hence, the src, weights and dst memroy descriptor format is set based on the tensor dimensions instead of using the format::any tag. ### Description <!-- Describe your changes. --> Arm Compute Library(ACL)backend requires explicit memory format tag iniatilization to decide wether the tensor can be computed with the ACL kernels. Hence, the src, weights and dst memroy descriptor format is set based on the tensor dimensions instead of using the format::any tag. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> The change enables ACL kernels for DNNL matmul ops on aarch64 platform.	2023-10-29 09:43:12 -07:00
Wei-Sheng Chin	24f9c1afe3	Distributed Expand (#18126 ) This PR implements DistributedExpand for llama 2. Representative Examples of DistributedExpand: - [shard on non-expanded axis] `input tensor (shape=[8, 1], spec=S[0]R, device_mesh=[0,1]) -> Expand(target_shape=[8, 2] -> output tensor (shape=[8, 2], spec=S[0]R, device_mesh=[0,1])` - [sharding expanded axis is invalid since it must have dim=1 and axis with dim=1 cannot be sharded] `input tensor (shape=[1, 8], spec=S[0]R, device_mesh=[0,1]) -> Expand(target_shape=[2, 8] -> output tensor (shape=[2, 8], spec=S[0]R, device_mesh=[0,1])` From those examples, we observe a few important behaviors. - The output sharding spec is always the same to the input sharding spec. - Expanding always happen on axis with dimension=1. Otherwise, it will violate the broadcasting rule. - No communication is needed since all computation can happen locally. Let's consider the first example again. If you put the first half tensor (shape: [4, 1]) on device 0 and the second half (shape: [4, 1]) on device 1, then `Expand` it with target shape [4, 2] , these two local tensors (shape: [4, 2]) are exactly the same as the one described by output sharding spec. Algorithm: - Compute logical (i.e., unsharded) shapes of input and output. - Compute sharded output shape from logical output. - Call Expand to broadcast local input to sharded output shape. How to review? - Start with [changes in onnxruntime_test_distributed.py](`ea33392f37`). Those tests are good examples for using this op. - [Read expand.h/expand.cc](`e4c49987f5`). Theose changes are for exposing functionalities in Expand to DistributedExpand. - Read distributed_expand.h/distributed_expand.cc. It follows the algorithm described above. The commit `68ac301bba` first sketches the definition of DistributedExpand. The next commit `0eb9330c3b` adds real implementation.	2023-10-28 00:44:02 -07:00
RandySheriffH	8daabf3f15	Tune min version supporint custom op ComputeV2 (#18134 ) Set min version supporting custom_op::ComputeV2 to 16, since the feature has been released since ort 1.16. Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2023-10-27 16:09:07 -07:00
zesongw	d9695dea6d	[WebNN EP] Remove Conv initializer constraint for GPU (#18129 ) ### Description WebNN can now handle Conv with filter as input . ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Support more models with WebNN.	2023-10-27 13:57:01 -07:00
sophies927	28ad3ff799	Fix stale bot issue (#18064 ) ### Description Previously used GitHub stale app is now deprecated, so I deleted that file and added a new GitHub Actions workflow to automatically apply the stale label to inactive issues. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-10-27 10:57:28 -07:00

1 2 3 4 5 ...

9899 commits