onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-11 17:48:34 +00:00

Author	SHA1	Message	Date
Tianlei Wu	72186bbb71	[CUDA] Build nhwc ops by default (#22648 ) ### Description * Build cuda nhwc ops by default. * Deprecate `--enable_cuda_nhwc_ops` in build.py and add `--disable_cuda_nhwc_ops` option Note that it requires cuDNN 9.x. If you build with cuDNN 8, NHWC ops will be disabled automatically. ### Motivation and Context In general, NHWC is faster than NCHW for convolution in Nvidia GPUs with Tensor Cores, and this could improve performance for vision models. This is the first step to prefer NHWC for CUDA in 1.21 release. Next step is to do some tests on popular vision models. If it help in most models and devices, set `prefer_nhwc=1` as default cuda provider option.	2024-11-06 09:54:55 -08:00
Tianlei Wu	ba22d7879a	[CUDA/ROCm] Conditionally support ArgMax and ArgMin for opset 12 and above (#22713 ) ### Description Based on https://github.com/microsoft/onnxruntime/pull/9700, and extend it to ArgMin as well. This pull request introduces several enhancements and fixes related to the `ArgMax` and `ArgMin` operators in the CUDA execution provider. The changes ensure proper handling of these operators across different versions and improve kernel registration and fallback mechanisms. Key changes include: #### Enhancements to `ArgMax` and `ArgMin` Operators: * Added new kernel class registrations for `ArgMax` and `ArgMin` for different data types and versions in `onnxruntime/core/providers/cuda/cuda_execution_provider.cc`. [[1]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R966-R972) [[2]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1209-R1215) [[3]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1657-R1659) [[4]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285L1825-L1827) [[5]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1933-R1939) [[6]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2174-R2180) * Introduced `ArgMaxOrArgMinNeedFallbackToCPU` function to handle fallback to CPU when the `select_last_index` attribute is set to 1, as CUDA does not support this attribute. [[1]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2597-R2622) [[2]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2672-R2674) #### Macro and Kernel Registration Improvements: * Replaced `REGISTER_KERNEL_UNTIL_VERSIONED_TYPED` with `REGISTER_KERNEL_VERSIONED_RANGE_TYPED` and `REGISTER_KERNEL_VERSIONED_SINCE_TYPED` macros for better version handling. [[1]](diffhunk://#diff-ee5316fc3898058f70e942d9a84de36be4c7da09f144633a2504236430d5d033L19-R29) [[2]](diffhunk://#diff-ee5316fc3898058f70e942d9a84de36be4c7da09f144633a2504236430d5d033L40-R46) * Updated kernel registration for `ArgMax` and `ArgMin` to use the new macros, ensuring proper version handling and support for different data types. #### Safety Checks: * Added safety checks in the `ArgMax` and `ArgMin` classes to ensure `select_last_index` is not set to 1, as it is not supported on CUDA. [[1]](diffhunk://#diff-8ab09fef1f4a12cbf3b3432e509f8f1ef561e83c72778a0e047780060aeef6efL91-R99) [[2]](diffhunk://#diff-8ab09fef1f4a12cbf3b3432e509f8f1ef561e83c72778a0e047780060aeef6efL101-R117) #### Testing Enhancements: * Added new tests for `ArgMax` and `ArgMin` operators to verify behavior when `select_last_index` is set to 0, ensuring compatibility with both CPU and CUDA execution providers. [[1]](diffhunk://#diff-77affe1b70d1a9d38c2485f7c6b16ef2b6b541ed94dd727bc9b286f068f1481aR3340-R3360) [[2]](diffhunk://#diff-77affe1b70d1a9d38c2485f7c6b16ef2b6b541ed94dd727bc9b286f068f1481aR3679-R3699) ### Motivation and Context Improve CUDA kernel coverage for stable diffusion model and hence improve its performance on CUDA	2024-11-06 09:54:32 -08:00
Tianlei Wu	d993ec313f	[CUDA] Fix NumericLimits (#22738 ) ### Description * Fix `NumericLimits<float>` that used infinity as max, which is not consistent with `std::numeric_limits<float>::max()` In Windows, (float)(1e+300) is used for INFINITY, which causes compiler error in Visual Studio 2022 v17.12 Preview 5. * Rename `NumericLimits<T>::Min` to Lowest to be consistent with std::numeric_limits * Fix topk implementation: use `NumericLimits<CudaT>` instead of `NumericLimits<T>` in kernel. That could avoid defining a confusing defintion of `NumericLimits<MLFloat16>` that returns half instead of MLFloat16. * Use CUDART_MAX_NORMAL_FP16 if possible. It sets bits value directly, which is faster than converting float to half. Note that NumericLimits does not support __nv_bfloat16 and _nv_fp8_e4m3 and __nv_fp8_e5m2 right now. ### Motivation and Context https://github.com/microsoft/onnxruntime/issues/22728	2024-11-06 09:53:49 -08:00
Enrico Galli	1cb5ceedf3	[WebNN EP] Fix issues with MLTensor caching (#22701 ) This PR fixes a bug that occurs when searching for compatible `MLTensor` in the cache. We were missing checking the number of dimensions in the shape. This would mean that a cached buffer of shape `[1]` could match for `[1, 1, 256, 256]`. This PR also adds better handling when attempting to force an `MLTensor` to a different shape.	2024-11-06 09:17:11 -08:00
Yang Gu	811231e418	[js/webgpu] Destroy staging buffers aggressively during weights uploading (#22726 ) In current implementation, all the staging buffers for weights uploading are destroyed after first batch of kernel execution. It requires a lot of memory as all the staging buffers couldn't be reused. It also hurts the startup time (weights uploading only happens in session creation), as weights uploading is delayed to a very late time. This PR uses a very aggressive way to submit queue and destroy staging buffers, so that the related GPU memory could be reused as much as possible, though the real situation depends on the WebGPU and driver implementation. The aggressive queue submission also moves GPU operations to a very early time, which helps the startup time. Some buffer uploading benchmarks are composed to compare multiple solutions, regarding to the memory and time consumption. Benchmarks can be found at https://github.com/webatintel/webbench/blob/master/webgpu/buffer-upload.html, while detailed test data can be found at https://docs.google.com/document/d/1KgygOkb9ZNzkgzQ_tWOGlEI9ScmMBHDjDojjPFLmVXU/edit. I also tested phi3.5 on 2 machines, first inference time improved from 5141ms to 3579ms and from 4327ms to 2947ms separately.	2024-11-06 08:55:15 -08:00
Edward Chen	742a0d30be	[C# MauiModelTester] Fix icon name in Info.plist (#21666 ) Fix icon name in Info.plist. It now matches the icon at `csharp/tools/MauiModelTester/Resources/AppIcon/onnxruntime_icon.png`.	2024-11-05 16:55:38 -08:00
Jian Chen	deee48002c	Enable CUDA Python Test (#22717 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-11-05 16:26:50 -08:00
Yulong Wang	0371e92419	[webgpu] change default validation mode (#22730 ) ### Description Change default validation mode in Release build from "wgpuOnly" to "basic"	2024-11-05 16:18:05 -08:00
Hector Li	017246260f	support Qnn 2 28 (#22724 ) ### Description support Qnn 2.28 update default qnn vesion to 2.28 in build pipeline	2024-11-05 15:41:15 -08:00
dtang317	aa097a5992	Fix GRU tests (#22716 ) ### Description Many GRU tests were being skipped due to an error in MLOperatorAuthorImpl.cpp. The issue was caused by activation function names not being capitalized (e.g., ‘sigmoid’), while The AttrValue was using mixed cases (e.g., ‘Sigmoid’, ‘LeakyRelu’), which resulted in an ‘unsupported activation function’ error in DMLOperatorRecurrentNeuralNetwork.cpp. This PR fixes the issue by making the DML EP activation function name case-insensitive, and capitalizing the activation function names in the tests. ref PR: https://github.com/microsoft/onnxruntime/pull/15914 ref bug: https://dev.azure.com/microsoft/OS/_workitems/edit/44571772 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: nums11 <numsmt2@gmail.com>	2024-11-05 14:38:28 -08:00
Jiajia Qin	d5b2730ff8	[js/webgpu] Increase workgroupSize if only one workgroup is dispached (#22709 ) #22031 For reduce related ops, we should increase workgroupSize to improve parallelism if only one workgroup is dispatched. The total ReduceMean time becomes 8.98 ms from 77.79 ms on my iGPUs.	2024-11-05 13:13:52 -08:00
Sevag H	00461d1205	Replace gsl::narrow with narrow in WebNN code (#22733 ) Replace use of `gsl::narrow` with `narrow` to build for WebNN @snnn ### Description Building for WebNN with exceptions disabled cannot use `gsl::narrow`. Replace with `narrow` ### Motivation and Context Address issue #22712	2024-11-05 10:22:34 -08:00
ivberg	db72096d17	Revert to err logging instead of LOGS_DEFAULT macro (#22720 ) Revert to err logging instead of LOGS_DEFAULT macro due to issue seen during testing. "onnxruntime::logging::LoggingManager::DefaultLogger Attempt to use DefaultLogger but none has been registered." ### Description Revert part of PR suggestion to prevent crash for scenario seen in #22699. Previously we had tested w/o this macro ### Motivation and Context Previous PR #22699 it was suggested to use LOGS_DEFAULT() but that does not work during early init. Safer to use std::cerr instead like the original PR had it.	2024-11-05 09:51:26 -08:00
Jian Chen	3711a655bc	Update DNNL CI python to 310 (#22691 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-11-05 09:14:48 -08:00
Yi Zhang	33a2059ced	Remove webgpu ep in mobile packaging stages (#22725 ) ### Description The nuget-zip-java packaging pipeline has been failed for 4 days since it's introduced in #22591	2024-11-05 09:14:26 -08:00
Changming Sun	66980e4646	Refactor the cmake code that is related to delay loading (#22646 ) ### Description Refactor the cmake code that is related to delay loading. Provide a cmake option to control if delay loading should be enabled or not. Disabling the option when python is enabled, due to a known issue. ### Motivation and Context ONNX Runtime's python package depends on DirectML.dll, but supposedly the DLL should be delay loaded. This PR only refactor the code. It doesn't change the behavior.	2024-11-04 16:30:50 -08:00
Jiajia Qin	64d8e25b4c	[js/webgpu] Optimize Gemm (#22706 ) BUG #22031 The total Gemm time in demucs model becomes 181.14 ms from over 1000 ms on my iGPUs. ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-11-04 15:05:21 -08:00
Yulong Wang	bd5dbf86fe	support WebGPU EP in Node.js binding (#22660 ) ### Description This change enhances the Node.js binding with the following features: - support WebGPU EP - lazy initialization of `OrtEnv` - being able to initialize ORT with default log level setting from `ort.env.logLevel`. - session options: - `enableProfiling` and `profileFilePrefix`: support profiling. - `externalData`: explicit external data (optional in Node.js binding) - `optimizedModelFilePath`: allow dumping optimized model for diagnosis purpose - `preferredOutputLocation`: support IO binding. ====================================================== `Tensor.download()` is not implemented in this PR. Build pipeline update is not included in this PR.	2024-11-04 21:09:07 +00:00
Wanming Lin	6c21ab7337	[WebNN] Support SimplifiedLayerNormalization op (#22674 ) WebNN doesn't provide dedicate op for SimplifiedLayerNormalization, use a couple of WebNN ops to emulate it in WebNN EP. X --> Pow --> ReduceMean --> Add --> Sqrt --> Div -> Mul	2024-11-04 12:25:11 -08:00
Wanming Lin	57668339e4	[WebNN EP] Align QDQ ops with latest Chromium implementation (#22180 ) - Pass inputs to WebNN directly, WebNN will handle the broadcasting - If `zero_point` is not provided, make a WebNN Constant with 0 values and same shape as `scale` input	2024-11-04 12:21:49 -08:00
Kyle	74adfc2099	Nuget Windows AI Pipeline, Disable SDL Submodules. (#22711 ) ### Description <!-- Describe your changes. --> Set SDL's git submodule to false. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> * Previous job's SDL logs:It has 'git submodule sync' command, which means 'git submodule sync synchronizes all submodules while git submodule sync' * After set sdl git submodules to false, the logs don't have 'git submodule sync' command.	2024-11-04 08:39:28 -08:00
Wanming Lin	c64459fdd7	[WebNN] Don't skip scalar tensor registration (#22688 ) ORT will optimize same scalar initializers into one, we should not skip such scalar registration as a WebNN Constant.	2024-11-03 21:33:34 -08:00
Bin Miao	777fe7922c	[WebNN EP] Support Sign and CumSum operators (#22616 ) This PR supports Sign and CumSum operators for WebNN EP. @Honry @fdwr PTAL, thanks.	2024-11-03 20:08:16 -08:00
Christian Larson	ac6fe48375	Adjust max chunk size to fix error limit check from DX12 for large resources that are CPU accessible. (#22680 ) Adjust max chunk size to fix error limit check from DX12 for large resources that are CPU accessible. ### Description Current agility SDK restricts CPU visible buffers to 0xFFFF0000 bytes or slightly smaller than 4GiB. Verified restriction is still in latest Agility SDK 1.614.1. ### Motivation and Context Allocation of Resources 4GiB or larger fail in DX12 verification layer. --------- Co-authored-by: Dwayne Robinson <dwayner@microsoft.com>	2024-11-03 18:08:12 -08:00
Tianlei Wu	120cb5a804	[Doc] Add I/O binding example using onnx data type in python API summary (#22695 ) ### Description Add I/O binding example using onnx data type in python API summary. The API is available since 1.20 release. ### Motivation and Context Follow up of https://github.com/microsoft/onnxruntime/pull/22306 to add some documentation.	2024-11-02 12:51:37 -07:00
mindest	4ffc1ff3b4	DMMHA: add unit tests; fix CPU, CUDA kernel (#22567 ) ### Description Fixes: (1) cpu kernel: applying scale before bias and mask like other MHA ops (2) cpu kernel: correct offset during appending past to present. (3) cuda kernel: apply mask if provided; fix output_qk offset. Add DMMHA unit tests	2024-11-02 21:05:56 +08:00
ivberg	2e4e221da8	Fix crash when running ORT under low integrity process like Edge where ETW registration can fail (#22699 ) ### Description Make ETW provider registration non-fatal and not throw an exception Needs to work under build with exceptions enabled & --disable_exceptions ### Motivation and Context ORT should not crash Addresses #22475. Private tested by filer of that issue	2024-11-01 22:58:47 -07:00
Xavier Dupré	d419df4076	Fix too strict assert in onnx_quantizer.py (#22283 ) ### Description Partial answer to issue #19997. The example succeeds after this change.	2024-11-01 19:08:35 -07:00
Justin Chu	6d5e970642	[CI] Set up proper permissions for linting workflow (#22696 ) Allow writing security events to post lint messages on PRs.	2024-11-01 18:14:52 -07:00
Wanming Lin	62c3476822	[WebNN] Remove some useless verbose logs (#22690 ) These logs are not quite useful and create a lot of noise during debugging, especially when working with large models.	2024-11-01 14:33:25 -07:00
Justin Chu	f7cabf6d4c	Use suggest-changes@v2 (#22667 ) Use suggest-changes@v2 (https://github.com/parkerbxyz/suggest-changes/issues/36#issuecomment-2447605058) to post suggested changes as comments instead of requested changes to streamline the review process. - Also updated the script to `set +e` to ignore exit code only for the linter run. So that if there is errors in dependency installation we can still get signals.	2024-11-01 13:13:04 -07:00
Scott McKay	ba0bb43b00	Rework the native library usage so that a pre-built ORT native package can be easily used (#22345 ) ### Description The local build of the native library was being included by almost every project, but is only needed to run tests. Due to the multiple inclusions attempting to use a pre-built package was clashing with any local builds that were available. Create a helper file to include either a local built of a pre-built package and include that in the two test projects. Cleanup various miscellaous things. ### Motivation and Context Create setup to simplify running on-device tests with the nuget packages.	2024-11-01 11:03:33 -07:00
Jiajia Qin	8fbbf2fd4f	[js/webgpu] Optimize MatMul with M = 1 (#22577 ) ### Description <!-- Describe your changes. --> BUG #22031 In the demucs model, there are lots of MatMul ops with shapes like below: `input[0]: [3448,1,512] \| float32, input[1]: [512,1536] \| float32, output[0]: [3448,1,1536] \| float32` We can see that for this kind of shape, the batch size is a big value, but M = 1. Our current algorithm is based on [M, N] to partition tiles, which is not efficient for such kind of shapes. This PR reshapes the inputs to improve the matmul performance. Before: [3448,1,512] x [512,1536] = [3448,1,1536] After: [1, 3448, 512] x [512, 1536] = [1, 3448, 1536] , then the output can be reshaped to [3448, 1, 1536] The overall MatMul time in demucs model becomes 1778.45 ms from 4418.17 ms on my iGPUs. --------- Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>	2024-11-01 08:04:42 -07:00
wejoncy	9daf7664fc	[CoreML] ML Program more ops (2/N) (#22480 ) - cast - argmax - gelu - cast - LayerNorm - GroupNorm - InstanceNorm ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Scott McKay <skottmckay@gmail.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2024-11-01 08:37:56 +08:00
shiyi	c7ecc081ca	Revert "[WebNN] Fallback the node when its output doesn't have shape info" (#22669 ) Reverts microsoft/onnxruntime#22556 since it causes incorrect fallback.	2024-10-31 16:07:56 -07:00
Changming Sun	f9bc24e1a7	Add concurrency setting to codeql workflow (#22678 ) ### Description 1. Add concurrency setting to codeql workflow 2. Modify lint workflow's PATH setting. ### Motivation and Context To save machine resource.	2024-10-31 16:01:07 -07:00
Yi Zhang	8e8b62b8b5	Build CUDA and DML together (#22602 ) ### Description Now, we need to build cuda and dml in one package. But CUDA EP and DML EP can't run in one process. It will throw the exception of `the GPU device instance has been suspended` So the issue is CUDA EP and DML EP coexist in compile time but can't exist in run time. This PR is to split cuda ep test and dml ep test in all unit tests. The solution is to use 2 environment variable, NO_CUDA_TEST and NO_DML_TEST, in CI. For example, if NO_CUDA_TEST is set, the DefaultCudaExecutionProvider will be nullptr, and the test will not run with CUDA EP. In debugging, the CUDAExecutionProvider will not be called. I think, as long as cuda functions, like cudaSetDevice, are not called, DML EP tests can pass. Disabled java test of testDIrectML because it doesn't work now even without CUDA EP.	2024-10-31 15:51:13 -07:00
Wanming Lin	7b9db658c3	Fixed a minor bug in layout transformation for Resize (#21954 ) Since opset 18, 'scales' and 'sizes' constant inputs can be 2D tensors, transpose for 2D tensors are not supported at current implementation, fix it by only allowing 4D constant inputs.	2024-10-31 14:20:10 -07:00
dtang317	55e0128b13	[DML EP] Cast to bool correctly, adding explicit clip after cast (#22665 ) ### Description The CastNonStringTester test in CastOpTest was failing due to bitwise mismatches when casting other types to bool. This was caused by bool being represented as uint8 in DML. Added a clipping step in DmlOperatorCast to ensure correct bitwise matching after casting to bool ref: https://dev.azure.com/microsoft/OS/_workitems/edit/44572678 ### Motivation and Context	2024-10-31 13:27:37 -07:00
Tianlei Wu	1b60209938	[CUDA/ROCm/Migraphx] consolidate gpu data transfer (#22609 ) ### Description Consolidate the gpu data transfer in CUDA, ROCm and Migraphx EP. (1) Remove some redundant stream synchronize on default stream according to spec of cudaMemcpy (2) consolidate CUDA, ROCm and MigrphaX to try use same logic. ### Motivation This is a follow up on reviewing https://github.com/microsoft/onnxruntime/pull/22589. ### Context https://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behavior.html#api-sync-behavior ##### cudaMemcpy() * For transfers from pageable host memory to device memory, a stream sync is performed before the copy is initiated. The function will return once the pageable buffer has been copied to the staging memory for DMA transfer to device memory, but the DMA to final destination may not have completed. * For transfers from pinned host memory to device memory, the function is synchronous with respect to the host. * For transfers from device to either pageable or pinned host memory, the function returns only once the copy has completed. * For transfers from device memory to device memory, no host-side synchronization is performed. * For transfers from any host memory to any host memory, the function is fully synchronous with respect to the host. #### cudaMemcpyAsync * For transfers between device memory and pageable host memory, the function might be synchronous with respect to host. * For transfers from any host memory to any host memory, the function is fully synchronous with respect to the host. * If pageable memory must first be staged to pinned memory, the driver may synchronize with the stream and stage the copy into pinned memory. * For all other transfers, the function should be fully asynchronous. https://rocm.docs.amd.com/projects/HIP/en/latest/doxygen/html/group___memory.html ##### hipMemcpyAsync() If host or dest are not pinned, the memory copy will be performed synchronously. For best performance, use hipHostMalloc to allocate host memory that is transferred asynchronously. on HCC hipMemcpyAsync does not support overlapped H2D and D2H copies. For hipMemcpy, the copy is always performed by the device associated with the specified stream. ##### hipMemcpy() For hipMemcpy, the copy is always performed by the current device (set by hipSetDevice). https://github.com/ROCm/ROCm/blob/roc-5.7.x/tools/autotag/templates/rocm_changes/5.6.1.md ROCm 5.6.1 release note: hipMemcpy device-to-device (intra device) is now asynchronous with respect to the host	2024-10-31 09:52:50 -07:00
Pranav Sharma	69fe58b0ef	Fix formatting of DML EP files that was disturbed in an earlier PR. (#22672 )	2024-10-31 07:37:28 -07:00
sstamenk	a2070bf091	Fix input shape related compile logs for MIGraphX EP to be semantically correct (#22624 ) As the title suggests, recompilation is done if a mismatch is detected. Changed the logs to reflect that behavior.	2024-10-30 19:46:35 -07:00
Changming Sun	60bfa7fab1	Update publish-python-apidocs.yml (#22655 ) To fix a permission error	2024-10-30 19:25:29 -07:00
Wanming Lin	eb66bfa7b4	[WebNN] Convert MLOperand methods into readonly attributes (#22653 ) Adapt to spec change at https://github.com/webmachinelearning/webnn/pull/774	2024-10-30 17:54:49 -07:00
Wanming Lin	fc375a6f58	[WebNN] Support And, Or and Xor ops (#22598 ) Co-authored-by: Dwayne Robinson <fdwr@hotmail.com>	2024-10-30 17:52:10 -07:00
Yi Zhang	5200e098aa	Not using predefined marco to check EP (#22654 ) ### Description We'll build CUDA EP and DML EP in one package. As a result, USE_DML and USE_CUDA will coexist. We can't use predefined macros to check EP any more ### Motivation and Context Other changes are in test code, so I make this change of core runtime into one PR.	2024-10-31 08:46:13 +08:00
Pranav Sharma	03ea5dc495	Distinguish between DML and the generic 'GPU' term. This is needed for packaging DML EP in the same ORT GPU pkg. (#22657 ) ### Description Distinguish between DML and the generic 'GPU' term. This is needed for packaging DML EP in the same ORT GPU pkg. ### Motivation and Context Customer requirement.	2024-10-30 11:58:34 -07:00
Enrico Galli	df236c7894	[WebNN EP] Add cache for `MLContext`s in the `WebNNBackend` (#22510 ) ### Description This change adds a cache of `MLContext`s keyed by their options to the `WebNNBackend`. This makes is so that multiple `InferenceSession`s create with the same options will share the same context. ### Motivation and Context Since `MLTensor`s are tied `MLContext`s, developer can't easily share tensors between `InferenceSession` (outside of manually an `MLContext` and specifying the `context` options). This leads strange behaviors such as, ```js const sessionsA = ort.InferenceSession.create(urlA, { executionProviders: ["webnn"], preferredOutputLocation: "ml-buffer", }); const sessionsB = ort.InferenceSession.create(urlB, { executionProviders: ["webnn"], }); const temp = await sessionA.run({/* arguments */}); const result = await sessionB.run({"input":temp["output"]}); // ERROR: Failed to execute 'dispatch' on 'MLContext': Invalid inputs: The context of MLGraph doesn't match the context of the MLTensor with name "input". ``` We encountered this behavior when updating the transformers.js version in the developer preview demos. microsoft/webnn-developer-preview#46	2024-10-30 10:26:33 -07:00
shiyi	46ff240821	[WebNN] Add ScatterElements and GatherElements (#22534 )	2024-10-30 10:20:21 -07:00
shiyi	86b3b89a6b	[WebNN EP] Check if the tensor shape has 0 dimension (#22573 ) WebNN doesn't support empty tensor.	2024-10-30 08:18:21 -07:00

1 2 3 4 5 ...

11947 commits