onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-05-14 20:48:00 +00:00

Author	SHA1	Message	Date
Yulong Wang	080c67e900	[WebGPU] allow build WebGPU EP for WebAssembly (#23364 ) ### Description This PR allows WebGPU EP to be built with Emscripten for WebAssembly, Including: - cmake build files update to support correct setup for Emscripten. - code changes to fix build breaks for wasm - change in Web CI pipeline to add a build-only target for wasm with `--use_webgpu`.	2025-01-16 10:52:17 -08:00
Ted Themistokleous	7cd08a6004	[MigraphX EP] [ROCm EP] Upstream ROCm changes for bugfixes and features (#23249 ) Add support to mainline Onnxruntime of changes from the ROCm Team's changes ### Motivation and Context Various bugfixes, and changes added between ROCm 6.2 and 6.3 that haven't been upstreamed yet to mainline --------- Co-authored-by: Yueqing Zhang <yuz75@Pitt.edu> Co-authored-by: Yueqing Zhang <yueqingz@amd.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com> Co-authored-by: Artur Wojcik <artur.wojcik@outlook.com> Co-authored-by: Ted Themistokleous <tedthemistokleous@amd.com> Co-authored-by: Xinya Zhang <Xinya.Zhang@amd.com> Co-authored-by: ikalinic <ilija.kalinic@amd.com> Co-authored-by: sstamenk <sstamenk@amd.com>	2025-01-15 12:57:04 -08:00
Changming Sun	6a7ea5c896	Update xnnpack, cpuinfo and pthreadpool (#23362 ) ### Description Update xnnpack to remove the dependency on psimd and fp16 libraries. However, coremltool still depends on them, which will be addressed later. Also, update CPUINFO because the latest xnnpack requires CPUINFO's avx10 support. ### Motivation and Context The fewer dependencies the better.	2025-01-15 09:42:15 -08:00
Yulong Wang	444fcebaa4	Pre-requisites of upgrading EMSDK (#23347 ) ### Description This PR contains a part of the changes in #23318. The reason of creating this PR is: The works to support building WebGPU EP in WASM depends on #23318, which cannot be merged since it's blocked by upstream (https://github.com/llvm/llvm-project/issues/122166). This PR contains the changes can be safely merged separately and can unblock the development of supporting building WebGPU EP in WASM. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-01-14 11:07:21 -08:00
Changming Sun	4e4fd2bdcf	Update ORT extension to the latest (#23314 ) Update ORT extension to the latest, to include some build system fixes.	2025-01-13 18:59:42 -08:00
Yulong Wang	a74817ab10	add missing build dependency for onnxruntime_providers_webgpu (#23324 ) ### Description Fixes build when specify with flag `--target onnxruntime_providers_webgpu` Otherwise the following error will occur: ``` range.cc D:\code\onnxruntime\build\Windows\Debug\_deps\onnx-src\onnx\onnx_pb.h(65,10): error C1083: Cannot open include file: 'o nnx/onnx-ml.pb.h': No such file or directory [D:\code\onnxruntime\build\Windows\Debug\onnxruntime_providers_webgpu.vcxp roj] (compiling source file '../../../onnxruntime/core/providers/webgpu/math/binary_elementwise_ops.cc') ```	2025-01-10 18:07:12 -08:00
Changming Sun	b461f06a15	Remove a hack in adjust_global_compile_flags.cmake (#23313 ) ### Description Remove a hack in adjust_global_compile_flags.cmake because the issue should have been resolved.	2025-01-10 18:05:43 -08:00
Changming Sun	1ce59577d5	Add VCPKG triplet files (#23298 ) Add VCPKG triplet files. All the triplet files are automatically generated by gen.py. Put the files there to ease use.	2025-01-09 16:18:51 -08:00
Changming Sun	0ec2171b9f	Update Linux docker images (#23244 ) The new images contain the following updates: 1. Added Git, Ninja and VCPKG to all docker images 2. Updated CPU containers' GCC version from 12 to 14 3. Pinned CUDA 12 images' CUDNN version to 9.5(The latest one is 9.6) 4. Addressed container supply chain warnings by building CUDA 12 images from scratch(avoid using Nvidia's prebuilt images) 5. Updated manylinux commit id to 75aeda9d18eafb323b00620537c8b4097d4bef48 Also, this PR updated some source code to make the CPU EP's source code compatible with GCC 14.	2025-01-09 10:20:33 -08:00
PARK DongHa	5b9c968eaa	Correct ONNX and Protobuf version in vcpkg build (#23285 ) ### Description Changes vcpkg manifest and configuration file (vcpkg.json & vcpkg-configuration.json) * Update vcpkg version to https://github.com/microsoft/vcpkg/releases/tag/2024.12.16 * Use protobuf 3.21.12(= `v21.12`) to sync with [cmake/deps.txt](https://github.com/microsoft/onnxruntime/blob/main/cmake/deps.txt) * Resolve https://github.com/microsoft/onnxruntime/issues/22750 * Add `onnx` to vcpkg manifest so `find_package(ONNX)` and `find_dependency(Protobuf)` can work as expected. * Currently, It uses 1.16.2 * v1.17.0 will become available after https://github.com/microsoft/vcpkg/pull/42942 However, `onnx` in vcpkg doesn't configure `ONNX_DISABLE_STATIC_REGISTRATION` build option. * https://github.com/microsoft/vcpkg/pull/38879 * Create "cmake/vcpkg-triplets/" folder and triplet files which use `VCPKG_CMAKE_CONFIGURE_OPTIONS` for the option * This requires `VCPKG_OVERLAY_TRIPLETS` environment variable for CI steps, which is a bit inconvenient. I will try to find simple way to get same result ### Motivation and Context * Help #23158 * "ONNX is not consumed from vcpkg" * "Mismatch protobuf version. When vcpkg is enabled , we should not fetch protoc from Github which may cause version mismatches." * https://github.com/microsoft/vcpkg/pull/43126 * #21348	2025-01-08 12:25:17 -08:00
Changming Sun	69bb53db85	Enable delay loading hooker for python packages (#23227 ) ### Description Enable delay loading hooker for python packages	2024-12-31 10:12:31 -08:00
liqun Fu	a9a881cc98	Integrate onnx 1.17.0 (#21897 ) ### Description <!-- Describe your changes. --> for ORT 1.21.0 release Create following related issues to track skipped tests due to updated ONNX operators in the ONNX 1.17.0 release: https://github.com/microsoft/onnxruntime/issues/23162 https://github.com/microsoft/onnxruntime/issues/23164 https://github.com/microsoft/onnxruntime/issues/23163 https://github.com/microsoft/onnxruntime/issues/23161 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Signed-off-by: Liqun Fu <liqfu@microsoft.com> Signed-off-by: Liqun Fu <liqun.fu@microsoft.com> Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com> Co-authored-by: Yifan Li <109183385+yf711@users.noreply.github.com> Co-authored-by: yf711 <yifanl@microsoft.com>	2024-12-24 09:02:02 -08:00
Yulong Wang	6806174096	fix webgpu delay load test (#23157 ) ### Description This change fixes the WebGPU delay load test. <details> <summary>Fix UB in macro</summary> The following C++ code outputs `2, 1` in MSVC, while it outputs `1, 1` in GCC: ```c++ #include <iostream> #define A 1 #define B 1 #define ENABLE defined(A) && defined(B) #if ENABLE int x = 1; #else int x = 2; #endif #if defined(A) && defined(B) int y = 1; #else int y = 2; #endif int main() { std::cout << x << ", " << y << "\n"; } ``` Clang reports `macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]`. </details> <details> <summary>Fix condition of build option onnxruntime_ENABLE_DELAY_LOADING_WIN_DLLS</summary> Delay load is explicitly disabled when python binding is being built. modifies the condition. </details>	2024-12-20 13:37:12 -08:00
Changming Sun	fcc34da5e9	Fix a tiny problem in winml.cmake (#23173 ) ### Description CMake's [target_link_libraries](https://cmake.org/cmake/help/latest/command/target_link_libraries.html#id2) function accepts plain library name(like `re2`) or target name(like `re2::re2`) or some other kinds of names. "plain library names" are old-fashioned, for compatibility only. We should use target names. ### Motivation and Context To make vcpkg work with winml build. See #23158	2024-12-20 11:48:43 -08:00
Yifan Li	d9d07ad8ae	[TensorRT EP] support TensorRT 10.7-GA (#23011 ) ### Description <!-- Describe your changes. --> Update CIs to TRT10.7 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-12-19 10:39:15 -08:00
Yulong Wang	8680244ebc	Fix delay load for WebGPU EP and DML EP (#23111 ) ### Description This change fixes the DLL delay load problem for the WebGPU EP and DirectML EP. See detailed explanation below. ### Problem When onnxruntime.dll uses delay loading for its dependencies, the dependencies are loaded using `LoadLibraryEx()`, which search the directory of process (.exe) instead of this library (onnxruntime.dll). This is a problem for usages of Node.js binding and python binding, because Windows will try to find the dependencies in the directory of node.exe or python.exe, which is not the directory of onnxruntime.dll. There was previous attempt to fix this by loading DirectML.dll in the initialization of onnxruntime nodejs binding, which works for DML EP but is not a good solution because it does not really "delay" the load. For WebGPU, the situation became worse because webgpu_dawn.dll depends on dxil.dll and dxcompiler.dll, which are explicitly dynamically loaded in the code using `LoadLibraryA()`. This has the same problem of the DLL search. ### Solutions For onnxruntime.dll loading its direct dependencies, it can be resolved by set the [`__pfnDliNotifyHook2` hook](https://learn.microsoft.com/en-us/cpp/build/reference/understanding-the-helper-function?view=msvc-170#structure-and-constant-definitions) to load from an absolute path that constructed from the onnxruntime.dll folder and the DLL name. For webgpu_dawn.dll loading dxil.dll and dxcompiler.dll, since they are explicitly loaded in the code, the hook does not work. Instead, it can be resolved by ~~using WIN32 API `SetDllDirectory()` to add the onnxruntime.dll folder to the search path.~~ preloading the 2 DLLs from the onnxruntime.dll folder .	2024-12-19 10:23:48 -08:00
Yulong Wang	3a0b958586	add 2 CMake build options of Dawn (#23096 ) ### Description This change adds the following CMake build options for Dawn: - onnxruntime_BUILD_DAWN_MONOLITHIC_LIBRARY - OFF by default - when enabled, builds Dawn as a monolithic library (webgpu_dawn.dll) - onnxruntime_ENABLE_DAWN_BACKEND_VULKAN - OFF by default - when enabled, build with Vulkan backend for Dawn on Windows - onnxruntime_ENABLE_DAWN_BACKEND_D3D12 - ON by default - when enabled, build with DirectX 12 backend for Dawn on Windows ### File Size Comparison (Windows) \| Build \| cmdline \| File Size \| \|---\|---\|---\| \| Baseline \| --config Release<br/> --build_shared_lib \| `12,755,456 onnxruntime.dll` \| \| WebGPU D3D12 (default) \| --use_webgpu<br/> --config Release<br/> --build_shared_lib \| `17,082,368 dxcompiler.dll`<br/>` 1,508,472 dxil.dll`<br/>`18,708,480 onnxruntime.dll` \| \| WebGPU D3D12+Vulkan \| --use_webgpu<br/> --config Release<br/> --build_shared_lib<br/> --cmake_extra_defines<br/> onnxruntime_ENABLE_DAWN_BACKEND_D3D12=1<br/> onnxruntime_ENABLE_DAWN_BACKEND_VULKAN=1 \| `17,081,344 dxcompiler.dll`<br/>` 1,508,472 dxil.dll`<br/>`19,388,416 onnxruntime.dll` \| \| WebGPU Vulkan \| --use_webgpu<br/> --config Release<br/> --build_shared_lib<br/> --cmake_extra_defines<br/> onnxruntime_ENABLE_DAWN_BACKEND_D3D12=0<br/> onnxruntime_ENABLE_DAWN_BACKEND_VULKAN=1 \| `17,615,872 onnxruntime.dll` \| \| Monolithic \| --use_webgpu<br/> --config Release<br/> --build_shared_lib<br/> --cmake_extra_defines<br/> onnxruntime_BUILD_DAWN_MONOLITHIC_LIBRARY=1 \| `17,082,368 dxcompiler.dll`<br/>` 1,508,472 dxil.dll`<br/>`13,277,696 onnxruntime.dll`<br/>` 5,616,640 webgpu_dawn.dll` \| \| External Dawn \| --use_webgpu<br/> --config Release<br/> --build_shared_lib<br/> --cmake_extra_defines<br/> onnxruntime_USE_EXTERNAL_DAWN=1<br/> --skip_tests \| `17,081,344 dxcompiler.dll`<br/>` 1,508,472 dxil.dll`<br/>`13,277,184 onnxruntime.dll`	2024-12-13 16:05:48 -08:00
Dmitri Smirnov	890a719c91	Remove deprecated static from Eigen that contributes to size increase (#23084 ) ### Description <!-- Describe your changes. --> This patches Eigen source to remove an unused deprecated static var. ### Motivation and Context Internal customer request.	2024-12-12 10:19:47 -08:00
Ankit Maheshkar	1f88284f96	OVEP 1.21.0 Development Updates (#23080 ) ### Description OVEP development changes for ORT 1.21 Release ### Motivation and Context - Has Critical Bug Fixes - Improved Performance optimizations for both memory & inference latency (https://github.com/intel/onnxruntime/pull/513) - Enabled Model Compilation using NPUW (https://github.com/intel/onnxruntime/pull/508) - Fixed support for EPContext embed mode 0 for lower memory utilization - Updated NuGet package name as `Intel.ML.OnnxRuntime.OpenVino` - Fixed QDQ Stripping logic on NPU	2024-12-11 22:26:32 -08:00
Edward Chen	fa6ad202aa	Minor updates to onnxruntime_java.cmake (#23068 ) - Use `ANDROID` instead of `CMAKE_SYSTEM_NAME STREQUAL "Android"`. - Put common gradle arguments into `COMMON_GRADLE_ARGS` to make them easier to reuse.	2024-12-10 15:44:36 -08:00
Misha Chornyi	bf4d3e1a5b	Update vcpkg.json - lock flatbuffer version (#23046 ) ### Description Locking version introduced in: `03ea5dc495/onnxruntime/core/flatbuffers/schema/ort_training_checkpoint.fbs.h (L11-L13)` ### Motivation and Context Resolve issue for version `>=1.20.` https://github.com/microsoft/onnxruntime/issues/22666	2024-12-10 11:23:01 -08:00
Jing Fang	bd5a759d0c	[ARM CPU] Add rotary embedding fp16 kernel (#23013 ) ### Description Add fp16 kernel to rotary embedding to boost performance. ### Motivation and Context Part of performance optimization work for group query attention	2024-12-06 13:25:48 -08:00
Yulong Wang	a615bd6688	Bump version of Dawn to 12a3b24c4 (#23002 ) ### Description Upgrade version of Dawn. Removed dawn.patch, because all patches are included in upstream. Updated code that affected by API changes (`const char*` -> `WGPUStringView`) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-12-04 09:47:16 -08:00
Yulong Wang	e84b8e7bd5	allow specify a custom local source path for Dawn (#22999 ) ### Description Allows to build ONNX Runtime with a custom local path of Dawn's source code. Usage: ```sh build --use_webgpu --cmake_extra_defines "onnxruntime_CUSTOM_DAWN_SRC_PATH=C:/src/dawn" ```	2024-12-03 19:25:22 -08:00
Kee	8c52fa3924	[VSINPU]Split/Pad and some element-wise OPs support (#22916 ) ### Description -Add split/pad/neg/not/ceil/round/min/max op support -Fix conv2d op default pads value issue -Add VSINPU EP to support python bindings ### Motivation and Context -New OPs support for VSINPU EP --------- Signed-off-by: Kee <xuke537@hotmail.com>	2024-12-02 13:57:30 -08:00
Aleksei Nikiforov	f6e1d44829	Add option to force generic algorithms on x86 (#22917 ) Option is named onnxruntime_FORCE_GENERIC_ALGORITHMS Follow up to https://github.com/microsoft/onnxruntime/pull/22125. ### Description This change adds compile-time option to disable optimized algorithms and use generic algorithms (exclude AVX* and SSE etc in GEMM) on x86. This new option is intended only for testing these algorithms, not for production use. Following build command on linux x86_64 builds onnxruntime with new option enabled: `./build.sh --parallel --cmake_extra_defines onnxruntime_FORCE_GENERIC_ALGORITHMS=1` ### Motivation and Context This change allows testing generic algorithms. This may be needed for platforms which don't have optimized implementations available, like in https://github.com/microsoft/onnxruntime/pull/22125.	2024-11-21 13:45:46 -08:00
Changming Sun	13346fdf18	Cleanup code (#22827 ) ### Description 1. Delete TVM EP because it is out of maintain 2. Delete ortmodule related docker files and scripts.	2024-11-19 14:13:33 -08:00
Jing Fang	c73a3d1804	[ARM] MatMulNBits fp16 support - connect kernels (#22856 ) ### Description A breakdown PR of https://github.com/microsoft/onnxruntime/pull/22651 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-11-15 14:59:11 -08:00
Po-Wei (Vincent)	bbe7c87738	Fix 1.20 cuda minimal build failure (#22751 ) ### Description Fixes build failure for the cuda minimal build ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> [This change](https://github.com/microsoft/onnxruntime/pull/19470) in 1.20 is causing build failures for the cuda minimal build. Essentially, some cudnn logic was not guarded by the `USE_CUDA_MINIMAL`. Also the build is looking for cudnn while in the cuda minimal build it shouldn't depend on it, resulting in linking error. cc @gedoensmax @chilo-ms	2024-11-15 10:50:55 -08:00
Preetha Veeramalai	ac9c135b95	Ovep develop 1.21 (#22824 ) ### Description OVEP development changes for ORT 1.21 Release ### Motivation and Context Has critical bug fixes Support for concurrency execution of models is enabled Support for OV 2024.5 Memory optimizations for NPU platform --------- Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com> Co-authored-by: Ankit Maheshkar <ankit.maheshkar@intel.com> Co-authored-by: sfatimar <sahar.fatima@intel.com> Co-authored-by: saurabhkale17 <saurabh1.kale@intel.com> Co-authored-by: TejalKhade28 <tejal.khade@intel.com> Co-authored-by: Javier E. Martinez <javier.e.martinez@intel.com>	2024-11-14 20:10:07 -08:00
Jing Fang	c02b398980	[ARM] MatMulNBits Fp16 support - API change only (#22826 ) ### Description A break-down PR of https://github.com/microsoft/onnxruntime/pull/22651 Op API change only. - add template to functions and classes that support fp32 and fp16 - rename functions, classes and files that support fp32 and fp16 from SQNBxxx to QNBxxx ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-11-14 10:38:59 -08:00
Jing Fang	7fa69461fd	[ARM] MatMulNBits FP16 support - kernels only (#22806 ) ### Description A break down PR of https://github.com/microsoft/onnxruntime/pull/22651 Add fp16 kernels. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-11-12 14:28:47 -08:00
zz002	d3ad76b2cf	[VitisAI] Cache node subgraph when necessary (#22073 ) ### Description <!-- Describe your changes. --> [VitisAI] Cache node subgraph when necessary ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Zhenze Wang <zhenzew@xilinx.com> Co-authored-by: zhenzew <zhenzew@amd.com>	2024-11-08 23:17:16 -08:00
Ranjit Ranjan	193671295e	[AIX] Fix for AIX build break (#22745 ) ### Description With recent changes, below build error is found under AIX. ``` ld: 0706-012 The -p flag is not recognized. ld: 0706-012 The -a flag is not recognized. ld: 0706-012 The -t flag is not recognized. ld: 0706-012 The -h flag is not recognized. ld: 0706-012 The -= flag is not recognized. ld: 0706-012 The -$ flag is not recognized. ld: 0706-012 The -$ flag is not recognized. ld: 0706-012 The -O flag is not recognized. ld: 0706-027 The -R IGIN flag is ignored. collect2: error: ld returned 255 exit status ``` ### Motivation and Context AIX linker doesn't support -rpath option , so blocking this option under AIX.	2024-11-07 13:22:22 -08:00
Yifan Li	3b7a6eba69	[TensorRT EP] support TensorRT 10.6-GA (#22644 ) ### Description <!-- Describe your changes. --> * Update CI with TRT 10.6 * Update oss parser to [10.6-GA-ORT-DDS ](https://github.com/onnx/onnx-tensorrt/tree/10.6-GA-ORT-DDS) and update dependency version * Update Py-cuda11 CI to use trt10.6 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> (There will be 3rd PR to further reduce trt_version hardcoding)	2024-11-06 14:33:46 -08:00
Tianlei Wu	72186bbb71	[CUDA] Build nhwc ops by default (#22648 ) ### Description * Build cuda nhwc ops by default. * Deprecate `--enable_cuda_nhwc_ops` in build.py and add `--disable_cuda_nhwc_ops` option Note that it requires cuDNN 9.x. If you build with cuDNN 8, NHWC ops will be disabled automatically. ### Motivation and Context In general, NHWC is faster than NCHW for convolution in Nvidia GPUs with Tensor Cores, and this could improve performance for vision models. This is the first step to prefer NHWC for CUDA in 1.21 release. Next step is to do some tests on popular vision models. If it help in most models and devices, set `prefer_nhwc=1` as default cuda provider option.	2024-11-06 09:54:55 -08:00
Changming Sun	66980e4646	Refactor the cmake code that is related to delay loading (#22646 ) ### Description Refactor the cmake code that is related to delay loading. Provide a cmake option to control if delay loading should be enabled or not. Disabling the option when python is enabled, due to a known issue. ### Motivation and Context ONNX Runtime's python package depends on DirectML.dll, but supposedly the DLL should be delay loaded. This PR only refactor the code. It doesn't change the behavior.	2024-11-04 16:30:50 -08:00
Yulong Wang	7a8fa12850	Add implementation of WebGPU EP (#22591 ) ### Description This PR adds the actual implementation of the WebGPU EP based on https://github.com/microsoft/onnxruntime/pull/22318. This change includes the following: <details> <summary><b>core framework of WebGPU EP</b></summary> - WebGPU EP factory classes for: - handling WebGPU options - creating WebGPU EP instance - creating WebGPU context - WebGPU Execution Provider classes - GPU Buffer allocator - data transfer - Buffer management classes - Buffer Manager - BufferCacheManager - DisabledCacheManager - SimpleCacheManager - LazyReleaseCacheManager - BucketCacheManager - Program classes - Program (base) - Program Cache Key - Program Manager - Shader helper classes - Shader Helper - ShaderIndicesHelper - ShaderVariableHelper - Utils - GPU Query based profiler - compute context - string utils - Miscs - Python binding webgpu support (basic) </details> <details> <summary><b>Kernel implementation</b></summary> - onnx.ai (default opset): - Elementwise (math): Abs, Neg, Floor, Ceil, Reciprocal, Sqrt, Exp, Erf, Log, Sin, Cos, Tan, Asin, Acos, Atan, Sinh, Cosh, Asinh, Acosh, Atanh, Tanh, Not, Cast - Elementwise (activation): Sigmoid, HardSigmoid, Clip, Elu, Relu, LeakyRelu, ThresholdedRelu, Gelu - Binary (math): Add, Sub, Mul, Div, Pow, Equal, Greater, GreaterOrEqual, Less, LessOrEqual - (Tensors): Shape, Reshape, Squeeze, Unsqueeze - Where - Transpose - Concat - Expand - Gather - Tile - Range - LayerNormalization - com.microsoft - FastGelu - MatMulNBits - MultiHeadAttention - RotaryEmbedding - SkipLayerNormalization - LayerNormalization - SimplifiedLayerNormalization - SkipSimplifiedLayerNormalization </details> <details> <summary><b>Build, test and CI pipeline integration</b></summary> - build works for Windows, macOS and iOS - support onnxruntime_test_all and python node test - added a new unit test for `--use_external_dawn` build flag. - updated MacOS pipeline to build with WebGPU support - added a new pipeline for WebGPU Windows </details> This change does not include: - Node.js binding support for WebGPU (will be a separate PR)	2024-10-29 18:29:40 -07:00
Indy Zhu	e2e837584f	[DML EP] Update DML to 1.15.4 (#22635 ) ### Description [DML EP] Update DML to 1.15.4 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> We want the customer to use the latest DirectML.	2024-10-29 17:13:57 -07:00
Tianlei Wu	b4afc6266f	[ROCm] Python 3.10 in ROCm CI, and ROCm 6.2.3 in MigraphX CI (#22527 ) ### Description Upgrade python from 3.9 to 3.10 in ROCm and MigraphX docker files and CI pipelines. Upgrade ROCm version to 6.2.3 in most places except ROCm CI, see comment below. Some improvements/upgrades on ROCm/Migraphx docker or pipeline: * rocm 6.0/6.1.3 => 6.2.3 * python 3.9 => 3.10 * Ubuntu 20.04 => 22.04 * Also upgrade ml_dtypes, numpy and scipy packages. * Fix message "ROCm version from ..." with correct file path in CMakeList.txt * Exclude some NHWC tests since ROCm EP lacks support for NHWC convolution. #### ROCm CI Pipeline: ROCm 6.1.3 is kept in the pipeline for now. - Failed after upgrading to ROCm 6.2.3: `HIPBLAS_STATUS_INVALID_VALUE ; GPU=0 ; hostname=76123b390aed ; file=/onnxruntime_src/onnxruntime/core/providers/rocm/rocm_execution_provider.cc ; line=170 ; expr=hipblasSetStream(hipblas_handle_, stream);` . It need further investigation. - cupy issues: (1) It currently supports numpy < 1.27, might not work with numpy 2.x. So we locked numpy==1.26.4 for now. (2) cupy support of ROCm 6.2 is still in progress: https://github.com/cupy/cupy/issues/8606. Note that miniconda issues: its libstdc++.so.6 and libgcc_s.so.1 might have conflict with the system ones. So we created links to use the system ones. #### MigraphX CI pipeline MigraphX CI does not use cupy, and we are able to use ROCm 6.2.3 and numpy 2.x in the pipeline. #### Other attempts Other things that I've tried which might help in the future: Attempt to use a single docker file for both ROCm and Migraphx: https://github.com/microsoft/onnxruntime/pull/22478 Upgrade to ubuntu 24.04 and python 3.12, and use venv like [this](`27903e7ff1/tools/ci_build/github/linux/docker/rocm-ci-pipeline-env.Dockerfile`). ### Motivation and Context In 1.20 release, ROCm nuget packaging pipeline will use 6.2: https://github.com/microsoft/onnxruntime/pull/22461. This upgrades rocm to 6.2.3 in CI pipelines to be consistent.	2024-10-25 11:47:16 -07:00
Satya Kumar Jandhyala	4ed5bec2e7	[JS/WebGPU] Support WASM64 (#21836 ) ### Description Support wasm64 ### Motivation and Context Overcome memory limitations --------- Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>	2024-10-24 20:21:51 -07:00
Changming Sun	88676e62b9	Remove nsync (#20413 ) ### Description 1. Remove the onnxruntime::OrtMutex class and replace it with ~absl::Mutex~ std::mutex. 2. After this change, most source files will not include <Windows.h> indirectly. ### Motivation and Context To reduce the number of deps we have, and address some Github issues that are related to build ONNX Runtime from source. In PR #3000 , I added a custom implementation of std::mutex . It was mainly because at that time std::mutex's default constructor was not trivial on Windows. If you had such a mutex as a global var, it could not be initialized at compile time. Then VC++ team fixed this issue. Therefore we don't need this custom implementation anymore. This PR also removes nsync. I ran several models tests on Linux. I didn't see any perf difference. This PR also reverts PR #21005 , which is no longer needed since conda has updated its msvc runtime DLL. This PR unblocks #22173 and resolves #22092 . We have a lot of open issues with nsync. This PR can resolve all of them.	2024-10-21 15:32:14 -07:00
Jeff Daily	5aabc53121	[ROCm] redo hipify of version controlled files (#22449 ) ### Description Updates the ROCm EP opsets to match the current CUDA EP opsets. Also enable the test CApiTest.basic_cuda_graph_with_annotation. Note that some changes are whitespace-only. These changes were made to improve the comparison of corresponding ROCm and CUDA EP source files when using a side by side diff tool. ### Motivation and Context The ROCm EP derives from the CUDA EP. Many source files are shared between the EPs and "hipified" during the ROCm EP build, however quite a few files within the ROCm EP are under source control after their initial hipification. Over time these ROCm EP files get stale relative to their CUDA EP counterparts. It becomes necessary to re-hipify these otherwise static files in order to pick up important changes such as opset differences.	2024-10-18 12:40:54 -07:00
Edward Chen	7964d3aef6	Specify iOS simulator runtime version (#22474 ) - Allow specification of iOS simulator runtime version to use. - Pick simulator runtime version (iphonesimulator 16.4) that is supported by the Xcode version (14.3.1) that we use. - Disable CoreML EP's DepthToSpace op support for CoreML version less than 7, with DCR mode, and FP16 input. It doesn't produce the correct output in this case. - Some cleanup of iOS test infrastructure.	2024-10-18 09:26:06 -07:00
Jeff Daily	8c21680ffc	[ROCm] prefer hip interfaces over roc during hipify (#22394 ) ### Description Change the hipify step to remove the -roc option to hipify-perl. This will prefer hipblas over rocblas. rocblas can still be called directly such as in TunableOp. ### Motivation and Context hip interfaces are preferred over roc for porting from cuda to hip. Calling roc interfaces is meant for ROCm-specific enhancements or extensions.	2024-10-14 20:34:03 -07:00
amarin16	7d17c466ec	Add microbenchmark for layer normalization and improve latency (#22223 ) - Added a microbenchmark for the `LayerNormalization` MLFloat16 support added in https://github.com/microsoft/onnxruntime/pull/22063. - Updated the `LayerNormalization` MLFloat16 implementation to improve the latency. ``` ---------------------------------------------------------------------------------------------- Original MLFloat16 support Time CPU Iterations ---------------------------------------------------------------------------------------------- BM_LayerNormalization<MLFloat16, float>/1/real_time 15599 us 15625 us 47 BM_LayerNormalization<MLFloat16, float>/1/real_time 14714 us 14824 us 39 BM_LayerNormalization<MLFloat16, float>/1/real_time 14634 us 14688 us 50 ---------------------------------------------------------------------------------------------- Updated MLFloat16 support Time CPU Iterations ---------------------------------------------------------------------------------------------- BM_LayerNormalization<MLFloat16, float>/1/real_time 7276 us 7254 us 84 BM_LayerNormalization<MLFloat16, float>/1/real_time 6820 us 6720 us 93 BM_LayerNormalization<MLFloat16, float>/1/real_time 6840 us 6882 us 84 ```	2024-10-14 18:47:27 -07:00
Tianlei Wu	de93f40240	[CUDA] Lean Attention (#22352 ) ### Description Add [Lean Attention](https://arxiv.org/abs/2405.10480) and the integration with MultiHeadAttention operator for LLM in GPU. LeanAttention speeds up self-attention for the token-generation phase (decode-phase) of decoder-only transformer models, especially on long context lengths. - [x] Initial implementation of Lean Attention (by Srikant Bharadwaj) - [x] Integration with MultiHeadAttention operator - [x] Add parity tests - [x] Add benchmark #### Implementation Details (1) Lean Attention is enabled in build for Linux, and disabled for Windows (2) Lean Attention is disabled by default. Need enable it through cuda provider option sdpa_kernel, or use environment variable `ORT_ENABLE_LEAN_ATTENTION=1` (3) It only works for token-generation (sequence_length==1, past_sequence_length > 0). (4) Like flash attention, it only works in Ampere or newer GPU. We can revisit #1 and #2 after comparing with DecoderMaskedMultiHeadAttention and XQA kernels. #### Benchmark ``` cd onnxruntime/test/python/transformers /bin/bash benchmark_mha.sh lean ``` Example outputs in H100: Note that past and present does not share buffer for MHA for now, so we can see low tflops. The relative ratio will change after buffer sharing is enabled. But we expect that the order (kernel A is faster than B) will remain the same after buffer sharing is enabled. Note that common settings `sequence_length=1; causal=True;attn_bias=None;cuda_graph=False` are not shown in the below table. batch_size \| past_sequence_length \| num_heads \| head_size \| average_latency \| tflops \| kernel -- \| -- \| -- \| -- \| -- \| -- \| -- 1 \| 512 \| 16 \| 64 \| 0.000059 \| 0.0178 \| ort:flash 1 \| 512 \| 16 \| 64 \| 0.000068 \| 0.0155 \| ort:efficient 1 \| 512 \| 16 \| 64 \| 0.000065 \| 0.0161 \| ort:math 1 \| 512 \| 16 \| 64 \| 0.000060 \| 0.0176 \| ort:lean 1 \| 512 \| 32 \| 128 \| 0.000062 \| 0.0674 \| ort:flash 1 \| 512 \| 32 \| 128 \| 0.000064 \| 0.0661 \| ort:efficient 1 \| 512 \| 32 \| 128 \| 0.000067 \| 0.0625 \| ort:math 1 \| 512 \| 32 \| 128 \| 0.000062 \| 0.0678 \| ort:lean 1 \| 1024 \| 16 \| 64 \| 0.000061 \| 0.0345 \| ort:flash 1 \| 1024 \| 16 \| 64 \| 0.000086 \| 0.0244 \| ort:efficient 1 \| 1024 \| 16 \| 64 \| 0.000065 \| 0.0322 \| ort:math 1 \| 1024 \| 16 \| 64 \| 0.000063 \| 0.0332 \| ort:lean 1 \| 1024 \| 32 \| 128 \| 0.000075 \| 0.1125 \| ort:flash 1 \| 1024 \| 32 \| 128 \| 0.000088 \| 0.0951 \| ort:efficient 1 \| 1024 \| 32 \| 128 \| 0.000079 \| 0.1068 \| ort:math 1 \| 1024 \| 32 \| 128 \| 0.000072 \| 0.1171 \| ort:lean 1 \| 2048 \| 16 \| 64 \| 0.000069 \| 0.0606 \| ort:flash 1 \| 2048 \| 16 \| 64 \| 0.000125 \| 0.0336 \| ort:efficient 1 \| 2048 \| 16 \| 64 \| 0.000064 \| 0.0655 \| ort:lean 1 \| 2048 \| 32 \| 128 \| 0.000098 \| 0.1720 \| ort:flash 1 \| 2048 \| 32 \| 128 \| 0.000132 \| 0.1270 \| ort:efficient 1 \| 2048 \| 32 \| 128 \| 0.000092 \| 0.1828 \| ort:lean 1 \| 4096 \| 16 \| 64 \| 0.000076 \| 0.1097 \| ort:flash 1 \| 4096 \| 16 \| 64 \| 0.000207 \| 0.0406 \| ort:efficient 1 \| 4096 \| 16 \| 64 \| 0.000069 \| 0.1209 \| ort:lean 1 \| 4096 \| 32 \| 128 \| 0.000140 \| 0.2394 \| ort:flash 1 \| 4096 \| 32 \| 128 \| 0.000213 \| 0.1575 \| ort:efficient 1 \| 4096 \| 32 \| 128 \| 0.000139 \| 0.2419 \| ort:lean 1 \| 8192 \| 16 \| 64 \| 0.000104 \| 0.1609 \| ort:flash 1 \| 8192 \| 16 \| 64 \| 0.000392 \| 0.0428 \| ort:efficient 1 \| 8192 \| 16 \| 64 \| 0.000093 \| 0.1809 \| ort:lean 1 \| 8192 \| 32 \| 128 \| 0.000212 \| 0.3160 \| ort:flash 1 \| 8192 \| 32 \| 128 \| 0.000360 \| 0.1866 \| ort:efficient 1 \| 8192 \| 32 \| 128 \| 0.000212 \| 0.3162 \| ort:lean 1 \| 16384 \| 16 \| 64 \| 0.000139 \| 0.2410 \| ort:flash 1 \| 16384 \| 16 \| 64 \| 0.000731 \| 0.0459 \| ort:efficient 1 \| 16384 \| 16 \| 64 \| 0.000136 \| 0.2465 \| ort:lean 1 \| 16384 \| 32 \| 128 \| 0.000361 \| 0.3722 \| ort:flash 1 \| 16384 \| 32 \| 128 \| 0.000667 \| 0.2014 \| ort:efficient 1 \| 16384 \| 32 \| 128 \| 0.000357 \| 0.3765 \| ort:lean 1 \| 32768 \| 16 \| 64 \| 0.000210 \| 0.3194 \| ort:flash 1 \| 32768 \| 16 \| 64 \| 0.001428 \| 0.0470 \| ort:efficient 1 \| 32768 \| 16 \| 64 \| 0.000209 \| 0.3211 \| ort:lean 1 \| 32768 \| 32 \| 128 \| 0.000659 \| 0.4074 \| ort:flash 1 \| 32768 \| 32 \| 128 \| 0.001270 \| 0.2114 \| ort:efficient 1 \| 32768 \| 32 \| 128 \| 0.000651 \| 0.4123 \| ort:lean 1 \| 65536 \| 16 \| 64 \| 0.000355 \| 0.3785 \| ort:flash 1 \| 65536 \| 16 \| 64 \| 0.002736 \| 0.0491 \| ort:efficient 1 \| 65536 \| 16 \| 64 \| 0.000349 \| 0.3845 \| ort:lean 1 \| 65536 \| 32 \| 128 \| 0.001251 \| 0.4290 \| ort:flash 1 \| 65536 \| 32 \| 128 \| 0.002480 \| 0.2165 \| ort:efficient 1 \| 65536 \| 32 \| 128 \| 0.001239 \| 0.4333 \| ort:lean 4 \| 512 \| 16 \| 64 \| 0.000063 \| 0.0665 \| ort:flash 4 \| 512 \| 16 \| 64 \| 0.000069 \| 0.0607 \| ort:efficient 4 \| 512 \| 16 \| 64 \| 0.000066 \| 0.0634 \| ort:math 4 \| 512 \| 16 \| 64 \| 0.000062 \| 0.0674 \| ort:lean 4 \| 512 \| 32 \| 128 \| 0.000100 \| 0.1677 \| ort:flash 4 \| 512 \| 32 \| 128 \| 0.000099 \| 0.1703 \| ort:efficient 4 \| 512 \| 32 \| 128 \| 0.000108 \| 0.1557 \| ort:math 4 \| 512 \| 32 \| 128 \| 0.000092 \| 0.1818 \| ort:lean 4 \| 1024 \| 16 \| 64 \| 0.000077 \| 0.1094 \| ort:flash 4 \| 1024 \| 16 \| 64 \| 0.000099 \| 0.0850 \| ort:efficient 4 \| 1024 \| 16 \| 64 \| 0.000081 \| 0.1038 \| ort:math 4 \| 1024 \| 16 \| 64 \| 0.000072 \| 0.1161 \| ort:lean 4 \| 1024 \| 32 \| 128 \| 0.000143 \| 0.2343 \| ort:flash 4 \| 1024 \| 32 \| 128 \| 0.000137 \| 0.2447 \| ort:efficient 4 \| 1024 \| 32 \| 128 \| 0.000150 \| 0.2245 \| ort:math 4 \| 1024 \| 32 \| 128 \| 0.000135 \| 0.2496 \| ort:lean 4 \| 2048 \| 16 \| 64 \| 0.000096 \| 0.1757 \| ort:flash 4 \| 2048 \| 16 \| 64 \| 0.000156 \| 0.1078 \| ort:efficient 4 \| 2048 \| 16 \| 64 \| 0.000089 \| 0.1892 \| ort:lean 4 \| 2048 \| 32 \| 128 \| 0.000223 \| 0.3010 \| ort:flash 4 \| 2048 \| 32 \| 128 \| 0.000217 \| 0.3101 \| ort:efficient 4 \| 2048 \| 32 \| 128 \| 0.000209 \| 0.3209 \| ort:lean 4 \| 4096 \| 16 \| 64 \| 0.000137 \| 0.2448 \| ort:flash 4 \| 4096 \| 16 \| 64 \| 0.000256 \| 0.1312 \| ort:efficient 4 \| 4096 \| 16 \| 64 \| 0.000133 \| 0.2530 \| ort:lean 4 \| 4096 \| 32 \| 128 \| 0.000389 \| 0.3450 \| ort:flash 4 \| 4096 \| 32 \| 128 \| 0.000376 \| 0.3574 \| ort:efficient 4 \| 4096 \| 32 \| 128 \| 0.000354 \| 0.3794 \| ort:lean 4 \| 8192 \| 16 \| 64 \| 0.000210 \| 0.3198 \| ort:flash 4 \| 8192 \| 16 \| 64 \| 0.000453 \| 0.1480 \| ort:efficient 4 \| 8192 \| 16 \| 64 \| 0.000206 \| 0.3260 \| ort:lean 4 \| 8192 \| 32 \| 128 \| 0.000725 \| 0.3705 \| ort:flash 4 \| 8192 \| 32 \| 128 \| 0.000693 \| 0.3874 \| ort:efficient 4 \| 8192 \| 32 \| 128 \| 0.000653 \| 0.4114 \| ort:lean 4 \| 16384 \| 16 \| 64 \| 0.000355 \| 0.3782 \| ort:flash 4 \| 16384 \| 16 \| 64 \| 0.000849 \| 0.1581 \| ort:efficient 4 \| 16384 \| 16 \| 64 \| 0.000346 \| 0.3874 \| ort:lean 4 \| 16384 \| 32 \| 128 \| 0.001395 \| 0.3848 \| ort:flash 4 \| 16384 \| 32 \| 128 \| 0.001337 \| 0.4017 \| ort:efficient 4 \| 16384 \| 32 \| 128 \| 0.001252 \| 0.4288 \| ort:lean 4 \| 32768 \| 16 \| 64 \| 0.000647 \| 0.4146 \| ort:flash 4 \| 32768 \| 16 \| 64 \| 0.001649 \| 0.1628 \| ort:efficient 4 \| 32768 \| 16 \| 64 \| 0.000639 \| 0.4204 \| ort:lean 4 \| 32768 \| 32 \| 128 \| 0.002721 \| 0.3947 \| ort:flash 4 \| 32768 \| 32 \| 128 \| 0.002601 \| 0.4128 \| ort:efficient 4 \| 32768 \| 32 \| 128 \| 0.002434 \| 0.4411 \| ort:lean 4 \| 65536 \| 16 \| 64 \| 0.001231 \| 0.4361 \| ort:flash 4 \| 65536 \| 16 \| 64 \| 0.003238 \| 0.1658 \| ort:efficient 4 \| 65536 \| 16 \| 64 \| 0.001217 \| 0.4412 \| ort:lean 4 \| 65536 \| 32 \| 128 \| 0.005357 \| 0.4009 \| ort:flash 4 \| 65536 \| 32 \| 128 \| 0.005118 \| 0.4196 \| ort:efficient 4 \| 65536 \| 32 \| 128 \| 0.004781 \| 0.4492 \| ort:lean 16 \| 512 \| 16 \| 64 \| 0.000098 \| 0.1724 \| ort:flash 16 \| 512 \| 16 \| 64 \| 0.000104 \| 0.1616 \| ort:efficient 16 \| 512 \| 16 \| 64 \| 0.000118 \| 0.1420 \| ort:math 16 \| 512 \| 16 \| 64 \| 0.000087 \| 0.1926 \| ort:lean 16 \| 512 \| 32 \| 128 \| 0.000220 \| 0.3062 \| ort:flash 16 \| 512 \| 32 \| 128 \| 0.000208 \| 0.3237 \| ort:efficient 16 \| 512 \| 32 \| 128 \| 0.000237 \| 0.2838 \| ort:math 16 \| 512 \| 32 \| 128 \| 0.000209 \| 0.3216 \| ort:lean 16 \| 1024 \| 16 \| 64 \| 0.000136 \| 0.2465 \| ort:flash 16 \| 1024 \| 16 \| 64 \| 0.000150 \| 0.2235 \| ort:efficient 16 \| 1024 \| 16 \| 64 \| 0.000148 \| 0.2266 \| ort:math 16 \| 1024 \| 16 \| 64 \| 0.000129 \| 0.2611 \| ort:lean 16 \| 1024 \| 32 \| 128 \| 0.000367 \| 0.3663 \| ort:flash 16 \| 1024 \| 32 \| 128 \| 0.000351 \| 0.3829 \| ort:efficient 16 \| 1024 \| 32 \| 128 \| 0.000400 \| 0.3357 \| ort:math 16 \| 1024 \| 32 \| 128 \| 0.000349 \| 0.3853 \| ort:lean 16 \| 2048 \| 16 \| 64 \| 0.000209 \| 0.3206 \| ort:flash 16 \| 2048 \| 16 \| 64 \| 0.000243 \| 0.2762 \| ort:efficient 16 \| 2048 \| 16 \| 64 \| 0.000201 \| 0.3338 \| ort:lean 16 \| 2048 \| 32 \| 128 \| 0.000671 \| 0.4002 \| ort:flash 16 \| 2048 \| 32 \| 128 \| 0.000645 \| 0.4163 \| ort:efficient 16 \| 2048 \| 32 \| 128 \| 0.000642 \| 0.4185 \| ort:lean 16 \| 4096 \| 16 \| 64 \| 0.000360 \| 0.3732 \| ort:flash 16 \| 4096 \| 16 \| 64 \| 0.000425 \| 0.3162 \| ort:efficient 16 \| 4096 \| 16 \| 64 \| 0.000341 \| 0.3933 \| ort:lean 16 \| 4096 \| 32 \| 128 \| 0.001292 \| 0.4156 \| ort:flash 16 \| 4096 \| 32 \| 128 \| 0.001251 \| 0.4291 \| ort:efficient 16 \| 4096 \| 32 \| 128 \| 0.001241 \| 0.4327 \| ort:lean 16 \| 8192 \| 16 \| 64 \| 0.000666 \| 0.4030 \| ort:flash 16 \| 8192 \| 16 \| 64 \| 0.000804 \| 0.3339 \| ort:efficient 16 \| 8192 \| 16 \| 64 \| 0.000627 \| 0.4283 \| ort:lean 16 \| 8192 \| 32 \| 128 \| 0.002541 \| 0.4226 \| ort:flash 16 \| 8192 \| 32 \| 128 \| 0.002454 \| 0.4376 \| ort:efficient 16 \| 8192 \| 32 \| 128 \| 0.002438 \| 0.4405 \| ort:lean 16 \| 16384 \| 16 \| 64 \| 0.001292 \| 0.4156 \| ort:flash 16 \| 16384 \| 16 \| 64 \| 0.001571 \| 0.3417 \| ort:efficient 16 \| 16384 \| 16 \| 64 \| 0.001217 \| 0.4411 \| ort:lean 16 \| 16384 \| 32 \| 128 \| 0.005042 \| 0.4260 \| ort:flash 16 \| 16384 \| 32 \| 128 \| 0.004859 \| 0.4420 \| ort:efficient 16 \| 16384 \| 32 \| 128 \| 0.004827 \| 0.4449 \| ort:lean 16 \| 32768 \| 16 \| 64 \| 0.002537 \| 0.4233 \| ort:flash 16 \| 32768 \| 16 \| 64 \| 0.003103 \| 0.3461 \| ort:efficient 16 \| 32768 \| 16 \| 64 \| 0.002385 \| 0.4501 \| ort:lean 16 \| 32768 \| 32 \| 128 \| 0.009961 \| 0.4312 \| ort:flash 16 \| 32768 \| 32 \| 128 \| 0.009605 \| 0.4472 \| ort:efficient 16 \| 32768 \| 32 \| 128 \| 0.009524 \| 0.4510 \| ort:lean 16 \| 65536 \| 16 \| 64 \| 0.005019 \| 0.4279 \| ort:flash 16 \| 65536 \| 16 \| 64 \| 0.006133 \| 0.3502 \| ort:efficient 16 \| 65536 \| 16 \| 64 \| 0.004703 \| 0.4566 \| ort:lean 16 \| 65536 \| 32 \| 128 \| 0.019746 \| 0.4350 \| ort:flash 16 \| 65536 \| 32 \| 128 \| 0.019027 \| 0.4515 \| ort:efficient 16 \| 65536 \| 32 \| 128 \| 0.018864 \| 0.4554 \| ort:lean ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-10-14 14:49:37 -07:00
Vishnudas Thaniel S	35adba21c7	Ovep develop lnl 1.2 (#22424 ) ### Description Support OV2024.4 Refactor tensor initialization check for external weights Support loading OV Config OVEP: Tensor Caching fix, Fix accuracy issues Refactor device memory implementation to make it more generic ### Motivation and Context The changes are required to fix accuracy issues, support loading of OV config, support OV2024.4 --------- Co-authored-by: Eric Crawford <eric.r.crawford@intel.com> Co-authored-by: saurabhkale17 <saurabh1.kale@intel.com> Co-authored-by: Javier E. Martinez <javier.e.martinez@intel.com> Co-authored-by: sfatimar <sahar.fatima@intel.com> Co-authored-by: ankitm3k <ankit.maheshkar@intel.com> Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com> Co-authored-by: n1harika <niharika.sathish@intel.com> Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com>	2024-10-14 12:10:01 -07:00
Edward Chen	04404ea482	Fix Xcode 16 iOS build issues (#22379 ) - Work around Xcode 16 iOS test build issue: `error: Multiple commands produce '.../PlugIns'`. - Fix link error in iOS static framework test. - Update build.py to check for the right kind of build before running iOS tests on the simulator. - Update Xcode 16 build images to 'macos-15' because that's the only image that will have Xcode 16 soon. See https://github.com/actions/runner-images/issues/10703.	2024-10-14 09:24:38 -07:00
Ted Themistokleous	572e43c5d7	[MIGraphX EP/ ROCm EP] add gfx1200, gfx1201 to CMAKE_HIP_ARCHITECTURES (#22348 ) ### Description Add additonal gfx targets for AMD GPU support ### Motivation and Context Required to integrate mainline onnxruntime support for AMD GPUs --------- Co-authored-by: Stefan Sokolovic <stsokolo@amd.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2024-10-11 17:31:36 -07:00

1 2 3 4 5 ...

1807 commits