onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-06-27 03:11:28 +00:00

Author	SHA1	Message	Date
mingyue	5ba2a1ed97	fix rebase error	2025-02-06 16:53:21 -08:00
mingyue	dcd9931533	onnxruntime_providers_vitisai.dll enable shared link Hybrid CRT	2025-02-06 16:53:10 -08:00
mingyue	20e4374d58	use internal vaip for rebase test	2025-02-06 16:52:48 -08:00
mingyue	def49fe12f	fix rebase error	2025-02-06 16:52:00 -08:00
Liu Minyue	6eeafca6e0	Add some VitisAI EP C APIs fix compile error move configuration to vaip update vaip repo remote url add vaip_get_default_config API move pattern zoo to internal (#1) * move pattern zoo to internal * change external vaip --------- Co-authored-by: mingyue <mingyue@amd.com> Co-authored-by: mingyue <mingyue@xilinx.com> add get_patterrn_list API and change get_pattern API (#2) * add get_patterrn_list API and change get_pattern API * lintrunner -a --------- Co-authored-by: mingyue <mingyue@amd.com> Co-authored-by: mingyue <mingyue@xilinx.com> change xcompiler_compile API by CPS (#3) Co-authored-by: mingyue <mingyue@amd.com> [deps] change vaip remote and branch Add vaip_get/has_mem_xclbin APIs (#4) * Add vaip_get/has_mem_xclbin APIs --------- Co-authored-by: mingyue <mingyue@xilinx.com> vaip point to github/amd use main branch	2025-02-06 16:51:36 -08:00
Chunye Wang	7cbcf5cd4b	refactor VitisAI EP for opensource update to use integrated vaip update vaip as the single entry point for cmake add dummy vaip_xcompile_run. onnxruntime_vitisia_ep.dll is optional. vaip_xcompiler_compile maybe nullptr. update create_ep_context_nodes.	2025-02-06 16:48:17 -08:00
Ashrit Shetty	4b5b5f7101	Update win-ort-main to tip main 250123 (#23473 ) ### Description This PR is to update the win-ort-main branch to the tip main branch as of 2025-01-23. ### PR List ddf0d377a7 [QNN EP] Add LoggingManager::HasDefaultLogger() to provider bridge API (#23467) 05fbbdf91f [QNN EP] Make QNN EP a shared library (#23120) 1336566d7f Add custom vcpkg ports (#23456) 2e1173c411 Update the compile flags for vcpkg packages (#23455) 1f628a9858 [Mobile] Add BrowserStack Android MAUI Test (#23383) 009cae0ec8 [js/webgpu] Optimize ConvTranspose (Continue) (#23429) 04a4a694cb Use onnx_protobuf.h to suppress some GCC warnings (#23453) 2e3b62b4b0 Suppress some strict-aliasing related warnings in WebGPU EP (#23454) b708f9b1dc Bump ruff from 0.9.1 to 0.9.2 (#23427) c0afc66b2a [WebNN] Remove workarounds for TFLite backend (#23406) 8a821ff7f9 Bump vite from 6.0.7 to 6.0.11 in /js/web/test/e2e/exports/testcases/vite-default (#23446) 220c1a203e Make ORT and Dawn use the same protobuf/abseil source code (#23447) b7b5792147 Change MacOS-13 to ubuntu on for android-java-api-aar-test.yml. (#23444) 19d0d2a30f WIP: Dp4MatMulNBits accuracy level 4 matmul for WebGPU EP (#23365) 95b8effbc4 [QNN EP]: Clean up QNN logging resources if an error occurs during initialization (#23435) 626134c5b5 Bump clang-format from 19.1.6 to 19.1.7 (#23428) 0cf975301f Fix eigen external deps (#23439) f9440aedce Moving RN_CI Android Testing to Linux (#23422) 1aa5902ff4 [QNN EP] workaround for QNN validation bug for Tanh with uint16 quantized output (#23432) 7f5582a0e2 Seperate RN andriod and IOS into 2 separated Stages. (#23400) 73deac2e7f Implement some missing element wise Add/Sub/Mul/Div/Neg operations for CPU and CUDA EPs (#23090) 949fe42af4 Upgrade Java version from react-native/android to Java 17 (#23066) 0892c23463 Update Qnn SDK default version to 2.30 (#23411) 94c099bcec Fix type cast build error (#23423) d633e571d1 [WebNN EP] Fix AddInitializersToSkip issues (#23354) e988ef00e2 [QNN EP] Fix regression for MatMul with two quantized/dynamic uint16 inputs (#23419) 7538795f6b Update onnxruntime binary size checks ci pipeline's docker image (#23405) 6c5ea41cad Revert "[QNN EP] Clean up correctly from a partial setup (#23320)" (#23420) e866804bbe Enable comprehension simplification in ruff rules (#23414) 0a5f1f392c bugfix: string_view of invalid memory (#23417) 4cc38e0277 fix crash when first input of BatchNormalization is 1-D (#23387) 033441487f Target py310 and modernize codebase with ruff (#23401) 87341ac010 [QNN EP] Fix segfault when unregistering HTP shared memory handles (#23402) ### Motivation and Context This update includes the change to make QNN-EP a shared library. --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com> Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com> Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Changming Sun <chasun@microsoft.com> Co-authored-by: Peishen Yan <peishen.yan@intel.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: Hector Li <hecli@microsoft.com> Co-authored-by: Jian Chen <cjian@microsoft.com> Co-authored-by: Alexis Tsogias <1114095+Zyrin@users.noreply.github.com> Co-authored-by: junchao-zhao <68935141+junchao-loongson@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: sushraja-msft <44513542+sushraja-msft@users.noreply.github.com> Co-authored-by: Wanming Lin <wanming.lin@intel.com> Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com> Co-authored-by: Caroline Zhu <wolfivyaura@gmail.com>	2025-01-23 09:12:03 -08:00
Ashrit Shetty	df873177eb	Update win-ort-main to tip main 250116 (#23398 ) ### Description This PR is to update the win-ort-main branch to the tip main branch as of 2025-01-16. ### Motivation and Context This update includes the OpenVino fix for debug builds. --------- Signed-off-by: Liqun Fu <liqfu@microsoft.com> Signed-off-by: Liqun Fu <liqun.fu@microsoft.com> Signed-off-by: Junze Wu <junze.wu@intel.com> Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Jianhui Dai <jianhui.j.dai@intel.com> Co-authored-by: Yueqing Zhang <yuz75@Pitt.edu> Co-authored-by: amancini-N <63410090+amancini-N@users.noreply.github.com> Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com> Co-authored-by: liqun Fu <liqfu@microsoft.com> Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com> Co-authored-by: Yifan Li <109183385+yf711@users.noreply.github.com> Co-authored-by: yf711 <yifanl@microsoft.com> Co-authored-by: Wanming Lin <wanming.lin@intel.com> Co-authored-by: wejoncy <wejoncy@163.com> Co-authored-by: wejoncy <wejoncy@.com> Co-authored-by: Scott McKay <skottmckay@gmail.com> Co-authored-by: Changming Sun <chasun@microsoft.com> Co-authored-by: Jean-Michaël Celerier <jeanmichael.celerier+github@gmail.com> Co-authored-by: Dmitry Deshevoy <mityada@gmail.com> Co-authored-by: xhcao <xinghua.cao@intel.com> Co-authored-by: Yueqing Zhang <yueqingz@amd.com> Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com> Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com> Co-authored-by: Wu, Junze <junze.wu@intel.com> Co-authored-by: Jian Chen <cjian@microsoft.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Matthieu Darbois <mayeut@users.noreply.github.com> Co-authored-by: Prathik Rao <prathik.rao@gmail.com> Co-authored-by: wonchung-microsoft <wonchung@microsoft.com> Co-authored-by: Vincent Wang <wangwchpku@outlook.com> Co-authored-by: PARK DongHa <luncliff@gmail.com> Co-authored-by: Hector Li <hecli@microsoft.com> Co-authored-by: Sam Webster <13457618+samwebster@users.noreply.github.com> Co-authored-by: Adrian Lizarraga <adrianlm2@gmail.com> Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com> Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com> Co-authored-by: Satya Kumar Jandhyala <satya.k.jandhyala@gmail.com> Co-authored-by: Corentin Maravat <101636442+cocotdf@users.noreply.github.com> Co-authored-by: Xiaoyu <85524621+xiaoyu-work@users.noreply.github.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jie Chen <jie.a.chen@intel.com> Co-authored-by: Jianhui Dai <jianhui.j.dai@intel.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Baiju Meswani <bmeswani@microsoft.com> Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com> Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com> Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Co-authored-by: Ted Themistokleous <107195283+TedThemistokleous@users.noreply.github.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com> Co-authored-by: Artur Wojcik <artur.wojcik@outlook.com> Co-authored-by: Ted Themistokleous <tedthemistokleous@amd.com> Co-authored-by: Xinya Zhang <Xinya.Zhang@amd.com> Co-authored-by: ikalinic <ilija.kalinic@amd.com> Co-authored-by: sstamenk <sstamenk@amd.com> Co-authored-by: Yi-Hong Lyu <yilyu@microsoft.com> Co-authored-by: Ti-Tai Wang <titaiwang@microsoft.com>	2025-01-16 15:20:25 -08:00
Yulong Wang	6806174096	fix webgpu delay load test (#23157 ) ### Description This change fixes the WebGPU delay load test. <details> <summary>Fix UB in macro</summary> The following C++ code outputs `2, 1` in MSVC, while it outputs `1, 1` in GCC: ```c++ #include <iostream> #define A 1 #define B 1 #define ENABLE defined(A) && defined(B) #if ENABLE int x = 1; #else int x = 2; #endif #if defined(A) && defined(B) int y = 1; #else int y = 2; #endif int main() { std::cout << x << ", " << y << "\n"; } ``` Clang reports `macro expansion producing 'defined' has undefined behavior [-Wexpansion-to-defined]`. </details> <details> <summary>Fix condition of build option onnxruntime_ENABLE_DELAY_LOADING_WIN_DLLS</summary> Delay load is explicitly disabled when python binding is being built. modifies the condition. </details>	2024-12-20 13:37:12 -08:00
Changming Sun	fcc34da5e9	Fix a tiny problem in winml.cmake (#23173 ) ### Description CMake's [target_link_libraries](https://cmake.org/cmake/help/latest/command/target_link_libraries.html#id2) function accepts plain library name(like `re2`) or target name(like `re2::re2`) or some other kinds of names. "plain library names" are old-fashioned, for compatibility only. We should use target names. ### Motivation and Context To make vcpkg work with winml build. See #23158	2024-12-20 11:48:43 -08:00
Dmitri Smirnov	00b262dbb4	Implement pre-packed blobs serialization on disk and their memory mapping on load (#23069 ) ### Description <!-- Describe your changes. --> Pre-packing is a feature, that allows kernels to re-arrange weights data to gain performance at interference time Currently, pre-packed blobs are shared when a cross-session weight sharing is enabled and only for those weights that are marked as shared by the user. Otherwise, data resides on the heap, the kernels own the data which may be duplicated. This change enables pre-packed data to be stored on disk alongside with the external initializers. The pre-packed blobs are memory mapped and are loaded into either the X-session shared container or a new container that shares pre-packed blobs within the session. With the new approach, pre-packed blobs are always owned by the shared container using the existing pre-pack mechanism for sharing. When X-session sharing is enabled, then the external container owns the data. A separate container owned by a root `SessionState` owns and shares the data when X-session sharing is not enabled. To facilitate this new approach, we introduce a new container that works in two modes. When an optimized model is being saved, and pre-packed weights saving is enabled, the new container will record pre-packed blobs and serialize them to disk using existing `ToGraphProtoWithExternalInitializers` function. To externalize the pre-packed weights, we introduce a new session option `kOrtSessionOptionsSavePrePackedConstantInitializers.` Note, that pre-packing should be enabled (default) for this to work. `ToGraphProtoWithExternalInitializers`function is modified to recurse into subgraphs to make sure we properly account for local initializer names. In the second mode, the container would simply hold the pre-packed weights memory-mapped from disk and share them with the kernels. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Reduce memory usage by pre-packed initializers and externalize them.	2024-12-20 10:49:08 -08:00
xhcao	29bccad96d	[webgpu] fix compiling error (#23139 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-12-20 09:05:23 -08:00
mingyue	4aca8f33df	[Bug Fix] Missing CustomOp SchemaRegister when generator EPContext ONNX model (#23091 ) ### Description Enhancements to EPContext Operations: 1. Introduced support for the bfloat16 data type in EPContext operations. 2. Bug Fix: Missing Custom OP Schema Registration when generator EPContext ONNX model --------- Co-authored-by: mingyue <mingyue@xilinx.com> Co-authored-by: Hector Li <hecli@microsoft.com>	2024-12-19 16:47:13 -08:00
Jiajia Qin	7c782f6741	[webgpu] Always use tile matmulnbits for block_size = 32 (#23140 ) ### Description After the optimization of prefill time with #23102, it seems that always using the tile matmulnibits with block_size = 32 can bring better performance even for discrete gpu for phi3 model. Phi3 becomes 42.64 tokens/sec from 32.82 tokens/sec in easy mode on my NV RTX 2000 GPU.	2024-12-19 16:22:53 -08:00
Yulong Wang	b4a6a0d511	[WebGPU EP] allows GPUDevice to be released after use (#23144 ) ### Description This change allows the `WebGpuContext` class to be released after all active inference sessions are released. This will cause: - for default context (ID=0), the underlying `wgpu::Device` and `wgpu::Adapter` to be released, together with all resources created by the Device. - for custom context (ID>0), the reference counts of passed in Instance, Adapter and Device will decrement correctly.	2024-12-19 15:33:40 -08:00
Yifan Li	d9d07ad8ae	[TensorRT EP] support TensorRT 10.7-GA (#23011 ) ### Description <!-- Describe your changes. --> Update CIs to TRT10.7 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-12-19 10:39:15 -08:00
Yifan Li	a3bb3f1487	[TensorRT EP] New CIs to test TRT+minimal CUDA build (#23028 ) ### Description <!-- Describe your changes. --> New CI: [Linux_TRT_Minimal_CUDA_Test_CI](https://dev.azure.com/onnxruntime/onnxruntime/_build?definitionId=230&_a=summary) and [Win_TRT_Minimal_CUDA_Test_CI ](https://dev.azure.com/onnxruntime/onnxruntime/_build?definitionId=231) Setting config for new CI to monitor if there's no issue to build ORT-TRTEP with minimal CUDA * yaml content is following Linux TRT CI yaml, with different build arg/cache name * build arg is following [[TensorRT EP] Enable a minimal CUDA EP compilation without kernels](https://github.com/microsoft/onnxruntime/pull/19052#issuecomment-1888066851) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Monitor if user is able to build ORT-TRTEP-minimalCUDA without any blocker (which takes ~30min to build)	2024-12-19 10:30:39 -08:00
Yulong Wang	8680244ebc	Fix delay load for WebGPU EP and DML EP (#23111 ) ### Description This change fixes the DLL delay load problem for the WebGPU EP and DirectML EP. See detailed explanation below. ### Problem When onnxruntime.dll uses delay loading for its dependencies, the dependencies are loaded using `LoadLibraryEx()`, which search the directory of process (.exe) instead of this library (onnxruntime.dll). This is a problem for usages of Node.js binding and python binding, because Windows will try to find the dependencies in the directory of node.exe or python.exe, which is not the directory of onnxruntime.dll. There was previous attempt to fix this by loading DirectML.dll in the initialization of onnxruntime nodejs binding, which works for DML EP but is not a good solution because it does not really "delay" the load. For WebGPU, the situation became worse because webgpu_dawn.dll depends on dxil.dll and dxcompiler.dll, which are explicitly dynamically loaded in the code using `LoadLibraryA()`. This has the same problem of the DLL search. ### Solutions For onnxruntime.dll loading its direct dependencies, it can be resolved by set the [`__pfnDliNotifyHook2` hook](https://learn.microsoft.com/en-us/cpp/build/reference/understanding-the-helper-function?view=msvc-170#structure-and-constant-definitions) to load from an absolute path that constructed from the onnxruntime.dll folder and the DLL name. For webgpu_dawn.dll loading dxil.dll and dxcompiler.dll, since they are explicitly loaded in the code, the hook does not work. Instead, it can be resolved by ~~using WIN32 API `SetDllDirectory()` to add the onnxruntime.dll folder to the search path.~~ preloading the 2 DLLs from the onnxruntime.dll folder .	2024-12-19 10:23:48 -08:00
Yulong Wang	780735098d	[nodejs binding] Fix building in latest clang (#23146 ) ### Description This change fixes the build break for Node.js binding on latest AppleClang: ``` ...tensor_helper.cc:65:5 error: integer value -1 is outside of the valid range of values [0,15] for the enumeration type 'napi_typedarray_type' [-Wenum-constexpr-conversion] ``` Use the underlying type of enum `napi_typedarray_type` for `DATA_TYPE_TYPEDARRAY_MAP` to solve this issue. Because the underlying type is implementation defined (it's `int` for MSVC and `unsigned int` for Clang), we use `std::underlying_type_t` to get the correct type.	2024-12-19 10:23:27 -08:00
Yulong Wang	ae6dcc839e	Revert "[js/webgpu] disable failed tests temporarily (#23127 )" (#23130 ) ### Description This reverts commit `9115682d69`. ### Motivation and Context	2024-12-18 18:07:50 -08:00
Prathik Rao	31e6e1010c	gather elements webgpu implementation (#23137 ) Increases operator coverage for WebGPU EP.	2024-12-18 16:29:26 -08:00
Changming Sun	5d7030e4c6	Revert DML pipeline changes (#23135 ) ### Description Previously we wanted to add DirectML EP to existing onnxruntime Windows CUDA packages. After careful consideration, we will postpone the change. This PR reverts some pipeline changes previously made by @mszhanyi and @jchen351 .	2024-12-18 10:42:10 -08:00
Changming Sun	e76bd2f5e9	Update CODEOWNERS: remove onnxruntime-es (#21677 ) Removing this restriction for now.	2024-12-17 13:39:13 -08:00
Wanming Lin	a5b60ec03f	[WebNN] Add limit to QDQ ops (#23076 ) WebNN requires the `scale_shape` to be a subsample of the `input_shape`.	2024-12-17 12:52:08 -08:00
Enrico Galli	54edb43e77	[WebNN] Fixes MLTensor caching across different contexts (#23100 ) We weren't checking that MLTensors were from the same context before reusing them. Found while debugging microsoft/webnn-developer-preview#69	2024-12-17 12:51:16 -08:00
Tianlei Wu	5afab787db	Update python version metadata (remove 3.7, 3.8, 3.9; add 3.13). (#23067 ) ### Description * Update python version metadata to be in sync with latest python packages (onnxruntime, onnxruntime-gpu and onnxruntime-qnn). * Update black format target-version to 3.10, and use lintrunner to format all files. * Update the lintrunner installation command line to be consistent. * Include `requirements-lintrunner.txt` in `requirements-dev.txt` to avoid duplicated settings. ### Motivation and Context https://github.com/microsoft/onnxruntime/issues/22993 Python support by numpy: https://numpy.org/neps/nep-0029-deprecation_policy.html#drop-schedule ``` On Apr 05, 2024 drop support for Python 3.9 On Apr 04, 2025 drop support for Python 3.10 ```	2024-12-17 10:59:20 -08:00
Jiajia Qin	0981bbf4ca	[webgpu] Optimize matmulnbits with M > 1 (#23102 ) This is the webgpu native ep implementation of #23092. I used https://github.com/fs-eire/ort-webgpu-nodejs-chatapp-prototype to test. Meanwhile, applied https://github.com/fs-eire/ort-webgpu-nodejs-chatapp-prototype/pull/2 to print the first token time. The result is like below: The latest main branch: Intel Arc Graphics ``` 659 tokens in 24.8sec, 26.57 tokens/sec Decoding first token with input 449 tokens: 13.0 sec Decoding remaining 210 tokens: 11.8 sec 17.79 tokens/sec ``` NV RTX 2000 ``` 659 tokens in 14.4sec, 45.85 tokens/sec Decoding first token with input 449 tokens: 7.3 sec Decoding remaining 210 tokens: 7.0 sec 29.81 tokens/sec ``` ------------------------------------------------------------------------- With this PR: Intel Arc Graphics ``` 657 tokens in 20.6sec, 31.92 tokens/sec Decoding first token with input 449 tokens: 8.5 sec Decoding remaining 208 tokens: 12.1 sec 17.23 tokens/sec ``` NV RTX 2000 ``` 659 tokens in 11.4sec, 57.93 tokens/sec Decoding first token with input 449 tokens: 4.1 sec Decoding remaining 210 tokens: 7.2 sec 28.98 tokens/sec ``` From above data, you can see that with this PR, both intel (13s -> 8.5s) and NV (7.3s -> 4.1s) GPUs for the first token time are performing better.	2024-12-16 20:47:40 -08:00
Yulong Wang	9115682d69	[js/webgpu] disable failed tests temporarily (#23127 ) ### Description Those test cases start to fail for unknown reasons. To unblock the CI, I disabled those tests temporarily to earn time to investigate the root cause.	2024-12-16 15:35:47 -08:00
Dmitri Smirnov	ae97068137	Fix Pybind memory leak (#23105 ) ### Description <!-- Describe your changes. --> Array GETITEM returns new reference which is a leak ### Motivation and Context Address https://github.com/microsoft/onnxruntime/issues/22271	2024-12-16 10:38:23 -08:00
tianf-fff	a4eb8f27b6	[VitisAI] Add profiler interface for vitisai (#23032 ) ### Description <!-- Describe your changes. --> Add common interfaces for vitis ep profiler. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Vitis ep can collect and record api and kernel timestamps in file when onnxruntime '-p' is enabled.	2024-12-16 09:09:48 -08:00
Changming Sun	2ff66b80e0	Fix a deadlock bug in EigenNonBlockingThreadPool.h (#23098 ) ### Description This PR fixes a deadlock bug in EigenNonBlockingThreadPool.h. It only happens on platforms with weakly ordered memory model, such as ARM64.	2024-12-16 09:05:12 -08:00
Yulong Wang	3a0b958586	add 2 CMake build options of Dawn (#23096 ) ### Description This change adds the following CMake build options for Dawn: - onnxruntime_BUILD_DAWN_MONOLITHIC_LIBRARY - OFF by default - when enabled, builds Dawn as a monolithic library (webgpu_dawn.dll) - onnxruntime_ENABLE_DAWN_BACKEND_VULKAN - OFF by default - when enabled, build with Vulkan backend for Dawn on Windows - onnxruntime_ENABLE_DAWN_BACKEND_D3D12 - ON by default - when enabled, build with DirectX 12 backend for Dawn on Windows ### File Size Comparison (Windows) \| Build \| cmdline \| File Size \| \|---\|---\|---\| \| Baseline \| --config Release<br/> --build_shared_lib \| `12,755,456 onnxruntime.dll` \| \| WebGPU D3D12 (default) \| --use_webgpu<br/> --config Release<br/> --build_shared_lib \| `17,082,368 dxcompiler.dll`<br/>` 1,508,472 dxil.dll`<br/>`18,708,480 onnxruntime.dll` \| \| WebGPU D3D12+Vulkan \| --use_webgpu<br/> --config Release<br/> --build_shared_lib<br/> --cmake_extra_defines<br/> onnxruntime_ENABLE_DAWN_BACKEND_D3D12=1<br/> onnxruntime_ENABLE_DAWN_BACKEND_VULKAN=1 \| `17,081,344 dxcompiler.dll`<br/>` 1,508,472 dxil.dll`<br/>`19,388,416 onnxruntime.dll` \| \| WebGPU Vulkan \| --use_webgpu<br/> --config Release<br/> --build_shared_lib<br/> --cmake_extra_defines<br/> onnxruntime_ENABLE_DAWN_BACKEND_D3D12=0<br/> onnxruntime_ENABLE_DAWN_BACKEND_VULKAN=1 \| `17,615,872 onnxruntime.dll` \| \| Monolithic \| --use_webgpu<br/> --config Release<br/> --build_shared_lib<br/> --cmake_extra_defines<br/> onnxruntime_BUILD_DAWN_MONOLITHIC_LIBRARY=1 \| `17,082,368 dxcompiler.dll`<br/>` 1,508,472 dxil.dll`<br/>`13,277,696 onnxruntime.dll`<br/>` 5,616,640 webgpu_dawn.dll` \| \| External Dawn \| --use_webgpu<br/> --config Release<br/> --build_shared_lib<br/> --cmake_extra_defines<br/> onnxruntime_USE_EXTERNAL_DAWN=1<br/> --skip_tests \| `17,081,344 dxcompiler.dll`<br/>` 1,508,472 dxil.dll`<br/>`13,277,184 onnxruntime.dll`	2024-12-13 16:05:48 -08:00
genmingz@AMD	62e7e24f17	Add attrProto.release_s interface (#22977 ) ### Description Add AttributeProto.release_s interface, which is used to obtain the string in the attribute using move semantics instead of copying it ### Motivation and Context The ep_context node stores a lot of information in attributes, which may cause the memory usage to increase. Use this interface to avoid memory waste --------- Co-authored-by: GenMing Zhong <genmingz@xlnx.xilinx.com> Co-authored-by: genmingz <genmingz@amd.com>	2024-12-12 21:13:43 -08:00
Hector Li	2a36fd4f6e	Fix the ctx_gen tool to make sure all generated ctx.onnx have max_size (#23097 ) ### Description Fix the qnn_ctx_gen tool to make sure all generated ctx.onnx have max_size	2024-12-12 21:12:02 -08:00
Hector Li	f43f40facf	Backward compatible with old QNN version (#23095 ) ### Description Make QNN EP compliable with old QNN version	2024-12-12 17:04:20 -08:00
Yulong Wang	01539ee7ab	[js/webgpu] fix Conv2DMatMul shader's out-of-bound read (#23085 ) ### Description <!-- Describe your changes. --> Fix a bug caused by potential out-of-bound reads of `W` in the Conv2DMatMul shader. ### Motivation and Context Fixes #22983	2024-12-12 11:33:53 -08:00
Dmitri Smirnov	890a719c91	Remove deprecated static from Eigen that contributes to size increase (#23084 ) ### Description <!-- Describe your changes. --> This patches Eigen source to remove an unused deprecated static var. ### Motivation and Context Internal customer request.	2024-12-12 10:19:47 -08:00
Ankit Maheshkar	1f88284f96	OVEP 1.21.0 Development Updates (#23080 ) ### Description OVEP development changes for ORT 1.21 Release ### Motivation and Context - Has Critical Bug Fixes - Improved Performance optimizations for both memory & inference latency (https://github.com/intel/onnxruntime/pull/513) - Enabled Model Compilation using NPUW (https://github.com/intel/onnxruntime/pull/508) - Fixed support for EPContext embed mode 0 for lower memory utilization - Updated NuGet package name as `Intel.ML.OnnxRuntime.OpenVino` - Fixed QDQ Stripping logic on NPU	2024-12-11 22:26:32 -08:00
Hector Li	ebb968d34a	disable the EP context embed model by default in session option (#23070 ) change the default value for session option ep.context_embed_mode to 0 to avoid the model loading memory overhead	2024-12-11 17:26:29 -08:00
Yulong Wang	e605870783	[js/web] Update API for `ort.env.webgpu` (#23026 ) ### Description This PR is a replacement of #21671. It offers a new way for accessing the following: - `ort.env.webgpu.adapter`: - deprecating. There is no point to get the value of it. Once `GPUDevice.adapterInfo` is supported, there is no point to set the value too. - `ort.env.webgpu.device`: - set value of `GPUDevice` if user created it. Use at user's own risk. - get value of `Promise<GPUDevice>`. if not exist, create a new one. if exist return it. - `ort.env.webgpu.powerPreference`: - deprecating. encouraging users to set `ort.env.webgpu.device` if necessary. - `ort.env.webgpu.forceFallbackAdapter`: - deprecating. encouraging users to set `ort.env.webgpu.device` if necessary.	2024-12-11 10:24:14 -08:00
sushraja-msft	8800830a44	Implement 2d tiled matmulnbits specialized for prefill (#23058 ) ### Description This change implements matmul4bits with tiling both for A and B. This is beneficial for prefill scenarios on Intel integrated GPUs, because each row of A has to run through the same set of shared rows of B. This change should improve core occupancy and model_benchmark does indicate improvements for prefill. The same shader is not used for generation because when A has just a single row, the other threads in the workgroup get unused and that hurts performance. ``` -- Baseline run on an Alderlake GPU -- C:\onnxruntime>C:\model_benchmark\model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web -l 500 Batch size: 1, prompt tokens: 501, tokens to generate: 128 Prompt processing (time to first token): avg (us): 1.72338e+07 avg (tokens/s): 29.0707 << p50 (us): 1.72548e+07 stddev (us): 57012.8 n: 5 * 501 token(s) Token generation: avg (us): 79227.5 avg (tokens/s): 12.6219 p50 (us): 79284.4 stddev (us): 2109.72 n: 635 * 1 token(s) Token sampling: avg (us): 15.8198 avg (tokens/s): 63211.8 p50 (us): 14.3 stddev (us): 8.67178 n: 640 * 1 token(s) E2E generation (entire generation loop): avg (ms): 27297.8 p50 (ms): 27269.8 stddev (ms): 89.4322 n: 5 Peak working set size (bytes): 5490987008 WebGPU device lost (2): Device was destroyed. ----------------------------------- With Prefill Optimization ---- C:\onnxruntime>C:\model_benchmark\model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web -l 500 Batch size: 1, prompt tokens: 501, tokens to generate: 128 Prompt processing (time to first token): avg (us): 1.2135e+07 avg (tokens/s): 41.2856 << p50 (us): 1.21288e+07 stddev (us): 21282.1 n: 5 * 501 token(s) Token generation: avg (us): 78945.3 avg (tokens/s): 12.667 p50 (us): 78900.7 stddev (us): 2232.43 n: 635 * 1 token(s) Token sampling: avg (us): 20.5608 avg (tokens/s): 48636.3 p50 (us): 18.7 stddev (us): 19.0409 n: 640 * 1 token(s) E2E generation (entire generation loop): avg (ms): 22163.8 p50 (ms): 22160.1 stddev (ms): 31.3122 n: 5 Peak working set size (bytes): 5478862848 WebGPU device lost (2): Device was destroyed. ```	2024-12-10 17:07:11 -08:00
amancini-N	d8de3c4096	[CUDA EP] Fix BeamSearch on T5 with sequence_as_input_ids (#20667 ) (#20668 ) ### Description Change the implementation of BeamSearch op when using CUDA EP: in case of T5 model, and in case the decoder input_ids are sequences, copy the sequences device-to-device instead of host-to-device ### Motivation and Context - Fixes #20667	2024-12-10 16:20:47 -08:00
shiyi	02f0af0d08	[WebNN] Improve data type check of slice op (#22988 ) A follow-up of [[WebNN] Support negative steps for slice](https://github.com/microsoft/onnxruntime/pull/22871#discussion_r1847929774). Slice op is emulated by reverse+slice when steps < 0 so `SliceOpBuilder::HasSupportedInputsImpl()` should also check the supported data types of reverse. --------- Co-authored-by: Wanming Lin <wanming.lin@intel.com>	2024-12-10 15:48:16 -08:00
Edward Chen	fa6ad202aa	Minor updates to onnxruntime_java.cmake (#23068 ) - Use `ANDROID` instead of `CMAKE_SYSTEM_NAME STREQUAL "Android"`. - Put common gradle arguments into `COMMON_GRADLE_ARGS` to make them easier to reuse.	2024-12-10 15:44:36 -08:00
Jiajia Qin	defcc4f819	[webgpu] Optimize Expand (#23052 ) ### Description <!-- Describe your changes. --> Use components = 4 if possible. This is the webgpu native implementation from #22752	2024-12-10 14:58:57 -08:00
Misha Chornyi	bf4d3e1a5b	Update vcpkg.json - lock flatbuffer version (#23046 ) ### Description Locking version introduced in: `03ea5dc495/onnxruntime/core/flatbuffers/schema/ort_training_checkpoint.fbs.h (L11-L13)` ### Motivation and Context Resolve issue for version `>=1.20.` https://github.com/microsoft/onnxruntime/issues/22666	2024-12-10 11:23:01 -08:00
Jian Chen	5f7b9d0245	Upgrade gradle to 8.7 (#23016 ) ### Description This PR only upgrade the gradle version and `com.android.tools.build:gradle` version from build.gradle. This only update the react-native library gradle version, not the e2e test. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-12-10 10:49:03 -08:00
A-Satti	b14b4ec703	Restore Qspectre flag (#23060 ) Restore a removed Qspectre flag and update comment ### Motivation and Context Adjustment for PR `f5293d253c`	2024-12-09 21:52:21 -08:00
Scott McKay	708ee8556e	Reduce default logger usage (#23030 ) ### Description <!-- Describe your changes. --> We have use cases where multiple sessions are created concurrently. Minimizing the usage of the default logger is important for these scenarios. Wire through the session logger to as many places as possible. The EP logger can also be used once the session is created (can't be used during EP construction/kernel registration but can be used in GetCapability and Compile). ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Improve logging when there are concurrent sessions.	2024-12-10 12:54:14 +11:00
wejoncy	e12421be30	[CoreML] more performace flag (#22975 ) ### Description refactor unsquzee's implementation add more flags to boost peformance. add profile flag ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: jicwen <jicwen@YiMacBook-Pro.local> Co-authored-by: wejoncy <wejoncy@.com> Co-authored-by: Scott McKay <skottmckay@gmail.com>	2024-12-10 09:35:05 +08:00

1 2 3 4 5 ...

12130 commits