onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-16 18:31:27 +00:00

Author	SHA1	Message	Date
mindest	90e8c8daaf	profile_explorer: add op-kernel correlation info (#15946 ) ### Description <!-- Describe your changes. --> * Add aggregated op-kernel correlation information in profiler explorer when running inference session. * Add filtering feature so that we can focus on model runs of interest (excluding warmup steps, etc.)	2023-05-30 23:25:43 +08:00
Yi Zhang	31fc25d2c2	[Fix] Check if CUDA is downloaded in AGENT_TEMPDIRECTORY (#16142 ) ### Description supplement of #15915 ### Motivation and Context fix nuget pipeline exception in the stage of Final_Jar_Testing_Windows_GPU ``` JUnit Jupiter:ProviderOptionsTest:testCUDAOptions() MethodSource [className = 'ai.onnxruntime.providers.ProviderOptionsTest', methodName = 'testCUDAOptions', methodParameterTypes = ''] => ai.onnxruntime.OrtException: Error code - ORT_RUNTIME_EXCEPTION - message: D:\a\_work\1\s\onnxruntime\core\session\provider_bridge_ort.cc:1131 onnxruntime::ProviderLibrary::Get [ONNXRuntimeError] : 1 : FAIL : LoadLibrary failed with error 126 "" when trying to load "C:\Users\cloudtest\AppData\Local\Temp\onnxruntime-java17193857285260738736\onnxruntime_providers_cuda.dll" ``` ### Verification https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=313476&view=results	2023-05-30 13:14:08 +08:00
Jian Chen	6abdc3a87b	Fix static analysis bug (#16114 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-05-28 10:58:07 -07:00
Yi Zhang	73584f9360	More fixes on nuget pipeline (#16091 ) ### Description 1. parameters couldn't using string to comprare, change it to boolean. 2. Windows_CI_GPU_DML_DEV_arm64 on the pool onnxruntime-Win-CPU-2022 failed to pass prefast step, change the pool to aiinfra-dml-winbuild. 3. skipped test_zfnet512, it's failed in Nuget_Test_Win_Training_CPU Todo Only Final_Jar_Testing_Windows_GPU failed now. https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=313042&view=logs&s=d66543d5-16de-5a48-6ecb-a36e21ff8d4d&j=d9489789-5e39-5a05-13ab-9aaf7b4d386f	2023-05-27 08:59:12 +08:00
Alexander Visheratin	415c26e46e	[JS/WebGPU] Squeeze operator implementation (#16024 ) ### Description This PR adds an implementation of the `Squeeze` operator to WebGPU JSEP. The implementation follows the [operator schema](https://github.com/onnx/onnx/blob/main/docs/Operators.md#Squeeze) and allows one or two inputs. ### How was it tested 1. I created two models. Without `axes`: ```Python import onnx.helper node = onnx.helper.make_node( "Squeeze", inputs=["T"], outputs=["y"], ) graph = onnx.helper.make_graph([node], "test", [onnx.helper.make_tensor_value_info("T", 1, [3, 1, 4, 5])], [onnx.helper.make_tensor_value_info("y", 1, [3, 4, 5])]) onnx.save(onnx.helper.make_model(graph), "squeeze.onnx") ``` And with `axes`: ```Python import onnx.helper node = onnx.helper.make_node( "Squeeze", inputs=["T", "axes"], outputs=["y"], ) graph = onnx.helper.make_graph([node], "test", [onnx.helper.make_tensor_value_info("T", 1, [3, 1, 4, 5]), onnx.helper.make_tensor_value_info("axes", 7, [1])], [onnx.helper.make_tensor_value_info("y", 1, [3, 4, 5])]) onnx.save(onnx.helper.make_model(graph), "squeeze-dim.onnx") ``` 2. I compiled the runtime using @fs-eire's [instructions](https://gist.github.com/fs-eire/a55b2c7e10a6864b9602c279b8b75dce). 3. I ran the test models in the browser using this minimal setup: ```HTML <html> <script src=".\dist\ort.webgpu.min.js"></script> <script> async function run() { const session = await ort.InferenceSession.create('squeeze-dim.onnx', {executionProviders: ['webgpu']}); console.log(session); const input = new ort.Tensor('float32', new Float32Array(60), [3, 1, 4, 5]); const dim = new ort.Tensor('int64', [-3n], [1]); const output = await session.run({ "T": input, "axes": dim }); console.log(output); } run(); </script> </html> ``` ### Motivation and Context Improve operator coverage for WebGPU JSEP.	2023-05-26 15:53:05 -07:00
Scott McKay	5e41d1600a	Add new QNN CIs to azp run tool (#16109 ) ### Description <!-- Describe your changes. --> Add 2 new QNN CIs to tools/python/run_CIs_for_external_pr.py ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Update tool so it runs all current CIs	2023-05-27 08:46:16 +10:00
Dmitri Smirnov	9939092e71	[CPP API]Fix constness in C++API (#16103 ) ### Description `CreateMap` and `CreateSequence` should be able to take in const data.	2023-05-26 14:09:00 -07:00
Jeff Bloomfield	54fdb640fe	Address performance regression with duplicate initializers across DML partitions (#16087 ) This addresses a DML performance regression introduced by the constant sharing pass. The constant sharing pass identifies small initializer tensors which contain identical values and merges them. This could have the effect of causing DML to treat those tensors as non-constant and skip certain optimization. To prevent this, there is now an element count threshold below which the DML EP will enable this optimization, even though it results in duplicate work uploading and pre-processing the common tensor at multiple operators.	2023-05-26 13:37:34 -07:00
Changming Sun	a5410515ad	Fix: Some fields in OrtCUDAProviderOptionsV2 struct are not initialized (#16113 ) ### Description The file include/onnxruntime/core/providers/cuda/cuda_provider_options.h is a C++ file. It is not for C. Before this commit, this header file is already not compatible with C compilers. Because it has: ``` onnxruntime::ArenaExtendStrategy arena_extend_strategy; ``` And this file is intended to be internal only. It is an internal header file. It should not be included in onnxruntime_c_api.h and should not be used with the public C APIs. User can only get the instance of OrtCUDAProviderOptionsV2 via CreateCUDAProviderOptions. In such a way we can add new members to this struct without breaking binary compatibility. Since it is an internal header, we can safely use C++ grammar there.	2023-05-26 11:34:22 -07:00
cao lei	4ab7d410ae	ExecutionProvider API refactor - Deattach allocator from EP by creating local cpu allocator instead (#16084 ) ### Description ExecutionProvider API refactor - Detach allocator from EP by creating local cpu allocator instead ### Motivation and Context This is PR is a refactor to create local CPU allocator instead of getting allocator from ExecutionProvider, which the final goal is to totally detach allocators from ExecutionProvider, and put them in session level indexed by OrtDevice	2023-05-26 04:54:42 -07:00
Edward Chen	4bfb8d3303	Update calls to OrtArenaCfg constructor to pass additional parameter. (#16104 ) Update calls to OrtArenaCfg constructor to pass additional parameter. Updating some call sites after change in #15983. Fix CI build.	2023-05-26 12:41:42 +08:00
cloudhan	2cf0ae7d01	[ROCm] Add AttentionMode to make attention logic streamline (#15978 ) Refactor for future kv cache change.	2023-05-26 12:06:36 +08:00
Skand Hurkat	b28e927ca4	Read AA64ISAR0_EL1 to check dot product support (#16082 ) ### Description Use an assembly instruction to read the `AA64ISAR0_EL1` register for dot product support. ### Motivation and Context The only reliable way to check for supported instruction extensions in ARM is to query the instruction set attribute registers. [Dot product instructions can be checked using bits 47:44 in the AA64ISAR0_EL1 register](https://developer.arm.com/documentation/ddi0601/2021-12/AArch64-Registers/ID-AA64ISAR0-EL1--AArch64-Instruction-Set-Attribute-Register-0?lang=en#fieldset_0-47_44). On `qemu-aarch64` with the `a64fx` cpu which does not support the dot product instructions, running a quantized BERT-Large (from MLPerf) results in `SIGILL`. With the change, the program continues without using the dot product instructions. Also verified that `S8S8_SDOT` kernels are invoked when running on hardware that supports dot product instructions. --------- Co-authored-by: Skand Hurkat <skhurkat@microsoft.com>	2023-05-25 17:05:30 -07:00
Wanming Lin	0d1a8cc651	[WebNN EP] Use NCHW as preferred layout for DML backend (#16037 ) To improve performance on DML backend.	2023-05-25 09:47:41 -07:00
Yuhong Guo	04a8f50674	New configuration to limit the arena extension (#15983 ) Add a configuration `max_power_of_two_extend_bytes ` to limit the arena extension size. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> In our real scenario, we observe that if the model is big enough the BfcArena will extend uncontrollable. As showed by the following figures, if a model uses more than 16GB memory, the BfcArena will totally apply for 32GB memory according to the `kNextPowerOfTwo` strategy. With the new strategy, the extension is limited. The default maximum extension size is 1GB. #### Without the new configuration After loading the model, ORT uses 32G GPU memory. ![image](https://github.com/microsoft/onnxruntime/assets/19584326/42b93c66-b957-4f20-a13b-d34cb390afff) #### With the new configuration After loading the model, ORT uses 23G GPU memory. ![image](https://github.com/microsoft/onnxruntime/assets/19584326/5abffeff-9ca3-4187-a262-37fd2764fe1b) Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com>	2023-05-25 02:19:07 -07:00
Changming Sun	60bb07307b	Fix the TRT GPU build job in python packaging pipeline (#16073 ) 1. Cherry-pick #16054 back to the main branch 2. Replace onnxruntime-gpu-winbuild-t4 with onnxruntime-Win2022-GPU-T4. The later one has VS2022. --------- Co-authored-by: Patrice Vignola <vignola.patrice@gmail.com>	2023-05-25 00:09:08 -07:00
Changming Sun	cc0c5e5612	Fix an error in test/shared_lib/test_inference.cc (#16090 ) ### Description Fix an error in test/shared_lib/test_inference.cc. It should use ASSERT_NEAR to test float values. ### Motivation and Context Our OpenVino pipeline is failing because of this.	2023-05-24 22:59:28 -07:00
Yi Zhang	76fd9aa745	[Fix] Some pipelines have to be using VS2019 (#16034 ) ### Description ### Motivation and Context Fix nuget and python package pipeline. 1. ARM 32 build isn't supported by VS2022 officially. https://developercommunity.visualstudio.com/t/Compilation-Error-with-VS2022-ARM/10285309 2. onnxruntime-gpu-winbuild-T4 and onnxruntime-gpu-winbuild-tensorrt8-T4 haven't VS 2022	2023-05-25 09:55:35 +08:00
pengwa	34fe8fb069	Type hint for ORTModule (#15938 ) ### Type hint for ORTModule Add Type hint for ORTModule Refine comments. The reason of removing theinterface execution_session_run_forward from `orttraining/orttraining/python/training/ortmodule/_graph_execution_manager.py`: PR `cc275e7529 (diff-497e18dc8878818205b81fd80f85942548d8aa15d0f1204ce3e3d9795e3dd195)` and some commit before it breaks the function interface contracts between parent calss _graph_execution_manager.py and its children _training_manager.py and _inference_manager.py. So there is no need to have this interface. ### Other EE work opportunities 1. Use logger correctly. 2. Remove few duplication logic parsing input/output recursively. 3. Clean up environment variable usage.	2023-05-25 09:28:20 +08:00
Sumit Agarwal	70d2dc8209	[DML EP] Fix issue with --dml_path build option (#15972 ) ### Description DML_PACKAGE_DIR cmake variable is not getting set properly when dml_path build options is used. ### Motivation and Context - Why is this change required? What problem does it solve? It is required for DML Perf dashboard. <!--- If it fixes an open issue, please link to the issue here. -->	2023-05-24 19:20:40 -05:00
Zhang Lei	63c9973b7a	Fix cuda provider crash on it (#16056 )	2023-05-24 16:13:11 -07:00
yf711	105f5f0f20	Avoid trt deprecated api warnings shown as errors during ORT-TRT build (#16035 ) ### Description Avoid trt deprecated api warnings shown as errors when building onnxruntime_test_all This issue is only visible when installing trt via binaries, rather than deb/rpm pkg (CI pipelines) The change is similar to existing set_property for onnxruntime_providers_tensorrt `89ea503024/cmake/onnxruntime_providers.cmake (L421)` ### Motivation and Context onnxruntime/test/unittest_main/[test_main.cc](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/test/unittest_main/test_main.cc#L32) includes nvinfer.h, which includes deprecated trt apis and and generates warnings. When building onnxruntime_test_all, it will show warnings as errors and block the build. ### Doubts Although this issue is visible on trt tar binaries but not on trt deb/rpm pkgs, Their file size&hash are the same (creation time vary), regarding headers/libs installing in different ways. \| tarBin \| pkg \| \| ------------------------------------------------------------ \| ------------------------------------------------------------ \| \| 997284784 Apr 26 15:15 libnvinfer_builder_resource.so.8.6.1 \| 997284784 Apr 26 22:21 libnvinfer_builder_resource.so.8.6.1 \| \| 235369632 Apr 26 15:14 libnvinfer.so.8.6.1 \| 235369632 Apr 26 22:21 libnvinfer.so.8.6.1 \|	2023-05-24 13:19:27 -07:00
yf711	84f1af7ff5	ort build flag fix (#16072 ) ### Description * Sync and clean build flag `--use_tensorrt_builtin_parser` from existing CI config as this becomes default flag * cuda version update	2023-05-24 12:32:10 -07:00
Guenther Schmuelling	20857c4ff2	workaround test failure in ci (#16070 ) don't run wasm proxy test on debug build to unblock ci. Needs some longer debugging.	2023-05-24 21:01:06 +08:00
Shukant Pal	f316bc57c4	[CoreML EP] Implement Unary & Reduce operators (#15532 ) ### Description This change is a follow-up to #15327. It adds Unary operators (Sqrt, Reciprocal) and Reduce operators (ReduceSum, ReduceMean). I've tried to follow existing patterns in the code :-) ### Motivation and Context This reduces fragmentation across EPs when using CoreML on macOS, thereby speeding up execution. --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2023-05-24 18:16:59 +10:00
Linnea May	954ea6604a	[DML EP] Register pad18 (#15985 ) ### Description <!-- Describe your changes. --> Pad18 adds the `axes` input, which is used to indicate what axes the padding values should be applied to. Add logic to manipulate paddings into DML padding operator inputs. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Linnea May <linneamay@microsoft.com>	2023-05-23 18:25:36 -07:00
Wanming Lin	bcd8b73343	[WebNN EP] Upgrade max supported opset to 19 (#16036 )	2023-05-23 18:02:20 -07:00
Adrian Lizarraga	efc84a43e8	[QNN EP] Add session option to disable fallback to default CPU EP (#16016 ) ### Description Adds the session config option `disable_cpu_ep_fallback` to allow the user to prevent the CPU EP from handling nodes not supported by other execution providers. ```C++ // Graph nodes that are not supported by the execution providers (EPs) explicitly added to the session are // assigned (i.e., "fallback") to the CPU EP by default. // // This option allows the user to disable the fallback of unsupported graph nodes to the CPU EP. // If this option is set to "1", session creation will fail if the execution providers other than the CPU EP cannot // fully support all of the nodes in the graph. // // It is invalid to set this option and explicitly add the CPU EP to the session. In this case, session creation // will also fail with an error. // // Option values: // - "0": CPU EP fallback is not disabled. [DEFAULT] // - "1": CPU EP fallback is disabled. static const char* const kOrtSessionOptionsDisableCPUEPFallback = "session.disable_cpu_ep_fallback"; ``` #### Example use ```C++ #include "core/session/onnxruntime_cxx_api.h" #include "core/session/onnxruntime_session_options_config_keys.h" int main(int argc, char** argv) { Ort::SessionOptions so; so.AddConfigEntry(kOrtSessionOptionsDisableCPUEPFallback, "1"); // Disable fallback to the CPU EP. onnxruntime::ProviderOptions options; #if defined(_WIN32) options["backend_path"] = "QnnCpu.dll"; #else options["backend_path"] = "libQnnCpu.so"; #endif so.AppendExecutionProvider("QNN", options); const ORTCHAR_T* ort_model_path = ORT_MODEL_FOLDER "qnn_ep_partial_support.onnx"; Ort::Session session(*ort_env, ort_model_path, so); // Throws exception if nodes fallback to CPU // ... ``` ### Motivation and Context Makes it easier for application developers to ensure that the entire model runs on specific EPs. This is critical for Qualcomm/scenarios. If the compute cannot be offloaded to the NPU, running on CPU is not acceptable. (could be the difference between 90 second inference and 6 seconds inference) --------- Co-authored-by: Pranav Sharma <prs@microsoft.com>	2023-05-23 17:56:32 -07:00
Ryan Hill	b9d39e3405	Fix cuda Transpose bug 16039 (#16042 ) ### Description Transpose will fail in cuda for FLOAT16 for tensors larger than 1048x1048 due to our optimized case exceeding the cuda grid size of 65536. The fix is to just use our regular cuda transpose in these cases. ### Motivation and Context https://github.com/microsoft/onnxruntime/issues/16039	2023-05-23 17:19:30 -07:00
Adrian Lizarraga	96ee72d7f8	[QNN EP] Support Resize with 'asymmetric' transformation mode on HTP backend (#16060 ) ### Description - Adds support for Resize with the `asymmetric` coordinate transformation mode on the QNN HTP backend. - Adds unit test that shows this is only correct if the `nearest_mode` is `"floor"`. ### Motivation and Context This is needed to enable more models to run on the QNN HTP backend. Note: QNN's ONNX converter tool translates an ONNX Resize op with `{mode: "nearest", coordinate_transformation_mode: "asymmetric", "nearest_mode": <ANYTHING>}` to QNN's ResizeNearestNeighbor with `{align_corners: 0, half_pixel: 0}`. Unit tests show that this is only accurate if the ONNX attribute nearest_mode is "floor". Need to investigate how to handle other rounding modes. Ideally, we would use QNN's own Resize operator (instead of ResizeNearestNeighbor), but that doesn't support the "asymmetric" coordinate transformation mode on the HTP backend.	2023-05-23 16:04:19 -07:00
Scott McKay	55c3f4b28f	Fix CoreML Flatten handling of axis attribute (#16046 ) ### Description <!-- Describe your changes. --> The CoreML EP implementation was not reading the axis attribute correctly causing an incorrect output shape to be produced for a Flatten node. That issue gets hidden as the Tensor to write the output to is created by the CoreML EP using the inferred output shape (which is correct) and we provide the Tensor's buffer but not the shape when executing the CoreML model. As the flatten isn't changing or moving any data nothing breaks when we test with only a Flatten node in the model. Fix the attribute name and add a test that uses a model with a Flatten followed by a Mul which requires broadcasting. Both nodes are handled by CoreML, so if the axis is not correctly processed the output from Flatten will not be broadcastable and the CoreML model execution will fail. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Bug fix.	2023-05-24 08:27:32 +10:00
Hector Li	1b498e414f	[QNN EP] Redirect Qnn log to Ort log (#16019 ) ### Description Redirect Qnn log to Ort log. Set Qnn log level align with Ort log level Always output Qnn log as Ort verbose log ### Motivation and Context Redirect Qnn log to Ort log instead of print to console. --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2023-05-23 09:22:14 -07:00
pengwa	b457cfaa8f	Enable conditional optimization automatically (#15885 ) ### Enable conditional optimization on inputs Label sparsity based optimization can be enabled depending on the input inspection result. So this PR introduce a conditional optimization path for ORTModule, where we automatically detect data sparsity from label or embedding, and enable the graph optimization accordingly without any user interaction. This feature had a new requirement of delaying passing pre_grad graph transformation config to OrtModuleGraphBuilder, from `Initialize` phase to its `Build` phase. Because once after `_initialize_graph_builder` we can detect the input sparsity, and make a decision to enable the label/embed sparisty based graph optimizations. Add UT cases for label/embed input runtime inspector.	2023-05-23 13:08:05 +08:00
Baiju Meswani	de0a973b6e	[Bug Fix] Incorrect comparison for FromBuffer in TrainingSession.cs (#16022 )	2023-05-22 21:21:54 -07:00
PeixuanZuo	2fddc65c8c	[ROCm] add hipblaslt into GemmFastGelu TunableOp (#15945 ) add hipblaslt into GemmFastGelu TunableOp.	2023-05-23 11:07:09 +08:00
Dmitri Smirnov	684e900e96	Remove NETSTANDARD1.1 moniker and NETSTD1.1 specific code (#16018 ) ### Description Remove NETSTANDARD1.1 moniker and NETSTD1.1 specific code. We no longer target this platform. ### Motivation and Context NETSTANDARD1.1 target constraints the development and the modern libraries we would like to use in the code while it is apparently no longer required by customers.	2023-05-22 17:33:46 -07:00
RandySheriffH	d35361bf9d	Fix python pipeline for AzureEP without using root (#16023 ) Fix python pipeline for AzureEP without using root, this is for 1.15. --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2023-05-22 16:38:47 -07:00
satyajandhyala	22a578c06c	Use node name to uniquify the subgraph nodes. (#15855 ) ### Description <!-- Describe your changes. --> Use the unique name of the function node name to uniquify the subgraph node names. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? Prevent duplicate node names in the graph. - If it fixes an open issue, please link to the issue here. --> https://github.com/microsoft/onnxruntime/issues/15849 --------- Co-authored-by: Satya Jandhyala <sajandhy@microsoft.com>	2023-05-22 16:15:14 -07:00
zhijiang	4dc4470cc7	Fix fusion for two LayerNorm sharing same input but with different weights (#15919 ) in gpt_j_residual(https://arxiv.org/pdf/2204.06745.pdf), there are 2 LN nodes will share one same input, and ORT does CSE graph optimization before LN fusion, which will modify the LN graph pattern and thus make LN fusion failure. ![image](https://github.com/microsoft/onnxruntime/assets/10530022/40990fd6-796f-4edf-be0b-3203e8503678)	2023-05-22 08:26:36 +08:00
zhijiang	5607a7151a	Introduce register-efficient warp-wise Softmax (#15266 ) improve softmax forward when number of elem to do softmax is between (1024,2048] several optimizations done in the PR: 1. originally ort will call softmax_block_forward when shape is 1500, this will cause 5.53ms, however ort has another implementation called softmax_warp_forward, this function will only need 4.74ms, so i modified the function selection logic to call the faster version. 2. softmax_warp_forward will use register to cache the input in fp32 mode, this will consume many registers when data number is large and will make warp occupancy quite low, also compiler can do some of its optimizations, so the pr implements another version of softmax_warp_forward, it will use shared memory instead of register to cache the input; also when the for loop in the function has many iterations, actually disable loop unrolling will make kernel faster further. the perf table between softmax_warp_forward1(the original version) and softmax_warp_forward2 ![image](https://user-images.githubusercontent.com/43435212/228491963-cf87e3b3-e69e-454c-bab6-7e62a25bf76b.png) in open-ai whisper case, the kernel gain will be 5.53ms/3.03ms = 82% (softmax_block_forward vs softmax_warp_forward2)	2023-05-22 08:26:03 +08:00
Changming Sun	0204594f90	Cleanup WASM cmake code (#15996 ) ### Description Remove the "onnxruntime_BUILD_WEBASSEMBLY" cmake option. Use `if (CMAKE_SYSTEM_NAME STREQUAL "Emscripten")` instead. It makes some code look more nature. For example, ```cmake if (CMAKE_SYSTEM_NAME STREQUAL "iOS" OR CMAKE_SYSTEM_NAME STREQUAL "Android" OR onnxruntime_BUILD_WEBASSEMBLY) ``` becomes ```cmake if (CMAKE_SYSTEM_NAME STREQUAL "iOS" OR CMAKE_SYSTEM_NAME STREQUAL "Android" OR CMAKE_SYSTEM_NAME STREQUAL "Emscripten") ```	2023-05-20 18:07:39 -07:00
Yulong Wang	e9e6bedf37	[js/webgpu] generate operator table for webgpu (#15954 ) ### Description [js/webgpu] generate operator table for webgpu	2023-05-20 12:20:41 -07:00
Yulong Wang	18f17c555d	[js/webgpu] fix buffer size when download (#15990 ) ### Description fix buffer size when download. buffer size should always be padded to multiple of 4. resolved issue described in #15796 > ![Image](https://user-images.githubusercontent.com/26504141/239093785-9417dffc-6f00-47b2-956d-402b43bdb0a9.png)	2023-05-20 00:21:18 -07:00
Patrice Vignola	85cacf315b	[DML EP] Add MultiHeadAttention and fix Attention (#15727 )	2023-05-19 15:07:14 -07:00
Yulong Wang	dc06c255b4	fix transpose optimizer on GPU EP (#15988 ) ### Description because of #15618 , the default allocator changed to device allocator, which will be GPU instead of CPU. in transpose optimizer we expect to read data from initializers so a CPU allocator is required here. this change fixes transpose optimizer on GPU EP Fixes the issue referred to in #15869, #15796	2023-05-19 14:33:45 -07:00
Hector Li	4324d2173b	[QNN EP] Enable Qnn context cache to save model initialization time (#15815 ) ### Description Enable Qnn Context cache feature to save model initialization time Provider options: qnn_context_cache_enable\|1 to enable the cache feature qnn_context_cache_path to set the cache path. It is set to model_file.onnx.bin by default. ### Motivation and Context Model initialization time takes long because the cost of conversion from Onnx model to Qnn model. Qnn have feature to serialize the Qnn context to file, then next time user can load it from the cache context and execute the graph to save the cost. --------- Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>	2023-05-19 10:52:17 -07:00
RandySheriffH	4dfb89b3ad	Implement mutex-free spin lock for task queue (#14834 ) Implemented "lock-free" spinlock to save CPU usage on context switching. The change has been tested on queene service of Ads team, the lock-free version of ort (40 threads) saves CPU usage on gen8 (128 logical processors on 8 numa nodes) windows by nearly half, from 65% to 35%. For 32 cores, the curve is flat: Anubis, 32 vCPU, windows, hugging face models, 95 percentile E2E latency in ms: model \| mutex(ms) \| mutex-free --- \| --- \| --- alvert_base_v2 \| 34.21 \| 34.09 bert_large_uncased \| 116.27\| 117.84 bart_base \| 72.06 \| 71.99 distilgpt2 \| 25.43 \| 25.02 vit_base_patch16_224 \| 37.33 \| 37.76 Anubis, 32 vCPU win, Linux, 1st party models, 95 percentile E2E latency in ms: model \| mutex(ms) \| mutex-free --- \| --- \| --- deepthink_v2 \| 24.35 \| 22.95 bing_feeds \| 36.96 \| 36.48 deep_writes \| 14.46 \| 14.32 keypoints \| 9.34 \| 7.69 model11 \| 1.71 \| 1.66 model12 \| 1.82 \| 1.44 model2 \| 4.21 \| 3.95 model6 \| 1.08 \| 1.05 agiencoder \| 0.99 \| 0.93 geminet_transformer \| 5.32 \| 5.24 --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2023-05-19 10:12:10 -07:00
cloudhan	0b0a359520	[CAPI] CAPI impl refactor (#15974 ) 1. Better options string building 2. avoid potential `new` `delete`	2023-05-19 11:40:56 +08:00
Patrice Vignola	310b22aa0c	[DML EP] Update DirectML version to 1.12.0 (#16011 )	2023-05-18 19:37:12 -07:00
PeixuanZuo	d78bbf5ef2	[ROCm] remove ROCm5.2.3, ROCm5.3, ROCm5.4 from pipeline (#16004 ) remove ROCm5.2.3, ROCm5.3, ROCm5.4 from pipeline.	2023-05-19 10:29:01 +08:00

1 2 3 4 5 ...

8889 commits