onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-05-21 21:52:11 +00:00

Author	SHA1	Message	Date
Chi Lo	4e3cff60fd	CUDA graph support for TRT EP (#16081 ) CUDA EP already supports [CUDA graph](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs), also we observed some models can benefit from using CUDA graph with `trtexec`. Therefore, this PR enables the CUDA graph support for TRT EP. The implementation is based on https://github.com/microsoft/onnxruntime/pull/9978 with the same [constraints](https://github.com/microsoft/onnxruntime/pull/9978) as below: - Models with control-flow ops (i.e. If, Loop and Scan ops) are not supported. - Usage of CUDA Graphs is limited to models where-in all the model ops (graph nodes) can be partitioned to the TRT EP. - The input/output types of models need to be tensors. - Shapes of inputs/outputs cannot change across inference calls. - IObinding is required.	2023-06-21 09:36:45 -07:00
Yuhong Guo	48e6186b1a	Move tests from core/providers/cuda/test/* to test/providers/cuda/ and refactor CUDA UT (#16161 ) ### Description <!-- Describe your changes. --> 1. Add a new test lib `onnxruntime_providers_cuda_ut` which is similar to `onnxruntime_providers_cuda` but `onnxruntime_providers_cuda_ut` is only built if `onnxruntime_BUILD_UNIT_TESTS` is set. We can call all CUDA UTs through this ut lib without affecting production lib `onnxruntime_providers_cuda`. 2. Move all test cases from `core/providers/cuda/test/` to `test/providers/cuda/`. These test cases are built into lib `onnxruntime_providers_cuda_ut` and run by `./onnxruntime_test_all --gtest_filter="CUDA_EP_Unittest"`. Since the lib is only for test, we can use gtest macros in the test cases. Previous implementation do not support using gtest lib in the CUDA UT cases. 3. The cmake code in `cmake/onnxruntime_providers.cmake` is refactored a bit. A new function `onnxruntime_add_object_library` is to build a object target. The 2 libs `onnxruntime_providers_cuda_ut` & `onnxruntime_providers_cuda` share most of the code, so the object files can be used in both libs, which helps reduce build time. Another function `config_cuda_provider_shared_module` is used to configure all 3 similar targets(onnxruntime_providers_cuda_obj/onnxruntime_providers_cuda/onnxruntime_providers_cuda_ut). 4. Refactored the test to call `testing::InitGoogleTest` & `RUN_ALL_TESTS` in `libonnxruntime_providers_cuda_ut.so`'s `TestAll`. After this change, we can see all the cases running in `CUDA_EP_Unittest.All`: ![image](https://github.com/microsoft/onnxruntime/assets/19584326/8ff80df6-060b-4ef0-90b7-657e68d3db87) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> After https://github.com/microsoft/onnxruntime/pull/13016, there are still test files in test/providers/cuda/ that are not moved to core/providers/cuda/test/ and the test cases are disabled. This PR helps to clean the unfinished TODOs. Even through onnxruntime_shared_lib_test covers some test for CUDA provider. onnxruntime_shared_lib_test works like a coarse grain end-to-end test for CUDA provider. If CUDA unittest can run cases for a single component, this wound be helpful for CUDA developers. --------- Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com>	2023-06-20 14:54:55 -07:00
cao lei	dd72192cf4	ExecutionProvider API refactor - move allocator from EP level to SessionState level and indexed by OrtDevice (#15833 ) ### Description This PR is to refactor ExecutionProvider API for memory management, which is to move allocators from EP level to SessionState level and indexed by OrtDevice ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> This PR is to refactor ExecutionProvider API for memory management, which is to move allocators from EP level to SessionState level and indexed by OrtDevice. By this change, EP level will shift the burden of maintaining allocators, which will be user friendly for EP developers --------- Co-authored-by: Lei Cao <leca@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>	2023-06-19 17:44:45 -07:00
Changming Sun	5754cd7d1d	Add fp16 support to CPU EP gemm op (#15506 )	2023-06-15 14:38:17 -07:00
Changming Sun	b72fe664c1	Refactor prepack buffer code (#16280 ) ### Description 1. Use IAllocatorUniquePtr to replace BufferUniquePtr. It will ensure the deleter is always right. 2. Change some std::unique_ptr to std::optional 3. Bypass Arena allocator when allocating the prepack buffers for mlas. In this special case, Arena doesn't help any. And this change is just an internal implementation change, it doesn't affect our public interface.	2023-06-08 14:42:02 -07:00
Dmitri Smirnov	908e940660	[CPP Api] Remove deprecated CustomOp API (#16256 ) ### Description Custom Op API has been deprecated in 1.15 release. We are removing it.	2023-06-07 14:03:13 -07:00
PeixuanZuo	1b518c6836	[ROCm] add early stop to tunable profile progress (#15716 ) For TunableOp, some instance may has very bad performance and it will take a long time during profile process. Add `tunable_op_max_tuning_duration_ms` parameter to limit max tuning time.	2023-06-01 10:18:25 +08:00
Xavier Dupré	e726151b5c	Introduce float 8 types (#14731 ) ### Description The PR implements FloatE4M3FN, FloatE5M2, FloatE4MEFNUZ, FloatE5M2FNUZ as described in PR https://github.com/onnx/onnx/pull/4805. It uses CUDA API to cast float/half to float8 if CUDA>=11.8, a custom implementation if CUDA<11.8. * It implements, Cast, QuantizeLinear, DequantizeLinear for all types on CPU, only for types FloatE4M3FN, FloatE5M2 on CUDA. * It extends the supported types for control flow operator, Shape, Reshape, Identity, If, Loop, Scan, Reshape * It implements Equal(19). * Cast, QuantizeLinear, DequantizeLinear operators now support a parameter `saturate` only valid for float 8 types. It is true by default. In that case, any value out of range is converted into the maximum float 8 value. If false, it is infinite. * QuantizeLinear, DequantizeLinear now supports multiple scales on CUDA (and ROCm by extension), scale = 1D tensor with one scale per channel ### Motivation and Context Supports latest onnx version. Fixes [AB#15395](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/15395) --------- Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net> Co-authored-by: Randy Shuai <rashuai@microsoft.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>	2023-05-30 13:25:58 -07:00
Dmitri Smirnov	9939092e71	[CPP API]Fix constness in C++API (#16103 ) ### Description `CreateMap` and `CreateSequence` should be able to take in const data.	2023-05-26 14:09:00 -07:00
Changming Sun	a5410515ad	Fix: Some fields in OrtCUDAProviderOptionsV2 struct are not initialized (#16113 ) ### Description The file include/onnxruntime/core/providers/cuda/cuda_provider_options.h is a C++ file. It is not for C. Before this commit, this header file is already not compatible with C compilers. Because it has: ``` onnxruntime::ArenaExtendStrategy arena_extend_strategy; ``` And this file is intended to be internal only. It is an internal header file. It should not be included in onnxruntime_c_api.h and should not be used with the public C APIs. User can only get the instance of OrtCUDAProviderOptionsV2 via CreateCUDAProviderOptions. In such a way we can add new members to this struct without breaking binary compatibility. Since it is an internal header, we can safely use C++ grammar there.	2023-05-26 11:34:22 -07:00
Yuhong Guo	04a8f50674	New configuration to limit the arena extension (#15983 ) Add a configuration `max_power_of_two_extend_bytes ` to limit the arena extension size. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> In our real scenario, we observe that if the model is big enough the BfcArena will extend uncontrollable. As showed by the following figures, if a model uses more than 16GB memory, the BfcArena will totally apply for 32GB memory according to the `kNextPowerOfTwo` strategy. With the new strategy, the extension is limited. The default maximum extension size is 1GB. #### Without the new configuration After loading the model, ORT uses 32G GPU memory. ![image](https://github.com/microsoft/onnxruntime/assets/19584326/42b93c66-b957-4f20-a13b-d34cb390afff) #### With the new configuration After loading the model, ORT uses 23G GPU memory. ![image](https://github.com/microsoft/onnxruntime/assets/19584326/5abffeff-9ca3-4187-a262-37fd2764fe1b) Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com>	2023-05-25 02:19:07 -07:00
Adrian Lizarraga	efc84a43e8	[QNN EP] Add session option to disable fallback to default CPU EP (#16016 ) ### Description Adds the session config option `disable_cpu_ep_fallback` to allow the user to prevent the CPU EP from handling nodes not supported by other execution providers. ```C++ // Graph nodes that are not supported by the execution providers (EPs) explicitly added to the session are // assigned (i.e., "fallback") to the CPU EP by default. // // This option allows the user to disable the fallback of unsupported graph nodes to the CPU EP. // If this option is set to "1", session creation will fail if the execution providers other than the CPU EP cannot // fully support all of the nodes in the graph. // // It is invalid to set this option and explicitly add the CPU EP to the session. In this case, session creation // will also fail with an error. // // Option values: // - "0": CPU EP fallback is not disabled. [DEFAULT] // - "1": CPU EP fallback is disabled. static const char* const kOrtSessionOptionsDisableCPUEPFallback = "session.disable_cpu_ep_fallback"; ``` #### Example use ```C++ #include "core/session/onnxruntime_cxx_api.h" #include "core/session/onnxruntime_session_options_config_keys.h" int main(int argc, char** argv) { Ort::SessionOptions so; so.AddConfigEntry(kOrtSessionOptionsDisableCPUEPFallback, "1"); // Disable fallback to the CPU EP. onnxruntime::ProviderOptions options; #if defined(_WIN32) options["backend_path"] = "QnnCpu.dll"; #else options["backend_path"] = "libQnnCpu.so"; #endif so.AppendExecutionProvider("QNN", options); const ORTCHAR_T* ort_model_path = ORT_MODEL_FOLDER "qnn_ep_partial_support.onnx"; Ort::Session session(*ort_env, ort_model_path, so); // Throws exception if nodes fallback to CPU // ... ``` ### Motivation and Context Makes it easier for application developers to ensure that the entire model runs on specific EPs. This is critical for Qualcomm/scenarios. If the compute cannot be offloaded to the NPU, running on CPU is not acceptable. (could be the difference between 90 second inference and 6 seconds inference) --------- Co-authored-by: Pranav Sharma <prs@microsoft.com>	2023-05-23 17:56:32 -07:00
Hector Li	4324d2173b	[QNN EP] Enable Qnn context cache to save model initialization time (#15815 ) ### Description Enable Qnn Context cache feature to save model initialization time Provider options: qnn_context_cache_enable\|1 to enable the cache feature qnn_context_cache_path to set the cache path. It is set to model_file.onnx.bin by default. ### Motivation and Context Model initialization time takes long because the cost of conversion from Onnx model to Qnn model. Qnn have feature to serialize the Qnn context to file, then next time user can load it from the cache context and execute the graph to save the cost. --------- Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>	2023-05-19 10:52:17 -07:00
RandySheriffH	4dfb89b3ad	Implement mutex-free spin lock for task queue (#14834 ) Implemented "lock-free" spinlock to save CPU usage on context switching. The change has been tested on queene service of Ads team, the lock-free version of ort (40 threads) saves CPU usage on gen8 (128 logical processors on 8 numa nodes) windows by nearly half, from 65% to 35%. For 32 cores, the curve is flat: Anubis, 32 vCPU, windows, hugging face models, 95 percentile E2E latency in ms: model \| mutex(ms) \| mutex-free --- \| --- \| --- alvert_base_v2 \| 34.21 \| 34.09 bert_large_uncased \| 116.27\| 117.84 bart_base \| 72.06 \| 71.99 distilgpt2 \| 25.43 \| 25.02 vit_base_patch16_224 \| 37.33 \| 37.76 Anubis, 32 vCPU win, Linux, 1st party models, 95 percentile E2E latency in ms: model \| mutex(ms) \| mutex-free --- \| --- \| --- deepthink_v2 \| 24.35 \| 22.95 bing_feeds \| 36.96 \| 36.48 deep_writes \| 14.46 \| 14.32 keypoints \| 9.34 \| 7.69 model11 \| 1.71 \| 1.66 model12 \| 1.82 \| 1.44 model2 \| 4.21 \| 3.95 model6 \| 1.08 \| 1.05 agiencoder \| 0.99 \| 0.93 geminet_transformer \| 5.32 \| 5.24 --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2023-05-19 10:12:10 -07:00
cloudhan	856afa49dd	[C#] Add missing rocm csharp api (#15540 )	2023-05-18 08:15:19 +08:00
Baiju Meswani	6b7181d31d	Add C# API documentation for training (and some other changes) (#15935 )	2023-05-16 03:15:24 -07:00
cloudhan	dc383ed4ce	Basic CSharp packaging support for ROCm EP (#15535 ) This PR mainly fixes building errors when trying to build nupkg for ROCm EP. It also slighly improve the packaging logic so that devlopers can produce the nupkg on linux natively.	2023-05-16 07:27:38 +08:00
Dmitri Smirnov	896a963492	Adust GetVersionString() GetBuildInfoString() signatures and move them to OrtApi (#15921 ) ### Description This PR partially reverts changes introduced in https://github.com/microsoft/onnxruntime/pull/15643 We make two API return std::string always in UTF-8. We also move the entry points from OrtApiBase to OrtApi to make them versioned. ### Motivation and Context `GetVersionString` always returns x.y.z numbers that are not subject to internationalization. `GetBuildInfoString` can hold international chars, but UTF-8 should be fine to contain those. We prefix them with u8"" in case the compiler default charset is not UTF-8. Furthermore, creating platform dependent APIs is discouraged. `ORTCHAR_T` is platform dependent and was created for paths only. On non-unix platforms would still produce `std::string` that can only contain UTF-8 The API was introduced after the latest release, and can still be adjusted.	2023-05-13 13:45:07 -07:00
Maximilian Müller	143551092f	fix: setting builder optimization level to TRT 8.6 default (#15897 ) The actual released default level is 3 and not the previously used 2. Just a small sample of the effects: ![Screenshot 2023-05-10 at 15 49 55](https://github.com/microsoft/onnxruntime/assets/44298237/5a694446-22c0-4943-9ddf-80670781878f)	2023-05-12 13:29:30 -07:00
Hector Li	1bebc88069	[SNPE EP] Add option to enable SNPE init caching feature (#15917 ) ### Description [SNPE EP] Add option to enable SNPE init caching feature ### Motivation and Context To save model initialization time	2023-05-12 07:57:11 -07:00
Wanming Lin	00b1e79e04	Support WebNN EP (#15698 ) Description: This PR intends to enable WebNN EP in ONNX Runtime Web. It translates the ONNX nodes by [WebNN API](https://webmachinelearning.github.io/webnn/), which is implemented in C++ and uses Emscripten [Embind API](https://emscripten.org/docs/porting/connecting_cpp_and_javascript/embind.html#). Temporarily using preferred layout NHWC for WebNN graph partitions since the restriction in WebNN XNNPack backend implementation and the ongoing [discussion](https://github.com/webmachinelearning/webnn/issues/324) in WebNN spec that whether WebNN should support both 'NHWC' and 'NCHW' layouts. No WebNN native EP, only for Web. Motivation and Context: Allow ONNXRuntime Web developers to access WebNN API to benefit from hardware acceleration. WebNN API Implementation Status in Chromium: - Tracked in Chromium issue: [#1273291](https://bugs.chromium.org/p/chromium/issues/detail?id=1273291) - CPU device: based on XNNPack backend, and had been available on Chrome Canary M112 behind "#enable-experimental-web-platform-features" flag for Windows and Linux platforms. Further implementation for more ops is ongoing. - GPU device: based on DML, implementation is ongoing. Open: - GitHub CI: WebNN currently is only available on Chrome Canary/Dev with XNNPack backend for Linux and Windows. This is an open to reviewers to help identify which GitHub CI should involved the WebNN EP and guide me to enable it. Thanks!	2023-05-08 21:25:10 -07:00
RandySheriffH	8e610f25d8	Implement lite custom op API (#15778 ) Implement a set of new APIs for lightweight custom ops registration, to save efforts from schema-composing. A few highlights: - Support build-time type inference; - Support function-as-op for "stateless" ops; - Support structure-as-op for "stateful" ops; - Support varied input/output forms such as span, scalar, and tensors, either optional or non-optional. --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2023-05-04 09:49:17 -07:00
Changming Sun	1fb2f2605b	Update VERSION_NUMBER (#15773 ) ### Description 1. Update VERSION_NUMBER for preparing the upcoming release. This PR's commit will not be included in the 1.15 release branch 2. Delete package/rpm/onnxruntime.spec since it was not used in past years. ### Motivation and Context Preparing the release. Fixed [AB#15311](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/15311)	2023-05-03 15:07:34 -07:00
Baiju Meswani	ba7b83ff3c	Remove onnxruntime_PYBIND_EXPORT_OPSCHEMA definition from onnxruntime (#15776 )	2023-05-03 13:08:35 -07:00
Chen Fu	bc58fd5413	fix compilation error in no absl build (#15769 ) ### Description Fix no-absl build error:	2023-05-02 08:20:49 -07:00
Changming Sun	034698cf6a	Revert "Implement lite custom op API (#15590 )" (#15768 ) This reverts commit `cdf4fc49fc` because it breaks the "debug_node_input_output" build in "Post Merge" pipeline	2023-05-02 01:10:10 -07:00
Ye Wang	391f897983	Bring back SLN cuda kernel and use provider options to switch to standard implementation (#15660 )	2023-05-01 18:35:26 -07:00
cao lei	d58fa9805b	ExecutionProvider API refactor - replace OrtMemoryInfo with OrtDevice (#15618 ) ### Description ExecutionProvider API refactor - replace OrtMemoryInfo with OrtDevice ### Motivation and Context Currently “Location” is represented as ORTMemoryInfo, which is OrtDevice + OrtMemType, while OrtDevice is represent as DeviceType + DeviceId + MemType. As we can see there is some unnecessary hierarchy, the proposal is to make it a clear definition that to use OrtDevice as an abstraction for Location --------- Co-authored-by: Lei Cao <leca@microsoft.com>	2023-05-01 10:06:00 -07:00
RandySheriffH	cdf4fc49fc	Implement lite custom op API (#15590 ) Implement a set of new APIs for lightweight custom ops registration, to save efforts on schema-composing. A few highlights: 1. Support build-time type inference; 2. Support function-as-op for "stateless" ops; 3. Support structure-as-op for "stateful" ops; 4. Support varied input/output forms such as span, scalar, and tensors, either optional or non-optional. --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2023-05-01 08:45:26 -07:00
Chen Fu	0e9472d391	NHWC graph optimizer (#15724 ) ### Description Augment nhwc graph optimizer to accommodate fp16 operators. ### Motivation and Context With new fp16 conv operator added. This operator prefers NHWC data layout. We need to augment existing graph optimizers to better utilize the new operator.	2023-05-01 08:44:07 -07:00
Chunye Wang@AMD	d35850c142	[VitisAI]Update VitisAI EP to be compatible with VitisAI 3.5 (#15673 ) ### Description Originally VitisAI EP only works with old version of VitisAI release. ### Motivation and Context Update VitisAI EP so that it works together with the current VitisiAI 3.5 and further version of VitisAI. We try our best to make it forward compatible. --------- Co-authored-by: Wang Chunye <chunywan@xilinx.com> Co-authored-by: mingyue <mingyue@amd.com> Co-authored-by: mingyueliuh <131847423+mingyueliuh@users.noreply.github.com> Co-authored-by: liumingyue <mingyue@xilinx.com> Co-authored-by: moore-ch <129165652+moore-ch@users.noreply.github.com> Co-authored-by: shoucair <shoucai.ren@amd.com> Co-authored-by: zz002 <zhenze.wang@amd.com> Co-authored-by: BoarQing <yuz75@Pitt.edu> Co-authored-by: Yueqing Zhang <yueqingz@amd.com> Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>	2023-05-01 08:28:26 -07:00
Jeff Bloomfield	3df3a85114	Default kOrtSessionOptionsDisableQuantQDQ to 1 when the DML EP is registered (#15725 ) This addresses a performance regression in some INT8 models with the DirectML EP by defaulting OrtSessionOptionsDisableQuantQDQ to 1 when the EP is registered. This regression occured due to the introduction of the QDQ propagation transformer, which is based on this session option. That transformer maximizes the number of nodes which are executed as quantized by logically propagating quantize operators upstream and dequantize operators downstream. However, it does this simply by inserting QDQ pairs, with an expectation that something will recognize sequences of DQ->Op->Q. This logic and related L2 transformers are not currently enabled for the DirectML EP. This change also removes a noisy warning when the session option for memory pattern is overriden as the DirectML EP is registered.	2023-05-01 08:26:03 -07:00
Chi Lo	6e652d0554	Support explicit TRT profiles from provider options (#15546 ) Previous behavior of TRT EP to set TRT optimization profiles for dynamic shape input is based on input tensor values. Users can't explicitly specify the profiles. This PR makes users capable of specifying min/max/opt profiles through newly added three provider options: `trt_profile_min_shapes`, `trt_profile_max_shapes` and `trt_profile_opt_shapes` with the format of "input1:dim1xdim2...,input2:dim3xdim4...". (Note: It's similar to --minShapes, --maxShapes and --optShapes of trtexec command-line [flags](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#trtexec-flags)) For example, if you are using onnxruntime_perf_test, you can try this: `./onnxruntime_perf_test -e tensorrt -r 1 -i "trt_profile_min_shapes\|imgs:1x3x384x288 trt_profile_max_shapes\|imgs:32x3x384x288 trt_profile_opt_shapes\|imgs:16x3x384x288" your_model_path` If the engine cache is enabled, you still need to provide these three explicit provider options in order to use this feature. ORT TRT will compare the min/max/opt profile shape with the ones saved in .profile file to decide whether to rebuild the engine. Constraints to use these provider options: (1) Need to specify min/max/opt profile shapes for all the dynamic shape input This feature is also requested by other users: https://github.com/microsoft/onnxruntime/issues/13851	2023-04-30 22:30:26 -07:00
Changming Sun	65020d433e	Prefast fixes for CUDA EP (#15726 ) ### Description 1. Adjust cmake flags. Do not modify CMAKE_CXX_FLAGS globally. Only apply the flags to ORT code. 2. Fix some SDL warnings.	2023-04-29 12:43:12 -07:00
Yuhong Guo	41dcf0d32e	Expose build information in dynamic lib (#15643 ) ### Description <!-- Describe your changes. --> 1. Add Build Info API to onnx. 2. Fix compile error while building onnxruntime_benchmark in MacOs. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> 1. When Onnxruntime lib is serving online, we need a way to detect how this lib is built. This PR helps the developer to get the build information using `strings` such as git branch, git commit id, build type and cmake cxx flags, which is showed as follows. ![image](https://user-images.githubusercontent.com/19584326/233794371-b2f95a2c-27fb-4709-a6dd-bf4bb12b0b5b.png) ![image](https://user-images.githubusercontent.com/19584326/233794360-f96f5d2e-332c-405c-83f1-370ccc2b86f8.png) If the build env has no git, there will be no git related infor: ![image](https://user-images.githubusercontent.com/19584326/234558596-298c1b01-9a90-41bf-9372-7259a8f8e5be.png) 3. Fix the following compile error while building benchmark in MacOs. ![image](https://user-images.githubusercontent.com/19584326/233793571-c261ac1f-47b2-434d-a293-7e9edc6c8a66.png) --------- Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com>	2023-04-28 21:57:31 -07:00
Chen Fu	be08b47e7b	Refine cast optimizer for safety (#15658 ) ### Description Cast optimizer may convert a fp16 node to fp32. This used to be safe as all fp16 kernels has fp32 implementation. As this assumption is no longer true, we need to check the validity of the operation ### Motivation and Context Main work here is to introduce an API to check whether a kernel is registered. Currently we don't have a way to do that without an operator node. This needs to be augmented. We need to query whether a kernel is registered by its property only, so that we can judge whether it is safe to construct a node long before we actually do so.	2023-04-28 09:32:54 -07:00
sfatimar	ebaafac3f5	Openvino ep ort 5.0 (#15626 ) ### Description The PR adds VPU support to OpenVINO Execution Provider Bug fixes for GPU, CPU. Changes to OpenVINO Backend in Serialized Model API for faster First Inference Latency. Deprecation to HDDL-VADM and MYRIAD, removed code Support OpenVINO 2023.0 Dynamic Shapes Support for iGPU ### Motivation and Context - VPU is an upcoming hardware that can provide AI Acceleration for Client Systems through OpenVINO - If it fixes an open issue, please link to the issue here. --> --------- Signed-off-by: MaajidKhan <n.maajid.khan@intel.com> Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com> Co-authored-by: MaajidKhan <n.maajid.khan@intel.com> Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>	2023-04-25 20:59:42 -07:00
Baiju Meswani	5885abfb35	Training Documentation (#15612 )	2023-04-25 11:44:12 -07:00
Yulong Wang	14cc02c65c	[js/web] WebGPU backend via JSEP (#14579 ) ### Description This change introduced the following new components into ONNX Runtime Web: - JavaScript Execution Provider (JSEP) - Asynchronized inferencing execution powered by Emscripten's Asyncify - WebGPU backend implemented in TypeScript - initial implementation of kernels: - elementwise operators (22) - binary operators (5) - tensor: Shape, Reshape, Transpose, Gemm - nn: Conv, {Global}Maxpool, {Global}AveragePool Code need to be polished. still working on it. ## Q&A What is JSEP? > JSEP, aka JavaScript Execution Provider, is a new ONNXRuntime execution provider that specifically works on Web environment (browsers). JSEP allows JavaScript code to kick in from various places when ONNX Runtime inferences a model. Why JSEP? > JSEP is a hybrid mode EP that contains both C/C++ and TypeScript/JavaScript implementation. There are 2 strong reasons why we introduces JSEP: > 1. the C/C++ part helps JSEP to leverage ONNX Runtime's capabilities as much as possible including graph transformer, optimizers and also the capabilities to fallback to CPU EP. TypeScript/JavaScript helps JSEP to develop and debug much easier in the browser for the kernel implementation. > 2. the requirement of asynchronized execution from JavaScript API (eg. `buffer.mapAsync()`) makes it impossible to run `OrtRun()` in a synchronized context (see "async problem" section below). This is done by using Emscripten's Asyncify. What is WebGPU? > WebGPU is the new GPU API that available in browser. It's one of the only 2 APIs that currently available to access the GPU from browser (the other is WebGL). > WebGPU is designed with more advanced and stronger features comparing to WebGL and is potentially solution that offer the best GPU performance for model inferencing that currently available. What is the async problem and why we have the problem? > The "async problem" is a problem that you cannot call an async function in a synchronous context. Think about the following C++ code: > ```c > // C-style declarations (API) > typedef void (ON_COMPLETE)(PVOID state, DATA data); > void read_data_from_file(FILEHANDLE file, ON_COMPLETE on_complete); > > // implementation > DATA * my_impl_read_data_from_file_sync(FILEHANDLE file) { > // how to implement? > } > ``` > The answer is, it's impossible to implement this function. Usually we try to find a sync version API, or launch a thread to call the async function and sync-wait on the main thread. Unfortunately, in browser environment, neither is possible. > > WebGPU does not offer any synchronized API for data downloading (GPU to CPU). This is the only operation that MUST be async. As `OrtRun()` will eventually call into DataTransfer for copy data from GPU to CPU, and `OrtRun()` is a synchronized function, this cannot be done in normal way. What is Emscripten? How is the Asyncify feature resolved the problem? > Emscripten is the C/C++ compiler for WebAssembly. It's what we use to compile ORT and generates the WebAssembly artifacts which runs on browsers. > > Asyncify is a [compiler feature](https://emscripten.org/docs/porting/asyncify.html) that allows calling async functions from a synchronized context. In short, it generates code to unwind and rewind call stack to emulate async execution. With this feature, we are able to call the async function inside `OrtRun()` call. ## Design Overview Inter-op JSEP is doing pretty much same thing to just another EP. It exposes an interface for inter-op with JavaScript, which is defined in onnxruntime/wasm/js_internal_api.js: ```js // init JSEP Module["jsepInit"] = function (backend, alloc, free, copy, copyAsync, createKernel, releaseKernel, run) { Module.jsepBackend = backend; Module.jsepAlloc = alloc; Module.jsepFree = free; Module.jsepCopy = copy; Module.jsepCopyAsync = copyAsync; Module.jsepCreateKernel = createKernel; Module.jsepReleaseKernel = releaseKernel; Module.jsepRun = run; }; ``` This simple JavaScript snippet defines all language barrier level functions that requires by JSEP to achieve implementing kernels and data transfers using JavaScript inside ONNX Runtime: - `jsepBackend`: assign the singleton object to webassembly module - `jsepAlloc` and `jsepFree`: implementation of data transfer's Alloc() and Free() - `jsepCopy`: synchronized copy ( GPU to GPU, CPU to GPU) - `jsepCopyAsync`: asynchronized copy ( GPU to CPU) - `jsepCreateKernel` and `jsepReleaseKernel`: a corresponding object that maintained in JS to match lifecycle of Kernel in ORT - `jsepRun`: OpKernel::Compute() should call into this The abstraction above allows to tie as little as possible connections and dependencies between C/C++ and TypeScript/JavaScript. Resource Management Lifecycle of tensor data and kernels are managed by ORT(C/C++) but the implementation are left to JavaScript. JavaScript code are responsible to implement the callbacks correctly. For WebGPU, the GPU data is managed by JavaScript using a singleton map (tensot_data_id => GPUBuffer). GPU pipeline is managed as singleton. Shaders are managed using a singletonmap (shader_key => gpu_program), while shader_key is generated by cache_key (OP specific, including attributes) and input shapes. about data transfer `js::DataTransfer::CopyTensor` implemented to call either synchronized or asynchronized copy callback, depending on the destination is GPU or not. Emscripten's macro `EM_ASYNC_JS` is used to wrap the async function to be called in the synchronized context. run kernel in JS Kernel class constructor calls once `jsepCreateKernel()` with an optional per-kernel specific serialization to pass attributes into JavaScript. `Compute()` are implemented in a way that a metadata serialization is performed in a base class and JavaScript code can access the data using the Emscripten specific builtin macro `EM_ASM_`. disabled features* memory pattern is force disabled, because the WebGPU data is not presented by a general memory model (a buffer can be represented by offset + size). concurrent run support is disabled. WebGPU is stateful and it also has async function call. To support concurrent run will significantly increase the complexity and we don't get any real benefit from it. prefer channels last JSEP prefers channels last and returns `DataLayout::NHWC` in method `GetPreferredLayout()`. This will let the graph transformers to preprocess the graph into a channels last form so that a more optimized WebGPU shader can be used. Testing code It's impossible to test JSEP directly because JSEP itself does not contain any kernel implementation. However, it has the kernel registration which need to work together with the corresponding JavaScript code. There are unit tests that run onnx models from JavaScript API. --------- Co-authored-by: Scott McKay <skottmckay@gmail.com>	2023-04-24 15:21:18 -07:00
cao lei	dc53ddef7a	Create a new C API KernelContext_GetAllocator() for Custom Op scenario (#15591 ) ### Description Create a new C API KernelContext_GetAllocator() for Custom Op scenario ### Motivation and Context Create a new C API KernelContext_GetAllocator() for Custom Op scenario	2023-04-23 21:54:35 -07:00
Dmitri Smirnov	a5dec8eedf	[C# ] Improve string marshalling and reduce GC pressure (#15545 ) ### Description Reduce a number of auxillary objects created to reduce GC pressure. Eliminate GCHandle type of memory pinning in most of the places. Improve string marshalling by allocating unmanaged memory that does not require pinning. Change native methods from `IntPtr` to `byte[]` (marshalling pinning is more efficient). Allocate input/output UTF-8 names in unmanaged heap for the lifetime of InferenceSession. So we do not keep converting them and pinning on every Run. Introduce a new native API that allows to allocate and convert/copy strings directly into a native tensor. The PR delivers around 50% latency improvements and less GC pauses. Inspired by: https://github.com/microsoft/onnxruntime/pull/15520 ### Motivation and Context Client experience GC pressure and performance degradation when dealing with string tensors. Co-Authored-By: @tannergooding	2023-04-20 15:12:51 -07:00
Chi Lo	6115c8fd1f	Add TRT plugins support using custom ops (#13847 ) This PR makes ORT support TRT plugin using custom ops. ORT TRT can automatically register all TRT plugins from TRT plugins registry as custom ops. There is no code change needed for ORT when new TRT plugins are introduced. Previous way for ORT to support TRT plugins was using contrib ops, but there are some concerns about it: - Contrib ops are shipped as part of the ORT binary by default. TRT related plugins should not be in the default ORT. - Contrib ops are designed for internal ops and developed for cpu and cuda EPs. Therefore, using custom ops is a good approach to support TRT plugins. Followings are the major modifications: 1. Add new `GetCustomOpDomainList` provider api which allows provider to create its own custom op domain list and ORT can register this domain list. Provider has the responsibility to free all the custom op domain instances it created. 2. Move OrtCustomOpDomain struct definition to framework_provider_common.h since this struct is being used by framework and EPs now. 3. There are several TRT plugins registered as onnx schema op through contrib op with onnx domain. In order not to break the old models using those TRT plugins which were registered with ONNX domain and maintain backward compatible, we need to keep the old/legacy TRT plugins with onnx domain. Moving forward, all newly added TRT plugins should be registered with `trt.plugins` domain. 4. TRT plugin doesn't have an api to get number of inputs/outputs of the registered plugins, so ORT TRT uses variadic inputs/outputs to bypass the onnx node validation. 5. Add new trt provider option, `trt_extra_plugin_lib_paths`, user can specify any extra plugin lib, for example, `fastertransformer/build/lib/libvit_plugin.so` or `fastertransformer/build/lib/libvit_plugin.so;fastertransformer/build/lib/libvit_plugin_v2.so`	2023-04-18 20:24:32 -07:00
Justin Chu	cf19c3697d	Run clang-format in CI (#15524 ) ### Description Run clang-format in CI. Formatted all c/c++, objective-c/c++ files. Excluded ``` 'onnxruntime/core/mlas/', 'onnxruntime/contrib_ops/cuda/bert/tensorrt_fused_multihead_attention/', ``` because they contain assembly or is data heavy ### Motivation and Context Coding style consistency	2023-04-18 09:26:58 -07:00
liqun Fu	919d8f2660	update with onnx main (#14929 )	2023-04-18 08:42:51 -07:00
cao lei	c2221d919f	create a stream in DeviceStreamCollection for memory pattern (#15426 ) ### Description Create a stream in DeviceStreamCollection for memory pattern case to fix the thread safe issue 15154 ### Motivation and Context This is to fix the bug 15154 https://github.com/microsoft/onnxruntime/issues/15154	2023-04-17 10:06:55 -07:00
Maximilian Müller	fbe88fccbd	Exposing new TRT build options (#15089 ) ### Description This will add a few TRT options, some of them are only available on TRT 8.6: - heuristics - sparsity - optimization level (8.6 only) - auxiliary stream (8.6 only) - tactic source selection I am no sure yet which tests is should add for these options. As those are mostly simple TRT flags i am not sure to what level i should test. For heuristics something similar to `44dda08b51/onnxruntime/test/providers/tensorrt/tensorrt_basic_test.cc (L510-L538)` should be possible for, but for all other essentially we would only be testing if there is a crash or not if the option is set. Also if i forgot some option that would be good to have feel free to speak up !	2023-04-14 09:47:36 -07:00
Dmitri Smirnov	ce3b4eabd3	Implement Optional Metadata support and C# test support (#15314 ) ### Description Implement Optional Type metadata support in the library. Implement optional support in C# API along with metadata. Implement Sequence, Map, Optional test data support and test execution. Prune tests and provide more details for failing tests in C# code. Note, this PR does not enable running onnx test models in C++. ### Motivation and Context Opset18 optional type support.	2023-04-11 09:41:59 -07:00
cloudhan	71a4e7eb97	Automatically enable tunable op usage for production models (#15156 ) Split `IsTunbaleOpEnable` semantics into enable tunable op for using and enable tunable op for tuning. They remain disabled in general for safety purpose. But - if session is created with onnx model with tuning results embeded - the embedded tuning results is set to the EP without error `Status` then we automatically enable the using, tuning remains disabled. The planned options will be - `tunable_op_enable`: The top-level switch of `TunableOp`, indicate if we will run into `TunableOp` related logic. NOTE: most of our impls have a bottom impl that is acting as a fallback and is set as the default. In this case, we still call into the `TunableOp`, but no kernel selection, no kernel tuning and caching is involved. This reduced our maintainance burden of a duplicate code path. - `tunable_op_tuning_enable`: The secondary switch of `TunableOp`, indicate if we will run into the tuning related logic of `TunableOp` Then for the possible future options: - `tunable_op_tuning_max_iteration`: blahblah - `tunable_op_tuning_max_duration_ms`: blahblah - `tunable_op_flash_attention_enable`: blahblah, for example only, we will not have this. For developer oriented envvar, it is for developers' convenience to inspect the performance impact of tuning. So there is only `ORT_ROCM_TUNABLE_OP_ENABLE`, `ORT_ROCM_TUNABLE_OP_TUNING_ENABLE` to take the fine-grind control of combinations.	2023-04-06 13:52:47 +08:00
Edward Chen	9f942e1a3e	Graph transformer to ensure unique DQ nodes for QDQ node units (#15145 ) ### Description <!-- Describe your changes. --> Add required graph transformer to duplicate DQ nodes to ensure that QDQ node units have unique DQ nodes. This condition is necessary for QDQ node unit processing. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> There is an existing Python utility that does this: `c7ced7a5e9/tools/python/util/qdq_helpers/qdq_model_utils.py (L77)` This PR implements it as a graph transformer so it is integrated into ORT and does not require a separate step to update the model. There are also tests to ensure that its effects are not undone by basic level graph optimizations.	2023-03-31 08:39:43 +10:00
FFFrog	ecb89ed752	[CANN] Multi-stream execution support for CANN EP. (#14058 ) ### Description Multi-stream execution support for CANN EP. ### Motivation and Context CANN EP is currently unavailable due to the introduction of a new mechanism for multi-stream execution [#13495](https://github.com/microsoft/onnxruntime/pull/13495), the deletion of the Fence-based synchronization mechanism, and the failure to update the relevant logic of CANN EP synchronously. This PR is to fix it.	2023-03-29 11:57:22 -07:00

1 2 3 4 5 ...

824 commits