onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-05-20 21:40:57 +00:00

Author	SHA1	Message	Date
RandySheriffH	8e610f25d8	Implement lite custom op API (#15778 ) Implement a set of new APIs for lightweight custom ops registration, to save efforts from schema-composing. A few highlights: - Support build-time type inference; - Support function-as-op for "stateless" ops; - Support structure-as-op for "stateful" ops; - Support varied input/output forms such as span, scalar, and tensors, either optional or non-optional. --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2023-05-04 09:49:17 -07:00
Changming Sun	1fb2f2605b	Update VERSION_NUMBER (#15773 ) ### Description 1. Update VERSION_NUMBER for preparing the upcoming release. This PR's commit will not be included in the 1.15 release branch 2. Delete package/rpm/onnxruntime.spec since it was not used in past years. ### Motivation and Context Preparing the release. Fixed [AB#15311](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/15311)	2023-05-03 15:07:34 -07:00
Baiju Meswani	ba7b83ff3c	Remove onnxruntime_PYBIND_EXPORT_OPSCHEMA definition from onnxruntime (#15776 )	2023-05-03 13:08:35 -07:00
Chen Fu	bc58fd5413	fix compilation error in no absl build (#15769 ) ### Description Fix no-absl build error:	2023-05-02 08:20:49 -07:00
Changming Sun	034698cf6a	Revert "Implement lite custom op API (#15590 )" (#15768 ) This reverts commit `cdf4fc49fc` because it breaks the "debug_node_input_output" build in "Post Merge" pipeline	2023-05-02 01:10:10 -07:00
Ye Wang	391f897983	Bring back SLN cuda kernel and use provider options to switch to standard implementation (#15660 )	2023-05-01 18:35:26 -07:00
cao lei	d58fa9805b	ExecutionProvider API refactor - replace OrtMemoryInfo with OrtDevice (#15618 ) ### Description ExecutionProvider API refactor - replace OrtMemoryInfo with OrtDevice ### Motivation and Context Currently “Location” is represented as ORTMemoryInfo, which is OrtDevice + OrtMemType, while OrtDevice is represent as DeviceType + DeviceId + MemType. As we can see there is some unnecessary hierarchy, the proposal is to make it a clear definition that to use OrtDevice as an abstraction for Location --------- Co-authored-by: Lei Cao <leca@microsoft.com>	2023-05-01 10:06:00 -07:00
RandySheriffH	cdf4fc49fc	Implement lite custom op API (#15590 ) Implement a set of new APIs for lightweight custom ops registration, to save efforts on schema-composing. A few highlights: 1. Support build-time type inference; 2. Support function-as-op for "stateless" ops; 3. Support structure-as-op for "stateful" ops; 4. Support varied input/output forms such as span, scalar, and tensors, either optional or non-optional. --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2023-05-01 08:45:26 -07:00
Chen Fu	0e9472d391	NHWC graph optimizer (#15724 ) ### Description Augment nhwc graph optimizer to accommodate fp16 operators. ### Motivation and Context With new fp16 conv operator added. This operator prefers NHWC data layout. We need to augment existing graph optimizers to better utilize the new operator.	2023-05-01 08:44:07 -07:00
Chunye Wang@AMD	d35850c142	[VitisAI]Update VitisAI EP to be compatible with VitisAI 3.5 (#15673 ) ### Description Originally VitisAI EP only works with old version of VitisAI release. ### Motivation and Context Update VitisAI EP so that it works together with the current VitisiAI 3.5 and further version of VitisAI. We try our best to make it forward compatible. --------- Co-authored-by: Wang Chunye <chunywan@xilinx.com> Co-authored-by: mingyue <mingyue@amd.com> Co-authored-by: mingyueliuh <131847423+mingyueliuh@users.noreply.github.com> Co-authored-by: liumingyue <mingyue@xilinx.com> Co-authored-by: moore-ch <129165652+moore-ch@users.noreply.github.com> Co-authored-by: shoucair <shoucai.ren@amd.com> Co-authored-by: zz002 <zhenze.wang@amd.com> Co-authored-by: BoarQing <yuz75@Pitt.edu> Co-authored-by: Yueqing Zhang <yueqingz@amd.com> Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>	2023-05-01 08:28:26 -07:00
Jeff Bloomfield	3df3a85114	Default kOrtSessionOptionsDisableQuantQDQ to 1 when the DML EP is registered (#15725 ) This addresses a performance regression in some INT8 models with the DirectML EP by defaulting OrtSessionOptionsDisableQuantQDQ to 1 when the EP is registered. This regression occured due to the introduction of the QDQ propagation transformer, which is based on this session option. That transformer maximizes the number of nodes which are executed as quantized by logically propagating quantize operators upstream and dequantize operators downstream. However, it does this simply by inserting QDQ pairs, with an expectation that something will recognize sequences of DQ->Op->Q. This logic and related L2 transformers are not currently enabled for the DirectML EP. This change also removes a noisy warning when the session option for memory pattern is overriden as the DirectML EP is registered.	2023-05-01 08:26:03 -07:00
Chi Lo	6e652d0554	Support explicit TRT profiles from provider options (#15546 ) Previous behavior of TRT EP to set TRT optimization profiles for dynamic shape input is based on input tensor values. Users can't explicitly specify the profiles. This PR makes users capable of specifying min/max/opt profiles through newly added three provider options: `trt_profile_min_shapes`, `trt_profile_max_shapes` and `trt_profile_opt_shapes` with the format of "input1:dim1xdim2...,input2:dim3xdim4...". (Note: It's similar to --minShapes, --maxShapes and --optShapes of trtexec command-line [flags](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#trtexec-flags)) For example, if you are using onnxruntime_perf_test, you can try this: `./onnxruntime_perf_test -e tensorrt -r 1 -i "trt_profile_min_shapes\|imgs:1x3x384x288 trt_profile_max_shapes\|imgs:32x3x384x288 trt_profile_opt_shapes\|imgs:16x3x384x288" your_model_path` If the engine cache is enabled, you still need to provide these three explicit provider options in order to use this feature. ORT TRT will compare the min/max/opt profile shape with the ones saved in .profile file to decide whether to rebuild the engine. Constraints to use these provider options: (1) Need to specify min/max/opt profile shapes for all the dynamic shape input This feature is also requested by other users: https://github.com/microsoft/onnxruntime/issues/13851	2023-04-30 22:30:26 -07:00
Changming Sun	65020d433e	Prefast fixes for CUDA EP (#15726 ) ### Description 1. Adjust cmake flags. Do not modify CMAKE_CXX_FLAGS globally. Only apply the flags to ORT code. 2. Fix some SDL warnings.	2023-04-29 12:43:12 -07:00
Yuhong Guo	41dcf0d32e	Expose build information in dynamic lib (#15643 ) ### Description <!-- Describe your changes. --> 1. Add Build Info API to onnx. 2. Fix compile error while building onnxruntime_benchmark in MacOs. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> 1. When Onnxruntime lib is serving online, we need a way to detect how this lib is built. This PR helps the developer to get the build information using `strings` such as git branch, git commit id, build type and cmake cxx flags, which is showed as follows. ![image](https://user-images.githubusercontent.com/19584326/233794371-b2f95a2c-27fb-4709-a6dd-bf4bb12b0b5b.png) ![image](https://user-images.githubusercontent.com/19584326/233794360-f96f5d2e-332c-405c-83f1-370ccc2b86f8.png) If the build env has no git, there will be no git related infor: ![image](https://user-images.githubusercontent.com/19584326/234558596-298c1b01-9a90-41bf-9372-7259a8f8e5be.png) 3. Fix the following compile error while building benchmark in MacOs. ![image](https://user-images.githubusercontent.com/19584326/233793571-c261ac1f-47b2-434d-a293-7e9edc6c8a66.png) --------- Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com>	2023-04-28 21:57:31 -07:00
Chen Fu	be08b47e7b	Refine cast optimizer for safety (#15658 ) ### Description Cast optimizer may convert a fp16 node to fp32. This used to be safe as all fp16 kernels has fp32 implementation. As this assumption is no longer true, we need to check the validity of the operation ### Motivation and Context Main work here is to introduce an API to check whether a kernel is registered. Currently we don't have a way to do that without an operator node. This needs to be augmented. We need to query whether a kernel is registered by its property only, so that we can judge whether it is safe to construct a node long before we actually do so.	2023-04-28 09:32:54 -07:00
sfatimar	ebaafac3f5	Openvino ep ort 5.0 (#15626 ) ### Description The PR adds VPU support to OpenVINO Execution Provider Bug fixes for GPU, CPU. Changes to OpenVINO Backend in Serialized Model API for faster First Inference Latency. Deprecation to HDDL-VADM and MYRIAD, removed code Support OpenVINO 2023.0 Dynamic Shapes Support for iGPU ### Motivation and Context - VPU is an upcoming hardware that can provide AI Acceleration for Client Systems through OpenVINO - If it fixes an open issue, please link to the issue here. --> --------- Signed-off-by: MaajidKhan <n.maajid.khan@intel.com> Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com> Co-authored-by: MaajidKhan <n.maajid.khan@intel.com> Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>	2023-04-25 20:59:42 -07:00
Baiju Meswani	5885abfb35	Training Documentation (#15612 )	2023-04-25 11:44:12 -07:00
Yulong Wang	14cc02c65c	[js/web] WebGPU backend via JSEP (#14579 ) ### Description This change introduced the following new components into ONNX Runtime Web: - JavaScript Execution Provider (JSEP) - Asynchronized inferencing execution powered by Emscripten's Asyncify - WebGPU backend implemented in TypeScript - initial implementation of kernels: - elementwise operators (22) - binary operators (5) - tensor: Shape, Reshape, Transpose, Gemm - nn: Conv, {Global}Maxpool, {Global}AveragePool Code need to be polished. still working on it. ## Q&A What is JSEP? > JSEP, aka JavaScript Execution Provider, is a new ONNXRuntime execution provider that specifically works on Web environment (browsers). JSEP allows JavaScript code to kick in from various places when ONNX Runtime inferences a model. Why JSEP? > JSEP is a hybrid mode EP that contains both C/C++ and TypeScript/JavaScript implementation. There are 2 strong reasons why we introduces JSEP: > 1. the C/C++ part helps JSEP to leverage ONNX Runtime's capabilities as much as possible including graph transformer, optimizers and also the capabilities to fallback to CPU EP. TypeScript/JavaScript helps JSEP to develop and debug much easier in the browser for the kernel implementation. > 2. the requirement of asynchronized execution from JavaScript API (eg. `buffer.mapAsync()`) makes it impossible to run `OrtRun()` in a synchronized context (see "async problem" section below). This is done by using Emscripten's Asyncify. What is WebGPU? > WebGPU is the new GPU API that available in browser. It's one of the only 2 APIs that currently available to access the GPU from browser (the other is WebGL). > WebGPU is designed with more advanced and stronger features comparing to WebGL and is potentially solution that offer the best GPU performance for model inferencing that currently available. What is the async problem and why we have the problem? > The "async problem" is a problem that you cannot call an async function in a synchronous context. Think about the following C++ code: > ```c > // C-style declarations (API) > typedef void (ON_COMPLETE)(PVOID state, DATA data); > void read_data_from_file(FILEHANDLE file, ON_COMPLETE on_complete); > > // implementation > DATA * my_impl_read_data_from_file_sync(FILEHANDLE file) { > // how to implement? > } > ``` > The answer is, it's impossible to implement this function. Usually we try to find a sync version API, or launch a thread to call the async function and sync-wait on the main thread. Unfortunately, in browser environment, neither is possible. > > WebGPU does not offer any synchronized API for data downloading (GPU to CPU). This is the only operation that MUST be async. As `OrtRun()` will eventually call into DataTransfer for copy data from GPU to CPU, and `OrtRun()` is a synchronized function, this cannot be done in normal way. What is Emscripten? How is the Asyncify feature resolved the problem? > Emscripten is the C/C++ compiler for WebAssembly. It's what we use to compile ORT and generates the WebAssembly artifacts which runs on browsers. > > Asyncify is a [compiler feature](https://emscripten.org/docs/porting/asyncify.html) that allows calling async functions from a synchronized context. In short, it generates code to unwind and rewind call stack to emulate async execution. With this feature, we are able to call the async function inside `OrtRun()` call. ## Design Overview Inter-op JSEP is doing pretty much same thing to just another EP. It exposes an interface for inter-op with JavaScript, which is defined in onnxruntime/wasm/js_internal_api.js: ```js // init JSEP Module["jsepInit"] = function (backend, alloc, free, copy, copyAsync, createKernel, releaseKernel, run) { Module.jsepBackend = backend; Module.jsepAlloc = alloc; Module.jsepFree = free; Module.jsepCopy = copy; Module.jsepCopyAsync = copyAsync; Module.jsepCreateKernel = createKernel; Module.jsepReleaseKernel = releaseKernel; Module.jsepRun = run; }; ``` This simple JavaScript snippet defines all language barrier level functions that requires by JSEP to achieve implementing kernels and data transfers using JavaScript inside ONNX Runtime: - `jsepBackend`: assign the singleton object to webassembly module - `jsepAlloc` and `jsepFree`: implementation of data transfer's Alloc() and Free() - `jsepCopy`: synchronized copy ( GPU to GPU, CPU to GPU) - `jsepCopyAsync`: asynchronized copy ( GPU to CPU) - `jsepCreateKernel` and `jsepReleaseKernel`: a corresponding object that maintained in JS to match lifecycle of Kernel in ORT - `jsepRun`: OpKernel::Compute() should call into this The abstraction above allows to tie as little as possible connections and dependencies between C/C++ and TypeScript/JavaScript. Resource Management Lifecycle of tensor data and kernels are managed by ORT(C/C++) but the implementation are left to JavaScript. JavaScript code are responsible to implement the callbacks correctly. For WebGPU, the GPU data is managed by JavaScript using a singleton map (tensot_data_id => GPUBuffer). GPU pipeline is managed as singleton. Shaders are managed using a singletonmap (shader_key => gpu_program), while shader_key is generated by cache_key (OP specific, including attributes) and input shapes. about data transfer `js::DataTransfer::CopyTensor` implemented to call either synchronized or asynchronized copy callback, depending on the destination is GPU or not. Emscripten's macro `EM_ASYNC_JS` is used to wrap the async function to be called in the synchronized context. run kernel in JS Kernel class constructor calls once `jsepCreateKernel()` with an optional per-kernel specific serialization to pass attributes into JavaScript. `Compute()` are implemented in a way that a metadata serialization is performed in a base class and JavaScript code can access the data using the Emscripten specific builtin macro `EM_ASM_`. disabled features* memory pattern is force disabled, because the WebGPU data is not presented by a general memory model (a buffer can be represented by offset + size). concurrent run support is disabled. WebGPU is stateful and it also has async function call. To support concurrent run will significantly increase the complexity and we don't get any real benefit from it. prefer channels last JSEP prefers channels last and returns `DataLayout::NHWC` in method `GetPreferredLayout()`. This will let the graph transformers to preprocess the graph into a channels last form so that a more optimized WebGPU shader can be used. Testing code It's impossible to test JSEP directly because JSEP itself does not contain any kernel implementation. However, it has the kernel registration which need to work together with the corresponding JavaScript code. There are unit tests that run onnx models from JavaScript API. --------- Co-authored-by: Scott McKay <skottmckay@gmail.com>	2023-04-24 15:21:18 -07:00
cao lei	dc53ddef7a	Create a new C API KernelContext_GetAllocator() for Custom Op scenario (#15591 ) ### Description Create a new C API KernelContext_GetAllocator() for Custom Op scenario ### Motivation and Context Create a new C API KernelContext_GetAllocator() for Custom Op scenario	2023-04-23 21:54:35 -07:00
Dmitri Smirnov	a5dec8eedf	[C# ] Improve string marshalling and reduce GC pressure (#15545 ) ### Description Reduce a number of auxillary objects created to reduce GC pressure. Eliminate GCHandle type of memory pinning in most of the places. Improve string marshalling by allocating unmanaged memory that does not require pinning. Change native methods from `IntPtr` to `byte[]` (marshalling pinning is more efficient). Allocate input/output UTF-8 names in unmanaged heap for the lifetime of InferenceSession. So we do not keep converting them and pinning on every Run. Introduce a new native API that allows to allocate and convert/copy strings directly into a native tensor. The PR delivers around 50% latency improvements and less GC pauses. Inspired by: https://github.com/microsoft/onnxruntime/pull/15520 ### Motivation and Context Client experience GC pressure and performance degradation when dealing with string tensors. Co-Authored-By: @tannergooding	2023-04-20 15:12:51 -07:00
Chi Lo	6115c8fd1f	Add TRT plugins support using custom ops (#13847 ) This PR makes ORT support TRT plugin using custom ops. ORT TRT can automatically register all TRT plugins from TRT plugins registry as custom ops. There is no code change needed for ORT when new TRT plugins are introduced. Previous way for ORT to support TRT plugins was using contrib ops, but there are some concerns about it: - Contrib ops are shipped as part of the ORT binary by default. TRT related plugins should not be in the default ORT. - Contrib ops are designed for internal ops and developed for cpu and cuda EPs. Therefore, using custom ops is a good approach to support TRT plugins. Followings are the major modifications: 1. Add new `GetCustomOpDomainList` provider api which allows provider to create its own custom op domain list and ORT can register this domain list. Provider has the responsibility to free all the custom op domain instances it created. 2. Move OrtCustomOpDomain struct definition to framework_provider_common.h since this struct is being used by framework and EPs now. 3. There are several TRT plugins registered as onnx schema op through contrib op with onnx domain. In order not to break the old models using those TRT plugins which were registered with ONNX domain and maintain backward compatible, we need to keep the old/legacy TRT plugins with onnx domain. Moving forward, all newly added TRT plugins should be registered with `trt.plugins` domain. 4. TRT plugin doesn't have an api to get number of inputs/outputs of the registered plugins, so ORT TRT uses variadic inputs/outputs to bypass the onnx node validation. 5. Add new trt provider option, `trt_extra_plugin_lib_paths`, user can specify any extra plugin lib, for example, `fastertransformer/build/lib/libvit_plugin.so` or `fastertransformer/build/lib/libvit_plugin.so;fastertransformer/build/lib/libvit_plugin_v2.so`	2023-04-18 20:24:32 -07:00
Justin Chu	cf19c3697d	Run clang-format in CI (#15524 ) ### Description Run clang-format in CI. Formatted all c/c++, objective-c/c++ files. Excluded ``` 'onnxruntime/core/mlas/', 'onnxruntime/contrib_ops/cuda/bert/tensorrt_fused_multihead_attention/', ``` because they contain assembly or is data heavy ### Motivation and Context Coding style consistency	2023-04-18 09:26:58 -07:00
liqun Fu	919d8f2660	update with onnx main (#14929 )	2023-04-18 08:42:51 -07:00
cao lei	c2221d919f	create a stream in DeviceStreamCollection for memory pattern (#15426 ) ### Description Create a stream in DeviceStreamCollection for memory pattern case to fix the thread safe issue 15154 ### Motivation and Context This is to fix the bug 15154 https://github.com/microsoft/onnxruntime/issues/15154	2023-04-17 10:06:55 -07:00
Maximilian Müller	fbe88fccbd	Exposing new TRT build options (#15089 ) ### Description This will add a few TRT options, some of them are only available on TRT 8.6: - heuristics - sparsity - optimization level (8.6 only) - auxiliary stream (8.6 only) - tactic source selection I am no sure yet which tests is should add for these options. As those are mostly simple TRT flags i am not sure to what level i should test. For heuristics something similar to `44dda08b51/onnxruntime/test/providers/tensorrt/tensorrt_basic_test.cc (L510-L538)` should be possible for, but for all other essentially we would only be testing if there is a crash or not if the option is set. Also if i forgot some option that would be good to have feel free to speak up !	2023-04-14 09:47:36 -07:00
Dmitri Smirnov	ce3b4eabd3	Implement Optional Metadata support and C# test support (#15314 ) ### Description Implement Optional Type metadata support in the library. Implement optional support in C# API along with metadata. Implement Sequence, Map, Optional test data support and test execution. Prune tests and provide more details for failing tests in C# code. Note, this PR does not enable running onnx test models in C++. ### Motivation and Context Opset18 optional type support.	2023-04-11 09:41:59 -07:00
cloudhan	71a4e7eb97	Automatically enable tunable op usage for production models (#15156 ) Split `IsTunbaleOpEnable` semantics into enable tunable op for using and enable tunable op for tuning. They remain disabled in general for safety purpose. But - if session is created with onnx model with tuning results embeded - the embedded tuning results is set to the EP without error `Status` then we automatically enable the using, tuning remains disabled. The planned options will be - `tunable_op_enable`: The top-level switch of `TunableOp`, indicate if we will run into `TunableOp` related logic. NOTE: most of our impls have a bottom impl that is acting as a fallback and is set as the default. In this case, we still call into the `TunableOp`, but no kernel selection, no kernel tuning and caching is involved. This reduced our maintainance burden of a duplicate code path. - `tunable_op_tuning_enable`: The secondary switch of `TunableOp`, indicate if we will run into the tuning related logic of `TunableOp` Then for the possible future options: - `tunable_op_tuning_max_iteration`: blahblah - `tunable_op_tuning_max_duration_ms`: blahblah - `tunable_op_flash_attention_enable`: blahblah, for example only, we will not have this. For developer oriented envvar, it is for developers' convenience to inspect the performance impact of tuning. So there is only `ORT_ROCM_TUNABLE_OP_ENABLE`, `ORT_ROCM_TUNABLE_OP_TUNING_ENABLE` to take the fine-grind control of combinations.	2023-04-06 13:52:47 +08:00
Edward Chen	9f942e1a3e	Graph transformer to ensure unique DQ nodes for QDQ node units (#15145 ) ### Description <!-- Describe your changes. --> Add required graph transformer to duplicate DQ nodes to ensure that QDQ node units have unique DQ nodes. This condition is necessary for QDQ node unit processing. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> There is an existing Python utility that does this: `c7ced7a5e9/tools/python/util/qdq_helpers/qdq_model_utils.py (L77)` This PR implements it as a graph transformer so it is integrated into ORT and does not require a separate step to update the model. There are also tests to ensure that its effects are not undone by basic level graph optimizations.	2023-03-31 08:39:43 +10:00
FFFrog	ecb89ed752	[CANN] Multi-stream execution support for CANN EP. (#14058 ) ### Description Multi-stream execution support for CANN EP. ### Motivation and Context CANN EP is currently unavailable due to the introduction of a new mechanism for multi-stream execution [#13495](https://github.com/microsoft/onnxruntime/pull/13495), the deletion of the Fence-based synchronization mechanism, and the failure to update the relevant logic of CANN EP synchronously. This PR is to fix it.	2023-03-29 11:57:22 -07:00
Scott McKay	eb8f6c7c52	Transpose optimizer enhancements (#15117 ) ### Description <!-- Describe your changes. --> - Add debug infrastructure to dump out model at various stages of transpose optimization. - Handle more scenarios where Transpose -> Reshape can be merged. - Run L1 optimizers after layout transform to constant fold initializers that had their layout changed. - Use cost check for Concat post layout transform as pushing a Transpose through it can potentially add Transpose nodes to multiple other inputs. - Update internal testing EP to support test where you want it to take all nodes, use NHWC layout, and to use dummy static kernels instead of compiling so the ops in the graph post-initialization can be counted. - Misc cleanup in InferenceSession to not unnecessarily pass args to TransposeGraph for class members. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Address perf issue seen with model where a Transpose gets blocked by a Reshape that could have been treated as a Transpose. --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2023-03-28 08:28:17 +10:00
Nat Kershaw (MSFT)	3064fa7611	Fix C API docs error (#15216 )	2023-03-27 14:34:18 -07:00
Dmitri Smirnov	2de15c5d50	Re-work OrtApi struct to satisfy C++20 compilers (#15183 ) ### Description <!-- Describe your changes. --> Remove `deletion` of copy functions from `OrtApi` as its initialization no longer compiles in C++20. Introduce a non-copyable member to implicitly delete copy ctor. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Inspired by https://github.com/microsoft/onnxruntime/pull/14901 Solution credits: @RyanUnderhill Cc: @georgthegreat	2023-03-24 13:52:17 -07:00
Chi Lo	c964da7ea2	FasterTransformer model wrapper using custom op (#15013 ) ### Description <!-- Describe your changes. --> We are introducing the FasterTransfomer model-level integration using ORT [custom op runtime wrapper](https://github.com/microsoft/onnxruntime/pull/13427). In order to make the FT wrapper/integration work, two things need to be done: - New API `KernelInfoGetConstantInput_tensor`. (Done in this PR) During custom op kernel initialization, it needs to get the model weights (saved as node's constant inputs) ready for FT's weights instantiation. What's why we need to add this new API to make kernel info capable of getting constant inputs. - Custom op and custom op kernel to wrap FT model. (Will provide in onnxruntime extensions or inference examples) During custom op kernel initialization, it can fetch attributes from kernel info to determine which kind of FT model instance create. During custom op kernel compute/inference, it can get input/output from kernel context and then assign input/output buffers for model instance to run.	2023-03-20 09:05:30 -07:00
Adrian Lizarraga	e42f7487df	Add logging APIs for custom operators (#14416 ) ### Description Add logging APIs for custom ops. This PR introduces a `OrtLogger` type, which can be retrieved from a `OrtKernelInfo` or `OrtKernelContext`. The kernel info's logger is the session logger stored in the execution provider. The kernel context's logger is a run logger. ### Motivation and Context Allows custom ops to log information in a manner consistent with built-in ops. Example usage in custom op: ```C++ struct MyCustomKernel { MyCustomKernel(const OrtApi& api, const OrtKernelInfo* info) { Ort::ConstKernelInfo kinfo(info); this->logger_ = kinfo.GetLogger(); // ... ORT_CXX_LOGF_NOEXCEPT(this->logger_, OrtLoggingLevel::ORT_LOGGING_LEVEL_ERROR, "Error: %s", err_msg); } void Compute(OrtKernelContext* context) { ORT_CXX_LOG(this->logger_, OrtLoggingLevel::ORT_LOGGING_LEVEL_VERBOSE, "Calling compute..."); // ... } // ... private: Ort::Logger logger_; }; ```	2023-03-17 15:05:28 -07:00
wejoncy	028c2372fa	remove disable_cpu_soft temporarily	2023-03-15 13:23:56 +08:00
JiCheng	8383a54f9d	Update include/onnxruntime/core/providers/nnapi/nnapi_provider_factory.h Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2023-03-15 13:23:56 +08:00
wejoncy	3873a55bd3	[NNAPI] fix feature_level query	2023-03-15 13:23:56 +08:00
Maximilian Müller	ad4db12699	TensorRT EP - timing cache (#14767 ) ### Description This will enable a user to use a TensorRT timing cache based on #10297 to accelerate build times on a device with the same compute capability. This will work across models as it simply store kernel runtimes for specific configurations. Those files are usually very small (only a few MB) which makes them very easy to ship with an application to accelerate the build time on the user end. ### Motivation and Context Especially for workstation use cases TRT build times can be a roadblock. With a few model from ONNX model zoo i evaluated speedups when a timing cache is present. `./build/onnxruntime_perf_test -e tensorrt -I -t 5 -i "trt_timing_cache_enable\|true" <onnx_path>` \|Model \| no Cache \| with Cache\| \| ------------- \| ------------- \| ------------- \| \|efficientnet-lite4-11 \| 34.6 s \| 7.7 s\| \|yolov4 \| 108.62 s \| 9.4 s\| To capture this is had to modify the onnxruntime_perf_test. The time is sometimes not captured within "Session creation time cost:" which is why i introduced "First inference time cost:". --------- Co-authored-by: Chi Lo <Chi.Lo@microsoft.com>	2023-03-10 09:02:27 -08:00
Xavier Dupré	5930e7e22f	Introduce RemovableAttributes (#14868 ) ### Description TreeEnsemble* kernels fully copies all the parameters from the onnx graph. Even if they are no longer needed or unused (hitrates), they remain in memory. For big models >= 200 trees, max_depth > 10, the model usually weights more than 10 Mb. This change offers a kernel the possibility to remove all unneeded attributes after they were used to create the session. Attributes are deleted after the model was possibly saved, at the of the session creation. The current design is to be debatted: * it stored the list of removable attributes in class `onnxruntime::Node`, * the node is marked as `const` everytime this implementation needs to register the name of a removable attribute or to remove them. The current implementation is just a POC as it needs to cast `onnxruntime::Node` into `const onnxruntime::Node`. Should we keep the list of removable attributes in `onnxruntime::Node`? ### Motivation and Context Motivation is mostly to reduce memory consumption. --------- Signed-off-by: xadupre <xadupre@microsoft.com>	2023-03-07 12:37:12 +01:00
Dmitri Smirnov	8d87fdcfa1	Add GetVersionSting API for C++, C# and Python (#14873 ) ### Description Added APIs. ### Motivation and Context Addresses https://github.com/microsoft/onnxruntime/issues/14584 Cc: @Craigacp cp	2023-03-02 17:11:07 -08:00
Hector Li	c6074f3a4b	OnnxRuntime QNN EP (#14791 ) ### Description Integrate Qualcomm QNN SDK to enable inference on QC hexagon NPU devices ### Motivation and Context Enable Ort inference on QC hexagon NPU devices. --------- Co-authored-by: Satya Jandhyala <sajandhy@microsoft.com> Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com> Co-authored-by: Adrian Lizarraga <adrianlm2@gmail.com>	2023-03-01 13:48:20 -08:00
Scott McKay	b7fde84341	Changes to support standalone custom ops in a minimal build. (#14497 ) ### Description <!-- Describe your changes. --> Changes to support standalone custom ops in a minimal build. Also incorporates changes from #14492 (needed to test builds prior to that being checked in). We first need to save the schema info from the operators used by the standalone op invoker in the ORT format model. Add mechanism for that. Merge the kernel lookup logic so the same is used in full and minimal build. NOTE: the version matching is now consistent with all other kernel lookups, and the call to CreateOp MUST use the exact version for the operator. Previously matching wasn't as strict, but this can lead to the incorrect kernel being chosen. Add tests. NOTE: There is currently no way to detect the ops/types/opsets used inside these custom ops as they don't exist until we create kernels, which is after model loading completes (which is the point the ORT format model is saved). Due to that they have to be manually added to the configuration used to do the reduced ops build. That shouldn't be too hard for the custom op author to add given the custom op implementation is specifying the op, opset and type constraints (i.e. they have the info and it's just a case of capturing/formatting it correctly). ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Enable usage of the standalone op invoker by custom ops in a minimal build. --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2023-03-01 11:22:54 +10:00
James Yuzawa	d925055a3e	Fix broken and outdated links in documentation (#14092 ) ### Description <!-- Describe your changes. --> I fixed some broken links in the C API documentation, but then did a quick pass over all of the links I could find and then fixed those. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> I got some 404's when exploring the documentation and wanted to fix it.	2023-02-23 10:48:04 -08:00
Sheil Kumar	1b7f65437e	Enable Opset11 Sequence Ops on DirectML, and make the CPU implementations agnostic to backend EP (#14442 ) Enable Opset11 Sequence Ops on DirectML, and make the CPU implementations agnostic to backend EP Opset 11 introduced the following sequence related operators: - SequenceAt - SequenceConstruct - SequenceEmpty - SequenceLength - SequenceErase - SequenceInsert - ConcatFromSequence With the exception of ConcatFromSequence, all of the above operators were implemented with CPU kernels that a) required all of the contained tensors to also be on CPU, and b) would clone each tensor into a new sequence as a side effect of each operator. The implementation of sequences are backend agnostic, as they dont affect actual tensor layout or manipulate the contents of the tensors. In addition, with the exception of SequenceAt, the other operators need not make copies of the underlying referenced tensors. Consequently, this change does the following: 1) Sequence* operators (except SequenceAt) no longer copies the contents of a sequence of tensors on every kernel execution. 2) SequenceAt uses the DataTransferManager to copy tensors agnostic to backend. 3) The internal container implemented by TensorSeq has changed from onnxruntime::Tensor to OrtValue. This is because onnxruntime::Tensor does not support copy or assignment construction, so it must have a singular owner. However, is same tensor participates in multiple containers it would have multiple container "owners" and this would not be possible. 4) Other code that accessed values from TensorSeq have associated changes to extract Tensors from OrtValues now. In addition, DirectML execution was very slow when the above Sequence operators were added to a graph, as this caused MemcpyToHost and MemcpyFromHost kernels to be inserted between the graph and the sequence operators. To optimize DirectML, 1) The CPU implementations for the Sequence* ops were registered as DML implementations. Since the above changes also includes making the CPU kernel implementations EP agnostic, the CPU kernels can be added as is. 2) The ConcatFromSequence operator needed to be implemented on DirectML. However, there was little DirectML EP operator framework support for operators that accept/output sequences of tensors. This change has modified the internal COM interfaces to include new apis to interrogate for sequence shapes, and extract the needed tensors from TensorSeq. --------- Co-authored-by: Patrice Vignola <vignola.patrice@gmail.com>	2023-02-21 18:08:28 -08:00
Christian Veenhuis	9fbb2b4742	Fix broken link in onnxruntime_c_api.h (#14748 ) ### Description Fix the broken link in header file onnxruntime_c_api.h w.r.t. the graph optimization levels (line 300). ### Motivation and Context This fix solves open issue #14741	2023-02-21 15:07:06 -08:00
Yuriy Chernyshov	973aaf110b	Improve compatibility with certain STL's We use customized libc++ which uses raw pointers as std::vector::iterators. As per [expr.pre.incr](https://eel.is/c++draft/expr.compound#expr.pre.incr), builtin `operator++` can only be applied to lvalue, while `std::vector::begin()` returns an rvalue. See [this](https://godbolt.org/z/d3a1aKTWP) godbolt snippet for the details.	2023-02-21 14:06:16 -08:00
Dale Phurrough	68db1b62a8	add noexcept to `InitApi()` and `GetApi()` (#13869 ) ### Description * add noexcept to `InitApi()` and `GetApi()` ### Motivation and Context * fixes microsoft/onnxruntime#12581	2023-02-15 16:49:16 -08:00
cao lei	50fa151298	remove device_id parameter out of ExecutionProvider::GetAllocator() (#14580 ) ### Description Remove the parameter device_id out of ExecutionProvider::GetAllocator() function ### Motivation and Context The parameter device_id is not necessary. We can fully rely on the second parameter OrtMemType mem_type to determine the device_id when getting allocator from executionProvider.	2023-02-13 10:01:07 -08:00
cloudhan	9bd022b8be	Add TuningContext for TunableOp (#14557 ) This makes the the TunableOp tuning results state free and will allow us to dump and load offline tuning results.	2023-02-10 14:27:43 +08:00
Maximilian Müller	e9ab56fa64	Adding RunOptions synchronization behaviour to C/C++ API (#14088 ) ### Description This is exposing the already existent interface of asynchronous work of all CUDA base EP's (CUDA + TensorRT). ### Motivation and Context This is something requested in #12216. It will enable users to build an efficient data pipeline with ONNXRuntime and CUDA pre-/post-processing. PCI traffic to the CUDA device can be run during inference as soon as the postprocessing consumed the input buffer and it can be overwritten. To do this work has to be submitted async to the device. Please see below screenshots showing the illustration of this using NSight Systems. Async: <img width="1401" alt="image" src="https://user-images.githubusercontent.com/44298237/209894303-706460ed-cbdb-4be2-a2e4-0c111ec875dd.png"> Synchronous: <img width="1302" alt="image" src="https://user-images.githubusercontent.com/44298237/209894630-1ce40925-bbd5-470d-b888-46553ab75fb9.png"> Note the gap in between the 2 inference runs due to issuing PCI traffic in between and to the CPU overhead the active synchronization has. --------- Co-authored-by: Chi Lo <chi.lo@microsoft.com>	2023-02-07 19:59:28 -08:00

1 2 3 4 5 ...

803 commits