onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-05-16 21:00:14 +00:00

Author	SHA1	Message	Date
Adam Pocock	a36692066d	[java] CUDA & TensorRT options fix (#20549 ) ### Description I misunderstood how UpdateCUDAProviderOptions and UpdateTensorRTProviderOptions work in the C API, I had assumed that they updated the options struct, however they re-initialize the struct to the defaults then only apply the values in the update. I've rewritten the Java bindings for those classes so that they aggregate all the updates and apply them in one go. I also updated the C API documentation to note that these classes have this behaviour. I've not checked if any of the other providers with an options struct have this behaviour, we only expose CUDA and TensorRT's options in Java. There's a small unrelated update to add a private constructor to the Fp16Conversions classes to remove a documentation warning (they shouldn't be instantiated anyway as they are utility classes containing static methods). ### Motivation and Context Fixes #20544.	2024-05-05 00:16:55 -07:00
vividsnow	5c3a1bc3b8	update onnxruntime_c_api.h (#20360 ) ### Description removing excess trailing semicolon from specific macro ### Motivation and Context I am preparing automatic generation of onnxruntime bindings for perl, and the parser (ucpp) has broken due to the "double semicolon" error in the subsequent lines where the macro is applied.	2024-04-30 16:47:24 -07:00
Yi-Hong Lyu	33e883fbc4	Fix the doxygen error (#20515 ) Fix onnxruntime/include/onnxruntime/core/session/onnxruntime_c_api.h:4637: error: argument 'session' of command @param is not found in the argument list of ``` OrtApi::AddExternalInitializersFromFilesInMemory( OrtSessionOptions options, const char const external_initializer_file_names, char const external_initializer_file_buffer_array, const size_t external_initializer_file_lengths, size_t num_external_initializer_files) ```	2024-04-30 11:45:03 -07:00
Yi-Hong Lyu	b2481e3602	Bump up version in main from 1.18.0 to 1.19.0 (#20489 ) Bump up version in main from 1.18.0 to 1.19.0 since the release branch has been cut. --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2024-04-29 20:21:41 -07:00
pengwa	a7787a0bad	Introduce memory efficient topological sort (#20258 ) ### Introduce memory efficient topo sort (for training) ~~and laze initialize Priority-Based and Memory-Efficient topo sort. Because in most cases, they are not needed, so we free the overheads of GraphViewer construction for most use cases.~~ ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-04-23 08:00:23 +08:00
Hector Li	5daeb5e0b0	enable model with external data be loaded from memory buffer (#19089 ) ### Description Background: User save large model with initializer data in external file. e.g: onnx.save_model(onnx_model, "path/to/save/the/model.onnx", save_as_external_data=True, all_tensors_to_one_file=True, location="filename", size_threshold=1024). In that case, Ort loads the model, get the external initializer information (external file name, offset, length) and use the model path to find the external file, and locate to the tensor data via the offset and length. But it won't work if user load the model from memory, since Ort lost track of the model path. This PR adds API/session option to let user provide a table with external initializer file name as the key, the pointer to the loaded external file in memory and the buffer length as value. So that 1. user can load the model from memory buffer with external initializers in memory buffer too. 2. the initializers can be shared across sessions, for different EPs. 3. user can load the file in any way they want, e.g mmap. Internally, 1. at session creation time, Ort goes through the external initializers in the graph, gets the file name, offset, data length of the external initializers from Tensorproto . 2. With the file name, Ort get the file in memory buffer and buffer length from the table user provided. 4. Ort locates the tensor buffer from file in memory buffer (user provided) using the offset and data length (from Tensorproto ). 5. Ort creates the Tensor and replace the existing Tensor in the graph. ### Motivation and Context https://github.com/onnx/onnx/blob/main/docs/ExternalData.md For a model with external data, the Tensorproto may have initializer data in a separate file. The external file location is set via the file path relative to the model path. With the API to load model from memory buffer, it lost track of the model path. So it causes error if the model has external data. By adding a session option to set the external data buffer, Ort can find the external data correctly if model loaded from memory buffer.	2024-04-17 19:01:01 -07:00
Hector Li	bb1972264b	Enable provider option to let user provider the profiling file path (#20285 ) Enable provider option to let user provider the profiling file path. Separate out the profiling level for ETW, in case there's switch like ETW enabled when Ort creates the QNN profiling, then gets disabled when Ort logs the profiling events. vise versa. Enhance the logic to decide the profiling level.	2024-04-17 09:42:40 -07:00
Scott McKay	5c8034cc20	Avoid call to Node::ToProto on first Graph::Resolve to improve session creation performance. (#20296 ) ### Description <!-- Describe your changes. --> The first call to Graph::Resolve occurs when creating the Graph instance when loading an existing model from ModelProto. As the Node instance will exactly match the source NodeProto there's no need to call Node::ToProto in this case. Add a temporary reference to the original NodeProto to avoid the call on the first Graph::Resolve. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Better alternative to #19469	2024-04-17 10:07:12 +10:00
Andrew Fantino	7303a90f49	Fix build errors from date/date.h C++20 compatibility (#20139 ) ### Description For C++ standards >= 20, use `std::chrono::operator<<` in place of `date::operator<<` to fix ambiguous operator compile error. ### Motivation and Context The external dependency HowardHinnant/date has a conflict with std::chrono for >=C++20. Solves #20137	2024-04-02 22:10:25 -07:00
Dmitri Smirnov	12e2538065	Add new SessionOptions config entry to disable specific transformers and rules (#20135 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Certain transformers slow down session loading time while providing no runtime perf benefits. Allow clients to exclude them.	2024-04-02 16:33:05 -07:00
Adam Pocock	262b6bd3b7	[java][DML EP] Modifying dml_provider_factory.h so it can compile as a C header file (#20157 ) ### Description The dml_provider_factory header file can't be used in C programs as it defines C++ inline operators. This PR rearranges that header file so that it looks like valid C when used from C, and also makes a couple of small modifications to the Java code so it correctly binds to the DML EP at build time. I'm having some difficulty testing it as I think it's pulling in the old version of DirectML on my computer and I can't figure out what the library loading path is in Java to make it look at the recent version I downloaded. So the test I added fails with: ``` InferenceTest > testDirectML() FAILED ai.onnxruntime.OrtException: Error code - ORT_RUNTIME_EXCEPTION - message: Exception during initialization: <path-to-ort>\onnxruntime\core\providers\dml\DmlExecutionProvider\src\AbiCustomRegistry.cpp(518)\onnxruntime.dll!00007FFF74819333: (caller: 00007FFF74793509) Exception(3) tid(4f58) 80070057 The parameter is incorrect. at app//ai.onnxruntime.OrtSession.createSession(Native Method) at app//ai.onnxruntime.OrtSession.<init>(OrtSession.java:74) at app//ai.onnxruntime.OrtEnvironment.createSession(OrtEnvironment.java:236) at app//ai.onnxruntime.OrtEnvironment.createSession(OrtEnvironment.java:221) at app//ai.onnxruntime.InferenceTest.openSessionSqueezeNet(InferenceTest.java:1961) at app//ai.onnxruntime.InferenceTest.runProvider(InferenceTest.java:665) at app//ai.onnxruntime.InferenceTest.testDirectML(InferenceTest.java:657) ``` But it does correctly compile, and this error seems very similar to other issues with the DML provider when it doesn't like a model due to the loaded library being old. The test is using the squeezenet file that's been in the repo since 2019. If someone can help me figure out how to get the right version of DML in the library path I can test it more on my end. I tried adding the folder with the new version into the system path, but I'm not very familiar with Windows' library loading behaviour. ### Motivation and Context Fixes #19656 to allow use of the DirectML EP from ORT Java. cc @martinb35	2024-04-01 21:58:50 -07:00
wangshuai09	3e2b659fce	[CANN] Add dump_om_model flag (#20075 ) ### Description New flag of `dump_om_model` for CANN EP, which defaults to "True". ### Motivation and Context When building an onnx model with CANN EP, the intermediate OM(offline model for Ascend NPU) is automatically saved. There are some users don't want to dump OM when resources are limited. This PR will resovle this situation with `dump_om_model=False`	2024-04-01 21:35:29 -07:00
cao lei	604b284261	add API function GetAliasMap and ReleaseAliasMap in OrtCustomOp (#20145 ) ### Description <!-- Describe your changes. --> Add API function GetAliasMap and ReleaseAliasMap in OrtCustomOp ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Add API function GetAliasMap and ReleaseAliasMap in OrtCustomOp	2024-03-29 13:49:56 -07:00
cao lei	2a184ac1a1	use OrtCustomOp's new API GetMayInplace in CreateKernelCreateInfo (#20037 ) ### Description <!-- Describe your changes. --> use OrtCustomOp's new API GetMayInplace in CreateKernelCreateInfo to hook the inplace map of custom ops ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> This PR is to use OrtCustomOp's new API GetMayInplace in CreateKernelCreateInfo to hook the inplace map of custom ops	2024-03-28 20:45:37 -07:00
Pranav Sharma	3ed0c81b30	Expose Reserve() in OrtAllocator to allow custom allocators to work when session.use_device_allocator_for_initializers is specified. (#19904 ) ### Description Expose Reserve() in OrtAllocator to allow custom allocators to work when session.use_device_allocator_for_initializers is specified. Update: this change has been verified by Bing Ads and brings a significant benefit in terms of memory utilization: 30GB less memory and also better CPU utilization. ### Motivation and Context https://microsoft-my.sharepoint.com/:w:/p/prs/Eeidf5YNtWtKrPVkfuTDsuABak1oL4QRpuBGuhqRbLKoJg?e=Zl3bah	2024-03-28 12:28:37 -07:00
Dmitri Smirnov	b95fd4e644	Enable CUDA EP unit testing on Windows (#20039 ) ### Description Address build issues and source code discrepancies. Fix cuda_test_provider gtest argument stack corruption. ### Motivation and Context `OpTester` class that is widely used for kernel testing is not suitable for testing internal classes for EPs that are built as shared objects. Currently, CUDA EP tests run only on Linux. We want to enable testing and developments on Windows, and create a usable pattern for testing of other EPs internals. Alternatives considered: Abstracting EP unit tests into separate test executable such as `onnxruntime_test_all`. This alternative was rejected as it would create a lot more changes in the established patterns, and potentially interfere with CUDA functionality with more complex source code maintanence.	2024-03-27 13:32:36 -07:00
Dmitri Smirnov	a033df8c31	Implement CustomOp Output Type Inference function (#19906 ) ### Description <!-- Describe your changes. --> This change addresses the following issues with the current CustomOP Output Type inference - The function does not take into account optional inputs. When input is absent the inference is silently aborted, and no output type is inferred (P1 customer issue) - Inferring output type based on the input type for multi-kernel custom ops is done based on the latest in sequence kernel definition. There is not an attempt made to match the kernel based on the input type. - Inference is aborted when variadic inputs/outputs are detected when the generated input/output names fail to obtain type constraints. This is not immediately clear from the code, because custom op schema is not available within the inference function. - No error reporting. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Most of CustomOPs lack their own type and shape inference function as it was recently introduced. For that reason, it is important to fix this. This change is inspired by a customer issue. This is a follow up on: - https://github.com/microsoft/onnxruntime/pull/15184 - https://github.com/cbourjau/ort-custom-op/pull/11 - https://github.com/microsoft/onnxruntime-extensions/issues/451	2024-03-18 10:28:39 -07:00
Adam Louly	32558134a9	[On-Device-Training] Upgrade Flatbuffers to Support 2GB+ Checkpoints. (#19770 ) ### Description Modifications to support 2GB+ checkpoint & Upgrading Flatbuffers ### Motivation and Context This PR includes changes that will make ort handle 2GB+ checkpoints. To do that we need to upgrade flatbuffers to 23.5.9 - https://github.com/google/flatbuffers/pull/7945 - Modified the commitHash and the hash for the new version - Removed the patch for rust generator's unused variable warning as it is no longer producing this - [Check it out here](`d121e09d89/src/idl_gen_rust.cpp`) - Updated the VerifyField calls with alignment values that were introduced in the new version. --------- Co-authored-by: Sumit Agarwal <sumitagarwal@microsoft.com>	2024-03-14 16:36:24 -07:00
cao lei	966fa74597	Add 2 C API for ort extension (#19808 ) ### Description <!-- Describe your changes. --> Add 2 C API for ORT extension: - KernelInfo_GetAllocator - OrtCustomOp::GetMayInplace ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Add 2 C API for ORT extension project, which will leverage these 2 APIs for GroupQueryAttention custom op.	2024-03-14 06:00:41 -07:00
cao lei	2c525a79b1	Add new API KernelContext_GetScratchBuffer (#19809 ) ### Description <!-- Describe your changes. --> add new API KernelContext_GetScratchBuffer to get scratch buffer from kernel context ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> add new API KernelContext_GetScratchBuffer to get scratch buffer from kernel context which will be used in ORT extension project for GroupQueryAttention custom op	2024-03-13 19:41:15 -07:00
Hector Li	60ad6c6409	Enable float32 model with FP16 precision for QNN HTP backend (#19863 ) ### Description Enable float32 model with FP16 precision for QNN HTP backend	2024-03-13 08:35:21 -07:00
Ye Wang	72ce4de07d	cuda graph enhancement (#19636 ) ### Description <!-- Describe your changes. --> 1. add a config key in run_options to control cuda graph in runtime. 2. enhance cuda graph class to support mutiple graph saving and retrieving in one ORT session 3. provide model modification/inference example on Phi2 4. benchmark shows an average of 13% latency reduction in token generation. limitation: TRT ep and ROCM ep hasn't applied this feature. we can revisit this in the future. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-07 10:15:18 -08:00
Markus Tavenrath	f2dc725b33	Add SpaceToDepth and DepthToSpace CUDA NHWC Ops (#19646 ) ### Description - Adding CUDA NHWC support for SpaceToDepth and DepthToSpace - Add a new test which verifies that swizzling SpaceToDepth swizzling for the H axis is correct. - If CUDA NHWC is enabled, run all tests on the CUDA EP with NHWC as well. ### Motivation and Context Adding more NHWC operations to avoid layout transformations when using the CUDA EP for more efficiency.	2024-03-06 12:35:55 -08:00
Scott McKay	db59cec82f	Don't reduce warning level for CUDA build on Windows (#19663 ) ### Description <!-- Describe your changes. --> Address warnings so all the ORT projects build with /W4 on Windows. Mainly - unused parameters - variables shadowing other ones ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> #19588 started on this.	2024-03-06 15:03:55 +10:00
Dmitri Smirnov	1e78bcea60	Implement CUDA IsInf-10,20 (#19772 ) ### Description Implment IsInf-10,20 for CUDA. Add FP16 types also on CPU. ### Motivation and Context Certain models lag in performance due to IsInf not available on CUDA.	2024-03-05 13:33:01 -08:00
pengwa	ae92d593c0	ONNX Gelu Op in Opset 20 (#19560 ) ### ONNX Gelu Op in Opset 20 Refactor code to support MSDomain Gelu and ONNX Gelu-opset20 Op 1. Move CPU-GELU implmentation from `onnxruntime/contrib_ops/cpu/activations.h/cc` to `onnxruntime/core/providers/cpu/tensor/gelu.h/cc`, as the implementation for approximate attribute to be 'none'. 2. Dumplicate some logic from `onnxruntime/contrib_ops/cpu/bert/bias_gelu.cc` to `onnxruntime/core/providers/cpu/tensor/gelu.h/cc`, as the implementation for approximate attribute to be 'tanh'. 3. Register ONNX domain Gelu CPU kernel from opset 20 in `onnxruntime/core/providers/cpu/cpu_execution_provider.cc`. 4. Move `onnxruntime/contrib_ops/cuda/bert/fast_gelu_impl.h/cu` to `onnxruntime/core/providers/cuda/tensor/gelu_impl.h` and `onnxruntime/core/providers/cuda/tensor/gelu_approximate_impl.cu` respectively, as the implementation for approximate attribute to be 'tanh'. 5. Implement the logic for approximate attribute to be 'none' in `onnxruntime/core/providers/cuda/tensor/gelu_impl.cu`. 6. Register ONNX domain Gelu CUDA kernel from opset 20 in `onnxruntime/core/providers/cuda/cuda_execution_provider.cc`. 7. ROCM ep related changes. 8. Enrich the tests for ONNX domain Gelu in `onnxruntime/test/providers/cpu/activation/activation_op_test.cc`.	2024-02-23 11:05:16 +08:00
Hector Li	4ab497603e	Enable user to set QNN HTP performance mode for every session run (#19521 ) ### Description Currently, the QNN HTP performance mode is set during session creation, there's no way to change it afterwards. There's requirement to set it high performance mode for high priority request and set it back to low performance mode later to save the power when the incoming request is idle for example. Now, still keeps the performance mode at the session level in QNN EP options which is used at the default one. Ort QNN EP will set it once if user set it. And there are setting (qnn.htp_perf_mode and qnn.htp_perf_mode_post_run) in run option to change the performance mode before and after session run. There's recommended scenario that user set the mode to high performance mode before the the inference sun so that user can get the result back ASAP. And set the mode to low performance mode after the inference to save the power.	2024-02-22 17:04:59 -08:00
Scott McKay	4e5119760d	Add initial support for CoreML ML Program to the CoreML EP. (#19347 ) ### Description <!-- Describe your changes. --> Adds infrastructure to create an ML Package containing the Model using ML Program. Updated coremltools files to v7.1 to bring in new protobuf definitions along with the tools to write the weight.bin file and create an ML Package correctly. Enables building a CoreML Model on all platforms which means all the operator builder code can be debugged anywhere. Execution of the generated CoreML model is obviously limited to Apple platforms. The Conv operator builder has been updated to be able to generate an ML Program Operation. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> NeuralNetwork is no longer being developed and ML Program is the replacement going forward.	2024-02-15 08:46:03 +10:00
Scott McKay	36d223676b	Use GraphViewer.IsConstantInitializer in NNAPI EP. (#19401 ) ### Description <!-- Describe your changes. --> An overridable initializer should not have a fixed value included in an NNAPI model as it could be changed at runtime. The current check doesn't include validating that the initializer is constant. I was updating GetClipMinMax as part of adding CoreML EP ML Program support, and in order to make both CoreML and NNAPI do the more correct thing of using IsConstantInitializer this set of changes was required. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Make NNAPI and CoreML EPs more correct. --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2024-02-07 14:01:51 +10:00
Tianlei Wu	bedf0eee73	[CUDA] Add use_tf32 provider option (for FP32 GEMM) (#19357 ) [TF32](https://blogs.nvidia.com/blog/tensorfloat-32-precision-format/) could help boost performance on GPU of SM >= 80. Sometime, user observes accuracy loss, or need disable TF32 for testing purpose. To disable TF32, it is also possible to set environment variable `NVIDIA_TF32_OVERRIDE = 0`. However, sometime we do not want to use environment variable to avoid impacting other applications, or want to have finer control (like one session using TF32, and another session not). This provider option could help. Here we add a provider option `use_tf32`. When `use_tf32 = 0`, we will disable TF32 for float MatMul/GEMM in cublas. It applies to MatMulNBits, Attention, LongformerAttention, PackedAttention, PackedMultiHeadAttention operators when float GEMM is used internally in the operator. Note that it will not impact other data type, like fp8 gemm could still use TF32 in accumulation. Previously, cublasGemmStridedBatchedHelper does not use TF32 in inference. Here we enabled TF32 by default, so we might observe speed up for FP32 transformers models on SM >= 80. There is another PR that enables the option for cuDNN Conv later. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> https://github.com/microsoft/onnxruntime/issues/15407 https://github.com/microsoft/onnxruntime/issues/19288	2024-02-06 13:31:33 -08:00
Yueqing Zhang	1d6f13fb92	[VitisAI] Refactor the VAIEP to use MSFT's standalone API (#19058 ) ### Description <!-- Describe your changes. --> Refactor the VAIEP to use MSFT's standalone API ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Vitis ONNX RT VAI should switch to using the standalone API for ONNX EPs in order to decouple the EP from onnxruntime.dll and the providers.dll. This will help to simplify customer deployment of applications and use cases that need to share their onnxruntime.dll with other applications. --------- Co-authored-by: Zhenze Wang <zhenzew@xilinx.com> Co-authored-by: zz002 <zhenze.wang@amd.com>	2024-01-31 21:08:26 -08:00
cao lei	7d4dc66846	ExecutionProvider API refactor - make GenerateMetaDefId a standalone function, decouple it from EP (#18977 ) ### Description <!-- Describe your changes. --> Make EP's member function, GenerateMetaDefId, a standalone function which decouples from EP ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> This change is for ExecutionProvider API refactoring, we will make a clean ExecutionProvider API first for later EPv2 work	2024-01-26 07:39:08 -08:00
Jeff Daily	b2aec41a83	[ROCm] enable hipGraph (#18382 ) This ports the cudaGraph support from the CUDA EP to the ROCM EP's hipGraph.	2024-01-23 11:17:04 +08:00
snadampal	77da2ef278	[aarch64] Add Sbgemm kernel to accelerate fp32 tensor matmul with bfloat16 (#17031 ) ### Description This PR adds SbgemmKernel for aarch64. This includes Sbegmm kernel to implement matrix multiplication with bfloat16 SIMD instructions (bfmmla) and MatMul operator changes to invoke the Sbgemm kernel. To enable Sbgemm kernel, set the following session option: "kOrtSessionOptionsGemmFastMathMode" The PR also adds new test cases for mlas and ort. ### Motivation and Context This is to improve MatMul performance on aarch64 platform. I have run the below benchmarking script (bert , roberta and gpt2 model inference) on AWS Graviton3 based c7g.4xl instance and observed 1.2x -1.76x performance improvement compared to sgemm (fp32) kernel performance. ``` cd onnxruntime/python/tools/transformers python3 benchmark.py ``` And the unit test precision results are matching to sgemm kernel results. `./build.sh --config RelWithDebInfo --build_shared_lib --parallel --compile_no_warning_as_error --skip_submodule_sync `	2024-01-22 14:43:06 -08:00
Adrian Lizarraga	8d9d751179	[QNN EP] Expose device-level session options (#19212 ) ### Description - Adds the following session options to configure the device: - `soc_model`: The SoC model number. Refer to the QNN SDK documentation for valid values. Defaults to "0" (unknown). - `htp_arch`: The minimum HTP architecture the driver will use to select compatible QNN operators. - `device_id`: The ID of the device to use when setting 'htp_arch'. Defaults to "0" (for single device). ### Motivation and Context Allow more configuration.	2024-01-22 12:47:42 -08:00
Chi Lo	f3402de01e	[TensorRT EP] Enhance EP context configs in session options and provider options (#19154 ) Several changes: 1. To align with other EPs' setting of EP context configs in session options, for example [QNN EP](https://github.com/microsoft/onnxruntime/pull/18877), EP context configs for TRT EP can be configured through: 1. Session Options: `ep.context_enable`, `ep.context_file_path` and `ep.context_embed_mode` 2. Provider Options: `trt_dump_ep_context_model`, `trt_ep_context_file_path` and `trt_dump_ep_context_embed_mode` 3. Above setting has 1:1 mapping and provider options has higher priority over session options. ``` Please note that there are rules for using following context model related provider options: 1. In the case of dumping the context model and loading the context model, for security reason, TRT EP doesn't allow the "ep_cache_context" node attribute of EP context node to be the absolute path or relative path that is outside of context model directory. It means engine cache needs to be in the same directory or sub-directory of context model. 2. In the case of dumping the context model, the engine cache path will be changed to the relative path of context model directory. For example: If "trt_dump_ep_context_model" is enabled and "trt_engine_cache_enable" is enabled, if "trt_ep_context_file_path" is "./context_model_dir", - if "trt_engine_cache_path" is "" -> the engine cache will be saved to "./context_model_dir" - if "trt_engine_cache_path" is "engine_dir" -> the engine cache will be saved to "./context_model_dir/engine_dir" ``` 2. User can decide the naming of the dumped "EP context" model by using `trt_ep_context_file_path`, please see GetCtxModelPath() for more details. 3. Added suggested comments from https://github.com/microsoft/onnxruntime/pull/18217	2024-01-21 10:51:58 -08:00
Hector Li	6e17571f2f	Fix issue that the generated context cache model inputs/outputs order is not guaranteed (#19195 ) Fix issue that the generated context cache model inputs/outputs order is not guaranteed ### Description Currently, QNN EP generate the context cache model in Compile() method which only get access to the partitioned graph. And the inputs/outputs order for the partitioned graph is not guaranteed. And EP doesn't have the view of the input user model. Have to move the context cache model generation to a higher level in GraphPartitioner which has the view of the partitioned model. This is also a break down of PR for multi-partition support. https://github.com/microsoft/onnxruntime/pull/18865	2024-01-19 15:16:17 -08:00
Maximilian Müller	bc219ed553	[TensorRT EP] Enable a minimal CUDA EP compilation without kernels (#19052 ) Adresses https://github.com/microsoft/onnxruntime/issues/18542. I followed the advice given by @RyanUnderhill [here](https://github.com/microsoft/onnxruntime/pull/18731#issuecomment-1848261925) and went with a minimal CUDA EP for now.	2024-01-17 11:33:34 -08:00
Rachel Guo	bd9d8fb2a5	[ORT 1.17.0 release] Bump up version to 1.18.0 (#19170 ) ### Description <!-- Describe your changes. --> Bump up version to 1.18.0 since the release branch has been cut. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>	2024-01-17 11:18:32 -08:00
Hector Li	62a4e9103e	Add extreme_power_saver for htp_performance_mode (#19111 ) ### Description Add extreme_power_saver mode for htp_performance_mode	2024-01-12 19:07:02 -08:00
Yifan Li	443aeb851c	[TensorRT EP] Customizable engine cache prefix (#19083 ) ### Description <!-- Describe your changes. --> Add new option `trt_engine_cache_prefix` to customize TRTEP engine cache prefix. i.e: - If user specifies `trt_engine_cache_prefix\|FRCNN trt_engine_cache_enable\|true` when running FRCNN model - the cache will be saved/loaded: `FRCNN_2068723788287043730__sm80.engine`. Engine profile follows same pattern. - If skipping this option, the engine will be saved/loaded: `TensorrtExecutionProvider_TRTKernel_graph_torch-jit-export_2068723788287043730__*_sm80.engine` as default case. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> https://github.com/microsoft/onnxruntime/issues/16708 --------- Co-authored-by: Chi Lo <Chi.Lo@microsoft.com> Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com>	2024-01-12 18:10:05 -08:00
Scott McKay	8f2e57f5d0	Make session configuration options available to kernels via OpKernelInfo (#18897 ) ### Description <!-- Describe your changes. --> Pass through the ConfigOptions from the session via OpKernelInfo so that kernel behavior can be configured. Initial usage would be to optionally enable a fast path for ARM64 bloat16 GEMM - see #17031 Other usages could be things like selected the exact implementations of the activation functions for RNN operators instead of the default approximations (e.g. use [sigmoid_exact instead of sigmoid](`2d6e2e243d/onnxruntime/core/providers/cpu/rnn/rnn_helpers.h (L379-L382)`)) OpKernelInfo is already passing through things from the session state, and adding a new member of ConfigOptions is the simpler update. It's also a more natural fit given it's providing state/info to the kernel. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-01-13 10:02:43 +10:00
Preetha Veeramalai	c340bf08f6	Openvino EP code changes for 1.17 update (#19023 ) ### Description Introduce AppendExecutionProvider_OpenVINO_V2 API and support for OV 2023.3. ### Context - The API is added to facilitate customers in using published official Microsoft onnxruntime libraries with OVEP libraries. - Add support for OpenVINO 2023.3 official release. - Extend operator coverage - GH fixes --------- Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>	2024-01-12 13:20:51 -08:00
Chi Lo	46dd0d3f52	[TensorRT EP] Load precompiled TRT engine file directly (#18217 ) When the TRT engine cache (precompiled engine) is present, it doesn't make sense to go over the processes of model verification, model optimization, TRT EP's GetCapability(), TRT EP's model proto reconstruction, calling TRT parser and engine compilation. This PR makes TRT EP skip those processes and directly load the engine to perform inference. The feature request: https://github.com/microsoft/onnxruntime/issues/18072 Features: - Replace original model with TRT engine wrapped ONNX model. It can save a lot of time as mentioned above. - How to get TRT engine wrapped ONNX model? 1. Set `trt_dump_ep_context_model` provider option to "true" and run the inference. You will find the "xxx_wrapper.onnx" at the engine cache path. (The same logic of generating engine cache) 2. Use gen_trt_engine_wrapper_onnx_model.py - Three provider options are added, `trt_dump_ep_context_model`: Enable dump wrapped onnx model by TRT EP `trt_ep_context_embed_mode`: Add embed_mode as attribute. 0 means engine cache path, 1 means engine binary data. `trt_ep_context_compute_capability_enable`: Add hardware_arch as attribute. When running the model, TRT EP will check consistency between model's hardware_arch and GPU's compute capability. - When the engine cache path is given in the wrapped model, TRT EP will first search for the engine file using the path (relative to model path), if it can't find it, it will change to use the path as it is (depends on user, could be relative to working dir or absolute path) Note: 1. This PR includes the change of https://github.com/microsoft/onnxruntime/pull/17751 Constraints: 1. The whole model should be fully supported by TRT. 4. Users need to make sure the engine is built with min/max/opt optimization profiles that large enough to cover the range of all inputs. TRT EP will simply fail and won't rebuild the engine if the input shape is out of range during runtime.	2024-01-11 22:20:54 -08:00
ivberg	4d1243b4b4	ORT ETW dynamic logging that improves ORT diagnosability & performance (#18882 ) ### Description This PR has several combined ORT ETW changes that improve ORT log diagnosability & performance. - The existing log behavior in the ORT API and Severity behavior remain the same as compiled by the dev using the ORT API - The PR keeps the existing design which has 2 TraceLogging providers defined (although both were not used before this PR) - Keeps great inference (inf) and session load performance even with dynamic logging enabled (see below) - On Windows, when ONNXRuntimeTraceLoggingProvider is enabled, then ORT will dynamically _add_ a new sink reflecting the severity level provided by ETW dynamically. E.G Critical - Verbose per the need at runtime - This allows previous printf style LOGS() statements both for CPU and NPU cases to flow to ETW via a local trace (if enabled) - This DOES NOT add any new Telemetry which can optionally be sent to Microsoft. - Telemetry are ETW events marked with the Measure keyword that _can_ be sampled if a box opts-in - Existing Microsoft.ML.ONNXRuntime events have appropriate keywords and levels added if they were missing - If Execution Providers (EPs) can provide more detailed insight into their HW or component, then this PR allows for those to be dynamically logged instead of just at compile time - In this PR, the QNN EP for QC NPUs can have basic or detailed profiling enabled to give some insight into how the NPU is performing - When the Microsoft.ML.ONNXRuntime ETW provider is enabled with the Profiling keyword and level 5 then QC QNN basic profiling info is output to ETW ### Motivation and Context - This make ORT logging and diagnosability more performant (on Windows) and available in a wider variety of runtime environments. - The performance difference for inf times was ~300x+ drastically better/faster when these logs were output to ETW vs just stdout (Verbose Severity) - This style of ETW dynamic tracing is the widely used standard for Windows components, and even by some 3rd party software since the ETW API is open and part of the Windows API - These ETW based logs can be seamlessly combined with other ETW logs such as an AI component/feature using ORT, OS CPU profiling, scheduling, and more - Before the PR, ORT logging is largely printf style and output to a sink (usually stdout) only if compiled with a certain log Severity. When enabled the previous logging (to stdout) would vastly slow down inference times. Once compiled for release the internal ORT logs were not accessible by anyone except the AI model developer in their dev inner loop. That means logs could not be enabled on a lab machine, or on a production system where the runtime behavior or performance might be different for various reasons on a wide variety of HW. - This change was tested with performance in mind and tested with a mobilenet small AI model with onnxruntime_perf_test - CPU: There was no statistical difference for inf times with the baseline (main) or this PR whether ETW was enabled or not (both ORT providers all keywords level 5). - NPU (QNN on SP9 or Dev Kit 2023 QC SQ3): There was no statistical difference for inf times with this PR whether ETW (both ORT providers all keywords) were enabled or not for Level 5 (Verbose). This is even with QNN Basic profiling turned on and outputting NPU stats to ETW - As expected and designed, there was perf slowdown when Max Level 255 is enabled which translates to QNN Detailed profiling. This mirrors the expected slowdown in the NPU when individual model operations are monitored & recorded as well. This perf is similar to the QNN SDK Detailed profiling performance separate from this PR. This is designed to be above level 5 (verbose) as that is commonly the max level used in many trace profiles and won't affect inf performance. - Other OSes such as Linux & Android are left untouched for now. - Out of scope for this PR but TraceLogging is available for Linux with LTTng tracing. So in the future, this optional tracing could also be made available on other OSes where a TraceLogging API is available	2024-01-11 12:43:27 -08:00
RandySheriffH	24e9daf707	Enrich cuda resources with ep options (#19014 ) Allow custom ops to access cuda ep options. --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2024-01-11 10:56:07 -08:00
RandySheriffH	df116b82c7	Custom op API for thread pool (#18980 ) Allow custom op to invoke internal thread-pool for parallelism. --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2024-01-10 14:13:25 -08:00
Scott McKay	8e9188e265	Add SessionOptions use_deterministic_compute to the C and C++ APIs. (#18944 ) ### Description <!-- Describe your changes. --> SessionOptions use_deterministic_compute can be set via the python API. User request to enable setting via C API. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> #17416	2024-01-04 11:12:48 +10:00
Scott McKay	df740d7d15	Throw if unique_ptr or array allocation fails due to SafeInt overflow (#18941 ) ### Description <!-- Describe your changes. --> If we fail to calculate the buffer size (due to overflow) we currently return a nullptr. This is inconsistent as an actual memory allocation failure throws. An overflow would typically be due to bad input so an exception makes more sense given that. Change to throw so code using MakeUniquePtr* and AllocArray* doesn't need to check for nullptr. Add some extra info to the log message to help debugging. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Should help with #18905 by avoiding the invalid attempted usage of a nullptr from the allocation. Extra info _might_ help with figuring out where the overflow is coming from which is the real issue.	2024-01-03 07:57:51 +10:00
Hector Li	8931854528	Move some QNN EP provider options to session options (#18877 ) Move QNN EP provider options to session options ### Description Need to use session option to support multi-partition for context cache feature. To smooth the transaction, move the provider options to session options first. This is the first step for PR: PR https://github.com/microsoft/onnxruntime/pull/18865	2023-12-20 00:13:38 -08:00

1 2 3 4 5 ...

940 commits