onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-06-04 23:59:56 +00:00

Author	SHA1	Message	Date
snadampal	77da2ef278	[aarch64] Add Sbgemm kernel to accelerate fp32 tensor matmul with bfloat16 (#17031 ) ### Description This PR adds SbgemmKernel for aarch64. This includes Sbegmm kernel to implement matrix multiplication with bfloat16 SIMD instructions (bfmmla) and MatMul operator changes to invoke the Sbgemm kernel. To enable Sbgemm kernel, set the following session option: "kOrtSessionOptionsGemmFastMathMode" The PR also adds new test cases for mlas and ort. ### Motivation and Context This is to improve MatMul performance on aarch64 platform. I have run the below benchmarking script (bert , roberta and gpt2 model inference) on AWS Graviton3 based c7g.4xl instance and observed 1.2x -1.76x performance improvement compared to sgemm (fp32) kernel performance. ``` cd onnxruntime/python/tools/transformers python3 benchmark.py ``` And the unit test precision results are matching to sgemm kernel results. `./build.sh --config RelWithDebInfo --build_shared_lib --parallel --compile_no_warning_as_error --skip_submodule_sync `	2024-01-22 14:43:06 -08:00
Adrian Lizarraga	8d9d751179	[QNN EP] Expose device-level session options (#19212 ) ### Description - Adds the following session options to configure the device: - `soc_model`: The SoC model number. Refer to the QNN SDK documentation for valid values. Defaults to "0" (unknown). - `htp_arch`: The minimum HTP architecture the driver will use to select compatible QNN operators. - `device_id`: The ID of the device to use when setting 'htp_arch'. Defaults to "0" (for single device). ### Motivation and Context Allow more configuration.	2024-01-22 12:47:42 -08:00
Chi Lo	f3402de01e	[TensorRT EP] Enhance EP context configs in session options and provider options (#19154 ) Several changes: 1. To align with other EPs' setting of EP context configs in session options, for example [QNN EP](https://github.com/microsoft/onnxruntime/pull/18877), EP context configs for TRT EP can be configured through: 1. Session Options: `ep.context_enable`, `ep.context_file_path` and `ep.context_embed_mode` 2. Provider Options: `trt_dump_ep_context_model`, `trt_ep_context_file_path` and `trt_dump_ep_context_embed_mode` 3. Above setting has 1:1 mapping and provider options has higher priority over session options. ``` Please note that there are rules for using following context model related provider options: 1. In the case of dumping the context model and loading the context model, for security reason, TRT EP doesn't allow the "ep_cache_context" node attribute of EP context node to be the absolute path or relative path that is outside of context model directory. It means engine cache needs to be in the same directory or sub-directory of context model. 2. In the case of dumping the context model, the engine cache path will be changed to the relative path of context model directory. For example: If "trt_dump_ep_context_model" is enabled and "trt_engine_cache_enable" is enabled, if "trt_ep_context_file_path" is "./context_model_dir", - if "trt_engine_cache_path" is "" -> the engine cache will be saved to "./context_model_dir" - if "trt_engine_cache_path" is "engine_dir" -> the engine cache will be saved to "./context_model_dir/engine_dir" ``` 2. User can decide the naming of the dumped "EP context" model by using `trt_ep_context_file_path`, please see GetCtxModelPath() for more details. 3. Added suggested comments from https://github.com/microsoft/onnxruntime/pull/18217	2024-01-21 10:51:58 -08:00
Hector Li	6e17571f2f	Fix issue that the generated context cache model inputs/outputs order is not guaranteed (#19195 ) Fix issue that the generated context cache model inputs/outputs order is not guaranteed ### Description Currently, QNN EP generate the context cache model in Compile() method which only get access to the partitioned graph. And the inputs/outputs order for the partitioned graph is not guaranteed. And EP doesn't have the view of the input user model. Have to move the context cache model generation to a higher level in GraphPartitioner which has the view of the partitioned model. This is also a break down of PR for multi-partition support. https://github.com/microsoft/onnxruntime/pull/18865	2024-01-19 15:16:17 -08:00
Maximilian Müller	bc219ed553	[TensorRT EP] Enable a minimal CUDA EP compilation without kernels (#19052 ) Adresses https://github.com/microsoft/onnxruntime/issues/18542. I followed the advice given by @RyanUnderhill [here](https://github.com/microsoft/onnxruntime/pull/18731#issuecomment-1848261925) and went with a minimal CUDA EP for now.	2024-01-17 11:33:34 -08:00
Rachel Guo	bd9d8fb2a5	[ORT 1.17.0 release] Bump up version to 1.18.0 (#19170 ) ### Description <!-- Describe your changes. --> Bump up version to 1.18.0 since the release branch has been cut. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>	2024-01-17 11:18:32 -08:00
Hector Li	62a4e9103e	Add extreme_power_saver for htp_performance_mode (#19111 ) ### Description Add extreme_power_saver mode for htp_performance_mode	2024-01-12 19:07:02 -08:00
Yifan Li	443aeb851c	[TensorRT EP] Customizable engine cache prefix (#19083 ) ### Description <!-- Describe your changes. --> Add new option `trt_engine_cache_prefix` to customize TRTEP engine cache prefix. i.e: - If user specifies `trt_engine_cache_prefix\|FRCNN trt_engine_cache_enable\|true` when running FRCNN model - the cache will be saved/loaded: `FRCNN_2068723788287043730__sm80.engine`. Engine profile follows same pattern. - If skipping this option, the engine will be saved/loaded: `TensorrtExecutionProvider_TRTKernel_graph_torch-jit-export_2068723788287043730__*_sm80.engine` as default case. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> https://github.com/microsoft/onnxruntime/issues/16708 --------- Co-authored-by: Chi Lo <Chi.Lo@microsoft.com> Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com>	2024-01-12 18:10:05 -08:00
Scott McKay	8f2e57f5d0	Make session configuration options available to kernels via OpKernelInfo (#18897 ) ### Description <!-- Describe your changes. --> Pass through the ConfigOptions from the session via OpKernelInfo so that kernel behavior can be configured. Initial usage would be to optionally enable a fast path for ARM64 bloat16 GEMM - see #17031 Other usages could be things like selected the exact implementations of the activation functions for RNN operators instead of the default approximations (e.g. use [sigmoid_exact instead of sigmoid](`2d6e2e243d/onnxruntime/core/providers/cpu/rnn/rnn_helpers.h (L379-L382)`)) OpKernelInfo is already passing through things from the session state, and adding a new member of ConfigOptions is the simpler update. It's also a more natural fit given it's providing state/info to the kernel. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-01-13 10:02:43 +10:00
Preetha Veeramalai	c340bf08f6	Openvino EP code changes for 1.17 update (#19023 ) ### Description Introduce AppendExecutionProvider_OpenVINO_V2 API and support for OV 2023.3. ### Context - The API is added to facilitate customers in using published official Microsoft onnxruntime libraries with OVEP libraries. - Add support for OpenVINO 2023.3 official release. - Extend operator coverage - GH fixes --------- Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>	2024-01-12 13:20:51 -08:00
Chi Lo	46dd0d3f52	[TensorRT EP] Load precompiled TRT engine file directly (#18217 ) When the TRT engine cache (precompiled engine) is present, it doesn't make sense to go over the processes of model verification, model optimization, TRT EP's GetCapability(), TRT EP's model proto reconstruction, calling TRT parser and engine compilation. This PR makes TRT EP skip those processes and directly load the engine to perform inference. The feature request: https://github.com/microsoft/onnxruntime/issues/18072 Features: - Replace original model with TRT engine wrapped ONNX model. It can save a lot of time as mentioned above. - How to get TRT engine wrapped ONNX model? 1. Set `trt_dump_ep_context_model` provider option to "true" and run the inference. You will find the "xxx_wrapper.onnx" at the engine cache path. (The same logic of generating engine cache) 2. Use gen_trt_engine_wrapper_onnx_model.py - Three provider options are added, `trt_dump_ep_context_model`: Enable dump wrapped onnx model by TRT EP `trt_ep_context_embed_mode`: Add embed_mode as attribute. 0 means engine cache path, 1 means engine binary data. `trt_ep_context_compute_capability_enable`: Add hardware_arch as attribute. When running the model, TRT EP will check consistency between model's hardware_arch and GPU's compute capability. - When the engine cache path is given in the wrapped model, TRT EP will first search for the engine file using the path (relative to model path), if it can't find it, it will change to use the path as it is (depends on user, could be relative to working dir or absolute path) Note: 1. This PR includes the change of https://github.com/microsoft/onnxruntime/pull/17751 Constraints: 1. The whole model should be fully supported by TRT. 4. Users need to make sure the engine is built with min/max/opt optimization profiles that large enough to cover the range of all inputs. TRT EP will simply fail and won't rebuild the engine if the input shape is out of range during runtime.	2024-01-11 22:20:54 -08:00
ivberg	4d1243b4b4	ORT ETW dynamic logging that improves ORT diagnosability & performance (#18882 ) ### Description This PR has several combined ORT ETW changes that improve ORT log diagnosability & performance. - The existing log behavior in the ORT API and Severity behavior remain the same as compiled by the dev using the ORT API - The PR keeps the existing design which has 2 TraceLogging providers defined (although both were not used before this PR) - Keeps great inference (inf) and session load performance even with dynamic logging enabled (see below) - On Windows, when ONNXRuntimeTraceLoggingProvider is enabled, then ORT will dynamically _add_ a new sink reflecting the severity level provided by ETW dynamically. E.G Critical - Verbose per the need at runtime - This allows previous printf style LOGS() statements both for CPU and NPU cases to flow to ETW via a local trace (if enabled) - This DOES NOT add any new Telemetry which can optionally be sent to Microsoft. - Telemetry are ETW events marked with the Measure keyword that _can_ be sampled if a box opts-in - Existing Microsoft.ML.ONNXRuntime events have appropriate keywords and levels added if they were missing - If Execution Providers (EPs) can provide more detailed insight into their HW or component, then this PR allows for those to be dynamically logged instead of just at compile time - In this PR, the QNN EP for QC NPUs can have basic or detailed profiling enabled to give some insight into how the NPU is performing - When the Microsoft.ML.ONNXRuntime ETW provider is enabled with the Profiling keyword and level 5 then QC QNN basic profiling info is output to ETW ### Motivation and Context - This make ORT logging and diagnosability more performant (on Windows) and available in a wider variety of runtime environments. - The performance difference for inf times was ~300x+ drastically better/faster when these logs were output to ETW vs just stdout (Verbose Severity) - This style of ETW dynamic tracing is the widely used standard for Windows components, and even by some 3rd party software since the ETW API is open and part of the Windows API - These ETW based logs can be seamlessly combined with other ETW logs such as an AI component/feature using ORT, OS CPU profiling, scheduling, and more - Before the PR, ORT logging is largely printf style and output to a sink (usually stdout) only if compiled with a certain log Severity. When enabled the previous logging (to stdout) would vastly slow down inference times. Once compiled for release the internal ORT logs were not accessible by anyone except the AI model developer in their dev inner loop. That means logs could not be enabled on a lab machine, or on a production system where the runtime behavior or performance might be different for various reasons on a wide variety of HW. - This change was tested with performance in mind and tested with a mobilenet small AI model with onnxruntime_perf_test - CPU: There was no statistical difference for inf times with the baseline (main) or this PR whether ETW was enabled or not (both ORT providers all keywords level 5). - NPU (QNN on SP9 or Dev Kit 2023 QC SQ3): There was no statistical difference for inf times with this PR whether ETW (both ORT providers all keywords) were enabled or not for Level 5 (Verbose). This is even with QNN Basic profiling turned on and outputting NPU stats to ETW - As expected and designed, there was perf slowdown when Max Level 255 is enabled which translates to QNN Detailed profiling. This mirrors the expected slowdown in the NPU when individual model operations are monitored & recorded as well. This perf is similar to the QNN SDK Detailed profiling performance separate from this PR. This is designed to be above level 5 (verbose) as that is commonly the max level used in many trace profiles and won't affect inf performance. - Other OSes such as Linux & Android are left untouched for now. - Out of scope for this PR but TraceLogging is available for Linux with LTTng tracing. So in the future, this optional tracing could also be made available on other OSes where a TraceLogging API is available	2024-01-11 12:43:27 -08:00
RandySheriffH	24e9daf707	Enrich cuda resources with ep options (#19014 ) Allow custom ops to access cuda ep options. --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2024-01-11 10:56:07 -08:00
RandySheriffH	df116b82c7	Custom op API for thread pool (#18980 ) Allow custom op to invoke internal thread-pool for parallelism. --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2024-01-10 14:13:25 -08:00
Scott McKay	8e9188e265	Add SessionOptions use_deterministic_compute to the C and C++ APIs. (#18944 ) ### Description <!-- Describe your changes. --> SessionOptions use_deterministic_compute can be set via the python API. User request to enable setting via C API. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> #17416	2024-01-04 11:12:48 +10:00
Scott McKay	df740d7d15	Throw if unique_ptr or array allocation fails due to SafeInt overflow (#18941 ) ### Description <!-- Describe your changes. --> If we fail to calculate the buffer size (due to overflow) we currently return a nullptr. This is inconsistent as an actual memory allocation failure throws. An overflow would typically be due to bad input so an exception makes more sense given that. Change to throw so code using MakeUniquePtr* and AllocArray* doesn't need to check for nullptr. Add some extra info to the log message to help debugging. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Should help with #18905 by avoiding the invalid attempted usage of a nullptr from the allocation. Extra info _might_ help with figuring out where the overflow is coming from which is the real issue.	2024-01-03 07:57:51 +10:00
Hector Li	8931854528	Move some QNN EP provider options to session options (#18877 ) Move QNN EP provider options to session options ### Description Need to use session option to support multi-partition for context cache feature. To smooth the transaction, move the provider options to session options first. This is the first step for PR: PR https://github.com/microsoft/onnxruntime/pull/18865	2023-12-20 00:13:38 -08:00
pengwa	ccf3b2054b	Allow layer-wise recompute (#18566 ) ### Allow layer-wise recompute Early, we need users/developers to specify the subgraphs to recompute, now we introduced a more user-friendly way to enable recompute for all detected stashed activation recomputation subgraphs. This scarifies getting the best configs while makes it easier to support user requirements when they switches from PyTorch per-layer gradient checkpoint to ORTModule. `ORTMODULE_MEMORY_OPT_LEVEL` is introduced to control the usage, by default, it is 0, e.g. `USER_SPECIFIED`, all subgraphs definedin `ORTMODULE_MEMORY_OPT_CONFIG` will be recomputed. So this is compatible to existing recompute usage in ORTModule integrated models. Using `ORTMODULE_MEMORY_OPT_LEVEL=1`, we will enable all recompute plans detected, so those configs in `ORTMODULE_MEMORY_OPT_CONFIG` will not be respected any more. Add Unit Tests using 3 layer blooms. https://github.com/microsoft/onnxruntime/blob/pengwa/add_aggresive_recompute/docs/Memory_Optimizer.md	2023-12-12 08:44:05 +08:00
Hector Li	ccfea55942	[QNN EP] Enable QNN HTP VTCM size setting (#18653 ) ### Description [QNN EP] Enable QNN HTP VTCM size setting	2023-11-30 21:09:13 -08:00
Edward Chen	14a343441d	Fix Objective-C static analysis build (#18606 ) - Patch abseil to fix a compile error about not finding `cxxabi.h`. - Fix some static analysis warnings.	2023-11-28 17:14:20 -08:00
pengwa	43a5147e01	Memory optimization refactor and refinement (#17481 ) ### Memory optimization refactor and refinement Currently memory optimizer runs graph transformations and print recompute opportunities in INFO level, while ORT backend has many many INFO level logs making users hard to find those information. So we are looking for a Python binding API to retrieve the memory optimization opportunities instead of depending on the MemoryOptimizer's default logging. Then we can print ORTModule feature statistics using this information. Also, with such an API, we can create an ORT session created, where allocation plan is done, the analysis will consider buffer reuse as well. This can void giving some recomputation subgraphs that are reusing other subgraphs' output buffers. Check https://github.com/microsoft/onnxruntime/blob/pengwa/add_devinfo_level/docs/Memory_Optimizer.md for the new flow using `MemoryOptimizer`. This pull requests made following refactoring: 1. Print the log in ORTModule Python script, along with ORTModule feature enabling stats. This is implemented by exposing an API `get_serialized_ortmodule_memory_stat` to retrieve the memory optimization opportunities. 2. We are analyzing memory optimization opportunities considering ORT memory planning. This is done by firstly creating the execution graph without enabling MemoryOptimizer, then we call `execution_agent.get_serialized_ortmodule_memory_stat` which internally will consider the session memory allocation planner when analyzing memory optimization opportunity. As a direct result, the memory optimization opportunities can show those stashed activations that are reusing other buffers. 3. Move recompute analysis logic from memory_optimizer.h/cc to recompute_analysis.h/cc. 4. Abstract optimization strategies for their own implementation. This will make introducing new strategies (for example compression and decompression ) easier. New logging matrix (INFO Level), in WARNING level, the details will NOT show. ``` 2023-09-13 13:25:09,249 orttraining.rank-0 [WARNING] - *** ONNX Runtime Training (ORTModule) is accelerating your model *** ORTModule is enabled with following features ON/OFF for [training] mode: ATen Executor : ON : Dispatch ATen operators to ORT's ATen executor Cast Propagation : ON : Level 1 enabled Custom Function : ON : Support custom torch.autograd.Function export and execution Memory Optimizer : ON : RecomputeConfig: Reshape+Where+BiasSoftmax+:1:-1,Cast+:1:-1, ProbeLevel: 1, available configs: Config Freq Saving(B) Saving Symbolic(Bytes) - Plan 1 : ON : Reshape+Where+BiasSoftmax+:1:-1 5 671,088,640 640.0inputs_input_ids_dim0inputs_input_ids_dim1*2 - Plan 2 : ON : Cast+:1:-1 6 402,587,648 inputs_input_ids_dim0inputs_input_ids_dim1(384.0inputs_input_ids_dim1 - 64.0) - Plan 3 : OFF : Reshape+Where+:1:-1 1 134,217,728 128.0inputs_input_ids_dim0inputs_input_ids_dim1*2 - Plan 4 : OFF : BiasSoftmax+:1:-1 1 134,086,656 128.0inputs_input_ids_dim0inputs_input_ids_dim1(inputs_input_ids_dim1 - 1) - Plan 5 : OFF : BiasGelu+:1:-1 6 125,808,640 inputs_input_ids_dim0(122880.0inputs_input_ids_dim1 - 20480.0) - Plan 6 : OFF : FusedMatMul+:1:-1 6 125,808,640 inputs_input_ids_dim0(122880.0inputs_input_ids_dim1 - 20480.0) - Plan 7 : OFF : FusedMatMul+Add+FusedMatMul+Add+Add+Add+:1:-1 5 26,214,400 25600.0inputs_input_ids_dim0inputs_input_ids_dim1 - Plan 8 : OFF : Add+:1:-1 1 5,237,760 5120.0inputs_input_ids_dim0(inputs_input_ids_dim1 - 1) - Plan 9 : OFF : Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+:1:-1 1 4,096 4.0inputs_input_ids_dim0inputs_input_ids_dim1 - Plan 10 : OFF : Cast+:2:-1 1 2,048 2.0inputs_input_ids_dim0inputs_input_ids_dim1 Compute Optimizer : ON : Enable/Disable with env ORTMODULE_ENABLE_COMPUTE_OPTIMIZER=1/0 - FLOPReduction : ON : Reduce FLOPs by upstreaming shrinking-sized ops Auto Fallback : ON : Fallback to PyTorch when encountering unsupported ops TritonOp Enabled : OFF : ORT will switch to Triton for executing some ops to further accelerate training. ZeRO Stage3 Support : OFF : Enable/Disable with env ORTMODULE_ENABLE_ZERO_STAGE3=1/0 Total ORT initialization overhead is 10.73s where export takes 8.39s. Other overhead details: graph builder init takes 0.06s, runtime detection takes 0.01s, graph building takes 0.31s, session creation takes 1.96s Versions: ONNX Runtime - 1.16.0+cu118, ONNX - 1.11.0 Note 1: use comma to enable multiple plans at the same time. export ORTMODULE_MEMORY_OPT_CONFIG=<plan1 config>,<plan2 config>,... Note 2: saving is calculated based on the 1st batch symbolic dim values: inputs_input_ids_dim0=1, inputs_input_ids_dim1=1024, inputs_attention_mask_dim0=1, inputs_attention_mask_dim1=1024, inputs_labels_dim0=1, inputs_labels_dim1=1024, ************************************************************************ ``` If DEVINFO level is enabled, then more details about the memory optimizations are printed. ``` MemoryInsight Summary - User config: BiasGelu+:1:-1,Cast+:2:-1 ========================================================================================================================================== \|Freq \| Memory Optimization Opportunities (Clustered by node-level activation patterns) \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|3 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph FusedMatMul+Add+Reshape+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+Add+Reshape+:1:-1 \| \| \| Stashed Activations: \| \| \| - ReuseFreq : Output 0(3), \| \| \| - Output 0 : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 32 x 240 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Reshape+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+:1:-1 \| \| \| Stashed Activations: \| \| \| - ReuseFreq : Output 0(2), \| \| \| - Output 0 : [ x 2560 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph FusedMatMul+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 10240 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Cast+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Cast+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Reshape+Where+BiasSoftmax+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Where+BiasSoftmax+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph BiasGelu+ \| \| \| Status : Enabled, requested count=-1, actual applied count=2 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 10240 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph FusedMatMul+Add+FusedMatMul+Add+Add+Add+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+Add+FusedMatMul+Add+Add+Add+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 2560 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Reshape+Where+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Where+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph FusedMatMul+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0(inputs_input_ids_dim1 - 1) x 10240 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Cast+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Cast+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 - 1 x inputs_input_ids_dim1 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 1 x 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved \| \| \| \| \| \|>>Option 2 : RecomputeWithCompromise subgraph Cast+ \| \| \| Status : Enabled, requested count=-1, actual applied count=1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 1 x 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 50% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph BiasSoftmax+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=BiasSoftmax+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 - 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph BiasGelu+ \| \| \| Status : Enabled, requested count=-1, actual applied count=1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0(inputs_input_ids_dim1 - 1) x 10240 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Add+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Add+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0(inputs_input_ids_dim1 - 1) x 2560 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| ========================================================================================================================================== Note: use comma as a separator for enabling more than one subgraphs. *********************************************************************** ``` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-11-23 11:39:00 +08:00
Dmitri Smirnov	81a763a9eb	Make TensorShapeVector to use InlinedVector<Int64_t> to reduce on template instantiations (#18519 ) ### Description Use InlinedVector<int64> instead of <int64_t,5> to reduce on the number of template instantiations. ### Motivation and Context The reported size reduction is small, just a few Ks. Just trying it out.	2023-11-21 14:13:50 -08:00
Sheil Kumar	2a01622536	Hide NPU Adapter selection behind macro (#18515 ) Hide NPU Adapter selection behind macro --------- Co-authored-by: Sheil Kumar <sheilk@microsoft.com>	2023-11-21 08:47:56 -08:00
RandySheriffH	53917a3353	Move up members in Lite Custom Op hierarchy for possible memleaks. (#18478 ) Move data member in LiteOpFunc to its parent to avoid possible mem leaks. --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2023-11-18 15:00:54 -08:00
Edward Chen	0a4d76d98b	MLAS AArch64 quantized int4 Gemm kernel (#18031 ) - Implement MLAS function for quantized 4-bit int Gemm (Gemm with float A and quantized 4-bit int B) for ARM NEON. This is an initial implementation. Only the M=1 path (with M being number of rows of A and C) has any optimization attempted so far. More optimization to come in future PRs. - Connect MatMulNBits contrib op to MLAS function.	2023-11-15 09:31:54 -08:00
Dmitri Smirnov	f19c673595	If Branch Constant Folding (#18105 ) ### Description When and if `If` condition proves to be a constant value, inline the corresponding subgraph yielding to more constant folding and optimization. ### Motivation and Context Newly converted models feature lots of nested `If` nodes that can be inlined and collapsed. In particular, for the sample models we are gaining on TorchScript exported models. For `HF Mobile Bert Dynamo` runtime went down from 0.069 -> 0.046. In total, AOT inlining + `If` constant folding yields improvement of about 50% 0.102 -> 0.046. Brining us very close to TorchScript exported models. `HF Bart Dynamo` further improves 0.668 -> 0.45. AOT + `If` constant folding improves 0.98 -> 0.45 Earlier the size of HF Mobile Bert 161Mb+, now 98Mb HF Bart Dynamo pre-optimized model was about 1.2Gb. It is now 710MB ![image](https://github.com/microsoft/onnxruntime/assets/11303988/1491a247-d371-4e66-85a3-2aeb702e8ca0)	2023-11-13 17:33:30 -08:00
RandySheriffH	646f77a94b	Align context virtuals (#18396 ) Deprecate ROCM context virtual function, to align with CUDA. Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2023-11-11 12:41:37 +10:00
RandySheriffH	59262dfc63	Add cuda context headers to zip (#18330 ) Expose cuda context headers for cuda custom ops. --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2023-11-09 14:53:58 -08:00
Ted Themistokleous	8d50313816	[Migraphx EP] Static int8 QDQ support (#17931 ) ### Description <!-- Describe your changes. --> Adding static int8 quantization support for MIGraphX Execution Provider - Allows for parsing in calibration tables generated by Onnxruntime or TensorRT's toolsets - Add proper environment variables into the MIGraphX EP - Update python API to include updating execution provider flags -> was missing on python side - Hook into MIGraphX's int8 quantitation and optimization of models ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Required so that we can get onnxruntime to pass in models while leveraging the existing tooling for int8 static QDQ quantization. First step in a series of PRs which will add further static quantization on the operator level as MIGraphX releases further support. These changes drew heavily from the tensorRT EP should allow for similar functionality for GPU based (versus CPU) quantization of models before an inference is performed. --------- Co-authored-by: Ted Themistokleous <tthemist@amd.com> Co-authored-by: Ted Themistokleous <tedthemistokleous@amd.com>	2023-11-09 17:46:49 +08:00
Hector Li	55c19d6ab5	[QNN EP] Enable option to set QNN context priority (#18315 ) Enable option qnn_context_priority to set QNN context priority, options: "low", "normal", "normal_high", "high". ### Description Enable option qnn_context_priority to set QNN context priority, options: "low", "normal", "normal_high", "high". This feature guarantees the model inference with higher priority. Tested with onnxruntime_perf_test tool using same model. 1. Run the model on the NPU with single instance, the latency is 300ms. 2. Run the same model on NPU with 2 instance at same time. Case 1: both with same priority (high ) -- latency is 600ms Case 2: 1 with low priority -- latency is 30,000ms 1 with high priority -- latency is 300ms Case 3: 1 with normal priority -- latency is 15,000ms 1 with high priority -- latency is 300ms	2023-11-08 20:56:36 -08:00
Justin Chu	c250540722	Bump linter versions (#18341 ) Bump linter versions and run format.	2023-11-08 13:04:40 -08:00
Adrian Lizarraga	a0eeeafa80	[QNN EP] Session option for graph optimization (#18262 ) ### Description Adds the QNN session option `htp_graph_finalization_optimization_mode` to enable QNN graph optimizations at the expense of longer preparation time. ### Motivation and Context Allow enabling QNN graph optimizations per app/model.	2023-11-08 10:06:15 -08:00
Preetha Veeramalai	d87216bcb1	Openvino ep ort 23.1 (#17911 ) ### Description Integration to OpenVINO 2023.1 ### Motivation and Context - Alignment with latest OpenVINO Version. - Device name change from VPUX to NPU and Remove from supported list until official public support is available. --------- Co-authored-by: Sahar Fatima <sfatima.3001@gmail.com> Co-authored-by: Saurabh Kale <saurabh1.kale@intel.com> Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com> Co-authored-by: sfatimar <sahar.fatima@intel.com>	2023-11-01 08:39:39 -07:00
RandySheriffH	2b95e74fa1	Versioning for custom op (#18088 ) Allow custom ops to have versions. --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2023-10-31 16:50:27 -07:00
Maximilian Müller	2eeafc37bc	Enable global TRT timing cache (#17865 ) I am adding a new `trt_timing_cache_path` option. Internally it is handled as `global_cache_path_` and will be set via a fall through approach: 1. no path provided => workdir 2. `trt_engine_cache_path` provided but no `trt_timing_cache_path` => `trt_engine_cache_path` 3. `trt_timing_cache_path` provided => `trt_timing_cache_path` (if not provided `trt_engine_cache_path` will still be workdir) ### Motivation and Context A TRT timing cache can be reused across multiple models as it only holds kernel timings and it is common that network "patterns" are reused. This can accelerate build times a lot. --------- Co-authored-by: Carson M <carson@pyke.io>	2023-10-27 09:23:19 -07:00
Patrice Vignola	538e97cbda	[DML EP] Add dynamic graph compilation (#17876 ) Historically, DML was only able to fuse partitions when all sizes are known in advance or when we were overriding them at session creation time. But in practice, it should be possible to compile partitions at compute time if the caller knows that the dimensions won't be changed for every inference (e.g. resizing a webcam window, or padding the input to powers of 2). This graph will be cached and reused until the sizes change. This is an opt-in option gated under the `enable_dynamic_graph_fusion` option, which means that it will only be enabled when the caller requests it since they have more context on how their model will be called between inferences. This PR also adds the option to disable metacommands from the python API, which is an option for the C API but was lacking for python.	2023-10-25 19:56:16 -07:00
liqun Fu	efa0cc2562	implement isinf20 and isnan20 (#17874 )	2023-10-24 10:58:54 -07:00
Dmitri Smirnov	2c50b75a26	Functions Ahead Of Time inlininng (#17764 ) ### Description Inline functions in an EP aware fashion. The result of this PR is that models that are having been inlined by ONNX inliner and optimized and models that have been AOT inlined appear to be visually identical. For tests I used two models. The only difference is the resulting size because ONNX inliner removes local function definitions and AOT does not. Difference in sizes for `HF Mobile` model was 2.5 MB, and for `HF Bart` it was ~500K. It seems that the resuling model size affects the load time more than the actual optimizations. In general, the inlined models grow in size very fast and can easily exceed 2Gb limit. Q. Should we make AOT optional? `If` costant folding and the removal of local inlined models will be coming in other PRs. Some stats: ![image](https://github.com/microsoft/onnxruntime/assets/11303988/fcb4c815-2e06-4574-8d96-5a0a727d1ecf)	2023-10-23 17:42:20 -07:00
RandySheriffH	009cd4ea2e	Allow cuda custom ops allocate deferred cpu mem (#17893 ) Expose a new allocator from cuda stream. The allocator manages deferred cpu memory which only get recycled before stream destruction. --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2023-10-20 16:12:21 -07:00
Maximilian Müller	7c17e33c07	Make CUDA a NHWC EP (#17200 ) ### Description CUDA inference speed heavily relies on Tensor Cores. To have tensor cores achieve the optimal throughput they require the data layout to be NHWC rather than NCHW. ### Motivation and Context Especially for convolutional networks this is very important. I will illustrate this using a very simple network: ``` import torch import torch.nn as nn class Net1(nn.Module): def __init__(self): super(Net1, self).__init__() # 1 input image channel, 6 output channels, 5x5 square convolution # kernel self.m = nn.ModuleList([ nn.Conv2d(in_channels=8, out_channels=32, kernel_size=5, stride=1), nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, stride=1), nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, stride=1), nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, stride=1, bias=False), nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, stride=1, bias=False), ]) def forward(self, x): for module in self.m: x = module(x) return x if __name__ == "__main__": dtype = torch.half device = "cuda" dummy_input = torch.randn(8, 8, 512, 512, dtype=dtype, device=device) model = Net1().to(dtype=dtype, device=device) input_names = ["input1"] output_names = ["output1"] torch.onnx.export(model, dummy_input, "test.onnx", input_names=input_names, output_names=output_names) ``` I profiled the launch of `./build/RelWithDebInfo/onnxruntime_perf_test -e cuda -I -q -t 5 test.onnx` using sys and nvtx ranges. Current master launches below kernels: ![image](https://github.com/microsoft/onnxruntime/assets/44298237/81655fce-0f8e-4f78-9335-b858a8c8977b) If I add the introduced `-l` flag we see below kernels: ![image](https://github.com/microsoft/onnxruntime/assets/44298237/fceb5d6f-c12d-442b-b15a-948797630008) Notice the missing NCHW<>NHWC kernels per operation. The layout optimizer introduced a transpose op as first and last op of the whole network. The `op_generic_tensor_kernel` shows the bias used which should also be optimized out next. Measured across some very basic models: \| CUDA EP \| NCHW [ms] \| NHWC [ms] \| Speedup \| \|:------------------------\|--------------------------------------:\|-----------------------------------------:\|------------------:\| \| \| -e cuda -t 5 -q \| -e cuda -t 5 -q -l \| \| \| resnet101-v2-7_bs8_fp16 \| 18.33 \| 13.07 \| 1.4 \| \| resnet101-v2-7_bs8 \| 21.8 \| 12.06 \| 1.81 \| \| test \| 102.07 \| 73.62 \| 1.39 \| Average speedup: 1.53 ## Outlook Next the mission will be to first write a templated unit test to check for correctness of NHWC vs NCHW ops. After that we have to transition more ops to measure perf improvements on a broader range of models. Currently this is not easily possible as we can do not support all ops in the NHWC domain. --------- Co-authored-by: Tianlei Wu <tlwu@microsoft.com>	2023-10-16 10:16:37 -07:00
RandySheriffH	c6c3555d0e	Custom op shape inference API (#17737 ) Add c/cxx API to allow custom ops do shape inference. --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2023-10-13 12:57:42 -07:00
Zhang Lei	762703e037	Support output cross qk, dtw and more for whisper model (#17500 ) Support cross qk in beam search for whisper model and related features Make whisper exporting tools support cross qk and some related features, * extra_decoding_ids * no_speech_prob Implement DTW kernel, unfold tensor kernel with unit test Several fix related with multiple session running parallel, like: * guard multihead_attention, fused_fp16_runner_ * some memory allocation with stream awareness * add use_ep_level_unified_stream option	2023-10-13 11:47:15 -07:00
Numfor Tiapo	b8f373b0ae	Add API for NPU Device Selection in the DML EP (#17612 ) Co-authored-by: Sheil Kumar <sheilk@microsoft.com>	2023-10-11 14:53:00 -07:00
Hector Li	385fab5bae	[QNN EP] Qnn cache improvement (#17757 ) ### Description Improve the QNN context binary cache feature to reduce the memory overhead and initialization time overhead. Instead of dumping a Qnn context binary file with metadata as header, we dump a Onnx format file with metadata inside Onnx node. ### Motivation and Context reduce the memory overhead and initialization time overhead	2023-10-06 15:56:33 -07:00
Chi Lo	569876fb16	[TensorRT EP] Refactor OrtTensorRTProviderOptions initialization and make it easy to add new field (#17617 ) Two major modifications of this PR: 1. Refactor OrtTensorRTProviderOptions initialization and make it easy to add new field. 2. Make Python API capable of using TensorRT plugins by adding new Python binding api `register_tensorrt_plugins_as_custom_ops`. (It needs to register ep's custom op domain before model load. For C++ API, it's slightly different, when calling SessionOptionsAppendExecutionProvider_TensorRT_XX, it appends cutom op domain to session option. Later ORT can register custom op domain from session option before model loading)	2023-10-06 14:12:20 -07:00
Adrian Lizarraga	8e6019af2e	[QNN EP] Enable QNN Saver for debugging issues (#17747 ) ### Description - Enables option to use the QNN Saver backend for dumping QNN API calls to file. - Adds logic to read environment variable `ORT_UNIT_TEST_ENABLE_QNN_SAVER` from QNN EP unit tests. If enabled, unit tests will use the QNN Saver backend and dump files to `./saver_output/`. ### Motivation and Context QNN Saver makes it easier to debug issues when unit tests fail. The output files generated by QNN Saver can be used to replay the exact QNN API calls that lead to a specific error condition. QNN Saver dumps QNN API calls (and weights) to disk. - saver_output/saver_output.c: C file containing all QNN API calls. - saver_output/params.bin: binary file containing all input/output/parameter tensor data provided during tensor creation, op config validation, and graph execution. Enabling the QNN Saver backend has 2 note-worthy effects: 1. All QNN API calls will succeed. 2. Inference output returns dummy data. Because the output files from QNN Saver are always overwritten, it is recommended to run individual unit tests via the `--gtest_filter` command-line option. Example (linux): ```shell $ ORT_UNIT_TEST_ENABLE_QNN_SAVER=1 ./onnxruntime_test_all --gtest_filter=QnnHTPBackendTests.Resize_DownSample_Linear_AlignCorners ```	2023-10-03 16:24:33 -07:00
Pranav Sharma	668c70ee11	Add support for specifying a custom logging function per session. (#17727 ) ### Description Add support for specifying a custom logging function per session. Bindings for other languages will be added after this PR is merged. ### Motivation and Context Users want a way to override the logging provided by the environment.	2023-09-29 19:46:55 -07:00
Scott McKay	33295ed883	Handle string initializers in constant folding (#17422 ) ### Description <!-- Describe your changes. --> * Allow either an allocator or a MemBuffer to be used when creating an OrtValue from an TensorProto * `Tensor<std::string>` requires an allocator to allocate/free the string values * Forcing the buffer to be allocated outside of the Tensor doesn't seem to provide any benefit in this usage as the Tensor class disables copy and assignment (so we wouldn't create 2 copies of the buffer via the Tensor class that externally managing the would buffer avoid) * New approach means we don't need to manage the buffers in the optimizer Info class as the Tensor dtor will do that * Update naming - MLValue was replaced by OrtValue a long time ago ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> #17392	2023-09-27 21:15:58 +10:00
RandySheriffH	37dcefb5b7	Patch lite custom op API (#17605 ) A few enhancements: - Support compute returning status; - Support variadic; --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2023-09-26 14:02:18 -07:00
Vincent Wang	e6301eee6a	Bump Up Version to 1.17.0 (#17587 ) Bump up version to 1.17.0 as the 1.16.0 release branch had been branched out.	2023-09-20 11:02:58 +08:00
Dmitri Smirnov	fdb132643d	Remove redundant Resolve() after each inlined function (#17556 ) ### Description Remove `Resolve()` on the entire graph as each function is resolved. We retain `Resolve()` after each inlining iteration. ### Motivation and Context Poor performance for inlining the model and session initialization. Original model before Resolve() removal FunctionTest.Profiling (65953 ms) After Resolve() Removal FunctionTest.Profiling (2911 ms) RelWithDebInfo pre-inlined model. Presumably because it runs Level1 optimizers Non-inlined model consists of functions and Level1 optimizers have no effect. FunctionTest.Profiling (9851 ms)	2023-09-15 12:13:37 -07:00
cao lei	32f5658abb	remove gsl to make status.h independent from gsl (#17402 ) ### Description <!-- Describe your changes. --> Make status.h independent from gsl. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> In the coming new feature external EP API (see the prototype https://github.com/microsoft/onnxruntime/pull/16718), we need to expose stream in the public header, however, stream is dependent on status.h which is dependent on gsl. We are seeking a way to decouple stream from gsl. From Changming's comment offline, prefast is disabled so all GSL_SUPPRESS are not taking any effect now. He will handle the warnings when enable prefast in the future	2023-09-13 21:47:43 -07:00
Yulong Wang	550293d9ad	OrtMemoryInfo: support new name "WebGPU_Buffer" (#17469 ) ### Description Add new name "WebGPU_Buffer" to OrtMemoryInfo. This is one of the prerequisites for supporting IO binding for WebGPU buffer in onnxruntime-web. list of prerequisites PRs: #17465 #17469 (this one)	2023-09-08 16:37:35 -07:00
Xavier Dupré	024f1dd72b	Fix float 8 rounding on CPU (#16940 ) ### Description Fix float 8 rounding issues discovered in issue #16938 (only CPU provider).	2023-09-07 20:48:25 +02:00
RandySheriffH	6c39641ea2	Fix a memleak in RunAsync python (#17326 ) Release ort value outputs that are created and released from ort::run(...). --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2023-08-30 12:54:17 -07:00
Artem Shilkin	6e60dba726	Fix compilation with newer flatbuffers (#17164 ) In flatbuffers@v23.5.9 was broken forward declaration for FlatBufferBuilder. Trying to compile onnxruntime falls with the following error: ``` flatbuffers/include/flatbuffers/flatbuffer_builder.h:1420:38: error: typedef redefinition with different types ('FlatBufferBuilderImpl<false>' vs 'flatbuffers::FlatBufferBuilder') typedef FlatBufferBuilderImpl<false> FlatBufferBuilder; ^ onnx_runtime/include/onnxruntime/core/graph/graph.h:47:11: note: previous definition is here class FlatBufferBuilder; ``` This PR removes these declarations and puts includes instead	2023-08-29 10:28:26 -07:00
pengwa	18d5cfdb85	Fix build - redefinition of default argument for ‘long unsigned int Extent’ (#17281 ) ### Fix build - redefinition of default argument for ‘long unsigned int Extent’ One of the training customer env, building ORT, there is such a build error. The GCC version are ``` aiscuser@node-0:/tmp/onnxruntime$ gcc --version gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 aiscuser@node-0:/tmp/onnxruntime$ g++ --version g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 ``` But on our dev node using same GCC/G++, we don't have build issue., not sure what's the difference but giving an explict type when creating `gsl::span` fixed the problem. ``` /tmp/onnxruntime/build/Linux/RelWithDebInfo/_deps/gsl-src/include/gsl/span:394:7: error: redefinition of default argument for ‘long unsigned int Extent’ 394 \| class span \| ^~~~ /tmp/onnxruntime/build/Linux/RelWithDebInfo/_deps/gsl-src/include/gsl/span_ext:46:51: note: original definition appeared here 46 \| template <class ElementType, std::size_t Extent = dynamic_extent> \| ^~~~~~~~~~~~~~~ /tmp/onnxruntime/include/onnxruntime/core/common/span_utils.h:82:93: error: return type ‘class gsl::span<const std::byte>’ is incomplete 82 \| [[nodiscard]] inline gsl::span<const std::byte> AsByteSpan(const void* data, size_t length) { \| ^ /tmp/onnxruntime/include/onnxruntime/core/common/span_utils.h: In function ‘void onnxruntime::AsByteSpan(const void, size_t)’: /tmp/onnxruntime/include/onnxruntime/core/common/span_utils.h:83:68: error: class template argument deduction failed: 83 \| return gsl::span(reinterpret_cast<const std::byte>(data), length); \| ^ /tmp/onnxruntime/include/onnxruntime/core/common/span_utils.h:83:68: error: no matching function for call to ‘span(const std::byte, size_t&)’ /tmp/onnxruntime/build/Linux/RelWithDebInfo/_deps/gsl-src/include/gsl/span:740:1: note: candidate: ‘template<class Type, long unsigned int Extent> gsl::span(Type (&)[Extent])-> gsl::span<ElementType, FirstExtent>’ 740 \| span(Type (&)[Extent]) -> span<Type, Extent>; \| ^~~~ /tmp/onnxruntime/build/Linux/RelWithDebInfo/_deps/gsl-src/include/gsl/span:740:1: note: template argument deduction/substitution failed: /tmp/onnxruntime/include/onnxruntime/core/common/span_utils.h:83:68: note: mismatched types ‘Type [Extent]’ and ‘const std::byte’ 83 \| return gsl::span(reinterpret_cast<const std::byte>(data), length); \| ^ /tmp/onnxruntime/build/Linux/RelWithDebInfo/_deps/gsl-src/include/gsl/span:743:1: note: candidate: ‘template<class Type, long unsigned int Size> gsl::span(std::array<_Tp, _Nm>&)-> gsl::span<ElementType, FirstExtent>’ 743 \| span(std::array<Type, Size>&) -> span<Type, Size>; \| ^~~~ /tmp/onnxruntime/build/Linux/RelWithDebInfo/_deps/gsl-src/include/gsl/span:743:1: note: template argument deduction/substitution failed: /tmp/onnxruntime/include/onnxruntime/core/common/span_utils.h:83:68: note: mismatched types ‘std::array<_Tp, _Nm>’ and ‘const std::byte’ 83 \| return gsl::span(reinterpret_cast<const std::byte*>(data), length); \| ^ ``` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-08-25 00:40:40 +08:00
Scott McKay	b3cb775cf9	Two fixes involving minimal builds (#17000 ) ### Description <!-- Describe your changes. --> - allocation planner was breaking if graph had no nodes - in this particular model a branch of an If node returned an outer scope value directly. - if model used non-tensor types and sparse tensors are disabled the call to IsSpareTensor causes an exception when prematurely terminates the code. - it's perfectly fine to check if a value is a sparse tensor when support for them is disabled. we just can't do anything with that OrtValue which is what the current ifdef's after the call to IsSparseTensor handle. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix model execution failure for partner with model that uses sequences in a minimal build with sparse tensors disabled.	2023-08-23 16:01:22 +10:00
Edward Chen	ae62d752d6	Prevent GSL_SUPPRESS arguments from being modified by clang-format (#17242 ) Prevent `GSL_SUPPRESS` arguments from being modified by clang-format and update existing usages. clang-format was changing something like `GSL_SUPPRESS(r.11)` to `GSL_SUPPRESS(r .11)`. For some compilers (e.g., clang), the `gsl::suppress` attribute takes a quoted string argument. We don't want to insert spaces there.	2023-08-22 18:26:53 -07:00
Edward Chen	d6cd41cfc1	[CoreML EP] Add Shape, Gather, and Slice ops (#17153 ) Add CoreML EP shape related ops: - Shape - Gather - Slice Add support for int64/int32 inputs in CoreML EP.	2023-08-18 22:34:34 -07:00
Dmitri Smirnov	5c54b64a63	Create NodeArgs for all Constant nodes and initializers for functions being inlined (#17089 ) ### Description When functions are inlined and constant nodes are being converted to initializers, we need to create NodeArg for them. Similar for inlined function subgraph, but we choose to give priority to non-constant nodes and then fill the gaps with constant and initializers. ### Motivation and Context This addresses issue https://github.com/microsoft/onnxruntime/issues/16813 for `eca_halonext26ts_mod.onnx` model where it fails to remove unused initializer because `NodeArg` was not created for it.	2023-08-17 14:22:28 -07:00
Changming Sun	5249b7ab7c	Re-implement stacktrace (#17173 ) ### Description Re-implement stacktrace. The new implementation doesn't directly use Windows API, hence can avoid problems regarding to initialize/uninitialize the dbghelp library. ### Motivation and Context	2023-08-16 16:07:49 -07:00
RandySheriffH	3dd2c1b4d7	EP context for custom op (#16454 ) Implement infrastructures to allow EP resources surfaced to custom ops. --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2023-08-16 13:03:40 -07:00
Yulong Wang	9cd4e5af68	[wasm] upgrade emsdk to 3.1.44 (#17069 ) ### Description This change upgrade emsdk to 3.1.44. Because backend is upgraded to LLVM 16, so need to fix a lot of build failures caused by "-Wshorten-64-to-32". most of the build failures comes from generated `onnx.pb.h`, and this can be fixed by including "core/graph/onnx_protobuf.h", which detects and ignore shorten-64-to-32 warnings.	2023-08-10 16:08:36 -07:00
Chi Lo	7361c283c7	Add API for updating CUDA EP provider option user compute stream (#17037 ) Add a generic `UpdateCUDAProviderOptionsWithValue()` C API to update CUDA EP provider options where its data type is pointer that can't be represented by string. Note: Please see some comments for the similar [PR ](https://github.com/microsoft/onnxruntime/pull/16965)for TRT EP.	2023-08-09 09:24:19 -07:00
Chi Lo	fc8003349e	Add API for updating TRT EP provider option user compute stream (#16965 ) Add a generic `UpdateTensorRTProviderOptionsWithValue()` C API to update TensorRT provider options where its data type is pointer that can't be represented by string.	2023-08-04 15:14:43 -07:00
Edward Chen	f98d3f8a23	[CoreML EP] Enable inputs with dynamic shape (#16915 ) Enable node inputs with dynamic shape to be handled by the CoreML EP.	2023-08-03 18:15:00 -07:00
satyajandhyala	dd24d52737	[JS/Web] Added Gelu contrib operator support to JSEP (#16909 ) ### Description Added Gelu operator to JSEP ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-07-31 09:18:58 -07:00
Dmitri Smirnov	bf006d34a9	Used feature macro for if constexpr in a public header (#16836 ) ### Description Use feature macro for `if constexpr` ### Motivation and Context We still do not require customers to use C++17 compiler.	2023-07-25 21:42:30 -07:00
kunal-vaishnavi	b7176f9826	Fix bug with saving model optimized by inference session (#16716 ) ### Description A [previous PR](https://github.com/microsoft/onnxruntime/pull/16531) added a temporary directory to save the model optimizations after loading a model into an `InferenceSession`. Many models that have an external data file, however, require the data file to be in the same directory as the ONNX model file. Because the model is saved in a temporary directory and the data is saved in another directory, this causes a `FileNotFoundError` error when trying to load the model in the temporary directory. This PR fixes this error by saving the external data file in the same directory that the optimized model is located in. ### Motivation and Context This PR fixes a bug with using a temporary directory while running the optimizer for models that have an external data file.	2023-07-20 18:44:28 -07:00
Xavier Dupré	2bc9fbb621	Fix url in the code documentation (graph optimizations) (#16770 ) ### Description Fix a wrong url in the documentation as mentioned in issue #16678. ### Motivation and Context Better documentation.	2023-07-20 07:02:22 -07:00
Dmitri Smirnov	e752cbe7f2	Work on eliminating Internal Compiler Error (#16741 ) ### Description <!-- Describe your changes. --> Replace the offending bitwise `operator \|` with if() logic for ARM.	2023-07-18 10:17:52 -07:00
cloudhan	a45b834722	Fix warning about uninitialized member (#16736 ) #16506 Cause almost every translation units on linux complaint ``` [1175/1235] Building CXX object CMakeFiles/onnxruntime_test_all.dir/home/guangyunhan/onnxruntime/orttraining/orttraining/test/training_ops/cuda/softmax_test.cc.o In file included from /home/guangyunhan/onnxruntime/include/onnxruntime/core/framework/float16.h:18, from /home/guangyunhan/onnxruntime/include/onnxruntime/core/framework/data_types.h:17, from /home/guangyunhan/onnxruntime/include/onnxruntime/core/framework/tensor.h:17, from /home/guangyunhan/onnxruntime/onnxruntime/test/common/tensor_op_test_utils.h:16, from /home/guangyunhan/onnxruntime/onnxruntime/test/providers/compare_provider_test_utils.h:7, from /home/guangyunhan/onnxruntime/orttraining/orttraining/test/training_ops/cuda/softmax_test.cc:4: /home/guangyunhan/onnxruntime/include/onnxruntime/core/session/onnxruntime_float16.h: In instantiation of ‘static constexpr uint16_t onnxruntime_float16::Float16Impl<Derived>::ToUint16Impl(float) [with Derived = onnxruntime::MLFloat16; uint16_t = short unsigned int]’: /home/guangyunhan/onnxruntime/include/onnxruntime/core/framework/float16.h:42:66: required from here /home/guangyunhan/onnxruntime/include/onnxruntime/core/session/onnxruntime_float16.h:241:7: note: ‘union onnxruntime_float16::detail::float32_bits’ has no user-provided default constructor 241 \| union float32_bits { \| ^~~~~~~~~~~~ /home/guangyunhan/onnxruntime/include/onnxruntime/core/session/onnxruntime_float16.h:242:16: note: and the implicitly-defined constructor does not initialize ‘unsigned int onnxruntime_float16::detail::float32_bits::u’ 242 \| unsigned int u; \| ^ ``` This PR shut the compiler up.	2023-07-17 11:33:54 -07:00
Dmitri Smirnov	b8c40b7813	Fix parameter naming that fails Doc generation. (#16717 ) ### Description Rename `FromBits` param name to match the docs. ### Motivation and Context Fix API Doc generation.	2023-07-16 22:02:05 -07:00
RandySheriffH	e1ca8ee6d4	RunAsync C/CXX API (#16613 ) Implement RunAsync API - the session will run in a thread of intra-op thread pool. --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2023-07-16 16:51:40 -07:00
Dmitri Smirnov	853c4ff0a5	[C#, CPP] Introduce Float16/BFloat16 support and tests for C#, C++ (#16506 ) ### Description Introduce `Float16/BFloat16` support for C# and C++ APIs. User should be able to perform conversions from `float` to/from `Float16/BFloat16`, compare values and tests for `NaN, Inifnity, and whether the number is denormalized.` ### Motivation and Context User filed issues such as: https://github.com/microsoft/onnxruntime/issues/14303	2023-07-14 10:46:52 -07:00
cao lei	329e8156d4	clean unused parameter in ORT_UNUSED_PARAMETER (#16538 ) ### Description clean unused parameter in ORT_UNUSED_PARAMETER ### Motivation and Context clean unused parameters in ORT_UNUSED_PARAMETER which are introduced from #15833	2023-07-07 13:20:36 -07:00
Edward Chen	6be7b03e53	Enable `-Wshorten-64-to-32` warning if available. (#16524 ) - Fix some warnings from Xcode build (`-Wshorten-64-to-32`). - Enable `-Wshorten-64-to-32` warning if available. Currently it's not fully enabled for `onnxruntime_test_all` and `onnxruntime_providers_xnnpack` yet. - Some clean up in build.py including setting CMake generator more consistently.	2023-07-07 08:11:44 -07:00
Xavier Dupré	d906d48ae9	Support custom ops taking float 8 tensors as inputs and outputs (#16323 ) ### Description C API for custom ops does not support float 8 types. This PR changes that. ### Motivation and Context The list of operators supporting float 8 is very limited. It should be extended to custom ops to let developpers add customized operators for these specific types.	2023-07-06 14:36:06 +02:00
cao lei	0c5f492493	remove AllocatorMgr class (#16509 ) ### Description Remove AllocatorManager class ### Motivation and Context After the refactor PR #15833 is in, AllocatorManager class is not referenced anymore.	2023-06-28 15:43:19 -07:00
Baiju Meswani	efeb6672d6	Temporary optimizer support for ort format models in non minimal build (#16485 )	2023-06-28 11:35:57 -07:00
Christian Bourjau	6dd4e4801a	Allow custom operator functions to safely propagate errors through the C-API (#16479 ) ### Description This PR implements a backward-compatible way to define custom operators with fallible compute functions. The C++ API templated gained an optional `Fallible` argument. Closes #14287 ### Motivation and Context #14287 contains more context. The gist is that the current C-API defines compute operations of custom operators as functions returning `void` rather than an `OrtStatusPtr`. Currently, errors are often propagated across the C-ABI using C++ exceptions. That is very unsafe and undefined behavior. Moreover, it is difficult for languages other than C++ to use this approach even if they wanted to. A C-compliant sound and safe way to propagate errors allows for non-C++ fallible custom operators. ### An example in action https://github.com/cbourjau/ort-custom-op/pull/6/files is a demonstration of how this PR can be used to write safe and fallible custom operators in Rust.	2023-06-28 08:16:32 -07:00
Pranav Sharma	a270d8407e	Allow saving of large models after optimization (github issue 12882) (#16440 ) ### Description Allow saving of large models after optimization. ### Motivation and Context Addresses https://github.com/microsoft/onnxruntime/issues/12882	2023-06-21 22:46:26 -07:00
Chi Lo	4e3cff60fd	CUDA graph support for TRT EP (#16081 ) CUDA EP already supports [CUDA graph](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs), also we observed some models can benefit from using CUDA graph with `trtexec`. Therefore, this PR enables the CUDA graph support for TRT EP. The implementation is based on https://github.com/microsoft/onnxruntime/pull/9978 with the same [constraints](https://github.com/microsoft/onnxruntime/pull/9978) as below: - Models with control-flow ops (i.e. If, Loop and Scan ops) are not supported. - Usage of CUDA Graphs is limited to models where-in all the model ops (graph nodes) can be partitioned to the TRT EP. - The input/output types of models need to be tensors. - Shapes of inputs/outputs cannot change across inference calls. - IObinding is required.	2023-06-21 09:36:45 -07:00
Yuhong Guo	48e6186b1a	Move tests from core/providers/cuda/test/* to test/providers/cuda/ and refactor CUDA UT (#16161 ) ### Description <!-- Describe your changes. --> 1. Add a new test lib `onnxruntime_providers_cuda_ut` which is similar to `onnxruntime_providers_cuda` but `onnxruntime_providers_cuda_ut` is only built if `onnxruntime_BUILD_UNIT_TESTS` is set. We can call all CUDA UTs through this ut lib without affecting production lib `onnxruntime_providers_cuda`. 2. Move all test cases from `core/providers/cuda/test/` to `test/providers/cuda/`. These test cases are built into lib `onnxruntime_providers_cuda_ut` and run by `./onnxruntime_test_all --gtest_filter="CUDA_EP_Unittest"`. Since the lib is only for test, we can use gtest macros in the test cases. Previous implementation do not support using gtest lib in the CUDA UT cases. 3. The cmake code in `cmake/onnxruntime_providers.cmake` is refactored a bit. A new function `onnxruntime_add_object_library` is to build a object target. The 2 libs `onnxruntime_providers_cuda_ut` & `onnxruntime_providers_cuda` share most of the code, so the object files can be used in both libs, which helps reduce build time. Another function `config_cuda_provider_shared_module` is used to configure all 3 similar targets(onnxruntime_providers_cuda_obj/onnxruntime_providers_cuda/onnxruntime_providers_cuda_ut). 4. Refactored the test to call `testing::InitGoogleTest` & `RUN_ALL_TESTS` in `libonnxruntime_providers_cuda_ut.so`'s `TestAll`. After this change, we can see all the cases running in `CUDA_EP_Unittest.All`: ![image](https://github.com/microsoft/onnxruntime/assets/19584326/8ff80df6-060b-4ef0-90b7-657e68d3db87) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> After https://github.com/microsoft/onnxruntime/pull/13016, there are still test files in test/providers/cuda/ that are not moved to core/providers/cuda/test/ and the test cases are disabled. This PR helps to clean the unfinished TODOs. Even through onnxruntime_shared_lib_test covers some test for CUDA provider. onnxruntime_shared_lib_test works like a coarse grain end-to-end test for CUDA provider. If CUDA unittest can run cases for a single component, this wound be helpful for CUDA developers. --------- Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com>	2023-06-20 14:54:55 -07:00
cao lei	dd72192cf4	ExecutionProvider API refactor - move allocator from EP level to SessionState level and indexed by OrtDevice (#15833 ) ### Description This PR is to refactor ExecutionProvider API for memory management, which is to move allocators from EP level to SessionState level and indexed by OrtDevice ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> This PR is to refactor ExecutionProvider API for memory management, which is to move allocators from EP level to SessionState level and indexed by OrtDevice. By this change, EP level will shift the burden of maintaining allocators, which will be user friendly for EP developers --------- Co-authored-by: Lei Cao <leca@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>	2023-06-19 17:44:45 -07:00
Changming Sun	5754cd7d1d	Add fp16 support to CPU EP gemm op (#15506 )	2023-06-15 14:38:17 -07:00
Changming Sun	b72fe664c1	Refactor prepack buffer code (#16280 ) ### Description 1. Use IAllocatorUniquePtr to replace BufferUniquePtr. It will ensure the deleter is always right. 2. Change some std::unique_ptr to std::optional 3. Bypass Arena allocator when allocating the prepack buffers for mlas. In this special case, Arena doesn't help any. And this change is just an internal implementation change, it doesn't affect our public interface.	2023-06-08 14:42:02 -07:00
Dmitri Smirnov	908e940660	[CPP Api] Remove deprecated CustomOp API (#16256 ) ### Description Custom Op API has been deprecated in 1.15 release. We are removing it.	2023-06-07 14:03:13 -07:00
PeixuanZuo	1b518c6836	[ROCm] add early stop to tunable profile progress (#15716 ) For TunableOp, some instance may has very bad performance and it will take a long time during profile process. Add `tunable_op_max_tuning_duration_ms` parameter to limit max tuning time.	2023-06-01 10:18:25 +08:00
Xavier Dupré	e726151b5c	Introduce float 8 types (#14731 ) ### Description The PR implements FloatE4M3FN, FloatE5M2, FloatE4MEFNUZ, FloatE5M2FNUZ as described in PR https://github.com/onnx/onnx/pull/4805. It uses CUDA API to cast float/half to float8 if CUDA>=11.8, a custom implementation if CUDA<11.8. * It implements, Cast, QuantizeLinear, DequantizeLinear for all types on CPU, only for types FloatE4M3FN, FloatE5M2 on CUDA. * It extends the supported types for control flow operator, Shape, Reshape, Identity, If, Loop, Scan, Reshape * It implements Equal(19). * Cast, QuantizeLinear, DequantizeLinear operators now support a parameter `saturate` only valid for float 8 types. It is true by default. In that case, any value out of range is converted into the maximum float 8 value. If false, it is infinite. * QuantizeLinear, DequantizeLinear now supports multiple scales on CUDA (and ROCm by extension), scale = 1D tensor with one scale per channel ### Motivation and Context Supports latest onnx version. Fixes [AB#15395](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/15395) --------- Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net> Co-authored-by: Randy Shuai <rashuai@microsoft.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>	2023-05-30 13:25:58 -07:00
Dmitri Smirnov	9939092e71	[CPP API]Fix constness in C++API (#16103 ) ### Description `CreateMap` and `CreateSequence` should be able to take in const data.	2023-05-26 14:09:00 -07:00
Changming Sun	a5410515ad	Fix: Some fields in OrtCUDAProviderOptionsV2 struct are not initialized (#16113 ) ### Description The file include/onnxruntime/core/providers/cuda/cuda_provider_options.h is a C++ file. It is not for C. Before this commit, this header file is already not compatible with C compilers. Because it has: ``` onnxruntime::ArenaExtendStrategy arena_extend_strategy; ``` And this file is intended to be internal only. It is an internal header file. It should not be included in onnxruntime_c_api.h and should not be used with the public C APIs. User can only get the instance of OrtCUDAProviderOptionsV2 via CreateCUDAProviderOptions. In such a way we can add new members to this struct without breaking binary compatibility. Since it is an internal header, we can safely use C++ grammar there.	2023-05-26 11:34:22 -07:00
Yuhong Guo	04a8f50674	New configuration to limit the arena extension (#15983 ) Add a configuration `max_power_of_two_extend_bytes ` to limit the arena extension size. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> In our real scenario, we observe that if the model is big enough the BfcArena will extend uncontrollable. As showed by the following figures, if a model uses more than 16GB memory, the BfcArena will totally apply for 32GB memory according to the `kNextPowerOfTwo` strategy. With the new strategy, the extension is limited. The default maximum extension size is 1GB. #### Without the new configuration After loading the model, ORT uses 32G GPU memory. ![image](https://github.com/microsoft/onnxruntime/assets/19584326/42b93c66-b957-4f20-a13b-d34cb390afff) #### With the new configuration After loading the model, ORT uses 23G GPU memory. ![image](https://github.com/microsoft/onnxruntime/assets/19584326/5abffeff-9ca3-4187-a262-37fd2764fe1b) Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com>	2023-05-25 02:19:07 -07:00
Adrian Lizarraga	efc84a43e8	[QNN EP] Add session option to disable fallback to default CPU EP (#16016 ) ### Description Adds the session config option `disable_cpu_ep_fallback` to allow the user to prevent the CPU EP from handling nodes not supported by other execution providers. ```C++ // Graph nodes that are not supported by the execution providers (EPs) explicitly added to the session are // assigned (i.e., "fallback") to the CPU EP by default. // // This option allows the user to disable the fallback of unsupported graph nodes to the CPU EP. // If this option is set to "1", session creation will fail if the execution providers other than the CPU EP cannot // fully support all of the nodes in the graph. // // It is invalid to set this option and explicitly add the CPU EP to the session. In this case, session creation // will also fail with an error. // // Option values: // - "0": CPU EP fallback is not disabled. [DEFAULT] // - "1": CPU EP fallback is disabled. static const char* const kOrtSessionOptionsDisableCPUEPFallback = "session.disable_cpu_ep_fallback"; ``` #### Example use ```C++ #include "core/session/onnxruntime_cxx_api.h" #include "core/session/onnxruntime_session_options_config_keys.h" int main(int argc, char** argv) { Ort::SessionOptions so; so.AddConfigEntry(kOrtSessionOptionsDisableCPUEPFallback, "1"); // Disable fallback to the CPU EP. onnxruntime::ProviderOptions options; #if defined(_WIN32) options["backend_path"] = "QnnCpu.dll"; #else options["backend_path"] = "libQnnCpu.so"; #endif so.AppendExecutionProvider("QNN", options); const ORTCHAR_T* ort_model_path = ORT_MODEL_FOLDER "qnn_ep_partial_support.onnx"; Ort::Session session(*ort_env, ort_model_path, so); // Throws exception if nodes fallback to CPU // ... ``` ### Motivation and Context Makes it easier for application developers to ensure that the entire model runs on specific EPs. This is critical for Qualcomm/scenarios. If the compute cannot be offloaded to the NPU, running on CPU is not acceptable. (could be the difference between 90 second inference and 6 seconds inference) --------- Co-authored-by: Pranav Sharma <prs@microsoft.com>	2023-05-23 17:56:32 -07:00
Hector Li	4324d2173b	[QNN EP] Enable Qnn context cache to save model initialization time (#15815 ) ### Description Enable Qnn Context cache feature to save model initialization time Provider options: qnn_context_cache_enable\|1 to enable the cache feature qnn_context_cache_path to set the cache path. It is set to model_file.onnx.bin by default. ### Motivation and Context Model initialization time takes long because the cost of conversion from Onnx model to Qnn model. Qnn have feature to serialize the Qnn context to file, then next time user can load it from the cache context and execute the graph to save the cost. --------- Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>	2023-05-19 10:52:17 -07:00
RandySheriffH	4dfb89b3ad	Implement mutex-free spin lock for task queue (#14834 ) Implemented "lock-free" spinlock to save CPU usage on context switching. The change has been tested on queene service of Ads team, the lock-free version of ort (40 threads) saves CPU usage on gen8 (128 logical processors on 8 numa nodes) windows by nearly half, from 65% to 35%. For 32 cores, the curve is flat: Anubis, 32 vCPU, windows, hugging face models, 95 percentile E2E latency in ms: model \| mutex(ms) \| mutex-free --- \| --- \| --- alvert_base_v2 \| 34.21 \| 34.09 bert_large_uncased \| 116.27\| 117.84 bart_base \| 72.06 \| 71.99 distilgpt2 \| 25.43 \| 25.02 vit_base_patch16_224 \| 37.33 \| 37.76 Anubis, 32 vCPU win, Linux, 1st party models, 95 percentile E2E latency in ms: model \| mutex(ms) \| mutex-free --- \| --- \| --- deepthink_v2 \| 24.35 \| 22.95 bing_feeds \| 36.96 \| 36.48 deep_writes \| 14.46 \| 14.32 keypoints \| 9.34 \| 7.69 model11 \| 1.71 \| 1.66 model12 \| 1.82 \| 1.44 model2 \| 4.21 \| 3.95 model6 \| 1.08 \| 1.05 agiencoder \| 0.99 \| 0.93 geminet_transformer \| 5.32 \| 5.24 --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2023-05-19 10:12:10 -07:00
cloudhan	856afa49dd	[C#] Add missing rocm csharp api (#15540 )	2023-05-18 08:15:19 +08:00
Baiju Meswani	6b7181d31d	Add C# API documentation for training (and some other changes) (#15935 )	2023-05-16 03:15:24 -07:00
cloudhan	dc383ed4ce	Basic CSharp packaging support for ROCm EP (#15535 ) This PR mainly fixes building errors when trying to build nupkg for ROCm EP. It also slighly improve the packaging logic so that devlopers can produce the nupkg on linux natively.	2023-05-16 07:27:38 +08:00

1 2 3 4 5 ...

957 commits