### Description
I misunderstood how UpdateCUDAProviderOptions and
UpdateTensorRTProviderOptions work in the C API, I had assumed that they
updated the options struct, however they re-initialize the struct to the
defaults then only apply the values in the update. I've rewritten the
Java bindings for those classes so that they aggregate all the updates
and apply them in one go. I also updated the C API documentation to note
that these classes have this behaviour. I've not checked if any of the
other providers with an options struct have this behaviour, we only
expose CUDA and TensorRT's options in Java.
There's a small unrelated update to add a private constructor to the
Fp16Conversions classes to remove a documentation warning (they
shouldn't be instantiated anyway as they are utility classes containing
static methods).
### Motivation and Context
Fixes#20544.
### Description
removing excess trailing semicolon from specific macro
### Motivation and Context
I am preparing automatic generation of onnxruntime bindings for perl,
and the parser (ucpp) has broken due to the "double semicolon" error in
the subsequent lines where the macro is applied.
Fix
onnxruntime/include/onnxruntime/core/session/onnxruntime_c_api.h:4637:
error: argument 'session' of command @param is not found in the argument
list of
```
OrtApi::AddExternalInitializersFromFilesInMemory(
OrtSessionOptions *options,
const char *const *external_initializer_file_names,
char *const *external_initializer_file_buffer_array,
const size_t *external_initializer_file_lengths,
size_t num_external_initializer_files)
```
Bump up version in main from 1.18.0 to 1.19.0 since the release branch
has been cut.
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Introduce memory efficient topo sort (for training)
~~and laze initialize Priority-Based and Memory-Efficient topo sort.
Because in most cases, they are not needed, so we free the overheads of
GraphViewer construction for most use cases.~~
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Background:
User save large model with initializer data in external file. e.g:
onnx.save_model(onnx_model, "path/to/save/the/model.onnx", save_as_external_data=True, all_tensors_to_one_file=True,
location="filename", size_threshold=1024).
In that case, Ort loads the model, get the external initializer information (external file name, offset, length) and use the model path to find the external file, and locate to the tensor data via the offset and length.
But it won't work if user load the model from memory, since Ort lost track of the model path.
This PR adds API/session option to let user provide a table with external initializer file name as the key, the pointer to the loaded external file in memory and the buffer length as value. So that
1. user can load the model from memory buffer with external initializers in memory buffer too.
2. the initializers can be shared across sessions, for different EPs.
3. user can load the file in any way they want, e.g mmap.
Internally,
1. at session creation time, Ort goes through the external initializers in the graph, gets the file name, offset, data length of the external initializers from Tensorproto .
2. With the file name, Ort get the file in memory buffer and buffer length from the table user provided.
4. Ort locates the tensor buffer from file in memory buffer (user provided) using the offset and data length (from Tensorproto ).
5. Ort creates the Tensor and replace the existing Tensor in the graph.
### Motivation and Context
https://github.com/onnx/onnx/blob/main/docs/ExternalData.md
For a model with external data, the Tensorproto may have initializer data in a separate file. The external file location is set via the file path relative to the model path. With the API to load model from memory buffer, it lost track of the
model path. So it causes error if the model has external data. By adding a session option to set the external data buffer, Ort can find the external data correctly if model loaded from memory buffer.
Enable provider option to let user provider the profiling file path.
Separate out the profiling level for ETW, in case there's switch like ETW enabled when Ort creates the QNN profiling, then gets disabled when Ort logs the profiling events. vise versa. Enhance the logic to decide the profiling level.
### Description
<!-- Describe your changes. -->
The first call to Graph::Resolve occurs when creating the Graph instance
when loading an existing model from ModelProto. As the Node instance
will exactly match the source NodeProto there's no need to call
Node::ToProto in this case.
Add a temporary reference to the original NodeProto to avoid the call on
the first Graph::Resolve.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Better alternative to #19469
### Description
For C++ standards >= 20, use `std::chrono::operator<<` in place of
`date::operator<<` to fix ambiguous operator compile error.
### Motivation and Context
The external dependency HowardHinnant/date has a conflict with
std::chrono for >=C++20.
Solves #20137
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Certain transformers slow down session loading time while providing no
runtime perf benefits.
Allow clients to exclude them.
### Description
The dml_provider_factory header file can't be used in C programs as it
defines C++ inline operators. This PR rearranges that header file so
that it looks like valid C when used from C, and also makes a couple of
small modifications to the Java code so it correctly binds to the DML EP
at build time.
I'm having some difficulty testing it as I think it's pulling in the old
version of DirectML on my computer and I can't figure out what the
library loading path is in Java to make it look at the recent version I
downloaded. So the test I added fails with:
```
InferenceTest > testDirectML() FAILED
ai.onnxruntime.OrtException: Error code - ORT_RUNTIME_EXCEPTION - message: Exception during initialization: <path-to-ort>\onnxruntime\core\providers\dml\DmlExecutionProvider\src\AbiCustomRegistry.cpp(518)\onnxruntime.dll!00007FFF74819333: (caller: 00007FFF74793509) Exception(3) tid(4f58) 80070057 The parameter is incorrect.
at app//ai.onnxruntime.OrtSession.createSession(Native Method)
at app//ai.onnxruntime.OrtSession.<init>(OrtSession.java:74)
at app//ai.onnxruntime.OrtEnvironment.createSession(OrtEnvironment.java:236)
at app//ai.onnxruntime.OrtEnvironment.createSession(OrtEnvironment.java:221)
at app//ai.onnxruntime.InferenceTest.openSessionSqueezeNet(InferenceTest.java:1961)
at app//ai.onnxruntime.InferenceTest.runProvider(InferenceTest.java:665)
at app//ai.onnxruntime.InferenceTest.testDirectML(InferenceTest.java:657)
```
But it does correctly compile, and this error seems very similar to
other issues with the DML provider when it doesn't like a model due to
the loaded library being old. The test is using the squeezenet file
that's been in the repo since 2019. If someone can help me figure out
how to get the right version of DML in the library path I can test it
more on my end. I tried adding the folder with the new version into the
system path, but I'm not very familiar with Windows' library loading
behaviour.
### Motivation and Context
Fixes#19656 to allow use of the DirectML EP from ORT Java.
cc @martinb35
### Description
New flag of `dump_om_model` for **CANN EP**, which defaults to "True".
### Motivation and Context
When building an onnx model with CANN EP, the intermediate **OM(offline
model for Ascend NPU)** is automatically saved. There are some users
don't want to dump OM when resources are limited.
This PR will resovle this situation with `dump_om_model=False`
### Description
<!-- Describe your changes. -->
Add API function GetAliasMap and ReleaseAliasMap in OrtCustomOp
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Add API function GetAliasMap and ReleaseAliasMap in OrtCustomOp
### Description
<!-- Describe your changes. -->
use OrtCustomOp's new API GetMayInplace in CreateKernelCreateInfo to
hook the inplace map of custom ops
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This PR is to use OrtCustomOp's new API GetMayInplace in
CreateKernelCreateInfo to hook the inplace map of custom ops
### Description
Expose Reserve() in OrtAllocator to allow custom allocators to work when
session.use_device_allocator_for_initializers is specified.
Update: this change has been verified by Bing Ads and brings a
significant benefit in terms of memory utilization: 30GB less memory and
also better CPU utilization.
### Motivation and Context
https://microsoft-my.sharepoint.com/:w:/p/prs/Eeidf5YNtWtKrPVkfuTDsuABak1oL4QRpuBGuhqRbLKoJg?e=Zl3bah
### Description
Address build issues and source code discrepancies.
Fix cuda_test_provider gtest argument stack corruption.
### Motivation and Context
`OpTester` class that is widely used for kernel testing is not
suitable for testing internal classes for EPs that are built as shared
objects.
Currently, CUDA EP tests run only on Linux.
We want to enable testing and developments on Windows,
and create a usable pattern for testing of other EPs internals.
Alternatives considered:
Abstracting EP unit tests into separate test executable such as
`onnxruntime_test_all`.
This alternative was rejected as it would create a lot more changes in
the established patterns,
and potentially interfere with CUDA functionality with more complex
source code maintanence.
### Description
<!-- Describe your changes. -->
This change addresses the following issues with the current CustomOP
Output Type inference
- The function does not take into account optional inputs. When input is
absent the inference is silently aborted, and no output type is inferred
(P1 customer issue)
- Inferring output type based on the input type for multi-kernel custom
ops is done based on the latest in sequence kernel definition. There is
not an attempt made to match the kernel based on the input type.
- Inference is aborted when variadic inputs/outputs are detected when
the generated input/output names fail to obtain type constraints. This
is not immediately clear from the code, because custom op schema is not
available within the inference function.
- No error reporting.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Most of CustomOPs lack their own type and shape inference function as it
was recently introduced. For that reason, it is important to fix this.
This change is inspired by a customer issue.
This is a follow up on:
- https://github.com/microsoft/onnxruntime/pull/15184
- https://github.com/cbourjau/ort-custom-op/pull/11
- https://github.com/microsoft/onnxruntime-extensions/issues/451
### Description
Modifications to support 2GB+ checkpoint & Upgrading Flatbuffers
### Motivation and Context
This PR includes changes that will make ort handle 2GB+ checkpoints.
To do that we need to upgrade flatbuffers to 23.5.9 -
https://github.com/google/flatbuffers/pull/7945
- Modified the commitHash and the hash for the new version
- Removed the patch for rust generator's unused variable warning as it
is no longer producing this - [Check it out
here](d121e09d89/src/idl_gen_rust.cpp)
- Updated the VerifyField calls with alignment values that were
introduced in the new version.
---------
Co-authored-by: Sumit Agarwal <sumitagarwal@microsoft.com>
### Description
<!-- Describe your changes. -->
Add 2 C API for ORT extension:
- KernelInfo_GetAllocator
- OrtCustomOp::GetMayInplace
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Add 2 C API for ORT extension project, which will leverage these 2 APIs
for GroupQueryAttention custom op.
### Description
<!-- Describe your changes. -->
add new API KernelContext_GetScratchBuffer to get scratch buffer from
kernel context
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
add new API KernelContext_GetScratchBuffer to get scratch buffer from
kernel context which will be used in ORT extension project for
GroupQueryAttention custom op
### Description
<!-- Describe your changes. -->
1. add a config key in run_options to control cuda graph in runtime.
2. enhance cuda graph class to support mutiple graph saving and
retrieving in one ORT session
3. provide model modification/inference example on Phi2
4. benchmark shows an average of 13% latency reduction in token
generation.
limitation: TRT ep and ROCM ep hasn't applied this feature. we can
revisit this in the future.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
- Adding CUDA NHWC support for SpaceToDepth and DepthToSpace
- Add a new test which verifies that swizzling SpaceToDepth swizzling
for the H axis is correct.
- If CUDA NHWC is enabled, run all tests on the CUDA EP with NHWC as
well.
### Motivation and Context
Adding more NHWC operations to avoid layout transformations when using
the CUDA EP for more efficiency.
### Description
<!-- Describe your changes. -->
Address warnings so all the ORT projects build with /W4 on Windows.
Mainly
- unused parameters
- variables shadowing other ones
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#19588 started on this.
### Description
Implment IsInf-10,20 for CUDA.
Add FP16 types also on CPU.
### Motivation and Context
Certain models lag in performance due to IsInf not available on CUDA.
### ONNX Gelu Op in Opset 20
Refactor code to support MSDomain Gelu and ONNX Gelu-opset20 Op
1. Move CPU-GELU implmentation from
`onnxruntime/contrib_ops/cpu/activations.h/cc` to
`onnxruntime/core/providers/cpu/tensor/gelu.h/cc`, as the implementation
for approximate attribute to be 'none'.
2. Dumplicate some logic from
`onnxruntime/contrib_ops/cpu/bert/bias_gelu.cc` to
`onnxruntime/core/providers/cpu/tensor/gelu.h/cc`, as the implementation
for approximate attribute to be 'tanh'.
3. Register ONNX domain Gelu CPU kernel from opset 20 in
`onnxruntime/core/providers/cpu/cpu_execution_provider.cc`.
4. Move `onnxruntime/contrib_ops/cuda/bert/fast_gelu_impl.h/cu` to
`onnxruntime/core/providers/cuda/tensor/gelu_impl.h` and
`onnxruntime/core/providers/cuda/tensor/gelu_approximate_impl.cu`
respectively, as the implementation for approximate attribute to be
'tanh'.
5. Implement the logic for approximate attribute to be 'none' in
`onnxruntime/core/providers/cuda/tensor/gelu_impl.cu`.
6. Register ONNX domain Gelu CUDA kernel from opset 20 in
`onnxruntime/core/providers/cuda/cuda_execution_provider.cc`.
7. ROCM ep related changes.
8. Enrich the tests for ONNX domain Gelu in
`onnxruntime/test/providers/cpu/activation/activation_op_test.cc`.
### Description
Currently, the QNN HTP performance mode is set during session creation, there's no way to change it afterwards. There's requirement to set it high performance mode for high priority request and set it back to low performance mode later to save the power when the incoming request is idle for example.
Now, still keeps the performance mode at the session level in QNN EP options which is used at the default one. Ort QNN EP will set it once if user set it.
And there are setting (qnn.htp_perf_mode and qnn.htp_perf_mode_post_run) in run option to change the performance mode before and after session run. There's recommended scenario that user set the mode to high performance mode before the the inference sun so that user can get the result back ASAP. And set the mode to low performance mode after the inference to save the power.
### Description
<!-- Describe your changes. -->
Adds infrastructure to create an ML Package containing the Model using
ML Program. Updated coremltools files to v7.1 to bring in new protobuf
definitions along with the tools to write the weight.bin file and create
an ML Package correctly.
Enables building a CoreML Model on all platforms which means all the
operator builder code can be debugged anywhere. Execution of the
generated CoreML model is obviously limited to Apple platforms.
The Conv operator builder has been updated to be able to generate an ML
Program Operation.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
NeuralNetwork is no longer being developed and ML Program is the
replacement going forward.
### Description
<!-- Describe your changes. -->
An overridable initializer should not have a fixed value included in an
NNAPI model as it could be changed at runtime. The current check doesn't
include validating that the initializer is constant.
I was updating GetClipMinMax as part of adding CoreML EP ML Program
support, and in order to make both CoreML and NNAPI do the more correct
thing of using IsConstantInitializer this set of changes was required.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Make NNAPI and CoreML EPs more correct.
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
[TF32](https://blogs.nvidia.com/blog/tensorfloat-32-precision-format/)
could help boost performance on GPU of SM >= 80. Sometime, user observes accuracy loss, or need disable TF32 for testing
purpose. To disable TF32, it is also possible to set environment
variable `NVIDIA_TF32_OVERRIDE = 0`. However, sometime we do not want to
use environment variable to avoid impacting other applications, or want
to have finer control (like one session using TF32, and another session
not). This provider option could help.
Here we add a provider option `use_tf32`. When `use_tf32 = 0`, we will
disable TF32 for float MatMul/GEMM in cublas. It applies to MatMulNBits,
Attention, LongformerAttention, PackedAttention,
PackedMultiHeadAttention operators when float GEMM is used internally in
the operator. Note that it will not impact other data type, like fp8
gemm could still use TF32 in accumulation.
Previously, cublasGemmStridedBatchedHelper does not use TF32 in
inference. Here we enabled TF32 by default, so we might observe speed up
for FP32 transformers models on SM >= 80.
There is another PR that enables the option for cuDNN Conv later.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
https://github.com/microsoft/onnxruntime/issues/15407https://github.com/microsoft/onnxruntime/issues/19288
### Description
<!-- Describe your changes. -->
Refactor the VAIEP to use MSFT's standalone API
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Vitis ONNX RT VAI should switch to using the standalone API for ONNX EPs
in order to decouple the EP from onnxruntime.dll and the providers.dll.
This will help to simplify customer deployment of applications and use
cases that need to share their onnxruntime.dll with other applications.
---------
Co-authored-by: Zhenze Wang <zhenzew@xilinx.com>
Co-authored-by: zz002 <zhenze.wang@amd.com>
### Description
<!-- Describe your changes. -->
Make EP's member function, GenerateMetaDefId, a standalone function
which decouples from EP
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This change is for ExecutionProvider API refactoring, we will make a
clean ExecutionProvider API first for later EPv2 work
### Description
This PR adds SbgemmKernel for aarch64. This includes Sbegmm kernel to
implement matrix multiplication with bfloat16 SIMD instructions (bfmmla)
and MatMul operator changes to invoke the Sbgemm kernel. To enable
Sbgemm kernel, set the following session option:
"kOrtSessionOptionsGemmFastMathMode"
The PR also adds new test cases for mlas and ort.
### Motivation and Context
This is to improve MatMul performance on aarch64 platform.
I have run the below benchmarking script (bert , roberta and gpt2 model
inference) on AWS Graviton3 based c7g.4xl instance and observed 1.2x
-1.76x performance improvement compared to sgemm (fp32) kernel
performance.
```
cd onnxruntime/python/tools/transformers
python3 benchmark.py
```
And the unit test precision results are matching to sgemm kernel
results.
`./build.sh --config RelWithDebInfo --build_shared_lib --parallel
--compile_no_warning_as_error --skip_submodule_sync `
### Description
- Adds the following session options to configure the device:
- `soc_model`: The SoC model number. Refer to the QNN SDK documentation
for valid values. Defaults to "0" (unknown).
- `htp_arch`: The minimum HTP architecture the driver will use to select
compatible QNN operators.
- `device_id`: The ID of the device to use when setting 'htp_arch'.
Defaults to "0" (for single device).
### Motivation and Context
Allow more configuration.
Several changes:
1. To align with other EPs' setting of EP context configs in session
options, for example [QNN
EP](https://github.com/microsoft/onnxruntime/pull/18877), EP context
configs for TRT EP can be configured through:
1. Session Options: `ep.context_enable`, `ep.context_file_path` and
`ep.context_embed_mode`
2. Provider Options: `trt_dump_ep_context_model`,
`trt_ep_context_file_path` and `trt_dump_ep_context_embed_mode`
3. Above setting has 1:1 mapping and provider options has higher
priority over session options.
```
Please note that there are rules for using following context model related provider options:
1. In the case of dumping the context model and loading the context model,
for security reason, TRT EP doesn't allow the "ep_cache_context" node attribute of EP context node to be
the absolute path or relative path that is outside of context model directory.
It means engine cache needs to be in the same directory or sub-directory of context model.
2. In the case of dumping the context model, the engine cache path will be changed to the relative path of context model directory.
For example:
If "trt_dump_ep_context_model" is enabled and "trt_engine_cache_enable" is enabled,
if "trt_ep_context_file_path" is "./context_model_dir",
- if "trt_engine_cache_path" is "" -> the engine cache will be saved to "./context_model_dir"
- if "trt_engine_cache_path" is "engine_dir" -> the engine cache will be saved to "./context_model_dir/engine_dir"
```
2. User can decide the naming of the dumped "EP context" model by using
`trt_ep_context_file_path`, please see GetCtxModelPath() for more
details.
3. Added suggested comments from
https://github.com/microsoft/onnxruntime/pull/18217
Fix issue that the generated context cache model inputs/outputs order is not guaranteed
### Description
Currently, QNN EP generate the context cache model in Compile() method which only get access to the partitioned graph. And the inputs/outputs order for the partitioned graph is not guaranteed. And EP doesn't have the view of the input user model. Have to move the context cache model generation to a higher level in GraphPartitioner which has the view of the partitioned model.
This is also a break down of PR for multi-partition support.
https://github.com/microsoft/onnxruntime/pull/18865
### Description
<!-- Describe your changes. -->
Bump up version to 1.18.0 since the release branch has been cut.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
### Description
<!-- Describe your changes. -->
Add new option `trt_engine_cache_prefix` to customize TRTEP engine cache
prefix.
i.e:
- If user specifies `trt_engine_cache_prefix|FRCNN
trt_engine_cache_enable|true` when running FRCNN model
- the cache will be saved/loaded:
`FRCNN_2068723788287043730_*_sm80.engine`. Engine profile follows same
pattern.
- If skipping this option, the engine will be saved/loaded:
`TensorrtExecutionProvider_TRTKernel_graph_torch-jit-export_2068723788287043730_*_*_sm80.engine`
as default case.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
https://github.com/microsoft/onnxruntime/issues/16708
---------
Co-authored-by: Chi Lo <Chi.Lo@microsoft.com>
Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com>
### Description
<!-- Describe your changes. -->
Pass through the ConfigOptions from the session via OpKernelInfo so that
kernel behavior can be configured.
Initial usage would be to optionally enable a fast path for ARM64 bloat16 GEMM - see #17031
Other usages could be things like selected the exact implementations of the activation functions for RNN operators instead of the default approximations (e.g. use [sigmoid_exact instead of sigmoid](2d6e2e243d/onnxruntime/core/providers/cpu/rnn/rnn_helpers.h (L379-L382)))
OpKernelInfo is already passing through things from the session state, and adding a new member of ConfigOptions
is the simpler update. It's also a more natural fit given it's providing state/info to the kernel.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Introduce AppendExecutionProvider_OpenVINO_V2 API and support for OV
2023.3.
### Context
- The API is added to facilitate customers in using published official
Microsoft onnxruntime libraries with OVEP libraries.
- Add support for OpenVINO 2023.3 official release.
- Extend operator coverage
- GH fixes
---------
Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
When the TRT engine cache (precompiled engine) is present, it doesn't
make sense to go over the processes of model verification, model
optimization, TRT EP's GetCapability(), TRT EP's model proto
reconstruction, calling TRT parser and engine compilation.
This PR makes TRT EP skip those processes and directly load the engine
to perform inference.
The feature request:
https://github.com/microsoft/onnxruntime/issues/18072
Features:
- Replace original model with TRT engine wrapped ONNX model. It can save
a lot of time as mentioned above.
- How to get TRT engine wrapped ONNX model?
1. Set `trt_dump_ep_context_model` provider option to "true" and run the
inference. You will find the "xxx_wrapper.onnx" at the engine cache
path. (The same logic of generating engine cache)
2. Use gen_trt_engine_wrapper_onnx_model.py
- Three provider options are added,
`trt_dump_ep_context_model`: Enable dump wrapped onnx model by TRT EP
`trt_ep_context_embed_mode`: Add embed_mode as attribute. 0 means engine
cache path, 1 means engine binary data.
`trt_ep_context_compute_capability_enable`: Add hardware_arch as
attribute. When running the model, TRT EP will check consistency between
model's hardware_arch and GPU's compute capability.
- When the engine cache path is given in the wrapped model, TRT EP will
first search for the engine file using the path (relative to model
path), if it can't find it, it will change to use the path as it is
(depends on user, could be relative to working dir or absolute path)
Note:
1. This PR includes the change of
https://github.com/microsoft/onnxruntime/pull/17751
Constraints:
1. The whole model should be fully supported by TRT.
4. Users need to make sure the engine is built with min/max/opt
optimization profiles that large enough to cover the range of all
inputs. TRT EP will simply fail and won't rebuild the engine if the
input shape is out of range during runtime.
### Description
This PR has several combined ORT ETW changes that improve ORT log
diagnosability & performance.
- The existing log behavior in the ORT API and Severity behavior remain
the same as compiled by the dev using the ORT API
- The PR keeps the existing design which has 2 TraceLogging providers
defined (although both were not used before this PR)
- Keeps great inference (inf) and session load performance even with
dynamic logging enabled (see below)
- On Windows, when ONNXRuntimeTraceLoggingProvider is enabled, then ORT
will dynamically _add_ a new sink reflecting the severity level provided
by ETW dynamically. E.G Critical - Verbose per the need at runtime
- This allows previous printf style LOGS() statements both for CPU and
NPU cases to flow to ETW via a local trace (if enabled)
- This DOES NOT add any new Telemetry which can optionally be sent to
Microsoft.
- Telemetry are ETW events marked with the Measure keyword that _can_ be
sampled if a box opts-in
- Existing Microsoft.ML.ONNXRuntime events have appropriate keywords and
levels added if they were missing
- If Execution Providers (EPs) can provide more detailed insight into
their HW or component, then this PR allows for those to be dynamically
logged instead of just at compile time
- In this PR, the QNN EP for QC NPUs can have basic or detailed
profiling enabled to give some insight into how the NPU is performing
- When the Microsoft.ML.ONNXRuntime ETW provider is enabled with the
Profiling keyword and level 5 then QC QNN basic profiling info is output
to ETW
### Motivation and Context
- This make ORT logging and diagnosability more performant (on Windows)
and available in a wider variety of runtime environments.
- The performance difference for inf times was ~300x+ drastically
better/faster when these logs were output to ETW vs just stdout (Verbose
Severity)
- This style of ETW dynamic tracing is the widely used standard for
Windows components, and even by some 3rd party software since the ETW
API is open and part of the Windows API
- These ETW based logs can be seamlessly combined with other ETW logs
such as an AI component/feature using ORT, OS CPU profiling, scheduling,
and more
- Before the PR, ORT logging is largely printf style and output to a
sink (usually stdout) only if compiled with a certain log Severity. When
enabled the previous logging (to stdout) would vastly slow down
inference times. Once compiled for release the internal ORT logs were
not accessible by anyone except the AI model developer in their dev
inner loop. That means logs could not be enabled on a lab machine, or on
a production system where the runtime behavior or performance might be
different for various reasons on a wide variety of HW.
- This change was tested with performance in mind and tested with a
mobilenet small AI model with onnxruntime_perf_test
- CPU: There was no statistical difference for inf times with the
baseline (main) or this PR whether ETW was enabled or not (both ORT
providers all keywords level 5).
- NPU (QNN on SP9 or Dev Kit 2023 QC SQ3): There was no statistical
difference for inf times with this PR whether ETW (both ORT providers
all keywords) were enabled or not for Level 5 (Verbose). This is even
with QNN Basic profiling turned on and outputting NPU stats to ETW
- As expected and designed, there was perf slowdown when Max Level 255
is enabled which translates to QNN Detailed profiling. This mirrors the
expected slowdown in the NPU when individual model operations are
monitored & recorded as well. This perf is similar to the QNN SDK
Detailed profiling performance separate from this PR. This is designed
to be above level 5 (verbose) as that is commonly the max level used in
many trace profiles and won't affect inf performance.
- Other OSes such as Linux & Android are left untouched for now.
- Out of scope for this PR but TraceLogging is available for Linux with
LTTng tracing. So in the future, this optional tracing could also be
made available on other OSes where a TraceLogging API is available
### Description
<!-- Describe your changes. -->
SessionOptions use_deterministic_compute can be set via the python API.
User request to enable setting via C API.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#17416
### Description
<!-- Describe your changes. -->
If we fail to calculate the buffer size (due to overflow) we currently
return a nullptr. This is inconsistent as an actual memory allocation
failure throws. An overflow would typically be due to bad input so an
exception makes more sense given that.
Change to throw so code using MakeUniquePtr* and AllocArray* doesn't
need to check for nullptr.
Add some extra info to the log message to help debugging.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Should help with #18905 by avoiding the invalid attempted usage of a
nullptr from the allocation. Extra info _might_ help with figuring out
where the overflow is coming from which is the real issue.
Move QNN EP provider options to session options
### Description
Need to use session option to support multi-partition for context cache feature. To smooth the transaction, move the provider options to session options first.
This is the first step for PR:
PR https://github.com/microsoft/onnxruntime/pull/18865