Add new provider option `trt_op_types_to_exclude`:
- User can provide op type list to be excluded from running on TRT
- e.g. `trt_op_types_to_exclude="MaxPool"`
There is a known performance issue with the DDS ops (NonMaxSuppression,
NonZero and RoiAlign) from TRT versions 10.0 to 10.7. TRT EP excludes
DDS ops from running on TRT by default, user can override default value
with empty string to include all ops.
- cast
- argmax
- gelu
- cast
- LayerNorm
- GroupNorm
- InstanceNorm
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
### Description
Distinguish between DML and the generic 'GPU' term. This is needed for
packaging DML EP in the same ORT GPU pkg.
### Motivation and Context
Customer requirement.
### Description
<!-- Describe your changes. -->
Allow some classes to be default constructed.
The effect is the same as constructing it with nullptr.
Make default ctor visible from the base classes.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Multiple customers complained that when storing Ort::Value
in an instance of std::vector, vector can not be resized.
We enable that with allowing it default constructed.
### Description
part of https://github.com/microsoft/onnxruntime/issues/21448
This change is intend to save CPU memory during model load for
inference.
Added session option save_prepacked_constant_initializers, with
save_prepacked_constant_initializers turn on:
1. optimize model with inference session, prepacked external initializer
will be saved into data file.
2. load optimized model and external data file with prepacked
initializer, no prepack is needed
3. run inference with optimized model and data file
Tested with model Phi-3-mini-instruct-onnx,
with ORT 1.12.0:

with this change:

Peak memory usage dropped from **5.438 GB to 2.726GB**.
This change takes advantage of ORT loads external initializer with mmap
on CPU. Prepack will use extra memory on heap, omit prepack process can
save this part of memory (roughly same size as external initializers).
next step:
Change all the kernels on CPU with PrePack method implemented and test
properly. Will do in next PR.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
1. Remove the onnxruntime::OrtMutex class and replace it with
~absl::Mutex~ std::mutex.
2. After this change, most source files will not include <Windows.h>
indirectly.
### Motivation and Context
To reduce the number of deps we have, and address some Github issues
that are related to build ONNX Runtime from source.
In PR #3000 , I added a custom implementation of std::mutex . It was
mainly because at that time std::mutex's default constructor was not
trivial on Windows. If you had such a mutex as a global var, it could
not be initialized at compile time. Then VC++ team fixed this issue.
Therefore we don't need this custom implementation anymore.
This PR also removes nsync. I ran several models tests on Linux. I
didn't see any perf difference.
This PR also reverts PR #21005 , which is no longer needed since conda
has updated its msvc runtime DLL.
This PR unblocks #22173 and resolves#22092 . We have a lot of open
issues with nsync. This PR can resolve all of them.
### Description
Updates the ROCm EP opsets to match the current CUDA EP opsets. Also
enable the test CApiTest.basic_cuda_graph_with_annotation.
Note that some changes are whitespace-only. These changes were made to
improve the comparison of corresponding ROCm and CUDA EP source files
when using a side by side diff tool.
### Motivation and Context
The ROCm EP derives from the CUDA EP. Many source files are shared
between the EPs and "hipified" during the ROCm EP build, however quite a
few files within the ROCm EP are under source control after their
initial hipification. Over time these ROCm EP files get stale relative
to their CUDA EP counterparts. It becomes necessary to re-hipify these
otherwise static files in order to pick up important changes such as
opset differences.
### Description
Adds QNN provider option `offload_graph_io_quantization` to offload
graph input quantization and graph output dequantization to the CPU EP.
Option is disabled by default to maintain current behavior.
### Motivation and Context
Offloading the handling of I/O quantization to the CPU EP significantly
improves inference latency for many models.
### Description
For no, CoreML only support run mlmodels on CPU/ALL, However, sometimes
CPU_GPU would be faster a lot.
We support the option to select different hardware to boost performance
in this PR.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
Change the hipify step to remove the -roc option to hipify-perl. This
will prefer hipblas over rocblas. rocblas can still be called directly
such as in TunableOp.
### Motivation and Context
hip interfaces are preferred over roc for porting from cuda to hip.
Calling roc interfaces is meant for ROCm-specific enhancements or
extensions.
### Description
Support OV2024.4
Refactor tensor initialization check for external weights
Support loading OV Config
OVEP: Tensor Caching fix, Fix accuracy issues
Refactor device memory implementation to make it more generic
### Motivation and Context
The changes are required to fix accuracy issues, support loading of OV
config, support OV2024.4
---------
Co-authored-by: Eric Crawford <eric.r.crawford@intel.com>
Co-authored-by: saurabhkale17 <saurabh1.kale@intel.com>
Co-authored-by: Javier E. Martinez <javier.e.martinez@intel.com>
Co-authored-by: sfatimar <sahar.fatima@intel.com>
Co-authored-by: ankitm3k <ankit.maheshkar@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
Co-authored-by: n1harika <niharika.sathish@intel.com>
Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com>
### Description
Add SetEpDynamicOptions and Remove workload_type from run/session
options.
### Motivation and Context
Added SetEpDynamicOptions as a dynamic way of changing EP settings even
in the middle of a Run
Using workload_type run/session options to set Efficient/Default mode
for workloads does not cover all the scenarios and can lead to priority
inversions. Working on a new API to support setting Efficient/Default
mode for workloads.
---------
Co-authored-by: Luis E. Pena <luispena@microsoft.com>
This reverts commit 4e15b229a0.
Reason: We are seeing an increase in the number of deadlocks after this
PR. We have a release coming up next week and do not have enough time to
investigate the root cause, hence reverting this PR temporarily.
Moreover, this is causing an increase int he binary size.
### Description
We are seeing an [increase in the number of
deadlocks](https://github.com/microsoft/onnxruntime/pull/22315#issuecomment-2394821893)
after this PR. We have a release coming up next week and do not have
enough time to investigate the root cause, hence reverting this PR
temporarily.
### Motivation and Context
See above.
### Description
This change introduces the WebGPU EP into ONNX Runtime.
To make the PR as simple as possible, this PR excluded the following:
- C API changes for WebGPU EP
- actual implementation of WebGPU EP. Currently in this PR, WebGPU is a
stub implementation that does not register any kernel.
- Python IO Binding update
- Node.js IO Binding update
This PR now contains only 43 file changes (while the working branch
contains 130+) and hopefully this makes it easier to review.
There is going to be separated PRs for each mentioned above.
Current working branch: #21904
The purpose of the patch is primarily to save power, but it also has
nice perf benefits (mostly from allowing the system to better distribute
power to cores doing meaningful work).
Changes are twofold:
1) Decrease WorkerLoop spin count dramatically ~10^6 -> ~10^4. The
reality is after ~10^4 spins, if there hasn't been any new work
added its unlikely any new work is imminent so sleep to
preserve power. This aligns more closely with upstream EigenV3.
2) Use exponential backoff for waiting on memory. This saves a bit
more power, and important increases the time between iterations
in WorkerLoop to help accomidate the dramatically lowering spin
counts.
Since the tuning for both the iteration counts / backoff counts are
dramatically different for hybrid/non-hybrid systems, this patch
templates the affected functions and dynamically choses based on
`CPUIDInfo::IsHybrid()`. This seemed like the "lightest weight" way of
getting the change in, although its likely we could incur less dynamic
overhead if we added the template argument to the entirety of
`ThreadPoolTempl`.
Measured performance on an [Intel Meteor Lake
CPU](https://www.intel.com/content/www/us/en/products/sku/237329/intel-core-ultra-7-processor-165u-12m-cache-up-to-4-90-ghz/specifications.html)
across a range of models.
Below are the result of 3 runs with each metric being the
value-before-patch / value-after-patch (so for something like inference
time, lower is better).
<div align="center">
<table>
<tr>
<th>Session creation time cost</th>
<td>0.7179</td>
</tr>
<tr>
<th>First inference time cost</th>
<td>0.7156</td>
</tr>
<tr>
<th>Total inference time cost</th>
<td>1.0146</td>
</tr>
<tr>
<th>Total inference requests</th>
<td>0.8874</td>
</tr>
<tr>
<th>Average inference time cost</th>
<td>0.8800</td>
</tr>
<tr>
<th>Total inference run time</th>
<td>1.0146</td>
</tr>
<tr>
<th>Number of inferences per second</th>
<td>0.8955</td>
</tr>
<tr>
<th>Avg CPU usage</th>
<td>0.9462</td>
</tr>
<tr>
<th>Peak working set size</th>
<td>0.9922</td>
</tr>
<tr>
<th>Runs</th>
<td>1.1552</td>
</tr>
<tr>
<th>Min Latency</th>
<td>0.7283</td>
</tr>
<tr>
<th>Max Latency</th>
<td>0.9258</td>
</tr>
<tr>
<th>P50 Latency</th>
<td>0.9534</td>
</tr>
<tr>
<th>P90 Latency</th>
<td>0.9639</td>
</tr>
<tr>
<th>P95 Latency</th>
<td>0.9659</td>
</tr>
<tr>
<th>P99 Latency</th>
<td>0.9640</td>
</tr>
</table>
</div>
So the net result is a 1.16x improvement in throughput and between
1.08-1.37x improvement in latency.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Enables using the MLTensor to pass data between models.
### Motivation and Context
Using MLTensor instead of ArrayBuffers reduces the number of copies
between the CPU and devices as well as the renderer and GPU process in
Chromium.
### Description
* Add std::numeric_limits for MLFloat16 and BFloat16.
* Update some comments in csharp ORTFloat16.shared.cs.
* Add unit tests (including Clip)
Note that the canonical NaN is not consistent in C++ and C#. C# uses
negative quiet NaN as canonical NaN, while C++ uses positive quiet NaN.
The choice of CSharp Float16.NaN is to be consistent with
System.Half.NaN.
FP16 data returns from CUDA might have 7FFF as NaN; FP16 data from CPU
provider might have 0x7E00 as NaN. Anyway there is no consistent
canonical NaN in ORT right now. Because all these NaNs are aligned with
IEEE spec, there shall not an issue in downstream.
### Motivation and Context
std::numeric_limits is used in codebase but not defined for MLFloat16
and BFloat16. It causes some bugs like
https://github.com/microsoft/onnxruntime/issues/21957 introduced by
https://github.com/microsoft/onnxruntime/pull/21493.
### Description
This PR makes the following updates to the Arm Compute Library execution
provider:
- Target Arm Compute Library 24.07
- Add support for the following operators:
- Conv (FP16)
- NhwcConv
- QLinearConv
- MatMul
- FusedMatMul
- MatMulIntegerToFloat
- Optimize memory usage and performance
- Expose the enable_fast_math setting
- Use the main runtime thread pool
### Motivation and Context
These updates improve performance and memory usage, and enable use of a
more recent version of Arm Compute Library.
@microsoft-github-policy-service agree company="Arm Ltd"
---------
Signed-off-by: Michael Tyler <michael.tyler@arm.com>
Error Codes are added to catch compilation error and signal recompile.
Remote Tensors are added to ensure direct memory access for NPU
inferencing.
UMD Bypass cache enabled with 2024.4 will eliminate need to disk caching
### Motivation and Context
The changes are needed to ensure backward compatibility
UMD Bypass caching eliminates driver caching
Remote Tensors lead to performance improvement with inferencing on NPU
---------
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
Co-authored-by: Srirammaswamy <srirammaswamy.s@intel.com>
Co-authored-by: saurabh <saurabh1.kale@intel.com>
Co-authored-by: Javier E. Martinez <javier.e.martinez@intel.com>
Co-authored-by: Eric Crawford <eric.r.crawford@intel.com>
Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com>
for the Float8 types with unsigned zero, we must clear the sign bit when
rounding to zero;
otherwise we end up with 0x80 which is the encoding for NAN.
### Description
Handle all zero and near-zero values the same way, rounding to positive
zero.
Note that I removed one "if" level but did not re-indent the code in
this PR, to make it
easier to see what the actual changes are.
### Motivation and Context
For the two new 8-bit floating point types Float8E4M3FNUZ and
Float8E5M2FNUZ,
converting from a near-zero negative value would end up with the sign
bit set only;
this bit pattern is not negative zero but instead means NAN.
### Description
Remove unused and confusing special constants in MLFloat16 and BFloat16
types.
### Motivation and Context
While looking at adding a specialization for std::numeric_limits for the
16-bit floating point types, I found that there are various special
constants in those types that are confusing or just wrong.
MLFLoat16::Epsilon is not an epsilon at all, but approximates "e". Looks
like a copy-paste bug.
BFloat16::Epsilon does not correspond to `numeric_limits::epsilon()`,
nor even to the C# Float.Epsilon.
Instead, it corresponds to `numeric_limits::min()` which was really
confusing to me.
The "MinValue" constants does correspond to the C# `Float.MinValue`
constant, but this is C++ so it would be better renamed to "LowestValue"
since it corresponds to `numeric_limits::lowest()`. As it was unused
except for some unit tests I have replaced it with the equivalent
`MaxValue.Negate()` here.
There's also an unused `kSignaling_NaNBits` constant which is just wrong
(has the same value as `kPositiveInfinityBits` instead of a NaN).
Calling Split API Calls Read+Model in lieu of unified Compile Model call
for export compile flow to ensure memory optimization. Freeing up model
proto and serialized string and read model ov ir later to free up memory
for the ahead pipeline
Optimization during EpCtxt flow
All the Graph related operations require all the Node Attributes to be
set while dealing with model instances internally with them, in the
existing implementation these attributes make a copy when constructing a
Graph dynamically during runtime.
Propose to use these attributes in place without creating a copy to
avoid memory allocation / copy while calling these Graph related
functions.
Changes to ensure the bug fixes related to openvino version and epctxt
file path.
Moving Compiler version to C++20 for getting r-value mem optimizations
benefit
### Motivation and Context
This change is required because memory optimization during Compilation
flow is too high.
---------
Co-authored-by: saurabhkale17 <saurabh1.kale@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
Co-authored-by: Vishnudas Thaniel S <vishnudas.thaniel.s@intel.com>
Co-authored-by: Javier E. Martinez <javier.e.martinez@intel.com>
Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com>
Co-authored-by: ankitm3k <ankit.maheshkar@intel.com>
Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com>
### Description
revert forceinline for MakeString.
This change reverts https://github.com/microsoft/onnxruntime/pull/21893.
The forceinline was introduced for performance considerations, however
it turns out to have some notable binary size increase, which is a
concern for some binary size sensitive platforms like Android.
I made a few tests locally and found it is not related to whether or not
have used the template struct `if_char_array_make_ptr_t` trick. So I
have to revert this back.
### Description
- make `MakeString` force inline
- refactor ORT_FORCEINLINE macro - move to one place to avoid macro
redefinition error
- ~~add a `StringJoin` utility~~
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
This PR added session and run option workload_type, this option is the
knob for applications to enable/disable the processor performance
efficient mode.
### Motivation and Context
The efficient mode is co-engineered with processor vendors to allow
applications voluntarily being serviced at a more energy efficient
performance level. This functionality can be used by long running,
latency insensitive application to save the energy consumption.
### Description
This PR introduces support for custom external data loader. An EP can
register a custom external data loader to override the default behavior,
making it possible to upload initializers directly to GPU.
### Motivation and Context
- In ONNX Runtime Web, WebAssembly uses 32-bit as pointer type
(`sizeof(size_t)==4`), which means there is a 4GB hard limit on the
maximum memory. As the ONNX models get larger, this becomes a blocker
for supporting medium-sized language models.
- ORT runs out of memory because the current code always loads data into
CPU memory, including the .onnx file (protobuf) and external data
file(s). However, if using GPU EP, the big data does not need to be kept
on CPU because the only thing that ORT does is to load the data into
memory, upload to GPU and then release them.
- Some platforms has offered developers way to upload data directly to
GPU. For example, webgpu allows uploading from any ArrayBuffer (it can
be a side buffer, not count into the 4GB) to GPU directly. This helps to
keep the CPU memory usage significantly.
### Design
Class `ExternalDataLoader` and `ExternalDataLoaderManager` are
introduced. They are similar to `DataTransfer` and
`DataTransferManager`. `InferenceSession` owns the manager object, and
`SessionState` keeps a reference to it.
Added a new method `GetExternalDataLoader` in `IExecutionProvider`. An
EP can override the method to register an instance of custom external
data loader.
The key function in a `ExternalDataLoader` class is method `LoadTensor`:
```c++
// the tensor is pre-created using the TensorProto info of the initializer and the MemoryInfo (from allocation plan).
virtual common::Status LoadTensor(const Env& env,
const std::filesystem::path& data_file_path,
FileOffsetType data_offset,
SafeInt<size_t> data_length,
Tensor& tensor) const;
```
This function can be registered by EP, going through a few layers and
eventually get into `DeserializeTensorProto()` in the finalizing stage
of session initialization. In this step, initializer tensors are
created. Behavior is changed to first look up for a registered external
data loader that can handle the current memory info. If any instance is
available, use the loader; otherwise respect the old code path.
### Description
Address issue #21524
Enable offset align for model saved as external data format
python data convertor fix here: https://github.com/onnx/onnx/pull/6248
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Set the exhaustive tune flag through the MIGraphX API and make this a
Session option in Onnxruntime
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Allow users to use MIGraphX Exhaustive tuning with Onnxruntime
inferences
This goers hand in hand with save/load after a model and been compiled
and tuning has found.
---------
Co-authored-by: Ted Themistokleous <tedthemistokleous@amd.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
### Description
Bug fix for the ShapeInferContext GetAttrxxxs APIs. Node attribute maybe
is empty.
### Motivation and Context
If the attr value is empty, the expected result through the interface is
empty , but currently, it returns a meaningless {0}.
---------
Co-authored-by: mingyue <mingyue@amd.com>
Co-authored-by: Liu Minyue <mingyue@xilinx.com>
We saw some models failed to run due to OOM and can be fixed by increase
trt_max_workspace_size.
This PR makes no size limitation by default (max device memory) which is
aligned with trtexec.
### Description
Added CUDNN Frontend and used it for NHWC convolutions, and optionally
fuse activation.
#### Backward compatible
- For model existed with FusedConv, model can still run.
- If ORT is built with cuDNN 8, cuDNN frontend will not be built into
binary. Old kernels (using cudnn backend APIs) are used.
#### Major Changes
- For cuDNN 9, we will enable cudnn frontend to fuse convolution and
bias when a provider option `fuse_conv_bias=1`.
- Remove the fusion of FusedConv from graph transformer for CUDA
provider, so there will not be FusedConv be added to graph for CUDA EP
in the future.
- Update cmake files regarding to cudnn settings. The search order of
CUDNN installation in build are like the following:
* environment variable `CUDNN_PATH`
* `onnxruntime_CUDNN_HOME` cmake extra defines. If a build starts from
build.py/build.sh, user can pass it through `--cudnn_home` parameter, or
by environment variable `CUDNN_HOME` if `--cudnn_home` not used.
* cudnn python package installation directory like
python3.xx/site-packages/nvidia/cudnn
* CUDA installation path
#### Potential Issues
- If ORT is built with cuDNN 8, FusedConv fusion is no longer done
automatically, so some model might have performance regression. If user
still wants FusedConv operator for performance reason, they can still
have multiple ways to walkaround: like use older version of onnxruntime;
or use older version of ORT to save optimized onnx, then run with latest
version of ORT. We believe that majority users have moved to cudnn 9
when 1.20 release (since the default in ORT and PyTorch is cudnn 9 for 3
months when 1.20 release), so the impact is small.
- cuDNN graph uses TF32 by default, and user cannot disable TF32 through
the use_tf32 cuda provider option. If user encounters accuracy issue
(like in testing), user has to set environment variable
`NVIDIA_TF32_OVERRIDE=0` to disable TF32. Need update the document of
use_tf32 later.
#### Follow ups
This is one of PRs that target to enable NHWC convolution in CUDA EP by
default if device supports it. There are other changes will follow up to
make it possible.
(1) Enable `prefer_nhwc` by default for device with sm >= 70.
(2) Change `fuse_conv_bias=1` by default after more testing.
(3) Add other NHWC operators (like Resize or UpSample).
### Motivation and Context
The new CUDNN Frontend library provides the functionality to fuse
operations and provides new heuristics for kernel selection. Here it
fuses the convolution with the pointwise bias operation. On the [NVIDIA
ResNet50](https://pytorch.org/hub/nvidia_deeplearningexamples_resnet50/)
we get a performance boost from 49.1144 ms to 42.4643 ms per inference
on a 2560x1440 input (`onnxruntime_perf_test -e cuda -I -q -r 100-d 1 -i
'prefer_nhwc|1' resnet50.onnx`).
---------
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: Maximilian Mueller <maximilianm@nvidia.com>
### Description
When the graph is quantized to qdq format, the DQ + MatMul is
transformed to MatMulNBits in the level 2 optimizer when the model is
initialized in an inference session.
In the transformation step, tensors are transposed and new tensor protos
are created. Instead of using protobuf arena allocated memory, the PR
sets the tensor proto to use external buffer, and point the external
location to memory location which contains the tensor buffer allocated
by CPU.
Then, in the step that creates OrtValue using the tensor proto, the
memory buffers in the tensor proto are directly assigned to the tensors
which were originally allocated by Ort Arena.
With these two steps, the peak memory usage of QDQ format model is the
same as usage of QOperator model. Besides, the model initialization time
is significantly reduced. Take
[Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct)
for example:
|| QOperator Model (MatMulNBits) | QDQ Model (DQ + MatMul, original
code) | QDQ Model (this PR) |
|---|---|---|---|
| peak memory consumption | 2.8 GB | ~4.8 GB | 2.8 GB |
| initialization time | 3 sec | 9 sec | 5 sec |
### Motivation and Context
When the graph is quantized to qdq format, the DQ + MatMul is converted
to MatMulNBits in the level 2 optimizer.
Originally, the newly created tensor proto use memory allocated by
protobuf arena. These memory usage cannot be fully released when the
tensor protos are deleted.
Then, in the tensor proto to OrtValue step, tensors are created using
ORT arena. Later, in the pre-pack step for MatMulNBits, new OrtValues
are created. The tensors in the ORT arena are not fully released as
well.
The two arena memory allocation steps in the DQ + MatMul -> MatMulNBits
transformation will result in almost 2x memory consumption in the model
initialization.
### Description
Functionality extension for the SetOutputShape method in custom op shape inference.
### Motivation and Context
- **SetOutputShape** Interface enhancement Actually, the shape infer function need set the tensor type and shape ,Add a parameter **type** to allow users to specify the tensor type, and set **ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT** as default value to ensure compatibility.
Co-authored-by: mingyue <mingyue@amd.com>
### Description
Add QNN EP option context_node_name_prefix to set EPContext node name prefix
### Motivation and Context
For the case to workaround QNN context PD memory limit, user need split the model into pieces and generate the QNN context model separately. It could happen that the generated EPContext node in separate graph has same node name. This will cause issue if glue those EPContext nodes together into a single model.
To avoid this user can set this context_node_name_prefix for each split pieces to make the node name unique.
### Description
<!-- Describe your changes. -->
Add these changes to one PR to simplify checkin
- Add Concat (#21423)
- Add DepthToSpace (#21426)
- Add LeakyRelu (#21453)
- Add test scripts (#21427)
- Add ability to set coreml flags from python (#21434)
Other changes
- updated partitioning utils to support dropping constant initializers
from a ComputeCapability's inputs.
- noticed that the list of inputs to the coreml model was unexpectedly
long due to this
- we copy constant initializers to a CoreML model so don't need the
originals, and if they remain as inputs ORT can't free them as they
appear to be in use.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->