### Description
<!-- Describe your changes. -->
Add these changes to one PR to simplify checkin
- Add Concat (#21423)
- Add DepthToSpace (#21426)
- Add LeakyRelu (#21453)
- Add test scripts (#21427)
- Add ability to set coreml flags from python (#21434)
Other changes
- updated partitioning utils to support dropping constant initializers
from a ComputeCapability's inputs.
- noticed that the list of inputs to the coreml model was unexpectedly
long due to this
- we copy constant initializers to a CoreML model so don't need the
originals, and if they remain as inputs ORT can't free them as they
appear to be in use.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
This is a partial change from
[fajin/qdqmatmulnbitstoolchain](https://github.com/microsoft/onnxruntime/pull/21180).
The original PR is blocked by Web CI failures.
MatMulNBits is a heavily optimized matmul operation. Currently a MatMul
can be converted to MatMulNBits to speed up the model inference.
However, MatMulNBits is an ORT only op. To make the graph compatible
with ONNX ops and utilize MatMulNBits at the same time, we introduce
Q/DQ support for MatMulNBits.
To convert MatMul ops in a model to MatMulNBits:
1. use matmul_4bits_quantizer.py to convert MatMul to DQ + MatMul using
QDQ mode.
2. In ORT session, DQ + MatMul is fused to MatMulNBits
#### Note
MatMulNBits assume B weight is uint4. When no zp is provided, zp
defaults to 8, which is different from DQ. DQ defaults zp to 0 when no
zp provided. And DQ supports int4. Therefore some conversions are
introduced during DQ + MatMul --> MatMulNBits step.
#### Perf
Using QDQ format will increase the model initialization time and memory
consumption. With current implement, model init time increased from ~4s
to ~9s, and memory consumption increased from ~2.8GB to ~4.8GB.
The memory increase is due to
1. in optimizer, after transpose the B weight, a in-memory tensor proto
is created using protobuf's arena.
2. in finalize step, when saving initializer and prepacking, ORT arena
is used to create buffers for initializers.
The memory allocated by arenas cannot be fully deallocated.
If disable ORT arena memory allocation, the memory consumptions of both
QDQ format and original format are ~2.2GB.
The time increase is mainly due to multiple memory copy, but can be
further optimized.
### Motivation and Context
Please see description for details.
### Description
* Add a cuda provider option `sdpa_kernel` to choose which attention kernel to run for testing purpose.
* Allow dump which attention kernel is used per node.
* Reserve a flag for cudnn flash attention which will be added soon.
#### CUDA provider option sdpa_kernel
Instead of setting environment variable, we also support setting it in
provider option. Note that the setting is global per session. That could
help performance testing of each kernel.
#### Attention Kernel Debug Info
Set an environment variable `ORT_ENABLE_ATTENTION_KERNEL_DEBUG_INFO=1`,
and ORT will print sdpa kernel used in each node:
For example
```
ORT_ENABLE_ATTENTION_KERNEL_DEBUG_INFO=1 ./onnxruntime_test_all --gtest_filter=MultiHeadAttentionTest*
```
It will show debug information of kernel used in testing:
```
[ RUN ] MultiHeadAttentionTest.SelfAttention_Batch2_HeadSize32_NoBias_NoMask_PackedQKV
AttentionKernelOptions: FLASH_ATTENTION=0 EFFICIENT_ATTENTION=0 TRT_FUSED_ATTENTION=1 CUDNN_FLASH_ATTENTION=0 TRT_FLASH_ATTENTION=1 TRT_CROSS_ATTENTION=0 TRT_CAUSAL_ATTENTION=0 MATH=1
Operator=MultiHeadAttention Node=node1 DataType=fp16 TRT_FUSED_ATTENTION=1
AttentionKernelOptions: FLASH_ATTENTION=0 EFFICIENT_ATTENTION=1 TRT_FUSED_ATTENTION=0 CUDNN_FLASH_ATTENTION=0 TRT_FLASH_ATTENTION=0 TRT_CROSS_ATTENTION=0 TRT_CAUSAL_ATTENTION=0 MATH=1
Operator=MultiHeadAttention Node=node1 DataType=fp16 EFFICIENT_ATTENTION=1
```
In this test case, the debug info shows that one session uses trt fused
attention and another session use efficient attention.
# Why so many commits
- Runtime debugging - which is necessary
- Three different approaches to EP context model - as a result testing back and forth
- Windows compatibility issues - this development has been done on Linux for convenience
# "Open" (?) questions
- Full offloading to a specific EP
- Dumping EP context models by EPs vs [by
ONNXRT](e2abba18ea/onnxruntime/core/framework/graph_partitioner.cc (L725))
- [Node name to pick
nodes](e2abba18ea/onnxruntime/core/framework/graph_partitioner.cc (L654))
# VitisAI EP made three variant implementations that have respective pros and cons (and of course we can combine them)
## Serialize and cache the list of compute capabilities and the original
ONNX model itself
## In `ComputeCapability()`, serialize and cache the backend compilation cache and the related necessary cache info such as cache dir and cache key
## In `Compile()`, serialize and cache the backend compilation cache and the related necessary cache info such as cache dir and cache key
# EP context model creation
- Precondition
Session option configuration `kOrtSessionOptionEpContextEnable` (aka "ep.context_enable") is enabled.
- Approach 1
- Steps
1. EP creates an ONNX model whose main graph has EP context nodes (i.e., node type is "EPContext").
2. EP implements/overrides `IExecutionProvider::GetEpContextNodes()` method.
3. ONNXRT core creates an EP context model and saves/dumps it.
- `CreateEpContextModel()` in the file "graph_partitioner.cc"
- In `get_ep_context_node()`, `Node::Name()` is used to check whether a node is an EP context node. This limits that EP model creation can only happen in `IExecutionProvider::Compile()`.
- The workaround is (1) not implementing `IExecutionProvider::GetEpContextNodes()` and (2) dumping the EP context model by EP itself.
4. Optionally, EP can also dump the EP context model it created by
iteself.
- Examples
- `QNNExecutionProvider`
- `VitisAIExecutionProvider`
- Approach 2
- Steps
1. EP creates an ONNX model whose main graph has EP context nodes (i.e., node type is "EPContext").
2. EP does NOT implement `IExecutionProvider::GetEpContextNodes()` at all.
3. EP dumps the EP context model it created.
- Examples
- `TensorrtExecutionProvider`
- UPDATES
- TRT EP is switching to leveraging
`IExecutionProvider::GetEpContextNodes()`
- `OpenVINOExecutionProvider` (?)
# What to cache in EP context nodes
- Non Compilation based EPs
- Examples
- `VitisAIExecutionProvider`
- Characteristics
- Heavy lifting work happens in `IExecutionProvider::GetCapability()`.
- Preconditions
- `IExecutionProvider::GetCapability()` is only called once by ONNXRT.
- Cache content
- Serialization of a list of `ComputeCapability`
- Not EP-specific
- Serialized using `onnx::FunctionProto`
- EP-specific cache
- Compilation based EPs
- Examples
- `QNNExecutionProvider`
- `TensorrtExecutionProvider`
- `MIGraphXExecutionProvider`
- `OpenVINOExecutionProvider`
- Cache content
- EP-specific cache
# Requirements
- Offline / AOT compilation of ONNX models with EP context cache
- Compile somewhere, run everywhere
- Pseudo code with brief explanation
```
GenerateCache(original_onnx_file, cache_onnx_file) model_buffer = load(original_onnx_file) --> Load the original ONNX model file
model_buffer = decrypt(model_buffer)
session_options = { kOrtSessionOptionEpContextEnable: true,
kOrtSessionOptionEpContextFilePath: temp_file } --> Set necessary configs
Ort::CreateSessionFromArray(model_buffer, session_options) --> The new ONNX model with EP context is created and dumped into the user specified file "temp_file"
temp_buffer = encrypt(temp_file)
write(temp_buffer, cache_onnx_file) --> Write the encypted context of "temp_file" into the "cache_onnx_file" file
InitializeInferenceSession(cache_onnx_file)
model_buffer = load(cache_onnx_file) --> Load the ONNX model with EP context from the file generated in the previous step
model_buffer = decrypt(model_buffer)
session_options = { }
Ort::CreateSessionFromArray(model_buffer, session_options) --> Create and initalize an session with the EP context model
```
- Python code with comments
- EP context model creation
```python
import onnxruntime as onnxrt
# Session options for creating an ONNX model with EP context cache.
sess_opts = onnxrt.SessionOptions()
# Verbose.
sess_opts.log_severity_level = 0
# This is REQUIRED.
sess_opts.add_session_config_entry("ep.context_enable", "1")
# This is OPTIONAL.
# Either an absolute path (preferred for now) or a relative path (WIP)
is okay.
# sess_opts.add_session_config_entry("ep.context_file_path",
"/some/path/to/original_model_ctx.onnx")
# This is OPTIONAL.
sess_opts.add_session_config_entry("ep.context_embed_mode", "1")
orig_model_location = "/some/path/to/original_model.onnx"
sess = onnxrt.InferenceSession(orig_model_location, sess_opts,
providers=["VitisAIExecutionProvider"], provider_options=[])
```
- Inference run with an EP context model
```python
import onnxruntime as onnxrt
# Session options for creating an ONNX model with EP context cache.
sess_opts = onnxrt.SessionOptions()
# Default EP context model path.
# ep_ctx_model_location = "/some/path/to/origina_model.onnx_ctx.onnx"
# User configured EP context model path.
ep_ctx_model_location = "/some/path/to/origina_model_ctx.onnx"
sess = onnxrt.InferenceSession(ep_ctx_model_location, sess_opts,
providers=["VitisAIExecutionProvider"], provider_options=[])
model_inputs = {}
run_opts = onnxrt.RunOptions()
# Verbose.
run_opts.log_severity_level = 1
sess.run(None, model_inputs, run_opts)
```
---------
Co-authored-by: Glen Cao <glen@Glens-MacBook-Air.local>
### Description
There are so many typos reported by the review dog, [Optional Lint]
actions (example:
https://github.com/microsoft/onnxruntime/actions/runs/9864564489/job/27239732367),
this PR is to fix some of them.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
It might be easier if we just directly include the original gsl headers.
"core/common/gsl.h" is an indirection that doesn't provide extra help.
### Description
Delete path.h and replace all occurrences of onnxruntime::Path with
std::filesystem::path.
Previously we couldn't use C++17's std::filesystem because it was not
supported in iOS 12(which was released in 2018). Now we dropped the
support for iOS 12.
### Motivation and Context
To simplify code. For example, if an EP wants to use the Path class, now
it can directly use it without going through a wrapper. And the standard
implementation can handle various path types better. (We didn't take
much consideration on UNC path, "/" as a path separator on Windows,
etc).
### Description
<!-- Describe your changes. -->
-It is an initial PR for VSINPU execution provider
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
- For support VeriSilicon hardware
- TIM-VX(Tensor Interface Module)
(https://github.com/VeriSilicon/TIM-VX) is an integrated software
solution by Verisilicon for our hardware(A311D/i.MX 8M Plus etc.)
design, it is easy to use Verisilicon’s hardware by simply connecting
onnxruntime with the TIM-VX API by this VSINPU execution provider.
### Description
1. Update the functions in tensorprotoutils.h to use
std::filesystem::path instead of onnxruntime::Path. Eventually we can
remove the whole onnxruntime::Path class, but to this PR small I am not
doing that.
2. Remove the _SILENCE_EXPERIMENTAL_FILESYSTEM_DEPRECATION_WARNING macro
def when TensorRT EP is enabled.
### Description
Adds the ability for MIGraphX EP to save off or load compiled models to
save time between inferences.
Via Command line
User should be able to set the save ability with
ORT_MIGRAPHX_SAVE_COMPILED_MODEL
ORT_MIGRAPHX_SAVE_COMPILE_PATH
User should be able to set the load ability with
ORT_MIGRAPHX_LOAD_COMPILED_MODEL
ORT_MIGRAPHX_LOAD_COMPILE_PATH
via Onnxruntime API
migx_save_compiled_model
migx_save_model_name
migx_load_compiled_model
migx_load_model_name
### Motivation and Context
The motivation for this is to leverage MIGraphX's existing API to
save/load models after our compile step of graph optimization. For
larger models or models which were compiled with additional tuning
steps, this saves time after first compile and inference run, and thus
speeds up the user experience in order to encourage development.
---------
Co-authored-by: Ted Themistokleous <tedthemistokleous@amd.com>
### Release backward inputs per static graph ref count
For the output buffer marked as external output:
1. Remove the additional ref count we used for avoiding reusing buffer.
Instead, when we find reuse input/output buffer, we will make sure the
reused buffer not not generated by nodes that has external outputs.
2. Remove the ref count of pybind feed inputs, which exists all the time
until the run_backward completed. Instead, passing a mutuble feeds, and
we clean the feeds vector once that is copied into session states and
not needed any more before run the graph sequencentially.
#### Before the change:
One of the backward inputs is 3.9GB, it lives until the backward ends.

#### With the change:
The 3.9GB is released when the last node depending on that tensor
completed.

Be noted: the peak did not change though, we have more work to do to
reduce on the peak.
#### Others
It is found there are few tests that were updated to use incorrect
expected values in previous code refactoring
a81faee41e (diff-9e8fbae7d3dff24106cd17564949f320e943cb3048eae07813c7de144f140419L382).
This PR tries to fix them back, and I think now all test cases are back
to normal.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Windows - Fully dynamic ETW controlled logging for ORT and QNN logs
The logging support is documented here
-
https://onnxruntime.ai/docs/performance/tune-performance/logging_tracing.html#tracing---windows
-
https://onnxruntime.ai/docs/performance/tune-performance/profiling-tools.html#tracelogging-etw-windows-profiling
Also add support for logging ORT SessionCreation on ETW CaptureState
### Motivation and Context
The previous ETW support only worked if you enabled ETW before the
session started. There can commonly be long-lived AI inference processes
that need to be traced & debugged. This enables logging fully on the
fly.
Without this support a dev would have to end up killing a process or
stopping a service in order to get tracing. We had to do this for a
recent issue with QNN, and it was a bit painful to get the logs and it
ruined the repro.
### Testing
I tested with the following cases
- Leaving default ORT run
- Enabling ETW prior to start and leaving running for entire session +
inferences, then stopping
- Starting ORT session + inf, then enabling and stopping ETW
- Start ORT session /w long running Inferences
- wpr -start
[ort.wprp](e6228575e4/ort.wprp (L4))
-start
[etw_provider.wprp](e6228575e4/onnxruntime/test/platform/windows/logging/etw_provider.wprp)
- Wait a few seconds
- wpr -stop ort.etl
- Inferences are still running
- Verify ONNXRuntimeLogEvent provider events are present and new
SessionCreation_CaptureState event under Microsoft.ML.ONNXRuntime
provider
Related:
#18882#19428
### Description
The recent [PR for int4
support](https://github.com/microsoft/onnxruntime/pull/20362) breaks
builds with the onnxruntime_DEBUG_NODE_INPUTS_OUTPUTS option enabled.
This PR adds utility functions for debug printing of int4 tensor
statistics and data.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
- 4-bit QuantizeLinear(21). **Blocked quantization still missing (i.e.,
do not support the new `block_size` attribute)**
- 4-bit DequantizeLinear(21). **Blocked dequantization still missing
(i.e., do not support the new `block_size` attribute)**
- 4-bit Transpose(21).
- Update quantization tool with int4 types.
- Disable QDQ fusions for 4-bit types. See:
https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selector_action_transformer.cc
- MLAS 4-bit quantization kernels for intel, neon, powerpc.
##### Notes
To calculate a tensor's storage size, we normally get the number of
elements from the shape (i.e., `tensor_shape.Size()`) and multiply by
the size of a single element. This does not directly work for sub-byte
elements like int4 as each element in a `Tensor<Int4x2>` stores **two**
packed int4 elements in a byte. The `Tensor::
CalculateTensorStorageSize` should be called to perform the correct
calculation for any tensor element type.
### Motivation and Context
ONNX 1.16 added the int4 and uint4 types. This initial PR adds the int4
type to ORT and adds int4 implementations for the Quant, Dequant, and
Transpose ops on CPU EP. We still need to add int4 support for many ops
and execution providers. See the ONNX 1.16 release notes:
https://github.com/onnx/onnx/releases.
### Description
<!-- Describe your changes. -->
- Introduce option `trt_engine_hw_compatible` to support engine hardware
compatibility for Ampere+ GPUs
- This enables `nvinfer1::HardwareCompatibilityLevel::kAMPERE_PLUS` flag
when generating engines
- This option has been validated on sm80/86 GPUs, as engine can be
reused across different ampere+ arch:
- Client side need to enable this option as well to leverage existing
sm80+ engines
- If this option is enabled by users which TRT<8.6 or sm<80, there will
be a warning showing this option not supported
Engine naming:
| When | `trt_engine_hw_compat=false` | `trt_engine_hw_compat=true` |
| -------------- |
------------------------------------------------------------ |
------------------------------------------------------------ |
| A100 (sm80) |
TensorrtExecutionProvider_TRTKernel_graph_torch-jit-export_9454133937466702238_0_0_sm**80**.engine
|
TensorrtExecutionProvider_TRTKernel_graph_torch-jit-export_9454133937466702238_0_0_sm**80+**.engine
|
| RTX3080 (sm86) |
TensorrtExecutionProvider_TRTKernel_graph_torch-jit-export_9454133937466702238_0_0_sm**86**.engine
|
TensorrtExecutionProvider_TRTKernel_graph_torch-jit-export_9454133937466702238_0_0_sm**80+**.engine
|
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Reference:
https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#hardware-compat
---------
Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com>
This PR includes the weight-stripped engine feature (thanks @moraxu for
the #20214) which is the major feature for TRT 10 integration.
Two TRT EP options are added:
- `trt_weight_stripped_engine_enable`: Enable weight-stripped engine
build and refit.
- `trt_onnx_model_folder_path`: In the quick load case using embedded
engine model / EPContext mode, the original onnx filename is in the
node's attribute, and this option specifies the directory of that onnx
file if needed.
Normal weight-stripped engine workflow:

Weight-stripped engine and quick load workflow:

see the doc [here
](https://onnxruntime.ai/docs/execution-providers/TensorRT-ExecutionProvider.html#tensorrt-ep-caches)for
more information about EPContext model.
---------
Co-authored-by: yf711 <yifanl@microsoft.com>
Co-authored-by: Ye Wang <52801275+wangyems@users.noreply.github.com>
Co-authored-by: Michal Guzek <moraxu@users.noreply.github.com>
Co-authored-by: pengwa <pengwa@microsoft.com>
Co-authored-by: wejoncy <wejoncy@163.com>
Co-authored-by: Yi Zhang <zhanyi@microsoft.com>
Co-authored-by: Yi Zhang <your@email.com>
Co-authored-by: Pranav Sharma <prs@microsoft.com>
Co-authored-by: Adam Pocock <adam.pocock@oracle.com>
Co-authored-by: cao lei <jslhcl@gmail.com>
Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
Co-authored-by: inisis <46103969+inisis@users.noreply.github.com>
Co-authored-by: Jeff Bloomfield <38966965+jeffbloo@users.noreply.github.com>
Co-authored-by: mo-ja <60505697+mo-ja@users.noreply.github.com>
Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
Co-authored-by: Sumit Agarwal <sumitagarwal330@gmail.com>
Co-authored-by: Atanas Dimitrov <70822030+neNasko1@users.noreply.github.com>
Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
Co-authored-by: Yufeng Li <liyufeng1987@gmail.com>
Co-authored-by: Dhruv Matani <dhruvbird@gmail.com>
Co-authored-by: Dhruv Matani <dhruv.matani@grammarly.com>
Co-authored-by: wangshuai09 <391746016@qq.com>
Co-authored-by: Xiaoyu <85524621+xiaoyu-work@users.noreply.github.com>
Co-authored-by: Xu Xing <xing.xu@intel.com>
Co-authored-by: Dmitri Smirnov <yuslepukhin@users.noreply.github.com>
Co-authored-by: Rachel Guo <35738743+YUNQIUGUO@users.noreply.github.com>
Co-authored-by: Sai Kishan Pampana <sai.kishan.pampana@intel.com>
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: Jian Chen <cjian@microsoft.com>
Co-authored-by: Shubham Bhokare <32080845+shubhambhokare1@users.noreply.github.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
Co-authored-by: Andrew Fantino <15876180+afantino951@users.noreply.github.com>
Co-authored-by: Thomas Boby <thomas@boby.uk>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
Co-authored-by: Michal Guzek <mguzek@nvidia.com>
Co-authored-by: George Wu <jywu@microsoft.com>
### Flash attn recompute
1. Allow PythonOp(FlashAttn) can be recomputed correctly.
45879ff5c2
2. Use JSON to pass the selected-to-recompute subgraphs.
3c374da678
#### Better Memory Efficiency
Customer model can run both PyTorch SPDA and Flash Attn, this PR make it
possible to let the Flash Attn path work with ORTModule layerwise
recompute. The peak drop from 45.xGB to 32.xGB if we only compare the
layers (not including other pieces, BTW there are few more optimization
targeting other pieces as well later).
#### Better Perf
Using Flash ATTN bring additionally 16% end to end time reduction, with
highly aligned loss curve.

#### Use JSON File to pass Recompute Plans
To overcome the limitation of max length of the strings defined in
session options.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
I misunderstood how UpdateCUDAProviderOptions and
UpdateTensorRTProviderOptions work in the C API, I had assumed that they
updated the options struct, however they re-initialize the struct to the
defaults then only apply the values in the update. I've rewritten the
Java bindings for those classes so that they aggregate all the updates
and apply them in one go. I also updated the C API documentation to note
that these classes have this behaviour. I've not checked if any of the
other providers with an options struct have this behaviour, we only
expose CUDA and TensorRT's options in Java.
There's a small unrelated update to add a private constructor to the
Fp16Conversions classes to remove a documentation warning (they
shouldn't be instantiated anyway as they are utility classes containing
static methods).
### Motivation and Context
Fixes#20544.
### Description
removing excess trailing semicolon from specific macro
### Motivation and Context
I am preparing automatic generation of onnxruntime bindings for perl,
and the parser (ucpp) has broken due to the "double semicolon" error in
the subsequent lines where the macro is applied.
Fix
onnxruntime/include/onnxruntime/core/session/onnxruntime_c_api.h:4637:
error: argument 'session' of command @param is not found in the argument
list of
```
OrtApi::AddExternalInitializersFromFilesInMemory(
OrtSessionOptions *options,
const char *const *external_initializer_file_names,
char *const *external_initializer_file_buffer_array,
const size_t *external_initializer_file_lengths,
size_t num_external_initializer_files)
```
Bump up version in main from 1.18.0 to 1.19.0 since the release branch
has been cut.
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Introduce memory efficient topo sort (for training)
~~and laze initialize Priority-Based and Memory-Efficient topo sort.
Because in most cases, they are not needed, so we free the overheads of
GraphViewer construction for most use cases.~~
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Background:
User save large model with initializer data in external file. e.g:
onnx.save_model(onnx_model, "path/to/save/the/model.onnx", save_as_external_data=True, all_tensors_to_one_file=True,
location="filename", size_threshold=1024).
In that case, Ort loads the model, get the external initializer information (external file name, offset, length) and use the model path to find the external file, and locate to the tensor data via the offset and length.
But it won't work if user load the model from memory, since Ort lost track of the model path.
This PR adds API/session option to let user provide a table with external initializer file name as the key, the pointer to the loaded external file in memory and the buffer length as value. So that
1. user can load the model from memory buffer with external initializers in memory buffer too.
2. the initializers can be shared across sessions, for different EPs.
3. user can load the file in any way they want, e.g mmap.
Internally,
1. at session creation time, Ort goes through the external initializers in the graph, gets the file name, offset, data length of the external initializers from Tensorproto .
2. With the file name, Ort get the file in memory buffer and buffer length from the table user provided.
4. Ort locates the tensor buffer from file in memory buffer (user provided) using the offset and data length (from Tensorproto ).
5. Ort creates the Tensor and replace the existing Tensor in the graph.
### Motivation and Context
https://github.com/onnx/onnx/blob/main/docs/ExternalData.md
For a model with external data, the Tensorproto may have initializer data in a separate file. The external file location is set via the file path relative to the model path. With the API to load model from memory buffer, it lost track of the
model path. So it causes error if the model has external data. By adding a session option to set the external data buffer, Ort can find the external data correctly if model loaded from memory buffer.
Enable provider option to let user provider the profiling file path.
Separate out the profiling level for ETW, in case there's switch like ETW enabled when Ort creates the QNN profiling, then gets disabled when Ort logs the profiling events. vise versa. Enhance the logic to decide the profiling level.
### Description
<!-- Describe your changes. -->
The first call to Graph::Resolve occurs when creating the Graph instance
when loading an existing model from ModelProto. As the Node instance
will exactly match the source NodeProto there's no need to call
Node::ToProto in this case.
Add a temporary reference to the original NodeProto to avoid the call on
the first Graph::Resolve.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Better alternative to #19469
### Description
For C++ standards >= 20, use `std::chrono::operator<<` in place of
`date::operator<<` to fix ambiguous operator compile error.
### Motivation and Context
The external dependency HowardHinnant/date has a conflict with
std::chrono for >=C++20.
Solves #20137
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Certain transformers slow down session loading time while providing no
runtime perf benefits.
Allow clients to exclude them.
### Description
The dml_provider_factory header file can't be used in C programs as it
defines C++ inline operators. This PR rearranges that header file so
that it looks like valid C when used from C, and also makes a couple of
small modifications to the Java code so it correctly binds to the DML EP
at build time.
I'm having some difficulty testing it as I think it's pulling in the old
version of DirectML on my computer and I can't figure out what the
library loading path is in Java to make it look at the recent version I
downloaded. So the test I added fails with:
```
InferenceTest > testDirectML() FAILED
ai.onnxruntime.OrtException: Error code - ORT_RUNTIME_EXCEPTION - message: Exception during initialization: <path-to-ort>\onnxruntime\core\providers\dml\DmlExecutionProvider\src\AbiCustomRegistry.cpp(518)\onnxruntime.dll!00007FFF74819333: (caller: 00007FFF74793509) Exception(3) tid(4f58) 80070057 The parameter is incorrect.
at app//ai.onnxruntime.OrtSession.createSession(Native Method)
at app//ai.onnxruntime.OrtSession.<init>(OrtSession.java:74)
at app//ai.onnxruntime.OrtEnvironment.createSession(OrtEnvironment.java:236)
at app//ai.onnxruntime.OrtEnvironment.createSession(OrtEnvironment.java:221)
at app//ai.onnxruntime.InferenceTest.openSessionSqueezeNet(InferenceTest.java:1961)
at app//ai.onnxruntime.InferenceTest.runProvider(InferenceTest.java:665)
at app//ai.onnxruntime.InferenceTest.testDirectML(InferenceTest.java:657)
```
But it does correctly compile, and this error seems very similar to
other issues with the DML provider when it doesn't like a model due to
the loaded library being old. The test is using the squeezenet file
that's been in the repo since 2019. If someone can help me figure out
how to get the right version of DML in the library path I can test it
more on my end. I tried adding the folder with the new version into the
system path, but I'm not very familiar with Windows' library loading
behaviour.
### Motivation and Context
Fixes#19656 to allow use of the DirectML EP from ORT Java.
cc @martinb35
### Description
New flag of `dump_om_model` for **CANN EP**, which defaults to "True".
### Motivation and Context
When building an onnx model with CANN EP, the intermediate **OM(offline
model for Ascend NPU)** is automatically saved. There are some users
don't want to dump OM when resources are limited.
This PR will resovle this situation with `dump_om_model=False`
### Description
<!-- Describe your changes. -->
Add API function GetAliasMap and ReleaseAliasMap in OrtCustomOp
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Add API function GetAliasMap and ReleaseAliasMap in OrtCustomOp
### Description
<!-- Describe your changes. -->
use OrtCustomOp's new API GetMayInplace in CreateKernelCreateInfo to
hook the inplace map of custom ops
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This PR is to use OrtCustomOp's new API GetMayInplace in
CreateKernelCreateInfo to hook the inplace map of custom ops
### Description
Expose Reserve() in OrtAllocator to allow custom allocators to work when
session.use_device_allocator_for_initializers is specified.
Update: this change has been verified by Bing Ads and brings a
significant benefit in terms of memory utilization: 30GB less memory and
also better CPU utilization.
### Motivation and Context
https://microsoft-my.sharepoint.com/:w:/p/prs/Eeidf5YNtWtKrPVkfuTDsuABak1oL4QRpuBGuhqRbLKoJg?e=Zl3bah
### Description
Address build issues and source code discrepancies.
Fix cuda_test_provider gtest argument stack corruption.
### Motivation and Context
`OpTester` class that is widely used for kernel testing is not
suitable for testing internal classes for EPs that are built as shared
objects.
Currently, CUDA EP tests run only on Linux.
We want to enable testing and developments on Windows,
and create a usable pattern for testing of other EPs internals.
Alternatives considered:
Abstracting EP unit tests into separate test executable such as
`onnxruntime_test_all`.
This alternative was rejected as it would create a lot more changes in
the established patterns,
and potentially interfere with CUDA functionality with more complex
source code maintanence.
### Description
<!-- Describe your changes. -->
This change addresses the following issues with the current CustomOP
Output Type inference
- The function does not take into account optional inputs. When input is
absent the inference is silently aborted, and no output type is inferred
(P1 customer issue)
- Inferring output type based on the input type for multi-kernel custom
ops is done based on the latest in sequence kernel definition. There is
not an attempt made to match the kernel based on the input type.
- Inference is aborted when variadic inputs/outputs are detected when
the generated input/output names fail to obtain type constraints. This
is not immediately clear from the code, because custom op schema is not
available within the inference function.
- No error reporting.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Most of CustomOPs lack their own type and shape inference function as it
was recently introduced. For that reason, it is important to fix this.
This change is inspired by a customer issue.
This is a follow up on:
- https://github.com/microsoft/onnxruntime/pull/15184
- https://github.com/cbourjau/ort-custom-op/pull/11
- https://github.com/microsoft/onnxruntime-extensions/issues/451
### Description
Modifications to support 2GB+ checkpoint & Upgrading Flatbuffers
### Motivation and Context
This PR includes changes that will make ort handle 2GB+ checkpoints.
To do that we need to upgrade flatbuffers to 23.5.9 -
https://github.com/google/flatbuffers/pull/7945
- Modified the commitHash and the hash for the new version
- Removed the patch for rust generator's unused variable warning as it
is no longer producing this - [Check it out
here](d121e09d89/src/idl_gen_rust.cpp)
- Updated the VerifyField calls with alignment values that were
introduced in the new version.
---------
Co-authored-by: Sumit Agarwal <sumitagarwal@microsoft.com>
### Description
<!-- Describe your changes. -->
Add 2 C API for ORT extension:
- KernelInfo_GetAllocator
- OrtCustomOp::GetMayInplace
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Add 2 C API for ORT extension project, which will leverage these 2 APIs
for GroupQueryAttention custom op.
### Description
<!-- Describe your changes. -->
add new API KernelContext_GetScratchBuffer to get scratch buffer from
kernel context
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
add new API KernelContext_GetScratchBuffer to get scratch buffer from
kernel context which will be used in ORT extension project for
GroupQueryAttention custom op
### Description
<!-- Describe your changes. -->
1. add a config key in run_options to control cuda graph in runtime.
2. enhance cuda graph class to support mutiple graph saving and
retrieving in one ORT session
3. provide model modification/inference example on Phi2
4. benchmark shows an average of 13% latency reduction in token
generation.
limitation: TRT ep and ROCM ep hasn't applied this feature. we can
revisit this in the future.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
- Adding CUDA NHWC support for SpaceToDepth and DepthToSpace
- Add a new test which verifies that swizzling SpaceToDepth swizzling
for the H axis is correct.
- If CUDA NHWC is enabled, run all tests on the CUDA EP with NHWC as
well.
### Motivation and Context
Adding more NHWC operations to avoid layout transformations when using
the CUDA EP for more efficiency.
### Description
<!-- Describe your changes. -->
Address warnings so all the ORT projects build with /W4 on Windows.
Mainly
- unused parameters
- variables shadowing other ones
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#19588 started on this.
### Description
Implment IsInf-10,20 for CUDA.
Add FP16 types also on CPU.
### Motivation and Context
Certain models lag in performance due to IsInf not available on CUDA.
### ONNX Gelu Op in Opset 20
Refactor code to support MSDomain Gelu and ONNX Gelu-opset20 Op
1. Move CPU-GELU implmentation from
`onnxruntime/contrib_ops/cpu/activations.h/cc` to
`onnxruntime/core/providers/cpu/tensor/gelu.h/cc`, as the implementation
for approximate attribute to be 'none'.
2. Dumplicate some logic from
`onnxruntime/contrib_ops/cpu/bert/bias_gelu.cc` to
`onnxruntime/core/providers/cpu/tensor/gelu.h/cc`, as the implementation
for approximate attribute to be 'tanh'.
3. Register ONNX domain Gelu CPU kernel from opset 20 in
`onnxruntime/core/providers/cpu/cpu_execution_provider.cc`.
4. Move `onnxruntime/contrib_ops/cuda/bert/fast_gelu_impl.h/cu` to
`onnxruntime/core/providers/cuda/tensor/gelu_impl.h` and
`onnxruntime/core/providers/cuda/tensor/gelu_approximate_impl.cu`
respectively, as the implementation for approximate attribute to be
'tanh'.
5. Implement the logic for approximate attribute to be 'none' in
`onnxruntime/core/providers/cuda/tensor/gelu_impl.cu`.
6. Register ONNX domain Gelu CUDA kernel from opset 20 in
`onnxruntime/core/providers/cuda/cuda_execution_provider.cc`.
7. ROCM ep related changes.
8. Enrich the tests for ONNX domain Gelu in
`onnxruntime/test/providers/cpu/activation/activation_op_test.cc`.
### Description
Currently, the QNN HTP performance mode is set during session creation, there's no way to change it afterwards. There's requirement to set it high performance mode for high priority request and set it back to low performance mode later to save the power when the incoming request is idle for example.
Now, still keeps the performance mode at the session level in QNN EP options which is used at the default one. Ort QNN EP will set it once if user set it.
And there are setting (qnn.htp_perf_mode and qnn.htp_perf_mode_post_run) in run option to change the performance mode before and after session run. There's recommended scenario that user set the mode to high performance mode before the the inference sun so that user can get the result back ASAP. And set the mode to low performance mode after the inference to save the power.