### Description
Use InlinedVector<int64> instead of <int64_t,5> to reduce on the number
of template instantiations.
### Motivation and Context
The reported size reduction is small, just a few Ks. Just trying it out.
- Implement MLAS function for quantized 4-bit int Gemm (Gemm with float A and quantized 4-bit int B) for ARM NEON. This is an initial implementation. Only the M=1 path (with M being number of rows of A and C) has any optimization attempted so far. More optimization to come in future PRs.
- Connect MatMulNBits contrib op to MLAS function.
### Description
When and if `If` condition proves to be a constant value, inline the
corresponding subgraph yielding to more constant folding and
optimization.
### Motivation and Context
Newly converted models feature lots of nested `If` nodes that can be
inlined and collapsed.
In particular, for the sample models we are gaining on TorchScript
exported models.
For `HF Mobile Bert Dynamo` runtime went down from 0.069 -> 0.046. In
total, AOT inlining + `If` constant folding
yields improvement of about 50% 0.102 -> 0.046. Brining us very close to
TorchScript exported models.
`HF Bart Dynamo` further improves 0.668 -> 0.45. AOT + `If` constant
folding improves 0.98 -> 0.45
Earlier the size of
HF Mobile Bert **161Mb+**, now **98Mb**
HF Bart Dynamo pre-optimized model was about **1.2Gb**. It is now
**710MB**

### Description
<!-- Describe your changes. -->
Adding static int8 quantization support for MIGraphX Execution Provider
- Allows for parsing in calibration tables generated by Onnxruntime or
TensorRT's toolsets
- Add proper environment variables into the MIGraphX EP
- Update python API to include updating execution provider flags -> was
missing on python side
- Hook into MIGraphX's int8 quantitation and optimization of models
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Required so that we can get onnxruntime to pass in models while
leveraging the existing tooling for int8 static QDQ quantization.
First step in a series of PRs which will add further static quantization
on the operator level as MIGraphX releases further support.
These changes drew heavily from the tensorRT EP should allow for similar
functionality for GPU based (versus CPU) quantization of models before
an inference is performed.
---------
Co-authored-by: Ted Themistokleous <tthemist@amd.com>
Co-authored-by: Ted Themistokleous <tedthemistokleous@amd.com>
Enable option qnn_context_priority to set QNN context priority, options:
"low", "normal", "normal_high", "high".
### Description
Enable option qnn_context_priority to set QNN context priority, options:
"low", "normal", "normal_high", "high".
This feature guarantees the model inference with higher priority. Tested
with onnxruntime_perf_test tool using same model.
1. Run the model on the NPU with single instance, the latency is 300ms.
2. Run the same model on NPU with 2 instance at same time.
Case 1:
both with same priority (high ) -- latency is 600ms
Case 2:
1 with low priority -- latency is 30,000ms
1 with high priority -- latency is 300ms
Case 3:
1 with normal priority -- latency is 15,000ms
1 with high priority -- latency is 300ms
### Description
Adds the QNN session option `htp_graph_finalization_optimization_mode`
to enable QNN graph optimizations at the expense of longer preparation
time.
### Motivation and Context
Allow enabling QNN graph optimizations per app/model.
### Description
Integration to OpenVINO 2023.1
### Motivation and Context
- Alignment with latest OpenVINO Version.
- Device name change from VPUX to NPU and Remove from supported list
until official public support is available.
---------
Co-authored-by: Sahar Fatima <sfatima.3001@gmail.com>
Co-authored-by: Saurabh Kale <saurabh1.kale@intel.com>
Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
Co-authored-by: sfatimar <sahar.fatima@intel.com>
I am adding a new `trt_timing_cache_path` option. Internally it is
handled as `global_cache_path_` and will be set via a fall through
approach:
1. no path provided => workdir
2. `trt_engine_cache_path` provided but no `trt_timing_cache_path` =>
`trt_engine_cache_path`
3. `trt_timing_cache_path` provided => `trt_timing_cache_path` (if not
provided `trt_engine_cache_path` will still be workdir)
### Motivation and Context
A TRT timing cache can be reused across multiple models as it only holds
kernel timings and it is common that network "patterns" are reused. This
can accelerate build times a lot.
---------
Co-authored-by: Carson M <carson@pyke.io>
Historically, DML was only able to fuse partitions when all sizes are
known in advance or when we were overriding them at session creation
time. But in practice, it should be possible to compile partitions at
compute time if the caller knows that the dimensions won't be changed
for every inference (e.g. resizing a webcam window, or padding the input
to powers of 2). This graph will be cached and reused until the sizes
change.
This is an opt-in option gated under the `enable_dynamic_graph_fusion`
option, which means that it will only be enabled when the caller
requests it since they have more context on how their model will be
called between inferences.
This PR also adds the option to disable metacommands from the python
API, which is an option for the C API but was lacking for python.
### Description
Inline functions in an EP aware fashion.
The result of this PR is that models that are having been inlined by
ONNX inliner and optimized and models that have been AOT inlined appear
to be visually identical.
For tests I used two models. The only difference is the resulting size
because ONNX inliner removes local function definitions and AOT does
not. Difference in sizes for `HF Mobile` model was 2.5 MB, and for `HF
Bart` it was ~500K. It seems that the resuling model size affects the
load time more than the actual optimizations.
In general, the inlined models grow in size very fast and can easily
exceed 2Gb limit.
Q. Should we make AOT optional?
`If` costant folding and the removal of local inlined models will be
coming in other PRs.
Some stats:

Expose a new allocator from cuda stream.
The allocator manages deferred cpu memory which only get recycled before
stream destruction.
---------
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
### Description
CUDA inference speed heavily relies on Tensor Cores. To have tensor
cores achieve the optimal throughput they require the data layout to be
NHWC rather than NCHW.
### Motivation and Context
Especially for convolutional networks this is very important. I will
illustrate this using a very simple network:
```
import torch
import torch.nn as nn
class Net1(nn.Module):
def __init__(self):
super(Net1, self).__init__()
# 1 input image channel, 6 output channels, 5x5 square convolution
# kernel
self.m = nn.ModuleList([
nn.Conv2d(in_channels=8, out_channels=32, kernel_size=5, stride=1),
nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, stride=1),
nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, stride=1),
nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, stride=1, bias=False),
nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, stride=1, bias=False),
])
def forward(self, x):
for module in self.m:
x = module(x)
return x
if __name__ == "__main__":
dtype = torch.half
device = "cuda"
dummy_input = torch.randn(8, 8, 512, 512, dtype=dtype, device=device)
model = Net1().to(dtype=dtype, device=device)
input_names = ["input1"]
output_names = ["output1"]
torch.onnx.export(model, dummy_input, "test.onnx",
input_names=input_names, output_names=output_names)
```
I profiled the launch of `./build/RelWithDebInfo/onnxruntime_perf_test
-e cuda -I -q -t 5 test.onnx` using sys and nvtx ranges.
Current master launches below kernels:

If I add the introduced `-l` flag we see below kernels:

Notice the missing NCHW<>NHWC kernels per operation. The layout
optimizer introduced a transpose op as first and last op of the whole
network. The `op_generic_tensor_kernel` shows the bias used which should
also be optimized out next.
Measured across some very basic models:
| CUDA EP | **NCHW** [ms] | **NHWC** [ms] | Speedup |
|:------------------------|--------------------------------------:|-----------------------------------------:|------------------:|
| | -e cuda -t 5 -q | -e cuda -t 5 -q -l | |
| resnet101-v2-7_bs8_fp16 | 18.33 | 13.07 | 1.4 |
| resnet101-v2-7_bs8 | 21.8 | 12.06 | 1.81 |
| test | 102.07 | 73.62 | 1.39 |
Average speedup: 1.53
## Outlook
Next the mission will be to first write a templated unit test to check
for correctness of NHWC vs NCHW ops. After that we have to transition
more ops to measure perf improvements on a broader range of models.
Currently this is not easily possible as we can do not support all ops
in the NHWC domain.
---------
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Support cross qk in beam search for whisper model and related features
Make whisper exporting tools support cross qk and some related features,
* extra_decoding_ids
* no_speech_prob
Implement DTW kernel, unfold tensor kernel with unit test Several fix
related with multiple session running parallel, like:
* guard multihead_attention, fused_fp16_runner_
* some memory allocation with stream awareness
* add use_ep_level_unified_stream option
### Description
Improve the QNN context binary cache feature to reduce the memory
overhead and initialization time overhead.
Instead of dumping a Qnn context binary file with metadata as header, we
dump a Onnx format file with metadata inside Onnx node.
### Motivation and Context
reduce the memory overhead and initialization time overhead
Two major modifications of this PR:
1. Refactor OrtTensorRTProviderOptions initialization and make it easy
to add new field.
2. Make Python API capable of using TensorRT plugins by adding new
Python binding api `register_tensorrt_plugins_as_custom_ops`. (It needs
to register ep's custom op domain before model load. For C++ API, it's
slightly different, when calling
SessionOptionsAppendExecutionProvider_TensorRT_XX, it appends cutom op
domain to session option. Later ORT can register custom op domain from
session option before model loading)
### Description
- Enables option to use the QNN Saver backend for dumping QNN API calls
to file.
- Adds logic to read environment variable
`ORT_UNIT_TEST_ENABLE_QNN_SAVER` from QNN EP unit tests. If enabled,
unit tests will use the QNN Saver backend and dump files to
`./saver_output/`.
### Motivation and Context
QNN Saver makes it easier to debug issues when unit tests fail. The
output files generated by QNN Saver can be used to replay the exact QNN
API calls that lead to a specific error condition.
QNN Saver dumps QNN API calls (and weights) to disk.
- saver_output/saver_output.c: C file containing all QNN API calls.
- saver_output/params.bin: binary file containing all
input/output/parameter tensor data provided during tensor creation, op
config validation, and graph execution.
Enabling the QNN Saver backend has 2 note-worthy effects:
1. All QNN API calls will succeed.
2. Inference output returns dummy data.
Because the output files from QNN Saver are always overwritten, it is
recommended to run individual unit tests via the `--gtest_filter`
command-line option.
Example (linux):
```shell
$ ORT_UNIT_TEST_ENABLE_QNN_SAVER=1 ./onnxruntime_test_all --gtest_filter=QnnHTPBackendTests.Resize_DownSample_Linear_AlignCorners
```
### Description
Add support for specifying a custom logging function per session.
Bindings for other languages will be added after this PR is merged.
### Motivation and Context
Users want a way to override the logging provided by the environment.
### Description
<!-- Describe your changes. -->
* Allow either an allocator or a MemBuffer to be used when creating an
OrtValue from an TensorProto
* `Tensor<std::string>` requires an allocator to allocate/free the
string values
* Forcing the buffer to be allocated outside of the Tensor doesn't seem
to provide any benefit in this usage as the Tensor class disables copy
and assignment (so we wouldn't create 2 copies of the buffer via the
Tensor class that externally managing the would buffer avoid)
* New approach means we don't need to manage the buffers in the
optimizer Info class as the Tensor dtor will do that
* Update naming - MLValue was replaced by OrtValue a long time ago
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#17392
### Description
Remove `Resolve()` on the entire graph as each function is resolved.
We retain `Resolve()` after each inlining iteration.
### Motivation and Context
Poor performance for inlining the model and session initialization.
Original model before Resolve() removal
FunctionTest.Profiling (**65953 ms**)
After Resolve() Removal
FunctionTest.Profiling (**2911 ms**)
RelWithDebInfo pre-inlined model. Presumably because it runs Level1
optimizers
Non-inlined model consists of functions and Level1 optimizers have no
effect.
FunctionTest.Profiling (**9851 ms**)
### Description
<!-- Describe your changes. -->
Make status.h independent from gsl.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
In the coming new feature external EP API (see the prototype
https://github.com/microsoft/onnxruntime/pull/16718), we need to expose
stream in the public header, however, stream is dependent on status.h
which is dependent on gsl. We are seeking a way to decouple stream from
gsl.
From Changming's comment offline, prefast is disabled so all
GSL_SUPPRESS are not taking any effect now. He will handle the warnings
when enable prefast in the future
### Description
Add new name "WebGPU_Buffer" to OrtMemoryInfo.
This is one of the prerequisites for supporting IO binding for WebGPU
buffer in onnxruntime-web.
list of prerequisites PRs:
#17465#17469 (this one)
In flatbuffers@v23.5.9 was broken forward declaration for
FlatBufferBuilder. Trying to compile onnxruntime falls with the
following error:
```
flatbuffers/include/flatbuffers/flatbuffer_builder.h:1420:38: error: typedef redefinition with different types ('FlatBufferBuilderImpl<false>' vs 'flatbuffers::FlatBufferBuilder')
typedef FlatBufferBuilderImpl<false> FlatBufferBuilder;
^
onnx_runtime/include/onnxruntime/core/graph/graph.h:47:11: note: previous definition is here
class FlatBufferBuilder;
```
This PR removes these declarations and puts includes instead
### Fix build - redefinition of default argument for ‘long unsigned int
Extent’
One of the training customer env, building ORT, there is such a build
error. The GCC version are
```
aiscuser@node-0:/tmp/onnxruntime$ gcc --version
gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
aiscuser@node-0:/tmp/onnxruntime$ g++ --version
g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
```
But on our dev node using same GCC/G++, we don't have build issue., not
sure what's the difference but giving an explict type when creating
`gsl::span` fixed the problem.
```
/tmp/onnxruntime/build/Linux/RelWithDebInfo/_deps/gsl-src/include/gsl/span:394:7: error: redefinition of default argument for ‘long unsigned int Extent’
394 | class span
| ^~~~
/tmp/onnxruntime/build/Linux/RelWithDebInfo/_deps/gsl-src/include/gsl/span_ext:46:51: note: original definition appeared here
46 | template <class ElementType, std::size_t Extent = dynamic_extent>
| ^~~~~~~~~~~~~~~
/tmp/onnxruntime/include/onnxruntime/core/common/span_utils.h:82:93: error: return type ‘class gsl::span<const std::byte>’ is incomplete
82 | [[nodiscard]] inline gsl::span<const std::byte> AsByteSpan(const void* data, size_t length) {
| ^
/tmp/onnxruntime/include/onnxruntime/core/common/span_utils.h: In function ‘void onnxruntime::AsByteSpan(const void*, size_t)’:
/tmp/onnxruntime/include/onnxruntime/core/common/span_utils.h:83:68: error: class template argument deduction failed:
83 | return gsl::span(reinterpret_cast<const std::byte*>(data), length);
| ^
/tmp/onnxruntime/include/onnxruntime/core/common/span_utils.h:83:68: error: no matching function for call to ‘span(const std::byte*, size_t&)’
/tmp/onnxruntime/build/Linux/RelWithDebInfo/_deps/gsl-src/include/gsl/span:740:1: note: candidate: ‘template<class Type, long unsigned int Extent> gsl::span(Type (&)[Extent])-> gsl::span<ElementType, FirstExtent>’
740 | span(Type (&)[Extent]) -> span<Type, Extent>;
| ^~~~
/tmp/onnxruntime/build/Linux/RelWithDebInfo/_deps/gsl-src/include/gsl/span:740:1: note: template argument deduction/substitution failed:
/tmp/onnxruntime/include/onnxruntime/core/common/span_utils.h:83:68: note: mismatched types ‘Type [Extent]’ and ‘const std::byte*’
83 | return gsl::span(reinterpret_cast<const std::byte*>(data), length);
| ^
/tmp/onnxruntime/build/Linux/RelWithDebInfo/_deps/gsl-src/include/gsl/span:743:1: note: candidate: ‘template<class Type, long unsigned int Size> gsl::span(std::array<_Tp, _Nm>&)-> gsl::span<ElementType, FirstExtent>’
743 | span(std::array<Type, Size>&) -> span<Type, Size>;
| ^~~~
/tmp/onnxruntime/build/Linux/RelWithDebInfo/_deps/gsl-src/include/gsl/span:743:1: note: template argument deduction/substitution failed:
/tmp/onnxruntime/include/onnxruntime/core/common/span_utils.h:83:68: note: mismatched types ‘std::array<_Tp, _Nm>’ and ‘const std::byte*’
83 | return gsl::span(reinterpret_cast<const std::byte*>(data), length);
| ^
```
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
- allocation planner was breaking if graph had no nodes
- in this particular model a branch of an If node returned an outer
scope value directly.
- if model used non-tensor types and sparse tensors are disabled the
call to IsSpareTensor causes an exception when prematurely terminates
the code.
- it's perfectly fine to check if a value is a sparse tensor when
support for them is disabled. we just can't do anything with that
OrtValue which is what the current ifdef's after the call to
IsSparseTensor handle.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix model execution failure for partner with model that uses sequences
in a minimal build with sparse tensors disabled.
Prevent `GSL_SUPPRESS` arguments from being modified by clang-format and update existing usages.
clang-format was changing something like `GSL_SUPPRESS(r.11)` to `GSL_SUPPRESS(r .11)`.
For some compilers (e.g., clang), the `gsl::suppress` attribute takes a quoted string argument. We don't want to insert spaces there.
### Description
When functions are inlined and constant nodes are being converted to
initializers, we need to create NodeArg for them.
Similar for inlined function subgraph, but we choose to give priority to
non-constant nodes and then fill the gaps with constant and
initializers.
### Motivation and Context
This addresses issue
https://github.com/microsoft/onnxruntime/issues/16813 for
`eca_halonext26ts_mod.onnx` model where it fails to remove unused
initializer because `NodeArg` was not created for it.
### Description
Re-implement stacktrace. The new implementation doesn't directly use
Windows API, hence can avoid problems regarding to
initialize/uninitialize the dbghelp library.
### Motivation and Context
### Description
This change upgrade emsdk to 3.1.44.
Because backend is upgraded to LLVM 16, so need to fix a lot of build
failures caused by "-Wshorten-64-to-32".
most of the build failures comes from generated `onnx.pb.h`, and this
can be fixed by including "core/graph/onnx_protobuf.h", which detects
and ignore shorten-64-to-32 warnings.
Add a generic `UpdateCUDAProviderOptionsWithValue()` C API to update
CUDA EP provider options where its data type is pointer that can't be
represented by string.
Note: Please see some comments for the similar [PR
](https://github.com/microsoft/onnxruntime/pull/16965)for TRT EP.
Add a generic `UpdateTensorRTProviderOptionsWithValue()` C API to update
TensorRT provider options where its data type is pointer that can't be
represented by string.
### Description
Added Gelu operator to JSEP
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
A [previous PR](https://github.com/microsoft/onnxruntime/pull/16531)
added a temporary directory to save the model optimizations after
loading a model into an `InferenceSession`. Many models that have an
external data file, however, require the data file to be in the same
directory as the ONNX model file. Because the model is saved in a
temporary directory and the data is saved in another directory, this
causes a `FileNotFoundError` error when trying to load the model in the
temporary directory.
This PR fixes this error by saving the external data file in the same
directory that the optimized model is located in.
### Motivation and Context
This PR fixes a bug with using a temporary directory while running the
optimizer for models that have an external data file.