### Description
A [previous PR](https://github.com/microsoft/onnxruntime/pull/16531)
added a temporary directory to save the model optimizations after
loading a model into an `InferenceSession`. Many models that have an
external data file, however, require the data file to be in the same
directory as the ONNX model file. Because the model is saved in a
temporary directory and the data is saved in another directory, this
causes a `FileNotFoundError` error when trying to load the model in the
temporary directory.
This PR fixes this error by saving the external data file in the same
directory that the optimized model is located in.
### Motivation and Context
This PR fixes a bug with using a temporary directory while running the
optimizer for models that have an external data file.
#16506 Cause almost every translation units on linux complaint
```
[1175/1235] Building CXX object CMakeFiles/onnxruntime_test_all.dir/home/guangyunhan/onnxruntime/orttraining/orttraining/test/training_ops/cuda/softmax_test.cc.o
In file included from /home/guangyunhan/onnxruntime/include/onnxruntime/core/framework/float16.h:18,
from /home/guangyunhan/onnxruntime/include/onnxruntime/core/framework/data_types.h:17,
from /home/guangyunhan/onnxruntime/include/onnxruntime/core/framework/tensor.h:17,
from /home/guangyunhan/onnxruntime/onnxruntime/test/common/tensor_op_test_utils.h:16,
from /home/guangyunhan/onnxruntime/onnxruntime/test/providers/compare_provider_test_utils.h:7,
from /home/guangyunhan/onnxruntime/orttraining/orttraining/test/training_ops/cuda/softmax_test.cc:4:
/home/guangyunhan/onnxruntime/include/onnxruntime/core/session/onnxruntime_float16.h: In instantiation of ‘static constexpr uint16_t onnxruntime_float16::Float16Impl<Derived>::ToUint16Impl(float) [with Derived = onnxruntime::MLFloat16; uint16_t = short unsigned int]’:
/home/guangyunhan/onnxruntime/include/onnxruntime/core/framework/float16.h:42:66: required from here
/home/guangyunhan/onnxruntime/include/onnxruntime/core/session/onnxruntime_float16.h:241:7: note: ‘union onnxruntime_float16::detail::float32_bits’ has no user-provided default constructor
241 | union float32_bits {
| ^~~~~~~~~~~~
/home/guangyunhan/onnxruntime/include/onnxruntime/core/session/onnxruntime_float16.h:242:16: note: and the implicitly-defined constructor does not initialize ‘unsigned int onnxruntime_float16::detail::float32_bits::u’
242 | unsigned int u;
| ^
```
This PR shut the compiler up.
### Description
Introduce `Float16/BFloat16` support for C# and C++ APIs.
User should be able to perform conversions from `float` to/from
`Float16/BFloat16`, compare values and tests for `NaN, Inifnity, and
whether the number is denormalized.`
### Motivation and Context
User filed issues such as:
https://github.com/microsoft/onnxruntime/issues/14303
### Description
clean unused parameter in ORT_UNUSED_PARAMETER
### Motivation and Context
clean unused parameters in ORT_UNUSED_PARAMETER which are introduced
from #15833
- Fix some warnings from Xcode build (`-Wshorten-64-to-32`).
- Enable `-Wshorten-64-to-32` warning if available. Currently it's not fully enabled for `onnxruntime_test_all` and `onnxruntime_providers_xnnpack` yet.
- Some clean up in build.py including setting CMake generator more consistently.
### Description
C API for custom ops does not support float 8 types. This PR changes
that.
### Motivation and Context
The list of operators supporting float 8 is very limited. It should be
extended to custom ops to let developpers add customized operators for
these specific types.
### Description
Remove AllocatorManager class
### Motivation and Context
After the refactor PR #15833 is in, AllocatorManager class is not
referenced anymore.
### Description
This PR implements a backward-compatible way to define custom operators
with fallible compute functions. The C++ API templated gained an
optional `Fallible` argument. Closes#14287
### Motivation and Context
#14287 contains more context. The gist is that the current C-API defines
compute operations of custom operators as functions returning `void`
rather than an `OrtStatusPtr`. Currently, errors are often propagated
across the C-ABI using C++ exceptions. That is very unsafe and undefined
behavior. Moreover, it is difficult for languages other than C++ to use
this approach even if they wanted to. A C-compliant sound and safe way
to propagate errors allows for non-C++ fallible custom operators.
### An example in action
https://github.com/cbourjau/ort-custom-op/pull/6/files is a
demonstration of how this PR can be used to write safe and fallible
custom operators in Rust.
CUDA EP already supports [CUDA
graph](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs),
also we observed some models can benefit from using CUDA graph with
`trtexec`. Therefore, this PR enables the CUDA graph support for TRT EP.
The implementation is based on
https://github.com/microsoft/onnxruntime/pull/9978 with the same
[constraints](https://github.com/microsoft/onnxruntime/pull/9978) as
below:
- Models with control-flow ops (i.e. If, Loop and Scan ops) are not
supported.
- Usage of CUDA Graphs is limited to models where-in all the model ops
(graph nodes) can be partitioned to the TRT EP.
- The input/output types of models need to be tensors.
- Shapes of inputs/outputs cannot change across inference calls.
- IObinding is required.
### Description
<!-- Describe your changes. -->
1. Add a new test lib `onnxruntime_providers_cuda_ut` which is similar
to `onnxruntime_providers_cuda` but `onnxruntime_providers_cuda_ut` is
only built if `onnxruntime_BUILD_UNIT_TESTS` is set. We can call all
CUDA UTs through this ut lib without affecting production lib
`onnxruntime_providers_cuda`.
2. Move all test cases from `core/providers/cuda/test/` to
`test/providers/cuda/`. These test cases are built into lib
`onnxruntime_providers_cuda_ut` and run by `./onnxruntime_test_all
--gtest_filter="*CUDA_EP_Unittest*"`. Since the lib is only for test, we
can use gtest macros in the test cases. Previous implementation do not
support using gtest lib in the CUDA UT cases.
3. The cmake code in `cmake/onnxruntime_providers.cmake` is refactored a
bit. A new function `onnxruntime_add_object_library` is to build a
object target. The 2 libs `onnxruntime_providers_cuda_ut` &
`onnxruntime_providers_cuda` share most of the code, so the object files
can be used in both libs, which helps reduce build time. Another
function `config_cuda_provider_shared_module` is used to configure all 3
similar
targets(onnxruntime_providers_cuda_obj/onnxruntime_providers_cuda/onnxruntime_providers_cuda_ut).
4. Refactored the test to call `testing::InitGoogleTest` &
`RUN_ALL_TESTS` in `libonnxruntime_providers_cuda_ut.so`'s `TestAll`.
After this change, we can see all the cases running in
`CUDA_EP_Unittest.All`:

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
After https://github.com/microsoft/onnxruntime/pull/13016, there are
still test files in test/providers/cuda/ that are not moved to
core/providers/cuda/test/ and the test cases are disabled. This PR helps
to clean the unfinished TODOs.
Even through onnxruntime_shared_lib_test covers some test for CUDA
provider. onnxruntime_shared_lib_test works like a coarse grain
end-to-end test for CUDA provider. If CUDA unittest can run cases for a
single component, this wound be helpful for CUDA developers.
---------
Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com>
### Description
This PR is to refactor ExecutionProvider API for memory management,
which is to move allocators from EP level to SessionState level and
indexed by OrtDevice
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This PR is to refactor ExecutionProvider API for memory management,
which is to move allocators from EP level to SessionState level and
indexed by OrtDevice. By this change, EP level will shift the burden of
maintaining allocators, which will be user friendly for EP developers
---------
Co-authored-by: Lei Cao <leca@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
### Description
1. Use IAllocatorUniquePtr to replace BufferUniquePtr. It will ensure
the deleter is always right.
2. Change some std::unique_ptr to std::optional
3. Bypass Arena allocator when allocating the prepack buffers for mlas.
In this special case, Arena doesn't help any. And this change is just an
internal implementation change, it doesn't affect our public interface.
For TunableOp, some instance may has very bad performance and it will
take a long time during profile process.
Add `tunable_op_max_tuning_duration_ms` parameter to limit max tuning
time.
### Description
The PR implements FloatE4M3FN, FloatE5M2, FloatE4MEFNUZ, FloatE5M2FNUZ
as described in PR https://github.com/onnx/onnx/pull/4805. It uses CUDA
API to cast float/half to float8 if CUDA>=11.8, a custom implementation
if CUDA<11.8.
* It implements, Cast, QuantizeLinear, DequantizeLinear for all types on
CPU, only for types FloatE4M3FN, FloatE5M2 on CUDA.
* It extends the supported types for control flow operator, Shape,
Reshape, Identity, If, Loop, Scan, Reshape
* It implements Equal(19).
* Cast, QuantizeLinear, DequantizeLinear operators now support a
parameter `saturate` only valid for float 8 types. It is true by
default. In that case, any value out of range is converted into the
maximum float 8 value. If false, it is infinite.
* QuantizeLinear, DequantizeLinear now supports multiple scales on CUDA
(and ROCm by extension), scale = 1D tensor with one scale per channel
### Motivation and Context
Supports latest onnx version.
Fixes
[AB#15395](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/15395)
---------
Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
### Description
The file include/onnxruntime/core/providers/cuda/cuda_provider_options.h
is a C++ file. It is not for C.
Before this commit, this header file is already not compatible with C compilers. Because it has:
```
onnxruntime::ArenaExtendStrategy arena_extend_strategy;
```
And this file is intended to be internal only. It is an internal header file. It should not be included in onnxruntime_c_api.h and should not be used with the public C APIs. User can only get the instance of OrtCUDAProviderOptionsV2 via CreateCUDAProviderOptions. In such a way we can add new members to this struct without breaking binary compatibility.
Since it is an internal header, we can safely use C++ grammar there.
Add a configuration `max_power_of_two_extend_bytes ` to limit the arena extension size.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
In our real scenario, we observe that if the model is big enough the
BfcArena will extend uncontrollable.
As showed by the following figures, if a model uses more than 16GB
memory, the BfcArena will totally apply for 32GB memory according to the
`kNextPowerOfTwo` strategy. With the new strategy, the extension is
limited. The default maximum extension size is 1GB.
#### Without the new configuration
After loading the model, ORT uses 32G GPU memory.

#### With the new configuration
After loading the model, ORT uses 23G GPU memory.

Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com>
### Description
Adds the session config option `disable_cpu_ep_fallback` to allow the
user to prevent the CPU EP from handling
nodes not supported by other execution providers.
```C++
// Graph nodes that are not supported by the execution providers (EPs) explicitly added to the session are
// assigned (i.e., "fallback") to the CPU EP by default.
//
// This option allows the user to disable the fallback of unsupported graph nodes to the CPU EP.
// If this option is set to "1", session creation will fail if the execution providers other than the CPU EP cannot
// fully support all of the nodes in the graph.
//
// It is invalid to set this option and explicitly add the CPU EP to the session. In this case, session creation
// will also fail with an error.
//
// Option values:
// - "0": CPU EP fallback is not disabled. [DEFAULT]
// - "1": CPU EP fallback is disabled.
static const char* const kOrtSessionOptionsDisableCPUEPFallback = "session.disable_cpu_ep_fallback";
```
#### Example use
```C++
#include "core/session/onnxruntime_cxx_api.h"
#include "core/session/onnxruntime_session_options_config_keys.h"
int main(int argc, char** argv) {
Ort::SessionOptions so;
so.AddConfigEntry(kOrtSessionOptionsDisableCPUEPFallback, "1"); // Disable fallback to the CPU EP.
onnxruntime::ProviderOptions options;
#if defined(_WIN32)
options["backend_path"] = "QnnCpu.dll";
#else
options["backend_path"] = "libQnnCpu.so";
#endif
so.AppendExecutionProvider("QNN", options);
const ORTCHAR_T* ort_model_path = ORT_MODEL_FOLDER "qnn_ep_partial_support.onnx";
Ort::Session session(*ort_env, ort_model_path, so); // Throws exception if nodes fallback to CPU
// ...
```
### Motivation and Context
Makes it easier for application developers to ensure that the entire
model runs on specific EPs. This is critical for Qualcomm/scenarios. If
the compute cannot be offloaded to the NPU, running on CPU is not
acceptable. (could be the difference between 90 second inference and 6
seconds inference)
---------
Co-authored-by: Pranav Sharma <prs@microsoft.com>
### Description
Enable Qnn Context cache feature to save model initialization time
Provider options:
qnn_context_cache_enable|1 to enable the cache feature
qnn_context_cache_path to set the cache path. It is set to model_file.onnx.bin by default.
### Motivation and Context
Model initialization time takes long because the cost of conversion from Onnx model to Qnn model. Qnn have feature to serialize the Qnn context to file, then next time user can load it from the cache context and execute the graph to save the cost.
---------
Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
This PR mainly fixes building errors when trying to build nupkg for ROCm EP.
It also slighly improve the packaging logic so that devlopers can
produce the nupkg on linux natively.
### Description
This PR partially reverts changes introduced in
https://github.com/microsoft/onnxruntime/pull/15643
We make two API return std::string always in UTF-8.
We also move the entry points from OrtApiBase to OrtApi to make them
versioned.
### Motivation and Context
`GetVersionString` always returns x.y.z numbers that are not subject to
internationalization.
`GetBuildInfoString` can hold international chars, but UTF-8 should be
fine to contain those.
We prefix them with u8"" in case the compiler default charset is not
UTF-8.
Furthermore, creating platform dependent APIs is discouraged.
`ORTCHAR_T` is platform dependent and was created for paths only.
On non-unix platforms would still produce `std::string` that can only
contain UTF-8
The API was introduced after the latest release, and can still be
adjusted.
**Description**:
This PR intends to enable WebNN EP in ONNX Runtime Web. It translates
the ONNX nodes by [WebNN
API](https://webmachinelearning.github.io/webnn/), which is implemented
in C++ and uses Emscripten [Embind
API](https://emscripten.org/docs/porting/connecting_cpp_and_javascript/embind.html#).
Temporarily using preferred layout **NHWC** for WebNN graph partitions
since the restriction in WebNN XNNPack backend implementation and the
ongoing
[discussion](https://github.com/webmachinelearning/webnn/issues/324) in
WebNN spec that whether WebNN should support both 'NHWC' and 'NCHW'
layouts. No WebNN native EP, only for Web.
**Motivation and Context**:
Allow ONNXRuntime Web developers to access WebNN API to benefit from
hardware acceleration.
**WebNN API Implementation Status in Chromium**:
- Tracked in Chromium issue:
[#1273291](https://bugs.chromium.org/p/chromium/issues/detail?id=1273291)
- **CPU device**: based on XNNPack backend, and had been available on
Chrome Canary M112 behind "#enable-experimental-web-platform-features"
flag for Windows and Linux platforms. Further implementation for more
ops is ongoing.
- **GPU device**: based on DML, implementation is ongoing.
**Open**:
- GitHub CI: WebNN currently is only available on Chrome Canary/Dev with
XNNPack backend for Linux and Windows. This is an open to reviewers to
help identify which GitHub CI should involved the WebNN EP and guide me
to enable it. Thanks!
Implement a set of new APIs for lightweight custom ops registration, to
save efforts from schema-composing.
A few highlights:
- Support build-time type inference;
- Support function-as-op for "stateless" ops;
- Support structure-as-op for "stateful" ops;
- Support varied input/output forms such as span, scalar, and tensors,
either optional or non-optional.
---------
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
### Description
1. Update VERSION_NUMBER for preparing the upcoming release. This PR's
commit will not be included in the 1.15 release branch
2. Delete package/rpm/onnxruntime.spec since it was not used in past
years.
### Motivation and Context
Preparing the release.
Fixed
[AB#15311](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/15311)
### Description
ExecutionProvider API refactor - replace OrtMemoryInfo with OrtDevice
### Motivation and Context
Currently “Location” is represented as ORTMemoryInfo, which is OrtDevice
+ OrtMemType, while OrtDevice is represent as DeviceType + DeviceId +
MemType. As we can see there is some unnecessary hierarchy, the proposal
is to make it a clear definition that to use OrtDevice as an abstraction
for Location
---------
Co-authored-by: Lei Cao <leca@microsoft.com>
Implement a set of new APIs for lightweight custom ops registration, to
save efforts on schema-composing.
A few highlights:
1. Support build-time type inference;
2. Support function-as-op for "stateless" ops;
3. Support structure-as-op for "stateful" ops;
4. Support varied input/output forms such as span, scalar, and tensors,
either optional or non-optional.
---------
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
### Description
Augment nhwc graph optimizer to accommodate fp16 operators.
### Motivation and Context
With new fp16 conv operator added. This operator prefers NHWC data
layout. We need to augment existing graph optimizers to better utilize
the new operator.
### Description
Originally VitisAI EP only works with old version of VitisAI release.
### Motivation and Context
Update VitisAI EP so that it works together with the current VitisiAI
3.5 and further version of VitisAI. We try our best to make it forward
compatible.
---------
Co-authored-by: Wang Chunye <chunywan@xilinx.com>
Co-authored-by: mingyue <mingyue@amd.com>
Co-authored-by: mingyueliuh <131847423+mingyueliuh@users.noreply.github.com>
Co-authored-by: liumingyue <mingyue@xilinx.com>
Co-authored-by: moore-ch <129165652+moore-ch@users.noreply.github.com>
Co-authored-by: shoucair <shoucai.ren@amd.com>
Co-authored-by: zz002 <zhenze.wang@amd.com>
Co-authored-by: BoarQing <yuz75@Pitt.edu>
Co-authored-by: Yueqing Zhang <yueqingz@amd.com>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
This addresses a performance regression in some INT8 models with the
DirectML EP by defaulting OrtSessionOptionsDisableQuantQDQ to 1 when the
EP is registered.
This regression occured due to the introduction of the QDQ propagation
transformer, which is based on this session option. That transformer
maximizes the number of nodes which are executed as quantized by
logically propagating quantize operators upstream and dequantize
operators downstream. However, it does this simply by inserting QDQ
pairs, with an expectation that something will recognize sequences of
DQ->Op->Q. This logic and related L2 transformers are not currently
enabled for the DirectML EP.
This change also removes a noisy warning when the session option for
memory pattern is overriden as the DirectML EP is registered.
Previous behavior of TRT EP to set TRT optimization profiles for dynamic
shape input is based on input tensor values. Users can't explicitly
specify the profiles.
This PR makes users capable of specifying min/max/opt profiles through
newly added three provider options:
`trt_profile_min_shapes`, `trt_profile_max_shapes` and
`trt_profile_opt_shapes`
with the format of "input1:dim1xdim2...,input2:dim3xdim4...".
(Note: It's similar to --minShapes, --maxShapes and --optShapes of
trtexec command-line
[flags](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#trtexec-flags))
For example, if you are using onnxruntime_perf_test, you can try this:
`./onnxruntime_perf_test -e tensorrt -r 1 -i
"trt_profile_min_shapes|imgs:1x3x384x288
trt_profile_max_shapes|imgs:32x3x384x288
trt_profile_opt_shapes|imgs:16x3x384x288" your_model_path`
If the engine cache is enabled, you still need to provide these three
explicit provider options in order to use this feature. ORT TRT will
compare the min/max/opt profile shape with the ones saved in .profile
file to decide whether to rebuild the engine.
Constraints to use these provider options: (1) Need to specify
min/max/opt profile shapes for all the dynamic shape input
This feature is also requested by other users:
https://github.com/microsoft/onnxruntime/issues/13851
### Description
Cast optimizer may convert a fp16 node to fp32. This used to be safe as
all fp16 kernels has fp32 implementation. As this assumption is no
longer true, we need to check the validity of the operation
### Motivation and Context
Main work here is to introduce an API to check whether a kernel is
registered. Currently we don't have a way to do that without an operator
node. This needs to be augmented. We need to query whether a kernel is
registered by its property only, so that we can judge whether it is safe
to construct a node long before we actually do so.