### Description
- Refactor codes to meet line length limit and guard missing warning
- Add slice/dropout op support
- Move vsinpu ep's cmake settings from onnxruntime_providers.cmake to a
separate file
- Modify apis with param onnxruntime::Path because this kind is replaced
by std:filesystem::path by #20920
### Description
Extend cuda minimal option to TRT provider, as with TRT 10 no linking to
cuDNN is required anymore
.
Besides that with the new engine dump feature it is also possible to
embed an engine in to an ONNX and not ship a builder lib.
In addition to that this has roughly the same deserialization
time/session setup time that using TRT standalone has.
### Motivation and Context
```
exe_builder_lib\onnxruntime_perf_test.exe -I -e tensorrt -r 5 -i 'trt_engine_cache_enable|1 trt_timing_cache_enable|1 trt_dump_ep_context_model|1 trt_weightless_engine_enable|1' model.onnx
exe_no_builder_lib\onnxruntime_perf_test.exe -I -e tensorrt -r 5 -i 'trt_engine_cache_enable|1 trt_timing_cache_enable|1 trt_dump_ep_context_model|1 trt_weightless_engine_enable|1' model_ctx.onnx
```
### Description
ETW trace logger is fakely registered as initialized_ is marked as true
before the registration is done, causing crashing issue for Lenovo
camera application.
[Bug
42610244](https://microsoft.visualstudio.com/OS/_workitems/edit/42610244):
[Watson Failure] caused by
SVCHOSTGROUP_Camera_INVALID_POINTER_READ_c0000005_onnxruntime.dll!onnxruntime::logging::Logger::Log
### Description
It might be easier if we just directly include the original gsl headers.
"core/common/gsl.h" is an indirection that doesn't provide extra help.
### Description
This PR enables the API added in #20816 as well as moving context
creation to JS.
### Motivation and Context
In order to enable I/O Binding with the upcoming
[MLBuffer](https://github.com/webmachinelearning/webnn/issues/542) API
in the WebNN specification, we need to share the same `MLContext` across
multiple sessions. This is because `MLBuffer`s are restricted to the
`MLContext` where they were created. This PR enables developers to use
the same `MLContext` across multiple sessions.
Currently WebNN TFLite backend allows the filter of
conv2d/convTranspose2d be an input. Remove the constraint and operate
necessary transpose/reshape operations for the filter input.
### Description
Support MatMulNBits shape infer in SymbolicShapeInference
MatMulNBits's B input is rank-2, so implicit merge does not apply.
### Motivation and Context
[Issue with performing shape inference using symbolic_shape_infer.py
with Phi-3 ONNX Models · Issue #21194 · microsoft/onnxruntime
(github.com)](https://github.com/microsoft/onnxruntime/issues/21194)
Fix fp8*fp8 when input A is e5m2, input B is e4m3 will run error
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Delete path.h and replace all occurrences of onnxruntime::Path with
std::filesystem::path.
Previously we couldn't use C++17's std::filesystem because it was not
supported in iOS 12(which was released in 2018). Now we dropped the
support for iOS 12.
### Motivation and Context
To simplify code. For example, if an EP wants to use the Path class, now
it can directly use it without going through a wrapper. And the standard
implementation can handle various path types better. (We didn't take
much consideration on UNC path, "/" as a path separator on Windows,
etc).
Add SparseAttention cpu implementation.
- [x] Refactoring GQAAttentionBase
- [x] Add SparseAttention implementation
- [x] Add test cases
This is unfused version. Flash attention version will be added later.
### Description
Make current ROCm packaging stages to a single workflow.
Reduce the possibility of all nightly packages can't be generated by one
failed stage
### Motivation and Context
Our plan is to reduce the complexity of the current zip-nuget pipeline
to improve the stability and performance of nightly packages generation.
ROCm packaging stages has no dependencies with other packaging jobs and
it's the most time-consuming route.
After this change, the most used CPU/CUDA/Mobile packaging workflow
duration can be reduced roughly from 3h20m to 2h30m.
### Problem
Currently, the codebase contains some logics pertaining to model
re-export checks and graph_builder reinitialization checks. Ideally,
these operations should function akin to a state machine. However, upon
inspecting the implementation, it becomes apparent that certain states
are checked or set in various scattered locations. This fragmentation
makes it challenging to comprehend when a re-export or re-initialization
will be triggered. For optimal clarity and maintainability, it is
advisable to consolidate these states into a cohesive component, rather
than dispersing them within the current graph execution manager.
Furthermore, the process of model exports and post-export processing for
stage 3 support or memory-efficient gradient management introduces
considerable complexity. To enhance the codebase's structure, it would
be beneficial to extract these intricate functionalities into a
dedicated component, divorcing them from the current graph execution
manager.
As part of the effort to improve the codebase, it's essential to address
inconsistencies in handling input/output flatten/unflatten operations.
Currently, there are several functions performing these operations
recursively, each with slightly different implementations. This
inconsistency leads to varying support for input/output data types and
structures in different parts of the code. To rectify this, the proposed
pull request simplifies these operations into a set of primitive
functions, ensuring uniformity. This not only streamlines the code but
also facilitates the maintenance of consistency when introducing bug
fixes or supporting new data types. One thing to mention here: input
output handling is deeply bound to the graph transition mentioned above,
so it is difficult to make this change separately.
While acknowledging the complexity of these logics, it is reassuring
that the codebase benefits from an extensive suite of unit tests that
cover all possible branches. Despite the intricacies, ensuring the
passage of all tests has been a time-intensive but necessary aspect of
this development effort.
### Design
Introduce `GraphTransitionManager` and put all model export and
post-export processing logics in it.
1. Re-export check
2. Do export
3. Re-post-export process check
4. Do post-export process
5. Return `PostExportProcessedModelInfo`, which contains all the
information we need, to pass to ORT to build gradient graph (currently
we do the same for training or evaluating, but ideally we should not do
it for evaluating, let's keep this behavior as it is now, and make the
change later).
```
# Input names for the pre-gradient-build graph.
# This may be different with the one in ExportedGraph since we may
modify the graph inputs as needed
# for example when memory efficient gradient management is enabled.
self.onnx_graph_input_names: list[str] = onnx_graph_input_names
# A subset of onnx_graph_input_names.
# Input names that require gradients for the pre-gradient-build graph.
self.onnx_graph_input_names_require_grad: list[str] =
onnx_graph_input_names_require_grad
# Create symbolic names for each dimension of the graph input (e.g.
onnx_graph_input_names).
# The key is the input name, the value is a dict of {dim_index:
symbolic_dim_name}
# e.g. {"input1": {0: "input1_dim0", 1: "input1_dim1"}, "input2": {0:
"input2_dim0"}}
self.onnx_graph_input_dynamic_axes_map: dict[str, dict[int, str]] =
onnx_graph_input_dynamic_axes_map
self.buffer_for_ort_runs: dict[str, torch.Tensor] = OrderedDict()
self.onnx_graph_input_names_user_defined = (
onnx_graph_input_names_user_defined # The ONNX graph input names
excluding the parameters, buffers.
)
# The ONNX graph input names excluding the parameters, buffers.
self.onnx_graph_input_names_require_grad_user_defined =
onnx_graph_input_names_require_grad_user_defined
self._post_export_processed_model: onnx.ModelProto | None =
post_export_processed_model
# A function to access the input data from the args and kwargs.
# If it is not None, the length is same as onnx_graph_input_names.
# For i-th input name, we can use the i-th function to get the input
data from args and kwargs.
self.data_accessor: list[callable] | None = data_accessor
# Used for unflattening the outputs from the ORT forward run.
self.module_forward_output_schema: ORTModelInputOutputSchemaType | None
= module_forward_output_schema```
The `GraphTransitionManager` instance is a property of
`GraphExecutionManager` (e.g. `TrainingManager` or ``InferenceManager),
1. Use
'self._graph_transition_manager.use_cache_or_reconstruct_post_processed_model(inputs,
kwargs)' to check whether the PyTorch module need a re-export or
re-post-export-process.
2. Use
`self._graph_transition_manager._post_export_processed_model_info.construct_inputs`
to construct the list of inputs used for ORT runs.
3. Use
`self._graph_transition_manager._post_export_processed_model_info.restore_outputs(user_outputs)`
to restore the outputs in original PyTorch output structure.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Add some macro to help print data to console for debugging purpose.
Example usage:
```
int input_id;
vector<int> some_vector;
DUMP_CPU_TENSOR_INIT()
DUMP_CPU_TENSOR("some vector", some_vector);
DUMP_STRING("input_id=", input_id);
```
- To enable dump thread id, set environment variable
`ORT_DUMP_THREAD_ID=0`.
- User can disable dumping by environment variable
`ORT_ENABLE_CPU_DUMP=0`.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
It's the prerequisite step of reducing complexity of current zip-nuget
pipeline.
Some packaging tasks could be cut from the most complex nuget pipline
and easily be published
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Disable using CoreML ML Program for a matmul where one of the inputs is
1D as the CoreML implementation appears to be broken. See
https://github.com/apple/coremltools/issues/2263
Add some debugging notes.
### Motivation and Context
Fix failing test on macos-14.
### Description
<!-- Describe your changes. -->
-It is an initial PR for VSINPU execution provider
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
- For support VeriSilicon hardware
- TIM-VX(Tensor Interface Module)
(https://github.com/VeriSilicon/TIM-VX) is an integrated software
solution by Verisilicon for our hardware(A311D/i.MX 8M Plus etc.)
design, it is easy to use Verisilicon’s hardware by simply connecting
onnxruntime with the TIM-VX API by this VSINPU execution provider.
### Description
Update upstream packaging pipeline name to make it more meaningful.
### Motivation and Context
The upstream pipeline used to only building Nuget packages, but now it
also builds Zip and Java. So change the name will make it more
meaningful.
### Description
1. Update the functions in tensorprotoutils.h to use
std::filesystem::path instead of onnxruntime::Path. Eventually we can
remove the whole onnxruntime::Path class, but to this PR small I am not
doing that.
2. Remove the _SILENCE_EXPERIMENTAL_FILESYSTEM_DEPRECATION_WARNING macro
def when TensorRT EP is enabled.
### Description
Vitis AI EP synchronously supports the TensorProto data types supported
by ONNX 1.16.
Add error message show when graph resolve fail for troubleshooting.
### Motivation and Context
ONNX 1.15 & 1.16 add support some new TensorProto DataType , such as
- FLOAT8E4M3FN
- FLOAT8E4M3FNUZ
- FLOAT8E5M2
- FLOAT8E5M2FNUZ
- UINT4
- INT4
---------
Co-authored-by: liumingyue <mingyue@xilinx.com>
### Description
Provide user level options to control the fallback on CPU for models not
supported on Intel's NPU hardware.
### Motivation and Context
- Current workflow of OVEP allows safe fallback from OV NPU to OV CPU on
compilation failures. Also supports MLAS CPU fallback in presence of
unsupported custom ops.
- The PR provides a build-time option to disable fallback from OV NPU to
OV CPU.
- The session Option "kOrtSessionOptionsDisableCPUEPFallback" disables
OV CPU and MLAS CPU fallback.
- Also has bug fix for proto creation.
---------
Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com>
Co-authored-by: ankitm3k <ankit.maheshkar@intel.com>
### Description
1. Add QNN UTs for QNN Pad Op with FP16 data on HTP backend
2. Improve Pad op builder to handle invalid optional input
3. Add UT for ReduceSum for FP16 precision with 5D for issue reproduce
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
As suggested by SciPy's doc, we will
`Build against NumPy 2.0.0, then it will work for all NumPy versions
with the same major version number (NumPy does maintain backwards ABI
compatibility), and as far back as NumPy 1.19 series at the time of
writing`
I think it works because in
[numpyconfig.h#L64](https://github.com/numpy/numpy/blob/main/numpy/_core/include/numpy/numpyconfig.h#L64)
there is a macro NPY_FEATURE_VERSION. By default it is set to
NPY_1_19_API_VERSION. And the NPY_FEATURE_VERSION macro controls ABI.
This PR only upgrade the build time dependency; When a user installs
ONNX Runtime, they still can use numpy 1.x.
### Motivation and Context
Recently numpy published a new version, 2.0.0, which is incompatible with the latest ONNX Runtime release.
### Description
### Motivation and Context
The pipeline is green even Llama2 parity_check fails.
The PR should be merged after the below exception is solved.
'''
2024-06-25 03:49:43.621298481 [E:onnxruntime:,
sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned
while running Expand node. Name:'/model/Expand' Status Message:
/model/Expand: left operand cannot broadcast on dim 3 LeftShape:
{1,1,9,9}, RightShape: {2,1,9,17}
An error occurred while verifying parity: Error in execution: Non-zero
status code returned while running Expand node. Name:'/model/Expand'
Status Message: /model/Expand: left operand cannot broadcast on dim 3
LeftShape: {1,1,9,9}, RightShape: {2,1,9,17}
Traceback (most recent call last):
File
"/workspace/onnxruntime/python/tools/transformers/models/llama/convert_to_onnx.py",
line 1043, in main
parity_check(parity_cmd)
File
"/workspace/onnxruntime/python/tools/transformers/models/llama/llama_parity.py",
line 298, in main
verify_parity(args, location, use_auth_token, kv_cache_ortvalues,
pytorch_model=llama, config=config)
File
"/workspace/onnxruntime/python/tools/transformers/models/llama/llama_parity.py",
line 137, in verify_parity
ort_model.run_with_iobinding(io_binding)
File
"/home/onnxruntimedev/.local/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py",
line 331, in run_with_iobinding
self._sess.run_with_iobinding(iobinding._iobinding, run_options)
RuntimeError: Error in execution: Non-zero status code returned while
running Expand node. Name:'/model/Expand' Status Message: /model/Expand:
left operand cannot broadcast on dim 3 LeftShape: {1,1,9,9}, RightShape:
{2,1,9,17}
'''
The exception looks caused by #19832
ONNX's Expand supports bidirectionally broadcast, while WebNN's expand
op only supports unidirectionally broadcast. Thus we should calculate
the output shape for 'newShape' input of WebNN's expand op.
### Description
<!-- Describe your changes. -->
The split op is using pin_memory when split on different sizes.
But pin_memory is not capable for using cudagraph.
Add a new implementation for only transformer scenarios, it split the
qkv_proj into q, k, v, not using pin_memory.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Setting the log level after environment creation is too late in some
cases.
If the DML EP is enabled, it will create a composite sink with the
original logger using the creation time log severity, as well as
additional ETW sink. As it saves the current severity levels for each
sink inside the composite sink that prevents being able to get verbose
log output to stdout even if you set that at the session level.
I don't know enough about the setup that combines ETW with the original
sink to say whether we should also be updating the severity of
individual sinks in the combined sink, so this change is limited to
making the unit tests behave in the expected manner when the default log
severity is set in the background and not directly controlled.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Make it possible to get verbose output to stdout when the DML EP is
enabled.
### Description
Skip softmax BF16 test for ROCm, because BFloat16 is unsupported by
MIOpen, and `torch.cuda.is_available()` also returns `True` for ROCm.