Commit graph

11329 commits

Author SHA1 Message Date
Xu Xing
92a8407b39
[js/webgpu] Remove unnecessary initialization of var (#21312)
This var has been initialized to 0 in tint, so no need extra loop to do
it again:
```
  float tint_symbol_52[1][4] = (float[1][4])0;
  {
    for(int tint_symbol_53 = 0; (tint_symbol_53 < 1); tint_symbol_53 = (tint_symbol_53 + 1)) {
      {
        for(int tint_symbol_54 = 0; (tint_symbol_54 < 4); tint_symbol_54 = (tint_symbol_54 + 1)) {
          tint_symbol_52[min(uint(tint_symbol_53), 0u)][min(uint(tint_symbol_54), 3u)] = 0.0f;
        }
      }
    }
  }
```
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-07-12 12:34:34 -07:00
Yi Zhang
f2ebd1cd6b
[Fix] Exception in iosDynamicFramework Post-Merge workflow (#21262)
### Description
the exception was caused by
3dd6fcc089


Why I add skip_macos_test
because there's new an exception in
https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1425579&view=logs&j=c90c5af3-67d5-5936-5a62-71c93ebfca65&t=01038f35-8e78-5801-1aa1-d9647bb65858
```
2024-07-05T14:41:09.3864740Z mkdir -p /Users/runner/Library/Developer/Xcode/DerivedData/apple_package_test-akksnidsbpojopfdqrclgsoqqerv/Build/Products/Debug/macos_package_testUITests.xctest/Contents/Frameworks
2024-07-05T14:41:09.3933430Z mkdir: /Users/runner/Library/Developer/Xcode/DerivedData/apple_package_test-akksnidsbpojopfdqrclgsoqqerv/Build/Products/Debug/macos_package_testUITests.xctest: Operation not permitted
2024-07-05T14:41:09.3996760Z /var/folders/0f/b0mzpg5d31z074x3z5lzkdxc0000gn/T/tmp97ycvwq5/apple_package_test/Pods/Target Support Files/Pods-macos_package_testUITests/Pods-macos_package_testUITests-frameworks.sh: line 7: realpath: command not found
2024-07-05T14:41:09.4003170Z :18: error: Unexpected failure
2024-07-05T14:41:11.1323470Z error: Sandbox: mkdir(72212) deny(1) file-write-create /Users/runner/Library/Developer/Xcode/DerivedData/apple_package_test-akksnidsbpojopfdqrclgsoqqerv/Build/Products/Debug/macos_package_testUITests.xctest (in target 'macos_package_testUITests' from project 'apple_package_test')
2024-07-05T14:41:11.1325620Z 
2024-07-05T14:41:11.8731110Z 
2024-07-05T14:41:11.8733040Z Test session results, code coverage, and logs:
2024-07-05T14:41:11.8734820Z 	/Users/runner/Library/Developer/Xcode/DerivedData/apple_package_test-akksnidsbpojopfdqrclgsoqqerv/Logs/Test/Test-macos_package_test-2024.07.05_14-40-38-+0000.xcresult
2024-07-05T14:41:11.8735530Z 
2024-07-05T14:41:11.8906210Z Testing failed:
2024-07-05T14:41:11.8911060Z 	Sandbox: mkdir(72212) deny(1) file-write-create /Users/runner/Library/Developer/Xcode/DerivedData/apple_package_test-akksnidsbpojopfdqrclgsoqqerv/Build/Products/Debug/macos_package_testUITests.xctest
2024-07-05T14:41:11.8912570Z 	Unexpected failure
2024-07-05T14:41:11.8913690Z 	Testing cancelled because the build failed.
2024-07-05T14:41:11.8914380Z 
2024-07-05T14:41:11.8914970Z ** TEST FAILED **
2024-07-05T14:41:11.8915480Z 
2024-07-05T14:41:11.8915780Z 
2024-07-05T14:41:11.8916750Z The following build commands failed:
2024-07-05T14:41:11.8919280Z 	PhaseScriptExecution [CP]\ Embed\ Pods\ Frameworks /Users/runner/Library/Developer/Xcode/DerivedData/apple_package_test-akksnidsbpojopfdqrclgsoqqerv/Build/Intermediates.noindex/apple_package_test.build/Debug/macos_package_testUITests.build/Script-059136A7770CA5376C30F2FD.sh (in target 'macos_package_testUITests' from project 'apple_package_test')
2024-07-05T14:41:11.8922180Z (1 failure)
```

And I find macos test is skipped in

9ef28f092f/tools/ci_build/github/azure-pipelines/templates/c-api-cpu.yml (L119-L127)
as well.
Maybe it is an known issue.
2024-07-12 09:24:12 -07:00
Ted Themistokleous
4ac4cd2668
Migraphx ep windows build (#21284)
### Description
Repeat of #21084 with removal of policy CMP0144 to suppress warnings
which uses CMake 3.27.0.


### Motivation and Context


Already approved PR: 
https://github.com/microsoft/onnxruntime/pull/21084

Removed the added policy from CMake 3.27.0.
2024-07-11 21:21:38 -07:00
mingyueliuh
42b7cedb06
[VitisAI] custom op support multiple outputs (#21280)
### Description
The implementation inside EP requires registering some custom ops which are only used in the model compilation phase. Currently only single output is supported.



### Motivation and Context
Now the demand upgrade requires support for multiple outputs, so the shaper infer of ep custom op needs to be extended to support multiple outputs

---------

Co-authored-by: liumingyue <mingyue@xilinx.com>
Co-authored-by: mingyue <mingyue@amd.com>
2024-07-11 16:04:18 -07:00
Qingnan Duan
80b56feb41
Implement FlashAttention for CPU (#20805)
### Description
Implement [FlashAttention](https://arxiv.org/pdf/2205.14135) and
[FlashAttention-2](https://arxiv.org/pdf/2307.08691) for
MultiHeadAttention on CPU.


### Motivation and Context
Accelerate the execution of MultiHeadAttention.

Current performance: 10ms vs 16ms (com.microsoft.MultiHeadAttention) on
my Linux machine and 10ms vs 38ms (com.microsoft.MultiHeadAttention) on
my Windows machine. May need further optimizations.

---------

Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: Qingnan Duan <qiduan@microsoft.com>
2024-07-11 14:19:59 -07:00
Edward Chen
33e7c7f6ec
Enable Android CI build stages to run in parallel. (#21314)
Enable Android CI build stages to run in parallel to possibly reduce total build time.
2024-07-11 10:09:09 -07:00
Yi Zhang
41ea47be1e
Move QNN nuget package stages out of the big Nuget packaging pipeline. (#21306)
### Description
1. remove QNN stages from the big packaging pipeline
2. Add publish nightly package in the current [QNN Nuget
pipeline](https://dev.azure.com/aiinfra/Lotus/_builddefinitionId=1234])



### Motivation and Context
Reduce the complexity of the big Nuget packaging pipelines.

---------

Co-authored-by: Yi Zhang <your@email.com>
2024-07-11 09:07:23 -07:00
pengwa
88336ffa92
Fix typos - 1st Wave (#21278)
### Description

There are so many typos reported by the review dog, [Optional Lint]
actions (example:
https://github.com/microsoft/onnxruntime/actions/runs/9864564489/job/27239732367),
this PR is to fix some of them.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2024-07-11 13:35:08 +08:00
pengwa
0a1178add9
Fix lint C++ actions (#21303)
### Description
<!-- Describe your changes. -->


83e0c6b96e
is the last commit having Lint C++ actions pass.


![image](https://github.com/microsoft/onnxruntime/assets/10530022/96bf005e-5815-46d0-ac17-c6094200957c)



4a7eaff1d9
is the first commit let it fail.


![image](https://github.com/microsoft/onnxruntime/assets/10530022/72a9271e-7b4b-40f8-83a5-f28b82c5e726)



Reviewdog/action-cpplint@master changed since that day.
https://github.com/reviewdog/action-cpplint/pull/42/files make
action-cpplint starts using reviewdog release
https://github.com/reviewdog/reviewdog/releases/tag/v0.19.0.


Optional Lint also failed with many typos, should be also related to the
same reason. Let's fix that in different prs.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-07-11 09:46:41 +08:00
Changming Sun
fe6ef404b5
Enable LTO for Android build (#21243)
### Description
Enable LTO for Android build, which can reduce binary size by 6%.
2024-07-10 18:44:17 -07:00
Sheil Kumar
28af544278
[DirectML] Broadcast NC-dims for Tensors A&B in DynamicQuantizeMatMul (#21298)
### Description
[DirectML] Broadcast NC-dims for Tensors A&B in DynamicQuantizeMatMul
The DynamicQuantizeMatMul allows input tensors in NCHW format, and
DirectML requires that input tensors share the same batch and channel
dimensions. Tensors A and B should be broadcast (if possible) to the
corresponding output NC dims.

### Motivation and Context
Certain models which use DynamicQuantizeMatMul hit a crash when the NC
dims are intended to be broadcast.

---------

Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
2024-07-10 17:35:47 -07:00
Edward Chen
20cd3394fc
[MLAS] AArch64 SQNBitGemm CompInt8 initial multi-row implementation (#21193)
Update AArch64 SQNBitGemm CompInt8 kernels to process matrix in tiles. E.g., computing the output in 2x2 tiles allows us to compute four elements of the output with one read of two rows of A and two columns of B.

Also moved some code around as it was getting big for a single file.
2024-07-10 15:39:26 -07:00
Changming Sun
8749fa381e
Update absl (#21300)
### Description
Our macOS pipeline are failing because of a build error in absl.
However, the bug fix we need is not available in the latest ABSL
release.

Here  is the issue: https://github.com/abseil/abseil-cpp/pull/1536
And here is the fix:
779a3565ac
 
GTests uses ABSL. But this ABSL target also depends on GTest. So, it is
a circular dependency. We should be able to avoid that by avoid building
tests for ABSL. However, the version we are using has a problem with
that: it has cmake target that still depends on GTest even when testing
is disabled.

It's strange that we suddenly hit this problem and it only happens on macOS.
2024-07-10 11:14:15 -07:00
Adrian Lizarraga
5753f8da8c
[QNN EP] Initial INT4 support (#21171)
### Description
- Adds support for int4 quantized weights (per-tensor and per-channel)
on QNN EP
- Adds test script that creates an INT4 qdq model with a Conv
- Adds a unit tests demonstrating accuracy issues.



### Motivation and Context
This is the next step in being able to run models that use 4-bit
quantized weights on QNN EP.
2024-07-10 10:03:53 -07:00
Pavan Goyal
1b82d835d8
[Fix] InterOpNumThreads Session Option for ONNX ReactNative Package (#21263)
### Description
This PR resolves a bug related to setting the **interOpNumThreads**
session option when creating an **ORTSession**. Currently, when the
**interOpNumThreads** option is passed from React Native, the native
module incorrectly sets **intraOpNumThreads** instead of
**interOpNumThreads**.


### Motivation and Context
Since this is a bug, users of the Onnx React Native package may believe
that they are setting **interOpNumThreads** correctly, So this change is
required. Refer to the code snippet below for details
<img width="634" alt="Screenshot 2024-07-05 at 9 28 58 PM"
src="https://github.com/microsoft/onnxruntime/assets/88655321/70a8f216-553a-4f4c-9481-e6871f0e37e6">
2024-07-10 07:00:18 -07:00
Ștefan Talpalaru
1b19045afa
[build] allow MPI on Unix when NCCL is disabled (#21175)
### Description

CMake logic fixed to allow enabling MPI while NCCL is disabled.

### Motivation and Context

MPI is also used on the CPU backend, not only with CUDA, so it makes
sense to decouple it properly from NCCL (which is for dealing with
multiple Nvidia GPUs).
2024-07-09 21:21:40 -07:00
Hann Wang
d28c26a919
[ROCm] fix: obtain AMD GPU memory info through rocm_smi library (#21190)
### Description
Previously ROCMExecutionProvider uses `hipMemGetInfo` to obtain the
sizes of total memory and available memory. However, this API has been
broken since ROCm 5.7. In this PR, we use `rocm_smi` library instead of
`hipMemGetInfo`.


### Motivation and Context

`hipMemGetInfo` API has been broken since ROCm 5.7 and inference with
ROCMExecutionProvider will lead to following errors:

```
HIP failure 1: invalid argument ; GPU=0 ; hostname=4cc4900475fe ; file=/onnxruntime/onnxruntime/core/providers/rocm/rocm_execution_provider.cc ; line=229 ; expr=hipMemGetInfo(&free, &total);
```

MIOpen has a brute-force fix for this
(911e671895/src/hip/handlehip.cpp (L72)).
Instead of hard-coding available memory to 16GB, I suppose we could
obtain memory info through `rocm_smi` library as in this PR.
2024-07-09 20:35:26 -07:00
Chen Feiyue
fffd430091
[VSINPU]Code improvement && Slice/Dropout OP support (#21217)
### Description
- Refactor codes to meet line length limit and guard missing warning
- Add slice/dropout op support
- Move vsinpu ep's cmake settings from onnxruntime_providers.cmake to a
separate file
- Modify apis with param onnxruntime::Path because this kind is replaced
by std:filesystem::path by #20920
2024-07-09 20:14:46 -07:00
Maximilian Müller
cc0de0d526
[Build] Propagate build option for CUDA minimal to TRT (#20695)
### Description

Extend cuda minimal option to TRT provider, as with TRT 10 no linking to
cuDNN is required anymore
.
Besides that with the new engine dump feature it is also possible to
embed an engine in to an ONNX and not ship a builder lib.
In addition to that this has roughly the same deserialization
time/session setup time that using TRT standalone has.

### Motivation and Context

```
exe_builder_lib\onnxruntime_perf_test.exe -I -e tensorrt -r 5 -i 'trt_engine_cache_enable|1 trt_timing_cache_enable|1 trt_dump_ep_context_model|1 trt_weightless_engine_enable|1' model.onnx


exe_no_builder_lib\onnxruntime_perf_test.exe -I -e tensorrt -r 5 -i 'trt_engine_cache_enable|1 trt_timing_cache_enable|1 trt_dump_ep_context_model|1 trt_weightless_engine_enable|1' model_ctx.onnx
```
2024-07-09 14:40:04 -07:00
Edward Chen
307b34a820
[NNAPI EP] Track skipped initializer usage (#21286)
Track skipped initializer usage in NNAPI EP to account for usage by other nodes.
2024-07-09 13:43:22 -07:00
Xiang Zhang
1ab162fbca
Fix ETW Sink Initialize unproperly locking (#21226)
### Description
ETW trace logger is fakely registered as initialized_ is marked as true
before the registration is done, causing crashing issue for Lenovo
camera application.

[Bug
42610244](https://microsoft.visualstudio.com/OS/_workitems/edit/42610244):
[Watson Failure] caused by
SVCHOSTGROUP_Camera_INVALID_POINTER_READ_c0000005_onnxruntime.dll!onnxruntime::logging::Logger::Log
2024-07-09 10:55:41 -07:00
Jian Chen
d1c19e79ea
Update OpenVino CI Ubuntu to 22.04 (#21127)
### Description
[Update OpenVino CI Ubuntu to
22.04](312fab5b3f)



### Motivation and Context
Ubuntu 22.04 is needed for linux C++20
2024-07-09 09:56:44 -07:00
Wanming Lin
eeb8fc0931
[WebNN EP] Release WebNN MLGraphBuilder after Compile to free memory (#21200)
This would help release the constants bound by the MLGraphBuilder.
2024-07-09 08:49:58 -07:00
Changming Sun
2c53b4a534
Remove core/common/gsl.h (#20894)
### Description
It might be easier if we just directly include the original gsl headers.
"core/common/gsl.h" is an indirection that doesn't provide extra help.
2024-07-08 18:09:39 -07:00
Enrico Galli
4c3c809bdb
[js/webnn] Enable user-supplied MLContext (#20600)
### Description
This PR enables the API added in #20816 as well as moving context
creation to JS.

### Motivation and Context
In order to enable I/O Binding with the upcoming
[MLBuffer](https://github.com/webmachinelearning/webnn/issues/542) API
in the WebNN specification, we need to share the same `MLContext` across
multiple sessions. This is because `MLBuffer`s are restricted to the
`MLContext` where they were created. This PR enables developers to use
the same `MLContext` across multiple sessions.
2024-07-08 10:19:39 -07:00
Wanming Lin
cd516a1677
[WebNN EP] Remove constraint for conv ops on CPU backend (#21237)
Currently WebNN TFLite backend allows the filter of
conv2d/convTranspose2d be an input. Remove the constraint and operate
necessary transpose/reshape operations for the filter input.
2024-07-08 10:14:43 -07:00
zz002
4a7eaff1d9
[vitisai] Fix build failure introduced by #20920 (#21247)
### Description

Fix build failure introduced by #20920
2024-07-08 05:44:30 -07:00
Jing Fang
83e0c6b96e
Add MatMulNBits shape infer to SymbolicShapeInference (#21246)
### Description
Support MatMulNBits shape infer in SymbolicShapeInference

MatMulNBits's B input is rank-2, so implicit merge does not apply.

### Motivation and Context
[Issue with performing shape inference using symbolic_shape_infer.py
with Phi-3 ONNX Models · Issue #21194 · microsoft/onnxruntime
(github.com)](https://github.com/microsoft/onnxruntime/issues/21194)
2024-07-05 16:24:57 -07:00
KnightYao
9ef28f092f
[Fix Bug] Fp8*Fp8 Run Error (#20911)
Fix fp8*fp8 when input A is e5m2, input B is e4m3 will run error

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-07-05 17:11:59 +02:00
pengwa
3f6b7430d6
Use cuda memset async (#21216)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-07-05 17:27:45 +08:00
Baiju Meswani
0bbd061a54
Exclude azure ep from gen_def.cc (#21250)
Addresses python packaging pipeline failure.
2024-07-04 10:50:27 -07:00
Changming Sun
07c429191e
Delete path.h (#21211)
### Description
Delete path.h and replace all occurrences of onnxruntime::Path with
std::filesystem::path.
Previously we couldn't use C++17's std::filesystem because it was not
supported in iOS 12(which was released in 2018). Now we dropped the
support for iOS 12.

### Motivation and Context
To simplify code. For example, if an EP wants to use the Path class, now
it can directly use it without going through a wrapper. And the standard
implementation can handle various path types better. (We didn't take
much consideration on UNC path, "/" as a path separator on Windows,
etc).
2024-07-04 15:54:13 +08:00
kailums
40d4b2ec75
exclude split3inner kernel on rocm ep (#21238)
### Description
There is an issue when using split3inner kernel on rocm-6.0.3, exclude
these code from rocm EP.
2024-07-04 14:32:28 +08:00
Tianlei Wu
7d9b12a2e3
[CPU] SparseAttention op (#21110)
Add SparseAttention cpu implementation.
- [x] Refactoring GQAAttentionBase
- [x] Add SparseAttention implementation
- [x] Add test cases

This is unfused version. Flash attention version will be added later.
2024-07-03 21:51:57 -07:00
Yi Zhang
30b6e82e7d
Make ROCm packaging stages to a single workflow (#21235)
### Description
Make current ROCm packaging stages to a single workflow.
Reduce the possibility of all nightly packages can't be generated by one
failed stage



### Motivation and Context
Our plan is to reduce the complexity of the current zip-nuget pipeline
to improve the stability and performance of nightly packages generation.
ROCm packaging stages has no dependencies with other packaging jobs and
it's the most time-consuming route.
After this change, the most used CPU/CUDA/Mobile packaging workflow
duration can be reduced roughly from 3h20m to 2h30m.
2024-07-04 11:07:04 +08:00
cloudhan
f39ee14b46
Add GQA support for ROCm (#21032) 2024-07-03 14:55:31 +08:00
pengwa
4932e04053
ORTModule GraphTransitionManager (#19007)
### Problem

Currently, the codebase contains some logics pertaining to model
re-export checks and graph_builder reinitialization checks. Ideally,
these operations should function akin to a state machine. However, upon
inspecting the implementation, it becomes apparent that certain states
are checked or set in various scattered locations. This fragmentation
makes it challenging to comprehend when a re-export or re-initialization
will be triggered. For optimal clarity and maintainability, it is
advisable to consolidate these states into a cohesive component, rather
than dispersing them within the current graph execution manager.

Furthermore, the process of model exports and post-export processing for
stage 3 support or memory-efficient gradient management introduces
considerable complexity. To enhance the codebase's structure, it would
be beneficial to extract these intricate functionalities into a
dedicated component, divorcing them from the current graph execution
manager.

As part of the effort to improve the codebase, it's essential to address
inconsistencies in handling input/output flatten/unflatten operations.
Currently, there are several functions performing these operations
recursively, each with slightly different implementations. This
inconsistency leads to varying support for input/output data types and
structures in different parts of the code. To rectify this, the proposed
pull request simplifies these operations into a set of primitive
functions, ensuring uniformity. This not only streamlines the code but
also facilitates the maintenance of consistency when introducing bug
fixes or supporting new data types. One thing to mention here: input
output handling is deeply bound to the graph transition mentioned above,
so it is difficult to make this change separately.

While acknowledging the complexity of these logics, it is reassuring
that the codebase benefits from an extensive suite of unit tests that
cover all possible branches. Despite the intricacies, ensuring the
passage of all tests has been a time-intensive but necessary aspect of
this development effort.

### Design


Introduce `GraphTransitionManager` and put all model export and
post-export processing logics in it.
1. Re-export check
2. Do export
3. Re-post-export process check
4. Do post-export process
5. Return `PostExportProcessedModelInfo`, which contains all the
information we need, to pass to ORT to build gradient graph (currently
we do the same for training or evaluating, but ideally we should not do
it for evaluating, let's keep this behavior as it is now, and make the
change later).
    ```
          # Input names for the pre-gradient-build graph.
# This may be different with the one in ExportedGraph since we may
modify the graph inputs as needed
# for example when memory efficient gradient management is enabled.
self.onnx_graph_input_names: list[str] = onnx_graph_input_names
  
          # A subset of onnx_graph_input_names.
# Input names that require gradients for the pre-gradient-build graph.
self.onnx_graph_input_names_require_grad: list[str] =
onnx_graph_input_names_require_grad
  
# Create symbolic names for each dimension of the graph input (e.g.
onnx_graph_input_names).
# The key is the input name, the value is a dict of {dim_index:
symbolic_dim_name}
# e.g. {"input1": {0: "input1_dim0", 1: "input1_dim1"}, "input2": {0:
"input2_dim0"}}
self.onnx_graph_input_dynamic_axes_map: dict[str, dict[int, str]] =
onnx_graph_input_dynamic_axes_map
  
self.buffer_for_ort_runs: dict[str, torch.Tensor] = OrderedDict()
          self.onnx_graph_input_names_user_defined = (
onnx_graph_input_names_user_defined # The ONNX graph input names
excluding the parameters, buffers.
          )
  
# The ONNX graph input names excluding the parameters, buffers.
self.onnx_graph_input_names_require_grad_user_defined =
onnx_graph_input_names_require_grad_user_defined
  
self._post_export_processed_model: onnx.ModelProto | None =
post_export_processed_model
  
# A function to access the input data from the args and kwargs.
# If it is not None, the length is same as onnx_graph_input_names.
# For i-th input name, we can use the i-th function to get the input
data from args and kwargs.
          self.data_accessor: list[callable] | None = data_accessor
  
          # Used for unflattening the outputs from the ORT forward run.
self.module_forward_output_schema: ORTModelInputOutputSchemaType | None
= module_forward_output_schema```




The `GraphTransitionManager` instance is a property of
`GraphExecutionManager` (e.g. `TrainingManager` or ``InferenceManager),
1. Use
'self._graph_transition_manager.use_cache_or_reconstruct_post_processed_model(inputs,
kwargs)' to check whether the PyTorch module need a re-export or
re-post-export-process.
2. Use
`self._graph_transition_manager._post_export_processed_model_info.construct_inputs`
to construct the list of inputs used for ORT runs.
3. Use
`self._graph_transition_manager._post_export_processed_model_info.restore_outputs(user_outputs)`
to restore the outputs in original PyTorch output structure.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-07-03 10:53:31 +08:00
Baiju Meswani
116398c1a4
onnxruntime shared lib inside python package (#21223) 2024-07-02 15:37:50 -07:00
Tianlei Wu
7df97f1987
Add debugging helper to dump string, vector and thread id (#21224)
### Description

Add some macro to help print data to console for debugging purpose.

Example usage:
```
int input_id;
vector<int> some_vector;

DUMP_CPU_TENSOR_INIT()
DUMP_CPU_TENSOR("some vector", some_vector);
DUMP_STRING("input_id=", input_id);
```

- To enable dump thread id, set environment variable
`ORT_DUMP_THREAD_ID=0`.
- User can disable dumping by environment variable
`ORT_ENABLE_CPU_DUMP=0`.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-07-02 11:24:04 -07:00
Yifan Li
7be1d4aad3
[TensorRT EP] Update TRT10.0 deprecated api (#20989)
### Description
<!-- Describe your changes. -->

Note:
* This PR would remove C4996 suppression in
tensorrt_execution_provider.cc only (according to Nvidia, places with
nvinfer.h included need C4996 suppression, when /Zc:__cplusplus is
enabled in ORT win build)
* A follow-up PR will be raised to update deprecated TRT Plugin api
usage.

Here are deprecated apis to be updated in this PR:
| deprecated api | Update |
| ------------------------------------------------------------ |
------------------------------------------------------------ |
|
[kCUBLAS](https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/namespacenvinfer1.html#a9e1d81e5a8bfeb38b86e22a66d5f836a)
| / |
|
[kCUBLAS_LT](https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/namespacenvinfer1.html#a9e1d81e5a8bfeb38b86e22a66d5f836a)
| / |
|
[kCUDNN](https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/namespacenvinfer1.html#a9e1d81e5a8bfeb38b86e22a66d5f836a)
| / |
|
[reallocateOutput](https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1v__1__0_1_1_i_output_allocator.html#acae6441d4029584cc1c6550917518691)
| Superseded by
[reallocateOutputAsync](https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1v__1__0_1_1_i_output_allocator.html#aa40eeb891c1dfe4c1bbf1eabe8c705ab)
with cudaStream_t argument |
|
[createExecutionContextWithoutDeviceMemory](https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1_i_cuda_engine.html#adc86bcc42b098204997396ef2b1093fb)
| Superseded by
[createExecutionContext()](https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1_i_cuda_engine.html#a35de29aa6134165a5b14a537e6d99e82)
with parameter.<br />Check
[ExecutionContextAllocationStrategy::kUSER_MANAGED](https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/namespacenvinfer1.html#ac6251a050df629edfc0ce037fa366503)
for more detail |




### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
TRT deprecated api list:
https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/deprecated.html
2024-07-01 22:55:20 -07:00
Yi Zhang
beb2496748
Templatize publishing nuget package (#21199)
### Description
It's the prerequisite step of reducing complexity of current zip-nuget
pipeline.
Some packaging tasks could be cut from the most complex nuget pipline
and easily be published

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-07-02 09:24:19 +08:00
Scott McKay
8c2689877f
CoreML: Disable 1D ML Program matmul due to bug in coreml (#21186)
### Description
Disable using CoreML ML Program for a matmul where one of the inputs is
1D as the CoreML implementation appears to be broken. See
https://github.com/apple/coremltools/issues/2263

Add some debugging notes. 

### Motivation and Context
Fix failing test on macos-14.
2024-06-29 12:19:51 -07:00
Chen Feiyue
56b36a58ba
Initial PR for VSINPU execution provider (#20903)
### Description
<!-- Describe your changes. -->
-It is an initial PR for VSINPU execution provider



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
- For support VeriSilicon hardware
- TIM-VX(Tensor Interface Module)
(https://github.com/VeriSilicon/TIM-VX) is an integrated software
solution by Verisilicon for our hardware(A311D/i.MX 8M Plus etc.)
design, it is easy to use Verisilicon’s hardware by simply connecting
onnxruntime with the TIM-VX API by this VSINPU execution provider.
2024-06-28 21:48:34 -07:00
Jian Chen
9007ede102
Update upstream packaging pipeline name to make it more meaningful. (#21154)
### Description
Update upstream packaging pipeline name to make it more meaningful.



### Motivation and Context
The upstream pipeline used to only building Nuget packages, but now it
also builds Zip and Java. So change the name will make it more
meaningful.
2024-06-28 21:40:09 -07:00
Changming Sun
3a83f8b317
Update the functions in tensorprotoutils.h to use std::filesystem::path instead (#20920)
### Description
1. Update the functions in tensorprotoutils.h to use
std::filesystem::path instead of onnxruntime::Path. Eventually we can
remove the whole onnxruntime::Path class, but to this PR small I am not
doing that.
2. Remove the _SILENCE_EXPERIMENTAL_FILESYSTEM_DEPRECATION_WARNING macro
def when TensorRT EP is enabled.
2024-06-28 20:03:57 -07:00
Jian Chen
0cbe7eec5e
Uppdate nuget to Use Nuget 6.10.x (#21209)
### Description
Uppdate nuget to Use Nuget 6.10.x
2024-06-28 19:49:54 -07:00
mingyueliuh
7e93cd7f8b
[VitisAI] Align TensorProto_DataType with onnx1.16 (#21067)
### Description
Vitis AI EP synchronously supports the TensorProto data types supported
by ONNX 1.16.
Add error message show when graph resolve fail for troubleshooting.


### Motivation and Context
ONNX 1.15 & 1.16 add support some new TensorProto DataType , such as 
- FLOAT8E4M3FN
- FLOAT8E4M3FNUZ
- FLOAT8E5M2
- FLOAT8E5M2FNUZ
- UINT4
- INT4

---------

Co-authored-by: liumingyue <mingyue@xilinx.com>
2024-06-28 17:19:20 -07:00
Preetha Veeramalai
6baaaf5165
OVEP options to disable CPU fallback at compile time (#21166)
### Description
Provide user level options to control the fallback on CPU for models not
supported on Intel's NPU hardware.


### Motivation and Context
- Current workflow of OVEP allows safe fallback from OV NPU to OV CPU on
compilation failures. Also supports MLAS CPU fallback in presence of
unsupported custom ops.
- The PR provides a build-time option to disable fallback from OV NPU to
OV CPU.
- The session Option "kOrtSessionOptionsDisableCPUEPFallback" disables
OV CPU and MLAS CPU fallback.
- Also has bug fix for proto creation.

---------

Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com>
Co-authored-by: ankitm3k <ankit.maheshkar@intel.com>
2024-06-28 08:31:02 -07:00
Hector Li
21ad004237
Add QNN UTs for QNN Pad Op with FP16 data on HTP backend (#21142)
### Description
1. Add QNN UTs for QNN Pad Op with FP16 data on HTP backend
2. Improve Pad op builder to handle invalid optional input
3. Add UT for ReduceSum for FP16 precision with 5D for issue reproduce
2024-06-27 22:09:13 -07:00
Yi Zhang
587e92c279
Add FP32 and INT4 test in Llama2 (#21187)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-06-28 06:18:26 +08:00