Reverts microsoft/onnxruntime#21226
Causes any onnxruntime app to hang on Windows ARM64. Our pipelines do
not have the same ETW environment, so we couldn't catch it.

The call to TraceLoggingRegisterEx() recursively calls back into
LazyInitialize():
LazyInitialize() -> TraceLoggingRegisterEx() ->
ORT_TL_EtwEnableCallback() -> Instance() -> LazyInitialize()
The original code got out of the recursive loop by checking the
`initialized_` flag.
Description: ### Description
This is a partial change ported from fajin/qdqmatmulnbitstoolchain. That
branch has issues resolving the web CI.
MatMulNBits is a heavily optimized matmul operation. Currently a MatMul
can be converted to MatMulNBits to speed up the model inference.
However, MatMulNBits is an ORT only op. To make the graph compatible
with ONNX ops and utilize MatMulNBits at the same time, we introduce
Q/DQ support for MatMulNBits.
To convert MatMul ops in a model to MatMulNBits:
1. use matmul_4bits_quantizer.py to convert MatMul to DQ + MatMul using
QDQ mode.
2. In ORT session, DQ + MatMul is fused to MatMulNBits
#### Note
MatMulNBits assume B weight is uint4. When no zp is provided, zp
defaults to 8, which is different from DQ. DQ defaults zp to 0 when no
zp provided. And DQ supports int4. Therefore some conversions are
introduced during DQ + MatMul --> MatMulNBits step.
#### Perf
Using QDQ format will increase the model initialization time and memory
consumption. With current implement, model init time increased from ~4s
to ~9s, and memory consumption increased from ~2.8GB to ~4.8GB.
The memory increase is due to
1. in optimizer, after transpose the B weight, a in-memory tensor proto
is created using protobuf's arena.
2. in finalize step, when saving initializer and prepacking, ORT arena
is used to create buffers for initializers.
The memory allocated by arenas cannot be fully deallocated.
If disable ORT arena memory allocation, the memory consumptions of both
QDQ format and original format are ~2.2GB.
The time increase is mainly due to multiple memory copy, but can be
further optimized.
### Motivation and Context
Please see description for details.
### Description
Combining android build and test step into one job
### Motivation and Context
Reduce runtime by removing additional machine allocation, and artifact
uploading and downloading.
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
<!-- Describe your changes. -->
Supress C4996 deprecated api warning as errors as a walkaround to build
ORT with TRT10.2GA on Windows
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Four apis were recently declared as deprecated, which are being used by
core code of TRT EP.
Temporally suppress deprecated api warnings before updating these apis
### Description
Resolve#21281 and #10589 .
1. Change libonnxruntime.so's SONAME: remove the minor and patch
version.
By default when creating an ELF shared object, linker will set the
file's internal DT_SONAME field to the specified name which is the file
name plus SOVERSION . For example, the file name for our library is
libonnxruntime.so. And by default SOVERSION is the lib's VERSION number,
which is something like 1.19.0. So the DT_SONAME field in
libonnxruntime.so is something like libonnxruntime.so.1.18.0. You can
use readelf tool to examine it.
```
readelf -d libonnxruntime.so | grep SONAME
0x000000000000000e (SONAME) Library soname: [libonnxruntime.so.1.18.0]
```
When an executable is linked with a shared object which has a DT_SONAME
field, then when the executable is run the dynamic linker will attempt
to load the shared object specified by the DT_SONAME field rather than
using the file name(which is libonnxruntime.so) given to the linker.
After this change, the SONAME will be shorten to "libonnxruntime.so.1"
instead.
2. Set default version strings for Windows DLLs, to resolve#10589
- Pass a list of files instead of path separator-delimited string to project.files(). See this issue: https://github.com/gradle/gradle/issues/19817
- Check for host (instead of target) being Windows when using fallback patch program.
### Description
Resolve#21267 . onnxruntime_perf_test does not work properly if the
input model path url is just a single filename without any path
separator. For example,
```
./onnxruntime_perf_test -t 10 model.onnx
```
The problem was introduced in #19196 by me.
# Why so many commits
- Runtime debugging - which is necessary
- Three different approaches to EP context model - as a result testing back and forth
- Windows compatibility issues - this development has been done on Linux for convenience
# "Open" (?) questions
- Full offloading to a specific EP
- Dumping EP context models by EPs vs [by
ONNXRT](e2abba18ea/onnxruntime/core/framework/graph_partitioner.cc (L725))
- [Node name to pick
nodes](e2abba18ea/onnxruntime/core/framework/graph_partitioner.cc (L654))
# VitisAI EP made three variant implementations that have respective pros and cons (and of course we can combine them)
## Serialize and cache the list of compute capabilities and the original
ONNX model itself
## In `ComputeCapability()`, serialize and cache the backend compilation cache and the related necessary cache info such as cache dir and cache key
## In `Compile()`, serialize and cache the backend compilation cache and the related necessary cache info such as cache dir and cache key
# EP context model creation
- Precondition
Session option configuration `kOrtSessionOptionEpContextEnable` (aka "ep.context_enable") is enabled.
- Approach 1
- Steps
1. EP creates an ONNX model whose main graph has EP context nodes (i.e., node type is "EPContext").
2. EP implements/overrides `IExecutionProvider::GetEpContextNodes()` method.
3. ONNXRT core creates an EP context model and saves/dumps it.
- `CreateEpContextModel()` in the file "graph_partitioner.cc"
- In `get_ep_context_node()`, `Node::Name()` is used to check whether a node is an EP context node. This limits that EP model creation can only happen in `IExecutionProvider::Compile()`.
- The workaround is (1) not implementing `IExecutionProvider::GetEpContextNodes()` and (2) dumping the EP context model by EP itself.
4. Optionally, EP can also dump the EP context model it created by
iteself.
- Examples
- `QNNExecutionProvider`
- `VitisAIExecutionProvider`
- Approach 2
- Steps
1. EP creates an ONNX model whose main graph has EP context nodes (i.e., node type is "EPContext").
2. EP does NOT implement `IExecutionProvider::GetEpContextNodes()` at all.
3. EP dumps the EP context model it created.
- Examples
- `TensorrtExecutionProvider`
- UPDATES
- TRT EP is switching to leveraging
`IExecutionProvider::GetEpContextNodes()`
- `OpenVINOExecutionProvider` (?)
# What to cache in EP context nodes
- Non Compilation based EPs
- Examples
- `VitisAIExecutionProvider`
- Characteristics
- Heavy lifting work happens in `IExecutionProvider::GetCapability()`.
- Preconditions
- `IExecutionProvider::GetCapability()` is only called once by ONNXRT.
- Cache content
- Serialization of a list of `ComputeCapability`
- Not EP-specific
- Serialized using `onnx::FunctionProto`
- EP-specific cache
- Compilation based EPs
- Examples
- `QNNExecutionProvider`
- `TensorrtExecutionProvider`
- `MIGraphXExecutionProvider`
- `OpenVINOExecutionProvider`
- Cache content
- EP-specific cache
# Requirements
- Offline / AOT compilation of ONNX models with EP context cache
- Compile somewhere, run everywhere
- Pseudo code with brief explanation
```
GenerateCache(original_onnx_file, cache_onnx_file) model_buffer = load(original_onnx_file) --> Load the original ONNX model file
model_buffer = decrypt(model_buffer)
session_options = { kOrtSessionOptionEpContextEnable: true,
kOrtSessionOptionEpContextFilePath: temp_file } --> Set necessary configs
Ort::CreateSessionFromArray(model_buffer, session_options) --> The new ONNX model with EP context is created and dumped into the user specified file "temp_file"
temp_buffer = encrypt(temp_file)
write(temp_buffer, cache_onnx_file) --> Write the encypted context of "temp_file" into the "cache_onnx_file" file
InitializeInferenceSession(cache_onnx_file)
model_buffer = load(cache_onnx_file) --> Load the ONNX model with EP context from the file generated in the previous step
model_buffer = decrypt(model_buffer)
session_options = { }
Ort::CreateSessionFromArray(model_buffer, session_options) --> Create and initalize an session with the EP context model
```
- Python code with comments
- EP context model creation
```python
import onnxruntime as onnxrt
# Session options for creating an ONNX model with EP context cache.
sess_opts = onnxrt.SessionOptions()
# Verbose.
sess_opts.log_severity_level = 0
# This is REQUIRED.
sess_opts.add_session_config_entry("ep.context_enable", "1")
# This is OPTIONAL.
# Either an absolute path (preferred for now) or a relative path (WIP)
is okay.
# sess_opts.add_session_config_entry("ep.context_file_path",
"/some/path/to/original_model_ctx.onnx")
# This is OPTIONAL.
sess_opts.add_session_config_entry("ep.context_embed_mode", "1")
orig_model_location = "/some/path/to/original_model.onnx"
sess = onnxrt.InferenceSession(orig_model_location, sess_opts,
providers=["VitisAIExecutionProvider"], provider_options=[])
```
- Inference run with an EP context model
```python
import onnxruntime as onnxrt
# Session options for creating an ONNX model with EP context cache.
sess_opts = onnxrt.SessionOptions()
# Default EP context model path.
# ep_ctx_model_location = "/some/path/to/origina_model.onnx_ctx.onnx"
# User configured EP context model path.
ep_ctx_model_location = "/some/path/to/origina_model_ctx.onnx"
sess = onnxrt.InferenceSession(ep_ctx_model_location, sess_opts,
providers=["VitisAIExecutionProvider"], provider_options=[])
model_inputs = {}
run_opts = onnxrt.RunOptions()
# Verbose.
run_opts.log_severity_level = 1
sess.run(None, model_inputs, run_opts)
```
---------
Co-authored-by: Glen Cao <glen@Glens-MacBook-Air.local>
This var has been initialized to 0 in tint, so no need extra loop to do
it again:
```
float tint_symbol_52[1][4] = (float[1][4])0;
{
for(int tint_symbol_53 = 0; (tint_symbol_53 < 1); tint_symbol_53 = (tint_symbol_53 + 1)) {
{
for(int tint_symbol_54 = 0; (tint_symbol_54 < 4); tint_symbol_54 = (tint_symbol_54 + 1)) {
tint_symbol_52[min(uint(tint_symbol_53), 0u)][min(uint(tint_symbol_54), 3u)] = 0.0f;
}
}
}
}
```
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
the exception was caused by
3dd6fcc089
Why I add skip_macos_test
because there's new an exception in
https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1425579&view=logs&j=c90c5af3-67d5-5936-5a62-71c93ebfca65&t=01038f35-8e78-5801-1aa1-d9647bb65858
```
2024-07-05T14:41:09.3864740Z mkdir -p /Users/runner/Library/Developer/Xcode/DerivedData/apple_package_test-akksnidsbpojopfdqrclgsoqqerv/Build/Products/Debug/macos_package_testUITests.xctest/Contents/Frameworks
2024-07-05T14:41:09.3933430Z mkdir: /Users/runner/Library/Developer/Xcode/DerivedData/apple_package_test-akksnidsbpojopfdqrclgsoqqerv/Build/Products/Debug/macos_package_testUITests.xctest: Operation not permitted
2024-07-05T14:41:09.3996760Z /var/folders/0f/b0mzpg5d31z074x3z5lzkdxc0000gn/T/tmp97ycvwq5/apple_package_test/Pods/Target Support Files/Pods-macos_package_testUITests/Pods-macos_package_testUITests-frameworks.sh: line 7: realpath: command not found
2024-07-05T14:41:09.4003170Z :18: error: Unexpected failure
2024-07-05T14:41:11.1323470Z error: Sandbox: mkdir(72212) deny(1) file-write-create /Users/runner/Library/Developer/Xcode/DerivedData/apple_package_test-akksnidsbpojopfdqrclgsoqqerv/Build/Products/Debug/macos_package_testUITests.xctest (in target 'macos_package_testUITests' from project 'apple_package_test')
2024-07-05T14:41:11.1325620Z
2024-07-05T14:41:11.8731110Z
2024-07-05T14:41:11.8733040Z Test session results, code coverage, and logs:
2024-07-05T14:41:11.8734820Z /Users/runner/Library/Developer/Xcode/DerivedData/apple_package_test-akksnidsbpojopfdqrclgsoqqerv/Logs/Test/Test-macos_package_test-2024.07.05_14-40-38-+0000.xcresult
2024-07-05T14:41:11.8735530Z
2024-07-05T14:41:11.8906210Z Testing failed:
2024-07-05T14:41:11.8911060Z Sandbox: mkdir(72212) deny(1) file-write-create /Users/runner/Library/Developer/Xcode/DerivedData/apple_package_test-akksnidsbpojopfdqrclgsoqqerv/Build/Products/Debug/macos_package_testUITests.xctest
2024-07-05T14:41:11.8912570Z Unexpected failure
2024-07-05T14:41:11.8913690Z Testing cancelled because the build failed.
2024-07-05T14:41:11.8914380Z
2024-07-05T14:41:11.8914970Z ** TEST FAILED **
2024-07-05T14:41:11.8915480Z
2024-07-05T14:41:11.8915780Z
2024-07-05T14:41:11.8916750Z The following build commands failed:
2024-07-05T14:41:11.8919280Z PhaseScriptExecution [CP]\ Embed\ Pods\ Frameworks /Users/runner/Library/Developer/Xcode/DerivedData/apple_package_test-akksnidsbpojopfdqrclgsoqqerv/Build/Intermediates.noindex/apple_package_test.build/Debug/macos_package_testUITests.build/Script-059136A7770CA5376C30F2FD.sh (in target 'macos_package_testUITests' from project 'apple_package_test')
2024-07-05T14:41:11.8922180Z (1 failure)
```
And I find macos test is skipped in
9ef28f092f/tools/ci_build/github/azure-pipelines/templates/c-api-cpu.yml (L119-L127)
as well.
Maybe it is an known issue.
### Description
Repeat of #21084 with removal of policy CMP0144 to suppress warnings
which uses CMake 3.27.0.
### Motivation and Context
Already approved PR:
https://github.com/microsoft/onnxruntime/pull/21084
Removed the added policy from CMake 3.27.0.
### Description
The implementation inside EP requires registering some custom ops which are only used in the model compilation phase. Currently only single output is supported.
### Motivation and Context
Now the demand upgrade requires support for multiple outputs, so the shaper infer of ep custom op needs to be extended to support multiple outputs
---------
Co-authored-by: liumingyue <mingyue@xilinx.com>
Co-authored-by: mingyue <mingyue@amd.com>
### Description
Implement [FlashAttention](https://arxiv.org/pdf/2205.14135) and
[FlashAttention-2](https://arxiv.org/pdf/2307.08691) for
MultiHeadAttention on CPU.
### Motivation and Context
Accelerate the execution of MultiHeadAttention.
Current performance: 10ms vs 16ms (com.microsoft.MultiHeadAttention) on
my Linux machine and 10ms vs 38ms (com.microsoft.MultiHeadAttention) on
my Windows machine. May need further optimizations.
---------
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: Qingnan Duan <qiduan@microsoft.com>
### Description
1. remove QNN stages from the big packaging pipeline
2. Add publish nightly package in the current [QNN Nuget
pipeline](https://dev.azure.com/aiinfra/Lotus/_builddefinitionId=1234])
### Motivation and Context
Reduce the complexity of the big Nuget packaging pipelines.
---------
Co-authored-by: Yi Zhang <your@email.com>
### Description
There are so many typos reported by the review dog, [Optional Lint]
actions (example:
https://github.com/microsoft/onnxruntime/actions/runs/9864564489/job/27239732367),
this PR is to fix some of them.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
[DirectML] Broadcast NC-dims for Tensors A&B in DynamicQuantizeMatMul
The DynamicQuantizeMatMul allows input tensors in NCHW format, and
DirectML requires that input tensors share the same batch and channel
dimensions. Tensors A and B should be broadcast (if possible) to the
corresponding output NC dims.
### Motivation and Context
Certain models which use DynamicQuantizeMatMul hit a crash when the NC
dims are intended to be broadcast.
---------
Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
Update AArch64 SQNBitGemm CompInt8 kernels to process matrix in tiles. E.g., computing the output in 2x2 tiles allows us to compute four elements of the output with one read of two rows of A and two columns of B.
Also moved some code around as it was getting big for a single file.
### Description
Our macOS pipeline are failing because of a build error in absl.
However, the bug fix we need is not available in the latest ABSL
release.
Here is the issue: https://github.com/abseil/abseil-cpp/pull/1536
And here is the fix:
779a3565ac
GTests uses ABSL. But this ABSL target also depends on GTest. So, it is
a circular dependency. We should be able to avoid that by avoid building
tests for ABSL. However, the version we are using has a problem with
that: it has cmake target that still depends on GTest even when testing
is disabled.
It's strange that we suddenly hit this problem and it only happens on macOS.
### Description
- Adds support for int4 quantized weights (per-tensor and per-channel)
on QNN EP
- Adds test script that creates an INT4 qdq model with a Conv
- Adds a unit tests demonstrating accuracy issues.
### Motivation and Context
This is the next step in being able to run models that use 4-bit
quantized weights on QNN EP.
### Description
This PR resolves a bug related to setting the **interOpNumThreads**
session option when creating an **ORTSession**. Currently, when the
**interOpNumThreads** option is passed from React Native, the native
module incorrectly sets **intraOpNumThreads** instead of
**interOpNumThreads**.
### Motivation and Context
Since this is a bug, users of the Onnx React Native package may believe
that they are setting **interOpNumThreads** correctly, So this change is
required. Refer to the code snippet below for details
<img width="634" alt="Screenshot 2024-07-05 at 9 28 58 PM"
src="https://github.com/microsoft/onnxruntime/assets/88655321/70a8f216-553a-4f4c-9481-e6871f0e37e6">
### Description
CMake logic fixed to allow enabling MPI while NCCL is disabled.
### Motivation and Context
MPI is also used on the CPU backend, not only with CUDA, so it makes
sense to decouple it properly from NCCL (which is for dealing with
multiple Nvidia GPUs).
### Description
Previously ROCMExecutionProvider uses `hipMemGetInfo` to obtain the
sizes of total memory and available memory. However, this API has been
broken since ROCm 5.7. In this PR, we use `rocm_smi` library instead of
`hipMemGetInfo`.
### Motivation and Context
`hipMemGetInfo` API has been broken since ROCm 5.7 and inference with
ROCMExecutionProvider will lead to following errors:
```
HIP failure 1: invalid argument ; GPU=0 ; hostname=4cc4900475fe ; file=/onnxruntime/onnxruntime/core/providers/rocm/rocm_execution_provider.cc ; line=229 ; expr=hipMemGetInfo(&free, &total);
```
MIOpen has a brute-force fix for this
(911e671895/src/hip/handlehip.cpp (L72)).
Instead of hard-coding available memory to 16GB, I suppose we could
obtain memory info through `rocm_smi` library as in this PR.
### Description
- Refactor codes to meet line length limit and guard missing warning
- Add slice/dropout op support
- Move vsinpu ep's cmake settings from onnxruntime_providers.cmake to a
separate file
- Modify apis with param onnxruntime::Path because this kind is replaced
by std:filesystem::path by #20920
### Description
Extend cuda minimal option to TRT provider, as with TRT 10 no linking to
cuDNN is required anymore
.
Besides that with the new engine dump feature it is also possible to
embed an engine in to an ONNX and not ship a builder lib.
In addition to that this has roughly the same deserialization
time/session setup time that using TRT standalone has.
### Motivation and Context
```
exe_builder_lib\onnxruntime_perf_test.exe -I -e tensorrt -r 5 -i 'trt_engine_cache_enable|1 trt_timing_cache_enable|1 trt_dump_ep_context_model|1 trt_weightless_engine_enable|1' model.onnx
exe_no_builder_lib\onnxruntime_perf_test.exe -I -e tensorrt -r 5 -i 'trt_engine_cache_enable|1 trt_timing_cache_enable|1 trt_dump_ep_context_model|1 trt_weightless_engine_enable|1' model_ctx.onnx
```
### Description
ETW trace logger is fakely registered as initialized_ is marked as true
before the registration is done, causing crashing issue for Lenovo
camera application.
[Bug
42610244](https://microsoft.visualstudio.com/OS/_workitems/edit/42610244):
[Watson Failure] caused by
SVCHOSTGROUP_Camera_INVALID_POINTER_READ_c0000005_onnxruntime.dll!onnxruntime::logging::Logger::Log
### Description
It might be easier if we just directly include the original gsl headers.
"core/common/gsl.h" is an indirection that doesn't provide extra help.
### Description
This PR enables the API added in #20816 as well as moving context
creation to JS.
### Motivation and Context
In order to enable I/O Binding with the upcoming
[MLBuffer](https://github.com/webmachinelearning/webnn/issues/542) API
in the WebNN specification, we need to share the same `MLContext` across
multiple sessions. This is because `MLBuffer`s are restricted to the
`MLContext` where they were created. This PR enables developers to use
the same `MLContext` across multiple sessions.
Currently WebNN TFLite backend allows the filter of
conv2d/convTranspose2d be an input. Remove the constraint and operate
necessary transpose/reshape operations for the filter input.
### Description
Support MatMulNBits shape infer in SymbolicShapeInference
MatMulNBits's B input is rank-2, so implicit merge does not apply.
### Motivation and Context
[Issue with performing shape inference using symbolic_shape_infer.py
with Phi-3 ONNX Models · Issue #21194 · microsoft/onnxruntime
(github.com)](https://github.com/microsoft/onnxruntime/issues/21194)
Fix fp8*fp8 when input A is e5m2, input B is e4m3 will run error
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Delete path.h and replace all occurrences of onnxruntime::Path with
std::filesystem::path.
Previously we couldn't use C++17's std::filesystem because it was not
supported in iOS 12(which was released in 2018). Now we dropped the
support for iOS 12.
### Motivation and Context
To simplify code. For example, if an EP wants to use the Path class, now
it can directly use it without going through a wrapper. And the standard
implementation can handle various path types better. (We didn't take
much consideration on UNC path, "/" as a path separator on Windows,
etc).
Add SparseAttention cpu implementation.
- [x] Refactoring GQAAttentionBase
- [x] Add SparseAttention implementation
- [x] Add test cases
This is unfused version. Flash attention version will be added later.
### Description
Make current ROCm packaging stages to a single workflow.
Reduce the possibility of all nightly packages can't be generated by one
failed stage
### Motivation and Context
Our plan is to reduce the complexity of the current zip-nuget pipeline
to improve the stability and performance of nightly packages generation.
ROCm packaging stages has no dependencies with other packaging jobs and
it's the most time-consuming route.
After this change, the most used CPU/CUDA/Mobile packaging workflow
duration can be reduced roughly from 3h20m to 2h30m.
### Problem
Currently, the codebase contains some logics pertaining to model
re-export checks and graph_builder reinitialization checks. Ideally,
these operations should function akin to a state machine. However, upon
inspecting the implementation, it becomes apparent that certain states
are checked or set in various scattered locations. This fragmentation
makes it challenging to comprehend when a re-export or re-initialization
will be triggered. For optimal clarity and maintainability, it is
advisable to consolidate these states into a cohesive component, rather
than dispersing them within the current graph execution manager.
Furthermore, the process of model exports and post-export processing for
stage 3 support or memory-efficient gradient management introduces
considerable complexity. To enhance the codebase's structure, it would
be beneficial to extract these intricate functionalities into a
dedicated component, divorcing them from the current graph execution
manager.
As part of the effort to improve the codebase, it's essential to address
inconsistencies in handling input/output flatten/unflatten operations.
Currently, there are several functions performing these operations
recursively, each with slightly different implementations. This
inconsistency leads to varying support for input/output data types and
structures in different parts of the code. To rectify this, the proposed
pull request simplifies these operations into a set of primitive
functions, ensuring uniformity. This not only streamlines the code but
also facilitates the maintenance of consistency when introducing bug
fixes or supporting new data types. One thing to mention here: input
output handling is deeply bound to the graph transition mentioned above,
so it is difficult to make this change separately.
While acknowledging the complexity of these logics, it is reassuring
that the codebase benefits from an extensive suite of unit tests that
cover all possible branches. Despite the intricacies, ensuring the
passage of all tests has been a time-intensive but necessary aspect of
this development effort.
### Design
Introduce `GraphTransitionManager` and put all model export and
post-export processing logics in it.
1. Re-export check
2. Do export
3. Re-post-export process check
4. Do post-export process
5. Return `PostExportProcessedModelInfo`, which contains all the
information we need, to pass to ORT to build gradient graph (currently
we do the same for training or evaluating, but ideally we should not do
it for evaluating, let's keep this behavior as it is now, and make the
change later).
```
# Input names for the pre-gradient-build graph.
# This may be different with the one in ExportedGraph since we may
modify the graph inputs as needed
# for example when memory efficient gradient management is enabled.
self.onnx_graph_input_names: list[str] = onnx_graph_input_names
# A subset of onnx_graph_input_names.
# Input names that require gradients for the pre-gradient-build graph.
self.onnx_graph_input_names_require_grad: list[str] =
onnx_graph_input_names_require_grad
# Create symbolic names for each dimension of the graph input (e.g.
onnx_graph_input_names).
# The key is the input name, the value is a dict of {dim_index:
symbolic_dim_name}
# e.g. {"input1": {0: "input1_dim0", 1: "input1_dim1"}, "input2": {0:
"input2_dim0"}}
self.onnx_graph_input_dynamic_axes_map: dict[str, dict[int, str]] =
onnx_graph_input_dynamic_axes_map
self.buffer_for_ort_runs: dict[str, torch.Tensor] = OrderedDict()
self.onnx_graph_input_names_user_defined = (
onnx_graph_input_names_user_defined # The ONNX graph input names
excluding the parameters, buffers.
)
# The ONNX graph input names excluding the parameters, buffers.
self.onnx_graph_input_names_require_grad_user_defined =
onnx_graph_input_names_require_grad_user_defined
self._post_export_processed_model: onnx.ModelProto | None =
post_export_processed_model
# A function to access the input data from the args and kwargs.
# If it is not None, the length is same as onnx_graph_input_names.
# For i-th input name, we can use the i-th function to get the input
data from args and kwargs.
self.data_accessor: list[callable] | None = data_accessor
# Used for unflattening the outputs from the ORT forward run.
self.module_forward_output_schema: ORTModelInputOutputSchemaType | None
= module_forward_output_schema```
The `GraphTransitionManager` instance is a property of
`GraphExecutionManager` (e.g. `TrainingManager` or ``InferenceManager),
1. Use
'self._graph_transition_manager.use_cache_or_reconstruct_post_processed_model(inputs,
kwargs)' to check whether the PyTorch module need a re-export or
re-post-export-process.
2. Use
`self._graph_transition_manager._post_export_processed_model_info.construct_inputs`
to construct the list of inputs used for ORT runs.
3. Use
`self._graph_transition_manager._post_export_processed_model_info.restore_outputs(user_outputs)`
to restore the outputs in original PyTorch output structure.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Add some macro to help print data to console for debugging purpose.
Example usage:
```
int input_id;
vector<int> some_vector;
DUMP_CPU_TENSOR_INIT()
DUMP_CPU_TENSOR("some vector", some_vector);
DUMP_STRING("input_id=", input_id);
```
- To enable dump thread id, set environment variable
`ORT_DUMP_THREAD_ID=0`.
- User can disable dumping by environment variable
`ORT_ENABLE_CPU_DUMP=0`.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
It's the prerequisite step of reducing complexity of current zip-nuget
pipeline.
Some packaging tasks could be cut from the most complex nuget pipline
and easily be published
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Disable using CoreML ML Program for a matmul where one of the inputs is
1D as the CoreML implementation appears to be broken. See
https://github.com/apple/coremltools/issues/2263
Add some debugging notes.
### Motivation and Context
Fix failing test on macos-14.