### Description
Replace inline pip install with pip install from requirements*.txt
### Motivation and Context
so that CG can recognize
### Dependency
- [x] https://github.com/microsoft/onnxruntime/pull/21085
### Description
WebNN spec introduces a new option: `outputDataType` to `argMax` and
`argMin` ops, it's default value is `int32`, we should explicitly set it
to `int64` for WebNN EP.
Spec CR: "Add outputDataType to argmin/argmax"
https://github.com/webmachinelearning/webnn/pull/730
### Description
- [x] Rewrite FusedMHARunnerFP16v2 to make it thread-safe.
- [x] Add multi-threading tests
Previously, the kernel parameters params is stored as a member of mha
runner, which means that different threads might change the params at
the same time and impacts the other threads.
For example, if batch_size and seq_len was changed by another thread to
larger values in setup(...), buffer overrun might happen in run(...)
because a kernel could read/write memory out of range of allocated
buffers.
In new implementation, I change the api and remove mutable member
variables to make it thread safe. Below is summary of change:
Before:
```
class FusedMHARunnerFP16v2::mhaImpl {
void setup(int seq_len, int batch_size) {
// change scalar params
}
void run(input, output) {
// change params for input and output pointers
// launch kernel using params
}
Fused_multihead_attention_params_v2 params; // mutable, not thread-safe
}
```
After:
```
class FusedMHARunnerFP16v2::FmhaImpl {
void setup(int seq_len, int batch_size, Fused_multihead_attention_params_v2& params) {
// change params
}
void run(params, input, output) {
// change params with input and output pointers
// launch kernel using params
}
}
```
### Motivation and Context
https://github.com/microsoft/onnxruntime/issues/18854https://github.com/microsoft/onnxruntime/issues/21413
### Description
This is a partial change from
[fajin/qdqmatmulnbitstoolchain](https://github.com/microsoft/onnxruntime/pull/21180).
The original PR is blocked by Web CI failures.
MatMulNBits is a heavily optimized matmul operation. Currently a MatMul
can be converted to MatMulNBits to speed up the model inference.
However, MatMulNBits is an ORT only op. To make the graph compatible
with ONNX ops and utilize MatMulNBits at the same time, we introduce
Q/DQ support for MatMulNBits.
To convert MatMul ops in a model to MatMulNBits:
1. use matmul_4bits_quantizer.py to convert MatMul to DQ + MatMul using
QDQ mode.
2. In ORT session, DQ + MatMul is fused to MatMulNBits
#### Note
MatMulNBits assume B weight is uint4. When no zp is provided, zp
defaults to 8, which is different from DQ. DQ defaults zp to 0 when no
zp provided. And DQ supports int4. Therefore some conversions are
introduced during DQ + MatMul --> MatMulNBits step.
#### Perf
Using QDQ format will increase the model initialization time and memory
consumption. With current implement, model init time increased from ~4s
to ~9s, and memory consumption increased from ~2.8GB to ~4.8GB.
The memory increase is due to
1. in optimizer, after transpose the B weight, a in-memory tensor proto
is created using protobuf's arena.
2. in finalize step, when saving initializer and prepacking, ORT arena
is used to create buffers for initializers.
The memory allocated by arenas cannot be fully deallocated.
If disable ORT arena memory allocation, the memory consumptions of both
QDQ format and original format are ~2.2GB.
The time increase is mainly due to multiple memory copy, but can be
further optimized.
### Motivation and Context
Please see description for details.
### Description
<!-- Describe your changes. -->
Add CoreML ML Program Resize
- refactor existing logic to try and simplify and share between
NeuralNetwork and MLProgram checks
- add handling for some new attributes
- antialias and axes - should have been done when setting the CoreML EP
max opset to 21
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Support priority models
### Description
* Add a cuda provider option `sdpa_kernel` to choose which attention kernel to run for testing purpose.
* Allow dump which attention kernel is used per node.
* Reserve a flag for cudnn flash attention which will be added soon.
#### CUDA provider option sdpa_kernel
Instead of setting environment variable, we also support setting it in
provider option. Note that the setting is global per session. That could
help performance testing of each kernel.
#### Attention Kernel Debug Info
Set an environment variable `ORT_ENABLE_ATTENTION_KERNEL_DEBUG_INFO=1`,
and ORT will print sdpa kernel used in each node:
For example
```
ORT_ENABLE_ATTENTION_KERNEL_DEBUG_INFO=1 ./onnxruntime_test_all --gtest_filter=MultiHeadAttentionTest*
```
It will show debug information of kernel used in testing:
```
[ RUN ] MultiHeadAttentionTest.SelfAttention_Batch2_HeadSize32_NoBias_NoMask_PackedQKV
AttentionKernelOptions: FLASH_ATTENTION=0 EFFICIENT_ATTENTION=0 TRT_FUSED_ATTENTION=1 CUDNN_FLASH_ATTENTION=0 TRT_FLASH_ATTENTION=1 TRT_CROSS_ATTENTION=0 TRT_CAUSAL_ATTENTION=0 MATH=1
Operator=MultiHeadAttention Node=node1 DataType=fp16 TRT_FUSED_ATTENTION=1
AttentionKernelOptions: FLASH_ATTENTION=0 EFFICIENT_ATTENTION=1 TRT_FUSED_ATTENTION=0 CUDNN_FLASH_ATTENTION=0 TRT_FLASH_ATTENTION=0 TRT_CROSS_ATTENTION=0 TRT_CAUSAL_ATTENTION=0 MATH=1
Operator=MultiHeadAttention Node=node1 DataType=fp16 EFFICIENT_ATTENTION=1
```
In this test case, the debug info shows that one session uses trt fused
attention and another session use efficient attention.
### Description
```
# npm audit report
socket.io 3.0.0 - 4.6.2
Severity: high
socket.io has an unhandled 'error' event - https://github.com/advisories/GHSA-25hc-qcg6-38wj
Depends on vulnerable versions of engine.io
fix available via `npm audit fix`
node_modules/socket.io
ws 8.0.0 - 8.17.0
Severity: high
ws affected by a DoS when handling a request with many HTTP headers - https://github.com/advisories/GHSA-3h5v-q93c-6h6q
fix available via `npm audit fix`
node_modules/ws
engine.io 0.7.8 - 0.7.9 || 6.0.0 - 6.5.4
Depends on vulnerable versions of ws
node_modules/engine.io
socket.io-adapter 2.5.2 - 2.5.4
Depends on vulnerable versions of ws
node_modules/socket.io-adapter
4 high severity vulnerabilities
```
### Description
Moves the `Relu -> QuantizeLinear` fusion to Level2 optimizations for
CPU EP only.
### Motivation and Context
See the related PR for motivation and context:
https://github.com/microsoft/onnxruntime/pull/20627
Update SQNBitGemm ARM NEON kernel to compute 4x2 tile of output.
Note: Also tried 2x4 and 4x4 tiles but observed the best microbenchmark results with 4x2 tiles.
### Description
<!-- Describe your changes. -->
* promote trt version to 10.2.0.19
* EP_Perf CI: clean config of legacy TRT<8.6, promote test env to
trt10.2-cu118/cu125
* skip two tests as Float8/BF16 are supported by TRT>10.0 but TRT CIs
are not hardware-compatible on these:
```
1: [ FAILED ] 2 tests, listed below:
1: [ FAILED ] IsInfTest.test_isinf_bfloat16
1: [ FAILED ] IsInfTest.test_Float8E4M3FN
```
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
There is a bug for kernel running on rocm6.0, so change ci docker image
to rocm6.1
For the torch installed in the docker image, change to rocm repo when it
is not 6.0 version.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Enablement of onnxruntime for AIX and fixing issues related to
big-endian platform.
### Motivation and Context
changes in this PR contains:
1. Enablement code for building onnxruntime on AIX operating system.
2. while testing the build on AIX, we found issues related to big endian
platform . More details about few of those issues can be found in [Big
endian issue: Graph Transformation Attention Fusion tests are failing
#12921](https://github.com/microsoft/onnxruntime/issues/12921)
Below are list of files and the description about the change.
1. cmake/CMakeLists.txt
[BUILDING on AIX issue] check for "IBMClang" is added for handling
-Wno-unused-parameter
2. cmake/external/onnxruntime_external_deps.cmake
[BUILDING on AIX issue]Enabling gtest_disable_pthreads for AIX
3. cmake/onnxruntime.cmake
[BUILDING on AIX issue]
o Blocking codes for AIX which generates generated_source.c and further
requires some symbol files.
o Putting NO AIX check for non-supported linker flags like --Xlinker
o iconv linking
4. cmake/onnxruntime_framework.cmake
[BUILDING on AIX issue]Putting NO AIX check for -Wl,-rpath='$ORIGIN'
5. cmake/onnxruntime_mlas.cmake
[BUILDING on AIX issue]POWER10 releated macro/function definition .
6. cmake/onnxruntime_providers_cpu.cmake
[BUILDING on AIX issue]Putting NO AIX check for non-supported linker
flags like --Xlinker
7. cmake/onnxruntime_unittests.cmake
[BUILDING on AIX issue]
o Putting NO AIX check for non-supported linker flags like --Xlinker
o Adding required libraries for AIX linker under applicatiion like
onnxruntime_shared_lib_test ,onnxruntime_logging_apis etc
8. cmake/patches/flatbuffers/flatbuffers.patch
[BUILDING on AIX issue] Handling of TypeCode in
include/flatbuffers/flatbuffers.h under AIX + clang
9. onnxruntime/contrib_ops/cpu/murmur_hash3.cc
[Big endian issue] Byte-Conversion handlling in compute() and getblock()
routines
10. onnxruntime/contrib_ops/cpu/quantization/matmul_nbits_impl.cc
[Big endian issue] Handling of test failures . Byte swapping for
quant_value.
11. onnxruntime/core/framework/tensorprotoutils.cc
[Big endian issue]
Implementation of SetRawDataInTensorProto , ConvertRawDataInTensorProto
.
o SetRawDataInTensorProto : Wrapper for set_raw_data(). Calling
ConvertRawDataInTensorProto() in big-endian system
o ConvertRawDataInTensorProto : function used mainly on big-endian
system for byte-swapping of tensor raw_data
12. onnxruntime/core/framework/tensorprotoutils.h
[Big endian issue]
Declaration of SetRawDataInTensorProto, ConvertRawDataInTensorProto
13. onnxruntime/core/graph/graph.cc
[Big endian issue]
o Call ConvertRawDataInTensorProto for SPARSE_TENSOR type
o Call ConvertRawDataInTensorProto for SaveToOrtFormat
14. onnxruntime/core/mlas/lib/platform.cpp
[BUILDING on AIX issue] POWER10 released enablement for AIX
15. onnxruntime/core/mlas/lib/power/qgemm_kernel_power10.cpp
[BUILDING on AIX issue]Handling of __vector under AIX+clang
16. onnxruntime/core/mlas/lib/qgemm.h
[BUILDING on AIX issue] Adding _AIX flag
17. onnxruntime/core/mlas/lib/qlmul.cpp
[BUILDING on AIX issue] Handling of __vector under AIX+clang
18. onnxruntime/core/optimizer/attention_fusion.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
19. onnxruntime/core/optimizer/compute_optimizer/shared_utils.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
20. onnxruntime/core/optimizer/constant_folding.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
21. onnxruntime/core/optimizer/embed_layer_norm_fusion.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
22. onnxruntime/core/optimizer/nchwc_transformer.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
23. onnxruntime/core/optimizer/qdq_transformer/avx2_weight_s8_to_u8.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
24. onnxruntime/core/optimizer/qdq_transformer/qdq_s8_to_u8.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
25. onnxruntime/core/optimizer/qdq_transformer/s8_to_u8.h
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
26.
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
27. onnxruntime/core/optimizer/reshape_fusion.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
28. onnxruntime/core/optimizer/stft_decomposition.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
29.
onnxruntime/core/optimizer/transpose_optimization/ort_optimizer_api_impl.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
30. onnxruntime/core/platform/path_lib.h
[BUILDING on AIX issue] Moving to normal function call, instead of
template
31. onnxruntime/core/platform/posix/env.cc
[BUILDING on AIX issue]Blocking syscall.h in AIX
32. onnxruntime/core/session/inference_session.cc
[Big endian issue] Removing ORT_RETURN_IF_NOT, FLATBUFFERS_LITTLEENDIAN
33. onnxruntime/test/flatbuffers/flatbuffer_utils_test.cc
[Big endian issue] Call ConvertRawDataInTensorProto in CreateInitializer
and ExternalWriteReadWithLoadInitializers
34. onnxruntime/test/framework/sparse_kernels_test.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
35. onnxruntime/test/framework/tensorutils_test.cc
[Big endian issue] Helper method ConvertEndianessForVector and call this
from required place.
36. onnxruntime/test/framework/test_tensor_loader.cc
o. [BUILDING on AIX issue] Handling of getcwd for AIX
o. [Big endian issue] Bytes Swapping in run_external_data_test
37. onnxruntime/test/onnx/main.cc
[Big endian issue] including <thread> for AIX
38. onnxruntime/test/onnx/tensorprotoutils.cc
[Big endian issue] Bytes swapping in UnpackTensorWithRawData
39. onnxruntime/test/optimizer/graph_transform_test.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
40. onnxruntime/test/optimizer/graph_transform_test_builder.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
41. onnxruntime/test/optimizer/graph_transform_test_builder.h
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
42. onnxruntime/test/optimizer/initializer_test.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
43. onnxruntime/test/optimizer/nchwc_optimizer_test.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
44. onnxruntime/test/providers/base_tester.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
45. onnxruntime/test/providers/cpu/generator/random_test.cc
[BUILDING on AIX issue] Adding AIX check in MultinomialGoodCase
---------
Co-authored-by: Vamshikrishna Thatikonda <vamshikrishna@in.ibm.com>
The test_flash_attn_rocm.py from
https://github.com/microsoft/onnxruntime/pull/21032 failed frequently.
For example, I saw two failed jobs today:
E Max absolute difference: 0.002167
E Max absolute difference: 0.002686
Adjust the abs threshold from 0.002 to 0.005, and use default relative tolerance rtol=0.001.
There is build errors when build with CUDA 12.5 and
`--cmake_extra_defines onnxruntime_BUILD_UNIT_TESTS=ON
onnxruntime_ENABLE_CUDA_EP_INTERNAL_TESTS=ON`.
Temporally exclude blkq4_fp16_gemm_sm80_test to unblock cuda 12.5 build.
### Description
<!-- Describe your changes. -->
Revert the wrong change in
https://github.com/microsoft/onnxruntime/pull/20920
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
It would save the data at a wrong position
### Description
This is a partial change ported from fajin/qdqmatmulnbitstoolchain. That
branch has issues resolving the web CI.
MatMulNBits is a heavily optimized matmul operation. Currently a MatMul
can be converted to MatMulNBits to speed up the model inference.
However, MatMulNBits is an ORT only op. To make the graph compatible
with ONNX ops and utilize MatMulNBits at the same time, we introduce
Q/DQ support for MatMulNBits.
To convert MatMul ops in a model to MatMulNBits:
use matmul_4bits_quantizer.py to convert MatMul to DQ + MatMul using QDQ
mode.
In ORT session, DQ + MatMul is fused to MatMulNBits
#### Note
MatMulNBits assume B weight is uint4. When no zp is provided, zp
defaults to 8, which is different from DQ. DQ defaults zp to 0 when no
zp provided. And DQ supports int4. Therefore some conversions are
introduced during DQ + MatMul --> MatMulNBits step.
#### Perf
Using QDQ format will increase the model initialization time and memory
consumption. With current implement, model init time increased from ~4s
to ~9s, and memory consumption increased from ~2.8GB to ~4.8GB.
The memory increase is due to
1. in optimizer, after transpose the B weight, a in-memory tensor proto
is created using protobuf's arena.
2. in finalize step, when saving initializer and prepacking, ORT arena
is used to create buffers for initializers.
The memory allocated by arenas cannot be fully deallocated.
If disable ORT arena memory allocation, the memory consumptions of both
QDQ format and original format are ~2.2GB.
The time increase is mainly due to multiple memory copy, but can be
further optimized.
### Motivation and Context
Please see description for details.
Resolve#21204
To reproduce the issue, build the code with
```
python3 tools/ci_build/build.py --build_dir /tmp/build13 --config Debug --skip_submodule_sync --build_shared_lib --parallel --use_binskim_compliant_compile_flags --build_csharp --enable_onnx_tests --update --build --build_wheel --use_cuda --cuda_home /usr/local/cuda --cudnn_home /usr/local/cuda --cmake_extra_defines onnxruntime_DISABLE_CONTRIB_OPS=ON onnxruntime_BUILD_UNIT_TESTS=OFF --skip_tests
```
Then run the following python script:
```python
#!/usr/bin/python3
import onnxruntime as ort
providers = [("CUDAExecutionProvider")]
ort_sess = ort.InferenceSession('/data/onnx/opset17/test_gemm_default_no_bias/model.onnx', providers=providers)
```
Without this fix, you will see an error:
Failed to load library libonnxruntime_providers_cuda.so with error:
/tmp/build18/Debug/onnxruntime/capi/libonnxruntime_providers_cuda.so:
undefined symbol:
_ZN11onnxruntime4cuda21BuildKernelCreateInfoINS0_57kCudaExecutionProvider_GridSample_kOnnxDomain_ver16_floatEEENS_16KernelCreateInfoEv
We need to prevent VitisAI EP build breaks, add a stage in Windows CPU
CI Pipeline to build Vitis AI EP on Windows.
There are no external dependencies for builds. Tests have to be disabled
though as the EP has external SW/HW dependencies.
This will at least allow us to prevent build breaks which has happened
on multiple occasions recently.
tested
https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1432346&view=results
and it seems to run fine.
Reverts microsoft/onnxruntime#21226
Causes any onnxruntime app to hang on Windows ARM64. Our pipelines do
not have the same ETW environment, so we couldn't catch it.

The call to TraceLoggingRegisterEx() recursively calls back into
LazyInitialize():
LazyInitialize() -> TraceLoggingRegisterEx() ->
ORT_TL_EtwEnableCallback() -> Instance() -> LazyInitialize()
The original code got out of the recursive loop by checking the
`initialized_` flag.
Description: ### Description
This is a partial change ported from fajin/qdqmatmulnbitstoolchain. That
branch has issues resolving the web CI.
MatMulNBits is a heavily optimized matmul operation. Currently a MatMul
can be converted to MatMulNBits to speed up the model inference.
However, MatMulNBits is an ORT only op. To make the graph compatible
with ONNX ops and utilize MatMulNBits at the same time, we introduce
Q/DQ support for MatMulNBits.
To convert MatMul ops in a model to MatMulNBits:
1. use matmul_4bits_quantizer.py to convert MatMul to DQ + MatMul using
QDQ mode.
2. In ORT session, DQ + MatMul is fused to MatMulNBits
#### Note
MatMulNBits assume B weight is uint4. When no zp is provided, zp
defaults to 8, which is different from DQ. DQ defaults zp to 0 when no
zp provided. And DQ supports int4. Therefore some conversions are
introduced during DQ + MatMul --> MatMulNBits step.
#### Perf
Using QDQ format will increase the model initialization time and memory
consumption. With current implement, model init time increased from ~4s
to ~9s, and memory consumption increased from ~2.8GB to ~4.8GB.
The memory increase is due to
1. in optimizer, after transpose the B weight, a in-memory tensor proto
is created using protobuf's arena.
2. in finalize step, when saving initializer and prepacking, ORT arena
is used to create buffers for initializers.
The memory allocated by arenas cannot be fully deallocated.
If disable ORT arena memory allocation, the memory consumptions of both
QDQ format and original format are ~2.2GB.
The time increase is mainly due to multiple memory copy, but can be
further optimized.
### Motivation and Context
Please see description for details.
### Description
Combining android build and test step into one job
### Motivation and Context
Reduce runtime by removing additional machine allocation, and artifact
uploading and downloading.
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
<!-- Describe your changes. -->
Supress C4996 deprecated api warning as errors as a walkaround to build
ORT with TRT10.2GA on Windows
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Four apis were recently declared as deprecated, which are being used by
core code of TRT EP.
Temporally suppress deprecated api warnings before updating these apis
### Description
Resolve#21281 and #10589 .
1. Change libonnxruntime.so's SONAME: remove the minor and patch
version.
By default when creating an ELF shared object, linker will set the
file's internal DT_SONAME field to the specified name which is the file
name plus SOVERSION . For example, the file name for our library is
libonnxruntime.so. And by default SOVERSION is the lib's VERSION number,
which is something like 1.19.0. So the DT_SONAME field in
libonnxruntime.so is something like libonnxruntime.so.1.18.0. You can
use readelf tool to examine it.
```
readelf -d libonnxruntime.so | grep SONAME
0x000000000000000e (SONAME) Library soname: [libonnxruntime.so.1.18.0]
```
When an executable is linked with a shared object which has a DT_SONAME
field, then when the executable is run the dynamic linker will attempt
to load the shared object specified by the DT_SONAME field rather than
using the file name(which is libonnxruntime.so) given to the linker.
After this change, the SONAME will be shorten to "libonnxruntime.so.1"
instead.
2. Set default version strings for Windows DLLs, to resolve#10589
- Pass a list of files instead of path separator-delimited string to project.files(). See this issue: https://github.com/gradle/gradle/issues/19817
- Check for host (instead of target) being Windows when using fallback patch program.
### Description
Resolve#21267 . onnxruntime_perf_test does not work properly if the
input model path url is just a single filename without any path
separator. For example,
```
./onnxruntime_perf_test -t 10 model.onnx
```
The problem was introduced in #19196 by me.
# Why so many commits
- Runtime debugging - which is necessary
- Three different approaches to EP context model - as a result testing back and forth
- Windows compatibility issues - this development has been done on Linux for convenience
# "Open" (?) questions
- Full offloading to a specific EP
- Dumping EP context models by EPs vs [by
ONNXRT](e2abba18ea/onnxruntime/core/framework/graph_partitioner.cc (L725))
- [Node name to pick
nodes](e2abba18ea/onnxruntime/core/framework/graph_partitioner.cc (L654))
# VitisAI EP made three variant implementations that have respective pros and cons (and of course we can combine them)
## Serialize and cache the list of compute capabilities and the original
ONNX model itself
## In `ComputeCapability()`, serialize and cache the backend compilation cache and the related necessary cache info such as cache dir and cache key
## In `Compile()`, serialize and cache the backend compilation cache and the related necessary cache info such as cache dir and cache key
# EP context model creation
- Precondition
Session option configuration `kOrtSessionOptionEpContextEnable` (aka "ep.context_enable") is enabled.
- Approach 1
- Steps
1. EP creates an ONNX model whose main graph has EP context nodes (i.e., node type is "EPContext").
2. EP implements/overrides `IExecutionProvider::GetEpContextNodes()` method.
3. ONNXRT core creates an EP context model and saves/dumps it.
- `CreateEpContextModel()` in the file "graph_partitioner.cc"
- In `get_ep_context_node()`, `Node::Name()` is used to check whether a node is an EP context node. This limits that EP model creation can only happen in `IExecutionProvider::Compile()`.
- The workaround is (1) not implementing `IExecutionProvider::GetEpContextNodes()` and (2) dumping the EP context model by EP itself.
4. Optionally, EP can also dump the EP context model it created by
iteself.
- Examples
- `QNNExecutionProvider`
- `VitisAIExecutionProvider`
- Approach 2
- Steps
1. EP creates an ONNX model whose main graph has EP context nodes (i.e., node type is "EPContext").
2. EP does NOT implement `IExecutionProvider::GetEpContextNodes()` at all.
3. EP dumps the EP context model it created.
- Examples
- `TensorrtExecutionProvider`
- UPDATES
- TRT EP is switching to leveraging
`IExecutionProvider::GetEpContextNodes()`
- `OpenVINOExecutionProvider` (?)
# What to cache in EP context nodes
- Non Compilation based EPs
- Examples
- `VitisAIExecutionProvider`
- Characteristics
- Heavy lifting work happens in `IExecutionProvider::GetCapability()`.
- Preconditions
- `IExecutionProvider::GetCapability()` is only called once by ONNXRT.
- Cache content
- Serialization of a list of `ComputeCapability`
- Not EP-specific
- Serialized using `onnx::FunctionProto`
- EP-specific cache
- Compilation based EPs
- Examples
- `QNNExecutionProvider`
- `TensorrtExecutionProvider`
- `MIGraphXExecutionProvider`
- `OpenVINOExecutionProvider`
- Cache content
- EP-specific cache
# Requirements
- Offline / AOT compilation of ONNX models with EP context cache
- Compile somewhere, run everywhere
- Pseudo code with brief explanation
```
GenerateCache(original_onnx_file, cache_onnx_file) model_buffer = load(original_onnx_file) --> Load the original ONNX model file
model_buffer = decrypt(model_buffer)
session_options = { kOrtSessionOptionEpContextEnable: true,
kOrtSessionOptionEpContextFilePath: temp_file } --> Set necessary configs
Ort::CreateSessionFromArray(model_buffer, session_options) --> The new ONNX model with EP context is created and dumped into the user specified file "temp_file"
temp_buffer = encrypt(temp_file)
write(temp_buffer, cache_onnx_file) --> Write the encypted context of "temp_file" into the "cache_onnx_file" file
InitializeInferenceSession(cache_onnx_file)
model_buffer = load(cache_onnx_file) --> Load the ONNX model with EP context from the file generated in the previous step
model_buffer = decrypt(model_buffer)
session_options = { }
Ort::CreateSessionFromArray(model_buffer, session_options) --> Create and initalize an session with the EP context model
```
- Python code with comments
- EP context model creation
```python
import onnxruntime as onnxrt
# Session options for creating an ONNX model with EP context cache.
sess_opts = onnxrt.SessionOptions()
# Verbose.
sess_opts.log_severity_level = 0
# This is REQUIRED.
sess_opts.add_session_config_entry("ep.context_enable", "1")
# This is OPTIONAL.
# Either an absolute path (preferred for now) or a relative path (WIP)
is okay.
# sess_opts.add_session_config_entry("ep.context_file_path",
"/some/path/to/original_model_ctx.onnx")
# This is OPTIONAL.
sess_opts.add_session_config_entry("ep.context_embed_mode", "1")
orig_model_location = "/some/path/to/original_model.onnx"
sess = onnxrt.InferenceSession(orig_model_location, sess_opts,
providers=["VitisAIExecutionProvider"], provider_options=[])
```
- Inference run with an EP context model
```python
import onnxruntime as onnxrt
# Session options for creating an ONNX model with EP context cache.
sess_opts = onnxrt.SessionOptions()
# Default EP context model path.
# ep_ctx_model_location = "/some/path/to/origina_model.onnx_ctx.onnx"
# User configured EP context model path.
ep_ctx_model_location = "/some/path/to/origina_model_ctx.onnx"
sess = onnxrt.InferenceSession(ep_ctx_model_location, sess_opts,
providers=["VitisAIExecutionProvider"], provider_options=[])
model_inputs = {}
run_opts = onnxrt.RunOptions()
# Verbose.
run_opts.log_severity_level = 1
sess.run(None, model_inputs, run_opts)
```
---------
Co-authored-by: Glen Cao <glen@Glens-MacBook-Air.local>
This var has been initialized to 0 in tint, so no need extra loop to do
it again:
```
float tint_symbol_52[1][4] = (float[1][4])0;
{
for(int tint_symbol_53 = 0; (tint_symbol_53 < 1); tint_symbol_53 = (tint_symbol_53 + 1)) {
{
for(int tint_symbol_54 = 0; (tint_symbol_54 < 4); tint_symbol_54 = (tint_symbol_54 + 1)) {
tint_symbol_52[min(uint(tint_symbol_53), 0u)][min(uint(tint_symbol_54), 3u)] = 0.0f;
}
}
}
}
```
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
the exception was caused by
3dd6fcc089
Why I add skip_macos_test
because there's new an exception in
https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1425579&view=logs&j=c90c5af3-67d5-5936-5a62-71c93ebfca65&t=01038f35-8e78-5801-1aa1-d9647bb65858
```
2024-07-05T14:41:09.3864740Z mkdir -p /Users/runner/Library/Developer/Xcode/DerivedData/apple_package_test-akksnidsbpojopfdqrclgsoqqerv/Build/Products/Debug/macos_package_testUITests.xctest/Contents/Frameworks
2024-07-05T14:41:09.3933430Z mkdir: /Users/runner/Library/Developer/Xcode/DerivedData/apple_package_test-akksnidsbpojopfdqrclgsoqqerv/Build/Products/Debug/macos_package_testUITests.xctest: Operation not permitted
2024-07-05T14:41:09.3996760Z /var/folders/0f/b0mzpg5d31z074x3z5lzkdxc0000gn/T/tmp97ycvwq5/apple_package_test/Pods/Target Support Files/Pods-macos_package_testUITests/Pods-macos_package_testUITests-frameworks.sh: line 7: realpath: command not found
2024-07-05T14:41:09.4003170Z :18: error: Unexpected failure
2024-07-05T14:41:11.1323470Z error: Sandbox: mkdir(72212) deny(1) file-write-create /Users/runner/Library/Developer/Xcode/DerivedData/apple_package_test-akksnidsbpojopfdqrclgsoqqerv/Build/Products/Debug/macos_package_testUITests.xctest (in target 'macos_package_testUITests' from project 'apple_package_test')
2024-07-05T14:41:11.1325620Z
2024-07-05T14:41:11.8731110Z
2024-07-05T14:41:11.8733040Z Test session results, code coverage, and logs:
2024-07-05T14:41:11.8734820Z /Users/runner/Library/Developer/Xcode/DerivedData/apple_package_test-akksnidsbpojopfdqrclgsoqqerv/Logs/Test/Test-macos_package_test-2024.07.05_14-40-38-+0000.xcresult
2024-07-05T14:41:11.8735530Z
2024-07-05T14:41:11.8906210Z Testing failed:
2024-07-05T14:41:11.8911060Z Sandbox: mkdir(72212) deny(1) file-write-create /Users/runner/Library/Developer/Xcode/DerivedData/apple_package_test-akksnidsbpojopfdqrclgsoqqerv/Build/Products/Debug/macos_package_testUITests.xctest
2024-07-05T14:41:11.8912570Z Unexpected failure
2024-07-05T14:41:11.8913690Z Testing cancelled because the build failed.
2024-07-05T14:41:11.8914380Z
2024-07-05T14:41:11.8914970Z ** TEST FAILED **
2024-07-05T14:41:11.8915480Z
2024-07-05T14:41:11.8915780Z
2024-07-05T14:41:11.8916750Z The following build commands failed:
2024-07-05T14:41:11.8919280Z PhaseScriptExecution [CP]\ Embed\ Pods\ Frameworks /Users/runner/Library/Developer/Xcode/DerivedData/apple_package_test-akksnidsbpojopfdqrclgsoqqerv/Build/Intermediates.noindex/apple_package_test.build/Debug/macos_package_testUITests.build/Script-059136A7770CA5376C30F2FD.sh (in target 'macos_package_testUITests' from project 'apple_package_test')
2024-07-05T14:41:11.8922180Z (1 failure)
```
And I find macos test is skipped in
9ef28f092f/tools/ci_build/github/azure-pipelines/templates/c-api-cpu.yml (L119-L127)
as well.
Maybe it is an known issue.
### Description
Repeat of #21084 with removal of policy CMP0144 to suppress warnings
which uses CMake 3.27.0.
### Motivation and Context
Already approved PR:
https://github.com/microsoft/onnxruntime/pull/21084
Removed the added policy from CMake 3.27.0.
### Description
The implementation inside EP requires registering some custom ops which are only used in the model compilation phase. Currently only single output is supported.
### Motivation and Context
Now the demand upgrade requires support for multiple outputs, so the shaper infer of ep custom op needs to be extended to support multiple outputs
---------
Co-authored-by: liumingyue <mingyue@xilinx.com>
Co-authored-by: mingyue <mingyue@amd.com>
### Description
Implement [FlashAttention](https://arxiv.org/pdf/2205.14135) and
[FlashAttention-2](https://arxiv.org/pdf/2307.08691) for
MultiHeadAttention on CPU.
### Motivation and Context
Accelerate the execution of MultiHeadAttention.
Current performance: 10ms vs 16ms (com.microsoft.MultiHeadAttention) on
my Linux machine and 10ms vs 38ms (com.microsoft.MultiHeadAttention) on
my Windows machine. May need further optimizations.
---------
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: Qingnan Duan <qiduan@microsoft.com>
### Description
1. remove QNN stages from the big packaging pipeline
2. Add publish nightly package in the current [QNN Nuget
pipeline](https://dev.azure.com/aiinfra/Lotus/_builddefinitionId=1234])
### Motivation and Context
Reduce the complexity of the big Nuget packaging pipelines.
---------
Co-authored-by: Yi Zhang <your@email.com>
### Description
There are so many typos reported by the review dog, [Optional Lint]
actions (example:
https://github.com/microsoft/onnxruntime/actions/runs/9864564489/job/27239732367),
this PR is to fix some of them.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
[DirectML] Broadcast NC-dims for Tensors A&B in DynamicQuantizeMatMul
The DynamicQuantizeMatMul allows input tensors in NCHW format, and
DirectML requires that input tensors share the same batch and channel
dimensions. Tensors A and B should be broadcast (if possible) to the
corresponding output NC dims.
### Motivation and Context
Certain models which use DynamicQuantizeMatMul hit a crash when the NC
dims are intended to be broadcast.
---------
Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
Update AArch64 SQNBitGemm CompInt8 kernels to process matrix in tiles. E.g., computing the output in 2x2 tiles allows us to compute four elements of the output with one read of two rows of A and two columns of B.
Also moved some code around as it was getting big for a single file.
### Description
Our macOS pipeline are failing because of a build error in absl.
However, the bug fix we need is not available in the latest ABSL
release.
Here is the issue: https://github.com/abseil/abseil-cpp/pull/1536
And here is the fix:
779a3565ac
GTests uses ABSL. But this ABSL target also depends on GTest. So, it is
a circular dependency. We should be able to avoid that by avoid building
tests for ABSL. However, the version we are using has a problem with
that: it has cmake target that still depends on GTest even when testing
is disabled.
It's strange that we suddenly hit this problem and it only happens on macOS.
### Description
- Adds support for int4 quantized weights (per-tensor and per-channel)
on QNN EP
- Adds test script that creates an INT4 qdq model with a Conv
- Adds a unit tests demonstrating accuracy issues.
### Motivation and Context
This is the next step in being able to run models that use 4-bit
quantized weights on QNN EP.
### Description
This PR resolves a bug related to setting the **interOpNumThreads**
session option when creating an **ORTSession**. Currently, when the
**interOpNumThreads** option is passed from React Native, the native
module incorrectly sets **intraOpNumThreads** instead of
**interOpNumThreads**.
### Motivation and Context
Since this is a bug, users of the Onnx React Native package may believe
that they are setting **interOpNumThreads** correctly, So this change is
required. Refer to the code snippet below for details
<img width="634" alt="Screenshot 2024-07-05 at 9 28 58 PM"
src="https://github.com/microsoft/onnxruntime/assets/88655321/70a8f216-553a-4f4c-9481-e6871f0e37e6">
### Description
CMake logic fixed to allow enabling MPI while NCCL is disabled.
### Motivation and Context
MPI is also used on the CPU backend, not only with CUDA, so it makes
sense to decouple it properly from NCCL (which is for dealing with
multiple Nvidia GPUs).
### Description
Previously ROCMExecutionProvider uses `hipMemGetInfo` to obtain the
sizes of total memory and available memory. However, this API has been
broken since ROCm 5.7. In this PR, we use `rocm_smi` library instead of
`hipMemGetInfo`.
### Motivation and Context
`hipMemGetInfo` API has been broken since ROCm 5.7 and inference with
ROCMExecutionProvider will lead to following errors:
```
HIP failure 1: invalid argument ; GPU=0 ; hostname=4cc4900475fe ; file=/onnxruntime/onnxruntime/core/providers/rocm/rocm_execution_provider.cc ; line=229 ; expr=hipMemGetInfo(&free, &total);
```
MIOpen has a brute-force fix for this
(911e671895/src/hip/handlehip.cpp (L72)).
Instead of hard-coding available memory to 16GB, I suppose we could
obtain memory info through `rocm_smi` library as in this PR.