### Description
**Multi-stream** execution support for **CANN EP**.
### Motivation and Context
**CANN EP** is currently **unavailable** due to the introduction of a
new mechanism for multi-stream execution
[#13495](https://github.com/microsoft/onnxruntime/pull/13495), the
deletion of the Fence-based synchronization mechanism, and the failure
to update the relevant logic of **CANN EP** synchronously.
This PR is to fix it.
### Description
windows update python3.11
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Ubuntu <chasun@chasunlinux.lw3b1xzoyrkuzm34swpscft0ff.dx.internal.cloudapp.net>
### Description
transpose.cc:
Arithmetic overflow: Using operator '-' on a 4 byte value and then
casting the result to a 8 byte value. Cast the value to the wider type
before calling operator '-' to avoid overflow (io.2).
cuda_provider_factory.cc:
The type 'struct onnxruntime::ProviderInfo_CUDA_Impl' with a virtual
function needs either public virtual or protected non-virtual destructor
(c.35).
### Description
<!-- Describe your changes. -->
- Add debug infrastructure to dump out model at various stages of
transpose optimization.
- Handle more scenarios where Transpose -> Reshape can be merged.
- Run L1 optimizers after layout transform to constant fold initializers
that had their layout changed.
- Use cost check for Concat post layout transform as pushing a Transpose
through it can potentially add Transpose nodes to multiple other inputs.
- Update internal testing EP to support test where you want it to take
all nodes, use NHWC layout, and to use dummy static kernels instead of
compiling so the ops in the graph post-initialization can be counted.
- Misc cleanup in InferenceSession to not unnecessarily pass args to
TransposeGraph for class members.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Address perf issue seen with model where a Transpose gets blocked by a
Reshape that could have been treated as a Transpose.
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
…ML in Win32/MSVC.
### Description
Use onnxruntime::narrow to silence some warnings that are turned into
errors when compiling the DML provider in Win32.
Also one case of warning turned to error for mixing int loop variable
type with a vector size() as upper bound.
### Motivation and Context
Solves
[https://github.com/microsoft/onnxruntime/issues/14595](https://github.com/microsoft/onnxruntime/issues/14595)
Co-authored-by: bengt.gustafsson <bengt.gustafsson@contextvision.se>
### Description
1. Move Linux CPU pipelines to an AMD CPU pool which is cheaper
2. Enable CCache for orttraining pipeline
### Motivation and Context
Azure AMD CPU machines are generally much cheaper than Intel CPU
machines. However, they don't have local disks.
### Description
Pause caching the docker images in pipeline cache in Linux Aten
Pipeline.
### Motivation and Context
We need to work out a better way to reduce the storage.
### Description
- Adds support for newer opset of Reduction operators (ReduceSum,
ReduceMax, ReduceMin, ReduceMean, ReduceProd) with axes as an
initializer input.
- Adds tests for HTP and CPU backends.
### Motivation and Context
Newer opset versions changed the `axes` attribute into an optional
input. This PR adds support for these newer reduction operators as long
as the axes input is defined as an initializer. The goal is to enable more models on QNN.
### Description
Improve Slice to support Onnx opset9 which has starts, ends & axes in
node attributes.
### Motivation and Context
To unblock some models.
### Description
`lintrunner` is a linter runner successfully used by pytorch, onnx and
onnx-script. It provides a uniform experience running linters locally
and in CI. It supports all major dev systems: Windows, Linux and MacOs.
The checks are enforced by the `Python format` workflow.
This PR adopts `lintrunner` to onnxruntime and fixed ~2000 flake8 errors
in Python code. `lintrunner` now runs all required python lints
including `ruff`(replacing `flake8`), `black` and `isort`. Future lints
like `clang-format` can be added.
Most errors are auto-fixed by `ruff` and the fixes should be considered
robust.
Lints that are more complicated to fix are applied `# noqa` for now and
should be fixed in follow up PRs.
### Notable changes
1. This PR **removed some suboptimal patterns**:
- `not xxx in` -> `xxx not in` membership checks
- bare excepts (`except:` -> `except Exception`)
- unused imports
The follow up PR will remove:
- `import *`
- mutable values as default in function definitions (`def func(a=[])`)
- more unused imports
- unused local variables
2. Use `ruff` to replace `flake8`. `ruff` is much (40x) faster than
flake8 and is more robust. We are using it successfully in onnx and
onnx-script. It also supports auto-fixing many flake8 errors.
3. Removed the legacy flake8 ci flow and updated docs.
4. The added workflow supports SARIF code scanning reports on github,
example snapshot:

5. Removed `onnxruntime-python-checks-ci-pipeline` as redundant
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Unified linting experience in CI and local.
Replacing https://github.com/microsoft/onnxruntime/pull/14306
---------
Signed-off-by: Justin Chu <justinchu@microsoft.com>
### Description
<!-- Describe your changes. -->
Remove `deletion` of copy functions from `OrtApi` as its initialization
no longer compiles in C++20.
Introduce a non-copyable member to implicitly delete copy ctor.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Inspired by https://github.com/microsoft/onnxruntime/pull/14901
Solution credits: @RyanUnderhill
Cc: @georgthegreat
### Description
Apply the DML graph fusion transformer optimization independently of ORT
graph optimization level.
### Motivation and Context
The DML graph fusion transformer is not a graph optimizer in the normal
sense: it isn't optimizing the ONNX graph structure, but rather fusing
nodes into what will later become a single IDMLCompiledOperator (using
IDMLDevice1::CompileGraph). This transformer can't be done ahead of time
(hence why it's disabled if saving an optimized model), but it's also
gated by the ORT graph optimization level; this makes it impossible to
preoptimize ONNX models ("offline mode") and then later disable graph
optimizations for better startup performance ("online mode") while
benefiting from DML graph fusion.
- Add compute type for Skiplayernorm to fix ROCm CI and get more
accurate results.
SkipLayerNorm:
type T: input, skip, bias
type U: epsilon, compute result
type V: output, beta, gamma
- refactor the usage of aligned_vector, reduce the usage of
`reinterpret_cast`.
### Description
Use build.sourceversion in docker image cache key.
### Motivation and Context
We used filpath as the cache key in #14496.
In most cases, the docker base image tag is latest.
So, the hash of the files couldn't be aware of the change of base image.
As the result, the docker image restored, but the image will still be
rebuilt .
The maintenance cost would be huge if we pin image hash in docker file.
For example,
https://quay.io/repository/pypa/manylinux2014_x86_64?tab=tags&tag=latest,
it's updated almost every week.
So far, the build.sourceversion is the right way to keep cache is
updated and valid.
### Description
<!-- Describe your changes. -->
Currently we bail on constant folding if QDQ is enabled and we hit a DQ
node. However, if we have a simple DQ -> X -> Q node unit where the DQ
and X do not produce graph outputs, their output only has one consumer,
and X is deterministic, we can constant fold all three nodes.
Add support for this simple scenario primarily to constant fold a QDQ
model that has had initializers updated by layout transformation, which
results in patterns like `initializer -> DQ -> Transpose -> Q` or
`initializer- > DQ -> Unsqueeze -> Q -> DQ -> Transpose -> Q` if the
initializer is broadcast.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve end result of layout transformation on a QDQ model.
### Description
<!-- Describe your changes. -->
As synced offline, rename this op and will create another op for mha
that supports both self and cross attention.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
### Description
<!-- Describe your changes. -->
1. upgrade cutlass to 3.0 that containing attn_bias support.
2. extend Attention/MHA to use memory efficient attention when
rel_pos_bias with [1, num_head, s, s*] and 1d mask with [2 * batch_size
+ 1] are present.
new mask format introduction:
MASK_1D_KEY_SEQ_LEN_START,
[3 * batch_size + 2] with [key_len[0], ..., key_len[batch_size - 1],
query_start[0], ..., query_start[batch_size - 1], query_end[batch_size -
1], key_start[0], ..., key_start[batch_size - 1], key_end[batch_size -
1]]
e.g
2D mask with [[1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 0]] converts to this
1D mask is [3, 5, 0, 6, 12, 0, 6, 12]
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
It potentially benefits tnlrv6 and t5(encoder)
---------
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
Co-authored-by: Kunal Vaishnavi <kvaishnavi@microsoft.com>
Co-authored-by: Kunal Vaishnavi <kvaishnavi@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
* Current SkipLayernorm did not using stable algo and cause correctness
issue.
* Enrich existing layernorm kernel to accept bias and residual.
* Tune standard layernorm threads.y according to elements and device
property.
* Remove existing skiplayernorm cuda implementation.
Fix two issues: (1) GPT attention fusion: get_parent could return None when the input is
initializer, add a check (2) ONNX node could have optional inputs and outputs. During
prune_graph, we shall exclude empty inputs/outputs. Here we exclude ""
from output_name_to_node and input_name_to_nodes.
Add an option allow_remove_graph_inputs in prune_graph
### Description
Fix issue relating to unused inputs of model-local functions. ORT
creates a schema for all such functions. The creation of this schema
does not handle unused function-inputs. The schema-creation relies on
the use of the function-inputs to infer type-constraints for the input,
and the code ends up creating an erroneous input-descriptor when there
is no use of the function-input.
The fix is to create an input with the given name, with a
type-constraint that allows all types.
### Motivation and Context
Fix https://github.com/microsoft/onnxruntime/issues/15046
Fix https://github.com/microsoft/onnx-script/issues/524
---------
Signed-off-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>