### Description
Rework some external targets to ease building with
`-DFETCHCONTENT_FULLY_DISCONNECTED=ON`
This will allow package managers to more easily provide an onnxruntime
package by reducing the amount of patching needed downstream at each
version.
### Motivation and Context
Availability of onnxruntime in some C++ package managers
https://github.com/microsoft/onnxruntime/issues/7150https://github.com/conan-io/conan-center-index/issues/16699https://github.com/microsoft/vcpkg/issues/20548
My initial intent is to get this in conan but the PR would most likely
be useful (though not tested) to vcpkg as well (and maybe others).
I tried to get only a first batch of not too specific patches (i.e. not
specific to conan).
The first commit reworks `flatbuffers` and just extends what @snnn did
in https://github.com/microsoft/onnxruntime/pull/13991
The second commit reworks `pytorch_cpuinfo`
The third commit reworks `google_nsync`
Temporarily remove Azure build check to unblock PR(s).
We need to investigate the sudden build failure and reenable.
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Ensure that Loop operators run on CPU.
Fix memcpy for Sequence Tensors, so that empty sequences (like when
SequenceEmpty runs on DirectML) can be copied back to CPU.
MSVC and gcc are both not good at optimizing select(), even in trivial
usage outside of ORT.
gcc seems to do better with -ffast-math (not used by ORT) but /fp:fast
does nothing for MSVC
This PR delivers a 33% speedup on the same model (360us -> 270us on
Windows; 205 us -> 153 us on Linux; measured on different systems).
TODO: Examine and fix Elu and other similar activation functions for the
use of `Eigen::select`
Co-authored-by: @fpribeiro
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Add a tool to convert fused BERT like model to packing mode
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Followed by https://github.com/microsoft/onnxruntime/pull/14881
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
rocm python packaging pipeline failed because manylinux version and
manylinux.patch update.
1. fix duplicate `epel-release` installation issue, ROCm pipeline
install it at the begin of the dockerfile to install rocm libs. remove
duplicate installation on install-runtime-packages.sh.
```
/var/tmp/yum-root-sMRl36/epel-release-latest-7.noarch.rpm: does not update installed package.
Error: Nothing to do
```
2. add python10 to fix error below.
```
+ /opt/python/cp310-cp310/bin/python -m venv /opt/_internal/tools
build_scripts/finalize.sh: line 40: /opt/python/cp310-cp310/bin/python: No such file or directory
```
3. add python10 to rocm pipeline.
pipeline link:
https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=294776&view=results
### Description
Removing fp16 support from apple build
### Motivation and Context
FP16 support on ARM64 only available after armv8.2a, thus the clang
compiler needs a compilation flag `-march=armv8.2-a+fp16`.
Unfortunately, our current universal build does not support hardware
specific compilation flags on cpp source files, as it would cause
trouble when compiling against more than one hardware target. Until we
figure out how to remove this limitation, had to disable fp16 support
for Apple systems.
### Description
<!-- Describe your changes. -->
Add required graph transformer to duplicate DQ nodes to ensure that QDQ
node units have unique DQ nodes. This condition is necessary for QDQ
node unit processing.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
There is an existing Python utility that does this:
c7ced7a5e9/tools/python/util/qdq_helpers/qdq_model_utils.py (L77)
This PR implements it as a graph transformer so it is integrated into
ORT and does not require a separate step to update the model. There are
also tests to ensure that its effects are not undone by basic level
graph optimizations.
### Description
When calculating symbolic shape like `mul(get_int_val(values=[1024,
0.5]))`,
the current script calls `get_int_val()` to get values, which values
becomes `[1024, 0]`.
Thus, the result of `mul(values)`->`mul([1024,0])`=0, but the expected
shape size is 512
Fix: for math binary operations like `mul()` and `div()`,
don't convert input shapes into integers if any possible precision loss
happen;
keep the input shape as float, finish the operation, and cast final
result into integer and output the shape.
Test cases are added:
1. mul(1024, 0.5)=>512 (before this fix, the output would be 0, as float
0.5 would be converted to int 0)
2. div(768, 1.5)=>512 (before this fix, the output would be 768, as
float 1.5 would be converted to int 0)
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
This PR uses cudaStreamNonBlocking flag when creating cuda stream,
meaning the created stream will run concurrently with default stream, no
implicit synchronization with default stream.
### Motivation and Context
This PR is required for the perf concern
### Description
DNNL EP doesn't support opset18 yet. So, let it skip such tests so that
we could still test the other EPs.
The models mentioned above are ONNX node tests that live in
github.com/onnx/onnx
### Description
<!-- Describe your changes. -->
Add all the ONNX layout sensitive ops from opset 11 on.
Make list in transpose optimizer consistent.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
When we run L1 optimizers after layout transform in a full build it
needs a schema for any layout sensitive ops that get converted to the
internal domain. Previously we did not run L1, so we got away without
having schemas unless the EP used a static kernel for the nhwc version
of the op.
### Description
1. Make 2 cache tasks in one pipeline really works
2. Each building step has its own environment variable CCACHE_DIR
instead of job variables.
3. Extenal Protobuf compilation cache only updates with deps.txt. It
doesn't generate new cache in every commit.
### Motivation and Context
The simple workflow is as below
```
--------build with ccache-------
|
cache
|
{CCACHE_DIR}-----cache stat.
```
```
-------Cache@2------
|
download cache
|
{path}--------upload cache
```
1. {XXX} means environment variable or task input.
2. {CCACHE_DIR} must be consistent with {path}. Ccache produces caches
in {CCACHE_DIR} and Cache@2 download cache into {path} and tar {path}
and upload it.
3. Protobuf changes with deps.txt so that it would reduce the storage
size.
4. Next step, we may split the compilation into 2 steps, one for
external dependencies and another for ORT.
### Description
1. move the cache task definition into template
2. In debug mode, the compiler mtime is different in different machine.
So, change the CCACHE_COMPILERCHECK to content.
### Motivation and Context
1. Accelerate the CoreML pipeline.
Test run:
https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=938040&view=logs&j=1ac7588f-a5bd-5ff7-4a8a-a34869d50220
With Cache, the run can be finished in 12 minutes. Without cache, it
takes about 1 hour.
3. Make the cache function easy to use and maintain.
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
This PR disables browser test temporarily. The test randomly fails and
we are investigating the issue. Disable the test to unblock others.
### Description
1. Remove Linux jobs for ORT-Extension combined build
2. Add a macOS build job for ORT-Extension combined build
3. Adjust the yaml file so that it can support two different ADO
instances.
### Motivation and Context
To test our code better. And it will enable us to run such tests for
every commit in the main branch. It would be easier for us to figure out
which change caused a build break.
See
[AB#13435](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/13435)
### Description
**Multi-stream** execution support for **CANN EP**.
### Motivation and Context
**CANN EP** is currently **unavailable** due to the introduction of a
new mechanism for multi-stream execution
[#13495](https://github.com/microsoft/onnxruntime/pull/13495), the
deletion of the Fence-based synchronization mechanism, and the failure
to update the relevant logic of **CANN EP** synchronously.
This PR is to fix it.
### Description
windows update python3.11
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Ubuntu <chasun@chasunlinux.lw3b1xzoyrkuzm34swpscft0ff.dx.internal.cloudapp.net>
### Description
transpose.cc:
Arithmetic overflow: Using operator '-' on a 4 byte value and then
casting the result to a 8 byte value. Cast the value to the wider type
before calling operator '-' to avoid overflow (io.2).
cuda_provider_factory.cc:
The type 'struct onnxruntime::ProviderInfo_CUDA_Impl' with a virtual
function needs either public virtual or protected non-virtual destructor
(c.35).
### Description
<!-- Describe your changes. -->
- Add debug infrastructure to dump out model at various stages of
transpose optimization.
- Handle more scenarios where Transpose -> Reshape can be merged.
- Run L1 optimizers after layout transform to constant fold initializers
that had their layout changed.
- Use cost check for Concat post layout transform as pushing a Transpose
through it can potentially add Transpose nodes to multiple other inputs.
- Update internal testing EP to support test where you want it to take
all nodes, use NHWC layout, and to use dummy static kernels instead of
compiling so the ops in the graph post-initialization can be counted.
- Misc cleanup in InferenceSession to not unnecessarily pass args to
TransposeGraph for class members.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Address perf issue seen with model where a Transpose gets blocked by a
Reshape that could have been treated as a Transpose.
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
…ML in Win32/MSVC.
### Description
Use onnxruntime::narrow to silence some warnings that are turned into
errors when compiling the DML provider in Win32.
Also one case of warning turned to error for mixing int loop variable
type with a vector size() as upper bound.
### Motivation and Context
Solves
[https://github.com/microsoft/onnxruntime/issues/14595](https://github.com/microsoft/onnxruntime/issues/14595)
Co-authored-by: bengt.gustafsson <bengt.gustafsson@contextvision.se>
### Description
1. Move Linux CPU pipelines to an AMD CPU pool which is cheaper
2. Enable CCache for orttraining pipeline
### Motivation and Context
Azure AMD CPU machines are generally much cheaper than Intel CPU
machines. However, they don't have local disks.