### Description
disable cache to save disk space for training_x64_debug
### Motivation and Context
To mitigate not enough disk space in training_x64_debug first.
Two modifications:
- After [TRT 8.5](https://github.com/microsoft/onnxruntime/pull/13867)
being merged, we can manually set timeout and make TRT EP only run small
portion of unit tests
(`onnxruntime_SKIP_AND_PERFORM_FILTERED_TENSORRT_TESTS=ON`) due to
additional TRT kernel overhead introduced by TRT 8.5 which increases
test time a lot. This PR modifies the checking condition and make
TensorRT CIs (can enable builder placeholder) still run most of the unit
tests.
- Exclude TRT EP from [Resize Opset
18](https://github.com/microsoft/onnxruntime/pull/13890) unit tests
since TensorRT 8.5 supports operators up to Opset 17.
### Description
Allows the PostAnalysis@2 task for windows CI jobs to continue even if
an error is encountered.
### Motivation and Context
This is a temporary workaround that enables the
`Windows_Packaging_CPU_x86_default` job within the Zip-Nuget-Java-NodeJS
packaging pipeline to finish. A recent update to dotnet 6 has broken the
PostAnalysis task for this job.
This task was originally added by
https://github.com/microsoft/onnxruntime/pull/13694
### Description
Add compilation cache in Linux CPU Aten Pipeline.
The pipeline could be completed in 6 minutes at best.
### Motivation and Context
1. Accelerate the pipeline.
2. It's the shortest pipeline with docker image. I'll use it to try
moving the storage of linux docker image from ACR to ADO pipeline cache.
### Description
Add a new install_shared_deps.sh
### Motivation and Context
Azcopy, Ninja, Node.js and CCache are all needed, but they are copied
everywhere.
### Description
Use pytest-xdist to distribute tests across multiple CPUs to speed up
test execution.
Use pytest-rerunfailures to rerun failed test in case of pytest-xdist
crash.
`pytest -n 16` can reduce pytest time from 80 minutes to 20 minutes.
### Motivation and Context
Now kernel explorer pytest of ROCm CI takes nearly 1 hour 20 minutes. It
will take longer time when we add more tunableOp in the future.
### Description
<!-- Describe your changes. -->
Use dlsym/GetProcAddress to lookup a custom ops registration function by
name and call it.
This will be better on mobile platforms where the custom ops library is
linked against, and there isn't necessarily a filesystem that a library
path can be loaded from.
Alternative is to wire up passing in the address of the function, but
that has multiple complications which differ by platform.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Enable using ort and ort-ext packages on mobile platforms.
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
Changes to incorporate OpenVINO EP 2022.3
### Motivation and Context
This change is required to incorportate OpenVINO EP 2022.3
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: mohsinmx <mohsinx.mohammad@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
Co-authored-by: Aravind <aravindx.gunda@intel.com>
Co-authored-by: mayavijx <mayax.vijayan@intel.com>
Co-authored-by: flexci <mohsinmx>
Fix https://github.com/microsoft/onnxruntime/issues/14017.
Before: shape_value = np.asarray([0, 0, np.array([4]), np.array([8])],
dtype=np.int64) raise Error in numpy 1.24.
After: shape_value = np.asarray([0, 0, 4, 8)], dtype=np.int64) is good
in numpy 1.24.
Update test environment to use numpy 1.24.
### Description
Enable creating dedicated build for on device training. With this PR we
can build a lean binary for on device training using flag
--enable_training_apis. This binary includes only the essentials like
training ops, optimizers etc and NOT features like Aten fallback,
strided tensors, gradient builders etc . This binary also removes all
the deprecated components like training::TrainingSession and OrtTrainer
etc
### Motivation and Context
This enables our partners to create a lean binary for on device
training.
### Description
Update the MIGraphX version used in ORT to rocm-5.4.0
### Motivation and Context
The previous branch migraphx_for_ort has stopped updating, it is too far
away from the MIgraphX latest release branch. More discussion here:
https://github.com/microsoft/onnxruntime/issues/14126#issuecomment-1373201049
Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
### Description
1. Set the WithCache default value as false in Mac OS CI workflow too.
2. Add date of today in cache key to avoid cache size keep increasing
too.
WithCache, the pipeline duration reduced from 70 more minutes to 10 more
minutes
### Description
Add date value of today into the cache key.
### Motivation and Context
Microsoft-host agent has only 10GB for build.
To limit cache size, pipeline only use cache generated today.
### Description
1. Renames all references of on device training to training apis. This
is to keep the naming general. Nothing really prevents us from using the
same apis on servers\non-edge devices.
2. Update ENABLE_TRAINING option: With this PR when this option is
enabled, training apis and torch interop is also enabled.
3. Refactoring for onnxruntime_ENABLE_TRAINING_TORCH_INTEROP option:
- Removed user facing option
- Setting onnxruntime_ENABLE_TRAINING_TORCH_INTEROP to ON when
onnxruntime_ENABLE_TRAINING is ON as we always build with torch interop.
Once this PR is merged when --enable_training is selected we will do a
"FULL Build" for training (with all the training entry points and
features).
Training entry points include:
1. ORTModule
2. Training APIs
Features include:
1. ATen Fallback
2. All Training OPs includes communication and collectives
3. Strided Tensor Support
4. Python Op (torch interop)
5. ONNXBlock (Front end tools for training artifacts prep when using
trianing apis)
### Motivation and Context
Intention is to simply the options for building training enabled builds.
This is part of the larger work item to create dedicated build for
learning on the edge scenarios with just training apis enabled.
Implement CloudEP for hybrid inferencing.
The PR introduces zero new API, customers could configure session and
run options to do inferencing with Azure [triton
endpoint.](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-deploy-with-triton?tabs=azure-cli%2Cendpoint)
Sample configuration in python be like:
```
sess_opt.add_session_config_entry('cloud.endpoint_type', 'triton');
sess_opt.add_session_config_entry('cloud.uri', 'https://cloud.com');
sess_opt.add_session_config_entry('cloud.model_name', 'detection2');
sess_opt.add_session_config_entry('cloud.model_version', '7'); // optional, default 1
sess_opt.add_session_config_entry('cloud.verbose', '1'); // optional, default '0', meaning no verbose
...
run_opt.add_run_config_entry('use_cloud', '1') # 0 for local inferencing, 1 for cloud endpoint.
run_opt.add_run_config_entry('cloud.auth_key', '...')
...
sess.run(None, {'input':input_}, run_opt)
```
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
### Description
Deprecate one step beam search since it lacks maintenance (some tests
failed) and its performance is not optimal.
For users who still need this feature, please use older version
(<=1.13.1) of onnxruntime to export one step beam search model, and the
model can run in latest onnxruntime.
It is recommend to use
[convert_generation.py](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/convert_generation.py)
to generate beam search onnx model for better performance.
### Description
Update CUDA ArgMin/ArgMax op kernels to have end version 11 since opset
12+ is not supported yet.
With the way these kernels are currently registered, the documentation
shows support for opset 11+. This is not accurate.
### Motivation and Context
Fix#13781
### Description
Update absl to a new version
### Motivation and Context
The new version contains fixes that are needed for Nvidia GPU build.
Once we update it to that version, we don't need to maintain our private
patches for Nvidia GPU build.
### Description
update versions of a few build dependencies for onnxruntime NPM
packages.
update nodejs version to v16.x in linux CI. v12 is too out-of-dated. see
[nodejs release
schedule](https://github.com/nodejs/release#release-schedule)
### Motivation and Context
- upgrade to latest webpack allows using of latest Node.js LTS version.
previous version of webpack does not work on Node.js v18 and it is fixed
in latest version
- upgrade to latest typescript, ts-loader and other dev deps to
accelerate the build and bundling.
- upgrade also helps to resolve security warnings that may be vulnerable
in out-of-dated version
### Description
Add the ability to run graph
### Motivation and Context
A brief description is as follows:
1) If the whole graph is supported, then will be processed by the graph
engine, directly.
2) If the whole graph is not supported, the whole graph will be divided
into subgraphs and single operators; The sub-graphs will be run on graph
engine, and the single operators will fallback to the traditional mode.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
For compilation in container, ADO Cache task doesn't work directly.
The workaround is to mount the cache directory to the container, and let
CCache in container to read/write cache data.
In short, we just leverage ADO API to download/upload cache data.
The Post-jobs works in stack-mode, So the PostBuildCleanUp Tasks should
be defined first.
Thus, The PostBuildCleanUp would be executed lastly.
Else, Cache Task would fail to upload cache because the Agent Directory
is cleaned.
**Description**: This PR including following works:
1. provide stream and related synchronization abstractions in
onnxruntime.
2. enhance onnxruntime's execution planner / executor / memory arena to
support execute multiple streams in parallel.
3. deprecate the parallel executor for cpu.
4. deprecate the Fence mechanism.
5. update the cuda / tensorrt EP to support the stream mechanism,
support running different request in different cuda stream.
**Motivation and Context**
- Why is this change required?
currently, the execution plan is just a linear list of those primitives,
ort will execute them step by step. For any given graph, ORT will
serialize it to a fixed execution order. This sequential execution
design simplifies most scenarios, but it has the following limitations:
1. it is difficult to enable inter-node parallelization, we have a
half-baked parallel executor but it is very difficult to make it work
with GPU.
2. The fence mechanism can work with single gpu stream + cpu thread
case, but when extend to multiple stream, it is difficult to manage the
cross GPU stream synchronizations.
3. our cuda EP rely on the BFCArena to make the memory management work
with the GPU async kernels, but current BFCArena is not aware of the
streams, so it doesn't behavior correctly when run with multiple
streams.
This PR enhance our existing execution plan and executor to support
multiple stream execution. we use an unified algorithm to mange both
single stream and multiple stream scenarios.
This PR mainly focus on the infrastructure support for multiple stream
execution, that is said, given a valid stream assignment, onnxruntime
can execute it correctly. How to generate a good stream assignment for a
given model will be in the future PR.
Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Cheng Tang <chenta@microsoft.com>
Co-authored-by: RandySheriffH <48490400+RandySheriffH@users.noreply.github.com>
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: cao lei <jslhcl@gmail.com>
Co-authored-by: Lei Cao <leca@microsoft.com>
### Description
Fix a problem: macOS CI pipeline doesn't run tests. It is due a code
refactoring I recently made.
### Motivation and Context
Add the tests back.
Integrate TensorRT 8.5
- Update TensorRT EP to support TensorRT 8.5
- Update relevant CI pipelines
- Disable known non-supported ops for TensorRT
- Make timeout configurable.
We observe more than [20
hours](https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=256729&view=logs&j=71ce39d8-054f-502a-dcd0-e89fa9931f40)
of running unit tests with TensorRT 8.5 in package pipelines. Because we
can't use placeholder to significantly reduce testing time (c-api
application test will deadlock) in package pipelines, we only run
subsets of model tests and unit tests that are related to TRT (add new
build flag--test_all_timeout and set it to 72000 seconds by package
pipelines). Just to remember, we still run all the tests in TensorRT CI
pipelines to have full test coverage.
- include https://github.com/microsoft/onnxruntime/pull/13918 to fix
onnx-tensorrt compile error.
Co-authored-by: George Wu <jywu@microsoft.com>
### Description
<!-- Describe your changes. -->
Update protobuf version to 3.18.3 in
tools/ci_build/github/linux/docker/scripts/requirements.txt.
### Motivation and Context
Address component governance alert CVE-2022-1941
### Description
- Adds a dockerfile for Ubuntu with TensorRT 8.5.1.1.
- Adds option to run EP Perf pipeline with TensorRT 8.5
### Motivation and Context
Necessary to benchmark models with TensorRT 8.5
### Description
<!-- Describe your changes. -->
1. Remove ROCm5.3 pipeline because it has rocblas bug, we don't need it.
2. We removed the dependency on centos docker image provided by
AMD(https://hub.docker.com/r/rocm/dev-centos-7) and build ROCm centos
base image by ourselves. The reference
dockerfile(https://github.com/RadeonOpenCompute/ROCm-docker/blob/master/dev/Dockerfile-centos-7)
is very redundant for our need. We simplified the ROCm manylinux
dockerfile.
3. Different versions of rocm use the same dockerfile
`Dockerfile.manylinux2014_rocm`.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>