### Description
<!-- Describe your changes. -->
Support launch multi-GPU without MPI
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
-update InstanceNormU8 with fixed input. With this input, it fails
consistently using QNN 2.15.1
-update QNN lib paths (target is deprecated) and additionally copy V73
skel file
Bump ruff version and remove pylint from the linter list. Fix any new
error detected by ruff.
### Motivation and Context
Ruff covers many of the pylint rules. Since pylint is not enabled in
this repo and runs slow, we remove it from the linters
This PR introduces
- New data structure to represent kernel-level (aka node-level or
op-level) tensor sharding informaiton. I consider it as the
fundamentaion of ONNX distribtued inference.
- Building blocks for distribtued kernels implementation especially
stateless implementation for communication ops.
- Implementation of DistributedMatMul and its tests.
Code structure:
- sharding.h/.cc: Function to shard and reshard tensors (calling into
NCCL).
- sharding_spec.h/.cc: Representation of how a tensor is sharded.
- distributed_matmul.h/.cc: Implementation of tensor parallel MatMul.
Inputs and outputs are sharded across devices.
- onnxruntime_test_distributed.py: distributed operator tests.
Example of specifying sharding information
```python
@onnxscript.script()
def matmul_rs_sr_rr(tensor_x: FLOAT, tensor_w: FLOAT) -> FLOAT:
# Run MatMul by sharding x along column axis and w along row axis on
# 2 GPUs.
return MICROSOFT_OPSET.DistributedMatMul(
tensor_x,
tensor_w,
device_mesh_shape=[2],
device_mesh_elements=[0, 1],
input_shard_specs=["RS[0]", "S[0]R"],
output_shard_specs=["RR"],
)
onnx_model = matmul_rs_sr_rr.to_model_proto(
input_types=[FLOAT[2, "s"], FLOAT["s", 2]],
output_types=[FLOAT[2, 2]],
)
```
In this example, the device mesh can be visualized as 1-D tensor, `[0,
1]`. The 2nd axis of `tensor_x` is sharded across `[0, 1]` (i.e., the
0-axis of the device mesh). Similarly, the 1st axis of `tensor_w` is
sharded across `[0, 1]` as well.
C++ classes to represent tensor sharding (copied from sharding_spec.h):
```cpp
class DeviceMesh {
public:
// [Device Mesh and Tensor Sharding for Tensor Parallel]
// Device mesh is a tensor of device indices.
// A tensor can then be partitioned along specific mesh axes.
//
// Assume we have 4 GPUs indexed by 0, 1, 2, and 3.
// Let's consider some examples.
// 1. 1D device mesh [0, 1, 2, 3]. In this case,
// device_mesh_shape is [4] and device_mesh_elements
// is [0, 1, 2, 3].
// If we want to shard a 2-D tensor along its axis 1, the
// corresponding sharding spec is a string "RS[0]".
// 2. 2D device mesh [[0, 1], [2, 3]]. In this case,
// device_mesh_shape is [2, 2] and device_mesh_elements
// is [0, 1, 2, 3].
// If we want to shard a 2-D tensor's
// rows along mesh axis 1 and
// columns along mesh axis 0, the
// corresponding sharding spec is a string "S[1]S[0]".
// If that 2-D tensor's value is np.array([[5, 6], [7, 8]]),
// GPU 0/1/2/3 owns 5/7/6/8. Below is a visualization the sharding
// proccess.
// - Start with a 2-D device mesh [[0, 1], [2, 3]] and
// a 2-D tensor [[5, 6], [7, 8]]
// - GPU: [[0, 1], [2, 3]], Tensor: [[5, 6], [7, 8]]
// - Split GPU mesh along axis 1 and tensor along
// axis 0 for "S[1]" in "S[1]S[0]"
// - GPU: [[0], [2]], Tensor: [[5, 6]]
// GPU: [[1], [3]], Tensor: [[7, 8]]
// - Split GPU mesh along axis 0 and tensor along
// axis 1 for "S[0]" in "S[1]S[0]"
// - GPU: [[0]], Tensor: [[5]]
// - GPU: [[2]], Tensor: [[6]]
// - GPU: [[1]], Tensor: [[7]]
// - GPU: [[3]], Tensor: [[8]]
// Actual shape of device mesh represented by `device_mesh_elements`.
std::vector<int64_t> device_mesh_shape;
// Flattened device mesh.
std::vector<int64_t> device_mesh_elements;
};
class AxisPartitionSpec {
// [Device Mesh and Tensor Sharding for Tensor Parallel]
// This class is the in-memory representation of
// 1. if a tensor is sharded or not (aka replica), and
// 2. which tensor axis is shard by which device mesh axis.
// Let's consider sharding 2-D tensor along column axis on
// device mesh [0, 1] as an example.
// The required sharding spec RS[0] can be represented by
// - AxisPartitionSpec(Condition::Replica, -1)
// - AxisPartitionSpec(Condition::Shard, 0)
public:
// Status of a tensor axis.
// A tensor axis can be either sharded or replicated
// along a device mesh axis.
enum class Condition { Replica,
Shard };
// This field tells if a tensor axis is sharded or not.
Condition cond;
// If a tensor axis is sharded, this field tells which device
// mesh axis to distribute the shards along.
// If a tensor axis is not sharded, this field is ignored.
int device_mesh_axis;
// A helper to construct a replica spec for a tensor axis.
static AxisPartitionSpec CreateReplica() {
return AxisPartitionSpec(Condition::Replica, -1);
}
// A helper to construct a sharding spec for a tensor axis.
// This tensor axis is sharded along `device_mesh_axis` in device mesh.
static AxisPartitionSpec CreateShard(int device_mesh_axis) {
return AxisPartitionSpec(Condition::Shard, device_mesh_axis);
}
};
class TensorPartitionSpec {
// [Device Mesh and Tensor Sharding for Tensor Parallel]
// TensorPartitionSpec holds a collection of AxisPartitionSpec and an
// associated DeviceMesh. It is responsible for determining how a tensor
// should be partitioned across a device mesh.
//
// Example 1: RS[0]
// In this scenario, `axis_specs` would contain two `AxisPartitionSpec` objects.
// - The first object is a Replica, denoting that the first axis of the tensor is
// not sharded but is instead replicated.
// - The second object is a Shard along the 0-th axis of the device mesh. It denotes
// that the second axis of the tensor is sharded along the first axis of the
// device mesh.
//
// Example 2: S[0]RR
// In this scenario, `axis_specs` would contain three `AxisPartitionSpec` objects.
// - The first object is a Shard along the 0-th axis of the device mesh, indicating
// that the first axis of the tensor is sharded along the first axis of the
// device mesh.
// - The second and third objects are Replicas, indicating that the second and third
// axes of the tensor are not sharded but are instead replicated.
public:
// axis_specs[i]: AxisPartitionSpec for tensor axis i. For a 2-D tensor,
// axis_specs[0] is for row axis and axis_specs[1] is for
// column axis. axis_specs[i].device_mesh_axis = j means that
// tensor axis i is sharded along device mesh axis j.
std::vector<AxisPartitionSpec> axis_specs;
// device_mesh: DeviceMesh for sharding the associated tensor.
// Read [Device Mesh and Tensor Sharding for Tensor Parallel] in DeviceMesh's comment.
DeviceMesh device_mesh;
};
```
The changes in PR #8919 overwrote the PUBLIC_HEADER property value of the `onnxruntime` target with a list that did not include EP-specific headers. We should probably be using a consistent set of header files across packages anyway.
### Description
Google test can be built either with absl/re2 or not. This PR enables
the build option so that google test framework can print out a nice
stacktrace when something went wrong. It helps locate test errors in CI
build pipelines.
Also, Google test will remove the build option and make it always ON. So
sooner or later we must make this change.
### Description
this is for ORT 1.17.0 - make ORT to use ONNX release 1.15.0 branch. Eventually will update to the release tag once ONNX 1.15.0 is released
### Motivation and Context
Prepare for ORT 1.17.0 release. People can start work on new and updated ONNX ops in ORT.
---------
Signed-off-by: Liqun Fu <liqfu@microsoft.com>
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
* Break QkvToContext into small functions. Each fused and unfused kernel
will have separated function.
* Move DecoderAttention kernel to separated file
* Move KV cache related kernel to attention_kv_cache.cu
### Motivation and Context
To make the code easier to maintain.
To avoid a huge cu file and make code more readable:
- Move PrepareQKV to separate cu file (attention_prepare_qkv.cu)
- Move ConcatPastToPresent to attention_concat.cu
- Add default value for AttentionData
- Add a data structure QkvData to track Q, K and V pointers and track
QKV format.
### Description
The new cpuinfo library doesn't use clog on Android. Newer XNNPack
versions have removed the dependency on clog, but the one we use still
has it. So I cherry-pick the XNNPack to our patch file.
This reverts commit bb136f86c8, then
re-implement it in a different way.
I reverted the original change, then added a version constraint to the
find_package args.
If you still found it picks up wrong gtest version after this change,
you may disable `find_package` by setting
'FETCHCONTENT_TRY_FIND_PACKAGE_MODE' to NEVER. For example, the latest
gtest version is 1.14.0. If at a later time Google releases a new
version of gtest and that one is incompatible with the ONNX Runtime
source code you get today and your dev environment already installed the
new version and you do not want to create a new clean build environment
that is without the package, you can add `--cmake_extra_defines
FETCHCONTENT_TRY_FIND_PACKAGE_MODE=NEVER` to your build command to solve
the problem.
### Description
[Successful pipeline
run](https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1123141&view=results)
Added flag to build the training artifacts & updated the
pull-wasm-artifacts script to pull the training artifacts as well.
Bundled into this PR are minor formatting fixes + naming fixes.
### Motivation and Context
[This PR](https://github.com/microsoft/onnxruntime/pull/16521) extended
the WASM API wrapper to build training WASM artifacts as well.
The ORT training WASM artifacts are required to support ORT training web
bindings.
### Description
1. Update docker files and their build instructions.
ARM64 and x86_64 can use the same docker file.
2. Upgrade Linux CUDA pipeline's base docker image from CentOS7 to UBI8
AB#18990
### Description
Git commands producing `git-commid-id` and `git-branch` are always run
in `CMAKE_CURRENT_SOURCE_DIR` (i.e. `onnxruntime/cmake`)
### Motivation and Context
Please refer to corresponding issue
[#17197](https://github.com/microsoft/onnxruntime/issues/17197).
In flatbuffers@v23.5.9 was broken forward declaration for
FlatBufferBuilder. Trying to compile onnxruntime falls with the
following error:
```
flatbuffers/include/flatbuffers/flatbuffer_builder.h:1420:38: error: typedef redefinition with different types ('FlatBufferBuilderImpl<false>' vs 'flatbuffers::FlatBufferBuilder')
typedef FlatBufferBuilderImpl<false> FlatBufferBuilder;
^
onnx_runtime/include/onnxruntime/core/graph/graph.h:47:11: note: previous definition is here
class FlatBufferBuilder;
```
This PR removes these declarations and puts includes instead
### Description
* Created `wasm/training_api` source and header files & modified
WebAssembly CMake to include training flags
* The `wasm/training_api` files use an `OrtTrainingManager` handle which
is a struct of an OrtCheckpointState and an OrtTrainingSession, rather
than creating a CheckpointState handle & a separate TrainingSession
handle.
* This is so that the TypeScript side only has to manage one handle that
will be passed between TrainingSession & CheckpointState
representations, rather than the TypeScript side managing separate
CheckpointStateHandle and TrainingSessionHandle.
### Motivation and Context
WASM API needs to be updated with ORT training API function calls so
that ORT training web bindings can be added for on-device training.
---------
Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
Co-authored-by: carzh <carolinezhu@microsoft.com>
Co-authored-by: Ashwini Khade <askhade@microsoft.com>
### Description
This PR contains changes to support error pop and kernel name.
- Add a function `JsepGetNodeName` to allow reading kernel name from JS
to C++
- When in debug mode ( `env.debug = true;` ) or in profiling mode (
`env.webgpu.profilingMode = 'default';` ), kernel name will be read from
ORT; otherwise use the kernel pointer ( a number ) as kernel name to
save calls from JS to C++.
- When in debug mode, WebGPU validation errors will be recorded and if
any error occurs, `inferenceSession.run()` will fail (Promise get
rejected). Behavior when not in debug mode is not changed. This is
because recording errors are not zero-overhead, and GPU validation
errors should occur consistently in and not in debug mode.
- Add `jsepOnRunStart()` and `jsepOnRunEnd()` hook to:
- allow implementation of the features mentioned above.
- pass session ID to backend.
### Description
This PR adds the following scripts for LLaMA:
- LLaMA conversion (support for TorchScript and Dynamo exporters)
- LLaMA parity
- LLaMA benchmark
- LLaMA quantization
- LLaMA integration with [Hugging Face
Optimum](https://github.com/huggingface/optimum)
### Motivation and Context
This PR adds scripts for using LLaMA. There is a [follow-up
PR](https://github.com/microsoft/onnxruntime/pull/17043) for adding
scripts for Whisper.
Enable verbose logging in unit test program with environment variable.
E.g., `ORT_UNIT_TEST_MAIN_LOG_LEVEL=0 ./onnxruntime_test_all --gtest_filter="<test that I want to see more logs for>"`.
### Description
When I worked on PR #17173, I didn't notice that
onnxruntime\core\platform\windows\debug_alloc.cc also needs to call
dbghelp functions like SymInitialize. So, if we use vc runtime's
stacktrace functionality, vc runtime will initialize/uninitialize the
dbghelp library independently and vc runtime's stacktrace helper DLLs
get unloaded before our memory leak checker starts get work. Then we
call SymSetOptions, it crashes.
More details:
In VC runtime the C++23 stacktrace functions are implemented on top of
dbgeng.dll. In C:\Program Files\Microsoft Visual
Studio\2022\Enterprise\VC\Tools\MSVC\14.37.32822\crt\src\stl\stacktrace.cpp,
you can see it has:
```
dbgeng = LoadLibraryExW(L"dbgeng.dll", nullptr, LOAD_LIBRARY_SEARCH_SYSTEM32);
```
The dbgeng.dll is a wrapper around dbghelp.dll. It calls SymInitialize
and SymCleanup. dbgeng.dll gets unloaded before our memory leak check
starts to run. In theory we should be able to call SymInitialize again
if the previous user who called SymInitialize has also called
SymCleanup. However, users can use
SymRegisterCallback/SymRegisterCallback64/SymRegisterCallbackW64 to
register callback functions to dbghelp.dll. These callback functions
need to be alive when SymSetOptions(and some other dbghelp APIs) get
called.
### Motivation and Context
This PR handles two changes:
1. There is an issue when running "Debug build" TRT EP with "Release
build" TRT builtin parser on Windows. Enforce use oss parser for Debug
build.
Note: args.config in build.py is an array, for example ["Debug",
"Release"...]. The code will be much mess if we made the change there.
2. Update to use latest commit of oss parser.
Please see the https://github.com/microsoft/onnxruntime/issues/16273
### Description
Re-implement stacktrace. The new implementation doesn't directly use
Windows API, hence can avoid problems regarding to
initialize/uninitialize the dbghelp library.
### Motivation and Context
### Description
For windows headers are not duplicated to the normal cuda include. For
linux they are:
```
(base) maximilianm@maximilianm-dt-linux:~$ ls /usr/local/cuda/include/nvtx3 | grep nvTool
nvToolsExt.h
nvToolsExtCuda.h
nvToolsExtCudaRt.h
nvToolsExtOpenCL.h
nvToolsExtSync.h
(base) maximilianm@maximilianm-dt-linux:~$ ls /usr/local/cuda/include | grep nvTool
nvToolsExt.h
nvToolsExtCuda.h
nvToolsExtCudaRt.h
nvToolsExtOpenCL.h
nvToolsExtSync.h
```
Is the preference via those added defines or should the include just be
changed to be `nvtx3/` ?
Also there is no library linking needed on Windows and the library is
not even present.
### Description
1. Clean up cmake files. Remove some unused code
2. Remove the "Semmle" task from
tools/ci_build/github/azure-pipelines/templates/win-ci.yml. Semmle is
deprecated and replaced by CodeQL.
### Description
1. `onnxruntime_fetchcontent_makeavailable` works around unconditional
install commands so that can be used instead of `FetchContent_Populate`
2. This dependency is Windows specific, mark it as such.
### Motivation and Context
1. This simplifies `cmake/external/wil.cmake` not to do anything
specific wether WIL was fetched or found
2. Given it's specific to Windows, it might not be available on other OS
in specific air-gapped environment such as
[conan-center-index](https://github.com/conan-io/conan-center-index).
This allows downstream builds not to require specific patches for
something not required by the build in the first place.
### Description
Remove the onnxruntime-extensions submodule since it now was used via
cmake FetchContent
### Motivation and Context
The submodule relies on an outdated version of the extensions, and the
build instructions should be updated to eliminate any confusion.
### Description
This PR fixes build break for WebAssembly introduced in
6986981482
(435ad2b1d8).
This change updates onnx.patch in onnxruntime repo. the corresponding PR
in onnx repo is: https://github.com/onnx/onnx/pull/5495.
It may takes a while for the next onnx version bump.
### Description
Some pipelines are failing. It is because PR #16325 set ONNX version to
`rel-1.14.1` . It is a branch name, not a commit or tag name. It means
whenever the branch got a new commit, we will auto pick it and use it.
### Description
This change upgrade emsdk to 3.1.44.
Because backend is upgraded to LLVM 16, so need to fix a lot of build
failures caused by "-Wshorten-64-to-32".
most of the build failures comes from generated `onnx.pb.h`, and this
can be fixed by including "core/graph/onnx_protobuf.h", which detects
and ignore shorten-64-to-32 warnings.
This is in preparation for planned ROCm 6.0 changes that are not
backward compatible. However, the adjustments made by this PR to the
current onnxruntime cmake files will work with ROCm 5.x and 6.x.
### Description
1. Add "--windows_sdk_version" argument to build.py
2. Fix Windows Static Analysis build pipeline. It is failing because it
picks up a different Windows SDK version after a build machine image
update. If we can explicitly specify Windows SDK version, we can avoid
such things happening again.
3. Remove --enable_training from Windows Static Analysis build pipeline
because PR #16993 makes it incompatible with "no_rtti".
AB#18315
OpenVINO EP ORT 5.1 Branch
Changes for the new API to take in OpenVINO Provider Options
and compatibility with OV 2023.1
### Motivation and Context
The change is required for the new API to take in OpenVINO Provider
Options
and make it seamless.
---------
Signed-off-by: MaajidKhan <n.maajid.khan@intel.com>
Co-authored-by: saurabhintel0 <saurabh1.kale@intel.com>
Co-authored-by: MaajidKhan <n.maajid.khan@intel.com>
Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
### Description
<!-- Describe your changes. -->
Adjust nativs to display tagged strings.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Hard to debug without seeing names.
### Motivation and Context
When we handle PyTorch models' inputs in different places (ORTModule or
others), it's common for us to flatten a structured data into a 1-D
tensor list (required by lib for example torch.onnx.export,
torch.autograd.Function.forward or ORT inference session), then do
subsequent work, then unflatten back to original hierarchy as returned
values.
DeepStage3 hooks support work also need such a lib to do similar things,
so I was proposing to extract this pair of APIs in training/utils/,
which can be more used more generally. Also a comprehensive set of test
data are used for testing unflatten/flatten in unit tests.
Let me know if you have any other suggestions.
### Refactor schema extraction and output unflattening
Move `_extract_schema` and `unflatten_user_output` in
`orttraining/orttraining/python/training/ortmodule/_io.py` . to
`extract_data_and_schema` and `unflatten_data_using_schema` in
`orttraining/orttraining/python/training/utils/torch_io_helper.py` as
shared libs, which can be used later by other features (deepspeed stage
3 hook rewrite).
While there are still a few duplicated logic handling flatten with
different task by recursively loop the data struct, will change them
step by step in case of heavy review efforts.
Last week I fixed error #16484 found when trying to build onnxruntime
with the icpx compiler. Another thing I found out is that icpx uses
-ffast-math flag by default. You can check it by running the compiler
with -v flag like following:
```bash
# Setup the environment
. /opt/intel/oneapi/setvars.sh
# Compile any file to see all the implicit flags
icpx -v main.cpp
```
This leads to a bunch of warnings during the build like:
```bash
In file included from /mnt/f/wsl_home/onnxruntime/onnxruntime/test/providers/cpu/tensor/upsample_op_test.cc:5:
In file included from /mnt/f/wsl_home/onnxruntime/onnxruntime/test/providers/provider_test_utils.h:6:
In file included from /mnt/f/wsl_home/onnxruntime/onnxruntime/test/providers/checkers.h:10:
In file included from /mnt/f/wsl_home/onnxruntime/onnxruntime/core/util/math_cpuonly.h:68:
In file included from /mnt/f/wsl_home/onnxruntime/build/Linux/RelWithDebInfo/_deps/eigen-src/Eigen/Core:172:
/mnt/f/wsl_home/onnxruntime/build/Linux/RelWithDebInfo/_deps/eigen-src/Eigen/src/Core/MathFunctions.h:1019:12: warning: comparison with NaN always evaluates to false in fast floating point modes [-Wtautological-constant-compare]
return isnan EIGEN_NOT_A_MACRO (x);
^~~~~~~~~~~~~~~~~~~~~~~~~~~
```
And some tests are failing as well, usually with infinities involved. To
list a few:
```bash
# ...
1: [ FAILED ] IsInfTest.test_isinf_float
1: [ FAILED ] IsInfTest.test_isinf_double
1: [ FAILED ] IsInfTest.test_isinf_positive_float
1: [ FAILED ] IsInfTest.test_isinf_positive_double
1: [ FAILED ] IsInfTest.test_isinf_negative_float
1: [ FAILED ] IsInfTest.test_isinf_negative_double
1: [ FAILED ] IsNaNOpTest.IsNaNFloat
1: [ FAILED ] IsNaNOpTest.IsNaNDouble
# ...
```
This PR adds a quick global check for the IntelLLVM compiler, as in the
way its name is reported by CMake and then, depending on the compiler
driver, sets either MSVC-like or GCC-like switch to disable fast-maths.
Probably a bit cleaner solution would be to use
```target_compile_options(${TARGET} PRIVATE MEOW)``` instead of a
global-wide ```set(CMAKE_CXX_FLAGS MEOW)```, but then we'd be required
to add it to all the individual targets and execution providers and this
will lead to a lot of code duplication.
In additions to `onnxruntime_test_all`, `onnxruntime_shared_lib_test`
and `onnxruntime_customopregistration_test` should
also add "-Wno-deprecated-declarations" flag to ignore compiler warning
### Description
1. As a follow-up of #16761, this PR allows build ORT on iOS/Android
without the need to explicitly specify a protoc path. #16761 is for
WASM. This one is for iOS/Android
2. Update the MacOS/Linux build scripts that build/install protobuf from
source. Make them be more flexible. Add the support for
RedHatEnterprise(ubi), which will needed for upgrading the base image
from centos:7 to ubi:8.
3. Update tools/ci_build/github/pai/rocm-ci-pipeline-env.Dockerfile :
the docker file's base image has preinstalled protobuf in /usr/local, we
should uninstall them to avoid conflicts.
### Description
Added Gelu operator to JSEP
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Add some safety check for conv op.
It is to validate if the attributes coming from a conv op are in a valid
range. (shouldn't be too large or too small).
### Description
Changes allow downloading prebuilt protoc compiler when building
WebAssebly version on mac systems.
Otherwise it tries to build a js/wasm version of protoc and throws an
error while executing it: "protoc.js permission denied"
### Motivation and Context
I need to switch between my main working computer and a PC to make
changes to WebAssebly build. Would like not to do that anymore.
Otherwise, an unsupported version of gtest/gmock will be found at
/opt/conda/include for ROCm builds. Though this issue was initially
found for ROCm builds, the issue is generic. onnxruntime requires a
specific version of googletest and should not rely on locating
googletest using find_package.
The ROCm error was:
```
In file included from /opt/conda/include/gmock/gmock-spec-builders.h:75,
from /opt/conda/include/gmock/gmock-generated-function-mockers.h:47,
from /opt/conda/include/gmock/gmock-function-mocker.h:39,
from /opt/conda/include/gmock/gmock.h:61,
from /stage/onnxruntime/onnxruntime/test/util/test_utils.cc:17:
/opt/conda/include/gmock/gmock-matchers.h: In instantiation of ‘bool testing::internal::PointwiseMatcher<TupleMatcher, RhsContainer>::Impl<LhsContainer>::
MatchAndExplain(LhsContainer, testing::MatchResultListener*) const [with LhsContainer = const gsl::span<const float>&; TupleMatcher = testing::internal::
FloatingEq2Matcher<float>; RhsContainer = gsl::span<const float>]’:
/opt/conda/include/gmock/gmock-matchers.h:2303:10: required from here
/opt/conda/include/gmock/gmock-matchers.h:2312:48: error: no type named ‘const_iterator’ in ‘testing::internal::PointwiseMatcher<testing::internal::
FloatingEq2Matcher<float>, gsl::span<const float> >::Impl<const gsl::span<const float>&>::LhsStlContainer’ {aka ‘class gsl::span<const float>’}
```
- Fix link errors by including the needed onnxruntime-extensions libraries in the static framework.
- Add Objective-C API to register custom ops from embedded onnxruntime-extensions.
Caveat: Not all onnxruntime-extensions build options are working yet. E.g., building with the onnxruntime-extensions OpenCV dependency does not work.
### Description
Introduce `Float16/BFloat16` support for C# and C++ APIs.
User should be able to perform conversions from `float` to/from
`Float16/BFloat16`, compare values and tests for `NaN, Inifnity, and
whether the number is denormalized.`
### Motivation and Context
User filed issues such as:
https://github.com/microsoft/onnxruntime/issues/14303
### Description
1. Replacing AMX intrinsics with machine code macros in QGEMM kernel.
2. Removing AMX build flags for GCC in cmake file.
3. Fixing the link time optimization (LTO) issue introduced with asm
.include of an assembly file.
I have moved the AMX instruction macro definitions from
QgemmU8S8KernelAmxCommon.S to the amx_common.h to fix the LTO issue.
Note that I am also pushing the macros defined in
QgemmU8S8KernelAmxCommon.S for future reference.
A special thanks to @laxmansole who helped in the development of the
instruction macro definitions for AMX intrinsics and fixing the LTO
issue.
### Motivation and Context
The additional AMX flag in cmake adds an extra layer of dependency on
GCC version to use the feature.These changes should allow the usage of
the AMX feature with just the CPU ID check.
### Description
- Update existing rocBLAS get_solutions API using
`*_get_solutions_by_type` (supported from ROCm5.6); remove the original
nested TunableOp logic.
- Update kernel_explorer.
### Description
<!-- Describe your changes. -->
Add ort_value.h to session_options.h so OrtValue is defined.
Update a unit test binary to add required include paths. Adding
ort_value.h pulls in more data type headers.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#16193
- Fix some warnings from Xcode build (`-Wshorten-64-to-32`).
- Enable `-Wshorten-64-to-32` warning if available. Currently it's not fully enabled for `onnxruntime_test_all` and `onnxruntime_providers_xnnpack` yet.
- Some clean up in build.py including setting CMake generator more consistently.
### Description
<!-- Describe your changes. -->
Split out the more basic changes from #15552 for easier review.
Re-organize to clarify the structure
- Separate out generic base functionality from ORT specific components
- pass in handlers for internal ORT ops to Optimize
- Split out layout transformation from transpose optimization
- Separate out level 1 transpose optimizer
- Cleanup some naming to try and clarify things like an optimizer vs.
general optimization code
Most of the changes are from this movement of code.
Two implementation changes:
- the extended handlers are queried first in GetHandler
- allows the extended handlers to override the default behaviour for an
ONNX operator
- simplify the Optimize function to remove OptimizerMode.
- `can_modify_node` is used instead of `mode` and
`ignore_assigned_nodes` and a long description of the current usage is
added. I don't _think_ that changes the current behavior and hopefully
clarifies what happens and when, and makes the base transpose optimizer
implementation more generic.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Create a cleaner separation to support adding EP specific logic next to
cleanly handle where an EP has additional layout sensitive behaviour
required (e.g. it's Resize implementation only handles one layout).
### Description
Remove AllocatorManager class
### Motivation and Context
After the refactor PR #15833 is in, AllocatorManager class is not
referenced anymore.
### Description
- Implement Objective-C binding for `ORTTrainingSession`
- Add `ORTUtils` utility class to handle conversion between C++ and
Objective-C types
- Add test case for saving checkpoint
- Add unit test cases for `ORTTrainingSession`
### Motivation and Context
This PR is part of implementing Objective-C bindings for training API.
It implements objective-c binding for training session. The objective-C
API closely resembles the C++ API.
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
<!-- Describe your changes. -->
Update file list to adjust for recent changes to test infra.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
🛠️ __Changes in this pull request:__
This pull request introduces two significant changes to the project:
- Changing on device training checkpoint format: The current
implementation stores the on device training checkpoint as a sequence of
tensors in multiple files inside a checkpoint folder, which can be
inefficient in terms of storage and performance. In this PR, I have
modified the checkpoint format to utilize the flatbuffer table to save
the checkpoint to a single file, providing a more compact and efficient
representation. The changes around this are twofold:
- Add the checkpoint flatbuffer schema that will generate the necessary
checkpoint source files.
- Update the checkpoint saving and loading functionality to use the new
format.
- Adding support for onnxruntime minimal build: To support scenarios
where binary size is a constraint, I made changes to ensure that the
training build can work well with the minimal build.
🔍 __Open Issues:__
- In order to extract the optimizer type, the existing implementation
re-loaded the onnx optimizer model and parsed it. This is no longer
possible, since the model format can either be onnx or ort. One idea is
to do the same for ort format optimizer model. This needs some
investigation.
- Changes to the offline tooling to generate ort format training
artifacts.
- End-to-end training example showcasing the use of the minimal training
build.
- Add support for export model for inferencing in a minimal build.
CUDA EP already supports [CUDA
graph](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs),
also we observed some models can benefit from using CUDA graph with
`trtexec`. Therefore, this PR enables the CUDA graph support for TRT EP.
The implementation is based on
https://github.com/microsoft/onnxruntime/pull/9978 with the same
[constraints](https://github.com/microsoft/onnxruntime/pull/9978) as
below:
- Models with control-flow ops (i.e. If, Loop and Scan ops) are not
supported.
- Usage of CUDA Graphs is limited to models where-in all the model ops
(graph nodes) can be partitioned to the TRT EP.
- The input/output types of models need to be tensors.
- Shapes of inputs/outputs cannot change across inference calls.
- IObinding is required.
### Description
<!-- Describe your changes. -->
1. Add a new test lib `onnxruntime_providers_cuda_ut` which is similar
to `onnxruntime_providers_cuda` but `onnxruntime_providers_cuda_ut` is
only built if `onnxruntime_BUILD_UNIT_TESTS` is set. We can call all
CUDA UTs through this ut lib without affecting production lib
`onnxruntime_providers_cuda`.
2. Move all test cases from `core/providers/cuda/test/` to
`test/providers/cuda/`. These test cases are built into lib
`onnxruntime_providers_cuda_ut` and run by `./onnxruntime_test_all
--gtest_filter="*CUDA_EP_Unittest*"`. Since the lib is only for test, we
can use gtest macros in the test cases. Previous implementation do not
support using gtest lib in the CUDA UT cases.
3. The cmake code in `cmake/onnxruntime_providers.cmake` is refactored a
bit. A new function `onnxruntime_add_object_library` is to build a
object target. The 2 libs `onnxruntime_providers_cuda_ut` &
`onnxruntime_providers_cuda` share most of the code, so the object files
can be used in both libs, which helps reduce build time. Another
function `config_cuda_provider_shared_module` is used to configure all 3
similar
targets(onnxruntime_providers_cuda_obj/onnxruntime_providers_cuda/onnxruntime_providers_cuda_ut).
4. Refactored the test to call `testing::InitGoogleTest` &
`RUN_ALL_TESTS` in `libonnxruntime_providers_cuda_ut.so`'s `TestAll`.
After this change, we can see all the cases running in
`CUDA_EP_Unittest.All`:

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
After https://github.com/microsoft/onnxruntime/pull/13016, there are
still test files in test/providers/cuda/ that are not moved to
core/providers/cuda/test/ and the test cases are disabled. This PR helps
to clean the unfinished TODOs.
Even through onnxruntime_shared_lib_test covers some test for CUDA
provider. onnxruntime_shared_lib_test works like a coarse grain
end-to-end test for CUDA provider. If CUDA unittest can run cases for a
single component, this wound be helpful for CUDA developers.
---------
Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com>
**Description**:
Adds support for cmake find_package.
**Motivation and Context**
As mentioned in issue #7150 onnxruntime doesn't have support for CMake
find_package, this PR adds that and also adds the CMake package version
file. Now anyone can link onnxruntime like this:
```cmake
find_package(onnxruntime)
add_executable(test Source.cpp)
target_link_libraries(test PRIVATE onnxruntime::onnxruntime)
```
this also simplifies #3124
### Description
<!-- Describe your changes. -->
Split up OpTester to separate out operator vs model testing. This led to
a lot of other cleanups/refactoring.
- create BaseTester class and derived OpTester/ModelTester classes to
limit APIs to what is applicable for each test type
- e.g. adding an attribute isn't relevant to a model test
- cleanup structure
- don't expose member variables either directly or via public methods
returning them
- split out checkers so they can be easily re-used
- refactor so there's one public Check method for comparing two OrtValue
instances containing any data type
- refactor the GradientOpTester usage
- it required a lot of OpTester internals to be exposed and no other
tests needed this
- it also returned Status through various parts which prevented the
usage of the google test macros which provide better output. change to
return void and use the macros.
- fix some other minor issues
- update some cmake files so all the source files are included
- remove some low value helpers (FetchTensor and GetShapeVector)
- remove some outdated code to allow unreleased opset versions from when
onnx opset 15 wasn't released
- move files from test/util/include/test to test/util/include
- doesn't seem to be any reason for the additional subdirectory given
they're not files use to test the code in test/util
- files were moved with no changes
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Cleanup test infrastructure.
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
This PR enables execution of subgraphs in OVEP and currently, when OVEP
developers install the onnxruntime-openvino package on windows from
pypi, they would have to additionally download OpenVINO windows binaries
and run the setupvars.bat script which sets the environment PATH to
locate the OV dll's. Also this PR fixes issues of OVEP windows io buffer
sample.
### Motivation and Context
Fix: We want to make the user experience easy for OVEP Python developers
on windows platform.
This fix, introduces a function add_openvino_libs_to_path at the
location tools/python/util/add_openvino_win_libs.py.
The above function, can be called by OVEP python users in the
application code and that takes care of setting
the OpenVINO dll's to the path from the OpenVINO pypi packge (openvino)
which was installed.
This change also makes sure that add_openvino_libs_to_path() function is
added to onnxruntime python package
only when it is build for OpenVINO Execution Provider for ONNXRuntime
and not for default ORT python package builds.
New user experience for Python OVEP developers on windows platform:
step 1: pip install onnxruntime-openvino
step 2: pip install openvino
step 3: <Add these 2 lines in the application code>
import onnxruntime.tools.add_openvino_win_libs as utils
utils.add_openvino_libs_to_path()
---------
Signed-off-by: MaajidKhan <n.maajid.khan@intel.com>
Co-authored-by: MaajidKhan <n.maajid.khan@intel.com>
Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
### Description
Before this PR, the CMake code assumed that when on Windows a
multiple-config CMake generator was used, while on non-Windows there was
the assumption of a single-config CMake generator. After this PR this
information is obtained from the
[`GENERATOR_IS_MULTI_CONFIG`](https://cmake.org/cmake/help/latest/prop_gbl/GENERATOR_IS_MULTI_CONFIG.html)
global CMake propery.
### Motivation and Context
I discovered this problem when building with Ninja generator on Windows,
but I guess this should fix problems also on non-Windows platforms when
using a multiple-config generator (such as Xcode on macOS or "Ninja
Multi-Config" on all platforms).
See
https://cmake.org/cmake/help/latest/prop_gbl/GENERATOR_IS_MULTI_CONFIG.html
for more info.
### Description
1. Replacing AMX intrinsics with machine code macro instructions in
QGEMM kernel.
2. Removing AMX build flags for GCC in cmake file.
### Motivation and Context
The additional AMX flag in cmake adds an extra layer of dependency on
GCC version to use the feature.These changes should allow the usage of
the AMX feature with just the CPU ID check.
### Description
Added support for ReduceL1, ReduceL2, ReduceMean, ReduceMin, ReduceMax,
ReduceSum, ReduceLogSum, ReduceLogSumExp, ReduceProd and
ReduceSquareSum.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Satya Jandhyala <sajandhy@microsoft.com>
Co-authored-by: guschmue <guschmue@microsoft.com>
### Description
Implement Objective-C binding for `ORTCheckPoint`. Additionally,
- Modify `onnxruntime_objectivec.cmake` to only include training header
and sources when training flag is enabled
- Enable objective-c binding for `orttraining-mac-ci-pipeline`
### Motivation and Context
This PR is part of implementing Objective-C bindings for training API.
It implements objective-c binding for ORTCheckPoint class. The
objective-C API closely resembles the C++ API.
**Note**: The test for saving checkpoint is skipped as it requires use
of training session. It will be added when the objective-c binding for
`ORTTrainingSession` is added.
- Fix flatbuffers flatc warning, unused-but-set-variable.
- Address `-Wshorten-64-to-32` warnings (fix in our code, allow in dependencies' code).
- Update CI builds to use Xcode 14.3.
- Update minimum iOS version to 12.0.
- Update Mac hosted agents to MacOS 13 where possible.
We implemented a number of new ops and data types to support running
segment anything model on Chromium WebNN DML backend (POC) in a forked
branch https://github.com/honry/onnxruntime/tree/stable-diffusion
In this PR, we migrate the changes in the forked branch to main branch,
includes:
- 22 new ops
- New tensor data types: bool, int32, uint32, uint64, int64, float16 (As
JavaScript hasn't shipped Float16Array, we use Uint16Array as a
workaound)
- Handle empty input tensors and duplicated outputs
- Fixed some nits
### Description
The PR implements FloatE4M3FN, FloatE5M2, FloatE4MEFNUZ, FloatE5M2FNUZ
as described in PR https://github.com/onnx/onnx/pull/4805. It uses CUDA
API to cast float/half to float8 if CUDA>=11.8, a custom implementation
if CUDA<11.8.
* It implements, Cast, QuantizeLinear, DequantizeLinear for all types on
CPU, only for types FloatE4M3FN, FloatE5M2 on CUDA.
* It extends the supported types for control flow operator, Shape,
Reshape, Identity, If, Loop, Scan, Reshape
* It implements Equal(19).
* Cast, QuantizeLinear, DequantizeLinear operators now support a
parameter `saturate` only valid for float 8 types. It is true by
default. In that case, any value out of range is converted into the
maximum float 8 value. If false, it is infinite.
* QuantizeLinear, DequantizeLinear now supports multiple scales on CUDA
(and ROCm by extension), scale = 1D tensor with one scale per channel
### Motivation and Context
Supports latest onnx version.
Fixes
[AB#15395](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/15395)
---------
Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
Add a configuration `max_power_of_two_extend_bytes ` to limit the arena extension size.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
In our real scenario, we observe that if the model is big enough the
BfcArena will extend uncontrollable.
As showed by the following figures, if a model uses more than 16GB
memory, the BfcArena will totally apply for 32GB memory according to the
`kNextPowerOfTwo` strategy. With the new strategy, the extension is
limited. The default maximum extension size is 1GB.
#### Without the new configuration
After loading the model, ORT uses 32G GPU memory.

#### With the new configuration
After loading the model, ORT uses 23G GPU memory.

Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com>
### Description
DML_PACKAGE_DIR cmake variable is not getting set properly when dml_path
build options is used.
### Motivation and Context
- Why is this change required? What problem does it solve?
It is required for DML Perf dashboard.
<!--- If it fixes an open issue, please link to the issue here. -->
### Description
Avoid trt deprecated api warnings shown as errors when building
onnxruntime_test_all
This issue is only visible when installing trt via binaries, rather than
deb/rpm pkg (CI pipelines)
The change is similar to existing set_property for
onnxruntime_providers_tensorrt
89ea503024/cmake/onnxruntime_providers.cmake (L421)
### Motivation and Context
onnxruntime/test/unittest_main/[test_main.cc](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/test/unittest_main/test_main.cc#L32)
includes nvinfer.h, which includes deprecated trt apis and and generates
warnings.
When building onnxruntime_test_all, it will show warnings as errors and
block the build.
### Doubts
Although this issue is visible on trt tar binaries but not on trt
deb/rpm pkgs,
Their file size&hash are the same (creation time vary), regarding
headers/libs installing in different ways.
| tarBin | pkg |
| ------------------------------------------------------------ |
------------------------------------------------------------ |
| 997284784 Apr 26 15:15 libnvinfer_builder_resource.so.8.6.1 |
997284784 Apr 26 22:21 libnvinfer_builder_resource.so.8.6.1 |
| 235369632 Apr 26 15:14 libnvinfer.so.8.6.1 | 235369632 Apr 26 22:21
libnvinfer.so.8.6.1 |
### Description
Remove the "onnxruntime_BUILD_WEBASSEMBLY" cmake option. Use `if
(CMAKE_SYSTEM_NAME STREQUAL "Emscripten")` instead. It makes some code
look more nature.
For example,
```cmake
if (CMAKE_SYSTEM_NAME STREQUAL "iOS" OR CMAKE_SYSTEM_NAME STREQUAL "Android" OR onnxruntime_BUILD_WEBASSEMBLY)
```
becomes
```cmake
if (CMAKE_SYSTEM_NAME STREQUAL "iOS" OR CMAKE_SYSTEM_NAME STREQUAL "Android" OR CMAKE_SYSTEM_NAME STREQUAL "Emscripten")
```
### Description
This PR adds the training headers to the training android packages.
### Motivation and Context
Training headers need to be added as part of the training android
packages, however because of the typo in the cmake these headers were
not being added. This PR fixes the issue.
### Description
Revert a change in #15797: restore the correct version of emsdk
### Motivation and Context
Without change, when you build it on Windows you will see:
```
2023-05-17 19:41:30,093 build [INFO] - Activating emsdk...
2023-05-17 19:41:30,093 util.run [INFO] - Running subprocess in 'C:\src\onnxruntime2\cmake\external\emsdk'
'C:\src\onnxruntime2\cmake\external\emsdk\emsdk.bat' activate 3.1.37
error: tool or SDK not found: '3.1.37'
```
### Description
In some scenarios, the triton written kernels are more performant than
CK or other handwritten kernels, so we implement a framework that
onnxruntime can use these triton written kernels.
This PR is to integrate triton into ort, so that ort can use kernels
that written and compiled by triton.
The main change focus on two part:
1. a build part to compile triton written kernel and combine these
kernels into libonnxruntime_providers_rocm.so
2. a loader and launcher in c++, for loading and launch triton written
kernels.
#### Build
To compile triton written kernel, add a script
`tools/ci_build/compile_triton.py`. This script will dynamic load all
kernel files, compile them, and generate `triton_kernel_infos.a` and
`triton_kernel_infos.h`.
`triton_kernel_infos.a` contains all compiled kernel instructions, this
file will be combined into libonnxruntime_providers_rocm.so, using
--whole-archive flag.
`triton_kernel_infos.h` defines a const array that contains all the
metadata for each compiled kernel. These metadata will be used for load
and launch. So this header file is included by 'triton_kernel.cu' which
defines load and launch functions.
Add a build flag in build.py and CMakeList.txt, when building rocm
provider, it will call triton_kernel build command, and generate all
necessary files.
#### C++ Load and Launch
On c++ part, we implement load and launch functions in triton_kernel.cu
and triton_kernel.h.
These two files located in `providers/cuda`, and when compiling rocm,
they will be hipified. so this part supports both cuda and rocm. But
currently we only call triton kernel in rocm.
We also implement a softmax triton op for example. Because there will
generate many kernels for different input shape of softmax, we use
TunableOp to select the best one.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This PR mainly fixes building errors when trying to build nupkg for ROCm EP.
It also slighly improve the packaging logic so that devlopers can
produce the nupkg on linux natively.
### Description
This PR partially reverts changes introduced in
https://github.com/microsoft/onnxruntime/pull/15643
We make two API return std::string always in UTF-8.
We also move the entry points from OrtApiBase to OrtApi to make them
versioned.
### Motivation and Context
`GetVersionString` always returns x.y.z numbers that are not subject to
internationalization.
`GetBuildInfoString` can hold international chars, but UTF-8 should be
fine to contain those.
We prefix them with u8"" in case the compiler default charset is not
UTF-8.
Furthermore, creating platform dependent APIs is discouraged.
`ORTCHAR_T` is platform dependent and was created for paths only.
On non-unix platforms would still produce `std::string` that can only
contain UTF-8
The API was introduced after the latest release, and can still be
adjusted.
When building the FlatBuffers dependencies, gcc13 emits a
stringop-overflow warning. All warnings being turned into errors, that
fails the compilation of FlatBuffers, and as a consequence also fails
the build of onnxruntime.
This commit adds the application of a patch to FlatBuffers's
CMakeList.txt, to add -Wno-error=stringop-overflow to the
CMAKE_CXX_FLAGS.
### Description
this is for ort 1.15 release to work with onnx 1.14
It shall be merged after onnx 1.14 release and before ort 1.15 release.
### Motivation and Context
---------
Signed-off-by: Liqun Fu <liqfu@microsoft.com>
### Description
- Update DML version to 1.11.0
- Disable Gemm+Softmax fusion
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
**Description**:
This PR intends to enable WebNN EP in ONNX Runtime Web. It translates
the ONNX nodes by [WebNN
API](https://webmachinelearning.github.io/webnn/), which is implemented
in C++ and uses Emscripten [Embind
API](https://emscripten.org/docs/porting/connecting_cpp_and_javascript/embind.html#).
Temporarily using preferred layout **NHWC** for WebNN graph partitions
since the restriction in WebNN XNNPack backend implementation and the
ongoing
[discussion](https://github.com/webmachinelearning/webnn/issues/324) in
WebNN spec that whether WebNN should support both 'NHWC' and 'NCHW'
layouts. No WebNN native EP, only for Web.
**Motivation and Context**:
Allow ONNXRuntime Web developers to access WebNN API to benefit from
hardware acceleration.
**WebNN API Implementation Status in Chromium**:
- Tracked in Chromium issue:
[#1273291](https://bugs.chromium.org/p/chromium/issues/detail?id=1273291)
- **CPU device**: based on XNNPack backend, and had been available on
Chrome Canary M112 behind "#enable-experimental-web-platform-features"
flag for Windows and Linux platforms. Further implementation for more
ops is ongoing.
- **GPU device**: based on DML, implementation is ongoing.
**Open**:
- GitHub CI: WebNN currently is only available on Chrome Canary/Dev with
XNNPack backend for Linux and Windows. This is an open to reviewers to
help identify which GitHub CI should involved the WebNN EP and guide me
to enable it. Thanks!
### Description
latest emsdk generated multi-thread version sometimes crash with unknown
reason ( error: memory access out of bounds ).
we don't want to break existing ort-web users, so revert emsdk back to
3.1.19 (same to what ort v1.14.0 uses)
### Description
Download protoc from Github Release instead of Nuget to avoid having
dependency on nuget.exe on Linux
### Motivation and Context
To avoid having dependency on nuget.exe on Linux. Many users' build
environment do not have nuget or dotnet.
### Description
This change will allow us building CUDA EP without installing CUDA SDK
on Windows.
### Motivation and Context
Nvidia's CUDA installer comes with a VS extension. In the past, we
require installing the extension. It is a little bit inconvenient since:
1. Visual Studio must be installed before CUDA SDK. CUDA's installer
will not install the extension if your machine doesn't have Visual
Studio.
2. We need to install CUDA SDK on our build machines, instead of just
downloading it and using it.
After this change, we will not need to install CUDA SDK on our build
machines. So it will be easier to add a support for a different CUDA
version.
Also, fix two PreFast warnings.
### Description
This PR creates Nuget and Android for Training.
### Motivation and Context
These packages are intended to be released in ORT 1.15 to enable
On-Device Training Scenarios.
## Packaging Story for Learning On The Edge Release
### Nuget Packages:
1. New Native package -> **Microsoft.ML.OnnxRuntime.Training** (Native
package will contain binaries for: win-x86, win-x64, win-arm, win-arm64,
linux-x64, linux-arm64, android)
2. C# bindings will be added to existing package ->
**Microsoft.ML.OnnxRuntime.Managed**
### Android Package published to Maven:
1. New package for training (full build) ->
**onnxruntime-training-android-full-aar**
### Python Package published to PyPi:
1. Python bindings and offline tooling will be added to the existing ort
training package -> **onnxruntime-training**
### Description
While building ORT for DML EP with `dml_EXTERNAL_PROJECT` flag, 2
variables (`DML_SHARED_LIB`, `DML_PACKAGE_DIR`) value is not set
properly.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Originally VitisAI EP only works with old version of VitisAI release.
### Motivation and Context
Update VitisAI EP so that it works together with the current VitisiAI
3.5 and further version of VitisAI. We try our best to make it forward
compatible.
---------
Co-authored-by: Wang Chunye <chunywan@xilinx.com>
Co-authored-by: mingyue <mingyue@amd.com>
Co-authored-by: mingyueliuh <131847423+mingyueliuh@users.noreply.github.com>
Co-authored-by: liumingyue <mingyue@xilinx.com>
Co-authored-by: moore-ch <129165652+moore-ch@users.noreply.github.com>
Co-authored-by: shoucair <shoucai.ren@amd.com>
Co-authored-by: zz002 <zhenze.wang@amd.com>
Co-authored-by: BoarQing <yuz75@Pitt.edu>
Co-authored-by: Yueqing Zhang <yueqingz@amd.com>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
### Description
In #8953 I introduced a change in our onnxruntime_mlas.cmake that it
enables "ASM_MASM" cmake language for all Windows build.
```cmake
enable_language(ASM_MASM)
```
Before the change, it is only enabled when onnxruntime_target_platform
equals to x64.
However, cmake 3.26 added a new language: ASM_MARMASM.
According to cmake's manual,
ASM_MASM is for Microsoft Assembler
ASM_MARMASM is for Microsoft ARM Assembler. This one is new in cmake
3.26.
We should choose the right one according to
${onnxruntime_target_platform}.
### Description
This PR adds the Whisper custom export scripts to the wheel.
### Motivation and Context
This enables access to the custom export scripts in the wheel.
### Description
This PR resolves a part of non-critical comments from code review
comments in #14579.
- use `USE_JSEP` instead of `USE_JS` in build definition to make it less
ambiguous
- remove unused util functions from util.ts
- fix transpose.h
- other misc fixes
### Description
The PR adds VPU support to OpenVINO Execution Provider
Bug fixes for GPU, CPU.
Changes to OpenVINO Backend in Serialized Model API for faster First
Inference Latency.
Deprecation to HDDL-VADM and MYRIAD, removed code
Support OpenVINO 2023.0
Dynamic Shapes Support for iGPU
### Motivation and Context
- VPU is an upcoming hardware that can provide AI Acceleration for
Client Systems through OpenVINO
- If it fixes an open issue, please link to the issue here. -->
---------
Signed-off-by: MaajidKhan <n.maajid.khan@intel.com>
Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
Co-authored-by: MaajidKhan <n.maajid.khan@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
### Description
Fix iconv link issue. The library is used in string_normalizer.cc.
### Motivation and Context
Though iconv is part of POSIX standard, some systems may have additional iconv providers, for example GNU iconv, that is not in the standard c runtime library. In these cases we may need to link to additional libraries.
However, this change has two caveats:
1. It may silently pull in GNU libraries into libonnxruntime.so, and make the shared library not distributable.
2. The detection of iconv library runs before we add additional include folders to ORT. So the detection may be inaccurate.
### Description
This change introduced the following new components into ONNX Runtime
Web:
- JavaScript Execution Provider (JSEP)
- Asynchronized inferencing execution powered by Emscripten's Asyncify
- WebGPU backend implemented in TypeScript
- initial implementation of kernels:
- elementwise operators (22)
- binary operators (5)
- tensor: Shape, Reshape, Transpose, Gemm
- nn: Conv, {Global}Maxpool, {Global}AveragePool
Code need to be polished. still working on it.
## Q&A
What is JSEP?
> JSEP, aka JavaScript Execution Provider, is a new ONNXRuntime
execution provider that specifically works on Web environment
(browsers). JSEP allows JavaScript code to kick in from various places
when ONNX Runtime inferences a model.
Why JSEP?
> JSEP is a hybrid mode EP that contains both C/C++ and
TypeScript/JavaScript implementation. There are 2 strong reasons why we
introduces JSEP:
> 1. the C/C++ part helps JSEP to leverage ONNX Runtime's capabilities
as much as possible including graph transformer, optimizers and also the
capabilities to fallback to CPU EP. TypeScript/JavaScript helps JSEP to
develop and debug much easier in the browser for the kernel
implementation.
> 2. the requirement of asynchronized execution from JavaScript API (eg.
`buffer.mapAsync()`) makes it impossible to run `OrtRun()` in a
synchronized context (see "async problem" section below). This is done
by using Emscripten's Asyncify.
What is WebGPU?
> WebGPU is the new GPU API that available in browser. It's one of the
only 2 APIs that currently available to access the GPU from browser (the
other is WebGL).
> WebGPU is designed with more advanced and stronger features comparing
to WebGL and is potentially solution that offer the best GPU performance
for model inferencing that currently available.
What is the async problem and why we have the problem?
> The "async problem" is a problem that you cannot call an async
function in a synchronous context. Think about the following C++ code:
> ```c
> // C-style declarations (API)
> typedef void (*ON_COMPLETE)(PVOID state, DATA *data);
> void read_data_from_file(FILEHANDLE file, ON_COMPLETE on_complete);
>
> // implementation
> DATA * my_impl_read_data_from_file_sync(FILEHANDLE file) {
> // how to implement?
> }
> ```
> The answer is, it's impossible to implement this function. Usually we
try to find a sync version API, or launch a thread to call the async
function and sync-wait on the main thread. Unfortunately, in browser
environment, neither is possible.
>
> WebGPU does not offer any synchronized API for data downloading (GPU
to CPU). This is the only operation that MUST be async. As `OrtRun()`
will eventually call into DataTransfer for copy data from GPU to CPU,
and `OrtRun()` is a synchronized function, this cannot be done in normal
way.
What is Emscripten? How is the Asyncify feature resolved the problem?
> Emscripten is the C/C++ compiler for WebAssembly. It's what we use to
compile ORT and generates the WebAssembly artifacts which runs on
browsers.
>
> Asyncify is a [compiler
feature](https://emscripten.org/docs/porting/asyncify.html) that allows
calling async functions from a synchronized context. In short, it
generates code to unwind and rewind call stack to emulate async
execution. With this feature, we are able to call the async function
inside `OrtRun()` call.
## Design Overview
**Inter-op**
JSEP is doing pretty much same thing to just another EP. It exposes an
interface for inter-op with JavaScript, which is defined in
onnxruntime/wasm/js_internal_api.js:
```js
// init JSEP
Module["jsepInit"] = function (backend, alloc, free, copy, copyAsync, createKernel, releaseKernel, run) {
Module.jsepBackend = backend;
Module.jsepAlloc = alloc;
Module.jsepFree = free;
Module.jsepCopy = copy;
Module.jsepCopyAsync = copyAsync;
Module.jsepCreateKernel = createKernel;
Module.jsepReleaseKernel = releaseKernel;
Module.jsepRun = run;
};
```
This simple JavaScript snippet defines all language barrier level
functions that requires by JSEP to achieve implementing kernels and data
transfers using JavaScript inside ONNX Runtime:
- `jsepBackend`: assign the singleton object to webassembly module
- `jsepAlloc` and `jsepFree`: implementation of data transfer's Alloc()
and Free()
- `jsepCopy`: synchronized copy ( GPU to GPU, CPU to GPU)
- `jsepCopyAsync`: asynchronized copy ( GPU to CPU)
- `jsepCreateKernel` and `jsepReleaseKernel`: a corresponding object
that maintained in JS to match lifecycle of Kernel in ORT
- `jsepRun`: OpKernel::Compute() should call into this
The abstraction above allows to tie as little as possible connections
and dependencies between C/C++ and TypeScript/JavaScript.
**Resource Management**
Lifecycle of tensor data and kernels are managed by ORT(C/C++) but the
implementation are left to JavaScript. JavaScript code are responsible
to implement the callbacks correctly.
For WebGPU, the GPU data is managed by JavaScript using a singleton map
(tensot_data_id => GPUBuffer). GPU pipeline is managed as singleton.
Shaders are managed using a singletonmap (shader_key => gpu_program),
while shader_key is generated by cache_key (OP specific, including
attributes) and input shapes.
**about data transfer**
`js::DataTransfer::CopyTensor` implemented to call either synchronized
or asynchronized copy callback, depending on the destination is GPU or
not. Emscripten's macro `EM_ASYNC_JS` is used to wrap the async function
to be called in the synchronized context.
**run kernel in JS**
Kernel class constructor calls once `jsepCreateKernel()` with an
optional per-kernel specific serialization to pass attributes into
JavaScript.
`Compute()` are implemented in a way that a metadata serialization is
performed in a base class and JavaScript code can access the data using
the Emscripten specific builtin macro `EM_ASM_*`.
**disabled features**
memory pattern is force disabled, because the WebGPU data is not
presented by a general memory model (a buffer can be represented by
offset + size).
concurrent run support is disabled. WebGPU is stateful and it also has
async function call. To support concurrent run will significantly
increase the complexity and we don't get any real benefit from it.
**prefer channels last**
JSEP prefers channels last and returns `DataLayout::NHWC` in method
`GetPreferredLayout()`. This will let the graph transformers to
preprocess the graph into a channels last form so that a more optimized
WebGPU shader can be used.
**Testing code**
It's impossible to test JSEP directly because JSEP itself does not
contain any kernel implementation. However, it has the kernel
registration which need to work together with the corresponding
JavaScript code. There are unit tests that run onnx models from
JavaScript API.
---------
Co-authored-by: Scott McKay <skottmckay@gmail.com>
TensorRT will load/unload libraries as builder objects are created and
torn down. This will happen for
every single unit test, which leads to excessive test execution time due
to that overhead.
This overhead has steadily increased over the past few TensorRT versions
as the library objects get bigger leading to
8 hours to run all the unit tests. Nvidia suggests to keep a placeholder
builder object around to avoid this.
### Description
Updating the build option for enabling training in java builds from
ENABLE_TRAINING -> ENABLE_TRAINING_APIS.
In the native codebase ENABLE_TRAINING is used for enabling full
training and ENABLE_TRAINING_APIS is used for creating the lte builds
with training apis. Making the change to sync the naming convention
across all the language bindings.
It was a bit confusing to see ENABLE_TRAINING when debugging the android
build failures for training. Making this change just to improve
readability of logs during debugging.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Fix CUDA 12.1 Windows build error of cuda namespace ambiguous. Use a new namespace for attention softmax.
Tested with VS 2019 and VS 2022 with the following settings:
- OS: Microsoft Windows 11 Enterprise (Version 10.0.22621 Build 22621)
- CUDA: cuda_12.1.0_531.14_windows
- TensorRT: TensorRT-8.6.0.12.Windows10.x86_64.cuda-12.0
- CUDNN: 8.8.1.3 for cuda 12
- Visual Studio Enterprise 2019, version 16.11.26 (MSVC v142) or
Visual Studio Enterprise 2022 (64-bit), version 17.5.4
- Python: 3.10
- CMake: 3.25.2
VS 2019:
```
build.bat --cmake_generator "Visual Studio 16 2019" --config Release --cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=52;60;61;70;75;80;86" --skip_submodule_sync --parallel --build_shared_lib --update --build --build_dir .\build\trt --use_cuda --cuda_version "12.1" --cuda_home "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1" --cudnn_home "C:\CuDNN\8.8.1.3_cuda12" --use_tensorrt --tensorrt_home "C:\TensorRT-8.6.0.12.Windows10.x86_64.cuda-12.0\TensorRT-8.6.0.12"
```
VS 2022:
```
build.bat --cmake_generator "Visual Studio 17 2022" --config Release --cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=52;60;61;70;75;80;86" --skip_submodule_sync --parallel --build_shared_lib --update --build --build_dir .\build\trt_2022 --use_cuda --cuda_version "12.1" --cuda_home "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1" --cudnn_home "C:\CuDNN\8.8.1.3_cuda12" --use_tensorrt --tensorrt_home "C:\TensorRT-8.6.0.12.Windows10.x86_64.cuda-12.0\TensorRT-8.6.0.12"
```
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
https://github.com/microsoft/onnxruntime/issues/15242
### Description
<!-- Describe your changes. -->
1. moved onnxruntime/contrib_ops/cuda/decoder to
onnxruntime/contrib_ops/cuda/bert
2. create utils.cuh under /bert for shared implementations in
decoder_masked_multihead_attention_impl_utils.h and
rotary_embedding_util.h
3. refactored relative_attn_bias_impl.cu by reusing the template
specializations in utils.cuh
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
### Description
<!-- Describe your changes. -->
Add Swift Package Manager (SPM) support for ORT based on #14621
- uses the existing objective-c bindings
- some re-organization of the directory structure was required but the
contents of the files are unchanged, apart from adjustments due to file
movements
Add tool for updating ORT native pod used in the SPM package
Update CIs to use ORT native pod from build, and build/test using SPM
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
iOS developers are using SPM as much as cocoapods, so adding SPM means
both are catered for.
A recent commit added an assembler check if the ASM dialect was ATT
This unfortunately broke the AMX build for systems that don't have the
ASM-ATT dialect.
This change assumes if the CMAKE_ASM-ATT_COMPILER_ID is not found and
the CMAKE_ASM_COMPILER_ID is "GNU" based on all the other already passed
checks AMX is supported by the compiler and assembler.
### Description
### Motivation and Context
On my build system the recent change to add the ASM-ATT version check
disabled AMX code from the build.
---------
Signed-off-by: George Nash <george.nash@intel.com>
### Description
https://github.com/microsoft/onnxruntime/pull/15538
Above pull request breaks Windows build on cmake 3.25 or earlier. This
should fix it.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
This PR contains fusion-level and kernel-level optimizations for
[OpenAI's Whisper](https://github.com/openai/whisper).
Some of the added optimizations include:
- Pruning of duplicate/unnecessary inputs and outputs
- Fusion support for Whisper models with or without these inputs/outputs
(e.g. with these inputs/outputs if exporting with an older official
Optimum version, without these inputs/outputs if exporting with Optimum
from source)
- Attention fusions
- For Whisper's encoder and decoder
- Modified symbolic shape inference for present output when no past
input exists (for decoder)
- Multi-head attention fusions
- For Whisper's decoder and decoder with past
- Packed MatMul for the 3 MatMuls excluded in multi-head attention
fusion
- Attention kernel changes
- CPU:
- Different Q and KV sequence lengths
- Parallel memset for large sequence lengths
- Convert broadcast add after MatMul of Q and K (add_qk) to element-wise
add
- Separate present key-value output into present key and present value
(for multi-head attention spec)
- CUDA:
- Use memory efficient attention compute kernel with present state (for
decoder)
- Multi-head attention kernel changes
- CPU:
- Introduction of multi-head attention CPU kernel (previously did not
exist)
- Use AddBiasReshape instead of AddBiasTranspose when sequence length =
1 (for decoder with past)
- Different Q, K, V input shapes
- Pass past key and past value directly as key and value
- CUDA:
- Use memory efficient attention compute kernel with past and/or present
state (for decoder with past)
### Usage
To use the optimizations, run the ORT transformer optimizer script as
follows:
```
$ cd onnxruntime/onnxruntime/python/tools/transformers/
$ python3 optimizer.py --input <filename>.onnx --output <filename>.onnx --model_type bart --num_heads <number of attention heads, depends on the size of the whisper model used> --hidden_size <attention hidden size, depends on the size of the whisper model used> --use_external_data_format --use_multi_head_attention
```
Once optimized, here's an example of how to run Whisper with [Hugging
Face's Optimum](https://github.com/huggingface/optimum):
```
from transformers.onnx.utils import get_preprocessor
from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from optimum.pipelines import pipeline as ort_pipeline
import whisper # Installed from OpenAI's repo - setup instructions at https://github.com/openai/whisper/
directory = './whisper_opt' # Where the optimized ONNX models are located
model_name = 'openai/whisper-tiny'
device = 'cpu'
# Get pipeline
processor = get_preprocessor(model_name)
model = ORTModelForSpeechSeq2Seq.from_pretrained(
directory,
use_io_binding=(device == 'cuda'),
provider='CPUExecutionProvider',
).to(device)
pipe = ort_pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
device=(-1 if device == 'cpu' else 0),
)
# Load audio file and run pipeline
audio = whisper.load_audio('tests/jfk.flac')
audio = whisper.pad_or_trim(audio)
outputs = pipe([audio])
print(outputs)
```
Note: In order to use these changes with Optimum, it is recommended to
use Optimum from source to have the following changes:
- https://github.com/huggingface/optimum/pull/872
- https://github.com/huggingface/optimum/pull/920
### Motivation and Context
This PR helps the following issues:
- https://github.com/microsoft/onnxruntime/issues/15100
- https://github.com/microsoft/onnxruntime/issues/15235
- https://github.com/huggingface/optimum/issues/869 (work in progress)
This PR can be used with the other currently merged Whisper PRs:
- https://github.com/microsoft/onnxruntime/pull/15247
- https://github.com/microsoft/onnxruntime/pull/15339
- https://github.com/microsoft/onnxruntime/pull/15362
- https://github.com/microsoft/onnxruntime/pull/15365
- https://github.com/microsoft/onnxruntime/pull/15427
This PR uses changes from the following merged PRs:
- https://github.com/microsoft/onnxruntime/pull/14198
- https://github.com/microsoft/onnxruntime/pull/14146
- https://github.com/microsoft/onnxruntime/pull/14201
- https://github.com/microsoft/onnxruntime/pull/14928 (this introduced
the new multi-head attention spec)
### Description
Run clang-format in CI. Formatted all c/c++, objective-c/c++ files.
Excluded
```
'onnxruntime/core/mlas/**',
'onnxruntime/contrib_ops/cuda/bert/tensorrt_fused_multihead_attention/**',
```
because they contain assembly or is data heavy
### Motivation and Context
Coding style consistency
### Description
Add hipBLASLt to GEMM Tunable op, which supports GEMM and
StridedBatchedGEMM.
To enable hipBLASLt implementation, add an extra flag to the building
command: `--cmake_extra_defines onnxruntime_USE_HIPBLASLT=ON`.
### Optimize SCE loss compute
Compute optimization based on label data sparsity:
- Insert ShrunkenGather before SCELoss node, to filter out invalid
labels for compute.
- Support ShrunkenGather upstream.
- Added test for the above.
- Added flag to enable label sparsity optimization with env var, by
default disabled now. Will enable after comprehensive benchmarking
later.
- Extract common logic into test_optimizer_utils.h/cc from
core/optimizer/compute_optimzier_test.cc, then the common functions can
be shared by both core/optimizer/compute_optimzier_test.cc and
orttraining/core/optimizer/compute_optimzier_test.cc
- Extract common logic into shared_utils.h/cc: `GetONNXOpSetVersion` and
`Create1DInitializerFromVector`
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
* Integrate TRT 8.6EA on relevant Linux/Windows/pkg pipelines
* Update onnx-tensorrt to 8.6
* Add new dockerfiles for TRT 8.6 and clean old ones
* Update
[CGManifest](https://github.com/microsoft/onnxruntime/tree/main/cgmanifests)
files and ort build deps version
* yml/script update
* Enable built-in TRT parser option on TRT related pipelines by default
* Exclude test TopKOperator.Top3ExplicitAxisInfinity out of TRT EP tests
(8.6-EA has issue with topk operator)
### Description
Fix onnxruntime_mlas build failure with cmake 3.26. Updated CMAKE
generator expression to make sure certain complier flags only apply for
C/CXX compiler.
### Motivation and Context
CMake changed the behavior of ASM_MASM in version 3.26. See
https://gitlab.kitware.com/cmake/cmake/-/issues/24639.
This also fixed the issue of #15101
### Description
Update mimalloc dependency.
### Motivation and Context
The latest release contains important fixes including memory leaks and
used by customers.
### Description
<!-- Describe your changes. -->
Add clog back to onnxruntime_EXTERNAL_LIBRARIES.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix iOS packaging pipeline build failure.
Change the default behavior to link against the nvonnxparser library
(onnx-tensorrt parser) that is included with the TensorRT package.
Previously, the default behavior was to build and statically link
against the OSS onnx-tensorrt parser.
Historically, we wanted to incorporate the latest commits/fixes from OSS
parser.
These days the OSS parser is not significantly different from the
included parser library so there is less reason to build against it by
default.
By linking with parser shared library from TensorRT library, the major
benefit is it's much easier to
build/link against a minor version update of TensorRT. And OnnxRuntime
can be updated with a new TensorRT minor version by simply replacing
TensorRT libraries with the newer version. (because the parser is no
longer statically linked into onnxruntime)
Added --use_tensorrt_oss_parser to build.py to support the previous
default behavior. (build + static link OSS parser)
### Description
Rework some external targets to ease building with
`-DFETCHCONTENT_FULLY_DISCONNECTED=ON`
This will allow package managers to more easily provide an onnxruntime
package by reducing the amount of patching needed downstream at each
version.
### Motivation and Context
Availability of onnxruntime in some C++ package managers
https://github.com/microsoft/onnxruntime/issues/7150https://github.com/conan-io/conan-center-index/issues/16699https://github.com/microsoft/vcpkg/issues/20548
My initial intent is to get this in conan but the PR would most likely
be useful (though not tested) to vcpkg as well (and maybe others).
I tried to get only a first batch of not too specific patches (i.e. not
specific to conan).
The first commit reworks `flatbuffers` and just extends what @snnn did
in https://github.com/microsoft/onnxruntime/pull/13991
The second commit reworks `pytorch_cpuinfo`
The third commit reworks `google_nsync`
### Description
Removing fp16 support from apple build
### Motivation and Context
FP16 support on ARM64 only available after armv8.2a, thus the clang
compiler needs a compilation flag `-march=armv8.2-a+fp16`.
Unfortunately, our current universal build does not support hardware
specific compilation flags on cpp source files, as it would cause
trouble when compiling against more than one hardware target. Until we
figure out how to remove this limitation, had to disable fp16 support
for Apple systems.
### Description
<!-- Describe your changes. -->
Add required graph transformer to duplicate DQ nodes to ensure that QDQ
node units have unique DQ nodes. This condition is necessary for QDQ
node unit processing.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
There is an existing Python utility that does this:
c7ced7a5e9/tools/python/util/qdq_helpers/qdq_model_utils.py (L77)
This PR implements it as a graph transformer so it is integrated into
ORT and does not require a separate step to update the model. There are
also tests to ensure that its effects are not undone by basic level
graph optimizations.
### Description
1. Remove Linux jobs for ORT-Extension combined build
2. Add a macOS build job for ORT-Extension combined build
3. Adjust the yaml file so that it can support two different ADO
instances.
### Motivation and Context
To test our code better. And it will enable us to run such tests for
every commit in the main branch. It would be easier for us to figure out
which change caused a build break.
See
[AB#13435](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/13435)
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
- Adds support for newer opset of Reduction operators (ReduceSum,
ReduceMax, ReduceMin, ReduceMean, ReduceProd) with axes as an
initializer input.
- Adds tests for HTP and CPU backends.
### Motivation and Context
Newer opset versions changed the `axes` attribute into an optional
input. This PR adds support for these newer reduction operators as long
as the axes input is defined as an initializer. The goal is to enable more models on QNN.
### Description
<!-- Describe your changes. -->
1. upgrade cutlass to 3.0 that containing attn_bias support.
2. extend Attention/MHA to use memory efficient attention when
rel_pos_bias with [1, num_head, s, s*] and 1d mask with [2 * batch_size
+ 1] are present.
new mask format introduction:
MASK_1D_KEY_SEQ_LEN_START,
[3 * batch_size + 2] with [key_len[0], ..., key_len[batch_size - 1],
query_start[0], ..., query_start[batch_size - 1], query_end[batch_size -
1], key_start[0], ..., key_start[batch_size - 1], key_end[batch_size -
1]]
e.g
2D mask with [[1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 0]] converts to this
1D mask is [3, 5, 0, 6, 12, 0, 6, 12]
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
It potentially benefits tnlrv6 and t5(encoder)
---------
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
Co-authored-by: Kunal Vaishnavi <kvaishnavi@microsoft.com>
Co-authored-by: Kunal Vaishnavi <kvaishnavi@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
### Statistics tool for ORTModule convergence parity
As ORTModule get more and more validated, it is pretty fast to
intergrade PyTorch based model with ORT.
The same time, we need make sure once there is convergence issue, we
don't spend months of time to investigate. As part of this efforts, this
PR is introducing a tool to dump activation statistics without much
involvement from users. The dumping results contains only some statistic
numbers plus sampled data, which is not big, compared with dumping all
the tensors, it is much faster and space efficient.
For us to use it, two single lines are needed before wrapping ORTModule.
For baseline run, need also apply the same trick.
```
+ from onnxruntime.training.utils.hooks import SubscriberManager, StatisticsSubscriber
+ SubscriberManager.subscribe(model, [StatisticsSubscriber("pt_out", override_output_dir=True)])
```
Once you run the steps, following command can be used to merge result
into per-step-summary respectively for ORT and baseline runs.
```bash
python -m onnxruntime.training.utils.hooks.merge_activation_summary --pt_dir pt_out --ort_dir ort_out --output_dir /tmp/output
```
Docs is added here as part of this PR [convergence investigation
notes](https://github.com/microsoft/onnxruntime/blob/pengwa/conv_tool/docs/ORTModule_Convergence_Notes.md)
Based on the generated merged files, we can compare them with tools.

### Design and Implementation
This PR introduced a common mechanism registering custom logic for
nn.Module's post forward hooks. And statistics for activation
(StatisticsSubscriber) is one of the implementations. If there is other
needs, we can define another XXSubscriber to do the customized things.
### Description
This PR includes the following changes:
- upgrade js dependencies
- enable STRICT mode for web assembly build.
- corresponding fix for cmake-js upgrade
- corresponsing fix for linter upgrade
- upgrade default typescript compile option of:
- `moduleResolution`: from `node` to `node16`
- `target`: from `es2017` to `es2020`
- fix ESM module import in commonJS source file
## change explanation
### changes to onnxruntime_webassembly.cmake
`-s WASM=1` and `-s LLD_REPORT_UNDEFINED` in latest version is
by-default and deprecated.
### changes to onnxruntime_node.cmake
The npm package `cmake-js` updated its way to find file `node.lib`.
previously it downloads this file from Node.js public release channel,
and now it generates it from a definition file.
The node.js release channel does not contain a windows/arm64 version, so
previously cmake-js will fail to download `node.lib` for that platform.
this is why we made special handling to download the unofficial binary
to build. now this is no longer needed so we removed that from the cmake
file.
### changes to tsconfig.json
`node16` module resolution supports async import and `es2020` as target
supports top level await.
### Description
<!-- Describe your changes. -->
AMX isn't supportted until assembler 2.40 even though the GCC frontend
supports it.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Convolution for fp16 datatype. Use NHWC for computation. For NCHW input,
it rearranges the input tensor to NHWC format before computing the
result.
Support two optional fusion:
1. Activation
2. Add (not yet implemented)
### Motivation and Context
Accelerating fp16 inference
### Description
<!-- Describe your changes. -->
We are introducing the FasterTransfomer model-level integration using
ORT [custom op runtime
wrapper](https://github.com/microsoft/onnxruntime/pull/13427).
In order to make the FT wrapper/integration work, two things need to be
done:
- New API `KernelInfoGetConstantInput_tensor`. (Done in this PR)
During custom op kernel initialization, it needs to get the model
weights (saved as node's constant inputs) ready for FT's weights
instantiation. What's why we need to add this new API to make kernel
info capable of getting constant inputs.
- Custom op and custom op kernel to wrap FT model. (Will provide in
onnxruntime extensions or inference examples)
During custom op kernel initialization, it can fetch attributes from
kernel info to determine which kind of FT model instance create. During
custom op kernel compute/inference, it can get input/output from kernel
context and then assign input/output buffers for model instance to run.
### Description
Add logging APIs for custom ops.
This PR introduces a `OrtLogger` type, which can be retrieved from a
`OrtKernelInfo` or `OrtKernelContext`. The kernel info's logger is the session logger stored
in the execution provider. The kernel context's logger is a run logger.
### Motivation and Context
Allows custom ops to log information in a manner consistent with
built-in ops.
Example usage in custom op:
```C++
struct MyCustomKernel {
MyCustomKernel(const OrtApi& api, const OrtKernelInfo* info) {
Ort::ConstKernelInfo kinfo(info);
this->logger_ = kinfo.GetLogger();
// ...
ORT_CXX_LOGF_NOEXCEPT(this->logger_, OrtLoggingLevel::ORT_LOGGING_LEVEL_ERROR, "Error: %s", err_msg);
}
void Compute(OrtKernelContext* context) {
ORT_CXX_LOG(this->logger_, OrtLoggingLevel::ORT_LOGGING_LEVEL_VERBOSE, "Calling compute...");
// ...
}
// ...
private:
Ort::Logger logger_;
};
```
### Description
<!-- Describe your changes. -->
This fix macos packaging build on universal2 arch.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Add BiasSplitGelu/BiasAdd/GroupNorm/NhwcConv operator for ROCm EP.
1. BiasSplitGelu and BiasAdd operators can be automatically hipified
from CUDA EP.
2. GroupNorm was hipified from CUDA EP and modified to build.
3. NhwcConv is similar to NhwcConv in CUDA EP, But the MIOpen API and
cuDnn API are different. `miopenConvolutionForwardbias` and
`miopenOpTensor` of MIOpen doesn't support NHWC layout now, use
BinaryElementwise to replace miopenConvolutionForwardbias(NHWC layout).
### Description
QNN EP:
- Adds the
[InstanceNormalization](https://onnx.ai/onnx/operators/onnx__InstanceNormalization.html)
operator to QNN EP.
- Fixes graph composition bug when Transpose node is the last node in a
graph.
- Adds check for input shape when GetCapability is called (before and
after layout transformation)
- Should add similar checks for other layout sensitive ops (conv, pool,
...) in a separate PR
- Adds initial QNN op tests for QDQ conv and QDQ InstanceNormalization
- Should add tests for other ops in a separate PR
Optimizer:
- Makes InstanceNormalization a layout sensitive operator.
- Adds a custom QDQ group selector for InstanceNormalization.
Quantization tool:
- Adds QDQ support for InstanceNormalization operator.
- Adds python unit test for InstanceNormalization quantization.
### Motivation and Context
Needed to support stable diffusion models with QNN.
---------
Co-authored-by: Hector Li <hecli@microsoft.com>
### Description
This will enable a user to use a TensorRT timing cache based on #10297
to accelerate build times on a device with the same compute capability.
This will work across models as it simply store kernel runtimes for
specific configurations. Those files are usually very small (only a few
MB) which makes them very easy to ship with an application to accelerate
the build time on the user end.
### Motivation and Context
Especially for workstation use cases TRT build times can be a roadblock.
With a few model from ONNX model zoo i evaluated speedups when a timing
cache is present.
`./build/onnxruntime_perf_test -e tensorrt -I -t 5 -i
"trt_timing_cache_enable|true" <onnx_path>`
|Model | no Cache | with Cache|
| ------------- | ------------- | ------------- |
|efficientnet-lite4-11 | 34.6 s | 7.7 s|
|yolov4 | 108.62 s | 9.4 s|
To capture this is had to modify the onnxruntime_perf_test. The time is
sometimes not captured within "Session creation time cost:" which is why
i introduced "First inference time cost:".
---------
Co-authored-by: Chi Lo <Chi.Lo@microsoft.com>
- Update Gradle version used in most places from 6.8.3 to 8.0.1. Update Android Gradle Plugin version where applicable.
Not updated in this change: React Native Android projects (under `js/react_native/`). That can be done later along with updating the React Native projects.
- Add Gradle wrapper in `java/` to make it easier to consistently use a specific Gradle version.
### Description
<!-- Describe your changes. -->
Consume ONNX 1.13.1 in ONNX Runtime. (ONNX 1.13.0 to ONNX 1.13.1)
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
ONNX 1.13.1 patch was just released yesterday. This PR is making ORT's
ONNX submodule consistent with the latest released ONNX. Not sure
whether this PR is really needed, but let me make it ready. Previous PR
for testing ONNX 1.13.1rc2 :
https://github.com/microsoft/onnxruntime/pull/14634.
Fixed
[AB#13174](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/13174)
.
### Description
<!-- Describe your changes. -->
1. add a build flag for rocblas tuning feature.
2. fix a build bug when enable rocblas tuning.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
The rocblas tunning feature has no build flag to control, only using a
MACRO flag.
So I add an build flag, and fix a code bug when enable rocblas tunning.
### Description
allow onnxruntime_test_all to run in browser for WebAssembly build (use
flag `--wasm_run_tests_in_browser`).
To output the logs from stdout correctly, this test needs to be build
with `--enable_wasm_threads`.
### Description
* Support flag 'optimizedModelFilePath' in session options.
In Node.js, the model will be saved into filesystem just like its
behaviour on native platforms.
In browser, the new model is not saved to filesystem. the file path is
ignored. Instead, a new pop-up window will be launched in browser and
user can 'save' the file as onnx model.
* Add corresponding commandline args for the following session option
flags:
- optimizedModelFilePath
- graphOptimizationLevel
### Description
<!-- Describe your changes. -->
Merging extensions from Git submodule to cmake FetchContent
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Changming Sun <chasun@microsoft.com>
Co-authored-by: Jian Chen <jchen351@MacBook-Pro.local>
### Description
Update oneDNN version from 2.7 to 3.0
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
1. Add Softmax warpwise_forward into SoftmaxTunableOp.
2. Set Softmax op use tunableOp as optional and use original
implementation by default.
3. There are some other operators use `dispatch_warpwise_softmax_forward
/dispatch_warpwise_softmax_forward/ SoftMaxComputeHelper ` directly. But
they only have files under cuda directory, adding `RocmTuningContext `
for these files requires copying and modifying hipified files. Now only
set RocmTuningContext as nullptr by default and not hipified other
operators.
Related PR: https://github.com/microsoft/onnxruntime/pull/14541
---------
Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
### Description
FP16 GEMM, including hardware agnostic driver code, a slow C++ kernel,
and ARM64 NEON kernel.
### Motivation and Context
First step in creating native support of fp16 model inferencing on ARM64
and AMD64 platforms.
---------
Co-authored-by: Chen Fu <fuchen@microsoft.com>
Signed-off-by: Kevin Chen <kevinch@nvidia.com>
### Description
Previously `include_directories(${TENSORRT_INCLUDE_DIR})` was only done
if `onnxruntime_USE_TENSORRT_BUILTIN_PARSER` was false. This would cause
a build failure when the switch was true as the include directory was
not added.
### Motivation and Context
Fixes TRT build when `onnxruntime_USE_TENSORRT_BUILTIN_PARSER` is true.
---------
Signed-off-by: Kevin Chen <kevinch@nvidia.com>
**Description**:
Support new feature of TVM Virtual Machine (method `set_outputs`) on TVM
Execution Provider side. It allows to avoid excess copying from TVM EP
output tensor to ONNX Runtime one
**Motivation and Context**
Tests with multiple output topologies and big output tensors shows that
there is overheads spent on copying from TVM EP to ONNX Runtime.
Returning output(s) on preallocated memory for VirtualMachine was
implemented on TVM side.
**Details**
`set_output_zero_copy` provider option for TVM EP switches on/off this
feature. It is true by default.
The feature works for both GraphExecutor and VirtualMachine from TVM.
---------
Co-authored-by: Valery Chernov <valery.chernov@deelvin.com>
### Description
Fix SNPE build issue caused by cmake dependency refactor
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
fix issue: https://github.com/microsoft/onnxruntime/pull/14547
### Description
Introduce collective ops into onnxruntime inference build, including
1) AllReduce and AllGather schema in contrib op, controlled by USE_MPI
flag
2) AllReduce and AllGather kernel in cuda EP, controlled by ORT_USE_NCCL
flag
### Motivation and Context
Enable the collective ops in onnxruntime inference build so we have the
ability to run distributed inference with multiple GPUs.
The original ncclAllReduce ops in training build require quite complex
configurations, which is not suitable for inference case, and it already
broken. so we introduce a new implementation.
---------
Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
### Description
<!-- Describe your changes. -->
1. fix a bug in relative position bias kernel where seq_len > 32
2. rename extra_add_qk to relative_position_bias
3. support relative_position_bias in multihead attention (B, N, S, S*)
4. gru_gate support by Lei
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
Co-authored-by: Lei Zhang <zhang.huanning@hotmail.com>
### Description
Reduce the cuda library size by:
1. refactoring beam_search_top_k to reduce template instantiation. It
saves ~56MB
2. opt out TopK for type uint*, int8_t and int16_t. It saves ~50MB.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
This is a follow-up of
https://github.com/microsoft/onnxruntime/pull/14428 for Stable Diffusion
CUDA optimizations:
(1) use NchwConv to replace Conv in onnx graph and add Tranpose nodes
accordingly
(2) reduce sequential Transpose nodes to at most one.
(3) symbolic shape infer of NchwConv
(4) fix add bias transpose which causes CUDA error (launching more than
1024 threads per block) in inferencing fp32 model.
(5) add models (bert, bart, stable_diffusion subdirectories) to package;
(6) remove option --disable_channels_last
Note that
(1) We can add a few graph transformations to reduce Transpose nodes
further. It is not done in this PR due to time limit.
(2) Stable diffusion 2.1 model outputs black images. It seems that
forcing Attention to float32 could avoid the issue. However it is much
slow to use float32 Attention.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Making basic porting effort to run Sampling UT on ROCm ep, based on the
commits:
https://github.com/microsoft/onnxruntime/pull/13426https://github.com/microsoft/onnxruntime/pull/14218
1. enabling EmbedLayerNorm op
2. enabling Sampling op
3. enabling helpers to copy data from CPU->GPU for subgraph
This task is the first checkpoint. There could be other missing ops when
testing a real model.
We will migrate more code onto ROCm as needed.
Co-authored-by: Ubuntu <ettao@ettao-amd-dev1.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
### Only link mpi when either use_mpi or use_nccl enabled
To fix the issue https://github.com/microsoft/onnxruntime/issues/14278.
Talked with @askhade, we think if users want to enable NCCL/MPi but MPI
is not found, it should be failure instead of warning.
So this PR made the change. As a result, to make CIs pass, we need
disable NCCL/MPI explicitly in the build command. This PR take an
alternative approach, e.g. since NCCL and MPi are not used for
customers, disable NCCL by default if "--disable_nccl" not specified,
disable MPI by default if "--use_mpi" not specified.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Add stable diffusion CUDA kernel optimizations.
The following are included:
(1) GroupNorm operator. This kernel is from TensorRT 8.5.
(2) BiasSplitGelu operator. This kernel is modified from SplitGelu of
TensorRT 8.5. We added bias to the SplitGelu.
(3) NhwcConv operator. This adds support of NHWC format (ONNX Conv
operator uses NCHW format).
(3) Update MultiHeadAttention (packed kv and no bias) for cross
attention. This could avoid transpose of kv for TRT fused cross
attention kernel.
(4) Optimization and benchmark script
Not included:
(1) Script to convert Conv to NhwcConv in onnx graph.
(2) Update symbolic shape inference for NhwcConv.
(3) Add SeqLen2Spatial operator
(4) Documents
Limitations: GroupNorm, BiasSplitGelu and NhwcConv kernels are
implemented based on stable diffusion usage. They might not be
applicable to any input size or dimensions. For example, BiasSplitGelu
requires hidden size to be 2560 | 5120 | 10240, and NhwcConv assumes 4D
input/weight.
There is minor increasement of binary size. For SM=75 only, python
package wheel size adds (33757K - 33640K) = 117 KB. It is possible to
move NHWC from template parameter to constructor to reduce binary size
(with slight cost of performance).
Note: for RTX 4090/4080/4070 Ti, need build with CUDA 11.8 and latest
cuDNN to get best performance.
### Description
Fix not working REMOVE_ITEM.
`onnxruntime/contrib_ops/rocm/aten_ops/aten_op.cc` is hipyfied from
`onnxruntime/contrib_ops/cuda/aten_ops/aten_op.cc`.
The file correct path is
`${CMAKE_CURRENT_BINARY_DIR}/amdgpu/onnxruntime/contrib_ops/rocm/aten_ops/aten_op.cc`
and it exists in hipyfied source files list
`onnxruntime_rocm_generated_contrib_ops_cc_srcs`.
A better way to fix it: If we don't want to build a file. Add it into
hipify excluded files and will not hipify it.
(cherry picked from commit 414b73a02123b672e496326664cd2dc3bd6c6d24)
### Rework for PR https://github.com/microsoft/onnxruntime/pull/14068:
Enable multiple step run for adamw tests (on device training)
### Removed duplicated MACRO checks for training.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
* Added the OrtDnnlProviderOptions structure to expose configuration
options to the user
* The number of threads can be defined by the user with the -i flag on
the perftest
* Number of threads can also be configured via the OMP_NUM_THREADS
environment variable
* The number of threads defined in the OrtDnnlProviderOptions is
prioritized over the environment variable
### Description
Avoids thread oversubscription caused by OpenMP allocating the maximum
number of threads possible for oneDNN EP. Added support for the
OrtDnnlProviderOptions, this will allow for more EP customization
capabilities, and allows for user defined number of threads.
### Motivation and Context
- Improves performances and allows for user to fine tune the number of
threads
### Description
upgrade protobuf to 3.20.2, same as onnx 1.13.0
### Motivation and Context
Per component governance requirement and Fixes#14060
unused-parameter error occurs in 2 conditions.
1. compile protolbuf
`onnxruntime_src/cmake/external/protobuf/src/google/protobuf/repeated_ptr_field.h:752:66:
error: unused parameter ‘prototype’ [-Werror=unused-parameter]`
2. include onnx_pb.h
```
2023-01-28T10:20:15.0410853Z FAILED: CMakeFiles/onnxruntime_pybind11_state.dir/onnxruntime_src/onnxruntime/python/onnxruntime_pybind_iobinding.cc.o
......
2023-01-28T10:20:15.0466024Z from /build/Debug/_deps/onnx-src/onnx/onnx_pb.h:51,
2023-01-28T10:20:15.0466958Z from /onnxruntime_src/include/onnxruntime/core/framework/to_tensor_proto_element_type.h:10,
....
2023-01-28T10:20:15.0609678Z /build/Debug/_deps/onnx-build/onnx/onnx-operators-ml.pb.h:1178:25: required from here
2023-01-28T10:20:15.0610895Z /onnxruntime_src/cmake/external/protobuf/src/google/protobuf/repeated_ptr_field.h:752:66: error: unused parameter ‘prototype’ [-Werror=unused-parameter]
2023-01-28T10:20:15.0611707Z cc1plus: all warnings being treated as errors
```
https://dev.azure.com/onnxruntime/2a773b67-e88b-4c7f-9fc0-87d31fea8ef2/_apis/build/builds/874605/logs/22
### Fix build error on Windows when building with "
--enable_language_interop_ops -cmake_extra_defines
onnxruntime_DISABLE_ABSEIL=ON"
This is a subsequent fix after
https://github.com/microsoft/onnxruntime/pull/14309, which fixed build
for onnxruntime_DISABLE_ABSEIL=ON build.
Going furthur, if we enable --enable_language_interop_ops, there are
following two errors:
```
test_symm_qgemm.cpp
test_transpose.cpp
onnxruntime_session.lib(inference_session.obj) : error LNK2019: unresolved external symbol "void __cdecl onnxruntime::L
oadInterOp(class std::basic_string<wchar_t,struct std::char_traits<wchar_t>,class std::allocator<wchar_t> > const &,cla
ss std::vector<struct Ort::CustomOpDomain,class std::allocator<struct Ort::CustomOpDomain> > &,class std::function<void
__cdecl(char const *)> const &)" (?LoadInterOp@onnxruntime@@YAXAEBV?$basic_string@_WU?$char_traits@_W@std@@V?$allocato
r@_W@2@@std@@AEAV?$vector@UCustomOpDomain@Ort@@V?$allocator@UCustomOpDomain@Ort@@@std@@@3@AEBV?$function@$$A6AXPEBD@Z@3
@@Z) referenced in function "public: __cdecl <lambda_f3a907e0b0a0e11d80d305605215cce8>::operator()(class std::shared_pt
r<class onnxruntime::Model> &)const " (??R<lambda_f3a907e0b0a0e11d80d305605215cce8>@@QEBA@AEAV?$shared_ptr@VModel@onnxr
untime@@@std@@@Z) [C:\Users\pengwa\dev\onnxruntime\build\Windows\RelWithDebInfo\onnxruntime_test_trainer.vcxproj]
onnxruntime_session.lib(inference_session.obj) : error LNK2019: unresolved external symbol "void __cdecl onnxruntime::L
oadInterOp(class onnx::ModelProto const &,class std::vector<struct Ort::CustomOpDomain,class std::allocator<struct Ort:
:CustomOpDomain> > &,class std::function<void __cdecl(char const *)> const &)" (?LoadInterOp@onnxruntime@@YAXAEBVModelP
roto@onnx@@AEAV?$vector@UCustomOpDomain@Ort@@V?$allocator@UCustomOpDomain@Ort@@@std@@@std@@AEBV?$function@$$A6AXPEBD@Z@
5@@Z) referenced in function "public: __cdecl <lambda_340b7b787b9c0f81848d348e60fe6c91>::operator()(class std::shared_p
tr<class onnxruntime::Model> &)const " (??R<lambda_340b7b787b9c0f81848d348e60fe6c91>@@QEBA@AEAV?$shared_ptr@VModel@onnx
runtime@@@std@@@Z) [C:\Users\pengwa\dev\onnxruntime\build\Windows\RelWithDebInfo\onnxruntime_test_trainer.vcxproj]
C:\Users\pengwa\dev\onnxruntime\build\Windows\RelWithDebInfo\RelWithDebInfo\onnxruntime_test_trainer.exe : fatal error
LNK1120: 2 unresolved externals [C:\Users\pengwa\dev\onnxruntime\build\Windows\RelWithDebInfo\onnxruntime_test_trainer.
vcxproj]
onnxruntime.vcxproj -> C:\Users\pengwa\dev\onnxruntime\build\Windows\RelWithDebInfo\RelWithDebInfo\onnxruntime.dll
onnxruntime_test_utils.vcxproj -> C:\Users\pengwa\dev\onnxruntime\build\Windows\RelWithDebInfo\RelWithDebInfo\onnxrun
time_test_utils.lib
CUDACOMPILE : nvcc warning : The 'compute_35', 'compute_37', 'sm_35', and 'sm_37' architectures are deprecated, and may
be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). [C:\Users\pengwa\dev\onnxruntime
\build\Windows\RelWithDebInfo\custom_op_library.vcxproj]
cuda_ops.cu
CUDACOMPILE : nvcc warning : The 'compute_35', 'compute_37', 'sm_35', and 'sm_37' architectures are deprecated, and may
be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). [C:\Users\pengwa\dev\onnxruntime
\build\Windows\RelWithDebInfo\onnxruntime_test_cuda_ops_lib.vcxproj]
```
```
kernel_type_str_resolver_utils_test.cc
local_kernel_registry_test.cc
C:\Users\pengwa\dev\onnxruntime\onnxruntime\test\framework\allocation_planner_test.cc(1388,9): error C2220: the followin
g warning is treated as an error [C:\Users\pengwa\dev\onnxruntime\build\Windows\RelWithDebInfo\onnxruntime_test_all.vcxp
roj]
C:\Users\pengwa\dev\onnxruntime\onnxruntime\test\framework\allocation_planner_test.cc(1388,9): warning C4067: unexpected
tokens following preprocessor directive - expected a newline [C:\Users\pengwa\dev\onnxruntime\build\Windows\RelWithDebI
nfo\onnxruntime_test_all.vcxproj]
```
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Updated DirectML version to 1.10.1
(https://www.nuget.org/packages/Microsoft.AI.DirectML/1.10.1)
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Add memory efficient attention from CUTLASS.
TODO (in next pull request):
(1) Need performance tests on different GPUs, then add a sequence length
threshold (only activate it for long sequence length).
(2) Merge changes from https://github.com/NVIDIA/cutlass/pull/773 when
it is in cutlass master.
### Description
Adds the below C APIs to support custom ops that wrap an entire model to
be inferenced with an external runtime. The current SNPE EP is an
example of an EP that could be ported to use a custom op wrapper. Ex:
The custom op stores the serialized SNPE DLC binary as a string
attribute. The SNPE model is built when the kernel is created. The model
is inferenced with SNPE APIs on call to the kernel's compute method.
#### C APIs
| API | Description | Why |
| --- | --- | --- |
| `KernelInfo_GetInputCount` | Gets number of inputs from
`OrtKernelInfo`. | Query I/O characteristics during kernel
creation<sup>1</sup> |
| `KernelInfo_GetOutputCount` | Gets number of outputs from
`OrtKernelInfo`. | Query I/O characteristics during kernel
creation<sup>1</sup> |
| `KernelInfo_GetInputName` | Gets an input's name. | Query I/O
characteristics during kernel creation<sup>1</sup> |
| `KernelInfo_GetOutputName` | Gets an output's name. | Query I/O
characteristics during kernel creation<sup>1</sup> |
| `KernelInfo_GetInputTypeInfo` | Gets the type/shape information for an
input. | Query I/O characteristics during kernel creation<sup>1</sup> |
| `KernelInfo_GetOutputTypeInfo` | Gets the type/shape information for
an output. | Query I/O characteristics during kernel
creation<sup>1</sup> |
| `KernelInfoGetAttribute_tensor` | Get a OrtValue tensor stored as an
attribute in the graph node | Extract serialized models, weights, etc. |
| `GetSessionConfigEntry` | Get a session configuration value | Need to
be able to get session-time configurations from within custom op |
| `HasSessionConfigEntry` | Check if session configuration entry exists.
| Need to be able to get session-time configurations from within custom
op |
#### Why so many KernelInfo APIs?<sup>1</sup>
Similar APIs currently exist for `OrtKernelContext`, but not
`OrtKernelInfo`. Note that `OrtKernelContext` is passed to the custom op
on call to its kernel's compute() function. However, `OrtKernelInfo` is
available on kernel creation, which occurs when the session is created.
Having these APIs available from `OrtKernelInfo` allows an operator to
trade-off computation time for session-creation time, and vice versa.
Operators that must build expensive state may prefer to do it during
session creation time instead of compute-time.
SNPE is an example of an EP that needs to be able to query `KernelInfo`
for the name, type, and shape of inputs and outputs in order to build
the model from the serialized DLC data. This is an expensive operation.
Other providers (e.g., OpenVINO) are able to query i/o info from the
serialized model, so they do not strictly need these APIs. However, the
APIs can still be used to validate the expected I/O characteristics.
Additionally, several of our CPU contrib ops currently use the same
internal version of these KernelInfo APIs (Ex:
[qlinear_softmax](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/contrib_ops/cpu/quantization/qlinear_softmax.cc#L71)).
If custom ops are also meant to be a test bed for future ops, then all
custom ops (not just runtime wrappers) would benefit from the addition
of these public KernelInfo APIs (IMO).
#### Example of usage in a custom OP
From
`onnxruntime/test/testdata/custom_op_openvino_wrapper_library/openvino_wrapper.h`
```c++
struct CustomOpOpenVINO : Ort::CustomOpBase<CustomOpOpenVINO, KernelOpenVINO> {
explicit CustomOpOpenVINO(Ort::ConstSessionOptions session_options);
CustomOpOpenVINO(const CustomOpOpenVINO&) = delete;
CustomOpOpenVINO& operator=(const CustomOpOpenVINO&) = delete;
void* CreateKernel(const OrtApi& api, const OrtKernelInfo* info) const;
constexpr const char* GetName() const noexcept {
return "OpenVINO_Wrapper";
}
constexpr const char* GetExecutionProviderType() const noexcept {
return "CPUExecutionProvider";
}
// IMPORTANT: In order to wrap a generic runtime-specific model, the custom operator
// must have a non-homogeneous variadic input and output.
constexpr size_t GetInputTypeCount() const noexcept {
return 1;
}
constexpr size_t GetOutputTypeCount() const noexcept {
return 1;
}
constexpr ONNXTensorElementDataType GetInputType(size_t /* index */) const noexcept {
return ONNX_TENSOR_ELEMENT_DATA_TYPE_UNDEFINED;
}
constexpr ONNXTensorElementDataType GetOutputType(size_t /* index */) const noexcept {
return ONNX_TENSOR_ELEMENT_DATA_TYPE_UNDEFINED;
}
constexpr OrtCustomOpInputOutputCharacteristic GetInputCharacteristic(size_t /* index */) const noexcept {
return INPUT_OUTPUT_VARIADIC;
}
constexpr OrtCustomOpInputOutputCharacteristic GetOutputCharacteristic(size_t /* index */) const noexcept {
return INPUT_OUTPUT_VARIADIC;
}
constexpr bool GetVariadicInputHomogeneity() const noexcept {
return false; // heterogenous
}
constexpr bool GetVariadicOutputHomogeneity() const noexcept {
return false; // heterogeneous
}
std::vector<std::string> GetSessionConfigKeys() const { return {"device_type"}; }
private:
std::unordered_map<std::string, std::string> session_configs_;
};
```
#### How to create a session:
```c++
Ort::Env env;
Ort::SessionOptions session_opts;
Ort::CustomOpConfigs custom_op_configs;
// Create local session config entries for the custom op.
custom_op_configs.AddConfig("OpenVINO_Wrapper", "device_type", "CPU");
// Register custom op library and pass in the custom op configs (optional).
session_opts.RegisterCustomOpsLibrary(lib_name, custom_op_configs);
Ort::Session session(env, model_path.data(), session_opts);
```
### Motivation and Context
Allows creation of simple "wrapper" EPs outside of the main ORT code
base.
### Description
<!-- Describe your changes. -->
If a user installs the debug libraries from Python on Windows the ORT
python project file attempts to use the debug python lib, which
conflicts with a pragma in pyconfig.h that wants the release lib (due to
pybind11 undefining _DEBUG).
Explicitly use the release lib instead of Python::Module so the build
doesn't break.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix obtuse build break.
### Description
Add FusedMatMul
### Motivation and Context
- Add the FusedMatMul fusion for DML
- Fix the FusedMatMul logic and tests when transposed batches are
involved
### Description
<!-- Describe your changes. -->
Use dlsym/GetProcAddress to lookup a custom ops registration function by
name and call it.
This will be better on mobile platforms where the custom ops library is
linked against, and there isn't necessarily a filesystem that a library
path can be loaded from.
Alternative is to wire up passing in the address of the function, but
that has multiple complications which differ by platform.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Enable using ort and ort-ext packages on mobile platforms.
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
Changes to incorporate OpenVINO EP 2022.3
### Motivation and Context
This change is required to incorportate OpenVINO EP 2022.3
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: mohsinmx <mohsinx.mohammad@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
Co-authored-by: Aravind <aravindx.gunda@intel.com>
Co-authored-by: mayavijx <mayax.vijayan@intel.com>
Co-authored-by: flexci <mohsinmx>
Use json format to save and load partition config, previously it was
csv, which brought issues among windows and posix due to different line
breaks.
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
### Description
Enable creating dedicated build for on device training. With this PR we
can build a lean binary for on device training using flag
--enable_training_apis. This binary includes only the essentials like
training ops, optimizers etc and NOT features like Aten fallback,
strided tensors, gradient builders etc . This binary also removes all
the deprecated components like training::TrainingSession and OrtTrainer
etc
### Motivation and Context
This enables our partners to create a lean binary for on device
training.
### Description
<!-- Describe your changes. -->
rename the CrossAttention to MultiheadAttention since this op can also
be used as self attention
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
Move separated Q, K and V (without input projection) from Attention to a
new operator CrossAttention.
The Attention operator is hard to maintain when we need support with and
without input projection in one class. Add a new operator according to
feedback.
Some change might need in the future, but not in this PR:
(1) bias could be optional (We will not proceed that route unless
experiments show that fusing Add bias with MatMul instead of this op
could improve performance).
(2) support packed KV. There are two ways to support it: when key and
value are same Tensor, they are packed; or we can make value as
optional, and use packed mode when value is empty and the key has packed
K/V.
(3) support cached key and value, and other (like relative position
bias), or more attention mask format. They can be added easily without
breaking backward compatible.
(4) ROCm/CPU implementation of this op.
### Description
<!-- Describe your changes. -->
Add GemmFastGelu CK implementation.
TODO
1. The performance of CK GemmFastGelu in ORT is not good as using CK
directly, still need to investigate the reason and improve the CK in
ORT.
`GemmFastGeluUnfused float16 NN m=49152 n=3072 k=768 2298.8064 us 100.89
tflops`
`withbias DeviceGemmMultipleD_Xdl_CShuffle<256, 256, 128, 32, 8, 8,
Default> LoopScheduler: Default, PipelineVersion: v1 float16 NN m=49152
n=3072 k=768 2401.9799 us 96.56 tflops`
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
### Description
- Adds a new C API `OrtApi::RegisterCustomOpsLibrary_V2` that manages
the lifetime of dynamic library handles (i.e., calls `dlclose` or
`FreeLibrary`).
- Deprecates C API `OrtApi::RegisterCustomOpsLibrary`.
- Adds C++ API wrapper for convenient registering of custom op
libraries.
- `PySessionOptions` is now an alias of `OrtSessionOptions`
### Motivation and Context
The current API for registering custom op libraries loads dynamic
libraries but requires users to handle the release of the corresponding
library handles. Additionally, the user has to make sure to release the
library handle _after_ the session has been destroyed (or the program
segfaults).
The new API automatically cleans up the library and allows the user to
write more straightforward code.
### Description
This PR is to address follow-up comments for the multi-stream pr
https://github.com/microsoft/onnxruntime/pull/13495
Changes including:
- Make StreamAwareArena transparent to minimal build
- Make DeviceStreamCollection transparent to minimal build
- Replace ORT_MUST_USE_RESULT with [[nodiscard]]
- Remove unnecessary shared_ptr
### Motivation and Context
This PR is to address follow-up comments for the multi-stream pr
https://github.com/microsoft/onnxruntime/pull/13495
Co-authored-by: Lei Cao <leca@microsoft.com>
### Description
1. Renames all references of on device training to training apis. This
is to keep the naming general. Nothing really prevents us from using the
same apis on servers\non-edge devices.
2. Update ENABLE_TRAINING option: With this PR when this option is
enabled, training apis and torch interop is also enabled.
3. Refactoring for onnxruntime_ENABLE_TRAINING_TORCH_INTEROP option:
- Removed user facing option
- Setting onnxruntime_ENABLE_TRAINING_TORCH_INTEROP to ON when
onnxruntime_ENABLE_TRAINING is ON as we always build with torch interop.
Once this PR is merged when --enable_training is selected we will do a
"FULL Build" for training (with all the training entry points and
features).
Training entry points include:
1. ORTModule
2. Training APIs
Features include:
1. ATen Fallback
2. All Training OPs includes communication and collectives
3. Strided Tensor Support
4. Python Op (torch interop)
5. ONNXBlock (Front end tools for training artifacts prep when using
trianing apis)
### Motivation and Context
Intention is to simply the options for building training enabled builds.
This is part of the larger work item to create dedicated build for
learning on the edge scenarios with just training apis enabled.
Implement CloudEP for hybrid inferencing.
The PR introduces zero new API, customers could configure session and
run options to do inferencing with Azure [triton
endpoint.](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-deploy-with-triton?tabs=azure-cli%2Cendpoint)
Sample configuration in python be like:
```
sess_opt.add_session_config_entry('cloud.endpoint_type', 'triton');
sess_opt.add_session_config_entry('cloud.uri', 'https://cloud.com');
sess_opt.add_session_config_entry('cloud.model_name', 'detection2');
sess_opt.add_session_config_entry('cloud.model_version', '7'); // optional, default 1
sess_opt.add_session_config_entry('cloud.verbose', '1'); // optional, default '0', meaning no verbose
...
run_opt.add_run_config_entry('use_cloud', '1') # 0 for local inferencing, 1 for cloud endpoint.
run_opt.add_run_config_entry('cloud.auth_key', '...')
...
sess.run(None, {'input':input_}, run_opt)
```
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
### Description
It's from the PR #14085
On multiple running msbuilds , it throws the exception of
```
22-12-30T16:35:34.2423207Z ##[error]C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\MSBuild\Microsoft\VC\v160\Microsoft.CppCommon.targets(155,5): Error MSB3073: The command "setlocal
"C:\Program Files\CMake\bin\cmake.exe" -E copy D:/a/_work/1/b/RelWithDebInfo/dnnl/install/bin/dnnl.dll D:/a/_work/1/b/RelWithDebInfo/RelWithDebInfo
if %errorlevel% neq 0 goto :cmEnd
:cmEnd
endlocal & call :cmErrorLevel %errorlevel% & goto :cmDone
:cmErrorLevel
exit /b %1
:cmDone
if %errorlevel% neq 0 goto :VCEnd
:VCEnd" exited with code 1.
```
https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=847423&view=logs&j=249e9d58-0012-5814-27cf-6a201adbd9cf&t=182b9780-832e-5dcb-3957-d6aa3ece582f
It should make sure that the onnxruntime_test_all project depends on
dnnl project.
### Description
Remove Abseil module placement specifications
### Motivation and Context
Allow Cmake defaults take place and possible redirection of all
submodules for sharing between the local builds.
### Description
<!-- Describe your changes. -->
Sampling op for cpu and cuda
support huggingface case and custom case
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
### Optimize computation orders
In `Roberta/Electra`, when `ClassificationHead` is used, there is
slicing operation on features on sequence_length dimensions, then loss
calculations only depend on this sliced data. This is a slicing at axis
1. Before slicing the shape is [batch, sequence_length, hidden], after
slicing, it becomes [batch , hidden_stage]
We had opportunities to bring this slicing earlier as much as possible,
by passing through simple elementwise ops (like Add/Div), or
Layernorm/Softmax(if their reduce axis is after the slicing axis), or
even MatMul's the left operand (if only it did not affect the last
dims).
For operators like Reshape/Transpose, it is special since they have
either data specified (after slicing we need update), or they have perm
specified, which requires the input rank remain unchanged. So for those
kinds of operators, we can remain the original rank, but just leave the
sliced dim to be 1, after the compute completed, we do a Squeeze.
```
class RobertaClassificationHead(nn.Module):
"""Head for sentence-level classification tasks."""
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
classifier_dropout = (
config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
)
self.dropout = nn.Dropout(classifier_dropout)
self.out_proj = nn.Linear(config.hidden_size, config.num_labels)
def forward(self, features, **kwargs):
x = features[:, 0, :] # take <s> token (equiv. to [CLS])
x = self.dropout(x)
x = self.dense(x)
x = torch.tanh(x)
x = self.dropout(x)
x = self.out_proj(x)
return x
```
src\transformers\models\roberta\modeling_roberta.py
src\transformers\models\electra\modeling_electra.py
#### Benchmark
A simple benchmark shows Robeta training latency dropped from 208ms ~
199ms. 4.5+% reduction.
More comprehensive tests are on the way.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Use target name for flatbuffers.
Add version range for flatbuffers. It is similar to #13870
### Motivation and Context
To fix a build error:
```
CMake Error at onnxruntime_graph.cmake:88 (add_dependencies):
The dependency target "flatbuffers" of target "onnxruntime_graph" does not
exist.
Call Stack (most recent call first):
CMakeLists.txt:1490 (include)
```
It happens when flatbuffers library is already installed. For example,
on Ubuntu people may get it from apt-get. But, the one provided by
Ubuntu 20.04 is not compatible with our code. The one in Ubuntu 22.04
works fine.
### Description
Update absl to a new version
### Motivation and Context
The new version contains fixes that are needed for Nvidia GPU build.
Once we update it to that version, we don't need to maintain our private
patches for Nvidia GPU build.
Fix error: builtin __has_trivial_destructor is deprecated; use __is_trivially_destructible instead [-Werror,-Wdeprecated-builtins]
This is not a clean fix as in 13783, users will need to manually set `CMAKE_HIP_FLAGS="-Wno-deprecated-builtins"` if they want to use self-built hipclang combining with ROCm 5.3.* or older.
### Description
Add the ability to run graph
### Motivation and Context
A brief description is as follows:
1) If the whole graph is supported, then will be processed by the graph
engine, directly.
2) If the whole graph is not supported, the whole graph will be divided
into subgraphs and single operators; The sub-graphs will be run on graph
engine, and the single operators will fallback to the traditional mode.
**Description**: This PR including following works:
1. provide stream and related synchronization abstractions in
onnxruntime.
2. enhance onnxruntime's execution planner / executor / memory arena to
support execute multiple streams in parallel.
3. deprecate the parallel executor for cpu.
4. deprecate the Fence mechanism.
5. update the cuda / tensorrt EP to support the stream mechanism,
support running different request in different cuda stream.
**Motivation and Context**
- Why is this change required?
currently, the execution plan is just a linear list of those primitives,
ort will execute them step by step. For any given graph, ORT will
serialize it to a fixed execution order. This sequential execution
design simplifies most scenarios, but it has the following limitations:
1. it is difficult to enable inter-node parallelization, we have a
half-baked parallel executor but it is very difficult to make it work
with GPU.
2. The fence mechanism can work with single gpu stream + cpu thread
case, but when extend to multiple stream, it is difficult to manage the
cross GPU stream synchronizations.
3. our cuda EP rely on the BFCArena to make the memory management work
with the GPU async kernels, but current BFCArena is not aware of the
streams, so it doesn't behavior correctly when run with multiple
streams.
This PR enhance our existing execution plan and executor to support
multiple stream execution. we use an unified algorithm to mange both
single stream and multiple stream scenarios.
This PR mainly focus on the infrastructure support for multiple stream
execution, that is said, given a valid stream assignment, onnxruntime
can execute it correctly. How to generate a good stream assignment for a
given model will be in the future PR.
Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Cheng Tang <chenta@microsoft.com>
Co-authored-by: RandySheriffH <48490400+RandySheriffH@users.noreply.github.com>
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: cao lei <jslhcl@gmail.com>
Co-authored-by: Lei Cao <leca@microsoft.com>
Integrate TensorRT 8.5
- Update TensorRT EP to support TensorRT 8.5
- Update relevant CI pipelines
- Disable known non-supported ops for TensorRT
- Make timeout configurable.
We observe more than [20
hours](https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=256729&view=logs&j=71ce39d8-054f-502a-dcd0-e89fa9931f40)
of running unit tests with TensorRT 8.5 in package pipelines. Because we
can't use placeholder to significantly reduce testing time (c-api
application test will deadlock) in package pipelines, we only run
subsets of model tests and unit tests that are related to TRT (add new
build flag--test_all_timeout and set it to 72000 seconds by package
pipelines). Just to remember, we still run all the tests in TensorRT CI
pipelines to have full test coverage.
- include https://github.com/microsoft/onnxruntime/pull/13918 to fix
onnx-tensorrt compile error.
Co-authored-by: George Wu <jywu@microsoft.com>
### Description
Fix usage of enable_training_ops and reduce ifdef complexity for
training builds.
### Motivation and Context
This is the second refactoring PR towards creating a dedicated build for
on device training. This PR aims to reduce some complexity. We can set
ENABLE_TRAINING_OPS in cmake when either ENABLE_TRAINING or
ENABLE_TRAINING_ON_DEVICE is selected, this way we dont have to use if
defined(ENABLE_TRAINING) || defined(ENABLE_TRAINING_ON_DEVICE )
everywhere in the code.
- If it fixes an open issue, please link to the issue here. -->
### Description
Use onnxruntime_fetchcontent_makeavailable cmake function for TRT. See
the comment for the reason.
### Motivation and Context
To support a newer TRT version. Previously they have a "BUILD_EXE" build
option to allow us to exclude such things from build. But in
https://github.com/onnx/onnx-tensorrt/pull/879 they deleted the build
option. It wouldn't be a problem if we continue to use git submodules as
before, because cmake's add_subdirectories function has an
"EXCLUDE_FROM_ALL" keyword. However, cmake's FetchContent module
doesn't. That's why I needed to create our own version of the macro.
### Description
The existing CUDA profiler is neither session-aware, nor thread-safe.
This PR ensures both.
### Motivation and Context
[PR 13549](https://github.com/microsoft/onnxruntime/pull/13549) brought
thread-safety and session-awareness to the ROCm profiler. This PR brings
the same goodness to the CUDA profiler as well.
Sample outputs of a profiling run from the StableDiffusion model (this
model was chosen because it requires orchestration of multiple sessions,
and verifies that the profilers are now indeed session-aware) on both
CUDA and ROCm EPs are attached, along with a script that checks that the
trace files generated by the profile are well-formed.
Update 11/29: Updated the profile outputs. The older profile outputs
exhibited an issue where some timestamps were wildly out of range,
leading to problems visualizing the traces. The bug has been fixed and
the profile outputs have been updated, along with an update to the check
script to ensure that timestamps are monotonically increasing.
[sd_profile_outputs_cuda.tar.gz](https://github.com/microsoft/onnxruntime/files/10118088/sd_profile_outputs_cuda.tar.gz)
[sd_profile_outputs_rocm.tar.gz](https://github.com/microsoft/onnxruntime/files/10118089/sd_profile_outputs_rocm.tar.gz)
[check_profile_output_well_formedness.zip](https://github.com/microsoft/onnxruntime/files/10118090/check_profile_output_well_formedness.zip)
Co-authored-by: Abhishek Udupa <abhishek.udupa@microsoft.com>
### Description
Improve the handling of "/external:I". The
"onnxruntime_external_lib_include_dir" variable may be:
1. A simple file path
2. A cmake generator expression like "$<INSTALL_INTERFACE:include>",
"$<TARGET_PROPERTY:onnx_proto,INTERFACE_INCLUDE_DIRECTORIES>",
"$<BUILD_INTERFACE:xxxx>". It seems that we can't simply put them in to
the "target_compile_options" line. So this PR tries to parse the
expression and extract the part we need out.
### Motivation and Context
Resolve the Github issue: https://github.com/microsoft/onnxruntime/issues/13893
To fix a build error:
/home/xxxxxxxxxxxxx/onnxruntime/build/Linux/Debug/tensorboard/compat/proto/cost_graph.pb.cc:17:8:
error:
‘PROTOBUF_INTERNAL_EXPORT_tensorboard_2fcompat_2fproto_2ftensor_5fshape_2eproto’
does not name a type
17 | extern
PROTOBUF_INTERNAL_EXPORT_tensorboard_2fcompat_2fproto_2ftensor_5fshape_2eproto
::PROTOBUF_NAMESPACE_ID::internal::SCCInfo<1>
scc_info_TensorShapeProto_tensorboard_2fcompat_2fproto_2ftensor_5fshape_2eproto;
### Description
Decouple strided tensor support from ENABLE_TRAINING
### Motivation and Context
This is step 1 for creating a dedicated build for on device training.
Intention is
1. We can set ENABLE_STRIDED_TENSORS in cmake when either
ENABLE_TRAINING or ENABLE_TRAINING_ON_DEVICE is selected, this way we
dont have to use if defined(ENABLE_TRAINING) ||
defined(ENABLE_TRAINING_ON_DEVICE ) everywhere in the code.
2. This also paves the way to easily enable strided tensor support for
inference in future (if required).
### Description
This PR enables building nuget packages locally for on device training
using --build_nuget arg.
This PR also enables the C# bindings by default in the managed package.
If a user triggers any training apis when the native binary is not built
for training, an exception with message "Training is disabled in the
current build. Please build ONNXRuntime from source with the build flags
enable_training and enable_training_on_device. " is thrown.
Build command for creating nuget packes for on device training:
build.bat --enable_training --enable_training_on_device --build_nuget
2 Nuget packages are built
1. Microsoft.ML.OnnxRuntime.Managed
2. Microsoft.ML.OnnxRuntime.Training OR
Microsoft.ML.OnnxRuntime.Training.Gpu
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
## Description
1. Convert some git submodules to cmake external projects
2. Update nsync from
[1.23.0](https://github.com/google/nsync/releases/tag/1.23.0) to
[1.25.0](https://github.com/google/nsync/releases/tag/1.25.0)
3. Update re2 from 2021-06-01 to 2022-06-01
4. Update wil from an old commit to 1.0.220914.1 tag
5. Update gtest to a newer commit so that it can optionally leverage
absl/re2 for parsing command line flags.
The following git submodules are deleted:
1. FP16
2. safeint
3. XNNPACK
4. cxxopts
5. dlpack
7. flatbuffers
8. googlebenchmark
9. json
10. mimalloc
11. mp11
12. pthreadpool
More will come.
## Motivation and Context
There are 3 ways of integrating 3rd party C/C++ libraries into ONNX
Runtime:
1. Install them to a system location, then use cmake's find_package
module to locate them.
2. Use git submodules
6. Use cmake's external projects(externalproject_add).
At first when this project was just started, we considered both option 2
and option 3. We preferred option 2 because:
1. It's easier to handle authentication. At first this project was not
open source, and it had some other non-public dependencies. If we use
git submodule, ADO will handle authentication smoothly. Otherwise we
need to manually pass tokens around and be very careful on not exposing
them in build logs.
2. At that time, cmake fetched dependencies after "cmake" finished
generating vcprojects/makefiles. So it was very difficult to make cflags
consistent. Since cmake 3.11, it has a new command: FetchContent, which
fetches dependencies when it generates vcprojects/makefiles just before
add_subdirectories, so the parent project's variables/settings can be
easily passed to the child projects.
And when the project went on, we had some new concerns:
1. As we started to have more and more EPs and build configs, the number
of submodules grew quickly. For more developers, most ORT submodules are
not relevant to them. They shouldn't need to download all of them.
2. It is impossible to let two different build configs use two different
versions of the same dependency. For example, right now we have protobuf
3.18.3 in the submodules. Then every EP must use the same version.
Whenever we have a need to upgrade protobuf, we need to coordinate
across the whole team and many external developers. I can't manage it
anymore.
3. Some projects want to manage the dependencies in a different way,
either because of their preference or because of compliance
requirements. For example, some Microsoft teams want to use vcpkg, but
we don't want to force every user of onnxruntime using vcpkg.
7. Someone wants to dynamically link to protobuf, but our build script
only does static link.
8. Hard to handle security vulnerabilities. For example, whenever
protobuf has a security patch, we have a lot of things to do. But if we
allowed people to build ORT with a different version of protobuf without
changing ORT"s source code, the customer who build ORT from source will
be able to act on such things in a quicker way. They will not need to
wait ORT having a patch release.
9. Every time we do a release, github will also publish a source file
zip file and a source file tarball for us. But they are not usable,
because they miss submodules.
### New features
After this change, users will be able to:
1. Build the dependencies in the way they want, then install them to
somewhere(for example, /usr or a temp folder).
2. Or download the dependencies by using cmake commands from these
dependencies official website
3. Similar to the above, but use your private mirrors to migrate supply
chain risks.
4. Use different versions of the dependencies, as long as our source
code is compatible with them. For example, you may use you can't use
protobuf 3.20.x as they need code changes in ONNX Runtime.
6. Only download the things the current build needs.
10. Avoid building external dependencies again and again in every build.
### Breaking change
The onnxruntime_PREFER_SYSTEM_LIB build option is removed you could think from now
it is default ON. If you don't like the new behavior, you can set FETCHCONTENT_TRY_FIND_PACKAGE_MODE to NEVER.
Besides, for who relied on the onnxruntime_PREFER_SYSTEM_LIB build
option, please be aware that this PR will change find_package calls from
Module mode to Config mode. For example, in the past if you have
installed protobuf from apt-get from ubuntu 20.04's official repo,
find_package can find it and use it. But after this PR, it won't. This
is because that protobuf version provided by Ubuntu 20.04 is too old to
support the "config mode". It can be resolved by getting a newer version
of protobuf from somewhere.
### Description
1. Move C/C++ deps' URLs to deps.txt, and download the dependencies from
Azure Devops Artifacts instead of github.
2. Add "EXCLUDE_FROM_ALL" keyword to the cmake external projects, so
that we only build the parts we need and avoid installing the 3rd-party
dependencies when people run `make install` in ORT's build directory.
However, at this moment cmake itself doesn't have the feature. So I
copied their code to cmake/external/helper_functions.cmake and modified
it.
This PR is split from #13523, to make that one smaller.
### Motivation and Context
1. Secure the supply chain
2. Make it be possible to automatically detect if ORT has an old
dependency that hasn't been updated from a long time.
fix for https://github.com/microsoft/onnxruntime/issues/13383,
https://github.com/microsoft/onnxruntime/issues/13408
Currently ort-web doesn't catch exceptions because turning on exception
catching increases the binary size by 3MB (~30%).
But ort can throw (ie onnx errors or ORT_ENFORCE) and there is no
useable error message.
Turning on exception catching just for top level api released file will
fix the error messages at minimal increase of binary size.
Patch Protobuf and ONNX's cmake files and enforce BinSkim check.
This PR has overlap with #13523 . I would prefer to get this one merged
first so that we can finished the BinSkim work, and I try to make this
PR as small as possible.
### Description
After this change, you will see GSL.natvis and wil.nativs files will be
added to every onnxruntime_xxx project.
Like this:

This is because in onnxruntime_common.cmake we have:
```cmake
if (MSVC)
set(ABSEIL_NATVIS_FILE "abseil-cpp.natvis")
target_sources(
onnxruntime_common
INTERFACE $<BUILD_INTERFACE:${PROJECT_SOURCE_DIR}/external/${ABSEIL_NATVIS_FILE}>)
endif()
```
It sets a property, INTERFACE_SOURCES, on the target
"onnxruntime_common".
Then if anyone else uses:
```
target_link_libraries(mytarget PRIVATE onnxruntime_common)
```
The nativis file will be added to `mytarget`.
However, in this project we don't use such things for the targets that
are static libraries. For example, onnxruntime_graph is a static
library.
Instead, we use the `onnxruntime_add_include_to_target ` function to
explicitly control what we want to propagate . The function was written
before we started to have nativis files. So it doesn't pass a source
file from one static library to another. Now we have the need. Probably
only for Windows.
### Motivation and Context
Add natvis files to every project.
### Description
Fix round 6
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Add a DML registration for Shape to avoid copying back to the CPU just
to get the shape of a GPU tensor.
### Motivation and Context
When using free dimensions, many Transformers models extensively use the
`Shape` operator. This causes hundreds of GPU->CPU copy that should be
completely avoidable. Note that this change also uses the same
heuristics as other providers (e.g. CUDA) to force some tensors on the
CPU in certain situations.
Co-authored-by: Patrice Vignola <pavignol@microsoft.com>
### Description
In the TVM EP, this adds more entries to the conversion from
`ONNXTensorElementDataType` to `DLDataType`. Additionally, it removes an
unused function and updates the TVM revision to allow running models
from recent revisions of TVM.
### Motivation and Context
In the TVM EP, the mapping from `ONNXTensorElementDataType` to
`DLDataType` was incomplete and neglected several integer types (in
particular `ONNX_TENSOR_ELEMENT_DATA_TYPE_UINT8` and
`ONNX_TENSOR_ELEMENT_DATA_TYPE_INT8`) which prevented some models from
running.
Co-authored-by: Peter Salas <psalas@octoml.ai>
1. Remove the cmake option onnxruntime_DEV_MODE and replace it with
"--compile-no-warning-as-error"
2. Suppress some GSL warnings because now we treat nvcc diag warnings as
errors
### Description
Upgrade cmake version to 3.24 because I need to use a new feature that
is only provided in that version and later. Starting from cmake 3.24,
the
[FetchContent](https://cmake.org/cmake/help/latest/module/FetchContent.html#module:FetchContent)
module and the
[find_package()](https://cmake.org/cmake/help/latest/command/find_package.html#command:find_package)
command now support integration capabilities, which means calls to
"FetchContent" can be implicitly redirected to "find_package", and vice
versa. Users can use a cmake variable to control the behavior. So, we
don't need to provide such a build option. We can delete our
"onnxruntime_PREFER_SYSTEM_LIB" build option and let cmake handle it.
And it would be easier for who wants to use vcpkg.
### Motivation and Context
Provide a unified package management method, and get aligned with the
community. This change is split from #13523 for easier review.
### Description
This adds bfloat16 support to the oneDNN ep.
When using the oneDNN ep this enables bfloat16 support for the following
ops:
Exp, Sigmoid, Tanh, Relu, MatMul, Gelu, BiasGelu, Add, Sub,
Mul, Div, Div, Sqrt, Pow, ReduceMean, Abs, Cast, Equal, Exp,
FastGelu, FusedMatMul, Gemm, Greter, GreaterOrEqual, LeakyRelu,
Less, LessOrEqual, LRN, ReduceOps, Reshape, Squeeze, Transpose,
and Unsqueeze.
LayerNorm with some internal casting.
BatchNorm only enabled BFloat16 for input and output, scale and bias
still need fp32 input.
Added bfloat16 unit tests for all of the operators in question. When
possible we reused the already existing unit tests that were added by
CUDA and ROCM eps.
In many of the unit tests an unusual pattern will be seen
#if defined(USE_DNNL)
TEST(Test, bfloat16_test) {
#if defined(USE_DNNL)
// oneDNN ep specific code
#endif
//test code
}
#endif
Although it looks unusual this was purposely done if another ep
implements bfloat16 support for that operator they will be able to
enable the unit test by adding there execution provider to the first
line without needing to edit inside the test.
Example: `#if defined(USE_CUDA) || defined(USE_DNNL)` see the
MatMul_float16 test in matmul_test.cc for and example of how this is
useful.
Additionally two new ISA checks (AVX512_BF16 and AMX-BF16) were added to
the cpuid_info code in. This was important to detecting is bfloat16
operations are supported by the CPU.
### Motivation and Context
This expands the capabilities of the oneDNN execution provider to
support models containing bfloat16 operations.
Signed-off-by: George Nash <george.nash@intel.com>
Signed-off-by: Ruihan-Yin <ruihan.yin@intel.com>
1. Update CK to its latest develop branch
2. `-mllvm -amdgpu-early-inline-all=true` is critical to CK's
performance, ensure it is properly configured.
- The flags are propagated from target `hip-lang::device`'s
`INTERFACE_COMPILE_OPTIONS`, we must not manually add the flags.
- Instead, we must ensure this target is properly configured by checking
_CMAKE_HIP_DEVICE_RUNTIME_TARGET is set.
TL,DR
`hip-lang::device` sometime will be not be properly configured if our
`CMAKE_PREFIX_PATH` is not configured carefully. In the CI docker, the
configuration is in good state, but on dev machine it is not, which then
silently result poor performance for kernels. We fixed it in this PR and
add a guard to avoid unsuccessful future editing and to prevent
convoluted debugging process.
`_CMAKE_HIP_DEVICE_RUNTIME_TARGET ` is shared in
`/opt/rocm/lib/cmake/hip-lang/hip-lang-config.cmake` and it is internal
to
[CMake](https://gitlab.kitware.com/cmake/cmake/-/merge_requests/6121/diffs),
the variable name will not be changed in the foreseeable future.
The oneDNN 2.7.1 release includes multiple functional and performance
improvements.
Signed-off-by: George Nash <george.nash@intel.com>
### Description
Update the oneDNN library from 2.7.0 to 2.7.1. This contains multiple
functional and performance improvements.
### Motivation and Context
This is a minor point release from the oneDNN library that gives
performance and functional fixes that were found in the oneDNN 2.7
library shortly after release.
Signed-off-by: George Nash <george.nash@intel.com>
This PR enables ORT to execute graphs captured by TorchDynamo. Major compilation code is in `OrtBackend.compile` in ort_backend.py. `register_backend.py` is for plugging `OrtBackend` into TorchDynamo as a compiler.
### Description
support building xnnpack for IOS
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Currently, hipify happens before cmake is configured and then cmake glob
the directories. This get rids of thoes customized python threading
logic and opt for build system itself to generate the files.
This also supersede the half baked branch
[sukha/hipify-with-cmake](https://github.com/microsoft/onnxruntime/tree/sukha/hipify-with-cmake)
**Description**: Changes to the MIGraphx execution provider code to
allow for stream synchronization on the gpu side
**Motivation and Context**
Performance boost by removing redundant host to device synchronizations
The current implementation of the execution provider continuously calls
hipDeviceSynchronize() between computations which adds overhead and an
idle wait between the GPU's computations. This is noticeable during
device
This change leverages new functionality that's been added to MIGraphX to
allow for GPU side synchronization which avoids the need for
host->device waits.
To maintain backwards compatibility with older MIGraphX versions, the
compile time define MIGRAPHX_STREAM_SYNC has been added to the API to
allow for older version operate with newer builds of onnxruntime without
loss of functionality to the current feature set as of (08/09/22)
Co-authored-by: Ted Themistokleous <tthemist@amd.com>
`python setup.py develop` doesn't install PyTorch as a normal package in
site-packages anymore, and the user must stay at PyTorch's root
directory to call `import torch`. This will break LORT tests because
LORT tests contains `import torch` and are called outside PyTorch root
directory. To make PyTorch a normal package again, this PR build PyTorch
with `python setup.py install`.
Fix warnings and enable dev mode for ROCm CI:
* Fix ROCm headers complaining "This file is deprecated. Use the header file from ..."
* Disable warning signed and unsigned compare for kernel explorer
* Fix unused and nondiscard warnings
* Enable dev mode for ROCm CI
* Walkaround error "unknown warning option '-Wno-nonnull-compare'" in kernel explorer by using '-Wno-unknown-warning-option' to ignore the unknown option
* Fix error "unused parameter 'mask'"
* Fix warning "instantiation of variable 'onnxruntime::rocm::Consts<float>::One' required here, but no definition is available", etc. Fixed by using C++17's inline (implied by constexpr) static initialization.
* Remove unused variable
* Add the missing `override` specifier
Update for ROCm CI before reland tunable GEMM #12853. This PR also update
composable kernel to use CMakes's HIP language support so that we can
mix C/C++ compiler with HIP compiler instead of locking to hip-clang
### Description
fix XNNPACK on WebAssembly SIMD.
Flag "-msimd128" need to be applied to every source file when compiling
WASM SIMD. Currently only a part of the source files are compiled with
this flag so we get inconsistent result for
`sizeof(xnn_f32_minmax_params)` because the type definition include a
`#ifdef` for `__wasm_simd128__`. The inconsistency causes writing
garbage data to a stack variable and eventually cause the crash.
XNNPACK libraries are C libraries so need to apply the build flags not
only to `CMAKE_CXX_FLAGS` but also to `CMAKE_C_FLAGS`.
### Description
This updates the oneDNN library used by oneDNN ep from version 2.6 to
version 2.7
### Motivation and Context
This brings in the many improvements incorporated into the oneDNN
library to the oneDNN execution provider.
Signed-off-by: George Nash <george.nash@intel.com>
Change ROCm to use tunable GEMM. It is not enabled in this PR. This will drastically improve GEMM performance in some shapes and dtypes configuration. This will benefit the overall performance for BERT inference and hopefully, training, when enabled.
### Description
<!-- Describe your changes. -->
As title
-Split long OpBuilder and OpSupportChecker files into individual
operator files.
-Add OpBuilder/SupportChecker registry factories.
-Combine the functionality of op_builder and op_support_checker into one
op_builder.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
The NNAPI OPBuilder was splitted into OPBuilder (For EP::Compile) and
OPSupportChecker (for EP::GetCapability)
At the time it was reasonable choice, but OPBuilder/OPSupportChecker
share some logic and has to use addition helper.
Clean up now to make NNAPI OPBuilder/OPSupportChecker into single
OPBuilder (similar to what CoreML EP has)
Fixesmicrosoft/onnxruntime#12969
### Motivation and Context
Build is broken, can't find cudnn.lib with nvidia official install of
cuDNN
Alternative method is to use `IF(EXISTS
${onnxruntime_CUDNN_HOME}/lib/x64/cudnn.lib)` to test for legacy
location and only add the legacy dir to the path, else add the current
official `lib/` dir.
1. Update CUDA version from 11.4 to 11.6.
2. Update Manylinux version
3. Upgrade GCC version from 10 to 11 for most x86_64 pipelines. CentOS 7 ARM64 doesn't have GCC 11 yet.
4. Refactor python packaging pipeline:
a. Split Linux GPU build job to two parts, build and test, so that the
build part doesn't need to use a GPU machine
b. Make the Linux GPU build job and Linux CPU build job more similar: share the same bash script and yaml file.
5. Temporarily disable Attention_Mask1D_Fp16_B2_FusedNoPadding because it is causing one of our packaging pipeline to fail. I have created an ADO task for this.
Some Ops in EP directory instead of contrib_ops directory will
require TunableOp. We will also need to add EP level session tuning
options for it. So move those code all at once.
Also remove duplicated utility functions.
**Description**: This PR adds Ascend CANN execution provider support.
**Motivation and Context**
- Why is this change required? What problem does it solve?
As the info shown in the issue. CANN is the API layer for Ascend
processor. Add CANN EP can allow user run onnx model on Ascend hardware
via onnxruntime
The detail change:
1. Added CANN EP framework.
2. Added the basic operators to support ResNet and VGG model.
3. Added C/C++、Python API support
- If it fixes an open issue, please link to the issue here.
https://github.com/microsoft/onnxruntime/issues/11477
Author:
lijiawei <lijiawei19@huawei.com>
wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: FFrog <ljw1101.vip@gmail.com>
This changes are to align OV 2022.2 Release with ORT . Changes
CPU FP16 Support, dGPU Support, RHEL Dockerfile, Ubuntu 20 Dockerfile
**Motivation and Context**
- This change is required to ensure ORT-OpenVINO Execution Provider is
aligned with latest changes.
- If it fixes an open issue, please link to the issue here.
Co-authored-by: mayavijx <mayax.vijayan@intel.com>
Co-authored-by: shamaksx <shamax.kshirsagar@intel.com>
Co-authored-by: pratiksha <pratikshax.bapusaheb.vanse@intel.com>
Co-authored-by: pratiksha <mohsinx.mohammad@intel.com>
Co-authored-by: Sahar Fatima <sfatima.3001@gmail.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
Co-authored-by: nmaajidk <n.maajid.khan@intel.com>
Co-authored-by: Mateusz Tabaka <mateusz.tabaka@intel.com>
Co-authored-by: intel <intel@iotgecsp-nuc04.iind.intel.com>
**Description**: **Python API Bindings for on device training. **
**Motivation and Context**
- This PR contains api bindings so python users can perform a whole
training loop.
Co-authored-by: Adam Louly <adamlouly@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
* consume ONNX 1.12.1 to prevent vulnerability issue while loading external tensors
* update ONNX 1.12.1
* test updated PR
* use official rel-1.12.1 commit
* drop nuphar code and configs
* refactor test case
* format python
* remove nuphar from training test
* remove commented nuphar logics
* restore llvm setting
* drop nuphar ci
* fix compile err
* fix compile err
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
* upgrade emsdk to 3.1.19
* fix build break
* ignore '-Wunused-but-set-variable' in eigen
* add malloc and free in exported functions
* EXPORTED_FUNCTIONS
* Add first pass of rocm kernel profiler
* Clean up rocm_profiler. Format args. Demangle kernel names.
Add Api EventRecords
* Remove debug output
* Temporarily disable profiling unit test 'api record check' for cupti
* Fix compile error for non-gpu builds
* Use common file for demangle and pid/tid. Namespace ThreadUtil. Fix gpu buffer clearing.
* Merge demangle into profiler_common
* Merge demangle into profiler_common part 2
* Style cleanup
* Resolve linking issues via ProviderHost interface
* Demangle cuda kernel names
* Clean up comments
* Fix formatting
* Fix anal retentive formatting