Implement a set of new APIs for lightweight custom ops registration, to
save efforts from schema-composing.
A few highlights:
- Support build-time type inference;
- Support function-as-op for "stateless" ops;
- Support structure-as-op for "stateful" ops;
- Support varied input/output forms such as span, scalar, and tensors,
either optional or non-optional.
---------
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
### Description
1. Update VERSION_NUMBER for preparing the upcoming release. This PR's
commit will not be included in the 1.15 release branch
2. Delete package/rpm/onnxruntime.spec since it was not used in past
years.
### Motivation and Context
Preparing the release.
Fixed
[AB#15311](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/15311)
### Description
ExecutionProvider API refactor - replace OrtMemoryInfo with OrtDevice
### Motivation and Context
Currently “Location” is represented as ORTMemoryInfo, which is OrtDevice
+ OrtMemType, while OrtDevice is represent as DeviceType + DeviceId +
MemType. As we can see there is some unnecessary hierarchy, the proposal
is to make it a clear definition that to use OrtDevice as an abstraction
for Location
---------
Co-authored-by: Lei Cao <leca@microsoft.com>
Implement a set of new APIs for lightweight custom ops registration, to
save efforts on schema-composing.
A few highlights:
1. Support build-time type inference;
2. Support function-as-op for "stateless" ops;
3. Support structure-as-op for "stateful" ops;
4. Support varied input/output forms such as span, scalar, and tensors,
either optional or non-optional.
---------
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
### Description
Augment nhwc graph optimizer to accommodate fp16 operators.
### Motivation and Context
With new fp16 conv operator added. This operator prefers NHWC data
layout. We need to augment existing graph optimizers to better utilize
the new operator.
### Description
Originally VitisAI EP only works with old version of VitisAI release.
### Motivation and Context
Update VitisAI EP so that it works together with the current VitisiAI
3.5 and further version of VitisAI. We try our best to make it forward
compatible.
---------
Co-authored-by: Wang Chunye <chunywan@xilinx.com>
Co-authored-by: mingyue <mingyue@amd.com>
Co-authored-by: mingyueliuh <131847423+mingyueliuh@users.noreply.github.com>
Co-authored-by: liumingyue <mingyue@xilinx.com>
Co-authored-by: moore-ch <129165652+moore-ch@users.noreply.github.com>
Co-authored-by: shoucair <shoucai.ren@amd.com>
Co-authored-by: zz002 <zhenze.wang@amd.com>
Co-authored-by: BoarQing <yuz75@Pitt.edu>
Co-authored-by: Yueqing Zhang <yueqingz@amd.com>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
This addresses a performance regression in some INT8 models with the
DirectML EP by defaulting OrtSessionOptionsDisableQuantQDQ to 1 when the
EP is registered.
This regression occured due to the introduction of the QDQ propagation
transformer, which is based on this session option. That transformer
maximizes the number of nodes which are executed as quantized by
logically propagating quantize operators upstream and dequantize
operators downstream. However, it does this simply by inserting QDQ
pairs, with an expectation that something will recognize sequences of
DQ->Op->Q. This logic and related L2 transformers are not currently
enabled for the DirectML EP.
This change also removes a noisy warning when the session option for
memory pattern is overriden as the DirectML EP is registered.
Previous behavior of TRT EP to set TRT optimization profiles for dynamic
shape input is based on input tensor values. Users can't explicitly
specify the profiles.
This PR makes users capable of specifying min/max/opt profiles through
newly added three provider options:
`trt_profile_min_shapes`, `trt_profile_max_shapes` and
`trt_profile_opt_shapes`
with the format of "input1:dim1xdim2...,input2:dim3xdim4...".
(Note: It's similar to --minShapes, --maxShapes and --optShapes of
trtexec command-line
[flags](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#trtexec-flags))
For example, if you are using onnxruntime_perf_test, you can try this:
`./onnxruntime_perf_test -e tensorrt -r 1 -i
"trt_profile_min_shapes|imgs:1x3x384x288
trt_profile_max_shapes|imgs:32x3x384x288
trt_profile_opt_shapes|imgs:16x3x384x288" your_model_path`
If the engine cache is enabled, you still need to provide these three
explicit provider options in order to use this feature. ORT TRT will
compare the min/max/opt profile shape with the ones saved in .profile
file to decide whether to rebuild the engine.
Constraints to use these provider options: (1) Need to specify
min/max/opt profile shapes for all the dynamic shape input
This feature is also requested by other users:
https://github.com/microsoft/onnxruntime/issues/13851
### Description
Cast optimizer may convert a fp16 node to fp32. This used to be safe as
all fp16 kernels has fp32 implementation. As this assumption is no
longer true, we need to check the validity of the operation
### Motivation and Context
Main work here is to introduce an API to check whether a kernel is
registered. Currently we don't have a way to do that without an operator
node. This needs to be augmented. We need to query whether a kernel is
registered by its property only, so that we can judge whether it is safe
to construct a node long before we actually do so.
### Description
The PR adds VPU support to OpenVINO Execution Provider
Bug fixes for GPU, CPU.
Changes to OpenVINO Backend in Serialized Model API for faster First
Inference Latency.
Deprecation to HDDL-VADM and MYRIAD, removed code
Support OpenVINO 2023.0
Dynamic Shapes Support for iGPU
### Motivation and Context
- VPU is an upcoming hardware that can provide AI Acceleration for
Client Systems through OpenVINO
- If it fixes an open issue, please link to the issue here. -->
---------
Signed-off-by: MaajidKhan <n.maajid.khan@intel.com>
Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
Co-authored-by: MaajidKhan <n.maajid.khan@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
### Description
This change introduced the following new components into ONNX Runtime
Web:
- JavaScript Execution Provider (JSEP)
- Asynchronized inferencing execution powered by Emscripten's Asyncify
- WebGPU backend implemented in TypeScript
- initial implementation of kernels:
- elementwise operators (22)
- binary operators (5)
- tensor: Shape, Reshape, Transpose, Gemm
- nn: Conv, {Global}Maxpool, {Global}AveragePool
Code need to be polished. still working on it.
## Q&A
What is JSEP?
> JSEP, aka JavaScript Execution Provider, is a new ONNXRuntime
execution provider that specifically works on Web environment
(browsers). JSEP allows JavaScript code to kick in from various places
when ONNX Runtime inferences a model.
Why JSEP?
> JSEP is a hybrid mode EP that contains both C/C++ and
TypeScript/JavaScript implementation. There are 2 strong reasons why we
introduces JSEP:
> 1. the C/C++ part helps JSEP to leverage ONNX Runtime's capabilities
as much as possible including graph transformer, optimizers and also the
capabilities to fallback to CPU EP. TypeScript/JavaScript helps JSEP to
develop and debug much easier in the browser for the kernel
implementation.
> 2. the requirement of asynchronized execution from JavaScript API (eg.
`buffer.mapAsync()`) makes it impossible to run `OrtRun()` in a
synchronized context (see "async problem" section below). This is done
by using Emscripten's Asyncify.
What is WebGPU?
> WebGPU is the new GPU API that available in browser. It's one of the
only 2 APIs that currently available to access the GPU from browser (the
other is WebGL).
> WebGPU is designed with more advanced and stronger features comparing
to WebGL and is potentially solution that offer the best GPU performance
for model inferencing that currently available.
What is the async problem and why we have the problem?
> The "async problem" is a problem that you cannot call an async
function in a synchronous context. Think about the following C++ code:
> ```c
> // C-style declarations (API)
> typedef void (*ON_COMPLETE)(PVOID state, DATA *data);
> void read_data_from_file(FILEHANDLE file, ON_COMPLETE on_complete);
>
> // implementation
> DATA * my_impl_read_data_from_file_sync(FILEHANDLE file) {
> // how to implement?
> }
> ```
> The answer is, it's impossible to implement this function. Usually we
try to find a sync version API, or launch a thread to call the async
function and sync-wait on the main thread. Unfortunately, in browser
environment, neither is possible.
>
> WebGPU does not offer any synchronized API for data downloading (GPU
to CPU). This is the only operation that MUST be async. As `OrtRun()`
will eventually call into DataTransfer for copy data from GPU to CPU,
and `OrtRun()` is a synchronized function, this cannot be done in normal
way.
What is Emscripten? How is the Asyncify feature resolved the problem?
> Emscripten is the C/C++ compiler for WebAssembly. It's what we use to
compile ORT and generates the WebAssembly artifacts which runs on
browsers.
>
> Asyncify is a [compiler
feature](https://emscripten.org/docs/porting/asyncify.html) that allows
calling async functions from a synchronized context. In short, it
generates code to unwind and rewind call stack to emulate async
execution. With this feature, we are able to call the async function
inside `OrtRun()` call.
## Design Overview
**Inter-op**
JSEP is doing pretty much same thing to just another EP. It exposes an
interface for inter-op with JavaScript, which is defined in
onnxruntime/wasm/js_internal_api.js:
```js
// init JSEP
Module["jsepInit"] = function (backend, alloc, free, copy, copyAsync, createKernel, releaseKernel, run) {
Module.jsepBackend = backend;
Module.jsepAlloc = alloc;
Module.jsepFree = free;
Module.jsepCopy = copy;
Module.jsepCopyAsync = copyAsync;
Module.jsepCreateKernel = createKernel;
Module.jsepReleaseKernel = releaseKernel;
Module.jsepRun = run;
};
```
This simple JavaScript snippet defines all language barrier level
functions that requires by JSEP to achieve implementing kernels and data
transfers using JavaScript inside ONNX Runtime:
- `jsepBackend`: assign the singleton object to webassembly module
- `jsepAlloc` and `jsepFree`: implementation of data transfer's Alloc()
and Free()
- `jsepCopy`: synchronized copy ( GPU to GPU, CPU to GPU)
- `jsepCopyAsync`: asynchronized copy ( GPU to CPU)
- `jsepCreateKernel` and `jsepReleaseKernel`: a corresponding object
that maintained in JS to match lifecycle of Kernel in ORT
- `jsepRun`: OpKernel::Compute() should call into this
The abstraction above allows to tie as little as possible connections
and dependencies between C/C++ and TypeScript/JavaScript.
**Resource Management**
Lifecycle of tensor data and kernels are managed by ORT(C/C++) but the
implementation are left to JavaScript. JavaScript code are responsible
to implement the callbacks correctly.
For WebGPU, the GPU data is managed by JavaScript using a singleton map
(tensot_data_id => GPUBuffer). GPU pipeline is managed as singleton.
Shaders are managed using a singletonmap (shader_key => gpu_program),
while shader_key is generated by cache_key (OP specific, including
attributes) and input shapes.
**about data transfer**
`js::DataTransfer::CopyTensor` implemented to call either synchronized
or asynchronized copy callback, depending on the destination is GPU or
not. Emscripten's macro `EM_ASYNC_JS` is used to wrap the async function
to be called in the synchronized context.
**run kernel in JS**
Kernel class constructor calls once `jsepCreateKernel()` with an
optional per-kernel specific serialization to pass attributes into
JavaScript.
`Compute()` are implemented in a way that a metadata serialization is
performed in a base class and JavaScript code can access the data using
the Emscripten specific builtin macro `EM_ASM_*`.
**disabled features**
memory pattern is force disabled, because the WebGPU data is not
presented by a general memory model (a buffer can be represented by
offset + size).
concurrent run support is disabled. WebGPU is stateful and it also has
async function call. To support concurrent run will significantly
increase the complexity and we don't get any real benefit from it.
**prefer channels last**
JSEP prefers channels last and returns `DataLayout::NHWC` in method
`GetPreferredLayout()`. This will let the graph transformers to
preprocess the graph into a channels last form so that a more optimized
WebGPU shader can be used.
**Testing code**
It's impossible to test JSEP directly because JSEP itself does not
contain any kernel implementation. However, it has the kernel
registration which need to work together with the corresponding
JavaScript code. There are unit tests that run onnx models from
JavaScript API.
---------
Co-authored-by: Scott McKay <skottmckay@gmail.com>
### Description
Create a new C API KernelContext_GetAllocator() for Custom Op scenario
### Motivation and Context
Create a new C API KernelContext_GetAllocator() for Custom Op scenario
### Description
Reduce a number of auxillary objects created to reduce GC pressure.
Eliminate GCHandle type of memory pinning in most of the places.
Improve string marshalling by allocating unmanaged memory that does not
require pinning. Change native methods from `IntPtr` to `byte[]`
(marshalling pinning is more efficient).
Allocate input/output UTF-8 names in unmanaged heap for the lifetime of
InferenceSession. So we do not keep converting them and pinning on every
Run.
Introduce a new native API that allows to allocate and convert/copy
strings directly into a native tensor.
The PR delivers around 50% latency improvements and less GC pauses.
Inspired by: https://github.com/microsoft/onnxruntime/pull/15520
### Motivation and Context
Client experience GC pressure and performance degradation when dealing
with string tensors.
Co-Authored-By: @tannergooding
This PR makes ORT support TRT plugin using custom ops. ORT TRT can
automatically register all TRT plugins from TRT plugins registry as
custom ops. There is no code change needed for ORT when new TRT plugins
are introduced.
Previous way for ORT to support TRT plugins was using contrib ops, but
there are some concerns about it:
- Contrib ops are shipped as part of the ORT binary by default. TRT
related plugins should not be in the default ORT.
- Contrib ops are designed for internal ops and developed for cpu and
cuda EPs.
Therefore, using custom ops is a good approach to support TRT plugins.
Followings are the major modifications:
1. Add new `GetCustomOpDomainList` provider api which allows provider to
create its own custom op domain list and ORT can register this domain
list. Provider has the responsibility to free all the custom op domain
instances it created.
2. Move OrtCustomOpDomain struct definition to
framework_provider_common.h since this struct is being used by framework
and EPs now.
3. There are several TRT plugins registered as onnx schema op through
contrib op with onnx domain. In order not to break the old models using
those TRT plugins which were registered with ONNX domain and maintain
backward compatible, we need to keep the old/legacy TRT plugins with
onnx domain. Moving forward, all newly added TRT plugins should be
registered with `trt.plugins` domain.
4. TRT plugin doesn't have an api to get number of inputs/outputs of the
registered plugins, so ORT TRT uses variadic inputs/outputs to bypass
the onnx node validation.
5. Add new trt provider option, `trt_extra_plugin_lib_paths`, user can
specify any extra plugin lib, for example,
`fastertransformer/build/lib/libvit_plugin.so` or
`fastertransformer/build/lib/libvit_plugin.so;fastertransformer/build/lib/libvit_plugin_v2.so`
### Description
Run clang-format in CI. Formatted all c/c++, objective-c/c++ files.
Excluded
```
'onnxruntime/core/mlas/**',
'onnxruntime/contrib_ops/cuda/bert/tensorrt_fused_multihead_attention/**',
```
because they contain assembly or is data heavy
### Motivation and Context
Coding style consistency
### Description
Create a stream in DeviceStreamCollection for memory pattern case to fix
the thread safe issue 15154
### Motivation and Context
This is to fix the bug 15154
https://github.com/microsoft/onnxruntime/issues/15154
### Description
This will add a few TRT options, some of them are only available on TRT
8.6:
- heuristics
- sparsity
- optimization level (8.6 only)
- auxiliary stream (8.6 only)
- tactic source selection
I am no sure yet which tests is should add for these options. As those
are mostly simple TRT flags i am not sure to what level i should test.
For heuristics something similar to
44dda08b51/onnxruntime/test/providers/tensorrt/tensorrt_basic_test.cc (L510-L538)
should be possible for, but for all other essentially we would only be
testing if there is a crash or not if the option is set.
Also if i forgot some option that would be good to have feel free to
speak up !
### Description
Implement Optional Type metadata support in the library.
Implement optional support in C# API along with metadata.
Implement Sequence, Map, Optional test data support
and test execution.
Prune tests and provide more details for failing tests in C# code.
Note, this PR does not enable running onnx test models in C++.
### Motivation and Context
Opset18 optional type support.
Split `IsTunbaleOpEnable` semantics into **enable tunable op for using**
and **enable tunable op for tuning**.
They remain disabled in general for safety purpose. But
- if session is created with onnx model with tuning results embeded
- the embedded tuning results is set to the EP without error `Status`
then we automatically enable the using, tuning remains disabled.
The planned options will be
- `tunable_op_enable`: The top-level switch of `TunableOp`, indicate if we will run into `TunableOp` related logic. **NOTE:** most of our impls have a bottom impl that is acting as a fallback and is set as the default. In this case, we still call into the `TunableOp`, but no kernel selection, no kernel tuning and caching is involved. This reduced our maintainance burden of a duplicate code path.
- `tunable_op_tuning_enable`: The secondary switch of `TunableOp`, indicate if we will run into the tuning related logic of `TunableOp`
Then for the possible future options:
- `tunable_op_tuning_max_iteration`: blahblah
- `tunable_op_tuning_max_duration_ms`: blahblah
- `tunable_op_flash_attention_enable`: blahblah, for example only, we will not have this.
For developer oriented envvar, it is for developers' convenience to inspect the performance impact of tuning. So there is only `ORT_ROCM_TUNABLE_OP_ENABLE`, `ORT_ROCM_TUNABLE_OP_TUNING_ENABLE` to take the fine-grind control of combinations.
### Description
<!-- Describe your changes. -->
Add required graph transformer to duplicate DQ nodes to ensure that QDQ
node units have unique DQ nodes. This condition is necessary for QDQ
node unit processing.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
There is an existing Python utility that does this:
c7ced7a5e9/tools/python/util/qdq_helpers/qdq_model_utils.py (L77)
This PR implements it as a graph transformer so it is integrated into
ORT and does not require a separate step to update the model. There are
also tests to ensure that its effects are not undone by basic level
graph optimizations.
### Description
**Multi-stream** execution support for **CANN EP**.
### Motivation and Context
**CANN EP** is currently **unavailable** due to the introduction of a
new mechanism for multi-stream execution
[#13495](https://github.com/microsoft/onnxruntime/pull/13495), the
deletion of the Fence-based synchronization mechanism, and the failure
to update the relevant logic of **CANN EP** synchronously.
This PR is to fix it.
### Description
<!-- Describe your changes. -->
- Add debug infrastructure to dump out model at various stages of
transpose optimization.
- Handle more scenarios where Transpose -> Reshape can be merged.
- Run L1 optimizers after layout transform to constant fold initializers
that had their layout changed.
- Use cost check for Concat post layout transform as pushing a Transpose
through it can potentially add Transpose nodes to multiple other inputs.
- Update internal testing EP to support test where you want it to take
all nodes, use NHWC layout, and to use dummy static kernels instead of
compiling so the ops in the graph post-initialization can be counted.
- Misc cleanup in InferenceSession to not unnecessarily pass args to
TransposeGraph for class members.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Address perf issue seen with model where a Transpose gets blocked by a
Reshape that could have been treated as a Transpose.
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
<!-- Describe your changes. -->
Remove `deletion` of copy functions from `OrtApi` as its initialization
no longer compiles in C++20.
Introduce a non-copyable member to implicitly delete copy ctor.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Inspired by https://github.com/microsoft/onnxruntime/pull/14901
Solution credits: @RyanUnderhill
Cc: @georgthegreat
### Description
<!-- Describe your changes. -->
We are introducing the FasterTransfomer model-level integration using
ORT [custom op runtime
wrapper](https://github.com/microsoft/onnxruntime/pull/13427).
In order to make the FT wrapper/integration work, two things need to be
done:
- New API `KernelInfoGetConstantInput_tensor`. (Done in this PR)
During custom op kernel initialization, it needs to get the model
weights (saved as node's constant inputs) ready for FT's weights
instantiation. What's why we need to add this new API to make kernel
info capable of getting constant inputs.
- Custom op and custom op kernel to wrap FT model. (Will provide in
onnxruntime extensions or inference examples)
During custom op kernel initialization, it can fetch attributes from
kernel info to determine which kind of FT model instance create. During
custom op kernel compute/inference, it can get input/output from kernel
context and then assign input/output buffers for model instance to run.
### Description
Add logging APIs for custom ops.
This PR introduces a `OrtLogger` type, which can be retrieved from a
`OrtKernelInfo` or `OrtKernelContext`. The kernel info's logger is the session logger stored
in the execution provider. The kernel context's logger is a run logger.
### Motivation and Context
Allows custom ops to log information in a manner consistent with
built-in ops.
Example usage in custom op:
```C++
struct MyCustomKernel {
MyCustomKernel(const OrtApi& api, const OrtKernelInfo* info) {
Ort::ConstKernelInfo kinfo(info);
this->logger_ = kinfo.GetLogger();
// ...
ORT_CXX_LOGF_NOEXCEPT(this->logger_, OrtLoggingLevel::ORT_LOGGING_LEVEL_ERROR, "Error: %s", err_msg);
}
void Compute(OrtKernelContext* context) {
ORT_CXX_LOG(this->logger_, OrtLoggingLevel::ORT_LOGGING_LEVEL_VERBOSE, "Calling compute...");
// ...
}
// ...
private:
Ort::Logger logger_;
};
```
### Description
This will enable a user to use a TensorRT timing cache based on #10297
to accelerate build times on a device with the same compute capability.
This will work across models as it simply store kernel runtimes for
specific configurations. Those files are usually very small (only a few
MB) which makes them very easy to ship with an application to accelerate
the build time on the user end.
### Motivation and Context
Especially for workstation use cases TRT build times can be a roadblock.
With a few model from ONNX model zoo i evaluated speedups when a timing
cache is present.
`./build/onnxruntime_perf_test -e tensorrt -I -t 5 -i
"trt_timing_cache_enable|true" <onnx_path>`
|Model | no Cache | with Cache|
| ------------- | ------------- | ------------- |
|efficientnet-lite4-11 | 34.6 s | 7.7 s|
|yolov4 | 108.62 s | 9.4 s|
To capture this is had to modify the onnxruntime_perf_test. The time is
sometimes not captured within "Session creation time cost:" which is why
i introduced "First inference time cost:".
---------
Co-authored-by: Chi Lo <Chi.Lo@microsoft.com>
### Description
TreeEnsemble* kernels fully copies all the parameters from the onnx
graph. Even if they are no longer needed or unused (hitrates), they
remain in memory. For big models >= 200 trees, max_depth > 10, the model
usually weights more than 10 Mb. This change offers a kernel the
possibility to remove all unneeded attributes after they were used to
create the session. Attributes are deleted after the model was possibly
saved, at the of the session creation.
The current design is to be debatted:
* it stored the list of removable attributes in class
`onnxruntime::Node`,
* the node is marked as `const` everytime this implementation needs to
register the name of a removable attribute or to remove them.
The current implementation is just a POC as it needs to cast
`onnxruntime::Node*` into `const onnxruntime::Node*`.
Should we keep the list of removable attributes in `onnxruntime::Node`?
### Motivation and Context
Motivation is mostly to reduce memory consumption.
---------
Signed-off-by: xadupre <xadupre@microsoft.com>
### Description
<!-- Describe your changes. -->
Changes to support standalone custom ops in a minimal build. Also
incorporates changes from #14492 (needed to test builds prior to that
being checked in).
We first need to save the schema info from the operators used by the
standalone op invoker in the ORT format model. Add mechanism for that.
Merge the kernel lookup logic so the same is used in full and minimal
build. NOTE: the version matching is now consistent with all other
kernel lookups, and the call to CreateOp MUST use the exact version for
the operator. Previously matching wasn't as strict, but this can lead to
the incorrect kernel being chosen.
Add tests.
NOTE: There is currently no way to detect the ops/types/opsets used
inside these custom ops as they don't exist until we create kernels,
which is after model loading completes (which is the point the ORT
format model is saved). Due to that they have to be manually added to
the configuration used to do the reduced ops build. That shouldn't be
too hard for the custom op author to add given the custom op
implementation is specifying the op, opset and type constraints (i.e.
they have the info and it's just a case of capturing/formatting it
correctly).
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Enable usage of the standalone op invoker by custom ops in a minimal
build.
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
<!-- Describe your changes. -->
I fixed some broken links in the C API documentation, but then did a
quick pass over all of the links I could find and then fixed those.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
I got some 404's when exploring the documentation and wanted to fix it.
Enable Opset11 Sequence Ops on DirectML, and make the CPU
implementations agnostic to backend EP
Opset 11 introduced the following sequence related operators:
- SequenceAt
- SequenceConstruct
- SequenceEmpty
- SequenceLength
- SequenceErase
- SequenceInsert
- ConcatFromSequence
With the exception of ConcatFromSequence, all of the above operators
were implemented with CPU kernels that a) required all of the contained
tensors to also be on CPU, and b) would clone each tensor into a new
sequence as a side effect of each operator. The implementation of
sequences are backend agnostic, as they dont affect actual tensor layout
or manipulate the contents of the tensors. In addition, with the
exception of SequenceAt, the other operators need not make copies of the
underlying referenced tensors.
Consequently, this change does the following:
1) Sequence* operators (except SequenceAt) no longer copies the contents
of a sequence of tensors on every kernel execution.
2) SequenceAt uses the DataTransferManager to copy tensors agnostic to
backend.
3) The internal container implemented by TensorSeq has changed from
onnxruntime::Tensor to OrtValue. This is because onnxruntime::Tensor
does not support copy or assignment construction, so it must have a
singular owner. However, is same tensor participates in multiple
containers it would have multiple container "owners" and this would not
be possible.
4) Other code that accessed values from TensorSeq have associated
changes to extract Tensors from OrtValues now.
In addition, DirectML execution was very slow when the above Sequence
operators were added to a graph, as this caused MemcpyToHost and
MemcpyFromHost kernels to be inserted between the graph and the sequence
operators. To optimize DirectML,
1) The CPU implementations for the Sequence* ops were registered as DML
implementations. Since the above changes also includes making the CPU
kernel implementations EP agnostic, the CPU kernels can be added as is.
2) The ConcatFromSequence operator needed to be implemented on DirectML.
However, there was little DirectML EP operator framework support for
operators that accept/output sequences of tensors. This change has
modified the internal COM interfaces to include new apis to interrogate
for sequence shapes, and extract the needed tensors from TensorSeq.
---------
Co-authored-by: Patrice Vignola <vignola.patrice@gmail.com>
### Description
Fix the broken link in header file onnxruntime_c_api.h w.r.t. the graph
optimization levels (line 300).
### Motivation and Context
This fix solves open issue #14741
We use customized libc++ which uses raw pointers as std::vector::iterators.
As per [expr.pre.incr](https://eel.is/c++draft/expr.compound#expr.pre.incr), builtin `operator++` can only be applied to lvalue, while `std::vector::begin()` returns an rvalue.
See [this](https://godbolt.org/z/d3a1aKTWP) godbolt snippet for the details.
### Description
Remove the parameter device_id out of ExecutionProvider::GetAllocator()
function
### Motivation and Context
The parameter device_id is not necessary. We can fully rely on the
second parameter OrtMemType mem_type to determine the device_id when
getting allocator from executionProvider.
### Description
This is exposing the already existent interface of asynchronous work of
all CUDA base EP's (CUDA + TensorRT).
### Motivation and Context
This is something requested in #12216. It will enable users to build an
efficient data pipeline with ONNXRuntime and CUDA pre-/post-processing.
PCI traffic to the CUDA device can be run during inference as soon as
the postprocessing consumed the input buffer and it can be overwritten.
To do this work has to be submitted async to the device. Please see
below screenshots showing the illustration of this using NSight Systems.
Async:
<img width="1401" alt="image"
src="https://user-images.githubusercontent.com/44298237/209894303-706460ed-cbdb-4be2-a2e4-0c111ec875dd.png">
Synchronous:
<img width="1302" alt="image"
src="https://user-images.githubusercontent.com/44298237/209894630-1ce40925-bbd5-470d-b888-46553ab75fb9.png">
Note the gap in between the 2 inference runs due to issuing PCI traffic
in between and to the CPU overhead the active synchronization has.
---------
Co-authored-by: Chi Lo <chi.lo@microsoft.com>