Commit graph

957 commits

Author SHA1 Message Date
Dmitri Smirnov
896a963492
Adust GetVersionString() GetBuildInfoString() signatures and move them to OrtApi (#15921)
### Description

This PR partially reverts changes introduced in
https://github.com/microsoft/onnxruntime/pull/15643

We make two API return std::string always in UTF-8.

We also move the entry points from OrtApiBase to OrtApi to make them
versioned.

### Motivation and Context

`GetVersionString` always returns x.y.z numbers that are not subject to
internationalization.
`GetBuildInfoString` can hold international chars, but UTF-8 should be
fine to contain those.
We prefix them with u8"" in case the compiler default charset is not
UTF-8.
Furthermore, creating platform dependent APIs is discouraged.
`ORTCHAR_T` is platform dependent and was created for paths only.
On non-unix platforms would still produce `std::string` that can only
contain UTF-8

The API was introduced after the latest release, and can still be
adjusted.
2023-05-13 13:45:07 -07:00
Maximilian Müller
143551092f
fix: setting builder optimization level to TRT 8.6 default (#15897)
The actual released default level is 3 and not the previously used 2.

Just a small sample of the effects:
![Screenshot 2023-05-10 at 15 49
55](https://github.com/microsoft/onnxruntime/assets/44298237/5a694446-22c0-4943-9ddf-80670781878f)
2023-05-12 13:29:30 -07:00
Hector Li
1bebc88069
[SNPE EP] Add option to enable SNPE init caching feature (#15917)
### Description
[SNPE EP] Add option to enable SNPE init caching feature

### Motivation and Context
To save model initialization time
2023-05-12 07:57:11 -07:00
Wanming Lin
00b1e79e04
Support WebNN EP (#15698)
**Description**: 

This PR intends to enable WebNN EP in ONNX Runtime Web. It translates
the ONNX nodes by [WebNN
API](https://webmachinelearning.github.io/webnn/), which is implemented
in C++ and uses Emscripten [Embind
API](https://emscripten.org/docs/porting/connecting_cpp_and_javascript/embind.html#).
Temporarily using preferred layout **NHWC** for WebNN graph partitions
since the restriction in WebNN XNNPack backend implementation and the
ongoing
[discussion](https://github.com/webmachinelearning/webnn/issues/324) in
WebNN spec that whether WebNN should support both 'NHWC' and 'NCHW'
layouts. No WebNN native EP, only for Web.

**Motivation and Context**:
Allow ONNXRuntime Web developers to access WebNN API to benefit from
hardware acceleration.

**WebNN API Implementation Status in Chromium**:
- Tracked in Chromium issue:
[#1273291](https://bugs.chromium.org/p/chromium/issues/detail?id=1273291)
- **CPU device**: based on XNNPack backend, and had been available on
Chrome Canary M112 behind "#enable-experimental-web-platform-features"
flag for Windows and Linux platforms. Further implementation for more
ops is ongoing.
- **GPU device**: based on DML, implementation is ongoing.

**Open**:
- GitHub CI: WebNN currently is only available on Chrome Canary/Dev with
XNNPack backend for Linux and Windows. This is an open to reviewers to
help identify which GitHub CI should involved the WebNN EP and guide me
to enable it. Thanks!
2023-05-08 21:25:10 -07:00
RandySheriffH
8e610f25d8
Implement lite custom op API (#15778)
Implement a set of new APIs for lightweight custom ops registration, to
save efforts from schema-composing.
A few highlights:

- Support build-time type inference;
- Support function-as-op for "stateless" ops;
- Support structure-as-op for "stateful" ops;
- Support varied input/output forms such as span, scalar, and tensors,
either optional or non-optional.

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2023-05-04 09:49:17 -07:00
Changming Sun
1fb2f2605b
Update VERSION_NUMBER (#15773)
### Description

1. Update VERSION_NUMBER for preparing the upcoming release. This PR's
commit will not be included in the 1.15 release branch
2. Delete package/rpm/onnxruntime.spec since it was not used in past
years.

### Motivation and Context
Preparing the release.

Fixed
[AB#15311](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/15311)
2023-05-03 15:07:34 -07:00
Baiju Meswani
ba7b83ff3c
Remove onnxruntime_PYBIND_EXPORT_OPSCHEMA definition from onnxruntime (#15776) 2023-05-03 13:08:35 -07:00
Chen Fu
bc58fd5413
fix compilation error in no absl build (#15769)
### Description

Fix no-absl build error:
2023-05-02 08:20:49 -07:00
Changming Sun
034698cf6a
Revert "Implement lite custom op API (#15590)" (#15768)
This reverts commit cdf4fc49fc because it
breaks the "debug_node_input_output" build in "Post Merge" pipeline
2023-05-02 01:10:10 -07:00
Ye Wang
391f897983
Bring back SLN cuda kernel and use provider options to switch to standard implementation (#15660) 2023-05-01 18:35:26 -07:00
cao lei
d58fa9805b
ExecutionProvider API refactor - replace OrtMemoryInfo with OrtDevice (#15618)
### Description
ExecutionProvider API refactor - replace OrtMemoryInfo with OrtDevice



### Motivation and Context
Currently “Location” is represented as ORTMemoryInfo, which is OrtDevice
+ OrtMemType, while OrtDevice is represent as DeviceType + DeviceId +
MemType. As we can see there is some unnecessary hierarchy, the proposal
is to make it a clear definition that to use OrtDevice as an abstraction
for Location

---------

Co-authored-by: Lei Cao <leca@microsoft.com>
2023-05-01 10:06:00 -07:00
RandySheriffH
cdf4fc49fc
Implement lite custom op API (#15590)
Implement a set of new APIs for lightweight custom ops registration, to
save efforts on schema-composing.
A few highlights:

1. Support build-time type inference;
2. Support function-as-op for "stateless" ops;
3. Support structure-as-op for "stateful" ops;
4. Support varied input/output forms such as span, scalar, and tensors,
either optional or non-optional.

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2023-05-01 08:45:26 -07:00
Chen Fu
0e9472d391
NHWC graph optimizer (#15724)
### Description

Augment nhwc graph optimizer to accommodate fp16 operators.


### Motivation and Context

With new fp16 conv operator added. This operator prefers NHWC data
layout. We need to augment existing graph optimizers to better utilize
the new operator.
2023-05-01 08:44:07 -07:00
Chunye Wang@AMD
d35850c142
[VitisAI]Update VitisAI EP to be compatible with VitisAI 3.5 (#15673)
### Description

Originally VitisAI EP only works with old version of VitisAI release. 


### Motivation and Context

Update VitisAI EP so that it works together with the current VitisiAI
3.5 and further version of VitisAI. We try our best to make it forward
compatible.

---------

Co-authored-by: Wang Chunye <chunywan@xilinx.com>
Co-authored-by: mingyue <mingyue@amd.com>
Co-authored-by: mingyueliuh <131847423+mingyueliuh@users.noreply.github.com>
Co-authored-by: liumingyue <mingyue@xilinx.com>
Co-authored-by: moore-ch <129165652+moore-ch@users.noreply.github.com>
Co-authored-by: shoucair <shoucai.ren@amd.com>
Co-authored-by: zz002 <zhenze.wang@amd.com>
Co-authored-by: BoarQing <yuz75@Pitt.edu>
Co-authored-by: Yueqing Zhang <yueqingz@amd.com>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
2023-05-01 08:28:26 -07:00
Jeff Bloomfield
3df3a85114
Default kOrtSessionOptionsDisableQuantQDQ to 1 when the DML EP is registered (#15725)
This addresses a performance regression in some INT8 models with the
DirectML EP by defaulting OrtSessionOptionsDisableQuantQDQ to 1 when the
EP is registered.

This regression occured due to the introduction of the QDQ propagation
transformer, which is based on this session option. That transformer
maximizes the number of nodes which are executed as quantized by
logically propagating quantize operators upstream and dequantize
operators downstream. However, it does this simply by inserting QDQ
pairs, with an expectation that something will recognize sequences of
DQ->Op->Q. This logic and related L2 transformers are not currently
enabled for the DirectML EP.

This change also removes a noisy warning when the session option for
memory pattern is overriden as the DirectML EP is registered.
2023-05-01 08:26:03 -07:00
Chi Lo
6e652d0554
Support explicit TRT profiles from provider options (#15546)
Previous behavior of TRT EP to set TRT optimization profiles for dynamic
shape input is based on input tensor values. Users can't explicitly
specify the profiles.

This PR makes users capable of specifying min/max/opt profiles through
newly added three provider options:

`trt_profile_min_shapes`, `trt_profile_max_shapes` and
`trt_profile_opt_shapes`
with the format of "input1:dim1xdim2...,input2:dim3xdim4...".
(Note: It's similar to --minShapes, --maxShapes and --optShapes of
trtexec command-line
[flags](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#trtexec-flags))

For example, if you are using onnxruntime_perf_test, you can try this:

`./onnxruntime_perf_test -e tensorrt -r 1 -i
"trt_profile_min_shapes|imgs:1x3x384x288
trt_profile_max_shapes|imgs:32x3x384x288
trt_profile_opt_shapes|imgs:16x3x384x288" your_model_path`

If the engine cache is enabled, you still need to provide these three
explicit provider options in order to use this feature. ORT TRT will
compare the min/max/opt profile shape with the ones saved in .profile
file to decide whether to rebuild the engine.

Constraints to use these provider options: (1) Need to specify
min/max/opt profile shapes for all the dynamic shape input

 

This feature is also requested by other users:
https://github.com/microsoft/onnxruntime/issues/13851
2023-04-30 22:30:26 -07:00
Changming Sun
65020d433e
Prefast fixes for CUDA EP (#15726)
### Description
1. Adjust cmake flags. Do not modify CMAKE_CXX_FLAGS globally. Only
apply the flags to ORT code.
2. Fix some SDL warnings.
2023-04-29 12:43:12 -07:00
Yuhong Guo
41dcf0d32e
Expose build information in dynamic lib (#15643)
### Description
<!-- Describe your changes. -->
1. Add Build Info API to onnx.
2. Fix compile error while building onnxruntime_benchmark in MacOs.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
1. When Onnxruntime lib is serving online, we need a way to detect how
this lib is built. This PR helps the developer to get the build
information using `strings` such as git branch, git commit id, build
type and cmake cxx flags, which is showed as follows.


![image](https://user-images.githubusercontent.com/19584326/233794371-b2f95a2c-27fb-4709-a6dd-bf4bb12b0b5b.png)


![image](https://user-images.githubusercontent.com/19584326/233794360-f96f5d2e-332c-405c-83f1-370ccc2b86f8.png)

If the build env has no git, there will be no git related infor:


![image](https://user-images.githubusercontent.com/19584326/234558596-298c1b01-9a90-41bf-9372-7259a8f8e5be.png)


3. Fix the following compile error while building benchmark in MacOs.

![image](https://user-images.githubusercontent.com/19584326/233793571-c261ac1f-47b2-434d-a293-7e9edc6c8a66.png)

---------

Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com>
2023-04-28 21:57:31 -07:00
Chen Fu
be08b47e7b
Refine cast optimizer for safety (#15658)
### Description

Cast optimizer may convert a fp16 node to fp32. This used to be safe as
all fp16 kernels has fp32 implementation. As this assumption is no
longer true, we need to check the validity of the operation



### Motivation and Context

Main work here is to introduce an API to check whether a kernel is
registered. Currently we don't have a way to do that without an operator
node. This needs to be augmented. We need to query whether a kernel is
registered by its property only, so that we can judge whether it is safe
to construct a node long before we actually do so.
2023-04-28 09:32:54 -07:00
sfatimar
ebaafac3f5
Openvino ep ort 5.0 (#15626)
### Description
The PR adds VPU support to OpenVINO Execution Provider
Bug fixes for GPU, CPU. 
Changes to OpenVINO Backend in Serialized Model API for faster First
Inference Latency.
Deprecation to HDDL-VADM and MYRIAD, removed code
Support OpenVINO 2023.0 
Dynamic Shapes Support for iGPU

### Motivation and Context
- VPU is an upcoming hardware that can provide AI Acceleration for
Client Systems through OpenVINO
- If it fixes an open issue, please link to the issue here. -->

---------

Signed-off-by: MaajidKhan <n.maajid.khan@intel.com>
Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
Co-authored-by: MaajidKhan <n.maajid.khan@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
2023-04-25 20:59:42 -07:00
Baiju Meswani
5885abfb35
Training Documentation (#15612) 2023-04-25 11:44:12 -07:00
Yulong Wang
14cc02c65c
[js/web] WebGPU backend via JSEP (#14579)
### Description
This change introduced the following new components into ONNX Runtime
Web:
- JavaScript Execution Provider (JSEP)
  - Asynchronized inferencing execution powered by Emscripten's Asyncify
- WebGPU backend implemented in TypeScript
  - initial implementation of kernels:
    - elementwise operators (22)
    - binary operators (5)
    - tensor: Shape, Reshape, Transpose, Gemm
    - nn: Conv, {Global}Maxpool, {Global}AveragePool


Code need to be polished. still working on it.

## Q&A
What is JSEP?
> JSEP, aka JavaScript Execution Provider, is a new ONNXRuntime
execution provider that specifically works on Web environment
(browsers). JSEP allows JavaScript code to kick in from various places
when ONNX Runtime inferences a model.

Why JSEP?
> JSEP is a hybrid mode EP that contains both C/C++ and
TypeScript/JavaScript implementation. There are 2 strong reasons why we
introduces JSEP:
> 1. the C/C++ part helps JSEP to leverage ONNX Runtime's capabilities
as much as possible including graph transformer, optimizers and also the
capabilities to fallback to CPU EP. TypeScript/JavaScript helps JSEP to
develop and debug much easier in the browser for the kernel
implementation.
> 2. the requirement of asynchronized execution from JavaScript API (eg.
`buffer.mapAsync()`) makes it impossible to run `OrtRun()` in a
synchronized context (see "async problem" section below). This is done
by using Emscripten's Asyncify.

What is WebGPU?
> WebGPU is the new GPU API that available in browser. It's one of the
only 2 APIs that currently available to access the GPU from browser (the
other is WebGL).
> WebGPU is designed with more advanced and stronger features comparing
to WebGL and is potentially solution that offer the best GPU performance
for model inferencing that currently available.

What is the async problem and why we have the problem?
> The "async problem" is a problem that you cannot call an async
function in a synchronous context. Think about the following C++ code:
> ```c
> // C-style declarations (API)
> typedef void (*ON_COMPLETE)(PVOID state, DATA *data);
> void read_data_from_file(FILEHANDLE file, ON_COMPLETE on_complete);
> 
> // implementation
> DATA * my_impl_read_data_from_file_sync(FILEHANDLE file) {
>   // how to implement?
> }
> ```
> The answer is, it's impossible to implement this function. Usually we
try to find a sync version API, or launch a thread to call the async
function and sync-wait on the main thread. Unfortunately, in browser
environment, neither is possible.
>
> WebGPU does not offer any synchronized API for data downloading (GPU
to CPU). This is the only operation that MUST be async. As `OrtRun()`
will eventually call into DataTransfer for copy data from GPU to CPU,
and `OrtRun()` is a synchronized function, this cannot be done in normal
way.

What is Emscripten? How is the Asyncify feature resolved the problem?
> Emscripten is the C/C++ compiler for WebAssembly. It's what we use to
compile ORT and generates the WebAssembly artifacts which runs on
browsers.
>
> Asyncify is a [compiler
feature](https://emscripten.org/docs/porting/asyncify.html) that allows
calling async functions from a synchronized context. In short, it
generates code to unwind and rewind call stack to emulate async
execution. With this feature, we are able to call the async function
inside `OrtRun()` call.

## Design Overview

**Inter-op**

JSEP is doing pretty much same thing to just another EP. It exposes an
interface for inter-op with JavaScript, which is defined in
onnxruntime/wasm/js_internal_api.js:
```js
// init JSEP
Module["jsepInit"] = function (backend, alloc, free, copy, copyAsync, createKernel, releaseKernel, run) {
    Module.jsepBackend = backend;
    Module.jsepAlloc = alloc;
    Module.jsepFree = free;
    Module.jsepCopy = copy;
    Module.jsepCopyAsync = copyAsync;
    Module.jsepCreateKernel = createKernel;
    Module.jsepReleaseKernel = releaseKernel;
    Module.jsepRun = run;
};
```
This simple JavaScript snippet defines all language barrier level
functions that requires by JSEP to achieve implementing kernels and data
transfers using JavaScript inside ONNX Runtime:
- `jsepBackend`: assign the singleton object to webassembly module
- `jsepAlloc` and `jsepFree`: implementation of data transfer's Alloc()
and Free()
- `jsepCopy`: synchronized copy ( GPU to GPU, CPU to GPU)
- `jsepCopyAsync`: asynchronized copy ( GPU to CPU)
- `jsepCreateKernel` and `jsepReleaseKernel`: a corresponding object
that maintained in JS to match lifecycle of Kernel in ORT
- `jsepRun`: OpKernel::Compute() should call into this

The abstraction above allows to tie as little as possible connections
and dependencies between C/C++ and TypeScript/JavaScript.

**Resource Management**

Lifecycle of tensor data and kernels are managed by ORT(C/C++) but the
implementation are left to JavaScript. JavaScript code are responsible
to implement the callbacks correctly.

For WebGPU, the GPU data is managed by JavaScript using a singleton map
(tensot_data_id => GPUBuffer). GPU pipeline is managed as singleton.
Shaders are managed using a singletonmap (shader_key => gpu_program),
while shader_key is generated by cache_key (OP specific, including
attributes) and input shapes.

**about data transfer**
`js::DataTransfer::CopyTensor` implemented to call either synchronized
or asynchronized copy callback, depending on the destination is GPU or
not. Emscripten's macro `EM_ASYNC_JS` is used to wrap the async function
to be called in the synchronized context.

**run kernel in JS**

Kernel class constructor calls once `jsepCreateKernel()` with an
optional per-kernel specific serialization to pass attributes into
JavaScript.

`Compute()` are implemented in a way that a metadata serialization is
performed in a base class and JavaScript code can access the data using
the Emscripten specific builtin macro `EM_ASM_*`.

**disabled features**
memory pattern is force disabled, because the WebGPU data is not
presented by a general memory model (a buffer can be represented by
offset + size).
concurrent run support is disabled. WebGPU is stateful and it also has
async function call. To support concurrent run will significantly
increase the complexity and we don't get any real benefit from it.

**prefer channels last**
JSEP prefers channels last and returns `DataLayout::NHWC` in method
`GetPreferredLayout()`. This will let the graph transformers to
preprocess the graph into a channels last form so that a more optimized
WebGPU shader can be used.

**Testing code**
It's impossible to test JSEP directly because JSEP itself does not
contain any kernel implementation. However, it has the kernel
registration which need to work together with the corresponding
JavaScript code. There are unit tests that run onnx models from
JavaScript API.

---------

Co-authored-by: Scott McKay <skottmckay@gmail.com>
2023-04-24 15:21:18 -07:00
cao lei
dc53ddef7a
Create a new C API KernelContext_GetAllocator() for Custom Op scenario (#15591)
### Description
Create a new C API KernelContext_GetAllocator() for Custom Op scenario



### Motivation and Context
Create a new C API KernelContext_GetAllocator() for Custom Op scenario
2023-04-23 21:54:35 -07:00
Dmitri Smirnov
a5dec8eedf
[C# ] Improve string marshalling and reduce GC pressure (#15545)
### Description

  Reduce a number of auxillary objects created to reduce GC pressure.
Eliminate GCHandle type of memory pinning in most of the places.
Improve string marshalling by allocating unmanaged memory that does not
require pinning. Change native methods from `IntPtr` to `byte[]`
(marshalling pinning is more efficient).

Allocate input/output UTF-8 names in unmanaged heap for the lifetime of
InferenceSession. So we do not keep converting them and pinning on every
Run.

Introduce a new native API that allows to allocate and convert/copy
strings directly into a native tensor.

The PR delivers around 50% latency improvements and less GC pauses.

Inspired by: https://github.com/microsoft/onnxruntime/pull/15520

### Motivation and Context
Client experience GC pressure and performance degradation when dealing
with string tensors.


Co-Authored-By: @tannergooding
2023-04-20 15:12:51 -07:00
Chi Lo
6115c8fd1f
Add TRT plugins support using custom ops (#13847)
This PR makes ORT support TRT plugin using custom ops. ORT TRT can
automatically register all TRT plugins from TRT plugins registry as
custom ops. There is no code change needed for ORT when new TRT plugins
are introduced.

Previous way for ORT to support TRT plugins was using contrib ops, but
there are some concerns about it:

- Contrib ops are shipped as part of the ORT binary by default. TRT
related plugins should not be in the default ORT.
- Contrib ops are designed for internal ops and developed for cpu and
cuda EPs.

Therefore, using custom ops is a good approach to support TRT plugins. 

Followings are the major modifications:

1. Add new `GetCustomOpDomainList` provider api which allows provider to
create its own custom op domain list and ORT can register this domain
list. Provider has the responsibility to free all the custom op domain
instances it created.
2. Move OrtCustomOpDomain struct definition to
framework_provider_common.h since this struct is being used by framework
and EPs now.
3. There are several TRT plugins registered as onnx schema op through
contrib op with onnx domain. In order not to break the old models using
those TRT plugins which were registered with ONNX domain and maintain
backward compatible, we need to keep the old/legacy TRT plugins with
onnx domain. Moving forward, all newly added TRT plugins should be
registered with `trt.plugins` domain.
4. TRT plugin doesn't have an api to get number of inputs/outputs of the
registered plugins, so ORT TRT uses variadic inputs/outputs to bypass
the onnx node validation.
5. Add new trt provider option, `trt_extra_plugin_lib_paths`, user can
specify any extra plugin lib, for example,
`fastertransformer/build/lib/libvit_plugin.so` or
`fastertransformer/build/lib/libvit_plugin.so;fastertransformer/build/lib/libvit_plugin_v2.so`
2023-04-18 20:24:32 -07:00
Justin Chu
cf19c3697d
Run clang-format in CI (#15524)
### Description

Run clang-format in CI. Formatted all c/c++, objective-c/c++ files.

Excluded

```
    'onnxruntime/core/mlas/**',
    'onnxruntime/contrib_ops/cuda/bert/tensorrt_fused_multihead_attention/**',
```

because they contain assembly or is data heavy


### Motivation and Context

Coding style consistency
2023-04-18 09:26:58 -07:00
liqun Fu
919d8f2660
update with onnx main (#14929) 2023-04-18 08:42:51 -07:00
cao lei
c2221d919f
create a stream in DeviceStreamCollection for memory pattern (#15426)
### Description
Create a stream in DeviceStreamCollection for memory pattern case to fix
the thread safe issue 15154



### Motivation and Context
This is to fix the bug 15154
https://github.com/microsoft/onnxruntime/issues/15154
2023-04-17 10:06:55 -07:00
Maximilian Müller
fbe88fccbd
Exposing new TRT build options (#15089)
### Description

This will add a few TRT options, some of them are only available on TRT
8.6:
- heuristics
- sparsity
- optimization level (8.6 only)
- auxiliary stream (8.6 only)
- tactic source selection

I am no sure yet which tests is should add for these options. As those
are mostly simple TRT flags i am not sure to what level i should test.
For heuristics something similar to
44dda08b51/onnxruntime/test/providers/tensorrt/tensorrt_basic_test.cc (L510-L538)
should be possible for, but for all other essentially we would only be
testing if there is a crash or not if the option is set.
Also if i forgot some option that would be good to have feel free to
speak up !
2023-04-14 09:47:36 -07:00
Dmitri Smirnov
ce3b4eabd3
Implement Optional Metadata support and C# test support (#15314)
### Description
Implement Optional Type metadata support in the library.
Implement optional support in C# API along with metadata.
Implement Sequence, Map, Optional test data support
and test execution.

Prune tests and provide more details for failing tests in C# code.

Note, this PR does not enable running onnx test models in C++.

### Motivation and Context
Opset18 optional type support.
2023-04-11 09:41:59 -07:00
cloudhan
71a4e7eb97
Automatically enable tunable op usage for production models (#15156)
Split `IsTunbaleOpEnable` semantics into **enable tunable op for using**
and **enable tunable op for tuning**.

They remain disabled in general for safety purpose. But
- if session is created with onnx model with tuning results embeded
- the embedded tuning results is set to the EP without error `Status`

then we automatically enable the using, tuning remains disabled.

The planned options will be
- `tunable_op_enable`: The top-level switch of `TunableOp`, indicate if we will run into `TunableOp` related logic. **NOTE:** most of our impls have a bottom impl that is acting as a fallback and is set as the default. In this case, we still call into the `TunableOp`, but no kernel selection, no kernel tuning and caching is involved. This reduced our maintainance burden of a duplicate code path.
- `tunable_op_tuning_enable`: The secondary switch of `TunableOp`, indicate if we will run into the tuning related logic of `TunableOp`

Then for the possible future options:
- `tunable_op_tuning_max_iteration`: blahblah
- `tunable_op_tuning_max_duration_ms`: blahblah
- `tunable_op_flash_attention_enable`: blahblah, for example only, we will not have this.

For developer oriented envvar, it is for developers' convenience to inspect the performance impact of tuning. So there is only `ORT_ROCM_TUNABLE_OP_ENABLE`, `ORT_ROCM_TUNABLE_OP_TUNING_ENABLE` to take the fine-grind control of combinations.
2023-04-06 13:52:47 +08:00
Edward Chen
9f942e1a3e
Graph transformer to ensure unique DQ nodes for QDQ node units (#15145)
### Description
<!-- Describe your changes. -->

Add required graph transformer to duplicate DQ nodes to ensure that QDQ
node units have unique DQ nodes. This condition is necessary for QDQ
node unit processing.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

There is an existing Python utility that does this: 

c7ced7a5e9/tools/python/util/qdq_helpers/qdq_model_utils.py (L77)

This PR implements it as a graph transformer so it is integrated into
ORT and does not require a separate step to update the model. There are
also tests to ensure that its effects are not undone by basic level
graph optimizations.
2023-03-31 08:39:43 +10:00
FFFrog
ecb89ed752
[CANN] Multi-stream execution support for CANN EP. (#14058)
### Description
**Multi-stream** execution support for **CANN EP**.

### Motivation and Context
**CANN EP** is currently **unavailable** due to the introduction of a
new mechanism for multi-stream execution
[#13495](https://github.com/microsoft/onnxruntime/pull/13495), the
deletion of the Fence-based synchronization mechanism, and the failure
to update the relevant logic of **CANN EP** synchronously.

This PR is to fix it.
2023-03-29 11:57:22 -07:00
Scott McKay
eb8f6c7c52
Transpose optimizer enhancements (#15117)
### Description
<!-- Describe your changes. -->
- Add debug infrastructure to dump out model at various stages of
transpose optimization.
- Handle more scenarios where Transpose -> Reshape can be merged.
- Run L1 optimizers after layout transform to constant fold initializers
that had their layout changed.
- Use cost check for Concat post layout transform as pushing a Transpose
through it can potentially add Transpose nodes to multiple other inputs.
- Update internal testing EP to support test where you want it to take
all nodes, use NHWC layout, and to use dummy static kernels instead of
compiling so the ops in the graph post-initialization can be counted.
- Misc cleanup in InferenceSession to not unnecessarily pass args to
TransposeGraph for class members.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Address perf issue seen with model where a Transpose gets blocked by a
Reshape that could have been treated as a Transpose.

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-03-28 08:28:17 +10:00
Nat Kershaw (MSFT)
3064fa7611
Fix C API docs error (#15216) 2023-03-27 14:34:18 -07:00
Dmitri Smirnov
2de15c5d50
Re-work OrtApi struct to satisfy C++20 compilers (#15183)
### Description
<!-- Describe your changes. -->
Remove `deletion` of copy functions from `OrtApi` as its initialization
no longer compiles in C++20.
Introduce a non-copyable member to implicitly delete copy ctor.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Inspired by https://github.com/microsoft/onnxruntime/pull/14901

Solution credits: @RyanUnderhill 

Cc: @georgthegreat
2023-03-24 13:52:17 -07:00
Chi Lo
c964da7ea2
FasterTransformer model wrapper using custom op (#15013)
### Description
<!-- Describe your changes. -->
We are introducing the FasterTransfomer model-level integration using
ORT [custom op runtime
wrapper](https://github.com/microsoft/onnxruntime/pull/13427).
In order to make the FT wrapper/integration work, two things need to be
done:

- New API `KernelInfoGetConstantInput_tensor`. (Done in this PR)
During custom op kernel initialization, it needs to get the model
weights (saved as node's constant inputs) ready for FT's weights
instantiation. What's why we need to add this new API to make kernel
info capable of getting constant inputs.

- Custom op and custom op kernel to wrap FT model. (Will provide in
onnxruntime extensions or inference examples)
During custom op kernel initialization, it can fetch attributes from
kernel info to determine which kind of FT model instance create. During
custom op kernel compute/inference, it can get input/output from kernel
context and then assign input/output buffers for model instance to run.
2023-03-20 09:05:30 -07:00
Adrian Lizarraga
e42f7487df
Add logging APIs for custom operators (#14416)
### Description
Add logging APIs for custom ops.

This PR introduces a `OrtLogger` type, which can be retrieved from a
`OrtKernelInfo` or `OrtKernelContext`. The kernel info's logger is the session logger stored
in the execution provider. The kernel context's logger is a run logger.



### Motivation and Context
Allows custom ops to log information in a manner consistent with
built-in ops.

Example usage in custom op:
```C++
struct MyCustomKernel {
  MyCustomKernel(const OrtApi& api, const OrtKernelInfo* info) {
    Ort::ConstKernelInfo kinfo(info);
    this->logger_ = kinfo.GetLogger();
    // ...
    ORT_CXX_LOGF_NOEXCEPT(this->logger_, OrtLoggingLevel::ORT_LOGGING_LEVEL_ERROR, "Error: %s", err_msg);
  }

  void Compute(OrtKernelContext* context) {
    ORT_CXX_LOG(this->logger_, OrtLoggingLevel::ORT_LOGGING_LEVEL_VERBOSE, "Calling compute...");
    // ...
  }

  // ...
 private:
  Ort::Logger logger_;
};
```
2023-03-17 15:05:28 -07:00
wejoncy
028c2372fa remove disable_cpu_soft temporarily 2023-03-15 13:23:56 +08:00
JiCheng
8383a54f9d Update include/onnxruntime/core/providers/nnapi/nnapi_provider_factory.h
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-03-15 13:23:56 +08:00
wejoncy
3873a55bd3 [NNAPI] fix feature_level query 2023-03-15 13:23:56 +08:00
Maximilian Müller
ad4db12699
TensorRT EP - timing cache (#14767)
### Description

This will enable a user to use a TensorRT timing cache based on #10297
to accelerate build times on a device with the same compute capability.
This will work across models as it simply store kernel runtimes for
specific configurations. Those files are usually very small (only a few
MB) which makes them very easy to ship with an application to accelerate
the build time on the user end.

### Motivation and Context
Especially for workstation use cases TRT build times can be a roadblock.
With a few model from ONNX model zoo i evaluated speedups when a timing
cache is present.
`./build/onnxruntime_perf_test -e tensorrt -I -t 5 -i
"trt_timing_cache_enable|true" <onnx_path>`

|Model | no Cache | with Cache|
| ------------- | ------------- | ------------- |
|efficientnet-lite4-11 | 34.6 s | 7.7 s|
|yolov4 | 108.62 s | 9.4 s|

To capture this is had to modify the onnxruntime_perf_test. The time is
sometimes not captured within "Session creation time cost:" which is why
i introduced "First inference time cost:".

---------

Co-authored-by: Chi Lo <Chi.Lo@microsoft.com>
2023-03-10 09:02:27 -08:00
Xavier Dupré
5930e7e22f
Introduce RemovableAttributes (#14868)
### Description
TreeEnsemble* kernels fully copies all the parameters from the onnx
graph. Even if they are no longer needed or unused (hitrates), they
remain in memory. For big models >= 200 trees, max_depth > 10, the model
usually weights more than 10 Mb. This change offers a kernel the
possibility to remove all unneeded attributes after they were used to
create the session. Attributes are deleted after the model was possibly
saved, at the of the session creation.

The current design is to be debatted:
* it stored the list of removable attributes in class
`onnxruntime::Node`,
* the node is marked as `const` everytime this implementation needs to
register the name of a removable attribute or to remove them.

The current implementation is just a POC as it needs to cast
`onnxruntime::Node*` into `const onnxruntime::Node*`.

Should we keep the list of removable attributes in `onnxruntime::Node`?

### Motivation and Context
Motivation is mostly to reduce memory consumption.

---------

Signed-off-by: xadupre <xadupre@microsoft.com>
2023-03-07 12:37:12 +01:00
Dmitri Smirnov
8d87fdcfa1
Add GetVersionSting API for C++, C# and Python (#14873)
### Description
Added APIs.

### Motivation and Context
Addresses https://github.com/microsoft/onnxruntime/issues/14584

Cc: @Craigacp cp
2023-03-02 17:11:07 -08:00
Hector Li
c6074f3a4b
OnnxRuntime QNN EP (#14791)
### Description
Integrate Qualcomm QNN SDK to enable inference on QC hexagon NPU devices

### Motivation and Context
Enable Ort inference on QC hexagon NPU devices.

---------

Co-authored-by: Satya Jandhyala <sajandhy@microsoft.com>
Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
Co-authored-by: Adrian Lizarraga <adrianlm2@gmail.com>
2023-03-01 13:48:20 -08:00
Scott McKay
b7fde84341
Changes to support standalone custom ops in a minimal build. (#14497)
### Description
<!-- Describe your changes. -->
Changes to support standalone custom ops in a minimal build. Also
incorporates changes from #14492 (needed to test builds prior to that
being checked in).

We first need to save the schema info from the operators used by the
standalone op invoker in the ORT format model. Add mechanism for that.

Merge the kernel lookup logic so the same is used in full and minimal
build. NOTE: the version matching is now consistent with all other
kernel lookups, and the call to CreateOp MUST use the exact version for
the operator. Previously matching wasn't as strict, but this can lead to
the incorrect kernel being chosen.

Add tests.

NOTE: There is currently no way to detect the ops/types/opsets used
inside these custom ops as they don't exist until we create kernels,
which is after model loading completes (which is the point the ORT
format model is saved). Due to that they have to be manually added to
the configuration used to do the reduced ops build. That shouldn't be
too hard for the custom op author to add given the custom op
implementation is specifying the op, opset and type constraints (i.e.
they have the info and it's just a case of capturing/formatting it
correctly).


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Enable usage of the standalone op invoker by custom ops in a minimal
build.

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-03-01 11:22:54 +10:00
James Yuzawa
d925055a3e
Fix broken and outdated links in documentation (#14092)
### Description
<!-- Describe your changes. -->

I fixed some broken links in the C API documentation, but then did a
quick pass over all of the links I could find and then fixed those.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

I got some 404's when exploring the documentation and wanted to fix it.
2023-02-23 10:48:04 -08:00
Sheil Kumar
1b7f65437e
Enable Opset11 Sequence Ops on DirectML, and make the CPU implementations agnostic to backend EP (#14442)
Enable Opset11 Sequence Ops on DirectML, and make the CPU
implementations agnostic to backend EP

Opset 11 introduced the following sequence related operators:
    - SequenceAt
    - SequenceConstruct
    - SequenceEmpty
    - SequenceLength
    - SequenceErase
    - SequenceInsert 
    - ConcatFromSequence

With the exception of ConcatFromSequence, all of the above operators
were implemented with CPU kernels that a) required all of the contained
tensors to also be on CPU, and b) would clone each tensor into a new
sequence as a side effect of each operator. The implementation of
sequences are backend agnostic, as they dont affect actual tensor layout
or manipulate the contents of the tensors. In addition, with the
exception of SequenceAt, the other operators need not make copies of the
underlying referenced tensors.

Consequently, this change does the following:
1) Sequence* operators (except SequenceAt) no longer copies the contents
of a sequence of tensors on every kernel execution.
2) SequenceAt uses the DataTransferManager to copy tensors agnostic to
backend.
3) The internal container implemented by TensorSeq has changed from
onnxruntime::Tensor to OrtValue. This is because onnxruntime::Tensor
does not support copy or assignment construction, so it must have a
singular owner. However, is same tensor participates in multiple
containers it would have multiple container "owners" and this would not
be possible.
4) Other code that accessed values from TensorSeq have associated
changes to extract Tensors from OrtValues now.

In addition, DirectML execution was very slow when the above Sequence
operators were added to a graph, as this caused MemcpyToHost and
MemcpyFromHost kernels to be inserted between the graph and the sequence
operators. To optimize DirectML,
1) The CPU implementations for the Sequence* ops were registered as DML
implementations. Since the above changes also includes making the CPU
kernel implementations EP agnostic, the CPU kernels can be added as is.
2) The ConcatFromSequence operator needed to be implemented on DirectML.
However, there was little DirectML EP operator framework support for
operators that accept/output sequences of tensors. This change has
modified the internal COM interfaces to include new apis to interrogate
for sequence shapes, and extract the needed tensors from TensorSeq.

---------

Co-authored-by: Patrice Vignola <vignola.patrice@gmail.com>
2023-02-21 18:08:28 -08:00
Christian Veenhuis
9fbb2b4742
Fix broken link in onnxruntime_c_api.h (#14748)
### Description
Fix the broken link in header file onnxruntime_c_api.h w.r.t. the graph
optimization levels (line 300).

### Motivation and Context
This fix solves open issue #14741
2023-02-21 15:07:06 -08:00
Yuriy Chernyshov
973aaf110b Improve compatibility with certain STL's
We use customized libc++ which uses raw pointers as std::vector::iterators.

As per [expr.pre.incr](https://eel.is/c++draft/expr.compound#expr.pre.incr), builtin `operator++` can only be applied to lvalue, while `std::vector::begin()` returns an rvalue.

See [this](https://godbolt.org/z/d3a1aKTWP) godbolt snippet for the details.
2023-02-21 14:06:16 -08:00
Dale Phurrough
68db1b62a8
add noexcept to InitApi() and GetApi() (#13869)
### Description

* add noexcept to `InitApi()` and `GetApi()`

### Motivation and Context

* fixes microsoft/onnxruntime#12581
2023-02-15 16:49:16 -08:00
cao lei
50fa151298
remove device_id parameter out of ExecutionProvider::GetAllocator() (#14580)
### Description
Remove the parameter device_id out of ExecutionProvider::GetAllocator()
function



### Motivation and Context
The parameter device_id is not necessary. We can fully rely on the
second parameter OrtMemType mem_type to determine the device_id when
getting allocator from executionProvider.
2023-02-13 10:01:07 -08:00
cloudhan
9bd022b8be
Add TuningContext for TunableOp (#14557)
This makes the the TunableOp tuning results state free and will allow us to
dump and load offline tuning results.
2023-02-10 14:27:43 +08:00
Maximilian Müller
e9ab56fa64
Adding RunOptions synchronization behaviour to C/C++ API (#14088)
### Description
This is exposing the already existent interface of asynchronous work of
all CUDA base EP's (CUDA + TensorRT).


### Motivation and Context
This is something requested in #12216. It will enable users to build an
efficient data pipeline with ONNXRuntime and CUDA pre-/post-processing.
PCI traffic to the CUDA device can be run during inference as soon as
the postprocessing consumed the input buffer and it can be overwritten.
To do this work has to be submitted async to the device. Please see
below screenshots showing the illustration of this using NSight Systems.

Async: 
<img width="1401" alt="image"
src="https://user-images.githubusercontent.com/44298237/209894303-706460ed-cbdb-4be2-a2e4-0c111ec875dd.png">

Synchronous:
<img width="1302" alt="image"
src="https://user-images.githubusercontent.com/44298237/209894630-1ce40925-bbd5-470d-b888-46553ab75fb9.png">

Note the gap in between the 2 inference runs due to issuing PCI traffic
in between and to the CPU overhead the active synchronization has.

---------

Co-authored-by: Chi Lo <chi.lo@microsoft.com>
2023-02-07 19:59:28 -08:00
Nat Kershaw (MSFT)
638f21b969
Upgrade doxygen to fix C API docs build issue (#13950) 2023-02-03 09:43:29 -08:00
Baiju Meswani
3d8fa4d77b
GetTrainingApi to not print to stderr when not an ort training build (#14515) 2023-02-02 13:28:32 -08:00
Dmitri Smirnov
61e7636e61
Re-work GetAvailableProviders API (#14486)
### Description
Re-work `OrtApi::GetAvailableProviders` in a way that the data is
returned in a single allocation.
Fix exception safety issues and fix `Release` function. 
Remove warning suppressions.
Fix exception safety issue in C++ API.
Fix exception safety issue in C# API.
Move EP name length enforcement to the implementation.

### Motivation and Context
The original motivation comes from
https://github.com/microsoft/onnxruntime/issues/14378.
However, the API is already implemented.

Cc: @prabhat00155
2023-02-01 14:38:04 -08:00
Erick Muñoz
d1533c27eb
[oneDNN] Improved thread handling (#13618)
* Added the OrtDnnlProviderOptions structure to expose configuration
options to the user

* The number of threads can be defined by the user with the -i flag on
the perftest

* Number of threads can also be configured via the OMP_NUM_THREADS
environment variable

* The number of threads defined in the OrtDnnlProviderOptions is
prioritized over the environment variable

### Description
Avoids thread oversubscription caused by OpenMP allocating the maximum
number of threads possible for oneDNN EP. Added support for the
OrtDnnlProviderOptions, this will allow for more EP customization
capabilities, and allows for user defined number of threads.



### Motivation and Context
- Improves performances and allows for user to fine tune the number of
threads
2023-01-31 14:37:13 -08:00
sfatimar
77b455b969
Ort openvino 4.3 cli (#14341)
### Description
Introduce cache_dir CLI for graph serialisation.
Replace existing use_compile_network and blob_dump_path cli options for
openvino with a single command line option "cache_dir" specifying the
path that needs to be passed for blob dump/load improving the developer
experience.

### Motivation and Context?
We were having two values to set cache dir which was unnecessary

Co-authored-by: Preetha <preetha.veeramalai@intel.com>
2023-01-23 14:17:52 -08:00
Adrian Lizarraga
de17d53c50
Custom Op runtime wrapper (#13427)
### Description

Adds the below C APIs to support custom ops that wrap an entire model to
be inferenced with an external runtime. The current SNPE EP is an
example of an EP that could be ported to use a custom op wrapper. Ex:
The custom op stores the serialized SNPE DLC binary as a string
attribute. The SNPE model is built when the kernel is created. The model
is inferenced with SNPE APIs on call to the kernel's compute method.

#### C APIs
| API | Description | Why |
| ---            | ---        | ---  |
| `KernelInfo_GetInputCount` | Gets number of inputs from
`OrtKernelInfo`. | Query I/O characteristics during kernel
creation<sup>1</sup> |
| `KernelInfo_GetOutputCount` | Gets number of outputs from
`OrtKernelInfo`. | Query I/O characteristics during kernel
creation<sup>1</sup> |
| `KernelInfo_GetInputName` | Gets an input's name. | Query I/O
characteristics during kernel creation<sup>1</sup> |
| `KernelInfo_GetOutputName` | Gets an output's name. | Query I/O
characteristics during kernel creation<sup>1</sup> |
| `KernelInfo_GetInputTypeInfo` | Gets the type/shape information for an
input. | Query I/O characteristics during kernel creation<sup>1</sup> |
| `KernelInfo_GetOutputTypeInfo` | Gets the type/shape information for
an output. | Query I/O characteristics during kernel
creation<sup>1</sup> |
| `KernelInfoGetAttribute_tensor` | Get a OrtValue tensor stored as an
attribute in the graph node | Extract serialized models, weights, etc. |
| `GetSessionConfigEntry` | Get a session configuration value | Need to
be able to get session-time configurations from within custom op |
| `HasSessionConfigEntry` | Check if session configuration entry exists.
| Need to be able to get session-time configurations from within custom
op |

#### Why so many KernelInfo APIs?<sup>1</sup>
Similar APIs currently exist for `OrtKernelContext`, but not
`OrtKernelInfo`. Note that `OrtKernelContext` is passed to the custom op
on call to its kernel's compute() function. However, `OrtKernelInfo` is
available on kernel creation, which occurs when the session is created.
Having these APIs available from `OrtKernelInfo` allows an operator to
trade-off computation time for session-creation time, and vice versa.
Operators that must build expensive state may prefer to do it during
session creation time instead of compute-time.

SNPE is an example of an EP that needs to be able to query `KernelInfo`
for the name, type, and shape of inputs and outputs in order to build
the model from the serialized DLC data. This is an expensive operation.
Other providers (e.g., OpenVINO) are able to query i/o info from the
serialized model, so they do not strictly need these APIs. However, the
APIs can still be used to validate the expected I/O characteristics.

Additionally, several of our CPU contrib ops currently use the same
internal version of these KernelInfo APIs (Ex:
[qlinear_softmax](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/contrib_ops/cpu/quantization/qlinear_softmax.cc#L71)).
If custom ops are also meant to be a test bed for future ops, then all
custom ops (not just runtime wrappers) would benefit from the addition
of these public KernelInfo APIs (IMO).

#### Example of usage in a custom OP
From
`onnxruntime/test/testdata/custom_op_openvino_wrapper_library/openvino_wrapper.h`

```c++
struct CustomOpOpenVINO : Ort::CustomOpBase<CustomOpOpenVINO, KernelOpenVINO> {
  explicit CustomOpOpenVINO(Ort::ConstSessionOptions session_options);

  CustomOpOpenVINO(const CustomOpOpenVINO&) = delete;
  CustomOpOpenVINO& operator=(const CustomOpOpenVINO&) = delete;

  void* CreateKernel(const OrtApi& api, const OrtKernelInfo* info) const;

  constexpr const char* GetName() const noexcept {
    return "OpenVINO_Wrapper";
  }

  constexpr const char* GetExecutionProviderType() const noexcept {
    return "CPUExecutionProvider";
  }

  // IMPORTANT: In order to wrap a generic runtime-specific model, the custom operator
  // must have a non-homogeneous variadic input and output.

  constexpr size_t GetInputTypeCount() const noexcept {
    return 1;
  }

  constexpr size_t GetOutputTypeCount() const noexcept {
    return 1;
  }

  constexpr ONNXTensorElementDataType GetInputType(size_t /* index */) const noexcept {
    return ONNX_TENSOR_ELEMENT_DATA_TYPE_UNDEFINED;
  }

  constexpr ONNXTensorElementDataType GetOutputType(size_t /* index */) const noexcept {
    return ONNX_TENSOR_ELEMENT_DATA_TYPE_UNDEFINED;
  }

  constexpr OrtCustomOpInputOutputCharacteristic GetInputCharacteristic(size_t /* index */) const noexcept {
    return INPUT_OUTPUT_VARIADIC;
  }

  constexpr OrtCustomOpInputOutputCharacteristic GetOutputCharacteristic(size_t /* index */) const noexcept {
    return INPUT_OUTPUT_VARIADIC;
  }

  constexpr bool GetVariadicInputHomogeneity() const noexcept {
    return false;  // heterogenous
  }

  constexpr bool GetVariadicOutputHomogeneity() const noexcept {
    return false;  // heterogeneous
  }

  std::vector<std::string> GetSessionConfigKeys() const { return {"device_type"}; }

 private:
  std::unordered_map<std::string, std::string> session_configs_;
};
```

#### How to create a session:
```c++
Ort::Env env;
Ort::SessionOptions session_opts;
Ort::CustomOpConfigs custom_op_configs;

// Create local session config entries for the custom op.
custom_op_configs.AddConfig("OpenVINO_Wrapper", "device_type", "CPU");

// Register custom op library and pass in the custom op configs (optional).
session_opts.RegisterCustomOpsLibrary(lib_name, custom_op_configs);

Ort::Session session(env, model_path.data(), session_opts);
```
### Motivation and Context
Allows creation of simple "wrapper" EPs outside of the main ORT code
base.
2023-01-18 09:09:32 -08:00
Jian Chen
d95249f516
Removing Double QDQ from Graphs (#14024)
### Description
When there are 2 QDQ pair back to back, we want to delete the 1 Q and 1
DQ nodes.
ex:
Q->DQ->Q->DQ  =====> Q->DQ



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-01-16 19:06:57 -08:00
Scott McKay
b9ecd428c1
Add ability to register custom ops by specifying a function name (#14177)
### Description
<!-- Describe your changes. -->
Use dlsym/GetProcAddress to lookup a custom ops registration function by
name and call it.

This will be better on mobile platforms where the custom ops library is
linked against, and there isn't necessarily a filesystem that a library
path can be loaded from.

Alternative is to wire up passing in the address of the function, but
that has multiple complications which differ by platform.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Enable using ort and ort-ext packages on mobile platforms.

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-01-12 15:11:34 +10:00
RandySheriffH
83ad562826
Rename CloudEP to AzureEP (#14175)
Rename CloudEP to AzureEP.

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2023-01-11 12:25:04 -08:00
Ashwini Khade
d92c663f28
Create dedicated build for training api (#14136)
### Description
Enable creating dedicated build for on device training. With this PR we
can build a lean binary for on device training using flag
--enable_training_apis. This binary includes only the essentials like
training ops, optimizers etc and NOT features like Aten fallback,
strided tensors, gradient builders etc . This binary also removes all
the deprecated components like training::TrainingSession and OrtTrainer
etc

### Motivation and Context
This enables our partners to create a lean binary for on device
training.
2023-01-10 20:58:04 -08:00
we1559
c65a03699a
add ThreadingOptions, wraps OrtThreadingOptions (#13711)
…threadpools' options of The Env.

### Description
<!-- Describe your changes. -->
add a c++ class ThreadingOptions, wraps OrtThreadingOptions
as I described in issue #13710 


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

close #13710

Co-authored-by: zengxiangneng <zengxiangneng@360.cn>
2023-01-06 11:21:10 -08:00
Abhishek Udupa
d460c01b8c
Fix skew between GPU/CPU timestamps in ORT profiler (#14004)
### Description
This PR fixes the skew between GPU/CPU timestamps with a more reliable
algorithm.

### Motivation and Context
An earlier implementation attempted to guess the right correction to
apply, but this led to misleading profile outputs. This PR fixes this
problem by utilizing a more reliable technique to normalize GPU
timestamps. Attached are sample profile outputs and visualization
screen-grabs from a run of a transformer-based model before and after
the fix.

Before Fix:

![profile_visualization_cuda_without_fix](https://user-images.githubusercontent.com/17418420/208197234-7390d8e3-4354-4e67-93cf-958c319146ee.png)

After Fix:

![profile_visualization_cuda_with_fix](https://user-images.githubusercontent.com/17418420/208197230-3e108b82-8dfa-476b-9277-7895639a3785.png)

Profiler outputs that are rendered in the visualizations above:

[sample_outputs.zip](https://github.com/microsoft/onnxruntime/files/10249689/sample_outputs.zip)

Co-authored-by: Abhishek Udupa <abhishek.udupa@microsoft.com>
2023-01-05 11:07:26 -08:00
Adrian Lizarraga
68794d0ac1
Improve custom op library handle cleanup (#14099)
### Description
- Adds a new C API `OrtApi::RegisterCustomOpsLibrary_V2` that manages
the lifetime of dynamic library handles (i.e., calls `dlclose` or
`FreeLibrary`).
- Deprecates C API `OrtApi::RegisterCustomOpsLibrary`.
- Adds C++ API wrapper for convenient registering of custom op
libraries.
- `PySessionOptions` is now an alias of `OrtSessionOptions`

### Motivation and Context
The current API for registering custom op libraries loads dynamic
libraries but requires users to handle the release of the corresponding
library handles. Additionally, the user has to make sure to release the
library handle _after_ the session has been destroyed (or the program
segfaults).

The new API automatically cleans up the library and allows the user to
write more straightforward code.
2023-01-04 17:56:29 -08:00
cao lei
b29a1c7348
Address follow-up comments on multistream pr #13495 (#13992)
### Description
This PR is to address follow-up comments for the multi-stream pr
https://github.com/microsoft/onnxruntime/pull/13495

Changes including:

- Make StreamAwareArena transparent to minimal build
- Make DeviceStreamCollection transparent to minimal build
- Replace ORT_MUST_USE_RESULT with [[nodiscard]]
- Remove unnecessary shared_ptr


### Motivation and Context
This PR is to address follow-up comments for the multi-stream pr
https://github.com/microsoft/onnxruntime/pull/13495

Co-authored-by: Lei Cao <leca@microsoft.com>
2023-01-03 16:33:36 -08:00
RandySheriffH
587e891cae
CloudEP (#13855)
Implement CloudEP for hybrid inferencing.
The PR introduces zero new API, customers could configure session and
run options to do inferencing with Azure [triton
endpoint.](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-deploy-with-triton?tabs=azure-cli%2Cendpoint)
Sample configuration in python be like:

```
sess_opt.add_session_config_entry('cloud.endpoint_type', 'triton');
sess_opt.add_session_config_entry('cloud.uri', 'https://cloud.com');
sess_opt.add_session_config_entry('cloud.model_name', 'detection2');
sess_opt.add_session_config_entry('cloud.model_version', '7'); // optional, default 1
sess_opt.add_session_config_entry('cloud.verbose', '1'); // optional, default '0', meaning no verbose
...
run_opt.add_run_config_entry('use_cloud', '1') # 0 for local inferencing, 1 for cloud endpoint.
run_opt.add_run_config_entry('cloud.auth_key', '...')
...
sess.run(None, {'input':input_}, run_opt)
```

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2023-01-03 10:03:15 -08:00
Adrian Lizarraga
3bbcc2799f
Support for custom op variadic inputs/outputs (#13946)
### Description
Adds support for variadic inputs and outputs to custom operators.

### Motivation and Context
Needed for custom ops that wrap external runtimes/models and maybe TensorRT plugins.
2022-12-23 11:41:15 -08:00
Changming Sun
fc2a6db573
Update absl to the latest release (#13990)
### Description
Update absl to a new version

### Motivation and Context
The new version contains fixes that are needed for Nvidia GPU build.
Once we update it to that version, we don't need to maintain our private
patches for Nvidia GPU build.
2022-12-19 14:25:13 -08:00
FFFrog
6705915af8
[CANN] Add the ability to run graph (#13728)
### Description
Add the ability to run graph

### Motivation and Context
A brief description is as follows:
1) If the whole graph is supported, then will be processed by the graph
engine, directly.
2) If the whole graph is not supported, the whole graph will be divided
into subgraphs and single operators; The sub-graphs will be run on graph
engine, and the single operators will fallback to the traditional mode.
2022-12-16 06:57:40 -08:00
Abhishek Udupa
c882601425
Add noexcept annotation to address prefast warnings (#13965)
### Description
Add noexcept annotations to move constructors and assignment ops to
address prefast warnings.
(see
https://dev.azure.com/aiinfra/ONNX%20Runtime/_workitems/edit/11012/)

Co-authored-by: Abhishek Udupa <abhishek.udupa@microsoft.com>
2022-12-15 09:44:22 -08:00
Tang, Cheng
a81faee41e
Multi-stream execution support (#13495)
**Description**: This PR including following works:
1. provide stream and related synchronization abstractions in
onnxruntime.
2. enhance onnxruntime's execution planner / executor / memory arena to
support execute multiple streams in parallel.
3. deprecate the parallel executor for cpu.
4. deprecate the Fence mechanism. 
5. update the cuda / tensorrt EP to support the stream mechanism,
support running different request in different cuda stream.

**Motivation and Context**
- Why is this change required? 
currently, the execution plan is just a linear list of those primitives,
ort will execute them step by step. For any given graph, ORT will
serialize it to a fixed execution order. This sequential execution
design simplifies most scenarios, but it has the following limitations:
1. it is difficult to enable inter-node parallelization, we have a
half-baked parallel executor but it is very difficult to make it work
with GPU.
2. The fence mechanism can work with single gpu stream + cpu thread
case, but when extend to multiple stream, it is difficult to manage the
cross GPU stream synchronizations.
3. our cuda EP rely on the BFCArena to make the memory management work
with the GPU async kernels, but current BFCArena is not aware of the
streams, so it doesn't behavior correctly when run with multiple
streams.

This PR enhance our existing execution plan and executor to support
multiple stream execution. we use an unified algorithm to mange both
single stream and multiple stream scenarios.
This PR mainly focus on the infrastructure support for multiple stream
execution, that is said, given a valid stream assignment, onnxruntime
can execute it correctly. How to generate a good stream assignment for a
given model will be in the future PR.

Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Cheng Tang <chenta@microsoft.com>
Co-authored-by: RandySheriffH <48490400+RandySheriffH@users.noreply.github.com>
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: cao lei <jslhcl@gmail.com>
Co-authored-by: Lei Cao <leca@microsoft.com>
2022-12-15 07:39:29 -08:00
Jeff Daily
c9edc01c0b
[ROCm] float16.h should use __HIP__ not USE_ROCM (#13684)
The float16.h header is shared between the CPU and ROCm EPs. The
USE_ROCM macro is defined universally, but for the float16.h header we
only wish to detect the hip-clang compiler. Otherwise, the CPU EP fails
to build because of -Werror -Wuninitialized caused by the USE_ROCM code
additions, and the CPU EP should be using a different code path.
2022-12-13 15:34:42 -08:00
RandySheriffH
75584c5fa8
Enabling thread pool to be numa-aware (#13778)
The PR enables ort thread pool to be numa-aware, so that threads could
be evenly created and distributed among numa nodes.
In addition, to facilitate performance tuning, the PR opens a new API
allowing customers to attach threads to certain logical processors.
Please check the API
[definition](https://github.com/microsoft/onnxruntime/pull/13778/files#diff-5845a5c76fb64abdc8f0cffe21b37f8da1712674eb3abc4cd87190891be1bd48)
for details.

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2022-12-12 10:33:55 -08:00
JiCheng
22fa62152a
Pass SessionOptions to XnnpackProviderFactoryCreator. (#13318)
### Description
To pass session_options to Xnnpack EP via
`XnnpackProviderFactoryCreator` for Initializing xnnpack's threadpool.

If you want to use different threadpool size or even disable xnnpack's
threadpool, just setting intra_threadpool to 1 by xnnpack EP's
provider_options.


### Motivation and Context

Co-authored-by: Guangyun Han <guangyunhan@microsoft.com>
Co-authored-by: Jicheng Wen <jicwen@microsoft.com>
2022-12-10 14:23:46 +08:00
Abhishek Udupa
83c59d2594
Session-aware and thread-safe CUDA profiler (#13706)
### Description
The existing CUDA profiler is neither session-aware, nor thread-safe.
This PR ensures both.

### Motivation and Context
[PR 13549](https://github.com/microsoft/onnxruntime/pull/13549) brought
thread-safety and session-awareness to the ROCm profiler. This PR brings
the same goodness to the CUDA profiler as well.

Sample outputs of a profiling run from the StableDiffusion model (this
model was chosen because it requires orchestration of multiple sessions,
and verifies that the profilers are now indeed session-aware) on both
CUDA and ROCm EPs are attached, along with a script that checks that the
trace files generated by the profile are well-formed.

Update 11/29: Updated the profile outputs. The older profile outputs
exhibited an issue where some timestamps were wildly out of range,
leading to problems visualizing the traces. The bug has been fixed and
the profile outputs have been updated, along with an update to the check
script to ensure that timestamps are monotonically increasing.


[sd_profile_outputs_cuda.tar.gz](https://github.com/microsoft/onnxruntime/files/10118088/sd_profile_outputs_cuda.tar.gz)

[sd_profile_outputs_rocm.tar.gz](https://github.com/microsoft/onnxruntime/files/10118089/sd_profile_outputs_rocm.tar.gz)

[check_profile_output_well_formedness.zip](https://github.com/microsoft/onnxruntime/files/10118090/check_profile_output_well_formedness.zip)

Co-authored-by: Abhishek Udupa <abhishek.udupa@microsoft.com>
2022-12-09 13:22:12 -08:00
Ashwini Khade
983877c712
Decouple strided tensor support from ENABLE_TRAINING (#13829)
### Description
Decouple strided tensor support from ENABLE_TRAINING

### Motivation and Context
This is step 1 for creating a dedicated build for on device training.
Intention is

1. We can set ENABLE_STRIDED_TENSORS in cmake when either
ENABLE_TRAINING or ENABLE_TRAINING_ON_DEVICE is selected, this way we
dont have to use if defined(ENABLE_TRAINING) ||
defined(ENABLE_TRAINING_ON_DEVICE ) everywhere in the code.

2. This also paves the way to easily enable strided tensor support for
inference in future (if required).
2022-12-07 09:22:21 -08:00
stevenlix
ce0025d3f2
Fallback Pow op in layer norm to FP32 in TRT to avoid overflow (#13639)
Accuracy loss is observed when transformer models such as BERT, DeBERTa,
ViT are running in TRT FP16 mode. The cause is that overflow happens at
Pow op in layer norm.
This PR provides the option to force Pow to run in TRT FP32 precision if
overflow occurs.

Co-authored-by: Ubuntu <azureuser@orteplinuxdev.bxgbzpva45kedp3rhbsbit4phb.jx.internal.cloudapp.net>
2022-11-29 13:37:31 -08:00
Changming Sun
87e6a26c5d
Enforce Prefast check in Windows CPU CI pipeline (#13735)
Right now we fix the warnings in an ad-hoc way. We run static analysis
in nightly builds, then create work items for the finding it found. Our
CI build pipelines run the same scan but do not break the build. So,
this PR will fix the remaining findings in the CPU EP(including the
training part) and enforce the check. Later on we can continue to expand
the scope.

We still have some warnings left in the JNI part. I will try to address
them later in the next month.
2022-11-23 09:25:02 -08:00
cloudhan
9e649d1ac4
Allow CUDA EP enable or disable TunableOp via session options and environment variable (#13601)
This ports #13116 from ROCm EP to CUDA EP
2022-11-15 14:43:54 +08:00
Abhishek Udupa
9954454c65
Make the ROCM profiler thread-safe, session-aware and preserve logical ordering between CPU and GPU events (#13549)
### Description
The existing ROCM profiler has a few shortcomings, which this PR fixes.

### Motivation and Context
The existing ROCM profiler:
1. Is not thread-safe
2. Is not session-aware: i.e., if multiple inference sessions enable
profiling, then events (esp GPU events) get mixed up between the
sessions
3. Has some issues with respect to coding standards.

This PR addresses all of the above by cleanly re-implementing parts of
the ROCM profiler as required.

Attached are 4 profile outputs from a multi-session run of the
StableDiffusion model, as well as a quick-and-dirty script that checks
the profile outputs for the invariants claimed.


[sd_profile_outputs.tar.gz](https://github.com/microsoft/onnxruntime/files/9924608/sd_profile_outputs.tar.gz)


[check_profile_output_wellformedness.zip](https://github.com/microsoft/onnxruntime/files/9924614/check_profile_output_wellformedness.zip)

Co-authored-by: Abhishek Udupa <abhishek.udupa@microsoft.com>
2022-11-10 10:25:41 -08:00
Edward Chen
215732f74b
Ignore saved runtime optimizations when updating ORT format model <v5. (#13393)
The old runtime optimization format is not readily convertible to the new one without extra information for translating kernel def hashes.
Ignore such saved runtime optimizations and output a warning for now.
2022-11-08 13:36:46 -08:00
yf711
8b9065a396
Add getter/setter of C# OrtEnv log level (#13402)
### Description
* Add getter/setter to access and update C# OrtEnv log level
* Add C API about updating ort env with custom log level to support the
setter above (Following [pybind
implementation](952c99304a/onnxruntime/python/onnxruntime_pybind_state.cc (L923-L924)))
* Add test case to verify getter & setter


### Motivation and Context
* For C++/Python, the log level can be adjusted via OrtEnv, and this
feature is missing in C# binding
2022-11-04 21:46:00 -07:00
pengwa
a3e7da60e7
Trade subgraph recompute for memory (#12852)
**Description**: Subgraph-level recompute

This PR adds an optional capability trading additional re-computation
for better memory efficiency. Specifically, a pre-defined operator list
used to iterate the Graph to find some subgraphs for recompute, to
reduce some stashed activations whose lifetime across forward and
backward pass.

When training with ORTModule, by default, the graph transformer will
scan the execution graph to find all eligible subgraph to recompute,
along with sizes that can save. An example looks like below.
If we want to enable some of them to recompute, we can define env
variable this way:
`export
ORTMODULE_ENABLE_MEMORY_ALLEVIATION="Mul+FusedMatMul+Cast+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Add+BiasSoftmaxDropout+Cast+:1:-1,BiasGelu+:1:-1,BitmaskDropout+Cast+:1:-1,FusedMatMul+:1:-1,Cast+:1:-1,Mul+Add+:1:-1,Mul+Sub+:1:-1"`
```

[1,0]<stderr>:2,022-10-12 14:47:39.302,954,530 [W:onnxruntime:, memory_alleviation.cc:595 PrintSummary]
[1,0]<stderr>:MemoryAlleviation Summary:
[1,0]<stderr>:  User config:
[1,0]<stderr>:  Mul+FusedMatMul+Cast+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Add+BiasSoftmaxDropout+Cast+:1,BiasGelu+:1,BitmaskDropout+Cast+:1,FusedMatMul+:1,Cast+:1,Mul+Add+:1,Mul+Sub+:1
[1,0]<stderr>:  =================================
[1,0]<stderr>:  Subgraph: BitmaskDropout+
[1,0]<stderr>:          AlleviationType: Disabled
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x 1,024 x   Frequency:1
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: BiasGelu+
[1,0]<stderr>:          AlleviationType: Recompute
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x input_ids_dim1 x 4,096 x  Frequency:24
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: Reshape[1,0]<stderr>:+
[1,0]<stderr>:          AlleviationType: Disabled
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:labels_dim0 x      Frequency:1
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: Unsqueeze+Unsqueeze+Cast+Sub+Mul+Mul+FusedMatMul+Cast+Add+BiasSoftmaxDropout+Cast+
[1,0]<stderr>:          AlleviationType: Disabled
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x input_ids_dim1 x    Frequency:23
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: Mul+FusedMatMul+Cast+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Add+BiasSoftmaxDropout+Cast+
[1,0]<stderr>:          AlleviationType: Recompute
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x input_ids_dim1 x    Frequency:1
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: Mul+Add+
[1,0]<stderr>:          AlleviationType: Recompute
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x 1 x         Frequency:24
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: FusedMatMul+Cast+Add+Reshape+Cast+
[1,0]<stderr>:          AlleviationType: Disabled
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x 2 x 4 x     Frequency:24
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: Mul+Sub+
[1,0]<stderr>:          AlleviationType: Recompute
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x 1 x         Frequency:24
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: Cast+
[1,0]<stderr>:          AlleviationType: Recompute
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:1,024 x 1,024 x    Frequency:97
[1,0]<stderr>:                  PatternShape:3 x 1,024 x        Frequency:1
[1,0]<stderr>:                  PatternShape:8 x 64 x   Frequency:24
[1,0]<stderr>:                  PatternShape:1,024 x 4,096 x    Frequency:24
[1,0]<stderr>:                  PatternShape:4,096 x    Frequency:24
[1,0]<stderr>:                  PatternShape:4,096 x 1,024 x    Frequency:24
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: FusedMatMul+
[1,0]<stderr>:          AlleviationType: Recompute
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x input_ids_dim1 x 4,096 x  Frequency:24
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  =================================
```


"Type config:" whether recompute is enabled by users. 0 - disable, 1-
enable.
"Subgraph" means what kind of subgraph will be recomputed, in this case,
it is a single node "Gelu", and it will be "Recompute".
"Shape && Frequency" means, for this recompute, one tensor of size
(batch size, 500) will be saved because it will be recomputed.

**Baseline**

On a 1P model (DEBERTA V2), sequence length 256, training with 16 A100
GPUs. With latest main branch, we can run batch size 16, and the maximum
batch size < 32. So 16 is usually chosen by data scientists. 65% of 40GB
memory is used during training. The SamplesPerSec=479.2543353561354.


![image](https://user-images.githubusercontent.com/10530022/188320941-13dde5e7-c32b-4399-a64b-6803fbb9dcda.png)

**With this PR**

Gelu is recomputed for saving memory peak, batch size 32 can be run. The
97% of 40GB A100 is used, the SamplesPerSec=562.041593991271 (**1.17X**
of baseline).


![image](https://user-images.githubusercontent.com/10530022/188321081-f64811bf-9637-4873-8095-349de8d498cc.png)


**Motivation and Context**
- Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here.
2022-11-03 13:49:41 +08:00
Wei-Sheng Chin
b5904c40dd
Enable ORT in TorchDynamo (#13259)
This PR enables ORT to execute graphs captured by TorchDynamo. Major compilation code is in `OrtBackend.compile` in ort_backend.py. `register_backend.py` is for plugging `OrtBackend` into TorchDynamo as a compiler.
2022-11-01 11:19:29 -07:00
Adrian Lizarraga
9d867a07c0
Fix regression in CustomOpApi::GetTensorData (#13450)
- Reverts change to CustomOpApi::GetTensorData introduced by commit 5dae0c477d,
which causes infinite recursion.
- Moves EndsProfilingAllocated to non-const session implementation
(C++ API header).
2022-10-31 12:20:49 -07:00
Edward Chen
2ecd1d6622
Switch GSL to MS GSL 4.0.0 (#13416) 2022-10-29 04:15:20 -07:00
Fei Hu
943e156f4c
Allow custom ops to set input memory type (#10879) 2022-10-28 21:45:26 -07:00
cloudhan
fc12abf6b1
Enable/Disbale tunable GEMM by using tunable switch in provider options and env var (#13116)
Related PRs #12853

This allows the user enable/disbale tunable GEMM on demand.
2022-10-19 22:35:08 -07:00
Scott McKay
565da71275
Make 'env' argument to Session const (#13362)
### Description
<!-- Describe your changes. -->
The Env argument does not need to be mutable to call the underlying C
API. Update the Ort::Session ctor to have a const Env.

All other changes are from clang-format running. 

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Cleanup
2022-10-19 14:23:24 +10:00
Dmitri Smirnov
f5e3165cc3
Fix move Base::operator= (#13355)
### Description
Base::operator= move is broken, loses a valid ptr.

### Motivation and Context
Address
https://github.com/microsoft/onnxruntime/pull/13215#discussion_r997814275
2022-10-18 13:07:40 -07:00
Dmitri Smirnov
4a63cd0290
Improve thread pool creation failure handling. (#13313)
### Description
Detect and report thread creation failure on Windows.
Do not throw out of constructor after the thread is created,
the thread handle is lost and cannot be joined, resulting in a deadlock.

Make setting a thread priority on Linux consistent with windows.
Set thread priority in the thread itself. Log failure properly,
but do not exit the thread.

### Motivation and Context
Address issues https://github.com/microsoft/onnxruntime/issues/13291
And
https://github.com/microsoft/onnxruntime/issues/13285#issuecomment-1278063223
2022-10-15 17:57:19 -07:00
Dmitri Smirnov
f0fbff6dd4
Adjust docs to comply with Doxygen requirements (#13302)
### Description
Fix up param names in docs

### Motivation and Context
Make pipelines pass
2022-10-12 18:07:18 -07:00
cloudhan
1e55949a70
Fix unsound hipify in ROCm EP (#13269)
Some cuda related things is still left in the rocm ep statically
hipified code. Eliminate them to avoid confusion.
2022-10-12 08:32:42 +08:00
cloudhan
2cf5d04e3d
Fix clang-tidy(cppcoreguidelines-pro-bounds-array-to-pointer-decay) (#13241)
clang-tidy says "Do not implicitly decay an array into a pointer; consider using gsl::array_view or an explicit cast instead"

It is a false positive scattering around all our codebase when using
helper macros. It is becuase for function with 4 char name, say `main`,
the type of __FUNCTION__ and __PRETTY_FUNCTION__ is `char [5]`.
2022-10-11 13:16:48 +08:00
Dmitri Smirnov
5dae0c477d
Deprecate CustomApi and refactor public API for better safety and consistency (#13215)
### Description
Deprecate CustomOpApi and refactor dependencies for exception safety and
eliminate memory leaks.
Refactor API classes for clear ownership and semantics.
Introduce `InitProviderOrtApi()`

### Motivation and Context
Make public API better and safer.

Special note about `Ort::Unowned`. The class suffers from the following
problems:

1. It is not able to hold const pointers to the underlying C objects.
This forces users to `const_cast` and circumvent constness of the
returned object. The user is now able to call mutating interfaces on the
object which violates invariants and may be a thread-safety issue. It
also enables to take ownership of the pointer and destroy it
unintentionally (see examples below).
2. The objects that are unowned cannot be copied and that makes coding
inconvenient and at times unsafe.
3. It directly inherits from the type it `unowns`.

All of the above creates great conditions for inadvertent unowned object
mutations and destructions. Consider the following examples of object
slicing, one of them is from a real customer issue and the other one I
accidentally coded myself (and I am supposed to know how this works).
None of the below can be solved by aftermarket patches and can be hard
to diagnose.

#### Example 1 slicing of argument
```cpp
void SlicingOnArgument(Ort::Value& value) {
  // This will take possession of the input and if the argument
  // is Ort::Unowned<Ort::Value> it would again double free the ptr
  // regardless if it was const or not since we cast it away.
  Ort::Value output_values[] = {std::move(value)};
}

void main() {
  const OrtValue* ptr = nullptr;  // some value does not matter
  Ort::Unowned<Ort::Value> unowned{const_cast<OrtValue*>(ptr)};
  // onowned is destroyed when the call returns.
  SlicingOnArgument(unowned);
}
```

#### Example 2 slicing of return value
```cpp
// The return will be sliced to Ort::Value that would own and relase (double free the ptr)
Ort::Value SlicingOnReturn() {
  const OrtValue* ptr = nullptr; // some value does not matter
  Ort::Unowned<Ort::Value> unowned{const_cast<OrtValue*>(ptr)};
  return unowned;
}
```
2022-10-06 14:57:37 -07:00
Edward Chen
5c89c37f7f
Consolidate enabled/default kernel def type constraints (#13034)
Consolidate enabled/default kernel def type constraint types into enabled.
2022-09-27 14:04:15 -07:00
RandySheriffH
a83a9ed6b0
Remove miscellaneous nuphar configs (#13070)
Remove a handful of nuphar related configurations after deprecation.

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2022-09-26 13:41:28 -07:00
Edward Chen
5f611b63a1
Make classes IKernelTypeStrResolver and IKernelLookup have protected destructors. (#13059) 2022-09-23 09:16:45 -07:00
wangxiyuan
952c99304a
Add CANN EP (#12416)
**Description**: This PR adds Ascend CANN execution provider support.

**Motivation and Context**
- Why is this change required? What problem does it solve?
As the info shown in the issue. CANN is the API layer for Ascend
processor. Add CANN EP can allow user run onnx model on Ascend hardware
via onnxruntime
  The detail change:
  1. Added CANN EP framework.
  2. Added the basic operators to support ResNet and VGG model.
  3. Added C/C++、Python API support
- If it fixes an open issue, please link to the issue here.
   https://github.com/microsoft/onnxruntime/issues/11477

Author: 
lijiawei <lijiawei19@huawei.com>
wangxiyuan <wangxiyuan1007@gmail.com>

Co-authored-by: FFrog <ljw1101.vip@gmail.com>
2022-09-22 14:53:40 -07:00
sfatimar
cccbe90764
Openvino ep 2022.2 v4.2 (#13023)
This changes are to align OV 2022.2 Release with ORT . Changes
CPU FP16 Support, dGPU Support, RHEL Dockerfile, Ubuntu 20 Dockerfile 

**Motivation and Context**
- This change is required to ensure ORT-OpenVINO Execution Provider is
aligned with latest changes.
- If it fixes an open issue, please link to the issue here.

Co-authored-by: mayavijx <mayax.vijayan@intel.com>
Co-authored-by: shamaksx <shamax.kshirsagar@intel.com>
Co-authored-by: pratiksha <pratikshax.bapusaheb.vanse@intel.com>
Co-authored-by: pratiksha <mohsinx.mohammad@intel.com>
Co-authored-by: Sahar Fatima <sfatima.3001@gmail.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
Co-authored-by: nmaajidk <n.maajid.khan@intel.com>
Co-authored-by: Mateusz Tabaka <mateusz.tabaka@intel.com>
Co-authored-by: intel <intel@iotgecsp-nuc04.iind.intel.com>
2022-09-22 12:31:40 -07:00
Edward Chen
454f77cd94
Update kernel matching logic: decouple from op schemas and remove kernel def hashes (#12791)
# Motivation
Currently, ORT minimal builds use kernel def hashes to map from nodes to
kernels to execute when loading the model. As the kernel def hashes must
be known ahead of time, this works for statically registered kernels.
This works well for the CPU EP.
For this approach to work, the kernel def hashes must also be known at
ORT format model conversion time, which means the EP with statically
registered kernels must also be enabled then. This is not an issue for
the always-available CPU EP. However, we do not want to require that any
EP which statically registers kernels is always available too.
Consequently, we explore another approach to match nodes to kernels that
does not rely on kernel def hashes. An added benefit of this is the
possibility of moving away from kernel def hashes completely, which
would eliminate the maintenance burden of keeping the hashes stable.

# Approach
In a full build, ORT uses some information from the ONNX op schema to
match a node to a kernel. We want to avoid including the ONNX op schema
in a minimal build to reduce binary size. Essentially, we take the
necessary information from the ONNX op schema and make it available in a
minimal build.
We decouple the ONNX op schema from the kernel matching logic. The
kernel matching logic instead relies on per-op information which can
either be obtained from the ONNX op schema or another source.
This per-op information must be available in a minimal build when there
are no ONNX op schemas. We put it in the ORT format model.
Existing uses of kernel def hashes to look up kernels are replaced
with the updated kernel matching logic. We no longer store
kernel def hashes in the ORT format model’s session state and runtime
optimization representations. We no longer keep the logic to
generate and ensure stability of kernel def hashes.
2022-09-20 14:24:59 -07:00
Cheng
f26054deca
[XNNPACK] Support running in multi-thread with seperate pthreadpool (#11762)
**Description**: Describe your changes.
XNNPACK takes pthreadpool as its internal threadpool implemtation, it
couples calculation and parallelization. Thus it's impossible to
leverage ORT's threadpool (EIGEN/OPENMP based). So we enabled
pthreadpool in XNNPACK EP in this PR.

Case 1:  Pthreadpool coexist with ORT-threadpool simply
Expriments setup
hardware:RedMi8A with 8 cores, ARMv7
The two threadpool has the same pool size form 1 to 8.
Two models: mobilenet_v2 and mobilenet_egetppu.
we can see the picture below and draw a conclusion, latency are even
higher from 5 threads or more.


![image](https://user-images.githubusercontent.com/9417365/190550127-2304adfe-97ac-4aeb-91a0-4606b5305a82.png)

Case 2:
For the reason of performance regression with 5 more threads,
ORT-threads are spinning on CPU and diddn't realease it after
computation finished. It's equivalent of creating 5x2 threads for
parallelization while we have only 8 cpu cores.
So I mannuly disabled spinning after ort-threadpool finished and enabled
it when enter ort-threadpool.
The result is quite normal now.

![image](https://user-images.githubusercontent.com/9417365/190675230-0d85dd02-01f0-4255-967d-e3dbb2a1fe52.png)


Case 3:
Even we achieved a reasonable results with disabling spinning, Will
ORT-threadpool still impact performance of pthreadpool?
we have expriment setting up as: Setting ORT-threadpool size
(intra_thread_num) as 1, and only pthreadpool created.
Attention that, almost a third of ops are running by CPU EP. we are
surprisingly find that disabling ort-threadpool is even better in
performance than creating two threadpool.


![image](https://user-images.githubusercontent.com/9417365/190556480-d6507396-d777-44fc-94e1-938d2b9bb7d7.png)


Case 4:
Use a unified threadpool between CPU ep and XNNPACK ep.
It's the fastest among all. But if we take the similar workload
partition strategy as ORT-threadpool, it could be faster.


![image](https://user-images.githubusercontent.com/9417365/190674908-a68fd20f-bdf4-41f9-bf0a-76b304cda490.png)

**Motivation and Context**
- Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here.

Co-authored-by: Jicheng Wen <jicwen@microsoft.com>
2022-09-20 16:02:15 +08:00
Tang, Cheng
739b5675c8
remove legacy compile api (#12932)
Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2022-09-15 13:18:40 -07:00
Dmitri Smirnov
bc2df1bf95
Remove previously deprecated API (#12935)
Remove previously deprecated API
Format JS code, address review comments
NPM Formatting
2022-09-14 10:58:03 -07:00
Scott McKay
1016c33519
Fix prefast warning in upsample.cc. (#12938)
* Fix prefast warning.
* Fix some other static analysis warnings.
2022-09-14 08:14:33 +10:00
Cheng
8cedafe250
[xnnpack] Have Initializer in Mobile related EPs in Minimal_build and creating EP specific dynamic-schema (#12555)
* Remove the dependence of Qlinearsoftmax schema

* refactor initializerview &&  create shared schema

* Dynamic Create EP specific schema

* Have Initializer in minimal_build

* address comments

* remove CancelFuseSubGraph
2022-09-06 14:32:15 +08:00
ashbhandare
27dde0b51f
Csharp bindings for on-device training APIs (#12404) 2022-09-02 13:13:48 -07:00
Yulong Wang
82a28cc2c3
upgrade emsdk to 3.1.19 (#12690)
* upgrade emsdk to 3.1.19

* fix build break

* ignore '-Wunused-but-set-variable' in eigen

* add malloc and free in exported functions

* EXPORTED_FUNCTIONS
2022-08-30 13:42:45 -07:00
Baiju Meswani
b83ea3c2ff
Address prefast static analysis warnings (#12756) 2022-08-29 10:09:32 -07:00
mwootton
817dc94345
Add first pass of rocm kernel profiler (#10911)
* Add first pass of rocm kernel profiler

* Clean up rocm_profiler. Format args. Demangle kernel names.
Add Api EventRecords

* Remove debug output

* Temporarily disable profiling unit test 'api record check' for cupti

* Fix compile error for non-gpu builds

* Use common file for demangle and pid/tid.  Namespace ThreadUtil.  Fix gpu buffer clearing.

* Merge demangle into profiler_common

* Merge demangle into profiler_common part 2

* Style cleanup

* Resolve linking issues via ProviderHost interface

* Demangle cuda kernel names

* Clean up comments

* Fix formatting

* Fix anal retentive formatting
2022-08-26 19:38:03 -07:00
edgchen1
c270ea148a Move 'using common::Status;' from common.h to status.h. 2022-08-26 15:05:53 -07:00
Yulong Wang
c144acc534
Replace 'master' branch ref to 'main' in the code (#12547) 2022-08-22 10:48:12 -07:00
Dmitri Smirnov
9481893b58
Replace to lock_guard as lighter class for locking (#12616)
Replace to lock_guard as lighter class
2022-08-17 11:08:31 -07:00
Haoming Chen
8a038b9b0c
Fix a build error (#12600)
LLVM compiler complains the std::hash<const char*> and suggests std::hash<const void*>. But the intention is to hash the name string instead of the pointer. So use std::hash<std::string> to be explicit.
2022-08-17 10:49:54 -07:00
Scott McKay
0b0c51e028
Support direct usage of ORT format model flatbuffer for initializers (#12465)
* Add ability to use ORT format model flatbuffer directly for intiializers by leveraging the TensorProto external data infrastructure.

Requires user to provide ORT format model bytes when creating the session, and set both `session.use_ort_model_bytes_directly` and `session.use_ort_model_bytes_for_initializers` to 1 in SessionOptions config entries (AddSessionConfigEntry in C API).
2022-08-12 18:31:43 +10:00
Changming Sun
ac7538b909
Remove CUDA 10.2 support (#12541) 2022-08-10 22:46:41 -07:00
Dmitri Smirnov
c10704a501
Use alignas instead of naive padding to avoid false cache sharing (#12514)
PerThread and ChildThreadStat alignas
2022-08-10 11:23:20 -07:00
Cheng
64e991a9fc
[Qlinearsoftmax] contrib cpu (#12177)
* [Qlinearsoftmax] contrib cpu

* int8 implementation

* contrib operator md

* qdq transformer test

* new attribute: opset

* doc

* quantized tool

* remove template to reduce Binary size

* doc of contribe operators

* enforce x_shape is valid

* fix reduce_size if input-shape is dynamic

* add UT

* register one op for reducing binarysize

* kernel hash update

* docs/ContribOperators.md
2022-08-10 10:52:02 +08:00
Hector Li
730240d2a5
remove the link the comments (#12510) 2022-08-08 15:20:40 -07:00
Scott McKay
8d830adf24
Rework parts of Graph::Resolve to reduce memory usage (#12176)
* Rework some aspects of Graph::Resolve to reduce memory usage.
2022-08-05 13:20:25 +10:00
Dmitri Smirnov
a4ef0e7f7b
Remove dynamic allocation for ThreadPool ParallelSection (#12429)
Use InlinedVector in a TP
Store per thread parallel section in std::optional and avoid memory allocation
2022-08-04 09:46:16 -07:00
Ryan Hill
52d4699788
Minor doc fixes (#12388) 2022-08-03 19:47:36 -07:00
Hariharan Seshadri
d5a1c01b38
Add C++ Session ctor taking model bytes and OrtPrepackedWeightsContainer (#12333) 2022-07-29 12:32:43 -07:00
Yateng Hong
c579497134
Fix TRT custom op issue (#12283)
* Pass schema registry on CreateModel.

* Fix ORT_MINIMAL_BUILD.

* Fix build issue.
2022-07-29 03:39:56 -07:00
Ryan Hill
3e014a5e5d
Fix C header to stop people accidentally copying the OrtApi by value (#12297)
* Fix C header to stop people accidentally copying the OrtApi by value
* Remove api_ from KernelTwo
2022-07-25 19:19:40 -07:00
Dmitri Smirnov
3bf614fd47
Eliminate memory allocations per recent profiling (#12225)
* Alloc begin

FeedsFetches refactoring
Refactor Tensor class
Fix buffer deletor
Remove new/delete deleted
Adjust alloc move
Fix up xnnpack provider
Clarifying the comment on Create()
2022-07-25 14:14:38 -07:00
Ashwini Khade
ceb76429db
Merge pull request #12056 from microsoft/bmeswani/merge-training_dev/on_device_poc
Merge On-Device-Training Offline Tooling and C/C++ APIs
2022-07-21 15:09:48 -07:00
Baiju Meswani
cbf08c7a7b Make GetTrainingApi as a part of the OrtApis, add Training API documentation and address other pull request review comments 2022-07-21 18:11:48 +00:00
Dmitri Smirnov
4f106d2b3b
Eliminate unnecessary status lock acquisition in TP (#12196)
Eliminate unnecessary status lock acquisition in the Thread Pool
2022-07-19 14:16:12 -07:00
Chen Fu
040c2f4517
x86/64 U8S8 Gemm Precision Fix (#12088)
Add a graph optimization that convert u8s8 matrix multiplication to u8u8 if needed

In x86/64 platforms, specifically SSE4.1, AVX2 and AVX512 CPUs provide better performance computing u8s8 matrix multiplications. Unfortunately, the higher performance comes with value overflow problems, as described in:
https://www.intel.com/content/www/us/en/develop/documentation/onednn-developer-guide-and-reference/top/advanced-topics/nuances-of-int8-computations.html

In this change we added a session option "session.x64quantprecision" (default off). For operators that calls u8s8 matrix multiplications, e.g. QAttention, we convert them to u8u8 when the following conditions are all satisfied:

1. Current CPU is SSE4.1, AVX2 or AVX512 with no VNNI support
2. Session option "session.x64quantprecision" is on.
3. Constant weight tensor contains values outside of [-64, 63] range

Note that when weight tensor is not constant, QDQS8ToU8Transformer should already convert it to u8.
2022-07-13 10:12:25 -07:00
Baiju Meswani
a457ddc41d Merge branch 'master' of https://github.com/microsoft/onnxruntime into bmeswani/merge_pr 2022-06-30 21:53:07 +00:00
Baiju Meswani
6e8edfff0c
Separate training apis from shared core apis (#12027) 2022-06-29 14:12:29 -07:00
RandySheriffH
d5fcb432fa
Generalize native op creation (#11539)
* create op from ep

* read input count from context

* create holder to host nodes

* fix typo

* cast type before comparison

* throw error on API fail

* silence warning from minimal build

* switch to unique_ptr with deleter to host nodes

* fix typo

* fix build err for minimal

* fix build err for minimal

* add UT for conv

* enable test on CUDA

* add comment

* fix typo

* use gsl::span and string view for Node constructor

* Added two APIs - CopyKernelInfo and ReleaseKernelInfo

* pass gsl::span by value

* switch to span<NodeArg* const> to allow for reference to const containers

* fix typo

* fix reduced build err

* fix reduced build err

* refactoring node construction logic

* rename exceptions

* add input and output count as arguments for op creation

* refactor static member

* use ORT_CATCH instead of catch

* cancel try catch

* add static value name map

* format input definition and set err code

* fix comments

* fix typo
2022-06-27 21:12:15 -07:00
Baiju Meswani
d25cf4df26 Merge branch 'master' into training_dev/on_device_poc 2022-06-24 20:18:19 +00:00
Dmitri Smirnov
088bc7494b
Deprecate APIs returning raw ptrs and provide replacements (#11922)
Provider better documentation
2022-06-24 09:50:04 -07:00
G. Ramalingam
b1411c8357
Restructure function inliner (#11731)
* Add nested function call tests

* Add overload for Specialize

* Pass symboltable to onnx shape inference

* Avoid renaming empty names

* Enable sequence_map tests which failed before this change
2022-06-24 09:21:31 -07:00
Dmitri Smirnov
607b7df060
Allow saving on CPU usage for infrequent inference requests by reducing thread spinning (#11841)
Introduce Start/Stop threadpool spinning switch
Add a session config option to force spinning stop at the end of the Run()
2022-06-23 10:04:37 -07:00
sfatimar
61a74f2f4d
Mohsin/enable dynamic shapes (#11867)
* Add pypi build changes to latest Master

* Add ORT training part of OV build

* Disabling SqueezeOpTest.BadAxes

* Add ONNXruntime branch ARG to Docker build

* Changes to include file details versions

* Commit File Version Updates

* Change naming for linux build

* Add fix for pylint format errors

* Fix pylint warnings.

* Enable Dynamic Shapes for OV_API_20

* Update requirements.txt whl version- internal_ci fix

* Update backend_manager.cc MYRIAD Fix

* Update wheel version in requirements.txt

* Update backend_manager.cc

* Update backend_manager.cc

* Update backend_manager.cc

* Update setup.py

* Fix pylint warnings

* Fix pylint warnings 2

* Update backend_manager.cc

* Update backend_manager.cc

* Update backend_manager.cc

* Update backend_manager.cc

* Update backend_manager.cc

* Update backend_manager.cc

* Update backend_manager.cc

* Update backend_manager.cc

Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
Co-authored-by: mayavijx <mayax.vijayan@intel.com>
Co-authored-by: Sahar Fatima <sfatima.3001@gmail.com>
Co-authored-by: mohsinmx <mohsinx.mohammad@intel.com>
2022-06-21 08:03:58 -07:00
Dmitri Smirnov
267a424e52
Retry Rework execution frame to reduce memory allocations (#11897)
* Revert "Revert "Refactor ExecutionFrame and SessionState to reduce memory all… (#11888)"

This reverts commit d2cbae3a04.

* Revert prepacked_weights to avoid indirect inclusion in CUDA and TRT code that breaks the build.
2022-06-20 10:29:43 -07:00
Edward Chen
a93fe7824a
Update EP compile API deprecation warning message. (#11808)
Minor wording update to warning message to clarify that the function style Compile API is deprecated now and will be removed soon.
Also updated some code comments.
2022-06-17 12:49:24 -07:00
Yi Zhang
d2cbae3a04
Revert "Refactor ExecutionFrame and SessionState to reduce memory all… (#11888)
Revert "Refactor ExecutionFrame and SessionState to reduce memory allocations and improve data locality (#11804)"

This reverts commit 2ecba6fd25.
2022-06-17 17:07:21 +08:00
stevenlix
bd65acd08d
Share execution context memory between TensorRT subgraphs (#11859)
* share trt context memory

* update parser to 8.4-EA

* update parser to 8.4-GA

* add context memory sharing enable option

* update parser to 8.2-GA

* fix format issue

* reverse orders

* fix format

* fix format

* fix issues
2022-06-16 22:42:40 -07:00
Dmitri Smirnov
2ecba6fd25
Refactor ExecutionFrame and SessionState to reduce memory allocations and improve data locality (#11804)
Refactor ExecutionFrame and SessionState for better data locality and less memory allocations.
2022-06-16 16:50:48 -07:00
Scott McKay
d64f23fec0
EP factory creation cleanup and enhancements. (#11798)
* Rework the EP factory creation setup so we're not cut-and-pasting function declarations in multiple places.
Convert append EP for SNPE to be generic, and also use for XNNPACK.
Add XNNPACK to C# API

* Don't need stub for MIGraphX as it's using provider bridge.

* Remove old 'create' functions that aren't applicable now that the EPs are built as separate libraries.

* Only use EPs that require the layout transform if the opset is supported by the layout transformer.

* Update wasm registration of xnnpack.
2022-06-16 07:01:41 +10:00
Ashwini Khade
f63e28c92f
C API version 0.001 (#11758)
* C API version 0.001

* fix linker issues

* fixes for save checkpoint api

* plus fixes based on tests

* plus test_runner and other changes

* Plus cosmetic updates

* remove unnecessary headers

* plus some updates

* plus more changes

Co-authored-by: Ashwini Khade <askhade@microsoft.com@orttrainingdev10.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2022-06-15 11:13:35 -07:00
Chen Fu
d936751aad
QlinearConv threading adjustments (#11228)
* Reserve the first core for the main thread

Currently in "auto affinity" mode the worker threads are affinized to cores 0..(N-1), leaving the very last core for the main thread. This patch preserves core #0 for the main thread, and affinizes the worker threads to cores 1..N.

* Avoid unneeded spin_pause in thread pool's worker threads

Remove unneeded PAUSE instruction (0.1-0.2 usec latency) after a worker thread finds a task to execute.

* MLAS/x86: optimize QLinearConv on hybrid CPUs

Existing 4x task granularity for task partitioning on hybrid CPUs is
not sufficient to compensate the difference of VNNI instructions
throughput
between performance and efficient cores. This patch...

* Increases granularity for QLinearConv by 2x, to have 2x more tasks
with 2x
  smaller output count

  * Limits QLinearConv task count from above, to avoid output count per
  task
    getting smaller than kernel's capability

    * Remove hardcoded task count for QLineConv as it limited scaling on
      16+ cores CPUs

* MLAS/x86: optimize QLinearConv on hybrid CPUs

Existing 4x task granularity for task partitioning on hybrid CPUs is not sufficient to compensate the difference of VNNI instructions
throughput between performance and efficient cores. This patch...

  * Increases granularity for QLinearConv by 2x, to have 2x more tasks
  with 2x smaller output count

  * Limits QLinearConv task count from above, to avoid output count per
  task getting smaller than kernel's capability

  * Remove hardcoded task count for QLineConv as it limited scaling on
  16+ cores CP

* Addressing comments

* combining x86 ARM branches in qlinearconv threaded job partition

* revert first core assignment

Co-authored-by: Saurabh <saurabh.tangri@intel.com>
Co-authored-by: Chen Fu <fuchen@microsoft.com>
2022-06-14 14:42:12 -07:00
Vincent Wang
5ecfaef042
ATen Fallback for Inference (#11597)
* aten op for inference

* fix build error

* more some code to training only

* remove domain from operator name

* move aten_op_executor ext out from ortmodule

* add pipeline

* add exec mode

* fix script

* fix ut script

* fix test pipeline

* failure test

* rollback

* bugfix

* resolve comments

* enable aten for python build only

* fix win build

* use target_compile_definitions

* support io binding

* turn off aten by default

* fix ut

Co-authored-by: Vincent Wang <weicwang@microsoft.com>
Co-authored-by: zhijxu <zhijxu@microsoft.com>
2022-06-09 16:07:30 +08:00
Scott McKay
927bac0f86
Rework allocator sharing to work for multiple devices. (#11700)
* Rework allocator sharing to work for multiple devices.
* Update SessionState to not use allocator name in matching for consistency with IExecutionProvider. The name doesn't have any clear meaning (e.g. we use the same name for the per-thread allocator in the CUDA EP as the shared allocate there and in the TRT EP).
  * NOTE: this means we will have one allocator per OrtMemType+OrtDevice. 
* Reverse order when doing allocator setup in SessionState. This will result in the CPU and CUDA EPs allocators being preferred (they are the most configurable), and also means the per-thread CUDA allocator for default GPU memory will be used even when TRT is enabled. 
  * NOTE: Combined with the change to remove the allocator name from the key this will mean that if CUDA and TRT or ROCM and MIGraphX are both enabled the CUDA/ROCM per-thread allocator will be used to allocate GPU memory.  
* Use InsertAllocator instead of TryInsertAllocator. Each EP should be registered once, and we should only enter RegisterAllocator once, so the 'try' should not be required and would indicate an unexpected setup was involved. i.e. better to fail and figure out if we need to support that setup.
* Add some clarifying comments around how replace allocator works.
* Add unit testing for setup where EP has local allocator that may get out of sync with values in the IExecutionProvider base class.
* Fix invalid check of whether data is on CPU to use device info instead of allocator name.
2022-06-09 17:38:38 +10:00
Changming Sun
3c1dd9514d
Revert "fixed point based requantization on arm64 (#11540)" (#11732)
This reverts commit 1f2c926. Because it makes our packaging pipeline crash

Error message:

[ RUN ] QLinearConvTest.Conv3D_S8S8_Depthwise
Test #1: onnxruntime_test_all ...................Subprocess killed***Exception: 838.24 sec

We haven't successfully reproduced the bug on a real ARM64 hardware. Currently we only saw it showed up with qemu. More investigations are on-going.
2022-06-03 19:12:25 -07:00
Hector Li
95a16c1ffe
Snpe ep (#11665)
* Initiate Ort SNPE EP
* fix snpe ep windows build which is caused by the utility method (ToUTF8String) name change on master
* correct the source path for libonnxruntime.so while building for andorid package
* add AdditionalDependencies for amr64
* On MS-Windows, the patchfile must be a text file, i.e. CR-LF must be used as line endings. A file with LF may give the error: "Assertion failed, hunk, file patch.c, line 343," unless the option '--binary' is given.
* fix build failure if snpe is not enabled
* update doc for contrib op
* separate out snpe ep settings to onnxruntime_snpe_provider.cmake
* renaming according review comments
* update according review comments
2022-06-03 14:10:02 -07:00
Scott McKay
4445dd6bc1
XNNPACK EP (#11445)
* Implement XNNPACK support via an EP.
  * Layout transform uses the GraphPartitioner infrastructure.
  * Node fusion is supported.
  * Conv and MaxPool implementations were ported from Changming's PR.
  * Added optional mutex in InferenceSession::Run as we only want to allow sequential calls if xnnpack is enabled
2022-06-03 20:22:34 +10:00
Yufeng Li
1f2c92673b
fixed point based requantization on arm64 (#11540)
* fixed point based requantization on arm64

* reverse MlasConvSymDepthwiseKernel u8s8 and s8s8 order
2022-06-02 12:34:17 -07:00
Edward Chen
738d9b153c
Consolidate several types into onnxruntime::ArgType. (#11430) 2022-05-09 14:44:28 -07:00
Tang, Cheng
3f3c5fcd68
Unify the Compile API for mobile build and normal build (#10632)
* use the lightweight compile api as default; use dnnl ep for testing

* apply to tensorrt ep

* fix the missing files

* fix build

* fix the copy issue on linux

* migrate migraphx and openvino ep

* fix openvino build break

* fix linux build

* fix unused parameter

* fix coreml build

* use graph view's filtered initializers

* fix openvino break

* fix tvm compile api

* fix tvm / rknpu / vitisai ep build

* add IsInitializedTensor in graph_viewer; fix nuphar build

* use serializer directly as tvm ep is still static lib

* fix the type mismatch

* fix the type mismatch

* fix merge conflict

* add a comment

* fix minimal build

* fix the DML EP's legacy approach

* save type/shape in dnnl IR

* fix linux break

* fix tvm failure

* dnnl ep: move initializer referenced out of dnnl subgraph

* Revert "add IsInitializedTensor in graph_viewer; fix nuphar build"

This reverts commit 1cc3c7f08c16fee4fe3309a67209eb769d479587.

* add IsInitializedTensor to graph viewer

* add the legacy code for nuphar build to temporarily make nuphar build work

* ignore internal test for nuphar

* remove the out of date tests

* keep the legacy API in EP for a while

* turn serializer into a static function

* update comments

* fix tvm build

* Update include/onnxruntime/core/framework/execution_provider.h

Co-authored-by: Pranav Sharma <prs@microsoft.com>

* Update include/onnxruntime/core/framework/execution_provider.h

Co-authored-by: Pranav Sharma <prs@microsoft.com>

* Update onnxruntime/core/framework/execution_provider.cc

Co-authored-by: Pranav Sharma <prs@microsoft.com>

* updatee comments; add warning message for legacy compil call

* add a flag to control out of scope arg in serialization

* fix trt  build; improve the test

* resolve merege errors

* fix a typo

Co-authored-by: Cheng Tang <chenta@microsoft.com>
Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Pranav Sharma <prs@microsoft.com>
2022-05-05 08:30:07 -07:00
Changming Sun
963e1ace4e
Fix SAL annotations for custom op (#11432)
Fix SAL annotations for custom op. For example, "_In_" only applies to pointers, not integers.
2022-05-04 10:47:28 -07:00
RandySheriffH
8d69b9398b
APIs for custom op to invoke ort operator directly (#10713)
* draft kernel creation

* setup eager context

* call into kernel in eager mode

* redefine test case

* refact eager context

* add comment

* remove header

* rename argument

* redefine API definition with types

* list outputs as argument

* switch to int to represent length

* fix compile err

* create attribute API

* add test case for topk

* remove bool from c api

* add gru test case

* remove var

* fix compile warnings

* rename status

* fix compile err

* exclude sparse tensor

* fix comments

* fix comments

* fix build err

* rename file and move location

* format code

* move file to session folder

* fix comments

Co-authored-by: Randy <Randy@randysmac.attlocal.net>
2022-05-03 14:16:30 -07:00
Changming Sun
5023f6750b
Revert "Call pluggable EP's shutdown function in Environment::~Environment() (#11120)" (#11393)
This reverts commit 4983d6e5d6. We can't destroy OrtEnv through python's atexit function, because at that time there might be many other ORT python objects alive.
2022-05-02 14:38:31 -07:00
Tang, Cheng
4b875e3543
Re-implment the function support in onnxruntime (#11167)
* initial fix

* refactor the function handle

* update the implementation

* fix linux build break

* fix training build

* fix minmal build

* fix gradient checker

* deprecate the local function members in graph. host it in model

* fix changming's comments

* fix comments about inlined containers

* fix a missed inlined container

* fix training build

* avoid const for std string_view

Co-authored-by: Cheng Tang <chenta@microsoft.com>
2022-04-29 10:15:58 -07:00
Vincent Wang
1c64351e09
Create Tensor with Strides (#11294)
* create tensor with strides

* resolve comments

* refactor

Co-authored-by: Vincent Wang <weicwang@microsoft.com>
2022-04-28 16:49:37 +08:00
Dmitri Smirnov
a7d0158c24
Introduce a way to disable Abseil library (#11353)
Introduce a way to disable Abseil library.
Use cmake extra args, no new build switch.
2022-04-27 08:57:52 -07:00
Edward Chen
4d0214f851
Move Contains() helper function to a higher common.h. (#11289) 2022-04-21 09:31:48 -07:00
Gary Miguel
7aa4af238a
Add strict_shape_type_inference config option (#11081)
Prior to this, certain shape and type errors were surfaced only when
the model was using the latest known op set version.

Providing users an explicit option allows for better testing of code
that produces models, which includes unit tests within this repo and
other repos such as the TF-ONNX and PT-ONNX converters.

Remove the previous behavior which seems quite counter-intuitive:
an otherwise identical model with a later op set version should be treated
identically in this regard.

The option defaults to false to avoid causing errors for users that
rely on the previous permissive behavior.

Turned on the strict enforcement by default in OpTester, which revealed a few
disagreements between ORT and ONNX on what the correct output shape should
be.

Fix shape inference bug in ReduceSumTraining with noop_with_empty_axes=1
which was revealed.

Fix TensorOpTest.Unsqueeze_scalar, which was testing negative axes on an
op set version where the op did not actually support negative axes.

Fixes #9506.
2022-04-21 08:32:40 -07:00
Edward Chen
4854a09340
Consolidate utils::ToTensorProtoElementType, TypeToDataType, and data_types_internal::ToTensorDataType. (#9824) 2022-04-20 12:45:53 -07:00
Ahmad Zakaria
63ff391b16
add AppendExecutionProvider_CUDA_V2 to the C++ api (#11153) 2022-04-14 17:33:27 -07:00
Vincent Wang
9707181257
fix build error (#11199) 2022-04-13 13:09:19 +08:00
Dmitri Smirnov
12c687f594
Rework initializer.cc to eliminate code duplication (#11131)
Rework initializer.cc to eliminate code duplication and add type enforcement.
 Address review comments.  Add literal operators for MLFloat16 abd BFloat16 and tests.
2022-04-08 09:42:31 -07:00
Justin Stoecker
7609694464
Enable building with a GDK (#11126) 2022-04-07 15:06:31 -07:00
Changming Sun
4983d6e5d6
Call pluggable EP's shutdown function in Environment::~Environment() (#11120)
I disabled some tests temporarily. I will move them to a separated executable file in another PR.

In the future, I want to combine onnxruntime::Environment and OrtEnv classes. Now we have 3 env classes, it is too confusing:

1. onnxruntime::Env
2. onnxruntime::Environment
3. OrtEnv
Our python binding uses onnxruntime::Environment, while all other language bindings use OrtEnv. So python doesn't unload EPs but the others do. It's better to make them consistent.

Please note even I added the call, currently the unload function still is a no-op on Linux. So, currently on Windows we must unload the EPs while on Linux we must not do it.
2022-04-07 14:11:29 -07:00
Dmitri Smirnov
2700261f7c
Provide an API to supply external initializers data from user buffers (#11109)
Imlpement AddExternalInitializers
2022-04-07 12:21:53 -07:00
Maajid khan
81fa28bc56
OpenVINO-EP v4.0 Release PR with OpenVINO 2022.1 (#11025)
* Enabling ov-ep for 2022.1 Release

->Added ov-ep 2022.1 flow
->Validated CPU Unit tests with OV
Master using onnxruntime_test_all unit
tests.

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Fix for output mismatch b/w OpenVINO and ONNX

Refer:
https://jira.devtools.intel.com/browse/CVS-60310

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Enabling Adobe ops

->Enable Resize op for iGPU
->Enable Add op for iGPU

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Removing irrelevant conditions

->Removing some conditions from
GetCapability() which are now not
required. (Removed conditions for
OV version support less than 2021.2)

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Enable upsample op

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Enable Adobe proxy-e model

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Removing any extra conditions for Opset13 ops

* Opset13 changes

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Exception handling for devices

* Added comments

* Implement GPU Throttling feature

*Added GPU Throttling feature for iGPU's.
when user enables it as a runtime option,
it helps in reducing overall CPU usage
of the application

*Added changes to exercise this option
using onnxruntime_perf_test application.

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Renaming the runtime config option

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Added the user to video and users group

* Handling_GPU.0_GPU.1

* Handling special conditions

->Handling corner cases for
device_type checks

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Modification to include new api 2.0 changes in the code

* Added opset13 changes

->Enabled Few ops
->Added Debug info for case 3b in getcapability()

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Enabling ov-ep for 2022.1 Release

->Added ov-ep 2022.1 flow
->Validated CPU Unit tests with OV
Master using onnxruntime_test_all unit
tests.

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Fix for output mismatch b/w OpenVINO and ONNX

Refer:
https://jira.devtools.intel.com/browse/CVS-60310

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Enabling Adobe ops

->Enable Resize op for iGPU
->Enable Add op for iGPU

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Removing irrelevant conditions

->Removing some conditions from
GetCapability() which are now not
required. (Removed conditions for
OV version support less than 2021.2)

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Enable upsample op

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Enable Adobe proxy-e model

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Removing any extra conditions for Opset13 ops

* Opset13 changes

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Exception handling for devices

* Added comments

* Implement GPU Throttling feature

*Added GPU Throttling feature for iGPU's.
when user enables it as a runtime option,
it helps in reducing overall CPU usage
of the application

*Added changes to exercise this option
using onnxruntime_perf_test application.

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Renaming the runtime config option

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Added the user to video and users group

* Handling_GPU.0_GPU.1

* Handling special conditions

->Handling corner cases for
device_type checks

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Added opset13 changes

->Enabled Few ops
->Added Debug info for case 3b in getcapability()

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Log comments updated

* Changes to enable 2.0 api

* Enabling ov-ep for 2022.1 Release

->Added ov-ep 2022.1 flow
->Validated CPU Unit tests with OV
Master using onnxruntime_test_all unit
tests.

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Fix for output mismatch b/w OpenVINO and ONNX

Refer:
https://jira.devtools.intel.com/browse/CVS-60310

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Enabling Adobe ops

->Enable Resize op for iGPU
->Enable Add op for iGPU

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Removing irrelevant conditions

->Removing some conditions from
GetCapability() which are now not
required. (Removed conditions for
OV version support less than 2021.2)

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Enable upsample op

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Enable Adobe proxy-e model

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Removing any extra conditions for Opset13 ops

* Opset13 changes

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Exception handling for devices

* Added comments

* Implement GPU Throttling feature

*Added GPU Throttling feature for iGPU's.
when user enables it as a runtime option,
it helps in reducing overall CPU usage
of the application

*Added changes to exercise this option
using onnxruntime_perf_test application.

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Renaming the runtime config option

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Added the user to video and users group

* Handling_GPU.0_GPU.1

* Handling special conditions

->Handling corner cases for
device_type checks

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Added opset13 changes

->Enabled Few ops
->Added Debug info for case 3b in getcapability()

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Fix build issue

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Fixes issues

*Fixes compiler warnings c4458 on windows.
*Fixes the bug in device_type check logic
*Adds print info for enable_opencl_throttling
option in onnxruntime_perf_test

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* commit to make openvino_2021.4 compatible

* Fixed IO Buffer Optimization

* Fix output names issue

* Fix 2021.3 branch

* Bug Fix for Multiple inputs/outputs

- Assigns the right output_name and
input_name for the graph when
returned by CompiledModel::inputs()
OV function.

- Also takex care of output mismatch
issue b/w openvino output and onnx
output

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Add comments for the changes made

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* IO Buffer Changes

* Commit for Disabling GPU Throttling for 2021.4

* Updated branch

* Fix windows build

->Fixed windows build in debug mode
->Disabled scatternd3_tensor_int64

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Fixed CPP Unit tests for CPU

-Fixed shrink, MVN, ReduceL2, Maxpool,
upsample, scatter, slice, reshape,
unsqueeze.

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Fixed first set of GPU Tests

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Fixed additional failing tests on GPU

->Added conditions to disable certain ops
under certain conditions

->Disabled certain tests

->Added some op supports for no_dimension
supported

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Added Expand op support for CPU

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Added condition for squeeze op

->Shape can't have empty axes attribute

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Add support for LessOrEqual op function

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* OV Interface wait for replaced by indefinite wait call

* use names from ONNX model to access OV tensors

This chnage is to use the input/output names
retrieved from original onnx model to access
OV tensors and to check if there's any input
or output names mismatch b/w ONNX naming
and OV naming.

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Fixes Myriad unit tests and other issues

->Fixes Myriad CPP unit tests
->Fixes output mismatch issue with models with
sub graph partitioning

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Fix segfault issue

->Fixed case 3b condition in get_capability()
which was causing the segfault issue

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Fixed build isuse with ov 2021.4 with I/O buffer

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Disables performance counters for I/O Buffer

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Fixed inputs/outputs mismatch for HDDL with 2022.1

Signed-off-by: Mohammad Amir Aqeel <mohammadx.amir.aqeel@intel.com>

* Fix to enable GPU FP16

* Enabled mlperf_ssd_mobilenet_300 model fully on CPU

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Added ov version specific dll packaging for nuget

* Fixed conditions for few ops

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Dockerfile updates

* Updated License Info

-Updated the copyrights License Info
-modified FP16 transformations with OV 2022.1

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Disabling mlperf_ssd_mobilenet_300 model

->Disabled this model for openvino. The
test is failing in Internal_CI pipelines.

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Disabling failing python CPU Tests

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

* Fixed flake8 python errors

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

Co-authored-by: hdgx <harinix.d.g@intel.com>
Co-authored-by: mayavijx <mayax.vijayan@intel.com>
Co-authored-by: sfatimar <sahar.fatima@intel.com>
Co-authored-by: mohsinmx <mohsinx.mohammad@intel.com>
Co-authored-by: Mohammad Amir Aqeel <mohammadx.amir.aqeel@intel.com>
2022-04-06 13:30:33 -07:00
Vincent Wang
3b6cee8059
[CUDA] Optimize Conv and ConvGrad for Training (#10999)
* Optimize Conv and ConvGrad for Training

* add provider option to control

* fix typo
2022-03-29 07:31:36 +08:00
Chi Lo
8ba52b0a05
Bump master version to 1.12 (#10797)
* bump master version to 1.11

* bump master version to 1.12
2022-03-28 12:30:11 -07:00
Scott McKay
47c09e6701
Clarify usage of kOnnxDomainAlias. (#10962)
* Clarify usage of kOnnxDomainAlias.
2022-03-25 09:52:59 +10:00
Leandro Gracia Gil
1cc2cfb7b8
Move #ifndef ORT_CXX_API_THROW to the no exceptions case. (#10937)
This is related to https://github.com/microsoft/onnxruntime/issues/10564
which introduced a fix in the wrong case where exceptions are enabled.
2022-03-21 11:12:56 -07:00
Valery Chernov
625a1f7673
[TVM EP] code refactor (#10655)
* rename info to options for TVM EP

* transfer options processing from TVMExecutionProvider to TVMEPOptions

* transfer TVMRunner to separated files

* implement TVMCompiler class

* replace CompileFunc by TVMCompiler object. update TVMRunner. now it does not depend on TvmExecutionProvider

* correct logging of TVM EP options

* RunnerImpl, GERunnerImpl and VMRunnerImpl were implemented

* add prepareComputeInfo method

* remove update_output_shapes flag

* embed all TVM EP dependences to tvm namespace. transfer model compilation from TVMRunner. connect TVMRunnerImpl to TVMRunner

* refactor compileModel method

* small cleaning

* separate TVM EP options data store and processing

* replace TvmTensorShape by InlinedVector with max_size 5

* correct indentation

* update TVM hash

Co-authored-by: Valery Chernov <valery.chernov@deelvin.com>
2022-03-16 13:55:04 +01:00
Edward Chen
f468ea40e5
Refactor Node::AddAttribute() (#10869) 2022-03-16 14:53:00 +10:00
Edward Chen
e53422c6d0
Update convert_onnx_models_to_ort.py to support runtime optimizations. (#10765)
Add runtime optimization support to ONNX -> ORT format conversion script.
Replace `--optimization_level`, `--use_nnapi`, and `--use_coreml` with a new `--optimization_style` option.
2022-03-14 16:50:41 -07:00
Hariharan Seshadri
e80ff63274
Fix bug in MemcpyToHost (#10816) 2022-03-10 07:02:27 -08:00
Edward Chen
c147c9dda6
Remove ORT_ENABLE_RUNTIME_OPTIMIZATION_IN_MINIMAL_BUILD. (#10778)
Remove ORT_ENABLE_RUNTIME_OPTIMIZATION_IN_MINIMAL_BUILD as it is now implied by ORT_EXTENDED_MINIMAL_BUILD.
Remove related CMake option.
2022-03-08 16:18:49 -08:00
Vincent Wang
4a38f9e31d
enable strided tensor for training only (#10748) 2022-03-08 08:31:28 +08:00
Fei Hu
60acfd3dd8
Support CUDA Graph in the CUDA EP (#9978) 2022-03-06 20:47:31 -08:00
Scott McKay
e337f5faf3
Enable QDQ cleanup and NHWC optimizers in an extended minimal build. (#10729)
* Enable QDQ cleanup and NHWC optimizers in an extended minimal build.
2022-03-04 15:45:42 +10:00
Rachel Guo
a9dc50ba8b
Add option to force QDQIsInt8Allowed to return true when exporting to ORT format (#10719)
* wip

* save

* minor update

* fix

* fix

* Revert "fix"

This reverts commit a76f364b2d.

* revert

* revert

* revert submodule removal

* address pr comments

* minor fix

* address cr comments

* fix format

Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
2022-03-02 23:26:14 -08:00
Yulong Wang
f4b2d3af2b
Upgrade emsdk to 3.1.3 (#10577) 2022-02-28 23:52:41 -08:00
Vincent Wang
9a22b5d253
Strided Tensor Support for Eager Mode (#10578)
* strided tensor for eager mode

* fix build and resolve comments

* fix win x86 build
2022-03-01 14:25:31 +08:00
Dmitri Smirnov
e23a224518
Fix CUDA 10.2 compile error due to inlined_containers.h inclusion (#10702)
Fix CUDA 10.2 compile error due to inlined_containers.h inclusion
 into a common CUDA header.
 Use NumberOfNodes() to reserve space in a hash table
 Prefer separate call to reserve() rather than passing in the
 hash table constructor. They have somewhat different meaning.
2022-02-28 19:56:44 -08:00
cloudhan
3243c9579f
Fix VLOG?_DEFAULT macros usability. (#10568)
* Add `set_default_logger_verbosity` api.

* fix docs

* make flake8 happy
2022-03-01 13:16:26 +10:00
Scott McKay
1f6d8248da
Add optional optimizer to remove leftover Q->DQ pairs after all other QDQ processing has completed (#10659)
Add an optimizer that can remove leftover Q->DQ pairs. Depending on the model this may help with performance and/or improve accuracy. Optional as it could make things worse so user needs to be aware of this and test what works best for their scenario. Enable with SessionOptions config param `session.enable_quant_qdq_cleanup`
2022-03-01 08:05:02 +10:00
Thiago Crepaldi
e788cc2a23
Convert com.microsoft::ATen into org.pytorch.aten::ATen onnx op (#10060)
Signed-off-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>
2022-02-28 14:14:45 -05:00
Ryan Hill
eb116595d4
Add ability to customize ORT_CXX_API_THROW (#10688) 2022-02-28 00:15:10 -08:00
Dmitri Smirnov
b30e0e2283
Remove inline_containers include from tensor_shape (#10682)
Hide Inlined Hash set and maps guts behind template forward declarations.
Currently CUDA 10.2 compiler can not compile abseil but provider interfaces
use those types in their signatures. InlinedVector seems to be fine.
Introduce core/common/inlined_containers_fwd.h header
2022-02-26 20:07:18 -08:00
Dmitri Smirnov
2679711bee
Refactor transformers and other code to reduce memory allocation calls (#10523)
Work on minimizing memory management calls by
  reducing number of allocations and copies.
  Replace std::unordered_set to InlinedHashSet
  and add usage of InlinedVector.
  Employ std::move() to minimize copying and memory allocations.
  Remove copying of the const shared data into each of the
  PropagateCast transformer instances.
  Move inlined_containers.h header to include/common
  Adjust AsSpan imlementation for C++ < 17
2022-02-24 16:17:14 -08:00
RandySheriffH
e056fbaa51
Add restrictions for hybrid cpus for thread pool task distribution (#10393)
* add restrictions for hybrid cpus

* add unit test to mock hybrid cpu

* attach hybrid flag

* add mocking interface to CpuInfo

* make is_hybrid

* make mock function const

* add force_hybrid for thread pool

* remove header
2022-02-17 14:34:09 -08:00
Ashwini Khade
f436d3437e
Add layout transformer for NNAPI (#10371)
* Add layout transformer for NNAPI

* plus merge fixes

* plus some more merge fixes

* test fixes

* comments + cleanup

* plus updates

* post merge changes

* enable layout transformer in extended minimal build

* plus more comments

* more tests + fix CI

* plus updates per review

* more updates per review

* fix file name

* fix qdq tests

* plus more updates

* plus updates

* typo fix

* fix qdq selection in 2nd optimization pass

* fix typo

* fix a test

* update dependency structure for layout transformer

* plus updates

* more updates

* plus change

* more updates to fix linker error in minimal build

* remove unnecessary headers
2022-02-15 20:25:29 -08:00
Vincent Wang
ceb1e2b1a6
[ROCm] Bugfix of BFloat16-float conversion and Add FastGelu Kernel for AMD (#10557)
* bf16 bugfix on amd

* enable fastgelu ut on amd
2022-02-16 11:11:08 +08:00
Valery Chernov
1cdc23aba4
[TVM EP] Rename Standalone TVM (STVM) Execution Provider to TVM EP (#10260)
* update java API for STVM EP. Issue is from PR#10019

* use_stvm -> use_tvm

* rename stvm worktree

* STVMAllocator -> TVMAllocator

* StvmExecutionProviderInfo -> TvmExecutionProviderInfo

* stvm -> tvm for cpu_targets. resolve onnxruntime::tvm and origin tvm namespaces conflict

* STVMRunner -> TVMRunner

* StvmExecutionProvider -> TvmExecutionProvider

* tvm::env_vars

* StvmProviderFactory -> TvmProviderFactory

* rename factory funcs

* StvmCPUDataTransfer -> TvmCPUDataTransfer

* small clean

* STVMFuncState -> TVMFuncState

* USE_TVM -> NUPHAR_USE_TVM

* USE_STVM -> USE_TVM

* python API: providers.stvm -> providers.tvm. clean TVM_EP.md

* clean build scripts #1

* clean build scripts, java frontend and others #2

* once more clean #3

* fix build of nuphar tvm test

* final transfer stvm namespace to onnxruntime::tvm

* rename stvm->tvm

* NUPHAR_USE_TVM -> USE_NUPHAR_TVM

* small fixes for correct CI tests

* clean after rebase. Last renaming stvm to tvm, separate TVM and Nuphar in cmake and build files

* update CUDA support for TVM EP

* roll back CudaNN home check

* ERROR for not positive input shape dimension instead of WARNING

* update documentation for CUDA

* small corrections after review

* update GPU description

* update GPU description

* misprints were fixed

* cleaned up error msgs

Co-authored-by: Valery Chernov <valery.chernov@deelvin.com>
Co-authored-by: KJlaccHoeUM9l <wotpricol@mail.ru>
Co-authored-by: Thierry Moreau <tmoreau@octoml.ai>
2022-02-15 10:21:02 +01:00
Chi Lo
0f5d0a091a
Make user capable of adding new field in OrtTensorRTProviderOptionsV2 as new provider option (#10450)
* modify code for add additional field in OrtTensorRTProviderOptionsV2

* add include file

* fix typo

* fix bug

* add comment

* fix code

* revert change
2022-02-05 11:15:12 -08:00