Commit graph

824 commits

Author SHA1 Message Date
Chi Lo
4e3cff60fd
CUDA graph support for TRT EP (#16081)
CUDA EP already supports [CUDA
graph](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs),
also we observed some models can benefit from using CUDA graph with
`trtexec`. Therefore, this PR enables the CUDA graph support for TRT EP.

The implementation is based on
https://github.com/microsoft/onnxruntime/pull/9978 with the same
[constraints](https://github.com/microsoft/onnxruntime/pull/9978) as
below:

- Models with control-flow ops (i.e. If, Loop and Scan ops) are not
supported.
- Usage of CUDA Graphs is limited to models where-in all the model ops
(graph nodes) can be partitioned to the TRT EP.
- The input/output types of models need to be tensors.
- Shapes of inputs/outputs cannot change across inference calls.
- IObinding is required.
2023-06-21 09:36:45 -07:00
Yuhong Guo
48e6186b1a
Move tests from core/providers/cuda/test/* to test/providers/cuda/ and refactor CUDA UT (#16161)
### Description
<!-- Describe your changes. -->

1. Add a new test lib `onnxruntime_providers_cuda_ut` which is similar
to `onnxruntime_providers_cuda` but `onnxruntime_providers_cuda_ut` is
only built if `onnxruntime_BUILD_UNIT_TESTS` is set. We can call all
CUDA UTs through this ut lib without affecting production lib
`onnxruntime_providers_cuda`.
2. Move all test cases from `core/providers/cuda/test/` to
`test/providers/cuda/`. These test cases are built into lib
`onnxruntime_providers_cuda_ut` and run by `./onnxruntime_test_all
--gtest_filter="*CUDA_EP_Unittest*"`. Since the lib is only for test, we
can use gtest macros in the test cases. Previous implementation do not
support using gtest lib in the CUDA UT cases.
3. The cmake code in `cmake/onnxruntime_providers.cmake` is refactored a
bit. A new function `onnxruntime_add_object_library` is to build a
object target. The 2 libs `onnxruntime_providers_cuda_ut` &
`onnxruntime_providers_cuda` share most of the code, so the object files
can be used in both libs, which helps reduce build time. Another
function `config_cuda_provider_shared_module` is used to configure all 3
similar
targets(onnxruntime_providers_cuda_obj/onnxruntime_providers_cuda/onnxruntime_providers_cuda_ut).
4. Refactored the test to call `testing::InitGoogleTest` &
`RUN_ALL_TESTS` in `libonnxruntime_providers_cuda_ut.so`'s `TestAll`.
After this change, we can see all the cases running in
`CUDA_EP_Unittest.All`:

![image](https://github.com/microsoft/onnxruntime/assets/19584326/8ff80df6-060b-4ef0-90b7-657e68d3db87)




### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

After https://github.com/microsoft/onnxruntime/pull/13016, there are
still test files in test/providers/cuda/ that are not moved to
core/providers/cuda/test/ and the test cases are disabled. This PR helps
to clean the unfinished TODOs.

Even through onnxruntime_shared_lib_test covers some test for CUDA
provider. onnxruntime_shared_lib_test works like a coarse grain
end-to-end test for CUDA provider. If CUDA unittest can run cases for a
single component, this wound be helpful for CUDA developers.

---------

Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com>
2023-06-20 14:54:55 -07:00
cao lei
dd72192cf4
ExecutionProvider API refactor - move allocator from EP level to SessionState level and indexed by OrtDevice (#15833)
### Description
This PR is to refactor ExecutionProvider API for memory management,
which is to move allocators from EP level to SessionState level and
indexed by OrtDevice



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This PR is to refactor ExecutionProvider API for memory management,
which is to move allocators from EP level to SessionState level and
indexed by OrtDevice. By this change, EP level will shift the burden of
maintaining allocators, which will be user friendly for EP developers

---------

Co-authored-by: Lei Cao <leca@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2023-06-19 17:44:45 -07:00
Changming Sun
5754cd7d1d
Add fp16 support to CPU EP gemm op (#15506) 2023-06-15 14:38:17 -07:00
Changming Sun
b72fe664c1
Refactor prepack buffer code (#16280)
### Description
1. Use IAllocatorUniquePtr to replace BufferUniquePtr. It will ensure
the deleter is always right.
2. Change some std::unique_ptr to std::optional
3. Bypass Arena allocator when allocating the prepack buffers for mlas.
In this special case, Arena doesn't help any. And this change is just an
internal implementation change, it doesn't affect our public interface.
2023-06-08 14:42:02 -07:00
Dmitri Smirnov
908e940660
[CPP Api] Remove deprecated CustomOp API (#16256)
### Description
Custom Op API has been deprecated in 1.15 release. We are removing it.
2023-06-07 14:03:13 -07:00
PeixuanZuo
1b518c6836
[ROCm] add early stop to tunable profile progress (#15716)
For TunableOp, some instance may has very bad performance and it will
take a long time during profile process.
Add `tunable_op_max_tuning_duration_ms` parameter to limit max tuning
time.
2023-06-01 10:18:25 +08:00
Xavier Dupré
e726151b5c
Introduce float 8 types (#14731)
### Description
The PR implements FloatE4M3FN, FloatE5M2, FloatE4MEFNUZ, FloatE5M2FNUZ
as described in PR https://github.com/onnx/onnx/pull/4805. It uses CUDA
API to cast float/half to float8 if CUDA>=11.8, a custom implementation
if CUDA<11.8.

* It implements, Cast, QuantizeLinear, DequantizeLinear for all types on
CPU, only for types FloatE4M3FN, FloatE5M2 on CUDA.
* It extends the supported types for control flow operator, Shape,
Reshape, Identity, If, Loop, Scan, Reshape
* It implements Equal(19).
* Cast, QuantizeLinear, DequantizeLinear operators now support a
parameter `saturate` only valid for float 8 types. It is true by
default. In that case, any value out of range is converted into the
maximum float 8 value. If false, it is infinite.
* QuantizeLinear, DequantizeLinear now supports multiple scales on CUDA
(and ROCm by extension), scale = 1D tensor with one scale per channel

### Motivation and Context
Supports latest onnx version.

Fixes
[AB#15395](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/15395)

---------

Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
2023-05-30 13:25:58 -07:00
Dmitri Smirnov
9939092e71
[CPP API]Fix constness in C++API (#16103)
### Description
`CreateMap` and `CreateSequence` should be able to take in const data.
2023-05-26 14:09:00 -07:00
Changming Sun
a5410515ad
Fix: Some fields in OrtCUDAProviderOptionsV2 struct are not initialized (#16113)
### Description
The file include/onnxruntime/core/providers/cuda/cuda_provider_options.h
is a C++ file. It is not for C.

Before this commit, this header file is already not compatible with C compilers. Because it has:
```
onnxruntime::ArenaExtendStrategy arena_extend_strategy;
```

And this file is intended to be internal only. It is an internal header file. It should not be included in onnxruntime_c_api.h and should not be used with the public C APIs. User can only get the instance of OrtCUDAProviderOptionsV2 via CreateCUDAProviderOptions. In such a way we can add new members to this struct without breaking binary compatibility.
Since it is an internal header, we can safely use C++ grammar there.
2023-05-26 11:34:22 -07:00
Yuhong Guo
04a8f50674
New configuration to limit the arena extension (#15983)
Add a configuration `max_power_of_two_extend_bytes ` to limit the arena extension size.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
In our real scenario, we observe that if the model is big enough the
BfcArena will extend uncontrollable.
As showed by the following figures, if a model uses more than 16GB
memory, the BfcArena will totally apply for 32GB memory according to the
`kNextPowerOfTwo` strategy. With the new strategy, the extension is
limited. The default maximum extension size is 1GB.

#### Without the new configuration
After loading the model, ORT uses 32G GPU memory.

![image](https://github.com/microsoft/onnxruntime/assets/19584326/42b93c66-b957-4f20-a13b-d34cb390afff)

#### With the new configuration
After loading the model, ORT uses 23G GPU memory.

![image](https://github.com/microsoft/onnxruntime/assets/19584326/5abffeff-9ca3-4187-a262-37fd2764fe1b)

Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com>
2023-05-25 02:19:07 -07:00
Adrian Lizarraga
efc84a43e8
[QNN EP] Add session option to disable fallback to default CPU EP (#16016)
### Description
Adds the session config option `disable_cpu_ep_fallback` to allow the
user to prevent the CPU EP from handling
nodes not supported by other execution providers.

```C++
// Graph nodes that are not supported by the execution providers (EPs) explicitly added to the session are
// assigned (i.e., "fallback") to the CPU EP by default.
//
// This option allows the user to disable the fallback of unsupported graph nodes to the CPU EP.
// If this option is set to "1", session creation will fail if the execution providers other than the CPU EP cannot
// fully support all of the nodes in the graph.
//
// It is invalid to set this option and explicitly add the CPU EP to the session. In this case, session creation
// will also fail with an error.
//
// Option values:
// - "0": CPU EP fallback is not disabled. [DEFAULT]
// - "1": CPU EP fallback is disabled.
static const char* const kOrtSessionOptionsDisableCPUEPFallback = "session.disable_cpu_ep_fallback";
```

#### Example use
```C++
#include "core/session/onnxruntime_cxx_api.h"
#include "core/session/onnxruntime_session_options_config_keys.h"

int main(int argc, char** argv) {
    Ort::SessionOptions so;
    so.AddConfigEntry(kOrtSessionOptionsDisableCPUEPFallback, "1");  // Disable fallback to the CPU EP.

    onnxruntime::ProviderOptions options;
#if defined(_WIN32)
    options["backend_path"] = "QnnCpu.dll";
#else
    options["backend_path"] = "libQnnCpu.so";
#endif

    so.AppendExecutionProvider("QNN", options);

    const ORTCHAR_T* ort_model_path = ORT_MODEL_FOLDER "qnn_ep_partial_support.onnx";
    Ort::Session session(*ort_env, ort_model_path, so);  // Throws exception if nodes fallback to CPU
    // ...
```

### Motivation and Context
Makes it easier for application developers to ensure that the entire
model runs on specific EPs. This is critical for Qualcomm/scenarios. If
the compute cannot be offloaded to the NPU, running on CPU is not
acceptable. (could be the difference between 90 second inference and 6
seconds inference)

---------

Co-authored-by: Pranav Sharma <prs@microsoft.com>
2023-05-23 17:56:32 -07:00
Hector Li
4324d2173b
[QNN EP] Enable Qnn context cache to save model initialization time (#15815)
### Description
Enable Qnn Context cache feature to save model initialization time
Provider options:
qnn_context_cache_enable|1 to enable the cache feature
qnn_context_cache_path to set the cache path. It is set to model_file.onnx.bin by default.

### Motivation and Context
Model initialization time takes long because the cost of conversion from Onnx model to Qnn model. Qnn have feature to serialize the Qnn context to file, then next time user can load it from the cache context and execute the graph to save the cost.

---------

Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
2023-05-19 10:52:17 -07:00
RandySheriffH
4dfb89b3ad
Implement mutex-free spin lock for task queue (#14834)
Implemented "lock-free" spinlock to save CPU usage on context switching.
The change has been tested on queene service of Ads team, the lock-free
version of ort (40 threads) saves CPU usage on gen8 (128 logical
processors on 8 numa nodes) windows by nearly half, from 65% to 35%.

For 32 cores, the curve is flat:

Anubis, 32 vCPU, windows, hugging face models,
95 percentile E2E latency in ms:

model | mutex(ms) | mutex-free
--- | --- | ---
 alvert_base_v2 | 34.21 | 34.09
 bert_large_uncased | 116.27| 117.84
 bart_base | 72.06 | 71.99
 distilgpt2 | 25.43 | 25.02
 vit_base_patch16_224 | 37.33 | 37.76

Anubis, 32 vCPU win, Linux, 1st party models,
95 percentile E2E latency in ms:

model | mutex(ms) | mutex-free
--- | --- | ---
deepthink_v2 | 24.35 | 22.95
bing_feeds |  36.96 | 36.48
deep_writes |  14.46 | 14.32
keypoints |  9.34 | 7.69
model11 |  1.71 | 1.66
model12 |  1.82 | 1.44
model2 |  4.21 | 3.95
model6 |  1.08 | 1.05
agiencoder |  0.99 | 0.93
geminet_transformer |  5.32 | 5.24

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2023-05-19 10:12:10 -07:00
cloudhan
856afa49dd
[C#] Add missing rocm csharp api (#15540) 2023-05-18 08:15:19 +08:00
Baiju Meswani
6b7181d31d
Add C# API documentation for training (and some other changes) (#15935) 2023-05-16 03:15:24 -07:00
cloudhan
dc383ed4ce
Basic CSharp packaging support for ROCm EP (#15535)
This PR mainly fixes building errors when trying to build nupkg for ROCm EP.
It also slighly improve the packaging logic so that devlopers can
produce the nupkg on linux natively.
2023-05-16 07:27:38 +08:00
Dmitri Smirnov
896a963492
Adust GetVersionString() GetBuildInfoString() signatures and move them to OrtApi (#15921)
### Description

This PR partially reverts changes introduced in
https://github.com/microsoft/onnxruntime/pull/15643

We make two API return std::string always in UTF-8.

We also move the entry points from OrtApiBase to OrtApi to make them
versioned.

### Motivation and Context

`GetVersionString` always returns x.y.z numbers that are not subject to
internationalization.
`GetBuildInfoString` can hold international chars, but UTF-8 should be
fine to contain those.
We prefix them with u8"" in case the compiler default charset is not
UTF-8.
Furthermore, creating platform dependent APIs is discouraged.
`ORTCHAR_T` is platform dependent and was created for paths only.
On non-unix platforms would still produce `std::string` that can only
contain UTF-8

The API was introduced after the latest release, and can still be
adjusted.
2023-05-13 13:45:07 -07:00
Maximilian Müller
143551092f
fix: setting builder optimization level to TRT 8.6 default (#15897)
The actual released default level is 3 and not the previously used 2.

Just a small sample of the effects:
![Screenshot 2023-05-10 at 15 49
55](https://github.com/microsoft/onnxruntime/assets/44298237/5a694446-22c0-4943-9ddf-80670781878f)
2023-05-12 13:29:30 -07:00
Hector Li
1bebc88069
[SNPE EP] Add option to enable SNPE init caching feature (#15917)
### Description
[SNPE EP] Add option to enable SNPE init caching feature

### Motivation and Context
To save model initialization time
2023-05-12 07:57:11 -07:00
Wanming Lin
00b1e79e04
Support WebNN EP (#15698)
**Description**: 

This PR intends to enable WebNN EP in ONNX Runtime Web. It translates
the ONNX nodes by [WebNN
API](https://webmachinelearning.github.io/webnn/), which is implemented
in C++ and uses Emscripten [Embind
API](https://emscripten.org/docs/porting/connecting_cpp_and_javascript/embind.html#).
Temporarily using preferred layout **NHWC** for WebNN graph partitions
since the restriction in WebNN XNNPack backend implementation and the
ongoing
[discussion](https://github.com/webmachinelearning/webnn/issues/324) in
WebNN spec that whether WebNN should support both 'NHWC' and 'NCHW'
layouts. No WebNN native EP, only for Web.

**Motivation and Context**:
Allow ONNXRuntime Web developers to access WebNN API to benefit from
hardware acceleration.

**WebNN API Implementation Status in Chromium**:
- Tracked in Chromium issue:
[#1273291](https://bugs.chromium.org/p/chromium/issues/detail?id=1273291)
- **CPU device**: based on XNNPack backend, and had been available on
Chrome Canary M112 behind "#enable-experimental-web-platform-features"
flag for Windows and Linux platforms. Further implementation for more
ops is ongoing.
- **GPU device**: based on DML, implementation is ongoing.

**Open**:
- GitHub CI: WebNN currently is only available on Chrome Canary/Dev with
XNNPack backend for Linux and Windows. This is an open to reviewers to
help identify which GitHub CI should involved the WebNN EP and guide me
to enable it. Thanks!
2023-05-08 21:25:10 -07:00
RandySheriffH
8e610f25d8
Implement lite custom op API (#15778)
Implement a set of new APIs for lightweight custom ops registration, to
save efforts from schema-composing.
A few highlights:

- Support build-time type inference;
- Support function-as-op for "stateless" ops;
- Support structure-as-op for "stateful" ops;
- Support varied input/output forms such as span, scalar, and tensors,
either optional or non-optional.

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2023-05-04 09:49:17 -07:00
Changming Sun
1fb2f2605b
Update VERSION_NUMBER (#15773)
### Description

1. Update VERSION_NUMBER for preparing the upcoming release. This PR's
commit will not be included in the 1.15 release branch
2. Delete package/rpm/onnxruntime.spec since it was not used in past
years.

### Motivation and Context
Preparing the release.

Fixed
[AB#15311](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/15311)
2023-05-03 15:07:34 -07:00
Baiju Meswani
ba7b83ff3c
Remove onnxruntime_PYBIND_EXPORT_OPSCHEMA definition from onnxruntime (#15776) 2023-05-03 13:08:35 -07:00
Chen Fu
bc58fd5413
fix compilation error in no absl build (#15769)
### Description

Fix no-absl build error:
2023-05-02 08:20:49 -07:00
Changming Sun
034698cf6a
Revert "Implement lite custom op API (#15590)" (#15768)
This reverts commit cdf4fc49fc because it
breaks the "debug_node_input_output" build in "Post Merge" pipeline
2023-05-02 01:10:10 -07:00
Ye Wang
391f897983
Bring back SLN cuda kernel and use provider options to switch to standard implementation (#15660) 2023-05-01 18:35:26 -07:00
cao lei
d58fa9805b
ExecutionProvider API refactor - replace OrtMemoryInfo with OrtDevice (#15618)
### Description
ExecutionProvider API refactor - replace OrtMemoryInfo with OrtDevice



### Motivation and Context
Currently “Location” is represented as ORTMemoryInfo, which is OrtDevice
+ OrtMemType, while OrtDevice is represent as DeviceType + DeviceId +
MemType. As we can see there is some unnecessary hierarchy, the proposal
is to make it a clear definition that to use OrtDevice as an abstraction
for Location

---------

Co-authored-by: Lei Cao <leca@microsoft.com>
2023-05-01 10:06:00 -07:00
RandySheriffH
cdf4fc49fc
Implement lite custom op API (#15590)
Implement a set of new APIs for lightweight custom ops registration, to
save efforts on schema-composing.
A few highlights:

1. Support build-time type inference;
2. Support function-as-op for "stateless" ops;
3. Support structure-as-op for "stateful" ops;
4. Support varied input/output forms such as span, scalar, and tensors,
either optional or non-optional.

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2023-05-01 08:45:26 -07:00
Chen Fu
0e9472d391
NHWC graph optimizer (#15724)
### Description

Augment nhwc graph optimizer to accommodate fp16 operators.


### Motivation and Context

With new fp16 conv operator added. This operator prefers NHWC data
layout. We need to augment existing graph optimizers to better utilize
the new operator.
2023-05-01 08:44:07 -07:00
Chunye Wang@AMD
d35850c142
[VitisAI]Update VitisAI EP to be compatible with VitisAI 3.5 (#15673)
### Description

Originally VitisAI EP only works with old version of VitisAI release. 


### Motivation and Context

Update VitisAI EP so that it works together with the current VitisiAI
3.5 and further version of VitisAI. We try our best to make it forward
compatible.

---------

Co-authored-by: Wang Chunye <chunywan@xilinx.com>
Co-authored-by: mingyue <mingyue@amd.com>
Co-authored-by: mingyueliuh <131847423+mingyueliuh@users.noreply.github.com>
Co-authored-by: liumingyue <mingyue@xilinx.com>
Co-authored-by: moore-ch <129165652+moore-ch@users.noreply.github.com>
Co-authored-by: shoucair <shoucai.ren@amd.com>
Co-authored-by: zz002 <zhenze.wang@amd.com>
Co-authored-by: BoarQing <yuz75@Pitt.edu>
Co-authored-by: Yueqing Zhang <yueqingz@amd.com>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
2023-05-01 08:28:26 -07:00
Jeff Bloomfield
3df3a85114
Default kOrtSessionOptionsDisableQuantQDQ to 1 when the DML EP is registered (#15725)
This addresses a performance regression in some INT8 models with the
DirectML EP by defaulting OrtSessionOptionsDisableQuantQDQ to 1 when the
EP is registered.

This regression occured due to the introduction of the QDQ propagation
transformer, which is based on this session option. That transformer
maximizes the number of nodes which are executed as quantized by
logically propagating quantize operators upstream and dequantize
operators downstream. However, it does this simply by inserting QDQ
pairs, with an expectation that something will recognize sequences of
DQ->Op->Q. This logic and related L2 transformers are not currently
enabled for the DirectML EP.

This change also removes a noisy warning when the session option for
memory pattern is overriden as the DirectML EP is registered.
2023-05-01 08:26:03 -07:00
Chi Lo
6e652d0554
Support explicit TRT profiles from provider options (#15546)
Previous behavior of TRT EP to set TRT optimization profiles for dynamic
shape input is based on input tensor values. Users can't explicitly
specify the profiles.

This PR makes users capable of specifying min/max/opt profiles through
newly added three provider options:

`trt_profile_min_shapes`, `trt_profile_max_shapes` and
`trt_profile_opt_shapes`
with the format of "input1:dim1xdim2...,input2:dim3xdim4...".
(Note: It's similar to --minShapes, --maxShapes and --optShapes of
trtexec command-line
[flags](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#trtexec-flags))

For example, if you are using onnxruntime_perf_test, you can try this:

`./onnxruntime_perf_test -e tensorrt -r 1 -i
"trt_profile_min_shapes|imgs:1x3x384x288
trt_profile_max_shapes|imgs:32x3x384x288
trt_profile_opt_shapes|imgs:16x3x384x288" your_model_path`

If the engine cache is enabled, you still need to provide these three
explicit provider options in order to use this feature. ORT TRT will
compare the min/max/opt profile shape with the ones saved in .profile
file to decide whether to rebuild the engine.

Constraints to use these provider options: (1) Need to specify
min/max/opt profile shapes for all the dynamic shape input

 

This feature is also requested by other users:
https://github.com/microsoft/onnxruntime/issues/13851
2023-04-30 22:30:26 -07:00
Changming Sun
65020d433e
Prefast fixes for CUDA EP (#15726)
### Description
1. Adjust cmake flags. Do not modify CMAKE_CXX_FLAGS globally. Only
apply the flags to ORT code.
2. Fix some SDL warnings.
2023-04-29 12:43:12 -07:00
Yuhong Guo
41dcf0d32e
Expose build information in dynamic lib (#15643)
### Description
<!-- Describe your changes. -->
1. Add Build Info API to onnx.
2. Fix compile error while building onnxruntime_benchmark in MacOs.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
1. When Onnxruntime lib is serving online, we need a way to detect how
this lib is built. This PR helps the developer to get the build
information using `strings` such as git branch, git commit id, build
type and cmake cxx flags, which is showed as follows.


![image](https://user-images.githubusercontent.com/19584326/233794371-b2f95a2c-27fb-4709-a6dd-bf4bb12b0b5b.png)


![image](https://user-images.githubusercontent.com/19584326/233794360-f96f5d2e-332c-405c-83f1-370ccc2b86f8.png)

If the build env has no git, there will be no git related infor:


![image](https://user-images.githubusercontent.com/19584326/234558596-298c1b01-9a90-41bf-9372-7259a8f8e5be.png)


3. Fix the following compile error while building benchmark in MacOs.

![image](https://user-images.githubusercontent.com/19584326/233793571-c261ac1f-47b2-434d-a293-7e9edc6c8a66.png)

---------

Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com>
2023-04-28 21:57:31 -07:00
Chen Fu
be08b47e7b
Refine cast optimizer for safety (#15658)
### Description

Cast optimizer may convert a fp16 node to fp32. This used to be safe as
all fp16 kernels has fp32 implementation. As this assumption is no
longer true, we need to check the validity of the operation



### Motivation and Context

Main work here is to introduce an API to check whether a kernel is
registered. Currently we don't have a way to do that without an operator
node. This needs to be augmented. We need to query whether a kernel is
registered by its property only, so that we can judge whether it is safe
to construct a node long before we actually do so.
2023-04-28 09:32:54 -07:00
sfatimar
ebaafac3f5
Openvino ep ort 5.0 (#15626)
### Description
The PR adds VPU support to OpenVINO Execution Provider
Bug fixes for GPU, CPU. 
Changes to OpenVINO Backend in Serialized Model API for faster First
Inference Latency.
Deprecation to HDDL-VADM and MYRIAD, removed code
Support OpenVINO 2023.0 
Dynamic Shapes Support for iGPU

### Motivation and Context
- VPU is an upcoming hardware that can provide AI Acceleration for
Client Systems through OpenVINO
- If it fixes an open issue, please link to the issue here. -->

---------

Signed-off-by: MaajidKhan <n.maajid.khan@intel.com>
Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
Co-authored-by: MaajidKhan <n.maajid.khan@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
2023-04-25 20:59:42 -07:00
Baiju Meswani
5885abfb35
Training Documentation (#15612) 2023-04-25 11:44:12 -07:00
Yulong Wang
14cc02c65c
[js/web] WebGPU backend via JSEP (#14579)
### Description
This change introduced the following new components into ONNX Runtime
Web:
- JavaScript Execution Provider (JSEP)
  - Asynchronized inferencing execution powered by Emscripten's Asyncify
- WebGPU backend implemented in TypeScript
  - initial implementation of kernels:
    - elementwise operators (22)
    - binary operators (5)
    - tensor: Shape, Reshape, Transpose, Gemm
    - nn: Conv, {Global}Maxpool, {Global}AveragePool


Code need to be polished. still working on it.

## Q&A
What is JSEP?
> JSEP, aka JavaScript Execution Provider, is a new ONNXRuntime
execution provider that specifically works on Web environment
(browsers). JSEP allows JavaScript code to kick in from various places
when ONNX Runtime inferences a model.

Why JSEP?
> JSEP is a hybrid mode EP that contains both C/C++ and
TypeScript/JavaScript implementation. There are 2 strong reasons why we
introduces JSEP:
> 1. the C/C++ part helps JSEP to leverage ONNX Runtime's capabilities
as much as possible including graph transformer, optimizers and also the
capabilities to fallback to CPU EP. TypeScript/JavaScript helps JSEP to
develop and debug much easier in the browser for the kernel
implementation.
> 2. the requirement of asynchronized execution from JavaScript API (eg.
`buffer.mapAsync()`) makes it impossible to run `OrtRun()` in a
synchronized context (see "async problem" section below). This is done
by using Emscripten's Asyncify.

What is WebGPU?
> WebGPU is the new GPU API that available in browser. It's one of the
only 2 APIs that currently available to access the GPU from browser (the
other is WebGL).
> WebGPU is designed with more advanced and stronger features comparing
to WebGL and is potentially solution that offer the best GPU performance
for model inferencing that currently available.

What is the async problem and why we have the problem?
> The "async problem" is a problem that you cannot call an async
function in a synchronous context. Think about the following C++ code:
> ```c
> // C-style declarations (API)
> typedef void (*ON_COMPLETE)(PVOID state, DATA *data);
> void read_data_from_file(FILEHANDLE file, ON_COMPLETE on_complete);
> 
> // implementation
> DATA * my_impl_read_data_from_file_sync(FILEHANDLE file) {
>   // how to implement?
> }
> ```
> The answer is, it's impossible to implement this function. Usually we
try to find a sync version API, or launch a thread to call the async
function and sync-wait on the main thread. Unfortunately, in browser
environment, neither is possible.
>
> WebGPU does not offer any synchronized API for data downloading (GPU
to CPU). This is the only operation that MUST be async. As `OrtRun()`
will eventually call into DataTransfer for copy data from GPU to CPU,
and `OrtRun()` is a synchronized function, this cannot be done in normal
way.

What is Emscripten? How is the Asyncify feature resolved the problem?
> Emscripten is the C/C++ compiler for WebAssembly. It's what we use to
compile ORT and generates the WebAssembly artifacts which runs on
browsers.
>
> Asyncify is a [compiler
feature](https://emscripten.org/docs/porting/asyncify.html) that allows
calling async functions from a synchronized context. In short, it
generates code to unwind and rewind call stack to emulate async
execution. With this feature, we are able to call the async function
inside `OrtRun()` call.

## Design Overview

**Inter-op**

JSEP is doing pretty much same thing to just another EP. It exposes an
interface for inter-op with JavaScript, which is defined in
onnxruntime/wasm/js_internal_api.js:
```js
// init JSEP
Module["jsepInit"] = function (backend, alloc, free, copy, copyAsync, createKernel, releaseKernel, run) {
    Module.jsepBackend = backend;
    Module.jsepAlloc = alloc;
    Module.jsepFree = free;
    Module.jsepCopy = copy;
    Module.jsepCopyAsync = copyAsync;
    Module.jsepCreateKernel = createKernel;
    Module.jsepReleaseKernel = releaseKernel;
    Module.jsepRun = run;
};
```
This simple JavaScript snippet defines all language barrier level
functions that requires by JSEP to achieve implementing kernels and data
transfers using JavaScript inside ONNX Runtime:
- `jsepBackend`: assign the singleton object to webassembly module
- `jsepAlloc` and `jsepFree`: implementation of data transfer's Alloc()
and Free()
- `jsepCopy`: synchronized copy ( GPU to GPU, CPU to GPU)
- `jsepCopyAsync`: asynchronized copy ( GPU to CPU)
- `jsepCreateKernel` and `jsepReleaseKernel`: a corresponding object
that maintained in JS to match lifecycle of Kernel in ORT
- `jsepRun`: OpKernel::Compute() should call into this

The abstraction above allows to tie as little as possible connections
and dependencies between C/C++ and TypeScript/JavaScript.

**Resource Management**

Lifecycle of tensor data and kernels are managed by ORT(C/C++) but the
implementation are left to JavaScript. JavaScript code are responsible
to implement the callbacks correctly.

For WebGPU, the GPU data is managed by JavaScript using a singleton map
(tensot_data_id => GPUBuffer). GPU pipeline is managed as singleton.
Shaders are managed using a singletonmap (shader_key => gpu_program),
while shader_key is generated by cache_key (OP specific, including
attributes) and input shapes.

**about data transfer**
`js::DataTransfer::CopyTensor` implemented to call either synchronized
or asynchronized copy callback, depending on the destination is GPU or
not. Emscripten's macro `EM_ASYNC_JS` is used to wrap the async function
to be called in the synchronized context.

**run kernel in JS**

Kernel class constructor calls once `jsepCreateKernel()` with an
optional per-kernel specific serialization to pass attributes into
JavaScript.

`Compute()` are implemented in a way that a metadata serialization is
performed in a base class and JavaScript code can access the data using
the Emscripten specific builtin macro `EM_ASM_*`.

**disabled features**
memory pattern is force disabled, because the WebGPU data is not
presented by a general memory model (a buffer can be represented by
offset + size).
concurrent run support is disabled. WebGPU is stateful and it also has
async function call. To support concurrent run will significantly
increase the complexity and we don't get any real benefit from it.

**prefer channels last**
JSEP prefers channels last and returns `DataLayout::NHWC` in method
`GetPreferredLayout()`. This will let the graph transformers to
preprocess the graph into a channels last form so that a more optimized
WebGPU shader can be used.

**Testing code**
It's impossible to test JSEP directly because JSEP itself does not
contain any kernel implementation. However, it has the kernel
registration which need to work together with the corresponding
JavaScript code. There are unit tests that run onnx models from
JavaScript API.

---------

Co-authored-by: Scott McKay <skottmckay@gmail.com>
2023-04-24 15:21:18 -07:00
cao lei
dc53ddef7a
Create a new C API KernelContext_GetAllocator() for Custom Op scenario (#15591)
### Description
Create a new C API KernelContext_GetAllocator() for Custom Op scenario



### Motivation and Context
Create a new C API KernelContext_GetAllocator() for Custom Op scenario
2023-04-23 21:54:35 -07:00
Dmitri Smirnov
a5dec8eedf
[C# ] Improve string marshalling and reduce GC pressure (#15545)
### Description

  Reduce a number of auxillary objects created to reduce GC pressure.
Eliminate GCHandle type of memory pinning in most of the places.
Improve string marshalling by allocating unmanaged memory that does not
require pinning. Change native methods from `IntPtr` to `byte[]`
(marshalling pinning is more efficient).

Allocate input/output UTF-8 names in unmanaged heap for the lifetime of
InferenceSession. So we do not keep converting them and pinning on every
Run.

Introduce a new native API that allows to allocate and convert/copy
strings directly into a native tensor.

The PR delivers around 50% latency improvements and less GC pauses.

Inspired by: https://github.com/microsoft/onnxruntime/pull/15520

### Motivation and Context
Client experience GC pressure and performance degradation when dealing
with string tensors.


Co-Authored-By: @tannergooding
2023-04-20 15:12:51 -07:00
Chi Lo
6115c8fd1f
Add TRT plugins support using custom ops (#13847)
This PR makes ORT support TRT plugin using custom ops. ORT TRT can
automatically register all TRT plugins from TRT plugins registry as
custom ops. There is no code change needed for ORT when new TRT plugins
are introduced.

Previous way for ORT to support TRT plugins was using contrib ops, but
there are some concerns about it:

- Contrib ops are shipped as part of the ORT binary by default. TRT
related plugins should not be in the default ORT.
- Contrib ops are designed for internal ops and developed for cpu and
cuda EPs.

Therefore, using custom ops is a good approach to support TRT plugins. 

Followings are the major modifications:

1. Add new `GetCustomOpDomainList` provider api which allows provider to
create its own custom op domain list and ORT can register this domain
list. Provider has the responsibility to free all the custom op domain
instances it created.
2. Move OrtCustomOpDomain struct definition to
framework_provider_common.h since this struct is being used by framework
and EPs now.
3. There are several TRT plugins registered as onnx schema op through
contrib op with onnx domain. In order not to break the old models using
those TRT plugins which were registered with ONNX domain and maintain
backward compatible, we need to keep the old/legacy TRT plugins with
onnx domain. Moving forward, all newly added TRT plugins should be
registered with `trt.plugins` domain.
4. TRT plugin doesn't have an api to get number of inputs/outputs of the
registered plugins, so ORT TRT uses variadic inputs/outputs to bypass
the onnx node validation.
5. Add new trt provider option, `trt_extra_plugin_lib_paths`, user can
specify any extra plugin lib, for example,
`fastertransformer/build/lib/libvit_plugin.so` or
`fastertransformer/build/lib/libvit_plugin.so;fastertransformer/build/lib/libvit_plugin_v2.so`
2023-04-18 20:24:32 -07:00
Justin Chu
cf19c3697d
Run clang-format in CI (#15524)
### Description

Run clang-format in CI. Formatted all c/c++, objective-c/c++ files.

Excluded

```
    'onnxruntime/core/mlas/**',
    'onnxruntime/contrib_ops/cuda/bert/tensorrt_fused_multihead_attention/**',
```

because they contain assembly or is data heavy


### Motivation and Context

Coding style consistency
2023-04-18 09:26:58 -07:00
liqun Fu
919d8f2660
update with onnx main (#14929) 2023-04-18 08:42:51 -07:00
cao lei
c2221d919f
create a stream in DeviceStreamCollection for memory pattern (#15426)
### Description
Create a stream in DeviceStreamCollection for memory pattern case to fix
the thread safe issue 15154



### Motivation and Context
This is to fix the bug 15154
https://github.com/microsoft/onnxruntime/issues/15154
2023-04-17 10:06:55 -07:00
Maximilian Müller
fbe88fccbd
Exposing new TRT build options (#15089)
### Description

This will add a few TRT options, some of them are only available on TRT
8.6:
- heuristics
- sparsity
- optimization level (8.6 only)
- auxiliary stream (8.6 only)
- tactic source selection

I am no sure yet which tests is should add for these options. As those
are mostly simple TRT flags i am not sure to what level i should test.
For heuristics something similar to
44dda08b51/onnxruntime/test/providers/tensorrt/tensorrt_basic_test.cc (L510-L538)
should be possible for, but for all other essentially we would only be
testing if there is a crash or not if the option is set.
Also if i forgot some option that would be good to have feel free to
speak up !
2023-04-14 09:47:36 -07:00
Dmitri Smirnov
ce3b4eabd3
Implement Optional Metadata support and C# test support (#15314)
### Description
Implement Optional Type metadata support in the library.
Implement optional support in C# API along with metadata.
Implement Sequence, Map, Optional test data support
and test execution.

Prune tests and provide more details for failing tests in C# code.

Note, this PR does not enable running onnx test models in C++.

### Motivation and Context
Opset18 optional type support.
2023-04-11 09:41:59 -07:00
cloudhan
71a4e7eb97
Automatically enable tunable op usage for production models (#15156)
Split `IsTunbaleOpEnable` semantics into **enable tunable op for using**
and **enable tunable op for tuning**.

They remain disabled in general for safety purpose. But
- if session is created with onnx model with tuning results embeded
- the embedded tuning results is set to the EP without error `Status`

then we automatically enable the using, tuning remains disabled.

The planned options will be
- `tunable_op_enable`: The top-level switch of `TunableOp`, indicate if we will run into `TunableOp` related logic. **NOTE:** most of our impls have a bottom impl that is acting as a fallback and is set as the default. In this case, we still call into the `TunableOp`, but no kernel selection, no kernel tuning and caching is involved. This reduced our maintainance burden of a duplicate code path.
- `tunable_op_tuning_enable`: The secondary switch of `TunableOp`, indicate if we will run into the tuning related logic of `TunableOp`

Then for the possible future options:
- `tunable_op_tuning_max_iteration`: blahblah
- `tunable_op_tuning_max_duration_ms`: blahblah
- `tunable_op_flash_attention_enable`: blahblah, for example only, we will not have this.

For developer oriented envvar, it is for developers' convenience to inspect the performance impact of tuning. So there is only `ORT_ROCM_TUNABLE_OP_ENABLE`, `ORT_ROCM_TUNABLE_OP_TUNING_ENABLE` to take the fine-grind control of combinations.
2023-04-06 13:52:47 +08:00
Edward Chen
9f942e1a3e
Graph transformer to ensure unique DQ nodes for QDQ node units (#15145)
### Description
<!-- Describe your changes. -->

Add required graph transformer to duplicate DQ nodes to ensure that QDQ
node units have unique DQ nodes. This condition is necessary for QDQ
node unit processing.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

There is an existing Python utility that does this: 

c7ced7a5e9/tools/python/util/qdq_helpers/qdq_model_utils.py (L77)

This PR implements it as a graph transformer so it is integrated into
ORT and does not require a separate step to update the model. There are
also tests to ensure that its effects are not undone by basic level
graph optimizations.
2023-03-31 08:39:43 +10:00
FFFrog
ecb89ed752
[CANN] Multi-stream execution support for CANN EP. (#14058)
### Description
**Multi-stream** execution support for **CANN EP**.

### Motivation and Context
**CANN EP** is currently **unavailable** due to the introduction of a
new mechanism for multi-stream execution
[#13495](https://github.com/microsoft/onnxruntime/pull/13495), the
deletion of the Fence-based synchronization mechanism, and the failure
to update the relevant logic of **CANN EP** synchronously.

This PR is to fix it.
2023-03-29 11:57:22 -07:00