Commit graph

216 commits

Author SHA1 Message Date
Chi Lo
4e3cff60fd
CUDA graph support for TRT EP (#16081)
CUDA EP already supports [CUDA
graph](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs),
also we observed some models can benefit from using CUDA graph with
`trtexec`. Therefore, this PR enables the CUDA graph support for TRT EP.

The implementation is based on
https://github.com/microsoft/onnxruntime/pull/9978 with the same
[constraints](https://github.com/microsoft/onnxruntime/pull/9978) as
below:

- Models with control-flow ops (i.e. If, Loop and Scan ops) are not
supported.
- Usage of CUDA Graphs is limited to models where-in all the model ops
(graph nodes) can be partitioned to the TRT EP.
- The input/output types of models need to be tensors.
- Shapes of inputs/outputs cannot change across inference calls.
- IObinding is required.
2023-06-21 09:36:45 -07:00
cao lei
dd72192cf4
ExecutionProvider API refactor - move allocator from EP level to SessionState level and indexed by OrtDevice (#15833)
### Description
This PR is to refactor ExecutionProvider API for memory management,
which is to move allocators from EP level to SessionState level and
indexed by OrtDevice



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This PR is to refactor ExecutionProvider API for memory management,
which is to move allocators from EP level to SessionState level and
indexed by OrtDevice. By this change, EP level will shift the burden of
maintaining allocators, which will be user friendly for EP developers

---------

Co-authored-by: Lei Cao <leca@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2023-06-19 17:44:45 -07:00
Hariharan Seshadri
63f5573354
Relax node placement check for CUDA Graph usage (#16358) 2023-06-15 14:03:08 -07:00
Yuhong Guo
04a8f50674
New configuration to limit the arena extension (#15983)
Add a configuration `max_power_of_two_extend_bytes ` to limit the arena extension size.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
In our real scenario, we observe that if the model is big enough the
BfcArena will extend uncontrollable.
As showed by the following figures, if a model uses more than 16GB
memory, the BfcArena will totally apply for 32GB memory according to the
`kNextPowerOfTwo` strategy. With the new strategy, the extension is
limited. The default maximum extension size is 1GB.

#### Without the new configuration
After loading the model, ORT uses 32G GPU memory.

![image](https://github.com/microsoft/onnxruntime/assets/19584326/42b93c66-b957-4f20-a13b-d34cb390afff)

#### With the new configuration
After loading the model, ORT uses 23G GPU memory.

![image](https://github.com/microsoft/onnxruntime/assets/19584326/5abffeff-9ca3-4187-a262-37fd2764fe1b)

Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com>
2023-05-25 02:19:07 -07:00
Changming Sun
cc0c5e5612
Fix an error in test/shared_lib/test_inference.cc (#16090)
### Description
Fix an error in test/shared_lib/test_inference.cc. It should use
ASSERT_NEAR to test float values.

### Motivation and Context
Our OpenVino pipeline is failing because of this.
2023-05-24 22:59:28 -07:00
Dmitri Smirnov
896a963492
Adust GetVersionString() GetBuildInfoString() signatures and move them to OrtApi (#15921)
### Description

This PR partially reverts changes introduced in
https://github.com/microsoft/onnxruntime/pull/15643

We make two API return std::string always in UTF-8.

We also move the entry points from OrtApiBase to OrtApi to make them
versioned.

### Motivation and Context

`GetVersionString` always returns x.y.z numbers that are not subject to
internationalization.
`GetBuildInfoString` can hold international chars, but UTF-8 should be
fine to contain those.
We prefix them with u8"" in case the compiler default charset is not
UTF-8.
Furthermore, creating platform dependent APIs is discouraged.
`ORTCHAR_T` is platform dependent and was created for paths only.
On non-unix platforms would still produce `std::string` that can only
contain UTF-8

The API was introduced after the latest release, and can still be
adjusted.
2023-05-13 13:45:07 -07:00
RandySheriffH
8e610f25d8
Implement lite custom op API (#15778)
Implement a set of new APIs for lightweight custom ops registration, to
save efforts from schema-composing.
A few highlights:

- Support build-time type inference;
- Support function-as-op for "stateless" ops;
- Support structure-as-op for "stateful" ops;
- Support varied input/output forms such as span, scalar, and tensors,
either optional or non-optional.

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2023-05-04 09:49:17 -07:00
RandySheriffH
e3ec2b3a8e
Exclude cases from reduced build (#15779)
Exclude cases from reduced build to unblock pipeline.

Fixed
[AB#15326](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/15326)

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2023-05-02 21:05:54 -07:00
Changming Sun
034698cf6a
Revert "Implement lite custom op API (#15590)" (#15768)
This reverts commit cdf4fc49fc because it
breaks the "debug_node_input_output" build in "Post Merge" pipeline
2023-05-02 01:10:10 -07:00
RandySheriffH
cdf4fc49fc
Implement lite custom op API (#15590)
Implement a set of new APIs for lightweight custom ops registration, to
save efforts on schema-composing.
A few highlights:

1. Support build-time type inference;
2. Support function-as-op for "stateless" ops;
3. Support structure-as-op for "stateful" ops;
4. Support varied input/output forms such as span, scalar, and tensors,
either optional or non-optional.

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2023-05-01 08:45:26 -07:00
Yuhong Guo
41dcf0d32e
Expose build information in dynamic lib (#15643)
### Description
<!-- Describe your changes. -->
1. Add Build Info API to onnx.
2. Fix compile error while building onnxruntime_benchmark in MacOs.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
1. When Onnxruntime lib is serving online, we need a way to detect how
this lib is built. This PR helps the developer to get the build
information using `strings` such as git branch, git commit id, build
type and cmake cxx flags, which is showed as follows.


![image](https://user-images.githubusercontent.com/19584326/233794371-b2f95a2c-27fb-4709-a6dd-bf4bb12b0b5b.png)


![image](https://user-images.githubusercontent.com/19584326/233794360-f96f5d2e-332c-405c-83f1-370ccc2b86f8.png)

If the build env has no git, there will be no git related infor:


![image](https://user-images.githubusercontent.com/19584326/234558596-298c1b01-9a90-41bf-9372-7259a8f8e5be.png)


3. Fix the following compile error while building benchmark in MacOs.

![image](https://user-images.githubusercontent.com/19584326/233793571-c261ac1f-47b2-434d-a293-7e9edc6c8a66.png)

---------

Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com>
2023-04-28 21:57:31 -07:00
RandySheriffH
9773e76c44
Single-schema-multi-kernel (#15184)
The PR is to allow custom op of different input types to have same op
name in a graph.
The idea to go over all ops of same name and merge their input/output
types into a type-inference function.
With the enhancement, custom op node inside a graph can have same
op-type given that the input/output types are different.

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2023-04-27 13:39:59 -07:00
cao lei
dc53ddef7a
Create a new C API KernelContext_GetAllocator() for Custom Op scenario (#15591)
### Description
Create a new C API KernelContext_GetAllocator() for Custom Op scenario



### Motivation and Context
Create a new C API KernelContext_GetAllocator() for Custom Op scenario
2023-04-23 21:54:35 -07:00
Dmitri Smirnov
a5dec8eedf
[C# ] Improve string marshalling and reduce GC pressure (#15545)
### Description

  Reduce a number of auxillary objects created to reduce GC pressure.
Eliminate GCHandle type of memory pinning in most of the places.
Improve string marshalling by allocating unmanaged memory that does not
require pinning. Change native methods from `IntPtr` to `byte[]`
(marshalling pinning is more efficient).

Allocate input/output UTF-8 names in unmanaged heap for the lifetime of
InferenceSession. So we do not keep converting them and pinning on every
Run.

Introduce a new native API that allows to allocate and convert/copy
strings directly into a native tensor.

The PR delivers around 50% latency improvements and less GC pauses.

Inspired by: https://github.com/microsoft/onnxruntime/pull/15520

### Motivation and Context
Client experience GC pressure and performance degradation when dealing
with string tensors.


Co-Authored-By: @tannergooding
2023-04-20 15:12:51 -07:00
Justin Chu
cf19c3697d
Run clang-format in CI (#15524)
### Description

Run clang-format in CI. Formatted all c/c++, objective-c/c++ files.

Excluded

```
    'onnxruntime/core/mlas/**',
    'onnxruntime/contrib_ops/cuda/bert/tensorrt_fused_multihead_attention/**',
```

because they contain assembly or is data heavy


### Motivation and Context

Coding style consistency
2023-04-18 09:26:58 -07:00
Maximilian Müller
fbe88fccbd
Exposing new TRT build options (#15089)
### Description

This will add a few TRT options, some of them are only available on TRT
8.6:
- heuristics
- sparsity
- optimization level (8.6 only)
- auxiliary stream (8.6 only)
- tactic source selection

I am no sure yet which tests is should add for these options. As those
are mostly simple TRT flags i am not sure to what level i should test.
For heuristics something similar to
44dda08b51/onnxruntime/test/providers/tensorrt/tensorrt_basic_test.cc (L510-L538)
should be possible for, but for all other essentially we would only be
testing if there is a crash or not if the option is set.
Also if i forgot some option that would be good to have feel free to
speak up !
2023-04-14 09:47:36 -07:00
Changming Sun
4a0b86eba6
Update the post-merge pipeline (#14965)
### Description
1.  Remove Linux jobs for ORT-Extension combined build
2.  Add a macOS build job for ORT-Extension combined build
3. Adjust the yaml file so that it can support two different ADO
instances.


### Motivation and Context
To test our code better. And it will enable us to run such tests for
every commit in the main branch. It would be easier for us to figure out
which change caused a build break.

See
[AB#13435](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/13435)
2023-03-29 13:12:07 -07:00
Chi Lo
c964da7ea2
FasterTransformer model wrapper using custom op (#15013)
### Description
<!-- Describe your changes. -->
We are introducing the FasterTransfomer model-level integration using
ORT [custom op runtime
wrapper](https://github.com/microsoft/onnxruntime/pull/13427).
In order to make the FT wrapper/integration work, two things need to be
done:

- New API `KernelInfoGetConstantInput_tensor`. (Done in this PR)
During custom op kernel initialization, it needs to get the model
weights (saved as node's constant inputs) ready for FT's weights
instantiation. What's why we need to add this new API to make kernel
info capable of getting constant inputs.

- Custom op and custom op kernel to wrap FT model. (Will provide in
onnxruntime extensions or inference examples)
During custom op kernel initialization, it can fetch attributes from
kernel info to determine which kind of FT model instance create. During
custom op kernel compute/inference, it can get input/output from kernel
context and then assign input/output buffers for model instance to run.
2023-03-20 09:05:30 -07:00
Adrian Lizarraga
e42f7487df
Add logging APIs for custom operators (#14416)
### Description
Add logging APIs for custom ops.

This PR introduces a `OrtLogger` type, which can be retrieved from a
`OrtKernelInfo` or `OrtKernelContext`. The kernel info's logger is the session logger stored
in the execution provider. The kernel context's logger is a run logger.



### Motivation and Context
Allows custom ops to log information in a manner consistent with
built-in ops.

Example usage in custom op:
```C++
struct MyCustomKernel {
  MyCustomKernel(const OrtApi& api, const OrtKernelInfo* info) {
    Ort::ConstKernelInfo kinfo(info);
    this->logger_ = kinfo.GetLogger();
    // ...
    ORT_CXX_LOGF_NOEXCEPT(this->logger_, OrtLoggingLevel::ORT_LOGGING_LEVEL_ERROR, "Error: %s", err_msg);
  }

  void Compute(OrtKernelContext* context) {
    ORT_CXX_LOG(this->logger_, OrtLoggingLevel::ORT_LOGGING_LEVEL_VERBOSE, "Calling compute...");
    // ...
  }

  // ...
 private:
  Ort::Logger logger_;
};
```
2023-03-17 15:05:28 -07:00
Christian Veenhuis
59dfcfdce7
Fix typos in sources: operater, tranform, neccessary, trainig (#14907)
### Description
While browsing the sources I found several typos here and there.
I collected them to a single PR and fixed them.
Namely these typos are: operater, tranform, neccessary, trainig.
After fixing none of them was found anymore:

$ git grep "operater"
$ git grep "tranform"
$ git grep "neccessary"
$ git grep "trainig"
$ 

### Motivation and Context
Since some of the typos are in example notebooks and markdown files,
users can see them.
2023-03-13 22:45:04 -07:00
Dmitri Smirnov
8d87fdcfa1
Add GetVersionSting API for C++, C# and Python (#14873)
### Description
Added APIs.

### Motivation and Context
Addresses https://github.com/microsoft/onnxruntime/issues/14584

Cc: @Craigacp cp
2023-03-02 17:11:07 -08:00
Scott McKay
b7fde84341
Changes to support standalone custom ops in a minimal build. (#14497)
### Description
<!-- Describe your changes. -->
Changes to support standalone custom ops in a minimal build. Also
incorporates changes from #14492 (needed to test builds prior to that
being checked in).

We first need to save the schema info from the operators used by the
standalone op invoker in the ORT format model. Add mechanism for that.

Merge the kernel lookup logic so the same is used in full and minimal
build. NOTE: the version matching is now consistent with all other
kernel lookups, and the call to CreateOp MUST use the exact version for
the operator. Previously matching wasn't as strict, but this can lead to
the incorrect kernel being chosen.

Add tests.

NOTE: There is currently no way to detect the ops/types/opsets used
inside these custom ops as they don't exist until we create kernels,
which is after model loading completes (which is the point the ORT
format model is saved). Due to that they have to be manually added to
the configuration used to do the reduced ops build. That shouldn't be
too hard for the custom op author to add given the custom op
implementation is specifying the op, opset and type constraints (i.e.
they have the info and it's just a case of capturing/formatting it
correctly).


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Enable usage of the standalone op invoker by custom ops in a minimal
build.

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-03-01 11:22:54 +10:00
Scott McKay
549cbc7e69
Fix issue with schema lookup where there are custom ops using the ONNX domain (#14492)
### Description
<!-- Describe your changes. -->
Fix issue with schema lookup where there are custom ops using the ONNX
domain.

Update testing infrastructure to use an explicit domain for custom ops.
Using an empty string clashes with the ONNX domain and can cause
unexpected issues. It's also a bad example for external users as our
docs point to the unit tests.

Fix a couple of places using exact matching of the node since version to
be slightly more flexible and use a range (which aligns with how the
kernel lookup works).

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fixes a problem that came up when adding support for standalone custom
ops in an ORT format model. Separating these changes out to simplify
review.
2023-02-03 08:05:18 +10:00
Erick Muñoz
d1533c27eb
[oneDNN] Improved thread handling (#13618)
* Added the OrtDnnlProviderOptions structure to expose configuration
options to the user

* The number of threads can be defined by the user with the -i flag on
the perftest

* Number of threads can also be configured via the OMP_NUM_THREADS
environment variable

* The number of threads defined in the OrtDnnlProviderOptions is
prioritized over the environment variable

### Description
Avoids thread oversubscription caused by OpenMP allocating the maximum
number of threads possible for oneDNN EP. Added support for the
OrtDnnlProviderOptions, this will allow for more EP customization
capabilities, and allows for user defined number of threads.



### Motivation and Context
- Improves performances and allows for user to fine tune the number of
threads
2023-01-31 14:37:13 -08:00
RandySheriffH
36ba3d8d21
Exclude a multi-stream case from reduced ops build (#14351)
Exclude a multi-stream case from reduced ops build to unblock
[pipeline](https://dev.azure.com/onnxruntime/onnxruntime/_build?definitionId=120&_a=summary).

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2023-01-19 14:39:25 -08:00
Adrian Lizarraga
de17d53c50
Custom Op runtime wrapper (#13427)
### Description

Adds the below C APIs to support custom ops that wrap an entire model to
be inferenced with an external runtime. The current SNPE EP is an
example of an EP that could be ported to use a custom op wrapper. Ex:
The custom op stores the serialized SNPE DLC binary as a string
attribute. The SNPE model is built when the kernel is created. The model
is inferenced with SNPE APIs on call to the kernel's compute method.

#### C APIs
| API | Description | Why |
| ---            | ---        | ---  |
| `KernelInfo_GetInputCount` | Gets number of inputs from
`OrtKernelInfo`. | Query I/O characteristics during kernel
creation<sup>1</sup> |
| `KernelInfo_GetOutputCount` | Gets number of outputs from
`OrtKernelInfo`. | Query I/O characteristics during kernel
creation<sup>1</sup> |
| `KernelInfo_GetInputName` | Gets an input's name. | Query I/O
characteristics during kernel creation<sup>1</sup> |
| `KernelInfo_GetOutputName` | Gets an output's name. | Query I/O
characteristics during kernel creation<sup>1</sup> |
| `KernelInfo_GetInputTypeInfo` | Gets the type/shape information for an
input. | Query I/O characteristics during kernel creation<sup>1</sup> |
| `KernelInfo_GetOutputTypeInfo` | Gets the type/shape information for
an output. | Query I/O characteristics during kernel
creation<sup>1</sup> |
| `KernelInfoGetAttribute_tensor` | Get a OrtValue tensor stored as an
attribute in the graph node | Extract serialized models, weights, etc. |
| `GetSessionConfigEntry` | Get a session configuration value | Need to
be able to get session-time configurations from within custom op |
| `HasSessionConfigEntry` | Check if session configuration entry exists.
| Need to be able to get session-time configurations from within custom
op |

#### Why so many KernelInfo APIs?<sup>1</sup>
Similar APIs currently exist for `OrtKernelContext`, but not
`OrtKernelInfo`. Note that `OrtKernelContext` is passed to the custom op
on call to its kernel's compute() function. However, `OrtKernelInfo` is
available on kernel creation, which occurs when the session is created.
Having these APIs available from `OrtKernelInfo` allows an operator to
trade-off computation time for session-creation time, and vice versa.
Operators that must build expensive state may prefer to do it during
session creation time instead of compute-time.

SNPE is an example of an EP that needs to be able to query `KernelInfo`
for the name, type, and shape of inputs and outputs in order to build
the model from the serialized DLC data. This is an expensive operation.
Other providers (e.g., OpenVINO) are able to query i/o info from the
serialized model, so they do not strictly need these APIs. However, the
APIs can still be used to validate the expected I/O characteristics.

Additionally, several of our CPU contrib ops currently use the same
internal version of these KernelInfo APIs (Ex:
[qlinear_softmax](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/contrib_ops/cpu/quantization/qlinear_softmax.cc#L71)).
If custom ops are also meant to be a test bed for future ops, then all
custom ops (not just runtime wrappers) would benefit from the addition
of these public KernelInfo APIs (IMO).

#### Example of usage in a custom OP
From
`onnxruntime/test/testdata/custom_op_openvino_wrapper_library/openvino_wrapper.h`

```c++
struct CustomOpOpenVINO : Ort::CustomOpBase<CustomOpOpenVINO, KernelOpenVINO> {
  explicit CustomOpOpenVINO(Ort::ConstSessionOptions session_options);

  CustomOpOpenVINO(const CustomOpOpenVINO&) = delete;
  CustomOpOpenVINO& operator=(const CustomOpOpenVINO&) = delete;

  void* CreateKernel(const OrtApi& api, const OrtKernelInfo* info) const;

  constexpr const char* GetName() const noexcept {
    return "OpenVINO_Wrapper";
  }

  constexpr const char* GetExecutionProviderType() const noexcept {
    return "CPUExecutionProvider";
  }

  // IMPORTANT: In order to wrap a generic runtime-specific model, the custom operator
  // must have a non-homogeneous variadic input and output.

  constexpr size_t GetInputTypeCount() const noexcept {
    return 1;
  }

  constexpr size_t GetOutputTypeCount() const noexcept {
    return 1;
  }

  constexpr ONNXTensorElementDataType GetInputType(size_t /* index */) const noexcept {
    return ONNX_TENSOR_ELEMENT_DATA_TYPE_UNDEFINED;
  }

  constexpr ONNXTensorElementDataType GetOutputType(size_t /* index */) const noexcept {
    return ONNX_TENSOR_ELEMENT_DATA_TYPE_UNDEFINED;
  }

  constexpr OrtCustomOpInputOutputCharacteristic GetInputCharacteristic(size_t /* index */) const noexcept {
    return INPUT_OUTPUT_VARIADIC;
  }

  constexpr OrtCustomOpInputOutputCharacteristic GetOutputCharacteristic(size_t /* index */) const noexcept {
    return INPUT_OUTPUT_VARIADIC;
  }

  constexpr bool GetVariadicInputHomogeneity() const noexcept {
    return false;  // heterogenous
  }

  constexpr bool GetVariadicOutputHomogeneity() const noexcept {
    return false;  // heterogeneous
  }

  std::vector<std::string> GetSessionConfigKeys() const { return {"device_type"}; }

 private:
  std::unordered_map<std::string, std::string> session_configs_;
};
```

#### How to create a session:
```c++
Ort::Env env;
Ort::SessionOptions session_opts;
Ort::CustomOpConfigs custom_op_configs;

// Create local session config entries for the custom op.
custom_op_configs.AddConfig("OpenVINO_Wrapper", "device_type", "CPU");

// Register custom op library and pass in the custom op configs (optional).
session_opts.RegisterCustomOpsLibrary(lib_name, custom_op_configs);

Ort::Session session(env, model_path.data(), session_opts);
```
### Motivation and Context
Allows creation of simple "wrapper" EPs outside of the main ORT code
base.
2023-01-18 09:09:32 -08:00
Scott McKay
dab900dfa0
Fix type mismatch when ORT_ENABLE_STREAM is off (#14324)
### Description
<!-- Describe your changes. -->
PartitionIntoStreams was incorrectly using std::string instead of
PathString for the config file argument when ORT_ENABLE_STREAM was not
defined.

Also Incorporate changes from #14291 to fix build and test issues.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix build error on Windows due to mismatched type.
2023-01-18 13:45:00 +10:00
Scott McKay
b9ecd428c1
Add ability to register custom ops by specifying a function name (#14177)
### Description
<!-- Describe your changes. -->
Use dlsym/GetProcAddress to lookup a custom ops registration function by
name and call it.

This will be better on mobile platforms where the custom ops library is
linked against, and there isn't necessarily a filesystem that a library
path can be loaded from.

Alternative is to wire up passing in the address of the function, but
that has multiple complications which differ by platform.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Enable using ort and ort-ext packages on mobile platforms.

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-01-12 15:11:34 +10:00
Adrian Lizarraga
68794d0ac1
Improve custom op library handle cleanup (#14099)
### Description
- Adds a new C API `OrtApi::RegisterCustomOpsLibrary_V2` that manages
the lifetime of dynamic library handles (i.e., calls `dlclose` or
`FreeLibrary`).
- Deprecates C API `OrtApi::RegisterCustomOpsLibrary`.
- Adds C++ API wrapper for convenient registering of custom op
libraries.
- `PySessionOptions` is now an alias of `OrtSessionOptions`

### Motivation and Context
The current API for registering custom op libraries loads dynamic
libraries but requires users to handle the release of the corresponding
library handles. Additionally, the user has to make sure to release the
library handle _after_ the session has been destroyed (or the program
segfaults).

The new API automatically cleans up the library and allows the user to
write more straightforward code.
2023-01-04 17:56:29 -08:00
Tianlei Wu
6a9dc6c993
[CUDA] Update fused MHA to support flash attention and causal mask (#13953)
### Description
Update fused attention kernels to support flash attention and causal
mask (GPT-2 initial decoder run).

Note: Causal kernels are from FasterTransformer 5.2. Flash attention
kernels that is not causal are from TensorRT 8.5.1.

#### Performance Test of bert-base model

Test like the following:
```
 python -m onnxruntime.transformers.benchmark -m bert-base-cased -b 1 4 8 16 32 64 -s 512 -t 1000 -o by_script -g -p fp16 -i 3 --use_mask_index
```

Original Flash Attention is from
https://github.com/HazyResearch/flash-attention. RemovePadding and
RestorePadding is added before/after the original flash attention but
not for this PR, so the result is not apple-to-apple comparison. It is
added for reference only.

Average latency (ms) of float16 bert-base-cased model:

* A100

Kernel  | b1_s512 | b4_s512 | b8_s512 | b16_s512 | b32_s512 | b64_s512 |
b128_s512
-- | -- | -- | -- | -- | -- | -- | --
Unfused | 1.83 | 5.00 | 9.31 | 17.76 | 34.47 | 67.43 | 133.38
TRT Fused | 2.05 | 3.58 | 5.70 | 10.96 | 21.22 | 41.23 | 80.56
Flash Attention (from FT) | 1.43 | 3.20 | 5.71 | 10.95 | 22.19 | 42.96 |
84.54
Flash Attention (from TRT) | 1.44 | 3.28 | 5.70 | 10.86 | 21.00 | 40.56
| 79.53
Original Flash Attention | 1.81 | 4.04 | 6.82 | 13.06 | 24.62 | 46.58 |
91.10

* T4

  | b1_s512 | b4_s512 | b8_s512 | b16_s512 | b32_s512 | b64_s512
-- | -- | -- | -- | -- | -- | --
Unfused | 8.17 | 29.86 | 59.56 | 115.77 | 236.66 | 461.43
Flash Attention (from FT) | 5.65 | 21.12 | 44.94 | 86.83 | 174.16 |
351.38
Flash Attention (from TRT) | 5.73| 21.49| 45.49 | 89.15 | 174.37 |
352.08
Original Flash Attention | 6.22 | 22.16 | 43.39 | 83.8 | 168.77 | 337.04

* V100

Kernel | b1_s512 | b4_512 | b8_s512 | b16_s512 | b32_s512 | b64_s512
-- | -- | -- | -- | -- | -- | --
Unfused | 3.77 | 10.48 | 19.53 | 37.63 | 73.68 | 145.58
Flash Attention (from FT) | 3.21 | 8.25 | 14.95 | 28.83 | 56.28 | 111.15

#### Performance Test of GPT-2 model
Test like the following:
`
python benchmark_gpt2.py -m distilgpt2 -o --stage 1 --use_gpu -p fp16 -b
1 4 8 16 32 64 128 -s 0 --sequence_lengths 8 16 32 64 128 256 512
`
* A100

Note that flash attention is used as fused attention when
sequence_length > 128.

batch_size | sequence_length | with Fused Attention | without Fused
Attention | A100 Gain
-- | -- | -- | -- | --
1 | 8 | 0.93 | 1 | 7.0%
4 | 8 | 0.82 | 0.88 | 6.8%
8 | 8 | 0.84 | 0.88 | 4.5%
16 | 8 | 0.92 | 0.97 | 5.2%
32 | 8 | 1.15 | 1.17 | 1.7%
64 | 8 | 1.68 | 1.72 | 2.3%
128 | 8 | 2.76 | 2.78 | 0.7%
1 | 16 | 0.95 | 0.95 | 0.0%
4 | 16 | 0.83 | 0.88 | 5.7%
8 | 16 | 0.91 | 0.97 | 6.2%
16 | 16 | 1.12 | 1.17 | 4.3%
32 | 16 | 1.67 | 1.72 | 2.9%
64 | 16 | 2.73 | 2.76 | 1.1%
128 | 16 | 4.96 | 4.95 | -0.2%
1 | 32 | 0.94 | 0.88 | -6.8%
4 | 32 | 0.91 | 0.97 | 6.2%
8 | 32 | 1.12 | 1.17 | 4.3%
16 | 32 | 1.65 | 1.71 | 3.5%
32 | 32 | 2.69 | 2.76 | 2.5%
64 | 32 | 4.86 | 4.94 | 1.6%
128 | 32 | 9.35 | 9.38 | 0.3%
1 | 64 | 0.84 | 0.88 | 4.5%
4 | 64 | 1.1 | 1.17 | 6.0%
8 | 64 | 1.64 | 1.73 | 5.2%
16 | 64 | 2.66 | 2.77 | 4.0%
32 | 64 | 4.82 | 4.97 | 3.0%
64 | 64 | 9.23 | 9.4 | 1.8%
128 | 64 | 18.54 | 19.12 | 3.0%
1 | 128 | 0.91 | 0.98 | 7.1%
4 | 128 | 1.68 | 1.74 | 3.4%
8 | 128 | 2.71 | 2.83 | 4.2%
16 | 128 | 4.85 | 5.09 | 4.7%
32 | 128 | 9.32 | 9.69 | 3.8%
64 | 128 | 18.54 | 19.44 | 4.6%
128 | 128 | 36.86 | 38.47 | 4.2%
1 | 256 | 1.15 | 1.23 | 6.5%
4 | 256 | 2.71 | 2.95 | 8.1%
8 | 256 | 4.87 | 5.3 | 8.1%
16 | 256 | 9.32 | 10.23 | 8.9%
32 | 256 | 18.6 | 20.53 | 9.4%
64 | 256 | 36.93 | 40.41 | 8.6%
128 | 256 | 72.84 | 80.14 | 9.1%
1 | 512 | 1.68 | 1.96 | 14.3%
4 | 512 | 4.9 | 6.02 | 18.6%
8 | 512 | 9.4 | 11.59 | 18.9%
16 | 512 | 18.71 | 23.05 | 18.8%
32 | 512 | 37.13 | 45.46 | 18.3%
64 | 512 | 74.04 | 89.88 | 17.6%
128 | 512 | NA | NA | NA

* T4:

batch_size | sequence_length | with Fused Attention | with Unfused
Attention | T4 Gain
-- | -- | -- | -- | --
1 | 8 | 1.97 | 2.11 | 6.6%
4 | 8 | 2.2 | 2.25 | 2.2%
8 | 8 | 2.77 | 3.1 | 10.6%
16 | 8 | 4.17 | 4.2 | 0.7%
32 | 8 | 6.86 | 6.82 | -0.6%
64 | 8 | 14.88 | 14.92 | 0.3%
128 | 8 | 31.4 | 31.29 | -0.4%
1 | 16 | 1.61 | 1.71 | 5.8%
4 | 16 | 2.13 | 2.31 | 7.8%
8 | 16 | 3.38 | 3.67 | 7.9%
16 | 16 | 6.16 | 6.54 | 5.8%
32 | 16 | 14.16 | 14.76 | 4.1%
64 | 16 | 30.36 | 30.57 | 0.7%
128 | 16 | 63.14 | 63.57 | 0.7%
1 | 32 | 1.53 | 1.69 | 9.5%
4 | 32 | 3.34 | 3.66 | 8.7%
8 | 32 | 6.25 | 6.64 | 5.9%
16 | 32 | 14.12 | 14.9 | 5.2%
32 | 32 | 28.96 | 29.82 | 2.9%
64 | 32 | 61.07 | 61.77 | 1.1%
128 | 32 | 116.38 | 117.98 | 1.4%
1 | 64 | 2.01 | 2.21 | 9.0%
4 | 64 | 6.18 | 6.67 | 7.3%
8 | 64 | 13.72 | 14.49 | 5.3%
16 | 64 | 28.71 | 29.83 | 3.8%
32 | 64 | 58.65 | 60.68 | 3.3%
64 | 64 | 113.09 | 113.17 | 0.1%
128 | 64 | 205.21 | 209.4 | 2.0%
1 | 128 | 3.37 | 3.76 | 10.4%
4 | 128 | 13.54 | 14.85 | 8.8%
8 | 128 | 28.32 | 30.22 | 6.3%
16 | 128 | 58.16 | 62.09 | 6.3%
32 | 128 | 109.17 | 113.99 | 4.2%
64 | 128 | 198.9 | 207.1 | 4.0%
128 | 128 | 413.25 | 421.82 | 2.0%
1 | 256 | 6.33 | 7.05 | 10.2%
4 | 256 | 28.09 | 31.49 | 10.8%
8 | 256 | 57.47 | 62.76 | 8.4%
16 | 256 | 106.77 | 117.95 | 9.5%
32 | 256 | 197.02 | 208.58 | 5.5%
64 | 256 | 406.81 | 431.36 | 5.7%
128 | 256 | NA | NA | NA
1 | 512 | 13.84 | 16.32 | 15.2%
4 | 512 | NA | NA | NA
8 | 512 | NA | NA | NA
16 | 512 | NA | NA | NA
32 | 512 | NA | NA | NA
64 | 512 | NA | NA | NA
128 | 512 | NA | NA | NA

* V100:

batch_size | sequence_length | with Fused Attention | with Unfused
Attention | V100 Gain
-- | -- | -- | -- | --
1 | 8 | 1.31 | 1.6 | 18.1%
4 | 8 | 1.17 | 1.26 | 7.1%
8 | 8 | 1.43 | 1.79 | 20.1%
16 | 8 | 2.14 | 1.96 | -9.2%
32 | 8 | 2.91 | 3.08 | 5.5%
64 | 8 | 5.32 | 5.27 | -0.9%
128 | 8 | 9.34 | 8.97 | -4.1%
1 | 16 | 1.41 | 1.58 | 10.8%
4 | 16 | 1.38 | 1.49 | 7.4%
8 | 16 | 1.81 | 2.2 | 17.7%
16 | 16 | 2.8 | 2.83 | 1.1%
32 | 16 | 4.94 | 4.99 | 1.0%
64 | 16 | 8.88 | 8.84 | -0.5%
128 | 16 | 17.35 | 17.2 | -0.9%
1 | 32 | 1.38 | 1.77 | 22.0%
4 | 32 | 1.77 | 1.93 | 8.3%
8 | 32 | 2.71 | 2.86 | 5.2%
16 | 32 | 5.03 | 4.92 | -2.2%
32 | 32 | 8.8 | 8.79 | -0.1%
64 | 32 | 17.29 | 17.23 | -0.3%
128 | 32 | 33.27 | 33.1 | -0.5%
1 | 64 | 1.67 | 1.87 | 10.7%
4 | 64 | 2.69 | 2.76 | 2.5%
8 | 64 | 4.87 | 4.94 | 1.4%
16 | 64 | 8.73 | 8.81 | 0.9%
32 | 64 | 16.92 | 17.24 | 1.9%
64 | 64 | 33 | 33.38 | 1.1%
128 | 64 | 65.33 | 65.86 | 0.8%
1 | 128 | 2.03 | 2.22 | 8.6%
4 | 128 | 4.9 | 5.04 | 2.8%
8 | 128 | 8.76 | 8.81 | 0.6%
16 | 128 | 17.06 | 17.29 | 1.3%
32 | 128 | 33.25 | 33.56 | 0.9%
64 | 128 | 65.54 | 66.5 | 1.4%
128 | 128 | 130.44 | 131.44 | 0.8%
1 | 256 | 2.78 | 2.86 | 2.8%
4 | 256 | 8.75 | 9.04 | 3.2%
8 | 256 | 17 | 17.68 | 3.8%
16 | 256 | 33.19 | 34.32 | 3.3%
32 | 256 | 65.43 | 67.86 | 3.6%
64 | 256 | 129.92 | 134.68 | 3.5%
128 | 256 | NA | NA | NA
1 | 512 | 4.95 | 5.32 | 7.0%
4 | 512 | NA | NA | NA
8 | 512 | NA | NA | NA
16 | 512 | NA | NA | NA
32 | 512 | NA | NA | NA
64 | 512 | NA | NA | NA
128 | 512 | NA | NA | NA
2022-12-31 10:33:54 -08:00
Adrian Lizarraga
3bbcc2799f
Support for custom op variadic inputs/outputs (#13946)
### Description
Adds support for variadic inputs and outputs to custom operators.

### Motivation and Context
Needed for custom ops that wrap external runtimes/models and maybe TensorRT plugins.
2022-12-23 11:41:15 -08:00
RandySheriffH
a061fedb5d
Exclude affinity-setting logic from minimal build (#13967)
Comment out the affinity-setting logic which introduced an unnecessary
binary size increase for the minimal build.

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2022-12-15 14:43:42 -08:00
Tang, Cheng
a81faee41e
Multi-stream execution support (#13495)
**Description**: This PR including following works:
1. provide stream and related synchronization abstractions in
onnxruntime.
2. enhance onnxruntime's execution planner / executor / memory arena to
support execute multiple streams in parallel.
3. deprecate the parallel executor for cpu.
4. deprecate the Fence mechanism. 
5. update the cuda / tensorrt EP to support the stream mechanism,
support running different request in different cuda stream.

**Motivation and Context**
- Why is this change required? 
currently, the execution plan is just a linear list of those primitives,
ort will execute them step by step. For any given graph, ORT will
serialize it to a fixed execution order. This sequential execution
design simplifies most scenarios, but it has the following limitations:
1. it is difficult to enable inter-node parallelization, we have a
half-baked parallel executor but it is very difficult to make it work
with GPU.
2. The fence mechanism can work with single gpu stream + cpu thread
case, but when extend to multiple stream, it is difficult to manage the
cross GPU stream synchronizations.
3. our cuda EP rely on the BFCArena to make the memory management work
with the GPU async kernels, but current BFCArena is not aware of the
streams, so it doesn't behavior correctly when run with multiple
streams.

This PR enhance our existing execution plan and executor to support
multiple stream execution. we use an unified algorithm to mange both
single stream and multiple stream scenarios.
This PR mainly focus on the infrastructure support for multiple stream
execution, that is said, given a valid stream assignment, onnxruntime
can execute it correctly. How to generate a good stream assignment for a
given model will be in the future PR.

Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Cheng Tang <chenta@microsoft.com>
Co-authored-by: RandySheriffH <48490400+RandySheriffH@users.noreply.github.com>
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: cao lei <jslhcl@gmail.com>
Co-authored-by: Lei Cao <leca@microsoft.com>
2022-12-15 07:39:29 -08:00
RandySheriffH
3addbabc59
Fix react native ci (#13948)
Find build error in react native ci pipeline by adding the common
header.

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2022-12-12 19:38:55 -08:00
RandySheriffH
75584c5fa8
Enabling thread pool to be numa-aware (#13778)
The PR enables ort thread pool to be numa-aware, so that threads could
be evenly created and distributed among numa nodes.
In addition, to facilitate performance tuning, the PR opens a new API
allowing customers to attach threads to certain logical processors.
Please check the API
[definition](https://github.com/microsoft/onnxruntime/pull/13778/files#diff-5845a5c76fb64abdc8f0cffe21b37f8da1712674eb3abc4cd87190891be1bd48)
for details.

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2022-12-12 10:33:55 -08:00
Edward Chen
2ecd1d6622
Switch GSL to MS GSL 4.0.0 (#13416) 2022-10-29 04:15:20 -07:00
Fei Hu
943e156f4c
Allow custom ops to set input memory type (#10879) 2022-10-28 21:45:26 -07:00
Dmitri Smirnov
5dae0c477d
Deprecate CustomApi and refactor public API for better safety and consistency (#13215)
### Description
Deprecate CustomOpApi and refactor dependencies for exception safety and
eliminate memory leaks.
Refactor API classes for clear ownership and semantics.
Introduce `InitProviderOrtApi()`

### Motivation and Context
Make public API better and safer.

Special note about `Ort::Unowned`. The class suffers from the following
problems:

1. It is not able to hold const pointers to the underlying C objects.
This forces users to `const_cast` and circumvent constness of the
returned object. The user is now able to call mutating interfaces on the
object which violates invariants and may be a thread-safety issue. It
also enables to take ownership of the pointer and destroy it
unintentionally (see examples below).
2. The objects that are unowned cannot be copied and that makes coding
inconvenient and at times unsafe.
3. It directly inherits from the type it `unowns`.

All of the above creates great conditions for inadvertent unowned object
mutations and destructions. Consider the following examples of object
slicing, one of them is from a real customer issue and the other one I
accidentally coded myself (and I am supposed to know how this works).
None of the below can be solved by aftermarket patches and can be hard
to diagnose.

#### Example 1 slicing of argument
```cpp
void SlicingOnArgument(Ort::Value& value) {
  // This will take possession of the input and if the argument
  // is Ort::Unowned<Ort::Value> it would again double free the ptr
  // regardless if it was const or not since we cast it away.
  Ort::Value output_values[] = {std::move(value)};
}

void main() {
  const OrtValue* ptr = nullptr;  // some value does not matter
  Ort::Unowned<Ort::Value> unowned{const_cast<OrtValue*>(ptr)};
  // onowned is destroyed when the call returns.
  SlicingOnArgument(unowned);
}
```

#### Example 2 slicing of return value
```cpp
// The return will be sliced to Ort::Value that would own and relase (double free the ptr)
Ort::Value SlicingOnReturn() {
  const OrtValue* ptr = nullptr; // some value does not matter
  Ort::Unowned<Ort::Value> unowned{const_cast<OrtValue*>(ptr)};
  return unowned;
}
```
2022-10-06 14:57:37 -07:00
Scott McKay
60e4d012e0
Fix unused variable warning from reduced ops build (#12889) 2022-09-09 08:08:56 +10:00
RandySheriffH
d3b684cd9e
Drop nuphar (#11555)
* drop nuphar code and configs

* refactor test case

* format python

* remove nuphar from training test

* remove commented nuphar logics

* restore llvm setting

* drop nuphar ci

* fix compile err

* fix compile err

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2022-09-07 15:11:18 -07:00
Hariharan Seshadri
cde504ebbf
Fix/Suppress some VC static analyzer warnings (#12713) 2022-08-24 23:39:40 -07:00
Hariharan Seshadri
d5a1c01b38
Add C++ Session ctor taking model bytes and OrtPrepackedWeightsContainer (#12333) 2022-07-29 12:32:43 -07:00
Hariharan Seshadri
73310b2a0f
Fix Reduced Ops build pipeline (#12144)
Fix ReducedOps build pipeline
2022-07-11 19:02:38 -07:00
Hariharan Seshadri
2e27a7e330
Skip Constant Folding for ops producing an optional type output (#11839) 2022-06-30 13:38:35 -07:00
RandySheriffH
d5fcb432fa
Generalize native op creation (#11539)
* create op from ep

* read input count from context

* create holder to host nodes

* fix typo

* cast type before comparison

* throw error on API fail

* silence warning from minimal build

* switch to unique_ptr with deleter to host nodes

* fix typo

* fix build err for minimal

* fix build err for minimal

* add UT for conv

* enable test on CUDA

* add comment

* fix typo

* use gsl::span and string view for Node constructor

* Added two APIs - CopyKernelInfo and ReleaseKernelInfo

* pass gsl::span by value

* switch to span<NodeArg* const> to allow for reference to const containers

* fix typo

* fix reduced build err

* fix reduced build err

* refactoring node construction logic

* rename exceptions

* add input and output count as arguments for op creation

* refactor static member

* use ORT_CATCH instead of catch

* cancel try catch

* add static value name map

* format input definition and set err code

* fix comments

* fix typo
2022-06-27 21:12:15 -07:00
Dmitri Smirnov
088bc7494b
Deprecate APIs returning raw ptrs and provide replacements (#11922)
Provider better documentation
2022-06-24 09:50:04 -07:00
Scott McKay
927bac0f86
Rework allocator sharing to work for multiple devices. (#11700)
* Rework allocator sharing to work for multiple devices.
* Update SessionState to not use allocator name in matching for consistency with IExecutionProvider. The name doesn't have any clear meaning (e.g. we use the same name for the per-thread allocator in the CUDA EP as the shared allocate there and in the TRT EP).
  * NOTE: this means we will have one allocator per OrtMemType+OrtDevice. 
* Reverse order when doing allocator setup in SessionState. This will result in the CPU and CUDA EPs allocators being preferred (they are the most configurable), and also means the per-thread CUDA allocator for default GPU memory will be used even when TRT is enabled. 
  * NOTE: Combined with the change to remove the allocator name from the key this will mean that if CUDA and TRT or ROCM and MIGraphX are both enabled the CUDA/ROCM per-thread allocator will be used to allocate GPU memory.  
* Use InsertAllocator instead of TryInsertAllocator. Each EP should be registered once, and we should only enter RegisterAllocator once, so the 'try' should not be required and would indicate an unexpected setup was involved. i.e. better to fail and figure out if we need to support that setup.
* Add some clarifying comments around how replace allocator works.
* Add unit testing for setup where EP has local allocator that may get out of sync with values in the IExecutionProvider base class.
* Fix invalid check of whether data is on CPU to use device info instead of allocator name.
2022-06-09 17:38:38 +10:00
RandySheriffH
8467af832f
Fix reduced pipeline by excluding test case standalone op (#11458)
* exclude reduce build from standalone op test

* exclude test from reduced op build
2022-05-06 16:19:49 -07:00
RandySheriffH
8d69b9398b
APIs for custom op to invoke ort operator directly (#10713)
* draft kernel creation

* setup eager context

* call into kernel in eager mode

* redefine test case

* refact eager context

* add comment

* remove header

* rename argument

* redefine API definition with types

* list outputs as argument

* switch to int to represent length

* fix compile err

* create attribute API

* add test case for topk

* remove bool from c api

* add gru test case

* remove var

* fix compile warnings

* rename status

* fix compile err

* exclude sparse tensor

* fix comments

* fix comments

* fix build err

* rename file and move location

* format code

* move file to session folder

* fix comments

Co-authored-by: Randy <Randy@randysmac.attlocal.net>
2022-05-03 14:16:30 -07:00
Dmitri Smirnov
2700261f7c
Provide an API to supply external initializers data from user buffers (#11109)
Imlpement AddExternalInitializers
2022-04-07 12:21:53 -07:00