Commit graph

719 commits

Author SHA1 Message Date
Edward Chen
2ecd1d6622
Switch GSL to MS GSL 4.0.0 (#13416) 2022-10-29 04:15:20 -07:00
Fei Hu
943e156f4c
Allow custom ops to set input memory type (#10879) 2022-10-28 21:45:26 -07:00
cloudhan
fc12abf6b1
Enable/Disbale tunable GEMM by using tunable switch in provider options and env var (#13116)
Related PRs #12853

This allows the user enable/disbale tunable GEMM on demand.
2022-10-19 22:35:08 -07:00
Scott McKay
565da71275
Make 'env' argument to Session const (#13362)
### Description
<!-- Describe your changes. -->
The Env argument does not need to be mutable to call the underlying C
API. Update the Ort::Session ctor to have a const Env.

All other changes are from clang-format running. 

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Cleanup
2022-10-19 14:23:24 +10:00
Dmitri Smirnov
f5e3165cc3
Fix move Base::operator= (#13355)
### Description
Base::operator= move is broken, loses a valid ptr.

### Motivation and Context
Address
https://github.com/microsoft/onnxruntime/pull/13215#discussion_r997814275
2022-10-18 13:07:40 -07:00
Dmitri Smirnov
4a63cd0290
Improve thread pool creation failure handling. (#13313)
### Description
Detect and report thread creation failure on Windows.
Do not throw out of constructor after the thread is created,
the thread handle is lost and cannot be joined, resulting in a deadlock.

Make setting a thread priority on Linux consistent with windows.
Set thread priority in the thread itself. Log failure properly,
but do not exit the thread.

### Motivation and Context
Address issues https://github.com/microsoft/onnxruntime/issues/13291
And
https://github.com/microsoft/onnxruntime/issues/13285#issuecomment-1278063223
2022-10-15 17:57:19 -07:00
Dmitri Smirnov
f0fbff6dd4
Adjust docs to comply with Doxygen requirements (#13302)
### Description
Fix up param names in docs

### Motivation and Context
Make pipelines pass
2022-10-12 18:07:18 -07:00
cloudhan
1e55949a70
Fix unsound hipify in ROCm EP (#13269)
Some cuda related things is still left in the rocm ep statically
hipified code. Eliminate them to avoid confusion.
2022-10-12 08:32:42 +08:00
cloudhan
2cf5d04e3d
Fix clang-tidy(cppcoreguidelines-pro-bounds-array-to-pointer-decay) (#13241)
clang-tidy says "Do not implicitly decay an array into a pointer; consider using gsl::array_view or an explicit cast instead"

It is a false positive scattering around all our codebase when using
helper macros. It is becuase for function with 4 char name, say `main`,
the type of __FUNCTION__ and __PRETTY_FUNCTION__ is `char [5]`.
2022-10-11 13:16:48 +08:00
Dmitri Smirnov
5dae0c477d
Deprecate CustomApi and refactor public API for better safety and consistency (#13215)
### Description
Deprecate CustomOpApi and refactor dependencies for exception safety and
eliminate memory leaks.
Refactor API classes for clear ownership and semantics.
Introduce `InitProviderOrtApi()`

### Motivation and Context
Make public API better and safer.

Special note about `Ort::Unowned`. The class suffers from the following
problems:

1. It is not able to hold const pointers to the underlying C objects.
This forces users to `const_cast` and circumvent constness of the
returned object. The user is now able to call mutating interfaces on the
object which violates invariants and may be a thread-safety issue. It
also enables to take ownership of the pointer and destroy it
unintentionally (see examples below).
2. The objects that are unowned cannot be copied and that makes coding
inconvenient and at times unsafe.
3. It directly inherits from the type it `unowns`.

All of the above creates great conditions for inadvertent unowned object
mutations and destructions. Consider the following examples of object
slicing, one of them is from a real customer issue and the other one I
accidentally coded myself (and I am supposed to know how this works).
None of the below can be solved by aftermarket patches and can be hard
to diagnose.

#### Example 1 slicing of argument
```cpp
void SlicingOnArgument(Ort::Value& value) {
  // This will take possession of the input and if the argument
  // is Ort::Unowned<Ort::Value> it would again double free the ptr
  // regardless if it was const or not since we cast it away.
  Ort::Value output_values[] = {std::move(value)};
}

void main() {
  const OrtValue* ptr = nullptr;  // some value does not matter
  Ort::Unowned<Ort::Value> unowned{const_cast<OrtValue*>(ptr)};
  // onowned is destroyed when the call returns.
  SlicingOnArgument(unowned);
}
```

#### Example 2 slicing of return value
```cpp
// The return will be sliced to Ort::Value that would own and relase (double free the ptr)
Ort::Value SlicingOnReturn() {
  const OrtValue* ptr = nullptr; // some value does not matter
  Ort::Unowned<Ort::Value> unowned{const_cast<OrtValue*>(ptr)};
  return unowned;
}
```
2022-10-06 14:57:37 -07:00
Edward Chen
5c89c37f7f
Consolidate enabled/default kernel def type constraints (#13034)
Consolidate enabled/default kernel def type constraint types into enabled.
2022-09-27 14:04:15 -07:00
RandySheriffH
a83a9ed6b0
Remove miscellaneous nuphar configs (#13070)
Remove a handful of nuphar related configurations after deprecation.

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2022-09-26 13:41:28 -07:00
Edward Chen
5f611b63a1
Make classes IKernelTypeStrResolver and IKernelLookup have protected destructors. (#13059) 2022-09-23 09:16:45 -07:00
wangxiyuan
952c99304a
Add CANN EP (#12416)
**Description**: This PR adds Ascend CANN execution provider support.

**Motivation and Context**
- Why is this change required? What problem does it solve?
As the info shown in the issue. CANN is the API layer for Ascend
processor. Add CANN EP can allow user run onnx model on Ascend hardware
via onnxruntime
  The detail change:
  1. Added CANN EP framework.
  2. Added the basic operators to support ResNet and VGG model.
  3. Added C/C++、Python API support
- If it fixes an open issue, please link to the issue here.
   https://github.com/microsoft/onnxruntime/issues/11477

Author: 
lijiawei <lijiawei19@huawei.com>
wangxiyuan <wangxiyuan1007@gmail.com>

Co-authored-by: FFrog <ljw1101.vip@gmail.com>
2022-09-22 14:53:40 -07:00
sfatimar
cccbe90764
Openvino ep 2022.2 v4.2 (#13023)
This changes are to align OV 2022.2 Release with ORT . Changes
CPU FP16 Support, dGPU Support, RHEL Dockerfile, Ubuntu 20 Dockerfile 

**Motivation and Context**
- This change is required to ensure ORT-OpenVINO Execution Provider is
aligned with latest changes.
- If it fixes an open issue, please link to the issue here.

Co-authored-by: mayavijx <mayax.vijayan@intel.com>
Co-authored-by: shamaksx <shamax.kshirsagar@intel.com>
Co-authored-by: pratiksha <pratikshax.bapusaheb.vanse@intel.com>
Co-authored-by: pratiksha <mohsinx.mohammad@intel.com>
Co-authored-by: Sahar Fatima <sfatima.3001@gmail.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
Co-authored-by: nmaajidk <n.maajid.khan@intel.com>
Co-authored-by: Mateusz Tabaka <mateusz.tabaka@intel.com>
Co-authored-by: intel <intel@iotgecsp-nuc04.iind.intel.com>
2022-09-22 12:31:40 -07:00
Edward Chen
454f77cd94
Update kernel matching logic: decouple from op schemas and remove kernel def hashes (#12791)
# Motivation
Currently, ORT minimal builds use kernel def hashes to map from nodes to
kernels to execute when loading the model. As the kernel def hashes must
be known ahead of time, this works for statically registered kernels.
This works well for the CPU EP.
For this approach to work, the kernel def hashes must also be known at
ORT format model conversion time, which means the EP with statically
registered kernels must also be enabled then. This is not an issue for
the always-available CPU EP. However, we do not want to require that any
EP which statically registers kernels is always available too.
Consequently, we explore another approach to match nodes to kernels that
does not rely on kernel def hashes. An added benefit of this is the
possibility of moving away from kernel def hashes completely, which
would eliminate the maintenance burden of keeping the hashes stable.

# Approach
In a full build, ORT uses some information from the ONNX op schema to
match a node to a kernel. We want to avoid including the ONNX op schema
in a minimal build to reduce binary size. Essentially, we take the
necessary information from the ONNX op schema and make it available in a
minimal build.
We decouple the ONNX op schema from the kernel matching logic. The
kernel matching logic instead relies on per-op information which can
either be obtained from the ONNX op schema or another source.
This per-op information must be available in a minimal build when there
are no ONNX op schemas. We put it in the ORT format model.
Existing uses of kernel def hashes to look up kernels are replaced
with the updated kernel matching logic. We no longer store
kernel def hashes in the ORT format model’s session state and runtime
optimization representations. We no longer keep the logic to
generate and ensure stability of kernel def hashes.
2022-09-20 14:24:59 -07:00
Cheng
f26054deca
[XNNPACK] Support running in multi-thread with seperate pthreadpool (#11762)
**Description**: Describe your changes.
XNNPACK takes pthreadpool as its internal threadpool implemtation, it
couples calculation and parallelization. Thus it's impossible to
leverage ORT's threadpool (EIGEN/OPENMP based). So we enabled
pthreadpool in XNNPACK EP in this PR.

Case 1:  Pthreadpool coexist with ORT-threadpool simply
Expriments setup
hardware:RedMi8A with 8 cores, ARMv7
The two threadpool has the same pool size form 1 to 8.
Two models: mobilenet_v2 and mobilenet_egetppu.
we can see the picture below and draw a conclusion, latency are even
higher from 5 threads or more.


![image](https://user-images.githubusercontent.com/9417365/190550127-2304adfe-97ac-4aeb-91a0-4606b5305a82.png)

Case 2:
For the reason of performance regression with 5 more threads,
ORT-threads are spinning on CPU and diddn't realease it after
computation finished. It's equivalent of creating 5x2 threads for
parallelization while we have only 8 cpu cores.
So I mannuly disabled spinning after ort-threadpool finished and enabled
it when enter ort-threadpool.
The result is quite normal now.

![image](https://user-images.githubusercontent.com/9417365/190675230-0d85dd02-01f0-4255-967d-e3dbb2a1fe52.png)


Case 3:
Even we achieved a reasonable results with disabling spinning, Will
ORT-threadpool still impact performance of pthreadpool?
we have expriment setting up as: Setting ORT-threadpool size
(intra_thread_num) as 1, and only pthreadpool created.
Attention that, almost a third of ops are running by CPU EP. we are
surprisingly find that disabling ort-threadpool is even better in
performance than creating two threadpool.


![image](https://user-images.githubusercontent.com/9417365/190556480-d6507396-d777-44fc-94e1-938d2b9bb7d7.png)


Case 4:
Use a unified threadpool between CPU ep and XNNPACK ep.
It's the fastest among all. But if we take the similar workload
partition strategy as ORT-threadpool, it could be faster.


![image](https://user-images.githubusercontent.com/9417365/190674908-a68fd20f-bdf4-41f9-bf0a-76b304cda490.png)

**Motivation and Context**
- Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here.

Co-authored-by: Jicheng Wen <jicwen@microsoft.com>
2022-09-20 16:02:15 +08:00
Tang, Cheng
739b5675c8
remove legacy compile api (#12932)
Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2022-09-15 13:18:40 -07:00
Dmitri Smirnov
bc2df1bf95
Remove previously deprecated API (#12935)
Remove previously deprecated API
Format JS code, address review comments
NPM Formatting
2022-09-14 10:58:03 -07:00
Scott McKay
1016c33519
Fix prefast warning in upsample.cc. (#12938)
* Fix prefast warning.
* Fix some other static analysis warnings.
2022-09-14 08:14:33 +10:00
Cheng
8cedafe250
[xnnpack] Have Initializer in Mobile related EPs in Minimal_build and creating EP specific dynamic-schema (#12555)
* Remove the dependence of Qlinearsoftmax schema

* refactor initializerview &&  create shared schema

* Dynamic Create EP specific schema

* Have Initializer in minimal_build

* address comments

* remove CancelFuseSubGraph
2022-09-06 14:32:15 +08:00
ashbhandare
27dde0b51f
Csharp bindings for on-device training APIs (#12404) 2022-09-02 13:13:48 -07:00
Yulong Wang
82a28cc2c3
upgrade emsdk to 3.1.19 (#12690)
* upgrade emsdk to 3.1.19

* fix build break

* ignore '-Wunused-but-set-variable' in eigen

* add malloc and free in exported functions

* EXPORTED_FUNCTIONS
2022-08-30 13:42:45 -07:00
Baiju Meswani
b83ea3c2ff
Address prefast static analysis warnings (#12756) 2022-08-29 10:09:32 -07:00
mwootton
817dc94345
Add first pass of rocm kernel profiler (#10911)
* Add first pass of rocm kernel profiler

* Clean up rocm_profiler. Format args. Demangle kernel names.
Add Api EventRecords

* Remove debug output

* Temporarily disable profiling unit test 'api record check' for cupti

* Fix compile error for non-gpu builds

* Use common file for demangle and pid/tid.  Namespace ThreadUtil.  Fix gpu buffer clearing.

* Merge demangle into profiler_common

* Merge demangle into profiler_common part 2

* Style cleanup

* Resolve linking issues via ProviderHost interface

* Demangle cuda kernel names

* Clean up comments

* Fix formatting

* Fix anal retentive formatting
2022-08-26 19:38:03 -07:00
edgchen1
c270ea148a Move 'using common::Status;' from common.h to status.h. 2022-08-26 15:05:53 -07:00
Yulong Wang
c144acc534
Replace 'master' branch ref to 'main' in the code (#12547) 2022-08-22 10:48:12 -07:00
Dmitri Smirnov
9481893b58
Replace to lock_guard as lighter class for locking (#12616)
Replace to lock_guard as lighter class
2022-08-17 11:08:31 -07:00
Haoming Chen
8a038b9b0c
Fix a build error (#12600)
LLVM compiler complains the std::hash<const char*> and suggests std::hash<const void*>. But the intention is to hash the name string instead of the pointer. So use std::hash<std::string> to be explicit.
2022-08-17 10:49:54 -07:00
Scott McKay
0b0c51e028
Support direct usage of ORT format model flatbuffer for initializers (#12465)
* Add ability to use ORT format model flatbuffer directly for intiializers by leveraging the TensorProto external data infrastructure.

Requires user to provide ORT format model bytes when creating the session, and set both `session.use_ort_model_bytes_directly` and `session.use_ort_model_bytes_for_initializers` to 1 in SessionOptions config entries (AddSessionConfigEntry in C API).
2022-08-12 18:31:43 +10:00
Changming Sun
ac7538b909
Remove CUDA 10.2 support (#12541) 2022-08-10 22:46:41 -07:00
Dmitri Smirnov
c10704a501
Use alignas instead of naive padding to avoid false cache sharing (#12514)
PerThread and ChildThreadStat alignas
2022-08-10 11:23:20 -07:00
Cheng
64e991a9fc
[Qlinearsoftmax] contrib cpu (#12177)
* [Qlinearsoftmax] contrib cpu

* int8 implementation

* contrib operator md

* qdq transformer test

* new attribute: opset

* doc

* quantized tool

* remove template to reduce Binary size

* doc of contribe operators

* enforce x_shape is valid

* fix reduce_size if input-shape is dynamic

* add UT

* register one op for reducing binarysize

* kernel hash update

* docs/ContribOperators.md
2022-08-10 10:52:02 +08:00
Hector Li
730240d2a5
remove the link the comments (#12510) 2022-08-08 15:20:40 -07:00
Scott McKay
8d830adf24
Rework parts of Graph::Resolve to reduce memory usage (#12176)
* Rework some aspects of Graph::Resolve to reduce memory usage.
2022-08-05 13:20:25 +10:00
Dmitri Smirnov
a4ef0e7f7b
Remove dynamic allocation for ThreadPool ParallelSection (#12429)
Use InlinedVector in a TP
Store per thread parallel section in std::optional and avoid memory allocation
2022-08-04 09:46:16 -07:00
Ryan Hill
52d4699788
Minor doc fixes (#12388) 2022-08-03 19:47:36 -07:00
Hariharan Seshadri
d5a1c01b38
Add C++ Session ctor taking model bytes and OrtPrepackedWeightsContainer (#12333) 2022-07-29 12:32:43 -07:00
Yateng Hong
c579497134
Fix TRT custom op issue (#12283)
* Pass schema registry on CreateModel.

* Fix ORT_MINIMAL_BUILD.

* Fix build issue.
2022-07-29 03:39:56 -07:00
Ryan Hill
3e014a5e5d
Fix C header to stop people accidentally copying the OrtApi by value (#12297)
* Fix C header to stop people accidentally copying the OrtApi by value
* Remove api_ from KernelTwo
2022-07-25 19:19:40 -07:00
Dmitri Smirnov
3bf614fd47
Eliminate memory allocations per recent profiling (#12225)
* Alloc begin

FeedsFetches refactoring
Refactor Tensor class
Fix buffer deletor
Remove new/delete deleted
Adjust alloc move
Fix up xnnpack provider
Clarifying the comment on Create()
2022-07-25 14:14:38 -07:00
Ashwini Khade
ceb76429db
Merge pull request #12056 from microsoft/bmeswani/merge-training_dev/on_device_poc
Merge On-Device-Training Offline Tooling and C/C++ APIs
2022-07-21 15:09:48 -07:00
Baiju Meswani
cbf08c7a7b Make GetTrainingApi as a part of the OrtApis, add Training API documentation and address other pull request review comments 2022-07-21 18:11:48 +00:00
Dmitri Smirnov
4f106d2b3b
Eliminate unnecessary status lock acquisition in TP (#12196)
Eliminate unnecessary status lock acquisition in the Thread Pool
2022-07-19 14:16:12 -07:00
Chen Fu
040c2f4517
x86/64 U8S8 Gemm Precision Fix (#12088)
Add a graph optimization that convert u8s8 matrix multiplication to u8u8 if needed

In x86/64 platforms, specifically SSE4.1, AVX2 and AVX512 CPUs provide better performance computing u8s8 matrix multiplications. Unfortunately, the higher performance comes with value overflow problems, as described in:
https://www.intel.com/content/www/us/en/develop/documentation/onednn-developer-guide-and-reference/top/advanced-topics/nuances-of-int8-computations.html

In this change we added a session option "session.x64quantprecision" (default off). For operators that calls u8s8 matrix multiplications, e.g. QAttention, we convert them to u8u8 when the following conditions are all satisfied:

1. Current CPU is SSE4.1, AVX2 or AVX512 with no VNNI support
2. Session option "session.x64quantprecision" is on.
3. Constant weight tensor contains values outside of [-64, 63] range

Note that when weight tensor is not constant, QDQS8ToU8Transformer should already convert it to u8.
2022-07-13 10:12:25 -07:00
Baiju Meswani
a457ddc41d Merge branch 'master' of https://github.com/microsoft/onnxruntime into bmeswani/merge_pr 2022-06-30 21:53:07 +00:00
Baiju Meswani
6e8edfff0c
Separate training apis from shared core apis (#12027) 2022-06-29 14:12:29 -07:00
RandySheriffH
d5fcb432fa
Generalize native op creation (#11539)
* create op from ep

* read input count from context

* create holder to host nodes

* fix typo

* cast type before comparison

* throw error on API fail

* silence warning from minimal build

* switch to unique_ptr with deleter to host nodes

* fix typo

* fix build err for minimal

* fix build err for minimal

* add UT for conv

* enable test on CUDA

* add comment

* fix typo

* use gsl::span and string view for Node constructor

* Added two APIs - CopyKernelInfo and ReleaseKernelInfo

* pass gsl::span by value

* switch to span<NodeArg* const> to allow for reference to const containers

* fix typo

* fix reduced build err

* fix reduced build err

* refactoring node construction logic

* rename exceptions

* add input and output count as arguments for op creation

* refactor static member

* use ORT_CATCH instead of catch

* cancel try catch

* add static value name map

* format input definition and set err code

* fix comments

* fix typo
2022-06-27 21:12:15 -07:00
Baiju Meswani
d25cf4df26 Merge branch 'master' into training_dev/on_device_poc 2022-06-24 20:18:19 +00:00
Dmitri Smirnov
088bc7494b
Deprecate APIs returning raw ptrs and provide replacements (#11922)
Provider better documentation
2022-06-24 09:50:04 -07:00