Commit graph

737 commits

Author SHA1 Message Date
Changming Sun
fc2a6db573
Update absl to the latest release (#13990)
### Description
Update absl to a new version

### Motivation and Context
The new version contains fixes that are needed for Nvidia GPU build.
Once we update it to that version, we don't need to maintain our private
patches for Nvidia GPU build.
2022-12-19 14:25:13 -08:00
FFFrog
6705915af8
[CANN] Add the ability to run graph (#13728)
### Description
Add the ability to run graph

### Motivation and Context
A brief description is as follows:
1) If the whole graph is supported, then will be processed by the graph
engine, directly.
2) If the whole graph is not supported, the whole graph will be divided
into subgraphs and single operators; The sub-graphs will be run on graph
engine, and the single operators will fallback to the traditional mode.
2022-12-16 06:57:40 -08:00
Abhishek Udupa
c882601425
Add noexcept annotation to address prefast warnings (#13965)
### Description
Add noexcept annotations to move constructors and assignment ops to
address prefast warnings.
(see
https://dev.azure.com/aiinfra/ONNX%20Runtime/_workitems/edit/11012/)

Co-authored-by: Abhishek Udupa <abhishek.udupa@microsoft.com>
2022-12-15 09:44:22 -08:00
Tang, Cheng
a81faee41e
Multi-stream execution support (#13495)
**Description**: This PR including following works:
1. provide stream and related synchronization abstractions in
onnxruntime.
2. enhance onnxruntime's execution planner / executor / memory arena to
support execute multiple streams in parallel.
3. deprecate the parallel executor for cpu.
4. deprecate the Fence mechanism. 
5. update the cuda / tensorrt EP to support the stream mechanism,
support running different request in different cuda stream.

**Motivation and Context**
- Why is this change required? 
currently, the execution plan is just a linear list of those primitives,
ort will execute them step by step. For any given graph, ORT will
serialize it to a fixed execution order. This sequential execution
design simplifies most scenarios, but it has the following limitations:
1. it is difficult to enable inter-node parallelization, we have a
half-baked parallel executor but it is very difficult to make it work
with GPU.
2. The fence mechanism can work with single gpu stream + cpu thread
case, but when extend to multiple stream, it is difficult to manage the
cross GPU stream synchronizations.
3. our cuda EP rely on the BFCArena to make the memory management work
with the GPU async kernels, but current BFCArena is not aware of the
streams, so it doesn't behavior correctly when run with multiple
streams.

This PR enhance our existing execution plan and executor to support
multiple stream execution. we use an unified algorithm to mange both
single stream and multiple stream scenarios.
This PR mainly focus on the infrastructure support for multiple stream
execution, that is said, given a valid stream assignment, onnxruntime
can execute it correctly. How to generate a good stream assignment for a
given model will be in the future PR.

Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Cheng Tang <chenta@microsoft.com>
Co-authored-by: RandySheriffH <48490400+RandySheriffH@users.noreply.github.com>
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: cao lei <jslhcl@gmail.com>
Co-authored-by: Lei Cao <leca@microsoft.com>
2022-12-15 07:39:29 -08:00
Jeff Daily
c9edc01c0b
[ROCm] float16.h should use __HIP__ not USE_ROCM (#13684)
The float16.h header is shared between the CPU and ROCm EPs. The
USE_ROCM macro is defined universally, but for the float16.h header we
only wish to detect the hip-clang compiler. Otherwise, the CPU EP fails
to build because of -Werror -Wuninitialized caused by the USE_ROCM code
additions, and the CPU EP should be using a different code path.
2022-12-13 15:34:42 -08:00
RandySheriffH
75584c5fa8
Enabling thread pool to be numa-aware (#13778)
The PR enables ort thread pool to be numa-aware, so that threads could
be evenly created and distributed among numa nodes.
In addition, to facilitate performance tuning, the PR opens a new API
allowing customers to attach threads to certain logical processors.
Please check the API
[definition](https://github.com/microsoft/onnxruntime/pull/13778/files#diff-5845a5c76fb64abdc8f0cffe21b37f8da1712674eb3abc4cd87190891be1bd48)
for details.

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2022-12-12 10:33:55 -08:00
JiCheng
22fa62152a
Pass SessionOptions to XnnpackProviderFactoryCreator. (#13318)
### Description
To pass session_options to Xnnpack EP via
`XnnpackProviderFactoryCreator` for Initializing xnnpack's threadpool.

If you want to use different threadpool size or even disable xnnpack's
threadpool, just setting intra_threadpool to 1 by xnnpack EP's
provider_options.


### Motivation and Context

Co-authored-by: Guangyun Han <guangyunhan@microsoft.com>
Co-authored-by: Jicheng Wen <jicwen@microsoft.com>
2022-12-10 14:23:46 +08:00
Abhishek Udupa
83c59d2594
Session-aware and thread-safe CUDA profiler (#13706)
### Description
The existing CUDA profiler is neither session-aware, nor thread-safe.
This PR ensures both.

### Motivation and Context
[PR 13549](https://github.com/microsoft/onnxruntime/pull/13549) brought
thread-safety and session-awareness to the ROCm profiler. This PR brings
the same goodness to the CUDA profiler as well.

Sample outputs of a profiling run from the StableDiffusion model (this
model was chosen because it requires orchestration of multiple sessions,
and verifies that the profilers are now indeed session-aware) on both
CUDA and ROCm EPs are attached, along with a script that checks that the
trace files generated by the profile are well-formed.

Update 11/29: Updated the profile outputs. The older profile outputs
exhibited an issue where some timestamps were wildly out of range,
leading to problems visualizing the traces. The bug has been fixed and
the profile outputs have been updated, along with an update to the check
script to ensure that timestamps are monotonically increasing.


[sd_profile_outputs_cuda.tar.gz](https://github.com/microsoft/onnxruntime/files/10118088/sd_profile_outputs_cuda.tar.gz)

[sd_profile_outputs_rocm.tar.gz](https://github.com/microsoft/onnxruntime/files/10118089/sd_profile_outputs_rocm.tar.gz)

[check_profile_output_well_formedness.zip](https://github.com/microsoft/onnxruntime/files/10118090/check_profile_output_well_formedness.zip)

Co-authored-by: Abhishek Udupa <abhishek.udupa@microsoft.com>
2022-12-09 13:22:12 -08:00
Ashwini Khade
983877c712
Decouple strided tensor support from ENABLE_TRAINING (#13829)
### Description
Decouple strided tensor support from ENABLE_TRAINING

### Motivation and Context
This is step 1 for creating a dedicated build for on device training.
Intention is

1. We can set ENABLE_STRIDED_TENSORS in cmake when either
ENABLE_TRAINING or ENABLE_TRAINING_ON_DEVICE is selected, this way we
dont have to use if defined(ENABLE_TRAINING) ||
defined(ENABLE_TRAINING_ON_DEVICE ) everywhere in the code.

2. This also paves the way to easily enable strided tensor support for
inference in future (if required).
2022-12-07 09:22:21 -08:00
stevenlix
ce0025d3f2
Fallback Pow op in layer norm to FP32 in TRT to avoid overflow (#13639)
Accuracy loss is observed when transformer models such as BERT, DeBERTa,
ViT are running in TRT FP16 mode. The cause is that overflow happens at
Pow op in layer norm.
This PR provides the option to force Pow to run in TRT FP32 precision if
overflow occurs.

Co-authored-by: Ubuntu <azureuser@orteplinuxdev.bxgbzpva45kedp3rhbsbit4phb.jx.internal.cloudapp.net>
2022-11-29 13:37:31 -08:00
Changming Sun
87e6a26c5d
Enforce Prefast check in Windows CPU CI pipeline (#13735)
Right now we fix the warnings in an ad-hoc way. We run static analysis
in nightly builds, then create work items for the finding it found. Our
CI build pipelines run the same scan but do not break the build. So,
this PR will fix the remaining findings in the CPU EP(including the
training part) and enforce the check. Later on we can continue to expand
the scope.

We still have some warnings left in the JNI part. I will try to address
them later in the next month.
2022-11-23 09:25:02 -08:00
cloudhan
9e649d1ac4
Allow CUDA EP enable or disable TunableOp via session options and environment variable (#13601)
This ports #13116 from ROCm EP to CUDA EP
2022-11-15 14:43:54 +08:00
Abhishek Udupa
9954454c65
Make the ROCM profiler thread-safe, session-aware and preserve logical ordering between CPU and GPU events (#13549)
### Description
The existing ROCM profiler has a few shortcomings, which this PR fixes.

### Motivation and Context
The existing ROCM profiler:
1. Is not thread-safe
2. Is not session-aware: i.e., if multiple inference sessions enable
profiling, then events (esp GPU events) get mixed up between the
sessions
3. Has some issues with respect to coding standards.

This PR addresses all of the above by cleanly re-implementing parts of
the ROCM profiler as required.

Attached are 4 profile outputs from a multi-session run of the
StableDiffusion model, as well as a quick-and-dirty script that checks
the profile outputs for the invariants claimed.


[sd_profile_outputs.tar.gz](https://github.com/microsoft/onnxruntime/files/9924608/sd_profile_outputs.tar.gz)


[check_profile_output_wellformedness.zip](https://github.com/microsoft/onnxruntime/files/9924614/check_profile_output_wellformedness.zip)

Co-authored-by: Abhishek Udupa <abhishek.udupa@microsoft.com>
2022-11-10 10:25:41 -08:00
Edward Chen
215732f74b
Ignore saved runtime optimizations when updating ORT format model <v5. (#13393)
The old runtime optimization format is not readily convertible to the new one without extra information for translating kernel def hashes.
Ignore such saved runtime optimizations and output a warning for now.
2022-11-08 13:36:46 -08:00
yf711
8b9065a396
Add getter/setter of C# OrtEnv log level (#13402)
### Description
* Add getter/setter to access and update C# OrtEnv log level
* Add C API about updating ort env with custom log level to support the
setter above (Following [pybind
implementation](952c99304a/onnxruntime/python/onnxruntime_pybind_state.cc (L923-L924)))
* Add test case to verify getter & setter


### Motivation and Context
* For C++/Python, the log level can be adjusted via OrtEnv, and this
feature is missing in C# binding
2022-11-04 21:46:00 -07:00
pengwa
a3e7da60e7
Trade subgraph recompute for memory (#12852)
**Description**: Subgraph-level recompute

This PR adds an optional capability trading additional re-computation
for better memory efficiency. Specifically, a pre-defined operator list
used to iterate the Graph to find some subgraphs for recompute, to
reduce some stashed activations whose lifetime across forward and
backward pass.

When training with ORTModule, by default, the graph transformer will
scan the execution graph to find all eligible subgraph to recompute,
along with sizes that can save. An example looks like below.
If we want to enable some of them to recompute, we can define env
variable this way:
`export
ORTMODULE_ENABLE_MEMORY_ALLEVIATION="Mul+FusedMatMul+Cast+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Add+BiasSoftmaxDropout+Cast+:1:-1,BiasGelu+:1:-1,BitmaskDropout+Cast+:1:-1,FusedMatMul+:1:-1,Cast+:1:-1,Mul+Add+:1:-1,Mul+Sub+:1:-1"`
```

[1,0]<stderr>:2,022-10-12 14:47:39.302,954,530 [W:onnxruntime:, memory_alleviation.cc:595 PrintSummary]
[1,0]<stderr>:MemoryAlleviation Summary:
[1,0]<stderr>:  User config:
[1,0]<stderr>:  Mul+FusedMatMul+Cast+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Add+BiasSoftmaxDropout+Cast+:1,BiasGelu+:1,BitmaskDropout+Cast+:1,FusedMatMul+:1,Cast+:1,Mul+Add+:1,Mul+Sub+:1
[1,0]<stderr>:  =================================
[1,0]<stderr>:  Subgraph: BitmaskDropout+
[1,0]<stderr>:          AlleviationType: Disabled
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x 1,024 x   Frequency:1
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: BiasGelu+
[1,0]<stderr>:          AlleviationType: Recompute
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x input_ids_dim1 x 4,096 x  Frequency:24
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: Reshape[1,0]<stderr>:+
[1,0]<stderr>:          AlleviationType: Disabled
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:labels_dim0 x      Frequency:1
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: Unsqueeze+Unsqueeze+Cast+Sub+Mul+Mul+FusedMatMul+Cast+Add+BiasSoftmaxDropout+Cast+
[1,0]<stderr>:          AlleviationType: Disabled
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x input_ids_dim1 x    Frequency:23
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: Mul+FusedMatMul+Cast+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Add+BiasSoftmaxDropout+Cast+
[1,0]<stderr>:          AlleviationType: Recompute
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x input_ids_dim1 x    Frequency:1
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: Mul+Add+
[1,0]<stderr>:          AlleviationType: Recompute
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x 1 x         Frequency:24
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: FusedMatMul+Cast+Add+Reshape+Cast+
[1,0]<stderr>:          AlleviationType: Disabled
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x 2 x 4 x     Frequency:24
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: Mul+Sub+
[1,0]<stderr>:          AlleviationType: Recompute
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x 1 x         Frequency:24
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: Cast+
[1,0]<stderr>:          AlleviationType: Recompute
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:1,024 x 1,024 x    Frequency:97
[1,0]<stderr>:                  PatternShape:3 x 1,024 x        Frequency:1
[1,0]<stderr>:                  PatternShape:8 x 64 x   Frequency:24
[1,0]<stderr>:                  PatternShape:1,024 x 4,096 x    Frequency:24
[1,0]<stderr>:                  PatternShape:4,096 x    Frequency:24
[1,0]<stderr>:                  PatternShape:4,096 x 1,024 x    Frequency:24
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: FusedMatMul+
[1,0]<stderr>:          AlleviationType: Recompute
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x input_ids_dim1 x 4,096 x  Frequency:24
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  =================================
```


"Type config:" whether recompute is enabled by users. 0 - disable, 1-
enable.
"Subgraph" means what kind of subgraph will be recomputed, in this case,
it is a single node "Gelu", and it will be "Recompute".
"Shape && Frequency" means, for this recompute, one tensor of size
(batch size, 500) will be saved because it will be recomputed.

**Baseline**

On a 1P model (DEBERTA V2), sequence length 256, training with 16 A100
GPUs. With latest main branch, we can run batch size 16, and the maximum
batch size < 32. So 16 is usually chosen by data scientists. 65% of 40GB
memory is used during training. The SamplesPerSec=479.2543353561354.


![image](https://user-images.githubusercontent.com/10530022/188320941-13dde5e7-c32b-4399-a64b-6803fbb9dcda.png)

**With this PR**

Gelu is recomputed for saving memory peak, batch size 32 can be run. The
97% of 40GB A100 is used, the SamplesPerSec=562.041593991271 (**1.17X**
of baseline).


![image](https://user-images.githubusercontent.com/10530022/188321081-f64811bf-9637-4873-8095-349de8d498cc.png)


**Motivation and Context**
- Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here.
2022-11-03 13:49:41 +08:00
Wei-Sheng Chin
b5904c40dd
Enable ORT in TorchDynamo (#13259)
This PR enables ORT to execute graphs captured by TorchDynamo. Major compilation code is in `OrtBackend.compile` in ort_backend.py. `register_backend.py` is for plugging `OrtBackend` into TorchDynamo as a compiler.
2022-11-01 11:19:29 -07:00
Adrian Lizarraga
9d867a07c0
Fix regression in CustomOpApi::GetTensorData (#13450)
- Reverts change to CustomOpApi::GetTensorData introduced by commit 5dae0c477d,
which causes infinite recursion.
- Moves EndsProfilingAllocated to non-const session implementation
(C++ API header).
2022-10-31 12:20:49 -07:00
Edward Chen
2ecd1d6622
Switch GSL to MS GSL 4.0.0 (#13416) 2022-10-29 04:15:20 -07:00
Fei Hu
943e156f4c
Allow custom ops to set input memory type (#10879) 2022-10-28 21:45:26 -07:00
cloudhan
fc12abf6b1
Enable/Disbale tunable GEMM by using tunable switch in provider options and env var (#13116)
Related PRs #12853

This allows the user enable/disbale tunable GEMM on demand.
2022-10-19 22:35:08 -07:00
Scott McKay
565da71275
Make 'env' argument to Session const (#13362)
### Description
<!-- Describe your changes. -->
The Env argument does not need to be mutable to call the underlying C
API. Update the Ort::Session ctor to have a const Env.

All other changes are from clang-format running. 

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Cleanup
2022-10-19 14:23:24 +10:00
Dmitri Smirnov
f5e3165cc3
Fix move Base::operator= (#13355)
### Description
Base::operator= move is broken, loses a valid ptr.

### Motivation and Context
Address
https://github.com/microsoft/onnxruntime/pull/13215#discussion_r997814275
2022-10-18 13:07:40 -07:00
Dmitri Smirnov
4a63cd0290
Improve thread pool creation failure handling. (#13313)
### Description
Detect and report thread creation failure on Windows.
Do not throw out of constructor after the thread is created,
the thread handle is lost and cannot be joined, resulting in a deadlock.

Make setting a thread priority on Linux consistent with windows.
Set thread priority in the thread itself. Log failure properly,
but do not exit the thread.

### Motivation and Context
Address issues https://github.com/microsoft/onnxruntime/issues/13291
And
https://github.com/microsoft/onnxruntime/issues/13285#issuecomment-1278063223
2022-10-15 17:57:19 -07:00
Dmitri Smirnov
f0fbff6dd4
Adjust docs to comply with Doxygen requirements (#13302)
### Description
Fix up param names in docs

### Motivation and Context
Make pipelines pass
2022-10-12 18:07:18 -07:00
cloudhan
1e55949a70
Fix unsound hipify in ROCm EP (#13269)
Some cuda related things is still left in the rocm ep statically
hipified code. Eliminate them to avoid confusion.
2022-10-12 08:32:42 +08:00
cloudhan
2cf5d04e3d
Fix clang-tidy(cppcoreguidelines-pro-bounds-array-to-pointer-decay) (#13241)
clang-tidy says "Do not implicitly decay an array into a pointer; consider using gsl::array_view or an explicit cast instead"

It is a false positive scattering around all our codebase when using
helper macros. It is becuase for function with 4 char name, say `main`,
the type of __FUNCTION__ and __PRETTY_FUNCTION__ is `char [5]`.
2022-10-11 13:16:48 +08:00
Dmitri Smirnov
5dae0c477d
Deprecate CustomApi and refactor public API for better safety and consistency (#13215)
### Description
Deprecate CustomOpApi and refactor dependencies for exception safety and
eliminate memory leaks.
Refactor API classes for clear ownership and semantics.
Introduce `InitProviderOrtApi()`

### Motivation and Context
Make public API better and safer.

Special note about `Ort::Unowned`. The class suffers from the following
problems:

1. It is not able to hold const pointers to the underlying C objects.
This forces users to `const_cast` and circumvent constness of the
returned object. The user is now able to call mutating interfaces on the
object which violates invariants and may be a thread-safety issue. It
also enables to take ownership of the pointer and destroy it
unintentionally (see examples below).
2. The objects that are unowned cannot be copied and that makes coding
inconvenient and at times unsafe.
3. It directly inherits from the type it `unowns`.

All of the above creates great conditions for inadvertent unowned object
mutations and destructions. Consider the following examples of object
slicing, one of them is from a real customer issue and the other one I
accidentally coded myself (and I am supposed to know how this works).
None of the below can be solved by aftermarket patches and can be hard
to diagnose.

#### Example 1 slicing of argument
```cpp
void SlicingOnArgument(Ort::Value& value) {
  // This will take possession of the input and if the argument
  // is Ort::Unowned<Ort::Value> it would again double free the ptr
  // regardless if it was const or not since we cast it away.
  Ort::Value output_values[] = {std::move(value)};
}

void main() {
  const OrtValue* ptr = nullptr;  // some value does not matter
  Ort::Unowned<Ort::Value> unowned{const_cast<OrtValue*>(ptr)};
  // onowned is destroyed when the call returns.
  SlicingOnArgument(unowned);
}
```

#### Example 2 slicing of return value
```cpp
// The return will be sliced to Ort::Value that would own and relase (double free the ptr)
Ort::Value SlicingOnReturn() {
  const OrtValue* ptr = nullptr; // some value does not matter
  Ort::Unowned<Ort::Value> unowned{const_cast<OrtValue*>(ptr)};
  return unowned;
}
```
2022-10-06 14:57:37 -07:00
Edward Chen
5c89c37f7f
Consolidate enabled/default kernel def type constraints (#13034)
Consolidate enabled/default kernel def type constraint types into enabled.
2022-09-27 14:04:15 -07:00
RandySheriffH
a83a9ed6b0
Remove miscellaneous nuphar configs (#13070)
Remove a handful of nuphar related configurations after deprecation.

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2022-09-26 13:41:28 -07:00
Edward Chen
5f611b63a1
Make classes IKernelTypeStrResolver and IKernelLookup have protected destructors. (#13059) 2022-09-23 09:16:45 -07:00
wangxiyuan
952c99304a
Add CANN EP (#12416)
**Description**: This PR adds Ascend CANN execution provider support.

**Motivation and Context**
- Why is this change required? What problem does it solve?
As the info shown in the issue. CANN is the API layer for Ascend
processor. Add CANN EP can allow user run onnx model on Ascend hardware
via onnxruntime
  The detail change:
  1. Added CANN EP framework.
  2. Added the basic operators to support ResNet and VGG model.
  3. Added C/C++、Python API support
- If it fixes an open issue, please link to the issue here.
   https://github.com/microsoft/onnxruntime/issues/11477

Author: 
lijiawei <lijiawei19@huawei.com>
wangxiyuan <wangxiyuan1007@gmail.com>

Co-authored-by: FFrog <ljw1101.vip@gmail.com>
2022-09-22 14:53:40 -07:00
sfatimar
cccbe90764
Openvino ep 2022.2 v4.2 (#13023)
This changes are to align OV 2022.2 Release with ORT . Changes
CPU FP16 Support, dGPU Support, RHEL Dockerfile, Ubuntu 20 Dockerfile 

**Motivation and Context**
- This change is required to ensure ORT-OpenVINO Execution Provider is
aligned with latest changes.
- If it fixes an open issue, please link to the issue here.

Co-authored-by: mayavijx <mayax.vijayan@intel.com>
Co-authored-by: shamaksx <shamax.kshirsagar@intel.com>
Co-authored-by: pratiksha <pratikshax.bapusaheb.vanse@intel.com>
Co-authored-by: pratiksha <mohsinx.mohammad@intel.com>
Co-authored-by: Sahar Fatima <sfatima.3001@gmail.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
Co-authored-by: nmaajidk <n.maajid.khan@intel.com>
Co-authored-by: Mateusz Tabaka <mateusz.tabaka@intel.com>
Co-authored-by: intel <intel@iotgecsp-nuc04.iind.intel.com>
2022-09-22 12:31:40 -07:00
Edward Chen
454f77cd94
Update kernel matching logic: decouple from op schemas and remove kernel def hashes (#12791)
# Motivation
Currently, ORT minimal builds use kernel def hashes to map from nodes to
kernels to execute when loading the model. As the kernel def hashes must
be known ahead of time, this works for statically registered kernels.
This works well for the CPU EP.
For this approach to work, the kernel def hashes must also be known at
ORT format model conversion time, which means the EP with statically
registered kernels must also be enabled then. This is not an issue for
the always-available CPU EP. However, we do not want to require that any
EP which statically registers kernels is always available too.
Consequently, we explore another approach to match nodes to kernels that
does not rely on kernel def hashes. An added benefit of this is the
possibility of moving away from kernel def hashes completely, which
would eliminate the maintenance burden of keeping the hashes stable.

# Approach
In a full build, ORT uses some information from the ONNX op schema to
match a node to a kernel. We want to avoid including the ONNX op schema
in a minimal build to reduce binary size. Essentially, we take the
necessary information from the ONNX op schema and make it available in a
minimal build.
We decouple the ONNX op schema from the kernel matching logic. The
kernel matching logic instead relies on per-op information which can
either be obtained from the ONNX op schema or another source.
This per-op information must be available in a minimal build when there
are no ONNX op schemas. We put it in the ORT format model.
Existing uses of kernel def hashes to look up kernels are replaced
with the updated kernel matching logic. We no longer store
kernel def hashes in the ORT format model’s session state and runtime
optimization representations. We no longer keep the logic to
generate and ensure stability of kernel def hashes.
2022-09-20 14:24:59 -07:00
Cheng
f26054deca
[XNNPACK] Support running in multi-thread with seperate pthreadpool (#11762)
**Description**: Describe your changes.
XNNPACK takes pthreadpool as its internal threadpool implemtation, it
couples calculation and parallelization. Thus it's impossible to
leverage ORT's threadpool (EIGEN/OPENMP based). So we enabled
pthreadpool in XNNPACK EP in this PR.

Case 1:  Pthreadpool coexist with ORT-threadpool simply
Expriments setup
hardware:RedMi8A with 8 cores, ARMv7
The two threadpool has the same pool size form 1 to 8.
Two models: mobilenet_v2 and mobilenet_egetppu.
we can see the picture below and draw a conclusion, latency are even
higher from 5 threads or more.


![image](https://user-images.githubusercontent.com/9417365/190550127-2304adfe-97ac-4aeb-91a0-4606b5305a82.png)

Case 2:
For the reason of performance regression with 5 more threads,
ORT-threads are spinning on CPU and diddn't realease it after
computation finished. It's equivalent of creating 5x2 threads for
parallelization while we have only 8 cpu cores.
So I mannuly disabled spinning after ort-threadpool finished and enabled
it when enter ort-threadpool.
The result is quite normal now.

![image](https://user-images.githubusercontent.com/9417365/190675230-0d85dd02-01f0-4255-967d-e3dbb2a1fe52.png)


Case 3:
Even we achieved a reasonable results with disabling spinning, Will
ORT-threadpool still impact performance of pthreadpool?
we have expriment setting up as: Setting ORT-threadpool size
(intra_thread_num) as 1, and only pthreadpool created.
Attention that, almost a third of ops are running by CPU EP. we are
surprisingly find that disabling ort-threadpool is even better in
performance than creating two threadpool.


![image](https://user-images.githubusercontent.com/9417365/190556480-d6507396-d777-44fc-94e1-938d2b9bb7d7.png)


Case 4:
Use a unified threadpool between CPU ep and XNNPACK ep.
It's the fastest among all. But if we take the similar workload
partition strategy as ORT-threadpool, it could be faster.


![image](https://user-images.githubusercontent.com/9417365/190674908-a68fd20f-bdf4-41f9-bf0a-76b304cda490.png)

**Motivation and Context**
- Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here.

Co-authored-by: Jicheng Wen <jicwen@microsoft.com>
2022-09-20 16:02:15 +08:00
Tang, Cheng
739b5675c8
remove legacy compile api (#12932)
Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2022-09-15 13:18:40 -07:00
Dmitri Smirnov
bc2df1bf95
Remove previously deprecated API (#12935)
Remove previously deprecated API
Format JS code, address review comments
NPM Formatting
2022-09-14 10:58:03 -07:00
Scott McKay
1016c33519
Fix prefast warning in upsample.cc. (#12938)
* Fix prefast warning.
* Fix some other static analysis warnings.
2022-09-14 08:14:33 +10:00
Cheng
8cedafe250
[xnnpack] Have Initializer in Mobile related EPs in Minimal_build and creating EP specific dynamic-schema (#12555)
* Remove the dependence of Qlinearsoftmax schema

* refactor initializerview &&  create shared schema

* Dynamic Create EP specific schema

* Have Initializer in minimal_build

* address comments

* remove CancelFuseSubGraph
2022-09-06 14:32:15 +08:00
ashbhandare
27dde0b51f
Csharp bindings for on-device training APIs (#12404) 2022-09-02 13:13:48 -07:00
Yulong Wang
82a28cc2c3
upgrade emsdk to 3.1.19 (#12690)
* upgrade emsdk to 3.1.19

* fix build break

* ignore '-Wunused-but-set-variable' in eigen

* add malloc and free in exported functions

* EXPORTED_FUNCTIONS
2022-08-30 13:42:45 -07:00
Baiju Meswani
b83ea3c2ff
Address prefast static analysis warnings (#12756) 2022-08-29 10:09:32 -07:00
mwootton
817dc94345
Add first pass of rocm kernel profiler (#10911)
* Add first pass of rocm kernel profiler

* Clean up rocm_profiler. Format args. Demangle kernel names.
Add Api EventRecords

* Remove debug output

* Temporarily disable profiling unit test 'api record check' for cupti

* Fix compile error for non-gpu builds

* Use common file for demangle and pid/tid.  Namespace ThreadUtil.  Fix gpu buffer clearing.

* Merge demangle into profiler_common

* Merge demangle into profiler_common part 2

* Style cleanup

* Resolve linking issues via ProviderHost interface

* Demangle cuda kernel names

* Clean up comments

* Fix formatting

* Fix anal retentive formatting
2022-08-26 19:38:03 -07:00
edgchen1
c270ea148a Move 'using common::Status;' from common.h to status.h. 2022-08-26 15:05:53 -07:00
Yulong Wang
c144acc534
Replace 'master' branch ref to 'main' in the code (#12547) 2022-08-22 10:48:12 -07:00
Dmitri Smirnov
9481893b58
Replace to lock_guard as lighter class for locking (#12616)
Replace to lock_guard as lighter class
2022-08-17 11:08:31 -07:00
Haoming Chen
8a038b9b0c
Fix a build error (#12600)
LLVM compiler complains the std::hash<const char*> and suggests std::hash<const void*>. But the intention is to hash the name string instead of the pointer. So use std::hash<std::string> to be explicit.
2022-08-17 10:49:54 -07:00
Scott McKay
0b0c51e028
Support direct usage of ORT format model flatbuffer for initializers (#12465)
* Add ability to use ORT format model flatbuffer directly for intiializers by leveraging the TensorProto external data infrastructure.

Requires user to provide ORT format model bytes when creating the session, and set both `session.use_ort_model_bytes_directly` and `session.use_ort_model_bytes_for_initializers` to 1 in SessionOptions config entries (AddSessionConfigEntry in C API).
2022-08-12 18:31:43 +10:00
Changming Sun
ac7538b909
Remove CUDA 10.2 support (#12541) 2022-08-10 22:46:41 -07:00
Dmitri Smirnov
c10704a501
Use alignas instead of naive padding to avoid false cache sharing (#12514)
PerThread and ChildThreadStat alignas
2022-08-10 11:23:20 -07:00