Commit graph

1222 commits

Author SHA1 Message Date
PeixuanZuo
4eac0db3af
[ROCm] Add GemmFastGelu CK implementation (#13759)
### Description
<!-- Describe your changes. -->

Add GemmFastGelu CK implementation.

TODO 
1. The performance of CK GemmFastGelu in ORT is not good as using CK
directly, still need to investigate the reason and improve the CK in
ORT.
`GemmFastGeluUnfused float16 NN m=49152 n=3072 k=768 2298.8064 us 100.89
tflops`
`withbias DeviceGemmMultipleD_Xdl_CShuffle<256, 256, 128, 32, 8, 8,
Default> LoopScheduler: Default, PipelineVersion: v1 float16 NN m=49152
n=3072 k=768 2401.9799 us 96.56 tflops`

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
2023-01-05 17:53:30 +08:00
Adrian Lizarraga
68794d0ac1
Improve custom op library handle cleanup (#14099)
### Description
- Adds a new C API `OrtApi::RegisterCustomOpsLibrary_V2` that manages
the lifetime of dynamic library handles (i.e., calls `dlclose` or
`FreeLibrary`).
- Deprecates C API `OrtApi::RegisterCustomOpsLibrary`.
- Adds C++ API wrapper for convenient registering of custom op
libraries.
- `PySessionOptions` is now an alias of `OrtSessionOptions`

### Motivation and Context
The current API for registering custom op libraries loads dynamic
libraries but requires users to handle the release of the corresponding
library handles. Additionally, the user has to make sure to release the
library handle _after_ the session has been destroyed (or the program
segfaults).

The new API automatically cleans up the library and allows the user to
write more straightforward code.
2023-01-04 17:56:29 -08:00
cao lei
b29a1c7348
Address follow-up comments on multistream pr #13495 (#13992)
### Description
This PR is to address follow-up comments for the multi-stream pr
https://github.com/microsoft/onnxruntime/pull/13495

Changes including:

- Make StreamAwareArena transparent to minimal build
- Make DeviceStreamCollection transparent to minimal build
- Replace ORT_MUST_USE_RESULT with [[nodiscard]]
- Remove unnecessary shared_ptr


### Motivation and Context
This PR is to address follow-up comments for the multi-stream pr
https://github.com/microsoft/onnxruntime/pull/13495

Co-authored-by: Lei Cao <leca@microsoft.com>
2023-01-03 16:33:36 -08:00
Ashwini Khade
68b5b2d7d3
Refactor training build options (#13964)
### Description
1. Renames all references of on device training to training apis. This
is to keep the naming general. Nothing really prevents us from using the
same apis on servers\non-edge devices.
2. Update ENABLE_TRAINING option: With this PR when this option is
enabled, training apis and torch interop is also enabled.
3. Refactoring for onnxruntime_ENABLE_TRAINING_TORCH_INTEROP option: 
   -  Removed user facing option
- Setting onnxruntime_ENABLE_TRAINING_TORCH_INTEROP to ON when
onnxruntime_ENABLE_TRAINING is ON as we always build with torch interop.

Once this PR is merged when --enable_training is selected we will do a
"FULL Build" for training (with all the training entry points and
features).
Training entry points include:
1. ORTModule
2. Training APIs

Features include:
1. ATen Fallback
2. All Training OPs includes communication and collectives
3. Strided Tensor Support
4. Python Op (torch interop)
5. ONNXBlock (Front end tools for training artifacts prep when using
trianing apis)

### Motivation and Context
Intention is to simply the options for building training enabled builds.
This is part of the larger work item to create dedicated build for
learning on the edge scenarios with just training apis enabled.
2023-01-03 13:28:16 -08:00
RandySheriffH
587e891cae
CloudEP (#13855)
Implement CloudEP for hybrid inferencing.
The PR introduces zero new API, customers could configure session and
run options to do inferencing with Azure [triton
endpoint.](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-deploy-with-triton?tabs=azure-cli%2Cendpoint)
Sample configuration in python be like:

```
sess_opt.add_session_config_entry('cloud.endpoint_type', 'triton');
sess_opt.add_session_config_entry('cloud.uri', 'https://cloud.com');
sess_opt.add_session_config_entry('cloud.model_name', 'detection2');
sess_opt.add_session_config_entry('cloud.model_version', '7'); // optional, default 1
sess_opt.add_session_config_entry('cloud.verbose', '1'); // optional, default '0', meaning no verbose
...
run_opt.add_run_config_entry('use_cloud', '1') # 0 for local inferencing, 1 for cloud endpoint.
run_opt.add_run_config_entry('cloud.auth_key', '...')
...
sess.run(None, {'input':input_}, run_opt)
```

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2023-01-03 10:03:15 -08:00
Yi Zhang
52e3fe961d
add dnnl dependency in unittest.cmake (#14104)
### Description
It's from the PR #14085 
On multiple running msbuilds , it throws the exception of
```
22-12-30T16:35:34.2423207Z ##[error]C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\MSBuild\Microsoft\VC\v160\Microsoft.CppCommon.targets(155,5): Error MSB3073: The command "setlocal
"C:\Program Files\CMake\bin\cmake.exe" -E copy D:/a/_work/1/b/RelWithDebInfo/dnnl/install/bin/dnnl.dll D:/a/_work/1/b/RelWithDebInfo/RelWithDebInfo
if %errorlevel% neq 0 goto :cmEnd
:cmEnd
endlocal & call :cmErrorLevel %errorlevel% & goto :cmDone
:cmErrorLevel
exit /b %1
:cmDone
if %errorlevel% neq 0 goto :VCEnd
:VCEnd" exited with code 1.
```

https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=847423&view=logs&j=249e9d58-0012-5814-27cf-6a201adbd9cf&t=182b9780-832e-5dcb-3957-d6aa3ece582f
It should make sure that the onnxruntime_test_all project depends on
dnnl project.
2023-01-03 11:24:06 +08:00
Tianlei Wu
6a9dc6c993
[CUDA] Update fused MHA to support flash attention and causal mask (#13953)
### Description
Update fused attention kernels to support flash attention and causal
mask (GPT-2 initial decoder run).

Note: Causal kernels are from FasterTransformer 5.2. Flash attention
kernels that is not causal are from TensorRT 8.5.1.

#### Performance Test of bert-base model

Test like the following:
```
 python -m onnxruntime.transformers.benchmark -m bert-base-cased -b 1 4 8 16 32 64 -s 512 -t 1000 -o by_script -g -p fp16 -i 3 --use_mask_index
```

Original Flash Attention is from
https://github.com/HazyResearch/flash-attention. RemovePadding and
RestorePadding is added before/after the original flash attention but
not for this PR, so the result is not apple-to-apple comparison. It is
added for reference only.

Average latency (ms) of float16 bert-base-cased model:

* A100

Kernel  | b1_s512 | b4_s512 | b8_s512 | b16_s512 | b32_s512 | b64_s512 |
b128_s512
-- | -- | -- | -- | -- | -- | -- | --
Unfused | 1.83 | 5.00 | 9.31 | 17.76 | 34.47 | 67.43 | 133.38
TRT Fused | 2.05 | 3.58 | 5.70 | 10.96 | 21.22 | 41.23 | 80.56
Flash Attention (from FT) | 1.43 | 3.20 | 5.71 | 10.95 | 22.19 | 42.96 |
84.54
Flash Attention (from TRT) | 1.44 | 3.28 | 5.70 | 10.86 | 21.00 | 40.56
| 79.53
Original Flash Attention | 1.81 | 4.04 | 6.82 | 13.06 | 24.62 | 46.58 |
91.10

* T4

  | b1_s512 | b4_s512 | b8_s512 | b16_s512 | b32_s512 | b64_s512
-- | -- | -- | -- | -- | -- | --
Unfused | 8.17 | 29.86 | 59.56 | 115.77 | 236.66 | 461.43
Flash Attention (from FT) | 5.65 | 21.12 | 44.94 | 86.83 | 174.16 |
351.38
Flash Attention (from TRT) | 5.73| 21.49| 45.49 | 89.15 | 174.37 |
352.08
Original Flash Attention | 6.22 | 22.16 | 43.39 | 83.8 | 168.77 | 337.04

* V100

Kernel | b1_s512 | b4_512 | b8_s512 | b16_s512 | b32_s512 | b64_s512
-- | -- | -- | -- | -- | -- | --
Unfused | 3.77 | 10.48 | 19.53 | 37.63 | 73.68 | 145.58
Flash Attention (from FT) | 3.21 | 8.25 | 14.95 | 28.83 | 56.28 | 111.15

#### Performance Test of GPT-2 model
Test like the following:
`
python benchmark_gpt2.py -m distilgpt2 -o --stage 1 --use_gpu -p fp16 -b
1 4 8 16 32 64 128 -s 0 --sequence_lengths 8 16 32 64 128 256 512
`
* A100

Note that flash attention is used as fused attention when
sequence_length > 128.

batch_size | sequence_length | with Fused Attention | without Fused
Attention | A100 Gain
-- | -- | -- | -- | --
1 | 8 | 0.93 | 1 | 7.0%
4 | 8 | 0.82 | 0.88 | 6.8%
8 | 8 | 0.84 | 0.88 | 4.5%
16 | 8 | 0.92 | 0.97 | 5.2%
32 | 8 | 1.15 | 1.17 | 1.7%
64 | 8 | 1.68 | 1.72 | 2.3%
128 | 8 | 2.76 | 2.78 | 0.7%
1 | 16 | 0.95 | 0.95 | 0.0%
4 | 16 | 0.83 | 0.88 | 5.7%
8 | 16 | 0.91 | 0.97 | 6.2%
16 | 16 | 1.12 | 1.17 | 4.3%
32 | 16 | 1.67 | 1.72 | 2.9%
64 | 16 | 2.73 | 2.76 | 1.1%
128 | 16 | 4.96 | 4.95 | -0.2%
1 | 32 | 0.94 | 0.88 | -6.8%
4 | 32 | 0.91 | 0.97 | 6.2%
8 | 32 | 1.12 | 1.17 | 4.3%
16 | 32 | 1.65 | 1.71 | 3.5%
32 | 32 | 2.69 | 2.76 | 2.5%
64 | 32 | 4.86 | 4.94 | 1.6%
128 | 32 | 9.35 | 9.38 | 0.3%
1 | 64 | 0.84 | 0.88 | 4.5%
4 | 64 | 1.1 | 1.17 | 6.0%
8 | 64 | 1.64 | 1.73 | 5.2%
16 | 64 | 2.66 | 2.77 | 4.0%
32 | 64 | 4.82 | 4.97 | 3.0%
64 | 64 | 9.23 | 9.4 | 1.8%
128 | 64 | 18.54 | 19.12 | 3.0%
1 | 128 | 0.91 | 0.98 | 7.1%
4 | 128 | 1.68 | 1.74 | 3.4%
8 | 128 | 2.71 | 2.83 | 4.2%
16 | 128 | 4.85 | 5.09 | 4.7%
32 | 128 | 9.32 | 9.69 | 3.8%
64 | 128 | 18.54 | 19.44 | 4.6%
128 | 128 | 36.86 | 38.47 | 4.2%
1 | 256 | 1.15 | 1.23 | 6.5%
4 | 256 | 2.71 | 2.95 | 8.1%
8 | 256 | 4.87 | 5.3 | 8.1%
16 | 256 | 9.32 | 10.23 | 8.9%
32 | 256 | 18.6 | 20.53 | 9.4%
64 | 256 | 36.93 | 40.41 | 8.6%
128 | 256 | 72.84 | 80.14 | 9.1%
1 | 512 | 1.68 | 1.96 | 14.3%
4 | 512 | 4.9 | 6.02 | 18.6%
8 | 512 | 9.4 | 11.59 | 18.9%
16 | 512 | 18.71 | 23.05 | 18.8%
32 | 512 | 37.13 | 45.46 | 18.3%
64 | 512 | 74.04 | 89.88 | 17.6%
128 | 512 | NA | NA | NA

* T4:

batch_size | sequence_length | with Fused Attention | with Unfused
Attention | T4 Gain
-- | -- | -- | -- | --
1 | 8 | 1.97 | 2.11 | 6.6%
4 | 8 | 2.2 | 2.25 | 2.2%
8 | 8 | 2.77 | 3.1 | 10.6%
16 | 8 | 4.17 | 4.2 | 0.7%
32 | 8 | 6.86 | 6.82 | -0.6%
64 | 8 | 14.88 | 14.92 | 0.3%
128 | 8 | 31.4 | 31.29 | -0.4%
1 | 16 | 1.61 | 1.71 | 5.8%
4 | 16 | 2.13 | 2.31 | 7.8%
8 | 16 | 3.38 | 3.67 | 7.9%
16 | 16 | 6.16 | 6.54 | 5.8%
32 | 16 | 14.16 | 14.76 | 4.1%
64 | 16 | 30.36 | 30.57 | 0.7%
128 | 16 | 63.14 | 63.57 | 0.7%
1 | 32 | 1.53 | 1.69 | 9.5%
4 | 32 | 3.34 | 3.66 | 8.7%
8 | 32 | 6.25 | 6.64 | 5.9%
16 | 32 | 14.12 | 14.9 | 5.2%
32 | 32 | 28.96 | 29.82 | 2.9%
64 | 32 | 61.07 | 61.77 | 1.1%
128 | 32 | 116.38 | 117.98 | 1.4%
1 | 64 | 2.01 | 2.21 | 9.0%
4 | 64 | 6.18 | 6.67 | 7.3%
8 | 64 | 13.72 | 14.49 | 5.3%
16 | 64 | 28.71 | 29.83 | 3.8%
32 | 64 | 58.65 | 60.68 | 3.3%
64 | 64 | 113.09 | 113.17 | 0.1%
128 | 64 | 205.21 | 209.4 | 2.0%
1 | 128 | 3.37 | 3.76 | 10.4%
4 | 128 | 13.54 | 14.85 | 8.8%
8 | 128 | 28.32 | 30.22 | 6.3%
16 | 128 | 58.16 | 62.09 | 6.3%
32 | 128 | 109.17 | 113.99 | 4.2%
64 | 128 | 198.9 | 207.1 | 4.0%
128 | 128 | 413.25 | 421.82 | 2.0%
1 | 256 | 6.33 | 7.05 | 10.2%
4 | 256 | 28.09 | 31.49 | 10.8%
8 | 256 | 57.47 | 62.76 | 8.4%
16 | 256 | 106.77 | 117.95 | 9.5%
32 | 256 | 197.02 | 208.58 | 5.5%
64 | 256 | 406.81 | 431.36 | 5.7%
128 | 256 | NA | NA | NA
1 | 512 | 13.84 | 16.32 | 15.2%
4 | 512 | NA | NA | NA
8 | 512 | NA | NA | NA
16 | 512 | NA | NA | NA
32 | 512 | NA | NA | NA
64 | 512 | NA | NA | NA
128 | 512 | NA | NA | NA

* V100:

batch_size | sequence_length | with Fused Attention | with Unfused
Attention | V100 Gain
-- | -- | -- | -- | --
1 | 8 | 1.31 | 1.6 | 18.1%
4 | 8 | 1.17 | 1.26 | 7.1%
8 | 8 | 1.43 | 1.79 | 20.1%
16 | 8 | 2.14 | 1.96 | -9.2%
32 | 8 | 2.91 | 3.08 | 5.5%
64 | 8 | 5.32 | 5.27 | -0.9%
128 | 8 | 9.34 | 8.97 | -4.1%
1 | 16 | 1.41 | 1.58 | 10.8%
4 | 16 | 1.38 | 1.49 | 7.4%
8 | 16 | 1.81 | 2.2 | 17.7%
16 | 16 | 2.8 | 2.83 | 1.1%
32 | 16 | 4.94 | 4.99 | 1.0%
64 | 16 | 8.88 | 8.84 | -0.5%
128 | 16 | 17.35 | 17.2 | -0.9%
1 | 32 | 1.38 | 1.77 | 22.0%
4 | 32 | 1.77 | 1.93 | 8.3%
8 | 32 | 2.71 | 2.86 | 5.2%
16 | 32 | 5.03 | 4.92 | -2.2%
32 | 32 | 8.8 | 8.79 | -0.1%
64 | 32 | 17.29 | 17.23 | -0.3%
128 | 32 | 33.27 | 33.1 | -0.5%
1 | 64 | 1.67 | 1.87 | 10.7%
4 | 64 | 2.69 | 2.76 | 2.5%
8 | 64 | 4.87 | 4.94 | 1.4%
16 | 64 | 8.73 | 8.81 | 0.9%
32 | 64 | 16.92 | 17.24 | 1.9%
64 | 64 | 33 | 33.38 | 1.1%
128 | 64 | 65.33 | 65.86 | 0.8%
1 | 128 | 2.03 | 2.22 | 8.6%
4 | 128 | 4.9 | 5.04 | 2.8%
8 | 128 | 8.76 | 8.81 | 0.6%
16 | 128 | 17.06 | 17.29 | 1.3%
32 | 128 | 33.25 | 33.56 | 0.9%
64 | 128 | 65.54 | 66.5 | 1.4%
128 | 128 | 130.44 | 131.44 | 0.8%
1 | 256 | 2.78 | 2.86 | 2.8%
4 | 256 | 8.75 | 9.04 | 3.2%
8 | 256 | 17 | 17.68 | 3.8%
16 | 256 | 33.19 | 34.32 | 3.3%
32 | 256 | 65.43 | 67.86 | 3.6%
64 | 256 | 129.92 | 134.68 | 3.5%
128 | 256 | NA | NA | NA
1 | 512 | 4.95 | 5.32 | 7.0%
4 | 512 | NA | NA | NA
8 | 512 | NA | NA | NA
16 | 512 | NA | NA | NA
32 | 512 | NA | NA | NA
64 | 512 | NA | NA | NA
128 | 512 | NA | NA | NA
2022-12-31 10:33:54 -08:00
Dmitri Smirnov
d762aa2a4c
Let Cmake decide where to place abseil (#14057)
### Description
Remove Abseil module placement specifications


### Motivation and Context
Allow Cmake defaults take place and possible redirection of all
submodules for sharing between the local builds.
2022-12-23 12:08:13 -08:00
Ye Wang
68518a1b72
Sampling op (#13426)
### Description
<!-- Describe your changes. -->

Sampling op for cpu and cuda
support huggingface case and custom case
            


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2022-12-22 17:34:12 -08:00
pengwa
2f5bf75e51
Optimize computation orders (#13672)
### Optimize computation orders

In `Roberta/Electra`, when `ClassificationHead` is used, there is
slicing operation on features on sequence_length dimensions, then loss
calculations only depend on this sliced data. This is a slicing at axis
1. Before slicing the shape is [batch, sequence_length, hidden], after
slicing, it becomes [batch , hidden_stage]

We had opportunities to bring this slicing earlier as much as possible,
by passing through simple elementwise ops (like Add/Div), or
Layernorm/Softmax(if their reduce axis is after the slicing axis), or
even MatMul's the left operand (if only it did not affect the last
dims).

For operators like Reshape/Transpose, it is special since they have
either data specified (after slicing we need update), or they have perm
specified, which requires the input rank remain unchanged. So for those
kinds of operators, we can remain the original rank, but just leave the
sliced dim to be 1, after the compute completed, we do a Squeeze.

```
class RobertaClassificationHead(nn.Module):
    """Head for sentence-level classification tasks."""

    def __init__(self, config):
        super().__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        classifier_dropout = (
            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
        )
        self.dropout = nn.Dropout(classifier_dropout)
        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)

    def forward(self, features, **kwargs):
        x = features[:, 0, :]  # take <s> token (equiv. to [CLS])
        x = self.dropout(x)
        x = self.dense(x)
        x = torch.tanh(x)
        x = self.dropout(x)
        x = self.out_proj(x)
        return x
```

src\transformers\models\roberta\modeling_roberta.py
src\transformers\models\electra\modeling_electra.py

#### Benchmark

A simple benchmark shows Robeta training latency dropped from 208ms ~
199ms. 4.5+% reduction.
More comprehensive tests are on the way.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2022-12-22 15:12:52 +08:00
Changming Sun
05137e6ec4
Use target name for flatbuffers (#13991)
### Description

Use target name for flatbuffers.
Add version range for flatbuffers. It is similar to #13870 
### Motivation and Context
To fix a build error:
```
CMake Error at onnxruntime_graph.cmake:88 (add_dependencies):
  The dependency target "flatbuffers" of target "onnxruntime_graph" does not
  exist.
Call Stack (most recent call first):
  CMakeLists.txt:1490 (include)
```

It happens when flatbuffers library is already installed. For example,
on Ubuntu people may get it from apt-get. But, the one provided by
Ubuntu 20.04 is not compatible with our code. The one in Ubuntu 22.04
works fine.
2022-12-20 11:44:02 -08:00
Changming Sun
fc2a6db573
Update absl to the latest release (#13990)
### Description
Update absl to a new version

### Motivation and Context
The new version contains fixes that are needed for Nvidia GPU build.
Once we update it to that version, we don't need to maintain our private
patches for Nvidia GPU build.
2022-12-19 14:25:13 -08:00
cloudhan
2df046fc67
Fix deprecated-builtins (#14001)
Fix error: builtin __has_trivial_destructor is deprecated; use __is_trivially_destructible instead [-Werror,-Wdeprecated-builtins]

This is not a clean fix as in 13783, users will need to manually set `CMAKE_HIP_FLAGS="-Wno-deprecated-builtins"` if they want to use self-built hipclang combining with ROCm 5.3.* or older.
2022-12-17 18:17:05 +08:00
FFFrog
6705915af8
[CANN] Add the ability to run graph (#13728)
### Description
Add the ability to run graph

### Motivation and Context
A brief description is as follows:
1) If the whole graph is supported, then will be processed by the graph
engine, directly.
2) If the whole graph is not supported, the whole graph will be divided
into subgraphs and single operators; The sub-graphs will be run on graph
engine, and the single operators will fallback to the traditional mode.
2022-12-16 06:57:40 -08:00
Tang, Cheng
a81faee41e
Multi-stream execution support (#13495)
**Description**: This PR including following works:
1. provide stream and related synchronization abstractions in
onnxruntime.
2. enhance onnxruntime's execution planner / executor / memory arena to
support execute multiple streams in parallel.
3. deprecate the parallel executor for cpu.
4. deprecate the Fence mechanism. 
5. update the cuda / tensorrt EP to support the stream mechanism,
support running different request in different cuda stream.

**Motivation and Context**
- Why is this change required? 
currently, the execution plan is just a linear list of those primitives,
ort will execute them step by step. For any given graph, ORT will
serialize it to a fixed execution order. This sequential execution
design simplifies most scenarios, but it has the following limitations:
1. it is difficult to enable inter-node parallelization, we have a
half-baked parallel executor but it is very difficult to make it work
with GPU.
2. The fence mechanism can work with single gpu stream + cpu thread
case, but when extend to multiple stream, it is difficult to manage the
cross GPU stream synchronizations.
3. our cuda EP rely on the BFCArena to make the memory management work
with the GPU async kernels, but current BFCArena is not aware of the
streams, so it doesn't behavior correctly when run with multiple
streams.

This PR enhance our existing execution plan and executor to support
multiple stream execution. we use an unified algorithm to mange both
single stream and multiple stream scenarios.
This PR mainly focus on the infrastructure support for multiple stream
execution, that is said, given a valid stream assignment, onnxruntime
can execute it correctly. How to generate a good stream assignment for a
given model will be in the future PR.

Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Cheng Tang <chenta@microsoft.com>
Co-authored-by: RandySheriffH <48490400+RandySheriffH@users.noreply.github.com>
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: cao lei <jslhcl@gmail.com>
Co-authored-by: Lei Cao <leca@microsoft.com>
2022-12-15 07:39:29 -08:00
Chi Lo
5b492cbae3
[TensorRT EP] support TensorRT 8.5 (#13867)
Integrate TensorRT 8.5

- Update TensorRT EP to support TensorRT 8.5
- Update relevant CI pipelines
- Disable known non-supported ops for TensorRT
- Make timeout configurable.
We observe more than [20
hours](https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=256729&view=logs&j=71ce39d8-054f-502a-dcd0-e89fa9931f40)
of running unit tests with TensorRT 8.5 in package pipelines. Because we
can't use placeholder to significantly reduce testing time (c-api
application test will deadlock) in package pipelines, we only run
subsets of model tests and unit tests that are related to TRT (add new
build flag--test_all_timeout and set it to 72000 seconds by package
pipelines). Just to remember, we still run all the tests in TensorRT CI
pipelines to have full test coverage.

- include https://github.com/microsoft/onnxruntime/pull/13918 to fix
onnx-tensorrt compile error.

Co-authored-by: George Wu <jywu@microsoft.com>
2022-12-14 13:06:03 -08:00
Ashwini Khade
6090d8cd6e
Fix usage of enable_training_ops and reduce ifdef complexity for training builds (#13888)
### Description
Fix usage of enable_training_ops and reduce ifdef complexity for
training builds.




### Motivation and Context
This is the second refactoring PR towards creating a dedicated build for
on device training. This PR aims to reduce some complexity. We can set
ENABLE_TRAINING_OPS in cmake when either ENABLE_TRAINING or
ENABLE_TRAINING_ON_DEVICE is selected, this way we dont have to use if
defined(ENABLE_TRAINING) || defined(ENABLE_TRAINING_ON_DEVICE )
everywhere in the code.

- If it fixes an open issue, please link to the issue here. -->
2022-12-14 08:32:46 -08:00
Changming Sun
070769d61d
Use onnxruntime_fetchcontent_makeavailable cmake function for TRT (#13918)
### Description
Use onnxruntime_fetchcontent_makeavailable cmake function for TRT. See
the comment for the reason.


### Motivation and Context
To support a newer TRT version. Previously they have a "BUILD_EXE" build
option to allow us to exclude such things from build. But in
https://github.com/onnx/onnx-tensorrt/pull/879 they deleted the build
option. It wouldn't be a problem if we continue to use git submodules as
before, because cmake's add_subdirectories function has an
"EXCLUDE_FROM_ALL" keyword. However, cmake's FetchContent module
doesn't. That's why I needed to create our own version of the macro.
2022-12-12 11:27:46 -08:00
RandySheriffH
75584c5fa8
Enabling thread pool to be numa-aware (#13778)
The PR enables ort thread pool to be numa-aware, so that threads could
be evenly created and distributed among numa nodes.
In addition, to facilitate performance tuning, the PR opens a new API
allowing customers to attach threads to certain logical processors.
Please check the API
[definition](https://github.com/microsoft/onnxruntime/pull/13778/files#diff-5845a5c76fb64abdc8f0cffe21b37f8da1712674eb3abc4cd87190891be1bd48)
for details.

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2022-12-12 10:33:55 -08:00
Abhishek Udupa
83c59d2594
Session-aware and thread-safe CUDA profiler (#13706)
### Description
The existing CUDA profiler is neither session-aware, nor thread-safe.
This PR ensures both.

### Motivation and Context
[PR 13549](https://github.com/microsoft/onnxruntime/pull/13549) brought
thread-safety and session-awareness to the ROCm profiler. This PR brings
the same goodness to the CUDA profiler as well.

Sample outputs of a profiling run from the StableDiffusion model (this
model was chosen because it requires orchestration of multiple sessions,
and verifies that the profilers are now indeed session-aware) on both
CUDA and ROCm EPs are attached, along with a script that checks that the
trace files generated by the profile are well-formed.

Update 11/29: Updated the profile outputs. The older profile outputs
exhibited an issue where some timestamps were wildly out of range,
leading to problems visualizing the traces. The bug has been fixed and
the profile outputs have been updated, along with an update to the check
script to ensure that timestamps are monotonically increasing.


[sd_profile_outputs_cuda.tar.gz](https://github.com/microsoft/onnxruntime/files/10118088/sd_profile_outputs_cuda.tar.gz)

[sd_profile_outputs_rocm.tar.gz](https://github.com/microsoft/onnxruntime/files/10118089/sd_profile_outputs_rocm.tar.gz)

[check_profile_output_well_formedness.zip](https://github.com/microsoft/onnxruntime/files/10118090/check_profile_output_well_formedness.zip)

Co-authored-by: Abhishek Udupa <abhishek.udupa@microsoft.com>
2022-12-09 13:22:12 -08:00
Changming Sun
d5b45226be
Improve the handling of /external:I (#13904)
### Description

Improve the handling of "/external:I". The
"onnxruntime_external_lib_include_dir" variable may be:

1. A simple file path
2. A cmake generator expression like "$<INSTALL_INTERFACE:include>",
"$<TARGET_PROPERTY:onnx_proto,INTERFACE_INCLUDE_DIRECTORIES>",
"$<BUILD_INTERFACE:xxxx>". It seems that we can't simply put them in to
the "target_compile_options" line. So this PR tries to parse the
expression and extract the part we need out.

### Motivation and Context
Resolve the Github issue: https://github.com/microsoft/onnxruntime/issues/13893
2022-12-09 11:44:32 -08:00
Changming Sun
05dc1165a5
Add protobuf version constraint (#13870)
To fix a build error:


/home/xxxxxxxxxxxxx/onnxruntime/build/Linux/Debug/tensorboard/compat/proto/cost_graph.pb.cc:17:8:
error:
‘PROTOBUF_INTERNAL_EXPORT_tensorboard_2fcompat_2fproto_2ftensor_5fshape_2eproto’
does not name a type
17 | extern
PROTOBUF_INTERNAL_EXPORT_tensorboard_2fcompat_2fproto_2ftensor_5fshape_2eproto
::PROTOBUF_NAMESPACE_ID::internal::SCCInfo<1>
scc_info_TensorShapeProto_tensorboard_2fcompat_2fproto_2ftensor_5fshape_2eproto;
2022-12-08 16:14:16 -08:00
Yulong Wang
dbf47284d1
[wasm] disable closure compiler in debug build (#13865)
### Description
disable closure compiler in debug build. after this change, emscripten
will only run closure compiler in release build.
2022-12-08 13:18:19 -08:00
Changming Sun
81c2defd3b
Remove unused git submodules (#13830) 2022-12-07 21:59:16 -08:00
Ashwini Khade
983877c712
Decouple strided tensor support from ENABLE_TRAINING (#13829)
### Description
Decouple strided tensor support from ENABLE_TRAINING

### Motivation and Context
This is step 1 for creating a dedicated build for on device training.
Intention is

1. We can set ENABLE_STRIDED_TENSORS in cmake when either
ENABLE_TRAINING or ENABLE_TRAINING_ON_DEVICE is selected, this way we
dont have to use if defined(ENABLE_TRAINING) ||
defined(ENABLE_TRAINING_ON_DEVICE ) everywhere in the code.

2. This also paves the way to easily enable strided tensor support for
inference in future (if required).
2022-12-07 09:22:21 -08:00
cloudhan
f79d38181b
Fix hipify to avoid nccl_service.h: No such file or directory (#13852)
Fix various flaky build error due to onnxruntime_session missing dependencies on hipify generated files.
2022-12-07 09:10:37 +08:00
Changming Sun
d12521d7b2
Upgrade pybind11 (#13853)
Upgrade pybind11 to include the fix for #9735
2022-12-06 15:39:23 -08:00
Ashwini Khade
65201e47bf
Enable nuget packages for on device training (#13637)
### Description
This PR enables building nuget packages locally for on device training
using --build_nuget arg.
This PR also enables the C# bindings by default in the managed package.
If a user triggers any training apis when the native binary is not built
for training, an exception with message "Training is disabled in the
current build. Please build ONNXRuntime from source with the build flags
enable_training and enable_training_on_device. " is thrown.

Build command for creating nuget packes for on device training:
build.bat --enable_training --enable_training_on_device --build_nuget 

2 Nuget packages are built
1. Microsoft.ML.OnnxRuntime.Managed
2. Microsoft.ML.OnnxRuntime.Training OR
Microsoft.ML.OnnxRuntime.Training.Gpu



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2022-12-05 14:54:09 -08:00
Changming Sun
04900f96c1
Improve dependency management (#13523)
## Description
1. Convert some git submodules to cmake external projects
2. Update nsync from
[1.23.0](https://github.com/google/nsync/releases/tag/1.23.0) to
[1.25.0](https://github.com/google/nsync/releases/tag/1.25.0)
3. Update re2 from 2021-06-01 to 2022-06-01
4. Update wil from an old commit to 1.0.220914.1 tag
5. Update gtest to a newer commit so that it can optionally leverage
absl/re2 for parsing command line flags.

The following git submodules are deleted:

1. FP16
2. safeint
3. XNNPACK
4. cxxopts
5. dlpack
7. flatbuffers
8. googlebenchmark
9. json
10. mimalloc
11. mp11
12. pthreadpool

More will come.

## Motivation and Context
There are 3 ways of integrating 3rd party C/C++ libraries into ONNX
Runtime:
1. Install them to a system location, then use cmake's find_package
module to locate them.
2.  Use git submodules 
6.  Use cmake's external projects(externalproject_add). 

At first when this project was just started, we considered both option 2
and option 3. We preferred option 2 because:

1. It's easier to handle authentication. At first this project was not
open source, and it had some other non-public dependencies. If we use
git submodule, ADO will handle authentication smoothly. Otherwise we
need to manually pass tokens around and be very careful on not exposing
them in build logs.
2. At that time, cmake fetched dependencies after "cmake" finished
generating vcprojects/makefiles. So it was very difficult to make cflags
consistent. Since cmake 3.11, it has a new command: FetchContent, which
fetches dependencies when it generates vcprojects/makefiles just before
add_subdirectories, so the parent project's variables/settings can be
easily passed to the child projects.

And when the project went on,  we had some new concerns:
1. As we started to have more and more EPs and build configs, the number
of submodules grew quickly. For more developers, most ORT submodules are
not relevant to them. They shouldn't need to download all of them.
2. It is impossible to let two different build configs use two different
versions of the same dependency. For example, right now we have protobuf
3.18.3 in the submodules. Then every EP must use the same version.
Whenever we have a need to upgrade protobuf, we need to coordinate
across the whole team and many external developers. I can't manage it
anymore.
3. Some projects want to manage the dependencies in a different way,
either because of their preference or because of compliance
requirements. For example, some Microsoft teams want to use vcpkg, but
we don't want to force every user of onnxruntime using vcpkg.
7. Someone wants to dynamically link to protobuf, but our build script
only does static link.
8. Hard to handle security vulnerabilities. For example, whenever
protobuf has a security patch, we have a lot of things to do. But if we
allowed people to build ORT with a different version of protobuf without
changing ORT"s source code, the customer who build ORT from source will
be able to act on such things in a quicker way. They will not need to
wait ORT having a patch release.
9. Every time we do a release, github will also publish a source file
zip file and a source file tarball for us. But they are not usable,
because they miss submodules.
 
### New features

After this change, users will be able to:
1. Build the dependencies in the way they want, then install them to
somewhere(for example, /usr or a temp folder).
2. Or download the dependencies by using cmake commands from these
dependencies official website
3. Similar to the above, but use your private mirrors to migrate supply
chain risks.
4. Use different versions of the dependencies, as long as our source
code is compatible with them. For example, you may use you can't use
protobuf 3.20.x as they need code changes in ONNX Runtime.
6.  Only download the things the current build needs.
10. Avoid building external dependencies again and again in every build.

### Breaking change
The onnxruntime_PREFER_SYSTEM_LIB build option is removed you could think from now 
it is default ON. If you don't like the new behavior, you can set FETCHCONTENT_TRY_FIND_PACKAGE_MODE to NEVER.
Besides, for who relied on the onnxruntime_PREFER_SYSTEM_LIB build
option, please be aware that this PR will change find_package calls from
Module mode to Config mode. For example, in the past if you have
installed protobuf from apt-get from ubuntu 20.04's official repo,
find_package can find it and use it. But after this PR, it won't. This
is because that protobuf version provided by Ubuntu 20.04 is too old to
support the "config mode". It can be resolved by getting a newer version
of protobuf from somewhere.
2022-12-01 09:51:59 -08:00
Patrice Vignola
4128e44b4f
[DML EP] Upgrade DML to 1.10.0 (#13796)
### Description
Upgrade DML to 1.10.0
2022-11-30 21:32:14 -08:00
Changming Sun
29ed8811e5
Move C/C++ deps' URLs to deps.txt (#13769)
### Description
1. Move C/C++ deps' URLs to deps.txt, and download the dependencies from
Azure Devops Artifacts instead of github.
2. Add "EXCLUDE_FROM_ALL" keyword to the cmake external projects, so
that we only build the parts we need and avoid installing the 3rd-party
dependencies when people run `make install` in ORT's build directory.
However, at this moment cmake itself doesn't have the feature. So I
copied their code to cmake/external/helper_functions.cmake and modified
it.

This PR is split from #13523, to make that one smaller. 

### Motivation and Context
1. Secure the supply chain
2. Make it be possible to automatically detect if ORT has an old
dependency that hasn't been updated from a long time.
2022-11-29 18:06:35 -08:00
Guenther Schmuelling
2d523c507e
for wasm catch exceptions at top level api (#13644)
fix for https://github.com/microsoft/onnxruntime/issues/13383,
https://github.com/microsoft/onnxruntime/issues/13408

Currently ort-web doesn't catch exceptions because turning on exception
catching increases the binary size by 3MB (~30%).
But ort can throw (ie onnx errors or ORT_ENFORCE) and there is no
useable error message.

Turning on exception catching just for top level api released file will
fix the error messages at minimal increase of binary size.
2022-11-28 10:24:34 -08:00
Edward Chen
4901987d1d
Remove SafeInt dependency from Objective-C API. (#13698) 2022-11-18 17:06:12 -08:00
Changming Sun
3e9e5e9d6d
Patch Protobuf and ONNX's cmake files and enforce BinSkim check (#13694)
Patch Protobuf and ONNX's cmake files and enforce BinSkim check.

This PR has overlap with #13523 . I would prefer to get this one merged
first so that we can finished the BinSkim work, and I try to make this
PR as small as possible.
2022-11-18 10:09:47 -08:00
Changming Sun
7a57976d1a
Make natvis files work better (#13665)
### Description
After this change, you will see GSL.natvis and wil.nativs files will be
added to every onnxruntime_xxx project.

Like this:

![image](https://user-images.githubusercontent.com/856316/202081013-314145a8-7a0f-4f45-bf85-f9ed0e247c63.png)

This is because in onnxruntime_common.cmake we have:

```cmake
    if (MSVC)
    set(ABSEIL_NATVIS_FILE "abseil-cpp.natvis")
    target_sources(
        onnxruntime_common
        INTERFACE $<BUILD_INTERFACE:${PROJECT_SOURCE_DIR}/external/${ABSEIL_NATVIS_FILE}>)
  endif()
```
It sets a property, INTERFACE_SOURCES, on the target
"onnxruntime_common".

Then if anyone else uses:
```
target_link_libraries(mytarget PRIVATE onnxruntime_common)
```
The nativis file will be added to `mytarget`.

However, in this project we don't use such things for the targets that
are static libraries. For example, onnxruntime_graph is a static
library.

Instead, we use the `onnxruntime_add_include_to_target ` function to
explicitly control what we want to propagate . The function was written
before we started to have nativis files. So it doesn't pass a source
file from one static library to another. Now we have the need. Probably
only for Windows.

### Motivation and Context

Add natvis  files to every project.
2022-11-17 19:13:40 -08:00
Jian Chen
8442d9df2c
Cjian/c4244 round 6 (#13663)
### Description
Fix round 6 



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2022-11-16 16:26:11 -05:00
cloudhan
369a822409
Share TunableOp between CUDA and ROCM EP (#13560)
Make TunableOp to support CUDA kernel authoring and add the corresponding supports for kernel explorer
2022-11-11 13:56:44 +08:00
Patrice Vignola
3482180ec2
DML EP add a registration for Shape and Size (#13442)
### Description
Add a DML registration for Shape to avoid copying back to the CPU just
to get the shape of a GPU tensor.



### Motivation and Context
When using free dimensions, many Transformers models extensively use the
`Shape` operator. This causes hundreds of GPU->CPU copy that should be
completely avoidable. Note that this change also uses the same
heuristics as other providers (e.g. CUDA) to force some tensors on the
CPU in certain situations.

Co-authored-by: Patrice Vignola <pavignol@microsoft.com>
2022-11-08 19:29:37 -08:00
Peter Salas
b383312f4c
[tvm] Add support for int8 models, update TVM revision (#13519)
### Description
In the TVM EP, this adds more entries to the conversion from
`ONNXTensorElementDataType` to `DLDataType`. Additionally, it removes an
unused function and updates the TVM revision to allow running models
from recent revisions of TVM.

### Motivation and Context
In the TVM EP, the mapping from `ONNXTensorElementDataType` to
`DLDataType` was incomplete and neglected several integer types (in
particular `ONNX_TENSOR_ELEMENT_DATA_TYPE_UINT8` and
`ONNX_TENSOR_ELEMENT_DATA_TYPE_INT8`) which prevented some models from
running.

Co-authored-by: Peter Salas <psalas@octoml.ai>
2022-11-08 11:28:32 -08:00
Changming Sun
efcbdac58e
Remove the cmake option: onnxruntime_DEV_MODE (#13573)
1. Remove the cmake option onnxruntime_DEV_MODE and replace it with
"--compile-no-warning-as-error"
2. Suppress some GSL warnings because now we treat nvcc diag warnings as
errors
2022-11-07 09:06:28 -08:00
Changming Sun
23da468154
Upgrade cmake version to 3.24 (#13569)
### Description
Upgrade cmake version to 3.24 because I need to use a new feature that
is only provided in that version and later. Starting from cmake 3.24,
the
[FetchContent](https://cmake.org/cmake/help/latest/module/FetchContent.html#module:FetchContent)
module and the
[find_package()](https://cmake.org/cmake/help/latest/command/find_package.html#command:find_package)
command now support integration capabilities, which means calls to
"FetchContent" can be implicitly redirected to "find_package", and vice
versa. Users can use a cmake variable to control the behavior. So, we
don't need to provide such a build option. We can delete our
"onnxruntime_PREFER_SYSTEM_LIB" build option and let cmake handle it.
And it would be easier for who wants to use vcpkg.


### Motivation and Context

Provide a unified package management method, and get aligned with the
community. This change is split from #13523 for easier review.
2022-11-04 22:58:51 -07:00
George Nash
0296bc74c1
oneDNN ep bf16 enabling (#13484)
### Description
 This adds bfloat16 support to the oneDNN ep.

When using the oneDNN ep this enables bfloat16 support for the following
ops:

Exp, Sigmoid, Tanh, Relu, MatMul, Gelu, BiasGelu, Add, Sub,
Mul, Div, Div, Sqrt, Pow, ReduceMean,  Abs, Cast, Equal, Exp,
FastGelu, FusedMatMul, Gemm, Greter, GreaterOrEqual, LeakyRelu,
Less, LessOrEqual, LRN, ReduceOps, Reshape, Squeeze, Transpose,
 and Unsqueeze.

LayerNorm with some internal casting. 
BatchNorm only enabled BFloat16 for input and output, scale and bias
still need fp32 input.

Added bfloat16 unit tests for all of the operators in question. When
possible we reused the already existing unit tests that were added by
CUDA and ROCM eps.

In many of the unit tests an unusual pattern will be seen 

    #if defined(USE_DNNL)
    TEST(Test, bfloat16_test) {
      #if defined(USE_DNNL)
        // oneDNN ep specific code
      #endif
       //test code
    }
    #endif

Although it looks unusual this was purposely done if another ep
implements bfloat16 support for that operator they will be able to
enable the unit test by adding there execution provider to the first
line without needing to edit inside the test.

Example: `#if defined(USE_CUDA) || defined(USE_DNNL)` see the
MatMul_float16 test in matmul_test.cc for and example of how this is
useful.

Additionally two new ISA checks (AVX512_BF16 and AMX-BF16) were added to
the cpuid_info code in. This was important to detecting is bfloat16
operations are supported by the CPU.

### Motivation and Context
This expands the capabilities of the oneDNN execution provider to
support models containing bfloat16 operations.

Signed-off-by: George Nash <george.nash@intel.com>
Signed-off-by: Ruihan-Yin <ruihan.yin@intel.com>
2022-11-04 18:25:09 -07:00
Edward Chen
4401f50c5e
Change GSL download to use HTTPS URL. (#13563) 2022-11-04 18:01:18 -07:00
cloudhan
2de883c592
Update CK and fix performance issue on dev machine (#13531)
1. Update CK to its latest develop branch
2. `-mllvm -amdgpu-early-inline-all=true` is critical to CK's
performance, ensure it is properly configured.
- The flags are propagated from target `hip-lang::device`'s
`INTERFACE_COMPILE_OPTIONS`, we must not manually add the flags.
- Instead, we must ensure this target is properly configured by checking
_CMAKE_HIP_DEVICE_RUNTIME_TARGET is set.

TL,DR

`hip-lang::device` sometime will be not be properly configured if our
`CMAKE_PREFIX_PATH` is not configured carefully. In the CI docker, the
configuration is in good state, but on dev machine it is not, which then
silently result poor performance for kernels. We fixed it in this PR and
add a guard to avoid unsuccessful future editing and to prevent
convoluted debugging process.

`_CMAKE_HIP_DEVICE_RUNTIME_TARGET ` is shared in
`/opt/rocm/lib/cmake/hip-lang/hip-lang-config.cmake` and it is internal
to
[CMake](https://gitlab.kitware.com/cmake/cmake/-/merge_requests/6121/diffs),
the variable name will not be changed in the foreseeable future.
2022-11-03 19:32:30 +08:00
George Nash
77be22f379
[oneDNN ep] Update from oneDNN v2.7.0 to oneDNN v2.7.1 (#13536)
The oneDNN 2.7.1 release includes multiple functional and performance
improvements.

Signed-off-by: George Nash <george.nash@intel.com>

### Description
Update the oneDNN library from 2.7.0 to 2.7.1. This contains multiple
functional and performance improvements.



### Motivation and Context
This is a minor point release from the oneDNN library that gives
performance and functional fixes that were found in the oneDNN 2.7
library shortly after release.

Signed-off-by: George Nash <george.nash@intel.com>
2022-11-02 15:57:49 -07:00
Changming Sun
b1e1b25e04
Delete CUB (#13534)
### Description
Delete CUB

### Motivation and Context
Because it is already in CUDA SDK.
2022-11-02 13:06:22 -07:00
Wei-Sheng Chin
b5904c40dd
Enable ORT in TorchDynamo (#13259)
This PR enables ORT to execute graphs captured by TorchDynamo. Major compilation code is in `OrtBackend.compile` in ort_backend.py. `register_backend.py` is for plugging `OrtBackend` into TorchDynamo as a compiler.
2022-11-01 11:19:29 -07:00
PeixuanZuo
c8886c5b4c Revert "Update CK and fix performance due to lacking -amdgpu-early-inline-all=true (#13493)"
This reverts commit 4dd053cc15.
2022-11-01 13:05:55 +08:00
Baiju Meswani
c557a55816
Fix on-device training ExportModelForInferencing api (#13510) 2022-10-31 21:29:06 -07:00
Edward Chen
2ecd1d6622
Switch GSL to MS GSL 4.0.0 (#13416) 2022-10-29 04:15:20 -07:00