Commit graph

1001 commits

Author SHA1 Message Date
sfatimar
ebaafac3f5
Openvino ep ort 5.0 (#15626)
### Description
The PR adds VPU support to OpenVINO Execution Provider
Bug fixes for GPU, CPU. 
Changes to OpenVINO Backend in Serialized Model API for faster First
Inference Latency.
Deprecation to HDDL-VADM and MYRIAD, removed code
Support OpenVINO 2023.0 
Dynamic Shapes Support for iGPU

### Motivation and Context
- VPU is an upcoming hardware that can provide AI Acceleration for
Client Systems through OpenVINO
- If it fixes an open issue, please link to the issue here. -->

---------

Signed-off-by: MaajidKhan <n.maajid.khan@intel.com>
Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
Co-authored-by: MaajidKhan <n.maajid.khan@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
2023-04-25 20:59:42 -07:00
Ye Wang
d05777ddb6
stabilize fusion script with a seperate create_attention_node() (#15670)
### Description
<!-- Describe your changes. -->

previously it used create_attention_node() from base class in
fusion_attention.py. sometimes the changes in that file may silently
lead to generating a bad model.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-04-25 13:07:58 -07:00
cloudhan
d1354dcc83
[ROCm] Add stable diffusion benchmark results for MI100 (#15646) 2023-04-23 18:29:35 +08:00
cloudhan
8297148bde
[ROCm] Update benchmark for stable diffusion (#15602)
1. update scripts for ROCm memory measurement.
2. update README to contain ROCm result.
3. address some minor issue in the README
2023-04-23 11:49:40 +08:00
kunal-vaishnavi
3de33e00c7
Fix issues for Whisper export with beam search (#15619)
### Description
This PR fixes an issue with calling the ORT transformer optimizer script
on the custom export of Whisper with beam search. It also includes the
[fix](https://github.com/microsoft/onnxruntime/pull/15616) for the GPU
out-of-memory issue.



### Motivation and Context
With this PR fix, the optimizer runs as described in the [Whisper model
optimization PR](https://github.com/microsoft/onnxruntime/pull/15473).
2023-04-21 00:08:58 -07:00
Yufeng Li
373f912e51
add quantization support for whisper (#15589)
### Description
<!-- Describe your changes. -->
Add dynamic quantization support for whisper model.
There are 3 options to try out:
- quantize_embedding_layer: enable to quantize embedding layer of
decoder model or not
- quantize_per_channel: enable to quantize per channel for Gemm or
MatMul
- quantize_reduce_range: use 7bit to quantize MatMul or Gemm. Use when
hitting accuracy issue on x64 cpus without VNNI.
2023-04-20 14:22:11 -07:00
PeixuanZuo
59ea35d592
[ROCm] add CK GroupNorm to GroupNormTunable (#15510)
- Add CK GroupNorm to GroupNormTunable.
- Reduce configuration of GroupNormNHWCOp because CK implementation is
better.

The performance gain on stable diffusion v1.5.
Before:
```
'height': 512
'width': 512
'steps': 50
'batch_size': 1
'batch_count': 5
'num_prompts': 1
'average_latency': 2.4782688856124877
'median_latency': 2.4783748388290405
'provider': 'ROCMExecutionProvider'
'disable_safety_checker': True 
```

After:
```
'height': 512, 
'width': 512, 
'steps': 50, 
'batch_size': 1,
'batch_count': 5,
'num_prompts': 1, 
'average_latency': 2.107170510292053,
 'median_latency': 2.1067750453948975,
 'first_run_memory_MB': -1, 
'second_run_memory_MB': -1,
'provider': 'ROCMExecutionProvider', 
'disable_safety_checker': True
```
2023-04-19 13:54:59 +08:00
Chi Lo
6115c8fd1f
Add TRT plugins support using custom ops (#13847)
This PR makes ORT support TRT plugin using custom ops. ORT TRT can
automatically register all TRT plugins from TRT plugins registry as
custom ops. There is no code change needed for ORT when new TRT plugins
are introduced.

Previous way for ORT to support TRT plugins was using contrib ops, but
there are some concerns about it:

- Contrib ops are shipped as part of the ORT binary by default. TRT
related plugins should not be in the default ORT.
- Contrib ops are designed for internal ops and developed for cpu and
cuda EPs.

Therefore, using custom ops is a good approach to support TRT plugins. 

Followings are the major modifications:

1. Add new `GetCustomOpDomainList` provider api which allows provider to
create its own custom op domain list and ORT can register this domain
list. Provider has the responsibility to free all the custom op domain
instances it created.
2. Move OrtCustomOpDomain struct definition to
framework_provider_common.h since this struct is being used by framework
and EPs now.
3. There are several TRT plugins registered as onnx schema op through
contrib op with onnx domain. In order not to break the old models using
those TRT plugins which were registered with ONNX domain and maintain
backward compatible, we need to keep the old/legacy TRT plugins with
onnx domain. Moving forward, all newly added TRT plugins should be
registered with `trt.plugins` domain.
4. TRT plugin doesn't have an api to get number of inputs/outputs of the
registered plugins, so ORT TRT uses variadic inputs/outputs to bypass
the onnx node validation.
5. Add new trt provider option, `trt_extra_plugin_lib_paths`, user can
specify any extra plugin lib, for example,
`fastertransformer/build/lib/libvit_plugin.so` or
`fastertransformer/build/lib/libvit_plugin.so;fastertransformer/build/lib/libvit_plugin_v2.so`
2023-04-18 20:24:32 -07:00
kunal-vaishnavi
901c2bc384
Whisper Model Optimization (#15473)
### Description
This PR contains fusion-level and kernel-level optimizations for
[OpenAI's Whisper](https://github.com/openai/whisper).

Some of the added optimizations include:

- Pruning of duplicate/unnecessary inputs and outputs
- Fusion support for Whisper models with or without these inputs/outputs
(e.g. with these inputs/outputs if exporting with an older official
Optimum version, without these inputs/outputs if exporting with Optimum
from source)
- Attention fusions
   - For Whisper's encoder and decoder
- Modified symbolic shape inference for present output when no past
input exists (for decoder)
- Multi-head attention fusions
   - For Whisper's decoder and decoder with past
- Packed MatMul for the 3 MatMuls excluded in multi-head attention
fusion
- Attention kernel changes
   - CPU:
      - Different Q and KV sequence lengths
      - Parallel memset for large sequence lengths
- Convert broadcast add after MatMul of Q and K (add_qk) to element-wise
add
- Separate present key-value output into present key and present value
(for multi-head attention spec)
   - CUDA:
- Use memory efficient attention compute kernel with present state (for
decoder)
- Multi-head attention kernel changes
   - CPU:
- Introduction of multi-head attention CPU kernel (previously did not
exist)
- Use AddBiasReshape instead of AddBiasTranspose when sequence length =
1 (for decoder with past)
      - Different Q, K, V input shapes
      - Pass past key and past value directly as key and value
   - CUDA:
- Use memory efficient attention compute kernel with past and/or present
state (for decoder with past)

### Usage
To use the optimizations, run the ORT transformer optimizer script as
follows:
```
$ cd onnxruntime/onnxruntime/python/tools/transformers/
$ python3 optimizer.py --input <filename>.onnx --output <filename>.onnx --model_type bart --num_heads <number of attention heads, depends on the size of the whisper model used> --hidden_size <attention hidden size, depends on the size of the whisper model used> --use_external_data_format --use_multi_head_attention
```

Once optimized, here's an example of how to run Whisper with [Hugging
Face's Optimum](https://github.com/huggingface/optimum):
```
from transformers.onnx.utils import get_preprocessor
from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from optimum.pipelines import pipeline as ort_pipeline

import whisper # Installed from OpenAI's repo - setup instructions at https://github.com/openai/whisper/

directory = './whisper_opt' # Where the optimized ONNX models are located
model_name = 'openai/whisper-tiny'
device = 'cpu'

# Get pipeline
processor = get_preprocessor(model_name)
model = ORTModelForSpeechSeq2Seq.from_pretrained(
    directory,
    use_io_binding=(device == 'cuda'),
    provider='CPUExecutionProvider',
).to(device)
pipe = ort_pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    device=(-1 if device == 'cpu' else 0),
)

# Load audio file and run pipeline
audio = whisper.load_audio('tests/jfk.flac')
audio = whisper.pad_or_trim(audio)
outputs = pipe([audio])
print(outputs)
```

Note: In order to use these changes with Optimum, it is recommended to
use Optimum from source to have the following changes:
- https://github.com/huggingface/optimum/pull/872
- https://github.com/huggingface/optimum/pull/920

### Motivation and Context
This PR helps the following issues:
- https://github.com/microsoft/onnxruntime/issues/15100
- https://github.com/microsoft/onnxruntime/issues/15235
- https://github.com/huggingface/optimum/issues/869 (work in progress)

This PR can be used with the other currently merged Whisper PRs:
- https://github.com/microsoft/onnxruntime/pull/15247
- https://github.com/microsoft/onnxruntime/pull/15339
- https://github.com/microsoft/onnxruntime/pull/15362
- https://github.com/microsoft/onnxruntime/pull/15365
- https://github.com/microsoft/onnxruntime/pull/15427

This PR uses changes from the following merged PRs:
- https://github.com/microsoft/onnxruntime/pull/14198
- https://github.com/microsoft/onnxruntime/pull/14146
- https://github.com/microsoft/onnxruntime/pull/14201
- https://github.com/microsoft/onnxruntime/pull/14928 (this introduced
the new multi-head attention spec)
2023-04-18 17:13:54 -07:00
Justin Chu
cf19c3697d
Run clang-format in CI (#15524)
### Description

Run clang-format in CI. Formatted all c/c++, objective-c/c++ files.

Excluded

```
    'onnxruntime/core/mlas/**',
    'onnxruntime/contrib_ops/cuda/bert/tensorrt_fused_multihead_attention/**',
```

because they contain assembly or is data heavy


### Motivation and Context

Coding style consistency
2023-04-18 09:26:58 -07:00
liqun Fu
919d8f2660
update with onnx main (#14929) 2023-04-18 08:42:51 -07:00
Justin Chu
9d26f8f4fe
Use os.fspath on Path (#15530)
### Description
<!-- Describe your changes. -->

Use os.fspath instead of str() on a path object. 

### Motivation and Context

I learned today that os.fspath is the right way to go:
https://github.com/charliermarsh/ruff/issues/3675#issuecomment-1494975508
2023-04-17 16:59:40 -07:00
Zhang Lei
a30b57da6e
Fix/Enhance convert_generation tool for SkipLayerNorm, op_block_list... (#15368)
After SkipLayernorm using fp32 for internal calculation and using
numeric stable algorithm, enable it for fp16 here.
Make the op_block_list a command line argument to help future tools.
Other minor changes.
2023-04-17 14:44:37 -07:00
Justin Chu
a36caba073
Bump ruff in CI (#15533)
### Description

Bump ruff version in CI and fixed new lint errors. 

- This change enables the flake8-implicit-str-concat rules which helps
detect unintended string concatenations:
https://beta.ruff.rs/docs/rules/#flake8-implicit-str-concat-isc
- Update gitignore to include common python files that we want to
exclude.


### Motivation and Context

Code quality
2023-04-17 10:11:44 -07:00
Maximilian Müller
fbe88fccbd
Exposing new TRT build options (#15089)
### Description

This will add a few TRT options, some of them are only available on TRT
8.6:
- heuristics
- sparsity
- optimization level (8.6 only)
- auxiliary stream (8.6 only)
- tactic source selection

I am no sure yet which tests is should add for these options. As those
are mostly simple TRT flags i am not sure to what level i should test.
For heuristics something similar to
44dda08b51/onnxruntime/test/providers/tensorrt/tensorrt_basic_test.cc (L510-L538)
should be possible for, but for all other essentially we would only be
testing if there is a crash or not if the option is set.
Also if i forgot some option that would be good to have feel free to
speak up !
2023-04-14 09:47:36 -07:00
pengwa
bf32dbbd9b
Share more constant initializers (#15461)
### Share more constant initializers.

`ConstantSharing` transformer originally only handle single value
initializer (scalar or 1D).

This PR tried to share more cases to make common subexpression
elimination transformer to remove more duplicated nodes.

Originally, we used a single
vector<std::variant<float,half,int32,int64>> to store different scalar
values. In this PR, we create a unordered map with its key being
data_type + rank + element count, and its value is a vector of
`InitializerValue`.

For one specific initializer, if it fulfils the condition, then finally
will find the corresponding vector of `InitializerValue` by its
<data_type + rank + element count>, then search from the vector whether
the constant tensor already exist or not. After that, a value id is
returned, which will be combined together with <data_type + rank +
element count> to form the pattern key to decide which tensor to reuse
(legacy code).

### Motivation and Context

One example we see here is:

```mermaid
stateDiagram
    [*] --> LayerNorm(b,s,64)
    LayerNorm(b,s,64) --> Reshape1
    Shape1_Const[b*s,64] --> Reshape1

    LayerNorm(b,s,64) --> Reshape2
    Shape2_Const[b*s,64] --> Reshape2


    Reshape1 --> AttentionSubGraph
    Reshape2 -->  Add
    AttentionSubGraph--> Add
   Add --> [*]
```

Ideally CommonSubexpressionElimination can remove one of `Reshape1` and
`Reshape2`, while since `Shape1_Const` and `Shape2_Const` are different
NodeArg*, so it did not remove the duplication.

This is an example: removing the duplication will bring more
opportunities to apply graph transformations.
2023-04-14 07:41:07 -07:00
Wei-Sheng Chin
d76cf374c4
Capture both ValueError and RuntimeError (#15503) 2023-04-13 19:29:34 -07:00
PeixuanZuo
ce1eb6d629
[ROCm] Add Tunable GroupNorm (#15298)
refactor GroupNorm and Add Tunable GroupNorm
2023-04-12 10:55:42 +08:00
Ye Wang
ef42fd09fb
google/mt5 optimization and fix (#15454)
### Description
<!-- Describe your changes. -->
1. enabled self-attention fusion in mt-5 decoder graph
2. fix a parity issue
https://github.com/microsoft/onnxruntime/issues/15042


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-04-11 00:09:11 -07:00
cloudhan
9acbfc6a29
ROCm MHA (#15279)
Add MultiHeadAttention for ROCm EP.

**Before:**
```
'engine': 'onnxruntime'
'version': '1.15.0'
'height': 512
'width': 512
'steps': 50
'batch_size': 1
'batch_count': 5
'num_prompts': 1
'average_latency': 3.878769588470459
'median_latency': 3.8792178630828857
'first_run_memory_MB': -1
'second_run_memory_MB': -1
'model_name': 'runwayml/stable-diffusion-v1-5'
'directory': './sd-v1-5-onnx-fp16-nomha'
'provider': 'ROCMExecutionProvider'
'disable_safety_checker': True
```

**After:**
```
'engine': 'onnxruntime'
'version': '1.15.0'
'height': 512
'width': 512
'steps': 50
'batch_size': 1
'batch_count': 5
'num_prompts': 1
'average_latency': 2.364924430847168
'median_latency': 2.3650705814361572
'first_run_memory_MB': -1
'second_run_memory_MB': -1
'model_name': 'runwayml/stable-diffusion-v1-5'
'directory': './sd-v1-5-onnx-fp16'
'provider': 'ROCMExecutionProvider'
'disable_safety_checker': True
```
2023-04-11 13:20:44 +08:00
Ye Wang
34f22daf25
Support T5 Beam Search with DecoderMaskedMHA (#15386)
### Description
<!-- Describe your changes. -->
tldr:
Latency improvement
t5-small: 37.8% 
t5-base: 24.5%


Benchmark on V100

Before:
T5-small
ORT {'test_times': 1, 'latency_variance': '0.00',
'latency_90_percentile': '104.74', 'latency_95_percentile': '104.74',
'latency_99_percentile': '104.74', 'average_latency_ms': '104.74',
'QPS': '19.10', 'parity': True}
T5-base
ORT {'test_times': 1, 'latency_variance': '0.00',
'latency_90_percentile': '200.93', 'latency_95_percentile': '200.93',
'latency_99_percentile': '200.93', 'average_latency_ms': '200.93',
'QPS': '9.95', 'parity': True}



After:
T5-small
ORT {'test_times': 1, 'latency_variance': '0.00',
'latency_90_percentile': '76.01', 'latency_95_percentile': '76.01',
'latency_99_percentile': '76.01', 'average_latency_ms': '76.01', 'QPS':
'26.31', 'parity': True}
T5-base
ORT {'test_times': 1, 'latency_variance': '0.00',
'latency_90_percentile': '161.40', 'latency_95_percentile': '161.40',
'latency_99_percentile': '161.40', 'average_latency_ms': '161.40',
'QPS': '12.39', 'parity': True}


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-04-08 12:50:18 -07:00
Ryan Hill
56beac4b5b
VIT model handling in the Benchmark.sh file (#15045)
### Description
Adds VIT model type to the benchmark
Also adds Swin (v1) model type

### Motivation and Context
Image models are important and we should verify these work as expected
at the performance we expect.
2023-04-07 20:17:29 -07:00
Hector Li
276c0a00e4
Reuse QDQConv for ConvTranspose to generate the QDQ model (#15385)
### Description
Reuse QDQConv for ConvTranspose to generate the QDQ model

### Motivation and Context
Generate the correct QDQ model
2023-04-06 15:07:44 -07:00
petermcaughan
2bd8e4a130
Petermca/whisper dedup (#15365)
### Description
Apply `get_shared_initializers()` to the encoder and decoder subgraphs
of Whisper before chaining and exporting the full, final model.


### Motivation and Context
The Whisper export process has some overlap between the encoder and
decoder subgraphs due to the format of the BeamSearch contrib op.
Consequently, there is some shared model data that is duplicated in the
final exported product, which can result in a file size increase of
~40%. This PR takes the methods in `convert_generation.py` and applies
them during the whisper export process.

---------

Co-authored-by: Peter McAughan <petermca@microsoft.com>
2023-04-06 13:27:05 -07:00
petermcaughan
d0cca91cfb
Fix token_id values for whisper export (#15362)
### Description
The current ONNX export of Whisper utilizes hard-coded values for
token_ids when configuring the BeamSearch node. This PR removes these
literals and instead takes these values straight from the WhisperConfig.



### Motivation and Context
Hard-coding these values can cause some parity issues when comparing to
default PyTorch behavior - this change to take from WhisperConfig
resolves these.

Co-authored-by: Peter McAughan <petermca@microsoft.com>
2023-04-06 11:01:21 -07:00
cloudhan
71a4e7eb97
Automatically enable tunable op usage for production models (#15156)
Split `IsTunbaleOpEnable` semantics into **enable tunable op for using**
and **enable tunable op for tuning**.

They remain disabled in general for safety purpose. But
- if session is created with onnx model with tuning results embeded
- the embedded tuning results is set to the EP without error `Status`

then we automatically enable the using, tuning remains disabled.

The planned options will be
- `tunable_op_enable`: The top-level switch of `TunableOp`, indicate if we will run into `TunableOp` related logic. **NOTE:** most of our impls have a bottom impl that is acting as a fallback and is set as the default. In this case, we still call into the `TunableOp`, but no kernel selection, no kernel tuning and caching is involved. This reduced our maintainance burden of a duplicate code path.
- `tunable_op_tuning_enable`: The secondary switch of `TunableOp`, indicate if we will run into the tuning related logic of `TunableOp`

Then for the possible future options:
- `tunable_op_tuning_max_iteration`: blahblah
- `tunable_op_tuning_max_duration_ms`: blahblah
- `tunable_op_flash_attention_enable`: blahblah, for example only, we will not have this.

For developer oriented envvar, it is for developers' convenience to inspect the performance impact of tuning. So there is only `ORT_ROCM_TUNABLE_OP_ENABLE`, `ORT_ROCM_TUNABLE_OP_TUNING_ENABLE` to take the fine-grind control of combinations.
2023-04-06 13:52:47 +08:00
Leso_KN
ea6b32fea8
Fix: Add def main() in onnxruntime_test.py (#15208) 2023-04-05 12:31:39 -07:00
Justin Chu
a96e19abc4
Add type annotations to onnxruntime_inference_collection.py (#15364)
### Description

Add type annotations to `onnxruntime_inference_collection.py`



### Motivation and Context

Fixes #15334
2023-04-05 10:32:49 -07:00
Hariharan Seshadri
5294cd0c55
Print value errors in ort.InferenceSession to user (#15360) 2023-04-04 16:01:24 -07:00
Anton Korablin
207c57219a
Add support for full ViT optimization (#15289)
Add support for ViT optimization in optimizer.py
As ViT architecture follows BERT rather closely, we can easily reuse
BERT fusions for ViT. The only difference is that ViT does not have
attention mask, which means there is no Add node in qk paths.
Make the necessary changes in onnx_exporter.py to be able to cover
optimizations with test.
2023-04-04 14:05:24 -07:00
Severin Simmler
4400e80452
Allow Path objects for deserialization of ONNX models (#15307) 2023-04-04 11:38:00 -07:00
petermcaughan
f30e2d4387
Whisper Export (#15247)
### Description
Add scripts to export Whisper model to ONNX and integrate the ORT
BeamSearch op with the resulting graphs.

Example command to execute this script:

python convert_to_onnx.py -m openai/whisper-large --output whisper -e

---------

Co-authored-by: Peter McAughan <petermca@microsoft.com>
2023-04-04 05:01:04 -07:00
Yufeng Li
c08d6b42e8
Add tool to support packing mode for BERT model (#15283)
### Description
<!-- Describe your changes. -->
Add a tool to convert fused BERT like model to packing mode


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-03-31 08:46:47 -07:00
yf711
dc61d3b5b6
Fix symbolic shape inference script on precision loss issue (#15215)
### Description
When calculating symbolic shape like `mul(get_int_val(values=[1024,
0.5]))`,
the current script calls `get_int_val()` to get values, which values
becomes `[1024, 0]`.
Thus, the result of `mul(values)`->`mul([1024,0])`=0, but the expected
shape size is 512

Fix: for math binary operations like `mul()` and `div()`, 
don't convert input shapes into integers if any possible precision loss
happen;
keep the input shape as float, finish the operation, and cast final
result into integer and output the shape.

Test cases are added:
1. mul(1024, 0.5)=>512 (before this fix, the output would be 0, as float
0.5 would be converted to int 0)
2. div(768, 1.5)=>512 (before this fix, the output would be 768, as
float 1.5 would be converted to int 0)

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-03-30 12:15:27 -07:00
PeixuanZuo
a6279d4cfb
[ROCm] update Stable Diffusion benchmark to support ROCm EP (#15094)
Update Stable Diffusion benchmark to support ROCm EP
2023-03-29 15:19:52 +08:00
Tianlei Wu
f752bb9973
Update stable diffusion benchmark results: A100 and PyTorch 2.0 (#15195)
Update stable diffusion benchmark results with A100 results and PyTorch 2.0 number.
2023-03-28 19:47:22 -07:00
Justin Chu
938e2136c6
Enable pylint and numpy rules (#15218)
### Description

Enable pylint and numpy rules

### Motivation and Context

Modernize numpy usage and enable more quality checks
2023-03-27 20:37:53 -07:00
cloudhan
d3565779c3
Allow bert_perf_test.py to load/save tuning results (#15096) 2023-03-26 18:03:08 +08:00
Justin Chu
d834ec895a
Adopt linrtunner as the linting tool - take 2 (#15085)
### Description

`lintrunner` is a linter runner successfully used by pytorch, onnx and
onnx-script. It provides a uniform experience running linters locally
and in CI. It supports all major dev systems: Windows, Linux and MacOs.
The checks are enforced by the `Python format` workflow.

This PR adopts `lintrunner` to onnxruntime and fixed ~2000 flake8 errors
in Python code. `lintrunner` now runs all required python lints
including `ruff`(replacing `flake8`), `black` and `isort`. Future lints
like `clang-format` can be added.

Most errors are auto-fixed by `ruff` and the fixes should be considered
robust.

Lints that are more complicated to fix are applied `# noqa` for now and
should be fixed in follow up PRs.

### Notable changes

1. This PR **removed some suboptimal patterns**:

	- `not xxx in` -> `xxx not in` membership checks
	- bare excepts (`except:` -> `except Exception`)
	- unused imports
	
	The follow up PR will remove:
	
	- `import *`
	- mutable values as default in function definitions (`def func(a=[])`)
	- more unused imports
	- unused local variables

2. Use `ruff` to replace `flake8`. `ruff` is much (40x) faster than
flake8 and is more robust. We are using it successfully in onnx and
onnx-script. It also supports auto-fixing many flake8 errors.

3. Removed the legacy flake8 ci flow and updated docs.

4. The added workflow supports SARIF code scanning reports on github,
example snapshot:
	

![image](https://user-images.githubusercontent.com/11205048/212598953-d60ce8a9-f242-4fa8-8674-8696b704604a.png)

5. Removed `onnxruntime-python-checks-ci-pipeline` as redundant

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Unified linting experience in CI and local.

Replacing https://github.com/microsoft/onnxruntime/pull/14306

---------

Signed-off-by: Justin Chu <justinchu@microsoft.com>
2023-03-24 15:29:03 -07:00
PeixuanZuo
7eb6dbe7d8
[ROCm] Add compute type for Skiplayernorm to fix ROCm CI (#15192)
- Add compute type for Skiplayernorm to fix ROCm CI and get more
accurate results.

SkipLayerNorm:
type T: input, skip, bias
type U: epsilon, compute result
type V: output, beta, gamma

- refactor the usage of aligned_vector, reduce the usage of
`reinterpret_cast`.
2023-03-24 19:31:14 +08:00
Ye Wang
44ba23e0f5
Rename DecoderMaskedMHA to DecoderMaskedSelfAttn (#15166)
### Description
<!-- Describe your changes. -->

As synced offline, rename this op and will create another op for mha
that supports both self and cross attention.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-03-23 12:31:38 -07:00
Hariharan Seshadri
7033346605 Support mask_filter_value attribute in DecoderMaskedMultiheadAttention (#15158) 2023-03-23 11:00:09 -07:00
Tianlei Wu
88a66a289b
Fix prune_graph and gpt attention fusion scripts (#15147)
Fix two issues: (1) GPT attention fusion: get_parent could return None when the input is
initializer, add a check (2) ONNX node could have optional inputs and outputs. During
prune_graph, we shall exclude empty inputs/outputs. Here we exclude ""
from output_name_to_node and input_name_to_nodes.

Add an option allow_remove_graph_inputs in prune_graph
2023-03-23 09:45:16 -07:00
pengwa
1d32285536
Statistics tool for ORTModule convergence parity (#15020)
### Statistics tool for ORTModule convergence parity

As ORTModule get more and more validated, it is pretty fast to
intergrade PyTorch based model with ORT.

The same time, we need make sure once there is convergence issue, we
don't spend months of time to investigate. As part of this efforts, this
PR is introducing a tool to dump activation statistics without much
involvement from users. The dumping results contains only some statistic
numbers plus sampled data, which is not big, compared with dumping all
the tensors, it is much faster and space efficient.

For us to use it, two single lines are needed before wrapping ORTModule.
For baseline run, need also apply the same trick.

```
+	from onnxruntime.training.utils.hooks import SubscriberManager, StatisticsSubscriber
+	SubscriberManager.subscribe(model, [StatisticsSubscriber("pt_out", override_output_dir=True)])
```

Once you run the steps, following command can be used to merge result
into per-step-summary respectively for ORT and baseline runs.
 
```bash
python -m onnxruntime.training.utils.hooks.merge_activation_summary --pt_dir pt_out --ort_dir ort_out --output_dir /tmp/output
```

Docs is added here as part of this PR [convergence investigation
notes](https://github.com/microsoft/onnxruntime/blob/pengwa/conv_tool/docs/ORTModule_Convergence_Notes.md)

Based on the generated merged files, we can compare them with tools. 


![image](https://user-images.githubusercontent.com/10530022/224653929-4e4480bd-bb02-4bbe-bd44-2672bdf91a87.png)

### Design and Implementation

This PR introduced a common mechanism registering custom logic for
nn.Module's post forward hooks. And statistics for activation
(StatisticsSubscriber) is one of the implementations. If there is other
needs, we can define another XXSubscriber to do the customized things.
2023-03-23 20:34:24 +08:00
cloudhan
039ca10822
Move offline_tuning.py, so that the utility will be package with whl distribution (#15124)
Just file move.
2023-03-23 15:24:41 +08:00
cloudhan
71b67ec1e2
Refactor ke register to be decentralized (#15036)
So that we can remove all unnecessay header files
2023-03-22 14:49:26 +08:00
Tianlei Wu
3e2d453b64
Supports model > 2GB in fp16 conversion with onnx shape inference (#15067)
(1) Allow model to be path, and use infer_shapes_path to fix
https://github.com/microsoft/onnxruntime/issues/15063
(2) Add some logging for float data truncation
(3) Add RandomUniformLike to default op_block_list
(4) Some minor changes to use f string.
2023-03-21 15:08:28 -07:00
Faith Xu
ef76b3aeb8
Transformers tool - update readme to link to docs page (#14964)
### Description
Transformers tool documentation has been moved to:
https://onnxruntime.ai/docs/performance/transformers-optimization.html
2023-03-21 11:56:19 -07:00
cloudhan
98ab4a62d6
Fix ROCm 5.2.3 pipeline (#15073)
Make CK optional again.
2023-03-17 15:59:57 +08:00
cloudhan
a5ab88247b
ROCm Flash Attention (#14838)
Adds flash attention via composable kernel for ROCm EP
2023-03-16 10:39:58 +08:00