Commit graph

11 commits

Author SHA1 Message Date
Justin Chu
3d2ddf96e3
Bump ruff linter to 0.2.1 (#19471)
### Motivation and Context

Include new lint rules
2024-02-08 16:08:27 -08:00
Tianlei Wu
c695de91ee
Update eval_squad to use API of latest optimum (#17918)
Update eval_squad with latest optimum. 

Tested with:
* optimum 1.13.1
* transformers 4.31.0
* onnxruntime-gpu 1.16.0
* onnx 1.14.1
* datasets 2.14.5
* evaluate 0.4.0
* torch version 2.2.0.dev20230920+cu121

Example output in A100:

{'exact': 86.66035950804162, 'f1': 92.99622739711005, 'total': 10570,
'HasAns_exact': 86.66035950804162, 'HasAns_f1': 92.99622739711005,
'HasAns_total': 10570, 'best_exact': 86.66035950804162,
'best_exact_thresh': 0.9998456239700317, 'best_f1': 92.9962273971104,
'best_f1_thresh': 0.9998456239700317, 'total_time_in_seconds':
84.74025378189981, 'samples_per_second': 124.73410838731417,
'latency_in_seconds': 0.008017053337928081, 'provider':
'CUDAExecutionProvider', 'disable_fused_attention': False,
'pretrained_model_name':
'bert-large-uncased-whole-word-masking-finetuned-squad', 'onnx_path':
'./bert-large-uncased-whole-word-masking-finetuned-squad/optimized_model.onnx',
'batch_size': 1, 'sequence_length': 384, 'use_io_binding': True}
2023-10-13 10:39:15 -07:00
Tianlei Wu
d65aa5400c
clean up transformers scripts (#17179)
(1) Remove class BertOptimizationOptions that has been deprecated a long
time ago
(2) Move sys path setttings to `__init__.py`, and update imports
(3) Fix bert_perf_test to run properly.
(4) Fix a onnx path in a whisper test case
(5) Fix a few typos
(6) Update comments in bert_perf_test regarding to graph inputs
2023-08-17 23:14:49 -07:00
PeixuanZuo
ebcd9b5cae
Fix deprecated optimum interface (#17112)
The `latest_model_name` argument to create an {self.__class__.__name__}
is deprecated since optimum 1.6.0. Replace it with `model_name`
2023-08-16 12:33:36 +08:00
Justin Chu
0c1a5098dc
Disable PERF* rules in ruff to allow better readability (#16834)
### Description

Disable two PERF* rules in ruff to allow better readability. Rational
commented inline. This change also removes the unused noqa directives
because of the rule change.

### Motivation and Context

Readability
2023-07-25 15:38:22 -07:00
Justin Chu
d79515041c
[Better Engineering] Bump ruff to 0.0.278 and fix new lint errors (#16789)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at
bottom):
* __->__ #16789

Bump ruff to 0.0.278 and fix new lint errors. I added noqa to all
existing RUF012 errors which requires mutable class variables to be
annotated with `ClassVar`, as well as all PERF issues.

Signed-off-by: Justin Chu <justinchu@microsoft.com>
2023-07-21 12:53:41 -07:00
Justin Chu
d834ec895a
Adopt linrtunner as the linting tool - take 2 (#15085)
### Description

`lintrunner` is a linter runner successfully used by pytorch, onnx and
onnx-script. It provides a uniform experience running linters locally
and in CI. It supports all major dev systems: Windows, Linux and MacOs.
The checks are enforced by the `Python format` workflow.

This PR adopts `lintrunner` to onnxruntime and fixed ~2000 flake8 errors
in Python code. `lintrunner` now runs all required python lints
including `ruff`(replacing `flake8`), `black` and `isort`. Future lints
like `clang-format` can be added.

Most errors are auto-fixed by `ruff` and the fixes should be considered
robust.

Lints that are more complicated to fix are applied `# noqa` for now and
should be fixed in follow up PRs.

### Notable changes

1. This PR **removed some suboptimal patterns**:

	- `not xxx in` -> `xxx not in` membership checks
	- bare excepts (`except:` -> `except Exception`)
	- unused imports
	
	The follow up PR will remove:
	
	- `import *`
	- mutable values as default in function definitions (`def func(a=[])`)
	- more unused imports
	- unused local variables

2. Use `ruff` to replace `flake8`. `ruff` is much (40x) faster than
flake8 and is more robust. We are using it successfully in onnx and
onnx-script. It also supports auto-fixing many flake8 errors.

3. Removed the legacy flake8 ci flow and updated docs.

4. The added workflow supports SARIF code scanning reports on github,
example snapshot:
	

![image](https://user-images.githubusercontent.com/11205048/212598953-d60ce8a9-f242-4fa8-8674-8696b704604a.png)

5. Removed `onnxruntime-python-checks-ci-pipeline` as redundant

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Unified linting experience in CI and local.

Replacing https://github.com/microsoft/onnxruntime/pull/14306

---------

Signed-off-by: Justin Chu <justinchu@microsoft.com>
2023-03-24 15:29:03 -07:00
Tianlei Wu
abe1642a0c
Update fusion for distilbert accuracy test on SQuAD (#13748)
(1) Embed layer fusion to work with --use_mask_index.
(2) Parse num_heads and hidden_size from a pattern of Concat shape node.
(3) Fix a typo (CUDAExcecutionProvider=> CUDAExecutionProvider) in eval_squad.py
(4) Update example comments in eval_squad.py to use optimized fp16 model.
(5) Update tests in test_optimizer.py
2022-11-29 13:06:39 -08:00
Tianlei Wu
e306b44e98
Improve coverage of fused MHA in Attention (#13732)
Previously, fused attention was applied to limited sequence lengths (64,
96, 128, 256, 384, 512). This will expand support all sequence lengths
<= 384 for V100 and T4, or 512 for A100.

Previously, fused attention only works for batch_size=1. After this
change, fused MHA has no limit on batch_size.

## Accuracy Tests on SQuAD

Using optimized fp16 onnx model of
distilbert-base-cased-distilled-squad, we test the CUDA EP with IO
Binding using eval_squad.py:

disable_fused_attention | batch_size | sequence_length | exact | f1 |
samples_per_second | latency_in_ms
-- | -- | -- | -- | -- | -- | --
TRUE | 1 | 384 | 79.6 | 86.8 | 283.5 | 3.5
TRUE | 2 | 384 | 79.6 | 86.8 | 308.3 | 3.2
FALSE | 1 | 384 | 79.6 | 86.8 | 313.2 | 3.2
FALSE | 2 | 384 | 79.6 | 86.8 | 340.9 | 2.9
TRUE | 1 | 300 | 79.3 | 86.6 | 278.5 | 3.6
TRUE | 2 | 300 | 79.4 | 86.6 | 301.8 | 3.3
FALSE | 1 | 300 | 79.4 | 86.6 | 305.8 | 3.3
FALSE | 2 | 300 | 79.4 | 86.6 | 335.9 | 3.0

It shows that with/without fused attention could achieve same accuracy. 

Note that latency number here is just for reference (eval_squad.py has
not been optimized for speed). We can see that it is about 10% faster
with fused attention than without fused attention.

version of package used: onnx 1.12.0, torch 1.13.0, transformers 4.24.0,
optimum 1.5.0, datasets 2.7.0, evaluate 0.3.0

## Performance Test of base-based-cased on T4 GPU
```
sudo nvidia-smi -rgc
export ORT_DISABLE_FUSED_ATTENTION=0
python benchmark.py -m bert-base-cased -e onnxruntime -g -p fp16 -o by_script -i 3 -t 1000 -b 1 8  -s 8 16 32 64 80 96 120 128 --use_mask_index --overwrite
```

Disable_Fused_Attention | b1_s8 | b1_s16 | b1_s32 | b1_s64 | b1_s80 |
b1_s96 | b1_s120 | b1_s128 | b8_s8 | b8_s16 | b8_s32 | b8_s64 | b8_s80 |
b8_s96 | b8_s120 | b8_s128
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
| -- | --
FALSE | 1.32 | 1.28 | 1.33 | 1.51 | 1.71 | 1.79 | 1.99 | 2.04 | 1.56 |
1.99 | 2.85 | 4.88 | 6.03 | 7.03 | 9.2 | 9.34
TRUE | 1.37 | 1.34 | 1.44 | 1.68 | 1.89 | 1.99 | 2.15 | 2.21 | 1.63 |
2.31 | 3.19 | 5.48 | 6.98 | 8.14 | 10.54 | 10.66
Latency Reduction  | 3.6% | 4.5% | 7.6% | 10.1% | 9.5% | 10.1% | 7.4% |
7.7% | 4.3% | 13.9% | 10.7% | 10.9% | 13.6% | 13.6% | 12.7% | 12.4%

Perf gain is observed in all sequence lengths tested.
2022-11-23 10:19:04 -08:00
Ted Themistokleous
9168e25738
Patch eval_squad.py script for Python < 3.8 and multiple Execution Providers (#13524)
Need this for benchmarks to function correctly with older containers

This fixes import errors when attempting to run eval_squad.py to
evaluate bert distilled models

Adds a change to the previously merged #12947 which fails when using
Python version < 3.8 to run this script.

Co-authored-by: Ted Themistokleous <tthemist@amd.com>
2022-11-23 15:37:39 +08:00
Tianlei Wu
d80212d42c
Add script for question answering (SQuAD) accuracy evaluation of BERT model (#12947)
Add script to evaluate accuracy of BERT/DistilBERT/Roberta models on question-answering task.

By default, pretrained model
`bert-large-uncased-whole-word-masking-finetuned-squad` will be used if
model name is not specified. If onnx path is not specified, optimum will
be used to export an ONNX model for testing.

Example usage:

* Evaluate with CPU execution provider:
`python eval_squad.py`

* Evaluate with CUDA execution provider:
`python eval_squad.py --use_gpu`

* Evaluate an optimized onnx model for
'distilbert-base-cased-distilled-squad' with sequence lengths
128/192/256/384 on first 100 samples:
`python eval_squad.py -m distilbert-base-cased-distilled-squad --use_gpu
-s 128 192 256 384 --onnx_path ./optimized_fp16.onnx -t 100`
2022-10-25 09:21:01 -07:00