onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-06-11 00:49:31 +00:00

Author	SHA1	Message	Date
Justin Chu	3d2ddf96e3	Bump ruff linter to 0.2.1 (#19471 ) ### Motivation and Context Include new lint rules	2024-02-08 16:08:27 -08:00
Tianlei Wu	c695de91ee	Update eval_squad to use API of latest optimum (#17918 ) Update eval_squad with latest optimum. Tested with: * optimum 1.13.1 * transformers 4.31.0 * onnxruntime-gpu 1.16.0 * onnx 1.14.1 * datasets 2.14.5 * evaluate 0.4.0 * torch version 2.2.0.dev20230920+cu121 Example output in A100: {'exact': 86.66035950804162, 'f1': 92.99622739711005, 'total': 10570, 'HasAns_exact': 86.66035950804162, 'HasAns_f1': 92.99622739711005, 'HasAns_total': 10570, 'best_exact': 86.66035950804162, 'best_exact_thresh': 0.9998456239700317, 'best_f1': 92.9962273971104, 'best_f1_thresh': 0.9998456239700317, 'total_time_in_seconds': 84.74025378189981, 'samples_per_second': 124.73410838731417, 'latency_in_seconds': 0.008017053337928081, 'provider': 'CUDAExecutionProvider', 'disable_fused_attention': False, 'pretrained_model_name': 'bert-large-uncased-whole-word-masking-finetuned-squad', 'onnx_path': './bert-large-uncased-whole-word-masking-finetuned-squad/optimized_model.onnx', 'batch_size': 1, 'sequence_length': 384, 'use_io_binding': True}	2023-10-13 10:39:15 -07:00
Tianlei Wu	d65aa5400c	clean up transformers scripts (#17179 ) (1) Remove class BertOptimizationOptions that has been deprecated a long time ago (2) Move sys path setttings to `__init__.py`, and update imports (3) Fix bert_perf_test to run properly. (4) Fix a onnx path in a whisper test case (5) Fix a few typos (6) Update comments in bert_perf_test regarding to graph inputs	2023-08-17 23:14:49 -07:00
PeixuanZuo	ebcd9b5cae	Fix deprecated optimum interface (#17112 ) The `latest_model_name` argument to create an {self.__class__.__name__} is deprecated since optimum 1.6.0. Replace it with `model_name`	2023-08-16 12:33:36 +08:00
Justin Chu	0c1a5098dc	Disable PERF* rules in ruff to allow better readability (#16834 ) ### Description Disable two PERF* rules in ruff to allow better readability. Rational commented inline. This change also removes the unused noqa directives because of the rule change. ### Motivation and Context Readability	2023-07-25 15:38:22 -07:00
Justin Chu	d79515041c	[Better Engineering] Bump ruff to 0.0.278 and fix new lint errors (#16789 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #16789 Bump ruff to 0.0.278 and fix new lint errors. I added noqa to all existing RUF012 errors which requires mutable class variables to be annotated with `ClassVar`, as well as all PERF issues. Signed-off-by: Justin Chu <justinchu@microsoft.com>	2023-07-21 12:53:41 -07:00
Justin Chu	d834ec895a	Adopt linrtunner as the linting tool - take 2 (#15085 ) ### Description `lintrunner` is a linter runner successfully used by pytorch, onnx and onnx-script. It provides a uniform experience running linters locally and in CI. It supports all major dev systems: Windows, Linux and MacOs. The checks are enforced by the `Python format` workflow. This PR adopts `lintrunner` to onnxruntime and fixed ~2000 flake8 errors in Python code. `lintrunner` now runs all required python lints including `ruff`(replacing `flake8`), `black` and `isort`. Future lints like `clang-format` can be added. Most errors are auto-fixed by `ruff` and the fixes should be considered robust. Lints that are more complicated to fix are applied `# noqa` for now and should be fixed in follow up PRs. ### Notable changes 1. This PR removed some suboptimal patterns: - `not xxx in` -> `xxx not in` membership checks - bare excepts (`except:` -> `except Exception`) - unused imports The follow up PR will remove: - `import *` - mutable values as default in function definitions (`def func(a=[])`) - more unused imports - unused local variables 2. Use `ruff` to replace `flake8`. `ruff` is much (40x) faster than flake8 and is more robust. We are using it successfully in onnx and onnx-script. It also supports auto-fixing many flake8 errors. 3. Removed the legacy flake8 ci flow and updated docs. 4. The added workflow supports SARIF code scanning reports on github, example snapshot: ![image](https://user-images.githubusercontent.com/11205048/212598953-d60ce8a9-f242-4fa8-8674-8696b704604a.png) 5. Removed `onnxruntime-python-checks-ci-pipeline` as redundant ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Unified linting experience in CI and local. Replacing https://github.com/microsoft/onnxruntime/pull/14306 --------- Signed-off-by: Justin Chu <justinchu@microsoft.com>	2023-03-24 15:29:03 -07:00
Tianlei Wu	abe1642a0c	Update fusion for distilbert accuracy test on SQuAD (#13748 ) (1) Embed layer fusion to work with --use_mask_index. (2) Parse num_heads and hidden_size from a pattern of Concat shape node. (3) Fix a typo (CUDAExcecutionProvider=> CUDAExecutionProvider) in eval_squad.py (4) Update example comments in eval_squad.py to use optimized fp16 model. (5) Update tests in test_optimizer.py	2022-11-29 13:06:39 -08:00
Tianlei Wu	e306b44e98	Improve coverage of fused MHA in Attention (#13732 ) Previously, fused attention was applied to limited sequence lengths (64, 96, 128, 256, 384, 512). This will expand support all sequence lengths <= 384 for V100 and T4, or 512 for A100. Previously, fused attention only works for batch_size=1. After this change, fused MHA has no limit on batch_size. ## Accuracy Tests on SQuAD Using optimized fp16 onnx model of distilbert-base-cased-distilled-squad, we test the CUDA EP with IO Binding using eval_squad.py: disable_fused_attention \| batch_size \| sequence_length \| exact \| f1 \| samples_per_second \| latency_in_ms -- \| -- \| -- \| -- \| -- \| -- \| -- TRUE \| 1 \| 384 \| 79.6 \| 86.8 \| 283.5 \| 3.5 TRUE \| 2 \| 384 \| 79.6 \| 86.8 \| 308.3 \| 3.2 FALSE \| 1 \| 384 \| 79.6 \| 86.8 \| 313.2 \| 3.2 FALSE \| 2 \| 384 \| 79.6 \| 86.8 \| 340.9 \| 2.9 TRUE \| 1 \| 300 \| 79.3 \| 86.6 \| 278.5 \| 3.6 TRUE \| 2 \| 300 \| 79.4 \| 86.6 \| 301.8 \| 3.3 FALSE \| 1 \| 300 \| 79.4 \| 86.6 \| 305.8 \| 3.3 FALSE \| 2 \| 300 \| 79.4 \| 86.6 \| 335.9 \| 3.0 It shows that with/without fused attention could achieve same accuracy. Note that latency number here is just for reference (eval_squad.py has not been optimized for speed). We can see that it is about 10% faster with fused attention than without fused attention. version of package used: onnx 1.12.0, torch 1.13.0, transformers 4.24.0, optimum 1.5.0, datasets 2.7.0, evaluate 0.3.0 ## Performance Test of base-based-cased on T4 GPU ``` sudo nvidia-smi -rgc export ORT_DISABLE_FUSED_ATTENTION=0 python benchmark.py -m bert-base-cased -e onnxruntime -g -p fp16 -o by_script -i 3 -t 1000 -b 1 8 -s 8 16 32 64 80 96 120 128 --use_mask_index --overwrite ``` Disable_Fused_Attention \| b1_s8 \| b1_s16 \| b1_s32 \| b1_s64 \| b1_s80 \| b1_s96 \| b1_s120 \| b1_s128 \| b8_s8 \| b8_s16 \| b8_s32 \| b8_s64 \| b8_s80 \| b8_s96 \| b8_s120 \| b8_s128 -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- FALSE \| 1.32 \| 1.28 \| 1.33 \| 1.51 \| 1.71 \| 1.79 \| 1.99 \| 2.04 \| 1.56 \| 1.99 \| 2.85 \| 4.88 \| 6.03 \| 7.03 \| 9.2 \| 9.34 TRUE \| 1.37 \| 1.34 \| 1.44 \| 1.68 \| 1.89 \| 1.99 \| 2.15 \| 2.21 \| 1.63 \| 2.31 \| 3.19 \| 5.48 \| 6.98 \| 8.14 \| 10.54 \| 10.66 Latency Reduction \| 3.6% \| 4.5% \| 7.6% \| 10.1% \| 9.5% \| 10.1% \| 7.4% \| 7.7% \| 4.3% \| 13.9% \| 10.7% \| 10.9% \| 13.6% \| 13.6% \| 12.7% \| 12.4% Perf gain is observed in all sequence lengths tested.	2022-11-23 10:19:04 -08:00
Ted Themistokleous	9168e25738	Patch eval_squad.py script for Python < 3.8 and multiple Execution Providers (#13524 ) Need this for benchmarks to function correctly with older containers This fixes import errors when attempting to run eval_squad.py to evaluate bert distilled models Adds a change to the previously merged #12947 which fails when using Python version < 3.8 to run this script. Co-authored-by: Ted Themistokleous <tthemist@amd.com>	2022-11-23 15:37:39 +08:00
Tianlei Wu	d80212d42c	Add script for question answering (SQuAD) accuracy evaluation of BERT model (#12947 ) Add script to evaluate accuracy of BERT/DistilBERT/Roberta models on question-answering task. By default, pretrained model `bert-large-uncased-whole-word-masking-finetuned-squad` will be used if model name is not specified. If onnx path is not specified, optimum will be used to export an ONNX model for testing. Example usage: * Evaluate with CPU execution provider: `python eval_squad.py` * Evaluate with CUDA execution provider: `python eval_squad.py --use_gpu` * Evaluate an optimized onnx model for 'distilbert-base-cased-distilled-squad' with sequence lengths 128/192/256/384 on first 100 samples: `python eval_squad.py -m distilbert-base-cased-distilled-squad --use_gpu -s 128 192 256 384 --onnx_path ./optimized_fp16.onnx -t 100`	2022-10-25 09:21:01 -07:00

11 commits