onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-06-08 00:23:03 +00:00

Author	SHA1	Message	Date
sfatimar	ebaafac3f5	Openvino ep ort 5.0 (#15626 ) ### Description The PR adds VPU support to OpenVINO Execution Provider Bug fixes for GPU, CPU. Changes to OpenVINO Backend in Serialized Model API for faster First Inference Latency. Deprecation to HDDL-VADM and MYRIAD, removed code Support OpenVINO 2023.0 Dynamic Shapes Support for iGPU ### Motivation and Context - VPU is an upcoming hardware that can provide AI Acceleration for Client Systems through OpenVINO - If it fixes an open issue, please link to the issue here. --> --------- Signed-off-by: MaajidKhan <n.maajid.khan@intel.com> Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com> Co-authored-by: MaajidKhan <n.maajid.khan@intel.com> Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>	2023-04-25 20:59:42 -07:00
Ye Wang	d05777ddb6	stabilize fusion script with a seperate create_attention_node() (#15670 ) ### Description <!-- Describe your changes. --> previously it used create_attention_node() from base class in fusion_attention.py. sometimes the changes in that file may silently lead to generating a bad model. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2023-04-25 13:07:58 -07:00
cloudhan	d1354dcc83	[ROCm] Add stable diffusion benchmark results for MI100 (#15646 )	2023-04-23 18:29:35 +08:00
cloudhan	8297148bde	[ROCm] Update benchmark for stable diffusion (#15602 ) 1. update scripts for ROCm memory measurement. 2. update README to contain ROCm result. 3. address some minor issue in the README	2023-04-23 11:49:40 +08:00
kunal-vaishnavi	3de33e00c7	Fix issues for Whisper export with beam search (#15619 ) ### Description This PR fixes an issue with calling the ORT transformer optimizer script on the custom export of Whisper with beam search. It also includes the [fix](https://github.com/microsoft/onnxruntime/pull/15616) for the GPU out-of-memory issue. ### Motivation and Context With this PR fix, the optimizer runs as described in the [Whisper model optimization PR](https://github.com/microsoft/onnxruntime/pull/15473).	2023-04-21 00:08:58 -07:00
Yufeng Li	373f912e51	add quantization support for whisper (#15589 ) ### Description <!-- Describe your changes. --> Add dynamic quantization support for whisper model. There are 3 options to try out: - quantize_embedding_layer: enable to quantize embedding layer of decoder model or not - quantize_per_channel: enable to quantize per channel for Gemm or MatMul - quantize_reduce_range: use 7bit to quantize MatMul or Gemm. Use when hitting accuracy issue on x64 cpus without VNNI.	2023-04-20 14:22:11 -07:00
PeixuanZuo	59ea35d592	[ROCm] add CK GroupNorm to GroupNormTunable (#15510 ) - Add CK GroupNorm to GroupNormTunable. - Reduce configuration of GroupNormNHWCOp because CK implementation is better. The performance gain on stable diffusion v1.5. Before: ``` 'height': 512 'width': 512 'steps': 50 'batch_size': 1 'batch_count': 5 'num_prompts': 1 'average_latency': 2.4782688856124877 'median_latency': 2.4783748388290405 'provider': 'ROCMExecutionProvider' 'disable_safety_checker': True ``` After: ``` 'height': 512, 'width': 512, 'steps': 50, 'batch_size': 1, 'batch_count': 5, 'num_prompts': 1, 'average_latency': 2.107170510292053, 'median_latency': 2.1067750453948975, 'first_run_memory_MB': -1, 'second_run_memory_MB': -1, 'provider': 'ROCMExecutionProvider', 'disable_safety_checker': True ```	2023-04-19 13:54:59 +08:00
Chi Lo	6115c8fd1f	Add TRT plugins support using custom ops (#13847 ) This PR makes ORT support TRT plugin using custom ops. ORT TRT can automatically register all TRT plugins from TRT plugins registry as custom ops. There is no code change needed for ORT when new TRT plugins are introduced. Previous way for ORT to support TRT plugins was using contrib ops, but there are some concerns about it: - Contrib ops are shipped as part of the ORT binary by default. TRT related plugins should not be in the default ORT. - Contrib ops are designed for internal ops and developed for cpu and cuda EPs. Therefore, using custom ops is a good approach to support TRT plugins. Followings are the major modifications: 1. Add new `GetCustomOpDomainList` provider api which allows provider to create its own custom op domain list and ORT can register this domain list. Provider has the responsibility to free all the custom op domain instances it created. 2. Move OrtCustomOpDomain struct definition to framework_provider_common.h since this struct is being used by framework and EPs now. 3. There are several TRT plugins registered as onnx schema op through contrib op with onnx domain. In order not to break the old models using those TRT plugins which were registered with ONNX domain and maintain backward compatible, we need to keep the old/legacy TRT plugins with onnx domain. Moving forward, all newly added TRT plugins should be registered with `trt.plugins` domain. 4. TRT plugin doesn't have an api to get number of inputs/outputs of the registered plugins, so ORT TRT uses variadic inputs/outputs to bypass the onnx node validation. 5. Add new trt provider option, `trt_extra_plugin_lib_paths`, user can specify any extra plugin lib, for example, `fastertransformer/build/lib/libvit_plugin.so` or `fastertransformer/build/lib/libvit_plugin.so;fastertransformer/build/lib/libvit_plugin_v2.so`	2023-04-18 20:24:32 -07:00
kunal-vaishnavi	901c2bc384	Whisper Model Optimization (#15473 ) ### Description This PR contains fusion-level and kernel-level optimizations for [OpenAI's Whisper](https://github.com/openai/whisper). Some of the added optimizations include: - Pruning of duplicate/unnecessary inputs and outputs - Fusion support for Whisper models with or without these inputs/outputs (e.g. with these inputs/outputs if exporting with an older official Optimum version, without these inputs/outputs if exporting with Optimum from source) - Attention fusions - For Whisper's encoder and decoder - Modified symbolic shape inference for present output when no past input exists (for decoder) - Multi-head attention fusions - For Whisper's decoder and decoder with past - Packed MatMul for the 3 MatMuls excluded in multi-head attention fusion - Attention kernel changes - CPU: - Different Q and KV sequence lengths - Parallel memset for large sequence lengths - Convert broadcast add after MatMul of Q and K (add_qk) to element-wise add - Separate present key-value output into present key and present value (for multi-head attention spec) - CUDA: - Use memory efficient attention compute kernel with present state (for decoder) - Multi-head attention kernel changes - CPU: - Introduction of multi-head attention CPU kernel (previously did not exist) - Use AddBiasReshape instead of AddBiasTranspose when sequence length = 1 (for decoder with past) - Different Q, K, V input shapes - Pass past key and past value directly as key and value - CUDA: - Use memory efficient attention compute kernel with past and/or present state (for decoder with past) ### Usage To use the optimizations, run the ORT transformer optimizer script as follows: ``` $ cd onnxruntime/onnxruntime/python/tools/transformers/ $ python3 optimizer.py --input <filename>.onnx --output <filename>.onnx --model_type bart --num_heads <number of attention heads, depends on the size of the whisper model used> --hidden_size <attention hidden size, depends on the size of the whisper model used> --use_external_data_format --use_multi_head_attention ``` Once optimized, here's an example of how to run Whisper with [Hugging Face's Optimum](https://github.com/huggingface/optimum): ``` from transformers.onnx.utils import get_preprocessor from optimum.onnxruntime import ORTModelForSpeechSeq2Seq from optimum.pipelines import pipeline as ort_pipeline import whisper # Installed from OpenAI's repo - setup instructions at https://github.com/openai/whisper/ directory = './whisper_opt' # Where the optimized ONNX models are located model_name = 'openai/whisper-tiny' device = 'cpu' # Get pipeline processor = get_preprocessor(model_name) model = ORTModelForSpeechSeq2Seq.from_pretrained( directory, use_io_binding=(device == 'cuda'), provider='CPUExecutionProvider', ).to(device) pipe = ort_pipeline( "automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, device=(-1 if device == 'cpu' else 0), ) # Load audio file and run pipeline audio = whisper.load_audio('tests/jfk.flac') audio = whisper.pad_or_trim(audio) outputs = pipe([audio]) print(outputs) ``` Note: In order to use these changes with Optimum, it is recommended to use Optimum from source to have the following changes: - https://github.com/huggingface/optimum/pull/872 - https://github.com/huggingface/optimum/pull/920 ### Motivation and Context This PR helps the following issues: - https://github.com/microsoft/onnxruntime/issues/15100 - https://github.com/microsoft/onnxruntime/issues/15235 - https://github.com/huggingface/optimum/issues/869 (work in progress) This PR can be used with the other currently merged Whisper PRs: - https://github.com/microsoft/onnxruntime/pull/15247 - https://github.com/microsoft/onnxruntime/pull/15339 - https://github.com/microsoft/onnxruntime/pull/15362 - https://github.com/microsoft/onnxruntime/pull/15365 - https://github.com/microsoft/onnxruntime/pull/15427 This PR uses changes from the following merged PRs: - https://github.com/microsoft/onnxruntime/pull/14198 - https://github.com/microsoft/onnxruntime/pull/14146 - https://github.com/microsoft/onnxruntime/pull/14201 - https://github.com/microsoft/onnxruntime/pull/14928 (this introduced the new multi-head attention spec)	2023-04-18 17:13:54 -07:00
Justin Chu	cf19c3697d	Run clang-format in CI (#15524 ) ### Description Run clang-format in CI. Formatted all c/c++, objective-c/c++ files. Excluded ``` 'onnxruntime/core/mlas/', 'onnxruntime/contrib_ops/cuda/bert/tensorrt_fused_multihead_attention/', ``` because they contain assembly or is data heavy ### Motivation and Context Coding style consistency	2023-04-18 09:26:58 -07:00
liqun Fu	919d8f2660	update with onnx main (#14929 )	2023-04-18 08:42:51 -07:00
Justin Chu	9d26f8f4fe	Use os.fspath on Path (#15530 ) ### Description <!-- Describe your changes. --> Use os.fspath instead of str() on a path object. ### Motivation and Context I learned today that os.fspath is the right way to go: https://github.com/charliermarsh/ruff/issues/3675#issuecomment-1494975508	2023-04-17 16:59:40 -07:00
Zhang Lei	a30b57da6e	Fix/Enhance convert_generation tool for SkipLayerNorm, op_block_list... (#15368 ) After SkipLayernorm using fp32 for internal calculation and using numeric stable algorithm, enable it for fp16 here. Make the op_block_list a command line argument to help future tools. Other minor changes.	2023-04-17 14:44:37 -07:00
Justin Chu	a36caba073	Bump ruff in CI (#15533 ) ### Description Bump ruff version in CI and fixed new lint errors. - This change enables the flake8-implicit-str-concat rules which helps detect unintended string concatenations: https://beta.ruff.rs/docs/rules/#flake8-implicit-str-concat-isc - Update gitignore to include common python files that we want to exclude. ### Motivation and Context Code quality	2023-04-17 10:11:44 -07:00
Maximilian Müller	fbe88fccbd	Exposing new TRT build options (#15089 ) ### Description This will add a few TRT options, some of them are only available on TRT 8.6: - heuristics - sparsity - optimization level (8.6 only) - auxiliary stream (8.6 only) - tactic source selection I am no sure yet which tests is should add for these options. As those are mostly simple TRT flags i am not sure to what level i should test. For heuristics something similar to `44dda08b51/onnxruntime/test/providers/tensorrt/tensorrt_basic_test.cc (L510-L538)` should be possible for, but for all other essentially we would only be testing if there is a crash or not if the option is set. Also if i forgot some option that would be good to have feel free to speak up !	2023-04-14 09:47:36 -07:00
pengwa	bf32dbbd9b	Share more constant initializers (#15461 ) ### Share more constant initializers. `ConstantSharing` transformer originally only handle single value initializer (scalar or 1D). This PR tried to share more cases to make common subexpression elimination transformer to remove more duplicated nodes. Originally, we used a single vector<std::variant<float,half,int32,int64>> to store different scalar values. In this PR, we create a unordered map with its key being data_type + rank + element count, and its value is a vector of `InitializerValue`. For one specific initializer, if it fulfils the condition, then finally will find the corresponding vector of `InitializerValue` by its <data_type + rank + element count>, then search from the vector whether the constant tensor already exist or not. After that, a value id is returned, which will be combined together with <data_type + rank + element count> to form the pattern key to decide which tensor to reuse (legacy code). ### Motivation and Context One example we see here is: ```mermaid stateDiagram [] --> LayerNorm(b,s,64) LayerNorm(b,s,64) --> Reshape1 Shape1_Const[bs,64] --> Reshape1 LayerNorm(b,s,64) --> Reshape2 Shape2_Const[bs,64] --> Reshape2 Reshape1 --> AttentionSubGraph Reshape2 --> Add AttentionSubGraph--> Add Add --> [] ``` Ideally CommonSubexpressionElimination can remove one of `Reshape1` and `Reshape2`, while since `Shape1_Const` and `Shape2_Const` are different NodeArg*, so it did not remove the duplication. This is an example: removing the duplication will bring more opportunities to apply graph transformations.	2023-04-14 07:41:07 -07:00
Wei-Sheng Chin	d76cf374c4	Capture both ValueError and RuntimeError (#15503 )	2023-04-13 19:29:34 -07:00
PeixuanZuo	ce1eb6d629	[ROCm] Add Tunable GroupNorm (#15298 ) refactor GroupNorm and Add Tunable GroupNorm	2023-04-12 10:55:42 +08:00
Ye Wang	ef42fd09fb	google/mt5 optimization and fix (#15454 ) ### Description <!-- Describe your changes. --> 1. enabled self-attention fusion in mt-5 decoder graph 2. fix a parity issue https://github.com/microsoft/onnxruntime/issues/15042 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2023-04-11 00:09:11 -07:00
cloudhan	9acbfc6a29	ROCm MHA (#15279 ) Add MultiHeadAttention for ROCm EP. Before: ``` 'engine': 'onnxruntime' 'version': '1.15.0' 'height': 512 'width': 512 'steps': 50 'batch_size': 1 'batch_count': 5 'num_prompts': 1 'average_latency': 3.878769588470459 'median_latency': 3.8792178630828857 'first_run_memory_MB': -1 'second_run_memory_MB': -1 'model_name': 'runwayml/stable-diffusion-v1-5' 'directory': './sd-v1-5-onnx-fp16-nomha' 'provider': 'ROCMExecutionProvider' 'disable_safety_checker': True ``` After: ``` 'engine': 'onnxruntime' 'version': '1.15.0' 'height': 512 'width': 512 'steps': 50 'batch_size': 1 'batch_count': 5 'num_prompts': 1 'average_latency': 2.364924430847168 'median_latency': 2.3650705814361572 'first_run_memory_MB': -1 'second_run_memory_MB': -1 'model_name': 'runwayml/stable-diffusion-v1-5' 'directory': './sd-v1-5-onnx-fp16' 'provider': 'ROCMExecutionProvider' 'disable_safety_checker': True ```	2023-04-11 13:20:44 +08:00
Ye Wang	34f22daf25	Support T5 Beam Search with DecoderMaskedMHA (#15386 ) ### Description <!-- Describe your changes. --> tldr: Latency improvement t5-small: 37.8% t5-base: 24.5% Benchmark on V100 Before: T5-small ORT {'test_times': 1, 'latency_variance': '0.00', 'latency_90_percentile': '104.74', 'latency_95_percentile': '104.74', 'latency_99_percentile': '104.74', 'average_latency_ms': '104.74', 'QPS': '19.10', 'parity': True} T5-base ORT {'test_times': 1, 'latency_variance': '0.00', 'latency_90_percentile': '200.93', 'latency_95_percentile': '200.93', 'latency_99_percentile': '200.93', 'average_latency_ms': '200.93', 'QPS': '9.95', 'parity': True} After: T5-small ORT {'test_times': 1, 'latency_variance': '0.00', 'latency_90_percentile': '76.01', 'latency_95_percentile': '76.01', 'latency_99_percentile': '76.01', 'average_latency_ms': '76.01', 'QPS': '26.31', 'parity': True} T5-base ORT {'test_times': 1, 'latency_variance': '0.00', 'latency_90_percentile': '161.40', 'latency_95_percentile': '161.40', 'latency_99_percentile': '161.40', 'average_latency_ms': '161.40', 'QPS': '12.39', 'parity': True} ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2023-04-08 12:50:18 -07:00
Ryan Hill	56beac4b5b	VIT model handling in the Benchmark.sh file (#15045 ) ### Description Adds VIT model type to the benchmark Also adds Swin (v1) model type ### Motivation and Context Image models are important and we should verify these work as expected at the performance we expect.	2023-04-07 20:17:29 -07:00
Hector Li	276c0a00e4	Reuse QDQConv for ConvTranspose to generate the QDQ model (#15385 ) ### Description Reuse QDQConv for ConvTranspose to generate the QDQ model ### Motivation and Context Generate the correct QDQ model	2023-04-06 15:07:44 -07:00
petermcaughan	2bd8e4a130	Petermca/whisper dedup (#15365 ) ### Description Apply `get_shared_initializers()` to the encoder and decoder subgraphs of Whisper before chaining and exporting the full, final model. ### Motivation and Context The Whisper export process has some overlap between the encoder and decoder subgraphs due to the format of the BeamSearch contrib op. Consequently, there is some shared model data that is duplicated in the final exported product, which can result in a file size increase of ~40%. This PR takes the methods in `convert_generation.py` and applies them during the whisper export process. --------- Co-authored-by: Peter McAughan <petermca@microsoft.com>	2023-04-06 13:27:05 -07:00
petermcaughan	d0cca91cfb	Fix token_id values for whisper export (#15362 ) ### Description The current ONNX export of Whisper utilizes hard-coded values for token_ids when configuring the BeamSearch node. This PR removes these literals and instead takes these values straight from the WhisperConfig. ### Motivation and Context Hard-coding these values can cause some parity issues when comparing to default PyTorch behavior - this change to take from WhisperConfig resolves these. Co-authored-by: Peter McAughan <petermca@microsoft.com>	2023-04-06 11:01:21 -07:00
cloudhan	71a4e7eb97	Automatically enable tunable op usage for production models (#15156 ) Split `IsTunbaleOpEnable` semantics into enable tunable op for using and enable tunable op for tuning. They remain disabled in general for safety purpose. But - if session is created with onnx model with tuning results embeded - the embedded tuning results is set to the EP without error `Status` then we automatically enable the using, tuning remains disabled. The planned options will be - `tunable_op_enable`: The top-level switch of `TunableOp`, indicate if we will run into `TunableOp` related logic. NOTE: most of our impls have a bottom impl that is acting as a fallback and is set as the default. In this case, we still call into the `TunableOp`, but no kernel selection, no kernel tuning and caching is involved. This reduced our maintainance burden of a duplicate code path. - `tunable_op_tuning_enable`: The secondary switch of `TunableOp`, indicate if we will run into the tuning related logic of `TunableOp` Then for the possible future options: - `tunable_op_tuning_max_iteration`: blahblah - `tunable_op_tuning_max_duration_ms`: blahblah - `tunable_op_flash_attention_enable`: blahblah, for example only, we will not have this. For developer oriented envvar, it is for developers' convenience to inspect the performance impact of tuning. So there is only `ORT_ROCM_TUNABLE_OP_ENABLE`, `ORT_ROCM_TUNABLE_OP_TUNING_ENABLE` to take the fine-grind control of combinations.	2023-04-06 13:52:47 +08:00
Leso_KN	ea6b32fea8	Fix: Add def main() in onnxruntime_test.py (#15208 )	2023-04-05 12:31:39 -07:00
Justin Chu	a96e19abc4	Add type annotations to `onnxruntime_inference_collection.py` (#15364 ) ### Description Add type annotations to `onnxruntime_inference_collection.py` ### Motivation and Context Fixes #15334	2023-04-05 10:32:49 -07:00
Hariharan Seshadri	5294cd0c55	Print value errors in ort.InferenceSession to user (#15360 )	2023-04-04 16:01:24 -07:00
Anton Korablin	207c57219a	Add support for full ViT optimization (#15289 ) Add support for ViT optimization in optimizer.py As ViT architecture follows BERT rather closely, we can easily reuse BERT fusions for ViT. The only difference is that ViT does not have attention mask, which means there is no Add node in qk paths. Make the necessary changes in onnx_exporter.py to be able to cover optimizations with test.	2023-04-04 14:05:24 -07:00
Severin Simmler	4400e80452	Allow `Path` objects for deserialization of ONNX models (#15307 )	2023-04-04 11:38:00 -07:00
petermcaughan	f30e2d4387	Whisper Export (#15247 ) ### Description Add scripts to export Whisper model to ONNX and integrate the ORT BeamSearch op with the resulting graphs. Example command to execute this script: python convert_to_onnx.py -m openai/whisper-large --output whisper -e --------- Co-authored-by: Peter McAughan <petermca@microsoft.com>	2023-04-04 05:01:04 -07:00
Yufeng Li	c08d6b42e8	Add tool to support packing mode for BERT model (#15283 ) ### Description <!-- Describe your changes. --> Add a tool to convert fused BERT like model to packing mode ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-03-31 08:46:47 -07:00
yf711	dc61d3b5b6	Fix symbolic shape inference script on precision loss issue (#15215 ) ### Description When calculating symbolic shape like `mul(get_int_val(values=[1024, 0.5]))`, the current script calls `get_int_val()` to get values, which values becomes `[1024, 0]`. Thus, the result of `mul(values)`->`mul([1024,0])`=0, but the expected shape size is 512 Fix: for math binary operations like `mul()` and `div()`, don't convert input shapes into integers if any possible precision loss happen; keep the input shape as float, finish the operation, and cast final result into integer and output the shape. Test cases are added: 1. mul(1024, 0.5)=>512 (before this fix, the output would be 0, as float 0.5 would be converted to int 0) 2. div(768, 1.5)=>512 (before this fix, the output would be 768, as float 1.5 would be converted to int 0) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-03-30 12:15:27 -07:00
PeixuanZuo	a6279d4cfb	[ROCm] update Stable Diffusion benchmark to support ROCm EP (#15094 ) Update Stable Diffusion benchmark to support ROCm EP	2023-03-29 15:19:52 +08:00
Tianlei Wu	f752bb9973	Update stable diffusion benchmark results: A100 and PyTorch 2.0 (#15195 ) Update stable diffusion benchmark results with A100 results and PyTorch 2.0 number.	2023-03-28 19:47:22 -07:00
Justin Chu	938e2136c6	Enable pylint and numpy rules (#15218 ) ### Description Enable pylint and numpy rules ### Motivation and Context Modernize numpy usage and enable more quality checks	2023-03-27 20:37:53 -07:00
cloudhan	d3565779c3	Allow bert_perf_test.py to load/save tuning results (#15096 )	2023-03-26 18:03:08 +08:00
Justin Chu	d834ec895a	Adopt linrtunner as the linting tool - take 2 (#15085 ) ### Description `lintrunner` is a linter runner successfully used by pytorch, onnx and onnx-script. It provides a uniform experience running linters locally and in CI. It supports all major dev systems: Windows, Linux and MacOs. The checks are enforced by the `Python format` workflow. This PR adopts `lintrunner` to onnxruntime and fixed ~2000 flake8 errors in Python code. `lintrunner` now runs all required python lints including `ruff`(replacing `flake8`), `black` and `isort`. Future lints like `clang-format` can be added. Most errors are auto-fixed by `ruff` and the fixes should be considered robust. Lints that are more complicated to fix are applied `# noqa` for now and should be fixed in follow up PRs. ### Notable changes 1. This PR removed some suboptimal patterns: - `not xxx in` -> `xxx not in` membership checks - bare excepts (`except:` -> `except Exception`) - unused imports The follow up PR will remove: - `import *` - mutable values as default in function definitions (`def func(a=[])`) - more unused imports - unused local variables 2. Use `ruff` to replace `flake8`. `ruff` is much (40x) faster than flake8 and is more robust. We are using it successfully in onnx and onnx-script. It also supports auto-fixing many flake8 errors. 3. Removed the legacy flake8 ci flow and updated docs. 4. The added workflow supports SARIF code scanning reports on github, example snapshot: ![image](https://user-images.githubusercontent.com/11205048/212598953-d60ce8a9-f242-4fa8-8674-8696b704604a.png) 5. Removed `onnxruntime-python-checks-ci-pipeline` as redundant ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Unified linting experience in CI and local. Replacing https://github.com/microsoft/onnxruntime/pull/14306 --------- Signed-off-by: Justin Chu <justinchu@microsoft.com>	2023-03-24 15:29:03 -07:00
PeixuanZuo	7eb6dbe7d8	[ROCm] Add compute type for Skiplayernorm to fix ROCm CI (#15192 ) - Add compute type for Skiplayernorm to fix ROCm CI and get more accurate results. SkipLayerNorm: type T: input, skip, bias type U: epsilon, compute result type V: output, beta, gamma - refactor the usage of aligned_vector, reduce the usage of `reinterpret_cast`.	2023-03-24 19:31:14 +08:00
Ye Wang	44ba23e0f5	Rename DecoderMaskedMHA to DecoderMaskedSelfAttn (#15166 ) ### Description <!-- Describe your changes. --> As synced offline, rename this op and will create another op for mha that supports both self and cross attention. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2023-03-23 12:31:38 -07:00
Hariharan Seshadri	7033346605	Support mask_filter_value attribute in DecoderMaskedMultiheadAttention (#15158 )	2023-03-23 11:00:09 -07:00
Tianlei Wu	88a66a289b	Fix prune_graph and gpt attention fusion scripts (#15147 ) Fix two issues: (1) GPT attention fusion: get_parent could return None when the input is initializer, add a check (2) ONNX node could have optional inputs and outputs. During prune_graph, we shall exclude empty inputs/outputs. Here we exclude "" from output_name_to_node and input_name_to_nodes. Add an option allow_remove_graph_inputs in prune_graph	2023-03-23 09:45:16 -07:00
pengwa	1d32285536	Statistics tool for ORTModule convergence parity (#15020 ) ### Statistics tool for ORTModule convergence parity As ORTModule get more and more validated, it is pretty fast to intergrade PyTorch based model with ORT. The same time, we need make sure once there is convergence issue, we don't spend months of time to investigate. As part of this efforts, this PR is introducing a tool to dump activation statistics without much involvement from users. The dumping results contains only some statistic numbers plus sampled data, which is not big, compared with dumping all the tensors, it is much faster and space efficient. For us to use it, two single lines are needed before wrapping ORTModule. For baseline run, need also apply the same trick. ``` + from onnxruntime.training.utils.hooks import SubscriberManager, StatisticsSubscriber + SubscriberManager.subscribe(model, [StatisticsSubscriber("pt_out", override_output_dir=True)]) ``` Once you run the steps, following command can be used to merge result into per-step-summary respectively for ORT and baseline runs. ```bash python -m onnxruntime.training.utils.hooks.merge_activation_summary --pt_dir pt_out --ort_dir ort_out --output_dir /tmp/output ``` Docs is added here as part of this PR [convergence investigation notes](https://github.com/microsoft/onnxruntime/blob/pengwa/conv_tool/docs/ORTModule_Convergence_Notes.md) Based on the generated merged files, we can compare them with tools. ![image](https://user-images.githubusercontent.com/10530022/224653929-4e4480bd-bb02-4bbe-bd44-2672bdf91a87.png) ### Design and Implementation This PR introduced a common mechanism registering custom logic for nn.Module's post forward hooks. And statistics for activation (StatisticsSubscriber) is one of the implementations. If there is other needs, we can define another XXSubscriber to do the customized things.	2023-03-23 20:34:24 +08:00
cloudhan	039ca10822	Move offline_tuning.py, so that the utility will be package with whl distribution (#15124 ) Just file move.	2023-03-23 15:24:41 +08:00
cloudhan	71b67ec1e2	Refactor ke register to be decentralized (#15036 ) So that we can remove all unnecessay header files	2023-03-22 14:49:26 +08:00
Tianlei Wu	3e2d453b64	Supports model > 2GB in fp16 conversion with onnx shape inference (#15067 ) (1) Allow model to be path, and use infer_shapes_path to fix https://github.com/microsoft/onnxruntime/issues/15063 (2) Add some logging for float data truncation (3) Add RandomUniformLike to default op_block_list (4) Some minor changes to use f string.	2023-03-21 15:08:28 -07:00
Faith Xu	ef76b3aeb8	Transformers tool - update readme to link to docs page (#14964 ) ### Description Transformers tool documentation has been moved to: https://onnxruntime.ai/docs/performance/transformers-optimization.html	2023-03-21 11:56:19 -07:00
cloudhan	98ab4a62d6	Fix ROCm 5.2.3 pipeline (#15073 ) Make CK optional again.	2023-03-17 15:59:57 +08:00
cloudhan	a5ab88247b	ROCm Flash Attention (#14838 ) Adds flash attention via composable kernel for ROCm EP	2023-03-16 10:39:58 +08:00

1 2 3 4 5 ...

1001 commits