onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-06-08 00:23:03 +00:00

Author	SHA1	Message	Date
Justin Chu	938e2136c6	Enable pylint and numpy rules (#15218 ) ### Description Enable pylint and numpy rules ### Motivation and Context Modernize numpy usage and enable more quality checks	2023-03-27 20:37:53 -07:00
cloudhan	d3565779c3	Allow bert_perf_test.py to load/save tuning results (#15096 )	2023-03-26 18:03:08 +08:00
Justin Chu	d834ec895a	Adopt linrtunner as the linting tool - take 2 (#15085 ) ### Description `lintrunner` is a linter runner successfully used by pytorch, onnx and onnx-script. It provides a uniform experience running linters locally and in CI. It supports all major dev systems: Windows, Linux and MacOs. The checks are enforced by the `Python format` workflow. This PR adopts `lintrunner` to onnxruntime and fixed ~2000 flake8 errors in Python code. `lintrunner` now runs all required python lints including `ruff`(replacing `flake8`), `black` and `isort`. Future lints like `clang-format` can be added. Most errors are auto-fixed by `ruff` and the fixes should be considered robust. Lints that are more complicated to fix are applied `# noqa` for now and should be fixed in follow up PRs. ### Notable changes 1. This PR removed some suboptimal patterns: - `not xxx in` -> `xxx not in` membership checks - bare excepts (`except:` -> `except Exception`) - unused imports The follow up PR will remove: - `import *` - mutable values as default in function definitions (`def func(a=[])`) - more unused imports - unused local variables 2. Use `ruff` to replace `flake8`. `ruff` is much (40x) faster than flake8 and is more robust. We are using it successfully in onnx and onnx-script. It also supports auto-fixing many flake8 errors. 3. Removed the legacy flake8 ci flow and updated docs. 4. The added workflow supports SARIF code scanning reports on github, example snapshot: ![image](https://user-images.githubusercontent.com/11205048/212598953-d60ce8a9-f242-4fa8-8674-8696b704604a.png) 5. Removed `onnxruntime-python-checks-ci-pipeline` as redundant ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Unified linting experience in CI and local. Replacing https://github.com/microsoft/onnxruntime/pull/14306 --------- Signed-off-by: Justin Chu <justinchu@microsoft.com>	2023-03-24 15:29:03 -07:00
PeixuanZuo	7eb6dbe7d8	[ROCm] Add compute type for Skiplayernorm to fix ROCm CI (#15192 ) - Add compute type for Skiplayernorm to fix ROCm CI and get more accurate results. SkipLayerNorm: type T: input, skip, bias type U: epsilon, compute result type V: output, beta, gamma - refactor the usage of aligned_vector, reduce the usage of `reinterpret_cast`.	2023-03-24 19:31:14 +08:00
Ye Wang	44ba23e0f5	Rename DecoderMaskedMHA to DecoderMaskedSelfAttn (#15166 ) ### Description <!-- Describe your changes. --> As synced offline, rename this op and will create another op for mha that supports both self and cross attention. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2023-03-23 12:31:38 -07:00
Hariharan Seshadri	7033346605	Support mask_filter_value attribute in DecoderMaskedMultiheadAttention (#15158 )	2023-03-23 11:00:09 -07:00
Tianlei Wu	88a66a289b	Fix prune_graph and gpt attention fusion scripts (#15147 ) Fix two issues: (1) GPT attention fusion: get_parent could return None when the input is initializer, add a check (2) ONNX node could have optional inputs and outputs. During prune_graph, we shall exclude empty inputs/outputs. Here we exclude "" from output_name_to_node and input_name_to_nodes. Add an option allow_remove_graph_inputs in prune_graph	2023-03-23 09:45:16 -07:00
pengwa	1d32285536	Statistics tool for ORTModule convergence parity (#15020 ) ### Statistics tool for ORTModule convergence parity As ORTModule get more and more validated, it is pretty fast to intergrade PyTorch based model with ORT. The same time, we need make sure once there is convergence issue, we don't spend months of time to investigate. As part of this efforts, this PR is introducing a tool to dump activation statistics without much involvement from users. The dumping results contains only some statistic numbers plus sampled data, which is not big, compared with dumping all the tensors, it is much faster and space efficient. For us to use it, two single lines are needed before wrapping ORTModule. For baseline run, need also apply the same trick. ``` + from onnxruntime.training.utils.hooks import SubscriberManager, StatisticsSubscriber + SubscriberManager.subscribe(model, [StatisticsSubscriber("pt_out", override_output_dir=True)]) ``` Once you run the steps, following command can be used to merge result into per-step-summary respectively for ORT and baseline runs. ```bash python -m onnxruntime.training.utils.hooks.merge_activation_summary --pt_dir pt_out --ort_dir ort_out --output_dir /tmp/output ``` Docs is added here as part of this PR [convergence investigation notes](https://github.com/microsoft/onnxruntime/blob/pengwa/conv_tool/docs/ORTModule_Convergence_Notes.md) Based on the generated merged files, we can compare them with tools. ![image](https://user-images.githubusercontent.com/10530022/224653929-4e4480bd-bb02-4bbe-bd44-2672bdf91a87.png) ### Design and Implementation This PR introduced a common mechanism registering custom logic for nn.Module's post forward hooks. And statistics for activation (StatisticsSubscriber) is one of the implementations. If there is other needs, we can define another XXSubscriber to do the customized things.	2023-03-23 20:34:24 +08:00
cloudhan	039ca10822	Move offline_tuning.py, so that the utility will be package with whl distribution (#15124 ) Just file move.	2023-03-23 15:24:41 +08:00
cloudhan	71b67ec1e2	Refactor ke register to be decentralized (#15036 ) So that we can remove all unnecessay header files	2023-03-22 14:49:26 +08:00
Tianlei Wu	3e2d453b64	Supports model > 2GB in fp16 conversion with onnx shape inference (#15067 ) (1) Allow model to be path, and use infer_shapes_path to fix https://github.com/microsoft/onnxruntime/issues/15063 (2) Add some logging for float data truncation (3) Add RandomUniformLike to default op_block_list (4) Some minor changes to use f string.	2023-03-21 15:08:28 -07:00
Faith Xu	ef76b3aeb8	Transformers tool - update readme to link to docs page (#14964 ) ### Description Transformers tool documentation has been moved to: https://onnxruntime.ai/docs/performance/transformers-optimization.html	2023-03-21 11:56:19 -07:00
cloudhan	98ab4a62d6	Fix ROCm 5.2.3 pipeline (#15073 ) Make CK optional again.	2023-03-17 15:59:57 +08:00
cloudhan	a5ab88247b	ROCm Flash Attention (#14838 ) Adds flash attention via composable kernel for ROCm EP	2023-03-16 10:39:58 +08:00
Hariharan Seshadri	ed7ab1660d	[CUDA] Add option to use DecoderMaskedMultiheadAttention in BeamSearch (#14990 )	2023-03-15 17:16:32 -07:00
Tianlei Wu	bdfdebfca7	Fix ReduceSum in attention fusion (#15047 ) Fix https://github.com/microsoft/onnxruntime/issues/14959. ReduceSum-13 move axes from attribute to node input.	2023-03-14 20:34:17 -07:00
PeixuanZuo	c70838cbbb	[ROCm] add Conv, NhwcConv benchmark to microbench (#15017 ) Add Conv, NhwcConv benchmark to microbench. Related PR: https://github.com/microsoft/onnxruntime/pull/14982, https://github.com/microsoft/onnxruntime/pull/14980	2023-03-15 11:07:17 +08:00
Ye Wang	0fa00429d5	[T5 optimization] script fusions and fixes (#14967 ) ### Description <!-- Describe your changes. --> 1. added script for t5 encoder self attention and t5 decoder self/cross attention fusions. 2. added simplified layernorm fusion for --external_data_format senario. (otherwise relying on ORT optimizer) 3. added rel_pos_bias shape inference code, modified attention/mha shape inference script. 4. reworked graph_topologic_sort() because the currently implementation is not functioning correctly. also added an option to topo-sort the graph in a deterministic way to let tests pass. note: 1. the t5-beamsearch export code is slightly modified. specifically, encoder_hidden_states(ehs) is no longer an input to the t5 decoder since the ehs is not actually used in the graph execution. 2. recent PRs do not add optimizations to t5 on cpu. 3. the fp32 model(encoder and decoder) for t5-small, t5-base and t5-large can get a parity of e-5 and the corresponding beam search models generate same results as pytorch. 4. fp16(mixed-precision) models, however, get a parity around 3e-2 and some has maximum diff a bit over 3e-2. But the beam search models still generate same results as pytorch (based on limited input data) 5. mt-5 model has a parity issue at the moment, even before any optimization. will investigate later. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2023-03-13 23:35:56 -07:00
Christian Veenhuis	59dfcfdce7	Fix typos in sources: operater, tranform, neccessary, trainig (#14907 ) ### Description While browsing the sources I found several typos here and there. I collected them to a single PR and fixed them. Namely these typos are: operater, tranform, neccessary, trainig. After fixing none of them was found anymore: $ git grep "operater" $ git grep "tranform" $ git grep "neccessary" $ git grep "trainig" $ ### Motivation and Context Since some of the typos are in example notebooks and markdown files, users can see them.	2023-03-13 22:45:04 -07:00
Adrian Lizarraga	d8ddd25272	Add InstanceNormalization operator to QNN EP (#14867 ) ### Description QNN EP: - Adds the [InstanceNormalization](https://onnx.ai/onnx/operators/onnx__InstanceNormalization.html) operator to QNN EP. - Fixes graph composition bug when Transpose node is the last node in a graph. - Adds check for input shape when GetCapability is called (before and after layout transformation) - Should add similar checks for other layout sensitive ops (conv, pool, ...) in a separate PR - Adds initial QNN op tests for QDQ conv and QDQ InstanceNormalization - Should add tests for other ops in a separate PR Optimizer: - Makes InstanceNormalization a layout sensitive operator. - Adds a custom QDQ group selector for InstanceNormalization. Quantization tool: - Adds QDQ support for InstanceNormalization operator. - Adds python unit test for InstanceNormalization quantization. ### Motivation and Context Needed to support stable diffusion models with QNN. --------- Co-authored-by: Hector Li <hecli@microsoft.com>	2023-03-10 14:42:41 -08:00
Dmitri Smirnov	0d7855ea5a	Re-work global objects dependancies in pybind layer. (#14941 ) ### Description Re-work handling of static objects in pybind. Make sure we ref-count Environment from Sessions. The following has been done: - Make global objects function static. This ensures that the objects are constructed on demand. The first object constructed is destructed last. This is platform independent. - Make global objects ownership shared as suggested by pybind since they are not surfaced at Python level, and they cannot be referred to by dependent python objects. Verified that all python objects are GCed before globals are destroyed. This takes care of inference session dependency on environment and its default logger and this is also platform independent. - Utilize pybind atexit mechanism to clear execution providers and unload CUDA libraries (as suggested by https://github.com/microsoft/onnxruntime/pull/14903) . Since this is registered for module exit, it takes place before any other global are destroyed and clears shared objects state or even unloads the libraries. This should also work in a platform independent way. ### Motivation and Context - Global object destruction order is managed manually and that becomes source of trouble. We want to make it deterministic and platform independent. - Frequent hangs in Python layer due to the static object's destruction order. Some of the Python session objects are being garbage collected after main exits and they require ORT environment to be alive. (Use after free)	2023-03-10 13:55:31 -08:00
Maximilian Müller	ad4db12699	TensorRT EP - timing cache (#14767 ) ### Description This will enable a user to use a TensorRT timing cache based on #10297 to accelerate build times on a device with the same compute capability. This will work across models as it simply store kernel runtimes for specific configurations. Those files are usually very small (only a few MB) which makes them very easy to ship with an application to accelerate the build time on the user end. ### Motivation and Context Especially for workstation use cases TRT build times can be a roadblock. With a few model from ONNX model zoo i evaluated speedups when a timing cache is present. `./build/onnxruntime_perf_test -e tensorrt -I -t 5 -i "trt_timing_cache_enable\|true" <onnx_path>` \|Model \| no Cache \| with Cache\| \| ------------- \| ------------- \| ------------- \| \|efficientnet-lite4-11 \| 34.6 s \| 7.7 s\| \|yolov4 \| 108.62 s \| 9.4 s\| To capture this is had to modify the onnxruntime_perf_test. The time is sometimes not captured within "Session creation time cost:" which is why i introduced "First inference time cost:". --------- Co-authored-by: Chi Lo <Chi.Lo@microsoft.com>	2023-03-10 09:02:27 -08:00
mindest	bf2cc808a1	[ROCm] SkipLayerNorm: add more configs for block size; loosen constraints (#14900 ) ### Description * add more configs for `threads_per_block` in SkipLayerNorm, also in kernel explorer. * loosen constraints for hidden_size, so that `SkipLayerNormSmallOp` can be selected for larger hidden sizes. * add flag for optional output in kernel_explorer ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-03-09 22:27:01 +08:00
Hariharan Seshadri	112a4d215a	[CUDA] Support decoding multihead self-attention implementation (#14848 )	2023-03-08 09:17:54 -08:00
Kyushick Lee	c696392f0c	Support external output tensors for DORT (#14516 ) ### Description <!-- Describe your changes. --> Support externally-managed output tensors (torch Tensors) for dort. Add `preallocate_output` option to OrtBackend to rely on externally-managed output tensors for dort. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> DORT currently allocates and returns output ortvalues and convert them to torch Tensors. The conversion based on dlpack does not support torch Tensors for custom Aten backends, and it is not yet possible to transfer the ownership from ortvalue to external handle (torch Tensor). To avoid this issue, the PR change provides an option (`preallocate_output`) to allocate output tensors externally in pytorch, which creates torch Tensor for an Aten backend, and let dort take pointers from torch Tensors to construct output ortvalues instead of allocating them inside InferenceSession.	2023-03-07 21:32:23 -08:00
George Wu	289f7dbcdd	enable pybind for qnn ep (#14897 ) enable python bindings for QNN EP. tested on Windows Dev Kit 2023 (ARM64) with python 3.11 (ARM64) from https://www.python.org/ftp/python/3.11.1/python-3.11.1-arm64.exe	2023-03-03 07:26:53 -08:00
Dmitri Smirnov	8d87fdcfa1	Add GetVersionSting API for C++, C# and Python (#14873 ) ### Description Added APIs. ### Motivation and Context Addresses https://github.com/microsoft/onnxruntime/issues/14584 Cc: @Craigacp cp	2023-03-02 17:11:07 -08:00
Tianlei Wu	c66af46fc1	Doc for Stable Diffusion CUDA Optimizations (#14830 ) Add document for stable diffusion optimizations and benchmark.	2023-03-01 19:29:30 -08:00
pengwa	79aa0acdd0	SCELoss(SCELossGrad) support half(float) input float(half) output (#13972 ) ### Description A follow up change for https://github.com/microsoft/onnxruntime/pull/13616. SoftmaxCrossEntropyLossInternal/SoftmaxCrossEntropyLossInternalGrad support different type for input and output. Add SCELoss(SCELossGrad) support half(float) input float(half) output ### Test Note #### Add tests for variant input and output types. To add such tests, have to refactor existing testing code for sce loss and scelossinternal gradient. Originally, FP32 input and output, the CPU kernels, runs with CPU kernels the baseline, CUDA/RCOM then runs with same data, user CompareTester to compare with CPU run results. FP16 input and output, the CPU kernels (did not have half kernels), runs with Cast_to_float->CPU kernel->cast_to_half as the baseline, CUDA/RCOM then runs with same data but using Half implementation, user CompareTester to compare with CPU run results. Now, we want the support run different input and output types. The proposed change here is, to run CPU kernels always with float input and output as baseline (because CPU only have float type kernels impl), this step is the very first thing for every test. Then, we run CUDA/RCOM kernels using half_input_half_output, float_input_float_output, half_input_float_output, float_input_half_output if there is corresponding kernel registered. Afterwards, compare the CUDA/ROCM run results with CPU float baselines. Be noted, there is one thing that deserved a special note: CompareOpTester's result compare can be loose than OpTester's. Roughly speaking: the former tolerant diff <= atol + rtolexpected_value, while the later one telerant diff < atol && diff < rtolexpected_value. When the expected value is super small in many cases of our tests cases, the former one can pass but the later one fails. So the refactoring also move the check outside of OpTester, explicitly check the values using the way CompareOPTester did (to align the previous behaviour). ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-02-28 18:02:08 +08:00
Ivan Komarov	9f6d452ca6	Fix `ValueError` when testing PyTorch performance (#14450 ) ### Description Fixed an exception that is thrown inside `transformers` when trying to test PyTorch performance: ``` > python convert_generation.py -m gpt2 --output gpt2_greedy_search.onnx --num_beams 1 --num_return_sequences 1 --torch_performance	2023-02-24 21:39:14 -08:00
Ted Themistokleous	702a61c3bb	Add verbose and optimization args for parity tests (Gelu, Layernorm, … (#14739 ) …GPT_Attention) Some EPs require that onnxruntime and optimum optimizations are turned off in order to run correctly. Allowing this option during test runs allows the EP and library to perform their own optimization and be more representative of actual use case conditions. Important for EPs like MIGraphX which require optimizations to be offer for certain operations ### Description <!-- Describe your changes. --> Allow flags to turn off optimizations and add verbose output to confirm which EP is being used for the inference run and validate fallbacks ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Related to: #14702 & #14700 --------- Signed-off-by: Ted Themistokleous <tthemist@amd.com> Co-authored-by: Ted Themistokleous <tthemist@amd.com>	2023-02-24 18:43:13 +08:00
PeixuanZuo	687118a159	[ROCm] Fix Skiplayernorm error and open ke test with optional output (#14794 ) Fix Skiplayernorm error and open kernel explorer test with optional output.	2023-02-24 12:46:42 +08:00
James Yuzawa	d925055a3e	Fix broken and outdated links in documentation (#14092 ) ### Description <!-- Describe your changes. --> I fixed some broken links in the C API documentation, but then did a quick pass over all of the links I could find and then fixed those. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> I got some 404's when exploring the documentation and wanted to fix it.	2023-02-23 10:48:04 -08:00
Ivan Komarov	16b39e5b87	`symbolic_shape_infer.py`: Fix slicing a tensor that has a sympy.Min() in its shape (#14384 ) ### Description `_infer_Slice()` is a function (arguably the most complex one) in `symbolic_shape_infer.py` that infers the shape of the output of a `Slice` node. This commit fixes an edge case in `_infer_Slice()` caused by a SymPy quirk. When both the end of the slice (let's call it `e`) and the corresponding dimension of the sliced tensor (let's call it `dim`) are arbitrary symbolic expressions, `symbolic_shape_infer.py` [checks](`de7a868d5f/onnxruntime/python/tools/symbolic_shape_infer.py (L1728)`) if `e <= dim`. Comparing symbolic expressions is hard in general, so if the comparison fails, `symbolic_shape_infer.py` [gives up](`de7a868d5f/onnxruntime/python/tools/symbolic_shape_infer.py (L1734)`) and assumes that `e` is equal to `dim`. A failure of this sort currently happens for expressions of the form `Y - X >= 0` where `Y` contains a `sympy.Min()` (`symbolic_shape_infer.py` tries to rewrite `X <= Y` comparisons in various ways, and `Y - X >= 0` is [one of them](`de7a868d5f/onnxruntime/python/tools/symbolic_shape_infer.py (L1664)`)). An simple example to illustrate this: ```python >>> import sympy >>> X = sympy.Symbol('X', positive=True, integer=True) >>> >>> y1 = 9999 >>> Y1 = X + y1 - 5000 >>> bool(Y1 - X >= 0) True >>> >>> y2 = X + 4999 >>> Y2 = X + y2 - 5000 >>> bool(Y2 - X >= 0) True >>> >>> y3 = sympy.Min(y1, y2) >>> Y3 = X + y3 - 5000 >>> bool(Y3 - X >= 0) Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../venv/lib/python3.9/site-packages/sympy/core/relational.py", line 511, in __bool__ raise TypeError("cannot determine truth value of Relational") TypeError: cannot determine truth value of Relational ``` If you assume that `X` is positive symbol (`symbolic_shape` [does assume](`de7a868d5f/onnxruntime/python/tools/symbolic_shape_infer.py (L2129)`) this for graph inputs), then both `Y1 >= X` and `Y2 >= X` holds, and SymPy can prove this. This means that `Y3 >= X` also holds (since `Y3` is essentially equal to either `Y1` or `Y2`, depending on the value of `X`), but this is too hard for SymPy to prove. I confirmed that this is still the case for the latest SymPy version (`1.11.1`). This commit tries to fix this edge case by slightly rewriting the expression containing `sympy.Min()`. I explain the details in the comments in `symbolic_shape_infer.py`, so I won't duplicate them in the PR description. ### Motivation and Context This sounds like a very contrived example, but it actually appeared in the wild when we tried to infer shapes for an ONNX graph exported from PyTorch that used relative-position multihead attention from Fairseq. The problematic line is [here](`7d050ada7d/fairseq/modules/espnet_multihead_attention.py (L192)`). In our codebase, we have something like `matrix_bd = matrix_bd[:, :, :, : matrix_ac.size(-1)]` before we add `matrix_ac` and `matrix_bd`. `matrix_bd` is itself a result of another slice, hence its shape contains `sympy.Min()`, and the SymPy weirdness described above prevents `symbolic_shape_infer.py` from correctly inferring the final shape of `matrix_bd`. Then `symbolic_shape_infer.py` explodes when we try to add `matrix_ac` and `matrix_bd`, because their shapes are not compatible. I added a small self-contained unit test to illustrate the problem. Without the fix, `slice_out_cropped` has shape `[N + Min(42, N + 21) - 22]`, and `input` has shape `[N]`, and we get this: ``` > python onnxruntime_test_python_symbolic_shape_infer.py ..................Cannot determine if 22 - N < 0 Unable to determine if N <= N + Min(42, N + 21) - 22, treat as equal E.... ====================================================================== ERROR: test_slice_of_min (__main__.TestSymbolicShapeInferenceForSlice) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/dfyz/onnxruntime/onnxruntime/test/python/onnxruntime_test_python_symbolic_shape_infer.py", line 460, in test_slice_of_min model = SymbolicShapeInference.infer_shapes(onnx.helper.make_model(graph_def)) File "/home/dfyz/onnxruntime/onnxruntime/test/python/../../python/tools/symbolic_shape_infer.py", line 2461, in infer_shapes raise Exception("Incomplete symbolic shape inference") Exception: Incomplete symbolic shape inference ---------------------------------------------------------------------- Ran 23 tests in 0.486s FAILED (errors=1) ``` With the fix, both tensors have shape `[N]`, and the test passes. --------- Co-authored-by: Ivan Komarov <dfyz@yandex-team.ru>	2023-02-23 15:32:37 +01:00
kunal-vaishnavi	460b3ff4fd	Update pattern matching for EmbedLayerNormalization fusion (#14344 ) ### Description This PR addresses the case where an optional Gather node is in the subgraph pattern. The optional node is now fused with the other nodes matched in the pattern to create an EmbedLayerNormalization node. ### Motivation and Context The original subgraph pattern is ``` Gather Gather \ / Add \| LayerNormalization \| Attention \| ... ``` and the new subgraph pattern is ``` Gather Gather \ / Gather (optional) Add \ \| LayerNormalization \| Attention \| ... ```	2023-02-22 12:57:14 -08:00
Tianlei Wu	262e46e8ce	Update stable diffusion benchmark script (#14759 ) Update stable diffusion benchmark script: (1) Test GPU memory usage (2) Change diffusers version to 0.13, and add support of PyTorch 2.0 including compile (3) Add support of xformers (4) Output result to CSV file Example to run PyTorch 2.0 with torch.compile: ``` pip3 install numpy --pre torch --force-reinstall --extra-index-url https://download.pytorch.org/whl/nightly/cu117 export TRITON_PTXAS_PATH=/usr/local/cuda-11.7/bin/ptxas python benchmark.py -e torch -v 1.5 -c 5 -n 1 -b 1 --enable_torch_compile ```	2023-02-21 23:37:38 -08:00
Sheil Kumar	1b7f65437e	Enable Opset11 Sequence Ops on DirectML, and make the CPU implementations agnostic to backend EP (#14442 ) Enable Opset11 Sequence Ops on DirectML, and make the CPU implementations agnostic to backend EP Opset 11 introduced the following sequence related operators: - SequenceAt - SequenceConstruct - SequenceEmpty - SequenceLength - SequenceErase - SequenceInsert - ConcatFromSequence With the exception of ConcatFromSequence, all of the above operators were implemented with CPU kernels that a) required all of the contained tensors to also be on CPU, and b) would clone each tensor into a new sequence as a side effect of each operator. The implementation of sequences are backend agnostic, as they dont affect actual tensor layout or manipulate the contents of the tensors. In addition, with the exception of SequenceAt, the other operators need not make copies of the underlying referenced tensors. Consequently, this change does the following: 1) Sequence* operators (except SequenceAt) no longer copies the contents of a sequence of tensors on every kernel execution. 2) SequenceAt uses the DataTransferManager to copy tensors agnostic to backend. 3) The internal container implemented by TensorSeq has changed from onnxruntime::Tensor to OrtValue. This is because onnxruntime::Tensor does not support copy or assignment construction, so it must have a singular owner. However, is same tensor participates in multiple containers it would have multiple container "owners" and this would not be possible. 4) Other code that accessed values from TensorSeq have associated changes to extract Tensors from OrtValues now. In addition, DirectML execution was very slow when the above Sequence operators were added to a graph, as this caused MemcpyToHost and MemcpyFromHost kernels to be inserted between the graph and the sequence operators. To optimize DirectML, 1) The CPU implementations for the Sequence* ops were registered as DML implementations. Since the above changes also includes making the CPU kernel implementations EP agnostic, the CPU kernels can be added as is. 2) The ConcatFromSequence operator needed to be implemented on DirectML. However, there was little DirectML EP operator framework support for operators that accept/output sequences of tensors. This change has modified the internal COM interfaces to include new apis to interrogate for sequence shapes, and extract the needed tensors from TensorSeq. --------- Co-authored-by: Patrice Vignola <vignola.patrice@gmail.com>	2023-02-21 18:08:28 -08:00
fxmarty	f76ff8c558	Initialize bias_weight in fusion_skiplayernorm.py (#14751 ) As per title, fixes https://github.com/microsoft/onnxruntime/issues/13625 Uncountered the issue when using the optimization with codegen model.	2023-02-21 10:42:08 -08:00
Tianlei Wu	c0d2472ede	Disable fused causal attention (#14732 ) There is accuracy regression in GPT-2 model. Top1 match rate (vs PyTorch model) drops about 1%. The cause is the fused causal attention uses fp16 accumulation. Disable it by default and add an environment variable ORT_ENABLE_FUSED_CAUSAL_ATTENTION=1 to turn on it manually. It also updated the GPT-2 parity test script to generate left side padding to reflect the actual usage. To test: ``` python -m onnxruntime.transformers.models.gpt2.convert_to_onnx -m gpt2 --output gpt2.onnx -o -p fp16 --use_gpu ``` The top1-match-rate in the output is on-par with ORT 1.13.1.	2023-02-21 09:53:31 -08:00
Tianlei Wu	6f99fb9d4b	Stable Diffusion CUDA Optimizations Part 5 (#14706 ) Add a fusion to remove transpose in subgraph like ``` --> Gemm --> Unsqueeze(axes=[2]) --> Unsqueeze(axes=[3]) --> Add --> Transpose([0,2,3,1]) --> GroupNorm ``` With this fusion, we can remove 22 Transpose nodes in UNet, and reduce latency by 0.1 second per image in T4.	2023-02-16 01:10:00 -08:00
PeixuanZuo	0f9d2432d2	[ROCm] Add WarpWise Softmax into SoftmaxTunableOp (#14612 ) 1. Add Softmax warpwise_forward into SoftmaxTunableOp. 2. Set Softmax op use tunableOp as optional and use original implementation by default. 3. There are some other operators use `dispatch_warpwise_softmax_forward /dispatch_warpwise_softmax_forward/ SoftMaxComputeHelper ` directly. But they only have files under cuda directory, adding `RocmTuningContext ` for these files requires copying and modifying hipified files. Now only set RocmTuningContext as nullptr by default and not hipified other operators. Related PR: https://github.com/microsoft/onnxruntime/pull/14541 --------- Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>	2023-02-16 11:26:08 +08:00
Tianlei Wu	eb2ac72fa9	Stable Diffusion CUDA Optimizations Part 4 (#14680 ) (1) Support packed QKV format in MultiHeadAttention. This format could avoid add bias transpose when TRT fused kernel is used. (2) Add cache for cumulated sequence length computation. For SD, it only need computed once since sequence length is fixed. (3) Do not allocate qkv workspace to save memory for packed KV or QKV. (4) Add unit tests for packed kv and packed qkv format in MultiHeadAttention (5) Mark some fusion options for SD only Performance tests show slight improvement in T4. Average latency reduced 0.15 seconds (from 5.25s to 5.10s) for 512x512 in 50 steps for SD 1.5 models. Memory usage drops from 5.1GB to 4.8GB.	2023-02-15 14:55:42 -08:00
ytaous	d49cea05fa	[ROCm] Support for gpt2-based model inferencing (#14675 ) When inferencing real gpt2-based model, found some gaps between CUDA and ROCm codebase. The fixes include: 1. minimum code change to fix tensor shape on Attention Op 2. Support optional output tensor with SkipLayerNorm 3. fix a build error found on MI200 --------- Co-authored-by: Ubuntu <ettao@ettao-amd-dev1.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>	2023-02-15 00:16:00 -08:00
cloudhan	a216c9a3fa	Offline tuning (#14558 ) Add the ability to get and set tuning results of an inference session. Also add tool to manipulate onnx file to embed the results into the model file and automatically load it on session initialization.	2023-02-15 14:17:34 +08:00
Tianlei Wu	f638c5a2ae	Stable Diffusion CUDA Optimizations Part 3 (#14646 ) The third part for stable diffusion CUDA optimizations (1) Add BiasAdd operator to replace two Add (bias and residual); Add fusion for BiasAdd (2) Add Attention fusion for VAE decoder. (3) Update float16 conversion to handle Resize and GroupNorm. This could reduce two Cast nodes for each Resize op in fp16 model. (4) Force inputs and outputs to be float16 to avoid data casts in the pipeline. (5) Add options --force_fp32_ops, --inspect etc in optimize script so that user could force some operator to run in float32 to potentially get better image quality (with cost of performance). Performance tests show slight improvement in T4. Average latency reduced 0.1 seconds (from 5.35s to 5.25s) for 512x512 in 50 steps.	2023-02-14 12:46:50 -08:00
Ye Wang	2a4c9a5cbf	[T5 optimization] fuse rel_pos_bias and remove extended mask (#14645 ) ### Description <!-- Describe your changes. --> 1. fuse rel_pos_bias in T5. 2. remove extended masks in T5 decoder and decoder_init since they generate all zeros 3. fix a bug in onnx_model.py ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2023-02-14 10:13:50 -08:00
PeixuanZuo	326cf2f5e9	[ROCm] add Softmax Tunable Op (#14541 ) ### Description Add Softmax Tunable Op, only include blockwise vec implementation and composable kernel. Related PR: https://github.com/microsoft/onnxruntime/pull/14475, https://github.com/microsoft/onnxruntime/pull/14612 --------- Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>	2023-02-13 15:56:50 +08:00
Chen Fu	0de4bc7050	add symmetric quant in softmax (#14640 ) ### Description https://github.com/microsoft/onnxruntime/issues/14626 ### Motivation and Context https://github.com/microsoft/onnxruntime/issues/14626	2023-02-10 08:36:04 -08:00
cloudhan	9bd022b8be	Add TuningContext for TunableOp (#14557 ) This makes the the TunableOp tuning results state free and will allow us to dump and load offline tuning results.	2023-02-10 14:27:43 +08:00
Tianlei Wu	cfda876a3f	Remove torch package from requirements.txt of stable diffusion models (#14630 ) ### Description Remove torch package from requirements to unblock nuget windowsai pipeline which does not allow --extra-index-url ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-02-08 12:18:17 -08:00

1 2 3 4 5 ...

965 commits