onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-06-07 00:13:17 +00:00

Author	SHA1	Message	Date
Sheil Kumar	1b7f65437e	Enable Opset11 Sequence Ops on DirectML, and make the CPU implementations agnostic to backend EP (#14442 ) Enable Opset11 Sequence Ops on DirectML, and make the CPU implementations agnostic to backend EP Opset 11 introduced the following sequence related operators: - SequenceAt - SequenceConstruct - SequenceEmpty - SequenceLength - SequenceErase - SequenceInsert - ConcatFromSequence With the exception of ConcatFromSequence, all of the above operators were implemented with CPU kernels that a) required all of the contained tensors to also be on CPU, and b) would clone each tensor into a new sequence as a side effect of each operator. The implementation of sequences are backend agnostic, as they dont affect actual tensor layout or manipulate the contents of the tensors. In addition, with the exception of SequenceAt, the other operators need not make copies of the underlying referenced tensors. Consequently, this change does the following: 1) Sequence* operators (except SequenceAt) no longer copies the contents of a sequence of tensors on every kernel execution. 2) SequenceAt uses the DataTransferManager to copy tensors agnostic to backend. 3) The internal container implemented by TensorSeq has changed from onnxruntime::Tensor to OrtValue. This is because onnxruntime::Tensor does not support copy or assignment construction, so it must have a singular owner. However, is same tensor participates in multiple containers it would have multiple container "owners" and this would not be possible. 4) Other code that accessed values from TensorSeq have associated changes to extract Tensors from OrtValues now. In addition, DirectML execution was very slow when the above Sequence operators were added to a graph, as this caused MemcpyToHost and MemcpyFromHost kernels to be inserted between the graph and the sequence operators. To optimize DirectML, 1) The CPU implementations for the Sequence* ops were registered as DML implementations. Since the above changes also includes making the CPU kernel implementations EP agnostic, the CPU kernels can be added as is. 2) The ConcatFromSequence operator needed to be implemented on DirectML. However, there was little DirectML EP operator framework support for operators that accept/output sequences of tensors. This change has modified the internal COM interfaces to include new apis to interrogate for sequence shapes, and extract the needed tensors from TensorSeq. --------- Co-authored-by: Patrice Vignola <vignola.patrice@gmail.com>	2023-02-21 18:08:28 -08:00
Vincent Wang	e9ec4c098b	[CUDA] Fix FP16 Precision for Sigmoid Op (#14727 ) Current Sigmoid's CUDA kernel uses target data type for all computation. For some small negative numbers, if using FP16, it will loss precision. For example, for input [-7.8477, 7.3320, -7.8008, 6.6016], the expected output is [3.9047e-04, 9.9935e-01, 4.0919e-04, 9.9864e-01], but current kernel will generate result [0.0000, 0.9990, 0.0000, 0.9990]. If some sub-graph contains Sigmoid, such as BinaryCrossEntropyWithLogits, it's likely to produce NaN as compute result. The PR fixes this by using FP32 for kernel internal computation. Note that the fix will not have perf regression, as CUDA's _Exp will also do float to half casting, so the fix doesn't introduce extra cast. We move the cast to right begin and end of the whole kernel so that other parts of computation are also in FP32 (instead of only Exp).	2023-02-22 09:16:22 +08:00
Christian Veenhuis	9fbb2b4742	Fix broken link in onnxruntime_c_api.h (#14748 ) ### Description Fix the broken link in header file onnxruntime_c_api.h w.r.t. the graph optimization levels (line 300). ### Motivation and Context This fix solves open issue #14741	2023-02-21 15:07:06 -08:00
Scott McKay	b234df3dd0	Refactor the cost check used by the transpose optimizer (#14690 ) ### Description <!-- Describe your changes. --> Refactor the cost check used by the transpose optimizer to separate out ORT specific logic. Change the post-layout transform optimization to only skip the cost check when moving the layout transform nodes around. Fall back to the normal cost check for all other transpose nodes. Cleanup some const correctness. Refactor usage of ResizeHandler slightly so the clang-formatting is nicer. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Address performance issue seen in SNPE model where a non-layout transpose node was moved. See https://github.com/microsoft/onnxruntime/pull/14547 for more details. Improve separation between generic transpose optimization code and any ORT specific code.	2023-02-22 08:56:29 +10:00
Edward Chen	755161100a	Fix CoreML API usage memory leak. (#14738 ) - Fix CoreML API usage memory leak by putting CoreML API prediction call in an `@autoreleasepool` block as suggested in #14455 and [here](https://developer.apple.com/forums/thread/692425). Conservatively wrapping all CoreML API usage. - Use MLModelConfiguration.computeUnits instead of deprecated MLPredictionOptions.usesCPUOnly (originally in #11382).	2023-02-21 14:08:03 -08:00
Yuriy Chernyshov	973aaf110b	Improve compatibility with certain STL's We use customized libc++ which uses raw pointers as std::vector::iterators. As per [expr.pre.incr](https://eel.is/c++draft/expr.compound#expr.pre.incr), builtin `operator++` can only be applied to lvalue, while `std::vector::begin()` returns an rvalue. See [this](https://godbolt.org/z/d3a1aKTWP) godbolt snippet for the details.	2023-02-21 14:06:16 -08:00
RandySheriffH	e6a8e6c438	Drop the test folder accidentally added (#14718 ) The test folder was accidentally added with zero usage in main, let's drop it. Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2023-02-21 13:59:06 -08:00
fxmarty	f76ff8c558	Initialize bias_weight in fusion_skiplayernorm.py (#14751 ) As per title, fixes https://github.com/microsoft/onnxruntime/issues/13625 Uncountered the issue when using the optimization with codegen model.	2023-02-21 10:42:08 -08:00
Tianlei Wu	c0d2472ede	Disable fused causal attention (#14732 ) There is accuracy regression in GPT-2 model. Top1 match rate (vs PyTorch model) drops about 1%. The cause is the fused causal attention uses fp16 accumulation. Disable it by default and add an environment variable ORT_ENABLE_FUSED_CAUSAL_ATTENTION=1 to turn on it manually. It also updated the GPT-2 parity test script to generate left side padding to reflect the actual usage. To test: ``` python -m onnxruntime.transformers.models.gpt2.convert_to_onnx -m gpt2 --output gpt2.onnx -o -p fp16 --use_gpu ``` The top1-match-rate in the output is on-par with ORT 1.13.1.	2023-02-21 09:53:31 -08:00
PeixuanZuo	25e10f413e	[FIX] USE_COMPOSABLE_KERNEL is not defined on ROCm5.2.3 (#14750 ) Fix build failure on ROCm5.2.3	2023-02-21 11:13:37 +08:00
cao lei	3d79b1f06e	Create new stream for data copy for IOBidning input scenario (#14719 ) ### Description Create new stream for data copy for IOBidning input scenario ### Motivation and Context Previously in bindInput(), a nullptr Stream is passed to copy data cross device. This caused the default stream is used thus hurt the performance. This PR is to fix https://github.com/microsoft/onnxruntime/issues/14484 --------- Co-authored-by: Lei Cao <leca@microsoft.com>	2023-02-20 09:47:57 -08:00
Yi Zhang	1ea360148f	restore opset18 test (#14677 ) ### Description Reenable disabled opset18 tests ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-02-20 18:19:10 +08:00
pengwa	fbf5d09a0c	Fix random failure of ortmodule_api.py::test_unused_parameters (#14729 ) ### Fix random failure of ortmodule_api.py::test_unused_parameters Fix FAILED orttraining_test_ortmodule_api.py::test_unused_parameters[model1-none_pt_params1] for orttraining-linux-gpu-ci-pipeline CI pipeline ``` =================================== FAILURES =================================== ________________ test_unused_parameters[model1-none_pt_params1] ________________ model = UnusedMiddleParameterNet( (fc1): Linear(in_features=784, out_features=500, bias=True) (relu): ReLU() (fc2): Linear(in_features=500, out_features=400, bias=True) (fc3): Linear(in_features=500, out_features=10, bias=True) ) none_pt_params = ['fc2.weight', 'fc2.bias'] @pytest.mark.parametrize( "model, none_pt_params", [ (UnusedBeginParameterNet(784, 500, 400, 10), ["fc1.weight", "fc1.bias"]), (UnusedMiddleParameterNet(784, 500, 400, 10), ["fc2.weight", "fc2.bias"]), (UnusedEndParameterNet(784, 500, 400, 10), ["fc2.weight", "fc2.bias"]), ], ) def test_unused_parameters(model, none_pt_params): device = "cuda" N, D_in, H1, H2, D_out = 64, 784, 500, 400, 10 model = model.to(device) ort_model = ORTModule(copy.deepcopy(model)) # Make sure model runs without any exception for _ in range(5): x = torch.randn(N, D_in, device=device) y = copy.deepcopy(x) out_pt = model(x) out_ort = ort_model(y) loss_pt = out_pt.sum() loss_pt.backward() loss_ort = out_ort.sum() loss_ort.backward() _test_helpers.assert_values_are_close(out_ort, out_pt) > _test_helpers.assert_gradients_match_and_reset_gradient(ort_model, model, none_pt_params=none_pt_params) orttraining_test_ortmodule_api.py:4050: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _test_helpers.py:216: in assert_gradients_match_and_reset_gradient assert_values_are_close(ort_param.grad, pt_param.grad, rtol=rtol, atol=atol) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ``` Initially the test runs very well. As we insert more and more tests, when running ortmodule_api.py::test_unused_parameters, the random generated data got changed, and now it is more easily to generate an input data that produce a result the break existing rtol and atol. The example data, 0.1041 only have very minor diff, e.g. abs_diff: 2.2649765014648438e-06. > The torch.allclose judge it is not equal because: abs_diff> 0.1041 * rtol + atol = 1.041e-1 * 1e-5 + 1e-6 =-2.041e-6. > Additionally, according to math [here](`7b31bcda2e/orttraining/orttraining/test/python/_test_helpers.py (L230)`) The maximum atol is 1.2238311910550692e-06 > current atol(1e-6), maximum rtol is 1.2149855137977283e-05 > current rtol(1e-5). This PR looses the atol to 1e-5, rtol to 1e-4 .	2023-02-20 18:09:53 +08:00
Edward Chen	ad78579b66	Update java/build.gradle to not use deprecated features that were removed in gradle 8.0. (#14733 ) ### Description <!-- Describe your changes. --> Update java/build.gradle to not use deprecated features that were removed in gradle 8.0. Also move gradle wrapper setup from a script into a step template. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix builds which use hosted Mac agents and gradle. Recently the system version of gradle got upgraded to 8.0. Even though we use an older gradle wrapper version, java/build.gradle is still processed with gradle 8.0 in the initial call to `gradle wrapper`.	2023-02-20 11:19:49 +08:00
Erick Muñoz	8372c86e7f	[oneDNN] Update to oneDNN v3.0 (#14267 ) ### Description Update oneDNN version from 2.7 to 3.0 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-02-17 09:56:29 -08:00
Wei-Sheng Chin	7b31bcda2e	Disable LazyTensor-ORT Test (#14703 ) As title since LazyTensor is replaced by Dynamo in PyTorch 2.0.	2023-02-17 17:46:51 +08:00
Hector Li	0dda42b46c	Enable some ops for QDQ Node Unit support (#14701 ) ### Description Enable some ops for QDQ Node Unit support: Flatten, Split, GlobalAveragePool, ReduceMean, Relu, Sigmoid, Sqrt, Div, Mul, Pow, Sub.	2023-02-16 17:14:31 -08:00
Ryan Hill	892f59b31a	Add string support to tile op (#14686 ) ### Description Add std::string tensor type support to Tile operator ### Motivation and Context Multiple users are hitting this missing feature: https://github.com/microsoft/onnxruntime/issues/14511	2023-02-16 14:59:44 -08:00
Baiju Meswani	ae205a7924	QAT POC tutorial (#14577 )	2023-02-16 14:38:18 -08:00
Baiju Meswani	4e686a9a7d	Support building a QAT onnx model using onnxblock (#14551 )	2023-02-16 14:38:01 -08:00
Zhang Lei	ff3aed8540	fix error due to () not used on operator priority. (#14699 )	2023-02-16 13:11:52 -08:00
Tianlei Wu	6f99fb9d4b	Stable Diffusion CUDA Optimizations Part 5 (#14706 ) Add a fusion to remove transpose in subgraph like ``` --> Gemm --> Unsqueeze(axes=[2]) --> Unsqueeze(axes=[3]) --> Add --> Transpose([0,2,3,1]) --> GroupNorm ``` With this fusion, we can remove 22 Transpose nodes in UNet, and reduce latency by 0.1 second per image in T4.	2023-02-16 01:10:00 -08:00
PeixuanZuo	0f9d2432d2	[ROCm] Add WarpWise Softmax into SoftmaxTunableOp (#14612 ) 1. Add Softmax warpwise_forward into SoftmaxTunableOp. 2. Set Softmax op use tunableOp as optional and use original implementation by default. 3. There are some other operators use `dispatch_warpwise_softmax_forward /dispatch_warpwise_softmax_forward/ SoftMaxComputeHelper ` directly. But they only have files under cuda directory, adding `RocmTuningContext ` for these files requires copying and modifying hipified files. Now only set RocmTuningContext as nullptr by default and not hipified other operators. Related PR: https://github.com/microsoft/onnxruntime/pull/14541 --------- Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>	2023-02-16 11:26:08 +08:00
Ted Themistokleous	601aab2ac8	[Testing] Arrange parity utilities for onnxruntime parity tests to set order pr… (#14700 ) Current configuration has CPU as the highest priority as per the specification found at : https://onnxruntime.ai/docs/api/python/api_summary.html#inferencesession providers – Optional sequence of providers in order of decreasing precedence. Values can either be provider names or tuples of (provider name, options dict). If not provided, then all available providers are used with the default precedence. Sets correct operator precedence for the EPs in parity utilities for test runs Ruling out any odd out of order issues when setting up tests for multiple EPs Co-authored-by: Ted Themistokleous <tthemist@amd.com>	2023-02-16 11:23:57 +08:00
Dale Phurrough	68db1b62a8	add noexcept to `InitApi()` and `GetApi()` (#13869 ) ### Description * add noexcept to `InitApi()` and `GetApi()` ### Motivation and Context * fixes microsoft/onnxruntime#12581	2023-02-15 16:49:16 -08:00
Tianlei Wu	eb2ac72fa9	Stable Diffusion CUDA Optimizations Part 4 (#14680 ) (1) Support packed QKV format in MultiHeadAttention. This format could avoid add bias transpose when TRT fused kernel is used. (2) Add cache for cumulated sequence length computation. For SD, it only need computed once since sequence length is fixed. (3) Do not allocate qkv workspace to save memory for packed KV or QKV. (4) Add unit tests for packed kv and packed qkv format in MultiHeadAttention (5) Mark some fusion options for SD only Performance tests show slight improvement in T4. Average latency reduced 0.15 seconds (from 5.25s to 5.10s) for 512x512 in 50 steps for SD 1.5 models. Memory usage drops from 5.1GB to 4.8GB.	2023-02-15 14:55:42 -08:00
Edward Chen	5605c3d454	Make some variables constexpr in orttraining/orttraining/training_ops/cuda/optimizer/lamb.cc. (#14698 )	2023-02-15 14:10:59 -08:00
Chen Fu	733ca85b73	Cfu fp16 (#14538 ) ### Description FP16 GEMM, including hardware agnostic driver code, a slow C++ kernel, and ARM64 NEON kernel. ### Motivation and Context First step in creating native support of fp16 model inferencing on ARM64 and AMD64 platforms. --------- Co-authored-by: Chen Fu <fuchen@microsoft.com>	2023-02-15 12:51:53 -08:00
Yi Zhang	b1abb8c656	skip col2im_pads test (#14685 ) ### Description skip col2im_pads test in model test. ### Motivation and Context The failed test blocks updating the new image.	2023-02-15 17:32:22 +08:00
ytaous	d49cea05fa	[ROCm] Support for gpt2-based model inferencing (#14675 ) When inferencing real gpt2-based model, found some gaps between CUDA and ROCm codebase. The fixes include: 1. minimum code change to fix tensor shape on Attention Op 2. Support optional output tensor with SkipLayerNorm 3. fix a build error found on MI200 --------- Co-authored-by: Ubuntu <ettao@ettao-amd-dev1.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>	2023-02-15 00:16:00 -08:00
cloudhan	a216c9a3fa	Offline tuning (#14558 ) Add the ability to get and set tuning results of an inference session. Also add tool to manipulate onnx file to embed the results into the model file and automatically load it on session initialization.	2023-02-15 14:17:34 +08:00
Tianlei Wu	f638c5a2ae	Stable Diffusion CUDA Optimizations Part 3 (#14646 ) The third part for stable diffusion CUDA optimizations (1) Add BiasAdd operator to replace two Add (bias and residual); Add fusion for BiasAdd (2) Add Attention fusion for VAE decoder. (3) Update float16 conversion to handle Resize and GroupNorm. This could reduce two Cast nodes for each Resize op in fp16 model. (4) Force inputs and outputs to be float16 to avoid data casts in the pipeline. (5) Add options --force_fp32_ops, --inspect etc in optimize script so that user could force some operator to run in float32 to potentially get better image quality (with cost of performance). Performance tests show slight improvement in T4. Average latency reduced 0.1 seconds (from 5.35s to 5.25s) for 512x512 in 50 steps.	2023-02-14 12:46:50 -08:00
Dmitri Smirnov	6eeeecf07f	Remove erroneous function cast (#14673 ) ### Description The custom thread entry point was declared `__stdcall` even though the API dictated a different type. Casting caused improper cleanup of the stack and crash manifested only in 32-bit Debug builds. ### Motivation and Context This addresses https://github.com/microsoft/onnxruntime/issues/14613	2023-02-14 11:35:33 -08:00
Ye Wang	2a4c9a5cbf	[T5 optimization] fuse rel_pos_bias and remove extended mask (#14645 ) ### Description <!-- Describe your changes. --> 1. fuse rel_pos_bias in T5. 2. remove extended masks in T5 decoder and decoder_init since they generate all zeros 3. fix a bug in onnx_model.py ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2023-02-14 10:13:50 -08:00
Ted Themistokleous	37033975bb	[MIGraphX EP] Add support for Mod OP (#14647 ) This has been available since July 25th 2022 in MIGraphX. Appared to be missing from support list of ops https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/pull/1302 ### Description <!-- Describe your changes. --> Add in node name for Mod Operator to be supported by MIGraphX ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Expand available functionality to Onnxruntime for the MIGraphX EP Co-authored-by: Ted Themistokleous <tthemist@amd.com>	2023-02-14 16:26:11 +08:00
Zachary Streeter	7e9f543af4	Use miopenGetConvolutionSpatialDim if ROCm5.5 (#14483 ) MIOpen created a new API to get the spatial dimensions.	2023-02-14 13:04:12 +08:00
Patrice Vignola	ce9a71620f	Fix DML release build (#14661 ) ### Description Fixes the DML release build for 1.14.1. This was initially fixed by https://github.com/microsoft/onnxruntime/pull/13417 for 1.13.1, but the changes didn't make their way back to the main branch.	2023-02-13 17:31:11 -08:00
Dmitri Smirnov	6e1008877e	Update OrtEnv class documentation (#14650 ) ### Description Tell more about `OrtEnv` class. ### Motivation and Context Need to mention the importance of creating `OrtEnv` first.	2023-02-13 16:40:44 -08:00
cao lei	50fa151298	remove device_id parameter out of ExecutionProvider::GetAllocator() (#14580 ) ### Description Remove the parameter device_id out of ExecutionProvider::GetAllocator() function ### Motivation and Context The parameter device_id is not necessary. We can fully rely on the second parameter OrtMemType mem_type to determine the device_id when getting allocator from executionProvider.	2023-02-13 10:01:07 -08:00
Baiju Meswani	22de2798f2	Update typing hints to support python 3.8 for training apis (#14649 )	2023-02-13 09:52:05 -08:00
PeixuanZuo	326cf2f5e9	[ROCm] add Softmax Tunable Op (#14541 ) ### Description Add Softmax Tunable Op, only include blockwise vec implementation and composable kernel. Related PR: https://github.com/microsoft/onnxruntime/pull/14475, https://github.com/microsoft/onnxruntime/pull/14612 --------- Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>	2023-02-13 15:56:50 +08:00
Misha Chornyi	12d91173c4	Add extra include to fix build w/ CUDA 12 (#14659 ) Signed-off-by: Cliff Woolley <jwoolley@nvidia.com> ### Description Including file to fix build w/CUDA 12 ### Motivation and Context It should allow users to compile against CUDA 12 Signed-off-by: Cliff Woolley <jwoolley@nvidia.com> Co-authored-by: Cliff Woolley <jwoolley@nvidia.com>	2023-02-11 16:14:40 -08:00
guyang3532	ba00f3a134	fix problem of reduplicate input names (#14163 ) Contributor: @guyang3532	2023-02-10 12:57:51 -08:00
Chen Fu	0de4bc7050	add symmetric quant in softmax (#14640 ) ### Description https://github.com/microsoft/onnxruntime/issues/14626 ### Motivation and Context https://github.com/microsoft/onnxruntime/issues/14626	2023-02-10 08:36:04 -08:00
cloudhan	9bd022b8be	Add TuningContext for TunableOp (#14557 ) This makes the the TunableOp tuning results state free and will allow us to dump and load offline tuning results.	2023-02-10 14:27:43 +08:00
Nat Kershaw (MSFT)	9a9d45fefb	Add instructions for previewing docs changes (#12528 )	2023-02-09 16:25:46 -08:00
Baiju Meswani	94bc0fe029	Skip all training opset model tests (#14636 )	2023-02-09 14:56:50 -08:00
Ryan Hill	02bba3e268	Switch to a static local variable to avoid global constexpr warning (#14638 ) ### Description Switch to a static local variable to fix the warning Comments in the code so it's clear that it's intentional. ### Motivation and Context Prefast warning: [prefast:Warning]: C26426 (in 'onnxruntime::cuda::`dynamic initializer for 'castOpTypeConstraints''') Global initializer calls a non-constexpr function 'onnxruntime::DataTypeImpl::GetTensorType<onnxruntime::MLFloat16>' (i.22).	2023-02-09 14:25:02 -08:00
Justin Stoecker	23f0e44265	Fix SAL annotation in private DML EP interface (#14639 ) In #14461 I added a private interface to MLOperatorAuthorPrivate.h to pipe ORT node names through to the debug name of DML operators/graphs. The wrong SAL annotation was used on the `Get*Name` methods, which confused static analysis tools into thinking there is a potential buffer overrun.	2023-02-09 10:27:20 -08:00
JiCheng	c5b485d25f	[prefast:Warning]: C26451 (#14628 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-02-09 17:04:52 +08:00

1 2 3 4 5 ...

8186 commits