onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-13 18:08:13 +00:00

Author	SHA1	Message	Date
dependabot[bot]	ffdcde7cc7	Bump minimatch from 3.0.4 to 3.0.5 in /js/web (#13722 ) Bumps [minimatch](https://github.com/isaacs/minimatch) from 3.0.4 to 3.0.5. <details> <summary>Commits</summary> <ul> <li><a href="`707e1b231d`"><code>707e1b2</code></a> 3.0.5</li> <li><a href="`a8763f4388`"><code>a8763f4</code></a> Improve redos protection, add many tests</li> <li><a href="`bafa295617`"><code>bafa295</code></a> Use master branch for travis badge</li> <li><a href="`013d64dc24`"><code>013d64d</code></a> update travis</li> <li>See full diff in <a href="https://github.com/isaacs/minimatch/compare/v3.0.4...v3.0.5">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=minimatch&package-manager=npm_and_yarn&previous-version=3.0.4&new-version=3.0.5)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/microsoft/onnxruntime/network/alerts). </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-12-07 13:14:59 -08:00
Adam Louly	f453d2845e	adding get and set lr for optimizer (#13661 ) ### Description Exposing get and set Learning rate for optimizer ### Motivation and Context you can now set learning rate for optimizer.	2022-12-07 11:59:11 -08:00
Ashwini Khade	983877c712	Decouple strided tensor support from ENABLE_TRAINING (#13829 ) ### Description Decouple strided tensor support from ENABLE_TRAINING ### Motivation and Context This is step 1 for creating a dedicated build for on device training. Intention is 1. We can set ENABLE_STRIDED_TENSORS in cmake when either ENABLE_TRAINING or ENABLE_TRAINING_ON_DEVICE is selected, this way we dont have to use if defined(ENABLE_TRAINING) \|\| defined(ENABLE_TRAINING_ON_DEVICE ) everywhere in the code. 2. This also paves the way to easily enable strided tensor support for inference in future (if required).	2022-12-07 09:22:21 -08:00
Yi Zhang	f6c493793d	Revert "skip TestCUDAProviderOptions in End2EndTest (#13737 )" (#13874 ) This reverts commit `87d5703b14`. ### Motivation and Context There was a bug in Linux CUDA installation. The OS image is updated. The TestCUDAProviderOptions could be reenabled.	2022-12-07 23:33:59 +08:00
Yi Zhang	ae2a9373ab	reenable quant model tests (#13871 ) ### Description ### Motivation and Context Test data in the image has been fixed.	2022-12-07 23:33:22 +08:00
Patrice Vignola	96d8d2c278	[DML EP] Add SkipLayerNormalization (#13849 ) ### Description Add SkipLayerNormalization for the DML EP	2022-12-07 01:49:14 -08:00
Hariharan Seshadri	004a1538d3	Extend vocab padding for logits MatMul for fp16 GPT2 GreedySearch (#13842 )	2022-12-06 19:39:20 -08:00
cloudhan	f79d38181b	Fix hipify to avoid nccl_service.h: No such file or directory (#13852 ) Fix various flaky build error due to onnxruntime_session missing dependencies on hipify generated files.	2022-12-07 09:10:37 +08:00
Changming Sun	d12521d7b2	Upgrade pybind11 (#13853 ) Upgrade pybind11 to include the fix for #9735	2022-12-06 15:39:23 -08:00
Yi Zhang	78d18fbf34	Use CacheTask to Accelerate MacOS build (#13859 ) ### Description Use CCache and ADO CacheTask to Accelerate MacOS build. ref: https://learn.microsoft.com/en-us/azure/devops/pipelines/release/caching?view=azure-devops ### Motivation and Context The MacOS CI duration could be reduced from more than 70minutes to 10 minutes https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=824912&view=results	2022-12-07 07:14:40 +08:00
Yi Zhang	d2188fbff9	skip resnet50-int8 model test in training (#13856 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2022-12-06 22:47:24 +08:00
Ashwini Khade	65201e47bf	Enable nuget packages for on device training (#13637 ) ### Description This PR enables building nuget packages locally for on device training using --build_nuget arg. This PR also enables the C# bindings by default in the managed package. If a user triggers any training apis when the native binary is not built for training, an exception with message "Training is disabled in the current build. Please build ONNXRuntime from source with the build flags enable_training and enable_training_on_device. " is thrown. Build command for creating nuget packes for on device training: build.bat --enable_training --enable_training_on_device --build_nuget 2 Nuget packages are built 1. Microsoft.ML.OnnxRuntime.Managed 2. Microsoft.ML.OnnxRuntime.Training OR Microsoft.ML.OnnxRuntime.Training.Gpu ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2022-12-05 14:54:09 -08:00
JiCheng	d5574e6999	LayerNorm test fix (#13840 ) ### Description <!-- Describe your changes. --> Testcases of LayerNorm with fp16/bf16 are failed in Andriod and IOS since the two platforms don't support the combinations of datatypes as well. https://dev.azure.com/onnxruntime/onnxruntime/_build?definitionId=134&_a=summary https://dev.azure.com/onnxruntime/onnxruntime/_build?definitionId=53&_a=summary ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2022-12-05 22:49:22 +08:00
Hariharan Seshadri	5f4e0c95ec	Misc minor bug fixes in transformer kernels (#13780 )	2022-12-04 21:30:57 -08:00
mindest	f34ebbc8ff	fix a wrong assert condition in benchmark_helper (#13821 ) ### Description fix a wrong assert condition in benchmark_helper.py (introduced in #13455)	2022-12-03 18:50:47 +08:00
Pranav Sharma	335b62bde6	Fix invocation of GetInputMemoryType. (#13828 ) ### Description GetInputMemoryType was introduced in ver 13 in [this PR](https://github.com/microsoft/onnxruntime/pull/10879). The ver check introduced in this PR allows custom ops compiled using older versions to work with newer versions (> 12) of the ORT binary. ### Motivation and Context Fixes binary compatibility.	2022-12-02 18:42:14 -08:00
Patrice Vignola	b53bbe7370	[DML EP] Add an implementation for NonZero (#13768 ) ### Description Add the NonZero op for DML ### Motivation and Context NonZero is used in a few transformer models, so having a DML implementation will stop large tensors from being transferred to the CPU and back to the GPU	2022-12-02 18:39:21 -08:00
Gaz Iqbal	b9702587df	[oneDNN] Implemented Concat Op (#13646 ) ### Description This PR implements the Concat Operator for the OneDNN Execution Provider. ### Motivation and Context - As part of evaluating ORT performance on ARM based targets such as Graviton3, we discovered that the OneDNN EP had some gaps on operator coverage. - The Concat Operator is fairly common and used in models such as Yolov5, MobileNet, DistillBert and GPT2 - For Yolov5 specifically, this improves average inference time over 100 runs on Graviton3 from 180.2ms to 115.5ms when using OneDNN + ARM Compute Library. Co-authored-by: Gaz Iqbal <giqbal@octoml.ai>	2022-12-02 13:30:37 -08:00
Patrice Vignola	c2d08fd73a	[DML EP] Add support for LayerNorm (scale == nullptr) != (bias == nullptr) (#13818 ) ### Description Add support for LayerNorm scale == nullptr != bias == nullptr	2022-12-02 13:19:53 -08:00
Patrice Vignola	a0b470bc35	[DML EP] Add mixed datatype support for DML's LayerNorm contrib op (#13734 ) ### Description Add mixed datatype support for DML's LayerNorm contrib op. ### Motivation and Context The fusion logic removes casts around LayerNorm in the graph because the contrib version of the op supports mixed datatypes. Scale, Bias and Output's datatypes must match, but input's datatype can be different.	2022-12-01 14:08:18 -08:00
JiCheng	82d123b6c9	[quick fix] Build onnxruntime under DISABLE_ABSEIL (#13799 )	2022-12-01 10:00:31 -08:00
Changming Sun	04900f96c1	Improve dependency management (#13523 ) ## Description 1. Convert some git submodules to cmake external projects 2. Update nsync from [1.23.0](https://github.com/google/nsync/releases/tag/1.23.0) to [1.25.0](https://github.com/google/nsync/releases/tag/1.25.0) 3. Update re2 from 2021-06-01 to 2022-06-01 4. Update wil from an old commit to 1.0.220914.1 tag 5. Update gtest to a newer commit so that it can optionally leverage absl/re2 for parsing command line flags. The following git submodules are deleted: 1. FP16 2. safeint 3. XNNPACK 4. cxxopts 5. dlpack 7. flatbuffers 8. googlebenchmark 9. json 10. mimalloc 11. mp11 12. pthreadpool More will come. ## Motivation and Context There are 3 ways of integrating 3rd party C/C++ libraries into ONNX Runtime: 1. Install them to a system location, then use cmake's find_package module to locate them. 2. Use git submodules 6. Use cmake's external projects(externalproject_add). At first when this project was just started, we considered both option 2 and option 3. We preferred option 2 because: 1. It's easier to handle authentication. At first this project was not open source, and it had some other non-public dependencies. If we use git submodule, ADO will handle authentication smoothly. Otherwise we need to manually pass tokens around and be very careful on not exposing them in build logs. 2. At that time, cmake fetched dependencies after "cmake" finished generating vcprojects/makefiles. So it was very difficult to make cflags consistent. Since cmake 3.11, it has a new command: FetchContent, which fetches dependencies when it generates vcprojects/makefiles just before add_subdirectories, so the parent project's variables/settings can be easily passed to the child projects. And when the project went on, we had some new concerns: 1. As we started to have more and more EPs and build configs, the number of submodules grew quickly. For more developers, most ORT submodules are not relevant to them. They shouldn't need to download all of them. 2. It is impossible to let two different build configs use two different versions of the same dependency. For example, right now we have protobuf 3.18.3 in the submodules. Then every EP must use the same version. Whenever we have a need to upgrade protobuf, we need to coordinate across the whole team and many external developers. I can't manage it anymore. 3. Some projects want to manage the dependencies in a different way, either because of their preference or because of compliance requirements. For example, some Microsoft teams want to use vcpkg, but we don't want to force every user of onnxruntime using vcpkg. 7. Someone wants to dynamically link to protobuf, but our build script only does static link. 8. Hard to handle security vulnerabilities. For example, whenever protobuf has a security patch, we have a lot of things to do. But if we allowed people to build ORT with a different version of protobuf without changing ORT"s source code, the customer who build ORT from source will be able to act on such things in a quicker way. They will not need to wait ORT having a patch release. 9. Every time we do a release, github will also publish a source file zip file and a source file tarball for us. But they are not usable, because they miss submodules. ### New features After this change, users will be able to: 1. Build the dependencies in the way they want, then install them to somewhere(for example, /usr or a temp folder). 2. Or download the dependencies by using cmake commands from these dependencies official website 3. Similar to the above, but use your private mirrors to migrate supply chain risks. 4. Use different versions of the dependencies, as long as our source code is compatible with them. For example, you may use you can't use protobuf 3.20.x as they need code changes in ONNX Runtime. 6. Only download the things the current build needs. 10. Avoid building external dependencies again and again in every build. ### Breaking change The onnxruntime_PREFER_SYSTEM_LIB build option is removed you could think from now it is default ON. If you don't like the new behavior, you can set FETCHCONTENT_TRY_FIND_PACKAGE_MODE to NEVER. Besides, for who relied on the onnxruntime_PREFER_SYSTEM_LIB build option, please be aware that this PR will change find_package calls from Module mode to Config mode. For example, in the past if you have installed protobuf from apt-get from ubuntu 20.04's official repo, find_package can find it and use it. But after this PR, it won't. This is because that protobuf version provided by Ubuntu 20.04 is too old to support the "config mode". It can be resolved by getting a newer version of protobuf from somewhere.	2022-12-01 09:51:59 -08:00
Patrice Vignola	e9b92fdf33	[DML EP] Add DML implementation for BiasGelu (#13795 ) ### Description Add DML implementation for BiasGelu	2022-12-01 09:23:19 -08:00
Numfor Tiapo	e0dcbc3832	Fix C26436 prefast errors (#13774 ) Fixes errors 9196, 9214, 9255, and 9314. Co-authored-by: Numfor Mbiziwo-Tiapo <numform@microsoft.com>	2022-12-01 09:07:44 -08:00
Patrice Vignola	4128e44b4f	[DML EP] Upgrade DML to 1.10.0 (#13796 ) ### Description Upgrade DML to 1.10.0	2022-11-30 21:32:14 -08:00
Yi Zhang	777c474f61	skip quantized model C# tests on GPU (#13782 ) ### Description Skip quantized model C# tests on GPU too. ### Motivation and Context It looks the current test result isn't reasonable. https://github.com/onnx/models/issues/581 Once we update the image, the quantized model [test data will be generated with VNNI](`ba629906dd`), the CI would be broken.	2022-12-01 12:33:20 +08:00
Wei-Sheng Chin	7df8f84228	Improve DORT document (#13790 ) 1. Refine words based on PyTorch changes. 2. Make the need of inference mode clearer. A test is added.	2022-11-30 16:55:25 -08:00
Yulong Wang	77c97b6f16	[js/rn] support load model from buffer on Android (#12676 ) Description: [js/React Native] Add android implementation for creating session from buffer. #12500 Co-authored-by: Rachel Guo <guorachel@microsoft.com>	2022-11-30 10:55:55 -08:00
Wei-Sheng Chin	639d285670	[DORT] Catch up with yesterday's PyTorch change (#13779 ) Fix recent CI failures.	2022-11-30 09:23:44 -08:00
Xavier Dupré	441b30b2d2	Move a function call outside a loop in ORTModule (#13771 ) ### Description The proposed change is useful for ORTModule when the output graph has multiple outputs. ### Motivation and Context performance Signed-off-by: xadupre <xadupre@microsoft.com>	2022-11-30 12:49:41 +01:00
Patrice Vignola	08ed09d20b	Add DML support to the transformers benchmark.py script (#13776 ) ### Description Add DML support to the transformers benchmark.py script ### Motivation and Context Before this change, running the `benchmark.py` script when the `onnxruntime-directml` package is installed resulted in an error because it expects a CUDA or ROCM framework.	2022-11-29 18:57:52 -08:00
Changming Sun	29ed8811e5	Move C/C++ deps' URLs to deps.txt (#13769 ) ### Description 1. Move C/C++ deps' URLs to deps.txt, and download the dependencies from Azure Devops Artifacts instead of github. 2. Add "EXCLUDE_FROM_ALL" keyword to the cmake external projects, so that we only build the parts we need and avoid installing the 3rd-party dependencies when people run `make install` in ORT's build directory. However, at this moment cmake itself doesn't have the feature. So I copied their code to cmake/external/helper_functions.cmake and modified it. This PR is split from #13523, to make that one smaller. ### Motivation and Context 1. Secure the supply chain 2. Make it be possible to automatically detect if ORT has an old dependency that hasn't been updated from a long time.	2022-11-29 18:06:35 -08:00
Jeff Bloomfield	571dc5a1f1	Support exteranl weights in DML execution provider (#13740 ) ### Description This enables support for external weights in the DML execution provider when its graph optimization logic is reached. ### Motivation and Context External weighs are encountered after optimization is applied to transformer models.	2022-11-29 15:47:16 -08:00
stevenlix	ce0025d3f2	Fallback Pow op in layer norm to FP32 in TRT to avoid overflow (#13639 ) Accuracy loss is observed when transformer models such as BERT, DeBERTa, ViT are running in TRT FP16 mode. The cause is that overflow happens at Pow op in layer norm. This PR provides the option to force Pow to run in TRT FP32 precision if overflow occurs. Co-authored-by: Ubuntu <azureuser@orteplinuxdev.bxgbzpva45kedp3rhbsbit4phb.jx.internal.cloudapp.net>	2022-11-29 13:37:31 -08:00
Chi Lo	0327606d2d	Revert TRT EP Linux CI to run unit tests in container (#13766 ) Revert TRT EP Linux CI to old behavior that code build and unit tests are both executing in container. So that we don't have to update the VM image for native Ubuntu to include latest TRT libraries every time newer version of TRT is introduced.	2022-11-29 13:15:27 -08:00
Tianlei Wu	abe1642a0c	Update fusion for distilbert accuracy test on SQuAD (#13748 ) (1) Embed layer fusion to work with --use_mask_index. (2) Parse num_heads and hidden_size from a pattern of Concat shape node. (3) Fix a typo (CUDAExcecutionProvider=> CUDAExecutionProvider) in eval_squad.py (4) Update example comments in eval_squad.py to use optimized fp16 model. (5) Update tests in test_optimizer.py	2022-11-29 13:06:39 -08:00
FFrog	181628ced1	[CANN] add more operators (#13578 ) ### Description Adding new operators and enhances operators, also. ### Motivation and Context The operators of CANN EP is modified as follows: The list of enhanced operators is as follows: - Add - Sub - Mul - Div - Gemm - MatMul - AveragePool - GlobalAveragePool - MaxPool - GlobalMaxPool - Dropout The new operators are as follows: - Abs - Neg - Floor - Ceil - Reciprocal - Sqrt - Log - Exp - Erf - Round - Sin - Cos - Cast - Reshape - Transpose The remaining operators will be supported in the next PRs.	2022-11-29 12:08:36 -08:00
Baiju Meswani	2c29938846	[QAT] Introduce FakeQuant op (#13649 )	2022-11-29 08:43:37 -08:00
sfatimar	49c3768985	Enabled ops for DeBERTa model (#13690 ) ### Description Enabled GatherElements Ops to enable DeBERTA Model ### Motivation and Context - This change is required to enable DeBerta Model which is relevant to MSFT - If it fixes an open issue, please link to the issue here. --> Co-authored-by: mayavijx <mayax.vijayan@intel.com>	2022-11-28 22:39:32 -08:00
pengwa	7c53b6eee8	Skip the tests of saving tensor in backward (#13767 ) ### skip the tests of saving tensor in backward The test failed randomly; Let's skip it until the issue got fixed to unblock the CIs.	2022-11-29 13:02:26 +08:00
Vincent Wang	3c258c878c	[CUDA] Optimize Slice Kernel (#13641 ) The PR optimizes Slice CUDA kernel by two ways: - Coalesce dimensions so less divmod during the kernel compute - Split data load and write for better memory throughput Below shows some perf results (cycles number from Nsight Compute) in V100 using real cases from Huggingface's XLNet model: \| Old \| New -- \| -- \| -- [8,12,2048,1024], axis=2, start=1, end=2048 \| 1838687\| 1539846 [8,12,1024,2047], axis=3, start=0, end=1024 \| 951383\| 722203	2022-11-29 09:18:03 +08:00
JiCheng	47780b7f3b	[XNNPACK] add more computation heavy ops (#13270 ) ### Description This is the first PR of adding remaining Ops for XNPACK EP, I am gonna add: - [x] ConvTranspose f32 qu8 q s8 - [x] ~~UnMaxpool f32 qu8 qs8~~ - [x] Resize f32 qu8 q s8 - [ ] GEMM see https://github.com/microsoft/onnxruntime/pull/13126 The remains operation support would be seperated into another PR. ### Motivation and Context	2022-11-29 09:09:26 +08:00
Dmitri Smirnov	4fbe16e493	Ifdef cpuinfo code on platforms we do not set affinity (#13486 ) ### Description Remove code that invokes cpuinfo library on platforms we do not set affinity. ### Motivation and Context `cpuinfo` library increases binary size.	2022-11-28 13:44:16 -08:00
Guenther Schmuelling	2d523c507e	for wasm catch exceptions at top level api (#13644 ) fix for https://github.com/microsoft/onnxruntime/issues/13383, https://github.com/microsoft/onnxruntime/issues/13408 Currently ort-web doesn't catch exceptions because turning on exception catching increases the binary size by 3MB (~30%). But ort can throw (ie onnx errors or ORT_ENFORCE) and there is no useable error message. Turning on exception catching just for top level api released file will fix the error messages at minimal increase of binary size.	2022-11-28 10:24:34 -08:00
Faith Xu	b7c3862330	Update resource section in readme (#13724 ) ### Description - adds link to release plans page - adds link to youtube channel	2022-11-28 09:42:31 -08:00
Jicheng Tang	b4a4fa5aac	Fix compile error with protobuf RepeatedIterator (#13731 ) ### Description <!-- Describe your changes. --> There are some compile errors with google::protobuf::internal::RepeatedIterator. replace reinterpret_cast with &(iter), which iter is RepeatedIterator type. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> My protobuf version is: - libprotoc 3.21.5 - g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 when I use build command: ``` ./build.sh --use_cuda --cudnn_home /usr --cuda_home /usr/local/cuda --config Debug --build_shared_lib --parallel ``` There are some compile errors like this: - error 1 onnxruntime/test/util/test_utils.cc:186:105: error: no matching function for call to ‘make_span(google::protobuf::RepeatedField<long int>::const_iterator, google::protobuf::RepeatedField<long int>::const_iterator)’ 186 \| ind_span = gsl::make_span(indices_proto.int64_data().cbegin(), indices_proto.int64_data().cend()); - error 2 onnxruntime/test/onnx/tensorprotoutils.cc:101:56: error: invalid cast from type ‘google::protobuf::internal::RepeatedIterator<const long unsigned int>’ to type ‘const uint32_t’ {aka ‘const unsigned int’} 101 \| p_data++ = reinterpret_cast<const T>(data_iter);	2022-11-28 09:33:53 -08:00
Numfor Tiapo	aa1390e963	Fix Prefast Errors (#13675 ) Fixes all C28204, C6031, and C26814 prefast errors. Co-authored-by: Numfor Mbiziwo-Tiapo <numform@microsoft.com>	2022-11-28 09:16:22 -08:00
Ted Themistokleous	c6bea4f02f	Modify MIGraphX EP for Accuracy tests (#13455 ) Allows MIGraphX EP to run the following additional tests. Also adds support to get MIGraphX to run eval_squad.py Reference to the Rocm EP changes: https://github.com/microsoft/onnxruntime/pull/13306 Co-authored-by: Joseph Groenenboom <joseph.groenenboom@amd.com> Co-authored-by: Ted Themistokleous <tthemist@amd.com>	2022-11-27 18:26:49 +08:00
Yufeng Li	4ca62b9ee8	fix build break in test/beam_search_topk.cc (#13739 )	2022-11-23 21:20:51 -08:00
Vincent Wang	47e7630378	[CUDA] Transpose3DImpl Supporting more Cases (#13611 ) CUDA's Transpose3DImpl is to transpose [batch, m, n] to [batch, n, m]. Currently it requires both m and n can be divided by 32 or 16. If it's not this case, the compute will fallback to general implementation, which is slow. This PR is to remove the limitation. Profiling in V100 using below size of tensors, got the cycles number from Nsight Compute: \| Old \| New -- \| -- \| -- [3072,64,512] \| 760793 \| 727140 [3072,16,2048] \| 854303 \| 851146 [3072,2048,12] \| 986924 \| 737884 [3072,1024,24] \| 1212427 \| 495117 It shows that even we added extra IF statements to the kernel implementation, it has nearly no impact to the old version (case 1 and 2). And for case 3 and 4 which will fallback to general implementation before, it's much faster. Above data was collected using FP16 tensors, similar results was observed for float tensors. This PR is to enhance the perf of ORT training of Huggingface's XLNet model which has[8,1024,1024,12].permute(0,3,1,2).	2022-11-24 09:40:48 +08:00

1 2 3 4 5 ...

7788 commits