onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-02 03:55:34 +00:00

Author	SHA1	Message	Date
Yi Zhang	209b6dbd97	add onnxruntime_float16.h into header validation list. (#17935 ) ### Description supplement of #17637 ### Motivation and Context	2023-10-16 08:52:16 +08:00
Scott McKay	ae211999dd	Attempt to make the usage of the Android emulator in CIs more robust (#17903 ) ### Description <!-- Describe your changes. --> Android emulator usage updates: - Change approach to detecting boot has completed - use `-delay-adb` and a simple command (`ls`) with `wait-for-device` as the first step - this ensures enough startup has occurred for adb to be responsive - use secondary loop on the python side to check for sys.boot_completed to be set - doing the check on the python side provides more feedback and seems to work well - make the 'stop' logic more precise by using psutil - add internal timeout of 20 mins for emulator startup - waiting for the CI jobs overall timeout is way too long - value is hardcoded for now (most CIs startup in under 10 mins) but could be made configurable if needed CI updates: - add template for using the Android emulator - update CIs to use template - reorder React Native CI - minimize the time the Android emulator or iOS simulator is running by moving some build steps around - don't run both at the same time - unnecessary and potentially adds significant memory pressure to the machine - fix QNN Android emulator CI as much as possible - now everything works apart from running onnx_test_runner with the QNN EP ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix inconsistent detection of the emulator boot completing. --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2023-10-15 08:42:36 +10:00
Zhipeng Han	a55b2688b6	Add save_attribute option to quantize_static (#17945 ) ### Description <!-- Describe your changes. --> The model with big Constants tensors size: Estimate size of the RWKV model: ONNX graph (8MB), initializer tensors(200MB), constants (~5.7GB). The `onnx.save_model` will got error due to the Constants is not output in external data. Only the initializer tensors are output as external data. In this change, expose parameter to support the constants in external data. Model owner can customize the output behavior and still keep the default behavior. Quantize the model and output it to local, got issue due to output size exceed 2GB even set `use_external_data_format=True`. The `use_external_data_format` flag only outputs initializer tensors to external data. Use the falg `convert_attribute` flag to output all tensors to external data. ``` def convert_model_to_external_data( model: ModelProto, all_tensors_to_one_file: bool = True, location: Optional[str] = None, size_threshold: int = 1024, include_attribute: bool = False, ) -> None: tensors = _get_initializer_tensors(model) if include_attribute: tensors = _get_all_tensors(model) ... ``` The `onnx.external_data_helper.convert_model_to_external_data` support output the attribute to external with flag `include_attribute=True`. However, this parameter is hide by the `onnxruntime\quantization\onnx_model.py` and the constants(`5.7GB) within the model will got protobuf 2GB limitation issue with default parameters. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix https://github.com/microsoft/onnxruntime/issues/17944 --------- Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>	2023-10-14 06:29:08 -07:00
Yufeng Li	11af34440a	Add MatMul 4bits support on GPU (#17890 ) ### Description <!-- Describe your changes. --> Add a contrib op MatMulNBits and related toolchain to support quantization on weight. This PR only adds support for 4bits. It: - add schema for contrib op MatMulNBits which can support 1-7 bits quantization on weight. - a naive implementation for 4bits MatMulNBits on CPU and GPU, i.e., implemented like MatMul(A, Dequantize(B)). - a special implementation for GemV for 4bits MatMulNBits and related benchmark tool - tool to quantization model with 4bits. Next: - add general and more efficient kernels for 4bits MatMulNBits on CPU and GPU	2023-10-13 16:55:30 -07:00
Adrian Lizarraga	3b69d9b886	[QNN EP] Fix index-out-of-bounds bug in Slice builder when initializer is shared (#17905 ) ### Description There's an index-out-of-bounds bug that is triggered when a Slice operator shares an initializer with another operator that is processed first. In this case, QNN EP fails to properly initialize a `raw_starts` (or `raw_ends`) vector, which is later indexed by a call to `SliceOp::PrepareForComputeHelper()`. ### Motivation and Context Fix bug that blocks https://github.com/microsoft/onnxruntime/pull/17764	2023-10-13 13:46:29 -07:00
Wanming Lin	9c65d5558c	[WebNN EP] Fixed wasm heap overflow for big model (#17902 ) When processing initialized tensors, WebNN did unnecessary tensor unpacking as which is already stored as raw byte data. This would cause WASM heap overflow when running big model. Fixed this issue by pointing directly to the tensor raw data. --------- Co-authored-by: Dwayne Robinson <fdwr@hotmail.com>	2023-10-13 13:40:31 -07:00
RandySheriffH	c6c3555d0e	Custom op shape inference API (#17737 ) Add c/cxx API to allow custom ops do shape inference. --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2023-10-13 12:57:42 -07:00
Zhang Lei	762703e037	Support output cross qk, dtw and more for whisper model (#17500 ) Support cross qk in beam search for whisper model and related features Make whisper exporting tools support cross qk and some related features, * extra_decoding_ids * no_speech_prob Implement DTW kernel, unfold tensor kernel with unit test Several fix related with multiple session running parallel, like: * guard multihead_attention, fused_fp16_runner_ * some memory allocation with stream awareness * add use_ep_level_unified_stream option	2023-10-13 11:47:15 -07:00
Tianlei Wu	c695de91ee	Update eval_squad to use API of latest optimum (#17918 ) Update eval_squad with latest optimum. Tested with: * optimum 1.13.1 * transformers 4.31.0 * onnxruntime-gpu 1.16.0 * onnx 1.14.1 * datasets 2.14.5 * evaluate 0.4.0 * torch version 2.2.0.dev20230920+cu121 Example output in A100: {'exact': 86.66035950804162, 'f1': 92.99622739711005, 'total': 10570, 'HasAns_exact': 86.66035950804162, 'HasAns_f1': 92.99622739711005, 'HasAns_total': 10570, 'best_exact': 86.66035950804162, 'best_exact_thresh': 0.9998456239700317, 'best_f1': 92.9962273971104, 'best_f1_thresh': 0.9998456239700317, 'total_time_in_seconds': 84.74025378189981, 'samples_per_second': 124.73410838731417, 'latency_in_seconds': 0.008017053337928081, 'provider': 'CUDAExecutionProvider', 'disable_fused_attention': False, 'pretrained_model_name': 'bert-large-uncased-whole-word-masking-finetuned-squad', 'onnx_path': './bert-large-uncased-whole-word-masking-finetuned-squad/optimized_model.onnx', 'batch_size': 1, 'sequence_length': 384, 'use_io_binding': True}	2023-10-13 10:39:15 -07:00
Yufeng Li	7551dd039f	Fix build break with cuda 12.2 (#17922 ) ### Description <!-- Describe your changes. --> nvcc 12.2 crashes while building onnxruntime/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_* for SM<8.0. nvcc 18.8 works though. It should be a bug in nvcc 12.2. This PR excludes building flashattention for arch < 800.	2023-10-13 10:21:06 -07:00
Chi Lo	28c1944561	[TensorRT EP] Add back creation cudnn and cublas using external cuda stream (#17912 ) Add back creation cudnn and cublas using external cuda stream.	2023-10-13 10:18:28 -07:00
Sheil Kumar	635d3faa3b	Remove half of the compiled shaders for gridsample with unused types (#17928 ) Remove half of the compiled shaders for gridsample with unused types Shader compilations for Bool and Double datatypes are not needed for GPU evaluation and are causing binary bloat. Removing them. Co-authored-by: Sheil Kumar <sheilk@microsoft.com>	2023-10-13 09:29:37 -07:00
Tianlei Wu	67d7eb3ac5	[CUDA] Fix SkipLayerNorm strict mode when skip has broadcast (#17896 ) In SLN strict mode, current code (#16510) does not handle skip broadcast nicely . There are two issues: (1) skip related parameters is not passed to cuda kernel in strict mode (2) Strict mode kernel also has bug in handling skip broadcasting (like cuWelfordMuSigma2 does not handle skip broadcasting). Here we remove the support of skip broadcasting in strict mode, and operator will return error message that strict mode only support same shape of input and skip. Other changes: * skip_size is misleading when there is no broadcasting. Change to correct value. * Refactor the code to be more efficient: (1) no need to check whether there is broadcasting in kernel. (2) remove one local buffer (load input to sum_v directly to save a local buffer copy). * compute input + bias + skip instead of input + skip + bias. The order is followed common pattern in transformers model (Here assume graph fusion will distinguish input and skip correctly, need double check fusion code later). * update unit test so that strict mode is triggered in each test case (unless skip broadcasting) to have higher test coverage. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> SLN strict mode does not support skip broadcast but current code will silently run (kernel might fail)	2023-10-13 07:51:37 -07:00
Adrian Lizarraga	4e2a03d5fa	[QNN EP] Fix topological node unit traversal during validation (#17913 ) ### Description We need to ensure that tensors are first created and validated by their producers. If we don't, then builders that need to modify their outputs may not be able to do so if consumers are processed first (due to caching of tensors). For example, the Tanh builder may need to override its output quant param for 16-bit QDQ. I've encountered a scenario (while working on a partner model) where the override was not being correctly applied due to the graph traversal order. I tried to fix this bug in a previous [PR](https://github.com/microsoft/onnxruntime/pull/17877#discussion_r1353676802), but my fix was incorrect.	2023-10-12 23:37:01 -07:00
Adrian Lizarraga	dad70ad4e8	[QNN EP] Handle rank 3 InstanceNormalization with N != 1 (#17897 ) ### Description The QNN HTP backend does not support rank 3 InstanceNorm if the batch size is not 1. To work around this limitation, QNN EP can wrap a rank 4 QNN InstanceNorm op with Reshapes (with the H dim set to 1). ### Motivation and Context Enable support for more models.	2023-10-12 21:52:09 -07:00
PeixuanZuo	0c5b1598d3	[ROCm] Add ROCm Debug wheels to private ADO Feeds (#17887 ) Add ROCm Debug wheels to private ADO Feeds	2023-10-13 10:28:10 +08:00
Jeff Daily	07317316cc	CUDA EP vs ROCM EP hipify audit (#17776 ) Migrate most CUDA EP improvements and changes to ROCM EP. The process involves using hipify against all CUDA EP files (i.e. do not exclude any files from onnxruntime_rocm_hipify.cmake) then vimdiff compare them against the ROCM EP files that are under source control and pull in most changes. These changes include functional as well as formatting and makes comparing CUDA EP and ROCM EP easier, though it makes the PR diff somewhat less obvious due to formatting changes. - hipify audit of onnxruntime/core/providers/rocm, enable ops - Loop - Scan - hipify audit of onnxruntime/contrib_ops/rocm - fix contrib ops search implementation - enable more contrib ops - Affine - ComplexMul - ConvTransposeWithDynamicPads - Crop - DynamicSlice - FFT [Rfft, Irfft] - GreedySearch - ImageScaler - ParametricSoftplus - ScaledTanh - ThresholdRelu --------- Co-authored-by: cloudhan <cloudhan@outlook.com>	2023-10-13 10:13:53 +08:00
Scott McKay	ba7f20ac57	Fix illegal opcode error from mlas (#17885 ) ### Description <!-- Describe your changes. --> Use cpuinfo value when checking to dot product is available. Reading the ID_AA64ISAR0_EL1 register is unsafe. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> #17647 #17541 #17851	2023-10-13 08:27:15 +10:00
Tang, Cheng	ca8cab29cd	distributed slice (#17761 ) ### Description Support DistributedSlice kernel in Cuda EP. mainly support following cases: 1. input data is sharded or replica for all axes (including slice axes) 2. slice axes is sharded across different devices. starts / ends / steps sharded across different devices are not supported yet. --------- Co-authored-by: Wei-Sheng Chin <wschin@outlook.com> Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net> Co-authored-by: Cheng Tang <chenta@microsoft.com>	2023-10-12 14:28:00 -07:00
Changming Sun	3f3ece4a39	Update NDK to 26.0.10792818 (#17852 ) ### Description Update NDK to 26.0.10792818 which is included in every macOS build machine so that we do not need to download a different version every time in every build. ### Motivation and Context Downloading NDK on-the-fly is a main contributor of Android related build failures.	2023-10-12 14:08:43 -07:00
zesongw	163218d6d7	[WebNN EP] Update Op Softmax for readability (#17826 ) ### Description <!-- Describe your changes. --> Improve readability by fixing misplaced comments and utilizing std::rotate. ### Motivation and Context Resolve some comments in https://github.com/microsoft/onnxruntime/pull/17714	2023-10-12 12:02:50 -07:00
Caroline Zhu	c373a808a2	Add "glue" between training WASM artifacts and training web (#17474 ) ### Description * follows the packaging approach according to the design document * adds `ENABLE_TRAINING` boolean flag to `BUILD_DEFS` * modifies `package.json` to include training submodule * modifies build script to handle, validate, and minimize training WASM artifacts * adds the binding for the new backend with training enabled & the new training artifacts * adds training backend * edits `index.ts` to use training backend depending on `BUILD_DEFS` * edits `wasm-factory.ts` to use the training artifacts if necessary ### Motivation and Context * we are in the process of adding web bindings to enable training. * Adding the "glue" to allow onnxruntime-web to use the training WASM artifacts is required for this work. * Since BUILD_DEFS is defined and used at build time, I thought that it made sense to bundle the changes to building in the same PR. #### Related work * #16521 allowed for training artifacts to be built * #17333 must be merged in before this one --------- Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>	2023-10-12 11:16:56 -07:00
Changming Sun	809c8905fe	Update TestCase.cc: exclude a test for DML (#17909 ) ### Description Update TestCase.cc: exclude a test for DML ### Motivation and Context The test is failing due to GPU driver update.	2023-10-12 10:08:07 -07:00
Sheil Kumar	2efab54f9c	Fix missing registration for new dml device selection API (#17898 ) Fix missing registration for new dml device selection API Co-authored-by: Sheil Kumar <sheilk@microsoft.com>	2023-10-12 09:41:45 -07:00
Vincent Wang	fa0a79a921	Fix Triton Compile Error for Codegened Dropout Code (#17899 )	2023-10-12 20:57:14 +08:00
Yi Zhang	9d07ca3621	Move compliance check before publishing pipeline artifact (#17857 ) ### Description <!-- Describe your changes. --> ### Motivation and Context Compliance check would fail randomly but the stage couldn't be rerun if the pipeline artifacts are already published. There's the error like `Artifact xxxx already exists`. We had to restart the whole pipeline if there's a random error in compliance check.	2023-10-12 15:48:04 +08:00
Maximilian Müller	74a8acf405	Set default value for NVCC threads (#17866 ) Without doing this CMake gives a miscellaneous error on windows when checking if NVCC is functional. It will be missing a number after `--threads`. Currently it is only possible to configure through the python build scripts and not CMake only configure - which is what I am usually doing through CLion.	2023-10-11 22:46:40 -07:00
Tianlei Wu	e2cd6748fc	Fix GroupNorm fusion: skip if num of channels not supported (#17869 ) Right now, GroupNorm only support limited number of channels (320, 640, 960, 1280, 1920, 2560, 128, 256, 512). Skip the fusion if number of channels are not supported. ### Motivation and Context SD XL refiner model uses number of channels 384, 768, 1152, 2304 and 3072 in GroupNorm.	2023-10-11 22:45:22 -07:00
Yulong Wang	25bbd8d4eb	[js/web] allow gpu IO binding tests to fail temporarily (#17892 ) ### Description allow gpu IO binding tests to fail temporarily. when the root cause is still in investigation, use `continueOnError: true` to allow the test to fail without blocking PRs.	2023-10-11 21:21:21 -07:00
Changming Sun	138ccecd22	Change how "NPM packaging pipeline" downloads packages from another pipeline (#17838 ) ### Description "NPM packaging pipeline" needs to download an artifact from "Zip-Nuget-Java-Nodejs Packaging Pipeline". It has been a long-time issue that they two pipelines often use different commit ids. This change declares 'Zip-Nuget-Java-Nodejs Packaging Pipeline' as a resource, so that "NPM packaging pipeline" will always fetch from the pipeline run that triggers this NPM pipeline. Their official document says: "When you define a resource trigger, if its pipeline resource is from the same repo as the current pipeline, triggering follows the same branch and commit on which the event is raised."	2023-10-11 21:07:27 -07:00
Yi Zhang	20798a9f03	Enable onnx_test_runner to run the whole models dir in CI machine (#17863 ) ### Description 1. If the model should be skipped, don't load it. 2. print loaded tests and skipped tests 3. add more same filters as of the onnxruntime_test_all. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-10-12 12:01:02 +08:00
Wanming Lin	b3cab55d68	[WebNN EP] Add a duplicate entry to support new "dataType" (#17841 ) WebNN spec renames "type" as "dataType" at https://github.com/webmachinelearning/webnn/pull/464, add a duplicate entry for "dataType" in order to workaround the compatibility issue.	2023-10-11 19:13:13 -07:00
Adrian Lizarraga	565bead85f	[QNN EP] Support Softmax/LogSoftmax with any axis attribute (#17877 ) ### Description The QNN HTP backend only supports Softmax/LogSoftmax operators with an axis attribute set to `input_rank - 1` (i.e., the last dimension). This PR adds support for any axis by wrapping the QNN operator in transposes. ### Motivation and Context Support more models.	2023-10-11 17:43:42 -07:00
pengwa	63dc5dc1a9	Add document for PythonOp (#17888 ) ### Add document for PythonOp https://github.com/microsoft/onnxruntime/blob/pengwa/pythonop_doc/docs/ORTModule_PythonOp_Notes.md ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-10-12 08:36:22 +08:00
Yulong Wang	d532645bed	[js/webgpu] revise uniform support (#17871 ) ### Description <!-- Describe your changes. --> work for items (2) and (3) in #17860	2023-10-11 16:41:46 -07:00
Numfor Tiapo	b8f373b0ae	Add API for NPU Device Selection in the DML EP (#17612 ) Co-authored-by: Sheil Kumar <sheilk@microsoft.com>	2023-10-11 14:53:00 -07:00
Yulong Wang	a441a71e8e	[js/web] support different export format for ort-web (#17878 ) ### Description support different export format for ort-web.	2023-10-11 09:38:51 -07:00
pengwa	0e2782438a	Support inplace update for PythonOp/Grad (#17687 ) ### Support inplace update for PythonOp/Grad This PR is based on another PR https://github.com/microsoft/onnxruntime/pull/17685's branch, to make it easier to review. With PR: PR https://github.com/microsoft/onnxruntime/pull/17685, By default all PythonOp inputs/outputs are assumed to not be inplaced, if during run, we found some inplace update happens (by checking output data address with all inputs data address), we add clone before set it as PythonOp/Grad's outputs. In this case, results are correct, but implicit copies overheads are introduced. This PR allow users to define output input reuse map, to let ORT know how to do the reuse map, avoid such unnecessary copies.	2023-10-10 21:36:45 -07:00
Abhishek Jindal	54b7503c30	create patch for allgather fn for deepspeed stage 3 (#17855 ) ### Description <!-- Describe your changes. --> Patch for All gather fn for Deepspeed Stage 3 changes ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-10-11 11:15:06 +08:00
Tianlei Wu	948c8369a0	[CUDA/ROCm] Remove limitation of BiasAdd (#17848 ) Previously, BiasAdd only supports hidden dimensions of 32, 640 and 1280 for stable diffusion. This adds a kernel that could support any number of channels. ### Motivation and Context Stable Diffusion XL refiner model uses hidden dimensions of 768 or 1536, which was not supported in BiasAdd.	2023-10-10 20:08:45 -07:00
Yulong Wang	5228332c9f	[js] upgrade JS shared dev dependencies (#17831 ) ### Description upgrade JS shared dev dependencies. - webpack: removed - eslint: upgrade to latest. - eslint config upgraded to compatible with latest version - typescript upgrade to v5 - update module "CommonJS" to "Node16" in tsconfig - update deprecated config "importsNotUsedAsValues" to "verbatimModuleSyntax" - remove webpack bundles in onnxruntime-common	2023-10-10 17:44:39 -07:00
Yulong Wang	c6f1a1ce69	update build_jsep.bat to add release build flags (#17471 ) ### Description flags `--enable_wasm_api_exception_catching --disable_rtti` are used in release build, so fix the build_jsep.bat script to make it more consistent with CI.	2023-10-10 17:38:35 -07:00
Tianlei Wu	d637111e9f	[CUDA/ROCm] Update BiasSplitGelu for SD XL Refiner model (#17849 ) SD XL Refiner model has new hidden dimension sizes not supported by BiasSplitGelu. This update the kernel to support them. ### Motivation and Context Current BiasSplitGelu does not support optimization for SD XL refiner model.	2023-10-10 11:07:27 -07:00
Hector Li	9a1c884ba3	[QNN EP] Add script to generate Onnx model from native QNN generated context binary file (#17859 ) Add script to generate Onnx model from native QNN generated context binary file. This is used for QNN EP example code.	2023-10-10 10:54:35 -07:00
Yulong Wang	d9b9c5a537	[js/webgpu] support using uniform buffer (#17803 ) ### Description support using uniform buffer. This PR allows to use uniform buffer in shader program, so that some runtime information (eg. input/output shape) is no longer need to be hardcoded into shader code. There are 2 commits in this PR: - [667f31c](`667f31c83d`): framework changes to support uniform buffer, as well as updates in program manager, gpu data manager and indices helper. - [09e1d2a](`09e1d2ad1d`): an example change for operator `Transpose` to use input's rank-only instead of dims as shader key. With this change, model mobilenetv2-12 shader compile times dropped from 71 to 52.	2023-10-10 00:31:12 -07:00
Yi Zhang	53be802f39	Onnx_test_runner and onnxruntime_test_all use the same broken test list. (#17840 )	2023-10-10 13:03:58 +08:00
Changming Sun	05ac9f6f2a	Split onnxruntime_providers.cmake to multiple (#17853 ) ### Description Split onnxruntime_providers.cmake to multiple files, for easier editing. No other change was made in this PR.	2023-10-09 20:33:44 -07:00
Scott McKay	046939b0c1	Include CoreML in mac os python packages (#17844 ) ### Description <!-- Describe your changes. --> Include CoreML EP in python package. I've added to the base package as CoreML comes from the OS so there are no additional libraries to distribute. Updated the CPU-based provider list to add the AzureEP, which is also included in the base package, to fix some test failures. Without this the infrastructure thinks a device copy implementation is required between AzureEP and CoreML nodes, which is not the case as the AzureEP is CPU based. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> #16989	2023-10-10 11:44:32 +10:00
Baiju Meswani	9c716f4557	Add noexcep_operators to onnxruntime internal libraries (#17850 )	2023-10-09 16:29:41 -07:00
aciddelgado	406cd324e0	[CUDA] GroupQueryAttention operator using FlashAttention (#17674 ) ### Description Added Group Query Attention op, supporting integer multiple number of heads for Q / KV. As of now, this op can only use FlashAttention kernel, meaning it only supports sm>=80 on Linux. Results from onnxruntime/test/python/transformers/benchmark_gqa.py show an on-average ~37% speed-up over Decoder Masked Multi-Head Attention, with even greater improvements for long past sequence lengths. ``` op batch s_kv heads h_dim ms TFLOPS gqa 16 2048 8 32 0.34 0.10 dmmha 16 2048 8 32 0.39 0.09 --------- gqa 16 2048 8 64 0.45 0.15 dmmha 16 2048 8 64 0.61 0.11 --------- gqa 16 2048 8 128 0.54 0.25 dmmha 16 2048 8 128 0.83 0.16 --------- gqa 16 2048 16 32 0.45 0.15 dmmha 16 2048 16 32 0.69 0.10 --------- gqa 16 2048 16 64 0.69 0.19 dmmha 16 2048 16 64 0.83 0.16 --------- gqa 16 2048 16 128 0.71 0.38 dmmha 16 2048 16 128 1.28 0.21 --------- gqa 16 2048 32 32 0.58 0.23 dmmha 16 2048 32 32 0.77 0.17 --------- gqa 16 2048 32 64 0.58 0.46 dmmha 16 2048 32 64 1.25 0.21 --------- gqa 16 2048 32 128 0.76 0.71 dmmha 16 2048 32 128 2.15 0.25 --------- gqa 16 2048 64 32 0.68 0.39 dmmha 16 2048 64 32 1.23 0.22 --------- gqa 16 2048 64 64 0.77 0.70 dmmha 16 2048 64 64 2.11 0.25 --------- gqa 16 2048 64 128 1.10 0.97 dmmha 16 2048 64 128 4.06 0.26 --------- gqa 16 2048 128 32 1.00 0.54 dmmha 16 2048 128 32 2.09 0.26 --------- gqa 16 2048 128 64 1.10 0.97 dmmha 16 2048 128 64 4.08 0.26 ``` ### Motivation and Context As of now, this op is targeted for use on LLama models, as it supports kv-caching and different number of heads for Q and KV (Grouped Query Attention). We plan to add support for more platforms, input formats, etc. in the future. --------- Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: tlwu@microsoft.com <tlwu@a100.crj0ad2y1kku1j4yxl4sj10o4e.gx.internal.cloudapp.net>	2023-10-09 12:43:12 -07:00

1 2 3 4 5 ...

9783 commits