onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-06-21 02:18:09 +00:00

Author	SHA1	Message	Date
Ashrit Shetty	4b5b5f7101	Update win-ort-main to tip main 250123 (#23473 ) ### Description This PR is to update the win-ort-main branch to the tip main branch as of 2025-01-23. ### PR List ddf0d377a7 [QNN EP] Add LoggingManager::HasDefaultLogger() to provider bridge API (#23467) 05fbbdf91f [QNN EP] Make QNN EP a shared library (#23120) 1336566d7f Add custom vcpkg ports (#23456) 2e1173c411 Update the compile flags for vcpkg packages (#23455) 1f628a9858 [Mobile] Add BrowserStack Android MAUI Test (#23383) 009cae0ec8 [js/webgpu] Optimize ConvTranspose (Continue) (#23429) 04a4a694cb Use onnx_protobuf.h to suppress some GCC warnings (#23453) 2e3b62b4b0 Suppress some strict-aliasing related warnings in WebGPU EP (#23454) b708f9b1dc Bump ruff from 0.9.1 to 0.9.2 (#23427) c0afc66b2a [WebNN] Remove workarounds for TFLite backend (#23406) 8a821ff7f9 Bump vite from 6.0.7 to 6.0.11 in /js/web/test/e2e/exports/testcases/vite-default (#23446) 220c1a203e Make ORT and Dawn use the same protobuf/abseil source code (#23447) b7b5792147 Change MacOS-13 to ubuntu on for android-java-api-aar-test.yml. (#23444) 19d0d2a30f WIP: Dp4MatMulNBits accuracy level 4 matmul for WebGPU EP (#23365) 95b8effbc4 [QNN EP]: Clean up QNN logging resources if an error occurs during initialization (#23435) 626134c5b5 Bump clang-format from 19.1.6 to 19.1.7 (#23428) 0cf975301f Fix eigen external deps (#23439) f9440aedce Moving RN_CI Android Testing to Linux (#23422) 1aa5902ff4 [QNN EP] workaround for QNN validation bug for Tanh with uint16 quantized output (#23432) 7f5582a0e2 Seperate RN andriod and IOS into 2 separated Stages. (#23400) 73deac2e7f Implement some missing element wise Add/Sub/Mul/Div/Neg operations for CPU and CUDA EPs (#23090) 949fe42af4 Upgrade Java version from react-native/android to Java 17 (#23066) 0892c23463 Update Qnn SDK default version to 2.30 (#23411) 94c099bcec Fix type cast build error (#23423) d633e571d1 [WebNN EP] Fix AddInitializersToSkip issues (#23354) e988ef00e2 [QNN EP] Fix regression for MatMul with two quantized/dynamic uint16 inputs (#23419) 7538795f6b Update onnxruntime binary size checks ci pipeline's docker image (#23405) 6c5ea41cad Revert "[QNN EP] Clean up correctly from a partial setup (#23320)" (#23420) e866804bbe Enable comprehension simplification in ruff rules (#23414) 0a5f1f392c bugfix: string_view of invalid memory (#23417) 4cc38e0277 fix crash when first input of BatchNormalization is 1-D (#23387) 033441487f Target py310 and modernize codebase with ruff (#23401) 87341ac010 [QNN EP] Fix segfault when unregistering HTP shared memory handles (#23402) ### Motivation and Context This update includes the change to make QNN-EP a shared library. --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com> Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com> Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Changming Sun <chasun@microsoft.com> Co-authored-by: Peishen Yan <peishen.yan@intel.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: Hector Li <hecli@microsoft.com> Co-authored-by: Jian Chen <cjian@microsoft.com> Co-authored-by: Alexis Tsogias <1114095+Zyrin@users.noreply.github.com> Co-authored-by: junchao-zhao <68935141+junchao-loongson@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: sushraja-msft <44513542+sushraja-msft@users.noreply.github.com> Co-authored-by: Wanming Lin <wanming.lin@intel.com> Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com> Co-authored-by: Caroline Zhu <wolfivyaura@gmail.com>	2025-01-23 09:12:03 -08:00
Ashrit Shetty	df873177eb	Update win-ort-main to tip main 250116 (#23398 ) ### Description This PR is to update the win-ort-main branch to the tip main branch as of 2025-01-16. ### Motivation and Context This update includes the OpenVino fix for debug builds. --------- Signed-off-by: Liqun Fu <liqfu@microsoft.com> Signed-off-by: Liqun Fu <liqun.fu@microsoft.com> Signed-off-by: Junze Wu <junze.wu@intel.com> Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Jianhui Dai <jianhui.j.dai@intel.com> Co-authored-by: Yueqing Zhang <yuz75@Pitt.edu> Co-authored-by: amancini-N <63410090+amancini-N@users.noreply.github.com> Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com> Co-authored-by: liqun Fu <liqfu@microsoft.com> Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com> Co-authored-by: Yifan Li <109183385+yf711@users.noreply.github.com> Co-authored-by: yf711 <yifanl@microsoft.com> Co-authored-by: Wanming Lin <wanming.lin@intel.com> Co-authored-by: wejoncy <wejoncy@163.com> Co-authored-by: wejoncy <wejoncy@.com> Co-authored-by: Scott McKay <skottmckay@gmail.com> Co-authored-by: Changming Sun <chasun@microsoft.com> Co-authored-by: Jean-Michaël Celerier <jeanmichael.celerier+github@gmail.com> Co-authored-by: Dmitry Deshevoy <mityada@gmail.com> Co-authored-by: xhcao <xinghua.cao@intel.com> Co-authored-by: Yueqing Zhang <yueqingz@amd.com> Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com> Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com> Co-authored-by: Wu, Junze <junze.wu@intel.com> Co-authored-by: Jian Chen <cjian@microsoft.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Matthieu Darbois <mayeut@users.noreply.github.com> Co-authored-by: Prathik Rao <prathik.rao@gmail.com> Co-authored-by: wonchung-microsoft <wonchung@microsoft.com> Co-authored-by: Vincent Wang <wangwchpku@outlook.com> Co-authored-by: PARK DongHa <luncliff@gmail.com> Co-authored-by: Hector Li <hecli@microsoft.com> Co-authored-by: Sam Webster <13457618+samwebster@users.noreply.github.com> Co-authored-by: Adrian Lizarraga <adrianlm2@gmail.com> Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com> Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com> Co-authored-by: Satya Kumar Jandhyala <satya.k.jandhyala@gmail.com> Co-authored-by: Corentin Maravat <101636442+cocotdf@users.noreply.github.com> Co-authored-by: Xiaoyu <85524621+xiaoyu-work@users.noreply.github.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jie Chen <jie.a.chen@intel.com> Co-authored-by: Jianhui Dai <jianhui.j.dai@intel.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Baiju Meswani <bmeswani@microsoft.com> Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com> Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com> Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Co-authored-by: Ted Themistokleous <107195283+TedThemistokleous@users.noreply.github.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com> Co-authored-by: Artur Wojcik <artur.wojcik@outlook.com> Co-authored-by: Ted Themistokleous <tedthemistokleous@amd.com> Co-authored-by: Xinya Zhang <Xinya.Zhang@amd.com> Co-authored-by: ikalinic <ilija.kalinic@amd.com> Co-authored-by: sstamenk <sstamenk@amd.com> Co-authored-by: Yi-Hong Lyu <yilyu@microsoft.com> Co-authored-by: Ti-Tai Wang <titaiwang@microsoft.com>	2025-01-16 15:20:25 -08:00
mingyue	4aca8f33df	[Bug Fix] Missing CustomOp SchemaRegister when generator EPContext ONNX model (#23091 ) ### Description Enhancements to EPContext Operations: 1. Introduced support for the bfloat16 data type in EPContext operations. 2. Bug Fix: Missing Custom OP Schema Registration when generator EPContext ONNX model --------- Co-authored-by: mingyue <mingyue@xilinx.com> Co-authored-by: Hector Li <hecli@microsoft.com>	2024-12-19 16:47:13 -08:00
Tianlei Wu	5afab787db	Update python version metadata (remove 3.7, 3.8, 3.9; add 3.13). (#23067 ) ### Description * Update python version metadata to be in sync with latest python packages (onnxruntime, onnxruntime-gpu and onnxruntime-qnn). * Update black format target-version to 3.10, and use lintrunner to format all files. * Update the lintrunner installation command line to be consistent. * Include `requirements-lintrunner.txt` in `requirements-dev.txt` to avoid duplicated settings. ### Motivation and Context https://github.com/microsoft/onnxruntime/issues/22993 Python support by numpy: https://numpy.org/neps/nep-0029-deprecation_policy.html#drop-schedule ``` On Apr 05, 2024 drop support for Python 3.9 On Apr 04, 2025 drop support for Python 3.10 ```	2024-12-17 10:59:20 -08:00
Hector Li	401d16c671	Enable QNN HTP spill fill buffer setting to save RAM usage. (#22853 ) ### Description Enable QNN HTP spill fill buffer setting to save RAM usage. This feature is available after QNN 2.28. Need to re-generate QNN context binary. https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/htp_backend.html#qnn-htp-backend-api Requirements: 1. Need to re-generate the Onnx model with QNN context binary by set the EP option enable_htp_spill_fill_buffer = 1. 2. Works for a model with multiple Context binaries. Need manually merge 2 Onnx model with context binary into 1 Onnx model. 3. Requires Linux platform if generate the context binary offline since QnnSystem lib is not available for Windows x86_64 platform. No need to do extra thing while running the model inference. The generated EPContext node will have a max_size attribute with the maximum spill fill buffer size for the context binary <img width="353" alt="image" src="https://github.com/user-attachments/assets/a3bf48be-a8da-4381-8a1d-3f2558eea37d"> --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2024-12-06 11:36:52 -08:00
Yulong Wang	7b0fa407eb	fix requirements.txt path (#22946 ) ### Description #22380 removes the file `tools/ci_build/github/linux/docker/inference/x86_64/python/cpu/scripts/requirements.txt` but it is still used in `dockerfiles/Dockerfile.cuda`. This change updates the file path of the requirements.txt fixes #22945.	2024-12-04 13:08:29 -08:00
Xavier Dupré	a2ba3cb547	Implementation of TreeEnsemble ai.onnx.ml==5 (#22333 ) ### Description Merges PR #21851, #21222. Implements TreeEnsemble from ai.onnx.ml==5 (CPU). --------- Co-authored-by: Bilyana Indzheva <bilyana2002@gmail.com> Co-authored-by: Bilyana Indzheva <36890669+bili2002@users.noreply.github.com> Co-authored-by: Christian Bourjau <cbourjau@users.noreply.github.com>	2024-11-22 19:48:23 +01:00
Changming Sun	13346fdf18	Cleanup code (#22827 ) ### Description 1. Delete TVM EP because it is out of maintain 2. Delete ortmodule related docker files and scripts.	2024-11-19 14:13:33 -08:00
dtang317	12dfe2859c	Register groupnorm for opset 21 (#22830 ) ### Description This PR registers GroupNormalization for opset 21 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-11-14 10:06:30 -08:00
dtang317	9836ef1c89	register Identity and QLinearMatmul for opset21 (#22804 ) ### Description This PR registers the following opset 21 operators: Idenity-21 OlieanrMatmul-21 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-11-12 09:36:19 -08:00
Tianlei Wu	72186bbb71	[CUDA] Build nhwc ops by default (#22648 ) ### Description * Build cuda nhwc ops by default. * Deprecate `--enable_cuda_nhwc_ops` in build.py and add `--disable_cuda_nhwc_ops` option Note that it requires cuDNN 9.x. If you build with cuDNN 8, NHWC ops will be disabled automatically. ### Motivation and Context In general, NHWC is faster than NCHW for convolution in Nvidia GPUs with Tensor Cores, and this could improve performance for vision models. This is the first step to prefer NHWC for CUDA in 1.21 release. Next step is to do some tests on popular vision models. If it help in most models and devices, set `prefer_nhwc=1` as default cuda provider option.	2024-11-06 09:54:55 -08:00
Tianlei Wu	ba22d7879a	[CUDA/ROCm] Conditionally support ArgMax and ArgMin for opset 12 and above (#22713 ) ### Description Based on https://github.com/microsoft/onnxruntime/pull/9700, and extend it to ArgMin as well. This pull request introduces several enhancements and fixes related to the `ArgMax` and `ArgMin` operators in the CUDA execution provider. The changes ensure proper handling of these operators across different versions and improve kernel registration and fallback mechanisms. Key changes include: #### Enhancements to `ArgMax` and `ArgMin` Operators: * Added new kernel class registrations for `ArgMax` and `ArgMin` for different data types and versions in `onnxruntime/core/providers/cuda/cuda_execution_provider.cc`. [[1]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R966-R972) [[2]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1209-R1215) [[3]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1657-R1659) [[4]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285L1825-L1827) [[5]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1933-R1939) [[6]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2174-R2180) * Introduced `ArgMaxOrArgMinNeedFallbackToCPU` function to handle fallback to CPU when the `select_last_index` attribute is set to 1, as CUDA does not support this attribute. [[1]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2597-R2622) [[2]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2672-R2674) #### Macro and Kernel Registration Improvements: * Replaced `REGISTER_KERNEL_UNTIL_VERSIONED_TYPED` with `REGISTER_KERNEL_VERSIONED_RANGE_TYPED` and `REGISTER_KERNEL_VERSIONED_SINCE_TYPED` macros for better version handling. [[1]](diffhunk://#diff-ee5316fc3898058f70e942d9a84de36be4c7da09f144633a2504236430d5d033L19-R29) [[2]](diffhunk://#diff-ee5316fc3898058f70e942d9a84de36be4c7da09f144633a2504236430d5d033L40-R46) * Updated kernel registration for `ArgMax` and `ArgMin` to use the new macros, ensuring proper version handling and support for different data types. #### Safety Checks: * Added safety checks in the `ArgMax` and `ArgMin` classes to ensure `select_last_index` is not set to 1, as it is not supported on CUDA. [[1]](diffhunk://#diff-8ab09fef1f4a12cbf3b3432e509f8f1ef561e83c72778a0e047780060aeef6efL91-R99) [[2]](diffhunk://#diff-8ab09fef1f4a12cbf3b3432e509f8f1ef561e83c72778a0e047780060aeef6efL101-R117) #### Testing Enhancements: * Added new tests for `ArgMax` and `ArgMin` operators to verify behavior when `select_last_index` is set to 0, ensuring compatibility with both CPU and CUDA execution providers. [[1]](diffhunk://#diff-77affe1b70d1a9d38c2485f7c6b16ef2b6b541ed94dd727bc9b286f068f1481aR3340-R3360) [[2]](diffhunk://#diff-77affe1b70d1a9d38c2485f7c6b16ef2b6b541ed94dd727bc9b286f068f1481aR3679-R3699) ### Motivation and Context Improve CUDA kernel coverage for stable diffusion model and hence improve its performance on CUDA	2024-11-06 09:54:32 -08:00
Tianlei Wu	120cb5a804	[Doc] Add I/O binding example using onnx data type in python API summary (#22695 ) ### Description Add I/O binding example using onnx data type in python API summary. The API is available since 1.20 release. ### Motivation and Context Follow up of https://github.com/microsoft/onnxruntime/pull/22306 to add some documentation.	2024-11-02 12:51:37 -07:00
dtang317	5b4e2a636b	DML EP Register Opset 21 (#22547 ) ### Description This PR registers the following opset 21 operators: - Size-21 - CastLike-21 - ConstantOfShape-21 - Flatten-21 - Pad-21 - Transpose-21 ### Motivation and Context	2024-10-25 09:21:19 -07:00
Hector Li	fc2be09386	Enable QLinearMatMul for opset21 (#22488 ) ### Description Enable QLinearMatMul for opset21	2024-10-22 14:33:36 -07:00
Akshay Sonawane	e5c2e50849	bumps up version in main from 1.20 -> 1.21 (#22482 ) Bump up version in main from 1.20.0 to 1.21.0 since the release branch has been cut.	2024-10-17 12:32:35 -07:00
mindest	1fa219d7d5	DecoderMaskedMultiHeadAttention CPU kernel. (#22292 ) ### Description DecoderMaskedMultiHeadAttention CPU kernel.	2024-10-12 13:43:00 -07:00
mindest	3c80aa9fee	Add CPU kernels for DynamicTimeWarping and UnfoldTensor. (#22033 ) ### Description Add CPU kernels for DynamicTimeWarping and UnfoldTensor.	2024-10-11 09:44:18 -07:00
kunal-vaishnavi	50bda44a70	Fix equation in MatMulNBits op spec (#22253 ) ### Description This PR fixes an equation in the MatMulNBits op spec. The old formula is stated as ``` [CeilDiv((N * n_blocks_per_col + 1) * bits, 8)] ``` but it should be stated as ``` [N * CeilDiv(n_blocks_per_col * bits, 8)] ``` or as ``` [N * FloorDiv((n_blocks_per_col + 1) * bits, 8)] ``` ### Motivation and Context For models such as ChatGLM where the column size is odd, the division math can be off. For example: ![image_360](https://github.com/user-attachments/assets/a5035bec-4dad-46af-9cb1-24a881eb70a0) With the old equation, the projections are calculated as follows. ``` # Down projection B = 4,096 x 107 x 64 zero_points = 221,184 N = 4,096 n_blocks_per_col = 107 4,096 * CeilDiv((107 + 1) * 4, 8) = 4,096 * CeilDiv(108 * 4, 8) = 4,096 * 54 = 221,184 # Up projection B = 13,696 x 32 x 64 zero_points = 219,136 N = 13,696 n_blocks_per_col = 32 13,696 * CeilDiv((32 + 1) * 4, 8) = 13,696 * CeilDiv(33 * 4, 8) = 13,696 * 17 = 232,832 ``` With the new equation, the projections are calculated as follows. ``` # Down projection B = 4,096 x 107 x 64 zero_points = 221,184 N = 4,096 n_blocks_per_col = 107 4,096 * CeilDiv(107 * 4, 8) = 4,096 * 54 = 221,184 # Up projection B = 13,696 x 32 x 64 zero_points= 219,136 N = 13,696 n_blocks_per_col = 32 13,696 * CeilDiv(32 * 4, 8) = 13,696 * 16 = 219,136 ```	2024-10-01 09:31:56 -07:00
Patrice Vignola	20be51525b	Support if node with sequence outputs (#22234 ) `If` nodes can have sequence outputs. Those nodes are mapped to the DML EP to be able to keep the outputs on the GPU, but they actually execute on the CPU by selecting either the `then` subgraph or the `else` subgraph.	2024-09-27 12:40:01 -07:00
amarin16	eb2506d77a	Add MLFloat16 support for LayerNormalization, SkipLayerNormalization (#22063 ) Add `MLFloat16` support for: - `LayerNormalization` - `SimplifiedLayerNormalization` - `SkipLayerNormalization` - `SkipSimplifiedLayerNormalization` There are existing `LayerNormTest` unit tests that cover the `MLFloat16` functionality for `LayerNormalization` once `MLFloat16` is registered (for example [`LayerNormTest.LayerNorm_Scale_Float16Input`](`91c916f9c6/onnxruntime/test/contrib_ops/layer_norm_op_test.cc (L112)`)). Similarly, there are unit tests such as [`SkipLayerNormTest.SkipLayerNormBatch1_Float16`](`91c916f9c6/onnxruntime/test/contrib_ops/skiplayernorm_op_test.cc (L255)`) that cover MLFloat16 inputs for `SkipLayerNormalization`.	2024-09-24 15:06:27 -07:00
Ye Wang	6cc06ad069	GQA MLFloat16 cpu (#22102 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Your Name <you@example.com>	2024-09-24 09:51:59 -07:00
Tianlei Wu	0806879ad4	Update lintrunner requirements (#22185 ) ### Description * Add lintrunner to requirements-lintrunner.txt * Lock lintrunner and lintrunner-adapter version * Update documentation ### Motivation and Context The document is not up to date.	2024-09-23 18:27:16 -07:00
Christian Bourjau	1a84f53c35	Make argmin/armax support identical data types and add int64 support (#21641 )	2024-09-23 13:02:29 -07:00
liqun Fu	a89bddd5c2	Matmul_nbits kernel for mlas sqnbits to support Fp16 inputs (#21807 )	2024-09-13 14:55:08 -07:00
aciddelgado	7e2c722459	Add Continuous Decoding support in GQA (#21523 ) ### Description This PR will add support for Continuous Decoding for batch_size = 1 input. From now on, GQA can take arbitrary length input using seqlens_k as total_sequence_length - 1 and the sequence length of qkv as new_sequence_length. This change will not affect the default behavior of GQA ### Motivation and Context Prior to this change it was impossible to support sequence_length > 1 inputs when past context was given. This use case is essential to making continuous decoding work, which is one of our current efforts in ORT-GenAI.	2024-09-13 13:21:11 -07:00
aciddelgado	509cb54d6f	softcap gqa (#21683 ) ### Description Implement softcap for gqa. ### Motivation and Context Fixes certain models like Gemma-2 which need softcap to work so they don't output nan's.	2024-08-30 19:11:04 -07:00
Jing Fang	5dee95fa10	[CUDA] Support CUDA EP blocked quantization in Q/DQ ops. (#21846 ) ### Description 1. Added CUDA EP support for blocked quantization in QuantizeLinear and DequantizeLinear ops. 2. Currently CUDA EP blocked quantization only supports int4/uint4 quantized types and float32/float16 unquantized types. 3. Added CUDA EP support in QDQ selector/action transformer. CUDA EP is only added to DQ + MatMul -> MatMulNBits rule. Other rules' EP support are not changed. ### Motivation and Context ONNX opset 21 introduced blocked quantization for Q/DQ opts. ORT originally only supports CPU EP blocked quantization.	2024-08-30 18:28:00 -07:00
Ye Wang	1d059b8702	Phi3 MoE cuda kernel (#21819 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Your Name <you@example.com>	2024-08-27 09:21:30 -07:00
Tianlei Wu	6e57576988	Support Smooth Softmax in GroupQueryAttention (#21867 ) ### Description Softmax (formula 1) is like the following: ```math y_{i} = \frac{exp(x_{i})}{\sum_{i} exp(x_{i})} ``` After applying softmax, each element will be in the range of $(0, 1)$, and the elements will add up to 1, so that they can be interpreted as probabilities. However, in language model, softmax has two issues: * When all elements are -inf (for example, a whole row is masked when a query token is padding), the result is not defined since exp(-inf)=0 and divided-by-zero is encountered in the above formula. * Why do we need normalize in a way that each query word are treated as equal important (each row has sum equals to1)? Smooth Softmax (formula 2) is a modified version that introduces a smooth factor like the following: ```math s_{i} = \frac{exp(x_{i})}{1+ \sum_{i} exp(x_{i})} ``` This formula could tackle the above two issues: * It could handle the special case that all elements are -inf: the result $s_{i}$ is 0 for every element in such case. * Sum of all elements $\sum_{i}{s_{i}} = \frac{\sum_{i}{exp(x_{i})}}{1+ \sum_{i} exp(x_{i})}$ is in the range of (0, 1), so that we can train the model to assign different importance to different query words. Since exponential is prone to overflow or underflow, to get stable result, formula 3 can be used: ```math s_{i} = \frac{exp(x_{i} + c)}{exp(c)+ \sum_{i} exp(x_{i} +c)} ``` c can be any value in theory. In practical, choice of constant c shall avoid $exp(c)$ and $exp(x_{i} +c)$ overflow (or underflow) at the same time. A reasonable choice is like formula 4: ```math c=-\max_{i} \{ x_i \} ``` or apply a constraint that c <=0 like the following formula 5: ```math c=-\max(0, \max_{i} \{ x_i \}) ``` The latter one (formula 5) ensures that $s_{i}$ will fallback to formula 2 when all elements are negative. For CPU provider, smooth softmax is implemented in MLAS. CPU implementation uses formula 5. @wangyems implemented the smooth softmax in flash attention for CUDA, which requires Ampere or newer GPU. The implementation of smooth softmax in flash attention uses formula 4. --------- Co-authored-by: Ye Wang	2024-08-26 23:13:15 -07:00
Patrice Vignola	de6ebcbb54	[DML] Add int4 QDQ (#21592 )	2024-08-20 23:44:58 -07:00
Yi Zhang	9f7e19cedd	[Fix] Make python API doc generation in Microsoft-hosted Agent (#21766 ) ### Description <!-- Describe your changes. --> ### Motivation and Context 1. Python API doc needs to be merged from a fork, but 1ES self-hosted pool is only for one github repo. 2. ubuntu-latest will be install numpy above 2.0 by default, and current python API doc generation doesn't support it. So I pin numpy < 2.0.0 ---------	2024-08-20 23:32:38 +08:00
Tianlei Wu	d79e3c5791	Extend Attention Bias Broadcast Support (#21710 ) ### Description Previously, MultiHeadAttention supports relative position bias of shape [1, N, S, T] or [B, N, S, T], and DecoderMaskedMultiHeadAttention supports [1, N, S, T]. This will extend the support to allow [1, N, S, T], [B, N, S, T], [B, 1, S, T] and [1, 1, S, T] for CUDA and CPU EPs. - [x] Rename the input of "relative position bias" to "attention bias" because it can also be used for other types of bias, like ALiBi (Attention with Linear Biases) or attention mask. - [x] Update unfused kernel to support broadcasting 2nd dimension of attention bias. - [x] Update efficient attention to support broadcasting 2nd dimension of attention bias. - [x] Update operators (MultiHeadAttention, DecoderMaskedMultiHeadAttention, Attention, PackedAttention, PackedMultiHeadAttention) to support broadcast attention bias on CUDA and CPU EPs. - [x] Update ROCm, DML and WebGPU naming to be consistent. (Note that those EPs do not support broadcasting attention_bias for now). - [x] Add attention bias tests for MultiHeadAttention. - [x] Update operator documents - [x] Update benchmark script Other changes: * Fix some checks in multihead-attention.ts * Add helper functions to dump tensors given dimensions.	2024-08-16 15:40:04 -07:00
Yi Zhang	b92908e197	[Fix] Python API doc generation (#21717 ) ### Description <!-- Describe your changes. --> ### Motivation and Context Make Python API doc generation workflow work. ### Verification Run https://github.com/microsoft/onnxruntime/actions/runs/10364762858	2024-08-14 08:48:29 +08:00
Jing Fang	f30581ed2c	[CPU EP] Add block quantized Gather contrib op (#21630 ) ### Description Add a gather that supports block-quantized input data. ### Motivation and Context To support Web inference scenario with quantized vocabulary embeddings.	2024-08-09 12:15:11 -07:00
Edward Chen	a5ce65d87a	Clean up some mobile package related files and their usages. (#21606 ) The mobile packages have been removed.	2024-08-05 16:38:20 -07:00
Prathik Rao	134f47743e	bumps up version in main from 1.19 -> 1.20 (#21588 ) Bump up version in main from 1.19.0 to 1.20.0 since the release branch has been cut.	2024-08-05 15:46:04 -07:00
Atanas Dimitrov	d0a6f57d74	Add reduce kernels for bigger types (#21490 )	2024-08-01 12:21:16 -07:00
Yi-Hong Lyu	530a2d7b41	Enable FP16 Clip and Handle Bias in FP16 Depthwise Conv (#21493 ) - Improved accuracy for face-detection, image-classification, and object-detection in the GeekBench ML benchmark on ARM64. - Fixed issue https://github.com/microsoft/onnxruntime/issues/18992	2024-07-30 03:49:14 -07:00
aamajumder	166809425e	[DML EP] Register ReduceMin-20 (#20477 ) ### Description This PR registers the ReduceMin-20 operator to the DML EP. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-07-25 17:06:30 -07:00
Preetha Veeramalai	ca47f0fdd3	OVEP - PR 1.19 (#21443 ) ### Description Add OVEP features for 1.19 The PR has, - Added support for EpCtx with ORT Session options for optimized performance. - Added bug fixes - Support for OV 2024.3 --------- Co-authored-by: ubuntu <ubuntu@ubuntu-mtlp-118727.iind.intel.com> Co-authored-by: vthaniel <vishnudas.thaniel.s@intel.com> Co-authored-by: sfatimar <sahar.fatima@intel.com> Co-authored-by: saurabhkale17 <saurabh1.kale@intel.com> Co-authored-by: Maheshkar <ankit.maheshkar@intel.com>	2024-07-24 23:45:31 -07:00
Sheil Kumar	dd010edb37	Update DirectML from 1.14.1 to 1.15.0 (#21323 ) Update DirectML from 1.14.1 to 1.15.0 --------- Co-authored-by: Sheil Kumar <sheilk@microsoft.com> Co-authored-by: Dwayne Robinson <dwayner@microsoft.com>	2024-07-22 16:59:03 -07:00
Prathik Rao	11ad299451	Adds ATen fallback for scaled_dot_product_attention (#21107 ) ### Description <!-- Describe your changes. --> Introduces an ATen fallback for `torch.nn.functional.scaled_dot_product_attention`. This operator was introduced in torch 2.0 and, since then, has had many updates including the implementation of memory efficient attention for V100 machines. The current torchscript exporter exports a subgraph for attention which does not provide the same memory savings that PyTorch's memory efficient attention kernel provides. Allowing fallback to PyTorch ATen op for attention helps mitigate memory spike issues for models leveraging memory efficient attention. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Memory issues arose when integrating ONNX Runtime Training with AML Stable Diffusion. --------- Co-authored-by: root <prathikrao@microsoft.com>	2024-07-22 16:37:04 -07:00
mindest	5b9369e93c	Fix typos according to reviewdog report. (#21335 ) ### Description Fix typos based on reviewdog report but with some exceptions/corrections.	2024-07-22 13:37:32 -07:00
Tianlei Wu	7d9b12a2e3	[CPU] SparseAttention op (#21110 ) Add SparseAttention cpu implementation. - [x] Refactoring GQAAttentionBase - [x] Add SparseAttention implementation - [x] Add test cases This is unfused version. Flash attention version will be added later.	2024-07-03 21:51:57 -07:00
Xavier Dupré	c501c6ffaf	Rename a mispelled filename in the documentation (#21066 ) ### Description Rename a file in the documentation	2024-06-17 18:18:41 +02:00
Frank Dong	8aa2667ae6	add bf16 for Tile CUDA executor (#20854 ) ### Description add bf16 for Tile CUDA executor ### Motivation and Context required change to support phimm model for ORT training	2024-06-17 05:52:13 -07:00
zkep	7313accd44	Update Dockerfile.cuda (#21042 )	2024-06-13 23:50:03 -07:00
wejoncy	bd61ae530b	relax seq len checking in rotary_emb (#20778 ) ### Description Length checking is even more strict for packed batching input. There are two cases for a batch of input_ids. - padded seq with equal length of inputs. ``` \|----******\| \|------------\| \|--------\| \|-*********\| ``` - packed seqs with different length of input_ids `\|----\|---------\|----\|-\|` The max_seq_length is either from graph_inputs or the position_ids. While in most of cases, we will cache the max_seq_length of rotary_cache in the model ans shared among all layers. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: kailums <kalu@microsoft.com>	2024-06-08 18:39:06 +08:00
Scott McKay	3ecf48e3b5	Add support for Trilu<bool>. (#20917 ) ### Description <!-- Describe your changes. --> Trilu<bool> is used by phi-3 when exported with torch.onnx.export. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-06-06 15:21:34 +10:00

1 2 3 4 5 ...

746 commits