Commit graph

746 commits

Author SHA1 Message Date
Ashrit Shetty
4b5b5f7101
Update win-ort-main to tip main 250123 (#23473)
### Description
This PR is to update the win-ort-main branch to the tip main branch as
of 2025-01-23.

### PR List
ddf0d377a7 [QNN EP] Add LoggingManager::HasDefaultLogger() to provider
bridge API (#23467)
05fbbdf91f [QNN EP] Make QNN EP a shared library (#23120)
1336566d7f Add custom vcpkg ports (#23456)
2e1173c411 Update the compile flags for vcpkg packages (#23455)
1f628a9858 [Mobile] Add BrowserStack Android MAUI Test (#23383)
009cae0ec8 [js/webgpu] Optimize ConvTranspose (Continue) (#23429)
04a4a694cb Use onnx_protobuf.h to suppress some GCC warnings (#23453)
2e3b62b4b0 Suppress some strict-aliasing related warnings in WebGPU EP
(#23454)
b708f9b1dc Bump ruff from 0.9.1 to 0.9.2 (#23427)
c0afc66b2a [WebNN] Remove workarounds for TFLite backend (#23406)
8a821ff7f9 Bump vite from 6.0.7 to 6.0.11 in
/js/web/test/e2e/exports/testcases/vite-default (#23446)
220c1a203e Make ORT and Dawn use the same protobuf/abseil source code
(#23447)
b7b5792147 Change MacOS-13 to ubuntu on for
android-java-api-aar-test.yml. (#23444)
19d0d2a30f WIP: Dp4MatMulNBits accuracy level 4 matmul for WebGPU EP
(#23365)
95b8effbc4 [QNN EP]: Clean up QNN logging resources if an error occurs
during initialization (#23435)
626134c5b5 Bump clang-format from 19.1.6 to 19.1.7 (#23428)
0cf975301f Fix eigen external deps (#23439)
f9440aedce Moving RN_CI Android Testing to Linux (#23422)
1aa5902ff4 [QNN EP] workaround for QNN validation bug for Tanh with
uint16 quantized output (#23432)
7f5582a0e2 Seperate RN andriod and IOS into 2 separated Stages. (#23400)
73deac2e7f Implement some missing element wise Add/Sub/Mul/Div/Neg
operations for CPU and CUDA EPs (#23090)
949fe42af4 Upgrade Java version from react-native/android to Java 17
(#23066)
0892c23463 Update Qnn SDK default version to 2.30 (#23411)
94c099bcec Fix type cast build error (#23423)
d633e571d1 [WebNN EP] Fix AddInitializersToSkip issues (#23354)
e988ef00e2 [QNN EP] Fix regression for MatMul with two quantized/dynamic
uint16 inputs (#23419)
7538795f6b Update onnxruntime binary size checks ci pipeline's docker
image (#23405)
6c5ea41cad Revert "[QNN EP] Clean up correctly from a partial setup
(#23320)" (#23420)
e866804bbe Enable comprehension simplification in ruff rules (#23414)
0a5f1f392c bugfix: string_view of invalid memory (#23417)
4cc38e0277 fix crash when first input of BatchNormalization is 1-D
(#23387)
033441487f Target py310 and modernize codebase with ruff (#23401)
87341ac010 [QNN EP] Fix segfault when unregistering HTP shared memory
handles (#23402)

### Motivation and Context
This update includes the change to make QNN-EP a shared library.

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Changming Sun <chasun@microsoft.com>
Co-authored-by: Peishen Yan <peishen.yan@intel.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: Hector Li <hecli@microsoft.com>
Co-authored-by: Jian Chen <cjian@microsoft.com>
Co-authored-by: Alexis Tsogias <1114095+Zyrin@users.noreply.github.com>
Co-authored-by: junchao-zhao <68935141+junchao-loongson@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: sushraja-msft <44513542+sushraja-msft@users.noreply.github.com>
Co-authored-by: Wanming Lin <wanming.lin@intel.com>
Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com>
Co-authored-by: Caroline Zhu <wolfivyaura@gmail.com>
2025-01-23 09:12:03 -08:00
Ashrit Shetty
df873177eb
Update win-ort-main to tip main 250116 (#23398)
### Description
This PR is to update the win-ort-main branch to the tip main
branch as of 2025-01-16.

### Motivation and Context
This update includes the OpenVino fix for debug builds.

---------

Signed-off-by: Liqun Fu <liqfu@microsoft.com>
Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>
Signed-off-by: Junze Wu <junze.wu@intel.com>
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Jianhui Dai <jianhui.j.dai@intel.com>
Co-authored-by: Yueqing Zhang <yuz75@Pitt.edu>
Co-authored-by: amancini-N <63410090+amancini-N@users.noreply.github.com>
Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
Co-authored-by: liqun Fu <liqfu@microsoft.com>
Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>
Co-authored-by: Yifan Li <109183385+yf711@users.noreply.github.com>
Co-authored-by: yf711 <yifanl@microsoft.com>
Co-authored-by: Wanming Lin <wanming.lin@intel.com>
Co-authored-by: wejoncy <wejoncy@163.com>
Co-authored-by: wejoncy <wejoncy@.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
Co-authored-by: Changming Sun <chasun@microsoft.com>
Co-authored-by: Jean-Michaël Celerier <jeanmichael.celerier+github@gmail.com>
Co-authored-by: Dmitry Deshevoy <mityada@gmail.com>
Co-authored-by: xhcao <xinghua.cao@intel.com>
Co-authored-by: Yueqing Zhang <yueqingz@amd.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com>
Co-authored-by: Wu, Junze <junze.wu@intel.com>
Co-authored-by: Jian Chen <cjian@microsoft.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Matthieu Darbois <mayeut@users.noreply.github.com>
Co-authored-by: Prathik Rao <prathik.rao@gmail.com>
Co-authored-by: wonchung-microsoft <wonchung@microsoft.com>
Co-authored-by: Vincent Wang <wangwchpku@outlook.com>
Co-authored-by: PARK DongHa <luncliff@gmail.com>
Co-authored-by: Hector Li <hecli@microsoft.com>
Co-authored-by: Sam Webster <13457618+samwebster@users.noreply.github.com>
Co-authored-by: Adrian Lizarraga <adrianlm2@gmail.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com>
Co-authored-by: Satya Kumar Jandhyala <satya.k.jandhyala@gmail.com>
Co-authored-by: Corentin Maravat <101636442+cocotdf@users.noreply.github.com>
Co-authored-by: Xiaoyu <85524621+xiaoyu-work@users.noreply.github.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Jie Chen <jie.a.chen@intel.com>
Co-authored-by: Jianhui Dai <jianhui.j.dai@intel.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Co-authored-by: Ted Themistokleous <107195283+TedThemistokleous@users.noreply.github.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Co-authored-by: Artur Wojcik <artur.wojcik@outlook.com>
Co-authored-by: Ted Themistokleous <tedthemistokleous@amd.com>
Co-authored-by: Xinya Zhang <Xinya.Zhang@amd.com>
Co-authored-by: ikalinic <ilija.kalinic@amd.com>
Co-authored-by: sstamenk <sstamenk@amd.com>
Co-authored-by: Yi-Hong Lyu <yilyu@microsoft.com>
Co-authored-by: Ti-Tai Wang <titaiwang@microsoft.com>
2025-01-16 15:20:25 -08:00
mingyue
4aca8f33df
[Bug Fix] Missing CustomOp SchemaRegister when generator EPContext ONNX model (#23091)
### Description
Enhancements to EPContext Operations:
1. Introduced support for the bfloat16 data type in EPContext operations.
2. Bug Fix: Missing Custom OP Schema Registration when generator EPContext ONNX model

---------

Co-authored-by: mingyue <mingyue@xilinx.com>
Co-authored-by: Hector Li <hecli@microsoft.com>
2024-12-19 16:47:13 -08:00
Tianlei Wu
5afab787db
Update python version metadata (remove 3.7, 3.8, 3.9; add 3.13). (#23067)
### Description

* Update python version metadata to be in sync with latest python
packages (onnxruntime, onnxruntime-gpu and onnxruntime-qnn).
* Update black format target-version to 3.10, and use lintrunner to
format all files.
* Update the lintrunner installation command line to be consistent.
* Include `requirements-lintrunner.txt` in `requirements-dev.txt` to
avoid duplicated settings.

### Motivation and Context

https://github.com/microsoft/onnxruntime/issues/22993

Python support by numpy:
https://numpy.org/neps/nep-0029-deprecation_policy.html#drop-schedule
```
On Apr 05, 2024 drop support for Python 3.9
On Apr 04, 2025 drop support for Python 3.10
```
2024-12-17 10:59:20 -08:00
Hector Li
401d16c671
Enable QNN HTP spill fill buffer setting to save RAM usage. (#22853)
### Description
Enable QNN HTP spill fill buffer setting to save RAM usage.
This feature is available after QNN 2.28. Need to re-generate QNN
context binary.

https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/htp_backend.html#qnn-htp-backend-api

Requirements:
1. Need to re-generate the Onnx model with QNN context binary by set the
EP option enable_htp_spill_fill_buffer = 1.
2. Works for a model with multiple Context binaries. Need manually merge
2 Onnx model with context binary into 1 Onnx model.
3. Requires Linux platform if generate the context binary offline since
QnnSystem lib is not available for Windows x86_64 platform.
No need to do extra thing while running the model inference.

The generated EPContext node will have a max_size attribute with the
maximum spill fill buffer size for the context binary
<img width="353" alt="image"
src="https://github.com/user-attachments/assets/a3bf48be-a8da-4381-8a1d-3f2558eea37d">

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2024-12-06 11:36:52 -08:00
Yulong Wang
7b0fa407eb
fix requirements.txt path (#22946)
### Description

#22380 removes the file
`tools/ci_build/github/linux/docker/inference/x86_64/python/cpu/scripts/requirements.txt`
but it is still used in `dockerfiles/Dockerfile.cuda`.

This change updates the file path of the requirements.txt

fixes #22945.
2024-12-04 13:08:29 -08:00
Xavier Dupré
a2ba3cb547
Implementation of TreeEnsemble ai.onnx.ml==5 (#22333)
### Description
Merges PR #21851, #21222.

Implements TreeEnsemble from ai.onnx.ml==5 (CPU).

---------

Co-authored-by: Bilyana Indzheva <bilyana2002@gmail.com>
Co-authored-by: Bilyana Indzheva <36890669+bili2002@users.noreply.github.com>
Co-authored-by: Christian Bourjau <cbourjau@users.noreply.github.com>
2024-11-22 19:48:23 +01:00
Changming Sun
13346fdf18
Cleanup code (#22827)
### Description
1.  Delete TVM EP because it is out of maintain 
2.  Delete ortmodule related docker files and scripts.
2024-11-19 14:13:33 -08:00
dtang317
12dfe2859c
Register groupnorm for opset 21 (#22830)
### Description
This PR registers GroupNormalization for opset 21



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-11-14 10:06:30 -08:00
dtang317
9836ef1c89
register Identity and QLinearMatmul for opset21 (#22804)
### Description
This PR registers the following opset 21 operators:

Idenity-21
OlieanrMatmul-21



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-11-12 09:36:19 -08:00
Tianlei Wu
72186bbb71
[CUDA] Build nhwc ops by default (#22648)
### Description

* Build cuda nhwc ops by default.
* Deprecate `--enable_cuda_nhwc_ops` in build.py and add
`--disable_cuda_nhwc_ops` option

Note that it requires cuDNN 9.x. If you build with cuDNN 8, NHWC ops
will be disabled automatically.

### Motivation and Context

In general, NHWC is faster than NCHW for convolution in Nvidia GPUs with
Tensor Cores, and this could improve performance for vision models.

This is the first step to prefer NHWC for CUDA in 1.21 release. Next
step is to do some tests on popular vision models. If it help in most
models and devices, set `prefer_nhwc=1` as default cuda provider option.
2024-11-06 09:54:55 -08:00
Tianlei Wu
ba22d7879a
[CUDA/ROCm] Conditionally support ArgMax and ArgMin for opset 12 and above (#22713)
### Description
Based on https://github.com/microsoft/onnxruntime/pull/9700, and extend
it to ArgMin as well.

This pull request introduces several enhancements and fixes related to
the `ArgMax` and `ArgMin` operators in the CUDA execution provider. The
changes ensure proper handling of these operators across different
versions and improve kernel registration and fallback mechanisms.

Key changes include:

#### Enhancements to `ArgMax` and `ArgMin` Operators:

* Added new kernel class registrations for `ArgMax` and `ArgMin` for
different data types and versions in
`onnxruntime/core/providers/cuda/cuda_execution_provider.cc`.
[[1]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R966-R972)
[[2]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1209-R1215)
[[3]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1657-R1659)
[[4]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285L1825-L1827)
[[5]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1933-R1939)
[[6]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2174-R2180)

* Introduced `ArgMaxOrArgMinNeedFallbackToCPU` function to handle
fallback to CPU when the `select_last_index` attribute is set to 1, as
CUDA does not support this attribute.
[[1]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2597-R2622)
[[2]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2672-R2674)

#### Macro and Kernel Registration Improvements:

* Replaced `REGISTER_KERNEL_UNTIL_VERSIONED_TYPED` with
`REGISTER_KERNEL_VERSIONED_RANGE_TYPED` and
`REGISTER_KERNEL_VERSIONED_SINCE_TYPED` macros for better version
handling.
[[1]](diffhunk://#diff-ee5316fc3898058f70e942d9a84de36be4c7da09f144633a2504236430d5d033L19-R29)
[[2]](diffhunk://#diff-ee5316fc3898058f70e942d9a84de36be4c7da09f144633a2504236430d5d033L40-R46)

* Updated kernel registration for `ArgMax` and `ArgMin` to use the new
macros, ensuring proper version handling and support for different data
types.

#### Safety Checks:

* Added safety checks in the `ArgMax` and `ArgMin` classes to ensure
`select_last_index` is not set to 1, as it is not supported on CUDA.
[[1]](diffhunk://#diff-8ab09fef1f4a12cbf3b3432e509f8f1ef561e83c72778a0e047780060aeef6efL91-R99)
[[2]](diffhunk://#diff-8ab09fef1f4a12cbf3b3432e509f8f1ef561e83c72778a0e047780060aeef6efL101-R117)

#### Testing Enhancements:

* Added new tests for `ArgMax` and `ArgMin` operators to verify behavior
when `select_last_index` is set to 0, ensuring compatibility with both
CPU and CUDA execution providers.
[[1]](diffhunk://#diff-77affe1b70d1a9d38c2485f7c6b16ef2b6b541ed94dd727bc9b286f068f1481aR3340-R3360)
[[2]](diffhunk://#diff-77affe1b70d1a9d38c2485f7c6b16ef2b6b541ed94dd727bc9b286f068f1481aR3679-R3699)

### Motivation and Context
Improve CUDA kernel coverage for stable diffusion model and hence
improve its performance on CUDA
2024-11-06 09:54:32 -08:00
Tianlei Wu
120cb5a804
[Doc] Add I/O binding example using onnx data type in python API summary (#22695)
### Description

Add I/O binding example using onnx data type in python API summary. The
API is available since 1.20 release.

### Motivation and Context

Follow up of https://github.com/microsoft/onnxruntime/pull/22306 to add
some documentation.
2024-11-02 12:51:37 -07:00
dtang317
5b4e2a636b
DML EP Register Opset 21 (#22547)
### Description
This PR registers the following opset 21 operators:
- Size-21
- CastLike-21
- ConstantOfShape-21
- Flatten-21
- Pad-21
- Transpose-21



### Motivation and Context
2024-10-25 09:21:19 -07:00
Hector Li
fc2be09386
Enable QLinearMatMul for opset21 (#22488)
### Description
Enable QLinearMatMul for opset21
2024-10-22 14:33:36 -07:00
Akshay Sonawane
e5c2e50849
bumps up version in main from 1.20 -> 1.21 (#22482)
Bump up version in main from 1.20.0 to 1.21.0 since the release branch
has been cut.
2024-10-17 12:32:35 -07:00
mindest
1fa219d7d5
DecoderMaskedMultiHeadAttention CPU kernel. (#22292)
### Description
DecoderMaskedMultiHeadAttention CPU kernel.
2024-10-12 13:43:00 -07:00
mindest
3c80aa9fee
Add CPU kernels for DynamicTimeWarping and UnfoldTensor. (#22033)
### Description
Add CPU kernels for DynamicTimeWarping and UnfoldTensor.
2024-10-11 09:44:18 -07:00
kunal-vaishnavi
50bda44a70
Fix equation in MatMulNBits op spec (#22253)
### Description
This PR fixes an equation in the MatMulNBits op spec. The old formula is
stated as

```
[CeilDiv((N * n_blocks_per_col + 1) * bits, 8)]
```

but it should be stated as

```
[N * CeilDiv(n_blocks_per_col * bits, 8)]
```

or as

```
[N * FloorDiv((n_blocks_per_col + 1) * bits, 8)]
```

### Motivation and Context
For models such as ChatGLM where the column size is odd, the division
math can be off. For example:


![image_360](https://github.com/user-attachments/assets/a5035bec-4dad-46af-9cb1-24a881eb70a0)

With the old equation, the projections are calculated as follows.

```
# Down projection
B = 4,096 x 107 x 64
zero_points = 221,184
N = 4,096
n_blocks_per_col = 107
 
4,096 * CeilDiv((107 + 1) * 4, 8) = 4,096 * CeilDiv(108 * 4, 8) = 4,096 * 54 = 221,184

# Up projection
B = 13,696 x 32 x 64
zero_points = 219,136
N = 13,696
n_blocks_per_col = 32
 
13,696 * CeilDiv((32 + 1) * 4, 8) = 13,696 * CeilDiv(33 * 4, 8) = 13,696 * 17 = 232,832
```

With the new equation, the projections are calculated as follows.

```
# Down projection
B = 4,096 x 107 x 64
zero_points = 221,184
N = 4,096
n_blocks_per_col = 107
 
4,096 * CeilDiv(107 * 4, 8) = 4,096 * 54 = 221,184

# Up projection
B = 13,696 x 32 x 64
zero_points= 219,136
N = 13,696
n_blocks_per_col = 32
 
13,696 * CeilDiv(32 * 4, 8) = 13,696 * 16 = 219,136
```
2024-10-01 09:31:56 -07:00
Patrice Vignola
20be51525b
Support if node with sequence outputs (#22234)
`If` nodes can have sequence outputs. Those nodes are mapped to the DML
EP to be able to keep the outputs on the GPU, but they actually execute
on the CPU by selecting either the `then` subgraph or the `else`
subgraph.
2024-09-27 12:40:01 -07:00
amarin16
eb2506d77a
Add MLFloat16 support for LayerNormalization, SkipLayerNormalization (#22063)
Add `MLFloat16` support for:
- `LayerNormalization`
- `SimplifiedLayerNormalization`
- `SkipLayerNormalization`
- `SkipSimplifiedLayerNormalization`

There are existing `LayerNormTest` unit tests that cover the `MLFloat16`
functionality for `LayerNormalization` once `MLFloat16` is registered
(for example
[`LayerNormTest.LayerNorm_Scale_Float16Input`](91c916f9c6/onnxruntime/test/contrib_ops/layer_norm_op_test.cc (L112))).

Similarly, there are unit tests such as
[`SkipLayerNormTest.SkipLayerNormBatch1_Float16`](91c916f9c6/onnxruntime/test/contrib_ops/skiplayernorm_op_test.cc (L255))
that cover MLFloat16 inputs for `SkipLayerNormalization`.
2024-09-24 15:06:27 -07:00
Ye Wang
6cc06ad069
GQA MLFloat16 cpu (#22102)
### Description
<!-- Describe your changes. -->


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Your Name <you@example.com>
2024-09-24 09:51:59 -07:00
Tianlei Wu
0806879ad4
Update lintrunner requirements (#22185)
### Description
* Add lintrunner to requirements-lintrunner.txt
* Lock lintrunner and lintrunner-adapter version
* Update documentation

### Motivation and Context
The document is not up to date.
2024-09-23 18:27:16 -07:00
Christian Bourjau
1a84f53c35
Make argmin/armax support identical data types and add int64 support (#21641) 2024-09-23 13:02:29 -07:00
liqun Fu
a89bddd5c2
Matmul_nbits kernel for mlas sqnbits to support Fp16 inputs (#21807) 2024-09-13 14:55:08 -07:00
aciddelgado
7e2c722459
Add Continuous Decoding support in GQA (#21523)
### Description
This PR will add support for Continuous Decoding for batch_size = 1
input. From now on, GQA can take arbitrary length input using seqlens_k
as total_sequence_length - 1 and the sequence length of qkv as
new_sequence_length.

**This change will not affect the default behavior of GQA**



### Motivation and Context
Prior to this change it was impossible to support sequence_length > 1
inputs when past context was given. This use case is essential to making
continuous decoding work, which is one of our current efforts in
ORT-GenAI.
2024-09-13 13:21:11 -07:00
aciddelgado
509cb54d6f
softcap gqa (#21683)
### Description
Implement softcap for gqa.

### Motivation and Context
Fixes certain models like Gemma-2 which need softcap to work so they
don't output nan's.
2024-08-30 19:11:04 -07:00
Jing Fang
5dee95fa10
[CUDA] Support CUDA EP blocked quantization in Q/DQ ops. (#21846)
### Description
1. Added CUDA EP support for blocked quantization in QuantizeLinear and
DequantizeLinear ops.
2. Currently CUDA EP blocked quantization only supports int4/uint4
quantized types and float32/float16 unquantized types.
3. Added CUDA EP support in QDQ selector/action transformer. CUDA EP is
only added to DQ + MatMul -> MatMulNBits rule. Other rules' EP support
are not changed.



### Motivation and Context
ONNX opset 21 introduced blocked quantization for Q/DQ opts. ORT
originally only supports CPU EP blocked quantization.
2024-08-30 18:28:00 -07:00
Ye Wang
1d059b8702
Phi3 MoE cuda kernel (#21819)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Your Name <you@example.com>
2024-08-27 09:21:30 -07:00
Tianlei Wu
6e57576988
Support Smooth Softmax in GroupQueryAttention (#21867)
### Description

Softmax (formula 1) is like the following:
```math
y_{i} = \frac{exp(x_{i})}{\sum_{i} exp(x_{i})}
```
After applying softmax, each element will be in the range of $(0, 1)$,
and the elements will add up to 1, so that they can be interpreted as
probabilities.

However, in language model, softmax has two issues:
* When all elements are -inf (for example, a whole row is masked when a
query token is padding), the result is not defined since exp(-inf)=0 and
divided-by-zero is encountered in the above formula.
* Why do we need normalize in a way that each query word are treated as
equal important (each row has sum equals to1)?

**Smooth Softmax** (formula 2) is a modified version that introduces a
smooth factor like the following:
```math
s_{i} = \frac{exp(x_{i})}{1+ \sum_{i} exp(x_{i})}
```

This formula could tackle the above two issues:
* It could handle the special case that all elements are -inf: the
result $s_{i}$ is 0 for every element in such case.
* Sum of all elements $\sum_{i}{s_{i}} = \frac{\sum_{i}{exp(x_{i})}}{1+
\sum_{i} exp(x_{i})}$ is in the range of (0, 1), so that we can train
the model to assign different importance to different query words.

Since exponential is prone to overflow or underflow, to get stable
result, formula 3 can be used:
```math
s_{i} = \frac{exp(x_{i} + c)}{exp(c)+ \sum_{i} exp(x_{i} +c)}
```
c can be any value in theory. In practical, choice of constant c shall
avoid $exp(c)$ and $exp(x_{i} +c)$ overflow (or underflow) at the same
time. A reasonable choice is like formula 4:
```math
c=-\max_{i} \{ x_i \}
```
or  apply a constraint that c <=0 like the following formula 5:

```math
c=-\max(0, \max_{i} \{ x_i \})
```
The latter one (formula 5) ensures that $s_{i}$ will fallback to formula
2 when all elements are negative.

For CPU provider, smooth softmax is implemented in MLAS. CPU
implementation uses formula 5.

@wangyems implemented the smooth softmax in flash attention for CUDA,
which requires Ampere or newer GPU. The implementation of smooth softmax
in flash attention uses formula 4.

---------

Co-authored-by: Ye Wang
2024-08-26 23:13:15 -07:00
Patrice Vignola
de6ebcbb54
[DML] Add int4 QDQ (#21592) 2024-08-20 23:44:58 -07:00
Yi Zhang
9f7e19cedd
[Fix] Make python API doc generation in Microsoft-hosted Agent (#21766)
### Description
<!-- Describe your changes. -->



### Motivation and Context
1. Python API doc needs to be merged from a fork, but 1ES self-hosted
pool is only for one github repo.
2. ubuntu-latest will be install numpy above 2.0 by default, and current
python API doc generation doesn't support it.
So I pin numpy < 2.0.0

---------
2024-08-20 23:32:38 +08:00
Tianlei Wu
d79e3c5791
Extend Attention Bias Broadcast Support (#21710)
### Description
Previously, MultiHeadAttention supports relative position bias of shape
[1, N, S, T] or [B, N, S, T], and DecoderMaskedMultiHeadAttention
supports [1, N, S, T]. This will extend the support to allow [1, N, S,
T], [B, N, S, T], [B, 1, S, T] and [1, 1, S, T] for CUDA and CPU EPs.

- [x] Rename the input of "relative position bias" to "attention bias"
because it can also be used for other types of bias, like ALiBi
(Attention with Linear Biases) or attention mask.
- [x] Update unfused kernel to support broadcasting 2nd dimension of
attention bias.
- [x] Update efficient attention to support broadcasting 2nd dimension
of attention bias.
- [x] Update operators (MultiHeadAttention,
DecoderMaskedMultiHeadAttention, Attention, PackedAttention,
PackedMultiHeadAttention) to support broadcast attention bias on CUDA
and CPU EPs.
- [x] Update ROCm, DML and WebGPU naming to be consistent. (Note that
those EPs do not support broadcasting attention_bias for now).
- [x] Add attention bias tests for MultiHeadAttention.
- [x] Update operator documents
- [x] Update benchmark script

Other changes:
* Fix some checks in multihead-attention.ts
* Add helper functions to dump tensors given dimensions.
2024-08-16 15:40:04 -07:00
Yi Zhang
b92908e197
[Fix] Python API doc generation (#21717)
### Description
<!-- Describe your changes. -->



### Motivation and Context
Make Python API doc generation workflow work.

### Verification Run
https://github.com/microsoft/onnxruntime/actions/runs/10364762858
2024-08-14 08:48:29 +08:00
Jing Fang
f30581ed2c
[CPU EP] Add block quantized Gather contrib op (#21630)
### Description
Add a gather that supports block-quantized input data.


### Motivation and Context
To support Web inference scenario with quantized vocabulary embeddings.
2024-08-09 12:15:11 -07:00
Edward Chen
a5ce65d87a
Clean up some mobile package related files and their usages. (#21606)
The mobile packages have been removed.
2024-08-05 16:38:20 -07:00
Prathik Rao
134f47743e
bumps up version in main from 1.19 -> 1.20 (#21588)
Bump up version in main from 1.19.0 to 1.20.0 since the release branch
has been cut.
2024-08-05 15:46:04 -07:00
Atanas Dimitrov
d0a6f57d74
Add reduce kernels for bigger types (#21490) 2024-08-01 12:21:16 -07:00
Yi-Hong Lyu
530a2d7b41
Enable FP16 Clip and Handle Bias in FP16 Depthwise Conv (#21493)
- Improved accuracy for face-detection, image-classification, and
object-detection in the GeekBench ML benchmark on ARM64.
- Fixed issue https://github.com/microsoft/onnxruntime/issues/18992
2024-07-30 03:49:14 -07:00
aamajumder
166809425e
[DML EP] Register ReduceMin-20 (#20477)
### Description
This PR registers the ReduceMin-20 operator to the DML EP.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-07-25 17:06:30 -07:00
Preetha Veeramalai
ca47f0fdd3
OVEP - PR 1.19 (#21443)
### Description
Add OVEP  features for 1.19 

The PR has,
- Added support for EpCtx with ORT Session options for optimized
performance.
- Added bug fixes
- Support for OV 2024.3

---------

Co-authored-by: ubuntu <ubuntu@ubuntu-mtlp-118727.iind.intel.com>
Co-authored-by: vthaniel <vishnudas.thaniel.s@intel.com>
Co-authored-by: sfatimar <sahar.fatima@intel.com>
Co-authored-by: saurabhkale17 <saurabh1.kale@intel.com>
Co-authored-by: Maheshkar <ankit.maheshkar@intel.com>
2024-07-24 23:45:31 -07:00
Sheil Kumar
dd010edb37
Update DirectML from 1.14.1 to 1.15.0 (#21323)
Update DirectML from 1.14.1 to 1.15.0

---------

Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
Co-authored-by: Dwayne Robinson <dwayner@microsoft.com>
2024-07-22 16:59:03 -07:00
Prathik Rao
11ad299451
Adds ATen fallback for scaled_dot_product_attention (#21107)
### Description
<!-- Describe your changes. -->

Introduces an ATen fallback for
`torch.nn.functional.scaled_dot_product_attention`. This operator was
introduced in torch 2.0 and, since then, has had many updates including
the implementation of memory efficient attention for V100 machines. The
current torchscript exporter exports a subgraph for attention which does
not provide the same memory savings that PyTorch's memory efficient
attention kernel provides. Allowing fallback to PyTorch ATen op for
attention helps mitigate memory spike issues for models leveraging
memory efficient attention.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Memory issues arose when integrating ONNX Runtime Training with AML
Stable Diffusion.

---------

Co-authored-by: root <prathikrao@microsoft.com>
2024-07-22 16:37:04 -07:00
mindest
5b9369e93c
Fix typos according to reviewdog report. (#21335)
### Description
Fix typos based on reviewdog report but with some
exceptions/corrections.
2024-07-22 13:37:32 -07:00
Tianlei Wu
7d9b12a2e3
[CPU] SparseAttention op (#21110)
Add SparseAttention cpu implementation.
- [x] Refactoring GQAAttentionBase
- [x] Add SparseAttention implementation
- [x] Add test cases

This is unfused version. Flash attention version will be added later.
2024-07-03 21:51:57 -07:00
Xavier Dupré
c501c6ffaf
Rename a mispelled filename in the documentation (#21066)
### Description
Rename a file in the documentation
2024-06-17 18:18:41 +02:00
Frank Dong
8aa2667ae6
add bf16 for Tile CUDA executor (#20854)
### Description
add bf16 for Tile CUDA executor



### Motivation and Context
required change to support phimm model for ORT training
2024-06-17 05:52:13 -07:00
zkep
7313accd44
Update Dockerfile.cuda (#21042) 2024-06-13 23:50:03 -07:00
wejoncy
bd61ae530b
relax seq len checking in rotary_emb (#20778)
### Description
Length checking is even more strict for packed batching input.
There are two cases for a batch of input_ids.
- padded seq with equal length of inputs. 
```
|----********|
|------------|
|--------****|
|-***********|
```
- packed seqs with different length of input_ids
`|----|---------|----|-|`

The max_seq_length is either from graph_inputs or the position_ids.
While in most of cases, we will cache the max_seq_length of rotary_cache
in the model ans shared among all layers.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: kailums <kalu@microsoft.com>
2024-06-08 18:39:06 +08:00
Scott McKay
3ecf48e3b5
Add support for Trilu<bool>. (#20917)
### Description
<!-- Describe your changes. -->
Trilu<bool> is used by phi-3 when exported with torch.onnx.export.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-06-06 15:21:34 +10:00