### Description
<!-- Describe your changes. -->
Currently figuring out if the protobuf dependency is building protoc it
is a little obtuse and inconsistent
* in some places we directly set protobuf_BUILD_PROTOC_BINARIES to OFF
to indicate the protobuf dependency is not building protoc
* e.g. macOS/iOS/visionOS builds
* for a user provided protoc path we don't set
protobuf_BUILD_PROTOC_BINARIES, and inside protobuf_function.cmake that
determines if `protobuf::protoc` is added as a dependency or not
*
0dda8b0c44/cmake/external/protobuf_function.cmake (L40-L45)
To be more consistent/explicit, set protobuf_BUILD_PROTOC_BINARIES to
OFF when ONNX_CUSTOM_PROTOC_EXECUTABLE set and valid.
Remove outdated script that built and external protoc binary which was
used in later builds. The build setup will fetch a pre-built protoc so
there's no need for this additional build.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Make it easier to figure out if protoc is coming from the protobuf
dependency.
### Description
There was a bug with gqa on cpu where on token case, with batch_size >
1, and with past_present_share_buffer off, the output would occasionally
contain nans. this pr fixes that. it also updates documentation and
fixes posid gen for rotary in cuda in prompt case.
### Motivation and Context
this pr solves the GQA CPU bug as well as updates the documentation and
makes seqlens_k irrelevant for prompt case, which is useful to prevent
user error.
Made some changes to the arm64x.cmake script to:
- handle edge case
- Enable Projects that include onnxruntime as submodule and build it, to
be able to build as x without causing onnxruntime build_as_x to fail.
### Description
<!-- Describe your changes. -->
Operator should not modify input tensors because they are managed by
framework and may be reused by other nodes.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
This PR supports profiling and tuning MoE Gemm kernels in the very first
run and store the best configuration to reuse in the following runs. The
Gemm id (the key to the config map, int64_t) is determined by num_rows,
gemm_n and gemm_k for each type.
First 32 bits are total_rows, next 16 bits are gemm_n, next 16 bits are
gemm_k
int64_t key = total_rows;
key = key << 16 | gemm_n;
key = key << 16 | gemm_k;
Mixtral-fp16 on 2 A100 with tp=2. batch size = 1, seq_len = 1k
| | Prompt | Token |
| :--- | :---: | ---: |
| before | 138ms | 16.4ms |
| after | 100ms | 13.9ms |
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Fix missing node during mem efficient topo sort
Some nodes are not cusumed by the backward path, they are also not
generating graph outputs. We missed those nodes, so this PR fix that and
add related tests.
A side note: we should remove those nodes that are not used for
computing any graph outputs in a graph transformer. (TODO)
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
I misunderstood how UpdateCUDAProviderOptions and
UpdateTensorRTProviderOptions work in the C API, I had assumed that they
updated the options struct, however they re-initialize the struct to the
defaults then only apply the values in the update. I've rewritten the
Java bindings for those classes so that they aggregate all the updates
and apply them in one go. I also updated the C API documentation to note
that these classes have this behaviour. I've not checked if any of the
other providers with an options struct have this behaviour, we only
expose CUDA and TensorRT's options in Java.
There's a small unrelated update to add a private constructor to the
Fp16Conversions classes to remove a documentation warning (they
shouldn't be instantiated anyway as they are utility classes containing
static methods).
### Motivation and Context
Fixes#20544.
### Description
Follow up of https://github.com/microsoft/onnxruntime/pull/20216 to add
sparse attention kernel compiled by Triton for H100 (sm90).
- [x] Refine sparse attention v1 kernel compilation (remove some
combinations)
- [x] compile kernels for v1 kernels
- [x] compile kernels for H100
- [x] run performance tests
### Performane
Test setting `batch_size=4, num_heads=32, max_seq_len=8192,
head_size=128, sparse_block_size=64, local_blocks=16, vert_stride=8,
num_layout=8`
We compare sparse attention to corresponding GQA with local attention
windows size 1024, or GQA with dense causal. Note that ORT-GQA-Dense has
more computation than ORT-SparseAtt, while ORT-GQA-Local has less
computation (no vertial strides) than ORT-SparseAtt. They are added for
reference. It is not fair comparison, but could show the benefit of
sparsity vs dense.
Example results in Azure Standard_ND96isr_H100_v5 VM with NVIDIA
H100-80GB-HBM3 GPU (sm=90):
```
prompt-sm90-batch4-head32-d128-local16-vert8-torch.float16:
sequence_length TORCH-GQA ORT-GQA-Dense ORT-GQA-Local ORT-SparseAtt
0 16.0 0.079877 0.006362 0.006403 0.042758
1 32.0 0.086920 0.016404 0.016686 0.044183
2 64.0 0.090727 0.020429 0.020409 0.045343
3 128.0 0.128148 0.032009 0.031984 0.051516
4 256.0 0.323933 0.074110 0.073920 0.068308
5 512.0 1.021856 0.162167 0.161951 0.109226
6 1024.0 3.596002 0.452629 0.452780 0.231653
7 2048.0 13.865088 1.499534 1.195749 0.515488
8 4096.0 0.000000 5.454785 2.669682 1.163233
9 8192.0 0.000000 22.068159 6.018604 2.772873
token-sm90-batch4-head32-d128-local16-vert8-torch.float16:
past_sequence_length TORCH-GQA ORT-GQA-Dense ORT-GQA-Local ORT-SparseAtt
0 16.0 0.104460 0.012652 0.012661 0.069549
1 32.0 0.113866 0.012776 0.012765 0.069024
2 64.0 0.124600 0.016791 0.012672 0.069397
3 128.0 0.108658 0.017900 0.018294 0.074844
4 256.0 0.115463 0.029409 0.029608 0.078911
5 512.0 0.149824 0.033968 0.033701 0.092998
6 1024.0 0.234050 0.042930 0.042951 0.116920
7 2048.0 0.390695 0.061462 0.043008 0.121555
8 4096.0 0.000000 0.097505 0.042948 0.134757
9 8191.0 0.000000 0.165861 0.043542 0.158796
```
The following might be able to help performance on short sequence
length. Need update operator spec:
Fall back to flash attention when total_sequence length < local_blocks * block_size
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
As a follow-up of #20506
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
with past/present shared same buffer, the present seq length is
different with total sequence length. The size of cos/sin cache should
be checked with sequence length.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
- Adds support for float32/float16 HardSigmoid on HTP backend.
Decomposes `HardSigmoid(X)` into `max(0, min(1, alpha * X + beta))`.
- Fuses the sequence `X * HardSigmoid<alpha=1/6, beta=0.5>(X)` into a
single `HardSwish(x)`. Only applies to non-quantized HardSigmoid/Mul.
### Motivation and Context
QNN does not natively support HardSigmoid. These changes expand model
support on QNN EP.
### Description
<!-- Describe your changes. -->
This PR registers DFT-20 to the DML EP.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
- Updates QNN pipelines to use QNN SDK 2.21
- Downloads QNN SDK from Azure storage to avoid having to rebuild images
when a new version is released.
### Motivation and Context
Test with the latest QNN SDK.
### Description
Follow up of #20216 to add kernel for sm=75 (GPU like T4, Geforce RTX
2080, GeForce GTX 1650 Ti, NVIDIA TITAN RTX, RTX 4000 etc)
- [x] Add kernel for sm=75
- [x] Update dispatch code to use sm to call different kernel.
- [x] Update compile script to use num_stages=2 instead of 3 for sm=75
- [x] Refactor test script and add tests for bfloat16.
- [x] Fix performance test of token generation (previously we did not
concatenate past_key)
- [x] Fix debug build
- [x] Run performance test and update numbers.
For sm=70, the v1 kernel can be compiled but there is error in compiling
v2 kernel. So it is skipped in this pull request.
Performance Test on T4 GPU (using Standard_NC4as_T4_v3 Azure VM) with
`batch_size=4, num_heads=32, max_seq_len=8192, head_size=128,
sparse_block_size=64, local_blocks=16, vert_stride=8, num_layout=8`
We compare sparse attention to corresponding GQA with dense causal. Note
that GQA with dense need more computation since no sparsity is used. The
TORCH-GQA use naive implementation (using cuSPARSE Block-SpMM could be
faster).
```
prompt-sm75-batch4-head32-d128-local16-vert8-torch.float16:
sequence_length TORCH-GQA ORT-GQA-Dense ORT-SparseAtt
1 32.0 0.184173 2.994347 0.089064
2 64.0 0.303300 3.023986 0.107418
3 128.0 0.887795 3.073728 0.174213
4 256.0 2.797654 3.246899 0.357869
5 512.0 10.055048 3.814039 0.893903
6 1024.0 37.849937 5.818439 2.658720
7 2048.0 148.641785 13.638480 7.202690
8 4096.0 OOM 43.556847 17.680954
9 8192.0 OOM 161.628540 44.336670
token-sm75-batch4-head32-d128-local16-vert8-torch.float16:
past_sequence_length TORCH-GQA ORT-GQA-Dense ORT-SparseAtt
1 32.0 0.110353 2.996305 0.137509
2 64.0 0.145088 3.006860 0.165424
3 128.0 0.219500 3.036448 0.192001
4 256.0 0.347496 3.071341 0.249125
5 512.0 0.595842 3.135225 0.398726
6 1024.0 1.081216 3.261110 0.612744
7 2048.0 2.060307 3.515578 0.685670
8 4096.0 OOM 4.022986 0.819707
9 8191.0 OOM 5.024528 1.072912
```
### Motivation and Context
To inference Phi-3-small in T4 GPU
### Description
<!-- Describe your changes. -->
This branch is based on rel-1.18.0 and supports TensorRT 10-GA.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
removing excess trailing semicolon from specific macro
### Motivation and Context
I am preparing automatic generation of onnxruntime bindings for perl,
and the parser (ucpp) has broken due to the "double semicolon" error in
the subsequent lines where the macro is applied.
Only define CPUIDInfo::pytorch_cpuinfo_init_ data member when CPUINFO_SUPPORTED is defined. It can cause unused variable warnings in some compilations.
Fix
onnxruntime/include/onnxruntime/core/session/onnxruntime_c_api.h:4637:
error: argument 'session' of command @param is not found in the argument
list of
```
OrtApi::AddExternalInitializersFromFilesInMemory(
OrtSessionOptions *options,
const char *const *external_initializer_file_names,
char *const *external_initializer_file_buffer_array,
const size_t *external_initializer_file_lengths,
size_t num_external_initializer_files)
```
### Description
Add CUDA implementation for block sparse attention for Phi-3-small.
Block sparse attention was proposed in [Sparse
Transformers](https://arxiv.org/pdf/1904.10509) by OpenAI, and also
adopted in [BigBird](https://arxiv.org/pdf/2007.14062) with different
sparse layout.
In Phi-3-small, the sparse layout is static, and works with
unidirectional (causal) attention.
Compared to dense attention, the benefit of block sparse is to speed up
both training and inference. It could save memory thus support longer
context length.
- [x] Add operator spec and shape inference
- [x] Symbolic shape inference
- [x] Refactor GroupQueryAttention to expose common kernels for kv cache
concatenation, q/k/v transpose etc.
- [x] Add cuda kernel to convert block mask to CSR format
- [x] Add cuda kernel to generate position ids
- [x] Add compile script and template files to convert triton kernel to
cubin and dispatcher.
- [x] Add triton kernel v1 for prompt
- [x] Add triton kernel v2 for token generation and support padding
- [x] Update IO Binding Helper to allow buffer sharing.
- [x] Test relevance
- [x] Test performance
### Performance
Test in A100-SXM4-80GB with `batch_size=4, num_heads=32,
max_seq_len=8192, head_size=128, sparse_block_size=64, local_blocks=16,
vert_stride=8, num_layout=8`
We compare sparse attention to corresponding GQA with local attention
windows size 1024, or GQA with dense causal.
Average latency in milliseconds (for fused attention kernel used in
prompt prefilling):
seq_len | GQA-Dense | GQA-Local | SparseAttention
-- | -- | -- | --
64 | 0.0465 | 0.0722 | 0.0641
128 | 0.0618 | 0.0787 | 0.0672
256 | 0.1086 | 0.1076 | 0.0943
512 | 0.2535 | 0.2487 | 0.1676
1024 | 0.7042 | 0.7050 | 0.3800
2048 | 2.4125 | 1.9316 | 0.8966
4096 | 8.9346 | 4.5699 | 2.1129
8192 | 40.5401 | 10.3508 | 5.1748
Average latency in milliseconds (for fused attention kernel used in
token generation:
past_seq_len | GQA-Dense | GQA-Local | SparseAttention
-- | -- | -- | --
64 | 0.0186 | 0.0186 | 0.0870
128 | 0.0408 | 0.0466 | 0.1165
256 | 0.0530 | 0.0592 | 0.0988
512 | 0.0445| 0.0447 | 0.1150
1024 | 0.0634 | 0.0640 | 0.1454
2048 | 0.1027 | 0.0637 | 0.1589
4096 | 0.1789 | 0.0631 | 0.1806
8192 | 0.3288 | 0.0655 | 0.2146
We can see that the kernel for token generation still have room to
improve.
#### Limitations
Only support right-side padding and unidirectional attention.
The following are not supported in the first version:
(1) Packed mode like PackedMultiHeadAttention where input has been
removed padding.
(2) paged attention.
(3) bidirectional attention.
(4) GPU compute capacity that is not 8.0, 8.6 and 8.9.
(5) Left side padding.
Some of these limitations will be removed in the future (may be in a new
operator).
Bump up version in main from 1.18.0 to 1.19.0 since the release branch
has been cut.
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
<!-- Describe your changes. -->
https://github.com/microsoft/onnxruntime/pull/20418
Add back Catalyst changes only for now.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Distribute writing-to-output work over all threads in MatMulNBits.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Originally, Prelu in QNN will fail when the input is fp16 and alpha is fp32.
QNN requires alpha is fp16 when input is fp16.
This can be resolved by casting alpha to fp16 and pass it to QNN.
### Motivation and Context
Makes QNN Prelu support fp16 case.
---------
Co-authored-by: Hector Li <hecli@microsoft.com>
In CMakeLists.txt:set_msvc_c_cpp_compiler_warning_level(), the regex should match the value that gets added by the function. The latter got updated, so this change updates the former to match.
### Description
<!-- Describe your changes. -->
Update order of steps
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix CI
### Description
<!-- Describe your changes. -->
[VitisAI] Solve the problem that gsl cannot be found when compiling
under linux
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: Zhenze Wang <zhenzew@xilinx.com>
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
```
tvm_execution_provider.cc
denormal.cc
D:\a\onnxruntime\onnxruntime\onnxruntime\core\providers\tvm\tvm_execution_provider.cc(122,5): error C2660: 'onnxruntime::GraphViewerToProto': function does not take 4 arguments [D:\a\onnxruntime\onnxruntime\build\Release\onnxruntime_providers_tvm.vcxproj]
D:\a\onnxruntime\onnxruntime\onnxruntime\core\graph\graph_proto_serializer.h(10,6):
see declaration of 'onnxruntime::GraphViewerToProto'
D:\a\onnxruntime\onnxruntime\onnxruntime\core\providers\tvm\tvm_execution_provider.cc(122,5):
while trying to match the argument list '(const onnxruntime::GraphViewer, onnx::GraphProto, bool, bool)'
cpuid_uarch.cc
get_execution_providers.cc
abi_session_options.cc
bias_dropout_fusion.cc
if.cc
```
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Perform computation in fp32 and convert finally to fp16.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Error:
**Artifact name input: e2e_test_logs_1364625_$(Date:yyyyMMddHHmmss)
##[error]Artifact name is not valid:
e2e_test_logs_1364625_$(Date:yyyyMMddHHmmss). It cannot contain '\', /',
"', ':', '<', '>', '|', '*', and '?'**
Date not correctly showing up in the artifact name. Use predefined
pipeline variable BuildNumber instead which also serves similarly as a
timestamp.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
RN CI failure
---------
Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
### Description
<!-- Describe your changes. -->
flatbuffers::String::c_str returns a pointer that may not be null
terminated.
This causes a warning when building on an A100 with gcc 11. Not clear
why other builds with gcc 11 (e.g. Ubuntu 22.04 WSL) don't generate a
warning. Either way it's safer to use str() as that constructs a
std::string with data() and size().
Unclear if this is an issue in reality as it's reading from the
flatbuffer and most likely didn't write out an empty string in order to
save space. There's no perf need to use c_str instead of str, and in
LOAD_STR_FROM_ORT_FORMAT we need to convert the return value to a
std::string anyway.
```c++
struct String : public Vector<char> {
const char *c_str() const { return reinterpret_cast<const char *>(Data()); }
std::string str() const { return std::string(c_str(), size()); }
```
```
inlined from ‘onnxruntime::common::Status onnxruntime::fbs::utils::LoadAttributeOrtFormat(const onnxruntime::fbs::Attribute&, onnx::AttributeProto&, std::unique_ptr<onnxruntime::Graph>&, onnxruntime::Graph&, onnxruntime::Node&, const onnxruntime::OrtFormatLoadOptions&, const onnxruntime::logging::Logger&)’ at /frdong_data/onnxruntime/onnxruntime/core/graph/graph_flatbuffers_utils.cc:385:3:
/usr/include/c++/11/bits/char_traits.h:399:32: error: ‘long unsigned int __builtin_strlen(const char*)’ reading 1 or more bytes from a region of size 0 [-Werror=stringop-overread]
```
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix build error on A100
The order of defines for these test have to be in the same order. If we
check for TRT -> CUDA ->DML wen cannot reverse that order in later
defines as we might want to build for multiple EPs.
+@PatriceVignola
### Description
mlas matmul nbits implementation requires packed b. have a condition for
this.
need to update this logic if it changes.
### Motivation and Context
---------
Signed-off-by: Liqun Fu <liqfu@microsoft.com>
### Description
Following the issue #19223, introduce `per_channel` attribute in
`MinMaxCalibrater` to develop per-channel calibration.
If required, this new functionality should be implemented in the other
_Calibraters_ (`HistogramCalibrater`, `EntropyCalibrater`, ...).
### Motivation and Context
- This is the first part to solve #19223's proposal.
- If per channel calibration was allowed, the quantization algorithm
could be updated to improve quantization performance, i.e. weights
quantization per channel and not per tensor. That is why it would be
interesting to have a 'per_channel' option in any 'Calibrater' class to
produce a set of calibration vectors instead of a single scalar.