Commit graph

11997 commits

Author SHA1 Message Date
Tianlei Wu
7c3a25225f
[CUDA] update test_flash_attn_cuda.py for Windows (#21006)
Currently test_flash_attn_cuda.py can only run in Linux. It is because
it uses triton for rotary reference implementation, and triton python
package is not available in Windows.

This changes the script to allow the test run in Windows, so that we can
test memory efficient attention in Windows.

Due to limitation, rotary is excluded in testing on Windows.
2024-06-13 12:50:02 -07:00
Ye Wang
f35dd1407f
custom allreduce cuda kernel (#20703)
### Description
<!-- Describe your changes. -->

Conditionally route to custom AllReduce kernel when buffer size and gpu
numbers meet certain requirements. Otherwise, keep using NCCL's
AllReduce.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Ye Wang <wangye@microsoft.com@h100vm-ort.kxelwkzfzxguje5bxvwxxs135a.gvxx.internal.cloudapp.net>
Co-authored-by: Your Name <you@example.com>
2024-06-13 11:09:49 -07:00
Jian Chen
9daed5565a
Component Governance Fix round 6 (#21021)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-06-13 09:10:51 -07:00
Changming Sun
73271dd329
Move jobs in onnxruntime-Win2022-GPU-T4 machine pool to onnxruntime-Win2022-GPU-A10 (#21023)
### Description
Move jobs in onnxruntime-Win2022-GPU-T4 machine pool to
onnxruntime-Win2022-GPU-A10

### Motivation and Context
To reduce the variants of VM images we need to maintain. Now we have 3:
1. Windows 2022 CPU
2. Windows 2022 GPU A10
3. Windows 2022 GPU T4

This change allows us removing the last one.
2024-06-12 22:04:40 -07:00
Jian Chen
4e18b0b7ce
Upgrade braces from 3.0.2 to 3.0.3 to fix the vulnerability (#21022) 2024-06-12 18:02:52 -07:00
Chen Fu
6fb09055d4
Adding a sm80 q4 gemm kernel for small tiles (#20545)
### Description

Implementation of a q4 gemm cuda kernel for small tiles and small
sequence_len or batch_size (<=16)

### Performance Test Results

| Problem Shape |New Kernel | | | Current Kernel| |
| ------------------: | ----------- | ------- |--| ------------- |
------- |
| **(M x N x K)** | **Latency (ms)** | **GFLOPS** | | **Latency (ms)** |
**GFLOPS** |
| 1 x 3072 x 3072 | 0.008124 | 2310.93 | | 0.017231 | 1095.39 |
| 16 x 3072 x 3072 | 0.011263 | 26813.7 | | 0.017431 | 17325.4 |
| 32 x 3072 x 3072 | 0.018559 | 32544.3 | | 0.079493 | 7597.89 |
| 64 x 3072 x 3072 | 0.030364 | 39782.1 | | 0.079387 | 15216 |
| 1024 x 3072 x 3072 | 0.387194 | 49916.5 | | 0.080849 | 239054 |
| | | | | | |
| 1 x 3072 x 9216 | 0.015734 | 3598.77 | | 0.043404 | 1304.55 |
| 16 x 3072 x 9216 | 0.023611 | 38371.3 | | 0.043388 | 20859.1 |
| 32 x 3072 x 9216 | 0.038652 | 46878 | | 0.224353 | 8076.31 |
| 64 x 3072 x 9216 | 0.072334 | 50099.5 | | 0.224338 | 16153.6 |
| 1024 x 3072 x 9216 | 1.02872 | 56363.2 | | 0.231284 | 250696 |
| | | | | | |
| 1 x 8192 x 3072 | 0.015787 | 3188.18 | | 0.017714 | 2841.28 |
| 16 x 8192 x 3072 | 0.025933 | 31053.3 | | 0.017919 | 44942.2 |
| 32 x 8192 x 3072 | 0.042633 | 37778.9 | | 0.079407 | 20282.9 |
| 64 x 8192 x 3072 | 0.070061 | 45977.5 | | 0.079531 | 40502.8 |
| 1024 x 8192 x 3072 | 1.01264 | 50896.3 | | 0.237244 | 217243 |
| | | | | | |
| 1 x 3072 x 8192 | 0.014444 | 3484.56 | | 0.038961 | 1291.85 |
| 16 x 3072 x 8192 | 0.020433 | 39411.8 | | 0.039056 | |
| 32 x 3072 x 8192 | 0.03459 | 46563.5 | | 0.200189 | 8045.47 |
| 64 x 3072 x 8192 | 0.063319 | 50873.4 | | 0.20029 | 16082.8 |
| 1024 x 3072 x 8192 | 0.928282 | 55521.5 | | 0.205883 | 250334 |
| | | | | | |
| 1 x 5120 x 5120 | 0.014573 | 3597.79 | | 0.02604 | 2013.42 |
| 16 x 5120 x 5120 | 0.025638 | 32719.5 | | 0.026194 | 32024.4 |
| 32 x 5120 x 5120 | 0.037421 | 44834.2 | | 0.127676 | 13140.4 |
| 64 x 5120 x 5120 | 0.065593 | 51155.9 | | 0.127706 | 26274.8 |
| 1024 x 5120 x 5120 | 1.00217 | 53570.9 | | 0.256388 | 209398 |
| | | | | | |
| 1 x 17920 x 5120 | 0.053868 | 3406.49 | | 0.04715 | 3891.84 |
| 16 x 17920 x 5120 | 0.071952 | 40805.1 | | 0.049755 | 59009.3 |
| 32 x 17920 x 5120 | 0.123657 | 47486.3 | | 0.129812 | 45234.8 |
| 64 x 17920 x 5120 | 0.222113 | 52874.2 | | 0.129781 | 90491.6 |
| 1024 x 17920 x 5120 | 3.50124 | 53668.1 | | 0.770569 | 243852 |
| | | | | | |
| 1 x 1280 x 5120 | 0.007029 | 1864.66 | | 0.025954 | 505.027 |
| 16 x 1280 x 5120 | 0.008122 | 25821.6 | | 0.025953 | 8080.59 |
| 32 x 1280 x 5120 | 0.012498 | 33558.7 | | 0.127618 | 3286.62 |
| 64 x 1280 x 5120 | 0.022049 | 38044.6 | | 0.127762 | 6565.81 |
| 1024 x 1280 x 5120 | 0.258547 | 51912.4 | | 0.128425 | 104511 |
| | | | | | |
| 1 x 5120 x 17920 | 0.049096 | 3737.59 | | 0.109703 | 1672.7 |
| 16 x 5120 x 17920 | 0.073145 | 40139.7 | | 0.110608 | 26544.3 |
| 32 x 5120 x 17920 | 0.11405 | 51486.3 | | 0.430942 | 13626 |
| 64 x 5120 x 17920 | 0.210022 | 55918.1 | | 0.430948 | 27251.7 |
| 1024 x 5120 x 17920 | 4.571 | 41108 | | 0.860118 | 218464 |
2024-06-12 16:02:26 -07:00
Changming Sun
feec8efae4
Add "-allow-unsupported-compiler" flags to Windows CUDA flags (#21004)
### Description
Add "-allow-unsupported-compiler" flags to Windows CUDA flags. This
change only impacts our pipelines. By default it would not reach this
code path.

### Motivation and Context
nvcc refuses working with the latest VS toolset unless this flag is set.

If without this change, our CI build will fail with the compiler is the
latest VS 2022 17.10. Here is the log:
https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1405549&view=logs&j=6df8fe70-7b8f-505a-8ef0-8bf93da2bac7&t=c7e55e04-f02b-57dc-d19a-29b7d3528c44&l=715

The error message is:
`D:\a\_work\_temp\v11.8\include\crt/host_config.h(153): fatal error
C1189: #error: -- unsupported Microsoft Visual Studio version! Only the
versions between 2017 and 2022 (inclusive) are supported! The nvcc flag
'-allow-unsupported-compiler' can be used to override this version
check; however, using an unsupported host compiler may cause compilation
failure or incorrect run time execution. Use at your own risk.
[D:\a\_work\1\b\RelWithDebInfo\CMakeFiles\CMakeScratch\TryCompile-g5rudf\cmTC_7b8ff.vcxproj]`
2024-06-12 14:23:00 -07:00
Tianlei Wu
a2b0a69dcc
Update MultiHeadAttention benchmark to test CPU (#20972)
### Description
MultiHeadAttention benchmark script only supports cuda provider right
now.
This updates the script to support testing cpu operator and ploting gpu
latency.

### Motivation and Context
Benchmark for the coming cpu flash attention.
2024-06-12 13:04:25 -07:00
Changming Sun
99f0fe3fae
Fix a few issues in "Zip-Nuget-Java-Nodejs Packaging Pipeline" (#21014)
### Description
Fix a few issues in the Windows TRT job in "Zip-Nuget-Java-Nodejs
Packaging Pipeline":
1. It is a Windows job. It should not use bash(which is usually not
available on Windows).
2. When it sets ADO vars, it missed a semicolon 

Here is the doc of how to set ADO vars via scripts:
https://learn.microsoft.com/en-us/azure/devops/pipelines/process/set-variables-scripts?view=azure-devops&tabs=bash

You could see it needs a semicolon . Without the semicolon , the vars
will have an extra quotation mark in their values.
2024-06-12 09:44:24 -07:00
Baiju Meswani
94aa21c3dd
Define _DISABLE_CONSTEXPR_MUTEX_CONSTRUCTOR (#21005)
https://github.com/microsoft/STL/pull/3824 introduces constexpr mutex.
An older version of msvcp140.dll will lead to ```A dynamic link library
(DLL) initialization routine failed```.

This error can be encountered if using conda Python since conda packages
msvc dlls and these are older right now.

This PR disables the constexpr mutex so that ort package can work with
older msvc dlls.

Thanks @snnn for the discovery.
2024-06-11 22:23:28 -07:00
Jing Fang
9be30348b9
[CPU EP] Add blocked quantization to QuantizeLinear op kernel (#20977)
### Description
Add blocked quantization to QuantizeLinear op kernel.

If the quantize axis is not the last axis, block the tensor using 1x128
blocks. Blocks are dispatched to multiple threads for concurrently
processing. Currently only support scalar instructions.

If the quantize axis is the last axis, block the tensor using 1 x
quant_block_size blocks. Blocks are dispatched to multiple threads for
concurrent processing. If output type is int types, call mlas kernel to
use the SIMD instructions in each block.

#### Benchmark data
20 core 2GHz CPU, RelWithDebInfo config, 196 x 4096 tensor, quantize
float to int4x2

Quantize before last axis:
 * single thread, scalar instruction: 31380900 ns
 * 8 thread, scalar instruction: 5098620 ns

Quantize last axis:
 * single thread, scalar instruction: 27927900 ns
 * 8 thread, SIMD instruction: 102261 ns

more thread, SIMD instruction, larger block size helps

### Motivation and Context
ONNX added blocked quantization to QuantizeLinear in optset 21
2024-06-11 20:25:28 -07:00
Yi Zhang
17d5dc503f
Upgrade ESRP signing task from v2 to v5 (#20995)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-06-12 08:31:53 +08:00
cloudhan
67c8befd1d
test: refactor flash_attn tests to use parameterized (#20913)
Use `parameterized` to decompose the huge test case. This will make
adding ROCm support be possible.

---------

Co-authored-by: Guangyun Han <guangyunhan@microsoft.com@h100vm-ort.kxelwkzfzxguje5bxvwxxs135a.gvxx.internal.cloudapp.net>
2024-06-11 15:57:20 -07:00
Tianlei Wu
b3fc9b5a0e
[CUDA] upgrade cutlass to 3.5.0 (#20940)
### Description
Upgrade cutlass to 3.5 to fix build errors using CUDA 12.4 or 12.5 in
Windows
- [x] Upgrade cutlass to 3.5.0.
- [x] Fix flash attention build error with latest cutlass header files
and APIs. This fix is provided by @wangyems.
- [x] Update efficient attention to use new cutlass fmha interface.
- [x] Patch cutlass to fix `hrsqrt` not found error for sm < 53.
- [x] Disable TF32 Staged Accumulation to fix blkq4_fp16_gemm_sm80_test
build error for cuda 11.8 to 12.3.
- [x] Disable TRT 10 deprecate warnings. 

The following are not included in this PR:
* TRT provider replaces the deprecated APIs.
* Fix blkq4_fp16_gemm_sm80_test build error for cuda 12.4 or 12.5. This
test is not built by default unless you add `--cmake_extra_defines
onnxruntime_ENABLE_CUDA_EP_INTERNAL_TESTS=ON` in build command.

To integrate to rel-1.18.1: Either bring in other changes (like onnx
1.16.1), or generate manifest and upload a new ONNX Runtime Build Time
Deps artifact based on rel-1.18.1.

### Motivation and Context
https://github.com/microsoft/onnxruntime/issues/19891
https://github.com/microsoft/onnxruntime/issues/20924
https://github.com/microsoft/onnxruntime/issues/20953
2024-06-11 13:32:15 -07:00
Yulong Wang
dd805ff77d
[js/web] ESM: use the bundled target as default export (#20991)
### Description
ESM: use the bundled target as default export

In this change, the default import of the following entries:
```
import from 'onnxruntime-web';
import from 'onnxruntime-web/all';
import from 'onnxruntime-web/webgpu';
```
will use the "bundled" version, which has no dynamic import.

This change should only apply to ESM on web.
2024-06-11 11:14:55 -07:00
Jian Chen
05032e5e5f
Updating cudnn from 8 to 9 on exsiting cuda 12 docker image (#20925)
### Description
Adding support of cudnn 9 


### Motivation and Context
Keep exsiting  cuda 12.2 with nvidia dirver 535
2024-06-11 09:37:16 -07:00
Wanming Lin
043ef5c95f
[WebNN EP] Support latest WebNN softmax op (#20827)
Latest WebNN softmax supports N-D input and axis parameter.
2024-06-11 08:27:14 -07:00
Changming Sun
ae4a2e6b3f
Publish Build Symbols for DML nightly nuget package (#20988)
### Description
Publish Build Symbols for DML nightly nuget package.
2024-06-10 17:53:22 -07:00
Changming Sun
dc545d366d
Publish debug symbols for Windows python packages (#20973)
### Description
1. Publish debug symbols for Windows python packages. This PR will
publish them to ADO. Later on I will also replicate them to Microsoft
Symbol Server.
2. Build the packages in Release mode instead of RelWithDebInfo, to be
consistent with the other platforms(Linux/macOS/...)


### Motivation and Context
To help debug things. Sometimes we found an issue, but we couldn't debug
it because we didn't have symbols, and once we rebuilt the package
locally the issue was gone. This change would be helpful for such
scenarios.

Build log:
https://aiinfra.visualstudio.com/Lotus/_build?definitionId=841
2024-06-10 12:33:49 -07:00
Changming Sun
92ae60b01f
Revert a cmake change in protobuf_cmake.patch (#20964)
Avoid patching external projects unless absolutely necessary
#20875
2024-06-10 11:20:33 -07:00
Hector Li
007d106b73
Disable inference on CPU if CPU fallback is disabled (#20976)
### Description
Don't allow model inference on CPU (Ort CPU EP or QNN EP CPU backend) if
CPU fallback is disabled.
2024-06-10 09:27:43 -07:00
Hector Li
3c6d409937
Enable Hardsigmoid for QNN EP using SDK support direct support (#20956)
### Description
Enable Hardsigmoid for QNN EP using SDK support direct support instead
of decomposing to its constituent ops so it can support the quantized
model
2024-06-10 09:16:25 -07:00
Edward Chen
855c1cffc9
Update comment in cpuid_info.cc (#20974)
Update comments to indicate that we don't need to set CPUIDInfo::is_armv8_narrow_ld_ on Apple platforms.
2024-06-10 08:52:38 -05:00
wejoncy
bd61ae530b
relax seq len checking in rotary_emb (#20778)
### Description
Length checking is even more strict for packed batching input.
There are two cases for a batch of input_ids.
- padded seq with equal length of inputs. 
```
|----********|
|------------|
|--------****|
|-***********|
```
- packed seqs with different length of input_ids
`|----|---------|----|-|`

The max_seq_length is either from graph_inputs or the position_ids.
While in most of cases, we will cache the max_seq_length of rotary_cache
in the model ans shared among all layers.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: kailums <kalu@microsoft.com>
2024-06-08 18:39:06 +08:00
Edward Chen
981893c318
Remove deprecated "mobile" packages (#20941)
# Description

This PR removes the building of the ORT "mobile" packages and much of the associated infrastructure which is no longer needed.

Not removed yet - tools/ci_build/github/android/mobile_package.required_operators.config and the helper scripts that depend on it.

# Motivation and Context

The mobile packages were deprecated in 1.18. Users should use the full packages (Android - onnxruntime-android, iOS - onnxruntime-c/onnxruntime-objc) instead or do a custom build.
2024-06-07 16:20:32 -05:00
Changming Sun
a53f692832
Update c-api-noopenmp-packaging-pipelines.yml: remove CUDA version parameter (#20955)
### Description
Update c-api-noopenmp-packaging-pipelines.yml: remove CUDA version
parameter
To reduce confusion. This pipeline is for generating CUDA 11 packages.
Just it. Not CUDA 12.

### Motivation and Context
In the last release we accidentally published CUDA 12(instead of CUDA
11) packages to nuget.org.
We also tried to publish CUDA 12 packages to
https://aiinfra.visualstudio.com/PublicPackages/_artifacts/feed/ORT-Nightly.
Luckily it didn't go through because a package with the same version
number already existed there. Every time when someone runs this pipeline
with CUDA version set to 12, the built packages will be published to
https://aiinfra.visualstudio.com/PublicPackages/_artifacts/feed/ORT-Nightly.
And GenAI team's build pipelines are based on the nightly packages. So
sometimes GenAI team builds their packages with CUDA 12 and sometimes
with CUDA 11, which is very random.
Therefore, please limit the use of pipeline parameters. Most Azure
DevOps yml files are template files. They should use parameters. But the
top level yml files should be more careful on that.
2024-06-07 11:19:59 -07:00
Jian Chen
d32adb26f2
Refactor deprecated gradle syntax (#20922)
To replaced deprecated API. 
Should verify with the `Gradle cmakeCheck` step from
`Windows_Packaging_CPU_x64_default` stage from the Zip-Nuge-...
pipeline.
2024-06-07 11:08:52 -07:00
ivberg
74028e4bdc
Fully dynamic ETW controlled logging for ORT and QNN logs (#20537)
### Description
Windows - Fully dynamic ETW controlled logging for ORT and QNN logs

The logging support is documented here 
-
https://onnxruntime.ai/docs/performance/tune-performance/logging_tracing.html#tracing---windows
-
https://onnxruntime.ai/docs/performance/tune-performance/profiling-tools.html#tracelogging-etw-windows-profiling

Also add support for logging ORT SessionCreation on ETW CaptureState

### Motivation and Context
The previous ETW support only worked if you enabled ETW before the
session started. There can commonly be long-lived AI inference processes
that need to be traced & debugged. This enables logging fully on the
fly.

Without this support a dev would have to end up killing a process or
stopping a service in order to get tracing. We had to do this for a
recent issue with QNN, and it was a bit painful to get the logs and it
ruined the repro.

### Testing
I tested with the following cases
- Leaving default ORT run
- Enabling ETW prior to start and leaving running for entire session +
inferences, then stopping
- Starting ORT session + inf, then enabling and stopping ETW
  - Start ORT session /w long running Inferences 
- wpr -start
[ort.wprp](e6228575e4/ort.wprp (L4))
-start
[etw_provider.wprp](e6228575e4/onnxruntime/test/platform/windows/logging/etw_provider.wprp)
  - Wait a few seconds
  - wpr -stop ort.etl
  - Inferences are still running
- Verify ONNXRuntimeLogEvent provider events are present and new
SessionCreation_CaptureState event under Microsoft.ML.ONNXRuntime
provider

Related:
#18882
#19428
2024-06-06 21:11:14 -07:00
Changming Sun
f8b5c2805e
Update abseil-cpp.cmake: add version check (#20962)
Some dev environments come with a preinstalled abseil. For example,
conda users often do that. If the preinstalled abseil version is
incompatible with what we have in cmake/deps.txt, it could result in a
hard-to-understand build error. This PR adds a version check to improve
that.
2024-06-06 19:42:31 -07:00
Jian Chen
96228c86a0
Adding Job names to jobs without a name (#20961)
### Description
Adding Job names to jobs without a name

### Motivation and Context
This way we will know which job fails CG scan.
2024-06-06 19:09:21 -07:00
Adrian Lizarraga
128bfc0665
[MLAS] Use C-style casting for power vector instructions (#20957)
### Description
Uses C-style casting for Power vector instructions in
`MlasQuantizeLinearInt4Kernel`.



### Motivation and Context
Vector commands (e.g., vec_xst) need C-style casting to support various
compiler versions.
ONNX Runtime CI pipelines do not build with all compiler versions. The
recent INT4 PR broke the powerpc build for certain compiler versions
because it uses C++-style `static_cast<>`.

See:
https://github.com/microsoft/onnxruntime/pull/20362#discussion_r1630106164

Signed-off-by: adrianlizarraga <adlizarraga@microsoft.com>
2024-06-06 15:11:59 -07:00
Hector Li
05889b33ef
Support loading from model with multiple QNN context binary (#20930)
### Description
Support loading from model with multiple QNN context binary

### Motivation and Context
QNN EP generated context binary model only has one single QNN context.
Because of QNN PD memory limitation, large model (>3.5GB) has to be split into 2 smaller models. Then generate the model with context binary. User can load from the smaller models with context binary. The problem is it requires 2 Ort session. User want to glue the split models into 1 (with multiple EPContext nodes) so that they can use 1 Ort session to do the work.
QNN EP has limitation which only support loading from 1 single QNN context binary. This PR removes that limitation to unblock this user scenario.

---------

Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
2024-06-06 14:44:57 -07:00
Wanming Lin
52874f628a
[WebNN EP] Remove some constraints for CPU backend (#20900)
Following constraints have been supported by WebNN TFLite backend:
- Concat: supports up to 4 inputs
- Matmul: supports broadcasting
- Resize: supports nearest mode
- Split: supports up to 4 outputs
2024-06-06 08:22:41 -07:00
Wanming Lin
da1f8f9274
[WebNN EP] TFLite backend only supports limit ranges for Clip (#20863) 2024-06-06 08:22:18 -07:00
Guenther Schmuelling
c749bd997a
webgpu quickgelu (#20939) 2024-06-06 08:21:33 -07:00
Chester Liu
5b87544aab
Add conditional check in Get/Set current GPU device id (#20932)
### Description

Add conditional check in Get/Set current GPU device id


### Motivation and Context

Currently with ROCm build, calling `GetCurrentGpuDeviceId` will still
try to find CUDA libraries and log the following error message:

```text
[E:onnxruntime:, provider_bridge_ort.cc:1836 TryGetProviderInfo_CUDA] /onnxruntime_src/onnxruntime/core/session/provider_bridge_ort.cc:1511 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_cuda.so with error: libonnxruntime_providers_cuda.so: cannot open shared object file: No such file or directory
```

This is unnecessary and confusing.
2024-06-06 17:10:14 +08:00
Scott McKay
3ecf48e3b5
Add support for Trilu<bool>. (#20917)
### Description
<!-- Describe your changes. -->
Trilu<bool> is used by phi-3 when exported with torch.onnx.export.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-06-06 15:21:34 +10:00
Chester Liu
eb2ec66716
Initialize device_id in cuda_call & rocm_call (#20933)
### Description
<!-- Describe your changes. -->

Initialize `device_id` with `-1` in  `cuda_call` and `rocm_call`.

### Motivation and Context

From PyTorch code:
bb2de3b101/c10/cuda/CUDAFunctions.cpp (L217-L324)

If `cudaGetDevice` or `hipGetDevice` failed, an uninitialized `int`
would produce a random number that changes during each run:

```text
[with ERRTYPE = hipError_t; bool THRW = true; std::conditional_t<THRW, void, common::Status> = void] HIP failure 101: invalid device ordinal ; GPU=32741 ; hostname=e6724be2a31a ; file=/onnxruntime_src/onnxruntime/core/providers/rocm/rocm_common.h ; line=66 ; expr=hipGetDeviceProperties(&deviceProp, 0); 
```

Notice the `GPU` value above. Using `-1` would clearly indicate such
failure and avoid confusion.
2024-06-06 11:19:09 +08:00
Adrian Lizarraga
b5eb9e8a8a
[QNN EP] Update to QNN SDK 2.22 (#20628)
### Description
- Updates pipelines to use QNN SDK 2.22 by default.
- Linux QNN pipeline now uses an Ubuntu 22.04 image (required by QNN
SDK)
- Android QNN pipeline still uses the current Ubuntu 20.04 image. Will
update in a separate PR.
- Disables QDQ LayerNorm test that triggers QNN's graph finalization
error on QNN 2.22
- Increases accuracy tolerance for various HTP tests so that they pass
on Windows arm64.



### Motivation and Context
Test QNN EP with latest QNN SDK version by default.

---------

Signed-off-by: adrianlizarraga <adlizarraga@microsoft.com>
2024-06-05 18:25:23 -07:00
Adrian Lizarraga
df28c7d73b
[Quant tool] Improve performance of int4 weight quantization (#20935)
### Description
- Uses our own quantization functions instead of the ONNX reference
implementation of QuantizeLinear when quantizing weights to int4.
- Uses a custom function that packs bytes into 4-bit elements.



### Motivation and Context
Running the quantization tool to create QDQ models with int4 weights
could take up to 7x longer. This PR uses our own quantization and byte
packing utilities to improve performance.

#### Measurements
Model with ~5M parameters to quantize to int4.

- Current implementation: **84.5s**
- Only replace ONNX QuantizeLinear implementation: **50.3s** (1.68x
speedup)
- This PR (replace onnx Q impl, custom packing func): **13.5s** (6.26x
speedup)

---------

Signed-off-by: adrianlizarraga <adlizarraga@microsoft.com>
2024-06-05 16:48:40 -07:00
Chip Kerchner
4cb23b020c
Improvements to the INT8 GEMM portion of the code for Power (#20595)
These are changes to improve GEMM portion of the code for Power.

There are 2 main code changes : 
1) Changing a function to a template parameter so that operations that
add/sub zero are eliminated at compile time. Plus reuse a vector that
has the mask instead of rebuilding each time.
2) Add processing 16 columns at a time in MlasGemmQuantCopyPackB8x8 -
this should reduce potential page faults by a factor of 4 and also be
faster.
3) Unroll MlasQgemmStoreVectorMMA and vectorize other variables.
2024-06-05 14:24:22 -07:00
Yufeng Li
63c13a4811
fix integer overflow in Attention (#20921)
### Description
<!-- Describe your changes. -->
offset used in attention is with data type int. It can overflow for
large sequence length.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-06-05 10:19:26 -07:00
Yueqing Zhang
b374ddd704
[VitisAI] add new api for models (#20899)
### Description
<!-- Describe your changes. -->
Add new APIs.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This change is required for satisfying requirement of Microsoft.

---------

Co-authored-by: Zhenze Wang <zhenzew@xilinx.com>
2024-06-04 22:48:04 -07:00
Jing Fang
3ecb012337
[CPU EP] Add blocked quantization to DequantizeLinear op kernel (#20901)
### Description
Added blocked quantization to DequantizeLinear op kernel. All existing
[input types and output
types](https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md#commicrosoftdequantizelinear)
are supported. All axes are supported.

The implementation in the PR is naive - single thread and scalar
instructions. Multi-threading and vector instructions are planned in the
future based on the needs.


### Motivation and Context
onnx introduced blocked quantization in opset 21 for
[DequantizeLinear](https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md#commicrosoftdequantizelinear).
This PR adds the spec support in onnx runtime.
2024-06-04 14:44:40 -07:00
Jian Chen
5faeaf6437
Remove failOnStderr from Gradle cmakeCheck (#20919)
### Description
Remove failOnStderr from Gradle cmakeCheck



### Motivation and Context
The Gradle is still using the deprecated API
2024-06-04 13:54:49 -07:00
Tianlei Wu
6dfdef7782
update stable diffusion demo requirements (#20914)
### Description
Update docker and package version for stable diffusion demo.

### Motivation and Context
Update onnx to 1.16 for security
2024-06-04 12:08:04 -07:00
liqun Fu
51bc53580d
Update to onnx 1.16.1 (#20702) 2024-06-04 11:06:28 -07:00
Changming Sun
3dd6fcc089
Upgrade min ios version to 13.0 (#20773)
To align with Office and other MS products.
Office's support policy is:
"Office for iPad and iPhone is supported on the two most recent versions
of iOS and iPadOS. When a new version of iOS or iPadOS is released, the
Office Operating System requirement becomes the two most recent
versions: the new version of iOS or iPadOS and the previous version."
(from https://products.office.com/office-system-requirements)

The latest iOS version is 17. So they support both 17 and 16. Here I set
our min iOS version to 13 so that it will be a superset of what Office
supports.

This change would allow us using C++17's std::filesystem feature in the
core framework. The modifications were generated by running
```bash
 find . -type f -exec sed -i "s/apple_deploy_target[ =]12.0/apple_deploy_target=13.0/g"  {} \;
```

Cannot use 15.0 because otherwise iOS packaging would fail with:

```
/Users/runner/work/1/b/apple_framework/intermediates/iphoneos_arm64/Release/_deps/coremltools-src/mlmodel/src/MILBlob/Util/Span.hpp:288:9: error: cannot use 'throw' with exceptions disabled
        MILVerifyIsTrue(index < Size(), std::range_error, "index out of bounds");
```

The Google OSS libraries we use only officially support iOS 15+.
2024-06-04 10:15:20 -07:00
Yi Zhang
c5087b9b58
Improve stable diffusion image parity test stability (#20904)
### Description
1. Add one image into whitelist, but if the image is hit, the pipeline
status is warning.
2. adjust the image parity test tolerance



### Motivation and Context
improve pipeline stability
2024-06-04 10:19:32 +08:00
zhijiang
3c561c8b26
fix bug (#20694)
when num of elem in tensor large than 2^32, then we can use cuda_long as
dtype of offset
2024-06-04 09:22:10 +08:00