Commit graph

11012 commits

Author SHA1 Message Date
Adrian Lizarraga
0dda8b0c44
[QNN EP] Update QNN SDK to 2.21 (#20534)
### Description
- Updates QNN pipelines to use QNN SDK 2.21
- Downloads QNN SDK from Azure storage to avoid having to rebuild images
when a new version is released.


### Motivation and Context
Test with the latest QNN SDK.
2024-05-01 20:17:35 -07:00
Tianlei Wu
87076553b0
[CUDA] Add SparseAttention kernel for sm=75 (#20531)
### Description
Follow up of #20216 to add kernel for sm=75 (GPU like T4, Geforce RTX
2080, GeForce GTX 1650 Ti, NVIDIA TITAN RTX, RTX 4000 etc)

- [x] Add kernel for sm=75
- [x] Update dispatch code to use sm to call different kernel.
- [x] Update compile script to use num_stages=2 instead of 3 for sm=75
- [x] Refactor test script and add tests for bfloat16.
- [x] Fix performance test of token generation (previously we did not
concatenate past_key)
- [x] Fix debug build
- [x] Run performance test and update numbers.

For sm=70, the v1 kernel can be compiled but there is error in compiling
v2 kernel. So it is skipped in this pull request.

Performance Test on T4 GPU (using Standard_NC4as_T4_v3 Azure VM) with
`batch_size=4, num_heads=32, max_seq_len=8192, head_size=128,
sparse_block_size=64, local_blocks=16, vert_stride=8, num_layout=8`

We compare sparse attention to corresponding GQA with dense causal. Note
that GQA with dense need more computation since no sparsity is used. The
TORCH-GQA use naive implementation (using cuSPARSE Block-SpMM could be
faster).

```
prompt-sm75-batch4-head32-d128-local16-vert8-torch.float16:
   sequence_length   TORCH-GQA  ORT-GQA-Dense  ORT-SparseAtt
1             32.0    0.184173       2.994347       0.089064
2             64.0    0.303300       3.023986       0.107418
3            128.0    0.887795       3.073728       0.174213
4            256.0    2.797654       3.246899       0.357869
5            512.0   10.055048       3.814039       0.893903
6           1024.0   37.849937       5.818439       2.658720
7           2048.0  148.641785      13.638480       7.202690
8           4096.0    OOM           43.556847      17.680954
9           8192.0    OOM           161.628540      44.336670

token-sm75-batch4-head32-d128-local16-vert8-torch.float16:
   past_sequence_length  TORCH-GQA  ORT-GQA-Dense  ORT-SparseAtt
1                  32.0   0.110353       2.996305       0.137509
2                  64.0   0.145088       3.006860       0.165424
3                 128.0   0.219500       3.036448       0.192001
4                 256.0   0.347496       3.071341       0.249125
5                 512.0   0.595842       3.135225       0.398726
6                1024.0   1.081216       3.261110       0.612744
7                2048.0   2.060307       3.515578       0.685670
8                4096.0   OOM            4.022986       0.819707
9                8191.0   OOM            5.024528       1.072912
```

### Motivation and Context

To inference Phi-3-small in T4 GPU
2024-05-01 19:52:13 -07:00
Scott McKay
f9febc4f35
Remove usage of 'required reason' iOS API from protobuf (#20529)
### Description
<!-- Describe your changes. -->

Using certain APIs is about to require a [privacy
manifest](https://developer.apple.com/documentation/bundleresources/privacy_manifest_files/describing_use_of_required_reason_api)
to be added to a package.

Our version of protobuf uses `mach_absolute_time`. Patch as per
https://github.com/protocolbuffers/protobuf/pull/15662/ to remove usage.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Usage of API will require a privacy manifest for an iOS app to be
accepted as of 5/1/2024
#20519
2024-05-02 08:21:08 +10:00
Yifan Li
29417762f7
[TensorRT EP] support TensorRT 10-GA (#20506)
### Description
<!-- Describe your changes. -->
This branch is based on rel-1.18.0 and supports TensorRT 10-GA.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-05-01 11:10:53 -07:00
Shubham Bhokare
7a7344dcc2
Update openai-whisper version in requirements.txt (#20473)
### Description
Update openai-whisper version in requirements.txt
2024-04-30 22:25:41 -07:00
Hector Li
755aaea9a6
Qnn nuget update (#20527)
### Description
Update Qnn nuget package to include Qnn libs and license file
2024-04-30 22:12:53 -07:00
Yi Zhang
91baeb8495
Reduce downloads to NodeJS to mitigate random connection exception. (#20518)
### Description
There was connection exception in docker build in package pipeline
```
48.26 + curl https://nodejs.org/dist/v18.17.1/node-v18.17.1-linux-x64.tar.gz -sSL --retry 5 --retry-delay 30 --create-dirs -o /tmp/src/node-v18.17.1-linux-x64.tar.gz --fail
456.0 curl: (92) HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)
```

https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=453140&view=logs&j=f9f5b320-fa10-56c4-debe-61ea69c74793&t=1656e225-defa-5b12-8935-2a0a93e76a67&s=3c85d903-a183-5028-775e-d63999fcc9ae

In fact, docker image shouldn't be rebuilt this time.

Checked the code, The docker image tag in Linux_C_API_Packaging_GPU_x64
of onnxruntimecuda${{ variables.CUDA_VERSION_MAJOR }}build was same as
the image tag of Linux-gpu-ci-pipeline, but their docker files are
different.

So changing the Linux GPU pipeline's image tag to avoid packaging
pipeline docker image overridden unexpectedly.
2024-05-01 09:04:56 +08:00
vividsnow
5c3a1bc3b8
update onnxruntime_c_api.h (#20360)
### Description
removing excess trailing semicolon from specific macro

### Motivation and Context
I am preparing automatic generation of onnxruntime bindings for perl,
and the parser (ucpp) has broken due to the "double semicolon" error in
the subsequent lines where the macro is applied.
2024-04-30 16:47:24 -07:00
Edward Chen
a7fc0e8370
Only define CPUIDInfo::pytorch_cpuinfo_init_ data member when CPUINFO_SUPPORTED is defined. (#20509)
Only define CPUIDInfo::pytorch_cpuinfo_init_ data member when CPUINFO_SUPPORTED is defined. It can cause unused variable warnings in some compilations.
2024-04-30 16:10:13 -07:00
Yi-Hong Lyu
33e883fbc4
Fix the doxygen error (#20515)
Fix
onnxruntime/include/onnxruntime/core/session/onnxruntime_c_api.h:4637:
error: argument 'session' of command @param is not found in the argument
list of

```
OrtApi::AddExternalInitializersFromFilesInMemory(
    OrtSessionOptions *options,
    const char *const *external_initializer_file_names,
    char *const *external_initializer_file_buffer_array,
    const size_t *external_initializer_file_lengths,
    size_t num_external_initializer_files)
```
2024-04-30 11:45:03 -07:00
Tianlei Wu
9f0fae29e8
[CUDA] Add SparseAttention operator for Phi-3-small (#20216)
### Description
Add CUDA implementation for block sparse attention for Phi-3-small.

Block sparse attention was proposed in [Sparse
Transformers](https://arxiv.org/pdf/1904.10509) by OpenAI, and also
adopted in [BigBird](https://arxiv.org/pdf/2007.14062) with different
sparse layout.

In Phi-3-small, the sparse layout is static, and works with
unidirectional (causal) attention.

Compared to dense attention, the benefit of block sparse is to speed up
both training and inference. It could save memory thus support longer
context length.

- [x] Add operator spec and shape inference
- [x] Symbolic shape inference
- [x] Refactor GroupQueryAttention to expose common kernels for kv cache
concatenation, q/k/v transpose etc.
- [x] Add cuda kernel to convert block mask to CSR format
- [x] Add cuda kernel to generate position ids
- [x] Add compile script and template files to convert triton kernel to
cubin and dispatcher.
- [x] Add triton kernel v1 for prompt
- [x] Add triton kernel v2 for token generation and support padding
- [x] Update IO Binding Helper to allow buffer sharing.
- [x] Test relevance
- [x] Test performance

### Performance
Test in A100-SXM4-80GB with `batch_size=4, num_heads=32,
max_seq_len=8192, head_size=128, sparse_block_size=64, local_blocks=16,
vert_stride=8, num_layout=8`

We compare sparse attention to corresponding GQA with local attention
windows size 1024, or GQA with dense causal.

Average latency in milliseconds (for fused attention kernel used in
prompt prefilling):

seq_len | GQA-Dense | GQA-Local | SparseAttention
-- | -- | -- | --
64 | 0.0465 | 0.0722 | 0.0641
128 | 0.0618 | 0.0787 | 0.0672
256 | 0.1086 | 0.1076 | 0.0943
512 | 0.2535 | 0.2487 | 0.1676
1024 | 0.7042 | 0.7050 | 0.3800
2048 | 2.4125 | 1.9316 | 0.8966
4096 | 8.9346 | 4.5699 | 2.1129
8192 | 40.5401 | 10.3508 | 5.1748

Average latency in milliseconds (for fused attention kernel used in
token generation:

past_seq_len | GQA-Dense | GQA-Local | SparseAttention
-- | -- | -- | --
64 | 0.0186 | 0.0186 | 0.0870
128 | 0.0408 | 0.0466 | 0.1165
256 | 0.0530  | 0.0592 | 0.0988
512 | 0.0445| 0.0447 | 0.1150
1024 | 0.0634  | 0.0640 | 0.1454
2048 | 0.1027 | 0.0637 | 0.1589
4096 | 0.1789 | 0.0631 | 0.1806
8192 | 0.3288 | 0.0655 | 0.2146

We can see that the kernel for token generation still have room to
improve.

#### Limitations
Only support right-side padding and unidirectional attention.

The following are not supported in the first version:
(1) Packed mode like PackedMultiHeadAttention where input has been
removed padding.
(2) paged attention.
(3) bidirectional attention.
(4) GPU compute capacity that is not 8.0, 8.6 and 8.9.
(5) Left side padding.

Some of these limitations will be removed in the future (may be in a new
operator).
2024-04-30 09:06:29 -07:00
Yi-Hong Lyu
b2481e3602
Bump up version in main from 1.18.0 to 1.19.0 (#20489)
Bump up version in main from 1.18.0 to 1.19.0 since the release branch
has been cut.

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2024-04-29 20:21:41 -07:00
Yulong Wang
b1085b51ca
[js/web] update README (#20492)
### Description

Update README.md in /js/web/

- update compatibility table
- update links to onnxruntime.ai
2024-04-29 17:56:23 -07:00
Chi Lo
a1558fe117
[TensorRT EP] Make TRT EP use priority-based topo sort (#20512)
This PR is needed for
https://github.com/microsoft/onnxruntime/pull/20411 to make sure TRT EP
use priority-based topo sort for consistency across TRT EP.
2024-04-29 16:00:43 -07:00
Rachel Guo
8c31f27dd1
Catalyst nuget package .NET changes only (#20424)
### Description
<!-- Describe your changes. -->

https://github.com/microsoft/onnxruntime/pull/20418

Add back Catalyst changes only for now.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
2024-04-29 15:39:48 -07:00
Satya Kumar Jandhyala
99b0e19f11
[JS/WebGPU] MatMulNBits remove unnecessary condition (#20396)
Distribute writing-to-output work over all threads in MatMulNBits.
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-04-29 14:27:21 -07:00
winskuo-quic
509cbcae6e
[QNN EP] Support prelu fp16 (#20428)
### Description
Originally, Prelu in QNN will fail when the input is fp16 and alpha is fp32.
QNN requires alpha is fp16 when input is fp16.
This can be resolved by casting alpha to fp16 and pass it to QNN.

### Motivation and Context
Makes QNN Prelu support fp16 case.

---------

Co-authored-by: Hector Li <hecli@microsoft.com>
2024-04-29 13:26:51 -07:00
Edward Chen
358f5bb022
Update regex to match correct pattern. (#20483)
In CMakeLists.txt:set_msvc_c_cpp_compiler_warning_level(), the regex should match the value that gets added by the function. The latter got updated, so this change updates the former to match.
2024-04-29 10:43:31 -07:00
George Wu
49b2bebe85
[qnn ep] include qnn sdk in onnxruntime-qnn python whl (#20485)
script changes to include qnn sdk libs with onnxruntime-qnn python
package.
2024-04-29 09:44:54 -07:00
guyang3532
3e4db2c686
Fuse Cast + SoftmaxCrossEntropyLossInternal (#20334)
### Description
Fuse Cast + SoftmaxCrossEntropyLossInternal to
SoftmaxCrossEntropyLossInternal.
2024-04-29 14:12:10 +08:00
Scott McKay
923b0ef323
Run fuzz testing before the CG task cleans up the build directory (#20500)
### Description
<!-- Describe your changes. -->
Update order of steps


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix CI
2024-04-29 16:02:53 +10:00
zz002
50e41984b0
[VitisAI] Solve the problem that gsl cannot be found when compiling under linux (#20466)
### Description
<!-- Describe your changes. -->

[VitisAI] Solve the problem that gsl cannot be found when compiling
under linux

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: Zhenze Wang <zhenzew@xilinx.com>
2024-04-28 20:56:16 -07:00
pengwa
f31486c8b7
Disable test_aten_conv_bf16 to unblock amd ci (#20499)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-04-29 11:38:40 +08:00
pengwa
b789677e65
Fix Onnxruntime-TVM CI (#20498)
### Description

```
  tvm_execution_provider.cc
  denormal.cc
D:\a\onnxruntime\onnxruntime\onnxruntime\core\providers\tvm\tvm_execution_provider.cc(122,5): error C2660: 'onnxruntime::GraphViewerToProto': function does not take 4 arguments [D:\a\onnxruntime\onnxruntime\build\Release\onnxruntime_providers_tvm.vcxproj]
  D:\a\onnxruntime\onnxruntime\onnxruntime\core\graph\graph_proto_serializer.h(10,6):
  see declaration of 'onnxruntime::GraphViewerToProto'
  D:\a\onnxruntime\onnxruntime\onnxruntime\core\providers\tvm\tvm_execution_provider.cc(122,5):
  while trying to match the argument list '(const onnxruntime::GraphViewer, onnx::GraphProto, bool, bool)'
  
  cpuid_uarch.cc
  get_execution_providers.cc
  abi_session_options.cc
  bias_dropout_fusion.cc
  if.cc
```


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-04-29 11:03:57 +08:00
Satya Kumar Jandhyala
736cbb3925
[JS/WebGU] Support fp16 in Attention by performing the computation in fp32. (#20486)
### Description
Perform computation in fp32 and convert finally to fp16.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-04-27 08:30:26 -07:00
Rachel Guo
ff505b9f44
Follow up fix for #20472 (#20484)
### Description
<!-- Describe your changes. -->

Error: 

**Artifact name input: e2e_test_logs_1364625_$(Date:yyyyMMddHHmmss)
##[error]Artifact name is not valid:
e2e_test_logs_1364625_$(Date:yyyyMMddHHmmss). It cannot contain '\', /',
"', ':', '<', '>', '|', '*', and '?'**

Date not correctly showing up in the artifact name. Use predefined
pipeline variable BuildNumber instead which also serves similarly as a
timestamp.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

RN CI failure

---------

Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
2024-04-27 13:42:24 +10:00
Scott McKay
321c1e5730
Use flatbuffers::String::str instead of c_str. (#20487)
### Description
<!-- Describe your changes. -->
flatbuffers::String::c_str returns a pointer that may not be null
terminated.

This causes a warning when building on an A100 with gcc 11. Not clear
why other builds with gcc 11 (e.g. Ubuntu 22.04 WSL) don't generate a
warning. Either way it's safer to use str() as that constructs a
std::string with data() and size().

Unclear if this is an issue in reality as it's reading from the
flatbuffer and most likely didn't write out an empty string in order to
save space. There's no perf need to use c_str instead of str, and in
LOAD_STR_FROM_ORT_FORMAT we need to convert the return value to a
std::string anyway.

```c++
struct String : public Vector<char> {
  const char *c_str() const { return reinterpret_cast<const char *>(Data()); }
  std::string str() const { return std::string(c_str(), size()); }
```

```
    inlined from ‘onnxruntime::common::Status onnxruntime::fbs::utils::LoadAttributeOrtFormat(const onnxruntime::fbs::Attribute&, onnx::AttributeProto&, std::unique_ptr<onnxruntime::Graph>&, onnxruntime::Graph&, onnxruntime::Node&, const onnxruntime::OrtFormatLoadOptions&, const onnxruntime::logging::Logger&)’ at /frdong_data/onnxruntime/onnxruntime/core/graph/graph_flatbuffers_utils.cc:385:3:
/usr/include/c++/11/bits/char_traits.h:399:32: error: ‘long unsigned int __builtin_strlen(const char*)’ reading 1 or more bytes from a region of size 0 [-Werror=stringop-overread]
```

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix build error on A100
2024-04-27 13:41:38 +10:00
Maximilian Müller
a0775d74a1
Fix: Shared lib tests fail during build for CUDA,TRT,DML (#20453)
The order of defines for these test have to be in the same order. If we
check for TRT -> CUDA ->DML wen cannot reverse that order in later
defines as we might want to build for multiple EPs.

+@PatriceVignola
2024-04-26 20:25:24 -07:00
liqun Fu
2f5fe4500d
mlas nbit matmul requires packed_b (#20482)
### Description

mlas matmul nbits implementation requires packed b. have a condition for
this.

need to update this logic if it changes.


### Motivation and Context

---------

Signed-off-by: Liqun Fu <liqfu@microsoft.com>
2024-04-26 17:18:53 -07:00
Johan MEJIA
619ceeed9e
[python] MinMax calibration per channel (#19285)
### Description

Following the issue #19223, introduce `per_channel` attribute in
`MinMaxCalibrater` to develop per-channel calibration.

If required, this new functionality should be implemented in the other
_Calibraters_ (`HistogramCalibrater`, `EntropyCalibrater`, ...).

### Motivation and Context
- This is the first part to solve #19223's proposal.
- If per channel calibration was allowed, the quantization algorithm
could be updated to improve quantization performance, i.e. weights
quantization per channel and not per tensor. That is why it would be
interesting to have a 'per_channel' option in any 'Calibrater' class to
produce a set of calibration vectors instead of a single scalar.
2024-04-26 12:40:49 -07:00
Wanming Lin
ddd4e8c3e3
[WebNN EP] Improve activation fusion (#20320)
- Create a common util to get supported activation set
- Fuse activation to BatchNormalization if possible
2024-04-26 08:16:55 -07:00
Rachel Guo
88904b9220
Add unique identifier to e2e_test_logs artifacts in react-native-ci.yml (#20472)
### Description
<!-- Describe your changes. -->

As title.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-04-26 22:20:10 +10:00
Scott McKay
b842effa29
Fix some x86 build warnings in training code (#20451)
### Description
<!-- Describe your changes. -->
Fix some misc build warnings from x86 Windows build


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-04-26 20:29:21 +10:00
Scott McKay
aa27dadd1c
Use download.onnxruntime.ai in podspec (#20474)
### Description
<!-- Describe your changes. -->
Update to more generic url


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-04-26 20:28:54 +10:00
Hector Li
d2d4639ddb
fix the build issue for Win Arm64 Release build (#20475)
### Description
Fix the build error for Win ARM64 Release build.
graph_transform_test.cc(1,1): error C1128: number of sections exceeded
object file format limit: compile with /bigobj
[D:\build\Windows\Release\onnxruntime_test_all.vcxproj]


### Motivation and Context
Fix issue: https://github.com/microsoft/onnxruntime/issues/20406
2024-04-25 22:08:19 -07:00
liqun Fu
cc26b2dac2
Mlas Gemm 4bit avx2, avx512, and avx512vnni kernels (#20163)
### Description

```
Avx2:
Int8

NS(Prompt)		MLAS(Prompt)  	MLAS(Prompt)Gain/Loss		NS(TokenGen)		MLAS(TokenGen)  	MLAS(TokenGen)Gain/Loss
Blklen16: 	90.96			25.15			-72%					7.65				11.71			53%
Blklen32:	90.73			48.55			-46%					7.86				14.28			81%
Blklen64:	89.49			68.84			-23%					8.30				15.78			90%
Blklen128:	87.38			78.37			-10%					7.90				16.05			103%
Blklen256:	89.45			82.36			-7%					8.30				16.56			99%

Fp32		
NS(Prompt)		MLAS(Prompt)  	MLAS(Prompt)Gain/Loss		NS(TokenGen)		MLAS(TokenGen)  	MLAS(TokenGen)Gain/Loss
Blklen16:	91.36			105.18		15%				7.57			9.52		25%
Blklen32:	89.30			105.99			18%					7.65				9.68			26%
Blklen64:	89.53			101.41			13%					7.97				9.84			23%
Blklen128:	85.23			99.71			16%					7.86				10.39			32%
Blklen256:	88.46			97.94			10%					8.32				10.23			22%

Avx512vnni:
Int8		
NS(Prompt)		MLAS(Prompt)  	MLAS(Prompt)Gain/Loss		NS(TokenGen)		MLAS(TokenGen)  	MLAS(TokenGen)Gain/Loss
Blklen16:	132.18			21.56			-83%					10.34				11.48			11%
Blklen32:	168.28			43.69			-74%					11.85				14.73			24%
Blklen64:	201.81			60.29			-70%					12.36				15.47			25%
Blklen128:	194.92			57.04			-71%					13.03				14.67			12%
Blklen256:	218.76			70.20			-68%					13.33				16.31			22%

Fp32		
NS(Prompt)		MLAS(Prompt)  	MLAS(Prompt)Gain/Loss		NS(TokenGen)		MLAS(TokenGen)  	MLAS(TokenGen)Gain/Loss
Blklen16:	102.81			92.74			-9%					8.41				9.18			9%
Blklen32:	109.49			97.08			-11%					8.83				11.51			30%
Blklen64:	104.13			101.57			-2%					9.32				12.00			28%
Blklen128:	108.45			103.69			-4%					9.58				12.45			29%
Blklen256:	109.43			106.43			-2%					9.19				12.2			32%

```

---------

Signed-off-by: Liqun Fu <liqfu@microsoft.com>
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Co-authored-by: edgchen1 <18449977+edgchen1@users.noreply.github.com>
2024-04-25 21:30:50 -07:00
Chi Lo
bbc30feb63
Make execution order an option for GraphViewerToProto() (#20411)
**Current issue:**

Once ORT gets the capability from EP's GetCapability(), it creates a
graph viewer based on the capability as below:
`viewers.push_back(std::make_unique<GraphViewer>(graph,
*cur_capability.sub_graph));` or see the code
[here](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/framework/graph_partitioner.cc#L458).

At this point, the graph viewer has the chance to generate the wrong
order of `nodes_in_topological_order_` when calling
[Graph::ReverseDFSFrom](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/graph/graph_viewer.cc#L107),
so that during EP Compile(), EP might create the "wrong nodes ordering"
model proto from the graph viewer when calling
[GraphViewerToProto()](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/graph/graph_proto_serializer.cc#L37)
because of the `nodes_in_topological_order_`.

This is a problem for TRT EP to refit weights to the "weightless"
engine. Since the engine is built from the model proto provided by TRT
EP and the weights is in the original onnx model. The model proto and
the orignal onnx model are not the same in terms of node ordering which
makes TRT complain when refitting.

**The original model (subgraph of ResNet50):**
<img width="442" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/54722500/bb9a641d-f2f2-46c3-aebf-4084a08ff289">

**The serialized model proto generated by TRT EP:**
(The highlighted part has the wrong node order compared to the original
model.)
<img width="340" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/54722500/bbc6bf34-f960-4753-9474-a18ebc2dc48b">

**The solution 1:**
Change default comparator to `NodeCompare::operator() {return
n1->Index() > n2->Index();}`

The root cause of the different node order between original model and EP
generated model is from graph viewer [generating
](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/graph/graph_viewer.cc#L107)the
different `nodes_in_topological_order_`. Modifying the
`NodeCompare::operator()` for sorting can fix the problem.

The `NodeCompare::operator()` will be used in
[Graph::ReverseDFSFrom](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/graph/graph.cc#L1760)
where the input nodes of the current node will be
[sorted](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/graph/graph.cc#L1802)
based on node index.
Due to the sorted nodes will be pushed into a stack which later
determines the final topological node order in a "first in, last out"
approach, the larger node index should be pushed into the stack first.
So that we can get a topological node order aligns with smaller index
node comes first.

**The solution 2 (This PR uses this solution):**
Use priority-based BFS for topological sort in GraphViewerToProto().
2024-04-25 16:07:36 -07:00
Satya Kumar Jandhyala
21b3cbc3af
[WIP][JS/WebGPU] Inputs Key and Value could be 4-dims. (#20470)
### Description
The Key and Value inputs could be 4-dims


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-04-25 13:33:46 -07:00
Edward Chen
2c19db0af1
Put x64 specific benchmark code into ifdefs. (#20456) 2024-04-25 12:33:12 -07:00
Frank Dong
227c4419fc
add bf16 support for few ops (#20385)
### Description
Add bf16 support for below ops:
ConstantOfShape
Exp
Erf
convolution
PythonOp



### Motivation and Context
phimm model works on bf16, ORT need support bf16 on previous ops to work
with phimm on bf16
2024-04-25 11:28:34 -07:00
Yi Zhang
464f199b95
Extend mac package jobs time out limit (#20459) 2024-04-25 10:13:13 -07:00
Yi-Hong Lyu
edffa2a180
Optimize MlasComputeSoftmax with prefetch (#20393)
The prefetching instructions (_mm_prefetch) is used to anticipate memory
accesses by prefetching the next row of the input buffer. This
optimization is designed to reduce the impact of memory latency, thereby
enhancing the performance of the MlasComputeSoftmax function. As a
result, the worst-case performance of the OCR model has improved by
approximately 50ms, which equates to a 3% improvement.
2024-04-25 08:28:59 -07:00
Chi Lo
a077330c3e
[TensorRT] adapt for TRT lib name change after TRT 10 GA (#20445)
For TensorRT 10 GA onwards, the TensorRT libraries will have major
version appended to the end on Windows, for example, nvinfer_10.dll,
nvinfer_plugin_10.dll, nvonnxparser_10.dll ...

Change cmake file accordingly.
2024-04-24 21:46:54 -07:00
Yi Zhang
e5947f5729
Two improvements in pipelines (#20449)
### Description
1. Update the image name to avoid docker image wouldn't be overwrite.
there was an mistake that variables.CUDA_VERSION_MAJOR is always empty

14fcf0a52d/tools/ci_build/github/azure-pipelines/stages/nuget-linux-cuda-packaging-stage.yml (L120)
3. set one artifact name as variable to make the job rerunnable



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-04-25 10:15:40 +08:00
Xavier Dupré
218b6b0a73
Fix missing argument when calling _get_quantize_input_nodes (#20245)
### Description
The current code is calling one method with a missing argument.



### Motivation and Context
It breaks Olive's unittests.

---------

Co-authored-by: Xavier Dupré <xavier.dupre@gmail.com>
2024-04-25 00:46:48 +02:00
Yulong Wang
a5182a2ef3
[js/web] update test condition for '--force-localhost' (#20450)
### Description

Fixes the NPM packaging pipeline failure.
2024-04-24 12:14:03 -07:00
Edward Chen
9cc5badc49
Fix Objective-C static analysis warnings. (#20417)
Replace most usages of [NSString stringWithUTF8String:] with checked helper function. The issue is that the former can return nil.
2024-04-24 11:48:29 -07:00
maggie1059
dfd4bce36e
Use compute queues by default in DML EP (#20438)
### Description
We originally only use compute queues for compute-only devices; this
change sets the default for DX12 devices to use compute queues as well.



### Motivation and Context
There have been issues with TDRs occurring when using the current
default queues, which doesn't happen on compute queues.
2024-04-24 10:44:16 -07:00
Xavier Dupré
f78215adad
Fix quantization tools for issue #19529 (#19591)
### Description
Fix issue #19529, the code was using a variable loop outside a loop.
2024-04-24 19:16:27 +02:00
Scott McKay
a46bab6364
Update podspec url to use AFD hostname (#20452)
Update to use AFD url when generating podspec
2024-04-24 09:37:24 -07:00