Commit graph

9531 commits

Author SHA1 Message Date
cloudhan
6ea3908db4
Add ck's streamk and splitk gemm impl (#17280) 2023-09-04 11:49:07 +08:00
Jiajia Qin
5e747071be
[js/webgpu] Fix bug in conv2dByMatMul path (#17369)
### Description
<!-- Describe your changes. -->
For the conv2dByMatMul path, the simulated matmul output shape is the
reshape of the original conv2d. So we should pass this information to
`createMatmulProgramInfo` so that it can process it correctly.
2023-09-02 00:16:28 -07:00
Tianlei Wu
e745575187
fix assert error in attention fusion script (#17375)
Add a check of num_heads and hidden_size to avoid assert error (https://github.com/microsoft/onnxruntime/issues/17254)
2023-09-01 08:18:50 -07:00
Tianlei Wu
e23f16adbf
output all parameters in the bert_perf_test tool (#17379)
Currently, there are some parameters missing in output file. This output
all parameters.

Example output:

Latency(ms) | Latency_P50 | Latency_P75 | Latency_P90 | Latency_P95 |
Latency_P99 | Throughput(QPS) | model | graph_optimization_level |
intra_op_num_threads | batch_size | sequence_length | test_cases |
test_times | use_gpu | use_io_binding | average_sequence_length |
random_sequence_length
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
| -- | -- | --
10.91 | 11.16 | 11.3 | 11.7 | 11.78 | 11.84 | 91.66 | model.onnx |
ENABLE_ALL | 4 | 1 | 512 | 1 | 10 | TRUE | TRUE | 64 | FALSE
2023-09-01 08:17:58 -07:00
Baiju Meswani
8b98ecad70
Change RuntimeError to ImportError (#17380)
The `onnxruntime-validation` for ORTModule checks for `ImportError`:


44101e8771/onnxruntime/python/onnxruntime_validation.py (L73-L75)

If any other kind of error is raised, it does not silently fail and will
raise an exception. This causes a problem when ortmodule is explicitly
not made available on win/mac packages since we currently raise a
RuntimeError.

Resolves issue:
https://github.com/microsoft/onnxruntime-training-examples/issues/161
2023-09-01 09:56:40 +08:00
Rachel Guo
16cfcd0590
Fix NNAPI optional input handling checks and unblock Android CI pipeline test failures (#17358)
### Description
<!-- Describe your changes. -->

- Fix missing optional input checks originally coming from a github
issue for no shape on Resize Op.
- Exclude Antialias support for Opset 18 + Resize for NNAPI
- Unblock Android CI pipeline tests failure. 

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Bug fixes.

Issue:
https://github.com/microsoft/onnxruntime/issues/17035

thanks @skottmckay for pointing out the cause.

---------

Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
2023-08-31 16:40:22 -07:00
aciddelgado
44101e8771
Flash Attention v2 MHA (#17227)
### Description
Integrate Flash Attention V2 to PackedMultiHeadAttention,
MultiHeadAttention and Attention operators.

Flash Attention v2 source code is from
https://github.com/Dao-AILab/flash-attention/tree/main/csrc/flash_attn/src.
We did some change to remove dependency on Torch, then removed backward
and bfloat16 related code.

Add benchmark script (see benchmark_mha.sh) to compare different
attention kernels for MultiHeadAttention operator.

Current limitations for Flash Attention in PackedMultiHeadAttention,
MultiHeadAttention and Attention operators:
* Relative Position Bias is not supported
* Different hidden size for Q and V is not supported
* Only float16 is supported
* Padding/attention mask is not supported
* For MultiHeadAttention, when there is past or present input, bias
shall be provided to activate flash attention
* For Attention, past or present inputs will deactivate flash attention
* Causal is not supported

Some limitations (like attention mask and causal) might be removed
later.

Currently, Flash Attention v2 only works in Linux. For Windows, we will
enable later with Cutlass 3.2.

Two environment variables can be used for testing purpose:
(1) `ORT_DISABLE_FLASH_ATTENTION` to disable flash attention. Default
value is 0 (enable). Set it to "1" to disable it.
(2) `ORT_MIN_SEQ_LEN_FLASH_ATTENTION_PACKED_QKV`. Default value is
"513", which means that we only enable flash attention when sequence
length is larger than 512 for packed QKV format. Set it to "0" if you
want to use flash attention v2 whenever possible.

### Speedup

The following result is from Standard_ND96amsr_A100_v4 VM
(A100-SXM4-80GB GPU) using benchmark_mha.sh. The metric is TFLOPs per
second for MultiHeadAttention operator.

There are 3 input formats:
* `Q,K,V` means separated inputs query, key and value of BxSxNH
* `Q,KV` means packed KV, where key is 5D: BxSxNx2xH
* `QKV` means packed QKV, where query is 5D: BxSxNx3xH

Note that flash attention cannot use packed QKV format, so extra
Transpose is needed. We found that TensorRT kernel is faster for
sequence length <= 512 for packed QKV. The reason might be no transpose
is needed for TensorRT kernel in this format.

We also notice that, TensorRT kernel is faster for stable diffusion
512x512 image (see seq_len=4096, heads=8, head_dim=40 below), while
flash attention v2 is faster for 1024x1024 image (see seq_len=16384,
heads=8, head_dim=40 below).

input format | batch size | sequence length | heads | head dim |
flash_v2 (TFLOPs/s) | TensorRT (TFLOPs/s) | Memory Efficient Attention
(TFLOPs/s)
-- | -- | -- | -- | -- | -- | -- | --
Q,K,V | 32 | 512 | 64 | 32 | 78.1 | 60.0 | 39.3
Q,K,V | 32 | 512 | 128 | 16 | 46.8 | 44.1 | 21.7
Q,K,V | 16 | 1024 | 64 | 32 | 99.0 | 72.8 | 44.3
Q,K,V | 16 | 1024 | 128 | 16 | 54.7 | 49.2 | 23.4
Q,K,V | 8 | 2048 | 64 | 32 | 113.8 | 81.2 | 47.8
Q,K,V | 8 | 2048 | 128 | 16 | 59.7 | 51.9 | 24.7
Q,K,V | 4 | 4096 | 64 | 32 | 122.5 | 85.6 | 49.7
Q,K,V | 4 | 4096 | 128 | 16 | 62.5 | 53.3 | 25.3
Q,K,V | 2 | 8192 | 64 | 32 | 127.4 | 87.5 | 50.7
Q,K,V | 2 | 8192 | 128 | 16 | 64.0 | 54.2 | 25.6
Q,K,V | 1 | 16384 | 64 | 32 | 129.5 | 91.0 | 51.2
Q,K,V | 1 | 16384 | 128 | 16 | 64.7 | 54.5 | 25.8
Q,K,V | 1 | 4096 | 8 | 40 | 51.0 | 43.6 | 36.8
Q,K,V | 1 | 4096 | 8 | 80 | 97.7 | 77.0 | 55.5
Q,K,V | 1 | 4096 | 8 | 160 | 120.0 | 39.7 | 57.8
Q,K,V | 4 | 4096 | 8 | 40 | 89.0 | 84.4 | 49.2
Q,K,V | 4 | 4096 | 8 | 80 | 133.0 | 92.2 | 63.2
Q,K,V | 4 | 4096 | 8 | 160 | 164.8 | 42.7 | 63.8
Q,K,V | 1 | 16384 | 8 | 40 | 96.9 | 91.3 | 52.1
Q,K,V | 1 | 16384 | 8 | 80 | 142.9 | 101.5 | 65.6
Q,K,V | 1 | 16384 | 8 | 160 | 177.4 | 44.2 | 65.7
Q,K,V | 128 | 128 | 12 | 64 | 29.0 | 26.9 | 25.7
Q,K,V | 64 | 128 | 12 | 64 | 23.1 | 10.8 | 21.3
Q,K,V | 128 | 384 | 12 | 64 | 83.5 | 60.8 | 55.7
Q,K,V | 64 | 384 | 12 | 64 | 72.6 | 40.5 | 52.8
Q,K,V | 128 | 512 | 12 | 64 | 98.9 | 77.9 | 62.1
Q,K,V | 64 | 512 | 12 | 64 | 94.7 | 75.6 | 60.4
Q,KV | 32 | 512 | 64 | 32 | 85.9 | 41.1 | 41.1
Q,KV | 32 | 512 | 128 | 16 | 47.1 | 21.6 | 21.6
Q,KV | 16 | 1024 | 64 | 32 | 104.4 | 45.8 | 45.8
Q,KV | 16 | 1024 | 128 | 16 | 54.7 | 23.6 | 23.6
Q,KV | 8 | 2048 | 64 | 32 | 116.8 | 48.5 | 48.5
Q,KV | 8 | 2048 | 128 | 16 | 59.8 | 24.7 | 24.7
Q,KV | 4 | 4096 | 64 | 32 | 124.2 | 50.1 | 50.1
Q,KV | 4 | 4096 | 128 | 16 | 62.6 | 25.3 | 25.3
Q,KV | 2 | 8192 | 64 | 32 | 128.5 | 50.8 | 50.9
Q,KV | 2 | 8192 | 128 | 16 | 64.1 | 25.6 | 25.6
Q,KV | 1 | 16384 | 64 | 32 | 129.4 | 51.2 | 51.2
Q,KV | 1 | 16384 | 128 | 16 | 64.8 | 25.8 | 25.8
Q,KV | 1 | 4096 | 8 | 40 | 67.5 | 37.7 | 37.5
Q,KV | 1 | 4096 | 8 | 80 | 101.3 | 56.7 | 56.6
Q,KV | 1 | 4096 | 8 | 160 | 124.0 | 58.6 | 58.6
Q,KV | 4 | 4096 | 8 | 40 | 90.8 | 49.8 | 49.8
Q,KV | 4 | 4096 | 8 | 80 | 135.6 | 63.8 | 63.8
Q,KV | 4 | 4096 | 8 | 160 | 166.3 | 64.5 | 64.5
Q,KV | 1 | 16384 | 8 | 40 | 97.5 | 52.3 | 52.3
Q,KV | 1 | 16384 | 8 | 80 | 143.5 | 65.9 | 65.8
Q,KV | 1 | 16384 | 8 | 160 | 178.4 | 65.9 | 65.8
Q,KV | 128 | 128 | 12 | 64 | 26.8 | 48.1 | 30.9
Q,KV | 64 | 128 | 12 | 64 | 28.0 | 38.9 | 25.0
Q,KV | 128 | 384 | 12 | 64 | 97.7 | 61.1 | 61.0
Q,KV | 64 | 384 | 12 | 64 | 89.5 | 57.8 | 57.9
Q,KV | 128 | 512 | 12 | 64 | 111.9 | 66.7 | 66.9
Q,KV | 64 | 512 | 12 | 64 | 107.2 | 64.9 | 64.8
QKV | 32 | 512 | 64 | 32 | 77.2 | 84.7 | 39.3
QKV | 32 | 512 | 128 | 16 | 43.4 | 53.1 | 20.9
QKV | 16 | 1024 | 64 | 32 | 98.8 | 87.4 | 44.6
QKV | 16 | 1024 | 128 | 16 | 52.0 | 54.1 | 23.2
QKV | 8 | 2048 | 64 | 32 | 113.1 | 89.0 | 47.9
QKV | 8 | 2048 | 128 | 16 | 58.2 | 54.6 | 24.5
QKV | 4 | 4096 | 64 | 32 | 120.6 | 89.7 | 49.7
QKV | 4 | 4096 | 128 | 16 | 61.7 | 54.6 | 25.2
QKV | 2 | 8192 | 64 | 32 | 125.9 | 89.5 | 50.7
QKV | 2 | 8192 | 128 | 16 | 63.6 | 54.8 | 25.5
QKV | 1 | 16384 | 64 | 32 | 128.5 | 92.0 | 51.2
QKV | 1 | 16384 | 128 | 16 | 64.6 | 54.8 | 25.7
QKV | 1 | 4096 | 8 | 40 | 60.2 | **69.8** | 38.1
QKV | 1 | 4096 | 8 | 80 | 101.6 | 75.2 | 56.7
QKV | 1 | 4096 | 8 | 160 | 130.2 | 41.2 | 58.4
QKV | 4 | 4096 | 8 | 40 | 90.6 | **91.0** | 49.5
QKV | 4 | 4096 | 8 | 80 | 133.6 | 98.1 | 62.8
QKV | 4 | 4096 | 8 | 160 | 165.3 | 43.7 | 63.9
QKV | 1 | 16384 | 8 | 40 | 97.2 | 92.8 | 52.1
QKV | 1 | 16384 | 8 | 80 | 143.0 | 103.1 | 65.6
QKV | 1 | 16384 | 8 | 160 | 177.6 | 44.5 | 65.7
QKV | 128 | 128 | 12 | 64 | 31.1 | 65.9 | 27.6
QKV | 64 | 128 | 12 | 64 | 26.1 | 49.8 | 23.5
QKV | 128 | 384 | 12 | 64 | 84.6 | 88.5 | 56.1
QKV | 64 | 384 | 12 | 64 | 79.1 | 80.3 | 53.5
QKV | 128 | 512 | 12 | 64 | 97.3 | 114.2 | 62.2
QKV | 64 | 512 | 12 | 64 | 95.9 | 110.7 | 60.6
QKV | 4 | 2048 | 32 | 128 | 125.26 | 44.72 | 78.15
QKV | 4 | 4096 | 32 | 128 | 141.62 | 46.29 | 85.84
QKV | 8 | 2048 | 32 | 128 | 127.40 | 45.49 | 78.75
QKV | 8 | 4096 | 32 | 128 | 144.24 | 46.60 | 86.95

### Known Issues

NVCC uses huge memory while compiling flash attention CUDA kernel. Linux
build with CUDA might fail when machine has limited memory while number
of CPUs is large. Walkaround is to use a build machine with larger
memory, or use argument like `--nvcc_threads 1` to limit nvcc threads in
build.

### Motivation and Context
Increases speed and efficiency of MHA or Packed MHA.

---------

Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: tlwu@microsoft.com <tlwu@a100.crj0ad2y1kku1j4yxl4sj10o4e.gx.internal.cloudapp.net>
2023-08-31 13:52:21 -07:00
Rachel Guo
b54619509f
Refine build script for adding disable selected data types option (#17284)
### Description
<!-- Describe your changes. -->

As title. 

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Now we have multiple data types that we want to disable for minimal
build and to reduce binary size. may be worth adding an argument in the
build script for specifying that.

Also for fp16 type stuff, it may be too restrict to disable that for all
minimal build.

---------

Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
2023-08-31 13:32:55 -07:00
Chi Lo
30a450dcf8
Fix TRT EP's cuda graph feature (#17355)
When users run inference with cuda graph enable with multithreading,
only the main thread creating the inference session will successfully
initialize cuda graph instance, for other threads executing the
inference run directly, they will hit segfault due to not calling
allocation/initialization for cuda graph instance.
This PR fixes this issue.
2023-08-31 13:32:15 -07:00
Changming Sun
ae90b716ff
Change _mm512_setzero to _mm512_setzero_ps (#17362)
### Description
_mm512_setzero is just an alias of _mm512_setzero_ps, and it is a wrong
one.
See: https://gcc.gnu.org/legacy-ml/gcc-patches/2018-05/msg00338.html
And
https://github.com/gcc-mirror/gcc/blob/master/gcc/config/i386/avx512fintrin.h
2023-08-31 13:11:44 -07:00
Wanming Lin
3a53836836
[WebNN EP] Fix compilation with newer flatbuffers (#17367) 2023-08-31 10:22:15 -07:00
Jian Chen
e60493525f
[js/webgpu] Adding support for abs with int32 type (#17359)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-08-31 08:13:54 -07:00
Jiajia Qin
352b745deb
[js/webgpu] Add input/output shapes information to profiling (#17342)
### Description
This PR is to enhance the profiling information.
With the PR, the profiling result is like below:
```
[profiling] kernel "[Split] 51288384" input[0]: 1,256,64,64, output[0]: 1,256,64,64, execution time: 37135 ns
program-manager.ts:114 
[profiling] kernel "[Concat] 52361040" input[0]: 1,256,64,64, output[0]: 1,256,64,64, execution time: 50833 ns
program-manager.ts:114 
[profiling] kernel "[Transpose] 52375264" input[0]: 1,256,64,64, output[0]: 1,64,64,256, execution time: 99791 ns
program-manager.ts:114 
[profiling] kernel "[Sub] 51098472" input[0]: , input[1]: 1, output[0]: 1, execution time: 7448 ns
program-manager.ts:114 
[profiling] kernel "[Mul] 51344440" input[0]: 1, input[1]: 1,256,1,1, output[0]: 1,256,1,1, execution time: 8334 ns
```
Without this PR, the profiling result is like below:
```
[profiling] kernel "52097928|[Split] 52097928" execution time: 37760 ns
program-manager.ts:105 
[profiling] kernel "41898328|[Concat] 41898328" execution time: 51666 ns
program-manager.ts:105 
[profiling] kernel "41915648|[Transpose] 41915648" execution time: 95416 ns
program-manager.ts:105 
[profiling] kernel "49757856|[Sub] 49757856" execution time: 7969 ns
program-manager.ts:105 
[profiling] kernel "51680504|[Mul] 51680504" execution time: 8906 ns
```
With the new information, we can easily know what kind of shape ops have
poor performance. Also it can help us to check whether too small shape
ops run on gpu.
2023-08-31 08:12:28 -07:00
Changming Sun
bbf28f09f2
Fix a build warning: a constexpr function calls a non-constexpr function (#17363)
### Description
The warning is:
```
/onnxruntime_src/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc:1202:41: error: call to non-‘constexpr’ function ‘bool onnx_transpose_optimization::TransposeQuantizeDequantizeAxis(const onnx_transpose_optimization::api::GraphRef&, const std::vector<long int>&, onnx_transpose_optimization::api::NodeRef&)’
   return TransposeQuantizeDequantizeAxis(graph, perm, node);
          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~
```
The function TransposeQuantizeDequantizeAxis is not constexpr. 
\
2023-08-31 06:56:27 -07:00
pengwa
58af36b49a
Fuse ScaledSum and its backward BatchScale (#16517)
### Fuse ScaledSum and its backward BatchScale

For deberta models, there is a pattern

a / scalar_0 + b / scalar_1 + c / scalar_2

We can fuse this into ScaledSum operator, taking 2(or 3) inputs, and
2(or 3) attributes scalar, generating one output.

For the backward, the gradient of a, b and c will be computed with
BatchScale.

### Benchmark on 8x32GV100

```bash
torchrun --nproc_per_node=8 examples/onnxruntime/training/language-modeling/run_mlm.py  --model_name_or_path microsoft/deberta-v3-large --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1  --num_train_epochs 10 --do_train  --overwrite_output_dir --output_dir ./outputs/ --seed 1137 --fp16 --report_to none --optim adamw_ort_fused  --max_steps 400 --logging_steps 1 --use_module_with_loss --deepspeed aml_ds_config_zero_1.json --per_device_train_batch_size 10
```

#### Main Branch

```
Total overhead: 127954ms where export takes 116489ms.
  epoch                    =      14.29
  train_loss               =     4.9803
  train_runtime            = 0:10:27.29
  train_samples            =       2223
  train_samples_per_second =     51.013
  train_steps_per_second   =      0.638


throughput per GPU = 14.29* 2223/ (627.29 - 127.954) / 8 (gpu) = 7.952 samples/second
```

#### This PR

```
Total overhead: 128761ms where export takes 118510ms.
***** train metrics *****
  epoch                    =      14.29
  train_loss               =     4.6144
  train_runtime            = 0:10:04.31
  train_samples            =       2223
  train_samples_per_second =     52.953
  train_steps_per_second   =      0.662

throughput per GPU = 14.29*2223 / (604.31 - 128.761) / 8 = 8.350 samples/second
```

5.x% performance gains.
2023-08-31 14:55:27 +08:00
Tianlei Wu
c11ed065ba
Fix SkipLayerNorm fusion in transformer optimizer (#17320)
### Description
Fix issues:
(1) When the output of Add before LayerNormalization node is a graph
output, we shall output it in SkipLayerNormalization, but currently not.
(2) When there is Cast before Add bias, the Cast output (instead of
input) shall be used as SkipLayerNormalization input.
(3) The skip input is not at the second input of fused node. According
to op spec, skip shall be the second. It could bring issue when we add
skip broadcasting support later.

### Motivation and Context

Fusion for Clip model of SDXL failed since the last hidden state is a
graph output.
2023-08-30 21:12:18 -07:00
Yi Zhang
507a40e1e9
Add compiler cache in Linux GPU TensorRT CI. (#17348)
### Description
Add the compiler cache in linux GPU tensorRT CI.
Save about 30 minutes in the GPU machine. (52 minutes -> 24 minutes)

PS. 
There're only white-space differences in the dockerfile.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-08-31 08:13:26 +08:00
Hector Li
47fe7fe900
Enable QDQ node unit support for Log op (#17354)
### Description
Enable QDQ node unit support for Log op
2023-08-30 16:23:33 -07:00
cao lei
64f06d0b4a
only Flush once for the same stream in copyInputAcrossDevice() (#17303)
### Description
<!-- Describe your changes. -->
In CopyInputAcrossDevice() function, we assign each feed a stream to
copy across device, once the copy is done, each stream will trigger the
Flush() function which is undesired. Same stream should be only flushed
once


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This change is to address a perf issue of TLNGv4 inference which
contains subgraph with many input feeds.
2023-08-30 16:10:26 -07:00
Wanming Lin
70e8c23944
[WebNN EP] Fix bug in interpreting ONNX's pads into WebNN's padding (#17325)
The ONNX's pads is [beginning_height, beginning_width, ending_height,
ending_width], while WebNN's padding is [beginning_height,
ending_height, beginning_width, ending_width]. We should permute the
ONNX's pads to [0, 2, 1, 3] for WebNN.
2023-08-30 13:15:39 -07:00
RandySheriffH
6c39641ea2
Fix a memleak in RunAsync python (#17326)
Release ort value outputs that are created and released from
ort::run(...).

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2023-08-30 12:54:17 -07:00
Jian Chen
081c0692a4
Update to nodejs version from 16 to 18.17.1 (#17351)
### Description
Update to nodejs version from 16 to 18.17.1



### Motivation and Context
Nodejs will reach EOL in September 2023
2023-08-30 12:41:48 -07:00
Nat Kershaw (MSFT)
2da08c477a
Add website publish placeholder (#17318) 2023-08-30 11:01:54 -07:00
Changming Sun
71da0824f3
Upgrade binskim and fix an error in nuget packaging pipeline (#17340)
### Description
Upgrade binskim and fix an error in nuget packaging pipeline.
2023-08-30 07:52:06 -07:00
Adrian Lizarraga
21ae86e405
[QNN EP] Fix test zero-point calculation and flaky MatMul test (#17338)
### Description
- Fix incorrect zero-point calculation in unit tests. Affects int8(signed) QDQ models.
- Replace flaky MatMul test that occasionally fails on main branch with a version that uses explicit inputs.

### Motivation and Context
Fix bug and improve test accuracy and stability.
2023-08-29 23:16:57 -07:00
Jian Chen
922629aad8
Upgrade Centos7 to Alamlinux8 (#16907)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Get the latest gcc 12 by default

---------

Co-authored-by: Changming Sun <chasun@microsoft.com>
2023-08-29 21:05:36 -07:00
Tianlei Wu
c961f67b5e
Handle dtype attribute in float16 conversion script (#17321)
Some operators have dtype attribute (search `dtype` in
https://github.com/onnx/onnx/blob/main/docs/Operators.md).
This change make sure dtype attribute is handled correctly in float16
conversion.
2023-08-29 18:41:56 -07:00
Adam Louly
8224891236
add logits option to generate artifacts (#17276)
### Description

Adding the ability to export logits as an output for train and eval
graphs in generate_artifacts
it will remain optional..
2023-08-29 16:55:31 -07:00
cloudhan
f3682eee3b
Fix log color, otherwise, the immediate line followed by the colored log will be tainted (#17329) 2023-08-30 07:46:04 +08:00
Ryan Hill
c438360c1e
Noticed a simple simplification in beam_search_topk (#17275)
### Description
There was an Init() method that does exactly like the lines I replaced,
so I switched to it.

### Motivation and Context
Simpler with no drawbacks.
2023-08-29 15:17:33 -07:00
Yi Zhang
d4a61ac71f
Pr trggiers generated by code (#17247)
### Description
1. Refactor the trigger rules generation.
2. Skip all doc changes in PR pipelines.


### Motivation and Context 
Make all trigger rules generated by running set-trigger-rules.py to
reduce inconsistences.
It's easily to make mistakes to copy&paste manually. 

For example: these 2 excludes are different, Why?

4e6cec4d09/tools/ci_build/github/azure-pipelines/linux-ci-pipeline.yml (L16-L18)


4e6cec4d09/tools/ci_build/github/azure-pipelines/linux-gpu-ci-pipeline.yml (L27-L29)


### Note
All changes in workflow yamls are generated by code.
Please review the **skip-js.yml, skip-docs.yml and
set-trigger-rules.py**.

@fs-eire, please double check the 
filter rules in skip-js.yml
and the skipped workflows

7023c2edff/tools/ci_build/set-trigger-rules.py (L14-L41)
2023-08-30 05:57:03 +08:00
AtanasDimitrovQC
fd0917b27b
Propagate noop_with_empty_axes in reduce operators. (#16845) 2023-08-29 14:15:03 -07:00
kushalpatil07
7b92057376
EvalStep called with wrong inputs onnxruntime_training_cxx_inline.h (#17331) 2023-08-29 14:14:35 -07:00
Yulong Wang
e5ca3f3dcb
[js/api] introducing IO binding for tensor (#16452)
[//]: # (## Work In Progress. Feedbacks are welcome!)

### Description
This PR adds a few properties, methods and factories to Tensor type to
support IO-binding feature. This will allow user to create tensor from
GPU/CPU bound data without a force transferring of data between CPU and
GPU.

This change is a way to resolve #15312

### Change Summary
1. Add properties to `Tensor` type:
a. `location`: indicating where the data is sitting. valid values are
`cpu`, `cpu-pinned`, `texture`, `gpu-buffer`.
b. `texture`: sit side to `data`, a readonly property of `WebGLTexture`
type. available only when `location === 'texture'`
c. `gpuBuffer`: sit side to `data`, a readonly property of `GPUBuffer`
type. available only when `location === 'gpu-buffer'`

2. Add methods to `Tensor` type (usually dealing with inference
outputs):
- async function `getData()` allows user to download data from GPU to
CPU manually.
- function `dispose()` allows user to release GPU resources manually.

3. Add factories for creating `Tensor` instances:
    a. `fromTexture()` to create a WebGL texture bound tensor data
    b. `fromGpuBuffer()` to create a WebGPUBuffer bound tensor data
    c. `fromPinnedBuffer()` to create a tensor using a CPU pinned buffer

### Examples:

create tensors from texture and pass to inference session as inputs
```js
// when create session, specify we prefer 'image_output:0' to be stored on GPU as texture
const session = await InferenceSession.create('./my_model.onnx', {
  executionProviders: [ 'webgl' ],
  preferredOutputLocation: { 'image_output:0': 'texture' }
});

...

const myImageTexture = getTexture(); // user's function to get a texture
const myFeeds = { input0: Tensor.fromTexture(myImageTexture, { width: 224, height: 224 }) }; // shape [1, 224, 224, 4], RGBA format.
const results = await session.run(myFeeds);
const myOutputTexture = results['image_output:0'].texture;
```
2023-08-29 12:58:26 -07:00
Chen Fu
8827363fd2
Bugfixes: dangling pointers and python property typo (#17285)
### Description
Bug fixes


### Motivation and Context
Fixing one dangling pointer, and one python property name typo
2023-08-29 12:50:15 -07:00
Jiajia Qin
fffefb1c22
[js/webgpu] Optimize matmul (#16969)
### Description
Changes in this PR:
1) use the optimized version `makeMatMulPacked[Vec4]Source` to support
matmul.
2) enable the conv2dByMatMul path.
3) support broadcast
4) use IndicesHelper.

MatMul with M = 512, K = 512, N = 512 becomes 2ms from 15ms when
enabling profilingMode on my ADL.
2023-08-29 12:40:57 -07:00
Patrice Vignola
4880f1da46
Fix attention fusion for UNet onnx model export when using LoRA weights (#17249)
### Description
Tested with stable diffusion unet models exported by both pytorch 2.1.0
(nightly) and pytorch 1.13.1, with and without LoRA weights.



### Motivation and Context
LoRA weights modifiy the unet model by adding matmul and scale
operations to every q/k/v/out tensors, which breaks the current MHA
pattern recognition.
2023-08-29 11:59:30 -07:00
Hector Li
761c4333b5
[QNN EP] GridSample op support (#17317)
### Description
QNN EP GridSample op support
2023-08-29 11:41:59 -07:00
Hector Li
742b192a34
[QNN EP] Enable GlobalMaxPool op (#17304)
### Description
[QNN EP] Enable GlobalMaxPool op
2023-08-29 11:25:34 -07:00
Artem Shilkin
6e60dba726
Fix compilation with newer flatbuffers (#17164)
In flatbuffers@v23.5.9 was broken forward declaration for
FlatBufferBuilder. Trying to compile onnxruntime falls with the
following error:
```
flatbuffers/include/flatbuffers/flatbuffer_builder.h:1420:38: error: typedef redefinition with different types ('FlatBufferBuilderImpl<false>' vs 'flatbuffers::FlatBufferBuilder')
typedef FlatBufferBuilderImpl<false> FlatBufferBuilder;
                                     ^
onnx_runtime/include/onnxruntime/core/graph/graph.h:47:11: note: previous definition is here
    class FlatBufferBuilder;
```
This PR removes these declarations and puts includes instead
2023-08-29 10:28:26 -07:00
Yi Zhang
0e9e9b2a67
Fix one exception in post merge (#17327)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-08-29 19:24:50 +08:00
Baiju Meswani
5d2c57363f
Sign CUDA Kernel (#17293) 2023-08-28 21:03:58 -07:00
Baiju Meswani
38ea8c3931
Increase max error tolerance for ConvTransposeGrad test (#17315) 2023-08-28 17:05:40 -07:00
Tianlei Wu
ee9d046112
Fix model serialization with external data in current directory (#17311)
When original model has external data in current directory, saving the
optimized model will raise File not found exception during looking for
external data file under root directory "/". This fix will look under
current directory for this case.

I manually tested an extra case and it is working: Original model with
external data in root directory ("/"), and save optimized to current
directory.

BTW, there is another bug found: when
"session.optimized_model_external_initializers_min_size_in_bytes" is set
a large value, some tensor is still pointed to the original external
data file. Add a TODO in unit test for this bug. Possible solution: load
external data into memory before saving model.
2023-08-28 16:06:04 -07:00
Caroline
228db24317
Add training API functions to WASM API (#16521)
### Description
* Created `wasm/training_api` source and header files & modified
WebAssembly CMake to include training flags
* The `wasm/training_api` files use an `OrtTrainingManager` handle which
is a struct of an OrtCheckpointState and an OrtTrainingSession, rather
than creating a CheckpointState handle & a separate TrainingSession
handle.
* This is so that the TypeScript side only has to manage one handle that
will be passed between TrainingSession & CheckpointState
representations, rather than the TypeScript side managing separate
CheckpointStateHandle and TrainingSessionHandle.


### Motivation and Context
WASM API needs to be updated with ORT training API function calls so
that ORT training web bindings can be added for on-device training.

---------

Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
Co-authored-by: carzh <carolinezhu@microsoft.com>
Co-authored-by: Ashwini Khade <askhade@microsoft.com>
2023-08-28 11:05:02 -07:00
Hariharan Seshadri
cbd97515cd
[JS/WebGPU] Support GatherElements kernel (#17243)
### Description
As title


### Motivation and Context
Improve WebGPU kernel coverage
2023-08-28 09:55:25 -07:00
mindest
53169f59e5
[ROCm] Sort candidate solutions in rocBLAS/hipBLASLt for deterministic offline tuning (#17297)
### Description

Sort the candidates in rocBLAS/hipBLASLt to make sure that they are
properly ordered and can be correctly fetched by saved indices in
offline tuning cases.
2023-08-28 16:34:21 +08:00
cloudhan
bf8b1681f9
Build nuget pkg for ROCm (#16791)
Add nuget pkg building and publishing for ROCm EP

---------
Co-authored-by: Yi Zhang <zhanyi@microsoft.com>
2023-08-28 13:35:08 +08:00
Yulong Wang
bb1871332f
[js/webgpu] add kernel Not and Equal (#17306)
### Description
This PR adds kernel implementation for operator "Not" and "Equal". Also
removed download cache in gpu data manager.

**Why removing download cache**
The following test case failed. ("Or" is on CPU, "Greater" and "Equal"
are on JSEP)

![image](https://github.com/microsoft/onnxruntime/assets/7679871/8d9798ad-2703-4fb9-907e-ff716c67d0b2)
after debugging, I found that both "Equal" and "Greater" are using the
same output GPU Data ID. This is because when ORT executes the graph, it
first run "Equal", allowing its shader to write into GPU Data ID 2; then
a Gpu2Cpu copy for it is issued (because currently "Or" is on CPU EP);
at this point, ORT thinks GPU Data ID=2 is free to use; so it reuse it
as output for "Greater". This means there is no allocation for output of
"Greater" kernel, and both kernel writes to GPU Data ID=2.

For gpu data manager, there will be 2 downloads from the same GPU
buffer. Previously I think this is a waste of resource so I cached the
data. But now it shoes that we need to perform 2 downloads because the
GPU data is already different. The download data cache should be
removed.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-08-27 19:50:17 -07:00
simonjub
4eedd3bb46
[TRT EP] Fix logic to reach cache encryption code. (#17111)
### Description
This is a followup to PR #15519 that is closed in favor of this one.


### Motivation and Context
The current implementation of TRT cache has no code execution path
possible so that an encrypted TRT engine cache could be created when
flags engine_cache_enable and engine_decryption_enable are true. This
was originally raised in issue #12551.
2023-08-26 20:09:03 -07:00