Commit graph

7983 commits

Author SHA1 Message Date
PeixuanZuo
d3a09cf77f
[ROCm] use pytest-xdist for fast pytest (#14261)
### Description

Use pytest-xdist to distribute tests across multiple CPUs to speed up
test execution.
Use pytest-rerunfailures to rerun failed test in case of pytest-xdist
crash.
`pytest -n 16` can reduce pytest time from 80 minutes to 20 minutes.


### Motivation and Context
Now kernel explorer pytest of ROCm CI takes nearly 1 hour 20 minutes. It
will take longer time when we add more tunableOp in the future.
2023-01-13 16:57:50 +08:00
Scott McKay
9bd9206928
Attempt to fix flaky Windows GPU CI Pipeline 'cuda' stage. (#14281)
### Description
<!-- Describe your changes. -->
Change tolerance for tests involving MNIST and cuda to try and fix flaky
CI tests.

Errors from CI:

ModelTests/ModelTest.Run/cuda__models_zoo_opset8_MNIST_model 
expected 4.0755 (40826a83), got 4.06948 (40823938), diff: 0.00601721,
tol=0.0050755 idx=4. 2 of 10 differ
ModelTests/ModelTest.Run/cuda__models_zoo_opset7_MNIST_model
expected 7.89851 (40fcc09e), got 7.88879 (40fc70f8), diff: 0.00972271,
tol=0.00889851 idx=4. 4 of 10 differ
ModelTests/ModelTest.Run/cuda__models_zoo_opset12_MNIST12_mnist12
expected -5.50068 (c0b00595), got -5.49023 (c0afaff0), diff: 0.0104547,
tol=0.00650068 idx=1. 1 of 10 differ

Use rtol of 1e-2 if cuda is enabled. Use same for openvino for
simplicity.
 
```
>>> expected = np.array([4.0755, 7.89851, -5.50068], dtype=np.float32)
>>> actual = np.array([4.06948,  7.88879, -5.49023], dtype=np.float32)
>>> np.isclose(expected, actual, rtol=1e-2, atol=1e-3)
array([ True,  True,  True])
```

Whitespace changes are from clang-format. 


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
CI fails semi-frequently causing unnecessary re-runs.
2023-01-13 18:09:49 +10:00
Yufeng Li
16e39807e0
presence_mask should be sampling only (#14275) 2023-01-12 22:09:17 -08:00
Ashwini Khade
cc7799835e
Enable a single build with optimized inference and on device training (#14241)
### Description
Right now prepacking code is not compiled when training is enabled. Our
partners want a single build of ort which can do both optimized
inference + training on device. This PR enables prepacking code in a
training build and controls whether it is enabled or not using already
existing session option - kOrtSessionOptionsConfigDisablePrepacking

For Inference scenarios - prepacking will be turned on by default and
this behavior remains the same after this PR too.
For training scenarios - prepacking will be disabled by default and if
user explicitly enables it then an error will be thrown.



### Motivation and Context
Enable both optimized inference as well as on device training in a
single build. For on device training use flag --enable_training_apis.
2023-01-12 21:36:43 -08:00
Vincent Wang
fb3c1221e4
Fix Prefast Warning (#14250)
Fix two prefast:Warning related to constexpr.
2023-01-13 10:16:35 +08:00
Scott McKay
ea12b674c0
Disable the failing opset 18 model tests that are breaking the packaging pipeline (#14259)
### Description
<!-- Describe your changes. -->
Skip tests for opset18 models that we haven't implemented kernels for
yet.

Slice was checked in today so those failures should go away.

Resize: #13890 (all resize failures are fixed by this PR as confirmed in
output
[here](https://dev.azure.com/aiinfra/530acbc4-21bc-487d-8cd8-348ff451d2ff/_apis/build/builds/264725/logs/729))
Col2Im: #12311
ScatterND and ScatterElement: #14224
Pad (should also fix CenterCropPad failures): #14219 Bitwise ops: #14197
Optional: Unknown if we're intending to support this in 1.14

Not sure about SoftPlus as that is failing due to `Could not find an
implementation for Exp(1)`. ORT supports Exp from opset 6 and on, and it
seems incorrect for the test model created for opset 18 to be using a
version of Exp that is so old. Would have expected it to use the latest
- Exp(13). @liqunfu is this something that requires a fix to the ONNX
model?


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix pipeline
2023-01-13 09:55:52 +10:00
Ye Wang
c9a53c9255
Some changes to Sampling Op (#14218)
### Description
<!-- Describe your changes. -->
1. add an optional input to pass in seed
2. two UTs. one for top_p=0.5, another for top_p=0.01(create greedy
search result, in convert_generation.py)
3. fix a bug in cpu kernel

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-01-12 14:15:26 -08:00
Hariharan Seshadri
3898b22a1a
Fix some prefast warnings (#14247) 2023-01-12 11:15:23 -08:00
Numfor Tiapo
dee36f8ade
DML EP Register ScatterND-16 (#14240)
This PR registers ScatterND-16 to the DML EP

- CPU fallback is added if the reduction attribute is in use, as this is
not yet supported by DML.

Co-authored-by: Numfor Mbiziwo-Tiapo <numform@microsoft.com>
2023-01-12 10:39:25 -08:00
Yufeng Li
8f7eb75c3e
fix greedysearch token out of range bug (#14242)
Bug: the last sentence generates token out of vocabulary size.
Cause: total element should be computed with padded vocabulary size.
2023-01-12 09:06:05 -08:00
shalvamist
5c16e0befb
[web] utility functions for tensor<->image conversion in ORT web (#13603)
### Description
Data processing capabilities to ORT Web. 
This PR will focus augmenting raw data to and from Tensors.

### Motivation and Context
Enabling different app building use cases to leverage ORT in a more
natural form.
Currently, the user needs to process the data and call Tensor
constructors - these util functions will provide a direct path to
generating ORT tensors.

Co-authored-by: shalvamist <shalva.mist@microsoft.com>
2023-01-12 09:05:18 -08:00
Patrice Vignola
99a4036c80
[DML EP] Add FusedMatMul (#14196)
### Description
Add FusedMatMul



### Motivation and Context
- Add the FusedMatMul fusion for DML
- Fix the FusedMatMul logic and tests when transposed batches are
involved
2023-01-12 02:17:04 -08:00
cloudhan
712f781702
Make CK an optional dependencies and only built with ck if ROCm >= 5.3 (#14232)
Recently, ck dropped ROCm 5.2 support, which is causing packaging
pipeline failures. This PR workaround it.
2023-01-12 17:09:40 +08:00
Scott McKay
b9ecd428c1
Add ability to register custom ops by specifying a function name (#14177)
### Description
<!-- Describe your changes. -->
Use dlsym/GetProcAddress to lookup a custom ops registration function by
name and call it.

This will be better on mobile platforms where the custom ops library is
linked against, and there isn't necessarily a filesystem that a library
path can be loaded from.

Alternative is to wire up passing in the address of the function, but
that has multiple complications which differ by platform.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Enable using ort and ort-ext packages on mobile platforms.

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-01-12 15:11:34 +10:00
guyang3532
f5b4b0f77d
Add support for 'axes' attr of unsqueeze in opset 13 and add ut (#14071)
Since opset 13, 'axes' attr of unsqueeze become an input of unsqueeze,
add support for it and add ut.
2023-01-12 10:45:15 +08:00
sfatimar
7654cd50e8
Openvino ep 2022.3 v4.3 (#14210)
### Description
Changes to incorporate OpenVINO EP 2022.3


### Motivation and Context
This change is required to incorportate OpenVINO EP 2022.3
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: mohsinmx <mohsinx.mohammad@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
Co-authored-by: Aravind <aravindx.gunda@intel.com>
Co-authored-by: mayavijx <mayax.vijayan@intel.com>
Co-authored-by: flexci <mohsinmx>
2023-01-11 16:31:26 -08:00
Nat Kershaw (MSFT)
26a6b40554
Delete add-issues-to-project (#14147)
It is not currently possible to get the appropriate org level
permissions for this action.

It should soon be possible to do this automatically within
[GitHub](https://github.com/microsoft/github-operations/issues/59#issuecomment-1275424404).
2023-01-11 14:33:37 -08:00
Scott McKay
dd2df460b3
Split(18) (#14015)
### Description
<!-- Describe your changes. -->
Opset 18 Split changes. Adds ability to specify num_outputs which also
allows uneven splitting.

https://github.com/onnx/onnx/releases/tag/v1.13.0

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Support ONNX opset 18.
2023-01-12 08:14:10 +10:00
Tianlei Wu
012b34dc4e
Add --use_multi_head_attention in transformers fusion (#14198)
Add an option --use_multi_head_attention to fuse model with
MultiHeadAttention operator instead of Attention operator for testing
purpose.

Note that MultiHeadAttention can be used in self-attention and
cross-attention, while Attention operator is used for self-attention
only. In Attention operator, there is packed Q/K/V weights for input
projection, but that MatMul of input projection is excluded from
MultiHeadAttention.
2023-01-11 13:20:05 -08:00
RandySheriffH
83ad562826
Rename CloudEP to AzureEP (#14175)
Rename CloudEP to AzureEP.

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2023-01-11 12:25:04 -08:00
Tianlei Wu
3b79b8eb1d
fix reshape fusion error in numpy 1.24 (#14231)
Fix https://github.com/microsoft/onnxruntime/issues/14017. 

Before: shape_value = np.asarray([0, 0, np.array([4]), np.array([8])],
dtype=np.int64) raise Error in numpy 1.24.
After: shape_value = np.asarray([0, 0, 4, 8)], dtype=np.int64) is good
in numpy 1.24.

Update test environment to use numpy 1.24.
2023-01-11 10:37:41 -08:00
Patrice Vignola
52fc1ff21e
[DML EP] Fix FusedMatMul crash when batch > 1 (#14225)
### Description
Fix FusedMatMul crash when batch > 1



### Motivation and Context
FusedMatMul calls `SetStrides` on its input tensors but doesn't update
the tensorSizeInBytes value. Calling `SetStrides` is very error-prone
because it puts the tensor in an invalid state, and the caller needs to
manually adjust it after the call. To avoid this situation in the
future, we now update the size of the tensor in the `SetStrides` call
itself.
2023-01-11 10:17:34 -08:00
pengwa
a4180d79c5
Multi-tensor SGDOptimizer (on device training) (#14083)
Implement SGDOptimizerV2 taking sequence of weights and gradients as
inputs.

For CPU EP and CUDA EP only.

Added tests.
2023-01-11 10:15:53 -08:00
RandySheriffH
ecd5ce0b33
Use json format to save and load partition config (#14169)
Use json format to save and load partition config, previously it was
csv, which brought issues among windows and posix due to different line
breaks.

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2023-01-11 10:03:14 -08:00
Yufeng Li
7a9a6bcebd
Improve TopP sampling (#14192)
### Description
Improve TopP sampling's filter kernel with cub::scan. It reduces TopP
sampling latency from 3.67 to 0.92 for batch size 8 and vocabulary size
51k.
2023-01-11 08:40:17 -08:00
Ashwini Khade
d92c663f28
Create dedicated build for training api (#14136)
### Description
Enable creating dedicated build for on device training. With this PR we
can build a lean binary for on device training using flag
--enable_training_apis. This binary includes only the essentials like
training ops, optimizers etc and NOT features like Aten fallback,
strided tensors, gradient builders etc . This binary also removes all
the deprecated components like training::TrainingSession and OrtTrainer
etc

### Motivation and Context
This enables our partners to create a lean binary for on device
training.
2023-01-10 20:58:04 -08:00
dependabot[bot]
3a39736a2c
Bump json5 from 2.2.0 to 2.2.3 in /js/web (#14110) 2023-01-11 02:27:42 +00:00
Ye Wang
342a5bf2b7
Improve rpb cuda kernel (#14195)
### Description
Average latency (ms) of float16 relative position bias cuda kernel on
V100:

Kernel\Seq_Len  | 16 | 32 | 64 | 128 | 256 | 384 | 512 | 768 | 1024 |
2048 | 4096
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
Before| 0.0494 | 0.0654 | 0.1519 | 0.4322 | 1.1865 | 2.4091 | 4.3676 |
14.912 | 36.517 | 142.09 | 561.80
After | 0.0483 | 0.0651 | 0.1294 | 0.3858 | 1.1128 | 2.2988 | 3.8391 |
14.290 | 34.542 | 136.13 | 529.54

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Review of this comment
https://github.com/microsoft/onnxruntime/pull/14149/#discussion_r1063152021

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-01-10 17:06:42 -08:00
RandySheriffH
d152452d4b
Tune test case for hybrid cpu (#14204)
Tune test case for hybrid cpu architecture.

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2023-01-10 12:54:02 -08:00
Chen Fu
90142899bd
Supporting Intel AMX instructions in quantized GEMM (#14042)
### Description
Using Intel AMX int8 instructions to accelerate quantized GEMM


### Motivation and Context
AMX instructions accelerate quantized GEMM significantly:

Prepacked B perf numbers (latency in ns)

GEMM Config | AVX512Vnni | AMX
-- | --: | --:
M:384/N:1024/K:1024/Batch:1/Threads:4 | 1057511 | 285393
M:384/N:1024/K:3072/Batch:1/Threads:4 | 2643929 | 700397
M:384/N:1024/K:4096/Batch:1/Threads:4 | 3784750 | 890701
M:384/N:4096/K:1024/Batch:1/Threads:4 | 2378139 | 887251
M:384/N:1024/K:1024/Batch:1/Threads:16 | 307137 | 138481
M:384/N:1024/K:3072/Batch:1/Threads:16 | 855730 | 295027
M:384/N:1024/K:4096/Batch:1/Threads:16 | 1126878 | 317395
M:384/N:4096/K:1024/Batch:1/Threads:16 | 781963 | 237014
M:1536/N:1024/K:1024/Batch:1/Threads:16 | 538864 | 181459
M:1536/N:1024/K:3072/Batch:1/Threads:16 | 1681002 | 561600
M:1536/N:1024/K:4096/Batch:1/Threads:16 | 2158127 | 717470
M:1536/N:4096/K:1024/Batch:1/Threads:16 | 2428622 | 896140
M:3072/N:1024/K:1024/Batch:1/Threads:16 | 1058029 | 357031
M:3072/N:1024/K:3072/Batch:1/Threads:16 | 3138504 | 1095857
M:3072/N:1024/K:4096/Batch:1/Threads:16 | 4155640 | 1386183
M:3072/N:4096/K:1024/Batch:1/Threads:16 | 4679030 | 1778624

Co-authored-by: Yi-Hong Lyu <yilyu@microsoft.com>
Co-authored-by: Chen Fu <fuchen@microsoft.com>
2023-01-10 12:16:27 -08:00
Ye Wang
a01bf8dbb1
rename CrossAttention to MultiHeadAttention (#14201)
### Description
<!-- Describe your changes. -->

rename the CrossAttention to MultiheadAttention since this op can also
be used as self attention

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-01-10 10:18:39 -08:00
Guenther Schmuelling
6b8c72cfa6
pin ort-ext to 81e7799c69044c745239202085eb0a98f102937b (#14044)
pin onnxruntime-extension to 81e7799c69044c745239202085eb0a98f102937b in
preparation to in enable extension in wasm build.
2023-01-10 10:10:17 -08:00
Numfor Tiapo
f4ea781b81
DML EP Register Identity-16 (#14053)
This PR Registers Identity-16 to the DML EP.

ONNX Backend tests and optional type tests were skipped pending future
additions.

Co-authored-by: Numfor Mbiziwo-Tiapo <numform@microsoft.com>
2023-01-10 09:16:09 -08:00
Tianlei Wu
05e26f302a
Hot fix for prefast failure to unblock python package pipeline (#14206)
### Description
Hot fix python packaging pipeline failures by disabling an attention op
test which causes cl crashes in prefast build.

Verified that python package is good with this hot fix:

https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=263786&view=results

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Failed in prefast build that linker crashes:
```
 cl : command line error D8040: error creating or communicating with child process
```
The cause is high stack usage in an attention op unit test introduced in
https://github.com/microsoft/onnxruntime/pull/13953.
2023-01-10 07:57:32 -08:00
Adrian Lizarraga
3d8b596cb9
Use a local copy of murmurhash3 in TensorRT shared library (#14207)
### Description
Uses a local copy of murmurhash3 in TensorRT.



### Motivation and Context
The current murmurhash3 implementation is located in core/framework,
which is not linked to the provider shared library. This causes a
segfault when tensorrt shared library is used standalone.
2023-01-10 07:24:06 -08:00
Ryan Hill
da57c0a701
Add protected destructor to Provider structure (#14152)
### Description
Add protected destructor so that any inherited classes can't
accidentally be deleted through a pointer to the base.

Fixes this prefast warning:
The type 'struct onnxruntime::CUDA_Provider' with a virtual function
needs either public virtual or protected non-virtual destructor (c.35).

Internal bug 8999
2023-01-09 23:04:04 -08:00
Ryan Hill
f8117b6f87
Add catch-all exception handler to API_IMPL_END (#14194)
### Description
Fairly self explanatory. Someone pointed out we could miss some
exceptions, and we never want to throw exceptions through the C API.

### Motivation and Context
This doesn't fix any known issue, it's just a good idea to have.
2023-01-09 21:58:46 -08:00
PeixuanZuo
33367fa2dc
[MIGraphX] update the MIGraphX version used in ORT to rocm-5.4.0 (#14184)
### Description
Update the MIGraphX version used in ORT to rocm-5.4.0

### Motivation and Context
The previous branch migraphx_for_ort has stopped updating, it is too far
away from the MIgraphX latest release branch. More discussion here:
https://github.com/microsoft/onnxruntime/issues/14126#issuecomment-1373201049

Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
2023-01-10 13:40:25 +08:00
Yi Zhang
6463f4383b
make WITHCACHE as an option in MacOS workflow (#14188)
### Description
1. Set the WithCache default value as false in Mac OS CI workflow too.
2. Add date of today in cache key to avoid cache size keep increasing
too.

WithCache, the pipeline duration reduced from 70 more minutes to 10 more
minutes
2023-01-10 10:54:19 +08:00
Tianlei Wu
7e751ac6e6
update convert_generation for Attention op change (#14191)
We remove key and value inputs in
https://github.com/microsoft/onnxruntime/pull/14146, need update the
convert_generation as well.
2023-01-09 18:04:44 -08:00
Patrice Vignola
c151afec71
[DML EP] Fix unconnected node removal logic (#14193)
### Description
Fix unconnected node removal logic



### Motivation and Context
The edges need to be removed before the nodes themselves, otherwise the
indices will reference the wrong nodes.
2023-01-09 15:40:09 -08:00
Sumit Agarwal
906f578be8
[DML EP] Update DML_FEATURE_LEVEL 5.0 (#14172)
### Description
DML EP was using very old feature level (2.0) which may lead to model
(having latest operator) execution failure, if model is running against
old DirectML.dll.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-01-09 13:00:56 -08:00
liqun Fu
1be36913cc
to work with onnx 1.13 rc, implement ver 18 reduce and optioanl ops, … (#13765) 2023-01-09 10:26:16 -08:00
Xavier Dupré
79dc39600f
Replace distutils by setuptools to import build_ext (#14108)
### Description
Uses setuptools instead of distutils.



### Motivation and Context
Fixes #14107.
2023-01-09 11:48:01 +01:00
Patrice Vignola
64541a587d
[DML EP] Remove unconnected nodes from the graph (#14155)
### Description
Remove unconnected nodes from the DML EP graph.



### Motivation and Context
Some operators like `EmbedLayerNorm` have many outputs, and some of the
outputs are non-optional. But in practice, they act like optional
outputs because they can have a value of 0, which means that the rest of
the model doesn't need to depend on those. The problem with that is that
DML will implicitly remove those output from the graph, but the nodes
that feed into that output will stay and become unconnected from the
rest of the graph, which is illegal in DML. Removing unconnected nodes
as a last pass will make sure that those nodes are getting removed and
will simplify the logic of individual operators by not having to account
for these special cases.
2023-01-08 15:20:52 -08:00
Zhang Lei
74fe45bf09
activate past_present_share_buffer for sampling node (#14166) 2023-01-07 19:36:39 -08:00
cloudhan
be879c11ee
Add batched and strided batched gemm as TunableOp (#13841) 2023-01-07 19:11:40 +08:00
Ye Wang
5eac2c1f41
relational attention bias cuda op (#14149)
### Description

This cuda op implements the compute_bias() method in T5 Attention
including the permutation.

note:
1. bias_table needs to be saved in col-major. be careful when
implementing fusion script
2. second input(sequence length) is placed on cpu. (using Shape node's
output should be good)
3. the first dimension of output is 1, so extra_add_qk in attention
should support broadcasting
4. compute_bias() only used in self-attn in t5

TODO: docs change will be applied later

### Motivation and Context
It's part of the process of optimizing t5 attention as well as t5 based
generation model

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-01-06 17:32:58 -08:00
cloudhan
8e2163018d
Ignore more build directories and clangd files (#14154)
Ignore all `build_*` directories in repo root. Ignore `.cache` and
`compile_commands.json` which are related to clangd cache and
configuration.
2023-01-07 06:58:57 +08:00
Tianlei Wu
2cacb24cb0
Add CrossAttention operator (#14146)
Move separated Q, K and V (without input projection) from Attention to a
new operator CrossAttention.

The Attention operator is hard to maintain when we need support with and
without input projection in one class. Add a new operator according to
feedback.

Some change might need in the future, but not in this PR:
(1) bias could be optional (We will not proceed that route unless
experiments show that fusing Add bias with MatMul instead of this op
could improve performance).
(2) support packed KV. There are two ways to support it: when key and
value are same Tensor, they are packed; or we can make value as
optional, and use packed mode when value is empty and the key has packed
K/V.
(3) support cached key and value, and other (like relative position
bias), or more attention mask format. They can be added easily without
breaking backward compatible.
(4) ROCm/CPU implementation of this op.
2023-01-06 14:27:40 -08:00