Commit graph

9170 commits

Author SHA1 Message Date
Dmitri Smirnov
e752cbe7f2
Work on eliminating Internal Compiler Error (#16741)
### Description
<!-- Describe your changes. -->
Replace the offending bitwise `operator |` with if() logic for ARM.
2023-07-18 10:17:52 -07:00
Wei-Sheng Chin
b71ebf91a5
[DORT] Reduce global configs to make enabling dynamic shape easier (#16720)
There are several global configs used by DORT.
```py
DEFAULT_ONNX_EXPORTER_OPTIONS = torch.onnx._internal.exporter.ResolvedExportOptions(
    torch.onnx._internal.exporter.ExportOptions()
)

# TODO(wechi): This line must generate result identical to the call of
# _create_onnx_supports_op_overload_table(...) inside
# create_onnx_friendly_decomposition_table(...) in
# torch/onnx/_internal/fx/decomposition_table.py.
_SUPPORT_DICT = torch.onnx._internal.fx.decomposition_table._create_onnx_supports_op_overload_table(
    DEFAULT_ONNX_EXPORTER_OPTIONS.onnx_registry
)  # type: ignore

_EXTRA_SUPPORT_DICT: Dict[str, Any] = {
    "getattr": None,
    "_operator.getitem": None,
}

DORT_DECOMPOSITION_TABLE = DEFAULT_ONNX_EXPORTER_OPTIONS.decomposition_table
```

We can see all but `_EXTRA_SUPPORT_DICT` are extracted from deduced from
ONNX exporter's options. As there are many ways to configure ONNX
exporter's options, we decided to move these variables to `OrtBackend`'s
`__init__` so that the construction of `OrtBackend` becomes more
flexible (especially for enabling dynamic shape or not).
2023-07-18 09:06:58 -07:00
PeixuanZuo
9b549c646c
[ROCm] fix kernel explorer GemmSoftmaxGemm test (#16735)
GemmSoftmaxGemmTunble occasionally broken with large numerical error.
The root cause of this error is CK's Strided Batched Gemm has larger
error under a specific initialization distribution
`(multinormal_distribution)`.

Generic(Gemm1 + Softmax + Gemm2) implementation is one instance of
GemmSoftmaxGemmTunble. Gemm1 and Gemm2 in Generic implementation are
TunableOps when tuning enabled. In some case GemmSoftmaxGemmTunble
select Generic implentation, while Gemm1 or Gemm2 select ck
implementation, the result of GemmSoftmaxGemmTunble affect by CK.

- Make tolerance more loosen.
- Add `GemmSoftmaxGemmPermuteGenericNestedTunable` to test Generic
implementation with tuning enabled.
2023-07-18 16:47:39 +08:00
zhangsibo1129
9ba5cdbaa4
[CANN EP] Fix Float16 support for CANN EP (#16733)
### Description
<!-- Describe your changes. -->

Replace the constructor function `MLFloat16()` with the public member
function `FromBits()` in the file
`onnxruntime/core/providers/cann/cann_common.cc`

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
PR [#16506](https://github.com/microsoft/onnxruntime/pull/16506) changed
the public constructor function `MLFloat16(uint16_t x)` to private, and
added a public function `MLFloat16::FromBits(uint16_t x)` in the file
`include/onnxruntime/core/framework/float16.h`, which broke the CANN CI.

This PR aligns the CANN behavior with the modified class `MLFloat16`.
2023-07-17 23:24:51 -07:00
cloudhan
0cab7e1a37
[ROCm] Generalize FastGeLU (#16623)
Allow the whole pipeline to be parameterized with unary elementwise
functor.
2023-07-18 11:23:12 +08:00
Scott McKay
ad90352a68
Add MAUI test app that can be used to test model loading and performance (#16658)
### Description
<!-- Describe your changes. -->
MAUI test app with tooling to add model and generated or provided input
test data.

The app will load the model and validate the output. It can also run a
specified number of iterations to provide basic performance information.

<img width="401" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/979079/daf3af13-fb22-4cbb-9159-486b483a7485">

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Primarily to make it easier to test an arbitrary model on iOS. A MAUI
app allows testing on all platforms.

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-07-18 08:21:18 +10:00
cloudhan
a45b834722
Fix warning about uninitialized member (#16736)
#16506 Cause almost every translation units on linux complaint

```
[1175/1235] Building CXX object CMakeFiles/onnxruntime_test_all.dir/home/guangyunhan/onnxruntime/orttraining/orttraining/test/training_ops/cuda/softmax_test.cc.o
In file included from /home/guangyunhan/onnxruntime/include/onnxruntime/core/framework/float16.h:18,
                 from /home/guangyunhan/onnxruntime/include/onnxruntime/core/framework/data_types.h:17,
                 from /home/guangyunhan/onnxruntime/include/onnxruntime/core/framework/tensor.h:17,
                 from /home/guangyunhan/onnxruntime/onnxruntime/test/common/tensor_op_test_utils.h:16,
                 from /home/guangyunhan/onnxruntime/onnxruntime/test/providers/compare_provider_test_utils.h:7,
                 from /home/guangyunhan/onnxruntime/orttraining/orttraining/test/training_ops/cuda/softmax_test.cc:4:
/home/guangyunhan/onnxruntime/include/onnxruntime/core/session/onnxruntime_float16.h: In instantiation of ‘static constexpr uint16_t onnxruntime_float16::Float16Impl<Derived>::ToUint16Impl(float) [with Derived = onnxruntime::MLFloat16; uint16_t = short unsigned int]’:
/home/guangyunhan/onnxruntime/include/onnxruntime/core/framework/float16.h:42:66:   required from here
/home/guangyunhan/onnxruntime/include/onnxruntime/core/session/onnxruntime_float16.h:241:7: note: ‘union onnxruntime_float16::detail::float32_bits’ has no user-provided default constructor
  241 | union float32_bits {
      |       ^~~~~~~~~~~~
/home/guangyunhan/onnxruntime/include/onnxruntime/core/session/onnxruntime_float16.h:242:16: note: and the implicitly-defined constructor does not initialize ‘unsigned int onnxruntime_float16::detail::float32_bits::u’
  242 |   unsigned int u;
      |                ^
```

This PR shut the compiler up.
2023-07-17 11:33:54 -07:00
Edward Chen
df8843c4a7
Upgrade old Python version in packaging pipeline (#16667)
- Upgrade from Python 3.6 to 3.8 in packaging pipeline.
- Raise build.py minimum required Python version.
2023-07-17 08:24:47 -07:00
Dmitri Smirnov
b8c40b7813
Fix parameter naming that fails Doc generation. (#16717)
### Description
Rename `FromBits` param name to match the docs.

### Motivation and Context
Fix API Doc generation.
2023-07-16 22:02:05 -07:00
RandySheriffH
e1ca8ee6d4
RunAsync C/CXX API (#16613)
Implement RunAsync API - the session will run in a thread of intra-op
thread pool.

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2023-07-16 16:51:40 -07:00
Ryan Hill
2cf31a20cf
Cuda: Decoder Masked Multihead Attention Q values get corrupted when using cross attention (#16721)
### Description
Some code was accidentally moved into the
`if(!params.is_cross_attention)' block, it must stay outside to work in
both cases.

### Motivation and Context
This causes invalid results. We detected this as a performance bug, as
it caused the EOS early exit to never happen, and the runs would always
take max_length to complete which was slow.
2023-07-15 00:41:06 -07:00
Wanming Lin
2b7a94e65b
[WebNN EP] Make some types clearer (#16705)
It's a follow-up to address comments in
https://github.com/microsoft/onnxruntime/pull/16671#discussion_r1261761828
and
https://github.com/microsoft/onnxruntime/pull/16671#discussion_r1261763873
2023-07-14 17:39:36 -07:00
Ryan Hill
2ae041f390
atomicAdd returns previous value, not current value. (#16690)
### Description
Mistake in beam scorer processing, atomicAdd result should be compared
with '1' vs '0' as it returns the original value, not the latest value.

This error just results in slow perf, nothing fails.

### Motivation and Context
Fixes #16642
2023-07-14 15:46:57 -07:00
Wei-Sheng Chin
44fd98ebfe
[DORT] Enable aten::full by implementing extra logics to select EP (#16699)
DORT only select devices from inputs arguments' (type: torch.Tensor).
However, it errors out when a graph doesn't have any inputs (e.g., a
single aten::full graph). This PR address this problem by changing the
EP selection to

- First, inspect graph inputs. If there are some valid devices, use them
plus a default one (`OrtBackend.ep: str`).
- Otherwise, inspect graph outputs carried by `torch.fx.GraphModule` and
use all valid devices plus the default `OrtBackend.ep`.
- When both (1) and (2) fail, it uses the default EP specified by
`OrtBackend.ep`.
2023-07-14 15:42:25 -07:00
Edward Chen
f236768d5c
[ios] Enable --use_extensions with custom built iOS pod (#16711)
- Fix link errors by including the needed onnxruntime-extensions libraries in the static framework.
- Add Objective-C API to register custom ops from embedded onnxruntime-extensions.

Caveat: Not all onnxruntime-extensions build options are working yet. E.g., building with the onnxruntime-extensions OpenCV dependency does not work.
2023-07-14 15:37:16 -07:00
G. Ramalingam
4faee2e44c
Fix issue in constant-propagation inside function subgraph (#16330)
### Description

The SequenceMap function-op has a graph-attribute. ORT's
constant-folding optimization may identify constant-expressions inside
the subgraph and promote them to constants, stored as initializers in
the main graph. When it does this, the optimization updates the subgraph
to remove the corresponding nodes.

When we expand a SequenceMap node by inlining its function-expansion, we
need to use this updated subgraph. However, the existing code uses the
original graph-attribute (GraphProto), instead of regenerating it from
the modified subgraph. This results in producing a graph with duplicate
definitions for the constant-folded variable, resulting in an error
during graph-resolve.

This PR fixes this issue (just a single line fix), and adds a test-case
to cover this scenario.

---------

Signed-off-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
2023-07-14 14:44:59 -07:00
Wanming Lin
ea43671eb6
[WebNN EP] Support several activation ops (#16693)
Support Elu, HardSigmoid, HardSwish, Softplus, Softsign, Tanh.
2023-07-14 14:36:15 -07:00
Adrian Lizarraga
a189e76fde
[QNN EP] Fix error handling for Softmax/ReduceOps (#16700)
### Description
- Fix check for Softmax with axis attributes not equal to -1. QNN EP
only supports axis values equal to -1 (or rank - 1).
- Explicit error when Reduce* ops have an input with rank > 4 on HTP
backend (unsupported).
- Correctly filter out partitions that only contain a single
QuantizeLinear or DequantizeLinear node.
- Add tests for the above and clean up unnecessary usage of test
description labels.



### Motivation and Context
Make it easier to debug why a model may not be supported.
2023-07-14 13:47:23 -07:00
Baiju Meswani
9889f0f507
Add support for training apis to support custom ops (#16601) 2023-07-14 11:15:51 -07:00
Adrian Lizarraga
19169afe30
[QNN EP] Add option to skip unit tests in the QNN NuGet packaging pipeline (#16164)
Add option to skip unit tests in the QNN NuGet packaging pipeline.
2023-07-14 10:52:05 -07:00
Dmitri Smirnov
853c4ff0a5
[C#, CPP] Introduce Float16/BFloat16 support and tests for C#, C++ (#16506)
### Description
Introduce `Float16/BFloat16` support for C# and C++ APIs.
User should be able to perform conversions from `float` to/from
`Float16/BFloat16`, compare values and tests for `NaN, Inifnity, and
whether the number is denormalized.`

### Motivation and Context
User filed issues such as:
https://github.com/microsoft/onnxruntime/issues/14303
2023-07-14 10:46:52 -07:00
Tianlei Wu
77b45c6503
Add Stable Diffusion Benchmark on A100-PCIE-80GB (#16702)
0(1) Fix a bug in https://github.com/microsoft/onnxruntime/pull/16560
that UNet shall be set fp16 flag.
(2) Remove wget in requirements since it is no longer needed.
(3) Add benchmark numbers in A100-PCIE-80GB. Note that CUDA EP have
issue to run in batch size 4 so the number is not added.
2023-07-14 10:37:00 -07:00
Yi Zhang
36b121d8c2
add more check to Web CI on cache restore (#16689)
### Description
<!-- Describe your changes. -->



### Motivation and Context
Make sure the data is correct.
2023-07-14 10:00:13 +08:00
mindest
810512c658
[ROCm] TunableOp: add hipBLASLt tuning logic (#16338)
### Description
- Add hipBLASLt tuning logic in place of default hipBLASLt
implementation;
- add kernel explorer for hipBLASLt.

related operators: Gemm, StridedBatchedGemm, and GemmFastGelu.

Temporarily mark algos that require extra workspace as unsupported.
Will add workspace support in later PR, which will change Gemm Params
def and affect multiple files.
2023-07-14 08:20:58 +08:00
Scott McKay
a3fc04ba74
Fix CodeCoverage pipeline (#16684)
### Description
<!-- Describe your changes. -->
Delete second reference to onnxruntime_api_tests_without_env in the code
coverage commands. One was removed in #16373 and the duplicate wasn't
noticed.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix pipeline.
2023-07-14 07:47:04 +10:00
Yulong Wang
d1d65978f6
[js/web] fix file size trim for wasm only .min.js (#16681)
### Description
fix file size trim for wasm only .min.js

minimal build `ort.wasm.min.js` and `ort.wasm-core.min.js` should
exclude JSEP related source code.
2023-07-13 14:20:51 -07:00
Danny Friar
5de2e2fb76
Call lazy_reset_grad in on-device training docs (#16696) 2023-07-13 13:29:54 -07:00
Dipanjan Sengupta
a461608409
Amx flag removal (#16527)
### Description
1. Replacing AMX intrinsics with machine code macros in QGEMM kernel.
2. Removing AMX build flags for GCC in cmake file.
3. Fixing the link time optimization (LTO) issue introduced with asm
.include of an assembly file.

I have moved the AMX instruction macro definitions from
QgemmU8S8KernelAmxCommon.S to the amx_common.h to fix the LTO issue.
Note that I am also pushing the macros defined in
QgemmU8S8KernelAmxCommon.S for future reference.

A special thanks to @laxmansole who helped in the development of the
instruction macro definitions for AMX intrinsics and fixing the LTO
issue.

### Motivation and Context
The additional AMX flag in cmake adds an extra layer of dependency on
GCC version to use the feature.These changes should allow the usage of
the AMX feature with just the CPU ID check.
2023-07-13 11:19:49 -07:00
Vincent Wang
c07a3b869c
Triton Codegen for ORTModule (#15831)
Fuse connected elementwise and reduce Ops to TritonOp and codegen triton
code to run the kernel.

This PR is co-edited by @wejoncy and @er3x3
2023-07-13 18:17:58 +08:00
Wanming Lin
7cac114e52
[WebNN EP] Support Abs and Neg ops (#16672) 2023-07-13 00:44:22 -07:00
Wanming Lin
d5b76cff60
[WebNN EP] Fixed build error (#16671)
The build break was caused by enabling `-Wshorten-64-to-32` in
https://github.com/microsoft/onnxruntime/pull/16524
2023-07-12 23:37:24 -07:00
mindest
b7fd5af48b
[ROCm] TunableOp: Update rocBLAS get_solutions API (since ROCm5.6) (#16657)
### Description
- Update existing rocBLAS get_solutions API using
`*_get_solutions_by_type` (supported from ROCm5.6); remove the original
nested TunableOp logic.
- Update kernel_explorer.
2023-07-13 11:20:26 +08:00
PeixuanZuo
ebc311365b
[ROCm] Optimize ROCm CI to reduce time (#16620)
This PR mainly optimize ROCm CI test to reduce time and CPU utilization.

- use smaller batch size on strided_batched_gemm/batched_gemm test
- disable cpu training test
- fix test_e2e_padding_elimination Occasional failures on ROCm.
2023-07-13 10:58:03 +08:00
cloudhan
af89496fc7
Allow generic pipeline to accept some params for cross attention (#16519)
Allow `GemmSoftmaxGemmPermuteGenericPipeline<T>` to be used in some
cross attention, that opt for rocblas instead of ck if rocblas is
better to the small problem. The improvement is ~20% e2e time reduction
on some test cases for whisper large.

**Note:** This is because ck has some performance issue if the sequence
length is merely 1, and should be improved in the future.
2023-07-13 09:31:31 +08:00
cloudhan
3866614519
Avoid cmake repeatly printing DISABLE_FLOAT8_TYPES=ON (#16656) 2023-07-13 09:29:20 +08:00
Yi Zhang
f3b40abe29
Use pipeline cache to cache onnx node test data. (#16659)
### Description
Use pipeline cache instead of reading data from the image.


### Motivation and Context
1. To reduce the browser dependency of custom image.
2. The onnx node test data is less than 30M and the cache download time
is very short.
2023-07-13 09:26:27 +08:00
Rachel Guo
111382746e
[js/rn] Add test for validating "executionProvider" options (#16651)
### Description
<!-- Describe your changes. -->

As title.

Validation at JS call level in E2E app is not included. Can cover
together in a separate pr.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Test coverage.

---------

Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
2023-07-12 14:55:47 -07:00
Ye Wang
dd7d721f3c
support rotary embeddings in decoder masked self-attention (#16556)
### Description
<!-- Describe your changes. -->

This PR adds support for rotary embeddings in decoder masked
self-attention

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-07-12 13:48:48 -07:00
Sheil Kumar
0c956bef0a
[WinML] Fix warnings in OnnxruntimeEngine and OnnxruntimeEngineBuilder (#16679)
Fix [prefast:Warning]: C6101 (in
'_winml::OnnxruntimeEngine::CreateTensorValueFromDefaultAllocator'
Fix [prefast:Warning]: C6101 (in
'_winml::OnnxruntimeEngineBuilder::CreateEngine'

Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
2023-07-12 13:09:50 -07:00
pengwa
2449ded20f
Use autograd_inlining for model export (#16665)
### Use autograd_inlining for model export

From some versions of PyTorch, there is an issue related to custom
autograd.Function inlining, even though we register custom export
function for the autograd.Function (e.g. when custom autograd function
is enabled).

As an options, PyTorch exporter adds a new flag during export, we can
disable the inline. https://github.com/pytorch/pytorch/pull/104067

Currently the PyTorch change is in nightly built, this PR dynamically
check the torch.onnx.export's signature and decide to use the
`autograd_inlining` when it exists.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-07-12 20:57:24 +08:00
PeixuanZuo
596dbe277e
[ROCm] add upgrade to fix security issue (#16668) 2023-07-12 17:57:18 +08:00
Yulong Wang
ecca11340a
[js/common] allow creating (u)int64 tensors in 2 ways (#16541)
### Description
allow creating (u)int64 tensors from either a number array or a bigint
array.

before:

```js
// TypeScript think is good, but actually does not work
// runtime error: Uncaught TypeError: Cannot convert 1 to a BigInt
const myTensor1 = new Tensor('int64', [1, 2, 3, 4], [2, 2]);

// runtime good, but TypeScript thinks myTensor2 is a string tensor
const myTensor2 = new Tensor('int64', [1n, 2n, 3n, 4n], [2, 2]);
```

after:
```js
// both work at runtime and TypeScript populates the correct types
const myTensor1 = new Tensor('int64', [1, 2, 3, 4], [2, 2]);
const myTensor2 = new Tensor('int64', [1n, 2n, 3n, 4n], [2, 2]);
```
2023-07-11 21:07:36 -07:00
Aditya Goel
8e393e0b8c
Unique operator with double (#16359)
### Description
The [ONNX
standard](https://github.com/onnx/onnx/blob/main/docs/Operators.md#type-constraints-181)
permits the `Unique` operator to have `double` input tensor element
type, however this was not supported in onnxruntime. This PR enables
this kernel.

### Motivation and Context
The lack of support for `float64` forces users currently to cast to
`float32` instead. This loss of precision can be severely problematic in
feature engineering pipelines downstream of the `Unique` operator. It
would be good to prevent this by updating ORT to reflect the standard
and support `double` input tensors.

---------

Signed-off-by: Aditya Goel <agoel4512@gmail.com>
2023-07-11 20:24:14 -07:00
Edward Chen
1b8d5c43c2
Fix builds (#16646)
- Fix some more `shorten-64-to-32` warnings
- Move minimum build.py Python version back to 3.6
2023-07-11 19:21:25 -07:00
Scott McKay
ce68a4c06a
Fix Linux build failure when onnxruntime_DISABLE_ABSEIL=ON (#16373)
### Description
<!-- Describe your changes. -->
Add ort_value.h to session_options.h so OrtValue is defined. 

Update a unit test binary to add required include paths. Adding
ort_value.h pulls in more data type headers.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#16193
2023-07-12 11:23:18 +10:00
Tianlei Wu
2de5807703
Attention fusion for UNet onnx model export from PyTorch 2.* (#16629)
### Description
Tested with stable diffusion unet models exported by pytorch nightly.

Example to run:
```
cd onnxruntime/python/tools/transformers/
python optimizer.py --input unet.onnx  --output unet_fp16.onnx --model_type unet --float16 --opt_level 0
```
2023-07-11 14:35:48 -07:00
Yulong Wang
b4bf7d5044
[js/web/test] accelerate 'npm test' suite0/1 init time (#16558)
### Description
This change reduces the number of calls to globby functions so that it
accelerates the initialization for 'npm test' with suite0/1 tests from
~14sec to <2sec.
2023-07-11 14:34:40 -07:00
Ti-Tai Wang
72076e5320
Update converter registry usage in orttraining_test_dort_custom_ops.py (#16663)
Fix Orttraining Linux Lazy Tensor CI       

Orttraining Linux Lazy Tensor CI is broken.
The error message is
AttributeError: 'OnnxRegistry' object has no attribute 'register'
2023-07-11 12:03:12 -07:00
satyajandhyala
d41bbac7b9
[Web/JS] Added Expand operator support. (#16577)
### Description
Added Expand operator support.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-07-11 09:38:16 -07:00
Tommy Au
1b07bbceaa
Update build.bat Prevent spaces in path (#16635)
### Description
<!-- Describe your changes. -->
Simply add double quotes to prevent there is spaces in the path


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
As if there are spaces in path the bat cannot run, error would occurs.
So with a simple double quotes can fix these problems
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-07-11 07:07:08 -07:00