Commit graph

8216 commits

Author SHA1 Message Date
Chen Fu
26abaeb284
Fix half precision gemm test accumulation error (#14842)
### Description

Half precision gemm test requirement relaxation

### Motivation and Context

Most CPUs does not support mixed precision accumulation, only mul & add
fuse. As a result, different striding on the K dimension may lead to
rounding error. Accumulation of these rounding error maybe very
significant. So setting an approximation ratio does NOT always work.
What's more, a relaxed test condition may hide real implementation
problem. So this is only a compromised fix.

More rigorous tests require manual efforts:
1. Change the K stride of the kernel under test to be 16.
2. Force the K stride of the fp16 kernel to 16
3. Change the test oracle to be exact match.
4. Pass this test and then change it back :-(.

Co-authored-by: Chen Fu <fuchen@microsoft.com>
2023-02-27 13:23:14 -08:00
Albert Grigoryan
e097e4e93c
Objective-C lib: Added support for int64 and uint64. (#14405)
### Description
Added support for int64 and uint64 in Objective-C lib.

### Motivation and Context
Int64 is rarely used, but we needed it. The Int64 inference worked after
the change (tested).
2023-02-24 23:25:16 -08:00
satyajandhyala
dcc8fe656b
Fix type mismatch in CUDA Trilu op. (#12863)
Added type cast to int64_t to avoid overflow errors/alerts.
2023-02-24 23:24:26 -08:00
Ivan Komarov
9f6d452ca6
Fix ValueError when testing PyTorch performance (#14450)
### Description
Fixed an exception that is thrown inside `transformers` when trying to
test PyTorch performance:
```
> python convert_generation.py -m gpt2 --output gpt2_greedy_search.onnx --num_beams 1 --num_return_sequences 1 --torch_performance
2023-02-24 21:39:14 -08:00
Tianlei Wu
6603be26c6
Fix prefast warnings in CUDA ops (#14814)
Fix some prefast warnings in CUDA ops.
2023-02-24 21:23:08 -08:00
cloudhan
3bcdb0a83a
Fix TunableOp signature generation (#14709)
Currently, all generated op sigs are `TunableOp<ParamsT, TimerT>` which
is causing signature collision when using rocblas solution api. This was
not a problem before `TuningContext`, because the `KernelMap`s were not
shared.

The root cause is the signature is initialized on Op consturction,
specificially, only in base class ctor, which is causing the type info
only caputre base class type info. That is, only the ParamsT + and base
class. After this change, the we will encode the derived class type in
the op sig.
2023-02-25 12:05:07 +08:00
Chen Fu
bc5d0c83d1
Fix prefast scan complaints (#14823)
### Description

Fix prefast warning
https://dev.azure.com/aiinfra/ONNX%20Runtime/_workitems/edit/12821/


### Motivation and Context

This one is just excessive and annoying. It warned that I should cast
value to 8bytes before using * operator. But the expression is:
(size_t(64) * size_t(1024))

size_t is 8 bytes on 64b systems. And when it is 4 bytes on 32b systems,
the cast is expensive and completely unnecessary.
2023-02-24 17:23:22 -08:00
Yulong Wang
6b83ad9659
[js/web] allow unittest (onnxruntime_test_all) to run in browser (#14820)
### Description
allow onnxruntime_test_all to run in browser for WebAssembly build (use
flag `--wasm_run_tests_in_browser`).

To output the logs from stdout correctly, this test needs to be build
with `--enable_wasm_threads`.
2023-02-24 16:45:33 -08:00
Yulong Wang
a631ed77c0
[js/web] support flag 'optimizedModelFilePath' in session options (#14355)
### Description
* Support flag 'optimizedModelFilePath' in session options.

In Node.js, the model will be saved into filesystem just like its
behaviour on native platforms.

In browser, the new model is not saved to filesystem. the file path is
ignored. Instead, a new pop-up window will be launched in browser and
user can 'save' the file as onnx model.

* Add corresponding commandline args for the following session option
flags:
    - optimizedModelFilePath
    - graphOptimizationLevel
2023-02-24 15:50:15 -08:00
Rachel Guo
0700788b6e
Disable e2e android react native CI test temporarily (#14803)
### Description
<!-- Describe your changes. -->

Disable e2e android react native test temporarily to unblock the CI
failure with no easy fix.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Temp solution to unblock CI failure.
2023-02-24 09:32:18 -08:00
Ted Themistokleous
702a61c3bb
Add verbose and optimization args for parity tests (Gelu, Layernorm, … (#14739)
…GPT_Attention)

Some EPs require that onnxruntime and optimum optimizations are turned
off in order to run correctly. Allowing this option during test runs
allows the EP and library to perform their own optimization and be more
representative of actual use case conditions.

Important for EPs like MIGraphX which require optimizations to be offer
for certain operations

### Description
<!-- Describe your changes. -->

Allow flags to turn off optimizations and add verbose output to confirm
which EP is being used for the inference run and validate fallbacks

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Related to: #14702 & #14700

---------

Signed-off-by: Ted Themistokleous <tthemist@amd.com>
Co-authored-by: Ted Themistokleous <tthemist@amd.com>
2023-02-24 18:43:13 +08:00
Tianlei Wu
d3785ef8f6
Fix decoder_attention scratch buffer size and prefast warnings (#14808)
(1) Change `GetScratchBuffer<T>(element_count * element_size)` to
`GetScratchBuffer<T>(element_count)` to avoid allocating more memory
than needed.
(2) Fix prefast:Warning C26451: Arithmetic overflow: Using operator '*'
on a 4 byte value and then casting the result to a 8 byte value.
2023-02-23 21:58:28 -08:00
Justin Stoecker
928289c414
STFT for DML EP (#14736)
### Description
Implements the STFT operator for the DirectML execution provider. This
is implemented as a custom op, just like the DFT kernel, because it's
implemented as a composite of two operators (DML Mul/Identity + DFT). As
such, this inherits the same restrictions as the existing DFT kernel
(requires power-of-two window sizes for now).

This change also adds a native FP16 shader to DFT so that both DFT/STFT
kernels support float16 tensors. There is no typed UAV fallback or
emulation path, so the HW _needs_ to support native float16. It also
appears the stockham shader was compiled with all optimizations disabled
and debug symbols (tsk tsk, Sheil), and this has been fixed.

This is passing all existing STFT tests (i.e. all of 1). I'm adding some
additional collateral in the Windows AI conformance tests in parallel to
check some extra cases.

---------

Co-authored-by: Patrice Vignola <vignola.patrice@gmail.com>
2023-02-23 21:12:22 -08:00
PeixuanZuo
687118a159
[ROCm] Fix Skiplayernorm error and open ke test with optional output (#14794)
Fix Skiplayernorm error and open kernel explorer test with optional
output.
2023-02-24 12:46:42 +08:00
Patrice Vignola
abb51ec975
[DML EP] Fix GetInputTensor crash when accessing null tensor (#14809) 2023-02-23 18:13:40 -08:00
Jian Chen
29428cd9dc
Cjian/pr into main for 1.14.1 fix (#14805)
### Description
<!-- Describe your changes. -->

PR a change made to 1.14.1 into Main branch as well. 

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-02-23 18:10:57 -08:00
Dmitri Smirnov
d034b432ec
Add support for handling sbyte (Int8) data in C# inference tests (#14807)
### Description
Add sbyte specific test case support.

### Motivation and Context
C# Test Data loading code and comparators are missing sbyte (Int8)
specializations.
This fails a test
2023-02-23 17:05:28 -08:00
cao lei
a012d60777
Make MemcpyToHost to a separate stream for performance gain (#14487)
### Description
Make MemcpyToHost to a separate stream for performance gain in default
DeviceBasedPartitioner



### Motivation and Context
Our experiments show that make MemcpyToHost a separate stream will make
it run parallel with other kernels, especially those compute-intensive
ones.

---------

Co-authored-by: Lei Cao <leca@microsoft.com>
2023-02-23 14:52:01 -08:00
Nat Kershaw (MSFT)
664e296270
Re-add api:javascript and api:java to the labeler (#14238) 2023-02-23 13:20:33 -08:00
James Yuzawa
d925055a3e
Fix broken and outdated links in documentation (#14092)
### Description
<!-- Describe your changes. -->

I fixed some broken links in the C API documentation, but then did a
quick pass over all of the links I could find and then fixed those.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

I got some 404's when exploring the documentation and wanted to fix it.
2023-02-23 10:48:04 -08:00
Ivan Komarov
16b39e5b87
symbolic_shape_infer.py: Fix slicing a tensor that has a sympy.Min() in its shape (#14384)
### Description

`_infer_Slice()` is a function (arguably the most complex one) in
`symbolic_shape_infer.py` that infers the shape of the output of a
`Slice` node. This commit fixes an edge case in `_infer_Slice()` caused
by a SymPy quirk.

When both the end of the slice (let's call it `e`) and the corresponding
dimension of the sliced tensor (let's call it `dim`) are arbitrary
symbolic expressions, `symbolic_shape_infer.py`
[checks](de7a868d5f/onnxruntime/python/tools/symbolic_shape_infer.py (L1728))
if `e <= dim`. Comparing symbolic expressions is hard in general, so if
the comparison fails, `symbolic_shape_infer.py` [gives
up](de7a868d5f/onnxruntime/python/tools/symbolic_shape_infer.py (L1734))
and assumes that `e` is equal to `dim`.

A failure of this sort currently happens for expressions of the form `Y
- X >= 0` where `Y` contains a `sympy.Min()` (`symbolic_shape_infer.py`
tries to rewrite `X <= Y` comparisons in various ways, and `Y - X >= 0`
is [one of
them](de7a868d5f/onnxruntime/python/tools/symbolic_shape_infer.py (L1664))).
An simple example to illustrate this:
```python
>>> import sympy
>>> X = sympy.Symbol('X', positive=True, integer=True)
>>> 
>>> y1 = 9999
>>> Y1 = X + y1 - 5000
>>> bool(Y1 - X >= 0)
True
>>> 
>>> y2 = X + 4999
>>> Y2 = X + y2 - 5000
>>> bool(Y2 - X >= 0)
True
>>> 
>>> y3 = sympy.Min(y1, y2)
>>> Y3 = X + y3 - 5000
>>> bool(Y3 - X >= 0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../venv/lib/python3.9/site-packages/sympy/core/relational.py", line 511, in __bool__
    raise TypeError("cannot determine truth value of Relational")
TypeError: cannot determine truth value of Relational
```

If you assume that `X` is positive symbol (`symbolic_shape` [does
assume](de7a868d5f/onnxruntime/python/tools/symbolic_shape_infer.py (L2129))
this for graph inputs), then both `Y1 >= X` and `Y2 >= X` holds, and
SymPy can prove this. This means that `Y3 >= X` also holds (since `Y3`
is essentially equal to either `Y1` or `Y2`, depending on the value of
`X`), but this is too hard for SymPy to prove. I confirmed that this is
still the case for the latest SymPy version (`1.11.1`).

This commit tries to fix this edge case by slightly rewriting the
expression containing `sympy.Min()`. I explain the details in the
comments in `symbolic_shape_infer.py`, so I won't duplicate them in the
PR description.

### Motivation and Context
This sounds like a very contrived example, but it actually appeared in
the wild when we tried to infer shapes for an ONNX graph exported from
PyTorch that used relative-position multihead attention from Fairseq.
The problematic line is
[here](7d050ada7d/fairseq/modules/espnet_multihead_attention.py (L192)).
In our codebase, we have something like `matrix_bd = matrix_bd[:, :, :,
: matrix_ac.size(-1)]` before we add `matrix_ac` and `matrix_bd`.
`matrix_bd` is itself a result of another slice, hence its shape
contains `sympy.Min()`, and the SymPy weirdness described above prevents
`symbolic_shape_infer.py` from correctly inferring the final shape of
`matrix_bd`. Then `symbolic_shape_infer.py` explodes when we try to add
`matrix_ac` and `matrix_bd`, because their shapes are not compatible.

I added a small self-contained unit test to illustrate the problem.
*Without* the fix, `slice_out_cropped` has shape `[N + Min(42, N + 21) -
22]`, and `input` has shape `[N]`, and we get this:
```
> python onnxruntime_test_python_symbolic_shape_infer.py
..................Cannot determine if 22 - N < 0
Unable to determine if N <= N + Min(42, N + 21) - 22, treat as equal
E....
======================================================================
ERROR: test_slice_of_min (__main__.TestSymbolicShapeInferenceForSlice)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/dfyz/onnxruntime/onnxruntime/test/python/onnxruntime_test_python_symbolic_shape_infer.py", line 460, in test_slice_of_min
    model = SymbolicShapeInference.infer_shapes(onnx.helper.make_model(graph_def))
  File "/home/dfyz/onnxruntime/onnxruntime/test/python/../../python/tools/symbolic_shape_infer.py", line 2461, in infer_shapes
    raise Exception("Incomplete symbolic shape inference")
Exception: Incomplete symbolic shape inference

----------------------------------------------------------------------
Ran 23 tests in 0.486s

FAILED (errors=1)
```

*With* the fix, both tensors have shape `[N]`, and the test passes.

---------

Co-authored-by: Ivan Komarov <dfyz@yandex-team.ru>
2023-02-23 15:32:37 +01:00
pengwa
b392d5351f
Fix memory profiler (#14695)
### Fix memory profiler

A follow up fix for PR
https://github.com/microsoft/onnxruntime/pull/13495

In ORTModule training, `PartialExecuteThePlan` is called twice, we need
create log event after the backward graph run complete to collect the
whole training graph's activations infos.

Also change some log level to verbose, to avoid too many logs in >
verbose log level.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-02-23 18:05:21 +08:00
Jian Chen
62ee0c8110
Migrating ORT Extensions from Git submodule to cmake FetchContent (#14298)
### Description
<!-- Describe your changes. -->

Merging extensions from Git submodule to cmake FetchContent


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Changming Sun <chasun@microsoft.com>
Co-authored-by: Jian Chen <jchen351@MacBook-Pro.local>
2023-02-22 19:42:36 -08:00
Ye Wang
58da3cacdf
support NeoX-style rotary embedding (#14785)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-02-22 18:21:34 -08:00
Yi Zhang
cad7ef93e6
use python 3.9.7 in windowai packaging pipeline (#14766)
### Description
Use python3.9.7 in windowsAI packaging pipeline.


### Motivation and Context
In WindowsAI packaging pipeline, cdpxwin1809 is deprecated and it will
no longer be cached from March 2023.

I used the recommended image
[onebranch.azurecr.io/windows/ltsc2019/vse2022:latest](https://onebranch.visualstudio.com/OneBranch/_wiki/wikis/OneBranch.wiki/4587/Container-Images?anchor=recommendation-for-windows-container-image-userst)
But it always failed to pass arm32 jobs. 
It's very likely a regression in VS2022. 
One user reported a similar issue #14190. 
I've submitted a bug to visual studio.
https://developercommunity.visualstudio.com/t/Compilation-Error-with-VS2022-ARM/10285309.

For `onebranch.azurecr.io/windows/ltsc2019/vse2019:latest`, there's an
exception` Error : init_sys_streams: can't initialize sys standard
streams`
It could be solved by updating from python3.7 to python3.9 in the
pipeline.

(https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=279301&view=results)
The root cause might be the conflicts between different python version.
The inherent python in onebranch image is python 3.9.7.

(https://onebranch.visualstudio.com/OneBranch/_wiki/wikis/OneBranch.Code.Wiki/6766/manifest)


### Ref

https://onebranch.visualstudio.com/OneBranch/_wiki/wikis/OneBranch.wiki/4587/Container-Images?anchor=windows-build-images
2023-02-23 09:48:42 +08:00
Tianlei Wu
5c8325a6cb
Add Environment Variables for cuda tensor dumper (#14780)
(1) Add two environment variables to configure the cuda dumper:
`ORT_TENSOR_SNIPPET_THRESHOLD` and `ORT_TENSOR_SNIPPET_EDGE_ITEMS`
(2) Move IConsoleDumper definition to a separated file console_dumper.h.
2023-02-22 15:59:10 -08:00
kunal-vaishnavi
460b3ff4fd
Update pattern matching for EmbedLayerNormalization fusion (#14344)
### Description
This PR addresses the case where an optional Gather node is in the
subgraph pattern. The optional node is now fused with the other nodes
matched in the pattern to create an EmbedLayerNormalization node.



### Motivation and Context
The original subgraph pattern is
```
                      Gather    Gather
                           \   /
                            Add
                             |           
                     LayerNormalization
                             |           
                          Attention
                             |  
                            ...
```
and the new subgraph pattern is
```
                      Gather    Gather
                           \   /
   Gather (optional)        Add
                   \         |           
                     LayerNormalization
                             |           
                          Attention
                             |  
                            ...
```
2023-02-22 12:57:14 -08:00
Edward Chen
b3b9be19b1
Update clang-tidy path for updated Mac image. (#14760)
Update clang-tidy path for updated Mac image. Fix Objective-C static analysis build.
2023-02-22 09:00:42 -08:00
Xavier Dupré
df45b12fdf
Reduce memory usage for TreeEnsemble operators (#14670)
### Description
The onnx file is about 5Mb for a lightgbm model with 500 trees.
onnxruntime uses additional 10Mb to compute the inference and keeps the
onnx structure. This PR reduces the memory usage by almost 50%. The
memory used by the onnx node could be freed if there is no optimized
graph to save but that's not covered by this PR.

### Motivation and Context
Reduce memory usage.
2023-02-22 12:20:02 +01:00
Tianlei Wu
262e46e8ce
Update stable diffusion benchmark script (#14759)
Update stable diffusion benchmark script:
(1) Test GPU memory usage
(2) Change diffusers version to 0.13, and add support of PyTorch 2.0
including compile
(3) Add support of xformers
(4) Output result to CSV file

Example to run PyTorch 2.0 with torch.compile:
```
pip3 install numpy --pre torch --force-reinstall --extra-index-url https://download.pytorch.org/whl/nightly/cu117
export TRITON_PTXAS_PATH=/usr/local/cuda-11.7/bin/ptxas
python benchmark.py -e torch -v 1.5 -c 5 -n 1 -b 1 --enable_torch_compile
```
2023-02-21 23:37:38 -08:00
Sheil Kumar
1b7f65437e
Enable Opset11 Sequence Ops on DirectML, and make the CPU implementations agnostic to backend EP (#14442)
Enable Opset11 Sequence Ops on DirectML, and make the CPU
implementations agnostic to backend EP

Opset 11 introduced the following sequence related operators:
    - SequenceAt
    - SequenceConstruct
    - SequenceEmpty
    - SequenceLength
    - SequenceErase
    - SequenceInsert 
    - ConcatFromSequence

With the exception of ConcatFromSequence, all of the above operators
were implemented with CPU kernels that a) required all of the contained
tensors to also be on CPU, and b) would clone each tensor into a new
sequence as a side effect of each operator. The implementation of
sequences are backend agnostic, as they dont affect actual tensor layout
or manipulate the contents of the tensors. In addition, with the
exception of SequenceAt, the other operators need not make copies of the
underlying referenced tensors.

Consequently, this change does the following:
1) Sequence* operators (except SequenceAt) no longer copies the contents
of a sequence of tensors on every kernel execution.
2) SequenceAt uses the DataTransferManager to copy tensors agnostic to
backend.
3) The internal container implemented by TensorSeq has changed from
onnxruntime::Tensor to OrtValue. This is because onnxruntime::Tensor
does not support copy or assignment construction, so it must have a
singular owner. However, is same tensor participates in multiple
containers it would have multiple container "owners" and this would not
be possible.
4) Other code that accessed values from TensorSeq have associated
changes to extract Tensors from OrtValues now.

In addition, DirectML execution was very slow when the above Sequence
operators were added to a graph, as this caused MemcpyToHost and
MemcpyFromHost kernels to be inserted between the graph and the sequence
operators. To optimize DirectML,
1) The CPU implementations for the Sequence* ops were registered as DML
implementations. Since the above changes also includes making the CPU
kernel implementations EP agnostic, the CPU kernels can be added as is.
2) The ConcatFromSequence operator needed to be implemented on DirectML.
However, there was little DirectML EP operator framework support for
operators that accept/output sequences of tensors. This change has
modified the internal COM interfaces to include new apis to interrogate
for sequence shapes, and extract the needed tensors from TensorSeq.

---------

Co-authored-by: Patrice Vignola <vignola.patrice@gmail.com>
2023-02-21 18:08:28 -08:00
Vincent Wang
e9ec4c098b
[CUDA] Fix FP16 Precision for Sigmoid Op (#14727)
Current Sigmoid's CUDA kernel uses target data type for all computation.
For some small negative numbers, if using FP16, it will loss precision.
For example, for input [-7.8477, 7.3320, -7.8008, 6.6016], the expected
output is [3.9047e-04, 9.9935e-01, 4.0919e-04, 9.9864e-01], but current
kernel will generate result [0.0000, 0.9990, 0.0000, 0.9990]. If some
sub-graph contains Sigmoid, such as BinaryCrossEntropyWithLogits, it's
likely to produce NaN as compute result.

The PR fixes this by using FP32 for kernel internal computation. Note
that the fix will not have perf regression, as CUDA's _Exp will also do
float to half casting, so the fix doesn't introduce extra cast. We move
the cast to right begin and end of the whole kernel so that other parts
of computation are also in FP32 (instead of only Exp).
2023-02-22 09:16:22 +08:00
Christian Veenhuis
9fbb2b4742
Fix broken link in onnxruntime_c_api.h (#14748)
### Description
Fix the broken link in header file onnxruntime_c_api.h w.r.t. the graph
optimization levels (line 300).

### Motivation and Context
This fix solves open issue #14741
2023-02-21 15:07:06 -08:00
Scott McKay
b234df3dd0
Refactor the cost check used by the transpose optimizer (#14690)
### Description
<!-- Describe your changes. -->
Refactor the cost check used by the transpose optimizer to separate out
ORT specific logic.

Change the post-layout transform optimization to only skip the cost
check when moving the layout transform nodes around. Fall back to the
normal cost check for all other transpose nodes.

Cleanup some const correctness.

Refactor usage of ResizeHandler slightly so the clang-formatting is
nicer.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Address performance issue seen in SNPE model where a non-layout
transpose node was moved. See
https://github.com/microsoft/onnxruntime/pull/14547 for more details.
Improve separation between generic transpose optimization code and any
ORT specific code.
2023-02-22 08:56:29 +10:00
Edward Chen
755161100a
Fix CoreML API usage memory leak. (#14738)
- Fix CoreML API usage memory leak by putting CoreML API prediction call in an `@autoreleasepool` block as suggested in #14455 and [here](https://developer.apple.com/forums/thread/692425). Conservatively wrapping all CoreML API usage.

- Use MLModelConfiguration.computeUnits instead of deprecated MLPredictionOptions.usesCPUOnly (originally in #11382).
2023-02-21 14:08:03 -08:00
Yuriy Chernyshov
973aaf110b Improve compatibility with certain STL's
We use customized libc++ which uses raw pointers as std::vector::iterators.

As per [expr.pre.incr](https://eel.is/c++draft/expr.compound#expr.pre.incr), builtin `operator++` can only be applied to lvalue, while `std::vector::begin()` returns an rvalue.

See [this](https://godbolt.org/z/d3a1aKTWP) godbolt snippet for the details.
2023-02-21 14:06:16 -08:00
RandySheriffH
e6a8e6c438
Drop the test folder accidentally added (#14718)
The test folder was accidentally added with zero usage in main, let's
drop it.

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2023-02-21 13:59:06 -08:00
fxmarty
f76ff8c558
Initialize bias_weight in fusion_skiplayernorm.py (#14751)
As per title, fixes
https://github.com/microsoft/onnxruntime/issues/13625

Uncountered the issue when using the optimization with codegen model.
2023-02-21 10:42:08 -08:00
Tianlei Wu
c0d2472ede
Disable fused causal attention (#14732)
There is accuracy regression in GPT-2 model. Top1 match rate (vs PyTorch
model) drops about 1%. The cause is the fused causal attention uses fp16
accumulation. Disable it by default and add an environment variable 
ORT_ENABLE_FUSED_CAUSAL_ATTENTION=1 to turn on it manually.

It also updated the GPT-2 parity test script to generate left side
padding to reflect the actual usage.

To test:
```
python -m onnxruntime.transformers.models.gpt2.convert_to_onnx -m gpt2 --output gpt2.onnx -o -p fp16 --use_gpu
```
The top1-match-rate in the output is on-par with ORT 1.13.1.
2023-02-21 09:53:31 -08:00
PeixuanZuo
25e10f413e
[FIX] USE_COMPOSABLE_KERNEL is not defined on ROCm5.2.3 (#14750)
Fix build failure on ROCm5.2.3
2023-02-21 11:13:37 +08:00
cao lei
3d79b1f06e
Create new stream for data copy for IOBidning input scenario (#14719)
### Description
Create new stream for data copy for IOBidning input scenario



### Motivation and Context
Previously in bindInput(), a nullptr Stream is passed to copy data cross
device. This caused the default stream is used thus hurt the
performance.
This PR is to fix https://github.com/microsoft/onnxruntime/issues/14484

---------

Co-authored-by: Lei Cao <leca@microsoft.com>
2023-02-20 09:47:57 -08:00
Yi Zhang
1ea360148f
restore opset18 test (#14677)
### Description
Reenable disabled opset18 tests


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-02-20 18:19:10 +08:00
pengwa
fbf5d09a0c
Fix random failure of ortmodule_api.py::test_unused_parameters (#14729)
### Fix random failure of ortmodule_api.py::test_unused_parameters

Fix FAILED
orttraining_test_ortmodule_api.py::test_unused_parameters[model1-none_pt_params1]
for orttraining-linux-gpu-ci-pipeline CI pipeline

```
=================================== FAILURES ===================================
________________ test_unused_parameters[model1-none_pt_params1] ________________

model = UnusedMiddleParameterNet(
  (fc1): Linear(in_features=784, out_features=500, bias=True)
  (relu): ReLU()
  (fc2): Linear(in_features=500, out_features=400, bias=True)
  (fc3): Linear(in_features=500, out_features=10, bias=True)
)
none_pt_params = ['fc2.weight', 'fc2.bias']

    @pytest.mark.parametrize(
        "model, none_pt_params",
        [
            (UnusedBeginParameterNet(784, 500, 400, 10), ["fc1.weight", "fc1.bias"]),
            (UnusedMiddleParameterNet(784, 500, 400, 10), ["fc2.weight", "fc2.bias"]),
            (UnusedEndParameterNet(784, 500, 400, 10), ["fc2.weight", "fc2.bias"]),
        ],
    )
    def test_unused_parameters(model, none_pt_params):
        device = "cuda"
    
        N, D_in, H1, H2, D_out = 64, 784, 500, 400, 10
        model = model.to(device)
        ort_model = ORTModule(copy.deepcopy(model))
    
        # Make sure model runs without any exception
        for _ in range(5):
            x = torch.randn(N, D_in, device=device)
            y = copy.deepcopy(x)
    
            out_pt = model(x)
            out_ort = ort_model(y)
            loss_pt = out_pt.sum()
            loss_pt.backward()
            loss_ort = out_ort.sum()
            loss_ort.backward()
            _test_helpers.assert_values_are_close(out_ort, out_pt)
>           _test_helpers.assert_gradients_match_and_reset_gradient(ort_model, model, none_pt_params=none_pt_params)

orttraining_test_ortmodule_api.py:4050: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_test_helpers.py:216: in assert_gradients_match_and_reset_gradient
    assert_values_are_close(ort_param.grad, pt_param.grad, rtol=rtol, atol=atol)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

```

Initially the test runs very well. As we insert more and more tests,
when running ortmodule_api.py::test_unused_parameters, the random
generated data got changed, and now it is more easily to generate an
input data that produce a result the break existing rtol and atol.

The example data, 0.1041 only have very minor diff, e.g. abs_diff:
2.2649765014648438e-06.
> The torch.allclose judge it is not equal because: abs_diff> 0.1041 *
rtol + atol = 1.041e-1 * 1e-5 + 1e-6 =-2.041e-6.
> Additionally, according to math
[here](7b31bcda2e/orttraining/orttraining/test/python/_test_helpers.py (L230))
The maximum atol is 1.2238311910550692e-06 > current atol(1e-6), maximum
rtol is 1.2149855137977283e-05 > current rtol(1e-5).

This PR looses the atol to 1e-5, rtol to 1e-4 .
2023-02-20 18:09:53 +08:00
Edward Chen
ad78579b66
Update java/build.gradle to not use deprecated features that were removed in gradle 8.0. (#14733)
### Description
<!-- Describe your changes. -->

Update java/build.gradle to not use deprecated features that were
removed in gradle 8.0.
Also move gradle wrapper setup from a script into a step template.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Fix builds which use hosted Mac agents and gradle.

Recently the system version of gradle got upgraded to 8.0. Even though
we use an older gradle wrapper version, java/build.gradle is still
processed with gradle 8.0 in the initial call to `gradle wrapper`.
2023-02-20 11:19:49 +08:00
Erick Muñoz
8372c86e7f
[oneDNN] Update to oneDNN v3.0 (#14267)
### Description
Update oneDNN version from 2.7 to 3.0



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-02-17 09:56:29 -08:00
Wei-Sheng Chin
7b31bcda2e
Disable LazyTensor-ORT Test (#14703)
As title since LazyTensor is replaced by Dynamo in PyTorch 2.0.
2023-02-17 17:46:51 +08:00
Hector Li
0dda42b46c
Enable some ops for QDQ Node Unit support (#14701)
### Description
Enable some ops for QDQ Node Unit support: Flatten, Split,
GlobalAveragePool, ReduceMean, Relu, Sigmoid, Sqrt, Div, Mul, Pow, Sub.
2023-02-16 17:14:31 -08:00
Ryan Hill
892f59b31a
Add string support to tile op (#14686)
### Description
Add std::string tensor type support to Tile operator


### Motivation and Context
Multiple users are hitting this missing feature:
https://github.com/microsoft/onnxruntime/issues/14511
2023-02-16 14:59:44 -08:00
Baiju Meswani
ae205a7924
QAT POC tutorial (#14577) 2023-02-16 14:38:18 -08:00
Baiju Meswani
4e686a9a7d
Support building a QAT onnx model using onnxblock (#14551) 2023-02-16 14:38:01 -08:00