Commit graph

8241 commits

Author SHA1 Message Date
cloudhan
a997bb46b6
Refactor rocm attention (#14688)
Extract QKV projection and attention computation into pipelines (composed from gemms and kernel launch). 

This will allow us to introduce ck flash attention in next PR
2023-03-03 12:16:11 +08:00
Changming Sun
f3b6664384
Remove Python 3.7 from the python packaging pipeline (#14887)
### Description
1. Remove Python 3.7 from the python packaging pipeline. It is planned
for the next release and approved by the PMs. Also we will add 3.11, but
it will be addressed in another PR.
2. Stop generating python packages based on Ubuntu 18.04 which will
reach EOL next month. We will either replace them with Ubuntu 20.04 or a
CentOS 8 variant.
2023-03-02 19:44:49 -08:00
guyang3532
c49f250a14
Del ort_model._modules to foward its accessing to torch_model._modules (#14563)
Missing '_modules' attribute in ORTModule will cause load_state_dict for
wrapped_ortmodule fail.

reference:https://github.com/microsoft/onnxruntime/pull/7847
2023-03-03 10:12:37 +08:00
Dmitri Smirnov
8d87fdcfa1
Add GetVersionSting API for C++, C# and Python (#14873)
### Description
Added APIs.

### Motivation and Context
Addresses https://github.com/microsoft/onnxruntime/issues/14584

Cc: @Craigacp cp
2023-03-02 17:11:07 -08:00
Zachary Streeter
6e2ca15140
added miopenGetConvolutionSpatialDim if ROCm5.5 (#14772)
The API should be `miopenGetConvolutionSpatialDim(cdesc, &spatial_dim)`, NOT `miopenGetConvolutionDescriptorSize(cdesc, &spatial_dim)`
2023-03-03 08:02:32 +08:00
Yuriy Chernyshov
0ebe8e34f8
Do not use ADL to invoke std algos (#14851)
This is a follow up for #14716
2023-03-02 15:38:33 -08:00
Chun-Wei Chen
70a31e047a
Consume ONNX 1.13.1 in ONNX Runtime (#14812)
### Description
<!-- Describe your changes. -->
Consume ONNX 1.13.1 in ONNX Runtime. (ONNX 1.13.0 to ONNX 1.13.1)


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
ONNX 1.13.1 patch was just released yesterday. This PR is making ORT's
ONNX submodule consistent with the latest released ONNX. Not sure
whether this PR is really needed, but let me make it ready. Previous PR
for testing ONNX 1.13.1rc2 :
https://github.com/microsoft/onnxruntime/pull/14634.

Fixed
[AB#13174](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/13174)
.
2023-03-02 14:57:35 -08:00
G. Ramalingam
2facc5efe6
Fix function inliner bug re. outer-scope names (#14734)
### Description

Fix the function inliner logic for renaming variables. Typically, a
FunctionProto does not contain references to outer-scope names. However,
special cases, such as the function-expansion of SequenceMap, can
generate such FunctionProtos. Extend the renaming logic to ensure that
references to outer-scope names are not renamed.

### Motivation and Context
Fixes https://github.com/onnx/onnx/issues/4892

Signed-off-by: Ganesan Ramalingam <grama@microsoft.com>
2023-03-02 12:08:37 -08:00
Chen Fu
603026fb84
Transpose for 16b tensors (#14877)
### Description
Matrix transpose for 16b tensors (shorts, and half precision floats)


### Motivation and Context

Need it for fp16 operations
2023-03-02 11:32:05 -08:00
Rachel Guo
7cd4b334a9
[CoreML EP] Add Flatten Op and LRN Op support (#14857)
### Description
<!-- Describe your changes. -->

As title.

CoreML Spec for reference: 


https://apple.github.io/coremltools/mlmodel/Format/NeuralNetwork.html#flattento2dlayerparams


https://apple.github.io/coremltools/mlmodel/Format/NeuralNetwork.html#lrnlayerparams

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Fill CoreML Clipchamp usage gaps.

---------

Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
2023-03-02 09:43:15 -08:00
Hector Li
bf35ad2aa3
[Qnn EP] Call Qnn deviceCreate during backend setup and deviceFree during shutdown (#14875)
### Description
Call Qnn deviceCreate during backend setup and call deviceFree during
shutdown

### Motivation and Context
Algin with Qnn formal setup and shutdown procedure.
2023-03-02 08:54:13 -08:00
cao lei
d69823f764
Do not create Barrier and triggerDownstream steps if the corresponding nodes are split by yield Op in training scenario (#14570)
### Description
Do not create Barrier and triggerDownstream steps during execution plan
creation if the corresponding nodes are split by yield Op in training
scenario.



### Motivation and Context
In training scenario, forward and backward processes are running two
different partial nodes of a graph. If there are two nodes each in one
of the partial graph and separate in two streams, there are still
triggerDownstream/barrier steps between them which work quite different
from inference process as one of the steps will not be executed due to
it is not in the correct range. To make it work, there is a hacky way to
trigger the barrier step explicitly for training.
This PR is to do some check, and do not create Barrier and
triggerDownstream steps if the corresponding nodes are split by yield Op
in training scenario. So the hacky way is not needed.
2023-03-02 07:08:29 -08:00
Tianlei Wu
c66af46fc1
Doc for Stable Diffusion CUDA Optimizations (#14830)
Add document for stable diffusion optimizations and benchmark.
2023-03-01 19:29:30 -08:00
Hector Li
c6074f3a4b
OnnxRuntime QNN EP (#14791)
### Description
Integrate Qualcomm QNN SDK to enable inference on QC hexagon NPU devices

### Motivation and Context
Enable Ort inference on QC hexagon NPU devices.

---------

Co-authored-by: Satya Jandhyala <sajandhy@microsoft.com>
Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
Co-authored-by: Adrian Lizarraga <adrianlm2@gmail.com>
2023-03-01 13:48:20 -08:00
George Wu
6044312a43
fix TRT dockerfile documentation https://github.com/microsoft/onnxruntime/issues/14556 (#14600)
address https://github.com/microsoft/onnxruntime/issues/14556
2023-03-01 07:02:42 -08:00
Scott McKay
b7fde84341
Changes to support standalone custom ops in a minimal build. (#14497)
### Description
<!-- Describe your changes. -->
Changes to support standalone custom ops in a minimal build. Also
incorporates changes from #14492 (needed to test builds prior to that
being checked in).

We first need to save the schema info from the operators used by the
standalone op invoker in the ORT format model. Add mechanism for that.

Merge the kernel lookup logic so the same is used in full and minimal
build. NOTE: the version matching is now consistent with all other
kernel lookups, and the call to CreateOp MUST use the exact version for
the operator. Previously matching wasn't as strict, but this can lead to
the incorrect kernel being chosen.

Add tests.

NOTE: There is currently no way to detect the ops/types/opsets used
inside these custom ops as they don't exist until we create kernels,
which is after model loading completes (which is the point the ORT
format model is saved). Due to that they have to be manually added to
the configuration used to do the reduced ops build. That shouldn't be
too hard for the custom op author to add given the custom op
implementation is specifying the op, opset and type constraints (i.e.
they have the info and it's just a case of capturing/formatting it
correctly).


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Enable usage of the standalone op invoker by custom ops in a minimal
build.

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-03-01 11:22:54 +10:00
Chen Fu
acc2ac627f
Fp16 Activations (#14722)
### Description

NEON fp16 SIMD implementation of Activation functions


### Motivation and Context
Step 2 of fp16 SIMD support.

---------

Co-authored-by: Chen Fu <fuchen@microsoft.com>
2023-02-28 17:20:40 -08:00
Yulong Wang
69c5edb11b
[wasm] upgrade emsdk from 3.1.19 to 3.1.32 (#14818)
### Description
upgrade emsdk from 3.1.19 to 3.1.32

also add explicit config for stack size (1MB).
2023-02-28 11:06:09 -08:00
Nat Kershaw (MSFT)
95c777b745
Fix link to High Level Design (#11786)
Address #11661
2023-02-28 11:05:54 -08:00
Yi Zhang
6320decf04
increase Test GPU Job's timeout to 8 hours (#14850)
### Description
<!-- Describe your changes. -->

### Motivation and Context
In practice, 6 hours is not enough to finish the job.
2023-02-28 18:52:03 +08:00
pengwa
79aa0acdd0
SCELoss(SCELossGrad) support half(float) input float(half) output (#13972)
### Description

A follow up change for
https://github.com/microsoft/onnxruntime/pull/13616.

SoftmaxCrossEntropyLossInternal/SoftmaxCrossEntropyLossInternalGrad
support different type for input and output.

Add SCELoss(SCELossGrad) support half(float) input float(half) output

### Test Note

#### Add tests for variant input and output types. To add such tests,
have to refactor existing testing code for sce loss and scelossinternal
gradient.

Originally, 

FP32 input and output, the CPU kernels, runs with CPU kernels the
baseline, CUDA/RCOM then runs with same data, user CompareTester to
compare with CPU run results.

FP16 input and output, the CPU kernels (did not have half kernels), runs
with Cast_to_float->CPU kernel->cast_to_half as the baseline, CUDA/RCOM
then runs with same data but using Half implementation, user
CompareTester to compare with CPU run results.

Now, we want the support run different input and output types. The
proposed change here is, to run CPU kernels always with float input and
output as baseline (because CPU only have float type kernels impl), this
step is the very first thing for every test.

Then, we run CUDA/RCOM kernels using half_input_half_output,
float_input_float_output, half_input_float_output,
float_input_half_output if there is corresponding kernel registered.

Afterwards, compare the CUDA/ROCM run results with CPU float baselines. 

Be noted, there is one thing that deserved a special note:
CompareOpTester's result compare can be loose than OpTester's.
Roughly speaking: the former tolerant diff <= atol +
rtol*expected_value, while the later one telerant diff < atol && diff <
rtol*expected_value. When the expected value is super small in many
cases of our tests cases, the former one can pass but the later one
fails. So the refactoring also move the check outside of OpTester,
explicitly check the values using the way CompareOPTester did (to align
the previous behaviour).

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-02-28 18:02:08 +08:00
Justin Stoecker
08699c8052
Address SDL warnings in recent STFT changes (#14847)
### Description
Addresses two separate SDL warnings, neither of which point to a cause
for concern:
1. `The expression '0<=_Param_(1)&&_Param_(1)<=3-1' is not true at this
call.
at
D:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\Operators\\DmlSTFT.h@443,33`.
In other words, the tool thinks one of the calls to
`barriers[barrierCount++]` will be an index-of-of-range issue, even
those this is not currently possible. Switching a normal C array avoids
this complaint.
2. `'_Param_(1)' could be '0': this does not adhere to the specification
for the function 'CD3DX12_RESOURCE_BARRIER::UAV'`. The d3dx12 helper for
UAV barriers has the wrong SAL annotation and doesn't allow a null
resource (`_In_`), even though a null resource is legal and well
defined. Updated the annotation to `_In_out_` and created a PR upstream.

### Motivation and Context
Pacify SDL tasks in CI pipelines.
2023-02-27 21:01:34 -08:00
kailums
9bdd42115c
add build flag for rocblas tune and fix bug (#14797)
### Description
<!-- Describe your changes. -->
1. add a build flag for rocblas tuning feature.

2. fix a build bug when enable rocblas tuning.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
The rocblas tunning feature has no build flag to control, only using a
MACRO flag.

So I add an build flag, and fix a code bug when enable rocblas tunning.
2023-02-28 10:37:07 +08:00
Yulong Wang
2d079c6333
[js/web] disable multi-thread test on Node.js in E2E test (#14844)
### Description
disable multi-thread test on Node.js in E2E test.

multi-thread test on Node.js in E2E test never worked, however the CI
does not pick up the error every time. So this became a flaky test case
which sometimes cause a build break.

Disable this test now and should enable it once it's get fixed.
2023-02-27 16:01:51 -08:00
Yi Zhang
0be20dc0f6
Run GPU test job after all CPU test jobs succeed. (#14833)
### Description
Make GPU job depends on all CPU jobs

### Motivation and Context
GPU resources are very limited in packaging pipeline.
And GPU test job is very time consuming.
Only one CPU job fails, the workflow fails, so the GPU job is
meaningless.
To utilize GPU resources more efficiently, run GPU job only after all
CPU jobs succeed.

###test pipeline

https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=280905&view=results
2023-02-28 07:44:51 +08:00
Chen Fu
26abaeb284
Fix half precision gemm test accumulation error (#14842)
### Description

Half precision gemm test requirement relaxation

### Motivation and Context

Most CPUs does not support mixed precision accumulation, only mul & add
fuse. As a result, different striding on the K dimension may lead to
rounding error. Accumulation of these rounding error maybe very
significant. So setting an approximation ratio does NOT always work.
What's more, a relaxed test condition may hide real implementation
problem. So this is only a compromised fix.

More rigorous tests require manual efforts:
1. Change the K stride of the kernel under test to be 16.
2. Force the K stride of the fp16 kernel to 16
3. Change the test oracle to be exact match.
4. Pass this test and then change it back :-(.

Co-authored-by: Chen Fu <fuchen@microsoft.com>
2023-02-27 13:23:14 -08:00
Albert Grigoryan
e097e4e93c
Objective-C lib: Added support for int64 and uint64. (#14405)
### Description
Added support for int64 and uint64 in Objective-C lib.

### Motivation and Context
Int64 is rarely used, but we needed it. The Int64 inference worked after
the change (tested).
2023-02-24 23:25:16 -08:00
satyajandhyala
dcc8fe656b
Fix type mismatch in CUDA Trilu op. (#12863)
Added type cast to int64_t to avoid overflow errors/alerts.
2023-02-24 23:24:26 -08:00
Ivan Komarov
9f6d452ca6
Fix ValueError when testing PyTorch performance (#14450)
### Description
Fixed an exception that is thrown inside `transformers` when trying to
test PyTorch performance:
```
> python convert_generation.py -m gpt2 --output gpt2_greedy_search.onnx --num_beams 1 --num_return_sequences 1 --torch_performance
2023-02-24 21:39:14 -08:00
Tianlei Wu
6603be26c6
Fix prefast warnings in CUDA ops (#14814)
Fix some prefast warnings in CUDA ops.
2023-02-24 21:23:08 -08:00
cloudhan
3bcdb0a83a
Fix TunableOp signature generation (#14709)
Currently, all generated op sigs are `TunableOp<ParamsT, TimerT>` which
is causing signature collision when using rocblas solution api. This was
not a problem before `TuningContext`, because the `KernelMap`s were not
shared.

The root cause is the signature is initialized on Op consturction,
specificially, only in base class ctor, which is causing the type info
only caputre base class type info. That is, only the ParamsT + and base
class. After this change, the we will encode the derived class type in
the op sig.
2023-02-25 12:05:07 +08:00
Chen Fu
bc5d0c83d1
Fix prefast scan complaints (#14823)
### Description

Fix prefast warning
https://dev.azure.com/aiinfra/ONNX%20Runtime/_workitems/edit/12821/


### Motivation and Context

This one is just excessive and annoying. It warned that I should cast
value to 8bytes before using * operator. But the expression is:
(size_t(64) * size_t(1024))

size_t is 8 bytes on 64b systems. And when it is 4 bytes on 32b systems,
the cast is expensive and completely unnecessary.
2023-02-24 17:23:22 -08:00
Yulong Wang
6b83ad9659
[js/web] allow unittest (onnxruntime_test_all) to run in browser (#14820)
### Description
allow onnxruntime_test_all to run in browser for WebAssembly build (use
flag `--wasm_run_tests_in_browser`).

To output the logs from stdout correctly, this test needs to be build
with `--enable_wasm_threads`.
2023-02-24 16:45:33 -08:00
Yulong Wang
a631ed77c0
[js/web] support flag 'optimizedModelFilePath' in session options (#14355)
### Description
* Support flag 'optimizedModelFilePath' in session options.

In Node.js, the model will be saved into filesystem just like its
behaviour on native platforms.

In browser, the new model is not saved to filesystem. the file path is
ignored. Instead, a new pop-up window will be launched in browser and
user can 'save' the file as onnx model.

* Add corresponding commandline args for the following session option
flags:
    - optimizedModelFilePath
    - graphOptimizationLevel
2023-02-24 15:50:15 -08:00
Rachel Guo
0700788b6e
Disable e2e android react native CI test temporarily (#14803)
### Description
<!-- Describe your changes. -->

Disable e2e android react native test temporarily to unblock the CI
failure with no easy fix.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Temp solution to unblock CI failure.
2023-02-24 09:32:18 -08:00
Ted Themistokleous
702a61c3bb
Add verbose and optimization args for parity tests (Gelu, Layernorm, … (#14739)
…GPT_Attention)

Some EPs require that onnxruntime and optimum optimizations are turned
off in order to run correctly. Allowing this option during test runs
allows the EP and library to perform their own optimization and be more
representative of actual use case conditions.

Important for EPs like MIGraphX which require optimizations to be offer
for certain operations

### Description
<!-- Describe your changes. -->

Allow flags to turn off optimizations and add verbose output to confirm
which EP is being used for the inference run and validate fallbacks

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Related to: #14702 & #14700

---------

Signed-off-by: Ted Themistokleous <tthemist@amd.com>
Co-authored-by: Ted Themistokleous <tthemist@amd.com>
2023-02-24 18:43:13 +08:00
Tianlei Wu
d3785ef8f6
Fix decoder_attention scratch buffer size and prefast warnings (#14808)
(1) Change `GetScratchBuffer<T>(element_count * element_size)` to
`GetScratchBuffer<T>(element_count)` to avoid allocating more memory
than needed.
(2) Fix prefast:Warning C26451: Arithmetic overflow: Using operator '*'
on a 4 byte value and then casting the result to a 8 byte value.
2023-02-23 21:58:28 -08:00
Justin Stoecker
928289c414
STFT for DML EP (#14736)
### Description
Implements the STFT operator for the DirectML execution provider. This
is implemented as a custom op, just like the DFT kernel, because it's
implemented as a composite of two operators (DML Mul/Identity + DFT). As
such, this inherits the same restrictions as the existing DFT kernel
(requires power-of-two window sizes for now).

This change also adds a native FP16 shader to DFT so that both DFT/STFT
kernels support float16 tensors. There is no typed UAV fallback or
emulation path, so the HW _needs_ to support native float16. It also
appears the stockham shader was compiled with all optimizations disabled
and debug symbols (tsk tsk, Sheil), and this has been fixed.

This is passing all existing STFT tests (i.e. all of 1). I'm adding some
additional collateral in the Windows AI conformance tests in parallel to
check some extra cases.

---------

Co-authored-by: Patrice Vignola <vignola.patrice@gmail.com>
2023-02-23 21:12:22 -08:00
PeixuanZuo
687118a159
[ROCm] Fix Skiplayernorm error and open ke test with optional output (#14794)
Fix Skiplayernorm error and open kernel explorer test with optional
output.
2023-02-24 12:46:42 +08:00
Patrice Vignola
abb51ec975
[DML EP] Fix GetInputTensor crash when accessing null tensor (#14809) 2023-02-23 18:13:40 -08:00
Jian Chen
29428cd9dc
Cjian/pr into main for 1.14.1 fix (#14805)
### Description
<!-- Describe your changes. -->

PR a change made to 1.14.1 into Main branch as well. 

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-02-23 18:10:57 -08:00
Dmitri Smirnov
d034b432ec
Add support for handling sbyte (Int8) data in C# inference tests (#14807)
### Description
Add sbyte specific test case support.

### Motivation and Context
C# Test Data loading code and comparators are missing sbyte (Int8)
specializations.
This fails a test
2023-02-23 17:05:28 -08:00
cao lei
a012d60777
Make MemcpyToHost to a separate stream for performance gain (#14487)
### Description
Make MemcpyToHost to a separate stream for performance gain in default
DeviceBasedPartitioner



### Motivation and Context
Our experiments show that make MemcpyToHost a separate stream will make
it run parallel with other kernels, especially those compute-intensive
ones.

---------

Co-authored-by: Lei Cao <leca@microsoft.com>
2023-02-23 14:52:01 -08:00
Nat Kershaw (MSFT)
664e296270
Re-add api:javascript and api:java to the labeler (#14238) 2023-02-23 13:20:33 -08:00
James Yuzawa
d925055a3e
Fix broken and outdated links in documentation (#14092)
### Description
<!-- Describe your changes. -->

I fixed some broken links in the C API documentation, but then did a
quick pass over all of the links I could find and then fixed those.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

I got some 404's when exploring the documentation and wanted to fix it.
2023-02-23 10:48:04 -08:00
Ivan Komarov
16b39e5b87
symbolic_shape_infer.py: Fix slicing a tensor that has a sympy.Min() in its shape (#14384)
### Description

`_infer_Slice()` is a function (arguably the most complex one) in
`symbolic_shape_infer.py` that infers the shape of the output of a
`Slice` node. This commit fixes an edge case in `_infer_Slice()` caused
by a SymPy quirk.

When both the end of the slice (let's call it `e`) and the corresponding
dimension of the sliced tensor (let's call it `dim`) are arbitrary
symbolic expressions, `symbolic_shape_infer.py`
[checks](de7a868d5f/onnxruntime/python/tools/symbolic_shape_infer.py (L1728))
if `e <= dim`. Comparing symbolic expressions is hard in general, so if
the comparison fails, `symbolic_shape_infer.py` [gives
up](de7a868d5f/onnxruntime/python/tools/symbolic_shape_infer.py (L1734))
and assumes that `e` is equal to `dim`.

A failure of this sort currently happens for expressions of the form `Y
- X >= 0` where `Y` contains a `sympy.Min()` (`symbolic_shape_infer.py`
tries to rewrite `X <= Y` comparisons in various ways, and `Y - X >= 0`
is [one of
them](de7a868d5f/onnxruntime/python/tools/symbolic_shape_infer.py (L1664))).
An simple example to illustrate this:
```python
>>> import sympy
>>> X = sympy.Symbol('X', positive=True, integer=True)
>>> 
>>> y1 = 9999
>>> Y1 = X + y1 - 5000
>>> bool(Y1 - X >= 0)
True
>>> 
>>> y2 = X + 4999
>>> Y2 = X + y2 - 5000
>>> bool(Y2 - X >= 0)
True
>>> 
>>> y3 = sympy.Min(y1, y2)
>>> Y3 = X + y3 - 5000
>>> bool(Y3 - X >= 0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../venv/lib/python3.9/site-packages/sympy/core/relational.py", line 511, in __bool__
    raise TypeError("cannot determine truth value of Relational")
TypeError: cannot determine truth value of Relational
```

If you assume that `X` is positive symbol (`symbolic_shape` [does
assume](de7a868d5f/onnxruntime/python/tools/symbolic_shape_infer.py (L2129))
this for graph inputs), then both `Y1 >= X` and `Y2 >= X` holds, and
SymPy can prove this. This means that `Y3 >= X` also holds (since `Y3`
is essentially equal to either `Y1` or `Y2`, depending on the value of
`X`), but this is too hard for SymPy to prove. I confirmed that this is
still the case for the latest SymPy version (`1.11.1`).

This commit tries to fix this edge case by slightly rewriting the
expression containing `sympy.Min()`. I explain the details in the
comments in `symbolic_shape_infer.py`, so I won't duplicate them in the
PR description.

### Motivation and Context
This sounds like a very contrived example, but it actually appeared in
the wild when we tried to infer shapes for an ONNX graph exported from
PyTorch that used relative-position multihead attention from Fairseq.
The problematic line is
[here](7d050ada7d/fairseq/modules/espnet_multihead_attention.py (L192)).
In our codebase, we have something like `matrix_bd = matrix_bd[:, :, :,
: matrix_ac.size(-1)]` before we add `matrix_ac` and `matrix_bd`.
`matrix_bd` is itself a result of another slice, hence its shape
contains `sympy.Min()`, and the SymPy weirdness described above prevents
`symbolic_shape_infer.py` from correctly inferring the final shape of
`matrix_bd`. Then `symbolic_shape_infer.py` explodes when we try to add
`matrix_ac` and `matrix_bd`, because their shapes are not compatible.

I added a small self-contained unit test to illustrate the problem.
*Without* the fix, `slice_out_cropped` has shape `[N + Min(42, N + 21) -
22]`, and `input` has shape `[N]`, and we get this:
```
> python onnxruntime_test_python_symbolic_shape_infer.py
..................Cannot determine if 22 - N < 0
Unable to determine if N <= N + Min(42, N + 21) - 22, treat as equal
E....
======================================================================
ERROR: test_slice_of_min (__main__.TestSymbolicShapeInferenceForSlice)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/dfyz/onnxruntime/onnxruntime/test/python/onnxruntime_test_python_symbolic_shape_infer.py", line 460, in test_slice_of_min
    model = SymbolicShapeInference.infer_shapes(onnx.helper.make_model(graph_def))
  File "/home/dfyz/onnxruntime/onnxruntime/test/python/../../python/tools/symbolic_shape_infer.py", line 2461, in infer_shapes
    raise Exception("Incomplete symbolic shape inference")
Exception: Incomplete symbolic shape inference

----------------------------------------------------------------------
Ran 23 tests in 0.486s

FAILED (errors=1)
```

*With* the fix, both tensors have shape `[N]`, and the test passes.

---------

Co-authored-by: Ivan Komarov <dfyz@yandex-team.ru>
2023-02-23 15:32:37 +01:00
pengwa
b392d5351f
Fix memory profiler (#14695)
### Fix memory profiler

A follow up fix for PR
https://github.com/microsoft/onnxruntime/pull/13495

In ORTModule training, `PartialExecuteThePlan` is called twice, we need
create log event after the backward graph run complete to collect the
whole training graph's activations infos.

Also change some log level to verbose, to avoid too many logs in >
verbose log level.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-02-23 18:05:21 +08:00
Jian Chen
62ee0c8110
Migrating ORT Extensions from Git submodule to cmake FetchContent (#14298)
### Description
<!-- Describe your changes. -->

Merging extensions from Git submodule to cmake FetchContent


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Changming Sun <chasun@microsoft.com>
Co-authored-by: Jian Chen <jchen351@MacBook-Pro.local>
2023-02-22 19:42:36 -08:00
Ye Wang
58da3cacdf
support NeoX-style rotary embedding (#14785)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-02-22 18:21:34 -08:00
Yi Zhang
cad7ef93e6
use python 3.9.7 in windowai packaging pipeline (#14766)
### Description
Use python3.9.7 in windowsAI packaging pipeline.


### Motivation and Context
In WindowsAI packaging pipeline, cdpxwin1809 is deprecated and it will
no longer be cached from March 2023.

I used the recommended image
[onebranch.azurecr.io/windows/ltsc2019/vse2022:latest](https://onebranch.visualstudio.com/OneBranch/_wiki/wikis/OneBranch.wiki/4587/Container-Images?anchor=recommendation-for-windows-container-image-userst)
But it always failed to pass arm32 jobs. 
It's very likely a regression in VS2022. 
One user reported a similar issue #14190. 
I've submitted a bug to visual studio.
https://developercommunity.visualstudio.com/t/Compilation-Error-with-VS2022-ARM/10285309.

For `onebranch.azurecr.io/windows/ltsc2019/vse2019:latest`, there's an
exception` Error : init_sys_streams: can't initialize sys standard
streams`
It could be solved by updating from python3.7 to python3.9 in the
pipeline.

(https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=279301&view=results)
The root cause might be the conflicts between different python version.
The inherent python in onebranch image is python 3.9.7.

(https://onebranch.visualstudio.com/OneBranch/_wiki/wikis/OneBranch.Code.Wiki/6766/manifest)


### Ref

https://onebranch.visualstudio.com/OneBranch/_wiki/wikis/OneBranch.wiki/4587/Container-Images?anchor=windows-build-images
2023-02-23 09:48:42 +08:00