Commit graph

8594 commits

Author SHA1 Message Date
Edward Chen
4b74cb1741
Make docker command fail if bash command fails. (#15564)
Add `set -e` so that failing bash commands will cause the containing docker command to fail.
2023-04-20 13:38:58 -07:00
Baiju Meswani
46210556f0
BatchnormInternal avoid setting num_channels if input shape is not known (#15544) 2023-04-20 12:57:16 -07:00
Baiju Meswani
11b0a18de6
Add support for cuda 11.8 and python 3.11 for training (#15548) 2023-04-20 12:56:45 -07:00
Justin Chu
1f7c2f724f
Fix lintrunner configurations (#15586)
### Description

- Fix lintrunner configurations to always use `python` instead of
`python3`.
- Set up dependabot
- Moved dependencies to requirements-lintrunner to allow dependabot to
update it similar to https://github.com/onnx/onnx/pull/5124
2023-04-20 08:54:26 -07:00
Adrian Lizarraga
9df96c7d5b
[QNN EP] Fix shape inference of NHWC Resize (#15477)
### Description
Adds schema for NHWC Resize that uses the default ONNX type/shape
inferencing.


### Motivation and Context
The QNN EP requires the Resize operator to be NHWC. Currently, the
Resize operator fails type and shape inference because the current
schema changes the input to NCHW, but the `scales` and `sizes` inputs
remain in NHWC.

This PR adds a schema for NHWC Resize that allows it to use the default
ONNX type/shape inference while still remaining in the internal NHWC
domain.
2023-04-20 07:25:25 -07:00
Scott McKay
446c478fbd
Add iOS Swift Package Manager support (#15297)
### Description
<!-- Describe your changes. -->
Add Swift Package Manager (SPM) support for ORT based on  #14621
- uses the existing objective-c bindings
- some re-organization of the directory structure was required but the
contents of the files are unchanged, apart from adjustments due to file
movements

Add tool for updating ORT native pod used in the SPM package
Update CIs to use ORT native pod from build, and build/test using SPM



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
iOS developers are using SPM as much as cocoapods, so adding SPM means
both are catered for.
2023-04-20 16:18:35 +10:00
Yi Zhang
64b63921a2
Retry the step of Start Android simulator (#15584)
### Description
Add Retry once There's a failure in `Start Android Simulator`. 

### Motivation and Context
`Start Android Simulator` isn't stable enough and the pipeline would
hang.

We could find many instances in
https://dev.azure.com/onnxruntime/onnxruntime/_pipeline/analytics/stageawareoutcome?definitionId=188&contextType=build
2023-04-20 12:06:35 +08:00
Yi Zhang
5b6f79e79b
Improve windows build cache steps (#15537)
### Description
1. Split deps' compilation cache and ort's
2. reduce the caches generation in merge branch.

### Motivation and Context
Reduce pipeline cache stage.
2023-04-20 09:42:22 +08:00
Chen Fu
29d00fb776
Set proper default values for pool attributes (#15559)
### Description
Setting proper default value for attributes of pool operators


### Motivation and Context
Fixed AB#14719

Global pooling and pooling operators usually share the same underlying
implementation. When we detect the operator is global, code for setting
up the attributes is skipped. This may cause un-deterministic behavior.
2023-04-19 17:24:35 -07:00
George Nash
f2889b41c1
[AMX] Update assembler check (#15501)
A recent commit added an assembler check if the ASM dialect was ATT

This unfortunately broke the AMX build for systems that don't have the
ASM-ATT dialect.

This change assumes if the CMAKE_ASM-ATT_COMPILER_ID is not found and
the CMAKE_ASM_COMPILER_ID is "GNU" based on all the other already passed
checks AMX is supported by the compiler and assembler.

### Description




### Motivation and Context
On my build system the recent change to add the ASM-ATT version check
disabled AMX code from the build.

---------

Signed-off-by: George Nash <george.nash@intel.com>
2023-04-19 14:16:26 -07:00
Chen Fu
142220ad87
Fix cmake 3.25 debug info config (#15565)
### Description

https://github.com/microsoft/onnxruntime/pull/15538
Above pull request breaks Windows build on cmake 3.25 or earlier. This
should fix it.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-04-19 09:14:19 -07:00
Yi Zhang
573e4cf95f
[Fix] Python Packaging Pipeline exception. (#15568)
### Description
supplement of #15299

### Motivation and Context
It broke Python Packaging Pipeline since April 12.
2023-04-19 21:57:14 +08:00
PeixuanZuo
59ea35d592
[ROCm] add CK GroupNorm to GroupNormTunable (#15510)
- Add CK GroupNorm to GroupNormTunable.
- Reduce configuration of GroupNormNHWCOp because CK implementation is
better.

The performance gain on stable diffusion v1.5.
Before:
```
'height': 512
'width': 512
'steps': 50
'batch_size': 1
'batch_count': 5
'num_prompts': 1
'average_latency': 2.4782688856124877
'median_latency': 2.4783748388290405
'provider': 'ROCMExecutionProvider'
'disable_safety_checker': True 
```

After:
```
'height': 512, 
'width': 512, 
'steps': 50, 
'batch_size': 1,
'batch_count': 5,
'num_prompts': 1, 
'average_latency': 2.107170510292053,
 'median_latency': 2.1067750453948975,
 'first_run_memory_MB': -1, 
'second_run_memory_MB': -1,
'provider': 'ROCMExecutionProvider', 
'disable_safety_checker': True
```
2023-04-19 13:54:59 +08:00
Dmitri Smirnov
a66af390fa
[C#] Allow passing various options when creating singleton Environment object. (#14723)
### Description
Re-work OrtEnv class so we can pass various options when creating the
environment such as:
- logId
- initial logging level
- thread options
- user supplied logging function

Create the default instance when SessionOptions are instantiated as
users often forget to do so.

### Motivation and Context
We lack this capability.
Inspired by
https://github.com/microsoft/onnxruntime/pull/13822
https://github.com/microsoft/onnxruntime/pull/13951
https://github.com/microsoft/onnxruntime/pull/11593


Cc: @thoron

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-04-18 21:49:55 -07:00
Chi Lo
6115c8fd1f
Add TRT plugins support using custom ops (#13847)
This PR makes ORT support TRT plugin using custom ops. ORT TRT can
automatically register all TRT plugins from TRT plugins registry as
custom ops. There is no code change needed for ORT when new TRT plugins
are introduced.

Previous way for ORT to support TRT plugins was using contrib ops, but
there are some concerns about it:

- Contrib ops are shipped as part of the ORT binary by default. TRT
related plugins should not be in the default ORT.
- Contrib ops are designed for internal ops and developed for cpu and
cuda EPs.

Therefore, using custom ops is a good approach to support TRT plugins. 

Followings are the major modifications:

1. Add new `GetCustomOpDomainList` provider api which allows provider to
create its own custom op domain list and ORT can register this domain
list. Provider has the responsibility to free all the custom op domain
instances it created.
2. Move OrtCustomOpDomain struct definition to
framework_provider_common.h since this struct is being used by framework
and EPs now.
3. There are several TRT plugins registered as onnx schema op through
contrib op with onnx domain. In order not to break the old models using
those TRT plugins which were registered with ONNX domain and maintain
backward compatible, we need to keep the old/legacy TRT plugins with
onnx domain. Moving forward, all newly added TRT plugins should be
registered with `trt.plugins` domain.
4. TRT plugin doesn't have an api to get number of inputs/outputs of the
registered plugins, so ORT TRT uses variadic inputs/outputs to bypass
the onnx node validation.
5. Add new trt provider option, `trt_extra_plugin_lib_paths`, user can
specify any extra plugin lib, for example,
`fastertransformer/build/lib/libvit_plugin.so` or
`fastertransformer/build/lib/libvit_plugin.so;fastertransformer/build/lib/libvit_plugin_v2.so`
2023-04-18 20:24:32 -07:00
Yulong Wang
cb83d2b1a9
[js/web] allow script to use partial success build (#15547)
### Description
allow script `npm run pull:wasm` to use partial success build.
2023-04-18 17:41:47 -07:00
kunal-vaishnavi
901c2bc384
Whisper Model Optimization (#15473)
### Description
This PR contains fusion-level and kernel-level optimizations for
[OpenAI's Whisper](https://github.com/openai/whisper).

Some of the added optimizations include:

- Pruning of duplicate/unnecessary inputs and outputs
- Fusion support for Whisper models with or without these inputs/outputs
(e.g. with these inputs/outputs if exporting with an older official
Optimum version, without these inputs/outputs if exporting with Optimum
from source)
- Attention fusions
   - For Whisper's encoder and decoder
- Modified symbolic shape inference for present output when no past
input exists (for decoder)
- Multi-head attention fusions
   - For Whisper's decoder and decoder with past
- Packed MatMul for the 3 MatMuls excluded in multi-head attention
fusion
- Attention kernel changes
   - CPU:
      - Different Q and KV sequence lengths
      - Parallel memset for large sequence lengths
- Convert broadcast add after MatMul of Q and K (add_qk) to element-wise
add
- Separate present key-value output into present key and present value
(for multi-head attention spec)
   - CUDA:
- Use memory efficient attention compute kernel with present state (for
decoder)
- Multi-head attention kernel changes
   - CPU:
- Introduction of multi-head attention CPU kernel (previously did not
exist)
- Use AddBiasReshape instead of AddBiasTranspose when sequence length =
1 (for decoder with past)
      - Different Q, K, V input shapes
      - Pass past key and past value directly as key and value
   - CUDA:
- Use memory efficient attention compute kernel with past and/or present
state (for decoder with past)

### Usage
To use the optimizations, run the ORT transformer optimizer script as
follows:
```
$ cd onnxruntime/onnxruntime/python/tools/transformers/
$ python3 optimizer.py --input <filename>.onnx --output <filename>.onnx --model_type bart --num_heads <number of attention heads, depends on the size of the whisper model used> --hidden_size <attention hidden size, depends on the size of the whisper model used> --use_external_data_format --use_multi_head_attention
```

Once optimized, here's an example of how to run Whisper with [Hugging
Face's Optimum](https://github.com/huggingface/optimum):
```
from transformers.onnx.utils import get_preprocessor
from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from optimum.pipelines import pipeline as ort_pipeline

import whisper # Installed from OpenAI's repo - setup instructions at https://github.com/openai/whisper/

directory = './whisper_opt' # Where the optimized ONNX models are located
model_name = 'openai/whisper-tiny'
device = 'cpu'

# Get pipeline
processor = get_preprocessor(model_name)
model = ORTModelForSpeechSeq2Seq.from_pretrained(
    directory,
    use_io_binding=(device == 'cuda'),
    provider='CPUExecutionProvider',
).to(device)
pipe = ort_pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    device=(-1 if device == 'cpu' else 0),
)

# Load audio file and run pipeline
audio = whisper.load_audio('tests/jfk.flac')
audio = whisper.pad_or_trim(audio)
outputs = pipe([audio])
print(outputs)
```

Note: In order to use these changes with Optimum, it is recommended to
use Optimum from source to have the following changes:
- https://github.com/huggingface/optimum/pull/872
- https://github.com/huggingface/optimum/pull/920

### Motivation and Context
This PR helps the following issues:
- https://github.com/microsoft/onnxruntime/issues/15100
- https://github.com/microsoft/onnxruntime/issues/15235
- https://github.com/huggingface/optimum/issues/869 (work in progress)

This PR can be used with the other currently merged Whisper PRs:
- https://github.com/microsoft/onnxruntime/pull/15247
- https://github.com/microsoft/onnxruntime/pull/15339
- https://github.com/microsoft/onnxruntime/pull/15362
- https://github.com/microsoft/onnxruntime/pull/15365
- https://github.com/microsoft/onnxruntime/pull/15427

This PR uses changes from the following merged PRs:
- https://github.com/microsoft/onnxruntime/pull/14198
- https://github.com/microsoft/onnxruntime/pull/14146
- https://github.com/microsoft/onnxruntime/pull/14201
- https://github.com/microsoft/onnxruntime/pull/14928 (this introduced
the new multi-head attention spec)
2023-04-18 17:13:54 -07:00
Ye Wang
53d304d4d2
optimize gated gru cuda kernel (#15525)
### Description
<!-- Describe your changes. -->

Improvement with Tulrv6 on A100

![image](https://user-images.githubusercontent.com/52801275/232602055-518726da-3a9a-4e2e-8def-2cd855c8225d.png)


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-04-18 14:23:43 -07:00
Justin Chu
831734a46e
Fix lint errors missed due to new commits (#15558)
Follow up of #15524
2023-04-18 12:55:02 -07:00
Yi Zhang
698e9f71cd
Improve cache hit rate in windows build (#15538)
### Description
1.  Update /Zi to /Z7 in abseil project while using cache
2.  Skip target_precompile_headers while using cache


### Motivation and Context
There're about 1/4 uncacheable calls in Windows GPU compilation with
cache.
```
Uncacheable calls:                   441 / 1641 (26.87%)
  Could not use precompiled header:  361 /  441 (81.86%)
  Preprocessing failed:                1 /  441 ( 0.23%)
  Unsupported compiler option:        79 /  441 (17.91%)
```

https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=961916&view=logs&j=5076e696-f193-5f12-2d8a-703dda41a79b&t=9b927034-e3ef-5e25-c6df-387bc37acd63&l=21

The root cause of `Unsupported compiler option` is that /Zi in Abseil
isn't updated to /Z7.
The root cause of `Could not use precompiled header` is the
`target_precompile_headers` creates cmake_pch.pch every time and it's
hash value is changed too.

### Result
It could reduce compilation time by another 20%. 
For example:
It took 16m43 in CUDA training compilation on Windows.
It takes 13m32 after the change.

https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=964002&view=logs&s=959c6b43-5937-53e5-5f36-e53cb0249117


### N.B.
In winml project, it's using own target_precompile**d**_header
https://github.com/microsoft/onnxruntime/blob/main/cmake/precompiled_header.cmake.
Just let it be.
2023-04-18 09:31:35 -07:00
Justin Chu
cf19c3697d
Run clang-format in CI (#15524)
### Description

Run clang-format in CI. Formatted all c/c++, objective-c/c++ files.

Excluded

```
    'onnxruntime/core/mlas/**',
    'onnxruntime/contrib_ops/cuda/bert/tensorrt_fused_multihead_attention/**',
```

because they contain assembly or is data heavy


### Motivation and Context

Coding style consistency
2023-04-18 09:26:58 -07:00
Sheil Kumar
2700d01642
Add Bluestein Z-Chirp CPU EP implementation for the DFT operator (#15522)
Add Bluestein Z-Chirp CPU EP implementation for the DFT operator

While the current DFT operator has an FFT implementation for signal
lengths of size 2^N, it currently only has a naive implementation for
completeness sake. The non-power of 2 case is very slow.

The appropriate algorithm to use here is the Bluestein Z-Chirp
algorithm, which evalutates a single DFT with 3 FFT calculations (2
forwards and 1 inverse) and a chirp signal. Luckily, the chirp signal
and one of these FFT operations can be precomputed (B).

The resulting computation performs multiple DFTs on longer signals, but
in the end is faster because the individual sub-DFT computations can
leverage the faster FFT implementation under the hood.

---------

Co-authored-by: stevenlix <38092805+stevenlix@users.noreply.github.com>
2023-04-18 09:06:05 -07:00
liqun Fu
919d8f2660
update with onnx main (#14929) 2023-04-18 08:42:51 -07:00
pengwa
d8dfda2e08
Minor fix for differently scoped cpu_ep usage (#15550)
### Minor fix for differently scoped cpu_ep

cpu_ep is under `#ifndef DISABLE_CONTRIB_OPS`, but one of its usage is
not under the same condition.

```
#ifndef DISABLE_CONTRIB_OPS
  const InlinedHashSet<std::string_view> cpu_ep = {onnxruntime::kCpuExecutionProvider};
#endif
```

### Motivation and Context

Postmoterm: https://github.com/microsoft/onnxruntime/pull/15461 passed
all CIs except Linux/Windows TVM CIs. I did not check the detailed error
message then because they are failed for some reason for a few days at
least. While checking the details, after PR 15461, the error messge
changes from

Before constant sharing change: TVM CI error message:

```
https://github.com/microsoft/onnxruntime/actions/runs/4700368634/jobs/8334955814

ERROR: testBooleanInputs (__main__.TestInferenceSession)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "onnxruntime_test_python.py", line 617, in testBooleanInputs
    sess = onnxrt.InferenceSession(get_name("logicaland.onnx"), providers=available_providers)
  File "D:\a\onnxruntime\onnxruntime\build\Release\Release\onnxruntime\capi\onnxruntime_inference_collection.py", line 383, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "D:\a\onnxruntime\onnxruntime\build\Release\Release\onnxruntime\capi\onnxruntime_inference_collection.py", line 435, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: D:\a\onnxruntime\onnxruntime\onnxruntime\core\providers\tvm\tvm_api.cc:49 onnxruntime::tvm::TVMCompile compile != nullptr was false. Unable to retrieve 'tvm_onnx_import_and_compile'.
```

to 

```
D:\a\onnxruntime\onnxruntime\onnxruntime\core\optimizer\graph_transformer_utils.cc(213,67): error C2065: 'cpu_ep': undeclared identifier [D:\a\onnxruntime\onnxruntime\build\Release\onnxruntime_optimizer.vcxproj]
D:\a\onnxruntime\onnxruntime\onnxruntime\core\optimizer\graph_transformer_utils.cc(213,19): error C2672: 
```

This PR fixes the build the issue, The error message of Windows/Linux
TVM CIs are back to the original ones.
2023-04-18 16:51:11 +08:00
PeixuanZuo
8bec6cd029
Refactor FusedConv test (#15512)
Refactor FusedConv test.
2023-04-18 15:22:31 +08:00
Justin Chu
9d26f8f4fe
Use os.fspath on Path (#15530)
### Description
<!-- Describe your changes. -->

Use os.fspath instead of str() on a path object. 

### Motivation and Context

I learned today that os.fspath is the right way to go:
https://github.com/charliermarsh/ruff/issues/3675#issuecomment-1494975508
2023-04-17 16:59:40 -07:00
Zhang Lei
a30b57da6e
Fix/Enhance convert_generation tool for SkipLayerNorm, op_block_list... (#15368)
After SkipLayernorm using fp32 for internal calculation and using
numeric stable algorithm, enable it for fp16 here.
Make the op_block_list a command line argument to help future tools.
Other minor changes.
2023-04-17 14:44:37 -07:00
Justin Chu
a36caba073
Bump ruff in CI (#15533)
### Description

Bump ruff version in CI and fixed new lint errors. 

- This change enables the flake8-implicit-str-concat rules which helps
detect unintended string concatenations:
https://beta.ruff.rs/docs/rules/#flake8-implicit-str-concat-isc
- Update gitignore to include common python files that we want to
exclude.


### Motivation and Context

Code quality
2023-04-17 10:11:44 -07:00
cao lei
c2221d919f
create a stream in DeviceStreamCollection for memory pattern (#15426)
### Description
Create a stream in DeviceStreamCollection for memory pattern case to fix
the thread safe issue 15154



### Motivation and Context
This is to fix the bug 15154
https://github.com/microsoft/onnxruntime/issues/15154
2023-04-17 10:06:55 -07:00
Ashwini Khade
8fa65aba0e
enable training tests for csharp bindings (#15513)
### Description
Simple fix to enable training tests in csharp through build.py script.
2023-04-17 09:57:23 -07:00
cloudhan
7ed3bfde51
Fix FusedConv for ROCm (#15460)
1. Fix undesired runtime optimization for non-Relu activation.
3. Fix false positive runtime error log due to fusion failure.
2023-04-17 11:41:00 +08:00
Wei-Sheng Chin
ac6ceffb2c
Force using fixed random seeds for flaky tests (#15515)
Some gradient-related tests fail frequently due to their math
properties. This PR fixes their random seed so that it's possible to
debug in the future.

Fixed
[AB#14605](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/14605),
[AB#14604](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/14604)
2023-04-14 18:44:51 -07:00
Adrian Lizarraga
5ebe700a9b
[QNN EP] Fix pool and conv op tests (#15504)
### Description
- Fixes QNN unit tests for pool and conv ops.
- Temporarily disables QNN Resize tests until we fix type/shape
inferencing for NHWC Resize.

### Motivation and Context
The Linux QNN CI Pipeline has not run unit tests for a week (see
https://github.com/microsoft/onnxruntime/pull/15497). Some tests broke
in the meantime.

Fixed
[AB#14625](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/14625)

---------

Co-authored-by: Changming Sun <chasun@microsoft.com>
2023-04-14 13:18:38 -07:00
Maximilian Müller
fbe88fccbd
Exposing new TRT build options (#15089)
### Description

This will add a few TRT options, some of them are only available on TRT
8.6:
- heuristics
- sparsity
- optimization level (8.6 only)
- auxiliary stream (8.6 only)
- tactic source selection

I am no sure yet which tests is should add for these options. As those
are mostly simple TRT flags i am not sure to what level i should test.
For heuristics something similar to
44dda08b51/onnxruntime/test/providers/tensorrt/tensorrt_basic_test.cc (L510-L538)
should be possible for, but for all other essentially we would only be
testing if there is a crash or not if the option is set.
Also if i forgot some option that would be good to have feel free to
speak up !
2023-04-14 09:47:36 -07:00
Yi Zhang
4e1f75810c
Add compilation cache in 2 Linux CPU pipelines and refactor the Linux build step with cache (#15484)
### Description
1. Add compilation cache in Linux CPU ARM and Linux Minimal Build.
2. Integrate 4 Linux CPU build step with cache into one.
3. install ccache from source code in Linux ARM64 image.

### Motivation and Context
1. Enable more build steps with compilation cache.
2. Make it easier to add cache.

It could save 40 more minutes of compilation time in Linux ARM64.

https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=959619&view=logs&j=1e0830bb-fd74-5d0a-5029-1c63b4266d7b&t=75260ed7-7566-5947-2095-566660191920
2023-04-14 23:56:59 +08:00
pengwa
bf32dbbd9b
Share more constant initializers (#15461)
### Share more constant initializers.

`ConstantSharing` transformer originally only handle single value
initializer (scalar or 1D).

This PR tried to share more cases to make common subexpression
elimination transformer to remove more duplicated nodes.

Originally, we used a single
vector<std::variant<float,half,int32,int64>> to store different scalar
values. In this PR, we create a unordered map with its key being
data_type + rank + element count, and its value is a vector of
`InitializerValue`.

For one specific initializer, if it fulfils the condition, then finally
will find the corresponding vector of `InitializerValue` by its
<data_type + rank + element count>, then search from the vector whether
the constant tensor already exist or not. After that, a value id is
returned, which will be combined together with <data_type + rank +
element count> to form the pattern key to decide which tensor to reuse
(legacy code).

### Motivation and Context

One example we see here is:

```mermaid
stateDiagram
    [*] --> LayerNorm(b,s,64)
    LayerNorm(b,s,64) --> Reshape1
    Shape1_Const[b*s,64] --> Reshape1

    LayerNorm(b,s,64) --> Reshape2
    Shape2_Const[b*s,64] --> Reshape2


    Reshape1 --> AttentionSubGraph
    Reshape2 -->  Add
    AttentionSubGraph--> Add
   Add --> [*]
```

Ideally CommonSubexpressionElimination can remove one of `Reshape1` and
`Reshape2`, while since `Shape1_Const` and `Shape2_Const` are different
NodeArg*, so it did not remove the duplication.

This is an example: removing the duplication will bring more
opportunities to apply graph transformations.
2023-04-14 07:41:07 -07:00
Changming Sun
f297bbb89b
Fix an indent error in build.py (#15497)
### Description
Fix an indent error in build.py

### Motivation and Context
The problem was introduced in #15395 when I was deleting unused code.
2023-04-14 06:32:46 -07:00
mindest
0fdd356abf
[ROCm] Add hipBLASLt GEMM support to Tunable op. (#15351)
### Description
Add hipBLASLt to GEMM Tunable op, which supports GEMM and
StridedBatchedGEMM.

To enable hipBLASLt implementation, add an extra flag to the building
command: `--cmake_extra_defines onnxruntime_USE_HIPBLASLT=ON`.
2023-04-14 17:56:01 +08:00
Sunghoon
fda0aa14c8
SkipLayerNorm fusion with different input and output type (#15500)
SkipLayerNorm fusion fuses LayerNorm and one or more Add kernels now.
While LayerNormalization kernel allows different input and output type
by definition, SkipLayerNormalization must have the same input and
output type.

This graph is valid as the output of Add node is float16 and two inputs
from initializers are float.


![image](https://user-images.githubusercontent.com/35605090/231874079-3f3b03cc-f751-4ad9-a002-31116a35117f.png)

But, when Add and LayerNormalization are fused, it fails because two
inputs of Add node are float16 type and SkipLayerNormalization must have
the same input types. To avoid this failure, this PR adds Cast node
before inputs of SkipLayerNormalization when input and output type are
different and output type is float. The above graph is fused as follows,


![image](https://user-images.githubusercontent.com/35605090/231874097-6405713a-7c95-4b5b-a293-1305976edc94.png)

For performance, it'd better for SkipLayerNormalization to support
different input and output type, but this PR is to unblock Turing NLR v5
base mode in Babel. When we have more cases, we can support it.
2023-04-13 23:07:47 -07:00
Wei-Sheng Chin
d76cf374c4
Capture both ValueError and RuntimeError (#15503) 2023-04-13 19:29:34 -07:00
Akshay Sonawane
56ad68120e
Add support to use sequence as input ids in decoder inputs to Beam Search CUDA Op (#15232)
Add support to use sequence as input ids in decoder inputs to Beam
Search CUDA Op

### Description
Currently Beam search Op is only supported for CPU EP, added support for
CUDA EP.

### Motivation and Context
- For Turing models inference was throwing segmentation fault due to
copy failing in cuda memory, also beam search support was not present in
cuda.
2023-04-13 13:35:33 -07:00
Changming Sun
5bed8d0285
Disable XNNPack EP's tests in Windows CI pipeline (#15406)
### Description

1. Disable XNNPack EP's tests in Windows CI pipeline
The EP code has a known problem(memory alignment), but the problem does
not impact the usages that we ship the code to. Now we only use XNNPack
EP in mobile apps and web usages. We have already pipelines to cover
these usages. We need to prioritize fixing the bugs found in these
pipelines, and there no resource to put on this Windows one. We can
re-enable the tests once we reached an agreement on how to fix the
memory alignment bug.

2.  Delete anybuild.yml which was for an already deleted pipeline.
3. Move Windows CPU pipelines to AMD CPU machine pools which are
cheaper.
4. Disable some qdq/int8 model tests that will fail if the CPU doesn't
have Intel AVX512 8-bit instructions.
2023-04-13 12:19:32 -07:00
zhijiang
05ec22330f
softmax perf improvement pr2 - import softmax bw (#15199)
when dimension to do softmax is 2048, original ort code will fallback to
cudnn, while with some optimization on ort's softmax_warp_backward, we
can be faster than cudnn implementation.

the ideas to optimize softmax_warp_backward is:
1. instead of saving intermediate result in register, we just recompute
to save resource
2. save the input data in fp16 instead of fp32 to further save resource

the perf numbers:

![image](https://user-images.githubusercontent.com/43435212/227476335-ae0b61c4-cd15-40b7-b743-a956fadaedda.png)

please be noted that when dim to do softmax is less than 2048, nothing
will be changed, so only gives perf number of 2048 case.


add more perf number for smaller batch size

![image](https://user-images.githubusercontent.com/43435212/231676120-c8944b09-a664-43f3-a1e8-dfe729c6e816.png)
2023-04-13 14:57:01 +08:00
mindest
67ac36101c
disable BatchNormalizationGrad test (#15485)
### Description
Temporarily disable BatchNormalizationGrad test due to random failure.

Example:

```
2023-04-12T06:33:24.1593811Z 1: [ RUN ] GradientCheckerTest.BatchNormalizationGrad
2023-04-12T06:33:27.5603881Z 1: D:\a\_work\1\s\orttraining\orttraining\test\gradient\gradient_ops_test.cc(1468): error: Value of: IsErrorWithinTolerance(max_error, error_tolerance)
2023-04-12T06:33:27.5604509Z 1: Actual: false
2023-04-12T06:33:27.5604719Z 1: Expected: true
2023-04-12T06:33:27.5604997Z 1: max_error: 1.776702880859375; tolerance: 0.019999999552965164; ORT test random seed: 2552121240;
2023-04-12T06:33:27.5605266Z 1: Google Test trace:
2023-04-12T06:33:27.5605531Z 1: D:\a\_work\1\s\onnxruntime\test\common\tensor_op_test_utils.cc(14): ORT test random seed: 8910
2023-04-12T06:33:27.5605843Z 1: D:\a\_work\1\s\onnxruntime\test\common\tensor_op_test_utils.cc(14): ORT test random seed: 5678
2023-04-12T06:33:27.5606478Z 1: D:\a\_work\1\s\onnxruntime\test\common\tensor_op_test_utils.cc(14): ORT test random seed: 1234
2023-04-12T06:33:27.8285560Z 1: D:\a\_work\1\s\orttraining\orttraining\test\gradient\gradient_ops_test.cc(1493): error: Value of: IsErrorWithinTolerance(max_error, error_tolerance)
2023-04-12T06:33:27.8286181Z 1: Actual: false
2023-04-12T06:33:27.8286404Z 1: Expected: true
2023-04-12T06:33:27.8286669Z 1: max_error: 1.776702880859375; tolerance: 0.019999999552965164; ORT test random seed: 2552121240;
2023-04-12T06:33:27.8286942Z 1: Google Test trace:
2023-04-12T06:33:27.8287208Z 1: D:\a\_work\1\s\onnxruntime\test\common\tensor_op_test_utils.cc(14): ORT test random seed: 8910
2023-04-12T06:33:27.8287532Z 1: D:\a\_work\1\s\onnxruntime\test\common\tensor_op_test_utils.cc(14): ORT test random seed: 5678
2023-04-12T06:33:27.8287849Z 1: D:\a\_work\1\s\onnxruntime\test\common\tensor_op_test_utils.cc(14): ORT test random seed: 1234
2023-04-12T06:33:51.6368960Z 1: [ FAILED ] GradientCheckerTest.BatchNormalizationGrad (27475 ms)
```
2023-04-13 14:53:47 +08:00
Changming Sun
a22dc65a81
Add a missing header to cuda_common.h (#15489)
### Description

The following three lines are needed before including some cutlass
header files, because cutlass uses "and"/"or" keywords. Generally it
should not be a problem without this header, but nvcc is not strictly
compliant to C++ standard.

```c++
#ifdef __cplusplus
#include <ciso646>
#endif
```

We didn't hit this problem because the above code exists in absl. We
always include absl headers first. However, ABSL recently deleted them!
https://github.com/abseil/abseil-cpp/pull/1246

The cutlass dependency was introduced in #14343 , after we had abseil.
2023-04-12 22:16:59 -07:00
pengwa
516c8e95fa
Optimize SCE loss compute (#15401)
### Optimize SCE loss compute

Compute optimization based on label data sparsity:
- Insert ShrunkenGather before SCELoss node, to filter out invalid
labels for compute.
- Support ShrunkenGather upstream.
- Added test for the above.
- Added flag to enable label sparsity optimization with env var, by
default disabled now. Will enable after comprehensive benchmarking
later.
- Extract common logic into test_optimizer_utils.h/cc from
core/optimizer/compute_optimzier_test.cc, then the common functions can
be shared by both core/optimizer/compute_optimzier_test.cc and
orttraining/core/optimizer/compute_optimzier_test.cc
- Extract common logic into shared_utils.h/cc: `GetONNXOpSetVersion` and
`Create1DInitializerFromVector`


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-04-13 13:02:12 +08:00
Justin Chu
07b64d5275
Remove codecov from requirements-dev.txt (#15487)
### Motivation and Context
It is no longer supported, and we don't really use it.
2023-04-12 18:48:02 -07:00
Patrice Vignola
fd7f0c3cfc
[DML EP] Use ORT node names in DML execution plans (#15411) 2023-04-12 16:44:53 -07:00
G. Ramalingam
e361e3f138
Fix bug in handling of variadics in function schema creation (#15409)
### Description

The code handling variadic parameters when creating a schema for a
function has a minor bug.
The checking logic was nested inside a conditional, instead of being
outside.
Fix the logic, and add a test-case. This bugs manifests itself when the
first parameter in the
variadic list is not an input/output of the enclosing function.

### Motivation and Context

Fixes https://github.com/microsoft/onnxruntime/issues/15404

---------

Signed-off-by: Ganesan Ramalingam <grama@microsoft.com>
2023-04-12 14:32:24 -07:00
Yulong Wang
e1e8852213
[build/npm] dump ORT_COMMON_FROM from validation (#15475)
### Description
dump ORT_COMMON_FROM from validation

This writes environment variable ORT_COMMON_FROM for later steps in the
release pipeline to use.
2023-04-12 13:48:19 -07:00