Commit graph

8653 commits

Author SHA1 Message Date
Yulong Wang
a02c885f86
[js/webgpu] add implementation of Relu, LeakyRelu and ThresholdedRelu (#15668)
### Description
add implementation of Relu, LeakyRelu and ThresholdedRelu
2023-04-26 15:11:01 -07:00
Justin Chu
76ddc92fbd
Enable RUFF as a formatter (#15699)
### Description

RUFF can now format since lintrunner-adapters v0.8. Removed the RUFF-FIX
linter.



### Motivation and Context

Better engineering
2023-04-26 14:04:07 -07:00
Yufeng Li
d7ba9814cf
[prefast:Warning]: C26409 ('PackedAttention<onnxruntime::MLFloat16>::TryGettingFusedRunner') (#15663)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-04-26 14:03:36 -07:00
Patrice Vignola
97c4cab6b7
[DML EP] Massage SkipLayerNorm axes to better target metacommands (#15676)
DML's MVN metacommand needs all axes except for batch and channel to be
reduced. By adding trailing dimensions of 1's and their corresponding
axes, the operation stays the same but we are now able to call
metacommands.
2023-04-26 14:00:36 -07:00
Hector Li
4c7b5032da
[QNN EP]Support unpack initializer from external data source (#15694)
### Description
Support unpack initializer from external data source

### Motivation and Context
Support unpack initializer from external data source
2023-04-26 13:39:40 -07:00
yf711
28985c47b7
[TensorRT EP] Unleash opset16-17 onnx model tests (#15657)
### Description
In 2021 we restricted onnx node test CI execution in range of opset
14-15 for ORT-TRT, which was the latest opset that TRT EP could support

Update this range to opset 14-17 to improve the ORT-TRT unit test
coverage, as [Nvidia announced that TRT 8.6 supported
opset17](https://github.com/onnx/onnx-tensorrt/blob/main/docs/operators.md)
2023-04-26 11:44:19 -07:00
kunal-vaishnavi
cfb8c0e2ca
Add Whisper custom export to wheel (#15685)
### Description
This PR adds the Whisper custom export scripts to the wheel.



### Motivation and Context
This enables access to the custom export scripts in the wheel.
2023-04-26 10:45:52 -07:00
yf711
d701dcd027
Fix Linux MultiGPU TensorRT CI (#15697)
### Description
* Reverting default TensorRT version to 8.5 as temporary fix
  
* Apart from that, this PR temporarily leaves this CI as a place to
validate user behavior that uses TRT 8.5 with latest ORT

### Context
* This CI pool equips 2xTesla M60 GPUs, which are no longer supported by
TensorRT 8.6.
* Currently, other CIs are using single-T4 VM but there's no VM with
2xT4 or other suitable dualGPU in the range.
* Once we decide which VM instance for this CI to migrate to, TRT8.6 can
be enabled on this CI

* According to
[Nvidia](https://docs.nvidia.com/deeplearning/tensorrt/release-notes/index.html):
* TensorRT 8.5.3 was the last release supporting NVIDIA Kepler (SM 3.x)
and NVIDIA Maxwell (SM 5.x) devices. *These devices are no longer
supported in TensorRT 8.6*. NVIDIA Pascal (SM 6.x) devices are
deprecated in TensorRT 8.6.
2023-04-26 10:01:33 -07:00
PeixuanZuo
0ecfe83932
[ROCm] add beam search support (#15625)
add beam search support for ROCm EP.
2023-04-26 17:53:33 +08:00
Xavier Dupré
699c9a520b
Fix TVM pipelines (#15653)
### Description
Fix TVM pipelines by adding missing dependancy of TVM (attrs).
2023-04-26 09:55:05 +02:00
Yulong Wang
b98317b907
[js/webgpu] following up for JSEP/WebGPU code cleanup (#15666)
### Description
This PR resolves a part of non-critical comments from code review
comments in #14579.

- use `USE_JSEP` instead of `USE_JS` in build definition to make it less
ambiguous
- remove unused util functions from util.ts
- fix transpose.h
- other misc fixes
2023-04-25 21:20:03 -07:00
sfatimar
ebaafac3f5
Openvino ep ort 5.0 (#15626)
### Description
The PR adds VPU support to OpenVINO Execution Provider
Bug fixes for GPU, CPU. 
Changes to OpenVINO Backend in Serialized Model API for faster First
Inference Latency.
Deprecation to HDDL-VADM and MYRIAD, removed code
Support OpenVINO 2023.0 
Dynamic Shapes Support for iGPU

### Motivation and Context
- VPU is an upcoming hardware that can provide AI Acceleration for
Client Systems through OpenVINO
- If it fixes an open issue, please link to the issue here. -->

---------

Signed-off-by: MaajidKhan <n.maajid.khan@intel.com>
Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
Co-authored-by: MaajidKhan <n.maajid.khan@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
2023-04-25 20:59:42 -07:00
Changming Sun
b1b6e5522e
Update cuda 11.6 to 11.8 for Windows pipelines (#15684)
### Description
Update cuda 11.6 to 11.8 for Windows pipelines
This PR is just for Windows CUDA pipelines. It does include any change
for Linux pipelines or TensorRT pipelines

### Motivation and Context
It is a planned feature for the upcoming ONNX Runtime release.
2023-04-25 20:23:57 -07:00
Rui Ren
db6a9bc033
support latest deepspeed version for optim (#15682)
### Description
<!-- Describe your changes. -->

support the latest deepspeed 0.9.1 for the next release


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This will avoid the warn message `Skip modifying optimizer because of
unsupported DeepSpeed version`

---------

Co-authored-by: ruiren <ruiren@microsoft.com>
2023-04-25 20:12:23 -07:00
Hector Li
3dc9720cfc
[QNN EP] Enable Qnn EP op support Elu, HardSwish, Atan (#15681)
### Description
Enable some Ops for QNN EP: Elu, HardSwish, Atan

### Motivation and Context
unblock more models
2023-04-25 20:11:06 -07:00
Wei-Sheng Chin
1524f73a09
Implement two easier random tensor generator (RTG) for flaky tests (#15517)
Some math ops have very bad numerical stability and essential randomness
(e.g., exp/log with reduction on large elements). To maintain the same
test coverage with lower CI failing rate, we can gradually replace flaky
tests' RTG with the ones implemented in this PR --- try Discrete first.
If still unstable, use Circular.

Overall recommended strategy to handle flaky test
- Find if it uses `Uniform` in
`onnxruntime/test/common/tensor_op_test_utils.h`. If yes, replace
`Uniform` with `Discrete` implemented in this PR. For
`candidate_values`, we can try `[-2, -1.5, -1, -0.5, 0, 0.5, 1, 1.5,
2]`, `[-2, -1, 0, 1, 2]`, `[-1, 0, 1]`, and `[0, 1]` and choose the most
difficult one among those passing 100 runs.
- If `Discrete` fails to meet the stability requirement, switch to
`Circular` and repeat the `candidate_values` selection process.

Let's keep an eye on the two bugs mentioned in
https://github.com/microsoft/onnxruntime/pull/15515. If the related unit
tests fail again, we can replace the underlying
`RandomValueGenerator::Uniform` with
`FixedPatternValueGenerator::Descrete` or
`FixedPatternValueGenerator::Circular` implemented in this PR.
2023-04-25 17:52:44 -07:00
Numfor Tiapo
f44f6c5b2e
Fix Prefast Errors (#15651)
This PR adds fixes for prefast errors with the following codes:

- C26814
- C26451
- C26400
2023-04-25 16:41:39 -07:00
Rui Ren
4c3e350a6a
fix ORTModuleONNXModelException fallback OOM (#15523)
### Description
<!-- Describe your changes. -->
### Error 
```
RuntimeError: There was an error while exporting the PyTorch model to ONNX:-

Traceback (most recent call last):
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_utils.py", line 254, in get_exception_as_string
    raise exception
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_graph_execution_manager.py", line 385, in _get_exported_model
    torch.onnx.export(self._flattened_module,
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/onnx/__init__.py", line 305, in export
    return utils.export(model, args, f, export_params, verbose, training,
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/onnx/utils.py", line 118, in export
    _export(model, args, f, export_params, verbose, training, input_names, output_names,
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/onnx/utils.py", line 743, in _export
    proto, export_map, val_use_external_data_format = graph._export_onnx(
RuntimeError: ONNX export failed: Couldn't export Python operator XDropout
```
The error leads to Out of Memory issue, because the log.txt file is **26
GB**.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
The root cause is that in each `_forward`
```
      if log_level <= _logger.LogLevel.WARNING and not self._raised_ORTModuleONNXModelException:
          warnings.warn(
              (
                  f"Fallback to PyTorch due to exception {type(self._exception)} was triggered. "
                  "Report this issue with a minimal repro at https://www.github.com/microsoft/onnxruntime. "
                  f"See details below:\n\n{_utils.get_exception_as_string(self._exception)}"
              ),
              UserWarning,
          )
```


above code will be called and log the `exception` through
`get_exception_as_string`,

In my training case, this will lead to 40 k times of `Traceback` stdout
and 110 millions lines of `onnx graph` output and run into OOM.

### Validation

After above fixes, the log.txt file will only be **2.4 MB**.

---------

Co-authored-by: ruiren <ruiren@microsoft.com>
2023-04-25 15:10:31 -07:00
Yulong Wang
d30831d829
[js/webgpu] make RunFunction return void (#15669)
### Description
make `RunFunction` return `void`.

the return value is meaningless in the OpResolveRule context. Allows any
JavaScript error to be caught and returns non-zero return value from
`computeKernel()`
2023-04-25 14:14:26 -07:00
Chen Fu
2fa10fb803
Fp16 onnx pool operators, relu, leakyrelu (#15498)
### Description
Adding the fp16 onnx operator implementations:
 maxpool, averagepool, global average pool, relu, leaky relu


### Motivation and Context

Continue with support for fp16. Standard onnx operator implementations are needed as a basis for the graph optimizers to work.
2023-04-25 14:01:47 -07:00
Changming Sun
9bf08bdb52
Fix iconv link issue (#15592)
### Description
Fix iconv link issue. The library is used in string_normalizer.cc. 

### Motivation and Context
Though iconv is part of POSIX standard, some systems may have additional iconv providers, for example GNU iconv, that is not in the standard c runtime library. In these cases we may need to link to additional libraries. 
However, this change has two caveats:
1. It may silently pull in GNU libraries into libonnxruntime.so,  and make the shared library not distributable. 
2. The detection of iconv library runs before we add additional include folders to ORT. So the detection may be inaccurate.
2023-04-25 13:28:36 -07:00
Ye Wang
d05777ddb6
stabilize fusion script with a seperate create_attention_node() (#15670)
### Description
<!-- Describe your changes. -->

previously it used create_attention_node() from base class in
fusion_attention.py. sometimes the changes in that file may silently
lead to generating a bad model.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-04-25 13:07:58 -07:00
Baiju Meswani
5885abfb35
Training Documentation (#15612) 2023-04-25 11:44:12 -07:00
Ye Wang
d00197aaa7
initialize cache_indir explicitly in beamsearch with encoder decoder model (#15667) 2023-04-25 11:05:21 -07:00
Chi Lo
e1755541cc
Fix TRT timing cache test (#15588)
TRT EP test for timing cache has wrong logic where it enables timing
cache for both sessions to compare the trt engine build time, that's why
CI got some intermittent failures.

This PR disabled the timing cache test for comparing the engine build
time between enabling/disabling timing cache until we find a model that
can benefit from timing cache.
2023-04-25 10:20:26 -07:00
Wei-Sheng Chin
d0c3f92ec6
[DORT] Fix fake tensor problem cuased by PyTorch change (#15664)
This should make `Orttraining Linux Lazy Tensor CI Pipeline` green
again.
2023-04-25 19:56:42 +08:00
Yulong Wang
3440d3a08e
remove 'lib/' from .gitignore (#15613)
This will ignore source folder /js/web/lib/
2023-04-24 18:43:32 -07:00
Ashwini Khade
124ea0a801
remove compute optimizer from lte (learning on the edge) builds (#15637)
### Description
Removing compute optimizer from on device training builds.

### Motivation and Context
1. mitigate android build failures
2. reduce binary size

Since only CPU EP is enabled for LTE builds, we can optimize the models
offline.
2023-04-24 15:57:15 -07:00
Yulong Wang
14cc02c65c
[js/web] WebGPU backend via JSEP (#14579)
### Description
This change introduced the following new components into ONNX Runtime
Web:
- JavaScript Execution Provider (JSEP)
  - Asynchronized inferencing execution powered by Emscripten's Asyncify
- WebGPU backend implemented in TypeScript
  - initial implementation of kernels:
    - elementwise operators (22)
    - binary operators (5)
    - tensor: Shape, Reshape, Transpose, Gemm
    - nn: Conv, {Global}Maxpool, {Global}AveragePool


Code need to be polished. still working on it.

## Q&A
What is JSEP?
> JSEP, aka JavaScript Execution Provider, is a new ONNXRuntime
execution provider that specifically works on Web environment
(browsers). JSEP allows JavaScript code to kick in from various places
when ONNX Runtime inferences a model.

Why JSEP?
> JSEP is a hybrid mode EP that contains both C/C++ and
TypeScript/JavaScript implementation. There are 2 strong reasons why we
introduces JSEP:
> 1. the C/C++ part helps JSEP to leverage ONNX Runtime's capabilities
as much as possible including graph transformer, optimizers and also the
capabilities to fallback to CPU EP. TypeScript/JavaScript helps JSEP to
develop and debug much easier in the browser for the kernel
implementation.
> 2. the requirement of asynchronized execution from JavaScript API (eg.
`buffer.mapAsync()`) makes it impossible to run `OrtRun()` in a
synchronized context (see "async problem" section below). This is done
by using Emscripten's Asyncify.

What is WebGPU?
> WebGPU is the new GPU API that available in browser. It's one of the
only 2 APIs that currently available to access the GPU from browser (the
other is WebGL).
> WebGPU is designed with more advanced and stronger features comparing
to WebGL and is potentially solution that offer the best GPU performance
for model inferencing that currently available.

What is the async problem and why we have the problem?
> The "async problem" is a problem that you cannot call an async
function in a synchronous context. Think about the following C++ code:
> ```c
> // C-style declarations (API)
> typedef void (*ON_COMPLETE)(PVOID state, DATA *data);
> void read_data_from_file(FILEHANDLE file, ON_COMPLETE on_complete);
> 
> // implementation
> DATA * my_impl_read_data_from_file_sync(FILEHANDLE file) {
>   // how to implement?
> }
> ```
> The answer is, it's impossible to implement this function. Usually we
try to find a sync version API, or launch a thread to call the async
function and sync-wait on the main thread. Unfortunately, in browser
environment, neither is possible.
>
> WebGPU does not offer any synchronized API for data downloading (GPU
to CPU). This is the only operation that MUST be async. As `OrtRun()`
will eventually call into DataTransfer for copy data from GPU to CPU,
and `OrtRun()` is a synchronized function, this cannot be done in normal
way.

What is Emscripten? How is the Asyncify feature resolved the problem?
> Emscripten is the C/C++ compiler for WebAssembly. It's what we use to
compile ORT and generates the WebAssembly artifacts which runs on
browsers.
>
> Asyncify is a [compiler
feature](https://emscripten.org/docs/porting/asyncify.html) that allows
calling async functions from a synchronized context. In short, it
generates code to unwind and rewind call stack to emulate async
execution. With this feature, we are able to call the async function
inside `OrtRun()` call.

## Design Overview

**Inter-op**

JSEP is doing pretty much same thing to just another EP. It exposes an
interface for inter-op with JavaScript, which is defined in
onnxruntime/wasm/js_internal_api.js:
```js
// init JSEP
Module["jsepInit"] = function (backend, alloc, free, copy, copyAsync, createKernel, releaseKernel, run) {
    Module.jsepBackend = backend;
    Module.jsepAlloc = alloc;
    Module.jsepFree = free;
    Module.jsepCopy = copy;
    Module.jsepCopyAsync = copyAsync;
    Module.jsepCreateKernel = createKernel;
    Module.jsepReleaseKernel = releaseKernel;
    Module.jsepRun = run;
};
```
This simple JavaScript snippet defines all language barrier level
functions that requires by JSEP to achieve implementing kernels and data
transfers using JavaScript inside ONNX Runtime:
- `jsepBackend`: assign the singleton object to webassembly module
- `jsepAlloc` and `jsepFree`: implementation of data transfer's Alloc()
and Free()
- `jsepCopy`: synchronized copy ( GPU to GPU, CPU to GPU)
- `jsepCopyAsync`: asynchronized copy ( GPU to CPU)
- `jsepCreateKernel` and `jsepReleaseKernel`: a corresponding object
that maintained in JS to match lifecycle of Kernel in ORT
- `jsepRun`: OpKernel::Compute() should call into this

The abstraction above allows to tie as little as possible connections
and dependencies between C/C++ and TypeScript/JavaScript.

**Resource Management**

Lifecycle of tensor data and kernels are managed by ORT(C/C++) but the
implementation are left to JavaScript. JavaScript code are responsible
to implement the callbacks correctly.

For WebGPU, the GPU data is managed by JavaScript using a singleton map
(tensot_data_id => GPUBuffer). GPU pipeline is managed as singleton.
Shaders are managed using a singletonmap (shader_key => gpu_program),
while shader_key is generated by cache_key (OP specific, including
attributes) and input shapes.

**about data transfer**
`js::DataTransfer::CopyTensor` implemented to call either synchronized
or asynchronized copy callback, depending on the destination is GPU or
not. Emscripten's macro `EM_ASYNC_JS` is used to wrap the async function
to be called in the synchronized context.

**run kernel in JS**

Kernel class constructor calls once `jsepCreateKernel()` with an
optional per-kernel specific serialization to pass attributes into
JavaScript.

`Compute()` are implemented in a way that a metadata serialization is
performed in a base class and JavaScript code can access the data using
the Emscripten specific builtin macro `EM_ASM_*`.

**disabled features**
memory pattern is force disabled, because the WebGPU data is not
presented by a general memory model (a buffer can be represented by
offset + size).
concurrent run support is disabled. WebGPU is stateful and it also has
async function call. To support concurrent run will significantly
increase the complexity and we don't get any real benefit from it.

**prefer channels last**
JSEP prefers channels last and returns `DataLayout::NHWC` in method
`GetPreferredLayout()`. This will let the graph transformers to
preprocess the graph into a channels last form so that a more optimized
WebGPU shader can be used.

**Testing code**
It's impossible to test JSEP directly because JSEP itself does not
contain any kernel implementation. However, it has the kernel
registration which need to work together with the corresponding
JavaScript code. There are unit tests that run onnx models from
JavaScript API.

---------

Co-authored-by: Scott McKay <skottmckay@gmail.com>
2023-04-24 15:21:18 -07:00
George Wu
8dd32fed47
[TensorRT EP] avoid excessive library load/unload overhead when running unit tests. (#15639)
TensorRT will load/unload libraries as builder objects are created and
torn down. This will happen for
every single unit test, which leads to excessive test execution time due
to that overhead.
This overhead has steadily increased over the past few TensorRT versions
as the library objects get bigger leading to
8 hours to run all the unit tests. Nvidia suggests to keep a placeholder
builder object around to avoid this.
2023-04-24 14:43:13 -07:00
George Wu
c2acf69d13
support new include,lib dir structure in upcoming QNN 2.11 (#15605)
upcoming QNN 2.11 will have a different include/lib directory structure.
update cmake files to support the new structure.
2023-04-24 13:10:17 -07:00
Ashwini Khade
ccb2243ee7
Update build option for training in java to enable_training_api (#15638)
### Description
Updating the build option for enabling training in java builds from
ENABLE_TRAINING -> ENABLE_TRAINING_APIS.
In the native codebase ENABLE_TRAINING is used for enabling full
training and ENABLE_TRAINING_APIS is used for creating the lte builds
with training apis. Making the change to sync the naming convention
across all the language bindings.

It was a bit confusing to see ENABLE_TRAINING when debugging the android
build failures for training. Making this change just to improve
readability of logs during debugging.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-04-24 11:53:08 -07:00
Tianlei Wu
686fd3c22a
Fix cuda 12.1 windows Build (#15614)
### Description
Fix CUDA 12.1 Windows build error of cuda namespace ambiguous. Use a new namespace for attention softmax.

Tested with VS 2019 and VS 2022 with the following settings:
- OS: Microsoft Windows 11 Enterprise (Version 10.0.22621 Build 22621)
- CUDA: cuda_12.1.0_531.14_windows
- TensorRT: TensorRT-8.6.0.12.Windows10.x86_64.cuda-12.0
- CUDNN: 8.8.1.3 for cuda 12
- Visual Studio Enterprise 2019, version 16.11.26 (MSVC v142) or
  Visual Studio Enterprise 2022 (64-bit), version 17.5.4
- Python: 3.10
- CMake: 3.25.2

VS 2019:
```
build.bat --cmake_generator "Visual Studio 16 2019" --config Release --cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=52;60;61;70;75;80;86" --skip_submodule_sync --parallel --build_shared_lib --update --build --build_dir .\build\trt --use_cuda --cuda_version "12.1" --cuda_home "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1" --cudnn_home "C:\CuDNN\8.8.1.3_cuda12" --use_tensorrt --tensorrt_home "C:\TensorRT-8.6.0.12.Windows10.x86_64.cuda-12.0\TensorRT-8.6.0.12"
```

VS 2022:
```
build.bat --cmake_generator "Visual Studio 17 2022" --config Release --cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=52;60;61;70;75;80;86" --skip_submodule_sync --parallel --build_shared_lib --update --build --build_dir .\build\trt_2022 --use_cuda --cuda_version "12.1" --cuda_home "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1" --cudnn_home "C:\CuDNN\8.8.1.3_cuda12" --use_tensorrt --tensorrt_home "C:\TensorRT-8.6.0.12.Windows10.x86_64.cuda-12.0\TensorRT-8.6.0.12"
```


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

https://github.com/microsoft/onnxruntime/issues/15242
2023-04-24 10:02:35 -07:00
cao lei
dc53ddef7a
Create a new C API KernelContext_GetAllocator() for Custom Op scenario (#15591)
### Description
Create a new C API KernelContext_GetAllocator() for Custom Op scenario



### Motivation and Context
Create a new C API KernelContext_GetAllocator() for Custom Op scenario
2023-04-23 21:54:35 -07:00
Hector Li
a8e2833050
[QNN EP]Unblock Qnn EP for Csharp support (#15640)
### Description
Unblock Qnn EP for Csharp support

### Motivation and Context
Enable Csharp support for Qnn EP
2023-04-23 21:28:34 -07:00
Changming Sun
c82bebde6a
Fix the TestCUDAProviderOptions test error (#15649)
The test limits GPU's running memory requirements to 20MB. It might be
enough in the past, but it seems not enough now when we upgrade CUDA to
a newer version or add more kernels/graph transformers to our code.
Therefore we need to increase it. Our test log shows sometimes the model
needs 128MB memory. So I set the limit to 256MB.
2023-04-24 11:21:59 +08:00
PeixuanZuo
9df1a5e605
[ROCm] enable LayerNorm opset Ver17 for ROCm EP (#15601)
enable LayerNorm opset Ver17 for ROCm EP.
2023-04-24 10:30:06 +08:00
Erick Muñoz
45c82eefb4
[OneDNN] Fix poolgrad bug (#15557)
* Fixed default dilatation value for poolgrad ops

### Description
Changed default dilatation value to 0 in poolgrad ops



### Motivation and Context
Fixes error on unit tests when --enable_training --use_dnnl flags are
active and
2023-04-23 08:20:26 -07:00
cloudhan
d1354dcc83
[ROCm] Add stable diffusion benchmark results for MI100 (#15646) 2023-04-23 18:29:35 +08:00
cloudhan
8297148bde
[ROCm] Update benchmark for stable diffusion (#15602)
1. update scripts for ROCm memory measurement.
2. update README to contain ROCm result.
3. address some minor issue in the README
2023-04-23 11:49:40 +08:00
cloudhan
9e44248bf9
Workaround ROCm global pool (#15481)
Implement global avg/max pool with reduction
2023-04-23 11:48:43 +08:00
Baiju Meswani
fd6ecc3909
Add env to the TrainingSession constructor (#15635) 2023-04-21 21:05:46 -07:00
Hector Li
fab3e33105
[Qnn EP]Enable Gelu op support (#15631)
### Description
Enable Gelu contrib op support

### Motivation and Context
unblock models with contrib op Gelu
2023-04-21 16:54:34 -07:00
Patrice Vignola
0080bb0331
Add NCHW transpose for GroupNorm (#15634)
It gives about a 2x perf improvement on Stable Diffusion on some
hardware.
2023-04-21 15:18:11 -07:00
Patrice Vignola
b49d428299
[DML EP] Add missing newline to image test logging (#15596) 2023-04-21 13:39:07 -07:00
Tianlei Wu
5a675d9113
Disable random failing DML image batch test (#15624)
### Description
Disable a test with random failure in Windows GPU CI Pipeline like the
following:

```
11: [       OK ] BatchTest/BatchTest.BatchSupport/163 (0 ms)
11: [ RUN      ] BatchTest/BatchTest.BatchSupport/164
11: D:\a\_work\1\s\winml\test\image\imagetests.cpp(186): error: Expected: m_model_binding.Bind(output_data_binding_name, output_video_frames) doesn't throw an exception.
11:   Actual: it throws.
11: D:\a\_work\1\s\winml\test\image\imagetests.cpp(211): error: Expected: m_result = m_session.Evaluate(m_model_binding, L"") doesn't throw an exception.
11:   Actual: it throws.
11: total errors is 0/2073600, errors rate is 0total errors is 0/2073600, errors rate is 0total errors is 0/2073600, errors rate is 0[  FAILED  ] BatchTest/BatchTest.BatchSupport/164, where GetParam() = ((L"fns-candy_Bgr8_Batch3.onnx", 0, { L"1080.jpg", L"fish_720_Gray.png", L"fish_720.png" }, 3, false), 0, 1, 1, 1, 4-byte object <02-00 00-00>) (3203 ms)
```

Since https://github.com/microsoft/onnxruntime/pull/15468 merged to
main, about 10~15% build job failed in the test.
2023-04-21 13:29:56 -07:00
Ye Wang
633dec0b17
refactor some code (#15566)
### Description
<!-- Describe your changes. -->

1. moved onnxruntime/contrib_ops/cuda/decoder to
onnxruntime/contrib_ops/cuda/bert
2. create utils.cuh under /bert for shared implementations in
decoder_masked_multihead_attention_impl_utils.h and
rotary_embedding_util.h
3. refactored relative_attn_bias_impl.cu by reusing the template
specializations in utils.cuh

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-04-21 12:57:08 -07:00
Baiju Meswani
b5a1941835
C, C++, Python, C# API update for on device training (#15518) 2023-04-21 11:36:01 -07:00
Zhang Lei
a6d6e45be2
Tune block size for layer_norm considering #rows and GPU resource (#15410)
fine tune cuda layernorm block size considering number of rows to
process together with column number, and hardware resources (number of
SMs, etc)

Co-authored-by: Lei Zhang <phill.zhang@gmail.com>
2023-04-21 09:49:21 -07:00
Rachel Guo
2cb3fb18b5
Integrate React Native E2E test with detox framework (#15133)
### Description
<!-- Describe your changes. -->

Integrate react native e2e test framework with detox.
https://wix.github.io/Detox/

Good build in CI:

https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=946695&view=results

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Write cross-platform end-to-end tests in JavaScript. 
Resolve flaky e2e tests in react native ci pipelines.

---------

Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
2023-04-21 09:46:26 -07:00