Commit graph

1371 commits

Author SHA1 Message Date
PeixuanZuo
2fddc65c8c
[ROCm] add hipblaslt into GemmFastGelu TunableOp (#15945)
add hipblaslt into GemmFastGelu TunableOp.
2023-05-23 11:07:09 +08:00
RandySheriffH
d35361bf9d
Fix python pipeline for AzureEP without using root (#16023)
Fix python pipeline for AzureEP without using root, this is for 1.15.

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2023-05-22 16:38:47 -07:00
Changming Sun
0204594f90
Cleanup WASM cmake code (#15996)
### Description
Remove the "onnxruntime_BUILD_WEBASSEMBLY" cmake option. Use `if
(CMAKE_SYSTEM_NAME STREQUAL "Emscripten")` instead. It makes some code
look more nature.
For example,

```cmake
if (CMAKE_SYSTEM_NAME STREQUAL "iOS" OR CMAKE_SYSTEM_NAME STREQUAL "Android" OR onnxruntime_BUILD_WEBASSEMBLY)
```
becomes
```cmake
if (CMAKE_SYSTEM_NAME STREQUAL "iOS" OR CMAKE_SYSTEM_NAME STREQUAL "Android" OR CMAKE_SYSTEM_NAME STREQUAL "Emscripten")
```
2023-05-20 18:07:39 -07:00
RandySheriffH
4dfb89b3ad
Implement mutex-free spin lock for task queue (#14834)
Implemented "lock-free" spinlock to save CPU usage on context switching.
The change has been tested on queene service of Ads team, the lock-free
version of ort (40 threads) saves CPU usage on gen8 (128 logical
processors on 8 numa nodes) windows by nearly half, from 65% to 35%.

For 32 cores, the curve is flat:

Anubis, 32 vCPU, windows, hugging face models,
95 percentile E2E latency in ms:

model | mutex(ms) | mutex-free
--- | --- | ---
 alvert_base_v2 | 34.21 | 34.09
 bert_large_uncased | 116.27| 117.84
 bart_base | 72.06 | 71.99
 distilgpt2 | 25.43 | 25.02
 vit_base_patch16_224 | 37.33 | 37.76

Anubis, 32 vCPU win, Linux, 1st party models,
95 percentile E2E latency in ms:

model | mutex(ms) | mutex-free
--- | --- | ---
deepthink_v2 | 24.35 | 22.95
bing_feeds |  36.96 | 36.48
deep_writes |  14.46 | 14.32
keypoints |  9.34 | 7.69
model11 |  1.71 | 1.66
model12 |  1.82 | 1.44
model2 |  4.21 | 3.95
model6 |  1.08 | 1.05
agiencoder |  0.99 | 0.93
geminet_transformer |  5.32 | 5.24

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2023-05-19 10:12:10 -07:00
Patrice Vignola
310b22aa0c
[DML EP] Update DirectML version to 1.12.0 (#16011) 2023-05-18 19:37:12 -07:00
Ashwini Khade
0c815a95b7
android package fix (#15999)
### Description
This PR adds the training headers to the training android packages.


### Motivation and Context
Training headers need to be added as part of the training android
packages, however because of the typo in the cmake these headers were
not being added. This PR fixes the issue.
2023-05-18 09:21:03 -07:00
Changming Sun
842b1a3472
Revert a change in #15797: restore the correct version of emsdk (#15995)
### Description
Revert a change in #15797: restore the correct version of emsdk


### Motivation and Context
Without change, when you build it on Windows you will see:
```
2023-05-17 19:41:30,093 build [INFO] - Activating emsdk...
2023-05-17 19:41:30,093 util.run [INFO] - Running subprocess in 'C:\src\onnxruntime2\cmake\external\emsdk'
  'C:\src\onnxruntime2\cmake\external\emsdk\emsdk.bat' activate 3.1.37
error: tool or SDK not found: '3.1.37'
```
2023-05-18 07:41:38 -07:00
kailums
f62f722c70
integrate triton into ort (#15862)
### Description
In some scenarios, the triton written kernels are more performant than
CK or other handwritten kernels, so we implement a framework that
onnxruntime can use these triton written kernels.

This PR is to integrate triton into ort, so that ort can use kernels
that written and compiled by triton.

The main change focus on two part:
1. a build part to compile triton written kernel and combine these
kernels into libonnxruntime_providers_rocm.so
2. a loader and launcher in c++, for loading and launch triton written
kernels.

#### Build

To compile triton written kernel, add a script
`tools/ci_build/compile_triton.py`. This script will dynamic load all
kernel files, compile them, and generate `triton_kernel_infos.a` and
`triton_kernel_infos.h`.

`triton_kernel_infos.a` contains all compiled kernel instructions, this
file will be combined into libonnxruntime_providers_rocm.so, using
--whole-archive flag.

`triton_kernel_infos.h` defines a const array that contains all the
metadata for each compiled kernel. These metadata will be used for load
and launch. So this header file is included by 'triton_kernel.cu' which
defines load and launch functions.

Add a build flag in build.py and CMakeList.txt, when building rocm
provider, it will call triton_kernel build command, and generate all
necessary files.

#### C++ Load and Launch

On c++ part, we implement load and launch functions in triton_kernel.cu
and triton_kernel.h.

These two files located in `providers/cuda`, and when compiling rocm,
they will be hipified. so this part supports both cuda and rocm. But
currently we only call triton kernel in rocm.

We also implement a softmax triton op for example. Because there will
generate many kernels for different input shape of softmax, we use
TunableOp to select the best one.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-05-17 09:35:28 +08:00
cloudhan
dc383ed4ce
Basic CSharp packaging support for ROCm EP (#15535)
This PR mainly fixes building errors when trying to build nupkg for ROCm EP.
It also slighly improve the packaging logic so that devlopers can
produce the nupkg on linux natively.
2023-05-16 07:27:38 +08:00
Dmitri Smirnov
896a963492
Adust GetVersionString() GetBuildInfoString() signatures and move them to OrtApi (#15921)
### Description

This PR partially reverts changes introduced in
https://github.com/microsoft/onnxruntime/pull/15643

We make two API return std::string always in UTF-8.

We also move the entry points from OrtApiBase to OrtApi to make them
versioned.

### Motivation and Context

`GetVersionString` always returns x.y.z numbers that are not subject to
internationalization.
`GetBuildInfoString` can hold international chars, but UTF-8 should be
fine to contain those.
We prefix them with u8"" in case the compiler default charset is not
UTF-8.
Furthermore, creating platform dependent APIs is discouraged.
`ORTCHAR_T` is platform dependent and was created for paths only.
On non-unix platforms would still produce `std::string` that can only
contain UTF-8

The API was introduced after the latest release, and can still be
adjusted.
2023-05-13 13:45:07 -07:00
RandySheriffH
7c4e8267e7
Implement openAI endpoint invoker for nuget (#15797)
Implement openAI audio endpoint, and enable nuget packaging.

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2023-05-11 22:04:02 -07:00
Jian Chen
1a73d61829
Update eigen to 3.4 and remove the eigen from git submodule (#15875)
### Description
Update eigen to 3.4 and remove the eigen from git submodule

### Motivation and Context
We need to have eigen 3.4 for c++20
2023-05-11 11:56:59 -07:00
Changming Sun
7c58d013aa
Remove Ubuntu 18.04 usages (#15781)
### Description
Remove Ubuntu 18.04 usages because it will be EOL this month.

### Motivation and Context
2023-05-11 11:44:00 -07:00
sdegrande
cf062dbdb1
FlatBuffers fails to compile with gcc13. (#15787)
When building the FlatBuffers dependencies, gcc13 emits a
stringop-overflow warning. All warnings being turned into errors, that
fails the compilation of FlatBuffers, and as a consequence also fails
the build of onnxruntime.

This commit adds the application of a patch to FlatBuffers's
CMakeList.txt, to add -Wno-error=stringop-overflow to the
CMAKE_CXX_FLAGS.
2023-05-11 11:20:19 -07:00
liqun Fu
ac9ae9f7c5
update onnx release 1.14 for docker files (#15680)
### Description
this is for ort 1.15 release to work with onnx 1.14
It shall be merged after onnx 1.14 release and before ort 1.15 release.


### Motivation and Context

---------

Signed-off-by: Liqun Fu <liqfu@microsoft.com>
2023-05-10 13:15:56 -07:00
Sumit Agarwal
b473e3f3c6
[DML EP] Update DirectML version to 1.11.0 (#15858)
### Description
- Update DML version to 1.11.0
- Disable Gemm+Softmax fusion



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-05-09 12:48:15 -07:00
Wanming Lin
00b1e79e04
Support WebNN EP (#15698)
**Description**: 

This PR intends to enable WebNN EP in ONNX Runtime Web. It translates
the ONNX nodes by [WebNN
API](https://webmachinelearning.github.io/webnn/), which is implemented
in C++ and uses Emscripten [Embind
API](https://emscripten.org/docs/porting/connecting_cpp_and_javascript/embind.html#).
Temporarily using preferred layout **NHWC** for WebNN graph partitions
since the restriction in WebNN XNNPack backend implementation and the
ongoing
[discussion](https://github.com/webmachinelearning/webnn/issues/324) in
WebNN spec that whether WebNN should support both 'NHWC' and 'NCHW'
layouts. No WebNN native EP, only for Web.

**Motivation and Context**:
Allow ONNXRuntime Web developers to access WebNN API to benefit from
hardware acceleration.

**WebNN API Implementation Status in Chromium**:
- Tracked in Chromium issue:
[#1273291](https://bugs.chromium.org/p/chromium/issues/detail?id=1273291)
- **CPU device**: based on XNNPack backend, and had been available on
Chrome Canary M112 behind "#enable-experimental-web-platform-features"
flag for Windows and Linux platforms. Further implementation for more
ops is ongoing.
- **GPU device**: based on DML, implementation is ongoing.

**Open**:
- GitHub CI: WebNN currently is only available on Chrome Canary/Dev with
XNNPack backend for Linux and Windows. This is an open to reviewers to
help identify which GitHub CI should involved the WebNN EP and guide me
to enable it. Thanks!
2023-05-08 21:25:10 -07:00
Yulong Wang
0457fd0b40
upgrade emsdk to 3.1.37 (#15817)
### Description
upgrade emsdk to 3.1.37

WIP branch to debug the mystery memory issue in web assembly
multi-thread build.
2023-05-08 16:49:47 -07:00
Guenther Schmuelling
5a43828b3d
update ort extensions to 94142d8391c9791ec71c38336436319a2d4ac7a0 (#15688)
needed to get tokenizers/decode for whisper

---------

Co-authored-by: Shalva Mist <shalvamist@microsoft.com>
2023-05-05 09:48:07 -07:00
cloudhan
412d05a1d2
[ROCm] Update cmake (#15807)
Followup of #15775
2023-05-04 11:20:56 -07:00
Yulong Wang
33d1372729
[wasm] revert emsdk to v3.1.19 (#15793)
### Description
latest emsdk generated multi-thread version sometimes crash with unknown
reason ( error: memory access out of bounds ).

we don't want to break existing ort-web users, so revert emsdk back to
3.1.19 (same to what ort v1.14.0 uses)
2023-05-04 01:15:01 -07:00
Baiju Meswani
ba7b83ff3c
Remove onnxruntime_PYBIND_EXPORT_OPSCHEMA definition from onnxruntime (#15776) 2023-05-03 13:08:35 -07:00
Changming Sun
41c082fdde
Add a Github workflow for Prefast (#15763) 2023-05-03 11:42:51 -07:00
Changming Sun
328cabb194
Download protoc from Github Release instead of Nuget (#15731)
### Description
Download protoc from Github Release instead of Nuget to avoid having
dependency on nuget.exe on Linux

### Motivation and Context
To avoid having dependency on nuget.exe on Linux. Many users' build
environment do not have nuget or dotnet.
2023-05-02 12:18:59 -07:00
Changming Sun
5352f6d9b0
Make "--cuda_version" build arg optional (#15758)
### Description
This change will allow us building CUDA EP without installing CUDA SDK
on Windows.

### Motivation and Context
Nvidia's CUDA installer comes with a VS extension. In the past, we
require installing the extension. It is a little bit inconvenient since:
1. Visual Studio must be installed before CUDA SDK. CUDA's installer
will not install the extension if your machine doesn't have Visual
Studio.
2. We need to install CUDA SDK on our build machines, instead of just
downloading it and using it.

After this change, we will not need to install CUDA SDK on our build
machines. So it will be easier to add a support for a different CUDA
version.

Also, fix two PreFast warnings.
2023-05-01 18:00:47 -07:00
Ashwini Khade
0ffae8073b
Creating Nuget and Android packages for Training (#15712)
### Description
This PR creates Nuget and Android for Training. 


### Motivation and Context
These packages are intended to be released in ORT 1.15 to enable
On-Device Training Scenarios.

## Packaging Story for Learning On The Edge Release
### Nuget Packages:
1. New Native package -> **Microsoft.ML.OnnxRuntime.Training** (Native
package will contain binaries for: win-x86, win-x64, win-arm, win-arm64,
linux-x64, linux-arm64, android)
2. C# bindings will be added to existing package ->
**Microsoft.ML.OnnxRuntime.Managed**

### Android Package published to Maven:
1. New package for training (full build) ->
**onnxruntime-training-android-full-aar**

### Python Package published to PyPi:
1. Python bindings and offline tooling will be added to the existing ort
training package -> **onnxruntime-training**
2023-05-01 12:59:56 -07:00
Sumit Agarwal
4c4f688a93
[DML EP] Fix dml_external_project (#15656)
### Description
While building ORT for DML EP with `dml_EXTERNAL_PROJECT` flag, 2
variables (`DML_SHARED_LIB`, `DML_PACKAGE_DIR`) value is not set
properly.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-05-01 12:02:56 -07:00
Chunye Wang@AMD
d35850c142
[VitisAI]Update VitisAI EP to be compatible with VitisAI 3.5 (#15673)
### Description

Originally VitisAI EP only works with old version of VitisAI release. 


### Motivation and Context

Update VitisAI EP so that it works together with the current VitisiAI
3.5 and further version of VitisAI. We try our best to make it forward
compatible.

---------

Co-authored-by: Wang Chunye <chunywan@xilinx.com>
Co-authored-by: mingyue <mingyue@amd.com>
Co-authored-by: mingyueliuh <131847423+mingyueliuh@users.noreply.github.com>
Co-authored-by: liumingyue <mingyue@xilinx.com>
Co-authored-by: moore-ch <129165652+moore-ch@users.noreply.github.com>
Co-authored-by: shoucair <shoucai.ren@amd.com>
Co-authored-by: zz002 <zhenze.wang@amd.com>
Co-authored-by: BoarQing <yuz75@Pitt.edu>
Co-authored-by: Yueqing Zhang <yueqingz@amd.com>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
2023-05-01 08:28:26 -07:00
Changming Sun
65020d433e
Prefast fixes for CUDA EP (#15726)
### Description
1. Adjust cmake flags. Do not modify CMAKE_CXX_FLAGS globally. Only
apply the flags to ORT code.
2. Fix some SDL warnings.
2023-04-29 12:43:12 -07:00
Yuhong Guo
41dcf0d32e
Expose build information in dynamic lib (#15643)
### Description
<!-- Describe your changes. -->
1. Add Build Info API to onnx.
2. Fix compile error while building onnxruntime_benchmark in MacOs.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
1. When Onnxruntime lib is serving online, we need a way to detect how
this lib is built. This PR helps the developer to get the build
information using `strings` such as git branch, git commit id, build
type and cmake cxx flags, which is showed as follows.


![image](https://user-images.githubusercontent.com/19584326/233794371-b2f95a2c-27fb-4709-a6dd-bf4bb12b0b5b.png)


![image](https://user-images.githubusercontent.com/19584326/233794360-f96f5d2e-332c-405c-83f1-370ccc2b86f8.png)

If the build env has no git, there will be no git related infor:


![image](https://user-images.githubusercontent.com/19584326/234558596-298c1b01-9a90-41bf-9372-7259a8f8e5be.png)


3. Fix the following compile error while building benchmark in MacOs.

![image](https://user-images.githubusercontent.com/19584326/233793571-c261ac1f-47b2-434d-a293-7e9edc6c8a66.png)

---------

Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com>
2023-04-28 21:57:31 -07:00
Changming Sun
d3e8d7a70d
Better support for cmake 3.26 and Windows ARM64 (#15704)
### Description

In #8953 I introduced a change in our onnxruntime_mlas.cmake that it
enables "ASM_MASM" cmake language for all Windows build.
```cmake
enable_language(ASM_MASM)
```
Before the change, it is only enabled when onnxruntime_target_platform
equals to x64.

However, cmake 3.26 added a new language:  ASM_MARMASM.

According to cmake's manual,
ASM_MASM is for Microsoft Assembler
ASM_MARMASM is for Microsoft ARM Assembler. This one is new in cmake
3.26.

We should choose the right one according to
${onnxruntime_target_platform}.
2023-04-27 10:25:45 -07:00
kunal-vaishnavi
cfb8c0e2ca
Add Whisper custom export to wheel (#15685)
### Description
This PR adds the Whisper custom export scripts to the wheel.



### Motivation and Context
This enables access to the custom export scripts in the wheel.
2023-04-26 10:45:52 -07:00
PeixuanZuo
0ecfe83932
[ROCm] add beam search support (#15625)
add beam search support for ROCm EP.
2023-04-26 17:53:33 +08:00
Yulong Wang
b98317b907
[js/webgpu] following up for JSEP/WebGPU code cleanup (#15666)
### Description
This PR resolves a part of non-critical comments from code review
comments in #14579.

- use `USE_JSEP` instead of `USE_JS` in build definition to make it less
ambiguous
- remove unused util functions from util.ts
- fix transpose.h
- other misc fixes
2023-04-25 21:20:03 -07:00
sfatimar
ebaafac3f5
Openvino ep ort 5.0 (#15626)
### Description
The PR adds VPU support to OpenVINO Execution Provider
Bug fixes for GPU, CPU. 
Changes to OpenVINO Backend in Serialized Model API for faster First
Inference Latency.
Deprecation to HDDL-VADM and MYRIAD, removed code
Support OpenVINO 2023.0 
Dynamic Shapes Support for iGPU

### Motivation and Context
- VPU is an upcoming hardware that can provide AI Acceleration for
Client Systems through OpenVINO
- If it fixes an open issue, please link to the issue here. -->

---------

Signed-off-by: MaajidKhan <n.maajid.khan@intel.com>
Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
Co-authored-by: MaajidKhan <n.maajid.khan@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
2023-04-25 20:59:42 -07:00
Changming Sun
9bf08bdb52
Fix iconv link issue (#15592)
### Description
Fix iconv link issue. The library is used in string_normalizer.cc. 

### Motivation and Context
Though iconv is part of POSIX standard, some systems may have additional iconv providers, for example GNU iconv, that is not in the standard c runtime library. In these cases we may need to link to additional libraries. 
However, this change has two caveats:
1. It may silently pull in GNU libraries into libonnxruntime.so,  and make the shared library not distributable. 
2. The detection of iconv library runs before we add additional include folders to ORT. So the detection may be inaccurate.
2023-04-25 13:28:36 -07:00
Yulong Wang
14cc02c65c
[js/web] WebGPU backend via JSEP (#14579)
### Description
This change introduced the following new components into ONNX Runtime
Web:
- JavaScript Execution Provider (JSEP)
  - Asynchronized inferencing execution powered by Emscripten's Asyncify
- WebGPU backend implemented in TypeScript
  - initial implementation of kernels:
    - elementwise operators (22)
    - binary operators (5)
    - tensor: Shape, Reshape, Transpose, Gemm
    - nn: Conv, {Global}Maxpool, {Global}AveragePool


Code need to be polished. still working on it.

## Q&A
What is JSEP?
> JSEP, aka JavaScript Execution Provider, is a new ONNXRuntime
execution provider that specifically works on Web environment
(browsers). JSEP allows JavaScript code to kick in from various places
when ONNX Runtime inferences a model.

Why JSEP?
> JSEP is a hybrid mode EP that contains both C/C++ and
TypeScript/JavaScript implementation. There are 2 strong reasons why we
introduces JSEP:
> 1. the C/C++ part helps JSEP to leverage ONNX Runtime's capabilities
as much as possible including graph transformer, optimizers and also the
capabilities to fallback to CPU EP. TypeScript/JavaScript helps JSEP to
develop and debug much easier in the browser for the kernel
implementation.
> 2. the requirement of asynchronized execution from JavaScript API (eg.
`buffer.mapAsync()`) makes it impossible to run `OrtRun()` in a
synchronized context (see "async problem" section below). This is done
by using Emscripten's Asyncify.

What is WebGPU?
> WebGPU is the new GPU API that available in browser. It's one of the
only 2 APIs that currently available to access the GPU from browser (the
other is WebGL).
> WebGPU is designed with more advanced and stronger features comparing
to WebGL and is potentially solution that offer the best GPU performance
for model inferencing that currently available.

What is the async problem and why we have the problem?
> The "async problem" is a problem that you cannot call an async
function in a synchronous context. Think about the following C++ code:
> ```c
> // C-style declarations (API)
> typedef void (*ON_COMPLETE)(PVOID state, DATA *data);
> void read_data_from_file(FILEHANDLE file, ON_COMPLETE on_complete);
> 
> // implementation
> DATA * my_impl_read_data_from_file_sync(FILEHANDLE file) {
>   // how to implement?
> }
> ```
> The answer is, it's impossible to implement this function. Usually we
try to find a sync version API, or launch a thread to call the async
function and sync-wait on the main thread. Unfortunately, in browser
environment, neither is possible.
>
> WebGPU does not offer any synchronized API for data downloading (GPU
to CPU). This is the only operation that MUST be async. As `OrtRun()`
will eventually call into DataTransfer for copy data from GPU to CPU,
and `OrtRun()` is a synchronized function, this cannot be done in normal
way.

What is Emscripten? How is the Asyncify feature resolved the problem?
> Emscripten is the C/C++ compiler for WebAssembly. It's what we use to
compile ORT and generates the WebAssembly artifacts which runs on
browsers.
>
> Asyncify is a [compiler
feature](https://emscripten.org/docs/porting/asyncify.html) that allows
calling async functions from a synchronized context. In short, it
generates code to unwind and rewind call stack to emulate async
execution. With this feature, we are able to call the async function
inside `OrtRun()` call.

## Design Overview

**Inter-op**

JSEP is doing pretty much same thing to just another EP. It exposes an
interface for inter-op with JavaScript, which is defined in
onnxruntime/wasm/js_internal_api.js:
```js
// init JSEP
Module["jsepInit"] = function (backend, alloc, free, copy, copyAsync, createKernel, releaseKernel, run) {
    Module.jsepBackend = backend;
    Module.jsepAlloc = alloc;
    Module.jsepFree = free;
    Module.jsepCopy = copy;
    Module.jsepCopyAsync = copyAsync;
    Module.jsepCreateKernel = createKernel;
    Module.jsepReleaseKernel = releaseKernel;
    Module.jsepRun = run;
};
```
This simple JavaScript snippet defines all language barrier level
functions that requires by JSEP to achieve implementing kernels and data
transfers using JavaScript inside ONNX Runtime:
- `jsepBackend`: assign the singleton object to webassembly module
- `jsepAlloc` and `jsepFree`: implementation of data transfer's Alloc()
and Free()
- `jsepCopy`: synchronized copy ( GPU to GPU, CPU to GPU)
- `jsepCopyAsync`: asynchronized copy ( GPU to CPU)
- `jsepCreateKernel` and `jsepReleaseKernel`: a corresponding object
that maintained in JS to match lifecycle of Kernel in ORT
- `jsepRun`: OpKernel::Compute() should call into this

The abstraction above allows to tie as little as possible connections
and dependencies between C/C++ and TypeScript/JavaScript.

**Resource Management**

Lifecycle of tensor data and kernels are managed by ORT(C/C++) but the
implementation are left to JavaScript. JavaScript code are responsible
to implement the callbacks correctly.

For WebGPU, the GPU data is managed by JavaScript using a singleton map
(tensot_data_id => GPUBuffer). GPU pipeline is managed as singleton.
Shaders are managed using a singletonmap (shader_key => gpu_program),
while shader_key is generated by cache_key (OP specific, including
attributes) and input shapes.

**about data transfer**
`js::DataTransfer::CopyTensor` implemented to call either synchronized
or asynchronized copy callback, depending on the destination is GPU or
not. Emscripten's macro `EM_ASYNC_JS` is used to wrap the async function
to be called in the synchronized context.

**run kernel in JS**

Kernel class constructor calls once `jsepCreateKernel()` with an
optional per-kernel specific serialization to pass attributes into
JavaScript.

`Compute()` are implemented in a way that a metadata serialization is
performed in a base class and JavaScript code can access the data using
the Emscripten specific builtin macro `EM_ASM_*`.

**disabled features**
memory pattern is force disabled, because the WebGPU data is not
presented by a general memory model (a buffer can be represented by
offset + size).
concurrent run support is disabled. WebGPU is stateful and it also has
async function call. To support concurrent run will significantly
increase the complexity and we don't get any real benefit from it.

**prefer channels last**
JSEP prefers channels last and returns `DataLayout::NHWC` in method
`GetPreferredLayout()`. This will let the graph transformers to
preprocess the graph into a channels last form so that a more optimized
WebGPU shader can be used.

**Testing code**
It's impossible to test JSEP directly because JSEP itself does not
contain any kernel implementation. However, it has the kernel
registration which need to work together with the corresponding
JavaScript code. There are unit tests that run onnx models from
JavaScript API.

---------

Co-authored-by: Scott McKay <skottmckay@gmail.com>
2023-04-24 15:21:18 -07:00
George Wu
8dd32fed47
[TensorRT EP] avoid excessive library load/unload overhead when running unit tests. (#15639)
TensorRT will load/unload libraries as builder objects are created and
torn down. This will happen for
every single unit test, which leads to excessive test execution time due
to that overhead.
This overhead has steadily increased over the past few TensorRT versions
as the library objects get bigger leading to
8 hours to run all the unit tests. Nvidia suggests to keep a placeholder
builder object around to avoid this.
2023-04-24 14:43:13 -07:00
George Wu
c2acf69d13
support new include,lib dir structure in upcoming QNN 2.11 (#15605)
upcoming QNN 2.11 will have a different include/lib directory structure.
update cmake files to support the new structure.
2023-04-24 13:10:17 -07:00
Ashwini Khade
ccb2243ee7
Update build option for training in java to enable_training_api (#15638)
### Description
Updating the build option for enabling training in java builds from
ENABLE_TRAINING -> ENABLE_TRAINING_APIS.
In the native codebase ENABLE_TRAINING is used for enabling full
training and ENABLE_TRAINING_APIS is used for creating the lte builds
with training apis. Making the change to sync the naming convention
across all the language bindings.

It was a bit confusing to see ENABLE_TRAINING when debugging the android
build failures for training. Making this change just to improve
readability of logs during debugging.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-04-24 11:53:08 -07:00
Tianlei Wu
686fd3c22a
Fix cuda 12.1 windows Build (#15614)
### Description
Fix CUDA 12.1 Windows build error of cuda namespace ambiguous. Use a new namespace for attention softmax.

Tested with VS 2019 and VS 2022 with the following settings:
- OS: Microsoft Windows 11 Enterprise (Version 10.0.22621 Build 22621)
- CUDA: cuda_12.1.0_531.14_windows
- TensorRT: TensorRT-8.6.0.12.Windows10.x86_64.cuda-12.0
- CUDNN: 8.8.1.3 for cuda 12
- Visual Studio Enterprise 2019, version 16.11.26 (MSVC v142) or
  Visual Studio Enterprise 2022 (64-bit), version 17.5.4
- Python: 3.10
- CMake: 3.25.2

VS 2019:
```
build.bat --cmake_generator "Visual Studio 16 2019" --config Release --cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=52;60;61;70;75;80;86" --skip_submodule_sync --parallel --build_shared_lib --update --build --build_dir .\build\trt --use_cuda --cuda_version "12.1" --cuda_home "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1" --cudnn_home "C:\CuDNN\8.8.1.3_cuda12" --use_tensorrt --tensorrt_home "C:\TensorRT-8.6.0.12.Windows10.x86_64.cuda-12.0\TensorRT-8.6.0.12"
```

VS 2022:
```
build.bat --cmake_generator "Visual Studio 17 2022" --config Release --cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=52;60;61;70;75;80;86" --skip_submodule_sync --parallel --build_shared_lib --update --build --build_dir .\build\trt_2022 --use_cuda --cuda_version "12.1" --cuda_home "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1" --cudnn_home "C:\CuDNN\8.8.1.3_cuda12" --use_tensorrt --tensorrt_home "C:\TensorRT-8.6.0.12.Windows10.x86_64.cuda-12.0\TensorRT-8.6.0.12"
```


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

https://github.com/microsoft/onnxruntime/issues/15242
2023-04-24 10:02:35 -07:00
cloudhan
9e44248bf9
Workaround ROCm global pool (#15481)
Implement global avg/max pool with reduction
2023-04-23 11:48:43 +08:00
Ye Wang
633dec0b17
refactor some code (#15566)
### Description
<!-- Describe your changes. -->

1. moved onnxruntime/contrib_ops/cuda/decoder to
onnxruntime/contrib_ops/cuda/bert
2. create utils.cuh under /bert for shared implementations in
decoder_masked_multihead_attention_impl_utils.h and
rotary_embedding_util.h
3. refactored relative_attn_bias_impl.cu by reusing the template
specializations in utils.cuh

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-04-21 12:57:08 -07:00
Scott McKay
446c478fbd
Add iOS Swift Package Manager support (#15297)
### Description
<!-- Describe your changes. -->
Add Swift Package Manager (SPM) support for ORT based on  #14621
- uses the existing objective-c bindings
- some re-organization of the directory structure was required but the
contents of the files are unchanged, apart from adjustments due to file
movements

Add tool for updating ORT native pod used in the SPM package
Update CIs to use ORT native pod from build, and build/test using SPM



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
iOS developers are using SPM as much as cocoapods, so adding SPM means
both are catered for.
2023-04-20 16:18:35 +10:00
George Nash
f2889b41c1
[AMX] Update assembler check (#15501)
A recent commit added an assembler check if the ASM dialect was ATT

This unfortunately broke the AMX build for systems that don't have the
ASM-ATT dialect.

This change assumes if the CMAKE_ASM-ATT_COMPILER_ID is not found and
the CMAKE_ASM_COMPILER_ID is "GNU" based on all the other already passed
checks AMX is supported by the compiler and assembler.

### Description




### Motivation and Context
On my build system the recent change to add the ASM-ATT version check
disabled AMX code from the build.

---------

Signed-off-by: George Nash <george.nash@intel.com>
2023-04-19 14:16:26 -07:00
Chen Fu
142220ad87
Fix cmake 3.25 debug info config (#15565)
### Description

https://github.com/microsoft/onnxruntime/pull/15538
Above pull request breaks Windows build on cmake 3.25 or earlier. This
should fix it.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-04-19 09:14:19 -07:00
PeixuanZuo
59ea35d592
[ROCm] add CK GroupNorm to GroupNormTunable (#15510)
- Add CK GroupNorm to GroupNormTunable.
- Reduce configuration of GroupNormNHWCOp because CK implementation is
better.

The performance gain on stable diffusion v1.5.
Before:
```
'height': 512
'width': 512
'steps': 50
'batch_size': 1
'batch_count': 5
'num_prompts': 1
'average_latency': 2.4782688856124877
'median_latency': 2.4783748388290405
'provider': 'ROCMExecutionProvider'
'disable_safety_checker': True 
```

After:
```
'height': 512, 
'width': 512, 
'steps': 50, 
'batch_size': 1,
'batch_count': 5,
'num_prompts': 1, 
'average_latency': 2.107170510292053,
 'median_latency': 2.1067750453948975,
 'first_run_memory_MB': -1, 
'second_run_memory_MB': -1,
'provider': 'ROCMExecutionProvider', 
'disable_safety_checker': True
```
2023-04-19 13:54:59 +08:00
kunal-vaishnavi
901c2bc384
Whisper Model Optimization (#15473)
### Description
This PR contains fusion-level and kernel-level optimizations for
[OpenAI's Whisper](https://github.com/openai/whisper).

Some of the added optimizations include:

- Pruning of duplicate/unnecessary inputs and outputs
- Fusion support for Whisper models with or without these inputs/outputs
(e.g. with these inputs/outputs if exporting with an older official
Optimum version, without these inputs/outputs if exporting with Optimum
from source)
- Attention fusions
   - For Whisper's encoder and decoder
- Modified symbolic shape inference for present output when no past
input exists (for decoder)
- Multi-head attention fusions
   - For Whisper's decoder and decoder with past
- Packed MatMul for the 3 MatMuls excluded in multi-head attention
fusion
- Attention kernel changes
   - CPU:
      - Different Q and KV sequence lengths
      - Parallel memset for large sequence lengths
- Convert broadcast add after MatMul of Q and K (add_qk) to element-wise
add
- Separate present key-value output into present key and present value
(for multi-head attention spec)
   - CUDA:
- Use memory efficient attention compute kernel with present state (for
decoder)
- Multi-head attention kernel changes
   - CPU:
- Introduction of multi-head attention CPU kernel (previously did not
exist)
- Use AddBiasReshape instead of AddBiasTranspose when sequence length =
1 (for decoder with past)
      - Different Q, K, V input shapes
      - Pass past key and past value directly as key and value
   - CUDA:
- Use memory efficient attention compute kernel with past and/or present
state (for decoder with past)

### Usage
To use the optimizations, run the ORT transformer optimizer script as
follows:
```
$ cd onnxruntime/onnxruntime/python/tools/transformers/
$ python3 optimizer.py --input <filename>.onnx --output <filename>.onnx --model_type bart --num_heads <number of attention heads, depends on the size of the whisper model used> --hidden_size <attention hidden size, depends on the size of the whisper model used> --use_external_data_format --use_multi_head_attention
```

Once optimized, here's an example of how to run Whisper with [Hugging
Face's Optimum](https://github.com/huggingface/optimum):
```
from transformers.onnx.utils import get_preprocessor
from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from optimum.pipelines import pipeline as ort_pipeline

import whisper # Installed from OpenAI's repo - setup instructions at https://github.com/openai/whisper/

directory = './whisper_opt' # Where the optimized ONNX models are located
model_name = 'openai/whisper-tiny'
device = 'cpu'

# Get pipeline
processor = get_preprocessor(model_name)
model = ORTModelForSpeechSeq2Seq.from_pretrained(
    directory,
    use_io_binding=(device == 'cuda'),
    provider='CPUExecutionProvider',
).to(device)
pipe = ort_pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    device=(-1 if device == 'cpu' else 0),
)

# Load audio file and run pipeline
audio = whisper.load_audio('tests/jfk.flac')
audio = whisper.pad_or_trim(audio)
outputs = pipe([audio])
print(outputs)
```

Note: In order to use these changes with Optimum, it is recommended to
use Optimum from source to have the following changes:
- https://github.com/huggingface/optimum/pull/872
- https://github.com/huggingface/optimum/pull/920

### Motivation and Context
This PR helps the following issues:
- https://github.com/microsoft/onnxruntime/issues/15100
- https://github.com/microsoft/onnxruntime/issues/15235
- https://github.com/huggingface/optimum/issues/869 (work in progress)

This PR can be used with the other currently merged Whisper PRs:
- https://github.com/microsoft/onnxruntime/pull/15247
- https://github.com/microsoft/onnxruntime/pull/15339
- https://github.com/microsoft/onnxruntime/pull/15362
- https://github.com/microsoft/onnxruntime/pull/15365
- https://github.com/microsoft/onnxruntime/pull/15427

This PR uses changes from the following merged PRs:
- https://github.com/microsoft/onnxruntime/pull/14198
- https://github.com/microsoft/onnxruntime/pull/14146
- https://github.com/microsoft/onnxruntime/pull/14201
- https://github.com/microsoft/onnxruntime/pull/14928 (this introduced
the new multi-head attention spec)
2023-04-18 17:13:54 -07:00
Yi Zhang
698e9f71cd
Improve cache hit rate in windows build (#15538)
### Description
1.  Update /Zi to /Z7 in abseil project while using cache
2.  Skip target_precompile_headers while using cache


### Motivation and Context
There're about 1/4 uncacheable calls in Windows GPU compilation with
cache.
```
Uncacheable calls:                   441 / 1641 (26.87%)
  Could not use precompiled header:  361 /  441 (81.86%)
  Preprocessing failed:                1 /  441 ( 0.23%)
  Unsupported compiler option:        79 /  441 (17.91%)
```

https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=961916&view=logs&j=5076e696-f193-5f12-2d8a-703dda41a79b&t=9b927034-e3ef-5e25-c6df-387bc37acd63&l=21

The root cause of `Unsupported compiler option` is that /Zi in Abseil
isn't updated to /Z7.
The root cause of `Could not use precompiled header` is the
`target_precompile_headers` creates cmake_pch.pch every time and it's
hash value is changed too.

### Result
It could reduce compilation time by another 20%. 
For example:
It took 16m43 in CUDA training compilation on Windows.
It takes 13m32 after the change.

https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=964002&view=logs&s=959c6b43-5937-53e5-5f36-e53cb0249117


### N.B.
In winml project, it's using own target_precompile**d**_header
https://github.com/microsoft/onnxruntime/blob/main/cmake/precompiled_header.cmake.
Just let it be.
2023-04-18 09:31:35 -07:00
Justin Chu
cf19c3697d
Run clang-format in CI (#15524)
### Description

Run clang-format in CI. Formatted all c/c++, objective-c/c++ files.

Excluded

```
    'onnxruntime/core/mlas/**',
    'onnxruntime/contrib_ops/cuda/bert/tensorrt_fused_multihead_attention/**',
```

because they contain assembly or is data heavy


### Motivation and Context

Coding style consistency
2023-04-18 09:26:58 -07:00