Commit graph

11091 commits

Author SHA1 Message Date
Jian Chen
0a10a3003a
component-governance fix round 4 (#20754)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-05-22 11:05:24 -07:00
Yulong Wang
e412bc1919
[doc] update file size table for ORT Web (#20755)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-05-22 11:04:57 -07:00
Xu Xing
f1fef19b6e
[js/webgpu] Support shared memory for transpose 2d (#19267)
For 1024x1024, without shared memoey, 18.7ms. With shared memory 13.2ms.
2024-05-22 08:15:44 -07:00
Yulong Wang
068bb3d5ee
[js/webgpu] add missing space in build script (#20752) 2024-05-21 16:24:34 -07:00
Chi Lo
df01e0d497
[TensorRT EP] Update ORT kernel output with TRT DDS int64 output for TRT 10 (#20738)
TRT 10 now natively supports int64 tensor, so needs to updating the code
where binding the ORT kernel output with DDS int64 output.
2024-05-21 09:03:48 -07:00
pengwa
8a98874e7e
Flash attention recompute (#20603)
### Flash attn recompute

1. Allow PythonOp(FlashAttn) can be recomputed correctly.
45879ff5c2
2. Use JSON to pass the selected-to-recompute subgraphs.
3c374da678

#### Better Memory Efficiency 

Customer model can run both PyTorch SPDA and Flash Attn, this PR make it
possible to let the Flash Attn path work with ORTModule layerwise
recompute. The peak drop from 45.xGB to 32.xGB if we only compare the
layers (not including other pieces, BTW there are few more optimization
targeting other pieces as well later).

#### Better Perf

Using Flash ATTN bring additionally 16% end to end time reduction, with
highly aligned loss curve.


![image](https://github.com/microsoft/onnxruntime/assets/10530022/bb63894a-f281-49bc-a8e6-ff818439be38)

#### Use JSON File to pass Recompute Plans

To overcome the limitation of max length of the strings defined in
session options.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-05-21 13:38:19 +08:00
Adrian Lizarraga
8acf60f35c
Layout transform: Fix-up QDQ units and add constant folding (#20685)
### Description

#### Problem 1: Broken Transpose QDQ unit
Layout transform's specialized cost function aggressively pushes down
transposes with channel-first or channel-last perms. This can lead to a
situation where a channel-fist/last Transpose gets stuck after being
pushed through an Unsqueeze node that makes the Transpose's perm no
longer channel-first/last. At this point, the specialized cost function
defers to the default const function, which does not see a need to
continue pushing this transpose node. This breaks the QDQ node units for
both the Unsqueeze and the Transpose: DQ -> Unsqueeze -> Transpose -> Q.

<img width="266" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/19691973/82f8432d-ca27-451b-8c36-c8d87b806e30">


The transpose optimizer should insert a Q -> DQ pair between the
Unsqueeze and Transpose nodes to fix both QDQ node units: DQ ->
Unsqueeze -> Q[new] -> DQ[new] -> Transpose -> Q

<img width="198" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/19691973/5a584bdf-e5db-4622-b3bb-83c060e09261">


#### Problem 2: Inserted Squeeze/Transpose nodes should be constant
folded when possible.
The transpose optimizer inserts Squeeze (and Transpose) ops between an
initializer and a DQ to counteract the effect of Unsqueezing that
initializer if it is consumed by multiple nodes. This results in a graph
where the inserted nodes are not in valid node units:

Original graph where two Mul nodes share a common initializer input:
<img width="456" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/19691973/4b9155ae-e32f-41fc-9136-f953b73e92e7">

Resulting graph after transpose optimization without constant folding:
<img width="452" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/19691973/3c1bfef1-d45f-4d6e-aa19-1c2929eae3f5">

Here, the circled Transpose and Squeeze nodes operate on a quantized
integer type but are not in valid QDQ node units. The solution is to run
constant folding, which results in:
<img width="405" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/19691973/aebdb91f-f38f-4583-adec-33e46126365f">


### Motivation and Context
Improve the layout transformation to allow more models to run on EPs
that prefer the channel-last layout.

---------

Co-authored-by: Scott McKay <skottmckay@gmail.com>
2024-05-20 20:19:06 -07:00
Jian Chen
372974e5d6
Using CPU pool to build Linux GPU C API Package (#20648)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-05-20 15:25:14 -07:00
Wanming Lin
87d49e3dda
[WebNN EP] Add WebNN operators doc to README.md (#20734) 2024-05-20 14:57:40 -07:00
Wanming Lin
0399d1b12d
[WebNN EP] Update chromium flag (#20732)
WebNN is currently enabled behind "Enables WebNN API" flag.
2024-05-20 14:57:30 -07:00
Jian Chen
ddafbf2224
Component Governance fix round 3 (#20689)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-05-20 13:39:09 -07:00
Jian Chen
11df22b59b
Reenabling Nuget Cuda Packaging Pipeline (#20688)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-05-20 10:37:15 -07:00
Edward Chen
fefae0cd04
Add Mac CI GitHub Actions workflow (#20717)
Add a new GitHub Actions workflow, `.github/workflows/mac.yml`. It contains these jobs:
- ARM64 MacOS CI build.
- Objective-C static analysis build. This was moved over from another Azure DevOps pipeline to make it more visible.
2024-05-20 10:27:03 -07:00
Preetha Veeramalai
ebed2c3785
Unified OV compile_model API in OVEP (#20700)
### Description
Have a unified API in OVEP that pass the ONNX graph proto from ORT to OV
for compilation


### Motivation and Context
The earlier implementation used two different flows when onnx model path
is present vs model laoded from memory.
The former directly passed the onnx model path to OV when the graph is
fully supported by EP. While the latter pass the ORT model proto to OV.

This cause a difference in results when ORT optimizations are enabled.
This PR address this issue.
2024-05-20 10:20:28 -07:00
Yulong Wang
036fcd93d4
[js/web] optimize module export and deployment (#20165)
### Description

This PR make numbers of optimizations to onnxruntime-web's module export
and deployment.

See each section below for more details.

#### Preview

>
[onnxruntime-web@1.19.0-esmtest.20240513-a16cd2bd21](https://www.npmjs.com/package/onnxruntime-web/v/1.19.0-esmtest.20240513-a16cd2bd21)

> ~~onnxruntime-web@1.19.0-esmtest.20240430-c7edbcc63d~~

> ~~onnxruntime-web@1.18.0-esmtest.20240428-624c681c83~~

> ~~onnxruntime-web@1.18.0-esmtest.20240411-1abb64e894~~

<details>
<summary><h4>Breaking changes</h4></summary>

There is no code change required, but there are a few differences
regarding **code import**, **flags**, **bundler config** and
**deployment steps**.

#### Importing:

Import table is changed. See following for details.

<details>
<summary><h5>Current import table:</h5></summary>

| Target Name | Path for "import" or "require" | WebGL | JSEP | wasm |
Proxy | Training |
  |------|-----|-----|-----|-----|-----|-----|
  | `ort` (default) | `onnxruntime-web` | ✔️ |  | ✔️ | ✔️ |  |
  | `ort.all` | `onnxruntime-web/experimental` | ✔️ | ✔️ | ✔️ | ✔️ |  |
  | `ort.node` | `onnxruntime-web` |  |  | ✔️ |  |  |
| `ort.training` | `onnxruntime-web/training` |  |  | ✔️ |
✔️<sup>\[1]</sup> | ✔️ |
  | `ort.wasm` | `onnxruntime-web/wasm` |  |  | ✔️ | ✔️ |  |
  | `ort.wasm-core` | `onnxruntime-web/wasm-core` |  |  | ✔️ |  |  |
| `ort.webgl` | `onnxruntime-web/webgl` | ✔️ |  |  | ✔️<sup>\[2]</sup>
|  |
  | `ort.webgpu` | `onnxruntime-web/webgpu` |  | ✔️ | ✔️ | ✔️ |  |

* [1] didn't test. may not actually work.
* [2] not working. this is a mistake in build config.

</details>

<details>
<summary><h5>Proposed update:</h5></summary>

| Target Name | Path for "import" or "require" | WebGL | JSEP | wasm |
Proxy | Training |
  |------|-----|-----|-----|-----|-----|-----|
  | `ort` (default) | `onnxruntime-web` | ✔️ |  | ✔️ | ✔️ |  |
| `ort.all` |
~~`onnxruntime-web/experimental`~~<br/>`onnxruntime-web/all` | ✔️ | ✔️ |
✔️ | ✔️ |  |
  | `ort.node` | `onnxruntime-web` |  |  | ✔️ |  |  |
  | `ort.training` | `onnxruntime-web/training` |  |  | ✔️ | ✔️ | ✔️ |
  | `ort.wasm` | `onnxruntime-web/wasm` |  |  | ✔️ | ✔️ |  |
| ~~`ort.wasm-core`~~ | ~~`onnxruntime-web/wasm-core`~~ | ~~~~ | ~~~~
| ~~✔️~~ | ~~~~ | ~~~~ |
  | `ort.webgl` | `onnxruntime-web/webgl` | ✔️ |  |  | ~~✔️~~  |  |
  | `ort.webgpu` | `onnxruntime-web/webgpu` |  | ✔️ | ✔️ | ✔️ |  |

</details>

#### Flags:

The following flags are deprecated:
- `env.wasm.simd` (boolean): will be ignored. SIMD is always enabled in
build.

The following flags changed their type:
- `env.wasm.wasmPaths`: When using this flag as a string ( for the URL
prefix ), nothing is changed. When using this flag as an object ( for
per-file path override ), the type changed:
  ```diff
  -  export interface Old_WasmFilePaths{
  -    'ort-wasm.wasm'?: string;
  -    'ort-wasm-threaded.wasm'?: string;
  -    'ort-wasm-simd.wasm'?: string;
  -    'ort-training-wasm-simd.wasm'?: string;
  -    'ort-wasm-simd-threaded.wasm'?: string;
  -  };
  +  export interface New_WasmFilePaths {
  +    /**
  +     * Specify the override path for the main .wasm file.
  +     *
  +     * This path should be an absolute path.
  +     *
  +     * If not modified, the filename of the .wasm file is:
  +     * - `ort-wasm-simd-threaded.wasm` for default build
+ * - `ort-wasm-simd-threaded.jsep.wasm` for JSEP build (with WebGPU and
WebNN)
  +     * - `ort-training-wasm-simd-threaded.wasm` for training build
  +     */
  +    wasm?: URL|string;
  +    /**
  +     * Specify the override path for the main .mjs file.
  +     *
  +     * This path should be an absolute path.
  +     *
  +     * If not modified, the filename of the .mjs file is:
  +     * - `ort-wasm-simd-threaded.mjs` for default build
+ * - `ort-wasm-simd-threaded.jsep.mjs` for JSEP build (with WebGPU and
WebNN)
  +     * - `ort-training-wasm-simd-threaded.mjs` for training build
  +     */
  +    mjs?: URL|string;
  +  }
  ```

#### Bundler compatibility:

Config changes are need for bundlers. See usage example in
/js/web/test/e2e/ for Webpack, parcel and rollup.

#### Deployment:

- if consuming from a CDN, there is no breaking change.
- if consuming from a local server, need to copy all `ort-*.wasm` and
`ort-*.mjs` files (totally 6 files) in the dist folder. (previously only
need to copy `ort-*.wasm` files.)

</details>
<details>
<summary><h4>Problems</h4></summary>

There are a few problems with the current module export and deployment:

- Script URL cannot be correctly inferred when imported as ESM.
- Workers are forcefully encoded using Blob URL, which makes
onnxruntime-web not working in CSP environment and Node.js, when using
proxy or multi-threading feature.
- Generated JS code (by Emscripten) is encoded using
`function.toString()`, which is unstable and error-prone.
- When running with a different Emscripten build, always need the build
step. Making it difficult to swap artifacts in deveopment/debug.
</details>
<details>
<summary><h4>Goals</h4></summary>

- Full ESM support
- Support variances of ways to import. Including:
- import from HTML's `<script>` tag (IIFE format, exporting to global
variable `ort`)
    ```html
<script
src="https://example.com/cdn-path-to-onnxruntime-web/dist/ort.min.js"></script>
    ```
  - import from source code inside `<script type="module">` tag (ESM)
    ```html
    <script type="module">
import * as ort from
"https://example.com/cdn-path-to-onnxruntime-web/dist/ort.min.mjs";

      // using 'ort'
    </script>
    ```
- import in a CommonJS project (CJS format, resolve from package.json
"exports" field)
    ```js
    // myProject/main.js
    const ort = require('onnxruntime-web');
    ```
- import in an ESM project (ESM format, resolve from package.json
"exports" field)
    ```js
    // myProject/main.js (or main.mjs)
    import * as ort from 'onnxruntime-web';
    ```
- Support popular bundlers when importing onnxruntime-web into a CJS/ESM
project.
  - webpack (esm requires extra post-process step)
  - rollup
  - parcel (esm requires extra post-process step)
  - More bundlers **TBD**
- Multi-threading support for Node.js

NOTE: keeping single JavaScript file (the all-in-one bundle) is no
longer a goal. This is because technically there is a conflict with the
other requirements.
</details>

<details>
<summary><h4>Important Design Decisions</h4></summary>

- Drop support of single JavaScript output.
- The current onnxruntime-web distribution uses a single JavaScript file
to include all code. While there are a few benefits, it also creates
problems as mentioned above. Since ESM is being used more and more
widely, and browsers are making more restricted security checks and
requirement, the old Blob based solution is going to be replaced.
- To achieve the requirement, specifically, the CSP environment support,
we have to offer a non Blob based solution. Therefore, we have to
distribute multiple files and drop the single file solution.

- Do not run parser/postprocess on Emscripten generated JavaScript.
- Emscripten is evolving quickly so we should only depends on what's in
its documentation instead of a certain implementation details. (for
example, currently we patch on its code to deal with a special variable
`_scriptDir`)
  - Keep the generated files as-is also helps to:
    - reduce the size of ort.min.js
- make it easier to replace build artifacts when in development/debug

- Drop support for non-SIMD and non-MultiThread. This helps to reduce
the number of artifacts in distribution.
  - (fixed-sized) SIMD is supported in any mainstream JS environment.
- Multi-thread as WebAssembly feature is supported in any mainstream JS
environment. In some environment the feature is guarded with cross
origin policy, but it can still work if not trying to create any worker.

- Use ESM output for Emscripten generated JavaScript.
- There are 2 ways to dynamically import classic (umd) modules and
neither of them are recommended:
- dynamically creating a <script> tag. This changes the HTML structure
and have quite a lot of compatibility issue
- use `fetch()` and `eval()`. However `eval` is strongly suggested to be
avoid because there is a great perf hit.
- importing ESM is super easy - just use the `import()` call.
Considering ESM is widely supported in modern browsers and Node.js this
is the better option.

- Add Blob based solution as a fallback for cross-origin workers.
- There are still wide use case of importing onnxruntime-web from CDN.
In this usage, make it able create worker by using `fetch()`+`Blob` to
create a same-origin Blob URL.

</details>

<details>
<summary><h4>Distribution File Manifest</h4></summary>

The distribution folder contains the following files:

- WebAssembly artifacts. These files are the result of compiling the
ONNX Runtime C++ code to WebAssembly by Emscripten.

  | File Name | Build Flags |
  |------|-----|
| ort-wasm-simd-threaded.mjs <br/> ort-wasm-simd-threaded.wasm |
`--enable_wasm_simd` <br/> `--enable_wasm_threads` |
| ort-training-wasm-simd-threaded.mjs <br/>
ort-training-wasm-simd-threaded.wasm | `--enable_training_apis` <br/>
`--enable_wasm_simd` <br/> `--enable_wasm_threads` |
| ort-wasm-simd-threaded.jsep.mjs <br/> ort-wasm-simd-threaded.jsep.wasm
| `--enable_wasm_simd` <br/> `--enable_wasm_threads` <br/> `--use_jsep`
<br/> `--use_webnn` |

- onnxruntime-web JavaScript artifacts. These files are generated by
ESBuild as the entry point for onnxruntime-web.

  There are multiple build targets for different use cases:
  | Target Name | Path for "import" or "require" | Description |
  |------|-----|-----|
  | `ort` | `onnxruntime-web` | The default target. |
  | `ort.all` | `onnxruntime-web/all` | The target including webgl. |
  | `ort.node` | `onnxruntime-web` | The default target for Node.js. |
| `ort.training` | `onnxruntime-web/training` | The target including
training APIs |
| `ort.wasm` | `onnxruntime-web/wasm` | The target including only
WebAssembly (CPU) EP |
| `ort.webgl` | `onnxruntime-web/webgl` | The target including only
WebGL EP |


  For each target, there are multiple files generated:
  | File Name | Description |
  |------|-----|
| [target].js | The entry point for the target. IIFE and CommonJS
format. |
  | [target].mjs | The entry point for the target. ESM format. |
| [target].min.js <br/> [target].min.js.map | The entry point for the
target. Minimized with sourcemap. IIFE and CommonJS format. |
| [target].min.mjs <br/> [target].min.mjs.map | The entry point for the
target. Minimized with sourcemap. ESM format. |
| [target].proxy.mjs | (if appliable) The proxy ESM module for the
target. |
| [target].proxy.min.mjs <br/> [target].proxy.min.mjs.map | (if
appliable) The proxy ESM module for the target. Minimized with
sourcemap. |

</details>

<details>
<summary><h4>Dynamic Import Explained</h4></summary>

- Local Served | No Proxy:
  ```
  [Bundle or ort.min.js]
    |
    + import()--> [ort-wasm-simd-threaded.mjs]
                    |
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
                    |
+ new Worker()--> [ort-wasm-simd-threaded.mjs (worker)]
                                        |
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
  ```
- Local Served | Proxy:
  ```
  [Bundle or ort.min.js]
    |
    + import()--> [ort.proxy.min.mjs]
                    |
                    + new Worker()--> [ort.proxy.min.mjs (worker)]
                                        |
+ import()--> [ort-wasm-simd-threaded.mjs]
                                                        |
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
                                                        |
+ new Worker()--> [ort-wasm-simd-threaded.mjs (worker)]
|
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
  ```
- Cross Origin | No Proxy:
  ```
  [Bundle or ort.min.js]
    |
    + fetch('ort-wasm-simd-threaded.mjs')
        |
        + URL.createObjectURL(res.blob())
        |
        + import()--> [blob:... (ort-wasm-simd-threaded)]
                        |
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
                        |
+ new Worker()--> [blob:... (ort-wasm-simd-threaded) (worker)]
                                            |
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
  ```

- Cross Origin | Proxy
  ```
  [Bundle or ort.min.js]
    |
    + fetch('ort.proxy.min.mjs')
        |
        + URL.createObjectURL(res.blob())
        |
        + import()--> [blob:... (ort.proxy)]
                        |
+ new Worker()--> [blob:... (ort.proxy) (worker)]
                                            |
+ fetch('ort-wasm-simd-threaded.mjs')
                                                |
+ URL.createObjectURL(res.blob())
                                                |
+ import()--> [blob:... (ort-wasm-simd-threaded)]
                                                                |
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
                                                                |
+ new Worker()--> [blob:... (ort-wasm-simd-threaded) (worker)]
|
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
  ```
</details>
2024-05-20 09:51:16 -07:00
kunal-vaishnavi
ca22a5a9d0
Add fusions for OpenAI CLIP (#20721)
### Description
This PR adds fusions for [OpenAI's CLIP
model](https://huggingface.co/openai/clip-vit-large-patch14-336). Here
is an example of how to run the ORT transformer optimizer for the linked
CLIP model.

```
$ git clone https://github.com/microsoft/onnxruntime
$ cd onnxruntime/onnxruntime/python/tools/transformers
$ python3 optimizer.py --input /path/to/model.onnx --output /path/to/model_opt.onnx --model_type clip --num_heads 16 --hidden_size 1024 --use_external_data_format --opt_level 0
```

### Motivation and Context
This PR helps optimize multi-modal models that use CLIP for the vision
encoder.
2024-05-18 08:27:16 -07:00
cloudhan
5d07291247
hipify int4 gemv (#20666)
Hipify MatMulNBits to accommodate the need of Phi3 onnx release.
2024-05-18 16:59:03 +08:00
kunal-vaishnavi
72a3bde330
Add GQA on CPU in LLaMA scripts (#20720)
### Description
This PR adds support for adding GroupQueryAttention (GQA) in models that
are running on CPU.

### Motivation and Context
Previously, the LLaMA scripts supported creating models that have GQA
for CUDA only. With the recently added support for [GQA on
CPU](https://github.com/microsoft/onnxruntime/pull/20299), models where
`num_attention_heads != num_key_value_heads` can now use the GQA op and
[run much faster on
CPU](https://github.com/microsoft/onnxruntime/pull/20598).
2024-05-17 23:23:57 -07:00
Dmitri Smirnov
bd7a0fb377
[C API Docs ] Address doxygen errors (#20714)
### Description
Make C API compliant with Doxygen expectations

### Motivation and Context
Doc workflow is failing.
2024-05-17 23:23:20 -07:00
Tianlei Wu
2e7de54565
[CUDA] Fix SparseAttention Kernel (#20716)
### Description

Currently, there is one bool flag to indicate whether kernel is loaded.
However, there are v1 and v2 kernels, so the flag will allow only one
version of kernel loaded. We use v1 kernel for prompt and v2 kernel for
token generation, and the flag will cause issue when we want both prompt
and token generation.

This bug is found in integration test. The unit test only test one
kernel at a time so the issue was not found before.

Another possible walkaround without this fix is to set an environment
variable `ORT_DISABLE_SPARSE_ATTENTION_V1=1`
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-05-17 22:42:19 -07:00
guyang3532
d7f7c3b343
Fix bug when Embedding has >2 output (#20678) 2024-05-17 16:12:57 +08:00
Xu Xing
6b58fcc00b
[js/web] Refine conv attributes (#20684)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-05-16 18:00:57 -07:00
Edward Chen
e81c8676e3
MatMulNBits + Add fusion (#20587)
- Add MatMulNBits Bias input
- Add graph transformer to fuse MatMulNBits + Add
2024-05-16 11:00:59 -07:00
Tom McDonald
1e1b3f9689
Remove ref struct return usage (#20132)
### Description
Removes ref struct return usage on netstandard 2.0 builds.

### Motivation and Context
Unblocks .NET native compilation
2024-05-16 09:46:19 -07:00
Yifan Li
47a178b518
[EP Perf] Fix on EP Perf (#20683)
### Description
<!-- Describe your changes. -->
* Partially revert [previous
change](https://github.com/microsoft/onnxruntime/pull/19804), and
   * Redo concurrency_test_result parser outside of post.py
* Add support of syncing memtest result to db


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
To fix the error when CI is running on two model groups.
- When running on two model groups, the [previous
change](https://github.com/microsoft/onnxruntime/pull/19804) wrongly
navigates two levels up in the directory after running one model group,
while one level is needed. After that, the script can't find another
model group.
- Running on one model group can't repro the issue
2024-05-15 21:38:52 -07:00
Wanming Lin
f5bfbd6d81
[WebNN EP] Remove activation fusion (#20635)
WebNN spec has removed activation option for conv and
batchNormalization. We don't need additional activation fusion in WebNN
EP anymore.

[edit by fdwr] Note this is handled in the browser now, which knows more
about the backend platform version and can more safely make decisions
about which fusions are possible (e.g. for the DirectML backend, whether
softmax and gelu can fuse successfully with their base operator).
2024-05-15 16:49:07 -07:00
Jian Chen
d1e66f0446
Increase NPM ComponentDetection.Timeout: 1200 (#20681)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-05-15 13:41:59 -07:00
Changming Sun
ee3f2f4ebf
Update docs/Model_Test.md (#11466)
Update the instructions of how to get test models.
2024-05-15 11:33:11 -07:00
Hans
2a17958b34
[js/rn] Fix some bugs (#20242)
### Description
<!-- Describe your changes. -->
- Fix `logSeverityLevel`
- Correct get RCTCxxBridge, old method for some cases will got wrong
bridge


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
2024-05-15 10:32:08 -07:00
Jian Chen
87ed1e3e3f
Component governance fix round 2 (#20679) 2024-05-14 17:15:15 -07:00
Edward Chen
113aa2992f
Update React Native CI (#20673)
- Move iOS package build to separate job so it can run in parallel with Android AAR build and be decoupled from the test stage. The test stage fails sometimes (not infrequently) and may need to be re-run.
- Update stop iOS simulator step so it doesn't fail if the start step doesn't run.
2024-05-14 14:10:56 -07:00
Jian Chen
83a871f890
Fix critical and High issues from Component Governance (#20611)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-05-14 09:17:23 -07:00
Hector Li
0e11d0c4f8
Enable Qnn nuget nightly (#20662)
### Description
Enable Qnn nuget nightly
2024-05-13 21:28:43 -07:00
Yi Zhang
c131ea89e1
Nuget Publish pipelines should be trigger by rel-* automatically too. (#20652)
### Description
And
Set allowPackageConflicts = True
`#allowPackageConflicts: false # boolean. Optional. Use when command =
push && nuGetFeedType = internal. Allow duplicates to be skipped.
Default: false.`

https://learn.microsoft.com/en-us/azure/devops/pipelines/tasks/reference/nuget-command-v2?view=azure-pipelines

Once the publish patial failed, we don't need to rerun the whole package
generation workflow.
2024-05-13 13:18:16 -07:00
Xu Xing
8c59cd4fce
[js/webgpu] Support GroupQueryAttention (#20237)
TODOs:
1. Handle H * params.kvNumHeads greater than work group size limit.
2. Support BNSH kv cache.
2024-05-13 09:43:37 -07:00
Edward Chen
90d49ccb9a
Allow path pattern to be specified in package_release_tasks.py. (#20650)
Do more in the Python helper script so the Bash code in the release definition can be simplified.
2024-05-13 09:16:04 -07:00
Adrian Lizarraga
643ed14720
Quant tool: make removal of Clip/Relu ops configurable (#20616)
### Description
Adds the extra option `QDQKeepRemovableActivations` to optionally
prevent automatic removal of Clip/Relu ops in QDQ models. The current
default behavior, which is to remove Clip/Relu, remains the same if the
new option is not enabled.

### Motivation and Context
Explicitly representing these Relu/Clip operators in the QDQ model is
necessary if optimizations or EP transformations will later remove
QuantizeLinear/DequantizeLinear operators from the model.
2024-05-10 17:23:24 -07:00
Yi-Hong Lyu
49d197a8e6
Enable ClipQuantFusion exclusively on CPU EP (#20627)
### Motivation and Context

The Intel NPU does not support 16-bit int quantized operators.
Consequently, the execution provider removes the
QuantizeLinear/DeQuantizeLinear (Q/DQ) operators from node units and
executes the operation as FP16 in the backend. However, if a Clip
operator was fused into a Q operator in the node unit, the removal of
Q/DQ operators results in inaccuracies because the effect of the
original Clip operators is lost.

Consider the following example:
- FP32 model: -> Op_FP32 -> Clip ->
- QDQ model: -> (DQ-> Op_FP32 -> Q) -> (DQ' -> Clip -> Q') ->
- After ClipQuantFusion: -> (DQ-> Op_FP32 -> Q) -> (DQ' -> Q') ->
- Intel Execution Provider strips Q/DQ: -> Op_FP16 ->

To solve this issue, we have enabled ClipQuantFusion exclusively on the
CPU execution provider.
2024-05-10 16:07:42 -07:00
Jian Chen
4fe565a62a
Java CUDA 12 support (#20583)
### Description

- This PR combine all CUDA 12 stage into the Zip-nuget-... pipeline.
- It also enables the cuda12 support



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-05-10 14:16:22 -07:00
Tianlei Wu
85facd678b
[CUDA] Benchmark GQA on popular LLM models (#20646)
### Description
Update benchmark_gqa.py to test latency on popular models (like
Llama3-8b, Llama3-70b, Mixtral-8x22B-v0.1 and Phi-3 etc).

Note that this is latency of just one GroupQueryAttention node, not the
whole model. For example, packed QKV might need more time in GQA, but it
is faster in MatMul of input projection, the overall effect is not
measured here.

Example output in  A100-SXM4-80GB :
```
prompt-sm80-Llama3-8B-b1-h32_8x128-fp16:
   sequence_length  ORT-GQA-Dense  ORT-GQA-Dense-PackedQKV
0             16.0       0.019073                 0.016264
1             32.0       0.017768                 0.017957
2             64.0       0.023304                 0.023192
3            128.0       0.032541                 0.031348
4            256.0       0.048329                 0.049484
5            512.0       0.095294                 0.095950
6           1024.0       0.228050                 0.228980
7           2048.0       0.663820                 0.663308
8           4096.0       2.243657                 2.242999
9           8192.0       8.197120                 8.186282

token-sm80-Llama3-8B-b1-h32_8_d128-fp16:
   past_sequence_length  ORT-GQA-Dense  ORT-GQA-Dense-PackedQKV
0                  16.0       0.018516                 0.015398
1                  32.0       0.015687                 0.016079
2                  64.0       0.016115                 0.016053
3                 128.0       0.018727                 0.019413
4                 256.0       0.036373                 0.035962
5                 512.0       0.041701                 0.042203
6                1024.0       0.053730                 0.053750
7                2048.0       0.076382                 0.075707
8                4096.0       0.121876                 0.121802
9                8191.0       0.211292                 0.211254

prompt-sm80-Llama3-8B-b4-h32_8x128-fp16:
   sequence_length  ORT-GQA-Dense  ORT-GQA-Dense-PackedQKV
0             16.0       0.024558                 0.022070
1             32.0       0.021276                 0.021406
2             64.0       0.044172                 0.027789
3            128.0       0.069100                 0.059071
4            256.0       0.146569                 0.106717
5            512.0       0.270472                 0.244461
6           1024.0       0.690024                 0.692501
7           2048.0       2.308546                 2.325453
8           4096.0       8.724295                 8.957337
9           8192.0      39.030785                41.381378

token-sm80-Llama3-8B-b4-h32_8_d128-fp16:
   past_sequence_length  ORT-GQA-Dense  ORT-GQA-Dense-PackedQKV
0                  16.0       0.018893                 0.018611
1                  32.0       0.018124                 0.018190
2                  64.0       0.018115                 0.018156
3                 128.0       0.023291                 0.023733
4                 256.0       0.038357                 0.038351
5                 512.0       0.047117                 0.047792
6                1024.0       0.066272                 0.065409
7                2048.0       0.104196                 0.104527
8                4096.0       0.180557                 0.180424
9                8191.0       0.332545                 0.332714

prompt-sm80-Llama3-70B-b1-h64_8x128-fp16:
   sequence_length  ORT-GQA-Dense  ORT-GQA-Dense-PackedQKV
0             16.0       0.040974                 0.015852
1             32.0       0.017839                 0.018615
2             64.0       0.023956                 0.022704
3            128.0       0.044622                 0.035229
4            256.0       0.080241                 0.075237
5            512.0       0.143457                 0.144322
6           1024.0       0.380473                 0.381731
7           2048.0       1.217328                 1.214505
8           4096.0       4.305315                 4.286324
9           8192.0      15.918250                15.933440

token-sm80-Llama3-70B-b1-h64_8_d128-fp16:
   past_sequence_length  ORT-GQA-Dense  ORT-GQA-Dense-PackedQKV
0                  16.0       0.016148                 0.015612
1                  32.0       0.015616                 0.015616
2                  64.0       0.016082                 0.016070
3                 128.0       0.019470                 0.019130
4                 256.0       0.036617                 0.037296
5                 512.0       0.042087                 0.042176
6                1024.0       0.053704                 0.053587
7                2048.0       0.076918                 0.076365
8                4096.0       0.122534                 0.121984
9                8191.0       0.212961                 0.213330

prompt-sm80-Llama3-70B-b4-h64_8x128-fp16:
   sequence_length  ORT-GQA-Dense  ORT-GQA-Dense-PackedQKV
0             16.0       0.031137                 0.026270
1             32.0       0.030938                 0.032009
2             64.0       0.040833                 0.059118
3            128.0       0.084899                 0.085482
4            256.0       0.163951                 0.166310
5            512.0       0.420436                 0.423721
6           1024.0       1.282019                 1.283482
7           2048.0       4.397661                 4.420121
8           4096.0      16.931839                17.456945
9           8192.0      77.896706                83.007484

token-sm80-Llama3-70B-b4-h64_8_d128-fp16:
   past_sequence_length  ORT-GQA-Dense  ORT-GQA-Dense-PackedQKV
0                  16.0       0.026106                 0.026061
1                  32.0       0.025678                 0.025589
2                  64.0       0.025438                 0.025965
3                 128.0       0.033879                 0.033320
4                 256.0       0.058078                 0.057656
5                 512.0       0.078010                 0.078153
6                1024.0       0.106353                 0.098079
7                2048.0       0.160039                 0.159153
8                4096.0       0.282527                 0.283346
9                8191.0       0.546207                 0.542135

prompt-sm80-Mistral-7B-v0.1-b1-h32_8x128-fp16:
    sequence_length  ORT-GQA-Dense  ORT-GQA-Local  ORT-GQA-Dense-PackedQKV  ORT-GQA-Local-PackedQKV
0              16.0       0.015722       0.015655                 0.015666                 0.016150
1              32.0       0.018590       0.018562                 0.018136                 0.024617
2              64.0       0.022480       0.023085                 0.023184                 0.023160
3             128.0       0.029948       0.030581                 0.030839                 0.031464
4             256.0       0.048532       0.049099                 0.049424                 0.049408
5             512.0       0.095096       0.095665                 0.096174                 0.096175
6            1024.0       0.228606       0.228942                 0.228434                 0.229568
7            2048.0       0.660832       0.661943                 0.662170                 0.663979
8            4096.0       2.238001       2.243999                 2.242243                 2.241707
9            8192.0       8.173824       6.147072                 8.187648                 6.152822
10          16384.0      33.826305      14.486015                34.849792                14.938283
11          32768.0     176.702469      32.725330               184.309753                34.736130

token-sm80-Mistral-7B-v0.1-b1-h32_8_d128-fp16:
    past_sequence_length  ORT-GQA-Dense  ORT-GQA-Local  ORT-GQA-Dense-PackedQKV  ORT-GQA-Local-PackedQKV
0                   16.0       0.015407       0.016042                 0.016030                 0.015429
1                   32.0       0.015525       0.016115                 0.016768                 0.016052
2                   64.0       0.015556       0.016079                 0.015383                 0.016008
3                  128.0       0.019302       0.018644                 0.018680                 0.019278
4                  256.0       0.036924       0.035900                 0.036753                 0.036786
5                  512.0       0.041482       0.041434                 0.041646                 0.042238
6                 1024.0       0.053587       0.052972                 0.052888                 0.052856
7                 2048.0       0.075749       0.075807                 0.076528                 0.075945
8                 4096.0       0.122053       0.122016                 0.122115                 0.122216
9                 8192.0       0.212069       0.121317                 0.211919                 0.121087
10               16384.0       0.394036       0.121202                 0.393661                 0.121483
11               32767.0       0.757216       0.124326                 0.757659                 0.124157

prompt-sm80-Mistral-7B-v0.1-b4-h32_8x128-fp16:
    sequence_length  ORT-GQA-Dense  ORT-GQA-Local  ORT-GQA-Dense-PackedQKV  ORT-GQA-Local-PackedQKV
0              16.0       0.018418       0.018911                 0.023387                 0.019256
1              32.0       0.021085       0.021132                 0.022143                 0.022251
2              64.0       0.026743       0.026770                 0.027942                 0.027714
3             128.0       0.057922       0.058483                 0.058800                 0.059402
4             256.0       0.105927       0.104876                 0.106695                 0.105996
5             512.0       0.242958       0.242543                 0.244599                 0.244774
6            1024.0       0.689321       0.689347                 0.691759                 0.692334
7            2048.0       2.308250       2.304410                 2.321587                 2.317875
8            4096.0       8.705210       8.713682                 8.927418                 8.903866
9            8192.0      39.630848      28.227926                41.604607                29.648554
10          16384.0     175.553543      61.422592               183.384064                64.560127
11          32768.0     772.296692     132.006912               813.537292               138.996735

token-sm80-Mistral-7B-v0.1-b4-h32_8_d128-fp16:
    past_sequence_length  ORT-GQA-Dense  ORT-GQA-Local  ORT-GQA-Dense-PackedQKV  ORT-GQA-Local-PackedQKV
0                   16.0       0.018127       0.018691                 0.018661                 0.018681
1                   32.0       0.018183       0.018812                 0.018739                 0.018759
2                   64.0       0.018081       0.018116                 0.018136                 0.018153
3                  128.0       0.023257       0.023146                 0.023114                 0.023103
4                  256.0       0.038665       0.038102                 0.038120                 0.038759
5                  512.0       0.047181       0.047156                 0.047012                 0.046382
6                 1024.0       0.066047       0.066103                 0.066604                 0.066076
7                 2048.0       0.104427       0.103770                 0.103799                 0.103807
8                 4096.0       0.180951       0.180373                 0.180173                 0.180154
9                 8192.0       0.334018       0.180801                 0.333269                 0.180690
10               16384.0       0.638682       0.180965                 0.638543                 0.180202
11               32767.0       1.249536       0.184779                 1.249963                 0.184624

prompt-sm80-Mixtral-8x22B-v0.1-b1-h48_8x128-fp16:
    sequence_length  ORT-GQA-Dense  ORT-GQA-Dense-PackedQKV
0              16.0       0.015699                 0.015563
1              32.0       0.017931                 0.017719
2              64.0       0.029975                 0.022875
3             128.0       0.031038                 0.055747
4             256.0       0.050191                 0.050845
5             512.0       0.125187                 0.122813
6            1024.0       0.304004                 0.301824
7            2048.0       0.936454                 0.931546
8            4096.0       3.264547                 3.255931
9            8192.0      12.062719                12.030080
10          16384.0      49.018368                48.970749
11          32768.0     261.211151               254.461945
12          65536.0    1221.138428              1197.559814

token-sm80-Mixtral-8x22B-v0.1-b1-h48_8_d128-fp16:
    past_sequence_length  ORT-GQA-Dense  ORT-GQA-Dense-PackedQKV
0                   16.0       0.015980                 0.016024
1                   32.0       0.015440                 0.016165
2                   64.0       0.015987                 0.015979
3                  128.0       0.020837                 0.018715
4                  256.0       0.036240                 0.036747
5                  512.0       0.042477                 0.041813
6                 1024.0       0.052950                 0.052956
7                 2048.0       0.076084                 0.076691
8                 4096.0       0.122233                 0.121540
9                 8192.0       0.212469                 0.212433
10               16384.0       0.394937                 0.394996
11               32768.0       0.757285                 0.757257
12               65535.0       1.484867                 1.485015

prompt-sm80-Mixtral-8x22B-v0.1-b4-h48_8x128-fp16:
    sequence_length  ORT-GQA-Dense  ORT-GQA-Dense-PackedQKV
0              16.0       0.024119                 0.018755
1              32.0       0.022214                 0.022267
2              64.0       0.028045                 0.027562
3             128.0       0.062894                 0.079766
4             256.0       0.135146                 0.134483
5             512.0       0.331323                 0.329094
6            1024.0       0.984576                 0.982221
7            2048.0       3.353564                 3.351021
8            4096.0      12.762113                12.778350
9            8192.0      58.599422                57.704449
10          16384.0     263.392242               258.709503
11          32768.0    1155.789795              1128.622070
12          65536.0    5014.187012              4874.590332

token-sm80-Mixtral-8x22B-v0.1-b4-h48_8_d128-fp16:
    past_sequence_length  ORT-GQA-Dense  ORT-GQA-Dense-PackedQKV
0                   16.0       0.018148                 0.018813
1                   32.0       0.018929                 0.018840
2                   64.0       0.018745                 0.018232
3                  128.0       0.023864                 0.023822
4                  256.0       0.038603                 0.038694
5                  512.0       0.048347                 0.047630
6                 1024.0       0.066957                 0.067392
7                 2048.0       0.105094                 0.105058
8                 4096.0       0.181941                 0.181808
9                 8192.0       0.334227                 0.334324
10               16384.0       0.640429                 0.640961
11               32768.0       1.267897                 1.269120
12               65535.0       2.534238                 2.504408

prompt-sm80-Phi-3-mini-128k-b1-h32_32x96-fp16:
    sequence_length  ORT-GQA-Dense  ORT-GQA-Dense-PackedQKV
0              16.0       0.016112                 0.026949
1              32.0       0.016486                 0.017284
2              64.0       0.020910                 0.020994
3             128.0       0.029306                 0.029452
4             256.0       0.044604                 0.044642
5             512.0       0.090079                 0.086868
6            1024.0       0.208169                 0.208094
7            2048.0       0.604687                 0.607910
8            4096.0       2.029056                 2.046771
9            8192.0       7.792128                 7.906303
10          16384.0      34.271233                34.418175
11          32768.0     160.377853               159.980545
12          65536.0     733.443054               734.722046

token-sm80-Phi-3-mini-128k-b1-h32_32_d96-fp16:
    past_sequence_length  ORT-GQA-Dense  ORT-GQA-Dense-PackedQKV
0                   16.0       0.016339                 0.015718
1                   32.0       0.016572                 0.015964
2                   64.0       0.016182                 0.016192
3                  128.0       0.019373                 0.018621
4                  256.0       0.021856                 0.022463
5                  512.0       0.028943                 0.028888
6                 1024.0       0.041124                 0.041104
7                 2048.0       0.067668                 0.067542
8                 4096.0       0.117528                 0.117447
9                 8192.0       0.216241                 0.215492
10               16384.0       0.413434                 0.414047
11               32768.0       0.811085                 0.810612
12               65536.0       1.606189                 1.606458
13              131071.0       3.193037                 3.192491

prompt-sm80-Phi-3-mini-128k-b4-h32_32x96-fp16:
    sequence_length  ORT-GQA-Dense  ORT-GQA-Dense-PackedQKV
0              16.0       0.019385                 0.019403
1              32.0       0.019801                 0.020006
2              64.0       0.025958                 0.025376
3             128.0       0.056445                 0.055909
4             256.0       0.103180                 0.102221
5             512.0       0.244224                 0.244360
6            1024.0       0.703066                 0.709327
7            2048.0       2.307456                 2.335001
8            4096.0       8.334522                 8.406760
9            8192.0      33.340416                33.758209
10          16384.0     144.141312               145.005569
11          32768.0     655.496216               655.656982
12          65536.0    2981.463135              2984.790039

token-sm80-Phi-3-mini-128k-b4-h32_32_d96-fp16:
    past_sequence_length  ORT-GQA-Dense  ORT-GQA-Dense-PackedQKV
0                   16.0       0.018701                 0.018185
1                   32.0       0.020625                 0.019213
2                   64.0       0.019936                 0.019943
3                  128.0       0.023648                 0.023689
4                  256.0       0.030309                 0.030305
5                  512.0       0.043501                 0.043801
6                 1024.0       0.067314                 0.068014
7                 2048.0       0.108649                 0.108134
8                 4096.0       0.186053                 0.186848
9                 8192.0       0.339973                 0.339742
10               16384.0       0.643288                 0.644366
11               32768.0       1.261468                 1.261510
12               65536.0       2.502252                 2.501820
13              131071.0       4.990437                 4.989521

prompt-sm80-Phi-3-small-128k-b1-h32_8x128-fp16:
    sequence_length  ORT-GQA-Dense  ORT-GQA-Dense-PackedQKV
0              16.0       0.025280                 0.023331
1              32.0       0.023071                 0.025931
2              64.0       0.022883                 0.026258
3             128.0       0.030658                 0.031445
4             256.0       0.057659                 0.057073
5             512.0       0.095589                 0.106579
6            1024.0       0.228532                 0.229402
7            2048.0       0.662315                 0.663349
8            4096.0       2.242885                 2.248095
9            8192.0       8.194646                 8.180395
10          16384.0      33.926659                35.130882
11          32768.0     175.320068               184.967163
12          65536.0     810.447876               847.632385

token-sm80-Phi-3-small-128k-b1-h32_8_d128-fp16:
    past_sequence_length  ORT-GQA-Dense  ORT-GQA-Dense-PackedQKV
0                   16.0       0.015517                 0.016038
1                   32.0       0.016372                 0.015477
2                   64.0       0.015472                 0.016016
3                  128.0       0.019291                 0.018664
4                  256.0       0.036250                 0.035990
5                  512.0       0.041691                 0.042238
6                 1024.0       0.053730                 0.053126
7                 2048.0       0.075912                 0.076439
8                 4096.0       0.121336                 0.121334
9                 8192.0       0.213104                 0.212443
10               16384.0       0.394353                 0.394272
11               32768.0       0.756965                 0.757017
12               65536.0       1.484548                 1.485371
13              131071.0       2.939200                 2.939552

prompt-sm80-Phi-3-small-128k-b4-h32_8x128-fp16:
    sequence_length  ORT-GQA-Dense  ORT-GQA-Dense-PackedQKV
0              16.0       0.044326                 0.019298
1              32.0       0.021840                 0.021408
2              64.0       0.027492                 0.027802
3             128.0       0.058128                 0.059431
4             256.0       0.104300                 0.106019
5             512.0       0.242562                 0.244948
6            1024.0       0.689614                 0.692305
7            2048.0       2.297931                 2.312857
8            4096.0       8.654848                 8.843170
9            8192.0      38.770176                40.929279
10          16384.0     175.572998               183.692291
11          32768.0     780.126221               820.551697
12          65536.0    3357.564941              3488.527344

token-sm80-Phi-3-small-128k-b4-h32_8_d128-fp16:
    past_sequence_length  ORT-GQA-Dense  ORT-GQA-Dense-PackedQKV
0                   16.0       0.018061                 0.017995
1                   32.0       0.018225                 0.018851
2                   64.0       0.018203                 0.018104
3                  128.0       0.023161                 0.023651
4                  256.0       0.038421                 0.037673
5                  512.0       0.047590                 0.046938
6                 1024.0       0.065639                 0.066055
7                 2048.0       0.103545                 0.103581
8                 4096.0       0.180461                 0.179998
9                 8192.0       0.332667                 0.332564
10               16384.0       0.638503                 0.639094
11               32768.0       1.249180                 1.249479
12               65536.0       2.469457                 2.471666
13              131071.0       4.915362                 4.914499

prompt-sm80-Phi-3-medium-128K-b1-h40_10x128-fp16:
    sequence_length  ORT-GQA-Dense  ORT-GQA-Dense-PackedQKV
0              16.0       0.025759                 0.016318
1              32.0       0.018282                 0.018111
2              64.0       0.022642                 0.022978
3             128.0       0.030860                 0.037988
4             256.0       0.055703                 0.050318
5             512.0       0.113465                 0.113776
6            1024.0       0.267678                 0.268292
7            2048.0       0.795202                 0.797222
8            4096.0       2.737953                 2.740435
9            8192.0      10.101760                10.149092
10          16384.0      43.326466                43.990013
11          32768.0     230.886398               229.886978
12          65536.0    1067.412476              1052.922852

token-sm80-Phi-3-medium-128K-b1-h40_10_d128-fp16:
    past_sequence_length  ORT-GQA-Dense  ORT-GQA-Dense-PackedQKV
0                   16.0       0.016122                 0.015582
1                   32.0       0.015594                 0.016262
2                   64.0       0.016099                 0.015512
3                  128.0       0.018708                 0.019510
4                  256.0       0.037582                 0.036341
5                  512.0       0.042411                 0.041894
6                 1024.0       0.053278                 0.053914
7                 2048.0       0.076553                 0.076636
8                 4096.0       0.121539                 0.121610
9                 8192.0       0.212083                 0.212377
10               16384.0       0.395086                 0.395280
11               32768.0       0.757879                 0.757888
12               65536.0       1.486093                 1.486915
13              131071.0       2.941728                 2.941408

prompt-sm80-Phi-3-medium-128K-b4-h40_10x128-fp16:
    sequence_length  ORT-GQA-Dense  ORT-GQA-Dense-PackedQKV
0              16.0       0.019448                 0.018872
1              32.0       0.022290                 0.022380
2              64.0       0.027986                 0.027955
3             128.0       0.062699                 0.062175
4             256.0       0.124868                 0.125247
5             512.0       0.298873                 0.298169
6            1024.0       0.862584                 0.863467
7            2048.0       2.944640                 2.957824
8            4096.0      11.318656                11.390720
9            8192.0      52.606976                52.019199
10          16384.0     232.616959               230.360062
11          32768.0    1024.171997              1019.540466
12          65536.0    4377.362305              4354.510742

token-sm80-Phi-3-medium-128K-b4-h40_10_d128-fp16:
    past_sequence_length  ORT-GQA-Dense  ORT-GQA-Dense-PackedQKV
0                   16.0       0.018192                 0.018175
1                   32.0       0.018999                 0.018319
2                   64.0       0.018447                 0.018897
3                  128.0       0.023863                 0.023195
4                  256.0       0.037712                 0.038192
5                  512.0       0.048863                 0.048548
6                 1024.0       0.067244                 0.066473
7                 2048.0       0.105203                 0.105021
8                 4096.0       0.180712                 0.180429
9                 8192.0       0.334948                 0.334734
10               16384.0       0.640662                 0.639709
11               32768.0       1.252196                 1.251684
12               65536.0       2.474927                 2.474280
13              131071.0       4.930829                 4.959340
```
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-05-10 14:14:15 -07:00
guyang3532
cfe830b248
Generalize label input sparsity check and refactor (#20636)
### Description
The InsertGatherBeforeSceLoss optimization is enabled when the density
of label padding less than 90%. We need to check the density of the
label padding to decide whether enable the optimization.

Before this pr, we just check the inputs of graph and correlate one with
the SCE node by iterate graph from the SCE node back to one graph input.
This is hard to be general because there may be complicated pattern
between graph input and SCE node.

This pr check padding density by the direct input of SCE module rather
than the input of graph at the first graph execution when exporting onnx
graph.
And if the density < 90%, insert a flag PythonOp after the SCE node as:
```
           SoftmaxCrossEntropy
		  |
            PythonOp (func_name: FlagAndPrintDensity)   (insert if density < 90%)
		  |
            Following graph
```

When the InsertGatherBeforeSceLoss is invoked, it check if there is the
flag PythonOp(func_name: FlagAndPrintDensity) after the SCE node and if
it is, remove it and do the padding elimination optimization.

If the env of ORTMODULE_PRINT_INPUT_DENSITY is 1, we will print input
density each step by the PythonOp (func_name: FlagAndPrintDensity). In
this case the PythonOp will not be removed.
2024-05-10 21:55:43 +08:00
vivianw-amd
e124cf8e76
set unload to false to prevent crash when linux lib load not successfully (#20626)
### Description
<!-- Describe your changes. -->
during VITSIAI shared library load, set unload to false to prevent crash
when linux lib load not successfully


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

for Linux environment, when library not loaded successfully, it will end
up with crash without giving any useful message.
the fix is to prevent the crash and give the useful message when shared
library not loaded correctly.
2024-05-10 00:01:23 -07:00
Tianlei Wu
01dd991f97
Update SparseAttention op spec to make it more flexible (#20625)
### Description
Make the operator more flexible:
(1) Decouple the max sequence length of rotary cache, kv cache and block
mask. They are allowed to have different values.
(2) Replace block_mask dense by CSR format (block_row_indices and
block_col_indices) to improve performance.
(3) Mark past_key and past_value as required inputs since we need them
to compute the shape of present_key and present_value.

### Motivation and Context
(1) LongRoPE has short and long rotary cache, which has different
length.
(2) Most users do not have enough GPU memory to run maximum sequence
length 128K. This change allows user to use smaller kv cache length to
test without hitting out of memory.
2024-05-09 22:15:21 -07:00
George Wu
a0c4bd4da7
[qnn ep] sign onnxruntime.dll/pyd for qnn packages (#20634)
sign only onnxruntime.dll and onnxruntime_pybind11_state.pyd in
packages.
2024-05-09 20:45:44 -07:00
pengwa
56f7035521
Improve perf for mem efficient grad mgmt (#20480)
### Improve perf for mem efficient grad mgmt

When memory efficient gradient mangement feature is enabled, the weight
retrieval PythonOp for every layers will be launched at the beginning of
the forward, which would make GPU stream idle for few milliseconds. The
reason is the ReversedDFS ordering cannot ALWAYS handle such input
branching well, so we introduce a distantance-to-input_leaf concepts
when doing the reversedDFS, which not only move the problematical
PythonOp to the place where it is needed, but also those Cast ops
following the weight retrieval to the place where it is needed.

Main branch: 102.19 - 26.35s = 75.84s for 260 steps(4627samples),
61.04sample/second
This PR: 100.28s - 25.10s = 75.18s for 260 steps. 61.54samples/second
(+0.8% gains)

Main branch:


![image](https://github.com/microsoft/onnxruntime/assets/10530022/75c4131e-dade-49b0-aa8b-ee1c637ad9a8)


This PR:


![image](https://github.com/microsoft/onnxruntime/assets/10530022/e590a536-3b80-4f51-b89f-f25a55ddd7e2)


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-05-10 08:09:17 +08:00
Yi Zhang
5a18818e1d
Migrate training storage from SAS to managed identity (#20618)
### Description
orttrainingtestdatascus has only save mnist whose size is only 64M in
Azure File
To meet security requirements and reduce maintenance cost, move the test
data to lotusscus and saved in Azure blob.
2024-05-09 15:44:29 -07:00
Jon Campbell
768c79317c
Enable QNN HTP support for Node (#20576)
### Description
Add support for using Onnx Runtime with Node

### Motivation and Context
Onnx Runtime supports the QNN HTP, but does not support it for Node.js.
This adds baseline support for the Onnx Runtime to be used with Node.

Note it does not update the node packages that are distributed
officially. This simply patches the onnxruntime.dll to allow 'qnn' to be
used as an execution provider.

Testing was done using the existing onnxruntime-node package. The
`onnxruntime.dll` and `onnxruntime_binding.node` were swapped into
`node_modules\onnxruntime-node\bin\napi-v3\win32\arm64` with the newly
built version, then the various QNN dlls and .so files were placed next
to the onnxruntime.dll. Testing was performed on a variety of models and
applications, but the easiest test is to modify the [node quickstart
example](https://github.com/microsoft/onnxruntime-inference-examples/tree/main/js/quick-start_onnxruntime-node).
2024-05-09 13:11:07 -07:00
Jian Chen
d1cbb3e076
The time for nuget pkg should be consistent (#20522)
This pull request primarily involves changes to the build scripts in the
`tools/ci_build/github/azure-pipelines` directory. The changes add build
date and time information to the build process. This is achieved by
introducing two new parameters, `BuildDate` and `BuildTime`, and
incorporating them into the `msbuildArguments` in multiple locations.

Addition of new parameters:

*
[`tools/ci_build/github/azure-pipelines/templates/c-api-cpu.yml`](diffhunk://#diff-00815920cc190d10fdebceac0c3a4b8a59e408684ae38177dfe7f96cae276c59R309-R310):
Added `BuildDate` and `BuildTime` parameters using the pipeline's start
time.

Incorporation of new parameters in `msbuildArguments`:

*
[`tools/ci_build/github/azure-pipelines/c-api-noopenmp-packaging-pipelines.yml`](diffhunk://#diff-efb530efd945fdd9d3e1b92e53d25cc8db7df2e28071c364b07a7193092de01bL947-R948):
Added `CurrentDate` and `CurrentTime` parameters to `msbuildArguments`
in multiple locations.
[[1]](diffhunk://#diff-efb530efd945fdd9d3e1b92e53d25cc8db7df2e28071c364b07a7193092de01bL947-R948)
[[2]](diffhunk://#diff-efb530efd945fdd9d3e1b92e53d25cc8db7df2e28071c364b07a7193092de01bL1092-R1093)
[[3]](diffhunk://#diff-efb530efd945fdd9d3e1b92e53d25cc8db7df2e28071c364b07a7193092de01bL1114-R1115)
[[4]](diffhunk://#diff-efb530efd945fdd9d3e1b92e53d25cc8db7df2e28071c364b07a7193092de01bL1137-R1138)
*
[`tools/ci_build/github/azure-pipelines/templates/c-api-cpu.yml`](diffhunk://#diff-00815920cc190d10fdebceac0c3a4b8a59e408684ae38177dfe7f96cae276c59L446-R448):
Incorporated the `CurrentDate` and `CurrentTime` parameters into
`msbuildArguments`.### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-05-09 11:35:45 -07:00
Tianlei Wu
69cfcba38a
[CUDA] Sparse Attention support 128k sequence length (#20614)
### Description
When sequence length is 128K, block_mask has 2048 rows, that is not
supported by previous kernel.
(1) Add a new kernel to handle more than 1024 rows, and each thread need
handle two rows.
(2) Add a test for sequence length 128k.
2024-05-08 20:54:38 -07:00
Edward Chen
a0db2187ee
Update CocoaPods package release script. (#20608)
- Update method for uploading to Azure storage to use managed identity.
- Allow helper script tasks to be split across different calls.
- Rewrite helper script in Python.

Motivation:
Recently the Azure storage account configuration was changed and now the old way of uploading to it no longer works.
2024-05-08 16:17:26 -07:00