### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Flash attn recompute
1. Allow PythonOp(FlashAttn) can be recomputed correctly.
45879ff5c2
2. Use JSON to pass the selected-to-recompute subgraphs.
3c374da678
#### Better Memory Efficiency
Customer model can run both PyTorch SPDA and Flash Attn, this PR make it
possible to let the Flash Attn path work with ORTModule layerwise
recompute. The peak drop from 45.xGB to 32.xGB if we only compare the
layers (not including other pieces, BTW there are few more optimization
targeting other pieces as well later).
#### Better Perf
Using Flash ATTN bring additionally 16% end to end time reduction, with
highly aligned loss curve.

#### Use JSON File to pass Recompute Plans
To overcome the limitation of max length of the strings defined in
session options.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
#### Problem 1: Broken Transpose QDQ unit
Layout transform's specialized cost function aggressively pushes down
transposes with channel-first or channel-last perms. This can lead to a
situation where a channel-fist/last Transpose gets stuck after being
pushed through an Unsqueeze node that makes the Transpose's perm no
longer channel-first/last. At this point, the specialized cost function
defers to the default const function, which does not see a need to
continue pushing this transpose node. This breaks the QDQ node units for
both the Unsqueeze and the Transpose: DQ -> Unsqueeze -> Transpose -> Q.
<img width="266" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/19691973/82f8432d-ca27-451b-8c36-c8d87b806e30">
The transpose optimizer should insert a Q -> DQ pair between the
Unsqueeze and Transpose nodes to fix both QDQ node units: DQ ->
Unsqueeze -> Q[new] -> DQ[new] -> Transpose -> Q
<img width="198" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/19691973/5a584bdf-e5db-4622-b3bb-83c060e09261">
#### Problem 2: Inserted Squeeze/Transpose nodes should be constant
folded when possible.
The transpose optimizer inserts Squeeze (and Transpose) ops between an
initializer and a DQ to counteract the effect of Unsqueezing that
initializer if it is consumed by multiple nodes. This results in a graph
where the inserted nodes are not in valid node units:
Original graph where two Mul nodes share a common initializer input:
<img width="456" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/19691973/4b9155ae-e32f-41fc-9136-f953b73e92e7">
Resulting graph after transpose optimization without constant folding:
<img width="452" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/19691973/3c1bfef1-d45f-4d6e-aa19-1c2929eae3f5">
Here, the circled Transpose and Squeeze nodes operate on a quantized
integer type but are not in valid QDQ node units. The solution is to run
constant folding, which results in:
<img width="405" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/19691973/aebdb91f-f38f-4583-adec-33e46126365f">
### Motivation and Context
Improve the layout transformation to allow more models to run on EPs
that prefer the channel-last layout.
---------
Co-authored-by: Scott McKay <skottmckay@gmail.com>
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Add a new GitHub Actions workflow, `.github/workflows/mac.yml`. It contains these jobs:
- ARM64 MacOS CI build.
- Objective-C static analysis build. This was moved over from another Azure DevOps pipeline to make it more visible.
### Description
Have a unified API in OVEP that pass the ONNX graph proto from ORT to OV
for compilation
### Motivation and Context
The earlier implementation used two different flows when onnx model path
is present vs model laoded from memory.
The former directly passed the onnx model path to OV when the graph is
fully supported by EP. While the latter pass the ORT model proto to OV.
This cause a difference in results when ORT optimizations are enabled.
This PR address this issue.
### Description
This PR make numbers of optimizations to onnxruntime-web's module export
and deployment.
See each section below for more details.
#### Preview
>
[onnxruntime-web@1.19.0-esmtest.20240513-a16cd2bd21](https://www.npmjs.com/package/onnxruntime-web/v/1.19.0-esmtest.20240513-a16cd2bd21)
> ~~onnxruntime-web@1.19.0-esmtest.20240430-c7edbcc63d~~
> ~~onnxruntime-web@1.18.0-esmtest.20240428-624c681c83~~
> ~~onnxruntime-web@1.18.0-esmtest.20240411-1abb64e894~~
<details>
<summary><h4>Breaking changes</h4></summary>
There is no code change required, but there are a few differences
regarding **code import**, **flags**, **bundler config** and
**deployment steps**.
#### Importing:
Import table is changed. See following for details.
<details>
<summary><h5>Current import table:</h5></summary>
| Target Name | Path for "import" or "require" | WebGL | JSEP | wasm |
Proxy | Training |
|------|-----|-----|-----|-----|-----|-----|
| `ort` (default) | `onnxruntime-web` | ✔️ | ❌ | ✔️ | ✔️ | ❌ |
| `ort.all` | `onnxruntime-web/experimental` | ✔️ | ✔️ | ✔️ | ✔️ | ❌ |
| `ort.node` | `onnxruntime-web` | ❌ | ❌ | ✔️ | ❌ | ❌ |
| `ort.training` | `onnxruntime-web/training` | ❌ | ❌ | ✔️ |
✔️<sup>\[1]</sup> | ✔️ |
| `ort.wasm` | `onnxruntime-web/wasm` | ❌ | ❌ | ✔️ | ✔️ | ❌ |
| `ort.wasm-core` | `onnxruntime-web/wasm-core` | ❌ | ❌ | ✔️ | ❌ | ❌ |
| `ort.webgl` | `onnxruntime-web/webgl` | ✔️ | ❌ | ❌ | ✔️<sup>\[2]</sup>
| ❌ |
| `ort.webgpu` | `onnxruntime-web/webgpu` | ❌ | ✔️ | ✔️ | ✔️ | ❌ |
* [1] didn't test. may not actually work.
* [2] not working. this is a mistake in build config.
</details>
<details>
<summary><h5>Proposed update:</h5></summary>
| Target Name | Path for "import" or "require" | WebGL | JSEP | wasm |
Proxy | Training |
|------|-----|-----|-----|-----|-----|-----|
| `ort` (default) | `onnxruntime-web` | ✔️ | ❌ | ✔️ | ✔️ | ❌ |
| `ort.all` |
~~`onnxruntime-web/experimental`~~<br/>`onnxruntime-web/all` | ✔️ | ✔️ |
✔️ | ✔️ | ❌ |
| `ort.node` | `onnxruntime-web` | ❌ | ❌ | ✔️ | ❌ | ❌ |
| `ort.training` | `onnxruntime-web/training` | ❌ | ❌ | ✔️ | ✔️ | ✔️ |
| `ort.wasm` | `onnxruntime-web/wasm` | ❌ | ❌ | ✔️ | ✔️ | ❌ |
| ~~`ort.wasm-core`~~ | ~~`onnxruntime-web/wasm-core`~~ | ~~❌~~ | ~~❌~~
| ~~✔️~~ | ~~❌~~ | ~~❌~~ |
| `ort.webgl` | `onnxruntime-web/webgl` | ✔️ | ❌ | ❌ | ~~✔️~~ ❌ | ❌ |
| `ort.webgpu` | `onnxruntime-web/webgpu` | ❌ | ✔️ | ✔️ | ✔️ | ❌ |
</details>
#### Flags:
The following flags are deprecated:
- `env.wasm.simd` (boolean): will be ignored. SIMD is always enabled in
build.
The following flags changed their type:
- `env.wasm.wasmPaths`: When using this flag as a string ( for the URL
prefix ), nothing is changed. When using this flag as an object ( for
per-file path override ), the type changed:
```diff
- export interface Old_WasmFilePaths{
- 'ort-wasm.wasm'?: string;
- 'ort-wasm-threaded.wasm'?: string;
- 'ort-wasm-simd.wasm'?: string;
- 'ort-training-wasm-simd.wasm'?: string;
- 'ort-wasm-simd-threaded.wasm'?: string;
- };
+ export interface New_WasmFilePaths {
+ /**
+ * Specify the override path for the main .wasm file.
+ *
+ * This path should be an absolute path.
+ *
+ * If not modified, the filename of the .wasm file is:
+ * - `ort-wasm-simd-threaded.wasm` for default build
+ * - `ort-wasm-simd-threaded.jsep.wasm` for JSEP build (with WebGPU and
WebNN)
+ * - `ort-training-wasm-simd-threaded.wasm` for training build
+ */
+ wasm?: URL|string;
+ /**
+ * Specify the override path for the main .mjs file.
+ *
+ * This path should be an absolute path.
+ *
+ * If not modified, the filename of the .mjs file is:
+ * - `ort-wasm-simd-threaded.mjs` for default build
+ * - `ort-wasm-simd-threaded.jsep.mjs` for JSEP build (with WebGPU and
WebNN)
+ * - `ort-training-wasm-simd-threaded.mjs` for training build
+ */
+ mjs?: URL|string;
+ }
```
#### Bundler compatibility:
Config changes are need for bundlers. See usage example in
/js/web/test/e2e/ for Webpack, parcel and rollup.
#### Deployment:
- if consuming from a CDN, there is no breaking change.
- if consuming from a local server, need to copy all `ort-*.wasm` and
`ort-*.mjs` files (totally 6 files) in the dist folder. (previously only
need to copy `ort-*.wasm` files.)
</details>
<details>
<summary><h4>Problems</h4></summary>
There are a few problems with the current module export and deployment:
- Script URL cannot be correctly inferred when imported as ESM.
- Workers are forcefully encoded using Blob URL, which makes
onnxruntime-web not working in CSP environment and Node.js, when using
proxy or multi-threading feature.
- Generated JS code (by Emscripten) is encoded using
`function.toString()`, which is unstable and error-prone.
- When running with a different Emscripten build, always need the build
step. Making it difficult to swap artifacts in deveopment/debug.
</details>
<details>
<summary><h4>Goals</h4></summary>
- Full ESM support
- Support variances of ways to import. Including:
- import from HTML's `<script>` tag (IIFE format, exporting to global
variable `ort`)
```html
<script
src="https://example.com/cdn-path-to-onnxruntime-web/dist/ort.min.js"></script>
```
- import from source code inside `<script type="module">` tag (ESM)
```html
<script type="module">
import * as ort from
"https://example.com/cdn-path-to-onnxruntime-web/dist/ort.min.mjs";
// using 'ort'
</script>
```
- import in a CommonJS project (CJS format, resolve from package.json
"exports" field)
```js
// myProject/main.js
const ort = require('onnxruntime-web');
```
- import in an ESM project (ESM format, resolve from package.json
"exports" field)
```js
// myProject/main.js (or main.mjs)
import * as ort from 'onnxruntime-web';
```
- Support popular bundlers when importing onnxruntime-web into a CJS/ESM
project.
- webpack (esm requires extra post-process step)
- rollup
- parcel (esm requires extra post-process step)
- More bundlers **TBD**
- Multi-threading support for Node.js
NOTE: keeping single JavaScript file (the all-in-one bundle) is no
longer a goal. This is because technically there is a conflict with the
other requirements.
</details>
<details>
<summary><h4>Important Design Decisions</h4></summary>
- Drop support of single JavaScript output.
- The current onnxruntime-web distribution uses a single JavaScript file
to include all code. While there are a few benefits, it also creates
problems as mentioned above. Since ESM is being used more and more
widely, and browsers are making more restricted security checks and
requirement, the old Blob based solution is going to be replaced.
- To achieve the requirement, specifically, the CSP environment support,
we have to offer a non Blob based solution. Therefore, we have to
distribute multiple files and drop the single file solution.
- Do not run parser/postprocess on Emscripten generated JavaScript.
- Emscripten is evolving quickly so we should only depends on what's in
its documentation instead of a certain implementation details. (for
example, currently we patch on its code to deal with a special variable
`_scriptDir`)
- Keep the generated files as-is also helps to:
- reduce the size of ort.min.js
- make it easier to replace build artifacts when in development/debug
- Drop support for non-SIMD and non-MultiThread. This helps to reduce
the number of artifacts in distribution.
- (fixed-sized) SIMD is supported in any mainstream JS environment.
- Multi-thread as WebAssembly feature is supported in any mainstream JS
environment. In some environment the feature is guarded with cross
origin policy, but it can still work if not trying to create any worker.
- Use ESM output for Emscripten generated JavaScript.
- There are 2 ways to dynamically import classic (umd) modules and
neither of them are recommended:
- dynamically creating a <script> tag. This changes the HTML structure
and have quite a lot of compatibility issue
- use `fetch()` and `eval()`. However `eval` is strongly suggested to be
avoid because there is a great perf hit.
- importing ESM is super easy - just use the `import()` call.
Considering ESM is widely supported in modern browsers and Node.js this
is the better option.
- Add Blob based solution as a fallback for cross-origin workers.
- There are still wide use case of importing onnxruntime-web from CDN.
In this usage, make it able create worker by using `fetch()`+`Blob` to
create a same-origin Blob URL.
</details>
<details>
<summary><h4>Distribution File Manifest</h4></summary>
The distribution folder contains the following files:
- WebAssembly artifacts. These files are the result of compiling the
ONNX Runtime C++ code to WebAssembly by Emscripten.
| File Name | Build Flags |
|------|-----|
| ort-wasm-simd-threaded.mjs <br/> ort-wasm-simd-threaded.wasm |
`--enable_wasm_simd` <br/> `--enable_wasm_threads` |
| ort-training-wasm-simd-threaded.mjs <br/>
ort-training-wasm-simd-threaded.wasm | `--enable_training_apis` <br/>
`--enable_wasm_simd` <br/> `--enable_wasm_threads` |
| ort-wasm-simd-threaded.jsep.mjs <br/> ort-wasm-simd-threaded.jsep.wasm
| `--enable_wasm_simd` <br/> `--enable_wasm_threads` <br/> `--use_jsep`
<br/> `--use_webnn` |
- onnxruntime-web JavaScript artifacts. These files are generated by
ESBuild as the entry point for onnxruntime-web.
There are multiple build targets for different use cases:
| Target Name | Path for "import" or "require" | Description |
|------|-----|-----|
| `ort` | `onnxruntime-web` | The default target. |
| `ort.all` | `onnxruntime-web/all` | The target including webgl. |
| `ort.node` | `onnxruntime-web` | The default target for Node.js. |
| `ort.training` | `onnxruntime-web/training` | The target including
training APIs |
| `ort.wasm` | `onnxruntime-web/wasm` | The target including only
WebAssembly (CPU) EP |
| `ort.webgl` | `onnxruntime-web/webgl` | The target including only
WebGL EP |
For each target, there are multiple files generated:
| File Name | Description |
|------|-----|
| [target].js | The entry point for the target. IIFE and CommonJS
format. |
| [target].mjs | The entry point for the target. ESM format. |
| [target].min.js <br/> [target].min.js.map | The entry point for the
target. Minimized with sourcemap. IIFE and CommonJS format. |
| [target].min.mjs <br/> [target].min.mjs.map | The entry point for the
target. Minimized with sourcemap. ESM format. |
| [target].proxy.mjs | (if appliable) The proxy ESM module for the
target. |
| [target].proxy.min.mjs <br/> [target].proxy.min.mjs.map | (if
appliable) The proxy ESM module for the target. Minimized with
sourcemap. |
</details>
<details>
<summary><h4>Dynamic Import Explained</h4></summary>
- Local Served | No Proxy:
```
[Bundle or ort.min.js]
|
+ import()--> [ort-wasm-simd-threaded.mjs]
|
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
|
+ new Worker()--> [ort-wasm-simd-threaded.mjs (worker)]
|
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
```
- Local Served | Proxy:
```
[Bundle or ort.min.js]
|
+ import()--> [ort.proxy.min.mjs]
|
+ new Worker()--> [ort.proxy.min.mjs (worker)]
|
+ import()--> [ort-wasm-simd-threaded.mjs]
|
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
|
+ new Worker()--> [ort-wasm-simd-threaded.mjs (worker)]
|
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
```
- Cross Origin | No Proxy:
```
[Bundle or ort.min.js]
|
+ fetch('ort-wasm-simd-threaded.mjs')
|
+ URL.createObjectURL(res.blob())
|
+ import()--> [blob:... (ort-wasm-simd-threaded)]
|
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
|
+ new Worker()--> [blob:... (ort-wasm-simd-threaded) (worker)]
|
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
```
- Cross Origin | Proxy
```
[Bundle or ort.min.js]
|
+ fetch('ort.proxy.min.mjs')
|
+ URL.createObjectURL(res.blob())
|
+ import()--> [blob:... (ort.proxy)]
|
+ new Worker()--> [blob:... (ort.proxy) (worker)]
|
+ fetch('ort-wasm-simd-threaded.mjs')
|
+ URL.createObjectURL(res.blob())
|
+ import()--> [blob:... (ort-wasm-simd-threaded)]
|
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
|
+ new Worker()--> [blob:... (ort-wasm-simd-threaded) (worker)]
|
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
```
</details>
### Description
This PR adds fusions for [OpenAI's CLIP
model](https://huggingface.co/openai/clip-vit-large-patch14-336). Here
is an example of how to run the ORT transformer optimizer for the linked
CLIP model.
```
$ git clone https://github.com/microsoft/onnxruntime
$ cd onnxruntime/onnxruntime/python/tools/transformers
$ python3 optimizer.py --input /path/to/model.onnx --output /path/to/model_opt.onnx --model_type clip --num_heads 16 --hidden_size 1024 --use_external_data_format --opt_level 0
```
### Motivation and Context
This PR helps optimize multi-modal models that use CLIP for the vision
encoder.
### Description
This PR adds support for adding GroupQueryAttention (GQA) in models that
are running on CPU.
### Motivation and Context
Previously, the LLaMA scripts supported creating models that have GQA
for CUDA only. With the recently added support for [GQA on
CPU](https://github.com/microsoft/onnxruntime/pull/20299), models where
`num_attention_heads != num_key_value_heads` can now use the GQA op and
[run much faster on
CPU](https://github.com/microsoft/onnxruntime/pull/20598).
### Description
Currently, there is one bool flag to indicate whether kernel is loaded.
However, there are v1 and v2 kernels, so the flag will allow only one
version of kernel loaded. We use v1 kernel for prompt and v2 kernel for
token generation, and the flag will cause issue when we want both prompt
and token generation.
This bug is found in integration test. The unit test only test one
kernel at a time so the issue was not found before.
Another possible walkaround without this fix is to set an environment
variable `ORT_DISABLE_SPARSE_ATTENTION_V1=1`
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
* Partially revert [previous
change](https://github.com/microsoft/onnxruntime/pull/19804), and
* Redo concurrency_test_result parser outside of post.py
* Add support of syncing memtest result to db
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
To fix the error when CI is running on two model groups.
- When running on two model groups, the [previous
change](https://github.com/microsoft/onnxruntime/pull/19804) wrongly
navigates two levels up in the directory after running one model group,
while one level is needed. After that, the script can't find another
model group.
- Running on one model group can't repro the issue
WebNN spec has removed activation option for conv and
batchNormalization. We don't need additional activation fusion in WebNN
EP anymore.
[edit by fdwr] Note this is handled in the browser now, which knows more
about the backend platform version and can more safely make decisions
about which fusions are possible (e.g. for the DirectML backend, whether
softmax and gelu can fuse successfully with their base operator).
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
- Fix `logSeverityLevel`
- Correct get RCTCxxBridge, old method for some cases will got wrong
bridge
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
- Move iOS package build to separate job so it can run in parallel with Android AAR build and be decoupled from the test stage. The test stage fails sometimes (not infrequently) and may need to be re-run.
- Update stop iOS simulator step so it doesn't fail if the start step doesn't run.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Adds the extra option `QDQKeepRemovableActivations` to optionally
prevent automatic removal of Clip/Relu ops in QDQ models. The current
default behavior, which is to remove Clip/Relu, remains the same if the
new option is not enabled.
### Motivation and Context
Explicitly representing these Relu/Clip operators in the QDQ model is
necessary if optimizations or EP transformations will later remove
QuantizeLinear/DequantizeLinear operators from the model.
### Motivation and Context
The Intel NPU does not support 16-bit int quantized operators.
Consequently, the execution provider removes the
QuantizeLinear/DeQuantizeLinear (Q/DQ) operators from node units and
executes the operation as FP16 in the backend. However, if a Clip
operator was fused into a Q operator in the node unit, the removal of
Q/DQ operators results in inaccuracies because the effect of the
original Clip operators is lost.
Consider the following example:
- FP32 model: -> Op_FP32 -> Clip ->
- QDQ model: -> (DQ-> Op_FP32 -> Q) -> (DQ' -> Clip -> Q') ->
- After ClipQuantFusion: -> (DQ-> Op_FP32 -> Q) -> (DQ' -> Q') ->
- Intel Execution Provider strips Q/DQ: -> Op_FP16 ->
To solve this issue, we have enabled ClipQuantFusion exclusively on the
CPU execution provider.
### Description
- This PR combine all CUDA 12 stage into the Zip-nuget-... pipeline.
- It also enables the cuda12 support
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
The InsertGatherBeforeSceLoss optimization is enabled when the density
of label padding less than 90%. We need to check the density of the
label padding to decide whether enable the optimization.
Before this pr, we just check the inputs of graph and correlate one with
the SCE node by iterate graph from the SCE node back to one graph input.
This is hard to be general because there may be complicated pattern
between graph input and SCE node.
This pr check padding density by the direct input of SCE module rather
than the input of graph at the first graph execution when exporting onnx
graph.
And if the density < 90%, insert a flag PythonOp after the SCE node as:
```
SoftmaxCrossEntropy
|
PythonOp (func_name: FlagAndPrintDensity) (insert if density < 90%)
|
Following graph
```
When the InsertGatherBeforeSceLoss is invoked, it check if there is the
flag PythonOp(func_name: FlagAndPrintDensity) after the SCE node and if
it is, remove it and do the padding elimination optimization.
If the env of ORTMODULE_PRINT_INPUT_DENSITY is 1, we will print input
density each step by the PythonOp (func_name: FlagAndPrintDensity). In
this case the PythonOp will not be removed.
### Description
<!-- Describe your changes. -->
during VITSIAI shared library load, set unload to false to prevent crash
when linux lib load not successfully
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
for Linux environment, when library not loaded successfully, it will end
up with crash without giving any useful message.
the fix is to prevent the crash and give the useful message when shared
library not loaded correctly.
### Description
Make the operator more flexible:
(1) Decouple the max sequence length of rotary cache, kv cache and block
mask. They are allowed to have different values.
(2) Replace block_mask dense by CSR format (block_row_indices and
block_col_indices) to improve performance.
(3) Mark past_key and past_value as required inputs since we need them
to compute the shape of present_key and present_value.
### Motivation and Context
(1) LongRoPE has short and long rotary cache, which has different
length.
(2) Most users do not have enough GPU memory to run maximum sequence
length 128K. This change allows user to use smaller kv cache length to
test without hitting out of memory.
### Improve perf for mem efficient grad mgmt
When memory efficient gradient mangement feature is enabled, the weight
retrieval PythonOp for every layers will be launched at the beginning of
the forward, which would make GPU stream idle for few milliseconds. The
reason is the ReversedDFS ordering cannot ALWAYS handle such input
branching well, so we introduce a distantance-to-input_leaf concepts
when doing the reversedDFS, which not only move the problematical
PythonOp to the place where it is needed, but also those Cast ops
following the weight retrieval to the place where it is needed.
Main branch: 102.19 - 26.35s = 75.84s for 260 steps(4627samples),
61.04sample/second
This PR: 100.28s - 25.10s = 75.18s for 260 steps. 61.54samples/second
(+0.8% gains)
Main branch:

This PR:

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
orttrainingtestdatascus has only save mnist whose size is only 64M in
Azure File
To meet security requirements and reduce maintenance cost, move the test
data to lotusscus and saved in Azure blob.
### Description
Add support for using Onnx Runtime with Node
### Motivation and Context
Onnx Runtime supports the QNN HTP, but does not support it for Node.js.
This adds baseline support for the Onnx Runtime to be used with Node.
Note it does not update the node packages that are distributed
officially. This simply patches the onnxruntime.dll to allow 'qnn' to be
used as an execution provider.
Testing was done using the existing onnxruntime-node package. The
`onnxruntime.dll` and `onnxruntime_binding.node` were swapped into
`node_modules\onnxruntime-node\bin\napi-v3\win32\arm64` with the newly
built version, then the various QNN dlls and .so files were placed next
to the onnxruntime.dll. Testing was performed on a variety of models and
applications, but the easiest test is to modify the [node quickstart
example](https://github.com/microsoft/onnxruntime-inference-examples/tree/main/js/quick-start_onnxruntime-node).
This pull request primarily involves changes to the build scripts in the
`tools/ci_build/github/azure-pipelines` directory. The changes add build
date and time information to the build process. This is achieved by
introducing two new parameters, `BuildDate` and `BuildTime`, and
incorporating them into the `msbuildArguments` in multiple locations.
Addition of new parameters:
*
[`tools/ci_build/github/azure-pipelines/templates/c-api-cpu.yml`](diffhunk://#diff-00815920cc190d10fdebceac0c3a4b8a59e408684ae38177dfe7f96cae276c59R309-R310):
Added `BuildDate` and `BuildTime` parameters using the pipeline's start
time.
Incorporation of new parameters in `msbuildArguments`:
*
[`tools/ci_build/github/azure-pipelines/c-api-noopenmp-packaging-pipelines.yml`](diffhunk://#diff-efb530efd945fdd9d3e1b92e53d25cc8db7df2e28071c364b07a7193092de01bL947-R948):
Added `CurrentDate` and `CurrentTime` parameters to `msbuildArguments`
in multiple locations.
[[1]](diffhunk://#diff-efb530efd945fdd9d3e1b92e53d25cc8db7df2e28071c364b07a7193092de01bL947-R948)
[[2]](diffhunk://#diff-efb530efd945fdd9d3e1b92e53d25cc8db7df2e28071c364b07a7193092de01bL1092-R1093)
[[3]](diffhunk://#diff-efb530efd945fdd9d3e1b92e53d25cc8db7df2e28071c364b07a7193092de01bL1114-R1115)
[[4]](diffhunk://#diff-efb530efd945fdd9d3e1b92e53d25cc8db7df2e28071c364b07a7193092de01bL1137-R1138)
*
[`tools/ci_build/github/azure-pipelines/templates/c-api-cpu.yml`](diffhunk://#diff-00815920cc190d10fdebceac0c3a4b8a59e408684ae38177dfe7f96cae276c59L446-R448):
Incorporated the `CurrentDate` and `CurrentTime` parameters into
`msbuildArguments`.### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
When sequence length is 128K, block_mask has 2048 rows, that is not
supported by previous kernel.
(1) Add a new kernel to handle more than 1024 rows, and each thread need
handle two rows.
(2) Add a test for sequence length 128k.
- Update method for uploading to Azure storage to use managed identity.
- Allow helper script tasks to be split across different calls.
- Rewrite helper script in Python.
Motivation:
Recently the Azure storage account configuration was changed and now the old way of uploading to it no longer works.