### Description
Delete all Prefast tasks because the new VS 17.7 version crashes every
time when we run the task on our CI build servers. However, we cannot
reproduce it locally. And this problem blocks us installing security
patches to our CI build machines.
Will use [CodeQL](https://codeql.github.com/) instead.
### Motivation and Context
Address some security alerts.
### Description
release session after use in npm test.
This is one of the prerequisites for supporting IO binding for WebGPU
buffer in onnxruntime-web.
list of prerequisites PRs:
#17465#17469#17470 (this one)
During optimization of SDXL UNet, the prune_graph takes up to 5 minutes.
The cause is to find a node in all nodes is time-consuming. This
optimization will reduce the latency of prune_graph to 2 seconds.
New algorithm will use a hash table (key is first node output, value is
node) to speed up.
The old provisioning profile no longer works. Switched to a temporary one that we can use before a new one is available. The temporary one has a different name.
Revert to the old TRT EP behavior of securing the whole compute_function
by lock_guard.
Current TRT EP which only puts lock_guard around a critical section
(obvious wrong) inside compute_function.
The issue can happen where one thread is updating the engine in
compute_function whereas another thread still accesses the
stale/corrupted engine instance in compute_function, for example, the
code outside the critical section, `int total_bindings =
trt_engine->getNbBindings()`.
So, make the whole compute_function the critical section should be okay.
### Description
This PR proposes a change that should speed up inference for the
TreeEnsemble* kernels. Previously, when traversing a decision tree, the
`TreeNodeElement` pointer would be incremented or decremented to the
appropriate child node - I assume this was because the
`truenode_inc_or_first_weight` and `falsenode_inc_or_n_weights` member
were overloaded for two purposes.
In this PR, we now assign the true branch pointer. We also initialise
`nodes_` in a pre-order traversal which means that the false branch's
position can be resolved statically and does not need to be stored.
I observe the following speed ups. The benchmarks used are derived from
those in https://github.com/siboehm/lleaves/tree/master/benchmarks and
the baseline is the main branch.
NYC Dataset
--------------
| Number of threads | Baseline | Pointer assignment | Pre-ordered
initialisation | Pointer assignment % improvement | Pre-ordered
initialisation % improvement |
|--------------------:|-----------:|---------------------:|-----------------------------:|-----------------------------------:|-------------------------------------------:|
| 1 | 176.539 | 155.709 | 145.119 | 11.7989 | 17.7976 |
| 4 | 59.9015 | 51.9652 | 50.0884 | 13.2488 | 16.382 |
| 8 | 34.5561 | 31.3024 | 28.2535 | 9.41581 | 18.2387 |
Airline Dataset
---------------
| Number of threads | Baseline | Pointer assignment | Pre-ordered
initialisation | Pointer assignment % improvement | Pre-ordered
initialisation % improvement |
|--------------------:|-----------:|---------------------:|-----------------------------:|-----------------------------------:|-------------------------------------------:|
| 1 | 2127.34 | 1389.7 | 920.373 | 34.6745 | 56.736 |
| 4 | 723.307 | 481.634 | 310.618 | 33.4122 | 57.0558 |
| 8 | 420.722 | 278.397 | 185.265 | 33.8286 | 55.9651 |
mtpl2 Dataset
--------------
| Number of threads | Baseline | Pointer assignment | Pre-ordered
initialisation | Pointer assignment % improvement | Pre-ordered
initialisation % improvement |
|--------------------:|-----------:|---------------------:|-----------------------------:|-----------------------------------:|-------------------------------------------:|
| 1 | 1143.62 | 1020.04 | 998.171 | 10.8055 | 13.0988 |
| 4 | 386.153 | 339.905 | 328.061 | 11.9764 | 14.3729 |
| 8 | 225.995 | 200.665 | 199.057 | 11.2084 | 13.4408 |
These were run using an M2 Pro with 16GB of RAM. All times are in
milliseconds and averages over 10 runs with a batch size of 100,000.
### Motivation and Context
Performance improvements.
### Description
Updates the version of QNN SDK used by CI Pipelines. Enables some tests
fixed by 2.14.1, but still need to look into Resize in a separate PR.
### Motivation and Context
Test latest version of QNN SDK.
A recent change was made in
5a83a67f32
to make `ep_type` a reference instead of having it be a copy, presumably
to avoid assigning strings (so `auto& ep_type =
node->GetExecutionProviderType()` instead of `auto ep_type =
node->GetExecutionProviderType()`). The problem with this change is that
calling `node->SetExecutionProviderType(kCpuExecutionProvider)` will
change the value of the reference itself, which means that it's
impossible to revert the node to its previous EP.
This change fixes this bug and adds an optimization over the previous
approach by only assigning a string when we know that we are dealing
with a non-CPU node.
### Description
Added Einsum operator support to JSEP.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Update the Web CI pipelines:
- remove parameter 'WebTemplate': Since we start to support webgpu, the
linux-web-ci.yml is no longer working and it is already out-of-date.
remove this file and parameter so that we always use win-web-ci.yml
- change flag `RunWebGpuTests` into 2 flags, for release and debug.
Currently for CI we only run webgpu tests on release build. But we want
to have the capability to run webgpu tests on debug build as well.
After this PR is merged, next step is to enable both Debug and Release
webgpu tests in PostMerge pipeline.
This makes it possible to call `optimize_by_onnxruntime` for float32 unet if `--use_external_data_format` is also used.
### Motivation and Context
When using `optimize_pipeline.py` without `--float16`, `optimize_by_onnxruntime` was not called for unet.
### Description
Add new name "WebGPU_Buffer" to OrtMemoryInfo.
This is one of the prerequisites for supporting IO binding for WebGPU
buffer in onnxruntime-web.
list of prerequisites PRs:
#17465#17469 (this one)
### Description
[Successful pipeline
run](https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1123141&view=results)
Added flag to build the training artifacts & updated the
pull-wasm-artifacts script to pull the training artifacts as well.
Bundled into this PR are minor formatting fixes + naming fixes.
### Motivation and Context
[This PR](https://github.com/microsoft/onnxruntime/pull/16521) extended
the WASM API wrapper to build training WASM artifacts as well.
The ORT training WASM artifacts are required to support ORT training web
bindings.
### Description
This PR contains a few changes in /js/common/ to support a coming PR for
a full implementation of webgpu IO binding.
- allows pass-through if value is already a Tensor instance in return
value of `handler.run()` called by `InferenceSession.run()`
(inference-session-impl.ts). Specifically, onnxruntime-node and
onnxruntime-react-native uses native bindings to generate a Tensor-like
object so we need to create a real Tensor instance here; for
onnxruntime-web the return value is already a Tensor instance.
- adds new types for GPU buffer supported types: `'float32'|'int32'` ->
`'float32'|'float16'|'int32'|'int64'|'uint32'|'bool'`
- exposes types `GpuBufferDataTypes` together with `CpuPinnedDataTypes`
and `TextureDataTypes` as exported
The embedding sum could be graph output (when exporting with output
hidden state enabled). Previously, we only check whether there are
multiple children node to decide whether to output embedding sum in
fused node. This fix will check if the sum is graph output, we will
retain the name.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
The yaml file changes made in #16050 do not really work. Currently the
pipeline is failing with error:
```
Error: Not found SourceFolder: C:\a\_work\5\b\RelWithDebInfo\RelWithDebInfo\nuget-artifacts\onnxruntime-win-x64\lib
```
So, I will revert the yaml changes first to bring the pipeline back.
Some people are waiting for our nightly packages.
Test run:
https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=351104&view=results
### Motivation and Context
### Description
install dotnet 6.0 in the docker image.
move C# build and test into docker.
### Motivation and Context
### Note
The Unit tests and Symbolic shape infer's migration will be in another
PR.
Some initializers are added without raw=True flag. That causes those
tensors cannot be saved to external data. If those tensors exceed 2GB
in total, optimized model cannot be saved due to protobuf limit.
This change will save attention weights and bias in raw data.
Note: it is optional to use raw data for shape tensor since they are
tiny.
### Motivation and Context
https://github.com/microsoft/onnxruntime/issues/17212https://github.com/microsoft/onnxruntime/issues/15349
### Description
<!-- Describe your changes. -->
graph_save only saves proto of the graph instead of entire model.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
We would like to export a part of a model as a new model for unit test.
Therefore, we have to change the API to support such need.
### Description
Include Support for neg.int32
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Commit fffefb1c22 (#16969) optimized
matmul and also fixes broadcasting. So #17191 is no longer needed.
However, the newly added operator test file from the PR by @dakenf is
helpful so pick and add it to enhance the tests.
### Description
clean up JSDoc for onnxruntime-common:
- replace "@internal" to "@ignore" as JSDoc do not use "@internal".
Using "@ignore" will let the content not show on the generated doc.
### Description
Remove unnecessary cargo:rerun-if-changed declaration.
### Motivation and Context
'cargo:rerun-if-changed' declarations tell Cargo when to re-run the
build script. The intention is that if the build script depends on other
files, then Cargo knows to re-run if those files change. It stores the
output and checks it before each build. The intention is that one emits
the declarations for _inputs_ of the build.
This rerun-if-changed declaration is a declaration on the _output_ of
the build, and stores the absolute path of the output. This is not a
useful declaration because the output path is unique to the build script
- there is no way for anything else to change it.
However, this does generate unnecessary rebuilds in some cases, for
example if the dependent repository is moved in the filesystem. This
causes me some issues when using https://crane.dev, as due to some
implementation details, if a crate being moved triggers a rebuild, by
default the build is broken.
To summarise:
- declaration is redundant
- causes issues in niche cases.
### Description
1. Update docker files and their build instructions.
ARM64 and x86_64 can use the same docker file.
2. Upgrade Linux CUDA pipeline's base docker image from CentOS7 to UBI8
AB#18990
### Description
<!-- Describe your changes. -->
Remove onnxruntime_test_all from emulator once tests have finished as
it's 1.2GB and takes up too much space given the 2GB maximum partition
size for the emulator.
Side issue is the java build isn't able to strip the binaries in the
java apk which causes that to be 800MB (exceeding the 2GB max). That may
require an Android/Gradle fix as I don't think we can hardcode an NDK
version into our build files.
https://issuetracker.google.com/issues/237187538?pli=1
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix Android CI build failures for
### Description
This PR changes the Whisper export scripts to further optimize the
process of removing duplicate initializers from two subgraphs.
The current Greedy approach is quicker by a large factor, but results in
some duplicate initializers not being caught and removed. This not only
results in a slightly larger Whisper model, but also a model that uses
more GPU memory.
The approach in this PR uses data hashes and caches to keep a quick
export but no longer rely on a greedy approach.
---------
Co-authored-by: Peter McAughan <petermca@microsoft.com>
### Description
Validate outputs type and shapes. Make sure sparse initializers are
taken into account.
### Motivation and Context
ORT currently does not validate output types or shapes. Further, neither
inputs or outputs take into account sparse initializers that are
converted from dense.
It is currently possible to pre-allocate a wrong type/shape buffer for
output.
Cc: @Craigacp
### Description
There are 8 cu files under [flash
attention](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/contrib_ops/cuda/bert/flash_attention)
and 4 cu files under [cutlass
fmha](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/contrib_ops/cuda/bert/cutlass_fmha)
need a lot of memory to compile.
Previously, the default value is same as parallel - number of CPU cores.
Standard_NC4as_T4_v3 has 4 CPUs and 28 GB memory, and we launched 16
nvcc threads in total (4 parallel jobs, and 4 nvcc threads per job).
Each thread might take 4 GB on average (peak is around 6GB, but threads
are not started at same time). OOM happens since 16 threads might need
close to 64 GB in worst case. When build machine has 64GB or larger
memory, OOM is rare.
Here we set a proper nvcc --threads based on available memory to avoid
OOM.
### Motivation and Context
Fix `Python Packaging Pipeline (Training Cuda 11.8)`