WebNN Spec adds missing 64-bit integers support for `reduceL1`,
`reduceSum`, `reduceSumSquare` and `reduceProduct` ops at this
[PR](https://github.com/webmachinelearning/webnn/pull/695), which has
already been implemented in Chromium. Update corresponding data type
constraints in WebNN EP.
Besides, WebNN CPU backend currently doesn't support `uint64` and
`uint32` for these ops.
### Description
<!-- Describe your changes. -->
Fix the threshold limit
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
We use this to debug the graph after each graph transformation. But, the
const data inside the model is useless.
So, we decided to remove that information to save disk space.
Co-authored-by: Chunye Wang <chunywan@xlnx.xilinx.com>
### Description
Adds the ability for MIGraphX EP to save off or load compiled models to
save time between inferences.
Via Command line
User should be able to set the save ability with
ORT_MIGRAPHX_SAVE_COMPILED_MODEL
ORT_MIGRAPHX_SAVE_COMPILE_PATH
User should be able to set the load ability with
ORT_MIGRAPHX_LOAD_COMPILED_MODEL
ORT_MIGRAPHX_LOAD_COMPILE_PATH
via Onnxruntime API
migx_save_compiled_model
migx_save_model_name
migx_load_compiled_model
migx_load_model_name
### Motivation and Context
The motivation for this is to leverage MIGraphX's existing API to
save/load models after our compile step of graph optimization. For
larger models or models which were compiled with additional tuning
steps, this saves time after first compile and inference run, and thus
speeds up the user experience in order to encourage development.
---------
Co-authored-by: Ted Themistokleous <tedthemistokleous@amd.com>
### Description
<!-- Describe your changes. -->
The tools should really all come from the same Android NDK, so using
`shutil.which` adds potential confusion when we do a lookup for the
target program by name first due to adding `dirnames.insert(0, "")` as
the first directory entry to lookup as it will match the filename
anywhere in the current path.
That's problematic as the emulator should come from
<sdk_tools>/emulator/emulator (see
[here](https://www.stkent.com/2017/08/10/update-your-path-for-the-new-android-emulator-location.html)),
but the paths on the CI machines result in the old location of
<sdk_tools>/tools/emulator being selected. This leads to the emulator
failing to run on arm64 macOS CIs as the old emulator does not look for
the arm64 binary.
At the most you may have multiple cmdline-tools versions installed, but
if we need to support explicitly specifying a version for that path that
can be added.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Make emulator run on arm64 macOS machines.
Add some helper functions to dump 4D tensors to help debugging.
Example to use it:
(1) Change DUMP_TENSOR_LEVEL from 0 to 2 in
contrib_ops/cpu/utils/debug_macros.h to enable dumping. Without
enabling, the dumping code will not be built into ORT binary.
(2) Add a few lines to dump tensors like
```
DUMP_CPU_TENSOR_INIT();
DUMP_CPU_TENSOR("tensor name", tensor_data, dim0, dim1, dim2, dim3);
```
Changes:
- [x] Add functions to dump 4D int32/int64/float/half tensors in CPU
- [x] Add functions to dump 4D int32/int64 tensors in CUDA
- [x] Change namespace (remove .transformers from namespace, and move
files to utils directory)
### Description
- Fixes compilation error for "reduced operator" builds with no FP16
kernels and `MLAS_F16VEC_INTRINSICS_SUPPORTED` enabled.
- Fixes linker error for "reduced operator" builds with QNN EP by
excluding QNN EP unit tests. QNN EP unit tests require CPU EP operator
implementations to evaluate accuracy.
### Motivation and Context
Need to be able to build a reduced operator build with QNN EP. See
https://github.com/microsoft/onnxruntime/blob/main/docs/Reduced_Operator_Kernel_build.md
The following example operator config file causes a compilation error
when either `MLAS_F16VEC_INTRINSICS_SUPPORTED` is defined or QNN EP is
enabled.
```
# reduced_op_config.txt
ai.onnx;12;Add
```
```shell
python tools\ci_build\build.py --include_ops_by_config reduced_op_config.txt --config Debug --build_wheel --build_shared_lib --skip_tests --build_dir build --parallel --use_qnn --qnn_home '<QNN_ROOT_DIR>'
```
Avoid using command line flags to pass in CMAKE_PREFIX_PATH. Use
environment variables instead.
Because, otherwise the value of CMAKE_PREFIX_PATH could get encoded
twice. For example, if the prefix is `C:\a\root`, then in
tools/ci_build/github/windows/helpers.ps1 we set it in Env:CMAKE_ARGS
which will be consumed by ONNX. Then when ONNX get it and decoded it,
ONNX will get `C:aroot` instead. Then because the path doesn't exist,
the CMAKE_PREFIX_PATH couldn't take effect when the script installs
ONNX. This PR fixes the issue.
The issue got discovered when I tried to upgrade cmake to a newer
version. Now our Windows CPU CI build pipeline uses cmake 3.27. In the
main branch even the CMAKE_PREFIX_PATH setting does not work, cmake
still can find protoc.exe from the directories. However, starting from
3.28 cmake changed it. With the newer cmake versions the find_library(),
find_path(), and find_file() cmake commands no longer search in
installation prefixes derived from the PATH environment variable.
### Release backward inputs per static graph ref count
For the output buffer marked as external output:
1. Remove the additional ref count we used for avoiding reusing buffer.
Instead, when we find reuse input/output buffer, we will make sure the
reused buffer not not generated by nodes that has external outputs.
2. Remove the ref count of pybind feed inputs, which exists all the time
until the run_backward completed. Instead, passing a mutuble feeds, and
we clean the feeds vector once that is copied into session states and
not needed any more before run the graph sequencentially.
#### Before the change:
One of the backward inputs is 3.9GB, it lives until the backward ends.

#### With the change:
The 3.9GB is released when the last node depending on that tensor
completed.

Be noted: the peak did not change though, we have more work to do to
reduce on the peak.
#### Others
It is found there are few tests that were updated to use incorrect
expected values in previous code refactoring
a81faee41e (diff-9e8fbae7d3dff24106cd17564949f320e943cb3048eae07813c7de144f140419L382).
This PR tries to fix them back, and I think now all test cases are back
to normal.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Several changes to remove warnings or errors on CPU benchmark and other
benchmarks.
- Phi-3 codes doesn't require auth token, but trust_remote_code flag.
Add "--trust" to enable only trust_remote_code.
- Phi3 models are not working with sdpa and needs to be run with eager
mode.
- Fix CPU io binding error with null device id and element_type
mismatch.
- use_buffer_share only when engine is ort.
### Description
<!-- Describe your changes. -->
Check if the onnxruntime_vitisai_ep.dll already linked. If yes, don't
use dlopen.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
If a executable is linked directly with onnxruntime_vitisai_ep.dll.
dlopen would cause two DLL with same content loaded. So we have to check
that case.
Linux can detect such situation automatically.
### Description
Fix some typos in Vitis AI EP.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: liumingyue <mingyue@xilinx.com>
Currently test_flash_attn_cuda.py can only run in Linux. It is because
it uses triton for rotary reference implementation, and triton python
package is not available in Windows.
This changes the script to allow the test run in Windows, so that we can
test memory efficient attention in Windows.
Due to limitation, rotary is excluded in testing on Windows.
### Description
<!-- Describe your changes. -->
Conditionally route to custom AllReduce kernel when buffer size and gpu
numbers meet certain requirements. Otherwise, keep using NCCL's
AllReduce.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Ye Wang <wangye@microsoft.com@h100vm-ort.kxelwkzfzxguje5bxvwxxs135a.gvxx.internal.cloudapp.net>
Co-authored-by: Your Name <you@example.com>
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Move jobs in onnxruntime-Win2022-GPU-T4 machine pool to
onnxruntime-Win2022-GPU-A10
### Motivation and Context
To reduce the variants of VM images we need to maintain. Now we have 3:
1. Windows 2022 CPU
2. Windows 2022 GPU A10
3. Windows 2022 GPU T4
This change allows us removing the last one.
### Description
Add "-allow-unsupported-compiler" flags to Windows CUDA flags. This
change only impacts our pipelines. By default it would not reach this
code path.
### Motivation and Context
nvcc refuses working with the latest VS toolset unless this flag is set.
If without this change, our CI build will fail with the compiler is the
latest VS 2022 17.10. Here is the log:
https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1405549&view=logs&j=6df8fe70-7b8f-505a-8ef0-8bf93da2bac7&t=c7e55e04-f02b-57dc-d19a-29b7d3528c44&l=715
The error message is:
`D:\a\_work\_temp\v11.8\include\crt/host_config.h(153): fatal error
C1189: #error: -- unsupported Microsoft Visual Studio version! Only the
versions between 2017 and 2022 (inclusive) are supported! The nvcc flag
'-allow-unsupported-compiler' can be used to override this version
check; however, using an unsupported host compiler may cause compilation
failure or incorrect run time execution. Use at your own risk.
[D:\a\_work\1\b\RelWithDebInfo\CMakeFiles\CMakeScratch\TryCompile-g5rudf\cmTC_7b8ff.vcxproj]`
### Description
MultiHeadAttention benchmark script only supports cuda provider right
now.
This updates the script to support testing cpu operator and ploting gpu
latency.
### Motivation and Context
Benchmark for the coming cpu flash attention.
### Description
Fix a few issues in the Windows TRT job in "Zip-Nuget-Java-Nodejs
Packaging Pipeline":
1. It is a Windows job. It should not use bash(which is usually not
available on Windows).
2. When it sets ADO vars, it missed a semicolon
Here is the doc of how to set ADO vars via scripts:
https://learn.microsoft.com/en-us/azure/devops/pipelines/process/set-variables-scripts?view=azure-devops&tabs=bash
You could see it needs a semicolon . Without the semicolon , the vars
will have an extra quotation mark in their values.
https://github.com/microsoft/STL/pull/3824 introduces constexpr mutex.
An older version of msvcp140.dll will lead to ```A dynamic link library
(DLL) initialization routine failed```.
This error can be encountered if using conda Python since conda packages
msvc dlls and these are older right now.
This PR disables the constexpr mutex so that ort package can work with
older msvc dlls.
Thanks @snnn for the discovery.
### Description
Add blocked quantization to QuantizeLinear op kernel.
If the quantize axis is not the last axis, block the tensor using 1x128
blocks. Blocks are dispatched to multiple threads for concurrently
processing. Currently only support scalar instructions.
If the quantize axis is the last axis, block the tensor using 1 x
quant_block_size blocks. Blocks are dispatched to multiple threads for
concurrent processing. If output type is int types, call mlas kernel to
use the SIMD instructions in each block.
#### Benchmark data
20 core 2GHz CPU, RelWithDebInfo config, 196 x 4096 tensor, quantize
float to int4x2
Quantize before last axis:
* single thread, scalar instruction: 31380900 ns
* 8 thread, scalar instruction: 5098620 ns
Quantize last axis:
* single thread, scalar instruction: 27927900 ns
* 8 thread, SIMD instruction: 102261 ns
more thread, SIMD instruction, larger block size helps
### Motivation and Context
ONNX added blocked quantization to QuantizeLinear in optset 21
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Use `parameterized` to decompose the huge test case. This will make
adding ROCm support be possible.
---------
Co-authored-by: Guangyun Han <guangyunhan@microsoft.com@h100vm-ort.kxelwkzfzxguje5bxvwxxs135a.gvxx.internal.cloudapp.net>
### Description
Upgrade cutlass to 3.5 to fix build errors using CUDA 12.4 or 12.5 in
Windows
- [x] Upgrade cutlass to 3.5.0.
- [x] Fix flash attention build error with latest cutlass header files
and APIs. This fix is provided by @wangyems.
- [x] Update efficient attention to use new cutlass fmha interface.
- [x] Patch cutlass to fix `hrsqrt` not found error for sm < 53.
- [x] Disable TF32 Staged Accumulation to fix blkq4_fp16_gemm_sm80_test
build error for cuda 11.8 to 12.3.
- [x] Disable TRT 10 deprecate warnings.
The following are not included in this PR:
* TRT provider replaces the deprecated APIs.
* Fix blkq4_fp16_gemm_sm80_test build error for cuda 12.4 or 12.5. This
test is not built by default unless you add `--cmake_extra_defines
onnxruntime_ENABLE_CUDA_EP_INTERNAL_TESTS=ON` in build command.
To integrate to rel-1.18.1: Either bring in other changes (like onnx
1.16.1), or generate manifest and upload a new ONNX Runtime Build Time
Deps artifact based on rel-1.18.1.
### Motivation and Context
https://github.com/microsoft/onnxruntime/issues/19891https://github.com/microsoft/onnxruntime/issues/20924https://github.com/microsoft/onnxruntime/issues/20953
### Description
ESM: use the bundled target as default export
In this change, the default import of the following entries:
```
import from 'onnxruntime-web';
import from 'onnxruntime-web/all';
import from 'onnxruntime-web/webgpu';
```
will use the "bundled" version, which has no dynamic import.
This change should only apply to ESM on web.
### Description
1. Publish debug symbols for Windows python packages. This PR will
publish them to ADO. Later on I will also replicate them to Microsoft
Symbol Server.
2. Build the packages in Release mode instead of RelWithDebInfo, to be
consistent with the other platforms(Linux/macOS/...)
### Motivation and Context
To help debug things. Sometimes we found an issue, but we couldn't debug
it because we didn't have symbols, and once we rebuilt the package
locally the issue was gone. This change would be helpful for such
scenarios.
Build log:
https://aiinfra.visualstudio.com/Lotus/_build?definitionId=841
### Description
Enable Hardsigmoid for QNN EP using SDK support direct support instead
of decomposing to its constituent ops so it can support the quantized
model
### Description
Length checking is even more strict for packed batching input.
There are two cases for a batch of input_ids.
- padded seq with equal length of inputs.
```
|----********|
|------------|
|--------****|
|-***********|
```
- packed seqs with different length of input_ids
`|----|---------|----|-|`
The max_seq_length is either from graph_inputs or the position_ids.
While in most of cases, we will cache the max_seq_length of rotary_cache
in the model ans shared among all layers.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: kailums <kalu@microsoft.com>
# Description
This PR removes the building of the ORT "mobile" packages and much of the associated infrastructure which is no longer needed.
Not removed yet - tools/ci_build/github/android/mobile_package.required_operators.config and the helper scripts that depend on it.
# Motivation and Context
The mobile packages were deprecated in 1.18. Users should use the full packages (Android - onnxruntime-android, iOS - onnxruntime-c/onnxruntime-objc) instead or do a custom build.
### Description
Update c-api-noopenmp-packaging-pipelines.yml: remove CUDA version
parameter
To reduce confusion. This pipeline is for generating CUDA 11 packages.
Just it. Not CUDA 12.
### Motivation and Context
In the last release we accidentally published CUDA 12(instead of CUDA
11) packages to nuget.org.
We also tried to publish CUDA 12 packages to
https://aiinfra.visualstudio.com/PublicPackages/_artifacts/feed/ORT-Nightly.
Luckily it didn't go through because a package with the same version
number already existed there. Every time when someone runs this pipeline
with CUDA version set to 12, the built packages will be published to
https://aiinfra.visualstudio.com/PublicPackages/_artifacts/feed/ORT-Nightly.
And GenAI team's build pipelines are based on the nightly packages. So
sometimes GenAI team builds their packages with CUDA 12 and sometimes
with CUDA 11, which is very random.
Therefore, please limit the use of pipeline parameters. Most Azure
DevOps yml files are template files. They should use parameters. But the
top level yml files should be more careful on that.
To replaced deprecated API.
Should verify with the `Gradle cmakeCheck` step from
`Windows_Packaging_CPU_x64_default` stage from the Zip-Nuge-...
pipeline.
### Description
Windows - Fully dynamic ETW controlled logging for ORT and QNN logs
The logging support is documented here
-
https://onnxruntime.ai/docs/performance/tune-performance/logging_tracing.html#tracing---windows
-
https://onnxruntime.ai/docs/performance/tune-performance/profiling-tools.html#tracelogging-etw-windows-profiling
Also add support for logging ORT SessionCreation on ETW CaptureState
### Motivation and Context
The previous ETW support only worked if you enabled ETW before the
session started. There can commonly be long-lived AI inference processes
that need to be traced & debugged. This enables logging fully on the
fly.
Without this support a dev would have to end up killing a process or
stopping a service in order to get tracing. We had to do this for a
recent issue with QNN, and it was a bit painful to get the logs and it
ruined the repro.
### Testing
I tested with the following cases
- Leaving default ORT run
- Enabling ETW prior to start and leaving running for entire session +
inferences, then stopping
- Starting ORT session + inf, then enabling and stopping ETW
- Start ORT session /w long running Inferences
- wpr -start
[ort.wprp](e6228575e4/ort.wprp (L4))
-start
[etw_provider.wprp](e6228575e4/onnxruntime/test/platform/windows/logging/etw_provider.wprp)
- Wait a few seconds
- wpr -stop ort.etl
- Inferences are still running
- Verify ONNXRuntimeLogEvent provider events are present and new
SessionCreation_CaptureState event under Microsoft.ML.ONNXRuntime
provider
Related:
#18882#19428
Some dev environments come with a preinstalled abseil. For example,
conda users often do that. If the preinstalled abseil version is
incompatible with what we have in cmake/deps.txt, it could result in a
hard-to-understand build error. This PR adds a version check to improve
that.
### Description
Uses C-style casting for Power vector instructions in
`MlasQuantizeLinearInt4Kernel`.
### Motivation and Context
Vector commands (e.g., vec_xst) need C-style casting to support various
compiler versions.
ONNX Runtime CI pipelines do not build with all compiler versions. The
recent INT4 PR broke the powerpc build for certain compiler versions
because it uses C++-style `static_cast<>`.
See:
https://github.com/microsoft/onnxruntime/pull/20362#discussion_r1630106164
Signed-off-by: adrianlizarraga <adlizarraga@microsoft.com>
### Description
Support loading from model with multiple QNN context binary
### Motivation and Context
QNN EP generated context binary model only has one single QNN context.
Because of QNN PD memory limitation, large model (>3.5GB) has to be split into 2 smaller models. Then generate the model with context binary. User can load from the smaller models with context binary. The problem is it requires 2 Ort session. User want to glue the split models into 1 (with multiple EPContext nodes) so that they can use 1 Ort session to do the work.
QNN EP has limitation which only support loading from 1 single QNN context binary. This PR removes that limitation to unblock this user scenario.
---------
Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
Following constraints have been supported by WebNN TFLite backend:
- Concat: supports up to 4 inputs
- Matmul: supports broadcasting
- Resize: supports nearest mode
- Split: supports up to 4 outputs