1. Upgrade nodejs from 16.x to 18.x for Windows pipelines
2. Avoid using Azure DevOps "NodeTool" on Linux. The tool installs
nodejs from internet or local disk cache. But we already moved all Linux
tests to docker. So we do not need the installer anymore.
3. Remove some other unused code.
### Description
update Chromium browser launch command line flags
Canary already using dxc so no need to specify
'--enable-dawn-features=use_dxc' for canary.
### Description
allow DataTransfer to deal with zero sized input.
This is a standalone fix for zero-sized tensor handling for JSEP
DataTransfer. There are other components in JSEP not supporting
zero-sized tensors need to be fixed.
### Description
allow JsCustomAllocator to deal with zero sized input.
This is a standalone fix for zero-sized tensor handling for
JsCustomAllocator. There are other components in JSEP not supporting
zero-sized tensors need to be fixed.
### Description
One quantization case was not covered by the current list of unit tests.
This PR adds a unit test to cover that case with the fix. It fixes the
issue #17619.
### Motivation and Context
### Description
ort-web build step - webpack consumes the amount of memory on the edge
of Node.js(V8)'s default max-old-space-size, so increase the default
memory size to 5GB to avoid this issue.
### Description
QNN SDK version 2.14.1 fixed several issues with the QNN Resize
operator. This PR integrates the fixes and simplifies the
implementation.
### Motivation and Context
Improve Resize operator and test coverage.
Fix ARMv7 build error on Linux.
### Description
`cpuinfo_*` functions are only available if `CPUINFO_SUPPORTED` set and
therefore `"cpuinfo.h"` included.
Fixed with extended conditional code.
### Motivation and Context
Compilation with ARMv7 on Linux system fails.
### Description
This PR optimizes the gather op, which is improved ~6ms in segment
anything model in ADL.
The problem in original algorithm is that it includes a for loop to
calculate a block size of data. However, the block size may be very
large, like `65536`. In GPU shader, we should try to avoid large loop in
shader and try to use more threads to do it parallelly.
Before:
```
[profiling] kernel "41771992|[Gather] 41771992" input[0]: [4,65536] | float32, input[1]: [1] | int64, output[0]: [1,65536] | float32, execution time: 6886207 ns
```
After:
```
[profiling] kernel "41771992|[Gather] 41771992" input[0]: [4,65536] | float32, input[1]: [1] | int64, output[0]: [1,65536] | float32, execution time: 11719 ns
### Description
1. For binary ops, the components is always 4. So the dispatchGroup
should be : `{x: Math.ceil(outputSize / 64 /* workgroup size */ / 4 /*
component size */)}` instead of `{x: Math.ceil(outputSize / 64 /*
workgroup size */ / (vectorize ? 4 : 1) /* vec size */)}`.
2. If any of a or b only has one element, we still can use the vectorize
path since the same value will be broadcasted.
### Description
Add float16 and bfloat16 data type support for VitisAI ep
### Motivation and Context
The VitisAI ep has added the bfloat datatype support. So we would like
to register the datatype from onnxruntime side to enable them.
---------
Signed-off-by: Yiming Hu <yiming.hu@amd.com>
### Model post process for zero stage3 training
This is the last change to make single GPU/Multiple GPUs run pass.
Design details:
https://microsoft.sharepoint.com/:p:/t/ONNX2/EfNfJ43necpIoPI6x5M2zvYBVbfjoPQmG4Boc_F7-tHm1w?e=ekQwA6&nav=eyJzSWQiOjMxNiwiY0lkIjoxMDE1Nzg3NDZ9
`PyTorch` runs with ZeROOffloadSubscriber:
```
model = prepare_model(...)
from onnxruntime.training.utils.hooks import configure_ort_compatible_zero_stage3
configure_ort_compatible_zero_stage3()
```
`ORTModule` runs with ZeROOffloadSubscriber:
```
os.environ['ORTMODULE_ENABLE_ZERO_STAGE3'] = '1'
from onnxruntime.training.ortmodule import ORTModule
model = ORTModule(self.model)
```
It will be fairly easy to debug convergence issue if both ORT and
PyTorch can run the same offload path.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Layer norm fusion changes required for deepspeed stage 3, also includes
test case.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
It helps fusing layer norm for Deepspeed Stage 3. Added a test case
scenario which ensures that the fusion is working properly for the
scenario.
This adds an additional check before enabling MlasGemmU8S8DispatchAmx
for GEMM operations. After checking the CPUID for AMX-TILE and AMX-INT8,
an additional check is added that checks value of the XCR0 register.
The value in the OXR0 register is set by the OS and indicates support
for various CPU features. In this case the bits indicating XTILECFG and
XTILEDATA support are checked.
### Description
This adds an additional check before enabling MlasGemmU8S8DispatchAmx
for GEMM operations. After checking the CPUID for AMX-TILE and AMX-INT8,
an additional check is added that checks value of the XCR0 register.
The value in the OXR0 register is set by the OS and indicates support
for various CPU features. In this case the bits indicating XTILECFG and
XTILEDATA support are checked.
### Motivation and Context
Fix for crash reported directly by customer. When running older Windows
server OS on newer Gen4 Xeon processors.
Signed-off-by: Nash <george.nash@intel.com>
### Description
1. Remove 'dnf update' from docker build scripts, because it upgrades TRT
packages from CUDA 11.x to CUDA 12.x.
To reproduce it, you can run the following commands in a CentOS CUDA
11.x docker image such as nvidia/cuda:11.8.0-cudnn8-devel-ubi8.
```
export v=8.6.1.6-1.cuda11.8
dnf install -y libnvinfer8-${v} libnvparsers8-${v} libnvonnxparsers8-${v} libnvinfer-plugin8-${v} libnvinfer-vc-plugin8-${v} libnvinfer-devel-${v} libnvparsers-devel-${v} libnvonnxparsers-devel-${v} libnvinfer-plugin-devel-${v} libnvinfer-vc-plugin-devel-${v} libnvinfer-headers-devel-${v} libnvinfer-headers-plugin-devel-${v}
dnf update -y
```
The last command will generate the following outputs:
```
========================================================================================================================
Package Architecture Version Repository Size
========================================================================================================================
Upgrading:
libnvinfer-devel x86_64 8.6.1.6-1.cuda12.0 cuda 542 M
libnvinfer-headers-devel x86_64 8.6.1.6-1.cuda12.0 cuda 118 k
libnvinfer-headers-plugin-devel x86_64 8.6.1.6-1.cuda12.0 cuda 14 k
libnvinfer-plugin-devel x86_64 8.6.1.6-1.cuda12.0 cuda 13 M
libnvinfer-plugin8 x86_64 8.6.1.6-1.cuda12.0 cuda 13 M
libnvinfer-vc-plugin-devel x86_64 8.6.1.6-1.cuda12.0 cuda 107 k
libnvinfer-vc-plugin8 x86_64 8.6.1.6-1.cuda12.0 cuda 251 k
libnvinfer8 x86_64 8.6.1.6-1.cuda12.0 cuda 543 M
libnvonnxparsers-devel x86_64 8.6.1.6-1.cuda12.0 cuda 467 k
libnvonnxparsers8 x86_64 8.6.1.6-1.cuda12.0 cuda 757 k
libnvparsers-devel x86_64 8.6.1.6-1.cuda12.0 cuda 2.0 M
libnvparsers8 x86_64 8.6.1.6-1.cuda12.0 cuda 854 k
Installing dependencies:
cuda-toolkit-12-0-config-common noarch 12.0.146-1 cuda 7.7 k
cuda-toolkit-12-config-common noarch 12.2.140-1 cuda 7.9 k
libcublas-12-0 x86_64 12.0.2.224-1 cuda 361 M
libcublas-devel-12-0 x86_64 12.0.2.224-1 cuda 397 M
Transaction Summary
========================================================================================================================
```
As you can see from the output, they are CUDA 12 packages.
The problem can also be solved by lock the packages' versions by using
"dnf versionlock" command right after installing the CUDA/TRT packages.
However, going forward, to get the better reproducibility, I suggest
manually fix dnf package versions in the installation scripts like we do
for TRT now.
```bash
v="8.6.1.6-1.cuda11.8" &&\
yum-config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo &&\
yum -y install libnvinfer8-${v} libnvparsers8-${v} libnvonnxparsers8-${v} libnvinfer-plugin8-${v} libnvinfer-vc-plugin8-${v}\
libnvinfer-devel-${v} libnvparsers-devel-${v} libnvonnxparsers-devel-${v} libnvinfer-plugin-devel-${v} libnvinfer-vc-plugin-devel-${v} libnvinfer-headers-devel-${v} libnvinfer-headers-plugin-devel-${v}
```
When we have a need to upgrade a package due to security alert or some
other reasons, we manually change the version string instead of relying
on "dnf update". Though this approach increases efforts, it can make our
pipeines more stable.
2. Move python test to docker
### Motivation and Context
Right now the nightly gpu package mixes using CUDA 11.x and CUDA 12.x
and the result package is totally not usable(crashes every time)
### Description
Include onnxruntime_float16.h in the package.
### Motivation and Context
This was missed in the recently released 1.16 pkgs (except Nuget).
manylinux build is used for nightly packaging generation and it's hard
to capture issue in time when related files change. This PR add
manylinux build in CI.
### Description
Adds more operator unit tests (all op types should now have at least 1
unit test):
- [x] Reshape
- [x] Flatten
- [x] Squeeze
- [x] Unsqueeze
- [x] Gemm
- [x] Clip
- Enable QDQ Clip on HTP backend (when not optimized away by L1
ClipQuantFusion optimizer)
- Add support for 16-bit QDQ Clip to ClipQuantFusion optimizer
- [x] Split
- [x] Topk
- Enable QDQ TopK on HTP backend
- [x] Tile
- Enable QDQ Tile on HTP backend
### Motivation and Context
Increase QNN operator support and test coverage.
The extensions submodule was removed in [this
PR](https://github.com/microsoft/onnxruntime/pull/17097) but not deleted
from the list of git modules. This causes breaks in code ingesting ORT
that references the git modules for an accurate list of submodules.
This change removes the extensions from the list of git modules to
resolve this issue.
### Description
Instead, set level to DEBUG for the logger returned.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Otherwise, this function call overrides root logger level setting, which
affects logging facility of other python packages.
* Break QkvToContext into small functions. Each fused and unfused kernel
will have separated function.
* Move DecoderAttention kernel to separated file
* Move KV cache related kernel to attention_kv_cache.cu
### Motivation and Context
To make the code easier to maintain.
### Description
Fixes a bug in `get_shared_initializers` where `signature_cache1,
signature_cache2` are passed as positional arguments to
`remove_shared_initializers` but their positions don't match the
function signature. So `signature_cache1` is passed to `min_elements`
and causes comparison error at line 907.
Pass the arguments as kwargs so that it doesn't rely on their positions.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fixes the bug described above.
### Description
1. use standard win build template
2. enable compiler cache
### Motivation and Context
Make win build task easy to maintain and accelerate the pipeline.
### Description
When trying to use the TRT EP option trt_extra_plugin_lib_paths I
noticed that my custom op library was not being loaded by the EP. After
some digging I found that code was missing to update this option when
UpdateTensorRTProviderOptions() is used to set it.
At the same time I noticed that char arrays were allocated in that
function and wondered where they are de-allocated. When I found it was
done in ReleaseTensorRTProviderOptions(), I noticed that a few
de-allocations were missing.
### Motivation and Context
This PR fixes the problems described above.
### Description
* TensorRT EP can fall back to CUDA EP if it's explicitly assigned
* MIGraphX can fall back to ROCM if it's explicitly assigned
Test cases:
| When user specifies providers= | self._fallback_providers= |
| ------------------------------------------------------------ |
------------------------------------------------- |
| ["TensorrtExecutionProvider", "CUDAExecutionProvider"] |
["CUDAExecutionProvider", "CPUExecutionProvider"] |
| ["TensorrtExecutionProvider",("CUDAExecutionProvider", cuda_options)]
| ["CUDAExecutionProvider", "CPUExecutionProvider"] |
| ["TensorrtExecutionProvider"] | ["CPUExecutionProvider"] |
| [("TensorrtExecutionProvider", trt_options)] |
["CPUExecutionProvider"] |
| [("TensorrtExecutionProvider", trt_options), ("CUDAExecutionProvider",
cuda_options)] | ["CUDAExecutionProvider", "CPUExecutionProvider"] |
| ["TensorrtExecutionProvider", "CPUExecutionProvider"] |
["CPUExecutionProvider"] |
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Apply comments of https://github.com/microsoft/onnxruntime/issues/17394
and unify the logic to [MIGraphX, ROCM]
### Description
Update test to explicitly fail for webnn without proxy.
I am doing this change because if I test webnn with other backend
together, it silently enables proxy. I want to make test runner behave
with less implicit flag reset. If proxy is not enabled, webnn test
should fail.
@Honry please let me know if other places (eg. CI scripts) should change
also.
### Description
Remove `Resolve()` on the entire graph as each function is resolved.
We retain `Resolve()` after each inlining iteration.
### Motivation and Context
Poor performance for inlining the model and session initialization.
Original model before Resolve() removal
FunctionTest.Profiling (**65953 ms**)
After Resolve() Removal
FunctionTest.Profiling (**2911 ms**)
RelWithDebInfo pre-inlined model. Presumably because it runs Level1
optimizers
Non-inlined model consists of functions and Level1 optimizers have no
effect.
FunctionTest.Profiling (**9851 ms**)
To avoid a huge cu file and make code more readable:
- Move PrepareQKV to separate cu file (attention_prepare_qkv.cu)
- Move ConcatPastToPresent to attention_concat.cu
- Add default value for AttentionData
- Add a data structure QkvData to track Q, K and V pointers and track
QKV format.
- [x] Optimize SDXL models exported by optimum.
- [x] Enable it to run locally instead of using module.
- [x] Detect external data file in original model, and save with same
format by default.
- [x] Add tests
### Example
```
pip install optimum transformers diffusers onnx onnxruntime-gpu>=1.16
optimum-cli export onnx --model stabilityai/stable-diffusion-xl-base-1.0 --task stable-diffusion-xl ./sd_xl_base_onnx
python -m onnxruntime.transformers.models.stable_diffusion.optimize_pipeline -i ./sd_xl_base_onnx -o ./sd_xl_base_fp16 --float16
```
### Known issues
(1) VAE decoder cannot be converted to float16. Otherwise, there will be
black image in output.
(2) To use the float16 models, need a minor change in optimum to convert
the inputs for VAE decoder from float16 to float32 since we keep VAE
decoder as float32. The change is to append a line like the following
after [this
line](afd2b5a366/optimum/pipelines/diffusers/pipeline_stable_diffusion_xl.py (L483))
```
latents = latents.astype(np.float32)
```
### Description
WebNN CPU backend expects slope of PRelu to be a static value. For now,
we will not support it.
### Motivation and Context
Fallback this case to pass the CI.
### Description
The files should not have the minor version number. The names were added
in #17365 by mistake.
### Motivation and Context
We did not successfully exclude them out.