The extensions submodule was removed in [this
PR](https://github.com/microsoft/onnxruntime/pull/17097) but not deleted
from the list of git modules. This causes breaks in code ingesting ORT
that references the git modules for an accurate list of submodules.
This change removes the extensions from the list of git modules to
resolve this issue.
### Description
Instead, set level to DEBUG for the logger returned.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Otherwise, this function call overrides root logger level setting, which
affects logging facility of other python packages.
* Break QkvToContext into small functions. Each fused and unfused kernel
will have separated function.
* Move DecoderAttention kernel to separated file
* Move KV cache related kernel to attention_kv_cache.cu
### Motivation and Context
To make the code easier to maintain.
### Description
Fixes a bug in `get_shared_initializers` where `signature_cache1,
signature_cache2` are passed as positional arguments to
`remove_shared_initializers` but their positions don't match the
function signature. So `signature_cache1` is passed to `min_elements`
and causes comparison error at line 907.
Pass the arguments as kwargs so that it doesn't rely on their positions.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fixes the bug described above.
### Description
1. use standard win build template
2. enable compiler cache
### Motivation and Context
Make win build task easy to maintain and accelerate the pipeline.
### Description
When trying to use the TRT EP option trt_extra_plugin_lib_paths I
noticed that my custom op library was not being loaded by the EP. After
some digging I found that code was missing to update this option when
UpdateTensorRTProviderOptions() is used to set it.
At the same time I noticed that char arrays were allocated in that
function and wondered where they are de-allocated. When I found it was
done in ReleaseTensorRTProviderOptions(), I noticed that a few
de-allocations were missing.
### Motivation and Context
This PR fixes the problems described above.
### Description
* TensorRT EP can fall back to CUDA EP if it's explicitly assigned
* MIGraphX can fall back to ROCM if it's explicitly assigned
Test cases:
| When user specifies providers= | self._fallback_providers= |
| ------------------------------------------------------------ |
------------------------------------------------- |
| ["TensorrtExecutionProvider", "CUDAExecutionProvider"] |
["CUDAExecutionProvider", "CPUExecutionProvider"] |
| ["TensorrtExecutionProvider",("CUDAExecutionProvider", cuda_options)]
| ["CUDAExecutionProvider", "CPUExecutionProvider"] |
| ["TensorrtExecutionProvider"] | ["CPUExecutionProvider"] |
| [("TensorrtExecutionProvider", trt_options)] |
["CPUExecutionProvider"] |
| [("TensorrtExecutionProvider", trt_options), ("CUDAExecutionProvider",
cuda_options)] | ["CUDAExecutionProvider", "CPUExecutionProvider"] |
| ["TensorrtExecutionProvider", "CPUExecutionProvider"] |
["CPUExecutionProvider"] |
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Apply comments of https://github.com/microsoft/onnxruntime/issues/17394
and unify the logic to [MIGraphX, ROCM]
### Description
Update test to explicitly fail for webnn without proxy.
I am doing this change because if I test webnn with other backend
together, it silently enables proxy. I want to make test runner behave
with less implicit flag reset. If proxy is not enabled, webnn test
should fail.
@Honry please let me know if other places (eg. CI scripts) should change
also.
### Description
Remove `Resolve()` on the entire graph as each function is resolved.
We retain `Resolve()` after each inlining iteration.
### Motivation and Context
Poor performance for inlining the model and session initialization.
Original model before Resolve() removal
FunctionTest.Profiling (**65953 ms**)
After Resolve() Removal
FunctionTest.Profiling (**2911 ms**)
RelWithDebInfo pre-inlined model. Presumably because it runs Level1
optimizers
Non-inlined model consists of functions and Level1 optimizers have no
effect.
FunctionTest.Profiling (**9851 ms**)
To avoid a huge cu file and make code more readable:
- Move PrepareQKV to separate cu file (attention_prepare_qkv.cu)
- Move ConcatPastToPresent to attention_concat.cu
- Add default value for AttentionData
- Add a data structure QkvData to track Q, K and V pointers and track
QKV format.
- [x] Optimize SDXL models exported by optimum.
- [x] Enable it to run locally instead of using module.
- [x] Detect external data file in original model, and save with same
format by default.
- [x] Add tests
### Example
```
pip install optimum transformers diffusers onnx onnxruntime-gpu>=1.16
optimum-cli export onnx --model stabilityai/stable-diffusion-xl-base-1.0 --task stable-diffusion-xl ./sd_xl_base_onnx
python -m onnxruntime.transformers.models.stable_diffusion.optimize_pipeline -i ./sd_xl_base_onnx -o ./sd_xl_base_fp16 --float16
```
### Known issues
(1) VAE decoder cannot be converted to float16. Otherwise, there will be
black image in output.
(2) To use the float16 models, need a minor change in optimum to convert
the inputs for VAE decoder from float16 to float32 since we keep VAE
decoder as float32. The change is to append a line like the following
after [this
line](afd2b5a366/optimum/pipelines/diffusers/pipeline_stable_diffusion_xl.py (L483))
```
latents = latents.astype(np.float32)
```
### Description
WebNN CPU backend expects slope of PRelu to be a static value. For now,
we will not support it.
### Motivation and Context
Fallback this case to pass the CI.
### Description
The files should not have the minor version number. The names were added
in #17365 by mistake.
### Motivation and Context
We did not successfully exclude them out.
### Description
<!-- Describe your changes. -->
This adds a missing member initialization.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
It caused an access violation in
`Dml::GraphDescBuilder::BuildGraphDesc`.
### Description
<!-- Describe your changes. -->
In previous implementation, there are two loops to iterate H * W
elements to calculate the `mean` and `squaredNorm` value in one thread,
meanwhile it outputs H * W elements in one thread. That results it's
very very slow when H * W is a large value. And usually, H * W does be a
large value in a model. For example, in the `candy-8` model, the shapes
of [H, W] are [224,224], [112,112], [56,56] for `InstanceNormalization`
op. And in my ADL, `[1,224,224,32]` consumes 17 ms. See below:
```
[profiling] kernel "23848328|[InstanceNormalization] 23848328" input[0]: [1,224,224,32] | float32, input[1]: [32] | float32, input[2]: [32] | float32, output[0]: [1,224,224,32] | float32, execution time: 17007914 ns
```
In this PR, it uses workgroup memory to optimize the original algorithm.
The advantage is that it can parallelly utilize the 64 (workgroupSize)
threads in one workgroup to calculate `mean` and `squaredNorm` value.
Meanwhile, it only outputs `H * W / workgroupSize` outputs for one
thread, which greatly reduces the overhead for one thread. With this
optimization, `[1,224,224,32]` becomes 3 ms and the main overhead is the
extra two `transpose`. The `createInstanceNormProgramInfo` only needs
`0.64` ms. See below:
```
[profiling] kernel "23003600|[InstanceNormalization] 23003600" input[0]: [1,224,224,32] | float32, output[0]: [1,32,224,224] | float32, execution time: 1543792 ns
program-manager.ts:115
[profiling] kernel "23003600|[InstanceNormalization] 23003600" input[0]: [1,32,224,224] | float32, input[1]: [32] | float32, input[2]: [32] | float32, output[0]: [1,32,224,224] | float32, execution time: 642652 ns
program-manager.ts:115
[profiling] kernel "23003600|[InstanceNormalization] 23003600" input[0]: [1,32,224,224] | float32, output[0]: [1,224,224,32] | float32, execution time: 991608 ns
```
This PR currently only applies the new algorithm to NCHW format. For
NHWC format, one way is to transpose the input so that it can use the
new algorithm. But the disadvantage is that 2 extra transpose are added.
@dakenf also gives another way to optimize NHWC. Details see
[here](d45a96616d/js/web/lib/wasm/jsep/webgpu/ops/instance-norm.ts).
I checked @dakenf's method. The perf is similar with transpose +
optimized NCHW. But on different GPUs, one is a little better than
another or vice versa. So I prefer this PR only does the NCHW part.
@dakenf can submit his optimization on NHWC.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
As title.
iOS language code uses different syntax for specifying language
code/region code:
https://developer.apple.com/documentation/xcode/choosing-localization-regions-and-scripts
current `default_locale` is not working for iOS.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Issue:
https://github.com/microsoft/onnxruntime/issues/17017
---------
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
The new cpuinfo library doesn't use clog on Android. Newer XNNPack
versions have removed the dependency on clog, but the one we use still
has it. So I cherry-pick the XNNPack to our patch file.
### Description
<!-- Describe your changes. -->
For some use case need to create boolean tensor.
I've tested on [this
project](https://github.com/hans00/react-native-transformers-example)
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Add handle `ONNX_TENSOR_ELEMENT_DATA_TYPE_BOOL`
And it required #15556 (It seems not include in latest release
(v1.15.1))
### Description
<!-- Describe your changes. -->
Make status.h independent from gsl.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
In the coming new feature external EP API (see the prototype
https://github.com/microsoft/onnxruntime/pull/16718), we need to expose
stream in the public header, however, stream is dependent on status.h
which is dependent on gsl. We are seeking a way to decouple stream from
gsl.
From Changming's comment offline, prefast is disabled so all
GSL_SUPPRESS are not taking any effect now. He will handle the warnings
when enable prefast in the future
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
PR 15470 updated some C/C++ dependencies. The change caused ROCM EP's
nightly build to fail. see issue
https://github.com/ROCm-Developer-Tools/HIP/issues/2082 for a
background. So, the root cause is HIP compiler has a special requirement
that HIP's include dirs must be used before the operating system's
include folder: /usr/include. HIP adds "-isystem" in front of
"/usr/include". gcc or clang will search the folders added with "-I"
first, then the "-isystem" folder. It works fine as long as we do not
add "-I/usr/include" to the compile commands for *.cu files. It would be wrong if
we already have installed an open source library to /usr and want to use the
prebuilt library from there instead of the current build dir.
### Motivation and Context
@fdwr This is the part 2 of the pybind work that was started earlier.
This adds the following features to the python IO binding
implementation:
- Use a bucketized allocator in order to reduce the number of resource
allocations
- Implement the following functions: `ortvalue_from_numpy`,
`update_inplace`, `ortvalue_from_shape_and_type` and `numpy`
- Modify the `onnxruntime_test_python_iobinding` tests to also run on
DML
---------
Co-authored-by: Jeff Bloomfield <jeffbloo@microsoft.com>
### Description
supplement of #17417
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
update prepack script to use exact version.
the prepack script for onnxruntime-node, onnxruntime-web and
onnxruntime-react-native is used to update their referencing version of
dependency "onnxruntime-common".
Previously "~" (tilde symbol) is used. This may cause NPM choose an
older version (if the old version matches the version requirement and
was previously installed already so hit the cache). see also
https://semver.npmjs.com/. [This
build](https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1134671&view=results)
is caused by this issue.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This reverts commit bb136f86c8, then
re-implement it in a different way.
I reverted the original change, then added a version constraint to the
find_package args.
If you still found it picks up wrong gtest version after this change,
you may disable `find_package` by setting
'FETCHCONTENT_TRY_FIND_PACKAGE_MODE' to NEVER. For example, the latest
gtest version is 1.14.0. If at a later time Google releases a new
version of gtest and that one is incompatible with the ONNX Runtime
source code you get today and your dev environment already installed the
new version and you do not want to create a new clean build environment
that is without the package, you can add `--cmake_extra_defines
FETCHCONTENT_TRY_FIND_PACKAGE_MODE=NEVER` to your build command to solve
the problem.
### Description
<!-- Describe your changes. -->
The name of nightly ACPT image has been updated to
`ptebic.azurecr.io/internal/aifx/acpt/nightly-ubuntu-cuda-torch-dev`
As the previous image alias had `cu118`, `torch210dev` or `py38`, any
version update will break the training nightly pipeline
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Using constant image alias to avoid pipeline failure.
### Description
Delete all Prefast tasks because the new VS 17.7 version crashes every
time when we run the task on our CI build servers. However, we cannot
reproduce it locally. And this problem blocks us installing security
patches to our CI build machines.
Will use [CodeQL](https://codeql.github.com/) instead.
### Motivation and Context
Address some security alerts.