### Description
Some code was accidentally moved into the
`if(!params.is_cross_attention)' block, it must stay outside to work in
both cases.
### Motivation and Context
This causes invalid results. We detected this as a performance bug, as
it caused the EOS early exit to never happen, and the runs would always
take max_length to complete which was slow.
### Description
Mistake in beam scorer processing, atomicAdd result should be compared
with '1' vs '0' as it returns the original value, not the latest value.
This error just results in slow perf, nothing fails.
### Motivation and Context
Fixes#16642
DORT only select devices from inputs arguments' (type: torch.Tensor).
However, it errors out when a graph doesn't have any inputs (e.g., a
single aten::full graph). This PR address this problem by changing the
EP selection to
- First, inspect graph inputs. If there are some valid devices, use them
plus a default one (`OrtBackend.ep: str`).
- Otherwise, inspect graph outputs carried by `torch.fx.GraphModule` and
use all valid devices plus the default `OrtBackend.ep`.
- When both (1) and (2) fail, it uses the default EP specified by
`OrtBackend.ep`.
- Fix link errors by including the needed onnxruntime-extensions libraries in the static framework.
- Add Objective-C API to register custom ops from embedded onnxruntime-extensions.
Caveat: Not all onnxruntime-extensions build options are working yet. E.g., building with the onnxruntime-extensions OpenCV dependency does not work.
### Description
The SequenceMap function-op has a graph-attribute. ORT's
constant-folding optimization may identify constant-expressions inside
the subgraph and promote them to constants, stored as initializers in
the main graph. When it does this, the optimization updates the subgraph
to remove the corresponding nodes.
When we expand a SequenceMap node by inlining its function-expansion, we
need to use this updated subgraph. However, the existing code uses the
original graph-attribute (GraphProto), instead of regenerating it from
the modified subgraph. This results in producing a graph with duplicate
definitions for the constant-folded variable, resulting in an error
during graph-resolve.
This PR fixes this issue (just a single line fix), and adds a test-case
to cover this scenario.
---------
Signed-off-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
### Description
- Fix check for Softmax with axis attributes not equal to -1. QNN EP
only supports axis values equal to -1 (or rank - 1).
- Explicit error when Reduce* ops have an input with rank > 4 on HTP
backend (unsupported).
- Correctly filter out partitions that only contain a single
QuantizeLinear or DequantizeLinear node.
- Add tests for the above and clean up unnecessary usage of test
description labels.
### Motivation and Context
Make it easier to debug why a model may not be supported.
### Description
Introduce `Float16/BFloat16` support for C# and C++ APIs.
User should be able to perform conversions from `float` to/from
`Float16/BFloat16`, compare values and tests for `NaN, Inifnity, and
whether the number is denormalized.`
### Motivation and Context
User filed issues such as:
https://github.com/microsoft/onnxruntime/issues/14303
0(1) Fix a bug in https://github.com/microsoft/onnxruntime/pull/16560
that UNet shall be set fp16 flag.
(2) Remove wget in requirements since it is no longer needed.
(3) Add benchmark numbers in A100-PCIE-80GB. Note that CUDA EP have
issue to run in batch size 4 so the number is not added.
### Description
- Add hipBLASLt tuning logic in place of default hipBLASLt
implementation;
- add kernel explorer for hipBLASLt.
related operators: Gemm, StridedBatchedGemm, and GemmFastGelu.
Temporarily mark algos that require extra workspace as unsupported.
Will add workspace support in later PR, which will change Gemm Params
def and affect multiple files.
### Description
<!-- Describe your changes. -->
Delete second reference to onnxruntime_api_tests_without_env in the code
coverage commands. One was removed in #16373 and the duplicate wasn't
noticed.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix pipeline.
### Description
fix file size trim for wasm only .min.js
minimal build `ort.wasm.min.js` and `ort.wasm-core.min.js` should
exclude JSEP related source code.
### Description
1. Replacing AMX intrinsics with machine code macros in QGEMM kernel.
2. Removing AMX build flags for GCC in cmake file.
3. Fixing the link time optimization (LTO) issue introduced with asm
.include of an assembly file.
I have moved the AMX instruction macro definitions from
QgemmU8S8KernelAmxCommon.S to the amx_common.h to fix the LTO issue.
Note that I am also pushing the macros defined in
QgemmU8S8KernelAmxCommon.S for future reference.
A special thanks to @laxmansole who helped in the development of the
instruction macro definitions for AMX intrinsics and fixing the LTO
issue.
### Motivation and Context
The additional AMX flag in cmake adds an extra layer of dependency on
GCC version to use the feature.These changes should allow the usage of
the AMX feature with just the CPU ID check.
### Description
- Update existing rocBLAS get_solutions API using
`*_get_solutions_by_type` (supported from ROCm5.6); remove the original
nested TunableOp logic.
- Update kernel_explorer.
This PR mainly optimize ROCm CI test to reduce time and CPU utilization.
- use smaller batch size on strided_batched_gemm/batched_gemm test
- disable cpu training test
- fix test_e2e_padding_elimination Occasional failures on ROCm.
Allow `GemmSoftmaxGemmPermuteGenericPipeline<T>` to be used in some
cross attention, that opt for rocblas instead of ck if rocblas is
better to the small problem. The improvement is ~20% e2e time reduction
on some test cases for whisper large.
**Note:** This is because ck has some performance issue if the sequence
length is merely 1, and should be improved in the future.
### Description
Use pipeline cache instead of reading data from the image.
### Motivation and Context
1. To reduce the browser dependency of custom image.
2. The onnx node test data is less than 30M and the cache download time
is very short.
### Description
<!-- Describe your changes. -->
As title.
Validation at JS call level in E2E app is not included. Can cover
together in a separate pr.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Test coverage.
---------
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
### Description
<!-- Describe your changes. -->
This PR adds support for rotary embeddings in decoder masked
self-attention
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
### Use autograd_inlining for model export
From some versions of PyTorch, there is an issue related to custom
autograd.Function inlining, even though we register custom export
function for the autograd.Function (e.g. when custom autograd function
is enabled).
As an options, PyTorch exporter adds a new flag during export, we can
disable the inline. https://github.com/pytorch/pytorch/pull/104067
Currently the PyTorch change is in nightly built, this PR dynamically
check the torch.onnx.export's signature and decide to use the
`autograd_inlining` when it exists.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
allow creating (u)int64 tensors from either a number array or a bigint
array.
before:
```js
// TypeScript think is good, but actually does not work
// runtime error: Uncaught TypeError: Cannot convert 1 to a BigInt
const myTensor1 = new Tensor('int64', [1, 2, 3, 4], [2, 2]);
// runtime good, but TypeScript thinks myTensor2 is a string tensor
const myTensor2 = new Tensor('int64', [1n, 2n, 3n, 4n], [2, 2]);
```
after:
```js
// both work at runtime and TypeScript populates the correct types
const myTensor1 = new Tensor('int64', [1, 2, 3, 4], [2, 2]);
const myTensor2 = new Tensor('int64', [1n, 2n, 3n, 4n], [2, 2]);
```
### Description
The [ONNX
standard](https://github.com/onnx/onnx/blob/main/docs/Operators.md#type-constraints-181)
permits the `Unique` operator to have `double` input tensor element
type, however this was not supported in onnxruntime. This PR enables
this kernel.
### Motivation and Context
The lack of support for `float64` forces users currently to cast to
`float32` instead. This loss of precision can be severely problematic in
feature engineering pipelines downstream of the `Unique` operator. It
would be good to prevent this by updating ORT to reflect the standard
and support `double` input tensors.
---------
Signed-off-by: Aditya Goel <agoel4512@gmail.com>
### Description
<!-- Describe your changes. -->
Add ort_value.h to session_options.h so OrtValue is defined.
Update a unit test binary to add required include paths. Adding
ort_value.h pulls in more data type headers.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#16193
### Description
This change reduces the number of calls to globby functions so that it
accelerates the initialization for 'npm test' with suite0/1 tests from
~14sec to <2sec.
Fix Orttraining Linux Lazy Tensor CI
Orttraining Linux Lazy Tensor CI is broken.
The error message is
AttributeError: 'OnnxRegistry' object has no attribute 'register'
### Description
Added Expand operator support.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Simply add double quotes to prevent there is spaces in the path
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
As if there are spaces in path the bat cannot run, error would occurs.
So with a simple double quotes can fix these problems
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Log ORTModule initialization overhead
When profiling some model for example
```
torchrun --nproc_per_node=1 examples/onnxruntime/training/language-modeling/run_mlm.py --model_name_or_path microsoft/deberta-v3-large --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --num_train_epochs 10 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --do_train --overwrite_output_dir --output_dir ./outputs/ --seed 1137 --fp16 --report_to none --optim adamw_ort_fused --max_steps 200 --logging_steps 1 --use_module_with_loss
{'train_runtime': 303.8711, 'train_samples_per_second': 0.658, 'train_steps_per_second': 0.658, 'train_loss': 6.569518616199494, 'epoch': 0.09}
100%|200/200 [05:03<00:00, 1.52s/it]
***** train metrics *****
epoch = 0.09
train_loss = 6.5695
train_runtime = 0:05:03.87
train_samples = 2223
train_samples_per_second = 0.658
train_steps_per_second = 0.658
```
The end to end time is 303s (train_runtime=0:05:03.87), but the
ORTModule first step initialization (including export, graph build, etc)
takes about 255s, so when we compare the end to end time for a baseline
ORT with an improved version of ORT, there is no perf gains, since the
x% gains over (303-255) is diluted out among the overall 303s. This is
misleading!
So this PR outputs the ORTModule initialization overhead in the output,
then we can manually compute the real compte time and get the perf
gains.
If the log level is >= WARNING, then only the total end to end time +
export time is logged, otherwise, more details of break down is logged:


### Description
Add support for Op InstanceNormalization and GroupNormalization via MeanVarianceNormalization.
### Motivation and Context
Enable more models like Olive'ified SD unet to run on WebNN EP.