### Description
In PR #19073 I mistunderstood the value of "--parallel". Instead of
testing if args.parallel is None or not , I should test the returned
value of number_of_parallel_jobs function.
If build.py was invoked without --parallel, then args.parallel equals to
1. Because it is the default value. Then we should not add "/MP".
However, the current code adds it. Because if `args.paralllel` is
evaluated to `if 1` , which is True.
If build.py was invoked with --parallel with additional numbers, then
args.parallel equals to 0. Because it is unspecified. Then we should add
"/MP". However, the current code does not add it. Because `if
args.paralllel` is evaluated to `if 0` , which is False.
This also adds a new build flag: use_binskim_compliant_compile_flags, which is intended to be only used in ONNX Runtime team's build pipelines for compliance reasons.
### Motivation and Context
### Description
1. Add visual parity test based on openai clip model
2. Add trigger rules
### Motivation and Context
1. check generated image is expected
2. reduce unnecessary triggers
### Description
Adds the ability to specify general session configuration entries via
the `-C` command-line option.
Example: `-C "session.disable_cpu_ep_fallback|1 ep.context_enable|1"`
Some session config entries can already be set via dedicated
command-line options. If the user uses multiple command-line options to
set the same session config entry, we'll print a warning. Note that the
dedicated command-line options will take precedence.
### Motivation and Context
Allows setting session configurations when testing EPs. QNN EP, for
example, uses the `session.disable_cpu_ep_fallback` and `ep.context_*`
options.
### Description
It is an important feature to pass user cuda stream to avoid
synchronization in python API. Here we allow user to pass cuda stream
for CUDA provider. Note that TRT or ROCm provider need similar change,
which are not included in this pull request.
Note that we will set `has_user_compute_stream` automatically based on
whether there is cuda stream passed, so setting
`has_user_compute_stream` through python API has no effect.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
https://github.com/microsoft/onnxruntime/issues/19094
### Description
<!-- Describe your changes. -->
Make EP's member function, GenerateMetaDefId, a standalone function
which decouples from EP
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This change is for ExecutionProvider API refactoring, we will make a
clean ExecutionProvider API first for later EPv2 work
### Description
Fix two issues:
(1) We can only use single quote inside `bash -c "..."`. Current
pipeline job stopped at `python3 demo_txt2img.py astronaut` and skip the
following commands. In this change, we remove the remaining commands to
get same effect (otherwise, the pipeline runtime might be 2 hours
instead of 15 minutes).
(2) Fix a typo of Stable.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Since Cutlass can be built with CUDA 11.4 (The minimum CUDA version for
onnxruntime CUDA build), there is no need to have a flag to disable
cutlass.
Changes:
(1) Reverted https://github.com/microsoft/onnxruntime/pull/18761
(2) remove the condition to build cutlass.
(3) Fix a few build errors or warnings during testing CUDA 11.4 build.
Note that SM 89 and 90 (including fp8) requires CUDA 11.8 or later.
Flash attention and cutlass fused multihead attention will not be built
for CUDA < 11.6. It is recommended to use CUDA 11.8 or above to build if
you want to support latest GPUs.
It is better to include it in 1.17.0 (otherwise, the release branch
might encounter build failure with CUDA 11.4).
Tests:
(1) Build with flash attention and efficient attention off: **passed**
(2) Build with CUDA 11.4: **passed**
Example build command used in Ubuntu 20.04:
```
export CUDA_HOME=/usr/local/cuda-11.4
export CUDNN_HOME=/usr/lib/x86_64-linux-gnu/
export CUDACXX=/usr/local/cuda-11.4/bin/nvcc
sh build.sh --config Release --build_shared_lib --parallel --use_cuda --cuda_version 11.4 \
--cuda_home $CUDA_HOME --cudnn_home $CUDNN_HOME --build_wheel --skip_tests \
--cmake_extra_defines CMAKE_CUDA_ARCHITECTURES=80 \
--disable_types float8
```
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
TRT EP's GetTensorRTCustomOpDomainList() will create vector of
OrtCustomOpDomain objects and release the ownership of those objects.
But, thoses objects are not released forever.
In session level, we need to make TRT EP remember what OrtCustomOpDomain
objects it created and release them at EP destruction time.
### Description
```math
\tanh(x)=\frac{e^x-e^{-x}}{e^x+e^{-x}}=
\left\{
\begin{array}{cc}
-\frac{1-e^{-2\cdot(-x)}}{1+e^{-2\cdot(-x)}}, & x<0 \\
0, & x=0 \\
\frac{1-e^{-2x}}{1+e^{-2x}}, & x>0
\end{array}
\right.
```
### Motivation and Context
On some platforms,
$$\tanh(1000)=\frac{e^{1000}-e^{-1000}}{e^{1000}+e^{-1000}}$$ would
produce NaN instead of 0.999... or 1 (imagine $e^{1000}=\infty$ and
$\frac{\infty}{\infty}$ explodes).
Split GroupNorm implementation into multiple files, to make ROCm EP can
reuse cuda code.
Related PR: https://github.com/microsoft/onnxruntime/pull/19158
---------
Co-authored-by: Peixuan Zuo <peixuanzuo@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
### Description
Implement Pad-18 for Cuda.
### Motivation and Context
Latest models converted by Dynamo fall back on CPU for Pad with
performance degradation.
This contributes to
https://github.com/microsoft/onnx-rewriter/issues/126
### Description
This pull request introduces the necessary changes to enable RISC-V
64-bit cross-compiling support for the ONNX Runtime on Linux. The RISC-V
architecture has gained popularity as an open standard instruction set
architecture, and this contribution aims to extend ONNX Runtime's
compatibility to include RISC-V, thereby broadening the reach of ONNX
models to a wider range of devices.
### Motivation and Context
RISC-V is a free and open-source instruction set architecture (ISA)
based on established RISC principles. It is provided under open licenses
without fees. Due to its extensibility and freedom in both software and
hardware, RISC-V is poised for widespread adoption in the future,
especially in applications related to AI, parallel computing, and data
centers.
### Example Build Command
```
./build.sh --parallel --config Debug --rv64 --riscv_toolchain_root=/path/to/toolchain/root --skip_tests
```
### Documentation Updates
Relevant sections of the documentation will be updated to reflect the
newly supported RISC-V 64-bit cross-compilation feature.
https://github.com/microsoft/onnxruntime/pull/19239
---------
Signed-off-by: Phoebe Chen <phoebe.chen@sifive.com>
WebNN only supports 4-D inputs for conv2d and convTranspose2d, this PR
supports 3-D inputs (i.e. conv1d) by prepending a 1 size dimension and
several reshape operations.
### Description
<!-- Describe your changes. -->
remove old python files
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
We have a new op MatMulNBits and this one is deprecated.
When we enable webgpu profiling mode between session.create and
session.run, current implementation has a problem to create querySet
(and also queryResolveBuffer) if we share the commandEncoder with inputs
upload. This PR fixes this by moving the querySet creation to the place
we set queryType.
### Description
Update abseil to a release tag and register neural_speed to CG.
### Motivation and Context
Now we are using a non-relesed version of abseil. Using a tag is better.
### Description
Added Uniforms to SkipLayerNorm
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve performance
---------
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
### Description
<!-- Describe your changes. -->
Fix amd test failure
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
These changes add rotary embedding and packed qkv input to gqa. As of
now, the changes are only supported with Flash-Attention (SM >= 80) but
should soon be supported with Memory Efficient Attention as well.
### Motivation and Context
With the fusion of rotary embedding into this Attention op, we hope to
observe some perf gain. The packed QKV should also provide some perf
gain in the context of certain models, like Llama2, that would benefit
from running ops on the fused QKV matrix, rather than the separate Q, K,
and V.
---------
Co-authored-by: Yufeng Li <liyufeng1987@gmail.com>
The decomposition pass (e.g., converting torch.add to aten.add) in DORT
no longer exists. Therefore, we have to use `use_aot_autograd=True` to
enable Dynamo's built-in operator decomposition. I think we need to add
the decomposition pass back to DORT or remove `use_aot_autograd` (remove
because it will always be `true`).
### Description
<!-- Describe your changes. -->
Add `temperature` as an input to WhisperBeamSearch op and initialize
correctly in parameter setup.
### Motivation and Context
Currently, temperature is included as an attribute to the BeamSearch op,
which doesn't let the model act dynamically in a single inference
session. By including this variable as an input, the temperature value
can be altered in any inference call (important for 1P teams)
---------
Co-authored-by: Peter McAughan <petermca@microsoft.com>
Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
Co-authored-by: Kunal Vaishnavi <kvaishnavi@microsoft.com>
### Description
1. Update Linux GPU machine from T4 to A10, sm=8.6
2. update the tolerance
### Motivation and Context
1. Free more T4 and test with higher compute capability.
2. ORT enables TF32 in GEMM for A10/100. TF32 will cause precsion loss
and fail this test
```
2024-01-19T13:27:18.8302842Z [ RUN ] ModelTests/ModelTest.Run/cuda__models_zoo_opset12_SSD_ssd12
2024-01-19T13:27:25.8438153Z /onnxruntime_src/onnxruntime/test/providers/cpu/model_tests.cc:347: Failure
2024-01-19T13:27:25.8438641Z Expected equality of these values:
2024-01-19T13:27:25.8438841Z COMPARE_RESULT::SUCCESS
2024-01-19T13:27:25.8439276Z Which is: 4-byte object <00-00 00-00>
2024-01-19T13:27:25.8439464Z ret.first
2024-01-19T13:27:25.8445514Z Which is: 4-byte object <01-00 00-00>
2024-01-19T13:27:25.8445962Z expected 0.145984 (3e157cc1), got 0.975133 (3f79a24b), diff: 0.829149, tol=0.0114598 idx=375. 20 of 388 differ
2024-01-19T13:27:25.8446198Z
2024-01-19T13:27:25.8555736Z [ FAILED ] ModelTests/ModelTest.Run/cuda__models_zoo_opset12_SSD_ssd12, where GetParam() = "cuda_../models/zoo/opset12/SSD/ssd-12.onnx" (7025 ms)
2024-01-19T13:27:25.8556077Z [ RUN ] ModelTests/ModelTest.Run/cuda__models_zoo_opset12_YOLOv312_yolov312
2024-01-19T13:27:29.3174318Z /onnxruntime_src/onnxruntime/test/providers/cpu/model_tests.cc:347: Failure
2024-01-19T13:27:29.3175144Z Expected equality of these values:
2024-01-19T13:27:29.3175389Z COMPARE_RESULT::SUCCESS
2024-01-19T13:27:29.3175812Z Which is: 4-byte object <00-00 00-00>
2024-01-19T13:27:29.3176080Z ret.first
2024-01-19T13:27:29.3176322Z Which is: 4-byte object <01-00 00-00>
2024-01-19T13:27:29.3178431Z expected 4.34958 (408b2fb8), got 4.51324 (40906c80), diff: 0.16367, tol=0.0534958 idx=9929. 22 of 42588 differ
```
3. some other test like SSD throw other exception, so skip them
'''
2024-01-22T09:07:40.8446910Z [ RUN ]
ModelTests/ModelTest.Run/cuda__models_zoo_opset12_SSD_ssd12
2024-01-22T09:07:51.5587571Z
/onnxruntime_src/onnxruntime/test/providers/cpu/model_tests.cc:358:
Failure
2024-01-22T09:07:51.5588512Z Expected equality of these values:
2024-01-22T09:07:51.5588870Z COMPARE_RESULT::SUCCESS
2024-01-22T09:07:51.5589467Z Which is: 4-byte object <00-00 00-00>
2024-01-22T09:07:51.5589953Z ret.first
2024-01-22T09:07:51.5590462Z Which is: 4-byte object <01-00 00-00>
2024-01-22T09:07:51.5590841Z expected 1, got 63
'''
### Description
Update unet fusion for [stable diffusion webui
extension](https://github.com/tianleiwu/Stable-Diffusion-WebUI-OnnxRuntime):
(1) Update fusion pattern to support fp16 unet model.
(2) Add progress bar
(3) Use a cached map to speed up dtype or shape lookup in shape
inference result.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Adds a job to create a nightly python package for ORT/QNN on Windows
ARM64.
Must build onnxruntime-qnn with python 3.11 and numpy 1.25.
**Note: pipeline run may take up to 3 hrs**
### Motivation and Context
Make it possible to get a nightly python package with the latest updates
to QNN EP.
Issue #19161
### Description
<!-- Describe your changes. -->
`env.webgpu.profiling` is a global flag. It may change before each
session.run. So the best place is to update it in `onRunStart` event.
After this, we can directly check `this.queryType`'s value. Without this
pr, we need to make sure that `getCommandEncoder()` is called before
checking `this.queryType`. Otherwise, it may happen that
`pendingKernels`'s length is not equal to `pendingDispatchNumber`'s
length. See the two ugly workarounds
[1)](e630dbf528 (diff-006fc84d3997f96a29b8033bd2075d6a0a9509211bd5812a6b934fc74fedfd9dR267-R268))
and
[2)](e630dbf528 (diff-618fe297fbe7a1da586380163b8fd2627311ccc217640a3c5cdc9c17a33472c1R73-R80))
if we don't introduce `onRunStart`. Or we need to call `setQueryType` in
each kernel run.
### Description
This op is required in mobilenetv3-small-100. With this PR,
mobilenetv3-small-100 model becomes less than 10 ms from over 100 ms on
ADL.
### Description
<!-- Describe your changes. -->
Register DML operators for opset 19.
- Cast19
- Castlike19
- Constant19
- Equal19
- Identity19
- QuantizeLinear19
- DequantizeLinear19
- Reshape19
- Shape19
- Size
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: linnealovespie <linneamay@microsoft.com>
### Description
This PR adds SbgemmKernel for aarch64. This includes Sbegmm kernel to
implement matrix multiplication with bfloat16 SIMD instructions (bfmmla)
and MatMul operator changes to invoke the Sbgemm kernel. To enable
Sbgemm kernel, set the following session option:
"kOrtSessionOptionsGemmFastMathMode"
The PR also adds new test cases for mlas and ort.
### Motivation and Context
This is to improve MatMul performance on aarch64 platform.
I have run the below benchmarking script (bert , roberta and gpt2 model
inference) on AWS Graviton3 based c7g.4xl instance and observed 1.2x
-1.76x performance improvement compared to sgemm (fp32) kernel
performance.
```
cd onnxruntime/python/tools/transformers
python3 benchmark.py
```
And the unit test precision results are matching to sgemm kernel
results.
`./build.sh --config RelWithDebInfo --build_shared_lib --parallel
--compile_no_warning_as_error --skip_submodule_sync `
### Description
- Adds the following session options to configure the device:
- `soc_model`: The SoC model number. Refer to the QNN SDK documentation
for valid values. Defaults to "0" (unknown).
- `htp_arch`: The minimum HTP architecture the driver will use to select
compatible QNN operators.
- `device_id`: The ID of the device to use when setting 'htp_arch'.
Defaults to "0" (for single device).
### Motivation and Context
Allow more configuration.
### Description
<!-- Describe your changes. -->
1. support causal mask in MHA cpu
2. support custom rotary_dim in rotary_emb
3. add bf16 for rotary_emb
4. fix a bug in attention rotary
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->