### Description
<!-- Describe your changes. -->
Fix the input shape of FastGelu
Minor improvement for the documentation of kernel explorer
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
1. Move DML packaging pipelines to aiinfra-dml-winbuild machine pool
2. Delete
tools/ci_build/github/azure-pipelines/templates/windowsai-nuget-build.yml
because the pipeline has been migrated to Onebranch. I monitored it for
months, it worked well.
### Description
<!-- Describe your changes. -->
Fix the input shape of FastGelu
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Some models have QuickGelu(x)=x*sigmoid(1.702x), which has 3 Ops for
forward and 5 Ops for backward. The PR is to fuse this to a single Op
named QuickGelu and its gradient QuickGeluGrad.
For CUDA, tested in V100 using input tensor with shape [64,128,2048] and
float16 type:
Before, FW takes 335us, BW takes 614us

After, FW takes 115us, BW takes 139us, which is much faster.

For CPU kernel, using same shape and float type:
Before, FW takes 10us, BW takes 49us
Mul: 3480[µs]
Sigmoid: 1996[µs]
Mul: 4789[µs]
Mul: 4642[µs]
Mul: 4195[µs]
SigmoidGrad: 18328[µs]
Mul: 2988[µs]
Sum: 18576[µs]
After, FW takes 4us, BW takes 5us, which is also much faster.
QuickGelu: 3939[µs]
QuickGeluGrad: 5089[µs]
Co-authored-by: Vincent Wang <weicwang@microsoft.com>
### Description
support building xnnpack for IOS
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Round 2 of fixing C4244
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Fix document generation CI. It's not currently updating the docs as
we're skipping the tests, which is the invocation of build.py that would
have generated the documentation.
Setup specific task to generate documentation for greater clarity.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Operator kernel documentation is not getting updated and is now out of
date.
### Description
Enables all datatypes supported for DML for `Abs` and `Sign`.
### Motivation and Context
`Abs` and `Sign` haven't been updated since DML started to support all
datatypes for them. These ops are used in some transformer models and
were forcing unnecessary copies between the CPU and the GPU.
### Description
First round of fixes for C4244 error.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
DML EP was a special EP w.r.t. capability fusion. It used to fuse a
capability outside the IExecutionProvider::Compile() call. But after
recent re-architecture #13131, it is no longer a special case.
### Motivation and Context
Why is this change required? What problem does it solve?
To make DML EP consistent with the ORT design.
- If it fixes an open issue, please link to the issue here. N/A
Co-authored-by: Sumit Agarwal <sumitagarwal@microsoft.com>
### Description
Fix logging for affinity failures on Linux.
Make `GetCpuCores()` consistently return the number of physical cores.
Use `CpuInfo` library to correctly set affinities for Linux where
supported.
Make windows generate affinity masks as ordinals and convert them to
masks at the setting site.
Allow setting multiple logical processors affinity masks per thread.
We continue to set all logical processors as thread affinity per
physical core.
### Motivation and Context
Error logging on Linux uses `pthread_self()` which does not return
Thread ID.
Fix default affinity mask generation on Windows. The following are the
issues with Windows:
- `GetThreadAffinityMasks()` returns bitmasks, but on other platforms it
returns ordinals generated for the hardware concurrency
- The maximum number of processors supported for requires a mask of
64-bits, but `size_t` type used is not always 64-bit
- The masks returned per physical core may have multiple bits set,
because the mask applies to several logical cores hosted by the physical
core. In the past, customers complained that their threads jump from one
core to another which adversely affects performance. The decision was
made to stay this way.
- 64-bit masks do not allow for logical processors with IDs that are
outside of 0-63 range.
### Description
Updates naming scheme for docker images built by the EP Perf pipeline.
Specifically, the docker image name is no longer based on the branch
name.
### Motivation and Context
The docker image name used by EP Perf pipeline is built from the branch
name. This makes the pipeline fail for branches with uppercase letters
because docker image names can only contain lower-case letters.
### Description
Implements a Python script for quick analysis of a generated JSON
profile from ORT.
### Motivation and Context
This PR implements a script that lists kernels that take up the most
time in a JSON profile, from both the CPU and GPU points-of-view. The
script also supports various options for CSV output, grouping of kernels
wrt shape of input tensors and wrt kernel dimensions.
Co-authored-by: Abhishek Udupa <abhishek.udupa@microsoft.com>
### Description
<!-- Describe your changes. -->
The subgraph below meet the SkipLayerNorm fusion pattern, but the fusion
rules also required every input dimension has a certain value. So the
subgraph below cannot fused to SkipLayerNorm.
subgraph we want to fuse

fusion pattern 3
[Sub1] [Sub2]
\ /
\ /
\ /
Add1
|
LayerNormalization
This change allow inputs of FirstAdd operator has dimension which only
has dim_param.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
### Description
This adds the "NHCW" format support for einsum MatMul. The logic is
basically a merge of the existing Transpose and MatMul Einsum
implementations.
### Motivation and Context
Some transformer models that I'm tracking use Einsum quite often during
a single inference, and about half of those were "NHCW" MatMul Einsums.
Supporting them will reduce the number of copies to the CPU.
### Description
Register all datatypes for DML's `Where` operator since DML now supports
everything.
### Motivation and Context
Some transformer models use the `Where` operator on int64 data, but
since DML wasn't supporting it, it needed to fall back to the CPU.
**Description**: Describe your changes.
fuse MatMul + FastGelu -> GemmFastGelu
prepare for AMD optimized fused operator GemmFastGelu
usage:
python benchmark.py -g -m bert-base-cased --sequence_length 384
--batch_sizes 128 --provider=rocm -p fp16 --disable_embed_layer_norm
--enable_gemm_fast_gelu
**Motivation and Context**
- Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here.
### Description
enabling on device training apis in the packaging pipelines.
### Motivation and Context
adding on device training flag so we can enable the on-device training
apis for Federated learning scenarios
Co-authored-by: Adam Louly <adamlouly@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
### Description
Allow separated Q, K and V inputs to support cross attention:
* Q: [batch_size, sequence_length, hidden_size]
* K: [batch_size, kv_sequence_length, hidden_size]
* V: [batch_size, kv_sequence_length, v_hidden_size]
* Output: [batch_size, sequence_length, v_hidden_size]
To use separated Q/K/V inputs, the input tensor is for query, and two
optional inputs are added for key and value. Weights for input
projection is not included for now, so the MatMul of input projection
shall be done out of Attention operator, but Add bias is included for
performance consideration.
### Description
The documentation pipeline does not require an actual GPU, and running
on GPU-capable agents costs more. So to enable running on CPU-only
devices and to potentially consolidate future pipelines, and since the
tests are not actually executed on this device anyway (it just needs to
initialize the EP for the sake of operator kernel enumeration), add an
initialization flag to skip the software device check - this is only an
internal overload not exposed in the public API. See
https://github.com/microsoft/onnxruntime/pull/13308.
### Motivation and Context
- *If it fixes an open issue, please link to the issue here.* NA
Add script to evaluate accuracy of BERT/DistilBERT/Roberta models on question-answering task.
By default, pretrained model
`bert-large-uncased-whole-word-masking-finetuned-squad` will be used if
model name is not specified. If onnx path is not specified, optimum will
be used to export an ONNX model for testing.
Example usage:
* Evaluate with CPU execution provider:
`python eval_squad.py`
* Evaluate with CUDA execution provider:
`python eval_squad.py --use_gpu`
* Evaluate an optimized onnx model for
'distilbert-base-cased-distilled-squad' with sequence lengths
128/192/256/384 on first 100 samples:
`python eval_squad.py -m distilbert-base-cased-distilled-squad --use_gpu
-s 128 192 256 384 --onnx_path ./optimized_fp16.onnx -t 100`
Some op will use a buffer for input and output at the same time, so it will do inplace update to it.
If we blindly tune over the `params`, there will be accumulated update to that buffer during FindFastest,
which is an undesired side effect. In this case, we use a proxy params struct for the tuning to avoid this side effect.
Add env variable to control disabling custom autogard function support.
When using ORTModule, if the torch model has torch.nn.Function, if user
confirms that it can be exported to ONNX (for example, by inline
PythonOp) and the backward implementation is matched to the forward
impl, user can export "ORTMODULE_DISABLE_CUSTOM_AUTOGRAD_SUPPORT=1" to
disable the custom autograd support so that it won't use ORT's PythonOp
to fallback to PyTorch. Exporting to ONNX sometimes can leverage some
graph optimizations in ORT so that perf is better.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Wrap SkipLayerNormoriginal implementation as a function.
Use it as part of SkipLayerNormTunableOp.
Use it in Kernel explorer to compare the gap between TunableOp and
Original implementation.
the profile output like below:
`float16 8 512 768 <class
'_kernel_explorer.SkipLayerNorm_half_Original'> 23.48 us 804.04 GB/s
float16 8 512 768 <class '_kernel_explorer.SkipLayerNorm_half_Tunable'>
20.41 us 925.00 GB/s
...`
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
### Description
1. update model name structure in model_tests.cpp with source name. To
avoid
`Condition test_param_names.count(param_name) == 0 failed. Duplicate
parameterized test name 'BERT_Squad_opset10_CPU'`
2. skip some failed models https://github.com/onnx/models/issues/568
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
DML EP kernel for com.microsoft.attention operator. It has been
implemented via DML_Graph. References for this implementation:
1. [Hugging Face Attention for
BERT](310340d0d0/src/transformers/models/bert/modeling_bert.py (L245-L284))
2. Chapter 3 of book Orielly: Natural Language Processing with
Transformers, Revised Edition
This PR also
- includes a very tiny fix for QLinearSigmoid kernel, which is storing
the temporary object into a named variable.
- enables 4 L2 transformers LayerNorm, Gelu, MatMulScale, Attention.
### Motivation and Context
- Why is this change required? What problem does it solve?
One of the main operators used in Transformer-based model. It
contributes to the overall perf of DML EP for Transformer models.
- If it fixes an open issue, please link to the issue here. N/A
Co-authored-by: Sumit Agarwal <sumitagarwal@microsoft.com>
Co-authored-by: Dwayne Robinson <dwayner@microsoft.com>
### Description
<!-- Describe your changes. -->
### Motivation and Context
Some models from model zoo failed in the Linux CPU workflow.
https://github.com/onnx/models/issues/562
Skip them temporarily.
###Verfication
Linux CPU CI passed with beta image
https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=789772&view=results
**2022-10-21T13:31:17.6740348Z Skip symbolic shape inference on :
/mnt/vss/_work/1/b/Release/../models/zoo/opset12/Inception-1-int8/inception-v1-12-int8.onnx**
2022-10-21T13:31:17.6740998Z Running symbolic shape inference on :
/mnt/vss/_work/1/b/Release/../models/zoo/opset12/DenseNet-121-12-int8/densenet-12-int8.onnx
2022-10-21T13:31:17.6741618Z Running symbolic shape inference on :
/mnt/vss/_work/1/b/Release/../models/zoo/opset12/MNIST-12/mnist-12.onnx
**2022-10-21T13:31:17.6742207Z Skip symbolic shape inference on :
/mnt/vss/_work/1/b/Release/../models/zoo/opset12/SSD-int8/ssd-12-int8.onnx**
2022-10-21T13:31:17.6742898Z Running symbolic shape inference on :
/mnt/vss/_work/1/b/Release/../models/zoo/opset12/ResNet50_fp32/resnet50-v1-12.onnx
2022-10-21T13:31:17.6743544Z Running symbolic shape inference on :
/mnt/vss/_work/1/b/Release/../models/zoo/opset12/MobileNet
v2-1.0-fp32/mobilenetv2-12.onnx
2022-10-21T13:31:17.6744259Z Running symbolic shape inference on :
/mnt/vss/_work/1/b/Release/../models/zoo/opset12/ResNet101_DUC_HDC-12/ResNet101-DUC-12.onnx
2022-10-21T13:31:17.6744891Z Running symbolic shape inference on :
/mnt/vss/_work/1/b/Release/../models/zoo/opset12/YOLOv3-12-int8/yolov3-12-int8.onnx
2022-10-21T13:31:17.6745501Z Running symbolic shape inference on :
/mnt/vss/_work/1/b/Release/../models/zoo/opset12/AlexNet/bvlcalexnet-12.onnx
2022-10-21T13:31:17.6746114Z Running symbolic shape inference on :
/mnt/vss/_work/1/b/Release/../models/zoo/opset12/ZFNet-512-int8/zfnet512-12-int8.onnx
**2022-10-21T13:31:17.6746768Z Skip symbolic shape inference on :
/mnt/vss/_work/1/b/Release/../models/zoo/opset12/SSD-MobilenetV1-12-int8/ssd_mobilenet_v1_12-int8.onnx**
### Description
Bumping up version number to 1.14.0
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->