Update a few optimizations for Stable Diffusion XL:
(1) Add SkipGroupNorm fusion
(2) Remvoe GroupNorm fusion limits. Previously, we only fuse GroupNorm
when channels is one of `320, 640, 960, 1280, 1920, 2560, 128, 256, 512`
so some GroupNorm in refiner was not fused.
(3) Tune SkipLayerNormalization to use vectorized kernel for hidden size
320, 640 and 1280.
Pipeline Improvements:
(4) Enable cuda graph for unetxl.
(5) Change optimization to generate optimized fp32 model with ORT, then
convert to fp16. Otherwise, fp16 model might be invalid.
(6) Add option to enable-vae-slicing.
Bug fixes:
(a) Fix vae decode in SD demo.
(b) Fix UnipPC add_noise missing a parameter.
(c) EulerA exception in SDXL demo. Disable it for now.
(d) Batch size > 4 has error in VAE without slicing. Force to enable vae
slicing when batch size > 4.
#### Performance Test on A100-SXM4-80GB
Description about the experiment in results:
*Baseline*: removed GroupNorm fusion limits; CUDA graph is enabled in
Clip and VAE, but not in Clip2 and UNet.
*UNetCG*: Enable Cuda Graph on UNet
*SLN*: Tune SkipLayerNormalization
*SGN*: Add SkipGroupNorm fusion
The latency (ms) of generating an image of size 1024x1024 with 30 steps
base model and 9 steps of refiner model:
| Baseline | UNetCG| UNetCG+SLN | UNetCG+SLN+SGN
-- | -- | -- | -- | --
Base Clip | 3.74 | 3.70 | 3.88 | 3.81
Base Unet x30 | 2567.73 | 2510.69 | 2505.09 | 2499.99
Refiner Clip | 7.59 | 7.42 | 7.41 | 7.58
Refiner Unet x 9 | 814.43 | 803.03 | 802.20 | 799.06
Refiner VAE Decoder | 84.62 | 85.18 | 85.24 | 87.43
E2E | 3480.56 | 3412.05 | 3405.77 | 3400.23
We can see that enable cuda graph brought major gain (around 68ms). SLN
Tuning has about 7ms gain. SkipGroupNorm fusion has 5ms gain.
SkipGroupNorm fusion won't reduce latency much, while it also has
benefit of reducing memory usage, so it is recommended to enable it.
### Motivation and Context
Additional optimizations upon previous work in
https://github.com/microsoft/onnxruntime/pull/17536.
Although SimplifiedLayerNorm is faster than LayerNorm, DML doesn't have
an optimized implementation for the former yet and LayerNorm ends up
being faster.
### Description
Enable Expand Op.
There no directly mapping from Onnx Expand op to QNN. Need to use
ElementWiseMultiply to do the data broadcast. Basically create the 2nd
input with value 1.0 and use the shape data from Expand op.
### Description
TestInlinedLocalFunctionNotRemoved checks that local functions are not
removed but TVM EP optimizes the whole graph after it is inlined.
1. Now we use a released version of ONNX, so we can directly download a
prebuilt package from pypi.org. We do not need to build one from source.
2. Update protobuf python package's version to match the C/C++ version
we are using.
3. Update tensorboard python python because the current one is
incompatible with the newer protobuf version.
### Description
Add CI changes for #18287
Install onnx explicitly to pass windows GPU+dml stage.
### Motivation and Context
'eigen-3.4' was refering to a branch, not to a tag. There is now an
Eigen 3.4.1 on that branch, and thus the hash has changed.
See
https://github.com/microsoft/onnxruntime/issues/18286#issuecomment-1793683416
### Description
<!-- Describe your changes. -->
When take a tensor's data as raw, clear data with other types within the
tensor.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve? -->
One model's graph transformation caused a node with multiple data types.
This would make the model valid.
### Description
This PR removes an internal `ORT_ENFORCE` when binding `torch.tensor`
inputs using IO binding for end-to-end scripts.
### Motivation and Context
In merged exports of PyTorch models to ONNX, each past key and past
value in the past KV cache has an input shape of `(batch_size,
num_heads, past_sequence_length, head_size)`. In the first pass through
the model to process the prompt, `past_sequence_length = 0`. Therefore,
each of these inputs is of shape `(batch_size, num_heads, 0,
head_size)`. In subsequent passes, `past_sequence_length > 0`.
When binding a `torch.tensor` of shape `(batch_size, num_heads, 0,
head_size)` with `io_binding.bind_input`, the tensor's `data_ptr()` must
be passed. For a `torch.tensor` of this shape, its `data_ptr()` returns
0. Because it returns 0, the existing `ORT_ENFORCE` is therefore false
and an error is raised. By removing the internal `ORT_ENFORCE`, no error
is raised and the model runs successfully.
LLaMA-2 Example:
Input Name | Input Size | Device | Device ID | Torch Dtype | data_ptr()
------------- | ----------- | ------- | ----------- | ------------- |
-----------
input_ids | torch.Size([1, 11]) | cuda | 7 | torch.int64 |
140639561842688
attention_mask | torch.Size([1, 11]) | cuda | 7 | torch.int64 |
140639561843200
position_ids | torch.Size([1, 11]) | cuda | 7 | torch.int64 |
140639561844224
past_key_values.0.key | torch.Size([1, 32, 0, 128]) | cuda | 7 |
torch.float32 | 0
past_key_values.0.value | torch.Size([1, 32, 0, 128]) | cuda | 7 |
torch.float32 | 0
... | ... | ... | ... | ... | ...
When the model has "shape tensor" as one of the inputs and user provides
explicit profile shapes for it, TRT EP doesn't correctly set the "shape
tensor" input.
Also, there is a bug for applying explicit profile shapes for the shape
tensor input.
Note: It seems the model has shape tensor input is a rare case. Most of
the cases, the inputs are all execution tensor.
### Description
Replace block-wise 4b quantization implementation
### Motivation and Context
In https://github.com/microsoft/onnxruntime/pull/18101 we have an
augmented block-wise 4b quantization interface and implementation. Here
we use this new implementation in onnxruntime contrib ops
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Make MlasTestFixture::mlas_tester an inline variable. With this change we no longer need to define `MlasTestFixture<T>::mlas_tester` outside of the class definition.
This PR fixes the the signed mismatch warning in
DmlRuntimeFusedGraphKernel. This warning is treated as an error on the
x86 versions of our internal builds preventing us from updating to
latest ORT.
### Description
QNN EP has 2 unit tests failing:
TEST_F(QnnHTPBackendTests, DISABLED_PadReflectMode)
TEST_F(QnnHTPBackendTests, DISABLED_Pad4dOutOfRangePadConstantValue)
For the first unit test, in QNN's master definition, it is stated that
when using MIRROR_REFLECT, the before and after pad amounts must not be
greater than shape(in[0])[i] - 1. Therefore, we need to change the pad
amount from {0,2,0,0} to {0,1,0,0}.
For second unit test, QNN does not have limitations stating that pad
constant should be smaller than input[0]. The reason that the test is
failing is because the unit test did not take the pad constant into
consideration when doing quantization.
### Motivation and Context
Fix the 2 unit tests mentioned in description.
### Description
Update the C# nuget build infrastructure to make building a test nuget
package more user friendly and to simplify
- Remove usage of dotnet and msbuild in CIs
- was temporary requirement until .net 6 MAUI was added to the released
Visual Studio
- remove SelectedTargets property and its usage
- Add property for excluding mobile targets
- generally we exclude based on the nuget package name
- can now specify `/p:IncludeMobileTargets=false` on the command line to
force exclusion
- support building test package using build.py `--build_nuget` better
- limit inclusion of xamarin targets as building with them requires a
lot more infrastructure
- use msbuild directly if xamarin targets are included. use dotnet
otherwise.
- remove quoting of property values as it doesn't appear to be necessary
and breaks when msbuild is being used
- add infrastructure to be able to pack the nuget package on linux with
`dotnet pack`
- `nuget pack` is not user friendly as-per comments in changes
- requires stub csproj to provide the nuspec path
- Remove netstandard1.0 targets from nuspec
- we removed support from the actual bindings previously
- Remove usage of nuget-staging directory when creating nuget package on
linux
- the nuspec file element has a fully qualified path for a source file
so there is no obvious benefit to copying to a staging directory prior
to packing
### Motivation and Context
Address issues with 1P users trying to create test nuget packages
locally.
Long overdue cleanup of CI complexity.
### Description
<!-- Describe your changes. -->
Update XNNPACK to latest version
- adds fp16 kernels and various other improvements
- requires pthreadpool update as well
Most code updates in the XNNPACK EP are to adjust to the new XNNPACK API
- 'setup' is split into 'reshape' and 'setup'
- some ops use a workspace buffer
- copied workspace allocation from XNNPACK unit test code
- some suffixes changed
Added wrapper for XNNPACK caches to base XNNPACK EP kernel
- simplifies usage
- XNNPACK split out the code and weights caches, but the code cache
isn't currently usable via the public API
- we could use the internal types if we think it's required for
performance reasons. non-trivial though as we'd need to propagate ifdef
values from the XNNPACK build up to the ORT build.
- using XNNPACK internals would also mean we would not be able to
support using a pre-build XNNPACK package
- not an issue currently
Fixed opset registration for internal NHWC domain
- was not being tied to the ONNX version, so nodes inserted by layout
transformation had the incorrect opset
- a number of other places needed updating once this issue was fixed
Remove support for NCHW Resize from XNNPACK EP so it's NHWC only
- we only supported NCHW for fp32,
- doing so adds complexity in multiple places (XNNPACK EP kernel
implementation, layout transformation and transpose optimization)
- unclear if that complexity provides any benefit. can add back if
required by production scenario
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
We're looking at enabling fp16 support for CoreML and NNAPI. If we do
that we need a good fallback story if the CPU EP will be used. The
XNNPACK fp16 kernels will hopefully provide that.
NOTE: This PR doesn't add fp16 support to the XNNPACK EP kernels. That
can be done as required in separate EPs and should be relatively simple
to do.
### Description
Introducing new L1 optimizer to fuse Pad to it's child node if the child
node is Conv or MaxPool.
Pad -> Conv = Conv
Pad -> MaxPool = MaxPool
Major Conditions:
- It will only fuse for the `Constant` mode of padding.
- Conv/MaxPool should not have optional `indices` output tensor
- Padding value for non-spatial dimensions should be zero and for
spatial dimensions padding values should be positive for `pad` operator.
For other conditions please see `SatisfyCondition()` in `pad_fusion.cc`.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Pre-link with `ld -r` to apply symbol visibility when the static library
is created to replicate XCode's Single Object Pre-link.
Current builds set the visibility flags but that doesn't get applied
until the static library is linked into something else, which can be too
late. Pre-linking fixes this.
The pre-link uses the .o files from the ORT static libraries and the .a
files from external libraries. This combination limits the symbols
included from the .a files to things required by the ORT .o files.
In order to minimize changes elsewhere in the build we extract the .o
files from the ORT static libraries using `ar -x`.
Re-ordered the pieces use to build the Apple framework to make it a
little more readable.
Fixed a couple of misc issues with missing symbols from the minimal
build that show up when pre-linking is applied.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Will hopefully address #17722
### Description
The quantization tool assumes QGemm is implemented for float 8 types but
it is not yet supported. The condition partially disabling the test was
not robust enough. This is changed by this PR.
### Description
Retry 3 times at most if the web test fails.
### Motivation and Context
Web GPU tests are not stable.
From this link, we could find these ort-web tests are all in top 10
failing tasks.
https://dev.azure.com/onnxruntime/onnxruntime/_pipeline/analytics/stageawareoutcome?definitionId=161&contextType=build.
Generally, it could pass by manually rerunning it.
So, enable it to rerun automatically.
These test steps duration isn't long. So, it won't take too long to
retry.
### Description
Disable ccache for DML. This change is similar to #18104. Now the DML
build job is having the same timeout issue. I don't know why. But
disabling ccache probably would help.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Adds bfloat16 as a valid input parameter type for where node for ONNX
opset 16+.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Enabling `meta-llama/Llama-2-70b` to be finetuned with ONNX Runtime
training.
---------
Co-authored-by: Prathik Rao <prathikrao@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
### Optimize 4bit Qlora training
Extent existing `MatmulBnb4bit` to its usage in training scenarios.
The PR includes following changes:
1. Add special `torch.autograd.Function` export logic for
`bitsandbytes.autograd._functions.MatMul4Bit` that is preferred before
common PythonOp exporter.
2. Add `training_mode` optional attribute for op `MatmulBnb4bit`, which
help skip some inference specific logic in implementation.
3. Add `transB` optional attribute, which is by default be 1; setting it
to be 0 is needed by backward usage.
Changing from `PythonOp` to this `MatmulBnb4bit` brings roughly ~2.9%
throughput gains. The reason is:
`bitsandbytes.autograd._functions.MatMul4Bit` has logic
`ctx.save_for_backward`, which would need an additional copy in
PythonOp, otherwise, the tensor might be released by ORT, while backward
op still references it.
Removing the clones also reduce the peak memory consumptions because
`bitsandbytes.autograd._functions.MatMul4Bit` saved tensors that are not
needed in backward compute.
### Description
* based on design document & following InferenceSession's run
implementation, implemented TrainingSession.runTrainStep
### Motivation and Context
* Adding web bindings for training
#### Related work
* #16521 allowed for training artifacts to be built
* #17333 added interfaces for training
* #17474 allowed for training package to be built + added training
backend to web package
* #17891 implementation for createTrainingSession on the TypeScript side
**[SHOULD BE MERGED IN BEFORE THIS PR]**
---------
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
Co-authored-by: Ashwini Khade <askhade@microsoft.com>
### Description
Existing sort can't handle the index within test data folders/inputs
name string correctly, when index is larger than 10
Update the logic to sort based on index fetched by regex
### Description
Support llama-70b model fusion and shardding
### Motivation and Context
This change enables shard and export llama-70b model into Onnx as this
model is too large for single GPU.
This change also fuses llama-70b model with repeat_kv pattern different
with llama-7b and llama-13b.
Implement Cutlass Memory Efficient Attention Kernel into Group Query
Attention Operator.
### Motivation and Context
Before this change, Group Query Attention Operator was supported only by
Flash-Attention. While this is the most efficient kernel for the
operation, it only supports sm >= 80. Cutlass Memory Efficient Attention
Kernel supports sm >= 53, allowing us to support a broader range of GPU
hardware.
### Description
Added FusedConv and FusedConvTranspose
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve performance
This PR implements distributed reduciton for llama 2. This version
doesn't consider any cases requring re-sharding because we haven't seen
any use cases.
Intutive examples:
- [supported] [2,4,6]-tensor with spec=RRS[0] and device_mesh=[0,1] ->
Reduce(axes=[0]) -> [1,4,6]-tensor with spec=RRS[0] and
device_mesh=[0,1]
- [supported] [2,4,6]-tensor with spec=RRS[0] and device_mesh=[0,1] ->
Reduce(axes=[1]) -> [2,1,6]-tensor with spec=RRS[0] and
device_mesh=[0,1]
- [not supported] [2,4,6]-tensor with spec=RRS[0] and device_mesh=[0,1]
-> Reduce(axes=[2]) -> [2,4,1]-tensor with spec=RRS[0] and
device_mesh=[0,1]
Algorithm:
When the reduced axes are not sharded, each device can call reduction
directly. The output sharding spec will be identical to input sharding
spec. We currently throw when input and output sharding specs are
different.
Review guideline:
- Check 97b8d2f for new op's schema and how new op is registered.
- Read tests in 2450f93 to get faimilar with the behavior of these ops.
- Check the implementation details in 753d9af.
### Description
Integration to OpenVINO 2023.1
### Motivation and Context
- Alignment with latest OpenVINO Version.
- Device name change from VPUX to NPU and Remove from supported list
until official public support is available.
---------
Co-authored-by: Sahar Fatima <sfatima.3001@gmail.com>
Co-authored-by: Saurabh Kale <saurabh1.kale@intel.com>
Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
Co-authored-by: sfatimar <sahar.fatima@intel.com>
### Description
There is a gap between onnx’s definition of batch normalization and
QNN’s.
According to the formula:
onnx: `(X - input_mean) / sqrt(input_var + epsilon) * scale + B`
QNN: `X * weight + bias`
We can then deduce that:
`weight = scale / sqrt(var + epsilon)`
`bias = B – (mean * scale / sqrt(var + epsilon))`
We must calculate the weight and bias, and their quantization parameters
for QNN in QNN EP.
Therefore, `scale`, `B`, `input_mean`, and `input_var` must be static
(`initializer`).
Implementation:
Firstly, dequantize `scale`, `B`, `input_mean`, and `input_var` to
floating point.
Second, calculate `weight` and `bias`, and their quantization
parameters.
Finally, quantize `weight` and `bias`, and add them into `TensorWrapper`
### Motivation and Context
Fix QnnHTPBackendTests.BatchNorm1D and QnnHTPBackendTests.BatchNorm2D
failures
### Description
Implement Split KV optimization for FlashAttention in MHA and Attention
operators.
### Motivation and Context
Can help further accelerate these ops.
### Description
This PR reduces the memory usage when exporting and benchmarking LLaMA.
### Motivation and Context
- Exporting: The PyTorch model is deleted from memory after a successful
export instead of deleting it from memory after exporting + converting
the ONNX model to the desired precision.
- Benchmarking: In the ONNX model with GroupQueryAttention, the KV cache
inputs use the same GPU memory for both the prompt and token generation
benchmarks.
### Description
<!-- Describe your changes. -->
Add the mobile CIs to the list so we check external PRs don't break
those.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Recent external PR was found to break iOS CI after checkin
The `RemoveDuplicateCastTransformer` fairly naively removed Cast nodes
from the graph without considering precision loss when using the same
`TypeGroup`. For instance, F64 -> F32 -> F64 would be optimised out of
the graph.
I also noticed that signedness was not accounted for, which is not
covered by any existing issue but is a problem. For example doing int ->
unsigned int -> int produces very different values for negative inputs
and so should not be optimised out
One could argue that we shouldn't be performing such cast elimination at
all (at least not in this transformer). The original scope might be well
restricted to only eliminating unnecessary casts from the
`InsertCastTransformer` and no others.
### Motivation and Context
This should fix https://github.com/microsoft/onnxruntime/issues/17565,
ttps://github.com/microsoft/onnxruntime/issues/9915 and
https://github.com/microsoft/onnxruntime/issues/8787.
* Add a new operator SkipGroupNorm to support skip and bias inputs.
* Update GroupNorm kernel to support number of channels used in SD XLrefiner.
* Add epsilon in kernel
* Add parity and performance test script
* Remove many limitations including max batch size, max number of groups, c % cPerBlock ==0 etc.
### Motivation and Context
Update GroupNorm to support SD XL Refiner and beyond.
### Description
Update batch file to set PATH for Cuda with TRT
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
update rocm package exclude libs.
- change librocblas.so.0 to librocblas.so.3 which is used on ROCm5.6 and
ROCm5.7
- add librocfft.so.0, libhipfft.so.0, libhiprtc.so.5 and sort the list.
### Description
This PR enables `softmax` outputs max supported components instead of
scalar for each thread.
Softmax with input[0]: [12,4096,4096] becomes 47.86 ms from 55.11 ms