This change refactored matmul/conv related programs to support shape
uniforms. Currently only matmul shape uniforms are fully enabled.
TODOs: add input dependencies for conv related programs, turn clipMax
and clipMin to uniforms.
### Description
<!-- Describe your changes. -->
Rework some aspects of the transpose optimizer to ensure we have valid
QDQ node units when it is done.
Conceptually we need to let individual Transpose nodes move through the
graph when optimizing. That can invalidate existing QDQ node units or
require new ones. We can fix this after inserting new nodes, or when
transpose optimization finishes moving Transpose nodes.
Fix when inserting new node
- TransposeInputs can add an Unsqueeze (to broadcast) and Transpose to a
node's inputs
- if there was a DQ node providing the input, add a Q -> DQ after
inserting the Unsqueeze/Transpose to make a QDQ node unit for the new
node.
- Unsqueeze/Transpose don't change data, so we can copy the
type/scale/zero point from the existing DQ
Fixes when transpose optimization completes moving Transpose nodes
- Remove empty DQ -> Q pairs if the type/scale/zero point match
- Pushing a Transpose through may have resulted in an existing
Transpose/Reshape being cancelled and removed leaving an empty QDQ node
unit
- the Transpose being moved may have started in a QDQ node unit
- Transpose that got blocked inside existing QDQ node unit
- e.g. if we hit a DQ -> MatMul -> Q node unit the Transpose gets
blocked after the DQ
- insert a Q -> DQ after the Transpose to put it in a QDQ node unit and
repair the original QDQ node unit
- Transpose moves past a DQ providing a graph output
- insert a Q -> DQ so the Transpose is in a QDQ node unit
This replaces the existing phase 2 logic which flipped a DQ -> Transpose
to fix a broken QDQ node unit. The new approach should handle more
scenarios and hopefully produce a better graph.
Additionally the logic to handle updates to shared initializers that
feed DQ nodes was simplified (i.e. largely removed). When we update the
shared initializer a Squeeze (if broadcast) and Transpose is added
between the initializer and the DQ for other usages of it. We only need
to check for this pattern in EstimateTransposeValueCost by looking past
a DQ node. We do not need to track the individual DQ nodes leading to an
updated shared initializer.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Initially to fix QNN issue with non-const input being transpose and the
QDQ node units being broken.
### Description
<!-- Describe your changes. -->
Added Uniforms to Expand operator kernel
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve performance
### Description
`generate_artifacts` generates 4 graphs for training. All graphs should
share the same opset version, the one coming from the model to train,
but the optimizer is left undefined. onnxruntime is using the latest
version defined by onnx but onnxruntime does not necessarily support it.
### Motivation and Context
The code does not let the user change it.
Current WebNN CPU (XNNPack) backend supports limit op list, fallbacks
unsupported ops for WebNN "cpu" deviceType directly. This is a
workaround because the op may be included in MLGraphBuilder for DirectML
backend but without XNNPack implementation in Chromium.
### Description
- Fix QDQ optimizer logic that drops Q/DQ ops from Split node groups so
that it only occurs when all input/output quantization parameters are
equal.
- Currently, the selector used for this optimization does not ensure
that all quantization parameters are equal.
- Support dropping Q/DQ ops from Split node groups with optional split
inputs (introduced opset 13). This was not working previously.
### Motivation and Context
Fix bugs in handling of QDQ Split node groups.
---------
Signed-off-by: adrianlizarraga <adlizarraga@microsoft.com>
Add support of LCM model
(https://huggingface.co/latent-consistency/lcm-sdxl) in SDXL demo.
Since LCM model does not need classifier-free guidance, so there is no
need to use negative prompt. The input and output shape is different
from original SDXL model: no need to double the batch dimension.
We also save metadata to image, and update image filename to include
scheduler and steps.
#### Latency (miliseconds) of generating 1024x1024 images in
A100-SXM4-80GB GPU
Engines are built with static input shape, and CUDA graph is enabled.
For dynamic shape input, the latency could be slower.
Batch Size | Pipeline | Steps | ORT_CUDA | ORT_TRT | TRT 8.6
-- | -- | -- | -- | -- | --
1 | LCM SDXL | 4 | 275 | 249 | 258
1 | LCM SDXL | 8 | 460 | 423 | 430
1 | SDXL Base | 30 | 2566 | 2535 | 2569
4 | LCM SDXL | 4 | 925 | 887 | 1032
4 | LCM SDXL | 8 | 1539 | 1493 | 1662
4 | SDXL Base | 30 | 9227 | 9408 | 9678
### Description
Use InlinedVector<int64> instead of <int64_t,5> to reduce on the number
of template instantiations.
### Motivation and Context
The reported size reduction is small, just a few Ks. Just trying it out.
### Description
<!-- Describe your changes. -->
Build ORT-training packaging pipeline for CUDA 12.2
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This will help any customer using CUDA 12 and would not need to build
ORT-training from source
Test run:
https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=382993&view=logs&s=130be951-c2f3-5601-5709-434b5e50ddb0
### Description
PR #16051 introduced operator GemmFloat8 but the flags
DISABLE_FLOAT8_TYPES was missing in a couple of places. The PR addresses
that issue. That would allows the compilation on CUDA < 11.8.
### Description
<!-- Describe your changes. -->
For some models, we need to re run model.forward to get past-kv
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
It was a mistake to use 2 different names for Clip operator in
op-resolve-rules.ts for different opset. An optimized implementation can
handle both cases (opset < 11 and opset >=11). Remove "ClipV10" as an
entry from the table.
### Description
Currently, the binary algorithms are divided into the vectorize one
(efficient) and non-vectorize one (less efficient). Below situations
will go to the vectorize one:
1) A or B's shape length is 1.
2) The shared dimensions length of A and B are divisible by 4.
3) A and B have same shape.
This PR adds another situation as below to go to the vectorize
algorithm.
4. A or B's last dimension is divisible by 4.
With this change, the aggerate time of Add in sam-b-encoder becomes
309.65 ms from 409.12 ms on Intel ADL.
### Description
Truncate traling non-existing arguments.
Make sure we do not skip on the non-existing arguments in the middle,
because shape inferece relies on their proper position.
This also affects the argument position in the Edges that must be
properly rebuilt
each time If node branch is inlined.
Make sure that when we rename Defs in subgraphs, new renamed defs are
created in those subgraphs
instead of pointing to outer scope defs.
Add unit test.
### Motivation and Context
This is a follow up for
https://github.com/microsoft/onnxruntime/pull/18105
Currently, the non-trailing arguments are simply ignored and the edges
are created
with potentially incorrect positions.
### Description
optimize eslint config to:
- set parserOptions.project to `true` to allow @typescript-eslint/parser
to find the nearest tsconfig.json file to that source file. This helps
to avoid parsing extra files, may helps with:
- reduce the possibility of seeing OOM or stackoverflow with "npm run
lint"
- faster processing
- enforce rule "no-underscore-dangle" with a list of exceptions.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Add bfloat16 support for `MatMulBnb4` contrib op. This is useful for
QLoRA fine-tuning.
- On GPUs with SM80+ (A100, etc), it uses the native cuda bfloat16
dtype, `nv_bfloat16`. On other GPUs, it uses the onnxruntime `BFloat16`
type which uses float for compute.
- I have validated the op in a llama2-7b training scenario. The losses
match pytorch training and the training throughput is better.
- Cannot add a bfloat16 case in the op unit test since casting BFloat16
to and from float multiple times during the test causes the required
tolerances to be unachievable.
The custom autograd function exporter in onnxruntime-training is updated
to support the latest version of bitsandbytes. They changed how the
`quant_state` is stored.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Enable QLoRA fine-tuning with bfloat16.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Recent PyTorch breaks DORT CI and [a
patch](https://github.com/pytorch/pytorch/pull/113697) has been merged
into PyTorch main. In order to update DORT's CI, we made dummy change in
this PR.
### Description
It causes our "NPM Packaging Pipeline" to fail.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
[js] update a few packages
- update semver
- update reference of onnx_proto to local folder in order to upgrade
protobufjs@7.2.4
Resolve AB#18513
### Description
Motivation for this PR is code cleanup.
1. Remove all deprecated python code related to orttrainer, old
checkpoint, related tests and utils
2. Cleanup orttraining_pybind_state.cc to remove all deprecated
bindings.
### Description
This PR addresses https://github.com/microsoft/onnxruntime/issues/17652.
The deprecated `MLMultiArray.dataPointer` is replaced with
`.getBytesWithHandler`, as suggested by the docs.
For now, I am only checking that the output `MLMultiArray` is
contiguous, returning unsupported operation when that is not the case.
I think this is already better than what we have right now, so we can
block unsafe calls to `.dataPointer` (if any..).
I would be happy to implement the handling of the non-contiguous case
(replacing `memcpy` for such cases) as suggested by @edgchen1, but I am
not sure how to reproduce that case to add a corresponding unit-test.
Would we have to define a custom `MLCustomLayer` to get a non-contiguous
output from a model..?
### Motivation and Context
Fix https://github.com/microsoft/onnxruntime/issues/17652.
---------
Co-authored-by: nicolo-lucchesi <nicolo.lucchesi@hexagon.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
This is a narrow implementation of Attention/MultiHeadAttention as it
does not support:
a. inputs 5-7 for MHA
b. packed QKV/KV
c. past/present
d. attention mask
But it works well for StableDiffusion and can be extended later. It
reduces VRAM usage as it combines many ops into few
I've updated demo here https://islamov.ai/stable-diffusion-webgpu/ it
takes ~13sec for 1 image with 20 steps on RTX3090Ti and about 25s on M1
Pro
VRAM usage is about 8gb if you don't use img2img
Going to focus on SDXL now
---------
Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
### Description
<!-- Describe your changes. -->
change RotaryEmbeddings op implementation, add support for 4D input
tensor that is with shape of [batch, num_heads, seq_len, head_size].
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Current RotaryEmbedding op only support 3d input tensor with shape
[batch, seq_len, hidden_size]
For llamav2 model, when using FusionRotaryEmbeddings to only fuse
RotaryEmbeddings op, there will be a transpose operation for query and
key, and then the input tensor of RotaryEmbeddings becomes 4D [batch,
num_heads, seq_len, head_size].
This scenario can't be supported by current RotaryEmbeddings
implementation. So it needs to support 4D input tensor.
### Description
Always run emsdk_env.sh before build.py, even when ccache is disabled
This is a follow up to #18434. That PR didn't handle the case when
ccache was disabled.
It's possible that subgraph of the "If" control flow op has no nodes.
TRT EP should consider this kind of subgraph is fully supported by TRT.
The faster rcnn model mentioned in this issue
https://github.com/microsoft/onnxruntime/issues/17434 is the case.
### Description
Implement preliminary version of local (sliding window) attention.
Currently only supported by Flash Attention (sm >= 80, Linux). Currently
only supports sliding attention with a large cached kv.
### Motivation and Context
This change enables to run Mistral and other models which use sliding
window attention.
### Description
QNN can't run MatMul if both inputs are dynamic inputs with uint16 quantized on v68. Make it run by inserting Convert op to convert 1 input to int8
### Description
<!-- Describe your changes. -->
Update usability checker and related infrastructure to support checking
models > 2GB.
- Add ability to set flag to keep initializers as external data
- we optimize the model as part of the checking so need to write out a
new copy.
- Handle issue with ONNX shape inferencing silently failing
- use API that supports large models but requires writing the model to a
new file
- automate cleanup of that copy of the model
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Allow analysis of LLMs to determine gaps for mobile usage.
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
Allow empty shapes and do not validate them for inputs/outputs at the
InferenceSession::ValidateInputsOutputs().
### Motivation and Context
https://github.com/microsoft/onnxruntime/pull/17301 disallowed empty
shapes.
However, many models depend on them as a way to pass shapes of different
ranks.
### Description
Support uniforms in Slice op
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve ferformance
in LoRA code, it will use conv1d to do projection for qkv, while the
conv1d calculation is mathematically equivalent to matmul, and matmul is
much faster than conv1d.
The subsitution of the graph optimizer is: 1 conv1d >> 2 split + 1
squeeze + group_num matmul + 1 concat
with this optimizer, we see 10%+ in one 1P model
The TRT builder instantization is slow (see
[here](https://github.com/microsoft/onnxruntime/issues/18071)).
In current TRT EP, we instantiate builder object every time we need it.
There are multiple places need the TRT builder so this causes huge
performance overhead.