Commit graph

10036 commits

Author SHA1 Message Date
Xu Xing
fa106942a7
[js/webgpu] Refactor matmul conv to support uniforms for matmul (#18452)
This change refactored matmul/conv related programs to support shape
uniforms. Currently only matmul shape uniforms are fully enabled.
TODOs: add input dependencies for conv related programs, turn clipMax
and clipMin to uniforms.
2023-11-22 14:42:55 -08:00
Scott McKay
42c6799c59
Update transpose optimization to be more QDQ aware (#18444)
### Description
<!-- Describe your changes. -->
Rework some aspects of the transpose optimizer to ensure we have valid
QDQ node units when it is done.

Conceptually we need to let individual Transpose nodes move through the
graph when optimizing. That can invalidate existing QDQ node units or
require new ones. We can fix this after inserting new nodes, or when
transpose optimization finishes moving Transpose nodes.

Fix when inserting new node
- TransposeInputs can add an Unsqueeze (to broadcast) and Transpose to a
node's inputs
- if there was a DQ node providing the input, add a Q -> DQ after
inserting the Unsqueeze/Transpose to make a QDQ node unit for the new
node.
- Unsqueeze/Transpose don't change data, so we can copy the
type/scale/zero point from the existing DQ

Fixes when transpose optimization completes moving Transpose nodes
- Remove empty DQ -> Q pairs if the type/scale/zero point match
- Pushing a Transpose through may have resulted in an existing
Transpose/Reshape being cancelled and removed leaving an empty QDQ node
unit
  - the Transpose being moved may have started in a QDQ node unit
- Transpose that got blocked inside existing QDQ node unit
- e.g. if we hit a DQ -> MatMul -> Q node unit the Transpose gets
blocked after the DQ
- insert a Q -> DQ after the Transpose to put it in a QDQ node unit and
repair the original QDQ node unit
- Transpose moves past a DQ providing a graph output
  - insert a Q -> DQ so the Transpose is in a QDQ node unit

This replaces the existing phase 2 logic which flipped a DQ -> Transpose
to fix a broken QDQ node unit. The new approach should handle more
scenarios and hopefully produce a better graph.

Additionally the logic to handle updates to shared initializers that
feed DQ nodes was simplified (i.e. largely removed). When we update the
shared initializer a Squeeze (if broadcast) and Transpose is added
between the initializer and the DQ for other usages of it. We only need
to check for this pattern in EstimateTransposeValueCost by looking past
a DQ node. We do not need to track the individual DQ nodes leading to an
updated shared initializer.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Initially to fix QNN issue with non-const input being transpose and the
QDQ node units being broken.
2023-11-23 08:27:47 +10:00
satyajandhyala
841f7ed3e0
[[JS/Web]Added uniform to Expand op. (#18558)
### Description
<!-- Describe your changes. -->
Added Uniforms to Expand operator kernel


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve performance
2023-11-22 14:14:24 -08:00
Arthur Islamov
1c555c5fc1
[JS/Web] Resize & BiasSplitGelu fp16 support (#18536)
### Description
Resize and BiasSplitGelu fp16 support on WebGPU
2023-11-22 12:12:07 -08:00
Xavier Dupré
3f0ebd6736
Fix opset import in GemmFloat8 python unit tests (#18489)
### Description
The unit test are failing if a development version of onnx is used. The
opset are set to 19.
2023-11-22 09:15:24 -08:00
Xavier Dupré
32fabb5555
Fix opset version of the optimizer in function generate_artifacts (#18300)
### Description
`generate_artifacts` generates 4 graphs for training. All graphs should
share the same opset version, the one coming from the model to train,
but the optimizer is left undefined. onnxruntime is using the latest
version defined by onnx but onnxruntime does not necessarily support it.

### Motivation and Context
The code does not let the user change it.
2023-11-22 09:15:11 -08:00
Wanming Lin
89723c8612
[WebNN EP] Mark and fallback unsupported op for WebNN CPU backend (#18472)
Current WebNN CPU (XNNPack) backend supports limit op list, fallbacks
unsupported ops for WebNN "cpu" deviceType directly. This is a
workaround because the op may be included in MLGraphBuilder for DirectML
backend but without XNNPack implementation in Chromium.
2023-11-22 09:05:30 -08:00
Vincent Wang
3bc9efc7b2
[ORTModule] Adjust Attention Patterns for Efficient Attention ATen Fallback (#18471)
Adjust attention patterns to match latest Whisper+exporter. Also add
some condition check and add docs.
2023-11-22 15:24:05 +08:00
Adrian Lizarraga
7c573054b6
[QDQ Optimizer] Fix logic that drops Q/DQ ops from QDQ split node groups (#18394)
### Description
- Fix QDQ optimizer logic that drops Q/DQ ops from Split node groups so
that it only occurs when all input/output quantization parameters are
equal.
- Currently, the selector used for this optimization does not ensure
that all quantization parameters are equal.
- Support dropping Q/DQ ops from Split node groups with optional split
inputs (introduced opset 13). This was not working previously.


### Motivation and Context
Fix bugs in handling of QDQ Split node groups.

---------

Signed-off-by: adrianlizarraga <adlizarraga@microsoft.com>
2023-11-21 21:31:31 -08:00
Tianlei Wu
62da3b1ca4
SDXL Latent Consistency Model (LCM) optimization (#18526)
Add support of LCM model
(https://huggingface.co/latent-consistency/lcm-sdxl) in SDXL demo.

Since LCM model does not need classifier-free guidance, so there is no
need to use negative prompt. The input and output shape is different
from original SDXL model: no need to double the batch dimension.

We also save metadata to image, and update image filename to include
scheduler and steps.

#### Latency (miliseconds) of generating 1024x1024 images in
A100-SXM4-80GB GPU

Engines are built with static input shape, and CUDA graph is enabled.
For dynamic shape input, the latency could be slower.

Batch Size | Pipeline | Steps | ORT_CUDA | ORT_TRT | TRT 8.6
-- | -- | -- | -- | -- | --
1 | LCM SDXL | 4 | 275 | 249 | 258
1 | LCM SDXL | 8 | 460 | 423 | 430
1 | SDXL Base | 30 | 2566 | 2535 | 2569
4 | LCM  SDXL | 4 | 925 | 887 | 1032
4 | LCM  SDXL | 8 | 1539 | 1493 | 1662
4 | SDXL Base | 30 | 9227 | 9408 | 9678
2023-11-21 21:27:49 -08:00
Yulong Wang
d455b0f8fd
[js/web] use Chrome in CI for npm tests (#18522)
### Description
use Chrome in CI for npm tests. Previously we use Edge, however it
sometimes crashes with reasons not yet identified.
2023-11-21 18:03:57 -08:00
Jiajia Qin
ac8598a837
[js/webgpu] enable f16 for concat (#18528)
### Description
With this PR `realesrgan-t64-f16` models becomes 32.8 ms from 1052.55
ms. Now the whole model run on jsep.
2023-11-21 14:26:00 -08:00
Dmitri Smirnov
81a763a9eb
Make TensorShapeVector to use InlinedVector<Int64_t> to reduce on template instantiations (#18519)
### Description
Use InlinedVector<int64> instead of <int64_t,5> to reduce on the number
of template instantiations.

### Motivation and Context
The reported size reduction is small, just a few Ks. Just trying it out.
2023-11-21 14:13:50 -08:00
Abhishek Jindal
680a526e73
Training packaging pipeline for cuda12 (#18524)
### Description
<!-- Describe your changes. -->
Build ORT-training packaging pipeline for CUDA 12.2


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This will help any customer using CUDA 12 and would not need to build
ORT-training from source

Test run:
https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=382993&view=logs&s=130be951-c2f3-5601-5709-434b5e50ddb0
2023-11-21 13:19:21 -08:00
Sheil Kumar
2a01622536
Hide NPU Adapter selection behind macro (#18515)
Hide NPU Adapter selection behind macro

---------

Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
2023-11-21 08:47:56 -08:00
Xavier Dupré
29a409acaa
Add missing flags DISABLE_FLOAT8_TYPES in GemmFloat8 custom operator for CUDA < 11.8 (#18162)
### Description
PR #16051 introduced operator GemmFloat8 but the flags
DISABLE_FLOAT8_TYPES was missing in a couple of places. The PR addresses
that issue. That would allows the compilation on CUDA < 11.8.
2023-11-21 14:37:48 +01:00
JiCheng
a608c002a3
fix past-kv in general LLM exporter (#18529)
### Description
<!-- Describe your changes. -->

For some models, we need to re run model.forward to get past-kv

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-11-21 19:04:55 +08:00
Yulong Wang
c7fd930330
[js/web] unify resolve rules for "Clip" (#18527)
### Description
It was a mistake to use 2 different names for Clip operator in
op-resolve-rules.ts for different opset. An optimized implementation can
handle both cases (opset < 11 and opset >=11). Remove "ClipV10" as an
entry from the table.
2023-11-20 23:18:06 -08:00
Jiajia Qin
abdf8b7c3f
[js/webgpu] Optimize broadcast binary. (#18185)
### Description
Currently, the binary algorithms are divided into the vectorize one
(efficient) and non-vectorize one (less efficient). Below situations
will go to the vectorize one:
1) A or B's shape length is 1.
2) The shared dimensions length of A and B are divisible by 4.
3) A and B have same shape.

This PR adds another situation as below to go to the vectorize
algorithm.
4. A or B's last dimension is divisible by 4.

With this change, the aggerate time of Add in sam-b-encoder becomes
309.65 ms from 409.12 ms on Intel ADL.
2023-11-20 16:52:17 -08:00
Dmitri Smirnov
cc542024ce
Create edges with arg positons correctly accounting for non-existing args (#18462)
### Description
Truncate traling non-existing arguments.
  Make sure we do not skip on the non-existing arguments in the middle,
  because shape inferece relies on their proper position.
This also affects the argument position in the Edges that must be
properly rebuilt
  each time If node branch is inlined.
Make sure that when we rename Defs in subgraphs, new renamed defs are
created in those subgraphs
  instead of pointing to outer scope defs.
  Add unit test.

### Motivation and Context
This is a follow up for
https://github.com/microsoft/onnxruntime/pull/18105
Currently, the non-trailing arguments are simply ignored and the edges
are created
with potentially incorrect positions.
2023-11-20 14:49:09 -08:00
Yulong Wang
247ce21859
[js] optimize eslint config (#18460)
### Description
optimize eslint config to:
- set parserOptions.project to `true` to allow @typescript-eslint/parser
to find the nearest tsconfig.json file to that source file. This helps
to avoid parsing extra files, may helps with:
- reduce the possibility of seeing OOM or stackoverflow with "npm run
lint"
   - faster processing
- enforce rule "no-underscore-dangle" with a list of exceptions.
2023-11-20 12:00:56 -08:00
Jian Chen
1dd9bf5340
Remove setup_env_azure.bat (#18482)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-11-20 09:58:15 -08:00
Jambay Kinley
1af0681554
Bfloat16 support for MatMulBnb4, Training support bitsandbytes>=0.41.2 (#18484)
### Description
<!-- Describe your changes. -->
Add bfloat16 support for `MatMulBnb4` contrib op. This is useful for
QLoRA fine-tuning.
- On GPUs with SM80+ (A100, etc), it uses the native cuda bfloat16
dtype, `nv_bfloat16`. On other GPUs, it uses the onnxruntime `BFloat16`
type which uses float for compute.
- I have validated the op in a llama2-7b training scenario. The losses
match pytorch training and the training throughput is better.
- Cannot add a bfloat16 case in the op unit test since casting BFloat16
to and from float multiple times during the test causes the required
tolerances to be unachievable.

The custom autograd function exporter in onnxruntime-training is updated
to support the latest version of bitsandbytes. They changed how the
`quant_state` is stored.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Enable QLoRA fine-tuning with bfloat16.
2023-11-20 09:52:58 -08:00
Jian Chen
d97fc1824f
Create a new Python Package pipeline for CUDA 12 (#18348)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-11-20 09:48:28 -08:00
Wei-Sheng Chin
3bcc137eb4
Tiny change to trigger the update of DORT's CI image (#18507)
Recent PyTorch breaks DORT CI and [a
patch](https://github.com/pytorch/pytorch/pull/113697) has been merged
into PyTorch main. In order to update DORT's CI, we made dummy change in
this PR.
2023-11-19 22:09:11 -08:00
Changming Sun
dc9ab4f821
Update setup.py: replace libcudart.so.12.0 with libcudart.so.12 (#18501) 2023-11-19 22:06:32 -08:00
Akshay Sonawane
97cc40d75a
Add fusion patterns for conformer-transducer model (#18461)
### Description
Add conformer-transducer model type to optimizer. This PR adds pattern
matches for attention shown below:
Unfused attention:

![ct_unfused](https://github.com/microsoft/onnxruntime/assets/111780983/46c71ed8-67e0-4607-85b1-bcadba5a2956)

Fused attention:

![ct_fused](https://github.com/microsoft/onnxruntime/assets/111780983/fbb91c96-0d4b-4f0b-8674-1ae3b9b9a92e)
2023-11-18 23:39:04 -08:00
RandySheriffH
53917a3353
Move up members in Lite Custom Op hierarchy for possible memleaks. (#18478)
Move data member in LiteOpFunc to its parent to avoid possible mem
leaks.

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2023-11-18 15:00:54 -08:00
Changming Sun
9364c05170
Update web-ci.yml: remove depth=1 (#18500)
### Description
It causes our "NPM Packaging Pipeline" to fail.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-11-17 22:49:03 -08:00
Yulong Wang
34c5424456
[js] update a few packages (#18499)
### Description
[js] update a few packages

- update semver
- update reference of onnx_proto to local folder in order to upgrade
protobufjs@7.2.4

Resolve AB#18513
2023-11-17 22:40:51 -08:00
Ashwini Khade
02333293de
Removed all the deprecated python training code and related tests and utils (#18333)
### Description
Motivation for this PR is code cleanup.

1. Remove all deprecated python code related to orttrainer, old
checkpoint, related tests and utils
2. Cleanup orttraining_pybind_state.cc to remove all deprecated
bindings.
2023-11-17 18:19:21 -08:00
Nicolò Lucchesi
cbb85b4874
[CoreML] Adapt to MLMultiArray.dataPointer deprecation (#17726)
### Description
This PR addresses https://github.com/microsoft/onnxruntime/issues/17652.
The deprecated `MLMultiArray.dataPointer` is replaced with
`.getBytesWithHandler`, as suggested by the docs.
For now, I am only checking that the output `MLMultiArray` is
contiguous, returning unsupported operation when that is not the case.
I think this is already better than what we have right now, so we can
block unsafe calls to `.dataPointer` (if any..).

I would be happy to implement the handling of the non-contiguous case
(replacing `memcpy` for such cases) as suggested by @edgchen1, but I am
not sure how to reproduce that case to add a corresponding unit-test.
Would we have to define a custom `MLCustomLayer` to get a non-contiguous
output from a model..?

### Motivation and Context
Fix https://github.com/microsoft/onnxruntime/issues/17652.

---------

Co-authored-by: nicolo-lucchesi <nicolo.lucchesi@hexagon.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-11-17 17:58:49 -08:00
Changming Sun
41f9379f3c
Update NDK version to 26.1.10909125 (#18493)
### Description
Similar to #17852


### Motivation and Context
To avoid downloading NDK
2023-11-17 14:14:01 -08:00
Arthur Islamov
fac3e33da5
[js/web] JSEP Attention & MultiHeadAttention (#17742)
### Description
This is a narrow implementation of Attention/MultiHeadAttention as it
does not support:
a. inputs 5-7 for MHA
b. packed QKV/KV
c. past/present
d. attention mask

But it works well for StableDiffusion and can be extended later. It
reduces VRAM usage as it combines many ops into few
I've updated demo here https://islamov.ai/stable-diffusion-webgpu/ it
takes ~13sec for 1 image with 20 steps on RTX3090Ti and about 25s on M1
Pro
VRAM usage is about 8gb if you don't use img2img

Going to focus on SDXL now

---------

Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
2023-11-17 12:23:52 -08:00
Wanming Lin
a5537f2f56
[WebNN Ep] Slice's axes and steps inputs should be constant initializers (#18427) 2023-11-17 08:01:40 -08:00
kailums
1a29460919
rope support 4D input tensor (#18454)
### Description
<!-- Describe your changes. -->

change RotaryEmbeddings op implementation, add support for 4D input
tensor that is with shape of [batch, num_heads, seq_len, head_size].

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Current RotaryEmbedding op only support 3d input tensor with shape
[batch, seq_len, hidden_size]

For llamav2 model, when using FusionRotaryEmbeddings to only fuse
RotaryEmbeddings op, there will be a transpose operation for query and
key, and then the input tensor of RotaryEmbeddings becomes 4D [batch,
num_heads, seq_len, head_size].

This scenario can't be supported by current RotaryEmbeddings
implementation. So it needs to support 4D input tensor.
2023-11-17 20:38:15 +08:00
Changming Sun
5eb5056c61
Always run emsdk_env.sh before build.py, even when ccache is disabled (#18477)
### Description
Always run emsdk_env.sh before build.py, even when ccache is disabled

This is a follow up to #18434. That PR didn't handle the case when
ccache was disabled.
2023-11-16 21:37:29 -08:00
George Wu
d73073d491
remove full protobuf requirement for tensorrt ep (#18413)
tensorrt can work with protobuf lite.
2023-11-16 20:44:27 -08:00
Chi Lo
f17b6afe3c
[TensorRT EP] Fix bug for no nodes in subgraph at GetCapability (#18449)
It's possible that subgraph of the "If" control flow op has no nodes.
TRT EP should consider this kind of subgraph is fully supported by TRT.

The faster rcnn model mentioned in this issue
https://github.com/microsoft/onnxruntime/issues/17434 is the case.
2023-11-16 19:56:05 -08:00
aciddelgado
adb56df2e8
Aciddelgado/gqa local (#18375)
### Description
Implement preliminary version of local (sliding window) attention.
Currently only supported by Flash Attention (sm >= 80, Linux). Currently
only supports sliding attention with a large cached kv.



### Motivation and Context
This change enables to run Mistral and other models which use sliding
window attention.
2023-11-16 15:01:06 -08:00
Hector Li
6a4e4488da
[QNN EP] Support Qnn MatMul with 2 dynamic inputs which are uint16 quantized (#18469)
### Description
QNN can't run MatMul if both inputs are dynamic inputs with uint16 quantized on v68. Make it run by inserting Convert op to convert 1 input to int8
2023-11-16 13:44:15 -08:00
Scott McKay
e7a524fea9
Update to allow large models to be checked for mobile support. (#18357)
### Description
<!-- Describe your changes. -->
Update usability checker and related infrastructure to support checking
models > 2GB.
- Add ability to set flag to keep initializers as external data
- we optimize the model as part of the checking so need to write out a
new copy.
- Handle issue with ONNX shape inferencing silently failing
- use API that supports large models but requires writing the model to a
new file
  - automate cleanup of that copy of the model

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Allow analysis of LLMs to determine gaps for mobile usage.

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-11-17 07:20:16 +10:00
Dmitri Smirnov
b6b9aff608
Allow empty shapes and do not validate them for inputs/outputs (#18442)
### Description
Allow empty shapes and do not validate them for inputs/outputs at the
InferenceSession::ValidateInputsOutputs().

### Motivation and Context
https://github.com/microsoft/onnxruntime/pull/17301 disallowed empty
shapes.
However, many models depend on them as a way to pass shapes of different
ranks.
2023-11-16 13:15:48 -08:00
Chi Lo
3588fbac13
[TensorRT EP] Fix memory leak for cudnn/cublas (#18467)
Free memory for cudnn/cublas instances at TRT EP destruction.
https://github.com/microsoft/onnxruntime/issues/18466
2023-11-16 10:23:08 -08:00
satyajandhyala
b291b20fa0
[JS/Web]Added uniforms support to Slice op. (#18422)
### Description
Support uniforms in Slice op



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve ferformance
2023-11-16 09:44:13 -08:00
Wanming Lin
999752a35d
[WebNN EP] Support GreaterOrEqual and LessOrEqual ops (#18411) 2023-11-16 08:01:58 -08:00
Tianlei Wu
119e86ec16
SDXL demo: Add Option to disable refiner (#18455)
Add option to disable refiner and only run base model.
2023-11-16 06:43:18 -08:00
zhijiang
16d7f55193
lora conv1d replacement (#16643)
in LoRA code, it will use conv1d to do projection for qkv, while the
conv1d calculation is mathematically equivalent to matmul, and matmul is
much faster than conv1d.
The subsitution of the graph optimizer is: 1 conv1d >> 2 split + 1
squeeze + group_num matmul + 1 concat

with this optimizer, we see 10%+ in one 1P model
2023-11-16 17:08:06 +08:00
guyang3532
751aa8d31a
fix axis of layernorm for UpstreamReshape (#18425)
Similar to https://github.com/microsoft/onnxruntime/pull/17255
update axis for Layernormalization when Reshape upstream it.
2023-11-16 16:29:00 +08:00
Chi Lo
18a3675bf7
[TensorRT EP] Only instantiate TRT builder once (#18100)
The TRT builder instantization is slow (see
[here](https://github.com/microsoft/onnxruntime/issues/18071)).
In current TRT EP, we instantiate builder object every time we need it.
There are multiple places need the TRT builder so this causes huge
performance overhead.
2023-11-15 23:39:41 -08:00