onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-02 03:55:34 +00:00

Author	SHA1	Message	Date
Xu Xing	fa106942a7	[js/webgpu] Refactor matmul conv to support uniforms for matmul (#18452 ) This change refactored matmul/conv related programs to support shape uniforms. Currently only matmul shape uniforms are fully enabled. TODOs: add input dependencies for conv related programs, turn clipMax and clipMin to uniforms.	2023-11-22 14:42:55 -08:00
Scott McKay	42c6799c59	Update transpose optimization to be more QDQ aware (#18444 ) ### Description <!-- Describe your changes. --> Rework some aspects of the transpose optimizer to ensure we have valid QDQ node units when it is done. Conceptually we need to let individual Transpose nodes move through the graph when optimizing. That can invalidate existing QDQ node units or require new ones. We can fix this after inserting new nodes, or when transpose optimization finishes moving Transpose nodes. Fix when inserting new node - TransposeInputs can add an Unsqueeze (to broadcast) and Transpose to a node's inputs - if there was a DQ node providing the input, add a Q -> DQ after inserting the Unsqueeze/Transpose to make a QDQ node unit for the new node. - Unsqueeze/Transpose don't change data, so we can copy the type/scale/zero point from the existing DQ Fixes when transpose optimization completes moving Transpose nodes - Remove empty DQ -> Q pairs if the type/scale/zero point match - Pushing a Transpose through may have resulted in an existing Transpose/Reshape being cancelled and removed leaving an empty QDQ node unit - the Transpose being moved may have started in a QDQ node unit - Transpose that got blocked inside existing QDQ node unit - e.g. if we hit a DQ -> MatMul -> Q node unit the Transpose gets blocked after the DQ - insert a Q -> DQ after the Transpose to put it in a QDQ node unit and repair the original QDQ node unit - Transpose moves past a DQ providing a graph output - insert a Q -> DQ so the Transpose is in a QDQ node unit This replaces the existing phase 2 logic which flipped a DQ -> Transpose to fix a broken QDQ node unit. The new approach should handle more scenarios and hopefully produce a better graph. Additionally the logic to handle updates to shared initializers that feed DQ nodes was simplified (i.e. largely removed). When we update the shared initializer a Squeeze (if broadcast) and Transpose is added between the initializer and the DQ for other usages of it. We only need to check for this pattern in EstimateTransposeValueCost by looking past a DQ node. We do not need to track the individual DQ nodes leading to an updated shared initializer. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Initially to fix QNN issue with non-const input being transpose and the QDQ node units being broken.	2023-11-23 08:27:47 +10:00
satyajandhyala	841f7ed3e0	[[JS/Web]Added uniform to Expand op. (#18558 ) ### Description <!-- Describe your changes. --> Added Uniforms to Expand operator kernel ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Improve performance	2023-11-22 14:14:24 -08:00
Arthur Islamov	1c555c5fc1	[JS/Web] Resize & BiasSplitGelu fp16 support (#18536 ) ### Description Resize and BiasSplitGelu fp16 support on WebGPU	2023-11-22 12:12:07 -08:00
Xavier Dupré	3f0ebd6736	Fix opset import in GemmFloat8 python unit tests (#18489 ) ### Description The unit test are failing if a development version of onnx is used. The opset are set to 19.	2023-11-22 09:15:24 -08:00
Xavier Dupré	32fabb5555	Fix opset version of the optimizer in function generate_artifacts (#18300 ) ### Description `generate_artifacts` generates 4 graphs for training. All graphs should share the same opset version, the one coming from the model to train, but the optimizer is left undefined. onnxruntime is using the latest version defined by onnx but onnxruntime does not necessarily support it. ### Motivation and Context The code does not let the user change it.	2023-11-22 09:15:11 -08:00
Wanming Lin	89723c8612	[WebNN EP] Mark and fallback unsupported op for WebNN CPU backend (#18472 ) Current WebNN CPU (XNNPack) backend supports limit op list, fallbacks unsupported ops for WebNN "cpu" deviceType directly. This is a workaround because the op may be included in MLGraphBuilder for DirectML backend but without XNNPack implementation in Chromium.	2023-11-22 09:05:30 -08:00
Vincent Wang	3bc9efc7b2	[ORTModule] Adjust Attention Patterns for Efficient Attention ATen Fallback (#18471 ) Adjust attention patterns to match latest Whisper+exporter. Also add some condition check and add docs.	2023-11-22 15:24:05 +08:00
Adrian Lizarraga	7c573054b6	[QDQ Optimizer] Fix logic that drops Q/DQ ops from QDQ split node groups (#18394 ) ### Description - Fix QDQ optimizer logic that drops Q/DQ ops from Split node groups so that it only occurs when all input/output quantization parameters are equal. - Currently, the selector used for this optimization does not ensure that all quantization parameters are equal. - Support dropping Q/DQ ops from Split node groups with optional split inputs (introduced opset 13). This was not working previously. ### Motivation and Context Fix bugs in handling of QDQ Split node groups. --------- Signed-off-by: adrianlizarraga <adlizarraga@microsoft.com>	2023-11-21 21:31:31 -08:00
Tianlei Wu	62da3b1ca4	SDXL Latent Consistency Model (LCM) optimization (#18526 ) Add support of LCM model (https://huggingface.co/latent-consistency/lcm-sdxl) in SDXL demo. Since LCM model does not need classifier-free guidance, so there is no need to use negative prompt. The input and output shape is different from original SDXL model: no need to double the batch dimension. We also save metadata to image, and update image filename to include scheduler and steps. #### Latency (miliseconds) of generating 1024x1024 images in A100-SXM4-80GB GPU Engines are built with static input shape, and CUDA graph is enabled. For dynamic shape input, the latency could be slower. Batch Size \| Pipeline \| Steps \| ORT_CUDA \| ORT_TRT \| TRT 8.6 -- \| -- \| -- \| -- \| -- \| -- 1 \| LCM SDXL \| 4 \| 275 \| 249 \| 258 1 \| LCM SDXL \| 8 \| 460 \| 423 \| 430 1 \| SDXL Base \| 30 \| 2566 \| 2535 \| 2569 4 \| LCM SDXL \| 4 \| 925 \| 887 \| 1032 4 \| LCM SDXL \| 8 \| 1539 \| 1493 \| 1662 4 \| SDXL Base \| 30 \| 9227 \| 9408 \| 9678	2023-11-21 21:27:49 -08:00
Yulong Wang	d455b0f8fd	[js/web] use Chrome in CI for npm tests (#18522 ) ### Description use Chrome in CI for npm tests. Previously we use Edge, however it sometimes crashes with reasons not yet identified.	2023-11-21 18:03:57 -08:00
Jiajia Qin	ac8598a837	[js/webgpu] enable f16 for concat (#18528 ) ### Description With this PR `realesrgan-t64-f16` models becomes 32.8 ms from 1052.55 ms. Now the whole model run on jsep.	2023-11-21 14:26:00 -08:00
Dmitri Smirnov	81a763a9eb	Make TensorShapeVector to use InlinedVector<Int64_t> to reduce on template instantiations (#18519 ) ### Description Use InlinedVector<int64> instead of <int64_t,5> to reduce on the number of template instantiations. ### Motivation and Context The reported size reduction is small, just a few Ks. Just trying it out.	2023-11-21 14:13:50 -08:00
Abhishek Jindal	680a526e73	Training packaging pipeline for cuda12 (#18524 ) ### Description <!-- Describe your changes. --> Build ORT-training packaging pipeline for CUDA 12.2 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> This will help any customer using CUDA 12 and would not need to build ORT-training from source Test run: https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=382993&view=logs&s=130be951-c2f3-5601-5709-434b5e50ddb0	2023-11-21 13:19:21 -08:00
Sheil Kumar	2a01622536	Hide NPU Adapter selection behind macro (#18515 ) Hide NPU Adapter selection behind macro --------- Co-authored-by: Sheil Kumar <sheilk@microsoft.com>	2023-11-21 08:47:56 -08:00
Xavier Dupré	29a409acaa	Add missing flags DISABLE_FLOAT8_TYPES in GemmFloat8 custom operator for CUDA < 11.8 (#18162 ) ### Description PR #16051 introduced operator GemmFloat8 but the flags DISABLE_FLOAT8_TYPES was missing in a couple of places. The PR addresses that issue. That would allows the compilation on CUDA < 11.8.	2023-11-21 14:37:48 +01:00
JiCheng	a608c002a3	fix past-kv in general LLM exporter (#18529 ) ### Description <!-- Describe your changes. --> For some models, we need to re run model.forward to get past-kv ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-11-21 19:04:55 +08:00
Yulong Wang	c7fd930330	[js/web] unify resolve rules for "Clip" (#18527 ) ### Description It was a mistake to use 2 different names for Clip operator in op-resolve-rules.ts for different opset. An optimized implementation can handle both cases (opset < 11 and opset >=11). Remove "ClipV10" as an entry from the table.	2023-11-20 23:18:06 -08:00
Jiajia Qin	abdf8b7c3f	[js/webgpu] Optimize broadcast binary. (#18185 ) ### Description Currently, the binary algorithms are divided into the vectorize one (efficient) and non-vectorize one (less efficient). Below situations will go to the vectorize one: 1) A or B's shape length is 1. 2) The shared dimensions length of A and B are divisible by 4. 3) A and B have same shape. This PR adds another situation as below to go to the vectorize algorithm. 4. A or B's last dimension is divisible by 4. With this change, the aggerate time of Add in sam-b-encoder becomes 309.65 ms from 409.12 ms on Intel ADL.	2023-11-20 16:52:17 -08:00
Dmitri Smirnov	cc542024ce	Create edges with arg positons correctly accounting for non-existing args (#18462 ) ### Description Truncate traling non-existing arguments. Make sure we do not skip on the non-existing arguments in the middle, because shape inferece relies on their proper position. This also affects the argument position in the Edges that must be properly rebuilt each time If node branch is inlined. Make sure that when we rename Defs in subgraphs, new renamed defs are created in those subgraphs instead of pointing to outer scope defs. Add unit test. ### Motivation and Context This is a follow up for https://github.com/microsoft/onnxruntime/pull/18105 Currently, the non-trailing arguments are simply ignored and the edges are created with potentially incorrect positions.	2023-11-20 14:49:09 -08:00
Yulong Wang	247ce21859	[js] optimize eslint config (#18460 ) ### Description optimize eslint config to: - set parserOptions.project to `true` to allow @typescript-eslint/parser to find the nearest tsconfig.json file to that source file. This helps to avoid parsing extra files, may helps with: - reduce the possibility of seeing OOM or stackoverflow with "npm run lint" - faster processing - enforce rule "no-underscore-dangle" with a list of exceptions.	2023-11-20 12:00:56 -08:00
Jian Chen	1dd9bf5340	Remove setup_env_azure.bat (#18482 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-11-20 09:58:15 -08:00
Jambay Kinley	1af0681554	Bfloat16 support for MatMulBnb4, Training support bitsandbytes>=0.41.2 (#18484 ) ### Description <!-- Describe your changes. --> Add bfloat16 support for `MatMulBnb4` contrib op. This is useful for QLoRA fine-tuning. - On GPUs with SM80+ (A100, etc), it uses the native cuda bfloat16 dtype, `nv_bfloat16`. On other GPUs, it uses the onnxruntime `BFloat16` type which uses float for compute. - I have validated the op in a llama2-7b training scenario. The losses match pytorch training and the training throughput is better. - Cannot add a bfloat16 case in the op unit test since casting BFloat16 to and from float multiple times during the test causes the required tolerances to be unachievable. The custom autograd function exporter in onnxruntime-training is updated to support the latest version of bitsandbytes. They changed how the `quant_state` is stored. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Enable QLoRA fine-tuning with bfloat16.	2023-11-20 09:52:58 -08:00
Jian Chen	d97fc1824f	Create a new Python Package pipeline for CUDA 12 (#18348 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-11-20 09:48:28 -08:00
Wei-Sheng Chin	3bcc137eb4	Tiny change to trigger the update of DORT's CI image (#18507 ) Recent PyTorch breaks DORT CI and [a patch](https://github.com/pytorch/pytorch/pull/113697) has been merged into PyTorch main. In order to update DORT's CI, we made dummy change in this PR.	2023-11-19 22:09:11 -08:00
Changming Sun	dc9ab4f821	Update setup.py: replace libcudart.so.12.0 with libcudart.so.12 (#18501 )	2023-11-19 22:06:32 -08:00
Akshay Sonawane	97cc40d75a	Add fusion patterns for conformer-transducer model (#18461 ) ### Description Add conformer-transducer model type to optimizer. This PR adds pattern matches for attention shown below: Unfused attention: ![ct_unfused](https://github.com/microsoft/onnxruntime/assets/111780983/46c71ed8-67e0-4607-85b1-bcadba5a2956) Fused attention: ![ct_fused](https://github.com/microsoft/onnxruntime/assets/111780983/fbb91c96-0d4b-4f0b-8674-1ae3b9b9a92e)	2023-11-18 23:39:04 -08:00
RandySheriffH	53917a3353	Move up members in Lite Custom Op hierarchy for possible memleaks. (#18478 ) Move data member in LiteOpFunc to its parent to avoid possible mem leaks. --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2023-11-18 15:00:54 -08:00
Changming Sun	9364c05170	Update web-ci.yml: remove depth=1 (#18500 ) ### Description It causes our "NPM Packaging Pipeline" to fail. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-11-17 22:49:03 -08:00
Yulong Wang	34c5424456	[js] update a few packages (#18499 ) ### Description [js] update a few packages - update semver - update reference of onnx_proto to local folder in order to upgrade protobufjs@7.2.4 Resolve AB#18513	2023-11-17 22:40:51 -08:00
Ashwini Khade	02333293de	Removed all the deprecated python training code and related tests and utils (#18333 ) ### Description Motivation for this PR is code cleanup. 1. Remove all deprecated python code related to orttrainer, old checkpoint, related tests and utils 2. Cleanup orttraining_pybind_state.cc to remove all deprecated bindings.	2023-11-17 18:19:21 -08:00
Nicolò Lucchesi	cbb85b4874	[CoreML] Adapt to `MLMultiArray.dataPointer` deprecation (#17726 ) ### Description This PR addresses https://github.com/microsoft/onnxruntime/issues/17652. The deprecated `MLMultiArray.dataPointer` is replaced with `.getBytesWithHandler`, as suggested by the docs. For now, I am only checking that the output `MLMultiArray` is contiguous, returning unsupported operation when that is not the case. I think this is already better than what we have right now, so we can block unsafe calls to `.dataPointer` (if any..). I would be happy to implement the handling of the non-contiguous case (replacing `memcpy` for such cases) as suggested by @edgchen1, but I am not sure how to reproduce that case to add a corresponding unit-test. Would we have to define a custom `MLCustomLayer` to get a non-contiguous output from a model..? ### Motivation and Context Fix https://github.com/microsoft/onnxruntime/issues/17652. --------- Co-authored-by: nicolo-lucchesi <nicolo.lucchesi@hexagon.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2023-11-17 17:58:49 -08:00
Changming Sun	41f9379f3c	Update NDK version to 26.1.10909125 (#18493 ) ### Description Similar to #17852 ### Motivation and Context To avoid downloading NDK	2023-11-17 14:14:01 -08:00
Arthur Islamov	fac3e33da5	[js/web] JSEP Attention & MultiHeadAttention (#17742 ) ### Description This is a narrow implementation of Attention/MultiHeadAttention as it does not support: a. inputs 5-7 for MHA b. packed QKV/KV c. past/present d. attention mask But it works well for StableDiffusion and can be extended later. It reduces VRAM usage as it combines many ops into few I've updated demo here https://islamov.ai/stable-diffusion-webgpu/ it takes ~13sec for 1 image with 20 steps on RTX3090Ti and about 25s on M1 Pro VRAM usage is about 8gb if you don't use img2img Going to focus on SDXL now --------- Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com> Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>	2023-11-17 12:23:52 -08:00
Wanming Lin	a5537f2f56	[WebNN Ep] Slice's axes and steps inputs should be constant initializers (#18427 )	2023-11-17 08:01:40 -08:00
kailums	1a29460919	rope support 4D input tensor (#18454 ) ### Description <!-- Describe your changes. --> change RotaryEmbeddings op implementation, add support for 4D input tensor that is with shape of [batch, num_heads, seq_len, head_size]. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Current RotaryEmbedding op only support 3d input tensor with shape [batch, seq_len, hidden_size] For llamav2 model, when using FusionRotaryEmbeddings to only fuse RotaryEmbeddings op, there will be a transpose operation for query and key, and then the input tensor of RotaryEmbeddings becomes 4D [batch, num_heads, seq_len, head_size]. This scenario can't be supported by current RotaryEmbeddings implementation. So it needs to support 4D input tensor.	2023-11-17 20:38:15 +08:00
Changming Sun	5eb5056c61	Always run emsdk_env.sh before build.py, even when ccache is disabled (#18477 ) ### Description Always run emsdk_env.sh before build.py, even when ccache is disabled This is a follow up to #18434. That PR didn't handle the case when ccache was disabled.	2023-11-16 21:37:29 -08:00
George Wu	d73073d491	remove full protobuf requirement for tensorrt ep (#18413 ) tensorrt can work with protobuf lite.	2023-11-16 20:44:27 -08:00
Chi Lo	f17b6afe3c	[TensorRT EP] Fix bug for no nodes in subgraph at GetCapability (#18449 ) It's possible that subgraph of the "If" control flow op has no nodes. TRT EP should consider this kind of subgraph is fully supported by TRT. The faster rcnn model mentioned in this issue https://github.com/microsoft/onnxruntime/issues/17434 is the case.	2023-11-16 19:56:05 -08:00
aciddelgado	adb56df2e8	Aciddelgado/gqa local (#18375 ) ### Description Implement preliminary version of local (sliding window) attention. Currently only supported by Flash Attention (sm >= 80, Linux). Currently only supports sliding attention with a large cached kv. ### Motivation and Context This change enables to run Mistral and other models which use sliding window attention.	2023-11-16 15:01:06 -08:00
Hector Li	6a4e4488da	[QNN EP] Support Qnn MatMul with 2 dynamic inputs which are uint16 quantized (#18469 ) ### Description QNN can't run MatMul if both inputs are dynamic inputs with uint16 quantized on v68. Make it run by inserting Convert op to convert 1 input to int8	2023-11-16 13:44:15 -08:00
Scott McKay	e7a524fea9	Update to allow large models to be checked for mobile support. (#18357 ) ### Description <!-- Describe your changes. --> Update usability checker and related infrastructure to support checking models > 2GB. - Add ability to set flag to keep initializers as external data - we optimize the model as part of the checking so need to write out a new copy. - Handle issue with ONNX shape inferencing silently failing - use API that supports large models but requires writing the model to a new file - automate cleanup of that copy of the model ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Allow analysis of LLMs to determine gaps for mobile usage. --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2023-11-17 07:20:16 +10:00
Dmitri Smirnov	b6b9aff608	Allow empty shapes and do not validate them for inputs/outputs (#18442 ) ### Description Allow empty shapes and do not validate them for inputs/outputs at the InferenceSession::ValidateInputsOutputs(). ### Motivation and Context https://github.com/microsoft/onnxruntime/pull/17301 disallowed empty shapes. However, many models depend on them as a way to pass shapes of different ranks.	2023-11-16 13:15:48 -08:00
Chi Lo	3588fbac13	[TensorRT EP] Fix memory leak for cudnn/cublas (#18467 ) Free memory for cudnn/cublas instances at TRT EP destruction. https://github.com/microsoft/onnxruntime/issues/18466	2023-11-16 10:23:08 -08:00
satyajandhyala	b291b20fa0	[JS/Web]Added uniforms support to Slice op. (#18422 ) ### Description Support uniforms in Slice op ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Improve ferformance	2023-11-16 09:44:13 -08:00
Wanming Lin	999752a35d	[WebNN EP] Support GreaterOrEqual and LessOrEqual ops (#18411 )	2023-11-16 08:01:58 -08:00
Tianlei Wu	119e86ec16	SDXL demo: Add Option to disable refiner (#18455 ) Add option to disable refiner and only run base model.	2023-11-16 06:43:18 -08:00
zhijiang	16d7f55193	lora conv1d replacement (#16643 ) in LoRA code, it will use conv1d to do projection for qkv, while the conv1d calculation is mathematically equivalent to matmul, and matmul is much faster than conv1d. The subsitution of the graph optimizer is: 1 conv1d >> 2 split + 1 squeeze + group_num matmul + 1 concat with this optimizer, we see 10%+ in one 1P model	2023-11-16 17:08:06 +08:00
guyang3532	751aa8d31a	fix axis of layernorm for UpstreamReshape (#18425 ) Similar to https://github.com/microsoft/onnxruntime/pull/17255 update axis for Layernormalization when Reshape upstream it.	2023-11-16 16:29:00 +08:00
Chi Lo	18a3675bf7	[TensorRT EP] Only instantiate TRT builder once (#18100 ) The TRT builder instantization is slow (see [here](https://github.com/microsoft/onnxruntime/issues/18071)). In current TRT EP, we instantiate builder object every time we need it. There are multiple places need the TRT builder so this causes huge performance overhead.	2023-11-15 23:39:41 -08:00

1 2 3 4 5 ...

10036 commits