onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-07 04:39:07 +00:00

Author	SHA1	Message	Date
Yulong Wang	d455b0f8fd	[js/web] use Chrome in CI for npm tests (#18522 ) ### Description use Chrome in CI for npm tests. Previously we use Edge, however it sometimes crashes with reasons not yet identified.	2023-11-21 18:03:57 -08:00
Jiajia Qin	ac8598a837	[js/webgpu] enable f16 for concat (#18528 ) ### Description With this PR `realesrgan-t64-f16` models becomes 32.8 ms from 1052.55 ms. Now the whole model run on jsep.	2023-11-21 14:26:00 -08:00
Dmitri Smirnov	81a763a9eb	Make TensorShapeVector to use InlinedVector<Int64_t> to reduce on template instantiations (#18519 ) ### Description Use InlinedVector<int64> instead of <int64_t,5> to reduce on the number of template instantiations. ### Motivation and Context The reported size reduction is small, just a few Ks. Just trying it out.	2023-11-21 14:13:50 -08:00
Abhishek Jindal	680a526e73	Training packaging pipeline for cuda12 (#18524 ) ### Description <!-- Describe your changes. --> Build ORT-training packaging pipeline for CUDA 12.2 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> This will help any customer using CUDA 12 and would not need to build ORT-training from source Test run: https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=382993&view=logs&s=130be951-c2f3-5601-5709-434b5e50ddb0	2023-11-21 13:19:21 -08:00
Sheil Kumar	2a01622536	Hide NPU Adapter selection behind macro (#18515 ) Hide NPU Adapter selection behind macro --------- Co-authored-by: Sheil Kumar <sheilk@microsoft.com>	2023-11-21 08:47:56 -08:00
Xavier Dupré	29a409acaa	Add missing flags DISABLE_FLOAT8_TYPES in GemmFloat8 custom operator for CUDA < 11.8 (#18162 ) ### Description PR #16051 introduced operator GemmFloat8 but the flags DISABLE_FLOAT8_TYPES was missing in a couple of places. The PR addresses that issue. That would allows the compilation on CUDA < 11.8.	2023-11-21 14:37:48 +01:00
JiCheng	a608c002a3	fix past-kv in general LLM exporter (#18529 ) ### Description <!-- Describe your changes. --> For some models, we need to re run model.forward to get past-kv ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-11-21 19:04:55 +08:00
Yulong Wang	c7fd930330	[js/web] unify resolve rules for "Clip" (#18527 ) ### Description It was a mistake to use 2 different names for Clip operator in op-resolve-rules.ts for different opset. An optimized implementation can handle both cases (opset < 11 and opset >=11). Remove "ClipV10" as an entry from the table.	2023-11-20 23:18:06 -08:00
Jiajia Qin	abdf8b7c3f	[js/webgpu] Optimize broadcast binary. (#18185 ) ### Description Currently, the binary algorithms are divided into the vectorize one (efficient) and non-vectorize one (less efficient). Below situations will go to the vectorize one: 1) A or B's shape length is 1. 2) The shared dimensions length of A and B are divisible by 4. 3) A and B have same shape. This PR adds another situation as below to go to the vectorize algorithm. 4. A or B's last dimension is divisible by 4. With this change, the aggerate time of Add in sam-b-encoder becomes 309.65 ms from 409.12 ms on Intel ADL.	2023-11-20 16:52:17 -08:00
Dmitri Smirnov	cc542024ce	Create edges with arg positons correctly accounting for non-existing args (#18462 ) ### Description Truncate traling non-existing arguments. Make sure we do not skip on the non-existing arguments in the middle, because shape inferece relies on their proper position. This also affects the argument position in the Edges that must be properly rebuilt each time If node branch is inlined. Make sure that when we rename Defs in subgraphs, new renamed defs are created in those subgraphs instead of pointing to outer scope defs. Add unit test. ### Motivation and Context This is a follow up for https://github.com/microsoft/onnxruntime/pull/18105 Currently, the non-trailing arguments are simply ignored and the edges are created with potentially incorrect positions.	2023-11-20 14:49:09 -08:00
Yulong Wang	247ce21859	[js] optimize eslint config (#18460 ) ### Description optimize eslint config to: - set parserOptions.project to `true` to allow @typescript-eslint/parser to find the nearest tsconfig.json file to that source file. This helps to avoid parsing extra files, may helps with: - reduce the possibility of seeing OOM or stackoverflow with "npm run lint" - faster processing - enforce rule "no-underscore-dangle" with a list of exceptions.	2023-11-20 12:00:56 -08:00
Jian Chen	1dd9bf5340	Remove setup_env_azure.bat (#18482 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-11-20 09:58:15 -08:00
Jambay Kinley	1af0681554	Bfloat16 support for MatMulBnb4, Training support bitsandbytes>=0.41.2 (#18484 ) ### Description <!-- Describe your changes. --> Add bfloat16 support for `MatMulBnb4` contrib op. This is useful for QLoRA fine-tuning. - On GPUs with SM80+ (A100, etc), it uses the native cuda bfloat16 dtype, `nv_bfloat16`. On other GPUs, it uses the onnxruntime `BFloat16` type which uses float for compute. - I have validated the op in a llama2-7b training scenario. The losses match pytorch training and the training throughput is better. - Cannot add a bfloat16 case in the op unit test since casting BFloat16 to and from float multiple times during the test causes the required tolerances to be unachievable. The custom autograd function exporter in onnxruntime-training is updated to support the latest version of bitsandbytes. They changed how the `quant_state` is stored. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Enable QLoRA fine-tuning with bfloat16.	2023-11-20 09:52:58 -08:00
Jian Chen	d97fc1824f	Create a new Python Package pipeline for CUDA 12 (#18348 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-11-20 09:48:28 -08:00
Wei-Sheng Chin	3bcc137eb4	Tiny change to trigger the update of DORT's CI image (#18507 ) Recent PyTorch breaks DORT CI and [a patch](https://github.com/pytorch/pytorch/pull/113697) has been merged into PyTorch main. In order to update DORT's CI, we made dummy change in this PR.	2023-11-19 22:09:11 -08:00
Changming Sun	dc9ab4f821	Update setup.py: replace libcudart.so.12.0 with libcudart.so.12 (#18501 )	2023-11-19 22:06:32 -08:00
Akshay Sonawane	97cc40d75a	Add fusion patterns for conformer-transducer model (#18461 ) ### Description Add conformer-transducer model type to optimizer. This PR adds pattern matches for attention shown below: Unfused attention: ![ct_unfused](https://github.com/microsoft/onnxruntime/assets/111780983/46c71ed8-67e0-4607-85b1-bcadba5a2956) Fused attention: ![ct_fused](https://github.com/microsoft/onnxruntime/assets/111780983/fbb91c96-0d4b-4f0b-8674-1ae3b9b9a92e)	2023-11-18 23:39:04 -08:00
RandySheriffH	53917a3353	Move up members in Lite Custom Op hierarchy for possible memleaks. (#18478 ) Move data member in LiteOpFunc to its parent to avoid possible mem leaks. --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2023-11-18 15:00:54 -08:00
Changming Sun	9364c05170	Update web-ci.yml: remove depth=1 (#18500 ) ### Description It causes our "NPM Packaging Pipeline" to fail. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-11-17 22:49:03 -08:00
Yulong Wang	34c5424456	[js] update a few packages (#18499 ) ### Description [js] update a few packages - update semver - update reference of onnx_proto to local folder in order to upgrade protobufjs@7.2.4 Resolve AB#18513	2023-11-17 22:40:51 -08:00
Ashwini Khade	02333293de	Removed all the deprecated python training code and related tests and utils (#18333 ) ### Description Motivation for this PR is code cleanup. 1. Remove all deprecated python code related to orttrainer, old checkpoint, related tests and utils 2. Cleanup orttraining_pybind_state.cc to remove all deprecated bindings.	2023-11-17 18:19:21 -08:00
Nicolò Lucchesi	cbb85b4874	[CoreML] Adapt to `MLMultiArray.dataPointer` deprecation (#17726 ) ### Description This PR addresses https://github.com/microsoft/onnxruntime/issues/17652. The deprecated `MLMultiArray.dataPointer` is replaced with `.getBytesWithHandler`, as suggested by the docs. For now, I am only checking that the output `MLMultiArray` is contiguous, returning unsupported operation when that is not the case. I think this is already better than what we have right now, so we can block unsafe calls to `.dataPointer` (if any..). I would be happy to implement the handling of the non-contiguous case (replacing `memcpy` for such cases) as suggested by @edgchen1, but I am not sure how to reproduce that case to add a corresponding unit-test. Would we have to define a custom `MLCustomLayer` to get a non-contiguous output from a model..? ### Motivation and Context Fix https://github.com/microsoft/onnxruntime/issues/17652. --------- Co-authored-by: nicolo-lucchesi <nicolo.lucchesi@hexagon.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2023-11-17 17:58:49 -08:00
Changming Sun	41f9379f3c	Update NDK version to 26.1.10909125 (#18493 ) ### Description Similar to #17852 ### Motivation and Context To avoid downloading NDK	2023-11-17 14:14:01 -08:00
Arthur Islamov	fac3e33da5	[js/web] JSEP Attention & MultiHeadAttention (#17742 ) ### Description This is a narrow implementation of Attention/MultiHeadAttention as it does not support: a. inputs 5-7 for MHA b. packed QKV/KV c. past/present d. attention mask But it works well for StableDiffusion and can be extended later. It reduces VRAM usage as it combines many ops into few I've updated demo here https://islamov.ai/stable-diffusion-webgpu/ it takes ~13sec for 1 image with 20 steps on RTX3090Ti and about 25s on M1 Pro VRAM usage is about 8gb if you don't use img2img Going to focus on SDXL now --------- Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com> Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>	2023-11-17 12:23:52 -08:00
Wanming Lin	a5537f2f56	[WebNN Ep] Slice's axes and steps inputs should be constant initializers (#18427 )	2023-11-17 08:01:40 -08:00
kailums	1a29460919	rope support 4D input tensor (#18454 ) ### Description <!-- Describe your changes. --> change RotaryEmbeddings op implementation, add support for 4D input tensor that is with shape of [batch, num_heads, seq_len, head_size]. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Current RotaryEmbedding op only support 3d input tensor with shape [batch, seq_len, hidden_size] For llamav2 model, when using FusionRotaryEmbeddings to only fuse RotaryEmbeddings op, there will be a transpose operation for query and key, and then the input tensor of RotaryEmbeddings becomes 4D [batch, num_heads, seq_len, head_size]. This scenario can't be supported by current RotaryEmbeddings implementation. So it needs to support 4D input tensor.	2023-11-17 20:38:15 +08:00
Changming Sun	5eb5056c61	Always run emsdk_env.sh before build.py, even when ccache is disabled (#18477 ) ### Description Always run emsdk_env.sh before build.py, even when ccache is disabled This is a follow up to #18434. That PR didn't handle the case when ccache was disabled.	2023-11-16 21:37:29 -08:00
George Wu	d73073d491	remove full protobuf requirement for tensorrt ep (#18413 ) tensorrt can work with protobuf lite.	2023-11-16 20:44:27 -08:00
Chi Lo	f17b6afe3c	[TensorRT EP] Fix bug for no nodes in subgraph at GetCapability (#18449 ) It's possible that subgraph of the "If" control flow op has no nodes. TRT EP should consider this kind of subgraph is fully supported by TRT. The faster rcnn model mentioned in this issue https://github.com/microsoft/onnxruntime/issues/17434 is the case.	2023-11-16 19:56:05 -08:00
aciddelgado	adb56df2e8	Aciddelgado/gqa local (#18375 ) ### Description Implement preliminary version of local (sliding window) attention. Currently only supported by Flash Attention (sm >= 80, Linux). Currently only supports sliding attention with a large cached kv. ### Motivation and Context This change enables to run Mistral and other models which use sliding window attention.	2023-11-16 15:01:06 -08:00
Hector Li	6a4e4488da	[QNN EP] Support Qnn MatMul with 2 dynamic inputs which are uint16 quantized (#18469 ) ### Description QNN can't run MatMul if both inputs are dynamic inputs with uint16 quantized on v68. Make it run by inserting Convert op to convert 1 input to int8	2023-11-16 13:44:15 -08:00
Scott McKay	e7a524fea9	Update to allow large models to be checked for mobile support. (#18357 ) ### Description <!-- Describe your changes. --> Update usability checker and related infrastructure to support checking models > 2GB. - Add ability to set flag to keep initializers as external data - we optimize the model as part of the checking so need to write out a new copy. - Handle issue with ONNX shape inferencing silently failing - use API that supports large models but requires writing the model to a new file - automate cleanup of that copy of the model ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Allow analysis of LLMs to determine gaps for mobile usage. --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2023-11-17 07:20:16 +10:00
Dmitri Smirnov	b6b9aff608	Allow empty shapes and do not validate them for inputs/outputs (#18442 ) ### Description Allow empty shapes and do not validate them for inputs/outputs at the InferenceSession::ValidateInputsOutputs(). ### Motivation and Context https://github.com/microsoft/onnxruntime/pull/17301 disallowed empty shapes. However, many models depend on them as a way to pass shapes of different ranks.	2023-11-16 13:15:48 -08:00
Chi Lo	3588fbac13	[TensorRT EP] Fix memory leak for cudnn/cublas (#18467 ) Free memory for cudnn/cublas instances at TRT EP destruction. https://github.com/microsoft/onnxruntime/issues/18466	2023-11-16 10:23:08 -08:00
satyajandhyala	b291b20fa0	[JS/Web]Added uniforms support to Slice op. (#18422 ) ### Description Support uniforms in Slice op ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Improve ferformance	2023-11-16 09:44:13 -08:00
Wanming Lin	999752a35d	[WebNN EP] Support GreaterOrEqual and LessOrEqual ops (#18411 )	2023-11-16 08:01:58 -08:00
Tianlei Wu	119e86ec16	SDXL demo: Add Option to disable refiner (#18455 ) Add option to disable refiner and only run base model.	2023-11-16 06:43:18 -08:00
zhijiang	16d7f55193	lora conv1d replacement (#16643 ) in LoRA code, it will use conv1d to do projection for qkv, while the conv1d calculation is mathematically equivalent to matmul, and matmul is much faster than conv1d. The subsitution of the graph optimizer is: 1 conv1d >> 2 split + 1 squeeze + group_num matmul + 1 concat with this optimizer, we see 10%+ in one 1P model	2023-11-16 17:08:06 +08:00
guyang3532	751aa8d31a	fix axis of layernorm for UpstreamReshape (#18425 ) Similar to https://github.com/microsoft/onnxruntime/pull/17255 update axis for Layernormalization when Reshape upstream it.	2023-11-16 16:29:00 +08:00
Chi Lo	18a3675bf7	[TensorRT EP] Only instantiate TRT builder once (#18100 ) The TRT builder instantization is slow (see [here](https://github.com/microsoft/onnxruntime/issues/18071)). In current TRT EP, we instantiate builder object every time we need it. There are multiple places need the TRT builder so this causes huge performance overhead.	2023-11-15 23:39:41 -08:00
Yulong Wang	6f9f653ada	[wasm] increase test max memory from 2G to 4G (#18459 ) ### Description increase max memory from 2G to 4G for onnxruntime_test_all in WebAssembly build.	2023-11-15 17:51:04 -08:00
Dmitri Smirnov	6f863ae2ad	Allow optional axes tensor to be null and ignore it as optional (#18423 ) ### Description Our function inliner converts call nodes to a proto. `Node::ToProto()` function recreates optional NodeArgs into a `NodeProto`. While handling missing input parameters, our inliner simply renames them as empty strings. `Graph::InlineFunctionProto()` recreates missing NodeArgs even though the original call node did not have them. This results in the below mentioned issue. The inlined model has the following entries, notice the second argument is present, but has no value in `ReduceSum` call (from a Dynamo exported model). > InsertedPrecisionFreeCast__inlfunc__aten_linalg_vector_norm_no_dim_onnx_result_12 = ReduceSum <keepdims: int = 0, noop_with_empty_axes: int = 0> (InsertedPrecisionFreeCast__inlfunc_ReduceL1_data_abs, ) We now allow second input to ReduceSum to be nullptr and ignore it as it is optional. ### Motivation and Context This seeks to address https://github.com/microsoft/onnxruntime/issues/18338	2023-11-15 16:09:05 -08:00
Changming Sun	cc840c5289	Fix a bug in SaveInputOutputNamesToNodeMapping function (#18456 ) ### Description Fix a bug in SaveInputOutputNamesToNodeMapping function. The fix was provided by Scott. ### Motivation and Context	2023-11-15 14:51:42 -08:00
Edward Chen	0a4d76d98b	MLAS AArch64 quantized int4 Gemm kernel (#18031 ) - Implement MLAS function for quantized 4-bit int Gemm (Gemm with float A and quantized 4-bit int B) for ARM NEON. This is an initial implementation. Only the M=1 path (with M being number of rows of A and C) has any optimization attempted so far. More optimization to come in future PRs. - Connect MatMulNBits contrib op to MLAS function.	2023-11-15 09:31:54 -08:00
Yulong Wang	586f06f5a1	[js/web] set noUnusedParameters to true and fix a few bugs (#18404 ) ### Description - set tsconfig "noUnusedParameters" to `true` and fix a few bugs discovered by typescript. how unused parameter is fixed: - for most code (webgl), add underscore as prefix, which is the standard ignore pattern for typescript check. - remove unused parameter from function and modify corresponding function calls (jsep) - fix a bug in ArgMinMax: this 2 operators do not have more than one input(s) so the `createArgMinMaxAttributesFromInputs()` is removed. - add proxy main.ts into typescript check and fix a bug in parameter passing - fixed `run()` function call and add typecheck fix (hack)	2023-11-15 09:16:29 -08:00
Vincent Wang	ed89ca573a	[ORTModule] Support User Config for Triton Codegen, Bugfix for Reduce-to-scalar (#18448 ) User can provide Triton codegen config JSON through env variable. Also fix some bugs related to reduction to scalar case.	2023-11-15 17:16:38 +08:00
Vincent Wang	b0699d901c	Support Graph Input and Initializer for GatherToSplit Fusion (#18412 ) Support graph input and initializer for GatherToSplit fusion. Previously the fusion requires Gather nodes consume some other node which cannot be graph input or initializer. This helps some model training with such case so that we will not have GatherGrad in the final graph. GatherGrad is super inefficient in kernel implementation.	2023-11-15 13:46:38 +08:00
Tianlei Wu	d738ff16ec	SDXL demo: consistent opt shape and seed (#18445 ) ### Description A few refinements: (1) Use fixed optimized shape for dynamic engine of TRT. (2) Use same seed in base and refiner. (3) Save metadata to png file so that it is easy to reproduce. (4) Disable EulerA scheduler for XL since it has issue in refiner with 1.16.2. (5) Limit height and width to be divisible by 64. (6) Update document to add a link of downloading optimized model. --------- Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>	2023-11-14 20:24:32 -08:00
Jian Chen	05526b354b	Adding new yaml file for downloading cuda, and trt from azure blob (#18443 ) This also set the Path variable for the downloaded libraries. ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-11-14 19:47:39 -08:00
Ye Wang	f9af94009b	onboard MoE (#18279 ) ### Description <!-- Describe your changes. --> 1. Introduce MoE CUDA op to ORT based on FT implementation. 2. Upgrade cutlass to 3.1.0 to avoid some build failures on Windows. Remove patch file for cutlass 3.0.0. 3. Sharded MoE implementation will come with another PR limitation: __CUDA_ARCH__ >= 700 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-11-14 16:48:51 -08:00

1 2 3 4 5 ...

10026 commits