onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-05-16 21:00:14 +00:00

Author	SHA1	Message	Date
Prathik Rao	7a3da4526f	add bfloat16 support for CUDA Neg kernel (#18306 ) ### Description <!-- Describe your changes. --> Registers BFloat16 datatype as valid input type for CUDA Neg Kernel. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Enabling `meta-llama/Llama-2-70b` to be finetuned with ONNX Runtime training. --------- Co-authored-by: Prathik Rao <prathikrao@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>	2023-11-08 18:32:12 -08:00
pengwa	2151c79bf1	Tune ORTModule logging experience a bit (#18298 ) ### Tune logging experience a bit After last time we update the ORTModule log experience, we found few issues: 1. `INFO` level output too many things, including PyTorch exporter verbose logs (tracing graphs) on every ranks. On this level, we only want to - Output a little bit more information to Users than `WARNING` level, for example the memory recomputation recommendations or other not-fully-ready features. - Output a little bit more information for a quick diagnostic, collected on rank-0 only. 2. ONNX Runtime logging filter during graph build, session init sometimes will hide the issues (for example segement fault), there is no useful information in `WARNING`/`INFO` for users to report to us. This is not good! 3. Some of our devs like using `pdb` to debug Python code, but if we add `import pdb; pdb.set_trace()` in models' code might hang when they use `INFO` or `WARNING`, where exporter happens and all output got redirected due to log filtering. The only workaround is to switch to VERBOSE, which output toooooooooooo many logs. The corresponding changes proposed here are: 1. For `INFO` logging, - We only logs rank-0. - We restricted the ORT backend logging level to be WARNING in this case, because ORT backend code output way too many logs that should be under verbose, while we cannot guarantee we can get them cleaned up immediately once they are added. - We output the PyTorch exporter verbose log (including tracing graph), which is useful for a quick diagnostic when an issue happens. 2. Remove all logging filtering on ORT backend, then the segment fault issue details will not be hidden once it happens again. 3. Introduced a `DEVINFO` logging, - Log logs on all ranks - Log ORT backend logging level INFO - PyTorch exporter logging filtering are all turned OFF (to unblock the pdb debugging). 4. Currently, to use Memory Optimizer, need use DEVINFO (which will output ORT backend INFO log). So update memory optimizer document to reflect this. https://github.com/microsoft/onnxruntime/pull/17481 will update the requirement back to INFO for show memory optimization infos. You can check https://github.com/microsoft/onnxruntime/blob/pengwa/devinfo_level/docs/ORTModule_Training_Guidelines.md#log-level-explanations for a better view of different log levels. This PR also extract some changes from a bigger one https://github.com/microsoft/onnxruntime/pull/17481, to reduce its complexity for review. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: mindest <30493312+mindest@users.noreply.github.com>	2023-11-08 17:42:50 +08:00
aciddelgado	3dece27f51	GQA Flash Attention with Attention Mask (#18283 ) ### Description GQA now only works with Flash Attention with Attention Mask input, allowing for batched input. Note: This PR Disables Memory Efficient Attention, only allowing Flash Attention kernel to be used. ### Motivation and Context Allows GQA to work with batched input. --------- Co-authored-by: Yufeng Li <liyufeng1987@gmail.com>	2023-11-07 17:47:51 -08:00
liqun Fu	6127dd1d2d	implement gridsample 20 (#17744 )	2023-11-07 10:42:41 -08:00
Patrice Vignola	800ae7742c	[DML EP] Add RotaryEmbedding (#18158 ) This is a graph implementation of RotaryEmbedding since there's no time to add it to DML before 1.16.2, but it eventually should move into DirectML since we're bandwidth-bound.	2023-11-07 08:26:11 -08:00
Prathik Rao	8978bdc59d	add bfloat16 support for where operator (#18118 ) ### Description <!-- Describe your changes. --> Adds bfloat16 as a valid input parameter type for where node for ONNX opset 16+. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Enabling `meta-llama/Llama-2-70b` to be finetuned with ONNX Runtime training. --------- Co-authored-by: Prathik Rao <prathikrao@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>	2023-11-02 12:23:20 -07:00
pengwa	c8e1038eab	Optimize 4bit Qlora training (#18131 ) ### Optimize 4bit Qlora training Extent existing `MatmulBnb4bit` to its usage in training scenarios. The PR includes following changes: 1. Add special `torch.autograd.Function` export logic for `bitsandbytes.autograd._functions.MatMul4Bit` that is preferred before common PythonOp exporter. 2. Add `training_mode` optional attribute for op `MatmulBnb4bit`, which help skip some inference specific logic in implementation. 3. Add `transB` optional attribute, which is by default be 1; setting it to be 0 is needed by backward usage. Changing from `PythonOp` to this `MatmulBnb4bit` brings roughly ~2.9% throughput gains. The reason is: `bitsandbytes.autograd._functions.MatMul4Bit` has logic `ctx.save_for_backward`, which would need an additional copy in PythonOp, otherwise, the tensor might be released by ORT, while backward op still references it. Removing the clones also reduce the peak memory consumptions because `bitsandbytes.autograd._functions.MatMul4Bit` saved tensors that are not needed in backward compute.	2023-11-02 09:46:11 -07:00
aciddelgado	178f7caaeb	GQA Memory Efficient Kernel (#17920 ) Implement Cutlass Memory Efficient Attention Kernel into Group Query Attention Operator. ### Motivation and Context Before this change, Group Query Attention Operator was supported only by Flash-Attention. While this is the most efficient kernel for the operation, it only supports sm >= 80. Cutlass Memory Efficient Attention Kernel supports sm >= 53, allowing us to support a broader range of GPU hardware.	2023-11-01 20:04:22 -07:00
Preetha Veeramalai	d87216bcb1	Openvino ep ort 23.1 (#17911 ) ### Description Integration to OpenVINO 2023.1 ### Motivation and Context - Alignment with latest OpenVINO Version. - Device name change from VPUX to NPU and Remove from supported list until official public support is available. --------- Co-authored-by: Sahar Fatima <sfatima.3001@gmail.com> Co-authored-by: Saurabh Kale <saurabh1.kale@intel.com> Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com> Co-authored-by: sfatimar <sahar.fatima@intel.com>	2023-11-01 08:39:39 -07:00
Tianlei Wu	95f053c652	[CUDA] Update GroupNorm and Add SkipGroupNorm (#18091 ) * Add a new operator SkipGroupNorm to support skip and bias inputs. * Update GroupNorm kernel to support number of channels used in SD XLrefiner. * Add epsilon in kernel * Add parity and performance test script * Remove many limitations including max batch size, max number of groups, c % cPerBlock ==0 etc. ### Motivation and Context Update GroupNorm to support SD XL Refiner and beyond.	2023-10-31 10:27:20 -07:00
Xavier Dupré	b5f242e978	GemmFloat8 as a contrib ops (#16051 ) ### Description Add support for Gemm with float 8 as a contrib op. --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Scott McKay <Scott.McKay@microsoft.com> Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>	2023-10-27 14:33:55 +02:00
Tang, Cheng	37873be86d	enable reduce ops on opset18 (#18053 ) ### Description Opset 18 apply the "axes as input" change from ReduceSum to all the other reduce ops. Our cuda kernel actually support it, but we didn't enable it for opset18. This PR update the reduce ops' kernel registration to enable the "axes as input" behavior for opset18. As part of the fix, I also simplify the reduce op kernel registration part. ORT doesn't require the kernel definition need to be exactly the same as onnx op definition. For our case, which we share the same kernel for all the reduce ops (from version 1 to version 18), we don't need to maintain different version of kernel definitions. we can simplify it by just using a single kernel definition for multiple versions. Although for some cases, we might register more types for legacy versions, but it is harmless. Framework is using schema to validate the graph, not kernel definition. --------- Co-authored-by: Cheng Tang <chenta@a100.crj0ad2y1kku1j4yxl4sj10o4e.gx.internal.cloudapp.net> Co-authored-by: Cheng Tang <chenta@microsoft.com>	2023-10-26 16:57:21 -07:00
Jambay Kinley	d30d4d372a	Add MatMul FP4 and NF4 Support (#18066 ) ### Description Add a contrib op MatMulBnb4 (FP4 and NF4) and related toolchain to support quantization on weight. This PR adds: - schema for contrib op MatMulBnb4 which can support FP4 (4-bit floating point) and NF4 (4-bit NormalFloat) quantization on weight. - a naive implementation for MatMulBnb4 on CPU and GPU, i.e., implemented like MatMul(A, Dequantize(B)). - a special implementation for GemV for MatMulBnb4 and related benchmark tool. - tool to quantize model to FP4 or NF4.	2023-10-25 15:34:58 -07:00
liqun Fu	706e13e0c9	implement affinegrid cpu kernel (#17777 )	2023-10-25 10:46:04 -07:00
liqun Fu	efa0cc2562	implement isinf20 and isnan20 (#17874 )	2023-10-24 10:58:54 -07:00
kunal-vaishnavi	2a17d5cf32	LLaMA Model Optimization (#18021 ) ### Description This PR contains fusion-level and kernel-level optimizations for [Meta's LLaMA-2](https://blogs.microsoft.com/blog/2023/07/18/microsoft-and-meta-expand-their-ai-partnership-with-llama-2-on-azure-and-windows/). Some of the added optimizations include: - SimplifiedLayerNorm changes - Fusions for multiple variants - SkipSimplifiedLayerNorm changes - Kernel support for CPU - Rotary embeddings (previously did not exist) - Fusions for multiple variants - CPU and CUDA kernels - Supports interleaving and non-interleaving in the same kernels - Optimized cache that requires half of its originally exported sizes - Reduced from `(max_sequence_length, head_size)` to `(max_sequence_length, head_size / 2)` - Multi-head attention - Support for 2D and 3D attention masks - Group query attention (for FP16 CUDA and INT4 CUDA) - Integration with flash attention v2 and past-present buffer sharing - Removes need for `attention_mask` input as it is supported in the kernel - 4 bit quantization - `block_size` parameter is available for customizing - Support the new changes for [Microsoft version](https://github.com/microsoft/Llama-2-Onnx) - Support combinations of the below variants (ex: export ORT version and run with Optimum) Supported variants of LLaMA-2 include: - [ORT version](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/python/tools/transformers/models/llama) - Produces one ONNX file that is already optimized (and quantized if requested) - Integrates with Optimum - [Another Microsoft version](https://github.com/microsoft/Llama-2-Onnx) - Already exported and available off-the-shelf - Faster versions of those models will be uploaded there soon - [Hugging Face version](https://huggingface.co/meta-llama) - Models that end with `-hf` - Some older and current versions of [`transformers`](https://github.com/huggingface/transformers) and [`optimum`](https://github.com/huggingface/optimum) that export the model to ONNX differently - Note that while some older versions are supported, it is recommended to use the latest package versions. ### Usage To use the optimizations, please see `README.md` for details. Please note the various `requirements.txt` files for the package versions recommended in order to use these changes. To run the ORT transformer optimizer separately, run the script as follows: ``` $ cd onnxruntime/onnxruntime/python/tools/transformers/ $ python3 optimizer.py --input <filename>.onnx --output <filename>.onnx --model_type gpt2 --num_heads <number of attention heads> --hidden_size <attention hidden size> --use_external_data_format --opt_level 0 ``` ### Motivation and Context This PR helps the following issues: - https://github.com/microsoft/onnxruntime/issues/14997 - https://github.com/microsoft/onnxruntime/issues/16254 - https://github.com/microsoft/onnxruntime/issues/17681 - https://github.com/microsoft/onnxruntime/issues/17925 - https://github.com/microsoft/onnxruntime-inference-examples/issues/320 This PR uses changes from the following PRs: - https://github.com/pytorch/pytorch/pull/104468 - https://github.com/pytorch/pytorch/pull/109759 - https://github.com/microsoft/onnxruntime/pull/17020 - https://github.com/microsoft/onnxruntime/pull/17674 - https://github.com/microsoft/onnxruntime/pull/17890 - https://github.com/microsoft/onnxruntime/pull/17920 - https://github.com/huggingface/transformers/pull/26162 - https://github.com/huggingface/optimum/pull/1257 - https://github.com/huggingface/optimum/pull/1289 - https://github.com/huggingface/optimum/pull/1462 ### New TorchDynamo Exporter (experimental stage) This PR uses changes from the following issues and PRs to begin supporting the [new TorchDynamo exporter](https://pytorch.org/docs/stable/onnx.html#torchdynamo-based-onnx-exporter): - https://github.com/huggingface/transformers/pull/26307 - https://github.com/pytorch/pytorch/issues/104903 - https://github.com/pytorch/pytorch/pull/105040 - https://github.com/microsoft/onnxscript/pull/847 - https://github.com/microsoft/onnxscript/pull/862 - https://github.com/microsoft/onnxscript/issues/493	2023-10-23 13:00:56 -07:00
Yufeng Li	11af34440a	Add MatMul 4bits support on GPU (#17890 ) ### Description <!-- Describe your changes. --> Add a contrib op MatMulNBits and related toolchain to support quantization on weight. This PR only adds support for 4bits. It: - add schema for contrib op MatMulNBits which can support 1-7 bits quantization on weight. - a naive implementation for 4bits MatMulNBits on CPU and GPU, i.e., implemented like MatMul(A, Dequantize(B)). - a special implementation for GemV for 4bits MatMulNBits and related benchmark tool - tool to quantization model with 4bits. Next: - add general and more efficient kernels for 4bits MatMulNBits on CPU and GPU	2023-10-13 16:55:30 -07:00
Zhang Lei	762703e037	Support output cross qk, dtw and more for whisper model (#17500 ) Support cross qk in beam search for whisper model and related features Make whisper exporting tools support cross qk and some related features, * extra_decoding_ids * no_speech_prob Implement DTW kernel, unfold tensor kernel with unit test Several fix related with multiple session running parallel, like: * guard multihead_attention, fused_fp16_runner_ * some memory allocation with stream awareness * add use_ep_level_unified_stream option	2023-10-13 11:47:15 -07:00
pengwa	63dc5dc1a9	Add document for PythonOp (#17888 ) ### Add document for PythonOp https://github.com/microsoft/onnxruntime/blob/pengwa/pythonop_doc/docs/ORTModule_PythonOp_Notes.md ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-10-12 08:36:22 +08:00
aciddelgado	406cd324e0	[CUDA] GroupQueryAttention operator using FlashAttention (#17674 ) ### Description Added Group Query Attention op, supporting integer multiple number of heads for Q / KV. As of now, this op can only use FlashAttention kernel, meaning it only supports sm>=80 on Linux. Results from onnxruntime/test/python/transformers/benchmark_gqa.py show an on-average ~37% speed-up over Decoder Masked Multi-Head Attention, with even greater improvements for long past sequence lengths. ``` op batch s_kv heads h_dim ms TFLOPS gqa 16 2048 8 32 0.34 0.10 dmmha 16 2048 8 32 0.39 0.09 --------- gqa 16 2048 8 64 0.45 0.15 dmmha 16 2048 8 64 0.61 0.11 --------- gqa 16 2048 8 128 0.54 0.25 dmmha 16 2048 8 128 0.83 0.16 --------- gqa 16 2048 16 32 0.45 0.15 dmmha 16 2048 16 32 0.69 0.10 --------- gqa 16 2048 16 64 0.69 0.19 dmmha 16 2048 16 64 0.83 0.16 --------- gqa 16 2048 16 128 0.71 0.38 dmmha 16 2048 16 128 1.28 0.21 --------- gqa 16 2048 32 32 0.58 0.23 dmmha 16 2048 32 32 0.77 0.17 --------- gqa 16 2048 32 64 0.58 0.46 dmmha 16 2048 32 64 1.25 0.21 --------- gqa 16 2048 32 128 0.76 0.71 dmmha 16 2048 32 128 2.15 0.25 --------- gqa 16 2048 64 32 0.68 0.39 dmmha 16 2048 64 32 1.23 0.22 --------- gqa 16 2048 64 64 0.77 0.70 dmmha 16 2048 64 64 2.11 0.25 --------- gqa 16 2048 64 128 1.10 0.97 dmmha 16 2048 64 128 4.06 0.26 --------- gqa 16 2048 128 32 1.00 0.54 dmmha 16 2048 128 32 2.09 0.26 --------- gqa 16 2048 128 64 1.10 0.97 dmmha 16 2048 128 64 4.08 0.26 ``` ### Motivation and Context As of now, this op is targeted for use on LLama models, as it supports kv-caching and different number of heads for Q and KV (Grouped Query Attention). We plan to add support for more platforms, input formats, etc. in the future. --------- Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: tlwu@microsoft.com <tlwu@a100.crj0ad2y1kku1j4yxl4sj10o4e.gx.internal.cloudapp.net>	2023-10-09 12:43:12 -07:00
kyoshisuki	ba72bb6f98	Fix a typo in ABI_Dev_Notes.md (#17832 )	2023-10-09 07:51:34 -07:00
Hector Li	385fab5bae	[QNN EP] Qnn cache improvement (#17757 ) ### Description Improve the QNN context binary cache feature to reduce the memory overhead and initialization time overhead. Instead of dumping a Qnn context binary file with metadata as header, we dump a Onnx format file with metadata inside Onnx node. ### Motivation and Context reduce the memory overhead and initialization time overhead	2023-10-06 15:56:33 -07:00
liqun Fu	2be4dc6d04	ONNX 1.15 integration (#17125 ) ### Description this is for ORT 1.17.0 - make ORT to use ONNX release 1.15.0 branch. Eventually will update to the release tag once ONNX 1.15.0 is released ### Motivation and Context Prepare for ORT 1.17.0 release. People can start work on new and updated ONNX ops in ORT. --------- Signed-off-by: Liqun Fu <liqfu@microsoft.com>	2023-09-26 14:44:48 -07:00
Nicolò Lucchesi	4ab0e17fe8	[Technical docs] Fixed a couple of old links in `FAQ.md` (#17415 ) ### Description Updated a couple of old links in the technical documentation that where pointing to files present prior to the migration to https://onnxruntime.ai/docs.	2023-09-26 13:38:24 -07:00
Vincent Wang	e6301eee6a	Bump Up Version to 1.17.0 (#17587 ) Bump up version to 1.17.0 as the 1.16.0 release branch had been branched out.	2023-09-20 11:02:58 +08:00
Adrian Lizarraga	dea425e7c1	[QNN/CPU EP] Add 16-bit Quantize/Dequantize contrib ops (#17015 ) ### Description - Adds 16-bit integer support to: - Quantization kernel implementations: Intel, Neon, and Power intrinsics - DequantizeLinear and QuantizeLinear contrib ops - QNN EP Quantize and Dequantize operators - Python quantization scripts - Disables QDQ fusions for most 16-bit QDQ node groups (need to add 16-bit support to QLinear* ops) - Retains support for dropping QDQ nodes from Split, Gather, Reshape, Transpose, Squeeze, and Unsqueeze node groups. Sample python code to generate QDQ model with 16-bit activations and 8-bit weights: ```python quantize_static( input_model_path, output_model_path, data_reader, quant_format=args.quant_format, per_channel=args.per_channel, activation_type=QuantType.QUInt16, weight_type=QuantType.QUInt8, extra_options={"DedicatedQDQPair": True, "ForceQuantizeNoInputCheck": True, "UseQDQContribOps": True}, ) ``` Note that enabling the `UseQDQContribOps` extra option is not strictly necessary. If the 16bit types are used without enabling `UseQDQContribOps`, the QDQ ops domains are overridden to 'com.microsoft', and a warning is printed to stdout. ### Automated Tests MLAS/CPU EP: - [x] 16-bit QuantizeLinear computation - [x] 16-bit DequantizeLinear computation Optimizer: - [x] Transpose QDQ fusion - [x] Gather QDQ fusion - [x] Reshape QDQ fusion - [x] Squeeze QDQ fusion - [x] Unsqueeze QDQ fusion - [x] Split drop QDQ - [x] DoubleQDQPairRemover - [x] Transpose optimization - [x] EnsureUniqueDQForNodeUnit - [x] Common subexpression elimination (DQ not removed) - [x] Constant folding QNN EP: - [x] Conv 16-bit activations, 8-bit weights - [x] MatMul 16-bit activations, 8-bit weights - [x] Unary 16-bit QDQ ops - [x] Binary 16-bit QDQ ops Quantization tool: - [x] Test creation of 16-bit QDQ model ### Motivation and Context Support mixed precision (8bit weights, 16bit activations) models. --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2023-09-18 09:43:34 -07:00
Nat Kershaw (MSFT)	a2fba28f6c	Remove extraneous javascript includes (#17558 )	2023-09-14 20:43:24 -07:00
Nat Kershaw (MSFT)	bbcf4b45dc	Upgrade doxygen to 1.9.8 (#17525 )	2023-09-12 20:44:27 -07:00
Baiju Meswani	5d2c57363f	Sign CUDA Kernel (#17293 )	2023-08-28 21:03:58 -07:00
Adrian Lizarraga	5a83a67f32	Support QDQ transformations with com.microsoft.Quantize/Dequantize ops (#17127 ) ### Description - Enables int32 support for com.microsoft.DequantizeLinear (contrib op) - Makes the `zero_point` input optional for Quantize/Dequantize contrib ops - Enables QDQ transformations with the Quantize/Dequantize contrib ops - Update tests: EnsureUniqueDQForNodeUnitTests, QDQTransformerTests, TransposeOptimizerTests ### Testing List of tested graph transformations: - [x] QDQSelectorActionTransformer - qdq_transformer_test.cc - [x] QDQS8ToU8Transformer - qdq_transformer_test.cc - [x] DoubleQDQPairsRemover - qdq_transformer_test.cc - [x] IdenticalChildrenConsolidation - qdq_transformer_test.cc - [x] QDQPropagation - qdq_transformer_test.cc - [x] QDQFinalCleanup - qdq_transformer_test.cc - [x] CliQuantFusion - qdq_transformer_test.cc - [x] ReluQuantFusion - qdq_transformer_test.cc - [x] EnsureUniqueDQForNodeUnit - ensure_unique_dq_for_node_unit_test.cc - [x] TransposeOptimizer - transpose_optimizer_test.cc - [x] CommonSubexpressionElimination - graph_transform_test.cc - [x] ConstantFolding - graph_transform_test.cc ### Motivation and Context We need to [support mixed 16-bit/8-bit precision QDQ models](https://github.com/microsoft/onnxruntime/pull/17015). This PR is the first step in achieving this goal: we need to make QDQ contrib ops work with our optimizations/transformations. --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Scott McKay <skottmckay@gmail.com>	2023-08-25 09:57:51 -07:00
pengwa	d90afc697b	Introduce ZeROOffloadSubscriber for ORTModule (#17006 ) ### Introduce ZeROOffloadSubscriber for ORTModule As part of the work: integrate ORTModule with DeepSpeed stage3, this PR mainly focus on moving original PyTorch-based (leveraging hooks) param partition/offload implementation to ORTModule compatible implementation. Changes include: 1. Refactor `SubscriberBase`/`SubcriberManager` to support pre-forward/post_forward hooks. 2. Implement new `ZeROOffloadSubscriber` by re-using DeepSpeed hook function as much as possible. Since all hook functions are defined in `DeepSpeedZeRoOffload._register_hooks_recursively` and `DeepSpeedZeRoOffload.setup_zero_stage3_hooks`, and the good thing is, the closure is not complex, all hooks are referencing the owning `DeepSpeedZeRoOffload` instance, so we can create new hook function with `FunctionType` by binding the owning `DeepSpeedZeRoOffload` instance, then call the new created function in subscriber's `pre_forward_module_apply_impl` and `post_forward_module_apply_impl` interfaces. 3. Monkey patch `DeepSpeedZeRoOffload.setup_zero_stage3_hooks` to register the `ZeROOffloadSubscriber` for the model, then we don't need change any code on the DeepSpeed repo (at least so far). 4. Fix the ATen embedding custom symbolic exporter function by tolerating weights size be (0) (changed by DeepSpeed zero stage 3). UT will be added once stage3 is fully supported. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-08-25 00:15:22 +08:00
Emmanuel Ferdman	08ca624d2b	Fix: update hyperlinks to the Jupyter notebooks (#16145 ) ### Description <!-- Describe your changes. --> This PR fixes broken hyperlinks in the documentation that should lead users to Jupyter notebooks. Currently, the hyperlinks are not working as intended. The PR resolves this issue by updating the hyperlinks to correctly direct users to the Jupyter notebooks. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? --> It fixes broken hyperlinks leading to the Jupyter notebooks.	2023-08-21 09:53:05 -07:00
Wenbing Li	d052c8a45c	Remove the extensions submodule (#17097 ) ### Description Remove the onnxruntime-extensions submodule since it now was used via cmake FetchContent ### Motivation and Context The submodule relies on an outdated version of the extensions, and the build instructions should be updated to eliminate any confusion.	2023-08-14 10:16:33 -07:00
liqun Fu	6697635b91	To support size opset 19 (#15689 )	2023-08-11 14:48:53 -07:00
sfatimar	2c5d4dce77	Openvino ep ort 5.1 (#17042 ) OpenVINO EP ORT 5.1 Branch Changes for the new API to take in OpenVINO Provider Options and compatibility with OV 2023.1 ### Motivation and Context The change is required for the new API to take in OpenVINO Provider Options and make it seamless. --------- Signed-off-by: MaajidKhan <n.maajid.khan@intel.com> Co-authored-by: saurabhintel0 <saurabh1.kale@intel.com> Co-authored-by: MaajidKhan <n.maajid.khan@intel.com> Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com> Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>	2023-08-09 11:50:10 -07:00
pengwa	6e6f582e08	Use full qualified name for PythonOp export (#17021 ) ### Use full qualified name for PythonOp export Originally, when there are duplicate named torch.autograd.Function in different module, for example: `a.b.c.Gelu` v.s. `d.e.func.<locals>.Gelu` We by default will throw exception to let user be aware we cannot distinguish the two Gelu because during model export, we did not module path. The workaround is we introduced `ORTMODULE_SKIPPED_AUTOGRAD_FUNCTIONS` to ignore those duplicated named Gelu that is not used by model run. This has limitations obviously for example if two Gelus are both used in training. This PR finds a way to construct a full qualified name. `def _export_pt_1_10(g, n, args, *kwargs):` 1. in exporter function, kwargs contains `name` and `module`, in the above example: `a.b.c.Gelu` --> name: `Gelu`, module: `a.b.c` `d.e.func.<locals>.Gelu` --> name: `Gelu`, module: `d.e` Using name and module is not enough to get a full qualified name, for the second case, where `d.e` is the module path, then there is a function called `func`, in this function, there is a local auto.grad.Function named `Gelu`. (Many of our UT looks like this). We can only get `d.e.Gelu`, but this is not the correct full qual name. The reason for this: `kwargs[name]` or `n.name` only return the class's name, not the class's full qual name. (be noted kwargs[module]` is correct). 2. `n` is torch.Node, we can access `pyobj` to get the torch.autograd.Function's apply method instance, then use `._self` to get the torch.autograd.Function class. Then we can get the `module` and `class`'s ful qual name, added together, we get the full qual name. With the above change, we don't need use `kwargs[name]` and `kwargs[module]` , and don't need check naming conflicting or `ORTMODULE_SKIPPED_AUTOGRAD_FUNCTIONS` env var any more.	2023-08-09 10:58:33 +08:00
Xavier Dupré	d0316ee768	Updating QDQ to support Float8E4M3FN (#16550 ) ### Description Naive update quantization tools to support Float8E4M3FN for Gemm.	2023-08-08 12:18:48 +02:00
Chen Fu	3c10f027de	4b quantization for weights of LLMs (#16833 ) ### Description Blockwise 4b quantization for LLMs. 1. Introduce 4b block-wise quantization for linear layer weights. 2. Implements matrix multiplication kernel for fp32 x int4 3. Implements special operator MatMulFpQ4 4. Implements quantization tool, that convert MatMul operator to MatMulFpQ4, when the right hand side is 2D const tensor. ### Motivation and Context Compress and accelerate LLMs \|Benchmark \| Time(ns)\| \|-------------\|----------\| \|Q4GEMM/Q4Sym/M:1/N:4096/K:4096/Threads:8\| 218054\| \|Q4GEMM/Q4Sym/M:1024/N:4096/K:4096/Threads:8\| 35830155\| \|Q4GEMM/Q4Sym/M:2048/N:4096/K:4096/Threads:8\| 73479790\| \|Q4GEMM/Q4Zp8/M:1/N:4096/K:4096/Threads:8\| 270152\| \|Q4GEMM/Q4Zp8/M:1024/N:4096/K:4096/Threads:8\| 35826721\| \|Q4GEMM/Q4Zp8/M:2048/N:4096/K:4096/Threads:8\| 73021200\| \|Q4GEMM/Q4Sym128/M:1/N:4096/K:4096/Threads:8\| 213832\| \|Q4GEMM/Q4Sym128/M:1024/N:4096/K:4096/Threads:8\| 36749874\| \|Q4GEMM/Q4Sym128/M:2048/N:4096/K:4096/Threads:8\| 72618120\| \|Benchmark \| Time(ns)\| \|-------------\|----------\| \|SGEMM/LLM/M:1/N:4096/K:4096/Threads:8\| 522610\| \|SGEMM/LLM/M:1024/N:4096/K:4096/Threads:8\| 39237689\| \|SGEMM/LLM/M:2048/N:4096/K:4096/Threads:8\| 75983467\| --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2023-08-07 12:23:55 -07:00
Khalia Spear	4e6ea730d6	Broadcasting for SLN for CPU and CUDA (#16510 ) ### Description Enhanced SkipLayerNorm by implementing broadcasting for both CPU and CUDA ### Motivation and Context The input and skip tensors no longer have to be the same size which means that it can accept data where the skip shape can be the same size as the input shape, have a shape of {1, sequence_length, hidden_size}, or {sequence_length, hidden_size}. --------- Co-authored-by: Tianlei Wu <tlwu@microsoft.com>	2023-08-07 09:55:42 -07:00
Tianlei Wu	50bf310dea	[CUDA] RelativePositionBias supports input with padding removed (#16923 ) update RelativePositionBias to support input with padding removed. - [x] add bias transpose kernel - [x] add test - [x] update operator document	2023-08-01 16:39:09 -07:00
Tianlei Wu	1fbd1ed179	[CUDA] PackedMultiHeadAttention support Bias and separated Q, K and V inputs (#16913 ) ### Description Follow-up change for PackedMultiHeadAttention added in https://github.com/microsoft/onnxruntime/pull/16779: - [x] Add Bias input - [x] Add CUDA kernels to support separated query, key and values inputs. - [x] Update operator documents - [x] Add unit tests	2023-08-01 15:30:41 -07:00
Patrice Vignola	49512e558a	[DML EP] Add I/O binding and `If` operator (#16859 ) Being able to leverage I/O binding for DML and registering `If` for the DML EP allows us to avoid copying the past/present key/values back and forth between the CPU and the GPU after every token. This gives us a 25% performance increase for Dolly V2 with 128 tokens on an RTX 4090.	2023-07-31 19:45:59 -07:00
Tianlei Wu	742edec5e8	[CUDA] Add PackedMultiHeadAttention operator (#16779 ) ### Description Add new operator for MultiHeadAttention with inputs removed padding. This only supports packed QKV format.	2023-07-28 16:35:38 -07:00
Alexey Kamenev	7c05f7bab1	Fix IRFFT contrib op output dimension calculation (#15662 ) ### Description Fixes the issue with IRFFT output dimension calculation as described in #13236 ### Motivation and Context Please refer to #13236 for detailed description. Specifically, [this code](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/contrib_ops/cuda/math/fft_ops.cc#L103) computes the output dimension as: ``` out_dim = in_dim * 2 - 1 ``` while it should be this instead: ``` out_dim = 2 * (in_dim - 1) ``` (assuming the original signal has even number of samples, of course). For example, if the original signal has 4 samples, then the round trip should look something like: ``` 4 -> (one-sided RFFT) -> 3 (complex) -> (one-sided IRFFT) -> 4 ``` with the current code the output will be a signal with 5 points. --------- Co-authored-by: Alexey Kamenev <akamenev@nvidia.com> Co-authored-by: Nick Geneva <nicholasgeneva@gmail.com>	2023-07-28 15:52:37 -07:00
Yi Zhang	9f21f694cf	stop support to VS 2019 (#16892 ) ### Description Remove VS 2019 code. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-07-28 13:09:35 +08:00
Prathik Rao	779fba1666	ORT Cache (#16744 ) ### Description <!-- Describe your changes. --> This PR adds support to cache the exported training/evaluation ONNX model in `ORTModule`. On future runs, instead of exporting the model again, we can pick up the model from a location on disc and run `ORTModule` training/evaluation. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> ORT Training DRI Contribution --------- Co-authored-by: root <root@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net> Co-authored-by: Prathik Rao <prathikrao@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net> Co-authored-by: Baiju Meswani <bmeswani@microsoft.com> Co-authored-by: pengwa <pengwa@microsoft.com>	2023-07-27 09:00:43 -07:00
Patrice Vignola	649930142f	[DML EP] Add NCHW and float16 gamma/beta support for GroupNorm (#16814 ) This will remove transposes that are non needed in the DML kernel. To keep backward compatiblity, the default behavior is to set NHWC when no attribute is set.	2023-07-25 21:43:29 -07:00
Justin Chu	0c1a5098dc	Disable PERF* rules in ruff to allow better readability (#16834 ) ### Description Disable two PERF* rules in ruff to allow better readability. Rational commented inline. This change also removes the unused noqa directives because of the rule change. ### Motivation and Context Readability	2023-07-25 15:38:22 -07:00
Justin Chu	d79515041c	[Better Engineering] Bump ruff to 0.0.278 and fix new lint errors (#16789 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #16789 Bump ruff to 0.0.278 and fix new lint errors. I added noqa to all existing RUF012 errors which requires mutable class variables to be annotated with `ClassVar`, as well as all PERF issues. Signed-off-by: Justin Chu <justinchu@microsoft.com>	2023-07-21 12:53:41 -07:00
saurabh	24566058b3	ovep dockerfile and wheel docs changes (#16482 ) ### Description This PR is includes changes in the documentation of _readmeOV.rst_ file and also the changes in the dockerfile which enables to build ORT with latest OpenVINO 2023.0.0 ### Motivation and Context Modified the dockerfile to incorporate the latest version of OpenVINO (2023.0.0) for building Onnxruntime. The changes in the PR aim to improve the overall user experience by providing accurate and up-to-date documentation while leveraging latest OpenVINO 2023.0.0	2023-07-19 09:01:09 -07:00

1 2 3 4 5 ...

618 commits