onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-05-18 21:21:17 +00:00

Author	SHA1	Message	Date
Yulong Wang	25bbd8d4eb	[js/web] allow gpu IO binding tests to fail temporarily (#17892 ) ### Description allow gpu IO binding tests to fail temporarily. when the root cause is still in investigation, use `continueOnError: true` to allow the test to fail without blocking PRs.	2023-10-11 21:21:21 -07:00
Changming Sun	138ccecd22	Change how "NPM packaging pipeline" downloads packages from another pipeline (#17838 ) ### Description "NPM packaging pipeline" needs to download an artifact from "Zip-Nuget-Java-Nodejs Packaging Pipeline". It has been a long-time issue that they two pipelines often use different commit ids. This change declares 'Zip-Nuget-Java-Nodejs Packaging Pipeline' as a resource, so that "NPM packaging pipeline" will always fetch from the pipeline run that triggers this NPM pipeline. Their official document says: "When you define a resource trigger, if its pipeline resource is from the same repo as the current pipeline, triggering follows the same branch and commit on which the event is raised."	2023-10-11 21:07:27 -07:00
Yi Zhang	20798a9f03	Enable onnx_test_runner to run the whole models dir in CI machine (#17863 ) ### Description 1. If the model should be skipped, don't load it. 2. print loaded tests and skipped tests 3. add more same filters as of the onnxruntime_test_all. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-10-12 12:01:02 +08:00
Wanming Lin	b3cab55d68	[WebNN EP] Add a duplicate entry to support new "dataType" (#17841 ) WebNN spec renames "type" as "dataType" at https://github.com/webmachinelearning/webnn/pull/464, add a duplicate entry for "dataType" in order to workaround the compatibility issue.	2023-10-11 19:13:13 -07:00
Adrian Lizarraga	565bead85f	[QNN EP] Support Softmax/LogSoftmax with any axis attribute (#17877 ) ### Description The QNN HTP backend only supports Softmax/LogSoftmax operators with an axis attribute set to `input_rank - 1` (i.e., the last dimension). This PR adds support for any axis by wrapping the QNN operator in transposes. ### Motivation and Context Support more models.	2023-10-11 17:43:42 -07:00
pengwa	63dc5dc1a9	Add document for PythonOp (#17888 ) ### Add document for PythonOp https://github.com/microsoft/onnxruntime/blob/pengwa/pythonop_doc/docs/ORTModule_PythonOp_Notes.md ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-10-12 08:36:22 +08:00
Yulong Wang	d532645bed	[js/webgpu] revise uniform support (#17871 ) ### Description <!-- Describe your changes. --> work for items (2) and (3) in #17860	2023-10-11 16:41:46 -07:00
Numfor Tiapo	b8f373b0ae	Add API for NPU Device Selection in the DML EP (#17612 ) Co-authored-by: Sheil Kumar <sheilk@microsoft.com>	2023-10-11 14:53:00 -07:00
Yulong Wang	a441a71e8e	[js/web] support different export format for ort-web (#17878 ) ### Description support different export format for ort-web.	2023-10-11 09:38:51 -07:00
pengwa	0e2782438a	Support inplace update for PythonOp/Grad (#17687 ) ### Support inplace update for PythonOp/Grad This PR is based on another PR https://github.com/microsoft/onnxruntime/pull/17685's branch, to make it easier to review. With PR: PR https://github.com/microsoft/onnxruntime/pull/17685, By default all PythonOp inputs/outputs are assumed to not be inplaced, if during run, we found some inplace update happens (by checking output data address with all inputs data address), we add clone before set it as PythonOp/Grad's outputs. In this case, results are correct, but implicit copies overheads are introduced. This PR allow users to define output input reuse map, to let ORT know how to do the reuse map, avoid such unnecessary copies.	2023-10-10 21:36:45 -07:00
Abhishek Jindal	54b7503c30	create patch for allgather fn for deepspeed stage 3 (#17855 ) ### Description <!-- Describe your changes. --> Patch for All gather fn for Deepspeed Stage 3 changes ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-10-11 11:15:06 +08:00
Tianlei Wu	948c8369a0	[CUDA/ROCm] Remove limitation of BiasAdd (#17848 ) Previously, BiasAdd only supports hidden dimensions of 32, 640 and 1280 for stable diffusion. This adds a kernel that could support any number of channels. ### Motivation and Context Stable Diffusion XL refiner model uses hidden dimensions of 768 or 1536, which was not supported in BiasAdd.	2023-10-10 20:08:45 -07:00
Yulong Wang	5228332c9f	[js] upgrade JS shared dev dependencies (#17831 ) ### Description upgrade JS shared dev dependencies. - webpack: removed - eslint: upgrade to latest. - eslint config upgraded to compatible with latest version - typescript upgrade to v5 - update module "CommonJS" to "Node16" in tsconfig - update deprecated config "importsNotUsedAsValues" to "verbatimModuleSyntax" - remove webpack bundles in onnxruntime-common	2023-10-10 17:44:39 -07:00
Yulong Wang	c6f1a1ce69	update build_jsep.bat to add release build flags (#17471 ) ### Description flags `--enable_wasm_api_exception_catching --disable_rtti` are used in release build, so fix the build_jsep.bat script to make it more consistent with CI.	2023-10-10 17:38:35 -07:00
Tianlei Wu	d637111e9f	[CUDA/ROCm] Update BiasSplitGelu for SD XL Refiner model (#17849 ) SD XL Refiner model has new hidden dimension sizes not supported by BiasSplitGelu. This update the kernel to support them. ### Motivation and Context Current BiasSplitGelu does not support optimization for SD XL refiner model.	2023-10-10 11:07:27 -07:00
Hector Li	9a1c884ba3	[QNN EP] Add script to generate Onnx model from native QNN generated context binary file (#17859 ) Add script to generate Onnx model from native QNN generated context binary file. This is used for QNN EP example code.	2023-10-10 10:54:35 -07:00
Yulong Wang	d9b9c5a537	[js/webgpu] support using uniform buffer (#17803 ) ### Description support using uniform buffer. This PR allows to use uniform buffer in shader program, so that some runtime information (eg. input/output shape) is no longer need to be hardcoded into shader code. There are 2 commits in this PR: - [667f31c](`667f31c83d`): framework changes to support uniform buffer, as well as updates in program manager, gpu data manager and indices helper. - [09e1d2a](`09e1d2ad1d`): an example change for operator `Transpose` to use input's rank-only instead of dims as shader key. With this change, model mobilenetv2-12 shader compile times dropped from 71 to 52.	2023-10-10 00:31:12 -07:00
Yi Zhang	53be802f39	Onnx_test_runner and onnxruntime_test_all use the same broken test list. (#17840 )	2023-10-10 13:03:58 +08:00
Changming Sun	05ac9f6f2a	Split onnxruntime_providers.cmake to multiple (#17853 ) ### Description Split onnxruntime_providers.cmake to multiple files, for easier editing. No other change was made in this PR.	2023-10-09 20:33:44 -07:00
Scott McKay	046939b0c1	Include CoreML in mac os python packages (#17844 ) ### Description <!-- Describe your changes. --> Include CoreML EP in python package. I've added to the base package as CoreML comes from the OS so there are no additional libraries to distribute. Updated the CPU-based provider list to add the AzureEP, which is also included in the base package, to fix some test failures. Without this the infrastructure thinks a device copy implementation is required between AzureEP and CoreML nodes, which is not the case as the AzureEP is CPU based. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> #16989	2023-10-10 11:44:32 +10:00
Baiju Meswani	9c716f4557	Add noexcep_operators to onnxruntime internal libraries (#17850 )	2023-10-09 16:29:41 -07:00
aciddelgado	406cd324e0	[CUDA] GroupQueryAttention operator using FlashAttention (#17674 ) ### Description Added Group Query Attention op, supporting integer multiple number of heads for Q / KV. As of now, this op can only use FlashAttention kernel, meaning it only supports sm>=80 on Linux. Results from onnxruntime/test/python/transformers/benchmark_gqa.py show an on-average ~37% speed-up over Decoder Masked Multi-Head Attention, with even greater improvements for long past sequence lengths. ``` op batch s_kv heads h_dim ms TFLOPS gqa 16 2048 8 32 0.34 0.10 dmmha 16 2048 8 32 0.39 0.09 --------- gqa 16 2048 8 64 0.45 0.15 dmmha 16 2048 8 64 0.61 0.11 --------- gqa 16 2048 8 128 0.54 0.25 dmmha 16 2048 8 128 0.83 0.16 --------- gqa 16 2048 16 32 0.45 0.15 dmmha 16 2048 16 32 0.69 0.10 --------- gqa 16 2048 16 64 0.69 0.19 dmmha 16 2048 16 64 0.83 0.16 --------- gqa 16 2048 16 128 0.71 0.38 dmmha 16 2048 16 128 1.28 0.21 --------- gqa 16 2048 32 32 0.58 0.23 dmmha 16 2048 32 32 0.77 0.17 --------- gqa 16 2048 32 64 0.58 0.46 dmmha 16 2048 32 64 1.25 0.21 --------- gqa 16 2048 32 128 0.76 0.71 dmmha 16 2048 32 128 2.15 0.25 --------- gqa 16 2048 64 32 0.68 0.39 dmmha 16 2048 64 32 1.23 0.22 --------- gqa 16 2048 64 64 0.77 0.70 dmmha 16 2048 64 64 2.11 0.25 --------- gqa 16 2048 64 128 1.10 0.97 dmmha 16 2048 64 128 4.06 0.26 --------- gqa 16 2048 128 32 1.00 0.54 dmmha 16 2048 128 32 2.09 0.26 --------- gqa 16 2048 128 64 1.10 0.97 dmmha 16 2048 128 64 4.08 0.26 ``` ### Motivation and Context As of now, this op is targeted for use on LLama models, as it supports kv-caching and different number of heads for Q and KV (Grouped Query Attention). We plan to add support for more platforms, input formats, etc. in the future. --------- Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: tlwu@microsoft.com <tlwu@a100.crj0ad2y1kku1j4yxl4sj10o4e.gx.internal.cloudapp.net>	2023-10-09 12:43:12 -07:00
kyoshisuki	ba72bb6f98	Fix a typo in ABI_Dev_Notes.md (#17832 )	2023-10-09 07:51:34 -07:00
Wei-Sheng Chin	60f19ab001	Fix Pad's quantization (#17807 ) Fix #17760. Upstream exporter creates empty string as Pad's 3rd input and the quantization tool 1) considers that as a valid tensor name and 2) adds corresponding invalid quantization nodes. This PR adds a condition check to make quantization tool working.	2023-10-08 22:09:23 -07:00
PeixuanZuo	2ef6ee674c	[ROCm] Update ROCm and MIGraphX CI to ROCm5.7 (#17834 ) - Update ROCm and MIGraphX CI to ROCm5.7 - Simplify test exculde file. Some tests will output `registered execution providers ROCMExecutionProvider were unable to run the model.` if they cannot run. - Add `enable_training` build argument for MIGraphX pipeline.	2023-10-09 10:29:11 +08:00
cloudhan	c2bd5b70b2	Fix enable_training and use_migraphx (#17827 )	2023-10-08 11:43:27 +08:00
MistEO	faf9a0f6c7	Fix runtime installation error (#17828 )	2023-10-07 11:50:02 -07:00
Wei-Sheng Chin	b5a103ae16	Upgrade transformers to fix CI (#17823 ) Python package pipeline fails due to "tokenizers" compilation. Since "tokenizers" is a dep of "transformers", we update its version and hope a new solution had been there. ``` error: casting `&T` to `&mut T` is undefined behavior, even if the reference is unused, consider instead using an `UnsafeCell` --> tokenizers-lib/src/models/bpe/trainer.rs:517:47 ```	2023-10-07 09:51:24 -07:00
Changming Sun	b76994dc3a	Improve CUDA EP's GetCapability (#17809 ) Improve CUDA EP's GetCapability: Add layout transformer support. Currently the code detects if a node is already assigned to some EP, if yes, it will directly return. ```c++ if (!node.GetExecutionProviderType().empty()) { return; } ``` So, if you call the GetCapability function twice, ```c++ auto caps = GetCapability(); assign_nodes_to_eps(..., caps, ...); auto caps2 = GetCapability(); ``` The second GetCapability() call will return fewer results than the first one. Layout transformer needs to call GetCapability twice as above. So the current GetCapability() implementation is incompatible with the Layout transformer. It is not an issue right now because the CUDA EP doesn't need to do layout transform. But we might want to support a different layout.	2023-10-07 09:05:02 -07:00
PeixuanZuo	37f4f27da0	[ROCm] ONNX Runtime training rocm package for ADO (#17683 ) - we will publish the onnxruntime-training-rocm package on ADO feeds. The onnxruntime-training package will solely be for cuda. - Add new pipeline for onnxruntime-training-rocm ADO feeds https://aiinfra.visualstudio.com/Lotus/_build?definitionId=1278. Only package with latest rocm version is publish to ADO.	2023-10-07 10:45:35 +08:00
pengwa	7201def4ec	Fix convergence for dolly+stage3 training (#17685 ) ### Fix convergence for dolly+stage3 training In [ZeROOffloadSubscriber](`216214b7d3/orttraining/orttraining/python/training/utils/hooks/_zero_offload_subscriber.py (L359C7-L359C28)`), we defined some PythonOp, taking input and returning it inplace, for example: `216214b7d3/orttraining/orttraining/python/training/utils/hooks/_zero_offload_subscriber.py (L223C20-L223C20)`. While it is possible, when ORT runs such a PythonOp, once it completes, it will release the input OrtValue, triggered the data erasing or overridden. But the PythonOp's returned value OrtValue are still pointing to that address, reading or writting on that may introduce a wrong result or even undefined behaviors. ``` /bert_ort/pengwa/py38/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_custom_autograd_function_runner.py:28: UserWarning: .rank-0: onnxruntime.training.utils.hooks._zero_offload_subscriber.ORTZeROOffloadPreForwardFunction->Backward: ONNX Op attribute 'tensor_reuse_map' doesn't indicate 8-th output is reusing any input, but detected inplace_map indicates it is reusing some input index. A clone will be done before returning to ORT, to align with ORT's NO Buffer reuse plan. Please update inplace_map explicitly to avoid such a copy. warnings.warn(f".rank-{get_rank()}: {message}") 0%\|▏ \| 1/1000 [00:04<1:15:08, 4.51s/it][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:44,023 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 14.1406, 'learning_rate': 0, 'epoch': 0.0} 0%\|▏ \| 1/1000 [00:04<1:15:08, 4.51s/it]Invalidate trace cache @ step 5: expected module 6, but got module 7 0%\|▍ \| 2/1000 [00:04<31:53, 1.92s/it][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:44,124 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0} 0%\|▋ \| 3/1000 [00:04<18:05, 1.09s/it][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:44,227 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0} 0%\|▋ \| 3/1000 [00:04<18:05, 1.09s/it][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:44,326 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0} 0%\|█▏ \| 5/1000 [00:04<08:44, 1.90it/s][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:44,419 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0} 0%\|█▏ \| 5/1000 [00:04<08:44, 1.90it/s][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:44,505 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0} 1%\|█▋ \| 7/1000 [00:05<05:28, 3.02it/s][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:44,597 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0} 1%\|█▋ \| 7/1000 [00:05<05:28, 3.02it/s][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:44,690 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0} 1%\|██▏ \| 9/1000 [00:05<03:57, 4.17it/s][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:44,791 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0} 1%\|██▏ \| 9/1000 [00:05<03:57, 4.17it/s][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:44,889 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0} 1%\|██▋ \| 11/1000 [00:05<03:06, 5.32it/s][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:44,981 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0} 1%\|██▋ \| 11/1000 [00:05<03:06, 5.32it/s][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:45,073 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.01} 1%\|███▏ \| 13/1000 [00:05<02:33, 6.42it/s][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:45,166 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.01} 1%\|███▏ \| 13/1000 [00:05<02:33, 6.42it/s][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:45,256 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.01} 2%\|███▌ \| 15/1000 [00:05<02:12, 7.43it/s][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:45,348 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.01} 2%\|███▌ \| 15/1000 [00:05<02:12, 7.43it/s][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:45,439 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.01} 2%\|████ \| 17/1000 [00:06<01:59, 8.22it/s][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:45,535 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.01} 2%\|████ \| 17/1000 [00:06<01:59, 8.22it/s]Traceback (most recent call last): File "examples/onnxruntime/training/language-modeling/run_clm.py", line 600, in <module> main() File "examples/onnxruntime/training/language-modeling/run_clm.py", line 548, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/bert_ort/pengwa/optimum/optimum/onnxruntime/trainer.py", line 457, in train return inner_training_loop( File "/bert_ort/pengwa/optimum/optimum/onnxruntime/trainer.py", line 781, in _inner_training_loop self.deepspeed.step() File "/bert_ort/pengwa/deepspeed/deepspeed/runtime/engine.py", line 2084, in step self._take_model_step(lr_kwargs) File "/bert_ort/pengwa/deepspeed/deepspeed/runtime/engine.py", line 1990, in _take_model_step self.optimizer.step() File "/bert_ort/pengwa/deepspeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(args, kwargs) File "/bert_ort/pengwa/deepspeed/deepspeed/runtime/zero/stage3.py", line 1854, in step if self._overflow_check_and_loss_scale_update(): File "/bert_ort/pengwa/deepspeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(args, *kwargs) File "/bert_ort/pengwa/deepspeed/deepspeed/runtime/zero/stage3.py", line 1788, in _overflow_check_and_loss_scale_update self._update_scale(self.overflow) File "/bert_ort/pengwa/deepspeed/deepspeed/runtime/zero/stage3.py", line 2132, in _update_scale self.loss_scaler.update_scale(has_overflow) File "/bert_ort/pengwa/deepspeed/deepspeed/runtime/fp16/loss_scaler.py", line 175, in update_scale raise Exception( Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run. 2%\|████ \| 17/1000 [00:06<06:07, 2.67it/s] [2023-09-25 08:30:51,075] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1065120) of binary: /bert_ort/pengwa/py38/bin/python Traceback (most recent call last): File "/bert_ort/pengwa/py38/bin/torchrun", line 8, in <module> sys.exit(main()) File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(args, **kwargs) File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/torch/distributed/run.py", line 806, in main run(args) File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ examples/onnxruntime/training/language-modeling/run_clm.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-09-25_08:30:51 host : orttrainingdev10.internal.cloudapp.net rank : 0 (local_rank: 0) exitcode : 1 (pid: 1065120) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ (/bert_ort/pengwa/py38) pengwa@microsoft.com@orttrainingdev10:/bert_ort/pengwa/optim ``` ## The Fix For those output that are reusing input, but ORT is not aware of, we detected on the fly (the first iteration, by checking the output tensor addresses with input tensor addresses) , then do implicit copy before set it as PythonOp's output tensors. With this fix: (left: PyTorch, right: ORT) ![image](https://github.com/microsoft/onnxruntime/assets/10530022/0d72f431-2abd-4e52-af99-19974b85edde)	2023-10-07 08:40:19 +08:00
Bowen Bao	891b50cc68	General INFO logging tracking occurance of GraphTransformer modification (#17819 ) ### Description Adds logging to `GraphTransformer::Apply` whether modification has taken place or not. ### Motivation and Context A general high level info logging to track which optimization occurred for a given model. To help improve dynamo exported model performance by monitoring the difference of triggered transformations between that of torchscript exported model.	2023-10-06 17:03:26 -07:00
Hector Li	385fab5bae	[QNN EP] Qnn cache improvement (#17757 ) ### Description Improve the QNN context binary cache feature to reduce the memory overhead and initialization time overhead. Instead of dumping a Qnn context binary file with metadata as header, we dump a Onnx format file with metadata inside Onnx node. ### Motivation and Context reduce the memory overhead and initialization time overhead	2023-10-06 15:56:33 -07:00
Chi Lo	569876fb16	[TensorRT EP] Refactor OrtTensorRTProviderOptions initialization and make it easy to add new field (#17617 ) Two major modifications of this PR: 1. Refactor OrtTensorRTProviderOptions initialization and make it easy to add new field. 2. Make Python API capable of using TensorRT plugins by adding new Python binding api `register_tensorrt_plugins_as_custom_ops`. (It needs to register ep's custom op domain before model load. For C++ API, it's slightly different, when calling SessionOptionsAppendExecutionProvider_TensorRT_XX, it appends cutom op domain to session option. Later ORT can register custom op domain from session option before model loading)	2023-10-06 14:12:20 -07:00
Yulong Wang	6ea493571e	[js/web] use esbuild to accelerate bundle build (#17745 ) ### Description Use esbuild to accelerate bundle build. This change uses esbuild to replace webpack for onnxruntime-web. Bundle build time reduced from ~20sec to ~0.6sec on my windows dev box. A few changes applied: - import nodejs modules using "node:" prefix - remove enum declaration inside namespace (EncoderUsage) - use "fs/promise" to replace the old promisify from "util" - separate ort-web and test-runner. Previously they are bundled together, now they are built into 2 files. - optimize karma runner launch time - remove unnecessary sourcemap preprocessor. sourcemaps are handled inside esbuild - remove unnecessary proxies (because ort-web and test-runner are separated now, the path are correctly inferred) - remove file watcher for test data - optimize special handling as esbuild plugins: - polyfill dummy imports for node.js modules when targetting browser. - load as content string for ort-wasm-.worker.js - load as content string for ./proxy-worker/main.ts - a source patch to ort-wasm-threaded*.js (see details in comments in code) - updated debug configurations for sourcemap mapping to ensure out-of-box good dev experience	2023-10-06 13:37:37 -07:00
Kaz Nishimura	be1e51af2a	Add length checks to fusion_transpose.py (#17608 ) This change adds list length checks to node's inputs in fusion_transpose.py. It bypasses the optimization if not applicable. ### Motivation and Context Unsqueeze in opset (<13) has only one input and cause runtime exceptions.	2023-10-06 12:06:13 -07:00
Changming Sun	735df7e2a8	[webgpu]: add a simple GetCapability implementation (#17643 ) Most of the function body was copied from CUDA EP.	2023-10-06 10:52:17 -07:00
Sheil Kumar	cb9408e89c	Enable cpp20 builds for DML EP and WinML API (#17800 ) Enable cpp20 builds for DML EP and WinML API 1) Missing typename for templated types 2) unmove helper for inline references to rvalue temporaries This is okay since per the standard a temporary bound to a reference parameter in a function call exists until the end of the full expression containing that function call: if the function returns a reference, which outlives the full expression, it becomes a dangling reference. 3) static now not needed for template specializations --------- Co-authored-by: Sheil Kumar <sheilk@microsoft.com>	2023-10-06 10:33:38 -07:00
JiCheng	3878011ce2	Remove MPI dependency (#17624 ) ### Description <!-- Describe your changes. --> Support launch multi-GPU without MPI ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-10-06 15:33:18 +08:00
George Wu	b306b02a86	[QNN EP] fixed input for InstanceNormU8 unit test and update copy lib paths (#17806 ) -update InstanceNormU8 with fixed input. With this input, it fails consistently using QNN 2.15.1 -update QNN lib paths (target is deprecated) and additionally copy V73 skel file	2023-10-05 22:17:15 -07:00
Justin Chu	be7541ef4a	[Linter] Bump ruff and remove pylint (#17797 ) Bump ruff version and remove pylint from the linter list. Fix any new error detected by ruff. ### Motivation and Context Ruff covers many of the pylint rules. Since pylint is not enabled in this repo and runs slow, we remove it from the linters	2023-10-05 21:07:33 -07:00
Adrian Lizarraga	7417fd41e2	[QNN EP] Add better unit tests for rank 5 ReduceSum (#17802 ) ### Description We previously had a unit test that checked that QNN EP rejected rank 5 reduce ops. This PR: - Allows the underlying QNN APIs to validate the input rank for Reduce ops. - Modifies a rank 5 ReduceSum unit test so that it can be used to reproduce a graph finalization error on QNN SDK 2.15.1. - Adds a new rank 5 ReduceSum unit test with a configuration that is known to work in QNN SDK 2.15.1. ### Motivation and Context Allows us to more easily test/verify rank 5 support for ReduceSum.	2023-10-05 16:16:05 -07:00
Rachel Guo	5be79e2e29	Remove swift files on ORT main repo (#17799 ) ### Description <!-- Describe your changes. --> Move the swift files to ORT SPM repo now: https://github.com/microsoft/onnxruntime-swift-package-manager ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>	2023-10-05 15:27:15 -07:00
Wei-Sheng Chin	faef9c32fa	ONNX-Native Tensor Parallel: Using Distributed MatMul as Example (#17695 ) This PR introduces - New data structure to represent kernel-level (aka node-level or op-level) tensor sharding informaiton. I consider it as the fundamentaion of ONNX distribtued inference. - Building blocks for distribtued kernels implementation especially stateless implementation for communication ops. - Implementation of DistributedMatMul and its tests. Code structure: - sharding.h/.cc: Function to shard and reshard tensors (calling into NCCL). - sharding_spec.h/.cc: Representation of how a tensor is sharded. - distributed_matmul.h/.cc: Implementation of tensor parallel MatMul. Inputs and outputs are sharded across devices. - onnxruntime_test_distributed.py: distributed operator tests. Example of specifying sharding information ```python @onnxscript.script() def matmul_rs_sr_rr(tensor_x: FLOAT, tensor_w: FLOAT) -> FLOAT: # Run MatMul by sharding x along column axis and w along row axis on # 2 GPUs. return MICROSOFT_OPSET.DistributedMatMul( tensor_x, tensor_w, device_mesh_shape=[2], device_mesh_elements=[0, 1], input_shard_specs=["RS[0]", "S[0]R"], output_shard_specs=["RR"], ) onnx_model = matmul_rs_sr_rr.to_model_proto( input_types=[FLOAT[2, "s"], FLOAT["s", 2]], output_types=[FLOAT[2, 2]], ) ``` In this example, the device mesh can be visualized as 1-D tensor, `[0, 1]`. The 2nd axis of `tensor_x` is sharded across `[0, 1]` (i.e., the 0-axis of the device mesh). Similarly, the 1st axis of `tensor_w` is sharded across `[0, 1]` as well. C++ classes to represent tensor sharding (copied from sharding_spec.h): ```cpp class DeviceMesh { public: // [Device Mesh and Tensor Sharding for Tensor Parallel] // Device mesh is a tensor of device indices. // A tensor can then be partitioned along specific mesh axes. // // Assume we have 4 GPUs indexed by 0, 1, 2, and 3. // Let's consider some examples. // 1. 1D device mesh [0, 1, 2, 3]. In this case, // device_mesh_shape is [4] and device_mesh_elements // is [0, 1, 2, 3]. // If we want to shard a 2-D tensor along its axis 1, the // corresponding sharding spec is a string "RS[0]". // 2. 2D device mesh [[0, 1], [2, 3]]. In this case, // device_mesh_shape is [2, 2] and device_mesh_elements // is [0, 1, 2, 3]. // If we want to shard a 2-D tensor's // rows along mesh axis 1 and // columns along mesh axis 0, the // corresponding sharding spec is a string "S[1]S[0]". // If that 2-D tensor's value is np.array([[5, 6], [7, 8]]), // GPU 0/1/2/3 owns 5/7/6/8. Below is a visualization the sharding // proccess. // - Start with a 2-D device mesh [[0, 1], [2, 3]] and // a 2-D tensor [[5, 6], [7, 8]] // - GPU: [[0, 1], [2, 3]], Tensor: [[5, 6], [7, 8]] // - Split GPU mesh along axis 1 and tensor along // axis 0 for "S[1]" in "S[1]S[0]" // - GPU: [[0], [2]], Tensor: [[5, 6]] // GPU: [[1], [3]], Tensor: [[7, 8]] // - Split GPU mesh along axis 0 and tensor along // axis 1 for "S[0]" in "S[1]S[0]" // - GPU: [[0]], Tensor: [[5]] // - GPU: [[2]], Tensor: [[6]] // - GPU: [[1]], Tensor: [[7]] // - GPU: [[3]], Tensor: [[8]] // Actual shape of device mesh represented by `device_mesh_elements`. std::vector<int64_t> device_mesh_shape; // Flattened device mesh. std::vector<int64_t> device_mesh_elements; }; class AxisPartitionSpec { // [Device Mesh and Tensor Sharding for Tensor Parallel] // This class is the in-memory representation of // 1. if a tensor is sharded or not (aka replica), and // 2. which tensor axis is shard by which device mesh axis. // Let's consider sharding 2-D tensor along column axis on // device mesh [0, 1] as an example. // The required sharding spec RS[0] can be represented by // - AxisPartitionSpec(Condition::Replica, -1) // - AxisPartitionSpec(Condition::Shard, 0) public: // Status of a tensor axis. // A tensor axis can be either sharded or replicated // along a device mesh axis. enum class Condition { Replica, Shard }; // This field tells if a tensor axis is sharded or not. Condition cond; // If a tensor axis is sharded, this field tells which device // mesh axis to distribute the shards along. // If a tensor axis is not sharded, this field is ignored. int device_mesh_axis; // A helper to construct a replica spec for a tensor axis. static AxisPartitionSpec CreateReplica() { return AxisPartitionSpec(Condition::Replica, -1); } // A helper to construct a sharding spec for a tensor axis. // This tensor axis is sharded along `device_mesh_axis` in device mesh. static AxisPartitionSpec CreateShard(int device_mesh_axis) { return AxisPartitionSpec(Condition::Shard, device_mesh_axis); } }; class TensorPartitionSpec { // [Device Mesh and Tensor Sharding for Tensor Parallel] // TensorPartitionSpec holds a collection of AxisPartitionSpec and an // associated DeviceMesh. It is responsible for determining how a tensor // should be partitioned across a device mesh. // // Example 1: RS[0] // In this scenario, `axis_specs` would contain two `AxisPartitionSpec` objects. // - The first object is a Replica, denoting that the first axis of the tensor is // not sharded but is instead replicated. // - The second object is a Shard along the 0-th axis of the device mesh. It denotes // that the second axis of the tensor is sharded along the first axis of the // device mesh. // // Example 2: S[0]RR // In this scenario, `axis_specs` would contain three `AxisPartitionSpec` objects. // - The first object is a Shard along the 0-th axis of the device mesh, indicating // that the first axis of the tensor is sharded along the first axis of the // device mesh. // - The second and third objects are Replicas, indicating that the second and third // axes of the tensor are not sharded but are instead replicated. public: // axis_specs[i]: AxisPartitionSpec for tensor axis i. For a 2-D tensor, // axis_specs[0] is for row axis and axis_specs[1] is for // column axis. axis_specs[i].device_mesh_axis = j means that // tensor axis i is sharded along device mesh axis j. std::vector<AxisPartitionSpec> axis_specs; // device_mesh: DeviceMesh for sharding the associated tensor. // Read [Device Mesh and Tensor Sharding for Tensor Parallel] in DeviceMesh's comment. DeviceMesh device_mesh; }; ```	2023-10-05 14:22:25 -07:00
Benedikt Hilmes	742069a8e8	Add option for max intermediate outputs for MinMaxCalibrater (#17029 ) ### Description <!-- Describe your changes. --> Adds the option to set max_intermediate_outputs for quantization with the MinMaxCalibrater via. extra_options following the structure of existing flags. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> When running quantization with the MinMaxCalibrater with larger datasets, one quickly runs out of memory since it tries to load the full dataset. Since merging and clearing of the intermediate_outputs is already implemented within the Calibrater this simply adds an optional flag to make use of these functions during quantization.	2023-10-05 11:43:12 -07:00
Edward Chen	b6bef0f063	Add test for iOS dynamic framework (#17790 ) Add test to cover iOS dynamic framework usage.	2023-10-05 11:18:51 -07:00
Ye Wang	0e988239cc	[BeamSearch]optimize key cache reordering (#17771 ) ### Description <!-- Describe your changes. --> Replace onnxruntime::cuda::Transpose4DKernelParallelizeMultipleElementsPerThreadInInnermostDim() with custom transpose kernel in ReorderPastState(). The original implementation doesn't benefit from vectorized loading and coalesced accessing(write). and not fully utilize threads in the block. benchmarked with TNLGv4 model(batch=4, seq_len=4K) transpose kernel speed up: ~1.9X (392 μs -> 206 μs) overall reordering speedup: ~1.48X Latency: before: ![image](https://github.com/microsoft/onnxruntime/assets/52801275/34c7ab73-3da1-4c41-a036-e9fb6a966891) after: ![image](https://github.com/microsoft/onnxruntime/assets/52801275/337818ec-9598-4d8a-9e9b-7215b6862498) GPU matrix: before: ![image](https://github.com/microsoft/onnxruntime/assets/52801275/4962248f-703c-49bd-8586-deaeccd9bce0) after: ![image](https://github.com/microsoft/onnxruntime/assets/52801275/a795a892-4c5d-432d-8375-0bb67385d2bc) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Your Name <you@example.com>	2023-10-05 10:29:11 -07:00
Hector Li	e1a089c23c	[QNN EP] Skip Op validation for Q & DQ node with 5D data (#17792 ) [QNN EP] Skip Op validation for Q & DQ node with 5D data ### Description Skip Op validation for Q & DQ node with 5D data to walk around a bug in QNN	2023-10-05 09:54:56 -07:00
Tianlei Wu	d6dad96923	Add CUDA EP in StableDiffusion demo (#17788 ) Add CUDA EP to the demo of stable diffusion. ### A100 Performance Test \| Engine Property \| Batch Size \| TRT Latency (ms) \| ORT_TRT Latency (ms) \| ORT_CUDA Latency (ms) \| TORCH Latency (ms) -- \| -- \| -- \| -- \| -- \| -- \| -- SD 1.5, 50 steps, 512x512 \| Static Input Shape \| 1 \| 861 \| 851 \| 861 \| N/A SD 1.5, 50 steps, 512x512 \| Dynamic Input Shape, Optimized for batch size 1 and image size 512x512 \| 1 \| 974 \| 1079 \| 928 \| 1222 SD 1.5, 50 steps, 768x768 \| Dynamic Input Shape, Optimized for batch size 1 and image size 512x512 \| 1 \| 2492 \| OOM \| 1901 \| 1971 SD 1.5, 50 steps, 768x768 \| Dynamic Input Shape, Optimized for batch size 1 and image size 512x512 \| 4 \|9091 \| OOM \| 6785 \| 6700 We can see that ORT_CUDA is the most robust one for handling dynamic input shape. PyTorch could be a good choice if you run large batch size. The above result is from one A100-SXM4-80GB GPU (in Standard_ND96amsr_A100_v4 Azure VM) with 50 steps to generate 512x512 or 768x768 images using StableDiffusion 1.5. Onnxruntime-gpu is built from source, and the following packages or libraries are used in this test: * tensorrt==8.6.1.post1 * torch==2.2.0.dev20230920+cu121 * transformers==4.31.0 * diffusers==0.19.3 * onnx==1.14.1 * onnx-graphsurgeon==0.3.27 * polygraphy==0.47.1 * protobuf==3.20.2 * onnxruntime-gpu==1.17.0 (built from source of main branch) * CUDA 12.2.2 * cuDNN 8.9.5.29 * python 3.10.13 For static input shape, the engine is built with static batch size and static image shape, and cuda graph is enabled. For dynamic input shape, the engine is built to support dynamic batch size and dynamic image shape, and cuda graph is disabled. The TensorRT engine is built for batch size 1~4, image size 256x256 ~ 1024x1024, and the optimized image size is 512x512. The script to test static and dynamic input shape are like the following: ``` prompt="a cute magical flying dog, fantasy art drawn by disney concept artists, highly detailed, digital paintining" for e in TRT ORT_TRT ORT_CUDA do python demo_txt2img.py --engine $e "$prompt" python demo_txt2img.py --engine $e --disable-cuda-graph --build-dynamic-batch --build-dynamic-shape "$prompt" python demo_txt2img.py --engine $e --disable-cuda-graph --build-dynamic-batch --build-dynamic-shape --height 768 --width 768 "$prompt" done ``` Performance of PyTorch is from commands like the following: ``` python benchmark.py -e torch -v 1.5 --enable_torch_compile -b 1 --height 512 --width 512 python benchmark.py -e torch -v 1.5 --enable_torch_compile -b 1 --height 768 --width 768 python benchmark.py -e torch -v 1.5 --enable_torch_compile -b 4 --height 768 --width 768 ```	2023-10-05 08:19:20 -07:00
Jiajia Qin	db3901ab97	[js/webgpu] Enable the NCHW ConvMatMul path (#17717 ) 1) Enable pointwise NCHW conv2d by MatMul. 2) Enable non-pointwise NCHW conv2d by convMatMul. 3) Fix bug when `sameSize` is true --------- Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>	2023-10-05 00:26:01 -07:00

1 2 3 4 5 ...

9755 commits