Commit graph

9755 commits

Author SHA1 Message Date
Yulong Wang
25bbd8d4eb
[js/web] allow gpu IO binding tests to fail temporarily (#17892)
### Description
allow gpu IO binding tests to fail temporarily.

when the root cause is still in investigation, use `continueOnError:
true` to allow the test to fail without blocking PRs.
2023-10-11 21:21:21 -07:00
Changming Sun
138ccecd22
Change how "NPM packaging pipeline" downloads packages from another pipeline (#17838)
### Description
"NPM packaging pipeline" needs to download an artifact from
"Zip-Nuget-Java-Nodejs Packaging Pipeline".
It has been a long-time issue that they two pipelines often use
different commit ids.
This change declares 'Zip-Nuget-Java-Nodejs Packaging Pipeline' as a
resource, so that "NPM packaging pipeline" will always fetch from the
pipeline run that triggers this NPM pipeline.
Their official document says:
"When you define a resource trigger, if its pipeline resource is from
the same repo as the current pipeline, triggering follows the same
branch and commit on which the event is raised."
2023-10-11 21:07:27 -07:00
Yi Zhang
20798a9f03
Enable onnx_test_runner to run the whole models dir in CI machine (#17863)
### Description
1. If the model should be skipped, don't load it.
2. print loaded tests and skipped tests
3. add more same filters as of the onnxruntime_test_all.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-10-12 12:01:02 +08:00
Wanming Lin
b3cab55d68
[WebNN EP] Add a duplicate entry to support new "dataType" (#17841)
WebNN spec renames "type" as "dataType" at
https://github.com/webmachinelearning/webnn/pull/464, add a duplicate
entry for "dataType" in order to workaround the compatibility issue.
2023-10-11 19:13:13 -07:00
Adrian Lizarraga
565bead85f
[QNN EP] Support Softmax/LogSoftmax with any axis attribute (#17877)
### Description
The QNN HTP backend only supports Softmax/LogSoftmax operators with an
axis attribute set to `input_rank - 1` (i.e., the last dimension). This
PR adds support for any axis by wrapping the QNN operator in transposes.


### Motivation and Context
Support more models.
2023-10-11 17:43:42 -07:00
pengwa
63dc5dc1a9
Add document for PythonOp (#17888)
### Add document for PythonOp



https://github.com/microsoft/onnxruntime/blob/pengwa/pythonop_doc/docs/ORTModule_PythonOp_Notes.md



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-10-12 08:36:22 +08:00
Yulong Wang
d532645bed
[js/webgpu] revise uniform support (#17871)
### Description
<!-- Describe your changes. -->

work for items (2) and (3) in #17860
2023-10-11 16:41:46 -07:00
Numfor Tiapo
b8f373b0ae
Add API for NPU Device Selection in the DML EP (#17612)
Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
2023-10-11 14:53:00 -07:00
Yulong Wang
a441a71e8e
[js/web] support different export format for ort-web (#17878)
### Description
support different export format for ort-web.
2023-10-11 09:38:51 -07:00
pengwa
0e2782438a
Support inplace update for PythonOp/Grad (#17687)
### Support inplace update for PythonOp/Grad

This PR is based on another PR
https://github.com/microsoft/onnxruntime/pull/17685's branch, to make it
easier to review.

With PR: PR https://github.com/microsoft/onnxruntime/pull/17685, By
default all PythonOp inputs/outputs are assumed to not be inplaced, if
during run, we found some inplace update happens (by checking output
data address with all inputs data address), we add clone before set it
as PythonOp/Grad's outputs. In this case, results are correct, but
implicit copies overheads are introduced.

This PR allow users to define output input reuse map, to let ORT know
how to do the reuse map, avoid such unnecessary copies.
2023-10-10 21:36:45 -07:00
Abhishek Jindal
54b7503c30
create patch for allgather fn for deepspeed stage 3 (#17855)
### Description
<!-- Describe your changes. -->
Patch for All gather fn for Deepspeed Stage 3 changes


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-10-11 11:15:06 +08:00
Tianlei Wu
948c8369a0
[CUDA/ROCm] Remove limitation of BiasAdd (#17848)
Previously, BiasAdd only supports hidden dimensions of 32, 640 and 1280
for stable diffusion. This adds a kernel that could support any number
of channels.

### Motivation and Context
Stable Diffusion XL refiner model uses hidden dimensions of 768 or 1536,
which was not supported in BiasAdd.
2023-10-10 20:08:45 -07:00
Yulong Wang
5228332c9f
[js] upgrade JS shared dev dependencies (#17831)
### Description
upgrade JS shared dev dependencies.

- webpack: removed
- eslint: upgrade to latest.
   - eslint config upgraded to compatible with latest version
- typescript upgrade to v5
   - update module "CommonJS" to "Node16" in tsconfig
- update deprecated config "importsNotUsedAsValues" to
"verbatimModuleSyntax"
- remove webpack bundles in onnxruntime-common
2023-10-10 17:44:39 -07:00
Yulong Wang
c6f1a1ce69
update build_jsep.bat to add release build flags (#17471)
### Description
flags `--enable_wasm_api_exception_catching --disable_rtti` are used in
release build, so fix the build_jsep.bat script to make it more
consistent with CI.
2023-10-10 17:38:35 -07:00
Tianlei Wu
d637111e9f
[CUDA/ROCm] Update BiasSplitGelu for SD XL Refiner model (#17849)
SD XL Refiner model has new hidden dimension sizes not supported by BiasSplitGelu. This update the kernel to support them.

### Motivation and Context
Current BiasSplitGelu does not support optimization for SD XL refiner model.
2023-10-10 11:07:27 -07:00
Hector Li
9a1c884ba3
[QNN EP] Add script to generate Onnx model from native QNN generated context binary file (#17859)
Add script to generate Onnx model from native QNN generated context
binary file. This is used for QNN EP example code.
2023-10-10 10:54:35 -07:00
Yulong Wang
d9b9c5a537
[js/webgpu] support using uniform buffer (#17803)
### Description
support using uniform buffer.

This PR allows to use uniform buffer in shader program, so that some
runtime information (eg. input/output shape) is no longer need to be
hardcoded into shader code.

There are 2 commits in this PR:
-
[667f31c](667f31c83d):
framework changes to support uniform buffer, as well as updates in
program manager, gpu data manager and indices helper.
-
[09e1d2a](09e1d2ad1d):
an example change for operator `Transpose` to use input's rank-only
instead of dims as shader key. With this change, model mobilenetv2-12
shader compile times dropped from 71 to 52.
2023-10-10 00:31:12 -07:00
Yi Zhang
53be802f39
Onnx_test_runner and onnxruntime_test_all use the same broken test list. (#17840) 2023-10-10 13:03:58 +08:00
Changming Sun
05ac9f6f2a
Split onnxruntime_providers.cmake to multiple (#17853)
### Description
Split onnxruntime_providers.cmake to multiple files, for easier editing.
No other change was made in this PR.
2023-10-09 20:33:44 -07:00
Scott McKay
046939b0c1
Include CoreML in mac os python packages (#17844)
### Description
<!-- Describe your changes. -->
Include CoreML EP in python package.

I've added to the base package as CoreML comes from the OS so there are
no additional libraries to distribute.

Updated the CPU-based provider list to add the AzureEP, which is also
included in the base package, to fix some test failures. Without this
the infrastructure thinks a device copy implementation is required
between AzureEP and CoreML nodes, which is not the case as the AzureEP
is CPU based.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#16989
2023-10-10 11:44:32 +10:00
Baiju Meswani
9c716f4557
Add noexcep_operators to onnxruntime internal libraries (#17850) 2023-10-09 16:29:41 -07:00
aciddelgado
406cd324e0
[CUDA] GroupQueryAttention operator using FlashAttention (#17674)
### Description
Added Group Query Attention op, supporting integer multiple number of
heads for Q / KV. As of now, this op can only use FlashAttention kernel,
meaning it only supports sm>=80 on Linux.

Results from onnxruntime/test/python/transformers/benchmark_gqa.py show
an on-average ~37% speed-up over Decoder Masked Multi-Head Attention,
with even greater improvements for long past sequence lengths.

```
op      batch   s_kv    heads   h_dim   ms      TFLOPS
gqa     16      2048    8       32      0.34    0.10
dmmha   16      2048    8       32      0.39    0.09
---------
gqa     16      2048    8       64      0.45    0.15
dmmha   16      2048    8       64      0.61    0.11
---------
gqa     16      2048    8       128     0.54    0.25
dmmha   16      2048    8       128     0.83    0.16
---------
gqa     16      2048    16      32      0.45    0.15
dmmha   16      2048    16      32      0.69    0.10
---------
gqa     16      2048    16      64      0.69    0.19
dmmha   16      2048    16      64      0.83    0.16
---------
gqa     16      2048    16      128     0.71    0.38
dmmha   16      2048    16      128     1.28    0.21
---------
gqa     16      2048    32      32      0.58    0.23
dmmha   16      2048    32      32      0.77    0.17
---------
gqa     16      2048    32      64      0.58    0.46
dmmha   16      2048    32      64      1.25    0.21
---------
gqa     16      2048    32      128     0.76    0.71
dmmha   16      2048    32      128     2.15    0.25
---------
gqa     16      2048    64      32      0.68    0.39
dmmha   16      2048    64      32      1.23    0.22
---------
gqa     16      2048    64      64      0.77    0.70
dmmha   16      2048    64      64      2.11    0.25
---------
gqa     16      2048    64      128     1.10    0.97
dmmha   16      2048    64      128     4.06    0.26
---------
gqa     16      2048    128     32      1.00    0.54
dmmha   16      2048    128     32      2.09    0.26
---------
gqa     16      2048    128     64      1.10    0.97
dmmha   16      2048    128     64      4.08    0.26
```


### Motivation and Context
As of now, this op is targeted for use on LLama models, as it supports
kv-caching and different number of heads for Q and KV (Grouped Query
Attention). We plan to add support for more platforms, input formats,
etc. in the future.

---------

Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: tlwu@microsoft.com <tlwu@a100.crj0ad2y1kku1j4yxl4sj10o4e.gx.internal.cloudapp.net>
2023-10-09 12:43:12 -07:00
kyoshisuki
ba72bb6f98
Fix a typo in ABI_Dev_Notes.md (#17832) 2023-10-09 07:51:34 -07:00
Wei-Sheng Chin
60f19ab001
Fix Pad's quantization (#17807)
Fix #17760. Upstream exporter creates empty string as Pad's 3rd input
and the quantization tool 1) considers that as a valid tensor name and
2) adds corresponding invalid quantization nodes. This PR adds a
condition check to make quantization tool working.
2023-10-08 22:09:23 -07:00
PeixuanZuo
2ef6ee674c
[ROCm] Update ROCm and MIGraphX CI to ROCm5.7 (#17834)
- Update ROCm and MIGraphX CI to ROCm5.7
- Simplify test exculde file. Some tests will output `registered
execution providers ROCMExecutionProvider were unable to run the model.`
if they cannot run.
- Add `enable_training` build argument for MIGraphX pipeline.
2023-10-09 10:29:11 +08:00
cloudhan
c2bd5b70b2
Fix enable_training and use_migraphx (#17827) 2023-10-08 11:43:27 +08:00
MistEO
faf9a0f6c7
Fix runtime installation error (#17828) 2023-10-07 11:50:02 -07:00
Wei-Sheng Chin
b5a103ae16
Upgrade transformers to fix CI (#17823)
Python package pipeline fails due to "tokenizers" compilation. Since
"tokenizers" is a dep of "transformers", we update its version and hope
a new solution had been there.

```
error: casting `&T` to `&mut T` is undefined behavior, even if the reference is unused, consider instead using an `UnsafeCell`
--> tokenizers-lib/src/models/bpe/trainer.rs:517:47
```
2023-10-07 09:51:24 -07:00
Changming Sun
b76994dc3a
Improve CUDA EP's GetCapability (#17809)
Improve CUDA EP's GetCapability: Add layout transformer support.   
Currently the code detects if a node is already assigned to some EP, if
yes, it will directly return.
```c++
    if (!node.GetExecutionProviderType().empty()) {
      return;
     }
```

So, if you call the GetCapability function twice,

```c++
auto caps = GetCapability();
assign_nodes_to_eps(..., caps, ...);
auto caps2 = GetCapability();
```
The second GetCapability() call will return fewer results than the first
one. Layout transformer needs to call GetCapability twice as above. So
the current GetCapability() implementation is incompatible with the
Layout transformer. It is not an issue right now because the CUDA EP
doesn't need to do layout transform.  But we might want to support a
different layout.
2023-10-07 09:05:02 -07:00
PeixuanZuo
37f4f27da0
[ROCm] ONNX Runtime training rocm package for ADO (#17683)
- we will publish the onnxruntime-training-rocm package on ADO feeds.
The onnxruntime-training package will solely be for cuda.

- Add new pipeline for onnxruntime-training-rocm ADO feeds
https://aiinfra.visualstudio.com/Lotus/_build?definitionId=1278. Only
package with latest rocm version is publish to ADO.
2023-10-07 10:45:35 +08:00
pengwa
7201def4ec
Fix convergence for dolly+stage3 training (#17685)
### Fix convergence for dolly+stage3 training

In
[ZeROOffloadSubscriber](216214b7d3/orttraining/orttraining/python/training/utils/hooks/_zero_offload_subscriber.py (L359C7-L359C28)),
we defined some PythonOp, taking input and returning it inplace, for
example:

216214b7d3/orttraining/orttraining/python/training/utils/hooks/_zero_offload_subscriber.py (L223C20-L223C20).
While it is possible, when ORT runs such a PythonOp, once it completes,
it will release the input OrtValue, triggered the data erasing or
overridden. But the PythonOp's returned value OrtValue are still
pointing to that address, reading or writting on that may introduce a
wrong result or even undefined behaviors.


```
/bert_ort/pengwa/py38/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_custom_autograd_function_runner.py:28: UserWarning: .rank-0: onnxruntime.training.utils.hooks._zero_offload_subscriber.ORTZeROOffloadPreForwardFunction->Backward: ONNX Op attribute 'tensor_reuse_map' doesn't indicate 8-th output is reusing any input, but detected inplace_map indicates it is reusing some input index. A clone will be done before returning to ORT, to align with ORT's NO Buffer reuse plan. Please update inplace_map explicitly to avoid such a copy.
  warnings.warn(f".rank-{get_rank()}: {message}")
  0%|▏                                                                                                                                                                                                                                               | 1/1000 [00:04<1:15:08,  4.51s/it][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:44,023 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 14.1406, 'learning_rate': 0, 'epoch': 0.0}
  0%|▏                                                                                                                                                                                                                                               | 1/1000 [00:04<1:15:08,  4.51s/it]Invalidate trace cache @ step 5: expected module 6, but got module 7
  0%|▍                                                                                                                                                                                                                                                 | 2/1000 [00:04<31:53,  1.92s/it][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:44,124 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0}
  0%|▋                                                                                                                                                                                                                                                 | 3/1000 [00:04<18:05,  1.09s/it][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:44,227 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0}
  0%|▋                                                                                                                                                                                                                                                 | 3/1000 [00:04<18:05,  1.09s/it][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:44,326 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0}
  0%|█▏                                                                                                                                                                                                                                                | 5/1000 [00:04<08:44,  1.90it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:44,419 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0}
  0%|█▏                                                                                                                                                                                                                                                | 5/1000 [00:04<08:44,  1.90it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:44,505 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0}
  1%|█▋                                                                                                                                                                                                                                                | 7/1000 [00:05<05:28,  3.02it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:44,597 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0}
  1%|█▋                                                                                                                                                                                                                                                | 7/1000 [00:05<05:28,  3.02it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:44,690 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0}
  1%|██▏                                                                                                                                                                                                                                               | 9/1000 [00:05<03:57,  4.17it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:44,791 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0}
  1%|██▏                                                                                                                                                                                                                                               | 9/1000 [00:05<03:57,  4.17it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:44,889 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0}
  1%|██▋                                                                                                                                                                                                                                              | 11/1000 [00:05<03:06,  5.32it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:44,981 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0}
  1%|██▋                                                                                                                                                                                                                                              | 11/1000 [00:05<03:06,  5.32it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:45,073 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.01}
  1%|███▏                                                                                                                                                                                                                                             | 13/1000 [00:05<02:33,  6.42it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:45,166 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.01}
  1%|███▏                                                                                                                                                                                                                                             | 13/1000 [00:05<02:33,  6.42it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:45,256 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.01}
  2%|███▌                                                                                                                                                                                                                                             | 15/1000 [00:05<02:12,  7.43it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:45,348 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.01}
  2%|███▌                                                                                                                                                                                                                                             | 15/1000 [00:05<02:12,  7.43it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:45,439 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.01}
  2%|████                                                                                                                                                                                                                                             | 17/1000 [00:06<01:59,  8.22it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:45,535 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.01}
  2%|████                                                                                                                                                                                                                                             | 17/1000 [00:06<01:59,  8.22it/s]Traceback (most recent call last):
  File "examples/onnxruntime/training/language-modeling/run_clm.py", line 600, in <module>
    main()
  File "examples/onnxruntime/training/language-modeling/run_clm.py", line 548, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/bert_ort/pengwa/optimum/optimum/onnxruntime/trainer.py", line 457, in train
    return inner_training_loop(
  File "/bert_ort/pengwa/optimum/optimum/onnxruntime/trainer.py", line 781, in _inner_training_loop
    self.deepspeed.step()
  File "/bert_ort/pengwa/deepspeed/deepspeed/runtime/engine.py", line 2084, in step
    self._take_model_step(lr_kwargs)
  File "/bert_ort/pengwa/deepspeed/deepspeed/runtime/engine.py", line 1990, in _take_model_step
    self.optimizer.step()
  File "/bert_ort/pengwa/deepspeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/bert_ort/pengwa/deepspeed/deepspeed/runtime/zero/stage3.py", line 1854, in step
    if self._overflow_check_and_loss_scale_update():
  File "/bert_ort/pengwa/deepspeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/bert_ort/pengwa/deepspeed/deepspeed/runtime/zero/stage3.py", line 1788, in _overflow_check_and_loss_scale_update
    self._update_scale(self.overflow)
  File "/bert_ort/pengwa/deepspeed/deepspeed/runtime/zero/stage3.py", line 2132, in _update_scale
    self.loss_scaler.update_scale(has_overflow)
  File "/bert_ort/pengwa/deepspeed/deepspeed/runtime/fp16/loss_scaler.py", line 175, in update_scale
    raise Exception(
Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.
  2%|████                                                                                                                                                                                                                                             | 17/1000 [00:06<06:07,  2.67it/s]
[2023-09-25 08:30:51,075] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1065120) of binary: /bert_ort/pengwa/py38/bin/python
Traceback (most recent call last):
  File "/bert_ort/pengwa/py38/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
examples/onnxruntime/training/language-modeling/run_clm.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-09-25_08:30:51
  host      : orttrainingdev10.internal.cloudapp.net
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1065120)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
(/bert_ort/pengwa/py38) pengwa@microsoft.com@orttrainingdev10:/bert_ort/pengwa/optim
```

## The Fix

For those output that are reusing input, but ORT is not aware of, we
detected on the fly (the first iteration, by checking the output tensor
addresses with input tensor addresses) , then do implicit copy before
set it as PythonOp's output tensors.


With this fix: (left: PyTorch, right: ORT)


![image](https://github.com/microsoft/onnxruntime/assets/10530022/0d72f431-2abd-4e52-af99-19974b85edde)
2023-10-07 08:40:19 +08:00
Bowen Bao
891b50cc68
General INFO logging tracking occurance of GraphTransformer modification (#17819)
### Description
Adds logging to `GraphTransformer::Apply` whether modification has taken
place or not.



### Motivation and Context
A general high level info logging to track which optimization occurred
for a given model. To help improve dynamo exported model performance by
monitoring the difference of triggered transformations between that of
torchscript exported model.
2023-10-06 17:03:26 -07:00
Hector Li
385fab5bae
[QNN EP] Qnn cache improvement (#17757)
### Description
Improve the QNN context binary cache feature to reduce the memory
overhead and initialization time overhead.
Instead of dumping a Qnn context binary file with metadata as header, we
dump a Onnx format file with metadata inside Onnx node.

### Motivation and Context
 reduce the memory overhead and initialization time overhead
2023-10-06 15:56:33 -07:00
Chi Lo
569876fb16
[TensorRT EP] Refactor OrtTensorRTProviderOptions initialization and make it easy to add new field (#17617)
Two major modifications of this PR:

1. Refactor OrtTensorRTProviderOptions initialization and make it easy
to add new field.
2. Make Python API capable of using TensorRT plugins by adding new
Python binding api `register_tensorrt_plugins_as_custom_ops`. (It needs
to register ep's custom op domain before model load. For C++ API, it's
slightly different, when calling
SessionOptionsAppendExecutionProvider_TensorRT_XX, it appends cutom op
domain to session option. Later ORT can register custom op domain from
session option before model loading)
2023-10-06 14:12:20 -07:00
Yulong Wang
6ea493571e
[js/web] use esbuild to accelerate bundle build (#17745)
### Description

Use esbuild to accelerate bundle build.

This change uses esbuild to replace webpack for onnxruntime-web. Bundle
build time reduced from ~20sec to ~0.6sec on my windows dev box.

A few changes applied:
- import nodejs modules using "node:" prefix
- remove enum declaration inside namespace (EncoderUsage)
- use "fs/promise" to replace the old promisify from "util"
- separate ort-web and test-runner. Previously they are bundled
together, now they are built into 2 files.
- optimize karma runner launch time
- remove unnecessary sourcemap preprocessor. sourcemaps are handled
inside esbuild
- remove unnecessary proxies (because ort-web and test-runner are
separated now, the path are correctly inferred)
    - remove file watcher for test data
- optimize special handling as esbuild plugins:
- polyfill dummy imports for node.js modules when targetting browser.
    - load as content string for ort-wasm-*.worker.js
    - load as content string for ./proxy-worker/main.ts
- a source patch to ort-wasm*-threaded*.js (see details in comments in
code)
- updated debug configurations for sourcemap mapping to ensure
out-of-box good dev experience
2023-10-06 13:37:37 -07:00
Kaz Nishimura
be1e51af2a
Add length checks to fusion_transpose.py (#17608)
This change adds list length checks to node's inputs in fusion_transpose.py. It bypasses the optimization if not applicable.

### Motivation and Context
Unsqueeze in opset (<13) has only one input and cause runtime exceptions.
2023-10-06 12:06:13 -07:00
Changming Sun
735df7e2a8
[webgpu]: add a simple GetCapability implementation (#17643)
Most of the function body was copied from CUDA EP.
2023-10-06 10:52:17 -07:00
Sheil Kumar
cb9408e89c
Enable cpp20 builds for DML EP and WinML API (#17800)
Enable cpp20 builds for DML EP and WinML API

1) Missing typename for templated types
2) unmove helper for inline references to rvalue temporaries
This is okay since per the standard a temporary bound to a reference
parameter in a function call exists until the end of the full expression
containing that function call: if the function returns a reference,
which outlives the full expression, it becomes a dangling reference.

3) static now not needed for template specializations

---------

Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
2023-10-06 10:33:38 -07:00
JiCheng
3878011ce2
Remove MPI dependency (#17624)
### Description
<!-- Describe your changes. -->

Support launch multi-GPU without MPI


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-10-06 15:33:18 +08:00
George Wu
b306b02a86
[QNN EP] fixed input for InstanceNormU8 unit test and update copy lib paths (#17806)
-update InstanceNormU8 with fixed input. With this input, it fails
consistently using QNN 2.15.1
-update QNN lib paths (target is deprecated) and additionally copy V73
skel file
2023-10-05 22:17:15 -07:00
Justin Chu
be7541ef4a
[Linter] Bump ruff and remove pylint (#17797)
Bump ruff version and remove pylint from the linter list. Fix any new
error detected by ruff.

### Motivation and Context

Ruff covers many of the pylint rules. Since pylint is not enabled in
this repo and runs slow, we remove it from the linters
2023-10-05 21:07:33 -07:00
Adrian Lizarraga
7417fd41e2
[QNN EP] Add better unit tests for rank 5 ReduceSum (#17802)
### Description
We previously had a unit test that checked that QNN EP rejected rank 5 reduce ops. This PR:
- Allows the underlying QNN APIs to validate the input rank for Reduce ops.
- Modifies a rank 5 ReduceSum unit test so that it can be used to reproduce a graph finalization error on QNN SDK 2.15.1.
- Adds a new rank 5 ReduceSum unit test with a configuration that is known to work in QNN SDK 2.15.1.

### Motivation and Context
Allows us to more easily test/verify rank 5 support for ReduceSum.
2023-10-05 16:16:05 -07:00
Rachel Guo
5be79e2e29
Remove swift files on ORT main repo (#17799)
### Description
<!-- Describe your changes. -->

Move the swift files to ORT SPM repo now:
https://github.com/microsoft/onnxruntime-swift-package-manager


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
2023-10-05 15:27:15 -07:00
Wei-Sheng Chin
faef9c32fa
ONNX-Native Tensor Parallel: Using Distributed MatMul as Example (#17695)
This PR introduces
- New data structure to represent kernel-level (aka node-level or
op-level) tensor sharding informaiton. I consider it as the
fundamentaion of ONNX distribtued inference.
- Building blocks for distribtued kernels implementation especially
stateless implementation for communication ops.
- Implementation of DistributedMatMul and its tests.

Code structure:
- sharding.h/.cc: Function to shard and reshard tensors (calling into
NCCL).
- sharding_spec.h/.cc: Representation of how a tensor is sharded.
- distributed_matmul.h/.cc: Implementation of tensor parallel MatMul.
Inputs and outputs are sharded across devices.
- onnxruntime_test_distributed.py: distributed operator tests.

Example of specifying sharding information
```python
        @onnxscript.script()
        def matmul_rs_sr_rr(tensor_x: FLOAT, tensor_w: FLOAT) -> FLOAT:
            # Run MatMul by sharding x along column axis and w along row axis on
            # 2 GPUs.
            return MICROSOFT_OPSET.DistributedMatMul(
                tensor_x,
                tensor_w,
                device_mesh_shape=[2],
                device_mesh_elements=[0, 1],
                input_shard_specs=["RS[0]", "S[0]R"],
                output_shard_specs=["RR"],
            )
        onnx_model = matmul_rs_sr_rr.to_model_proto(
            input_types=[FLOAT[2, "s"], FLOAT["s", 2]],
            output_types=[FLOAT[2, 2]],
        )
```

In this example, the device mesh can be visualized as 1-D tensor, `[0,
1]`. The 2nd axis of `tensor_x` is sharded across `[0, 1]` (i.e., the
0-axis of the device mesh). Similarly, the 1st axis of `tensor_w` is
sharded across `[0, 1]` as well.

C++ classes to represent tensor sharding (copied from sharding_spec.h):
```cpp
class DeviceMesh {
 public:
  // [Device Mesh and Tensor Sharding for Tensor Parallel]
  // Device mesh is a tensor of device indices.
  // A tensor can then be partitioned along specific mesh axes.
  //
  // Assume we have 4 GPUs indexed by 0, 1, 2, and 3.
  // Let's consider some examples.
  //  1. 1D device mesh [0, 1, 2, 3]. In this case,
  //     device_mesh_shape is [4] and device_mesh_elements
  //     is [0, 1, 2, 3].
  //     If we want to shard a 2-D tensor along its axis 1, the
  //     corresponding sharding spec is a string "RS[0]".
  //  2. 2D device mesh [[0, 1], [2, 3]]. In this case,
  //     device_mesh_shape is [2, 2] and device_mesh_elements
  //     is [0, 1, 2, 3].
  //     If we want to shard a 2-D tensor's
  //     rows along mesh axis 1 and
  //     columns along mesh axis 0, the
  //     corresponding sharding spec is a string "S[1]S[0]".
  //     If that 2-D tensor's value is np.array([[5, 6], [7, 8]]),
  //     GPU 0/1/2/3 owns 5/7/6/8.  Below is a visualization the sharding
  //     proccess.
  //     - Start with a 2-D device mesh [[0, 1], [2, 3]] and
  //       a 2-D tensor [[5, 6], [7, 8]]
  //       - GPU: [[0, 1], [2, 3]], Tensor: [[5, 6], [7, 8]]
  //     - Split GPU mesh along axis 1 and tensor along
  //       axis 0 for "S[1]" in "S[1]S[0]"
  //       - GPU: [[0], [2]], Tensor: [[5, 6]]
  //         GPU: [[1], [3]], Tensor: [[7, 8]]
  //     - Split GPU mesh along axis 0 and tensor along
  //       axis 1 for "S[0]" in "S[1]S[0]"
  //       - GPU: [[0]], Tensor: [[5]]
  //       - GPU: [[2]], Tensor: [[6]]
  //       - GPU: [[1]], Tensor: [[7]]
  //       - GPU: [[3]], Tensor: [[8]]

  // Actual shape of device mesh represented by `device_mesh_elements`.
  std::vector<int64_t> device_mesh_shape;

  // Flattened device mesh.
  std::vector<int64_t> device_mesh_elements;
};

class AxisPartitionSpec {
  // [Device Mesh and Tensor Sharding for Tensor Parallel]
  // This class is the in-memory representation of
  //  1. if a tensor is sharded or not (aka replica), and
  //  2. which tensor axis is shard by which device mesh axis.
  // Let's consider sharding 2-D tensor along column axis on
  // device mesh [0, 1] as an example.
  // The required sharding spec RS[0] can be represented by
  // - AxisPartitionSpec(Condition::Replica, -1)
  // - AxisPartitionSpec(Condition::Shard, 0)
 public:
  // Status of a tensor axis.
  // A tensor axis can be either sharded or replicated
  // along a device mesh axis.
  enum class Condition { Replica,
                         Shard };

  // This field tells if a tensor axis is sharded or not.
  Condition cond;

  // If a tensor axis is sharded, this field tells which device
  // mesh axis to distribute the shards along.
  // If a tensor axis is not sharded, this field is ignored.
  int device_mesh_axis;

  // A helper to construct a replica spec for a tensor axis.
  static AxisPartitionSpec CreateReplica() {
    return AxisPartitionSpec(Condition::Replica, -1);
  }

  // A helper to construct a sharding spec for a tensor axis.
  // This tensor axis is sharded along `device_mesh_axis` in device mesh.
  static AxisPartitionSpec CreateShard(int device_mesh_axis) {
    return AxisPartitionSpec(Condition::Shard, device_mesh_axis);
  }
};

class TensorPartitionSpec {
  // [Device Mesh and Tensor Sharding for Tensor Parallel]
  // TensorPartitionSpec holds a collection of AxisPartitionSpec and an
  // associated DeviceMesh. It is responsible for determining how a tensor
  // should be partitioned across a device mesh.
  //
  // Example 1: RS[0]
  // In this scenario, `axis_specs` would contain two `AxisPartitionSpec` objects.
  // - The first object is a Replica, denoting that the first axis of the tensor is
  //   not sharded but is instead replicated.
  // - The second object is a Shard along the 0-th axis of the device mesh. It denotes
  //   that the second axis of the tensor is sharded along the first axis of the
  //   device mesh.
  //
  // Example 2: S[0]RR
  // In this scenario, `axis_specs` would contain three `AxisPartitionSpec` objects.
  // - The first object is a Shard along the 0-th axis of the device mesh, indicating
  //   that the first axis of the tensor is sharded along the first axis of the
  //   device mesh.
  // - The second and third objects are Replicas, indicating that the second and third
  //   axes of the tensor are not sharded but are instead replicated.
 public:
  // axis_specs[i]: AxisPartitionSpec for tensor axis i. For a 2-D tensor,
  //                axis_specs[0] is for row axis and axis_specs[1] is for
  //                column axis. axis_specs[i].device_mesh_axis = j means that
  //                tensor axis i is sharded along device mesh axis j.
  std::vector<AxisPartitionSpec> axis_specs;

  // device_mesh: DeviceMesh for sharding the associated tensor.
  // Read [Device Mesh and Tensor Sharding for Tensor Parallel] in DeviceMesh's comment.
  DeviceMesh device_mesh;
};
```
2023-10-05 14:22:25 -07:00
Benedikt Hilmes
742069a8e8
Add option for max intermediate outputs for MinMaxCalibrater (#17029)
### Description
<!-- Describe your changes. -->
Adds the option to set max_intermediate_outputs for quantization with
the MinMaxCalibrater via. extra_options following the structure of
existing flags.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
When running quantization with the MinMaxCalibrater with larger
datasets, one quickly runs out of memory since it tries to load the full
dataset. Since merging and clearing of the intermediate_outputs is
already implemented within the Calibrater this simply adds an optional
flag to make use of these functions during quantization.
2023-10-05 11:43:12 -07:00
Edward Chen
b6bef0f063
Add test for iOS dynamic framework (#17790)
Add test to cover iOS dynamic framework usage.
2023-10-05 11:18:51 -07:00
Ye Wang
0e988239cc
[BeamSearch]optimize key cache reordering (#17771)
### Description
<!-- Describe your changes. --> 

Replace
onnxruntime::cuda::Transpose4DKernelParallelizeMultipleElementsPerThreadInInnermostDim()
with custom transpose kernel in ReorderPastState(). The original
implementation doesn't benefit from vectorized loading and coalesced
accessing(write). and not fully utilize threads in the block.

benchmarked with TNLGv4 model(batch=4, seq_len=4K)
transpose kernel speed up: ~1.9X (392 μs -> 206 μs)
overall reordering speedup: ~1.48X

Latency:
before:

![image](https://github.com/microsoft/onnxruntime/assets/52801275/34c7ab73-3da1-4c41-a036-e9fb6a966891)
after:

![image](https://github.com/microsoft/onnxruntime/assets/52801275/337818ec-9598-4d8a-9e9b-7215b6862498)

GPU matrix:
before:

![image](https://github.com/microsoft/onnxruntime/assets/52801275/4962248f-703c-49bd-8586-deaeccd9bce0)
after:

![image](https://github.com/microsoft/onnxruntime/assets/52801275/a795a892-4c5d-432d-8375-0bb67385d2bc)


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Your Name <you@example.com>
2023-10-05 10:29:11 -07:00
Hector Li
e1a089c23c
[QNN EP] Skip Op validation for Q & DQ node with 5D data (#17792)
[QNN EP] Skip Op validation for Q & DQ node with 5D data

### Description
Skip Op validation for Q & DQ node with 5D data to walk around a bug in
QNN
2023-10-05 09:54:56 -07:00
Tianlei Wu
d6dad96923
Add CUDA EP in StableDiffusion demo (#17788)
Add CUDA EP to the demo of stable diffusion.

### A100 Performance
Test | Engine Property | Batch Size | TRT Latency (ms) | ORT_TRT Latency
(ms) | ORT_CUDA Latency (ms) | TORCH Latency (ms)
-- | -- | -- | -- | -- | -- | --
SD 1.5, 50 steps, 512x512 | Static Input Shape | 1 | 861 | 851 | 861 |
N/A
SD 1.5, 50 steps, 512x512 | Dynamic Input Shape, Optimized for batch
size 1 and image size 512x512 | 1 | 974 | 1079 | 928 | 1222
SD 1.5, 50 steps, 768x768 | Dynamic Input Shape, Optimized for batch
size 1 and image size 512x512 | 1 | 2492 | OOM | 1901 | 1971
SD 1.5, 50 steps, 768x768 | Dynamic Input Shape, Optimized for batch
size 1 and image size 512x512 | 4 |9091 | OOM | 6785 | 6700

We can see that ORT_CUDA is the most robust one for handling dynamic
input shape. PyTorch could be a good choice if you run large batch size.

The above result is from one A100-SXM4-80GB GPU (in
Standard_ND96amsr_A100_v4 Azure VM) with 50 steps to generate 512x512 or
768x768 images using StableDiffusion 1.5. Onnxruntime-gpu is built from
source, and the following packages or libraries are used in this test:
* tensorrt==8.6.1.post1
* torch==2.2.0.dev20230920+cu121
* transformers==4.31.0
* diffusers==0.19.3
* onnx==1.14.1
* onnx-graphsurgeon==0.3.27
* polygraphy==0.47.1
* protobuf==3.20.2
* onnxruntime-gpu==1.17.0 (built from source of main branch)
* CUDA 12.2.2
* cuDNN 8.9.5.29
* python 3.10.13

For static input shape, the engine is built with static batch size and
static image shape, and cuda graph is enabled.

For dynamic input shape, the engine is built to support dynamic batch
size and dynamic image shape, and cuda graph is disabled. The TensorRT
engine is built for batch size 1~4, image size 256x256 ~ 1024x1024, and
the optimized image size is 512x512.

The script to test static and dynamic input shape are like the
following:
```
prompt="a cute magical flying dog, fantasy art drawn by disney concept artists, highly detailed, digital paintining"
for e in TRT ORT_TRT ORT_CUDA
do
  python demo_txt2img.py --engine $e "$prompt"
  python demo_txt2img.py --engine $e --disable-cuda-graph --build-dynamic-batch --build-dynamic-shape "$prompt"
  python demo_txt2img.py --engine $e --disable-cuda-graph --build-dynamic-batch --build-dynamic-shape --height 768 --width 768 "$prompt"
done
```

Performance of PyTorch is from commands like the following:
```
python benchmark.py -e torch -v 1.5 --enable_torch_compile -b 1 --height 512 --width 512
python benchmark.py -e torch -v 1.5 --enable_torch_compile -b 1 --height 768 --width 768
python benchmark.py -e torch -v 1.5 --enable_torch_compile -b 4 --height 768 --width 768
```
2023-10-05 08:19:20 -07:00
Jiajia Qin
db3901ab97
[js/webgpu] Enable the NCHW ConvMatMul path (#17717)
1) Enable pointwise NCHW conv2d by MatMul.
2) Enable non-pointwise NCHW conv2d by convMatMul.
3) Fix bug when `sameSize` is true

---------

Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
2023-10-05 00:26:01 -07:00