Commit graph

9347 commits

Author SHA1 Message Date
Yulong Wang
9cd4e5af68
[wasm] upgrade emsdk to 3.1.44 (#17069)
### Description
This change upgrade emsdk to 3.1.44.

Because backend is upgraded to LLVM 16, so need to fix a lot of build
failures caused by "-Wshorten-64-to-32".

most of the build failures comes from generated `onnx.pb.h`, and this
can be fixed by including "core/graph/onnx_protobuf.h", which detects
and ignore shorten-64-to-32 warnings.
2023-08-10 16:08:36 -07:00
dependabot[bot]
66b45e0085
Bump actions/upload-pages-artifact from 1 to 2 (#16727)
Bumps
[actions/upload-pages-artifact](https://github.com/actions/upload-pages-artifact)
from 1 to 2.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/actions/upload-pages-artifact/releases">actions/upload-pages-artifact's
releases</a>.</em></p>
<blockquote>
<h2>v2.0.0</h2>
<h1>Changelog</h1>
<ul>
<li>⚠️ <strong>BREAKING CHANGE:</strong> Remove built-in
<code>chmod</code> commands for <code>v2</code> <a
href="https://github.com/JamesMGreene"><code>@​JamesMGreene</code></a>
(<a
href="https://redirect.github.com/actions/upload-pages-artifact/issues/69">#69</a>)</li>
<li>Update README for <code>v2</code> <a
href="https://github.com/JamesMGreene"><code>@​JamesMGreene</code></a>
(<a
href="https://redirect.github.com/actions/upload-pages-artifact/issues/70">#70</a>)</li>
</ul>
<p>See details of <a
href="https://github.com/actions/upload-pages-artifact/compare/v1.0.10...v2.0.0">all
code changes</a> since previous release.</p>
<h2>v1.0.10</h2>
<h1>Changelog</h1>
<ul>
<li>readme: fix/improve note about permissions <a
href="https://github.com/tshepang"><code>@​tshepang</code></a> (<a
href="https://redirect.github.com/actions/upload-pages-artifact/issues/65">#65</a>)</li>
<li>Revert <code>chmod</code> removal for <code>v1</code> <a
href="https://github.com/JamesMGreene"><code>@​JamesMGreene</code></a>
(<a
href="https://redirect.github.com/actions/upload-pages-artifact/issues/68">#68</a>)</li>
<li>Add file perms handling <a
href="https://github.com/tsusdere"><code>@​tsusdere</code></a> (<a
href="https://redirect.github.com/actions/upload-pages-artifact/issues/64">#64</a>)</li>
</ul>
<p>See details of <a
href="https://github.com/actions/upload-pages-artifact/compare/v1.0.9...v1.0.10">all
code changes</a> since previous release.</p>
<h2>v1.0.9</h2>
<p>Removed <code>chmod</code> as we moved towards trusting correct file
permissions have been set. In the event this isn't the case then we
raise an error in the action related to the file permissions.</p>
<h2>v1.0.8</h2>
<h1>Changelog</h1>
<ul>
<li>Fail if no artifact file is found to upload <a
href="https://github.com/JamesMGreene"><code>@​JamesMGreene</code></a>
(<a
href="https://redirect.github.com/actions/upload-pages-artifact/issues/55">#55</a>)</li>
<li>Fix link to releases in README <a
href="https://github.com/waldyrious"><code>@​waldyrious</code></a> (<a
href="https://redirect.github.com/actions/upload-pages-artifact/issues/53">#53</a>)</li>
<li>Bump actions/publish-action from 0.2.1 to 0.2.2 <a
href="https://github.com/dependabot"><code>@​dependabot</code></a> (<a
href="https://redirect.github.com/actions/upload-pages-artifact/issues/47">#47</a>)</li>
<li>Add Dependabot config for Actions usage updates <a
href="https://github.com/JamesMGreene"><code>@​JamesMGreene</code></a>
(<a
href="https://redirect.github.com/actions/upload-pages-artifact/issues/46">#46</a>)</li>
</ul>
<p>See details of <a
href="https://github.com/actions/upload-pages-artifact/compare/v1.0.7...v1.0.8">all
code changes</a> since previous release.</p>
<h2>v1.0.7</h2>
<h1>Changelog</h1>
<ul>
<li>Don't change file permissions of other files <a
href="https://github.com/KyeRussell"><code>@​KyeRussell</code></a> (<a
href="https://redirect.github.com/actions/upload-pages-artifact/issues/44">#44</a>)</li>
</ul>
<p>See details of <a
href="https://github.com/actions/upload-pages-artifact/compare/v1.0.6...v1.0.7">all
code changes</a> since previous release.</p>
<h2>v1.0.6</h2>
<h1>Changelog</h1>
<ul>
<li>Customize artifact name <a
href="https://github.com/yuradanyliuk"><code>@​yuradanyliuk</code></a>
(<a
href="https://redirect.github.com/actions/upload-pages-artifact/issues/41">#41</a>)</li>
<li>Fix permissions <a
href="https://github.com/yoannchaudet"><code>@​yoannchaudet</code></a>
(<a
href="https://redirect.github.com/actions/upload-pages-artifact/issues/42">#42</a>)</li>
<li>Print warnings about changed file permissions in bulk <a
href="https://github.com/TooManyBees"><code>@​TooManyBees</code></a> (<a
href="https://redirect.github.com/actions/upload-pages-artifact/issues/38">#38</a>)</li>
<li>Update to latest <code>actions/publish-action</code> <a
href="https://github.com/JamesMGreene"><code>@​JamesMGreene</code></a>
(<a
href="https://redirect.github.com/actions/upload-pages-artifact/issues/36">#36</a>)</li>
</ul>
<p>See details of <a
href="https://github.com/actions/upload-pages-artifact/compare/v1.0.5...v1.0.6">all
code changes</a> since previous release.</p>
<h2>v1.0.5</h2>
<h1>Changelog</h1>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="a753861a5d"><code>a753861</code></a>
Merge pull request <a
href="https://redirect.github.com/actions/upload-pages-artifact/issues/69">#69</a>
from actions/reapply-chmod-removal-for-v2</li>
<li><a
href="dca6bac0e5"><code>dca6bac</code></a>
Merge branch 'main' into reapply-chmod-removal-for-v2</li>
<li><a
href="3138c05496"><code>3138c05</code></a>
Merge pull request <a
href="https://redirect.github.com/actions/upload-pages-artifact/issues/70">#70</a>
from actions/v2-docs-improvements</li>
<li><a
href="07f501f6a0"><code>07f501f</code></a>
Update README for <code>v2</code></li>
<li><a
href="9c071e6bed"><code>9c071e6</code></a>
Reapply PR <a
href="https://redirect.github.com/actions/upload-pages-artifact/issues/63">#63</a>
for v2</li>
<li>See full diff in <a
href="https://github.com/actions/upload-pages-artifact/compare/v1...v2">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=actions/upload-pages-artifact&package-manager=github_actions&previous-version=1&new-version=2)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-08-10 15:00:35 -07:00
Justin Chu
83240d1346
Bump clang-format to 16.0.6 in CI (#17099)
### Description

Bump clang-format to 16.0.6 in CI to take in fixes.
2023-08-10 13:53:04 -07:00
Bowen Bao
6986981482
Bump ONNX version (#16325)
### Description
Bump ONNX version to https://github.com/onnx/onnx/tree/rel-1.14.1 to
include a fix for segfault when shape inferencing nested onnx functions.



### Motivation and Context
Resolves #16170
2023-08-10 11:27:28 -07:00
Changming Sun
6dffd1a890
Update model_tests.cc: avoid auto adding new tests from new opsets (#17084)
### Description
1. Update model_tests.cc: avoid auto adding new tests from new opsets. 
2. Simplify the "ConcatPathComponent" function. It does not need to be a
template.

### Motivation and Context
All our Windows/Linux CI build machines are preloaded with some test
data. In model_tests.cc, we auto add all of them to
onnxruntime_test_all.exe's unit tests. However, it causes problems when
we update the CI build machine images: new data could cause pipelines
suddenly failing.
Therefore, instead of auto discovering test data and adding all of them
to tests, this PR changes it to explicitly specify the opset names.

This change doesn't impact how Web CI pipeline runs its tests.

Going forward, the workflow would be like:
Step 1: update the onnx version in deps.txt
Step 2: Update js/scripts/prepare-onnx-node-tests.ts. Like #16943 .
Better to put step 1 and step 2 in the same PR.
Step 3: onnxruntime-es team regenerates VM images, test them and deploy
them.
Step 4: Enable the new opset test data for EPs. 


[AB#18340](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/18340)
2023-08-10 11:11:26 -07:00
PeixuanZuo
12837ba5c7
[ROCm] Update CI based on ubuntu 22.04 (#17076)
- Update ROCm version to ROCm5.6
- Update CI based on ubuntu 22.04
2023-08-10 09:51:29 -07:00
BoarQing
87285323e6
[VITISAI] nested subgraph is unsupported for now (#17067)
### Description
<!-- Describe your changes. -->
return empty ComputeCapability when a graph contains nested subgraph.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
For now, our architecture does not support nested subgraph. So, we
return empty ComputeCapability for this case.
2023-08-10 09:45:13 -07:00
BoarQing
1b081d51dc
[VITISAI] node arg can be used more than once (#17068)
### Description
<!-- Describe your changes. -->
a node arg can be matched multiple times.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Previous, we thought the node name must be unique and thus can be used
as identifier. However, we recently found that a node's name can be
empty thus failed to identify which node is which. So, we use node arg
to differentiate the node. To do so, we need to match node arg more than
once.
2023-08-10 09:44:27 -07:00
satyajandhyala
e8a9d4f04d
[JS/Web] Fix Resize kMSInternalNHWCDomain (#17023)
### Description
Fix some Resize failing tests.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
2023-08-10 09:14:43 -07:00
guyang3532
ef6f4a4aa1
support broadcast shape for elementwise node in padding elimination (#16710)
With PaddingElimination optimizer, input1 of element-wise op may be
flattened like:

```
  input1 (shape:[batch_size, seq_len, ...])        input1 (shape:[valid_tokens, ...])
        \                                               \
         \               input2                          \               input2
          \                /              ----->          \               /
           \              /                                \             /
	    Element-wise Op                                Element-wise Op
```
So, the shape of input2 should be processed accordingly:
1. If input2.shape.dim_size <= input1.shape.dim_size-2, i.e. input2 has
no [batch_size, seq_len] at begining,
we needn't to process the shape of input2 because it's compatible with
the flattened shape of input1 (shape:[valid_tokens, ...]).
   
2. If the shape of input2 has the same dim_size with shape of input1 and
has [batch_size, seqlen] at begening,
to be compatible with flattened shape of input1, we need to insert
flatten pattern for input2 also,
which flatten the shape of input2 from [batch_size, seq_len, ...] to
[valida_tokens, ...].
   
  
3. (which done in this pr) In other case for shape of input2, like [1,
seq_len, ...] or [batch_size, 1, ...], we firstly need to expand it
to [batch_size, seq_len, ...] which is convenient to flatten. And then
insert flatten pattern.
2023-08-10 19:07:22 +08:00
cloudhan
b4e0fc87ea
[ROCm] Make KE reports with better format (#17049) 2023-08-10 17:44:32 +08:00
pengwa
0471f6fbb3
Check type for building gradient graph (#17046)
### Check type for building gradient graph

**Bug1**: 

To fix the error when running the model with ORTModule + Stage 3:

```
Exception happens when running  <bound method Function.apply of <class 'onnxruntime.training.utils.hooks._zero_offload_subscriber.ORTZeROOffloadPreForwardFunction'>>
Traceback (most recent call last):
  File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_custom_autograd_function_runner.py", line 207, in call_python_forward_function
    wrapped_arg.requires_grad = is_training_mode and grad_flag
RuntimeError: only Tensors of floating point and complex dtype can require gradients

```

This is because when running PythonA, the 3rd input is int64, we find it
requires gradient during the check in gradient builder, so we set its
requires_grad = True, but PyTorch thinks it is incorrect, throwing the
exception. So we need understand why ORT gradient builder think the 3rd
input need gradients.


During `ReverseBFSWithStopGradient`, which do reverse BFS from graph
outputs, it collects all nodes that are needed for computing the graph
outputs. `ReverseBFSWithStopGradient` define a queue, initially add all
nodes that generate graph outputs, then iterate the nodes one by one,
checking each node's input, if the input did not hit stop edge and its
node arg type is allowed type (float, etc), then the input node is
append into the queue, do the next iteration of work.

PythonOpA is such a node that is needed to compute graph outputs, then
IsReachable(PythonOpA) will return True.


![image](https://github.com/microsoft/onnxruntime/assets/10530022/c4c53fb9-15f7-4e8d-9aa2-7fc20555a001)

In the above code snippet, when node is PythonOpB, and next_node being
PythonOpA, we did not check node_arg type between node and next_node on
the connection of PythonOpA's 3rd input to PythonOpB's outputs. So we
append the int64 typed node args to sets that require gradient.


**Fix1**: add the node arg type check before appending it into require
grad lists.


After the fixing, A unit test failed
"orttraining_test_ortmodule_api.py::test_gradient_correctness_minmax[data_type0-True-0-min]
Fatal Python error: Segmentation fault". After investigation, it is
another bug.

**Bug2**: 

Without the above fix1, the execution graph looks like this


![image](https://github.com/microsoft/onnxruntime/assets/10530022/b2fd4b03-95c7-414a-b268-2ba6a7300105)

As you can see, int64 type has a gradient edge built, while it is not
used for any consumers. And the execution runs well. While think twice,
int type should not have grad edge built.

With the Fix1, the execution graph looks like this;


![image](https://github.com/microsoft/onnxruntime/assets/10530022/1870d3cc-2fe5-4aa7-ad6b-0d88dcc40f8a)

So the int type node arg did not has gradient edge built. **Fix1** is
fixing this problem.

But another bug happens if the inital "y_node_arg_names" e.g. in this
case Aten's two outputs, 1st one in float, 2nd one in int. When we check
the y_node
(6e6f582e08/orttraining/orttraining/core/framework/gradient_graph_builder.cc (L60C16-L60C16)),
we did not check the data type, then add it into `y_node_args_` which is
the list of graph output node args that requires gradient. Then
`non_differentiable_y_node_arg_names_` did not has the int type graph
output.

Then
6e6f582e08/orttraining/orttraining/core/framework/ortmodule_graph_builder.cc (L312C18-L312C18)
will try to get the grad node arg into `yield_output_node_args`, BUT the
grad node arg is not built for int type node arg (with the **Fix1**). So
we insert a nullptr, later when we using it, we get segment fault.

**Fix2** 

Again, we add the type check when handle y_node_args, also add null
check when getting gradient node arg and append into
yield_output_node_args
2023-08-10 14:24:42 +08:00
Baiju Meswani
31cbd63af7
GRU Training and GRU Gradient Kernels (#16929) 2023-08-09 21:24:47 -07:00
BoarQing
249c2221b6
[VITISAI] remove unused code (#17066)
### Description
<!-- Describe your changes. -->
remove unused code


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
remove unused code
2023-08-09 21:07:36 -07:00
Jeff Daily
dbbfc249f7
[ROCm] update header and binary search paths used by cmake (#17083)
This is in preparation for planned ROCm 6.0 changes that are not
backward compatible. However, the adjustments made by this PR to the
current onnxruntime cmake files will work with ROCm 5.x and 6.x.
2023-08-10 11:05:21 +08:00
PeixuanZuo
7c7c991417
[ROCm] Workaround type conversion issue (#17074) 2023-08-10 11:04:11 +08:00
Patrice Vignola
7201dbebe5
[DML EP] Split fused kernels when the persistent resource is too big (#16780)
The approach is the following:

1. Build partitions
2. Try compiling each partition into a `IDMLCompiledOperator`
3. If the compiled operator's persistent resource is bigger than 4GB,
tell the partitioner to split the partition in the middle and try again.
4. Once all partitions have been successfully compiled into an
`IDMLCompiledOperator`, fuse the partitions into an ORT operator and
register them all.

This change is relatively simple (basically a basic retry mechanism),
but it required a lot of refactoring just to make sure that we don't
modify the graph until **all** partitions have been compiled
successfully. This is because partly modifying the graph before making
sure that all partitions can be compiled will break future retries.

This path is not expected to be used a lot, and even then the loop is
not expected to loop more than twice very often. This is a very specific
edge case for large models that were able to merge a large number of
nodes into a single partition.
2023-08-09 19:53:15 -07:00
BoarQing
e951f837e4
[VITISAI] fix out of bound error on graph with loop (#17065)
### Description
<!-- Describe your changes. -->
Check the bound of the node_get_inputs for out of bound error.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Model with loop would encounter this error. Currrent we do not support
custom op for loop. So, ideally it would throw an error and fall back to
CPU evalution.
2023-08-09 18:38:30 -07:00
Baiju Meswani
f17efb5c7b
Copy to buffer for both trainable as well as non trainable parameters (#17070) 2023-08-09 17:23:24 -07:00
Hector Li
555f346923
[QNN EP] Enable DepthToSpace & SpaceToDepth Ops (#17038)
### Description
[QNN EP] Enable DepthToSpace & SpaceToDepth Ops
2023-08-09 16:52:15 -07:00
Zimon Tai
a3e02e8e2a
Fix Resize op input check (#16594)
### Description
onnxjs contains a `Resize` op input check which is outdated since opset
9. Currently `Resize` supports up to 4 inputs. This PR looses the input
check.



### Motivation and Context

Fixes #15636
2023-08-09 15:42:30 -07:00
Changming Sun
7d340256f1
Add "windows_sdk_version" build arg and fix SCA build pipeline (#17062)
### Description
1. Add "--windows_sdk_version" argument to build.py
2. Fix Windows Static Analysis build pipeline. It is failing because it
picks up a different Windows SDK version after a build machine image
update. If we can explicitly specify Windows SDK version, we can avoid
such things happening again.
3. Remove --enable_training from Windows Static Analysis build pipeline
because PR #16993 makes it incompatible with "no_rtti".

AB#18315
2023-08-09 14:01:16 -07:00
Adrian Lizarraga
d793e239b0
[QNN EP] Increase tolerance for ReduceProd test on x64 Windows (#17078)
### Description
Slightly increases the allowable error tolerance for ReduceProd tests on
x64 Windows/Linux with the QNN CPU backend.


### Motivation and Context
A recent [PR](https://github.com/microsoft/onnxruntime/pull/16916)
updated the input range for ReduceProd tests, which uncovered an
inaccuracy for ReduceProd on x64 Windows/Linux with the QNN CPU backend.
This PR updates the allowable error tolerance and adds a TODO for
investigation.

This is needed to ensure the QNN_Nuget_Windows pipeline runs
successfully.
2023-08-09 13:52:14 -07:00
Patrice Vignola
4bc2287a85
Fix GroupNorm tests failing when no providers are supported (#17054) 2023-08-09 13:14:13 -07:00
RandySheriffH
a7542f48d6
Make AzureEP default for python and c# packaging (#17025)
Make AzureEP default for python and c# packaging, with UT.

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2023-08-09 12:36:52 -07:00
sfatimar
2c5d4dce77
Openvino ep ort 5.1 (#17042)
OpenVINO EP ORT 5.1 Branch
Changes for the new API to take in OpenVINO Provider Options
and compatibility with OV 2023.1


### Motivation and Context
The change is required for the new API to take in OpenVINO Provider
Options
and make it seamless.

---------

Signed-off-by: MaajidKhan <n.maajid.khan@intel.com>
Co-authored-by: saurabhintel0 <saurabh1.kale@intel.com>
Co-authored-by: MaajidKhan <n.maajid.khan@intel.com>
Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
2023-08-09 11:50:10 -07:00
Adam Pocock
03c3e91b0d
[java] Relaxing CoreML test (#16777)
### Description
Reduces precision on the CoreML provider test as it returns slightly
different answers than the other tested providers. Checked on a 2020 13"
M1 MBP.

### Motivation and Context
Fixes Java CoreML test failure after #16763.
2023-08-09 11:43:05 -07:00
Dmitri Smirnov
07dfe34714
Fix FunctionProto visualization (#17063)
### Description
Title

### Motivation and Context
Need to debug function protos
2023-08-09 11:05:52 -07:00
Chi Lo
7361c283c7
Add API for updating CUDA EP provider option user compute stream (#17037)
Add a generic `UpdateCUDAProviderOptionsWithValue()` C API to update
CUDA EP provider options where its data type is pointer that can't be
represented by string.

Note: Please see some comments for the similar [PR
](https://github.com/microsoft/onnxruntime/pull/16965)for TRT EP.
2023-08-09 09:24:19 -07:00
cloudhan
a4902ee65b
[CUDA][ROCm] Allow allocating ScratchBuffer from TuningContext (#17028)
By switching to ort native stream, we can allocate scratch buffer
directly from tuning context.
2023-08-10 00:05:10 +08:00
pengwa
6e6f582e08
Use full qualified name for PythonOp export (#17021)
### Use full qualified name for PythonOp export

Originally, when there are duplicate named torch.autograd.Function in
different module, for example:

`a.b.c.Gelu` v.s. `d.e.func.<locals>.Gelu`

We by default will throw exception to let user be aware we cannot
distinguish the two Gelu because during model export, we did not module
path. The workaround is we introduced
`ORTMODULE_SKIPPED_AUTOGRAD_FUNCTIONS` to ignore those duplicated named
Gelu that is not used by model run. This has limitations obviously for
example if two Gelus are both used in training.



This PR finds a way to construct a full qualified name.

`def _export_pt_1_10(g, n, *args, **kwargs):`

1. in exporter function, kwargs contains `name` and `module`, in the
above example:
   `a.b.c.Gelu`  --> name: `Gelu`, module: `a.b.c`
   `d.e.func.<locals>.Gelu` --> name: `Gelu`, module: `d.e`
   
 
Using name and module is not enough to get a full qualified name, for
the second case, where `d.e` is the module path, then there is a
function called `func`, in this function, there is a local
auto.grad.Function named `Gelu`. (Many of our UT looks like this). We
can only get `d.e.Gelu`, but this is not the correct full qual name.

The reason for this: `kwargs[name]` or `n.name` only return the class's
name, not the class's full qual name. (be noted kwargs[module]` is
correct).

2. `n` is torch.Node, we can access `pyobj` to get the
torch.autograd.Function's apply method instance, then use `._self` to
get the torch.autograd.Function class. Then we can get the `module` and
`class`'s ful qual name, added together, we get the full qual name.

With the above change, we don't need use `kwargs[name]` and
`kwargs[module]` , and don't need check naming conflicting or
`ORTMODULE_SKIPPED_AUTOGRAD_FUNCTIONS` env var any more.
2023-08-09 10:58:33 +08:00
Dmitri Smirnov
c424e42594
[C++] Correctly handle scalar inputs in reduction ops, enforce Transpose perm attribute matches input rank. (#17041)
### Description

This PR addresses the following issues related to the use of the
functions in ORT.

- https://github.com/microsoft/onnxruntime/issues/16492
- https://github.com/microsoft/onnxruntime/issues/16997
- https://github.com/microsoft/onnxruntime/issues/14678
- Partially addresses
https://github.com/microsoft/onnxruntime/issues/16813

The optimization case for a scalar input did not correctly recognize it
as such.
Transpose kernel assumed that `perm` attribute would always match input
tensor rank.

### Motivation and Context
The issues causes crashes and erratic behavior.
2023-08-08 14:47:01 -07:00
Tianlei Wu
fb11c67368
Fix SkipLayerNorm for 2D input (#17014)
Fix an obvious bug:
(1) In packing mode, the input for SLN has two dimensions (introduced by
#15283): [token_count, hidden_size]. Current code of `element_count =
input_dims[0] * sequence_length * hidden_size` will use element_size =
token_count * hidden_size * hidden_size, and causes invalid memory write
in cuda kernel and ORT crash

and two minor issues:
(2) potential integer overflow in `static_cast<int>(element_count)`
(3) some dead code after `return LaunchSkipLayerNormKernel` that will
never have chance to run.
2023-08-08 14:04:03 -07:00
Chi Lo
73037978f8
Add PerThreadContext for TRT EP (#16599)
Maintaining one execution context on a per thread basis is suggested per
TRT
[doc](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#threading)
to avoid synchronization issue.
For previous TRT EP, we did see synchronization issues when running
multithreading on some models, for example, FasterRCNN.

This PR leverages per thread context implementation from CUDA EP.
Followings are the modifications:

- Move CUDA graph and IExecutionContext objects to per thread context.
- Remove lock_gruad that previously placed for the whole compute_func()
and put lock_gruad in the blocks where multiple threads may update
kernel function state, access one builder, create/serialize/save engine,
save profile and serialize/save timing cache.
- On CentOS, don't unload TRT EP shared library and leave it around, so
that destructor of thread local data is still accessible upon thread
exits.

Note: Tested this PR with onnxruntime_perf_test and the overhead of
PerThreadContext is small.
2023-08-08 13:02:34 -07:00
Yulong Wang
56bced0581
[js/web] enable webgpu in browser unit test (#16310)
### Description
enable webgpu in browser unit test.

The CI pipeline uses Edge v113+ which enables WebGPU.

===

**UPDATE on 08/07/2023:**
- add flags to Edge browser launch commandline so that Edge on CI agents
can initialize WebGPU correctly.
- ONLY enable webgpu on web release build. Other pipelines are using
flag `-b=wasm,webgl,xnnpack` to specify the other 3 backends explicitly.
- disable "Resize" related test failures. Once they are fixed the tests
can be re-enabled.

---------

Co-authored-by: Satya Jandhyala <satya.k.jandhyala@gmail.com>
2023-08-08 11:45:04 -07:00
Arthur Islamov
c3f04251c7
[js/web] JSEP LayerNormalization and InstanceNormalizations kernels (#16830)
### Description
Added two kernels for Layer and Instance norm

Also added maximum limits for `maxBufferSize` when requesting GPU device
as by default it's limited to 256mb and it fails allocating 600mb buffer
while running fp32 StableDiffusion weights.


### Motivation and Context
These two are used in StableDiffusion and many other networks
2023-08-08 09:09:37 -07:00
Chi Lo
5b9bf8b663
[TensorRT EP] Fix bug for using correct device id for EP allocator (#17036)
The code always uses device id 0. Fix to use provider option
`device_id_`
2023-08-08 09:06:44 -07:00
Edward Chen
50719d2f8e
[iOS] Add script to get simulator device info. (#17012)
Add script to get iOS simulator device info so we don't need to use hardcoded specifiers which may or may not refer to a valid simulator device.

Add use-xcode-version step to a packaging pipeline so it uses a consistent version of Xcode.
2023-08-08 09:04:06 -07:00
Ti-Tai Wang
45ea907f53
Fix orttraining_test_dort.py (#17034)
Converter has moved `opset_version` out from `torch.onnx.ExportOptions`,
and put it into `torch.onnx.OnnxRegistry`.
This PR fixes the usage in DORT.
2023-08-08 08:11:48 -07:00
Xavier Dupré
d0316ee768
Updating QDQ to support Float8E4M3FN (#16550)
### Description
Naive update quantization tools to support Float8E4M3FN for Gemm.
2023-08-08 12:18:48 +02:00
RandySheriffH
063e9054b8
RunAsync in C# (#16890)
Implement c# binding for RunAsync.

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2023-08-07 22:19:38 -07:00
Baiju Meswani
249917a093
Add mac and windows python packages for onnxruntime-training (#16993) 2023-08-07 20:32:55 -07:00
Yi-Hong Lyu
e48dc3b281
Parallelize Transpose (#16854)
It gives up to 5.6% improvement for prompt and 2.3% improvement for token generation in LLaMA 7B case.
2023-08-07 14:25:53 -07:00
Chen Fu
3c10f027de
4b quantization for weights of LLMs (#16833)
### Description
Blockwise 4b quantization for LLMs. 
1. Introduce 4b block-wise quantization for linear layer weights.
2. Implements matrix multiplication kernel for fp32 x int4
3. Implements special operator MatMulFpQ4
4. Implements quantization tool, that convert MatMul operator to
MatMulFpQ4, when the right hand side is 2D const tensor.


### Motivation and Context
Compress and accelerate LLMs

|Benchmark | Time(ns)|
|-------------|----------|
|Q4GEMM/Q4Sym/M:1/N:4096/K:4096/Threads:8| 218054|
|Q4GEMM/Q4Sym/M:1024/N:4096/K:4096/Threads:8| 35830155|
|Q4GEMM/Q4Sym/M:2048/N:4096/K:4096/Threads:8| 73479790|
|Q4GEMM/Q4Zp8/M:1/N:4096/K:4096/Threads:8| 270152|
|Q4GEMM/Q4Zp8/M:1024/N:4096/K:4096/Threads:8| 35826721|
|Q4GEMM/Q4Zp8/M:2048/N:4096/K:4096/Threads:8| 73021200|
|Q4GEMM/Q4Sym128/M:1/N:4096/K:4096/Threads:8| 213832|
|Q4GEMM/Q4Sym128/M:1024/N:4096/K:4096/Threads:8| 36749874|
|Q4GEMM/Q4Sym128/M:2048/N:4096/K:4096/Threads:8| 72618120|


|Benchmark | Time(ns)|
|-------------|----------|
|SGEMM/LLM/M:1/N:4096/K:4096/Threads:8|   522610|
|SGEMM/LLM/M:1024/N:4096/K:4096/Threads:8| 39237689|
|SGEMM/LLM/M:2048/N:4096/K:4096/Threads:8| 75983467|

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-08-07 12:23:55 -07:00
Ti-Tai Wang
8a335b8347
Update torch.onnx.OnnxRegistry usage in DORT tests (#17009)
Update the usage of torch.onnx.OnnxRegistry, as it's officially
published in PyTorch: https://github.com/pytorch/pytorch/pull/106140.

---------

Co-authored-by: Wei-Sheng Chin <wechi@microsoft.com>
2023-08-07 10:15:51 -07:00
Khalia Spear
4e6ea730d6
Broadcasting for SLN for CPU and CUDA (#16510)
### Description
Enhanced SkipLayerNorm by implementing broadcasting for both CPU and
CUDA



### Motivation and Context
The input and skip tensors no longer have to be the same size which
means that it can accept data where the skip shape can be the same size
as the input shape, have a shape of {1, sequence_length, hidden_size},
or {sequence_length, hidden_size}.

---------

Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
2023-08-07 09:55:42 -07:00
pengwa
3649376f09
Fix few small bugs (#17019)
### Fix few bugs

1. symbolic shape infer, there is no None check before get length. 
2. Rename PythonOp/PythonOpGrad's attribute `name` to `func_name`,
otherwise, when we use onnx.helper.make_node to create node, `name`
conflicts with node name.
3. Filter shape inference warnings for PythonOp for torch 2.0 or newer. 
4. Close file descriptor for log suppression. Without the fix, two extra
fd is left after the log suppression exit its context.
Before enter log suppression (left), Before exit log suppression (right)

![image](https://github.com/microsoft/onnxruntime/assets/10530022/3cd3057a-59f9-4c89-8359-d9b32c49a17e)
   With the fix, no fd added after context exit.

![image](https://github.com/microsoft/onnxruntime/assets/10530022/03454a8f-ab48-4552-bb9b-293a4f51be67)
2023-08-07 14:01:36 +08:00
Chi Lo
a451318820
Refactor TRT EP error message with details (#17007)
If users use `trt_profile_min_shapes`, `trt_profile_max_shapes` and
`trt_profile_opt_shapes`, they need to provide all the dynamic shape
input with associated shape profiles.
In the case of the main graph is partitioned into TRT/CUDA subgraphs, if
the input of the subgraph is also dynamic shape, users need to provide
its shape profiles as well. User might not notice, so TRT EP will tell
them which input shape profiles need to be provided.

New warning message is :

```
  Traceback (most recent call last):
    File "/home/azureuser/disk2/debug/optional_inputs.py", line 218, in <module>
      test_optional_input_dynamic(trt_profile=True, optional=True)
    File "/home/azureuser/disk2/debug/optional_inputs.py", line 195, in test_optional_input_dynamic
      session = ort.InferenceSession(
    File "/home/azureuser/anaconda3/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 
  419, in __init__
      self._create_inference_session(providers, provider_options, disabled_optimizers)
    File "/home/azureuser/anaconda3/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 
  471, in _create_inference_session
      sess.initialize_session(providers, provider_options, disabled_optimizers)
  onnxruntime.capi.onnxruntime_pybind11_state.EPFail: [ONNXRuntimeError] : 11 : EP_FAIL : User needs to provide all the 
  dynamic shape inputs with associated profiles if they want to explicitly set profiles through provider options.
  Please note that main graph could be partitioned into TRT/CUDA/CPU subgraphs, in this case, user also needs to provide 
  shape profiles for the TRT subgraph's input if it's dynamic shape input.
  Following input(s) has no associated shape profiles provided: x1
```

Please see this github issue:
https://github.com/microsoft/onnxruntime/issues/16600
2023-08-06 09:04:21 -07:00
Dmitri Smirnov
d5e4bdbe7d
Fix protobuf TaggedStringPtr display (#17008)
### Description
<!-- Describe your changes. -->
Adjust nativs to display tagged strings.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Hard to debug without seeing names.
2023-08-04 17:51:01 -07:00
Sheil Kumar
78a5f049f4
[DML] Model corrupter during layernorm fusion and DmlNonZeroOperator crashes (#16918)
[DML] Model corrupter during layernorm fusion and DmlNonZeroOperator
crashes

Two issues fixed in this PR:
1) Changes to layernom fusion regressed DirectML. This has been disabled
for DML to unblock models.
2) DmlNonZero needs to create an operator call that needs to know the
number of non-zero elements (size in bytes). Therefore this needs to be
allocated during compute, but is being allocated during initialization.
This causes the output tensor size to mismatch with the operator's
expectations.

---------

Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
2023-08-04 17:44:54 -07:00