Commit graph

9125 commits

Author SHA1 Message Date
Tianlei Wu
2de5807703
Attention fusion for UNet onnx model export from PyTorch 2.* (#16629)
### Description
Tested with stable diffusion unet models exported by pytorch nightly.

Example to run:
```
cd onnxruntime/python/tools/transformers/
python optimizer.py --input unet.onnx  --output unet_fp16.onnx --model_type unet --float16 --opt_level 0
```
2023-07-11 14:35:48 -07:00
Yulong Wang
b4bf7d5044
[js/web/test] accelerate 'npm test' suite0/1 init time (#16558)
### Description
This change reduces the number of calls to globby functions so that it
accelerates the initialization for 'npm test' with suite0/1 tests from
~14sec to <2sec.
2023-07-11 14:34:40 -07:00
Ti-Tai Wang
72076e5320
Update converter registry usage in orttraining_test_dort_custom_ops.py (#16663)
Fix Orttraining Linux Lazy Tensor CI       

Orttraining Linux Lazy Tensor CI is broken.
The error message is
AttributeError: 'OnnxRegistry' object has no attribute 'register'
2023-07-11 12:03:12 -07:00
satyajandhyala
d41bbac7b9
[Web/JS] Added Expand operator support. (#16577)
### Description
Added Expand operator support.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-07-11 09:38:16 -07:00
Tommy Au
1b07bbceaa
Update build.bat Prevent spaces in path (#16635)
### Description
<!-- Describe your changes. -->
Simply add double quotes to prevent there is spaces in the path


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
As if there are spaces in path the bat cannot run, error would occurs.
So with a simple double quotes can fix these problems
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-07-11 07:07:08 -07:00
Justin Chu
ad994565ae
Fix type annotation for InferenceSession (#16632)
The Sequence should have been annotated to take a Union type; otherwise
the annotation would be invalid.
2023-07-11 06:34:22 -07:00
dependabot[bot]
617b3a84ba
Bump semver from 5.7.1 to 5.7.2 in /js/react_native/e2e (#16653) 2023-07-11 10:40:19 +00:00
dependabot[bot]
3608592ef5
Bump semver from 5.7.1 to 5.7.2 in /js/react_native (#16652) 2023-07-11 10:39:57 +00:00
pengwa
1ebc5d3879
Log ORTModule initialization overhead (#16529)
### Log ORTModule initialization overhead

When profiling some model for example 

```
 torchrun --nproc_per_node=1 examples/onnxruntime/training/language-modeling/run_mlm.py  --model_name_or_path microsoft/deberta-v3-large --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1  --num_train_epochs 10 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --do_train  --overwrite_output_dir --output_dir ./outputs/ --seed 1137 --fp16 --report_to none --optim adamw_ort_fused  --max_steps 200 --logging_steps 1 --use_module_with_loss

{'train_runtime': 303.8711, 'train_samples_per_second': 0.658, 'train_steps_per_second': 0.658, 'train_loss': 6.569518616199494, 'epoch': 0.09}
100%|200/200 [05:03<00:00,  1.52s/it]
***** train metrics *****
  epoch                    =       0.09
  train_loss               =     6.5695
  train_runtime            = 0:05:03.87
  train_samples            =       2223
  train_samples_per_second =      0.658
  train_steps_per_second   =      0.658


```



The end to end time is 303s (train_runtime=0:05:03.87), but the
ORTModule first step initialization (including export, graph build, etc)
takes about 255s, so when we compare the end to end time for a baseline
ORT with an improved version of ORT, there is no perf gains, since the
x% gains over (303-255) is diluted out among the overall 303s. This is
misleading!

So this PR outputs the ORTModule initialization overhead in the output,
then we can manually compute the real compte time and get the perf
gains.


If the log level is >= WARNING, then only the total end to end time +
export time is logged, otherwise, more details of break down is logged:


![image](https://github.com/microsoft/onnxruntime/assets/10530022/8e34283d-4868-4f22-b65b-9f00d10d8fb7)



![image](https://github.com/microsoft/onnxruntime/assets/10530022/c13bcfad-0d79-483d-a886-e238efcbe657)
2023-07-11 14:11:29 +08:00
mindest
347c963d5c
[ROCm] Add ROCm Triton TunableOp for GroupNorm (#16196)
### Description
- Refactor existing Triton TunableOp-related code (based on work in
#15862)
- Add GroupNorm Triton implementation
2023-07-11 13:55:30 +08:00
Yulong Wang
5b6c1394cb
[js/test] CI: use pre-downloaded testdata in image (#16562)
### Description
update web CI to use pre-downloaded testdata in image
2023-07-10 22:22:14 -07:00
zesongw
53057ec1f5
[WebNN EP] Support InstanceNormalization and GroupNormalization. (#16604)
### Description
Add support for Op InstanceNormalization and GroupNormalization via MeanVarianceNormalization.

### Motivation and Context
Enable more models like Olive'ified SD unet to run on WebNN EP.
2023-07-10 20:51:53 -07:00
pengwa
15cb2f5a8a
Warn the user when nondet kernels are invoked in det mode (#16571)
### Give user warnings if nondeterministic kernels got called when
Deterministic flag is set

When we do accuracy investigation (for example training convergence
issue debug), usually we will set `use_deterministic_compute ` to be
true.

```
 SessionOptions sess_options;
 sess_options.use_deterministic_compute = true;
```

While in recent investigation, it is found GatherElementsGrad kernel
(who used atomic add) generate non-deterministic results, making a
deberta model ouput pretty different loss curve every time we run it
even we fix the seed, remove the dropout ratio, and set
use_deterministic_compute to be true. It turned out to be an expected
problem if we do the add in different order by cuda threads. The order
cannot be guaranteed.

So this PR will give warnings when users set `use_deterministic_compute
`, but some kernels don't have determinstic kernel impl, has to run with
non-determinstic impls. This would at least let users know the results
is not determinstic though that flag is set to be True.


![image](https://github.com/microsoft/onnxruntime/assets/10530022/99ff60f5-21a4-44cf-bf5b-323d698b7147)

Only print the message once in case it floods training logs.
2023-07-11 11:45:47 +08:00
Edward Chen
b4c4e2b594
[objc] Add session options register custom ops with function pointer API (#16603) 2023-07-10 18:54:32 -07:00
Adam Louly
211fe5988e
add steps to write modulewithloss wrapper (#16486)
### Description
This PR includes documentation updates, providing step-by-step
instructions on how to implement the ModuleWithLoss wrapper in a
different codebase.
The documentation outlines the necessary code changes and offers
customization options based on specific requirements.

---------

Co-authored-by: Adam Louly <adamlouly@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2023-07-11 09:07:35 +08:00
dependabot[bot]
de1b66a25a
Bump tough-cookie from 4.0.0 to 4.1.3 in /js/react_native (#16633)
Bumps [tough-cookie](https://github.com/salesforce/tough-cookie) from
4.0.0 to 4.1.3.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/salesforce/tough-cookie/releases">tough-cookie's
releases</a>.</em></p>
<blockquote>
<h2>4.1.3</h2>
<p>Security fix for Prototype Pollution discovery in <a
href="https://redirect.github.com/salesforce/tough-cookie/issues/282">#282</a>.
This is a minor release, although output from the <code>inspect</code>
utility is affected by this change, we felt this change was important
enough to be pushed into the next patch.</p>
<h2>4.1.2 -- Patch and Bugfix Release</h2>
<h2>What's Changed</h2>
<ul>
<li>fix: allow set cookies with localhost by <a
href="https://github.com/colincasey"><code>@​colincasey</code></a> in <a
href="https://redirect.github.com/salesforce/tough-cookie/pull/253">salesforce/tough-cookie#253</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/salesforce/tough-cookie/compare/v4.1.1...v4.1.2">https://github.com/salesforce/tough-cookie/compare/v4.1.1...v4.1.2</a></p>
<h2>4.1.1</h2>
<h2>Patch Release</h2>
<h2>What's Changed</h2>
<ul>
<li>fix: allow special use domains by default by <a
href="https://github.com/colincasey"><code>@​colincasey</code></a> in <a
href="https://redirect.github.com/salesforce/tough-cookie/pull/249">salesforce/tough-cookie#249</a></li>
<li>4.1.1 Patch -- allow special use domains by default by <a
href="https://github.com/awaterma"><code>@​awaterma</code></a> in <a
href="https://redirect.github.com/salesforce/tough-cookie/pull/250">salesforce/tough-cookie#250</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/salesforce/tough-cookie/compare/v4.1.0...v4.1.1">https://github.com/salesforce/tough-cookie/compare/v4.1.0...v4.1.1</a></p>
<h2>4.1.0</h2>
<p>v4.1.0</p>
<p>Minor release, focused mainly on resolving reported issues and some
minor feature work.</p>
<h2>What's Changed</h2>
<ul>
<li>Create CHANGELOG.md by <a
href="https://github.com/ShivanKaul"><code>@​ShivanKaul</code></a> in <a
href="https://redirect.github.com/salesforce/tough-cookie/pull/189">salesforce/tough-cookie#189</a></li>
<li>Missing param validation issue145 by <a
href="https://github.com/medelibero-sfdc"><code>@​medelibero-sfdc</code></a>
in <a
href="https://redirect.github.com/salesforce/tough-cookie/pull/193">salesforce/tough-cookie#193</a></li>
<li>Create SECURITY.md by <a
href="https://github.com/ShivanKaul"><code>@​ShivanKaul</code></a> in <a
href="https://redirect.github.com/salesforce/tough-cookie/pull/201">salesforce/tough-cookie#201</a></li>
<li>Create CODE_OF_CONDUCT.md by <a
href="https://github.com/ShivanKaul"><code>@​ShivanKaul</code></a> in <a
href="https://redirect.github.com/salesforce/tough-cookie/pull/200">salesforce/tough-cookie#200</a></li>
<li>Fix for issue <a
href="https://redirect.github.com/salesforce/tough-cookie/issues/195">#195</a>
by <a
href="https://github.com/medelibero-sfdc"><code>@​medelibero-sfdc</code></a>
in <a
href="https://redirect.github.com/salesforce/tough-cookie/pull/202">salesforce/tough-cookie#202</a></li>
<li>Add explanation and more special-use domains by <a
href="https://github.com/ShivanKaul"><code>@​ShivanKaul</code></a> in <a
href="https://redirect.github.com/salesforce/tough-cookie/pull/203">salesforce/tough-cookie#203</a></li>
<li>Sync of constructor options for serialization by <a
href="https://github.com/medelibero-sfdc"><code>@​medelibero-sfdc</code></a>
in <a
href="https://redirect.github.com/salesforce/tough-cookie/pull/204">salesforce/tough-cookie#204</a></li>
<li>Returned null in case of empty cookie value by <a
href="https://github.com/vsin12"><code>@​vsin12</code></a> in <a
href="https://redirect.github.com/salesforce/tough-cookie/pull/196">salesforce/tough-cookie#196</a></li>
<li>132 str trim not a function by <a
href="https://github.com/awaterma"><code>@​awaterma</code></a> in <a
href="https://redirect.github.com/salesforce/tough-cookie/pull/209">salesforce/tough-cookie#209</a></li>
<li>Fix for issue <a
href="https://redirect.github.com/salesforce/tough-cookie/issues/153">#153</a>
by <a
href="https://github.com/medelibero-sfdc"><code>@​medelibero-sfdc</code></a>
in <a
href="https://redirect.github.com/salesforce/tough-cookie/pull/210">salesforce/tough-cookie#210</a></li>
<li>Fix permuteDomain with trailing dot by <a
href="https://github.com/ruoho-sfdc"><code>@​ruoho-sfdc</code></a> in <a
href="https://redirect.github.com/salesforce/tough-cookie/pull/216">salesforce/tough-cookie#216</a></li>
<li>Issue <a
href="https://redirect.github.com/salesforce/tough-cookie/issues/213">#213</a>
-- added gh-actions flow for building and testing tough-co… by <a
href="https://github.com/awaterma"><code>@​awaterma</code></a> in <a
href="https://redirect.github.com/salesforce/tough-cookie/pull/218">salesforce/tough-cookie#218</a></li>
<li>Issue <a
href="https://redirect.github.com/salesforce/tough-cookie/issues/210">#210</a>
-- Updated workflow to use npm install. by <a
href="https://github.com/awaterma"><code>@​awaterma</code></a> in <a
href="https://redirect.github.com/salesforce/tough-cookie/pull/220">salesforce/tough-cookie#220</a></li>
<li>@<a
href="https://redirect.github.com/salesforce/tough-cookie/issues/215">GH-215</a>
-- Tests that document localhost behavior when set as domain. by <a
href="https://github.com/awaterma"><code>@​awaterma</code></a> in <a
href="https://redirect.github.com/salesforce/tough-cookie/pull/221">salesforce/tough-cookie#221</a></li>
<li>fix: MemoryCookieStore methods should exist on the prototype, not on
the class. by <a
href="https://github.com/wjhsf"><code>@​wjhsf</code></a> in <a
href="https://redirect.github.com/salesforce/tough-cookie/pull/226">salesforce/tough-cookie#226</a></li>
<li>Unit test cases for <code>allowSpecialUseDomain</code> option by <a
href="https://github.com/colincasey"><code>@​colincasey</code></a> in <a
href="https://redirect.github.com/salesforce/tough-cookie/pull/225">salesforce/tough-cookie#225</a></li>
<li>[Snyk] Upgrade universalify from 0.1.2 to 0.2.0 by <a
href="https://github.com/snyk-bot"><code>@​snyk-bot</code></a> in <a
href="https://redirect.github.com/salesforce/tough-cookie/pull/228">salesforce/tough-cookie#228</a></li>
<li>React Native Support by <a
href="https://github.com/colincasey"><code>@​colincasey</code></a> in <a
href="https://redirect.github.com/salesforce/tough-cookie/pull/227">salesforce/tough-cookie#227</a></li>
<li>Adding Updating CODEOWNERS with ECCN as per Export Control
Compliance by <a
href="https://github.com/svc-scm"><code>@​svc-scm</code></a> in <a
href="https://redirect.github.com/salesforce/tough-cookie/pull/223">salesforce/tough-cookie#223</a></li>
<li>fix: domain match routine by <a
href="https://github.com/colincasey"><code>@​colincasey</code></a> in <a
href="https://redirect.github.com/salesforce/tough-cookie/pull/236">salesforce/tough-cookie#236</a></li>
<li>Stop using the internal NodeJS punycode module by <a
href="https://github.com/gboer"><code>@​gboer</code></a> in <a
href="https://redirect.github.com/salesforce/tough-cookie/pull/238">salesforce/tough-cookie#238</a></li>
<li>Initial documentation review by <a
href="https://github.com/mcarey86"><code>@​mcarey86</code></a> in <a
href="https://redirect.github.com/salesforce/tough-cookie/pull/234">salesforce/tough-cookie#234</a></li>
<li>fix: distinguish between no samesite and samesite=none by <a
href="https://github.com/colincasey"><code>@​colincasey</code></a> in <a
href="https://redirect.github.com/salesforce/tough-cookie/pull/240">salesforce/tough-cookie#240</a></li>
<li>Prepare tough-cookie 4.1 for publishing (updated GitHub actions,
move… by <a
href="https://github.com/awaterma"><code>@​awaterma</code></a> in <a
href="https://redirect.github.com/salesforce/tough-cookie/pull/242">salesforce/tough-cookie#242</a></li>
<li>4.1.0 release to NPM by <a
href="https://github.com/awaterma"><code>@​awaterma</code></a> in <a
href="https://redirect.github.com/salesforce/tough-cookie/pull/245">salesforce/tough-cookie#245</a></li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="4ff4d29f6c"><code>4ff4d29</code></a>
4.1.3 release preparation, update the package and lib/version to 4.1.3.
(<a
href="https://redirect.github.com/salesforce/tough-cookie/issues/284">#284</a>)</li>
<li><a
href="12d474791b"><code>12d4747</code></a>
Prevent prototype pollution in cookie memstore (<a
href="https://redirect.github.com/salesforce/tough-cookie/issues/283">#283</a>)</li>
<li><a
href="f06b72d1d4"><code>f06b72d</code></a>
Fix documentation for store.findCookies, missing allowSpecialUseDomain
proper...</li>
<li><a
href="b1a8898ee3"><code>b1a8898</code></a>
fix: allow set cookies with localhost (<a
href="https://redirect.github.com/salesforce/tough-cookie/issues/253">#253</a>)</li>
<li><a
href="ec707966e6"><code>ec70796</code></a>
4.1.1 Patch -- allow special use domains by default (<a
href="https://redirect.github.com/salesforce/tough-cookie/issues/250">#250</a>)</li>
<li><a
href="d4ac5801dd"><code>d4ac580</code></a>
fix: allow special use domains by default (<a
href="https://redirect.github.com/salesforce/tough-cookie/issues/249">#249</a>)</li>
<li><a
href="79c2f7d373"><code>79c2f7d</code></a>
4.1.0 release to NPM (<a
href="https://redirect.github.com/salesforce/tough-cookie/issues/245">#245</a>)</li>
<li><a
href="4fafc179a7"><code>4fafc17</code></a>
Prepare tough-cookie 4.1 for publishing (updated GitHub actions, move
Dockerf...</li>
<li><a
href="aa4396da7a"><code>aa4396d</code></a>
fix: distinguish between no samesite and samesite=none (<a
href="https://redirect.github.com/salesforce/tough-cookie/issues/240">#240</a>)</li>
<li><a
href="b8d751188d"><code>b8d7511</code></a>
Modernize README (<a
href="https://redirect.github.com/salesforce/tough-cookie/issues/234">#234</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/salesforce/tough-cookie/compare/v4.0.0...v4.1.3">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=tough-cookie&package-manager=npm_and_yarn&previous-version=4.0.0&new-version=4.1.3)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
Dependabot will merge this PR once CI passes on it, as requested by
@fs-eire.

[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).

</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-07-10 11:23:24 -07:00
Tianlei Wu
b8f6235f11
Update stable diffusion benchmark for TensorRT EP (#16560)
### Description

Add Stable Diffusion Text2Image pipelines of TensorRT EP and CUDA EP.
They can automatically export and optimize ONNX model, and create
ONNXRuntime session to use TensorRT EP or CUDA execution provider.

Add support for benchmarking TensorRT.

Add support of cuda graph. The feature is only supported in nightly
package right now.

Engine/Provider to test | command line
---- | ---
CUDA EP | `python benchmark.py -v 1.5`
CUDA EP with cuda graph | `python benchmark.py -v 1.5
--enable_cuda_graph`
TensorRT EP | `python benchmark.py -v 1.5 -r tensorrt`
TensorRT EP with cuda graph | `python benchmark.py -v 1.5 -r tensorrt
--enable_cuda_graph`
TensorRT | `python benchmark.py -v 1.5 -e tensorrt`

Add benchmark numbers of T4 GPU using CUDA 11.7, cuDNN 8.5, PyTorch
1.13.1+cu11.7, TensorRT 8.6.1, onnxruntime-gpu 1.15.1 (or
ort-nightly-gpu 1.16 for cuda graph).

TODO: add benchmark numbers of A100-80GB

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-07-10 09:51:03 -07:00
PeixuanZuo
2fd5e1cc39
[ROCm] fix shell bug (#16641)
`set -ex` with `grep` will exit when grep doesn't meet any string.
2023-07-10 17:31:27 +08:00
PeixuanZuo
3b729e5d2f
[ROCm] use cupy for GPU-accelerated computing (#16611)
kernel explorer has lots of tests and need numpy to verify the results
of GPU kernels, it will make CPU utilization very high. This PR use
`cupy ` to replace `numpy` to do compute on GPU to reduce CPU
utilization.

set `KERNEL_EXPLORER_TEST_USE_CUPY=1` to enable cupy.
2023-07-10 17:17:39 +08:00
cloudhan
5fee3f4302
Remove the special min cmake for rocm (#16570)
#15807 fixed the building error for rocm with cmake 3.26. The
specialized relaxation of the cmake version is not needed anymore.
2023-07-10 13:19:48 +08:00
PeixuanZuo
2f56815344
[Whisper] Fix provider in export procress (#16545)
type of parameter `provider` is string.
2023-07-10 11:55:36 +08:00
PeixuanZuo
cb4bf4f5c8
[ROCm] Move ROCm build step on CPU only machine (#16596)
- Move ROCm build step on CPU only machine
- Add the performance data of the huggingface bert-large model on the
MI200
- At the beginning of the test step, check the agent's GPU usage and
kill the threads occupying the GPU, which may be left over from previous
tasks that exited abnormally.
- Use different docker images during the build and test steps. The
difference is the `uid` and `user` when build docker image and create
docker container.
2023-07-10 11:55:10 +08:00
pengwa
bcebd3b1ca
Allow upstream for Slice on single axis (#16410)
### Allow upstream for Slice on single axis

#### Benchmark on 8x32GB V100 + DeepSpeed

On Bloom560M model, there is 1.5% throughput gains on the same max batch
size 6.
```
torchrun --nproc_per_node=8 examples/onnxruntime/training/language-modeling/run_clm.py  --model_name_or_path bigscience/bloom-560m --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1  --num_train_epochs 10 --per_device_train_batch_size 6 --per_device_eval_batch_size 1 --do_train  --overwrite_output_dir --output_dir ./outputs/ --seed 1137 --fp16 --report_to none --optim adamw_ort_fused  --max_steps 200 --logging_steps 1 --use_module_with_loss --deepspeed aml_ds_config_zero_1.json
```

##### Main branch

```
Total overhead: 38957ms where export takes 35493ms.
***** train metrics *****
  epoch                    =       4.08
  train_loss               =     2.6841
  train_runtime            = 0:03:10.67
  train_samples            =       2318
  train_samples_per_second =     50.348
  train_steps_per_second   =      1.049

throughput  per gpu=4.08 * 2318 / (190.67 - 38.957) / 8(gpu) = 7.792 samples/second
```

##### This PR

```
Total overhead: 38649ms where export takes 34946ms.

***** train metrics *****
  epoch                    =       4.08
  train_loss               =     2.6757
  train_runtime            = 0:03:08.08
  train_samples            =       2318
  train_samples_per_second =      51.04
  train_steps_per_second   =      1.063

throughput  per gpu=4.08 * 2318 / (188.08 - 38.649) / 8(gpu) = 7.911 samples/second
```

#### Benchmark on 4x16GB V100 + AutoCast

On Bloom560M model, there is 1.8% throughput gains on the same batch
size, 24% gains with corresponding maximum batch size.

Also it allow ORT run bigger batch size (from 3 to 4) on following
recipe.

```
torchrun --nproc_per_node=4 examples/onnxruntime/training/language-modeling/run_clm.py  --model_name_or_path bigscience/bloom-560m --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1  --num_train_epochs 10 --per_device_train_batch_size 3 --per_device_eval_batch_size 1 --do_train  --overwrite_output_dir --output_dir ./outputs/ --seed 1137 --fp16 --report_to none --optim adamw_ort_fused  --max_steps 200 --logging_steps 1 --use_module_with_loss
```

##### Main branch

```
Total overhead: 4789ms where export takes 3798ms.
***** train metrics *****
  epoch                    =       1.02
  train_loss               =    20.3338
  train_runtime            = 0:01:42.78
  train_samples            =       2343
  train_samples_per_second =     23.349
  train_steps_per_second   =      1.946

throughput  per gpu=1.02 * 2343 / (102.78 - 4.789) / 4(gpu) = 6.097 samples/second
```

##### This PR

```
Total overhead: 4608ms where export takes 3555ms.
***** train metrics *****
  epoch                    =       1.02
  train_loss               =    20.3364
  train_runtime            = 0:01:40.87
  train_samples            =       2343
  train_samples_per_second =     23.792

throughput  per gpu=1.02 * 2343 / (100.87 - 4.608) / 4(gpu) = 6.207 samples/second
```

With this PR, also can run batch size 4 (main branch fails), 

```
Total overhead: 4743ms where export takes 3698ms.
***** train metrics *****
  epoch                    =       1.36
  train_loss               =    20.2096
  train_runtime            = 0:01:50.42
  train_samples            =       2343
  train_samples_per_second =     28.979
  train_steps_per_second   =      1.811


throughput  per gpu= 1.36 *  2343 / (110 - 4.743) / 4(gpu) =7.57 sample/second
```



#### Benchmark on 8x32GB V100 + AutoCast

On Bloom560M model, there is 0.9% throughput gains on the same batch
size, 8.6% gains with corresponding maximum batch size.

Also it allow ORT run bigger batch size (from 3 to 4) on following
recipe.

```
torchrun --nproc_per_node=8 examples/onnxruntime/training/language-modeling/run_clm.py  --model_name_or_path bigscience/bloom-560m --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1  --num_train_epochs 10 --per_device_train_batch_size 3 --per_device_eval_batch_size 1 --do_train  --overwrite_output_dir --output_dir ./outputs/ --seed 1137 --fp16 --report_to none --optim adamw_ort_fused  --max_steps 200 --logging_steps 1 --use_module_with_loss

```

##### Main branch

```
Total overhead: 55259ms where export takes 51140ms.
***** train metrics *****
  epoch                    =       2.06
  train_loss               =     2.8788
  train_runtime            = 0:02:36.65
  train_samples            =       2318
  train_samples_per_second =      30.64
  train_steps_per_second   =      1.277

throughput  per gpu=2.06 * 2318 / (156.65 - 55.259) / 8(gpu) = 5.887 samples/second
```

##### This PR

```
Total overhead: 55712ms where export takes 51418ms.
***** train metrics *****
  epoch                    =       2.06
  train_loss               =     2.8696
  train_runtime            = 0:02:36.19
  train_samples            =       2318
  train_samples_per_second =     30.731
  train_steps_per_second   =       1.28

throughput  per gpu=2.06 * 2318/ (156.19 - 55.712) / 8(gpu) = 5.940 samples/second
```

With this PR, also can run batch size 4 (main branch fails), 

```
Total overhead: 54238ms where export takes 49899ms.
***** train metrics *****
  epoch                    =       2.74
  train_loss               =     2.7692
  train_runtime            = 0:02:58.47
  train_samples            =       2318
  train_samples_per_second =     35.859
  train_steps_per_second   =      1.121

throughput  per gpu= 2.74 * 2318 / (178.47 - 54.238) / 8(gpu) =6.391sample/second
```
2023-07-10 08:36:11 +08:00
Yulong Wang
67f4cd54fa
fix JSEP build break (#16636)
### Description
fix JSEP build break. the build break was caused by enabling
`-Wshorten-64-to-32` while merging the CI.
2023-07-09 08:53:11 -07:00
satyajandhyala
00e8f2a2a9
[Web/JS] Add ConvTranspose support (#16433)
### Description
Add ConvTranspose support for WebGPU


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-07-08 11:10:50 -07:00
Yulong Wang
5c6613875c
[js/web] [JSEP] allow passing data in kernel compute (#16621)
### Description
allow passing data in OpKernel::Compute() from C++ to JS.
2023-07-07 14:27:30 -07:00
cao lei
329e8156d4
clean unused parameter in ORT_UNUSED_PARAMETER (#16538)
### Description
clean unused parameter in ORT_UNUSED_PARAMETER


### Motivation and Context
clean unused parameters in ORT_UNUSED_PARAMETER which are introduced
from #15833
2023-07-07 13:20:36 -07:00
satyajandhyala
e55a20ece8
[Web/JS] Added Split operator support. (#16567)
### Description
Added WeGPU/JSEP Split operator support.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-07-07 12:16:10 -07:00
Adrian Lizarraga
dd13252506
[QNN EP] Support 1D Conv/ConvTranspose (#16563)
### Description
- Adds support for 1D (rank 3) convolutions to QNN EP
- Implements 1D convolutions as 2D convolutions with height == 1.
Reshape nodes are added at the inputs and outputs as necessary.
- Adds more unit tests for Conv and ConvTranspose (2D and 1D).


### Motivation and Context
Allow more models to run on QNN EP.
2023-07-07 10:37:49 -07:00
satyajandhyala
5933a183df
[Web/JS] Added missing L1Reduce and L2Reduce oprator kernels. (#16580)
### Description
Add missing L1Reduce and L2Reduce operator kernels.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-07-07 09:55:55 -07:00
cloudhan
01c5d05712
Avoid repeated GemmSoftmaxGemmPermuteTunableOp<HipT> ctor invocation (#16518)
The `GemmSoftmaxGemmPermuteTunableOp<HipT>` is expensive to construct,
avoid the ctor invocation will substantially improve the launch time and
get better performance during the decoding. This get <7% e2e time
reduction of whisper large.
2023-07-08 00:25:07 +08:00
Xavier Dupré
47a0289ee6
[CI] Removes type2 in process_registration and fix Windows GPU Reduced Ops CI Pipeline (#16530)
### Description
Windows GPU Reduced Ops CI Pipeline is broken due to the introduction of
a second template type in registered kernels. The python code checking
the registration is broken due to that. This PR addresses this issue on
the python side by keeping only one type equal to the concatenation of
the two types.
2023-07-07 18:21:06 +02:00
Edward Chen
6be7b03e53
Enable -Wshorten-64-to-32 warning if available. (#16524)
- Fix some warnings from Xcode build (`-Wshorten-64-to-32`).
- Enable `-Wshorten-64-to-32` warning if available. Currently it's not fully enabled for `onnxruntime_test_all` and `onnxruntime_providers_xnnpack` yet.
- Some clean up in build.py including setting CMake generator more consistently.
2023-07-07 08:11:44 -07:00
Edward Chen
e22b0836e7
[objc] Update docs and fix static analysis build (#16617)
- Update some documentation comments.
- Use onnxruntime_training.h as the umbrella header so training API docs are included in generated docs.
- Fix static analysis build.
2023-07-07 07:58:54 -07:00
Vincent Wang
2a11f29eaa
[CUDA] Optimize BiasGelu/BiasGeluGrad Kernel (#16608)
The PR optimizes BiasGelu/BiasGeluGrad CUDA kernel by 3 changes:
- Use Erf instead of Normcdf for half compute
- Change CUDA thread organization for BiasGelu kernel instead of using
binary elementwise template
- Add vectorized support

Using BiasGelu(A[256, 128, 768] + B[768]) in V100 as example, the perf
number below are in us
Before change, FW: 246.37, BW: 292.77
Use Erf, FW: 152.86, BW: 238.98
All above changes, FW: 132.45, BW: 199.14

For Huggingface's bertweet-base model, with the changes, the step time
(FW+BW) reduces from 324.71766 ms to 316.42552 ms, which is 1.026x
faster.

Using Erf is for half data only, evaluation shows that for float on
CUDA, Normcdf is faster. I didn't check the perf for BFloat16 or on AMD,
so keep them unchanged.
2023-07-07 08:28:38 +08:00
Scott McKay
697dd12f6e
Re-organize the transpose optimization and layout transformation files. (#16246)
### Description
<!-- Describe your changes. -->
Split out the more basic changes from #15552 for easier review.

Re-organize to clarify the structure
- Separate out generic base functionality from ORT specific components
  - pass in handlers for internal ORT ops to Optimize
- Split out layout transformation from transpose optimization
- Separate out level 1 transpose optimizer
- Cleanup some naming to try and clarify things like an optimizer vs.
general optimization code

Most of the changes are from this movement of code.

Two implementation changes:
- the extended handlers are queried first in GetHandler
- allows the extended handlers to override the default behaviour for an
ONNX operator
- simplify the Optimize function to remove OptimizerMode. 
- `can_modify_node` is used instead of `mode` and
`ignore_assigned_nodes` and a long description of the current usage is
added. I don't _think_ that changes the current behavior and hopefully
clarifies what happens and when, and makes the base transpose optimizer
implementation more generic.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Create a cleaner separation to support adding EP specific logic next to
cleanly handle where an EP has additional layout sensitive behaviour
required (e.g. it's Resize implementation only handles one layout).
2023-07-07 08:24:47 +10:00
Hariharan Seshadri
5ffd58c8e6
Fix Reduced Ops pipeline (#16612) 2023-07-06 14:32:59 -07:00
Dmitri Smirnov
51c42ae64a
[C#] Allow users to quickly populate native string buffers with utf8 bytes (#16559)
### Description
Introduce an API that allows users to gain access to a string tensor
element buffer of requested length in bytes
so then can quickly load any utf8 data.

### Motivation and Context
Useful for testing an otherwise.
2023-07-06 09:51:26 -07:00
Yulong Wang
d13f3153d7
[js/webgpu] enable op test for webgpu (#16542)
### Description
This change enables the JSON-format operator tests for webgpu.

Usage:

```
npm test -- op abs.jsonc -b=webgpu
```
2023-07-06 08:35:19 -07:00
Xavier Dupré
d906d48ae9
Support custom ops taking float 8 tensors as inputs and outputs (#16323)
### Description
C API for custom ops does not support float 8 types. This PR changes
that.



### Motivation and Context
The list of operators supporting float 8 is very limited. It should be
extended to custom ops to let developpers add customized operators for
these specific types.
2023-07-06 14:36:06 +02:00
zhangsibo1129
180292f426
Modify CANN EP to align with the EP API refactor and fix CANN CI (#16490)
Modify CANN EP `CANNExecutionProvider::CreatePreferredAllocators`,
`CANNExecutionProvider::CreateCannAllocator`
to align with the EP API refactor and fix CANN CI for
https://github.com/microsoft/onnxruntime/pull/15833#issuecomment-1601568295
in this [PR](https://github.com/microsoft/onnxruntime/pull/15833)
2023-07-05 19:24:32 -07:00
cloudhan
b84c63db2a
Fix ORT_RETURN condition mistake (#16520)
In #16339, the `ORT_ENFORCE(cuda_device_arch_ >= 530` (throw) it changed
to `ORT_RETURN_IF` (Status) but the condition is negated. This fixes the
problem.
2023-07-06 09:13:13 +08:00
Yi Zhang
fed08e070a
Add compiler cache in linux wasm build (#16579)
### Description
Add compiler cache in wasm build to accelerate web ci

### Motivation and Context
It could reduce the pipeline duration by 30 minutes.
web ci could be completed in 2 hours with cache.

https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1053219&view=results
2023-07-06 06:58:48 +08:00
Vrajang Parikh
fd8ad9b950
Enable iOS packaging for training (#16525)
### Description
Enable support for building iOS packages/CocoaPods with training API

- Add `Training` Package variant and config files in current iOS
packaging utilities to enable creation of training packages

### Motivation and Context
This PR introduces new `Training` variant in
`build_and_assemble_ios_pods.py` script which allows creating pods for
iOS with training API enabled.

The sample script to build training pods:

```
python3 tools/ci_build/github/apple/build_and_assemble_ios_pods.py --variant Training \
--build-settings-file  tools/ci_build/github/apple/default_full_ios_training_framework_build_settings.json \ 
-b=-- path_to_protoc_exe=<path/to/protoc>
``` 

Note: build settings file should have `--enable_training` as a build
parameter.


Simply adding training packaging increases the duration of the Azure
pipeline for packaging by 70 minutes. To address this issue, we need to
parallelize pod creation. In order not to further strain the pipeline,
the changes for training packaging will be added in another PR, which
optimizes the packaging pipeline.

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-07-05 13:27:59 -07:00
Adam Pocock
ba91457183
[java] Adding addExternalInitializers and addInitializer to OrtSession.SessionOptions (#16198)
### Description
Adds support for adding external initializers or overriding initializers
to a session options from Java.

### Motivation and Context
We want to instantiate large models from Java without filesystem access.

cc @yuslepukhin
2023-07-05 12:51:59 -07:00
Yulong Wang
661fd4b978
[js/rn] always use 'typescript' from /js/ folder (#16554)
### Description
always use 'typescript' from /js/ folder. This allows all NPM packages
to use the same typescript version.

- remove 'typescript' from /js/react_native/package.json. use the one
from /js/package.json
- remove unused '@types/fs-extra'
2023-07-05 12:26:56 -07:00
satyajandhyala
a7c892106d
[Web/JS] Support WebGPU Concat operator (#16543)
### Description
Add Concat operator



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-07-05 11:59:45 -07:00
Chi Lo
d8792f8040
Fix TRT EP allocator memory leak (#16552)
Fix memory leak issue which comes from TRT EP's allocator object not
being released upon destruction.
Following is the log from valgrind:
```
==1911860== 100,272 (56 direct, 100,216 indirect) bytes in 1 blocks are definitely lost in loss record 1,751 of 1,832
==1911860==    at 0x483CFA3: operator new(unsigned long) (vg_replace_malloc.c:472)
==1911860==    by 0x315DC2: std::_MakeUniq<onnxruntime::OrtAllocatorImplWrappingIAllocator>::__single_object std::make_unique<onnxruntime::OrtAllocatorImplWrappingIAllocator, std::shared_ptr<onnxruntime::IAllocator> >(std::shared_ptr<onnxruntime::IAllocator>&&) (unique_ptr.h:857)
==1911860==    by 0x30EE7B: OrtApis::KernelContext_GetAllocator(OrtKernelContext const*, OrtMemoryInfo const*, OrtAllocator**) (custom_ops.cc:121)
==1911860==    by 0x660D115: onnxruntime::TensorrtExecutionProvider::Compile(std::vector<onnxruntime::IExecutionProvider::FusedNodeAndGraph, std::allocator<onnxruntime::IExecutionProvider::FusedNodeAndGraph> > const&, std::vector<onnxruntime::NodeComputeInfo, std::allocator<onnxruntime::NodeComputeInfo> >&)::{lambda(void*, OrtApi const*, OrtKernelContext*)#3}::operator()(void*, OrtApi const*, OrtKernelContext*) const (tensorrt_execution_provider.cc:2223)
```
This issue happens after this [EP allocator
refactor](https://github.com/microsoft/onnxruntime/pull/15833)
2023-07-05 09:25:05 -07:00
Aditya Goel
9799d43c36
LabelEncoder kernel creation improvement (#16516) 2023-07-05 07:09:20 -07:00
PeixuanZuo
e2526714e2
[ROCm] Move MIGraphX build step on CPU only machine (#16582)
- Move MIGraphX build step on CPU only machine
- Use ccache on build step
- Not pass host uid into docker build process.
2023-07-05 13:55:28 +08:00