Commit graph

9068 commits

Author SHA1 Message Date
dependabot[bot]
c8a94f1ef7
Bump actions/setup-dotnet from 2 to 3 (#16403)
Bumps [actions/setup-dotnet](https://github.com/actions/setup-dotnet)
from 2 to 3.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/actions/setup-dotnet/releases">actions/setup-dotnet's
releases</a>.</em></p>
<blockquote>
<h2>v3.0.0</h2>
<p>This major release includes the following
<strong>changes:</strong></p>
<ul>
<li><a
href="https://redirect.github.com/actions/setup-dotnet/issues/219">#219</a>
New input <code>dotnet-quality</code> was added in <a
href="https://redirect.github.com/actions/setup-dotnet/pull/315">#315</a>:</li>
</ul>
<pre lang="yaml"><code>    - uses: actions/setup-dotnet@v3
      with:
        dotnet-version: '6.0.x'
        dotnet-quality: 'preview'
    - run: dotnet build &lt;my project&gt;
</code></pre>
<p>More in detail <a
href="https://github.com/actions/setup-dotnet#using-the-dotnet-quality-input">here</a>.</p>
<ul>
<li><a
href="https://redirect.github.com/actions/setup-dotnet/issues/241">#241</a>
The output variable <code>dotnet-version</code> which contains the
installed by the action SDK version was added in <a
href="https://redirect.github.com/actions/setup-dotnet/pull/324">#324</a>:</li>
</ul>
<pre lang="yaml"><code>    - uses: actions/setup-dotnet@v3
      id: cp310
      with:
        dotnet-version: '3.1.422'
- run: echo '${{ steps.cp310.outputs.dotnet-version }}' # outputs
3.1.422
</code></pre>
<p>More in detail <a
href="https://github.com/actions/setup-dotnet/tree/main#dotnet-version">here</a>.</p>
<ul>
<li>The <code>dotnet-version</code> syntax was updated and now it allows
to specify the prerelease version without using
<code>include-prerelease</code> input. The
<code>include-prerelease</code> input was cut out:</li>
</ul>
<pre lang="yaml"><code>    - uses: actions/setup-dotnet@v3
      with:
        dotnet-version: '5.0.0-preview.6'
</code></pre>
<p>More in detail <a
href="https://github.com/actions/setup-dotnet#supported-version-syntax">here</a>.</p>
<ul>
<li><a
href="https://redirect.github.com/actions/setup-dotnet/issues/251">#251</a>
The problem with out of support .NET version warnings was solved in <a
href="https://redirect.github.com/actions/setup-dotnet/pull/315">#315</a>.</li>
</ul>
<p><strong>Breaking changes</strong>:</p>
<ul>
<li>Installation paths for Windows and Ubuntu images were changed to
match the location of pre-installed SDKs. In more detail, read <a
href="https://github.com/actions/setup-dotnet/blob/main/docs/adrs/v3-setup-dotnet.md#breaking-changes">here</a>.</li>
</ul>
<h2>Add support for Windows-arm</h2>
<p>In scope of this release we <a
href="https://redirect.github.com/actions/setup-dotnet/pull/320">add
support for Windows-arm</a>. Besides, we change getInput to <a
href="https://redirect.github.com/actions/setup-dotnet/pull/250">getBooleanInput</a>
for include-prerelease.</p>
<h2>Package updates, support for global json file in a subdirectory,
installer scripts updates</h2>
<p>This release includes the following PRs:</p>
<ul>
<li>Adding support for the <code>global-json-file</code> input: <a
href="https://redirect.github.com/actions/setup-dotnet/issues/276">#276</a>
Example of usage:
<pre lang="yaml"><code>- uses: actions/setup-dotnet@v2
  with:
    global-json-file: csharp/global.json
- run: dotnet build &lt;my project&gt;
  working-directory: csharp
</code></pre>
</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="3447fd6a9f"><code>3447fd6</code></a>
feat: Cache NuGet global-packages folder (<a
href="https://redirect.github.com/actions/setup-dotnet/issues/303">#303</a>)</li>
<li><a
href="916351aac9"><code>916351a</code></a>
Merge pull request <a
href="https://redirect.github.com/actions/setup-dotnet/issues/430">#430</a>
from akv-platform/remove-implicit-dependencies</li>
<li><a
href="1ad2e312fa"><code>1ad2e31</code></a>
Add missing dependency</li>
<li><a
href="e3f84b8f7a"><code>e3f84b8</code></a>
Install eslint-plugin-node</li>
<li><a
href="ba848a34bb"><code>ba848a3</code></a>
Update configuration files</li>
<li><a
href="aa983c550d"><code>aa983c5</code></a>
Merge pull request <a
href="https://redirect.github.com/actions/setup-dotnet/issues/428">#428</a>
from akv-platform/add-latest-patch-syntax</li>
<li><a
href="b891376106"><code>b891376</code></a>
Merge branch 'main' into add-latest-patch-syntax</li>
<li><a
href="b05a3f26b3"><code>b05a3f2</code></a>
Fix review points, rebuild solution</li>
<li><a
href="5fdecd2063"><code>5fdecd2</code></a>
Increase amount of retries for Dotnet installation scripts tests (<a
href="https://redirect.github.com/actions/setup-dotnet/issues/427">#427</a>)</li>
<li><a
href="38b49fb717"><code>38b49fb</code></a>
Fix informational and debug messages</li>
<li>Additional commits viewable in <a
href="https://github.com/actions/setup-dotnet/compare/v2...v3">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=actions/setup-dotnet&package-manager=github_actions&previous-version=2&new-version=3)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-07-01 09:16:28 +08:00
Dmitri Smirnov
f5b2d213eb
Fix nuget pipeline (#16553)
### Description
Address test class visibility.

### Motivation and Context
Fixes NuGetPackaging pipeline
2023-06-30 17:32:06 -07:00
petermcaughan
47f136e2d3
Speed Up Whisper Export (#16504)
### Description
Add a greedy option to the initializer deduplication process in the
Whisper export.

Currently to detect shared initializers, ORT compares every initializer
against every other initializer (n^2). In the comparison operator, if
the two initializers have different data types (e.g. raw_data and
int_64), both initializers are converted to a numpy array and the cast
result is compared. This cast happens in every comparison, and
exponentially affects the runtime of finding shared initializers. This
cast operation is the bottleneck for the current Whisper export script.

The conversion to the numpy array is useful for detecting equal
initializer values across nodes of different data types (e.g.
recognizing a bias value of 0.0 is the same as a slice index of 0) but
isn't triggered when comparing initializers of the same data type (e.g.
weight value of 0.6 == weight value of 0.6). The latter case is where
the majority of utility is for Whisper, and so by eliminating our path
for comparing numpy arrays for initializers we save a lot of time for
minimal cost.

In other words, this PR adds an option to remove the ability to detect
shared initializers of different types (e.g. Slice Index and MatMul
Constant) while retaining the ability to deduplicate weights.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- Current time to export Whisper-large is prohibitive.

---------

Co-authored-by: Peter McAughan <petermca@microsoft.com>
2023-06-30 12:22:30 -07:00
Yulong Wang
708dec5d95
[js/webgpu] allow 0 sized tensor for tensor view (#16540)
### Description
allow 0 sized tensor for tensor view
2023-06-30 12:05:04 -07:00
satyajandhyala
3be6eb53c8
[JS/Web] Fixed the output indexing in the shader code when the output is 1-dim. (#16508)
### Description
Modified indexing into outputIndices in the shader code. When the output
is 1-dim the outputIndices is not a vector and indexing results in
error.



### Motivation and Context
Fix the problem in the Reduce Ops implementation in WebGPU.
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-06-30 09:42:38 -07:00
Mike Guo
aeaa1d650f
make optimized_model_path be in temp folder instead of source model folder for transformer optimization (#16531)
### The optimize_model will generate a temporary model in current model
folder. Most of time, it is fine.

However, the scenario will break when the function run against input
model mount from AzureML. In that case, the mounted folder is read-only.
We have to copy the model to another temp folder to call optimize_model
to workaround this issue. Otherwise, the optimize_model will fail when
creating the optimized model in the read-only folder. However, the model
copy is painful, especially when model is huge.

This PR just expose the optimized_model_path at optimize_model level so
that the caller could decide where to save the temp model.
2023-06-30 20:19:51 +08:00
Scott McKay
2fd25de360
Use verbose logging in Android emulator in React Native CI (#16528)
### Description
<!-- Describe your changes. -->
Set emulator logging to verbose to see if it helps with intermittent
React Native CI failures when emulator crashes at startup


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-06-30 11:51:20 +10:00
Baiju Meswani
5b3447beef
Allow disable exceptions to work with ort-extensions (#16536) 2023-06-29 18:24:33 -07:00
JiCheng
0051497055
Fix Comments (#16513)
### Description
Address comments in #14040 


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-06-30 10:37:57 +10:00
Edward Chen
05c4566fe9
[objc] Fix possible leak of OrtValue in initializer. (#16487)
Fix possible leak of OrtValue in initializer. There was a possible early return before ownership was transferred to the internal C++ Ort::Value.
2023-06-29 17:37:16 -07:00
pengwa
8fc3037ff4
Support SCELossInternal/SCELossInternalGrad run with larger sized input (#16363)
### Support SCELoss/SCELossGrad run with larger sized input

#### Motivation and Context: Run bigger batch size for Bloom model. 
For Bloom560M model, ORT has potential to run bigger batch size from
initialally 6 to now 10. SCELoss/SCELossGrad's input size is Bsz X 1023
* 250680. When Bsz is bigger than 8, totoal element count cannot be
represented by int32_t, which those kernels are using to passing total
elem count. There is silent overflow causing other indirectly
exceptions, or wrong mistake without errors.


#### Changes in this PR

- For SCELossInternal/SCELossGradInternal CUDA kernels, use uint64_t if
total element count is bigger than int32::max() to pass all element
count and element index for the ops mentioned above.
- For SCELossInternal/SCELossGradInternal CPU kernels, 
   - always use uint64_t to pass the element count. 
- update the Eigen functions involved in the two kernels'
implementations, to use `ptrdiff_t` to pass element count instead of
original `int`.
- Parallelize SCELossInternal/SCELossGradInternal CPU kernels,
otherwise, it is super slow when handling so many elements.
  
- Others changed needed:
- Add `CompareOrtValueNumerals` to compare two OrtValue with different
data types (float or float16), without caller explicitly converting to
the lower-precision data types. The comparison is also done in parallel,
which reduce the comparsion time for the large UT case from 22s to
~1.6s.
- The check of `IsResultCloselyMatch` is buggy for nan/inf cases, so fix
the bugs.
- The cross entropy tests are running CPU base line with float, then the
result is used to compare with float16 results of CUDA runs. But there
is precision issue when we check the results. Because the randomized
input data is represented in float, CPU use it directly, but CUDA use a
float16 version of it, so there is precision diff between the inputs, as
the test data count increases, it make the results fail even on 1e-2.
The fix is: generate data in float16, convert to float for CPU run,
directly use float16 for CUDA runs. When compare the output, cast back
CPU float to float16 then compare with CUDA outputs.
- `RandomValueGenerator ` for the large size take about ~20second, so
`ParallelRandomValueGenerator ` is added to random input in parallel, it
takes about <2s for preparing input data.

#### Non-goals

`SoftmaxCrossEntropyLoss` && `SoftmaxCrossEntropyLossGrad` is not
covered in this PR
2023-06-30 08:36:06 +08:00
Dmitri Smirnov
322237f482
[C#] Implement OrtValue APIs (#16206)
### Description

Expose `OrtValue` class API as first-class citizen.
Make it simular with C++ API.
Enable safe direct native memory access.
Make string tensor manipulation more efficient.
Avoid intermediate structures such as `NamedOnnxValue`,
`DisposableNamedOnnxvalue` and etc.

Provide more examples with `IOBinding`, although `OrtValue` API
potentially makes `IOBinding` redundant for most of scenarios, since
`OrtValue` can be created on top of any memory.

Run all the pre-trained models now with `OrtValue` API as well.
Obsolete `OrtExternalMemory class`. Obsolete IOBinding API that takes
`FixedBufferOnnxValue`.

### Motivation and Context
Make the API efficient and uniform with C++.

This aspires to address: 
https://github.com/microsoft/onnxruntime/issues/14918
https://github.com/microsoft/onnxruntime/issues/15381

Cc: @Craigacp
2023-06-29 08:59:23 -07:00
Edward Chen
9b2733de8e
[docs] Specify Objective-C max line length. (#16503)
Update coding standards doc to specify Objective-C max line length of 120 to be consistent with C++.
2023-06-28 16:58:23 -07:00
cao lei
0c5f492493
remove AllocatorMgr class (#16509)
### Description
Remove AllocatorManager class


### Motivation and Context
After the refactor PR #15833 is in, AllocatorManager class is not
referenced anymore.
2023-06-28 15:43:19 -07:00
Hariharan Seshadri
ff0894e540
Simplify gating check for CUDA Graph usage (#16491) 2023-06-28 15:25:34 -07:00
Baiju Meswani
efeb6672d6
Temporary optimizer support for ort format models in non minimal build (#16485) 2023-06-28 11:35:57 -07:00
Vrajang Parikh
960e320dff
Objective C Training API: TrainingSession (#16374)
### Description
- Implement Objective-C binding for `ORTTrainingSession`
- Add `ORTUtils` utility class to handle conversion between C++ and
Objective-C types
- Add test case for saving checkpoint
- Add unit test cases for `ORTTrainingSession`

### Motivation and Context
This PR is part of implementing Objective-C bindings for training API.
It implements objective-c binding for training session. The objective-C
API closely resembles the C++ API.

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-06-28 09:13:56 -07:00
Christian Bourjau
6dd4e4801a
Allow custom operator functions to safely propagate errors through the C-API (#16479)
### Description
This PR implements a backward-compatible way to define custom operators
with fallible compute functions. The C++ API templated gained an
optional `Fallible` argument. Closes #14287

### Motivation and Context
#14287 contains more context. The gist is that the current C-API defines
compute operations of custom operators as functions returning `void`
rather than an `OrtStatusPtr`. Currently, errors are often propagated
across the C-ABI using C++ exceptions. That is very unsafe and undefined
behavior. Moreover, it is difficult for languages other than C++ to use
this approach even if they wanted to. A C-compliant sound and safe way
to propagate errors allows for non-C++ fallible custom operators.

### An example in action
https://github.com/cbourjau/ort-custom-op/pull/6/files is a
demonstration of how this PR can be used to write safe and fallible
custom operators in Rust.
2023-06-28 08:16:32 -07:00
cloudhan
15f16ef36e
[ROCm] Add DecoderMaskedMultiHeadAttention (#16362)
Reuse MultiHeadAttention to implement DecoderMaskedMultiHeadAttention.
2023-06-28 19:18:07 +08:00
Baiju Meswani
cbfbe210a8
Fix bug that accidentally disabled training op tests (#16488) 2023-06-27 18:39:54 -07:00
Yi Zhang
fb7e1f133f
[Fix] TSA Upload failed in nuget pipeline. (#16476)
### Description
partially revert PR  #16244.


### Motivation and Context
npm pipeline couldn't triggered if nuget pipeline status is warning.


### Test Run

https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=321873&view=logs&s=b17bed5b-cc14-5026-390a-fb2feea063f2
2023-06-28 06:40:49 +08:00
cao lei
e5270e3b4f
shared allocator for on device training (#16432)
### Description
<!-- Describe your changes. -->
New logic to share allocators among module, optimizer and eval sessions
for Training scenario



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Previously on device training using shared allocator by sharing EP, now
with new mechanism to share allocator, we need to explicitly register
allocator in the environment.

---------

Co-authored-by: Lei Cao <leca@microsoft.com>
2023-06-27 15:10:42 -07:00
Ryan Hill
1001ec93a7
Ryanunderhill/beamscorer gpu (#16272)
### Description
Make BeamScorer run on the GPU vs the CPU.

Brief overview:
  Adds a CUDA 'CudaBeamSearchScorer' implementation of IBeamScorer
Instead of a 'done' flag per beam, there is one single 'not done'
variable that is copied to the CPU every iteration
Removes some of the extra CPU side buffers and parameters that are no
longer needed

Remaining future optimizations:
CPU copied beam indices is still used in the non
DecoderMaskedSelfAttention case. An extra kernel can be written to avoid
PickGptPasteState needing CPU copied beam indices (called from
UpdateGptFeeds).

### Motivation and Context
It's faster to keep the work on the GPU to avoid GPU->CPU->GPU copies of
data.
2023-06-27 15:08:44 -07:00
Adrian Lizarraga
f5e9625c36
[QNN EP] Properly skip HTP test on x64 (#16500)
### Description
Fixes a typo that prevents skipping a test that targets the QNN HTP
backend on Windows x64.



### Motivation and Context
- Windows x64 machines cannot load/run the QNN HTP backend. Therefore,
we need to skip such tests on Windows x64.
- Fixes the QNN_Nuget_Windows pipeline.
2023-06-27 14:59:46 -07:00
Rachel Guo
892b1b19ea
[js/rn] limit x86_64 arch in detox xcodebuild for react native e2e test (#16460)
### Description
<!-- Describe your changes. -->

As title.




### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Works with local onnxruntime-c pod in js/rn/e2e test.
2023-06-27 09:45:04 -07:00
Michael Klimenko
c3db1d3628
Replace float_t with float (#16484)
A couple of places in onnxruntime used `float_t` data type alias as an
alternative to `float`. However, this is not entirely correct, since
`float_t` is an implementation-defined type alias, which may be `float`,
`double`, `long double` or some other implementation-defined data type,
depending on the state of the internal `FLT_EVAL_METHOD` macro:
https://en.cppreference.com/w/c/numeric/math/float_t

On most major platforms and compilers (clang, GCC, MSVC) this is only a
cosmetic change and will not lead to any changes. However, icpx compiler
(and legacy icc) tends to substitute `float_t` with `long double`,
resulting in a linker error (unresolved reference) to the base onnx
library, that only contains the `ParseData` function for `float` and
`double` as in
[here](9264e09367/onnx/defs/tensor_proto_util.cc (L133-L134)).

Overall, this PR cleans up the implementation-defined behaviour and
enables building onnxruntime with icpx.
2023-06-27 09:28:38 -07:00
guyang3532
4768ac5f30
Fix onnxruntime-CI-nightly-ort-pipeline Failure (#16495)
The image for the onnxruntime-CI-nightly-ort-pipeline is too old. 
The ort package in the image is older than latest test code in nightly
ci. This causes the nightly ci failed.
2023-06-27 23:19:23 +08:00
pengwa
a49bb85cfe
Manage ORTModule configurations consistently (#16396)
### Manage ORTModule options

Move all env vars that used for feature ON/OFF into runtime options for
consistent managements.


Be noted: the features' switch are assigned in 2 phases: default values,
overwritten by env vars (if specified by users). So env vars take the
highest priority when all 2 phases both given value explicitly for one
feature.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-06-27 19:19:36 +08:00
pengwa
403bebfb51
Use PadAndUnflatten to replace GatherGrad for restore (#16429)
### Use PadAndUnflatten to replace GatherGrad for restore




### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-06-27 15:07:20 +08:00
cloudhan
ae6da03438
ROCm decoding (#16339) 2023-06-27 13:19:39 +08:00
Yi Zhang
6e9541046e
extend react native ci timeout limit (#16469)
### Description
<!-- Describe your changes. -->

### Motivation and Context
2 consecutive runs in npm pipeline failed due to time out
2023-06-27 08:44:03 +08:00
guyang3532
eb4e6d2062
Support Mul and Sub in padding elimination (#16478)
### Description
Support Mul and Sub in padding elimination
2023-06-27 07:43:29 +08:00
Edward Chen
4a331ef667
Rework CPU MeanVarianceNormalization kernel to support arbitrary axes. (#16420) 2023-06-26 15:29:50 -07:00
Yifan Li
e2c214d81f
[TensorRT EP] TRT 8.6 minor version update (#16475)
### Description
* Minor version update: TRT 8.6.0.12->8.6.1.6
  * CI pipeline ymls/dockerfiles are updated
* cgmanifest.json/deps.txt/download-deps.yml are updated; Win trt
binaries uploaded to [win img
307029](https://aiinfra.visualstudio.com/AI%20Infra%20Management/_build/results?buildId=307029&view=results)
* Re-enable unit tests which were failed in 8.6.0 and re-gained support
in 8.6.1
2023-06-26 10:44:27 -07:00
Baiju Meswani
1f60414bc2
Load CheckpointState from a buffer (#16457) 2023-06-26 09:18:38 -07:00
Yifan Li
efe0af3720
[TensorRT EP] Fix nullptr check (#16468)
### Description
Fix the nullptr check so that it would check the actual existence of
engine/context
(Currently, it checks the address of unique_ptr, which is always not
null. Thx @jslhcl for pointing that out)

> A quick recall of struct
[trt_state](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.h#L104):
> ```
> std::unique_ptr<nvinfer1::ICudaEngine>* engine = nullptr;
>std::unique_ptr<nvinfer1::IExecutionContext>* context = nullptr;
>```


### Motivation and Context
https://github.com/microsoft/onnxruntime/issues/15982
The incorrect check couldn't stop TRT EP from loading incompatible
engine cache on purpose, which invokes unhandled exception
2023-06-26 09:02:59 -07:00
Adam Louly
c55c6255e0
Eliminate safe nodes that are followed by a shape node. (#16065)
### Description
Eliminate Cast operator if Shape is the next one.

### Motivation and Context
#### Cast
When working with onnx opset 15 and above, the shape operator now
accepts all types of variables.
This change is documented in the [onnx
Changelog](https://github.com/onnx/onnx/blob/main/docs/Changelog.md#Shape-15).

As a result, casting variables right before the shape operation becomes
unnecessary.
Removing these unnecessary casts will improve the graph and potentially
provide better performance gains.


## Results
On :
torchrun examples/onnxruntime/training/language-modeling/run_clm.py
--model_name_or_path gpt2 --do_train --overwrite_output_dir --output_dir
./outputs/ --seed 1337 --fp16 True --per_device_train_batch_size 4
--num_train_epochs 1 --dataset_name wikitext --dataset_config_name
wikitext-2-raw-v1 --learning_rate 2e-5 --report_to none --optim
adamw_ort_fused

without changes:
***** train metrics *****
  epoch                    =        1.0
  train_loss               =     3.2981
  train_runtime            = 0:02:13.29
  train_samples            =       2318
  train_samples_per_second =      17.39
  train_steps_per_second   =      4.351

With my changes:
***** train metrics *****
  epoch                    =        1.0
  train_loss               =     3.2981
  train_runtime            = 0:02:08.98
  train_samples            =       2318
  train_samples_per_second =     17.971
  train_steps_per_second   =      4.497

We see around 3% gain.

---------

Co-authored-by: Adam Louly <adamlouly@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2023-06-26 16:35:07 +08:00
Scott McKay
48eff09664
Fix file list for test of build with IO debug (#16474)
### Description
<!-- Describe your changes. -->
Update file list to adjust for recent changes to test infra. 


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-06-26 16:36:22 +10:00
PeixuanZuo
7e211f0e03
[ROCm] Move mount data step into docker container (#16471)
Some CI jobs may interrupted unexpectedly and didn't execute umount data
step. The data left in host device will cause `device or resource busy`
and make subsequent CI jobs fail.

Move the mount data step into docker container, the host machine will
not be occupied when CI jobs exit incorrectly.
2023-06-26 10:25:06 +08:00
Guenther Schmuelling
8971af72af
fix webnn build (#16464)
remove unused variable.
2023-06-25 11:09:58 -07:00
zhijiang
9206b7cdc6
Zhijxu/cast propagation softmax (#16408)
enhance cast-propagation for "softmax can be put at fp16 when data flow
is cast-to-fp32 > softmax > cast-to-fp16"

this optimization can save gpu memory and have performance gain
2023-06-25 10:28:54 +08:00
Tianlei Wu
9407c3270c
GPT-2 attention fusion for transformers >= 4.27 (#16461)
### Description
Before transformers 4.27, the causal mask uses uint8 data type, so there
is extra Cast node to convert it to bool. This adds a pattern that
without Cast node to support attention fusion for GPT-2 models exported
with transformers >= 4.27.

### Motivation and Context

https://github.com/microsoft/onnxruntime/issues/16453
2023-06-23 15:38:35 -07:00
Hector Li
a8c313dec4
[QNN EP] Python script to modify Onnx model to make it aligned with converted QNN model (#16423)
Python script to modify Onnx model to make it aligned with converted QNN
model

### Description
Onnxruntime QNN EP can support context binary file generated by QNN tool chain. However QNN generated context binary file uses channel last and 8 bits or 16 bits for input and output. This script get the QNN model input & output information from QNN converted model_net.json file, and insert Cast, Transpose nodes to Onnx model if required.
2023-06-23 11:00:51 -07:00
Ryan Hill
d5b606d50d
Remove now duplicated symbol (#16458)
### Description
Change #16161 broke rocm by duplicating this symbol. This removes the
duplicate to unblock the tests.
2023-06-23 09:21:03 -07:00
Chen Fu
5c125b4366
Cfu revertamx (#16455)
### Description

This is to revert two PRs that aim at reducing AMX toolchain
requirements. Unfortunately we still have some pipeline issues.

https://github.com/microsoft/onnxruntime/pull/16390
https://github.com/microsoft/onnxruntime/pull/16086

### Motivation and Context

Looks like gcc link time optimization does not work very well with
inline assembly in the above PRs.
2023-06-23 09:20:23 -07:00
Rachel Guo
04dbdc96bf
[js/rn] Fix React Native CI pipeline E2E test (#16447)
### Description
<!-- Describe your changes. -->

Based on this kindly provided quick fix:
https://github.com/microsoft/onnxruntime/pull/16411

See more description in the above linked pr about bumping AGP version,
etc.

Also fixed import header file path in detox e2e test.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Good build:

https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1041757&view=logs&j=de302ec2-2305-57e0-e8c6-cd89c569f2a3&t=9894c870-b8ce-548d-51ff-8f44d21a4117&l=18
2023-06-22 14:33:49 -07:00
Baiju Meswani
10ba1e270c
Minimal Build for On-Device Training (#16326)
🛠️ __Changes in this pull request:__

This pull request introduces two significant changes to the project:

- Changing on device training checkpoint format: The current
implementation stores the on device training checkpoint as a sequence of
tensors in multiple files inside a checkpoint folder, which can be
inefficient in terms of storage and performance. In this PR, I have
modified the checkpoint format to utilize the flatbuffer table to save
the checkpoint to a single file, providing a more compact and efficient
representation. The changes around this are twofold:
- Add the checkpoint flatbuffer schema that will generate the necessary
checkpoint source files.
- Update the checkpoint saving and loading functionality to use the new
format.

- Adding support for onnxruntime minimal build: To support scenarios
where binary size is a constraint, I made changes to ensure that the
training build can work well with the minimal build.

🔍 __Open Issues:__
- In order to extract the optimizer type, the existing implementation
re-loaded the onnx optimizer model and parsed it. This is no longer
possible, since the model format can either be onnx or ort. One idea is
to do the same for ort format optimizer model. This needs some
investigation.
- Changes to the offline tooling to generate ort format training
artifacts.
- End-to-end training example showcasing the use of the minimal training
build.
- Add support for export model for inferencing in a minimal build.
2023-06-22 12:27:23 -07:00
dependabot[bot]
97f4484df9
Bump actions/setup-python from 3 to 4 (#16404) 2023-06-22 18:12:11 +00:00
Yi Zhang
8e8840f1de
Enable Web CI on Linux (#16419)
### Description
1. Enable Web ci on Linux

### Motivation and Context
1. speed up web ci, the duration can be reduced from 160 minutes to 130
minutes, a time saving of 20% could be be achieved.
The total computation time is 455 minutes now. Moved to Linux, it could
be reduced to 336 minutes.
2. It's the first step to enable compilation cache for emscripten
3. per Yulong's request, build_web stages are still using windows pool


![image](https://github.com/microsoft/onnxruntime/assets/16190118/c9496408-74bd-45ea-b4ae-a4dd2a574d17)


https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1038382&view=results
2023-06-22 15:42:58 +08:00
Pranav Sharma
a270d8407e
Allow saving of large models after optimization (github issue 12882) (#16440)
### Description
Allow saving of large models after optimization.

### Motivation and Context
Addresses https://github.com/microsoft/onnxruntime/issues/12882
2023-06-21 22:46:26 -07:00