### Description
Add ConvTranspose support for WebGPU
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
clean unused parameter in ORT_UNUSED_PARAMETER
### Motivation and Context
clean unused parameters in ORT_UNUSED_PARAMETER which are introduced
from #15833
### Description
Added WeGPU/JSEP Split operator support.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
- Adds support for 1D (rank 3) convolutions to QNN EP
- Implements 1D convolutions as 2D convolutions with height == 1.
Reshape nodes are added at the inputs and outputs as necessary.
- Adds more unit tests for Conv and ConvTranspose (2D and 1D).
### Motivation and Context
Allow more models to run on QNN EP.
### Description
Add missing L1Reduce and L2Reduce operator kernels.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
The `GemmSoftmaxGemmPermuteTunableOp<HipT>` is expensive to construct,
avoid the ctor invocation will substantially improve the launch time and
get better performance during the decoding. This get <7% e2e time
reduction of whisper large.
### Description
Windows GPU Reduced Ops CI Pipeline is broken due to the introduction of
a second template type in registered kernels. The python code checking
the registration is broken due to that. This PR addresses this issue on
the python side by keeping only one type equal to the concatenation of
the two types.
- Fix some warnings from Xcode build (`-Wshorten-64-to-32`).
- Enable `-Wshorten-64-to-32` warning if available. Currently it's not fully enabled for `onnxruntime_test_all` and `onnxruntime_providers_xnnpack` yet.
- Some clean up in build.py including setting CMake generator more consistently.
- Update some documentation comments.
- Use onnxruntime_training.h as the umbrella header so training API docs are included in generated docs.
- Fix static analysis build.
The PR optimizes BiasGelu/BiasGeluGrad CUDA kernel by 3 changes:
- Use Erf instead of Normcdf for half compute
- Change CUDA thread organization for BiasGelu kernel instead of using
binary elementwise template
- Add vectorized support
Using BiasGelu(A[256, 128, 768] + B[768]) in V100 as example, the perf
number below are in us
Before change, FW: 246.37, BW: 292.77
Use Erf, FW: 152.86, BW: 238.98
All above changes, FW: 132.45, BW: 199.14
For Huggingface's bertweet-base model, with the changes, the step time
(FW+BW) reduces from 324.71766 ms to 316.42552 ms, which is 1.026x
faster.
Using Erf is for half data only, evaluation shows that for float on
CUDA, Normcdf is faster. I didn't check the perf for BFloat16 or on AMD,
so keep them unchanged.
### Description
<!-- Describe your changes. -->
Split out the more basic changes from #15552 for easier review.
Re-organize to clarify the structure
- Separate out generic base functionality from ORT specific components
- pass in handlers for internal ORT ops to Optimize
- Split out layout transformation from transpose optimization
- Separate out level 1 transpose optimizer
- Cleanup some naming to try and clarify things like an optimizer vs.
general optimization code
Most of the changes are from this movement of code.
Two implementation changes:
- the extended handlers are queried first in GetHandler
- allows the extended handlers to override the default behaviour for an
ONNX operator
- simplify the Optimize function to remove OptimizerMode.
- `can_modify_node` is used instead of `mode` and
`ignore_assigned_nodes` and a long description of the current usage is
added. I don't _think_ that changes the current behavior and hopefully
clarifies what happens and when, and makes the base transpose optimizer
implementation more generic.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Create a cleaner separation to support adding EP specific logic next to
cleanly handle where an EP has additional layout sensitive behaviour
required (e.g. it's Resize implementation only handles one layout).
### Description
Introduce an API that allows users to gain access to a string tensor
element buffer of requested length in bytes
so then can quickly load any utf8 data.
### Motivation and Context
Useful for testing an otherwise.
### Description
C API for custom ops does not support float 8 types. This PR changes
that.
### Motivation and Context
The list of operators supporting float 8 is very limited. It should be
extended to custom ops to let developpers add customized operators for
these specific types.
In #16339, the `ORT_ENFORCE(cuda_device_arch_ >= 530` (throw) it changed
to `ORT_RETURN_IF` (Status) but the condition is negated. This fixes the
problem.
### Description
Enable support for building iOS packages/CocoaPods with training API
- Add `Training` Package variant and config files in current iOS
packaging utilities to enable creation of training packages
### Motivation and Context
This PR introduces new `Training` variant in
`build_and_assemble_ios_pods.py` script which allows creating pods for
iOS with training API enabled.
The sample script to build training pods:
```
python3 tools/ci_build/github/apple/build_and_assemble_ios_pods.py --variant Training \
--build-settings-file tools/ci_build/github/apple/default_full_ios_training_framework_build_settings.json \
-b=-- path_to_protoc_exe=<path/to/protoc>
```
Note: build settings file should have `--enable_training` as a build
parameter.
Simply adding training packaging increases the duration of the Azure
pipeline for packaging by 70 minutes. To address this issue, we need to
parallelize pod creation. In order not to further strain the pipeline,
the changes for training packaging will be added in another PR, which
optimizes the packaging pipeline.
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
Adds support for adding external initializers or overriding initializers
to a session options from Java.
### Motivation and Context
We want to instantiate large models from Java without filesystem access.
cc @yuslepukhin
### Description
always use 'typescript' from /js/ folder. This allows all NPM packages
to use the same typescript version.
- remove 'typescript' from /js/react_native/package.json. use the one
from /js/package.json
- remove unused '@types/fs-extra'
### Description
Add Concat operator
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix memory leak issue which comes from TRT EP's allocator object not
being released upon destruction.
Following is the log from valgrind:
```
==1911860== 100,272 (56 direct, 100,216 indirect) bytes in 1 blocks are definitely lost in loss record 1,751 of 1,832
==1911860== at 0x483CFA3: operator new(unsigned long) (vg_replace_malloc.c:472)
==1911860== by 0x315DC2: std::_MakeUniq<onnxruntime::OrtAllocatorImplWrappingIAllocator>::__single_object std::make_unique<onnxruntime::OrtAllocatorImplWrappingIAllocator, std::shared_ptr<onnxruntime::IAllocator> >(std::shared_ptr<onnxruntime::IAllocator>&&) (unique_ptr.h:857)
==1911860== by 0x30EE7B: OrtApis::KernelContext_GetAllocator(OrtKernelContext const*, OrtMemoryInfo const*, OrtAllocator**) (custom_ops.cc:121)
==1911860== by 0x660D115: onnxruntime::TensorrtExecutionProvider::Compile(std::vector<onnxruntime::IExecutionProvider::FusedNodeAndGraph, std::allocator<onnxruntime::IExecutionProvider::FusedNodeAndGraph> > const&, std::vector<onnxruntime::NodeComputeInfo, std::allocator<onnxruntime::NodeComputeInfo> >&)::{lambda(void*, OrtApi const*, OrtKernelContext*)#3}::operator()(void*, OrtApi const*, OrtKernelContext*) const (tensorrt_execution_provider.cc:2223)
```
This issue happens after this [EP allocator
refactor](https://github.com/microsoft/onnxruntime/pull/15833)
The ONNX exporter in DORT have been moved to PyTorch as a formal
feature. We therefore switch to consume the exporter from PyTorch
instead of maintaining two duplicates.
### Description
Unlike most ORT classes `SessionOptions` and `RunOptions` don't trigger
native library loading of the JNI binding and ORT when the classes are
initialized (after class loading). This was initially because I thought
that loading an inner class would trigger the static initialization of
the outer class, but this is not true. So if you create a
`SessionOptions` instance before referencing `OrtEnvironment` then you
won't trigger library loading and you'll get an error saying it couldn't
link the native method that creates a `SessionOptions` object.
Note this doesn't prevent users from creating a `SessionOptions` and
modifying it before the `OrtEnvironment` is created, which can still
cause issues. It would be a breaking API change to modify the
`SessionOptions` constructor to take an environment, and it wouldn't
mirror the way it works in the C API which requires this by convention
rather than API design, but we can discuss making that modification
later.
### Motivation and Context
Reduces the occurrence of mysterious Java library loading errors. Helps
with #16434.
### Description
<!-- Describe your changes. -->
As title.
ONNX definition:
2e9a6757ad/onnx/defs/math/defs.cc (L330)
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
CoreML has support and some super resolution models use it.
---------
Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Bumps [actions/setup-dotnet](https://github.com/actions/setup-dotnet)
from 2 to 3.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/actions/setup-dotnet/releases">actions/setup-dotnet's
releases</a>.</em></p>
<blockquote>
<h2>v3.0.0</h2>
<p>This major release includes the following
<strong>changes:</strong></p>
<ul>
<li><a
href="https://redirect.github.com/actions/setup-dotnet/issues/219">#219</a>
New input <code>dotnet-quality</code> was added in <a
href="https://redirect.github.com/actions/setup-dotnet/pull/315">#315</a>:</li>
</ul>
<pre lang="yaml"><code> - uses: actions/setup-dotnet@v3
with:
dotnet-version: '6.0.x'
dotnet-quality: 'preview'
- run: dotnet build <my project>
</code></pre>
<p>More in detail <a
href="https://github.com/actions/setup-dotnet#using-the-dotnet-quality-input">here</a>.</p>
<ul>
<li><a
href="https://redirect.github.com/actions/setup-dotnet/issues/241">#241</a>
The output variable <code>dotnet-version</code> which contains the
installed by the action SDK version was added in <a
href="https://redirect.github.com/actions/setup-dotnet/pull/324">#324</a>:</li>
</ul>
<pre lang="yaml"><code> - uses: actions/setup-dotnet@v3
id: cp310
with:
dotnet-version: '3.1.422'
- run: echo '${{ steps.cp310.outputs.dotnet-version }}' # outputs
3.1.422
</code></pre>
<p>More in detail <a
href="https://github.com/actions/setup-dotnet/tree/main#dotnet-version">here</a>.</p>
<ul>
<li>The <code>dotnet-version</code> syntax was updated and now it allows
to specify the prerelease version without using
<code>include-prerelease</code> input. The
<code>include-prerelease</code> input was cut out:</li>
</ul>
<pre lang="yaml"><code> - uses: actions/setup-dotnet@v3
with:
dotnet-version: '5.0.0-preview.6'
</code></pre>
<p>More in detail <a
href="https://github.com/actions/setup-dotnet#supported-version-syntax">here</a>.</p>
<ul>
<li><a
href="https://redirect.github.com/actions/setup-dotnet/issues/251">#251</a>
The problem with out of support .NET version warnings was solved in <a
href="https://redirect.github.com/actions/setup-dotnet/pull/315">#315</a>.</li>
</ul>
<p><strong>Breaking changes</strong>:</p>
<ul>
<li>Installation paths for Windows and Ubuntu images were changed to
match the location of pre-installed SDKs. In more detail, read <a
href="https://github.com/actions/setup-dotnet/blob/main/docs/adrs/v3-setup-dotnet.md#breaking-changes">here</a>.</li>
</ul>
<h2>Add support for Windows-arm</h2>
<p>In scope of this release we <a
href="https://redirect.github.com/actions/setup-dotnet/pull/320">add
support for Windows-arm</a>. Besides, we change getInput to <a
href="https://redirect.github.com/actions/setup-dotnet/pull/250">getBooleanInput</a>
for include-prerelease.</p>
<h2>Package updates, support for global json file in a subdirectory,
installer scripts updates</h2>
<p>This release includes the following PRs:</p>
<ul>
<li>Adding support for the <code>global-json-file</code> input: <a
href="https://redirect.github.com/actions/setup-dotnet/issues/276">#276</a>
Example of usage:
<pre lang="yaml"><code>- uses: actions/setup-dotnet@v2
with:
global-json-file: csharp/global.json
- run: dotnet build <my project>
working-directory: csharp
</code></pre>
</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="3447fd6a9f"><code>3447fd6</code></a>
feat: Cache NuGet global-packages folder (<a
href="https://redirect.github.com/actions/setup-dotnet/issues/303">#303</a>)</li>
<li><a
href="916351aac9"><code>916351a</code></a>
Merge pull request <a
href="https://redirect.github.com/actions/setup-dotnet/issues/430">#430</a>
from akv-platform/remove-implicit-dependencies</li>
<li><a
href="1ad2e312fa"><code>1ad2e31</code></a>
Add missing dependency</li>
<li><a
href="e3f84b8f7a"><code>e3f84b8</code></a>
Install eslint-plugin-node</li>
<li><a
href="ba848a34bb"><code>ba848a3</code></a>
Update configuration files</li>
<li><a
href="aa983c550d"><code>aa983c5</code></a>
Merge pull request <a
href="https://redirect.github.com/actions/setup-dotnet/issues/428">#428</a>
from akv-platform/add-latest-patch-syntax</li>
<li><a
href="b891376106"><code>b891376</code></a>
Merge branch 'main' into add-latest-patch-syntax</li>
<li><a
href="b05a3f26b3"><code>b05a3f2</code></a>
Fix review points, rebuild solution</li>
<li><a
href="5fdecd2063"><code>5fdecd2</code></a>
Increase amount of retries for Dotnet installation scripts tests (<a
href="https://redirect.github.com/actions/setup-dotnet/issues/427">#427</a>)</li>
<li><a
href="38b49fb717"><code>38b49fb</code></a>
Fix informational and debug messages</li>
<li>Additional commits viewable in <a
href="https://github.com/actions/setup-dotnet/compare/v2...v3">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
### Description
Add a greedy option to the initializer deduplication process in the
Whisper export.
Currently to detect shared initializers, ORT compares every initializer
against every other initializer (n^2). In the comparison operator, if
the two initializers have different data types (e.g. raw_data and
int_64), both initializers are converted to a numpy array and the cast
result is compared. This cast happens in every comparison, and
exponentially affects the runtime of finding shared initializers. This
cast operation is the bottleneck for the current Whisper export script.
The conversion to the numpy array is useful for detecting equal
initializer values across nodes of different data types (e.g.
recognizing a bias value of 0.0 is the same as a slice index of 0) but
isn't triggered when comparing initializers of the same data type (e.g.
weight value of 0.6 == weight value of 0.6). The latter case is where
the majority of utility is for Whisper, and so by eliminating our path
for comparing numpy arrays for initializers we save a lot of time for
minimal cost.
In other words, this PR adds an option to remove the ability to detect
shared initializers of different types (e.g. Slice Index and MatMul
Constant) while retaining the ability to deduplicate weights.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- Current time to export Whisper-large is prohibitive.
---------
Co-authored-by: Peter McAughan <petermca@microsoft.com>
### Description
Modified indexing into outputIndices in the shader code. When the output
is 1-dim the outputIndices is not a vector and indexing results in
error.
### Motivation and Context
Fix the problem in the Reduce Ops implementation in WebGPU.
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### The optimize_model will generate a temporary model in current model
folder. Most of time, it is fine.
However, the scenario will break when the function run against input
model mount from AzureML. In that case, the mounted folder is read-only.
We have to copy the model to another temp folder to call optimize_model
to workaround this issue. Otherwise, the optimize_model will fail when
creating the optimized model in the read-only folder. However, the model
copy is painful, especially when model is huge.
This PR just expose the optimized_model_path at optimize_model level so
that the caller could decide where to save the temp model.
### Description
<!-- Describe your changes. -->
Set emulator logging to verbose to see if it helps with intermittent
React Native CI failures when emulator crashes at startup
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Address comments in #14040
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Support SCELoss/SCELossGrad run with larger sized input
#### Motivation and Context: Run bigger batch size for Bloom model.
For Bloom560M model, ORT has potential to run bigger batch size from
initialally 6 to now 10. SCELoss/SCELossGrad's input size is Bsz X 1023
* 250680. When Bsz is bigger than 8, totoal element count cannot be
represented by int32_t, which those kernels are using to passing total
elem count. There is silent overflow causing other indirectly
exceptions, or wrong mistake without errors.
#### Changes in this PR
- For SCELossInternal/SCELossGradInternal CUDA kernels, use uint64_t if
total element count is bigger than int32::max() to pass all element
count and element index for the ops mentioned above.
- For SCELossInternal/SCELossGradInternal CPU kernels,
- always use uint64_t to pass the element count.
- update the Eigen functions involved in the two kernels'
implementations, to use `ptrdiff_t` to pass element count instead of
original `int`.
- Parallelize SCELossInternal/SCELossGradInternal CPU kernels,
otherwise, it is super slow when handling so many elements.
- Others changed needed:
- Add `CompareOrtValueNumerals` to compare two OrtValue with different
data types (float or float16), without caller explicitly converting to
the lower-precision data types. The comparison is also done in parallel,
which reduce the comparsion time for the large UT case from 22s to
~1.6s.
- The check of `IsResultCloselyMatch` is buggy for nan/inf cases, so fix
the bugs.
- The cross entropy tests are running CPU base line with float, then the
result is used to compare with float16 results of CUDA runs. But there
is precision issue when we check the results. Because the randomized
input data is represented in float, CPU use it directly, but CUDA use a
float16 version of it, so there is precision diff between the inputs, as
the test data count increases, it make the results fail even on 1e-2.
The fix is: generate data in float16, convert to float for CPU run,
directly use float16 for CUDA runs. When compare the output, cast back
CPU float to float16 then compare with CUDA outputs.
- `RandomValueGenerator ` for the large size take about ~20second, so
`ParallelRandomValueGenerator ` is added to random input in parallel, it
takes about <2s for preparing input data.
#### Non-goals
`SoftmaxCrossEntropyLoss` && `SoftmaxCrossEntropyLossGrad` is not
covered in this PR
### Description
Expose `OrtValue` class API as first-class citizen.
Make it simular with C++ API.
Enable safe direct native memory access.
Make string tensor manipulation more efficient.
Avoid intermediate structures such as `NamedOnnxValue`,
`DisposableNamedOnnxvalue` and etc.
Provide more examples with `IOBinding`, although `OrtValue` API
potentially makes `IOBinding` redundant for most of scenarios, since
`OrtValue` can be created on top of any memory.
Run all the pre-trained models now with `OrtValue` API as well.
Obsolete `OrtExternalMemory class`. Obsolete IOBinding API that takes
`FixedBufferOnnxValue`.
### Motivation and Context
Make the API efficient and uniform with C++.
This aspires to address:
https://github.com/microsoft/onnxruntime/issues/14918https://github.com/microsoft/onnxruntime/issues/15381
Cc: @Craigacp
### Description
Remove AllocatorManager class
### Motivation and Context
After the refactor PR #15833 is in, AllocatorManager class is not
referenced anymore.