### Description
Training C# bindings (ReleaseTrainingSession and ReleaseCheckpointState)
broke after an API order change in Training C API. This PR fixes this
issue.
### Motivation and Context
Bug Fix for Training C# bindings
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Removes the workaround introduced in #12063, which disabled DML command
list reuse for Xbox builds.
The ID3D12CommandList created in FusedGraphKernel takes points to an
ID3D12CommandAllocator that is local to the `BuildReusableCommandList`
function. On PC it would seem the command list is keeping the command
allocator alive, but this is highly suspect logic that definitely
doesn't work on Xbox. I find no documentation indicating this logic
should work (a section on [reference
counting](https://learn.microsoft.com/en-us/windows/win32/direct3d12/recording-command-lists-and-bundles#reference-counting)
makes it clear command lists take no refs on D3D objects passed as args
to its APIs; however, it's unclear if this also applies to its
construction).
A second (small) change is constructing the command list straight into
`ID3D12GraphicsCommandList` and removing an unnecessary QI.
**Description**: Fixes these TSA issues (no actual bugs fixed, but just
changing code to make TSA happy)
To fix 1944982 and 1944973 I changed DeleteOnUnloadPtr to not use 'new'
and to just use placement new to go into a fixed buffer. This required
changing the rocm usage of it also (probably a separate TSA bug on that
one that I don't have)
1944982 Ryan Hill [prefast:Warning]: C26426 (in
onnxruntime/core/providers/cuda/tensor/cast_op.cc)
Global initializer calls a non-constexpr function 'operator new' (i.22).
1944973 Ryan Hill [prefast:Warning]: C26426 (in
onnxruntime/core/providers/cuda/cuda_execution_provider_info.cc)
Global initializer calls a non-constexpr function 'operator new' (i.22).
1944929 Ryan Hill [prefast:Warning]: C26436 (in
onnxruntime/core/providers/cuda/cuda_provider_factory.cc)
The type 'struct onnxruntime::ProviderInfo_CUDA_Impl' with a virtual
function needs either public virtual or protected non-virtual destructor
(c.35).
**Description**: This PR adds support for "XNNPACK EP" in ORTWeb and
changes the behavior of how ORTWeb deals with "backends", or "EPs" in
API.
**Background**: Term "backend" is introduced in ONNX.js to representing
a TypeScript type which implements a "backend" interface, which is a
similar but different concept to ORT's EP (execution provider). There
was 3 backends in ONNX.js: "cpu", "wasm" and "webgl".
When ORT Web is launched, the concept is derived to help users to
integrate smoothly. Technically, when "wasm" backend is used, users need
to also specify "EP" in the session options. Considering it may get
complicated and confused for users to figure out the difference between
"backend" and "EP", the JS API hide the "backend" concept and made a
mapping between names, backends and EPs:
"webgl" (Name) <==> "onnxjsBackend" (Backend)
"wasm" (Name) <==> "wasmBackend" (Backend) <==> "CPU" (EP)
**Details**:
The following changes are applied in this PR:
1. allow multi-registration for backends using the same name. This is
for use scenarios where both "onnxruntime-node" and "onnxruntime-web"
are consumed in a Node.js App ( so "cpu" will be registered twice in
this scenario. )
2. re-assign priority values to backends. I give 100 as base to "cpu"
for node and react_native, and 10 as base to "cpu" in web.
3. add "cpu", "xnnpack" as new names of backends.
4. update onnxruntime wasm exported functions to support EP
registration.
5. update implementations in ort web to handle execution providers in
session options.
6. add '--use_xnnpack' as default build flag for ort-web
### Description
fix XNNPACK on WebAssembly SIMD.
Flag "-msimd128" need to be applied to every source file when compiling
WASM SIMD. Currently only a part of the source files are compiled with
this flag so we get inconsistent result for
`sizeof(xnn_f32_minmax_params)` because the type definition include a
`#ifdef` for `__wasm_simd128__`. The inconsistency causes writing
garbage data to a stack variable and eventually cause the crash.
XNNPACK libraries are C libraries so need to apply the build flags not
only to `CMAKE_CXX_FLAGS` but also to `CMAKE_C_FLAGS`.
Reduce binary size for minimal Android builds.
- reduce places where Status objects are created in KernelTypeStrResolver::LoadFromOrtFormat()
- remove some unused parameters (in a base minimal build) and code in graph_partitioner.cc
### Description
This updates the oneDNN library used by oneDNN ep from version 2.6 to
version 2.7
### Motivation and Context
This brings in the many improvements incorporated into the oneDNN
library to the oneDNN execution provider.
Signed-off-by: George Nash <george.nash@intel.com>
Update clang-tidy config to prepare for creating a CI workflow to run
clang-tidy.
Added clangtidy check in CI
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
Kernels like Attention, BatchNormalization15, etc, can be implemented by
using multiple DML APIs. This PR paves the path for graph-based kernel
implementation.
As part of this PR, every kernel in DML EP will now wrap their
DML_OPERATOR_DESC into a graph and send it to FusedGraphKernel.
FusedGraphKernel will stich this smaller graph into its main DML_GRAPH.
All onnxconformance test and Winml model tests passed.
Co-authored-by: Sumit Agarwal <sumitagarwal@microsoft.com>
Co-authored-by: Dwayne Robinson <dwayner@microsoft.com>
### Description
<!-- Describe your changes. -->
A fix for parity issue in huggingface bart model with beam search
https://github.com/microsoft/onnxruntime/pull/12779
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Add handling for variadic inputs/outputs in a function.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#13121
**Description**:
Use the onnx headers to find the latest opset for each operator. This
allows the script to detect optimizers with
`graph_utils::IsSupportedOptypeVersionAndDomain` calls that need
updating when run during the update of the onnx commit id. Without this
change issues are not detected until a new kernel is registered.
**Motivation and Context**
Detect optimizers that need updates as part of the ONNX update process.
For below code in some transformers models:
```
fused_qkv = fused_qkv.view(batch_size, seq_length, self.num_heads, 3, self.head_dim)
return fused_qkv[..., 0, :], fused_qkv[..., 1, :], fused_qkv[..., 2, :]
```
The exported graph will contains 3 Gather nodes, currently ORT's
GatherGrad CUDA implementation is slow. This pattern can be fused to use
one Split, so that we can launch less kernels for the compute, the perf
of Split/Concat (for grad) is also better than Gather/GatherGrad.
In a real example, one GatherGrad will take 15ms and there are 3 for
each layer in the graph, after the fusion, one Concat takes only 35us.
The total time of a step is improved from 1.5s to 0.4s.
### Description
<!-- Describe your changes. -->
fix migraphx ci pipeline failed problem.
Disabled MIGraphX pipeline now. It will be Enabled when this PR merge.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
We fix iGPU Unit and Python tests with this PR
We add packaging pip pkg to build Many Linux DockerFile
### Motivation and Context
This change is required to make sure iGPU Unit Test/Python Tests with OV
are fixed
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: shamaksx <shamax.kshirsagar@intel.com>
Co-authored-by: mayavijx <mayax.vijayan@intel.com>
Co-authored-by: pratiksha <pratikshax.bapusaheb.vanse@intel.com>
Co-authored-by: pratiksha <mohsinx.mohammad@intel.com>
Co-authored-by: Sahar Fatima <sfatima.3001@gmail.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
Co-authored-by: nmaajidk <n.maajid.khan@intel.com>
Co-authored-by: Mateusz Tabaka <mateusz.tabaka@intel.com>
Previously OnnxSequence would flatten out a list of tensors into a
single output array assuming they were all scalar values. This doesn't
accurately represent the semantics of an ONNX sequence, but was what the
semantics appeared to be years ago when I first wrote that class. This
PR changes it so that the `getValue` method on `OnnxSequence` unwraps
the sequence and returns `List<? extends OnnxValue>` allowing the user
to process the individual ONNX values separately. It's done this way
rather than returning a multidimensional array for a tensor and a Java
map for a map as multidimensional arrays are very inefficient in Java
and best practice when operating with a OnnxTensor in Java is to use a
`java.nio.ByteBuffer`. So allowing users to access each `OnnxTensor`s
individually allows them to control how the data is materialised on the
Java heap.
**Description**: Describe your changes.
This allow us quickly launch a microbench session by, for example:
`python skip_layer_norm_test.py 8 128 128 float32 `
Change ROCm to use tunable GEMM. It is not enabled in this PR. This will drastically improve GEMM performance in some shapes and dtypes configuration. This will benefit the overall performance for BERT inference and hopefully, training, when enabled.
### Description
<!-- Describe your changes. -->
As title
-Split long OpBuilder and OpSupportChecker files into individual
operator files.
-Add OpBuilder/SupportChecker registry factories.
-Combine the functionality of op_builder and op_support_checker into one
op_builder.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
The NNAPI OPBuilder was splitted into OPBuilder (For EP::Compile) and
OPSupportChecker (for EP::GetCapability)
At the time it was reasonable choice, but OPBuilder/OPSupportChecker
share some logic and has to use addition helper.
Clean up now to make NNAPI OPBuilder/OPSupportChecker into single
OPBuilder (similar to what CoreML EP has)
### Description
<!-- Describe your changes. -->
Update React Native documentation to reflect change to use full ORT. Fix
broken links.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
ORT v1.13 uses the full ORT package. Instructions for performing a
custom build did not cover this.
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
**Description**: Describe your changes.
add SkipLayerNorm vectorize regular case
1. when hidden size <= 1024, SkipLayerNormTunable op can use both small
case and regular case
2. when hidden size > 1024, SkipLayerNormTunable op can only use regular
case.
**Motivation and Context**
- Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here.
This PR is to fix https://github.com/microsoft/onnxruntime/issues/12930
and https://github.com/microsoft/onnxruntime/issues/12579.
In detail:
- For CPU EP, since current impl of SimplifiedLayerNormalization doesn't
support input and scale having different data types, so if the sub-graph
contains Cast Op, the sub-graph will not fused, this guarantee that both
inputs and output data type will be same
- For CUDA EP, add (fp16, float) support to (T,V) type constraints all
combinations of fp16 and float can be supported in the impl
With the fix, the original model can be run with
SimplifiedLayerNormalization, which also helps to improve the perf.