### Description
Implement Objective-C binding for `ORTCheckPoint`. Additionally,
- Modify `onnxruntime_objectivec.cmake` to only include training header
and sources when training flag is enabled
- Enable objective-c binding for `orttraining-mac-ci-pipeline`
### Motivation and Context
This PR is part of implementing Objective-C bindings for training API.
It implements objective-c binding for ORTCheckPoint class. The
objective-C API closely resembles the C++ API.
**Note**: The test for saving checkpoint is skipped as it requires use
of training session. It will be added when the objective-c binding for
`ORTTrainingSession` is added.
### Description
The name of the flag we set when compiling the JNI binding to enable the CoreML EP changed at some point in the past. This PR fixes it by updating the flag in the JNI. I also added a quick smoke test for the CoreML provider to make sure it doesn't crash and can be enabled.
### Motivation and Context
All the EPs should work as expected in Java. Fixes#16230.
- Fix flatbuffers flatc warning, unused-but-set-variable.
- Address `-Wshorten-64-to-32` warnings (fix in our code, allow in dependencies' code).
- Update CI builds to use Xcode 14.3.
- Update minimum iOS version to 12.0.
- Update Mac hosted agents to MacOS 13 where possible.
### Description
Increases allowable accuracy tolerance for specific Conv op test on QNN
CPU backed (Windows x64).
### Motivation and Context
Allow QNN NuGet pipeline to run. PR
https://github.com/microsoft/onnxruntime/pull/15975 introduced a failing
test on Windows x64.
We implemented a number of new ops and data types to support running
segment anything model on Chromium WebNN DML backend (POC) in a forked
branch https://github.com/honry/onnxruntime/tree/stable-diffusion
In this PR, we migrate the changes in the forked branch to main branch,
includes:
- 22 new ops
- New tensor data types: bool, int32, uint32, uint64, int64, float16 (As
JavaScript hasn't shipped Float16Array, we use Uint16Array as a
workaound)
- Handle empty input tensors and duplicated outputs
- Fixed some nits
1. Add new test data GetSelfAttentionData_WithPastAndPresent_HeadSize8_NoMask_NoRelPosBias, also added non-biased data
2. Add new test data GetCrossAttentionData_DiffSequenceLengths_HeadSize8, also added non-biased data
3. Disabled the new tests for CUDA EP due to qkv is not correctly transposed.
### Description
The tensor creation code now allows the creation of boolean tensors from
non-direct `ByteBuffer` instances. It previously only allowed them from
arrays and direct `ByteBuffer` instances and this fixes that
inconsistency. The boolean tensor test has been updated to cover all
three cases.
### Motivation and Context
Fixes#15509.
MIGraphX CI
- Change docker container user name to `onnxruntimedev`
ROCm CI
- Build docker image every job instead of using prebuild image.
- Every job create a container with only one GPU with command `docker
run -it --device=/dev/kfd --device=/dev/dri/renderDxxx`
- Remove tests that are unstable or use outdated interfaces.
- Enable training ortmodule test.
### Description
<!-- Describe your changes. -->
Detect fake tensor mode if it has already been created. Follows this
example in pytorch:
86c7652503/torch/_inductor/compile_fx.py (L280)
### Motivation and Context
As of torch nightly 6/2/23, when trying to run a torch dynamo graph on
the ORT backend, we observe
```
E torch._dynamo.exc.BackendCompilerFailed: backend='compiler_fn' raised:
E AssertionError: Mixing fake modes NYI
E
E
E You can suppress this exception and fall back to eager by setting:
E import torch._dynamo
E torch._dynamo.config.suppress_errors = True
```
The issue is that `ort_backend.py` creates a new fake tensor mode even
though one has already been created by torch.
### Description
The proposed fix is to store the result of AsBlockSparse() in a variable
to ensure the object isn't destroyed until the end of the current scope.
### Motivation and Context
"own_buffer_tensor" is a temporary object that is destroyed at the end
of the expression and causes a compile error.
### Description
1. Avoid taking dependency on dl.fedoraproject.org
The website is not very stable. Our build pipelines often fail to fetch
packages from there.
2. Update manylinux to the latest version
Fixes#13119 top concerns by
* using `onnxruntime::AllocatorDefaultAlloc` instead of `malloc`
* set `MLAS_DEFAULT_PREFERRED_BUFFER_ALIGNMENT=64` which cascades that
value
to several members and functions not directly related to MLAS.
### Motivation and Context
* Fixes#13119 top concerns. Otherwise, alignment is to 16 bytes circa
1990s 👴
* Does not yet enable flexible alignment. Instead fixed at 64 (64 x 8
bits=512 bits) for modern NN hardware like AVX-512
### Description
- Updates QDQ transformer to handle QDQ logical operators (Equal, Less,
LessOrEqual, Greater, GreaterOrEqual).
- Expects 2 DQ inputs and no Qs in the output, which is boolean.
### Motivation and Context
This is needed to enable QDQ models with logical comparison operators to
run on QNN EP.
### Description
1. Add UT for cached Qnn context binary
2. Minor change: set model path to "" if model_path is not available
since the model could be loaded from buffer instead of Onnx file
### Motivation and Context
support more scenario
---------
Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
### Description
1. Add a Memory Profiling build job
2. Remove no absl build job since the feature will be removed
3. Simplify post-merge-jobs.yml by unifying the pool names
### Motivation and Context
To catch build errors in #16124
### Description
This PR adds an implementation of the Squeeze operator to WebGPU JSEP.
The implementation follows the [operator
schema](https://github.com/onnx/onnx/blob/main/docs/Operators.md#Unsqueeze).
To implement the `Unsqueeze` operator in the same fashion as the
`Squeeze`, I added the `ComputeOutputShape()` method to the
`UnsqueezeBase` class and made some slight modifications. Please let me
know if it is a bad idea and if I should move this method to the JS
implementation.
I also uncommented test case lines in the `suite-test-list.jsonc` file
for both Squeeze and Unsqueeze operators following @hariharans29's
[comment](https://github.com/microsoft/onnxruntime/pull/16024#issuecomment-1565113633).
### How was it tested
1. I created a model with only one operator:
```Python
import onnx.helper
node = onnx.helper.make_node(
"Unsqueeze",
inputs=["T", "axes"],
outputs=["y"],
)
graph = onnx.helper.make_graph([node], "test", [onnx.helper.make_tensor_value_info("T", 1, [3, 4, 5]), onnx.helper.make_tensor_value_info("axes", 7, [2])], [onnx.helper.make_tensor_value_info("y", 1, [3, 1, 4, 5, 1])])
onnx.save(onnx.helper.make_model(graph), "unsqueeze.onnx")
```
2. I compiled the runtime using @fs-eire's
[instructions](https://gist.github.com/fs-eire/a55b2c7e10a6864b9602c279b8b75dce).
3. I ran the test models in the browser using this minimal setup:
```HTML
<html>
<script src=".\dist\ort.webgpu.min.js"></script>
<script>
async function run() {
const session = await ort.InferenceSession.create('unsqueeze.onnx', {executionProviders: ['webgpu']});
console.log(session);
const input = new ort.Tensor('float32', new Float32Array(60), [3, 4, 5]);
const dim = new ort.Tensor('int64', [1n, 4n], [2]);
const output = await session.run({ "T": input, "axes": dim });
console.log(output);
}
run();
</script>
</html>
```
### Motivation and Context
Improve operator coverage for WebGPU JSEP.
For TunableOp, some instance may has very bad performance and it will
take a long time during profile process.
Add `tunable_op_max_tuning_duration_ms` parameter to limit max tuning
time.
### Consolidate ORTModule logging
There are few improvements for ORTModule loggings:
- All ORTModule logging are used logger that is initialized in
`ortmodule.py`.
- Manage all export logs same way, e.g. use `
_logger.suppress_os_stream_output(log_level=self._debug_options.logging.log_level)`
to control exporting related logs suppressing or not. If any warning or
errors suppressed, `self._warning_log_detected_during_export` will be
set to True, then when we log ORTModule feature matrix, we will also
told users there are logs suppressed.
- Downgrade some warnings. We had some warnings for years, and looks
many models have them by default, no action we actually can take, so
downgrade them to make user logging cleaner.
- PyTorch export requires update of custom export function signature
changes, otherwise, _symbolic_context_handler complains with warnings,
so update custom export function adaption for version >=1.13 PyTorch.
- Add ORTModule feature matrix summary, **this is supposed to be only
places users see our logs by default** (unless they use INFO or
VERBOSE). Features ON/OFF states are shown clearly to them in case they
want to try some features in OFF states. This logs only shows up in rank
0 (if there are multiple rank), the intention is we want user to see a
useful and clean output from ORTModule by default. The outputs shown as
below:


- `reinitialize_ortmodule` in util.py is only used by ortmodule.py,
moving it into ortmodule.py, then utils takes no dependency on
`orttraining/orttraining/python/training/ortmodule/_custom_op_symbolic_registry.py`,
then `_custom_op_symbolic_registry.py` can call functions defined in
utils.py (without recursively include).
### Description
Fix an issue that FusedMatMulOpTest.FloatTypeTransposeBatch fails to run on GPUs with TF32 support.
Authored-by: Tianlei Wu <tlwu@microsoft.com>
### Description
disable webpack's polyfill for node's `global`, `__filename` and
`__dirname` in web build. This will confuse emscripten generated
environment detection.
see https://webpack.js.org/configuration/node/
### Description
<!-- Describe your changes. -->
This PR is to fix the build break when onnxruntime_ENABLE_MEMORY_PROFILE
is on
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This PR is to fix the build break when onnxruntime_ENABLE_MEMORY_PROFILE
is on.
It fixes this issue
https://github.com/microsoft/onnxruntime/issues/16124
Co-authored-by: Lei Cao <leca@microsoft.com>
### Description
<!-- Describe your changes. -->
Convolution with Padding and Convolution with large inputs,outputs.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This is mainly to check the CPU vs QNN EP output mismatch for models.
./onnxruntime_test_all --gtest_filter=*.TestQDQConvU8U8S32*
Failed tests with mismatch.
[ FAILED ] 2 tests, listed below:
[ FAILED ]
QnnHTPBackendTests.TestQDQConvU8U8S32_large_input1_padding_bias_initializer
[ FAILED ]
QnnHTPBackendTests.TestQDQConvU8U8S32_large_input2_bias_initializer
./onnxruntime_test_all --gtest_filter=*.TestCPUConvf32_*
[ FAILED ]
QnnCPUBackendTests.TestCPUConvf32_large_input1_pad_bias_initializer
### Description
This change adds a new instance function (method) to type
`InferenceSession` to allow users to manually release an inference
session instance.
#16131 depends on this change to work correctly.
Change ortmodule test because rocm ep behaves differently than cuda.
The warning from torch `The first argument to symbolic functions is
deprecated in 1.13 and will be removed in the future. Please annotate
treat the first argument (g) as GraphContext and use context information
from the object instead.` appears twice on ROCm EP.
On ROCm EP, the log is shown as below:
```
The first argument to symbolic functions is deprecated in 1.13 and will be removed in the future. Please annotate treat the first argument (g) as GraphContext and use context information from the object instead.
The first argument to symbolic functions is deprecated in 1.13 and will be removed in the future. Please annotate treat the first argument (g) as GraphContext and use context information from the object instead.
User Module's attribute name _torch_module collides with ORTModule's attribute name. User Module's attribute may not be returned when trying to retrieve the attribute through ORTModule.
User Module's attribute name load_state_dict collides with ORTModule's attribute name. User Module's method may not be called upon invocation through ORTModule.
```
For older versions of custom ops, optional and variadic callbacks are
null pointers, hence adding conditions to scope the usage.
---------
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
### Description
The PR implements FloatE4M3FN, FloatE5M2, FloatE4MEFNUZ, FloatE5M2FNUZ
as described in PR https://github.com/onnx/onnx/pull/4805. It uses CUDA
API to cast float/half to float8 if CUDA>=11.8, a custom implementation
if CUDA<11.8.
* It implements, Cast, QuantizeLinear, DequantizeLinear for all types on
CPU, only for types FloatE4M3FN, FloatE5M2 on CUDA.
* It extends the supported types for control flow operator, Shape,
Reshape, Identity, If, Loop, Scan, Reshape
* It implements Equal(19).
* Cast, QuantizeLinear, DequantizeLinear operators now support a
parameter `saturate` only valid for float 8 types. It is true by
default. In that case, any value out of range is converted into the
maximum float 8 value. If false, it is infinite.
* QuantizeLinear, DequantizeLinear now supports multiple scales on CUDA
(and ROCm by extension), scale = 1D tensor with one scale per channel
### Motivation and Context
Supports latest onnx version.
Fixes
[AB#15395](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/15395)
---------
Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
### Description
<!-- Describe your changes. -->
* Add aggregated op-kernel correlation information in profiler explorer
when running inference session.
* Add filtering feature so that we can focus on model runs of interest
(excluding warmup steps, etc.)
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
This PR adds an implementation of the `Squeeze` operator to WebGPU JSEP.
The implementation follows the [operator
schema](https://github.com/onnx/onnx/blob/main/docs/Operators.md#Squeeze)
and allows one or two inputs.
### How was it tested
1. I created two models. Without `axes`:
```Python
import onnx.helper
node = onnx.helper.make_node(
"Squeeze",
inputs=["T"],
outputs=["y"],
)
graph = onnx.helper.make_graph([node], "test", [onnx.helper.make_tensor_value_info("T", 1, [3, 1, 4, 5])],
[onnx.helper.make_tensor_value_info("y", 1, [3, 4, 5])])
onnx.save(onnx.helper.make_model(graph), "squeeze.onnx")
```
And with `axes`:
```Python
import onnx.helper
node = onnx.helper.make_node(
"Squeeze",
inputs=["T", "axes"],
outputs=["y"],
)
graph = onnx.helper.make_graph([node], "test", [onnx.helper.make_tensor_value_info("T", 1, [3, 1, 4, 5]), onnx.helper.make_tensor_value_info("axes", 7, [1])], [onnx.helper.make_tensor_value_info("y", 1, [3, 4, 5])])
onnx.save(onnx.helper.make_model(graph), "squeeze-dim.onnx")
```
2. I compiled the runtime using @fs-eire's
[instructions](https://gist.github.com/fs-eire/a55b2c7e10a6864b9602c279b8b75dce).
3. I ran the test models in the browser using this minimal setup:
```HTML
<html>
<script src=".\dist\ort.webgpu.min.js"></script>
<script>
async function run() {
const session = await ort.InferenceSession.create('squeeze-dim.onnx', {executionProviders: ['webgpu']});
console.log(session);
const input = new ort.Tensor('float32', new Float32Array(60), [3, 1, 4, 5]);
const dim = new ort.Tensor('int64', [-3n], [1]);
const output = await session.run({ "T": input, "axes": dim });
console.log(output);
}
run();
</script>
</html>
```
### Motivation and Context
Improve operator coverage for WebGPU JSEP.
### Description
<!-- Describe your changes. -->
Add 2 new QNN CIs to tools/python/run_CIs_for_external_pr.py
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Update tool so it runs all current CIs
This addresses a DML performance regression introduced by the constant
sharing pass.
The constant sharing pass identifies small initializer tensors which
contain identical values and merges them. This could have the effect of
causing DML to treat those tensors as non-constant and skip certain
optimization.
To prevent this, there is now an element count threshold below which the
DML EP will enable this optimization, even though it results in
duplicate work uploading and pre-processing the common tensor at
multiple operators.
### Description
The file include/onnxruntime/core/providers/cuda/cuda_provider_options.h
is a C++ file. It is not for C.
Before this commit, this header file is already not compatible with C compilers. Because it has:
```
onnxruntime::ArenaExtendStrategy arena_extend_strategy;
```
And this file is intended to be internal only. It is an internal header file. It should not be included in onnxruntime_c_api.h and should not be used with the public C APIs. User can only get the instance of OrtCUDAProviderOptionsV2 via CreateCUDAProviderOptions. In such a way we can add new members to this struct without breaking binary compatibility.
Since it is an internal header, we can safely use C++ grammar there.
### Description
ExecutionProvider API refactor - Detach allocator from EP by creating
local cpu allocator instead
### Motivation and Context
This is PR is a refactor to create local CPU allocator instead of
getting allocator from ExecutionProvider, which the final goal is to
totally detach allocators from ExecutionProvider, and put them in
session level indexed by OrtDevice