### Description
This PR improves `TreeNodeElementId` hash function by employing [Elegant
Pairing function](http://szudzik.com/ElegantPairing.pdf). In few works,
Elegant Pairing function maps two non−negative integers to a
non−negative integer that is uniquely associated with that pair. This
drastically reduces the collision and therefore reduces the time
required to create a session in order to use a large tree ensemble
model.
### Motivation and Context
We use ONNX runtime to serve our models as part of Triton backend. We
noticed that it was taking around 2 minutes to load a model which is a
large tree ensemble model (around 5k trees with around 3 millions nodes
in total). After investigating the issue, it was clear that the
`TreeNodeElementId` hash function wasn't being able to map keys to
buckets of C++ `unordered_map` without a significant amount of
collisions (in same cases 700 items per bucket).
The following picture shows graphically the improvement obtained by the
proposed change. We used the `onnx_test_runner` command.

#### Before
```
$> time ./onnx_test_runner -v ~/folder_with_model
result:
Models: 1
Total test cases: 0
Succeeded: 0
Not implemented: 0
Failed: 0
Stats by Operator type:
Not implemented(0):
Failed:
Failed Test Cases:
real 0m55.695s
user 0m52.919s
sys 0m0.760s
```
#### After
```
$> time ./onnx_test_runner -v ~/folder_with_model
result:
Models: 1
Total test cases: 0
Succeeded: 0
Not implemented: 0
Failed: 0
Stats by Operator type:
Not implemented(0):
Failed:
Failed Test Cases:
real 0m17.152s
user 0m14.318s
sys 0m0.619s
```
### Description
<!-- Describe your changes. -->
Updating README.md to add vitisai for onnxruntime_perftest
### Motivation and Context
<!-- - Why is this change required? What problem does it solve? -->
The perftest tool does support vitisai whereas the README.md does not
list it. This creates some confusions internally about if vitisai is
supported. See https://github.com/microsoft/onnxruntime/pull/15673 for
context.
### Fix slice upstream - (MatMul) [ShapeInferenceError] Incompatible
dimensions
```
2023-07-22 14:58:16.918478478 [I:onnxruntime:Default, constant_sharing.cc:256 ApplyImpl] Total shared scalar initializer count: 10
2023-07-22 14:58:16.919494252 [W:onnxruntime:Default, graph.cc:108 MergeShapeInfo] Error merging shape info for output. 'onnx::Cast_424' source:{-1,31,-1,-1} target:{-1,32,-1,-1}. Falling back to lenient merge.
2023-07-22 14:58:16.921014114 [W:onnxruntime:Default, graph.cc:108 MergeShapeInfo] Error merging shape info for output. 'onnx::MatMul_425' source:{-1,31,-1,-1} target:{-1,32,-1,-1}. Falling back to lenient merge.
Traceback (most recent call last):
File "examples/onnxruntime/training/language-modeling/run_clm.py", line 594, in <module>
main()
File "examples/onnxruntime/training/language-modeling/run_clm.py", line 542, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/bert_ort/pengwa/optimum/optimum/onnxruntime/trainer.py", line 454, in train
return inner_training_loop(
File "/bert_ort/pengwa/optimum/optimum/onnxruntime/trainer.py", line 755, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/transformers/trainer.py", line 2735, in training_step
loss = self.compute_loss(model, inputs)
File "/bert_ort/pengwa/optimum/optimum/onnxruntime/trainer.py", line 363, in compute_loss
return model_with_loss(dict_inputs, return_outputs)
File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1724, in forward
loss = self.module(*inputs, **kwargs)
File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_utils.py", line 384, in _forward
return ortmodule._torch_module.forward(*inputs, **kwargs)
File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_utils.py", line 364, in _forward
return torch_module_ort._execution_manager(torch_module_ort.is_training()).forward(*inputs, **kwargs)
File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_training_manager.py", line 345, in forward
self._fallback_manager.handle_exception(
File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_fallback.py", line 157, in handle_exception
raise exception
File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_training_manager.py", line 280, in forward
self._build_graph(graph_transformer_config)
File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_logger.py", line 218, in wrapper
result = func(graph_execution_manager, *args, **kwargs)
File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_training_manager.py", line 360, in _build_graph
super()._build_graph(graph_transformer_config)
File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_graph_execution_manager.py", line 186, in _build_graph
self._graph_builder.build(config)
RuntimeError: /bert_ort/pengwa/onnxruntime/orttraining/orttraining/python/orttraining_pybind_state.cc:823 onnxruntime::python::addObjectMethodsForTraining(pybind11::module&, onnxruntime::python::ExecutionProviderRegistrationFn)::<lambda(onnxruntime::training::OrtModuleGraphBuilder*, const onnxruntime::training::TrainingGraphTransformerConfiguration&)> [ONNXRuntimeError] : 1 : FAIL : Node (MatMul_403) Op (MatMul) [ShapeInferenceError] Incompatible dimensions
```
Missed using `axis` attribute for `Slice` op, so change to use `axes`
inputs instead.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Sometimes, ONNX exporter generates rank- or shape-dependent sub-graphs.
Thus, error could occur when running the ONNX model with different
inputs. This PR
([78e736d](78e736d857))
addresses this problem by
- if needed, exporting multiple ONNX models with different inputs for
the same GraphModule.
- implementing a naive mechanism to determine of existing ONNX models
(and the associated InferenceSession) can be reused.
On the other hand, in the second commit
[b5a9b5f](b5a9b5f849),
this PR also enables dynamic shapes in DORT by
- passing dynamic_shapes = True to exporter (see how
DEFAULT_DYNAMIC_BACKEND is created)
- calling torch._dynamo.optimize(dynamic_ort_aot, dynamic=True) (see how
dynamic_ort_aot is created).
/builds/devtechproviz/dl/ort-builder/onnxruntime/onnxruntime/python/onnxruntime_pybind_state.cc:388:14:
error: missing initializer for member
'OrtTensorRTProviderOptionsV2::trt_cuda_graph_enable'
[-Werror=missing-field-initializers]
388 | 0};
|
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Current TRT EP can support model which has nested control flow ops
(multiple level subgraphs). But it fails at a case where the subgraph
has outer scope value that is defined several levels up in the top-level
graph, in this case, the outer scope value is the input of the top-level
graph. The outer scope values are not properly handled during TRT EP's
subgraph reconstruction stage and fails at `graph.resolve()`.
The way ORT gets capability from EPs is a bottom-up approach meaning
inner most subgraph gets handled first. TRT EP reconstructs each
subgraph level by level and following modifications are made to fix the
outer scope values issue:
- `SetGraphOuterScopeValuesAndInputs()` and `SetAllGraphInputs()` are
added to handle outer scope values and add those values as graph inputs
if needed in order to make `graph.resolve()` happy.
- Change to use `GetNodeArgIncludingParentGraphs` so that when creating
the fused TRT node for some subgraphs in`
Graph::CreateFusedSubGraphNode()`, it can get the NodeArgs for outer
scope values from top-level graph.
This PR fixes https://github.com/microsoft/onnxruntime/issues/16217
### Disable large index tests due to limited GPU mem
Recently following two tests fail due to GPU mem not enough, not sure
what else program running using GPU as well. So disable them for now to
unblock the required CI.
```
1: [ FAILED ] 2 tests, listed below:
1: [ FAILED ] CrossEntropyTest.SoftmaxCrossEntropyLossInternal_LargeSizeTensorUInt64Index
1: [ FAILED ] CrossEntropyTest.SoftmaxCrossEntropyLossInternalGrad_LargeSizeTensorUInt64Index
2023-07-23T02:15:39.7559251Z 1: [ RUN ] CrossEntropyTest.SoftmaxCrossEntropyLossInternal_LargeSizeTensorUInt64Index
2023-07-23T02:16:53.0904576Z 1: 2023-07-23 02:16:53.089586592 [E:onnxruntime:SoftmaxCrossEntropyLossInternal, sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned while running SoftmaxCrossEntropyLossInternal node. Name:'node1' Status Message: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:376 void* **onnxruntime::BFCArena::AllocateRawInternal(size_t, bool, onnxruntime::Stream*, bool, onnxruntime::WaitNotificationFn) Failed to allocate memory for requested buffer of size 4294973440**
2023-07-23T02:16:53.0905775Z 1:
2023-07-23T02:16:53.0906087Z 1: /onnxruntime_src/onnxruntime/test/providers/base_tester.cc:323: Failure
2023-07-23T02:16:53.0906698Z 1: Expected equality of these values:
2023-07-23T02:16:53.0907086Z 1: expect_result
2023-07-23T02:16:53.0907564Z 1: Which is: 4-byte object <00-00 00-00>
2023-07-23T02:16:53.0973055Z 1: ExpectResult::kExpectFailure
2023-07-23T02:16:53.0973984Z 1: Which is: 4-byte object <01-00 00-00>
2023-07-23T02:16:53.0975375Z 1: Run failed but expected success: Non-zero status code returned while running SoftmaxCrossEntropyLossInternal node. Name:'node1' Status Message: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:376 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool, onnxruntime::Stream*, bool, onnxruntime::WaitNotificationFn) Failed to allocate memory for requested buffer of size 4294973440
2023-07-23T02:16:53.0976198Z 1:
2023-07-23T02:16:53.0976483Z 1: Google Test trace:
2023-07-23T02:16:53.0976818Z 1: /onnxruntime_src/onnxruntime/test/common/random_generator.h:49: ORT test random seed: 8910
2023-07-23T02:16:53.0977229Z 1: /onnxruntime_src/onnxruntime/test/common/random_generator.h:49: ORT test random seed: 8910
2023-07-23T02:16:53.0977639Z 1: /onnxruntime_src/onnxruntime/test/common/random_generator.h:49: ORT test random seed: 2345
2023-07-23T02:16:53.0978035Z 1: /onnxruntime_src/onnxruntime/test/common/random_generator.h:49: ORT test random seed: 5678
2023-07-23T02:16:53.0978441Z 1: /onnxruntime_src/onnxruntime/test/common/random_generator.h:49: ORT test random seed: 1234
2023-07-23T02:16:53.1303810Z 1: /onnxruntime_src/orttraining/orttraining/test/training_ops/cuda/cross_entropy_test.cc:443: Failure
2023-07-23T02:16:53.1304644Z 1: Expected equality of these values:
2023-07-23T02:16:53.1304974Z 1: ret.first
2023-07-23T02:16:53.1305685Z 1: Which is: 4-byte object <04-00 00-00>
2023-07-23T02:16:53.1306030Z 1: COMPARE_RESULT::SUCCESS
2023-07-23T02:16:53.1306414Z 1: Which is: 4-byte object <00-00 00-00>
2023-07-23T02:16:53.1306754Z 1: Unsupported compare with CompareOrtValueNumerals.
2023-07-23T02:16:53.1307487Z 1: Google Test trace:
2023-07-23T02:16:53.1307848Z 1: /onnxruntime_src/onnxruntime/test/common/random_generator.h:49: ORT test random seed: 8910
2023-07-23T02:16:53.1308252Z 1: /onnxruntime_src/onnxruntime/test/common/random_generator.h:49: ORT test random seed: 8910
2023-07-23T02:16:53.1308652Z 1: /onnxruntime_src/onnxruntime/test/common/random_generator.h:49: ORT test random seed: 2345
2023-07-23T02:16:53.1309068Z 1: /onnxruntime_src/onnxruntime/test/common/random_generator.h:49: ORT test random seed: 5678
2023-07-23T02:16:53.1309460Z 1: /onnxruntime_src/onnxruntime/test/common/random_generator.h:49: ORT test random seed: 1234
2023-07-23T02:16:53.1309889Z 1: /onnxruntime_src/orttraining/orttraining/test/training_ops/cuda/cross_entropy_test.cc:443: Failure
2023-07-23T02:16:53.1310239Z 1: Expected equality of these values:
2023-07-23T02:16:53.1310527Z 1: ret.first
2023-07-23T02:16:53.1310893Z 1: Which is: 4-byte object <04-00 00-00>
2023-07-23T02:16:53.1311208Z 1: COMPARE_RESULT::SUCCESS
2023-07-23T02:16:53.1311600Z 1: Which is: 4-byte object <00-00 00-00>
2023-07-23T02:16:53.1311921Z 1: Unsupported compare with CompareOrtValueNumerals.
2023-07-23T02:16:53.1312229Z 1: Google Test trace:
2023-07-23T02:16:53.1312556Z 1: /onnxruntime_src/onnxruntime/test/common/random_generator.h:49: ORT test random seed: 8910
2023-07-23T02:16:53.1312951Z 1: /onnxruntime_src/onnxruntime/test/common/random_generator.h:49: ORT test random seed: 8910
2023-07-23T02:16:53.1313362Z 1: /onnxruntime_src/onnxruntime/test/common/random_generator.h:49: ORT test random seed: 2345
2023-07-23T02:16:53.1313749Z 1: /onnxruntime_src/onnxruntime/test/common/random_generator.h:49: ORT test random seed: 5678
2023-07-23T02:16:53.1314156Z 1: /onnxruntime_src/onnxruntime/test/common/random_generator.h:49: ORT test random seed: 1234
2023-07-23T02:16:53.4476437Z 1: [ FAILED ] CrossEntropyTest.SoftmaxCrossEntropyLossInternal_LargeSizeTensorUInt64Index (73692 ms)
```
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
1. use the pool with VS2022
2. upgrade System.Memory to 4.5.5
### Motivation and Context
Solve the build error while using VS2022:
`[Failure] Msbuild failed when processing the file
'D:\a\_work\1\s\csharp\src\Microsoft.ML.OnnxRuntime\Microsoft.ML.OnnxRuntime.csproj'
with message: Method not found: 'System.ReadOnlySpan`1<Char>
Microsoft.IO.Path.GetFileName(System.ReadOnlySpan`1<Char>)'`
Ref:
https://stackoverflow.com/questions/73399777/azure-build-failing-due-to-method-not-found-system-readonlyspan1char-micros
### Description
Changes allow downloading prebuilt protoc compiler when building
WebAssebly version on mac systems.
Otherwise it tries to build a js/wasm version of protoc and throws an
error while executing it: "protoc.js permission denied"
### Motivation and Context
I need to switch between my main working computer and a PC to make
changes to WebAssebly build. Would like not to do that anymore.
### Description
<!-- Describe your changes. -->
Allocating new GPUBuffer in every session.run is not efficient. We
should make it only happen in the first run. In the following runs, we
should try to reuse those buffers.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
- This PR is for performance.
See mobilenetv2 becomes 9.58 ms from 12.9 ms.
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at
bottom):
* __->__ #16789
Bump ruff to 0.0.278 and fix new lint errors. I added noqa to all
existing RUF012 errors which requires mutable class variables to be
annotated with `ClassVar`, as well as all PERF issues.
Signed-off-by: Justin Chu <justinchu@microsoft.com>
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at
bottom):
* #16789
* __->__ #16788
This change fixes the N802 lint errors by renaming the test case to use
snake case.
### Description
The Java API currently only supports fp16 output tensors which it
automatically casts to floats on the way out. This PR adds support for
creating fp16 and bf16 tensors (from `java.nio.Buffer` objects or as the
output of models, creation from Java short arrays is not supported),
along with efficient methods for casting `FloatBuffer` into
`ShortBuffer` filled with fp16 or bf16 values and vice versa.
The fp16 conversions use a trick to pull in the efficient conversion
methods added to Java 20, falling back to ports of the MLAS methods
otherwise. The Java 20 methods can be special cased by the C2 JIT
compiler to emit the single instruction on x86 and ARM which converts
fp32<->fp16, or the vectorized versions thereof, so they should be quite
a bit faster than the MLAS ported one.
### Motivation and Context
fp16 and bf16 are increasingly popular formats and we've had several
requests for this functionality. Fixes#7003.
cc @yuslepukhin @cassiebreviu
---------
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
### Description
1) Added Sequence And Maps convenience APIs to create input Sequences
and Maps
and also visit the outputs.
2) Address OrtValue design issue when the values are created on top of
the
managed memory and the ortValues are used for sequence and maps
creation.
We should retain the original managed instances that keep the memory
pinned.
We opt to keep track of those and dispose of them within an instance of
OrtValue
that represents a Map or a Sequence.
3) Set `LangVersion` to default per [MS Versioning
Docs.](https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/configure-language-version)
### Motivation and Context
1) When writing code examples, use of Map and Sequences API proved to be
cumbersome.
2) It is a BUG, that we should address, as the managed memory can move
by the GC and lead to
intermittent crashes.
3) Make use of the most feature of the C#.
### Description
Add op support for LayerNorm, Asin, Sign.
Enable QDQ node unit support for Sin Op
---------
Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
### Description
torch.norm is deprecated as mentioned in issue #16751. This PR replaces
the call to torch.norm by the options suggested by torch documentation.
### Description
A [previous PR](https://github.com/microsoft/onnxruntime/pull/16531)
added a temporary directory to save the model optimizations after
loading a model into an `InferenceSession`. Many models that have an
external data file, however, require the data file to be in the same
directory as the ONNX model file. Because the model is saved in a
temporary directory and the data is saved in another directory, this
causes a `FileNotFoundError` error when trying to load the model in the
temporary directory.
This PR fixes this error by saving the external data file in the same
directory that the optimized model is located in.
### Motivation and Context
This PR fixes a bug with using a temporary directory while running the
optimizer for models that have an external data file.
This pull request contains a few changes:
1. Adds support for string ort values.
2. Fixes the training minimal build (that was broken with #16601) by
putting custom op registration behind #ifdefs
3. Fixes the iOS pod package generation (that was again broken with
#16601) by explicitly providing paths to be copied during pod creation.
### Description
- Updates the default QNN SDK to 2.12 for CI pipelines
- Adds a disabled InstanceNormalization test for regression on QNN SDK
2.12
- Cleans up logs for unsupported ops.
### Motivation and Context
Test with the latest QNN SDK.
Otherwise, an unsupported version of gtest/gmock will be found at
/opt/conda/include for ROCm builds. Though this issue was initially
found for ROCm builds, the issue is generic. onnxruntime requires a
specific version of googletest and should not rely on locating
googletest using find_package.
The ROCm error was:
```
In file included from /opt/conda/include/gmock/gmock-spec-builders.h:75,
from /opt/conda/include/gmock/gmock-generated-function-mockers.h:47,
from /opt/conda/include/gmock/gmock-function-mocker.h:39,
from /opt/conda/include/gmock/gmock.h:61,
from /stage/onnxruntime/onnxruntime/test/util/test_utils.cc:17:
/opt/conda/include/gmock/gmock-matchers.h: In instantiation of ‘bool testing::internal::PointwiseMatcher<TupleMatcher, RhsContainer>::Impl<LhsContainer>::
MatchAndExplain(LhsContainer, testing::MatchResultListener*) const [with LhsContainer = const gsl::span<const float>&; TupleMatcher = testing::internal::
FloatingEq2Matcher<float>; RhsContainer = gsl::span<const float>]’:
/opt/conda/include/gmock/gmock-matchers.h:2303:10: required from here
/opt/conda/include/gmock/gmock-matchers.h:2312:48: error: no type named ‘const_iterator’ in ‘testing::internal::PointwiseMatcher<testing::internal::
FloatingEq2Matcher<float>, gsl::span<const float> >::Impl<const gsl::span<const float>&>::LhsStlContainer’ {aka ‘class gsl::span<const float>’}
```
### Description
<!-- Describe your changes. -->
Support Op Pad for WebNN EP. It aims to support three modes (constant,
reflect and edge). For now, only constant can be tested with Chrome
Canary.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Support more models like SD1.5-VAE-encode.
### Description
<!-- Describe your changes. -->
Comment out ORT-Nightly feed in NuGet.config to see if that makes the
Secure Supply Chain Analysis CI step happy.
Add info to readme on manually adding feed and using it.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
This PR is includes changes in the documentation of _readmeOV.rst_ file
and also the changes in the dockerfile which enables to build ORT with
latest OpenVINO 2023.0.0
### Motivation and Context
Modified the dockerfile to incorporate the latest version of OpenVINO
(2023.0.0) for building Onnxruntime.
The changes in the PR aim to improve the overall user experience by
providing accurate and up-to-date documentation while leveraging latest
OpenVINO 2023.0.0
### Description
This change upgrades a lot of dependencies. There are 2 motivations of
doing this change:
- fix the security issue reported by dependabot (protobufjs Prototype
Pollution vulnerability -
https://github.com/advisories/GHSA-h755-8qp9-cq85)
- resolve the requirement of using ONNX IR_VERSION 9 (#16638)
This requires:
- upgrade protobufjs to v7.2.4
- upgrade library 'onnx-proto' to consume latest ONNX release (v1.14.0).
Problems:
- protobufjs v7.2.4 depends on long.js v5, which does not work well with
typescript (commonjs).
- onnx-proto depends on this fix with a new release of long.js
- long.js is in maintenance and it takes longer than expected to put in
new changes
Solutions:
- use a patch script in `preprepare` to copy type declarations to make
long.js work with typescript (commonjs)
- generate onnx protobuf JS/TS files and put them under
js/web/lib/onnxjs/ort-schema/protobuf folder - remove 'onnx-proto' from
dependency.
- apply fixes to generated onnx.d.ts
### Description
- Fixes support for ArgMin/ArgMax to QNN CPU and HTP backends.
- Adds Q/DQ node unit selection logic.
- Handles casting int64 output to uint32 when necessary.
- Adds unit tests for ArgMax/ArgMin.
### Motivation and Context
QNN EP did not actually support ArgMin/ArgMax. Unit tests revealed that
the existing translation was not sufficient to support these ops.
### Description
Fix some issues found in GPT-NeoX graph fusion:
(1) GPT-NeoX uses float16 weights. The step of using onnxruntime with
opt_level==1 uses CPU provider. Since most operators does not have fp16
in CPU EP, so extra Cast nodes are added to up cast to fp32.
(2) Add is shared by two LayerNormalization children, and
SkipLayerNormalization might cause invalid graph.
(3) Reshape fusion might miss since some part only check initializer but
not Constant.
This PR adds a check whether model uses FP16, and output a warning when
use_gpu is not True, and use GPU provider for graph optimization when use_gpu=True.
There are several global configs used by DORT.
```py
DEFAULT_ONNX_EXPORTER_OPTIONS = torch.onnx._internal.exporter.ResolvedExportOptions(
torch.onnx._internal.exporter.ExportOptions()
)
# TODO(wechi): This line must generate result identical to the call of
# _create_onnx_supports_op_overload_table(...) inside
# create_onnx_friendly_decomposition_table(...) in
# torch/onnx/_internal/fx/decomposition_table.py.
_SUPPORT_DICT = torch.onnx._internal.fx.decomposition_table._create_onnx_supports_op_overload_table(
DEFAULT_ONNX_EXPORTER_OPTIONS.onnx_registry
) # type: ignore
_EXTRA_SUPPORT_DICT: Dict[str, Any] = {
"getattr": None,
"_operator.getitem": None,
}
DORT_DECOMPOSITION_TABLE = DEFAULT_ONNX_EXPORTER_OPTIONS.decomposition_table
```
We can see all but `_EXTRA_SUPPORT_DICT` are extracted from deduced from
ONNX exporter's options. As there are many ways to configure ONNX
exporter's options, we decided to move these variables to `OrtBackend`'s
`__init__` so that the construction of `OrtBackend` becomes more
flexible (especially for enabling dynamic shape or not).
GemmSoftmaxGemmTunble occasionally broken with large numerical error.
The root cause of this error is CK's Strided Batched Gemm has larger
error under a specific initialization distribution
`(multinormal_distribution)`.
Generic(Gemm1 + Softmax + Gemm2) implementation is one instance of
GemmSoftmaxGemmTunble. Gemm1 and Gemm2 in Generic implementation are
TunableOps when tuning enabled. In some case GemmSoftmaxGemmTunble
select Generic implentation, while Gemm1 or Gemm2 select ck
implementation, the result of GemmSoftmaxGemmTunble affect by CK.
- Make tolerance more loosen.
- Add `GemmSoftmaxGemmPermuteGenericNestedTunable` to test Generic
implementation with tuning enabled.
### Description
<!-- Describe your changes. -->
Replace the constructor function `MLFloat16()` with the public member
function `FromBits()` in the file
`onnxruntime/core/providers/cann/cann_common.cc`
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
PR [#16506](https://github.com/microsoft/onnxruntime/pull/16506) changed
the public constructor function `MLFloat16(uint16_t x)` to private, and
added a public function `MLFloat16::FromBits(uint16_t x)` in the file
`include/onnxruntime/core/framework/float16.h`, which broke the CANN CI.
This PR aligns the CANN behavior with the modified class `MLFloat16`.
### Description
<!-- Describe your changes. -->
MAUI test app with tooling to add model and generated or provided input
test data.
The app will load the model and validate the output. It can also run a
specified number of iterations to provide basic performance information.
<img width="401" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/979079/daf3af13-fb22-4cbb-9159-486b483a7485">
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Primarily to make it easier to test an arbitrary model on iOS. A MAUI
app allows testing on all platforms.
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
#16506 Cause almost every translation units on linux complaint
```
[1175/1235] Building CXX object CMakeFiles/onnxruntime_test_all.dir/home/guangyunhan/onnxruntime/orttraining/orttraining/test/training_ops/cuda/softmax_test.cc.o
In file included from /home/guangyunhan/onnxruntime/include/onnxruntime/core/framework/float16.h:18,
from /home/guangyunhan/onnxruntime/include/onnxruntime/core/framework/data_types.h:17,
from /home/guangyunhan/onnxruntime/include/onnxruntime/core/framework/tensor.h:17,
from /home/guangyunhan/onnxruntime/onnxruntime/test/common/tensor_op_test_utils.h:16,
from /home/guangyunhan/onnxruntime/onnxruntime/test/providers/compare_provider_test_utils.h:7,
from /home/guangyunhan/onnxruntime/orttraining/orttraining/test/training_ops/cuda/softmax_test.cc:4:
/home/guangyunhan/onnxruntime/include/onnxruntime/core/session/onnxruntime_float16.h: In instantiation of ‘static constexpr uint16_t onnxruntime_float16::Float16Impl<Derived>::ToUint16Impl(float) [with Derived = onnxruntime::MLFloat16; uint16_t = short unsigned int]’:
/home/guangyunhan/onnxruntime/include/onnxruntime/core/framework/float16.h:42:66: required from here
/home/guangyunhan/onnxruntime/include/onnxruntime/core/session/onnxruntime_float16.h:241:7: note: ‘union onnxruntime_float16::detail::float32_bits’ has no user-provided default constructor
241 | union float32_bits {
| ^~~~~~~~~~~~
/home/guangyunhan/onnxruntime/include/onnxruntime/core/session/onnxruntime_float16.h:242:16: note: and the implicitly-defined constructor does not initialize ‘unsigned int onnxruntime_float16::detail::float32_bits::u’
242 | unsigned int u;
| ^
```
This PR shut the compiler up.