### Description
disable cache to save disk space for training_x64_debug
### Motivation and Context
To mitigate not enough disk space in training_x64_debug first.
Two modifications:
- After [TRT 8.5](https://github.com/microsoft/onnxruntime/pull/13867)
being merged, we can manually set timeout and make TRT EP only run small
portion of unit tests
(`onnxruntime_SKIP_AND_PERFORM_FILTERED_TENSORRT_TESTS=ON`) due to
additional TRT kernel overhead introduced by TRT 8.5 which increases
test time a lot. This PR modifies the checking condition and make
TensorRT CIs (can enable builder placeholder) still run most of the unit
tests.
- Exclude TRT EP from [Resize Opset
18](https://github.com/microsoft/onnxruntime/pull/13890) unit tests
since TensorRT 8.5 supports operators up to Opset 17.
### Description
Allows the PostAnalysis@2 task for windows CI jobs to continue even if
an error is encountered.
### Motivation and Context
This is a temporary workaround that enables the
`Windows_Packaging_CPU_x86_default` job within the Zip-Nuget-Java-NodeJS
packaging pipeline to finish. A recent update to dotnet 6 has broken the
PostAnalysis task for this job.
This task was originally added by
https://github.com/microsoft/onnxruntime/pull/13694
### Description
Adds the below C APIs to support custom ops that wrap an entire model to
be inferenced with an external runtime. The current SNPE EP is an
example of an EP that could be ported to use a custom op wrapper. Ex:
The custom op stores the serialized SNPE DLC binary as a string
attribute. The SNPE model is built when the kernel is created. The model
is inferenced with SNPE APIs on call to the kernel's compute method.
#### C APIs
| API | Description | Why |
| --- | --- | --- |
| `KernelInfo_GetInputCount` | Gets number of inputs from
`OrtKernelInfo`. | Query I/O characteristics during kernel
creation<sup>1</sup> |
| `KernelInfo_GetOutputCount` | Gets number of outputs from
`OrtKernelInfo`. | Query I/O characteristics during kernel
creation<sup>1</sup> |
| `KernelInfo_GetInputName` | Gets an input's name. | Query I/O
characteristics during kernel creation<sup>1</sup> |
| `KernelInfo_GetOutputName` | Gets an output's name. | Query I/O
characteristics during kernel creation<sup>1</sup> |
| `KernelInfo_GetInputTypeInfo` | Gets the type/shape information for an
input. | Query I/O characteristics during kernel creation<sup>1</sup> |
| `KernelInfo_GetOutputTypeInfo` | Gets the type/shape information for
an output. | Query I/O characteristics during kernel
creation<sup>1</sup> |
| `KernelInfoGetAttribute_tensor` | Get a OrtValue tensor stored as an
attribute in the graph node | Extract serialized models, weights, etc. |
| `GetSessionConfigEntry` | Get a session configuration value | Need to
be able to get session-time configurations from within custom op |
| `HasSessionConfigEntry` | Check if session configuration entry exists.
| Need to be able to get session-time configurations from within custom
op |
#### Why so many KernelInfo APIs?<sup>1</sup>
Similar APIs currently exist for `OrtKernelContext`, but not
`OrtKernelInfo`. Note that `OrtKernelContext` is passed to the custom op
on call to its kernel's compute() function. However, `OrtKernelInfo` is
available on kernel creation, which occurs when the session is created.
Having these APIs available from `OrtKernelInfo` allows an operator to
trade-off computation time for session-creation time, and vice versa.
Operators that must build expensive state may prefer to do it during
session creation time instead of compute-time.
SNPE is an example of an EP that needs to be able to query `KernelInfo`
for the name, type, and shape of inputs and outputs in order to build
the model from the serialized DLC data. This is an expensive operation.
Other providers (e.g., OpenVINO) are able to query i/o info from the
serialized model, so they do not strictly need these APIs. However, the
APIs can still be used to validate the expected I/O characteristics.
Additionally, several of our CPU contrib ops currently use the same
internal version of these KernelInfo APIs (Ex:
[qlinear_softmax](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/contrib_ops/cpu/quantization/qlinear_softmax.cc#L71)).
If custom ops are also meant to be a test bed for future ops, then all
custom ops (not just runtime wrappers) would benefit from the addition
of these public KernelInfo APIs (IMO).
#### Example of usage in a custom OP
From
`onnxruntime/test/testdata/custom_op_openvino_wrapper_library/openvino_wrapper.h`
```c++
struct CustomOpOpenVINO : Ort::CustomOpBase<CustomOpOpenVINO, KernelOpenVINO> {
explicit CustomOpOpenVINO(Ort::ConstSessionOptions session_options);
CustomOpOpenVINO(const CustomOpOpenVINO&) = delete;
CustomOpOpenVINO& operator=(const CustomOpOpenVINO&) = delete;
void* CreateKernel(const OrtApi& api, const OrtKernelInfo* info) const;
constexpr const char* GetName() const noexcept {
return "OpenVINO_Wrapper";
}
constexpr const char* GetExecutionProviderType() const noexcept {
return "CPUExecutionProvider";
}
// IMPORTANT: In order to wrap a generic runtime-specific model, the custom operator
// must have a non-homogeneous variadic input and output.
constexpr size_t GetInputTypeCount() const noexcept {
return 1;
}
constexpr size_t GetOutputTypeCount() const noexcept {
return 1;
}
constexpr ONNXTensorElementDataType GetInputType(size_t /* index */) const noexcept {
return ONNX_TENSOR_ELEMENT_DATA_TYPE_UNDEFINED;
}
constexpr ONNXTensorElementDataType GetOutputType(size_t /* index */) const noexcept {
return ONNX_TENSOR_ELEMENT_DATA_TYPE_UNDEFINED;
}
constexpr OrtCustomOpInputOutputCharacteristic GetInputCharacteristic(size_t /* index */) const noexcept {
return INPUT_OUTPUT_VARIADIC;
}
constexpr OrtCustomOpInputOutputCharacteristic GetOutputCharacteristic(size_t /* index */) const noexcept {
return INPUT_OUTPUT_VARIADIC;
}
constexpr bool GetVariadicInputHomogeneity() const noexcept {
return false; // heterogenous
}
constexpr bool GetVariadicOutputHomogeneity() const noexcept {
return false; // heterogeneous
}
std::vector<std::string> GetSessionConfigKeys() const { return {"device_type"}; }
private:
std::unordered_map<std::string, std::string> session_configs_;
};
```
#### How to create a session:
```c++
Ort::Env env;
Ort::SessionOptions session_opts;
Ort::CustomOpConfigs custom_op_configs;
// Create local session config entries for the custom op.
custom_op_configs.AddConfig("OpenVINO_Wrapper", "device_type", "CPU");
// Register custom op library and pass in the custom op configs (optional).
session_opts.RegisterCustomOpsLibrary(lib_name, custom_op_configs);
Ort::Session session(env, model_path.data(), session_opts);
```
### Motivation and Context
Allows creation of simple "wrapper" EPs outside of the main ORT code
base.
### Description
fix a security warning in GemmInt8 cuda kernel
### Motivation and Context
it is for issue:
https://dev.azure.com/aiinfra/ONNX%20Runtime/_workitems/edit/11158/
Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
### Description
<!-- Describe your changes. -->
PartitionIntoStreams was incorrectly using std::string instead of
PathString for the config file argument when ORT_ENABLE_STREAM was not
defined.
Also Incorporate changes from #14291 to fix build and test issues.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix build error on Windows due to mismatched type.
### Description
The DML EP provider factory verifies the adapter id is a real GPU (not
some software emulation like WARP which would be quite slow or basic
display driver which lacks D3D compute ability), but the automated tests
sometimes erratically get run on a variety of ADO cloud machines that
lack a GPU or are in a bad state such that Windows fell back to software
emulation. In such cases, you end up reaching the `!IsSoftwareAdapter`
check in the provider factory ([line
132](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/dml/dml_provider_factory.cc#L132))
and seeing in the pipeline logs E_INVALIDARG. Let's return a more
immediately enlightening error code like
ERROR_GRAPHICS_INVALID_DISPLAY_ADAPTER rather than just E_INVALIDARG.
### Motivation and Context
- *Why is this change required? What problem does it solve* Pipeline
noise.
- *If it fixes an open issue, please link to the issue here.* NA.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
https://github.com/microsoft/onnxruntime/issues/12843
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
fix the error
"/onnxruntime_src/onnxruntime/core/providers/cuda/test/greedy_search_top_one.cc:34:9:
error: typedef ‘using VK = struct std::pair<float, int>’ locally defined
but not used [-Werror=unused-local-typedefs]34 | using VK =
std::pair<float, int32_t>"
### Description
Updates TensorRT and CANN EPs to use murmurhash3 from core/framework via
provider bridge.
### Motivation and Context
A failure in a packaging pipeline required us to temporarily duplicate
murmurhash3 code for the TensorRT EP. This PR removes the duplicate
code. This is what is happening:
The original version of this code conditionally included a murmurhash
function for TensorRT only (not cuda) in the provider bridge. The
packaging pipeline selectively [copies binaries from two separate
builds](https://github.com/microsoft/onnxruntime/blob/main/tools/ci_build/github/linux/extract_and_bundle_gpu_package.sh)
(a cuda-only build and a tensorrt build) into a single libs directory.
These are the files within the resulting libs directory:
- onnxruntime.so (copied from tensorrt build, implements murmurhash in
provider bridge host)
- onnxruntime_providers_shared.so (copied from tensorrt build)
- onnxruntime_providers_tensorrt.so (copied from tensorrt build)
- onnxruntime_providers_cuda.so (copied from **cuda-only build**,
expects a provider host w/o murmurhash)
The [squeezenet
example](https://github.com/microsoft/onnxruntime-inference-examples/tree/main/c_cxx/squeezenet)
crashed when onnxruntime_providers_cuda.so was loaded because the cuda
ep tried to call functions from a `ProviderHost` object that did not
match what was actually implemented by onnxruntime.so.
I've confirmed that we _can_ prevent the crash by modifying the pipeline
to use the onnxruntime_providers_cuda.so file from the tensorrt build
(instead of the file from the cuda-only build). However, I don't think
that is necessarily correct. Instead, I think we should try to make sure
that the provider bridge exposes the same interface to any EP libraries
that can potentially coexist in the same application (like cuda and
tensorrt). Failing that, there's probably something we can do to
generate a better error message when an EP detects that the Provider
Host implements an unexpected interface.
Note that the above applies to the Windows build in the packaging
pipeline as well. I used the onnxruntime branch
[adrianl/test-trt-cuda-bridge-packaging-pipeline](https://github.com/microsoft/onnxruntime/tree/adrianl/test-trt-cuda-bridge-packaging-pipeline)
along with the onnxruntime-inference-examples branch
[adrianl/squeezenet_ld_debug](https://github.com/microsoft/onnxruntime-inference-examples/tree/adrianl/squeezenet_ld_debug)
to test that copying the onnxruntime_providers_cuda.so file from the
tensorrt build gets rid of the crash.
For following failures, folder of convert_to_onnx should be specified to
import for source code case:
FAILED
test_gpt2_to_onnx.py::TestGpt2ConvertToOnnx::test_auto_mixed_precision
FAILED test_gpt2_to_onnx.py::TestGpt2ConvertToOnnx::test_stage1 -
TypeError: ...
FAILED test_gpt2_to_onnx.py::TestGpt2ConvertToOnnx::test_stage2 -
TypeError: ...
For failure below, SkipLayerNormal is fused:
FAILED
test_optimizer.py::TestModelOptimization::test_huggingface_openaigpt_fusion
### Description
<!-- Describe your changes. -->
This change fixes a bug that when running ort with nccl collective
operation on AMD, it can't trace nccl operation.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
The reason of missing nccl operation in roctracer is that roctracer is
using whitelist of which api can be traced, and nccl use
hipExtLaunchKernel api which is not included in the whitelist. This fix
is to add hipExtLaunchKernel into whitelist, then nccl operation could
be traced.
### Description
When there are 2 QDQ pair back to back, we want to delete the 1 Q and 1
DQ nodes.
ex:
Q->DQ->Q->DQ =====> Q->DQ
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Add compilation cache in Linux CPU Aten Pipeline.
The pipeline could be completed in 6 minutes at best.
### Motivation and Context
1. Accelerate the pipeline.
2. It's the shortest pipeline with docker image. I'll use it to try
moving the storage of linux docker image from ACR to ADO pipeline cache.
### Description
<!-- Describe your changes. -->
If a user installs the debug libraries from Python on Windows the ORT
python project file attempts to use the debug python lib, which
conflicts with a pragma in pyconfig.h that wants the release lib (due to
pybind11 undefining _DEBUG).
Explicitly use the release lib instead of Python::Module so the build
doesn't break.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix obtuse build break.
Subgraph index in TRT engine name keeps increasing when multiple
sessions are created for the same model, which causes TRT engine not
being reused and new engine is created again. The issue is because
trt_model_id_generator_ is defined globally.
This PR made following changes and improvements,
1. Define subgraph index as local variable thus it won't be shared
across sessions.
2. Decouple subgraph index from hash id generator
3. Call hash id generator once at the beginning of GetCapability since
hash id is shared between TRT subgraphs and there is no need to call it
for each subgraph
fix https://github.com/microsoft/onnxruntime/issues/14269
### Description
Add a new install_shared_deps.sh
### Motivation and Context
Azcopy, Ninja, Node.js and CCache are all needed, but they are copied
everywhere.
### Description
This code change allows for the QlinearConv operator to sync batches
into a single parallel section. This allows for the tasks of all the
batches to be made available for threads to exercise. This would act
alternatively to the existing method which parallelizes the tasks of
induvial images separately which forces threads to wait for all an
entire image’s tasks to complete before continuing.
### Motivation and Context
For int8 convolution models where multiple batches are being utilized,
this patch delivers an inference improvement of up-to 41% and 39% for
Mobilenet_edtpu (U8S8) and Resnet50(U8S8) respectively on systems with
higher core counts. The patch, delivers the highest benefit on systems
with higher thread counts and when utilizing large batch sizes.
<html>
<body>
<!--StartFragment--><span style="color: rgb(201, 209, 217); font-family:
-apple-system, BlinkMacSystemFont, "Segoe UI", "Noto
Sans", Helvetica, Arial, sans-serif, "Apple Color Emoji",
"Segoe UI Emoji"; font-size: 14px; font-style: normal;
font-variant-ligatures: normal; font-variant-caps: normal; font-weight:
400; letter-spacing: normal; orphans: 2; text-align: start; text-indent:
0px; text-transform: none; white-space: normal; widows: 2; word-spacing:
0px; -webkit-text-stroke-width: 0px; background-color: rgb(13, 17, 23);
text-decoration-thickness: initial; text-decoration-style: initial;
text-decoration-color: initial; display: inline !important; float:
none;"><style> </style></span>
| | Batch 2 | Batch 4 | Batch 8 | Batch 16 | Batch 32 | Batch 64
-- | -- | -- | -- | -- | -- | -- | --
resnet50 | % Gain | 22% | 25% | 32% | 36% | 33% | 32%
<!--EndFragment-->
</body>
</html>
Optimize top 1 computation in greedysearch.
For vocabulary size 50k on A100,
- batch size 1: from 220us to 10.4us.
- batch size 4, from 230us to 11.5us.
For generation of 50 tokens for example, it saves 50*0.2ms = 10ms.
### Description
To Implement Resize 18.
This PR depends on https://github.com/microsoft/onnxruntime/pull/13765.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Completing some missing parts of some test cases for python bindings
### Motivation and Context
Some test cases like test_training_module_checkpoint and test_optimizer
step were not completed before because we had no access to parameters to
check if the parameters are changing after the optimizer step or that
the checkpoint saved parameters remains the same.
now that we have access to the vector or parameters by exposing
get_contiguous_parameters() method.
we can complete the tests.
### Description
<!-- Describe your changes. -->
Bug fixed: Quantized models cannot be loaded into ort.InferenceSession
when DedicatedQDQPair is True in extra_options of QDQQuantizer.
Solutions: Add postfix to node names of dedicated QDQ pairs similar to
tensor names of them.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Loading quantized model fails when setting `DedicatedQDQPair` to `True`
in `extra_options` and raise an error as below:
```
Fail: [ONNXRuntimeError] : 1 : FAIL : Load model from mobilenetv2-opset10-quantized-dedicated.onnx failed:This is an invalid model. Error: two nodes with same node name (489_QuantizeLinear).
```
After visualizing the quantized model using netron, we can find that
both the dedicated QDQ pairs for tensor 489 have the same node names of
"489_QuantizeLinear". So I found that in QDQQuantizer, there is no
unique postfix for the node names of dedicated QDQ pairs.
<img width="1171" alt="image"
src="https://user-images.githubusercontent.com/12782861/212010296-f8cc05ce-c20e-4189-a692-aaf4bbac3a29.png">
Therefore, I add postfix to node names of QDQ pairs similar to doing so
to tensor names. After this modification, the quantized model can be
loaded successfully and dedicated QDQ pairs have different node names.👌🏻
<img width="1037" alt="image"
src="https://user-images.githubusercontent.com/12782861/212010594-78eba39d-eab6-4d77-9ecd-b55f5303bcf4.png">
Description
Add bindings for Android and iOS.
Motivation and Context
Enable mobile app linking against ort-extensions library and registering the custom ops with ORT.
### Description
If the number of trees is >= 100 and batch size >= 2000, the
parallelization by tree becomes slower than the parallelization by rows.
However, by applying the parallelization by trees over smaller chunks of
data, it is still better than the parallelization by rows. The following
script was used to measure the performance
[plot_gexternal_lightgbm_reg_per.zip](https://github.com/microsoft/onnxruntime/files/10149092/plot_gexternal_lightgbm_reg_per.zip)
with different thresholds. The graph were produced by the script
following the graph.
* //N means parallelization by rows
* //T means parallelization by trees
* //T-128 means parallelization by trees every batch of 128 rows.
* //T-1024 means parallelization by trees every batch of 1024 rows.
The following graphs shows that the parallelization by trees is better
than the parallelization by rows on small batches only. It is also
better to split the input tensor by chunks of 128 rows and parallelize
by trees on every chunk of 128 rows. The proposed changes implements
that optimization.
It applies the same idea even when there is only one thread. It also
makes sure one thread is used when the user only wants one.

```python
import pandas
import matplotlib.pyplot as plt
filenames = [
("//N",r"plot_gexternal_lightgbm_reg_per_N.csv"),
("//T", "plot_gexternal_lightgbm_reg_per_T.csv"),
("//T-128", "plot_gexternal_lightgbm_reg_per_128.csv"),
("//T-1024", "plot_gexternal_lightgbm_reg_per_1024.csv"),
]
dfs = []
for name, filename in filenames:
df = pandas.read_csv(filename)
for c in df.columns:
if "batch" in c:
df[f"-{name}-{c}"] = df[c]
dfs.append(df)
df = dfs[0][["N"]].copy()
for _df in dfs:
for c in _df.columns:
if c[0] == "-":
df[c] = _df[c].copy()
fig, ax = plt.subplots(1, 3, figsize=(14, 6))
Ts = [50, 500, 2000]
ga = df.set_index("N")
for i, nt in enumerate(Ts):
cs = [c for c in ga.columns if c.endswith(f"-{nt}")]
ga[cs].plot(ax=ax[i], title=f"Trees={nt}", logy=True, logx=True)
```
Below the performance gain for the monothread implementation by looping
on data in the inner loop.

### Motivation and Context
Performance.
Signed-off-by: xadupre <xadupre@microsoft.com>
### Description
Use pytest-xdist to distribute tests across multiple CPUs to speed up
test execution.
Use pytest-rerunfailures to rerun failed test in case of pytest-xdist
crash.
`pytest -n 16` can reduce pytest time from 80 minutes to 20 minutes.
### Motivation and Context
Now kernel explorer pytest of ROCm CI takes nearly 1 hour 20 minutes. It
will take longer time when we add more tunableOp in the future.
### Description
<!-- Describe your changes. -->
Change tolerance for tests involving MNIST and cuda to try and fix flaky
CI tests.
Errors from CI:
ModelTests/ModelTest.Run/cuda__models_zoo_opset8_MNIST_model
expected 4.0755 (40826a83), got 4.06948 (40823938), diff: 0.00601721,
tol=0.0050755 idx=4. 2 of 10 differ
ModelTests/ModelTest.Run/cuda__models_zoo_opset7_MNIST_model
expected 7.89851 (40fcc09e), got 7.88879 (40fc70f8), diff: 0.00972271,
tol=0.00889851 idx=4. 4 of 10 differ
ModelTests/ModelTest.Run/cuda__models_zoo_opset12_MNIST12_mnist12
expected -5.50068 (c0b00595), got -5.49023 (c0afaff0), diff: 0.0104547,
tol=0.00650068 idx=1. 1 of 10 differ
Use rtol of 1e-2 if cuda is enabled. Use same for openvino for
simplicity.
```
>>> expected = np.array([4.0755, 7.89851, -5.50068], dtype=np.float32)
>>> actual = np.array([4.06948, 7.88879, -5.49023], dtype=np.float32)
>>> np.isclose(expected, actual, rtol=1e-2, atol=1e-3)
array([ True, True, True])
```
Whitespace changes are from clang-format.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
CI fails semi-frequently causing unnecessary re-runs.
### Description
Right now prepacking code is not compiled when training is enabled. Our
partners want a single build of ort which can do both optimized
inference + training on device. This PR enables prepacking code in a
training build and controls whether it is enabled or not using already
existing session option - kOrtSessionOptionsConfigDisablePrepacking
For Inference scenarios - prepacking will be turned on by default and
this behavior remains the same after this PR too.
For training scenarios - prepacking will be disabled by default and if
user explicitly enables it then an error will be thrown.
### Motivation and Context
Enable both optimized inference as well as on device training in a
single build. For on device training use flag --enable_training_apis.
### Description
<!-- Describe your changes. -->
Skip tests for opset18 models that we haven't implemented kernels for
yet.
Slice was checked in today so those failures should go away.
Resize: #13890 (all resize failures are fixed by this PR as confirmed in
output
[here](https://dev.azure.com/aiinfra/530acbc4-21bc-487d-8cd8-348ff451d2ff/_apis/build/builds/264725/logs/729))
Col2Im: #12311
ScatterND and ScatterElement: #14224
Pad (should also fix CenterCropPad failures): #14219 Bitwise ops: #14197
Optional: Unknown if we're intending to support this in 1.14
Not sure about SoftPlus as that is failing due to `Could not find an
implementation for Exp(1)`. ORT supports Exp from opset 6 and on, and it
seems incorrect for the test model created for opset 18 to be using a
version of Exp that is so old. Would have expected it to use the latest
- Exp(13). @liqunfu is this something that requires a fix to the ONNX
model?
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix pipeline
### Description
<!-- Describe your changes. -->
1. add an optional input to pass in seed
2. two UTs. one for top_p=0.5, another for top_p=0.01(create greedy
search result, in convert_generation.py)
3. fix a bug in cpu kernel
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
This PR registers ScatterND-16 to the DML EP
- CPU fallback is added if the reduction attribute is in use, as this is
not yet supported by DML.
Co-authored-by: Numfor Mbiziwo-Tiapo <numform@microsoft.com>