### Description
* Build cuda nhwc ops by default.
* Deprecate `--enable_cuda_nhwc_ops` in build.py and add
`--disable_cuda_nhwc_ops` option
Note that it requires cuDNN 9.x. If you build with cuDNN 8, NHWC ops
will be disabled automatically.
### Motivation and Context
In general, NHWC is faster than NCHW for convolution in Nvidia GPUs with
Tensor Cores, and this could improve performance for vision models.
This is the first step to prefer NHWC for CUDA in 1.21 release. Next
step is to do some tests on popular vision models. If it help in most
models and devices, set `prefer_nhwc=1` as default cuda provider option.
### Description
Based on https://github.com/microsoft/onnxruntime/pull/9700, and extend
it to ArgMin as well.
This pull request introduces several enhancements and fixes related to
the `ArgMax` and `ArgMin` operators in the CUDA execution provider. The
changes ensure proper handling of these operators across different
versions and improve kernel registration and fallback mechanisms.
Key changes include:
#### Enhancements to `ArgMax` and `ArgMin` Operators:
* Added new kernel class registrations for `ArgMax` and `ArgMin` for
different data types and versions in
`onnxruntime/core/providers/cuda/cuda_execution_provider.cc`.
[[1]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R966-R972)
[[2]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1209-R1215)
[[3]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1657-R1659)
[[4]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285L1825-L1827)
[[5]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1933-R1939)
[[6]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2174-R2180)
* Introduced `ArgMaxOrArgMinNeedFallbackToCPU` function to handle
fallback to CPU when the `select_last_index` attribute is set to 1, as
CUDA does not support this attribute.
[[1]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2597-R2622)
[[2]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2672-R2674)
#### Macro and Kernel Registration Improvements:
* Replaced `REGISTER_KERNEL_UNTIL_VERSIONED_TYPED` with
`REGISTER_KERNEL_VERSIONED_RANGE_TYPED` and
`REGISTER_KERNEL_VERSIONED_SINCE_TYPED` macros for better version
handling.
[[1]](diffhunk://#diff-ee5316fc3898058f70e942d9a84de36be4c7da09f144633a2504236430d5d033L19-R29)
[[2]](diffhunk://#diff-ee5316fc3898058f70e942d9a84de36be4c7da09f144633a2504236430d5d033L40-R46)
* Updated kernel registration for `ArgMax` and `ArgMin` to use the new
macros, ensuring proper version handling and support for different data
types.
#### Safety Checks:
* Added safety checks in the `ArgMax` and `ArgMin` classes to ensure
`select_last_index` is not set to 1, as it is not supported on CUDA.
[[1]](diffhunk://#diff-8ab09fef1f4a12cbf3b3432e509f8f1ef561e83c72778a0e047780060aeef6efL91-R99)
[[2]](diffhunk://#diff-8ab09fef1f4a12cbf3b3432e509f8f1ef561e83c72778a0e047780060aeef6efL101-R117)
#### Testing Enhancements:
* Added new tests for `ArgMax` and `ArgMin` operators to verify behavior
when `select_last_index` is set to 0, ensuring compatibility with both
CPU and CUDA execution providers.
[[1]](diffhunk://#diff-77affe1b70d1a9d38c2485f7c6b16ef2b6b541ed94dd727bc9b286f068f1481aR3340-R3360)
[[2]](diffhunk://#diff-77affe1b70d1a9d38c2485f7c6b16ef2b6b541ed94dd727bc9b286f068f1481aR3679-R3699)
### Motivation and Context
Improve CUDA kernel coverage for stable diffusion model and hence
improve its performance on CUDA
### Description
* Fix `NumericLimits<float>` that used infinity as max, which is not
consistent with `std::numeric_limits<float>::max()`
In Windows, (float)(1e+300) is used for INFINITY, which causes compiler
error in Visual Studio 2022 v17.12 Preview 5.
* Rename `NumericLimits<T>::Min` to Lowest to be consistent with
std::numeric_limits
* Fix topk implementation: use `NumericLimits<CudaT>` instead of
`NumericLimits<T>` in kernel. That could avoid defining a confusing
defintion of `NumericLimits<MLFloat16>` that returns half instead of
MLFloat16.
* Use CUDART_MAX_NORMAL_FP16 if possible. It sets bits value directly,
which is faster than converting float to half.
Note that NumericLimits does not support __nv_bfloat16 and _nv_fp8_e4m3
and __nv_fp8_e5m2 right now.
### Motivation and Context
https://github.com/microsoft/onnxruntime/issues/22728
This PR fixes a bug that occurs when searching for compatible `MLTensor`
in the cache. We were missing checking the number of dimensions in the
shape. This would mean that a cached buffer of shape `[1]` could match
for `[1, 1, 256, 256]`.
This PR also adds better handling when attempting to force an `MLTensor`
to a different shape.
In current implementation, all the staging buffers for weights uploading
are destroyed after first batch of kernel execution. It requires a lot
of memory as all the staging buffers couldn't be reused. It also hurts
the startup time (weights uploading only happens in session creation),
as weights uploading is delayed to a very late time.
This PR uses a very aggressive way to submit queue and destroy staging
buffers, so that the related GPU memory could be reused as much as
possible, though the real situation depends on the WebGPU and driver
implementation. The aggressive queue submission also moves GPU
operations to a very early time, which helps the startup time.
Some buffer uploading benchmarks are composed to compare multiple
solutions, regarding to the memory and time consumption. Benchmarks can
be found at
https://github.com/webatintel/webbench/blob/master/webgpu/buffer-upload.html,
while detailed test data can be found at
https://docs.google.com/document/d/1KgygOkb9ZNzkgzQ_tWOGlEI9ScmMBHDjDojjPFLmVXU/edit.
I also tested phi3.5 on 2 machines, first inference time improved from
5141ms to 3579ms and from 4327ms to 2947ms separately.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Many GRU tests were being skipped due to an error in
MLOperatorAuthorImpl.cpp. The issue was caused by activation function
names not being capitalized (e.g., ‘sigmoid’), while The AttrValue was
using mixed cases (e.g., ‘Sigmoid’, ‘LeakyRelu’), which resulted in an
‘unsupported activation function’ error in
DMLOperatorRecurrentNeuralNetwork.cpp.
This PR fixes the issue by making the DML EP activation function name
case-insensitive, and capitalizing the activation function names in the
tests.
ref PR: https://github.com/microsoft/onnxruntime/pull/15914
ref bug: https://dev.azure.com/microsoft/OS/_workitems/edit/44571772
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: nums11 <numsmt2@gmail.com>
#22031
For reduce related ops, we should increase workgroupSize to improve
parallelism if only one workgroup is dispatched.
The total ReduceMean time becomes 8.98 ms from 77.79 ms on my iGPUs.
Replace use of `gsl::narrow` with `narrow` to build for WebNN @snnn
### Description
Building for WebNN with exceptions disabled cannot use `gsl::narrow`.
Replace with `narrow`
### Motivation and Context
Address issue #22712
Revert to err logging instead of LOGS_DEFAULT macro due to issue seen
during testing. "onnxruntime::logging::LoggingManager::DefaultLogger
Attempt to use DefaultLogger but none has been registered."
### Description
Revert part of PR suggestion to prevent crash for scenario seen in
#22699. Previously we had tested w/o this macro
### Motivation and Context
Previous PR #22699 it was suggested to use LOGS_DEFAULT() but that does
not work during early init. Safer to use std::cerr instead like the
original PR had it.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Refactor the cmake code that is related to delay loading. Provide a
cmake option to control if delay loading should be enabled or not.
Disabling the option when python is enabled, due to a known issue.
### Motivation and Context
ONNX Runtime's python package depends on DirectML.dll, but supposedly
the DLL should be delay loaded.
This PR only refactor the code. It doesn't change the behavior.
BUG #22031
The total Gemm time in demucs model becomes 181.14 ms from over 1000 ms
on my iGPUs.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
This change enhances the Node.js binding with the following features:
- support WebGPU EP
- lazy initialization of `OrtEnv`
- being able to initialize ORT with default log level setting from
`ort.env.logLevel`.
- session options:
- `enableProfiling` and `profileFilePrefix`: support profiling.
- `externalData`: explicit external data (optional in Node.js binding)
- `optimizedModelFilePath`: allow dumping optimized model for diagnosis
purpose
- `preferredOutputLocation`: support IO binding.
======================================================
`Tensor.download()` is not implemented in this PR.
Build pipeline update is not included in this PR.
WebNN doesn't provide dedicate op for SimplifiedLayerNormalization, use
a couple of WebNN ops to emulate it in WebNN EP.
X --> Pow --> ReduceMean --> Add --> Sqrt --> Div -> Mul
- Pass inputs to WebNN directly, WebNN will handle the broadcasting
- If `zero_point` is not provided, make a WebNN Constant with 0 values
and same shape as `scale` input
### Description
<!-- Describe your changes. -->
Set SDL's git submodule to false.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
* Previous job's SDL logs:It has 'git submodule sync' command, which
means 'git submodule sync synchronizes all submodules while git
submodule sync'
* After set sdl git submodules to false, the logs don't have 'git
submodule sync' command.
Adjust max chunk size to fix error limit check from DX12 for large
resources that are CPU accessible.
### Description
Current agility SDK restricts CPU visible buffers to 0xFFFF0000 bytes or
slightly smaller than 4GiB. Verified restriction is still in latest
Agility SDK 1.614.1.
### Motivation and Context
Allocation of Resources 4GiB or larger fail in DX12 verification layer.
---------
Co-authored-by: Dwayne Robinson <dwayner@microsoft.com>
### Description
Add I/O binding example using onnx data type in python API summary. The
API is available since 1.20 release.
### Motivation and Context
Follow up of https://github.com/microsoft/onnxruntime/pull/22306 to add
some documentation.
### Description
Fixes:
(1) cpu kernel: applying scale before bias and mask like other MHA ops
(2) cpu kernel: correct offset during appending past to present.
(3) cuda kernel: apply mask if provided; fix output_qk offset.
Add DMMHA unit tests
### Description
Make ETW provider registration non-fatal and not throw an exception
Needs to work under build with exceptions enabled & --disable_exceptions
### Motivation and Context
ORT should not crash
Addresses #22475. Private tested by filer of that issue
Use suggest-changes@v2
(https://github.com/parkerbxyz/suggest-changes/issues/36#issuecomment-2447605058)
to post suggested changes as comments instead of requested changes to
streamline the review process.
- Also updated the script to `set +e` to ignore exit code only for the
linter run. So that if there is errors in dependency installation we can
still get signals.
### Description
The local build of the native library was being included by almost every
project, but is only needed to run tests. Due to the multiple inclusions
attempting to use a pre-built package was clashing with any local builds
that were available.
Create a helper file to include either a local built of a pre-built
package and include that in the two test projects.
Cleanup various miscellaous things.
### Motivation and Context
Create setup to simplify running on-device tests with the nuget
packages.
### Description
<!-- Describe your changes. -->
BUG #22031
In the demucs model, there are lots of MatMul ops with shapes like
below:
`input[0]: [3448,1,512] | float32, input[1]: [512,1536] | float32,
output[0]: [3448,1,1536] | float32`
We can see that for this kind of shape, the batch size is a big value,
but M = 1. Our current algorithm is based on [M, N] to partition tiles,
which is not efficient for such kind of shapes. This PR reshapes the
inputs to improve the matmul performance.
Before: [3448,1,512] x [512,1536] = [3448,1,1536]
After: [1, 3448, 512] x [512, 1536] = [1, 3448, 1536] , then the output
can be reshaped to [3448, 1, 1536]
The overall MatMul time in demucs model becomes 1778.45 ms from 4418.17
ms on my iGPUs.
---------
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
- cast
- argmax
- gelu
- cast
- LayerNorm
- GroupNorm
- InstanceNorm
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
### Description
Now, we need to build cuda and dml in one package.
But CUDA EP and DML EP can't run in one process.
It will throw the exception of `the GPU device instance has been
suspended`
So the issue is CUDA EP and DML EP coexist in compile time but can't
exist in run time.
This PR is to split cuda ep test and dml ep test in all unit tests.
The solution is to use 2 environment variable, NO_CUDA_TEST and
NO_DML_TEST, in CI.
For example, if NO_CUDA_TEST is set, the DefaultCudaExecutionProvider
will be nullptr, and the test will not run with CUDA EP.
In debugging, the CUDAExecutionProvider will not be called.
I think, as long as cuda functions, like cudaSetDevice, are not called,
DML EP tests can pass.
Disabled java test of testDIrectML because it doesn't work now even
without CUDA EP.
Since opset 18, 'scales' and 'sizes' constant inputs can be 2D tensors,
transpose for 2D tensors are not supported at current implementation,
fix it by only allowing 4D constant inputs.
### Description
The CastNonStringTester test in CastOpTest was failing due to bitwise
mismatches when casting other types to bool. This was caused by bool
being represented as uint8 in DML. Added a clipping step in
DmlOperatorCast to ensure correct bitwise matching after casting to bool
ref: https://dev.azure.com/microsoft/OS/_workitems/edit/44572678
### Motivation and Context
### Description
Consolidate the gpu data transfer in CUDA, ROCm and Migraphx EP.
(1) Remove some redundant stream synchronize on default stream according
to spec of cudaMemcpy
(2) consolidate CUDA, ROCm and MigrphaX to try use same logic.
### Motivation
This is a follow up on reviewing
https://github.com/microsoft/onnxruntime/pull/22589.
### Context
https://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behavior.html#api-sync-behavior
##### cudaMemcpy()
* For transfers from pageable host memory to device memory, a stream
sync is performed before the copy is initiated. The function will return
once the pageable buffer has been copied to the staging memory for DMA
transfer to device memory, **but the DMA to final destination may not
have completed**.
* For transfers from pinned host memory to device memory, the function
is synchronous with respect to the host.
* For transfers from device to either pageable or pinned host memory,
the function returns only once the copy has completed.
* For transfers from device memory to device memory, **no host-side
synchronization is performed**.
* For transfers from any host memory to any host memory, the function is
fully synchronous with respect to the host.
#### cudaMemcpyAsync
* For transfers between device memory and pageable host memory, the
function might be synchronous with respect to host.
* For transfers from any host memory to any host memory, the function is
fully synchronous with respect to the host.
* If pageable memory must first be staged to pinned memory, the driver
may synchronize with the stream and stage the copy into pinned memory.
* For all other transfers, the function should be fully asynchronous.
https://rocm.docs.amd.com/projects/HIP/en/latest/doxygen/html/group___memory.html
##### hipMemcpyAsync()
If host or dest are not pinned, the memory copy will be performed
synchronously. For best performance, use hipHostMalloc to allocate host
memory that is transferred asynchronously.
on HCC hipMemcpyAsync does not support overlapped H2D and D2H copies.
For hipMemcpy, the copy is always performed by the device associated
with the specified stream.
##### hipMemcpy()
For hipMemcpy, the copy is always performed by the current device (set
by hipSetDevice).
https://github.com/ROCm/ROCm/blob/roc-5.7.x/tools/autotag/templates/rocm_changes/5.6.1.md
ROCm 5.6.1 release note: hipMemcpy device-to-device (intra device) is
now asynchronous with respect to the host
### Description
We'll build CUDA EP and DML EP in one package.
As a result, USE_DML and USE_CUDA will coexist.
We can't use predefined macros to check EP any more
### Motivation and Context
Other changes are in test code, so I make this change of core runtime
into one PR.
### Description
Distinguish between DML and the generic 'GPU' term. This is needed for
packaging DML EP in the same ORT GPU pkg.
### Motivation and Context
Customer requirement.
### Description
This change adds a cache of `MLContext`s keyed by their options to the
`WebNNBackend`. This makes is so that multiple `InferenceSession`s
create with the same options will share the same context.
### Motivation and Context
Since `MLTensor`s are tied `MLContext`s, developer can't easily share
tensors between `InferenceSession` (outside of manually an `MLContext`
and specifying the `context` options). This leads strange behaviors such
as,
```js
const sessionsA = ort.InferenceSession.create(urlA, {
executionProviders: ["webnn"],
preferredOutputLocation: "ml-buffer",
});
const sessionsB = ort.InferenceSession.create(urlB, {
executionProviders: ["webnn"],
});
const temp = await sessionA.run({/* arguments */});
const result = await sessionB.run({"input":temp["output"]}); // ERROR: Failed to execute 'dispatch' on 'MLContext': Invalid inputs: The context of MLGraph doesn't match the context of the MLTensor with name "input".
```
We encountered this behavior when updating the transformers.js version
in the developer preview demos. microsoft/webnn-developer-preview#46