### Description
<!-- Describe your changes. -->
BUG #22031
In the demucs model, there are lots of MatMul ops with shapes like
below:
`input[0]: [3448,1,512] | float32, input[1]: [512,1536] | float32,
output[0]: [3448,1,1536] | float32`
We can see that for this kind of shape, the batch size is a big value,
but M = 1. Our current algorithm is based on [M, N] to partition tiles,
which is not efficient for such kind of shapes. This PR reshapes the
inputs to improve the matmul performance.
Before: [3448,1,512] x [512,1536] = [3448,1,1536]
After: [1, 3448, 512] x [512, 1536] = [1, 3448, 1536] , then the output
can be reshaped to [3448, 1, 1536]
The overall MatMul time in demucs model becomes 1778.45 ms from 4418.17
ms on my iGPUs.
---------
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
- cast
- argmax
- gelu
- cast
- LayerNorm
- GroupNorm
- InstanceNorm
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
### Description
Now, we need to build cuda and dml in one package.
But CUDA EP and DML EP can't run in one process.
It will throw the exception of `the GPU device instance has been
suspended`
So the issue is CUDA EP and DML EP coexist in compile time but can't
exist in run time.
This PR is to split cuda ep test and dml ep test in all unit tests.
The solution is to use 2 environment variable, NO_CUDA_TEST and
NO_DML_TEST, in CI.
For example, if NO_CUDA_TEST is set, the DefaultCudaExecutionProvider
will be nullptr, and the test will not run with CUDA EP.
In debugging, the CUDAExecutionProvider will not be called.
I think, as long as cuda functions, like cudaSetDevice, are not called,
DML EP tests can pass.
Disabled java test of testDIrectML because it doesn't work now even
without CUDA EP.
Since opset 18, 'scales' and 'sizes' constant inputs can be 2D tensors,
transpose for 2D tensors are not supported at current implementation,
fix it by only allowing 4D constant inputs.
### Description
The CastNonStringTester test in CastOpTest was failing due to bitwise
mismatches when casting other types to bool. This was caused by bool
being represented as uint8 in DML. Added a clipping step in
DmlOperatorCast to ensure correct bitwise matching after casting to bool
ref: https://dev.azure.com/microsoft/OS/_workitems/edit/44572678
### Motivation and Context
### Description
Consolidate the gpu data transfer in CUDA, ROCm and Migraphx EP.
(1) Remove some redundant stream synchronize on default stream according
to spec of cudaMemcpy
(2) consolidate CUDA, ROCm and MigrphaX to try use same logic.
### Motivation
This is a follow up on reviewing
https://github.com/microsoft/onnxruntime/pull/22589.
### Context
https://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behavior.html#api-sync-behavior
##### cudaMemcpy()
* For transfers from pageable host memory to device memory, a stream
sync is performed before the copy is initiated. The function will return
once the pageable buffer has been copied to the staging memory for DMA
transfer to device memory, **but the DMA to final destination may not
have completed**.
* For transfers from pinned host memory to device memory, the function
is synchronous with respect to the host.
* For transfers from device to either pageable or pinned host memory,
the function returns only once the copy has completed.
* For transfers from device memory to device memory, **no host-side
synchronization is performed**.
* For transfers from any host memory to any host memory, the function is
fully synchronous with respect to the host.
#### cudaMemcpyAsync
* For transfers between device memory and pageable host memory, the
function might be synchronous with respect to host.
* For transfers from any host memory to any host memory, the function is
fully synchronous with respect to the host.
* If pageable memory must first be staged to pinned memory, the driver
may synchronize with the stream and stage the copy into pinned memory.
* For all other transfers, the function should be fully asynchronous.
https://rocm.docs.amd.com/projects/HIP/en/latest/doxygen/html/group___memory.html
##### hipMemcpyAsync()
If host or dest are not pinned, the memory copy will be performed
synchronously. For best performance, use hipHostMalloc to allocate host
memory that is transferred asynchronously.
on HCC hipMemcpyAsync does not support overlapped H2D and D2H copies.
For hipMemcpy, the copy is always performed by the device associated
with the specified stream.
##### hipMemcpy()
For hipMemcpy, the copy is always performed by the current device (set
by hipSetDevice).
https://github.com/ROCm/ROCm/blob/roc-5.7.x/tools/autotag/templates/rocm_changes/5.6.1.md
ROCm 5.6.1 release note: hipMemcpy device-to-device (intra device) is
now asynchronous with respect to the host
### Description
We'll build CUDA EP and DML EP in one package.
As a result, USE_DML and USE_CUDA will coexist.
We can't use predefined macros to check EP any more
### Motivation and Context
Other changes are in test code, so I make this change of core runtime
into one PR.
### Description
Distinguish between DML and the generic 'GPU' term. This is needed for
packaging DML EP in the same ORT GPU pkg.
### Motivation and Context
Customer requirement.
### Description
This change adds a cache of `MLContext`s keyed by their options to the
`WebNNBackend`. This makes is so that multiple `InferenceSession`s
create with the same options will share the same context.
### Motivation and Context
Since `MLTensor`s are tied `MLContext`s, developer can't easily share
tensors between `InferenceSession` (outside of manually an `MLContext`
and specifying the `context` options). This leads strange behaviors such
as,
```js
const sessionsA = ort.InferenceSession.create(urlA, {
executionProviders: ["webnn"],
preferredOutputLocation: "ml-buffer",
});
const sessionsB = ort.InferenceSession.create(urlB, {
executionProviders: ["webnn"],
});
const temp = await sessionA.run({/* arguments */});
const result = await sessionB.run({"input":temp["output"]}); // ERROR: Failed to execute 'dispatch' on 'MLContext': Invalid inputs: The context of MLGraph doesn't match the context of the MLTensor with name "input".
```
We encountered this behavior when updating the transformers.js version
in the developer preview demos. microsoft/webnn-developer-preview#46
### Description
[DML EP] Update DML to 1.15.4
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
We want the customer to use the latest DirectML.
BUG #22031
Optimize below two situations:
1. Increase workgroupSize if only one workgroup is dispatched.
2. Avoid transpose if not necessary.
The overall time of demucs model becomes 106.36 ms from 154.60 ms on my
dGPUs with this PR and PR #22577
### Description
Issue can happen with multiple sessions and when ETW captureState /
rundown is triggered.
Resolves use after free issue.
Tested with local unit test creating/destroying multiple sessions while
continually enabling & disabling ETW. This currently requires Admin
prompt so not checking in
### Motivation and Context
ORT should not crash
### Description
<!-- Describe your changes. -->
Allow some classes to be default constructed.
The effect is the same as constructing it with nullptr.
Make default ctor visible from the base classes.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Multiple customers complained that when storing Ort::Value
in an instance of std::vector, vector can not be resized.
We enable that with allowing it default constructed.
### Description
<!-- Describe your changes. -->
* Leverage template `common-variables.yml` and reduce usage of hardcoded
trt_version
8391b24447/tools/ci_build/github/azure-pipelines/templates/common-variables.yml (L2-L7)
* Among all CI yamls, this PR reduces usage of hardcoding trt_version
from 40 to 6, by importing trt_version from `common-variables.yml`
* Apply TRT 10.5 and re-enable control flow op test
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
- Reduce usage of hardcoding trt_version among all CI ymls
### Next refactor PR
will work on reducing usage of hardcoding trt_version among
`.dockerfile`, `.bat` and remaining 2 yml files
(download_win_gpu_library.yml & set-winenv.yml, which are step-template
yaml that can't import variables)
### Description
1. Add pipauth to more ADO pipeline. (We will use a private ADO feed to
fetch python packages in these pipeline, to improve security)
2. Enforce codeSignValidation(CSV).
### Motivation and Context
Fulfill some internal compliance requirements.
### Description
(1) tokenizer.max_model_input_sizes was deprecated. Use
tokenizer.model_max_length to replace it.
(2) onnx opset updated to 16 instead of 11/12 for models.
(3) Update a few comments related to torch installation.
(4) Test gpu instead of cpu in dev_benchmark.cmd.
### Motivation and Context
Update bert benchmark script so that it can run with latest huggingface
transformers package.
### Description
<!-- Describe your changes. -->
hipblasLt library is released with rocm6.x, and current onnxruntime's
code need some modifications to match new hipblasLt API.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Add support for softmaxcrossentropy loss. This is already enabled on our
ROCm Fork of the MIGraphX EP
### Motivation and Context
Adds support for the SoftmaxCrossEntropyLoss operator and removes the
filtering of inputs here.
### Description
<!-- Describe your changes. -->
Test case failing sometimes and passing other times.
### Motivation and Context
Prevent unnecessary CI build failures requiring manually rerunning tests
### Description
Upgrade python from 3.9 to 3.10 in ROCm and MigraphX docker files and CI
pipelines. Upgrade ROCm version to 6.2.3 in most places except ROCm CI,
see comment below.
Some improvements/upgrades on ROCm/Migraphx docker or pipeline:
* rocm 6.0/6.1.3 => 6.2.3
* python 3.9 => 3.10
* Ubuntu 20.04 => 22.04
* Also upgrade ml_dtypes, numpy and scipy packages.
* Fix message "ROCm version from ..." with correct file path in
CMakeList.txt
* Exclude some NHWC tests since ROCm EP lacks support for NHWC
convolution.
#### ROCm CI Pipeline:
ROCm 6.1.3 is kept in the pipeline for now.
- Failed after upgrading to ROCm 6.2.3: `HIPBLAS_STATUS_INVALID_VALUE ;
GPU=0 ; hostname=76123b390aed ;
file=/onnxruntime_src/onnxruntime/core/providers/rocm/rocm_execution_provider.cc
; line=170 ; expr=hipblasSetStream(hipblas_handle_, stream);` . It need
further investigation.
- cupy issues:
(1) It currently supports numpy < 1.27, might not work with numpy 2.x.
So we locked numpy==1.26.4 for now.
(2) cupy support of ROCm 6.2 is still in progress:
https://github.com/cupy/cupy/issues/8606.
Note that miniconda issues: its libstdc++.so.6 and libgcc_s.so.1 might
have conflict with the system ones. So we created links to use the
system ones.
#### MigraphX CI pipeline
MigraphX CI does not use cupy, and we are able to use ROCm 6.2.3 and
numpy 2.x in the pipeline.
#### Other attempts
Other things that I've tried which might help in the future:
Attempt to use a single docker file for both ROCm and Migraphx:
https://github.com/microsoft/onnxruntime/pull/22478
Upgrade to ubuntu 24.04 and python 3.12, and use venv like
[this](27903e7ff1/tools/ci_build/github/linux/docker/rocm-ci-pipeline-env.Dockerfile).
### Motivation and Context
In 1.20 release, ROCm nuget packaging pipeline will use 6.2:
https://github.com/microsoft/onnxruntime/pull/22461.
This upgrades rocm to 6.2.3 in CI pipelines to be consistent.
### Description
Replace `hipMemcpy` with `hipMemcpyWithStream`
### Motivation and Context
`hipMemcpy` uses default stream, which may be out of synchronization
with the current stream when ORT_ENABLE_STREAM is defined.
### Description
<!-- Describe your changes. -->
**Changes applied to maven related signing:**
* Windows sha256 file encoded by utf8(no BOM)
* powershell script task used latest version, previous 5.1 version only
supports utf8 with BOM.
* Windows sha256 file content in format 'sha256value
*filename.extension'.
* Linux sha256 file content in format 'sha256value *filename.extension'.
**More information about powershell encoding:**
Windows powershell encoding reference: [about_Character_Encoding -
PowerShell | Microsoft
Learn](https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.core/about/about_character_encoding?view=powershell-7.4)
- for version 5.1, it only has 'UTF8 Uses UTF-8 (with BOM).'
- for version v7.1 and higher, it has:
utf8: Encodes in UTF-8 format (no BOM).
utf8BOM: Encodes in UTF-8 format with Byte Order Mark (BOM)
utf8NoBOM: Encodes in UTF-8 format without Byte Order Mark (BOM)
### Description
part of https://github.com/microsoft/onnxruntime/issues/21448
This change is intend to save CPU memory during model load for
inference.
Added session option save_prepacked_constant_initializers, with
save_prepacked_constant_initializers turn on:
1. optimize model with inference session, prepacked external initializer
will be saved into data file.
2. load optimized model and external data file with prepacked
initializer, no prepack is needed
3. run inference with optimized model and data file
Tested with model Phi-3-mini-instruct-onnx,
with ORT 1.12.0:

with this change:

Peak memory usage dropped from **5.438 GB to 2.726GB**.
This change takes advantage of ORT loads external initializer with mmap
on CPU. Prepack will use extra memory on heap, omit prepack process can
save this part of memory (roughly same size as external initializers).
next step:
Change all the kernels on CPU with PrePack method implemented and test
properly. Will do in next PR.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This PR add dependency to the global-agent package, and use it in JSEP
scripts that download files from network (i.e. `js/scripts/utils.ts` and
`js/web/script/pull-prebuilt-wasm-artifacts.ts`), so that user can make
these script use network proxy by setting environment variable
GLOBAL_AGENT_HTTPS_PROXY.
### Description
<!-- Describe your changes. -->
Should make the binary size report more stable as changes < 4K can occur
when a padding boundary is crossed.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->