### Description
It always has been out of memory in training CUDA 12.2 packaging
pipeline
https://dev.azure.com/aiinfra/Lotus/_build?definitionId=1308&_a=summary
since the PR #19910
I tried other CPU agents for example, D64as_v5(256G memory) and
D32as_v4(128G memory and 256 G SSD temp storage), which are still out of
memory like the below image

But it works on T4, though T4 only has 4 vCPUs, 28G memory and 180G temp
storage, and it takes much more time.
### Motivation and Context
Restore CUDA 12.2 training packaging pipeline first.
More time is needed to investigate the root cause
### Other Clues.
These 2 compilation steps take nearly 6 minutes with Cuda 12.2 on T4
And it runs out of memory on CPU machine. @ajindal1
cuda12.2 on T4
```
2024-03-14T05:39:08.7726865Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim32_fp16_sm80.cu.o
2024-03-14T05:45:01.3223393Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim64_bf16_sm80.cu.o
2024-03-14T05:46:07.9218003Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim96_fp16_sm80.cu.o
2024-03-14T05:52:59.2387051Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/group_query_attention_impl.cu.o
```
But they could be finished in about one minute with Cuda 11.8 on CPU
```
cuda11.8 on CPU
2024-04-09T11:34:35.0849836Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim32_fp16_sm80.cu.o
2024-04-09T11:35:53.6648154Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim64_bf16_sm80.cu.o
cuda11.8 on GPU
024-03-13T12:16:33.4102477Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim32_fp16_sm80.cu.o
2024-03-13T12:19:58.8268272Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim64_bf16_sm80.cu.o
```
### Description
Update QNN python packages to use QNN SDK version 2.19.2.
### Motivation and Context
Our CI builds already use QNN SDK version 2.19.2. We should make sure
the ort-nightly-qnn python packages are also built with the same QNN SDK
version.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Except [Python-CUDA-Packaging
pipeline](https://dev.azure.com/aiinfra/Lotus/_build?definitionId=1299&_a=summary),
all windows cuda packaging jobs have been running well now.
After comparison, enable_lto isn't added in the pipeline, which might be
one root cause of the random hang.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Enable NPUs supporting DXCORE_ADAPTER_ATTRIBUTE_D3D12_GENERIC_ML and
D3D_FEATURE_LEVEL_1_0_GENERIC with DML EP. This also begins ingesting DX
headers through the DirectX-Headers repo.
Note that this includes an update to cgamanifest.json for onnx-tensorrt
which is triggered during re-generation due to a prior changes to
deps.txt.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Bump spotless and the Gradle wrapper to 6.25.0 and 8.6 respectively to
allow compiling ORT on Java 21. The build still targets Java 8.
I'm not sure if there will be CI changes necessary to use this PR,
specifically for the Gradle version as I don't know if that is cached
somewhere earlier in the CI build process.
The new Gradle version adds a warning that using `--source` and
`--target` to select the Java language version is obsolete which is
annoying, we can fix it if we decide to only allow building on newer
versions of Java, while still supporting running on Java 8.
### Motivation and Context
Java 21 is the latest LTS release of Java and ORT should be able to
build on it.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Yi Zhang <your@email.com>
### Description
reactor win-ci.yml to solve the random hang issue in more GPU workflows,
move nugget-zip packages and python cuda12 packages building to CPU
machine.
---------
Co-authored-by: Yi Zhang <your@email.com>
### Description
Address build issues and source code discrepancies.
Fix cuda_test_provider gtest argument stack corruption.
### Motivation and Context
`OpTester` class that is widely used for kernel testing is not
suitable for testing internal classes for EPs that are built as shared
objects.
Currently, CUDA EP tests run only on Linux.
We want to enable testing and developments on Windows,
and create a usable pattern for testing of other EPs internals.
Alternatives considered:
Abstracting EP unit tests into separate test executable such as
`onnxruntime_test_all`.
This alternative was rejected as it would create a lot more changes in
the established patterns,
and potentially interfere with CUDA functionality with more complex
source code maintanence.
### Description
In #20073, I use pin onnx version to unblock the whole PR CI.
In fact, we could use the onnx that installed by building source code,
that the onnx version is controlled by deps.txt.
For some history reason, DML stage installed onnx from pypi. Now, the
onnx can be installed as other stages.
add an option to skip installing onnx in win-ci-prebuild-step
### Description
Make Windows GPU Packaging stage in Python Packaging pipeline run on CPU
machine as well
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Test Link
https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=430961&view=results
### Description
Update Web CI to use data dir under Agent.TempDirectory
This change fixes the random failure caused by unstable access to karma
temp directory (which is under AppData\Local\Temp) on CI pipeline
### Description
Add NPU to list of device supported.
Added changes for Support to OV 2024.0
Nuget packages removes packaging of OpenVINO DLL
Bug Fixes with Python API
Reverted Dockerfiles not being maintained.
### Motivation and Context
NPU Device has been introduced by Intel in latest client systems
OpenVINO 2024.0 release is out.
---------
Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
Co-authored-by: Ubuntu <ubuntu@ubuntu-118727.iind.intel.com>
Co-authored-by: hmamidix <hemax.sowjanya.mamidi@intel.com>
Co-authored-by: vthaniel <vishnudas.thaniel.s@intel.com>
Co-authored-by: saurabhkale17 <saurabh1.kale@intel.com>
### Description
1. Move building on CPU machine.
2. Optimize the pipeline
3. Since there isn't official ONNX package for python 12, the python 12
test stage uses the packages built with ONNX source in build stage.
### Motivation and Context
1. Resolve the random hang in compilation
4. Save a lot of GPU resources.
---------
### Description
<!-- Describe your changes. -->
### Motivation and Context
downloading deps is not needed in test stage
remove it to reduce random downloading errors
### Description
<!-- Describe your changes. -->
the crash caused by the neural_speed turns out to be a very corn case.
Turn it on by default.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
MAUI on macOS uses mac-catalyst which requires a different native
binary.
---------
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
### Description
The docker image name was fixed, but the docker argument was different
in different job.
It would trigger rebuilding the docker image almost every time!!!
### Description
Fix a few warnings in typedoc (for generating JS API):
```
[warning] The signature TrainingSession.loadParametersBuffer has an @param with name "buffer", which was not used.
[warning] NonTensorType, defined in ./lib/onnx-value.ts, is referenced by OnnxValue but not included in the documentation.
[warning] TensorFactory, defined in ./lib/tensor-factory.ts, is referenced by Tensor but not included in the documentation.
[warning] ExternalDataFileType, defined in ./lib/onnx-model.ts, is referenced by InferenceSession.SessionOptions.externalData but not included in the documentation.
[warning] TensorToDataUrlOptions, defined in ./lib/tensor-conversion.ts, is referenced by Tensor.toDataURL.toDataURL.options but not included in the documentation.
[warning] TensorToImageDataOptions, defined in ./lib/tensor-conversion.ts, is referenced by Tensor.toImageData.toImageData.options but not included in the documentation.
[warning] Failed to resolve link to "GpuBufferType" in comment for Env.WebGpuFlags.adapter.
[warning] Failed to resolve link to "GpuBufferType" in comment for Env.WebGpuFlags.device.
```
Changes highlighted:
- Merge `CoreMlExecutionProviderOption` and
`CoreMLExecutionProviderOption`. They expose 2 set of different options
for React-native and ORT nodejs binding. This should be fixed in future.
- Fix a few inconsistency of names between JSDoc and parameters
- Fix broken type links
- Exclude trace functions
### Description
<!-- Describe your changes. -->
* Add concurrency test to EP Perf CI panel (impl. by onnx_test_runner)
* Model: FasterRCNN-10 model within CI image
* `-c` param configurable via CI panel when kicking off CI tasks
* Auto-replicate test input/outputs according to `-c` param
* By default, the model test will be executed in 100 iterations (~2min
added to T4 CI task load overall)
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
To monitor potential concurrency issues of ORT-TRT
### Description
Modifications to support 2GB+ checkpoint & Upgrading Flatbuffers
### Motivation and Context
This PR includes changes that will make ort handle 2GB+ checkpoints.
To do that we need to upgrade flatbuffers to 23.5.9 -
https://github.com/google/flatbuffers/pull/7945
- Modified the commitHash and the hash for the new version
- Removed the patch for rust generator's unused variable warning as it
is no longer producing this - [Check it out
here](d121e09d89/src/idl_gen_rust.cpp)
- Updated the VerifyField calls with alignment values that were
introduced in the new version.
---------
Co-authored-by: Sumit Agarwal <sumitagarwal@microsoft.com>
### Description
1. Use stage to organize the pipeline and split building and testing
2. Move compilation on CPU machine
3. test stage can leverage existing artifacts
4. check wheel size, it gives warning if the size above 300M
5. docker image name wasn't change even the argument changed, which
caused the docker image was always rebuilt. So update the docker image
name according to the argument can save the docker build time.
Pipeline duration reduced by 60% (2 hours -> 50 minutes)
Compilation time reduced by 75% (1.5hours -> 20 minutes)
GPU time reduced by 87% ( 8 hours to 1 hours)
for debugging, the GPU time could be reduced by above 95%, because we
can choose run only one test stage and skip building.
### Motivation and Context
Make the pipeline efficient.
Optimized
https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=424177&view=results
Curent
https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=422393&view=results
---------
### Description
Change nuget pipeline's "Final_Jar_Testing_Windows_GPU" job to download
TRT binaries in every build. Now all the other build jobs are already
doing this. This is the only one left.
Similar to #19909
### Motivation and Context
As a follow up of #19118
### Description
Change nuget pipeline's "Final_Jar_Testing_Windows_GPU" job to download
TRT binaries in every build. Now all the other build jobs are already
doing this. This is the only one left.
### Motivation and Context
As a follow up of #19118
### Description
the `npm test` flags are difficult to memorize, because they are
different to the `ort.env` flags. This change makes those flags align
with ort JS API. eg. `--wasm-enable-proxy` became `--wasm.proxy`.
Old flags are marked as deprecated except `-x` (as a shortcut of
`--wasm.numThreads`)
### Description
Check the onnx node tests and model tests worked
### Motivation and Context
onnx node test data and model data are mount in one dir.
And onnxruntime_test_all search the dir and load the data.
If the dir does exist or there's some change in onnxruntime_test_all,
those tests may not be executed.
For example, all onnx node test data is 32M. It's hardly for us aware of
the regression.
So I add the simple check to ensure those tests are executed.
---------
Co-authored-by: Yi Zhang <your@email.com>
### Description
<!-- Describe your changes. -->
### Motivation and Context
--extra-index-url is not allowed by injected Secure Supply Chain Step in
packaging pipelines.
```
> Starting Multifeed Python Security Analysis:
##[warning]tools/ci_build/github/azure-pipelines/bigmodels-ci-pipeline.yml - Found "extra-index-url". (https://aka.ms/cfs/pypi)
```
And those 2 packages can be installed from PyPI as well now.
Co-authored-by: Yi Zhang <your@email.com>
Fix some linker errors that come up when integrating the onnxruntime-training-c pod into another Xcode project. The problematic configuration is a minimal build with training APIs enabled.
- training_op_defs.o had some unresolved references to ONNX functions. It should not be included at all in a minimal build.
- tree_ensemble_helper.o also had unresolved references to ONNX ParseData. The containing function is unused in a minimal build.
Added a test to cover this configuration.
### Description
As a follow up of #19788, remove more remaining Windows ARM32 build
jobs.
### Motivation and Context
Our nuget packaging pipeline is failing because it could not find an
artifact for Win ARM32.
```
##[error]Artifact onnxruntime-training-win-arm was not found for build 421397.
```
Deprecation of Win ARM32 was announced by Windows team in January 2023.
We should follow it.
### Description
* Update name of existing dockerfiles and add support to test latest
TensorRT EA binary located in the image
* Add cuda 12.3/cuDNN 9/TensorRT 8.6 dockerfile
* Add detail to CI prompts and configs
Instruction to test latest TRT via BIN:
1. Select `BIN` in TensorRT Version
2. In Variables, update related tarCudaVersion, **clear**
tarCudnnVersion (not required in latest TRT tar binary) , and path to
binary.
### Description
* Add tag to distinguish if TRT `builtin` or `oss` parser is being used
* `oss` tag will be inserted with onnx-tensorrt commit id, to indicate
which version oss parser is
### Validate
DB entry before/after this PR
(during test, `builtin` or `oss_{commit_id}` tag was inserted in the
database entries):
### Motivation and Context
To distinguish perf results using builtin/oss parser in the database,
this parser tag is needed.
In future, results using different parsers will be listed in different
Perf Dashboard pages.
### Description
<!-- Describe your changes. -->
Address warnings so all the ORT projects build with /W4 on Windows.
Mainly
- unused parameters
- variables shadowing other ones
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#19588 started on this.