### Description
[DirectML EP] Add DML EP registration for Col2Im operator
### Motivation and Context
Add Col2Im support for opset 18.
This operator is implemented as the DirectML Fold operator.
---------
Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
Co-authored-by: Dwayne Robinson <dwayner@microsoft.com>
Update resource creation flag to avoid D3D12 WARNING
### Description
Update the DML DX12 allocator to use D3D12_RESOUCE_STATE_COMMON to avoid
DX12 Warning messages.
### Motivation and Context
When directML is created with debug layer there are warnings when
resources are created by ORT.
---------
Co-authored-by: Christian Larson <28911437+chrilaMSFT@users.noreply.github.com>
### Description
1. Expand input datatype support for Resize with uint8/int8.
2. Update the logic to compute output shape of Resize Op, roiRange is
got rid of to align with how tests compute the output shape to go around
the size asserting in MLOperatorAuthorImpl.cpp
`m_inputDimensions[i] * roiRange * scale` -> `m_inputDimensions[i] *
scale`
3. disable 4 tests because of the result mismatch. The results of DML
with float32 and uint8/int8 match each other, so it should be problem of
resize implementation, which is out the scope of this PR.
`ResizeOpTest.NhwcResizeOpLinearDownSampleTest_tf_crop_and_resize_without_extrapolation_uint8
ResizeOpTest.NhwcResizeOpLinearDownSampleTest_tf_crop_and_resize_without_extrapolation_int8
ResizeOpTest.NhwcResizeOpLinearDownSampleTest_4DBilinear_pytorch_half_pixel_uint8
ResizeOpTest.NhwcResizeOpLinearDownSampleTest_4DBilinear_pytorch_half_pixel_int8`
[Cherry pick Reviewed]
Re-add changes which were merged out...
---------
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: Sheil Kumar <smk2007@gmail.com>
Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: Jeff Bloomfield <jeffbloo@microsoft.com>
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Jeff Bloomfield <jeffbloo@microsoft.com>
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: Jeff Bloomfield <jeffbloo@microsoft.com>
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: Jeff Bloomfield <jeffbloo@microsoft.com>
### Description
This PR also includes,
8b0a55e7cc DML constant pow operator
7520974970 Enable custom heaps based on query-
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Jeff Bloomfield <jeffbloo@microsoft.com>
[Cherry Pick Reviewed]
DML EP Implementation for
[QLinearAveragePool](https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md#com.microsoft.QLinearAveragePool)
```
Note: Google Test filter = *QLinear*Pool*
[==========] Running 72 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 36 tests from QLinearGlobalAveragePool
[ RUN ] QLinearGlobalAveragePool.Nhwc_1x1x32x32
[ OK ] QLinearGlobalAveragePool.Nhwc_1x1x32x32 (410 ms)
[ RUN ] QLinearGlobalAveragePool.Nchw_1x32x32x1
[ OK ] QLinearGlobalAveragePool.Nchw_1x32x32x1 (641 ms)
[ RUN ] QLinearGlobalAveragePool.Nhwc_1x256x8x8
[ OK ] QLinearGlobalAveragePool.Nhwc_1x256x8x8 (156 ms)
[ RUN ] QLinearGlobalAveragePool.Nchw_1x8x8x256
[ OK ] QLinearGlobalAveragePool.Nchw_1x8x8x256 (134 ms)
[ RUN ] QLinearGlobalAveragePool.Nhwc_1x255x7x7
[ OK ] QLinearGlobalAveragePool.Nhwc_1x255x7x7 (160 ms)
[ RUN ] QLinearGlobalAveragePool.Nchw_1x7x7x255
[ OK ] QLinearGlobalAveragePool.Nchw_1x7x7x255 (145 ms)
[ RUN ] QLinearGlobalAveragePool.Nhwc_1x255x8x8
[ OK ] QLinearGlobalAveragePool.Nhwc_1x255x8x8 (148 ms)
[ RUN ] QLinearGlobalAveragePool.Nchw_1x8x8x255
[ OK ] QLinearGlobalAveragePool.Nchw_1x8x8x255 (129 ms)
[ RUN ] QLinearGlobalAveragePool.Nhwc_1x256x7x7
[ OK ] QLinearGlobalAveragePool.Nhwc_1x256x7x7 (134 ms)
[ RUN ] QLinearGlobalAveragePool.Nchw_1x7x7x256
[ OK ] QLinearGlobalAveragePool.Nchw_1x7x7x256 (131 ms)
[ RUN ] QLinearGlobalAveragePool.Nhwc_3x256x8x8
[ OK ] QLinearGlobalAveragePool.Nhwc_3x256x8x8 (159 ms)
[ RUN ] QLinearGlobalAveragePool.Nchw_3x8x8x256
[ OK ] QLinearGlobalAveragePool.Nchw_3x8x8x256 (168 ms)
[ RUN ] QLinearGlobalAveragePool.Nhwc_3x255x7x7
[ OK ] QLinearGlobalAveragePool.Nhwc_3x255x7x7 (139 ms)
[ RUN ] QLinearGlobalAveragePool.Nchw_3x7x7x255
[ OK ] QLinearGlobalAveragePool.Nchw_3x7x7x255 (170 ms)
[ RUN ] QLinearGlobalAveragePool.Nhwc_3x255x8x8
[ OK ] QLinearGlobalAveragePool.Nhwc_3x255x8x8 (155 ms)
[ RUN ] QLinearGlobalAveragePool.Nchw_3x8x8x255
[ OK ] QLinearGlobalAveragePool.Nchw_3x8x8x255 (156 ms)
[ RUN ] QLinearGlobalAveragePool.Nhwc_3x256x7x7
[ OK ] QLinearGlobalAveragePool.Nhwc_3x256x7x7 (133 ms)
[ RUN ] QLinearGlobalAveragePool.Nchw_3x7x7x256
[ OK ] QLinearGlobalAveragePool.Nchw_3x7x7x256 (149 ms)
[ RUN ] QLinearGlobalAveragePool.Nhwc_1x1x32x32_S8
[ OK ] QLinearGlobalAveragePool.Nhwc_1x1x32x32_S8 (131 ms)
[ RUN ] QLinearGlobalAveragePool.Nchw_1x32x32x1_S8
[ OK ] QLinearGlobalAveragePool.Nchw_1x32x32x1_S8 (127 ms)
[ RUN ] QLinearGlobalAveragePool.Nhwc_1x256x8x8_S8
[ OK ] QLinearGlobalAveragePool.Nhwc_1x256x8x8_S8 (153 ms)
[ RUN ] QLinearGlobalAveragePool.Nchw_1x8x8x256_S8
[ OK ] QLinearGlobalAveragePool.Nchw_1x8x8x256_S8 (129 ms)
[ RUN ] QLinearGlobalAveragePool.Nhwc_1x255x7x7_S8
[ OK ] QLinearGlobalAveragePool.Nhwc_1x255x7x7_S8 (133 ms)
[ RUN ] QLinearGlobalAveragePool.Nchw_1x7x7x255_S8
[ OK ] QLinearGlobalAveragePool.Nchw_1x7x7x255_S8 (135 ms)
[ RUN ] QLinearGlobalAveragePool.Nhwc_1x255x8x8_S8
[ OK ] QLinearGlobalAveragePool.Nhwc_1x255x8x8_S8 (129 ms)
[ RUN ] QLinearGlobalAveragePool.Nchw_1x8x8x255_S8
[ OK ] QLinearGlobalAveragePool.Nchw_1x8x8x255_S8 (152 ms)
[ RUN ] QLinearGlobalAveragePool.Nhwc_1x256x7x7_S8
[ OK ] QLinearGlobalAveragePool.Nhwc_1x256x7x7_S8 (140 ms)
[ RUN ] QLinearGlobalAveragePool.Nchw_1x7x7x256_S8
[ OK ] QLinearGlobalAveragePool.Nchw_1x7x7x256_S8 (133 ms)
[ RUN ] QLinearGlobalAveragePool.Nhwc_3x256x8x8_S8
[ OK ] QLinearGlobalAveragePool.Nhwc_3x256x8x8_S8 (135 ms)
[ RUN ] QLinearGlobalAveragePool.Nchw_3x8x8x256_S8
[ OK ] QLinearGlobalAveragePool.Nchw_3x8x8x256_S8 (147 ms)
[ RUN ] QLinearGlobalAveragePool.Nhwc_3x255x7x7_S8
[ OK ] QLinearGlobalAveragePool.Nhwc_3x255x7x7_S8 (156 ms)
[ RUN ] QLinearGlobalAveragePool.Nchw_3x7x7x255_S8
[ OK ] QLinearGlobalAveragePool.Nchw_3x7x7x255_S8 (155 ms)
[ RUN ] QLinearGlobalAveragePool.Nhwc_3x255x8x8_S8
[ OK ] QLinearGlobalAveragePool.Nhwc_3x255x8x8_S8 (138 ms)
[ RUN ] QLinearGlobalAveragePool.Nchw_3x8x8x255_S8
[ OK ] QLinearGlobalAveragePool.Nchw_3x8x8x255_S8 (155 ms)
[ RUN ] QLinearGlobalAveragePool.Nhwc_3x256x7x7_S8
[ OK ] QLinearGlobalAveragePool.Nhwc_3x256x7x7_S8 (144 ms)
[ RUN ] QLinearGlobalAveragePool.Nchw_3x7x7x256_S8
[ OK ] QLinearGlobalAveragePool.Nchw_3x7x7x256_S8 (139 ms)
[----------] 36 tests from QLinearGlobalAveragePool (5968 ms total)
[----------] 36 tests from QLinearPoolTest
[ RUN ] QLinearPoolTest.AveragePool1D_ExcludePadPixel
[ OK ] QLinearPoolTest.AveragePool1D_ExcludePadPixel (480 ms)
[ RUN ] QLinearPoolTest.AveragePool1D_IncludePadPixel
[ OK ] QLinearPoolTest.AveragePool1D_IncludePadPixel (481 ms)
[ RUN ] QLinearPoolTest.AveragePool2D_ExcludePadPixel
[ OK ] QLinearPoolTest.AveragePool2D_ExcludePadPixel (512 ms)
[ RUN ] QLinearPoolTest.AveragePool2D_IncludePadPixel
[ OK ] QLinearPoolTest.AveragePool2D_IncludePadPixel (455 ms)
[ RUN ] QLinearPoolTest.AveragePool2D_MultiChannel
[ OK ] QLinearPoolTest.AveragePool2D_MultiChannel (463 ms)
[ RUN ] QLinearPoolTest.AveragePool3D_ExcludePadPixel
[ OK ] QLinearPoolTest.AveragePool3D_ExcludePadPixel (448 ms)
[ RUN ] QLinearPoolTest.AveragePool3D_IncludePadPixel
[ OK ] QLinearPoolTest.AveragePool3D_IncludePadPixel (458 ms)
[ RUN ] QLinearPoolTest.AveragePool1D_ExcludePadPixel_nhwc
[ OK ] QLinearPoolTest.AveragePool1D_ExcludePadPixel_nhwc (171 ms)
[ RUN ] QLinearPoolTest.AveragePool1D_IncludePadPixel_nhwc
[ OK ] QLinearPoolTest.AveragePool1D_IncludePadPixel_nhwc (169 ms)
[ RUN ] QLinearPoolTest.AveragePool2D_ExcludePadPixel_nhwc
[ OK ] QLinearPoolTest.AveragePool2D_ExcludePadPixel_nhwc (152 ms)
[ RUN ] QLinearPoolTest.AveragePool2D_IncludePadPixel_nhwc
[ OK ] QLinearPoolTest.AveragePool2D_IncludePadPixel_nhwc (660 ms)
[ RUN ] QLinearPoolTest.AveragePool2D_MultiChannel_nhwc
[ OK ] QLinearPoolTest.AveragePool2D_MultiChannel_nhwc (150 ms)
[ RUN ] QLinearPoolTest.AveragePool3D_ExcludePadPixel_nhwc
[ OK ] QLinearPoolTest.AveragePool3D_ExcludePadPixel_nhwc (145 ms)
[ RUN ] QLinearPoolTest.AveragePool3D_IncludePadPixel_nhwc
[ OK ] QLinearPoolTest.AveragePool3D_IncludePadPixel_nhwc (146 ms)
[ RUN ] QLinearPoolTest.AveragePool2D_BigImage
[ OK ] QLinearPoolTest.AveragePool2D_BigImage (505 ms)
[ RUN ] QLinearPoolTest.AveragePool2D_BigImage_nhwc
[ OK ] QLinearPoolTest.AveragePool2D_BigImage_nhwc (161 ms)
[ RUN ] QLinearPoolTest.AveragePool2D_Global
[ OK ] QLinearPoolTest.AveragePool2D_Global (481 ms)
[ RUN ] QLinearPoolTest.AveragePool2D_Global_nhwc
[ OK ] QLinearPoolTest.AveragePool2D_Global_nhwc (152 ms)
[ RUN ] QLinearPoolTest.AveragePool1D_ExcludePadPixel_S8
[ OK ] QLinearPoolTest.AveragePool1D_ExcludePadPixel_S8 (461 ms)
[ RUN ] QLinearPoolTest.AveragePool1D_IncludePadPixel_S8
[ OK ] QLinearPoolTest.AveragePool1D_IncludePadPixel_S8 (448 ms)
[ RUN ] QLinearPoolTest.AveragePool2D_ExcludePadPixel_S8
[ OK ] QLinearPoolTest.AveragePool2D_ExcludePadPixel_S8 (471 ms)
[ RUN ] QLinearPoolTest.AveragePool2D_IncludePadPixel_S8
[ OK ] QLinearPoolTest.AveragePool2D_IncludePadPixel_S8 (473 ms)
[ RUN ] QLinearPoolTest.AveragePool2D_MultiChannel_S8
[ OK ] QLinearPoolTest.AveragePool2D_MultiChannel_S8 (1507 ms)
[ RUN ] QLinearPoolTest.AveragePool3D_ExcludePadPixel_S8
[ OK ] QLinearPoolTest.AveragePool3D_ExcludePadPixel_S8 (477 ms)
[ RUN ] QLinearPoolTest.AveragePool3D_IncludePadPixel_S8
[ OK ] QLinearPoolTest.AveragePool3D_IncludePadPixel_S8 (493 ms)
[ RUN ] QLinearPoolTest.AveragePool1D_ExcludePadPixel_nhwc_S8
[ OK ] QLinearPoolTest.AveragePool1D_ExcludePadPixel_nhwc_S8 (158 ms)
[ RUN ] QLinearPoolTest.AveragePool1D_IncludePadPixel_nhwc_S8
[ OK ] QLinearPoolTest.AveragePool1D_IncludePadPixel_nhwc_S8 (146 ms)
[ RUN ] QLinearPoolTest.AveragePool2D_ExcludePadPixel_nhwc_S8
[ OK ] QLinearPoolTest.AveragePool2D_ExcludePadPixel_nhwc_S8 (146 ms)
[ RUN ] QLinearPoolTest.AveragePool2D_IncludePadPixel_nhwc_S8
[ OK ] QLinearPoolTest.AveragePool2D_IncludePadPixel_nhwc_S8 (158 ms)
[ RUN ] QLinearPoolTest.AveragePool2D_MultiChannel_nhwc_S8
[ OK ] QLinearPoolTest.AveragePool2D_MultiChannel_nhwc_S8 (157 ms)
[ RUN ] QLinearPoolTest.AveragePool3D_ExcludePadPixel_nhwc_S8
[ OK ] QLinearPoolTest.AveragePool3D_ExcludePadPixel_nhwc_S8 (145 ms)
[ RUN ] QLinearPoolTest.AveragePool3D_IncludePadPixel_nhwc_S8
[ OK ] QLinearPoolTest.AveragePool3D_IncludePadPixel_nhwc_S8 (147 ms)
[ RUN ] QLinearPoolTest.AveragePool2D_BigImage_S8
[ OK ] QLinearPoolTest.AveragePool2D_BigImage_S8 (537 ms)
[ RUN ] QLinearPoolTest.AveragePool2D_BigImage_nhwc_S8
[ OK ] QLinearPoolTest.AveragePool2D_BigImage_nhwc_S8 (173 ms)
[ RUN ] QLinearPoolTest.AveragePool2D_Global_S8
[ OK ] QLinearPoolTest.AveragePool2D_Global_S8 (457 ms)
[ RUN ] QLinearPoolTest.AveragePool2D_Global_nhwc_S8
[ OK ] QLinearPoolTest.AveragePool2D_Global_nhwc_S8 (150 ms)
[----------] 36 tests from QLinearPoolTest (12914 ms total)
[----------] Global test environment tear-down
[==========] 72 tests from 2 test suites ran. (18885 ms total)
[ PASSED ] 72 tests.
memleakdbg:
----- No memory leaks detected -----
```
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: Adrian Tsai <adtsai@microsoft.com>
### Description
[Cherry Pick Reviewed]
```
[ OK ] QLinearConcatS8.ExpectFail_WrongZeroPointType_1 (372 ms)
[ RUN ] QLinearConcatS8.InputOne_Dynamic
[ OK ] QLinearConcatS8.InputOne_Dynamic (255 ms)
[ RUN ] QLinearConcatS8.InputOne_Const
[ OK ] QLinearConcatS8.InputOne_Const (255 ms)
[----------] 11 tests from QLinearConcatS8 (3385 ms total)
[----------] Global test environment tear-down
[==========] 21 tests from 3 test suites ran. (9355 ms total)
[ PASSED ] 21 tests.
```
[#16971](https://github.com/microsoft/onnxruntime/pull/16971)
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: Xiang Zhang <xianz@microsoft.com>
### Description
<!-- Describe your changes. -->
If we fail to calculate the buffer size (due to overflow) we currently
return a nullptr. This is inconsistent as an actual memory allocation
failure throws. An overflow would typically be due to bad input so an
exception makes more sense given that.
Change to throw so code using MakeUniquePtr* and AllocArray* doesn't
need to check for nullptr.
Add some extra info to the log message to help debugging.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Should help with #18905 by avoiding the invalid attempted usage of a
nullptr from the allocation. Extra info _might_ help with figuring out
where the overflow is coming from which is the real issue.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Previously, shape and strides were added unconditionally even they are
not used. This PR fixes this issue and only adds shape and strides when
they are required.
It's useful when some shapes are not used as uniform if the program
depends on type instead of rank.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Add trilinear interpolation to Resize and changed activation_params attribute as optional for FuseConv.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Fix build when flash attention and memory efficient attention are
disabled
On a customer env with lower version of CUDA < 11.6. Both flash
attention and memory efficient attention is turned OFF according to
e8f33b54ba/cmake/CMakeLists.txt (L701).
So
e8f33b54ba/cmake/external/cutlass.cmake (L1)
condition check return false. No cutlass lib is built.
```
Turn off flash attention since CUDA compiler version < 11.6
```
While, the kernels in
https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/contrib_ops/cuda/moe/ft_moe
are depending on cutass for its build, so we get error like this:
```
[ 77%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_fp16_fp16.cu.o
In file included from /tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_fp16_fp16.cu:17:
/tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_template.h:23:10: fatal error: cutlass/array.h: No such file or directory
23 | #include "cutlass/array.h"
| ^~~~~~~~~~~~~~~~~
compilation terminated.
In file included from /tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_fp16_fp16.cu:17:
/tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_template.h:23:10: fatal error: cutlass/array.h: No such file or directory
23 | #include "cutlass/array.h"
| ^~~~~~~~~~~~~~~~~
compilation terminated.
In file included from /tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_fp16_fp16.cu:17:
/tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_template.h:23:10: fatal error: cutlass/array.h: No such file or directory
23 | #include "cutlass/array.h"
| ^~~~~~~~~~~~~~~~~
compilation terminated.
In file included from /tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_fp16_fp16.cu:17:
/tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_template.h:23:10: fatal error: cutlass/array.h: No such file or directory
23 | #include "cutlass/array.h"
| ^~~~~~~~~~~~~~~~~
compilation terminated.
fatal : Could not open input file /tmp/tmpxft_00044da3_00000000-11_moe_gemm_kernels_fp16_fp16.compute_60.cpp1.ii
make[2]: *** [CMakeFiles/onnxruntime_providers_cuda.dir/build.make:6290: CMakeFiles/onnxruntime_providers_cuda.dir/tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_fp16_fp16.cu.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [CMakeFiles/Makefile2:2210: CMakeFiles/onnxruntime_providers_cuda.dir/all] Error 2
make: *** [Makefile:166: all] Error 2
Traceback (most recent call last):
File "/tmp/onnxruntime/tools/ci_build/build.py", line 2746, in <module>
sys.exit(main())
File "/tmp/onnxruntime/tools/ci_build/build.py", line 2639, in main
build_targets(args, cmake_path, build_dir, configs, num_parallel_jobs, args.target)
File "/tmp/onnxruntime/tools/ci_build/build.py", line 1527, in build_targets
run_subprocess(cmd_args, env=env)
File "/tmp/onnxruntime/tools/ci_build/build.py", line 824, in run_subprocess
return run(*args, cwd=cwd, capture_stdout=capture_stdout, shell=shell, env=my_env)
File "/tmp/onnxruntime/tools/python/util/run.py", line 49, in run
completed_process = subprocess.run(
File "/opt/conda/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
```
### Motivation and Context
To summarize, there are two cases we will have build failure for Linux
CUDA build:
1. User use cuda version < 11.6
2. User disabled Flash attention and memory efficient attention
explictly with onnxruntime_USE_FLASH_ATTENTION and
onnxruntime_USE_MEMORY_EFFICIENT_ATTENTION
### Description
This makes a minimal change to address a crash caused by the PadFusion
pass. This pass assumed that the "pads" attribute of a child node
existed, and it now skips when it's missing.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: Jeff Bloomfield <38966965+jeffbloo@users.noreply.github.com>
### Description
1. Update donwload-artifacts to flex-downloadartifacts to make it eaiser
to debug.
2. Move the native files into Gpu.Windows and Gpu-linux packages.
Onnxruntime-Gpu has dependency on them.
3. update the package validation as well
4. Add 2 stages to run E2E test for GPU.Windows and GPU.Linux
for example:

### Motivation and Context
Single Onnxruntime.Gpu Package size has already excceded the Nuget size
limit.
We split the package into some smaller packages to make them can be
published.
For compatibility, the user can install or upgrade Onnxruntime.Gpu,
which will install Gpu.Windows and Gpu.Linux automatically.
And the user can only install Gpu.Windows and Gpu.Linux directly.
### Test Link
1. In ORT_NIGHTLY
2. Install the preview version in nuget-int. (nuget source:
https://apiint.nugettest.org/v3/index.json)
---------
Co-authored-by: Scott McKay <skottmckay@gmail.com>
### Description
<!-- Describe your changes. -->
ORT-CUDAFp16 model tests were all failed
due to the latest `onnxmltools` 1.12.0 started to remove
`onnxconverter-common` out of its dependencies, which is needed by the
ep perf env to test models with CUDA EP under fp16.
Add `onnxconverter-common` dep to env to fix.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Bumps [transformers](https://github.com/huggingface/transformers) from
4.30.0 to 4.36.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/huggingface/transformers/releases">transformers's
releases</a>.</em></p>
<blockquote>
<h2>v4.36: Mixtral, Llava/BakLlava, SeamlessM4T v2, AMD ROCm, F.sdpa
wide-spread support</h2>
<h2>New model additions</h2>
<h3>Mixtral</h3>
<p>Mixtral is the new open-source model from Mistral AI announced by the
blogpost <a href="https://mistral.ai/news/mixtral-of-experts/">Mixtral
of Experts</a>. The model has been proven to have comparable
capabilities to Chat-GPT according to the benchmark results shared on
the release blogpost.</p>
<!-- raw HTML omitted -->
<p>The architecture is a sparse Mixture of Experts with Top-2 routing
strategy, similar as <code>NllbMoe</code> architecture in transformers.
You can use it through <code>AutoModelForCausalLM</code> interface:</p>
<pre lang="py"><code>>>> import torch
>>> from transformers import AutoModelForCausalLM,
AutoTokenizer
<p>>>> model =
AutoModelForCausalLM.from_pretrained("mistralai/Mixtral-8x7B",
torch_dtype=torch.float16, device_map="auto")
>>> tokenizer =
AutoTokenizer.from_pretrained("mistralai/Mistral-8x7B")</p>
<p>>>> prompt = "My favourite condiment is"</p>
<p>>>> model_inputs = tokenizer([prompt],
return_tensors="pt").to(device)
>>> model.to(device)</p>
<p>>>> generated_ids = model.generate(**model_inputs,
max_new_tokens=100, do_sample=True)
>>> tokenizer.batch_decode(generated_ids)[0]
</code></pre></p>
<p>The model is compatible with existing optimisation tools such Flash
Attention 2, <code>bitsandbytes</code> and PEFT library. The checkpoints
are release under <a
href="https://huggingface.co/mistralai"><code>mistralai</code></a>
organisation on the Hugging Face Hub.</p>
<h3>Llava / BakLlava</h3>
<p>Llava is an open-source chatbot trained by fine-tuning LlamA/Vicuna
on GPT-generated multimodal instruction-following data. It is an
auto-regressive language model, based on the transformer architecture.
In other words, it is an multi-modal version of LLMs fine-tuned for chat
/ instructions.</p>
<!-- raw HTML omitted -->
<p>The Llava model was proposed in <a
href="https://arxiv.org/pdf/2310.03744">Improved Baselines with Visual
Instruction Tuning</a> by Haotian Liu, Chunyuan Li, Yuheng Li and Yong
Jae Lee.</p>
<ul>
<li>[<code>Llava</code>] Add Llava to transformers by <a
href="https://github.com/younesbelkada"><code>@younesbelkada</code></a>
in <a
href="https://redirect.github.com/huggingface/transformers/issues/27662">#27662</a></li>
<li>[LLaVa] Some improvements by <a
href="https://github.com/NielsRogge"><code>@NielsRogge</code></a> in <a
href="https://redirect.github.com/huggingface/transformers/issues/27895">#27895</a></li>
</ul>
<p>The integration also includes <a
href="https://github.com/SkunkworksAI/BakLLaVA"><code>BakLlava</code></a>
which is a Llava model trained with Mistral backbone.</p>
<p>The mode is compatible with <code>"image-to-text"</code>
pipeline:</p>
<pre lang="py"><code>from transformers import pipeline
from PIL import Image
import requests
<p>model_id = "llava-hf/llava-1.5-7b-hf"
</tr></table>
</code></pre></p>
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="14666775a2"><code>1466677</code></a>
Release: v4.36.0</li>
<li><a
href="accccdd008"><code>accccdd</code></a>
[<code>Add Mixtral</code>] Adds support for the Mixtral MoE (<a
href="https://redirect.github.com/huggingface/transformers/issues/27942">#27942</a>)</li>
<li><a
href="0676d992a5"><code>0676d99</code></a>
[<code>from_pretrained</code>] Make from_pretrained fast again (<a
href="https://redirect.github.com/huggingface/transformers/issues/27709">#27709</a>)</li>
<li><a
href="9f18cc6df0"><code>9f18cc6</code></a>
Fix SDPA dispatch & make SDPA CI compatible with torch<2.1.1 (<a
href="https://redirect.github.com/huggingface/transformers/issues/27940">#27940</a>)</li>
<li><a
href="7ea21f1f03"><code>7ea21f1</code></a>
[LLaVa] Some improvements (<a
href="https://redirect.github.com/huggingface/transformers/issues/27895">#27895</a>)</li>
<li><a
href="5e620a92cf"><code>5e620a9</code></a>
Fix <code>SeamlessM4Tv2ModelIntegrationTest</code> (<a
href="https://redirect.github.com/huggingface/transformers/issues/27911">#27911</a>)</li>
<li><a
href="e96c1de191"><code>e96c1de</code></a>
Skip <code>UnivNetModelTest::test_multi_gpu_data_parallel_forward</code>
(<a
href="https://redirect.github.com/huggingface/transformers/issues/27912">#27912</a>)</li>
<li><a
href="8d8970efdd"><code>8d8970e</code></a>
[BEiT] Fix test (<a
href="https://redirect.github.com/huggingface/transformers/issues/27934">#27934</a>)</li>
<li><a
href="235be08569"><code>235be08</code></a>
[DETA] fix backbone freeze/unfreeze function (<a
href="https://redirect.github.com/huggingface/transformers/issues/27843">#27843</a>)</li>
<li><a
href="df5c5c62ae"><code>df5c5c6</code></a>
Fix typo (<a
href="https://redirect.github.com/huggingface/transformers/issues/27918">#27918</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/huggingface/transformers/compare/v4.30.0...v4.36.0">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [github/issue-labeler](https://github.com/github/issue-labeler)
from 3.2 to 3.3.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/github/issue-labeler/releases">github/issue-labeler's
releases</a>.</em></p>
<blockquote>
<h2>v3.3</h2>
<h2>What's Changed</h2>
<ul>
<li>feat(config): support reading from local file if it exists by <a
href="https://github.com/lrstanley"><code>@lrstanley</code></a> in <a
href="https://redirect.github.com/github/issue-labeler/pull/48">github/issue-labeler#48</a></li>
</ul>
<h2>New Contributors</h2>
<ul>
<li><a href="https://github.com/lrstanley"><code>@lrstanley</code></a>
made their first contribution in <a
href="https://redirect.github.com/github/issue-labeler/pull/48">github/issue-labeler#48</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/github/issue-labeler/compare/v3.2...v3.3">https://github.com/github/issue-labeler/compare/v3.2...v3.3</a></p>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="6bea9ed491"><code>6bea9ed</code></a>
feat(config): support reading from local file if it exists (<a
href="https://redirect.github.com/github/issue-labeler/issues/48">#48</a>)</li>
<li>See full diff in <a
href="https://github.com/github/issue-labeler/compare/v3.2...v3.3">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
You can trigger a rebase of this PR by commenting `@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
> **Note**
> Automatic rebases have been disabled on this pull request as it has
been open for over 30 days.
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [transformers](https://github.com/huggingface/transformers) from
4.35.2 to 4.36.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/huggingface/transformers/releases">transformers's
releases</a>.</em></p>
<blockquote>
<h2>v4.36: Mixtral, Llava/BakLlava, SeamlessM4T v2, AMD ROCm, F.sdpa
wide-spread support</h2>
<h2>New model additions</h2>
<h3>Mixtral</h3>
<p>Mixtral is the new open-source model from Mistral AI announced by the
blogpost <a href="https://mistral.ai/news/mixtral-of-experts/">Mixtral
of Experts</a>. The model has been proven to have comparable
capabilities to Chat-GPT according to the benchmark results shared on
the release blogpost.</p>
<!-- raw HTML omitted -->
<p>The architecture is a sparse Mixture of Experts with Top-2 routing
strategy, similar as <code>NllbMoe</code> architecture in transformers.
You can use it through <code>AutoModelForCausalLM</code> interface:</p>
<pre lang="py"><code>>>> import torch
>>> from transformers import AutoModelForCausalLM,
AutoTokenizer
<p>>>> model =
AutoModelForCausalLM.from_pretrained("mistralai/Mixtral-8x7B",
torch_dtype=torch.float16, device_map="auto")
>>> tokenizer =
AutoTokenizer.from_pretrained("mistralai/Mistral-8x7B")</p>
<p>>>> prompt = "My favourite condiment is"</p>
<p>>>> model_inputs = tokenizer([prompt],
return_tensors="pt").to(device)
>>> model.to(device)</p>
<p>>>> generated_ids = model.generate(**model_inputs,
max_new_tokens=100, do_sample=True)
>>> tokenizer.batch_decode(generated_ids)[0]
</code></pre></p>
<p>The model is compatible with existing optimisation tools such Flash
Attention 2, <code>bitsandbytes</code> and PEFT library. The checkpoints
are release under <a
href="https://huggingface.co/mistralai"><code>mistralai</code></a>
organisation on the Hugging Face Hub.</p>
<h3>Llava / BakLlava</h3>
<p>Llava is an open-source chatbot trained by fine-tuning LlamA/Vicuna
on GPT-generated multimodal instruction-following data. It is an
auto-regressive language model, based on the transformer architecture.
In other words, it is an multi-modal version of LLMs fine-tuned for chat
/ instructions.</p>
<!-- raw HTML omitted -->
<p>The Llava model was proposed in <a
href="https://arxiv.org/pdf/2310.03744">Improved Baselines with Visual
Instruction Tuning</a> by Haotian Liu, Chunyuan Li, Yuheng Li and Yong
Jae Lee.</p>
<ul>
<li>[<code>Llava</code>] Add Llava to transformers by <a
href="https://github.com/younesbelkada"><code>@younesbelkada</code></a>
in <a
href="https://redirect.github.com/huggingface/transformers/issues/27662">#27662</a></li>
<li>[LLaVa] Some improvements by <a
href="https://github.com/NielsRogge"><code>@NielsRogge</code></a> in <a
href="https://redirect.github.com/huggingface/transformers/issues/27895">#27895</a></li>
</ul>
<p>The integration also includes <a
href="https://github.com/SkunkworksAI/BakLLaVA"><code>BakLlava</code></a>
which is a Llava model trained with Mistral backbone.</p>
<p>The mode is compatible with <code>"image-to-text"</code>
pipeline:</p>
<pre lang="py"><code>from transformers import pipeline
from PIL import Image
import requests
<p>model_id = "llava-hf/llava-1.5-7b-hf"
</tr></table>
</code></pre></p>
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="14666775a2"><code>1466677</code></a>
Release: v4.36.0</li>
<li><a
href="accccdd008"><code>accccdd</code></a>
[<code>Add Mixtral</code>] Adds support for the Mixtral MoE (<a
href="https://redirect.github.com/huggingface/transformers/issues/27942">#27942</a>)</li>
<li><a
href="0676d992a5"><code>0676d99</code></a>
[<code>from_pretrained</code>] Make from_pretrained fast again (<a
href="https://redirect.github.com/huggingface/transformers/issues/27709">#27709</a>)</li>
<li><a
href="9f18cc6df0"><code>9f18cc6</code></a>
Fix SDPA dispatch & make SDPA CI compatible with torch<2.1.1 (<a
href="https://redirect.github.com/huggingface/transformers/issues/27940">#27940</a>)</li>
<li><a
href="7ea21f1f03"><code>7ea21f1</code></a>
[LLaVa] Some improvements (<a
href="https://redirect.github.com/huggingface/transformers/issues/27895">#27895</a>)</li>
<li><a
href="5e620a92cf"><code>5e620a9</code></a>
Fix <code>SeamlessM4Tv2ModelIntegrationTest</code> (<a
href="https://redirect.github.com/huggingface/transformers/issues/27911">#27911</a>)</li>
<li><a
href="e96c1de191"><code>e96c1de</code></a>
Skip <code>UnivNetModelTest::test_multi_gpu_data_parallel_forward</code>
(<a
href="https://redirect.github.com/huggingface/transformers/issues/27912">#27912</a>)</li>
<li><a
href="8d8970efdd"><code>8d8970e</code></a>
[BEiT] Fix test (<a
href="https://redirect.github.com/huggingface/transformers/issues/27934">#27934</a>)</li>
<li><a
href="235be08569"><code>235be08</code></a>
[DETA] fix backbone freeze/unfreeze function (<a
href="https://redirect.github.com/huggingface/transformers/issues/27843">#27843</a>)</li>
<li><a
href="df5c5c62ae"><code>df5c5c6</code></a>
Fix typo (<a
href="https://redirect.github.com/huggingface/transformers/issues/27918">#27918</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/huggingface/transformers/compare/v4.35.2...v4.36.0">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
### Description
<!-- Describe your changes. -->
- Wrap usage of kENABLE_TACTIC_HEURISTIC around version checking macros
- Use delete instead of deprecated destroy() functions on TRT objects.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
- Removes usages of deprecated TRT APIs.
Signed-off-by: Kevin Chen <kevinch@nvidia.com>