onnxruntime/tools/ci_build/github/azure-pipelines
Yi Zhang 14d7872ce9
Reuse T4 for Cuda12.2 training packaging pipeline. (#20244)
### Description
It always has been out of memory in training CUDA 12.2 packaging
pipeline
https://dev.azure.com/aiinfra/Lotus/_build?definitionId=1308&_a=summary
since the PR #19910
I tried other CPU agents for example, D64as_v5(256G memory) and
D32as_v4(128G memory and 256 G SSD temp storage), which are still out of
memory like the below image

![image](https://github.com/microsoft/onnxruntime/assets/16190118/5acde9ef-674f-4b6d-a1b3-b54647645083)


But it works on T4, though T4 only has 4 vCPUs, 28G memory and 180G temp
storage, and it takes much more time.

### Motivation and Context
Restore CUDA 12.2 training packaging pipeline first.
More time is needed to investigate the root cause


### Other Clues.
These 2 compilation steps take nearly 6 minutes with Cuda 12.2 on T4
And it runs out of memory on CPU machine. @ajindal1 
cuda12.2 on T4
```
2024-03-14T05:39:08.7726865Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim32_fp16_sm80.cu.o
2024-03-14T05:45:01.3223393Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim64_bf16_sm80.cu.o

2024-03-14T05:46:07.9218003Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim96_fp16_sm80.cu.o
2024-03-14T05:52:59.2387051Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/group_query_attention_impl.cu.o

```

But they could be finished in about one minute with Cuda 11.8 on CPU
```
cuda11.8 on CPU
2024-04-09T11:34:35.0849836Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim32_fp16_sm80.cu.o
2024-04-09T11:35:53.6648154Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim64_bf16_sm80.cu.o

cuda11.8 on GPU
024-03-13T12:16:33.4102477Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim32_fp16_sm80.cu.o
2024-03-13T12:19:58.8268272Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim64_bf16_sm80.cu.o
```
2024-04-10 09:21:40 +08:00
..
nodejs/templates Fix training and macos ci pipelines (#20034) 2024-03-26 12:20:11 -07:00
nuget/templates Fix training and macos ci pipelines (#20034) 2024-03-26 12:20:11 -07:00
stages enable lto in Python-CUDA-Packaging Pipline (#20164) 2024-04-01 15:42:28 +08:00
templates Reuse T4 for Cuda12.2 training packaging pipeline. (#20244) 2024-04-10 09:21:40 +08:00
triggers
android-arm64-v8a-QNN-crosscompile-ci-pipeline.yml [QNN EP] Update default QNN SDK to 2.19.2.240210 (#19546) 2024-02-16 16:59:43 -08:00
android-x86_64-crosscompile-ci-pipeline.yml Change "onnxruntime-Linux-CPU-For-Android-CI" machine pool to "onnxruntime-Ubuntu2204-AMD-CPU" (#19698) 2024-02-28 19:36:26 -08:00
bigmodels-ci-pipeline.yml Remove --extra-index-url (#19885) 2024-03-13 09:45:22 -07:00
binary-size-checks-pipeline.yml
build-perf-test-binaries-pipeline.yml
c-api-noopenmp-packaging-pipelines.yml Split more windows GPU workflow into 2 stages, building and testing, to make them more stable (#20080) 2024-03-28 12:55:44 +08:00
clean-build-docker-image-cache-pipeline.yml
cuda-packaging-pipeline.yml Split more windows GPU workflow into 2 stages, building and testing, to make them more stable (#20080) 2024-03-28 12:55:44 +08:00
linux-ci-pipeline.yml Check whether required tests are executed. (#19884) 2024-03-13 09:59:57 -07:00
linux-cpu-aten-pipeline.yml Fix a build issue: /MP was not enabled correctly (#19190) 2024-01-29 12:45:38 -08:00
linux-cpu-eager-pipeline.yml Fix a build issue: /MP was not enabled correctly (#19190) 2024-01-29 12:45:38 -08:00
linux-cpu-minimal-build-ci-pipeline.yml Change "onnxruntime-Linux-CPU-For-Android-CI" machine pool to "onnxruntime-Ubuntu2204-AMD-CPU" (#19698) 2024-02-28 19:36:26 -08:00
linux-dnnl-ci-pipeline.yml
linux-gpu-ci-pipeline.yml Enable CUDA EP unit testing on Windows (#20039) 2024-03-27 13:32:36 -07:00
linux-gpu-tensorrt-ci-pipeline.yml Fix a build issue: /MP was not enabled correctly (#19190) 2024-01-29 12:45:38 -08:00
linux-gpu-tensorrt-daily-perf-pipeline.yml [EP Perf] Add concurrency test (#19804) 2024-03-15 07:41:21 -07:00
linux-migraphx-ci-pipeline.yml [ROCm] Remove MPI dependency and collectives to use NCCL (#19830) 2024-03-19 17:35:18 -07:00
linux-multi-gpu-tensorrt-ci-pipeline.yml
linux-openvino-ci-pipeline.yml Ort openvino npu 1.17 master (#19966) 2024-03-21 18:44:00 -07:00
linux-qnn-ci-pipeline.yml [QNN EP] Update default QNN SDK to 2.19.2.240210 (#19546) 2024-02-16 16:59:43 -08:00
mac-ci-pipeline.yml
mac-coreml-ci-pipeline.yml Switch a portion of CI/packaging jobs to MacOS12 (#19908) 2024-03-19 14:54:58 -07:00
mac-ios-ci-pipeline.yml Switch a portion of CI/packaging jobs to MacOS12 (#19908) 2024-03-19 14:54:58 -07:00
mac-ios-packaging-pipeline.yml Switch a portion of CI/packaging jobs to MacOS12 (#19908) 2024-03-19 14:54:58 -07:00
mac-objc-static-analysis-ci-pipeline.yml Fix training and macos ci pipelines (#20034) 2024-03-26 12:20:11 -07:00
mac-react-native-ci-pipeline.yml Change "onnxruntime-Linux-CPU-For-Android-CI" machine pool to "onnxruntime-Ubuntu2204-AMD-CPU" (#19698) 2024-02-28 19:36:26 -08:00
npm-packaging-pipeline.yml
nuget-cuda-publishing-pipeline.yml
orttraining-linux-ci-pipeline.yml Fix a build issue: /MP was not enabled correctly (#19190) 2024-01-29 12:45:38 -08:00
orttraining-linux-gpu-ci-pipeline.yml
orttraining-linux-gpu-ortmodule-distributed-test-ci-pipeline.yml
orttraining-linux-nightly-ortmodule-test-pipeline.yml
orttraining-mac-ci-pipeline.yml
orttraining-pai-ci-pipeline.yml
orttraining-py-packaging-pipeline-cpu.yml [Fix] Error Python Packaging Pipeline (Training CPU) (#19992) 2024-03-20 09:02:50 -07:00
orttraining-py-packaging-pipeline-cuda.yml Reuse T4 for Cuda12.2 training packaging pipeline. (#20244) 2024-04-10 09:21:40 +08:00
orttraining-py-packaging-pipeline-cuda12.yml Reuse T4 for Cuda12.2 training packaging pipeline. (#20244) 2024-04-10 09:21:40 +08:00
orttraining-py-packaging-pipeline-rocm.yml
post-merge-jobs.yml Fix training and macos ci pipelines (#20034) 2024-03-26 12:20:11 -07:00
publish-nuget.yml
py-cuda-package-test-pipeline.yml
py-cuda-packaging-pipeline.yml Refactor Python CUDA packaging pipeline to fix random hangs in building (#19989) 2024-03-22 09:16:00 +08:00
py-cuda-publishing-pipeline.yml
py-package-build-pipeline.yml
py-package-test-pipeline.yml Fix training and macos ci pipelines (#20034) 2024-03-26 12:20:11 -07:00
py-packaging-pipeline.yml [QNN EP] Build x64 python wheel for QNN EP (#19499) 2024-02-12 20:54:04 -08:00
qnn-ep-nuget-packaging-pipeline.yml [QNN EP] Update default QNN SDK to 2.19.2.240210 (#19546) 2024-02-16 16:59:43 -08:00
web-ci-pipeline.yml
win-ci-fuzz-testing.yml Fix Fuzz Testing CI (#19228) 2024-01-22 15:44:57 -08:00
win-ci-pipeline.yml Install ONNX by buildling source code in Windows DML stage (#20079) 2024-03-27 12:29:34 -07:00
win-gpu-ci-pipeline.yml Enable CUDA EP unit testing on Windows (#20039) 2024-03-27 13:32:36 -07:00
win-gpu-reduce-op-ci-pipeline.yml
win-gpu-tensorrt-ci-pipeline.yml Fix a build issue: /MP was not enabled correctly (#19190) 2024-01-29 12:45:38 -08:00
win-qnn-arm64-ci-pipeline.yml [QNN EP] Update default QNN SDK to 2.19.2.240210 (#19546) 2024-02-16 16:59:43 -08:00
win-qnn-ci-pipeline.yml [QNN EP] Update default QNN SDK to 2.19.2.240210 (#19546) 2024-02-16 16:59:43 -08:00