mirror of
https://github.com/saymrwulf/onnxruntime.git
synced 2026-05-31 23:27:43 +00:00
### Description It always has been out of memory in training CUDA 12.2 packaging pipeline https://dev.azure.com/aiinfra/Lotus/_build?definitionId=1308&_a=summary since the PR #19910 I tried other CPU agents for example, D64as_v5(256G memory) and D32as_v4(128G memory and 256 G SSD temp storage), which are still out of memory like the below image  But it works on T4, though T4 only has 4 vCPUs, 28G memory and 180G temp storage, and it takes much more time. ### Motivation and Context Restore CUDA 12.2 training packaging pipeline first. More time is needed to investigate the root cause ### Other Clues. These 2 compilation steps take nearly 6 minutes with Cuda 12.2 on T4 And it runs out of memory on CPU machine. @ajindal1 cuda12.2 on T4 ``` 2024-03-14T05:39:08.7726865Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim32_fp16_sm80.cu.o 2024-03-14T05:45:01.3223393Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim64_bf16_sm80.cu.o 2024-03-14T05:46:07.9218003Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim96_fp16_sm80.cu.o 2024-03-14T05:52:59.2387051Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/group_query_attention_impl.cu.o ``` But they could be finished in about one minute with Cuda 11.8 on CPU ``` cuda11.8 on CPU 2024-04-09T11:34:35.0849836Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim32_fp16_sm80.cu.o 2024-04-09T11:35:53.6648154Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim64_bf16_sm80.cu.o cuda11.8 on GPU 024-03-13T12:16:33.4102477Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim32_fp16_sm80.cu.o 2024-03-13T12:19:58.8268272Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim64_bf16_sm80.cu.o ``` |
||
|---|---|---|
| .. | ||
| nodejs/templates | ||
| nuget/templates | ||
| stages | ||
| templates | ||
| triggers | ||
| android-arm64-v8a-QNN-crosscompile-ci-pipeline.yml | ||
| android-x86_64-crosscompile-ci-pipeline.yml | ||
| bigmodels-ci-pipeline.yml | ||
| binary-size-checks-pipeline.yml | ||
| build-perf-test-binaries-pipeline.yml | ||
| c-api-noopenmp-packaging-pipelines.yml | ||
| clean-build-docker-image-cache-pipeline.yml | ||
| cuda-packaging-pipeline.yml | ||
| linux-ci-pipeline.yml | ||
| linux-cpu-aten-pipeline.yml | ||
| linux-cpu-eager-pipeline.yml | ||
| linux-cpu-minimal-build-ci-pipeline.yml | ||
| linux-dnnl-ci-pipeline.yml | ||
| linux-gpu-ci-pipeline.yml | ||
| linux-gpu-tensorrt-ci-pipeline.yml | ||
| linux-gpu-tensorrt-daily-perf-pipeline.yml | ||
| linux-migraphx-ci-pipeline.yml | ||
| linux-multi-gpu-tensorrt-ci-pipeline.yml | ||
| linux-openvino-ci-pipeline.yml | ||
| linux-qnn-ci-pipeline.yml | ||
| mac-ci-pipeline.yml | ||
| mac-coreml-ci-pipeline.yml | ||
| mac-ios-ci-pipeline.yml | ||
| mac-ios-packaging-pipeline.yml | ||
| mac-objc-static-analysis-ci-pipeline.yml | ||
| mac-react-native-ci-pipeline.yml | ||
| npm-packaging-pipeline.yml | ||
| nuget-cuda-publishing-pipeline.yml | ||
| orttraining-linux-ci-pipeline.yml | ||
| orttraining-linux-gpu-ci-pipeline.yml | ||
| orttraining-linux-gpu-ortmodule-distributed-test-ci-pipeline.yml | ||
| orttraining-linux-nightly-ortmodule-test-pipeline.yml | ||
| orttraining-mac-ci-pipeline.yml | ||
| orttraining-pai-ci-pipeline.yml | ||
| orttraining-py-packaging-pipeline-cpu.yml | ||
| orttraining-py-packaging-pipeline-cuda.yml | ||
| orttraining-py-packaging-pipeline-cuda12.yml | ||
| orttraining-py-packaging-pipeline-rocm.yml | ||
| post-merge-jobs.yml | ||
| publish-nuget.yml | ||
| py-cuda-package-test-pipeline.yml | ||
| py-cuda-packaging-pipeline.yml | ||
| py-cuda-publishing-pipeline.yml | ||
| py-package-build-pipeline.yml | ||
| py-package-test-pipeline.yml | ||
| py-packaging-pipeline.yml | ||
| qnn-ep-nuget-packaging-pipeline.yml | ||
| web-ci-pipeline.yml | ||
| win-ci-fuzz-testing.yml | ||
| win-ci-pipeline.yml | ||
| win-gpu-ci-pipeline.yml | ||
| win-gpu-reduce-op-ci-pipeline.yml | ||
| win-gpu-tensorrt-ci-pipeline.yml | ||
| win-qnn-arm64-ci-pipeline.yml | ||
| win-qnn-ci-pipeline.yml | ||