onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-11 17:48:34 +00:00

History

Yi Zhang 14d7872ce9 Reuse T4 for Cuda12.2 training packaging pipeline. (#20244 ) ### Description It always has been out of memory in training CUDA 12.2 packaging pipeline https://dev.azure.com/aiinfra/Lotus/_build?definitionId=1308&_a=summary since the PR #19910 I tried other CPU agents for example, D64as_v5(256G memory) and D32as_v4(128G memory and 256 G SSD temp storage), which are still out of memory like the below image ![image](https://github.com/microsoft/onnxruntime/assets/16190118/5acde9ef-674f-4b6d-a1b3-b54647645083) But it works on T4, though T4 only has 4 vCPUs, 28G memory and 180G temp storage, and it takes much more time. ### Motivation and Context Restore CUDA 12.2 training packaging pipeline first. More time is needed to investigate the root cause ### Other Clues. These 2 compilation steps take nearly 6 minutes with Cuda 12.2 on T4 And it runs out of memory on CPU machine. @ajindal1 cuda12.2 on T4 ``` 2024-03-14T05:39:08.7726865Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim32_fp16_sm80.cu.o 2024-03-14T05:45:01.3223393Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim64_bf16_sm80.cu.o 2024-03-14T05:46:07.9218003Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim96_fp16_sm80.cu.o 2024-03-14T05:52:59.2387051Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/group_query_attention_impl.cu.o ``` But they could be finished in about one minute with Cuda 11.8 on CPU ``` cuda11.8 on CPU 2024-04-09T11:34:35.0849836Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim32_fp16_sm80.cu.o 2024-04-09T11:35:53.6648154Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim64_bf16_sm80.cu.o cuda11.8 on GPU 024-03-13T12:16:33.4102477Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim32_fp16_sm80.cu.o 2024-03-13T12:19:58.8268272Z [ 90%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim64_bf16_sm80.cu.o ```		2024-04-10 09:21:40 +08:00
..
github	Reuse T4 for Cuda12.2 training packaging pipeline. (#20244 )	2024-04-10 09:21:40 +08:00
__init__.py
amd_hipify.py	[ROCm] Add SkipGroupNorm for ROCm EP (#19303 )	2024-02-21 11:08:48 +08:00
build.py	increase cl mpcount since Compilation is moved on CPU machine (#20116 )	2024-03-28 13:30:33 +08:00
clean_docker_image_cache.py	Bump ruff to 0.3.2 and black to 24 (#19878 )	2024-03-13 10:00:32 -07:00
compile_triton.py
coverage.py
gen_def.py
get_docker_image.py	Bump ruff to 0.3.2 and black to 24 (#19878 )	2024-03-13 10:00:32 -07:00
logger.py
op_registration_utils.py	Bump ruff to 0.3.2 and black to 24 (#19878 )	2024-03-13 10:00:32 -07:00
op_registration_validator.py	Bump ruff to 0.3.2 and black to 24 (#19878 )	2024-03-13 10:00:32 -07:00
patch_manylinux.py
policheck_exclusions.xml
reduce_op_kernels.py
replace_urls_in_deps.py
requirements-transformers-test.txt	GQA Rotary and Packed QKV with Flash (#18906 )	2024-01-23 16:34:26 -08:00
set-trigger-rules.py	Add VP test in Stable diffusion pipeline (#19300 )	2024-01-29 09:33:58 -08:00
update_tsaoptions.py
upload_python_package_to_azure_storage.py