onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-11 17:48:34 +00:00

History

Tianlei Wu a46e49b439 Unblock migraphx and linux GPU training ci pipelines (#21662 ) ### Description * Fix migraphx build error caused by https://github.com/microsoft/onnxruntime/pull/21598: Add a conditional compile on code block that depends on ROCm >= 6.2. Note that the pipeline uses ROCm 6.0. Unblock orttraining-linux-gpu-ci-pipeline and orttraining-ortmodule-distributed and orttraining-amd-gpu-ci-pipeline pipelines: * Disable a model test in linux GPU training ci pipelines caused by https://github.com/microsoft/onnxruntime/pull/19470: Sometime, cudnn frontend throws exception that cudnn graph does not support a Conv node of keras_lotus_resnet3D model on V100 GPU. Note that same test does not throw exception in other GPU pipelines. The failure might be related to cudnn 8.9 and V100 GPU used in the pipeline (Amper GPUs and cuDNN 9.x do not have the issue). The actual fix requires fallback logic, which will take time to implement, so we temporarily disable the test in training pipelines. * Force install torch for cuda 11.8. (The docker has torch 2.4.0 for cuda 12.1 to build torch extension, which it is not compatible cuda 11.8). Note that this is temporary walkround. More elegant fix is to make sure right torch version in docker build step, that might need update install_python_deps.sh and corresponding requirements.txt. * Skip test_gradient_correctness_conv1d since it causes segment fault. Root cause need more investigation (maybe due to cudnn frontend as well). * Skip test_aten_attention since it causes assert failure. Root cause need more investigation (maybe due to torch version). * Skip orttraining_ortmodule_distributed_tests.py since it has error that compiler for torch extension does not support c++17. One possible fix it to set the following compile argument inside setup.py of extension fused_adam: extra_compile_args['cxx'] = ['-std=c++17']. However, due to the urgency of unblocking the pipelines, just disable the test for now. * skip test_softmax_bf16_large. For some reason, torch.cuda.is_bf16_supported() returns True in V100 with torch 2.3.1, so the test was run in CI, but V100 does not support bf16 natively. * Fix typo of deterministic ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->		2024-08-08 19:44:15 -07:00
..
github	Unblock migraphx and linux GPU training ci pipelines (#21662 )	2024-08-08 19:44:15 -07:00
requirements	Replace inline pip install with pip install from requirements*.txt (#21106 )	2024-07-22 12:39:10 -07:00
__init__.py
amd_hipify.py
build.py	Update ruff and clang-format versions (#21479 )	2024-07-24 11:50:11 -07:00
clean_docker_image_cache.py	Bump ruff to 0.3.2 and black to 24 (#19878 )	2024-03-13 10:00:32 -07:00
compile_triton.py
coverage.py
gen_def.py	Update ruff and clang-format versions (#21479 )	2024-07-24 11:50:11 -07:00
get_docker_image.py	Fix docker image layer caching to avoid redundant docker building and transient connection exceptions. (#21612 )	2024-08-06 21:37:09 +08:00
logger.py
op_registration_utils.py	Bump ruff to 0.3.2 and black to 24 (#19878 )	2024-03-13 10:00:32 -07:00
op_registration_validator.py	Bump ruff to 0.3.2 and black to 24 (#19878 )	2024-03-13 10:00:32 -07:00
patch_manylinux.py
policheck_exclusions.xml
reduce_op_kernels.py	Update ruff and clang-format versions (#21479 )	2024-07-24 11:50:11 -07:00
replace_urls_in_deps.py	Update ruff and clang-format versions (#21479 )	2024-07-24 11:50:11 -07:00
set-trigger-rules.py	Separating all GPU stages into different Pipelines (#21521 )	2024-07-26 14:54:45 -07:00
update_tsaoptions.py
upload_python_package_to_azure_storage.py	Update ruff and clang-format versions (#21479 )	2024-07-24 11:50:11 -07:00