pytorch/.github/workflows
Dmitry Nikolaev d4871750d9 [ROCm] Enable post-merge trunk workflow on MI300 runners; skip and fix MI300 related failed tests (#143673)
This PR
* makes changes to the workflow files and scripts so we can run CI workflows on the MI300 runners
* skips and fixes several tests, failed on MI300, observed in https://github.com/pytorch/pytorch/pull/140989

Skipped due to unsupported Float8_e4m3fn data type on MI300 (need to update test code to use datatypes supported by MI300):
- distributed.tensor.parallel.test_micro_pipeline_tp.py::MicroPipelineTPTest::test_fuse_all_gather_scaled_matmul_A_dims_\*_gather_dim_\* (24 tests across inductor/distributed configs)
- distributed.tensor.parallel.test_micro_pipeline_tp.py::test_fuse_scaled_matmul_reduce_scatter_A_dims_\*_scatter_dim_\* (12 tests across inductor/distributed configs))
- inductor.test_loop_ordering::LoopOrderingTest::test_fp8_cast_and_t
- inductor.test_loop_ordering::LoopOrderingTest::test_fp8_pattern_2

Skipped due to AssertionError on MI300:
- inductor.test_mkldnn_pattern_matcher.py::test_qconv2d_int8_mixed_bf16
- distributed._tools.test_sac_ilp::TestSACILP::test_sac_ilp_case1

Skipped:
- test_cuda.py::TestCudaMallocAsync::test_clock_speed
- test_cuda.py::TestCudaMallocAsync::test_power_draw
- test_torch.py::TestTorchDeviceTypeCUDA::test_deterministic_cumsum_cuda

Skipped flaky tests on MI300:
- distributed.test_c10d_gloo.py::ProcessGroupGlooTest::test_gather_stress_cuda
- inductor.test_cpu_repro::CPUReproTests::test_lstm_packed_unbatched_False* (256 tests)

Fixed:
- test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_float8_basics_cuda

Features:
- inductor/test_fp8.py - declare a new function to convert FP8 datatypes to ROCm supported FP8 datatypes. It keeps test names for CUDA and ROCm and allows to enable Inductor FP8 tests on CPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143673
Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/pruthvistony

Co-authored-by: saienduri <saimanas.enduri@amd.com>
Co-authored-by: Jithun Nair <jithun.nair@amd.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-09 05:18:57 +00:00
..
_bazel-build-test.yml Don't pass credentials explicitly to sccache (#140611) 2024-11-14 04:44:55 +00:00
_binary-build-linux.yml Remove builder repo from workflows and scripts (#143776) 2024-12-24 14:11:51 +00:00
_binary-test-linux.yml Remove builder repo from workflows and scripts (#143776) 2024-12-24 14:11:51 +00:00
_binary-upload.yml Remove builder repo from workflows and scripts (#143776) 2024-12-24 14:11:51 +00:00
_docs.yml
_linux-build.yml Use the build environment as sccache prefix instead of workflow name (#144112) 2025-01-03 17:33:03 +00:00
_linux-test.yml Use the build environment as sccache prefix instead of workflow name (#144112) 2025-01-03 17:33:03 +00:00
_mac-build.yml Update to upload-artifacts and download-artifacts to v4 (#139808) 2024-11-06 05:57:41 +00:00
_mac-test-mps.yml [BE] Get rid of malfet/checkout@silent-checkout (#143516) 2024-12-19 00:36:36 +00:00
_mac-test.yml [Monitor] Enable non-perf linux test monitor (#142168) 2024-12-11 01:10:43 +00:00
_rocm-test.yml Use sccache 0.9.0 on ROCm build job (#144125) 2025-01-04 08:56:48 +00:00
_runner-determinator.yml Fix unused Python variables outside torch/ and test/ (#136359) 2024-12-11 17:10:23 +00:00
_win-build.yml Upload sccache stats into benchmark database with build step time (#140839) 2024-11-21 22:38:45 +00:00
_win-test.yml [Utilization Monitor] input to disable utilization monitor (#140857) 2024-11-18 23:26:03 +00:00
_xpu-test.yml [Utilization Monitor] input to disable utilization monitor (#140857) 2024-11-18 23:26:03 +00:00
assigntome-docathon.yml
auto_request_review.yml
build-almalinux-images.yml Refactor conda-builder -> almalinux-builder (#140157) 2024-11-09 16:06:40 +00:00
build-libtorch-images.yml [ROCm] upgrade nightly wheels to rocm6.3 - 1 of 2 (docker images) (#142151) 2024-12-13 16:21:17 +00:00
build-magma-linux.yml Build magma tarball for cuda 126 (#140143) 2024-11-12 23:42:26 +00:00
build-magma-windows.yml Remove builder repo from workflows and scripts (#143776) 2024-12-24 14:11:51 +00:00
build-manywheel-images-s390x.yml S390x update builder image (#132983) 2024-11-11 16:14:06 +00:00
build-manywheel-images.yml [ROCm] upgrade nightly wheels to rocm6.3 - 1 of 2 (docker images) (#142151) 2024-12-13 16:21:17 +00:00
build-triton-wheel.yml [ROCm] upgrade nightly wheels to rocm6.3 - 2 of 2 (binaries) (#143613) 2024-12-23 19:47:30 +00:00
check-labels.yml
check_mergeability_ghstack.yml Remove most rockset references (#139922) 2024-11-12 21:17:43 +00:00
cherry-pick.yml Remove most rockset references (#139922) 2024-11-12 21:17:43 +00:00
close-nonexistent-disable-issues.yml
create_release.yml [BE] Get rid of malfet/checkout@silent-checkout (#143516) 2024-12-19 00:36:36 +00:00
delete_old_branches.yml
docathon-sync-label.yml
docker-builds.yml Update inductor jobs to use CUDA 12.4 (#142177) 2024-12-09 16:18:38 +00:00
docker-release.yml Use validate-docker-images workflow from test-infra (#143081) 2024-12-12 00:24:27 +00:00
generated-linux-aarch64-binary-manywheel-nightly.yml Remove builder repo from workflows and scripts (#143776) 2024-12-24 14:11:51 +00:00
generated-linux-binary-libtorch-cxx11-abi-main.yml Remove builder repo from workflows and scripts (#143776) 2024-12-24 14:11:51 +00:00
generated-linux-binary-libtorch-cxx11-abi-nightly.yml Remove builder repo from workflows and scripts (#143776) 2024-12-24 14:11:51 +00:00
generated-linux-binary-libtorch-pre-cxx11-main.yml Remove builder repo from workflows and scripts (#143776) 2024-12-24 14:11:51 +00:00
generated-linux-binary-libtorch-pre-cxx11-nightly.yml Remove builder repo from workflows and scripts (#143776) 2024-12-24 14:11:51 +00:00
generated-linux-binary-manywheel-main.yml Remove builder repo from workflows and scripts (#143776) 2024-12-24 14:11:51 +00:00
generated-linux-binary-manywheel-nightly.yml Remove builder repo from workflows and scripts (#143776) 2024-12-24 14:11:51 +00:00
generated-linux-s390x-binary-manywheel-nightly.yml Remove builder repo from workflows and scripts (#143776) 2024-12-24 14:11:51 +00:00
generated-macos-arm64-binary-libtorch-cxx11-abi-nightly.yml Remove builder repo from workflows and scripts (#143776) 2024-12-24 14:11:51 +00:00
generated-macos-arm64-binary-wheel-nightly.yml Remove builder repo from workflows and scripts (#143776) 2024-12-24 14:11:51 +00:00
generated-windows-binary-libtorch-debug-main.yml Remove builder repo from workflows and scripts (#143776) 2024-12-24 14:11:51 +00:00
generated-windows-binary-libtorch-debug-nightly.yml Remove builder repo from workflows and scripts (#143776) 2024-12-24 14:11:51 +00:00
generated-windows-binary-libtorch-release-main.yml Remove builder repo from workflows and scripts (#143776) 2024-12-24 14:11:51 +00:00
generated-windows-binary-libtorch-release-nightly.yml Remove builder repo from workflows and scripts (#143776) 2024-12-24 14:11:51 +00:00
generated-windows-binary-wheel-nightly.yml Remove builder repo from workflows and scripts (#143776) 2024-12-24 14:11:51 +00:00
inductor-micro-benchmark-x86.yml Uniformly pass secrets: inherit to all jobs that go to _linux-build/_linux-test (#141995) 2024-12-05 14:52:43 +00:00
inductor-micro-benchmark.yml Migrate the rest of CUDA 12.1 jobs to 12.4 (#144118) 2025-01-03 17:45:41 +00:00
inductor-perf-compare.yml Migrate the rest of CUDA 12.1 jobs to 12.4 (#144118) 2025-01-03 17:45:41 +00:00
inductor-perf-test-nightly-aarch64.yml [Monitor] Enable non-perf linux test monitor (#142168) 2024-12-11 01:10:43 +00:00
inductor-perf-test-nightly-macos.yml [Monitor] Enable non-perf linux test monitor (#142168) 2024-12-11 01:10:43 +00:00
inductor-perf-test-nightly-x86.yml [Monitor] Enable non-perf linux test monitor (#142168) 2024-12-11 01:10:43 +00:00
inductor-perf-test-nightly.yml Migrate the rest of CUDA 12.1 jobs to 12.4 (#144118) 2025-01-03 17:45:41 +00:00
inductor-periodic.yml Migrate the rest of CUDA 12.1 jobs to 12.4 (#144118) 2025-01-03 17:45:41 +00:00
inductor-rocm.yml Run inductor-rocm workflow on ciflow/inductor (#143205) 2024-12-17 20:09:48 +00:00
inductor-unittest.yml ir.ExternKernel: correctly handle kwarg default arguments (#141371) 2025-01-03 16:05:31 +00:00
inductor.yml Update inductor jobs to use CUDA 12.4 (#142177) 2024-12-09 16:18:38 +00:00
lint-autoformat.yml
lint-bc.yml Add ciflow/inductor-cu126 label (#141377) 2024-11-22 23:14:24 +00:00
lint.yml Fix missing tests on test tool lint job (#143052) 2024-12-12 20:29:32 +00:00
linux-aarch64.yml [CI] Run aarch64 tests on Graviton3 (#143129) 2024-12-13 07:39:22 +00:00
llm_td_retrieval.yml
mac-mps.yml Uniformly pass secrets: inherit to all jobs that go to _linux-build/_linux-test (#141995) 2024-12-05 14:52:43 +00:00
nightly-s3-uploads.yml Some workflows to use oidc instead of AWS keys (#142264) 2024-12-10 19:40:23 +00:00
nightly.yml Uniformly pass secrets: inherit to all jobs that go to _linux-build/_linux-test (#141995) 2024-12-05 14:52:43 +00:00
nitpicker.yml
periodic.yml Migrate the rest of CUDA 12.1 jobs to 12.4 (#144118) 2025-01-03 17:45:41 +00:00
pull.yml [ROCm] Use linux.rocm.gpu.2 for 2-GPU and linux.rocm.gpu.4 for 4-GPU runners (#143769) 2024-12-24 08:04:00 +00:00
revert.yml
rocm.yml [ROCm] Enable post-merge trunk workflow on MI300 runners; skip and fix MI300 related failed tests (#143673) 2025-01-09 05:18:57 +00:00
runner-determinator-validator.yml
runner_determinator_script_sync.yaml
s390.yml Uniformly pass secrets: inherit to all jobs that go to _linux-build/_linux-test (#141995) 2024-12-05 14:52:43 +00:00
scorecards.yml Update to upload-artifacts and download-artifacts to v4 (#139808) 2024-11-06 05:57:41 +00:00
slow.yml [ROCm] Use linux.rocm.gpu.2 for 2-GPU and linux.rocm.gpu.4 for 4-GPU runners (#143769) 2024-12-24 08:04:00 +00:00
stale.yml
target-determination-indexer.yml
target_determination.yml Update to upload-artifacts and download-artifacts to v4 (#139808) 2024-11-06 05:57:41 +00:00
test-check-binary.yml Add check_binary workflow to pytorch/pytorch (#143201) 2024-12-13 19:30:10 +00:00
torchbench.yml Uniformly pass secrets: inherit to all jobs that go to _linux-build/_linux-test (#141995) 2024-12-05 14:52:43 +00:00
trunk.yml [ROCm] Enable post-merge trunk workflow on MI300 runners; skip and fix MI300 related failed tests (#143673) 2025-01-09 05:18:57 +00:00
trymerge.yml
tryrebase.yml
unstable-periodic.yml
unstable.yml Update ET pin for #6744 (#140199) 2024-11-11 21:40:12 +00:00
update-viablestrict.yml
update_pytorch_labels.yml
upload-test-stats-while-running.yml Continuous job for pulling artifacts and doing upload (#140453) 2024-11-20 20:41:52 +00:00
upload-test-stats.yml Remove most rockset references (#139922) 2024-11-12 21:17:43 +00:00
upload-torch-dynamo-perf-stats.yml Add macos perf run to the dashboard upload (#141999) 2024-12-04 01:08:13 +00:00
upload_test_stats_intermediate.yml Some workflows to use oidc instead of AWS keys (#142264) 2024-12-10 19:40:23 +00:00
weekly.yml
xpu.yml Uniformly pass secrets: inherit to all jobs that go to _linux-build/_linux-test (#141995) 2024-12-05 14:52:43 +00:00