* Will enable us to target `periodic`/distributed CI jobs to 4-GPU runners using a different label `linux.rocm.gpu.4`
* Use 2-GPU runners for `trunk`, `pull` and `slow` (in addition to `inductor-rocm`) as well (although this currently will not change anything, since all our MI2xx runners have both `linux.rocm.gpu` and `linux.rocm.gpu.2` labels... but this will change in the future: see next point)
* Continue to use `linux.rocm.gpu` label for any job that doesn't need more than 1-GPU eg. binary test jobs in `workflows/generated-linux-binary-manywheel-nightly.yml`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143769
Approved by: https://github.com/jeffdaily
Using EC2 G6 instance, based on NVIDIA L4, added to scale config in https://github.com/pytorch/test-infra/pull/5376
To enable more balanced sharding, had to push 148ae19935
Added `@xfailIfSM89` to the following tests:
- test_fp8_pattern_2
- test_original_aten_preserved_split_addmm
- test_sparse_semi_structured_scaled_mm
- test_sparse_semi_structured_scaled_mm_fp8
- test_sparse_fp8fp8_mm
Increased tolerance to 2e-4 for `RNNTest.BidirectionalMultilayerGRU_CPU_vs_CUDA`
Skipped following inductor tests (that either flaky OOMs or timeouts):
- test_reduction_fn_std_float64
- test_reduction_fn_var_mean_float64
- test_multi_output_unbacked_custom_op
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140305
Approved by: https://github.com/wdvr, https://github.com/ZainRizvi
Fixes#107302
This is a clone and fix for #139199.
This PR is a small step for the overall NumPy 2 support.
It adds a new CI job for testing with NumPy 2 with one test file only.
More tests to be fixed and added later in follow-up pull requests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140586
Approved by: https://github.com/malfet
Co-authored-by: Nikita Shulga <nshulga@meta.com>
MInor, adds a linter that ensures that all jobs run on pull_request, schedule, push etc have a `if: github.repository_owner == 'pytorch'` or are dependent on a job that has that check
There is also a setting in Github repos that can disable all workflows for that repo
A lot of these are unnecessary because many jobs use reusable workflows that have that check. However, this is a one time change so I'm not that bothered
Unfortunately I can't put this at the workflow level, which would make this better
Lots of weird string parsing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138849
Approved by: https://github.com/malfet
Updates all references to runner determinator workflow (`_runner-determinator.yml`) from current cloned version to main version.
This enables the team to push updates to this workflow, like fixing bugs or pushing improvements, and have it immediately be reflected on all open PRs. So avoiding potentially breaking situations, empowering moving fast and fast and simple recover in case of bugs.
From:
```
jobs:
get-label-type:
uses: ./.github/workflows/_runner-determinator.yml
```
To:
```
jobs:
get-label-type:
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137791
Approved by: https://github.com/malfet, https://github.com/huydhn, https://github.com/zxiiro
The test runs all its combination (512) sequentially, so it takes more than 30 minutes to finish or timeout on ASAN after one hour. Parametrizing it will break it up, so individual tests can finish and aren't need to be marked as slow anymore.
Also, the test seems to run OOM on a 2xlarge with `std::bad_alloc` memory error. Maybe, this would also fix the issue (pending CI testing)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137447
Approved by: https://github.com/albanD, https://github.com/malfet
The test runs all its combination (512) sequentially, so it takes more than 30 minutes to finish or timeout on ASAN after one hour. Parametrizing it will break it up, so individual tests can finish and aren't need to be marked as slow anymore.
Also, the test seems to run OOM on a 2xlarge with `std::bad_alloc` memory error. Maybe, this would also fix the issue (pending CI testing)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137447
Approved by: https://github.com/albanD, https://github.com/malfet
Changes jobs to go back to using the default AMI.
Note: This is only a cleanup PR. It does NOT introduce any behavior changes in CI
Now that the default variant uses the Amazon 2023 AMI and has been shown to be stable for a week, it's time to remove the explicit amz2023 references and go back to using the default variant.
After a week or two, when this is rolled out to most people, we can remove the variants from scale config as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134355
Approved by: https://github.com/jeanschmidt
These tests keep failing on the Linux Amazon 2023 AMI. The distributed team is looking into them, but until then, disabling the tests in order to unblock the AMI upgrade
Examples of the failures:
Failure 1: https://github.com/pytorch/pytorch/actions/runs/10047579686/job/27770963175
```
FAILED [90.0880s] distributed/test_c10d_nccl.py::NCCLTraceTestDumpOnTimeout::test_timeout_dumps_timing_enabled_False - AssertionError: None mismatch: None is not -6
```
Failure 2: https://github.com/pytorch/pytorch/actions/runs/10047579686/job/27770963494
```
____ NCCLTraceTestTimeoutDumpOnStuckRanks.test_timeout_dumps_on_stuck_ranks ____
Traceback (most recent call last):
File "/var/lib/jenkins/workspace/test/distributed/test_c10d_nccl.py", line 4214, in test_timeout_dumps_on_stuck_ranks
self.assertEqual(self._wait_process(0, timeout=90), -6)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3721, in assertEqual
raise error_metas.pop()[0].to_error(
AssertionError: None mismatch: None is not -6
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133355
Approved by: https://github.com/kit1980, https://github.com/wconstab
This PR only adds the execution of the benchmarks on this PR and print results, following diffs will add checking out head~1 and running it and comparing.
to access results goto test pr_time_benchmarks and inspect logs:
you should see
```
+ echo 'benchmark results on current PR: '
benchmark results on current PR:
+ cat /var/lib/jenkins/workspace/test/test-reports/pr_time_benchmarks_before.txt
update_hint_regression,instruction_count,27971461254
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131475
Approved by: https://github.com/ezyang
This is already represented in trunk.yml so it seems a bit redundant to include this level of testing in pull.yml.
I've been observing a large spike in our usage of `g3.4xlarge` which seems to correspond to these builds in particular so removing these from `pull.yml` since they are already covered in `trunk.yml`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132537
Approved by: https://github.com/ZainRizvi, https://github.com/malfet
Migrate all pull jobs to the new Amazon 2023 AMI runner type.
Exceptions:
- Distributed tests are still on the old AMI since they had some weird [test failures](https://github.com/pytorch/pytorch/actions/runs/10047579686/job/27770963175). Will debug those separately.
- Ported over a couple trunk and slow jobs that had `sync-tag`s set with the pull jobs and so needed to be on the same AMI
Revert plan, in case something starts breaking when we run these new AMIs at a larger scale:
- If specific jobs start failing consistently, we bring those jobs back to the old AMI
- If the failure is more widespread, revert this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131250
Approved by: https://github.com/malfet, https://github.com/atalman