1. This tags docker images using docker pull/tag/push for current release
2. Sets RELEASE_VERSION_TAG var and regenerates the workflows using the new docker tag
3. Remove conda token setting and Binary tests release changes these are already automated
4. Pin unstable and disabled jobs, autumate: https://github.com/pytorch/pytorch/pull/111675
Test:
```
RELEASE_VERSION=2.2 ./scripts/release/apply-release-changes.sh
Tagging pytorch/manylinux-builder:cuda11.8-main to pytorch/manylinux-builder:cuda11.8-2.2 , dry_run: enabled
Tagging pytorch/manylinux-builder:cuda12.1-main to pytorch/manylinux-builder:cuda12.1-2.2 , dry_run: enabled
Tagging pytorch/libtorch-cxx11-builder:cuda11.8-main to pytorch/libtorch-cxx11-builder:cuda11.8-2.2 , dry_run: enabled
Tagging pytorch/libtorch-cxx11-builder:cuda12.1-main to pytorch/libtorch-cxx11-builder:cuda12.1-2.2 , dry_run: enabled
Tagging pytorch/manylinux-builder:rocm5.6-main to pytorch/manylinux-builder:rocm5.6-2.2 , dry_run: enabled
Tagging pytorch/manylinux-builder:rocm5.7-main to pytorch/manylinux-builder:rocm5.7-2.2 , dry_run: enabled
Tagging pytorch/libtorch-cxx11-builder:rocm5.6-main to pytorch/libtorch-cxx11-builder:rocm5.6-2.2 , dry_run: enabled
Tagging pytorch/libtorch-cxx11-builder:rocm5.7-main to pytorch/libtorch-cxx11-builder:rocm5.7-2.2 , dry_run: enabled
Tagging pytorch/manylinux-builder:cpu-main to pytorch/manylinux-builder:cpu-2.2 , dry_run: enabled
Tagging pytorch/libtorch-cxx11-builder:cpu-main to pytorch/libtorch-cxx11-builder:cpu-2.2 , dry_run: enabled
Tagging pytorch/manylinuxcxx11-abi-builder:cpu-cxx11-abi-main to pytorch/manylinuxcxx11-abi-builder:cpu-cxx11-abi-2.2 , dry_run: enabled
Tagging pytorch/manylinuxaarch64-builder:cpu-aarch64-main to pytorch/manylinuxaarch64-builder:cpu-aarch64-2.2 , dry_run: enabled
Tagging pytorch/conda-builder:cuda11.8-main to pytorch/conda-builder:cuda11.8-2.2 , dry_run: enabled
Tagging pytorch/conda-builder:cuda12.1-main to pytorch/conda-builder:cuda12.1-2.2 , dry_run: enabled
Tagging pytorch/conda-builder:cpu-main to pytorch/conda-builder:cpu-2.2 , dry_run: enabled
/data/users/atalman/pytorch/.github/workflows/generated-linux-binary-manywheel-nightly.yml
/data/users/atalman/pytorch/.github/workflows/generated-linux-binary-conda-nightly.yml
/data/users/atalman/pytorch/.github/workflows/generated-linux-binary-libtorch-cxx11-abi-nightly.yml
/data/users/atalman/pytorch/.github/workflows/generated-linux-binary-libtorch-pre-cxx11-nightly.yml
/data/users/atalman/pytorch/.github/workflows/generated-linux-aarch64-binary-manywheel-nightly.yml
/data/users/atalman/pytorch/.github/workflows/generated-linux-binary-manywheel-main.yml
/data/users/atalman/pytorch/.github/workflows/generated-linux-binary-libtorch-cxx11-abi-main.yml
/data/users/atalman/pytorch/.github/workflows/generated-linux-binary-libtorch-pre-cxx11-main.yml
/data/users/atalman/pytorch/.github/workflows/generated-windows-binary-wheel-nightly.yml
/data/users/atalman/pytorch/.github/workflows/generated-windows-binary-conda-nightly.yml
/data/users/atalman/pytorch/.github/workflows/generated-windows-binary-libtorch-release-nightly.yml
/data/users/atalman/pytorch/.github/workflows/generated-windows-binary-libtorch-debug-nightly.yml
/data/users/atalman/pytorch/.github/workflows/generated-windows-binary-libtorch-release-main.yml
/data/users/atalman/pytorch/.github/workflows/generated-windows-binary-libtorch-debug-main.yml
/data/users/atalman/pytorch/.github/workflows/generated-macos-binary-wheel-nightly.yml
/data/users/atalman/pytorch/.github/workflows/generated-macos-binary-conda-nightly.yml
/data/users/atalman/pytorch/.github/workflows/generated-macos-binary-libtorch-cxx11-abi-nightly.yml
/data/users/atalman/pytorch/.github/workflows/generated-macos-arm64-binary-libtorch-cxx11-abi-nightly.yml
/data/users/atalman/pytorch/.github/workflows/generated-macos-arm64-binary-wheel-nightly.yml
/data/users/atalman/pytorch/.github/workflows/generated-macos-arm64-binary-conda-nightly.yml
````
Result of pinning unstable and disabled jobs:
```
# The link to the published list of disabled jobs
DISABLED_JOBS_URL = "https://ossci-metrics.s3.amazonaws.com/disabled-jobs.json?versionid=kKJlAXdrUbk3CilXbKu.6OwNTGQB8a.B"
# and unstable jobs
UNSTABLE_JOBS_URL = "https://ossci-metrics.s3.amazonaws.com/unstable-jobs.json?versionid=vzaicOxSsh55iXBXwgGrW6dFeVtPfrhr"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114355
Approved by: https://github.com/malfet
Move the pytest cache downloading into the build step and store it in additional ci files so that it stays consistent during sharding.
Only build env is taken into account now instead of also test config since we might not have the test config during build time, making it less specific, but I also think this might be better since tests are likely to fail across the same test config (I also think it might be worth not even looking at build env but thats a different topic)
Each cache upload should only include information from the current run. Do not merge current cache with downloaded cache during upload (shouldn't matter anyways since the downloaded cache won't exist at the time)
From what I cant tell of the s3 retention policy, pytest cache files will be deleted after 30 days (cc @ZainRizvi to confirm), so we never have to worry about space or pulling old versions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113804
Approved by: https://github.com/ZainRizvi
Abort merges invoked with `-r` if there is nothing to rebase
Make `rebase_onto`/`rebase_ghstack_onto` return False if rebase is no-op and abort merge in that case
Remove `-e` option from both trymerge and tryrebase workflows as one should never report failures on workflow dispatch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113960
Approved by: https://github.com/clee2000
The bot that creates the issue got changed, but the search did not, so it wasn't finding old PRs and was just making new ones.
This PR makes it reuse PRs again instead of making a new one everytime.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113961
Approved by: https://github.com/huydhn
In https://github.com/pytorch/pytorch/pull/110362, the failure was flaky but merge bot treated it as an actual failure. This is a regression after https://github.com/pytorch/test-infra/pull/4604 where the name returned by Dr.CI now includes workflow name. For example, the name is `trunk / macos-12-py3-arm64 / test (default, 2, 3, macos-m1-12)` in the JSON response:
```
{"FAILED": [], "FLAKY": [{"workflowId": 6372581477, "id": 17297638807, "name": "trunk / macos-12-py3-arm64 / test (default, 2, 3, macos-m1-12)", "jobName": "macos-12-py3-arm64 / test (default, 2, 3, macos-m1-12)", "conclusion": "failure", "completed_at": "2023-10-01T22:18:28Z", "html_url": "https://github.com/pytorch/pytorch/actions/runs/6372581477/job/17297638807", "head_branch": "ciflow/trunk/110362", "pr_number": 110362, "head_sha": "03f51e36dedf234931006d1db61677b229c9a119", "failure_captures": ["Failure: There is only 4671284KB free space left in /, which is less than the minimum requirement of"], "failure_line": "Failure: There is only 4671284KB free space left in /, which is less than the minimum requirement of 6291456KB for macOS", "time": "2023-10-01T22:17:53.847751Z"}], "BROKEN_TRUNK": [], "UNSTABLE": []}
```
I update merge bot to handle this better by considering both workflow name, job name, and the combination full name.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110661
Approved by: https://github.com/clee2000
One should never edit `gql_mocks.json` by hand, as otherwise it does not validate mergebot behavior using the actual GitHub data, but rather snapshot of this data frozen in time.
Unfortunately, GitHub started to delete checkrun statuses against older
PR, so some tests needs to be updated.
For example https://github.com/pytorch/pytorch/pull/77700/checks committed on May 19th 2022 has no checks at the time of the writing (Sep 28th 2023)
Deleted `test_checksuites_pagination` as its checks are gone it tests the same functionality as `test_get_checkruns_many_runs`, which was updated to use more recent PR.
Deleted `test_get_classifications_pending_unstable`, because what it wants to test is inherently unreliable and therefore it must be rewritten using some different mechanisms.
Disabled `test_internal_changes` as the mechanism is broken at the moment, see https://github.com/pytorch/pytorch/issues/110218
Updated `test_pr_dependencies_ghstack` and `test_pr_dependencies` to generate `msg` using `pr.get_body()` rather than hardcode the text (that were updated after test was committed.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110221
Approved by: https://github.com/clee2000, https://github.com/huydhn
After https://github.com/pytorch/test-infra/pull/4589, we can now query Dr.CI to get the list of flaky failures there. This change queries Dr.CI API endpoint and check if the failure is a flaky one using `is_flaky` function.
Because the change is relatively large, I'm breaking it down to several smaller PRs in this order:
* [x] This PR queries Dr.CI and adds `is_flaky` check
* [ ] Clean up the flaky rules logic because it has already been implemented on Dr. CI
* [ ] Clean up the broken trunk logic for the same reason
### Testing
* Create a new `drci_mocks.json` file to catch the JSON response from Dr.CI API endpoint. The API requires `DRCI_BOT_KEY`.
* `pytest -v test_trymerge.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110054
Approved by: https://github.com/clee2000
After https://github.com/pytorch/test-infra/pull/4589, we can now query Dr.CI to get the list of flaky failures there. This change queries Dr.CI API endpoint and check if the failure is a flaky one using `is_flaky` function.
Because the change is relatively large, I'm breaking it down to several smaller PRs in this order:
* [x] This PR queries Dr.CI and adds `is_flaky` check
* [ ] Clean up the flaky rules logic because it has already been implemented on Dr. CI
* [ ] Clean up the broken trunk logic for the same reason
### Testing
* Create a new `drci_mocks.json` file to catch the JSON response from Dr.CI API endpoint. The API requires `DRCI_BOT_KEY`.
* `pytest -v test_trymerge.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110054
Approved by: https://github.com/clee2000
Enables two ruff rules derived from pylint:
* PLR1722 replaces any exit() calls with sys.exit(). exit() is only designed to be used in repl contexts as may not always be imported by default. This always use the version in the sys module which is better
* PLW3301 replaces nested min / max calls with simplified versions (ie. `min(a, min(b, c))` => `min(a, b. c)`). The new version is more idiomatic and more efficient.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109461
Approved by: https://github.com/ezyang
Update to ROCm triton pinned commit for the 2.1 branch cut off.
As part of this we are updating `build_triton_wheel.py` and `build-triton-wheel.yml` to support building ROCm triton wheels through pytorch/manylinux-rocm to avoid the need of slowly downloading rpm libraries for ROCm in the cpu manylinux builder image and avoiding the need to maintain a conditional file with hard coded repositories from radeon.org for every ROCm release.
This new approach will allow us to build wheels faster in a more easily maintainable way.
This PR also brings in a required change as Triton on ROCm requires device_type to be set to hip so we can pass down the correct device type to triton (https://github.com/ROCmSoftwarePlatform/triton/pull/284).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107600
Approved by: https://github.com/jansel, https://github.com/jithunnair-amd
I notice a curious case on https://github.com/pytorch/pytorch/pull/107508 where there was one broken trunk failure and the PR was merged with `merge -ic`. Because the failure had been classified as unrelated, I expected to see a no-op force merge here. However, it showed up as a force merge with failure.

The record on Rockset reveals https://github.com/pytorch/pytorch/pull/107508 has:
* 0 broken trunk check (unexpected, this should be 1 as Dr. CI clearly say so)
* 1 ignore current check (unexpected, this should be 0 and the failure should be counted as broken trunk instead)
* 3 unstable ROCm jobs (expected)
It turns out that ignore current takes precedence over flaky and broken trunk classification. This might have been the expectation in the past but I think that's not the case now. The bot should be consistent with what is shown on Dr. CI. The change here is to make flaky, unstable, and broken trunk classification to take precedence over ignore current. Basically, we only need to ignore new or unrecognized failures that have yet been classified.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107761
Approved by: https://github.com/clee2000
* Enables PIE807 + PIE810. PIE807 is do not reimplement list builtin function using lambda and PIE810 is to always fuse startswith / endswith calls (I applied the autofixes for this before we had ruff enabled).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106218
Approved by: https://github.com/albanD
* Encourage people to use -i instead of -f for mergebot
* Add additional info for when rebase fails due to lacking permissions
<details><summary>dryrun</summary>
````
csl@csl-mbp ~/zzzzzzzz/pytorch [csl/errormsgs] $
(forpytorch) python3 .github/scripts/tryrebase.py 106089 --branch viable/strict --dry-run
+ git -C /Users/csl/zzzzzzzz/pytorch rev-parse --verify refs/remotes/origin/viable/strict
@pytorchbot started a rebase job onto [refs/remotes/origin/viable/strict](7c97c943fb). Check the current status [here](None)
+ git -C /Users/csl/zzzzzzzz/pytorch fetch origin pull/106089/head:pull/106089/head
+ git -C /Users/csl/zzzzzzzz/pytorch rebase refs/remotes/origin/viable/strict pull/106089/head
+ git -C /Users/csl/zzzzzzzz/pytorch rev-parse --verify pull/106089/head
+ git -C /Users/csl/zzzzzzzz/pytorch rev-parse --verify refs/remotes/origin/viable/strict
+ git -C /Users/csl/zzzzzzzz/pytorch push --dry-run -f https://github.com/Lightning-Sandbox/pytorch.git pull/106089/head:fix/spaces
stdout:
remote: Permission to Lightning-Sandbox/pytorch.git denied to clee2000.
fatal: unable to access 'https://github.com/Lightning-Sandbox/pytorch.git/': The requested URL returned error: 403
stderr:
Rebase failed due to Command `git -C /Users/csl/zzzzzzzz/pytorch push --dry-run -f https://github.com/Lightning-Sandbox/pytorch.git pull/106089/head:fix/spaces` returned non-zero exit code 128
```
remote: Permission to Lightning-Sandbox/pytorch.git denied to clee2000.
fatal: unable to access 'https://github.com/Lightning-Sandbox/pytorch.git/': The requested URL returned error: 403
```
This is likely because the author did not allow edits from maintainers on the PR or because the repo has additional permissions settings that mergebot does not qualify.
````
</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106150
Approved by: https://github.com/huydhn
At the moment, we only record the list of pending and failed check on Rockset merge records. This is enough to compute the force merge KPI(s), but isn't enough for more in-depth analysis on what happened at the time of the merge:
* If the number of `ok_failed_checks` is less than `ok_failed_checks_threshold`, the list of `failed_checks` would be empty (expectedly). So Rockset would only record an empty list.
* We support retry in PR, so the classifications on Dr.CI could be different than what dev observed at the time of the merge if retry completed successfully
### Testing
`python .github/scripts/trymerge.py --comment-id 1654010315 106095 --dry-run` (need to comment out some of the code to actually write a test record to Rockset), then manually verify it with
```
SELECT
*
FROM
commons.merges
WHERE
pr_num = 106095
```
to see that `ignore_current_checks`, `broken_trunk_checks`, `flaky_checks`, and `unstable_checks` shows up correctly
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106162
Approved by: https://github.com/clee2000