Changes by apply order:
1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`.
2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`.
3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first.
`.parent{...}.absolute()` -> `.absolute().parent{...}`
4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.)
`.parent.parent.parent.parent` -> `.parents[3]`
5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~
~`.parents[3]` -> `.parents[4 - 1]`~
6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374
Approved by: https://github.com/justinchuby, https://github.com/malfet
This PR combines Manylinux 2_28 and Manylinux 2014 builds of triton under one workflow. This is required in order to support torch cpu, cuda 118, cuda 12.4 wheels built with Manylinux 2014 and torch cuda 12.6 wheels built with Manylinux 2_28.
Manylinux 2014 wheels:
``pytorch_triton-3.2.0+git35c6c7c6-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl``
Manylinux 2_28 wheels:
``pytorch_triton-3.2.0+git35c6c7c6-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl``
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141704
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/huydhn
Bump the Triton pin to the release candidate commit for Triton 3.2.
A few changes beyond the pin bump itself are needed:
* Remove the script that adds a git version hash suffix to the Triton wheel, since as of https://github.com/triton-lang/triton/pull/4812 Triton adds that itself
* Add `pybind11` to the Triton build setup, since Triton now depends on it
* Use manylinux-2.28 for the Triton wheel builder, and use clang+lld for building to pick up the right glibc
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139206
Approved by: https://github.com/malfet, https://github.com/atalman
Co-authored-by: Andrey Talman <atalman@fb.com>
Bump the Triton pin to the release candidate commit for Triton 3.2.
A few changes beyond the pin bump itself are needed:
* Remove the script that adds a git version hash suffix to the Triton wheel, since as of https://github.com/triton-lang/triton/pull/4812 Triton adds that itself
* Add `pybind11` to the Triton build setup, since Triton now depends on it
* Use manylinux-2.28 for the Triton wheel builder, and use clang+lld for building to pick up the right glibc
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139206
Approved by: https://github.com/malfet, https://github.com/atalman
Co-authored-by: Andrey Talman <atalman@fb.com>
Remove most references to rockset:
* replace comments and docs with a generic "backend database"
* Delete `upload_to_rockset`, so we no longer need to install the package.
* Do not upload perf stats to rockset as well (we should be completely on DynamoDB now right @huydhn?)
According to VSCode, it went from 41 -> 7 instances of "rockset" in the repo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139922
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
Publish current state of s390x builder image to allow reproducing worker setup.
Also, if this image gets published to docker repository later, it'd be possible to download published image instead of building it into worker image in https://github.com/pytorch/pytorch/blob/main/.github/scripts/s390x-ci/self-hosted-builder/actions-runner.Dockerfile#L66, which should allow improving restart time at the cost of additional runtime overhead.
Compared to first attempt to merge:
- default docker repository settings are added to all runners. Changes are mirrored in this PR.
- job is moved into separate workflow file.
- it's no longer attempted to update limits on s390x. Limits should be properly set up there on the host. And it's not possible to update them from worker since it runs in container. Also, worker container currently doesn't have sudo installed or configured or any systemd running.
- github token is now passed once via named pipe instead of environment variable. This should increase security of tokens.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132983
Approved by: https://github.com/huydhn, https://github.com/malfet
This is a first step towards removing builds dependency to conda.
Currently we build magma as a conda package in a pytorch conda channel, implemented in a1b372dbda/magma.
This commit adapts the logic from pytorch/builder as follows:
- use pytorch/manylinux-cuda<cuda-version> as base image
- apply patches and invoke the build.sh script directly (not anymore through conda build)
- stores license and build files along with the built artifact, in an info subfolder
- create a tarball file which resembles that created by conda, without any conda-specific metadata
A new matrix workflow is added, which runs the build for each supported cuda version, and uploads the binaries to pyorch s3 bucket.
For the upload, define an upload.sh script, which will be used by the magma windows job as well, to upload to `s3://ossci-*` buckets.
The build runs on PR and push, upload runs in DRY_RUN mode in case of PR.
Fixes#139397
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139888
Approved by: https://github.com/atalman, https://github.com/malfet, https://github.com/seemethere
My sales pitch: I need to ssh into the runner from time to time on my PR to debug issues, but it's well-known that LF runners don't support SSH login anymore. So, the propose fix here is to introduce a new label called ~no-runner-determinator~ `no-runner-experiments` that can be attached to the PR. Whenever `.github/scripts/runner_determinator.py` runs on a PR and sees this label, it will not apply any logic and just straight up use an empty prefix.
### Testing
With the label:
```
python3 runner_determinator.py \
--github-token "MY_TOKEN" \
--github-issue "5132" \
--github-branch "install-torchao-torchtune-et" \
--github-actor "huydhn" \
--github-issue-owner "huydhn" \
--github-ref-type "branch" \
--github-repo "pytorch/pytorch" \
--eligible-experiments "" \
--pr-number "139947"
INFO : Opt-out runner determinator because #139947 has no-runner-determinator label
WARNING : No env var found for GITHUB_OUTPUT, you must be running this code locally. Falling back to the deprecated print method.
::set-output name=label-type::
```
Without the label:
```
python3 runner_determinator.py \
--github-token "MY_TOKEN" \
--github-issue "5132" \
--github-branch "install-torchao-torchtune-et" \
--github-actor "huydhn" \
--github-issue-owner "huydhn" \
--github-ref-type "branch" \
--github-repo "pytorch/pytorch" \
--eligible-experiments "" \
--pr-number "139947"
INFO : Based on rollout percentage of 95%, enabling experiment lf.
INFO : Skipping experiment 'awsa100', as it is not a default experiment
WARNING : No env var found for GITHUB_OUTPUT, you must be running this code locally. Falling back to the deprecated print method.
::set-output name=label-type::lf.
```
Running in trunk commit without a PR number will use the regular logic:
```
python3 runner_determinator.py \
--github-token "MY_TOKEN" \
--github-issue "5132" \
--github-branch "install-torchao-torchtune-et" \
--github-actor "huydhn" \
--github-issue-owner "huydhn" \
--github-ref-type "branch" \
--github-repo "pytorch/pytorch" \
--eligible-experiments "" \
--pr-number ""
INFO : Based on rollout percentage of 95%, enabling experiment lf.
INFO : Skipping experiment 'awsa100', as it is not a default experiment
WARNING : No env var found for GITHUB_OUTPUT, you must be running this code locally. Falling back to the deprecated print method.
::set-output name=label-type::lf.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140054
Approved by: https://github.com/malfet, https://github.com/ZainRizvi
Since cuda 12.4 binaries are default binaries on pypi now. The pytorch_extra_install_requirements need to use 12.4.
This would need to be cherry-picked to release 2.5 branch to avoid injecting these versions into metadata during pypi promotion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138458
Approved by: https://github.com/malfet
This is only a minor patch that I hope will change how I talk to contributors when lint fails, so that I can tell them to read the logs about lintrunner. There have been too many times when I have had to click the "approve all workflows" just for lint to fail again cuz the developer is manually applying every fix and using CI to test. I understand there are times when lintrunner doesn't work, but I'd like most contributors to at least give it a swirl once to start.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138232
Approved by: https://github.com/kit1980, https://github.com/Skylion007
adds a `default` tag to experiment configurations, allowing to remove some experiments by default on the random draw:
```
experiments:
lf:
rollout_perc: 25
otherExp:
rollout_perc: 25
default: false
---
```
and includes the configuration to filter what experiments are of interest for a particular workflow (comma separated):
```
get-test-label-type:
name: get-test-label-type
uses: ./.github/workflows/_runner-determinator.yml
with:
...
check_experiments: "awsa100"
```
The end goal, is to enable us to run multiple experiments, that are independent from one another. For example, while we still runs the LF infra experiment, we want to migrate other runners leveraging the current solution. A immediate UC is for the A100 instances, where we want to migrate to AWS.
Those new instances will during the migration period be labeled both `awsa100.linux.gcp.a100` and `linux.aws.a100`. Once the experiment ends, we will remove the first confusing one.
```
jobs:
get-build-label-type:
name: get-build-label-type
uses: ./.github/workflows/_runner-determinator.yml
with:
...
get-test-label-type:
name: get-test-label-type
uses: ./.github/workflows/_runner-determinator.yml
with:
...
check_experiments: "awsa100"
linux-focal-cuda12_1-py3_10-gcc9-inductor-build:
name: cuda12.1-py3.10-gcc9-sm80
uses: ./.github/workflows/_linux-build.yml
needs:
- get-build-label-type
- get-test-label-type
with:
runner_prefix: "${{ needs.get-build-label-type.outputs.label-type }}"
...
test-matrix: |
{ include: [
{ config: "inductor_huggingface_perf_compare", shard: 1, num_shards: 1, runner: "${{ needs.get-test-label-type.outputs.label-type }}linux.gcp.a100" },
...
]}
...
```
```
experiments:
lf:
rollout_perc: 50
awsa100:
rollout_perc: 50
default: false
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137614
Approved by: https://github.com/malfet