Commit graph

103 commits

Author SHA1 Message Date
Catherine Lee
06b52dd103 TD outside of test job (#118250)
Give TD it's own job so that each shard can get the results from this one job artifact and they will always be in sync with each other/no longer need to worry about consistently issues

* Move test discovery to its own file that is not dependent on torch so it can be run without building torch
  * Cannot do cpp test discovery before building pytorch
* Move TD calculation to own file that will create a json file with the final results
* TD is now job/build env agnostic
* TD will rank all tests, including those that test jobs may not want to run (ex it will rank distributed tests along with default tests, even though these tests are never run on the same machine together)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118250
Approved by: https://github.com/huydhn
2024-03-01 23:08:10 +00:00
Catherine Lee
5d6e323549 No TD (test removal) option in CI (#118808)
It currently doesn't do anything, but I will want these env vars later.  Maybe I should start using ghstack

Intention: --enable-td actually gets rid of tests

I am open to better names
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118808
Approved by: https://github.com/huydhn, https://github.com/osalpekar
2024-02-09 16:42:27 +00:00
Catherine Lee
de9ddd19a5 Various CI settings (#117668)
Test [ci-verbose-test-logs] (this worked, the test logs printing while running and interleaved and are really long)

Settings for no timeout (step timeout still applies, only gets rid of ~30 min timeout for shard of test file) and no piping logs/extra verbose test logs (good for debugging deadlocks but results in very long and possibly interleaved logs).

Also allows these to be set via pr body if the label name is in brackets ex [label name] or the test above.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117668
Approved by: https://github.com/huydhn
2024-01-26 00:17:29 +00:00
Catherine Lee
2bdc2a68cb [ez][td] Fix for emit metrics can't find JOB_NAME (#116748)
After #113884
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116748
Approved by: https://github.com/malfet, https://github.com/huydhn
2024-01-04 05:31:25 +00:00
Catherine Lee
b5578cb08b [ez] Remove unittest retries (#115460)
Pytest is used in CI now for reruns and I doubt people are using the env vars when running locally.  imo removing this code has the makes the run function easier to read
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115460
Approved by: https://github.com/malfet, https://github.com/huydhn
2023-12-11 19:46:09 +00:00
Nikita Shulga
2f875c74bf Print ghcr docker pull during build/test (#114510)
To make debugging easier to external devs

Test plan: Copy and run command from [`Use the following to pull public copy of the image`](https://github.com/pytorch/pytorch/actions/runs/7012511180/job/19077533416?pr=114510#step:6:9):
```
docker pull ghcr.io/pytorch/ci-image:pytorch-linux-jammy-py3.8-gcc11-0d0042fd2e432ea07301ad6f6a474d36a581f0dc

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114510
Approved by: https://github.com/atalman, https://github.com/huydhn
2023-11-28 04:38:17 +00:00
Catherine Lee
dab272eed8 [td] Consistent pytest cache (#113804)
Move the pytest cache downloading into the build step and store it in additional ci files so that it stays consistent during sharding.

Only build env is taken into account now instead of also test config since we might not have the test config during build time, making it less specific, but I also think this might be better since tests are likely to fail across the same test config (I also think it might be worth not even looking at build env but thats a different topic)

Each cache upload should only include information from the current run.  Do not merge current cache with downloaded cache during upload (shouldn't matter anyways since the downloaded cache won't exist at the time)

From what I cant tell of the s3 retention policy, pytest cache files will be deleted after 30 days (cc @ZainRizvi to confirm), so we never have to worry about space or pulling old versions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113804
Approved by: https://github.com/ZainRizvi
2023-11-17 23:45:47 +00:00
Zain Rizvi
9a9232956f Include job name in the emitted metrics (#113884)
What it says in the title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113884
Approved by: https://github.com/clee2000
2023-11-16 21:26:49 +00:00
Catherine Lee
6e73ae2022 [ci][ez] Add job_id to emit_metrics (#113099)
As in title.

Also print the job id in the step since I'm struggling to find it
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113099
Approved by: https://github.com/seemethere
2023-11-08 10:32:41 +00:00
Huy Do
f6f81a5969 Update get-workflow-job-id to also return job name (#112103)
Then we can use this job name in `filter-test-configs` if it's available.  This addresses the issue in which `filter-test-configs` on GitHub runners (MacOS x86) couldn't find the runner log to get the job name.  This is expected because GitHub runners are isolated, so a job should not be able to access runner logs, which could contains information from other jobs.

This allows all missing features depending on running `filter-test-configs` on GitHub runners:
* Rerun disabled tests and memory leak check. For example, this would help avoid closing https://github.com/pytorch/pytorch/issues/110980#issuecomment-1779806466 early with the disabled test running properly on MacOS x86
* MacOS x86 jobs can now be disabled or marked as unstable

I keep the current logic to parse the log as a fallback because it's working fine on self-hosted runners.  That also handles the case where `get-workflow-job-id` fails.  Also I move the rest of `get-workflow-job-id` up before the test step like https://github.com/pytorch/pytorch/pull/111483

### Testing

Spot checks some jobs to confirm they have the correct names:

* MacOS M1 test job https://github.com/pytorch/pytorch/actions/runs/6648305319/job/18065275722?pr=112103#step:10:8
* MacOS x86 build job https://github.com/pytorch/pytorch/actions/runs/6648306305/job/18065138137?pr=112103#step:9:14
* Linux test job has https://github.com/pytorch/pytorch/actions/runs/6648300991/job/18065354503?pr=112103#step:13:7
* Windows test job https://github.com/pytorch/pytorch/actions/runs/6648305319/job/18065599500?pr=112103#step:12:7
* MacOS x86 test job https://github.com/pytorch/pytorch/actions/runs/6648306305/job/18066312801#step:10:8
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112103
Approved by: https://github.com/clee2000
2023-10-26 16:42:46 +00:00
Catherine Lee
102fbd402c [ci] Move step to get workflow job id before test step in linux (#111483)
We’ve been strugging to get the job id since 9/28/2023 12:03 pm.  Before this we had almost 0 problems getting job id, but after, we get a lot of `Recieved status code '502' when attempting to retrieve https://api.github.com/repos/pytorch/pytorch/actions/runs/6551579728/jobs?per_page=100:\n", 'Bad Gateway\n\nheaders=Server: GitHub.com\nDate: Tue, 17 Oct 2023 20:32:52 GMT\nContent-Type: application/json\nContent-Length: 32\nETag: "652eed15-20"\nVary: Accept-Encoding, Accept, X-Requested-With\nX-GitHub-Request-Id: EC62:7EE0:166AAF5:2D51A8E:652EEF6A\nconnection: close\n\n` ex https://github.com/pytorch/pytorch/actions/runs/6551579728/job/17793898278#step:18:22

Recently, it has been happening around 1/4 of the time, possibly more. I think this happens almost only on linux.

I believe this is somehow caused by a test, since distributed tests seems to be disproportionately affected, so I move the step to get the job id before the test step.  This also has the benefit of the test step being able to get the job id now if we want it.

Regardless of whether this works or not, its a pretty harmless change that might make things easier in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111483
Approved by: https://github.com/huydhn
2023-10-18 20:54:06 +00:00
Huy Do
6e8079e00f Fix timeout value for memory leak check job (#111386)
Fixes https://github.com/pytorch/pytorch/pull/110193 as it doesn't work as expected:

* I forgot the timeout on the test step
* Also MacOS test job wasn't covered

### Testing

The job timeout is set correctly to 600 https://github.com/pytorch/pytorch/actions/runs/6541825177/job/17764485473#step:14:7
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111386
Approved by: https://github.com/clee2000
2023-10-17 18:25:02 +00:00
Catherine Lee
d6e5898e8d Quieter logs in CI (#110033)
To reduce the amount of logs
* for successes, only print the part that says what tests ran and don't print the rest.  Zip the log into an artifact.  The line listing al the test names is really long, but if you view source of the raw logs, it will not wrap so it will only be one line.  The log classifier can also be configured to ignored this line. Gets rid of lines like `test_ops.py::TestCommonCPU::test_multiple_devices_round_cpu_int64 SKIPPED [0.0010s] (Only runs on cuda) [  9%]`
* for failures/reruns, print logs.  Do not zip.

Also
* change log artifact name

Examples of various logs:
a074db0f7f failures
1b439e24c4 failures

possibly controversial haha
should i include an option for always printing?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110033
Approved by: https://github.com/huydhn
2023-10-05 16:40:37 +00:00
Huy Do
7827ae2864 Increase job timeout limit when running with memory leak check (#110193)
This fixes the daily timeout of ROCm jobs when running with memory leak check turning on.  I want to use something like `inputs.timeout-minutes * 2` but that syntax, unfortunately, isn't supported in GitHub action YAML.  So I decide to just x2 the current timeout value of 300 minutes to make it 600 minutes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110193
Approved by: https://github.com/clee2000
2023-10-02 18:01:49 +00:00
Mark Saroufim
6268ab2c2d torchbench pin upd: hf auth token, clip, whisper, llamav2, sd (#106009)
Includes stable diffusion, whisper, llama7b and clip

To get this to work I had to Pass in hf auth token to all ci jobs, github does not pass in secrets from parent to child automatically. There's a likelihood HF will rate limit us in case please revert this PR and I'll work on adding a cache next - cc @voznesenskym @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @aakhundov @malfet

Something upstream changed in torchbench too where now `hf_Bert` and `hf_Bert_large` are both failing on some dynamic shape looking error which I'm not sure how to debug yet so for now felt a bit gross but added a skip since others are building on top this work @ezyang

`llamav2_7b_16h` cannot pass through accuracy checks cause it OOMs on deepcloning extra inputs this seems to make it not need to show up in expected numbers csv, will figure this when we update the pin with https://github.com/pytorch/benchmark/pull/1803 cc @H-Huang @xuzhao9 @cpuhrsch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106009
Approved by: https://github.com/malfet
2023-08-03 16:28:40 +00:00
Huy Do
0e85c224f8 Use shareable calculate-docker-image GHA (#105372)
Switch from PyTorch `calculate-docker-image` GHA to its shareable version on test-infra https://github.com/pytorch/test-infra/pull/4397.

I will clean up PyTorch `calculate-docker-image` GHA in a separate PR after landing this one.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105372
Approved by: https://github.com/malfet
2023-07-19 05:02:01 +00:00
Catherine Lee
c16a28860f Reenable disabled tests by pr body (#103790)
Query for list of renabled issues in the filter test config step: switch filter test config to query for all the PR info instead of just the labels so token usage should stay the same, move code and tests related to parsing reenabled issues to filter test config step, remove old code to get PR body and commit message.  `REENABLED_ISSUES` should be a comma separated list of issue numbers to be reenabled.

For testing: Fixes #103789
Check that 103789 shows up in list of ignored disabled issues
Sanity check that test-config labels still work

More testing via `python3 ".github/scripts/filter_test_configs.py"     --workflow "pull"     --job-name "linux-bionic-cuda12.1-py3.10-gcc9 / test (default, 4, 5, linux.4xlarge.nvidia.gpu)"     --test-matrix "{ include: [
    { config: "default", shard: 1, num_shards: 1 },
  ]}
  "     --pr-number ""     --tag ""     --event-name "push"     --schedule ""     --branch ""`
 and
 `python3 ".github/scripts/filter_test_configs.py"     --workflow "pull"     --job-name "linux-bionic-cuda12.1-py3.10-gcc9 / test (default, 4, 5, linux.4xlarge.nvidia.gpu)"     --test-matrix "{"include": [{"config": "default", "shard": 1, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 2, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 3, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 4, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 5, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}]}"     --pr-number "103790"     --tag ""     --event-name "pull_request"     --schedule ""     --branch ""`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103790
Approved by: https://github.com/huydhn
2023-06-22 19:47:11 +00:00
PyTorch MergeBot
58d11159bd Revert "Reenable disabled tests by pr body (#103790)"
This reverts commit 2237b4ad75.

Reverted https://github.com/pytorch/pytorch/pull/103790 on behalf of https://github.com/huydhn due to I think we tested it on PR but missed the logic in trunk where there is no PR number ([comment](https://github.com/pytorch/pytorch/pull/103790#issuecomment-1601890299))
2023-06-22 01:26:46 +00:00
Catherine Lee
2237b4ad75 Reenable disabled tests by pr body (#103790)
Query for list of renabled issues in the filter test config step: switch filter test config to query for all the PR info instead of just the labels so token usage should stay the same, move code and tests related to parsing reenabled issues to filter test config step, remove old code to get PR body and commit message.  `REENABLED_ISSUES` should be a comma separated list of issue numbers to be reenabled.

For testing: Fixes #103789
Check that 103789 shows up in list of ignored disabled issues
Sanity check that test-config labels still work
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103790
Approved by: https://github.com/huydhn
2023-06-22 01:10:31 +00:00
Zain Rizvi
c3d3165f16 Enable uploading metrics and upload Test Reordering metrics to dynamodb (#102691)
Added a feature to upload test statistics to DynamoDB and Rockset using a new function `emit_metric` in `tools/stats/upload_stats_lib.py`.

Added metrics to measure test reordering effectiveness in `tools/testing/test_selections.py`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102691
Approved by: https://github.com/malfet
2023-06-12 23:01:53 +00:00
Huy Do
04c1c2b791 Try to build the Docker image if it doesn't exist (#102562)
There is a bug in the test workflow where it could fail to find the new Docker image when the image hasn't yet became available on ECR, for example e71ab21422.  This basically is a race condition where the test job starts before the docker-build workflow could finish successfully.  The fix here is to make sure that the test job has the opportunity to build the image if it doesn't exist, same as what the build workflow does atm.  Once the docker-build workflow finishes pushing the new image to ECR, that can then be used instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102562
Approved by: https://github.com/PaliC
2023-05-31 20:50:27 +00:00
Bin Bao
2a14652879 [CI] Introduce dashboard-tag to pass dashboard run configs (#101320)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101320
Approved by: https://github.com/huydhn
2023-05-12 23:26:16 +00:00
Zain Rizvi
96f46316c9 Preserve PyTest Cache across job runs (#100522)
Preserves the PyTest cache from one job run to the next.  In a later PR, this will be used to change the order in which we actually run those tests

The process is:
1. Before running tests, check S3 to see if there is an uploaded cache from any shard of the current job
2. If there are, download them all and merge their contents. Put the merged cache in the default .pytest_cache folder
3. After running the tests, merge the now-current .pytest_cache folder with the cache previously downloaded for the current shard. This will make the merged cache contain all tests that have ever failed for the given PR in the current shard
4. Upload the resulting cache file back to S3

The S3 folder has a retention policy of 30 days, after which the uploaded cache files will get auto-deleted.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100522
Approved by: https://github.com/huydhn
2023-05-10 18:37:28 +00:00
Nikita Shulga
7ff71a3a48 Populate download.pytorch.org IP to container (#100475)
Follow up after https://github.com/pytorch/pytorch/pull/100436 to disable download.pytorch.org access over ipv6 access problems.

Why not copy `/etc/hosts` from host to the container? Because it would break container ip resolution in distributed tests, that relies on `socket.gethostbyname(socket.gethostname())` to work.

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 756d0b1</samp>

Propagate `download.pytorch.org` IP address to docker containers in `test-pytorch-binary` action and workflow. This fixes DNS issues when downloading PyTorch binaries inside the containers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100475
Approved by: https://github.com/huydhn
2023-05-02 22:08:06 +00:00
PyTorch MergeBot
e8a1d0be3e Revert "Mount /etc/hosts into container (#100475)"
This reverts commit 99ded8bbce.

Reverted https://github.com/pytorch/pytorch/pull/100475 on behalf of https://github.com/malfet due to Breaks distributed tests ([comment](https://github.com/pytorch/pytorch/pull/100475#issuecomment-1532097309))
2023-05-02 20:23:32 +00:00
Nikita Shulga
99ded8bbce Mount /etc/hosts into container (#100475)
Follow up after https://github.com/pytorch/pytorch/pull/100436 to disable download.pytorch.org access over ipv6 access problems.

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 55c9443</samp>

This pull request improves the network configuration of the test-pytorch-binary GitHub action and workflow by mounting the host's `/etc/hosts` file into the container. This enables the container to resolve hostname aliases consistently with the host machine.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100475
Approved by: https://github.com/huydhn
2023-05-02 17:34:07 +00:00
Nikita Shulga
2418b94576
Rename default branch to main (#99210)
Mostly `s/@master/@main` in numerous `.yml` files.

Keep `master` in `weekly.yml` as it refers to `xla` repo and in `test_trymerge.py` as it refers to a branch PR originates from.
2023-04-16 18:48:14 -07:00
Catherine Lee
06ad8d6d5f Remove filter step (#98969)
remove filter steps from linux, rocm, and mac tests

theres still some filter jobs from other places like bazel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98969
Approved by: https://github.com/huydhn, https://github.com/malfet
2023-04-15 00:08:20 +00:00
Huy Do
4563adacc5 Update the use of nvidia-smi for GPU healthcheck (#98036)
This goes together with https://github.com/pytorch/test-infra/pull/3967 to:

* Provide a more accurate health check command with `nvidia-smi`
* Avoid running the check in the edge case when `nvidia-smi` doesn't even exist due to GitHub outage, i.e. https://github.com/pytorch/pytorch/actions/runs/4591098682/jobs/8107204277
* Also check for the number of GPU as part of the health check. The number of GPUs needs to be a power of 2 on a healthy runner.  Fixes https://github.com/pytorch/test-infra/issues/4000

### Testing

Luckily, the PR picked up the broken runner https://github.com/pytorch/pytorch/actions/runs/4640688249/jobs/8213191715, and the script correctly detected that the runner had only 3/4 GPUS and shut it down
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98036
Approved by: https://github.com/weiwangmeta
2023-04-08 00:53:20 +00:00
Wei Wang
d4dbdee528 Update _linux-test.yml (#98317)
Skip "setup-ssh" for now for a100 runners from GCP as it frequently encounter issues like "connect ETIMEDOUT 173.231.16.75:443" Every day about 10 occurrences

Examples for just today so far:
2023-04-04T15:07:50.916331Z | inductor | https://github.com/pytorch/pytorch/actions/runs/4609056040/jobs/8146321650
-- | -- | --
2023-04-04T15:03:56.914692Z | inductor | https://github.com/pytorch/pytorch/actions/runs/4609010125/jobs/8146217819
2023-04-04T14:39:58.004717Z | inductor | https://github.com/pytorch/pytorch/actions/runs/4608784966/jobs/8145641764
2023-04-04T14:19:28.854825Z | inductor | https://github.com/pytorch/pytorch/actions/runs/4608561116/jobs/8145147916
2023-04-04T06:15:39.241848Z | inductor | https://github.com/pytorch/pytorch/actions/runs/4604422106/jobs/8135687673
2023-04-04T06:10:21.056131Z | inductor | https://github.com/pytorch/pytorch/actions/runs/4604406947/jobs/8135611094
2023-04-04T05:34:50.908482Z | inductor | https://github.com/pytorch/pytorch/actions/runs/4604198332/jobs/8135201048
2023-04-04T03:04:36.628201Z | inductor | https://github.com/pytorch/pytorch/actions/runs/4603162241/jobs/8133620905
2023-04-04T01:49:27.119830Z | inductor | https://github.com/pytorch/pytorch/actions/runs/4600897505/jobs/8132760483
2023-04-04T01:18:06.141437Z | inductor | https://github.com/pytorch/pytorch/actions/runs/4602745871/jobs/8132387930
2023-04-04T00:38:30.610770Z | inductor | https://github.com/pytorch/pytorch/actions/runs/4602537869/jobs/8131938265

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98317
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-04-07 01:51:02 +00:00
Catherine Lee
38207a9e53 [ci][easy] Only print remaining logs if test step ran (#97713)
it sometimes spits out left over logs from a previous run on the windows ephemeral runner, but this might have been fixed by now.  I get a bit annoyed when the step runs even though it obviously isnt going to be useful since the test step didnt run,

always() is needed to ensure that it runs on test step failure
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97713
Approved by: https://github.com/huydhn
2023-03-30 23:03:41 +00:00
Huy Do
f92cae4849 Fix a grep-itself bug when checking for GPU healthcheck (#97929)
The logic works https://github.com/pytorch/pytorch/actions/runs/4558327458, but it also grep itself due to `set -x` is set (Ugh, debug message)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97929
Approved by: https://github.com/malfet, https://github.com/weiwangmeta
2023-03-30 08:14:01 +00:00
PyTorch MergeBot
b093dfaefa Revert "Fix a grep-itself bug when checking for GPU healthcheck (#97929)"
This reverts commit f40b2ed59c.

Reverted https://github.com/pytorch/pytorch/pull/97929 on behalf of https://github.com/huydhn due to Rework to get rid of grep completely
2023-03-30 07:52:20 +00:00
Huy Do
f40b2ed59c Fix a grep-itself bug when checking for GPU healthcheck (#97929)
The logic works https://github.com/pytorch/pytorch/actions/runs/4558327458, but it also grep itself due to `set -x` is set (Ugh, debug message)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97929
Approved by: https://github.com/malfet, https://github.com/weiwangmeta
2023-03-30 00:25:43 +00:00
Huy Do
099b2801db Stop runner service when its GPU crashes (#97585)
Per title, I'm looking for a way to take the runner out of service when its GPU crashes and couldn't recover.  Taking the faulty runner out of service would prevent future jobs to be assigned to it as they will surely fail.

This is based on the observation that GPU crash usually happen in the middle of the test or in the next `setup-nvidia` step.  This is only happens on G5 runner with A10G GPU, so the suspicion is that this is a hardware failure.  Updating to the newer NVIDIA driver (525.85.06) might or might not help with the issue (https://github.com/pytorch/pytorch/pull/96904), so I'm preparing this PR as a preemptive measure.  Here are the symptoms when the GPU crashes:

* Test fails with "No CUDA GPUs are available" error when initialize CUDA.  For examples:
  * https://github.com/pytorch/pytorch/actions/runs/4506110581/jobs/7932832519
  * https://github.com/pytorch/pytorch/actions/runs/4507220502/jobs/7935084759
* Calling nvidia-smi timeouts after 60 second.  For example:
  * https://github.com/pytorch/pytorch/actions/runs/4496201282/jobs/7910938448
* Fail to run nvidia-smi with an unable to determine the device handle for GPU unknown error
  * https://github.com/pytorch/pytorch/actions/runs/4546343549/jobs/8015359600
*  Run `docker --gpus all` fails with error response from daemon while the command `nvidia-container-cli` fails with `detection error: nvml error: unknown error`
  * https://github.com/pytorch/pytorch/actions/runs/4545579871/jobs/8013667872

I'm assume that an offline runner with a stopped runner service would be teardown and recycle properly by infra scaling process.

### Testing
https://github.com/pytorch/pytorch/actions/runs/4517112069/jobs/7956204805.  When it runs, the code fetches the service name from `${{ RUNNER_WORKSPACE }}/../../.service` file and issue `sudo systemctl stop ${RUNNER_SERVICE_NAME}` to stop the self-hosted runner service.

The job will show its status as `The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97585
Approved by: https://github.com/jeanschmidt
2023-03-29 21:17:13 +00:00
Huy Do
2806fa4470 Use the latest NVIDIA driver from setup-nvidia (#97840)
This goes with https://github.com/pytorch/test-infra/pull/3949

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97840
Approved by: https://github.com/ZainRizvi
2023-03-29 21:14:27 +00:00
Bin Bao
c55d1a6049 [CI] Experiment with a newer CUDA driver (#96904)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96904
Approved by: https://github.com/huydhn, https://github.com/weiwangmeta
2023-03-24 17:05:18 +00:00
Wei Wang
9320cae1da Add GPU frequency lock option to inductor workflows running on A100 (#97465)
Fixes #97459

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97465
Approved by: https://github.com/xuzhao9
2023-03-24 05:15:21 +00:00
Xuehai Pan
4b0e2e2cc6 Use official NVML Python bindings (#93925)
Use the official NVML Python binding package [`nvidia-ml-py`](https://pypi.org/project/nvidia-ml-py), which is maintained by the NVIDIA NVML team.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93925
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi, https://github.com/ptrblck
2023-02-07 05:27:36 +00:00
Jane Xu
0ecb071fc4 [BE][CI] change references from .jenkins to .ci (#92624)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92624
Approved by: https://github.com/ZainRizvi, https://github.com/huydhn
2023-01-30 22:50:07 +00:00
Catherine Lee
27ab1dfc28 Remove print_test_stats, test_history, s3_stat_parser (#92841)
Pritam Damania no longer uses it (and is no longer with FB), and I don't know who else has interest in this
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92841
Approved by: https://github.com/malfet, https://github.com/huydhn, https://github.com/ZainRizvi, https://github.com/seemethere
2023-01-27 18:11:42 +00:00
Catherine Lee
00f3e0d8c9 [ci] Set step level timeout (#93084)
Not super important, but it is nice for the logs because the logs now say "the action timed out" instead of "the action was cancelled".  It also makes the job status "failure" instead of "cancelled"

also adds timeout minutes as an input for rocm and mac tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93084
Approved by: https://github.com/huydhn
2023-01-27 17:52:33 +00:00
Zain Rizvi
92fbb35bff Upload failures shouldn't fail a CI that passed tests (#92996)
This'll reduce some flakiness we've been seeing recently
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92996
Approved by: https://github.com/malfet, https://github.com/kit1980
2023-01-25 19:23:51 +00:00
Huy Do
5610766044 Mark test monitoring as an optional process (#92658)
This is an optional step that is ok to ignored when PyPI becomes flaky.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92658
Approved by: https://github.com/clee2000
2023-01-20 18:59:56 +00:00
Huy Do
ce43fc586f Register sccache epilogue before starting sccache (#92587)
Fixing XLA test job flaky with sccache failing to start with a timeout error, for example:

* https://github.com/pytorch/pytorch/actions/runs/3953719143/jobs/6770489428
* https://github.com/pytorch/pytorch/actions/runs/3952860712/jobs/6769339620
* https://github.com/pytorch/pytorch/actions/runs/3946315315/jobs/6754126326

XLA test job actually builds XLA as part of the test ~~, so it needs sccache~~

* Register sccache epilogue before starting sccache, so that any errors when starting sccache can be printed
* Add `-e SKIP_SCCACHE_INITIALIZATION=1` to `_linux_test` workflow, this is the same flag used in `_linux_build` workflow. Quoted the reason from the build script:

> sccache --start-server seems to hang forever on self hosted runners for GHA so let's just go ahead and skip the --start-server altogether since it seems as though sccache still gets used even when the sscache server isn't started explicitly

* Also fix the code alignment in `.jenkins/pytorch/common-build.sh`
* We don't even use sccache in XLA test job, but there is an S3 cache used by bazel there (`XLA_CLANG_CACHE_S3_BUCKET_NAME=ossci-compiler-clang-cache-circleci-xla`)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92587
Approved by: https://github.com/malfet, https://github.com/ZainRizvi
2023-01-19 16:14:31 +00:00
Catherine Lee
e67f5ab6cc Print and zip remaining test logs (#91510)
When CI times out or gets cancelled, the code to print and delete logs for currently running tests doesn't get run, which makes it hard to debug what's going on, so print the logs in a new step and also zip them into the usage-log zip (which should probably get a name change at some point)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91510
Approved by: https://github.com/malfet, https://github.com/huydhn, https://github.com/ZainRizvi
2023-01-09 17:31:36 +00:00
Edward Z. Yang
ffd0b15a49 Add support for keep-going label (#90902)
This makes run_test.py keep going even on failure.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90902
Approved by: https://github.com/malfet, https://github.com/huydhn
2022-12-16 06:47:06 +00:00
PyTorch MergeBot
82a191313e Revert "Add support for keep-going label (#90902)"
This reverts commit 855f4b7d24.

Reverted https://github.com/pytorch/pytorch/pull/90902 on behalf of https://github.com/huydhn due to This change breaks trunk where, unlike PR, there is no label
2022-12-16 05:07:49 +00:00
Edward Z. Yang
855f4b7d24 Add support for keep-going label (#90902)
This makes run_test.py keep going even on failure.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90902
Approved by: https://github.com/malfet, https://github.com/huydhn
2022-12-16 04:03:52 +00:00
Wei Wang
1439ebd899 Enable inductor perf test on GCP A100 (#90322)
This PR tries to enable inductor performance nightly testing on A100 runner provided by GCP. Currently these GCP runners were created and maintained using scripts in https://github.com/fairinternal/pytorch-gha-infra/pull/82.
For some reason the artifacts cannot (and does not need to) be uploaded to S3, so adding use-gha parameter to _linux-test.yml to avoid creating a new but mostly identical _linux-test.yml.

Workflow test results: https://github.com/pytorch/pytorch/actions/runs/3642340544/jobs/6149691109

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90322
Approved by: https://github.com/anijain2305, https://github.com/seemethere, https://github.com/desertfire
2022-12-13 17:47:01 +00:00