pytorch

mirror of https://github.com/saymrwulf/pytorch.git synced 2026-05-14 20:57:59 +00:00

Author	SHA1	Message	Date
Catherine Lee	06b52dd103	TD outside of test job (#118250 ) Give TD it's own job so that each shard can get the results from this one job artifact and they will always be in sync with each other/no longer need to worry about consistently issues * Move test discovery to its own file that is not dependent on torch so it can be run without building torch * Cannot do cpp test discovery before building pytorch * Move TD calculation to own file that will create a json file with the final results * TD is now job/build env agnostic * TD will rank all tests, including those that test jobs may not want to run (ex it will rank distributed tests along with default tests, even though these tests are never run on the same machine together) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118250 Approved by: https://github.com/huydhn	2024-03-01 23:08:10 +00:00
Catherine Lee	5d6e323549	No TD (test removal) option in CI (#118808 ) It currently doesn't do anything, but I will want these env vars later. Maybe I should start using ghstack Intention: --enable-td actually gets rid of tests I am open to better names Pull Request resolved: https://github.com/pytorch/pytorch/pull/118808 Approved by: https://github.com/huydhn, https://github.com/osalpekar	2024-02-09 16:42:27 +00:00
Catherine Lee	de9ddd19a5	Various CI settings (#117668 ) Test [ci-verbose-test-logs] (this worked, the test logs printing while running and interleaved and are really long) Settings for no timeout (step timeout still applies, only gets rid of ~30 min timeout for shard of test file) and no piping logs/extra verbose test logs (good for debugging deadlocks but results in very long and possibly interleaved logs). Also allows these to be set via pr body if the label name is in brackets ex [label name] or the test above. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117668 Approved by: https://github.com/huydhn	2024-01-26 00:17:29 +00:00
Catherine Lee	2bdc2a68cb	[ez][td] Fix for emit metrics can't find JOB_NAME (#116748 ) After #113884 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116748 Approved by: https://github.com/malfet, https://github.com/huydhn	2024-01-04 05:31:25 +00:00
Catherine Lee	b5578cb08b	[ez] Remove unittest retries (#115460 ) Pytest is used in CI now for reruns and I doubt people are using the env vars when running locally. imo removing this code has the makes the run function easier to read Pull Request resolved: https://github.com/pytorch/pytorch/pull/115460 Approved by: https://github.com/malfet, https://github.com/huydhn	2023-12-11 19:46:09 +00:00
Nikita Shulga	2f875c74bf	Print ghcr docker pull during build/test (#114510 ) To make debugging easier to external devs Test plan: Copy and run command from [`Use the following to pull public copy of the image`](https://github.com/pytorch/pytorch/actions/runs/7012511180/job/19077533416?pr=114510#step:6:9): ``` docker pull ghcr.io/pytorch/ci-image:pytorch-linux-jammy-py3.8-gcc11-0d0042fd2e432ea07301ad6f6a474d36a581f0dc ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/114510 Approved by: https://github.com/atalman, https://github.com/huydhn	2023-11-28 04:38:17 +00:00
Catherine Lee	dab272eed8	[td] Consistent pytest cache (#113804 ) Move the pytest cache downloading into the build step and store it in additional ci files so that it stays consistent during sharding. Only build env is taken into account now instead of also test config since we might not have the test config during build time, making it less specific, but I also think this might be better since tests are likely to fail across the same test config (I also think it might be worth not even looking at build env but thats a different topic) Each cache upload should only include information from the current run. Do not merge current cache with downloaded cache during upload (shouldn't matter anyways since the downloaded cache won't exist at the time) From what I cant tell of the s3 retention policy, pytest cache files will be deleted after 30 days (cc @ZainRizvi to confirm), so we never have to worry about space or pulling old versions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113804 Approved by: https://github.com/ZainRizvi	2023-11-17 23:45:47 +00:00
Zain Rizvi	9a9232956f	Include job name in the emitted metrics (#113884 ) What it says in the title Pull Request resolved: https://github.com/pytorch/pytorch/pull/113884 Approved by: https://github.com/clee2000	2023-11-16 21:26:49 +00:00
Catherine Lee	6e73ae2022	[ci][ez] Add job_id to emit_metrics (#113099 ) As in title. Also print the job id in the step since I'm struggling to find it Pull Request resolved: https://github.com/pytorch/pytorch/pull/113099 Approved by: https://github.com/seemethere	2023-11-08 10:32:41 +00:00
Huy Do	f6f81a5969	Update get-workflow-job-id to also return job name (#112103 ) Then we can use this job name in `filter-test-configs` if it's available. This addresses the issue in which `filter-test-configs` on GitHub runners (MacOS x86) couldn't find the runner log to get the job name. This is expected because GitHub runners are isolated, so a job should not be able to access runner logs, which could contains information from other jobs. This allows all missing features depending on running `filter-test-configs` on GitHub runners: * Rerun disabled tests and memory leak check. For example, this would help avoid closing https://github.com/pytorch/pytorch/issues/110980#issuecomment-1779806466 early with the disabled test running properly on MacOS x86 * MacOS x86 jobs can now be disabled or marked as unstable I keep the current logic to parse the log as a fallback because it's working fine on self-hosted runners. That also handles the case where `get-workflow-job-id` fails. Also I move the rest of `get-workflow-job-id` up before the test step like https://github.com/pytorch/pytorch/pull/111483 ### Testing Spot checks some jobs to confirm they have the correct names: * MacOS M1 test job https://github.com/pytorch/pytorch/actions/runs/6648305319/job/18065275722?pr=112103#step:10:8 * MacOS x86 build job https://github.com/pytorch/pytorch/actions/runs/6648306305/job/18065138137?pr=112103#step:9:14 * Linux test job has https://github.com/pytorch/pytorch/actions/runs/6648300991/job/18065354503?pr=112103#step:13:7 * Windows test job https://github.com/pytorch/pytorch/actions/runs/6648305319/job/18065599500?pr=112103#step:12:7 * MacOS x86 test job https://github.com/pytorch/pytorch/actions/runs/6648306305/job/18066312801#step:10:8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/112103 Approved by: https://github.com/clee2000	2023-10-26 16:42:46 +00:00
Catherine Lee	102fbd402c	[ci] Move step to get workflow job id before test step in linux (#111483 ) We’ve been strugging to get the job id since 9/28/2023 12:03 pm. Before this we had almost 0 problems getting job id, but after, we get a lot of `Recieved status code '502' when attempting to retrieve https://api.github.com/repos/pytorch/pytorch/actions/runs/6551579728/jobs?per_page=100:\n", 'Bad Gateway\n\nheaders=Server: GitHub.com\nDate: Tue, 17 Oct 2023 20:32:52 GMT\nContent-Type: application/json\nContent-Length: 32\nETag: "652eed15-20"\nVary: Accept-Encoding, Accept, X-Requested-With\nX-GitHub-Request-Id: EC62:7EE0:166AAF5:2D51A8E:652EEF6A\nconnection: close\n\n` ex https://github.com/pytorch/pytorch/actions/runs/6551579728/job/17793898278#step:18:22 Recently, it has been happening around 1/4 of the time, possibly more. I think this happens almost only on linux. I believe this is somehow caused by a test, since distributed tests seems to be disproportionately affected, so I move the step to get the job id before the test step. This also has the benefit of the test step being able to get the job id now if we want it. Regardless of whether this works or not, its a pretty harmless change that might make things easier in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111483 Approved by: https://github.com/huydhn	2023-10-18 20:54:06 +00:00
Huy Do	6e8079e00f	Fix timeout value for memory leak check job (#111386 ) Fixes https://github.com/pytorch/pytorch/pull/110193 as it doesn't work as expected: * I forgot the timeout on the test step * Also MacOS test job wasn't covered ### Testing The job timeout is set correctly to 600 https://github.com/pytorch/pytorch/actions/runs/6541825177/job/17764485473#step:14:7 Pull Request resolved: https://github.com/pytorch/pytorch/pull/111386 Approved by: https://github.com/clee2000	2023-10-17 18:25:02 +00:00
Catherine Lee	d6e5898e8d	Quieter logs in CI (#110033 ) To reduce the amount of logs * for successes, only print the part that says what tests ran and don't print the rest. Zip the log into an artifact. The line listing al the test names is really long, but if you view source of the raw logs, it will not wrap so it will only be one line. The log classifier can also be configured to ignored this line. Gets rid of lines like `test_ops.py::TestCommonCPU::test_multiple_devices_round_cpu_int64 SKIPPED [0.0010s] (Only runs on cuda) [ 9%]` * for failures/reruns, print logs. Do not zip. Also * change log artifact name Examples of various logs: `a074db0f7f` failures `1b439e24c4` failures possibly controversial haha should i include an option for always printing? Pull Request resolved: https://github.com/pytorch/pytorch/pull/110033 Approved by: https://github.com/huydhn	2023-10-05 16:40:37 +00:00
Huy Do	7827ae2864	Increase job timeout limit when running with memory leak check (#110193 ) This fixes the daily timeout of ROCm jobs when running with memory leak check turning on. I want to use something like `inputs.timeout-minutes * 2` but that syntax, unfortunately, isn't supported in GitHub action YAML. So I decide to just x2 the current timeout value of 300 minutes to make it 600 minutes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110193 Approved by: https://github.com/clee2000	2023-10-02 18:01:49 +00:00
Mark Saroufim	6268ab2c2d	torchbench pin upd: hf auth token, clip, whisper, llamav2, sd (#106009 ) Includes stable diffusion, whisper, llama7b and clip To get this to work I had to Pass in hf auth token to all ci jobs, github does not pass in secrets from parent to child automatically. There's a likelihood HF will rate limit us in case please revert this PR and I'll work on adding a cache next - cc @voznesenskym @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @aakhundov @malfet Something upstream changed in torchbench too where now `hf_Bert` and `hf_Bert_large` are both failing on some dynamic shape looking error which I'm not sure how to debug yet so for now felt a bit gross but added a skip since others are building on top this work @ezyang `llamav2_7b_16h` cannot pass through accuracy checks cause it OOMs on deepcloning extra inputs this seems to make it not need to show up in expected numbers csv, will figure this when we update the pin with https://github.com/pytorch/benchmark/pull/1803 cc @H-Huang @xuzhao9 @cpuhrsch Pull Request resolved: https://github.com/pytorch/pytorch/pull/106009 Approved by: https://github.com/malfet	2023-08-03 16:28:40 +00:00
Huy Do	0e85c224f8	Use shareable calculate-docker-image GHA (#105372 ) Switch from PyTorch `calculate-docker-image` GHA to its shareable version on test-infra https://github.com/pytorch/test-infra/pull/4397. I will clean up PyTorch `calculate-docker-image` GHA in a separate PR after landing this one. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105372 Approved by: https://github.com/malfet	2023-07-19 05:02:01 +00:00
Catherine Lee	c16a28860f	Reenable disabled tests by pr body (#103790 ) Query for list of renabled issues in the filter test config step: switch filter test config to query for all the PR info instead of just the labels so token usage should stay the same, move code and tests related to parsing reenabled issues to filter test config step, remove old code to get PR body and commit message. `REENABLED_ISSUES` should be a comma separated list of issue numbers to be reenabled. For testing: Fixes #103789 Check that 103789 shows up in list of ignored disabled issues Sanity check that test-config labels still work More testing via `python3 ".github/scripts/filter_test_configs.py" --workflow "pull" --job-name "linux-bionic-cuda12.1-py3.10-gcc9 / test (default, 4, 5, linux.4xlarge.nvidia.gpu)" --test-matrix "{ include: [ { config: "default", shard: 1, num_shards: 1 }, ]} " --pr-number "" --tag "" --event-name "push" --schedule "" --branch ""` and `python3 ".github/scripts/filter_test_configs.py" --workflow "pull" --job-name "linux-bionic-cuda12.1-py3.10-gcc9 / test (default, 4, 5, linux.4xlarge.nvidia.gpu)" --test-matrix "{"include": [{"config": "default", "shard": 1, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 2, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 3, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 4, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 5, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}]}" --pr-number "103790" --tag "" --event-name "pull_request" --schedule "" --branch ""` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103790 Approved by: https://github.com/huydhn	2023-06-22 19:47:11 +00:00
PyTorch MergeBot	58d11159bd	Revert "Reenable disabled tests by pr body (#103790 )" This reverts commit `2237b4ad75`. Reverted https://github.com/pytorch/pytorch/pull/103790 on behalf of https://github.com/huydhn due to I think we tested it on PR but missed the logic in trunk where there is no PR number ([comment](https://github.com/pytorch/pytorch/pull/103790#issuecomment-1601890299))	2023-06-22 01:26:46 +00:00
Catherine Lee	2237b4ad75	Reenable disabled tests by pr body (#103790 ) Query for list of renabled issues in the filter test config step: switch filter test config to query for all the PR info instead of just the labels so token usage should stay the same, move code and tests related to parsing reenabled issues to filter test config step, remove old code to get PR body and commit message. `REENABLED_ISSUES` should be a comma separated list of issue numbers to be reenabled. For testing: Fixes #103789 Check that 103789 shows up in list of ignored disabled issues Sanity check that test-config labels still work Pull Request resolved: https://github.com/pytorch/pytorch/pull/103790 Approved by: https://github.com/huydhn	2023-06-22 01:10:31 +00:00
Zain Rizvi	c3d3165f16	Enable uploading metrics and upload Test Reordering metrics to dynamodb (#102691 ) Added a feature to upload test statistics to DynamoDB and Rockset using a new function `emit_metric` in `tools/stats/upload_stats_lib.py`. Added metrics to measure test reordering effectiveness in `tools/testing/test_selections.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102691 Approved by: https://github.com/malfet	2023-06-12 23:01:53 +00:00
Huy Do	04c1c2b791	Try to build the Docker image if it doesn't exist (#102562 ) There is a bug in the test workflow where it could fail to find the new Docker image when the image hasn't yet became available on ECR, for example `e71ab21422`. This basically is a race condition where the test job starts before the docker-build workflow could finish successfully. The fix here is to make sure that the test job has the opportunity to build the image if it doesn't exist, same as what the build workflow does atm. Once the docker-build workflow finishes pushing the new image to ECR, that can then be used instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102562 Approved by: https://github.com/PaliC	2023-05-31 20:50:27 +00:00
Bin Bao	2a14652879	[CI] Introduce dashboard-tag to pass dashboard run configs (#101320 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101320 Approved by: https://github.com/huydhn	2023-05-12 23:26:16 +00:00
Zain Rizvi	96f46316c9	Preserve PyTest Cache across job runs (#100522 ) Preserves the PyTest cache from one job run to the next. In a later PR, this will be used to change the order in which we actually run those tests The process is: 1. Before running tests, check S3 to see if there is an uploaded cache from any shard of the current job 2. If there are, download them all and merge their contents. Put the merged cache in the default .pytest_cache folder 3. After running the tests, merge the now-current .pytest_cache folder with the cache previously downloaded for the current shard. This will make the merged cache contain all tests that have ever failed for the given PR in the current shard 4. Upload the resulting cache file back to S3 The S3 folder has a retention policy of 30 days, after which the uploaded cache files will get auto-deleted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100522 Approved by: https://github.com/huydhn	2023-05-10 18:37:28 +00:00
Nikita Shulga	7ff71a3a48	Populate download.pytorch.org IP to container (#100475 ) Follow up after https://github.com/pytorch/pytorch/pull/100436 to disable download.pytorch.org access over ipv6 access problems. Why not copy `/etc/hosts` from host to the container? Because it would break container ip resolution in distributed tests, that relies on `socket.gethostbyname(socket.gethostname())` to work. <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 756d0b1</samp> Propagate `download.pytorch.org` IP address to docker containers in `test-pytorch-binary` action and workflow. This fixes DNS issues when downloading PyTorch binaries inside the containers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100475 Approved by: https://github.com/huydhn	2023-05-02 22:08:06 +00:00
PyTorch MergeBot	e8a1d0be3e	Revert "Mount /etc/hosts into container (#100475 )" This reverts commit `99ded8bbce`. Reverted https://github.com/pytorch/pytorch/pull/100475 on behalf of https://github.com/malfet due to Breaks distributed tests ([comment](https://github.com/pytorch/pytorch/pull/100475#issuecomment-1532097309))	2023-05-02 20:23:32 +00:00
Nikita Shulga	99ded8bbce	Mount /etc/hosts into container (#100475 ) Follow up after https://github.com/pytorch/pytorch/pull/100436 to disable download.pytorch.org access over ipv6 access problems. <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 55c9443</samp> This pull request improves the network configuration of the test-pytorch-binary GitHub action and workflow by mounting the host's `/etc/hosts` file into the container. This enables the container to resolve hostname aliases consistently with the host machine. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100475 Approved by: https://github.com/huydhn	2023-05-02 17:34:07 +00:00
Nikita Shulga	2418b94576	Rename default branch to `main` (#99210 ) Mostly `s/@master/@main` in numerous `.yml` files. Keep `master` in `weekly.yml` as it refers to `xla` repo and in `test_trymerge.py` as it refers to a branch PR originates from.	2023-04-16 18:48:14 -07:00
Catherine Lee	06ad8d6d5f	Remove filter step (#98969 ) remove filter steps from linux, rocm, and mac tests theres still some filter jobs from other places like bazel Pull Request resolved: https://github.com/pytorch/pytorch/pull/98969 Approved by: https://github.com/huydhn, https://github.com/malfet	2023-04-15 00:08:20 +00:00
Huy Do	4563adacc5	Update the use of nvidia-smi for GPU healthcheck (#98036 ) This goes together with https://github.com/pytorch/test-infra/pull/3967 to: * Provide a more accurate health check command with `nvidia-smi` * Avoid running the check in the edge case when `nvidia-smi` doesn't even exist due to GitHub outage, i.e. https://github.com/pytorch/pytorch/actions/runs/4591098682/jobs/8107204277 * Also check for the number of GPU as part of the health check. The number of GPUs needs to be a power of 2 on a healthy runner. Fixes https://github.com/pytorch/test-infra/issues/4000 ### Testing Luckily, the PR picked up the broken runner https://github.com/pytorch/pytorch/actions/runs/4640688249/jobs/8213191715, and the script correctly detected that the runner had only 3/4 GPUS and shut it down Pull Request resolved: https://github.com/pytorch/pytorch/pull/98036 Approved by: https://github.com/weiwangmeta	2023-04-08 00:53:20 +00:00
Wei Wang	d4dbdee528	Update _linux-test.yml (#98317 ) Skip "setup-ssh" for now for a100 runners from GCP as it frequently encounter issues like "connect ETIMEDOUT 173.231.16.75:443" Every day about 10 occurrences Examples for just today so far: 2023-04-04T15:07:50.916331Z \| inductor \| https://github.com/pytorch/pytorch/actions/runs/4609056040/jobs/8146321650 -- \| -- \| -- 2023-04-04T15:03:56.914692Z \| inductor \| https://github.com/pytorch/pytorch/actions/runs/4609010125/jobs/8146217819 2023-04-04T14:39:58.004717Z \| inductor \| https://github.com/pytorch/pytorch/actions/runs/4608784966/jobs/8145641764 2023-04-04T14:19:28.854825Z \| inductor \| https://github.com/pytorch/pytorch/actions/runs/4608561116/jobs/8145147916 2023-04-04T06:15:39.241848Z \| inductor \| https://github.com/pytorch/pytorch/actions/runs/4604422106/jobs/8135687673 2023-04-04T06:10:21.056131Z \| inductor \| https://github.com/pytorch/pytorch/actions/runs/4604406947/jobs/8135611094 2023-04-04T05:34:50.908482Z \| inductor \| https://github.com/pytorch/pytorch/actions/runs/4604198332/jobs/8135201048 2023-04-04T03:04:36.628201Z \| inductor \| https://github.com/pytorch/pytorch/actions/runs/4603162241/jobs/8133620905 2023-04-04T01:49:27.119830Z \| inductor \| https://github.com/pytorch/pytorch/actions/runs/4600897505/jobs/8132760483 2023-04-04T01:18:06.141437Z \| inductor \| https://github.com/pytorch/pytorch/actions/runs/4602745871/jobs/8132387930 2023-04-04T00:38:30.610770Z \| inductor \| https://github.com/pytorch/pytorch/actions/runs/4602537869/jobs/8131938265 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98317 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-04-07 01:51:02 +00:00
Catherine Lee	38207a9e53	[ci][easy] Only print remaining logs if test step ran (#97713 ) it sometimes spits out left over logs from a previous run on the windows ephemeral runner, but this might have been fixed by now. I get a bit annoyed when the step runs even though it obviously isnt going to be useful since the test step didnt run, always() is needed to ensure that it runs on test step failure Pull Request resolved: https://github.com/pytorch/pytorch/pull/97713 Approved by: https://github.com/huydhn	2023-03-30 23:03:41 +00:00
Huy Do	f92cae4849	Fix a grep-itself bug when checking for GPU healthcheck (#97929 ) The logic works https://github.com/pytorch/pytorch/actions/runs/4558327458, but it also grep itself due to `set -x` is set (Ugh, debug message) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97929 Approved by: https://github.com/malfet, https://github.com/weiwangmeta	2023-03-30 08:14:01 +00:00
PyTorch MergeBot	b093dfaefa	Revert "Fix a grep-itself bug when checking for GPU healthcheck (#97929 )" This reverts commit `f40b2ed59c`. Reverted https://github.com/pytorch/pytorch/pull/97929 on behalf of https://github.com/huydhn due to Rework to get rid of grep completely	2023-03-30 07:52:20 +00:00
Huy Do	f40b2ed59c	Fix a grep-itself bug when checking for GPU healthcheck (#97929 ) The logic works https://github.com/pytorch/pytorch/actions/runs/4558327458, but it also grep itself due to `set -x` is set (Ugh, debug message) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97929 Approved by: https://github.com/malfet, https://github.com/weiwangmeta	2023-03-30 00:25:43 +00:00
Huy Do	099b2801db	Stop runner service when its GPU crashes (#97585 ) Per title, I'm looking for a way to take the runner out of service when its GPU crashes and couldn't recover. Taking the faulty runner out of service would prevent future jobs to be assigned to it as they will surely fail. This is based on the observation that GPU crash usually happen in the middle of the test or in the next `setup-nvidia` step. This is only happens on G5 runner with A10G GPU, so the suspicion is that this is a hardware failure. Updating to the newer NVIDIA driver (525.85.06) might or might not help with the issue (https://github.com/pytorch/pytorch/pull/96904), so I'm preparing this PR as a preemptive measure. Here are the symptoms when the GPU crashes: * Test fails with "No CUDA GPUs are available" error when initialize CUDA. For examples: * https://github.com/pytorch/pytorch/actions/runs/4506110581/jobs/7932832519 * https://github.com/pytorch/pytorch/actions/runs/4507220502/jobs/7935084759 * Calling nvidia-smi timeouts after 60 second. For example: * https://github.com/pytorch/pytorch/actions/runs/4496201282/jobs/7910938448 * Fail to run nvidia-smi with an unable to determine the device handle for GPU unknown error * https://github.com/pytorch/pytorch/actions/runs/4546343549/jobs/8015359600 * Run `docker --gpus all` fails with error response from daemon while the command `nvidia-container-cli` fails with `detection error: nvml error: unknown error` * https://github.com/pytorch/pytorch/actions/runs/4545579871/jobs/8013667872 I'm assume that an offline runner with a stopped runner service would be teardown and recycle properly by infra scaling process. ### Testing https://github.com/pytorch/pytorch/actions/runs/4517112069/jobs/7956204805. When it runs, the code fetches the service name from `${{ RUNNER_WORKSPACE }}/../../.service` file and issue `sudo systemctl stop ${RUNNER_SERVICE_NAME}` to stop the self-hosted runner service. The job will show its status as `The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.` Pull Request resolved: https://github.com/pytorch/pytorch/pull/97585 Approved by: https://github.com/jeanschmidt	2023-03-29 21:17:13 +00:00
Huy Do	2806fa4470	Use the latest NVIDIA driver from setup-nvidia (#97840 ) This goes with https://github.com/pytorch/test-infra/pull/3949 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97840 Approved by: https://github.com/ZainRizvi	2023-03-29 21:14:27 +00:00
Bin Bao	c55d1a6049	[CI] Experiment with a newer CUDA driver (#96904 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96904 Approved by: https://github.com/huydhn, https://github.com/weiwangmeta	2023-03-24 17:05:18 +00:00
Wei Wang	9320cae1da	Add GPU frequency lock option to inductor workflows running on A100 (#97465 ) Fixes #97459 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97465 Approved by: https://github.com/xuzhao9	2023-03-24 05:15:21 +00:00
Xuehai Pan	4b0e2e2cc6	Use official NVML Python bindings (#93925 ) Use the official NVML Python binding package [`nvidia-ml-py`](https://pypi.org/project/nvidia-ml-py), which is maintained by the NVIDIA NVML team. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93925 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi, https://github.com/ptrblck	2023-02-07 05:27:36 +00:00
Jane Xu	0ecb071fc4	[BE][CI] change references from .jenkins to .ci (#92624 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92624 Approved by: https://github.com/ZainRizvi, https://github.com/huydhn	2023-01-30 22:50:07 +00:00
Catherine Lee	27ab1dfc28	Remove print_test_stats, test_history, s3_stat_parser (#92841 ) Pritam Damania no longer uses it (and is no longer with FB), and I don't know who else has interest in this Pull Request resolved: https://github.com/pytorch/pytorch/pull/92841 Approved by: https://github.com/malfet, https://github.com/huydhn, https://github.com/ZainRizvi, https://github.com/seemethere	2023-01-27 18:11:42 +00:00
Catherine Lee	00f3e0d8c9	[ci] Set step level timeout (#93084 ) Not super important, but it is nice for the logs because the logs now say "the action timed out" instead of "the action was cancelled". It also makes the job status "failure" instead of "cancelled" also adds timeout minutes as an input for rocm and mac tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/93084 Approved by: https://github.com/huydhn	2023-01-27 17:52:33 +00:00
Zain Rizvi	92fbb35bff	Upload failures shouldn't fail a CI that passed tests (#92996 ) This'll reduce some flakiness we've been seeing recently Pull Request resolved: https://github.com/pytorch/pytorch/pull/92996 Approved by: https://github.com/malfet, https://github.com/kit1980	2023-01-25 19:23:51 +00:00
Huy Do	5610766044	Mark test monitoring as an optional process (#92658 ) This is an optional step that is ok to ignored when PyPI becomes flaky. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92658 Approved by: https://github.com/clee2000	2023-01-20 18:59:56 +00:00
Huy Do	ce43fc586f	Register sccache epilogue before starting sccache (#92587 ) Fixing XLA test job flaky with sccache failing to start with a timeout error, for example: * https://github.com/pytorch/pytorch/actions/runs/3953719143/jobs/6770489428 * https://github.com/pytorch/pytorch/actions/runs/3952860712/jobs/6769339620 * https://github.com/pytorch/pytorch/actions/runs/3946315315/jobs/6754126326 XLA test job actually builds XLA as part of the test ~~, so it needs sccache~~ * Register sccache epilogue before starting sccache, so that any errors when starting sccache can be printed * Add `-e SKIP_SCCACHE_INITIALIZATION=1` to `_linux_test` workflow, this is the same flag used in `_linux_build` workflow. Quoted the reason from the build script: > sccache --start-server seems to hang forever on self hosted runners for GHA so let's just go ahead and skip the --start-server altogether since it seems as though sccache still gets used even when the sscache server isn't started explicitly * Also fix the code alignment in `.jenkins/pytorch/common-build.sh` * We don't even use sccache in XLA test job, but there is an S3 cache used by bazel there (`XLA_CLANG_CACHE_S3_BUCKET_NAME=ossci-compiler-clang-cache-circleci-xla`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92587 Approved by: https://github.com/malfet, https://github.com/ZainRizvi	2023-01-19 16:14:31 +00:00
Catherine Lee	e67f5ab6cc	Print and zip remaining test logs (#91510 ) When CI times out or gets cancelled, the code to print and delete logs for currently running tests doesn't get run, which makes it hard to debug what's going on, so print the logs in a new step and also zip them into the usage-log zip (which should probably get a name change at some point) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91510 Approved by: https://github.com/malfet, https://github.com/huydhn, https://github.com/ZainRizvi	2023-01-09 17:31:36 +00:00
Edward Z. Yang	ffd0b15a49	Add support for keep-going label (#90902 ) This makes run_test.py keep going even on failure. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90902 Approved by: https://github.com/malfet, https://github.com/huydhn	2022-12-16 06:47:06 +00:00
PyTorch MergeBot	82a191313e	Revert "Add support for keep-going label (#90902 )" This reverts commit `855f4b7d24`. Reverted https://github.com/pytorch/pytorch/pull/90902 on behalf of https://github.com/huydhn due to This change breaks trunk where, unlike PR, there is no label	2022-12-16 05:07:49 +00:00
Edward Z. Yang	855f4b7d24	Add support for keep-going label (#90902 ) This makes run_test.py keep going even on failure. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90902 Approved by: https://github.com/malfet, https://github.com/huydhn	2022-12-16 04:03:52 +00:00
Wei Wang	1439ebd899	Enable inductor perf test on GCP A100 (#90322 ) This PR tries to enable inductor performance nightly testing on A100 runner provided by GCP. Currently these GCP runners were created and maintained using scripts in https://github.com/fairinternal/pytorch-gha-infra/pull/82. For some reason the artifacts cannot (and does not need to) be uploaded to S3, so adding use-gha parameter to _linux-test.yml to avoid creating a new but mostly identical _linux-test.yml. Workflow test results: https://github.com/pytorch/pytorch/actions/runs/3642340544/jobs/6149691109 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90322 Approved by: https://github.com/anijain2305, https://github.com/seemethere, https://github.com/desertfire	2022-12-13 17:47:01 +00:00

1 2 3

103 commits