This is a follow-up of https://github.com/pytorch/pytorch/pull/144112#pullrequestreview-2528451214. After leaving https://github.com/pytorch/pytorch/pull/144112 running for more than a week, all build jobs were fine, but I failed to see any improvement in build time.
So, let's try @malfet suggestion by removing the prefix altogether to keep it simple. After this land, I will circle back on this to see if there is any improvements. Otherwise, it's still a simple BE change I guess.
Here is the query I'm using to gather build time data for reference:
```
with jobs as (
select
id,
name,
DATE_DIFF('minute', created_at, completed_at) as duration,
DATE_TRUNC('week', created_at) as bucket
from
workflow_job
where
name like '%/ build'
and html_url like concat('%', {repo: String }, '%')
and conclusion = 'success'
and created_at >= (CURRENT_TIMESTAMP() - INTERVAL 6 MONTHS)
),
aggregated_jobs_in_bucket as (
select
--groupArray(duration) as durations,
--quantiles(0.9)(duration),
avg(duration),
bucket
from
jobs
group by
bucket
)
select
*
from
aggregated_jobs_in_bucket
order by
bucket desc
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144704
Approved by: https://github.com/clee2000
Periodically run testsuite for s390x
**Dependencies update**
Package z3-solver is updated from version 4.12.2.0 to version 4.12.6.0. This is a minor version update, so no functional change is expected.
The reason for update is build on s390x. pypi doesn't provide binary build for z3-solver for versions 4.12.2.0 or 4.12.6.0 for s390x. Unfortunately, version 4.12.2.0 fails to build with newer gcc used on s390x builders, but those errors are fixed in version 4.12.6.0. Due to this minor version bump fixes build on s390x.
```
# pip3 install z3-solver==4.12.2.0
...
In file included from /tmp/pip-install-756iytc6/z3-solver_ce6f750b780b4146a9a7c01e52672071/core/src/util/region.cpp:53:
/tmp/pip-install-756iytc6/z3-solver_ce6f750b780b4146a9a7c01e52672071/core/src/util/region.cpp: In member function ‘void* region::allocate(size_t)’:
/tmp/pip-install-756iytc6/z3-solver_ce6f750b780b4146a9a7c01e52672071/core/src/util/tptr.h:29:62: error: ‘uintptr_t’ does not name a type
29 | #define ALIGN(T, PTR) reinterpret_cast<T>(((reinterpret_cast<uintptr_t>(PTR) >> PTR_ALIGNMENT) + \
| ^~~~~~~~~
/tmp/pip-install-756iytc6/z3-solver_ce6f750b780b4146a9a7c01e52672071/core/src/util/region.cpp:82:22: note: in expansion of macro ‘ALIGN’
82 | m_curr_ptr = ALIGN(char *, new_curr_ptr);
| ^~~~~
/tmp/pip-install-756iytc6/z3-solver_ce6f750b780b4146a9a7c01e52672071/core/src/util/region.cpp:57:1: note: ‘uintptr_t’ is defined in header ‘<cstdint>’; did you forget to ‘#include <cstdint>’?
56 | #include "util/page.h"
+++ |+#include <cstdint>
57 |
```
**Python paths update**
On AlmaLinux 8 s390x, old paths:
```
python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())'
/usr/lib/python3.12/site-packages
```
Total result is `/usr/lib/python3.12/site-packages/torch;/usr/lib/python3.12/site-packages`
New paths:
```
python -c 'import site; print(";".join([x for x in site.getsitepackages()] + [x + "/torch" for x in site.getsitepackages()]))'
/usr/local/lib64/python3.12/site-packages;/usr/local/lib/python3.12/site-packages;/usr/lib64/python3.12/site-packages;/usr/lib/python3.12/site-packages;/usr/local/lib64/python3.12/site-packages/torch;/usr/local/lib/python3.12/site-packages/torch;/usr/lib64/python3.12/site-packages/torch;/usr/lib/python3.12/site-packages/torch
```
```
# python -c 'import torch ; print(torch)'
<module 'torch' from '/usr/local/lib64/python3.12/site-packages/torch/__init__.py'>
```
`pip3 install dist/*.whl` installs torch into `/usr/local/lib64/python3.12/site-packages`, and later it's not found by cmake with old paths:
```
CMake Error at CMakeLists.txt:9 (find_package):
By not providing "FindTorch.cmake" in CMAKE_MODULE_PATH this project has
asked CMake to find a package configuration file provided by "Torch", but
CMake did not find one.
```
https://github.com/pytorch/pytorch/actions/runs/10994060107/job/30521868178?pr=125401
**Builders availability**
Build took 60 minutes
Tests took: 150, 110, 65, 55, 115, 85, 50, 70, 105, 110 minutes (split into 10 shards)
60 + 150 + 110 + 65 + 55 + 115 + 85 + 50 + 70 + 105 + 110 = 975 minutes used. Let's double it. It would be 1950 minutes.
We have 20 machines * 24 hours = 20 * 24 * 60 = 20 * 1440 = 28800 minutes
We currently run 5 nightly binaries builds, each on average 90 minutes build, 15 minutes test, 5 minutes upload, 110 minutes total for each, 550 minutes total. Doubling would be 1100 minutes.
That leaves 28800 - 1100 = 27700 minutes total. Periodic tests would use will leave 25750 minutes.
Nightly binaries build + nightly tests = 3050 minutes.
25750 / 3050 = 8.44. So we could do both 8 more times for additional CI runs for any reason. And that is with pretty good safety margin.
**Skip test_tensorexpr**
On s390x, pytorch is built without llvm.
Even if it would be built with llvm, llvm currently doesn't support used features on s390x and test fails with errors like:
```
JIT session error: Unsupported target machine architecture in ELF object pytorch-jitted-objectbuffer
unknown file: Failure
C++ exception with description "valOrErr INTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/torch/csrc/jit/tensorexpr/llvm_jit.h":34, please report a bug to PyTorch. Unexpected failure in LLVM JIT: Failed to materialize symbols: { (main, { func }) }
```
**Disable cpp/static_runtime_test on s390x**
Quantization is not fully supported on s390x in pytorch yet.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125401
Approved by: https://github.com/malfet
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
This is an attempt to improve cache usage for jobs in non-pull workflows like periodic, slow, or inductor as we are seeing build timeout there from time to time, for example https://github.com/pytorch/pytorch/actions/runs/12553928804. The build timeout never happens in pull or trunk AFAICT because they are more up to date with the cache content coming from the PR itself.
Logically, the same build should use the same cache regardless of the workflows. We have many examples where the same build, for example [linux-focal-cuda12.4-py3.10-gcc9-sm86](https://github.com/search?q=repo%3Apytorch%2Fpytorch+linux-focal-cuda12.4-py3.10-gcc9-sm86&type=code), is split between different workflows and, thus, uses different caches.
I could gather some sccache stats from CH in the meantime to try to prove the improvement before and after this lands.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144112
Approved by: https://github.com/malfet
# Overview
Currently monitor.py produces error only result, this pr introduct disable-monitor option to all *-test.yml. We also like to explore how the monitor code affect benchmark results.
# next steps
- fix the monitor.py
- enable non-benchmark tests with monitor
- investigate benchmark test behavior with monitor background job
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140857
Approved by: https://github.com/huydhn
It is possible to write to Meta's internal in-memory database Scuba via the Scribe Graph API: https://www.internalfb.com/intern/wiki/Scribe/users/Knowledge_Base/Interacting_with_Scribe_categories/Graph_API/ This is currently being used by pytorch/benchmark repo to upload torchbench performance results.
I want to make this API generally available to all jobs running on CI in a semi-trusted context. To talk to Scribe, you need a secret access token. I have initially configured an environment prod-branch-main which contains `SCRIBE_GRAPHQL_ACCESS_TOKEN`, and switched a single class of jobs (linux-test) to use this environment when they are running on the main branch. Because we require approvals for running CI on untrusted contributions, we could potentially allow all jobs to run in this environment, including jobs on PRs, but I don't need this for my use case (per-PR benchmark result reporting, and miscellaneous statistics on main.)
If this works, I'll push out this environment to the rest of our test jobs.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133536
Approved by: https://github.com/xuzhao9, https://github.com/malfet, https://github.com/albanD
Add a new label `ci-test-showlocals` and add it to test config filter.
If the PR is labeled with `ci-test-showlocals` or "ci-test-showlocals"
present in the PR comment, the test config filter will set a environment
variable `TEST_SHOWLOCALS`. Then `pytest` will show local variables on
failures for better debugging.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131981
Approved by: https://github.com/malfet
ghstack dependencies: #131151
Add a new label `ci-test-showlocals` and add it to test config filter.
If the PR is labeled with `ci-test-showlocals` or "ci-test-showlocals"
present in the PR comment, the test config filter will set a environment
variable `TEST_SHOWLOCALS`. Then `pytest` will show local variables on
failures for better debugging.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131981
Approved by: https://github.com/malfet
Since ARC runners use dind-rootless mode setting the ulimit in the docker run command is not possible as the dind-rootless container does not sufficient permissions to do that.
This change looks like it was coming from a migration from another CI system so perhaps it's not necessary anymore.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122629
Approved by: https://github.com/jeanschmidt
ARC Runners will provide working Nvidia drivers through the host configuration so this step is no longer necessary in the workflow as the ARC container is not able to install packages at the host level.
Also simplify the the setup-linux condition on if running in ARC as we can achieve the same result without needing an extra shell step via the hashFiles() function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122890
Approved by: https://github.com/seemethere, https://github.com/jeanschmidt
Namely, it adds the `s3-bucket` argument to the following workflows, with default value set to `gha-artifacts`):
- _docs
- _linux-test workflows
- download-build-artifacts
- pytest-cache-download
- upload-test-artifacts
This is prerequisite part is required in order to start migrating to other s3 buckets for asset storage; This is one of the required steps in order to migrate to ARC and move our assets away from our S3 to Linux Foundation S3
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121907
Approved by: https://github.com/malfet
Give TD it's own job so that each shard can get the results from this one job artifact and they will always be in sync with each other/no longer need to worry about consistently issues
* Move test discovery to its own file that is not dependent on torch so it can be run without building torch
* Cannot do cpp test discovery before building pytorch
* Move TD calculation to own file that will create a json file with the final results
* TD is now job/build env agnostic
* TD will rank all tests, including those that test jobs may not want to run (ex it will rank distributed tests along with default tests, even though these tests are never run on the same machine together)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118250
Approved by: https://github.com/huydhn
Test [ci-verbose-test-logs] (this worked, the test logs printing while running and interleaved and are really long)
Settings for no timeout (step timeout still applies, only gets rid of ~30 min timeout for shard of test file) and no piping logs/extra verbose test logs (good for debugging deadlocks but results in very long and possibly interleaved logs).
Also allows these to be set via pr body if the label name is in brackets ex [label name] or the test above.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117668
Approved by: https://github.com/huydhn
Move the pytest cache downloading into the build step and store it in additional ci files so that it stays consistent during sharding.
Only build env is taken into account now instead of also test config since we might not have the test config during build time, making it less specific, but I also think this might be better since tests are likely to fail across the same test config (I also think it might be worth not even looking at build env but thats a different topic)
Each cache upload should only include information from the current run. Do not merge current cache with downloaded cache during upload (shouldn't matter anyways since the downloaded cache won't exist at the time)
From what I cant tell of the s3 retention policy, pytest cache files will be deleted after 30 days (cc @ZainRizvi to confirm), so we never have to worry about space or pulling old versions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113804
Approved by: https://github.com/ZainRizvi
We’ve been strugging to get the job id since 9/28/2023 12:03 pm. Before this we had almost 0 problems getting job id, but after, we get a lot of `Recieved status code '502' when attempting to retrieve https://api.github.com/repos/pytorch/pytorch/actions/runs/6551579728/jobs?per_page=100:\n", 'Bad Gateway\n\nheaders=Server: GitHub.com\nDate: Tue, 17 Oct 2023 20:32:52 GMT\nContent-Type: application/json\nContent-Length: 32\nETag: "652eed15-20"\nVary: Accept-Encoding, Accept, X-Requested-With\nX-GitHub-Request-Id: EC62:7EE0:166AAF5:2D51A8E:652EEF6A\nconnection: close\n\n` ex https://github.com/pytorch/pytorch/actions/runs/6551579728/job/17793898278#step:18:22
Recently, it has been happening around 1/4 of the time, possibly more. I think this happens almost only on linux.
I believe this is somehow caused by a test, since distributed tests seems to be disproportionately affected, so I move the step to get the job id before the test step. This also has the benefit of the test step being able to get the job id now if we want it.
Regardless of whether this works or not, its a pretty harmless change that might make things easier in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111483
Approved by: https://github.com/huydhn
To reduce the amount of logs
* for successes, only print the part that says what tests ran and don't print the rest. Zip the log into an artifact. The line listing al the test names is really long, but if you view source of the raw logs, it will not wrap so it will only be one line. The log classifier can also be configured to ignored this line. Gets rid of lines like `test_ops.py::TestCommonCPU::test_multiple_devices_round_cpu_int64 SKIPPED [0.0010s] (Only runs on cuda) [ 9%]`
* for failures/reruns, print logs. Do not zip.
Also
* change log artifact name
Examples of various logs:
a074db0f7f failures
1b439e24c4 failures
possibly controversial haha
should i include an option for always printing?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110033
Approved by: https://github.com/huydhn
This fixes the daily timeout of ROCm jobs when running with memory leak check turning on. I want to use something like `inputs.timeout-minutes * 2` but that syntax, unfortunately, isn't supported in GitHub action YAML. So I decide to just x2 the current timeout value of 300 minutes to make it 600 minutes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110193
Approved by: https://github.com/clee2000
Includes stable diffusion, whisper, llama7b and clip
To get this to work I had to Pass in hf auth token to all ci jobs, github does not pass in secrets from parent to child automatically. There's a likelihood HF will rate limit us in case please revert this PR and I'll work on adding a cache next - cc @voznesenskym @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @aakhundov @malfet
Something upstream changed in torchbench too where now `hf_Bert` and `hf_Bert_large` are both failing on some dynamic shape looking error which I'm not sure how to debug yet so for now felt a bit gross but added a skip since others are building on top this work @ezyang
`llamav2_7b_16h` cannot pass through accuracy checks cause it OOMs on deepcloning extra inputs this seems to make it not need to show up in expected numbers csv, will figure this when we update the pin with https://github.com/pytorch/benchmark/pull/1803 cc @H-Huang @xuzhao9 @cpuhrsch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106009
Approved by: https://github.com/malfet
Query for list of renabled issues in the filter test config step: switch filter test config to query for all the PR info instead of just the labels so token usage should stay the same, move code and tests related to parsing reenabled issues to filter test config step, remove old code to get PR body and commit message. `REENABLED_ISSUES` should be a comma separated list of issue numbers to be reenabled.
For testing: Fixes#103789
Check that 103789 shows up in list of ignored disabled issues
Sanity check that test-config labels still work
More testing via `python3 ".github/scripts/filter_test_configs.py" --workflow "pull" --job-name "linux-bionic-cuda12.1-py3.10-gcc9 / test (default, 4, 5, linux.4xlarge.nvidia.gpu)" --test-matrix "{ include: [
{ config: "default", shard: 1, num_shards: 1 },
]}
" --pr-number "" --tag "" --event-name "push" --schedule "" --branch ""`
and
`python3 ".github/scripts/filter_test_configs.py" --workflow "pull" --job-name "linux-bionic-cuda12.1-py3.10-gcc9 / test (default, 4, 5, linux.4xlarge.nvidia.gpu)" --test-matrix "{"include": [{"config": "default", "shard": 1, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 2, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 3, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 4, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 5, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}]}" --pr-number "103790" --tag "" --event-name "pull_request" --schedule "" --branch ""`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103790
Approved by: https://github.com/huydhn