mirror of
https://github.com/saymrwulf/pytorch.git
synced 2026-05-14 20:57:59 +00:00
After https://github.com/pytorch/pytorch/pull/81116, we started pulling test times straight from the source instead of first downloading them in the build job and then having the test job take the build jobs version. This can cause an issues where different shards pull different versions of the file, leading to incorrect sharding (ex two shards running the same tests file on accident). This generally happens if the test jobs happen while the test times file is being updated (unlikely, but not impossible) or if someone reruns a test job the next day. In this PR, I return to the old method of downloading the test times file during the build job and having the test jobs pull from the build jobs uploaded artifacts. If there is no test times file in the build job's artifacts, we fall back to the default sharding plan. Notes: * script moved to a new file to avoid needing to import torch, which would require torch to be built, which can cause issues with asan * I got errors with asan (`ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD.`), so I put the script at the beginning of the build ### Test Plan Verified that the number of tests ran in the pull and trunk workflows are similar to workflows run on master. Checked logs to see if artifacts were being used for sharding. Spot checked a few test configs to check that their lists of selected tests didn't overlap. Pull Request resolved: https://github.com/pytorch/pytorch/pull/81915 Approved by: https://github.com/huydhn |
||
|---|---|---|
| .. | ||
| __init__.py | ||
| export_test_times.py | ||
| import_test_stats.py | ||
| monitor.py | ||
| print_test_stats.py | ||
| README.md | ||
| s3_stat_parser.py | ||
| scribe.py | ||
| test_history.py | ||
| upload_sccache_stats.py | ||
| upload_stats_lib.py | ||
| upload_test_stats.py | ||
PyTorch CI Stats
We track various stats about each CI job.
- Jobs upload their artifacts to an intermediate data store (either GitHub
Actions artifacts or S3, depending on what permissions the job has). Example:
a9f6a35a33/.github/workflows/_linux-build.yml (L144-L151) - When a workflow completes, a
workflow_runevent triggersupload-test-stats.yml. upload-test-statsdownloads the raw stats from the intermediate data store and uploads them as JSON to Rockset, our metrics backend.
graph LR
J1[Job with AWS creds<br>e.g. linux, win] --raw stats--> S3[(AWS S3)]
J2[Job w/o AWS creds<br>e.g. mac] --raw stats--> GHA[(GH artifacts)]
S3 --> uts[upload-test-stats.yml]
GHA --> uts
uts --json--> R[(Rockset)]
Why this weird indirection? Because writing to Rockset requires special permissions which, for security reasons, we do not want to give to pull request CI. Instead, we implemented GitHub's recommended pattern for cases like this.
For more details about what stats we export, check out
upload-test-stats.yml