Commit graph

135 commits

Author SHA1 Message Date
Yang Wang
6d4f5f7688 [Utilization][Usage Log] Add data model for record (#145114)
Add data model for consistency and data model change in the future.

The data model will be used during the post-test-process pipeline
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145114
Approved by: https://github.com/huydhn
2025-01-23 19:04:41 +00:00
Aleksei Nikiforov
53e2408015 Improve cleanup of cancelled jobs on s390x for tests too (#144968)
Follow up to https://github.com/pytorch/pytorch/pull/144149
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144968
Approved by: https://github.com/huydhn
2025-01-20 12:56:07 +00:00
Huy Do
b221f88fc1 Leave SCCACHE_S3_KEY_PREFIX empty to share the cache among all build jobs (#144704)
This is a follow-up of https://github.com/pytorch/pytorch/pull/144112#pullrequestreview-2528451214.  After leaving https://github.com/pytorch/pytorch/pull/144112 running for more than a week, all build jobs were fine, but I failed to see any improvement in build time.

So, let's try @malfet suggestion by removing the prefix altogether to keep it simple.  After this land, I will circle back on this to see if there is any improvements.  Otherwise, it's still a simple BE change I guess.

Here is the query I'm using to gather build time data for reference:

```
with jobs as (
    select
        id,
        name,
        DATE_DIFF('minute', created_at, completed_at) as duration,
        DATE_TRUNC('week', created_at) as bucket
    from
        workflow_job
    where
        name like '%/ build'
        and html_url like concat('%', {repo: String }, '%')
        and conclusion = 'success'
        and created_at >= (CURRENT_TIMESTAMP() - INTERVAL 6 MONTHS)
),
aggregated_jobs_in_bucket as (
    select
        --groupArray(duration) as durations,
        --quantiles(0.9)(duration),
        avg(duration),
        bucket
    from
        jobs
    group by
        bucket
)
select
    *
from
    aggregated_jobs_in_bucket
order by
    bucket desc
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144704
Approved by: https://github.com/clee2000
2025-01-14 02:19:38 +00:00
Aleksei Nikiforov
4143312e67 S390x ci periodic tests (#125401)
Periodically run testsuite for s390x

**Dependencies update**
Package z3-solver is updated from version 4.12.2.0 to version 4.12.6.0. This is a minor version update, so no functional change is expected.
The reason for update is build on s390x. pypi doesn't provide binary build for z3-solver for versions 4.12.2.0 or 4.12.6.0 for s390x. Unfortunately, version 4.12.2.0 fails to build with newer gcc used on s390x builders, but those errors are fixed in version 4.12.6.0. Due to this minor version bump fixes build on s390x.

```
# pip3 install z3-solver==4.12.2.0
...
      In file included from /tmp/pip-install-756iytc6/z3-solver_ce6f750b780b4146a9a7c01e52672071/core/src/util/region.cpp:53:
      /tmp/pip-install-756iytc6/z3-solver_ce6f750b780b4146a9a7c01e52672071/core/src/util/region.cpp: In member function ‘void* region::allocate(size_t)’:
      /tmp/pip-install-756iytc6/z3-solver_ce6f750b780b4146a9a7c01e52672071/core/src/util/tptr.h:29:62: error: ‘uintptr_t’ does not name a type
         29 | #define ALIGN(T, PTR) reinterpret_cast<T>(((reinterpret_cast<uintptr_t>(PTR) >> PTR_ALIGNMENT) + \
            |                                                              ^~~~~~~~~
      /tmp/pip-install-756iytc6/z3-solver_ce6f750b780b4146a9a7c01e52672071/core/src/util/region.cpp:82:22: note: in expansion of macro ‘ALIGN’
         82 |         m_curr_ptr = ALIGN(char *, new_curr_ptr);
            |                      ^~~~~
      /tmp/pip-install-756iytc6/z3-solver_ce6f750b780b4146a9a7c01e52672071/core/src/util/region.cpp:57:1: note: ‘uintptr_t’ is defined in header ‘<cstdint>’; did you forget to ‘#include <cstdint>’?
         56 | #include "util/page.h"
        +++ |+#include <cstdint>
         57 |
```

**Python paths update**
On AlmaLinux 8 s390x, old paths:
```
python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())'
/usr/lib/python3.12/site-packages
```

Total result is `/usr/lib/python3.12/site-packages/torch;/usr/lib/python3.12/site-packages`

New paths:
```
python -c 'import site; print(";".join([x for x in site.getsitepackages()] + [x + "/torch" for x in site.getsitepackages()]))'
/usr/local/lib64/python3.12/site-packages;/usr/local/lib/python3.12/site-packages;/usr/lib64/python3.12/site-packages;/usr/lib/python3.12/site-packages;/usr/local/lib64/python3.12/site-packages/torch;/usr/local/lib/python3.12/site-packages/torch;/usr/lib64/python3.12/site-packages/torch;/usr/lib/python3.12/site-packages/torch
```

```
# python -c 'import torch ; print(torch)'
<module 'torch' from '/usr/local/lib64/python3.12/site-packages/torch/__init__.py'>
```

`pip3 install dist/*.whl` installs torch into `/usr/local/lib64/python3.12/site-packages`, and later it's not found by cmake with old paths:

```
CMake Error at CMakeLists.txt:9 (find_package):
  By not providing "FindTorch.cmake" in CMAKE_MODULE_PATH this project has
  asked CMake to find a package configuration file provided by "Torch", but
  CMake did not find one.
```

https://github.com/pytorch/pytorch/actions/runs/10994060107/job/30521868178?pr=125401

**Builders availability**
Build took 60 minutes
Tests took: 150, 110, 65, 55, 115, 85, 50, 70, 105, 110 minutes (split into 10 shards)

60 + 150 + 110 + 65 + 55 + 115 + 85 + 50 + 70 + 105 + 110 = 975 minutes used. Let's double it. It would be 1950 minutes.

We have 20 machines * 24 hours = 20 * 24 * 60 = 20 * 1440 = 28800 minutes

We currently run 5 nightly binaries builds, each on average 90 minutes build, 15 minutes test, 5 minutes upload, 110 minutes total for each, 550 minutes total. Doubling would be 1100 minutes.

That leaves 28800 - 1100 = 27700 minutes total. Periodic tests would use will leave 25750 minutes.

Nightly binaries build + nightly tests = 3050 minutes.

25750 / 3050 = 8.44. So we could do both 8 more times for additional CI runs for any reason. And that is with pretty good safety margin.

**Skip test_tensorexpr**
On s390x, pytorch is built without llvm.
Even if it would be built with llvm, llvm currently doesn't support used features on s390x and test fails with errors like:
```
JIT session error: Unsupported target machine architecture in ELF object pytorch-jitted-objectbuffer
unknown file: Failure
C++ exception with description "valOrErr INTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/torch/csrc/jit/tensorexpr/llvm_jit.h":34, please report a bug to PyTorch. Unexpected failure in LLVM JIT: Failed to materialize symbols: { (main, { func }) }
```
**Disable cpp/static_runtime_test on s390x**

Quantization is not fully supported on s390x in pytorch yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125401
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-10 18:21:07 +00:00
Huy Do
cbdc70ae07 Use the build environment as sccache prefix instead of workflow name (#144112)
This is an attempt to improve cache usage for jobs in non-pull workflows like periodic, slow, or inductor as we are seeing build timeout there from time to time, for example https://github.com/pytorch/pytorch/actions/runs/12553928804.  The build timeout never happens in pull or trunk AFAICT because they are more up to date with the cache content coming from the PR itself.

Logically, the same build should use the same cache regardless of the workflows.  We have many examples where the same build, for example [linux-focal-cuda12.4-py3.10-gcc9-sm86](https://github.com/search?q=repo%3Apytorch%2Fpytorch+linux-focal-cuda12.4-py3.10-gcc9-sm86&type=code), is split between different workflows and, thus, uses different caches.

I could gather some sccache stats from CH in the meantime to try to prove the improvement before and after this lands.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144112
Approved by: https://github.com/malfet
2025-01-03 17:33:03 +00:00
Huy Do
c15638d803 Enable swap on all Linux jobs (#143316)
A swapfile on Linux runner has been prepared by https://github.com/pytorch/test-infra/pull/6058.  So this PR does 2 things:

* Start using the swapfile on all Linux build and test jobs
* Testing the rollout https://github.com/pytorch-labs/pytorch-gha-infra/pull/582

### Testing

Run `swapon` inside the container and the swapfile shows up correctly:

```
jenkins@259dfb0a314c:~/workspace$ swapon
NAME      TYPE SIZE USED PRIO
/swapfile file   3G 256K   -2
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143316
Approved by: https://github.com/ZainRizvi, https://github.com/atalman
2024-12-17 02:12:24 +00:00
Yang Wang
2b105de2c1 [Monitor] Enable non-perf linux test monitor (#142168)
# Overview
Enable monitorings for non-perf linux tests

# Other
- move monitoring step right before build artifact for mac_test.yml, notice this test is not enable monitoring now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142168
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2024-12-11 01:10:43 +00:00
Huy Do
1a7055cb73 Record PR time benchmark results in JSON format (#140493)
I'm trying to make this benchmark results available on OSS benchmark database, so that people can query it from outside.  The first step is to also record the results in the JSON format compatible with the database schema defined in https://github.com/pytorch/test-infra/pull/5839.

Existing CSV files remain unchanged.

### Testing

The JSON results are uploaded as artifacts to S3 https://github.com/pytorch/pytorch/actions/runs/11809725848/job/32901411180#step:26:13, for example https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/11809725848/1/artifact/test-jsons-test-pr_time_benchmarks-1-1-linux.g4dn.metal.nvidia.gpu_32901411180.zip

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140493
Approved by: https://github.com/laithsakka
2024-11-20 18:54:01 +00:00
PyTorch MergeBot
a4e8ca789a Revert "Record PR time benchmark results in JSON format (#140493)"
This reverts commit 783cd9c8dd.

Reverted https://github.com/pytorch/pytorch/pull/140493 on behalf of https://github.com/huydhn due to I think I missed something in the workflow setup as the test is failing in non-test CI jobs ([comment](https://github.com/pytorch/pytorch/pull/140493#issuecomment-2487360455))
2024-11-20 04:04:07 +00:00
Huy Do
783cd9c8dd Record PR time benchmark results in JSON format (#140493)
I'm trying to make this benchmark results available on OSS benchmark database, so that people can query it from outside.  The first step is to also record the results in the JSON format compatible with the database schema defined in https://github.com/pytorch/test-infra/pull/5839.

Existing CSV files remain unchanged.

### Testing

The JSON results are uploaded as artifacts to S3 https://github.com/pytorch/pytorch/actions/runs/11809725848/job/32901411180#step:26:13, for example https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/11809725848/1/artifact/test-jsons-test-pr_time_benchmarks-1-1-linux.g4dn.metal.nvidia.gpu_32901411180.zip

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140493
Approved by: https://github.com/laithsakka
2024-11-20 01:48:00 +00:00
Yang Wang
175ba9fed6 [Utilization Monitor] input to disable utilization monitor (#140857)
# Overview
Currently monitor.py produces error only result, this pr introduct disable-monitor option to all *-test.yml. We also like to explore how the monitor code affect benchmark results.

# next steps
- fix the monitor.py
- enable non-benchmark tests with monitor
- investigate benchmark test behavior with monitor background job

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140857
Approved by: https://github.com/huydhn
2024-11-18 23:26:03 +00:00
Nikita Shulga
99c8d5af27 Don't pass credentials explicitly to sccache (#140611)
sccache-0.2.14 can query it thru IMDSv1 and sccache-0.8.2 can do it thru v2 (or may be just use trust relationships between host and bucket
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140611
Approved by: https://github.com/wdvr
2024-11-14 04:44:55 +00:00
Aleksei Nikiforov
057f0dca78 Don't use sudo to checkout sources (#140263)
Move this part out of https://github.com/pytorch/pytorch/pull/125401 and try using it for all architectures.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140263
Approved by: https://github.com/zxiiro, https://github.com/huydhn
2024-11-12 14:29:17 +00:00
Nikita Shulga
ac6b6c6f98 [BE][CI] Use pip3 instead of pip (#140185)
As on modern distros(see this oldie but goodie: https://launchpad.net/ubuntu/focal/+package/python-is-python3 ), `pip` alias might be missing or indeed point to Python2 installation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140185
Approved by: https://github.com/wdvr, https://github.com/huydhn, https://github.com/seemethere
2024-11-08 23:15:02 +00:00
Nikita Shulga
c81d4fd0a8 Upgrade sccache to v0.8.2 for CPU targets (#121323)
This essentially reverts https://github.com/pytorch/pytorch/pull/95997 but switches to builds from source to official mozilla's sccache repo for CPU builds, except PCH one, see https://github.com/pytorch/pytorch/issues/139188
- Define `SCCACHE_REGION` for the jobs that needs it.
- Enable aarch64 builds to use sccache, which allows one to do incremental rebuilds under 10 min, see https://github.com/pytorch/pytorch/actions/runs/11565944328/job/32197278296

Fixes https://github.com/pytorch/pytorch/issues/121559
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121323
Approved by: https://github.com/atalman
2024-10-29 19:54:36 +00:00
Catherine Lee
cc93c1e5e4 Upload artifacts during test run (#125799)
Zip and upload artifacts while run_test is running
Upgrade boto3 because I get errors about not having `botocore.vendored.six.move` if I don't
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125799
Approved by: https://github.com/huydhn
2024-10-22 16:48:57 +00:00
Jean Schmidt
466623fb51 [CI] Support for CI GPU test and benchmark on containers (#137169)
Renames the arc references to container, and add changes required so CI that requires GPU can run on containers
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137169
Approved by: https://github.com/huydhn
2024-10-02 17:10:59 +00:00
Jean Schmidt
e3fd4d796f [CI] Skip sccache for nvcc builds when building for A100 (#137170)
There is a unknown issue with nvcc builds and sccache, it crashes with:

```
      /opt/cache/bin/sccache /usr/local/cuda-12.1/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_MPI -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_py_EXPORTS -I/tmp/pip-install-893ub5fd/fbgemm-gpu_f79a3c2737924c478e50ea29fedfa172/fbgemm_gpu -I/tmp/pip-install-893ub5fd/fbgemm-gpu_f79a3c2737924c478e50ea29fedfa172/fbgemm_gpu/include -I/tmp/pip-install-893ub5fd/fbgemm-gpu_f79a3c2737924c478e50ea29fedfa172/fbgemm_gpu/../include -I/tmp/pip-install-893ub5fd/fbgemm-gpu_f79a3c2737924c478e50ea29fedfa172/fbgemm_gpu/../third_party/asmjit/src -I/tmp/pip-install-893ub5fd/fbgemm-gpu_f79a3c2737924c478e50ea29fedfa172/fbgemm_gpu/../third_party/cpuinfo/include -isystem /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/include -isystem /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.1/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -D_GLIBCXX_USE_CXX11_ABI=1 --expt-relaxed-constexpr -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -MD -MT CMakeFiles/fbgemm_gpu_py.dir/src/sparse_ops/sparse_index_select.cu.o -MF CMakeFiles/fbgemm_gpu_py.dir/src/sparse_ops/sparse_index_select.cu.o.d -x cu -c /tmp/pip-install-893ub5fd/fbgemm-gpu_f79a3c2737924c478e50ea29fedfa172/fbgemm_gpu/src/sparse_ops/sparse_index_select.cu -o CMakeFiles/fbgemm_gpu_py.dir/src/sparse_ops/sparse_index_select.cu.o
      sccache: error: failed to execute compile
      sccache: caused by: error reading compile response from server
      sccache: caused by: Failed to read response header
      sccache: caused by: failed to fill whole buffer
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137170
Approved by: https://github.com/huydhn
2024-10-02 17:07:24 +00:00
Edward Z. Yang
32e057636c Enable scribe environment for compile-time benchmarks if requested. (#133891)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133891
Approved by: https://github.com/malfet
2024-08-21 18:02:54 +00:00
Edward Z. Yang
432638f521 Remove useless environment in reusable workflow (#133659)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133659
Approved by: https://github.com/Skylion007
2024-08-19 20:44:17 +00:00
Edward Z. Yang
99cf567714 Make SCRIBE_GRAPHQL_ACCESS_TOKEN available to test jobs running on main (#133536)
It is possible to write to Meta's internal in-memory database Scuba via the Scribe Graph API: https://www.internalfb.com/intern/wiki/Scribe/users/Knowledge_Base/Interacting_with_Scribe_categories/Graph_API/ This is currently being used by pytorch/benchmark repo to upload torchbench performance results.

I want to make this API generally available to all jobs running on CI in a semi-trusted context. To talk to Scribe, you need a secret access token. I have initially configured an environment prod-branch-main which contains `SCRIBE_GRAPHQL_ACCESS_TOKEN`, and switched a single class of jobs (linux-test) to use this environment when they are running on the main branch. Because we require approvals for running CI on untrusted contributions, we could potentially allow all jobs to run in this environment, including jobs on PRs, but I don't need this for my use case (per-PR benchmark result reporting, and miscellaneous statistics on main.)

If this works, I'll push out this environment to the rest of our test jobs.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133536
Approved by: https://github.com/xuzhao9, https://github.com/malfet, https://github.com/albanD
2024-08-15 19:53:17 +00:00
Xuehai Pan
5cc34f61d1 [CI] add new test config label ci-test-showlocals to control test log verbosity (#131981)
Add a new label `ci-test-showlocals` and add it to test config filter.
If the PR is labeled with `ci-test-showlocals` or "ci-test-showlocals"
present in the PR comment, the test config filter will set a environment
variable `TEST_SHOWLOCALS`. Then `pytest` will show local variables on
failures for better debugging.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131981
Approved by: https://github.com/malfet
ghstack dependencies: #131151
2024-07-29 18:53:14 +00:00
PyTorch MergeBot
06fe99a097 Revert "[CI] add new test config label ci-test-showlocals to control test log verbosity (#131981)"
This reverts commit dfa18bf3f3.

Reverted https://github.com/pytorch/pytorch/pull/131981 on behalf of https://github.com/atalman due to Sorry, need to revert bottom PR, which broke CI: https://github.com/pytorch/pytorch/pull/131151 ([comment](https://github.com/pytorch/pytorch/pull/131981#issuecomment-2255892628))
2024-07-29 13:09:41 +00:00
Xuehai Pan
dfa18bf3f3 [CI] add new test config label ci-test-showlocals to control test log verbosity (#131981)
Add a new label `ci-test-showlocals` and add it to test config filter.
If the PR is labeled with `ci-test-showlocals` or "ci-test-showlocals"
present in the PR comment, the test config filter will set a environment
variable `TEST_SHOWLOCALS`. Then `pytest` will show local variables on
failures for better debugging.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131981
Approved by: https://github.com/malfet
2024-07-29 07:40:42 +00:00
PyTorch MergeBot
ee140a198f Revert "[Port][Quant][Inductor] Bug fix: mutation nodes not handled correctly for QLinearPointwiseBinaryPT2E (#128591)"
This reverts commit 03e8a4cf45.

Reverted https://github.com/pytorch/pytorch/pull/128591 on behalf of https://github.com/atalman due to Contains release only changes should not be landed ([comment](https://github.com/pytorch/pytorch/pull/128591#issuecomment-2168308233))
2024-06-14 15:51:00 +00:00
Xia, Weiwen
03e8a4cf45 [Port][Quant][Inductor] Bug fix: mutation nodes not handled correctly for QLinearPointwiseBinaryPT2E (#128591)
Port #127592 from main to release/2.4

------
Fixes #127402

- Revert some changes to `ir.MutationOutput` and inductor/test_flex_attention.py
- Add checks of mutation for QLinearPointwiseBinaryPT2E

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127592
Approved by: https://github.com/leslie-fang-intel, https://github.com/Chillee

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128591
Approved by: https://github.com/jgong5, https://github.com/Chillee
2024-06-14 09:31:38 +00:00
Catherine Lee
61be8843c9 [TD] Use label to configure td on distributed for rollout (#122976)
Gate TD on distributed behind label

TODO:
auto add label to certain people's prs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122976
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2024-04-08 15:53:55 +00:00
Thanh Ha
5ecfe58cfb Remove ulimit setting for ARC dind-rootless (#122629)
Since ARC runners use dind-rootless mode setting the ulimit in the docker run command is not possible as the dind-rootless container does not sufficient permissions to do that.

This change looks like it was coming from a migration from another CI system so perhaps it's not necessary anymore.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122629
Approved by: https://github.com/jeanschmidt
2024-04-04 14:18:58 +00:00
Thanh Ha
a5cf9a5800 [CI] Do not install Nvidia drivers in ARC (#122890)
ARC Runners will provide working Nvidia drivers through the host configuration so this step is no longer necessary in the workflow as the ARC container is not able to install packages at the host level.

Also simplify the the setup-linux condition on if running in ARC as we can achieve the same result without needing an extra shell step via the hashFiles() function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122890
Approved by: https://github.com/seemethere, https://github.com/jeanschmidt
2024-04-03 21:47:16 +00:00
PyTorch MergeBot
676a77177e Revert "[BE] Migrate pull.yml to use S3 pytorch-ci-artifacts bucket for linux-jammy-py3_8-gcc11 and docs builds/tests (#121908)"
This reverts commit 4cbf963894.

Reverted https://github.com/pytorch/pytorch/pull/121908 on behalf of https://github.com/jeanschmidt due to this is due to OIDC can't work on forked PR due to token write permissions can't be shared ([comment](https://github.com/pytorch/pytorch/pull/121908#issuecomment-2004707582))
2024-03-18 19:03:11 +00:00
Jean Schmidt
4cbf963894 [BE] Migrate pull.yml to use S3 pytorch-ci-artifacts bucket for linux-jammy-py3_8-gcc11 and docs builds/tests (#121908)
Switch to use LF S3 bucket for pull on linux-jammy-py3_9-gcc and docs jobs. This is required to migrate to ARC and move to use LF resources.

Depends on https://github.com/pytorch/pytorch/pull/121907
Follow up issue https://github.com/pytorch/pytorch/issues/121919
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121908
Approved by: https://github.com/malfet
2024-03-15 09:09:53 +00:00
Jean Schmidt
4c3a052acf [BE] Add S3 bucket argument to number of workflows (#121907)
Namely, it adds the `s3-bucket` argument to the following workflows, with default value set to `gha-artifacts`):
- _docs
- _linux-test workflows
- download-build-artifacts
- pytest-cache-download
- upload-test-artifacts

This is prerequisite part is required in order to start migrating to other s3 buckets for asset storage; This is one of the required steps in order to migrate to ARC and move our assets away from our S3 to Linux Foundation S3

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121907
Approved by: https://github.com/malfet
2024-03-14 17:57:05 +00:00
Catherine Lee
06b52dd103 TD outside of test job (#118250)
Give TD it's own job so that each shard can get the results from this one job artifact and they will always be in sync with each other/no longer need to worry about consistently issues

* Move test discovery to its own file that is not dependent on torch so it can be run without building torch
  * Cannot do cpp test discovery before building pytorch
* Move TD calculation to own file that will create a json file with the final results
* TD is now job/build env agnostic
* TD will rank all tests, including those that test jobs may not want to run (ex it will rank distributed tests along with default tests, even though these tests are never run on the same machine together)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118250
Approved by: https://github.com/huydhn
2024-03-01 23:08:10 +00:00
Catherine Lee
5d6e323549 No TD (test removal) option in CI (#118808)
It currently doesn't do anything, but I will want these env vars later.  Maybe I should start using ghstack

Intention: --enable-td actually gets rid of tests

I am open to better names
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118808
Approved by: https://github.com/huydhn, https://github.com/osalpekar
2024-02-09 16:42:27 +00:00
Catherine Lee
de9ddd19a5 Various CI settings (#117668)
Test [ci-verbose-test-logs] (this worked, the test logs printing while running and interleaved and are really long)

Settings for no timeout (step timeout still applies, only gets rid of ~30 min timeout for shard of test file) and no piping logs/extra verbose test logs (good for debugging deadlocks but results in very long and possibly interleaved logs).

Also allows these to be set via pr body if the label name is in brackets ex [label name] or the test above.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117668
Approved by: https://github.com/huydhn
2024-01-26 00:17:29 +00:00
Catherine Lee
2bdc2a68cb [ez][td] Fix for emit metrics can't find JOB_NAME (#116748)
After #113884
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116748
Approved by: https://github.com/malfet, https://github.com/huydhn
2024-01-04 05:31:25 +00:00
Catherine Lee
b5578cb08b [ez] Remove unittest retries (#115460)
Pytest is used in CI now for reruns and I doubt people are using the env vars when running locally.  imo removing this code has the makes the run function easier to read
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115460
Approved by: https://github.com/malfet, https://github.com/huydhn
2023-12-11 19:46:09 +00:00
Nikita Shulga
2f875c74bf Print ghcr docker pull during build/test (#114510)
To make debugging easier to external devs

Test plan: Copy and run command from [`Use the following to pull public copy of the image`](https://github.com/pytorch/pytorch/actions/runs/7012511180/job/19077533416?pr=114510#step:6:9):
```
docker pull ghcr.io/pytorch/ci-image:pytorch-linux-jammy-py3.8-gcc11-0d0042fd2e432ea07301ad6f6a474d36a581f0dc

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114510
Approved by: https://github.com/atalman, https://github.com/huydhn
2023-11-28 04:38:17 +00:00
Catherine Lee
dab272eed8 [td] Consistent pytest cache (#113804)
Move the pytest cache downloading into the build step and store it in additional ci files so that it stays consistent during sharding.

Only build env is taken into account now instead of also test config since we might not have the test config during build time, making it less specific, but I also think this might be better since tests are likely to fail across the same test config (I also think it might be worth not even looking at build env but thats a different topic)

Each cache upload should only include information from the current run.  Do not merge current cache with downloaded cache during upload (shouldn't matter anyways since the downloaded cache won't exist at the time)

From what I cant tell of the s3 retention policy, pytest cache files will be deleted after 30 days (cc @ZainRizvi to confirm), so we never have to worry about space or pulling old versions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113804
Approved by: https://github.com/ZainRizvi
2023-11-17 23:45:47 +00:00
Zain Rizvi
9a9232956f Include job name in the emitted metrics (#113884)
What it says in the title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113884
Approved by: https://github.com/clee2000
2023-11-16 21:26:49 +00:00
Catherine Lee
6e73ae2022 [ci][ez] Add job_id to emit_metrics (#113099)
As in title.

Also print the job id in the step since I'm struggling to find it
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113099
Approved by: https://github.com/seemethere
2023-11-08 10:32:41 +00:00
Huy Do
f6f81a5969 Update get-workflow-job-id to also return job name (#112103)
Then we can use this job name in `filter-test-configs` if it's available.  This addresses the issue in which `filter-test-configs` on GitHub runners (MacOS x86) couldn't find the runner log to get the job name.  This is expected because GitHub runners are isolated, so a job should not be able to access runner logs, which could contains information from other jobs.

This allows all missing features depending on running `filter-test-configs` on GitHub runners:
* Rerun disabled tests and memory leak check. For example, this would help avoid closing https://github.com/pytorch/pytorch/issues/110980#issuecomment-1779806466 early with the disabled test running properly on MacOS x86
* MacOS x86 jobs can now be disabled or marked as unstable

I keep the current logic to parse the log as a fallback because it's working fine on self-hosted runners.  That also handles the case where `get-workflow-job-id` fails.  Also I move the rest of `get-workflow-job-id` up before the test step like https://github.com/pytorch/pytorch/pull/111483

### Testing

Spot checks some jobs to confirm they have the correct names:

* MacOS M1 test job https://github.com/pytorch/pytorch/actions/runs/6648305319/job/18065275722?pr=112103#step:10:8
* MacOS x86 build job https://github.com/pytorch/pytorch/actions/runs/6648306305/job/18065138137?pr=112103#step:9:14
* Linux test job has https://github.com/pytorch/pytorch/actions/runs/6648300991/job/18065354503?pr=112103#step:13:7
* Windows test job https://github.com/pytorch/pytorch/actions/runs/6648305319/job/18065599500?pr=112103#step:12:7
* MacOS x86 test job https://github.com/pytorch/pytorch/actions/runs/6648306305/job/18066312801#step:10:8
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112103
Approved by: https://github.com/clee2000
2023-10-26 16:42:46 +00:00
Catherine Lee
102fbd402c [ci] Move step to get workflow job id before test step in linux (#111483)
We’ve been strugging to get the job id since 9/28/2023 12:03 pm.  Before this we had almost 0 problems getting job id, but after, we get a lot of `Recieved status code '502' when attempting to retrieve https://api.github.com/repos/pytorch/pytorch/actions/runs/6551579728/jobs?per_page=100:\n", 'Bad Gateway\n\nheaders=Server: GitHub.com\nDate: Tue, 17 Oct 2023 20:32:52 GMT\nContent-Type: application/json\nContent-Length: 32\nETag: "652eed15-20"\nVary: Accept-Encoding, Accept, X-Requested-With\nX-GitHub-Request-Id: EC62:7EE0:166AAF5:2D51A8E:652EEF6A\nconnection: close\n\n` ex https://github.com/pytorch/pytorch/actions/runs/6551579728/job/17793898278#step:18:22

Recently, it has been happening around 1/4 of the time, possibly more. I think this happens almost only on linux.

I believe this is somehow caused by a test, since distributed tests seems to be disproportionately affected, so I move the step to get the job id before the test step.  This also has the benefit of the test step being able to get the job id now if we want it.

Regardless of whether this works or not, its a pretty harmless change that might make things easier in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111483
Approved by: https://github.com/huydhn
2023-10-18 20:54:06 +00:00
Huy Do
6e8079e00f Fix timeout value for memory leak check job (#111386)
Fixes https://github.com/pytorch/pytorch/pull/110193 as it doesn't work as expected:

* I forgot the timeout on the test step
* Also MacOS test job wasn't covered

### Testing

The job timeout is set correctly to 600 https://github.com/pytorch/pytorch/actions/runs/6541825177/job/17764485473#step:14:7
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111386
Approved by: https://github.com/clee2000
2023-10-17 18:25:02 +00:00
Catherine Lee
d6e5898e8d Quieter logs in CI (#110033)
To reduce the amount of logs
* for successes, only print the part that says what tests ran and don't print the rest.  Zip the log into an artifact.  The line listing al the test names is really long, but if you view source of the raw logs, it will not wrap so it will only be one line.  The log classifier can also be configured to ignored this line. Gets rid of lines like `test_ops.py::TestCommonCPU::test_multiple_devices_round_cpu_int64 SKIPPED [0.0010s] (Only runs on cuda) [  9%]`
* for failures/reruns, print logs.  Do not zip.

Also
* change log artifact name

Examples of various logs:
a074db0f7f failures
1b439e24c4 failures

possibly controversial haha
should i include an option for always printing?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110033
Approved by: https://github.com/huydhn
2023-10-05 16:40:37 +00:00
Huy Do
7827ae2864 Increase job timeout limit when running with memory leak check (#110193)
This fixes the daily timeout of ROCm jobs when running with memory leak check turning on.  I want to use something like `inputs.timeout-minutes * 2` but that syntax, unfortunately, isn't supported in GitHub action YAML.  So I decide to just x2 the current timeout value of 300 minutes to make it 600 minutes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110193
Approved by: https://github.com/clee2000
2023-10-02 18:01:49 +00:00
Mark Saroufim
6268ab2c2d torchbench pin upd: hf auth token, clip, whisper, llamav2, sd (#106009)
Includes stable diffusion, whisper, llama7b and clip

To get this to work I had to Pass in hf auth token to all ci jobs, github does not pass in secrets from parent to child automatically. There's a likelihood HF will rate limit us in case please revert this PR and I'll work on adding a cache next - cc @voznesenskym @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @aakhundov @malfet

Something upstream changed in torchbench too where now `hf_Bert` and `hf_Bert_large` are both failing on some dynamic shape looking error which I'm not sure how to debug yet so for now felt a bit gross but added a skip since others are building on top this work @ezyang

`llamav2_7b_16h` cannot pass through accuracy checks cause it OOMs on deepcloning extra inputs this seems to make it not need to show up in expected numbers csv, will figure this when we update the pin with https://github.com/pytorch/benchmark/pull/1803 cc @H-Huang @xuzhao9 @cpuhrsch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106009
Approved by: https://github.com/malfet
2023-08-03 16:28:40 +00:00
Huy Do
0e85c224f8 Use shareable calculate-docker-image GHA (#105372)
Switch from PyTorch `calculate-docker-image` GHA to its shareable version on test-infra https://github.com/pytorch/test-infra/pull/4397.

I will clean up PyTorch `calculate-docker-image` GHA in a separate PR after landing this one.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105372
Approved by: https://github.com/malfet
2023-07-19 05:02:01 +00:00
Catherine Lee
c16a28860f Reenable disabled tests by pr body (#103790)
Query for list of renabled issues in the filter test config step: switch filter test config to query for all the PR info instead of just the labels so token usage should stay the same, move code and tests related to parsing reenabled issues to filter test config step, remove old code to get PR body and commit message.  `REENABLED_ISSUES` should be a comma separated list of issue numbers to be reenabled.

For testing: Fixes #103789
Check that 103789 shows up in list of ignored disabled issues
Sanity check that test-config labels still work

More testing via `python3 ".github/scripts/filter_test_configs.py"     --workflow "pull"     --job-name "linux-bionic-cuda12.1-py3.10-gcc9 / test (default, 4, 5, linux.4xlarge.nvidia.gpu)"     --test-matrix "{ include: [
    { config: "default", shard: 1, num_shards: 1 },
  ]}
  "     --pr-number ""     --tag ""     --event-name "push"     --schedule ""     --branch ""`
 and
 `python3 ".github/scripts/filter_test_configs.py"     --workflow "pull"     --job-name "linux-bionic-cuda12.1-py3.10-gcc9 / test (default, 4, 5, linux.4xlarge.nvidia.gpu)"     --test-matrix "{"include": [{"config": "default", "shard": 1, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 2, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 3, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 4, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 5, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}]}"     --pr-number "103790"     --tag ""     --event-name "pull_request"     --schedule ""     --branch ""`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103790
Approved by: https://github.com/huydhn
2023-06-22 19:47:11 +00:00
PyTorch MergeBot
58d11159bd Revert "Reenable disabled tests by pr body (#103790)"
This reverts commit 2237b4ad75.

Reverted https://github.com/pytorch/pytorch/pull/103790 on behalf of https://github.com/huydhn due to I think we tested it on PR but missed the logic in trunk where there is no PR number ([comment](https://github.com/pytorch/pytorch/pull/103790#issuecomment-1601890299))
2023-06-22 01:26:46 +00:00