Commit graph

12 commits

Author SHA1 Message Date
Huy Do
9f39123d18 Allow to continue when fail to configure Windows Defender (#103454)
Windows Defender will soon be removed from the AMI.  Without the service, the step fails with the following error:

```
Set-MpPreference : Invalid class
At C:\actions-runner\_work\_temp\1f029685-bb66-496d-beb8-19268ecbe44a.ps1:5 char:1
+ Set-MpPreference -DisableRealtimeMonitoring $True
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : MetadataError: (MSFT_MpPreference:root\Microsoft\...FT_MpPreference) [Set-MpPreference],
    CimException
    + FullyQualifiedErrorId : HRESULT 0x80041010,Set-MpPreference
```

For example, https://github.com/pytorch/pytorch-canary/actions/runs/5267043497/jobs/9521809176.  This is expected as the service is completely removed.

Here are all the places where `Set-MpPreference` is used according to https://github.com/search?type=code&q=org%3Apytorch+Set-MpPreference
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103454
Approved by: https://github.com/atalman
2023-06-15 18:30:58 +00:00
PyTorch MergeBot
43127f19f1 Revert "Allow disable binary build jobs on CI (#100754)"
This reverts commit 4c3b52a5a9.

Reverted https://github.com/pytorch/pytorch/pull/100754 on behalf of https://github.com/huydhn due to The subset of Windows binary jobs running only in trunk fails because the runners do not have Python setup ([comment](https://github.com/pytorch/pytorch/pull/100754#issuecomment-1539586399))
2023-05-09 07:15:32 +00:00
Huy Do
4c3b52a5a9 Allow disable binary build jobs on CI (#100754)
Given the recent outage w.r.t. binary workflows running on CI, I want to close the gap between them and regular CI jobs.  The first part is to add the same filter step used by regular CI jobs so that oncalls can disable the job if need.

* Nightly runs are excluded as it includes the step to publish nightly binaries.  Allowing oncalls to disable this part requires more thoughts.  So this covers only CI binary build and test jobs
* As binary jobs doesn't have a concept of test matrix config which is a required parameter to the filter script, I use a pseudo input of test config default there

### Testing

* https://github.com/pytorch/pytorch/issues/100758.  The job is skipped in https://github.com/pytorch/pytorch/actions/runs/4911034089/jobs/8768782689
* https://github.com/pytorch/pytorch/issues/100759.  The job is skipped in https://github.com/pytorch/pytorch/actions/runs/4911033966/jobs/8768713669

Note that Windows binary jobs are not run in PR anymore after https://github.com/pytorch/pytorch/pull/100638, and MacOS binary jobs only run nightly.  So there are only Linux jobs left.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100754
Approved by: https://github.com/ZainRizvi
2023-05-09 06:53:34 +00:00
Huy Do
1d5577b601 No need to run Windows binary build for every PR (#100638)
Per the discussion with @malfet , there is no need to run Windows binary build for every PR. We will keep it running in trunk (on push) though just in case.

This also moves the workflow back from unstable after the symlink copy fix in 860d444515

Another data point to back this up is the high correlation between Windows binaries debug and release build v.s. Windows CPU CI job.  The numbers are:

* `libtorch-cpu-shared-with-deps-debug` and `win-vs2019-cpu-py3` has 0.95 correlation
* `libtorch-cpu-shared-with-deps-release` and `win-vs2019-cpu-py3` has the same 0.95 correlation

The rest is noise, eh?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100638
Approved by: https://github.com/atalman
2023-05-04 21:57:39 +00:00
Huy Do
478a5ddd8a Mark Windows CPU jobs as unstable (#100581)
Caused by https://github.com/pytorch/pytorch/pull/100377, something removes VS2019 installation on the non-ephemeral runner.  I think moving this to unstable is nicer to gather signals in trunk without completely disable the job or revert https://github.com/pytorch/pytorch/pull/100377 (for the Nth times)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100581
Approved by: https://github.com/clee2000, https://github.com/malfet
2023-05-03 21:43:43 +00:00
Jean Schmidt
2ac6ee7f12 Migrate jobs: windows.4xlarge->windows.4xlarge.nonephemeral (#100548)
This is reopening of the PR https://github.com/pytorch/pytorch/pull/100377

# About this PR

Due to increased pressure over our windows runners, and the elevated cost of instantiating and bringing down those instances, we want to migrate instances from ephemeral to not ephemeral.

Possible impacts are related to breakages in or misbehaves on CI jobs that puts the runners in a bad state. Other possible impacts are related to exhaustion of resources, especially disk space, but memory might be a contender, as CI trash piles up on those instances.

As a somewhat middle of the road approach to this, currently nonephemeral instances are stochastically rotated as older instances get higher priority to be terminated when demand is lower.

Instances definition can be found here: https://github.com/pytorch/test-infra/pull/4072

This is a first in a multi-step approach where we will migrate away from all ephemeral windows instances and follow the lead of the `windows.g5.4xlarge.nvidia.gpu` in order to help reduce queue times for those instances. The phased approach follows:

* migrate `windows.4xlarge` to `windows.4xlarge.nonephemeral` instances under `pytorch/pytorch`
* migrate `windows.8xlarge.nvidia.gpu` to `windows.8xlarge.nvidia.gpu.nonephemeral` instances under `pytorch/pytorch`
* submit PRs to all repositories under `pytorch/` organization to migrate `windows.4xlarge` to `windows.4xlarge.nonephemeral`
* submit PRs to all repositories under `pytorch/` organization to migrate `windows.8xlarge.nvidia.gpu` to `windows.8xlarge.nvidia.gpu.nonephemeral`
* terminate the existence of `windows.4xlarge` and `windows.8xlarge.nvidia.gpu`
* evaluate and start the work related to the adoption of `windows.g5.4xlarge.nvidia.gpu` to replace `windows.8xlarge.nvidia.gpu.nonephemeral` in other repositories and use cases (proposed by @huydhn)

The reasoning for this phased approach is to reduce the scope of possible contenders to investigate in case of misbehave of particular CI jobs.

# Copilot Summary

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 579d87a</samp>

This pull request migrates some windows workflows to use `nonephemeral` runners for better performance and reliability. It also adds support for new Python and CUDA versions for some binary builds. It affects the following files: `.github/templates/windows_binary_build_workflow.yml.j2`, `.github/workflows/generated-windows-binary-*.yml`, `.github/workflows/pull.yml`, `.github/actionlint.yaml`, `.github/workflows/_win-build.yml`, `.github/workflows/periodic.yml`, and `.github/workflows/trunk.yml`.

# Copilot Poem

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 579d87a</samp>

> _We're breaking free from the ephemeral chains_
> _We're running on the nonephemeral lanes_
> _We're building faster, testing stronger, supporting newer_
> _We're the non-ephemeral runners of fire_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100377
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/atalman

(cherry picked from commit 7caac545b1)

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100548
Approved by: https://github.com/jeanschmidt, https://github.com/janeyx99
2023-05-03 15:47:18 +00:00
PyTorch MergeBot
543b7ebb50 Revert "Migrate jobs from windows.4xlarge windows.4xlarge.nonephemeral instances (#100377)"
This reverts commit 7caac545b1.

Reverted https://github.com/pytorch/pytorch/pull/100377 on behalf of https://github.com/malfet due to This is not the PR I've reviewed ([comment](https://github.com/pytorch/pytorch/pull/100377#issuecomment-1532148086))
2023-05-02 21:05:53 +00:00
Jean Schmidt
7caac545b1 Migrate jobs from windows.4xlarge windows.4xlarge.nonephemeral instances (#100377)
This is reopening of the PR [100091](https://github.com/pytorch/pytorch/pull/100091)

# About this PR

Due to increased pressure over our windows runners, and the elevated cost of instantiating and bringing down those instances, we want to migrate instances from ephemeral to not ephemeral.

Possible impacts are related to breakages in or misbehaves on CI jobs that puts the runners in a bad state. Other possible impacts are related to exhaustion of resources, especially disk space, but memory might be a contender, as CI trash piles up on those instances.

As a somewhat middle of the road approach to this, currently nonephemeral instances are stochastically rotated as older instances get higher priority to be terminated when demand is lower.

Instances definition can be found here: https://github.com/pytorch/test-infra/pull/4072

This is a first in a multi-step approach where we will migrate away from all ephemeral windows instances and follow the lead of the `windows.g5.4xlarge.nvidia.gpu` in order to help reduce queue times for those instances. The phased approach follows:

* migrate `windows.4xlarge` to `windows.4xlarge.nonephemeral` instances under `pytorch/pytorch`
* migrate `windows.8xlarge.nvidia.gpu` to `windows.8xlarge.nvidia.gpu.nonephemeral` instances under `pytorch/pytorch`
* submit PRs to all repositories under `pytorch/` organization to migrate `windows.4xlarge` to `windows.4xlarge.nonephemeral`
* submit PRs to all repositories under `pytorch/` organization to migrate `windows.8xlarge.nvidia.gpu` to `windows.8xlarge.nvidia.gpu.nonephemeral`
* terminate the existence of `windows.4xlarge` and `windows.8xlarge.nvidia.gpu`
* evaluate and start the work related to the adoption of `windows.g5.4xlarge.nvidia.gpu` to replace `windows.8xlarge.nvidia.gpu.nonephemeral` in other repositories and use cases (proposed by @huydhn)

The reasoning for this phased approach is to reduce the scope of possible contenders to investigate in case of misbehave of particular CI jobs.

# Copilot Summary

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 579d87a</samp>

This pull request migrates some windows workflows to use `nonephemeral` runners for better performance and reliability. It also adds support for new Python and CUDA versions for some binary builds. It affects the following files: `.github/templates/windows_binary_build_workflow.yml.j2`, `.github/workflows/generated-windows-binary-*.yml`, `.github/workflows/pull.yml`, `.github/actionlint.yaml`, `.github/workflows/_win-build.yml`, `.github/workflows/periodic.yml`, and `.github/workflows/trunk.yml`.

# Copilot Poem

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 579d87a</samp>

> _We're breaking free from the ephemeral chains_
> _We're running on the nonephemeral lanes_
> _We're building faster, testing stronger, supporting newer_
> _We're the non-ephemeral runners of fire_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100377
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/atalman
2023-05-02 20:41:12 +00:00
PyTorch MergeBot
e5291e633f Revert "Migrate jobs from windows.4xlarge to windows.4xlarge.nonephemeral instances (#100091)"
This reverts commit 1183eecbf1.

Reverted https://github.com/pytorch/pytorch/pull/100091 on behalf of https://github.com/huydhn due to CPU jobs start failing in trunk due to some error in MSVC setup
2023-04-26 19:17:58 +00:00
Jean Schmidt
1183eecbf1
Migrate jobs from windows.4xlarge to windows.4xlarge.nonephemeral instances (#100091) 2023-04-26 18:32:50 +02:00
Huy Do
06f19fdbe5 Turn off Windows Defender in temp folder on binary build workflow (#99389)
This issue starts to show up recently https://github.com/pytorch/pytorch/actions/runs/4724983231/jobs/8385139626 and I'm pretty sure that the root cause is Windows Defender as I did a similar fix on Windows CI a while ago https://github.com/pytorch/pytorch/pull/96931.  Without this, Windows binary build could fail flakily when Windows Defender chooses to delete/quarantine a file in the temp folder.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99389
Approved by: https://github.com/weiwangmeta
2023-04-18 16:45:38 +00:00
Nikita Shulga
2418b94576
Rename default branch to main (#99210)
Mostly `s/@master/@main` in numerous `.yml` files.

Keep `master` in `weekly.yml` as it refers to `xla` repo and in `test_trymerge.py` as it refers to a branch PR originates from.
2023-04-16 18:48:14 -07:00
Renamed from .github/workflows/generated-windows-binary-libtorch-debug-master.yml (Browse further)