Disable core dump when rerunning disabled tests (#104131)

Fixes https://github.com/pytorch/pytorch/issues/103612

Figuring out a way to dynamically stop generating core dumps on Linux runner is harder than I thought.  The recommend solution is to set a custom script in `/proc/sys/kernel/core_pattern` as documented in https://man7.org/linux/man-pages/man5/core.5.html so that we could dynamically stop generating more core file when the disk space drops below a certain threshold.  However, AFAICT this is not yet supported inside Docker container (https://stackoverflow.com/questions/59986788).

In addition, when the runner runs out of space, all the subsequent step to clean it up won't be done.  The next job running will also fail because nothing could be setup, i.e. https://github.com/pytorch/pytorch/actions/runs/5357044327/jobs/9717914230

So this is only a limit fix to not generate core dumps while re-running disabled tests because a crashed test is run multiple times there and will generate multiple core files.

### Testing

```
ulimit -c 0
kill -3 PID
```

Check that no core file is generated after.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104131
Approved by: https://github.com/kit1980, https://github.com/malfet
This commit is contained in:
Huy Do 2023-06-24 02:29:50 +00:00 committed by PyTorch MergeBot
parent 75dab587ef
commit 202a9108f7

View file

@ -58,6 +58,19 @@ if [[ "$BUILD_ENVIRONMENT" == *clang9* ]]; then
export VALGRIND=OFF
fi
if [[ "${PYTORCH_TEST_RERUN_DISABLED_TESTS}" == "1" ]]; then
# When rerunning disable tests, do not generate core dumps as it could consume
# the runner disk space when crashed tests are run multiple times. Running out
# of space is a nasty issue because there is no space left to even download the
# GHA to clean up the disk
ulimit -c 0
# Note that by piping the core dump to a script set in /proc/sys/kernel/core_pattern
# as documented in https://man7.org/linux/man-pages/man5/core.5.html, we could
# dynamically stop generating more core file when the disk space drops below a
# certain threshold. However, this is not supported inside Docker container atm
fi
# Get fully qualified path using realpath
if [[ "$BUILD_ENVIRONMENT" != *bazel* ]]; then
CUSTOM_TEST_ARTIFACT_BUILD_DIR=$(realpath "${CUSTOM_TEST_ARTIFACT_BUILD_DIR:-"build/custom_test_artifacts"}")