pytorch/torch/distributed
Felix Su feb4818bc9 [SJD] adding kill logic for current process when killing a worker (#141060)
Summary:
we have seen cases where some workers don't receive stop signals, meaning watchdog isn't stopped accordingly. this diff introduces logic to kill the current pid alongside the worker pid

something to note is that there is a case where the worker pid to be killed either doesn't exist or cannot be killed for some reason which will result in the current pid also not being killed. this seems okay since the watchdog loop will just attempt to kill the worker pid on the next iteration but just wanted to point this out

Test Plan: experiment in next diff shows this works

Differential Revision: D65837085

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141060
Approved by: https://github.com/gag1jain
2024-12-18 00:13:02 +00:00
..
_composable [FSDP2] Fix backward-compatible imports (#142419) 2024-12-09 23:56:32 +00:00
_shard Add support for other backends in get_preferred_device (#132118) 2024-12-16 18:30:41 +00:00
_sharded_tensor
_sharding_spec
_symmetric_memory [fused_all_gather_matmul] use _multimem_all_gather_matmul for small global Ms (#143160) 2024-12-17 01:07:27 +00:00
_tensor
_tools [FSDP2] Move to public torch.distributed.fsdp (#141868) 2024-12-07 01:24:28 +00:00
algorithms Deprecate torch._utils.is_compiling() (#127690) 2024-12-08 22:55:36 +00:00
autograd
benchmarks
checkpoint [DSD] Fix strict=False case for DDP (#143038) 2024-12-12 21:15:21 +00:00
elastic [SJD] adding kill logic for current process when killing a worker (#141060) 2024-12-18 00:13:02 +00:00
examples
fsdp [FSDP2] Clamp reduce_dtype in lazy init (#143297) 2024-12-17 00:25:08 +00:00
launcher [BE] replace incorrect .. note:: invocations (#142868) 2024-12-11 19:58:18 +00:00
nn
optim [BE] replace incorrect .. note:: invocations (#142868) 2024-12-11 19:58:18 +00:00
pipelining [pipelining] fix backward_one_chunk when the output of the model is a… (#142237) 2024-12-12 20:59:35 +00:00
rpc remove allow-untyped-defs for distributed/rpc/_testing/__init__.py (#143271) 2024-12-16 02:35:37 +00:00
tensor [BE] typing for decorators - distributed/_tensor/ops/utils (#142139) 2024-12-16 21:19:33 +00:00
__init__.py
_checkpointable.py
_composable_state.py
_functional_collectives.py AsyncCollectiveTensor: fix _are_we_tracing() in dynamo (#142075) 2024-12-05 02:01:18 +00:00
_functional_collectives_impl.py
_state_dict_utils.py
argparse_util.py
c10d_logger.py
collective_utils.py
constants.py
CONTRIBUTING.md
device_mesh.py Use new group instead of split group on non-CUDA device (#141469) 2024-12-13 05:11:33 +00:00
distributed_c10d.py Register Intel distributed Backend (XCCL) in PyTorch distributed package (#141856) 2024-12-10 01:58:06 +00:00
launch.py
logging_handlers.py
remote_device.py
rendezvous.py
run.py
utils.py