pytorch

mirror of https://github.com/saymrwulf/pytorch.git synced 2026-05-14 20:57:59 +00:00

History

Felix Su feb4818bc9 [SJD] adding kill logic for current process when killing a worker (#141060 ) Summary: we have seen cases where some workers don't receive stop signals, meaning watchdog isn't stopped accordingly. this diff introduces logic to kill the current pid alongside the worker pid something to note is that there is a case where the worker pid to be killed either doesn't exist or cannot be killed for some reason which will result in the current pid also not being killed. this seems okay since the watchdog loop will just attempt to kill the worker pid on the next iteration but just wanted to point this out Test Plan: experiment in next diff shows this works Differential Revision: D65837085 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141060 Approved by: https://github.com/gag1jain		2024-12-18 00:13:02 +00:00
..
_composable	[FSDP2] Fix backward-compatible imports (#142419 )	2024-12-09 23:56:32 +00:00
_shard	Add support for other backends in get_preferred_device (#132118 )	2024-12-16 18:30:41 +00:00
_sharded_tensor
_sharding_spec
_symmetric_memory	[fused_all_gather_matmul] use _multimem_all_gather_matmul for small global Ms (#143160 )	2024-12-17 01:07:27 +00:00
_tensor
_tools	[FSDP2] Move to public `torch.distributed.fsdp` (#141868 )	2024-12-07 01:24:28 +00:00
algorithms	Deprecate `torch._utils.is_compiling()` (#127690 )	2024-12-08 22:55:36 +00:00
autograd
benchmarks
checkpoint	[DSD] Fix strict=False case for DDP (#143038 )	2024-12-12 21:15:21 +00:00
elastic	[SJD] adding kill logic for current process when killing a worker (#141060 )	2024-12-18 00:13:02 +00:00
examples
fsdp	[FSDP2] Clamp `reduce_dtype` in lazy init (#143297 )	2024-12-17 00:25:08 +00:00
launcher	[BE] replace incorrect .. note:: invocations (#142868 )	2024-12-11 19:58:18 +00:00
nn
optim	[BE] replace incorrect .. note:: invocations (#142868 )	2024-12-11 19:58:18 +00:00
pipelining	[pipelining] fix backward_one_chunk when the output of the model is a… (#142237 )	2024-12-12 20:59:35 +00:00
rpc	remove allow-untyped-defs for distributed/rpc/_testing/__init__.py (#143271 )	2024-12-16 02:35:37 +00:00
tensor	[BE] typing for decorators - distributed/_tensor/ops/utils (#142139 )	2024-12-16 21:19:33 +00:00
__init__.py
_checkpointable.py
_composable_state.py
_functional_collectives.py	AsyncCollectiveTensor: fix _are_we_tracing() in dynamo (#142075 )	2024-12-05 02:01:18 +00:00
_functional_collectives_impl.py
_state_dict_utils.py
argparse_util.py
c10d_logger.py
collective_utils.py
constants.py
CONTRIBUTING.md
device_mesh.py	Use new group instead of split group on non-CUDA device (#141469 )	2024-12-13 05:11:33 +00:00
distributed_c10d.py	Register Intel distributed Backend (`XCCL`) in PyTorch distributed package (#141856 )	2024-12-10 01:58:06 +00:00
launch.py
logging_handlers.py
remote_device.py
rendezvous.py
run.py
utils.py