pytorch/torch/distributed
Shuqiang Zhang 367213a608 [c10] add an option to pg_config split share (#130877)
Summary:
context is: #129865
We want to give users an option to not share comms resouces so that
comm opts can overlap
Test Plan:
Augmentd UT

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130877
Approved by: https://github.com/fduwjj
2024-07-18 19:03:00 +00:00
..
_composable [Traceable FSDP2] Preserve fsdp.set_ op through lowering; Add unit test for multiple .set_ into same primal; Add unit test for FSDP2 module layer reuse (#130786) 2024-07-17 23:25:42 +00:00
_shard typing proxy_tensor.py (#129182) 2024-07-12 23:17:09 +00:00
_sharded_tensor
_sharding_spec
_spmd typing proxy_tensor.py (#129182) 2024-07-12 23:17:09 +00:00
_symmetric_memory [SymmetricMemory] make sure different subgroups with the same name use different store prefixes (#130756) 2024-07-16 20:21:05 +00:00
_tensor [export] Fully support extension op in serialization/deserialization. (#130851) 2024-07-18 16:47:53 +00:00
_tools [BE]: Update flake8-comprehensions and enable C420 (#130699) 2024-07-16 13:47:49 +00:00
algorithms [BE] Make ActivationWrapper an abstract class (#129808) 2024-07-02 04:29:43 +00:00
autograd
benchmarks [Ez][BE]: Enable new stable ruff rules (#129825) 2024-07-02 14:47:10 +00:00
checkpoint DSD for TorchTune LoRA (#129635) 2024-07-18 17:00:35 +00:00
elastic elastic/store: use wait instead of get for barrier (#130148) 2024-07-08 19:53:42 +00:00
examples
fsdp [BE][Easy] fix ruff rule needless-bool (SIM103) (#130206) 2024-07-14 08:17:52 +00:00
launcher
nn
optim Pass torch.load(weights_only=) internally to avoid FutureWarning (#130663) 2024-07-16 01:24:38 +00:00
pipelining Added zb1p schedule (#130210) 2024-07-14 17:32:59 +00:00
rpc [BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): list() / tuple() / dict() (#130199) 2024-07-11 17:30:28 +00:00
tensor Fix loss_parallel with BF16 logits (#130550) 2024-07-12 15:47:38 +00:00
__init__.py
_checkpointable.py
_composable_state.py
_functional_collectives.py
_functional_collectives_impl.py
_state_dict_utils.py [BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): list() / tuple() / dict() (#130199) 2024-07-11 17:30:28 +00:00
argparse_util.py
c10d_logger.py [DCP] Fix duplicated logging messages when enable both c10d and dcp l… (#130423) 2024-07-11 13:43:39 +00:00
collective_utils.py
constants.py
CONTRIBUTING.md
device_mesh.py [DeviceMesh][Reland] Only include the real thread_id in DeviceMesh hash under threaded backend (#130495) (#130685) 2024-07-15 20:05:26 +00:00
distributed_c10d.py [c10] add an option to pg_config split share (#130877) 2024-07-18 19:03:00 +00:00
launch.py
logging_handlers.py
remote_device.py [BE][Easy] fix ruff rule needless-bool (SIM103) (#130206) 2024-07-14 08:17:52 +00:00
rendezvous.py
run.py
utils.py [Reland][PT-D] Relaxed contract to allow Sequence[nn.Module] (#127773) (#130947) 2024-07-17 22:40:13 +00:00