pytorch/test/distributed/elastic
Kurman Karabukaev d62b025efc [TorchElastic] Option for sharing TCPStore created by rdzv handlers (#125743)
Summary:

1. Define explicit `use_agent_store` on rdzv handlers. Handlers that set is true can share the store.
2. Instead of agent coordinating master_add/master_port values, the logic is now encapsulated by a *rdzv_handler* where `RendezvousInfo` will have `RendezvousStoreInfo` object that handlers must return.
    - Depending on the implementation they can either:
         - point to existing store (and expected to `use_agent_store` as true - point 1). Client code will rely on `TORCHELASTIC_USE_AGENT_STORE` env variable to know if the store is shared.
         - build args that `torch.distributed.init_process_group` can bootstrap by creating new store.

Additional points:

- When TCPStore is shared, it should be wrapped in PrefixStore to qualify/scope namespace for other usecases.
- `next_rendezvous` signature changed to return instance of `RendezvousInfo` instead of a (store, rank, world_size) tuple for extensibility purposes.

Why:
- Reduce moving parts
   - easier to swap implementation
   - improve tractability
   - addressing perf/debug-ability will benefit all usecases
   -
Test Plan: CI

Differential Revision: D57055235

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125743
Approved by: https://github.com/d4l3k
2024-05-22 18:24:11 +00:00
..
agent/server/test [TorchElastic] Option for sharing TCPStore created by rdzv handlers (#125743) 2024-05-22 18:24:11 +00:00
events
metrics [BE] enable ruff rule RSE and remove useless parentheses in raise statements (#124261) 2024-04-17 19:29:34 +00:00
multiprocessing [torch/distributed] Bugfix: wait for all child procs to exit before c… (#125969) 2024-05-15 00:13:08 +00:00
rendezvous [TorchElastic] Option for sharing TCPStore created by rdzv handlers (#125743) 2024-05-22 18:24:11 +00:00
timer Fix AttributeError when doing mock patch for FileTimerServerTest.test_expired_timers (#125144) 2024-05-01 12:08:04 +00:00
utils elastic/rendezvous: make barrier and rank assignment operations O(n) instead of O(n^2) (#124982) 2024-04-27 02:21:44 +00:00