mirror of
https://github.com/saymrwulf/pytorch.git
synced 2026-05-14 20:57:59 +00:00
Summary:
1. Define explicit `use_agent_store` on rdzv handlers. Handlers that set is true can share the store.
2. Instead of agent coordinating master_add/master_port values, the logic is now encapsulated by a *rdzv_handler* where `RendezvousInfo` will have `RendezvousStoreInfo` object that handlers must return.
- Depending on the implementation they can either:
- point to existing store (and expected to `use_agent_store` as true - point 1). Client code will rely on `TORCHELASTIC_USE_AGENT_STORE` env variable to know if the store is shared.
- build args that `torch.distributed.init_process_group` can bootstrap by creating new store.
Additional points:
- When TCPStore is shared, it should be wrapped in PrefixStore to qualify/scope namespace for other usecases.
- `next_rendezvous` signature changed to return instance of `RendezvousInfo` instead of a (store, rank, world_size) tuple for extensibility purposes.
Why:
- Reduce moving parts
- easier to swap implementation
- improve tractability
- addressing perf/debug-ability will benefit all usecases
-
Test Plan: CI
Differential Revision: D57055235
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125743
Approved by: https://github.com/d4l3k
|
||
|---|---|---|
| .. | ||
| agent/server/test | ||
| events | ||
| metrics | ||
| multiprocessing | ||
| rendezvous | ||
| timer | ||
| utils | ||