pytorch

mirror of https://github.com/saymrwulf/pytorch.git synced 2026-05-15 21:00:47 +00:00

History

Tristan Rice dc4c75ba72 elastic/rendezvous: make barrier and rank assignment operations O(n) instead of O(n^2) (#124982 ) Summary: This makes barrier and rank operations linear instead of quadratic with the number of workers. This drastically improves performance for rendezvous when running with over 1000 hosts. This uses 2 approaches for different areas: * local rank assignment: each worker does 1 set and 1 get, local ranks are assigned on the rank 0 host in a O(n) operation which reduces total store operations to be linear with number of workers. * exit_barrier: use a counter and a final flag so each worker has to do max 1 set, 1 get and 1 add. At 4000 hosts we see torchelastic be able to run in as little as 10 seconds down from 373 seconds. Test Plan: This is testing using many small tests running on a remote cluster. {D56549942} ``` torchx run --scheduler mast -- --image=torchelastic_benchmark --j=4000x1 ``` Differential Revision: D56605193 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124982 Approved by: https://github.com/kiukchung, https://github.com/kurman		2024-04-27 02:21:44 +00:00
..
data
__init__.py
distributed_test.py
logging_test.py
util_test.py