pytorch/torch/distributed/_symmetric_memory
Luca Wehrstedt 3ee655e4d4 [async-TP] Fix scheduling in matmul+reduce-scatter for 2 ranks (#145846)
There's a sleep that is issued in order to "nudge" CUDA to do the right scheduling decision, but this is issued on iteration number 2. However, when the world size is 2, we never reach that iteration, which led to a suboptimal scheduling.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145846
Approved by: https://github.com/yifuwang
2025-01-30 18:26:34 +00:00
..
__init__.py [async-TP] Fix scheduling in matmul+reduce-scatter for 2 ranks (#145846) 2025-01-30 18:26:34 +00:00