pytorch/test/cpp
Luca Wehrstedt 8f4cfaa9db Fix race condition in TP agent (#58753)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58753

TSAN was (rightfully!) detecting and complaining about a race due to the fact that upon init the TP agent exchanges the device maps between nodes using RPC requests (and by doing so it accesses the device maps) and then sets the reverse device maps (thus possibly modifying the set of devices). This resulted in a data race, i.e., simultaneously reading and writing the set of devices without synchronizing.

One solution is to add a mutex around the devices, which works, but is "annoying". An alternative solution is to make the set of devices immutable (i.e., `const`). For that to work, we need to exchange the device maps without using RPC calls. We can do so using the process group that we need to create anyways.

Since now there's a lot more logic in Python, I've moved (and restructured) all safety checks over there, and removed them from C++.
ghstack-source-id: 130583775

Test Plan: Unit tests

Reviewed By: mrshenli

Differential Revision: D28603754

fbshipit-source-id: 88533e65d72d1eb806dc41bec8d55def5082e290
2021-06-04 06:53:42 -07:00
..
api Back out "[pytorch][PR] ENH Adds dtype to nn.functional.one_hot" (#59080) 2021-05-27 15:40:52 -07:00
common
dist_autograd Fix distributed autograd gradients synchronization (#57792) 2021-05-09 17:32:59 -07:00
jit Make JIT not assume that the device is CUDA. (#54238) 2021-06-03 22:21:27 -07:00
lite_interpreter_runtime [Pytorch Delegated Backend] Save function name in debug info (#57481) 2021-05-25 13:19:02 -07:00
rpc Fix race condition in TP agent (#58753) 2021-06-04 06:53:42 -07:00
tensorexpr [NNC] Fix loopnest.cache_accesses for reduce ops (fixed #59002) (#59136) 2021-06-03 21:04:14 -07:00
__init__.py