pytorch

mirror of https://github.com/saymrwulf/pytorch.git synced 2026-05-15 21:00:47 +00:00

History

Luca Wehrstedt 8f4cfaa9db Fix race condition in TP agent (#58753 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58753 TSAN was (rightfully!) detecting and complaining about a race due to the fact that upon init the TP agent exchanges the device maps between nodes using RPC requests (and by doing so it accesses the device maps) and then sets the reverse device maps (thus possibly modifying the set of devices). This resulted in a data race, i.e., simultaneously reading and writing the set of devices without synchronizing. One solution is to add a mutex around the devices, which works, but is "annoying". An alternative solution is to make the set of devices immutable (i.e., `const`). For that to work, we need to exchange the device maps without using RPC calls. We can do so using the process group that we need to create anyways. Since now there's a lot more logic in Python, I've moved (and restructured) all safety checks over there, and removed them from C++. ghstack-source-id: 130583775 Test Plan: Unit tests Reviewed By: mrshenli Differential Revision: D28603754 fbshipit-source-id: 88533e65d72d1eb806dc41bec8d55def5082e290		2021-06-04 06:53:42 -07:00
..
api	Back out "[pytorch][PR] ENH Adds dtype to nn.functional.one_hot" (#59080 )	2021-05-27 15:40:52 -07:00
common
dist_autograd	Fix distributed autograd gradients synchronization (#57792 )	2021-05-09 17:32:59 -07:00
jit	Make JIT not assume that the device is CUDA. (#54238 )	2021-06-03 22:21:27 -07:00
lite_interpreter_runtime	[Pytorch Delegated Backend] Save function name in debug info (#57481 )	2021-05-25 13:19:02 -07:00
rpc	Fix race condition in TP agent (#58753 )	2021-06-04 06:53:42 -07:00
tensorexpr	[NNC] Fix loopnest.cache_accesses for reduce ops (fixed #59002 ) (#59136 )	2021-06-03 21:04:14 -07:00
__init__.py