Commit graph

6 commits

Author SHA1 Message Date
Wanchao Liang
a4f0f8b1e9 [distributed] add base processgroup::options (#53662)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53662

Add a base processgroup::options so that we can do inheritance and
provide
a universal option API in python

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D26968856

Pulled By: wanchaol

fbshipit-source-id: 858f4b61b27aecb1943959bba68f8c14114f67d8
2021-03-17 18:40:04 -07:00
Shen Li
c7b1979b6b Use Store collect and verify names in all RPC agents (#53209)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53209

closes #40048

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D26791524

Pulled By: mrshenli

fbshipit-source-id: fc75589f9707014334fcfae6f05af3c04217783b
2021-03-07 16:51:46 -08:00
Wanchao Liang
553ccccc54 [c10d] switch ProcessGroup to be managed by intrusive_ptr (#47343)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47343

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D24723418

Pulled By: wanchaol

fbshipit-source-id: 0463819b96c53b12bdbb3905431110d7b21beb77
2020-11-12 07:36:23 -08:00
Pritam Damania
bf85642c4c Remove lock from GraphTask::set_exception_without_signal. (#45867)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45867

In most cases the lock ordering was hold a lock in local autograd and
then hold a lock in DistAutogradContext.

In case of `set_exception_without_signal` the lock order was in reverse and as
a result we saw potential deadlock issues in our TSAN tests. To fix this, I
removed the lock and instead just used std::atomic exchange.

In addition to this, I fixed TestE2E to ensure that we use the appropriate
timeout.

TestE2EProcessGroup was flaky for these two reasons and now is fixed.
ghstack-source-id: 113592709

Test Plan: waitforbuildbot.

Reviewed By: albanD

Differential Revision: D24120962

fbshipit-source-id: 12447b84ceae772b91e9a183c90d1e6340f44e66
2020-10-05 20:02:29 -07:00
generatedunixname89002005287564@sandcastle1415.cln1.facebook.com
1dd658f28f [Codemod][GleanFbcode] Remove dead includes in caffe2/test (#43953)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43953

Reviewed By: malfet

Differential Revision: D23445556

fbshipit-source-id: 89cd6833aa06f35c5d3c99d698abb08cd61ae4ab
2020-09-01 21:48:28 -07:00
Pritam Damania
ff6e560301 Add C++ end to end test for RPC and distributed autograd. (#36893)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36893

Adding an end to end test for running a simple training loop in C++
for the distributed RPC framework.

The goal of this change is to enable LeakSanitizer and potentially catch memory
leaks in the Future. Enabling LSAN with python multiprocessing is tricky and we
haven't found a solution for this. As a result, adding a C++ test that triggers
most of the critical codepaths would be good for now.

As an example, this unit test would've caught the memory leak fixed by:
https://github.com/pytorch/pytorch/pull/31030
ghstack-source-id: 107781167

Test Plan:
1) Verify the test catches memory leaks.
2) waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D21112208

fbshipit-source-id: 4eb2a6b409253108f6b6e14352e593d250c7a64d
2020-07-15 12:59:19 -07:00