mirror of
https://github.com/saymrwulf/pytorch.git
synced 2026-05-14 20:57:59 +00:00
Add note in RPC docs about retries. (#73601)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73601 Some users had questions about how the RPC framework deals with failures and whether we retry. Adding a note about this to our docs to elaborate on our current behavior and why we chose that approach. ghstack-source-id: 150359866 Test Plan: view docs. Reviewed By: mrshenli Differential Revision: D34560199 fbshipit-source-id: ee33ceed7fa706270d4ca5c8fcff7535583490ff (cherry picked from commit 954a906240cc40aacf08ca13f6554a35303a678a)
This commit is contained in:
parent
2ab9702955
commit
71aa3ab020
1 changed files with 12 additions and 0 deletions
|
|
@ -190,6 +190,18 @@ Example::
|
|||
:members:
|
||||
:inherited-members:
|
||||
|
||||
.. note ::
|
||||
The RPC framework does not automatically retry any
|
||||
:meth:`~torch.distributed.rpc.rpc_sync`,
|
||||
:meth:`~torch.distributed.rpc.rpc_async` and
|
||||
:meth:`~torch.distributed.rpc.remote` calls. The reason being that there is
|
||||
no way the RPC framework can determine whether an operation is idempotent or
|
||||
not and whether it is safe to retry. As a result, it is the application's
|
||||
responsibility to deal with failures and retry if necessary. RPC communication
|
||||
is based on TCP and as a result failures could happen due to network failures
|
||||
or intermittent network connectivity issues. In such scenarios, the application
|
||||
needs to retry appropriately with reasonable backoffs to ensure the network
|
||||
isn't overwhelmed by aggressive retries.
|
||||
|
||||
.. _rref:
|
||||
|
||||
|
|
|
|||
Loading…
Reference in a new issue