Add note in RPC docs about retries. (#73601)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73601

Some users had questions about how the RPC framework deals with
failures and whether we retry. Adding a note about this to our docs to
elaborate on our current behavior and why we chose that approach.
ghstack-source-id: 150359866

Test Plan: view docs.

Reviewed By: mrshenli

Differential Revision: D34560199

fbshipit-source-id: ee33ceed7fa706270d4ca5c8fcff7535583490ff
(cherry picked from commit 954a906240cc40aacf08ca13f6554a35303a678a)
This commit is contained in:
Pritam Damania 2022-03-02 16:11:28 -08:00 committed by PyTorch MergeBot
parent 2ab9702955
commit 71aa3ab020

View file

@ -190,6 +190,18 @@ Example::
:members:
:inherited-members:
.. note ::
The RPC framework does not automatically retry any
:meth:`~torch.distributed.rpc.rpc_sync`,
:meth:`~torch.distributed.rpc.rpc_async` and
:meth:`~torch.distributed.rpc.remote` calls. The reason being that there is
no way the RPC framework can determine whether an operation is idempotent or
not and whether it is safe to retry. As a result, it is the application's
responsibility to deal with failures and retry if necessary. RPC communication
is based on TCP and as a result failures could happen due to network failures
or intermittent network connectivity issues. In such scenarios, the application
needs to retry appropriately with reasonable backoffs to ensure the network
isn't overwhelmed by aggressive retries.
.. _rref: