diff --git a/docs/source/rpc.rst b/docs/source/rpc.rst index 2e801f3b69c..89f146bfd68 100644 --- a/docs/source/rpc.rst +++ b/docs/source/rpc.rst @@ -190,6 +190,18 @@ Example:: :members: :inherited-members: +.. note :: + The RPC framework does not automatically retry any + :meth:`~torch.distributed.rpc.rpc_sync`, + :meth:`~torch.distributed.rpc.rpc_async` and + :meth:`~torch.distributed.rpc.remote` calls. The reason being that there is + no way the RPC framework can determine whether an operation is idempotent or + not and whether it is safe to retry. As a result, it is the application's + responsibility to deal with failures and retry if necessary. RPC communication + is based on TCP and as a result failures could happen due to network failures + or intermittent network connectivity issues. In such scenarios, the application + needs to retry appropriately with reasonable backoffs to ensure the network + isn't overwhelmed by aggressive retries. .. _rref: