Add note in RPC docs about retries. (#73601)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73601 Some users had questions about how the RPC framework deals with failures and whether we retry. Adding a note about this to our docs to elaborate on our current behavior and why we chose that approach. ghstack-source-id: 150359866 Test Plan: view docs. Reviewed By: mrshenli Differential Revision: D34560199 fbshipit-source-id: ee33ceed7fa706270d4ca5c8fcff7535583490ff (cherry picked from commit 954a906240cc40aacf08ca13f6554a35303a678a)
2026-05-14 20:57:59 +00:00 · 2022-03-02 16:11:28 -08:00 · 2022-03-02 16:11:28 -08:00 · 71aa3ab020
commit 71aa3ab020
parent 2ab9702955
1 changed files with 12 additions and 0 deletions
--- a/docs/source/rpc.rst
+++ b/docs/source/rpc.rst
@ -190,6 +190,18 @@ Example::
    :members:
    :inherited-members:

+.. note ::
+  The RPC framework does not automatically retry any
+  :meth:`~torch.distributed.rpc.rpc_sync`,
+  :meth:`~torch.distributed.rpc.rpc_async` and
+  :meth:`~torch.distributed.rpc.remote` calls. The reason being that there is
+  no way the RPC framework can determine whether an operation is idempotent or
+  not and whether it is safe to retry. As a result, it is the application's
+  responsibility to deal with failures and retry if necessary. RPC communication
+  is based on TCP and as a result failures could happen due to network failures
+  or intermittent network connectivity issues. In such scenarios, the application
+  needs to retry appropriately with reasonable backoffs to ensure the network
+  isn't overwhelmed by aggressive retries.

 .. _rref: