From 71aa3ab02086ea626f7879ed3faabcd0f37ccb29 Mon Sep 17 00:00:00 2001
From: Pritam Damania <pritam.damania@fb.com>
Date: Wed, 2 Mar 2022 16:11:28 -0800
Subject: [PATCH] Add note in RPC docs about retries. (#73601)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73601

Some users had questions about how the RPC framework deals with
failures and whether we retry. Adding a note about this to our docs to
elaborate on our current behavior and why we chose that approach.
ghstack-source-id: 150359866

Test Plan: view docs.

Reviewed By: mrshenli

Differential Revision: D34560199

fbshipit-source-id: ee33ceed7fa706270d4ca5c8fcff7535583490ff
(cherry picked from commit 954a906240cc40aacf08ca13f6554a35303a678a)
---
 docs/source/rpc.rst | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/docs/source/rpc.rst b/docs/source/rpc.rst
index 2e801f3b69c..89f146bfd68 100644
--- a/docs/source/rpc.rst
+++ b/docs/source/rpc.rst
@@ -190,6 +190,18 @@ Example::
     :members:
     :inherited-members:
 
+.. note ::
+  The RPC framework does not automatically retry any
+  :meth:`~torch.distributed.rpc.rpc_sync`,
+  :meth:`~torch.distributed.rpc.rpc_async` and
+  :meth:`~torch.distributed.rpc.remote` calls. The reason being that there is
+  no way the RPC framework can determine whether an operation is idempotent or
+  not and whether it is safe to retry. As a result, it is the application's
+  responsibility to deal with failures and retry if necessary. RPC communication
+  is based on TCP and as a result failures could happen due to network failures
+  or intermittent network connectivity issues. In such scenarios, the application
+  needs to retry appropriately with reasonable backoffs to ensure the network
+  isn't overwhelmed by aggressive retries.
 
 .. _rref: