From 71aa3ab02086ea626f7879ed3faabcd0f37ccb29 Mon Sep 17 00:00:00 2001 From: Pritam Damania Date: Wed, 2 Mar 2022 16:11:28 -0800 Subject: [PATCH] Add note in RPC docs about retries. (#73601) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73601 Some users had questions about how the RPC framework deals with failures and whether we retry. Adding a note about this to our docs to elaborate on our current behavior and why we chose that approach. ghstack-source-id: 150359866 Test Plan: view docs. Reviewed By: mrshenli Differential Revision: D34560199 fbshipit-source-id: ee33ceed7fa706270d4ca5c8fcff7535583490ff (cherry picked from commit 954a906240cc40aacf08ca13f6554a35303a678a) --- docs/source/rpc.rst | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/docs/source/rpc.rst b/docs/source/rpc.rst index 2e801f3b69c..89f146bfd68 100644 --- a/docs/source/rpc.rst +++ b/docs/source/rpc.rst @@ -190,6 +190,18 @@ Example:: :members: :inherited-members: +.. note :: + The RPC framework does not automatically retry any + :meth:`~torch.distributed.rpc.rpc_sync`, + :meth:`~torch.distributed.rpc.rpc_async` and + :meth:`~torch.distributed.rpc.remote` calls. The reason being that there is + no way the RPC framework can determine whether an operation is idempotent or + not and whether it is safe to retry. As a result, it is the application's + responsibility to deal with failures and retry if necessary. RPC communication + is based on TCP and as a result failures could happen due to network failures + or intermittent network connectivity issues. In such scenarios, the application + needs to retry appropriately with reasonable backoffs to ensure the network + isn't overwhelmed by aggressive retries. .. _rref: