From aef4317ec89b810f41b0a6d71bd689c2634c53ea Mon Sep 17 00:00:00 2001 From: Tristan Rice Date: Tue, 15 Oct 2024 21:17:03 +0000 Subject: [PATCH] [c10d] socket: retry connection timeout failures (#138003) This will retry connection timeout failures up to the timeout duration. Under heavy load the server may not be able to immediately accept the connection. In such a case we do want to retry the connection rather than fall back to ipv4 for the remaining of the connection timeout. The connection timeout here is not the same as the c10d timeout which appears to be higher. We could adjust the linux timeout directly but using the c10d retry loop keeps things more consistent and gives us things like exponential backoff, logs, etc. Example failure: ``` socket.cpp:752] [c10d] The client socket has failed to connect to [...]:29400 (errno: 110 - Connection timed out). socket.cpp:752] [c10d] The IPv4 network addresses of (..., 29400) cannot be retrieved (gai error: -2 - Name or service not known). ... repeats ipv4 connection failure ``` From Linux man page: https://man7.org/linux/man-pages/man2/connect.2.html ``` ETIMEDOUT Timeout while attempting connection. The server may be too busy to accept new connections. Note that for IP sockets the timeout may be very long when syncookies are enabled on the server. ``` Test plan: CI for backwards compatibility Pull Request resolved: https://github.com/pytorch/pytorch/pull/138003 Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj, https://github.com/rsdcastro --- torch/csrc/distributed/c10d/socket.cpp | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/torch/csrc/distributed/c10d/socket.cpp b/torch/csrc/distributed/c10d/socket.cpp index cb7b6329f77..320251b587c 100644 --- a/torch/csrc/distributed/c10d/socket.cpp +++ b/torch/csrc/distributed/c10d/socket.cpp @@ -920,6 +920,11 @@ SocketConnectOp::ConnectResult SocketConnectOp::tryConnect( addr, err); + return ConnectResult::Retry; + } else if (err == std::errc::timed_out) { + C10D_WARNING( + "The server socket on {} has timed out, will retry.", addr, err); + return ConnectResult::Retry; } else { recordError(