pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp
Rohan Varma e0e832c2ba [c10d] Provide failure reason from ProcessGroup when aborting NCCL comm (#64241)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64241

When things go wrong PG NCCL aborts nccl communicators via `ncclCommAbort`, but one issues is that often the error can be set to `ncclSystemError` (see  https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/NCCLUtils.hpp#L176) when that might not be the true cause of the issue and the actual issue is that some prior work timed out, communicator was aborted on other rank, etc.

This results in a lot of confusion when debugging jobs with a large no. of processes as the current message for ncclSystemError is not very informative: https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/NCCLUtils.hpp#L22

The fix here is to pass in a string exception message from PG NCCL down to `NCCLUtils` which will aim to raise that as the actual issue and not the confusing `ncclSystemError` message.

Test Plan: CI

Reviewed By: pallab-zz, cbalioglu

Differential Revision: D30658855

fbshipit-source-id: 17661dbe0a1bb8cc5b87b637c47634b1f52f54e1
2021-09-08 09:19:24 -07:00

57 lines
1.5 KiB
C++

#include <c10d/NCCLUtils.hpp>
#ifdef USE_C10D_NCCL
#include <mutex>
namespace c10d {
ncclComm_t NCCLComm::getNcclComm() {
std::unique_lock<std::mutex> lock(mutex_);
if (aborted_) {
auto commFailureMsg = commFailureReason_ != c10::nullopt
? c10::str(" Original reason for failure was: ", *commFailureReason_)
: "";
TORCH_CHECK(
false,
c10::str(
"NCCL communicator was aborted on rank ",
rank_,
". ",
commFailureMsg));
}
return ncclComm_;
}
std::string getNcclVersion() {
static std::once_flag ncclGetVersionFlag;
static std::string versionString;
std::call_once(ncclGetVersionFlag, []() {
int version;
ncclResult_t status = ncclGetVersion(&version);
// can't compute the version if call did not return successfully or version
// code < 100 (corresponding to 0.1.0)
if (status != ncclSuccess || version < 100) {
versionString = "Unknown NCCL version";
} else {
auto ncclMajor = version / 1000;
auto ncclMinor = (version % 1000) / 100;
auto ncclPatch = version % (ncclMajor * 1000 + ncclMinor * 100);
versionString = std::to_string(ncclMajor) + "." +
std::to_string(ncclMinor) + "." + std::to_string(ncclPatch);
}
});
return versionString;
}
std::string ncclGetErrorWithVersion(ncclResult_t error) {
return std::string(ncclGetErrorString(error)) + ", NCCL version " +
getNcclVersion();
}
} // namespace c10d
#endif // USE_C10D_NCCL