pytorch/test/cpp/rpc/test_e2e_tensorpipe.cpp
Pritam Damania 8b501dfd98 Fix memory leak in TensorPipeAgent. (#50564)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50564

When an RPC was sent, the associated future was stored in two maps:
pendingResponseMessage_ and timeoutMap_. Once the response was received, the
entry was only removed from pendingResponseMessage_ and not timeoutMap_. The
pollTimedoudRpcs method then eventually removed the entry from timeoutMap_
after the time out duration had passed.

Although, in scenarios where there is a large timeout and a large number of
RPCs being used, it is very easy for the timeoutMap_ to grow without any
bounds. This was discovered in https://github.com/pytorch/pytorch/issues/50522.

To fix this issue, I've added some code to cleanup timeoutMap_ as well once we
receive a response.
ghstack-source-id: 119925182

Test Plan:
1) Unit test added.
2) Tested with repro in https://github.com/pytorch/pytorch/issues/50522

#Closes: https://github.com/pytorch/pytorch/issues/50522

Reviewed By: mrshenli

Differential Revision: D25919650

fbshipit-source-id: a0a42647e706d598fce2ca2c92963e540b9d9dbb
2021-01-18 16:34:28 -08:00

66 lines
1.9 KiB
C++

#include <gtest/gtest.h>
#include "e2e_test_base.h"
#include <c10d/ProcessGroupGloo.hpp>
#include <torch/csrc/distributed/rpc/request_callback_no_python.h>
#include <torch/csrc/distributed/rpc/tensorpipe_agent.h>
#include <torch/torch.h>
namespace torch {
namespace distributed {
namespace rpc {
#ifdef USE_TENSORPIPE
class TestE2ETensorPipe : public TestE2EBase {
protected:
void buildRpcAgent() override {
c10d::ProcessGroupGloo::Options options;
options.devices.push_back(
::c10d::ProcessGroupGloo::createDeviceForHostname(serverAddress));
float rpcTimeout = 30;
// Initialize server rpc agent.
auto pg = c10::make_intrusive<c10d::ProcessGroupGloo>(
store, 0, numWorkers, options);
TensorPipeRpcBackendOptions opts(
/*numWorkerThreads=*/std::max(16U, std::thread::hardware_concurrency()),
/*transports=*/nullopt,
/*channels=*/nullopt,
/*rpc_timeout=*/rpcTimeout,
/*init_method=*/"unused");
rpcAgent = std::make_shared<TensorPipeAgent>(
store,
"worker",
0,
numWorkers,
pg,
opts,
std::make_unique<RequestCallbackNoPython>());
}
};
// End to end training loop test in C++ so that we can run LSAN on this test to
// catch memory leaks. Enabling LSAN with python multiprocessing has been
// challenging and we don't have a good solution yet.
TEST_F(TestE2ETensorPipe, TestTrainingLoop) {
runTrainingLoop();
// Ensure the tensorpipe internal state is cleared up.
auto tensorpipeAgent = std::static_pointer_cast<TensorPipeAgent>(rpcAgent);
// Wait a while for async RPCs to propagate through (ex: dist autograd
// cleanup)
while (tensorpipeAgent->numPendingResponses() != 0) {
std::this_thread::sleep_for(std::chrono::milliseconds(10));
}
ASSERT_EQ(0, tensorpipeAgent->numPendingResponses());
ASSERT_EQ(0, tensorpipeAgent->timeoutMapSize());
}
#endif
} // namespace rpc
} // namespace distributed
} // namespace torch