pytorch/test/cpp/c10d
fduwjj ae7df51232 [c10d] Fix CudaEventCache for dangling references (#144496)
Reported in https://github.com/pytorch/pytorch/issues/143470, we have a dangling references in `CudaEventCache`. So we want to fix it.
1. We add a unit test to repro the issue mentioned in the issue.
2. Instead of converting variables to shared pointers as suggested in the issue, we then make the cache itself a shared pointer. So if the thread creates the cache dies before all events get recycled, the cache is still there until the last CudaEvent get deleted. (thanks for the suggestion from @kwen2501 )

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144496
Approved by: https://github.com/kwen2501
2025-01-15 05:11:48 +00:00
..
example
BackoffTest.cpp
CMakeLists.txt
CUDATest.cu
CUDATest.hpp
FileStoreTest.cpp
HashStoreTest.cpp
ProcessGroupGlooAsyncTest.cpp
ProcessGroupGlooTest.cpp
ProcessGroupMPITest.cpp
ProcessGroupNCCLErrorsTest.cpp
ProcessGroupNCCLTest.cpp [c10d] Fix CudaEventCache for dangling references (#144496) 2025-01-15 05:11:48 +00:00
ProcessGroupUCCTest.cpp
StoreTestCommon.hpp
TCPStoreTest.cpp
TestUtils.hpp