pytorch/test/cpp
Tristan Rice f33bcbe5fd c10d/logging: add C10D_LOCK_GUARD (#134131)
This adds logs if we can't acquire locks in NCCLUtils and ProcessGroupNCCL for 30s.

This is motivated by some deadlocks were seeing and it's unclear if it's in NCCL or on the PyTorch side of things.

This required replacing most `std::mutex` with `std::timed_mutex` and `std::condition_variable_any` as appropriate.

Test plan:

existing CI for regressions

will add unit tests on `C10D_LOCK_GUARD`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134131
Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj
2024-08-28 01:40:42 +00:00
..
aoti_abi_check [AOTI] Fix complex64 not defined (#132810) 2024-08-08 18:08:23 +00:00
aoti_inference [ROCm][Inductor] Enable AOT Inductor CPP UTs for ROCm (#131521) 2024-08-08 19:49:56 +00:00
api [structural binding][12/N] Replace std::tie with structural binding (#131031) 2024-08-14 00:51:34 +00:00
c10d c10d/logging: add C10D_LOCK_GUARD (#134131) 2024-08-28 01:40:42 +00:00
common
dist_autograd
jit [BE][Easy] enable ruff rule PIE790: unnecessary pass statement (#133200) 2024-08-15 15:50:19 +00:00
lazy
lite_interpreter_runtime Add None return type to init -- tests (#132352) 2024-08-01 15:44:51 +00:00
monitor
profiler
rpc
tensorexpr
__init__.py