pytorch/test/cpp
Shuqiang Zhang bfd5bb0c44 [c10d] only PG0 should dump when monitoring thread timed out (#125356)
Summary:
We found that some dumps are missing when monitoring thread timeout.
This is likely due to multiple PGs could still dump the same records
at the same time. So we should allow only PG0 to actualy dump
Test Plan:
 unit test
python test/run_test.py --cpp --verbose -i cpp/ProcessGroupNCCLErrorsTest
Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125356
Approved by: https://github.com/c-p-i-o
2024-05-04 00:43:20 +00:00
..
aoti_abi_check [AOTI] Add more ABI-compatiblity unit test (#123900) 2024-04-23 16:06:40 +00:00
aoti_inference [AOTI] Add ABI-compatiblity tests (#123848) 2024-04-19 00:51:24 +00:00
api UFMT formatting on test/autograd test/ao test/cpp test/backends (#123369) 2024-04-05 18:51:38 +00:00
c10d [c10d] only PG0 should dump when monitoring thread timed out (#125356) 2024-05-04 00:43:20 +00:00
common [AOTI] Add ABI-compatiblity tests (#123848) 2024-04-19 00:51:24 +00:00
dist_autograd Remove unneeded linking in CMake targets (#109192) 2023-09-15 19:43:25 +00:00
jit Driver folder check (#117548) 2024-05-03 09:10:11 +00:00
lazy test_lazy: skip HashTest.Scalar (#112747) 2023-11-16 01:22:58 +00:00
lite_interpreter_runtime [2/N] Cleanup header inclusions in torch_cpu by iwyu (#109964) 2023-11-19 20:56:32 +00:00
monitor
profiler [c10] Move profiler clock to libc10 for timestamps (#111972) 2023-10-27 16:18:40 +00:00
rpc Revert " [Distributed] [7/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124987)" 2024-04-30 00:37:53 +00:00
tensorexpr [ROCm] remove HCC references (#111975) 2023-10-26 02:39:10 +00:00
__init__.py