pytorch/test/cpp
Chirag Pandya a83e745356 [BE] split seq_id to collective_seq_id and p2p_seq_id (#125727)
Summary:
Split out `seq_id` into `collective_seq_id` and `p2p_seq_id`. The main idea here is that collectives that go to all machines should have identical `collective_seq_id` and therefore it makes it easier to spot if one of machines isn't handling a collective operation.
Next, we can attempt to match up p2p operations to ensure that the sender(s)/receivers(s) are in sync.

Resolves issue: https://github.com/pytorch/pytorch/issues/125173

Test Plan:
Unit tests.

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125727
Approved by: https://github.com/zdevito
2024-05-21 03:26:49 +00:00
..
aoti_abi_check
aoti_inference
api Use object identity for deepcopy memo (#126126) 2024-05-17 00:06:26 +00:00
c10d [BE] split seq_id to collective_seq_id and p2p_seq_id (#125727) 2024-05-21 03:26:49 +00:00
common
dist_autograd
jit [codemod] c10:optional -> std::optional (#126135) 2024-05-14 19:35:51 +00:00
lazy [codemod] c10:optional -> std::optional (#126135) 2024-05-14 19:35:51 +00:00
lite_interpreter_runtime
monitor
profiler
rpc Revert " [Distributed] [7/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124987)" 2024-04-30 00:37:53 +00:00
tensorexpr [codemod] c10:optional -> std::optional (#126135) 2024-05-14 19:35:51 +00:00
__init__.py