pytorch/torch/csrc/distributed
PyTorch MergeBot f62d8b2a0f ProcessGroupWrapper log full rank fingerprint mismatches (#79901)
### Current Error Message:
```
Detected mismatch between collectives on ranks. Rank 0 is running collective:
CollectiveFingerPrint(OpType=ALLREDUCE, …), but Rank 1 is running collective: REDUCE.
```

### Ops Mismatch, New Error Message (shows full fingerprint, includes tensor shape, data types, and device types):
```
Detected mismatch between collectives on ranks. Rank 0 is running collective:
CollectiveFingerPrint(OpType=ALLREDUCE, TensorShape=[20, 10], TensorDtypes=Float,
TensorDeviceTypes=TensorOptions(dtype=float (default), device=cpu,
layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))),

but Rank 1 is running collective:
CollectiveFingerPrint(OpType=REDUCE, TensorShape=[20, 10], TensorDtypes=Float,
TensorDeviceTypes=TensorOptions(dtype=float (default), device=cpu,
layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))).
```

### Shape Mismatch, New Error Message
```
RuntimeError: Detected mismatch between collectives on ranks. Rank 0 is running collective:
CollectiveFingerPrint(OpType=SCATTER, TensorShape=[1], TensorDtypes=Long,
TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false
(default), pinned_memory=false (default), memory_format=(nullopt))),

but Rank 1 is running collective:
CollectiveFingerPrint(OpType=SCATTER, TensorShape=[2], TensorDtypes=Long,
TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false
(default), pinned_memory=false (default), memory_format=(nullopt))).

```

Changes:
- Update deserialize function to read shape of tensors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79901
Approved by: https://github.com/rohan-varma
2022-06-28 18:30:38 +00:00
..
autograd canonicalize includes of form <aten/src/ATen/...> 2022-06-16 17:46:45 +00:00
c10d ProcessGroupWrapper log full rank fingerprint mismatches (#79901) 2022-06-28 18:30:38 +00:00
rpc [lint] autoformat test/cpp and torch/csrc 2022-06-11 21:11:16 +00:00