Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68223
DETAIL debug mode didn't work with object-based collectives for NCCL backend, because we'd only check if backend is NCCL and then move tensors to CUDA.
Instead, check if it is a wrapped PG, and then check the pg that is wrapped to see if its nccl.
ghstack-source-id: 143242023
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D32366840
fbshipit-source-id: be0a2af6849f8f24446593f4a4fbea4a67586ee5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66167
Sometimes due to desync we see PG wrapper monitored barrier fail. In
this case it would be useful to print the info about the collective that was
trying to run along with the actual error.
ghstack-source-id: 140037653
Test Plan: CI
Reviewed By: cbalioglu
Differential Revision: D31353021
fbshipit-source-id: e2a515326c9314c98119978d5566eb5431cca96c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66166
These methods should be private.
ghstack-source-id: 139782587
Test Plan: CI
Reviewed By: cbalioglu
Differential Revision: D31353020
fbshipit-source-id: 583fb315cc2cacc37df3d29cd5793b42558930b3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60237
Closes https://github.com/pytorch/pytorch/issues/58711
This diff refactors the collective consistency checking in `ProcessGroupWrapper` as described in the above issue. In particular, we no longer run separate verification checks (`all_gather`s) for shapes, op type, etc. Instead, we implement a function `serialize_fingerprint` to serialize all this data into a single tensor and only verify that.
This has the benefit of being a lot more extensible, the developer does not need to add separate `all_gather` calls in order to verify additional data in the future. We can also provide some sort of mechanism where we allow data that needs to be verified to be "registered" in the `CollectiveFingerPrint` struct and make it even easier to add additional data, we can consider doing this if there are significant additions to `process group wrapper`.
We now also begin to check tensor `dtypes` and device types for consistency as well. Tests are refactored/added accordingly.
ghstack-source-id: 132520261
Test Plan: CI
Reviewed By: cbalioglu
Differential Revision: D28597287
fbshipit-source-id: b09f14f628df9e2457623ba81fc13fd4e214f3c9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60543
Since now c10d is part of libtorch, it would also be nice if the sources lived all in one place.
ghstack-source-id: 132306292
Test Plan: It builds
Reviewed By: cbalioglu
Differential Revision: D29062002
fbshipit-source-id: d9e1301e9d73e1643fa0f0119cd2d618f1ad52e6
2021-06-24 12:38:51 -07:00
Renamed from torch/lib/c10d/ProcessGroupWrapper.cpp (Browse further)