mirror of
https://github.com/saymrwulf/pytorch.git
synced 2026-05-14 20:57:59 +00:00
This is a mitigation for an internal out of MEM issues on GPU0 that happend during comms abort, this PR was tested internally to have fixed the out of MEM issue. Note This is supposed to be mitigation only, as the ideal fix should be within NCCL comm libs, which should just set the right CUDA context before any CUDA call and restore it to its exact previous state ncclCommDestroy/ncclCommAbort -> commReclaim -> commDestroySync (https://fburl.com/code/pori1tka) In commDestroySync, it thinks that "current device context" is not same as comm's device context. It tries to: 1) save the current context 2) sets the comm's device context 3) cleans up things 4) Restores "previously stored context" by another cudaSetDevice. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127363 Approved by: https://github.com/wconstab |
||
|---|---|---|
| .. | ||
| autograd | ||
| c10d | ||
| rpc | ||