mirror of
https://github.com/saymrwulf/pytorch.git
synced 2026-05-14 20:57:59 +00:00
Should fix #13362 and fix #83790 I think I've discovered the root cause of the intermittent nccl link failures. If we look at the variable name in the redefinition error: ``` _02021d91_11_sendrecv_cu_0bc7b9c8_11152 ``` this is the name of the file being compiled + some form of unique ID. As part of NCCL's build process, the same file is compiled multiple times with different macro definitions depending on which operator and dtype are being compiled, e.g. ``` nvcc -DNCCL_OP=0 -DNCCL_TYPE=0 -dc sendrecv.cu -o sendrecv_sum_i8.o ``` Since the filename parts are the same, then if the unique IDs also happen to collide then the entire identifier will collide and the link fails. So the fix here is to generate a unique `.cu` file for each object file. I've implemented this as a `.patch` file that gets applied from our cmake code, but if we instead fork nccl that would be cleaner. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84245 Approved by: https://github.com/janeyx99, https://github.com/malfet |
||
|---|---|---|
| .. | ||
| External | ||
| Modules | ||
| Modules_CUDA_fix | ||
| public | ||
| Allowlist.cmake | ||
| BuildVariables.cmake | ||
| Caffe2Config.cmake.in | ||
| Caffe2ConfigVersion.cmake.in | ||
| cmake_uninstall.cmake.in | ||
| Codegen.cmake | ||
| DebugHelper.cmake | ||
| Dependencies.cmake | ||
| FlatBuffers.cmake | ||
| GoogleTestPatch.cmake | ||
| IncludeSource.cpp.in | ||
| iOS.cmake | ||
| Metal.cmake | ||
| MiscCheck.cmake | ||
| ProtoBuf.cmake | ||
| ProtoBufPatch.cmake | ||
| Summary.cmake | ||
| TorchConfig.cmake.in | ||
| TorchConfigVersion.cmake.in | ||
| VulkanCodegen.cmake | ||
| VulkanDependencies.cmake | ||