pytorch/cmake/Modules_CUDA_fix
Rong Rong (AI Infra) ebd142e94b initial commit to enable fast_nvcc (#49773)
Summary:
draft enable fast_nvcc.
* cleaned up some non-standard usages
* added fall-back to wrap_nvcc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49773

Test Plan:
Configuration to enable fast nvcc:
  - install and enable `ccache` but delete `.ccache/` folder before each build.
  - `TORCH_CUDA_ARCH_LIST=6.0;6.1;6.2;7.0;7.5`
  - Toggling `USE_FAST_NVCC=ON/OFF` cmake config and run `cmake --build` to verify the build time.

Initial statistic for a full compilation:
* `cmake --build . -- -j $(nproc)`:
  - fast NVCC
```
        real    48m55.706s
        user    1559m14.218s
        sys     318m41.138s
```
  - normal NVCC:
```
        real    43m38.723s
        user    1470m28.131s
        sys     90m46.879s
```
* `cmake --build . -- -j $(nproc/4)`:
  - fast NVCC:
```
        real    53m44.173s
        user    1130m18.323s
        sys     71m32.385s
```
  - normal  NVCC:
```
        real    81m53.768s
        user    858m45.402s
        sys     61m15.539s
```
* Conclusion: fast NVCC doesn't provide too much gain when compiler is set to use full CPU utilization, in fact it is **even worse** because of the thread switcing.

initial statistic for partial recompile (edit .cu files)

* `cmake --build . -- -j $(nproc)`
  - fast NVCC:
```
[2021-01-13 18:10:24] [ 86%] Building NVCC (Device) object caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/torch_cuda_generated_BinaryMiscOpsKernels.cu.o
[2021-01-13 18:11:08] [ 86%] Linking CXX shared library ../lib/libtorch_cuda.so
```
  - normal NVCC:
```
[2021-01-13 17:35:40] [ 86%] Building NVCC (Device) object caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/torch_cuda_generated_BinaryMiscOpsKernels.cu.o
[2021-01-13 17:38:08] [ 86%] Linking CXX shared library ../lib/libtorch_cuda.so
```
* Conclusion: Effective compilation time for single CU file modification reduced from from 2min30sec to only 40sec when compiling multiple architecture. This shows **4X** gain in speed up using fast NVCC -- reaching the theoretical limit of 5X when compiling 5 gencode architecture at the same time.

Follow up PRs:
- should have better fallback mechanism to detect whether a build is supported by fast_nvcc or not instead of dryruning then fail with fallback.
- performance measurement instrumentation to measure what's the total compile time vs the parallel tasks critical path time.
- figure out why `-j $(nproc)` gives significant sys overhead (`sys 318m41.138s` vs `sys 90m46.879s`) over normal nvcc, guess this is context switching, but not exactly sure

Reviewed By: malfet

Differential Revision: D25692758

Pulled By: walterddr

fbshipit-source-id: c244d07b9b71f146e972b6b3682ca792b38c4457
2021-01-19 14:50:54 -08:00
..
upstream initial commit to enable fast_nvcc (#49773) 2021-01-19 14:50:54 -08:00
FindCUDA.cmake
FindCUDNN.cmake
README.md

This ./upstream subfolder contains fixes for FindCUDA that are introduced in later versions of cmake but cause generator expression errors in earlier CMake versions. Specifically:

  1. a problem where a generator expression for include directories was passed to NVCC, where the generator expression itself was prefixed by -I. As the NNPACK include directory generator expression expands to multiple directories, the second and later ones were not prefixed by -I, causing NVCC to return an error. First fixed in CMake 3.7 (see Kitware/CMake@7ded655f).

  2. Windows VS2017 fixes that allows one to define the ccbin path differently between earlier versions of Visual Studio and VS2017. First introduced after 3.10.1 master version (see Kitware/CMake@bc88329e).

The downside of using these fixes is that ./upstream/CMakeInitializeConfigs.cmake, defining some new CMake variables (added in Kitware/CMake@48f7e2d3), must be included before ./upstream/FindCUDA.cmake to support older CMake versions. A wrapper ./FindCUDA.cmake is created to do this automatically, and to allow submodules to use these fixes because we can't patch their CMakeList.txt.

If you need to update files under ./upstream folder, we recommend you issue PRs against the CMake mainline branch, and then backport it here for earlier CMake compatibility.