Summary: draft enable fast_nvcc. * cleaned up some non-standard usages * added fall-back to wrap_nvcc Pull Request resolved: https://github.com/pytorch/pytorch/pull/49773 Test Plan: Configuration to enable fast nvcc: - install and enable `ccache` but delete `.ccache/` folder before each build. - `TORCH_CUDA_ARCH_LIST=6.0;6.1;6.2;7.0;7.5` - Toggling `USE_FAST_NVCC=ON/OFF` cmake config and run `cmake --build` to verify the build time. Initial statistic for a full compilation: * `cmake --build . -- -j $(nproc)`: - fast NVCC ``` real 48m55.706s user 1559m14.218s sys 318m41.138s ``` - normal NVCC: ``` real 43m38.723s user 1470m28.131s sys 90m46.879s ``` * `cmake --build . -- -j $(nproc/4)`: - fast NVCC: ``` real 53m44.173s user 1130m18.323s sys 71m32.385s ``` - normal NVCC: ``` real 81m53.768s user 858m45.402s sys 61m15.539s ``` * Conclusion: fast NVCC doesn't provide too much gain when compiler is set to use full CPU utilization, in fact it is **even worse** because of the thread switcing. initial statistic for partial recompile (edit .cu files) * `cmake --build . -- -j $(nproc)` - fast NVCC: ``` [2021-01-13 18:10:24] [ 86%] Building NVCC (Device) object caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/torch_cuda_generated_BinaryMiscOpsKernels.cu.o [2021-01-13 18:11:08] [ 86%] Linking CXX shared library ../lib/libtorch_cuda.so ``` - normal NVCC: ``` [2021-01-13 17:35:40] [ 86%] Building NVCC (Device) object caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/torch_cuda_generated_BinaryMiscOpsKernels.cu.o [2021-01-13 17:38:08] [ 86%] Linking CXX shared library ../lib/libtorch_cuda.so ``` * Conclusion: Effective compilation time for single CU file modification reduced from from 2min30sec to only 40sec when compiling multiple architecture. This shows **4X** gain in speed up using fast NVCC -- reaching the theoretical limit of 5X when compiling 5 gencode architecture at the same time. Follow up PRs: - should have better fallback mechanism to detect whether a build is supported by fast_nvcc or not instead of dryruning then fail with fallback. - performance measurement instrumentation to measure what's the total compile time vs the parallel tasks critical path time. - figure out why `-j $(nproc)` gives significant sys overhead (`sys 318m41.138s` vs `sys 90m46.879s`) over normal nvcc, guess this is context switching, but not exactly sure Reviewed By: malfet Differential Revision: D25692758 Pulled By: walterddr fbshipit-source-id: c244d07b9b71f146e972b6b3682ca792b38c4457 |
||
|---|---|---|
| .. | ||
| upstream | ||
| FindCUDA.cmake | ||
| FindCUDNN.cmake | ||
| README.md | ||
This ./upstream subfolder contains fixes for FindCUDA that are introduced in
later versions of cmake but cause generator expression errors in earlier CMake
versions. Specifically:
-
a problem where a generator expression for include directories was passed to NVCC, where the generator expression itself was prefixed by
-I. As the NNPACK include directory generator expression expands to multiple directories, the second and later ones were not prefixed by-I, causing NVCC to return an error. First fixed in CMake 3.7 (see Kitware/CMake@7ded655f). -
Windows VS2017 fixes that allows one to define the ccbin path differently between earlier versions of Visual Studio and VS2017. First introduced after 3.10.1 master version (see Kitware/CMake@bc88329e).
The downside of using these fixes is that ./upstream/CMakeInitializeConfigs.cmake,
defining some new CMake variables (added in
Kitware/CMake@48f7e2d3),
must be included before ./upstream/FindCUDA.cmake to support older CMake
versions. A wrapper ./FindCUDA.cmake is created to do this automatically, and
to allow submodules to use these fixes because we can't patch their
CMakeList.txt.
If you need to update files under ./upstream folder, we recommend you issue PRs
against the CMake mainline branch,
and then backport it here for earlier CMake compatibility.