pytorch

mirror of https://github.com/saymrwulf/pytorch.git synced 2026-05-15 21:00:47 +00:00

History

Rong Rong (AI Infra) ebd142e94b initial commit to enable fast_nvcc (#49773 ) Summary: draft enable fast_nvcc. * cleaned up some non-standard usages * added fall-back to wrap_nvcc Pull Request resolved: https://github.com/pytorch/pytorch/pull/49773 Test Plan: Configuration to enable fast nvcc: - install and enable `ccache` but delete `.ccache/` folder before each build. - `TORCH_CUDA_ARCH_LIST=6.0;6.1;6.2;7.0;7.5` - Toggling `USE_FAST_NVCC=ON/OFF` cmake config and run `cmake --build` to verify the build time. Initial statistic for a full compilation: * `cmake --build . -- -j $(nproc)`: - fast NVCC ``` real 48m55.706s user 1559m14.218s sys 318m41.138s ``` - normal NVCC: ``` real 43m38.723s user 1470m28.131s sys 90m46.879s ``` * `cmake --build . -- -j $(nproc/4)`: - fast NVCC: ``` real 53m44.173s user 1130m18.323s sys 71m32.385s ``` - normal NVCC: ``` real 81m53.768s user 858m45.402s sys 61m15.539s ``` * Conclusion: fast NVCC doesn't provide too much gain when compiler is set to use full CPU utilization, in fact it is even worse because of the thread switcing. initial statistic for partial recompile (edit .cu files) * `cmake --build . -- -j $(nproc)` - fast NVCC: ``` [2021-01-13 18:10:24] [ 86%] Building NVCC (Device) object caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/torch_cuda_generated_BinaryMiscOpsKernels.cu.o [2021-01-13 18:11:08] [ 86%] Linking CXX shared library ../lib/libtorch_cuda.so ``` - normal NVCC: ``` [2021-01-13 17:35:40] [ 86%] Building NVCC (Device) object caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/torch_cuda_generated_BinaryMiscOpsKernels.cu.o [2021-01-13 17:38:08] [ 86%] Linking CXX shared library ../lib/libtorch_cuda.so ``` * Conclusion: Effective compilation time for single CU file modification reduced from from 2min30sec to only 40sec when compiling multiple architecture. This shows 4X gain in speed up using fast NVCC -- reaching the theoretical limit of 5X when compiling 5 gencode architecture at the same time. Follow up PRs: - should have better fallback mechanism to detect whether a build is supported by fast_nvcc or not instead of dryruning then fail with fallback. - performance measurement instrumentation to measure what's the total compile time vs the parallel tasks critical path time. - figure out why `-j $(nproc)` gives significant sys overhead (`sys 318m41.138s` vs `sys 90m46.879s`) over normal nvcc, guess this is context switching, but not exactly sure Reviewed By: malfet Differential Revision: D25692758 Pulled By: walterddr fbshipit-source-id: c244d07b9b71f146e972b6b3682ca792b38c4457		2021-01-19 14:50:54 -08:00
..
upstream	initial commit to enable fast_nvcc (#49773 )	2021-01-19 14:50:54 -08:00
FindCUDA.cmake
FindCUDNN.cmake
README.md

README.md

This ./upstream subfolder contains fixes for FindCUDA that are introduced in later versions of cmake but cause generator expression errors in earlier CMake versions. Specifically:

a problem where a generator expression for include directories was passed to NVCC, where the generator expression itself was prefixed by -I. As the NNPACK include directory generator expression expands to multiple directories, the second and later ones were not prefixed by -I, causing NVCC to return an error. First fixed in CMake 3.7 (see Kitware/CMake@7ded655f).
Windows VS2017 fixes that allows one to define the ccbin path differently between earlier versions of Visual Studio and VS2017. First introduced after 3.10.1 master version (see Kitware/CMake@bc88329e).

The downside of using these fixes is that ./upstream/CMakeInitializeConfigs.cmake, defining some new CMake variables (added in Kitware/CMake@48f7e2d3), must be included before ./upstream/FindCUDA.cmake to support older CMake versions. A wrapper ./FindCUDA.cmake is created to do this automatically, and to allow submodules to use these fixes because we can't patch their CMakeList.txt.

If you need to update files under ./upstream folder, we recommend you issue PRs against the CMake mainline branch, and then backport it here for earlier CMake compatibility.