pytorch/caffe2
Wei Zhang 1d4e996b87 Separate parameter downloading tasks from training tasks and run them in a different group
Summary:
At the end of distributed training, trainer needs to download the parameters back from parameter servers for saving the model. Currently, this parameter downloading happens at the end of job's epoch task group, which creates several problems when checkpointing is enabled for distributed training:

1. When checkpointing is enabled, we run multiple training epochs. At the end of each epoch, the model download tasks will run to collect parameters, but we won't save the model until the true end of training, so there is a big waste of resource.
2. After trainer0 downloads the parameters, these parameters take a lot of memory, so trainer0 can easily run out of memory in the next epoch of training.

Our solution is to insert a parameter download task group between the job's training epoch_group and the job's exit_group.

Reviewed By: azzolini

Differential Revision: D6765393

fbshipit-source-id: 5a4f556fc3c1cd7834a7c406a3c0de3fccd50c49
2018-01-22 14:04:12 -08:00
..
binaries avoid auto's in the lambdas in OSS build 2017-12-08 12:05:02 -08:00
contrib Check for result in queue only after background process is terminated 2018-01-12 18:06:47 -08:00
core Checking performance flags during init. 2018-01-22 14:04:11 -08:00
cuda_rtc
db
distributed
experiments
image
mkl Add op in MKLDNN 2018-01-21 08:21:43 -08:00
mobile Add vulkanSymbolWrapperReset function 2018-01-12 21:18:06 -08:00
mpi
observers Update observer when attached to RNN ops 2017-12-14 10:04:20 -08:00
operators Moved mask-rcnn inference operators to open source caffe2. 2018-01-19 16:20:14 -08:00
perfkernels Add FusedEmbeddingLookup 2018-01-19 15:44:34 -08:00
proto Remove unused field in tensor proto 2017-11-13 17:25:15 -08:00
python Separate parameter downloading tasks from training tasks and run them in a different group 2018-01-22 14:04:12 -08:00
queue Misc Windows lint 2017-12-23 20:07:27 -08:00
sgd RowWiseSparseAdam operator 2018-01-16 19:39:31 -08:00
share NNPACK: Use new bindings and custom thread pool 2018-01-11 10:48:12 -08:00
test
transforms
utils Fix the Macro definiton for E in cpuid.h; #undef E 2018-01-19 15:44:32 -08:00
video update the video input op in caffe2 2018-01-19 09:52:25 -08:00
CMakeLists.txt Adapting conda build to work for ubuntu and adding a flag to control precedence of Anaconda include dirs 2018-01-11 12:01:04 -08:00