Tensors and Dynamic neural networks in Python with strong GPU acceleration
Find a file
Wei Zhang 1d4e996b87 Separate parameter downloading tasks from training tasks and run them in a different group
Summary:
At the end of distributed training, trainer needs to download the parameters back from parameter servers for saving the model. Currently, this parameter downloading happens at the end of job's epoch task group, which creates several problems when checkpointing is enabled for distributed training:

1. When checkpointing is enabled, we run multiple training epochs. At the end of each epoch, the model download tasks will run to collect parameters, but we won't save the model until the true end of training, so there is a big waste of resource.
2. After trainer0 downloads the parameters, these parameters take a lot of memory, so trainer0 can easily run out of memory in the next epoch of training.

Our solution is to insert a parameter download task group between the job's training epoch_group and the job's exit_group.

Reviewed By: azzolini

Differential Revision: D6765393

fbshipit-source-id: 5a4f556fc3c1cd7834a7c406a3c0de3fccd50c49
2018-01-22 14:04:12 -08:00
.github Add placeholders for issues/pull requests 2017-12-11 14:35:25 -08:00
.jenkins Semi-automatically generate scripts out of our tutorials 2018-01-19 22:36:47 -08:00
.travis Run build_android.sh in Jenkins 2017-11-21 15:53:38 -08:00
caffe/proto cmake: relative paths for install() 2017-08-22 09:52:09 -07:00
caffe2 Separate parameter downloading tasks from training tasks and run them in a different group 2018-01-22 14:04:12 -08:00
cmake Checking performance flags during init. 2018-01-22 14:04:11 -08:00
conda Adapting conda build to work for ubuntu and adding a flag to control precedence of Anaconda include dirs 2018-01-11 12:01:04 -08:00
docker Add doxygen and graphviz to Jenkins docker base. 2018-01-19 15:05:45 -08:00
docs Build doxygen docs with cmake and fix catalog generation 2018-01-18 18:47:59 -08:00
modules Enable the detectron module in cmake 2018-01-18 10:21:22 -08:00
scripts Adding a separate script for anaconda builds 2018-01-18 16:03:45 -08:00
third_party Bump gloo 2018-01-04 17:49:21 -08:00
.gitattributes Fix linguist detection with gitattribute overrides 2017-10-23 17:03:07 -07:00
.gitignore Misc Windows lint 2017-12-23 20:07:27 -08:00
.gitmodules Adding zstd to build 2017-11-13 22:18:44 -08:00
.travis.yml disable travis webhook as we are moving to jenkins as CI 2018-01-02 14:42:15 -08:00
appveyor.yml Fix a few typos and grammars in comment 2017-06-14 18:22:39 -07:00
CMakeLists.txt Checking performance flags during init. 2018-01-22 14:04:11 -08:00
LICENSE Re-license to Apache 2017-09-28 16:22:00 -07:00
Makefile
NOTICE Re-license to Apache 2017-09-28 16:22:00 -07:00
README.md Remove request for proposal link from README.md 2018-01-04 09:11:05 -08:00
release-notes.md
setup.py OSError will be raised in setup.py if "git" is not installed 2018-01-22 14:04:10 -08:00
VERSION_NUMBER Add setup.py 2017-11-17 12:22:52 -08:00

Caffe2

License Jenkins Build Status Appveyor Build Status

Caffe2 is a lightweight, modular, and scalable deep learning framework. Building on the original Caffe, Caffe2 is designed with expression, speed, and modularity in mind.

Questions and Feedback

Please use Github issues (https://github.com/caffe2/caffe2/issues) to ask questions, report bugs, and request new features.

Please participate in our survey (https://www.surveymonkey.com/r/caffe2). We will send you information about new releases and special developer events/webinars.

License

Caffe2 is released under the Apache 2.0 license. See the NOTICE file for details.

Further Resources on Caffe2.ai