pytorch/caffe2/python
Wei Zhang 1d4e996b87 Separate parameter downloading tasks from training tasks and run them in a different group
Summary:
At the end of distributed training, trainer needs to download the parameters back from parameter servers for saving the model. Currently, this parameter downloading happens at the end of job's epoch task group, which creates several problems when checkpointing is enabled for distributed training:

1. When checkpointing is enabled, we run multiple training epochs. At the end of each epoch, the model download tasks will run to collect parameters, but we won't save the model until the true end of training, so there is a big waste of resource.
2. After trainer0 downloads the parameters, these parameters take a lot of memory, so trainer0 can easily run out of memory in the next epoch of training.

Our solution is to insert a parameter download task group between the job's training epoch_group and the job's exit_group.

Reviewed By: azzolini

Differential Revision: D6765393

fbshipit-source-id: 5a4f556fc3c1cd7834a7c406a3c0de3fccd50c49
2018-01-22 14:04:12 -08:00
..
docs Build doxygen docs with cmake and fix catalog generation 2018-01-18 18:47:59 -08:00
examples Checking for positive epoch size before running epoch 2018-01-18 11:48:35 -08:00
helpers Add if and while ops to brew 2017-12-05 17:33:34 -08:00
layers add dense regularization 2018-01-08 13:03:17 -08:00
mint Re-license to Apache 2017-09-28 16:22:00 -07:00
mkl Add op in MKLDNN 2018-01-21 08:21:43 -08:00
modeling Added inverted FP16 Initializer 2017-10-27 10:20:04 -07:00
models Fix pool op custom path issue 2, wrongful routing to global pooling 2018-01-09 00:54:45 -08:00
operator_test Implement fused 8bit rowwise sparse lengths reductions 2018-01-19 15:44:35 -08:00
predictor Record workflow run id for inference. 2017-12-18 15:33:19 -08:00
rnn GRU cell: add linear_before_reset boolean parameter 2018-01-08 13:22:56 -08:00
test Fix occasional test timeouts 2018-01-19 20:08:58 -08:00
_import_c_extension.py Re-license to Apache 2017-09-28 16:22:00 -07:00
allcompare_test.py Re-license to Apache 2017-09-28 16:22:00 -07:00
attention.py Re-license to Apache 2017-09-28 16:22:00 -07:00
benchmark_generator.py Re-license to Apache 2017-09-28 16:22:00 -07:00
binarysize.py Re-license to Apache 2017-09-28 16:22:00 -07:00
brew.py Add if and while ops to brew 2017-12-05 17:33:34 -08:00
brew_test.py Add if and while ops to brew 2017-12-05 17:33:34 -08:00
build.py Expose CMake options in the binary 2017-10-04 02:33:02 -07:00
cached_reader.py Cached reader 2017-11-15 12:38:49 -08:00
caffe_translator.py Fix for wrong newline in caffe_translator.py (Crop layer translation) 2018-01-12 16:17:53 -08:00
caffe_translator_test.py Re-license to Apache 2017-09-28 16:22:00 -07:00
checkpoint.py Separate parameter downloading tasks from training tasks and run them in a different group 2018-01-22 14:04:12 -08:00
checkpoint_test.py Separate parameter downloading tasks from training tasks and run them in a different group 2018-01-22 14:04:12 -08:00
CMakeLists.txt Fix OSS build 2017-12-21 19:04:25 -08:00
cnn.py Re-license to Apache 2017-09-28 16:22:00 -07:00
context.py Re-license to Apache 2017-09-28 16:22:00 -07:00
context_test.py Re-license to Apache 2017-09-28 16:22:00 -07:00
control.py Re-license to Apache 2017-09-28 16:22:00 -07:00
control_ops_grad.py Backpropagation for While op 2017-12-18 16:03:45 -08:00
control_ops_util.py Backpropagation for While op 2017-12-18 16:03:45 -08:00
control_test.py Revert D6026557: [caffe2][PR] Fix "No handlers could be found for logger" 2017-10-12 20:21:52 -07:00
convnet_benchmarks.py Re-license to Apache 2017-09-28 16:22:00 -07:00
convnet_benchmarks_test.py Re-license to Apache 2017-09-28 16:22:00 -07:00
core.py Backpropagation for While op 2017-12-18 16:03:45 -08:00
core_gradients_test.py Re-license to Apache 2017-09-28 16:22:00 -07:00
core_test.py Py3 test fixes 2017-12-05 10:34:41 -08:00
crf.py Re-license to Apache 2017-09-28 16:22:00 -07:00
data_parallel_model.py Allow shifting of activations / ops to other GPUs in data parallel model 2017-11-29 21:17:00 -08:00
data_parallel_model_test.py Skip DeviceShiftTest if host has < 4 GPU devices 2017-12-03 16:02:05 -08:00
data_parallel_model_utils.py Allow shifting of activations / ops to other GPUs in data parallel model 2017-11-29 21:17:00 -08:00
data_workers.py move print to logger 2017-11-17 18:03:44 -08:00
data_workers_test.py Re-license to Apache 2017-09-28 16:22:00 -07:00
dataio.py Adding a time limit reader 2018-01-02 11:33:53 -08:00
dataio_test.py Adding a time limit reader 2018-01-02 11:33:53 -08:00
dataset.py Re-license to Apache 2017-09-28 16:22:00 -07:00
db_test.py Re-license to Apache 2017-09-28 16:22:00 -07:00
device_checker.py Re-license to Apache 2017-09-28 16:22:00 -07:00
dlpack.h Support for DLPack in Python op 2017-12-21 17:02:16 -08:00
dyndep.py Re-license to Apache 2017-09-28 16:22:00 -07:00
embedding_generation_benchmark.py Re-license to Apache 2017-09-28 16:22:00 -07:00
experiment_util.py Re-license to Apache 2017-09-28 16:22:00 -07:00
extension_loader.py Re-license to Apache 2017-09-28 16:22:00 -07:00
functional.py Re-license to Apache 2017-09-28 16:22:00 -07:00
functional_test.py Re-license to Apache 2017-09-28 16:22:00 -07:00
fused_8bit_rowwise_conversion_ops_test.py Add float32 <-> fused_rowwise_8bit conversion Caffe2 operators 2018-01-19 15:44:33 -08:00
gradient_check_test.py Backpropagation for While op 2017-12-18 16:03:45 -08:00
gradient_checker.py Re-license to Apache 2017-09-28 16:22:00 -07:00
gru_cell.py fix gru_cell bug 2018-01-12 15:34:23 -08:00
hsm_util.py Re-license to Apache 2017-09-28 16:22:00 -07:00
hypothesis_test.py Increase lower bound of values for values in div test 2018-01-22 09:06:12 -08:00
hypothesis_test_util.py Ensure indices list in sparse optimizer tests is unique 2018-01-03 12:19:14 -08:00
layer_model_helper.py enable setting model initialization seed 2018-01-11 14:04:03 -08:00
layer_model_instantiator.py add dense regularization 2018-01-08 13:03:17 -08:00
layer_parameter_sharing_test.py Add shape checks and print more info in parameter sharing 2017-10-27 01:22:06 -07:00
layer_test_util.py Re-license to Apache 2017-09-28 16:22:00 -07:00
layers_test.py testSparseLookup 2018-01-19 09:27:20 -08:00
lengths_reducer_fused_8bit_rowwise_ops_test.py Implement fused 8bit rowwise sparse lengths reductions 2018-01-19 15:44:35 -08:00
lengths_reducer_rowwise_8bit_ops_test.py Re-license to Apache 2017-09-28 16:22:00 -07:00
lstm_benchmark.py Re-license to Apache 2017-09-28 16:22:00 -07:00
memonger.py Revert D6026557: [caffe2][PR] Fix "No handlers could be found for logger" 2017-10-12 20:21:52 -07:00
memonger_test.py Re-license to Apache 2017-09-28 16:22:00 -07:00
mkl_test_util.py Re-license to Apache 2017-09-28 16:22:00 -07:00
model_device_test.py Re-license to Apache 2017-09-28 16:22:00 -07:00
model_helper.py add sanity check to model_helper.TensorProtosDBInput 2017-11-21 10:28:25 -08:00
modifier_context.py Re-license to Apache 2017-09-28 16:22:00 -07:00
mpi_python.cc Upgrade to 2.2.1 2017-10-22 13:26:56 -07:00
muji.py Re-license to Apache 2017-09-28 16:22:00 -07:00
muji_test.py Re-license to Apache 2017-09-28 16:22:00 -07:00
net_builder.py Minor documentation fix in NetBuiler 2017-11-15 16:22:22 -08:00
net_builder_test.py Re-license to Apache 2017-09-28 16:22:00 -07:00
net_drawer.py Revert D6026557: [caffe2][PR] Fix "No handlers could be found for logger" 2017-10-12 20:21:52 -07:00
net_printer.py Separate parameter downloading tasks from training tasks and run them in a different group 2018-01-22 14:04:12 -08:00
net_printer_test.py Re-license to Apache 2017-09-28 16:22:00 -07:00
observer_test.py Attach observers to operators inside step net 2017-11-14 15:06:38 -08:00
optimizer.py RowWiseSparseAdam operator 2018-01-16 19:39:31 -08:00
optimizer_context.py Re-license to Apache 2017-09-28 16:22:00 -07:00
optimizer_test.py Support RMSProp in Caffe2. 2017-11-08 16:43:18 -08:00
optimizer_test_util.py momentum sgd 2017-11-03 16:17:17 -07:00
parallel_workers.py Revert D6026557: [caffe2][PR] Fix "No handlers could be found for logger" 2017-10-12 20:21:52 -07:00
parallel_workers_test.py Add shutdown_fun to parallel_workers 2017-10-10 12:02:24 -07:00
parallelize_bmuf_distributed_test.py BMUF cpu support 2017-11-19 23:41:25 -08:00
pipeline.py Re-license to Apache 2017-09-28 16:22:00 -07:00
pipeline_test.py Re-license to Apache 2017-09-28 16:22:00 -07:00
predictor_constants.py Re-license to Apache 2017-09-28 16:22:00 -07:00
pybind_state.cc Support for DLPack in Python op 2017-12-21 17:02:16 -08:00
pybind_state.h Support for DLPack in Python op 2017-12-21 17:02:16 -08:00
pybind_state_dlpack.cc Support for DLPack in Python op 2017-12-21 17:02:16 -08:00
pybind_state_dlpack.h Support for DLPack in Python op 2017-12-21 17:02:16 -08:00
pybind_state_gpu.cc Remove Set/GetDefaultGPUID and move to use current gpu id instead. 2018-01-19 18:03:21 -08:00
pybind_state_mkl.cc Re-license to Apache 2017-09-28 16:22:00 -07:00
python_op_test.py Throw Python exception from PythonOp instead of logging 2017-11-20 09:03:17 -08:00
queue_util.py Re-license to Apache 2017-09-28 16:22:00 -07:00
record_queue.py Re-license to Apache 2017-09-28 16:22:00 -07:00
recurrent.py Remove scoping assertion because it is not useful and causing errors 2017-12-11 18:03:45 -08:00
regularizer.py add dense regularization 2018-01-08 13:03:17 -08:00
regularizer_context.py Re-license to Apache 2017-09-28 16:22:00 -07:00
regularizer_test.py add dense regularization 2018-01-08 13:03:17 -08:00
rnn_cell.py Add ElmanCell and ElmanRNN 2018-01-18 12:14:02 -08:00
schema.py add struct get method 2017-12-19 12:35:56 -08:00
schema_test.py add struct get method 2017-12-19 12:35:56 -08:00
scope.py truthy check for empty string in NameScope() 2018-01-19 21:34:09 -08:00
scope_test.py Add a EmptyDeviceScope (i.e. allow setting CurrentDeviceScope() to None) 2017-11-02 11:25:48 -07:00
session.py Re-license to Apache 2017-09-28 16:22:00 -07:00
session_test.py Re-license to Apache 2017-09-28 16:22:00 -07:00
sparse_to_dense_mask_test.py Skip negative indices 2017-10-09 16:09:50 -07:00
task.py Re-license to Apache 2017-09-28 16:22:00 -07:00
test_util.py Re-license to Apache 2017-09-28 16:22:00 -07:00
text_file_reader.py Re-license to Apache 2017-09-28 16:22:00 -07:00
timeout_guard.py Re-license to Apache 2017-09-28 16:22:00 -07:00
toy_regression_test.py Re-license to Apache 2017-09-28 16:22:00 -07:00
tt_core.py Re-license to Apache 2017-09-28 16:22:00 -07:00
tt_core_test.py Re-license to Apache 2017-09-28 16:22:00 -07:00
utils.py Re-license to Apache 2017-09-28 16:22:00 -07:00
visualize.py Re-license to Apache 2017-09-28 16:22:00 -07:00
workspace.py Remove Set/GetDefaultGPUID and move to use current gpu id instead. 2018-01-19 18:03:21 -08:00
workspace_test.py Remove Set/GetDefaultGPUID and move to use current gpu id instead. 2018-01-19 18:03:21 -08:00