pytorch

mirror of https://github.com/saymrwulf/pytorch.git synced 2026-05-15 21:00:47 +00:00

History

Wei Zhang 1d4e996b87 Separate parameter downloading tasks from training tasks and run them in a different group Summary: At the end of distributed training, trainer needs to download the parameters back from parameter servers for saving the model. Currently, this parameter downloading happens at the end of job's epoch task group, which creates several problems when checkpointing is enabled for distributed training: 1. When checkpointing is enabled, we run multiple training epochs. At the end of each epoch, the model download tasks will run to collect parameters, but we won't save the model until the true end of training, so there is a big waste of resource. 2. After trainer0 downloads the parameters, these parameters take a lot of memory, so trainer0 can easily run out of memory in the next epoch of training. Our solution is to insert a parameter download task group between the job's training epoch_group and the job's exit_group. Reviewed By: azzolini Differential Revision: D6765393 fbshipit-source-id: 5a4f556fc3c1cd7834a7c406a3c0de3fccd50c49		2018-01-22 14:04:12 -08:00
..
binaries	avoid auto's in the lambdas in OSS build	2017-12-08 12:05:02 -08:00
contrib	Check for result in queue only after background process is terminated	2018-01-12 18:06:47 -08:00
core	Checking performance flags during init.	2018-01-22 14:04:11 -08:00
cuda_rtc
db
distributed
experiments
image
mkl	Add op in MKLDNN	2018-01-21 08:21:43 -08:00
mobile	Add vulkanSymbolWrapperReset function	2018-01-12 21:18:06 -08:00
mpi
observers	Update observer when attached to RNN ops	2017-12-14 10:04:20 -08:00
operators	Moved mask-rcnn inference operators to open source caffe2.	2018-01-19 16:20:14 -08:00
perfkernels	Add FusedEmbeddingLookup	2018-01-19 15:44:34 -08:00
proto	Remove unused field in tensor proto	2017-11-13 17:25:15 -08:00
python	Separate parameter downloading tasks from training tasks and run them in a different group	2018-01-22 14:04:12 -08:00
queue	Misc Windows lint	2017-12-23 20:07:27 -08:00
sgd	RowWiseSparseAdam operator	2018-01-16 19:39:31 -08:00
share	NNPACK: Use new bindings and custom thread pool	2018-01-11 10:48:12 -08:00
test
transforms
utils	Fix the Macro definiton for E in cpuid.h; #undef E	2018-01-19 15:44:32 -08:00
video	update the video input op in caffe2	2018-01-19 09:52:25 -08:00
CMakeLists.txt	Adapting conda build to work for ubuntu and adding a flag to control precedence of Anaconda include dirs	2018-01-11 12:01:04 -08:00