mirror of
https://github.com/saymrwulf/pytorch.git
synced 2026-05-15 21:00:47 +00:00
Summary: At the end of distributed training, trainer needs to download the parameters back from parameter servers for saving the model. Currently, this parameter downloading happens at the end of job's epoch task group, which creates several problems when checkpointing is enabled for distributed training: 1. When checkpointing is enabled, we run multiple training epochs. At the end of each epoch, the model download tasks will run to collect parameters, but we won't save the model until the true end of training, so there is a big waste of resource. 2. After trainer0 downloads the parameters, these parameters take a lot of memory, so trainer0 can easily run out of memory in the next epoch of training. Our solution is to insert a parameter download task group between the job's training epoch_group and the job's exit_group. Reviewed By: azzolini Differential Revision: D6765393 fbshipit-source-id: 5a4f556fc3c1cd7834a7c406a3c0de3fccd50c49 |
||
|---|---|---|
| .. | ||
| binaries | ||
| contrib | ||
| core | ||
| cuda_rtc | ||
| db | ||
| distributed | ||
| experiments | ||
| image | ||
| mkl | ||
| mobile | ||
| mpi | ||
| observers | ||
| operators | ||
| perfkernels | ||
| proto | ||
| python | ||
| queue | ||
| sgd | ||
| share | ||
| test | ||
| transforms | ||
| utils | ||
| video | ||
| CMakeLists.txt | ||