mirror of
https://github.com/saymrwulf/pytorch.git
synced 2026-05-14 20:57:59 +00:00
Tensors and Dynamic neural networks in Python with strong GPU acceleration
Summary: At the end of distributed training, trainer needs to download the parameters back from parameter servers for saving the model. Currently, this parameter downloading happens at the end of job's epoch task group, which creates several problems when checkpointing is enabled for distributed training: 1. When checkpointing is enabled, we run multiple training epochs. At the end of each epoch, the model download tasks will run to collect parameters, but we won't save the model until the true end of training, so there is a big waste of resource. 2. After trainer0 downloads the parameters, these parameters take a lot of memory, so trainer0 can easily run out of memory in the next epoch of training. Our solution is to insert a parameter download task group between the job's training epoch_group and the job's exit_group. Reviewed By: azzolini Differential Revision: D6765393 fbshipit-source-id: 5a4f556fc3c1cd7834a7c406a3c0de3fccd50c49 |
||
|---|---|---|
| .github | ||
| .jenkins | ||
| .travis | ||
| caffe/proto | ||
| caffe2 | ||
| cmake | ||
| conda | ||
| docker | ||
| docs | ||
| modules | ||
| scripts | ||
| third_party | ||
| .gitattributes | ||
| .gitignore | ||
| .gitmodules | ||
| .travis.yml | ||
| appveyor.yml | ||
| CMakeLists.txt | ||
| LICENSE | ||
| Makefile | ||
| NOTICE | ||
| README.md | ||
| release-notes.md | ||
| setup.py | ||
| VERSION_NUMBER | ||
Caffe2
Caffe2 is a lightweight, modular, and scalable deep learning framework. Building on the original Caffe, Caffe2 is designed with expression, speed, and modularity in mind.
Questions and Feedback
Please use Github issues (https://github.com/caffe2/caffe2/issues) to ask questions, report bugs, and request new features.
Please participate in our survey (https://www.surveymonkey.com/r/caffe2). We will send you information about new releases and special developer events/webinars.
License
Caffe2 is released under the Apache 2.0 license. See the NOTICE file for details.