Summary:
(Work in progress). This diff will allow shifting of activations to other GPUs, in case the model does not fit into memory. To see the API, check the code in data_parallel_model_test, which tests shifting two activations from 0 and 1 to gpu 4, and from gpu 2 and 3 to gpu 5.
I will need to further test on ResNets, and probablly add copy operations to handle device change points.
Reviewed By: asaadaldien
Differential Revision: D5591674
fbshipit-source-id: eb12d23651a56d64fa4db91090c6474218705270
Summary:
resnet50 trainer will save the 'optimizer_iteration' blob in checkpoints, but loads it i in GPU context. This fails because AtomicIter/Iter expect the blob to be in CPU context. So manually reset the optimizer_iteration in CPU context.
I am thinking of making the iter-operators automatically do this switch, but in the mean time this unbreaks the trainer.
Reviewed By: sf-wind
Differential Revision: D6232626
fbshipit-source-id: da7c183a87803e008f94c86b6574b879c3b76438
Summary:
My commit bab5bc broke things wiht fp16 compute, as i had tested it only with the null-input, that actually produced fp32 data (even dtype was given as float16). Also, I had confused the concepts of "float16 compute" and fp16 data. Issue #1408.
This fixes those issues, tested with both Volta and M40 GPUs. Basically restored much of the previous code and fixed the null input to do FloatToHalf.
Reviewed By: pietern
Differential Revision: D6211849
fbshipit-source-id: 5b41cffdd605f61a438a4c34c56972ede9eee28e
Summary: Allow the GEMMs in the FC/FCGradient Op to do FP16 compute instead of FP32 if the appropriate op flag is set.
Reviewed By: asaadaldien
Differential Revision: D5839777
fbshipit-source-id: 8051daedadf72bf56c298c1cf830b019b7019f43
Summary: On CPU, no need to replicate parameters. So try using only one copy (cpu_0) for parameters. Made resnet50_trainer use shared model in cpu mode.
Reviewed By: wesolwsk
Differential Revision: D5812181
fbshipit-source-id: 93254733edbc4a62bd74a629a68f5fa23f7e96ea
Summary:
This is useful for pure throughput tests where
we don't care about training a real model.
Reviewed By: akyrola
Differential Revision: D5834293
fbshipit-source-id: dab528c9269fb713e6f6b42457966219c06e0a35
Summary: Otherwise weights, biases are not created and test creation fails
Reviewed By: gsethi523
Differential Revision: D5836438
fbshipit-source-id: 32a75313b6b9ebecbfaa43ebd39f19c8eaba8cd1
Summary:
Before this change there were two ways for machines to rendezvous for a
distributed run: shared file system or Redis. If you're using an MPI
cluster it is much more convenient to simply execute mpirun and expect
the "right thing (tm)" to happen. This change adds the "mpi_rendezvous"
option to the CreateCommonWorld operator. If this is set, the common
world size and rank will be pulled from the MPI context and Gloo
rendezvous takes place using MPI. Note that this does NOT mean the MPI
BTL is used; MPI is only used for rendezvous.
Closes https://github.com/caffe2/caffe2/pull/1190
Reviewed By: akyrola
Differential Revision: D5796060
Pulled By: pietern
fbshipit-source-id: f8276908d3f3afef2ac88594ad377e38c17d0226
Summary:
These arguments control which Gloo transport (TCP or IB) and which
network interface is used for the common world. If not specified, it
defaults to using TCP and the network interface for the IP that the
machine's hostname resolves to.
The valid values for the transport argument are "tcp" and "ibverbs".
For ibverbs to work, Gloo must have been compiled with ibverbs
support. If Gloo is built as part of Caffe2 (sourced from the
third_party directory), then you can pass -DUSE_IBVERBS=ON to CMake to
enable ibverbs support in Gloo.
Closes https://github.com/caffe2/caffe2/pull/1177
Reviewed By: akyrola
Differential Revision: D5789729
Pulled By: pietern
fbshipit-source-id: 0dea1a115c729e54c5c1f9fdd5fb29c14a834a82
Summary:
This brings it up to par with how the RedisStoreHandler
works. The store handler configuration does not have to change and
only the run ID parameter changes across runs.
This was inconsistent and came up in https://github.com/caffe2/caffe2/issues/984.
Reviewed By: Yangqing
Differential Revision: D5539299
fbshipit-source-id: 3b5f31c6549b46c24bbd70ebc0bec150eac8b76c
Summary: It is common mistake to create test/validation model with init_params=True. When its param_init_net is run, it will overwrite training models' params, and with DPM, those won't be synchronized to all GPUs. I don't want to make this an assertion yet, since it might break people's trainers (it is ok to have init_params=True if you never run the param_init_net...).
Reviewed By: asaadaldien
Differential Revision: D5509963
fbshipit-source-id: 63b1a16ec0af96e3790e226850f6e0e64689143f
Summary:
CPU -version of data parallel model. Great thing is that now we can run data_parallel_model_test in Sandcastle (as it does not have GPUs).
Pretty simple change, really. I did not change all variable names with "gpu" in them, to reduce risk (and being a bit lazy). Can improve later.
Reviewed By: wesolwsk
Differential Revision: D5277350
fbshipit-source-id: 682e0c5f9f4ce94a8f5bd089905b0f8268bd2210
Summary: I broke resnet50 when switching to use optimizer, which uses LR per parameter. This only happens after each epoch, and I did no test patiently enough. For a stop-gap, while asaadaldien works on a better solution, just fetch the lr of a conv1_w param.
Reviewed By: asaadaldien
Differential Revision: D5207552
fbshipit-source-id: f3474cd5eb0e291a59880e2834375491883fddfc
Summary: replace hand made sgd with build_sgd
Reviewed By: salexspb
Differential Revision: D5186331
fbshipit-source-id: 3c7b4b370e29a1344b95819766463bae3812c9a6
Summary:
Add add_weight_decay to optimizer + test.
In D5142973 I accidentally removed weight decay from resnet50 trainer, so this restores it.
Reviewed By: asaadaldien
Differential Revision: D5173594
fbshipit-source-id: c736d8955eddff151632ae6be11afde0883f7531
Summary:
This diff does two things:
- add supports for optimizer to data_parallel_model. User can supply optimizer_builder_fun instead of param_update_builder_fun. The latter is called for each GPU separately with proper namescope and devicescope, while optimizer builder only is called once and adds optimizes to the whole model.
- use MomentumSGDUpdate instead of MomentumSGD + WeightedSum. This bring major perf benefits.
Changes resnet50 trainer to use optimizer.
This relies on D5133652
Reviewed By: dzhulgakov
Differential Revision: D5142973
fbshipit-source-id: 98e1114f5fae6c657314b3296841ae2dad0dc0e2
Summary:
Major improvements. Before we only synced "params" and "computed params" of model after initialization and after loading a checkpoint. But actually we want to sync all blobs that are generated in the param_init_net. For example the _momentum blobs were missed by the previous implementation and had to be manually included in checkpoint finalization.
I also added GetCheckpointParams() to data_parallel_model because it is now fully general. Also added a unit test.
Reviewed By: andrewwdye
Differential Revision: D5093689
fbshipit-source-id: 8154ded0c73cd6a0f54ee024dc5f2c6826ed7e42
Summary:
Update rnn_cell.py and char_rnn.py example with new `brew` model.
- Deprecated CNNModelHelper
- replace all helper functions with brew helper functions
- Use `model.net.<SingleOp>` format to create bare bone Operator for better clarity.
Reviewed By: salexspb
Differential Revision: D5062963
fbshipit-source-id: 254f7b9059a29621027d2b09e932f3f81db2e0ce
Summary: new resnet building with brew
Reviewed By: akyrola
Differential Revision: D4945418
fbshipit-source-id: d90463834cbba2c35d625053ba8812e192df0adf
Summary:
Script caffe2/caffe2/python/examples/resnet50_trainer.py can be used to train a ResNet-50 model with Imagenet data (or similar).
However, currently the script does not actually save the model, so it is kind of useless.
Task 1: After each Epoch, save the model in a file "<filename>_X.mdl' where X is the epoch number and <filename> is given as a command line parameter. By default, use "resnet50_model" as filename.
Task 2: Add a functionality to restore the model from a previous file:
- add a command line parameter "load_model", which user can use to specify a filename.
- if this parameter is set, load the model parameters from the previous file
Reviewed By: prigoyal
Differential Revision: D4984340
fbshipit-source-id: 333e92679ba52a7effe9917fdfc2d55d652b868f
Summary: printing resnet training loss and accuracy for each batch so that people will have better idea of what is going on
Reviewed By: pietern
Differential Revision: D4945390
fbshipit-source-id: 0fcd60f4735e81641355aba6e6cbf0e57e886e38
Summary: This is the nice way to re-use RNN layers for training and for inference.
Reviewed By: salexspb
Differential Revision: D4825894
fbshipit-source-id: 779c69758cee8caca6f36bc507e3ea0566f7652a
Summary:
A few fixes in this commit: the epoch size is now rounded
down to the closest integer multiple of the global batch size (batch
per GPU * GPUs per hosts * hosts per run). The num_shards and shard_id
parameters are now passed to CreateDB so multiple processes actually
train on different subsets of data. The LR step size is scaled by the
number of hosts in the run. The test accuracy is only determined after
each epoch instead of after every so many iterations.
Differential Revision: D4871505
fbshipit-source-id: d2703dc7cf1e0f76710d9d7c09cd362a42fe0598
Summary:
Use data_parallel_model for seq2seq multi-gpu training. The main reason for complexity here is that GatherOp hasn't yet been implemented on GPU.
This diff also adds better cliping procedure - clip by global norm rather than by absolute value.
Differential Revision: D4778691
fbshipit-source-id: bff184dae02ecc227413fef51f48a4726e5d3825
Summary: D4734505 part 2. Remove more instances of the batch_size parameter
Reviewed By: urikz
Differential Revision: D4736906
fbshipit-source-id: fc9d374e9308017d61c427890364c5ab9cec2edf
Summary:
Make it use Gloo and optionally use Redis for rendezvous (where a
shared filesystem is not available).
Differential Revision: D4709943
fbshipit-source-id: 59cc7a14316c7b634417ea5161a75fab3c19f2fa
Summary: UNK needs tobe indexed in the vocabulary for validation to work. Default args now result in training loss decreasing.
Reviewed By: urikz
Differential Revision: D4703393
fbshipit-source-id: e4d6ad100daf8392f8ba1e502f9ecf39bb8ce24a
Summary: We should be using the vocabulary built on the training data, and corpus_eval as data for the evaluation phase.
Reviewed By: urikz
Differential Revision: D4700382
fbshipit-source-id: ca1dd043a28f9bb585faad050c82fb12c1cdf6cc
Summary:
TSIA
This change also fixes an undefined attribute error after running 20
iterations of the resnet50 example trainer.
Differential Revision: D4692794
fbshipit-source-id: b98efdfeb078c5ba89d2a86837f3c672e1eade5f
Summary:
OSS implementation of seq2seq model in Caffe2. The script uses Seq2SeqModelCaffe2 class to build and run the model. It takes in training data in the form of text file with one sentence in each line, builds a vocabulary, generates batches based on batch size and runs the net for a configurable number of epochs. It prints total scalar loss at the end of each epoch.
All FBLearner and neural_mt type system dependencies have been removed. Unimplemented and unnecessary methods have been removed to make the script simpler.
fblearner/flow/projects/langtech/translation/neural_mt/model_util_caffe2.py has been moved to caffe2/caffe2/python/examples/seq2seq_util.py and remains unchanged
Potential TODOs:
- Get the model running in GPU. Only GatherOp does not have a corresponding GPU implementation. Try adding CopyGPUToCPU before and CopyCPUToGPU after Gather, and use CUDA DeviceOption.
- Add evaluation on test data with suitable metric (perplexity? bleu?)
Reviewed By: urikz
Differential Revision: D4653333
fbshipit-source-id: 1c7d970ebc86afe23fad4d48854296bf54eb0f77
Summary:
It could be that only first item
in the batch was really used in a case rest of the memory was 0. Or if
memory there had a big positive integer, then whole sequence was used. So we used rest of the batch depending on our luck :)
Reviewed By: Yangqing
Differential Revision: D4599569
fbshipit-source-id: ae89cee796bbcbc232e4abcab71dee360b0d8bc6
Summary:
Input have to be arranged in such a way so j-th example of
batch i goes right before j-th example in batch i+1 in the text.
Reviewed By: urikz
Differential Revision: D4519553
fbshipit-source-id: 9dd80658e0c4d9ff0f97a7904cbb164f267fe39f