Commit graph

25 commits

Author SHA1 Message Date
Pieter Noordhuis
8c9f4d8c3b Add throughput information to resnet50_trainer
Summary:
TSIA

Makes it easier for throughput debugging.

Differential Revision: D4879634

fbshipit-source-id: 8d479d51b0ec51ad3d86ad5500fc3095400cf095
2017-04-12 17:46:14 -07:00
Pieter Noordhuis
c907c7c7dc Update resnet50_trainer example
Summary:
A few fixes in this commit: the epoch size is now rounded
down to the closest integer multiple of the global batch size (batch
per GPU * GPUs per hosts * hosts per run). The num_shards and shard_id
parameters are now passed to CreateDB so multiple processes actually
train on different subsets of data. The LR step size is scaled by the
number of hosts in the run. The test accuracy is only determined after
each epoch instead of after every so many iterations.

Differential Revision: D4871505

fbshipit-source-id: d2703dc7cf1e0f76710d9d7c09cd362a42fe0598
2017-04-12 14:03:51 -07:00
Pieter Noordhuis
26d301fbe4 Configurable CuDNN workspace limit in resnet50_trainer
Summary: TSIA

Reviewed By: Yangqing, bwasti

Differential Revision: D4835477

fbshipit-source-id: a0083188fe91a56c5f910c7dda46412e38632d7e
2017-04-05 10:50:00 -07:00
Aaron Markham
58f7f2b441 doxygen python block added
Summary: Closes https://github.com/caffe2/caffe2/pull/226

Differential Revision: D4793550

Pulled By: JoelMarcey

fbshipit-source-id: cc33e58186304fa8dcac2ee9115dcc271d785b1e
2017-03-29 06:46:16 -07:00
Yury Zemlyanskiy
0c47d345df Multi-gpu training for OSS seq2seq
Summary:
Use data_parallel_model for seq2seq multi-gpu training. The main reason for complexity here is that GatherOp hasn't yet been implemented on GPU.

This diff also adds better cliping procedure - clip by global norm rather than by absolute value.

Differential Revision: D4778691

fbshipit-source-id: bff184dae02ecc227413fef51f48a4726e5d3825
2017-03-27 17:32:39 -07:00
James Reed
33f41c06c0 Remove more instances of batch_size
Summary: D4734505 part 2. Remove more instances of the batch_size parameter

Reviewed By: urikz

Differential Revision: D4736906

fbshipit-source-id: fc9d374e9308017d61c427890364c5ab9cec2edf
2017-03-19 22:31:30 -07:00
Pieter Noordhuis
92101aa87a Update resnet50 example
Summary:
Make it use Gloo and optionally use Redis for rendezvous (where a
shared filesystem is not available).

Differential Revision: D4709943

fbshipit-source-id: 59cc7a14316c7b634417ea5161a75fab3c19f2fa
2017-03-15 08:18:50 -07:00
Deepak Gopinath
a1d63da6af Adding UNK to vocab | Changing default params
Summary: UNK needs tobe indexed in the vocabulary for validation to work. Default args now result in training loss decreasing.

Reviewed By: urikz

Differential Revision: D4703393

fbshipit-source-id: e4d6ad100daf8392f8ba1e502f9ecf39bb8ce24a
2017-03-13 22:17:48 -07:00
Deepak Gopinath
001ac5d751 Fix to use appropriate corpus and vocab in eval
Summary: We should be using the vocabulary built on the training data, and corpus_eval as data for the evaluation phase.

Reviewed By: urikz

Differential Revision: D4700382

fbshipit-source-id: ca1dd043a28f9bb585faad050c82fb12c1cdf6cc
2017-03-13 14:31:27 -07:00
Pieter Noordhuis
6729d81418 Specify which GPUs to use in resnet50 example
Summary:
TSIA

This change also fixes an undefined attribute error after running 20
iterations of the resnet50 example trainer.

Differential Revision: D4692794

fbshipit-source-id: b98efdfeb078c5ba89d2a86837f3c672e1eade5f
2017-03-12 22:33:15 -07:00
Deepak Gopinath
57ecd20197 seq2seq open source implementation
Summary:
OSS implementation of seq2seq model in Caffe2. The script uses Seq2SeqModelCaffe2 class to build and run the model. It takes in training data in the form of text file with one sentence in each line, builds a vocabulary, generates batches based on batch size and runs the net for a configurable number of epochs. It prints total scalar loss at the end of each epoch.

All FBLearner and neural_mt type system dependencies have been removed. Unimplemented and unnecessary methods have been removed to make the script simpler.
fblearner/flow/projects/langtech/translation/neural_mt/model_util_caffe2.py has been moved to caffe2/caffe2/python/examples/seq2seq_util.py and remains unchanged

Potential TODOs:
  - Get the model running in GPU. Only GatherOp does not have a corresponding GPU implementation. Try adding CopyGPUToCPU before and CopyCPUToGPU after Gather, and use CUDA DeviceOption.
  - Add evaluation on test data with suitable metric (perplexity? bleu?)

Reviewed By: urikz

Differential Revision: D4653333

fbshipit-source-id: 1c7d970ebc86afe23fad4d48854296bf54eb0f77
2017-03-09 16:18:08 -08:00
Ahmed Taei
4f0e7730a9 Distrubited Multi-GPU resnet50
Summary: Use filesystem rendezvous for dist-multi GPU training.

Differential Revision: D4664945

fbshipit-source-id: 7b6767323e94bc4e7fa25ef3eba65b38abb79341
2017-03-08 11:39:29 -08:00
Alexander Sidorov
95262032d8 ] Char RNN bug fix for batching
Summary:
It could be that only first item
in the batch was really used in a case rest of the memory was 0. Or if
memory there had a big positive integer, then whole sequence was used. So we used rest of the batch depending on our luck :)

Reviewed By: Yangqing

Differential Revision: D4599569

fbshipit-source-id: ae89cee796bbcbc232e4abcab71dee360b0d8bc6
2017-02-22 17:34:30 -08:00
Alexander Sidorov
2727317384 char-rnn: add comments
Summary: Just some comments

Reviewed By: pietern

Differential Revision: D4544518

fbshipit-source-id: b517023bf5e9712a2bf96ae15a709553e5ee6032
2017-02-10 12:20:58 -08:00
Alexander Sidorov
98f66fd282 Char-rnn : fix batching
Summary:
Input have to be arranged in such a way so j-th example of
batch i goes right before j-th example in batch i+1 in the text.

Reviewed By: urikz

Differential Revision: D4519553

fbshipit-source-id: 9dd80658e0c4d9ff0f97a7904cbb164f267fe39f
2017-02-10 10:07:32 -08:00
Alexander Sidorov
e676f4411b GPU support for RecurrentOp + Char RNN example
Summary: On batch size of 32 and other default parameters I get 70 iterations per second vs. 40 on CPU. batching still doesn't produce good loss, I am going to work on this in a separate diff

Reviewed By: urikz

Differential Revision: D4516566

fbshipit-source-id: d0611534747beb2cd935a8607a283369378e4a6c
2017-02-09 22:54:53 -08:00
Aapo Kyrola
1c7886701e lr_scale to loss_scale
Summary:
As per discussion in https://www.prod.facebook.com/groups/184236721951559/permalink/354591931582703/, KaimingHe pointed out that scaling LR is not same as scaling Loss, since LR scaling will affect the weight decay (which is implemented by modifying the gradient, which thus is not yet correctly 'averaged'). Actually prigoyal tried to convince me earlier that loss scaling is the way to go, but I was then not convinved :/.

So this diff removes the LR scaling parameter passed by data_parallel_model and instead passes a loss_scale parameter to the model creation function. Unfortunately, this will break all existing code that uses the data parallel model. But that is not only a bad thing, since it will bring awareness to this change. I will inform in the FB groups about this.

In this diff I modified all my models to work correctly.

Reviewed By: Yangqing

Differential Revision: D4507002

fbshipit-source-id: 16c7221663282f71a1b754b34de0c8ccd5c2ca90
2017-02-03 07:44:40 -08:00
Alexander Sidorov
2ce3cfefe1 Char-RNN Tutorial
Summary:
This learns Shakespeare and then generates samples one character at a time. We want this to be an example of using our LSTM and RNNs in general.

Now it takes 4ms to run the training net on current parameters (with batch size = 1). I don't have data on how much each operator takes yet. But overal python loop doesn't seem to influence much - with 1000 fake iterations in run_net it took 4s for each iteration as expected.

Future work:

* fixing convergence for batching
* profiling on operator level
* trying it out with GPUs
* benchmarking against  existing char-rnn implementations
* stacking lstms (one lstm is different from two, one needs to take care of scoping)

Reviewed By: urikz

Differential Revision: D4430612

fbshipit-source-id: b36644fed9844683f670717d57f8527c25ad285c
2017-02-02 15:44:32 -08:00
Aapo Kyrola
95b3309a87 Gradient Input memory sharing using memonger blob sharing
Summary:
This diff brings us to roughly par with Torch on ResNet memory usage. On batch size 32, Resnet-50 took 7497MiB, after this 5010 MiB. This will thus allow us to handle 64 images / GPU, or 256 images / 4 GPUs.

In addition, I added a special argument to DagNet that causes it to run only one thread for the first iteration. This is needed since there are allocations on the first iteration's backward pass due to gradient sharing, and this will cause NCCL to deadlock.

The sharing of gradient buffers requires inferring which gradients can share memory (i.e that they are not used concurrently). Previous memonger code uses topological sort, but rbgirshick showed that it does not work with tree-like models. Thus, I wrote a new optimization algorithm based on DFS. It takes about 0.25 secs / GPU on resnet-50, so is clearly fast enough.

Module data_parallel_model supports this feature natively.

Reviewed By: prigoyal

Differential Revision: D4363209

fbshipit-source-id: 73b11e7610438098bb11bff0af8075ab0cf2c0f1
2017-01-09 19:44:23 -08:00
Aapo Kyrola
e8dc09064e exhaustive_search=True
Summary: For some reason I had been disabling the exhaustive search heuristic for cudnn for xray/resnet trainers. On BigBasin, this gives 10% perf boost. On BigSur maybe 5%.

Reviewed By: prigoyal

Differential Revision: D4338654

fbshipit-source-id: 3974dd612f5d4f4dc8b2febccb59664d3f276c3e
2016-12-15 22:59:27 -08:00
Aapo Kyrola
68cfc52452 MomemtumSGDUpdate -- version of MomentumSGD with update.
Summary:
It gives a significant perf boost to do the parameter update inside MomentumSGD, instead of with a separate WeightedSum op.
To ensure backwards compatibility, I made it a separate op.

Also added an unit test.

Reviewed By: prigoyal

Differential Revision: D4262446

fbshipit-source-id: 38e7ee6d7677b398658ac7fe9b7a59b569e033f4
2016-12-15 12:01:29 -08:00
Aapo Kyrola
e65eeff665 LMDB example
Summary:
This examples writes a LMDB database of image data and labels (random). Then it reads them using Caffe2's TensorProtosDBINput and validates the checksums match. This example shows how to coerce image data into TensorProtos and be happy.

Before there was no clear example how to create databases for Caffe2.

Differential Revision: D4263614

fbshipit-source-id: 21e08066899095b4efcc2d23dbc3ede81e75914a
2016-12-05 11:53:26 -08:00
Aapo Kyrola
3410939459 pass learning rate scaling factor to parameter update builder function
Summary:
When refactoring data parallel model, the division of LR by number of devices was dropped, and thus we ended up effectively multiplying gradients by the number of devices. Thus, we need to scale the LR by 1/numgpus.

Created a test to confirm that data_parallel_model produces exactly same results on different number of gpus, given the total batch size.

Reviewed By: prigoyal

Differential Revision: D4248907

fbshipit-source-id: af21ede113e6ac25f12c556de298cb18974548be
2016-12-05 11:53:26 -08:00
Aapo Kyrola
b9f1555b6a remove unused function from resnet50_trainer
Summary: Just noticed that I had duplicate code in the example imagenet trainer. Removed the function.

Differential Revision: D4223070

fbshipit-source-id: 443a9401bf7e425f7a3a13a44c9d0f7e21e72303
2016-11-29 15:18:37 -08:00
Yangqing Jia
589398950f fbsync at f5a877 2016-11-18 15:41:06 -08:00