Summary:
This diff is adding eval nets to layer model helper. It should be useful for
the cases when train/eval nets need some extra input (usually some supervision)
for train/eval. For example various sampled layers, etc.
Differential Revision: D4769453
fbshipit-source-id: 7a8ec7024051eab73b8869ec21e20b5f10fd9acb
Summary:
We should resize the workspace-vector only when it increases. Otherwise we end up destroying and recreating workspaces constantly if sequence length varies.
Modified the lstm_benchmark test to randomize sequence length.
This provides big perf improvement to machine translation pipeline. Look at the recurrent network op runtimes.
WITH:
I0328 12:17:54.073976 492094 prof_dag_net.cc:156] 136.271 ms/iter ( 120.987 ms/iter) RecurrentNetwork
I0328 12:17:54.073982 492094 prof_dag_net.cc:156] 190.074 ms/iter ( 156.828 ms/iter) RecurrentNetworkGradient
WITHOUT:
I0328 12:25:17.658206 518884 prof_dag_net.cc:156] 375.369 ms/iter ( 249.268 ms/iter) RecurrentNetwork
I0328 12:25:17.658211 518884 prof_dag_net.cc:156] 278.892 ms/iter ( 227.29 ms/iter) RecurrentNetworkGradient
With LSTM benchmark, get about 2x speedup
Reviewed By: jamesr66a
Differential Revision: D4789354
fbshipit-source-id: ad72f61974e35b0474abcacdc466ae9c6b4eb0ff
Summary: PadImage has no kernel parameters resulting pads_ paraemeters to be not set (0). I added a test case too.
Differential Revision: D4785230
fbshipit-source-id: fd475e7c41208e07fa7a363def9a45c6f82cddfe
Summary: this is useful to test rnn cells
Reviewed By: dzhulgakov
Differential Revision: D4720641
fbshipit-source-id: baa7df43357ed8af72ede64be3e0a642a40472df
Summary:
Instead of doing gemms in a for-loop (which is not parallelized), it is much better to do the batched matmuls using CUDA 8's new batched-striped version of gemm.
With the MT team's test, we get 5-10% improvement in overall walltime, so it is significant improvement:
----
Without batched gemm:
I0328 10:46:48.118605 58068 prof_dag_net.cc:136] 424.757 ms/iter ( 283.878 ms/iter) RecurrentNetwork
I0328 10:46:48.118609 58068 prof_dag_net.cc:136] 352.603 ms/iter ( 265.85 ms/iter) RecurrentNetworkGradient
With batched gemm:
I0328 10:53:48.169996 85617 prof_dag_net.cc:136] 407.438 ms/iter ( 269.564 ms/iter) RecurrentNetwork
I0328 10:53:48.169999 85617 prof_dag_net.cc:136] 322.393 ms/iter ( 287.625 ms/iter) RecurrentNetworkGradient
Reviewed By: jamesr66a
Differential Revision: D4788272
fbshipit-source-id: 210e8b94c1e036b6ef0f039ce000d455258651f4
Summary:
This is pretty tricky to explain, but we can just use
backward_links. This way the whole cell would use a blob from the
states_grad tensor instead of having its own blob. This also should
save on memory a bit
Differential Revision: D4770798
fbshipit-source-id: 673f85b2c2fdf42c47feeaa24d1e2bf086f012f9
Summary: Creates SparseMomentumSGDUpdate, a sparse version of MomentumSGDUpdate, to make that optimization method (via in-place updating operator) compatible with GradientSlices.
Differential Revision: D4784973
fbshipit-source-id: e6330f471a4d5f53589a6ac245e38f256ca7f354
Summary:
`SamplingTrain` layer is a wrapper around another layer subclassing `SamplingTrainableMixin`. When initiated in the training context, `SamplingTrain` produces sparse output of the wrapped layer. Output can be paired with `indices` to create Map schema. When initiated in prediction context, the full output of the wrap layer is produced.
This is liked the SampledFC function in model helper, https://fburl.com/gi9g1awh, with the ability to initiated in both trainig and prediction context.
I'd like to get consensus whether we should introduce the `SamplingTrain` layer and the accompaying mixin. This can probably be accomplished in some other way, but I think this is not too bad.
Reviewed By: xianjiec
Differential Revision: D4689887
fbshipit-source-id: 7be8a52d82f3a09a053378146262df1047ab26a8
Summary:
Use data_parallel_model for seq2seq multi-gpu training. The main reason for complexity here is that GatherOp hasn't yet been implemented on GPU.
This diff also adds better cliping procedure - clip by global norm rather than by absolute value.
Differential Revision: D4778691
fbshipit-source-id: bff184dae02ecc227413fef51f48a4726e5d3825
Summary:
To evaluate from checkpoints, we need to load a model from the checkpoints.
However, the checkpoints store way more blobs than the blobs needed by the
model. This function enables the model builder to load only the blobs
associated with the model to the workspace. After that, the model builder
can evaluate the model from the populated workspace.
Reviewed By: azzolini
Differential Revision: D4751414
fbshipit-source-id: a7a420228d681fc2dcfd8573cf69a97b1abc2ef3
Summary: Currently, we cannot have layer constant because layer params are required to have gradient and optimizer. Global constants don't cut for this because it can only be added once; therefore, a layer that add any global constant can only be used once.
Differential Revision: D4773212
fbshipit-source-id: 5b60d31f3c1602afb04b61f6d30b8e3e06ed2de3
Summary:
D4690225 added support for nested field name lookup in nested
`schema.Struct`s. It would throw a KeyError if trying to access a nested
`List`s field. Writing the lookup recursively avoids the need to enumerate
all complex field types in the lookup.
Differential Revision: D4719755
fbshipit-source-id: 37c87a32d730f0f45f72fb20894da3e32f820999
Summary: Creating PackSegments and UnpackSegments GPU operators using GPUFallbackOp for now. The op does mainly copying of blobs and this is a reasonable solution until we have a CUDA op.
Reviewed By: pietern
Differential Revision: D4761589
fbshipit-source-id: dd483b9e34ecb6b53925405e5b4c24859c549606
Summary: Allow to drill down on data throuhgput overall and per field.
Reviewed By: dzhulgakov
Differential Revision: D4622168
fbshipit-source-id: 1462bb2fac05824fda0c02f4f5f0b8713893e650
Summary:
Use AddNet and AddBlobs to add net and blobs to meta_net_def.
This a codemod and does not change the functionality.
It is for preparation of the protobuf change.
Depends on: D4770648
Reviewed By: salexspb
Differential Revision: D4771110
fbshipit-source-id: 00cecb2105f2c332bd50c3c51b9a10e1004fa90f
Summary:
This was a nasty one to track down. This was the error message:
```
E0323 14:47:46.138900 2870 context_gpu.h:126] Encountered CUDA error: an illegal memory access was encountered
F0323 14:47:46.139143 2870 operator.h:176] Computation on device returned error in operator
input: "x_gpu_2" output: "loss" name: "" type: "AveragedLoss" device_option { device_type: 1 cuda_gpu_id: 1 }
```
Closes https://github.com/caffe2/caffe2/pull/220
Differential Revision: D4771086
Pulled By: Yangqing
fbshipit-source-id: f2d0f39f1647c84d97d9745f8a0305a389bfbc41
Summary:
Codemod to use a separate function, for protobuf change later on
It does not change the functionality
Reviewed By: salexspb
Differential Revision: D4770648
fbshipit-source-id: d8090f45d31ffa5ca1dca47297fb7c196f34d8a6
Summary: We anyway accumulate values of this blob (param_grad) in a another special internal blob
Differential Revision: D4768643
fbshipit-source-id: a9d08b7eafd25f278a8db722f9cdb1d0064b852a
Summary: Apart from copying gradient blobs for inputs with initial_cell_input, we needed to perform a similar operation for external parameters used by the step net
Reviewed By: salexspb
Differential Revision: D4752259
fbshipit-source-id: 13ee48cf583ed86221a4cc1cc9f57f5c3a7d2450
Summary:
currently the output schema and blobs are names as "field_i" which is
bad for debugging. This diff allows us to specify output names.
Reviewed By: kennyhorror
Differential Revision: D4744949
fbshipit-source-id: 8ac4d3c75cacbb4c9b5f55793ac969fe1cf20467
Summary:
Add ConvNd interface for Nd convluton and keep Conv for 2d convlution.
I added _BaseConv to share code between ConvNd and Conv.
Reviewed By: Yangqing
Differential Revision: D4660822
fbshipit-source-id: 8339421351ce9a36ce5a165f7fa455cfcc61733d
Summary:
This completes the fix that viswanathgs started in an earlier diff but did not
cover the full Caffe convention. It should have proper guards for all the stuff
that Caffe implies, either supporting it or throwing an explicit exception.
Reviewed By: viswanathgs
Differential Revision: D4751751
fbshipit-source-id: 474e921c33840cff333a631b7b19f881b39ebccd
Summary: This didn't work for a reason specified in comments. Also some cleanup in the unit tests, now inference uses a custom workspace to run cell net on
Reviewed By: urikz
Differential Revision: D4742670
fbshipit-source-id: 04165c029fddec5ae31b20b207faf06d2fa20816
Summary:
aaronmarkham this solves your Windows build issue. Basically:
(1) VS 2017 does not have CUDA support yet, and we will be waiting on NVidia to do so.
(2) VS 2015 and 2017 need different cmake generator strings.
This PR shows how to determine those and also updates appveyor to do contbuild guard for the following 3 settings:
- VS2015 without cuda
- VS2017 without cuda
- VS2015 with cuda
Closes https://github.com/caffe2/caffe2/pull/210
Differential Revision: D4745007
Pulled By: Yangqing
fbshipit-source-id: 50952552843abd0eb6f4145d9f132daeee3a6794
Summary: Created `BatchDistillLRLoss` layer and added support for it in DPer2.
Differential Revision: D4718333
fbshipit-source-id: b873954ea704daafed94ac65fef47a20d56858e2
Summary: D4734505 part 2. Remove more instances of the batch_size parameter
Reviewed By: urikz
Differential Revision: D4736906
fbshipit-source-id: fc9d374e9308017d61c427890364c5ab9cec2edf
Summary: Reshape based on tensor shapes in the graph rather than based on a passed-in batch_size parameter
Reviewed By: urikz
Differential Revision: D4734505
fbshipit-source-id: d9c23d85be84f61124106e752ef2b4f6945e2a07
Summary: we don't use this one any more except a few tests
Reviewed By: urikz
Differential Revision: D4731401
fbshipit-source-id: c5c28b7594e3251f501fc28455dfc9bd2093a836
Summary: Adding synchronous optimization on GPUs to the translation training pipeline, via data_parallel_model.Parallelize_GPU, which needs to be updated so there is some way of performing sparse parameter updates (e.g., on embedding tables), whether on GPU or CPU.
Reviewed By: urikz
Differential Revision: D4631914
fbshipit-source-id: 9cdd655f7dbda3f9b2733d459228b3e097892441
Summary: This adds a nearest neighbor interpolation resizing operator to caffe2. CPU only, NCHW only, no gradients. Also adds torch2caffe support. This is probably not optimal in terms of performance, but it works.
Reviewed By: ajtulloch
Differential Revision: D4724244
fbshipit-source-id: b8295061141fb513da84acf91fdfd67264119059
Summary:
1. migrate the basic mtml model to dper 2
2. test dper 2 mtml model
3. test all optimizers
Reviewed By: kittipatv
Differential Revision: D4680215
fbshipit-source-id: 7aac5c59bdac22fcad8ed869b98e9e62dca1d337
Summary: layer that takes a label, prediction pair and outputs the L2 loss
Reviewed By: kittipatv
Differential Revision: D4702111
fbshipit-source-id: 09f2ede44d1b548e61096de741f1b2aa0b66bbcb