Summary:
Add a pointwise `IsMemberOf` operator to Caffe2.
The original idea was `In` but I think this is not so clear.
I used `UnaryElementwiseWithArgsOp` at some point, but it was making the code a bit more difficult to read without bringing any feature.
Reviewed By: ender-wieczorek
Differential Revision: D4912655
fbshipit-source-id: 716b66bb51468dd59db5f76f23d78cda85961b58
Summary:
Two new operators to pack and unpack a dataset. This is so that we can
re-use other operators that do not understand the schema format. The immediate
use-case is to use it with a partition operator.
Packing works by splitting the input into separate tensors, putting them in a
vector and wrapping in a shared_ptr (as opposed to a unique_ptr, so we can
copy).
Unpack takes the packed input and concatenates it back to the original.
I also had a gard time understanding the iteration, so I created a TreeWalker
that just hides the complexity of operating with all the arrays and makes the
short functions for a given purpose that at least for me are easier to
understand.
Reviewed By: dzhulgakov
Differential Revision: D4918002
fbshipit-source-id: ecbf9196ed25e886a94383961176b8c84dde2d2f
Summary:
Added option to recurrent_net and RNNCell's for forward_only. If this is set, the backward_step_net is not passed to the operator.
When backward_step_net is not available, operator knows it is in forward_only mode and does not create workspaces for each step but cycles
through only one private workspace.
Note: we could avoid doing a lot of work in recurrent.py:recurrent_network call when backward step is not needed, but doing that nicely requires
more refactoring that I did not want to do now. Thus, we create the backward step nets etc, but just don't pass it to the op.
This can be used to create more efficient inference models. You can also sanitize existing inference nets and remove the backward_step_net argument to
get the benefits.
Reviewed By: salexspb
Differential Revision: D4916482
fbshipit-source-id: c99b93c9cb897c32b0f449253f7f6d6a942618ad
Summary:
rename ModelHelperBase to Model.
This is the result of running:
find . -type f -exec sed -i 's/ModelHelperBase/ModelHelper/g' {} +
We had 19 results when fbgs ModelHelperBase. Here is 20 instances because I added 1 test in model_helpers_test.py
Reviewed By: salexspb
Differential Revision: D4928337
fbshipit-source-id: bc4c12b60b90c167e717de50ea9fe17521e142e3
Summary:
This is getting too messy again. So cleaning up it even more. One thing I added here - not calling random to generate the input sequence. Ideally we do this for all other inputs. This was reported to be an issue when hypothesis finds bad examples - it can make it run very long.
Also I tunned ranges a bit so test finishes faster. On my devgpu test the whole test took 600 before and now is 39 seconds.
One more important thing - we want to test all combinations of things that are in the for loop. While things provided by hypothesis are just random tensor inputs.
Differential Revision: D4902956
fbshipit-source-id: ceb02d6761406b3192101d3b255abe90b2866770
Summary:
CUDA version of PRelu and its gradient. Forward pass is straightforward, backward pass requires reductino over the weights.
tsaizhenling, please patch this and test.
Differential Revision: D4931630
fbshipit-source-id: 1238e7d536e41480713865ced91aaef88f4feef5
Summary:
Simple FindOp for CPU and GPU which searches a list of unordered needles from an unordered index. CPU version might be faster if first sorting the index / needles, but we can get back to that later.
CUDA op is also kind of brutish, but pretty parallel. Since the index and the queries are smallish at least in the use case currently in mind (Machine Translation's team word candidate search), I think this is a sufficient start.
Note that this is much simpler than the Index-class of ops which allow modifying the index etc. Since CUDA ops are more complex to implement for the full Index functionality, I decided to make a separate op with this very simple functionality.
Differential Revision: D4910131
fbshipit-source-id: 6df35c9e3c71d5392a500d5b98fd708ab0c8e587
Summary: Work in progress for improving the performance of the TransposeOp on CPU. This is used extensively for inference in several neural MT systems, so optimizing this function is worthwhile and will reduce request latency.
Differential Revision: D4913075
fbshipit-source-id: fa2742829291d91f3eba00fdfe7d6c0dae83e206
Summary: This is needed for the completeness of random negative sampling. When the pool size is 0, we want to generate empty indices tensor.
Reviewed By: xianjiec
Differential Revision: D4906866
fbshipit-source-id: 75d66a92d15d60bb37bcd1075d324f28069c4fa0
Summary:
Due to the massive dependencies I did not update the version number - under
the same big version number (2017) the API is compatible so no need to
rebuild all the dependencies.
This will unblock the Caffe2 Intel pull request on MKLDNN.
Differential Revision: D4906463
fbshipit-source-id: 0f74436ac3a05605e35b8b649c3e8b5c1c69b500
Summary: unit test using hypothesis for unmask operator
Reviewed By: ender-wieczorek
Differential Revision: D4904075
fbshipit-source-id: 874d3756ec703ab2cc82f24f7160b4254bf791f1
Summary: This will be used to generate random indices input to `Gather`
Reviewed By: xianjiec
Differential Revision: D4904591
fbshipit-source-id: 8d858631e3d640be2cec12f1566cbf195e6aad4b
Summary:
Two new operators to pack and unpack a dataset. This is so that we can
re-use other operators that do not understand the schema format. The immediate
use-case is to use it with a partition operator.
Packing works by splitting the input into separate tensors, putting them in a
vector and wrapping in a shared_ptr (as opposed to a unique_ptr, so we can
copy).
Unpack takes the packed input and concatenates it back to the original.
I also had a gard time understanding the iteration, so I created a TreeWalker
that just hides the complexity of operating with all the arrays and makes the
short functions for a given purpose that at least for me are easier to
understand.
Reviewed By: dzhulgakov
Differential Revision: D4870606
fbshipit-source-id: dc29428de5c96cc3039af2885d9e4b026d9f482d
Summary: This is the nice way to re-use RNN layers for training and for inference.
Reviewed By: salexspb
Differential Revision: D4825894
fbshipit-source-id: 779c69758cee8caca6f36bc507e3ea0566f7652a
Summary:
This is from discussion with dzhulgakov : as a step towards revisiting the
core.Net autonaming, we will first guard against accidental overwrites of
existing networks in the workspace.
ajtulloch since we are doing Predictors in mobile, this should be safe right?
azzolini - I assume this would be safe, but would love to get your approval.
akyrola - would this hurt xray?
Reviewed By: dzhulgakov
Differential Revision: D4897725
fbshipit-source-id: aa41271927ad6671f07a53b9505283623f8c49e5
Summary:
Added the possibility to add 'tiles' and 'axis' as input
as opposed to arguments for the Tile Operator. If provided, the input
values will override the argument values
Differential Revision: D4794432
fbshipit-source-id: a7e38f4f925a4cedf530924bd426c3bb08b5aad8
Summary:
Implement a new op ElementwiseLinear.
Given inputs X of size (N x D), a of size D and b of size D,
the op computes Y of size (N X D) where Y_{nd} = X_{nd} * a_d + b_d.
Typically, this op is followed by SigmoidCrossEntropyWithLogits op for multi-label classification problem.
Differential Revision: D4892220
fbshipit-source-id: 77bffc5fbe03d48b3d83ab785f7c24a71c952aec
Summary:
This allows us to do in-place relu and also corrects the previous error of
inconsistency between the cudnn impl and the non-cudnn impl.
This implementation butchers the cudnn interface, in the sense that we pass
in the output instead of the input for the gradient pass. We do have a
gradient checker to guard this situation, so we should be safe.
Reviewed By: asaadaldien
Differential Revision: D4889426
fbshipit-source-id: 081f8fe06de78413b5786086bfd5ae6c8128cd6e
Summary: Add an option to bias the forget gate one way or another by adding in some float value before the sigmoid is applied.
Differential Revision: D4880712
fbshipit-source-id: 1306a97c29fb31630838b2f96597a46e952d940a
Summary:
CopyGPUToCPu and CopyGPUToCPU need to handle gradients that come sparse on their way. Added unit test and fixed the gradient makers to create copies for both value and indices.
This becomes less important with gpu sparse parameter update ops land, but nevertheless good to fix.
Reviewed By: dzhulgakov
Differential Revision: D4882327
fbshipit-source-id: aafd2df46b3e1bcb30b52b1edf40fad8271f1f88
Summary:
These GPU paths are probably even buggier than the CPU paths for sparse gradients with duplicate indices. Both paths cause multiple momentum updates in a single iteration, but only the GPU path is non-deterministic. Depending on how we decide to address the issues on the CPU path, pooyadavoodi has a good idea for how to match dense behavior with the sparse GPU ops.
Closes https://github.com/caffe2/caffe2/pull/254
Reviewed By: bwasti
Differential Revision: D4871680
Pulled By: dzhulgakov
fbshipit-source-id: 220be57a0f699a22ea85ed4f7022d92d362d06b3
Summary: making the name a bit clearer
Reviewed By: xianjiec
Differential Revision: D4866940
fbshipit-source-id: 3e0f7067a9d3ba89cb038d85c1991e541f1e439c
Summary:
Length-aware gather operator. This will be use for random negative sampling. See the task for details.
This should be equivalent to:
LengthsToRange + Gather + Reshape + GatherRanges
That's pretty complicated.
Differential Revision: D4846023
fbshipit-source-id: 8d9b7ff3eddc75a7ab147cd1c2a12f377652df93
Summary:
This diff adds an option to recurrent_net to define some cell blobs to be recomputed on backward step, and thus they don't need to be stored in the step workspace. This is done by modifying the backward step to automatically include all operators that are needed to produce the output that is to be recomputed, and by storing those blobs in a shared workspace. To enable the shared workspace, i had to modify the stepworkspaces blob to also store a forward shared workspace. Making it a class field won't work since the lifecycle of the blob does not match the lifecycle of the operator.
For basic LSTM, the performance hit is quite modest (about 15% with one setting, but your mileage might vary. For Attention models, I am sure this is beneficial as computing the attention blobs is not expensive.
For basic LSTM, the memory saving is wonderful: each forward workspace only has 4 bytes (for timestep).
I also modified the neural_mt LSTM Cells, but there is no test available, so I am not 100% sure I did it correctly. Please have a look.
Added options to LSTM, MILSTM and LSTMAttention to enable memory mode.
Reviewed By: urikz
Differential Revision: D4853890
fbshipit-source-id: d8d0e0e75a5330d174fbfa39b96d8e4e8c446baa
Summary:
add necessary ops for feature processing
* logit op
* replace nan
* batch one hot op
Reviewed By: kittipatv
Differential Revision: D4840869
fbshipit-source-id: 197123ea5608d54f0b5ac7899973a077a6a86775
Summary:
Added SumSqrElements, since then we can avoid a large temporary blob which is needed when doing Sqr + SumElements.
Also moved to reduction_ops, because utlitity_ops has grown too big.
Reviewed By: jamesr66a
Differential Revision: D4844172
fbshipit-source-id: 032eec45e24d6724f0d5fb83f4ec1c771d1146e5
Summary:
The PiecewiseLinearTransformOp passes the transform parameters (bounds, slopes, intercepts) via operator arg. This diff supports to pass these parameters through input blobs.
The purpose is to allow us to create a model calibration net that can be exported when saving model.
Reviewed By: dragonxlwang
Differential Revision: D4777086
fbshipit-source-id: 0d157154860f61ec6ecfab95aea80beed54aa5c6
Summary: This is like LengthsToSegmentIds + Gather w/o the immediate segment IDs blob. I only realized that after I wrote the whole thing. That combination is not obvious, so just check this in?
Reviewed By: xianjiec
Differential Revision: D4847591
fbshipit-source-id: a1c480f16b317763866af13c83b3aaaeb6a60751
Summary:
1. CPU/GPU implementation of SumReduceLikeOp.
[SRLOp](matrix A, matrix B) -> C
where C is of the same shape as B, its value would be the reduce sum of corresponding A element.
2. Make SumReduceLikeOp (part of) the gradient of Add/Mul/Sub and provide unittests
===Update for Translation Team===
3. Passed Tests:
$ buck test caffe2/caffe2/python/operator_test:recurrent_network_test
$ buck test fblearner/flow/tests/langtech/translation/neural_mt:seq2seq_model_caffe2
$ buck test fblearner/flow/tests/langtech/translation/neural_mt:seq2seq_ensemble_beam_model_caffe2
Reviewed By: Yangqing
Differential Revision: D4711302
fbshipit-source-id: 0865abde871b3046b367599731593dae03f0775a
Summary: Put the size of the input tensor vector into the output blob
Reviewed By: xianjiec
Differential Revision: D4849556
fbshipit-source-id: 0929319e1705b027874d41a90a9159b335d93545
Summary: When only_loss=True is enabled, the softmax output buffer is shared with the gradient buffer (which is of same size). Added tests for this. Only for GPU version for now.
Reviewed By: salexspb
Differential Revision: D4843991
fbshipit-source-id: 834d2a1b357d784e4d64efe484f893442201ad6a
Summary: Added the support of axis for cudnn version of softmax + added cudnn tests to the softmax_ops_test
Reviewed By: urikz
Differential Revision: D4835409
fbshipit-source-id: 9150b969237e38daebff961fee3c36759f834ac4
Summary: NanCheck is an in-place operator for GPU that checks the input for any NaN or inf values. The operator fails and prints diagnostic information (input tensor dims and values) if it detects these erroneous values. This should help us to narrow down our numerical instability issues in the NMT models, and it might help others as well.
Differential Revision: D4818141
fbshipit-source-id: e5aa9762089c58ce160270446007c7a91a7a85e5
Summary:
Following jamesr66a's brilliant observation, this diff fixes the non-CUDNN versions of Softmax. The op did not take into account that blocks can run in parallel, and thus could overwrite each others values, particularly the "row max" that is important for numerical stability
So in this diff:
1) SoftmaxOp now shares all the code with SoftmaxWithLoss, that had better implementation
+ Strengthen the test case and renaming of file.
Reviewed By: jamesr66a
Differential Revision: D4832929
fbshipit-source-id: 4a1bfa2106ceb65ec75f5b868323ee1e7a3457fb
Summary:
Two new features for RecurrentNetwork:
1. Ability to specify longer (for a few steps) initial state
2. Ability to link more than one step of external blob to internal one.
Some motivation for these changes is provided in the unit test
Reviewed By: salexspb
Differential Revision: D4816230
fbshipit-source-id: 5ae6fed53b3b08a6ce4547ff1d0cb773dab42af0
Summary: The PadImage op supports cropping along the H/W dimensions by using negative pads; but currently passing negative values for pad attributes throws an error in ConvPoolOpBase, which PadImage inherits from. Modify ConvPoolOpBase to accept negative pad values for non-conv, non-pool ops. Also add a python operator test for cropping
Reviewed By: ajtulloch
Differential Revision: D4817118
fbshipit-source-id: 5ea5203e8072cc34fe14938e534b157d0ad55f6b
Summary:
Uses the cudnnTransformTensor function. It works by shuffling the strides according to the transpose axis. Significant speedup over current GPU version .
+ moves the transpose test under utility_ops, because hypothesis_test is too big
Reviewed By: jamesr66a
Differential Revision: D4810993
fbshipit-source-id: 82577c4ced1389e70bd5992820ae4d8297a3817f
Summary:
This is an initial (read: unoptimized) implementation of GatherOp on GPU.
Closes https://github.com/caffe2/caffe2/pull/209
Differential Revision: D4809676
Pulled By: Yangqing
fbshipit-source-id: bc36fa02e9964370ca845e9cc13344e5f3dbf176
Summary:
We did not parallelize over D, which can be very large, especially in RNN models. This speeds up significantly, with my quick test in lstm_benchmark and nvprof, the time of RowMaxKernel dropped from 1.2s total to 0.28s total.
+ addded softmaxwithloss to the lstm_benchmark
Reviewed By: jamesr66a
Differential Revision: D4800629
fbshipit-source-id: 3400ea1064b1eb2793bc403df2c1b68801d545e5