Summary:
RFC. This is a naive implementation of Rebatchin Queue for MultiTask
effort. Full disclaimer, I'm very new to Caffe/Machine Learning and I'm doing
dodge science here (under Dmytros supervision), so please be extra tough on
this review so I
can learn best practices :)
Differential Revision: D4871970
fbshipit-source-id: 924820ef0fce45b5e2bdabeec9885cbafa23a880
Summary:
Implement NormalizeOP for GPU using CUDA, and re-write the graident to be a function of the output
so its more efficent specially for CUDA implemntation.
Reviewed By: akyrola
Differential Revision: D4971300
fbshipit-source-id: e0ab66462000988aaf1f26010ea550533d107167
Summary: Only CPU impl is available at the moment. Wrote simple cuda kernels.
Reviewed By: akyrola
Differential Revision: D4577736
fbshipit-source-id: c2540aa9d332fcdeac46cc7f89aab164d107d7a8
Summary: Implement CPU and GPU gradient for Leaky ReLU op.
Differential Revision: D4943905
fbshipit-source-id: 541f13cd5f274a18b69ecf1362722b1bc0105ad9
Summary:
Instance norm failed grad check in some cases that needed a smaller step size. Decreased step size, but also increased threshold slightly.
Related diff: D4627379
Reviewed By: kennyhorror
Differential Revision: D4941827
fbshipit-source-id: d6f565340da92af40bfee90627960a3356c69412
Summary:
This is a naive layering approroach till we have a better
one. It could be c++ based and support diagonal execution. Not integrating into main LSTM API yet as this might be revised a bit. Would like to land so we can compare current implementation in the benchmark and also use this as an example of how LSTMs could be combined (as some folks are doing similar things with some variations).
Later we can LSTM() support API of layered_LSTM() and also change it under the hood so it stacks cells into a bigger cell instead. This way if we make RNN op use a kind of a DAG net, then RNN op can provide more parallelizm in stacked cells.
Reviewed By: urikz
Differential Revision: D4936015
fbshipit-source-id: b1e25f12d985dda582f0c67d9a02508027e5497f
Summary:
This is useful when data has standalone sequences which are
not connected to each other by any meaningful context
Reviewed By: yqwangustc
Differential Revision: D4835164
fbshipit-source-id: f95626acc26acc3eba3bca7efb08ed1dbdb36c83
Summary:
ScaleGradient is a helper operator that does no actual numerical computation,
and in the gradient computation phase scales the gradient from being computed
through it.
Differential Revision: D4920719
fbshipit-source-id: 0e1e0888f79594be874fdbdda5ccef7389064c50
Summary:
lengthTile goes from 1 to multiple, the gradient op is simply the reverse,
by adding up the fanned-out rows of gradients together into 1
Reviewed By: kittipatv
Differential Revision: D4943375
fbshipit-source-id: deae9984e849974a0d484a10b94efdb1d30941cc
Summary:
Added optional support for using activation blobs for sharing as well. Doing this change revealed an non-optimal implementation in the blob sharing: we need to prefer to reuse freeblobs by prefering those blobs that are already shared by many other blobs. Otherwise the memory usage can increase when the pool of 'free blobs' grows.
Also, my first version only passed "free blobs" (i.e blobs in recycling pool) down the first branch when operators forked. But now we pass those blobs that were not used by the first branch down the second branch and so on.
Also added support for blob size information in the heuristic. This uses the shape inference mechanism.
I had to also do some small tweaks:
- use Sum() operator as a way to match shapes of blobs that had otherwise unknown shapes. This is related to the Sum() operator that is added to combine multiple incoming gradient inputs (with _autosplit gradients).
- a couple of random shape inference fixes
This reduces the Resnet-50 memory usage on 64 batch from 9.45 Gig to 8.5 Gig.
For a 32 batch, the memory usage is 4330 MiB, down from 4800 MB, compared to Torch's 6856MiB (thanks prigoyal for checking this for me).
This is unfortunately quite a bunch to review...
Reviewed By: asaadaldien
Differential Revision: D4393909
fbshipit-source-id: 9c7c94125f96512bea80463ebcb63c215ef95ff9
Summary:
Add a pointwise `IsMemberOf` operator to Caffe2.
The original idea was `In` but I think this is not so clear.
I used `UnaryElementwiseWithArgsOp` at some point, but it was making the code a bit more difficult to read without bringing any feature.
Reviewed By: ender-wieczorek
Differential Revision: D4912655
fbshipit-source-id: 716b66bb51468dd59db5f76f23d78cda85961b58
Summary:
Two new operators to pack and unpack a dataset. This is so that we can
re-use other operators that do not understand the schema format. The immediate
use-case is to use it with a partition operator.
Packing works by splitting the input into separate tensors, putting them in a
vector and wrapping in a shared_ptr (as opposed to a unique_ptr, so we can
copy).
Unpack takes the packed input and concatenates it back to the original.
I also had a gard time understanding the iteration, so I created a TreeWalker
that just hides the complexity of operating with all the arrays and makes the
short functions for a given purpose that at least for me are easier to
understand.
Reviewed By: dzhulgakov
Differential Revision: D4918002
fbshipit-source-id: ecbf9196ed25e886a94383961176b8c84dde2d2f
Summary:
Added option to recurrent_net and RNNCell's for forward_only. If this is set, the backward_step_net is not passed to the operator.
When backward_step_net is not available, operator knows it is in forward_only mode and does not create workspaces for each step but cycles
through only one private workspace.
Note: we could avoid doing a lot of work in recurrent.py:recurrent_network call when backward step is not needed, but doing that nicely requires
more refactoring that I did not want to do now. Thus, we create the backward step nets etc, but just don't pass it to the op.
This can be used to create more efficient inference models. You can also sanitize existing inference nets and remove the backward_step_net argument to
get the benefits.
Reviewed By: salexspb
Differential Revision: D4916482
fbshipit-source-id: c99b93c9cb897c32b0f449253f7f6d6a942618ad
Summary:
rename ModelHelperBase to Model.
This is the result of running:
find . -type f -exec sed -i 's/ModelHelperBase/ModelHelper/g' {} +
We had 19 results when fbgs ModelHelperBase. Here is 20 instances because I added 1 test in model_helpers_test.py
Reviewed By: salexspb
Differential Revision: D4928337
fbshipit-source-id: bc4c12b60b90c167e717de50ea9fe17521e142e3
Summary:
This is getting too messy again. So cleaning up it even more. One thing I added here - not calling random to generate the input sequence. Ideally we do this for all other inputs. This was reported to be an issue when hypothesis finds bad examples - it can make it run very long.
Also I tunned ranges a bit so test finishes faster. On my devgpu test the whole test took 600 before and now is 39 seconds.
One more important thing - we want to test all combinations of things that are in the for loop. While things provided by hypothesis are just random tensor inputs.
Differential Revision: D4902956
fbshipit-source-id: ceb02d6761406b3192101d3b255abe90b2866770
Summary:
CUDA version of PRelu and its gradient. Forward pass is straightforward, backward pass requires reductino over the weights.
tsaizhenling, please patch this and test.
Differential Revision: D4931630
fbshipit-source-id: 1238e7d536e41480713865ced91aaef88f4feef5
Summary:
Simple FindOp for CPU and GPU which searches a list of unordered needles from an unordered index. CPU version might be faster if first sorting the index / needles, but we can get back to that later.
CUDA op is also kind of brutish, but pretty parallel. Since the index and the queries are smallish at least in the use case currently in mind (Machine Translation's team word candidate search), I think this is a sufficient start.
Note that this is much simpler than the Index-class of ops which allow modifying the index etc. Since CUDA ops are more complex to implement for the full Index functionality, I decided to make a separate op with this very simple functionality.
Differential Revision: D4910131
fbshipit-source-id: 6df35c9e3c71d5392a500d5b98fd708ab0c8e587
Summary: Work in progress for improving the performance of the TransposeOp on CPU. This is used extensively for inference in several neural MT systems, so optimizing this function is worthwhile and will reduce request latency.
Differential Revision: D4913075
fbshipit-source-id: fa2742829291d91f3eba00fdfe7d6c0dae83e206
Summary: This is needed for the completeness of random negative sampling. When the pool size is 0, we want to generate empty indices tensor.
Reviewed By: xianjiec
Differential Revision: D4906866
fbshipit-source-id: 75d66a92d15d60bb37bcd1075d324f28069c4fa0
Summary:
Due to the massive dependencies I did not update the version number - under
the same big version number (2017) the API is compatible so no need to
rebuild all the dependencies.
This will unblock the Caffe2 Intel pull request on MKLDNN.
Differential Revision: D4906463
fbshipit-source-id: 0f74436ac3a05605e35b8b649c3e8b5c1c69b500
Summary: unit test using hypothesis for unmask operator
Reviewed By: ender-wieczorek
Differential Revision: D4904075
fbshipit-source-id: 874d3756ec703ab2cc82f24f7160b4254bf791f1
Summary: This will be used to generate random indices input to `Gather`
Reviewed By: xianjiec
Differential Revision: D4904591
fbshipit-source-id: 8d858631e3d640be2cec12f1566cbf195e6aad4b
Summary:
Two new operators to pack and unpack a dataset. This is so that we can
re-use other operators that do not understand the schema format. The immediate
use-case is to use it with a partition operator.
Packing works by splitting the input into separate tensors, putting them in a
vector and wrapping in a shared_ptr (as opposed to a unique_ptr, so we can
copy).
Unpack takes the packed input and concatenates it back to the original.
I also had a gard time understanding the iteration, so I created a TreeWalker
that just hides the complexity of operating with all the arrays and makes the
short functions for a given purpose that at least for me are easier to
understand.
Reviewed By: dzhulgakov
Differential Revision: D4870606
fbshipit-source-id: dc29428de5c96cc3039af2885d9e4b026d9f482d
Summary: This is the nice way to re-use RNN layers for training and for inference.
Reviewed By: salexspb
Differential Revision: D4825894
fbshipit-source-id: 779c69758cee8caca6f36bc507e3ea0566f7652a
Summary:
This is from discussion with dzhulgakov : as a step towards revisiting the
core.Net autonaming, we will first guard against accidental overwrites of
existing networks in the workspace.
ajtulloch since we are doing Predictors in mobile, this should be safe right?
azzolini - I assume this would be safe, but would love to get your approval.
akyrola - would this hurt xray?
Reviewed By: dzhulgakov
Differential Revision: D4897725
fbshipit-source-id: aa41271927ad6671f07a53b9505283623f8c49e5
Summary:
Added the possibility to add 'tiles' and 'axis' as input
as opposed to arguments for the Tile Operator. If provided, the input
values will override the argument values
Differential Revision: D4794432
fbshipit-source-id: a7e38f4f925a4cedf530924bd426c3bb08b5aad8
Summary:
Implement a new op ElementwiseLinear.
Given inputs X of size (N x D), a of size D and b of size D,
the op computes Y of size (N X D) where Y_{nd} = X_{nd} * a_d + b_d.
Typically, this op is followed by SigmoidCrossEntropyWithLogits op for multi-label classification problem.
Differential Revision: D4892220
fbshipit-source-id: 77bffc5fbe03d48b3d83ab785f7c24a71c952aec
Summary:
This allows us to do in-place relu and also corrects the previous error of
inconsistency between the cudnn impl and the non-cudnn impl.
This implementation butchers the cudnn interface, in the sense that we pass
in the output instead of the input for the gradient pass. We do have a
gradient checker to guard this situation, so we should be safe.
Reviewed By: asaadaldien
Differential Revision: D4889426
fbshipit-source-id: 081f8fe06de78413b5786086bfd5ae6c8128cd6e
Summary: Add an option to bias the forget gate one way or another by adding in some float value before the sigmoid is applied.
Differential Revision: D4880712
fbshipit-source-id: 1306a97c29fb31630838b2f96597a46e952d940a
Summary:
CopyGPUToCPu and CopyGPUToCPU need to handle gradients that come sparse on their way. Added unit test and fixed the gradient makers to create copies for both value and indices.
This becomes less important with gpu sparse parameter update ops land, but nevertheless good to fix.
Reviewed By: dzhulgakov
Differential Revision: D4882327
fbshipit-source-id: aafd2df46b3e1bcb30b52b1edf40fad8271f1f88
Summary:
These GPU paths are probably even buggier than the CPU paths for sparse gradients with duplicate indices. Both paths cause multiple momentum updates in a single iteration, but only the GPU path is non-deterministic. Depending on how we decide to address the issues on the CPU path, pooyadavoodi has a good idea for how to match dense behavior with the sparse GPU ops.
Closes https://github.com/caffe2/caffe2/pull/254
Reviewed By: bwasti
Differential Revision: D4871680
Pulled By: dzhulgakov
fbshipit-source-id: 220be57a0f699a22ea85ed4f7022d92d362d06b3
Summary: making the name a bit clearer
Reviewed By: xianjiec
Differential Revision: D4866940
fbshipit-source-id: 3e0f7067a9d3ba89cb038d85c1991e541f1e439c
Summary:
Length-aware gather operator. This will be use for random negative sampling. See the task for details.
This should be equivalent to:
LengthsToRange + Gather + Reshape + GatherRanges
That's pretty complicated.
Differential Revision: D4846023
fbshipit-source-id: 8d9b7ff3eddc75a7ab147cd1c2a12f377652df93
Summary:
This diff adds an option to recurrent_net to define some cell blobs to be recomputed on backward step, and thus they don't need to be stored in the step workspace. This is done by modifying the backward step to automatically include all operators that are needed to produce the output that is to be recomputed, and by storing those blobs in a shared workspace. To enable the shared workspace, i had to modify the stepworkspaces blob to also store a forward shared workspace. Making it a class field won't work since the lifecycle of the blob does not match the lifecycle of the operator.
For basic LSTM, the performance hit is quite modest (about 15% with one setting, but your mileage might vary. For Attention models, I am sure this is beneficial as computing the attention blobs is not expensive.
For basic LSTM, the memory saving is wonderful: each forward workspace only has 4 bytes (for timestep).
I also modified the neural_mt LSTM Cells, but there is no test available, so I am not 100% sure I did it correctly. Please have a look.
Added options to LSTM, MILSTM and LSTMAttention to enable memory mode.
Reviewed By: urikz
Differential Revision: D4853890
fbshipit-source-id: d8d0e0e75a5330d174fbfa39b96d8e4e8c446baa
Summary:
add necessary ops for feature processing
* logit op
* replace nan
* batch one hot op
Reviewed By: kittipatv
Differential Revision: D4840869
fbshipit-source-id: 197123ea5608d54f0b5ac7899973a077a6a86775