Commit graph

181 commits

Author SHA1 Message Date
Janusz Kudelka
ee7b3c9b2b caffe2: rebatching queue for MultiTask
Summary:
RFC. This is a naive implementation of Rebatchin Queue for MultiTask
effort. Full disclaimer, I'm very new to Caffe/Machine Learning and I'm doing
dodge science here (under Dmytros supervision), so please be extra tough on
this review so I
can learn best practices :)

Differential Revision: D4871970

fbshipit-source-id: 924820ef0fce45b5e2bdabeec9885cbafa23a880
2017-05-02 15:22:46 -07:00
Ahmed Taei
561255218a NormalizeOP CUDA impelementation
Summary:
Implement NormalizeOP for GPU using CUDA, and re-write the graident to be a function of the output
so its more efficent specially for CUDA implemntation.

Reviewed By: akyrola

Differential Revision: D4971300

fbshipit-source-id: e0ab66462000988aaf1f26010ea550533d107167
2017-05-01 09:25:30 -07:00
Viswanath Sivakumar
6e1333fe92 CUDA operators for DotProduct and DotProductGradient
Summary: Only CPU impl is available at the moment. Wrote simple cuda kernels.

Reviewed By: akyrola

Differential Revision: D4577736

fbshipit-source-id: c2540aa9d332fcdeac46cc7f89aab164d107d7a8
2017-04-28 19:47:00 -07:00
Ying Zhang
d223d71703 Add shape inference function for RoiPool.
Summary: As the title.

Reviewed By: akyrola

Differential Revision: D4960241

fbshipit-source-id: d5f7d7c2eea72a75f810aa2f532965fff48f8388
2017-04-28 17:03:29 -07:00
Kevin Matzen
6bb43ee41e leaky relu gradient op
Summary: Implement CPU and GPU gradient for Leaky ReLU op.

Differential Revision: D4943905

fbshipit-source-id: 541f13cd5f274a18b69ecf1362722b1bc0105ad9
2017-04-28 10:06:23 -07:00
Kevin Matzen
482ffccd76 Make instance norm grad test less flakey
Summary:
Instance norm failed grad check in some cases that needed a smaller step size.  Decreased step size, but also increased threshold slightly.

Related diff: D4627379

Reviewed By: kennyhorror

Differential Revision: D4941827

fbshipit-source-id: d6f565340da92af40bfee90627960a3356c69412
2017-04-27 22:35:10 -07:00
Xianjie Chen
726ded4758 add box cox transform op
Summary: as desc

Reviewed By: kittipatv

Differential Revision: D4949042

fbshipit-source-id: 06b8828d8fbe2a88f6798c5d19a702ebaf6def70
2017-04-27 22:06:43 -07:00
Alexander Sidorov
bf50599c70 Layered LSTM (naive version)
Summary:
This is a naive layering approroach till we have a better
one. It could be c++ based and support diagonal execution. Not integrating into main LSTM API yet as this might be revised a bit. Would like to land so we can compare current implementation in the benchmark and also use this as an example of how LSTMs could be combined (as some folks are doing similar things with some variations).

Later we can LSTM() support API of layered_LSTM() and also change it under the hood so it stacks cells into a bigger cell instead. This way if we make RNN op use a kind of a DAG net, then RNN op can provide more parallelizm in stacked cells.

Reviewed By: urikz

Differential Revision: D4936015

fbshipit-source-id: b1e25f12d985dda582f0c67d9a02508027e5497f
2017-04-27 19:16:58 -07:00
Mathieu Baudet
1aadf4324b Add row-wise broadcasting to "Where" operator
Summary: Add row-wise mode to `Where` (D4901402), similar to `RowMul`.

Reviewed By: ender-wieczorek

Differential Revision: D4928221

fbshipit-source-id: 3443e559cd366e48c2f6a3f379aeefb7921264ee
2017-04-27 12:31:54 -07:00
Alexander Sidorov
ad6204eb0b LSTM: support dropping hidden / cell states when sequence
Summary:
This is useful when data has standalone sequences which are
not connected to each other by any meaningful context

Reviewed By: yqwangustc

Differential Revision: D4835164

fbshipit-source-id: f95626acc26acc3eba3bca7efb08ed1dbdb36c83
2017-04-27 11:47:29 -07:00
Jeffrey Dunn
9f9a2da1a1 Revert D4920719: [dper2][operator] ScaleGradientOp
Summary: This reverts commit 0e1e0888f79594be874fdbdda5ccef7389064c50

Differential Revision: D4920719

fbshipit-source-id: 1ca9dc329eaffeb2932267d631506bb124d4e7ae
2017-04-26 09:34:47 -07:00
Huazhong Ning
e42c14e819 ScaleGradientOp
Summary:
ScaleGradient is a helper operator that does no actual numerical computation,
and in the gradient computation phase scales the gradient from being computed
through it.

Differential Revision: D4920719

fbshipit-source-id: 0e1e0888f79594be874fdbdda5ccef7389064c50
2017-04-25 21:46:45 -07:00
Yang Yang
5692969e8f add gradient for LengthsTileOp
Summary:
lengthTile goes from 1 to multiple, the gradient op is simply the reverse,
by adding up the fanned-out rows of gradients together into 1

Reviewed By: kittipatv

Differential Revision: D4943375

fbshipit-source-id: deae9984e849974a0d484a10b94efdb1d30941cc
2017-04-25 14:31:15 -07:00
Aapo Kyrola
f82a510be6 share forward activation blobs + pass unused free blobs down all branches + use shape infernece
Summary:
Added optional support for using activation blobs for sharing as well. Doing this change revealed an non-optimal implementation in the blob sharing: we need to prefer to reuse freeblobs by prefering those blobs that are already shared by many other blobs. Otherwise the memory usage can increase when the pool of 'free blobs' grows.

Also, my first version only passed "free blobs" (i.e blobs in recycling pool) down the first branch when operators forked. But now we pass those blobs that were not used by the first branch down the second branch and so on.

Also added support for blob size information in the heuristic. This uses the shape inference mechanism.

I had to also do some small tweaks:
- use Sum() operator as a way to match shapes of blobs that had otherwise unknown shapes. This is related to the Sum() operator that is added to combine multiple incoming gradient inputs (with _autosplit gradients).
- a couple of random shape inference fixes

This reduces the Resnet-50 memory usage on 64 batch from 9.45 Gig to 8.5 Gig.
For a 32 batch, the memory usage is 4330 MiB, down from 4800 MB, compared to Torch's 6856MiB (thanks prigoyal  for checking this for me).

This is unfortunately quite a bunch to review...

Reviewed By: asaadaldien

Differential Revision: D4393909

fbshipit-source-id: 9c7c94125f96512bea80463ebcb63c215ef95ff9
2017-04-25 14:23:25 -07:00
Ahmed Taei
2533671a97 Support 3D&1D SpatialBatchNorm in cuDNN
Differential Revision: D4941087

fbshipit-source-id: 4adbf1f8990c7356f8effd8b0e1ae286fce6558c
2017-04-24 22:16:19 -07:00
Mathieu Baudet
081001a176 "IsMemberOf" operator
Summary:
Add a pointwise `IsMemberOf` operator to Caffe2.

The original idea was `In` but I think this is not so clear.

I used `UnaryElementwiseWithArgsOp` at some point, but it was making the code a bit more difficult to read without bringing any feature.

Reviewed By: ender-wieczorek

Differential Revision: D4912655

fbshipit-source-id: 716b66bb51468dd59db5f76f23d78cda85961b58
2017-04-24 18:18:49 -07:00
Mathieu Baudet
24ff90ee6b "Where" operator
Summary: Adding a pointwise `Where(condition, left, right)` operator to Caffe2.

Reviewed By: ender-wieczorek

Differential Revision: D4901402

fbshipit-source-id: a33682e77b2e7367050a94eeb4e10b7e5de9f955
2017-04-24 18:18:48 -07:00
Janusz Kudelka
902409be56 caffe2: datasets pack/unpack
Summary:
Two new operators to pack and unpack a dataset. This is so that we can
re-use other operators that do not understand the schema format. The immediate
use-case is to use it with a partition operator.

Packing works by splitting the input into separate tensors, putting them in a
vector and wrapping in a shared_ptr (as opposed to a unique_ptr, so we can
copy).

Unpack takes the packed input and concatenates it back to the original.

I also had a gard time understanding the iteration, so I created a TreeWalker
that just hides the complexity of operating with all the arrays and makes the
short functions for a given purpose that at least for me are easier to
understand.

Reviewed By: dzhulgakov

Differential Revision: D4918002

fbshipit-source-id: ecbf9196ed25e886a94383961176b8c84dde2d2f
2017-04-24 16:09:39 -07:00
Aapo Kyrola
9cb901caf0 Forward-only rnns
Summary:
Added option to recurrent_net and RNNCell's for forward_only. If this is set, the backward_step_net is not passed to the operator.
When backward_step_net is not available, operator knows it is in forward_only mode and does not create workspaces for each step but cycles
through only one private workspace.

Note: we could avoid doing a lot of work in recurrent.py:recurrent_network call when backward step is not needed, but doing that nicely requires
more refactoring that I did not want to do now. Thus, we create the backward step nets etc, but just don't pass it to the op.

This can be used to create more efficient inference models. You can also sanitize existing inference nets and remove the backward_step_net argument to
get the benefits.

Reviewed By: salexspb

Differential Revision: D4916482

fbshipit-source-id: c99b93c9cb897c32b0f449253f7f6d6a942618ad
2017-04-24 15:52:27 -07:00
Yiming Wu
bef6e45f8b rename ModelHelperBase
Summary:
rename ModelHelperBase to Model.

This is the result of running:

  find . -type f -exec sed -i 's/ModelHelperBase/ModelHelper/g' {} +

We had 19 results when fbgs ModelHelperBase. Here is 20 instances because I added 1 test in model_helpers_test.py

Reviewed By: salexspb

Differential Revision: D4928337

fbshipit-source-id: bc4c12b60b90c167e717de50ea9fe17521e142e3
2017-04-24 15:52:26 -07:00
Alexander Sidorov
4f77a49ddd refactor LSTM test to avoid copy pasta, improve speed 1.5x and provide better coverage
Summary:
This is getting too messy again. So cleaning up it even more. One thing I added here - not calling random to generate the input sequence. Ideally we do this for all other inputs. This was reported to be an issue when hypothesis finds bad examples - it can make it run very long.

Also I tunned ranges a bit so test finishes faster. On my devgpu test the whole test took 600 before and now is 39 seconds.

One more important thing - we want to test all combinations of things that are in the for loop. While things provided by hypothesis are just random tensor inputs.

Differential Revision: D4902956

fbshipit-source-id: ceb02d6761406b3192101d3b255abe90b2866770
2017-04-24 15:52:26 -07:00
Aapo Kyrola
41f4198344 CUDA version of PRelu/Gradient + Fix Gradient for dW
Summary:
CUDA version of PRelu and its gradient. Forward pass is straightforward, backward pass requires reductino over the weights.

tsaizhenling, please patch this and test.

Differential Revision: D4931630

fbshipit-source-id: 1238e7d536e41480713865ced91aaef88f4feef5
2017-04-24 15:52:25 -07:00
Luke Yeager
09bb91022a Fix tests for ops without a CUDA backend
Summary:
*See https://github.com/caffe2/caffe2/pull/227*

* Logit
* ReplaceNaN
* BatchOneHot
Closes https://github.com/caffe2/caffe2/pull/277

Differential Revision: D4915268

Pulled By: Yangqing

fbshipit-source-id: 77ccb2e7d03e6953e8ca60646987a02868d0ef5b
2017-04-24 15:52:25 -07:00
Aapo Kyrola
b82f9e9ea7 FindOp
Summary:
Simple FindOp for CPU and GPU which searches a list of unordered needles from an unordered index. CPU version might be faster if first sorting the index / needles, but we can get back to that later.

CUDA op is also kind of brutish, but pretty parallel. Since the index and the queries are smallish at least in the use case currently in mind (Machine Translation's team word candidate search), I think this is a sufficient start.

Note that this is much simpler than the Index-class of ops which allow modifying the index etc. Since CUDA ops are more complex to implement for the full Index functionality, I decided to make a separate op with this very simple functionality.

Differential Revision: D4910131

fbshipit-source-id: 6df35c9e3c71d5392a500d5b98fd708ab0c8e587
2017-04-24 15:52:25 -07:00
James Reed
01c76bf830 Optimize TransposeOp by using strided access pattern, bulk memory transfer, and other profile-guided optimizations
Summary: Work in progress for improving the performance of the TransposeOp on CPU. This is used extensively for inference in several neural MT systems, so optimizing this function is worthwhile and will reduce request latency.

Differential Revision: D4913075

fbshipit-source-id: fa2742829291d91f3eba00fdfe7d6c0dae83e206
2017-04-20 18:31:40 -07:00
Kittipat Virochsiri
e5e3ec1498 fix unit test
Summary: CUDA is not implemented

Reviewed By: xianjiec

Differential Revision: D4917368

fbshipit-source-id: dc41a76cf569018896cf457c0e3358ce840e198e
2017-04-19 17:22:00 -07:00
Yiming Wu
4ad3a4fc8b Revert D4794432: Added tiles and axis as input parameters to Tile Operator
Summary: This reverts commit a7e38f4f925a4cedf530924bd426c3bb08b5aad8

Differential Revision: D4794432

fbshipit-source-id: 05b2b0d101ebd917527e94ef8a74e63ab40942a4
2017-04-19 14:17:25 -07:00
Kittipat Virochsiri
883ff96f74 Allow UniformIntFill to produce empty tensor
Summary: This is needed for the completeness of random negative sampling. When the pool size is 0, we want to generate empty indices tensor.

Reviewed By: xianjiec

Differential Revision: D4906866

fbshipit-source-id: 75d66a92d15d60bb37bcd1075d324f28069c4fa0
2017-04-19 13:03:23 -07:00
Yangqing Jia
41620f86c9 Update IntelComposerXE to 2017.2.274
Summary:
Due to the massive dependencies I did not update the version number - under
the same big version number (2017) the API is compatible so no need to
rebuild all the dependencies.

This will unblock the Caffe2 Intel pull request on MKLDNN.

Differential Revision: D4906463

fbshipit-source-id: 0f74436ac3a05605e35b8b649c3e8b5c1c69b500
2017-04-19 10:07:09 -07:00
Shenxiu Liu
8492c411e8 Caffe2 unit test for unmask
Summary: unit test using hypothesis for unmask operator

Reviewed By: ender-wieczorek

Differential Revision: D4904075

fbshipit-source-id: 874d3756ec703ab2cc82f24f7160b4254bf791f1
2017-04-18 21:06:18 -07:00
Dmytro Dzhulgakov
580e192151 Revert D4870606: caffe2: datasets pack/unpack
Summary: This reverts commit dc29428de5c96cc3039af2885d9e4b026d9f482d

Differential Revision: D4870606

fbshipit-source-id: 1d05912b1a9e35e84b0c163c7b018db125ce060f
2017-04-18 16:47:05 -07:00
Kittipat Virochsiri
009bbc9983 Allow UniformFill/UniformIntFill to take parameters from input blobs
Summary: This will be used to generate random indices input to `Gather`

Reviewed By: xianjiec

Differential Revision: D4904591

fbshipit-source-id: 8d858631e3d640be2cec12f1566cbf195e6aad4b
2017-04-18 14:31:03 -07:00
Janusz Kudelka
34269a6fda caffe2: datasets pack/unpack
Summary:
Two new operators to pack and unpack a dataset. This is so that we can
re-use other operators that do not understand the schema format. The immediate
use-case is to use it with a partition operator.

Packing works by splitting the input into separate tensors, putting them in a
vector and wrapping in a shared_ptr (as opposed to a unique_ptr, so we can
copy).

Unpack takes the packed input and concatenates it back to the original.

I also had a gard time understanding the iteration, so I created a TreeWalker
that just hides the complexity of operating with all the arrays and makes the
short functions for a given purpose that at least for me are easier to
understand.

Reviewed By: dzhulgakov

Differential Revision: D4870606

fbshipit-source-id: dc29428de5c96cc3039af2885d9e4b026d9f482d
2017-04-18 13:31:10 -07:00
Yury Zemlyanskiy
4bf559eddb RNNCell, LSTMCell, LSTMWithAttentionCell
Summary: This is the nice way to re-use RNN layers for training and for inference.

Reviewed By: salexspb

Differential Revision: D4825894

fbshipit-source-id: 779c69758cee8caca6f36bc507e3ea0566f7652a
2017-04-18 00:47:20 -07:00
Yangqing Jia
cf317d1106 create_net: explicitly specify if one wants to overwrite the network.
Summary:
This is from discussion with dzhulgakov : as a step towards revisiting the
core.Net autonaming, we will first guard against accidental overwrites of
existing networks in the workspace.

ajtulloch since we are doing Predictors in mobile, this should be safe right?

azzolini - I assume this would be safe, but would love to get your approval.

akyrola - would this hurt xray?

Reviewed By: dzhulgakov

Differential Revision: D4897725

fbshipit-source-id: aa41271927ad6671f07a53b9505283623f8c49e5
2017-04-17 21:46:53 -07:00
Romain Cledat
20330fe3f4 Added tiles and axis as input parameters to Tile Operator
Summary:
Added the possibility to add 'tiles' and 'axis' as input
as opposed to arguments for the Tile Operator. If provided, the input
values will override the argument values

Differential Revision: D4794432

fbshipit-source-id: a7e38f4f925a4cedf530924bd426c3bb08b5aad8
2017-04-17 15:31:20 -07:00
Zhicheng Yan
25035e8b3b ElementwiseLinearOp
Summary:
Implement a new op ElementwiseLinear.
Given inputs X of size (N x D), a of size D and b of size D,
the op computes Y of size (N X D) where Y_{nd} = X_{nd} * a_d + b_d.
Typically, this op is followed by SigmoidCrossEntropyWithLogits op for multi-label classification problem.

Differential Revision: D4892220

fbshipit-source-id: 77bffc5fbe03d48b3d83ab785f7c24a71c952aec
2017-04-17 14:18:27 -07:00
Aapo Kyrola
4db7bec686 CUDA version of SigmoidCrossEntropyWithLogits
Summary: CUDA versions of SigmoidCrossEntropyWithLogits/Gradient.

Reviewed By: jay-mahadeokar

Differential Revision: D4891254

fbshipit-source-id: cabad908026e30d9a0721cad092ba948659ab917
2017-04-14 16:07:33 -07:00
Yangqing Jia
d65892b7f2 Change back the function signature of relu gradient to only use
Summary:
This allows us to do in-place relu and also corrects the previous error of
inconsistency between the cudnn impl and the non-cudnn impl.

This implementation butchers the cudnn interface, in the sense that we pass
in the output instead of the input for the gradient pass. We do have a
gradient checker to guard this situation, so we should be safe.

Reviewed By: asaadaldien

Differential Revision: D4889426

fbshipit-source-id: 081f8fe06de78413b5786086bfd5ae6c8128cd6e
2017-04-13 22:08:09 -07:00
James Reed
e8cc5563fe Add an optional forget bias argument to LSTMUnit
Summary: Add an option to bias the forget gate one way or another by adding in some float value before the sigmoid is applied.

Differential Revision: D4880712

fbshipit-source-id: 1306a97c29fb31630838b2f96597a46e952d940a
2017-04-13 21:49:17 -07:00
Aapo Kyrola
69f42e3f70 make CopyGPUToCPU/CPUToGPU handle sparse gradients
Summary:
CopyGPUToCPu and CopyGPUToCPU need to handle gradients that come sparse on their way. Added unit test and fixed the gradient makers to create copies for both value and indices.

This becomes less important with gpu sparse parameter update ops land, but nevertheless good to fix.

Reviewed By: dzhulgakov

Differential Revision: D4882327

fbshipit-source-id: aafd2df46b3e1bcb30b52b1edf40fad8271f1f88
2017-04-13 17:16:26 -07:00
Luke Yeager
8bd0522c20 Add tests and GPU impls for sparse optimizers
Summary:
These GPU paths are probably even buggier than the CPU paths for sparse gradients with duplicate indices. Both paths cause multiple momentum updates in a single iteration, but only the GPU path is non-deterministic. Depending on how we decide to address the issues on the CPU path, pooyadavoodi has a good idea for how to match dense behavior with the sparse GPU ops.
Closes https://github.com/caffe2/caffe2/pull/254

Reviewed By: bwasti

Differential Revision: D4871680

Pulled By: dzhulgakov

fbshipit-source-id: 220be57a0f699a22ea85ed4f7022d92d362d06b3
2017-04-13 11:07:40 -07:00
Yiming Wu
83f360887f new SumReduceLike op CPU/GPU implementation and doc
Summary:
new SumReduceLikeOp CPU/GPU implementation and doc. Unit tests and NMT team tests passed. Some benchmark results here:

  shape(A) = [100, 1000, 100]
  shape(B) = [1000]

  0.36684 ms/iter (0.00122679 ms/iter) SumReduceLike
  0.246593 ms/iter (0.00151116 ms/iter) ReduceBackSum
  0.202563 ms/iter (0.00511932 ms/iter) ReduceFrontSum
  // This means that we are faster than back+front sum now

  shape(A) = [32, 32, 100]
  shape(B) = [32, 100]

  0.0253826 ms/iter (0.00257504 ms/iter) ReduceFrontSum
  0.0233368 ms/iter (0.00118283 ms/iter) SumReduceLike

  shape(A) = [32, 32, 100]
  shape(B) = [32, 32]

  0.0276206 ms/iter (0.00691918 ms/iter) ReduceBackSum
  0.0254768 ms/iter (0.00325529 ms/iter) SumReduceLike

Reviewed By: Yangqing

Differential Revision: D4873222

fbshipit-source-id: 736b1537998f4289876bc53d38607b8052e89c70
2017-04-13 10:28:46 -07:00
Kittipat Virochsiri
05002442eb Renaming DuplicateOp to LengthsTileOp
Summary: making the name a bit clearer

Reviewed By: xianjiec

Differential Revision: D4866940

fbshipit-source-id: 3e0f7067a9d3ba89cb038d85c1991e541f1e439c
2017-04-12 22:04:20 -07:00
Kittipat Virochsiri
f5ac83b060 LengthsGatherOp
Summary:
Length-aware gather operator. This will be use for random negative sampling. See the task for details.

This should be equivalent to:

LengthsToRange + Gather + Reshape + GatherRanges

That's pretty complicated.

Differential Revision: D4846023

fbshipit-source-id: 8d9b7ff3eddc75a7ab147cd1c2a12f377652df93
2017-04-12 12:01:35 -07:00
Ahmed Taei
75c2168966 Generalize PoolingOp(CUDA) to compute 1D, 2D and 3D pooling.
Summary: Extend MaxPooling & AvergePooling CUDA ops to compute 1D, 2D & 3D pooling.

Differential Revision: D4866699

fbshipit-source-id: 9bf2d970f2df2b87194a539fc60c07ac19fa1042
2017-04-12 09:16:45 -07:00
Ahmed Taei
09bfc8043b Generalize PoolingOp(CPU) to compute 1D, 2D and 3D pooling.
Summary: Extend the op compute 1D, 2D & 3D pooling.

Differential Revision: D4828691

fbshipit-source-id: 87540e82ed20d1361476cfbc43a708de9ca7a88e
2017-04-11 18:18:21 -07:00
Aapo Kyrola
1e5140aa76 option to recompute blobs backward pass with massive memory savings
Summary:
This diff adds an option to recurrent_net to define some cell blobs to be recomputed on backward step, and thus they don't need to be stored in the step workspace. This is done by modifying the backward step to automatically include all operators that are needed to produce the output that is to be recomputed, and by storing those blobs in a shared workspace. To enable the shared workspace, i had to modify the stepworkspaces blob to also store a forward shared workspace. Making it a class field won't work since the lifecycle of the blob does not match the lifecycle of the operator.

For basic LSTM, the performance hit is quite modest (about 15% with one setting, but your mileage might vary. For Attention models, I am sure this is beneficial as computing the attention blobs is not expensive.

For basic LSTM, the memory saving is wonderful: each forward workspace only has 4 bytes (for timestep).

I also modified the neural_mt LSTM Cells, but there is no test available, so I am not 100% sure I did it correctly. Please have a look.

Added options to LSTM, MILSTM and LSTMAttention to enable memory mode.

Reviewed By: urikz

Differential Revision: D4853890

fbshipit-source-id: d8d0e0e75a5330d174fbfa39b96d8e4e8c446baa
2017-04-11 13:03:48 -07:00
Xianjie Chen
70e9c08f27 feature processing ops
Summary:
add necessary ops for feature processing
* logit op
* replace nan
* batch one hot op

Reviewed By: kittipatv

Differential Revision: D4840869

fbshipit-source-id: 197123ea5608d54f0b5ac7899973a077a6a86775
2017-04-11 07:07:51 -07:00
Aapo Kyrola
22584b546a Revert D4711302: SumReduceLikeOp CPU/GPU implementation
Summary: This reverts commit 0865abde871b3046b367599731593dae03f0775a

Differential Revision: D4711302

fbshipit-source-id: 6c22e683544f6627142fc9970a781ec98f682cad
2017-04-10 23:01:26 -07:00