Commit graph

145 commits

Author SHA1 Message Date
Zhicheng Yan
25035e8b3b ElementwiseLinearOp
Summary:
Implement a new op ElementwiseLinear.
Given inputs X of size (N x D), a of size D and b of size D,
the op computes Y of size (N X D) where Y_{nd} = X_{nd} * a_d + b_d.
Typically, this op is followed by SigmoidCrossEntropyWithLogits op for multi-label classification problem.

Differential Revision: D4892220

fbshipit-source-id: 77bffc5fbe03d48b3d83ab785f7c24a71c952aec
2017-04-17 14:18:27 -07:00
Aapo Kyrola
4db7bec686 CUDA version of SigmoidCrossEntropyWithLogits
Summary: CUDA versions of SigmoidCrossEntropyWithLogits/Gradient.

Reviewed By: jay-mahadeokar

Differential Revision: D4891254

fbshipit-source-id: cabad908026e30d9a0721cad092ba948659ab917
2017-04-14 16:07:33 -07:00
Yangqing Jia
d65892b7f2 Change back the function signature of relu gradient to only use
Summary:
This allows us to do in-place relu and also corrects the previous error of
inconsistency between the cudnn impl and the non-cudnn impl.

This implementation butchers the cudnn interface, in the sense that we pass
in the output instead of the input for the gradient pass. We do have a
gradient checker to guard this situation, so we should be safe.

Reviewed By: asaadaldien

Differential Revision: D4889426

fbshipit-source-id: 081f8fe06de78413b5786086bfd5ae6c8128cd6e
2017-04-13 22:08:09 -07:00
James Reed
e8cc5563fe Add an optional forget bias argument to LSTMUnit
Summary: Add an option to bias the forget gate one way or another by adding in some float value before the sigmoid is applied.

Differential Revision: D4880712

fbshipit-source-id: 1306a97c29fb31630838b2f96597a46e952d940a
2017-04-13 21:49:17 -07:00
Aapo Kyrola
69f42e3f70 make CopyGPUToCPU/CPUToGPU handle sparse gradients
Summary:
CopyGPUToCPu and CopyGPUToCPU need to handle gradients that come sparse on their way. Added unit test and fixed the gradient makers to create copies for both value and indices.

This becomes less important with gpu sparse parameter update ops land, but nevertheless good to fix.

Reviewed By: dzhulgakov

Differential Revision: D4882327

fbshipit-source-id: aafd2df46b3e1bcb30b52b1edf40fad8271f1f88
2017-04-13 17:16:26 -07:00
Luke Yeager
8bd0522c20 Add tests and GPU impls for sparse optimizers
Summary:
These GPU paths are probably even buggier than the CPU paths for sparse gradients with duplicate indices. Both paths cause multiple momentum updates in a single iteration, but only the GPU path is non-deterministic. Depending on how we decide to address the issues on the CPU path, pooyadavoodi has a good idea for how to match dense behavior with the sparse GPU ops.
Closes https://github.com/caffe2/caffe2/pull/254

Reviewed By: bwasti

Differential Revision: D4871680

Pulled By: dzhulgakov

fbshipit-source-id: 220be57a0f699a22ea85ed4f7022d92d362d06b3
2017-04-13 11:07:40 -07:00
Yiming Wu
83f360887f new SumReduceLike op CPU/GPU implementation and doc
Summary:
new SumReduceLikeOp CPU/GPU implementation and doc. Unit tests and NMT team tests passed. Some benchmark results here:

  shape(A) = [100, 1000, 100]
  shape(B) = [1000]

  0.36684 ms/iter (0.00122679 ms/iter) SumReduceLike
  0.246593 ms/iter (0.00151116 ms/iter) ReduceBackSum
  0.202563 ms/iter (0.00511932 ms/iter) ReduceFrontSum
  // This means that we are faster than back+front sum now

  shape(A) = [32, 32, 100]
  shape(B) = [32, 100]

  0.0253826 ms/iter (0.00257504 ms/iter) ReduceFrontSum
  0.0233368 ms/iter (0.00118283 ms/iter) SumReduceLike

  shape(A) = [32, 32, 100]
  shape(B) = [32, 32]

  0.0276206 ms/iter (0.00691918 ms/iter) ReduceBackSum
  0.0254768 ms/iter (0.00325529 ms/iter) SumReduceLike

Reviewed By: Yangqing

Differential Revision: D4873222

fbshipit-source-id: 736b1537998f4289876bc53d38607b8052e89c70
2017-04-13 10:28:46 -07:00
Kittipat Virochsiri
05002442eb Renaming DuplicateOp to LengthsTileOp
Summary: making the name a bit clearer

Reviewed By: xianjiec

Differential Revision: D4866940

fbshipit-source-id: 3e0f7067a9d3ba89cb038d85c1991e541f1e439c
2017-04-12 22:04:20 -07:00
Kittipat Virochsiri
f5ac83b060 LengthsGatherOp
Summary:
Length-aware gather operator. This will be use for random negative sampling. See the task for details.

This should be equivalent to:

LengthsToRange + Gather + Reshape + GatherRanges

That's pretty complicated.

Differential Revision: D4846023

fbshipit-source-id: 8d9b7ff3eddc75a7ab147cd1c2a12f377652df93
2017-04-12 12:01:35 -07:00
Ahmed Taei
75c2168966 Generalize PoolingOp(CUDA) to compute 1D, 2D and 3D pooling.
Summary: Extend MaxPooling & AvergePooling CUDA ops to compute 1D, 2D & 3D pooling.

Differential Revision: D4866699

fbshipit-source-id: 9bf2d970f2df2b87194a539fc60c07ac19fa1042
2017-04-12 09:16:45 -07:00
Ahmed Taei
09bfc8043b Generalize PoolingOp(CPU) to compute 1D, 2D and 3D pooling.
Summary: Extend the op compute 1D, 2D & 3D pooling.

Differential Revision: D4828691

fbshipit-source-id: 87540e82ed20d1361476cfbc43a708de9ca7a88e
2017-04-11 18:18:21 -07:00
Aapo Kyrola
1e5140aa76 option to recompute blobs backward pass with massive memory savings
Summary:
This diff adds an option to recurrent_net to define some cell blobs to be recomputed on backward step, and thus they don't need to be stored in the step workspace. This is done by modifying the backward step to automatically include all operators that are needed to produce the output that is to be recomputed, and by storing those blobs in a shared workspace. To enable the shared workspace, i had to modify the stepworkspaces blob to also store a forward shared workspace. Making it a class field won't work since the lifecycle of the blob does not match the lifecycle of the operator.

For basic LSTM, the performance hit is quite modest (about 15% with one setting, but your mileage might vary. For Attention models, I am sure this is beneficial as computing the attention blobs is not expensive.

For basic LSTM, the memory saving is wonderful: each forward workspace only has 4 bytes (for timestep).

I also modified the neural_mt LSTM Cells, but there is no test available, so I am not 100% sure I did it correctly. Please have a look.

Added options to LSTM, MILSTM and LSTMAttention to enable memory mode.

Reviewed By: urikz

Differential Revision: D4853890

fbshipit-source-id: d8d0e0e75a5330d174fbfa39b96d8e4e8c446baa
2017-04-11 13:03:48 -07:00
Xianjie Chen
70e9c08f27 feature processing ops
Summary:
add necessary ops for feature processing
* logit op
* replace nan
* batch one hot op

Reviewed By: kittipatv

Differential Revision: D4840869

fbshipit-source-id: 197123ea5608d54f0b5ac7899973a077a6a86775
2017-04-11 07:07:51 -07:00
Aapo Kyrola
22584b546a Revert D4711302: SumReduceLikeOp CPU/GPU implementation
Summary: This reverts commit 0865abde871b3046b367599731593dae03f0775a

Differential Revision: D4711302

fbshipit-source-id: 6c22e683544f6627142fc9970a781ec98f682cad
2017-04-10 23:01:26 -07:00
Aapo Kyrola
092c1440a2 SumSqrElements
Summary:
Added SumSqrElements, since then we can avoid a large temporary blob which is needed when doing Sqr + SumElements.

Also moved to reduction_ops, because utlitity_ops has grown too big.

Reviewed By: jamesr66a

Differential Revision: D4844172

fbshipit-source-id: 032eec45e24d6724f0d5fb83f4ec1c771d1146e5
2017-04-10 16:16:52 -07:00
Huazhong Ning
d1af311224 PiecewiseLinearTransformOp supports passing params from input blobs.
Summary:
The PiecewiseLinearTransformOp passes the transform parameters (bounds, slopes, intercepts) via operator arg. This diff supports to pass these parameters through input blobs.

The purpose is to allow us to create a model calibration net that can be exported when saving model.

Reviewed By: dragonxlwang

Differential Revision: D4777086

fbshipit-source-id: 0d157154860f61ec6ecfab95aea80beed54aa5c6
2017-04-08 11:02:35 -07:00
Kittipat Virochsiri
d8b9e787c2 DuplicateOp
Summary: This is like LengthsToSegmentIds + Gather w/o the immediate segment IDs blob. I only realized that after I wrote the whole thing. That combination is not obvious, so just check this in?

Reviewed By: xianjiec

Differential Revision: D4847591

fbshipit-source-id: a1c480f16b317763866af13c83b3aaaeb6a60751
2017-04-08 00:01:59 -07:00
Yiming Wu
dc5a34200f SumReduceLikeOp CPU/GPU implementation
Summary:
1. CPU/GPU implementation of SumReduceLikeOp.

[SRLOp](matrix A, matrix B) -> C

where C is of the same shape as B, its value would be the reduce sum of corresponding A element.

2. Make SumReduceLikeOp (part of) the gradient of Add/Mul/Sub and provide unittests

===Update for Translation Team===
3. Passed Tests:
$ buck test caffe2/caffe2/python/operator_test:recurrent_network_test
$ buck test fblearner/flow/tests/langtech/translation/neural_mt:seq2seq_model_caffe2
$ buck test fblearner/flow/tests/langtech/translation/neural_mt:seq2seq_ensemble_beam_model_caffe2

Reviewed By: Yangqing

Differential Revision: D4711302

fbshipit-source-id: 0865abde871b3046b367599731593dae03f0775a
2017-04-07 15:19:24 -07:00
Kittipat Virochsiri
8482cf9823 TensorVectorSizeOp
Summary: Put the size of the input tensor vector into the output blob

Reviewed By: xianjiec

Differential Revision: D4849556

fbshipit-source-id: 0929319e1705b027874d41a90a9159b335d93545
2017-04-07 14:46:19 -07:00
Aapo Kyrola
23183b9642 memory-saving only_loss argument for SoftmaxWithLoss
Summary: When only_loss=True is enabled, the softmax output buffer is shared with the gradient buffer (which is of same size). Added tests for this. Only for GPU version for now.

Reviewed By: salexspb

Differential Revision: D4843991

fbshipit-source-id: 834d2a1b357d784e4d64efe484f893442201ad6a
2017-04-06 13:04:31 -07:00
Jerry Pan
76abd9a8ac Caffe2: consolidate AveragedLoss with SumElementsOp
Summary: Caffe2: consolidate AveragedLoss with SumElementsOp

Differential Revision: D4781561

fbshipit-source-id: 6734adb9dd81d4cad1819a5f8fe736de2477cb72
2017-04-06 10:35:01 -07:00
Aapo Kyrola
cf201ebac8 support axis for cudnn softmax
Summary: Added the support of axis for cudnn version of softmax + added cudnn tests to the softmax_ops_test

Reviewed By: urikz

Differential Revision: D4835409

fbshipit-source-id: 9150b969237e38daebff961fee3c36759f834ac4
2017-04-05 14:06:03 -07:00
James Reed
320b598ff1 Add NanCheckOp, an operator that checks for NaNs and inf's on both the forward and backward pass.
Summary: NanCheck is an in-place operator for GPU that checks the input for any NaN or inf values. The operator fails and prints diagnostic information (input tensor dims and values) if it detects these erroneous values. This should help us to narrow down our numerical instability issues in the NMT models, and it might help others as well.

Differential Revision: D4818141

fbshipit-source-id: e5aa9762089c58ce160270446007c7a91a7a85e5
2017-04-05 13:07:59 -07:00
Aapo Kyrola
ecd3bda44e Fix Softmax for CUDA
Summary:
Following jamesr66a's brilliant observation, this diff fixes the non-CUDNN versions of Softmax. The op did not take into account that blocks can run in parallel, and thus could overwrite each others values, particularly the "row max" that is important for numerical stability

So in this diff:
1) SoftmaxOp now shares all the code with SoftmaxWithLoss, that had better implementation

+ Strengthen the test case and renaming of file.

Reviewed By: jamesr66a

Differential Revision: D4832929

fbshipit-source-id: 4a1bfa2106ceb65ec75f5b868323ee1e7a3457fb
2017-04-05 10:07:54 -07:00
Yury Zemlyanskiy
5f263c6175 RecurrentNetwork and variable length links
Summary:
Two new features for RecurrentNetwork:
1. Ability to specify longer (for a few steps) initial state
2. Ability to link more than one step of external blob to internal one.

Some motivation for these changes is provided in the unit test

Reviewed By: salexspb

Differential Revision: D4816230

fbshipit-source-id: 5ae6fed53b3b08a6ce4547ff1d0cb773dab42af0
2017-04-04 19:46:53 -07:00
Jon Morton
0e5b2fd016 Support cropping with negative pad sizes in PadImage
Summary: The PadImage op supports cropping along the H/W dimensions by using negative pads; but currently passing negative values for pad attributes throws an error in ConvPoolOpBase, which PadImage inherits from. Modify ConvPoolOpBase to accept negative pad values for non-conv, non-pool ops. Also add a python operator test for cropping

Reviewed By: ajtulloch

Differential Revision: D4817118

fbshipit-source-id: 5ea5203e8072cc34fe14938e534b157d0ad55f6b
2017-04-03 23:47:54 -07:00
Aapo Kyrola
e13e9c1302 cuDNN version of TransposeOp
Summary:
Uses the cudnnTransformTensor function. It works by shuffling the strides according to the transpose axis. Significant speedup over current GPU version .
+ moves the transpose test under utility_ops, because hypothesis_test is too big

Reviewed By: jamesr66a

Differential Revision: D4810993

fbshipit-source-id: 82577c4ced1389e70bd5992820ae4d8297a3817f
2017-04-03 13:33:10 -07:00
Pooya Davoodi
a2593ea0c2 Add GatherOp for GPU, and update its tests.
Summary:
This is an initial (read: unoptimized) implementation of GatherOp on GPU.
Closes https://github.com/caffe2/caffe2/pull/209

Differential Revision: D4809676

Pulled By: Yangqing

fbshipit-source-id: bc36fa02e9964370ca845e9cc13344e5f3dbf176
2017-03-31 13:20:09 -07:00
Aapo Kyrola
8421bf7c60 Faster softmaxWithLoss rowMaxKernel
Summary:
We did not parallelize over D, which can be very large, especially in RNN models. This speeds up significantly, with my quick test in lstm_benchmark and nvprof, the time of RowMaxKernel dropped from 1.2s total to 0.28s total.

+ addded softmaxwithloss to the lstm_benchmark

Reviewed By: jamesr66a

Differential Revision: D4800629

fbshipit-source-id: 3400ea1064b1eb2793bc403df2c1b68801d545e5
2017-03-30 15:49:46 -07:00
Luke Yeager
d76a814c93 Fixes for ops without a CUDA backend
Summary:
All of these tests fail with some variant of `Cannot create operator of type 'X' on the device 'CUDA'` (see commit messages).
Closes https://github.com/caffe2/caffe2/pull/227

Differential Revision: D4797060

Pulled By: Yangqing

fbshipit-source-id: 5feaa8e949098bfc1254d4c7449a2744e552f925
2017-03-29 14:36:09 -07:00
Aapo Kyrola
1ed746df45 BatchMatMulOp: use cuBLAS batched strided gemm for CUDA
Summary:
Instead of doing gemms in a for-loop (which is not parallelized), it is much better to do the batched matmuls using CUDA 8's new batched-striped version of gemm.

With the MT team's test, we get 5-10% improvement in overall walltime, so it is significant improvement:

----

Without batched gemm:

I0328 10:46:48.118605 58068 prof_dag_net.cc:136]    424.757 ms/iter (   283.878 ms/iter) RecurrentNetwork
I0328 10:46:48.118609 58068 prof_dag_net.cc:136]    352.603 ms/iter (    265.85 ms/iter) RecurrentNetworkGradient

With batched gemm:
I0328 10:53:48.169996 85617 prof_dag_net.cc:136]    407.438 ms/iter (   269.564 ms/iter) RecurrentNetwork
I0328 10:53:48.169999 85617 prof_dag_net.cc:136]    322.393 ms/iter (   287.625 ms/iter) RecurrentNetworkGradient

Reviewed By: jamesr66a

Differential Revision: D4788272

fbshipit-source-id: 210e8b94c1e036b6ef0f039ce000d455258651f4
2017-03-28 11:54:09 -07:00
Alexander Sidorov
242bff8480 RNN: avoid copy for gradients of inputs to the rnn cell and save more memory!
Summary:
This is pretty tricky to explain, but we can just use
backward_links. This way the whole cell would use a blob from the
states_grad tensor instead of having its own blob. This also should
save on memory a bit

Differential Revision: D4770798

fbshipit-source-id: 673f85b2c2fdf42c47feeaa24d1e2bf086f012f9
2017-03-28 10:02:25 -07:00
Jerry Pan
78f0b35949 Caffe2: CUDA implementation for LeakyReluOp
Summary: Caffe2: CUDA implementation for LeakyReluOp

Reviewed By: asaadaldien

Differential Revision: D4782336

fbshipit-source-id: 402eace695307b62740c918660d9e521217e928a
2017-03-28 08:48:25 -07:00
James Cross
b41449b680 SparseMomentumSGDUpdateOp
Summary: Creates SparseMomentumSGDUpdate, a sparse version of MomentumSGDUpdate, to make that optimization method (via in-place updating operator) compatible with GradientSlices.

Differential Revision: D4784973

fbshipit-source-id: e6330f471a4d5f53589a6ac245e38f256ca7f354
2017-03-28 07:47:46 -07:00
Deepak Gopinath
6aee34b666 Registering GPU version of PackSegments using GPUFallbackOp
Summary: Creating PackSegments and UnpackSegments GPU operators using GPUFallbackOp for now. The op does mainly copying of blobs and this is a reasonable solution until we have a CUDA op.

Reviewed By: pietern

Differential Revision: D4761589

fbshipit-source-id: dd483b9e34ecb6b53925405e5b4c24859c549606
2017-03-24 16:01:53 -07:00
Luke Yeager
0ade0578b1 Reset workspace after each test in copy_ops_test
Summary:
This was a nasty one to track down. This was the error message:
```
E0323 14:47:46.138900  2870 context_gpu.h:126] Encountered CUDA error: an illegal memory access was encountered
F0323 14:47:46.139143  2870 operator.h:176] Computation on device returned error in operator
input: "x_gpu_2" output: "loss" name: "" type: "AveragedLoss" device_option { device_type: 1 cuda_gpu_id: 1 }
```
Closes https://github.com/caffe2/caffe2/pull/220

Differential Revision: D4771086

Pulled By: Yangqing

fbshipit-source-id: f2d0f39f1647c84d97d9745f8a0305a389bfbc41
2017-03-24 12:20:34 -07:00
Ahmed Aly
99bfd36a04 CRF layer in caffe2
Summary:
This is implementation of a CRF layer in caffe2 according to this paper: https://arxiv.org/abs/1603.01360
Currently this implementation works only for batch_size = 1

Reference implementations:

- Tensorflow:
 63a21e0540/tensorflow/contrib/crf/python/ops/crf.py

- Theano:
https://github.com/glample/tagger/blob/master/model.py#L286

Differential Revision: D4644004

fbshipit-source-id: bf0801fd8562d11dca3fefe371c3d85e1dd69ccc
2017-03-23 22:02:02 -07:00
Alexander Sidorov
d7b2aebf2c Support for Sum in cell net as first operator
Summary: This didn't work for a reason specified in comments. Also some cleanup in the unit tests, now inference uses a custom workspace to run cell net on

Reviewed By: urikz

Differential Revision: D4742670

fbshipit-source-id: 04165c029fddec5ae31b20b207faf06d2fa20816
2017-03-21 18:32:18 -07:00
Ahmed Taei
e41d35909a Conv-ND NCHW CUP/CUDA implementation
Summary: Migrate caffe1 ConvNd implementation to caffe2.

Reviewed By: Yangqing

Differential Revision: D4659868

fbshipit-source-id: 14b178af3faa2c0b12e5a9f7aa76c1d8945419ea
2017-03-20 14:01:07 -07:00
James Reed
33f41c06c0 Remove more instances of batch_size
Summary: D4734505 part 2. Remove more instances of the batch_size parameter

Reviewed By: urikz

Differential Revision: D4736906

fbshipit-source-id: fc9d374e9308017d61c427890364c5ab9cec2edf
2017-03-19 22:31:30 -07:00
James Reed
17da5856ed Remove batch_size parameter from attention and LSTMWithAttention interfaces
Summary: Reshape based on tensor shapes in the graph rather than based on a passed-in batch_size parameter

Reviewed By: urikz

Differential Revision: D4734505

fbshipit-source-id: d9c23d85be84f61124106e752ef2b4f6945e2a07
2017-03-19 18:16:28 -07:00
Yury Zemlyanskiy
d1424c3265 Revert D4702086: Remove batch_size parameter from attention and LSTMWithAttention interfaces
Summary: This reverts commit c4c1d8425cd36c1e86695918eaba2667c27e9601

Differential Revision: D4702086

fbshipit-source-id: 4620610b182bb84b9297b5de32782761ae89d20b
2017-03-17 17:36:47 -07:00
Alexander Sidorov
f97d7949d0 Remove legacy LSTM, cleanup tests
Summary: we don't use this one any more except a few tests

Reviewed By: urikz

Differential Revision: D4731401

fbshipit-source-id: c5c28b7594e3251f501fc28455dfc9bd2093a836
2017-03-17 16:33:53 -07:00
James Cross
79c3a3af54 add gpu support for caffe2-seq2seq
Summary: Adding synchronous optimization on GPUs to the translation training pipeline, via data_parallel_model.Parallelize_GPU, which needs to be updated so there is some way of performing sparse parameter updates (e.g., on embedding tables), whether on GPU or CPU.

Reviewed By: urikz

Differential Revision: D4631914

fbshipit-source-id: 9cdd655f7dbda3f9b2733d459228b3e097892441
2017-03-17 05:19:14 -07:00
Jon Morton
1513b1de6b Add ResizeNearest operator
Summary: This adds a nearest neighbor interpolation resizing operator to caffe2. CPU only, NCHW only, no gradients. Also adds torch2caffe support. This is probably not optimal in terms of performance, but it works.

Reviewed By: ajtulloch

Differential Revision: D4724244

fbshipit-source-id: b8295061141fb513da84acf91fdfd67264119059
2017-03-16 18:49:01 -07:00
James Reed
cc2e915461 Implement TopK op in caffe2
Reviewed By: salexspb, urikz

Differential Revision: D4718439

fbshipit-source-id: e6866eb7bb586f2716662cd4b65961bdd9914525
2017-03-16 17:32:20 -07:00
James Reed
10d95bd0f0 Remove batch_size parameter from attention and LSTMWithAttention interfaces
Summary: Reshape based on tensor shapes in the graph rather than based on a passed-in batch_size parameter

Reviewed By: urikz

Differential Revision: D4702086

fbshipit-source-id: c4c1d8425cd36c1e86695918eaba2667c27e9601
2017-03-16 11:47:52 -07:00
Luke Yeager
7773a2d643 Bugfix: type not being set when inferring types+shapes
Summary:
/cc akyrola

I basically just copied all the `ShapeCall` stuff as `TypeCall`. Is there a better way?
Closes https://github.com/caffe2/caffe2/pull/187

Differential Revision: D4699312

Pulled By: Yangqing

fbshipit-source-id: 92f736ffe4127b00b5821acb1eb359771975fdd7
2017-03-15 18:48:40 -07:00
Luke Yeager
014d1fe5c4 Allow test discovery in caffe2/python/
Summary:
These are all essentially no-op changes which allow for nose-style (or pytest-style) test discovery.

With this patch, you can use any of these methods to discover and run tests under `caffe2/python`:
```
python -m unittest discover -p '*test*.py' caffe2/python/
python -m nose caffe2/python/
python -m pytest caffe2/python/
```

Future work:

* Get all of the tests to pass
  * Some seem to be testing operations which don't have GPU implementations
  * I get a segfault unless I set `CUDA_VISIBLE_DEVICES=0`
  * Some tests are flaky
* Allow test discovery throughout the whole project (e.g. the `experiments/` dir)
Closes https://github.com/caffe2/caffe2/pull/199

Reviewed By: pietern

Differential Revision: D4704504

Pulled By: Yangqing

fbshipit-source-id: 8f5687ec9c8aa873dfaff30dbf44272bc38a206b
2017-03-14 18:16:41 -07:00
Ahmed Taei
a745981c94 ReduceBack{Sum|Mean}Op CPU & GPU implementation
Summary:
Implement ReduceBackSum & ReduceBackMean with gradients for CPU & GPU contexts.
The reduction happens among the last dimenstions for example if input is a
M x N matrix ReduceBackSum will results a vector of dim M x 1 contains the
rowwise sums.

Differential Revision: D4689768

fbshipit-source-id: 5b0482d4341867ecf23526dc6c4d544420e7d8f7
2017-03-13 16:19:58 -07:00