pytorch

mirror of https://github.com/saymrwulf/pytorch.git synced 2026-05-15 21:00:47 +00:00

Author	SHA1	Message	Date
Mathieu Baudet	081001a176	"IsMemberOf" operator Summary: Add a pointwise `IsMemberOf` operator to Caffe2. The original idea was `In` but I think this is not so clear. I used `UnaryElementwiseWithArgsOp` at some point, but it was making the code a bit more difficult to read without bringing any feature. Reviewed By: ender-wieczorek Differential Revision: D4912655 fbshipit-source-id: 716b66bb51468dd59db5f76f23d78cda85961b58	2017-04-24 18:18:49 -07:00
Mathieu Baudet	24ff90ee6b	"Where" operator Summary: Adding a pointwise `Where(condition, left, right)` operator to Caffe2. Reviewed By: ender-wieczorek Differential Revision: D4901402 fbshipit-source-id: a33682e77b2e7367050a94eeb4e10b7e5de9f955	2017-04-24 18:18:48 -07:00
Janusz Kudelka	902409be56	caffe2: datasets pack/unpack Summary: Two new operators to pack and unpack a dataset. This is so that we can re-use other operators that do not understand the schema format. The immediate use-case is to use it with a partition operator. Packing works by splitting the input into separate tensors, putting them in a vector and wrapping in a shared_ptr (as opposed to a unique_ptr, so we can copy). Unpack takes the packed input and concatenates it back to the original. I also had a gard time understanding the iteration, so I created a TreeWalker that just hides the complexity of operating with all the arrays and makes the short functions for a given purpose that at least for me are easier to understand. Reviewed By: dzhulgakov Differential Revision: D4918002 fbshipit-source-id: ecbf9196ed25e886a94383961176b8c84dde2d2f	2017-04-24 16:09:39 -07:00
Aapo Kyrola	9cb901caf0	Forward-only rnns Summary: Added option to recurrent_net and RNNCell's for forward_only. If this is set, the backward_step_net is not passed to the operator. When backward_step_net is not available, operator knows it is in forward_only mode and does not create workspaces for each step but cycles through only one private workspace. Note: we could avoid doing a lot of work in recurrent.py:recurrent_network call when backward step is not needed, but doing that nicely requires more refactoring that I did not want to do now. Thus, we create the backward step nets etc, but just don't pass it to the op. This can be used to create more efficient inference models. You can also sanitize existing inference nets and remove the backward_step_net argument to get the benefits. Reviewed By: salexspb Differential Revision: D4916482 fbshipit-source-id: c99b93c9cb897c32b0f449253f7f6d6a942618ad	2017-04-24 15:52:27 -07:00
Yiming Wu	bef6e45f8b	rename ModelHelperBase Summary: rename ModelHelperBase to Model. This is the result of running: find . -type f -exec sed -i 's/ModelHelperBase/ModelHelper/g' {} + We had 19 results when fbgs ModelHelperBase. Here is 20 instances because I added 1 test in model_helpers_test.py Reviewed By: salexspb Differential Revision: D4928337 fbshipit-source-id: bc4c12b60b90c167e717de50ea9fe17521e142e3	2017-04-24 15:52:26 -07:00
Alexander Sidorov	4f77a49ddd	refactor LSTM test to avoid copy pasta, improve speed 1.5x and provide better coverage Summary: This is getting too messy again. So cleaning up it even more. One thing I added here - not calling random to generate the input sequence. Ideally we do this for all other inputs. This was reported to be an issue when hypothesis finds bad examples - it can make it run very long. Also I tunned ranges a bit so test finishes faster. On my devgpu test the whole test took 600 before and now is 39 seconds. One more important thing - we want to test all combinations of things that are in the for loop. While things provided by hypothesis are just random tensor inputs. Differential Revision: D4902956 fbshipit-source-id: ceb02d6761406b3192101d3b255abe90b2866770	2017-04-24 15:52:26 -07:00
Aapo Kyrola	41f4198344	CUDA version of PRelu/Gradient + Fix Gradient for dW Summary: CUDA version of PRelu and its gradient. Forward pass is straightforward, backward pass requires reductino over the weights. tsaizhenling, please patch this and test. Differential Revision: D4931630 fbshipit-source-id: 1238e7d536e41480713865ced91aaef88f4feef5	2017-04-24 15:52:25 -07:00
Luke Yeager	09bb91022a	Fix tests for ops without a CUDA backend Summary: See https://github.com/caffe2/caffe2/pull/227 * Logit * ReplaceNaN * BatchOneHot Closes https://github.com/caffe2/caffe2/pull/277 Differential Revision: D4915268 Pulled By: Yangqing fbshipit-source-id: 77ccb2e7d03e6953e8ca60646987a02868d0ef5b	2017-04-24 15:52:25 -07:00
Aapo Kyrola	b82f9e9ea7	FindOp Summary: Simple FindOp for CPU and GPU which searches a list of unordered needles from an unordered index. CPU version might be faster if first sorting the index / needles, but we can get back to that later. CUDA op is also kind of brutish, but pretty parallel. Since the index and the queries are smallish at least in the use case currently in mind (Machine Translation's team word candidate search), I think this is a sufficient start. Note that this is much simpler than the Index-class of ops which allow modifying the index etc. Since CUDA ops are more complex to implement for the full Index functionality, I decided to make a separate op with this very simple functionality. Differential Revision: D4910131 fbshipit-source-id: 6df35c9e3c71d5392a500d5b98fd708ab0c8e587	2017-04-24 15:52:25 -07:00
James Reed	01c76bf830	Optimize TransposeOp by using strided access pattern, bulk memory transfer, and other profile-guided optimizations Summary: Work in progress for improving the performance of the TransposeOp on CPU. This is used extensively for inference in several neural MT systems, so optimizing this function is worthwhile and will reduce request latency. Differential Revision: D4913075 fbshipit-source-id: fa2742829291d91f3eba00fdfe7d6c0dae83e206	2017-04-20 18:31:40 -07:00
Kittipat Virochsiri	e5e3ec1498	fix unit test Summary: CUDA is not implemented Reviewed By: xianjiec Differential Revision: D4917368 fbshipit-source-id: dc41a76cf569018896cf457c0e3358ce840e198e	2017-04-19 17:22:00 -07:00
Yiming Wu	4ad3a4fc8b	Revert D4794432: Added tiles and axis as input parameters to Tile Operator Summary: This reverts commit a7e38f4f925a4cedf530924bd426c3bb08b5aad8 Differential Revision: D4794432 fbshipit-source-id: 05b2b0d101ebd917527e94ef8a74e63ab40942a4	2017-04-19 14:17:25 -07:00
Kittipat Virochsiri	883ff96f74	Allow UniformIntFill to produce empty tensor Summary: This is needed for the completeness of random negative sampling. When the pool size is 0, we want to generate empty indices tensor. Reviewed By: xianjiec Differential Revision: D4906866 fbshipit-source-id: 75d66a92d15d60bb37bcd1075d324f28069c4fa0	2017-04-19 13:03:23 -07:00
Yangqing Jia	41620f86c9	Update IntelComposerXE to 2017.2.274 Summary: Due to the massive dependencies I did not update the version number - under the same big version number (2017) the API is compatible so no need to rebuild all the dependencies. This will unblock the Caffe2 Intel pull request on MKLDNN. Differential Revision: D4906463 fbshipit-source-id: 0f74436ac3a05605e35b8b649c3e8b5c1c69b500	2017-04-19 10:07:09 -07:00
Shenxiu Liu	8492c411e8	Caffe2 unit test for unmask Summary: unit test using hypothesis for unmask operator Reviewed By: ender-wieczorek Differential Revision: D4904075 fbshipit-source-id: 874d3756ec703ab2cc82f24f7160b4254bf791f1	2017-04-18 21:06:18 -07:00
Dmytro Dzhulgakov	580e192151	Revert D4870606: caffe2: datasets pack/unpack Summary: This reverts commit dc29428de5c96cc3039af2885d9e4b026d9f482d Differential Revision: D4870606 fbshipit-source-id: 1d05912b1a9e35e84b0c163c7b018db125ce060f	2017-04-18 16:47:05 -07:00
Kittipat Virochsiri	009bbc9983	Allow UniformFill/UniformIntFill to take parameters from input blobs Summary: This will be used to generate random indices input to `Gather` Reviewed By: xianjiec Differential Revision: D4904591 fbshipit-source-id: 8d858631e3d640be2cec12f1566cbf195e6aad4b	2017-04-18 14:31:03 -07:00
Janusz Kudelka	34269a6fda	caffe2: datasets pack/unpack Summary: Two new operators to pack and unpack a dataset. This is so that we can re-use other operators that do not understand the schema format. The immediate use-case is to use it with a partition operator. Packing works by splitting the input into separate tensors, putting them in a vector and wrapping in a shared_ptr (as opposed to a unique_ptr, so we can copy). Unpack takes the packed input and concatenates it back to the original. I also had a gard time understanding the iteration, so I created a TreeWalker that just hides the complexity of operating with all the arrays and makes the short functions for a given purpose that at least for me are easier to understand. Reviewed By: dzhulgakov Differential Revision: D4870606 fbshipit-source-id: dc29428de5c96cc3039af2885d9e4b026d9f482d	2017-04-18 13:31:10 -07:00
Yury Zemlyanskiy	4bf559eddb	RNNCell, LSTMCell, LSTMWithAttentionCell Summary: This is the nice way to re-use RNN layers for training and for inference. Reviewed By: salexspb Differential Revision: D4825894 fbshipit-source-id: 779c69758cee8caca6f36bc507e3ea0566f7652a	2017-04-18 00:47:20 -07:00
Yangqing Jia	cf317d1106	create_net: explicitly specify if one wants to overwrite the network. Summary: This is from discussion with dzhulgakov : as a step towards revisiting the core.Net autonaming, we will first guard against accidental overwrites of existing networks in the workspace. ajtulloch since we are doing Predictors in mobile, this should be safe right? azzolini - I assume this would be safe, but would love to get your approval. akyrola - would this hurt xray? Reviewed By: dzhulgakov Differential Revision: D4897725 fbshipit-source-id: aa41271927ad6671f07a53b9505283623f8c49e5	2017-04-17 21:46:53 -07:00
Romain Cledat	20330fe3f4	Added tiles and axis as input parameters to Tile Operator Summary: Added the possibility to add 'tiles' and 'axis' as input as opposed to arguments for the Tile Operator. If provided, the input values will override the argument values Differential Revision: D4794432 fbshipit-source-id: a7e38f4f925a4cedf530924bd426c3bb08b5aad8	2017-04-17 15:31:20 -07:00
Zhicheng Yan	25035e8b3b	ElementwiseLinearOp Summary: Implement a new op ElementwiseLinear. Given inputs X of size (N x D), a of size D and b of size D, the op computes Y of size (N X D) where Y_{nd} = X_{nd} * a_d + b_d. Typically, this op is followed by SigmoidCrossEntropyWithLogits op for multi-label classification problem. Differential Revision: D4892220 fbshipit-source-id: 77bffc5fbe03d48b3d83ab785f7c24a71c952aec	2017-04-17 14:18:27 -07:00
Aapo Kyrola	4db7bec686	CUDA version of SigmoidCrossEntropyWithLogits Summary: CUDA versions of SigmoidCrossEntropyWithLogits/Gradient. Reviewed By: jay-mahadeokar Differential Revision: D4891254 fbshipit-source-id: cabad908026e30d9a0721cad092ba948659ab917	2017-04-14 16:07:33 -07:00
Yangqing Jia	d65892b7f2	Change back the function signature of relu gradient to only use Summary: This allows us to do in-place relu and also corrects the previous error of inconsistency between the cudnn impl and the non-cudnn impl. This implementation butchers the cudnn interface, in the sense that we pass in the output instead of the input for the gradient pass. We do have a gradient checker to guard this situation, so we should be safe. Reviewed By: asaadaldien Differential Revision: D4889426 fbshipit-source-id: 081f8fe06de78413b5786086bfd5ae6c8128cd6e	2017-04-13 22:08:09 -07:00
James Reed	e8cc5563fe	Add an optional forget bias argument to LSTMUnit Summary: Add an option to bias the forget gate one way or another by adding in some float value before the sigmoid is applied. Differential Revision: D4880712 fbshipit-source-id: 1306a97c29fb31630838b2f96597a46e952d940a	2017-04-13 21:49:17 -07:00
Aapo Kyrola	69f42e3f70	make CopyGPUToCPU/CPUToGPU handle sparse gradients Summary: CopyGPUToCPu and CopyGPUToCPU need to handle gradients that come sparse on their way. Added unit test and fixed the gradient makers to create copies for both value and indices. This becomes less important with gpu sparse parameter update ops land, but nevertheless good to fix. Reviewed By: dzhulgakov Differential Revision: D4882327 fbshipit-source-id: aafd2df46b3e1bcb30b52b1edf40fad8271f1f88	2017-04-13 17:16:26 -07:00
Luke Yeager	8bd0522c20	Add tests and GPU impls for sparse optimizers Summary: These GPU paths are probably even buggier than the CPU paths for sparse gradients with duplicate indices. Both paths cause multiple momentum updates in a single iteration, but only the GPU path is non-deterministic. Depending on how we decide to address the issues on the CPU path, pooyadavoodi has a good idea for how to match dense behavior with the sparse GPU ops. Closes https://github.com/caffe2/caffe2/pull/254 Reviewed By: bwasti Differential Revision: D4871680 Pulled By: dzhulgakov fbshipit-source-id: 220be57a0f699a22ea85ed4f7022d92d362d06b3	2017-04-13 11:07:40 -07:00
Yiming Wu	83f360887f	new SumReduceLike op CPU/GPU implementation and doc Summary: new SumReduceLikeOp CPU/GPU implementation and doc. Unit tests and NMT team tests passed. Some benchmark results here: shape(A) = [100, 1000, 100] shape(B) = [1000] 0.36684 ms/iter (0.00122679 ms/iter) SumReduceLike 0.246593 ms/iter (0.00151116 ms/iter) ReduceBackSum 0.202563 ms/iter (0.00511932 ms/iter) ReduceFrontSum // This means that we are faster than back+front sum now shape(A) = [32, 32, 100] shape(B) = [32, 100] 0.0253826 ms/iter (0.00257504 ms/iter) ReduceFrontSum 0.0233368 ms/iter (0.00118283 ms/iter) SumReduceLike shape(A) = [32, 32, 100] shape(B) = [32, 32] 0.0276206 ms/iter (0.00691918 ms/iter) ReduceBackSum 0.0254768 ms/iter (0.00325529 ms/iter) SumReduceLike Reviewed By: Yangqing Differential Revision: D4873222 fbshipit-source-id: 736b1537998f4289876bc53d38607b8052e89c70	2017-04-13 10:28:46 -07:00
Kittipat Virochsiri	05002442eb	Renaming DuplicateOp to LengthsTileOp Summary: making the name a bit clearer Reviewed By: xianjiec Differential Revision: D4866940 fbshipit-source-id: 3e0f7067a9d3ba89cb038d85c1991e541f1e439c	2017-04-12 22:04:20 -07:00
Kittipat Virochsiri	f5ac83b060	LengthsGatherOp Summary: Length-aware gather operator. This will be use for random negative sampling. See the task for details. This should be equivalent to: LengthsToRange + Gather + Reshape + GatherRanges That's pretty complicated. Differential Revision: D4846023 fbshipit-source-id: 8d9b7ff3eddc75a7ab147cd1c2a12f377652df93	2017-04-12 12:01:35 -07:00
Ahmed Taei	75c2168966	Generalize PoolingOp(CUDA) to compute 1D, 2D and 3D pooling. Summary: Extend MaxPooling & AvergePooling CUDA ops to compute 1D, 2D & 3D pooling. Differential Revision: D4866699 fbshipit-source-id: 9bf2d970f2df2b87194a539fc60c07ac19fa1042	2017-04-12 09:16:45 -07:00
Ahmed Taei	09bfc8043b	Generalize PoolingOp(CPU) to compute 1D, 2D and 3D pooling. Summary: Extend the op compute 1D, 2D & 3D pooling. Differential Revision: D4828691 fbshipit-source-id: 87540e82ed20d1361476cfbc43a708de9ca7a88e	2017-04-11 18:18:21 -07:00
Aapo Kyrola	1e5140aa76	option to recompute blobs backward pass with massive memory savings Summary: This diff adds an option to recurrent_net to define some cell blobs to be recomputed on backward step, and thus they don't need to be stored in the step workspace. This is done by modifying the backward step to automatically include all operators that are needed to produce the output that is to be recomputed, and by storing those blobs in a shared workspace. To enable the shared workspace, i had to modify the stepworkspaces blob to also store a forward shared workspace. Making it a class field won't work since the lifecycle of the blob does not match the lifecycle of the operator. For basic LSTM, the performance hit is quite modest (about 15% with one setting, but your mileage might vary. For Attention models, I am sure this is beneficial as computing the attention blobs is not expensive. For basic LSTM, the memory saving is wonderful: each forward workspace only has 4 bytes (for timestep). I also modified the neural_mt LSTM Cells, but there is no test available, so I am not 100% sure I did it correctly. Please have a look. Added options to LSTM, MILSTM and LSTMAttention to enable memory mode. Reviewed By: urikz Differential Revision: D4853890 fbshipit-source-id: d8d0e0e75a5330d174fbfa39b96d8e4e8c446baa	2017-04-11 13:03:48 -07:00
Xianjie Chen	70e9c08f27	feature processing ops Summary: add necessary ops for feature processing * logit op * replace nan * batch one hot op Reviewed By: kittipatv Differential Revision: D4840869 fbshipit-source-id: 197123ea5608d54f0b5ac7899973a077a6a86775	2017-04-11 07:07:51 -07:00
Aapo Kyrola	22584b546a	Revert D4711302: SumReduceLikeOp CPU/GPU implementation Summary: This reverts commit 0865abde871b3046b367599731593dae03f0775a Differential Revision: D4711302 fbshipit-source-id: 6c22e683544f6627142fc9970a781ec98f682cad	2017-04-10 23:01:26 -07:00
Aapo Kyrola	092c1440a2	SumSqrElements Summary: Added SumSqrElements, since then we can avoid a large temporary blob which is needed when doing Sqr + SumElements. Also moved to reduction_ops, because utlitity_ops has grown too big. Reviewed By: jamesr66a Differential Revision: D4844172 fbshipit-source-id: 032eec45e24d6724f0d5fb83f4ec1c771d1146e5	2017-04-10 16:16:52 -07:00
Huazhong Ning	d1af311224	PiecewiseLinearTransformOp supports passing params from input blobs. Summary: The PiecewiseLinearTransformOp passes the transform parameters (bounds, slopes, intercepts) via operator arg. This diff supports to pass these parameters through input blobs. The purpose is to allow us to create a model calibration net that can be exported when saving model. Reviewed By: dragonxlwang Differential Revision: D4777086 fbshipit-source-id: 0d157154860f61ec6ecfab95aea80beed54aa5c6	2017-04-08 11:02:35 -07:00
Kittipat Virochsiri	d8b9e787c2	DuplicateOp Summary: This is like LengthsToSegmentIds + Gather w/o the immediate segment IDs blob. I only realized that after I wrote the whole thing. That combination is not obvious, so just check this in? Reviewed By: xianjiec Differential Revision: D4847591 fbshipit-source-id: a1c480f16b317763866af13c83b3aaaeb6a60751	2017-04-08 00:01:59 -07:00
Yiming Wu	dc5a34200f	SumReduceLikeOp CPU/GPU implementation Summary: 1. CPU/GPU implementation of SumReduceLikeOp. [SRLOp](matrix A, matrix B) -> C where C is of the same shape as B, its value would be the reduce sum of corresponding A element. 2. Make SumReduceLikeOp (part of) the gradient of Add/Mul/Sub and provide unittests ===Update for Translation Team=== 3. Passed Tests: $ buck test caffe2/caffe2/python/operator_test:recurrent_network_test $ buck test fblearner/flow/tests/langtech/translation/neural_mt:seq2seq_model_caffe2 $ buck test fblearner/flow/tests/langtech/translation/neural_mt:seq2seq_ensemble_beam_model_caffe2 Reviewed By: Yangqing Differential Revision: D4711302 fbshipit-source-id: 0865abde871b3046b367599731593dae03f0775a	2017-04-07 15:19:24 -07:00
Kittipat Virochsiri	8482cf9823	TensorVectorSizeOp Summary: Put the size of the input tensor vector into the output blob Reviewed By: xianjiec Differential Revision: D4849556 fbshipit-source-id: 0929319e1705b027874d41a90a9159b335d93545	2017-04-07 14:46:19 -07:00
Aapo Kyrola	23183b9642	memory-saving only_loss argument for SoftmaxWithLoss Summary: When only_loss=True is enabled, the softmax output buffer is shared with the gradient buffer (which is of same size). Added tests for this. Only for GPU version for now. Reviewed By: salexspb Differential Revision: D4843991 fbshipit-source-id: 834d2a1b357d784e4d64efe484f893442201ad6a	2017-04-06 13:04:31 -07:00
Jerry Pan	76abd9a8ac	Caffe2: consolidate AveragedLoss with SumElementsOp Summary: Caffe2: consolidate AveragedLoss with SumElementsOp Differential Revision: D4781561 fbshipit-source-id: 6734adb9dd81d4cad1819a5f8fe736de2477cb72	2017-04-06 10:35:01 -07:00
Aapo Kyrola	cf201ebac8	support axis for cudnn softmax Summary: Added the support of axis for cudnn version of softmax + added cudnn tests to the softmax_ops_test Reviewed By: urikz Differential Revision: D4835409 fbshipit-source-id: 9150b969237e38daebff961fee3c36759f834ac4	2017-04-05 14:06:03 -07:00
James Reed	320b598ff1	Add NanCheckOp, an operator that checks for NaNs and inf's on both the forward and backward pass. Summary: NanCheck is an in-place operator for GPU that checks the input for any NaN or inf values. The operator fails and prints diagnostic information (input tensor dims and values) if it detects these erroneous values. This should help us to narrow down our numerical instability issues in the NMT models, and it might help others as well. Differential Revision: D4818141 fbshipit-source-id: e5aa9762089c58ce160270446007c7a91a7a85e5	2017-04-05 13:07:59 -07:00
Aapo Kyrola	ecd3bda44e	Fix Softmax for CUDA Summary: Following jamesr66a's brilliant observation, this diff fixes the non-CUDNN versions of Softmax. The op did not take into account that blocks can run in parallel, and thus could overwrite each others values, particularly the "row max" that is important for numerical stability So in this diff: 1) SoftmaxOp now shares all the code with SoftmaxWithLoss, that had better implementation + Strengthen the test case and renaming of file. Reviewed By: jamesr66a Differential Revision: D4832929 fbshipit-source-id: 4a1bfa2106ceb65ec75f5b868323ee1e7a3457fb	2017-04-05 10:07:54 -07:00
Yury Zemlyanskiy	5f263c6175	RecurrentNetwork and variable length links Summary: Two new features for RecurrentNetwork: 1. Ability to specify longer (for a few steps) initial state 2. Ability to link more than one step of external blob to internal one. Some motivation for these changes is provided in the unit test Reviewed By: salexspb Differential Revision: D4816230 fbshipit-source-id: 5ae6fed53b3b08a6ce4547ff1d0cb773dab42af0	2017-04-04 19:46:53 -07:00
Jon Morton	0e5b2fd016	Support cropping with negative pad sizes in PadImage Summary: The PadImage op supports cropping along the H/W dimensions by using negative pads; but currently passing negative values for pad attributes throws an error in ConvPoolOpBase, which PadImage inherits from. Modify ConvPoolOpBase to accept negative pad values for non-conv, non-pool ops. Also add a python operator test for cropping Reviewed By: ajtulloch Differential Revision: D4817118 fbshipit-source-id: 5ea5203e8072cc34fe14938e534b157d0ad55f6b	2017-04-03 23:47:54 -07:00
Aapo Kyrola	e13e9c1302	cuDNN version of TransposeOp Summary: Uses the cudnnTransformTensor function. It works by shuffling the strides according to the transpose axis. Significant speedup over current GPU version . + moves the transpose test under utility_ops, because hypothesis_test is too big Reviewed By: jamesr66a Differential Revision: D4810993 fbshipit-source-id: 82577c4ced1389e70bd5992820ae4d8297a3817f	2017-04-03 13:33:10 -07:00
Pooya Davoodi	a2593ea0c2	Add GatherOp for GPU, and update its tests. Summary: This is an initial (read: unoptimized) implementation of GatherOp on GPU. Closes https://github.com/caffe2/caffe2/pull/209 Differential Revision: D4809676 Pulled By: Yangqing fbshipit-source-id: bc36fa02e9964370ca845e9cc13344e5f3dbf176	2017-03-31 13:20:09 -07:00
Aapo Kyrola	8421bf7c60	Faster softmaxWithLoss rowMaxKernel Summary: We did not parallelize over D, which can be very large, especially in RNN models. This speeds up significantly, with my quick test in lstm_benchmark and nvprof, the time of RowMaxKernel dropped from 1.2s total to 0.28s total. + addded softmaxwithloss to the lstm_benchmark Reviewed By: jamesr66a Differential Revision: D4800629 fbshipit-source-id: 3400ea1064b1eb2793bc403df2c1b68801d545e5	2017-03-30 15:49:46 -07:00

1 2 3 4

166 commits