pytorch/caffe2/python/operator_test
Aapo Kyrola 1ed746df45 BatchMatMulOp: use cuBLAS batched strided gemm for CUDA
Summary:
Instead of doing gemms in a for-loop (which is not parallelized), it is much better to do the batched matmuls using CUDA 8's new batched-striped version of gemm.

With the MT team's test, we get 5-10% improvement in overall walltime, so it is significant improvement:

----

Without batched gemm:

I0328 10:46:48.118605 58068 prof_dag_net.cc:136]    424.757 ms/iter (   283.878 ms/iter) RecurrentNetwork
I0328 10:46:48.118609 58068 prof_dag_net.cc:136]    352.603 ms/iter (    265.85 ms/iter) RecurrentNetworkGradient

With batched gemm:
I0328 10:53:48.169996 85617 prof_dag_net.cc:136]    407.438 ms/iter (   269.564 ms/iter) RecurrentNetwork
I0328 10:53:48.169999 85617 prof_dag_net.cc:136]    322.393 ms/iter (   287.625 ms/iter) RecurrentNetworkGradient

Reviewed By: jamesr66a

Differential Revision: D4788272

fbshipit-source-id: 210e8b94c1e036b6ef0f039ce000d455258651f4
2017-03-28 11:54:09 -07:00
..
activation_ops_test.py Caffe2: CUDA implementation for LeakyReluOp 2017-03-28 08:48:25 -07:00
atomic_ops_test.py
checkpoint_test.py
conv_test.py Conv-ND NCHW CUP/CUDA implementation 2017-03-20 14:01:07 -07:00
conv_transpose_test.py
copy_ops_test.py Reset workspace after each test in copy_ops_test 2017-03-24 12:20:34 -07:00
cosine_embedding_criterion_op_test.py
counter_ops_test.py AtomicCounter to return previous value on Reset. 2017-02-02 14:59:30 -08:00
crf_test.py CRF layer in caffe2 2017-03-23 22:02:02 -07:00
cross_entropy_ops_test.py delete redundant comment lines. 2017-02-24 11:04:36 -08:00
dataset_ops_test.py NextScopedBlob with well-defined behavior and respect namescope 2017-02-16 17:16:36 -08:00
duplicate_operands_test.py
elementwise_op_broadcast_test.py
elementwise_ops_test.py Sqr op and gradient 2017-03-07 03:03:07 -08:00
emptysample_ops_test.py
extend_tensor_op_test.py
fc_operator_test.py Test for FC operator + fix for docs 2017-01-27 10:44:24 -08:00
filler_ops_test.py add exception for empty shape param 2017-03-10 00:33:59 -08:00
gather_ops_test.py
gather_ranges_op_test.py
given_tensor_fill_op_test.py support fill bool tensors in GivenTensorFill 2017-03-02 20:18:59 -08:00
group_conv_test.py Make all convolution operators allow optional bias term 2016-12-21 15:14:24 -08:00
hsm_test.py Generate huffman tree 2017-01-19 16:14:23 -08:00
index_ops_test.py Change the schema of IndexLoad & IndexFreeze so that state change is captured by the framework 2017-02-14 10:05:12 -08:00
instance_norm_test.py instance norm test fix 2017-02-25 14:31:42 -08:00
margin_ranking_criterion_op_test.py
matmul_op_test.py BatchMatMulOp: use cuBLAS batched strided gemm for CUDA 2017-03-28 11:54:09 -07:00
mkl_conv_op_test.py MKL convolution operator 2017-01-23 09:59:30 -08:00
mkl_packed_fc_op_test.py MKL convolution operator 2017-01-23 09:59:30 -08:00
mkl_speed_test.py MKL convolution operator 2017-01-23 09:59:30 -08:00
momentum_sgd_test.py SparseMomentumSGDUpdateOp 2017-03-28 07:47:46 -07:00
mpi_test.py
one_hot_ops_test.py
pack_ops_test.py Registering GPU version of PackSegments using GPUFallbackOp 2017-03-24 16:01:53 -07:00
partition_ops_test.py
piecewise_linear_transform_test.py PiecewiseLinearTransformOp transform binary predictions specially 2017-02-15 16:00:44 -08:00
pooling_test.py Unit test for big batch size avg pooling 2017-01-18 19:29:20 -08:00
pow_op_test.py CUDA version of elementwise power + rename to Pow + gradient 2017-03-07 10:20:40 -08:00
python_op_test.py
rank_loss_operator_test.py Normalize rank loss gradient to avoid convergence issues when the number of pairs is really large 2016-12-21 17:29:24 -08:00
record_queue_test.py
recurrent_network_test.py RNN: avoid copy for gradients of inputs to the rnn cell and save more memory! 2017-03-28 10:02:25 -07:00
reduce_ops_test.py ReduceBack{Sum|Mean}Op CPU & GPU implementation 2017-03-13 16:19:58 -07:00
relu_op_test.py
reshape_ops_test.py Allow test discovery in caffe2/python/ 2017-03-14 18:16:41 -07:00
resize_op_test.py Add ResizeNearest operator 2017-03-16 18:49:01 -07:00
segment_ops_test.py Allow test discovery in caffe2/python/ 2017-03-14 18:16:41 -07:00
sequence_ops_test.py add gpu support for caffe2-seq2seq 2017-03-17 05:19:14 -07:00
shape_inference_test.py Bugfix: type not being set when inferring types+shapes 2017-03-15 18:48:40 -07:00
softmax_ops_test.py add soft label functionality to softmax with loss op 2017-02-10 09:01:53 -08:00
sparse_gradient_checker_test.py
sparse_ops_test.py
spatial_bn_op_test.py
square_root_divide_op_test.py
stats_ops_test.py Performance counters 2017-02-21 16:31:24 -08:00
string_ops_test.py
text_file_reader_test.py
tile_op_test.py Caffe2: Tile operator 2017-02-28 23:17:26 -08:00
top_k_test.py Implement TopK op in caffe2 2017-03-16 17:32:20 -07:00
unique_uniform_fill_op_test.py UniqueUniformFillOp 2017-02-15 16:00:44 -08:00
utility_ops_test.py Add gradient operator for SumElements 2017-03-07 20:03:07 -08:00