mirror of
https://github.com/saymrwulf/pytorch.git
synced 2026-05-15 21:00:47 +00:00
Summary: Instead of doing gemms in a for-loop (which is not parallelized), it is much better to do the batched matmuls using CUDA 8's new batched-striped version of gemm. With the MT team's test, we get 5-10% improvement in overall walltime, so it is significant improvement: ---- Without batched gemm: I0328 10:46:48.118605 58068 prof_dag_net.cc:136] 424.757 ms/iter ( 283.878 ms/iter) RecurrentNetwork I0328 10:46:48.118609 58068 prof_dag_net.cc:136] 352.603 ms/iter ( 265.85 ms/iter) RecurrentNetworkGradient With batched gemm: I0328 10:53:48.169996 85617 prof_dag_net.cc:136] 407.438 ms/iter ( 269.564 ms/iter) RecurrentNetwork I0328 10:53:48.169999 85617 prof_dag_net.cc:136] 322.393 ms/iter ( 287.625 ms/iter) RecurrentNetworkGradient Reviewed By: jamesr66a Differential Revision: D4788272 fbshipit-source-id: 210e8b94c1e036b6ef0f039ce000d455258651f4 |
||
|---|---|---|
| .. | ||
| activation_ops_test.py | ||
| atomic_ops_test.py | ||
| checkpoint_test.py | ||
| conv_test.py | ||
| conv_transpose_test.py | ||
| copy_ops_test.py | ||
| cosine_embedding_criterion_op_test.py | ||
| counter_ops_test.py | ||
| crf_test.py | ||
| cross_entropy_ops_test.py | ||
| dataset_ops_test.py | ||
| duplicate_operands_test.py | ||
| elementwise_op_broadcast_test.py | ||
| elementwise_ops_test.py | ||
| emptysample_ops_test.py | ||
| extend_tensor_op_test.py | ||
| fc_operator_test.py | ||
| filler_ops_test.py | ||
| gather_ops_test.py | ||
| gather_ranges_op_test.py | ||
| given_tensor_fill_op_test.py | ||
| group_conv_test.py | ||
| hsm_test.py | ||
| index_ops_test.py | ||
| instance_norm_test.py | ||
| margin_ranking_criterion_op_test.py | ||
| matmul_op_test.py | ||
| mkl_conv_op_test.py | ||
| mkl_packed_fc_op_test.py | ||
| mkl_speed_test.py | ||
| momentum_sgd_test.py | ||
| mpi_test.py | ||
| one_hot_ops_test.py | ||
| pack_ops_test.py | ||
| partition_ops_test.py | ||
| piecewise_linear_transform_test.py | ||
| pooling_test.py | ||
| pow_op_test.py | ||
| python_op_test.py | ||
| rank_loss_operator_test.py | ||
| record_queue_test.py | ||
| recurrent_network_test.py | ||
| reduce_ops_test.py | ||
| relu_op_test.py | ||
| reshape_ops_test.py | ||
| resize_op_test.py | ||
| segment_ops_test.py | ||
| sequence_ops_test.py | ||
| shape_inference_test.py | ||
| softmax_ops_test.py | ||
| sparse_gradient_checker_test.py | ||
| sparse_ops_test.py | ||
| spatial_bn_op_test.py | ||
| square_root_divide_op_test.py | ||
| stats_ops_test.py | ||
| string_ops_test.py | ||
| text_file_reader_test.py | ||
| tile_op_test.py | ||
| top_k_test.py | ||
| unique_uniform_fill_op_test.py | ||
| utility_ops_test.py | ||