pytorch/caffe2/core
Aapo Kyrola 631971e459 threaded RNN executor for CPU, multi-stream executor CUDA
Summary:
Special executor for RNNs which can exploit parallelism over timesteps. For CPU we use multi-threading, achiving 3x or so improved on 4-layers LSTMs.
With CUDA, perf improvements are more modest, but the structure allows for optimizing it further. For CUDA, we use multiple streams and events if there is parallellism
over timesteps. In my experiments, it was not good to use more than 2 streams, though.

Flag --caffe2_rnn_executor can be used to switch the executor off.

Reviewed By: salexspb

Differential Revision: D5749304

fbshipit-source-id: d6f76b3e16598be5b4e8188aff031671ebafaa4c
2017-09-06 12:26:30 -07:00
..
allocator.cc move memory allocators to allocator.{h,cc} 2017-08-16 01:35:20 -07:00
allocator.h move memory allocators to allocator.{h,cc} 2017-08-16 01:35:20 -07:00
asan.h
blob.h Fix a few typos and grammars in comment 2017-06-14 18:22:39 -07:00
blob_gpu_test.cc codemod: use <> includes for gtest headers 2017-03-28 00:50:54 -07:00
blob_serialization.cc comment out unused parameters 2017-07-21 15:14:43 -07:00
blob_serialization.h comment out unused parameters 2017-07-21 15:14:43 -07:00
blob_serialization_gpu.cc
blob_serializer_base.h comment out unused parameters 2017-07-21 15:14:43 -07:00
blob_stats.cc Allow to query the blob size in bytes for perf stats 2017-03-22 18:09:55 -07:00
blob_stats.h Allow to query the blob size in bytes for perf stats 2017-03-22 18:09:55 -07:00
blob_test.cc Early design for a general Event abstraction cross-devices. 2017-08-18 15:46:51 -07:00
CMakeLists.txt
common.cc Allow caffe2 to detect if cuda lib has been linked, and also fix oss build error. 2017-08-23 18:41:15 -07:00
common.h Update the speed benchmark code 2017-09-01 23:16:39 -07:00
common_cudnn.cc
common_cudnn.h set stream for cudnn handle correctly in cudnn wapper 2017-09-01 18:07:07 -07:00
common_gpu.cc Do CaffeCudaSetDevice and CaffeCudaGetDevice 2017-08-25 18:20:14 -07:00
common_gpu.h Do CaffeCudaSetDevice and CaffeCudaGetDevice 2017-08-25 18:20:14 -07:00
common_omp.h Remove openmp parallel for in caffe2 2017-02-16 22:05:10 -08:00
context.h Early design for a general Event abstraction cross-devices. 2017-08-18 15:46:51 -07:00
context_gpu.cu better default settings for CUB 2017-08-29 19:11:08 -07:00
context_gpu.h Do CaffeCudaSetDevice and CaffeCudaGetDevice 2017-08-25 18:20:14 -07:00
context_gpu_test.cc Do CaffeCudaSetDevice and CaffeCudaGetDevice 2017-08-25 18:20:14 -07:00
context_test.cc Change Allocator interface to return deleter 2017-07-17 15:26:27 -07:00
db.cc comment out unused parameters 2017-07-21 15:14:43 -07:00
db.h factored out DBExists function 2017-07-24 11:21:27 -07:00
event.cc Add event as a first-class citizen of the OperatorBase interface. 2017-08-21 13:30:53 -07:00
event.h Add event as a first-class citizen of the OperatorBase interface. 2017-08-21 13:30:53 -07:00
event_gpu.cc Do CaffeCudaSetDevice and CaffeCudaGetDevice 2017-08-25 18:20:14 -07:00
event_gpu_test.cc Add event as a first-class citizen of the OperatorBase interface. 2017-08-21 13:30:53 -07:00
event_test.cc Early design for a general Event abstraction cross-devices. 2017-08-18 15:46:51 -07:00
flags.cc Update the speed benchmark code 2017-09-01 23:16:39 -07:00
flags.h fixed gflags 2.2.0 error and image_input_op.h 2017-05-24 10:09:17 -07:00
graph.cc Common Subexpression Elimination 2017-08-18 16:31:48 -07:00
graph.h Common Subexpression Elimination 2017-08-18 16:31:48 -07:00
graph_test.cc Fix travis tests, by splitting DummyOp to GraphDummyOp and TransformDummyOp 2017-08-18 11:17:28 -07:00
init.cc Add a guard function to check Caffe2 linking setup. 2017-04-21 03:38:37 -07:00
init.h
init_omp.cc Fix more MKL build issues 2017-08-25 14:01:01 -07:00
init_test.cc codemod: use <> includes for gtest headers 2017-03-28 00:50:54 -07:00
logging.cc Clean up binary build cmake script 2017-05-24 13:47:26 -07:00
logging.h comment out unused parameters 2017-07-21 15:14:43 -07:00
logging_is_google_glog.h remove unused parameters in logging_is_google_glog.h and operator.h 2017-07-13 16:24:15 -07:00
logging_is_not_google_glog.h Re-apply windows diff D4657831 2017-03-07 11:02:12 -08:00
logging_test.cc CodeMod: Prefer ADD_FAILURE() over EXPECT_TRUE(false), et cetera 2017-07-16 21:40:12 -07:00
macros.h Make extension loader properly handle visibility. 2017-03-30 14:38:38 -07:00
macros.h.in cmake: generate macros.h with configure_file() 2017-08-22 14:22:36 -07:00
memonger.cc fix memonger for RecurrentNetworks 2017-08-25 16:01:25 -07:00
memonger.h Rewrite memonger DAG in C++. 2017-08-16 16:17:15 -07:00
net.cc code cleanup: separate the several net implementations to separate files. 2017-08-21 22:07:48 -07:00
net.h code cleanup: separate the several net implementations to separate files. 2017-08-21 22:07:48 -07:00
net_async_dag_gpu.cc code cleanup: separate the several net implementations to separate files. 2017-08-21 22:07:48 -07:00
net_async_dag_gpu.h code cleanup: separate the several net implementations to separate files. 2017-08-21 22:07:48 -07:00
net_dag.cc code cleanup: separate the several net implementations to separate files. 2017-08-21 22:07:48 -07:00
net_dag.h code cleanup: separate the several net implementations to separate files. 2017-08-21 22:07:48 -07:00
net_simple.cc added gflop annotation to TEST_benchmark 2017-08-31 14:18:20 -07:00
net_simple.h code cleanup: separate the several net implementations to separate files. 2017-08-21 22:07:48 -07:00
net_singlethread_async_gpu.cc code cleanup: separate the several net implementations to separate files. 2017-08-21 22:07:48 -07:00
net_test.cc Allow caffe2 to detect if cuda lib has been linked, and also fix oss build error. 2017-08-23 18:41:15 -07:00
observer.h Detailed per-operator tracking for all nets 2017-07-16 14:48:09 -07:00
observer_test.cc code cleanup: separate the several net implementations to separate files. 2017-08-21 22:07:48 -07:00
operator.cc added gflop annotation to TEST_benchmark 2017-08-31 14:18:20 -07:00
operator.h threaded RNN executor for CPU, multi-stream executor CUDA 2017-09-06 12:26:30 -07:00
operator_gpu_test.cc Do not run operator gpu tests if there is not gpu 2017-08-23 11:32:41 -07:00
operator_gradient.h improve error for non-existing/vs. sparse or dense gradient 2017-07-24 08:56:02 -07:00
operator_schema.cc added gflop annotation to TEST_benchmark 2017-08-31 14:18:20 -07:00
operator_schema.h added gflop annotation to TEST_benchmark 2017-08-31 14:18:20 -07:00
operator_schema_test.cc Strip Operator Schema in mobile build 2017-08-22 13:31:08 -07:00
operator_test.cc Allow caffe2 to detect if cuda lib has been linked, and also fix oss build error. 2017-08-23 18:41:15 -07:00
parallel_net_test.cc Add linter for enforcing caffe operator documentation 2017-07-24 15:27:47 -07:00
plan_executor.cc Tuning number of parameter servers based on performance estimation job 2017-08-30 18:03:59 -07:00
plan_executor.h Tuning number of parameter servers based on performance estimation job 2017-08-30 18:03:59 -07:00
predictor.cc Sync mobile codebase changes back to fbcode 2017-07-18 17:54:41 -07:00
predictor.h Sync mobile codebase changes back to fbcode 2017-07-18 17:54:41 -07:00
predictor_test.cc codemod: use <> includes for gtest headers 2017-03-28 00:50:54 -07:00
qtensor.cc move qtensor to open source 2017-03-08 18:02:39 -08:00
qtensor.h Change Allocator interface to return deleter 2017-07-17 15:26:27 -07:00
qtensor_serialization.cc QTensor serialization/deserialization 2017-03-09 00:01:12 -08:00
qtensor_serialization.h QTensor serialization/deserialization 2017-03-09 00:01:12 -08:00
registry.h lengths_reducer_ops refactoring. 2017-07-25 12:08:26 -07:00
registry_test.cc codemod: use <> includes for gtest headers 2017-03-28 00:50:54 -07:00
scope_guard.h
static_tracepoint.h Add USDT for operator execution 2017-02-06 08:44:42 -08:00
static_tracepoint_elfx86.h Add USDT for operator execution 2017-02-06 08:44:42 -08:00
stats.cc Average and time spent counters 2017-03-24 13:34:27 -07:00
stats.h Average and time spent counters 2017-03-24 13:34:27 -07:00
stats_test.cc codemod: use <> includes for gtest headers 2017-03-28 00:50:54 -07:00
tensor.cc allow querying tensor device + tool to validate that all ops have tensors from correct devices (GPUs) 2017-07-01 09:16:37 -07:00
tensor.h added gflop annotation to TEST_benchmark 2017-08-31 14:18:20 -07:00
timer.h
timer_test.cc codemod: use <> includes for gtest headers 2017-03-28 00:50:54 -07:00
transform.cc ApplyTransformIfFaster 2017-08-17 15:36:51 -07:00
transform.h ApplyTransformIfFaster 2017-08-17 15:36:51 -07:00
transform_test.cc Fix travis tests, by splitting DummyOp to GraphDummyOp and TransformDummyOp 2017-08-18 11:17:28 -07:00
typeid.cc Halfway into windows port 2017-02-13 09:46:18 -08:00
typeid.h Add CUDA implementation of BooleanUnmask and fixed some bugs in the test 2017-08-01 16:51:40 -07:00
typeid_test.cc codemod: use <> includes for gtest headers 2017-03-28 00:50:54 -07:00
types.cc Add CUDA implementation of BooleanUnmask and fixed some bugs in the test 2017-08-01 16:51:40 -07:00
types.h Halfway into windows port 2017-02-13 09:46:18 -08:00
workspace.cc Forward blobs into workspace 2017-08-22 18:45:56 -07:00
workspace.h Forward blobs into workspace 2017-08-22 18:45:56 -07:00
workspace_test.cc fix windows build 2017-07-26 03:50:20 -07:00