pytorch/caffe2/utils
Aapo Kyrola 631971e459 threaded RNN executor for CPU, multi-stream executor CUDA
Summary:
Special executor for RNNs which can exploit parallelism over timesteps. For CPU we use multi-threading, achiving 3x or so improved on 4-layers LSTMs.
With CUDA, perf improvements are more modest, but the structure allows for optimizing it further. For CUDA, we use multiple streams and events if there is parallellism
over timesteps. In my experiments, it was not good to use more than 2 streams, though.

Flag --caffe2_rnn_executor can be used to switch the executor off.

Reviewed By: salexspb

Differential Revision: D5749304

fbshipit-source-id: d6f76b3e16598be5b4e8188aff031671ebafaa4c
2017-09-06 12:26:30 -07:00
..
threadpool Use std::{thread,mutex,condition_variable} instead of raw pthreads in WorkersPool 2017-09-05 12:33:13 -07:00
cast.h Support fp16 output from ImageInputOp 2017-04-28 14:50:47 -07:00
cblas.h Fix more MKL build issues 2017-08-25 14:01:01 -07:00
CMakeLists.txt Add stack traces on fatal signals 2017-05-22 10:34:32 -07:00
conversions.h Disable -Wstrict-aliasing when including cuda_fp16.h 2017-08-17 14:15:32 -07:00
cpu_neon.h
cpuid.cc Move cpuid ctor to .cc 2017-07-26 23:37:14 -07:00
cpuid.h Move cpuid ctor to .cc 2017-07-26 23:37:14 -07:00
cpuid_test.cc Add proper cpuid support. 2017-07-23 17:21:50 -07:00
fatal_signal_asan_no_sig_test.cc Disable stacktrace on fatal signal by default 2017-05-31 12:54:04 -07:00
fixed_divisor.h
fixed_divisor_test.cc codemod: use <> includes for gtest headers 2017-03-28 00:50:54 -07:00
GpuBitonicSort.cuh Implement TopKOp for GPU 2017-06-17 08:47:38 -07:00
GpuDefs.cuh CUDA 9 support 2017-08-06 11:50:17 -07:00
GpuScanUtils.cuh CUDA 9 support 2017-08-06 11:50:17 -07:00
math-detail.h comment out unused parameters 2017-07-21 15:14:43 -07:00
math.h New math.h functions required by YellowFin 2017-08-25 18:09:34 -07:00
math_cpu.cc EnsureDense/SparseToDense for CUDA 2017-09-01 09:33:05 -07:00
math_gpu.cu EnsureDense/SparseToDense for CUDA 2017-09-01 09:33:05 -07:00
math_gpu_test.cc New math.h functions required by YellowFin 2017-08-25 18:09:34 -07:00
math_test.cc fix bad conversion to float in cpu_half2float 2017-05-17 15:57:42 -07:00
murmur_hash3.cc Fix race in FileStoreHandler 2017-02-03 09:59:45 -08:00
murmur_hash3.h Fix race in FileStoreHandler 2017-02-03 09:59:45 -08:00
proto_utils.cc threaded RNN executor for CPU, multi-stream executor CUDA 2017-09-06 12:26:30 -07:00
proto_utils.h threaded RNN executor for CPU, multi-stream executor CUDA 2017-09-06 12:26:30 -07:00
proto_utils_test.cc codemod: use <> includes for gtest headers 2017-03-28 00:50:54 -07:00
signal_handler.cc Sync mobile codebase changes back to fbcode 2017-07-18 17:54:41 -07:00
signal_handler.h Disable stacktrace on fatal signal by default 2017-05-31 12:54:04 -07:00
simple_queue.h comment out unused parameters 2017-07-21 15:14:43 -07:00
simple_queue_test.cc Early design for a general Event abstraction cross-devices. 2017-08-18 15:46:51 -07:00
smart_tensor_printer.cc guard against apple platforms 2017-04-24 21:19:30 -07:00
smart_tensor_printer.h caffe2: smart_tensor_printer 2017-04-24 15:52:26 -07:00
smart_tensor_printer_test.cc Disable smart_tensor_printer_test on OSX 2017-06-16 05:50:46 -07:00
string_utils.cc Adding changes that enable MSVC build 2017-03-01 16:47:58 -08:00
string_utils.h Added editDistance helper to caffe2 operators 2017-02-28 13:31:56 -08:00
thread_pool.h fix thread_pool.h 2017-08-08 10:32:08 -07:00
zmq_helper.h