pytorch/test
Richard Zou f1420adfe3 Move at::chunk into the graph fuser (#10178)
Summary:
... to avoid slow at::chunk (it is slow due to tensor initialization). Picking up from #10026

This is done through the following:

1) Absorb starting chunks into FusionGroup as a part of the graph fuser
pass.
2) When compiling a kernel, emit a `std::vector<ConcatDesc>` that describes if an input (of the original graph) will be chunked.
3) When launching a kernel, `use std::vector<ConcatDesc>` to chunk an
input tensor on the CPU. This chunk directly takes in an at::Tensor and creates
four TensorInfo structs in-place in the argument list, bypassing the creation of intermediate Tensors.

- Expect test and correctness test to see if a single chunk is fused
  by the graph fuser
- Correctness test for a variety of chunks (dimension = beginning,
  middle, end) and tensors (contiguous, non-contiguous, edge case
  (splitSize = 1) for both CPU/CUDA
- Expect test for multiple chunks fused into the same kernel and
  correctness test.

cc zdevito apaszke

LSTM forward pass, 1 layer, 512 hidden size and input size, 100 seq length, requires_grad=False on all inputs and weights.

After changes:
```
thnn    cudnn   jit
8.8468  6.5797  9.3470
```

Before changes:
```
thnn    cudnn   jit
9.9221  6.6539  11.2550
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10178

Differential Revision: D9382661

Pulled By: zou3519

fbshipit-source-id: 1f8a749208fbdd45559775ce98cf4eb9558448f8
2018-08-18 16:10:11 -07:00
..
bottleneck
cpp/api Make torch::Tensor -> at::Tensor (#10516) 2018-08-15 21:25:12 -07:00
cpp_extensions Creates CUDAContext (#9435) 2018-07-20 12:56:15 -07:00
custom_operator Build mechanism for custom operators (#10226) 2018-08-16 18:56:17 -07:00
data
error_messages
expect Move at::chunk into the graph fuser (#10178) 2018-08-18 16:10:11 -07:00
ffi/src
onnx ATen layer norm symbolic (#10513) 2018-08-15 08:28:52 -07:00
optim
common.py enable unit tests and other changes (#10266) 2018-08-06 14:54:01 -07:00
common_cuda.py
common_nn.py Add CELU activation to pytorch (#8551) 2018-08-01 07:54:44 -07:00
run_test.py improve use of ROCm libraries, enable more tests, small fixes (#10406) 2018-08-13 11:39:43 -07:00
test_autograd.py improve use of ROCm libraries, enable more tests, small fixes (#10406) 2018-08-13 11:39:43 -07:00
test_c10d.py fixed c10d test (#10557) 2018-08-15 17:22:38 -07:00
test_cpp_extensions.py
test_cuda.py Fix corner case with torch.multinomial (#9960) 2018-08-15 13:25:39 -07:00
test_dataloader.py improve use of ROCm libraries, enable more tests, small fixes (#10406) 2018-08-13 11:39:43 -07:00
test_distributed.py Fix Python lint errors. (#10441) 2018-08-11 21:08:50 -07:00
test_distributed_trap.py
test_distributions.py remove implicit conversion from gpu to cpu (#10553) 2018-08-16 12:10:39 -07:00
test_indexing.py Re-enable empty n-dimensional empty tensor and fix parallel CPU on empty tensors (#10077) 2018-07-31 16:43:45 -07:00
test_jit.py Move at::chunk into the graph fuser (#10178) 2018-08-18 16:10:11 -07:00
test_legacy_nn.py Add CTC loss (#9628) 2018-07-31 11:09:48 -07:00
test_multiprocessing.py Correctly share CUDA Parameters. (#10220) 2018-08-10 13:54:56 -07:00
test_nccl.py
test_nn.py Fix dropout fused kernel applied in eval mode (#10621) 2018-08-17 14:54:42 -07:00
test_optim.py improve use of ROCm libraries, enable more tests, small fixes (#10406) 2018-08-13 11:39:43 -07:00
test_sparse.py set coalesced=false at sparse transpose() and removed transpose invariants (#10496) 2018-08-14 21:25:37 -07:00
test_torch.py Fix bincount for empty input (#9757) 2018-08-15 20:55:59 -07:00
test_utils.py improve use of ROCm libraries, enable more tests, small fixes (#10406) 2018-08-13 11:39:43 -07:00