Commit graph

1165 commits

Author SHA1 Message Date
Frank Seide
6bde0ca6d3 T66557700 Support default argument values of a method (#48863)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48863

Support default arguments when invoking a module via PyTorch Lite (`mobile::Module`).

Test Plan:
buck test mode/dbg //caffe2/test/cpp/jit:jit -- LiteInterpreterTest.MethodInvocation

buck test mode/dbg caffe2/test:mobile -- test_method_calls_with_optional_arg

Reviewed By: raziel, iseeyuan

Differential Revision: D25152559

fbshipit-source-id: bbf52f1fbdbfbc6f8fa8b65ab524b1cd4648f9c0
2020-12-16 15:55:03 -08:00
Igor Gitman
1b6d18aa7c Adding support for CuDNN-based LSTM with projections (#47725)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46213

I didn't yet update the documentation, will add those change soon. A few other things that I didn't do, but want to clarify if I maybe should.

1. I didn't expose projections in c++ API: torch/csrc/api/src/nn/modules/rnn.cpp. Let me know if this is desirable and I will add those changes.
2. I didn't expose projections in "lstm_cell" function and "_thnn_differentiable_lstm_cell_backward" functions from aten/src/ATen/native/RNN.cpp. As far as I understand, they are not needed for nn.LSTM CPU execution. For lstm_cell, projections don't bring any real benefit, since if cell is used separately, it can be easily added in Python. For "_thnn_differentiable_lstm_cell_backward", I'm actually not sure where exactly that function is used, so I also disabled projections there for now. Please let me know if I should change that.
3. I added check that projections are not supported for quantized LSTMs to quantized_lstm_<data/input> functions. But I didn't add any checks to LSTMCell code. It seems that since I disabled projections in "lstm_cell" function, they should also not be available for quantized models through any other API than quantized_lstm_<data/input>. Please let me know if I'm not correct and I will add checks to other places.
4. Projections are not supported for CuDNN versions < 7.1.2. Should I add the check for CuDNN version and disable projections in that case? If so, what will be the best way to do that?
5. Currently I added projection weight as the last weight, so the layout is "w_ih, w_hh, b_ih, b_hh, w_hr". This breaks the assumption that biases come after weights and thus I had to add additional if-s in various places. Alternative way would be to have "w_ih, w_hh, w_hr, b_ih, b_hh" layout, in which case the assumption will be true. But in that case I will need to split the loop in get_parameters function from aten/src/ATen/native/cudnn/RNN.cpp. And in some cases, I will still need to add an "undefined" tensor in the 3rd position, because we get all 5 weights from CuDNN most of the time. So I'm not sure which way is better. Let me know if you think I should change to the weights-then-biases layout.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47725

Reviewed By: zou3519

Differential Revision: D25449794

Pulled By: ngimel

fbshipit-source-id: fe6ce59e481d1f5fd861a8ff7fa13d1affcedb0c
2020-12-16 11:27:02 -08:00
Scott Wolchok
22c6dafd33 [PyTorch] Use plain old function pointer for RecordFunctionCallback (reapply) (#49408)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49408

Nearly every non-test callsite doesn't need to capture any variables anyway, and this saves 48 bytes per callback.
ghstack-source-id: 118665808

Test Plan:
Wait for GitHub CI since we had C++14-specific issues with
this one in previous PR https://github.com/pytorch/pytorch/pull/48629

Reviewed By: malfet

Differential Revision: D25563207

fbshipit-source-id: 6a2831205917d465f8248ca37429ba2428d5626d
2020-12-15 19:16:01 -08:00
Mikhail Zolotukhin
38a59a67f3 [JIT] Support multiple outputs in subgraph matcher. (#48992)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48992

Differential Revision: D25388100

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Pulled By: ZolotukhinM

fbshipit-source-id: d95713af2220cf4f99ac92f59f8e5b902f2f3822
2020-12-15 13:09:24 -08:00
Mike Ruberry
25bc906281 Revert D25135415: [PyTorch] Use plain old function pointer for RecordFunctionCallback
Test Plan: revert-hammer

Differential Revision:
D25135415 (7e23ee1598)

Original commit changeset: 5e92dc79da64

fbshipit-source-id: 45b1634a100084c84dca158a1f16ca760fef6988
2020-12-14 21:04:27 -08:00
Scott Wolchok
7e23ee1598 [PyTorch] Use plain old function pointer for RecordFunctionCallback (#48629)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48629

Nearly every non-test callsite doesn't need to capture any variables anyway, and this saves 48 bytes per callback.
ghstack-source-id: 118568240

Test Plan: CI

Reviewed By: dhruvbird

Differential Revision: D25135415

fbshipit-source-id: 5e92dc79da6473ed15d1e381a21ed315879168f3
2020-12-14 20:08:16 -08:00
Scott Wolchok
900aa4ee97 [PyTorch] remove convenience RecordFunctionCallback interface (#48620)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48620

In preparation for storing bare function pointer (8 bytes)
instead of std::function (32 bytes).
ghstack-source-id: 118568242

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D25132183

fbshipit-source-id: 3790cfb5d98479a46cf665b14eb0041a872c13da
2020-12-14 20:03:15 -08:00
Bram Wasti
6b78644623 [te] Add BitCast to the IR (#49184)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49184

Adds BitCasting to NNC.  This will enable fast approximation algorithms implemented directly in TensorExpressions

Test Plan: buck test mode/no-gpu //caffe2/test/cpp/tensorexpr:tensorexpr

Reviewed By: bertmaher

Differential Revision: D25466476

fbshipit-source-id: f063ab29ba7bab2dcce463e499f2d4a16bdc1f0e
2020-12-11 16:12:20 -08:00
Shijun Kong
f965b0fcfb Expose run_async function on torch::jit::Method (#48607)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48607

This change builds on top of
https://github.com/pytorch/pytorch/pull/46865

further exposing the async interface to `torch::jit::Method`.

added unit test for new `run_async`

Test Plan: `buck test caffe2/test/cpp/jit/...`

Reviewed By: dzhulgakov

Differential Revision: D25219726

fbshipit-source-id: 89743c82a0baa1affe0254c1e2dbf873de8e5c76
2020-12-11 11:17:58 -08:00
generatedunixname89002005325676
dcd1e3d78d [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D25490983

fbshipit-source-id: b24a11214a485a4a24ccf7da1e72715b450d3a81
2020-12-11 08:43:24 -08:00
Nick Gibson
5469aa5e7f [NNC] Add a non functional Tensor kind (#48750)
Summary:
Adds the CompoundTensor, a specialisation of the NNC Tensor which allows arbitrary production statements. This will allow lowering of aten ops into specific NNC IR patterns (which don't need to be functional) - allowing us to shortcut to the optimized form of common patterns.

This is part 1 of trying to clean up the lowering of aten::cat so it is easier to optimize.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48750

Reviewed By: tugsbayasgalan

Differential Revision: D25433517

Pulled By: nickgg

fbshipit-source-id: de13c4719f8f87619ab254e5f324f13b5be1c9da
2020-12-10 19:43:50 -08:00
Elias Ellison
3b57be176e [NNC] Preserve strided output (#48264)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48264

Preserves the strided representation of NNC Tensor outputs by transforming them into the right layout at the end of the kernel.

Fix for https://github.com/pytorch/pytorch/issues/45604

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D25286213

Pulled By: eellison

fbshipit-source-id: 64d94ac463741e2568a1c9d44174e15ea26e511f
2020-12-10 12:19:51 -08:00
Elias Ellison
413caa7fd2 [NNC] Compute Tensor Output Properties in ininitialization (#47813)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47813

We have some code paths that at kernel invocation seem to handle dynamic sizes, but I'm not sure how well it works because we have other parts of our code base that assume that tenso shapes are always fully specified. https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/tensorexpr/kernel.cpp#L1572

As with some other PRs in the stack, I think it would be good to remove the features that aren't on/actively being worked on while they are not used.

I initially did this PR to try to speed up perf. I couldn't observe too much  of a speed up, so we can decide to keep drop this PR if we want.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D25286212

Pulled By: eellison

fbshipit-source-id: 4ae66e0af88d649dd4e592bc78686538c2fdbaeb
2020-12-10 12:19:45 -08:00
Bram Wasti
195b92bfa6 Revert D25441716: [te] Add BitCast to the IR
Test Plan: revert-hammer

Differential Revision:
D25441716 (3384145418)

Original commit changeset: c97b871697bc

fbshipit-source-id: e6eff02e28e1ae8c826dd2cfed79f869839ed2ba
2020-12-10 09:31:35 -08:00
Bram Wasti
3384145418 [te] Add BitCast to the IR
Summary: Adds BitCasting to NNC.  This will enable fast approximation algorithms implemented directly in TensorExpressions

Test Plan: buck test mode/no-gpu //caffe2/test/cpp/tensorexpr:tensorexpr

Reviewed By: bertmaher

Differential Revision: D25441716

fbshipit-source-id: c97b871697bc5931d09cda4a9cb0a81bb420f4e2
2020-12-10 09:25:46 -08:00
Nick Gibson
c5bc6b40ab [NNC] Dead Store Elimination (#49030)
Summary:
Adds a new optimization method to LoopNest which eliminates stores that do not contribute to any output. It's unlikely any of the lowerings of aten operators produce these stores yet, but this creates some wiggle room for transformations in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49030

Reviewed By: tugsbayasgalan

Differential Revision: D25434538

Pulled By: nickgg

fbshipit-source-id: fa1ead82e6f7440cc783c6116b23d0b7a5b5db4b
2020-12-09 18:49:53 -08:00
Bert Maher
492580b855 [te] Remove vestigial __init__.py from test/cpp/tensorexpr (#49061)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49061

We don't use the python harness for cpp tests anymore.
ghstack-source-id: 118140485

Test Plan: Careful thinking.

Reviewed By: navahgar

Differential Revision: D25410290

fbshipit-source-id: 879e3c6fb296298d567e1d70b18bde96b5cac90d
2020-12-09 10:09:46 -08:00
Mikhail Zolotukhin
2b70bcd014 [TensorExpr] Enable inlining for output tensors too. (#48967)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48967

We previously didn't inline output tensors which resulted in correctness
issues like #48533. This PR allows inlining for output tensors too -
this could result in duplicated computations, but we can address that
later once correctness is ensured.

Performance results on FastRNNS:
Before the fix:
```
Benchmarking LSTMs...
            name          avg_fwd          std_fwd          avg_bwd          std_bwd
           cudnn            10.09          0.05431            17.55           0.2108
            aten            21.52           0.1276             26.7            1.471
             jit            13.25           0.8748            22.47             1.73
      jit_premul            11.43           0.3226            19.43            2.231
 jit_premul_bias            11.84           0.2245            20.33            2.205
      jit_simple            13.27           0.9906            22.15           0.9724
  jit_multilayer            13.38           0.8748            22.82             1.01
              py            33.55            4.837            46.41            6.333
```
After the fix:
```
Benchmarking LSTMs...
            name          avg_fwd          std_fwd          avg_bwd          std_bwd
           cudnn            10.09          0.05979            17.45           0.1987
            aten            21.21            0.144            26.43           0.7356
             jit            13.01           0.2925            23.21           0.8454
      jit_premul             11.4           0.3905            19.62            2.448
 jit_premul_bias            11.85           0.2461            20.29           0.6592
      jit_simple            13.08           0.8533            22.81            1.315
  jit_multilayer            12.93           0.1095            23.57            1.459
              py            31.21            2.783            44.63            6.073
```

Differential Revision: D25383949

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Pulled By: ZolotukhinM

fbshipit-source-id: 16f5727475109a278499bef7905f6aad18c8527a
2020-12-08 13:24:40 -08:00
Peter Bell
5180caeeb4 Remove deprecated spectral ops from torch namespace (#48594)
Summary:
Ref https://github.com/pytorch/pytorch/issues/42175

This removes the 4 deprecated spectral functions: `torch.{fft,rfft,ifft,irfft}`. `torch.fft` is also now imported by by default.

The actual `at::native` functions are still used in `torch.stft` so can't be full removed yet. But will once https://github.com/pytorch/pytorch/issues/47601 has been merged.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48594

Reviewed By: heitorschueroff

Differential Revision: D25298929

Pulled By: mruberry

fbshipit-source-id: e36737fe8192fcd16f7e6310f8b49de478e63bf0
2020-12-05 04:12:32 -08:00
Chen Lai
416dc68341 [Pytorch][Annotation] Update inlined callstack with module instance info (#47416)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47416

Test Plan: Imported from OSS

Reviewed By: kimishpatel

Differential Revision: D24752846

Pulled By: cccclai

fbshipit-source-id: 94d3c18c56161d1de3a16bb7c93502fedf71644c
2020-12-03 10:44:46 -08:00
Scott Wolchok
d1df4038ff [PyTorch] Make RecordFunctionCallback::should_run_ a function pointer (#48274)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48274

The std::function-ness of it was used only for tests. (std::function is huge at 32 bytes, and not particularly efficient.)
ghstack-source-id: 117498491

Test Plan: CI

Reviewed By: dzhulgakov

Differential Revision: D25102077

fbshipit-source-id: fd941ddf32235a9659a1a17609c27cc5cb446a54
2020-12-01 13:02:25 -08:00
Bert Maher
adb4fd3f2f [te] Fix comparison ops on booleans (#48384)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48384

As title

Test Plan: buck test //caffe2/test:jit -- test_binary_ops

Reviewed By: asuhan

Differential Revision: D25115773

fbshipit-source-id: c5f8ee21692bcf0d78f099789c0fc7c457a1e4a2
2020-11-30 18:21:35 -08:00
ProGamerGov
d6ddd78eb0 Fix multiple spelling and grammar mistakes (#48592)
Summary:
I found a number of spelling & grammatical mistakes in the repository. Previously I had these fixes submitted individually, but I saw that a single word change was apparently too small for a PR to be merged. Hopefully this new PR has a sufficient number of changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48592

Reviewed By: ejguan

Differential Revision: D25224216

Pulled By: mrshenli

fbshipit-source-id: 2af3db2aee486563efd0dffc4e8f777306a73e44
2020-11-30 15:18:44 -08:00
Ilia Cherniavskii
f7a8bf2855 Use libkineto in profiler (#46470)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46470

Adding ability to use Kineto (CUPTI) to profile CUDA kernels

Test Plan:
USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
python test/test_profiler.py

python test/test_autograd.py -k test_profile
python test/test_autograd.py -k test_record

```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                       Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       1.000us             2
                                      sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       2.000us             1
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                       Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                                            aten::randn         5.17%      74.000us         6.71%      96.000us      48.000us       0.000us         0.00%       0.000us       0.000us             2
                                            aten::empty         1.33%      19.000us         1.33%      19.000us       4.750us       0.000us         0.00%       0.000us       0.000us             4
                                          aten::normal_         1.05%      15.000us         1.05%      15.000us       7.500us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::to        77.90%       1.114ms        91.61%       1.310ms     436.667us       0.000us         0.00%       3.000us       1.000us             3
                                    aten::empty_strided         2.52%      36.000us         2.52%      36.000us      12.000us       0.000us         0.00%       0.000us       0.000us             3
                                            aten::copy_         2.73%      39.000us        11.19%     160.000us      53.333us       0.000us         0.00%       3.000us       1.000us             3
                                        cudaMemcpyAsync         4.34%      62.000us         4.34%      62.000us      20.667us       0.000us         0.00%       0.000us       0.000us             3
                                  cudaStreamSynchronize         1.61%      23.000us         1.61%      23.000us       7.667us       0.000us         0.00%       0.000us       0.000us             3
                                               aten::mm         0.21%       3.000us         7.20%     103.000us     103.000us       0.000us         0.00%       2.000us       2.000us             1
                                           aten::stride         0.21%       3.000us         0.21%       3.000us       1.000us       0.000us         0.00%       0.000us       0.000us             3
                                       cudaLaunchKernel         2.45%      35.000us         2.45%      35.000us      17.500us       0.000us         0.00%       0.000us       0.000us             2
                                              aten::add         0.49%       7.000us         4.27%      61.000us      61.000us       0.000us         0.00%       1.000us       1.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
```

benchmark: https://gist.github.com/ilia-cher/a5a9eb6b68504542a3cad5150fc39b1a

Reviewed By: Chillee

Differential Revision: D25142223

Pulled By: ilia-cher

fbshipit-source-id: b0dff46c28da5fb0a8e01cf548aa4f2b723fde80
2020-11-25 04:32:16 -08:00
Elias Ellison
a00ba63023 Disable old fuser internally (#48322)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48322

Disable old fuser internally. I would like to find where we are inadvertently setting the old fuser, but in the meantime I would like to land a diff that I know will 100% cause it not to be run, and verify that it fixes the issue.

Test Plan: sandcastle

Reviewed By: ZolotukhinM

Differential Revision: D25126202

fbshipit-source-id: 5a4d0742f5f829e536f50e7ede1256c94dd05232
2020-11-21 00:42:23 -08:00
Erjia Guan
c542614e53 Implement C++ ModuleDict (#47707)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47707

Fixes #45896

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D24872641

Pulled By: ejguan

fbshipit-source-id: 3d1dc9148ba3bcf66ab9c44ddb5774060bbc365d
2020-11-19 08:07:51 -08:00
Nikolay Korovaiko
0d8ddb5ec2 Make softmax and log_softmax handle negative dims, add tests (#48156)
Summary:
Make softmax and log_softmax handle negative dims, add tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48156

Reviewed By: bertmaher

Differential Revision: D25059788

Pulled By: Krovatkin

fbshipit-source-id: 985963e7df400857c9774660c76be7d56201a1ad
2020-11-19 01:38:14 -08:00
Scott Wolchok
bef460a803 [PyTorch] Return raw ptr from ThreadLocalDebugInfo::get() (#47796)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47796

`ThreadLocalDebugInfo::get()` is a hot function. For example, it is called by `DefaultCPUAllocator::allocate()`. Most callers do not even bother to keep the returned `shared_ptr` around, proving that they have no lifetime issues currently. For the rest, it appears that the only way that the returned pointer could become invalid is if they then called a function that swapped out `ThreadLocalDebugInfo` using `ThreadLocalStateGuard`. There are very few such paths, and it doesn't look like any current callers of `ThreadLocalDebugInfo::get()` needed a `shared_ptr` at all.
ghstack-source-id: 116979577

Test Plan:
1) reviewers to double-check audit of safety
2) run framework overhead benchmarks

Reviewed By: dzhulgakov

Differential Revision: D24902978

fbshipit-source-id: d684737cc2568534cac7cd3fb8d623b971c2fd28
2020-11-18 20:37:17 -08:00
Scott Wolchok
4c9eb57914 [PyTorch] Narrow Device to 2 bytes by narrowing DeviceType and DeviceIndex (#47023)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47023

DeviceType pretty clearly only needs 1 byte. DeviceIndex only needs 1 byte given that machines don't have anywhere near 255 GPUs in them as far as I know.
ghstack-source-id: 116901430

Test Plan: Existing tests, added assertion to catch if my assumption about DeviceIndex is incorrect

Reviewed By: dzhulgakov

Differential Revision: D24605460

fbshipit-source-id: 7c9a89027fcf8eebd623b7cdbf6302162c981cd2
2020-11-18 19:39:40 -08:00
Bert Maher
07657b6001 [tensorexpr] Switch cpp tests to pure gtest (#48160)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48160

We no longer use the custom c++ test infra anyways, so move to pure
gtest.

Fixes #45703
ghstack-source-id: 116977283

Test Plan: `buck test //caffe2/test/cpp/tensorexpr`

Reviewed By: navahgar, nickgg

Differential Revision: D25046618

fbshipit-source-id: da34183d87465f410379048148c28e1623618553
2020-11-18 12:23:34 -08:00
Nick Gibson
aabc87cd04 [NNC] Fix HalfChecker when half present but unused (#48068)
Summary:
Fixes an internally reported issue in the tensorexpr fuser when using FP16 on Cuda. The HalfChecker analysis to determine if we need to define the Half type searches the IR for expressions that use Half. If one of the parameters is of type Half but it (or any other Half expr) are not used in the IR we'll return a false negative. Fix this by adding the parameter list to the HalfChecker.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48068

Reviewed By: ZolotukhinM

Differential Revision: D25009680

Pulled By: nickgg

fbshipit-source-id: 24fddef06821f130db3d3f45d6d041c7f34a6ab0
2020-11-17 12:07:57 -08:00
Bert Maher
af37f8f810 [pytorch][te] Do not merge Tensor[] variant of aten::where into fusion group (#48063)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48063

The TE fuser does not know how to construct a list of Tensors.

Test Plan: new unit test

Reviewed By: eellison

Differential Revision: D25007234

fbshipit-source-id: 1a8ffdf5ffecb39a727357799ed32df8f53150d6
2020-11-16 22:41:10 -08:00
Bram Wasti
43a9d6fb6e [TorchScript] Support user defined classes as constants (#5062)
Summary:
Pull Request resolved: https://github.com/pytorch/glow/pull/5062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45556

User defined classes can be used as constants.  This is useful when freezing and removing the module from the graph.

Test Plan: waitforsadcastle

Reviewed By: eellison

Differential Revision: D23994974

fbshipit-source-id: 5b4a5c91158aa7f22df39d71f2658afce1d29317
2020-11-16 20:52:02 -08:00
Nick Gibson
957e45a97c [NNC] Support vectorization of reductions (#47924)
Summary:
Add support for ReduceOp in the Vectorizer, which allows vectorization of reductions. Only non-reduce axes can be vectorized currently, we'd need either automatically pulling out the RHS of reductions (better as a separate transform, I think) or special handling of vector reduce in the LLVM codegen (tricky, maybe not useful?) to make vectorizing reduce axes work.

There was a disabled LLVM test for this case which I reenabled with a bit of massaging, and added a few more.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47924

Reviewed By: bertmaher

Differential Revision: D24963464

Pulled By: nickgg

fbshipit-source-id: 91d91e9e2696555ab5690b154984b1ce48359d51
2020-11-16 10:43:53 -08:00
Mike Ruberry
013e6a3d9d Revert D24698027: Fix auto exponent issue for torch.pow
Test Plan: revert-hammer

Differential Revision:
D24698027 (8ef7ccd669)

Original commit changeset: f23fdb65c925

fbshipit-source-id: 9a67a2c6310c9e4fdefbb421a8cd4fa41595bc9a
2020-11-15 03:58:44 -08:00
anjali411
8ef7ccd669 Fix auto exponent issue for torch.pow (#47024)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47024

Fixes https://github.com/pytorch/pytorch/issues/46936

Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#47024 Fix auto exponent issue for torch.pow**

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D24698027

Pulled By: anjali411

fbshipit-source-id: f23fdb65c925166243593036e08214c4f041a63d
2020-11-14 22:50:12 -08:00
Bert Maher
5adf840259 [pytorch][te][easy] Remove KernelScope from fusion pass tests (#47952)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47952

We don't actually generate a TE kernel so no need to use the
arena-allocation guard.

Test Plan:
```
buck test //caffe2/test/cpp/tensorexpr -- FuserPass
```

Reviewed By: ZolotukhinM

Differential Revision: D24967107

fbshipit-source-id: 302f65b2fcff704079e8b51b942b7b3baff95585
2020-11-14 20:25:01 -08:00
Bert Maher
6b8d20c023 [pytorch][te] Don't start TE fusion groups with an unknown-typed result (#47884)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47884

We need to know output types of everything in a fusion group to ensure
that we generate correctly-typed tensors.  We were incorrectly starting a
fusion group with an unknown-typed output.

Test Plan:
New unit tests:
```
buck test //caffe2/test:jit //caffe2/test/cpp/tensorexpr:tensorexpr
```

Reviewed By: eellison

Differential Revision: D24932786

fbshipit-source-id: 83978a951f32c1207bbc3555a7d3bd94fe4e70fb
2020-11-13 10:52:53 -08:00
Nick Gibson
eab809377d [NNC] Remove all deferred expansion from Reductions (#47709)
Summary:
Refactors the ReduceOp node to remove the last remaining deferred functionality: completing the interaction between the accumulator buffer and the body. This fixes two issues with reductions:
1. Nodes inside the interaction could not be visited or modified, meaning we could generate bad code when the interaction was complex.
2. The accumulator load was created at expansion time and so could not be modified in some ways (ie. vectorization couldn't act on these loads).

This simplifies reduction logic quite a bit, but theres a bit more involved in the rfactor transform.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47709

Reviewed By: ZolotukhinM

Differential Revision: D24904220

Pulled By: nickgg

fbshipit-source-id: 159e5fd967d2d1f8697cfa96ce1bb5fc44920a40
2020-11-12 20:17:52 -08:00
Nick Gibson
76ff557de7 [NNC] add hazard analysis to Bounds Inference (#47684)
Summary:
Adds a helper function to Bounds Inference / Memory Analaysis infrastructure which returns the kind of hazard found between two Stmts (e.g. Blocks or Loops). E.g.
```
for (int i = 0; i < 10; ++i) {
  A[x] = i * 2;
}
for (int j = 0; j < 10; ++j) {
 B[x] = A[x] / 2;
}
```
The two loops have a `ReadAfterWrite` hazard, while in this example:
```
for (int i = 0; i < 10; ++i) {
  A[x] = i * 2;
}
for (int j = 0; j < 10; ++j) {
 A[x] = B[x] / 2;
}
```
The loops have a `WriteAfterWrite` hazard.

This isn't 100% of what we need for loop fusion, for example we don't check the strides of the loop to see if they match.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47684

Reviewed By: malfet

Differential Revision: D24873587

Pulled By: nickgg

fbshipit-source-id: 991149e5942e769612298ada855687469a219d62
2020-11-12 11:34:31 -08:00
Yanan Cao
00a3add425 [TorchBind] Support using lambda function as TorchBind constructor (#47819)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47819

Reviewed By: wanchaol

Differential Revision: D24910065

Pulled By: gmagogsfm

fbshipit-source-id: ad5b4f67b0367e44fe486d31a060d9ad1e0cf568
2020-11-12 09:29:34 -08:00
Wanchao Liang
553ccccc54 [c10d] switch ProcessGroup to be managed by intrusive_ptr (#47343)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47343

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D24723418

Pulled By: wanchaol

fbshipit-source-id: 0463819b96c53b12bdbb3905431110d7b21beb77
2020-11-12 07:36:23 -08:00
Wanchao Liang
665ac2f7b0 [reland] [c10d] switch Store to be managed by intrusive_ptr (#47808)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47808

reland https://github.com/pytorch/pytorch/pull/47074

Test Plan: wait for ci

Reviewed By: gmagogsfm

Differential Revision: D24905246

fbshipit-source-id: edeb7e6e486570ce889f12512e9dc02061d6cc03
2020-11-11 22:53:20 -08:00
Wanchao Liang
1f946e942d Revert D24667128: [c10d] switch Store to be managed by intrusive_ptr
Test Plan: revert-hammer

Differential Revision:
D24667128 (0cfe3451d4)

Original commit changeset: 9b6024c31c85

fbshipit-source-id: d8ddf9eb2fccef5023e05698e0c4662708fe4945
2020-11-11 10:49:58 -08:00
Wanchao Liang
0cfe3451d4 [c10d] switch Store to be managed by intrusive_ptr (#47074)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47074

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D24667128

Pulled By: wanchaol

fbshipit-source-id: 9b6024c31c851b7c3243540f460ae57323da523b
2020-11-10 23:36:44 -08:00
Nick Gibson
0b30a8d007 [NNC] Simplify and fix some bugs in Bounds Inference (#47450)
Summary:
Refactors NNC bounds inference to use the dependency analysis added in https://github.com/pytorch/pytorch/issues/46952. This ends up being a pretty good simplification because we no longer need the complicated bound merging code that we used to determine contiguous ranges. There were no usages of that code and the memory dependency analyzer is closer to what we want for those use cases anyway.

Added tests for a few cases uncovered by the existing bounds inference test - much of the coverage for this feature is in tests of it's uses: rfactor, computeAt and cacheAccesses.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47450

Reviewed By: heitorschueroff

Differential Revision: D24834458

Pulled By: nickgg

fbshipit-source-id: f93e40b09c0745dcc46c7e34359db594436d04f0
2020-11-09 21:37:04 -08:00
Martin Yuan
a1fef453b6 Support extra files in _load_for_mobile (#47425)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47425

Extra files can be exported in lite interpreter model, but it could not be loaded. This PR is to add the capability to load extra files from lite interpreter model. Because extra_files is a default argument, it should not affect the existing usage of _load_for_mobile. It's a simple assembly or a generic unordered_map. No additional dependency should be introduced and the size overhead should be small (to be tested).

Test Plan: Imported from OSS

Reviewed By: kwanmacher

Differential Revision: D24770266

Pulled By: iseeyuan

fbshipit-source-id: 7e8bd301ce734dbbf36ae56c9decb045aeb801ce
2020-11-06 20:26:54 -08:00
Raghavan Raman
8eb228a7f3 Add support for log_softmax (#47409)
Summary:
This diff adds support for `log_softmax` op in NNC.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47409

Reviewed By: ejguan

Differential Revision: D24750203

Pulled By: navahgar

fbshipit-source-id: c4dacc7f62f9df65ae467f0d578ea03d3698273d
2020-11-06 13:29:27 -08:00
Nick Gibson
0edc6a39c8 [NNC] Read/Write Dependency analysis (#46952)
Summary:
Adds a new piece of infrastructure to the NNC fused-kernel generation compiler, which builds a dependency graph of the reads and writes to memory regions in a kernel.

It can be used to generate graphs like this from the GEMM benchmark (not this only represents memory hierarchy not compute hierarchy):

![image](https://user-images.githubusercontent.com/701287/97368797-e99d5600-1868-11eb-9a7e-ceeb91ce72b8.png)

Or to answer questions like this:
```
Tensor* c = Compute(...);
Tensor* d = Compute(...);
LoopNest loop({d});
MemDependencyChecker analyzer;
loop.root_stmt()->accept(analyzer);
if (analyzer.dependsDirectly(loop.getLoopStmtsFor(d)[0], loop.getLoopStmtsFor(c)[0]) {
  // do something, maybe computeInline
}
```

Or this:
```
Tensor* d = Compute(...);
LoopNest loop({d});
MemDependencyChecker analyzer(loop.getInputs(), loop.getOutputs());
const Buf* output = d->buf();
for (const Buf* input : inputs) {
  if (!analyzer.dependsIndirectly(output, input)) {
    // signal that this input is unused
  }
}
```

This is a monster of a diff, and I apologize. I've tested it as well as possible for now, but it's not hooked up to anything yet so should not affect any current usages of the NNC fuser.

**How it works:**

Similar to the registerizer, the MemDependencyChecker walks the IR aggregating memory accesses into scopes, then merges those scopes into their parent scope and tracks which writes are responsible for the last write to a particular region of memory, adding dependency links where that region is used.

This relies on a bunch of math on symbolic contiguous regions which I've pulled out into its own file (bounds_overlap.h/cpp). Sometimes this wont be able to infer dependence with 100% accuracy but I think it should always be conservative and occaisionally add false positives but I'm aware of no false negatives.

The hardest part of the analysis is determining when a Load inside a For loop depends on a Store that is lower in the IR from a previous iteration of the loop. This depends on a whole bunch of factors, including whether or not we should consider loop iteration order. The analyzer comes with configuration of this setting. For example this loop:
```
for (int i = 0; i < 10; ++i) {
 A[x] = B[x] + 1;
}
```

has no inter loop dependence, since each iteration uses a distinct slice of both A and B. But this one:

```
for (int i = 0; i < 10; ++i) {
 A[0] = A[0] + B[x];
}
```

Has a self loop dependence between the Load and the Store of A. This applies to many cases that are not reductions as well. In this example:

```
for (int i =0; i < 10; ++i) {
  A[x] = A[x+1] + x;
}
```

Whether or not it has self-loop dependence depends on if we are assuming the execution order is fixed (or whether this loop could later be parallelized). If the read from `A[x+1]` always comes before the write to that same region then it has no dependence.

The analyzer can correctly handle dynamic shapes, but we may need more test coverage of real world usages of dynamic shapes. I unit test some simple and pathological cases, but coverage could be better.

**Next Steps:**

Since the PR was already so big I didn't actually hook it up anywhere, but I had planned on rewriting bounds inference based on the dependency graph. Will do that next.

There are few gaps in this code which could be filled in later if we need it:
* Upgrading the bound math to work with write strides, which will reduce false positive dependencies.
* Better handling of Conditions, reducing false positive dependencies when a range is written in both branches of a Cond.
* Support for AtomicAdd node added in Cuda codegen.

**Testing:**

See new unit tests, I've tried to be verbose about what is being tested. I ran the python tests but there shouldn't be any way for this work to affect them yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46952

Reviewed By: ejguan

Differential Revision: D24730346

Pulled By: nickgg

fbshipit-source-id: 654c67c71e9880495afd3ae0efc142e95d5190df
2020-11-04 19:52:20 -08:00
Gaoxiang Liu
735f8cc6c2 [DI] Allow explicit taskLauncher for torchscript interpreter (#46865)
Summary:
By default, TorchScript execution is single threaded and uses the caller's thread pool. For the use case of distributed inference, we hope there is a way to customize the behavior where the  interpreter in torch script can be executed in other places. This diff allows an explicit taskLauncher for torchscript interpreter.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46865

Test Plan:
unit test is passed.

fbshipit-source-id: 1d7b003926c0d1f8facc53206efb960cff8897ac

Fixes #{issue number}

Reviewed By: houseroad

Differential Revision: D24616102

Pulled By: garroud

fbshipit-source-id: 79202b62f92d0b0baf72e4bf7aa3f05e0da91d59
2020-11-04 17:07:55 -08:00