Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48863
Support default arguments when invoking a module via PyTorch Lite (`mobile::Module`).
Test Plan:
buck test mode/dbg //caffe2/test/cpp/jit:jit -- LiteInterpreterTest.MethodInvocation
buck test mode/dbg caffe2/test:mobile -- test_method_calls_with_optional_arg
Reviewed By: raziel, iseeyuan
Differential Revision: D25152559
fbshipit-source-id: bbf52f1fbdbfbc6f8fa8b65ab524b1cd4648f9c0
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46213
I didn't yet update the documentation, will add those change soon. A few other things that I didn't do, but want to clarify if I maybe should.
1. I didn't expose projections in c++ API: torch/csrc/api/src/nn/modules/rnn.cpp. Let me know if this is desirable and I will add those changes.
2. I didn't expose projections in "lstm_cell" function and "_thnn_differentiable_lstm_cell_backward" functions from aten/src/ATen/native/RNN.cpp. As far as I understand, they are not needed for nn.LSTM CPU execution. For lstm_cell, projections don't bring any real benefit, since if cell is used separately, it can be easily added in Python. For "_thnn_differentiable_lstm_cell_backward", I'm actually not sure where exactly that function is used, so I also disabled projections there for now. Please let me know if I should change that.
3. I added check that projections are not supported for quantized LSTMs to quantized_lstm_<data/input> functions. But I didn't add any checks to LSTMCell code. It seems that since I disabled projections in "lstm_cell" function, they should also not be available for quantized models through any other API than quantized_lstm_<data/input>. Please let me know if I'm not correct and I will add checks to other places.
4. Projections are not supported for CuDNN versions < 7.1.2. Should I add the check for CuDNN version and disable projections in that case? If so, what will be the best way to do that?
5. Currently I added projection weight as the last weight, so the layout is "w_ih, w_hh, b_ih, b_hh, w_hr". This breaks the assumption that biases come after weights and thus I had to add additional if-s in various places. Alternative way would be to have "w_ih, w_hh, w_hr, b_ih, b_hh" layout, in which case the assumption will be true. But in that case I will need to split the loop in get_parameters function from aten/src/ATen/native/cudnn/RNN.cpp. And in some cases, I will still need to add an "undefined" tensor in the 3rd position, because we get all 5 weights from CuDNN most of the time. So I'm not sure which way is better. Let me know if you think I should change to the weights-then-biases layout.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47725
Reviewed By: zou3519
Differential Revision: D25449794
Pulled By: ngimel
fbshipit-source-id: fe6ce59e481d1f5fd861a8ff7fa13d1affcedb0c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49408
Nearly every non-test callsite doesn't need to capture any variables anyway, and this saves 48 bytes per callback.
ghstack-source-id: 118665808
Test Plan:
Wait for GitHub CI since we had C++14-specific issues with
this one in previous PR https://github.com/pytorch/pytorch/pull/48629
Reviewed By: malfet
Differential Revision: D25563207
fbshipit-source-id: 6a2831205917d465f8248ca37429ba2428d5626d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48629
Nearly every non-test callsite doesn't need to capture any variables anyway, and this saves 48 bytes per callback.
ghstack-source-id: 118568240
Test Plan: CI
Reviewed By: dhruvbird
Differential Revision: D25135415
fbshipit-source-id: 5e92dc79da6473ed15d1e381a21ed315879168f3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48620
In preparation for storing bare function pointer (8 bytes)
instead of std::function (32 bytes).
ghstack-source-id: 118568242
Test Plan: CI
Reviewed By: ezyang
Differential Revision: D25132183
fbshipit-source-id: 3790cfb5d98479a46cf665b14eb0041a872c13da
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49184
Adds BitCasting to NNC. This will enable fast approximation algorithms implemented directly in TensorExpressions
Test Plan: buck test mode/no-gpu //caffe2/test/cpp/tensorexpr:tensorexpr
Reviewed By: bertmaher
Differential Revision: D25466476
fbshipit-source-id: f063ab29ba7bab2dcce463e499f2d4a16bdc1f0e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48607
This change builds on top of
https://github.com/pytorch/pytorch/pull/46865
further exposing the async interface to `torch::jit::Method`.
added unit test for new `run_async`
Test Plan: `buck test caffe2/test/cpp/jit/...`
Reviewed By: dzhulgakov
Differential Revision: D25219726
fbshipit-source-id: 89743c82a0baa1affe0254c1e2dbf873de8e5c76
Summary:
Adds the CompoundTensor, a specialisation of the NNC Tensor which allows arbitrary production statements. This will allow lowering of aten ops into specific NNC IR patterns (which don't need to be functional) - allowing us to shortcut to the optimized form of common patterns.
This is part 1 of trying to clean up the lowering of aten::cat so it is easier to optimize.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48750
Reviewed By: tugsbayasgalan
Differential Revision: D25433517
Pulled By: nickgg
fbshipit-source-id: de13c4719f8f87619ab254e5f324f13b5be1c9da
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48264
Preserves the strided representation of NNC Tensor outputs by transforming them into the right layout at the end of the kernel.
Fix for https://github.com/pytorch/pytorch/issues/45604
Test Plan: Imported from OSS
Reviewed By: nikithamalgifb
Differential Revision: D25286213
Pulled By: eellison
fbshipit-source-id: 64d94ac463741e2568a1c9d44174e15ea26e511f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47813
We have some code paths that at kernel invocation seem to handle dynamic sizes, but I'm not sure how well it works because we have other parts of our code base that assume that tenso shapes are always fully specified. https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/tensorexpr/kernel.cpp#L1572
As with some other PRs in the stack, I think it would be good to remove the features that aren't on/actively being worked on while they are not used.
I initially did this PR to try to speed up perf. I couldn't observe too much of a speed up, so we can decide to keep drop this PR if we want.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D25286212
Pulled By: eellison
fbshipit-source-id: 4ae66e0af88d649dd4e592bc78686538c2fdbaeb
Summary: Adds BitCasting to NNC. This will enable fast approximation algorithms implemented directly in TensorExpressions
Test Plan: buck test mode/no-gpu //caffe2/test/cpp/tensorexpr:tensorexpr
Reviewed By: bertmaher
Differential Revision: D25441716
fbshipit-source-id: c97b871697bc5931d09cda4a9cb0a81bb420f4e2
Summary:
Adds a new optimization method to LoopNest which eliminates stores that do not contribute to any output. It's unlikely any of the lowerings of aten operators produce these stores yet, but this creates some wiggle room for transformations in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49030
Reviewed By: tugsbayasgalan
Differential Revision: D25434538
Pulled By: nickgg
fbshipit-source-id: fa1ead82e6f7440cc783c6116b23d0b7a5b5db4b
Summary:
Ref https://github.com/pytorch/pytorch/issues/42175
This removes the 4 deprecated spectral functions: `torch.{fft,rfft,ifft,irfft}`. `torch.fft` is also now imported by by default.
The actual `at::native` functions are still used in `torch.stft` so can't be full removed yet. But will once https://github.com/pytorch/pytorch/issues/47601 has been merged.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48594
Reviewed By: heitorschueroff
Differential Revision: D25298929
Pulled By: mruberry
fbshipit-source-id: e36737fe8192fcd16f7e6310f8b49de478e63bf0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48274
The std::function-ness of it was used only for tests. (std::function is huge at 32 bytes, and not particularly efficient.)
ghstack-source-id: 117498491
Test Plan: CI
Reviewed By: dzhulgakov
Differential Revision: D25102077
fbshipit-source-id: fd941ddf32235a9659a1a17609c27cc5cb446a54
Summary:
I found a number of spelling & grammatical mistakes in the repository. Previously I had these fixes submitted individually, but I saw that a single word change was apparently too small for a PR to be merged. Hopefully this new PR has a sufficient number of changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48592
Reviewed By: ejguan
Differential Revision: D25224216
Pulled By: mrshenli
fbshipit-source-id: 2af3db2aee486563efd0dffc4e8f777306a73e44
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48322
Disable old fuser internally. I would like to find where we are inadvertently setting the old fuser, but in the meantime I would like to land a diff that I know will 100% cause it not to be run, and verify that it fixes the issue.
Test Plan: sandcastle
Reviewed By: ZolotukhinM
Differential Revision: D25126202
fbshipit-source-id: 5a4d0742f5f829e536f50e7ede1256c94dd05232
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47796
`ThreadLocalDebugInfo::get()` is a hot function. For example, it is called by `DefaultCPUAllocator::allocate()`. Most callers do not even bother to keep the returned `shared_ptr` around, proving that they have no lifetime issues currently. For the rest, it appears that the only way that the returned pointer could become invalid is if they then called a function that swapped out `ThreadLocalDebugInfo` using `ThreadLocalStateGuard`. There are very few such paths, and it doesn't look like any current callers of `ThreadLocalDebugInfo::get()` needed a `shared_ptr` at all.
ghstack-source-id: 116979577
Test Plan:
1) reviewers to double-check audit of safety
2) run framework overhead benchmarks
Reviewed By: dzhulgakov
Differential Revision: D24902978
fbshipit-source-id: d684737cc2568534cac7cd3fb8d623b971c2fd28
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47023
DeviceType pretty clearly only needs 1 byte. DeviceIndex only needs 1 byte given that machines don't have anywhere near 255 GPUs in them as far as I know.
ghstack-source-id: 116901430
Test Plan: Existing tests, added assertion to catch if my assumption about DeviceIndex is incorrect
Reviewed By: dzhulgakov
Differential Revision: D24605460
fbshipit-source-id: 7c9a89027fcf8eebd623b7cdbf6302162c981cd2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48160
We no longer use the custom c++ test infra anyways, so move to pure
gtest.
Fixes#45703
ghstack-source-id: 116977283
Test Plan: `buck test //caffe2/test/cpp/tensorexpr`
Reviewed By: navahgar, nickgg
Differential Revision: D25046618
fbshipit-source-id: da34183d87465f410379048148c28e1623618553
Summary:
Fixes an internally reported issue in the tensorexpr fuser when using FP16 on Cuda. The HalfChecker analysis to determine if we need to define the Half type searches the IR for expressions that use Half. If one of the parameters is of type Half but it (or any other Half expr) are not used in the IR we'll return a false negative. Fix this by adding the parameter list to the HalfChecker.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48068
Reviewed By: ZolotukhinM
Differential Revision: D25009680
Pulled By: nickgg
fbshipit-source-id: 24fddef06821f130db3d3f45d6d041c7f34a6ab0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48063
The TE fuser does not know how to construct a list of Tensors.
Test Plan: new unit test
Reviewed By: eellison
Differential Revision: D25007234
fbshipit-source-id: 1a8ffdf5ffecb39a727357799ed32df8f53150d6
Summary:
Pull Request resolved: https://github.com/pytorch/glow/pull/5062
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45556
User defined classes can be used as constants. This is useful when freezing and removing the module from the graph.
Test Plan: waitforsadcastle
Reviewed By: eellison
Differential Revision: D23994974
fbshipit-source-id: 5b4a5c91158aa7f22df39d71f2658afce1d29317
Summary:
Add support for ReduceOp in the Vectorizer, which allows vectorization of reductions. Only non-reduce axes can be vectorized currently, we'd need either automatically pulling out the RHS of reductions (better as a separate transform, I think) or special handling of vector reduce in the LLVM codegen (tricky, maybe not useful?) to make vectorizing reduce axes work.
There was a disabled LLVM test for this case which I reenabled with a bit of massaging, and added a few more.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47924
Reviewed By: bertmaher
Differential Revision: D24963464
Pulled By: nickgg
fbshipit-source-id: 91d91e9e2696555ab5690b154984b1ce48359d51
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47952
We don't actually generate a TE kernel so no need to use the
arena-allocation guard.
Test Plan:
```
buck test //caffe2/test/cpp/tensorexpr -- FuserPass
```
Reviewed By: ZolotukhinM
Differential Revision: D24967107
fbshipit-source-id: 302f65b2fcff704079e8b51b942b7b3baff95585
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47884
We need to know output types of everything in a fusion group to ensure
that we generate correctly-typed tensors. We were incorrectly starting a
fusion group with an unknown-typed output.
Test Plan:
New unit tests:
```
buck test //caffe2/test:jit //caffe2/test/cpp/tensorexpr:tensorexpr
```
Reviewed By: eellison
Differential Revision: D24932786
fbshipit-source-id: 83978a951f32c1207bbc3555a7d3bd94fe4e70fb
Summary:
Refactors the ReduceOp node to remove the last remaining deferred functionality: completing the interaction between the accumulator buffer and the body. This fixes two issues with reductions:
1. Nodes inside the interaction could not be visited or modified, meaning we could generate bad code when the interaction was complex.
2. The accumulator load was created at expansion time and so could not be modified in some ways (ie. vectorization couldn't act on these loads).
This simplifies reduction logic quite a bit, but theres a bit more involved in the rfactor transform.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47709
Reviewed By: ZolotukhinM
Differential Revision: D24904220
Pulled By: nickgg
fbshipit-source-id: 159e5fd967d2d1f8697cfa96ce1bb5fc44920a40
Summary:
Adds a helper function to Bounds Inference / Memory Analaysis infrastructure which returns the kind of hazard found between two Stmts (e.g. Blocks or Loops). E.g.
```
for (int i = 0; i < 10; ++i) {
A[x] = i * 2;
}
for (int j = 0; j < 10; ++j) {
B[x] = A[x] / 2;
}
```
The two loops have a `ReadAfterWrite` hazard, while in this example:
```
for (int i = 0; i < 10; ++i) {
A[x] = i * 2;
}
for (int j = 0; j < 10; ++j) {
A[x] = B[x] / 2;
}
```
The loops have a `WriteAfterWrite` hazard.
This isn't 100% of what we need for loop fusion, for example we don't check the strides of the loop to see if they match.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47684
Reviewed By: malfet
Differential Revision: D24873587
Pulled By: nickgg
fbshipit-source-id: 991149e5942e769612298ada855687469a219d62
Summary:
Refactors NNC bounds inference to use the dependency analysis added in https://github.com/pytorch/pytorch/issues/46952. This ends up being a pretty good simplification because we no longer need the complicated bound merging code that we used to determine contiguous ranges. There were no usages of that code and the memory dependency analyzer is closer to what we want for those use cases anyway.
Added tests for a few cases uncovered by the existing bounds inference test - much of the coverage for this feature is in tests of it's uses: rfactor, computeAt and cacheAccesses.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47450
Reviewed By: heitorschueroff
Differential Revision: D24834458
Pulled By: nickgg
fbshipit-source-id: f93e40b09c0745dcc46c7e34359db594436d04f0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47425
Extra files can be exported in lite interpreter model, but it could not be loaded. This PR is to add the capability to load extra files from lite interpreter model. Because extra_files is a default argument, it should not affect the existing usage of _load_for_mobile. It's a simple assembly or a generic unordered_map. No additional dependency should be introduced and the size overhead should be small (to be tested).
Test Plan: Imported from OSS
Reviewed By: kwanmacher
Differential Revision: D24770266
Pulled By: iseeyuan
fbshipit-source-id: 7e8bd301ce734dbbf36ae56c9decb045aeb801ce
Summary:
This diff adds support for `log_softmax` op in NNC.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47409
Reviewed By: ejguan
Differential Revision: D24750203
Pulled By: navahgar
fbshipit-source-id: c4dacc7f62f9df65ae467f0d578ea03d3698273d
Summary:
Adds a new piece of infrastructure to the NNC fused-kernel generation compiler, which builds a dependency graph of the reads and writes to memory regions in a kernel.
It can be used to generate graphs like this from the GEMM benchmark (not this only represents memory hierarchy not compute hierarchy):

Or to answer questions like this:
```
Tensor* c = Compute(...);
Tensor* d = Compute(...);
LoopNest loop({d});
MemDependencyChecker analyzer;
loop.root_stmt()->accept(analyzer);
if (analyzer.dependsDirectly(loop.getLoopStmtsFor(d)[0], loop.getLoopStmtsFor(c)[0]) {
// do something, maybe computeInline
}
```
Or this:
```
Tensor* d = Compute(...);
LoopNest loop({d});
MemDependencyChecker analyzer(loop.getInputs(), loop.getOutputs());
const Buf* output = d->buf();
for (const Buf* input : inputs) {
if (!analyzer.dependsIndirectly(output, input)) {
// signal that this input is unused
}
}
```
This is a monster of a diff, and I apologize. I've tested it as well as possible for now, but it's not hooked up to anything yet so should not affect any current usages of the NNC fuser.
**How it works:**
Similar to the registerizer, the MemDependencyChecker walks the IR aggregating memory accesses into scopes, then merges those scopes into their parent scope and tracks which writes are responsible for the last write to a particular region of memory, adding dependency links where that region is used.
This relies on a bunch of math on symbolic contiguous regions which I've pulled out into its own file (bounds_overlap.h/cpp). Sometimes this wont be able to infer dependence with 100% accuracy but I think it should always be conservative and occaisionally add false positives but I'm aware of no false negatives.
The hardest part of the analysis is determining when a Load inside a For loop depends on a Store that is lower in the IR from a previous iteration of the loop. This depends on a whole bunch of factors, including whether or not we should consider loop iteration order. The analyzer comes with configuration of this setting. For example this loop:
```
for (int i = 0; i < 10; ++i) {
A[x] = B[x] + 1;
}
```
has no inter loop dependence, since each iteration uses a distinct slice of both A and B. But this one:
```
for (int i = 0; i < 10; ++i) {
A[0] = A[0] + B[x];
}
```
Has a self loop dependence between the Load and the Store of A. This applies to many cases that are not reductions as well. In this example:
```
for (int i =0; i < 10; ++i) {
A[x] = A[x+1] + x;
}
```
Whether or not it has self-loop dependence depends on if we are assuming the execution order is fixed (or whether this loop could later be parallelized). If the read from `A[x+1]` always comes before the write to that same region then it has no dependence.
The analyzer can correctly handle dynamic shapes, but we may need more test coverage of real world usages of dynamic shapes. I unit test some simple and pathological cases, but coverage could be better.
**Next Steps:**
Since the PR was already so big I didn't actually hook it up anywhere, but I had planned on rewriting bounds inference based on the dependency graph. Will do that next.
There are few gaps in this code which could be filled in later if we need it:
* Upgrading the bound math to work with write strides, which will reduce false positive dependencies.
* Better handling of Conditions, reducing false positive dependencies when a range is written in both branches of a Cond.
* Support for AtomicAdd node added in Cuda codegen.
**Testing:**
See new unit tests, I've tried to be verbose about what is being tested. I ran the python tests but there shouldn't be any way for this work to affect them yet.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46952
Reviewed By: ejguan
Differential Revision: D24730346
Pulled By: nickgg
fbshipit-source-id: 654c67c71e9880495afd3ae0efc142e95d5190df
Summary:
By default, TorchScript execution is single threaded and uses the caller's thread pool. For the use case of distributed inference, we hope there is a way to customize the behavior where the interpreter in torch script can be executed in other places. This diff allows an explicit taskLauncher for torchscript interpreter.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46865
Test Plan:
unit test is passed.
fbshipit-source-id: 1d7b003926c0d1f8facc53206efb960cff8897ac
Fixes #{issue number}
Reviewed By: houseroad
Differential Revision: D24616102
Pulled By: garroud
fbshipit-source-id: 79202b62f92d0b0baf72e4bf7aa3f05e0da91d59